diff --git a/docs/configuration/engine_args.md b/docs/configuration/engine_args.md index fb2689a56..e02c7090d 100644 --- a/docs/configuration/engine_args.md +++ b/docs/configuration/engine_args.md @@ -6,7 +6,7 @@ title: Engine Arguments Engine arguments control the behavior of the vLLM engine. - For [offline inference][offline-inference], they are part of the arguments to [LLM][vllm.LLM] class. -- For [online serving][openai-compatible-server], they are part of the arguments to `vllm serve`. +- For [online serving][serving-openai-compatible-server], they are part of the arguments to `vllm serve`. You can look at [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs] to see the available engine arguments. diff --git a/docs/design/arch_overview.md b/docs/design/arch_overview.md index 9bfdab170..b2ef76c0e 100644 --- a/docs/design/arch_overview.md +++ b/docs/design/arch_overview.md @@ -74,7 +74,7 @@ python -m vllm.entrypoints.openai.api_server --model That code can be found in . -More details on the API server can be found in the [OpenAI-Compatible Server][openai-compatible-server] document. +More details on the API server can be found in the [OpenAI-Compatible Server][serving-openai-compatible-server] document. ## LLM Engine diff --git a/docs/features/structured_outputs.md b/docs/features/structured_outputs.md index b63f344eb..614b0bfe9 100644 --- a/docs/features/structured_outputs.md +++ b/docs/features/structured_outputs.md @@ -21,7 +21,7 @@ The following parameters are supported, which must be added as extra parameters: - `guided_grammar`: the output will follow the context free grammar. - `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text. -You can see the complete list of supported parameters on the [OpenAI-Compatible Server][openai-compatible-server] page. +You can see the complete list of supported parameters on the [OpenAI-Compatible Server][serving-openai-compatible-server] page. Structured outputs are supported by default in the OpenAI-Compatible Server. You may choose to specify the backend to use by setting the diff --git a/docs/getting_started/installation/intel_gaudi.md b/docs/getting_started/installation/intel_gaudi.md index a4f13dca4..d1d544c83 100644 --- a/docs/getting_started/installation/intel_gaudi.md +++ b/docs/getting_started/installation/intel_gaudi.md @@ -110,7 +110,7 @@ docker run \ ### Supported features - [Offline inference][offline-inference] -- Online serving via [OpenAI-Compatible Server][openai-compatible-server] +- Online serving via [OpenAI-Compatible Server][serving-openai-compatible-server] - HPU autodetection - no need to manually select device within vLLM - Paged KV cache with algorithms enabled for Intel Gaudi accelerators - Custom Intel Gaudi implementations of Paged Attention, KV cache ops, diff --git a/docs/models/generative_models.md b/docs/models/generative_models.md index 355ed506e..fd5c65992 100644 --- a/docs/models/generative_models.md +++ b/docs/models/generative_models.md @@ -134,7 +134,7 @@ outputs = llm.chat(conversation, chat_template=custom_template) ## Online Serving -Our [OpenAI-Compatible Server][openai-compatible-server] provides endpoints that correspond to the offline APIs: +Our [OpenAI-Compatible Server][serving-openai-compatible-server] provides endpoints that correspond to the offline APIs: - [Completions API][completions-api] is similar to `LLM.generate` but only accepts text. - [Chat API][chat-api] is similar to `LLM.chat`, accepting both text and [multi-modal inputs][multimodal-inputs] for models with a chat template. diff --git a/docs/models/pooling_models.md b/docs/models/pooling_models.md index 89a128915..693212e64 100644 --- a/docs/models/pooling_models.md +++ b/docs/models/pooling_models.md @@ -113,7 +113,7 @@ A code example can be found here: ` for [offline-inference][offline-inference] or `vllm serve ` for the [openai-compatible-server][openai-compatible-server]. +- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference][offline-inference] or `--trust-remote-code` for the [openai-compatible-server][serving-openai-compatible-server]. +- in a local directory, simply pass directory path to `model=` for [offline-inference][offline-inference] or `vllm serve ` for the [openai-compatible-server][serving-openai-compatible-server]. This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM! diff --git a/docs/serving/openai_compatible_server.md b/docs/serving/openai_compatible_server.md index a3f1ef9fd..e9c0a902a 100644 --- a/docs/serving/openai_compatible_server.md +++ b/docs/serving/openai_compatible_server.md @@ -1,7 +1,7 @@ --- title: OpenAI-Compatible Server --- -[](){ #openai-compatible-server } +[](){ #serving-openai-compatible-server } vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.