[Doc] Convert docs to use colon fences (#12471)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor
2025-01-29 03:38:29 +00:00
committed by GitHub
parent a7e3eba66f
commit dd6a3a02cb
68 changed files with 2352 additions and 2341 deletions

View File

@@ -14,9 +14,9 @@ In short, you should increase the number of GPUs and the number of nodes until y
After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like `# GPU blocks: 790`. Multiply the number by `16` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough.
```{note}
:::{note}
There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
```
:::
## Running vLLM on a single node
@@ -94,12 +94,12 @@ vllm serve /path/to/the/model/in/the/container \
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
```{warning}
:::{warning}
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](#troubleshooting-incorrect-hardware-driver) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
```
:::
```{warning}
:::{warning}
Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes.
When you use huggingface repo id to refer to the model, you should append your huggingface token to the `run_cluster.sh` script, e.g. `-e HF_TOKEN=`. The recommended way is to download the model first, and then use the path to refer to the model.
```
:::

View File

@@ -4,6 +4,7 @@
Below, you can find an explanation of every engine argument for vLLM:
<!--- pyml disable-num-lines 7 no-space-in-emphasis-->
```{eval-rst}
.. argparse::
:module: vllm.engine.arg_utils
@@ -16,6 +17,7 @@ Below, you can find an explanation of every engine argument for vLLM:
Below are the additional arguments related to the asynchronous engine:
<!--- pyml disable-num-lines 7 no-space-in-emphasis-->
```{eval-rst}
.. argparse::
:module: vllm.engine.arg_utils

View File

@@ -2,14 +2,14 @@
vLLM uses the following environment variables to configure the system:
```{warning}
:::{warning}
Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
```
:::
```{literalinclude} ../../../vllm/envs.py
:::{literalinclude} ../../../vllm/envs.py
:end-before: end-env-vars-definition
:language: python
:start-after: begin-env-vars-definition
```
:::

View File

@@ -1,8 +1,8 @@
# External Integrations
```{toctree}
:::{toctree}
:maxdepth: 1
langchain
llamaindex
```
:::

View File

@@ -31,8 +31,8 @@ vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-I
The following metrics are exposed:
```{literalinclude} ../../../vllm/engine/metrics.py
:::{literalinclude} ../../../vllm/engine/metrics.py
:end-before: end-metrics-definitions
:language: python
:start-after: begin-metrics-definitions
```
:::

View File

@@ -4,10 +4,10 @@
This page teaches you how to pass multi-modal inputs to [multi-modal models](#supported-mm-models) in vLLM.
```{note}
:::{note}
We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
```
:::
## Offline Inference
@@ -203,13 +203,13 @@ for o in outputs:
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
```{important}
:::{important}
A chat template is **required** to use Chat Completions API.
Although most models come with a chat template, for others you have to define one yourself.
The chat template can be inferred based on the documentation on the model's HuggingFace repo.
For example, LLaVA-1.5 (`llava-hf/llava-1.5-7b-hf`) requires a chat template that can be found here: <gh-file:examples/template_llava.jinja>
```
:::
### Image
@@ -273,24 +273,25 @@ print("Chat completion output:", chat_response.choices[0].message.content)
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
```{tip}
:::{tip}
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
and pass the file path as `url` in the API request.
```
:::
```{tip}
:::{tip}
There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
```
:::
````{note}
:::{note}
By default, the timeout for fetching images through HTTP URL is `5` seconds.
You can override this by setting the environment variable:
```console
$ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
```
````
:::
### Video
@@ -345,14 +346,15 @@ print("Chat completion output from image url:", result)
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
````{note}
:::{note}
By default, the timeout for fetching videos through HTTP URL is `30` seconds.
You can override this by setting the environment variable:
```console
$ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
```
````
:::
### Audio
@@ -448,24 +450,25 @@ print("Chat completion output from audio url:", result)
Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for_multimodal.py>
````{note}
:::{note}
By default, the timeout for fetching audios through HTTP URL is `10` seconds.
You can override this by setting the environment variable:
```console
$ export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
```
````
:::
### Embedding
vLLM's Embeddings API is a superset of OpenAI's [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings),
where a list of chat `messages` can be passed instead of batched `inputs`. This enables multi-modal inputs to be passed to embedding models.
```{tip}
:::{tip}
The schema of `messages` is exactly the same as in Chat Completions API.
You can refer to the above tutorials for more details on how to pass each type of multi-modal data.
```
:::
Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images.
Refer to the examples below for illustration.
@@ -477,13 +480,13 @@ vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
```
```{important}
:::{important}
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
to run this model in embedding mode instead of text generation mode.
The custom chat template is completely different from the original one for this model,
and can be found here: <gh-file:examples/template_vlm2vec.jinja>
```
:::
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
@@ -518,16 +521,16 @@ vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
```
```{important}
:::{important}
Like with VLM2Vec, we have to explicitly pass `--task embed`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
```
:::
```{important}
:::{important}
Also important, `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
example below for details.
```
:::
Full example: <gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py>

View File

@@ -22,9 +22,9 @@ The available APIs depend on the type of model that is being run:
Please refer to the above pages for more details about each API.
```{seealso}
:::{seealso}
[API Reference](/api/offline_inference/index)
```
:::
## Configuration Options
@@ -70,12 +70,12 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
tensor_parallel_size=2)
```
```{important}
:::{important}
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`)
before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
```
:::
#### Quantization

View File

@@ -161,11 +161,11 @@ print(completion._request_id)
The `vllm serve` command is used to launch the OpenAI-compatible server.
```{argparse}
:::{argparse}
:module: vllm.entrypoints.openai.cli_args
:func: create_parser_for_docs
:prog: vllm serve
```
:::
#### Configuration file
@@ -188,10 +188,10 @@ To use the above config file:
vllm serve SOME_MODEL --config config.yaml
```
```{note}
:::{note}
In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
The order of priorities is `command line > config file values > defaults`.
```
:::
## API Reference
@@ -208,19 +208,19 @@ Code example: <gh-file:examples/online_serving/openai_completion_client.py>
The following [sampling parameters](#sampling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-completion-sampling-params
:end-before: end-completion-sampling-params
```
:::
The following extra parameters are supported:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-completion-extra-params
:end-before: end-completion-extra-params
```
:::
(chat-api)=
@@ -240,19 +240,19 @@ Code example: <gh-file:examples/online_serving/openai_chat_completion_client.py>
The following [sampling parameters](#sampling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-chat-completion-sampling-params
:end-before: end-chat-completion-sampling-params
```
:::
The following extra parameters are supported:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-chat-completion-extra-params
:end-before: end-chat-completion-extra-params
```
:::
(embeddings-api)=
@@ -264,9 +264,9 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
which will be treated as a single prompt to the model.
```{tip}
:::{tip}
This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details.
```
:::
Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
@@ -274,27 +274,27 @@ Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
The following [pooling parameters](#pooling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-embedding-pooling-params
:end-before: end-embedding-pooling-params
```
:::
The following extra parameters are supported by default:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-embedding-extra-params
:end-before: end-embedding-extra-params
```
:::
For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-chat-embedding-extra-params
:end-before: end-chat-embedding-extra-params
```
:::
(tokenizer-api)=
@@ -465,19 +465,19 @@ Response:
The following [pooling parameters](#pooling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-score-pooling-params
:end-before: end-score-pooling-params
```
:::
The following extra parameters are supported:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-score-extra-params
:end-before: end-score-extra-params
```
:::
(rerank-api)=
@@ -552,16 +552,16 @@ Response:
The following [pooling parameters](#pooling-params) are supported.
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-rerank-pooling-params
:end-before: end-rerank-pooling-params
```
:::
The following extra parameters are supported:
```{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
:language: python
:start-after: begin-rerank-extra-params
:end-before: end-rerank-extra-params
```
:::