[Doc] [1/N] Reorganize Getting Started section (#11645)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
37
docs/source/getting_started/faq.md
Normal file
37
docs/source/getting_started/faq.md
Normal file
@@ -0,0 +1,37 @@
|
||||
(faq)=
|
||||
|
||||
# Frequently Asked Questions
|
||||
|
||||
> Q: How can I serve multiple models on a single port using the OpenAI API?
|
||||
|
||||
A: Assuming that you're referring to using OpenAI compatible server to serve multiple models at once, that is not currently supported, you can run multiple instances of the server (each serving a different model) at the same time, and have another layer to route the incoming request to the correct server accordingly.
|
||||
|
||||
______________________________________________________________________
|
||||
|
||||
> Q: Which model to use for offline inference embedding?
|
||||
|
||||
A: You can try [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) and [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5);
|
||||
more are listed [here](#supported-models).
|
||||
|
||||
By extracting hidden states, vLLM can automatically convert text generation models like [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B),
|
||||
[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) into embedding models,
|
||||
but they are expected be inferior to models that are specifically trained on embedding tasks.
|
||||
|
||||
______________________________________________________________________
|
||||
|
||||
> Q: Can the output of a prompt vary across runs in vLLM?
|
||||
|
||||
A: Yes, it can. vLLM does not guarantee stable log probabilities (logprobs) for the output tokens. Variations in logprobs may occur due to
|
||||
numerical instability in Torch operations or non-deterministic behavior in batched Torch operations when batching changes. For more details,
|
||||
see the [Numerical Accuracy section](https://pytorch.org/docs/stable/notes/numerical_accuracy.html#batched-computations-or-slice-computations).
|
||||
|
||||
In vLLM, the same requests might be batched differently due to factors such as other concurrent requests,
|
||||
changes in batch size, or batch expansion in speculative decoding. These batching variations, combined with numerical instability of Torch operations,
|
||||
can lead to slightly different logit/logprob values at each step. Such differences can accumulate, potentially resulting in
|
||||
different tokens being sampled. Once a different token is sampled, further divergence is likely.
|
||||
|
||||
**Mitigation Strategies**
|
||||
|
||||
- For improved stability and reduced variance, use `float32`. Note that this will require more memory.
|
||||
- If using `bfloat16`, switching to `float16` can also help.
|
||||
- Using request seeds can aid in achieving more stable generation for temperature > 0, but discrepancies due to precision differences may still occur.
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
# Installation for ARM CPUs
|
||||
|
||||
vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform. This guide provides installation instructions specific to ARM. For additional details on supported features, refer to the x86 platform documentation covering:
|
||||
vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform. This guide provides installation instructions specific to ARM. For additional details on supported features, refer to the [x86 CPU documentation](#installation-x86) covering:
|
||||
|
||||
- CPU backend inference capabilities
|
||||
- Relevant runtime environment variables
|
||||
@@ -1,6 +1,6 @@
|
||||
(installation-cpu)=
|
||||
(installation-x86)=
|
||||
|
||||
# Installation with CPU
|
||||
# Installation for x86 CPUs
|
||||
|
||||
vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16. vLLM CPU backend supports the following vLLM features:
|
||||
|
||||
@@ -151,4 +151,4 @@ $ python examples/offline_inference.py
|
||||
$ VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
|
||||
```
|
||||
|
||||
- Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](../serving/deploying_with_nginx.md) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md).
|
||||
- Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](#nginxloadbalancer) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md).
|
||||
@@ -1,6 +1,6 @@
|
||||
(installation)=
|
||||
(installation-cuda)=
|
||||
|
||||
# Installation
|
||||
# Installation for CUDA
|
||||
|
||||
vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries.
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
(installation-rocm)=
|
||||
|
||||
# Installation with ROCm
|
||||
# Installation for ROCm
|
||||
|
||||
vLLM supports AMD GPUs with ROCm 6.2.
|
||||
|
||||
@@ -1,4 +1,6 @@
|
||||
# Installation with Intel® Gaudi® AI Accelerators
|
||||
(installation-gaudi)=
|
||||
|
||||
# Installation for Intel® Gaudi®
|
||||
|
||||
This README provides instructions on running vLLM with Intel Gaudi devices.
|
||||
|
||||
19
docs/source/getting_started/installation/index.md
Normal file
19
docs/source/getting_started/installation/index.md
Normal file
@@ -0,0 +1,19 @@
|
||||
(installation-index)=
|
||||
|
||||
# Installation
|
||||
|
||||
vLLM supports the following hardware platforms:
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
gpu-cuda
|
||||
gpu-rocm
|
||||
cpu-x86
|
||||
cpu-arm
|
||||
hpu-gaudi
|
||||
tpu
|
||||
xpu
|
||||
openvino
|
||||
neuron
|
||||
```
|
||||
@@ -1,6 +1,6 @@
|
||||
(installation-neuron)=
|
||||
|
||||
# Installation with Neuron
|
||||
# Installation for Neuron
|
||||
|
||||
vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK with continuous batching.
|
||||
Paged Attention and Chunked Prefill are currently in development and will be available soon.
|
||||
@@ -1,8 +1,8 @@
|
||||
(installation-openvino)=
|
||||
|
||||
# Installation with OpenVINO
|
||||
# Installation for OpenVINO
|
||||
|
||||
vLLM powered by OpenVINO supports all LLM models from {doc}`vLLM supported models list <../models/supported_models>` and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs ([the list of supported GPUs](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu)). OpenVINO vLLM backend supports the following advanced vLLM features:
|
||||
vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](#supported-models) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs ([the list of supported GPUs](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu)). OpenVINO vLLM backend supports the following advanced vLLM features:
|
||||
|
||||
- Prefix caching (`--enable-prefix-caching`)
|
||||
- Chunked prefill (`--enable-chunked-prefill`)
|
||||
@@ -1,6 +1,6 @@
|
||||
(installation-tpu)=
|
||||
|
||||
# Installation with TPU
|
||||
# Installation for TPUs
|
||||
|
||||
Tensor Processing Units (TPUs) are Google's custom-developed application-specific
|
||||
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
|
||||
@@ -1,6 +1,6 @@
|
||||
(installation-xpu)=
|
||||
|
||||
# Installation with XPU
|
||||
# Installation for XPUs
|
||||
|
||||
vLLM initially supports basic model inferencing and serving on Intel GPU platform.
|
||||
|
||||
@@ -23,7 +23,7 @@ $ conda activate myenv
|
||||
$ pip install vllm
|
||||
```
|
||||
|
||||
Please refer to the {ref}`installation documentation <installation>` for more details on installing vLLM.
|
||||
Please refer to the [installation documentation](#installation-index) for more details on installing vLLM.
|
||||
|
||||
(offline-batched-inference)=
|
||||
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
(debugging)=
|
||||
(troubleshooting)=
|
||||
|
||||
# Debugging Tips
|
||||
# Troubleshooting
|
||||
|
||||
This document outlines some debugging strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
|
||||
This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
|
||||
|
||||
```{note}
|
||||
Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
|
||||
@@ -47,6 +47,7 @@ You might also need to set `export NCCL_SOCKET_IFNAME=<your_network_interface>`
|
||||
If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph.
|
||||
To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
|
||||
|
||||
(troubleshooting-incorrect-hardware-driver)=
|
||||
## Incorrect hardware/driver
|
||||
|
||||
If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
|
||||
@@ -139,7 +140,7 @@ A multi-node environment is more complicated than a single-node one. If you see
|
||||
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
|
||||
```
|
||||
|
||||
(debugging-python-multiprocessing)=
|
||||
(troubleshooting-python-multiprocessing)=
|
||||
## Python multiprocessing
|
||||
|
||||
### `RuntimeError` Exception
|
||||
@@ -150,7 +151,7 @@ If you have seen a warning in your logs like this:
|
||||
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
|
||||
initialized. We must use the `spawn` multiprocessing start method. Setting
|
||||
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
|
||||
https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing
|
||||
https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing
|
||||
for more information.
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user