[Doc][3/N] Reorganize Serving section (#11766)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
79
docs/source/serving/offline_inference.md
Normal file
79
docs/source/serving/offline_inference.md
Normal file
@@ -0,0 +1,79 @@
|
||||
(offline-inference)=
|
||||
|
||||
# Offline Inference
|
||||
|
||||
You can run vLLM in your own code on a list of prompts.
|
||||
|
||||
The offline API is based on the {class}`~vllm.LLM` class.
|
||||
To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.
|
||||
|
||||
For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
|
||||
and runs it in vLLM using the default configuration.
|
||||
|
||||
```python
|
||||
llm = LLM(model="facebook/opt-125m")
|
||||
```
|
||||
|
||||
After initializing the `LLM` instance, you can perform model inference using various APIs.
|
||||
The available APIs depend on the type of model that is being run:
|
||||
|
||||
- [Generative models](#generative-models) output logprobs which are sampled from to obtain the final output text.
|
||||
- [Pooling models](#pooling-models) output their hidden states directly.
|
||||
|
||||
Please refer to the above pages for more details about each API.
|
||||
|
||||
```{seealso}
|
||||
[API Reference](/dev/offline_inference/offline_index)
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
|
||||
This section lists the most common options for running the vLLM engine.
|
||||
For a full list, refer to the [Engine Arguments](#engine-args) page.
|
||||
|
||||
### Reducing memory usage
|
||||
|
||||
Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
|
||||
|
||||
#### Tensor Parallelism (TP)
|
||||
|
||||
Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.
|
||||
|
||||
The following code splits the model across 2 GPUs.
|
||||
|
||||
```python
|
||||
llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
|
||||
tensor_parallel_size=2)
|
||||
```
|
||||
|
||||
```{important}
|
||||
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`)
|
||||
before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
|
||||
|
||||
To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
|
||||
```
|
||||
|
||||
#### Quantization
|
||||
|
||||
Quantized models take less memory at the cost of lower precision.
|
||||
|
||||
Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Neural Magic](https://huggingface.co/neuralmagic))
|
||||
and used directly without extra configuration.
|
||||
|
||||
Dynamic quantization is also supported via the `quantization` option -- see [here](#quantization-index) for more details.
|
||||
|
||||
#### Context length and batch size
|
||||
|
||||
You can further reduce memory usage by limit the context length of the model (`max_model_len` option)
|
||||
and the maximum batch size (`max_num_seqs` option).
|
||||
|
||||
```python
|
||||
llm = LLM(model="adept/fuyu-8b",
|
||||
max_model_len=2048,
|
||||
max_num_seqs=2)
|
||||
```
|
||||
|
||||
### Performance optimization and tuning
|
||||
|
||||
You can potentially improve the performance of vLLM by finetuning various options.
|
||||
Please refer to [this guide](#optimization-and-tuning) for more details.
|
||||
Reference in New Issue
Block a user