[Doc] Update more docs with respect to V1 (#29188)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
@@ -4,9 +4,7 @@
|
||||
|
||||
We have fully deprecated V0. Please read [RFC #18571](https://github.com/vllm-project/vllm/issues/18571) for more details.
|
||||
|
||||
V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
|
||||
|
||||
## Why vLLM V1?
|
||||
If you have a use case that works on V0 Engine but not V1, please share it on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).
|
||||
|
||||
vLLM V0 successfully supported a wide range of models and hardware, but as new features were developed independently, the system grew increasingly complex. This complexity made it harder to integrate new capabilities and introduced technical debt, revealing the need for a more streamlined and unified design.
|
||||
|
||||
@@ -32,16 +30,44 @@ Upgrade to vLLM’s Core Architecture](https://blog.vllm.ai/2025/01/27/v1-alpha-
|
||||
|
||||
This living user guide outlines a few known **important changes and limitations** introduced by vLLM V1. The team has been working actively to bring V1 as the default engine, therefore this guide will be updated constantly as more features get supported on vLLM V1.
|
||||
|
||||
## Current Status
|
||||
## Differences from V0
|
||||
|
||||
For each item, our progress towards V1 support falls into one of the following states:
|
||||
This section lists some differences in behavior between V0 and V1.
|
||||
|
||||
- **🚀 Optimized**: Nearly fully optimized, with no further work currently planned.
|
||||
- **🟢 Functional**: Fully operational, with ongoing optimizations.
|
||||
- **🚧 WIP**: Under active development.
|
||||
- **🟡 Planned**: Scheduled for future implementation (some may have open PRs/RFCs).
|
||||
- **🟠 Delayed**: Temporarily dropped in V1 but planned to be re-introduced later.
|
||||
- **🔴 Deprecated**: Not planned for V1 unless there is strong demand.
|
||||
### Chunked Prefill
|
||||
|
||||
Chunked prefill is enabled by default whenever possible, unlike in V0 where it was conditionally enabled based on model characteristics.
|
||||
|
||||
### CUDA Graphs
|
||||
|
||||
CUDA graph capture takes up more memory in V1 than in V0.
|
||||
|
||||
### Semantic Changes to Logprobs
|
||||
|
||||
#### Logprobs Calculation
|
||||
|
||||
By default, logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e.
|
||||
before applying any logits post-processing such as temperature scaling or penalty
|
||||
adjustments). As a result, the returned logprobs do not reflect the final adjusted
|
||||
probabilities used during sampling.
|
||||
|
||||
You can adjust this behavior by setting the `--logprobs-mode` flag.
|
||||
Four modes are supported: `raw_logprobs` (default), `processed_logprobs`, `raw_logits`, `processed_logits`.
|
||||
Raw means the values before applying any logit processors, like bad words.
|
||||
Processed means the values after applying all processors, including temperature and top_k/top_p.
|
||||
|
||||
#### Prompt Logprobs with Prefix Caching
|
||||
|
||||
While V1 supports passing prompt logprobs with prefix caching enabled, it no longer caches the logprobs.
|
||||
For a request requiring prompt logprobs, the engine will ignore the prefix cache and recompute the prefill of full prompt to generate the logprobs.
|
||||
|
||||
## Feature Support
|
||||
|
||||
For each item, its support in vLLM V1 falls into one of the following states:
|
||||
|
||||
- **🟢 Functional**: Fully operational with optimizations comparable to or better than V0.
|
||||
- **🟡 In Progress**: Planned to be in vLLM V1, with open PRs/RFCs.
|
||||
- **🔴 Removed**: Dropped from vLLM V1. Will only consider re-introducing if there is strong demand.
|
||||
|
||||
!!! note
|
||||
vLLM V1’s unified scheduler treats both prompt and output tokens the same
|
||||
@@ -57,13 +83,13 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the
|
||||
|
||||
### Hardware
|
||||
|
||||
| Hardware | Status |
|
||||
|------------|-----------------------------------------------|
|
||||
| **NVIDIA** | <nobr>🚀</nobr> |
|
||||
| **AMD** | <nobr>🟢</nobr> |
|
||||
| Hardware | Status |
|
||||
|------------------|-----------------------------------------------|
|
||||
| **NVIDIA** | <nobr>🟢</nobr> |
|
||||
| **AMD** | <nobr>🟢</nobr> |
|
||||
| **INTEL GPU** | <nobr>🟢</nobr> |
|
||||
| **TPU** | <nobr>🟢</nobr> |
|
||||
| **CPU** | <nobr>🟢 (x86\_64/aarch64) 🟡 (MacOS) </nobr> |
|
||||
| **TPU** | <nobr>🟢</nobr> |
|
||||
| **CPU** | <nobr>🟢</nobr> |
|
||||
|
||||
!!! note
|
||||
|
||||
@@ -78,23 +104,21 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the
|
||||
|
||||
### Models
|
||||
|
||||
| Model Type | Status |
|
||||
|-----------------------------|------------------------------------------------------------------------------------|
|
||||
| **Decoder-only Models** | <nobr>🚀 Optimized</nobr> |
|
||||
| **Encoder-Decoder Models** | <nobr>🟢 Whisper only</nobr> |
|
||||
| **Embedding Models** | <nobr>🟢 Functional</nobr> |
|
||||
| **Mamba Models** | <nobr>🟢 (Mamba-2), 🟢 (Mamba-1)</nobr> |
|
||||
| **Multimodal Models** | <nobr>🟢 Functional</nobr> |
|
||||
| Model Type | Status |
|
||||
|-----------------------------|-------------------------------------------------------------------------|
|
||||
| **Decoder-only Models** | <nobr>🟢</nobr> |
|
||||
| **Encoder-Decoder Models** | <nobr>🟢 (Whisper), 🔴 (Others) </nobr> |
|
||||
| **Pooling Models** | <nobr>🟢</nobr> |
|
||||
| **Mamba Models** | <nobr>🟢</nobr> |
|
||||
| **Multimodal Models** | <nobr>🟢</nobr> |
|
||||
|
||||
See below for the status of models that are not yet supported or have more features planned in V1.
|
||||
|
||||
#### Embedding Models
|
||||
#### Pooling Models
|
||||
|
||||
The initial basic support is now functional.
|
||||
Now fully supported, with prefix caching and chunked prefill newly available for last-pooling models.
|
||||
|
||||
Later, we will consider using [hidden states processor](https://github.com/vllm-project/vllm/issues/12249),
|
||||
which is based on [global logits processor](https://github.com/vllm-project/vllm/pull/13360)
|
||||
to enable simultaneous generation and embedding using the same engine instance in V1.
|
||||
We are working on enabling prefix caching and chunked prefill for more categories of pooling models.
|
||||
|
||||
#### Mamba Models
|
||||
|
||||
@@ -112,24 +136,25 @@ Please note that prefix caching is not yet supported for any of the above models
|
||||
|
||||
Whisper is supported. Other models requiring cross-attention between separate
|
||||
encoder and decoder (e.g., `BartForConditionalGeneration`,
|
||||
`MllamaForConditionalGeneration`) are not supported.
|
||||
`MllamaForConditionalGeneration`) are no longer supported.
|
||||
|
||||
### Features
|
||||
|
||||
| Feature | Status |
|
||||
|---------------------------------------------|-----------------------------------------------------------------------------------|
|
||||
| **Prefix Caching** | <nobr>🚀 Optimized</nobr> |
|
||||
| **Chunked Prefill** | <nobr>🚀 Optimized</nobr> |
|
||||
| **LoRA** | <nobr>🚀 Optimized</nobr> |
|
||||
| **Prefix Caching** | <nobr>🟢 Functional</nobr> |
|
||||
| **Chunked Prefill** | <nobr>🟢 Functional</nobr> |
|
||||
| **LoRA** | <nobr>🟢 Functional</nobr> |
|
||||
| **Logprobs Calculation** | <nobr>🟢 Functional</nobr> |
|
||||
| **FP8 KV Cache** | <nobr>🟢 Functional on Hopper devices (<https://github.com/vllm-project/vllm/pull/15191>)</nobr>|
|
||||
| **Spec Decode** | <nobr>🚀 Optimized</nobr> |
|
||||
| **Prompt Logprobs with Prefix Caching** | <nobr>🟡 Planned ([RFC #13414](https://github.com/vllm-project/vllm/issues/13414))</nobr>|
|
||||
| **FP8 KV Cache** | <nobr>🟢 Functional</nobr> |
|
||||
| **Spec Decode** | <nobr>🟢 Functional</nobr> |
|
||||
| **Prompt Logprobs with Prefix Caching** | <nobr>🟢 Functional</nobr> |
|
||||
| **Structured Output Alternative Backends** | <nobr>🟢 Functional</nobr> |
|
||||
| **Request-level Structured Output Backend** | <nobr>🔴 Deprecated</nobr> |
|
||||
| **best_of** | <nobr>🔴 Deprecated ([RFC #13361](https://github.com/vllm-project/vllm/issues/13361))</nobr>|
|
||||
| **Per-Request Logits Processors** | <nobr>🔴 Deprecated ([RFC #13360](https://github.com/vllm-project/vllm/pull/13360))</nobr> |
|
||||
| **GPU <> CPU KV Cache Swapping** | <nobr>🔴 Deprecated</nobr> |
|
||||
| **Concurrent Partial Prefills** | <nobr>🟡 [In Progress](https://github.com/vllm-project/vllm/issues/14003)</nobr> |
|
||||
| **best_of** | <nobr>🔴 [Removed](https://github.com/vllm-project/vllm/issues/13361)</nobr> |
|
||||
| **Per-Request Logits Processors** | <nobr>🔴 [Removed](https://github.com/vllm-project/vllm/pull/13360)</nobr> |
|
||||
| **GPU <> CPU KV Cache Swapping** | <nobr>🔴 Removed</nobr> |
|
||||
| **Request-level Structured Output Backend** | <nobr>🔴 Removed</nobr> |
|
||||
|
||||
!!! note
|
||||
|
||||
@@ -139,37 +164,16 @@ encoder and decoder (e.g., `BartForConditionalGeneration`,
|
||||
prefix caching, and speculative decoding without a strict separation between prefill
|
||||
and decode phases.
|
||||
|
||||
#### Semantic Changes to Logprobs
|
||||
#### Removed Features
|
||||
|
||||
vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic
|
||||
differences compared to V0:
|
||||
|
||||
##### Logprobs Calculation
|
||||
|
||||
By default, logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e.
|
||||
before applying any logits post-processing such as temperature scaling or penalty
|
||||
adjustments). As a result, the returned logprobs do not reflect the final adjusted
|
||||
probabilities used during sampling.
|
||||
|
||||
You can adjust this behavior by setting the `--logprobs-mode` flag.
|
||||
Four modes are supported: `raw_logprobs` (default), `processed_logprobs`, `raw_logits`, `processed_logits`.
|
||||
Raw means the values before applying any logit processors, like bad words.
|
||||
Processed means the values after applying all processors, including temperature and top_k/top_p.
|
||||
|
||||
##### Prompt Logprobs with Prefix Caching
|
||||
|
||||
Logprobs are not cached. For a request requiring prompt logprobs, the engine will ignore the prefix cache and recompute the prefill of full prompt to generate the logprobs.
|
||||
|
||||
#### Deprecated Features
|
||||
|
||||
As part of the major architectural rework in vLLM V1, several legacy features have been deprecated.
|
||||
As part of the major architectural rework in vLLM V1, several legacy features have been removed.
|
||||
|
||||
##### Sampling features
|
||||
|
||||
- **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
|
||||
- **best_of**: This feature has been removed due to limited usage. See details at [RFC #13361](https://github.com/vllm-project/vllm/issues/13361).
|
||||
- **Per-Request Logits Processors**: In V0, users could pass custom
|
||||
processing functions to adjust logits on a per-request basis. In vLLM V1, this
|
||||
feature has been deprecated. Instead, we now support **global logits processors**
|
||||
feature has been removed. Instead, we now support **global logits processors**
|
||||
which are set at startup time, see [RFC #17799](https://github.com/vllm-project/vllm/issues/17799).
|
||||
|
||||
##### KV Cache features
|
||||
@@ -179,4 +183,4 @@ to handle request preemptions.
|
||||
|
||||
##### Structured Output features
|
||||
|
||||
- **Request-level Structured Output Backend**: Deprecated, alternative backends (outlines, guidance) with fallbacks is supported now.
|
||||
- **Request-level Structured Output Backend**: Removed; alternative backends (outlines, guidance) with fallbacks are supported now.
|
||||
|
||||
Reference in New Issue
Block a user