Allow markdownlint to run locally (#36398)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor
2026-03-09 03:05:24 +00:00
committed by GitHub
parent fde4771bbd
commit a0f44bb616
47 changed files with 394 additions and 392 deletions

6
.github/mergify.yml vendored
View File

@@ -38,15 +38,13 @@ pull_request_rules:
> [!TIP] > [!TIP]
> <details> > <details>
> <summary>Is <code>mypy</code> or <code>markdownlint</code> failing?</summary> > <summary>Is <code>mypy</code> failing?</summary>
> <br/> > <br/>
> <code>mypy</code> and <code>markdownlint</code> are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally: > <code>mypy</code> is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
> >
> ```bash > ```bash
> # For mypy (substitute "3.10" with the failing version if needed) > # For mypy (substitute "3.10" with the failing version if needed)
> pre-commit run --hook-stage manual mypy-3.10 > pre-commit run --hook-stage manual mypy-3.10
> # For markdownlint
> pre-commit run --hook-stage manual markdownlint
> ``` > ```
> </details> > </details>

View File

@@ -24,12 +24,12 @@ repos:
exclude: 'csrc/(moe/topk_softmax_kernels.cu|quantization/gguf/(ggml-common.h|dequantize.cuh|vecdotq.cuh|mmq.cuh|mmvq.cuh))|vllm/third_party/.*' exclude: 'csrc/(moe/topk_softmax_kernels.cu|quantization/gguf/(ggml-common.h|dequantize.cuh|vecdotq.cuh|mmq.cuh|mmvq.cuh))|vllm/third_party/.*'
types_or: [c++, cuda] types_or: [c++, cuda]
args: [--style=file, --verbose] args: [--style=file, --verbose]
- repo: https://github.com/igorshubovych/markdownlint-cli - repo: https://github.com/DavidAnson/markdownlint-cli2
rev: v0.45.0 rev: v0.21.0
hooks: hooks:
- id: markdownlint - id: markdownlint-cli2
exclude: '.*\.inc\.md' language_version: lts
stages: [manual] # Only run in CI args: [--fix]
- repo: https://github.com/rhysd/actionlint - repo: https://github.com/rhysd/actionlint
rev: v1.7.7 rev: v1.7.7
hooks: hooks:

View File

@@ -187,7 +187,7 @@ python benchmark.py \
## Hardware Requirements ## Hardware Requirements
| Backend | Hardware | | Backend | Hardware |
|---------|----------| | ------- | -------- |
| Flash/Triton/FlashInfer | Any CUDA GPU | | Flash/Triton/FlashInfer | Any CUDA GPU |
| CUTLASS MLA | Blackwell (SM100+) | | CUTLASS MLA | Blackwell (SM100+) |
| FlashAttn MLA | Hopper (SM90+) | | FlashAttn MLA | Hopper (SM90+) |

View File

@@ -41,7 +41,7 @@ MODEL=meta-llama/Llama-3.3-70B-Instruct SYSTEM=TPU TP=8 DOWNLOAD_DIR='' INPUT_LE
| --- | --- | --- | | --- | --- | --- |
| `BASE` | **Required.** The absolute path to the parent directory of your vLLM repository directory. | `"$HOME"` | | `BASE` | **Required.** The absolute path to the parent directory of your vLLM repository directory. | `"$HOME"` |
| `MODEL` | **Required.** The Hugging Face model identifier to be served by vllm. | `"meta-llama/Llama-3.1-8B-Instruct"` | | `MODEL` | **Required.** The Hugging Face model identifier to be served by vllm. | `"meta-llama/Llama-3.1-8B-Instruct"` |
| `SYSTEM`| **Required.** The hardware you are running on. Choices: `TPU` or `GPU`. (For other systems, it might not support saving profiles) | `"TPU"` | | `SYSTEM` | **Required.** The hardware you are running on. Choices: `TPU` or `GPU`. (For other systems, it might not support saving profiles) | `"TPU"` |
| `TP` | **Required.** The tensor-parallelism size. | `1` | | `TP` | **Required.** The tensor-parallelism size. | `1` |
| `DOWNLOAD_DIR` | **Required.** Directory to download and load model weights from. | `""` (default download path) | | `DOWNLOAD_DIR` | **Required.** Directory to download and load model weights from. | `""` (default download path) |
| `INPUT_LEN` | **Required.** Request input length. | `4000` | | `INPUT_LEN` | **Required.** Request input length. | `4000` |

View File

@@ -18,7 +18,7 @@ th {
</style> </style>
| Dataset | Online | Offline | Data Path | | Dataset | Online | Offline | Data Path |
|---------|--------|---------|-----------| | ------- | ------ | ------- | --------- |
| ShareGPT | ✅ | ✅ | `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json` | | ShareGPT | ✅ | ✅ | `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json` |
| ShareGPT4V (Image) | ✅ | ✅ | `wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:<br>`wget http://images.cocodataset.org/zips/train2017.zip` | | ShareGPT4V (Image) | ✅ | ✅ | `wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:<br>`wget http://images.cocodataset.org/zips/train2017.zip` |
| ShareGPT4Video (Video) | ✅ | ✅ | `git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video` | | ShareGPT4Video (Video) | ✅ | ✅ | `git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video` |
@@ -383,14 +383,14 @@ The `--burstiness` parameter mathematically controls request arrival patterns us
Load Pattern Recommendations by Use Case: Load Pattern Recommendations by Use Case:
| Use Case | Burstiness | Request Rate | Max Concurrency | Description | | Use Case | Burstiness | Request Rate | Max Concurrency | Description |
| --- | --- | --- | --- | --- | | --- | --- | --- | --- | --- |
| Maximum Throughput | N/A | Infinite | Limited | **Most common**: Simulates load balancer/gateway limits with unlimited user demand | | Maximum Throughput | N/A | Infinite | Limited | **Most common**: Simulates load balancer/gateway limits with unlimited user demand |
| Realistic Testing | 1.0 | Moderate (5-20) | Infinite | Natural Poisson traffic patterns for baseline performance | | Realistic Testing | 1.0 | Moderate (5-20) | Infinite | Natural Poisson traffic patterns for baseline performance |
| Stress Testing | 0.1-0.5 | High (20-100) | Infinite | Challenging burst patterns to test resilience | | Stress Testing | 0.1-0.5 | High (20-100) | Infinite | Challenging burst patterns to test resilience |
| Latency Profiling | 2.0-5.0 | Low (1-10) | Infinite | Uniform load for consistent timing analysis | | Latency Profiling | 2.0-5.0 | Low (1-10) | Infinite | Uniform load for consistent timing analysis |
| Capacity Planning | 1.0 | Variable | Limited | Test resource limits with realistic constraints | | Capacity Planning | 1.0 | Variable | Limited | Test resource limits with realistic constraints |
| SLA Validation | 1.0 | Target rate | SLA limit | Production-like constraints for compliance testing | | SLA Validation | 1.0 | Target rate | SLA limit | Production-like constraints for compliance testing |
These load patterns help evaluate different aspects of your vLLM deployment, from basic performance characteristics to resilience under challenging traffic conditions. These load patterns help evaluate different aspects of your vLLM deployment, from basic performance characteristics to resilience under challenging traffic conditions.
@@ -941,7 +941,7 @@ Benchmark per-stage latency of the multimodal (MM) input processor pipeline, inc
The benchmark measures the following stages for each request: The benchmark measures the following stages for each request:
| Stage | Description | | Stage | Description |
|-------|-------------| | ----- | ----------- |
| `get_mm_hashes_secs` | Time spent hashing multimodal inputs | | `get_mm_hashes_secs` | Time spent hashing multimodal inputs |
| `get_cache_missing_items_secs` | Time spent looking up the processor cache | | `get_cache_missing_items_secs` | Time spent looking up the processor cache |
| `apply_hf_processor_secs` | Time spent in the HuggingFace processor | | `apply_hf_processor_secs` | Time spent in the HuggingFace processor |

View File

@@ -60,12 +60,12 @@ Here is an example using the script to compare result_a and result_b with max co
***Output Tput (tok/s) — Model : [ meta-llama/Llama-3.1-8B-Instruct ] , Dataset Name : [ random ] , Input Len : [ 2048.0 ] , Output Len : [ 2048.0 ]*** ***Output Tput (tok/s) — Model : [ meta-llama/Llama-3.1-8B-Instruct ] , Dataset Name : [ random ] , Input Len : [ 2048.0 ] , Output Len : [ 2048.0 ]***
| | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio | | | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
|----|------|-----|-----------|----------|----------| | | -------------------- | --- | -------------------------------- | -------------------------------- | ---------- |
| 0 | 12 | inf | 24.98 | 186.03 | 7.45 | | 0 | 12 | inf | 24.98 | 186.03 | 7.45 |
| 1 | 16 | inf| 25.49 | 246.92 | 9.69 | | 1 | 16 | inf | 25.49 | 246.92 | 9.69 |
| 2 | 24 | inf| 27.74 | 293.34 | 10.57 | | 2 | 24 | inf | 27.74 | 293.34 | 10.57 |
| 3 | 32 | inf| 28.61 |306.69 | 10.72 | | 3 | 32 | inf | 28.61 |306.69 | 10.72 |
***compare-json-results.py Command-Line Parameters*** ***compare-json-results.py Command-Line Parameters***

View File

@@ -29,7 +29,7 @@ vllm bench mm-processor \
## Measured Stages ## Measured Stages
| Stage | Description | | Stage | Description |
|-------|-------------| | ----- | ----------- |
| `get_mm_hashes_secs` | Time spent hashing multimodal inputs | | `get_mm_hashes_secs` | Time spent hashing multimodal inputs |
| `get_cache_missing_items_secs` | Time spent looking up the processor cache | | `get_cache_missing_items_secs` | Time spent looking up the processor cache |
| `apply_hf_processor_secs` | Time spent in the HuggingFace processor | | `apply_hf_processor_secs` | Time spent in the HuggingFace processor |

View File

@@ -1,3 +1,4 @@
<!-- markdownlint-disable MD041 -->
When passing JSON CLI arguments, the following sets of arguments are equivalent: When passing JSON CLI arguments, the following sets of arguments are equivalent:
- `--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'` - `--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'`
@@ -6,4 +7,4 @@ When passing JSON CLI arguments, the following sets of arguments are equivalent:
Additionally, list elements can be passed individually using `+`: Additionally, list elements can be passed individually using `+`:
- `--json-arg '{"key4": ["value3", "value4", "value5"]}'` - `--json-arg '{"key4": ["value3", "value4", "value5"]}'`
- `--json-arg.key4+ value3 --json-arg.key4+='value4,value5'` - `--json-arg.key4+ value3 --json-arg.key4+='value4,value5'`

View File

@@ -293,7 +293,7 @@ llm = LLM(
Based on the configuration, the content of the multi-modal caches on `P0` and `P1` are as follows: Based on the configuration, the content of the multi-modal caches on `P0` and `P1` are as follows:
| mm_processor_cache_type | Cache Type | `P0` Cache | `P1` Engine Cache | `P1` Worker Cache | Max. Memory | | mm_processor_cache_type | Cache Type | `P0` Cache | `P1` Engine Cache | `P1` Worker Cache | Max. Memory |
|-------------------|-------------|------------|------------|-------------|-------------| | ----------------- | ----------- | ---------- | ---------- | ----------- | ----------- |
| lru | Processor Caching | K + V | N/A | N/A | `mm_processor_cache_gb * data_parallel_size` | | lru | Processor Caching | K + V | N/A | N/A | `mm_processor_cache_gb * data_parallel_size` |
| lru | Key-Replicated Caching | K | K + V | N/A | `mm_processor_cache_gb * api_server_count` | | lru | Key-Replicated Caching | K | K + V | N/A | `mm_processor_cache_gb * api_server_count` |
| shm | Shared Memory Caching | K | N/A | V | `mm_processor_cache_gb * api_server_count` | | shm | Shared Memory Caching | K | N/A | V | `mm_processor_cache_gb * api_server_count` |

View File

@@ -94,7 +94,6 @@ vLLM's `pre-commit` hooks will now run automatically every time you commit.
Some `pre-commit` hooks only run in CI. If you need to, you can run them locally with: Some `pre-commit` hooks only run in CI. If you need to, you can run them locally with:
```bash ```bash
pre-commit run --hook-stage manual markdownlint
pre-commit run --hook-stage manual mypy-3.10 pre-commit run --hook-stage manual mypy-3.10
``` ```

View File

@@ -66,12 +66,12 @@ This complicates the process as we cannot use the out-of-the-box
- Important indexes at the moment include: - Important indexes at the moment include:
| Platform | `--extra-index-url` | | Platform | `--extra-index-url` |
|----------|-----------------| | -------- | ------------------- |
| CUDA 12.8| [https://download.pytorch.org/whl/cu128](https://download.pytorch.org/whl/cu128)| | CUDA 12.8 | [https://download.pytorch.org/whl/cu128](https://download.pytorch.org/whl/cu128) |
| CPU | [https://download.pytorch.org/whl/cpu](https://download.pytorch.org/whl/cpu)| | CPU | [https://download.pytorch.org/whl/cpu](https://download.pytorch.org/whl/cpu) |
| ROCm 6.2 | [https://download.pytorch.org/whl/rocm6.2.4](https://download.pytorch.org/whl/rocm6.2.4) | | ROCm 6.2 | [https://download.pytorch.org/whl/rocm6.2.4](https://download.pytorch.org/whl/rocm6.2.4) |
| ROCm 6.3 | [https://download.pytorch.org/whl/rocm6.3](https://download.pytorch.org/whl/rocm6.3) | | ROCm 6.3 | [https://download.pytorch.org/whl/rocm6.3](https://download.pytorch.org/whl/rocm6.3) |
| XPU | [https://download.pytorch.org/whl/xpu](https://download.pytorch.org/whl/xpu) | | XPU | [https://download.pytorch.org/whl/xpu](https://download.pytorch.org/whl/xpu) |
- Update the below files to match the CUDA version from step 1. This makes sure that the release vLLM wheel is tested on CI. - Update the below files to match the CUDA version from step 1. This makes sure that the release vLLM wheel is tested on CI.
- `.buildkite/release-pipeline.yaml` - `.buildkite/release-pipeline.yaml`

View File

@@ -66,7 +66,7 @@ stages will be removed.
Assume a feature is deprecated in `v0.9.0`. Assume a feature is deprecated in `v0.9.0`.
| Release | Status | | Release | Status |
|---------------|-------------------------------------------------------------------------------------------------| | ------------- | ----------------------------------------------------------------------------------------------- |
| `v0.9.0` | Feature is deprecated with clear removal version listed. | | `v0.9.0` | Feature is deprecated with clear removal version listed. |
| `v0.10.0` | Feature is now off by default, throws an error when used, and can be re-enabled for legacy use. | | `v0.10.0` | Feature is now off by default, throws an error when used, and can be re-enabled for legacy use. |
| `v0.11.0` | Feature is removed. | | `v0.11.0` | Feature is removed. |

View File

@@ -49,7 +49,7 @@ chart **including persistent volumes** and deletes the release.
The following table describes configurable parameters of the chart in `values.yaml`: The following table describes configurable parameters of the chart in `values.yaml`:
| Key | Type | Default | Description | | Key | Type | Default | Description |
|-----|------|---------|-------------| | --- | ---- | ------- | ----------- |
| autoscaling | object | {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80} | Autoscaling configuration | | autoscaling | object | {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80} | Autoscaling configuration |
| autoscaling.enabled | bool | false | Enable autoscaling | | autoscaling.enabled | bool | false | Enable autoscaling |
| autoscaling.maxReplicas | int | 100 | Maximum replicas | | autoscaling.maxReplicas | int | 100 | Maximum replicas |

View File

@@ -6,7 +6,7 @@ A Ray cluster can be declared in YAML, and the operator then handles pod schedul
## Why KubeRay instead of manual scripts? ## Why KubeRay instead of manual scripts?
| Feature | Manual scripts | KubeRay | | Feature | Manual scripts | KubeRay |
|---------|-----------------------------------------------------------|---------| | ------- | --------------------------------------------------------- | ------- |
| Cluster bootstrap | Manually SSH into every node and run a script | One command to create or update the whole cluster: `kubectl apply -f cluster.yaml` | | Cluster bootstrap | Manually SSH into every node and run a script | One command to create or update the whole cluster: `kubectl apply -f cluster.yaml` |
| Autoscaling | Manual | Automatically patches CRDs for adjusting cluster size | | Autoscaling | Manual | Automatically patches CRDs for adjusting cluster size |
| Upgrades | Tear down & re-create manually | Blue/green deployment updates supported | | Upgrades | Tear down & re-create manually | Blue/green deployment updates supported |

View File

@@ -119,7 +119,7 @@ The code can be found in [vllm/v1/engine/coordinator.py](../../vllm/v1/engine/co
For a deployment with `N` GPUs, `TP` tensor parallel size, `DP` data parallel size, and `A` API server count: For a deployment with `N` GPUs, `TP` tensor parallel size, `DP` data parallel size, and `A` API server count:
| Process Type | Count | Notes | | Process Type | Count | Notes |
|---|---|---| | - | - | - |
| API Server | `A` (default `DP`) | Handles HTTP requests and input processing | | API Server | `A` (default `DP`) | Handles HTTP requests and input processing |
| Engine Core | `DP` (default 1) | Scheduler and KV cache management | | Engine Core | `DP` (default 1) | Scheduler and KV cache management |
| GPU Worker | `N` (= `DP x PP x TP`) | One per GPU, executes model forward passes | | GPU Worker | `N` (= `DP x PP x TP`) | One per GPU, executes model forward passes |

View File

@@ -101,7 +101,7 @@ Priority is **1 = highest** (tried first).
**Blackwell (SM 10.x):** **Blackwell (SM 10.x):**
| Priority | Backend | | Priority | Backend |
|----------|---------| | -------- | ------- |
| 1 | `FLASHINFER` | | 1 | `FLASHINFER` |
| 2 | `FLASH_ATTN` | | 2 | `FLASH_ATTN` |
| 3 | `TRITON_ATTN` | | 3 | `TRITON_ATTN` |
@@ -110,7 +110,7 @@ Priority is **1 = highest** (tried first).
**Ampere/Hopper (SM 8.x-9.x):** **Ampere/Hopper (SM 8.x-9.x):**
| Priority | Backend | | Priority | Backend |
|----------|---------| | -------- | ------- |
| 1 | `FLASH_ATTN` | | 1 | `FLASH_ATTN` |
| 2 | `FLASHINFER` | | 2 | `FLASHINFER` |
| 3 | `TRITON_ATTN` | | 3 | `TRITON_ATTN` |
@@ -121,7 +121,7 @@ Priority is **1 = highest** (tried first).
**Blackwell (SM 10.x):** **Blackwell (SM 10.x):**
| Priority | Backend | | Priority | Backend |
|----------|---------| | -------- | ------- |
| 1 | `FLASHINFER_MLA` | | 1 | `FLASHINFER_MLA` |
| 2 | `CUTLASS_MLA` | | 2 | `CUTLASS_MLA` |
| 3 | `FLASH_ATTN_MLA` | | 3 | `FLASH_ATTN_MLA` |
@@ -133,7 +133,7 @@ Priority is **1 = highest** (tried first).
**Ampere/Hopper (SM 8.x-9.x):** **Ampere/Hopper (SM 8.x-9.x):**
| Priority | Backend | | Priority | Backend |
|----------|---------| | -------- | ------- |
| 1 | `FLASH_ATTN_MLA` | | 1 | `FLASH_ATTN_MLA` |
| 2 | `FLASHMLA` | | 2 | `FLASHMLA` |
| 3 | `FLASHINFER_MLA` | | 3 | `FLASHINFER_MLA` |
@@ -145,7 +145,7 @@ Priority is **1 = highest** (tried first).
## Legend ## Legend
| Column | Description | | Column | Description |
|--------|-------------| | ------ | ----------- |
| **Dtypes** | Supported model data types (fp16, bf16, fp32) | | **Dtypes** | Supported model data types (fp16, bf16, fp32) |
| **KV Dtypes** | Supported KV cache data types (`auto`, `fp8`, `fp8_e4m3`, etc.) | | **KV Dtypes** | Supported KV cache data types (`auto`, `fp8`, `fp8_e4m3`, etc.) |
| **Block Sizes** | Supported KV cache block sizes (%N means multiples of N) | | **Block Sizes** | Supported KV cache block sizes (%N means multiples of N) |
@@ -162,20 +162,20 @@ Priority is **1 = highest** (tried first).
## Standard Attention (MHA, MQA, GQA) Backends ## Standard Attention (MHA, MQA, GQA) Backends
| Backend | Version | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | MM Prefix | DCP | Attention Types | Compute Cap. | | Backend | Version | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | MM Prefix | DCP | Attention Types | Compute Cap. |
|---------|---------|--------|-----------|-------------|------------|------|-----------|-----|-----------------|--------------| | ------- | ------- | ------ | --------- | ----------- | ---------- | ---- | --------- | --- | --------------- | ------------ |
| `CPU_ATTN` | | fp16, bf16, fp32 | `auto` | Any | 32, 64, 80, 96, 112, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | All | N/A | | `CPU_ATTN` | | fp16, bf16, fp32 | `auto` | Any | 32, 64, 80, 96, 112, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | All | N/A |
| `FLASHINFER` | Native† | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ❌ | ❌ | ✅ | Decoder | 7.x-9.x | | `FLASHINFER` | Native† | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ❌ | ❌ | ✅ | Decoder | 7.x-9.x |
| `FLASHINFER` | TRTLLM† | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ✅ | ❌ | ✅ | Decoder | 10.x | | `FLASHINFER` | TRTLLM† | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 64 | 64, 128, 256 | ✅ | ❌ | ✅ | Decoder | 10.x |
| `FLASH_ATTN` | FA2* | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥8.0 | | `FLASH_ATTN` | FA2* | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥8.0 |
| `FLASH_ATTN` | FA3* | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ❌ | ✅ | All | 9.x | | `FLASH_ATTN` | FA3* | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ❌ | ✅ | All | 9.x |
| `FLASH_ATTN` | FA4* | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥10.0 | | `FLASH_ATTN` | FA4* | fp16, bf16 | `auto`, `bfloat16` | %16 | Any | ❌ | ❌ | ✅ | All | ≥10.0 |
| `FLASH_ATTN_DIFFKV` | | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ✅ | Decoder | Any | | `FLASH_ATTN_DIFFKV` | | fp16, bf16 | `auto` | Any | Any | ❌ | ❌ | ✅ | Decoder | Any |
| `FLEX_ATTENTION` | | fp16, bf16, fp32 | `auto`, `bfloat16` | Any | Any | ❌ | ✅ | ❌ | Decoder, Encoder Only | Any | | `FLEX_ATTENTION` | | fp16, bf16, fp32 | `auto`, `bfloat16` | Any | Any | ❌ | ✅ | ❌ | Decoder, Encoder Only | Any |
| `ROCM_AITER_FA` | | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32 | 64, 128, 256 | ❌ | ❌ | ❌ | Decoder, Enc-Dec | N/A | | `ROCM_AITER_FA` | | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32 | 64, 128, 256 | ❌ | ❌ | ❌ | Decoder, Enc-Dec | N/A |
| `ROCM_AITER_UNIFIED_ATTN` | | fp16, bf16 | `auto` | %16 | Any | ✅ | ✅ | ❌ | All | N/A | | `ROCM_AITER_UNIFIED_ATTN` | | fp16, bf16 | `auto` | %16 | Any | ✅ | ✅ | ❌ | All | N/A |
| `ROCM_ATTN` | | fp16, bf16, fp32 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 544 | 32, 64, 80, 96, 128, 160, 192, 224, 256 | ✅ | ✅ | ❌ | All | N/A | | `ROCM_ATTN` | | fp16, bf16, fp32 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | 16, 32, 544 | 32, 64, 80, 96, 128, 160, 192, 224, 256 | ✅ | ✅ | ❌ | All | N/A |
| `TREE_ATTN` | | fp16, bf16 | `auto` | %16 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | Decoder | Any | | `TREE_ATTN` | | fp16, bf16 | `auto` | %16 | 32, 64, 96, 128, 160, 192, 224, 256 | ❌ | ❌ | ❌ | Decoder | Any |
| `TRITON_ATTN` | | fp16, bf16, fp32 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ✅ | ❌ | All | Any | | `TRITON_ATTN` | | fp16, bf16, fp32 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3`, `fp8_e5m2` | %16 | Any | ✅ | ✅ | ❌ | All | Any |
> **†** FlashInfer uses TRTLLM attention on Blackwell (SM100), which supports sinks. Disable via `--attention-config.use_trtllm_attention=0`. > **†** FlashInfer uses TRTLLM attention on Blackwell (SM100), which supports sinks. Disable via `--attention-config.use_trtllm_attention=0`.
> >
@@ -191,10 +191,10 @@ The prefill backend is selected at runtime based on hardware and
configuration. configuration.
| Backend | Description | Compute Cap. | Enable | Disable | Notes | | Backend | Description | Compute Cap. | Enable | Disable | Notes |
|---------|-------------|--------------|--------|---------|-------| | ------- | ----------- | ------------ | ------ | ------- | ----- |
| TRT-LLM Ragged‡ | TensorRT-LLM ragged attention | 10.x | Default on SM100 | `-ac.use_trtllm_ragged_deepseek_prefill=0` | DeepSeek R1 dims only | | TRT-LLM Ragged‡ | TensorRT-LLM ragged attention | 10.x | Default on SM100 | `-ac.use_trtllm_ragged_deepseek_prefill=0` | DeepSeek R1 dims only |
| FlashInfer | FlashInfer CUTLASS backend | 10.x | `-ac.disable_flashinfer_prefill=0` | `-ac.disable_flashinfer_prefill=1` | DeepSeek R1 dims only | | FlashInfer | FlashInfer CUTLASS backend | 10.x | `-ac.disable_flashinfer_prefill=0` | `-ac.disable_flashinfer_prefill=1` | DeepSeek R1 dims only |
| cuDNN | cuDNN-based attention | 10.x | `-ac.use_cudnn_prefill=1` | `-ac.use_cudnn_prefill=0` | | | cuDNN | cuDNN-based attention | 10.x | `-ac.use_cudnn_prefill=1` | `-ac.use_cudnn_prefill=0` | |
| FlashAttention | FlashAttention varlen (FA2/FA3) | Any | Default fallback | Use other backends | FA3 on SM90, FA2 otherwise | | FlashAttention | FlashAttention varlen (FA2/FA3) | Any | Default fallback | Use other backends | FA3 on SM90, FA2 otherwise |
> **‡** TRT-LLM Ragged is the default on Blackwell (SM100). > **‡** TRT-LLM Ragged is the default on Blackwell (SM100).
@@ -203,7 +203,7 @@ configuration.
### Decode Backends ### Decode Backends
| Backend | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | Sparse | MM Prefix | DCP | Attention Types | Compute Cap. | | Backend | Dtypes | KV Dtypes | Block Sizes | Head Sizes | Sink | Sparse | MM Prefix | DCP | Attention Types | Compute Cap. |
|---------|--------|-----------|-------------|------------|------|--------|-----------|-----|-----------------|--------------| | ------- | ------ | --------- | ----------- | ---------- | ---- | ------ | --------- | --- | --------------- | ------------ |
| `CUTLASS_MLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 128 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 10.x | | `CUTLASS_MLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 128 | Any | ❌ | ❌ | ❌ | ✅ | Decoder | 10.x |
| `FLASHINFER_MLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 32, 64 | Any | ❌ | ❌ | ❌ | ❌ | Decoder | 10.x | | `FLASHINFER_MLA` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 32, 64 | Any | ❌ | ❌ | ❌ | ❌ | Decoder | 10.x |
| `FLASHINFER_MLA_SPARSE` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 32, 64 | 576 | ❌ | ✅ | ❌ | ❌ | Decoder | 10.x | | `FLASHINFER_MLA_SPARSE` | fp16, bf16 | `auto`, `bfloat16`, `fp8`, `fp8_e4m3` | 32, 64 | 576 | ❌ | ✅ | ❌ | ❌ | Decoder | 10.x |

View File

@@ -174,18 +174,18 @@ Suppose we have hybrid attention backends (e.g., in mamba mixer models). In that
The following table lists backends that support full CUDA Graphs at the time of writing. The following table lists backends that support full CUDA Graphs at the time of writing.
| Attention Backend | cudagraph_support | Comments | | Attention Backend | cudagraph_support | Comments |
|:---|:---|:---| | :---------------- | :---------------- | :------- |
| FlashAttention v2 | `UNIFORM_BATCH` | Actually `ALWAYS` but workaround to fallback to `FULL_AND_PIECEWISE` for performance reason | | FlashAttention v2 | `UNIFORM_BATCH` | Actually `ALWAYS` but workaround to fallback to `FULL_AND_PIECEWISE` for performance reason |
| FlashAttention v3 | `ALWAYS` | has unified routine for both batches, so `FULL` mode is good | | FlashAttention v3 | `ALWAYS` | has unified routine for both batches, so `FULL` mode is good |
| Triton Attention | `ALWAYS` | prefer `FULL_AND_PIECEWISE` since it has different kernels for prefill/mixed and pure decode batches | | Triton Attention | `ALWAYS` | prefer `FULL_AND_PIECEWISE` since it has different kernels for prefill/mixed and pure decode batches |
| AITER FlashAttention | `UNIFORM_BATCH`| | | AITER FlashAttention | `UNIFORM_BATCH` | |
| FlashInfer | `UNIFORM_SINGLE_TOKEN_DECODE` | Will be set to `UNIFORM_BATCH` when using TRTLLM attention on Blackwell | | FlashInfer | `UNIFORM_SINGLE_TOKEN_DECODE` | Will be set to `UNIFORM_BATCH` when using TRTLLM attention on Blackwell |
| FlashMLA | `UNIFORM_BATCH` | | | FlashMLA | `UNIFORM_BATCH` | |
| FlashInferMLA | `UNIFORM_BATCH` | | | FlashInferMLA | `UNIFORM_BATCH` | |
| FlashInferMLASparse | `UNIFORM_BATCH` | | | FlashInferMLASparse | `UNIFORM_BATCH` | |
| AITER MLA | `UNIFORM_SINGLE_TOKEN_DECODE` | | | AITER MLA | `UNIFORM_SINGLE_TOKEN_DECODE` | |
| CUTLASS MLA | `UNIFORM_SINGLE_TOKEN_DECODE` | | | CUTLASS MLA | `UNIFORM_SINGLE_TOKEN_DECODE` | |
| Mamba attention| `UNIFORM_SINGLE_TOKEN_DECODE` | | | Mamba attention | `UNIFORM_SINGLE_TOKEN_DECODE` | |
Unlisted backends are all declared as `NEVER`. Unlisted backends are all declared as `NEVER`.

View File

@@ -5,12 +5,12 @@ TL;DR:
- use tlparse to acquire torch.compile logs. Include these logs in bug reports and/or support asks. - use tlparse to acquire torch.compile logs. Include these logs in bug reports and/or support asks.
- The vLLM-torch.compile integration is multiple pieces. vLLM exposes flags to turn off each piece: - The vLLM-torch.compile integration is multiple pieces. vLLM exposes flags to turn off each piece:
| Online Flag | Offline Flag | Result | | Online Flag | Offline Flag | Result |
|----------|----------|-------------| | ----------- | ------------ | ------ |
| --enforce-eager | enforce_eager=True | Turn off torch.compile and CUDAGraphs | | --enforce-eager | enforce_eager=True | Turn off torch.compile and CUDAGraphs |
| -cc.mode=0 | mode=CompilationMode.NONE | Turn off torch.compile only | | -cc.mode=0 | mode=CompilationMode.NONE | Turn off torch.compile only |
| -cc.cudagraph_mode=NONE | compilation_config=CompilationConfig(cudagraph_mode=CUDAGraphMode.NONE) | Turn off CUDAGraphs only | | -cc.cudagraph_mode=NONE | compilation_config=CompilationConfig(cudagraph_mode=CUDAGraphMode.NONE) | Turn off CUDAGraphs only |
| -cc.backend=eager | compilation_config=CompilationConfig(backend='eager') | Turn off TorchInductor | | -cc.backend=eager | compilation_config=CompilationConfig(backend='eager') | Turn off TorchInductor |
## vLLM-torch.compile overview ## vLLM-torch.compile overview

View File

@@ -19,7 +19,7 @@ or just on the low or high end.
If tuning performance by hand, always benchmark your exact use-case with and without the fusion to verify the impact. If tuning performance by hand, always benchmark your exact use-case with and without the fusion to verify the impact.
| Fusion | `PassConfig` flag | Fused operations | Default at | E2E Speedup | Fullgraph | `num_tokens` | | Fusion | `PassConfig` flag | Fused operations | Default at | E2E Speedup | Fullgraph | `num_tokens` |
|--------------------------------------------------------------------------------|------------------------------|------------------------------------------------|--------------------------------|--------------------|-----------|--------------| | ------------------------------------------------------------------------------ | ---------------------------- | ---------------------------------------------- | ------------------------------ | ------------------ | --------- | ------------ |
| [AllReduce + RMSNorm](#allreduce--rmsnorm-fuse_allreduce_rms) | `fuse_allreduce_rms` | All-reduce → RMSNorm (+residual_add) (→ quant) | O2 (Hopper/Blackwell + TP > 1) | 5-20% | No | Low | | [AllReduce + RMSNorm](#allreduce--rmsnorm-fuse_allreduce_rms) | `fuse_allreduce_rms` | All-reduce → RMSNorm (+residual_add) (→ quant) | O2 (Hopper/Blackwell + TP > 1) | 5-20% | No | Low |
| [Attention + Quant](#attention--quantization-fuse_attn_quant) | `fuse_attn_quant` | Attention output → FP8/NVFP4 quant | Off by default | 3-7% | Yes | Always | | [Attention + Quant](#attention--quantization-fuse_attn_quant) | `fuse_attn_quant` | Attention output → FP8/NVFP4 quant | Off by default | 3-7% | Yes | Always |
| [RoPE + KV-Cache Update](#rope--kv-cache-update-fuse_rope_kvcache) | `fuse_rope_kvcache` | Rotary embedding → KV cache write | O1 (ROCm/AITER only) | TBD | No | Low | | [RoPE + KV-Cache Update](#rope--kv-cache-update-fuse_rope_kvcache) | `fuse_rope_kvcache` | Rotary embedding → KV cache write | O1 (ROCm/AITER only) | TBD | No | Low |
@@ -37,7 +37,7 @@ The table below lists the quantization schemes supported by each fusion on each
[#36066](https://github.com/vllm-project/vllm/issues/36066) [#36066](https://github.com/vllm-project/vllm/issues/36066)
| Fusion | SM100 (Blackwell) | SM90 (Hopper) | SM89 (Ada) | SM80 (Ampere) | ROCm | | Fusion | SM100 (Blackwell) | SM90 (Hopper) | SM89 (Ada) | SM80 (Ampere) | ROCm |
|------------------------------|------------------------------------------|------------------------------------------|------------------------------------------|---------------|------------------------------------------| | ---------------------------- | ---------------------------------------- | ---------------------------------------- | ---------------------------------------- | ------------- | ---------------------------------------- |
| `fuse_allreduce_rms` | FP16/BF16, FP8 static, NVFP4 | FP16/BF16, FP8 static | — | — | — | | `fuse_allreduce_rms` | FP16/BF16, FP8 static, NVFP4 | FP16/BF16, FP8 static | — | — | — |
| `fuse_attn_quant`\* | FP8 static\*, NVFP4\* | FP8 static\* | FP8 static\* | — | FP8 static\* | | `fuse_attn_quant`\* | FP8 static\*, NVFP4\* | FP8 static\* | FP8 static\* | — | FP8 static\* |
| `fuse_rope_kvcache` | — | — | — | — | FP16/BF16 | | `fuse_rope_kvcache` | — | — | — | — | FP16/BF16 |

View File

@@ -31,7 +31,7 @@ th {
</style> </style>
| Backend | Output act. format | Quant. types | Quant. format | Async | Apply Weight On Input | Subclass | | Backend | Output act. format | Quant. types | Quant. format | Async | Apply Weight On Input | Subclass |
|---------|--------------------|--------------|---------------|-------|-----------------------|-----------| | ------- | ------------------ | ------------ | ------------- | ----- | --------------------- | --------- |
| naive | standard | all<sup>1</sup> | G,A,T | N | <sup>6</sup> | [layer.py][vllm.model_executor.layers.fused_moe.layer.FusedMoE] | | naive | standard | all<sup>1</sup> | G,A,T | N | <sup>6</sup> | [layer.py][vllm.model_executor.layers.fused_moe.layer.FusedMoE] |
| deepep_high_throughput | standard | fp8 | G(128),A,T<sup>2</sup> | Y | Y | [`DeepEPHTPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.deepep_ht_prepare_finalize.DeepEPHTPrepareAndFinalize] | | deepep_high_throughput | standard | fp8 | G(128),A,T<sup>2</sup> | Y | Y | [`DeepEPHTPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.deepep_ht_prepare_finalize.DeepEPHTPrepareAndFinalize] |
| deepep_low_latency | batched | fp8 | G(128),A,T<sup>3</sup> | Y | Y | [`DeepEPLLPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.deepep_ll_prepare_finalize.DeepEPLLPrepareAndFinalize] | | deepep_low_latency | batched | fp8 | G(128),A,T<sup>3</sup> | Y | Y | [`DeepEPLLPrepareAndFinalize`][vllm.model_executor.layers.fused_moe.deepep_ll_prepare_finalize.DeepEPLLPrepareAndFinalize] |
@@ -78,7 +78,7 @@ Most experts flavors include an equivalent modular interface which will be a sub
To be used with a particular `FusedMoEPrepareAndFinalizeModular` subclass, MoE kernels must have compatible activation formats, quantization types and quantization formats. To be used with a particular `FusedMoEPrepareAndFinalizeModular` subclass, MoE kernels must have compatible activation formats, quantization types and quantization formats.
| Kernel | Input act. format | Quant. types | Quant. format | Activation function | Apply Weight On Input | Modular | Source | | Kernel | Input act. format | Quant. types | Quant. format | Activation function | Apply Weight On Input | Modular | Source |
|--------|-------------------|--------------|---------------|---------------------|-----------------------|---------|--------| | ------ | ----------------- | ------------ | ------------- | ------------------- | --------------------- | ------- | ------ |
| triton | standard | all<sup>1</sup> | G,A,T | silu, gelu,</br>swigluoai,</br>silu_no_mul,</br>gelu_no_mul | Y | Y | [`fused_experts`][vllm.model_executor.layers.fused_moe.fused_moe.fused_experts],</br>[`TritonExperts`][vllm.model_executor.layers.fused_moe.fused_moe.TritonExperts] | | triton | standard | all<sup>1</sup> | G,A,T | silu, gelu,</br>swigluoai,</br>silu_no_mul,</br>gelu_no_mul | Y | Y | [`fused_experts`][vllm.model_executor.layers.fused_moe.fused_moe.fused_experts],</br>[`TritonExperts`][vllm.model_executor.layers.fused_moe.fused_moe.TritonExperts] |
| triton (batched) | batched | all<sup>1</sup> | G,A,T | silu, gelu | <sup>6</sup> | Y | [`BatchedTritonExperts`][vllm.model_executor.layers.fused_moe.fused_batched_moe.BatchedTritonExperts] | | triton (batched) | batched | all<sup>1</sup> | G,A,T | silu, gelu | <sup>6</sup> | Y | [`BatchedTritonExperts`][vllm.model_executor.layers.fused_moe.fused_batched_moe.BatchedTritonExperts] |
| deep gemm | standard,</br>batched | fp8 | G(128),A,T | silu, gelu | <sup>6</sup> | Y | </br>[`DeepGemmExperts`][vllm.model_executor.layers.fused_moe.deep_gemm_moe.DeepGemmExperts],</br>[`BatchedDeepGemmExperts`][vllm.model_executor.layers.fused_moe.batched_deep_gemm_moe.BatchedDeepGemmExperts] | | deep gemm | standard,</br>batched | fp8 | G(128),A,T | silu, gelu | <sup>6</sup> | Y | </br>[`DeepGemmExperts`][vllm.model_executor.layers.fused_moe.deep_gemm_moe.DeepGemmExperts],</br>[`BatchedDeepGemmExperts`][vllm.model_executor.layers.fused_moe.batched_deep_gemm_moe.BatchedDeepGemmExperts] |
@@ -105,7 +105,7 @@ To be used with a particular `FusedMoEPrepareAndFinalizeModular` subclass, MoE k
The following table shows "families" of modular kernels that are intended to work together. There are some combinations which may work but have not yet been tested, e.g. flashinfer with other fp8 experts. Note that the "naive" backend will work with any non-modular experts. The following table shows "families" of modular kernels that are intended to work together. There are some combinations which may work but have not yet been tested, e.g. flashinfer with other fp8 experts. Note that the "naive" backend will work with any non-modular experts.
| backend | `FusedMoEPrepareAndFinalizeModular` subclasses | `FusedMoEExpertsModular` subclasses | | backend | `FusedMoEPrepareAndFinalizeModular` subclasses | `FusedMoEExpertsModular` subclasses |
|---------|-----------------------------------------|----------------------------------------------| | ------- | ---------------------------------------------- | ----------------------------------- |
| deepep_high_throughput | `DeepEPHTPrepareAndFinalize` | `DeepGemmExperts`,</br>`TritonExperts`,</br>`TritonOrDeepGemmExperts`,</br>`CutlassExpertsFp8`, </br>`MarlinExperts` | | deepep_high_throughput | `DeepEPHTPrepareAndFinalize` | `DeepGemmExperts`,</br>`TritonExperts`,</br>`TritonOrDeepGemmExperts`,</br>`CutlassExpertsFp8`, </br>`MarlinExperts` |
| deepep_low_latency | `DeepEPLLPrepareAndFinalize` | `BatchedDeepGemmExperts`,</br>`BatchedTritonExperts`,</br>`CutlassBatchedExpertsFp8`,</br>`BatchedMarlinExperts` | | deepep_low_latency | `DeepEPLLPrepareAndFinalize` | `BatchedDeepGemmExperts`,</br>`BatchedTritonExperts`,</br>`CutlassBatchedExpertsFp8`,</br>`BatchedMarlinExperts` |
| flashinfer | `FlashInferCutlassMoEPrepareAndFinalize` | `FlashInferExperts` | | flashinfer | `FlashInferCutlassMoEPrepareAndFinalize` | `FlashInferExperts` |

View File

@@ -37,7 +37,7 @@ th:not(:first-child) {
</style> </style>
| Feature | [CP](../configuration/optimization.md#chunked-prefill) | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](speculative_decoding/README.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search | [prompt-embeds](prompt_embeds.md) | | Feature | [CP](../configuration/optimization.md#chunked-prefill) | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](speculative_decoding/README.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search | [prompt-embeds](prompt_embeds.md) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
| [CP](../configuration/optimization.md#chunked-prefill) | ✅ | | | | | | | | | | | | | | | | [CP](../configuration/optimization.md#chunked-prefill) | ✅ | | | | | | | | | | | | | | |
| [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | | | [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | |
| [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | | | [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | |
@@ -59,23 +59,23 @@ th:not(:first-child) {
### Feature x Hardware ### Feature x Hardware
| Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD | Intel GPU | | Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD | Intel GPU |
|-----------------------------------------------------------|---------------------|-----------|-----------|--------|------------|--------------------|--------| ------------| | ------- | ----- | ------ | ------ | --- | ------ | --- | --- | --------- |
| [CP](../configuration/optimization.md#chunked-prefill) | [](https://github.com/vllm-project/vllm/issues/2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [CP](../configuration/optimization.md#chunked-prefill) | [](https://github.com/vllm-project/vllm/issues/2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [APC](automatic_prefix_caching.md) | [](https://github.com/vllm-project/vllm/issues/3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [APC](automatic_prefix_caching.md) | [](https://github.com/vllm-project/vllm/issues/3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [LoRA](lora.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [SD](speculative_decoding/README.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | [SD](speculative_decoding/README.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ |
| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | [](https://github.com/vllm-project/vllm/issues/26970) | | CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | [](https://github.com/vllm-project/vllm/issues/26970) |
| [pooling](../models/pooling_models.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [pooling](../models/pooling_models.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| [mm](multimodal_inputs.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | [mm](multimodal_inputs.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [prompt-embeds](prompt_embeds.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | [prompt-embeds](prompt_embeds.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ |
| <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | <abbr title="Logprobs">logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | <abbr title="Prompt Logprobs">prmpt logP</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ | | <abbr title="Async Output Processing">async output</abbr> | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ✅ |
| multi-step | ✅ | ✅ | ✅ | ✅ | ✅ | [](https://github.com/vllm-project/vllm/issues/8477) | ✅ | ✅ | | multi-step | ✅ | ✅ | ✅ | ✅ | ✅ | [](https://github.com/vllm-project/vllm/issues/8477) | ✅ | ✅ |
| best-of | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | best-of | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| beam-search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | beam-search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
!!! note !!! note
For information on feature support on Google TPU, please refer to the [TPU-Inference Recommended Models and Features](https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features/) documentation. For information on feature support on Google TPU, please refer to the [TPU-Inference Recommended Models and Features](https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features/) documentation.

View File

@@ -20,9 +20,9 @@ With interleaved thinking, the model can:
vLLM currently supports the following interleaved thinking models: vLLM currently supports the following interleaved thinking models:
| Model Series | Reasoning Parser Name | | Model Series | Reasoning Parser Name |
|--------------|-----------------------| | ------------ | --------------------- |
| moonshotai/Kimi-K2-Thinking | kimi_k2 | | moonshotai/Kimi-K2-Thinking | kimi_k2 |
| MiniMaxAI/MiniMax-M2 | minimax_m2 | | MiniMaxAI/MiniMax-M2 | minimax_m2 |
## Example Usage ## Example Usage

View File

@@ -44,16 +44,16 @@ th:not(:first-child) {
} }
</style> </style>
| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU | | Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU |
|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-----------| | ------------------------- | ----- | ------ | ------ | --- | ------ | ------- | --------- | ------- |
| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | | AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ |
| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | | GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ |
| Marlin (GPTQ/AWQ/FP8/FP4) | ❌ | ✅︎* | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | | Marlin (GPTQ/AWQ/FP8/FP4) | ❌ | ✅︎* | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ |
| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ✅︎ | | INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ✅︎ |
| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | | FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ |
| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | | bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ |
| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | | DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ |
| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | | GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ |
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0. - Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- ✅︎ indicates that the quantization method is supported on the specified hardware. - ✅︎ indicates that the quantization method is supported on the specified hardware.
@@ -131,7 +131,7 @@ class MyQuantConfig(QuantizationConfig):
Your custom `QuantizationConfig` subclass must implement these abstract methods: Your custom `QuantizationConfig` subclass must implement these abstract methods:
| Method | Description | | Method | Description |
|--------|-------------| | ------ | ----------- |
| `get_name()` | Returns the name of the quantization method | | `get_name()` | Returns the name of the quantization method |
| `get_supported_act_dtypes()` | Returns list of supported activation dtypes (e.g., `torch.float16`) | | `get_supported_act_dtypes()` | Returns list of supported activation dtypes (e.g., `torch.float16`) |
| `get_min_capability()` | Returns minimum GPU compute capability (e.g., 80 for Ampere, -1 for no restriction) | | `get_min_capability()` | Returns minimum GPU compute capability (e.g., 80 for Ampere, -1 for no restriction) |

View File

@@ -114,7 +114,7 @@ Here's an example of the resulting scores:
```text ```text
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr| |Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:| | --- |------:| -------------- |-----:| --------- | - |----:| - |-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.768|± |0.0268| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.768|± |0.0268|
| | |strict-match | 5|exact_match|↑ |0.768|± |0.0268| | | |strict-match | 5|exact_match|↑ |0.768|± |0.0268|
``` ```

View File

@@ -12,7 +12,7 @@ Reasoning models return an additional `reasoning` field in their outputs, which
vLLM currently supports the following reasoning models: vLLM currently supports the following reasoning models:
| Model Series | Parser Name | Structured Output Support | Tool Calling | | Model Series | Parser Name | Structured Output Support | Tool Calling |
|--------------|-------------|------------------|-------------| | ------------ | ----------- | ---------------- | ----------- |
| [DeepSeek R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) | `deepseek_r1` | `json`, `regex` | ❌ | | [DeepSeek R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) | `deepseek_r1` | `json`, `regex` | ❌ |
| [DeepSeek-V3.1](https://huggingface.co/collections/deepseek-ai/deepseek-v31-68a491bed32bd77e7fca048f) | `deepseek_v3` | `json`, `regex` | ❌ | | [DeepSeek-V3.1](https://huggingface.co/collections/deepseek-ai/deepseek-v31-68a491bed32bd77e7fca048f) | `deepseek_v3` | `json`, `regex` | ❌ |
| [ERNIE-4.5-VL series](https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-PT) | `ernie45` | `json`, `regex` | ❌ | | [ERNIE-4.5-VL series](https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-PT) | `ernie45` | `json`, `regex` | ❌ |

View File

@@ -1,4 +1,5 @@
# --8<-- [start:installation] <!-- markdownlint-disable MD041 -->
--8<-- [start:installation]
vLLM has experimental support for macOS with Apple Silicon. For now, users must build from source to natively run on macOS. vLLM has experimental support for macOS with Apple Silicon. For now, users must build from source to natively run on macOS.
@@ -7,23 +8,23 @@ Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
!!! tip "GPU-Accelerated Inference with vLLM-Metal" !!! tip "GPU-Accelerated Inference with vLLM-Metal"
For GPU-accelerated inference on Apple Silicon using Metal, check out [vllm-metal](https://github.com/vllm-project/vllm-metal), a community-maintained hardware plugin that uses MLX as the compute backend. For GPU-accelerated inference on Apple Silicon using Metal, check out [vllm-metal](https://github.com/vllm-project/vllm-metal), a community-maintained hardware plugin that uses MLX as the compute backend.
# --8<-- [end:installation] --8<-- [end:installation]
# --8<-- [start:requirements] --8<-- [start:requirements]
- OS: `macOS Sonoma` or later - OS: `macOS Sonoma` or later
- SDK: `XCode 15.4` or later with Command Line Tools - SDK: `XCode 15.4` or later with Command Line Tools
- Compiler: `Apple Clang >= 15.0.0` - Compiler: `Apple Clang >= 15.0.0`
# --8<-- [end:requirements] --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python] --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] --8<-- [start:pre-built-wheels]
Currently, there are no pre-built Apple silicon CPU wheels. Currently, there are no pre-built Apple silicon CPU wheels.
# --8<-- [end:pre-built-wheels] --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] --8<-- [start:build-wheel-from-source]
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from source. After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from source.
@@ -36,7 +37,7 @@ uv pip install -e .
!!! tip !!! tip
The `--index-strategy unsafe-best-match` flag is needed to resolve dependencies across multiple package indexes (PyTorch CPU index and PyPI). Without this flag, you may encounter `typing-extensions` version conflicts. The `--index-strategy unsafe-best-match` flag is needed to resolve dependencies across multiple package indexes (PyTorch CPU index and PyPI). Without this flag, you may encounter `typing-extensions` version conflicts.
The term "unsafe" refers to the package resolution strategy, not security. By default, `uv` only searches the first index where a package is found to prevent dependency confusion attacks. This flag allows `uv` to search all configured indexes to find the best compatible versions. Since both PyTorch and PyPI are trusted package sources, using this strategy is safe and appropriate for vLLM installation. The term "unsafe" refers to the package resolution strategy, not security. By default, `uv` only searches the first index where a package is found to prevent dependency confusion attacks. This flag allows `uv` to search all configured indexes to find the best compatible versions. Since both PyTorch and PyPI are trusted package sources, using this strategy is safe and appropriate for vLLM installation.
!!! note !!! note
@@ -77,14 +78,14 @@ uv pip install -e .
``` ```
On Apple Clang 16 you should see: `#define __cplusplus 201703L` On Apple Clang 16 you should see: `#define __cplusplus 201703L`
# --8<-- [end:build-wheel-from-source] --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images] --8<-- [start:pre-built-images]
Currently, there are no pre-built Arm silicon CPU images. Currently, there are no pre-built Arm silicon CPU images.
# --8<-- [end:pre-built-images] --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source] --8<-- [start:build-image-from-source]
# --8<-- [end:build-image-from-source] --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information] --8<-- [start:extra-information]
# --8<-- [end:extra-information] --8<-- [end:extra-information]

View File

@@ -1,19 +1,20 @@
# --8<-- [start:installation] <!-- markdownlint-disable MD041 -->
--8<-- [start:installation]
vLLM offers basic model inferencing and serving on Arm CPU platform, with support for NEON, data types FP32, FP16 and BF16. vLLM offers basic model inferencing and serving on Arm CPU platform, with support for NEON, data types FP32, FP16 and BF16.
# --8<-- [end:installation] --8<-- [end:installation]
# --8<-- [start:requirements] --8<-- [start:requirements]
- OS: Linux - OS: Linux
- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended) - Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
- Instruction Set Architecture (ISA): NEON support is required - Instruction Set Architecture (ISA): NEON support is required
# --8<-- [end:requirements] --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python] --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] --8<-- [start:pre-built-wheels]
Pre-built vLLM wheels for Arm are available since version 0.11.2. These wheels contain pre-compiled C++ binaries. Pre-built vLLM wheels for Arm are available since version 0.11.2. These wheels contain pre-compiled C++ binaries.
@@ -43,13 +44,14 @@ uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VE
The `uv` approach works for vLLM `v0.6.6` and later. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version. The `uv` approach works for vLLM `v0.6.6` and later. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.
**Install the latest code** #### Install the latest code
LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides working pre-built Arm CPU wheels for every commit since `v0.11.2` on <https://wheels.vllm.ai/nightly>. For native CPU wheels, this index should be used: LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides working pre-built Arm CPU wheels for every commit since `v0.11.2` on <https://wheels.vllm.ai/nightly>. For native CPU wheels, this index should be used:
* `https://wheels.vllm.ai/nightly/cpu/vllm` - `https://wheels.vllm.ai/nightly/cpu/vllm`
To install from nightly index, run: To install from nightly index, run:
```bash ```bash
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index-strategy first-index uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index-strategy first-index
``` ```
@@ -64,7 +66,7 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index
pip install https://wheels.vllm.ai/4fa7ce46f31cbd97b4651694caf9991cc395a259/vllm-0.13.0rc2.dev104%2Bg4fa7ce46f.cpu-cp38-abi3-manylinux_2_35_aarch64.whl # current nightly build (the filename will change!) pip install https://wheels.vllm.ai/4fa7ce46f31cbd97b4651694caf9991cc395a259/vllm-0.13.0rc2.dev104%2Bg4fa7ce46f.cpu-cp38-abi3-manylinux_2_35_aarch64.whl # current nightly build (the filename will change!)
``` ```
**Install specific revisions** #### Install specific revisions
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL: If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
@@ -73,8 +75,8 @@ export VLLM_COMMIT=730bd35378bf2a5b56b6d3a45be28b3092d26519 # use full commit ha
uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}/cpu --index-strategy first-index uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}/cpu --index-strategy first-index
``` ```
# --8<-- [end:pre-built-wheels] --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] --8<-- [start:build-wheel-from-source]
First, install the recommended compiler. We recommend using `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run: First, install the recommended compiler. We recommend using `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
@@ -133,8 +135,8 @@ Testing has been conducted on AWS Graviton3 instances for compatibility.
export LD_PRELOAD="$TC_PATH:$LD_PRELOAD" export LD_PRELOAD="$TC_PATH:$LD_PRELOAD"
``` ```
# --8<-- [end:build-wheel-from-source] --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images] --8<-- [start:pre-built-images]
To pull the latest image from Docker Hub: To pull the latest image from Docker Hub:
@@ -170,10 +172,10 @@ export VLLM_COMMIT=6299628d326f429eba78736acb44e76749b281f5 # use full commit ha
docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT}-arm64-cpu docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT}-arm64-cpu
``` ```
# --8<-- [end:pre-built-images] --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source] --8<-- [start:build-image-from-source]
## Building for your target ARM CPU #### Building for your target ARM CPU
```bash ```bash
docker build -f docker/Dockerfile.cpu \ docker build -f docker/Dockerfile.cpu \
@@ -189,9 +191,9 @@ docker build -f docker/Dockerfile.cpu \
- `VLLM_CPU_ARM_BF16=true` - Force-enable ARM BF16 support (build with BF16 regardless of build system capabilities) - `VLLM_CPU_ARM_BF16=true` - Force-enable ARM BF16 support (build with BF16 regardless of build system capabilities)
- `VLLM_CPU_ARM_BF16=false` - Rely on auto-detection (default) - `VLLM_CPU_ARM_BF16=false` - Rely on auto-detection (default)
### Examples ##### Examples
**Auto-detection build (native ARM)** ###### Auto-detection build (native ARM)
```bash ```bash
# Building on ARM64 system - platform auto-detected # Building on ARM64 system - platform auto-detected
@@ -200,7 +202,7 @@ docker build -f docker/Dockerfile.cpu \
--target vllm-openai . --target vllm-openai .
``` ```
**Cross-compile for ARM with BF16 support** ###### Cross-compile for ARM with BF16 support
```bash ```bash
# Building on ARM64 for newer ARM CPUs with BF16 # Building on ARM64 for newer ARM CPUs with BF16
@@ -210,7 +212,7 @@ docker build -f docker/Dockerfile.cpu \
--target vllm-openai . --target vllm-openai .
``` ```
**Cross-compile from x86_64 to ARM64 with BF16** ###### Cross-compile from x86_64 to ARM64 with BF16
```bash ```bash
# Requires Docker buildx with ARM emulation (QEMU) # Requires Docker buildx with ARM emulation (QEMU)
@@ -226,7 +228,7 @@ docker buildx build -f docker/Dockerfile.cpu \
!!! note "ARM BF16 requirements" !!! note "ARM BF16 requirements"
ARM BF16 support requires ARMv8.6-A or later (FEAT_BF16). Supported on AWS Graviton3/4, AmpereOne, and other recent ARM processors. ARM BF16 support requires ARMv8.6-A or later (FEAT_BF16). Supported on AWS Graviton3/4, AmpereOne, and other recent ARM processors.
## Launching the OpenAI server #### Launching the OpenAI server
```bash ```bash
docker run --rm \ docker run --rm \
@@ -245,6 +247,6 @@ docker run --rm \
!!! tip "Alternative to --privileged" !!! tip "Alternative to --privileged"
Instead of `--privileged=true`, use `--cap-add SYS_NICE --security-opt seccomp=unconfined` for better security. Instead of `--privileged=true`, use `--cap-add SYS_NICE --security-opt seccomp=unconfined` for better security.
# --8<-- [end:build-image-from-source] --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information] --8<-- [start:extra-information]
# --8<-- [end:extra-information] --8<-- [end:extra-information]

View File

@@ -1,3 +1,7 @@
---
toc_depth: 3
---
# CPU # CPU
vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions: vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:

View File

@@ -1,27 +1,28 @@
# --8<-- [start:installation] <!-- markdownlint-disable MD041 -->
--8<-- [start:installation]
vLLM has experimental support for s390x architecture on IBM Z platform. For now, users must build from source to natively run on IBM Z platform. vLLM has experimental support for s390x architecture on IBM Z platform. For now, users must build from source to natively run on IBM Z platform.
Currently, the CPU implementation for s390x architecture supports FP32 datatype only. Currently, the CPU implementation for s390x architecture supports FP32 datatype only.
# --8<-- [end:installation] --8<-- [end:installation]
# --8<-- [start:requirements] --8<-- [start:requirements]
- OS: `Linux` - OS: `Linux`
- SDK: `gcc/g++ >= 12.3.0` or later with Command Line Tools - SDK: `gcc/g++ >= 12.3.0` or later with Command Line Tools
- Instruction Set Architecture (ISA): VXE support is required. Works with Z14 and above. - Instruction Set Architecture (ISA): VXE support is required. Works with Z14 and above.
- Build install python packages: `pyarrow`, `torch` and `torchvision` - Build install python packages: `pyarrow`, `torch` and `torchvision`
# --8<-- [end:requirements] --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python] --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] --8<-- [start:pre-built-wheels]
Currently, there are no pre-built IBM Z CPU wheels. Currently, there are no pre-built IBM Z CPU wheels.
# --8<-- [end:pre-built-wheels] --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] --8<-- [start:build-wheel-from-source]
Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4: Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4:
@@ -65,13 +66,13 @@ Execute the following commands to build and install vLLM from source.
pip install dist/*.whl pip install dist/*.whl
``` ```
# --8<-- [end:build-wheel-from-source] --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images] --8<-- [start:pre-built-images]
Currently, there are no pre-built IBM Z CPU images. Currently, there are no pre-built IBM Z CPU images.
# --8<-- [end:pre-built-images] --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source] --8<-- [start:build-image-from-source]
```bash ```bash
docker build -f docker/Dockerfile.s390x \ docker build -f docker/Dockerfile.s390x \
@@ -93,6 +94,6 @@ docker run --rm \
!!! tip !!! tip
An alternative of `--privileged true` is `--cap-add SYS_NICE --security-opt seccomp=unconfined`. An alternative of `--privileged true` is `--cap-add SYS_NICE --security-opt seccomp=unconfined`.
# --8<-- [end:build-image-from-source] --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information] --8<-- [start:extra-information]
# --8<-- [end:extra-information] --8<-- [end:extra-information]

View File

@@ -1,9 +1,10 @@
# --8<-- [start:installation] <!-- markdownlint-disable MD041 -->
--8<-- [start:installation]
vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16. vLLM supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
# --8<-- [end:installation] --8<-- [end:installation]
# --8<-- [start:requirements] --8<-- [start:requirements]
- OS: Linux - OS: Linux
- CPU flags: `avx512f` (Recommended), `avx512_bf16` (Optional), `avx512_vnni` (Optional) - CPU flags: `avx512f` (Recommended), `avx512_bf16` (Optional), `avx512_vnni` (Optional)
@@ -11,11 +12,11 @@ vLLM supports basic model inferencing and serving on x86 CPU platform, with data
!!! tip !!! tip
Use `lscpu` to check the CPU flags. Use `lscpu` to check the CPU flags.
# --8<-- [end:requirements] --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] --8<-- [start:set-up-using-python]
# --8<-- [end:set-up-using-python] --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] --8<-- [start:pre-built-wheels]
Pre-built vLLM wheels for x86 with AVX512 are available since version 0.13.0. To install release wheels: Pre-built vLLM wheels for x86 with AVX512 are available since version 0.13.0. To install release wheels:
@@ -25,6 +26,7 @@ export VLLM_VERSION=$(curl -s https://api.github.com/repos/vllm-project/vllm/rel
# use uv # use uv
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl --torch-backend cpu uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl --torch-backend cpu
``` ```
??? console "pip" ??? console "pip"
```bash ```bash
# use pip # use pip
@@ -46,7 +48,7 @@ uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VE
export LD_PRELOAD="$TC_PATH:$IOMP_PATH:$LD_PRELOAD" export LD_PRELOAD="$TC_PATH:$IOMP_PATH:$LD_PRELOAD"
``` ```
**Install the latest code** #### Install the latest code
To install the wheel built from the latest main branch: To install the wheel built from the latest main branch:
@@ -54,7 +56,7 @@ To install the wheel built from the latest main branch:
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index-strategy first-index --torch-backend cpu uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly/cpu --index-strategy first-index --torch-backend cpu
``` ```
**Install specific revisions** #### Install specific revisions
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL: If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
@@ -63,8 +65,8 @@ export VLLM_COMMIT=730bd35378bf2a5b56b6d3a45be28b3092d26519 # use full commit ha
uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}/cpu --index-strategy first-index --torch-backend cpu uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}/cpu --index-strategy first-index --torch-backend cpu
``` ```
# --8<-- [end:pre-built-wheels] --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] --8<-- [start:build-wheel-from-source]
Install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run: Install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
@@ -158,8 +160,8 @@ uv pip install dist/*.whl
] ]
``` ```
# --8<-- [end:build-wheel-from-source] --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images] --8<-- [start:pre-built-images]
You can pull the latest available CPU image from Docker Hub: You can pull the latest available CPU image from Docker Hub:
@@ -189,10 +191,10 @@ vllm/vllm-openai-cpu:latest-x86_64 <args...>
!!! warning !!! warning
If deploying the pre-built images on machines without `avx512f`, `avx512_bf16`, or `avx512_vnni` support, an `Illegal instruction` error may be raised. See the build-image-from-source section below for build arguments to match your target CPU capabilities. If deploying the pre-built images on machines without `avx512f`, `avx512_bf16`, or `avx512_vnni` support, an `Illegal instruction` error may be raised. See the build-image-from-source section below for build arguments to match your target CPU capabilities.
# --8<-- [end:pre-built-images] --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source] --8<-- [start:build-image-from-source]
## Building for your target CPU #### Building for your target CPU
```bash ```bash
docker build -f docker/Dockerfile.cpu \ docker build -f docker/Dockerfile.cpu \
@@ -212,15 +214,15 @@ docker build -f docker/Dockerfile.cpu \
- `VLLM_CPU_{ISA}=true` - Force-enable the instruction set (build with ISA regardless of build system capabilities) - `VLLM_CPU_{ISA}=true` - Force-enable the instruction set (build with ISA regardless of build system capabilities)
- `VLLM_CPU_{ISA}=false` - Rely on auto-detection (default) - `VLLM_CPU_{ISA}=false` - Rely on auto-detection (default)
### Examples ##### Examples
**Auto-detection build (default)** ###### Auto-detection build (default)
```bash ```bash
docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai . docker build -f docker/Dockerfile.cpu --tag vllm-cpu-env --target vllm-openai .
``` ```
**Cross-compile for AVX512** ###### Cross-compile for AVX512
```bash ```bash
docker build -f docker/Dockerfile.cpu \ docker build -f docker/Dockerfile.cpu \
@@ -231,7 +233,7 @@ docker build -f docker/Dockerfile.cpu \
--target vllm-openai . --target vllm-openai .
``` ```
**Cross-compile for AVX2** ###### Cross-compile for AVX2
```bash ```bash
docker build -f docker/Dockerfile.cpu \ docker build -f docker/Dockerfile.cpu \
@@ -240,7 +242,7 @@ docker build -f docker/Dockerfile.cpu \
--target vllm-openai . --target vllm-openai .
``` ```
## Launching the OpenAI server #### Launching the OpenAI server
```bash ```bash
docker run --rm \ docker run --rm \
@@ -255,6 +257,6 @@ docker run --rm \
other vLLM OpenAI server arguments other vLLM OpenAI server arguments
``` ```
# --8<-- [end:build-image-from-source] --8<-- [end:build-image-from-source]
# --8<-- [start:extra-information] --8<-- [start:extra-information]
# --8<-- [end:extra-information] --8<-- [end:extra-information]

View File

@@ -1,14 +1,15 @@
# --8<-- [start:installation] <!-- markdownlint-disable MD041 MD051 -->
--8<-- [start:installation]
vLLM contains pre-compiled C++ and CUDA (12.8) binaries. vLLM contains pre-compiled C++ and CUDA (12.8) binaries.
# --8<-- [end:installation] --8<-- [end:installation]
# --8<-- [start:requirements] --8<-- [start:requirements]
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.) - GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
# --8<-- [end:requirements] --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] --8<-- [start:set-up-using-python]
!!! note !!! note
PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <https://github.com/vllm-project/vllm/issues/8420> for more details. PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See <https://github.com/vllm-project/vllm/issues/8420> for more details.
@@ -17,8 +18,8 @@ In order to be performant, vLLM has to compile many cuda kernels. The compilatio
Therefore, it is recommended to install vLLM with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [below](#build-wheel-from-source) for more details. Therefore, it is recommended to install vLLM with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [below](#build-wheel-from-source) for more details.
# --8<-- [end:set-up-using-python] --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] --8<-- [start:pre-built-wheels]
```bash ```bash
uv pip install vllm --torch-backend=auto uv pip install vllm --torch-backend=auto
@@ -49,8 +50,8 @@ uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VE
LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for every commit since `v0.5.3` on <https://wheels.vllm.ai/nightly>. There are multiple indices that could be used: LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for every commit since `v0.5.3` on <https://wheels.vllm.ai/nightly>. There are multiple indices that could be used:
* `https://wheels.vllm.ai/nightly`: the default variant (CUDA with version specified in `VLLM_MAIN_CUDA_VERSION`) built with the last commit on the `main` branch. Currently it is CUDA 12.9. - `https://wheels.vllm.ai/nightly`: the default variant (CUDA with version specified in `VLLM_MAIN_CUDA_VERSION`) built with the last commit on the `main` branch. Currently it is CUDA 12.9.
* `https://wheels.vllm.ai/nightly/<variant>`: all other variants. Now this includes `cu130`, and `cpu`. The default variant (`cu129`) also has a subdirectory to keep consistency. - `https://wheels.vllm.ai/nightly/<variant>`: all other variants. Now this includes `cu130`, and `cpu`. The default variant (`cu129`) also has a subdirectory to keep consistency.
To install from nightly index, run: To install from nightly index, run:
@@ -82,8 +83,8 @@ uv pip install vllm \
--extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT} # add variant subdirectory here if needed --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT} # add variant subdirectory here if needed
``` ```
# --8<-- [end:pre-built-wheels] --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] --8<-- [start:build-wheel-from-source]
#### Set up using Python-only build (without compilation) {#python-only-build} #### Set up using Python-only build (without compilation) {#python-only-build}
@@ -116,9 +117,9 @@ uv pip install --editable .
There are more environment variables to control the behavior of Python-only build: There are more environment variables to control the behavior of Python-only build:
* `VLLM_PRECOMPILED_WHEEL_LOCATION`: specify the exact wheel URL or local file path of a pre-compiled wheel to use. All other logic to find the wheel will be skipped. - `VLLM_PRECOMPILED_WHEEL_LOCATION`: specify the exact wheel URL or local file path of a pre-compiled wheel to use. All other logic to find the wheel will be skipped.
* `VLLM_PRECOMPILED_WHEEL_COMMIT`: override the commit hash to download the pre-compiled wheel. It can be `nightly` to use the last **already built** commit on the main branch. - `VLLM_PRECOMPILED_WHEEL_COMMIT`: override the commit hash to download the pre-compiled wheel. It can be `nightly` to use the last **already built** commit on the main branch.
* `VLLM_PRECOMPILED_WHEEL_VARIANT`: specify the variant subdirectory to use on the nightly index, e.g., `cu129`, `cu130`, `cpu`. If not specified, the variant is auto-detected based on your system's CUDA version (from PyTorch or nvidia-smi). You can also set `VLLM_MAIN_CUDA_VERSION` to override auto-detection. - `VLLM_PRECOMPILED_WHEEL_VARIANT`: specify the variant subdirectory to use on the nightly index, e.g., `cu129`, `cu130`, `cpu`. If not specified, the variant is auto-detected based on your system's CUDA version (from PyTorch or nvidia-smi). You can also set `VLLM_MAIN_CUDA_VERSION` to override auto-detection.
You can find more information about vLLM's wheels in [Install the latest code](#install-the-latest-code). You can find more information about vLLM's wheels in [Install the latest code](#install-the-latest-code).
@@ -236,8 +237,8 @@ export VLLM_TARGET_DEVICE=empty
uv pip install -e . uv pip install -e .
``` ```
# --8<-- [end:build-wheel-from-source] --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images] --8<-- [start:pre-built-images]
vLLM offers an official Docker image for deployment. vLLM offers an official Docker image for deployment.
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags). The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
@@ -314,8 +315,8 @@ docker run --runtime nvidia --gpus all \
This will automatically configure `LD_LIBRARY_PATH` to point to the compatibility libraries before loading PyTorch and other dependencies. This will automatically configure `LD_LIBRARY_PATH` to point to the compatibility libraries before loading PyTorch and other dependencies.
# --8<-- [end:pre-built-images] --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source] --8<-- [start:build-image-from-source]
You can build and run vLLM from source via the provided [docker/Dockerfile](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile). To build vLLM: You can build and run vLLM from source via the provided [docker/Dockerfile](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile). To build vLLM:
@@ -415,9 +416,9 @@ The argument `vllm/vllm-openai` specifies the image to run, and should be replac
!!! note !!! note
**For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` . **For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` .
# --8<-- [end:build-image-from-source] --8<-- [end:build-image-from-source]
# --8<-- [start:supported-features] --8<-- [start:supported-features]
See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information. See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information.
# --8<-- [end:supported-features] --8<-- [end:supported-features]

View File

@@ -88,8 +88,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
### Pre-built images ### Pre-built images
<!-- markdownlint-disable MD025 --> --8<-- [start:pre-built-images]
# --8<-- [start:pre-built-images]
=== "NVIDIA CUDA" === "NVIDIA CUDA"
@@ -103,15 +102,11 @@ vLLM is a Python library that supports the following GPU variants. Select your G
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images"
# --8<-- [end:pre-built-images] --8<-- [end:pre-built-images]
<!-- markdownlint-enable MD025 -->
<!-- markdownlint-disable MD001 -->
### Build image from source ### Build image from source
<!-- markdownlint-enable MD001 -->
<!-- markdownlint-disable MD025 --> --8<-- [start:build-image-from-source]
# --8<-- [start:build-image-from-source]
=== "NVIDIA CUDA" === "NVIDIA CUDA"
@@ -125,8 +120,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source"
# --8<-- [end:build-image-from-source] --8<-- [end:build-image-from-source]
<!-- markdownlint-enable MD025 -->
## Supported features ## Supported features

View File

@@ -1,23 +1,24 @@
# --8<-- [start:installation] <!-- markdownlint-disable MD041 MD051 -->
--8<-- [start:installation]
vLLM supports AMD GPUs with ROCm 6.3 or above. Pre-built wheels are available for ROCm 7.0. vLLM supports AMD GPUs with ROCm 6.3 or above. Pre-built wheels are available for ROCm 7.0.
# --8<-- [end:installation] --8<-- [end:installation]
# --8<-- [start:requirements] --8<-- [start:requirements]
- GPU: MI200s (gfx90a), MI300 (gfx942), MI350 (gfx950), Radeon RX 7900 series (gfx1100/1101), Radeon RX 9000 series (gfx1200/1201), Ryzen AI MAX / AI 300 Series (gfx1151/1150) - GPU: MI200s (gfx90a), MI300 (gfx942), MI350 (gfx950), Radeon RX 7900 series (gfx1100/1101), Radeon RX 9000 series (gfx1200/1201), Ryzen AI MAX / AI 300 Series (gfx1151/1150)
- ROCm 6.3 or above - ROCm 6.3 or above
- MI350 requires ROCm 7.0 or above - MI350 requires ROCm 7.0 or above
- Ryzen AI MAX / AI 300 Series requires ROCm 7.0.2 or above - Ryzen AI MAX / AI 300 Series requires ROCm 7.0.2 or above
# --8<-- [end:requirements] --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] --8<-- [start:set-up-using-python]
The vLLM wheel bundles PyTorch and all required dependencies, and you should use the included PyTorch for compatibility. Because vLLM compiles many ROCm kernels to ensure a validated, highperformance stack, the resulting binaries may not be compatible with other ROCm or PyTorch builds. The vLLM wheel bundles PyTorch and all required dependencies, and you should use the included PyTorch for compatibility. Because vLLM compiles many ROCm kernels to ensure a validated, highperformance stack, the resulting binaries may not be compatible with other ROCm or PyTorch builds.
If you need a different ROCm version or want to use an existing PyTorch installation, youll need to build vLLM from source. See [below](#build-wheel-from-source) for more details. If you need a different ROCm version or want to use an existing PyTorch installation, youll need to build vLLM from source. See [below](#build-wheel-from-source) for more details.
# --8<-- [end:set-up-using-python] --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] --8<-- [start:pre-built-wheels]
To install the latest version of vLLM for Python 3.12, ROCm 7.0 and `glibc >= 2.35`. To install the latest version of vLLM for Python 3.12, ROCm 7.0 and `glibc >= 2.35`.
@@ -34,7 +35,7 @@ To install a specific version and ROCm variant of vLLM wheel.
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700 uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
``` ```
!!! warning "Caveats for using `pip`" !!! warning "Caveats for using `pip`"
We recommend leveraging `uv` to install vLLM wheel. Using `pip` to install from custom indices is cumbersome, because `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install wheel from custom index if exact versions of all packages are specified exactly. In contrast, `uv` gives the extra index [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). We recommend leveraging `uv` to install vLLM wheel. Using `pip` to install from custom indices is cumbersome, because `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install wheel from custom index if exact versions of all packages are specified exactly. In contrast, `uv` gives the extra index [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes).
@@ -44,8 +45,8 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
pip install vllm==0.15.0+rocm700 --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700 pip install vllm==0.15.0+rocm700 --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
``` ```
# --8<-- [end:pre-built-wheels] --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] --8<-- [start:build-wheel-from-source]
!!! tip !!! tip
- If you found that the following installation step does not work for you, please refer to [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base). Dockerfile is a form of installation steps. - If you found that the following installation step does not work for you, please refer to [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base). Dockerfile is a form of installation steps.
@@ -104,7 +105,6 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
!!! note !!! note
- The validated `$FA_BRANCH` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base). - The validated `$FA_BRANCH` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
3. Optionally, if you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps: 3. Optionally, if you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps:
```bash ```bash
@@ -120,7 +120,6 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
- You will need to config the `$AITER_BRANCH_OR_COMMIT` for your purpose. - You will need to config the `$AITER_BRANCH_OR_COMMIT` for your purpose.
- The validated `$AITER_BRANCH_OR_COMMIT` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base). - The validated `$AITER_BRANCH_OR_COMMIT` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
4. Optionally, if you want to use MORI for EP or PD disaggregation, you can install [MORI](https://github.com/ROCm/mori) using the following steps: 4. Optionally, if you want to use MORI for EP or PD disaggregation, you can install [MORI](https://github.com/ROCm/mori) using the following steps:
```bash ```bash
@@ -135,7 +134,6 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
- You will need to config the `$MORI_BRANCH_OR_COMMIT` for your purpose. - You will need to config the `$MORI_BRANCH_OR_COMMIT` for your purpose.
- The validated `$MORI_BRANCH_OR_COMMIT` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base). - The validated `$MORI_BRANCH_OR_COMMIT` can be found in the [docker/Dockerfile.rocm_base](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm_base).
5. Build vLLM. For example, vLLM on ROCM 7.0 can be built with the following steps: 5. Build vLLM. For example, vLLM on ROCM 7.0 can be built with the following steps:
???+ console "Commands" ???+ console "Commands"
@@ -171,8 +169,8 @@ uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level. - For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/vllm-optimization.html). For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/vllm-optimization.html).
# --8<-- [end:build-wheel-from-source] --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images] --8<-- [start:pre-built-images]
vLLM offers an official Docker image for deployment. vLLM offers an official Docker image for deployment.
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai-rocm](https://hub.docker.com/r/vllm/vllm-openai-rocm/tags). The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai-rocm](https://hub.docker.com/r/vllm/vllm-openai-rocm/tags).
@@ -217,8 +215,8 @@ rocm/vllm-dev:nightly
Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html) Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
for instructions on how to use this prebuilt docker image. for instructions on how to use this prebuilt docker image.
# --8<-- [end:pre-built-images] --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source] --8<-- [start:build-image-from-source]
You can build and run vLLM from source via the provided [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm). You can build and run vLLM from source via the provided [docker/Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm).
@@ -271,7 +269,6 @@ To build vllm on ROCm 7.0 for MI200 and MI300 series, you can use the default (w
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm/vllm-openai-rocm . DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm/vllm-openai-rocm .
``` ```
To run vLLM with the custom-built Docker image: To run vLLM with the custom-built Docker image:
```bash ```bash
@@ -308,9 +305,9 @@ To use the docker image as base for development, you can launch it in interactiv
vllm/vllm-openai-rocm vllm/vllm-openai-rocm
``` ```
# --8<-- [end:build-image-from-source] --8<-- [end:build-image-from-source]
# --8<-- [start:supported-features] --8<-- [start:supported-features]
See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information. See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information.
# --8<-- [end:supported-features] --8<-- [end:supported-features]

View File

@@ -1,29 +1,30 @@
# --8<-- [start:installation] <!-- markdownlint-disable MD041 -->
--8<-- [start:installation]
vLLM initially supports basic model inference and serving on Intel GPU platform. vLLM initially supports basic model inference and serving on Intel GPU platform.
# --8<-- [end:installation] --8<-- [end:installation]
# --8<-- [start:requirements] --8<-- [start:requirements]
- Supported Hardware: Intel Data Center GPU, Intel ARC GPU - Supported Hardware: Intel Data Center GPU, Intel ARC GPU
- OneAPI requirements: oneAPI 2025.3 - OneAPI requirements: oneAPI 2025.3
- Dependency: [vllm-xpu-kernels](https://github.com/vllm-project/vllm-xpu-kernels): a package provide all necessary vllm custom kernel when running vLLM on Intel GPU platform, - Dependency: [vllm-xpu-kernels](https://github.com/vllm-project/vllm-xpu-kernels): a package provide all necessary vllm custom kernel when running vLLM on Intel GPU platform,
- Python: 3.12 - Python: 3.12
!!! warning !!! warning
The provided vllm-xpu-kernels whl is Python3.12 specific so this version is a MUST. The provided vllm-xpu-kernels whl is Python3.12 specific so this version is a MUST.
# --8<-- [end:requirements] --8<-- [end:requirements]
# --8<-- [start:set-up-using-python] --8<-- [start:set-up-using-python]
There is no extra information on creating a new Python environment for this device. There is no extra information on creating a new Python environment for this device.
# --8<-- [end:set-up-using-python] --8<-- [end:set-up-using-python]
# --8<-- [start:pre-built-wheels] --8<-- [start:pre-built-wheels]
Currently, there are no pre-built XPU wheels. Currently, there are no pre-built XPU wheels.
# --8<-- [end:pre-built-wheels] --8<-- [end:pre-built-wheels]
# --8<-- [start:build-wheel-from-source] --8<-- [start:build-wheel-from-source]
- First, install required [driver](https://dgpu-docs.intel.com/driver/installation.html#installing-gpu-drivers) and [Intel OneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) 2025.3 or later. - First, install required [driver](https://dgpu-docs.intel.com/driver/installation.html#installing-gpu-drivers) and [Intel OneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) 2025.3 or later.
- Second, install Python packages for vLLM XPU backend building: - Second, install Python packages for vLLM XPU backend building:
@@ -54,13 +55,13 @@ pip install -v -r requirements/xpu.txt
VLLM_TARGET_DEVICE=xpu pip install --no-build-isolation -e . -v VLLM_TARGET_DEVICE=xpu pip install --no-build-isolation -e . -v
``` ```
# --8<-- [end:build-wheel-from-source] --8<-- [end:build-wheel-from-source]
# --8<-- [start:pre-built-images] --8<-- [start:pre-built-images]
Currently, we release prebuilt XPU images at docker [hub](https://hub.docker.com/r/intel/vllm/tags) based on vLLM released version. For more information, please refer release [note](https://github.com/intel/ai-containers/blob/main/vllm). Currently, we release prebuilt XPU images at docker [hub](https://hub.docker.com/r/intel/vllm/tags) based on vLLM released version. For more information, please refer release [note](https://github.com/intel/ai-containers/blob/main/vllm).
# --8<-- [end:pre-built-images] --8<-- [end:pre-built-images]
# --8<-- [start:build-image-from-source] --8<-- [start:build-image-from-source]
```bash ```bash
docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g . docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
@@ -74,8 +75,8 @@ docker run -it \
vllm-xpu-env vllm-xpu-env
``` ```
# --8<-- [end:build-image-from-source] --8<-- [end:build-image-from-source]
# --8<-- [start:supported-features] --8<-- [start:supported-features]
XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. For **pipeline parallel**, we support it on single node with mp as the backend. For example, a reference execution like following: XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. For **pipeline parallel**, we support it on single node with mp as the backend. For example, a reference execution like following:
@@ -90,9 +91,9 @@ vllm serve facebook/opt-13b \
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the [examples/online_serving/run_cluster.sh](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh) helper script. By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the [examples/online_serving/run_cluster.sh](https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh) helper script.
# --8<-- [end:supported-features] --8<-- [end:supported-features]
# --8<-- [start:distributed-backend] --8<-- [start:distributed-backend]
XPU platform uses **torch-ccl** for torch<2.8 and **xccl** for torch>=2.8 as distributed backend, since torch 2.8 supports **xccl** as built-in backend for XPU. XPU platform uses **torch-ccl** for torch<2.8 and **xccl** for torch>=2.8 as distributed backend, since torch 2.8 supports **xccl** as built-in backend for XPU.
# --8<-- [end:distributed-backend] --8<-- [end:distributed-backend]

View File

@@ -1,3 +1,4 @@
<!-- markdownlint-disable MD041 -->
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following commands: It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment using the following commands:
```bash ```bash

View File

@@ -2,32 +2,32 @@
## Validated Hardware ## Validated Hardware
| Hardware | | Hardware |
| ----------------------------------------- | | -------- |
| [Intel® Xeon® 6 Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon.html) | | [Intel® Xeon® 6 Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon.html) |
| [Intel® Xeon® 5 Processors](https://www.intel.com/content/www/us/en/products/docs/processors/xeon/5th-gen-xeon-scalable-processors.html) | | [Intel® Xeon® 5 Processors](https://www.intel.com/content/www/us/en/products/docs/processors/xeon/5th-gen-xeon-scalable-processors.html) |
## Recommended Models ## Recommended Models
### Text-only Language Models ### Text-only Language Models
| Model | Architecture | Supported | | Model | Architecture | Supported |
|--------------------------------------|-------------------------------------------|-----------| | ------------------------------------ | ---------------------------------------- | --------- |
| meta-llama/Llama-3.1-8B-Instruct | LlamaForCausalLM | ✅ | | meta-llama/Llama-3.1-8B-Instruct | LlamaForCausalLM | ✅ |
| meta-llama/Llama-3.2-3B-Instruct | LlamaForCausalLM | ✅ | | meta-llama/Llama-3.2-3B-Instruct | LlamaForCausalLM | ✅ |
| ibm-granite/granite-3.2-2b-instruct | GraniteForCausalLM | ✅ | | ibm-granite/granite-3.2-2b-instruct | GraniteForCausalLM | ✅ |
| Qwen/Qwen3-1.7B | Qwen3ForCausalLM | ✅ | | Qwen/Qwen3-1.7B | Qwen3ForCausalLM | ✅ |
| Qwen/Qwen3-4B | Qwen3ForCausalLM | ✅ | | Qwen/Qwen3-4B | Qwen3ForCausalLM | ✅ |
| Qwen/Qwen3-8B | Qwen3ForCausalLM | ✅ | | Qwen/Qwen3-8B | Qwen3ForCausalLM | ✅ |
| zai-org/glm-4-9b-hf | GLMForCausalLM | ✅ | | zai-org/glm-4-9b-hf | GLMForCausalLM | ✅ |
| google/gemma-7b | GemmaForCausalLM | ✅ | | google/gemma-7b | GemmaForCausalLM | ✅ |
### Multimodal Language Models ### Multimodal Language Models
| Model | Architecture | Supported | | Model | Architecture | Supported |
|--------------------------------------|-------------------------------------------|-----------| | ------------------------------------ | ---------------------------------------- | --------- |
| Qwen/Qwen2.5-VL-7B-Instruct | Qwen2VLForConditionalGeneration | ✅ | | Qwen/Qwen2.5-VL-7B-Instruct | Qwen2VLForConditionalGeneration | ✅ |
| openai/whisper-large-v3 | WhisperForConditionalGeneration | ✅ | | openai/whisper-large-v3 | WhisperForConditionalGeneration | ✅ |
✅ Runs and optimized. ✅ Runs and optimized.
🟨 Runs and correct but not optimized to green yet. 🟨 Runs and correct but not optimized to green yet.

View File

@@ -2,9 +2,9 @@
## Validated Hardware ## Validated Hardware
| Hardware | | Hardware |
| ----------------------------------------- | | -------- |
| [Intel® Arc™ Pro B-Series Graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/workstations/b-series/overview.html) | | [Intel® Arc™ Pro B-Series Graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/workstations/b-series/overview.html) |
## Recommended Models ## Recommended Models
@@ -12,53 +12,53 @@
| Model | Architecture | FP16 | Dynamic FP8 | MXFP4 | | Model | Architecture | FP16 | Dynamic FP8 | MXFP4 |
| ----------------------------------------- | ---------------------------------------------------- | ---- | ----------- | ----- | | ----------------------------------------- | ---------------------------------------------------- | ---- | ----------- | ----- |
| openai/gpt-oss-20b | GPTForCausalLM | | | ✅ | | openai/gpt-oss-20b | GPTForCausalLM | | | ✅ |
| openai/gpt-oss-120b | GPTForCausalLM | | | ✅ | | openai/gpt-oss-120b | GPTForCausalLM | | | ✅ |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | LlamaForCausalLM | ✅ | ✅ | | | deepseek-ai/DeepSeek-R1-Distill-Llama-8B | LlamaForCausalLM | ✅ | ✅ | |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | QwenForCausalLM | ✅ | ✅ | | | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | QwenForCausalLM | ✅ | ✅ | |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | QwenForCausalLM | ✅ | ✅ | | | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | QwenForCausalLM | ✅ | ✅ | |
| deepseek-ai/DeepSeek-R1-Distill-Llama-70B | LlamaForCausalLM | ✅ | ✅ | | | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | LlamaForCausalLM | ✅ | ✅ | |
| Qwen/Qwen2.5-72B-Instruct | Qwen2ForCausalLM | ✅ | ✅ | | | Qwen/Qwen2.5-72B-Instruct | Qwen2ForCausalLM | ✅ | ✅ | |
| Qwen/Qwen3-14B | Qwen3ForCausalLM | ✅ | ✅ | | | Qwen/Qwen3-14B | Qwen3ForCausalLM | ✅ | ✅ | |
| Qwen/Qwen3-32B | Qwen3ForCausalLM | ✅ | ✅ | | | Qwen/Qwen3-32B | Qwen3ForCausalLM | ✅ | ✅ | |
| Qwen/Qwen3-30B-A3B | Qwen3ForCausalLM | ✅ | ✅ | | | Qwen/Qwen3-30B-A3B | Qwen3ForCausalLM | ✅ | ✅ | |
| Qwen/Qwen3-30B-A3B-GPTQ-Int4 | Qwen3ForCausalLM | ✅ | ✅ | | | Qwen/Qwen3-30B-A3B-GPTQ-Int4 | Qwen3ForCausalLM | ✅ | ✅ | |
| Qwen/Qwen3-coder-30B-A3B-Instruct | Qwen3ForCausalLM | ✅ | ✅ | | | Qwen/Qwen3-coder-30B-A3B-Instruct | Qwen3ForCausalLM | ✅ | ✅ | |
| Qwen/QwQ-32B | QwenForCausalLM | ✅ | ✅ | | | Qwen/QwQ-32B | QwenForCausalLM | ✅ | ✅ | |
| deepseek-ai/DeepSeek-V2-Lite | DeepSeekForCausalLM | ✅ | ✅ | | | deepseek-ai/DeepSeek-V2-Lite | DeepSeekForCausalLM | ✅ | ✅ | |
| meta-llama/Llama-3.1-8B-Instruct | LlamaForCausalLM | ✅ | ✅ | | | meta-llama/Llama-3.1-8B-Instruct | LlamaForCausalLM | ✅ | ✅ | |
| baichuan-inc/Baichuan2-13B-Chat | BaichuanForCausalLM | ✅ | ✅ | | | baichuan-inc/Baichuan2-13B-Chat | BaichuanForCausalLM | ✅ | ✅ | |
| THUDM/GLM-4-9B-chat | GLMForCausalLM | ✅ | ✅ | | | THUDM/GLM-4-9B-chat | GLMForCausalLM | ✅ | ✅ | |
| THUDM/CodeGeex4-All-9B | CodeGeexForCausalLM | ✅ | ✅ | | | THUDM/CodeGeex4-All-9B | CodeGeexForCausalLM | ✅ | ✅ | |
| chuhac/TeleChat2-35B | LlamaForCausalLM (TeleChat2 based on Llama arch) | ✅ | ✅ | | | chuhac/TeleChat2-35B | LlamaForCausalLM (TeleChat2 based on Llama arch) | ✅ | ✅ | |
| 01-ai/Yi1.5-34B-Chat | YiForCausalLM | ✅ | ✅ | | | 01-ai/Yi1.5-34B-Chat | YiForCausalLM | ✅ | ✅ | |
| THUDM/CodeGeex4-All-9B | CodeGeexForCausalLM | ✅ | ✅ | | | THUDM/CodeGeex4-All-9B | CodeGeexForCausalLM | ✅ | ✅ | |
| deepseek-ai/DeepSeek-Coder-33B-base | DeepSeekCoderForCausalLM | ✅ | ✅ | | | deepseek-ai/DeepSeek-Coder-33B-base | DeepSeekCoderForCausalLM | ✅ | ✅ | |
| baichuan-inc/Baichuan2-13B-Chat | BaichuanForCausalLM | ✅ | ✅ | | | baichuan-inc/Baichuan2-13B-Chat | BaichuanForCausalLM | ✅ | ✅ | |
| meta-llama/Llama-2-13b-chat-hf | LlamaForCausalLM | ✅ | ✅ | | | meta-llama/Llama-2-13b-chat-hf | LlamaForCausalLM | ✅ | ✅ | |
| THUDM/CodeGeex4-All-9B | CodeGeexForCausalLM | ✅ | ✅ | | | THUDM/CodeGeex4-All-9B | CodeGeexForCausalLM | ✅ | ✅ | |
| Qwen/Qwen1.5-14B-Chat | QwenForCausalLM | ✅ | ✅ | | | Qwen/Qwen1.5-14B-Chat | QwenForCausalLM | ✅ | ✅ | |
| Qwen/Qwen1.5-32B-Chat | QwenForCausalLM | ✅ | ✅ | | | Qwen/Qwen1.5-32B-Chat | QwenForCausalLM | ✅ | ✅ | |
### Multimodal Language Models ### Multimodal Language Models
| Model | Architecture | FP16 | Dynamic FP8 | MXFP4 | | Model | Architecture | FP16 | Dynamic FP8 | MXFP4 |
| ---------------------------- | -------------------------------- | ---- | ----------- | ----- | | ---------------------------- | -------------------------------- | ---- | ----------- | ----- |
| OpenGVLab/InternVL3_5-8B | InternVLForConditionalGeneration | ✅ | ✅ | | | OpenGVLab/InternVL3_5-8B | InternVLForConditionalGeneration | ✅ | ✅ | |
| OpenGVLab/InternVL3_5-14B | InternVLForConditionalGeneration | ✅ | ✅ | | | OpenGVLab/InternVL3_5-14B | InternVLForConditionalGeneration | ✅ | ✅ | |
| OpenGVLab/InternVL3_5-38B | InternVLForConditionalGeneration | ✅ | ✅ | | | OpenGVLab/InternVL3_5-38B | InternVLForConditionalGeneration | ✅ | ✅ | |
| Qwen/Qwen2-VL-7B-Instruct | Qwen2VLForConditionalGeneration | ✅ | ✅ | | | Qwen/Qwen2-VL-7B-Instruct | Qwen2VLForConditionalGeneration | ✅ | ✅ | |
| Qwen/Qwen2.5-VL-72B-Instruct | Qwen2VLForConditionalGeneration | ✅ | ✅ | | | Qwen/Qwen2.5-VL-72B-Instruct | Qwen2VLForConditionalGeneration | ✅ | ✅ | |
| Qwen/Qwen2.5-VL-32B-Instruct | Qwen2VLForConditionalGeneration | ✅ | ✅ | | | Qwen/Qwen2.5-VL-32B-Instruct | Qwen2VLForConditionalGeneration | ✅ | ✅ | |
| THUDM/GLM-4v-9B | GLM4vForConditionalGeneration | ✅ | ✅ | | | THUDM/GLM-4v-9B | GLM4vForConditionalGeneration | ✅ | ✅ | |
| openbmb/MiniCPM-V-4 | MiniCPMVForConditionalGeneration | ✅ | ✅ | | | openbmb/MiniCPM-V-4 | MiniCPMVForConditionalGeneration | ✅ | ✅ | |
### Embedding and Reranker Language Models ### Embedding and Reranker Language Models
| Model | Architecture | FP16 | Dynamic FP8 | MXFP4 | | Model | Architecture | FP16 | Dynamic FP8 | MXFP4 |
| ----------------------- | ------------------------------ | ---- | ----------- | ----- | | ----------------------- | ------------------------------ | ---- | ----------- | ----- |
| Qwen/Qwen3-Embedding-8B | Qwen3ForTextEmbedding | ✅ | ✅ | | | Qwen/Qwen3-Embedding-8B | Qwen3ForTextEmbedding | ✅ | ✅ | |
| Qwen/Qwen3-Reranker-8B | Qwen3ForSequenceClassification | ✅ | ✅ | | | Qwen/Qwen3-Reranker-8B | Qwen3ForSequenceClassification | ✅ | ✅ | |
✅ Runs and optimized. ✅ Runs and optimized.
🟨 Runs and correct but not optimized to green yet. 🟨 Runs and correct but not optimized to green yet.

View File

@@ -31,7 +31,7 @@ vLLM will attempt to automatically convert the model according to the architectu
shown in the table below. shown in the table below.
| Architecture | `--convert` | Supported pooling tasks | | Architecture | `--convert` | Supported pooling tasks |
|-------------------------------------------------|-------------|---------------------------------------| | ----------------------------------------------- | ----------- | ------------------------------------- |
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `token_embed`, `embed` | | `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `token_embed`, `embed` |
| `*ForRewardModeling`, `*RewardModel` | `embed` | `token_embed`, `embed` | | `*ForRewardModeling`, `*RewardModel` | `embed` | `token_embed`, `embed` |
| `*For*Classification`, `*ClassificationModel` | `classify` | `token_classify`, `classify`, `score` | | `*For*Classification`, `*ClassificationModel` | `classify` | `token_classify`, `classify`, `score` |
@@ -46,7 +46,7 @@ Each pooling model in vLLM supports one or more of these tasks according to
enabling the corresponding APIs: enabling the corresponding APIs:
| Task | APIs | | Task | APIs |
|------------------|-------------------------------------------------------------------------------| | ---------------- | ----------------------------------------------------------------------------- |
| `embed` | `LLM.embed(...)`, `LLM.score(...)`\*, `LLM.encode(..., pooling_task="embed")` | | `embed` | `LLM.embed(...)`, `LLM.score(...)`\*, `LLM.encode(..., pooling_task="embed")` |
| `classify` | `LLM.classify(...)`, `LLM.encode(..., pooling_task="classify")` | | `classify` | `LLM.classify(...)`, `LLM.encode(..., pooling_task="classify")` |
| `score` | `LLM.score(...)` | | `score` | `LLM.score(...)` |
@@ -69,7 +69,7 @@ If the model has been converted via `--convert` (see above),
the pooler assigned to each task has the following attributes by default: the pooler assigned to each task has the following attributes by default:
| Task | Pooling Type | Normalization | Softmax | | Task | Pooling Type | Normalization | Softmax |
|------------|--------------|---------------|---------| | ---------- | ------------ | ------------- | ------- |
| `embed` | `LAST` | ✅︎ | ❌ | | `embed` | `LAST` | ✅︎ | ❌ |
| `classify` | `LAST` | ❌ | ✅︎ | | `classify` | `LAST` | ❌ | ✅︎ |
@@ -314,7 +314,7 @@ An OpenAI client example can be found here: [examples/pooling/embed/openai_embed
vLLM supports ColBERT models with multiple encoder backbones: vLLM supports ColBERT models with multiple encoder backbones:
| Architecture | Backbone | Example HF Models | | Architecture | Backbone | Example HF Models |
|---|---|---| | - | - | - |
| `HF_ColBERT` | BERT | `answerdotai/answerai-colbert-small-v1`, `colbert-ir/colbertv2.0` | | `HF_ColBERT` | BERT | `answerdotai/answerai-colbert-small-v1`, `colbert-ir/colbertv2.0` |
| `ColBERTModernBertModel` | ModernBERT | `lightonai/GTE-ModernColBERT-v1` | | `ColBERTModernBertModel` | ModernBERT | `lightonai/GTE-ModernColBERT-v1` |
| `ColBERTJinaRobertaModel` | Jina XLM-RoBERTa | `jinaai/jina-colbert-v2` | | `ColBERTJinaRobertaModel` | Jina XLM-RoBERTa | `jinaai/jina-colbert-v2` |
@@ -379,7 +379,7 @@ An example can be found here: [examples/pooling/score/colbert_rerank_online.py](
ColQwen3 is based on [ColPali](https://arxiv.org/abs/2407.01449), which extends ColBERT's late interaction approach to **multi-modal** inputs. While ColBERT operates on text-only token embeddings, ColPali/ColQwen3 can embed both **text and images** (e.g. PDF pages, screenshots, diagrams) into per-token L2-normalized vectors and compute relevance via MaxSim scoring. ColQwen3 specifically uses Qwen3-VL as its vision-language backbone. ColQwen3 is based on [ColPali](https://arxiv.org/abs/2407.01449), which extends ColBERT's late interaction approach to **multi-modal** inputs. While ColBERT operates on text-only token embeddings, ColPali/ColQwen3 can embed both **text and images** (e.g. PDF pages, screenshots, diagrams) into per-token L2-normalized vectors and compute relevance via MaxSim scoring. ColQwen3 specifically uses Qwen3-VL as its vision-language backbone.
| Architecture | Backbone | Example HF Models | | Architecture | Backbone | Example HF Models |
|---|---|---| | - | - | - |
| `ColQwen3` | Qwen3-VL | `TomoroAI/tomoro-colqwen3-embed-4b`, `TomoroAI/tomoro-colqwen3-embed-8b` | | `ColQwen3` | Qwen3-VL | `TomoroAI/tomoro-colqwen3-embed-4b`, `TomoroAI/tomoro-colqwen3-embed-8b` |
| `OpsColQwen3Model` | Qwen3-VL | `OpenSearch-AI/Ops-Colqwen3-4B`, `OpenSearch-AI/Ops-Colqwen3-8B` | | `OpsColQwen3Model` | Qwen3-VL | `OpenSearch-AI/Ops-Colqwen3-4B`, `OpenSearch-AI/Ops-Colqwen3-8B` |
| `Qwen3VLNemotronEmbedModel` | Qwen3-VL | `nvidia/nemotron-colembed-vl-4b-v2`, `nvidia/nemotron-colembed-vl-8b-v2` | | `Qwen3VLNemotronEmbedModel` | Qwen3-VL | `nvidia/nemotron-colembed-vl-4b-v2`, `nvidia/nemotron-colembed-vl-8b-v2` |
@@ -507,7 +507,7 @@ Llama Nemotron VL Embedding models combine the bidirectional Llama embedding bac
single-vector embeddings from text and/or images. single-vector embeddings from text and/or images.
| Architecture | Backbone | Example HF Models | | Architecture | Backbone | Example HF Models |
|---|---|---| | - | - | - |
| `LlamaNemotronVLModel` | Bidirectional Llama + SigLIP | `nvidia/llama-nemotron-embed-vl-1b-v2` | | `LlamaNemotronVLModel` | Bidirectional Llama + SigLIP | `nvidia/llama-nemotron-embed-vl-1b-v2` |
Start the server: Start the server:
@@ -567,7 +567,7 @@ Llama Nemotron VL reranker models combine the same bidirectional Llama + SigLIP
backbone with a sequence-classification head for cross-encoder scoring and reranking. backbone with a sequence-classification head for cross-encoder scoring and reranking.
| Architecture | Backbone | Example HF Models | | Architecture | Backbone | Example HF Models |
|---|---|---| | - | - | - |
| `LlamaNemotronVLForSequenceClassification` | Bidirectional Llama + SigLIP | `nvidia/llama-nemotron-rerank-vl-1b-v2` | | `LlamaNemotronVLForSequenceClassification` | Bidirectional Llama + SigLIP | `nvidia/llama-nemotron-rerank-vl-1b-v2` |
Start the server: Start the server:

View File

@@ -179,7 +179,7 @@ class MyConfig(PretrainedConfig):
Some model architectures are supported via vLLM plugins. These plugins extend vLLM's capabilities through the [plugin system](../design/plugin_system.md). Some model architectures are supported via vLLM plugins. These plugins extend vLLM's capabilities through the [plugin system](../design/plugin_system.md).
| Architecture | Models | Plugin Repository | | Architecture | Models | Plugin Repository |
|--------------|--------|-------------------| | ------------ | ------ | ----------------- |
| `BartForConditionalGeneration` | BART | [bart-plugin](https://github.com/vllm-project/bart-plugin) | | `BartForConditionalGeneration` | BART | [bart-plugin](https://github.com/vllm-project/bart-plugin) |
| `Florence2ForConditionalGeneration` | Florence-2 | [bart-plugin](https://github.com/vllm-project/bart-plugin) | | `Florence2ForConditionalGeneration` | Florence-2 | [bart-plugin](https://github.com/vllm-project/bart-plugin) |
@@ -363,7 +363,7 @@ th {
</style> </style>
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
|--------------|--------|-------------------|----------------------|---------------------------| | ------------ | ------ | ----------------- | -------------------- | ------------------------- |
| `AfmoeForCausalLM` | Afmoe | TBA | ✅︎ | ✅︎ | | `AfmoeForCausalLM` | Afmoe | TBA | ✅︎ | ✅︎ |
| `ApertusForCausalLM` | Apertus | `swiss-ai/Apertus-8B-2509`, `swiss-ai/Apertus-70B-Instruct-2509`, etc. | ✅︎ | ✅︎ | | `ApertusForCausalLM` | Apertus | `swiss-ai/Apertus-8B-2509`, `swiss-ai/Apertus-70B-Instruct-2509`, etc. | ✅︎ | ✅︎ |
| `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ | | `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ |
@@ -387,7 +387,7 @@ th {
| `Dots1ForCausalLM` | dots.llm1 | `rednote-hilab/dots.llm1.base`, `rednote-hilab/dots.llm1.inst`, etc. | | ✅︎ | | `Dots1ForCausalLM` | dots.llm1 | `rednote-hilab/dots.llm1.base`, `rednote-hilab/dots.llm1.inst`, etc. | | ✅︎ |
| `DotsOCRForCausalLM` | dots_ocr | `rednote-hilab/dots.ocr` | ✅︎ | ✅︎ | | `DotsOCRForCausalLM` | dots_ocr | `rednote-hilab/dots.ocr` | ✅︎ | ✅︎ |
| `Ernie4_5ForCausalLM` | Ernie4.5 | `baidu/ERNIE-4.5-0.3B-PT`, etc. | ✅︎ | ✅︎ | | `Ernie4_5ForCausalLM` | Ernie4.5 | `baidu/ERNIE-4.5-0.3B-PT`, etc. | ✅︎ | ✅︎ |
| `Ernie4_5_MoeForCausalLM` | Ernie4.5MoE | `baidu/ERNIE-4.5-21B-A3B-PT`, `baidu/ERNIE-4.5-300B-A47B-PT`, etc. |✅︎| ✅︎ | | `Ernie4_5_MoeForCausalLM` | Ernie4.5MoE | `baidu/ERNIE-4.5-21B-A3B-PT`, `baidu/ERNIE-4.5-300B-A47B-PT`, etc. | ✅︎ | ✅︎ |
| `ExaoneForCausalLM` | EXAONE-3 | `LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc. | ✅︎ | ✅︎ | | `ExaoneForCausalLM` | EXAONE-3 | `LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc. | ✅︎ | ✅︎ |
| `ExaoneMoEForCausalLM` | K-EXAONE | `LGAI-EXAONE/K-EXAONE-236B-A23B`, etc. | | | | `ExaoneMoEForCausalLM` | K-EXAONE | `LGAI-EXAONE/K-EXAONE-236B-A23B`, etc. | | |
| `Exaone4ForCausalLM` | EXAONE-4 | `LGAI-EXAONE/EXAONE-4.0-32B`, etc. | ✅︎ | ✅︎ | | `Exaone4ForCausalLM` | EXAONE-4 | `LGAI-EXAONE/EXAONE-4.0-32B`, etc. | ✅︎ | ✅︎ |
@@ -427,18 +427,18 @@ th {
| `Jais2ForCausalLM` | Jais2 | `inceptionai/Jais-2-8B-Chat`, `inceptionai/Jais-2-70B-Chat`, etc. | | ✅︎ | | `Jais2ForCausalLM` | Jais2 | `inceptionai/Jais-2-8B-Chat`, `inceptionai/Jais-2-70B-Chat`, etc. | | ✅︎ |
| `JambaForCausalLM` | Jamba | `ai21labs/AI21-Jamba-1.5-Large`, `ai21labs/AI21-Jamba-1.5-Mini`, `ai21labs/Jamba-v0.1`, etc. | ✅︎ | ✅︎ | | `JambaForCausalLM` | Jamba | `ai21labs/AI21-Jamba-1.5-Large`, `ai21labs/AI21-Jamba-1.5-Mini`, `ai21labs/Jamba-v0.1`, etc. | ✅︎ | ✅︎ |
| `KimiLinearForCausalLM` | Kimi-Linear-48B-A3B-Base, Kimi-Linear-48B-A3B-Instruct | `moonshotai/Kimi-Linear-48B-A3B-Base`, `moonshotai/Kimi-Linear-48B-A3B-Instruct` | | ✅︎ | | `KimiLinearForCausalLM` | Kimi-Linear-48B-A3B-Base, Kimi-Linear-48B-A3B-Instruct | `moonshotai/Kimi-Linear-48B-A3B-Base`, `moonshotai/Kimi-Linear-48B-A3B-Instruct` | | ✅︎ |
| `Lfm2ForCausalLM` | LFM2 | `LiquidAI/LFM2-1.2B`, `LiquidAI/LFM2-700M`, `LiquidAI/LFM2-350M`, etc. | ✅︎ | ✅︎ | | `Lfm2ForCausalLM` | LFM2 | `LiquidAI/LFM2-1.2B`, `LiquidAI/LFM2-700M`, `LiquidAI/LFM2-350M`, etc. | ✅︎ | ✅︎ |
| `Lfm2MoeForCausalLM` | LFM2MoE | `LiquidAI/LFM2-8B-A1B-preview`, etc. | ✅︎ | ✅︎ | | `Lfm2MoeForCausalLM` | LFM2MoE | `LiquidAI/LFM2-8B-A1B-preview`, etc. | ✅︎ | ✅︎ |
| `LlamaForCausalLM` | Llama 3.1, Llama 3, Llama 2, LLaMA, Yi | `meta-llama/Meta-Llama-3.1-405B-Instruct`, `meta-llama/Meta-Llama-3.1-70B`, `meta-llama/Meta-Llama-3-70B-Instruct`, `meta-llama/Llama-2-70b-hf`, `01-ai/Yi-34B`, etc. | ✅︎ | ✅︎ | | `LlamaForCausalLM` | Llama 3.1, Llama 3, Llama 2, LLaMA, Yi | `meta-llama/Meta-Llama-3.1-405B-Instruct`, `meta-llama/Meta-Llama-3.1-70B`, `meta-llama/Meta-Llama-3-70B-Instruct`, `meta-llama/Llama-2-70b-hf`, `01-ai/Yi-34B`, etc. | ✅︎ | ✅︎ |
| `LongcatFlashForCausalLM` | LongCat-Flash | `meituan-longcat/LongCat-Flash-Chat`, `meituan-longcat/LongCat-Flash-Chat-FP8` | ✅︎ | ✅︎ | | `LongcatFlashForCausalLM` | LongCat-Flash | `meituan-longcat/LongCat-Flash-Chat`, `meituan-longcat/LongCat-Flash-Chat-FP8` | ✅︎ | ✅︎ |
| `MambaForCausalLM` | Mamba | `state-spaces/mamba-130m-hf`, `state-spaces/mamba-790m-hf`, `state-spaces/mamba-2.8b-hf`, etc. | | ✅︎ | | `MambaForCausalLM` | Mamba | `state-spaces/mamba-130m-hf`, `state-spaces/mamba-790m-hf`, `state-spaces/mamba-2.8b-hf`, etc. | | ✅︎ |
| `Mamba2ForCausalLM` | Mamba2 | `mistralai/Mamba-Codestral-7B-v0.1`, etc. | | ✅︎ | | `Mamba2ForCausalLM` | Mamba2 | `mistralai/Mamba-Codestral-7B-v0.1`, etc. | | ✅︎ |
| `MiMoForCausalLM` | MiMo | `XiaomiMiMo/MiMo-7B-RL`, etc. | ✅︎ | ✅︎ | | `MiMoForCausalLM` | MiMo | `XiaomiMiMo/MiMo-7B-RL`, etc. | ✅︎ | ✅︎ |
| `MiMoV2FlashForCausalLM` | MiMoV2Flash | `XiaomiMiMo/MiMo-V2-Flash`, etc. | | ✅︎ | | `MiMoV2FlashForCausalLM` | MiMoV2Flash | `XiaomiMiMo/MiMo-V2-Flash`, etc. | | ✅︎ |
| `MiniCPMForCausalLM` | MiniCPM | `openbmb/MiniCPM-2B-sft-bf16`, `openbmb/MiniCPM-2B-dpo-bf16`, `openbmb/MiniCPM-S-1B-sft`, etc. | ✅︎ | ✅︎ | | `MiniCPMForCausalLM` | MiniCPM | `openbmb/MiniCPM-2B-sft-bf16`, `openbmb/MiniCPM-2B-dpo-bf16`, `openbmb/MiniCPM-S-1B-sft`, etc. | ✅︎ | ✅︎ |
| `MiniCPM3ForCausalLM` | MiniCPM3 | `openbmb/MiniCPM3-4B`, etc. | ✅︎ | ✅︎ | | `MiniCPM3ForCausalLM` | MiniCPM3 | `openbmb/MiniCPM3-4B`, etc. | ✅︎ | ✅︎ |
| `MiniMaxForCausalLM` | MiniMax-Text | `MiniMaxAI/MiniMax-Text-01-hf`, etc. | | | | `MiniMaxForCausalLM` | MiniMax-Text | `MiniMaxAI/MiniMax-Text-01-hf`, etc. | | |
| `MiniMaxM2ForCausalLM` | MiniMax-M2, MiniMax-M2.1 |`MiniMaxAI/MiniMax-M2`, etc. | ✅︎ | ✅︎ | | `MiniMaxM2ForCausalLM` | MiniMax-M2, MiniMax-M2.1 | `MiniMaxAI/MiniMax-M2`, etc. | ✅︎ | ✅︎ |
| `MistralForCausalLM` | Ministral-3, Mistral, Mistral-Instruct | `mistralai/Ministral-3-3B-Instruct-2512`, `mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc. | ✅︎ | ✅︎ | | `MistralForCausalLM` | Ministral-3, Mistral, Mistral-Instruct | `mistralai/Ministral-3-3B-Instruct-2512`, `mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc. | ✅︎ | ✅︎ |
| `MistralLarge3ForCausalLM` | Mistral-Large-3-675B-Base-2512, Mistral-Large-3-675B-Instruct-2512 | `mistralai/Mistral-Large-3-675B-Base-2512`, `mistralai/Mistral-Large-3-675B-Instruct-2512`, etc. | ✅︎ | ✅︎ | | `MistralLarge3ForCausalLM` | Mistral-Large-3-675B-Base-2512, Mistral-Large-3-675B-Instruct-2512 | `mistralai/Mistral-Large-3-675B-Base-2512`, `mistralai/Mistral-Large-3-675B-Instruct-2512`, etc. | ✅︎ | ✅︎ |
| `MixtralForCausalLM` | Mixtral-8x7B, Mixtral-8x7B-Instruct | `mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, `mistral-community/Mixtral-8x22B-v0.1`, etc. | ✅︎ | ✅︎ | | `MixtralForCausalLM` | Mixtral-8x7B, Mixtral-8x7B-Instruct | `mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, `mistral-community/Mixtral-8x22B-v0.1`, etc. | ✅︎ | ✅︎ |
@@ -453,9 +453,9 @@ th {
| `OPTForCausalLM` | OPT, OPT-IML | `facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc. | ✅︎ | ✅︎ | | `OPTForCausalLM` | OPT, OPT-IML | `facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc. | ✅︎ | ✅︎ |
| `OrionForCausalLM` | Orion | `OrionStarAI/Orion-14B-Base`, `OrionStarAI/Orion-14B-Chat`, etc. | | ✅︎ | | `OrionForCausalLM` | Orion | `OrionStarAI/Orion-14B-Base`, `OrionStarAI/Orion-14B-Chat`, etc. | | ✅︎ |
| `OuroForCausalLM` | ouro | `ByteDance/Ouro-1.4B`, `ByteDance/Ouro-2.6B`, etc. | ✅︎ | | | `OuroForCausalLM` | ouro | `ByteDance/Ouro-1.4B`, `ByteDance/Ouro-2.6B`, etc. | ✅︎ | |
| `PanguEmbeddedForCausalLM` |openPangu-Embedded-7B | `FreedomIntelligence/openPangu-Embedded-7B-V1.1` | ✅︎ | ✅︎ | | `PanguEmbeddedForCausalLM` | openPangu-Embedded-7B | `FreedomIntelligence/openPangu-Embedded-7B-V1.1` | ✅︎ | ✅︎ |
| `PanguProMoEV2ForCausalLM` |openpangu-pro-moe-v2 | | ✅︎ | ✅︎ | | `PanguProMoEV2ForCausalLM` | openpangu-pro-moe-v2 | | ✅︎ | ✅︎ |
| `PanguUltraMoEForCausalLM` |openpangu-ultra-moe-718b-model | `FreedomIntelligence/openPangu-Ultra-MoE-718B-V1.1` | ✅︎ | ✅︎ | | `PanguUltraMoEForCausalLM` | openpangu-ultra-moe-718b-model | `FreedomIntelligence/openPangu-Ultra-MoE-718B-V1.1` | ✅︎ | ✅︎ |
| `PhiForCausalLM` | Phi | `microsoft/phi-1_5`, `microsoft/phi-2`, etc. | ✅︎ | ✅︎ | | `PhiForCausalLM` | Phi | `microsoft/phi-1_5`, `microsoft/phi-2`, etc. | ✅︎ | ✅︎ |
| `Phi3ForCausalLM` | Phi-4, Phi-3 | `microsoft/Phi-4-mini-instruct`, `microsoft/Phi-4`, `microsoft/Phi-3-mini-4k-instruct`, `microsoft/Phi-3-mini-128k-instruct`, `microsoft/Phi-3-medium-128k-instruct`, etc. | ✅︎ | ✅︎ | | `Phi3ForCausalLM` | Phi-4, Phi-3 | `microsoft/Phi-4-mini-instruct`, `microsoft/Phi-4`, `microsoft/Phi-3-mini-4k-instruct`, `microsoft/Phi-3-mini-128k-instruct`, `microsoft/Phi-3-medium-128k-instruct`, etc. | ✅︎ | ✅︎ |
| `PhiMoEForCausalLM` | Phi-3.5-MoE | `microsoft/Phi-3.5-MoE-instruct`, etc. | ✅︎ | ✅︎ | | `PhiMoEForCausalLM` | Phi-3.5-MoE | `microsoft/Phi-3.5-MoE-instruct`, etc. | ✅︎ | ✅︎ |
@@ -477,7 +477,7 @@ th {
| `StableLMEpochForCausalLM` | StableLM Epoch | `stabilityai/stablelm-zephyr-3b`, etc. | | ✅︎ | | `StableLMEpochForCausalLM` | StableLM Epoch | `stabilityai/stablelm-zephyr-3b`, etc. | | ✅︎ |
| `Starcoder2ForCausalLM` | Starcoder2 | `bigcode/starcoder2-3b`, `bigcode/starcoder2-7b`, `bigcode/starcoder2-15b`, etc. | | ✅︎ | | `Starcoder2ForCausalLM` | Starcoder2 | `bigcode/starcoder2-3b`, `bigcode/starcoder2-7b`, `bigcode/starcoder2-15b`, etc. | | ✅︎ |
| `Step1ForCausalLM` | Step-Audio | `stepfun-ai/Step-Audio-EditX`, etc. | ✅︎ | ✅︎ | | `Step1ForCausalLM` | Step-Audio | `stepfun-ai/Step-Audio-EditX`, etc. | ✅︎ | ✅︎ |
| `Step3p5ForCausalLM` | Step-3.5-flash | `stepfun-ai/Step-3.5-Flash`, etc. | | ✅︎ | | `Step3p5ForCausalLM` | Step-3.5-flash | `stepfun-ai/Step-3.5-Flash`, etc. | | ✅︎ |
| `TeleChatForCausalLM` | TeleChat | `chuhac/TeleChat2-35B`, etc. | ✅︎ | ✅︎ | | `TeleChatForCausalLM` | TeleChat | `chuhac/TeleChat2-35B`, etc. | ✅︎ | ✅︎ |
| `TeleChat2ForCausalLM` | TeleChat2 | `Tele-AI/TeleChat2-3B`, `Tele-AI/TeleChat2-7B`, `Tele-AI/TeleChat2-35B`, etc. | ✅︎ | ✅︎ | | `TeleChat2ForCausalLM` | TeleChat2 | `Tele-AI/TeleChat2-3B`, `Tele-AI/TeleChat2-7B`, `Tele-AI/TeleChat2-35B`, etc. | ✅︎ | ✅︎ |
| `TeleFLMForCausalLM` | TeleFLM | `CofeAI/FLM-2-52B-Instruct-2407`, `CofeAI/Tele-FLM`, etc. | ✅︎ | ✅︎ | | `TeleFLMForCausalLM` | TeleFLM | `CofeAI/FLM-2-52B-Instruct-2407`, `CofeAI/Tele-FLM`, etc. | ✅︎ | ✅︎ |
@@ -492,7 +492,7 @@ th {
Some models are supported only via the [Transformers modeling backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers modeling backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it! Some models are supported only via the [Transformers modeling backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers modeling backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it!
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
|--------------|--------|-------------------|----------------------|---------------------------| | ------------ | ------ | ----------------- | -------------------- | ------------------------- |
| `SmolLM3ForCausalLM` | SmolLM3 | `HuggingFaceTB/SmolLM3-3B` | ✅︎ | ✅︎ | | `SmolLM3ForCausalLM` | SmolLM3 | `HuggingFaceTB/SmolLM3-3B` | ✅︎ | ✅︎ |
!!! note !!! note
@@ -511,16 +511,16 @@ See [this page](./pooling_models.md) for more information on how to use pooling
These models primarily support the [`LLM.embed`](./pooling_models.md#llmembed) API. These models primarily support the [`LLM.embed`](./pooling_models.md#llmembed) API.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
|--------------|--------|-------------------|----------------------|---------------------------| | ------------ | ------ | ----------------- | -------------------- | ------------------------- |
| `BertModel`<sup>C</sup> | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | | | `BertModel`<sup>C</sup> | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | |
| `BertSpladeSparseEmbeddingModel` | SPLADE | `naver/splade-v3` | | | | `BertSpladeSparseEmbeddingModel` | SPLADE | `naver/splade-v3` | | |
| `Gemma2Model`<sup>C</sup> | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | ✅︎ | | `Gemma2Model`<sup>C</sup> | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | ✅︎ |
| `Gemma3TextModel`<sup>C</sup> | Gemma 3-based | `google/embeddinggemma-300m`, etc. | ✅︎ | ✅︎ | | `Gemma3TextModel`<sup>C</sup> | Gemma 3-based | `google/embeddinggemma-300m`, etc. | ✅︎ | ✅︎ |
| `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ | | `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ |
| `GteModel`<sup>C</sup> | Arctic-Embed-2.0-M | `Snowflake/snowflake-arctic-embed-m-v2.0`. | | | | `GteModel`<sup>C</sup> | Arctic-Embed-2.0-M | `Snowflake/snowflake-arctic-embed-m-v2.0`. | | |
| `GteNewModel`<sup>C</sup> | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | | | `GteNewModel`<sup>C</sup> | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | |
| `ModernBertModel`<sup>C</sup> | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | | | `ModernBertModel`<sup>C</sup> | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | |
| `NomicBertModel`<sup>C</sup> | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | | | `NomicBertModel`<sup>C</sup> | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | |
| `LlamaBidirectionalModel`<sup>C</sup> | Llama-based with bidirectional attention | `nvidia/llama-nemotron-embed-1b-v2`, etc. | ✅︎ | ✅︎ | | `LlamaBidirectionalModel`<sup>C</sup> | Llama-based with bidirectional attention | `nvidia/llama-nemotron-embed-1b-v2`, etc. | ✅︎ | ✅︎ |
| `LlamaModel`<sup>C</sup>, `LlamaForCausalLM`<sup>C</sup>, `MistralModel`<sup>C</sup>, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ | | `LlamaModel`<sup>C</sup>, `LlamaForCausalLM`<sup>C</sup>, `MistralModel`<sup>C</sup>, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ |
| `Qwen2Model`<sup>C</sup>, `Qwen2ForCausalLM`<sup>C</sup> | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ | | `Qwen2Model`<sup>C</sup>, `Qwen2ForCausalLM`<sup>C</sup> | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ |
@@ -555,7 +555,7 @@ of the whole prompt are extracted from the normalized hidden state corresponding
These models primarily support the [`LLM.classify`](./pooling_models.md#llmclassify) API. These models primarily support the [`LLM.classify`](./pooling_models.md#llmclassify) API.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
|--------------|--------|-------------------|----------------------|---------------------------| | ------------ | ------ | ----------------- | -------------------- | ------------------------- |
| `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ | | `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ |
| `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | | | `GPT2ForSequenceClassification` | GPT2 | `nie3e/sentiment-polish-gpt2-small` | | |
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* | | `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* |
@@ -572,7 +572,7 @@ Cross-encoder and reranker models are a subset of classification models that acc
These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) API. These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) API.
| Architecture | Models | Example HF Models | Score template (see note) | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | | Architecture | Models | Example HF Models | Score template (see note) | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
|--------------|--------|-------------------|---------------------------|-----------------------------|-----------------------------------------| | ------------ | ------ | ----------------- | ------------------------- | --------------------------- | --------------------------------------- |
| `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | N/A | | | | `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | N/A | | |
| `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma`(see note), etc. | [bge-reranker-v2-gemma.jinja](../../examples/pooling/score/template/bge-reranker-v2-gemma.jinja) | ✅︎ | ✅︎ | | `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma`(see note), etc. | [bge-reranker-v2-gemma.jinja](../../examples/pooling/score/template/bge-reranker-v2-gemma.jinja) | ✅︎ | ✅︎ |
| `GteNewForSequenceClassification` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-reranker-base`, etc. | N/A | | | | `GteNewForSequenceClassification` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-reranker-base`, etc. | N/A | | |
@@ -622,7 +622,7 @@ These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) A
These models primarily support the [`LLM.reward`](./pooling_models.md#llmreward) API. These models primarily support the [`LLM.reward`](./pooling_models.md#llmreward) API.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
|--------------|--------|-------------------|----------------------|---------------------------| | ------------ | ------ | ----------------- | -------------------- | ------------------------- |
| `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ | | `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ |
| `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ | | `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ |
| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B`, etc. | ✅︎ | ✅︎ | | `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B`, etc. | ✅︎ | ✅︎ |
@@ -637,9 +637,9 @@ These models primarily support the [`LLM.reward`](./pooling_models.md#llmreward)
These models primarily support the [`LLM.encode`](./pooling_models.md#llmencode) API. These models primarily support the [`LLM.encode`](./pooling_models.md#llmencode) API.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
|--------------|--------|-------------------|-----------------------------|-----------------------------------------| | ------------ | ------ | ----------------- | --------------------------- | --------------------------------------- |
| `BertForTokenClassification` | bert-based | `boltuix/NeuroBERT-NER` (see note), etc. | | | | `BertForTokenClassification` | bert-based | `boltuix/NeuroBERT-NER` (see note), etc. | | |
| `ModernBertForTokenClassification` | ModernBERT-based | `disham993/electrical-ner-ModernBERT-base` | | | | `ModernBertForTokenClassification` | ModernBERT-based | `disham993/electrical-ner-ModernBERT-base` | | |
!!! note !!! note
Named Entity Recognition (NER) usage, please refer to [examples/pooling/token_classify/ner_offline.py](../../examples/pooling/token_classify/ner_offline.py), [examples/pooling/token_classify/ner_online.py](../../examples/pooling/token_classify/ner_online.py). Named Entity Recognition (NER) usage, please refer to [examples/pooling/token_classify/ner_offline.py](../../examples/pooling/token_classify/ner_offline.py), [examples/pooling/token_classify/ner_online.py](../../examples/pooling/token_classify/ner_online.py).
@@ -678,7 +678,7 @@ See [this page](generative_models.md) for more information on how to use generat
These models primarily accept the [`LLM.generate`](./generative_models.md#llmgenerate) API. Chat/Instruct models additionally support the [`LLM.chat`](./generative_models.md#llmchat) API. These models primarily accept the [`LLM.generate`](./generative_models.md#llmgenerate) API. Chat/Instruct models additionally support the [`LLM.chat`](./generative_models.md#llmchat) API.
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | | Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
|--------------|--------|--------|-------------------|----------------------|---------------------------| | ------------ | ------ | ------ | ----------------- | -------------------- | ------------------------- |
| `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | | | `AriaForConditionalGeneration` | Aria | T + I<sup>+</sup> | `rhymes-ai/Aria` | | |
| `AudioFlamingo3ForConditionalGeneration` | AudioFlamingo3 | T + A | `nvidia/audio-flamingo-3-hf`, `nvidia/music-flamingo-2601-hf` | ✅︎ | ✅︎ | | `AudioFlamingo3ForConditionalGeneration` | AudioFlamingo3 | T + A | `nvidia/audio-flamingo-3-hf`, `nvidia/music-flamingo-2601-hf` | ✅︎ | ✅︎ |
| `AyaVisionForConditionalGeneration` | Aya Vision | T + I<sup>+</sup> | `CohereLabs/aya-vision-8b`, `CohereLabs/aya-vision-32b`, etc. | | ✅︎ | | `AyaVisionForConditionalGeneration` | Aya Vision | T + I<sup>+</sup> | `CohereLabs/aya-vision-8b`, `CohereLabs/aya-vision-32b`, etc. | | ✅︎ |
@@ -698,7 +698,7 @@ These models primarily accept the [`LLM.generate`](./generative_models.md#llmgen
| `GLM4VForCausalLM`<sup>^</sup> | GLM-4V | T + I | `zai-org/glm-4v-9b`, `zai-org/cogagent-9b-20241220`, etc. | ✅︎ | ✅︎ | | `GLM4VForCausalLM`<sup>^</sup> | GLM-4V | T + I | `zai-org/glm-4v-9b`, `zai-org/cogagent-9b-20241220`, etc. | ✅︎ | ✅︎ |
| `Glm4vForConditionalGeneration` | GLM-4.1V-Thinking | T + I<sup>E+</sup> + V<sup>E+</sup> | `zai-org/GLM-4.1V-9B-Thinking`, etc. | ✅︎ | ✅︎ | | `Glm4vForConditionalGeneration` | GLM-4.1V-Thinking | T + I<sup>E+</sup> + V<sup>E+</sup> | `zai-org/GLM-4.1V-9B-Thinking`, etc. | ✅︎ | ✅︎ |
| `Glm4vMoeForConditionalGeneration` | GLM-4.5V | T + I<sup>E+</sup> + V<sup>E+</sup> | `zai-org/GLM-4.5V`, etc. | ✅︎ | ✅︎ | | `Glm4vMoeForConditionalGeneration` | GLM-4.5V | T + I<sup>E+</sup> + V<sup>E+</sup> | `zai-org/GLM-4.5V`, etc. | ✅︎ | ✅︎ |
| `GlmOcrForConditionalGeneration` | GLM-OCR | T + I<sup>E+</sup> | `zai-org/GLM-OCR`, etc. | ✅︎ | ✅︎ | | `GlmOcrForConditionalGeneration` | GLM-OCR | T + I<sup>E+</sup> | `zai-org/GLM-OCR`, etc. | ✅︎ | ✅︎ |
| `GraniteSpeechForConditionalGeneration` | Granite Speech | T + A | `ibm-granite/granite-speech-3.3-8b` | ✅︎ | ✅︎ | | `GraniteSpeechForConditionalGeneration` | Granite Speech | T + A | `ibm-granite/granite-speech-3.3-8b` | ✅︎ | ✅︎ |
| `HCXVisionForCausalLM` | HyperCLOVAX-SEED-Vision-Instruct-3B | T + I<sup>+</sup> + V<sup>+</sup> | `naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B` | | | | `HCXVisionForCausalLM` | HyperCLOVAX-SEED-Vision-Instruct-3B | T + I<sup>+</sup> + V<sup>+</sup> | `naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B` | | |
| `H2OVLChatModel` | H2OVL | T + I<sup>E+</sup> | `h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc. | | ✅︎ | | `H2OVLChatModel` | H2OVL | T + I<sup>E+</sup> | `h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc. | | ✅︎ |
@@ -714,7 +714,7 @@ These models primarily accept the [`LLM.generate`](./generative_models.md#llmgen
| `KeyeVL1_5ForConditionalGeneration` | Keye-VL-1_5-8B | T + I<sup>E+</sup> + V<sup>E+</sup> | `Kwai-Keye/Keye-VL-1_5-8B` | ✅︎ | ✅︎ | | `KeyeVL1_5ForConditionalGeneration` | Keye-VL-1_5-8B | T + I<sup>E+</sup> + V<sup>E+</sup> | `Kwai-Keye/Keye-VL-1_5-8B` | ✅︎ | ✅︎ |
| `KimiVLForConditionalGeneration` | Kimi-VL-A3B-Instruct, Kimi-VL-A3B-Thinking | T + I<sup>+</sup> | `moonshotai/Kimi-VL-A3B-Instruct`, `moonshotai/Kimi-VL-A3B-Thinking` | | ✅︎ | | `KimiVLForConditionalGeneration` | Kimi-VL-A3B-Instruct, Kimi-VL-A3B-Thinking | T + I<sup>+</sup> | `moonshotai/Kimi-VL-A3B-Instruct`, `moonshotai/Kimi-VL-A3B-Thinking` | | ✅︎ |
| `KimiK25ForConditionalGeneration` | Kimi-K2.5 | T + I<sup>+</sup> | `moonshotai/Kimi-K2.5` | | ✅︎ | | `KimiK25ForConditionalGeneration` | Kimi-K2.5 | T + I<sup>+</sup> | `moonshotai/Kimi-K2.5` | | ✅︎ |
| `LightOnOCRForConditionalGeneration` | LightOnOCR-1B | T + I<sup>+</sup> | `lightonai/LightOnOCR-1B`, etc | ✅︎ | ✅︎ | | `LightOnOCRForConditionalGeneration` | LightOnOCR-1B | T + I<sup>+</sup> | `lightonai/LightOnOCR-1B`, etc | ✅︎ | ✅︎ |
| `Lfm2VlForConditionalGeneration` | LFM2-VL | T + I<sup>+</sup> | `LiquidAI/LFM2-VL-450M`, `LiquidAI/LFM2-VL-3B`, `LiquidAI/LFM2-VL-8B-A1B`, etc. | ✅︎ | ✅︎ | | `Lfm2VlForConditionalGeneration` | LFM2-VL | T + I<sup>+</sup> | `LiquidAI/LFM2-VL-450M`, `LiquidAI/LFM2-VL-3B`, `LiquidAI/LFM2-VL-8B-A1B`, etc. | ✅︎ | ✅︎ |
| `Llama4ForConditionalGeneration` | Llama 4 | T + I<sup>+</sup> | `meta-llama/Llama-4-Scout-17B-16E-Instruct`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, etc. | ✅︎ | ✅︎ | | `Llama4ForConditionalGeneration` | Llama 4 | T + I<sup>+</sup> | `meta-llama/Llama-4-Scout-17B-16E-Instruct`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, etc. | ✅︎ | ✅︎ |
| `Llama_Nemotron_Nano_VL` | Llama Nemotron Nano VL | T + I<sup>E+</sup> | `nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1` | ✅︎ | ✅︎ | | `Llama_Nemotron_Nano_VL` | Llama Nemotron Nano VL | T + I<sup>E+</sup> | `nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1` | ✅︎ | ✅︎ |
@@ -731,7 +731,7 @@ These models primarily accept the [`LLM.generate`](./generative_models.md#llmgen
| `Molmo2ForConditionalGeneration` | Molmo2 | T + I<sup>+</sup> / V | `allenai/Molmo2-4B`, `allenai/Molmo2-8B`, `allenai/Molmo2-O-7B` | ✅︎ | ✅︎ | | `Molmo2ForConditionalGeneration` | Molmo2 | T + I<sup>+</sup> / V | `allenai/Molmo2-4B`, `allenai/Molmo2-8B`, `allenai/Molmo2-O-7B` | ✅︎ | ✅︎ |
| `NVLM_D_Model` | NVLM-D 1.0 | T + I<sup>+</sup> | `nvidia/NVLM-D-72B`, etc. | | ✅︎ | | `NVLM_D_Model` | NVLM-D 1.0 | T + I<sup>+</sup> | `nvidia/NVLM-D-72B`, etc. | | ✅︎ |
| `OpenCUAForConditionalGeneration` | OpenCUA-7B | T + I<sup>E+</sup> | `xlangai/OpenCUA-7B` | ✅︎ | ✅︎ | | `OpenCUAForConditionalGeneration` | OpenCUA-7B | T + I<sup>E+</sup> | `xlangai/OpenCUA-7B` | ✅︎ | ✅︎ |
| `OpenPanguVLForConditionalGeneration` | openpangu-VL | T + I<sup>E+</sup> + V<sup>E+</sup> |`FreedomIntelligence/openPangu-VL-7B` | ✅︎ | ✅︎ | | `OpenPanguVLForConditionalGeneration` | openpangu-VL | T + I<sup>E+</sup> + V<sup>E+</sup> | `FreedomIntelligence/openPangu-VL-7B` | ✅︎ | ✅︎ |
| `Ovis` | Ovis2, Ovis1.6 | T + I<sup>+</sup> | `AIDC-AI/Ovis2-1B`, `AIDC-AI/Ovis1.6-Llama3.2-3B`, etc. | | ✅︎ | | `Ovis` | Ovis2, Ovis1.6 | T + I<sup>+</sup> | `AIDC-AI/Ovis2-1B`, `AIDC-AI/Ovis1.6-Llama3.2-3B`, etc. | | ✅︎ |
| `Ovis2_5` | Ovis2.5 | T + I<sup>+</sup> + V | `AIDC-AI/Ovis2.5-9B`, etc. | | | | `Ovis2_5` | Ovis2.5 | T + I<sup>+</sup> + V | `AIDC-AI/Ovis2.5-9B`, etc. | | |
| `Ovis2_6ForCausalLM` | Ovis2.6 | T + I<sup>+</sup> + V | `AIDC-AI/Ovis2.6-2B`, etc. | | | | `Ovis2_6ForCausalLM` | Ovis2.6 | T + I<sup>+</sup> + V | `AIDC-AI/Ovis2.6-2B`, etc. | | |
@@ -764,7 +764,7 @@ These models primarily accept the [`LLM.generate`](./generative_models.md#llmgen
Some models are supported only via the [Transformers modeling backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers modeling backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it! Some models are supported only via the [Transformers modeling backend](#transformers). The purpose of the table below is to acknowledge models which we officially support in this way. The logs will say that the Transformers modeling backend is being used, and you will see no warning that this is fallback behaviour. This means that, if you have issues with any of the models listed below, please [make an issue](https://github.com/vllm-project/vllm/issues/new/choose) and we'll do our best to fix it!
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | | Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
|--------------|--------|--------|-------------------|-----------------------------|-----------------------------------------| | ------------ | ------ | ------ | ----------------- | --------------------------- | --------------------------------------- |
| `Emu3ForConditionalGeneration` | Emu3 | T + I | `BAAI/Emu3-Chat-hf` | ✅︎ | ✅︎ | | `Emu3ForConditionalGeneration` | Emu3 | T + I | `BAAI/Emu3-Chat-hf` | ✅︎ | ✅︎ |
<sup>^</sup> You need to set the architecture name via `--hf-overrides` to match the one in vLLM.</br> <sup>^</sup> You need to set the architecture name via `--hf-overrides` to match the one in vLLM.</br>
@@ -795,7 +795,7 @@ Some models are supported only via the [Transformers modeling backend](#transfor
Speech2Text models trained specifically for Automatic Speech Recognition. Speech2Text models trained specifically for Automatic Speech Recognition.
| Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | | Architecture | Models | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
|--------------|--------|-------------------|----------------------|---------------------------| | ------------ | ------ | ----------------- | -------------------- | ------------------------- |
| `FireRedASR2ForConditionalGeneration` | FireRedASR2 | `allendou/FireRedASR2-LLM-vllm`, etc. | | | | `FireRedASR2ForConditionalGeneration` | FireRedASR2 | `allendou/FireRedASR2-LLM-vllm`, etc. | | |
| `FunASRForConditionalGeneration` | FunASR | `allendou/Fun-ASR-Nano-2512-vllm`, etc. | | | | `FunASRForConditionalGeneration` | FunASR | `allendou/Fun-ASR-Nano-2512-vllm`, etc. | | |
| `Gemma3nForConditionalGeneration` | Gemma3n | `google/gemma-3n-E2B-it`, `google/gemma-3n-E4B-it`, etc. | | | | `Gemma3nForConditionalGeneration` | Gemma3n | `google/gemma-3n-E2B-it`, `google/gemma-3n-E4B-it`, etc. | | |
@@ -823,7 +823,7 @@ These models primarily support the [`LLM.embed`](./pooling_models.md#llmembed) A
The following table lists those that are tested in vLLM. The following table lists those that are tested in vLLM.
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | | Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
|--------------|--------|--------|-------------------|----------------------|---------------------------| | ------------ | ------ | ------ | ----------------- | -------------------- | ------------------------- |
| `CLIPModel` | CLIP | T / I | `openai/clip-vit-base-patch32`, `openai/clip-vit-large-patch14`, etc. | | | | `CLIPModel` | CLIP | T / I | `openai/clip-vit-base-patch32`, `openai/clip-vit-large-patch14`, etc. | | |
| `ColModernVBertForRetrieval` | ColModernVBERT | T / I | `ModernVBERT/colmodernvbert-merged` | | | | `ColModernVBertForRetrieval` | ColModernVBERT | T / I | `ModernVBERT/colmodernvbert-merged` | | |
| `LlamaNemotronVLModel` | Llama Nemotron Embedding + SigLIP | T + I | `nvidia/llama-nemotron-embed-vl-1b-v2` | | | | `LlamaNemotronVLModel` | Llama Nemotron Embedding + SigLIP | T + I | `nvidia/llama-nemotron-embed-vl-1b-v2` | | |
@@ -844,7 +844,7 @@ Cross-encoder and reranker models are a subset of classification models that acc
These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) API. These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) API.
| Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) | | Architecture | Models | Inputs | Example HF Models | [LoRA](../features/lora.md) | [PP](../serving/parallelism_scaling.md) |
|--------------|--------|--------|-------------------|----------------------|---------------------------| | ------------ | ------ | ------ | ----------------- | -------------------- | ------------------------- |
| `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | ✅︎ | ✅︎ | | `JinaVLForSequenceClassification` | JinaVL-based | T + I<sup>E+</sup> | `jinaai/jina-reranker-m0`, etc. | ✅︎ | ✅︎ |
| `LlamaNemotronVLForSequenceClassification` | Llama Nemotron Reranker + SigLIP | T + I<sup>E+</sup> | `nvidia/llama-nemotron-rerank-vl-1b-v2` | | | | `LlamaNemotronVLForSequenceClassification` | Llama Nemotron Reranker + SigLIP | T + I<sup>E+</sup> | `nvidia/llama-nemotron-rerank-vl-1b-v2` | | |
| `Qwen3VLForSequenceClassification` | Qwen3-VL-Reranker | T + I<sup>E+</sup> + V<sup>E+</sup> | `Qwen/Qwen3-VL-Reranker-2B`(see note), etc. | ✅︎ | ✅︎ | | `Qwen3VLForSequenceClassification` | Qwen3-VL-Reranker | T + I<sup>E+</sup> + V<sup>E+</sup> | `Qwen/Qwen3-VL-Reranker-2B`(see note), etc. | ✅︎ | ✅︎ |

View File

@@ -17,7 +17,7 @@ Before using EP, you need to install the necessary dependencies. We are actively
vLLM provides multiple communication backends for EP. Use `--all2all-backend` to select one: vLLM provides multiple communication backends for EP. Use `--all2all-backend` to select one:
| Backend | Use Case | Features | Best For | | Backend | Use Case | Features | Best For |
|---------|----------|----------|----------| | ------- | -------- | -------- | -------- |
| `allgather_reducescatter` | Default backend | Standard all2all using allgather/reducescatter primitives | General purpose, works with any EP+DP configuration | | `allgather_reducescatter` | Default backend | Standard all2all using allgather/reducescatter primitives | General purpose, works with any EP+DP configuration |
| `deepep_high_throughput` | Multi-node prefill | Grouped GEMM with continuous layout, optimized for prefill | Prefill-dominated workloads, high-throughput scenarios | | `deepep_high_throughput` | Multi-node prefill | Grouped GEMM with continuous layout, optimized for prefill | Prefill-dominated workloads, high-throughput scenarios |
| `deepep_low_latency` | Multi-node decode | CUDA graph support, masked layout, optimized for decode | Decode-dominated workloads, low-latency scenarios | | `deepep_low_latency` | Multi-node decode | CUDA graph support, masked layout, optimized for decode | Decode-dominated workloads, low-latency scenarios |
@@ -48,7 +48,7 @@ Where:
When EP is enabled, different layers in MoE models behave differently: When EP is enabled, different layers in MoE models behave differently:
| Layer Type | Behavior | Parallelism Used | | Layer Type | Behavior | Parallelism Used |
|------------|----------|------------------| | ---------- | -------- | ---------------- |
| **Expert (MoE) Layers** | Sharded across all EP ranks | Expert Parallel (EP) of size `TP × DP` | | **Expert (MoE) Layers** | Sharded across all EP ranks | Expert Parallel (EP) of size `TP × DP` |
| **Attention Layers** | Behavior depends on TP size | See below | | **Attention Layers** | Behavior depends on TP size | See below |
@@ -146,9 +146,9 @@ When enabled, vLLM collects load statistics with every forward pass and periodic
Configure EPLB with the `--eplb-config` argument, which accepts a JSON string. The available keys and their descriptions are: Configure EPLB with the `--eplb-config` argument, which accepts a JSON string. The available keys and their descriptions are:
| Parameter | Description | Default | | Parameter | Description | Default |
|-----------|-------------|---------| | --------- | ----------- | ------- |
| `window_size`| Number of engine steps to track for rebalancing decisions | 1000 | | `window_size` | Number of engine steps to track for rebalancing decisions | 1000 |
| `step_interval`| Frequency of rebalancing (every N engine steps) | 3000 | | `step_interval` | Frequency of rebalancing (every N engine steps) | 3000 |
| `log_balancedness` | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | `false` | | `log_balancedness` | Log balancedness metrics (avg tokens per expert ÷ max tokens per expert) | `false` |
| `num_redundant_experts` | Additional global experts per EP rank beyond equal distribution | `0` | | `num_redundant_experts` | Additional global experts per EP rank beyond equal distribution | `0` |
| `use_async` | Use non-blocking EPLB for reduced latency overhead | `false` | | `use_async` | Use non-blocking EPLB for reduced latency overhead | `false` |

View File

@@ -596,7 +596,7 @@ Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono chan
#### Client → Server Events #### Client → Server Events
| Event | Description | | Event | Description |
|-------|-------------| | ----- | ----------- |
| `input_audio_buffer.append` | Send base64-encoded audio chunk: `{"type": "input_audio_buffer.append", "audio": "<base64>"}` | | `input_audio_buffer.append` | Send base64-encoded audio chunk: `{"type": "input_audio_buffer.append", "audio": "<base64>"}` |
| `input_audio_buffer.commit` | Trigger transcription processing or end: `{"type": "input_audio_buffer.commit", "final": bool}` | | `input_audio_buffer.commit` | Trigger transcription processing or end: `{"type": "input_audio_buffer.commit", "final": bool}` |
| `session.update` | Configure session: `{"type": "session.update", "model": "model-name"}` | | `session.update` | Configure session: `{"type": "session.update", "model": "model-name"}` |
@@ -604,7 +604,7 @@ Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono chan
#### Server → Client Events #### Server → Client Events
| Event | Description | | Event | Description |
|-------|-------------| | ----- | ----------- |
| `session.created` | Connection established with session ID and timestamp | | `session.created` | Connection established with session ID and timestamp |
| `transcription.delta` | Incremental transcription text: `{"type": "transcription.delta", "delta": "text"}` | | `transcription.delta` | Incremental transcription text: `{"type": "transcription.delta", "delta": "text"}` |
| `transcription.done` | Final transcription with usage stats | | `transcription.done` | Final transcription with usage stats |

View File

@@ -83,13 +83,13 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the
### Hardware ### Hardware
| Hardware | Status | | Hardware | Status |
|------------------|-----------------------------------------------| | --------------| --------------- |
| **NVIDIA** | <nobr>🟢</nobr> | | **NVIDIA** | <nobr>🟢</nobr> |
| **AMD** | <nobr>🟢</nobr> | | **AMD** | <nobr>🟢</nobr> |
| **INTEL GPU** | <nobr>🟢</nobr> | | **INTEL GPU** | <nobr>🟢</nobr> |
| **TPU** | <nobr>🟢</nobr> | | **TPU** | <nobr>🟢</nobr> |
| **CPU** | <nobr>🟢</nobr> | | **CPU** | <nobr>🟢</nobr> |
!!! note !!! note
@@ -104,13 +104,13 @@ based on assigned priority, with FCFS as a tie-breaker), configurable via the
### Models ### Models
| Model Type | Status | | Model Type | Status |
|-----------------------------|-------------------------------------------------------------------------| | -------------------------- | --------------------------------------- |
| **Decoder-only Models** | <nobr>🟢</nobr> | | **Decoder-only Models** | <nobr>🟢</nobr> |
| **Encoder-Decoder Models** | <nobr>🟢 (Whisper), 🔴 (Others) </nobr> | | **Encoder-Decoder Models** | <nobr>🟢 (Whisper), 🔴 (Others) </nobr> |
| **Pooling Models** | <nobr>🟢</nobr> | | **Pooling Models** | <nobr>🟢</nobr> |
| **Mamba Models** | <nobr>🟢</nobr> | | **Mamba Models** | <nobr>🟢</nobr> |
| **Multimodal Models** | <nobr>🟢</nobr> | | **Multimodal Models** | <nobr>🟢</nobr> |
See below for the status of models that are not yet supported or have more features planned in V1. See below for the status of models that are not yet supported or have more features planned in V1.
@@ -145,7 +145,7 @@ following a similar pattern by implementing support through the [plugin system](
### Features ### Features
| Feature | Status | | Feature | Status |
|---------------------------------------------|-----------------------------------------------------------------------------------| | ------------------------------------------- | --------------------------------------------------------------------------------- |
| **Prefix Caching** | <nobr>🟢 Functional</nobr> | | **Prefix Caching** | <nobr>🟢 Functional</nobr> |
| **Chunked Prefill** | <nobr>🟢 Functional</nobr> | | **Chunked Prefill** | <nobr>🟢 Functional</nobr> |
| **LoRA** | <nobr>🟢 Functional</nobr> | | **LoRA** | <nobr>🟢 Functional</nobr> |

View File

@@ -34,7 +34,7 @@ deployment methods:
Both platforms provide equivalent monitoring capabilities: Both platforms provide equivalent monitoring capabilities:
| Dashboard | Description | | Dashboard | Description |
|-----------|-------------| | --------- | ----------- |
| **Performance Statistics** | Tracks latency, throughput, and performance metrics | | **Performance Statistics** | Tracks latency, throughput, and performance metrics |
| **Query Statistics** | Monitors request volume, query performance, and KPIs | | **Query Statistics** | Monitors request volume, query performance, and KPIs |

View File

@@ -95,7 +95,7 @@ If you enable prefill instance (`--prefill-servers-urls` not disabled), you will
## Proxy Instance Flags (`disagg_epd_proxy.py`) ## Proxy Instance Flags (`disagg_epd_proxy.py`)
| Flag | Description | | Flag | Description |
|------|-------------| | ---- | ----------- |
| `--encode-servers-urls` | Comma-separated list of encoder endpoints. Every multimodal item extracted from the request is fanned out to one of these URLs in a round-robin fashion. | | `--encode-servers-urls` | Comma-separated list of encoder endpoints. Every multimodal item extracted from the request is fanned out to one of these URLs in a round-robin fashion. |
| `--prefill-servers-urls` | Comma-separated list of prefill endpoints. Set to `disable`, `none`, or `""` to skip the dedicated prefill phase and run E+PD (encoder + combined prefill/decode). | | `--prefill-servers-urls` | Comma-separated list of prefill endpoints. Set to `disable`, `none`, or `""` to skip the dedicated prefill phase and run E+PD (encoder + combined prefill/decode). |
| `--decode-servers-urls` | Comma-separated list of decode endpoints. Non-stream and stream paths both round-robin over this list. | | `--decode-servers-urls` | Comma-separated list of decode endpoints. Non-stream and stream paths both round-robin over this list. |

View File

@@ -34,7 +34,7 @@ python client.py
## 📁 Files ## 📁 Files
| File | Description | | File | Description |
|------|-------------| | ---- | ----------- |
| `service.sh` | Server startup script with chunked processing enabled | | `service.sh` | Server startup script with chunked processing enabled |
| `client.py` | Comprehensive test client for long text embedding | | `client.py` | Comprehensive test client for long text embedding |
@@ -61,7 +61,7 @@ The key parameters for chunked processing are in the `--pooler-config`:
Chunked processing uses **MEAN aggregation** for cross-chunk combination when input exceeds the model's native maximum length: Chunked processing uses **MEAN aggregation** for cross-chunk combination when input exceeds the model's native maximum length:
| Component | Behavior | Description | | Component | Behavior | Description |
|-----------|----------|-------------| | --------- | -------- | ----------- |
| **Within chunks** | Model's native pooling | Uses the model's configured pooling strategy | | **Within chunks** | Model's native pooling | Uses the model's configured pooling strategy |
| **Cross-chunk aggregation** | Always MEAN | Weighted averaging based on chunk token counts | | **Cross-chunk aggregation** | Always MEAN | Weighted averaging based on chunk token counts |
| **Performance** | Optimal | All chunks processed for complete semantic coverage | | **Performance** | Optimal | All chunks processed for complete semantic coverage |
@@ -69,7 +69,7 @@ Chunked processing uses **MEAN aggregation** for cross-chunk combination when in
### Environment Variables ### Environment Variables
| Variable | Default | Description | | Variable | Default | Description |
|----------|---------|-------------| | -------- | ------- | ----------- |
| `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use (supports multiple models) | | `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use (supports multiple models) |
| `PORT` | `31090` | Server port | | `PORT` | `31090` | Server port |
| `GPU_COUNT` | `1` | Number of GPUs to use | | `GPU_COUNT` | `1` | Number of GPUs to use |
@@ -106,7 +106,7 @@ With `MAX_EMBED_LEN=3072000`, you can process:
### Chunked Processing Performance ### Chunked Processing Performance
| Aspect | Behavior | Performance | | Aspect | Behavior | Performance |
|--------|----------|-------------| | ------ | -------- | ----------- |
| **Chunk Processing** | All chunks processed with native pooling | Consistent with input length | | **Chunk Processing** | All chunks processed with native pooling | Consistent with input length |
| **Cross-chunk Aggregation** | MEAN weighted averaging | Minimal overhead | | **Cross-chunk Aggregation** | MEAN weighted averaging | Minimal overhead |
| **Memory Usage** | Proportional to number of chunks | Moderate, scalable | | **Memory Usage** | Proportional to number of chunks | Moderate, scalable |

View File

@@ -1153,11 +1153,11 @@ def _render_table(
) -> list[str]: ) -> list[str]:
"""Render a markdown table from column specs and backend data.""" """Render a markdown table from column specs and backend data."""
header = "| " + " | ".join(name for name, _ in columns) + " |" header = "| " + " | ".join(name for name, _ in columns) + " |"
sep = "|" + "|".join("-" * (len(name) + 2) for name, _ in columns) + "|" sep = "| " + " | ".join("-" * len(name) for name, _ in columns) + " |"
lines = [header, sep] lines = [header, sep]
for info in sorted(backends, key=_sort_key): for info in sorted(backends, key=_sort_key):
row = "| " + " | ".join(fmt(info) for _, fmt in columns) + " |" row = "| " + " | ".join(fmt(info) for _, fmt in columns) + " |"
lines.append(row) lines.append(row.replace(" ", " "))
return lines return lines
@@ -1268,7 +1268,7 @@ def _priority_table(title: str, backends: list[str]) -> list[str]:
f"**{title}:**", f"**{title}:**",
"", "",
"| Priority | Backend |", "| Priority | Backend |",
"|----------|---------|", "| -------- | ------- |",
*[f"| {i} | `{b}` |" for i, b in enumerate(backends, 1)], *[f"| {i} | `{b}` |" for i, b in enumerate(backends, 1)],
"", "",
] ]
@@ -1317,7 +1317,7 @@ def generate_legend() -> str:
return """## Legend return """## Legend
| Column | Description | | Column | Description |
|--------|-------------| | ------ | ----------- |
| **Dtypes** | Supported model data types (fp16, bf16, fp32) | | **Dtypes** | Supported model data types (fp16, bf16, fp32) |
| **KV Dtypes** | Supported KV cache data types (`auto`, `fp8`, `fp8_e4m3`, etc.) | | **KV Dtypes** | Supported KV cache data types (`auto`, `fp8`, `fp8_e4m3`, etc.) |
| **Block Sizes** | Supported KV cache block sizes (%N means multiples of N) | | **Block Sizes** | Supported KV cache block sizes (%N means multiples of N) |
@@ -1348,7 +1348,7 @@ def generate_mla_section(
"configuration.", "configuration.",
"", "",
"| Backend | Description | Compute Cap. | Enable | Disable | Notes |", "| Backend | Description | Compute Cap. | Enable | Disable | Notes |",
"|---------|-------------|--------------|--------|---------|-------|", "| ------- | ----------- | ------------ | ------ | ------- | ----- |",
] ]
for backend in prefill_backends: for backend in prefill_backends:
@@ -1360,7 +1360,7 @@ def generate_mla_section(
backend["disable"], backend["disable"],
backend.get("notes", ""), backend.get("notes", ""),
) )
lines.append(row) lines.append(row.replace(" ", " "))
lines.extend( lines.extend(
[ [

View File

@@ -43,14 +43,14 @@ Multi-lora shrink/expand Triton kernel tuning follows a similar methodology from
### File Naming ### File Naming
| Kernel Type | File Name Template | Example | | Kernel Type | File Name Template | Example |
|---------------------------|--------------------------------------------|---------------------------------------------| | ------------------------- | ------------------------------------------- | -------------------------------------------- |
| shrink | `{gpu_name}_SHRINK.json` | `NVIDIA_H200_SHRINK.json` | | shrink | `{gpu_name}_SHRINK.json` | `NVIDIA_H200_SHRINK.json` |
| expand | `{gpu_name}_EXPAND_{add_input}.json` | `NVIDIA_H200_EXPAND_TRUE.json` | | expand | `{gpu_name}_EXPAND_{add_input}.json` | `NVIDIA_H200_EXPAND_TRUE.json` |
| fused_moe_lora_w13_shrink | `{gpu_name}_FUSED_MOE_LORA_W13_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W13_SHRINK.json` | | fused_moe_lora_w13_shrink | `{gpu_name}_FUSED_MOE_LORA_W13_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W13_SHRINK.json` |
| fused_moe_lora_w13_expand | `{gpu_name}_FUSED_MOE_LORA_W13_EXPAND.json` | `NVIDIA_H200_FUSED_MOE_LORA_W13_EXPAND.json` | | fused_moe_lora_w13_expand | `{gpu_name}_FUSED_MOE_LORA_W13_EXPAND.json` | `NVIDIA_H200_FUSED_MOE_LORA_W13_EXPAND.json` |
| fused_moe_lora_w2_shrink | `{gpu_name}_FUSED_MOE_LORA_W2_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W2_SHRINK.json` | | fused_moe_lora_w2_shrink | `{gpu_name}_FUSED_MOE_LORA_W2_SHRINK.json` | `NVIDIA_H200_FUSED_MOE_LORA_W2_SHRINK.json` |
| fused_moe_lora_w2_expand | `{gpu_name}_FUSED_MOE_LORA_W2_EXPAND.json` | `NVIDIA_H200_FUSED_MOE_LORA_W2_EXPAND.json` | | fused_moe_lora_w2_expand | `{gpu_name}_FUSED_MOE_LORA_W2_EXPAND.json` | `NVIDIA_H200_FUSED_MOE_LORA_W2_EXPAND.json` |
The `gpu_name` can be automatically detected by calling `torch.cuda.get_device_name()`. The `gpu_name` can be automatically detected by calling `torch.cuda.get_device_name()`.