docs/features/quantization/README.md

# Quantization

Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.

Contents:

- [AutoAWQ](auto_awq.md)
- [AutoRound](auto_round.md)
- [BitsAndBytes](bnb.md)
- [BitBLAS](bitblas.md)
- [GGUF](gguf.md)
- [GPTQModel](gptqmodel.md)
- [INC](inc.md)
- [INT4 W4A16](int4.md)
- [INT8 W8A8](int8.md)
- [FP8 W8A8](fp8.md)
- [NVIDIA TensorRT Model Optimizer](modelopt.md)
- [AMD Quark](quark.md)
- [Quantized KV Cache](quantized_kvcache.md)
- [TorchAO](torchao.md)

## Supported Hardware

The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:

<style>
td:not(:first-child) {
  text-align: center !important;
}
td {
  padding: 0.5rem !important;
  white-space: nowrap;
}

th {
  padding: 0.5rem !important;
  min-width: 0 !important;
}

th:not(:first-child) {
  writing-mode: vertical-lr;
  transform: rotate(180deg)
}
</style>

| Implementation        | Volta   | Turing   | Ampere   | Ada   | Hopper   | AMD GPU   | Intel GPU   | Intel Gaudi | x86 CPU   |
|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-------------|-----------|
| AWQ                   | ❌      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ✅︎          | ❌         | ✅︎        |
| GPTQ                  | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ✅︎          | ❌         | ✅︎        |
| Marlin (GPTQ/AWQ/FP8) | ❌      | ❌       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌          | ❌         | ❌        |
| INT8 (W8A8)           | ❌      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌          | ❌         | ✅︎        |
| FP8 (W8A8)            | ❌      | ❌       | ❌       | ✅︎    | ✅︎       | ✅︎         | ❌          | ❌         | ❌        |
| BitBLAS               | ✅︎      | ✅       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌          | ❌         | ❌        |
| BitBLAS (GPTQ)        | ❌      | ❌       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌          | ❌         | ❌        |
| bitsandbytes          | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌          | ❌         | ❌        |
| DeepSpeedFP           | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌          | ❌         | ❌        |
| GGUF                  | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ✅︎         | ❌          | ❌         | ❌        |
| INC (W8A8)            | ❌      | ❌       | ❌       | ❌    | ❌       | ❌         | ❌          | ✅︎         | ❌        |

- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- ✅︎ indicates that the quantization method is supported on the specified hardware.
- ❌ indicates that the quantization method is not supported on the specified hardware.

!!! note
    For information on quantization support on Google TPU, please refer to the [TPU-Inference Recommended Models and Features](https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features/) documentation.

!!! note
    This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.

    For the most up-to-date information on hardware support and quantization methods, please refer to [vllm/model_executor/layers/quantization](../../../vllm/model_executor/layers/quantization) or consult with the vLLM development team.
Stop using title frontmatter and fix doc that can only be reached by search (#20623) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-07-08 11:27:40 +01:00			`# Quantization`
Migrate docs from Sphinx to MkDocs (#18145) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-05-23 11:09:53 +02:00
			`Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.`

			`Contents:`

[Doc] Fix quantization link titles (#19478) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-06-11 16:27:22 +08:00			`- [AutoAWQ](auto_awq.md)`
[Docs] add auto-round quantization readme (#21600) Signed-off-by: Wenhua Cheng <wenhua.cheng@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-07-25 23:52:42 +08:00			`- [AutoRound](auto_round.md)`
[Doc] Fix quantization link titles (#19478) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-06-11 16:27:22 +08:00			`- [BitsAndBytes](bnb.md)`
			`- [BitBLAS](bitblas.md)`
			`- [GGUF](gguf.md)`
			`- [GPTQModel](gptqmodel.md)`
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) (#12010) Signed-off-by: Nir David <ndavid@habana.ai> Signed-off-by: Uri Livne <ulivne@habana.ai> Co-authored-by: Uri Livne <ulivne@habana.ai> 2025-07-16 22:33:41 +03:00			`- [INC](inc.md)`
[Doc] Fix quantization link titles (#19478) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-06-11 16:27:22 +08:00			`- [INT4 W4A16](int4.md)`
			`- [INT8 W8A8](int8.md)`
			`- [FP8 W8A8](fp8.md)`
			`- [NVIDIA TensorRT Model Optimizer](modelopt.md)`
			`- [AMD Quark](quark.md)`
			`- [Quantized KV Cache](quantized_kvcache.md)`
			`- [TorchAO](torchao.md)`
[Docs] Move quant supported hardware table to README (#23663) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-08-26 23:26:46 +01:00
			`## Supported Hardware`

			`The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:`

			`<style>`
			`td:not(:first-child) {`
			`text-align: center !important;`
			`}`
			`td {`
			`padding: 0.5rem !important;`
			`white-space: nowrap;`
			`}`

			`th {`
			`padding: 0.5rem !important;`
			`min-width: 0 !important;`
			`}`

			`th:not(:first-child) {`
			`writing-mode: vertical-lr;`
			`transform: rotate(180deg)`
			`}`
			`</style>`

[Doc] cleanup TPU documentation and remove outdated examples (#29048) Signed-off-by: Rob Mulla <rob.mulla@gmail.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-11-20 19:05:59 -05:00			`\| Implementation \| Volta \| Turing \| Ampere \| Ada \| Hopper \| AMD GPU \| Intel GPU \| Intel Gaudi \| x86 CPU \|`
			`\|-----------------------\|---------\|----------\|----------\|-------\|----------\|-----------\|-------------\|-------------\|-----------\|`
			`\| AWQ \| ❌ \| ✅︎ \| ✅︎ \| ✅︎ \| ✅︎ \| ❌ \| ✅︎ \| ❌ \| ✅︎ \|`
			`\| GPTQ \| ✅︎ \| ✅︎ \| ✅︎ \| ✅︎ \| ✅︎ \| ❌ \| ✅︎ \| ❌ \| ✅︎ \|`
			`\| Marlin (GPTQ/AWQ/FP8) \| ❌ \| ❌ \| ✅︎ \| ✅︎ \| ✅︎ \| ❌ \| ❌ \| ❌ \| ❌ \|`
			`\| INT8 (W8A8) \| ❌ \| ✅︎ \| ✅︎ \| ✅︎ \| ✅︎ \| ❌ \| ❌ \| ❌ \| ✅︎ \|`
			`\| FP8 (W8A8) \| ❌ \| ❌ \| ❌ \| ✅︎ \| ✅︎ \| ✅︎ \| ❌ \| ❌ \| ❌ \|`
			`\| BitBLAS \| ✅︎ \| ✅ \| ✅︎ \| ✅︎ \| ✅︎ \| ❌ \| ❌ \| ❌ \| ❌ \|`
			`\| BitBLAS (GPTQ) \| ❌ \| ❌ \| ✅︎ \| ✅︎ \| ✅︎ \| ❌ \| ❌ \| ❌ \| ❌ \|`
			`\| bitsandbytes \| ✅︎ \| ✅︎ \| ✅︎ \| ✅︎ \| ✅︎ \| ❌ \| ❌ \| ❌ \| ❌ \|`
			`\| DeepSpeedFP \| ✅︎ \| ✅︎ \| ✅︎ \| ✅︎ \| ✅︎ \| ❌ \| ❌ \| ❌ \| ❌ \|`
			`\| GGUF \| ✅︎ \| ✅︎ \| ✅︎ \| ✅︎ \| ✅︎ \| ✅︎ \| ❌ \| ❌ \| ❌ \|`
			`\| INC (W8A8) \| ❌ \| ❌ \| ❌ \| ❌ \| ❌ \| ❌ \| ❌ \| ✅︎ \| ❌ \|`
[Docs] Move quant supported hardware table to README (#23663) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-08-26 23:26:46 +01:00
			`- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.`
			`- ✅︎ indicates that the quantization method is supported on the specified hardware.`
			`- ❌ indicates that the quantization method is not supported on the specified hardware.`

[Doc] cleanup TPU documentation and remove outdated examples (#29048) Signed-off-by: Rob Mulla <rob.mulla@gmail.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-11-20 19:05:59 -05:00			`!!! note`
			`For information on quantization support on Google TPU, please refer to the [TPU-Inference Recommended Models and Features](https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features/) documentation.`

[Docs] Move quant supported hardware table to README (#23663) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-08-26 23:26:46 +01:00			`!!! note`
			`This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.`

[Docs] Reduce custom syntax used in docs (#27009) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-10-17 04:05:34 +01:00			`For the most up-to-date information on hardware support and quantization methods, please refer to [vllm/model_executor/layers/quantization](../../../vllm/model_executor/layers/quantization) or consult with the vLLM development team.`