2025-07-08 11:27:40 +01:00
# Quantization
2025-05-23 11:09:53 +02:00
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
Contents:
2025-06-11 16:27:22 +08:00
- [AutoAWQ ](auto_awq.md )
2025-07-25 23:52:42 +08:00
- [AutoRound ](auto_round.md )
2025-06-11 16:27:22 +08:00
- [BitsAndBytes ](bnb.md )
- [BitBLAS ](bitblas.md )
- [GGUF ](gguf.md )
- [GPTQModel ](gptqmodel.md )
2025-07-16 22:33:41 +03:00
- [INC ](inc.md )
2025-06-11 16:27:22 +08:00
- [INT4 W4A16 ](int4.md )
- [INT8 W8A8 ](int8.md )
- [FP8 W8A8 ](fp8.md )
- [NVIDIA TensorRT Model Optimizer ](modelopt.md )
- [AMD Quark ](quark.md )
- [Quantized KV Cache ](quantized_kvcache.md )
- [TorchAO ](torchao.md )
2025-08-26 23:26:46 +01:00
## Supported Hardware
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
<style>
td:not(:first-child) {
text-align: center !important;
}
td {
padding: 0.5rem !important;
white-space: nowrap;
}
th {
padding: 0.5rem !important;
min-width: 0 !important;
}
th:not(:first-child) {
writing-mode: vertical-lr;
transform: rotate(180deg)
}
</style>
2025-11-20 19:05:59 -05:00
| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | Intel Gaudi | x86 CPU |
|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-------------|-----------|
| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ |
| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ❌ | ✅︎ |
| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ✅︎ |
| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ |
| BitBLAS | ✅︎ | ✅ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
| BitBLAS (GPTQ) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ |
| INC (W8A8) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅︎ | ❌ |
2025-08-26 23:26:46 +01:00
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
- ✅︎ indicates that the quantization method is supported on the specified hardware.
- ❌ indicates that the quantization method is not supported on the specified hardware.
2025-11-20 19:05:59 -05:00
!!! note
For information on quantization support on Google TPU, please refer to the [TPU-Inference Recommended Models and Features ](https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features/ ) documentation.
2025-08-26 23:26:46 +01:00
!!! note
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
2025-10-17 04:05:34 +01:00
For the most up-to-date information on hardware support and quantization methods, please refer to [vllm/model_executor/layers/quantization ](../../../vllm/model_executor/layers/quantization ) or consult with the vLLM development team.