[Doc] Update Marlin support matrix for Turing (#34319)
Signed-off-by: Tianqi Ren <tianqi.r@outlook.com>
This commit is contained in:
@@ -48,7 +48,7 @@ th:not(:first-child) {
|
||||
|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-----------|
|
||||
| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ |
|
||||
| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ |
|
||||
| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ |
|
||||
| Marlin (GPTQ/AWQ/FP8/FP4) | ❌ | ✅︎* | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ |
|
||||
| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ✅︎ |
|
||||
| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ |
|
||||
| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ |
|
||||
@@ -59,6 +59,7 @@ th:not(:first-child) {
|
||||
- ✅︎ indicates that the quantization method is supported on the specified hardware.
|
||||
- ❌ indicates that the quantization method is not supported on the specified hardware.
|
||||
- All Intel Gaudi quantization support has been migrated to [vLLM-Gaudi](https://github.com/vllm-project/vllm-gaudi).
|
||||
- *Turing does not support Marlin MXFP4.
|
||||
|
||||
!!! note
|
||||
For information on quantization support on Google TPU, please refer to the [TPU-Inference Recommended Models and Features](https://docs.vllm.ai/projects/tpu/en/latest/recommended_models_features/) documentation.
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
|
||||
Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
|
||||
Ampere GPUs are supported for W8A16 (weight-only FP8) utilizing Marlin kernels.
|
||||
Turing/Ampere GPUs are supported for W8A16 (weight-only FP8) utilizing Marlin kernels.
|
||||
Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy.
|
||||
|
||||
Please visit the HF collection of [quantized FP8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127).
|
||||
@@ -13,8 +13,8 @@ The FP8 types typically supported in hardware have two distinct representations,
|
||||
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
|
||||
|
||||
!!! note
|
||||
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
|
||||
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
|
||||
FP8 computation is supported on NVIDIA GPUs with compute capability >= 8.9 (Ada Lovelace, Hopper).
|
||||
FP8 models will run on compute capability >= 7.5 (Turing) as weight-only W8A16, utilizing FP8 Marlin.
|
||||
|
||||
## Installation
|
||||
|
||||
|
||||
Reference in New Issue
Block a user