[Feature][OCP MX] Support mxfp6 and mixed mxfp6-mxfp4 (#21166)
This commit is contained in:
@@ -231,9 +231,9 @@ python3 quantize_quark.py --model_dir meta-llama/Llama-2-70b-chat-hf \
|
||||
--tasks gsm8k
|
||||
```
|
||||
|
||||
## Using MXFP4 models
|
||||
## Using OCP MX (MXFP4, MXFP6) models
|
||||
|
||||
vLLM supports loading MXFP4 models quantized offline through AMD Quark, compliant with [Open Compute Project (OCP) specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
|
||||
vLLM supports loading MXFP4 and MXFP6 models quantized offline through AMD Quark, compliant with [Open Compute Project (OCP) specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
|
||||
|
||||
The scheme currently only supports dynamic quantization for activations.
|
||||
|
||||
@@ -241,17 +241,21 @@ Example usage, after installing the latest AMD Quark release:
|
||||
|
||||
```bash
|
||||
vllm serve fxmarty/qwen_1.5-moe-a2.7b-mxfp4 --tensor-parallel-size 1
|
||||
# or, for a model using fp6 activations and fp4 weights:
|
||||
vllm serve fxmarty/qwen1.5_moe_a2.7b_chat_w_fp4_a_fp6_e2m3 --tensor-parallel-size 1
|
||||
```
|
||||
|
||||
A simulation of the matrix multiplication execution in MXFP4 can be run on devices that do not support MXFP4 operations natively (e.g. AMD Instinct MI325, MI300 and MI250), dequantizing weights from MXFP4 to half precision on the fly, using a fused kernel. This is useful e.g. to evaluate MXFP4 models using vLLM, or alternatively to benefit from the ~4x memory savings (compared to float16 and bfloat16).
|
||||
A simulation of the matrix multiplication execution in MXFP4/MXFP6 can be run on devices that do not support OCP MX operations natively (e.g. AMD Instinct MI325, MI300 and MI250), dequantizing weights from FP4/FP6 to half precision on the fly, using a fused kernel. This is useful e.g. to evaluate FP4/FP6 models using vLLM, or alternatively to benefit from the ~2.5-4x memory savings (compared to float16 and bfloat16).
|
||||
|
||||
To generate offline models quantized using MXFP4 data type, the easiest approach is to use AMD Quark's [quantization script](https://quark.docs.amd.com/latest/pytorch/example_quark_torch_llm_ptq.html), as an example:
|
||||
|
||||
```bash
|
||||
python quantize_quark.py --model_dir Qwen/Qwen1.5-MoE-A2.7B-Chat \
|
||||
--quant_scheme w_mxfp4_a_mxfp4_sym \
|
||||
--quant_scheme w_mxfp4_a_mxfp4 \
|
||||
--output_dir qwen_1.5-moe-a2.7b-mxfp4 \
|
||||
--skip_evaluation \
|
||||
--model_export hf_format \
|
||||
--group_size 32
|
||||
```
|
||||
|
||||
The current integration supports [all combination of FP4, FP6_E3M2, FP6_E2M3](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/ocp_mx_utils.py) used for either weights or activations. Eventually, some target hardware support mixed precision GEMM, as AMD Instinct MI350/MI355, for example using FP6 for activations and FP4 for weights.
|
||||
|
||||
Reference in New Issue
Block a user