Feature/silu block quant fusion v1 (#32996)
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
This commit is contained in:
@@ -45,7 +45,7 @@ The table below lists the quantization schemes supported by each fusion on each
|
||||
| `enable_sp` | FP16/BF16, FP8 static† | FP16/BF16, FP8 static | FP16/BF16† | FP16/BF16† | — |
|
||||
| `fuse_gemm_comms` | FP16/BF16, FP8 static† | FP16/BF16, FP8 static | FP16/BF16† | FP16/BF16† | — |
|
||||
| `fuse_norm_quant` | FP8 static, FP8 per-token, FP8 per-group | FP8 static, FP8 per-token, FP8 per-group | FP8 static, FP8 per-token, FP8 per-group | — | FP8 static, FP8 per-token, FP8 per-group |
|
||||
| `fuse_act_quant` | FP8 static, NVFP4 | FP8 static | FP8 static | — | FP8 per-group |
|
||||
| `fuse_act_quant` | FP8 static, NVFP4 | FP8 static, FP8 per-group (128/64) | FP8 static, FP8 per-group (128/64) | — | FP8 per-group |
|
||||
| `fuse_act_padding` | — | — | — | — | FP16/BF16 |
|
||||
|
||||
\* `fuse_attn_quant` support depends on the attention backend in use; not all backends support
|
||||
@@ -305,6 +305,7 @@ Note that AITER fusions are in a separate pass in `vllm.compilation.passes.fusio
|
||||
Supported quantization scheme/hardware combinations:
|
||||
|
||||
- FP8 static per-tensor: CUDA & HIP kernel
|
||||
- FP8 dynamic per-group (128/64): CUDA kernel (sm89+, not active when DeepGemm is used on sm100+)
|
||||
- NVFP4 dynamic: CUDA sm100+ only with FlashInfer
|
||||
- FP8 per-token-group (128): ROCm AITER only
|
||||
|
||||
@@ -313,6 +314,7 @@ Supported quantization scheme/hardware combinations:
|
||||
- Pass: [`vllm/compilation/passes/fusion/act_quant_fusion.py`](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/act_quant_fusion.py)
|
||||
- ROCm AITER pass: [`vllm/compilation/passes/fusion/rocm_aiter_fusion.py`](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/passes/fusion/rocm_aiter_fusion.py)
|
||||
- CUDA/HIP kernels: [`csrc/quantization/`](https://github.com/vllm-project/vllm/blob/main/csrc/quantization/)
|
||||
- Fused SiLU+Mul+BlockQuant kernel: [`csrc/quantization/fused_kernels/fused_silu_mul_block_quant.cu`](https://github.com/vllm-project/vllm/blob/main/csrc/quantization/fused_kernels/fused_silu_mul_block_quant.cu)
|
||||
|
||||
### RMSNorm + Padding (`fuse_act_padding`)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user