[Feat] Support non-gated MoE with Marlin, NVFP4 CUTLASS, FP8, INT8, compressed-tensors (#32257)

Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Tomer Natan <tbarnatan@ipp1-1429.ipp1a1.colossus.nvidia.com>
This commit is contained in:
TomerBN-Nvidia
2026-01-16 02:15:05 +02:00
committed by GitHub
parent aca5c51487
commit c277fbdf31
17 changed files with 226 additions and 127 deletions

View File

@@ -637,6 +637,7 @@ class Fp8MoEMethod(FusedMoEMethodBase):
block_quant=self.block_quant,
tp_size=layer.moe_parallel_config.tp_size,
with_lora_support=self.moe.is_lora_enabled,
is_act_and_mul=self.moe.is_act_and_mul,
)
if self.fp8_backend == Fp8MoeBackend.FLASHINFER_CUTLASS: