nvfp4-megamoe-kernel

Files

biondizzle e1a642452a Fix Blackwell: skip FlashMLA assertion + force CuTeDSL kernel

1. DeepseekV4MLAAttention.__init__ had a hard assertion that the
   attention backend MUST be FlashMLA. On Blackwell, FlashMLA doesn't
   work but we bypass it via _attention_impl_blackwell(). Added
   _is_blackwell flag to skip FlashMLA-specific init (fp8_ds_mla
   cache format conversion).

2. Added VLLM_NVFP4_GEMM_BACKEND=cutedsl env var to docker-compose.yml
   to force CuTeDSL kernel selection for NVFP4 linear layers.

3. Updated register_cutedsl_kernel.py to also register CuTeDSL in
   _NVFP4_BACKEND_TO_KERNEL dict (for the env var override path).

2026-05-19 08:19:23 +00:00

kernels/linear/nvfp4

Fix OOM: add --max-model-len=876544 + revert CPU dummy weight

2026-05-19 07:35:43 +00:00

patches

Fix Blackwell: skip FlashMLA assertion + force CuTeDSL kernel

2026-05-19 08:19:23 +00:00

cutedsl_quant_method.py

Fix OOM: add --max-model-len=876544 + revert CPU dummy weight

2026-05-19 07:35:43 +00:00

nvfp4_cutedsl.py

Replace autograd.Function with torch.library.custom_op for Dynamo compat

2026-05-19 01:54:48 +00:00