nvfp4-megamoe-kernel

Files

biondizzle 334e95047e Fix: dequantize ALL attention NVFP4 projections to BF16

Root cause of NaN from layer 0: FlashInferCutlassNvFp4LinearKernel
uses checkpoint input_scale for activation quantization, which produces
NaN immediately. Fix: dequantize all attention NVFP4 weights (wq_a,
wq_b, wkv, wo_a, wo_b) to BF16 at load time, bypassing the broken
input_scale entirely. Uses existing _dequant_nvfp4_to_bf16 method.

This trades memory for correctness. Future optimization: add warmup
for attention input_global_scale_inv (same as MoE warmup).

2026-05-18 13:09:36 +00:00

patches

Fix: dequantize ALL attention NVFP4 projections to BF16

2026-05-18 13:09:36 +00:00

nvfp4_cutedsl.py

HOTFIX: remove NaN checks from run() — torch.isnan().any() does CPU-GPU sync, breaks cudagraph

2026-05-17 22:28:32 +00:00