nvfp4-megamoe-kernel

Files

biondizzle 0b6ca0df80 P5 integration + B3 q_a_norm fused + gsa scalar fix

P5: Wire up fused mHC pre_block + RMSNorm + NVFP4 quantize kernel
- Replaces: pre_block bmm + rmsnorm (4+ launches) + quantize (2 launches)
- With: 2 kernel launches (mhc_rmsnorm_amax_gsa + mhc_rmsnorm_quantize_nvfp4)
- Both attn and ffn mHC paths now use P5 fused kernel
- Savings: ~5 launches/site × 2 sites × 61 layers = 610 launches/token

B3: Fused rmsnorm+quant for q_a_norm → q_b path
- q_a output → rmsnorm_quantize_nvfp4 → QuantizedActivation → q_b.run_from_quantized
- Eliminates BF16 round-trip between q_a_norm and q_b GEMM
- Saves: ~6 kernel launches per layer (rmsnorm 4+ + quantize 2 vs fused 2)

gsa scalar fix in Nvfp4Linear.run_from_quantized:
- CuTeDSL NVFP4 GEMM expects global_scale_a as per-expert scalar (shape (1,))
- Per-row gsa from fused kernels must be reduced to scalar (max) for M>1
- For M=1 decode: already scalar, no reduction needed
- Fixes potential correctness issue at prefill (M>1) when using fused paths

Cleanup: Remove --ab-compare flag and A/B comparison code (replaced by P5)

2026-06-02 21:20:34 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

grouped_linear.py

P0 complete: Kill .item() in grouped_linear, reduce hot-path syncs

2026-06-01 22:21:12 +00:00

linear.py

P5 integration + B3 q_a_norm fused + gsa scalar fix

2026-06-02 21:20:34 +00:00

mhc.py

Fix mHC Sinkhorn kernel: remove VLA, remove Python fallback