Files
nvfp4-megamoe-kernel/STAGE_D_REMAINING.md
biondizzle 80b6b79f9e NVFP4-1.1: FP4 quantization primitives for CuTeDSL kernels
- fp8_e4m3_from_float32: manual FP8 E4M3 cast (bias=7, exp 0-15 valid,
  NaN guard for exp=15/mant=7, mantissa overflow handling)
- fp8_e4m3_to_float32: dequantize FP8 E4M3 bit pattern back to Float32
- half_step_to_e2m1_idx: E2M1 step mapping (0-12 → 0-7)
- quantize_e2m1_nibble: per-element E2M1 quantize + sign + pack
- Verified 0/500 trial failures against Python reference
- Key fixes discovered during validation:
  1. FP8 E4M3 bias is 7, NOT 8
  2. Exponent range is 0-15 (exp=15/mant=7 is NaN; others valid)
  3. Subnormal formula: val = m * 2^(-9) = m/512 (NOT m/1024)
  4. Round-to-nearest-even (not round-half-up) for half_step and mantissa
  5. Mantissa overflow (round to 8) must increment exponent
2026-05-28 03:39:55 +00:00

3.2 KiB
Raw Blame History

Here's what remains in those two archived plans:


STAGE_D.md — Remaining Items

NVFP4-0.1 through NVFP4-0.4 (Diagnostics) — ALL DONE

All four print-only diagnostics passed. sf_dtype=E4M3, TMA element type correct, MMA kind correct. No action needed.

NVFP4-3 (use_2cta_instrs) — DONE

Conditional use_2cta_instrs added. 1.71.9× prefill speedup. Merged.

NVFP4-1.1 (Fuse FP4 quant into SwiGLU epilogue) — NOT DONE

Still has a separate quantize_activation_nvfp4 kernel launch between L1 and L2. The amax + FP4 pack should happen in the SwiGLU epilogue registers, eliminating the BF16 GMEM materialization. No blockers. Independent of FMHA. Estimated 1 day.

NVFP4-1.2 (Fuse FP4 quant into invRoPE→wo_a) — NOT DONE

inverse_rope_bf16 produces BF16, then wo_a quantizes. Should fuse FP4 pack into the inverse RoPE epilogue. Blocked on Priority 2 (one-way final epilogue rewrite) — needs the register slot in the new FMHA epilogue.

NVFP4-1.3 (Fuse FP4 quant into mHC mixing) — NOT DONE

mHC post_block (B_l @ X_l + C_l ⊗ F_out) lands in BF16. Should fuse FP4 quant so attention/FFN GEMMs read FP4 directly. Blocked on having the mHC mixing kernel built with FP4 epilogue support.

NVFP4-2 (FP4 KV pipeline depth) — NOT DONE

FP4 KV in SMEM with dequant → deeper pipeline stages. Blocked on Priority 2 and BF16 KV being solid first.

D1.5 (in-kernel O rescale) — CLOSED

TMEM round-trip is fundamentally broken. Python KV merge is the production path. Listed in the plan but already resolved per MEMORY.md.

D1.4 (hd=512) — BLOCKED

MLIR compilation hang. Same as ROADMAP Priority 9.


STAGE_D2.md — Remaining Items

D2 Per-head launch + Head-packed — DONE

Per-head launch works (n_h=1128, cos 0.999995). Head-packed M dimension works. MQA/GQA in production.py.

D2 Multi-CTA grid — BLOCKED

flat_divide + epilogue_tma_store layout mismatch. Requires full refactor of tma_partition + epilogue into the kernel. Blocked on Priority 2 (one-way final epilogue rewrite). The CUTLASS reference uses flat_divide + tma_partition inside the kernel with direct TMA bulk copy — no epilogue_tma_store.

D2.1 (num_query_heads/batch in constructor) — ⚠️ PARTIAL

Added as params but grid is still per-head Python loop, not multi-CTA.

D2.9 (LSE for multi-head) — DONE

Per-row LSE verified, row_sums output working.


Summary: What's Actually Left (Unblocked, Actionable)

Item Source Status Effort Blocker
NVFP4-1.1 — FP4 quant in SwiGLU epilogue STAGE_D Not done ~1 day None. Independent.
NVFP4-1.2 — FP4 in invRoPE→wo_a STAGE_D ~1 day Priority 2
NVFP4-1.3 — FP4 in mHC mixing STAGE_D ~2 days mHC kernel
NVFP4-2 — FP4 KV pipeline STAGE_D ~1 day Priority 2 + BF16 KV solid
D2 Multi-CTA grid STAGE_D2 12 days Priority 2

NVFP4-1.1 is the only unblocked, independent, high-impact item. Pure MoE-side, no FMHA dependency, eliminates a kernel launch and halves GMEM bandwidth between L1 and L2. That's the easy problem.