nvfp4-megamoe-kernel/STAGE_D_REMAINING.md at 0ecb98daeeecb3b86bbade7aeae5152db4d84c82

Files

biondizzle 80b6b79f9e NVFP4-1.1: FP4 quantization primitives for CuTeDSL kernels

- fp8_e4m3_from_float32: manual FP8 E4M3 cast (bias=7, exp 0-15 valid,
  NaN guard for exp=15/mant=7, mantissa overflow handling)
- fp8_e4m3_to_float32: dequantize FP8 E4M3 bit pattern back to Float32
- half_step_to_e2m1_idx: E2M1 step mapping (0-12 → 0-7)
- quantize_e2m1_nibble: per-element E2M1 quantize + sign + pack
- Verified 0/500 trial failures against Python reference
- Key fixes discovered during validation:
  1. FP8 E4M3 bias is 7, NOT 8
  2. Exponent range is 0-15 (exp=15/mant=7 is NaN; others valid)
  3. Subnormal formula: val = m * 2^(-9) = m/512 (NOT m/1024)
  4. Round-to-nearest-even (not round-half-up) for half_step and mantissa
  5. Mantissa overflow (round to 8) must increment exponent

2026-05-28 03:39:55 +00:00

3.2 KiB

Raw Blame History

Here's what remains in those two archived plans:

STAGE_D.md — Remaining Items

NVFP4-0.1 through NVFP4-0.4 (Diagnostics) — ✅ ALL DONE

All four print-only diagnostics passed. sf_dtype=E4M3, TMA element type correct, MMA kind correct. No action needed.

NVFP4-3 (use_2cta_instrs) — ✅ DONE

Conditional use_2cta_instrs added. 1.7–1.9× prefill speedup. Merged.

NVFP4-1.1 (Fuse FP4 quant into SwiGLU epilogue) — ❌ NOT DONE

Still has a separate quantize_activation_nvfp4 kernel launch between L1 and L2. The amax + FP4 pack should happen in the SwiGLU epilogue registers, eliminating the BF16 GMEM materialization. No blockers. Independent of FMHA. Estimated 1 day.

NVFP4-1.2 (Fuse FP4 quant into invRoPE→wo_a) — ❌ NOT DONE

inverse_rope_bf16 produces BF16, then wo_a quantizes. Should fuse FP4 pack into the inverse RoPE epilogue. Blocked on Priority 2 (one-way final epilogue rewrite) — needs the register slot in the new FMHA epilogue.

NVFP4-1.3 (Fuse FP4 quant into mHC mixing) — ❌ NOT DONE

mHC post_block (B_l @ X_l + C_l ⊗ F_out) lands in BF16. Should fuse FP4 quant so attention/FFN GEMMs read FP4 directly. Blocked on having the mHC mixing kernel built with FP4 epilogue support.

NVFP4-2 (FP4 KV pipeline depth) — ❌ NOT DONE

FP4 KV in SMEM with dequant → deeper pipeline stages. Blocked on Priority 2 and BF16 KV being solid first.

D1.5 (in-kernel O rescale) — ❌ CLOSED

TMEM round-trip is fundamentally broken. Python KV merge is the production path. Listed in the plan but already resolved per MEMORY.md.

D1.4 (hd=512) — ❌ BLOCKED

MLIR compilation hang. Same as ROADMAP Priority 9.

STAGE_D2.md — Remaining Items

D2 Per-head launch + Head-packed — ✅ DONE

Per-head launch works (n_h=1–128, cos 0.999995). Head-packed M dimension works. MQA/GQA in production.py.

D2 Multi-CTA grid — ❌ BLOCKED

flat_divide + epilogue_tma_store layout mismatch. Requires full refactor of tma_partition + epilogue into the kernel. Blocked on Priority 2 (one-way final epilogue rewrite). The CUTLASS reference uses flat_divide + tma_partition inside the kernel with direct TMA bulk copy — no epilogue_tma_store.

D2.1 (num_query_heads/batch in constructor) — ⚠️ PARTIAL

Added as params but grid is still per-head Python loop, not multi-CTA.

D2.9 (LSE for multi-head) — ✅ DONE

Per-row LSE verified, row_sums output working.

Summary: What's Actually Left (Unblocked, Actionable)

Item	Source	Status	Effort	Blocker
NVFP4-1.1 — FP4 quant in SwiGLU epilogue	STAGE_D	❌ Not done	~1 day	None. Independent.
NVFP4-1.2 — FP4 in invRoPE→wo_a	STAGE_D	❌	~1 day	Priority 2
NVFP4-1.3 — FP4 in mHC mixing	STAGE_D	❌	~2 days	mHC kernel
NVFP4-2 — FP4 KV pipeline	STAGE_D	❌	~1 day	Priority 2 + BF16 KV solid
D2 Multi-CTA grid	STAGE_D2	❌	1–2 days	Priority 2

NVFP4-1.1 is the only unblocked, independent, high-impact item. Pure MoE-side, no FMHA dependency, eliminates a kernel launch and halves GMEM bandwidth between L1 and L2. That's the easy problem.

3.2 KiB Raw Blame History Unescape Escape