- fp8_e4m3_from_float32: manual FP8 E4M3 cast (bias=7, exp 0-15 valid, NaN guard for exp=15/mant=7, mantissa overflow handling) - fp8_e4m3_to_float32: dequantize FP8 E4M3 bit pattern back to Float32 - half_step_to_e2m1_idx: E2M1 step mapping (0-12 → 0-7) - quantize_e2m1_nibble: per-element E2M1 quantize + sign + pack - Verified 0/500 trial failures against Python reference - Key fixes discovered during validation: 1. FP8 E4M3 bias is 7, NOT 8 2. Exponent range is 0-15 (exp=15/mant=7 is NaN; others valid) 3. Subnormal formula: val = m * 2^(-9) = m/512 (NOT m/1024) 4. Round-to-nearest-even (not round-half-up) for half_step and mantissa 5. Mantissa overflow (round to 8) must increment exponent
3.2 KiB
Here's what remains in those two archived plans:
STAGE_D.md — Remaining Items
NVFP4-0.1 through NVFP4-0.4 (Diagnostics) — ✅ ALL DONE
All four print-only diagnostics passed. sf_dtype=E4M3, TMA element type correct, MMA kind correct. No action needed.
NVFP4-3 (use_2cta_instrs) — ✅ DONE
Conditional use_2cta_instrs added. 1.7–1.9× prefill speedup. Merged.
NVFP4-1.1 (Fuse FP4 quant into SwiGLU epilogue) — ❌ NOT DONE
Still has a separate quantize_activation_nvfp4 kernel launch between L1 and L2. The amax + FP4 pack should happen in the SwiGLU epilogue registers, eliminating the BF16 GMEM materialization. No blockers. Independent of FMHA. Estimated 1 day.
NVFP4-1.2 (Fuse FP4 quant into invRoPE→wo_a) — ❌ NOT DONE
inverse_rope_bf16 produces BF16, then wo_a quantizes. Should fuse FP4 pack into the inverse RoPE epilogue. Blocked on Priority 2 (one-way final epilogue rewrite) — needs the register slot in the new FMHA epilogue.
NVFP4-1.3 (Fuse FP4 quant into mHC mixing) — ❌ NOT DONE
mHC post_block (B_l @ X_l + C_l ⊗ F_out) lands in BF16. Should fuse FP4 quant so attention/FFN GEMMs read FP4 directly. Blocked on having the mHC mixing kernel built with FP4 epilogue support.
NVFP4-2 (FP4 KV pipeline depth) — ❌ NOT DONE
FP4 KV in SMEM with dequant → deeper pipeline stages. Blocked on Priority 2 and BF16 KV being solid first.
D1.5 (in-kernel O rescale) — ❌ CLOSED
TMEM round-trip is fundamentally broken. Python KV merge is the production path. Listed in the plan but already resolved per MEMORY.md.
D1.4 (hd=512) — ❌ BLOCKED
MLIR compilation hang. Same as ROADMAP Priority 9.
STAGE_D2.md — Remaining Items
D2 Per-head launch + Head-packed — ✅ DONE
Per-head launch works (n_h=1–128, cos 0.999995). Head-packed M dimension works. MQA/GQA in production.py.
D2 Multi-CTA grid — ❌ BLOCKED
flat_divide + epilogue_tma_store layout mismatch. Requires full refactor of tma_partition + epilogue into the kernel. Blocked on Priority 2 (one-way final epilogue rewrite). The CUTLASS reference uses flat_divide + tma_partition inside the kernel with direct TMA bulk copy — no epilogue_tma_store.
D2.1 (num_query_heads/batch in constructor) — ⚠️ PARTIAL
Added as params but grid is still per-head Python loop, not multi-CTA.
D2.9 (LSE for multi-head) — ✅ DONE
Per-row LSE verified, row_sums output working.
Summary: What's Actually Left (Unblocked, Actionable)
| Item | Source | Status | Effort | Blocker |
|---|---|---|---|---|
| NVFP4-1.1 — FP4 quant in SwiGLU epilogue | STAGE_D | ❌ Not done | ~1 day | None. Independent. |
| NVFP4-1.2 — FP4 in invRoPE→wo_a | STAGE_D | ❌ | ~1 day | Priority 2 |
| NVFP4-1.3 — FP4 in mHC mixing | STAGE_D | ❌ | ~2 days | mHC kernel |
| NVFP4-2 — FP4 KV pipeline | STAGE_D | ❌ | ~1 day | Priority 2 + BF16 KV solid |
| D2 Multi-CTA grid | STAGE_D2 | ❌ | 1–2 days | Priority 2 |
NVFP4-1.1 is the only unblocked, independent, high-impact item. Pure MoE-side, no FMHA dependency, eliminates a kernel launch and halves GMEM bandwidth between L1 and L2. That's the easy problem.