From 0c5a42c0560d76eea04aea9ea56342f760d73e7d Mon Sep 17 00:00:00 2001 From: biondizzle Date: Sat, 23 May 2026 19:09:56 +0000 Subject: [PATCH] Update STAGE_D.md with current action plan - starting NVFP4-0 verification and D1.3 validation on B200 --- STAGE_D.md | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/STAGE_D.md b/STAGE_D.md index e267433a..9748f14b 100644 --- a/STAGE_D.md +++ b/STAGE_D.md @@ -473,6 +473,11 @@ The indexer needs a full rewrite from scalar CUDA to tcgen05 MMA + radix-select. - Verified on B200 ### 🔨 IN PROGRESS / NEXT UP +- **D1.3 VERIFICATION**: Running comprehensive tests on B200 to verify SMEM-P fix produces correct results for hd=128,256,512 + - Need to run `test_fmha_v3_stage_d1.py` and other regression tests + - Checking debug prints from `[SMEM-P PROPER]` sections in fmha.py + - Verifying cosine similarity against FP32 oracle + - **D1.5**: Correction epilog fix (3% error from TMEM layout mismatch) 🟡 COMPLEX REFACTOR - Hand-constructed `Ld32x32bOp`/`St32x32bOp` atoms cause layout mismatch - Proper fix: CUTLASS `correction_epilog` pattern with paired atoms @@ -739,4 +744,22 @@ The following are real potential wins but go beyond what the V4 paper explicitly **NVFP4-0 results gate the critical path.** If NVFP4-0.1–0.4 find a wrong sf_dtype or wrong MMA kind, the fix comes before D1.3. Everything else is either parallel or post-D1.3. -**NVFP4-3 (use_2cta_instrs) is the fastest win and has no dependencies.** Do it immediately after the NVFP4-0 prints. \ No newline at end of file +**NVFP4-3 (use_2cta_instrs) is the fastest win and has no dependencies.** Do it immediately after the NVFP4-0 prints. + +--- + +## ⚡ CURRENT ACTION (2026-05-23 19:10 UTC) + +**Starting NVFP4-0 verification and D1.3 validation on B200:** + +1. **NVFP4-0.1**: Check sf_dtype in gemm_runner.py - appears to be `torch.float8_e4m3fn` from quantize.py ✅ +2. **NVFP4-0.2**: SF TMEM layout - need to verify packing matches UE4M3 (NVFP4) not UE8M0 (MXFP4) +3. **NVFP4-0.3**: FP4 TMA element type - check `float4_e2m1fn_x2` in FMHA TMA +4. **NVFP4-0.4**: MMA kind verification - confirm tcgen05 infers NVFP4 not MXFP4 + +**D1.3 Verification:** +- Running `test_fmha_v3_stage_d1.py` on B200 via `fire_b200_test` +- Checking debug prints from SMEM-P path in fmha.py +- Verifying hd=128,256,512 produce correct cosine similarity + +**Git workflow:** Edit locally → commit → push → test on B200 \ No newline at end of file