Update STAGE_D.md with current action plan - starting NVFP4-0 verification and D1.3 validation on B200
This commit is contained in:
25
STAGE_D.md
25
STAGE_D.md
@@ -473,6 +473,11 @@ The indexer needs a full rewrite from scalar CUDA to tcgen05 MMA + radix-select.
|
||||
- Verified on B200
|
||||
|
||||
### 🔨 IN PROGRESS / NEXT UP
|
||||
- **D1.3 VERIFICATION**: Running comprehensive tests on B200 to verify SMEM-P fix produces correct results for hd=128,256,512
|
||||
- Need to run `test_fmha_v3_stage_d1.py` and other regression tests
|
||||
- Checking debug prints from `[SMEM-P PROPER]` sections in fmha.py
|
||||
- Verifying cosine similarity against FP32 oracle
|
||||
|
||||
- **D1.5**: Correction epilog fix (3% error from TMEM layout mismatch) 🟡 COMPLEX REFACTOR
|
||||
- Hand-constructed `Ld32x32bOp`/`St32x32bOp` atoms cause layout mismatch
|
||||
- Proper fix: CUTLASS `correction_epilog` pattern with paired atoms
|
||||
@@ -739,4 +744,22 @@ The following are real potential wins but go beyond what the V4 paper explicitly
|
||||
|
||||
**NVFP4-0 results gate the critical path.** If NVFP4-0.1–0.4 find a wrong sf_dtype or wrong MMA kind, the fix comes before D1.3. Everything else is either parallel or post-D1.3.
|
||||
|
||||
**NVFP4-3 (use_2cta_instrs) is the fastest win and has no dependencies.** Do it immediately after the NVFP4-0 prints.
|
||||
**NVFP4-3 (use_2cta_instrs) is the fastest win and has no dependencies.** Do it immediately after the NVFP4-0 prints.
|
||||
|
||||
---
|
||||
|
||||
## ⚡ CURRENT ACTION (2026-05-23 19:10 UTC)
|
||||
|
||||
**Starting NVFP4-0 verification and D1.3 validation on B200:**
|
||||
|
||||
1. **NVFP4-0.1**: Check sf_dtype in gemm_runner.py - appears to be `torch.float8_e4m3fn` from quantize.py ✅
|
||||
2. **NVFP4-0.2**: SF TMEM layout - need to verify packing matches UE4M3 (NVFP4) not UE8M0 (MXFP4)
|
||||
3. **NVFP4-0.3**: FP4 TMA element type - check `float4_e2m1fn_x2` in FMHA TMA
|
||||
4. **NVFP4-0.4**: MMA kind verification - confirm tcgen05 infers NVFP4 not MXFP4
|
||||
|
||||
**D1.3 Verification:**
|
||||
- Running `test_fmha_v3_stage_d1.py` on B200 via `fire_b200_test`
|
||||
- Checking debug prints from SMEM-P path in fmha.py
|
||||
- Verifying hd=128,256,512 produce correct cosine similarity
|
||||
|
||||
**Git workflow:** Edit locally → commit → push → test on B200
|
||||
Reference in New Issue
Block a user