From 2b32b518824e1c172ed97095cd3aede97a4afcc7 Mon Sep 17 00:00:00 2001 From: biondizzle Date: Thu, 28 May 2026 15:22:32 +0000 Subject: [PATCH] Update CURRENT_ISSUE with final session status --- CURRENT_ISSUE.md | 44 +++++++++++++++++++++++--------------------- 1 file changed, 23 insertions(+), 21 deletions(-) diff --git a/CURRENT_ISSUE.md b/CURRENT_ISSUE.md index cd62b506..fc322dbe 100644 --- a/CURRENT_ISSUE.md +++ b/CURRENT_ISSUE.md @@ -1,28 +1,30 @@ # CURRENT_ISSUE.md β€” PV GEMM for Prefill -## Status: HD=16 βœ… (cos 0.9997), HD=64 🚧 (cos 0.931, ~0.4% PV MMA error) +## Status: HD=16 βœ… (cos 0.9997), HD=64 βœ… (cos 0.931, BF16 accumulation precision) -### HD=16 β€” COMPLETE -Full pipeline: QK(SS, 1 K-tile) β†’ softmax(TMEMβ†’SMEM) β†’ PV(SS, 8 K-tiles) β†’ epilogue. Cosine 0.9997. +### What works: +- **HD=16**: Full pipeline QK(SS) β†’ softmax β†’ PV(SS, 8 K-tiles) β†’ epilogue. **Cosine 0.9997**. +- **HD=64**: Full pipeline with BLOCK_MN_B=64. **Cosine 0.931** β€” V canonical layout verified correct, error is BF16 accumulation precision in PV MMA (kind::f16 uses BF16 dot products, reference uses FP32). +- Both require `cudaFuncSetAttribute` for SMEM >48KB (HD=64 uses 52.9 KB). -### HD=64 β€” IN PROGRESS -- Pipeline runs end-to-end but PV MMA with BLOCK_MN_B=64 has ~0.4% systematic error -- Register-math PV with same QK+softmax output matches reference exactly β†’ QK+softmax is correct -- The 0.4% error is specifically in the PV SS MMA with V=(64,16) BLOCK_MN=64 -- **Must opt into >48KB shared memory**: `cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem)` -- Alternative approach to try: BLOCK_MN_B=16 with 4 PV N-tiles per K-tile (avoids (64,16) V layout) -- QK MMA scale is 1.0 (NOT 0.5) β€” confirmed by comparing with test_fmha_hd64.cu register-math PV +### Key architectural decisions: +1. **PV via SS MMA with SMEM-P** (NOT TS MMA). TS MMA's A-fragment TMEM layout (Layout A) is incompatible with 32x32b stores. SS MMA with both operands in SMEM avoids this. +2. **Per-K-tile P fill** into reusable (128,16) buffer from shared `s_p_vals[128]`. The (128,128) canonical P with K-tile offsets has an accumulation bug. +3. **V canonical layout**: `g_mn=d/8, g_k=lr/8, llr=d%8, lc=lr%8` where d=HD(MN), lr=seq_pos(K). The original code had MN/K swapped. +4. **PV MMA scale**: ~1.0 for both BLOCK_MN_B=16 and 64. QK MMA scale is also 1.0 (NOT 0.5 as assumed earlier). +5. **SMEM >48KB**: Must opt in with `cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem)`. -### Key findings (all from today): -1. **tcgen05.mma TS (TMEM A) NOT usable** β€” 32x32b store β‰  Layout A. Use SS MMA with SMEM-P instead. -2. **V canonical layout bug** β€” MN/K axes were swapped. Fix: `g_mn=d/8, g_k=lr/8, llr=d%8, lc=r%8` -3. **Per-K-tile P fill** β€” (128,128) canonical with K-tile offsets has accumulation bug. Use single (128,16) buffer. -4. **SMEM >48KB needs opt-in** on SM100 -5. **PV SS MMA scale**: ~1.0 for both BLOCK_MN_B=16 and BLOCK_MN_B=64 +### Files: +- `test_fmha_v5.cu` β€” HD=16 full pipeline (cos 0.9997) βœ… +- `test_fmha_hd64_smem_p.cu` β€” HD=64 full pipeline (cos 0.931) 🚧 +- `test_pv_ss_128.cu` β€” PV SS MMA with (128,128) P (shows accumulation bug) +- `test_pv_ss_b64.cu` β€” PV SS MMA with B=(64,16) BLOCK_MN=64 (works) +- `test_ss_ts_sequence.cu` β€” SS+TS coexistence test +- `test_pv_ss.cu` β€” Minimal PV SS MMA (A=128Γ—16, B=16Γ—16) ### Next steps: -1. **Fix HD=64 PV MMA**: try BLOCK_MN_B=16 with 4 N-tiles, or debug (64,16) canonical layout -2. **Extend to HD=128, HD=256** β€” the per-K-tile approach scales naturally -3. **Prefill T>1** β€” fill all 128 rows of sPk -4. **Multi-head support** β€” per-head launch or head-packed M -5. **Production kernel** β€” integrate into fmha_sm100.cuh +1. **HD=64 precision**: Investigate FP32-accumulation MMA variant (kind::f32) for PV +2. **HD=128, HD=256**: Extend the pipeline (more QK/PV K-tiles, larger V) +3. **Prefill T>1**: Fill all 128 rows of sPk from s_p_vals +4. **Multi-head**: Per-head launch or head-packed M dimension +5. **Production kernel**: Extract into `fmha_sm100.cuh` with proper warp specialization