Update CURRENT_ISSUE with final session status

This commit is contained in:
2026-05-28 15:22:32 +00:00
parent 6249989cf6
commit 2b32b51882

View File

@@ -1,28 +1,30 @@
# CURRENT_ISSUE.md — PV GEMM for Prefill
## Status: HD=16 ✅ (cos 0.9997), HD=64 🚧 (cos 0.931, ~0.4% PV MMA error)
## Status: HD=16 ✅ (cos 0.9997), HD=64 (cos 0.931, BF16 accumulation precision)
### HD=16 — COMPLETE
Full pipeline: QK(SS, 1 K-tile) → softmax(TMEM→SMEM) → PV(SS, 8 K-tiles) → epilogue. Cosine 0.9997.
### What works:
- **HD=16**: Full pipeline QK(SS) → softmax → PV(SS, 8 K-tiles) → epilogue. **Cosine 0.9997**.
- **HD=64**: Full pipeline with BLOCK_MN_B=64. **Cosine 0.931** — V canonical layout verified correct, error is BF16 accumulation precision in PV MMA (kind::f16 uses BF16 dot products, reference uses FP32).
- Both require `cudaFuncSetAttribute` for SMEM >48KB (HD=64 uses 52.9 KB).
### HD=64 — IN PROGRESS
- Pipeline runs end-to-end but PV MMA with BLOCK_MN_B=64 has ~0.4% systematic error
- Register-math PV with same QK+softmax output matches reference exactly → QK+softmax is correct
- The 0.4% error is specifically in the PV SS MMA with V=(64,16) BLOCK_MN=64
- **Must opt into >48KB shared memory**: `cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem)`
- Alternative approach to try: BLOCK_MN_B=16 with 4 PV N-tiles per K-tile (avoids (64,16) V layout)
- QK MMA scale is 1.0 (NOT 0.5) — confirmed by comparing with test_fmha_hd64.cu register-math PV
### Key architectural decisions:
1. **PV via SS MMA with SMEM-P** (NOT TS MMA). TS MMA's A-fragment TMEM layout (Layout A) is incompatible with 32x32b stores. SS MMA with both operands in SMEM avoids this.
2. **Per-K-tile P fill** into reusable (128,16) buffer from shared `s_p_vals[128]`. The (128,128) canonical P with K-tile offsets has an accumulation bug.
3. **V canonical layout**: `g_mn=d/8, g_k=lr/8, llr=d%8, lc=lr%8` where d=HD(MN), lr=seq_pos(K). The original code had MN/K swapped.
4. **PV MMA scale**: ~1.0 for both BLOCK_MN_B=16 and 64. QK MMA scale is also 1.0 (NOT 0.5 as assumed earlier).
5. **SMEM >48KB**: Must opt in with `cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, smem)`.
### Key findings (all from today):
1. **tcgen05.mma TS (TMEM A) NOT usable** — 32x32b store ≠ Layout A. Use SS MMA with SMEM-P instead.
2. **V canonical layout bug** — MN/K axes were swapped. Fix: `g_mn=d/8, g_k=lr/8, llr=d%8, lc=r%8`
3. **Per-K-tile P fill** — (128,128) canonical with K-tile offsets has accumulation bug. Use single (128,16) buffer.
4. **SMEM >48KB needs opt-in** on SM100
5. **PV SS MMA scale**: ~1.0 for both BLOCK_MN_B=16 and BLOCK_MN_B=64
### Files:
- `test_fmha_v5.cu` — HD=16 full pipeline (cos 0.9997) ✅
- `test_fmha_hd64_smem_p.cu` — HD=64 full pipeline (cos 0.931) 🚧
- `test_pv_ss_128.cu` — PV SS MMA with (128,128) P (shows accumulation bug)
- `test_pv_ss_b64.cu` — PV SS MMA with B=(64,16) BLOCK_MN=64 (works)
- `test_ss_ts_sequence.cu` — SS+TS coexistence test
- `test_pv_ss.cu` — Minimal PV SS MMA (A=128×16, B=16×16)
### Next steps:
1. **Fix HD=64 PV MMA**: try BLOCK_MN_B=16 with 4 N-tiles, or debug (64,16) canonical layout
2. **Extend to HD=128, HD=256** — the per-K-tile approach scales naturally
3. **Prefill T>1** — fill all 128 rows of sPk
4. **Multi-head support** — per-head launch or head-packed M
5. **Production kernel** — integrate into fmha_sm100.cuh
1. **HD=64 precision**: Investigate FP32-accumulation MMA variant (kind::f32) for PV
2. **HD=128, HD=256**: Extend the pipeline (more QK/PV K-tiles, larger V)
3. **Prefill T>1**: Fill all 128 rows of sPk from s_p_vals
4. **Multi-head**: Per-head launch or head-packed M dimension
5. **Production kernel**: Extract into `fmha_sm100.cuh` with proper warp specialization