145 lines
7.0 KiB
Markdown
145 lines
7.0 KiB
Markdown
# DSV4 NVFP4 Workspace
|
||
|
||
## Status (May 21, 2026 — 09:18 UTC)
|
||
|
||
### Stage A ✅ COMPLETE
|
||
Bare Q@K^T via tcgen05.mma → TMEM → GMEM. Cosine 0.999999.
|
||
|
||
### Stage B 🔨 IN PROGRESS — TMEM Alias Bug 4
|
||
|
||
Two MMAs chained: Q@K^T (SMEM source) → identity softmax in TMEM → P@V (TMEM source).
|
||
|
||
**Pipeline deadlock: ✅ FIXED. Softmax packing: ✅ CONFIRMED CORRECT.**
|
||
|
||
---
|
||
|
||
## Bug 4 (ACTIVE): Non-(128,128) PV MMA — V/B Staging or Output C/D Failure
|
||
|
||
### Summary
|
||
|
||
The softmax writes P to TMEM using the **QK C-fragment layout**. The PV MMA reads P from TMEM using the **PV A-fragment layout**. For (128,128) PV these layouts agree. For (128,16) PV they disagree — the PV A-fragment reads from different TMEM columns than where the softmax wrote, producing zero output.
|
||
|
||
**FMHA uses (128,16) PV with the same construction and works.** The root difference is not yet identified despite exhaustive comparison. FMHA references: `p_tmem_layout_staged = make_smem_layout_a(pv_mma, pv_mma_tiler, q_dtype, 1)` — same call we make.
|
||
|
||
### What Works / What Doesn't
|
||
|
||
- ✅ PV (128,128) output, V=I or random → cosine 1.0 / 0.999999
|
||
- ✅ PV (128,128) with zero-padded V (head_dim=16) → cosine 1.0 **WORKAROUND**
|
||
- ✅ PV (128,64), all-ones V → cosine 0.999999 (uniform hides bug)
|
||
- ✅ PV (128,64), single-element V → cosine 1.0 (sparse hides bug)
|
||
- ❌ PV (128,64), truncated identity V → cosine 0.02
|
||
- ❌ PV (128,16), V=I(128,128) → cosine 0.0 (all zeros)
|
||
- ❌ PV (128,16) with P at S offset (no softmax) → NaN (FP32→BF16 reinterpret)
|
||
|
||
### Root Cause (Updated May 21 09:20 UTC)
|
||
|
||
**The P/A TMEM alias is NOT the bug.** Diagnostic prints confirm the PV A-fragment layout is IDENTICAL for all PV sizes:
|
||
|
||
```
|
||
(128,128) PV: tOrP2_s = (2048, 1, 8), size=16384, cosine=1.0
|
||
(128,32) PV: tOrP2_s = (2048, 1, 8), size=16384, cosine=0.51
|
||
(128,16) PV: tOrP2_s = (2048, 1, 8), size=16384, cosine=0.36
|
||
```
|
||
|
||
The C++ source confirms: the A-fragment TMEM atom depends on M and K, NOT output N. The softmax writes P to the same TMEM columns regardless of PV size.
|
||
|
||
**The real bug is in the V/B staging or output C/D path.** When using (128,128) PV with zero-padded V (which keeps the V SMEM, O C-fragment, and epilogue at (128,128) dimensions), cosine=1.0. When using native (128,32) PV with V=(32,128), cosine=0.51. The difference is the V SMEM layout and/or output epilogue.
|
||
|
||
**Key observations:**
|
||
- V smem_size=2048 for (128,16) PV, vs 16384 for (128,128) PV
|
||
- O tOtO_size=2048 for (128,16) PV, vs 16384 for (128,128) PV
|
||
- cta_tile_shape_mnk=(128,128,64) for BOTH — this is QK's cta tile, not PV's
|
||
- epi_tile=(128,16) for (128,16) PV — this IS correct (from PV)
|
||
- Swapping cta_tile to PV before epilogue doesn't fix the issue
|
||
|
||
**Next steps:**
|
||
1. Test V TMA load correctness for (128,32) PV — is V data loaded correctly into SMEM?
|
||
2. Test PV MMA output directly (skip epilogue) — is the PV MMA producing correct O?
|
||
3. Check if V B-operand fragment (tCrV) has the right shape for (128,32) PV
|
||
|
||
### Current Workaround
|
||
|
||
Use **(128,128) PV with zero-padded V**. This wastes compute (8× for head_dim=16, 2× for head_dim=64) but produces correct results (cosine 1.0). For the production kernel, we'll use this initially and optimize to (128,16) PV once the TMEM alias is resolved.
|
||
|
||
### Required Fixes (Not Yet Applied)
|
||
|
||
1. **Primary**: Softmax must write P using the PV A-fragment TMEM layout, not the QK C-fragment layout. Requires constructing a `make_tmem_copy` with `tP` (PV layout) as the destination, and rearranging register data from QK partition to PV partition.
|
||
|
||
2. **Secondary**: `epi_tile` must use PV's cta tile, and `self.cta_tile_shape_mnk` must be swapped before `epilogue_tma_store`. FMHA sets `self.epi_tile = self.pv_mma_tiler[:2]` directly.
|
||
|
||
3. **Alternative (for later)**: Investigate using `composition()` to create a hybrid layout that both the QK softmax write and PV A-fragment read can agree on.
|
||
|
||
---
|
||
|
||
## Bugs 1–3: ✅ FIXED
|
||
|
||
### Bug 1: V B-Operand Must Be MN-Major
|
||
|
||
FMHA requires V to be **MN-major** for the PV MMA B-operand. V must be shaped (head_dim, seq) = (64, 128) with strides (1, 64) via `as_strided`.
|
||
|
||
### Bug 2: C-Fragment Composition Store for P — CONFIRMED CORRECT
|
||
|
||
FP32→BF16 packing via C-fragment composition store works. ⛔ `St32x32bOp` MUST use Float32, NOT BFloat16.
|
||
|
||
### Bug 3: First PV Must Use ACCUMULATE=False
|
||
|
||
If ACCUMULATE=True on the first PV, `O = P@V + old_O` adds uninitialized TMEM. FMHA: `pv_tiled_mma.set(tcgen05.Field.ACCUMULATE, kphase_idx != 0)`.
|
||
|
||
---
|
||
|
||
## Pipeline Deadlock — ✅ FIXED (May 21)
|
||
|
||
Three root causes found and fixed:
|
||
1. `PipelineUmmaAsync` for mma_si must NOT pass `cta_layout_vmnk`
|
||
2. TMA warp must NOT call `tmem.wait_for_alloc()`
|
||
3. `pipeline.PipelineTmaStore` (not `TmaStorePipeline`)
|
||
|
||
---
|
||
|
||
## ⛔ FOOTGUNS — CUTLASS CuTeDSL Landmines
|
||
|
||
1. **St32x32bOp with BFloat16 → ILLEGAL MEMORY ACCESS** — Must use Float32 + `cute.recast_ptr`
|
||
2. **V major ≠ K major** — V must be MN-major, use `as_strided`
|
||
3. **C-fragment → A-fragment TMEM alias only works when N_MMA matches** — (128,128) works, (128,64) breaks
|
||
4. **PipelineUmmaAsync consumer = thread count, NOT warp count** — `32 * len(warp_ids)`
|
||
5. **mma_si pipeline must NOT pass cta_layout_vmnk**
|
||
6. **TMA warp excluded from tmem barrier**
|
||
7. **First PV ACCUMULATE=False**
|
||
8. **TMEM offset: FP32 ptr + 32 = BF16 ptr + 64** (width scaling)
|
||
9. **epi_tile must use PV cta_tile, not QK**
|
||
10. **CuTe nested layout modes flatten sequentially** — `((128,16),1,(4,2)):((65536,1),0,(16,64))` is sequential
|
||
|
||
---
|
||
|
||
## Architecture: Per-Tile Flow
|
||
|
||
```
|
||
For each KV tile:
|
||
1. Load warp writes sKV[stage] (paged FP8 gather via indexed cp.async)
|
||
2. MMA warp issues MMA1: sQ @ sKV[stage]^T → tmem_scores (accumulate=False)
|
||
Signals scores_full_mbar (via PipelineUmmaAsync commit)
|
||
3. Epilogue warps wait on mma_si consumer (scores ready), then:
|
||
a. tcgen05.ld scores from TMEM → register fragments
|
||
b. Compute tile_max, new_max, rescale = exp(old_max - new_max)
|
||
c. Apply rescale to tmem_output IN PLACE (tmem_output *= rescale)
|
||
d. tcgen05.st exp(scores - new_max) back to TMEM → P operand
|
||
e. Release mma_si (softmax_done — MMA warp can re-acquire and issue PV MMA)
|
||
4. MMA warp waits on mma_si acquire (softmax done), MMA2: P @ sV → tmem_output (accumulate=True)
|
||
5. Stage released, load warp can refill it
|
||
|
||
After all tiles: epilogue warps tcgen05.ld tmem_output, divide by row_sum, cast to BF16, store to GMEM
|
||
```
|
||
|
||
---
|
||
|
||
## Environment
|
||
|
||
- **Server**: root@45.76.247.107 (B200, 180 GiB HBM3e per GPU)
|
||
- **venv**: `source /root/dsv4-nvfp4-workspace/venv/bin/activate`
|
||
- **PYTHONPATH**: `/root/dsv4-nvfp4-workspace/kernel`
|
||
- **Model**: `/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4`
|
||
- **vLLM repo**: `/root/dsv4-nvfp4-workspace/vllm` (modified for Blackwell)
|
||
- **Pseudocode**: `/root/fragile-kernel-example/README.md`
|
||
- **fmha.py reference**: `/root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha.py`
|
||
- **fmha_bwd.py reference**: `/root/cutlass/examples/python/CuTeDSL/cute/blackwell/kernel/attention/fmha/fmha_bwd.py`
|