README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns

This commit is contained in:
2026-05-21 06:46:02 +00:00
parent f1c4ee0e4d
commit b9b1b808a5

View File

@@ -2,7 +2,7 @@
CuTeDSL kernels for DeepSeek-V4 (Blackwell B200, SM100). All kernels use `cutlass.cute` (CuTeDSL) with Blackwell tensor cores.
## Status (May 21, 2026 — 05:15 UTC)
## Status (May 21, 2026 — 06:45 UTC)
### ✅ Stage A: Bare Q@K^T via tcgen05.mma → TMEM → GMEM — COMPLETE
@@ -15,9 +15,9 @@ CuTeDSL kernels for DeepSeek-V4 (Blackwell B200, SM100). All kernels use `cutlas
**Bug 1 (V MN-major): ✅ Fix applied.**
**Bug 2 (softmax packing): ✅ Confirmed correct (V=I test: cosine 1.0).**
**Bug 3 (ACCUMULATE): ✅ Fix applied.**
**Bug 4 (non-square PV): 🔨 ROOT CAUSE IDENTIFIED — TMEM layout mismatch.**
**Bug 4 (non-square PV): 🔨 ACTIVE — Two approaches attempted, both blocked.**
#### Bug 4 (CURRENT): PV MMA Broken for (128,64) Output — ROOT CAUSE IDENTIFIED
#### Bug 4 (CURRENT): PV MMA Broken for (128,64) Output
**Root Cause: The (128,64) PV MMA's A-fragment reads P from TMEM with a different layout than the softmax packing writes it.**
@@ -31,23 +31,75 @@ The softmax packing writes P using the **QK C-fragment layout** (MMA atom = (128
**C++ TMEM Fragment Layout (from mma_traits_sm100.hpp):**
```cpp
// For M_MMA = 128, N_MMA varies with the MMA atom's N dimension
Layout tmem_atom = Layout<Shape <_128, Int<N_MMA>>,
Stride< _1, _128>>{};
```
- QK C-fragment: N_MMA=128 → 128 TMEM columns, stride 128
- PV A-fragment (128,64): N_MMA=64 → 64 TMEM columns, stride 128
When the softmax packing writes P at `tmem_p0_offset` using the QK C-fragment layout (N_MMA=128), P's (m,k) elements land at TMEM address `m + 128*k`. But the PV A-fragment (N_MMA=64) reads the same TMEM region as if P were stored with N_MMA=64, so it interprets the data with stride 64 instead of 128, causing the every-other-column effect (O[m,d] ≈ P[m, 2d]).
**Approach 1: (128,64) PV MMA — BLOCKED (v28, deadlocks)**
- Created PV MMA with `pv_mma_tiler[:2] = (128, 64)` as FMHA does
- Kernel compiles but **deadlocks at runtime** inside `epilogue_tma_store`
- The (128,64) PV MMA changes `tOtO` shape from (128,128) to (128,64), affecting TMEM allocation, epilogue partitioning, and `tCgC` partitioning
- The deadlock is NOT in the MMA or softmax — it's specifically in `epilogue_tma_store` which calls `acc_pipeline.consumer_wait()`
- Diagnostics show: MMA warp stuck at `mma_si acquire` (waiting for softmax), EPI warps complete softmax but deadlock in epilogue
- All three known deadlock fixes are applied (no cta_layout_vmnk on mma_si, TMA warp doesn't call wait_for_alloc, PipelineTmaStore)
- **New deadlock root cause unknown** — likely related to epilogue partitioning mismatch with (128,64) output shape
**Fix (not yet applied): The softmax packing must write P using the PV MMA's A-fragment layout, not the QK C-fragment layout.** FMHA does this correctly because its softmax writes P using a composition that matches the PV A-fragment — the `tStS_P` layout is derived from `tStS.layout` (QK C-fragment) but the TMEM store uses a C-fragment composition that's based on the PV MMA's tiling. The key is that FMHA's `tilePlikeFP32` computation adapts the packing width to match the PV output N.
**Approach 2: Pad V to (128,128), use (128,128) PV MMA — BLOCKED (v29, deadlocks)**
- Pad V from (64,128) to (128,128) with zeros, keep (128,128) PV MMA
- Should produce O=(128,128) where first 64 columns are correct
- **Also deadlocks**, even with V=I(128,128) which works in test_pv_diag.py
- test_pv_diag.py uses the exact same pipeline and kernel structure
- Difference between v29 and test_pv_diag: likely a subtle code issue (b_dtype vs q_dtype, pv_mma_tiler, etc.)
- **Needs bisecting** — slowly modify test_pv_diag into v29 to find the breaking change
**Additional fix: `epi_tile` must be computed from PV cta_tile, not QK cta_tile.** Using QK's cta_tile for the epilogue produces `epi_tile=(128,128)` which is wrong for a (128,64) output. Computing from PV's cta_tile gives `epi_tile=(128:1, 32:1)`. This fix alone improved cosine from 0.01 to 0.876, but the TMEM layout mismatch remains.
**Approach 3 (NOT YET TRIED): FMHA-style (128,16) PV MMA with N-tiling**
- FMHA uses pv_mma_tiler = (128, 16, 128) with MN = (128, 16)
- For head_dim=64, FMHA tiles the N dimension 4 times (64/16=4)
- Requires restructuring the kernel to loop over N tiles
- This is the "correct" approach but requires more code changes
**Approach 4 (NOT YET TRIED): Fix softmax packing to write P in PV A-fragment layout**
- Instead of composing tStS.layout → (128, 64), write P using a layout derived from the PV A-fragment
- FMHA does this: `tStS_P_layout = cute.composition(tStS.layout, (128, tilePlikeFP32))`
- The composition writes 64 packed FP32 columns that alias the PV A-fragment's 64 TMEM columns
- This should fix the alias for (128,64) PV MMA, but still blocked by the epilogue deadlock
**V SMEM Layouts (confirmed correct):**
- `PV(128,64) V SMEM: outer=((64,16),1,8,1):((1,64),0,1024,0), inner=S<3,4,3>`
- `PV(128,128) V SMEM: outer=(((64,2),16),1,8,1):(((1,8192),64),0,1024,0), inner=S<3,4,3>`
**FMHA Softmax Packing Bridge (reference trace):**
FMHA's softmax writes P in a packed format that aliases the PV A-fragment:
```python
# P store columns: qk_mma_tiler[1] * p_dtype.width // qk_acc_dtype.width
# For 128 BF16 columns: 128 * 16 / 32 = 64 packed FP32 columns
tilePlikeFP32 = qk_mma_tiler[1] // Float32.width * o_dtype.width # = 64
# Destination: packed-P physical TMEM view (64 FP32 columns, not 128)
tStS_P_layout = cute.composition(tStS.layout, cute.make_layout((128, tilePlikeFP32)))
tStS_P = cute.make_tensor(tStS.iterator + tmem_p0_offset, tStS_P_layout)
# Register packing: Float32 backing → BF16 recast view
tTMEM_STORErS_x4 = cute.make_rmem_tensor(tTMEM_STOREcS.shape, Float32)
tTMEM_STORErS_x4_e = cute.make_tensor(
cute.recast_ptr(tTMEM_STORErS_x4.iterator, dtype=BFloat16),
tTMEM_LOADrS.layout) # 128 BF16 logical elements
```
The consumer (PV A-fragment) reads from the same TMEM columns:
```python
tP = cute.make_tensor(tStS.iterator, p_tmem_layout_staged.outer)
tOrP = pv_thr_mma.make_fragment_A(tP)[None, None, None, 0]
tOrP0 = cute.make_tensor(
tOrP.iterator + Float32.width // BFloat16.width * tmem_p0_offset,
tOrP.layout)
```
The subtle point: `tilePlikeFP32 = 64` means "64 packed FP32 TMEM columns to store 128 logical BF16 P columns", NOT "PV output N=64". It's a coincidence that they're both 64 for head_dim=64.
---
### Bug 1: V B-Operand Must Be MN-Major — ✅ FIX APPLIED
@@ -99,6 +151,18 @@ The tmem allocation barrier has `num_threads = 32 * (mma_warp + epilogue_warps)`
---
## ⛔ NEW DEADLOCK: (128,64) PV MMA Epilogue — UNRESOLVED (May 21, 06:45 UTC)
v28 (128,64 PV MMA) and v29 (padded V 128x128) both deadlock inside `epilogue_tma_store`.
**Not caused by the three known deadlock fixes** — all are applied.
**Symptom**: `epilogue_tma_store` blocks on `acc_pipeline.consumer_wait()`. MMA warp appears stuck at `mma_si_prod.acquire_and_advance()` (2nd acquire, waiting for softmax). But EPI warps complete softmax successfully.
**Bisect needed**: test_pv_diag.py (V=I 128x128, WORKS) → v29 (V=I 128x128, DEADLOCKS). Same input, same MMA configuration, same pipeline. Difference must be a subtle code issue.
---
## ⛔ DEAD TEST: test_stage_b_v21.py — DELETED, DO NOT RECREATE
v21 attempted both Bug 1 and Bug 2 fixes in a hand-rolled pipeline kernel. It deadlocks on GPU. Root cause: pipeline synchronization mismatch. **Do not recreate.** Write from scratch using fmha.py as the reference.
@@ -159,6 +223,14 @@ The softmax packing writes P using the QK C-fragment layout. The PV A-fragment r
`compute_epilogue_tile_shape` must use PV's `cta_tile_shape_mnk`, not QK's. Also, `self.cta_tile_shape_mnk` must be set to PV's cta tile before calling `epilogue_tma_store` (it reads `gemm_kernel.cta_tile_shape_mnk` internally). FMHA sets `self.epi_tile = self.pv_mma_tiler[:2]` directly.
### 11. GPU State Persists After Deadlocked Kernels
After a kernel deadlock, the GPU may remain in a bad state. Kill all python processes using the GPU before running new tests. `nvidia-smi` shows hanging processes. Use `kill -9 <PID>` to clean up. A deadlocked kernel can also cause subsequent runs to fail even if the code is correct.
### 12. V Padded to (128,128) Must Use MN-Major Strides
When padding V from (64,128) to (128,128), the padded tensor MUST use MN-major strides (1, 128), not the default PyTorch row-major strides (128, 1). Use `as_strided` or `transpose()` to get the correct layout. The wrong layout causes `LayoutEnum.ROW_MAJOR` instead of `LayoutEnum.COL_MAJOR`, which the PV MMA's OperandMajorMode.MN does not expect.
---
## Architecture: Per-Tile Flow
@@ -198,6 +270,9 @@ After all tiles: epilogue warps tcgen05.ld tmem_output, divide by row_sum, cast
| `test_diag_layout.py` | (128,64) PV with epi from PV cta_tile | 0.876 | ❌ partial fix — epi correct, TMEM alias still broken |
| `test_diag_smem_layout.py` | Print V SMEM layouts for (128,64) vs (128,128) | N/A | layouts confirmed correct |
| `test_layout_compare.py` | Print TMEM layouts for QK S and PV A-fragment | N/A | layout inspection |
| `test_stage_b_v28.py` | (128,64) PV MMA + epi from PV | — | ❌ DEADLOCK in epilogue_tma_store |
| `test_stage_b_v29.py` | Padded V (128,128) + (128,128) PV MMA | — | ❌ DEADLOCK (likely code bug, not fundamental) |
| `test_stage_b_v30.py` | Copy of test_pv_diag.py | 1.0 | ✅ PASS (sanity check) |
---
@@ -216,6 +291,9 @@ pv_mma_tiler = (qk_mma_tiler[0], qk_mma_tiler[2], qk_mma_tiler[1])
FMHA passes `pv_mma_tiler[:2] = (128, head_dim)` to `make_trivial_tiled_mma`, NOT the QK tiler `(128, 128)`.
### FMHA uses (128,16) PV MMA, NOT (128,64)
FMHA uses pv_mma_tiler = (128, 16, 128) with MN = (128, 16). For head_dim=64, FMHA tiles the N dimension 4 times. The softmax writes P once, and each tile reads the same P with a different V slice.
### make_trivial_tiled_mma — Use New Overload
```python
make_trivial_tiled_mma(a_dtype, b_dtype, a_leading_mode, b_leading_mode,