README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns

2026-05-21 06:46:02 +00:00
parent f1c4ee0e4d
commit b9b1b808a5
1 changed files with 85 additions and 7 deletions
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@

 CuTeDSL kernels for DeepSeek-V4 (Blackwell B200, SM100). All kernels use `cutlass.cute` (CuTeDSL) with Blackwell tensor cores.

-## Status (May 21, 2026 — 05:15 UTC)
+## Status (May 21, 2026 — 06:45 UTC)

 ### ✅ Stage A: Bare Q@K^T via tcgen05.mma → TMEM → GMEM — COMPLETE

@@ -15,9 +15,9 @@ CuTeDSL kernels for DeepSeek-V4 (Blackwell B200, SM100). All kernels use `cutlas
 **Bug 1 (V MN-major): ✅ Fix applied.**
 **Bug 2 (softmax packing): ✅ Confirmed correct (V=I test: cosine 1.0).**
 **Bug 3 (ACCUMULATE): ✅ Fix applied.**
-**Bug 4 (non-square PV): 🔨 ROOT CAUSE IDENTIFIED — TMEM layout mismatch.**
+**Bug 4 (non-square PV): 🔨 ACTIVE — Two approaches attempted, both blocked.**

-#### Bug 4 (CURRENT): PV MMA Broken for (128,64) Output — ROOT CAUSE IDENTIFIED
+#### Bug 4 (CURRENT): PV MMA Broken for (128,64) Output

 **Root Cause: The (128,64) PV MMA's A-fragment reads P from TMEM with a different layout than the softmax packing writes it.**

@@ -31,23 +31,75 @@ The softmax packing writes P using the **QK C-fragment layout** (MMA atom = (128

 **C++ TMEM Fragment Layout (from mma_traits_sm100.hpp):**
 ```cpp
-// For M_MMA = 128, N_MMA varies with the MMA atom's N dimension
 Layout tmem_atom = Layout<Shape <_128, Int<N_MMA>>,
                          Stride<  _1,     _128>>{};
 ```
 - QK C-fragment: N_MMA=128 → 128 TMEM columns, stride 128
 - PV A-fragment (128,64): N_MMA=64 → 64 TMEM columns, stride 128

-When the softmax packing writes P at `tmem_p0_offset` using the QK C-fragment layout (N_MMA=128), P's (m,k) elements land at TMEM address `m + 128*k`. But the PV A-fragment (N_MMA=64) reads the same TMEM region as if P were stored with N_MMA=64, so it interprets the data with stride 64 instead of 128, causing the every-other-column effect (O[m,d] ≈ P[m, 2d]).
+**Approach 1: (128,64) PV MMA — BLOCKED (v28, deadlocks)**
+- Created PV MMA with `pv_mma_tiler[:2] = (128, 64)` as FMHA does
+- Kernel compiles but **deadlocks at runtime** inside `epilogue_tma_store`
+- The (128,64) PV MMA changes `tOtO` shape from (128,128) to (128,64), affecting TMEM allocation, epilogue partitioning, and `tCgC` partitioning
+- The deadlock is NOT in the MMA or softmax — it's specifically in `epilogue_tma_store` which calls `acc_pipeline.consumer_wait()`
+- Diagnostics show: MMA warp stuck at `mma_si acquire` (waiting for softmax), EPI warps complete softmax but deadlock in epilogue
+- All three known deadlock fixes are applied (no cta_layout_vmnk on mma_si, TMA warp doesn't call wait_for_alloc, PipelineTmaStore)
+- **New deadlock root cause unknown** — likely related to epilogue partitioning mismatch with (128,64) output shape

-**Fix (not yet applied): The softmax packing must write P using the PV MMA's A-fragment layout, not the QK C-fragment layout.** FMHA does this correctly because its softmax writes P using a composition that matches the PV A-fragment — the `tStS_P` layout is derived from `tStS.layout` (QK C-fragment) but the TMEM store uses a C-fragment composition that's based on the PV MMA's tiling. The key is that FMHA's `tilePlikeFP32` computation adapts the packing width to match the PV output N.
+**Approach 2: Pad V to (128,128), use (128,128) PV MMA — BLOCKED (v29, deadlocks)**
+- Pad V from (64,128) to (128,128) with zeros, keep (128,128) PV MMA
+- Should produce O=(128,128) where first 64 columns are correct
+- **Also deadlocks**, even with V=I(128,128) which works in test_pv_diag.py
+- test_pv_diag.py uses the exact same pipeline and kernel structure
+- Difference between v29 and test_pv_diag: likely a subtle code issue (b_dtype vs q_dtype, pv_mma_tiler, etc.)
+- **Needs bisecting** — slowly modify test_pv_diag into v29 to find the breaking change

-**Additional fix: `epi_tile` must be computed from PV cta_tile, not QK cta_tile.** Using QK's cta_tile for the epilogue produces `epi_tile=(128,128)` which is wrong for a (128,64) output. Computing from PV's cta_tile gives `epi_tile=(128:1, 32:1)`. This fix alone improved cosine from 0.01 to 0.876, but the TMEM layout mismatch remains.
+**Approach 3 (NOT YET TRIED): FMHA-style (128,16) PV MMA with N-tiling**
+- FMHA uses pv_mma_tiler = (128, 16, 128) with MN = (128, 16)
+- For head_dim=64, FMHA tiles the N dimension 4 times (64/16=4)
+- Requires restructuring the kernel to loop over N tiles
+- This is the "correct" approach but requires more code changes
+
+**Approach 4 (NOT YET TRIED): Fix softmax packing to write P in PV A-fragment layout**
+- Instead of composing tStS.layout → (128, 64), write P using a layout derived from the PV A-fragment
+- FMHA does this: `tStS_P_layout = cute.composition(tStS.layout, (128, tilePlikeFP32))`
+- The composition writes 64 packed FP32 columns that alias the PV A-fragment's 64 TMEM columns
+- This should fix the alias for (128,64) PV MMA, but still blocked by the epilogue deadlock

 **V SMEM Layouts (confirmed correct):**
 - `PV(128,64) V SMEM: outer=((64,16),1,8,1):((1,64),0,1024,0), inner=S<3,4,3>`
 - `PV(128,128) V SMEM: outer=(((64,2),16),1,8,1):(((1,8192),64),0,1024,0), inner=S<3,4,3>`

+**FMHA Softmax Packing Bridge (reference trace):**
+
+FMHA's softmax writes P in a packed format that aliases the PV A-fragment:
+```python
+# P store columns: qk_mma_tiler[1] * p_dtype.width // qk_acc_dtype.width
+# For 128 BF16 columns: 128 * 16 / 32 = 64 packed FP32 columns
+tilePlikeFP32 = qk_mma_tiler[1] // Float32.width * o_dtype.width  # = 64
+
+# Destination: packed-P physical TMEM view (64 FP32 columns, not 128)
+tStS_P_layout = cute.composition(tStS.layout, cute.make_layout((128, tilePlikeFP32)))
+tStS_P = cute.make_tensor(tStS.iterator + tmem_p0_offset, tStS_P_layout)
+
+# Register packing: Float32 backing → BF16 recast view
+tTMEM_STORErS_x4 = cute.make_rmem_tensor(tTMEM_STOREcS.shape, Float32)
+tTMEM_STORErS_x4_e = cute.make_tensor(
+    cute.recast_ptr(tTMEM_STORErS_x4.iterator, dtype=BFloat16),
+    tTMEM_LOADrS.layout)  # 128 BF16 logical elements
+```
+
+The consumer (PV A-fragment) reads from the same TMEM columns:
+```python
+tP = cute.make_tensor(tStS.iterator, p_tmem_layout_staged.outer)
+tOrP = pv_thr_mma.make_fragment_A(tP)[None, None, None, 0]
+tOrP0 = cute.make_tensor(
+    tOrP.iterator + Float32.width // BFloat16.width * tmem_p0_offset,
+    tOrP.layout)
+```
+
+The subtle point: `tilePlikeFP32 = 64` means "64 packed FP32 TMEM columns to store 128 logical BF16 P columns", NOT "PV output N=64". It's a coincidence that they're both 64 for head_dim=64.
+
 ---

 ### Bug 1: V B-Operand Must Be MN-Major — ✅ FIX APPLIED
@@ -99,6 +151,18 @@ The tmem allocation barrier has `num_threads = 32 * (mma_warp + epilogue_warps)`

 ---

+## ⛔ NEW DEADLOCK: (128,64) PV MMA Epilogue — UNRESOLVED (May 21, 06:45 UTC)
+
+v28 (128,64 PV MMA) and v29 (padded V 128x128) both deadlock inside `epilogue_tma_store`.
+
+**Not caused by the three known deadlock fixes** — all are applied.
+
+**Symptom**: `epilogue_tma_store` blocks on `acc_pipeline.consumer_wait()`. MMA warp appears stuck at `mma_si_prod.acquire_and_advance()` (2nd acquire, waiting for softmax). But EPI warps complete softmax successfully.
+
+**Bisect needed**: test_pv_diag.py (V=I 128x128, WORKS) → v29 (V=I 128x128, DEADLOCKS). Same input, same MMA configuration, same pipeline. Difference must be a subtle code issue.
+
+---
+
 ## ⛔ DEAD TEST: test_stage_b_v21.py — DELETED, DO NOT RECREATE

 v21 attempted both Bug 1 and Bug 2 fixes in a hand-rolled pipeline kernel. It deadlocks on GPU. Root cause: pipeline synchronization mismatch. **Do not recreate.** Write from scratch using fmha.py as the reference.
@@ -159,6 +223,14 @@ The softmax packing writes P using the QK C-fragment layout. The PV A-fragment r

 `compute_epilogue_tile_shape` must use PV's `cta_tile_shape_mnk`, not QK's. Also, `self.cta_tile_shape_mnk` must be set to PV's cta tile before calling `epilogue_tma_store` (it reads `gemm_kernel.cta_tile_shape_mnk` internally). FMHA sets `self.epi_tile = self.pv_mma_tiler[:2]` directly.

+### 11. GPU State Persists After Deadlocked Kernels
+
+After a kernel deadlock, the GPU may remain in a bad state. Kill all python processes using the GPU before running new tests. `nvidia-smi` shows hanging processes. Use `kill -9 <PID>` to clean up. A deadlocked kernel can also cause subsequent runs to fail even if the code is correct.
+
+### 12. V Padded to (128,128) Must Use MN-Major Strides
+
+When padding V from (64,128) to (128,128), the padded tensor MUST use MN-major strides (1, 128), not the default PyTorch row-major strides (128, 1). Use `as_strided` or `transpose()` to get the correct layout. The wrong layout causes `LayoutEnum.ROW_MAJOR` instead of `LayoutEnum.COL_MAJOR`, which the PV MMA's OperandMajorMode.MN does not expect.
+
 ---

 ## Architecture: Per-Tile Flow
@@ -198,6 +270,9 @@ After all tiles: epilogue warps tcgen05.ld tmem_output, divide by row_sum, cast
 | `test_diag_layout.py` | (128,64) PV with epi from PV cta_tile | 0.876 | ❌ partial fix — epi correct, TMEM alias still broken |
 | `test_diag_smem_layout.py` | Print V SMEM layouts for (128,64) vs (128,128) | N/A | ℹ️ layouts confirmed correct |
 | `test_layout_compare.py` | Print TMEM layouts for QK S and PV A-fragment | N/A | ℹ️ layout inspection |
+| `test_stage_b_v28.py` | (128,64) PV MMA + epi from PV | — | ❌ DEADLOCK in epilogue_tma_store |
+| `test_stage_b_v29.py` | Padded V (128,128) + (128,128) PV MMA | — | ❌ DEADLOCK (likely code bug, not fundamental) |
+| `test_stage_b_v30.py` | Copy of test_pv_diag.py | 1.0 | ✅ PASS (sanity check) |

 ---

@@ -216,6 +291,9 @@ pv_mma_tiler = (qk_mma_tiler[0], qk_mma_tiler[2], qk_mma_tiler[1])

 FMHA passes `pv_mma_tiler[:2] = (128, head_dim)` to `make_trivial_tiled_mma`, NOT the QK tiler `(128, 128)`.

+### FMHA uses (128,16) PV MMA, NOT (128,64)
+FMHA uses pv_mma_tiler = (128, 16, 128) with MN = (128, 16). For head_dim=64, FMHA tiles the N dimension 4 times. The softmax writes P once, and each tile reads the same P with a different V slice.
+
 ### make_trivial_tiled_mma — Use New Overload
 ```python
 make_trivial_tiled_mma(a_dtype, b_dtype, a_leading_mode, b_leading_mode,