diff --git a/README.md b/README.md index 72f31fe5..f6157b4a 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ CuTeDSL kernels for DeepSeek-V4 (Blackwell B200, SM100). All kernels use `cutlass.cute` (CuTeDSL) with Blackwell tensor cores. -## Status (May 21, 2026 — 05:15 UTC) +## Status (May 21, 2026 — 06:45 UTC) ### ✅ Stage A: Bare Q@K^T via tcgen05.mma → TMEM → GMEM — COMPLETE @@ -15,9 +15,9 @@ CuTeDSL kernels for DeepSeek-V4 (Blackwell B200, SM100). All kernels use `cutlas **Bug 1 (V MN-major): ✅ Fix applied.** **Bug 2 (softmax packing): ✅ Confirmed correct (V=I test: cosine 1.0).** **Bug 3 (ACCUMULATE): ✅ Fix applied.** -**Bug 4 (non-square PV): 🔨 ROOT CAUSE IDENTIFIED — TMEM layout mismatch.** +**Bug 4 (non-square PV): 🔨 ACTIVE — Two approaches attempted, both blocked.** -#### Bug 4 (CURRENT): PV MMA Broken for (128,64) Output — ROOT CAUSE IDENTIFIED +#### Bug 4 (CURRENT): PV MMA Broken for (128,64) Output **Root Cause: The (128,64) PV MMA's A-fragment reads P from TMEM with a different layout than the softmax packing writes it.** @@ -31,23 +31,75 @@ The softmax packing writes P using the **QK C-fragment layout** (MMA atom = (128 **C++ TMEM Fragment Layout (from mma_traits_sm100.hpp):** ```cpp -// For M_MMA = 128, N_MMA varies with the MMA atom's N dimension Layout tmem_atom = Layout>, Stride< _1, _128>>{}; ``` - QK C-fragment: N_MMA=128 → 128 TMEM columns, stride 128 - PV A-fragment (128,64): N_MMA=64 → 64 TMEM columns, stride 128 -When the softmax packing writes P at `tmem_p0_offset` using the QK C-fragment layout (N_MMA=128), P's (m,k) elements land at TMEM address `m + 128*k`. But the PV A-fragment (N_MMA=64) reads the same TMEM region as if P were stored with N_MMA=64, so it interprets the data with stride 64 instead of 128, causing the every-other-column effect (O[m,d] ≈ P[m, 2d]). +**Approach 1: (128,64) PV MMA — BLOCKED (v28, deadlocks)** +- Created PV MMA with `pv_mma_tiler[:2] = (128, 64)` as FMHA does +- Kernel compiles but **deadlocks at runtime** inside `epilogue_tma_store` +- The (128,64) PV MMA changes `tOtO` shape from (128,128) to (128,64), affecting TMEM allocation, epilogue partitioning, and `tCgC` partitioning +- The deadlock is NOT in the MMA or softmax — it's specifically in `epilogue_tma_store` which calls `acc_pipeline.consumer_wait()` +- Diagnostics show: MMA warp stuck at `mma_si acquire` (waiting for softmax), EPI warps complete softmax but deadlock in epilogue +- All three known deadlock fixes are applied (no cta_layout_vmnk on mma_si, TMA warp doesn't call wait_for_alloc, PipelineTmaStore) +- **New deadlock root cause unknown** — likely related to epilogue partitioning mismatch with (128,64) output shape -**Fix (not yet applied): The softmax packing must write P using the PV MMA's A-fragment layout, not the QK C-fragment layout.** FMHA does this correctly because its softmax writes P using a composition that matches the PV A-fragment — the `tStS_P` layout is derived from `tStS.layout` (QK C-fragment) but the TMEM store uses a C-fragment composition that's based on the PV MMA's tiling. The key is that FMHA's `tilePlikeFP32` computation adapts the packing width to match the PV output N. +**Approach 2: Pad V to (128,128), use (128,128) PV MMA — BLOCKED (v29, deadlocks)** +- Pad V from (64,128) to (128,128) with zeros, keep (128,128) PV MMA +- Should produce O=(128,128) where first 64 columns are correct +- **Also deadlocks**, even with V=I(128,128) which works in test_pv_diag.py +- test_pv_diag.py uses the exact same pipeline and kernel structure +- Difference between v29 and test_pv_diag: likely a subtle code issue (b_dtype vs q_dtype, pv_mma_tiler, etc.) +- **Needs bisecting** — slowly modify test_pv_diag into v29 to find the breaking change -**Additional fix: `epi_tile` must be computed from PV cta_tile, not QK cta_tile.** Using QK's cta_tile for the epilogue produces `epi_tile=(128,128)` which is wrong for a (128,64) output. Computing from PV's cta_tile gives `epi_tile=(128:1, 32:1)`. This fix alone improved cosine from 0.01 to 0.876, but the TMEM layout mismatch remains. +**Approach 3 (NOT YET TRIED): FMHA-style (128,16) PV MMA with N-tiling** +- FMHA uses pv_mma_tiler = (128, 16, 128) with MN = (128, 16) +- For head_dim=64, FMHA tiles the N dimension 4 times (64/16=4) +- Requires restructuring the kernel to loop over N tiles +- This is the "correct" approach but requires more code changes + +**Approach 4 (NOT YET TRIED): Fix softmax packing to write P in PV A-fragment layout** +- Instead of composing tStS.layout → (128, 64), write P using a layout derived from the PV A-fragment +- FMHA does this: `tStS_P_layout = cute.composition(tStS.layout, (128, tilePlikeFP32))` +- The composition writes 64 packed FP32 columns that alias the PV A-fragment's 64 TMEM columns +- This should fix the alias for (128,64) PV MMA, but still blocked by the epilogue deadlock **V SMEM Layouts (confirmed correct):** - `PV(128,64) V SMEM: outer=((64,16),1,8,1):((1,64),0,1024,0), inner=S<3,4,3>` - `PV(128,128) V SMEM: outer=(((64,2),16),1,8,1):(((1,8192),64),0,1024,0), inner=S<3,4,3>` +**FMHA Softmax Packing Bridge (reference trace):** + +FMHA's softmax writes P in a packed format that aliases the PV A-fragment: +```python +# P store columns: qk_mma_tiler[1] * p_dtype.width // qk_acc_dtype.width +# For 128 BF16 columns: 128 * 16 / 32 = 64 packed FP32 columns +tilePlikeFP32 = qk_mma_tiler[1] // Float32.width * o_dtype.width # = 64 + +# Destination: packed-P physical TMEM view (64 FP32 columns, not 128) +tStS_P_layout = cute.composition(tStS.layout, cute.make_layout((128, tilePlikeFP32))) +tStS_P = cute.make_tensor(tStS.iterator + tmem_p0_offset, tStS_P_layout) + +# Register packing: Float32 backing → BF16 recast view +tTMEM_STORErS_x4 = cute.make_rmem_tensor(tTMEM_STOREcS.shape, Float32) +tTMEM_STORErS_x4_e = cute.make_tensor( + cute.recast_ptr(tTMEM_STORErS_x4.iterator, dtype=BFloat16), + tTMEM_LOADrS.layout) # 128 BF16 logical elements +``` + +The consumer (PV A-fragment) reads from the same TMEM columns: +```python +tP = cute.make_tensor(tStS.iterator, p_tmem_layout_staged.outer) +tOrP = pv_thr_mma.make_fragment_A(tP)[None, None, None, 0] +tOrP0 = cute.make_tensor( + tOrP.iterator + Float32.width // BFloat16.width * tmem_p0_offset, + tOrP.layout) +``` + +The subtle point: `tilePlikeFP32 = 64` means "64 packed FP32 TMEM columns to store 128 logical BF16 P columns", NOT "PV output N=64". It's a coincidence that they're both 64 for head_dim=64. + --- ### Bug 1: V B-Operand Must Be MN-Major — ✅ FIX APPLIED @@ -99,6 +151,18 @@ The tmem allocation barrier has `num_threads = 32 * (mma_warp + epilogue_warps)` --- +## ⛔ NEW DEADLOCK: (128,64) PV MMA Epilogue — UNRESOLVED (May 21, 06:45 UTC) + +v28 (128,64 PV MMA) and v29 (padded V 128x128) both deadlock inside `epilogue_tma_store`. + +**Not caused by the three known deadlock fixes** — all are applied. + +**Symptom**: `epilogue_tma_store` blocks on `acc_pipeline.consumer_wait()`. MMA warp appears stuck at `mma_si_prod.acquire_and_advance()` (2nd acquire, waiting for softmax). But EPI warps complete softmax successfully. + +**Bisect needed**: test_pv_diag.py (V=I 128x128, WORKS) → v29 (V=I 128x128, DEADLOCKS). Same input, same MMA configuration, same pipeline. Difference must be a subtle code issue. + +--- + ## ⛔ DEAD TEST: test_stage_b_v21.py — DELETED, DO NOT RECREATE v21 attempted both Bug 1 and Bug 2 fixes in a hand-rolled pipeline kernel. It deadlocks on GPU. Root cause: pipeline synchronization mismatch. **Do not recreate.** Write from scratch using fmha.py as the reference. @@ -159,6 +223,14 @@ The softmax packing writes P using the QK C-fragment layout. The PV A-fragment r `compute_epilogue_tile_shape` must use PV's `cta_tile_shape_mnk`, not QK's. Also, `self.cta_tile_shape_mnk` must be set to PV's cta tile before calling `epilogue_tma_store` (it reads `gemm_kernel.cta_tile_shape_mnk` internally). FMHA sets `self.epi_tile = self.pv_mma_tiler[:2]` directly. +### 11. GPU State Persists After Deadlocked Kernels + +After a kernel deadlock, the GPU may remain in a bad state. Kill all python processes using the GPU before running new tests. `nvidia-smi` shows hanging processes. Use `kill -9 ` to clean up. A deadlocked kernel can also cause subsequent runs to fail even if the code is correct. + +### 12. V Padded to (128,128) Must Use MN-Major Strides + +When padding V from (64,128) to (128,128), the padded tensor MUST use MN-major strides (1, 128), not the default PyTorch row-major strides (128, 1). Use `as_strided` or `transpose()` to get the correct layout. The wrong layout causes `LayoutEnum.ROW_MAJOR` instead of `LayoutEnum.COL_MAJOR`, which the PV MMA's OperandMajorMode.MN does not expect. + --- ## Architecture: Per-Tile Flow @@ -198,6 +270,9 @@ After all tiles: epilogue warps tcgen05.ld tmem_output, divide by row_sum, cast | `test_diag_layout.py` | (128,64) PV with epi from PV cta_tile | 0.876 | ❌ partial fix — epi correct, TMEM alias still broken | | `test_diag_smem_layout.py` | Print V SMEM layouts for (128,64) vs (128,128) | N/A | ℹ️ layouts confirmed correct | | `test_layout_compare.py` | Print TMEM layouts for QK S and PV A-fragment | N/A | ℹ️ layout inspection | +| `test_stage_b_v28.py` | (128,64) PV MMA + epi from PV | — | ❌ DEADLOCK in epilogue_tma_store | +| `test_stage_b_v29.py` | Padded V (128,128) + (128,128) PV MMA | — | ❌ DEADLOCK (likely code bug, not fundamental) | +| `test_stage_b_v30.py` | Copy of test_pv_diag.py | 1.0 | ✅ PASS (sanity check) | --- @@ -216,6 +291,9 @@ pv_mma_tiler = (qk_mma_tiler[0], qk_mma_tiler[2], qk_mma_tiler[1]) FMHA passes `pv_mma_tiler[:2] = (128, head_dim)` to `make_trivial_tiled_mma`, NOT the QK tiler `(128, 128)`. +### FMHA uses (128,16) PV MMA, NOT (128,64) +FMHA uses pv_mma_tiler = (128, 16, 128) with MN = (128, 16). For head_dim=64, FMHA tiles the N dimension 4 times. The softmax writes P once, and each tile reads the same P with a different V slice. + ### make_trivial_tiled_mma — Use New Overload ```python make_trivial_tiled_mma(a_dtype, b_dtype, a_leading_mode, b_leading_mode,