Commit Graph

258 Commits

Author SHA1 Message Date
4fef047f5c CRITICAL FIX: remove extra scale_log2 in softmax (minus_row_max and acc_scale) 2026-05-22 18:52:58 +00:00
fe58d07b8c FIX: K slice (None,None,0,0) like working diag 2026-05-22 18:51:22 +00:00
12defe923d Diag: identity softmax on example6 pipeline to isolate softmax bug 2026-05-22 18:49:33 +00:00
d7e0aa6168 Quick test: working v3 with n=256 multi-tile 2026-05-22 18:47:54 +00:00
6a7b955c19 DEBUG: add version marker to confirm code changes are running 2026-05-22 18:47:05 +00:00
fb1057eb5f CRITICAL FIX: TMA pre-slice (None,0,None,0) → (None,None,0,0) to keep GMEM tile dim free 2026-05-22 18:45:55 +00:00
2deb28827a Diag: TMA shapes with hardcoded major modes 2026-05-22 18:43:55 +00:00
d2a3b83aa2 Diag: simplified TMA shape analysis 2026-05-22 18:43:30 +00:00
73109dde85 Diag: print TMA partition shapes for multi-tile debugging 2026-05-22 18:43:05 +00:00
375b24304b FIX: Use Python range() in TMA warp for concrete per-iteration GMEM coords 2026-05-22 18:26:31 +00:00
0dd6fefd66 FIX: Force SSA GMEM coord via n_kv_tiles - n_kv_tiles instead of cutlass.range kt 2026-05-22 18:25:13 +00:00
2a4731426e FIX: TMEM offset bug in O rescale/normalize — use tOtO0.iterator not tOtO.iterator 2026-05-22 18:23:46 +00:00
1e0805ad60 Diag: test n=384 (3 tiles) to find crash boundary 2026-05-22 18:07:07 +00:00
1aa4a91d01 Diag: test all sizes 128-1024 2026-05-22 18:06:28 +00:00
8586603280 DEBUG: disable O rescale to isolate NaN cause 2026-05-22 18:05:46 +00:00
45a0fb9971 Add NaN/inf checking to stage C test 2026-05-22 18:01:11 +00:00
147d85a617 CRITICAL FIX: K GMEM slice (None,None,0,0) not (None,0,None,0)
K from QK MMA B-partition has GMEM iter at mode 1, NOT mode 2.
(None,0,None,0) hardcodes mode 1 to 0 → TMA always loads tile 0.
(None,None,0,0) keeps mode 1 free → correct multi-tile loading.

Proof: diag n=256 went from cos 0.711 → 0.999999 with this one change.
2026-05-22 17:59:57 +00:00
bc3e94ff45 Diag: try K slice (None,None,0,0) keeping mode 1 (CUTLASS ref style) 2026-05-22 17:59:01 +00:00
601e662dd4 Diag: try runtime Int32(0+0) for kv_coord with cutlass.range 2026-05-22 17:57:58 +00:00
e5030cbea5 Diag: use Python range() unrolling like stage C test 2026-05-22 17:56:59 +00:00
5ee34d925b Fix diagnostic test: same Int32(kt) + n_kv_tiles fixes 2026-05-22 17:56:15 +00:00
d2bbdd59f6 Try cutlass.range with Int32(kt) — now n_kv_tiles is Python int 2026-05-22 17:51:25 +00:00
bf80fbee99 FIX: n_kv_tiles as Python int (s_k//128) for range() unrolling
cute.size() returns a CuTeDSL symbol, not a Python int.
range() on a symbol can't iterate — the loop never unrolls.
Now n_kv_tiles is computed in __init__ as s_k // 128 (Python int).
2026-05-22 17:50:07 +00:00
0b3bc3a16d Option 2: Python range() with Int32(kt) for TMA GMEM coord
cutlass.range traces once - kv_coord/kt are trace-time values,
not runtime loop-carried state. Python range() fully unrolls at
trace time, emitting distinct Int32(k) constants per iteration.
Int32(1) hardcoded already proved TMA CAN load from tile 1.
2026-05-22 17:47:43 +00:00
083e205aae Add example5: use cutlass.range induction variable as TMA GMEM coord 2026-05-22 17:47:10 +00:00
f80f8eb38f Clean up debug prints, set kv_coord as Int32(0)
Key findings to relay to CUTLASS LLM:
- kv_coord=Int32(1) hardcode CHANGES the output (TMA CAN load from different tiles)
- kv_coord=Int32(0) + kv_coord += 1 does NOT increment at runtime
  (all multi-tile outputs identical to kv_coord=0)
- kv_coord=0 (plain Python int) also doesn't work
- Pipeline handle .count doesn't work either
- The TMA GMEM tile coordinate must be dynamic at kernel runtime,
  but CuTeDSL appears to constant-fold or not propagate the increment
2026-05-22 17:39:27 +00:00
36cf1a363b DEBUG: try plain Python int kv_coord (like CUTLASS ref) 2026-05-22 17:34:30 +00:00
d95e2221c2 DEBUG: hardcode kv_coord=1 to test if TMA uses it 2026-05-22 17:32:53 +00:00
59746b46fc DEBUG: try K slice (None,0,None,0) keeping mode 2 free 2026-05-22 17:30:06 +00:00
b6cefba31c DEBUG: print tBgK/tVgV shapes before/after slice 2026-05-22 17:28:45 +00:00
ba2cefb668 Stage C: manual kv_coord + correct K GMEM slice + O rescale fence
Key fixes:
1. GMEM tile coord: manual Int32 kv_coord (not kvh.count)
2. K GMEM slice: (None,None,0,0) keeps mode 1 free (GMEM iter)
3. V GMEM slice: (None,0,None,0) keeps mode 2 free (GMEM iter)
4. Add fence_view_async_tmem_load before O rescale for visibility
2026-05-22 17:26:56 +00:00
5a6c4d2cd2 Add example4: manual kv_coord Int32 for GMEM tile indexing
Pipeline handle .count is NOT a GMEM coordinate — it's opaque pipeline
state. CUTLASS FMHA reference uses a manual Int32 counter (kv_coord)
for GMEM and handle.index for SMEM. .count is never used as a
coordinate anywhere in the reference.
2026-05-22 17:24:38 +00:00
a950f978d3 run_test.sh: SIGKILL all children of screen session on cleanup
Deadlocked GPU processes ignore SIGHUP from screen -X quit.
Now kills the entire process group with SIGKILL, plus a catch-all
pkill for any python test_ processes.
2026-05-22 17:08:12 +00:00
ebb5d1ea23 Add check_log.sh convenience script 2026-05-22 17:07:23 +00:00
b1a37bd2dd Fix quoting in run_test.sh 2026-05-22 17:06:00 +00:00
6594e31db5 Add run_test.sh harness (screen + log) 2026-05-22 17:05:43 +00:00
4f6853e1ae FIX: only slice GMEM tensors (SMEM already 2D from tma_partition) 2026-05-22 16:57:31 +00:00
c61590ac6d FIX: consistent GMEM/SMEM slicing for K and V TMA partitions
Both GMEM and SMEM sides must be sliced to the same rank for cute.copy.
K (QK MMA B-partition): slice [(None,None,0,0)] keeps modes 0,1
  - mode 1 = GMEM iteration, indexed by kvh.count
V (PV MMA B-partition): slice [(None,0,None,0)] keeps modes 0,2
  - mode 2 = GMEM iteration, indexed by kvh.count
Q: only 1 tile, (None,0,None,0) hardcode is fine.
2026-05-22 16:56:38 +00:00
7aaf9ccbda FIX: keep GMEM iteration dimension FREE in TMA K/V partition slices
Root cause of multi-tile failure: (None,0,None,0) slice hardcodes the
GMEM tile dimension to 0, so TMA always loads from tile 0 regardless
of kvh.count. K from QK MMA has GMEM iter at mode 1, V from PV MMA
has it at mode 2 (different layouts: K,D,L vs D,K,L).

Fix follows CUTLASS FMHA reference:
- K: tBgK[(None,None,None,0)] + tBgK[(None, kvh.count, None)]
- V: tVgV[(None,0,None,0)] + tVgV[(None, kvh.count)]
2026-05-22 16:51:57 +00:00
04da36e18c Add diagnostic test for multi-tile TMA pipeline (identity softmax) 2026-05-22 16:47:08 +00:00
b50968dfaf FIX: acc_scale was double-multiplying by scale_log2
row_max is already in log2(scaled) space (S * scale_log2), so
old_row_max - row_max_safe is the correct exponent for exp2.
The old code computed exp2(scale_log2 * (old_row_max - row_max_safe))
which is exp2(scale_log2^2 * (old_max_S - new_max_S)) — wrong.
2026-05-22 16:42:45 +00:00
c9fe26a5fc Stage C: integrate example3 multi-tile fixes into unit test
- Combined K+V barrier (one acquire per kt, kvh.count == kt)
- O rescale for kt > 0 (online softmax O correction)
- final_o_bar sync (MMA signals before producer_tail)
- s_k as constructor param (compile-time for V layout)
- kv_tx_bytes covers both K and V transfers
- Test covers n=128, 256, 512, 1024
2026-05-22 16:39:45 +00:00
e5c02caed4 FMHA Stage-C multi-tile: combined K+V barrier, final_o_bar, acc_pipe producer
Key changes from Mike:
1. Combined K+V TMA barrier: one acquire per kt, both cute.copys share
   kvh.barrier. kvh.count naturally == kt (no interleaving problem).
   tx_count = K_bytes + V_bytes. Also fixes the sK[0]/sV[1] slot quirk.
2. final_o_bar NamedBarrier: MMA .arrive() after acc_pipe.producer_tail;
   softmax .arrive_and_wait() before reading O for normalize. Prevents
   softmax racing MMA's PV[N-1] on the final O read.
3. acc_pipe producer in MMA: producer_acquire before loop, commit+advance
   after loop, producer_tail after. Consumer in epilogue as before.
4. O rescale re-enabled for kt>0 with acc_scale before softmax_done_bar.
2026-05-22 16:23:36 +00:00
452ba604fc restore tBgK to kh.count indexing (single-tile working), add TODO for multi-tile
CuTeDSL TMA copy API doesn't support dynamic GMEM tile indexing.
kh.count works for single tile. For multi-tile, need to either:
1. Map pipeline count to tile index (kh.count // 2 for interleaved K/V)
2. Separate K and V into non-interleaved TMA loops
3. Use gK/gV layouts that iterate naturally with pipeline count

This is the architectural blocker for multi-tile FMHA.
2026-05-22 15:54:03 +00:00
07817ae82e FIX: use unsliced tBgK with (None, kt, None, 0) for proper GMEM tile indexing
The pre-slice (None,0,None,0) hardcoded GMEM iteration to tile 0.
Instead, keep the original tBgK and index with (None, kt, None, 0)
inside the TMA loop, where kt selects the correct GMEM tile.
This preserves 2D rank matching with the SMEM tensor.
2026-05-22 15:52:56 +00:00
1ad243f095 CRITICAL FIX: keep GMEM iteration dim free in tBgK/tVgV slice
The slice (None,0,None,0) was hardcoding the GMEM iteration dim to 0,
meaning TMA always loaded K/V from tile 0 regardless of kt.
Changed to (None,None,None,0) to keep gmem_iter free,
then index with (None, kt, None) in the TMA copy loop.

This is the root cause of multi-tile failure: TMA was always reading
the first 128 tokens for ALL KV tiles.
2026-05-22 15:52:06 +00:00
32412b2250 add explicit acc_pipe.consumer_wait before final normalize
Race condition: softmax reads O to normalize while MMA may still be
writing PV[N-1]. Single-tile wins by luck; multi-tile drifts.
Move acc_cons_st construction before the wait so epilogue reuses it.
2026-05-22 15:49:48 +00:00
3f7addb83a FMHA Stage-C multi-tile: Fix 1 (s_k=n), Fix 2 (TMA kt indexing), Fix 3 (O rescale)
Fix 1: s_k must equal actual n. With s_k < n, v_fmha layout only spans
first s_k V tokens and TMA reads OOB on later tiles.

Fix 2: TMA producer indexes K and V by kt (loop variable), NOT by the
pipeline's interleaved count. The kv pipeline interleaves K and V, so
pipeline count goes 0,1,2,3 but GMEM tiles should be K[0],V[0],K[1],V[1].

Fix 3: Online O rescale before softmax_done_bar. When row_max grows,
O must be multiplied by exp2(old_max - new_max) before MMA starts next PV.
2026-05-22 15:41:14 +00:00
ad2a494968 Revert "debug: test 12w identity softmax with n=256 to verify multi-tile pipeline"
This reverts commit 6cf8702e3c.
2026-05-22 10:25:48 +00:00
24a807eae2 debug: test 12w identity softmax with n=256 to verify multi-tile pipeline 2026-05-22 10:24:53 +00:00