This website requires JavaScript.
ebb5d1ea23
Add check_log.sh convenience script
biondizzle
2026-05-22 17:07:23 +00:00
b1a37bd2dd
Fix quoting in run_test.sh
biondizzle
2026-05-22 17:06:00 +00:00
6594e31db5
Add run_test.sh harness (screen + log)
biondizzle
2026-05-22 17:05:43 +00:00
4f6853e1ae
FIX: only slice GMEM tensors (SMEM already 2D from tma_partition)
biondizzle
2026-05-22 16:57:31 +00:00
c61590ac6d
FIX: consistent GMEM/SMEM slicing for K and V TMA partitions
biondizzle
2026-05-22 16:56:38 +00:00
7aaf9ccbda
FIX: keep GMEM iteration dimension FREE in TMA K/V partition slices
biondizzle
2026-05-22 16:51:57 +00:00
04da36e18c
Add diagnostic test for multi-tile TMA pipeline (identity softmax)
biondizzle
2026-05-22 16:47:08 +00:00
b50968dfaf
FIX: acc_scale was double-multiplying by scale_log2
biondizzle
2026-05-22 16:42:45 +00:00
c9fe26a5fc
Stage C: integrate example3 multi-tile fixes into unit test
biondizzle
2026-05-22 16:39:45 +00:00
793e3243d5
README + MEMORY: update Stage C status to single-tile only, document multi-tile blocker
biondizzle
2026-05-22 16:32:31 +00:00
e5c02caed4
FMHA Stage-C multi-tile: combined K+V barrier, final_o_bar, acc_pipe producer
biondizzle
2026-05-22 16:23:36 +00:00
452ba604fc
restore tBgK to kh.count indexing (single-tile working), add TODO for multi-tile
biondizzle
2026-05-22 15:54:03 +00:00
07817ae82e
FIX: use unsliced tBgK with (None, kt, None, 0) for proper GMEM tile indexing
biondizzle
2026-05-22 15:52:56 +00:00
1ad243f095
CRITICAL FIX: keep GMEM iteration dim free in tBgK/tVgV slice
biondizzle
2026-05-22 15:52:06 +00:00
32412b2250
add explicit acc_pipe.consumer_wait before final normalize
biondizzle
2026-05-22 15:49:48 +00:00
3f7addb83a
FMHA Stage-C multi-tile: Fix 1 (s_k=n), Fix 2 (TMA kt indexing), Fix 3 (O rescale)
biondizzle
2026-05-22 15:41:14 +00:00
ad2a494968
Revert "debug: test 12w identity softmax with n=256 to verify multi-tile pipeline"
biondizzle
2026-05-22 10:25:48 +00:00
24a807eae2
debug: test 12w identity softmax with n=256 to verify multi-tile pipeline
biondizzle
2026-05-22 10:24:53 +00:00
572656e79b
debug: disable O rescaling to test multi-tile pipeline baseline
biondizzle
2026-05-22 10:23:37 +00:00
8ce257150e
fix: revert to scaled row_max, use exp2(old_max - new_max) for O rescaling
biondizzle
2026-05-22 10:22:44 +00:00
e85d50dc3b
fix: compute row_max from RAW S values, not scaled
biondizzle
2026-05-22 10:21:50 +00:00
0bcb5aba2b
fix: missing newline after self.s_k = s_k
biondizzle
2026-05-22 10:20:35 +00:00
1982cc4d39
fix: add s_k param to FmhaV3StageC, use self.s_k for V FMHA reconstruction
biondizzle
2026-05-22 10:19:49 +00:00
b80a1ab083
Stage C: add online O rescaling for multi-tile KV + test n=256
biondizzle
2026-05-22 10:19:08 +00:00
55beaeb2a5
fix: add epilogue warp to tmem_bar, restore wait_for_alloc in epilogue
biondizzle
2026-05-22 10:17:02 +00:00
6514888a5c
fix: add softmax_done_bar to synchronize MMA PV with softmax P production
biondizzle
2026-05-22 10:15:26 +00:00
fdea390c71
fix: epilogue warp self-signals acc_pipe producer before consuming
biondizzle
2026-05-22 10:11:55 +00:00
18ab3396b7
fix: remove duplicate tmem free from epilogue (MMA warp handles dealloc)
biondizzle
2026-05-22 10:05:52 +00:00
1994b2ae46
fix: add acc_pipe pipeline for epilogue, matching 12w pattern
biondizzle
2026-05-22 10:03:08 +00:00
925d85820b
fix: epilogue_warp_id must be tuple for epilogue_tma_store, check with [0]
biondizzle
2026-05-22 09:59:20 +00:00
23421bc282
fix: epilogue warp reuse mma_corr_cons pipeline instead of creating new one from st
biondizzle
2026-05-22 09:56:18 +00:00
5b32490b15
fix: define cS and tScS in correction warps (not visible across if blocks)
biondizzle
2026-05-22 09:52:59 +00:00
a5a9413aa5
fix: correct @cute.kernel indentation
biondizzle
2026-05-22 09:49:36 +00:00
cf900a22fe
fix: remove duplicate @cute.kernel decorator
biondizzle
2026-05-22 09:46:09 +00:00
bfc1518046
FMHA Stage-C2: production 12-warp pipeline with correction warps
biondizzle
2026-05-22 09:42:39 +00:00
35d532c742
README: update Stage C status to WORKING, add CuTeDSL constraints and target architecture
biondizzle
2026-05-22 09:39:15 +00:00
347c107394
test: add multiple seeds to verify softmax consistency
biondizzle
2026-05-22 09:32:08 +00:00
d3682b0c33
fix: use plain range loop for row_max (fmax not allowed in vectorized)
biondizzle
2026-05-22 09:31:07 +00:00
235c7850df
fix: add missing old_row_max = row_max before softmax max computation
biondizzle
2026-05-22 09:30:32 +00:00
35056300cb
fix vectorize issue: remove vectorize from exp2 pass, add row_sum accumulation
biondizzle
2026-05-22 09:29:43 +00:00
c5a504d064
fix: use cute.arch.fmax instead of if-else in vectorized loop
biondizzle
2026-05-22 09:28:32 +00:00
6f4bb0842e
softmax: element-wise row_max computation instead of .reduce()
biondizzle
2026-05-22 09:27:36 +00:00
9e145c35f1
fix O normalization: use direct rmem tensor from partition_D shape
biondizzle
2026-05-22 09:23:58 +00:00
9ea5551241
FMHA Stage-C: real softmax + O normalization in 6-warp layout
biondizzle
2026-05-22 09:22:56 +00:00
aaa68634d4
fix: use make_smem_layout_epi not make_epilogue_smem_layout
biondizzle
2026-05-22 09:19:12 +00:00
054bf99436
FMHA v3 Stage-C full: 12-warp pipeline with real softmax + correction + epilogue
biondizzle
2026-05-22 09:18:56 +00:00
fbe1c8ee49
more stuff
biondizzle
2026-05-22 08:57:38 +00:00
187d9e231c
FMHA v3: per-row min test + explicit loop replacements
biondizzle
2026-05-22 07:29:04 +00:00
5c2d9ad312
FMHA v3: per-row patch from Mike + deadlock fix + V layout fix
biondizzle
2026-05-22 07:09:52 +00:00
5f1922da3e
FMHA v3: add debug variants for C9 normalization investigation
biondizzle
2026-05-22 05:52:10 +00:00
7d41f4861a
Fix indexer score kernel: use static shared memory, correct FP4 head offsets
biondizzle
2026-05-22 01:45:05 +00:00
c2f705a21a
Indexer: score+topk kernel, gather KV, compute_valid_lens
biondizzle
2026-05-22 01:20:39 +00:00
0f539e4855
Flush compressor: schema fix, prepare_forward, flush_write kernels, state rotation
biondizzle
2026-05-22 00:25:47 +00:00
b4d58df620
KV Cache: schema, allocator, pools, manager, append_swa kernel
biondizzle
2026-05-22 00:08:38 +00:00
4453d7475a
Fix layer construction: match existing API signatures, add RMSNorm impl
biondizzle
2026-05-21 23:31:58 +00:00
d5ec0e5133
Clean up: remove debug/temp files and dangling test kernels
biondizzle
2026-05-21 23:26:50 +00:00
97a1b11f41
10-warp debug: MMA=warp4 TMA=warp5 idle=6-9 still gives cosine 0.29
biondizzle
2026-05-21 23:24:44 +00:00
66a89859ed
Layer dispatch: config, schedule, attention/FFN sub-blocks, TransformerLayer
biondizzle
2026-05-21 23:11:09 +00:00
dd364b6d4d
10-warp idle test: no crash but cosine 0.29 (6-warp gives 0.999999)
biondizzle
2026-05-21 22:07:53 +00:00
0d06e55770
Router: Blackwell-native fused decode kernel — real CuTeDSL implementation
biondizzle
2026-05-21 22:04:20 +00:00
9c39f48443
Router: clean up dense_router_decode.py — realistic architecture, no fake code
biondizzle
2026-05-21 21:58:31 +00:00
abfe4485f7
Router: full kernel stack — hash, topk, activation+topk, dense decode/prefill
biondizzle
2026-05-21 21:54:05 +00:00
c97661994e
WIP: correction warp group architecture - compiles, illegal address at runtime
biondizzle
2026-05-21 21:20:39 +00:00
d2a16daf70
BREAKTHROUGH: cosine 0.993 for n=128! PV-partitioned P row sum works.
biondizzle
2026-05-21 20:13:51 +00:00
7189165a67
WIP: TMEM vector bridge not working (same cosine 0.513)
biondizzle
2026-05-21 19:26:15 +00:00
26f6c1ba7f
WIP: confirmed row_sum is wrong (5.5 vs correct 29.22 for row 0)
biondizzle
2026-05-21 19:16:15 +00:00
4251af1f14
WIP: scalar C9 normalization - confirmed inv_row_sum is wrong
biondizzle
2026-05-21 19:09:32 +00:00
8612bc5426
WIP: QK-partitioned C9 normalization (does not work)
biondizzle
2026-05-21 18:59:21 +00:00
d7aa4da686
BREAKTHROUGH: unnormalized P@V cosine 0.999998 for n=128!
biondizzle
2026-05-21 18:55:00 +00:00
a983a8fb41
WIP: TMEM vector for per-row row_sum (not yet working)
biondizzle
2026-05-21 18:45:30 +00:00
331d9e95f3
WIP: Stage C softmax - partial progress
biondizzle
2026-05-21 18:04:21 +00:00
84cd636ba9
Stage C fixes: pv_done_bar sync, acc_scale with scale, fastmath=True
biondizzle
2026-05-21 17:58:04 +00:00
52b46a2dee
Stage C: add validation harness with real softmax reference (C1)
biondizzle
2026-05-21 17:49:26 +00:00
96f900f5f0
README: add full DSV4 pipeline architecture diagram (CSA/HCA, not MLA)
biondizzle
2026-05-21 17:40:25 +00:00
2ec32eb8da
README: update for new dsv4/ package structure
biondizzle
2026-05-21 17:34:40 +00:00
3fb3c925af
Restructure: cutedsl/ -> dsv4/ with proper layering
biondizzle
2026-05-21 17:30:44 +00:00
99e143dd0e
Fix: add scale_softmax_log2, use O TMEM rescale for C9 normalization
biondizzle
2026-05-21 17:15:15 +00:00
df04ba40ee
Stage C: online softmax kernel (WIP) - test_fmha_v3_softmax.py
biondizzle
2026-05-21 17:10:58 +00:00
20564425ec
README: full roadmap — Stage C (real softmax), D (paged KV), E (production kernel)
biondizzle
2026-05-21 15:43:01 +00:00
ad24792fc7
Update both READMEs: Stage B complete, document TMEM overlap root cause
biondizzle
2026-05-21 15:36:06 +00:00
2030d41e41
Fix TMEM overlap in test_pv64_with_softmax.py too — cosine 0.999999
biondizzle
2026-05-21 15:32:49 +00:00
0f4f69907e
STAGE B BUG 4b FIXED: TMEM P/O overlap + FMHA V reconstruction
biondizzle
2026-05-21 15:30:24 +00:00
4564758466
Stage B Bug 4b debugging: P/A alias proven working, V layout issue for (128,64) PV
biondizzle
2026-05-21 15:20:14 +00:00
81d5d8d04c
FMHA v3: KV-tile interleaving pipeline - QK works, Bug 4b blocks PV
biondizzle
2026-05-21 12:52:29 +00:00
73e03cfa6d
Stage B: PV(128,64) test + v2 pipeline fixes
biondizzle
2026-05-21 11:49:06 +00:00
61b23efbcf
stuff and stuff
biondizzle
2026-05-21 10:50:30 +00:00
d72f854efb
FMHA v1: pv_mma_tiler=(128,64,128) works with V=I, fails with real V (SMEM layout bug)
biondizzle
2026-05-21 10:47:46 +00:00
750f1f09c9
README: Bug 4 ROOT CAUSE CONFIRMED - V SMEM 1 K-tile + PV 8 K-phases mismatch. Zero-pad V workaround correct.
biondizzle
2026-05-21 09:59:37 +00:00
dbb240adc9
Root cause FOUND: V SMEM only holds 1 K-tile (2048 BF16), but PV MMA iterates 8 K-phases. For non-(128,128) PV, most K-phases read wrong V data. Zero-padded V works because V is (128,128) covering all 8 K-phases. FMHA interleaves QK+PV per KV-tile to avoid this.
biondizzle
2026-05-21 09:56:54 +00:00
50e9b5da81
README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D.
biondizzle
2026-05-21 09:47:08 +00:00
d4934371d0
Key finding: PV A-fragment layout is IDENTICAL for (128,128)/(128,32)/(128,16) PV. Bug is NOT TMEM alias. cta_tile_shape_mnk wrong for non-(128,128) PV. V SMEM and O C-fragment sizes look correct. Debugging V/epilogue paths.
biondizzle
2026-05-21 09:44:22 +00:00
422af26024
Update README: Bug 4 status, (128,16) PV zero output, (128,128) PV zero-pad workaround (cosine 1.0)
biondizzle
2026-05-21 09:20:09 +00:00
781684dd89
TMEM alias analysis: (128,16) PV broken, (128,128) PV with zero-pad works. Root cause: PV A-fragment layout differs from QK C-fragment layout for (128,16) PV, causing TMEM column mismatch. Using (128,128) PV as workaround.
biondizzle
2026-05-21 09:10:12 +00:00
96e7210db7
Debugging TMEM alias for (128,16) PV: zero output confirmed, PV reads from wrong TMEM columns. Need to align softmax P write with PV A-fragment layout.
biondizzle
2026-05-21 09:00:42 +00:00
ad3f63033d
Stage B N-tiling: (128,16) PV MMA compiles and runs, cosine 0.36 (TMEM alias mismatch bug). FMHA head_dim=64 passes. Debugging TMEM layout alignment.
biondizzle
2026-05-21 08:45:49 +00:00
5e37ea56e4
FOOTGUN #0 : num_tma_load_bytes MUST include V bytes. Fix v27, v29, comment all. Update README.
biondizzle
2026-05-21 07:13:14 +00:00
dd8d872bec
v29: FIX DEADLOCK - add V bytes to num_tma_load_bytes. V=I(128,128) cosine 1.0
biondizzle
2026-05-21 07:08:29 +00:00
b9b1b808a5
README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns
biondizzle
2026-05-21 06:46:02 +00:00
f1c4ee0e4d
v29 (padded V, deadlocks), v30 (diag copy, works) — debugging epilogue deadlock with (128,128) PV
biondizzle
2026-05-21 06:40:27 +00:00
4968ce064d
even more stuff
biondizzle
2026-05-21 05:55:22 +00:00