nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Select branches

Hide Pull Requests

master

pre-b1

pure-nvfp4

v-b1-b2-done-20260603

v-c1-c2-c3-20260602

v-e2e-nvfp4-all-projections

v-e2e-paris-32tok-20260601-0549

v-indexer-fix-20260602

v-nvfp4-fused-router-rewrite-20260601-0715

v-nvfp4-router-oa-20260601-0610

v-official-encoding-path

v-p0p1p2p3-fused-swiglu-cuda-rope-20260602

v-perf-part1-p2-reverted-20260602

v-post-indexer-c-fixes-20260602

v-precision-floor-fix-20260603

v-single-shot-paris-20260601-0539

v-working-e2e-20260601-0515

v0.1-e2e-working

ebb5d1ea23 Add check_log.sh convenience script biondizzle 2026-05-22 17:07:23 +00:00
b1a37bd2dd Fix quoting in run_test.sh biondizzle 2026-05-22 17:06:00 +00:00
6594e31db5 Add run_test.sh harness (screen + log) biondizzle 2026-05-22 17:05:43 +00:00
4f6853e1ae FIX: only slice GMEM tensors (SMEM already 2D from tma_partition) biondizzle 2026-05-22 16:57:31 +00:00
c61590ac6d FIX: consistent GMEM/SMEM slicing for K and V TMA partitions biondizzle 2026-05-22 16:56:38 +00:00
7aaf9ccbda FIX: keep GMEM iteration dimension FREE in TMA K/V partition slices biondizzle 2026-05-22 16:51:57 +00:00
04da36e18c Add diagnostic test for multi-tile TMA pipeline (identity softmax) biondizzle 2026-05-22 16:47:08 +00:00
b50968dfaf FIX: acc_scale was double-multiplying by scale_log2 biondizzle 2026-05-22 16:42:45 +00:00
c9fe26a5fc Stage C: integrate example3 multi-tile fixes into unit test biondizzle 2026-05-22 16:39:45 +00:00
793e3243d5 README + MEMORY: update Stage C status to single-tile only, document multi-tile blocker biondizzle 2026-05-22 16:32:31 +00:00
e5c02caed4 FMHA Stage-C multi-tile: combined K+V barrier, final_o_bar, acc_pipe producer biondizzle 2026-05-22 16:23:36 +00:00
452ba604fc restore tBgK to kh.count indexing (single-tile working), add TODO for multi-tile biondizzle 2026-05-22 15:54:03 +00:00
07817ae82e FIX: use unsliced tBgK with (None, kt, None, 0) for proper GMEM tile indexing biondizzle 2026-05-22 15:52:56 +00:00
1ad243f095 CRITICAL FIX: keep GMEM iteration dim free in tBgK/tVgV slice biondizzle 2026-05-22 15:52:06 +00:00
32412b2250 add explicit acc_pipe.consumer_wait before final normalize biondizzle 2026-05-22 15:49:48 +00:00
3f7addb83a FMHA Stage-C multi-tile: Fix 1 (s_k=n), Fix 2 (TMA kt indexing), Fix 3 (O rescale) biondizzle 2026-05-22 15:41:14 +00:00
ad2a494968 Revert "debug: test 12w identity softmax with n=256 to verify multi-tile pipeline" biondizzle 2026-05-22 10:25:48 +00:00
24a807eae2 debug: test 12w identity softmax with n=256 to verify multi-tile pipeline biondizzle 2026-05-22 10:24:53 +00:00
572656e79b debug: disable O rescaling to test multi-tile pipeline baseline biondizzle 2026-05-22 10:23:37 +00:00
8ce257150e fix: revert to scaled row_max, use exp2(old_max - new_max) for O rescaling biondizzle 2026-05-22 10:22:44 +00:00
e85d50dc3b fix: compute row_max from RAW S values, not scaled biondizzle 2026-05-22 10:21:50 +00:00
0bcb5aba2b fix: missing newline after self.s_k = s_k biondizzle 2026-05-22 10:20:35 +00:00
1982cc4d39 fix: add s_k param to FmhaV3StageC, use self.s_k for V FMHA reconstruction biondizzle 2026-05-22 10:19:49 +00:00
b80a1ab083 Stage C: add online O rescaling for multi-tile KV + test n=256 biondizzle 2026-05-22 10:19:08 +00:00
55beaeb2a5 fix: add epilogue warp to tmem_bar, restore wait_for_alloc in epilogue biondizzle 2026-05-22 10:17:02 +00:00
6514888a5c fix: add softmax_done_bar to synchronize MMA PV with softmax P production biondizzle 2026-05-22 10:15:26 +00:00
fdea390c71 fix: epilogue warp self-signals acc_pipe producer before consuming biondizzle 2026-05-22 10:11:55 +00:00
18ab3396b7 fix: remove duplicate tmem free from epilogue (MMA warp handles dealloc) biondizzle 2026-05-22 10:05:52 +00:00
1994b2ae46 fix: add acc_pipe pipeline for epilogue, matching 12w pattern biondizzle 2026-05-22 10:03:08 +00:00
925d85820b fix: epilogue_warp_id must be tuple for epilogue_tma_store, check with [0] biondizzle 2026-05-22 09:59:20 +00:00
23421bc282 fix: epilogue warp reuse mma_corr_cons pipeline instead of creating new one from st biondizzle 2026-05-22 09:56:18 +00:00
5b32490b15 fix: define cS and tScS in correction warps (not visible across if blocks) biondizzle 2026-05-22 09:52:59 +00:00
a5a9413aa5 fix: correct @cute.kernel indentation biondizzle 2026-05-22 09:49:36 +00:00
cf900a22fe fix: remove duplicate @cute.kernel decorator biondizzle 2026-05-22 09:46:09 +00:00
bfc1518046 FMHA Stage-C2: production 12-warp pipeline with correction warps biondizzle 2026-05-22 09:42:39 +00:00
35d532c742 README: update Stage C status to WORKING, add CuTeDSL constraints and target architecture biondizzle 2026-05-22 09:39:15 +00:00
347c107394 test: add multiple seeds to verify softmax consistency biondizzle 2026-05-22 09:32:08 +00:00
d3682b0c33 fix: use plain range loop for row_max (fmax not allowed in vectorized) biondizzle 2026-05-22 09:31:07 +00:00
235c7850df fix: add missing old_row_max = row_max before softmax max computation biondizzle 2026-05-22 09:30:32 +00:00
35056300cb fix vectorize issue: remove vectorize from exp2 pass, add row_sum accumulation biondizzle 2026-05-22 09:29:43 +00:00
c5a504d064 fix: use cute.arch.fmax instead of if-else in vectorized loop biondizzle 2026-05-22 09:28:32 +00:00
6f4bb0842e softmax: element-wise row_max computation instead of .reduce() biondizzle 2026-05-22 09:27:36 +00:00
9e145c35f1 fix O normalization: use direct rmem tensor from partition_D shape biondizzle 2026-05-22 09:23:58 +00:00
9ea5551241 FMHA Stage-C: real softmax + O normalization in 6-warp layout biondizzle 2026-05-22 09:22:56 +00:00
aaa68634d4 fix: use make_smem_layout_epi not make_epilogue_smem_layout biondizzle 2026-05-22 09:19:12 +00:00
054bf99436 FMHA v3 Stage-C full: 12-warp pipeline with real softmax + correction + epilogue biondizzle 2026-05-22 09:18:56 +00:00
fbe1c8ee49 more stuff biondizzle 2026-05-22 08:57:38 +00:00
187d9e231c FMHA v3: per-row min test + explicit loop replacements biondizzle 2026-05-22 07:29:04 +00:00
5c2d9ad312 FMHA v3: per-row patch from Mike + deadlock fix + V layout fix biondizzle 2026-05-22 07:09:52 +00:00
5f1922da3e FMHA v3: add debug variants for C9 normalization investigation biondizzle 2026-05-22 05:52:10 +00:00
7d41f4861a Fix indexer score kernel: use static shared memory, correct FP4 head offsets biondizzle 2026-05-22 01:45:05 +00:00
c2f705a21a Indexer: score+topk kernel, gather KV, compute_valid_lens biondizzle 2026-05-22 01:20:39 +00:00
0f539e4855 Flush compressor: schema fix, prepare_forward, flush_write kernels, state rotation biondizzle 2026-05-22 00:25:47 +00:00
b4d58df620 KV Cache: schema, allocator, pools, manager, append_swa kernel biondizzle 2026-05-22 00:08:38 +00:00
4453d7475a Fix layer construction: match existing API signatures, add RMSNorm impl biondizzle 2026-05-21 23:31:58 +00:00
d5ec0e5133 Clean up: remove debug/temp files and dangling test kernels biondizzle 2026-05-21 23:26:50 +00:00
97a1b11f41 10-warp debug: MMA=warp4 TMA=warp5 idle=6-9 still gives cosine 0.29 biondizzle 2026-05-21 23:24:44 +00:00
66a89859ed Layer dispatch: config, schedule, attention/FFN sub-blocks, TransformerLayer biondizzle 2026-05-21 23:11:09 +00:00
dd364b6d4d 10-warp idle test: no crash but cosine 0.29 (6-warp gives 0.999999) biondizzle 2026-05-21 22:07:53 +00:00
0d06e55770 Router: Blackwell-native fused decode kernel — real CuTeDSL implementation biondizzle 2026-05-21 22:04:20 +00:00
9c39f48443 Router: clean up dense_router_decode.py — realistic architecture, no fake code biondizzle 2026-05-21 21:58:31 +00:00
abfe4485f7 Router: full kernel stack — hash, topk, activation+topk, dense decode/prefill biondizzle 2026-05-21 21:54:05 +00:00
c97661994e WIP: correction warp group architecture - compiles, illegal address at runtime biondizzle 2026-05-21 21:20:39 +00:00
d2a16daf70 BREAKTHROUGH: cosine 0.993 for n=128! PV-partitioned P row sum works. biondizzle 2026-05-21 20:13:51 +00:00
7189165a67 WIP: TMEM vector bridge not working (same cosine 0.513) biondizzle 2026-05-21 19:26:15 +00:00
26f6c1ba7f WIP: confirmed row_sum is wrong (5.5 vs correct 29.22 for row 0) biondizzle 2026-05-21 19:16:15 +00:00
4251af1f14 WIP: scalar C9 normalization - confirmed inv_row_sum is wrong biondizzle 2026-05-21 19:09:32 +00:00
8612bc5426 WIP: QK-partitioned C9 normalization (does not work) biondizzle 2026-05-21 18:59:21 +00:00
d7aa4da686 BREAKTHROUGH: unnormalized P@V cosine 0.999998 for n=128! biondizzle 2026-05-21 18:55:00 +00:00
a983a8fb41 WIP: TMEM vector for per-row row_sum (not yet working) biondizzle 2026-05-21 18:45:30 +00:00
331d9e95f3 WIP: Stage C softmax - partial progress biondizzle 2026-05-21 18:04:21 +00:00
84cd636ba9 Stage C fixes: pv_done_bar sync, acc_scale with scale, fastmath=True biondizzle 2026-05-21 17:58:04 +00:00
52b46a2dee Stage C: add validation harness with real softmax reference (C1) biondizzle 2026-05-21 17:49:26 +00:00
96f900f5f0 README: add full DSV4 pipeline architecture diagram (CSA/HCA, not MLA) biondizzle 2026-05-21 17:40:25 +00:00
2ec32eb8da README: update for new dsv4/ package structure biondizzle 2026-05-21 17:34:40 +00:00
3fb3c925af Restructure: cutedsl/ -> dsv4/ with proper layering biondizzle 2026-05-21 17:30:44 +00:00
99e143dd0e Fix: add scale_softmax_log2, use O TMEM rescale for C9 normalization biondizzle 2026-05-21 17:15:15 +00:00
df04ba40ee Stage C: online softmax kernel (WIP) - test_fmha_v3_softmax.py biondizzle 2026-05-21 17:10:58 +00:00
20564425ec README: full roadmap — Stage C (real softmax), D (paged KV), E (production kernel) biondizzle 2026-05-21 15:43:01 +00:00
ad24792fc7 Update both READMEs: Stage B complete, document TMEM overlap root cause biondizzle 2026-05-21 15:36:06 +00:00
2030d41e41 Fix TMEM overlap in test_pv64_with_softmax.py too — cosine 0.999999 biondizzle 2026-05-21 15:32:49 +00:00
0f4f69907e STAGE B BUG 4b FIXED: TMEM P/O overlap + FMHA V reconstruction biondizzle 2026-05-21 15:30:24 +00:00
4564758466 Stage B Bug 4b debugging: P/A alias proven working, V layout issue for (128,64) PV biondizzle 2026-05-21 15:20:14 +00:00
81d5d8d04c FMHA v3: KV-tile interleaving pipeline - QK works, Bug 4b blocks PV biondizzle 2026-05-21 12:52:29 +00:00
73e03cfa6d Stage B: PV(128,64) test + v2 pipeline fixes biondizzle 2026-05-21 11:49:06 +00:00
61b23efbcf stuff and stuff biondizzle 2026-05-21 10:50:30 +00:00
d72f854efb FMHA v1: pv_mma_tiler=(128,64,128) works with V=I, fails with real V (SMEM layout bug) biondizzle 2026-05-21 10:47:46 +00:00
750f1f09c9 README: Bug 4 ROOT CAUSE CONFIRMED - V SMEM 1 K-tile + PV 8 K-phases mismatch. Zero-pad V workaround correct. biondizzle 2026-05-21 09:59:37 +00:00
dbb240adc9 Root cause FOUND: V SMEM only holds 1 K-tile (2048 BF16), but PV MMA iterates 8 K-phases. For non-(128,128) PV, most K-phases read wrong V data. Zero-padded V works because V is (128,128) covering all 8 K-phases. FMHA interleaves QK+PV per KV-tile to avoid this. biondizzle 2026-05-21 09:56:54 +00:00
50e9b5da81 README: Bug 4 corrected — NOT TMEM alias. A-fragment identical for all PV sizes. Real bug in V/B or output C/D. biondizzle 2026-05-21 09:47:08 +00:00
d4934371d0 Key finding: PV A-fragment layout is IDENTICAL for (128,128)/(128,32)/(128,16) PV. Bug is NOT TMEM alias. cta_tile_shape_mnk wrong for non-(128,128) PV. V SMEM and O C-fragment sizes look correct. Debugging V/epilogue paths. biondizzle 2026-05-21 09:44:22 +00:00
422af26024 Update README: Bug 4 status, (128,16) PV zero output, (128,128) PV zero-pad workaround (cosine 1.0) biondizzle 2026-05-21 09:20:09 +00:00
781684dd89 TMEM alias analysis: (128,16) PV broken, (128,128) PV with zero-pad works. Root cause: PV A-fragment layout differs from QK C-fragment layout for (128,16) PV, causing TMEM column mismatch. Using (128,128) PV as workaround. biondizzle 2026-05-21 09:10:12 +00:00
96e7210db7 Debugging TMEM alias for (128,16) PV: zero output confirmed, PV reads from wrong TMEM columns. Need to align softmax P write with PV A-fragment layout. biondizzle 2026-05-21 09:00:42 +00:00
ad3f63033d Stage B N-tiling: (128,16) PV MMA compiles and runs, cosine 0.36 (TMEM alias mismatch bug). FMHA head_dim=64 passes. Debugging TMEM layout alignment. biondizzle 2026-05-21 08:45:49 +00:00
5e37ea56e4 FOOTGUN #0: num_tma_load_bytes MUST include V bytes. Fix v27, v29, comment all. Update README. biondizzle 2026-05-21 07:13:14 +00:00
dd8d872bec v29: FIX DEADLOCK - add V bytes to num_tma_load_bytes. V=I(128,128) cosine 1.0 biondizzle 2026-05-21 07:08:29 +00:00
b9b1b808a5 README: update with v28/v29 deadlock investigation, FMHA softmax bridge trace, new footguns biondizzle 2026-05-21 06:46:02 +00:00
f1c4ee0e4d v29 (padded V, deadlocks), v30 (diag copy, works) — debugging epilogue deadlock with (128,128) PV biondizzle 2026-05-21 06:40:27 +00:00
4968ce064d even more stuff biondizzle 2026-05-21 05:55:22 +00:00

... 17 18 19 20 21 ...