nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Select branches

Hide Pull Requests

master

pre-b1

pure-nvfp4

v-b1-b2-done-20260603

v-c1-c2-c3-20260602

v-e2e-nvfp4-all-projections

v-e2e-paris-32tok-20260601-0549

v-indexer-fix-20260602

v-nvfp4-fused-router-rewrite-20260601-0715

v-nvfp4-router-oa-20260601-0610

v-official-encoding-path

v-p0p1p2p3-fused-swiglu-cuda-rope-20260602

v-perf-part1-p2-reverted-20260602

v-post-indexer-c-fixes-20260602

v-precision-floor-fix-20260603

v-single-shot-paris-20260601-0539

v-working-e2e-20260601-0515

v0.1-e2e-working

cd2c028b39 feat: TMA loads for both K and V in 6-warp FMHA kernel biondizzle 2026-05-29 19:34:48 +00:00
523d3838a2 test: HD=128/256 variants for TMA FMHA biondizzle 2026-05-29 19:32:49 +00:00
bd4f09d514 fix: ambiguous MMA_K_BF16 in test biondizzle 2026-05-29 19:32:15 +00:00
4459ddefdd feat: 6-warp TMA FMHA kernel + test — TMA for K loads biondizzle 2026-05-29 19:32:02 +00:00
7a8ba8eeb6 fix: SMEM size calculation — TILE_SZ is in BF16 elements, need *sizeof(bf16_t) for bytes biondizzle 2026-05-29 19:30:50 +00:00
aac1b25442 test: TMA QK diagnostic — 3 variants to isolate failure biondizzle 2026-05-29 19:29:35 +00:00
9dfada6626 test: TMA + canonical + QK GEMM incremental biondizzle 2026-05-29 19:28:23 +00:00
0435e229bd fix: typo cuda_SUCCESS -> cudaSuccess biondizzle 2026-05-29 19:27:30 +00:00
74514e2680 test: TMA sub-tile load — exact pattern from test_qk_softmax biondizzle 2026-05-29 19:26:56 +00:00
e449d6d5e1 test: TMA diagnostic with 192 threads biondizzle 2026-05-29 19:26:09 +00:00
0b36b6047a test: TMA diagnostic with 128 threads biondizzle 2026-05-29 19:25:38 +00:00
a766b488c2 test: minimal TMA diagnostic — isolate multi-warp TMA bug biondizzle 2026-05-29 19:25:01 +00:00
fe3b6b8d13 test: QK+softmax T=1 first biondizzle 2026-05-29 19:12:26 +00:00
a9a87fe7b8 fix: P write with lane stride, use sRowSum biondizzle 2026-05-29 19:11:19 +00:00
fd6a9b00ae test: QK + softmax — verify P values against reference biondizzle 2026-05-29 19:10:08 +00:00
5eff53c145 fix: SMEM layout and printf in PV-only test biondizzle 2026-05-29 19:08:39 +00:00
106f103c83 test: PV-only GEMM — isolate PV from full FMHA pipeline biondizzle 2026-05-29 19:06:52 +00:00
5542a9da00 debug: V loaded directly from GMEM (not TMA) to isolate PV issue biondizzle 2026-05-29 18:57:42 +00:00
2262e10fca fix: PV GEMM — V canonical uses CORES_MN_V=2 (block_mn=16), not 16 biondizzle 2026-05-29 18:54:02 +00:00
90c3372040 refactor: TMA FMHA kernel — 4-warp, proven pattern, full pipeline biondizzle 2026-05-29 18:50:58 +00:00
d5e20b2d42 fix: reference should be raw dot product (MMA is unscaled) biondizzle 2026-05-29 18:48:39 +00:00
2b945f255b test: TMA K-load + QK GEMM — incremental from working pattern biondizzle 2026-05-29 18:47:27 +00:00
f33746f183 test: minimal TMA K-load — no MMA/TMEM, just verify TMA + canonical biondizzle 2026-05-29 18:46:09 +00:00
d64b62bc80 test: simple (128,16) TMA desc for K sub-tile only biondizzle 2026-05-29 18:45:01 +00:00
eaf8a878cf fix: only warp 0 lane 0 issues TMA (not all lane 0 threads) biondizzle 2026-05-29 18:44:18 +00:00
69bf20b09d fix: SMEM alignment in TMA K-only test biondizzle 2026-05-29 18:43:44 +00:00
2c0ee69aea test: TMA K-only — proven gen pattern + TMA for K loads only biondizzle 2026-05-29 18:43:07 +00:00
9fc2d549e4 fix: warp-collective TMEM read/dealloc in minimal QK test biondizzle 2026-05-29 18:42:03 +00:00
c755e6fdde fix: TMEM read/dealloc for 128-thread kernel biondizzle 2026-05-29 18:40:24 +00:00
bd1309ba88 test: minimal QK — 128 threads, tid==0 MMA, match working gen kernel pattern biondizzle 2026-05-29 18:40:11 +00:00
39aef1284f fix: smem size in minimal QK test biondizzle 2026-05-29 18:37:38 +00:00
ce89fe9170 test: minimal QK — separate sQ0/sK0, clean SMEM layout biondizzle 2026-05-29 18:37:20 +00:00
71b353577d fix: QK direct test — per-K-sub-tile Q load (same as working kernel) biondizzle 2026-05-29 18:35:00 +00:00
35d0596893 fix: T=1 for QK direct test (write_q_to_smem only handles row 0) biondizzle 2026-05-29 18:33:35 +00:00
bee7cc5f8f fix: lane vs threadIdx.x in direct QK test biondizzle 2026-05-29 18:32:21 +00:00
670599b754 test: direct QK GEMM — baseline for TMA comparison biondizzle 2026-05-29 18:31:57 +00:00
9a185f0222 test: debug Q SMEM canonical after TMA load biondizzle 2026-05-29 18:30:52 +00:00
1500020593 test: QK-only TMA test — isolate TMA load + canonical + MMA biondizzle 2026-05-29 18:29:49 +00:00
204cc90808 fix: load full Q (128,HD) once before QK loop — not per K-sub-tile biondizzle 2026-05-29 18:28:45 +00:00
bf7cf54a51 fix: align TMA SMEM to 128 bytes in verification test biondizzle 2026-05-29 18:27:07 +00:00
befc2c647b test: TMA load verification — compare against direct GMEM read biondizzle 2026-05-29 18:26:34 +00:00
8e09fae3a1 fix: warp-stride for TMA canonical writes — only load warp calls them biondizzle 2026-05-29 18:25:47 +00:00
3e14a25bb0 fix: don't re-init mbarrier in loop — use phase parity tracking biondizzle 2026-05-29 18:24:47 +00:00
bd169ccb0f fix: smart quote in fmha_tma.cuh biondizzle 2026-05-29 18:22:26 +00:00
345b107f4c fix: TMA mbarrier — add arrive.expect_tx (root cause of multi-warp hang) biondizzle 2026-05-29 18:22:00 +00:00
c69f3668e1 feat: TMA async FMHA kernel — WORKING on B200 biondizzle 2026-05-29 07:02:07 +00:00
a40c05f3f2 archive: TMA driver-API files + CUDA 13 TMA discovery notes biondizzle 2026-05-29 06:52:39 +00:00
55f0c6267b auto: pre-test commit biondizzle 2026-05-29 06:41:58 +00:00
197cac875c fix: CUDA 13 TMA descriptor — 3D rank + byte strides + mbarrier byte count biondizzle 2026-05-29 06:34:58 +00:00
85cd95e609 debug: TMA context fix test biondizzle 2026-05-29 04:45:54 +00:00
76c82ebdcd debug: detailed TMA descriptor debug test biondizzle 2026-05-29 04:45:06 +00:00
0c9245b4d2 fix: add cuInit(0) for CUDA driver API biondizzle 2026-05-29 04:43:24 +00:00
6cc2f61431 debug: TMA descriptor dimension test biondizzle 2026-05-29 04:42:44 +00:00
3412ff1a9b fix: TMA tile strides must match global strides, not tile dimensions biondizzle 2026-05-29 04:41:53 +00:00
409838ace2 refactor: per-sub-tile TMA loads with padded GMEM allocations biondizzle 2026-05-29 04:41:03 +00:00
8c17f65f5b fix: cast typo biondizzle 2026-05-29 04:39:21 +00:00
8908b697dd fix: bool type mismatch biondizzle 2026-05-29 04:39:12 +00:00
b78ebe8a9c debug: add TMA descriptor error reporting biondizzle 2026-05-29 04:38:57 +00:00
c7a6d7d231 fix: tma_mbar_init → tma_mbarrier_init (typo) biondizzle 2026-05-29 04:37:48 +00:00
696462f07a feat: TMA async load infrastructure for FMHA kernel biondizzle 2026-05-29 04:36:52 +00:00
d1c1eaeddc clean: remove debug prints, multirow kernel complete with multi-tile KV merge biondizzle 2026-05-28 23:57:31 +00:00
c65baabcc9 fix: V tile copy — V is (HD, SK_TOTAL) so tile columns are not contiguous biondizzle 2026-05-28 23:55:52 +00:00
869460a932 debug: add LSE verification and merge debug prints biondizzle 2026-05-28 23:54:30 +00:00
2f2259395e fix: always normalize in kernel, correct KV merge with normalized O + LSE biondizzle 2026-05-28 23:53:44 +00:00
914f76d30c multirow: add normalize flag, un-norm + LSE output, multi-tile KV merge test biondizzle 2026-05-28 23:51:23 +00:00
ca5cf0e517 test: add multi-head and batched prefill tests for multirow kernel biondizzle 2026-05-28 23:48:53 +00:00
ac8fa779e2 fix: move epilogue TMEM loads outside my_row_active guard (warp-collective hang) biondizzle 2026-05-28 23:46:46 +00:00
55c0604a71 add fence.sc.gpu between PV and epilogue for TMEM visibility biondizzle 2026-05-28 23:21:53 +00:00
52809b0ec6 fix: tcgen05.wait::ld.sync.aligned (was missing 'sync') biondizzle 2026-05-28 23:19:03 +00:00
0220e51d18 fix: typo cudaErrorCudaSuccess -> cudaSuccess biondizzle 2026-05-28 23:18:21 +00:00
468614a4e2 fmha_multirow: non-interleaved design — softmax first, then PV biondizzle 2026-05-28 23:17:43 +00:00
c768abed95 test: softmax-only kernel (QK + row_max, no PV) biondizzle 2026-05-28 23:15:36 +00:00
43ba672e15 fmha_multirow: add fence.sc.gpu after QK GEMM for TMEM visibility biondizzle 2026-05-28 23:13:31 +00:00
d840fbbf85 test: clean multirow test with proper SMEM calc biondizzle 2026-05-28 23:10:49 +00:00
f2124b9378 fix: SMEM calc in decode test biondizzle 2026-05-28 23:08:54 +00:00
58ff781388 test: simplified decode kernel for debugging multirow biondizzle 2026-05-28 23:08:33 +00:00
be2685e9e3 fmha_multirow: use natural 4-warp TMEM partitioning after UMMA biondizzle 2026-05-28 23:07:31 +00:00
ff8c677486 fix: SMEM size for MMA test — account for both sQ0 and sK0 biondizzle 2026-05-28 23:06:07 +00:00
fee022a485 test: MMA→4-warp read using proven fmha_common+umma_desc infra biondizzle 2026-05-28 23:05:29 +00:00
e1a708a187 test: try 16x256b.x1 with column step=4 (4 cols per read) biondizzle 2026-05-28 23:03:51 +00:00
95003eced2 test: 16x256b.x1 loads with uint32_t regs, matching working pattern biondizzle 2026-05-28 23:03:10 +00:00
fffb493b0e fix: 16x256b.x1 load syntax — single address operand biondizzle 2026-05-28 23:02:23 +00:00
44dcd6e8d0 test: 16x256b.x1 multiple LOADS — do they crash like stores? biondizzle 2026-05-28 23:02:03 +00:00
d54bce6a6d fix: correct SMEM size for MMA 4-warp test biondizzle 2026-05-28 23:01:12 +00:00
be45e87891 test: MMA→4-warp TMEM read — do warps see different rows? biondizzle 2026-05-28 23:00:27 +00:00
6b0d57074a test: TMEM cross-warp visibility with different sync strategies biondizzle 2026-05-28 22:59:31 +00:00
77d190278e test: simpler TMEM 4-warp read — direct store+load biondizzle 2026-05-28 22:58:48 +00:00
91b03bd6bd test: verify 4-warp TMEM read with 32x32b.x8 after MMA biondizzle 2026-05-28 22:57:59 +00:00
28e04a5ea8 fix: use __cvta_generic_to_shared directly for 64-bit compat biondizzle 2026-05-28 22:56:29 +00:00
1d6a95df32 fix: typo in tmem row offset test biondizzle 2026-05-28 22:56:15 +00:00
cf6fe71368 test: verify TMEM 32x32b.x8 row offset addressing biondizzle 2026-05-28 22:56:00 +00:00
4cfb707405 fix: correct SMEM size calculation in multirow test biondizzle 2026-05-28 22:53:46 +00:00
863a030c3b fmha_multirow: rewrite with 32x32b.x8 only, no s_p_vals, row_page addressing biondizzle 2026-05-28 22:52:52 +00:00
1ba304db3e stuff biondizzle 2026-05-28 21:08:13 +00:00
deaa3ec725 CRITICAL FIX: Q/K SMEM canonical layout must use local d (0..15) not full_d — UMMA descriptor reads from sQ0/sK0 start, not offset biondizzle 2026-05-28 20:13:52 +00:00
08694b8136 Fix multi-row softmax v3: 32x32b.x8 with per-lane per-row (no wmax/wsum), per-row sRowMax/sRowSum arrays biondizzle 2026-05-28 20:10:13 +00:00
aaa76c1af1 Rewrite multi-row softmax using 16x256b.x1 TMEM reads for proper multi-row access biondizzle 2026-05-28 20:08:30 +00:00
5e3c61184c Fix multi-row softmax: remove cross-lane wmax/wsum — each lane handles its own row independently biondizzle 2026-05-28 20:06:16 +00:00
bf4dfd131b Fix nvcc goto-bypasses-init: move var decls before goto targets biondizzle 2026-05-28 20:04:59 +00:00
2b09d4f2ef Fix nvcc goto-bypasses-init in multi-row test biondizzle 2026-05-28 20:04:45 +00:00

... 6 7 8 9 10 ...