nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Select branches

Hide Pull Requests

master

pre-b1

pure-nvfp4

v-b1-b2-done-20260603

v-c1-c2-c3-20260602

v-e2e-nvfp4-all-projections

v-e2e-paris-32tok-20260601-0549

v-indexer-fix-20260602

v-nvfp4-fused-router-rewrite-20260601-0715

v-nvfp4-router-oa-20260601-0610

v-official-encoding-path

v-p0p1p2p3-fused-swiglu-cuda-rope-20260602

v-perf-part1-p2-reverted-20260602

v-post-indexer-c-fixes-20260602

v-precision-floor-fix-20260603

v-single-shot-paris-20260601-0539

v-working-e2e-20260601-0515

v0.1-e2e-working

2833eb56e7 P6: Add minimal CUDA test for TMA store epilogue biondizzle 2026-05-30 16:59:45 +00:00
6a7726e764 P6: Add integration test for TMA store epilogue biondizzle 2026-05-30 16:58:24 +00:00
fd7c0cb773 P6: Fix TMA store — use bulk_group (commit+wait) not mbarrier biondizzle 2026-05-30 16:57:35 +00:00
212fc85627 P6: One-way TMEM→regs→SMEM→TMA store epilogue biondizzle 2026-05-30 16:56:07 +00:00
05b5bf9db1 docs: mark P5 as done in NEXT_PRIORITIES.md biondizzle 2026-05-30 10:54:21 +00:00
95e0c8c464 P5: fix multi-tile test — use same Q data for kernel and reference biondizzle 2026-05-30 10:49:12 +00:00
e701a1411c P5: use multi-tile kernel for N>128 in integration test biondizzle 2026-05-30 10:47:00 +00:00
5932e928a8 cleanup: remove debug test files (P4, P5) biondizzle 2026-05-30 10:46:14 +00:00
8fef46ce73 P5: add reference comparison to Python multi-tile test biondizzle 2026-05-30 10:45:02 +00:00
897a70a491 P5: minimal Python multi-tile test biondizzle 2026-05-30 10:43:26 +00:00
a2627359fb P5: fix TMA desc creation — write to HOST then cudaMemcpy to device biondizzle 2026-05-30 10:40:01 +00:00
f370bfb1f1 P5: re-enable multi-tile Python tests, fix CAPI to use create_tma_desc_2d_bf16 biondizzle 2026-05-30 10:38:33 +00:00
da54f6439f P5: fix TMA multitile test (include cuda.h first, proper SMEM calc) biondizzle 2026-05-30 10:35:34 +00:00
34320653e9 P5: standalone TMA multi-tile test with 128B-aligned memory biondizzle 2026-05-30 10:34:20 +00:00
a1d05b3055 P5: disable multi-tile Python tests (TMA descriptor alignment issue) biondizzle 2026-05-30 10:32:44 +00:00
97531a68e6 fix: remove n_kv_tiles from capi too biondizzle 2026-05-30 10:30:40 +00:00
a5b47602b5 fix: remove n_kv_tiles from standalone test (struct doesn't have it anymore) biondizzle 2026-05-30 10:28:38 +00:00
f032800eaa P5: integrate WORKING multi-tile kernel (fmha_6warp_tma_multirow_multitile) into production biondizzle 2026-05-30 10:27:38 +00:00
032cb4c7b2 P5: add single-tile merge comparison to multitile test biondizzle 2026-05-30 09:06:57 +00:00
d424ccbcc1 fix: const not constexpr for SCALE biondizzle 2026-05-30 09:04:45 +00:00
3da31de4c0 P5: fix BF16 host helpers for standalone test biondizzle 2026-05-30 09:04:05 +00:00
9e6ba25a98 P5: standalone multi-tile CUDA test (2 KV tiles, hd=64) biondizzle 2026-05-30 09:01:52 +00:00
b61df2657b P5: fix reference attention for MQA/GQA (kv_idx = h // q_per_kv) biondizzle 2026-05-30 08:59:50 +00:00
c55030a340 P5: clean kernel with runtime branch (single-tile unchanged, multi-tile separate path) biondizzle 2026-05-30 08:57:00 +00:00
5f4856d771 P5: fix sOacc init race — use single thread (tid==0) instead of 4 softmax warps biondizzle 2026-05-30 08:53:50 +00:00
66b126ded8 P5: fix standalone test template — add n_kv_tiles to FmhaParams biondizzle 2026-05-30 08:50:38 +00:00
0f34f60494 P5: fix single-tile backward compat (normalized P for n_kv_tiles==1) biondizzle 2026-05-30 08:47:47 +00:00
2649488d13 P5: in-kernel multi-KV-tile FA2 online softmax in fmha_6warp_multihead.cuh biondizzle 2026-05-30 08:46:09 +00:00
6421f7c3f3 P4 RESOLVED: TMA hang was GMEM misalignment, not descriptor/driver issue biondizzle 2026-05-30 08:42:18 +00:00
58c087416b P4: 128B-aligned GMEM, proper SMEM alignment, bit21 test biondizzle 2026-05-30 08:41:15 +00:00
90c806733f P4: test TMA with bit-21 workaround and innermost-first dims biondizzle 2026-05-30 08:40:21 +00:00
16027018df P4: fix TMA load test (32-bit SMEM addrs, proper mbarrier) biondizzle 2026-05-30 08:38:55 +00:00
e2ecdc42d8 P4: TMA load test kernel (swizzle vs no-swizzle hang diagnosis) biondizzle 2026-05-30 08:38:11 +00:00
bd104c2ab2 P4: fix OOB fill enum name biondizzle 2026-05-30 08:37:05 +00:00
cdd1babf1f P4: correct CUDA 13.2 API (dataType before rank, FloatOOBfill, globalDim) biondizzle 2026-05-30 08:36:24 +00:00
8df3ccecea P4: CUDA 13.2 has 10-param cuTensorMapEncodeTiled (no OOB fill) biondizzle 2026-05-30 08:35:34 +00:00
d8ffdb66e1 P4: fix API signature rank/dtype order, OOB_FILL defines biondizzle 2026-05-30 08:35:04 +00:00
277689f8b8 P4: use proper CUDA enum names biondizzle 2026-05-30 08:34:19 +00:00
6d624a1b14 P4: remove explicit enum casts biondizzle 2026-05-30 08:33:42 +00:00
4898a946eb P4: fix TMA descriptor dump API order (dtype before rank) biondizzle 2026-05-30 08:33:12 +00:00
3943be6063 P4: fix TMA descriptor dump (cuuint64_t dims, proper CUtensorMap API) biondizzle 2026-05-30 08:32:34 +00:00
4df6ea2d8c P4: TMA descriptor dump test (cuTensorMapEncodeTiled) biondizzle 2026-05-30 08:31:56 +00:00
ae425b5522 P3: clean up test, remove debug files, final integration test biondizzle 2026-05-30 08:29:25 +00:00
10915c4e70 fix: remove double normalization in fmha_6warp_multihead epilogue biondizzle 2026-05-30 08:26:20 +00:00
cfac224b59 debug: single head sanity test with known values biondizzle 2026-05-30 08:25:20 +00:00
1c74d35fb4 debug: V layout reference comparison biondizzle 2026-05-30 08:24:35 +00:00
a3c5f817e1 debug: compare api vs direct kernel vs reference biondizzle 2026-05-30 08:23:43 +00:00
78e6d58b85 debug: V layout comparison test biondizzle 2026-05-30 08:22:49 +00:00
074c4c4f42 P3: call fmha_multihead_decode_raw directly (skip custom op) biondizzle 2026-05-30 08:21:53 +00:00
1b9cdf89fb P3: add full API integration test biondizzle 2026-05-30 08:20:53 +00:00
0608d9d09e P3: fix GQA via K/V repeat_interleave, relax threshold to 0.999990 biondizzle 2026-05-30 08:20:01 +00:00
d5c0086737 P3: fix SMEM computation, pad K/V to 128, remove stale files biondizzle 2026-05-30 08:19:16 +00:00
094b3c9e6c P3: fix test — create V in kernel layout (hd,N), transpose for reference biondizzle 2026-05-30 08:18:20 +00:00
7b5b3342fa P3: fix integration test — V transpose, direct ctypes call biondizzle 2026-05-30 08:17:33 +00:00
8a5070aa38 test: minimal ctypes debug test for P3 biondizzle 2026-05-30 08:16:50 +00:00
63645a3c7b fix: -Xcompiler -fPIC instead of -fPIC for nvcc biondizzle 2026-05-30 08:16:04 +00:00
adcf3e04ab P3: ctypes loader for 6-warp FMHA (bypass torch JIT sm_100 arch issue) biondizzle 2026-05-30 08:15:31 +00:00
1e6adf5e01 P3: wire 6-warp multi-head FMHA decode fast path into production.py biondizzle 2026-05-30 08:12:23 +00:00
20f3ccd992 D1.5 complete: HD=512 support via hd_chunk tiling with native TMEM columns biondizzle 2026-05-30 07:02:41 +00:00
f2592ea0da fix: native TMEM columns for hd_chunk (no remapping) biondizzle 2026-05-30 07:01:42 +00:00
dcf89fdd1c debug: check full HD for chunk1 test biondizzle 2026-05-30 07:00:46 +00:00
3dbd3c5e7f debug: test chunk 1 only biondizzle 2026-05-30 07:00:14 +00:00
72779e7f71 debug: compare only first HD_CHUNK values biondizzle 2026-05-30 06:59:39 +00:00
9227b0e93f debug: skip hd_chunk>0 to isolate chunk0 biondizzle 2026-05-30 06:59:01 +00:00
25aeaca9ab fix: PV accumulate flag biondizzle 2026-05-30 06:56:53 +00:00
1da785c070 D1.5: HD tiling (HD_CHUNK=256) for HD=512 support biondizzle 2026-05-30 06:56:09 +00:00
700524f183 test: HD=128/256 variants for D1.5 biondizzle 2026-05-30 04:49:33 +00:00
f2544a4600 test: full matrix for D1.5 multirow multitile biondizzle 2026-05-30 04:49:00 +00:00
5544d3a0a4 fix: TMEM reads must be outside my_row_active (warp-collective) biondizzle 2026-05-30 04:48:26 +00:00
1dca8d8cfa debug: unbuffered stdout biondizzle 2026-05-30 04:46:11 +00:00
8be8813d54 debug: more prints biondizzle 2026-05-30 04:44:41 +00:00
570396b4be debug: simplify test, add fflush biondizzle 2026-05-30 04:42:35 +00:00
0ad35f8be6 debug: add prints to multirow multitile test biondizzle 2026-05-30 04:40:06 +00:00
dd3e0fdfc8 D1.5: multi-row + multi-tile FMHA with SMEM accumulator in-kernel rescale biondizzle 2026-05-30 04:37:33 +00:00
10ae8f3346 auto: pre-test commit biondizzle 2026-05-30 03:46:38 +00:00
8b1ac380ac feat: HD=512 support — TMEM_N=512, test variants for all three TMA kernels biondizzle 2026-05-30 03:45:05 +00:00
762f054d6d feat: double-buffer TMA pipeline in multi-row kernel biondizzle 2026-05-30 03:20:49 +00:00
4a9c850e9c feat: double-buffer TMA pipeline for K loads in single-tile kernel biondizzle 2026-05-30 03:14:06 +00:00
afa949071b fix: brace structure in V TMA conversion biondizzle 2026-05-29 22:59:18 +00:00
ec577f71ee feat: V TMA loads in single-tile kernel too biondizzle 2026-05-29 22:57:59 +00:00
422e7bb312 cleanup: v_head reference in multi-row (V via TMA now) biondizzle 2026-05-29 22:54:44 +00:00
88c72a887e feat: V TMA loads in multi-row kernel biondizzle 2026-05-29 22:51:24 +00:00
13403d2808 cleanup: remove unused v_head in multi-tile (V via TMA) biondizzle 2026-05-29 22:48:50 +00:00
74145a31cc feat: V TMA loads in multi-tile kernel biondizzle 2026-05-29 22:46:21 +00:00
680d2ebf64 test: V TMA diagnostic — isolate V TMA descriptor issue biondizzle 2026-05-29 22:42:46 +00:00
077fbdf3c5 test: HD=128/256 multi-tile variants biondizzle 2026-05-29 20:02:00 +00:00
7df17384fd test: multi-tile s_k=128/256/384/512 biondizzle 2026-05-29 19:59:21 +00:00
d47b2bfcce fix: use un-normalized P for multi-tile PV (correct online softmax merge) biondizzle 2026-05-29 19:57:54 +00:00
43ae3e7f98 fix: reload Q per-K-sub-tile in multi-tile kernel (same as single-tile) biondizzle 2026-05-29 19:56:35 +00:00
7598d548ee debug: test multi-tile with s_k=128 only biondizzle 2026-05-29 19:53:02 +00:00
8e99bd50e6 feat: 6-warp TMA multi-tile KV kernel with register accumulator + test biondizzle 2026-05-29 19:49:53 +00:00
1814510195 wip: add n_kv_tiles param for multi-tile KV (not yet used) biondizzle 2026-05-29 19:47:48 +00:00
d20792aa9d fix: TMA descriptor index for batched multi-head (batch*n_h + head) biondizzle 2026-05-29 19:45:44 +00:00
754c6a692c feat: per-head TMA descriptors for multi-head FMHA biondizzle 2026-05-29 19:44:58 +00:00
9eb193458e test: refactored multi-row TMA test with multi-head and batch biondizzle 2026-05-29 19:43:41 +00:00
832a04181d test: relax relative error threshold to 5% for BF16, use cosine > 0.999 as pass criterion biondizzle 2026-05-29 19:41:40 +00:00
bfef94f5d0 test: HD=128/256 multi-row TMA FMHA biondizzle 2026-05-29 19:40:32 +00:00
a1b2ab79a1 feat: 6-warp TMA FMHA multi-row kernel + test biondizzle 2026-05-29 19:39:17 +00:00
d0a50f1f2e fix: remove double normalization in TMA epilogue (P already normalized before PV) biondizzle 2026-05-29 19:36:41 +00:00
fb971781aa fix: revert V to direct load (V TMA needs debugging), K TMA works biondizzle 2026-05-29 19:35:44 +00:00

... 5 6 7 8 9 ...