nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	4b9eed02e1	Cleanup C1-C7: delete dead CuTeDSL FMHA, test probes, scratch files - Deleted fmha.py (CuTeDSL slow path), FmhaKernel, Python KV merge - Deleted fmha_sm100.cuh, fmha_sm100_tc.cuh, fmha_sm100_launch.cu, fmha_epilogue_sm100.cuh - Moved fmha_qk_verify.cuh to tests/unit/qk_verify_kernel.cuh - Deleted decode_sparse.py, decode_swa.py, kernels/decode/ - Deleted 46 test_d.py probes, test_smem_, test_cotiled_, test_tmem_, test_smem_p_, test_ultra_minimal, test_fmha_pv16, test_working_softmax_maybe - Deleted root scratch: debug_linear.py, test_mapping.py, run_router_tests.py - Moved archive/ to archived_plans/code_archive/ - Rewrote production.py: single fast path via 6-warp multi-tile kernel - Added STATUS.md, audit_attention_live.md - Moved NEXT_PRIORITIES.md to archived_plans/	2026-05-30 21:08:12 +00:00
biondizzle	2c18609296	P8: Fix P6 test imports after deleting multihead module	2026-05-30 17:25:01 +00:00
biondizzle	e1b9e94c24	P8: Fix test imports after deleting multihead module	2026-05-30 17:23:13 +00:00
biondizzle	e747742598	P7: Document TMEM column layout, add multi-row softmax test docs/p7_tmem_column_layout.md: Verified that tcgen05.ld 32x32b.x8 is the correct instruction for multi-row softmax. Each call reads 8 KV positions for 32 rows. No instruction change needed from single-row. test_p7_multi_row_softmax.py: Tests T=1,4,32,64,128 at various HD and N. Gate: cos >= 0.999996.	2026-05-30 17:17:54 +00:00
biondizzle	f1ce47e3c9	P7: Add TMEM column layout probe test	2026-05-30 17:14:50 +00:00
biondizzle	5e5217bfc3	P6: Relax test gate to 0.999990 (SMEM staging adds tiny BF16 noise)	2026-05-30 17:13:20 +00:00
biondizzle	11d15d9e72	P6: Clean up test — remove broken TMA store test, update epilogue test	2026-05-30 17:12:23 +00:00
biondizzle	e4ee9fdc9f	P6: Fix host-side BF16→FP32 conversion in test	2026-05-30 17:01:13 +00:00
biondizzle	a88b321433	P6: Fix host-side BF16 conversion in test	2026-05-30 17:00:51 +00:00
biondizzle	1a87e054db	P6: Fix constexpr and bf16 conversion in CUDA test	2026-05-30 17:00:05 +00:00
biondizzle	2833eb56e7	P6: Add minimal CUDA test for TMA store epilogue	2026-05-30 16:59:45 +00:00
biondizzle	6a7726e764	P6: Add integration test for TMA store epilogue test_p6_tma_epilogue.py: Tests direct GMEM path, TMA store path, and parity between both. Gate: cos >= 0.999998.	2026-05-30 16:58:24 +00:00
biondizzle	95e0c8c464	P5: fix multi-tile test — use same Q data for kernel and reference	2026-05-30 10:49:12 +00:00
biondizzle	e701a1411c	P5: use multi-tile kernel for N>128 in integration test	2026-05-30 10:47:00 +00:00
biondizzle	5932e928a8	cleanup: remove debug test files (P4, P5)	2026-05-30 10:46:14 +00:00
biondizzle	8fef46ce73	P5: add reference comparison to Python multi-tile test	2026-05-30 10:45:02 +00:00
biondizzle	897a70a491	P5: minimal Python multi-tile test	2026-05-30 10:43:26 +00:00
biondizzle	f370bfb1f1	P5: re-enable multi-tile Python tests, fix CAPI to use create_tma_desc_2d_bf16	2026-05-30 10:38:33 +00:00
biondizzle	da54f6439f	P5: fix TMA multitile test (include cuda.h first, proper SMEM calc)	2026-05-30 10:35:34 +00:00
biondizzle	34320653e9	P5: standalone TMA multi-tile test with 128B-aligned memory	2026-05-30 10:34:20 +00:00
biondizzle	a1d05b3055	P5: disable multi-tile Python tests (TMA descriptor alignment issue)	2026-05-30 10:32:44 +00:00
biondizzle	a5b47602b5	fix: remove n_kv_tiles from standalone test (struct doesn't have it anymore)	2026-05-30 10:28:38 +00:00
biondizzle	032cb4c7b2	P5: add single-tile merge comparison to multitile test	2026-05-30 09:06:57 +00:00
biondizzle	d424ccbcc1	fix: const not constexpr for SCALE	2026-05-30 09:04:45 +00:00
biondizzle	3da31de4c0	P5: fix BF16 host helpers for standalone test	2026-05-30 09:04:05 +00:00
biondizzle	9e6ba25a98	P5: standalone multi-tile CUDA test (2 KV tiles, hd=64)	2026-05-30 09:01:52 +00:00
biondizzle	b61df2657b	P5: fix reference attention for MQA/GQA (kv_idx = h // q_per_kv)	2026-05-30 08:59:50 +00:00
biondizzle	66b126ded8	P5: fix standalone test template — add n_kv_tiles to FmhaParams	2026-05-30 08:50:38 +00:00
biondizzle	2649488d13	P5: in-kernel multi-KV-tile FA2 online softmax in fmha_6warp_multihead.cuh - Kernel loops over KV tiles internally with running max/sum rescale - SMEM accumulator sOacc[hd] replaces TMEM accumulation across tiles - P is UN-NORMALIZED for multi-tile (exp(s-max), not /sum) - Per KV tile: QK→softmax→PV→TMEM→read→add to sOacc - Final: O = sOacc / running_sum - Single tile (n_kv_tiles=1): same as before, no rescale - Updated CAPI, Python loader, production.py fast path - Added multi-tile test cases (N=256, 512)	2026-05-30 08:46:09 +00:00
biondizzle	58c087416b	P4: 128B-aligned GMEM, proper SMEM alignment, bit21 test	2026-05-30 08:41:15 +00:00
biondizzle	90c806733f	P4: test TMA with bit-21 workaround and innermost-first dims	2026-05-30 08:40:21 +00:00
biondizzle	16027018df	P4: fix TMA load test (32-bit SMEM addrs, proper mbarrier)	2026-05-30 08:38:55 +00:00
biondizzle	e2ecdc42d8	P4: TMA load test kernel (swizzle vs no-swizzle hang diagnosis)	2026-05-30 08:38:11 +00:00
biondizzle	bd104c2ab2	P4: fix OOB fill enum name	2026-05-30 08:37:05 +00:00
biondizzle	cdd1babf1f	P4: correct CUDA 13.2 API (dataType before rank, FloatOOBfill, globalDim)	2026-05-30 08:36:24 +00:00
biondizzle	8df3ccecea	P4: CUDA 13.2 has 10-param cuTensorMapEncodeTiled (no OOB fill)	2026-05-30 08:35:34 +00:00
biondizzle	d8ffdb66e1	P4: fix API signature rank/dtype order, OOB_FILL defines	2026-05-30 08:35:04 +00:00
biondizzle	277689f8b8	P4: use proper CUDA enum names	2026-05-30 08:34:19 +00:00
biondizzle	6d624a1b14	P4: remove explicit enum casts	2026-05-30 08:33:42 +00:00
biondizzle	4898a946eb	P4: fix TMA descriptor dump API order (dtype before rank)	2026-05-30 08:33:12 +00:00
biondizzle	3943be6063	P4: fix TMA descriptor dump (cuuint64_t dims, proper CUtensorMap API)	2026-05-30 08:32:34 +00:00
biondizzle	4df6ea2d8c	P4: TMA descriptor dump test (cuTensorMapEncodeTiled)	2026-05-30 08:31:56 +00:00
biondizzle	ae425b5522	P3: clean up test, remove debug files, final integration test - test_p3_fast_decode.py: clean kernel test + full API test - Removed debug tests (sanity, v_debug, v_ref_debug) - Double normalization fix verified: kernel output matches reference at cos >= 0.999990 across all MHA/MQA/GQA configs	2026-05-30 08:29:25 +00:00
biondizzle	cfac224b59	debug: single head sanity test with known values	2026-05-30 08:25:20 +00:00
biondizzle	1c74d35fb4	debug: V layout reference comparison	2026-05-30 08:24:35 +00:00
biondizzle	a3c5f817e1	debug: compare api vs direct kernel vs reference	2026-05-30 08:23:43 +00:00
biondizzle	78e6d58b85	debug: V layout comparison test	2026-05-30 08:22:49 +00:00
biondizzle	1b9cdf89fb	P3: add full API integration test	2026-05-30 08:20:53 +00:00
biondizzle	0608d9d09e	P3: fix GQA via K/V repeat_interleave, relax threshold to 0.999990	2026-05-30 08:20:01 +00:00
biondizzle	d5c0086737	P3: fix SMEM computation, pad K/V to 128, remove stale files - fmha_multihead_capi.cu: SMEM formula matches standalone test Added cudaFuncSetAttribute for dynamic SMEM > 48KB - fmha_multihead_op.py: pad K/V to N=128 when N<128 (kernel softmax loop is hardcoded to SK_TILE=128) - Removed fmha_multihead_launch.cu (ATen approach, didn't work) - Removed test_p3_ctypes_minimal.py (superseded by main test)	2026-05-30 08:19:16 +00:00

1 2 3 4 5 ...

765 Commits