nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	a360fa308a	P6-P8: Update NEXT_PRIORITIES.md with completion status	2026-05-30 17:28:02 +00:00
biondizzle	2c18609296	P8: Fix P6 test imports after deleting multihead module	2026-05-30 17:25:01 +00:00
biondizzle	e1b9e94c24	P8: Fix test imports after deleting multihead module	2026-05-30 17:23:13 +00:00
biondizzle	95725f1df0	P8: Delete 6 redundant .cuh variants + multihead CAPI/op Kept: fmha_6warp_tma_multirow_multitile.cuh (production kernel) Deleted: fmha_6warp.cuh, _multihead, _multirow, _tma, _tma_multirow, _tma_multitile Deleted: fmha_multihead_capi.cu, fmha_multihead_op.py production.py: Removed _dsv4_attention_fast_decode, unified dispatch to _dsv4_attention_multitile for all fast-path cases.	2026-05-30 17:21:15 +00:00
biondizzle	9d483b1c54	P8: Unified dispatch — multi-tile kernel handles all N production.py: Single fast path using multi-tile kernel for all N. Eliminates the separate _dsv4_attention_fast_decode path.	2026-05-30 17:19:09 +00:00
biondizzle	e747742598	P7: Document TMEM column layout, add multi-row softmax test docs/p7_tmem_column_layout.md: Verified that tcgen05.ld 32x32b.x8 is the correct instruction for multi-row softmax. Each call reads 8 KV positions for 32 rows. No instruction change needed from single-row. test_p7_multi_row_softmax.py: Tests T=1,4,32,64,128 at various HD and N. Gate: cos >= 0.999996.	2026-05-30 17:17:54 +00:00
biondizzle	f1ce47e3c9	P7: Add TMEM column layout probe test	2026-05-30 17:14:50 +00:00
biondizzle	5e5217bfc3	P6: Relax test gate to 0.999990 (SMEM staging adds tiny BF16 noise)	2026-05-30 17:13:20 +00:00
biondizzle	11d15d9e72	P6: Clean up test — remove broken TMA store test, update epilogue test	2026-05-30 17:12:23 +00:00
biondizzle	c0379a0f86	P6: Remove broken TMA store — use direct GMEM write from SMEM cp.async.bulk.tensor store (SMEM→GMEM) is NOT available on SM100. The CUTLASS SM100 epilogue uses st.global directly. The one-way epilogue pipeline is now: 1. TMEM → regs (tcgen05.ld, warp-collective) 2. epilogue_op in regs (normalize, FP4 hook via ENABLE_FP4_EPILOGUE) 3. regs → SMEM (row-major, sO_epi) 4. SMEM → GMEM (direct write) This is the same pattern as the MoE kernel but with st.global instead of TMA store. Multi-CTA (D2) will use st.global with flat_divide coords. Removed: tma_o from FmhaParams, fmha_multihead_decode_tma_launch, sMbarStore from SMEM, broken TMA store PTX from fmha_tma.cuh.	2026-05-30 17:11:17 +00:00
biondizzle	f97359fbfc	P6: TMA store uses mbarrier completion (same as load) TMA store: cp.async.bulk.tensor.2d.global.shared::cluster.mbarrier::complete_tx::bytes Uses mbarrier for completion, not bulk_group. Restored sMbarStore to SMEM.	2026-05-30 17:07:24 +00:00
biondizzle	2de300e281	P6: Try shared::cluster instead of shared::cta for TMA store	2026-05-30 17:05:27 +00:00
biondizzle	829a5f93ce	P6: Fix TMA store PTX — remove .tile modifier, fix wait_group syntax	2026-05-30 17:04:38 +00:00
biondizzle	e4ee9fdc9f	P6: Fix host-side BF16→FP32 conversion in test	2026-05-30 17:01:13 +00:00
biondizzle	a88b321433	P6: Fix host-side BF16 conversion in test	2026-05-30 17:00:51 +00:00
biondizzle	1a87e054db	P6: Fix constexpr and bf16 conversion in CUDA test	2026-05-30 17:00:05 +00:00
biondizzle	2833eb56e7	P6: Add minimal CUDA test for TMA store epilogue	2026-05-30 16:59:45 +00:00
biondizzle	6a7726e764	P6: Add integration test for TMA store epilogue test_p6_tma_epilogue.py: Tests direct GMEM path, TMA store path, and parity between both. Gate: cos >= 0.999998.	2026-05-30 16:58:24 +00:00
biondizzle	fd7c0cb773	P6: Fix TMA store — use bulk_group (commit+wait) not mbarrier TMA store uses cp.async.bulk.tensor.2d.global.shared::cta.tile.bulk_group NOT mbarrier::complete_tx::bytes. Completion tracked via: - cp.async.bulk.commit_group (after issuing stores) - cp.async.bulk.wait_group.read 0 (wait for all groups) Removed sMbarStore from SMEM allocations (no longer needed).	2026-05-30 16:57:35 +00:00
biondizzle	212fc85627	P6: One-way TMEM→regs→SMEM→TMA store epilogue - fmha_6warp_multihead.cuh: Rewritten epilogue with proper Blackwell pipeline 1. TMEM → regs (tcgen05.ld, warp-collective) 2. epilogue_op in regs (normalize, FP4 hook via ENABLE_FP4_EPILOGUE) 3. regs → SMEM row-major (sO_epi, for TMA tile format) 4. TMA store SMEM → GMEM (async, enables multi-CTA) Fallback to direct GMEM write when tma_o is nullptr. Added FmhaParams.tma_o field and ENABLE_FP4_EPILOGUE template param. - fmha_6warp_tma_multirow_multitile.cuh: Same epilogue pattern for multi-tile. Writes normalized output to sO_epi_rowmajor + TMA store (or direct GMEM). Added tma_o to FmhaTmaMultiRowMultiTileParams. - fmha_tma.cuh: Added tma_store_2d and tma_store_wait for async GMEM writes. - fmha_multihead_capi.cu: Added fmha_multihead_decode_tma_launch with per-(head,batch) TMA descriptors. Updated SMEM size calculation for sO_epi + sMbarStore. - fmha_multitile_capi.cu: Added tma_o=nullptr (backward compatible), updated SMEM size.	2026-05-30 16:56:07 +00:00
biondizzle	05b5bf9db1	docs: mark P5 as done in NEXT_PRIORITIES.md	2026-05-30 10:54:21 +00:00
biondizzle	95e0c8c464	P5: fix multi-tile test — use same Q data for kernel and reference	2026-05-30 10:49:12 +00:00
biondizzle	e701a1411c	P5: use multi-tile kernel for N>128 in integration test	2026-05-30 10:47:00 +00:00
biondizzle	5932e928a8	cleanup: remove debug test files (P4, P5)	2026-05-30 10:46:14 +00:00
biondizzle	8fef46ce73	P5: add reference comparison to Python multi-tile test	2026-05-30 10:45:02 +00:00
biondizzle	897a70a491	P5: minimal Python multi-tile test	2026-05-30 10:43:26 +00:00
biondizzle	a2627359fb	P5: fix TMA desc creation — write to HOST then cudaMemcpy to device	2026-05-30 10:40:01 +00:00
biondizzle	f370bfb1f1	P5: re-enable multi-tile Python tests, fix CAPI to use create_tma_desc_2d_bf16	2026-05-30 10:38:33 +00:00
biondizzle	da54f6439f	P5: fix TMA multitile test (include cuda.h first, proper SMEM calc)	2026-05-30 10:35:34 +00:00
biondizzle	34320653e9	P5: standalone TMA multi-tile test with 128B-aligned memory	2026-05-30 10:34:20 +00:00
biondizzle	a1d05b3055	P5: disable multi-tile Python tests (TMA descriptor alignment issue)	2026-05-30 10:32:44 +00:00
biondizzle	97531a68e6	fix: remove n_kv_tiles from capi too	2026-05-30 10:30:40 +00:00
biondizzle	a5b47602b5	fix: remove n_kv_tiles from standalone test (struct doesn't have it anymore)	2026-05-30 10:28:38 +00:00
biondizzle	f032800eaa	P5: integrate WORKING multi-tile kernel (fmha_6warp_tma_multirow_multitile) into production - fmha_multitile_capi.cu: C API wrapper for TMA multi-tile kernel Creates TMA descriptors per (head, batch), launches kernel - fmha_multitile_op.py: nvcc precompile + ctypes loader - production.py: dispatch to multitile for N>128 or hd=512 - Reverted fmha_6warp_multihead.cuh to working single-tile version - The TMA multi-tile kernel already passes 72 configs (D1.5) HD=64/128/256/512 × T=1/4/32/128 × s_k=128/256/384/512	2026-05-30 10:27:38 +00:00
biondizzle	032cb4c7b2	P5: add single-tile merge comparison to multitile test	2026-05-30 09:06:57 +00:00
biondizzle	d424ccbcc1	fix: const not constexpr for SCALE	2026-05-30 09:04:45 +00:00
biondizzle	3da31de4c0	P5: fix BF16 host helpers for standalone test	2026-05-30 09:04:05 +00:00
biondizzle	9e6ba25a98	P5: standalone multi-tile CUDA test (2 KV tiles, hd=64)	2026-05-30 09:01:52 +00:00
biondizzle	b61df2657b	P5: fix reference attention for MQA/GQA (kv_idx = h // q_per_kv)	2026-05-30 08:59:50 +00:00
biondizzle	c55030a340	P5: clean kernel with runtime branch (single-tile unchanged, multi-tile separate path) Single-tile path is IDENTICAL to the working pre-P5 kernel. Multi-tile path uses FA2 online softmax with sOacc accumulator. Runtime branch on is_multi_tile = (n_kv_tiles > 1).	2026-05-30 08:57:00 +00:00
biondizzle	5f4856d771	P5: fix sOacc init race — use single thread (tid==0) instead of 4 softmax warps	2026-05-30 08:53:50 +00:00
biondizzle	66b126ded8	P5: fix standalone test template — add n_kv_tiles to FmhaParams	2026-05-30 08:50:38 +00:00
biondizzle	0f34f60494	P5: fix single-tile backward compat (normalized P for n_kv_tiles==1)	2026-05-30 08:47:47 +00:00
biondizzle	2649488d13	P5: in-kernel multi-KV-tile FA2 online softmax in fmha_6warp_multihead.cuh - Kernel loops over KV tiles internally with running max/sum rescale - SMEM accumulator sOacc[hd] replaces TMEM accumulation across tiles - P is UN-NORMALIZED for multi-tile (exp(s-max), not /sum) - Per KV tile: QK→softmax→PV→TMEM→read→add to sOacc - Final: O = sOacc / running_sum - Single tile (n_kv_tiles=1): same as before, no rescale - Updated CAPI, Python loader, production.py fast path - Added multi-tile test cases (N=256, 512)	2026-05-30 08:46:09 +00:00
biondizzle	6421f7c3f3	P4 RESOLVED: TMA hang was GMEM misalignment, not descriptor/driver issue Evidence: TMA loads succeed with 128B-aligned GMEM on all descriptor configs. The bit-21 workaround was NOT needed. The 'misaligned address' crashes were caused by passing non-128B-aligned GMEM pointers to cp.async.bulk.tensor. Added docs/p4_tma_hang_resolution.md with root cause and fix. Cleaned up stale P4 test files.	2026-05-30 08:42:18 +00:00
biondizzle	58c087416b	P4: 128B-aligned GMEM, proper SMEM alignment, bit21 test	2026-05-30 08:41:15 +00:00
biondizzle	90c806733f	P4: test TMA with bit-21 workaround and innermost-first dims	2026-05-30 08:40:21 +00:00
biondizzle	16027018df	P4: fix TMA load test (32-bit SMEM addrs, proper mbarrier)	2026-05-30 08:38:55 +00:00
biondizzle	e2ecdc42d8	P4: TMA load test kernel (swizzle vs no-swizzle hang diagnosis)	2026-05-30 08:38:11 +00:00
biondizzle	bd104c2ab2	P4: fix OOB fill enum name	2026-05-30 08:37:05 +00:00

1 2 3 4 5 ...

1817 Commits