nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	abfe9dbaa1	test: only 1 tmem_store to verify single column works	2026-05-28 09:49:21 +00:00
biondizzle	5795589abc	test: TMEM 4 columns, individual store calls + loop load	2026-05-28 09:48:27 +00:00
biondizzle	8a428f6127	test: TMEM column addressing test (128 cols, store+load)	2026-05-28 09:46:49 +00:00
biondizzle	ee3fe6d6b2	test: tmem_load column 1 only	2026-05-28 09:45:34 +00:00
biondizzle	6c38c6e442	test: read 8 TMEM columns individually (no loop)	2026-05-28 09:44:30 +00:00
biondizzle	bcc6ed114d	test: add 8KB padding after sQ to prevent MMA read overrun	2026-05-28 09:43:17 +00:00
biondizzle	764ed01d6f	test: try M=64 in descriptor + idesc to debug 4x factor	2026-05-28 09:41:50 +00:00
biondizzle	4cb656e583	test: try idesc=0 (same as gau-nernst)	2026-05-28 09:40:19 +00:00
biondizzle	cfba8484da	test: try idesc with N=128 (full extent) + 128 TMEM cols	2026-05-28 09:39:19 +00:00
biondizzle	30f0056b11	test: clean rewrite with SMEM Q/K verification and dot product check	2026-05-28 09:38:26 +00:00
biondizzle	7eb85a71fc	test: add Q SMEM verification output + bf16_to_f32_host	2026-05-28 09:37:07 +00:00
biondizzle	8f23c2aaf6	test: verify SMEM Q layout by reading back canonical data	2026-05-28 09:35:58 +00:00
biondizzle	004046a6a8	test: read only 1 TMEM column after MMA	2026-05-28 09:35:02 +00:00
biondizzle	41128122e3	test: clean rewrite, 32 TMEM cols, MMA N=32, tmem_load loop	2026-05-28 09:33:45 +00:00
biondizzle	58be79957d	test: 32 TMEM cols, add MMA call with N=32, read S from TMEM	2026-05-28 09:32:33 +00:00
biondizzle	22fb861447	test: 2 tmem_stores with syncwarp between	2026-05-28 09:30:37 +00:00
biondizzle	a87f20a4ae	test: just 1 tmem_store, no fence, no loop	2026-05-28 09:29:46 +00:00
biondizzle	2b57f28968	test: zero 128 TMEM columns, skip fence	2026-05-28 09:29:14 +00:00
biondizzle	25c9b70591	test: zero 2 TMEM columns	2026-05-28 09:28:31 +00:00
biondizzle	01c4097ccc	test: zero 32 TMEM columns	2026-05-28 09:27:59 +00:00
biondizzle	3694f63ba4	test: re-enable full TMEM zeroing (128 columns)	2026-05-28 09:27:25 +00:00
biondizzle	c3b6c3a5e6	test: minimal tmem_store debug (1 column + sentinels)	2026-05-28 09:26:52 +00:00
biondizzle	f1aaa50326	test: re-enable TMEM zeroing with tmem_base debug	2026-05-28 09:26:16 +00:00
biondizzle	a7f81331f8	test: skip TMEM zeroing again, alloc+dealloc only	2026-05-28 09:25:37 +00:00
biondizzle	3f5dcd481e	test: zero only 32 TMEM columns	2026-05-28 09:25:05 +00:00
biondizzle	2b1c8ce7df	test: re-enable all TMEM ops (alloc, zero, dealloc)	2026-05-28 09:24:28 +00:00
biondizzle	acc7424a48	test: skip TMEM zeroing, just alloc+dealloc	2026-05-28 09:23:48 +00:00
biondizzle	ca419c52f3	test: re-enable TMEM alloc + zero	2026-05-28 09:23:10 +00:00
biondizzle	09e8ea5933	test: fix compile error, skip TMEM read	2026-05-28 09:22:17 +00:00
biondizzle	69bbc21300	test: skip all TMEM ops, just test SMEM layout + descriptor	2026-05-28 09:21:52 +00:00
biondizzle	a6c0ce51a2	test: skip MMA, just test descriptor values	2026-05-28 09:20:59 +00:00
biondizzle	ea6b42e649	test_umma_qk: add descriptor debug output	2026-05-28 09:20:12 +00:00
biondizzle	0f6907b001	UMMA: fix descriptor + idesc — use gau-nernst tutorial values - LBO = BLOCK_MN * 16 (bytes), SBO = 128 (bytes) for K-major NONE - Canonical SMEM layout: column-major interleaving of core matrices - idesc is SEPARATE 32-bit value (was using desc_a>>32 = WRONG) - idesc encodes dtype/atype/btype/MMA_M/MMA_N - This was the root cause of 'misaligned address' errors	2026-05-28 09:18:45 +00:00
biondizzle	9b458d2a6c	test_umma_qk: clean rewrite, hardcoded HD=16, explicit core-matrix layout writes	2026-05-28 09:16:37 +00:00
biondizzle	427410d94a	UMMA: Rewrite fmha_umma_desc.cuh with correct K-major core-matrix layout + minimal QK GEMM test - Core-matrix layout: each 8x8 BF16 tile (128B) contiguous in SMEM - K-major NONE descriptor: LBO=1 (16B), SBO=block_k/8, lbo_mode=0 - MMA K-tiling: tcgen05.mma uses K=16 per call, tile for hd>16 - write_smem_kmajor: converts row-major to core-matrix layout - write_smem_ktile: extracts single K-tile in core-matrix layout - test_umma_qk.cu: minimal hd=16, sk=128 test (single MMA call) - Previous UMMA descriptors were wrong (row-major SMEM, wrong LBO/SBO)	2026-05-28 09:15:40 +00:00
biondizzle	e5ba0ca119	debug: clean QK verify with scalar sanity + MMA result	2026-05-28 08:53:35 +00:00
biondizzle	9a51bfa578	fix: align SMEM layout properly (128B aligned tmem + Q)	2026-05-28 08:46:56 +00:00
biondizzle	2a765be715	fix: correct SMEM size for row-major (not swizzled)	2026-05-28 08:44:55 +00:00
biondizzle	ab84ad0f86	feat: implement canonical UMMA SMEM layout with SWIZZLE_128B Proper implementation of the SMEM layout that tcgen05.mma expects: - SWIZZLE_128B (layout_type=2) for both MN-major A and K-major B - Swizzle<3,4,3> applied to element offsets before SMEM write - MN_SW128 atom: (1024, 8) BF16, stride (1, 1024) - K_SW128 atom: (8, 1024) BF16, stride (1, 8) - umma_smem_write/read functions for both MN and K major - Descriptor with correct leading_byte_offset and stride_byte_offset This is the RIGHT WAY. No shortcuts.	2026-05-28 08:18:47 +00:00
biondizzle	3549a2388b	fix: constexpr HD for template param	2026-05-28 08:01:18 +00:00
biondizzle	7436315309	feat: add tcgen05.mma QK GEMM verification kernel + test Step 1 of tensor-core acceleration: - fmha_umma_desc.cuh: UMMA SMEM descriptor construction (raw bitfield) - fmha_qk_verify.cuh: QK GEMM using tcgen05.mma SS (SMEM A, SMEM B → TMEM C) - test_qk_mma.cu: standalone test comparing MMA output vs CPU reference Key design decisions: - UMMA descriptors built from raw bitfield (no CuTe dependency) - tcgen05.mma called by one lane per warp (elect_one_sync pattern) - Q: (128, HD) MN-major, K: (128, HD) K-major (transposed via descriptor) - S: (128, 128) in TMEM, row 0 read back via tcgen05.ld	2026-05-28 08:00:42 +00:00
biondizzle	9524b674ab	test: enable both reference + TMEM epilogue tests at hd=64/128	2026-05-28 07:49:48 +00:00
biondizzle	146e4f0282	debug: print NaN positions in test	2026-05-28 07:46:57 +00:00
biondizzle	a12607b0bd	test: add NaN counter to FMHA test	2026-05-28 07:45:32 +00:00
biondizzle	53c676c8a6	test: add max_abs_diff to FMHA test output	2026-05-28 07:44:45 +00:00
biondizzle	593bc25afa	test: add TMEM lane mapping diagnostics	2026-05-28 07:42:16 +00:00
biondizzle	0ddcc6bafd	debug: add printf to TMEM kernel to find hang point	2026-05-28 07:39:53 +00:00
biondizzle	44fb04fa1f	test: disable tmem epilogue test (debugging reference hang)	2026-05-28 07:38:47 +00:00
biondizzle	2eb44a00bf	fix(tmem): warp-collective TMEM ops + one-way correction epilogue Key fixes for fmha_epilogue_sm100.cuh hang: - tcgen05.ld/st are WARP-COLLECTIVE: ALL 32 lanes must execute - Old code guarded TMEM ops with if(tid==0) = warp divergence = HANG - tmem_dealloc now uses tmem_base (value from alloc), not SMEM pointer - Compute attention in SMEM, then do one-way TMEM pipeline: SMEM → TMEM (warp-collective store) → regs (warp-collective load) → normalize in regs → BF16 cast → GMEM - This proves the MoE-style one-way correction epilogue on FMHA Also: enable TMEM kernel test + hd=128 in standalone test	2026-05-28 07:27:25 +00:00
biondizzle	bd16e8fa85	fix: use tcgen05.wait::st/ld instead of nonexistent tcgen05.fence ROOT CAUSE of TMET hang: tcgen05.fence.cta_group::1.sync.aligned is NOT a valid PTX instruction. The correct TMEM ordering primitives are: - tcgen05.wait::st.sync.aligned (wait for TMEM stores to complete) - tcgen05.wait::ld.sync.aligned (wait for TMEM loads to complete) Found in cutlass/arch/barrier.h fence_view_async_tmem_store/load.	2026-05-28 07:12:26 +00:00

1 2 3 4 5 ...

427 Commits