nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	c768abed95	test: softmax-only kernel (QK + row_max, no PV)	2026-05-28 23:15:36 +00:00
biondizzle	43ba672e15	fmha_multirow: add fence.sc.gpu after QK GEMM for TMEM visibility	2026-05-28 23:13:31 +00:00
biondizzle	d840fbbf85	test: clean multirow test with proper SMEM calc	2026-05-28 23:10:49 +00:00
biondizzle	f2124b9378	fix: SMEM calc in decode test	2026-05-28 23:08:54 +00:00
biondizzle	58ff781388	test: simplified decode kernel for debugging multirow	2026-05-28 23:08:33 +00:00
biondizzle	be2685e9e3	fmha_multirow: use natural 4-warp TMEM partitioning after UMMA After UMMA (QK GEMM), 4 warps reading TMEM with 32x32b.x8 each see a different 32-row partition (verified on B200): Warp 0 → rows 0-31, Warp 1 → rows 32-63, etc. Lane l in warp w reads row w*32 + l. This eliminates the broken row_page<<16 addressing and allows: - T<=32: warp 0 only, 32x32b.x8, each lane = one row - T>32: 4 warps, each reads its natural 32-row partition - Epilogue: same partitioning for reading O from TMEM No s_p_vals buffer. P streamed per K-tile through sPk.	2026-05-28 23:07:31 +00:00
biondizzle	ff8c677486	fix: SMEM size for MMA test — account for both sQ0 and sK0	2026-05-28 23:06:07 +00:00
biondizzle	fee022a485	test: MMA→4-warp read using proven fmha_common+umma_desc infra	2026-05-28 23:05:29 +00:00
biondizzle	e1a708a187	test: try 16x256b.x1 with column step=4 (4 cols per read)	2026-05-28 23:03:51 +00:00
biondizzle	95003eced2	test: 16x256b.x1 loads with uint32_t regs, matching working pattern	2026-05-28 23:03:10 +00:00
biondizzle	fffb493b0e	fix: 16x256b.x1 load syntax — single address operand	2026-05-28 23:02:23 +00:00
biondizzle	44dcd6e8d0	test: 16x256b.x1 multiple LOADS — do they crash like stores?	2026-05-28 23:02:03 +00:00
biondizzle	d54bce6a6d	fix: correct SMEM size for MMA 4-warp test	2026-05-28 23:01:12 +00:00
biondizzle	be45e87891	test: MMA→4-warp TMEM read — do warps see different rows?	2026-05-28 23:00:27 +00:00
biondizzle	6b0d57074a	test: TMEM cross-warp visibility with different sync strategies	2026-05-28 22:59:31 +00:00
biondizzle	77d190278e	test: simpler TMEM 4-warp read — direct store+load	2026-05-28 22:58:48 +00:00
biondizzle	91b03bd6bd	test: verify 4-warp TMEM read with 32x32b.x8 after MMA	2026-05-28 22:57:59 +00:00
biondizzle	28e04a5ea8	fix: use __cvta_generic_to_shared directly for 64-bit compat	2026-05-28 22:56:29 +00:00
biondizzle	1d6a95df32	fix: typo in tmem row offset test	2026-05-28 22:56:15 +00:00
biondizzle	cf6fe71368	test: verify TMEM 32x32b.x8 row offset addressing	2026-05-28 22:56:00 +00:00
biondizzle	4cfb707405	fix: correct SMEM size calculation in multirow test	2026-05-28 22:53:46 +00:00
biondizzle	863a030c3b	fmha_multirow: rewrite with 32x32b.x8 only, no s_p_vals, row_page addressing - Kill 64KB s_p_vals buffer — P is streamed per K-tile through sPk - All TMEM ops use 32x32b.x8 exclusively (16x256b.x1 crashes on 2nd call) - T>32: 4 softmax warps use row_page offset in TMEM address (row<<16) - Lane l in warp w handles row w*32+l - Two-pass softmax: pass 1 row_max, pass 2 exp/sum interleaved with PV - PV: N=16 sub-tiles, SS MMA sPk(128,16) × sV(16,16) → TMEM - Epilogue: 32x32b.x8 TMEM read, normalize, BF16 → GMEM - SMEM budget: ~14KB (well within 232KB)	2026-05-28 22:52:52 +00:00
biondizzle	1ba304db3e	stuff	2026-05-28 21:08:13 +00:00
biondizzle	deaa3ec725	CRITICAL FIX: Q/K SMEM canonical layout must use local d (0..15) not full_d — UMMA descriptor reads from sQ0/sK0 start, not offset	2026-05-28 20:13:52 +00:00
biondizzle	08694b8136	Fix multi-row softmax v3: 32x32b.x8 with per-lane per-row (no wmax/wsum), per-row sRowMax/sRowSum arrays	2026-05-28 20:10:13 +00:00
biondizzle	aaa76c1af1	Rewrite multi-row softmax using 16x256b.x1 TMEM reads for proper multi-row access	2026-05-28 20:08:30 +00:00
biondizzle	5e3c61184c	Fix multi-row softmax: remove cross-lane wmax/wsum — each lane handles its own row independently	2026-05-28 20:06:16 +00:00
biondizzle	bf4dfd131b	Fix nvcc goto-bypasses-init: move var decls before goto targets	2026-05-28 20:04:59 +00:00
biondizzle	2b09d4f2ef	Fix nvcc goto-bypasses-init in multi-row test	2026-05-28 20:04:45 +00:00
biondizzle	d8b421ccee	Multi-row FMHA kernel (Milestone 4): T>1 prefill support with 4-warp parallel softmax	2026-05-28 20:04:29 +00:00
biondizzle	adc88613fa	Milestone 5 COMPLETE: multi-head FMHA grid launch verified on B200 All HD=16/64/128/256 pass across MHA (4+8 heads), MQA, batched modes. cos 0.999997+, LSE matches reference. Updated CURRENT_ISSUE.md.	2026-05-28 19:35:06 +00:00
biondizzle	3fd302e7a0	Fix nvcc goto-bypasses-init errors in multi-head test	2026-05-28 19:33:04 +00:00
biondizzle	aa41cfa2e5	Multi-head FMHA kernel (Milestone 5): grid launch with MHA/MQA/batch support - fmha_6warp_multihead.cuh: grid=(1, n_h, batch) kernel with FmhaParams - MQA support via k_head_stride=0 / v_head_stride=0 - LSE output for multi-segment KV merge composition - test_fmha_6warp_multihead.cu: MHA (4+8 heads), MQA, batched tests - HD-specific wrappers for hd=16/64/128/256 - Marked E2M1 dequant bug as FIXED in consultant issue file	2026-05-28 19:32:35 +00:00
biondizzle	6af2feb42a	TMA 5D test: element stride decomposition	2026-05-28 19:18:01 +00:00
biondizzle	96f2f0bb90	auto: pre-test commit	2026-05-28 19:12:23 +00:00
biondizzle	015435b1ab	auto: pre-test commit	2026-05-28 19:09:50 +00:00
biondizzle	41343fdc6b	auto: pre-test commit	2026-05-28 19:08:04 +00:00
biondizzle	a723b524f7	TMA alignment test	2026-05-28 17:00:20 +00:00
biondizzle	c54a83960d	TMA debug: fix globalStrides to tensorRank-1 elements	2026-05-28 16:58:30 +00:00
biondizzle	944e567b6c	TMA debug: test various CUtensorMap configs	2026-05-28 16:55:25 +00:00
biondizzle	55d289c65b	Fix TMA: use CU_TENSOR_MAP_DATA_TYPE_BFLOAT16 not UINT16	2026-05-28 16:51:40 +00:00
biondizzle	0fd3e12a52	Fix TMA test: globalStrides in bytes not elements	2026-05-28 16:46:56 +00:00
biondizzle	ad8050bbad	WIP: TMA load test infrastructure (manual compile needed)	2026-05-28 16:45:04 +00:00
biondizzle	d9df1e6486	auto: pre-test commit	2026-05-28 16:42:24 +00:00
biondizzle	a4211559cf	auto: pre-test commit	2026-05-28 16:40:51 +00:00
biondizzle	3b8fdcc823	auto: pre-test commit	2026-05-28 16:39:45 +00:00
biondizzle	072fbf0b5d	auto: pre-test commit	2026-05-28 16:36:53 +00:00
biondizzle	090f2866ae	Update CURRENT_ISSUE: 6-warp Milestone 1 complete	2026-05-28 16:35:02 +00:00
biondizzle	b3020c2811	6-warp specialized FMHA kernel — ALL HD=16/64/128/256 PASS cos 0.999997+ Warp layout (192 threads): - Warps 0-3: Softmax + correction + epilogue - Warp 4: MMA (QK + PV GEMM) - Warp 5: Data staging (Q/K/V loads, direct GMEM for now) CTA-wide __syncthreads() sync between phases. Fix: removed spurious inv_sum normalization in epilogue (MMA output is already correctly scaled with softmax'd P). Files: fmha_6warp.cuh + test_fmha_6warp*.cu	2026-05-28 16:34:14 +00:00
biondizzle	2a6d72912a	auto: pre-test commit	2026-05-28 16:28:58 +00:00

... 5 6 7 8 9 ...

1930 Commits