nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	2ffbfda47d	test: print SMEM verify data	2026-05-28 11:51:08 +00:00
biondizzle	4fd41365de	test: add SMEM verify for HD=64 K-tile offsets	2026-05-28 11:49:44 +00:00
biondizzle	4483539f01	test: HD=64 random data, 4 K-tiles, accumulate	2026-05-28 11:47:56 +00:00
biondizzle	73bd21ce01	test: force 1 K-tile for HD=64 debug	2026-05-28 11:46:12 +00:00
biondizzle	abe1870429	test: HD=64 all-ones, expected S[0,j]=64 (unscaled) or 8.0 scaled	2026-05-28 11:44:31 +00:00
biondizzle	73f9ff98c9	test: UMMA QK HD=64 (4 K-tiles, accumulate) — multi-K-tile test	2026-05-28 11:42:29 +00:00
biondizzle	df34cae9c6	UMMA QK GEMM WORKING! Update docs — 4x was scale factor, not bug Major milestone: UMMA QK GEMM produces correct attention scores at HD=16! - MMA computes raw dot product; apply 1/sqrt(HD) scaling manually - tcgen05.fence::after_thread_sync for MMA→TMEM fence - 32x32b.x8 TMEM reads for Layout D output - 4 warps (128 threads) required for M=128 - Next: HD=64 multi-K-tile, PV GEMM, full FMHA pipeline	2026-05-28 11:41:19 +00:00
biondizzle	1874a70a6d	test: fix var ref	2026-05-28 11:39:15 +00:00
biondizzle	8426d13285	test: fix comparison — row 0 is S[0,c], rows 1-127 should be zero	2026-05-28 11:38:22 +00:00
biondizzle	6f40fafa91	test: verify ALL 128 rows × 8 cols match scalar reference	2026-05-28 11:36:46 +00:00
biondizzle	3c7d9d9303	test: apply 1/sqrt(HD) scale to MMA output — 4x was the scale factor, not a bug!	2026-05-28 11:34:45 +00:00
biondizzle	013f370046	test: all-ones data, expected S[0,j]=16.0 for every j	2026-05-28 11:32:56 +00:00
biondizzle	f5a0966afc	test: 4 warp leaders (lane==0) call MMA simultaneously	2026-05-28 11:30:19 +00:00
biondizzle	c01d6fddf4	test: gau-nernst pattern — fence::after_thread_sync, 4 warps, 128 threads, 32x32b.x8 loop	2026-05-28 11:28:47 +00:00
biondizzle	a048b56886	test: single-thread MMA + 0.25 scaling for 4× factor	2026-05-28 10:23:06 +00:00
biondizzle	57d67e6b51	test: revert to 64-bit descriptors, 4 warp leaders, 32x32b read	2026-05-28 10:21:06 +00:00
biondizzle	32f7fa7bce	Update CURRENT_ISSUE.md and MEMORY.md with UMMA 4× bug details - MMA produces exactly 4× scalar reference for all output values - SMEM data verified correct, descriptor values correct - 4× persists across different N, warp counts, padding - TMEM multi-store bug documented (16x256b.x1 crashes on 2nd store) - Layout D read with 32x32b.x8 works - Next: study CUTLASS FMHA TMEM output layout to fix 4× factor	2026-05-28 10:15:14 +00:00
biondizzle	3f95f1c5d4	test: try LBO with block_mn=32 (1/4 of M=128)	2026-05-28 10:11:38 +00:00
biondizzle	d03e353972	test: 4 warp leaders call MMA (Layout D requires 4 warps)	2026-05-28 10:10:07 +00:00
biondizzle	8059ed15ad	test: explicitly zero padding between Q and K	2026-05-28 10:08:35 +00:00
biondizzle	9e98c067ab	test: Layout D TMEM read using 32x32b.x8 format, 4 warps	2026-05-28 10:07:15 +00:00
biondizzle	68d1a7920c	test: M=64 in both desc and idesc	2026-05-28 10:04:17 +00:00
biondizzle	0f51fda0da	test: try N=8 in idesc	2026-05-28 10:02:52 +00:00
biondizzle	4f7c9649fd	test: clean UMMA QK test, debug 4x factor, 8KB padding, 128 TMEM cols	2026-05-28 10:01:39 +00:00
biondizzle	ac65ece33b	test: TMEM 2-store with fence outside wid guard, 64 threads	2026-05-28 09:59:43 +00:00
biondizzle	2c89eea6be	test: fence+sync between 2 tmem_stores	2026-05-28 09:58:51 +00:00
biondizzle	24c5afe1dc	test: 64 threads, 2 stores to col 0	2026-05-28 09:57:53 +00:00
biondizzle	987f2c8917	test: 2 tmem_stores to SAME column 0	2026-05-28 09:57:07 +00:00
biondizzle	494149f034	test: 32 threads (1 warp), no guards, all participate	2026-05-28 09:56:17 +00:00
biondizzle	f0cb71da5c	test: TMEM 2-col with fence+sync between stores, separate wid==0 blocks	2026-05-28 09:54:19 +00:00
biondizzle	b69a538ab1	test: add fence+sync between 2 tmem_stores	2026-05-28 09:53:10 +00:00
biondizzle	7a21fa4bd8	test: add 2nd tmem_store to column 1	2026-05-28 09:52:05 +00:00
biondizzle	4b129c146e	test: add 1 tmem_load back	2026-05-28 09:51:21 +00:00
biondizzle	61f19ce891	test: skip tmem_load, only store+dealloc	2026-05-28 09:50:48 +00:00
biondizzle	2513e1a692	test: use 64 threads, fence outside warp guard, 1 store	2026-05-28 09:50:09 +00:00
biondizzle	abfe9dbaa1	test: only 1 tmem_store to verify single column works	2026-05-28 09:49:21 +00:00
biondizzle	5795589abc	test: TMEM 4 columns, individual store calls + loop load	2026-05-28 09:48:27 +00:00
biondizzle	8a428f6127	test: TMEM column addressing test (128 cols, store+load)	2026-05-28 09:46:49 +00:00
biondizzle	ee3fe6d6b2	test: tmem_load column 1 only	2026-05-28 09:45:34 +00:00
biondizzle	6c38c6e442	test: read 8 TMEM columns individually (no loop)	2026-05-28 09:44:30 +00:00
biondizzle	bcc6ed114d	test: add 8KB padding after sQ to prevent MMA read overrun	2026-05-28 09:43:17 +00:00
biondizzle	764ed01d6f	test: try M=64 in descriptor + idesc to debug 4x factor	2026-05-28 09:41:50 +00:00
biondizzle	4cb656e583	test: try idesc=0 (same as gau-nernst)	2026-05-28 09:40:19 +00:00
biondizzle	cfba8484da	test: try idesc with N=128 (full extent) + 128 TMEM cols	2026-05-28 09:39:19 +00:00
biondizzle	30f0056b11	test: clean rewrite with SMEM Q/K verification and dot product check	2026-05-28 09:38:26 +00:00
biondizzle	7eb85a71fc	test: add Q SMEM verification output + bf16_to_f32_host	2026-05-28 09:37:07 +00:00
biondizzle	8f23c2aaf6	test: verify SMEM Q layout by reading back canonical data	2026-05-28 09:35:58 +00:00
biondizzle	004046a6a8	test: read only 1 TMEM column after MMA	2026-05-28 09:35:02 +00:00
biondizzle	41128122e3	test: clean rewrite, 32 TMEM cols, MMA N=32, tmem_load loop	2026-05-28 09:33:45 +00:00
biondizzle	58be79957d	test: 32 TMEM cols, add MMA call with N=32, read S from TMEM	2026-05-28 09:32:33 +00:00

1 2 3 4 5 ...

1450 Commits