nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	adc88613fa	Milestone 5 COMPLETE: multi-head FMHA grid launch verified on B200 All HD=16/64/128/256 pass across MHA (4+8 heads), MQA, batched modes. cos 0.999997+, LSE matches reference. Updated CURRENT_ISSUE.md.	2026-05-28 19:35:06 +00:00
biondizzle	3fd302e7a0	Fix nvcc goto-bypasses-init errors in multi-head test	2026-05-28 19:33:04 +00:00
biondizzle	aa41cfa2e5	Multi-head FMHA kernel (Milestone 5): grid launch with MHA/MQA/batch support - fmha_6warp_multihead.cuh: grid=(1, n_h, batch) kernel with FmhaParams - MQA support via k_head_stride=0 / v_head_stride=0 - LSE output for multi-segment KV merge composition - test_fmha_6warp_multihead.cu: MHA (4+8 heads), MQA, batched tests - HD-specific wrappers for hd=16/64/128/256 - Marked E2M1 dequant bug as FIXED in consultant issue file	2026-05-28 19:32:35 +00:00
biondizzle	6af2feb42a	TMA 5D test: element stride decomposition	2026-05-28 19:18:01 +00:00
biondizzle	96f2f0bb90	auto: pre-test commit	2026-05-28 19:12:23 +00:00
biondizzle	015435b1ab	auto: pre-test commit	2026-05-28 19:09:50 +00:00
biondizzle	41343fdc6b	auto: pre-test commit	2026-05-28 19:08:04 +00:00
biondizzle	a723b524f7	TMA alignment test	2026-05-28 17:00:20 +00:00
biondizzle	c54a83960d	TMA debug: fix globalStrides to tensorRank-1 elements	2026-05-28 16:58:30 +00:00
biondizzle	944e567b6c	TMA debug: test various CUtensorMap configs	2026-05-28 16:55:25 +00:00
biondizzle	55d289c65b	Fix TMA: use CU_TENSOR_MAP_DATA_TYPE_BFLOAT16 not UINT16	2026-05-28 16:51:40 +00:00
biondizzle	0fd3e12a52	Fix TMA test: globalStrides in bytes not elements	2026-05-28 16:46:56 +00:00
biondizzle	ad8050bbad	WIP: TMA load test infrastructure (manual compile needed)	2026-05-28 16:45:04 +00:00
biondizzle	d9df1e6486	auto: pre-test commit	2026-05-28 16:42:24 +00:00
biondizzle	a4211559cf	auto: pre-test commit	2026-05-28 16:40:51 +00:00
biondizzle	3b8fdcc823	auto: pre-test commit	2026-05-28 16:39:45 +00:00
biondizzle	072fbf0b5d	auto: pre-test commit	2026-05-28 16:36:53 +00:00
biondizzle	090f2866ae	Update CURRENT_ISSUE: 6-warp Milestone 1 complete	2026-05-28 16:35:02 +00:00
biondizzle	b3020c2811	6-warp specialized FMHA kernel — ALL HD=16/64/128/256 PASS cos 0.999997+ Warp layout (192 threads): - Warps 0-3: Softmax + correction + epilogue - Warp 4: MMA (QK + PV GEMM) - Warp 5: Data staging (Q/K/V loads, direct GMEM for now) CTA-wide __syncthreads() sync between phases. Fix: removed spurious inv_sum normalization in epilogue (MMA output is already correctly scaled with softmax'd P). Files: fmha_6warp.cuh + test_fmha_6warp*.cu	2026-05-28 16:34:14 +00:00
biondizzle	2a6d72912a	auto: pre-test commit	2026-05-28 16:28:58 +00:00
biondizzle	e74c84458c	Clean up E2M1 dequant: use LUT approach (consultant recommendation) Both indexer files now use a constexpr LUT matching Python's E2M1_MAGNITUDES = [0, 0.5, 1, 1.5, 2, 3, 4, 6]. This is cleaner and more auditable than bit-manipulation.	2026-05-28 16:17:47 +00:00
biondizzle	79ef87f9a9	FIX: E2M1 FP4 dequantization bug in indexer_score_topk.cu The dequant_fp4_scalar function was treating the magnitude bits as a raw integer (0-6) instead of the E2M1 floating-point format: Old (WRONG): val = (int)(nibble & 0x07) * scale New (CORRECT): proper E2M1 decode with exponent + mantissa E2M1 encoding (bias=1): exp=0 subnormal: 0b000=0, 0b001=0.5 exp=1: 0b010=1, 0b011=1.5 exp=2: 0b100=2, 0b101=3 exp=3: 0b110=4, 0b111=6 Bug found by outside consultant. Affects indexer top-k selection correctness — wrong FP4 key decoding would select wrong CSA blocks. Fixed in both: - dsv4/kernels/indexer/indexer_score_topk.cu - dsv4/kernels/cuda/indexer_score_topk.cu	2026-05-28 16:16:24 +00:00
biondizzle	44c4bade5f	Rewrite fmha_sm100_tc.cuh with working N=16 PV sub-tile approach Production FMHA kernel template for Blackwell SM100: - FmhaSm100Kernel<HD>::launch(q, k, v, o, s_k, scale, stream) - QK: SS MMA N=128, one K-tile at a time - PV: SS MMA N=16 sub-tiles (HD/16 calls per K-tile) - Epilogue: TMEM → regs → BF16 → GMEM - ~25KB SMEM for all HD values - All HD=16/64/128/256 pass with cos 0.999997+	2026-05-28 16:04:11 +00:00
biondizzle	a18d9c1584	Update CURRENT_ISSUE: ALL HD=16/64/128/256 PASS cos 0.999997+ Documented Layout D N=64 bug and N=16 sub-tile workaround.	2026-05-28 16:03:05 +00:00
biondizzle	01319d7247	auto: pre-test commit	2026-05-28 15:59:22 +00:00
biondizzle	43516ed4ec	auto: pre-test commit	2026-05-28 15:55:59 +00:00
biondizzle	1ec3e1ed2c	auto: pre-test commit	2026-05-28 15:55:18 +00:00
biondizzle	babff1f402	auto: pre-test commit	2026-05-28 15:54:05 +00:00
biondizzle	2b007d2008	auto: pre-test commit	2026-05-28 15:53:39 +00:00
biondizzle	84b997881f	auto: pre-test commit	2026-05-28 15:53:04 +00:00
biondizzle	6e5401df3b	auto: pre-test commit	2026-05-28 15:51:55 +00:00
biondizzle	102174fade	auto: pre-test commit	2026-05-28 15:50:52 +00:00
biondizzle	2dcfc0089f	auto: pre-test commit	2026-05-28 15:49:47 +00:00
biondizzle	1cdb90462f	auto: pre-test commit	2026-05-28 15:48:15 +00:00
biondizzle	80fd612132	auto: pre-test commit	2026-05-28 15:47:58 +00:00
biondizzle	9583cbc67a	auto: pre-test commit	2026-05-28 15:46:53 +00:00
biondizzle	1b86860c19	auto: pre-test commit	2026-05-28 15:46:16 +00:00
biondizzle	66cc117e11	auto: pre-test commit	2026-05-28 15:44:45 +00:00
biondizzle	2b32b51882	Update CURRENT_ISSUE with final session status	2026-05-28 15:22:32 +00:00
biondizzle	6249989cf6	Clean up HD=64 test, V layout verified correct	2026-05-28 15:21:33 +00:00
biondizzle	e1daad6955	Verify V SMEM values vs GMEM for HD=64	2026-05-28 15:19:31 +00:00
biondizzle	bafd26707b	FMHA HD=64 with BLOCK_MN_B=16, 4 N-tiles per K-tile	2026-05-28 15:17:40 +00:00
biondizzle	6896d1aebb	Update CURRENT_ISSUE: HD=16 done, HD=64 in progress	2026-05-28 15:16:19 +00:00
biondizzle	6b9b06647a	Clean up HD=64 debug prints, keep register-math PV check	2026-05-28 15:15:22 +00:00
biondizzle	5c9d471162	Add register-math PV reference for HD=64 debug	2026-05-28 15:13:47 +00:00
biondizzle	43e9efbc2b	Fix string literal	2026-05-28 15:12:20 +00:00
biondizzle	906be7ce50	Add filtered cosine (exclude near-zero)	2026-05-28 15:11:14 +00:00
biondizzle	40c83c769a	Fix: remove ×2 QK scale correction (MMA scale is 1.0, not 0.5)	2026-05-28 15:09:57 +00:00
biondizzle	6ea7356fdd	Debug: print P values for HD=64	2026-05-28 15:07:55 +00:00
biondizzle	4b052f22a5	Fix: opt into >48KB shared memory for HD=64	2026-05-28 15:06:37 +00:00

1 2 3 4 5 ...

1600 Commits