nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	863a030c3b	fmha_multirow: rewrite with 32x32b.x8 only, no s_p_vals, row_page addressing - Kill 64KB s_p_vals buffer — P is streamed per K-tile through sPk - All TMEM ops use 32x32b.x8 exclusively (16x256b.x1 crashes on 2nd call) - T>32: 4 softmax warps use row_page offset in TMEM address (row<<16) - Lane l in warp w handles row w*32+l - Two-pass softmax: pass 1 row_max, pass 2 exp/sum interleaved with PV - PV: N=16 sub-tiles, SS MMA sPk(128,16) × sV(16,16) → TMEM - Epilogue: 32x32b.x8 TMEM read, normalize, BF16 → GMEM - SMEM budget: ~14KB (well within 232KB)	2026-05-28 22:52:52 +00:00
biondizzle	1ba304db3e	stuff	2026-05-28 21:08:13 +00:00
biondizzle	deaa3ec725	CRITICAL FIX: Q/K SMEM canonical layout must use local d (0..15) not full_d — UMMA descriptor reads from sQ0/sK0 start, not offset	2026-05-28 20:13:52 +00:00
biondizzle	08694b8136	Fix multi-row softmax v3: 32x32b.x8 with per-lane per-row (no wmax/wsum), per-row sRowMax/sRowSum arrays	2026-05-28 20:10:13 +00:00
biondizzle	aaa76c1af1	Rewrite multi-row softmax using 16x256b.x1 TMEM reads for proper multi-row access	2026-05-28 20:08:30 +00:00
biondizzle	5e3c61184c	Fix multi-row softmax: remove cross-lane wmax/wsum — each lane handles its own row independently	2026-05-28 20:06:16 +00:00
biondizzle	d8b421ccee	Multi-row FMHA kernel (Milestone 4): T>1 prefill support with 4-warp parallel softmax	2026-05-28 20:04:29 +00:00
biondizzle	aa41cfa2e5	Multi-head FMHA kernel (Milestone 5): grid launch with MHA/MQA/batch support - fmha_6warp_multihead.cuh: grid=(1, n_h, batch) kernel with FmhaParams - MQA support via k_head_stride=0 / v_head_stride=0 - LSE output for multi-segment KV merge composition - test_fmha_6warp_multihead.cu: MHA (4+8 heads), MQA, batched tests - HD-specific wrappers for hd=16/64/128/256 - Marked E2M1 dequant bug as FIXED in consultant issue file	2026-05-28 19:32:35 +00:00
biondizzle	6af2feb42a	TMA 5D test: element stride decomposition	2026-05-28 19:18:01 +00:00
biondizzle	41343fdc6b	auto: pre-test commit	2026-05-28 19:08:04 +00:00
biondizzle	b3020c2811	6-warp specialized FMHA kernel — ALL HD=16/64/128/256 PASS cos 0.999997+ Warp layout (192 threads): - Warps 0-3: Softmax + correction + epilogue - Warp 4: MMA (QK + PV GEMM) - Warp 5: Data staging (Q/K/V loads, direct GMEM for now) CTA-wide __syncthreads() sync between phases. Fix: removed spurious inv_sum normalization in epilogue (MMA output is already correctly scaled with softmax'd P). Files: fmha_6warp.cuh + test_fmha_6warp*.cu	2026-05-28 16:34:14 +00:00
biondizzle	2a6d72912a	auto: pre-test commit	2026-05-28 16:28:58 +00:00
biondizzle	e74c84458c	Clean up E2M1 dequant: use LUT approach (consultant recommendation) Both indexer files now use a constexpr LUT matching Python's E2M1_MAGNITUDES = [0, 0.5, 1, 1.5, 2, 3, 4, 6]. This is cleaner and more auditable than bit-manipulation.	2026-05-28 16:17:47 +00:00
biondizzle	79ef87f9a9	FIX: E2M1 FP4 dequantization bug in indexer_score_topk.cu The dequant_fp4_scalar function was treating the magnitude bits as a raw integer (0-6) instead of the E2M1 floating-point format: Old (WRONG): val = (int)(nibble & 0x07) * scale New (CORRECT): proper E2M1 decode with exponent + mantissa E2M1 encoding (bias=1): exp=0 subnormal: 0b000=0, 0b001=0.5 exp=1: 0b010=1, 0b011=1.5 exp=2: 0b100=2, 0b101=3 exp=3: 0b110=4, 0b111=6 Bug found by outside consultant. Affects indexer top-k selection correctness — wrong FP4 key decoding would select wrong CSA blocks. Fixed in both: - dsv4/kernels/indexer/indexer_score_topk.cu - dsv4/kernels/cuda/indexer_score_topk.cu	2026-05-28 16:16:24 +00:00
biondizzle	44c4bade5f	Rewrite fmha_sm100_tc.cuh with working N=16 PV sub-tile approach Production FMHA kernel template for Blackwell SM100: - FmhaSm100Kernel<HD>::launch(q, k, v, o, s_k, scale, stream) - QK: SS MMA N=128, one K-tile at a time - PV: SS MMA N=16 sub-tiles (HD/16 calls per K-tile) - Epilogue: TMEM → regs → BF16 → GMEM - ~25KB SMEM for all HD values - All HD=16/64/128/256 pass with cos 0.999997+	2026-05-28 16:04:11 +00:00
biondizzle	0f6907b001	UMMA: fix descriptor + idesc — use gau-nernst tutorial values - LBO = BLOCK_MN * 16 (bytes), SBO = 128 (bytes) for K-major NONE - Canonical SMEM layout: column-major interleaving of core matrices - idesc is SEPARATE 32-bit value (was using desc_a>>32 = WRONG) - idesc encodes dtype/atype/btype/MMA_M/MMA_N - This was the root cause of 'misaligned address' errors	2026-05-28 09:18:45 +00:00
biondizzle	427410d94a	UMMA: Rewrite fmha_umma_desc.cuh with correct K-major core-matrix layout + minimal QK GEMM test - Core-matrix layout: each 8x8 BF16 tile (128B) contiguous in SMEM - K-major NONE descriptor: LBO=1 (16B), SBO=block_k/8, lbo_mode=0 - MMA K-tiling: tcgen05.mma uses K=16 per call, tile for hd>16 - write_smem_kmajor: converts row-major to core-matrix layout - write_smem_ktile: extracts single K-tile in core-matrix layout - test_umma_qk.cu: minimal hd=16, sk=128 test (single MMA call) - Previous UMMA descriptors were wrong (row-major SMEM, wrong LBO/SBO)	2026-05-28 09:15:40 +00:00
biondizzle	e5ba0ca119	debug: clean QK verify with scalar sanity + MMA result	2026-05-28 08:53:35 +00:00
biondizzle	a04d794979	debug: skip TMEM alloc — test SMEM loads only	2026-05-28 08:49:37 +00:00
biondizzle	72c97f2546	debug: minimal UMMA descriptor (just start_addr + version)	2026-05-28 08:48:01 +00:00
biondizzle	9a51bfa578	fix: align SMEM layout properly (128B aligned tmem + Q)	2026-05-28 08:46:56 +00:00
biondizzle	c64bd7b875	debug: read Q/K directly from SMEM	2026-05-28 08:43:39 +00:00
biondizzle	58b610c96c	fix: proper early return for SMEM load test	2026-05-28 08:41:30 +00:00
biondizzle	82bc2c4a49	debug: verify SMEM loads + scalar QK sanity check	2026-05-28 08:40:16 +00:00
biondizzle	53139d24bf	debug: verify TMEM r/w works before MMA	2026-05-28 08:39:12 +00:00
biondizzle	a9d71ff6ab	debug: print TMEM values after MMA	2026-05-28 08:38:08 +00:00
biondizzle	bfb1e177ce	debug: try all-lane MMA + print tmem_base	2026-05-28 08:37:02 +00:00
biondizzle	d3510980e4	feat: SWIZZLE_NONE UMMA descriptors with row-major SMEM Canonical UMMA layout for SWIZZLE_NONE: - MN-major (128, 64): LBO=16, SBO=128 (from logical_divide Tile(1,8)) - K-major (128, 64): LBO=16, SBO=32 (from logical_divide Tile(8,2)) Using simple row-major SMEM layout (no swizzle, no permutation). Data is written directly to SMEM in row-major order. The descriptor strides describe the canonical layout.	2026-05-28 08:35:30 +00:00
biondizzle	ab84ad0f86	feat: implement canonical UMMA SMEM layout with SWIZZLE_128B Proper implementation of the SMEM layout that tcgen05.mma expects: - SWIZZLE_128B (layout_type=2) for both MN-major A and K-major B - Swizzle<3,4,3> applied to element offsets before SMEM write - MN_SW128 atom: (1024, 8) BF16, stride (1, 1024) - K_SW128 atom: (8, 1024) BF16, stride (1, 8) - umma_smem_write/read functions for both MN and K major - Descriptor with correct leading_byte_offset and stride_byte_offset This is the RIGHT WAY. No shortcuts.	2026-05-28 08:18:47 +00:00
biondizzle	ecbc75255c	fix: correct UMMA descriptor format from CUTLASS source The descriptor bitfield is completely different from what I assumed: - [0,14) start_address (smem_ptr >> 4) - [16,30) leading_byte_offset (row stride bytes >> 4) - [32,46) stride_byte_offset - [46,48) version = 1 (Blackwell) - [61,64) layout_type (0=NONE, 2=128B, 4=64B, 6=32B) - idescE = desc >> 32, passed as separate arg to MMA PTX The 64-bit descriptor uses byte offsets (not log2 or element counts). The upper 32 bits are reinterpreted by the MMA hardware as idescE.	2026-05-28 08:07:52 +00:00
biondizzle	fe7d561143	debug: print UMMA descriptor values for diagnosis	2026-05-28 08:03:53 +00:00
biondizzle	c5f7a9a15c	fix: align SMEM buffers to 16 bytes for UMMA descriptors	2026-05-28 08:02:53 +00:00
biondizzle	7436315309	feat: add tcgen05.mma QK GEMM verification kernel + test Step 1 of tensor-core acceleration: - fmha_umma_desc.cuh: UMMA SMEM descriptor construction (raw bitfield) - fmha_qk_verify.cuh: QK GEMM using tcgen05.mma SS (SMEM A, SMEM B → TMEM C) - test_qk_mma.cu: standalone test comparing MMA output vs CPU reference Key design decisions: - UMMA descriptors built from raw bitfield (no CuTe dependency) - tcgen05.mma called by one lane per warp (elect_one_sync pattern) - Q: (128, HD) MN-major, K: (128, HD) K-major (transposed via descriptor) - S: (128, 128) in TMEM, row 0 read back via tcgen05.ld	2026-05-28 08:00:42 +00:00
biondizzle	6fb3d54c02	docs: update here-docs with CuTeDSL rationale for NVIDIA Updated fmha_common.cuh, fmha_sm100.cuh, fmha_epilogue_sm100.cuh, and fmha_sm100_launch.cuh with comprehensive here-docs explaining: 1. The 4 CuTeDSL gaps that forced us to raw CUDA C++: - TMEM round-trip broken (Ld32x32bOp/St32x32bOp column mismatch) - Float→int impossible (arith.fptosi not lowerable) - epilogue_tma_store blocks multi-CTA - hd=512 MLIR optimizer hangs 2. TMEM lane mapping (verified on B200): - Lane i → positions i*4+0..3, 128 FP32 per column - Warp-collective: ALL 32 lanes must call ld/st or HANG - Column address = tmem_base + column_index 3. Key insight for NVIDIA: float→int gap is the single most impactful limitation, blocking ALL quantization-epilogue fusion	2026-05-28 07:54:01 +00:00
biondizzle	446a0ca9fd	refactor(tmem): clean rewrite of TMEM epilogue kernel Removed all dead code from the first (broken) attention loop approach. Clean pipeline: SMEM attention → TMEM write → TMEM read → normalize → GMEM. Also renamed sPvBuf to sO for clarity (same as reference kernel).	2026-05-28 07:49:03 +00:00
biondizzle	c989dc78d9	debug: print sPvBuf[32] value	2026-05-28 07:47:37 +00:00
biondizzle	b50f6a8512	debug: add TMEM read diagnostic	2026-05-28 07:46:15 +00:00
biondizzle	579dd061cd	fix: remove duplicate TMEM_COLS_NEEDED declarations	2026-05-28 07:43:54 +00:00
biondizzle	278f1b34af	fix(tmem): correct lane-to-position mapping for tcgen05.ld/st CRITICAL FIX: tcgen05.st 16x256b.x1.b32 is warp-collective where: - Lane i writes to positions i4+0..i4+3 within the column - 32 lanes × 4 FP32 = 128 FP32 per column - For row 0: lane 0 = positions 0-3, lane 1 = 4-7, ..., lane 31 = 124-127 Old code iterated col = lane; col < N; col += 32, treating each lane as owning a separate column. That was WRONG — all 32 lanes share each column, each owning 4 positions within it. New code: HD values need ceil(HD/128) columns. Lane i writes sPvBuf[i*4+0..3] to column 0 (or column 1 for HD > 128). Verified via test_tmem_lane_mapping.cu on B200.	2026-05-28 07:43:40 +00:00
biondizzle	33cedbee0a	fix(tmem): TMEM ld/st are warp-collective — ALL 32 lanes must call them Root cause of TMEM epilogue hang: tmem_store/tmem_load are warp-collective operations requiring ALL 32 lanes to participate. The loop 'for (col = lane; col < TMEM_O_COLS; col += WARP)' with TMEM_O_COLS=16 and WARP=32 means only lanes 0-15 execute the op. Lanes 16-31 skip it = warp divergence on collective = HANG. Fix: loop over TMEM_N (>= 32, power of 2) so all 32 lanes participate. Columns beyond TMEM_O_COLS write don't-care data to allocated-but-unused TMEM columns.	2026-05-28 07:41:16 +00:00
biondizzle	cea02fe407	fix: add cstdio for printf in TMEM debug	2026-05-28 07:40:04 +00:00
biondizzle	0ddcc6bafd	debug: add printf to TMEM kernel to find hang point	2026-05-28 07:39:53 +00:00
biondizzle	2eb44a00bf	fix(tmem): warp-collective TMEM ops + one-way correction epilogue Key fixes for fmha_epilogue_sm100.cuh hang: - tcgen05.ld/st are WARP-COLLECTIVE: ALL 32 lanes must execute - Old code guarded TMEM ops with if(tid==0) = warp divergence = HANG - tmem_dealloc now uses tmem_base (value from alloc), not SMEM pointer - Compute attention in SMEM, then do one-way TMEM pipeline: SMEM → TMEM (warp-collective store) → regs (warp-collective load) → normalize in regs → BF16 cast → GMEM - This proves the MoE-style one-way correction epilogue on FMHA Also: enable TMEM kernel test + hd=128 in standalone test	2026-05-28 07:27:25 +00:00
biondizzle	4fe9bbab48	add back in the archived code	2026-05-28 07:04:59 +00:00
biondizzle	4336de9372	attention/: Clean up folder, archive backups, add detailed status headers What changed: - Moved fmha_backup_pre_epilog.py, fmha_backup_v2.py, fmha_smem_acc.py to archive/ - Deleted fmha.py.backup (git has history) - Added detailed heredoc headers to ALL files documenting: * WHAT WORKS and WHAT'S BROKEN * WHY each limitation exists (CuTeDSL toolchain gaps) * KEY INSIGHTS FOR NVIDIA (what CuTeDSL is missing) * What each file unblocks if fixed File status: fmha.py — CuTeDSL FMHA, cos 0.999998, D1.5 workaround fmha_common.cuh — Raw CUDA shared defs (BF16, TMEM ops) fmha_sm100.cuh — Raw CUDA reference, cos 0.999999 fmha_epilogue_sm100.cuh — Raw CUDA TMEM epilogue, HANGS (needs debug) fmha_sm100_launch.cu — PyTorch binding (JIT broken, nvcc works) production.py — CuTeDSL production wrapper (partial) archive/ — Historical backups with explanation headers	2026-05-28 07:01:33 +00:00
biondizzle	a391615f60	fix: uint64_t for SMEM pointer	2026-05-28 06:39:19 +00:00
biondizzle	b4779e3f48	fix: cvta.to.shared.u64 for 64-bit SMEM pointers	2026-05-28 06:37:52 +00:00
biondizzle	cf264bd0e2	fix: cvta.shared.u32 (not cvta.to.shared)	2026-05-28 06:36:50 +00:00
biondizzle	771799e112	FMHA SM100: Fix TMEM operations — uint32_t registers, correct PTX syntax TMEM load/store uses b32 (uint32_t) registers, NOT float. Bitcast float↔uint32_t for FP32 TMEM values. TMEM alloc takes SMEM pointer (not a return value). TMEM column addressing: col + row_group * tmem_n.	2026-05-28 06:35:50 +00:00
biondizzle	e173295a3a	FMHA SM100: Refactor into common + reference + TMEM epilogue headers - fmha_common.cuh: BF16, TMEM ops, warp reductions (shared) - fmha_sm100.cuh: Phase 1 reference (SMEM-based, cos 0.999999) - fmha_epilogue_sm100.cuh: Phase 2 TMEM+correction epilogue (Priority 2) - Test both kernels at hd=64 and hd=128	2026-05-28 06:31:05 +00:00

1 2 3 4 5 ...

403 Commits