nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	2b945f255b	test: TMA K-load + QK GEMM — incremental from working pattern	2026-05-29 18:47:27 +00:00
biondizzle	f33746f183	test: minimal TMA K-load — no MMA/TMEM, just verify TMA + canonical	2026-05-29 18:46:09 +00:00
biondizzle	d64b62bc80	test: simple (128,16) TMA desc for K sub-tile only	2026-05-29 18:45:01 +00:00
biondizzle	eaf8a878cf	fix: only warp 0 lane 0 issues TMA (not all lane 0 threads)	2026-05-29 18:44:18 +00:00
biondizzle	69bf20b09d	fix: SMEM alignment in TMA K-only test	2026-05-29 18:43:44 +00:00
biondizzle	2c0ee69aea	test: TMA K-only — proven gen pattern + TMA for K loads only	2026-05-29 18:43:07 +00:00
biondizzle	9fc2d549e4	fix: warp-collective TMEM read/dealloc in minimal QK test	2026-05-29 18:42:03 +00:00
biondizzle	c755e6fdde	fix: TMEM read/dealloc for 128-thread kernel	2026-05-29 18:40:24 +00:00
biondizzle	bd1309ba88	test: minimal QK — 128 threads, tid==0 MMA, match working gen kernel pattern	2026-05-29 18:40:11 +00:00
biondizzle	39aef1284f	fix: smem size in minimal QK test	2026-05-29 18:37:38 +00:00
biondizzle	ce89fe9170	test: minimal QK — separate sQ0/sK0, clean SMEM layout	2026-05-29 18:37:20 +00:00
biondizzle	71b353577d	fix: QK direct test — per-K-sub-tile Q load (same as working kernel)	2026-05-29 18:35:00 +00:00
biondizzle	35d0596893	fix: T=1 for QK direct test (write_q_to_smem only handles row 0)	2026-05-29 18:33:35 +00:00
biondizzle	bee7cc5f8f	fix: lane vs threadIdx.x in direct QK test	2026-05-29 18:32:21 +00:00
biondizzle	670599b754	test: direct QK GEMM — baseline for TMA comparison	2026-05-29 18:31:57 +00:00
biondizzle	9a185f0222	test: debug Q SMEM canonical after TMA load	2026-05-29 18:30:52 +00:00
biondizzle	1500020593	test: QK-only TMA test — isolate TMA load + canonical + MMA	2026-05-29 18:29:49 +00:00
biondizzle	204cc90808	fix: load full Q (128,HD) once before QK loop — not per K-sub-tile The MMA expects Q sub-tiles from a full (128,HD) canonical buffer, but we were only loading (128,16) sub-tiles into a (128,16) buffer. The MMA descriptor with block_mn=128 describes a (128,128) matrix, reading 128 columns from SMEM but only 16 had real data. Now: load all HD/16 TMA tiles of Q into a full (128,HD) canonical buffer before the QK loop. The MMA reads the kt-th sub-tile via descriptor offset kt * 128 * 32 bytes. Also: share single sTmaBuf staging buffer for all TMA loads (Q, K, V). Removed separate sQ_tma, sK_tma, sV_tma buffers.	2026-05-29 18:28:45 +00:00
biondizzle	bf7cf54a51	fix: align TMA SMEM to 128 bytes in verification test	2026-05-29 18:27:07 +00:00
biondizzle	befc2c647b	test: TMA load verification — compare against direct GMEM read	2026-05-29 18:26:34 +00:00
biondizzle	8e09fae3a1	fix: warp-stride for TMA canonical writes — only load warp calls them write_smem_canonical used NTHREADS=192 as the stride, but in the TMA kernel only the load warp (32 threads) calls it. With threadIdx.x in [160,191] and stride 192, only 32 out of 2048 elements got written. Fix: template STRIDE parameter, default 192, TMA kernel uses 32.	2026-05-29 18:25:47 +00:00
biondizzle	3e14a25bb0	fix: don't re-init mbarrier in loop — use phase parity tracking The mbarrier is initialized once before the loop with count=1. Inside the loop: issue TMA → arrive.expect_tx → wait(phase) → flip phase. Re-initializing the mbarrier inside the loop resets the phase, which breaks the parity tracking and causes the wait to hang. This matches the CUTLASS/gau-nernst pattern exactly.	2026-05-29 18:24:47 +00:00
biondizzle	bd169ccb0f	fix: smart quote in fmha_tma.cuh	2026-05-29 18:22:26 +00:00
biondizzle	345b107f4c	fix: TMA mbarrier — add arrive.expect_tx (root cause of multi-warp hang) The TMA cp.async.bulk.tensor with mbarrier::complete_tx::bytes decrements the mbarrier tx_count by the byte count of the transfer. Without calling mbarrier.arrive.expect_tx to increment tx_count first, the count underflows and the phase never completes — causing the wait to hang forever. This was the root cause of the multi-warp TMA hang. With 32 threads it worked by accident (phase parity wrapped around); with 128+ threads the timing was different and the hang was exposed. Also: - Use CUTLASS-style @P1 bra DONE wait pattern (not selp.b32) - Add fence.mbarrier_init.release.cluster after mbarrier init - Track phase parity across the kernel (flip after each wait) - Re-init mbarrier before each TMA transaction (proper phase management) Reference: gau-nernst tcgen05 tutorial	2026-05-29 18:22:00 +00:00
biondizzle	c69f3668e1	feat: TMA async FMHA kernel — WORKING on B200 Three critical CUDA 13 fixes that made TMA work: 1. globalStrides in BYTES not elements (root cause of desc creation failures) 2. BFLOAT16 data type instead of UINT16 3. mbarrier wait: selp.b32 polling pattern (@p bra HANGS on SM100!) Also includes CUTLASS driver workaround (bit 21 clear for drv <= 13.1). Verified: 2D TMA load of (128,16) BF16 tile = 0 mismatches. Kernel: fmha_6warp_tma_kernel with per-sub-tile TMA loads for Q, K, V. Test: test_fmha_tma.cu with padded Q allocations and per-head descriptors.	2026-05-29 07:02:07 +00:00
biondizzle	a40c05f3f2	archive: TMA driver-API files + CUDA 13 TMA discovery notes Key findings documented in docs/cuda13_tma_notes.md: - CUDA 13 globalStrides are in BYTES not elements (root cause of desc creation failures) - BFLOAT16 data type available in CUDA 13 - Driver API descriptors create OK but cp.async.bulk.tensor hangs on driver 13.0 + toolkit 13.2 - CuTeDSL tma_partition works (production path) Archived (not deleted): - fmha_tma_driver_api.cuh, fmha_6warp_tma_driver_api.cuh, test_fmha_tma_driver_api.cu - These will work once driver matches toolkit version	2026-05-29 06:52:39 +00:00
biondizzle	55f0c6267b	auto: pre-test commit	2026-05-29 06:41:58 +00:00
biondizzle	197cac875c	fix: CUDA 13 TMA descriptor — 3D rank + byte strides + mbarrier byte count Three critical fixes for CUDA 13.x on Blackwell: 1. globalStrides are in BYTES not elements (CUDA 13 change) 2. Use 3D descriptors (degenerate 3rd dim=1) — CUDA 13 TMA requires rank >= 2 3. mbarrier init uses expected byte count (4096 for 128x16 BF16 tile) 4. cp.async.bulk.tensor.3d instead of .2d for 3D descriptors 5. BFLOAT16 data type instead of UINT16	2026-05-29 06:34:58 +00:00
biondizzle	85cd95e609	debug: TMA context fix test	2026-05-29 04:45:54 +00:00
biondizzle	76c82ebdcd	debug: detailed TMA descriptor debug test	2026-05-29 04:45:06 +00:00
biondizzle	0c9245b4d2	fix: add cuInit(0) for CUDA driver API	2026-05-29 04:43:24 +00:00
biondizzle	6cc2f61431	debug: TMA descriptor dimension test	2026-05-29 04:42:44 +00:00
biondizzle	3412ff1a9b	fix: TMA tile strides must match global strides, not tile dimensions The tile stride in the outer dimension should be the global row stride (cols), not the tile width. The tile is a window into the global tensor and elements are addressed with global strides.	2026-05-29 04:41:53 +00:00
biondizzle	409838ace2	refactor: per-sub-tile TMA loads with padded GMEM allocations - Q, K, V all loaded per (128,16) sub-tile via TMA - Q GMEM padded to (128, HD) to satisfy TMA tile requirements - Simpler SMEM layout — only (128,16) staging buffers needed - Updated test with padded allocations	2026-05-29 04:41:03 +00:00
biondizzle	8c17f65f5b	fix: cast typo	2026-05-29 04:39:21 +00:00
biondizzle	8908b697dd	fix: bool type mismatch	2026-05-29 04:39:12 +00:00
biondizzle	b78ebe8a9c	debug: add TMA descriptor error reporting	2026-05-29 04:38:57 +00:00
biondizzle	c7a6d7d231	fix: tma_mbar_init → tma_mbarrier_init (typo)	2026-05-29 04:37:48 +00:00
biondizzle	696462f07a	feat: TMA async load infrastructure for FMHA kernel - fmha_tma.cuh: TMA descriptor creation, mbarrier helpers, cp.async.bulk.tensor.2d wrappers - fmha_6warp_tma.cuh: TMA-integrated multirow kernel with async GMEM→SMEM loads - TMA loads Q, K, V tiles to row-major SMEM - Transposes to canonical K-major layout for MMA - Same softmax/epilogue as non-TMA kernel - test_fmha_tma.cu: Test harness for TMA FMHA (HD=64 first)	2026-05-29 04:36:52 +00:00
biondizzle	d1c1eaeddc	clean: remove debug prints, multirow kernel complete with multi-tile KV merge	2026-05-28 23:57:31 +00:00
biondizzle	c65baabcc9	fix: V tile copy — V is (HD, SK_TOTAL) so tile columns are not contiguous	2026-05-28 23:55:52 +00:00
biondizzle	869460a932	debug: add LSE verification and merge debug prints	2026-05-28 23:54:30 +00:00
biondizzle	2f2259395e	fix: always normalize in kernel, correct KV merge with normalized O + LSE	2026-05-28 23:53:44 +00:00
biondizzle	914f76d30c	multirow: add normalize flag, un-norm + LSE output, multi-tile KV merge test	2026-05-28 23:51:23 +00:00
biondizzle	ca5cf0e517	test: add multi-head and batched prefill tests for multirow kernel	2026-05-28 23:48:53 +00:00
biondizzle	ac8fa779e2	fix: move epilogue TMEM loads outside my_row_active guard (warp-collective hang)	2026-05-28 23:46:46 +00:00
biondizzle	55c0604a71	add fence.sc.gpu between PV and epilogue for TMEM visibility	2026-05-28 23:21:53 +00:00
biondizzle	52809b0ec6	fix: tcgen05.wait::ld.sync.aligned (was missing 'sync')	2026-05-28 23:19:03 +00:00
biondizzle	0220e51d18	fix: typo cudaErrorCudaSuccess -> cudaSuccess	2026-05-28 23:18:21 +00:00
biondizzle	468614a4e2	fmha_multirow: non-interleaved design — softmax first, then PV KEY FIX: TMEM is shared between QK output (S) and PV output (O). Cannot interleave softmax reads with PV writes because PV overwrites S. New flow: 1. QK GEMM → S in TMEM 2. Softmax: read ALL S from TMEM, compute P in registers - Pass 1: row_max (4 warps, 32x32b.x8) - Pass 2: exp, sum, store P in p_vals[SK_TILE] registers 3. PV GEMM: write P to sPk per K-tile, accumulate O in TMEM 4. Epilogue: read O from TMEM, normalize, write GMEM P in registers: each lane holds float p_vals[128] = 512 bytes. Register budget: 128 lanes × 512B = 64KB (within B200 256KB register file).	2026-05-28 23:17:43 +00:00

... 4 5 6 7 8 ...

1930 Commits