nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	e1b9e94c24	P8: Fix test imports after deleting multihead module	2026-05-30 17:23:13 +00:00
biondizzle	95e0c8c464	P5: fix multi-tile test — use same Q data for kernel and reference	2026-05-30 10:49:12 +00:00
biondizzle	e701a1411c	P5: use multi-tile kernel for N>128 in integration test	2026-05-30 10:47:00 +00:00
biondizzle	f370bfb1f1	P5: re-enable multi-tile Python tests, fix CAPI to use create_tma_desc_2d_bf16	2026-05-30 10:38:33 +00:00
biondizzle	a1d05b3055	P5: disable multi-tile Python tests (TMA descriptor alignment issue)	2026-05-30 10:32:44 +00:00
biondizzle	b61df2657b	P5: fix reference attention for MQA/GQA (kv_idx = h // q_per_kv)	2026-05-30 08:59:50 +00:00
biondizzle	2649488d13	P5: in-kernel multi-KV-tile FA2 online softmax in fmha_6warp_multihead.cuh - Kernel loops over KV tiles internally with running max/sum rescale - SMEM accumulator sOacc[hd] replaces TMEM accumulation across tiles - P is UN-NORMALIZED for multi-tile (exp(s-max), not /sum) - Per KV tile: QK→softmax→PV→TMEM→read→add to sOacc - Final: O = sOacc / running_sum - Single tile (n_kv_tiles=1): same as before, no rescale - Updated CAPI, Python loader, production.py fast path - Added multi-tile test cases (N=256, 512)	2026-05-30 08:46:09 +00:00
biondizzle	ae425b5522	P3: clean up test, remove debug files, final integration test - test_p3_fast_decode.py: clean kernel test + full API test - Removed debug tests (sanity, v_debug, v_ref_debug) - Double normalization fix verified: kernel output matches reference at cos >= 0.999990 across all MHA/MQA/GQA configs	2026-05-30 08:29:25 +00:00
biondizzle	a3c5f817e1	debug: compare api vs direct kernel vs reference	2026-05-30 08:23:43 +00:00
biondizzle	1b9cdf89fb	P3: add full API integration test	2026-05-30 08:20:53 +00:00
biondizzle	0608d9d09e	P3: fix GQA via K/V repeat_interleave, relax threshold to 0.999990	2026-05-30 08:20:01 +00:00
biondizzle	094b3c9e6c	P3: fix test — create V in kernel layout (hd,N), transpose for reference	2026-05-30 08:18:20 +00:00
biondizzle	7b5b3342fa	P3: fix integration test — V transpose, direct ctypes call	2026-05-30 08:17:33 +00:00
biondizzle	adcf3e04ab	P3: ctypes loader for 6-warp FMHA (bypass torch JIT sm_100 arch issue) - fmha_multihead_capi.cu: pure C API wrapper, no ATen/pybind11 deps - fmha_multihead_op.py: nvcc precompile + ctypes load (sm_100a) - Removed fmha_multihead_launch.cu (ATen approach didn't work) - Updated test to call kernel directly via ctypes API	2026-05-30 08:15:31 +00:00
biondizzle	1e6adf5e01	P3: wire 6-warp multi-head FMHA decode fast path into production.py - fmha_multihead_launch.cu: PyTorch launch wrapper for fmha_6warp_multihead_kernel (c10::BFloat16 boundary, uint16_t bf16_t inside kernel, zero-cost casts) - fmha_multihead_op.py: torch.utils.cpp_extension JIT loader + custom_op registration (dsv4::fmha_multihead_decode for torch.compile) - production.py: fast path dispatch for T=1, n_segments==1, hd in {64,128,256} Falls through to CuTeDSL slow path for multi-segment/prefill - test_p3_fast_decode.py: integration test (MHA/MQA/GQA, cosine >= 0.999998) Architecture: Grid: dim3(1, n_h, batch_size) — one CTA per (head, batch) MQA: k_head_stride=0 so all Q heads share same K/V Single kernel launch, zero cudaDeviceSynchronize on hot path Normalized output for single-segment decode	2026-05-30 08:12:23 +00:00

15 Commits