nvfp4-megamoe-kernel

archive

archive: TMA driver-API files + CUDA 13 TMA discovery notes

2026-05-29 06:52:39 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

cudagraph_test.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

layertest.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

test_cotiled_diag.py

fix: proper v_major from tensor

2026-05-24 01:55:37 +00:00

test_cutedsl.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

test_d1_3_cotiled.py

D1.3: Fix while loop in cotiled diag - precompute num_tmem_alloc_cols

2026-05-24 00:07:22 +00:00

test_d1_3_layout_diag.py

D1.3: Enhanced diagnostic - test QK C-fragment as source for make_tiled_copy_C

2026-05-23 22:24:15 +00:00

test_d1_3_smem_diag.py

D1.3: Add SMEM-P coordinate diagnostic test

2026-05-23 23:23:05 +00:00

test_d1_3_smem_direct.py

D1.5: Replace TMEM round-trip normalize with correction epilog (one-way: TMEM→reg→SMEM→GMEM)

2026-05-24 00:24:24 +00:00

test_d1_3_smem_vs_tmem.py

D1.3: Add SMEM-P vs TMEM-P comparison test

2026-05-24 00:10:18 +00:00

test_d1_3_unnorm_debug.py

D1.3: Add unnormalized debug test to isolate SMEM-P vs O round-trip error

2026-05-24 00:15:41 +00:00

test_d1_3_write_read.py

D1.3: Add SMEM-P write/read diagnostic

2026-05-24 00:13:28 +00:00

test_d1_debug.py

debug: hd=64 with CUDA_LAUNCH_BLOCKING

2026-05-23 03:42:53 +00:00

test_d1_diag2.py

D1: Add diagnostic test (TMEM-P vs SMEM-P at various hd)

2026-05-24 03:22:23 +00:00

test_d1_diag.py

fix: use mV.iterator

2026-05-23 03:25:29 +00:00

test_d1_hd512_merge.py

D1.4: Fix merge test - use use_smem_p=False for hd=256 kernel (SMEM budget)

2026-05-24 16:36:48 +00:00

test_d1_hd512_only.py

D1.4: Remove --opt-level 0 from hd512 test (use default opt level)

2026-05-24 16:42:01 +00:00

test_d1_hd512.py

d1: add hd=512 test

2026-05-23 03:20:46 +00:00

test_d1_kv_merge_v2.py

D1: fix per-row LSE output + add KV merge test v2 with per-row LSE

2026-05-24 22:21:51 +00:00

test_d1_kv_merge_v3.py

D1: corrected KV merge test with proper normalized output formula

2026-05-24 22:24:27 +00:00

test_d1_kv_merge.py

D1: add KV merge test using log-sum-exp (avoids TMEM round-trip)

2026-05-24 22:17:24 +00:00

test_d1_lse_verify.py

fix lse verify

2026-05-24 22:23:08 +00:00

test_d1_lse.py

D1: LSE diagnostic at various hd

2026-05-24 03:23:16 +00:00

test_d1_multi_kv.py

D1: add multi-KV-tile O rescale test (s_k=256,384,512)

2026-05-24 22:00:42 +00:00

test_d1_qk512.py

Fix: add cutlass import to test_d1_qk512

2026-05-24 14:20:32 +00:00

test_d1_raw.py

D1: test raw unnormalized output via epilogue_tma_store

2026-05-23 03:33:59 +00:00

test_d1_regression.py

D1.4: Fix regression test for un-normalized O output (D5a)

2026-05-24 15:13:16 +00:00

test_d1_rescale_debug.py

fix debug test

2026-05-24 22:04:51 +00:00

test_d1_rescale_diag.py

D1: add rescale diagnostic

2026-05-24 22:18:12 +00:00

test_d1_rescale_min.py

D1: add KV merge test using log-sum-exp (avoids TMEM round-trip)

2026-05-24 22:17:24 +00:00

test_d1_smem128.py

D1: SMEM-P test at hd=128

2026-05-24 03:48:37 +00:00

test_d1_sweep.py

D1: paired atoms epilogue (no TMEM round-trip)

2026-05-23 03:29:51 +00:00

test_d1_tmem_only.py

D1: Fix SMEM-P (coordinate store), LSE (FP32), add TMEM-P-only test

2026-05-24 03:27:14 +00:00

test_d1_tmem_trip.py

D1: add KV merge test using log-sum-exp (avoids TMEM round-trip)

2026-05-24 22:17:24 +00:00

test_d2_headpacked.py

D2: comprehensive head-packed test (n_h=1, 64, 128, hd=64, 128)

2026-05-25 17:16:05 +00:00

test_d2_multicta.py

D2: add num_query_heads/batch_size params + head-packed test

2026-05-25 16:50:49 +00:00

test_d2_multihead.py

D1: revert per-row LSE to sfw_idx=0 for now (debugging D2 regression)

2026-05-24 22:28:11 +00:00

test_d2_perhead.py

D2: add per-head launch test

2026-05-24 22:48:22 +00:00

test_d2_regression.py

fix: use reference attn_sum for normalization (kernel LSE per-row may be wrong)

2026-05-25 17:13:34 +00:00

test_d2_scale.py

D2: add scale test (more heads, larger hd)

2026-05-24 22:49:44 +00:00

test_d3_inkernel_mask.py

fix: swa_len as Int32 scalar instead of CuTe tensor

2026-05-26 10:54:41 +00:00

test_d3_swa_mask.py

fix typo: from_dlset → from_dlpack

2026-05-25 17:28:43 +00:00

test_d4_causal_mask.py

fix: D4 test reference computation only applies causal mask when is_causal=True

2026-05-26 10:56:04 +00:00

test_d5b_perrow_lse.py

fix: k_seg is already 3D from slicing, don't add extra unsqueeze(-1)

2026-05-26 11:02:44 +00:00

test_d5c_fused.py

D5c: add apply_sink_bias flag (independent of n_comp)

2026-05-26 15:26:52 +00:00

test_d5c_multitile.py

diag: rewrite multi-tile test with explicit per-segment compile and reference

2026-05-26 15:39:39 +00:00

test_d15_in_kernel_rescale.py

D1.5: Implement in-kernel O rescale via CUTLASS correction_rescale pattern

2026-05-26 20:26:06 +00:00

test_d15_multi_kv.py

D1.5: add multi-KV-tile attention test with Python KV merge

2026-05-25 17:18:50 +00:00

test_d15_noop_rescale.py

D1.5 debug: add NOOP rescale test (acc_scale=1.0) to isolate TMEM round-trip corruption

2026-05-26 20:28:55 +00:00

test_d15_rescale_debug.py

D1.5 debug: add targeted s_k=256 rescale diagnostic test

2026-05-26 20:27:37 +00:00

test_d15_roundtrip_iso.py

D1.5: Add isolated round-trip test comparing s_k=128 vs s_k=256 with NOOP rescale

2026-05-26 20:45:58 +00:00

test_fmha_6warp_hd16.cu

auto: pre-test commit

2026-05-28 16:28:58 +00:00

test_fmha_6warp_hd64.cu

auto: pre-test commit

2026-05-28 16:28:58 +00:00

test_fmha_6warp_hd128.cu

auto: pre-test commit

2026-05-28 16:28:58 +00:00

test_fmha_6warp_hd256.cu

auto: pre-test commit

2026-05-28 16:28:58 +00:00

test_fmha_6warp_multihead_hd16.cu

Multi-head FMHA kernel (Milestone 5): grid launch with MHA/MQA/batch support

2026-05-28 19:32:35 +00:00

test_fmha_6warp_multihead_hd64.cu

Multi-head FMHA kernel (Milestone 5): grid launch with MHA/MQA/batch support

2026-05-28 19:32:35 +00:00

test_fmha_6warp_multihead_hd128.cu

Multi-head FMHA kernel (Milestone 5): grid launch with MHA/MQA/batch support

2026-05-28 19:32:35 +00:00

test_fmha_6warp_multihead_hd256.cu

Multi-head FMHA kernel (Milestone 5): grid launch with MHA/MQA/batch support

2026-05-28 19:32:35 +00:00

test_fmha_6warp_multihead.cu

Fix nvcc goto-bypasses-init errors in multi-head test

2026-05-28 19:33:04 +00:00

test_fmha_6warp_multirow_hd16.cu

Multi-row FMHA kernel (Milestone 4): T>1 prefill support with 4-warp parallel softmax

2026-05-28 20:04:29 +00:00

test_fmha_6warp_multirow_hd64.cu

Multi-row FMHA kernel (Milestone 4): T>1 prefill support with 4-warp parallel softmax

2026-05-28 20:04:29 +00:00

test_fmha_6warp_multirow_hd128.cu

Multi-row FMHA kernel (Milestone 4): T>1 prefill support with 4-warp parallel softmax

2026-05-28 20:04:29 +00:00

test_fmha_6warp_multirow_hd256.cu

Multi-row FMHA kernel (Milestone 4): T>1 prefill support with 4-warp parallel softmax

2026-05-28 20:04:29 +00:00

test_fmha_6warp_multirow.cu

clean: remove debug prints, multirow kernel complete with multi-tile KV merge

2026-05-28 23:57:31 +00:00

test_fmha_6warp.cu

auto: pre-test commit

2026-05-28 16:28:58 +00:00

test_fmha_gen_kernel.cuh

auto: pre-test commit

2026-05-28 15:59:22 +00:00

test_fmha_gen.cu

auto: pre-test commit

2026-05-28 15:59:22 +00:00

test_fmha_hd16_gen.cu

auto: pre-test commit

2026-05-28 15:59:22 +00:00

test_fmha_hd16_v2.cu

auto: pre-test commit

2026-05-28 15:59:22 +00:00

test_fmha_hd16.cu

test: debug — just QK+softmax+P read (no PV)

2026-05-28 13:08:06 +00:00

test_fmha_hd64_debug.cu

auto: pre-test commit

2026-05-28 15:46:53 +00:00

test_fmha_hd64_gen.cu

auto: pre-test commit

2026-05-28 15:59:22 +00:00

test_fmha_hd64_n16_v2.cu

auto: pre-test commit

2026-05-28 15:55:59 +00:00

test_fmha_hd64_n16.cu

FMHA HD=64 with BLOCK_MN_B=16, 4 N-tiles per K-tile

2026-05-28 15:17:40 +00:00

test_fmha_hd64_smem_p.cu

Clean up HD=64 test, V layout verified correct

2026-05-28 15:21:33 +00:00

test_fmha_hd64.cu

test: merge softmax+PV into single warp0 block (s_vals scope fix)

2026-05-28 13:10:02 +00:00

test_fmha_hd128_gen.cu

auto: pre-test commit

2026-05-28 15:59:22 +00:00

test_fmha_hd256_gen.cu

auto: pre-test commit

2026-05-28 15:59:22 +00:00

test_fmha_pv16.py

auto: pre-test commit

2026-05-28 19:12:23 +00:00

test_fmha_sm100_standalone.cu

test: enable both reference + TMEM epilogue tests at hd=64/128

2026-05-28 07:49:48 +00:00

test_fmha_sm100.py

fix: increase test timeout for TMEM kernel

2026-05-28 06:41:59 +00:00

test_fmha_smem_p.cu

Fix: add back cudaDeviceSynchronize

2026-05-28 14:28:24 +00:00

test_fmha_softmax.cu

test: FMHA softmax (QK→read S→softmax→write P→read P→verify)

2026-05-28 13:00:37 +00:00

test_fmha_tma.cu

feat: TMA async FMHA kernel — WORKING on B200

2026-05-29 07:02:07 +00:00

test_fmha_ts_full.cu

Fix SMEM allocation (was half the needed size) + re-enable full pipeline

2026-05-28 14:16:43 +00:00

test_fmha_ts_hd16.cu

test: properly aligned V SMEM buffer

2026-05-28 13:47:47 +00:00

test_fmha_v3_stage_c.py

fix: revert to composition layout for hand-constructed atoms (matching CUTLASS)

2026-05-23 02:54:54 +00:00

test_fmha_v3_stage_d1.py

D1: Full test with TMEM-P at hd=64,128,256,512

2026-05-24 04:07:40 +00:00

test_fmha_v3_stage_d5b.py

D5b: Fix reference computation - use logsumexp for stable LSE, fix o_unnorm definition

2026-05-23 21:43:04 +00:00

test_fmha_v3.py

FIX: (None,0,None,0) for ALL tma_partition outputs — verified shapes on B200

2026-05-22 23:35:55 +00:00

test_fmha_v4.cu

Move s_p_vals to dynamic SMEM

2026-05-28 14:38:03 +00:00

test_fmha_v5.cu

Milestone: Full FMHA HD=16 with PV SS MMA (SMEM-P) — cosine 0.9997

2026-05-28 14:50:43 +00:00

test_fp4_roundtrip.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

test_int32_cast.py

NVFP4-1.1: try .to(Int32) for float-to-int conversion

2026-05-28 04:02:45 +00:00

test_minimal_pv.cu

Fix tb scope

2026-05-28 14:40:55 +00:00

test_mma_ts_copy.cu

Add systematic SS+TS sequence test to debug MMA coexistence crash

2026-05-28 14:10:07 +00:00

test_mma_ts.cu

Test TS MMA with non-uniform A data

2026-05-28 14:19:45 +00:00

test_nvfp4_1_1_layout.py

NVFP4-1.1: FP4 quantization primitives for CuTeDSL kernels

2026-05-28 03:39:55 +00:00

test_nvfp4_1_1_quant.py

NVFP4-1.1: test kernel uses Float32 input (avoids BF16 scalar load issue)

2026-05-28 04:32:08 +00:00

test_nvfp4_diag.py

NVFP4-0.2-0.4: add FP4 primitives diagnostic test

2026-05-25 03:07:53 +00:00

test_nvfp4_gpu_quantize.py

fix test 4: use silu(gate)+swiglu interleaved (matching fused kernel output)

2026-05-25 16:24:04 +00:00

test_nvfp4_primitives.py

diag: add 2-CTA check + fix LayoutEnum in MMA test

2026-05-23 08:45:26 +00:00

test_nvfp4_quant_kernel.py

NVFP4-1.1 Step 3: post-SWiGLU quantization test suite (all PASS)

2026-05-25 09:08:01 +00:00

test_nvfp4_quantize_kernel.py

fix: quantize_activation_nvfp4 returns 2 values, not 3

2026-05-25 03:17:13 +00:00

test_paired_epilog.py

test: paired atoms epilog from old commit 6ee28d8

2026-05-23 03:32:53 +00:00

test_production.py

Stage E: head-packed MQA/GQA, batch dim, custom_op, integration API

2026-05-27 15:15:03 +00:00

test_pv_accum.cu

PV accumulation debug with detailed TMEM read

2026-05-28 14:35:29 +00:00

test_pv_ss_128.cu

Test K-tiles 0-1 accumulated

2026-05-28 14:33:31 +00:00

test_pv_ss_b64.cu

Test PV SS MMA with B=(64,16) BLOCK_MN=64

2026-05-28 14:58:10 +00:00

test_pv_ss.cu

Full FMHA SMEM-P with scale calibration

2026-05-28 14:24:53 +00:00

test_qk_mma.cu

debug: clean QK verify with scalar sanity + MMA result

2026-05-28 08:53:35 +00:00

test_qk_pv_layout.cu

QK→PV layout test: skip softmax to test TMEM layout compatibility

2026-05-28 14:17:37 +00:00

test_smem_acc.py

Add c_simple GMEM tensor (non-dynamic) for SMEM accumulator TMA store

2026-05-27 05:33:30 +00:00

test_smem_budget.py

D1.4: Reduce pv_n_tile to 128 for hd=512 to fit SMEM budget (192KB)

2026-05-24 08:07:32 +00:00

test_smem_p_coord.py

test: add try/except for SMEM-P coord test

2026-05-24 02:15:07 +00:00

test_smem_p_diag.py

shit left dangling

2026-05-23 23:58:57 +00:00

test_smem_p_write.py

test: SMEM-P coordinate verification test

2026-05-24 01:58:32 +00:00

test_softmax_pv.cu

Test softmax→PV with 1 K-tile in isolation

2026-05-28 14:18:39 +00:00

test_ss_ts_sequence.cu

Add systematic SS+TS sequence test to debug MMA coexistence crash

2026-05-28 14:10:07 +00:00

test_sw128_qk.cu

auto: pre-test commit

2026-05-28 16:36:53 +00:00

test_tma_5d.cu

TMA 5D test: element stride decomposition

2026-05-28 19:18:01 +00:00

test_tma_align.cu

TMA alignment test

2026-05-28 17:00:20 +00:00

test_tma_debug.cu

TMA debug: fix globalStrides to tensorRank-1 elements

2026-05-28 16:58:30 +00:00

test_tma_desc_debug2.cu

debug: detailed TMA descriptor debug test

2026-05-29 04:45:06 +00:00

test_tma_desc_debug3.cu

debug: TMA context fix test

2026-05-29 04:45:54 +00:00

test_tma_desc_debug.cu

fix: add cuInit(0) for CUDA driver API

2026-05-29 04:43:24 +00:00

test_tma_driver.cu

Fix TMA: use CU_TENSOR_MAP_DATA_TYPE_BFLOAT16 not UINT16

2026-05-28 16:51:40 +00:00

test_tma_load.cu

auto: pre-test commit

2026-05-28 16:39:45 +00:00

test_tma_proper.cu

auto: pre-test commit

2026-05-28 16:42:24 +00:00

test_tmem_4warp_read.cu

fix: SMEM size for MMA test — account for both sQ0 and sK0

2026-05-28 23:06:07 +00:00

test_tmem_all_lanes.cu

auto: pre-test commit

2026-05-28 15:51:55 +00:00

test_tmem_budget.py

D1.2: fix probe for hd=512 (MMA max N=256, use pv_n_tile)

2026-05-23 06:41:42 +00:00

test_tmem_cols.cu

test: TMEM 2-store with fence outside wid guard, 64 threads

2026-05-28 09:59:43 +00:00

test_tmem_lane_mapping.cu

test: add TMEM lane mapping diagnostics

2026-05-28 07:42:16 +00:00

test_tmem_layout_full.cu

auto: pre-test commit

2026-05-28 15:49:47 +00:00

test_tmem_layout_pv64.cu

auto: pre-test commit

2026-05-28 15:48:15 +00:00

test_tmem_minimal.cu

fix: use tcgen05.wait::st/ld instead of nonexistent tcgen05.fence

2026-05-28 07:12:26 +00:00

test_tmem_roundtrip_minimal.py

D1.5: Implement in-kernel O rescale via CUTLASS correction_rescale pattern

2026-05-26 20:26:06 +00:00

test_tmem_row_offset.cu

fix: use __cvta_generic_to_shared directly for 64-bit compat

2026-05-28 22:56:29 +00:00

test_tmem_zero_pv.cu

auto: pre-test commit

2026-05-28 15:54:05 +00:00

test_ultra_minimal.py

NVFP4-1.1: ultra-minimal test — Float32 comparison + Int32 select

2026-05-28 04:35:06 +00:00

test_umma_qk_hd64.cu

test: separate (128,16) SMEM per K-tile with correct source stride

2026-05-28 12:57:38 +00:00

test_umma_qk.cu

test: fix var ref

2026-05-28 11:39:15 +00:00