nvfp4-megamoe-kernel

archive

archive: TMA driver-API files + CUDA 13 TMA discovery notes

2026-05-29 06:52:39 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

cudagraph_test.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

layertest.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

qk_verify_kernel.cuh

Cleanup C1-C7: delete dead CuTeDSL FMHA, test probes, scratch files

2026-05-30 21:08:12 +00:00

test_compressor_position_bias.py

test: add compressor position_bias unit test

2026-06-01 05:55:05 +00:00

test_cute_math_api.py

Test: enumerate CuTeDSL math API to check available operations

2026-06-01 09:11:29 +00:00

test_cutedsl.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

test_fmha_6warp_hd16.cu

auto: pre-test commit

2026-05-28 16:28:58 +00:00

test_fmha_6warp_hd64.cu

auto: pre-test commit

2026-05-28 16:28:58 +00:00

test_fmha_6warp_hd128.cu

auto: pre-test commit

2026-05-28 16:28:58 +00:00

test_fmha_6warp_hd256.cu

auto: pre-test commit

2026-05-28 16:28:58 +00:00

test_fmha_6warp_multihead_hd16.cu

Multi-head FMHA kernel (Milestone 5): grid launch with MHA/MQA/batch support

2026-05-28 19:32:35 +00:00

test_fmha_6warp_multihead_hd64.cu

Multi-head FMHA kernel (Milestone 5): grid launch with MHA/MQA/batch support

2026-05-28 19:32:35 +00:00

test_fmha_6warp_multihead_hd128.cu

Multi-head FMHA kernel (Milestone 5): grid launch with MHA/MQA/batch support

2026-05-28 19:32:35 +00:00

test_fmha_6warp_multihead_hd256.cu

Multi-head FMHA kernel (Milestone 5): grid launch with MHA/MQA/batch support

2026-05-28 19:32:35 +00:00

test_fmha_6warp_multihead.cu

fix: remove n_kv_tiles from standalone test (struct doesn't have it anymore)

2026-05-30 10:28:38 +00:00

test_fmha_6warp_multirow_hd16.cu

Multi-row FMHA kernel (Milestone 4): T>1 prefill support with 4-warp parallel softmax

2026-05-28 20:04:29 +00:00

test_fmha_6warp_multirow_hd64.cu

Multi-row FMHA kernel (Milestone 4): T>1 prefill support with 4-warp parallel softmax

2026-05-28 20:04:29 +00:00

test_fmha_6warp_multirow_hd128.cu

Multi-row FMHA kernel (Milestone 4): T>1 prefill support with 4-warp parallel softmax

2026-05-28 20:04:29 +00:00

test_fmha_6warp_multirow_hd256.cu

Multi-row FMHA kernel (Milestone 4): T>1 prefill support with 4-warp parallel softmax

2026-05-28 20:04:29 +00:00

test_fmha_6warp_multirow.cu

clean: remove debug prints, multirow kernel complete with multi-tile KV merge

2026-05-28 23:57:31 +00:00

test_fmha_6warp_tma_hd128.cu

test: HD=128/256 variants for TMA FMHA

2026-05-29 19:32:49 +00:00

test_fmha_6warp_tma_hd256.cu

test: HD=128/256 variants for TMA FMHA

2026-05-29 19:32:49 +00:00

test_fmha_6warp_tma_hd512.cu

feat: HD=512 support — TMEM_N=512, test variants for all three TMA kernels

2026-05-30 03:45:05 +00:00

test_fmha_6warp_tma_multirow_hd128.cu

test: HD=128/256 multi-row TMA FMHA

2026-05-29 19:40:32 +00:00

test_fmha_6warp_tma_multirow_hd256.cu

test: HD=128/256 multi-row TMA FMHA

2026-05-29 19:40:32 +00:00

test_fmha_6warp_tma_multirow_hd512.cu

feat: HD=512 support — TMEM_N=512, test variants for all three TMA kernels

2026-05-30 03:45:05 +00:00

test_fmha_6warp_tma_multirow_multitile_hd128.cu

test: HD=128/256 variants for D1.5

2026-05-30 04:49:33 +00:00

test_fmha_6warp_tma_multirow_multitile_hd256.cu

test: HD=128/256 variants for D1.5

2026-05-30 04:49:33 +00:00

test_fmha_6warp_tma_multirow_multitile_hd512.cu

D1.5: HD tiling (HD_CHUNK=256) for HD=512 support

2026-05-30 06:56:09 +00:00

test_fmha_6warp_tma_multirow_multitile.cu

D1.5 complete: HD=512 support via hd_chunk tiling with native TMEM columns

2026-05-30 07:02:41 +00:00

test_fmha_6warp_tma_multirow.cu

feat: double-buffer TMA pipeline in multi-row kernel

2026-05-30 03:20:49 +00:00

test_fmha_6warp_tma_multitile_hd128.cu

test: HD=128/256 multi-tile variants

2026-05-29 20:02:00 +00:00

test_fmha_6warp_tma_multitile_hd256.cu

test: HD=128/256 multi-tile variants

2026-05-29 20:02:00 +00:00

test_fmha_6warp_tma_multitile_hd512.cu

feat: HD=512 support — TMEM_N=512, test variants for all three TMA kernels

2026-05-30 03:45:05 +00:00

test_fmha_6warp_tma_multitile.cu

feat: V TMA loads in multi-tile kernel

2026-05-29 22:46:21 +00:00

test_fmha_6warp_tma.cu

auto: pre-test commit

2026-05-30 03:46:38 +00:00

test_fmha_6warp.cu

auto: pre-test commit

2026-05-28 16:28:58 +00:00

test_fmha_gen_kernel.cuh

auto: pre-test commit

2026-05-28 15:59:22 +00:00

test_fmha_gen.cu

auto: pre-test commit

2026-05-28 15:59:22 +00:00

test_fmha_hd16_gen.cu

auto: pre-test commit

2026-05-28 15:59:22 +00:00

test_fmha_hd16_v2.cu

auto: pre-test commit

2026-05-28 15:59:22 +00:00

test_fmha_hd16.cu

test: debug — just QK+softmax+P read (no PV)

2026-05-28 13:08:06 +00:00

test_fmha_hd64_debug.cu

auto: pre-test commit

2026-05-28 15:46:53 +00:00

test_fmha_hd64_gen.cu

auto: pre-test commit

2026-05-28 15:59:22 +00:00

test_fmha_hd64_n16_v2.cu

auto: pre-test commit

2026-05-28 15:55:59 +00:00

test_fmha_hd64_n16.cu

FMHA HD=64 with BLOCK_MN_B=16, 4 N-tiles per K-tile

2026-05-28 15:17:40 +00:00

test_fmha_hd64_smem_p.cu

Clean up HD=64 test, V layout verified correct

2026-05-28 15:21:33 +00:00

test_fmha_hd64.cu

test: merge softmax+PV into single warp0 block (s_vals scope fix)

2026-05-28 13:10:02 +00:00

test_fmha_hd128_gen.cu

auto: pre-test commit

2026-05-28 15:59:22 +00:00

test_fmha_hd256_gen.cu

auto: pre-test commit

2026-05-28 15:59:22 +00:00

test_fmha_sink_bias.py

FMHA sink bias in kernel + single_shot production rewrite

2026-05-31 23:10:13 +00:00

test_fmha_sm100_standalone.cu

test: enable both reference + TMEM epilogue tests at hd=64/128

2026-05-28 07:49:48 +00:00

test_fmha_sm100.py

fix: increase test timeout for TMEM kernel

2026-05-28 06:41:59 +00:00

test_fmha_smem_p.cu

Fix: add back cudaDeviceSynchronize

2026-05-28 14:28:24 +00:00

test_fmha_softmax.cu

test: FMHA softmax (QK→read S→softmax→write P→read P→verify)

2026-05-28 13:00:37 +00:00

test_fmha_tma.cu

refactor: TMA FMHA kernel — 4-warp, proven pattern, full pipeline

2026-05-29 18:50:58 +00:00

test_fmha_ts_full.cu

Fix SMEM allocation (was half the needed size) + re-enable full pipeline

2026-05-28 14:16:43 +00:00

test_fmha_ts_hd16.cu

test: properly aligned V SMEM buffer

2026-05-28 13:47:47 +00:00

test_fmha_v3_stage_c.py

fix: revert to composition layout for hand-constructed atoms (matching CUTLASS)

2026-05-23 02:54:54 +00:00

test_fmha_v3.py

FIX: (None,0,None,0) for ALL tma_partition outputs — verified shapes on B200

2026-05-22 23:35:55 +00:00

test_fmha_v4.cu

Move s_p_vals to dynamic SMEM

2026-05-28 14:38:03 +00:00

test_fmha_v5.cu

Milestone: Full FMHA HD=16 with PV SS MMA (SMEM-P) — cosine 0.9997

2026-05-28 14:50:43 +00:00

test_fp4_roundtrip.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

test_fused_rmsnorm_quantize.py

P4: Fused RMSNorm + NVFP4 quantize kernel (2 launches vs 6+)

2026-06-02 16:26:24 +00:00

test_fused_router.py

Fix fused router: use run_nvfp4_fused_router wrapper, correct CuTe tensor API

2026-06-01 09:19:48 +00:00

test_fused_swiglu_kernel.py

fix: expert_offsets for 4-expert fused SwiGLU test

2026-06-02 08:24:32 +00:00

test_gemm_shape.py

test: verify GEMM shape with production weight format

2026-06-02 08:43:40 +00:00

test_int32_cast.py

NVFP4-1.1: try .to(Int32) for float-to-int conversion

2026-05-28 04:02:45 +00:00

test_kv_compress_quant.py

KV-1/KV-2: Fused compress+NVFP4 quantize kernels + dequant

2026-06-02 09:37:53 +00:00

test_kv_diag.py

KV diag test

2026-06-02 09:43:45 +00:00

test_kv_quantize.py

KV-1/KV-2: Mixed FP8+BF16 compressed KV (DeepSeek V4 paper format)

2026-06-02 10:08:43 +00:00

test_layer_comparison.py

auto: pre-test commit

2026-06-01 15:04:46 +00:00

test_mhc_comparison.py

auto: pre-test commit

2026-06-01 15:04:46 +00:00

test_mhc_sinkhorn.py

Fix mHC Sinkhorn test: row sums expected to be off (eps after softmax)

2026-06-02 10:46:28 +00:00

test_minimal_pv.cu

Fix tb scope

2026-05-28 14:40:55 +00:00

test_mma_ts_copy.cu

Add systematic SS+TS sequence test to debug MMA coexistence crash

2026-05-28 14:10:07 +00:00

test_mma_ts.cu

Test TS MMA with non-uniform A data

2026-05-28 14:19:45 +00:00

test_nvfp4_1_1_layout.py

NVFP4-1.1: FP4 quantization primitives for CuTeDSL kernels

2026-05-28 03:39:55 +00:00

test_nvfp4_1_1_quant.py

NVFP4-1.1: test kernel uses Float32 input (avoids BF16 scalar load issue)

2026-05-28 04:32:08 +00:00

test_nvfp4_cutedsl_compile.py

Fix test: use from_dlpack for c_tensor

2026-06-01 07:55:29 +00:00

test_nvfp4_diag.py

NVFP4-0.2-0.4: add FP4 primitives diagnostic test

2026-05-25 03:07:53 +00:00

test_nvfp4_gpu_quantize.py

fix test 4: use silu(gate)+swiglu interleaved (matching fused kernel output)

2026-05-25 16:24:04 +00:00

test_nvfp4_linear_accuracy.py

Add NVFP4 linear accuracy test: prod vs ref with all-ones input

2026-06-01 14:15:27 +00:00

test_nvfp4_primitives.py

diag: add 2-CTA check + fix LayoutEnum in MMA test

2026-05-23 08:45:26 +00:00

test_nvfp4_quant_kernel.py

NVFP4-1.1 Step 3: post-SWiGLU quantization test suite (all PASS)

2026-05-25 09:08:01 +00:00

test_nvfp4_quantize_kernel.py

fix: quantize_activation_nvfp4 returns 2 values, not 3

2026-05-25 03:17:13 +00:00

test_nvfp4_runtime_gsa.py

test: NVFP4 runtime gsa accuracy vs PyTorch reference

2026-06-02 04:31:18 +00:00

test_p3_fast_decode.py

P8: Fix test imports after deleting multihead module

2026-05-30 17:23:13 +00:00

test_p6_tma_epilogue.py

P8: Fix P6 test imports after deleting multihead module

2026-05-30 17:25:01 +00:00

test_p7_multi_row_softmax.py

P7: Document TMEM column layout, add multi-row softmax test

2026-05-30 17:17:54 +00:00

test_paired_epilog.py

test: paired atoms epilog from old commit 6ee28d8

2026-05-23 03:32:53 +00:00

test_prod_vs_ref_comparison.py

Add prod vs ref GEMM comparison test + gate logits diagnostic

2026-06-01 14:11:37 +00:00

test_production_compress.py

test: production compressor kernel unit test

2026-06-01 05:19:13 +00:00

test_production.py

Cleanup C1-C7: delete dead CuTeDSL FMHA, test probes, scratch files

2026-05-30 21:08:12 +00:00

test_pv_accum.cu

PV accumulation debug with detailed TMEM read

2026-05-28 14:35:29 +00:00

test_pv_only.cu

fix: SMEM layout and printf in PV-only test

2026-05-29 19:08:39 +00:00

test_pv_ss_128.cu

Test K-tiles 0-1 accumulated

2026-05-28 14:33:31 +00:00

test_pv_ss_b64.cu

Test PV SS MMA with B=(64,16) BLOCK_MN=64

2026-05-28 14:58:10 +00:00

test_pv_ss.cu

Full FMHA SMEM-P with scale calibration

2026-05-28 14:24:53 +00:00

test_q_smem_debug.cu

test: debug Q SMEM canonical after TMA load

2026-05-29 18:30:52 +00:00

test_qk_direct.cu

fix: QK direct test — per-K-sub-tile Q load (same as working kernel)

2026-05-29 18:35:00 +00:00

test_qk_minimal.cu

fix: warp-collective TMEM read/dealloc in minimal QK test

2026-05-29 18:42:03 +00:00

test_qk_mma.cu

debug: clean QK verify with scalar sanity + MMA result

2026-05-28 08:53:35 +00:00

test_qk_pv_layout.cu

QK→PV layout test: skip softmax to test TMEM layout compatibility

2026-05-28 14:17:37 +00:00

test_qk_softmax.cu

fix: SMEM size calculation — TILE_SZ is in BF16 elements, need *sizeof(bf16_t) for bytes

2026-05-29 19:30:50 +00:00

test_qk_tma.cu

test: QK-only TMA test — isolate TMA load + canonical + MMA

2026-05-29 18:29:49 +00:00

test_rope_cuda.py

P3: CUDA RoPE kernel — single launch per call (vs 5-6 PyTorch ops)

2026-06-02 09:05:22 +00:00

test_softmax_pv.cu

Test softmax→PV with 1 K-tile in isolation

2026-05-28 14:18:39 +00:00

test_ss_ts_sequence.cu

Add systematic SS+TS sequence test to debug MMA coexistence crash

2026-05-28 14:10:07 +00:00

test_sw128_qk.cu

auto: pre-test commit

2026-05-28 16:36:53 +00:00

test_tma_5d.cu

TMA 5D test: element stride decomposition

2026-05-28 19:18:01 +00:00

test_tma_align.cu

TMA alignment test

2026-05-28 17:00:20 +00:00

test_tma_debug.cu

TMA debug: fix globalStrides to tensorRank-1 elements

2026-05-28 16:58:30 +00:00

test_tma_desc_debug2.cu

debug: detailed TMA descriptor debug test

2026-05-29 04:45:06 +00:00

test_tma_desc_debug3.cu

debug: TMA context fix test

2026-05-29 04:45:54 +00:00

test_tma_desc_debug.cu

fix: add cuInit(0) for CUDA driver API

2026-05-29 04:43:24 +00:00

test_tma_driver.cu

Fix TMA: use CU_TENSOR_MAP_DATA_TYPE_BFLOAT16 not UINT16

2026-05-28 16:51:40 +00:00

test_tma_kload.cu

test: minimal TMA K-load — no MMA/TMEM, just verify TMA + canonical

2026-05-29 18:46:09 +00:00

test_tma_konly.cu

test: simple (128,16) TMA desc for K sub-tile only

2026-05-29 18:45:01 +00:00

test_tma_kqk.cu

fix: reference should be raw dot product (MMA is unscaled)

2026-05-29 18:48:39 +00:00

test_tma_load.cu

auto: pre-test commit

2026-05-28 16:39:45 +00:00

test_tma_minimal.cu

test: TMA diagnostic with 192 threads

2026-05-29 19:26:09 +00:00

test_tma_proper.cu

auto: pre-test commit

2026-05-28 16:42:24 +00:00

test_tma_qk_diag.cu

test: TMA QK diagnostic — 3 variants to isolate failure

2026-05-29 19:29:35 +00:00

test_tma_qk.cu

test: TMA + canonical + QK GEMM incremental

2026-05-29 19:28:23 +00:00

test_tma_subtile.cu

fix: typo cuda_SUCCESS -> cudaSuccess

2026-05-29 19:27:30 +00:00

test_tma_verify.cu

fix: align TMA SMEM to 128 bytes in verification test

2026-05-29 18:27:07 +00:00

test_tmem_4warp_read.cu

fix: SMEM size for MMA test — account for both sQ0 and sK0

2026-05-28 23:06:07 +00:00

test_tmem_all_lanes.cu

auto: pre-test commit

2026-05-28 15:51:55 +00:00

test_tmem_cols.cu

test: TMEM 2-store with fence outside wid guard, 64 threads

2026-05-28 09:59:43 +00:00

test_tmem_lane_mapping.cu

test: add TMEM lane mapping diagnostics

2026-05-28 07:42:16 +00:00

test_tmem_layout_full.cu

auto: pre-test commit

2026-05-28 15:49:47 +00:00

test_tmem_layout_pv64.cu

auto: pre-test commit

2026-05-28 15:48:15 +00:00

test_tmem_minimal.cu

fix: use tcgen05.wait::st/ld instead of nonexistent tcgen05.fence

2026-05-28 07:12:26 +00:00

test_tmem_row_offset.cu

fix: use __cvta_generic_to_shared directly for 64-bit compat

2026-05-28 22:56:29 +00:00

test_tmem_zero_pv.cu

auto: pre-test commit

2026-05-28 15:54:05 +00:00

test_umma_qk_hd64.cu

test: separate (128,16) SMEM per K-tile with correct source stride

2026-05-28 12:57:38 +00:00

test_umma_qk.cu

test: fix var ref

2026-05-28 11:39:15 +00:00

test_v_tma.cu

test: V TMA diagnostic — isolate V TMA descriptor issue

2026-05-29 22:42:46 +00:00