STATUS — DSV4 Inference Kernel (post-cleanup 2026-05-30)

Production Path

One FMHA kernel: fmha_6warp_tma_multirow_multitile.cuh — 6-warp, TMA, UMMA, tcgen05.mma SS, in-kernel multi-tile SMEM accumulator, multi-row softmax. Loaded via fmha_multitile_capi.cu (C API) + fmha_multitile_op.py (ctypes). Dispatched from production.py.

Head dims: 64, 128, 256, 512. T=1 decode proven (cos ≥ 0.999996). T>1 prefill via multi-row path (P5, P7).

No CuTeDSL runtime dependency. All kernel code is raw CUDA C++. CuTeDSL (fmha.py) deleted; Python KV merge deleted; FmhaKernel deleted.

Live Attention Files

File	Role
`fmha_6warp_tma_multirow_multitile.cuh`	Production kernel
`fmha_common.cuh`	Shared types/defs
`fmha_tma.cuh`	TMA descriptor helpers
`fmha_umma_desc.cuh`	UMMA descriptor creation
`fmha_multitile_capi.cu`	C API wrapper (nvcc compiled)
`fmha_multitile_op.py`	ctypes loader
`production.py`	Public API (dsv4_attention)
`__init__.py`	Bridge to layers (sparse/dense/swa)

Stage E Checklist (from ROADMAP/NEXT_PRIORITIES_PART_2)

E1: Wire LayerCacheHandle → gather_compressed_kv, gather_all_compressed_kv, gather_swa_kv, num_query_heads, head_dim ✅
E2: End-to-end smoke test through one full layer ✅ (SWA + CSA + HCA)
E3: Top-level model/dsv4.py ✅
E4: Delete torch.cuda.synchronize() from fast path ✅
E5: Fold batch loop into kernel grid
E6: FP4 output fusion for FMHA → wo_a
E7: Lightning indexer FP4 tensor-core scoring
E8: Multi-CTA grid for prefill
E9: CUDA graph capture

Cleanup Done (C1–C7)

Deleted: fmha.py, fmha_sm100.cuh, fmha_sm100_tc.cuh, fmha_sm100_launch.cu, fmha_epilogue_sm100.cuh, fmha_qk_verify.cuh (moved to tests/unit/), decode_sparse.py, decode_swa.py, kernels/decode/, 46 test_d*.py probes, root scratch files, archive/ (moved to archived_plans/code_archive/)
Removed: FmhaKernel import, CuTeDSL slow path, Python KV merge, torch.cuda.synchronize in _run_fmha_segmented (function deleted)

2.1 KiB Raw Blame History Unescape Escape

STATUS — DSV4 Inference Kernel (post-cleanup 2026-05-30)

Production Path

Live Attention Files

Stage E Checklist (from ROADMAP/NEXT_PRIORITIES_PART_2)

Cleanup Done (C1–C7)

2.1 KiB

Raw Blame History