2.1 KiB
2.1 KiB
STATUS — DSV4 Inference Kernel (post-cleanup 2026-05-30)
Production Path
One FMHA kernel: fmha_6warp_tma_multirow_multitile.cuh — 6-warp, TMA, UMMA, tcgen05.mma SS, in-kernel multi-tile SMEM accumulator, multi-row softmax. Loaded via fmha_multitile_capi.cu (C API) + fmha_multitile_op.py (ctypes). Dispatched from production.py.
Head dims: 64, 128, 256, 512. T=1 decode proven (cos ≥ 0.999996). T>1 prefill via multi-row path (P5, P7).
No CuTeDSL runtime dependency. All kernel code is raw CUDA C++. CuTeDSL (fmha.py) deleted; Python KV merge deleted; FmhaKernel deleted.
Live Attention Files
| File | Role |
|---|---|
fmha_6warp_tma_multirow_multitile.cuh |
Production kernel |
fmha_common.cuh |
Shared types/defs |
fmha_tma.cuh |
TMA descriptor helpers |
fmha_umma_desc.cuh |
UMMA descriptor creation |
fmha_multitile_capi.cu |
C API wrapper (nvcc compiled) |
fmha_multitile_op.py |
ctypes loader |
production.py |
Public API (dsv4_attention) |
__init__.py |
Bridge to layers (sparse/dense/swa) |
Stage E Checklist (from ROADMAP/NEXT_PRIORITIES_PART_2)
- E1: Wire
LayerCacheHandle→gather_compressed_kv,gather_all_compressed_kv,gather_swa_kv,num_query_heads,head_dim✅ - E2: End-to-end smoke test through one full layer ✅ (SWA + CSA + HCA)
- E3: Top-level
model/dsv4.py✅ - E4: Delete
torch.cuda.synchronize()from fast path ✅ - E5: Fold batch loop into kernel grid
- E6: FP4 output fusion for FMHA → wo_a
- E7: Lightning indexer FP4 tensor-core scoring
- E8: Multi-CTA grid for prefill
- E9: CUDA graph capture
Cleanup Done (C1–C7)
- Deleted: fmha.py, fmha_sm100.cuh, fmha_sm100_tc.cuh, fmha_sm100_launch.cu, fmha_epilogue_sm100.cuh, fmha_qk_verify.cuh (moved to tests/unit/), decode_sparse.py, decode_swa.py, kernels/decode/, 46 test_d*.py probes, root scratch files, archive/ (moved to archived_plans/code_archive/)
- Removed: FmhaKernel import, CuTeDSL slow path, Python KV merge, torch.cuda.synchronize in _run_fmha_segmented (function deleted)