Files
nvfp4-megamoe-kernel/STATUS.md

2.1 KiB
Raw Blame History

STATUS — DSV4 Inference Kernel (post-cleanup 2026-05-30)

Production Path

One FMHA kernel: fmha_6warp_tma_multirow_multitile.cuh — 6-warp, TMA, UMMA, tcgen05.mma SS, in-kernel multi-tile SMEM accumulator, multi-row softmax. Loaded via fmha_multitile_capi.cu (C API) + fmha_multitile_op.py (ctypes). Dispatched from production.py.

Head dims: 64, 128, 256, 512. T=1 decode proven (cos ≥ 0.999996). T>1 prefill via multi-row path (P5, P7).

No CuTeDSL runtime dependency. All kernel code is raw CUDA C++. CuTeDSL (fmha.py) deleted; Python KV merge deleted; FmhaKernel deleted.

Live Attention Files

File Role
fmha_6warp_tma_multirow_multitile.cuh Production kernel
fmha_common.cuh Shared types/defs
fmha_tma.cuh TMA descriptor helpers
fmha_umma_desc.cuh UMMA descriptor creation
fmha_multitile_capi.cu C API wrapper (nvcc compiled)
fmha_multitile_op.py ctypes loader
production.py Public API (dsv4_attention)
__init__.py Bridge to layers (sparse/dense/swa)

Stage E Checklist (from ROADMAP/NEXT_PRIORITIES_PART_2)

  • E1: Wire LayerCacheHandlegather_compressed_kv, gather_all_compressed_kv, gather_swa_kv, num_query_heads, head_dim
  • E2: End-to-end smoke test through one full layer (SWA + CSA + HCA)
  • E3: Top-level model/dsv4.py
  • E4: Delete torch.cuda.synchronize() from fast path
  • E5: Fold batch loop into kernel grid
  • E6: FP4 output fusion for FMHA → wo_a
  • E7: Lightning indexer FP4 tensor-core scoring
  • E8: Multi-CTA grid for prefill
  • E9: CUDA graph capture

Cleanup Done (C1C7)

  • Deleted: fmha.py, fmha_sm100.cuh, fmha_sm100_tc.cuh, fmha_sm100_launch.cu, fmha_epilogue_sm100.cuh, fmha_qk_verify.cuh (moved to tests/unit/), decode_sparse.py, decode_swa.py, kernels/decode/, 46 test_d*.py probes, root scratch files, archive/ (moved to archived_plans/code_archive/)
  • Removed: FmhaKernel import, CuTeDSL slow path, Python KV merge, torch.cuda.synchronize in _run_fmha_segmented (function deleted)