nvfp4-megamoe-kernel/STATUS.md

# STATUS — DSV4 Inference Kernel (post-cleanup 2026-05-30)

## Production Path

**One FMHA kernel:** `fmha_6warp_tma_multirow_multitile.cuh` — 6-warp, TMA, UMMA, tcgen05.mma SS, in-kernel multi-tile SMEM accumulator, multi-row softmax. Loaded via `fmha_multitile_capi.cu` (C API) + `fmha_multitile_op.py` (ctypes). Dispatched from `production.py`.

**Head dims:** 64, 128, 256, 512. **T=1 decode** proven (cos ≥ 0.999996). **T>1 prefill** via multi-row path (P5, P7).

**No CuTeDSL runtime dependency.** All kernel code is raw CUDA C++. CuTeDSL (fmha.py) deleted; Python KV merge deleted; `FmhaKernel` deleted.

## Live Attention Files

| File | Role |
|---|---|
| `fmha_6warp_tma_multirow_multitile.cuh` | Production kernel |
| `fmha_common.cuh` | Shared types/defs |
| `fmha_tma.cuh` | TMA descriptor helpers |
| `fmha_umma_desc.cuh` | UMMA descriptor creation |
| `fmha_multitile_capi.cu` | C API wrapper (nvcc compiled) |
| `fmha_multitile_op.py` | ctypes loader |
| `production.py` | Public API (dsv4_attention) |
| `__init__.py` | Bridge to layers (sparse/dense/swa) |

## Stage E Checklist (from ROADMAP/NEXT_PRIORITIES_PART_2)

- [x] **E1:** Wire `LayerCacheHandle` → `gather_compressed_kv`, `gather_all_compressed_kv`, `gather_swa_kv`, `num_query_heads`, `head_dim` ✅
- [x] **E2:** End-to-end smoke test through one full layer ✅ (SWA + CSA + HCA)
- [x] **E3:** Top-level `model/dsv4.py` ✅
- [x] **E4:** Delete `torch.cuda.synchronize()` from fast path ✅
- [ ] **E5:** Fold batch loop into kernel grid
- [ ] **E6:** FP4 output fusion for FMHA → wo_a
- [ ] **E7:** Lightning indexer FP4 tensor-core scoring
- [ ] **E8:** Multi-CTA grid for prefill
- [ ] **E9:** CUDA graph capture

## Cleanup Done (C1–C7)

- Deleted: fmha.py, fmha_sm100.cuh, fmha_sm100_tc.cuh, fmha_sm100_launch.cu, fmha_epilogue_sm100.cuh, fmha_qk_verify.cuh (moved to tests/unit/), decode_sparse.py, decode_swa.py, kernels/decode/, 46 test_d*.py probes, root scratch files, archive/ (moved to archived_plans/code_archive/)
- Removed: FmhaKernel import, CuTeDSL slow path, Python KV merge, torch.cuda.synchronize in _run_fmha_segmented (function deleted)