Files
nvfp4-megamoe-kernel/STATUS.md

40 lines
2.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# STATUS — DSV4 Inference Kernel (post-cleanup 2026-05-30)
## Production Path
**One FMHA kernel:** `fmha_6warp_tma_multirow_multitile.cuh` — 6-warp, TMA, UMMA, tcgen05.mma SS, in-kernel multi-tile SMEM accumulator, multi-row softmax. Loaded via `fmha_multitile_capi.cu` (C API) + `fmha_multitile_op.py` (ctypes). Dispatched from `production.py`.
**Head dims:** 64, 128, 256, 512. **T=1 decode** proven (cos ≥ 0.999996). **T>1 prefill** via multi-row path (P5, P7).
**No CuTeDSL runtime dependency.** All kernel code is raw CUDA C++. CuTeDSL (fmha.py) deleted; Python KV merge deleted; `FmhaKernel` deleted.
## Live Attention Files
| File | Role |
|---|---|
| `fmha_6warp_tma_multirow_multitile.cuh` | Production kernel |
| `fmha_common.cuh` | Shared types/defs |
| `fmha_tma.cuh` | TMA descriptor helpers |
| `fmha_umma_desc.cuh` | UMMA descriptor creation |
| `fmha_multitile_capi.cu` | C API wrapper (nvcc compiled) |
| `fmha_multitile_op.py` | ctypes loader |
| `production.py` | Public API (dsv4_attention) |
| `__init__.py` | Bridge to layers (sparse/dense/swa) |
## Stage E Checklist (from ROADMAP/NEXT_PRIORITIES_PART_2)
- [x] **E1:** Wire `LayerCacheHandle``gather_compressed_kv`, `gather_all_compressed_kv`, `gather_swa_kv`, `num_query_heads`, `head_dim`
- [x] **E2:** End-to-end smoke test through one full layer ✅ (SWA + CSA + HCA)
- [x] **E3:** Top-level `model/dsv4.py`
- [x] **E4:** Delete `torch.cuda.synchronize()` from fast path ✅
- [ ] **E5:** Fold batch loop into kernel grid
- [ ] **E6:** FP4 output fusion for FMHA → wo_a
- [ ] **E7:** Lightning indexer FP4 tensor-core scoring
- [ ] **E8:** Multi-CTA grid for prefill
- [ ] **E9:** CUDA graph capture
## Cleanup Done (C1C7)
- Deleted: fmha.py, fmha_sm100.cuh, fmha_sm100_tc.cuh, fmha_sm100_launch.cu, fmha_epilogue_sm100.cuh, fmha_qk_verify.cuh (moved to tests/unit/), decode_sparse.py, decode_swa.py, kernels/decode/, 46 test_d*.py probes, root scratch files, archive/ (moved to archived_plans/code_archive/)
- Removed: FmhaKernel import, CuTeDSL slow path, Python KV merge, torch.cuda.synchronize in _run_fmha_segmented (function deleted)