Files
nvfp4-megamoe-kernel/archived_plans/STATUS.md
biondizzle 13be3ad443 FMHA sink bias in kernel + single_shot production rewrite
FMHA kernel (fmha_6warp_tma_multirow_multitile.cuh):
- Added sink_bias field to FmhaTmaMultiRowMultiTileParams
- After KV tile loop, sink logit is included in online softmax rescale:
  new_max = max(running_max, sink_bias * scale)
  rescale existing O_unnorm and running_sum
  running_sum += exp(sink_bias * scale - new_max)
  No PV contribution from sink (D5c: single softmax)
- C API: fmha_multitile_decode_launch now takes sink_bias_ptr
- Python: fmha_multitile_decode_raw accepts attn_sink tensor

single_shot_inference.py:
- Full rewrite to use production kernel stack
- mHC: uses dsv4.layers.mhc.mHCLayer (proper Sinkhorn-Knopp)
- Projections: uses Nvfp4Linear (CuTeDSL GEMM) for q_a, q_b, kv, o_b
- FMHA: 6-warp TMA multi-tile with sink bias (no SDPA fallback)
- MoE: Nvfp4MoE + Nvfp4SharedExpert (no reference fallback)
- Router: production dense/hash dispatch
- Compressor/Indexer: reference dequant (not yet on tensor cores)
- NO try/except fallbacks on production paths
2026-05-31 23:10:13 +00:00

43 lines
2.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# STATUS — DSV4 Inference Kernel (post-cleanup 2026-05-30)
## Production Path
**One FMHA kernel:** `fmha_6warp_tma_multirow_multitile.cuh` — 6-warp, TMA, UMMA, tcgen05.mma SS, in-kernel multi-tile SMEM accumulator, multi-row softmax. Loaded via `fmha_multitile_capi.cu` (C API) + `fmha_multitile_op.py` (ctypes). Dispatched from `production.py`.
**Head dims:** 64, 128, 256, 512. **T=1 decode** proven (cos ≥ 0.999996). **T>1 prefill** via multi-row path (P5, P7).
**No CuTeDSL runtime dependency.** All kernel code is raw CUDA C++. CuTeDSL (fmha.py) deleted; Python KV merge deleted; `FmhaKernel` deleted.
## Live Attention Files
| File | Role |
|---|---|
| `fmha_6warp_tma_multirow_multitile.cuh` | Production kernel |
| `fmha_common.cuh` | Shared types/defs |
| `fmha_tma.cuh` | TMA descriptor helpers |
| `fmha_umma_desc.cuh` | UMMA descriptor creation |
| `fmha_multitile_capi.cu` | C API wrapper (nvcc compiled) |
| `fmha_multitile_op.py` | ctypes loader |
| `production.py` | Public API (dsv4_attention) |
| `__init__.py` | Bridge to layers (sparse/dense/swa) |
## Stage E Checklist (from ROADMAP/NEXT_PRIORITIES_PART_2)
- [x] **E1:** Wire `LayerCacheHandle` → gather methods ✅
- [x] **E2:** E2E smoke tests (SWA + CSA + HCA) ✅
- [x] **E3:** DSV4Model class ✅
- [x] **E4:** Removed `torch.cuda.synchronize`
- [x] **E5:** Batch loop folded into kernel grid ✅
- [x] **Single-shot inference:** Full 61-layer pipeline runs on B200 ✅
- FMHA kernel verified: hd=512, 128 query heads, all layers correct
- Garbage output expected without mHC/MoE/KV-cache (architecture gaps, not kernel)
- [ ] **E6:** FP4 output fusion for FMHA → wo_a
- [ ] **E7:** Lightning indexer FP4 tensor-core scoring
- [ ] **E8:** Multi-CTA grid for prefill
- [ ] **E9:** CUDA graph capture
## Cleanup Done (C1C7)
- Deleted: fmha.py, fmha_sm100.cuh, fmha_sm100_tc.cuh, fmha_sm100_launch.cu, fmha_epilogue_sm100.cuh, fmha_qk_verify.cuh (moved to tests/unit/), decode_sparse.py, decode_swa.py, kernels/decode/, 46 test_d*.py probes, root scratch files, archive/ (moved to archived_plans/code_archive/)
- Removed: FmhaKernel import, CuTeDSL slow path, Python KV merge, torch.cuda.synchronize in _run_fmha_segmented (function deleted)