FMHA kernel (fmha_6warp_tma_multirow_multitile.cuh): - Added sink_bias field to FmhaTmaMultiRowMultiTileParams - After KV tile loop, sink logit is included in online softmax rescale: new_max = max(running_max, sink_bias * scale) rescale existing O_unnorm and running_sum running_sum += exp(sink_bias * scale - new_max) No PV contribution from sink (D5c: single softmax) - C API: fmha_multitile_decode_launch now takes sink_bias_ptr - Python: fmha_multitile_decode_raw accepts attn_sink tensor single_shot_inference.py: - Full rewrite to use production kernel stack - mHC: uses dsv4.layers.mhc.mHCLayer (proper Sinkhorn-Knopp) - Projections: uses Nvfp4Linear (CuTeDSL GEMM) for q_a, q_b, kv, o_b - FMHA: 6-warp TMA multi-tile with sink bias (no SDPA fallback) - MoE: Nvfp4MoE + Nvfp4SharedExpert (no reference fallback) - Router: production dense/hash dispatch - Compressor/Indexer: reference dequant (not yet on tensor cores) - NO try/except fallbacks on production paths
2.2 KiB
2.2 KiB
STATUS — DSV4 Inference Kernel (post-cleanup 2026-05-30)
Production Path
One FMHA kernel: fmha_6warp_tma_multirow_multitile.cuh — 6-warp, TMA, UMMA, tcgen05.mma SS, in-kernel multi-tile SMEM accumulator, multi-row softmax. Loaded via fmha_multitile_capi.cu (C API) + fmha_multitile_op.py (ctypes). Dispatched from production.py.
Head dims: 64, 128, 256, 512. T=1 decode proven (cos ≥ 0.999996). T>1 prefill via multi-row path (P5, P7).
No CuTeDSL runtime dependency. All kernel code is raw CUDA C++. CuTeDSL (fmha.py) deleted; Python KV merge deleted; FmhaKernel deleted.
Live Attention Files
| File | Role |
|---|---|
fmha_6warp_tma_multirow_multitile.cuh |
Production kernel |
fmha_common.cuh |
Shared types/defs |
fmha_tma.cuh |
TMA descriptor helpers |
fmha_umma_desc.cuh |
UMMA descriptor creation |
fmha_multitile_capi.cu |
C API wrapper (nvcc compiled) |
fmha_multitile_op.py |
ctypes loader |
production.py |
Public API (dsv4_attention) |
__init__.py |
Bridge to layers (sparse/dense/swa) |
Stage E Checklist (from ROADMAP/NEXT_PRIORITIES_PART_2)
- E1: Wire
LayerCacheHandle→ gather methods ✅ - E2: E2E smoke tests (SWA + CSA + HCA) ✅
- E3: DSV4Model class ✅
- E4: Removed
torch.cuda.synchronize✅ - E5: Batch loop folded into kernel grid ✅
- Single-shot inference: Full 61-layer pipeline runs on B200 ✅
- FMHA kernel verified: hd=512, 128 query heads, all layers correct
- Garbage output expected without mHC/MoE/KV-cache (architecture gaps, not kernel)
- E6: FP4 output fusion for FMHA → wo_a
- E7: Lightning indexer FP4 tensor-core scoring
- E8: Multi-CTA grid for prefill
- E9: CUDA graph capture
Cleanup Done (C1–C7)
- Deleted: fmha.py, fmha_sm100.cuh, fmha_sm100_tc.cuh, fmha_sm100_launch.cu, fmha_epilogue_sm100.cuh, fmha_qk_verify.cuh (moved to tests/unit/), decode_sparse.py, decode_swa.py, kernels/decode/, 46 test_d*.py probes, root scratch files, archive/ (moved to archived_plans/code_archive/)
- Removed: FmhaKernel import, CuTeDSL slow path, Python KV merge, torch.cuda.synchronize in _run_fmha_segmented (function deleted)