Files
biondizzle 13be3ad443 FMHA sink bias in kernel + single_shot production rewrite
FMHA kernel (fmha_6warp_tma_multirow_multitile.cuh):
- Added sink_bias field to FmhaTmaMultiRowMultiTileParams
- After KV tile loop, sink logit is included in online softmax rescale:
  new_max = max(running_max, sink_bias * scale)
  rescale existing O_unnorm and running_sum
  running_sum += exp(sink_bias * scale - new_max)
  No PV contribution from sink (D5c: single softmax)
- C API: fmha_multitile_decode_launch now takes sink_bias_ptr
- Python: fmha_multitile_decode_raw accepts attn_sink tensor

single_shot_inference.py:
- Full rewrite to use production kernel stack
- mHC: uses dsv4.layers.mhc.mHCLayer (proper Sinkhorn-Knopp)
- Projections: uses Nvfp4Linear (CuTeDSL GEMM) for q_a, q_b, kv, o_b
- FMHA: 6-warp TMA multi-tile with sink bias (no SDPA fallback)
- MoE: Nvfp4MoE + Nvfp4SharedExpert (no reference fallback)
- Router: production dense/hash dispatch
- Compressor/Indexer: reference dequant (not yet on tensor cores)
- NO try/except fallbacks on production paths
2026-05-31 23:10:13 +00:00

2.2 KiB
Raw Permalink Blame History

STATUS — DSV4 Inference Kernel (post-cleanup 2026-05-30)

Production Path

One FMHA kernel: fmha_6warp_tma_multirow_multitile.cuh — 6-warp, TMA, UMMA, tcgen05.mma SS, in-kernel multi-tile SMEM accumulator, multi-row softmax. Loaded via fmha_multitile_capi.cu (C API) + fmha_multitile_op.py (ctypes). Dispatched from production.py.

Head dims: 64, 128, 256, 512. T=1 decode proven (cos ≥ 0.999996). T>1 prefill via multi-row path (P5, P7).

No CuTeDSL runtime dependency. All kernel code is raw CUDA C++. CuTeDSL (fmha.py) deleted; Python KV merge deleted; FmhaKernel deleted.

Live Attention Files

File Role
fmha_6warp_tma_multirow_multitile.cuh Production kernel
fmha_common.cuh Shared types/defs
fmha_tma.cuh TMA descriptor helpers
fmha_umma_desc.cuh UMMA descriptor creation
fmha_multitile_capi.cu C API wrapper (nvcc compiled)
fmha_multitile_op.py ctypes loader
production.py Public API (dsv4_attention)
__init__.py Bridge to layers (sparse/dense/swa)

Stage E Checklist (from ROADMAP/NEXT_PRIORITIES_PART_2)

  • E1: Wire LayerCacheHandle → gather methods
  • E2: E2E smoke tests (SWA + CSA + HCA)
  • E3: DSV4Model class
  • E4: Removed torch.cuda.synchronize
  • E5: Batch loop folded into kernel grid
  • Single-shot inference: Full 61-layer pipeline runs on B200
    • FMHA kernel verified: hd=512, 128 query heads, all layers correct
    • Garbage output expected without mHC/MoE/KV-cache (architecture gaps, not kernel)
  • E6: FP4 output fusion for FMHA → wo_a
  • E7: Lightning indexer FP4 tensor-core scoring
  • E8: Multi-CTA grid for prefill
  • E9: CUDA graph capture

Cleanup Done (C1C7)

  • Deleted: fmha.py, fmha_sm100.cuh, fmha_sm100_tc.cuh, fmha_sm100_launch.cu, fmha_epilogue_sm100.cuh, fmha_qk_verify.cuh (moved to tests/unit/), decode_sparse.py, decode_swa.py, kernels/decode/, 46 test_d*.py probes, root scratch files, archive/ (moved to archived_plans/code_archive/)
  • Removed: FmhaKernel import, CuTeDSL slow path, Python KV merge, torch.cuda.synchronize in _run_fmha_segmented (function deleted)