nvfp4-megamoe-kernel/archived_plans/STATUS.md at master

biondizzle 13be3ad443 FMHA sink bias in kernel + single_shot production rewrite

FMHA kernel (fmha_6warp_tma_multirow_multitile.cuh):
- Added sink_bias field to FmhaTmaMultiRowMultiTileParams
- After KV tile loop, sink logit is included in online softmax rescale:
  new_max = max(running_max, sink_bias * scale)
  rescale existing O_unnorm and running_sum
  running_sum += exp(sink_bias * scale - new_max)
  No PV contribution from sink (D5c: single softmax)
- C API: fmha_multitile_decode_launch now takes sink_bias_ptr
- Python: fmha_multitile_decode_raw accepts attn_sink tensor

single_shot_inference.py:
- Full rewrite to use production kernel stack
- mHC: uses dsv4.layers.mhc.mHCLayer (proper Sinkhorn-Knopp)
- Projections: uses Nvfp4Linear (CuTeDSL GEMM) for q_a, q_b, kv, o_b
- FMHA: 6-warp TMA multi-tile with sink bias (no SDPA fallback)
- MoE: Nvfp4MoE + Nvfp4SharedExpert (no reference fallback)
- Router: production dense/hash dispatch
- Compressor/Indexer: reference dequant (not yet on tensor cores)
- NO try/except fallbacks on production paths

File	Role
`fmha_6warp_tma_multirow_multitile.cuh`	Production kernel
`fmha_common.cuh`	Shared types/defs
`fmha_tma.cuh`	TMA descriptor helpers
`fmha_umma_desc.cuh`	UMMA descriptor creation
`fmha_multitile_capi.cu`	C API wrapper (nvcc compiled)
`fmha_multitile_op.py`	ctypes loader
`production.py`	Public API (dsv4_attention)
`__init__.py`	Bridge to layers (sparse/dense/swa)

2.2 KiB

Raw Permalink Blame History

STATUS — DSV4 Inference Kernel (post-cleanup 2026-05-30)

Production Path

Live Attention Files

Stage E Checklist (from ROADMAP/NEXT_PRIORITIES_PART_2)

Cleanup Done (C1–C7)

2.2 KiB Raw Permalink Blame History Unescape Escape

STATUS — DSV4 Inference Kernel (post-cleanup 2026-05-30)

Production Path

Live Attention Files

Stage E Checklist (from ROADMAP/NEXT_PRIORITIES_PART_2)

Cleanup Done (C1–C7)

2.2 KiB

Raw Permalink Blame History