Files
nvfp4-megamoe-kernel/dsv4
biondizzle 230c350c77 FMHA SM100: Raw CUDA C++ decode kernel — initial skeleton
6-warp specialization using CUTLASS C++ atoms directly:
- tcgen05.mma for QK (SMEM→SMEM→TMEM) and PV (TMEM→SMEM→TMEM)
- TMEM accumulator with one-way correction epilogue (TMEM→regs→SMEM→GMEM)
- In-kernel O rescale via registers (fixes D1.5 TMEM round-trip!)
- D3/D4/D5c masks, NVFP4 quantize helpers, FP8 E4M3 encode
- PyTorch binding with head_dim template dispatch

This bypasses all CuTeDSL limitations: float→int, TMEM round-trip,
multi-CTA, hd=512 MLIR compilation hang.
2026-05-28 05:04:44 +00:00
..