nvfp4-megamoe-kernel

Files

biondizzle c0379a0f86 P6: Remove broken TMA store — use direct GMEM write from SMEM

cp.async.bulk.tensor store (SMEM→GMEM) is NOT available on SM100.
The CUTLASS SM100 epilogue uses st.global directly.

The one-way epilogue pipeline is now:
  1. TMEM → regs (tcgen05.ld, warp-collective)
  2. epilogue_op in regs (normalize, FP4 hook via ENABLE_FP4_EPILOGUE)
  3. regs → SMEM (row-major, sO_epi)
  4. SMEM → GMEM (direct write)

This is the same pattern as the MoE kernel but with st.global instead
of TMA store. Multi-CTA (D2) will use st.global with flat_divide coords.

Removed: tma_o from FmhaParams, fmha_multihead_decode_tma_launch,
sMbarStore from SMEM, broken TMA store PTX from fmha_tma.cuh.

2026-05-30 17:11:17 +00:00

cache

Flush compressor: schema fix, prepare_forward, flush_write kernels, state rotation

2026-05-22 00:25:47 +00:00

kernels

P6: Remove broken TMA store — use direct GMEM write from SMEM

2026-05-30 17:11:17 +00:00

layers

NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring

2026-05-25 16:19:07 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering