nvfp4-megamoe-kernel

Files

biondizzle ffb3e736bb D1.5: Revert broken paired-atom O rescale — TMEM round-trip fundamentally broken

Ld32x32bOp and St32x32bOp have different column mappings at the hardware
level. No layout transformation can fix this — the atoms themselves map
TMEM columns differently.

The MoE correction epilogue avoids the problem by doing a ONE-WAY trip
(TMEM→REGS→SMEM→GMEM, never writes back to TMEM). FMHA needs O in TMEM
for PV accumulation between kt iterations, so one-way doesn't help.

Production path for multi-KV-tile: Python KV merge (already verified,
cos 0.999998 for s_k up to 1024). Run kernel per 128-token segment.

Future: restructure PV to accumulate into REGS/SMEM instead of TMEM,
enabling the one-way correction epilogue pattern.

2026-05-26 19:50:11 +00:00

cache

Flush compressor: schema fix, prepare_forward, flush_write kernels, state rotation

2026-05-22 00:25:47 +00:00

kernels

D1.5: Revert broken paired-atom O rescale — TMEM round-trip fundamentally broken

2026-05-26 19:50:11 +00:00

layers

NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring

2026-05-25 16:19:07 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering