nvfp4-megamoe-kernel

Files

biondizzle 4bb0e063cc D1.5: Replace broken TMEM round-trip with correction epilogue (paired atoms)

Replace hand-constructed Ld32x32bOp/St32x32bOp TMEM round-trip with the
proven correction epilogue pattern from fused_swiglu.py:

1. O rescale (kt>0): TMEM→REGS (paired load), multiply by acc_scale,
   REGS→TMEM (paired store via retile_to_S). No layout mismatch.

2. Final O output: One-way TMEM→REGS→SMEM→GMEM using
   epilogue_tmem_copy_and_partition + epilogue_smem_copy_and_partition
   + TMA partition. Register-level normalization (divide by row_sum)
   or raw BF16 cast for D5a path.

This fixes both D1.5 issues:
- Issue 1: TMEM round-trip corruption (hand-constructed atoms)
- Issue 2: O rescale for multi-KV-tile (kt>0)

Supports normalize=True (in-kernel) and normalize=False (D5a external).
Uses epilog_sync_bar + c_pipe for SMEM→GMEM, replacing epilogue_tma_store.

2026-05-26 19:11:19 +00:00

cache

Flush compressor: schema fix, prepare_forward, flush_write kernels, state rotation

2026-05-22 00:25:47 +00:00

kernels

D1.5: Replace broken TMEM round-trip with correction epilogue (paired atoms)

2026-05-26 19:11:19 +00:00

layers

NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring

2026-05-25 16:19:07 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering