nvfp4-megamoe-kernel

Files

biondizzle 230c350c77 FMHA SM100: Raw CUDA C++ decode kernel — initial skeleton

6-warp specialization using CUTLASS C++ atoms directly:
- tcgen05.mma for QK (SMEM→SMEM→TMEM) and PV (TMEM→SMEM→TMEM)
- TMEM accumulator with one-way correction epilogue (TMEM→regs→SMEM→GMEM)
- In-kernel O rescale via registers (fixes D1.5 TMEM round-trip!)
- D3/D4/D5c masks, NVFP4 quantize helpers, FP8 E4M3 encode
- PyTorch binding with head_dim template dispatch

This bypasses all CuTeDSL limitations: float→int, TMEM round-trip,
multi-CTA, hd=512 MLIR compilation hang.

2026-05-28 05:04:44 +00:00

cache

Flush compressor: schema fix, prepare_forward, flush_write kernels, state rotation

2026-05-22 00:25:47 +00:00

kernels

FMHA SM100: Raw CUDA C++ decode kernel — initial skeleton

2026-05-28 05:04:44 +00:00

layers

NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring

2026-05-25 16:19:07 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering