nvfp4-megamoe-kernel

Files

biondizzle 468614a4e2 fmha_multirow: non-interleaved design — softmax first, then PV

KEY FIX: TMEM is shared between QK output (S) and PV output (O).
Cannot interleave softmax reads with PV writes because PV overwrites S.

New flow:
1. QK GEMM → S in TMEM
2. Softmax: read ALL S from TMEM, compute P in registers
   - Pass 1: row_max (4 warps, 32x32b.x8)
   - Pass 2: exp, sum, store P in p_vals[SK_TILE] registers
3. PV GEMM: write P to sPk per K-tile, accumulate O in TMEM
4. Epilogue: read O from TMEM, normalize, write GMEM

P in registers: each lane holds float p_vals[128] = 512 bytes.
Register budget: 128 lanes × 512B = 64KB (within B200 256KB register file).

2026-05-28 23:17:43 +00:00

cache

Flush compressor: schema fix, prepare_forward, flush_write kernels, state rotation

2026-05-22 00:25:47 +00:00

kernels

fmha_multirow: non-interleaved design — softmax first, then PV

2026-05-28 23:17:43 +00:00

layers

NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring

2026-05-25 16:19:07 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering