Files
nvfp4-megamoe-kernel/dsv4/kernels
biondizzle 468614a4e2 fmha_multirow: non-interleaved design — softmax first, then PV
KEY FIX: TMEM is shared between QK output (S) and PV output (O).
Cannot interleave softmax reads with PV writes because PV overwrites S.

New flow:
1. QK GEMM → S in TMEM
2. Softmax: read ALL S from TMEM, compute P in registers
   - Pass 1: row_max (4 warps, 32x32b.x8)
   - Pass 2: exp, sum, store P in p_vals[SK_TILE] registers
3. PV GEMM: write P to sPk per K-tile, accumulate O in TMEM
4. Epilogue: read O from TMEM, normalize, write GMEM

P in registers: each lane holds float p_vals[128] = 512 bytes.
Register budget: 128 lanes × 512B = 64KB (within B200 256KB register file).
2026-05-28 23:17:43 +00:00
..