Files
nvfp4-megamoe-kernel/tests/unit
biondizzle 468614a4e2 fmha_multirow: non-interleaved design — softmax first, then PV
KEY FIX: TMEM is shared between QK output (S) and PV output (O).
Cannot interleave softmax reads with PV writes because PV overwrites S.

New flow:
1. QK GEMM → S in TMEM
2. Softmax: read ALL S from TMEM, compute P in registers
   - Pass 1: row_max (4 warps, 32x32b.x8)
   - Pass 2: exp, sum, store P in p_vals[SK_TILE] registers
3. PV GEMM: write P to sPk per K-tile, accumulate O in TMEM
4. Epilogue: read O from TMEM, normalize, write GMEM

P in registers: each lane holds float p_vals[128] = 512 bytes.
Register budget: 128 lanes × 512B = 64KB (within B200 256KB register file).
2026-05-28 23:17:43 +00:00
..
2026-05-23 03:25:29 +00:00
2026-05-23 03:20:46 +00:00
2026-05-24 22:23:08 +00:00
2026-05-24 22:04:51 +00:00
2026-05-24 03:48:37 +00:00
2026-05-28 16:28:58 +00:00
2026-05-28 16:28:58 +00:00
2026-05-28 16:28:58 +00:00
2026-05-28 16:28:58 +00:00
2026-05-28 16:28:58 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:46:53 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:55:59 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 19:12:23 +00:00
2026-05-28 14:38:03 +00:00
2026-05-28 14:40:55 +00:00
2026-05-28 14:33:31 +00:00
2026-05-23 23:58:57 +00:00
2026-05-28 16:36:53 +00:00
2026-05-28 17:00:20 +00:00
2026-05-28 16:39:45 +00:00
2026-05-28 16:42:24 +00:00
2026-05-28 15:51:55 +00:00
2026-05-28 15:49:47 +00:00
2026-05-28 15:48:15 +00:00
2026-05-28 15:54:05 +00:00
2026-05-28 11:39:15 +00:00