KEY FIX: TMEM is shared between QK output (S) and PV output (O).
Cannot interleave softmax reads with PV writes because PV overwrites S.
New flow:
1. QK GEMM → S in TMEM
2. Softmax: read ALL S from TMEM, compute P in registers
- Pass 1: row_max (4 warps, 32x32b.x8)
- Pass 2: exp, sum, store P in p_vals[SK_TILE] registers
3. PV GEMM: write P to sPk per K-tile, accumulate O in TMEM
4. Epilogue: read O from TMEM, normalize, write GMEM
P in registers: each lane holds float p_vals[128] = 512 bytes.
Register budget: 128 lanes × 512B = 64KB (within B200 256KB register file).