The MMA loop (cutlass.range) and MMA consumer loop (range) also used
cute.size(gK, mode=[3]) which returns 1 for all n. Fixed all 3 loops:
1. TMA load loop (cutlass.range, line 215)
2. MMA consumer loop (range, line 231)
3. Softmax loop (range, line 324)
This was causing the deadlock — MMA only produced S[0] while softmax
waited for S[1].