The MMA loop (cutlass.range) and MMA consumer loop (range) also used cute.size(gK, mode=[3]) which returns 1 for all n. Fixed all 3 loops: 1. TMA load loop (cutlass.range, line 215) 2. MMA consumer loop (range, line 231) 3. Softmax loop (range, line 324) This was causing the deadlock — MMA only produced S[0] while softmax waited for S[1].