- SMEM_SFA/SFB_SIZE_PER_STAGE doubled: group=16 needs 8 SFs per token per BLOCK_K=128 (vs 4 for group=32) - arrive_and_expect_tx updated to use SMEM_SFA/SFB constants - Removed stale comments about 8/16 TMEM columns
- SMEM_SFA/SFB_SIZE_PER_STAGE doubled: group=16 needs 8 SFs per token per BLOCK_K=128 (vs 4 for group=32) - arrive_and_expect_tx updated to use SMEM_SFA/SFB constants - Removed stale comments about 8/16 TMEM columns