- Single amax per warp (16 N-elements = 1 SF group, no warp-pair reduction) - Single sf_val instead of sf.x/sf.y split - All 4 warps write SF (k_idx = n_block_idx*4 + warp_idx_in_wg) - Remove dead SMEM amax storage, reclaim barrier offset space - Remove dead __syncwarp after register-local amax