Two fixes: 1. B tensor: permute(0,2,1).contiguous().permute(0,2,1) gives K-major stride (16384,1,128) matching reference 2. scale_b: transpose to (N, K_sf) before swizzling — reference uses (intermediate, hidden//16) not (hidden//16, intermediate)
Two fixes: 1. B tensor: permute(0,2,1).contiguous().permute(0,2,1) gives K-major stride (16384,1,128) matching reference 2. scale_b: transpose to (N, K_sf) before swizzling — reference uses (intermediate, hidden//16) not (hidden//16, intermediate)