Critical bug: checkpoint weights are (N_packed, K_packed) N-major format, but make_b_k_major expects (E, K_packed, N_packed) input. Without the permute, the K and N dimensions are swapped, producing garbage output with wrong dimensions (e.g., q_a output was 3584 instead of 1536). Also fix scale assembly: checkpoint scales are (N, K_sf) which should use assemble_raw_scales_2d3d_3d_side (no transpose), not assemble_scales_3d_side (which incorrectly transposes K_sf↔N).
7.3 KiB
7.3 KiB