float4_e2m1fn_x2 packs 2 values per byte along K, not N. The GEMM output N dimension is the logical N from mat_b.shape[2], not 2x packed. Previous n_dim*2 was wrong — it accidentally worked in the test because intermediate_size*2 == 2*intermediate_size. Real model with N=9216 exposed the bug.