- F.linear(x, W) computes x @ W.T which caused shape mismatch when W_gate was pre-transposed to [E, H] - Use torch.matmul(x, W_gate) instead — computes x @ W directly, no transpose needed, no FP32 conversion, fully graph-capturable - W_gate stays as [H, E] (original checkpoint shape)