Files
nvfp4-megamoe-kernel/tests
biondizzle a0ff8a3278 fix: transpose checkpoint block scales (N,K_sf)→(K_sf,N) for bridge
The bridge's assemble_scales_3d_side expects (K_sf, N) input and
transposes to (N, K_sf) internally before swizzling. The checkpoint
stores scales as (N, K_sf). Without this transpose, the kernel was
reading completely wrong scale data — cosine dropped to 0.713.

Also fixed dual global scale normalization: after transpose, gate/up
are along dim 1 (columns), not dim 0 (rows).
2026-05-16 03:43:30 +00:00
..