Files
nvfp4-megamoe-kernel/cutedsl
biondizzle 5f5b997fc3 Fix wo_a: permute to groups-first layout for grouped GEMM
The grouped GEMM expects mat_a to be laid out contiguously per group:
[all tokens for group0, all tokens for group1, ...]
A simple reshape of (T, G, D) → (T*G, D) gives interleaved layout
which is wrong. Fix: permute to (G, T, D) before flattening.
Same fix for output: permute (G, T, R) → (T, G, R).
2026-05-19 02:41:32 +00:00
..