nvfp4-megamoe-kernel

Files

biondizzle 5f5b997fc3 Fix wo_a: permute to groups-first layout for grouped GEMM

The grouped GEMM expects mat_a to be laid out contiguously per group:
[all tokens for group0, all tokens for group1, ...]
A simple reshape of (T, G, D) → (T*G, D) gives interleaved layout
which is wrong. Fix: permute to (G, T, D) before flattening.
Same fix for output: permute (G, T, R) → (T, G, R).

2026-05-19 02:41:32 +00:00

kernel

refactor: copy CuTeDSL kernel into repo with local imports

2026-05-16 02:57:54 +00:00

__init__.py

refactor: copy CuTeDSL kernel into repo with local imports

2026-05-16 02:57:54 +00:00

bridge.py

Fix torch.compile crash: remove threading.Lock from LUT cache path

2026-05-18 20:54:55 +00:00

custom_ops.py

Replace autograd.Function with torch.library.custom_op for Dynamo compat