nvfp4-megamoe-kernel

Files

biondizzle 05cdde1676 Fix wo_a: scatter each group's data at correct offset in padded buffer

The grouped GEMM expects each group's tokens at their own offset range:
- Group 0: rows [0, padded_T)
- Group 1: rows [padded_T, 2*padded_T)
- etc.

Previously we wrote all groups' data contiguously starting at row 0,
so group 1+ would read zeros from the padding area. Now we scatter
each group's quantized activation at the correct offset.

Also:
- Size buffer for total_max_rows = padded_max * n_groups
- Use assemble_scales_2d_side for multi-group scale assembly
- Extract output per-group at correct offsets

2026-05-19 02:45:57 +00:00

kernel

refactor: copy CuTeDSL kernel into repo with local imports

2026-05-16 02:57:54 +00:00

__init__.py

refactor: copy CuTeDSL kernel into repo with local imports

2026-05-16 02:57:54 +00:00

bridge.py

Fix torch.compile crash: remove threading.Lock from LUT cache path

2026-05-18 20:54:55 +00:00

custom_ops.py

Replace autograd.Function with torch.library.custom_op for Dynamo compat