nvfp4-megamoe-kernel

Files

biondizzle 28788c6f55 fix: L1 weight N dimension is 2*intermediate (gate+up), not intermediate

float4_e2m1fn_x2 packs 2 values per byte along K, not N.
The GEMM output N dimension is the logical N from mat_b.shape[2],
not 2x packed. Previous n_dim*2 was wrong — it accidentally worked
in the test because intermediate_size*2 == 2*intermediate_size.
Real model with N=9216 exposed the bug.

2026-05-16 19:07:08 +00:00

cudagraph_test.py

fix: L1 weight N dimension is 2*intermediate (gate+up), not intermediate

2026-05-16 19:07:08 +00:00

layertest.py

fix: transpose checkpoint block scales (N,K_sf)→(K_sf,N) for bridge

2026-05-16 03:43:30 +00:00

requirements.txt

test: add standalone layer 0 comparison test (no vLLM, no Docker)

2026-05-16 02:13:18 +00:00

run_test.sh

fix: use setup.py install for CUTLASS extension build