nvfp4-megamoe-kernel

Files

biondizzle f66d4b69a4 GPU-only scale assembly + cudagraph test harness

- assemble_activation_scales_gpu: builds padded+swizzled scale tensor
  without .item() or .tolist() CPU syncs. Uses GPU index arange + cat
  + single scatter instead of per-expert Python slicing.
- Still has a for e in range(num_experts) loop but num_experts is
  compile-time constant so torch.compile unrolls it.
- Added tests/cudagraph_test.py: attempts CUDA graph capture on the
  MoE runner, diagnoses sync violations with patched torch functions.
- Removed the if total_slots == 0 early return (Python control flow
  on GPU data)

2026-05-16 18:05:13 +00:00

cudagraph_test.py

GPU-only scale assembly + cudagraph test harness

2026-05-16 18:05:13 +00:00

layertest.py

fix: transpose checkpoint block scales (N,K_sf)→(K_sf,N) for bridge

2026-05-16 03:43:30 +00:00

requirements.txt

test: add standalone layer 0 comparison test (no vLLM, no Docker)

2026-05-16 02:13:18 +00:00

run_test.sh

fix: use setup.py install for CUTLASS extension build