nvfp4-megamoe-kernel

Files

biondizzle 09ff5c5b98 feat: full NVFP4 MoE pipeline (L1→SiLU→L2→scatter)

cutedsl/moe_pipeline.py: complete pipeline
  - stage_activation: BF16 → NVFP4 (keeps data in FP4)
  - L1 GEMM: NVFP4 × NVFP4 → BF16 (gate+up)
  - SiLU(gate) * up: BF16 (only nonlinear, can't avoid)
  - Re-quantize: BF16 → NVFP4 (back to native)
  - L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj)
  - Scatter with routing weights → BF16 output

layertest.py: now tests the FULL MoE pipeline against BF16 reference.

NVFP4-native: both GEMMs use float4_e2m1fn_x2 for A and B,
float8_e4m3fn for block scales, float32 for global scales.
BF16 only for SiLU activation and final scatter.

2026-05-16 03:22:43 +00:00

layertest.py

feat: full NVFP4 MoE pipeline (L1→SiLU→L2→scatter)

2026-05-16 03:22:43 +00:00

requirements.txt

test: add standalone layer 0 comparison test (no vLLM, no Docker)

2026-05-16 02:13:18 +00:00

run_test.sh

fix: use setup.py install for CUTLASS extension build

2026-05-16 02:21:17 +00:00

test_b_layout.py

cleanup: move useful tests to tests/, nuke stale debug tests