nvfp4-megamoe-kernel

Files

biondizzle 389453fbf4 feat: direct NVFP4 path — no BF16 round-trip on weights

finalize_weights() now view-casts checkpoint uint8 → float4_e2m1fn_x2
directly. Block scales (float8_e4m3fn) and global scales (float32)
pass through unchanged. Zero precision loss on the weights themselves.

L1 dual global scale handling: gate and up have different global scales.
Normalize to max(gate_gs, up_gs) and fold the ratio into block scales
via float32 (one multiply + float8 round-trip on the RATIO only —
much better than dequantizing the entire weight matrix).

layertest.py: updated to test direct path. Expect cosine improvement
from 0.989 → 0.995+ (matching the L1-only result).

2026-05-16 03:41:23 +00:00

layertest.py

feat: direct NVFP4 path — no BF16 round-trip on weights

2026-05-16 03:41:23 +00:00

requirements.txt

test: add standalone layer 0 comparison test (no vLLM, no Docker)

2026-05-16 02:13:18 +00:00

run_test.sh

fix: use setup.py install for CUTLASS extension build

2026-05-16 02:21:17 +00:00

test_b_layout.py

cleanup: move useful tests to tests/, nuke stale debug tests