The bridge's assemble_scales_3d_side expects (K_sf, N) input and
transposes to (N, K_sf) internally before swizzling. The checkpoint
stores scales as (N, K_sf). Without this transpose, the kernel was
reading completely wrong scale data — cosine dropped to 0.713.
Also fixed dual global scale normalization: after transpose, gate/up
are along dim 1 (columns), not dim 0 (rows).
finalize_weights() now view-casts checkpoint uint8 → float4_e2m1fn_x2
directly. Block scales (float8_e4m3fn) and global scales (float32)
pass through unchanged. Zero precision loss on the weights themselves.
L1 dual global scale handling: gate and up have different global scales.
Normalize to max(gate_gs, up_gs) and fold the ratio into block scales
via float32 (one multiply + float8 round-trip on the RATIO only —
much better than dequantizing the entire weight matrix).
layertest.py: updated to test direct path. Expect cosine improvement
from 0.989 → 0.995+ (matching the L1-only result).
The layertest dequantizes checkpoint NVFP4→BF16 then re-quantizes
BF16→NVFP4. This double quantization costs ~1% cosine. The kernel
itself is correct — the 0.989 cosine is expected quantization noise.
cutedsl/moe_pipeline.py: complete pipeline
- stage_activation: BF16 → NVFP4 (keeps data in FP4)
- L1 GEMM: NVFP4 × NVFP4 → BF16 (gate+up)
- SiLU(gate) * up: BF16 (only nonlinear, can't avoid)
- Re-quantize: BF16 → NVFP4 (back to native)
- L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj)
- Scatter with routing weights → BF16 output
layertest.py: now tests the FULL MoE pipeline against BF16 reference.
NVFP4-native: both GEMMs use float4_e2m1fn_x2 for A and B,
float8_e4m3fn for block scales, float32 for global scales.
BF16 only for SiLU activation and final scatter.
Tokens must be laid out as [expert0_tokens | expert1_tokens | ...]
for the 2Dx3D grouped GEMM. Each expert gets its own contiguous
block of tokens. Scale factors split by expert offsets.
Copied from CUTLASS examples (no more runtime dependency on
/root/cutlass/examples/). Fixed all imports to use cutedsl.kernel.*
instead of blackwell.kernel.*.
Structure:
cutedsl/__init__.py
cutedsl/kernel/__init__.py
cutedsl/kernel/moe/ (the MoE scaled grouped GEMM)
cutedsl/kernel/blockscaled_gemm/ (dense blockscaled GEMM)
test_cutedsl.py updated to import from our local copy.
Tests the NVIDIA reference kernel with our quantization pipeline:
1. Quantize BF16 → NVFP4 (our stage_activation logic)
2. Pad and swizzle scale factors (to_blocked)
3. Run ScaledGroupedGemmKernel (2Dx3D scenario)
4. Compare against BF16 matmul reference
Also adds cutedsl/moe.py module for the future pipeline integration.
tests/layertest.py:
- Loads layer 0 expert weights from both original (MXFP4) and NVFP4 checkpoints
- Dequantizes both to BF16 for reference comparison
- Runs MoE forward pass in pure BF16 (no kernel)
- Runs same forward pass through our NVFP4 CUTLASS kernel
- Compares cosine similarity: kernel vs BF16 reference
tests/run_test.sh:
- Creates venv, installs deps, builds kernel from source, runs test
Isolates our kernel completely from vLLM's weight loading, tensor
parallelism, and MoE routing. If cosine ≈ 1.0, bug is in vLLM. If
cosine ≈ 0, bug is in our kernel pipeline.