intermediate_size=3072 is the size of gate OR up, not gate+up.
Split L1 output at intermediate_size, not intermediate_size//2.
gate = l1_out[:, :3072], up = l1_out[:, 3072:]
The bridge's assemble_scales_3d_side expects (K_sf, N) input and
transposes to (N, K_sf) internally before swizzling. The checkpoint
stores scales as (N, K_sf). Without this transpose, the kernel was
reading completely wrong scale data — cosine dropped to 0.713.
Also fixed dual global scale normalization: after transpose, gate/up
are along dim 1 (columns), not dim 0 (rows).
finalize_weights() now view-casts checkpoint uint8 → float4_e2m1fn_x2
directly. Block scales (float8_e4m3fn) and global scales (float32)
pass through unchanged. Zero precision loss on the weights themselves.
L1 dual global scale handling: gate and up have different global scales.
Normalize to max(gate_gs, up_gs) and fold the ratio into block scales
via float32 (one multiply + float8 round-trip on the RATIO only —
much better than dequantizing the entire weight matrix).
layertest.py: updated to test direct path. Expect cosine improvement
from 0.989 → 0.995+ (matching the L1-only result).
README.md: full rewrite explaining how we got here, project structure,
plan, and key lessons learned from the C++ CUTLASS disaster.
Removed:
- DEBUG_LOG.md (old debug timeline, no longer relevant)
- REWRITE_PLAN.md (plan is now in README)
- test_gemm.py (C++ extension test)
Added:
- vllm/nvfp4_cutedsl.py: CuTeDSLMoERunner class for vLLM integration
- Replaces nvfp4_mega_moe_full + SymmBuffer with CuTeDSL kernel
- Handles slot-based routing, L1→SiLU→L2→scatter
- prepare_weights_from_dequantized() for weight prep
Tagged the-last-of-cutlass on the old C++ kernel state.
- Renamed misleading _ue8m0_to_float32 to _block_scale_to_float32
(our checkpoint uses float8_e4m3fn, NOT E8M0)
- Removed dead is_scale_e8m0 property (never referenced)
- Removed dead _block_scale_to_float32 copy in MegaMoEExperts class
- Cleaned up stale E8M0/UE8M0/shift-by-23 comments
- Simplified E8M0 assertion to ValueError (not assert False)
- Updated DeepseekV4FP8Config docstring for NVFP4
Using checkpoint input_scale as the normalization scale saturates
FP4 values (all block scales = 448). The input_scale is a calibration
constant, NOT the amax/(6*448) normalization scale.
Reverted to dynamic amax/(6*448) for activation quantization.
The correct use of checkpoint input_scale is still under investigation.
Preserved: _w13_input_scale and _w2_input_scale in finalize_weights
for future use once we understand the correct alpha contract.
Critical fix: the checkpoint's input_scale was used during weight
calibration but we were computing dynamic scale from data (amax/2688).
This was 13x off from the checkpoint value.
Changes:
- stage_activation() accepts optional input_global_scale parameter
- nvfp4_mega_moe_full() accepts l1_input_scale and l2_input_scale
- vLLM patch preserves w13/w2_input_scale in finalize_weights
- L1 activation uses checkpoint w13_input_scale for quantization
- L2 activation uses checkpoint w2_input_scale for quantization
- alpha = input_scale * weight_scale_2 (correct calibration contract)
The fold block_sf (float8) * global_sf (float32) -> float8 loses ~25% precision.
Product of ~56-448 block_sf * ~4.65e-05 global_sf lands in float8 low-precision
zone where step size is 25%. This makes model output garbage despite finite values.
Fix: keep block scales as original float8, return global scales separately as
float32 per-expert vectors. Apply global scale as per-expert GEMM alpha in
cutlass_grouped_nvfp4_gemm (already iterates per-expert). For L1 with separate
gate/up global scales, use gate_gs as alpha and apply up_correction ratio to
the up half post-GEMM.
weight_transform.py: no more _fold_global_scale, returns (w, sf, global_sf)
nvfp4_mega_moe.py: per-expert alpha = activation_gs * weight_gs
kernel.py: per_expert_alpha parameter in grouped GEMM
deepseek_v4.py: updated type hints and comments