stage_activation now returns (x_fp4, x_sf, input_global_scale).
The global scale is applied as the CUTLASS GEMM alpha parameter
in the epilogue: D = alpha * A @ B, avoiding the fp32→UE4M3
round-trip that folding would introduce.
Changes:
- stage_activation: returns global scale as 3rd value
- cutlass_nvfp4_gemm C++ binding: alpha param (was hardcoded 1.0)
- cutlass_grouped_nvfp4_gemm: passes alpha to per-expert GEMM
- nvfp4_mega_moe_l1/l2: accept alpha, pass to grouped GEMM
- nvfp4_moe_full: reads symm_buffer.input_global_scale for L1,
uses stage_activation's returned global scale for L2
- SymmBuffer: added input_global_scale field
- vllm patch: stores global scale from stage_activation
Without a global scale, block scales (block_max / 6.0) could exceed
UE4M3 max (448.0) for large activations, causing saturation and garbage
MoE outputs. The degeneration pattern (positions 1-5 OK, then constant
spaces) is consistent with UE4M3 overflow: first few tokens have small
activations that fit, but once SiLU(mul(gate, up)) produces larger
values, block scales overflow and the GEMM produces zeros/garbage.
Fix: compute input_global_scale = amax / (6.0 * 448.0), normalize
before block quantization, then fold global scale back into block
scales (same as weight_transform.py folds weight_scale_2). This
ensures block scales are always ≤ 448.0 in UE4M3 range.
lm_head lives on DeepseekV4ForCausalLM, not DeepseekV4Model. The inner
load_weights silently drops it (not in params_dict). Extract it in the
outer loader, load it directly, then forward the rest to the inner model.
Two fixes:
1. attn_hc.base → hc_attn_base (underscore not dot before base/fn/scale)
Same for fn, scale, and ffn_hc variants.
2. compressor.position_bias → compressor.ape was never firing because
the .self_attn.compressor. rule matched first (break). Added combined
.self_attn.compressor.position_bias → .attn.mla_attn.compressor.ape.
Three more dropped checkpoint→model mappings:
1. hc_head: checkpoint has hc_head.hc_base/fn/scale, model has
hc_head_base/fn/scale (underscore not dot separator)
2. attn_hc/ffn_hc: checkpoint has .attn_hc. and .ffn_hc., model has
.hc_attn. and .hc_ffn. (word order reversed)
3. compressor.position_bias → compressor.ape: checkpoint name is
position_bias, model attr is ape (absolute position encoding)
All 461 remaining zero params should now be just indexer.k_norm.bias
(legit zero - no bias in checkpoint, only weight).
indexer.weights_proj is uint8 [64,3584] in checkpoint but bf16 [64,7168]
in model. The uint8→bf16 unpack logic only ran in the stacked_params
loop, so non-stacked NVFP4 params hit a size mismatch assertion.
Three categories of missed renames in CKPT_KEY_SUBST:
1. Shared experts: .shared_experts.gate_proj.→.ffn.shared_experts.w1. fired
but break prevented .mlp.→.ffn. from also applying, producing
mlp.ffn.shared_experts.w1. (double prefix). Fixed by including .mlp.
in the pattern. Added missing .shared_experts.down_proj. rule.
2. Indexer (layers 2+): .self_attn.compressor.indexer.* was caught by the
generic .self_attn.compressor.→.attn.mla_attn.compressor. rule, producing
wrong path attn.mla_attn.compressor.indexer.* instead of attn.indexer.*.
Added indexer-specific patterns (q_b_proj→wq_b, kv_norm→k_norm,
position_bias→compressor.ape, gate_proj→compressor.wgate,
kv_proj→compressor.wkv) before the generic compressor rule.
3. Compressor kv_proj/gate_proj: old .compressor.kv_proj.→.compressor.wkv.
pattern could never fire because .self_attn.compressor. matched first
(break). Merged into combined patterns that handle both the
self_attn.compressor→attn.mla_attn.compressor path AND the projection
rename in one step.
Checkpoint keys are model.layers.N.shared_experts.gate_proj.weight
but model params are layers.N.ffn.shared_experts.gate_up_proj.weight.
The .ffn. was missing from the rename, so stacked gate_up_proj
never matched params_dict.
The stacking logic skipped any key containing '.experts.' to avoid
MoE routed expert weights. But 'shared_experts' also matches that
substring, so gate_proj and up_proj were never stacked into
gate_up_proj. Changed to '.ffn.experts.' which only matches the
routed experts path.
Also includes POST-LOAD all-zero param scan.
vLLM's symm_buffer stores topk_ids as GLOBAL expert IDs (0..383).
Our weight tensors are indexed by LOCAL IDs (0..47 per rank).
Each rank r handles experts [r*48, r*48+47]. Without conversion,
topk_ids like 137, 222, 378 would index way out of bounds in the
weight tensor (shape (48, N, K)), producing garbage.
Derive experts_start_idx from the topk_ids and subtract to get
local IDs. This was why all ranks except rank 0 produced zero
expert matches → zero output → garbage text.
DeepSeek-V4-Pro has 384 routed experts, 48 per rank (384/8).
The cross-rank all-reduce happens in the parent DeepseekV4MoE.forward,
not in our kernel. Our kernel writes local output; caller does reduce.
Fixed README, nvfp4_mega_moe.py comments.
weight_transform.py returns float8_e4m3fn scales, NOT packed uint32.
The _pack_ue4m3_to_uint32 function was never called. Removed it.
Updated README data formats to accurately reflect the pipeline:
- Weight scales: float8_e4m3fn (direct to CUTLASS, no unpack)
- Activation scales: uint32 packed (from staging kernel, unpacked to float8)
Added detailed SF remap section with the empirical coordinate dump table
showing flat_rank=8 decomposition. Documented all 5 bugs found/fixed,
the diagnostic trail (constant-scale test, single-element probes), and
the 6 verification probes confirming the extraction formula.
m = f0 + f1*32 + f2*128 (CuTe 'first sub varies fastest')
k_sf = f4 + f5*4
f3 is the Step<2> stride (degenerate, always=total), NOT a coordinate.
Previous formula (f3*2+f2)*128 was catastrophically wrong — mapped
everything to m=0 or m=huge.