torch.stack(packed) held all expert tensors + final stack (~3.5 GiB).
Now pre-allocate output and fill in-place — only 1 expert tmp + final
tensor in memory at any time.
Added big comment block explaining the cache sizing rationale and the
CUDA graph trap: default of 2 works for sequential layer execution but
will cause use-after-free if CUDA graphs capture multiple layers.
Set MEGA_MOE_PREPACK_CACHE_MAX to cover all captured layers in that case.
Cache was growing unbounded — 61 MoE layers × 2 = 122 prepacked SFB
tensors permanently in GPU memory (~1.75 GiB each). With sequential
layer execution, only 2 entries are needed at a time (current L1 + L2).
Added LRU eviction to keep max 2 entries.
Non-prepacked path handles uint32 unpacking, but prepack sends
weight_sf directly to _C.prepack_sfb(). If scales aren't float8,
this would silently produce wrong layout/values.
- Split provided_slot_token vs slot_token_out (returned to caller)
- No gather when slot_token=None (L2 path), no unnecessary alloc
- .contiguous() on gathered tensors for CUTLASS alignment
- Return slot_token_out consistently
Old cache used only tag ('l1'/'l2'), so layer 1 would reuse layer 0's
packed scales if the function object persisted. Now keyed by
(tag, data_ptr, shape, dtype, device, N, K) — safe across layers.
Shape-based check (x_fp4.shape[0] != num_slots) silently fails when
num_tokens == num_slots in L1 (topk=1). Now checks if slot_token is
the identity mapping — only gathers when slot ordering differs from
token ordering.
Both L1 and L2 now pass pre-built 1D slot_expert_ids and slot_token to
cutlass_grouped_nvfp4_gemm instead of the 2D topk_ids.
The 2D path was broken for expert parallelism — local_mask matched ALL
local experts, producing mismatched slot_token/slot_k lengths that caused
vectorized_gather_kernel index out of bounds.
cutlass_grouped_nvfp4_gemm now:
- Takes 1D slot_expert_ids + optional slot_token
- Gathers x_fp4 by slot_token when needed (L1: tokens→slots)
- Skips gather when x_fp4 already has num_slots rows (L2)
The L2 function was rebuilding slot_expert_ids by scanning topk_ids with a
local_mask. This produced mismatched slot_k (all-expert mask) vs slot_token
(rank-local mask), causing vectorized_gather_kernel index out of bounds.
Now slot_expert_local is passed directly from the outer routing logic, matching
the same slot ordering as L1.
stage_activation now returns (x_fp4, x_sf, input_global_scale).
The global scale is applied as the CUTLASS GEMM alpha parameter
in the epilogue: D = alpha * A @ B, avoiding the fp32→UE4M3
round-trip that folding would introduce.
Changes:
- stage_activation: returns global scale as 3rd value
- cutlass_nvfp4_gemm C++ binding: alpha param (was hardcoded 1.0)
- cutlass_grouped_nvfp4_gemm: passes alpha to per-expert GEMM
- nvfp4_mega_moe_l1/l2: accept alpha, pass to grouped GEMM
- nvfp4_moe_full: reads symm_buffer.input_global_scale for L1,
uses stage_activation's returned global scale for L2
- SymmBuffer: added input_global_scale field
- vllm patch: stores global scale from stage_activation
Without a global scale, block scales (block_max / 6.0) could exceed
UE4M3 max (448.0) for large activations, causing saturation and garbage
MoE outputs. The degeneration pattern (positions 1-5 OK, then constant
spaces) is consistent with UE4M3 overflow: first few tokens have small
activations that fit, but once SiLU(mul(gate, up)) produces larger
values, block scales overflow and the GEMM produces zeros/garbage.
Fix: compute input_global_scale = amax / (6.0 * 448.0), normalize
before block quantization, then fold global scale back into block
scales (same as weight_transform.py folds weight_scale_2). This
ensures block scales are always ≤ 448.0 in UE4M3 range.
lm_head lives on DeepseekV4ForCausalLM, not DeepseekV4Model. The inner
load_weights silently drops it (not in params_dict). Extract it in the
outer loader, load it directly, then forward the rest to the inner model.
Two fixes:
1. attn_hc.base → hc_attn_base (underscore not dot before base/fn/scale)
Same for fn, scale, and ffn_hc variants.
2. compressor.position_bias → compressor.ape was never firing because
the .self_attn.compressor. rule matched first (break). Added combined
.self_attn.compressor.position_bias → .attn.mla_attn.compressor.ape.
Three more dropped checkpoint→model mappings:
1. hc_head: checkpoint has hc_head.hc_base/fn/scale, model has
hc_head_base/fn/scale (underscore not dot separator)
2. attn_hc/ffn_hc: checkpoint has .attn_hc. and .ffn_hc., model has
.hc_attn. and .hc_ffn. (word order reversed)
3. compressor.position_bias → compressor.ape: checkpoint name is
position_bias, model attr is ape (absolute position encoding)
All 461 remaining zero params should now be just indexer.k_norm.bias
(legit zero - no bias in checkpoint, only weight).
indexer.weights_proj is uint8 [64,3584] in checkpoint but bf16 [64,7168]
in model. The uint8→bf16 unpack logic only ran in the stacked_params
loop, so non-stacked NVFP4 params hit a size mismatch assertion.
Three categories of missed renames in CKPT_KEY_SUBST:
1. Shared experts: .shared_experts.gate_proj.→.ffn.shared_experts.w1. fired
but break prevented .mlp.→.ffn. from also applying, producing
mlp.ffn.shared_experts.w1. (double prefix). Fixed by including .mlp.
in the pattern. Added missing .shared_experts.down_proj. rule.
2. Indexer (layers 2+): .self_attn.compressor.indexer.* was caught by the
generic .self_attn.compressor.→.attn.mla_attn.compressor. rule, producing
wrong path attn.mla_attn.compressor.indexer.* instead of attn.indexer.*.
Added indexer-specific patterns (q_b_proj→wq_b, kv_norm→k_norm,
position_bias→compressor.ape, gate_proj→compressor.wgate,
kv_proj→compressor.wkv) before the generic compressor rule.
3. Compressor kv_proj/gate_proj: old .compressor.kv_proj.→.compressor.wkv.
pattern could never fire because .self_attn.compressor. matched first
(break). Merged into combined patterns that handle both the
self_attn.compressor→attn.mla_attn.compressor path AND the projection
rename in one step.
Checkpoint keys are model.layers.N.shared_experts.gate_proj.weight
but model params are layers.N.ffn.shared_experts.gate_up_proj.weight.
The .ffn. was missing from the rename, so stacked gate_up_proj
never matched params_dict.
The stacking logic skipped any key containing '.experts.' to avoid
MoE routed expert weights. But 'shared_experts' also matches that
substring, so gate_proj and up_proj were never stacked into
gate_up_proj. Changed to '.ffn.experts.' which only matches the
routed experts path.
Also includes POST-LOAD all-zero param scan.