fix: indexer + shared_experts + compressor checkpoint→model key renames
Three categories of missed renames in CKPT_KEY_SUBST: 1. Shared experts: .shared_experts.gate_proj.→.ffn.shared_experts.w1. fired but break prevented .mlp.→.ffn. from also applying, producing mlp.ffn.shared_experts.w1. (double prefix). Fixed by including .mlp. in the pattern. Added missing .shared_experts.down_proj. rule. 2. Indexer (layers 2+): .self_attn.compressor.indexer.* was caught by the generic .self_attn.compressor.→.attn.mla_attn.compressor. rule, producing wrong path attn.mla_attn.compressor.indexer.* instead of attn.indexer.*. Added indexer-specific patterns (q_b_proj→wq_b, kv_norm→k_norm, position_bias→compressor.ape, gate_proj→compressor.wgate, kv_proj→compressor.wkv) before the generic compressor rule. 3. Compressor kv_proj/gate_proj: old .compressor.kv_proj.→.compressor.wkv. pattern could never fire because .self_attn.compressor. matched first (break). Merged into combined patterns that handle both the self_attn.compressor→attn.mla_attn.compressor path AND the projection rename in one step.
This commit is contained in:
@@ -1270,16 +1270,26 @@ class DeepseekV4Model(nn.Module):
|
||||
".self_attn.sinks": ".attn.attn_sink",
|
||||
".self_attn.kv_proj.": ".attn.wkv.",
|
||||
".self_attn.kv_norm.": ".attn.kv_norm.",
|
||||
# Indexer: self_attn.compressor.indexer → attn.indexer
|
||||
# MUST come before the generic .self_attn.compressor. rule
|
||||
".self_attn.compressor.indexer.q_b_proj.": ".attn.indexer.wq_b.",
|
||||
".self_attn.compressor.indexer.kv_norm.": ".attn.indexer.k_norm.",
|
||||
".self_attn.compressor.indexer.position_bias": ".attn.indexer.compressor.ape",
|
||||
".self_attn.compressor.indexer.gate_proj.": ".attn.indexer.compressor.wgate.",
|
||||
".self_attn.compressor.indexer.kv_proj.": ".attn.indexer.compressor.wkv.",
|
||||
".self_attn.compressor.indexer.": ".attn.indexer.",
|
||||
# Compressor: self_attn.compressor → attn.mla_attn.compressor
|
||||
# Compressor projections for stacking (fused_wkv_wgate)
|
||||
".self_attn.compressor.kv_proj.": ".attn.mla_attn.compressor.wkv.",
|
||||
".self_attn.compressor.gate_proj.": ".attn.mla_attn.compressor.gate.",
|
||||
".self_attn.compressor.kv_norm.": ".attn.kv_norm.",
|
||||
".self_attn.compressor.": ".attn.mla_attn.compressor.",
|
||||
# Compressor projections for stacking (fused_wkv_wgate)
|
||||
".compressor.kv_proj.": ".compressor.wkv.",
|
||||
".compressor.gate_proj.": ".compressor.gate.",
|
||||
# Shared expert projections (stacking into gate_up_proj)
|
||||
# Checkpoint has .shared_experts. but model has .ffn.shared_experts.
|
||||
".shared_experts.gate_proj.": ".ffn.shared_experts.w1.",
|
||||
".shared_experts.up_proj.": ".ffn.shared_experts.w3.",
|
||||
# Must include .mlp. prefix since break prevents .mlp.→.ffn. from
|
||||
# firing on the same key after these patterns match.
|
||||
".mlp.shared_experts.gate_proj.": ".ffn.shared_experts.w1.",
|
||||
".mlp.shared_experts.up_proj.": ".ffn.shared_experts.w3.",
|
||||
".mlp.shared_experts.down_proj.": ".ffn.shared_experts.down_proj.",
|
||||
# modelopt uses mlp, vllm uses ffn internally
|
||||
".mlp.": ".ffn.",
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user