Proper NVFP4 integration: quantized compressor/indexer + mapper fixes
Weight mapper fixes:
- Reorder substr renames: compressor renames first, then .self_attn.compressor.
→ .attn.mla_attn.compressor., then indexer renames (so indexer keys end up
under mla_attn after the compressor rename already fired)
- Add compressor param renames: kv_proj→wkv, gate_proj→wgate, kv_norm→norm,
position_bias→ape (checkpoint uses NVFP4 naming, model uses internal names)
- Add indexer param renames: q_b_proj→wq_b, kv_proj→compressor.wkv,
gate_proj→compressor.wgate, kv_norm→k_norm, position_bias→compressor.ape,
weights_proj stays (structural: compressor.indexer → indexer.compressor)
- Remove broken suffix renames (already fixed in prior commit)
Model architecture fixes:
- Patch deepseek_compressor.py to pass quant_config (was None, but NVFP4
checkpoint has quantized compressor weights with input_scale/weight_scale)
- Patch deepseek_v4_attention.py indexer: weights_proj now uses quant_config
(was None, but checkpoint has quantized weights)
- Add indexer.compressor.fused_wkv_wgate stacking in load_weights
Infrastructure:
- Add deepseek_compressor.py to Dockerfile
- Force MoE backend to flashinfer_cutedsl (was auto-selecting FLASHINFER_TRTLLM)
- Update unit test to 50 cases (compressor + indexer + quantization scales)