Config patches: document modelopt↔vllm gaps with NVIDIA reference

This commit is contained in:
2026-05-10 08:59:28 +00:00
parent 0d74b97fb2
commit 30608e3834

View File

@@ -16,15 +16,22 @@ Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7T
## ⚠️ Model Config Patches (post-export)
modelopt 0.45.0.dev64's export doesn't fully match what vllm 0.20.2 expects. These changes were made to `DeepSeek-V4-Pro-NVFP4/config.json` and `hf_quant_config.json` after export:
modelopt 0.45.0.dev64's export produces configs that don't match what vllm expects at runtime. **NVIDIA's own published NVFP4 exports have the same gaps** — we compared against `nvidia/DeepSeek-V3.2-NVFP4` and `nvidia/MiniMax-M2.7-NVFP4` on HuggingFace. Neither includes `compress_ratios` or `scale_fmt` either. This is a modelopt ↔ vllm integration gap, not a problem with our quantization.
| Field | modelopt export | vllm expects | Fix |
|-------|----------------|-------------|-----|
| `compress_ratios` | Missing (transformers 5.8.0 uses `compress_rates` dict) | List of 61 ints, indexed by layer_id | Copied from BF16 source model's config.json |
| `quantization_config.scale_fmt` | Missing | `"ue8m0"` string | Added to config.json |
| `hf_quant_config.scale_fmt` | Missing | `"ue8m0"` string | Added to hf_quant_config.json |
All patches below are to `DeepSeek-V4-Pro-NVFP4/config.json` unless noted.
The `compress_rates` dict (`{'compressed_sparse_attention': 4, 'heavily_compressed_attention': 128}`) is the new transformers 5.8.0 format. vllm still expects the old per-layer list. The serve script (`serve_vllm.py`) also monkey-patches `DeepseekV4Config.__init__` to auto-convert when loading.
| # | Field | modelopt export (original) | vllm requires | Patch applied | Why modelopt doesn't export it |
|---|-------|---------------------------|--------------|---------------|------------------------------ |
| 1 | `compress_ratios` | Missing (transformers 5.8.0 renamed to `compress_rates` dict) | List of ints indexed by layer_id | Copied from BF16 source model's `compress_ratios` (62 items) | modelopt doesn't add fields the source config lacks; transformers 5.8.0 renamed the field |
| 2 | `quantization_config.scale_fmt` | Missing | `"ue8m0"` string | Added | modelopt doesn't include vllm-specific runtime fields |
| 3 | `rope_parameters` | Nested dict `{'main': {...}, 'compress': {...}}` (transformers 5.8.0 format) | Flat dict `{'rope_theta': ..., 'rope_type': ..., ...}` | Flattened to `main` sub-dict | transformers 5.8.0 changed rope_parameters from flat → nested per-component |
| 4 | `rope_scaling` | Nested dict `{'main': {...}, 'compress': {...}}` (same as above) | Flat dict | Flattened to `main` sub-dict | Same transformers 5.8.0 schema change |
**NVIDIA's own NVFP4 exports confirmed to also lack patches 1 and 2.** We checked:
- `nvidia/DeepSeek-V3.2-NVFP4` — no `compress_ratios`, no `scale_fmt`, no `quantization_config` in config.json at all (V3.2 doesn't use MLA compression so it sidesteps the issue)
- `nvidia/MiniMax-M2.7-NVFP4` — has `quantization_config` in config.json (same schema as ours) but no `scale_fmt`
The `compress_rates``compress_ratios` rename and `rope_parameters` nesting are transformers 5.8.0 regressions that modelopt doesn't account for. `scale_fmt` is a vllm runtime field that modelopt has never exported.
## Architecture