Config patches: document modelopt↔vllm gaps with NVIDIA reference

2026-05-10 08:59:28 +00:00
parent 0d74b97fb2
commit 30608e3834
1 changed files with 14 additions and 7 deletions
--- a/README.md
+++ b/README.md
@@ -16,15 +16,22 @@ Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7T

 ## ⚠️ Model Config Patches (post-export)

-modelopt 0.45.0.dev64's export doesn't fully match what vllm 0.20.2 expects. These changes were made to `DeepSeek-V4-Pro-NVFP4/config.json` and `hf_quant_config.json` after export:
+modelopt 0.45.0.dev64's export produces configs that don't match what vllm expects at runtime. **NVIDIA's own published NVFP4 exports have the same gaps** — we compared against `nvidia/DeepSeek-V3.2-NVFP4` and `nvidia/MiniMax-M2.7-NVFP4` on HuggingFace. Neither includes `compress_ratios` or `scale_fmt` either. This is a modelopt ↔ vllm integration gap, not a problem with our quantization.

-| Field | modelopt export | vllm expects | Fix |
-|-------|----------------|-------------|-----|
-| `compress_ratios` | Missing (transformers 5.8.0 uses `compress_rates` dict) | List of 61 ints, indexed by layer_id | Copied from BF16 source model's config.json |
-| `quantization_config.scale_fmt` | Missing | `"ue8m0"` string | Added to config.json |
-| `hf_quant_config.scale_fmt` | Missing | `"ue8m0"` string | Added to hf_quant_config.json |
+All patches below are to `DeepSeek-V4-Pro-NVFP4/config.json` unless noted.

-The `compress_rates` dict (`{'compressed_sparse_attention': 4, 'heavily_compressed_attention': 128}`) is the new transformers 5.8.0 format. vllm still expects the old per-layer list. The serve script (`serve_vllm.py`) also monkey-patches `DeepseekV4Config.__init__` to auto-convert when loading.
+| # | Field | modelopt export (original) | vllm requires | Patch applied | Why modelopt doesn't export it |
+|---|-------|---------------------------|--------------|---------------|------------------------------ |
+| 1 | `compress_ratios` | Missing (transformers 5.8.0 renamed to `compress_rates` dict) | List of ints indexed by layer_id | Copied from BF16 source model's `compress_ratios` (62 items) | modelopt doesn't add fields the source config lacks; transformers 5.8.0 renamed the field |
+| 2 | `quantization_config.scale_fmt` | Missing | `"ue8m0"` string | Added | modelopt doesn't include vllm-specific runtime fields |
+| 3 | `rope_parameters` | Nested dict `{'main': {...}, 'compress': {...}}` (transformers 5.8.0 format) | Flat dict `{'rope_theta': ..., 'rope_type': ..., ...}` | Flattened to `main` sub-dict | transformers 5.8.0 changed rope_parameters from flat → nested per-component |
+| 4 | `rope_scaling` | Nested dict `{'main': {...}, 'compress': {...}}` (same as above) | Flat dict | Flattened to `main` sub-dict | Same transformers 5.8.0 schema change |
+
+**NVIDIA's own NVFP4 exports confirmed to also lack patches 1 and 2.** We checked:
+- `nvidia/DeepSeek-V3.2-NVFP4` — no `compress_ratios`, no `scale_fmt`, no `quantization_config` in config.json at all (V3.2 doesn't use MLA compression so it sidesteps the issue)
+- `nvidia/MiniMax-M2.7-NVFP4` — has `quantization_config` in config.json (same schema as ours) but no `scale_fmt`
+
+The `compress_rates` → `compress_ratios` rename and `rope_parameters` nesting are transformers 5.8.0 regressions that modelopt doesn't account for. `scale_fmt` is a vllm runtime field that modelopt has never exported.

 ## Architecture