Config patches: document modelopt↔vllm gaps with NVIDIA reference
This commit is contained in:
21
README.md
21
README.md
@@ -16,15 +16,22 @@ Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7T
|
||||
|
||||
## ⚠️ Model Config Patches (post-export)
|
||||
|
||||
modelopt 0.45.0.dev64's export doesn't fully match what vllm 0.20.2 expects. These changes were made to `DeepSeek-V4-Pro-NVFP4/config.json` and `hf_quant_config.json` after export:
|
||||
modelopt 0.45.0.dev64's export produces configs that don't match what vllm expects at runtime. **NVIDIA's own published NVFP4 exports have the same gaps** — we compared against `nvidia/DeepSeek-V3.2-NVFP4` and `nvidia/MiniMax-M2.7-NVFP4` on HuggingFace. Neither includes `compress_ratios` or `scale_fmt` either. This is a modelopt ↔ vllm integration gap, not a problem with our quantization.
|
||||
|
||||
| Field | modelopt export | vllm expects | Fix |
|
||||
|-------|----------------|-------------|-----|
|
||||
| `compress_ratios` | Missing (transformers 5.8.0 uses `compress_rates` dict) | List of 61 ints, indexed by layer_id | Copied from BF16 source model's config.json |
|
||||
| `quantization_config.scale_fmt` | Missing | `"ue8m0"` string | Added to config.json |
|
||||
| `hf_quant_config.scale_fmt` | Missing | `"ue8m0"` string | Added to hf_quant_config.json |
|
||||
All patches below are to `DeepSeek-V4-Pro-NVFP4/config.json` unless noted.
|
||||
|
||||
The `compress_rates` dict (`{'compressed_sparse_attention': 4, 'heavily_compressed_attention': 128}`) is the new transformers 5.8.0 format. vllm still expects the old per-layer list. The serve script (`serve_vllm.py`) also monkey-patches `DeepseekV4Config.__init__` to auto-convert when loading.
|
||||
| # | Field | modelopt export (original) | vllm requires | Patch applied | Why modelopt doesn't export it |
|
||||
|---|-------|---------------------------|--------------|---------------|------------------------------ |
|
||||
| 1 | `compress_ratios` | Missing (transformers 5.8.0 renamed to `compress_rates` dict) | List of ints indexed by layer_id | Copied from BF16 source model's `compress_ratios` (62 items) | modelopt doesn't add fields the source config lacks; transformers 5.8.0 renamed the field |
|
||||
| 2 | `quantization_config.scale_fmt` | Missing | `"ue8m0"` string | Added | modelopt doesn't include vllm-specific runtime fields |
|
||||
| 3 | `rope_parameters` | Nested dict `{'main': {...}, 'compress': {...}}` (transformers 5.8.0 format) | Flat dict `{'rope_theta': ..., 'rope_type': ..., ...}` | Flattened to `main` sub-dict | transformers 5.8.0 changed rope_parameters from flat → nested per-component |
|
||||
| 4 | `rope_scaling` | Nested dict `{'main': {...}, 'compress': {...}}` (same as above) | Flat dict | Flattened to `main` sub-dict | Same transformers 5.8.0 schema change |
|
||||
|
||||
**NVIDIA's own NVFP4 exports confirmed to also lack patches 1 and 2.** We checked:
|
||||
- `nvidia/DeepSeek-V3.2-NVFP4` — no `compress_ratios`, no `scale_fmt`, no `quantization_config` in config.json at all (V3.2 doesn't use MLA compression so it sidesteps the issue)
|
||||
- `nvidia/MiniMax-M2.7-NVFP4` — has `quantization_config` in config.json (same schema as ours) but no `scale_fmt`
|
||||
|
||||
The `compress_rates` → `compress_ratios` rename and `rope_parameters` nesting are transformers 5.8.0 regressions that modelopt doesn't account for. `scale_fmt` is a vllm runtime field that modelopt has never exported.
|
||||
|
||||
## Architecture
|
||||
|
||||
|
||||
Reference in New Issue
Block a user