diff --git a/README.md b/README.md index 76eb34e..2ce94a7 100644 --- a/README.md +++ b/README.md @@ -16,15 +16,22 @@ Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7T ## ⚠️ Model Config Patches (post-export) -modelopt 0.45.0.dev64's export doesn't fully match what vllm 0.20.2 expects. These changes were made to `DeepSeek-V4-Pro-NVFP4/config.json` and `hf_quant_config.json` after export: +modelopt 0.45.0.dev64's export produces configs that don't match what vllm expects at runtime. **NVIDIA's own published NVFP4 exports have the same gaps** — we compared against `nvidia/DeepSeek-V3.2-NVFP4` and `nvidia/MiniMax-M2.7-NVFP4` on HuggingFace. Neither includes `compress_ratios` or `scale_fmt` either. This is a modelopt ↔ vllm integration gap, not a problem with our quantization. -| Field | modelopt export | vllm expects | Fix | -|-------|----------------|-------------|-----| -| `compress_ratios` | Missing (transformers 5.8.0 uses `compress_rates` dict) | List of 61 ints, indexed by layer_id | Copied from BF16 source model's config.json | -| `quantization_config.scale_fmt` | Missing | `"ue8m0"` string | Added to config.json | -| `hf_quant_config.scale_fmt` | Missing | `"ue8m0"` string | Added to hf_quant_config.json | +All patches below are to `DeepSeek-V4-Pro-NVFP4/config.json` unless noted. -The `compress_rates` dict (`{'compressed_sparse_attention': 4, 'heavily_compressed_attention': 128}`) is the new transformers 5.8.0 format. vllm still expects the old per-layer list. The serve script (`serve_vllm.py`) also monkey-patches `DeepseekV4Config.__init__` to auto-convert when loading. +| # | Field | modelopt export (original) | vllm requires | Patch applied | Why modelopt doesn't export it | +|---|-------|---------------------------|--------------|---------------|------------------------------ | +| 1 | `compress_ratios` | Missing (transformers 5.8.0 renamed to `compress_rates` dict) | List of ints indexed by layer_id | Copied from BF16 source model's `compress_ratios` (62 items) | modelopt doesn't add fields the source config lacks; transformers 5.8.0 renamed the field | +| 2 | `quantization_config.scale_fmt` | Missing | `"ue8m0"` string | Added | modelopt doesn't include vllm-specific runtime fields | +| 3 | `rope_parameters` | Nested dict `{'main': {...}, 'compress': {...}}` (transformers 5.8.0 format) | Flat dict `{'rope_theta': ..., 'rope_type': ..., ...}` | Flattened to `main` sub-dict | transformers 5.8.0 changed rope_parameters from flat → nested per-component | +| 4 | `rope_scaling` | Nested dict `{'main': {...}, 'compress': {...}}` (same as above) | Flat dict | Flattened to `main` sub-dict | Same transformers 5.8.0 schema change | + +**NVIDIA's own NVFP4 exports confirmed to also lack patches 1 and 2.** We checked: +- `nvidia/DeepSeek-V3.2-NVFP4` — no `compress_ratios`, no `scale_fmt`, no `quantization_config` in config.json at all (V3.2 doesn't use MLA compression so it sidesteps the issue) +- `nvidia/MiniMax-M2.7-NVFP4` — has `quantization_config` in config.json (same schema as ours) but no `scale_fmt` + +The `compress_rates` → `compress_ratios` rename and `rope_parameters` nesting are transformers 5.8.0 regressions that modelopt doesn't account for. `scale_fmt` is a vllm runtime field that modelopt has never exported. ## Architecture