docs/features/quantization/modelopt.md

# NVIDIA Model Optimizer

The [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) is a library designed to optimize models for inference with NVIDIA GPUs. It includes tools for Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) of Large Language Models (LLMs), Vision Language Models (VLMs), and diffusion models.

We recommend installing the library with:

```bash
pip install nvidia-modelopt
```

## Supported ModelOpt checkpoint formats

vLLM detects ModelOpt checkpoints via `hf_quant_config.json` and supports the
following `quantization.quant_algo` values:

- `FP8`: per-tensor weight scale (+ optional static activation scale).
- `FP8_PER_CHANNEL_PER_TOKEN`: per-channel weight scale and dynamic per-token activation quantization.
- `FP8_PB_WO` (ModelOpt may emit `fp8_pb_wo`): block-scaled FP8 weight-only (typically 128×128 blocks).
- `NVFP4`: ModelOpt NVFP4 checkpoints (use `quantization="modelopt_fp4"`).
- `MXFP8`: ModelOpt MXFP8 checkpoints (use `quantization="modelopt_mxfp8"`).

## Quantizing HuggingFace Models with PTQ

You can quantize HuggingFace models using the example scripts provided in the Model Optimizer repository. The primary script for LLM PTQ is typically found within the `examples/llm_ptq` directory.

Below is an example showing how to quantize a model using modelopt's PTQ API:

??? code

    ```python
    import modelopt.torch.quantization as mtq
    from transformers import AutoModelForCausalLM

    # Load the model from HuggingFace
    model = AutoModelForCausalLM.from_pretrained("<path_or_model_id>")

    # Select the quantization config, for example, FP8
    config = mtq.FP8_DEFAULT_CFG

    # Define a forward loop function for calibration
    def forward_loop(model):
        for data in calib_set:
            model(data)

    # PTQ with in-place replacement of quantized modules
    model = mtq.quantize(model, config, forward_loop)
    ```

After the model is quantized, you can export it to a quantized checkpoint using the export API:

```python
import torch
from modelopt.torch.export import export_hf_checkpoint

with torch.inference_mode():
    export_hf_checkpoint(
        model,  # The quantized model.
        export_dir,  # The directory where the exported files will be stored.
    )
```

The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:

??? code

    ```python
    from vllm import LLM, SamplingParams

    def main():
        model_id = "nvidia/Llama-3.1-8B-Instruct-FP8"

        # Ensure you specify quantization="modelopt" when loading the modelopt checkpoint
        llm = LLM(model=model_id, quantization="modelopt", trust_remote_code=True)

        sampling_params = SamplingParams(temperature=0.8, top_p=0.9)

        prompts = [
            "Hello, my name is",
            "The president of the United States is",
            "The capital of France is",
            "The future of AI is",
        ]

        outputs = llm.generate(prompts, sampling_params)

        for output in outputs:
            prompt = output.prompt
            generated_text = output.outputs[0].text
            print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

    if __name__ == "__main__":
        main()
    ```

## Running the OpenAI-compatible server

To serve a local ModelOpt checkpoint via the OpenAI-compatible API:

```bash
vllm serve <path_to_exported_checkpoint> \
  --quantization modelopt \
  --host 0.0.0.0 --port 8000
```

## Testing (local checkpoints)

vLLM's ModelOpt unit tests are gated by local checkpoint paths and are skipped
by default in CI. To run the tests locally:

```bash
export VLLM_TEST_MODELOPT_FP8_PC_PT_MODEL_PATH=<path_to_fp8_pc_pt_checkpoint>
export VLLM_TEST_MODELOPT_FP8_PB_WO_MODEL_PATH=<path_to_fp8_pb_wo_checkpoint>
pytest -q tests/quantization/test_modelopt.py
```
-												[Misc] Rename TensorRT Model Optimizer to Model Optimizer (#30091)

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
											
										
										
											2025-12-07 23:05:27 -08:00
+								# NVIDIA Model Optimizer
-												Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)


											
										
										
											2025-05-02 11:36:46 -07:00
-												[Misc] Rename TensorRT Model Optimizer to Model Optimizer (#30091)

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
											
										
										
											2025-12-07 23:05:27 -08:00
+								The [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer) is a library designed to optimize models for inference with NVIDIA GPUs. It includes tools for Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) of Large Language Models (LLMs), Vision Language Models (VLMs), and diffusion models.
-												Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)


											
										
										
											2025-05-02 11:36:46 -07:00
 								We recommend installing the library with:
-												[Docs] Fix syntax highlighting of shell commands (#19870)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
											
										
										
											2025-06-23 18:59:09 +01:00
+								```bash
-												Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)


											
										
										
											2025-05-02 11:36:46 -07:00
+								pip install nvidia-modelopt
 								```
-												[Feature]: Support NVIDIA ModelOpt HF FP8 variants FP8_PER_CHANNEL_PER_TOKEN and FP8_PB_WO  in vLLM (#30957)


											
										
										
											2025-12-22 11:34:49 +08:00
+								## Supported ModelOpt checkpoint formats
 								vLLM detects ModelOpt checkpoints via `hf_quant_config.json` and supports the
 								following `quantization.quant_algo` values:
 								- `FP8`: per-tensor weight scale (+ optional static activation scale).
 								- `FP8_PER_CHANNEL_PER_TOKEN`: per-channel weight scale and dynamic per-token activation quantization.
 								- `FP8_PB_WO` (ModelOpt may emit `fp8_pb_wo`): block-scaled FP8 weight-only (typically 128×128 blocks).
 								- `NVFP4`: ModelOpt NVFP4 checkpoints (use `quantization="modelopt_fp4"`).
-												Add support for ModelOpt MXFP8 dense models (#33786)

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
											
										
										
											2026-02-08 21:16:48 +02:00
+								- `MXFP8`: ModelOpt MXFP8 checkpoints (use `quantization="modelopt_mxfp8"`).
-												[Feature]: Support NVIDIA ModelOpt HF FP8 variants FP8_PER_CHANNEL_PER_TOKEN and FP8_PB_WO  in vLLM (#30957)


											
										
										
											2025-12-22 11:34:49 +08:00
-												Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)


											
										
										
											2025-05-02 11:36:46 -07:00
+								## Quantizing HuggingFace Models with PTQ
-												[Misc] Rename TensorRT Model Optimizer to Model Optimizer (#30091)

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
											
										
										
											2025-12-07 23:05:27 -08:00
+								You can quantize HuggingFace models using the example scripts provided in the Model Optimizer repository. The primary script for LLM PTQ is typically found within the `examples/llm_ptq` directory.
-												Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)


											
										
										
											2025-05-02 11:36:46 -07:00
 								Below is an example showing how to quantize a model using modelopt's PTQ API:
-												Make distinct `code` and `console` admonitions so readers are less likely to miss them (#20585)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-08 03:55:28 +01:00
+								??? code
-												Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)


											
										
										
											2025-05-02 11:36:46 -07:00
-												[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-06-23 13:24:23 +08:00
+								    ```python
 								    import modelopt.torch.quantization as mtq
 								    from transformers import AutoModelForCausalLM
-												Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)


											
										
										
											2025-05-02 11:36:46 -07:00
-												[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-06-23 13:24:23 +08:00
+								    # Load the model from HuggingFace
 								    model = AutoModelForCausalLM.from_pretrained("<path_or_model_id>")
-												Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)


											
										
										
											2025-05-02 11:36:46 -07:00
-												[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-06-23 13:24:23 +08:00
+								    # Select the quantization config, for example, FP8
 								    config = mtq.FP8_DEFAULT_CFG
-												Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)


											
										
										
											2025-05-02 11:36:46 -07:00
-												[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-06-23 13:24:23 +08:00
+								    # Define a forward loop function for calibration
 								    def forward_loop(model):
 								        for data in calib_set:
 								            model(data)
 								    # PTQ with in-place replacement of quantized modules
 								    model = mtq.quantize(model, config, forward_loop)
 								    ```
-												Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)


											
										
										
											2025-05-02 11:36:46 -07:00
 								After the model is quantized, you can export it to a quantized checkpoint using the export API:
 								```python
 								import torch
 								from modelopt.torch.export import export_hf_checkpoint
 								with torch.inference_mode():
 								    export_hf_checkpoint(
 								        model,  # The quantized model.
 								        export_dir,  # The directory where the exported files will be stored.
 								    )
 								```
 								The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:
-												Make distinct `code` and `console` admonitions so readers are less likely to miss them (#20585)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
											
										
										
											2025-07-08 03:55:28 +01:00
+								??? code
-												Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)


											
										
										
											2025-05-02 11:36:46 -07:00
-												[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-06-23 13:24:23 +08:00
+								    ```python
 								    from vllm import LLM, SamplingParams
-												Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)


											
										
										
											2025-05-02 11:36:46 -07:00
-												[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-06-23 13:24:23 +08:00
+								    def main():
 								        model_id = "nvidia/Llama-3.1-8B-Instruct-FP8"
-												[Doc] ruff format remaining Python examples (#26795)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
											
										
										
											2025-10-15 16:25:49 +08:00
 								        # Ensure you specify quantization="modelopt" when loading the modelopt checkpoint
-												[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-06-23 13:24:23 +08:00
+								        llm = LLM(model=model_id, quantization="modelopt", trust_remote_code=True)
-												Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)


											
										
										
											2025-05-02 11:36:46 -07:00
-												[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-06-23 13:24:23 +08:00
+								        sampling_params = SamplingParams(temperature=0.8, top_p=0.9)
-												Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)


											
										
										
											2025-05-02 11:36:46 -07:00
-												[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-06-23 13:24:23 +08:00
+								        prompts = [
 								            "Hello, my name is",
 								            "The president of the United States is",
 								            "The capital of France is",
 								            "The future of AI is",
 								        ]
-												Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)


											
										
										
											2025-05-02 11:36:46 -07:00
-												[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-06-23 13:24:23 +08:00
+								        outputs = llm.generate(prompts, sampling_params)
-												Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561)


											
										
										
											2025-05-02 11:36:46 -07:00
-												[doc] Fold long code blocks to improve readability (#19926)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
											
										
										
											2025-06-23 13:24:23 +08:00
+								        for output in outputs:
 								            prompt = output.prompt
 								            generated_text = output.outputs[0].text
 								            print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
 								    if __name__ == "__main__":
 								        main()
 								    ```
-												[Feature]: Support NVIDIA ModelOpt HF FP8 variants FP8_PER_CHANNEL_PER_TOKEN and FP8_PB_WO  in vLLM (#30957)


											
										
										
											2025-12-22 11:34:49 +08:00
 								## Running the OpenAI-compatible server
 								To serve a local ModelOpt checkpoint via the OpenAI-compatible API:
 								```bash
 								vllm serve <path_to_exported_checkpoint> \
 								  --quantization modelopt \
 								  --host 0.0.0.0 --port 8000
 								```
 								## Testing (local checkpoints)
 								vLLM's ModelOpt unit tests are gated by local checkpoint paths and are skipped
 								by default in CI. To run the tests locally:
 								```bash
 								export VLLM_TEST_MODELOPT_FP8_PC_PT_MODEL_PATH=<path_to_fp8_pc_pt_checkpoint>
 								export VLLM_TEST_MODELOPT_FP8_PB_WO_MODEL_PATH=<path_to_fp8_pb_wo_checkpoint>
 								pytest -q tests/quantization/test_modelopt.py
 								```