docs/features/quantization/modelopt.md

# NVIDIA TensorRT Model Optimizer

The [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a library designed to optimize models for inference with NVIDIA GPUs. It includes tools for Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) of Large Language Models (LLMs), Vision Language Models (VLMs), and diffusion models.

We recommend installing the library with:

```bash
pip install nvidia-modelopt
```

## Quantizing HuggingFace Models with PTQ

You can quantize HuggingFace models using the example scripts provided in the TensorRT Model Optimizer repository. The primary script for LLM PTQ is typically found within the `examples/llm_ptq` directory.

Below is an example showing how to quantize a model using modelopt's PTQ API:

??? code

    ```python
    import modelopt.torch.quantization as mtq
    from transformers import AutoModelForCausalLM

    # Load the model from HuggingFace
    model = AutoModelForCausalLM.from_pretrained("<path_or_model_id>")

    # Select the quantization config, for example, FP8
    config = mtq.FP8_DEFAULT_CFG

    # Define a forward loop function for calibration
    def forward_loop(model):
        for data in calib_set:
            model(data)

    # PTQ with in-place replacement of quantized modules
    model = mtq.quantize(model, config, forward_loop)
    ```

After the model is quantized, you can export it to a quantized checkpoint using the export API:

```python
import torch
from modelopt.torch.export import export_hf_checkpoint

with torch.inference_mode():
    export_hf_checkpoint(
        model,  # The quantized model.
        export_dir,  # The directory where the exported files will be stored.
    )
```

The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:

??? code

    ```python
    from vllm import LLM, SamplingParams

    def main():

        model_id = "nvidia/Llama-3.1-8B-Instruct-FP8"
        # Ensure you specify quantization='modelopt' when loading the modelopt checkpoint
        llm = LLM(model=model_id, quantization="modelopt", trust_remote_code=True)

        sampling_params = SamplingParams(temperature=0.8, top_p=0.9)

        prompts = [
            "Hello, my name is",
            "The president of the United States is",
            "The capital of France is",
            "The future of AI is",
        ]

        outputs = llm.generate(prompts, sampling_params)

        for output in outputs:
            prompt = output.prompt
            generated_text = output.outputs[0].text
            print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

    if __name__ == "__main__":
        main()
    ```
Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561) 2025-05-02 11:36:46 -07:00			`# NVIDIA TensorRT Model Optimizer`

			`The [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) is a library designed to optimize models for inference with NVIDIA GPUs. It includes tools for Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) of Large Language Models (LLMs), Vision Language Models (VLMs), and diffusion models.`

			`We recommend installing the library with:`

[Docs] Fix syntax highlighting of shell commands (#19870) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> 2025-06-23 18:59:09 +01:00			```bash
Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561) 2025-05-02 11:36:46 -07:00			`pip install nvidia-modelopt`
			```

			`## Quantizing HuggingFace Models with PTQ`

			You can quantize HuggingFace models using the example scripts provided in the TensorRT Model Optimizer repository. The primary script for LLM PTQ is typically found within the `examples/llm_ptq` directory.

			`Below is an example showing how to quantize a model using modelopt's PTQ API:`

Make distinct `code` and `console` admonitions so readers are less likely to miss them (#20585) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-07-08 03:55:28 +01:00			`??? code`
Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561) 2025-05-02 11:36:46 -07:00
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00			```python
			`import modelopt.torch.quantization as mtq`
			`from transformers import AutoModelForCausalLM`
Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561) 2025-05-02 11:36:46 -07:00
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00			`# Load the model from HuggingFace`
			`model = AutoModelForCausalLM.from_pretrained("<path_or_model_id>")`
Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561) 2025-05-02 11:36:46 -07:00
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00			`# Select the quantization config, for example, FP8`
			`config = mtq.FP8_DEFAULT_CFG`
Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561) 2025-05-02 11:36:46 -07:00
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00			`# Define a forward loop function for calibration`
			`def forward_loop(model):`
			`for data in calib_set:`
			`model(data)`

			`# PTQ with in-place replacement of quantized modules`
			`model = mtq.quantize(model, config, forward_loop)`
			```
Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561) 2025-05-02 11:36:46 -07:00
			`After the model is quantized, you can export it to a quantized checkpoint using the export API:`

			```python
			`import torch`
			`from modelopt.torch.export import export_hf_checkpoint`

			`with torch.inference_mode():`
			`export_hf_checkpoint(`
			`model, # The quantized model.`
			`export_dir, # The directory where the exported files will be stored.`
			`)`
			```

			The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:

Make distinct `code` and `console` admonitions so readers are less likely to miss them (#20585) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-07-08 03:55:28 +01:00			`??? code`
Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561) 2025-05-02 11:36:46 -07:00
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00			```python
			`from vllm import LLM, SamplingParams`
Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561) 2025-05-02 11:36:46 -07:00
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00			`def main():`
Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561) 2025-05-02 11:36:46 -07:00
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00			`model_id = "nvidia/Llama-3.1-8B-Instruct-FP8"`
			`# Ensure you specify quantization='modelopt' when loading the modelopt checkpoint`
			`llm = LLM(model=model_id, quantization="modelopt", trust_remote_code=True)`
Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561) 2025-05-02 11:36:46 -07:00
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00			`sampling_params = SamplingParams(temperature=0.8, top_p=0.9)`
Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561) 2025-05-02 11:36:46 -07:00
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00			`prompts = [`
			`"Hello, my name is",`
			`"The president of the United States is",`
			`"The capital of France is",`
			`"The future of AI is",`
			`]`
Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561) 2025-05-02 11:36:46 -07:00
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00			`outputs = llm.generate(prompts, sampling_params)`
Add NVIDIA TensorRT Model Optimizer in vLLM documentation (#17561) 2025-05-02 11:36:46 -07:00
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00			`for output in outputs:`
			`prompt = output.prompt`
			`generated_text = output.outputs[0].text`
			`print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")`

			`if __name__ == "__main__":`
			`main()`
			```