docs/features/quantization/torchao.md

# TorchAO

TorchAO is an architecture optimization library for PyTorch, it provides high performance dtypes, optimization techniques and kernels for inference and training, featuring composability with native PyTorch features like torch.compile, FSDP etc.. Some benchmark numbers can be found [here](https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks).

We recommend installing the latest torchao nightly with

```bash
# Install the latest TorchAO nightly build
# Choose the CUDA version that matches your system (cu126, cu128, etc.)
pip install \
    --pre torchao>=10.0.0 \
    --index-url https://download.pytorch.org/whl/nightly/cu126
```

## Quantizing HuggingFace Models

You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:

??? code

    ```Python
    import torch
    from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
    from torchao.quantization import Int8WeightOnlyConfig

    model_name = "meta-llama/Meta-Llama-3-8B"
    quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
    quantized_model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype="auto",
        device_map="auto",
        quantization_config=quantization_config
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    input_text = "What are we having for dinner?"
    input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

    hub_repo = # YOUR HUB REPO ID
    tokenizer.push_to_hub(hub_repo)
    quantized_model.push_to_hub(hub_repo, safe_serialization=False)
    ```

Alternatively, you can use the [TorchAO Quantization space](https://huggingface.co/spaces/medmekk/TorchAO_Quantization) for quantizing models with a simple UI.
Torchao (#14231) Signed-off-by: drisspg <drisspguessous@gmail.com> 2025-04-07 16:39:28 -07:00			`# TorchAO`

			`TorchAO is an architecture optimization library for PyTorch, it provides high performance dtypes, optimization techniques and kernels for inference and training, featuring composability with native PyTorch features like torch.compile, FSDP etc.. Some benchmark numbers can be found [here](https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks).`

			`We recommend installing the latest torchao nightly with`

[Docs] Fix syntax highlighting of shell commands (#19870) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> 2025-06-23 18:59:09 +01:00			```bash
Torchao (#14231) Signed-off-by: drisspg <drisspguessous@gmail.com> 2025-04-07 16:39:28 -07:00			`# Install the latest TorchAO nightly build`
			`# Choose the CUDA version that matches your system (cu126, cu128, etc.)`
[doc] improve readability (#18675) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-05-25 16:40:31 +08:00			`pip install \`
			`--pre torchao>=10.0.0 \`
			`--index-url https://download.pytorch.org/whl/nightly/cu126`
Torchao (#14231) Signed-off-by: drisspg <drisspguessous@gmail.com> 2025-04-07 16:39:28 -07:00			```

			`## Quantizing HuggingFace Models`
[Docs] Switch to better markdown linting pre-commit hook (#21851) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-07-30 03:45:08 +01:00
Torchao (#14231) Signed-off-by: drisspg <drisspguessous@gmail.com> 2025-04-07 16:39:28 -07:00			`You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:`

Make distinct `code` and `console` admonitions so readers are less likely to miss them (#20585) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-07-08 03:55:28 +01:00			`??? code`
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00
			```Python
			`import torch`
			`from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer`
			`from torchao.quantization import Int8WeightOnlyConfig`

			`model_name = "meta-llama/Meta-Llama-3-8B"`
			`quantization_config = TorchAoConfig(Int8WeightOnlyConfig())`
			`quantized_model = AutoModelForCausalLM.from_pretrained(`
			`model_name,`
			`torch_dtype="auto",`
			`device_map="auto",`
			`quantization_config=quantization_config`
			`)`
			`tokenizer = AutoTokenizer.from_pretrained(model_name)`
			`input_text = "What are we having for dinner?"`
			`input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")`

			`hub_repo = # YOUR HUB REPO ID`
			`tokenizer.push_to_hub(hub_repo)`
			`quantized_model.push_to_hub(hub_repo, safe_serialization=False)`
			```
Torchao (#14231) Signed-off-by: drisspg <drisspguessous@gmail.com> 2025-04-07 16:39:28 -07:00
[doc] update wrong hf model links (#17184) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-04-26 00:40:54 +08:00			`Alternatively, you can use the [TorchAO Quantization space](https://huggingface.co/spaces/medmekk/TorchAO_Quantization) for quantizing models with a simple UI.`