Make distinct code and console admonitions so readers are less likely to miss them (#20585)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -15,7 +15,7 @@ pip install autoawq
|
||||
|
||||
After installing AutoAWQ, you are ready to quantize a model. Please refer to the [AutoAWQ documentation](https://casper-hansen.github.io/AutoAWQ/examples/#basic-quantization) for further details. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from awq import AutoAWQForCausalLM
|
||||
@@ -51,7 +51,7 @@ python examples/offline_inference/llm_engine_example.py \
|
||||
|
||||
AWQ models are also supported directly through the LLM entrypoint:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@@ -43,7 +43,7 @@ llm = LLM(
|
||||
|
||||
## Read gptq format checkpoint
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
@@ -58,7 +58,7 @@ For FP8 quantization, we can recover accuracy with simple RTN quantization. We r
|
||||
|
||||
Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import oneshot
|
||||
|
||||
@@ -41,7 +41,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
||||
|
||||
You can also use the GGUF model directly through the LLM entrypoint:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@@ -31,7 +31,7 @@ After installing GPTQModel, you are ready to quantize a model. Please refer to t
|
||||
|
||||
Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
@@ -69,7 +69,7 @@ python examples/offline_inference/llm_engine_example.py \
|
||||
|
||||
GPTQModel quantized models are also supported directly through the LLM entrypoint:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@@ -53,7 +53,7 @@ When quantizing weights to INT4, you need sample data to estimate the weight upd
|
||||
It's best to use calibration data that closely matches your deployment data.
|
||||
For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
@@ -78,7 +78,7 @@ For a general-purpose instruction-tuned model, you can use a dataset like `ultra
|
||||
|
||||
Now, apply the quantization algorithms:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import oneshot
|
||||
@@ -141,7 +141,7 @@ lm_eval --model vllm \
|
||||
|
||||
The following is an example of an expanded quantization recipe you can tune to your own use case:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from compressed_tensors.quantization import (
|
||||
|
||||
@@ -54,7 +54,7 @@ When quantizing activations to INT8, you need sample data to estimate the activa
|
||||
It's best to use calibration data that closely matches your deployment data.
|
||||
For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
@@ -81,7 +81,7 @@ For a general-purpose instruction-tuned model, you can use a dataset like `ultra
|
||||
|
||||
Now, apply the quantization algorithms:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import oneshot
|
||||
|
||||
@@ -14,7 +14,7 @@ You can quantize HuggingFace models using the example scripts provided in the Te
|
||||
|
||||
Below is an example showing how to quantize a model using modelopt's PTQ API:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
import modelopt.torch.quantization as mtq
|
||||
@@ -50,7 +50,7 @@ with torch.inference_mode():
|
||||
|
||||
The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@@ -35,7 +35,7 @@ Studies have shown that FP8 E4M3 quantization typically only minimally degrades
|
||||
|
||||
Here is an example of how to enable FP8 quantization:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# To calculate kv cache scales on the fly enable the calculate_kv_scales
|
||||
@@ -73,7 +73,7 @@ pip install llmcompressor
|
||||
|
||||
Here's a complete example using `meta-llama/Llama-3.1-8B-Instruct` (most models can use this same pattern):
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
|
||||
@@ -42,7 +42,7 @@ The Quark quantization process can be listed for 5 steps as below:
|
||||
Quark uses [Transformers](https://huggingface.co/docs/transformers/en/index)
|
||||
to fetch model and tokenizer.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
@@ -65,7 +65,7 @@ Quark uses the [PyTorch Dataloader](https://pytorch.org/tutorials/beginner/basic
|
||||
to load calibration data. For more details about how to use calibration datasets efficiently, please refer
|
||||
to [Adding Calibration Datasets](https://quark.docs.amd.com/latest/pytorch/calibration_datasets.html).
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
@@ -98,7 +98,7 @@ kv-cache and the quantization algorithm is AutoSmoothQuant.
|
||||
AutoSmoothQuant config file for Llama is
|
||||
`examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from quark.torch.quantization import (Config, QuantizationConfig,
|
||||
@@ -145,7 +145,7 @@ HuggingFace `safetensors`, you can refer to
|
||||
[HuggingFace format exporting](https://quark.docs.amd.com/latest/pytorch/export/quark_export_hf.html)
|
||||
for more exporting format details.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
import torch
|
||||
@@ -176,7 +176,7 @@ for more exporting format details.
|
||||
|
||||
Now, you can load and run the Quark quantized model directly through the LLM entrypoint:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@@ -15,7 +15,7 @@ pip install \
|
||||
## Quantizing HuggingFace Models
|
||||
You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```Python
|
||||
import torch
|
||||
|
||||
Reference in New Issue
Block a user