Make distinct code and console admonitions so readers are less likely to miss them (#20585)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-08 03:55:28 +01:00
parent 31c5d0a1b7
commit af107d5a0e
52 changed files with 192 additions and 162 deletions
--- a/docs/features/quantization/auto_awq.md
+++ b/docs/features/quantization/auto_awq.md
@@ -15,7 +15,7 @@ pip install autoawq

 After installing AutoAWQ, you are ready to quantize a model. Please refer to the [AutoAWQ documentation](https://casper-hansen.github.io/AutoAWQ/examples/#basic-quantization) for further details. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:

-??? Code
+??? code

    ```python
    from awq import AutoAWQForCausalLM
@@ -51,7 +51,7 @@ python examples/offline_inference/llm_engine_example.py \

 AWQ models are also supported directly through the LLM entrypoint:

-??? Code
+??? code

    ```python
    from vllm import LLM, SamplingParams
--- a/docs/features/quantization/bitblas.md
+++ b/docs/features/quantization/bitblas.md
@@ -43,7 +43,7 @@ llm = LLM(

 ## Read gptq format checkpoint

-??? Code
+??? code

    ```python
    from vllm import LLM
--- a/docs/features/quantization/fp8.md
+++ b/docs/features/quantization/fp8.md
@@ -58,7 +58,7 @@ For FP8 quantization, we can recover accuracy with simple RTN quantization. We r

 Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.

-??? Code
+??? code

    ```python
    from llmcompressor.transformers import oneshot
--- a/docs/features/quantization/gguf.md
+++ b/docs/features/quantization/gguf.md
@@ -41,7 +41,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \

 You can also use the GGUF model directly through the LLM entrypoint:

-??? Code
+??? code

      ```python
      from vllm import LLM, SamplingParams
--- a/docs/features/quantization/gptqmodel.md
+++ b/docs/features/quantization/gptqmodel.md
@@ -31,7 +31,7 @@ After installing GPTQModel, you are ready to quantize a model. Please refer to t

 Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:

-??? Code
+??? code

    ```python
    from datasets import load_dataset
@@ -69,7 +69,7 @@ python examples/offline_inference/llm_engine_example.py \

 GPTQModel quantized models are also supported directly through the LLM entrypoint:

-??? Code
+??? code

    ```python
    from vllm import LLM, SamplingParams
--- a/docs/features/quantization/int4.md
+++ b/docs/features/quantization/int4.md
@@ -53,7 +53,7 @@ When quantizing weights to INT4, you need sample data to estimate the weight upd
 It's best to use calibration data that closely matches your deployment data.
 For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:

-??? Code
+??? code

    ```python
    from datasets import load_dataset
@@ -78,7 +78,7 @@ For a general-purpose instruction-tuned model, you can use a dataset like `ultra

 Now, apply the quantization algorithms:

-??? Code
+??? code

    ```python
    from llmcompressor.transformers import oneshot
@@ -141,7 +141,7 @@ lm_eval --model vllm \

 The following is an example of an expanded quantization recipe you can tune to your own use case:

-??? Code
+??? code

    ```python
    from compressed_tensors.quantization import (
--- a/docs/features/quantization/int8.md
+++ b/docs/features/quantization/int8.md
@@ -54,7 +54,7 @@ When quantizing activations to INT8, you need sample data to estimate the activa
 It's best to use calibration data that closely matches your deployment data.
 For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:

-??? Code
+??? code

    ```python
    from datasets import load_dataset
@@ -81,7 +81,7 @@ For a general-purpose instruction-tuned model, you can use a dataset like `ultra

 Now, apply the quantization algorithms:

-??? Code
+??? code

    ```python
    from llmcompressor.transformers import oneshot
--- a/docs/features/quantization/modelopt.md
+++ b/docs/features/quantization/modelopt.md
@@ -14,7 +14,7 @@ You can quantize HuggingFace models using the example scripts provided in the Te

 Below is an example showing how to quantize a model using modelopt's PTQ API:

-??? Code
+??? code

    ```python
    import modelopt.torch.quantization as mtq
@@ -50,7 +50,7 @@ with torch.inference_mode():

 The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:

-??? Code
+??? code

    ```python
    from vllm import LLM, SamplingParams
--- a/docs/features/quantization/quantized_kvcache.md
+++ b/docs/features/quantization/quantized_kvcache.md
@@ -35,7 +35,7 @@ Studies have shown that FP8 E4M3 quantization typically only minimally degrades

 Here is an example of how to enable FP8 quantization:

-??? Code
+??? code

    ```python
    # To calculate kv cache scales on the fly enable the calculate_kv_scales
@@ -73,7 +73,7 @@ pip install llmcompressor

 Here's a complete example using `meta-llama/Llama-3.1-8B-Instruct` (most models can use this same pattern):

-??? Code
+??? code

    ```python
    from datasets import load_dataset
--- a/docs/features/quantization/quark.md
+++ b/docs/features/quantization/quark.md
@@ -42,7 +42,7 @@ The Quark quantization process can be listed for 5 steps as below:
 Quark uses [Transformers](https://huggingface.co/docs/transformers/en/index)
 to fetch model and tokenizer.

-??? Code
+??? code

    ```python
    from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -65,7 +65,7 @@ Quark uses the [PyTorch Dataloader](https://pytorch.org/tutorials/beginner/basic
 to load calibration data. For more details about how to use calibration datasets efficiently, please refer
 to [Adding Calibration Datasets](https://quark.docs.amd.com/latest/pytorch/calibration_datasets.html).

-??? Code
+??? code

    ```python
    from datasets import load_dataset
@@ -98,7 +98,7 @@ kv-cache and the quantization algorithm is AutoSmoothQuant.
    AutoSmoothQuant config file for Llama is
    `examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.

-??? Code
+??? code

    ```python
    from quark.torch.quantization import (Config, QuantizationConfig,
@@ -145,7 +145,7 @@ HuggingFace `safetensors`, you can refer to
 [HuggingFace format exporting](https://quark.docs.amd.com/latest/pytorch/export/quark_export_hf.html)
 for more exporting format details.

-??? Code
+??? code

    ```python
    import torch
@@ -176,7 +176,7 @@ for more exporting format details.

 Now, you can load and run the Quark quantized model directly through the LLM entrypoint:

-??? Code
+??? code

    ```python
    from vllm import LLM, SamplingParams
--- a/docs/features/quantization/torchao.md
+++ b/docs/features/quantization/torchao.md
@@ -15,7 +15,7 @@ pip install \
 ## Quantizing HuggingFace Models
 You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:

-??? Code
+??? code

    ```Python
    import torch