Make distinct code and console admonitions so readers are less likely to miss them (#20585)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-08 03:55:28 +01:00
parent 31c5d0a1b7
commit af107d5a0e
52 changed files with 192 additions and 162 deletions
--- a/docs/features/lora.md
+++ b/docs/features/lora.md
@@ -29,7 +29,7 @@ We can now submit the prompts and call `llm.generate` with the `lora_request` pa
 of `LoRARequest` is a human identifiable name, the second parameter is a globally unique ID for the adapter and
 the third parameter is the path to the LoRA adapter.

-??? Code
+??? code

    ```python
    sampling_params = SamplingParams(
@@ -70,7 +70,7 @@ The server entrypoint accepts all other LoRA configuration parameters (`max_lora
 etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
 with its base model (if `jq` is not installed, you can follow [this guide](https://jqlang.org/download/) to install it.):

-??? Command
+??? console "Command"

    ```bash
    curl localhost:8000/v1/models | jq .
@@ -172,7 +172,7 @@ Alternatively, follow these example steps to implement your own plugin:

 1. Implement the LoRAResolver interface.

-    ??? Example of a simple S3 LoRAResolver implementation
+    ??? code "Example of a simple S3 LoRAResolver implementation"

        ```python
        import os
@@ -238,7 +238,7 @@ The new format of `--lora-modules` is mainly to support the display of parent mo
 - The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-2-7b-hf`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter.
 - The `root` field points to the artifact location of the lora adapter.

-??? Command output
+??? console "Command output"

    ```bash
    $ curl http://localhost:8000/v1/models
--- a/docs/features/multimodal_inputs.md
+++ b/docs/features/multimodal_inputs.md
@@ -20,7 +20,7 @@ To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]:

 You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples:

-??? Code
+??? code

    ```python
    from vllm import LLM
@@ -68,7 +68,7 @@ Full example: <gh-file:examples/offline_inference/vision_language.py>

 To substitute multiple images inside the same text prompt, you can pass in a list of images instead:

-??? Code
+??? code

    ```python
    from vllm import LLM
@@ -146,7 +146,7 @@ for o in outputs:

 Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:

-??? Code
+??? code

    ```python
    from vllm import LLM
@@ -193,7 +193,7 @@ Full example: <gh-file:examples/offline_inference/audio_language.py>
 To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
 pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.

-??? Code
+??? code

    ```python
    from vllm import LLM
@@ -220,7 +220,7 @@ pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the cor

 For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings:

-??? Code
+??? code

    ```python
    # Construct the prompt based on your model
@@ -288,7 +288,7 @@ vllm serve microsoft/Phi-3.5-vision-instruct --task generate \

 Then, you can use the OpenAI client as follows:

-??? Code
+??? code

    ```python
    from openai import OpenAI
@@ -366,7 +366,7 @@ vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model

 Then, you can use the OpenAI client as follows:

-??? Code
+??? code

    ```python
    from openai import OpenAI
@@ -430,7 +430,7 @@ vllm serve fixie-ai/ultravox-v0_5-llama-3_2-1b

 Then, you can use the OpenAI client as follows:

-??? Code
+??? code

    ```python
    import base64
@@ -486,7 +486,7 @@ Then, you can use the OpenAI client as follows:

 Alternatively, you can pass `audio_url`, which is the audio counterpart of `image_url` for image input:

-??? Code
+??? code

    ```python
    chat_completion_from_url = client.chat.completions.create(
@@ -531,7 +531,7 @@ pass a tensor of shape to the corresponding field of the multi-modal dictionary.
 For image embeddings, you can pass the base64-encoded tensor to the `image_embeds` field.
 The following example demonstrates how to pass image embeddings to the OpenAI server:

-??? Code
+??? code

    ```python
    image_embedding = torch.load(...)
--- a/docs/features/quantization/auto_awq.md
+++ b/docs/features/quantization/auto_awq.md
@@ -15,7 +15,7 @@ pip install autoawq

 After installing AutoAWQ, you are ready to quantize a model. Please refer to the [AutoAWQ documentation](https://casper-hansen.github.io/AutoAWQ/examples/#basic-quantization) for further details. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:

-??? Code
+??? code

    ```python
    from awq import AutoAWQForCausalLM
@@ -51,7 +51,7 @@ python examples/offline_inference/llm_engine_example.py \

 AWQ models are also supported directly through the LLM entrypoint:

-??? Code
+??? code

    ```python
    from vllm import LLM, SamplingParams
--- a/docs/features/quantization/bitblas.md
+++ b/docs/features/quantization/bitblas.md
@@ -43,7 +43,7 @@ llm = LLM(

 ## Read gptq format checkpoint

-??? Code
+??? code

    ```python
    from vllm import LLM
--- a/docs/features/quantization/fp8.md
+++ b/docs/features/quantization/fp8.md
@@ -58,7 +58,7 @@ For FP8 quantization, we can recover accuracy with simple RTN quantization. We r

 Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.

-??? Code
+??? code

    ```python
    from llmcompressor.transformers import oneshot
--- a/docs/features/quantization/gguf.md
+++ b/docs/features/quantization/gguf.md
@@ -41,7 +41,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \

 You can also use the GGUF model directly through the LLM entrypoint:

-??? Code
+??? code

      ```python
      from vllm import LLM, SamplingParams
--- a/docs/features/quantization/gptqmodel.md
+++ b/docs/features/quantization/gptqmodel.md
@@ -31,7 +31,7 @@ After installing GPTQModel, you are ready to quantize a model. Please refer to t

 Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:

-??? Code
+??? code

    ```python
    from datasets import load_dataset
@@ -69,7 +69,7 @@ python examples/offline_inference/llm_engine_example.py \

 GPTQModel quantized models are also supported directly through the LLM entrypoint:

-??? Code
+??? code

    ```python
    from vllm import LLM, SamplingParams
--- a/docs/features/quantization/int4.md
+++ b/docs/features/quantization/int4.md
@@ -53,7 +53,7 @@ When quantizing weights to INT4, you need sample data to estimate the weight upd
 It's best to use calibration data that closely matches your deployment data.
 For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:

-??? Code
+??? code

    ```python
    from datasets import load_dataset
@@ -78,7 +78,7 @@ For a general-purpose instruction-tuned model, you can use a dataset like `ultra

 Now, apply the quantization algorithms:

-??? Code
+??? code

    ```python
    from llmcompressor.transformers import oneshot
@@ -141,7 +141,7 @@ lm_eval --model vllm \

 The following is an example of an expanded quantization recipe you can tune to your own use case:

-??? Code
+??? code

    ```python
    from compressed_tensors.quantization import (
--- a/docs/features/quantization/int8.md
+++ b/docs/features/quantization/int8.md
@@ -54,7 +54,7 @@ When quantizing activations to INT8, you need sample data to estimate the activa
 It's best to use calibration data that closely matches your deployment data.
 For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:

-??? Code
+??? code

    ```python
    from datasets import load_dataset
@@ -81,7 +81,7 @@ For a general-purpose instruction-tuned model, you can use a dataset like `ultra

 Now, apply the quantization algorithms:

-??? Code
+??? code

    ```python
    from llmcompressor.transformers import oneshot
--- a/docs/features/quantization/modelopt.md
+++ b/docs/features/quantization/modelopt.md
@@ -14,7 +14,7 @@ You can quantize HuggingFace models using the example scripts provided in the Te

 Below is an example showing how to quantize a model using modelopt's PTQ API:

-??? Code
+??? code

    ```python
    import modelopt.torch.quantization as mtq
@@ -50,7 +50,7 @@ with torch.inference_mode():

 The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:

-??? Code
+??? code

    ```python
    from vllm import LLM, SamplingParams
--- a/docs/features/quantization/quantized_kvcache.md
+++ b/docs/features/quantization/quantized_kvcache.md
@@ -35,7 +35,7 @@ Studies have shown that FP8 E4M3 quantization typically only minimally degrades

 Here is an example of how to enable FP8 quantization:

-??? Code
+??? code

    ```python
    # To calculate kv cache scales on the fly enable the calculate_kv_scales
@@ -73,7 +73,7 @@ pip install llmcompressor

 Here's a complete example using `meta-llama/Llama-3.1-8B-Instruct` (most models can use this same pattern):

-??? Code
+??? code

    ```python
    from datasets import load_dataset
--- a/docs/features/quantization/quark.md
+++ b/docs/features/quantization/quark.md
@@ -42,7 +42,7 @@ The Quark quantization process can be listed for 5 steps as below:
 Quark uses [Transformers](https://huggingface.co/docs/transformers/en/index)
 to fetch model and tokenizer.

-??? Code
+??? code

    ```python
    from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -65,7 +65,7 @@ Quark uses the [PyTorch Dataloader](https://pytorch.org/tutorials/beginner/basic
 to load calibration data. For more details about how to use calibration datasets efficiently, please refer
 to [Adding Calibration Datasets](https://quark.docs.amd.com/latest/pytorch/calibration_datasets.html).

-??? Code
+??? code

    ```python
    from datasets import load_dataset
@@ -98,7 +98,7 @@ kv-cache and the quantization algorithm is AutoSmoothQuant.
    AutoSmoothQuant config file for Llama is
    `examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.

-??? Code
+??? code

    ```python
    from quark.torch.quantization import (Config, QuantizationConfig,
@@ -145,7 +145,7 @@ HuggingFace `safetensors`, you can refer to
 [HuggingFace format exporting](https://quark.docs.amd.com/latest/pytorch/export/quark_export_hf.html)
 for more exporting format details.

-??? Code
+??? code

    ```python
    import torch
@@ -176,7 +176,7 @@ for more exporting format details.

 Now, you can load and run the Quark quantized model directly through the LLM entrypoint:

-??? Code
+??? code

    ```python
    from vllm import LLM, SamplingParams
--- a/docs/features/quantization/torchao.md
+++ b/docs/features/quantization/torchao.md
@@ -15,7 +15,7 @@ pip install \
 ## Quantizing HuggingFace Models
 You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:

-??? Code
+??? code

    ```Python
    import torch
--- a/docs/features/reasoning_outputs.md
+++ b/docs/features/reasoning_outputs.md
@@ -33,7 +33,7 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \

 Next, make a request to the model that should return the reasoning content in the response.

-??? Code
+??? code

    ```python
    from openai import OpenAI
@@ -70,7 +70,7 @@ The `reasoning_content` field contains the reasoning steps that led to the final

 Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in [chat completion response chunks](https://platform.openai.com/docs/api-reference/chat/streaming).

-??? Json
+??? console "Json"

    ```json
    {
@@ -95,7 +95,7 @@ Streaming chat completions are also supported for reasoning models. The `reasoni

 OpenAI Python client library does not officially support `reasoning_content` attribute for streaming output. But the client supports extra attributes in the response. You can use `hasattr` to check if the `reasoning_content` attribute is present in the response. For example:

-??? Code
+??? code

    ```python
    from openai import OpenAI
@@ -152,7 +152,7 @@ Remember to check whether the `reasoning_content` exists in the response before

 The reasoning content is also available when both tool calling and the reasoning parser are enabled. Additionally, tool calling only parses functions from the `content` field, not from the `reasoning_content`.

-??? Code
+??? code

    ```python
    from openai import OpenAI
@@ -200,7 +200,7 @@ For more examples, please refer to <gh-file:examples/online_serving/openai_chat_

 You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.

-??? Code
+??? code

    ```python
    # import the required packages
@@ -258,7 +258,7 @@ You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_

 Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.

-??? Code
+??? code

    ```python
    @dataclass
--- a/docs/features/spec_decode.md
+++ b/docs/features/spec_decode.md
@@ -18,7 +18,7 @@ Speculative decoding is a technique which improves inter-token latency in memory

 The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.

-??? Code
+??? code

    ```python
    from vllm import LLM, SamplingParams
@@ -62,7 +62,7 @@ python -m vllm.entrypoints.openai.api_server \

 Then use a client:

-??? Code
+??? code

    ```python
    from openai import OpenAI
@@ -103,7 +103,7 @@ Then use a client:
 The following code configures vLLM to use speculative decoding where proposals are generated by
 matching n-grams in the prompt. For more information read [this thread.](https://x.com/joao_gante/status/1747322413006643259)

-??? Code
+??? code

    ```python
    from vllm import LLM, SamplingParams
@@ -137,7 +137,7 @@ draft models that conditioning draft predictions on both context vectors and sam
 For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or
 [this technical report](https://arxiv.org/abs/2404.19124).

-??? Code
+??? code

    ```python
    from vllm import LLM, SamplingParams
@@ -185,7 +185,7 @@ A variety of speculative models of this type are available on HF hub:
 The following code configures vLLM to use speculative decoding where proposals are generated by
 an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](gh-file:examples/offline_inference/eagle.py).

-??? Code
+??? code

    ```python
    from vllm import LLM, SamplingParams
--- a/docs/features/structured_outputs.md
+++ b/docs/features/structured_outputs.md
@@ -33,7 +33,7 @@ text.

 Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one:

-??? Code
+??? code

    ```python
    from openai import OpenAI
@@ -55,7 +55,7 @@ Now let´s see an example for each of the cases, starting with the `guided_choic

 The next example shows how to use the `guided_regex`. The idea is to generate an email address, given a simple regex template:

-??? Code
+??? code

    ```python
    completion = client.chat.completions.create(
@@ -79,7 +79,7 @@ For this we can use the `guided_json` parameter in two different ways:

 The next example shows how to use the `guided_json` parameter with a Pydantic model:

-??? Code
+??? code

    ```python
    from pydantic import BaseModel
@@ -127,7 +127,7 @@ difficult to use, but it´s really powerful. It allows us to define complete
 languages like SQL queries. It works by using a context free EBNF grammar.
 As an example, we can use to define a specific format of simplified SQL queries:

-??? Code
+??? code

    ```python
    simplified_sql_grammar = """
@@ -169,7 +169,7 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --reasoning-parser deepseek_r

 Note that you can use reasoning with any provided structured outputs feature. The following uses one with JSON schema:

-??? Code
+??? code

    ```python
    from pydantic import BaseModel
@@ -212,7 +212,7 @@ For the following examples, vLLM was setup using `vllm serve meta-llama/Llama-3.

 Here is a simple example demonstrating how to get structured output using Pydantic models:

-??? Code
+??? code

    ```python
    from pydantic import BaseModel
@@ -248,7 +248,7 @@ Age: 28

 Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:

-??? Code
+??? code

    ```python
    from typing import List
@@ -308,7 +308,7 @@ These parameters can be used in the same way as the parameters from the Online
 Serving examples above. One example for the usage of the `choice` parameter is
 shown below:

-??? Code
+??? code

    ```python
    from vllm import LLM, SamplingParams
--- a/docs/features/tool_calling.md
+++ b/docs/features/tool_calling.md
@@ -15,7 +15,7 @@ vllm serve meta-llama/Llama-3.1-8B-Instruct \

 Next, make a request to the model that should result in it using the available tools:

-??? Code
+??? code

    ```python
    from openai import OpenAI
@@ -320,7 +320,7 @@ A tool parser plugin is a Python file containing one or more ToolParser implemen

 Here is a summary of a plugin file:

-??? Code
+??? code

    ```python