Make distinct code and console admonitions so readers are less likely to miss them (#20585)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor
2025-07-08 03:55:28 +01:00
committed by GitHub
parent 31c5d0a1b7
commit af107d5a0e
52 changed files with 192 additions and 162 deletions

View File

@@ -29,7 +29,7 @@ We can now submit the prompts and call `llm.generate` with the `lora_request` pa
of `LoRARequest` is a human identifiable name, the second parameter is a globally unique ID for the adapter and
the third parameter is the path to the LoRA adapter.
??? Code
??? code
```python
sampling_params = SamplingParams(
@@ -70,7 +70,7 @@ The server entrypoint accepts all other LoRA configuration parameters (`max_lora
etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
with its base model (if `jq` is not installed, you can follow [this guide](https://jqlang.org/download/) to install it.):
??? Command
??? console "Command"
```bash
curl localhost:8000/v1/models | jq .
@@ -172,7 +172,7 @@ Alternatively, follow these example steps to implement your own plugin:
1. Implement the LoRAResolver interface.
??? Example of a simple S3 LoRAResolver implementation
??? code "Example of a simple S3 LoRAResolver implementation"
```python
import os
@@ -238,7 +238,7 @@ The new format of `--lora-modules` is mainly to support the display of parent mo
- The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-2-7b-hf`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter.
- The `root` field points to the artifact location of the lora adapter.
??? Command output
??? console "Command output"
```bash
$ curl http://localhost:8000/v1/models

View File

@@ -20,7 +20,7 @@ To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]:
You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples:
??? Code
??? code
```python
from vllm import LLM
@@ -68,7 +68,7 @@ Full example: <gh-file:examples/offline_inference/vision_language.py>
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
??? Code
??? code
```python
from vllm import LLM
@@ -146,7 +146,7 @@ for o in outputs:
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
??? Code
??? code
```python
from vllm import LLM
@@ -193,7 +193,7 @@ Full example: <gh-file:examples/offline_inference/audio_language.py>
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
??? Code
??? code
```python
from vllm import LLM
@@ -220,7 +220,7 @@ pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the cor
For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings:
??? Code
??? code
```python
# Construct the prompt based on your model
@@ -288,7 +288,7 @@ vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
Then, you can use the OpenAI client as follows:
??? Code
??? code
```python
from openai import OpenAI
@@ -366,7 +366,7 @@ vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model
Then, you can use the OpenAI client as follows:
??? Code
??? code
```python
from openai import OpenAI
@@ -430,7 +430,7 @@ vllm serve fixie-ai/ultravox-v0_5-llama-3_2-1b
Then, you can use the OpenAI client as follows:
??? Code
??? code
```python
import base64
@@ -486,7 +486,7 @@ Then, you can use the OpenAI client as follows:
Alternatively, you can pass `audio_url`, which is the audio counterpart of `image_url` for image input:
??? Code
??? code
```python
chat_completion_from_url = client.chat.completions.create(
@@ -531,7 +531,7 @@ pass a tensor of shape to the corresponding field of the multi-modal dictionary.
For image embeddings, you can pass the base64-encoded tensor to the `image_embeds` field.
The following example demonstrates how to pass image embeddings to the OpenAI server:
??? Code
??? code
```python
image_embedding = torch.load(...)

View File

@@ -15,7 +15,7 @@ pip install autoawq
After installing AutoAWQ, you are ready to quantize a model. Please refer to the [AutoAWQ documentation](https://casper-hansen.github.io/AutoAWQ/examples/#basic-quantization) for further details. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
??? Code
??? code
```python
from awq import AutoAWQForCausalLM
@@ -51,7 +51,7 @@ python examples/offline_inference/llm_engine_example.py \
AWQ models are also supported directly through the LLM entrypoint:
??? Code
??? code
```python
from vllm import LLM, SamplingParams

View File

@@ -43,7 +43,7 @@ llm = LLM(
## Read gptq format checkpoint
??? Code
??? code
```python
from vllm import LLM

View File

@@ -58,7 +58,7 @@ For FP8 quantization, we can recover accuracy with simple RTN quantization. We r
Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
??? Code
??? code
```python
from llmcompressor.transformers import oneshot

View File

@@ -41,7 +41,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
You can also use the GGUF model directly through the LLM entrypoint:
??? Code
??? code
```python
from vllm import LLM, SamplingParams

View File

@@ -31,7 +31,7 @@ After installing GPTQModel, you are ready to quantize a model. Please refer to t
Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
??? Code
??? code
```python
from datasets import load_dataset
@@ -69,7 +69,7 @@ python examples/offline_inference/llm_engine_example.py \
GPTQModel quantized models are also supported directly through the LLM entrypoint:
??? Code
??? code
```python
from vllm import LLM, SamplingParams

View File

@@ -53,7 +53,7 @@ When quantizing weights to INT4, you need sample data to estimate the weight upd
It's best to use calibration data that closely matches your deployment data.
For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
??? Code
??? code
```python
from datasets import load_dataset
@@ -78,7 +78,7 @@ For a general-purpose instruction-tuned model, you can use a dataset like `ultra
Now, apply the quantization algorithms:
??? Code
??? code
```python
from llmcompressor.transformers import oneshot
@@ -141,7 +141,7 @@ lm_eval --model vllm \
The following is an example of an expanded quantization recipe you can tune to your own use case:
??? Code
??? code
```python
from compressed_tensors.quantization import (

View File

@@ -54,7 +54,7 @@ When quantizing activations to INT8, you need sample data to estimate the activa
It's best to use calibration data that closely matches your deployment data.
For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
??? Code
??? code
```python
from datasets import load_dataset
@@ -81,7 +81,7 @@ For a general-purpose instruction-tuned model, you can use a dataset like `ultra
Now, apply the quantization algorithms:
??? Code
??? code
```python
from llmcompressor.transformers import oneshot

View File

@@ -14,7 +14,7 @@ You can quantize HuggingFace models using the example scripts provided in the Te
Below is an example showing how to quantize a model using modelopt's PTQ API:
??? Code
??? code
```python
import modelopt.torch.quantization as mtq
@@ -50,7 +50,7 @@ with torch.inference_mode():
The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:
??? Code
??? code
```python
from vllm import LLM, SamplingParams

View File

@@ -35,7 +35,7 @@ Studies have shown that FP8 E4M3 quantization typically only minimally degrades
Here is an example of how to enable FP8 quantization:
??? Code
??? code
```python
# To calculate kv cache scales on the fly enable the calculate_kv_scales
@@ -73,7 +73,7 @@ pip install llmcompressor
Here's a complete example using `meta-llama/Llama-3.1-8B-Instruct` (most models can use this same pattern):
??? Code
??? code
```python
from datasets import load_dataset

View File

@@ -42,7 +42,7 @@ The Quark quantization process can be listed for 5 steps as below:
Quark uses [Transformers](https://huggingface.co/docs/transformers/en/index)
to fetch model and tokenizer.
??? Code
??? code
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -65,7 +65,7 @@ Quark uses the [PyTorch Dataloader](https://pytorch.org/tutorials/beginner/basic
to load calibration data. For more details about how to use calibration datasets efficiently, please refer
to [Adding Calibration Datasets](https://quark.docs.amd.com/latest/pytorch/calibration_datasets.html).
??? Code
??? code
```python
from datasets import load_dataset
@@ -98,7 +98,7 @@ kv-cache and the quantization algorithm is AutoSmoothQuant.
AutoSmoothQuant config file for Llama is
`examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.
??? Code
??? code
```python
from quark.torch.quantization import (Config, QuantizationConfig,
@@ -145,7 +145,7 @@ HuggingFace `safetensors`, you can refer to
[HuggingFace format exporting](https://quark.docs.amd.com/latest/pytorch/export/quark_export_hf.html)
for more exporting format details.
??? Code
??? code
```python
import torch
@@ -176,7 +176,7 @@ for more exporting format details.
Now, you can load and run the Quark quantized model directly through the LLM entrypoint:
??? Code
??? code
```python
from vllm import LLM, SamplingParams

View File

@@ -15,7 +15,7 @@ pip install \
## Quantizing HuggingFace Models
You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:
??? Code
??? code
```Python
import torch

View File

@@ -33,7 +33,7 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
Next, make a request to the model that should return the reasoning content in the response.
??? Code
??? code
```python
from openai import OpenAI
@@ -70,7 +70,7 @@ The `reasoning_content` field contains the reasoning steps that led to the final
Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in [chat completion response chunks](https://platform.openai.com/docs/api-reference/chat/streaming).
??? Json
??? console "Json"
```json
{
@@ -95,7 +95,7 @@ Streaming chat completions are also supported for reasoning models. The `reasoni
OpenAI Python client library does not officially support `reasoning_content` attribute for streaming output. But the client supports extra attributes in the response. You can use `hasattr` to check if the `reasoning_content` attribute is present in the response. For example:
??? Code
??? code
```python
from openai import OpenAI
@@ -152,7 +152,7 @@ Remember to check whether the `reasoning_content` exists in the response before
The reasoning content is also available when both tool calling and the reasoning parser are enabled. Additionally, tool calling only parses functions from the `content` field, not from the `reasoning_content`.
??? Code
??? code
```python
from openai import OpenAI
@@ -200,7 +200,7 @@ For more examples, please refer to <gh-file:examples/online_serving/openai_chat_
You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
??? Code
??? code
```python
# import the required packages
@@ -258,7 +258,7 @@ You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_
Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
??? Code
??? code
```python
@dataclass

View File

@@ -18,7 +18,7 @@ Speculative decoding is a technique which improves inter-token latency in memory
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
??? Code
??? code
```python
from vllm import LLM, SamplingParams
@@ -62,7 +62,7 @@ python -m vllm.entrypoints.openai.api_server \
Then use a client:
??? Code
??? code
```python
from openai import OpenAI
@@ -103,7 +103,7 @@ Then use a client:
The following code configures vLLM to use speculative decoding where proposals are generated by
matching n-grams in the prompt. For more information read [this thread.](https://x.com/joao_gante/status/1747322413006643259)
??? Code
??? code
```python
from vllm import LLM, SamplingParams
@@ -137,7 +137,7 @@ draft models that conditioning draft predictions on both context vectors and sam
For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or
[this technical report](https://arxiv.org/abs/2404.19124).
??? Code
??? code
```python
from vllm import LLM, SamplingParams
@@ -185,7 +185,7 @@ A variety of speculative models of this type are available on HF hub:
The following code configures vLLM to use speculative decoding where proposals are generated by
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](gh-file:examples/offline_inference/eagle.py).
??? Code
??? code
```python
from vllm import LLM, SamplingParams

View File

@@ -33,7 +33,7 @@ text.
Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one:
??? Code
??? code
```python
from openai import OpenAI
@@ -55,7 +55,7 @@ Now let´s see an example for each of the cases, starting with the `guided_choic
The next example shows how to use the `guided_regex`. The idea is to generate an email address, given a simple regex template:
??? Code
??? code
```python
completion = client.chat.completions.create(
@@ -79,7 +79,7 @@ For this we can use the `guided_json` parameter in two different ways:
The next example shows how to use the `guided_json` parameter with a Pydantic model:
??? Code
??? code
```python
from pydantic import BaseModel
@@ -127,7 +127,7 @@ difficult to use, but it´s really powerful. It allows us to define complete
languages like SQL queries. It works by using a context free EBNF grammar.
As an example, we can use to define a specific format of simplified SQL queries:
??? Code
??? code
```python
simplified_sql_grammar = """
@@ -169,7 +169,7 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --reasoning-parser deepseek_r
Note that you can use reasoning with any provided structured outputs feature. The following uses one with JSON schema:
??? Code
??? code
```python
from pydantic import BaseModel
@@ -212,7 +212,7 @@ For the following examples, vLLM was setup using `vllm serve meta-llama/Llama-3.
Here is a simple example demonstrating how to get structured output using Pydantic models:
??? Code
??? code
```python
from pydantic import BaseModel
@@ -248,7 +248,7 @@ Age: 28
Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:
??? Code
??? code
```python
from typing import List
@@ -308,7 +308,7 @@ These parameters can be used in the same way as the parameters from the Online
Serving examples above. One example for the usage of the `choice` parameter is
shown below:
??? Code
??? code
```python
from vllm import LLM, SamplingParams

View File

@@ -15,7 +15,7 @@ vllm serve meta-llama/Llama-3.1-8B-Instruct \
Next, make a request to the model that should result in it using the available tools:
??? Code
??? code
```python
from openai import OpenAI
@@ -320,7 +320,7 @@ A tool parser plugin is a Python file containing one or more ToolParser implemen
Here is a summary of a plugin file:
??? Code
??? code
```python