Make distinct code and console admonitions so readers are less likely to miss them (#20585)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -29,7 +29,7 @@ We can now submit the prompts and call `llm.generate` with the `lora_request` pa
|
||||
of `LoRARequest` is a human identifiable name, the second parameter is a globally unique ID for the adapter and
|
||||
the third parameter is the path to the LoRA adapter.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
sampling_params = SamplingParams(
|
||||
@@ -70,7 +70,7 @@ The server entrypoint accepts all other LoRA configuration parameters (`max_lora
|
||||
etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
|
||||
with its base model (if `jq` is not installed, you can follow [this guide](https://jqlang.org/download/) to install it.):
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```bash
|
||||
curl localhost:8000/v1/models | jq .
|
||||
@@ -172,7 +172,7 @@ Alternatively, follow these example steps to implement your own plugin:
|
||||
|
||||
1. Implement the LoRAResolver interface.
|
||||
|
||||
??? Example of a simple S3 LoRAResolver implementation
|
||||
??? code "Example of a simple S3 LoRAResolver implementation"
|
||||
|
||||
```python
|
||||
import os
|
||||
@@ -238,7 +238,7 @@ The new format of `--lora-modules` is mainly to support the display of parent mo
|
||||
- The `parent` field of LoRA model `sql-lora` now links to its base model `meta-llama/Llama-2-7b-hf`. This correctly reflects the hierarchical relationship between the base model and the LoRA adapter.
|
||||
- The `root` field points to the artifact location of the lora adapter.
|
||||
|
||||
??? Command output
|
||||
??? console "Command output"
|
||||
|
||||
```bash
|
||||
$ curl http://localhost:8000/v1/models
|
||||
|
||||
@@ -20,7 +20,7 @@ To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]:
|
||||
|
||||
You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
@@ -68,7 +68,7 @@ Full example: <gh-file:examples/offline_inference/vision_language.py>
|
||||
|
||||
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
@@ -146,7 +146,7 @@ for o in outputs:
|
||||
|
||||
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
@@ -193,7 +193,7 @@ Full example: <gh-file:examples/offline_inference/audio_language.py>
|
||||
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
|
||||
pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
@@ -220,7 +220,7 @@ pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the cor
|
||||
|
||||
For Qwen2-VL and MiniCPM-V, we accept additional parameters alongside the embeddings:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# Construct the prompt based on your model
|
||||
@@ -288,7 +288,7 @@ vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@@ -366,7 +366,7 @@ vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@@ -430,7 +430,7 @@ vllm serve fixie-ai/ultravox-v0_5-llama-3_2-1b
|
||||
|
||||
Then, you can use the OpenAI client as follows:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
import base64
|
||||
@@ -486,7 +486,7 @@ Then, you can use the OpenAI client as follows:
|
||||
|
||||
Alternatively, you can pass `audio_url`, which is the audio counterpart of `image_url` for image input:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
chat_completion_from_url = client.chat.completions.create(
|
||||
@@ -531,7 +531,7 @@ pass a tensor of shape to the corresponding field of the multi-modal dictionary.
|
||||
For image embeddings, you can pass the base64-encoded tensor to the `image_embeds` field.
|
||||
The following example demonstrates how to pass image embeddings to the OpenAI server:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
image_embedding = torch.load(...)
|
||||
|
||||
@@ -15,7 +15,7 @@ pip install autoawq
|
||||
|
||||
After installing AutoAWQ, you are ready to quantize a model. Please refer to the [AutoAWQ documentation](https://casper-hansen.github.io/AutoAWQ/examples/#basic-quantization) for further details. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from awq import AutoAWQForCausalLM
|
||||
@@ -51,7 +51,7 @@ python examples/offline_inference/llm_engine_example.py \
|
||||
|
||||
AWQ models are also supported directly through the LLM entrypoint:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@@ -43,7 +43,7 @@ llm = LLM(
|
||||
|
||||
## Read gptq format checkpoint
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
|
||||
@@ -58,7 +58,7 @@ For FP8 quantization, we can recover accuracy with simple RTN quantization. We r
|
||||
|
||||
Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import oneshot
|
||||
|
||||
@@ -41,7 +41,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
||||
|
||||
You can also use the GGUF model directly through the LLM entrypoint:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@@ -31,7 +31,7 @@ After installing GPTQModel, you are ready to quantize a model. Please refer to t
|
||||
|
||||
Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
@@ -69,7 +69,7 @@ python examples/offline_inference/llm_engine_example.py \
|
||||
|
||||
GPTQModel quantized models are also supported directly through the LLM entrypoint:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@@ -53,7 +53,7 @@ When quantizing weights to INT4, you need sample data to estimate the weight upd
|
||||
It's best to use calibration data that closely matches your deployment data.
|
||||
For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
@@ -78,7 +78,7 @@ For a general-purpose instruction-tuned model, you can use a dataset like `ultra
|
||||
|
||||
Now, apply the quantization algorithms:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import oneshot
|
||||
@@ -141,7 +141,7 @@ lm_eval --model vllm \
|
||||
|
||||
The following is an example of an expanded quantization recipe you can tune to your own use case:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from compressed_tensors.quantization import (
|
||||
|
||||
@@ -54,7 +54,7 @@ When quantizing activations to INT8, you need sample data to estimate the activa
|
||||
It's best to use calibration data that closely matches your deployment data.
|
||||
For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
@@ -81,7 +81,7 @@ For a general-purpose instruction-tuned model, you can use a dataset like `ultra
|
||||
|
||||
Now, apply the quantization algorithms:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import oneshot
|
||||
|
||||
@@ -14,7 +14,7 @@ You can quantize HuggingFace models using the example scripts provided in the Te
|
||||
|
||||
Below is an example showing how to quantize a model using modelopt's PTQ API:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
import modelopt.torch.quantization as mtq
|
||||
@@ -50,7 +50,7 @@ with torch.inference_mode():
|
||||
|
||||
The quantized checkpoint can then be deployed with vLLM. As an example, the following code shows how to deploy `nvidia/Llama-3.1-8B-Instruct-FP8`, which is the FP8 quantized checkpoint derived from `meta-llama/Llama-3.1-8B-Instruct`, using vLLM:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@@ -35,7 +35,7 @@ Studies have shown that FP8 E4M3 quantization typically only minimally degrades
|
||||
|
||||
Here is an example of how to enable FP8 quantization:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# To calculate kv cache scales on the fly enable the calculate_kv_scales
|
||||
@@ -73,7 +73,7 @@ pip install llmcompressor
|
||||
|
||||
Here's a complete example using `meta-llama/Llama-3.1-8B-Instruct` (most models can use this same pattern):
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
|
||||
@@ -42,7 +42,7 @@ The Quark quantization process can be listed for 5 steps as below:
|
||||
Quark uses [Transformers](https://huggingface.co/docs/transformers/en/index)
|
||||
to fetch model and tokenizer.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
@@ -65,7 +65,7 @@ Quark uses the [PyTorch Dataloader](https://pytorch.org/tutorials/beginner/basic
|
||||
to load calibration data. For more details about how to use calibration datasets efficiently, please refer
|
||||
to [Adding Calibration Datasets](https://quark.docs.amd.com/latest/pytorch/calibration_datasets.html).
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
@@ -98,7 +98,7 @@ kv-cache and the quantization algorithm is AutoSmoothQuant.
|
||||
AutoSmoothQuant config file for Llama is
|
||||
`examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from quark.torch.quantization import (Config, QuantizationConfig,
|
||||
@@ -145,7 +145,7 @@ HuggingFace `safetensors`, you can refer to
|
||||
[HuggingFace format exporting](https://quark.docs.amd.com/latest/pytorch/export/quark_export_hf.html)
|
||||
for more exporting format details.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
import torch
|
||||
@@ -176,7 +176,7 @@ for more exporting format details.
|
||||
|
||||
Now, you can load and run the Quark quantized model directly through the LLM entrypoint:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@@ -15,7 +15,7 @@ pip install \
|
||||
## Quantizing HuggingFace Models
|
||||
You can quantize your own huggingface model with torchao, e.g. [transformers](https://huggingface.co/docs/transformers/main/en/quantization/torchao) and [diffusers](https://huggingface.co/docs/diffusers/en/quantization/torchao), and save the checkpoint to huggingface hub like [this](https://huggingface.co/jerryzh168/llama3-8b-int8wo) with the following example code:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```Python
|
||||
import torch
|
||||
|
||||
@@ -33,7 +33,7 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
|
||||
|
||||
Next, make a request to the model that should return the reasoning content in the response.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@@ -70,7 +70,7 @@ The `reasoning_content` field contains the reasoning steps that led to the final
|
||||
|
||||
Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in [chat completion response chunks](https://platform.openai.com/docs/api-reference/chat/streaming).
|
||||
|
||||
??? Json
|
||||
??? console "Json"
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -95,7 +95,7 @@ Streaming chat completions are also supported for reasoning models. The `reasoni
|
||||
|
||||
OpenAI Python client library does not officially support `reasoning_content` attribute for streaming output. But the client supports extra attributes in the response. You can use `hasattr` to check if the `reasoning_content` attribute is present in the response. For example:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@@ -152,7 +152,7 @@ Remember to check whether the `reasoning_content` exists in the response before
|
||||
|
||||
The reasoning content is also available when both tool calling and the reasoning parser are enabled. Additionally, tool calling only parses functions from the `content` field, not from the `reasoning_content`.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@@ -200,7 +200,7 @@ For more examples, please refer to <gh-file:examples/online_serving/openai_chat_
|
||||
|
||||
You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
# import the required packages
|
||||
@@ -258,7 +258,7 @@ You can add a new `ReasoningParser` similar to <gh-file:vllm/reasoning/deepseek_
|
||||
|
||||
Additionally, to enable structured output, you'll need to create a new `Reasoner` similar to the one in <gh-file:vllm/reasoning/deepseek_r1_reasoning_parser.py>.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
|
||||
@@ -18,7 +18,7 @@ Speculative decoding is a technique which improves inter-token latency in memory
|
||||
|
||||
The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
@@ -62,7 +62,7 @@ python -m vllm.entrypoints.openai.api_server \
|
||||
|
||||
Then use a client:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@@ -103,7 +103,7 @@ Then use a client:
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||
matching n-grams in the prompt. For more information read [this thread.](https://x.com/joao_gante/status/1747322413006643259)
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
@@ -137,7 +137,7 @@ draft models that conditioning draft predictions on both context vectors and sam
|
||||
For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or
|
||||
[this technical report](https://arxiv.org/abs/2404.19124).
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
@@ -185,7 +185,7 @@ A variety of speculative models of this type are available on HF hub:
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model. A more detailed example for offline mode, including how to extract request level acceptance rate, can be found [here](gh-file:examples/offline_inference/eagle.py).
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@@ -33,7 +33,7 @@ text.
|
||||
|
||||
Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@@ -55,7 +55,7 @@ Now let´s see an example for each of the cases, starting with the `guided_choic
|
||||
|
||||
The next example shows how to use the `guided_regex`. The idea is to generate an email address, given a simple regex template:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
completion = client.chat.completions.create(
|
||||
@@ -79,7 +79,7 @@ For this we can use the `guided_json` parameter in two different ways:
|
||||
|
||||
The next example shows how to use the `guided_json` parameter with a Pydantic model:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
@@ -127,7 +127,7 @@ difficult to use, but it´s really powerful. It allows us to define complete
|
||||
languages like SQL queries. It works by using a context free EBNF grammar.
|
||||
As an example, we can use to define a specific format of simplified SQL queries:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
simplified_sql_grammar = """
|
||||
@@ -169,7 +169,7 @@ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --reasoning-parser deepseek_r
|
||||
|
||||
Note that you can use reasoning with any provided structured outputs feature. The following uses one with JSON schema:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
@@ -212,7 +212,7 @@ For the following examples, vLLM was setup using `vllm serve meta-llama/Llama-3.
|
||||
|
||||
Here is a simple example demonstrating how to get structured output using Pydantic models:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
@@ -248,7 +248,7 @@ Age: 28
|
||||
|
||||
Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from typing import List
|
||||
@@ -308,7 +308,7 @@ These parameters can be used in the same way as the parameters from the Online
|
||||
Serving examples above. One example for the usage of the `choice` parameter is
|
||||
shown below:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@@ -15,7 +15,7 @@ vllm serve meta-llama/Llama-3.1-8B-Instruct \
|
||||
|
||||
Next, make a request to the model that should result in it using the available tools:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
@@ -320,7 +320,7 @@ A tool parser plugin is a Python file containing one or more ToolParser implemen
|
||||
|
||||
Here is a summary of a plugin file:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
|
||||
|
||||
Reference in New Issue
Block a user