2025-07-08 11:27:40 +01:00
# Structured Outputs
2024-12-23 17:35:38 -05:00
2025-04-26 11:12:18 -04:00
vLLM supports the generation of structured outputs using
[xgrammar ](https://github.com/mlc-ai/xgrammar ) or
[guidance ](https://github.com/guidance-ai/llguidance ) as backends.
This document shows you some examples of the different options that are
available to generate structured outputs.
2024-12-23 17:35:38 -05:00
2025-09-25 12:45:25 +01:00
!!! warning
2025-11-25 05:24:05 +00:00
If you are still using the following deprecated API fields which were removed in v0.12.0, please update your code to use `structured_outputs` as demonstrated in the rest of this document:
2025-09-25 12:45:25 +01:00
- `guided_json` -> `{"structured_outputs": {"json": ...}}` or `StructuredOutputsParams(json=...)`
- `guided_regex` -> `{"structured_outputs": {"regex": ...}}` or `StructuredOutputsParams(regex=...)`
- `guided_choice` -> `{"structured_outputs": {"choice": ...}}` or `StructuredOutputsParams(choice=...)`
- `guided_grammar` -> `{"structured_outputs": {"grammar": ...}}` or `StructuredOutputsParams(grammar=...)`
- `guided_whitespace_pattern` -> `{"structured_outputs": {"whitespace_pattern": ...}}` or `StructuredOutputsParams(whitespace_pattern=...)`
- `structural_tag` -> `{"structured_outputs": {"structural_tag": ...}}` or `StructuredOutputsParams(structural_tag=...)`
- `guided_decoding_backend` -> Remove this field from your request
2025-01-10 12:05:56 +00:00
## Online Serving (OpenAI API)
2024-12-23 17:35:38 -05:00
You can generate structured outputs using the OpenAI's [Completions ](https://platform.openai.com/docs/api-reference/completions ) and [Chat ](https://platform.openai.com/docs/api-reference/chat ) API.
The following parameters are supported, which must be added as extra parameters:
2025-09-18 05:20:27 -04:00
- `choice` : the output will be exactly one of the choices.
- `regex` : the output will follow the regex pattern.
- `json` : the output will follow the JSON schema.
- `grammar` : the output will follow the context free grammar.
2025-04-26 11:12:18 -04:00
- `structural_tag` : Follow a JSON schema within a set of specified tags within the generated text.
2024-12-23 17:35:38 -05:00
2025-07-08 10:49:13 +01:00
You can see the complete list of supported parameters on the [OpenAI-Compatible Server ](../serving/openai_compatible_server.md ) page.
2025-04-26 11:12:18 -04:00
Structured outputs are supported by default in the OpenAI-Compatible Server. You
may choose to specify the backend to use by setting the
2025-09-18 05:20:27 -04:00
`--structured-outputs-config.backend` flag to `vllm serve` . The default backend is `auto` ,
2025-04-26 11:12:18 -04:00
which will try to choose an appropriate backend based on the details of the
request. You may also choose a specific backend, along with
some options. A full set of options is available in the `vllm serve --help`
text.
2024-12-23 17:35:38 -05:00
2025-09-18 05:20:27 -04:00
Now let´ s see an example for each of the cases, starting with the `choice` , as it´ s the easiest one:
2024-12-23 17:35:38 -05:00
2025-07-08 03:55:28 +01:00
??? code
2025-06-23 13:24:23 +08:00
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="-",
)
model = client.models.list().data[0].id
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
],
2025-09-18 05:20:27 -04:00
extra_body={"structured_outputs": {"choice": ["positive", "negative"]}},
2025-06-23 13:24:23 +08:00
)
print(completion.choices[0].message.content)
```
2024-12-23 17:35:38 -05:00
2025-09-18 05:20:27 -04:00
The next example shows how to use the `regex` . The idea is to generate an email address, given a simple regex template:
2024-12-23 17:35:38 -05:00
2025-07-08 03:55:28 +01:00
??? code
2025-06-23 13:24:23 +08:00
```python
completion = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": "Generate an example email address for Alan Turing, who works in Enigma. End in .com and new line. Example result: alan.turing@enigma .com\n",
}
],
2025-09-18 05:20:27 -04:00
extra_body={"structured_outputs": {"regex": r"\w+@\w+\.com\n"}, "stop": ["\n"]},
2025-06-23 13:24:23 +08:00
)
print(completion.choices[0].message.content)
```
2024-12-23 17:35:38 -05:00
One of the most relevant features in structured text generation is the option to generate a valid JSON with pre-defined fields and formats.
2025-09-18 05:20:27 -04:00
For this we can use the `json` parameter in two different ways:
2024-12-23 17:35:38 -05:00
- Using directly a [JSON Schema ](https://json-schema.org/ )
- Defining a [Pydantic model ](https://docs.pydantic.dev/latest/ ) and then extracting the JSON Schema from it (which is normally an easier option).
2025-09-18 05:20:27 -04:00
The next example shows how to use the `response_format` parameter with a Pydantic model:
2024-12-23 17:35:38 -05:00
2025-07-08 03:55:28 +01:00
??? code
2025-06-23 13:24:23 +08:00
```python
from pydantic import BaseModel
from enum import Enum
class CarType(str, Enum):
sedan = "sedan"
suv = "SUV"
truck = "Truck"
coupe = "Coupe"
class CarDescription(BaseModel):
brand: str
model: str
car_type: CarType
json_schema = CarDescription.model_json_schema()
completion = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": "Generate a JSON with the brand, model and car_type of the most iconic car from the 90's",
}
],
2025-08-01 15:01:22 +08:00
response_format={
2025-06-23 13:24:23 +08:00
"type": "json_schema",
"json_schema": {
"name": "car-description",
"schema": CarDescription.model_json_schema()
},
2025-06-12 18:50:31 -04:00
},
2025-06-23 13:24:23 +08:00
)
print(completion.choices[0].message.content)
```
2024-12-23 17:35:38 -05:00
2025-05-23 11:09:53 +02:00
!!! tip
While not strictly necessary, normally it´ s better to indicate in the prompt the
2025-06-12 18:50:31 -04:00
JSON schema and how the fields should be populated. This can improve the
2025-05-23 11:09:53 +02:00
results notably in most cases.
2024-12-23 17:35:38 -05:00
2025-09-18 05:20:27 -04:00
Finally we have the `grammar` option, which is probably the most
2025-04-26 11:12:18 -04:00
difficult to use, but it´ s really powerful. It allows us to define complete
2025-06-12 18:50:31 -04:00
languages like SQL queries. It works by using a context free EBNF grammar.
2025-04-26 11:12:18 -04:00
As an example, we can use to define a specific format of simplified SQL queries:
2024-12-23 17:35:38 -05:00
2025-07-08 03:55:28 +01:00
??? code
2024-12-23 17:35:38 -05:00
2025-06-23 13:24:23 +08:00
```python
simplified_sql_grammar = """
root ::= select_statement
2024-12-23 17:35:38 -05:00
2025-06-23 13:24:23 +08:00
select_statement ::= "SELECT " column " from " table " where " condition
2024-12-23 17:35:38 -05:00
2025-06-23 13:24:23 +08:00
column ::= "col_1 " | "col_2 "
2024-12-23 17:35:38 -05:00
2025-06-23 13:24:23 +08:00
table ::= "table_1 " | "table_2 "
2024-12-23 17:35:38 -05:00
2025-06-23 13:24:23 +08:00
condition ::= column "= " number
2024-12-23 17:35:38 -05:00
2025-06-23 13:24:23 +08:00
number ::= "1 " | "2 "
"""
completion = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": "Generate an SQL query to show the 'username' and 'email' from the 'users' table.",
}
],
2025-09-18 05:20:27 -04:00
extra_body={"structured_outputs": {"grammar": simplified_sql_grammar}},
2025-06-23 13:24:23 +08:00
)
print(completion.choices[0].message.content)
```
2024-12-23 17:35:38 -05:00
2025-07-07 12:15:50 +01:00
See also: [full example ](../examples/online_serving/structured_outputs.md )
2025-06-12 18:50:31 -04:00
## Reasoning Outputs
You can also use structured outputs with <project:#reasoning -outputs> for reasoning models.
```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --reasoning-parser deepseek_r1
```
Note that you can use reasoning with any provided structured outputs feature. The following uses one with JSON schema:
2025-07-08 03:55:28 +01:00
??? code
2025-06-23 13:24:23 +08:00
```python
from pydantic import BaseModel
class People(BaseModel):
name: str
age: int
completion = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": "Generate a JSON with the name and age of one random person.",
}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "people",
"schema": People.model_json_schema()
}
},
)
2025-11-08 04:15:08 -08:00
print("reasoning: ", completion.choices[0].message.reasoning)
2025-06-23 13:24:23 +08:00
print("content: ", completion.choices[0].message.content)
```
2025-06-12 18:50:31 -04:00
2025-07-07 12:15:50 +01:00
See also: [full example ](../examples/online_serving/structured_outputs.md )
2024-12-23 17:35:38 -05:00
## Experimental Automatic Parsing (OpenAI API)
This section covers the OpenAI beta wrapper over the `client.chat.completions.create()` method that provides richer integrations with Python specific types.
At the time of writing (`openai==1.54.4` ), this is a "beta" feature in the OpenAI client library. Code reference can be found [here ](https://github.com/openai/openai-python/blob/52357cff50bee57ef442e94d78a0de38b4173fc2/src/openai/resources/beta/chat/completions.py#L100-L104 ).
2025-08-28 13:38:19 +02:00
For the following examples, vLLM was set up using `vllm serve meta-llama/Llama-3.1-8B-Instruct`
2024-12-23 17:35:38 -05:00
Here is a simple example demonstrating how to get structured output using Pydantic models:
2025-07-08 03:55:28 +01:00
??? code
2025-06-23 13:24:23 +08:00
```python
from pydantic import BaseModel
from openai import OpenAI
class Info(BaseModel):
name: str
age: int
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
model = client.models.list().data[0].id
completion = client.beta.chat.completions.parse(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Cameron, I'm 28. What's my name and age?"},
],
response_format=Info,
)
message = completion.choices[0].message
print(message)
assert message.parsed
print("Name:", message.parsed.name)
print("Age:", message.parsed.age)
```
2024-12-23 17:35:38 -05:00
```console
ParsedChatCompletionMessage[Testing ](content='{"name": "Cameron", "age": 28}', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=Testing(name='Cameron', age=28 ))
Name: Cameron
Age: 28
```
Here is a more complex example using nested Pydantic models to handle a step-by-step math solution:
2025-07-08 03:55:28 +01:00
??? code
2025-06-23 13:24:23 +08:00
```python
from typing import List
from pydantic import BaseModel
from openai import OpenAI
class Step(BaseModel):
explanation: str
output: str
class MathResponse(BaseModel):
steps: list[Step]
final_answer: str
completion = client.beta.chat.completions.parse(
model=model,
messages=[
{"role": "system", "content": "You are a helpful expert math tutor."},
{"role": "user", "content": "Solve 8x + 31 = 2."},
],
response_format=MathResponse,
)
message = completion.choices[0].message
print(message)
assert message.parsed
for i, step in enumerate(message.parsed.steps):
print(f"Step #{i}:", step)
print("Answer:", message.parsed.final_answer)
```
2024-12-23 17:35:38 -05:00
Output:
```console
ParsedChatCompletionMessage[MathResponse ](content='{ "steps": [{ "explanation": "First, let\'s isolate the term with the variable \'x\'. To do this, we\'ll subtract 31 from both sides of the equation.", "output": "8x + 31 - 31 = 2 - 31"}, { "explanation": "By subtracting 31 from both sides, we simplify the equation to 8x = -29.", "output": "8x = -29"}, { "explanation": "Next, let\'s isolate \'x\' by dividing both sides of the equation by 8.", "output": "8x / 8 = -29 / 8"}], "final_answer": "x = -29/8" }', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], parsed=MathResponse(steps=[Step(explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation.", output='8x + 31 - 31 = 2 - 31' ), Step(explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.', output='8x = -29'), Step(explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8.", output='8x / 8 = -29 / 8')], final_answer='x = -29/8'))
Step #0: explanation="First, let's isolate the term with the variable 'x'. To do this, we'll subtract 31 from both sides of the equation." output='8x + 31 - 31 = 2 - 31'
Step #1: explanation='By subtracting 31 from both sides, we simplify the equation to 8x = -29.' output='8x = -29'
Step #2: explanation="Next, let's isolate 'x' by dividing both sides of the equation by 8." output='8x / 8 = -29 / 8'
Answer: x = -29/8
```
2025-10-17 04:05:34 +01:00
An example of using `structural_tag` can be found here: [examples/online_serving/structured_outputs ](../../examples/online_serving/structured_outputs )
2025-04-26 11:12:18 -04:00
2024-12-23 17:35:38 -05:00
## Offline Inference
2025-06-12 18:50:31 -04:00
Offline inference allows for the same types of structured outputs.
2025-09-18 05:20:27 -04:00
To use it, we´ ll need to configure the structured outputs using the class `StructuredOutputsParams` inside `SamplingParams` .
The main available options inside `StructuredOutputsParams` are:
2024-12-23 17:35:38 -05:00
- `json`
- `regex`
- `choice`
- `grammar`
2025-04-26 11:12:18 -04:00
- `structural_tag`
2024-12-23 17:35:38 -05:00
2025-04-26 11:12:18 -04:00
These parameters can be used in the same way as the parameters from the Online
2025-06-12 18:50:31 -04:00
Serving examples above. One example for the usage of the `choice` parameter is
2025-04-26 11:12:18 -04:00
shown below:
2024-12-23 17:35:38 -05:00
2025-07-08 03:55:28 +01:00
??? code
2024-12-23 17:35:38 -05:00
2025-06-23 13:24:23 +08:00
```python
from vllm import LLM, SamplingParams
2025-09-18 05:20:27 -04:00
from vllm.sampling_params import StructuredOutputsParams
2024-12-23 17:35:38 -05:00
2025-06-23 13:24:23 +08:00
llm = LLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct")
2025-09-18 05:20:27 -04:00
structured_outputs_params = StructuredOutputsParams(choice=["Positive", "Negative"])
sampling_params = SamplingParams(structured_outputs=structured_outputs_params)
2025-06-23 13:24:23 +08:00
outputs = llm.generate(
prompts="Classify this sentiment: vLLM is wonderful!",
sampling_params=sampling_params,
)
print(outputs[0].outputs[0].text)
```
2024-12-23 17:35:38 -05:00
2025-07-07 12:15:50 +01:00
See also: [full example ](../examples/online_serving/structured_outputs.md )