Fix various typos found in docs (#32212)

Signed-off-by: Andrew Bennett <potatosaladx@meta.com>
This commit is contained in:
Andrew Bennett
2026-01-12 21:41:47 -06:00
committed by GitHub
parent 60b77e1463
commit f243abc92d
21 changed files with 26 additions and 26 deletions

View File

@@ -68,7 +68,7 @@ Here is a figure illustrating disaggregate encoder flow:
![Disaggregated Encoder Flow](../assets/features/disagg_encoder/disagg_encoder_flow.png)
For the PD disaggregation part, the Prefill instance receive cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfer KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execute of the PDinstance.
For the PD disaggregation part, the Prefill instance receives cache exactly the same as the disaggregated encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfers KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execution of the PD instance.
`docs/features/disagg_prefill.md` shows the brief idea about the disaggregated prefill (v0)

View File

@@ -1,6 +1,6 @@
# Disaggregated Prefilling (experimental)
This page introduces you the disaggregated prefilling feature in vLLM.
This page introduces you to the disaggregated prefilling feature in vLLM.
!!! note
This feature is experimental and subject to change.

View File

@@ -19,7 +19,7 @@ Once you've completed the model calibration process and collected the measuremen
```bash
export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_measure_g3.json
vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --tensor_paralel_size 8
vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --tensor-parallel-size 8
```
!!! tip

View File

@@ -173,7 +173,7 @@ Suffix Decoding can achieve better performance for tasks with high repetition, s
## Speculating using MLP speculators
The following code configures vLLM to use speculative decoding where proposals are generated by
draft models that conditioning draft predictions on both context vectors and sampled tokens.
draft models that condition draft predictions on both context vectors and sampled tokens.
For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or
[this technical report](https://arxiv.org/abs/2404.19124).

View File

@@ -39,7 +39,7 @@ request. You may also choose a specific backend, along with
some options. A full set of options is available in the `vllm serve --help`
text.
Now let´s see an example for each of the cases, starting with the `choice`, as it´s the easiest one:
Now let's see an example for each of the cases, starting with the `choice`, as it's the easiest one:
??? code
@@ -126,12 +126,12 @@ The next example shows how to use the `response_format` parameter with a Pydanti
```
!!! tip
While not strictly necessary, normally it´s better to indicate in the prompt the
While not strictly necessary, normally it's better to indicate in the prompt the
JSON schema and how the fields should be populated. This can improve the
results notably in most cases.
Finally we have the `grammar` option, which is probably the most
difficult to use, but it´s really powerful. It allows us to define complete
difficult to use, but it's really powerful. It allows us to define complete
languages like SQL queries. It works by using a context free EBNF grammar.
As an example, we can use to define a specific format of simplified SQL queries:
@@ -303,7 +303,7 @@ An example of using `structural_tag` can be found here: [examples/online_serving
## Offline Inference
Offline inference allows for the same types of structured outputs.
To use it, we´ll need to configure the structured outputs using the class `StructuredOutputsParams` inside `SamplingParams`.
To use it, we'll need to configure the structured outputs using the class `StructuredOutputsParams` inside `SamplingParams`.
The main available options inside `StructuredOutputsParams` are:
- `json`