Fix various typos found in docs (#32212)
Signed-off-by: Andrew Bennett <potatosaladx@meta.com>
This commit is contained in:
@@ -68,7 +68,7 @@ Here is a figure illustrating disaggregate encoder flow:
|
||||
|
||||

|
||||
|
||||
For the PD disaggregation part, the Prefill instance receive cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfer KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execute of the PDinstance.
|
||||
For the PD disaggregation part, the Prefill instance receives cache exactly the same as the disaggregated encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfers KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execution of the PD instance.
|
||||
|
||||
`docs/features/disagg_prefill.md` shows the brief idea about the disaggregated prefill (v0)
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Disaggregated Prefilling (experimental)
|
||||
|
||||
This page introduces you the disaggregated prefilling feature in vLLM.
|
||||
This page introduces you to the disaggregated prefilling feature in vLLM.
|
||||
|
||||
!!! note
|
||||
This feature is experimental and subject to change.
|
||||
|
||||
@@ -19,7 +19,7 @@ Once you've completed the model calibration process and collected the measuremen
|
||||
|
||||
```bash
|
||||
export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_measure_g3.json
|
||||
vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --tensor_paralel_size 8
|
||||
vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --tensor-parallel-size 8
|
||||
```
|
||||
|
||||
!!! tip
|
||||
|
||||
@@ -173,7 +173,7 @@ Suffix Decoding can achieve better performance for tasks with high repetition, s
|
||||
## Speculating using MLP speculators
|
||||
|
||||
The following code configures vLLM to use speculative decoding where proposals are generated by
|
||||
draft models that conditioning draft predictions on both context vectors and sampled tokens.
|
||||
draft models that condition draft predictions on both context vectors and sampled tokens.
|
||||
For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or
|
||||
[this technical report](https://arxiv.org/abs/2404.19124).
|
||||
|
||||
|
||||
@@ -39,7 +39,7 @@ request. You may also choose a specific backend, along with
|
||||
some options. A full set of options is available in the `vllm serve --help`
|
||||
text.
|
||||
|
||||
Now let´s see an example for each of the cases, starting with the `choice`, as it´s the easiest one:
|
||||
Now let's see an example for each of the cases, starting with the `choice`, as it's the easiest one:
|
||||
|
||||
??? code
|
||||
|
||||
@@ -126,12 +126,12 @@ The next example shows how to use the `response_format` parameter with a Pydanti
|
||||
```
|
||||
|
||||
!!! tip
|
||||
While not strictly necessary, normally it´s better to indicate in the prompt the
|
||||
While not strictly necessary, normally it's better to indicate in the prompt the
|
||||
JSON schema and how the fields should be populated. This can improve the
|
||||
results notably in most cases.
|
||||
|
||||
Finally we have the `grammar` option, which is probably the most
|
||||
difficult to use, but it´s really powerful. It allows us to define complete
|
||||
difficult to use, but it's really powerful. It allows us to define complete
|
||||
languages like SQL queries. It works by using a context free EBNF grammar.
|
||||
As an example, we can use to define a specific format of simplified SQL queries:
|
||||
|
||||
@@ -303,7 +303,7 @@ An example of using `structural_tag` can be found here: [examples/online_serving
|
||||
## Offline Inference
|
||||
|
||||
Offline inference allows for the same types of structured outputs.
|
||||
To use it, we´ll need to configure the structured outputs using the class `StructuredOutputsParams` inside `SamplingParams`.
|
||||
To use it, we'll need to configure the structured outputs using the class `StructuredOutputsParams` inside `SamplingParams`.
|
||||
The main available options inside `StructuredOutputsParams` are:
|
||||
|
||||
- `json`
|
||||
|
||||
Reference in New Issue
Block a user