Fix various typos found in docs (#32212)

Signed-off-by: Andrew Bennett <potatosaladx@meta.com>
This commit is contained in:
Andrew Bennett
2026-01-12 21:41:47 -06:00
committed by GitHub
parent 60b77e1463
commit f243abc92d
21 changed files with 26 additions and 26 deletions

View File

@@ -46,7 +46,7 @@ warning (e.g., "This will be removed in v0.10.0").
- GitHub Issue (RFC) for feedback
- Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs
### 2.Deprecated (Off By Default)
### 2. Deprecated (Off By Default)
- **Action**: Feature is disabled by default, but can still be re-enabled via a
CLI flag or environment variable. Feature throws an error when used without

View File

@@ -118,7 +118,7 @@ To support a model with interleaving sliding windows, we need to take care of th
- Make sure the model's `config.json` contains `layer_types`.
- In the modeling code, parse the correct sliding window value for every layer, and pass it to the attention layer's `per_layer_sliding_window` argument. For reference, check [this line](https://github.com/vllm-project/vllm/blob/996357e4808ca5eab97d4c97c7d25b3073f46aab/vllm/model_executor/models/llama.py#L171).
With these two steps, interleave sliding windows should work with the model.
With these two steps, interleaved sliding windows should work with the model.
### How to support models that use Mamba?

View File

@@ -59,7 +59,7 @@ Then, run the following code to deploy it to the cloud:
cerebrium deploy
```
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case `/run`)
??? console "Command"

View File

@@ -70,7 +70,7 @@ This method applies to models with the [`transformers` library tag](https://hugg
![Locate deploy button](../../assets/deployment/hf-inference-endpoints-locate-deploy-button.png)
3. Click to **Deploy** button > **HF Inference Endpoints**. You will be taken to the Inference Endpoints interface to configure the deployment.
3. Click the **Deploy** button > **HF Inference Endpoints**. You will be taken to the Inference Endpoints interface to configure the deployment.
![Click deploy button](../../assets/deployment/hf-inference-endpoints-click-deploy-button.png)

View File

@@ -10,7 +10,7 @@ If you are new to Kubernetes, don't worry: in the vLLM production stack [repo](h
## Pre-requisite
Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine).
Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-metal GPU machine).
## Deployment using vLLM production stack

View File

@@ -40,9 +40,9 @@ Furthermore, vLLM decides whether to enable or disable a `CustomOp` based on `co
By default, if `compilation_config.backend == "inductor"` and `compilation_config.mode != CompilationMode.NONE`, a `none` will be appended into `compilation_config.custom_ops`, otherwise a `all` will be appended. In other words, this means `CustomOp` will be disabled in some platforms (i.e., those use `inductor` as dafault backend for `torch.compile`) when running with torch compile mode. In this case, Inductor generates (fused) Triton kernels for those disabled custom ops.
!!! note
For multi-modal models, vLLM has enforece enabled some custom ops to use device-specific deep-optimized kernels for better performance in ViT part, such as `MMEncoderAttention` and `ApplyRotaryEmb`. We can also pass a `enforce_enable=True` param to the `__init__()` method of the `CustomOp` to enforce enable itself at object-level.
For multi-modal models, vLLM has enforced the enabling of some custom ops to use device-specific deep-optimized kernels for better performance in ViT part, such as `MMEncoderAttention` and `ApplyRotaryEmb`. We can also pass a `enforce_enable=True` param to the `__init__()` method of the `CustomOp` to enforce enable itself at object-level.
Note that this `enforce_enable` mechanism will be removed after we adding a separate `compilation_config` for multi-modal part.
Note that this `enforce_enable` mechanism will be removed after we add a separate `compilation_config` for multi-modal part.
## How to Customise Your Configuration for CustomOp

View File

@@ -2,7 +2,7 @@
## Introduction
FusedMoEModularKernel is implemented [here](../..//vllm/model_executor/layers/fused_moe/modular_kernel.py)
FusedMoEModularKernel is implemented [here](../../vllm/model_executor/layers/fused_moe/modular_kernel.py)
Based on the format of the input activations, FusedMoE implementations are broadly classified into 2 types.

View File

@@ -138,7 +138,7 @@ Note that the sampler will access the logits processors via `SamplingMetadata.lo
# ...return sampler output data structure...
def sample(self, logits, sampling_metadta)
def sample(self, logits, sampling_metadata)
...

View File

@@ -68,7 +68,7 @@ Here is a figure illustrating disaggregate encoder flow:
![Disaggregated Encoder Flow](../assets/features/disagg_encoder/disagg_encoder_flow.png)
For the PD disaggregation part, the Prefill instance receive cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfer KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execute of the PDinstance.
For the PD disaggregation part, the Prefill instance receives cache exactly the same as the disaggregated encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfers KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execution of the PD instance.
`docs/features/disagg_prefill.md` shows the brief idea about the disaggregated prefill (v0)

View File

@@ -1,6 +1,6 @@
# Disaggregated Prefilling (experimental)
This page introduces you the disaggregated prefilling feature in vLLM.
This page introduces you to the disaggregated prefilling feature in vLLM.
!!! note
This feature is experimental and subject to change.

View File

@@ -19,7 +19,7 @@ Once you've completed the model calibration process and collected the measuremen
```bash
export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_measure_g3.json
vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --tensor_paralel_size 8
vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --tensor-parallel-size 8
```
!!! tip

View File

@@ -173,7 +173,7 @@ Suffix Decoding can achieve better performance for tasks with high repetition, s
## Speculating using MLP speculators
The following code configures vLLM to use speculative decoding where proposals are generated by
draft models that conditioning draft predictions on both context vectors and sampled tokens.
draft models that condition draft predictions on both context vectors and sampled tokens.
For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or
[this technical report](https://arxiv.org/abs/2404.19124).

View File

@@ -39,7 +39,7 @@ request. You may also choose a specific backend, along with
some options. A full set of options is available in the `vllm serve --help`
text.
Now let´s see an example for each of the cases, starting with the `choice`, as it´s the easiest one:
Now let's see an example for each of the cases, starting with the `choice`, as it's the easiest one:
??? code
@@ -126,12 +126,12 @@ The next example shows how to use the `response_format` parameter with a Pydanti
```
!!! tip
While not strictly necessary, normally it´s better to indicate in the prompt the
While not strictly necessary, normally it's better to indicate in the prompt the
JSON schema and how the fields should be populated. This can improve the
results notably in most cases.
Finally we have the `grammar` option, which is probably the most
difficult to use, but it´s really powerful. It allows us to define complete
difficult to use, but it's really powerful. It allows us to define complete
languages like SQL queries. It works by using a context free EBNF grammar.
As an example, we can use to define a specific format of simplified SQL queries:
@@ -303,7 +303,7 @@ An example of using `structural_tag` can be found here: [examples/online_serving
## Offline Inference
Offline inference allows for the same types of structured outputs.
To use it, we´ll need to configure the structured outputs using the class `StructuredOutputsParams` inside `SamplingParams`.
To use it, we'll need to configure the structured outputs using the class `StructuredOutputsParams` inside `SamplingParams`.
The main available options inside `StructuredOutputsParams` are:
- `json`

View File

@@ -1,6 +1,6 @@
# --8<-- [start:installation]
vLLM offers basic model inferencing and serving on Arm CPU platform, with support NEON, data types FP32, FP16 and BF16.
vLLM offers basic model inferencing and serving on Arm CPU platform, with support for NEON, data types FP32, FP16 and BF16.
# --8<-- [end:installation]
# --8<-- [start:requirements]

View File

@@ -75,7 +75,7 @@ This guide will help you quickly get started with vLLM to perform:
For more detailed instructions, including Docker, installing from source, and troubleshooting, please refer to the [vLLM on TPU documentation](https://docs.vllm.ai/projects/tpu/en/latest/).
!!! note
For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.
For more detail and non-CUDA platforms, please refer to the [installation guide](installation/README.md) for specific instructions on how to install vLLM.
## Offline Batched Inference

View File

@@ -18,7 +18,7 @@ For features that you intend to maintain, please feel free to add yourself in [`
If you use vLLM, we recommend you making the model work with vLLM by following the [model registration](../contributing/model/registration.md) process before you release it publicly.
The vLLM team helps with new model architectures not supported by vLLM, especially models pushing architectural frontiers.
Here's how the vLLM team works with model providers. The vLLM team includes all [committers](./committers.md) of the project. model providers can exclude certain members but shouldn't, as this may harm release timelines due to missing expertise. Contact [project leads](./process.md) if you want to collaborate.
Here's how the vLLM team works with model providers. The vLLM team includes all [committers](./committers.md) of the project. Model providers can exclude certain members but shouldn't, as this may harm release timelines due to missing expertise. Contact [project leads](./process.md) if you want to collaborate.
Once we establish the connection between the vLLM team and model provider:
@@ -30,7 +30,7 @@ The vLLM team works with model providers on features, integrations, and release
The vLLM maintainers will not publicly share details about model architecture, release timelines, or upcoming releases. We maintain model weights on secure servers with security measures (though we can work with security reviews and testing without certification). We delete pre-release weights or artifacts upon request.
The vLLM team collaborates on marketing and promotional efforts for model releases. model providers can use vLLM's trademark and logo in publications and materials.
The vLLM team collaborates on marketing and promotional efforts for model releases. Model providers can use vLLM's trademark and logo in publications and materials.
## Adding New Hardware

View File

@@ -1,4 +1,4 @@
Loading Model weights with fastsafetensors
Loading model weights with fastsafetensors
===================================================================
Using fastsafetensors library enables loading model weights to GPU memory by leveraging GPU direct storage. See [their GitHub repository](https://github.com/foundation-model-stack/fastsafetensors) for more details.

View File

@@ -2,7 +2,7 @@
vLLM provides first-class support for generative models, which covers most of LLMs.
In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
In vLLM, generative models implement the [VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
which are then passed through [Sampler][vllm.v1.sample.sampler.Sampler] to obtain the final text.

View File

@@ -874,7 +874,7 @@ You can pass multi-modal inputs to scoring models by passing `content` including
Full example:
- [examples/pooling/score/vision_score_api_online.py](../../examples/pooling/score/vision_score_api_online.py)
- examples/pooling/score/vision_rerank_api_online.py](../../examples/pooling/score/vision_rerank_api_online.py)
- [examples/pooling/score/vision_rerank_api_online.py](../../examples/pooling/score/vision_rerank_api_online.py)
#### Extra parameters

View File

@@ -338,7 +338,7 @@ If you use triton kernels with cuda 13, you might see an error like `ptxas fatal
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
```
It means that the ptxas in triton bundle not compatible with your device. You need to set `TRITON_PTXAS_PATH` environment variable to use cuda toolkit's ptxas manually instead:
It means that the ptxas in the triton bundle is not compatible with your device. You need to set `TRITON_PTXAS_PATH` environment variable to use cuda toolkit's ptxas manually instead:
```shell
export CUDA_HOME=/usr/local/cuda

View File

@@ -123,7 +123,7 @@ We are working on enabling prefix caching and chunked prefill for more categorie
#### Mamba Models
Models using selective state-space mechanisms instead of standard transformer attention are supported.
Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`,`FalconMambaForCausalLM`) are supported.
Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`, `FalconMambaForCausalLM`) are supported.
Hybrid models that combine Mamba-2 and Mamba-1 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`,
`Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`, `Plamo2ForCausalLM`).