Fix various typos found in docs (#32212)

Signed-off-by: Andrew Bennett <potatosaladx@meta.com>
2026-01-12 21:41:47 -06:00
parent 60b77e1463
commit f243abc92d
21 changed files with 26 additions and 26 deletions
--- a/docs/contributing/deprecation_policy.md
+++ b/docs/contributing/deprecation_policy.md
@@ -46,7 +46,7 @@ warning (e.g., "This will be removed in v0.10.0").
    - GitHub Issue (RFC) for feedback
    - Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs

-### 2.Deprecated (Off By Default)
+### 2. Deprecated (Off By Default)

 - **Action**: Feature is disabled by default, but can still be re-enabled via a
 CLI flag or environment variable. Feature throws an error when used without
--- a/docs/contributing/model/basic.md
+++ b/docs/contributing/model/basic.md
@@ -118,7 +118,7 @@ To support a model with interleaving sliding windows, we need to take care of th
 - Make sure the model's `config.json` contains `layer_types`.
 - In the modeling code, parse the correct sliding window value for every layer, and pass it to the attention layer's `per_layer_sliding_window` argument. For reference, check [this line](https://github.com/vllm-project/vllm/blob/996357e4808ca5eab97d4c97c7d25b3073f46aab/vllm/model_executor/models/llama.py#L171).

-With these two steps, interleave sliding windows should work with the model.
+With these two steps, interleaved sliding windows should work with the model.

 ### How to support models that use Mamba?

--- a/docs/deployment/frameworks/cerebrium.md
+++ b/docs/deployment/frameworks/cerebrium.md
@@ -59,7 +59,7 @@ Then, run the following code to deploy it to the cloud:
 cerebrium deploy
 ```

-If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)
+If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case `/run`)

 ??? console "Command"

--- a/docs/deployment/frameworks/hf_inference_endpoints.md
+++ b/docs/deployment/frameworks/hf_inference_endpoints.md
@@ -70,7 +70,7 @@ This method applies to models with the [`transformers` library tag](https://hugg

    ![Locate deploy button](../../assets/deployment/hf-inference-endpoints-locate-deploy-button.png)

-3. Click to **Deploy** button > **HF Inference Endpoints**. You will be taken to the Inference Endpoints interface to configure the deployment.
+3. Click the **Deploy** button > **HF Inference Endpoints**. You will be taken to the Inference Endpoints interface to configure the deployment.

    ![Click deploy button](../../assets/deployment/hf-inference-endpoints-click-deploy-button.png)

--- a/docs/deployment/integrations/production-stack.md
+++ b/docs/deployment/integrations/production-stack.md
@@ -10,7 +10,7 @@ If you are new to Kubernetes, don't worry: in the vLLM production stack [repo](h

 ## Pre-requisite

-Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine).
+Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-metal GPU machine).

 ## Deployment using vLLM production stack

--- a/docs/design/custom_op.md
+++ b/docs/design/custom_op.md
@@ -40,9 +40,9 @@ Furthermore, vLLM decides whether to enable or disable a `CustomOp` based on `co
 By default, if `compilation_config.backend == "inductor"` and `compilation_config.mode != CompilationMode.NONE`, a `none` will be appended into `compilation_config.custom_ops`, otherwise a `all` will be appended. In other words, this means `CustomOp` will be disabled in some platforms (i.e., those use `inductor` as dafault backend for `torch.compile`) when running with torch compile mode. In this case, Inductor generates (fused) Triton kernels for those disabled custom ops.

 !!! note
-    For multi-modal models, vLLM has enforece enabled some custom ops to use device-specific deep-optimized kernels for better performance in ViT part, such as `MMEncoderAttention` and `ApplyRotaryEmb`. We can also pass a `enforce_enable=True` param to the `__init__()` method of the `CustomOp` to enforce enable itself at object-level.
+    For multi-modal models, vLLM has enforced the enabling of some custom ops to use device-specific deep-optimized kernels for better performance in ViT part, such as `MMEncoderAttention` and `ApplyRotaryEmb`. We can also pass a `enforce_enable=True` param to the `__init__()` method of the `CustomOp` to enforce enable itself at object-level.

-    Note that this `enforce_enable` mechanism will be removed after we adding a separate `compilation_config` for multi-modal part.
+    Note that this `enforce_enable` mechanism will be removed after we add a separate `compilation_config` for multi-modal part.

 ## How to Customise Your Configuration for CustomOp

--- a/docs/design/fused_moe_modular_kernel.md
+++ b/docs/design/fused_moe_modular_kernel.md
@@ -2,7 +2,7 @@

 ## Introduction

-FusedMoEModularKernel is implemented [here](../..//vllm/model_executor/layers/fused_moe/modular_kernel.py)
+FusedMoEModularKernel is implemented [here](../../vllm/model_executor/layers/fused_moe/modular_kernel.py)

 Based on the format of the input activations, FusedMoE implementations are broadly classified into 2 types.

--- a/docs/design/logits_processors.md
+++ b/docs/design/logits_processors.md
@@ -138,7 +138,7 @@ Note that the sampler will access the logits processors via `SamplingMetadata.lo
            # ...return sampler output data structure...


-        def sample(self, logits, sampling_metadta)
+        def sample(self, logits, sampling_metadata)

            ...

--- a/docs/features/disagg_encoder.md
+++ b/docs/features/disagg_encoder.md
@@ -68,7 +68,7 @@ Here is a figure illustrating disaggregate encoder flow:

 ![Disaggregated Encoder Flow](../assets/features/disagg_encoder/disagg_encoder_flow.png)

-For the PD disaggregation part, the Prefill instance receive cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfer KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execute of the PDinstance.
+For the PD disaggregation part, the Prefill instance receives cache exactly the same as the disaggregated encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfers KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execution of the PD instance.

 `docs/features/disagg_prefill.md` shows the brief idea about the disaggregated prefill (v0)

--- a/docs/features/disagg_prefill.md
+++ b/docs/features/disagg_prefill.md
@@ -1,6 +1,6 @@
 # Disaggregated Prefilling (experimental)

-This page introduces you the disaggregated prefilling feature in vLLM.
+This page introduces you to the disaggregated prefilling feature in vLLM.

 !!! note
    This feature is experimental and subject to change.
--- a/docs/features/quantization/inc.md
+++ b/docs/features/quantization/inc.md
@@ -19,7 +19,7 @@ Once you've completed the model calibration process and collected the measuremen

 ```bash
 export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_measure_g3.json
-vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --tensor_paralel_size 8
+vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --tensor-parallel-size 8
 ```

 !!! tip
--- a/docs/features/spec_decode.md
+++ b/docs/features/spec_decode.md
@@ -173,7 +173,7 @@ Suffix Decoding can achieve better performance for tasks with high repetition, s
 ## Speculating using MLP speculators

 The following code configures vLLM to use speculative decoding where proposals are generated by
-draft models that conditioning draft predictions on both context vectors and sampled tokens.
+draft models that condition draft predictions on both context vectors and sampled tokens.
 For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or
 [this technical report](https://arxiv.org/abs/2404.19124).

--- a/docs/features/structured_outputs.md
+++ b/docs/features/structured_outputs.md
@@ -39,7 +39,7 @@ request. You may also choose a specific backend, along with
 some options. A full set of options is available in the `vllm serve --help`
 text.

-Now let´s see an example for each of the cases, starting with the `choice`, as it´s the easiest one:
+Now let's see an example for each of the cases, starting with the `choice`, as it's the easiest one:

 ??? code

@@ -126,12 +126,12 @@ The next example shows how to use the `response_format` parameter with a Pydanti
    ```

 !!! tip
-    While not strictly necessary, normally it´s better to indicate in the prompt the
+    While not strictly necessary, normally it's better to indicate in the prompt the
    JSON schema and how the fields should be populated. This can improve the
    results notably in most cases.

 Finally we have the `grammar` option, which is probably the most
-difficult to use, but it´s really powerful. It allows us to define complete
+difficult to use, but it's really powerful. It allows us to define complete
 languages like SQL queries. It works by using a context free EBNF grammar.
 As an example, we can use to define a specific format of simplified SQL queries:

@@ -303,7 +303,7 @@ An example of using `structural_tag` can be found here: [examples/online_serving
 ## Offline Inference

 Offline inference allows for the same types of structured outputs.
-To use it, we´ll need to configure the structured outputs using the class `StructuredOutputsParams` inside `SamplingParams`.
+To use it, we'll need to configure the structured outputs using the class `StructuredOutputsParams` inside `SamplingParams`.
 The main available options inside `StructuredOutputsParams` are:

 - `json`
--- a/docs/getting_started/installation/cpu.arm.inc.md
+++ b/docs/getting_started/installation/cpu.arm.inc.md
@@ -1,6 +1,6 @@
 # --8<-- [start:installation]

-vLLM offers basic model inferencing and serving on Arm CPU platform, with support NEON, data types FP32, FP16 and BF16.
+vLLM offers basic model inferencing and serving on Arm CPU platform, with support for NEON, data types FP32, FP16 and BF16.

 # --8<-- [end:installation]
 # --8<-- [start:requirements]
--- a/docs/getting_started/quickstart.md
+++ b/docs/getting_started/quickstart.md
@@ -75,7 +75,7 @@ This guide will help you quickly get started with vLLM to perform:
        For more detailed instructions, including Docker, installing from source, and troubleshooting, please refer to the [vLLM on TPU documentation](https://docs.vllm.ai/projects/tpu/en/latest/).

 !!! note
-    For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.
+    For more detail and non-CUDA platforms, please refer to the [installation guide](installation/README.md) for specific instructions on how to install vLLM.

 ## Offline Batched Inference

--- a/docs/governance/collaboration.md
+++ b/docs/governance/collaboration.md
@@ -18,7 +18,7 @@ For features that you intend to maintain, please feel free to add yourself in [`
 If you use vLLM, we recommend you making the model work with vLLM by following the [model registration](../contributing/model/registration.md) process before you release it publicly.

 The vLLM team helps with new model architectures not supported by vLLM, especially models pushing architectural frontiers.
-Here's how the vLLM team works with model providers. The vLLM team includes all [committers](./committers.md) of the project. model providers can exclude certain members but shouldn't, as this may harm release timelines due to missing expertise. Contact [project leads](./process.md) if you want to collaborate.
+Here's how the vLLM team works with model providers. The vLLM team includes all [committers](./committers.md) of the project. Model providers can exclude certain members but shouldn't, as this may harm release timelines due to missing expertise. Contact [project leads](./process.md) if you want to collaborate.

 Once we establish the connection between the vLLM team and model provider:

@@ -30,7 +30,7 @@ The vLLM team works with model providers on features, integrations, and release

 The vLLM maintainers will not publicly share details about model architecture, release timelines, or upcoming releases. We maintain model weights on secure servers with security measures (though we can work with security reviews and testing without certification). We delete pre-release weights or artifacts upon request.

-The vLLM team collaborates on marketing and promotional efforts for model releases. model providers can use vLLM's trademark and logo in publications and materials.
+The vLLM team collaborates on marketing and promotional efforts for model releases. Model providers can use vLLM's trademark and logo in publications and materials.

 ## Adding New Hardware

--- a/docs/models/extensions/fastsafetensor.md
+++ b/docs/models/extensions/fastsafetensor.md
@@ -1,4 +1,4 @@
-Loading Model weights with fastsafetensors
+Loading model weights with fastsafetensors
 ===================================================================

 Using fastsafetensors library enables loading model weights to GPU memory by leveraging GPU direct storage. See [their GitHub repository](https://github.com/foundation-model-stack/fastsafetensors) for more details.
--- a/docs/models/generative_models.md
+++ b/docs/models/generative_models.md
@@ -2,7 +2,7 @@

 vLLM provides first-class support for generative models, which covers most of LLMs.

-In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
+In vLLM, generative models implement the [VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
 Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
 which are then passed through [Sampler][vllm.v1.sample.sampler.Sampler] to obtain the final text.

--- a/docs/serving/openai_compatible_server.md
+++ b/docs/serving/openai_compatible_server.md
@@ -874,7 +874,7 @@ You can pass multi-modal inputs to scoring models by passing `content` including
 Full example:

 - [examples/pooling/score/vision_score_api_online.py](../../examples/pooling/score/vision_score_api_online.py)
- examples/pooling/score/vision_rerank_api_online.py](../../examples/pooling/score/vision_rerank_api_online.py)
+- [examples/pooling/score/vision_rerank_api_online.py](../../examples/pooling/score/vision_rerank_api_online.py)

 #### Extra parameters

--- a/docs/usage/troubleshooting.md
+++ b/docs/usage/troubleshooting.md
@@ -338,7 +338,7 @@ If you use triton kernels with cuda 13, you might see an error like `ptxas fatal
 vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
 ```

-It means that the ptxas in triton bundle not compatible with your device. You need to set `TRITON_PTXAS_PATH` environment variable to use cuda toolkit's ptxas manually instead:
+It means that the ptxas in the triton bundle is not compatible with your device. You need to set `TRITON_PTXAS_PATH` environment variable to use cuda toolkit's ptxas manually instead:

 ```shell
 export CUDA_HOME=/usr/local/cuda
--- a/docs/usage/v1_guide.md
+++ b/docs/usage/v1_guide.md
@@ -123,7 +123,7 @@ We are working on enabling prefix caching and chunked prefill for more categorie
 #### Mamba Models

 Models using selective state-space mechanisms instead of standard transformer attention are supported.
-Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`,`FalconMambaForCausalLM`) are supported.
+Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`, `FalconMambaForCausalLM`) are supported.

 Hybrid models that combine Mamba-2 and Mamba-1 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`,
 `Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`, `Plamo2ForCausalLM`).