diff --git a/docs/contributing/deprecation_policy.md b/docs/contributing/deprecation_policy.md index 904ef4ca0..99b7c382d 100644 --- a/docs/contributing/deprecation_policy.md +++ b/docs/contributing/deprecation_policy.md @@ -46,7 +46,7 @@ warning (e.g., "This will be removed in v0.10.0"). - GitHub Issue (RFC) for feedback - Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs -### 2.Deprecated (Off By Default) +### 2. Deprecated (Off By Default) - **Action**: Feature is disabled by default, but can still be re-enabled via a CLI flag or environment variable. Feature throws an error when used without diff --git a/docs/contributing/model/basic.md b/docs/contributing/model/basic.md index 28f6f960a..915fe1495 100644 --- a/docs/contributing/model/basic.md +++ b/docs/contributing/model/basic.md @@ -118,7 +118,7 @@ To support a model with interleaving sliding windows, we need to take care of th - Make sure the model's `config.json` contains `layer_types`. - In the modeling code, parse the correct sliding window value for every layer, and pass it to the attention layer's `per_layer_sliding_window` argument. For reference, check [this line](https://github.com/vllm-project/vllm/blob/996357e4808ca5eab97d4c97c7d25b3073f46aab/vllm/model_executor/models/llama.py#L171). -With these two steps, interleave sliding windows should work with the model. +With these two steps, interleaved sliding windows should work with the model. ### How to support models that use Mamba? diff --git a/docs/deployment/frameworks/cerebrium.md b/docs/deployment/frameworks/cerebrium.md index 960347d95..1b7c5d5a9 100644 --- a/docs/deployment/frameworks/cerebrium.md +++ b/docs/deployment/frameworks/cerebrium.md @@ -59,7 +59,7 @@ Then, run the following code to deploy it to the cloud: cerebrium deploy ``` -If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`) +If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case `/run`) ??? console "Command" diff --git a/docs/deployment/frameworks/hf_inference_endpoints.md b/docs/deployment/frameworks/hf_inference_endpoints.md index 05df0dacd..6217dc062 100644 --- a/docs/deployment/frameworks/hf_inference_endpoints.md +++ b/docs/deployment/frameworks/hf_inference_endpoints.md @@ -70,7 +70,7 @@ This method applies to models with the [`transformers` library tag](https://hugg ![Locate deploy button](../../assets/deployment/hf-inference-endpoints-locate-deploy-button.png) -3. Click to **Deploy** button > **HF Inference Endpoints**. You will be taken to the Inference Endpoints interface to configure the deployment. +3. Click the **Deploy** button > **HF Inference Endpoints**. You will be taken to the Inference Endpoints interface to configure the deployment. ![Click deploy button](../../assets/deployment/hf-inference-endpoints-click-deploy-button.png) diff --git a/docs/deployment/integrations/production-stack.md b/docs/deployment/integrations/production-stack.md index 624e98a08..4db595164 100644 --- a/docs/deployment/integrations/production-stack.md +++ b/docs/deployment/integrations/production-stack.md @@ -10,7 +10,7 @@ If you are new to Kubernetes, don't worry: in the vLLM production stack [repo](h ## Pre-requisite -Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-medal GPU machine). +Ensure that you have a running Kubernetes environment with GPU (you can follow [this tutorial](https://github.com/vllm-project/production-stack/blob/main/tutorials/00-install-kubernetes-env.md) to install a Kubernetes environment on a bare-metal GPU machine). ## Deployment using vLLM production stack diff --git a/docs/design/custom_op.md b/docs/design/custom_op.md index fd298a149..13c2915ab 100644 --- a/docs/design/custom_op.md +++ b/docs/design/custom_op.md @@ -40,9 +40,9 @@ Furthermore, vLLM decides whether to enable or disable a `CustomOp` based on `co By default, if `compilation_config.backend == "inductor"` and `compilation_config.mode != CompilationMode.NONE`, a `none` will be appended into `compilation_config.custom_ops`, otherwise a `all` will be appended. In other words, this means `CustomOp` will be disabled in some platforms (i.e., those use `inductor` as dafault backend for `torch.compile`) when running with torch compile mode. In this case, Inductor generates (fused) Triton kernels for those disabled custom ops. !!! note - For multi-modal models, vLLM has enforece enabled some custom ops to use device-specific deep-optimized kernels for better performance in ViT part, such as `MMEncoderAttention` and `ApplyRotaryEmb`. We can also pass a `enforce_enable=True` param to the `__init__()` method of the `CustomOp` to enforce enable itself at object-level. + For multi-modal models, vLLM has enforced the enabling of some custom ops to use device-specific deep-optimized kernels for better performance in ViT part, such as `MMEncoderAttention` and `ApplyRotaryEmb`. We can also pass a `enforce_enable=True` param to the `__init__()` method of the `CustomOp` to enforce enable itself at object-level. - Note that this `enforce_enable` mechanism will be removed after we adding a separate `compilation_config` for multi-modal part. + Note that this `enforce_enable` mechanism will be removed after we add a separate `compilation_config` for multi-modal part. ## How to Customise Your Configuration for CustomOp diff --git a/docs/design/fused_moe_modular_kernel.md b/docs/design/fused_moe_modular_kernel.md index e1a96be6c..975df8ba2 100644 --- a/docs/design/fused_moe_modular_kernel.md +++ b/docs/design/fused_moe_modular_kernel.md @@ -2,7 +2,7 @@ ## Introduction -FusedMoEModularKernel is implemented [here](../..//vllm/model_executor/layers/fused_moe/modular_kernel.py) +FusedMoEModularKernel is implemented [here](../../vllm/model_executor/layers/fused_moe/modular_kernel.py) Based on the format of the input activations, FusedMoE implementations are broadly classified into 2 types. diff --git a/docs/design/logits_processors.md b/docs/design/logits_processors.md index 8eadeb386..af1d7b6bb 100644 --- a/docs/design/logits_processors.md +++ b/docs/design/logits_processors.md @@ -138,7 +138,7 @@ Note that the sampler will access the logits processors via `SamplingMetadata.lo # ...return sampler output data structure... - def sample(self, logits, sampling_metadta) + def sample(self, logits, sampling_metadata) ... diff --git a/docs/features/disagg_encoder.md b/docs/features/disagg_encoder.md index f18a0e85e..d95427464 100644 --- a/docs/features/disagg_encoder.md +++ b/docs/features/disagg_encoder.md @@ -68,7 +68,7 @@ Here is a figure illustrating disaggregate encoder flow: ![Disaggregated Encoder Flow](../assets/features/disagg_encoder/disagg_encoder_flow.png) -For the PD disaggregation part, the Prefill instance receive cache exactly the same as the disaggregate encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfer KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execute of the PDinstance. +For the PD disaggregation part, the Prefill instance receives cache exactly the same as the disaggregated encoder flow above. Prefill instance executes 1 step (prefill -> 1 token output) and then transfers KV cache to the Decode instance for the remaining execution. The KV transfer part purely happens after the execution of the PD instance. `docs/features/disagg_prefill.md` shows the brief idea about the disaggregated prefill (v0) diff --git a/docs/features/disagg_prefill.md b/docs/features/disagg_prefill.md index 7b8280c4d..df69849bb 100644 --- a/docs/features/disagg_prefill.md +++ b/docs/features/disagg_prefill.md @@ -1,6 +1,6 @@ # Disaggregated Prefilling (experimental) -This page introduces you the disaggregated prefilling feature in vLLM. +This page introduces you to the disaggregated prefilling feature in vLLM. !!! note This feature is experimental and subject to change. diff --git a/docs/features/quantization/inc.md b/docs/features/quantization/inc.md index 9875bc44c..f2bbca498 100644 --- a/docs/features/quantization/inc.md +++ b/docs/features/quantization/inc.md @@ -19,7 +19,7 @@ Once you've completed the model calibration process and collected the measuremen ```bash export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_measure_g3.json -vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --tensor_paralel_size 8 +vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --tensor-parallel-size 8 ``` !!! tip diff --git a/docs/features/spec_decode.md b/docs/features/spec_decode.md index 6097500ca..bd525ae33 100644 --- a/docs/features/spec_decode.md +++ b/docs/features/spec_decode.md @@ -173,7 +173,7 @@ Suffix Decoding can achieve better performance for tasks with high repetition, s ## Speculating using MLP speculators The following code configures vLLM to use speculative decoding where proposals are generated by -draft models that conditioning draft predictions on both context vectors and sampled tokens. +draft models that condition draft predictions on both context vectors and sampled tokens. For more information see [this blog](https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/) or [this technical report](https://arxiv.org/abs/2404.19124). diff --git a/docs/features/structured_outputs.md b/docs/features/structured_outputs.md index 3ac987559..a1f789111 100644 --- a/docs/features/structured_outputs.md +++ b/docs/features/structured_outputs.md @@ -39,7 +39,7 @@ request. You may also choose a specific backend, along with some options. A full set of options is available in the `vllm serve --help` text. -Now let´s see an example for each of the cases, starting with the `choice`, as it´s the easiest one: +Now let's see an example for each of the cases, starting with the `choice`, as it's the easiest one: ??? code @@ -126,12 +126,12 @@ The next example shows how to use the `response_format` parameter with a Pydanti ``` !!! tip - While not strictly necessary, normally it´s better to indicate in the prompt the + While not strictly necessary, normally it's better to indicate in the prompt the JSON schema and how the fields should be populated. This can improve the results notably in most cases. Finally we have the `grammar` option, which is probably the most -difficult to use, but it´s really powerful. It allows us to define complete +difficult to use, but it's really powerful. It allows us to define complete languages like SQL queries. It works by using a context free EBNF grammar. As an example, we can use to define a specific format of simplified SQL queries: @@ -303,7 +303,7 @@ An example of using `structural_tag` can be found here: [examples/online_serving ## Offline Inference Offline inference allows for the same types of structured outputs. -To use it, we´ll need to configure the structured outputs using the class `StructuredOutputsParams` inside `SamplingParams`. +To use it, we'll need to configure the structured outputs using the class `StructuredOutputsParams` inside `SamplingParams`. The main available options inside `StructuredOutputsParams` are: - `json` diff --git a/docs/getting_started/installation/cpu.arm.inc.md b/docs/getting_started/installation/cpu.arm.inc.md index 611e6edf6..b5eb777b7 100644 --- a/docs/getting_started/installation/cpu.arm.inc.md +++ b/docs/getting_started/installation/cpu.arm.inc.md @@ -1,6 +1,6 @@ # --8<-- [start:installation] -vLLM offers basic model inferencing and serving on Arm CPU platform, with support NEON, data types FP32, FP16 and BF16. +vLLM offers basic model inferencing and serving on Arm CPU platform, with support for NEON, data types FP32, FP16 and BF16. # --8<-- [end:installation] # --8<-- [start:requirements] diff --git a/docs/getting_started/quickstart.md b/docs/getting_started/quickstart.md index e3974354d..01025c43e 100644 --- a/docs/getting_started/quickstart.md +++ b/docs/getting_started/quickstart.md @@ -75,7 +75,7 @@ This guide will help you quickly get started with vLLM to perform: For more detailed instructions, including Docker, installing from source, and troubleshooting, please refer to the [vLLM on TPU documentation](https://docs.vllm.ai/projects/tpu/en/latest/). !!! note - For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM. + For more detail and non-CUDA platforms, please refer to the [installation guide](installation/README.md) for specific instructions on how to install vLLM. ## Offline Batched Inference diff --git a/docs/governance/collaboration.md b/docs/governance/collaboration.md index 5b3d2beff..7f4d3c0dc 100644 --- a/docs/governance/collaboration.md +++ b/docs/governance/collaboration.md @@ -18,7 +18,7 @@ For features that you intend to maintain, please feel free to add yourself in [` If you use vLLM, we recommend you making the model work with vLLM by following the [model registration](../contributing/model/registration.md) process before you release it publicly. The vLLM team helps with new model architectures not supported by vLLM, especially models pushing architectural frontiers. -Here's how the vLLM team works with model providers. The vLLM team includes all [committers](./committers.md) of the project. model providers can exclude certain members but shouldn't, as this may harm release timelines due to missing expertise. Contact [project leads](./process.md) if you want to collaborate. +Here's how the vLLM team works with model providers. The vLLM team includes all [committers](./committers.md) of the project. Model providers can exclude certain members but shouldn't, as this may harm release timelines due to missing expertise. Contact [project leads](./process.md) if you want to collaborate. Once we establish the connection between the vLLM team and model provider: @@ -30,7 +30,7 @@ The vLLM team works with model providers on features, integrations, and release The vLLM maintainers will not publicly share details about model architecture, release timelines, or upcoming releases. We maintain model weights on secure servers with security measures (though we can work with security reviews and testing without certification). We delete pre-release weights or artifacts upon request. -The vLLM team collaborates on marketing and promotional efforts for model releases. model providers can use vLLM's trademark and logo in publications and materials. +The vLLM team collaborates on marketing and promotional efforts for model releases. Model providers can use vLLM's trademark and logo in publications and materials. ## Adding New Hardware diff --git a/docs/models/extensions/fastsafetensor.md b/docs/models/extensions/fastsafetensor.md index 0f30d4e2f..03c673f69 100644 --- a/docs/models/extensions/fastsafetensor.md +++ b/docs/models/extensions/fastsafetensor.md @@ -1,4 +1,4 @@ -Loading Model weights with fastsafetensors +Loading model weights with fastsafetensors =================================================================== Using fastsafetensors library enables loading model weights to GPU memory by leveraging GPU direct storage. See [their GitHub repository](https://github.com/foundation-model-stack/fastsafetensors) for more details. diff --git a/docs/models/generative_models.md b/docs/models/generative_models.md index be2f25bf0..99914327e 100644 --- a/docs/models/generative_models.md +++ b/docs/models/generative_models.md @@ -2,7 +2,7 @@ vLLM provides first-class support for generative models, which covers most of LLMs. -In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface. +In vLLM, generative models implement the [VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface. Based on the final hidden states of the input, these models output log probabilities of the tokens to generate, which are then passed through [Sampler][vllm.v1.sample.sampler.Sampler] to obtain the final text. diff --git a/docs/serving/openai_compatible_server.md b/docs/serving/openai_compatible_server.md index 9d2b1e1ff..8c3cfe46a 100644 --- a/docs/serving/openai_compatible_server.md +++ b/docs/serving/openai_compatible_server.md @@ -874,7 +874,7 @@ You can pass multi-modal inputs to scoring models by passing `content` including Full example: - [examples/pooling/score/vision_score_api_online.py](../../examples/pooling/score/vision_score_api_online.py) -- examples/pooling/score/vision_rerank_api_online.py](../../examples/pooling/score/vision_rerank_api_online.py) +- [examples/pooling/score/vision_rerank_api_online.py](../../examples/pooling/score/vision_rerank_api_online.py) #### Extra parameters diff --git a/docs/usage/troubleshooting.md b/docs/usage/troubleshooting.md index c326fab1c..128c36b78 100644 --- a/docs/usage/troubleshooting.md +++ b/docs/usage/troubleshooting.md @@ -338,7 +338,7 @@ If you use triton kernels with cuda 13, you might see an error like `ptxas fatal vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. ``` -It means that the ptxas in triton bundle not compatible with your device. You need to set `TRITON_PTXAS_PATH` environment variable to use cuda toolkit's ptxas manually instead: +It means that the ptxas in the triton bundle is not compatible with your device. You need to set `TRITON_PTXAS_PATH` environment variable to use cuda toolkit's ptxas manually instead: ```shell export CUDA_HOME=/usr/local/cuda diff --git a/docs/usage/v1_guide.md b/docs/usage/v1_guide.md index 5f647aafd..8506e01b9 100644 --- a/docs/usage/v1_guide.md +++ b/docs/usage/v1_guide.md @@ -123,7 +123,7 @@ We are working on enabling prefix caching and chunked prefill for more categorie #### Mamba Models Models using selective state-space mechanisms instead of standard transformer attention are supported. -Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`,`FalconMambaForCausalLM`) are supported. +Models that use Mamba-2 and Mamba-1 layers (e.g., `Mamba2ForCausalLM`, `MambaForCausalLM`, `FalconMambaForCausalLM`) are supported. Hybrid models that combine Mamba-2 and Mamba-1 layers with standard attention layers are also supported (e.g., `BambaForCausalLM`, `Zamba2ForCausalLM`, `NemotronHForCausalLM`, `FalconH1ForCausalLM` and `GraniteMoeHybridForCausalLM`, `JambaForCausalLM`, `Plamo2ForCausalLM`).