diff --git a/docs/deployment/frameworks/skypilot.md b/docs/deployment/frameworks/skypilot.md index f4a984a64..e9b0d5f06 100644 --- a/docs/deployment/frameworks/skypilot.md +++ b/docs/deployment/frameworks/skypilot.md @@ -4,7 +4,7 @@ vLLM

-vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html). +vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc., can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html). ## Prerequisites diff --git a/docs/design/prefix_caching.md b/docs/design/prefix_caching.md index bd4070f38..48536a877 100644 --- a/docs/design/prefix_caching.md +++ b/docs/design/prefix_caching.md @@ -1,6 +1,6 @@ # Automatic Prefix Caching -Prefix caching kv-cache blocks is a popular optimization in LLM inference to avoid redundant prompt computations. The core idea is simple – we cache the kv-cache blocks of processed requests, and reuse these blocks when a new request comes in with the same prefix as previous requests. Since prefix caching is almost a free lunch and won’t change model outputs, it has been widely used by many public endpoints (e.g., OpenAI, Anthropic, etc) and most open source LLM inference frameworks (e.g., SGLang). +Prefix caching kv-cache blocks is a popular optimization in LLM inference to avoid redundant prompt computations. The core idea is simple – we cache the kv-cache blocks of processed requests, and reuse these blocks when a new request comes in with the same prefix as previous requests. Since prefix caching is almost a free lunch and won’t change model outputs, it has been widely used by many public endpoints (e.g., OpenAI, Anthropic, etc.) and most open source LLM inference frameworks (e.g., SGLang). While there are many ways to implement prefix caching, vLLM chooses a hash-based approach. Specifically, we hash each kv-cache block by the tokens in the block and the tokens in the prefix before the block: diff --git a/docs/features/nixl_connector_usage.md b/docs/features/nixl_connector_usage.md index 1ce038f4d..f0e25e31a 100644 --- a/docs/features/nixl_connector_usage.md +++ b/docs/features/nixl_connector_usage.md @@ -158,7 +158,7 @@ python tests/v1/kv_connector/nixl_integration/toy_proxy_server.py \ ## Experimental Feature -### Heterogenuous KV Layout support +### Heterogeneous KV Layout support Support use case: Prefill with 'HND' and decode with 'NHD' with experimental configuration diff --git a/docs/getting_started/quickstart.md b/docs/getting_started/quickstart.md index cfc8b4d98..9e86f785b 100644 --- a/docs/getting_started/quickstart.md +++ b/docs/getting_started/quickstart.md @@ -286,7 +286,7 @@ If desired, you can also manually set the backend of your choice by configuring - On NVIDIA CUDA: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`. - On AMD ROCm: `TRITON_ATTN`, `ROCM_ATTN`, `ROCM_AITER_FA` or `ROCM_AITER_UNIFIED_ATTN`. -For AMD ROCm, you can futher control the specific Attention implementation using the following variables: +For AMD ROCm, you can further control the specific Attention implementation using the following variables: - Triton Unified Attention: `VLLM_ROCM_USE_AITER=0 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0 VLLM_ROCM_USE_AITER_MHA=0` - AITER Unified Attention: `VLLM_ROCM_USE_AITER=1 VLLM_USE_AITER_UNIFIED_ATTENTION=1 VLLM_V1_USE_PREFILL_DECODE_ATTENTION=0 VLLM_ROCM_USE_AITER_MHA=0` diff --git a/tests/v1/ec_connector/integration/README.md b/tests/v1/ec_connector/integration/README.md index 30426e055..2dbcb307f 100644 --- a/tests/v1/ec_connector/integration/README.md +++ b/tests/v1/ec_connector/integration/README.md @@ -113,7 +113,7 @@ Quick sanity check: - Outputs differ between baseline and disagg - Server startup fails -- Encoder cache not found (should fallback to local execution) +- Encoder cache not found (should fall back to local execution) - Proxy routing errors ## Notes diff --git a/vllm/multimodal/evs.py b/vllm/multimodal/evs.py index 4a288d2d2..8a36ea415 100644 --- a/vllm/multimodal/evs.py +++ b/vllm/multimodal/evs.py @@ -185,7 +185,7 @@ def recompute_mrope_positions( Args: input_ids: (N,) All input tokens of the prompt (entire sequence). - multimodal_positions: List of mrope positsions for each media. + multimodal_positions: List of mrope positions for each media. mrope_positions: Existing mrope positions (4, N) for entire sequence. num_computed_tokens: A number of computed tokens so far. vision_start_token_id: Token indicating start of vision media.