Migrate docs from Sphinx to MkDocs (#18145)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
87
docs/contributing/deprecation_policy.md
Normal file
87
docs/contributing/deprecation_policy.md
Normal file
@@ -0,0 +1,87 @@
|
||||
# Deprecation Policy
|
||||
|
||||
This document outlines the official policy and process for deprecating features
|
||||
in the vLLM project.
|
||||
|
||||
## Overview
|
||||
|
||||
vLLM uses a structured "deprecation pipeline" to guide the lifecycle of
|
||||
deprecated features. This policy ensures that users are given clear and
|
||||
sufficient notice when a feature is deprecated and that deprecations proceed in
|
||||
a consistent and predictable manner.
|
||||
|
||||
We aim to strike a balance between continued innovation and respecting users’
|
||||
reliance on existing functionality. Deprecations are tied to our **minor (Y)
|
||||
releases** following semantic versioning (X.Y.Z), where:
|
||||
|
||||
- **X** is a major version (rare)
|
||||
- **Y** is a minor version (used for significant changes, including deprecations/removals)
|
||||
- **Z** is a patch version (used for fixes and safer enhancements)
|
||||
|
||||
Features that fall under this policy include (at a minimum) the following:
|
||||
|
||||
- CLI flags
|
||||
- Environment variables
|
||||
- Configuration files
|
||||
- APIs in the OpenAI-compatible API server
|
||||
- Public Python APIs for the `vllm` library
|
||||
|
||||
## Deprecation Pipeline
|
||||
|
||||
The deprecation process consists of several clearly defined stages that span
|
||||
multiple Y releases:
|
||||
|
||||
**1. Deprecated (Still On By Default)**
|
||||
|
||||
- **Action**: Feature is marked as deprecated.
|
||||
- **Timeline**: A removal version is explicitly stated in the deprecation
|
||||
warning (e.g., "This will be removed in v0.10.0").
|
||||
- **Communication**: Deprecation is noted in the following, as applicable:
|
||||
- Help strings
|
||||
- Log output
|
||||
- API responses
|
||||
- `/metrics` output (for metrics features)
|
||||
- User-facing documentation
|
||||
- Release notes
|
||||
- GitHub Issue (RFC) for feedback
|
||||
- Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs
|
||||
|
||||
**2.Deprecated (Off By Default)**
|
||||
|
||||
- **Action**: Feature is disabled by default, but can still be re-enabled via a
|
||||
CLI flag or environment variable. Feature throws an error when used without
|
||||
re-enabling.
|
||||
- **Purpose**: Allows users who missed earlier warnings a temporary escape hatch
|
||||
while signaling imminent removal. Ensures any remaining usage is clearly
|
||||
surfaced and blocks silent breakage before full removal.
|
||||
|
||||
**3. Removed**
|
||||
|
||||
- **Action**: Feature is completely removed from the codebase.
|
||||
- **Note**: Only features that have passed through the previous deprecation
|
||||
stages will be removed.
|
||||
|
||||
## Example Timeline
|
||||
|
||||
Assume a feature is deprecated in `v0.9.0`.
|
||||
|
||||
| Release | Status |
|
||||
|---------------|-------------------------------------------------------------------------------------------------|
|
||||
| `v0.9.0` | Feature is deprecated with clear removal version listed. |
|
||||
| `v0.10.0` | Feature is now off by default, throws an error when used, and can be re-enabled for legacy use. |
|
||||
| `v0.11.0` | Feature is removed. |
|
||||
|
||||
## Important Guidelines
|
||||
|
||||
- **No Removals in Patch Releases**: Removing deprecated features in patch
|
||||
(`.Z`) releases is disallowed to avoid surprising users.
|
||||
- **Grace Period for Existing Deprecations**: Any feature deprecated **before
|
||||
this policy** will have its grace period start **now**, not retroactively.
|
||||
- **Documentation is Critical**: Ensure every stage of the pipeline is
|
||||
documented clearly for users.
|
||||
|
||||
## Final Notes
|
||||
|
||||
This policy is a living document and may evolve as the needs of the project and
|
||||
its users change. Community feedback is welcome and encouraged as we refine the
|
||||
process.
|
||||
48
docs/contributing/dockerfile/dockerfile.md
Normal file
48
docs/contributing/dockerfile/dockerfile.md
Normal file
@@ -0,0 +1,48 @@
|
||||
# Dockerfile
|
||||
|
||||
We provide a <gh-file:docker/Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
|
||||
More information about deploying with Docker can be found [here][deployment-docker].
|
||||
|
||||
Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
|
||||
|
||||
- All build stages
|
||||
- The default build target (highlighted in grey)
|
||||
- External images (with dashed borders)
|
||||
|
||||
The edges of the build graph represent:
|
||||
|
||||
- `FROM ...` dependencies (with a solid line and a full arrow head)
|
||||
|
||||
- `COPY --from=...` dependencies (with a dashed line and an empty arrow head)
|
||||
|
||||
- `RUN --mount=(.\*)from=...` dependencies (with a dotted line and an empty diamond arrow head)
|
||||
|
||||
> <figure markdown="span">
|
||||
> { align="center" alt="query" width="100%" }
|
||||
> </figure>
|
||||
>
|
||||
> Made using: <https://github.com/patrickhoefler/dockerfilegraph>
|
||||
>
|
||||
> Commands to regenerate the build graph (make sure to run it **from the \`root\` directory of the vLLM repository** where the dockerfile is present):
|
||||
>
|
||||
> ```bash
|
||||
> dockerfilegraph -o png --legend --dpi 200 --max-label-length 50 --filename docker/Dockerfile
|
||||
> ```
|
||||
>
|
||||
> or in case you want to run it directly with the docker image:
|
||||
>
|
||||
> ```bash
|
||||
> docker run \
|
||||
> --rm \
|
||||
> --user "$(id -u):$(id -g)" \
|
||||
> --workdir /workspace \
|
||||
> --volume "$(pwd)":/workspace \
|
||||
> ghcr.io/patrickhoefler/dockerfilegraph:alpine \
|
||||
> --output png \
|
||||
> --dpi 200 \
|
||||
> --max-label-length 50 \
|
||||
> --filename docker/Dockerfile \
|
||||
> --legend
|
||||
> ```
|
||||
>
|
||||
> (To run it for a different file, you can pass in a different argument to the flag `--filename`.)
|
||||
23
docs/contributing/model/README.md
Normal file
23
docs/contributing/model/README.md
Normal file
@@ -0,0 +1,23 @@
|
||||
---
|
||||
title: Adding a New Model
|
||||
---
|
||||
[](){ #new-model }
|
||||
|
||||
This section provides more information on how to integrate a [PyTorch](https://pytorch.org/) model into vLLM.
|
||||
|
||||
Contents:
|
||||
|
||||
- [Basic](basic.md)
|
||||
- [Registration](registration.md)
|
||||
- [Tests](tests.md)
|
||||
- [Multimodal](multimodal.md)
|
||||
|
||||
!!! note
|
||||
The complexity of adding a new model depends heavily on the model's architecture.
|
||||
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
|
||||
However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
|
||||
|
||||
!!! tip
|
||||
If you are encountering issues while integrating your model into vLLM, feel free to open a [GitHub issue](https://github.com/vllm-project/vllm/issues)
|
||||
or ask on our [developer slack](https://slack.vllm.ai).
|
||||
We will be happy to help you out!
|
||||
122
docs/contributing/model/basic.md
Normal file
122
docs/contributing/model/basic.md
Normal file
@@ -0,0 +1,122 @@
|
||||
---
|
||||
title: Implementing a Basic Model
|
||||
---
|
||||
[](){ #new-model-basic }
|
||||
|
||||
This guide walks you through the steps to implement a basic vLLM model.
|
||||
|
||||
## 1. Bring your model code
|
||||
|
||||
First, clone the PyTorch model code from the source repository.
|
||||
For instance, vLLM's [OPT model](gh-file:vllm/model_executor/models/opt.py) was adapted from
|
||||
HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file.
|
||||
|
||||
!!! warning
|
||||
Make sure to review and adhere to the original code's copyright and licensing terms!
|
||||
|
||||
## 2. Make your code compatible with vLLM
|
||||
|
||||
To ensure compatibility with vLLM, your model must meet the following requirements:
|
||||
|
||||
### Initialization Code
|
||||
|
||||
All vLLM modules within the model must include a `prefix` argument in their constructor. This `prefix` is typically the full name of the module in the model's state dictionary and is crucial for:
|
||||
|
||||
- Runtime support: vLLM's attention operators are registered in a model's state by their full names. Each attention operator must have a unique prefix as its layer name to avoid conflicts.
|
||||
- Non-uniform quantization support: A quantized checkpoint can selectively quantize certain layers while keeping others in full precision. By providing the `prefix` during initialization, vLLM can match the current layer's `prefix` with the quantization configuration to determine if the layer should be initialized in quantized mode.
|
||||
|
||||
The initialization code should look like this:
|
||||
|
||||
```python
|
||||
from torch import nn
|
||||
from vllm.config import VllmConfig
|
||||
from vllm.attention import Attention
|
||||
|
||||
class MyAttention(nn.Module):
|
||||
def __init__(self, vllm_config: VllmConfig, prefix: str):
|
||||
super().__init__()
|
||||
self.attn = Attention(prefix=f"{prefix}.attn")
|
||||
|
||||
class MyDecoderLayer(nn.Module):
|
||||
def __init__(self, vllm_config: VllmConfig, prefix: str):
|
||||
super().__init__()
|
||||
self.self_attn = MyAttention(prefix=f"{prefix}.self_attn")
|
||||
|
||||
class MyModel(nn.Module):
|
||||
def __init__(self, vllm_config: VllmConfig, prefix: str):
|
||||
super().__init__()
|
||||
self.layers = nn.ModuleList(
|
||||
[MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)]
|
||||
)
|
||||
|
||||
class MyModelForCausalLM(nn.Module):
|
||||
def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
|
||||
super().__init__()
|
||||
self.model = MyModel(vllm_config, prefix=f"{prefix}.model")
|
||||
```
|
||||
|
||||
### Computation Code
|
||||
|
||||
- Add a `get_input_embeddings` method inside `MyModel` module that returns the text embeddings given `input_ids`. This is equivalent to directly calling the text embedding layer, but provides a unified interface in case `MyModel` is used within a composite multimodal model.
|
||||
|
||||
```python
|
||||
class MyModel(nn.Module):
|
||||
...
|
||||
|
||||
def get_input_embeddings(self, input_ids: torch.Tensor) -> torch.Tensor:
|
||||
...
|
||||
```
|
||||
|
||||
- Rewrite the [forward][torch.nn.Module.forward] method of your model to remove any unnecessary code, such as training-specific code. Modify the input parameters to treat `input_ids` and `positions` as flattened tensors with a single batch size dimension, without a max-sequence length dimension.
|
||||
|
||||
```python
|
||||
def forward(
|
||||
self,
|
||||
input_ids: torch.Tensor,
|
||||
positions: torch.Tensor,
|
||||
) -> torch.Tensor:
|
||||
...
|
||||
```
|
||||
|
||||
!!! note
|
||||
Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
|
||||
If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
|
||||
|
||||
For reference, check out our [Llama implementation](gh-file:vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out <gh-dir:vllm/model_executor/models> for more examples.
|
||||
|
||||
## 3. (Optional) Implement tensor parallelism and quantization support
|
||||
|
||||
If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it.
|
||||
To do this, substitute your model's linear and embedding layers with their tensor-parallel versions.
|
||||
For the embedding layer, you can simply replace [torch.nn.Embedding][] with `VocabParallelEmbedding`. For the output LM head, you can use `ParallelLMHead`.
|
||||
When it comes to the linear layers, we provide the following options to parallelize them:
|
||||
|
||||
- `ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving.
|
||||
- `RowParallelLinear`: The input tensor is partitioned along the hidden dimension. The weight matrix is partitioned along the rows (input dimension). An *all-reduce* operation is performed after the matrix multiplication to reduce the results. Typically used for the second FFN layer and the output linear transformation of the attention layer.
|
||||
- `ColumnParallelLinear`: The input tensor is replicated. The weight matrix is partitioned along the columns (output dimension). The result is partitioned along the column dimension. Typically used for the first FFN layer and the separated QKV transformation of the attention layer in the original Transformer.
|
||||
- `MergedColumnParallelLinear`: Column-parallel linear that merges multiple `ColumnParallelLinear` operators. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles the sharded weight loading logic of multiple weight matrices.
|
||||
- `QKVParallelLinear`: Parallel linear layer for the query, key, and value projections of the multi-head and grouped-query attention mechanisms. When number of key/value heads are less than the world size, this class replicates the key/value heads properly. This class handles the weight loading and replication of the weight matrices.
|
||||
|
||||
Note that all the linear layers above take `linear_method` as an input. vLLM will set this parameter according to different quantization schemes to support weight quantization.
|
||||
|
||||
## 4. Implement the weight loading logic
|
||||
|
||||
You now need to implement the `load_weights` method in your `*ForCausalLM` class.
|
||||
This method should load the weights from the HuggingFace's checkpoint file and assign them to the corresponding layers in your model. Specifically, for `MergedColumnParallelLinear` and `QKVParallelLinear` layers, if the original model has separated weight matrices, you need to load the different parts separately.
|
||||
|
||||
## 5. Register your model
|
||||
|
||||
See [this page][new-model-registration] for instructions on how to register your new model to be used by vLLM.
|
||||
|
||||
## Frequently Asked Questions
|
||||
|
||||
### How to support models with interleaving sliding windows?
|
||||
|
||||
For models with interleaving sliding windows (e.g. `google/gemma-2-2b-it` and `mistralai/Ministral-8B-Instruct-2410`), the scheduler will treat the model as a full-attention model, i.e., kv-cache of all tokens will not be dropped. This is to make sure prefix caching works with these models. Sliding window only appears as a parameter to the attention kernel computation.
|
||||
|
||||
To support a model with interleaving sliding windows, we need to take care of the following details:
|
||||
|
||||
- Make sure the model's `config.json` contains `sliding_window_pattern`. vLLM then sets `self.hf_text_config.interleaved_sliding_window` to the value of `self.hf_text_config.sliding_window` and deletes `sliding_window` from `self.hf_text_config`. The model will then be treated as a full-attention model.
|
||||
- In the modeling code, parse the correct sliding window value for every layer, and pass it to the attention layer's `per_layer_sliding_window` argument. For reference, check [this line](https://github.com/vllm-project/vllm/blob/996357e4808ca5eab97d4c97c7d25b3073f46aab/vllm/model_executor/models/llama.py#L171).
|
||||
|
||||
With these two steps, interleave sliding windows should work with the model.
|
||||
803
docs/contributing/model/multimodal.md
Normal file
803
docs/contributing/model/multimodal.md
Normal file
@@ -0,0 +1,803 @@
|
||||
---
|
||||
title: Multi-Modal Support
|
||||
---
|
||||
[](){ #supports-multimodal }
|
||||
|
||||
This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs][multimodal-inputs].
|
||||
|
||||
## 1. Update the base vLLM model
|
||||
|
||||
It is assumed that you have already implemented the model in vLLM according to [these steps][new-model-basic].
|
||||
Further update the model as follows:
|
||||
|
||||
- Reserve a keyword parameter in [forward][torch.nn.Module.forward] for each input tensor that corresponds to a multi-modal input, as shown in the following example:
|
||||
|
||||
```diff
|
||||
def forward(
|
||||
self,
|
||||
input_ids: torch.Tensor,
|
||||
positions: torch.Tensor,
|
||||
+ pixel_values: torch.Tensor,
|
||||
) -> SamplerOutput:
|
||||
```
|
||||
|
||||
More conveniently, you can simply pass `**kwargs` to the [forward][torch.nn.Module.forward] method and retrieve the keyword parameters for multimodal inputs from it.
|
||||
|
||||
- Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.
|
||||
|
||||
```python
|
||||
class YourModelForImage2Seq(nn.Module):
|
||||
...
|
||||
|
||||
def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor:
|
||||
|
||||
assert self.vision_encoder is not None
|
||||
image_features = self.vision_encoder(image_input)
|
||||
return self.multi_modal_projector(image_features)
|
||||
|
||||
def get_multimodal_embeddings(
|
||||
self, **kwargs: object) -> Optional[MultiModalEmbeddings]:
|
||||
|
||||
# Validate the multimodal input keyword arguments
|
||||
image_input = self._parse_and_validate_image_input(**kwargs)
|
||||
if image_input is None:
|
||||
return None
|
||||
|
||||
# Run multimodal inputs through encoder and projector
|
||||
vision_embeddings = self._process_image_input(image_input)
|
||||
return vision_embeddings
|
||||
```
|
||||
|
||||
!!! warning
|
||||
The returned `multimodal_embeddings` must be either a **3D [torch.Tensor][]** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D [torch.Tensor][]'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request.
|
||||
|
||||
- Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
|
||||
|
||||
```python
|
||||
from .utils import merge_multimodal_embeddings
|
||||
|
||||
class YourModelForImage2Seq(nn.Module):
|
||||
...
|
||||
|
||||
def get_input_embeddings(
|
||||
self,
|
||||
input_ids: torch.Tensor,
|
||||
multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
|
||||
) -> torch.Tensor:
|
||||
|
||||
# `get_input_embeddings` should already be implemented for the language
|
||||
# model as one of the requirements of basic vLLM model implementation.
|
||||
inputs_embeds = self.language_model.get_input_embeddings(input_ids)
|
||||
|
||||
if multimodal_embeddings is not None:
|
||||
inputs_embeds = merge_multimodal_embeddings(
|
||||
input_ids=input_ids,
|
||||
inputs_embeds=inputs_embeds,
|
||||
multimodal_embeddings=multimodal_embeddings,
|
||||
placeholder_token_id=self.config.image_token_index)
|
||||
|
||||
return inputs_embeds
|
||||
```
|
||||
|
||||
- Implement [get_language_model][vllm.model_executor.models.interfaces.SupportsMultiModal.get_language_model] getter to provide stable access to the underlying language model.
|
||||
|
||||
```python
|
||||
class YourModelForImage2Seq(nn.Module):
|
||||
...
|
||||
|
||||
def get_language_model(self) -> torch.nn.Module:
|
||||
# Change `language_model` according to your implementation.
|
||||
return self.language_model
|
||||
```
|
||||
|
||||
- Once the above steps are done, update the model class with the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface.
|
||||
|
||||
```diff
|
||||
+ from vllm.model_executor.models.interfaces import SupportsMultiModal
|
||||
|
||||
- class YourModelForImage2Seq(nn.Module):
|
||||
+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
|
||||
```
|
||||
|
||||
!!! note
|
||||
The model class does not have to be named `*ForCausalLM`.
|
||||
Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.
|
||||
|
||||
## 2. Specify processing information
|
||||
|
||||
Next, create a subclass of [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo]
|
||||
to provide basic information related to HF processing.
|
||||
|
||||
### Maximum number of input items
|
||||
|
||||
You need to override the abstract method [get_supported_mm_limits][vllm.multimodal.processing.BaseProcessingInfo.get_supported_mm_limits]
|
||||
to return the maximum number of input items for each modality supported by the model.
|
||||
|
||||
For example, if the model supports any number of images but only one video per prompt:
|
||||
|
||||
```python
|
||||
def get_supported_mm_limits(self) -> Mapping[str, Optional[int]]:
|
||||
return {"image": None, "video": 1}
|
||||
```
|
||||
|
||||
## 3. Specify dummy inputs
|
||||
|
||||
Then, inherit [BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] to construct dummy inputs for
|
||||
HF processing as well as memory profiling.
|
||||
|
||||
### For memory profiling
|
||||
|
||||
Override the abstract methods [get_dummy_text][vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text] and [get_dummy_mm_data][vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_mm_data] to construct dummy inputs for memory profiling. These dummy inputs should result in the worst-case memory usage of the model so that vLLM can reserve the correct amount of memory for it.
|
||||
|
||||
Assuming that the memory usage increases with the number of tokens, the dummy inputs can be constructed to maximize the number of output embeddings, which is the same number as placeholder feature tokens.
|
||||
|
||||
=== "Basic example: LLaVA"
|
||||
|
||||
Looking at the code of HF's `LlavaForConditionalGeneration`:
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
|
||||
n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
|
||||
n_image_features = image_features.shape[0] * image_features.shape[1]
|
||||
|
||||
if n_image_tokens != n_image_features:
|
||||
raise ValueError(
|
||||
f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
|
||||
)
|
||||
special_image_mask = (
|
||||
(input_ids == self.config.image_token_index)
|
||||
.unsqueeze(-1)
|
||||
.expand_as(inputs_embeds)
|
||||
.to(inputs_embeds.device)
|
||||
)
|
||||
image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
|
||||
inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
|
||||
```
|
||||
|
||||
The number of placeholder feature tokens per image is `image_features.shape[1]`.
|
||||
`image_features` is calculated inside the `get_image_features` method:
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
|
||||
image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
|
||||
|
||||
selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
|
||||
if vision_feature_select_strategy == "default":
|
||||
selected_image_feature = selected_image_feature[:, 1:]
|
||||
elif vision_feature_select_strategy == "full":
|
||||
selected_image_feature = selected_image_feature
|
||||
else:
|
||||
raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}")
|
||||
image_features = self.multi_modal_projector(selected_image_feature)
|
||||
return image_features
|
||||
```
|
||||
|
||||
We can infer that `image_features.shape[1]` is based on `image_outputs.hidden_states.shape[1]` from the vision tower
|
||||
(`CLIPVisionModel` for the [`llava-hf/llava-1.5-7b-hf`](https://huggingface.co/llava-hf/llava-1.5-7b-hf) model).
|
||||
Moreover, we only need the sequence length (the second dimension of the tensor) to get `image_features.shape[1]`.
|
||||
The sequence length is determined by the initial hidden states in `CLIPVisionTransformer` since the attention
|
||||
mechanism doesn't change the sequence length of the output hidden states.
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L1094-L1102
|
||||
hidden_states = self.embeddings(pixel_values, interpolate_pos_encoding=interpolate_pos_encoding)
|
||||
hidden_states = self.pre_layrnorm(hidden_states)
|
||||
|
||||
encoder_outputs = self.encoder(
|
||||
inputs_embeds=hidden_states,
|
||||
output_attentions=output_attentions,
|
||||
output_hidden_states=output_hidden_states,
|
||||
return_dict=return_dict,
|
||||
)
|
||||
```
|
||||
|
||||
To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`:
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
|
||||
target_dtype = self.patch_embedding.weight.dtype
|
||||
patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid]
|
||||
patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
|
||||
|
||||
class_embeds = self.class_embedding.expand(batch_size, 1, -1)
|
||||
embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
|
||||
if interpolate_pos_encoding:
|
||||
embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width)
|
||||
else:
|
||||
embeddings = embeddings + self.position_embedding(self.position_ids)
|
||||
return embeddings
|
||||
```
|
||||
|
||||
We can infer that `embeddings.shape[1] == self.num_positions`, where
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L195-L196
|
||||
self.num_patches = (self.image_size // self.patch_size) ** 2
|
||||
self.num_positions = self.num_patches + 1
|
||||
```
|
||||
|
||||
Overall, the number of placeholder feature tokens for an image can be calculated as:
|
||||
|
||||
```python
|
||||
def get_num_image_tokens(
|
||||
self,
|
||||
*,
|
||||
image_width: int,
|
||||
image_height: int,
|
||||
) -> int:
|
||||
hf_config = self.get_hf_config()
|
||||
hf_processor = self.get_hf_processor()
|
||||
|
||||
image_size = hf_config.vision_config.image_size
|
||||
patch_size = hf_config.vision_config.patch_size
|
||||
|
||||
num_image_tokens = (image_size // patch_size) ** 2 + 1
|
||||
if hf_processor.vision_feature_select_strategy == "default":
|
||||
num_image_tokens -= 1
|
||||
|
||||
return num_image_tokens
|
||||
```
|
||||
|
||||
Notice that the number of image tokens doesn't depend on the image width and height.
|
||||
We can simply use a dummy `image_size` to calculate the multimodal profiling data:
|
||||
|
||||
```python
|
||||
# NOTE: In actuality, this is usually implemented as part of the
|
||||
# model's subclass of `BaseProcessingInfo`, but we show it as is
|
||||
# here for simplicity.
|
||||
def get_image_size_with_most_features(self) -> ImageSize:
|
||||
hf_config = self.get_hf_config()
|
||||
width = height = hf_config.image_size
|
||||
return ImageSize(width=width, height=height)
|
||||
|
||||
def get_dummy_mm_data(
|
||||
self,
|
||||
seq_len: int,
|
||||
mm_counts: Mapping[str, int],
|
||||
) -> MultiModalDataDict:
|
||||
num_images = mm_counts.get("image", 0)
|
||||
|
||||
target_width, target_height = \
|
||||
self.info.get_image_size_with_most_features()
|
||||
|
||||
return {
|
||||
"image":
|
||||
self._get_dummy_images(width=target_width,
|
||||
height=target_height,
|
||||
num_images=num_images)
|
||||
}
|
||||
```
|
||||
|
||||
For the text, we simply expand the multimodal image token from the model config to match the desired number of images.
|
||||
|
||||
```python
|
||||
def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
|
||||
num_images = mm_counts.get("image", 0)
|
||||
|
||||
processor = self.info.get_hf_processor()
|
||||
image_token = processor.image_token
|
||||
|
||||
return image_token * num_images
|
||||
```
|
||||
|
||||
=== "No input placeholders: Fuyu"
|
||||
|
||||
Looking at the code of HF's `FuyuForCausalLM`:
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
|
||||
if image_patches is not None and past_key_values is None:
|
||||
patch_embeddings = [
|
||||
self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype))
|
||||
.squeeze(0)
|
||||
.to(inputs_embeds.device)
|
||||
for patch in image_patches
|
||||
]
|
||||
inputs_embeds = self.gather_continuous_embeddings(
|
||||
word_embeddings=inputs_embeds,
|
||||
continuous_embeddings=patch_embeddings,
|
||||
image_patch_input_indices=image_patches_indices,
|
||||
)
|
||||
```
|
||||
|
||||
The number of placeholder feature tokens for the `i`th item in the batch is `patch_embeddings[i].shape[0]`,
|
||||
which is the same as `image_patches[i].shape[0]`, i.e. `num_total_patches`.
|
||||
|
||||
Unlike LLaVA, Fuyu does not define the number of patches inside the modeling file. Where can we get more information?
|
||||
Considering that the model input comes from the output of `FuyuProcessor`, let's **look at the preprocessing files**.
|
||||
|
||||
The image outputs are obtained by calling `FuyuImageProcessor.preprocess` and then
|
||||
`FuyuImageProcessor.preprocess_with_tokenizer_info` inside `FuyuProcessor`.
|
||||
|
||||
In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`,
|
||||
returning the dimensions after resizing (but before padding) as metadata.
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
|
||||
image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"])
|
||||
batch_images = image_encoding["images"]
|
||||
image_unpadded_heights = image_encoding["image_unpadded_heights"]
|
||||
image_unpadded_widths = image_encoding["image_unpadded_widths"]
|
||||
|
||||
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L
|
||||
if do_resize:
|
||||
batch_images = [
|
||||
[self.resize(image, size=size, input_data_format=input_data_format) for image in images]
|
||||
for images in batch_images
|
||||
]
|
||||
|
||||
image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images]
|
||||
image_unpadded_heights = [[image_size[0]] for image_size in image_sizes]
|
||||
image_unpadded_widths = [[image_size[1]] for image_size in image_sizes]
|
||||
|
||||
if do_pad:
|
||||
batch_images = [
|
||||
[
|
||||
self.pad_image(
|
||||
image,
|
||||
size=size,
|
||||
mode=padding_mode,
|
||||
constant_values=padding_value,
|
||||
input_data_format=input_data_format,
|
||||
)
|
||||
for image in images
|
||||
]
|
||||
for images in batch_images
|
||||
]
|
||||
```
|
||||
|
||||
In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
|
||||
model_image_input = self.image_processor.preprocess_with_tokenizer_info(
|
||||
image_input=tensor_batch_images,
|
||||
image_present=image_present,
|
||||
image_unpadded_h=image_unpadded_heights,
|
||||
image_unpadded_w=image_unpadded_widths,
|
||||
image_placeholder_id=image_placeholder_id,
|
||||
image_newline_id=image_newline_id,
|
||||
variable_sized=True,
|
||||
)
|
||||
|
||||
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658
|
||||
image_height, image_width = image.shape[1], image.shape[2]
|
||||
if variable_sized: # variable_sized=True
|
||||
new_h = min(
|
||||
image_height,
|
||||
math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height,
|
||||
)
|
||||
new_w = min(
|
||||
image_width,
|
||||
math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width,
|
||||
)
|
||||
image = image[:, :new_h, :new_w]
|
||||
image_height, image_width = new_h, new_w
|
||||
|
||||
num_patches = self.get_num_patches(image_height=image_height, image_width=image_width)
|
||||
tensor_of_image_ids = torch.full(
|
||||
[num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device
|
||||
)
|
||||
patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0)
|
||||
assert num_patches == patches.shape[0]
|
||||
```
|
||||
|
||||
The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`:
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
|
||||
patch_size = patch_size if patch_size is not None else self.patch_size
|
||||
patch_height, patch_width = self.patch_size["height"], self.patch_size["width"]
|
||||
|
||||
if image_height % patch_height != 0:
|
||||
raise ValueError(f"{image_height=} must be divisible by {patch_height}")
|
||||
if image_width % patch_width != 0:
|
||||
raise ValueError(f"{image_width=} must be divisible by {patch_width}")
|
||||
|
||||
num_patches_per_dim_h = image_height // patch_height
|
||||
num_patches_per_dim_w = image_width // patch_width
|
||||
num_patches = num_patches_per_dim_h * num_patches_per_dim_w
|
||||
```
|
||||
|
||||
These image patches correspond to placeholder tokens (`|SPEAKER|`). So, we just need to maximize the number of image patches. Since input images are first resized
|
||||
to fit within `image_processor.size`, we can maximize the number of image patches by inputting an image with size equal to `image_processor.size`.
|
||||
|
||||
```python
|
||||
def get_image_size_with_most_features(self) -> ImageSize:
|
||||
image_processor = self.get_image_processor()
|
||||
return ImageSize(width=image_processor.size["width"],
|
||||
height=image_processor.size["height"])
|
||||
```
|
||||
|
||||
Fuyu does not expect image placeholders in the inputs to HF processor, so
|
||||
the dummy prompt text is empty regardless of the number of images.
|
||||
|
||||
```python
|
||||
def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
|
||||
return ""
|
||||
```
|
||||
|
||||
For the multimodal image profiling data, the logic is very similar to LLaVA:
|
||||
|
||||
```python
|
||||
def get_dummy_mm_data(
|
||||
self,
|
||||
seq_len: int,
|
||||
mm_counts: Mapping[str, int],
|
||||
) -> MultiModalDataDict:
|
||||
target_width, target_height = \
|
||||
self.info.get_image_size_with_most_features()
|
||||
num_images = mm_counts.get("image", 0)
|
||||
|
||||
return {
|
||||
"image":
|
||||
self._get_dummy_images(width=target_width,
|
||||
height=target_height,
|
||||
num_images=num_images)
|
||||
}
|
||||
```
|
||||
|
||||
## 4. Specify processing details
|
||||
|
||||
Afterwards, create a subclass of [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor]
|
||||
to fill in the missing details about HF processing.
|
||||
|
||||
!!! info
|
||||
[Multi-Modal Data Processing][mm-processing]
|
||||
|
||||
### Multi-modal fields
|
||||
|
||||
Override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] to
|
||||
return a schema of the tensors outputted by the HF processor that are related to the input multi-modal items.
|
||||
|
||||
=== "Basic example: LLaVA"
|
||||
|
||||
The output of `CLIPImageProcessor` is a simple tensor with shape
|
||||
`(num_images, num_channels, image_height, image_width)`:
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/image_processing_clip.py#L339-L345
|
||||
images = [
|
||||
to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
|
||||
for image in all_images
|
||||
]
|
||||
|
||||
data = {"pixel_values": images}
|
||||
return BatchFeature(data=data, tensor_type=return_tensors)
|
||||
```
|
||||
|
||||
So, we override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] as follows:
|
||||
|
||||
```python
|
||||
def _get_mm_fields_config(
|
||||
self,
|
||||
hf_inputs: BatchFeature,
|
||||
hf_processor_mm_kwargs: Mapping[str, object],
|
||||
) -> Mapping[str, MultiModalFieldConfig]:
|
||||
return dict(
|
||||
pixel_values=MultiModalFieldConfig.batched("image"),
|
||||
)
|
||||
```
|
||||
|
||||
!!! note
|
||||
Our [actual code](gh-file:vllm/model_executor/models/llava.py) additionally supports
|
||||
pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument.
|
||||
|
||||
=== "With postprocessing: Fuyu"
|
||||
|
||||
The `image_patches` output of `FuyuImageProcessor.preprocess_with_tokenizer_info` concatenates
|
||||
the patches from each image belonging to an item in the batch:
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L673-L679
|
||||
image_input_ids.append(tensor_of_image_ids)
|
||||
image_patches.append(patches)
|
||||
else:
|
||||
image_input_ids.append(torch.tensor([], dtype=torch.int32, device=image_input.device))
|
||||
|
||||
batch_image_input_ids.append(image_input_ids)
|
||||
batch_image_patches.append(image_patches)
|
||||
```
|
||||
|
||||
The shape of `image_patches` outputted by `FuyuImageProcessor` is therefore
|
||||
`(1, num_images, num_patches, patch_width * patch_height * num_channels)`.
|
||||
|
||||
In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA,
|
||||
we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]:
|
||||
|
||||
```python
|
||||
def _call_hf_processor(
|
||||
self,
|
||||
prompt: str,
|
||||
mm_data: Mapping[str, object],
|
||||
mm_kwargs: Mapping[str, object],
|
||||
) -> BatchFeature:
|
||||
processed_outputs = super()._call_hf_processor(
|
||||
prompt=prompt,
|
||||
mm_data=mm_data,
|
||||
mm_kwargs=mm_kwargs,
|
||||
)
|
||||
|
||||
image_patches = processed_outputs.get("image_patches")
|
||||
if image_patches is not None:
|
||||
images = mm_data["images"]
|
||||
assert isinstance(images, list)
|
||||
|
||||
# Original output: (1, num_images, Pn, Px * Py * C)
|
||||
# New output: (num_images, Pn, Px * Py * C)
|
||||
assert (isinstance(image_patches, list)
|
||||
and len(image_patches) == 1)
|
||||
assert (isinstance(image_patches[0], torch.Tensor)
|
||||
and len(image_patches[0]) == len(images))
|
||||
|
||||
processed_outputs["image_patches"] = image_patches[0]
|
||||
|
||||
return processed_outputs
|
||||
```
|
||||
|
||||
!!! note
|
||||
Our [actual code](gh-file:vllm/model_executor/models/fuyu.py) has special handling
|
||||
for text-only inputs to prevent unnecessary warnings from HF processor.
|
||||
|
||||
This lets us override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] as follows:
|
||||
|
||||
```python
|
||||
def _get_mm_fields_config(
|
||||
self,
|
||||
hf_inputs: BatchFeature,
|
||||
hf_processor_mm_kwargs: Mapping[str, object],
|
||||
) -> Mapping[str, MultiModalFieldConfig]:
|
||||
return dict(image_patches=MultiModalFieldConfig.batched("image"))
|
||||
```
|
||||
|
||||
### Prompt updates
|
||||
|
||||
Override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] to
|
||||
return a list of [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instances.
|
||||
|
||||
Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies an update operation
|
||||
(e.g.: insertion, replacement) performed by the HF processor.
|
||||
|
||||
=== "Basic example: LLaVA"
|
||||
|
||||
Looking at HF's `LlavaProcessor`:
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/processing_llava.py#L167-L170
|
||||
prompt_strings = []
|
||||
for sample in text:
|
||||
sample = sample.replace(self.image_token, self.image_token * num_image_tokens)
|
||||
prompt_strings.append(sample)
|
||||
```
|
||||
|
||||
It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
|
||||
Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows:
|
||||
|
||||
```python
|
||||
def _get_prompt_updates(
|
||||
self,
|
||||
mm_items: MultiModalDataItems,
|
||||
hf_processor_mm_kwargs: Mapping[str, object],
|
||||
out_mm_kwargs: MultiModalKwargs,
|
||||
) -> Sequence[PromptUpdate]:
|
||||
hf_config = self.info.get_hf_config()
|
||||
image_token_id = hf_config.image_token_index
|
||||
|
||||
def get_replacement(item_idx: int):
|
||||
images = mm_items.get_items("image", ImageProcessorItems)
|
||||
|
||||
image_size = images.get_image_size(item_idx)
|
||||
num_image_tokens = self.info.get_num_image_tokens(
|
||||
image_width=image_size.width,
|
||||
image_height=image_size.height,
|
||||
)
|
||||
|
||||
return [image_token_id] * num_image_tokens
|
||||
|
||||
return [
|
||||
PromptReplacement(
|
||||
modality="image",
|
||||
target=[image_token_id],
|
||||
replacement=get_replacement,
|
||||
),
|
||||
]
|
||||
```
|
||||
|
||||
=== "Handling additional tokens: Fuyu"
|
||||
|
||||
Recall the layout of feature tokens from Step 2:
|
||||
|
||||
```
|
||||
|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
|
||||
|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
|
||||
...
|
||||
|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
|
||||
```
|
||||
|
||||
We define a helper function to return `ncols` and `nrows` directly:
|
||||
|
||||
```python
|
||||
def get_image_feature_grid_size(
|
||||
self,
|
||||
*,
|
||||
image_width: int,
|
||||
image_height: int,
|
||||
) -> tuple[int, int]:
|
||||
image_processor = self.get_image_processor()
|
||||
target_width = image_processor.size["width"]
|
||||
target_height = image_processor.size["height"]
|
||||
patch_width = image_processor.patch_size["width"]
|
||||
patch_height = image_processor.patch_size["height"]
|
||||
|
||||
if not (image_width <= target_width and image_height <= target_height):
|
||||
height_scale_factor = target_height / image_height
|
||||
width_scale_factor = target_width / image_width
|
||||
optimal_scale_factor = min(height_scale_factor, width_scale_factor)
|
||||
|
||||
image_height = int(image_height * optimal_scale_factor)
|
||||
image_width = int(image_width * optimal_scale_factor)
|
||||
|
||||
ncols = math.ceil(image_width / patch_width)
|
||||
nrows = math.ceil(image_height / patch_height)
|
||||
return ncols, nrows
|
||||
```
|
||||
|
||||
Based on this, we can initially define our replacement tokens as:
|
||||
|
||||
```python
|
||||
def get_replacement(item_idx: int):
|
||||
images = mm_items.get_items("image", ImageProcessorItems)
|
||||
image_size = images.get_image_size(item_idx)
|
||||
|
||||
ncols, nrows = self.info.get_image_feature_grid_size(
|
||||
image_width=image_size.width,
|
||||
image_height=image_size.height,
|
||||
)
|
||||
|
||||
# `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|`
|
||||
# `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|`
|
||||
return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows
|
||||
```
|
||||
|
||||
However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called,
|
||||
a BOS token (`<s>`) is also added to the promopt:
|
||||
|
||||
```python
|
||||
# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
|
||||
model_image_input = self.image_processor.preprocess_with_tokenizer_info(
|
||||
image_input=tensor_batch_images,
|
||||
image_present=image_present,
|
||||
image_unpadded_h=image_unpadded_heights,
|
||||
image_unpadded_w=image_unpadded_widths,
|
||||
image_placeholder_id=image_placeholder_id,
|
||||
image_newline_id=image_newline_id,
|
||||
variable_sized=True,
|
||||
)
|
||||
prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
|
||||
tokenizer=self.tokenizer,
|
||||
prompts=prompts,
|
||||
scale_factors=scale_factors,
|
||||
max_tokens_to_generate=self.max_tokens_to_generate,
|
||||
max_position_embeddings=self.max_position_embeddings,
|
||||
add_BOS=True,
|
||||
add_beginning_of_answer_token=True,
|
||||
)
|
||||
```
|
||||
|
||||
To assign the vision embeddings to only the image tokens, instead of a string
|
||||
you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]:
|
||||
|
||||
```python
|
||||
hf_config = self.info.get_hf_config()
|
||||
bos_token_id = hf_config.bos_token_id # `<s>`
|
||||
assert isinstance(bos_token_id, int)
|
||||
|
||||
def get_replacement_fuyu(item_idx: int):
|
||||
images = mm_items.get_items("image", ImageProcessorItems)
|
||||
image_size = images.get_image_size(item_idx)
|
||||
|
||||
ncols, nrows = self.info.get_image_feature_grid_size(
|
||||
image_width=image_size.width,
|
||||
image_height=image_size.height,
|
||||
)
|
||||
image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
|
||||
[_NEWLINE_TOKEN_ID]) * nrows
|
||||
|
||||
return PromptUpdateDetails.select_token_id(
|
||||
image_tokens + [bos_token_id],
|
||||
embed_token_id=_IMAGE_TOKEN_ID,
|
||||
)
|
||||
```
|
||||
|
||||
Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
|
||||
we can search for it to conduct the replacement at the start of the string:
|
||||
|
||||
```python
|
||||
def _get_prompt_updates(
|
||||
self,
|
||||
mm_items: MultiModalDataItems,
|
||||
hf_processor_mm_kwargs: Mapping[str, object],
|
||||
out_mm_kwargs: MultiModalKwargs,
|
||||
) -> Sequence[PromptUpdate]:
|
||||
hf_config = self.info.get_hf_config()
|
||||
bos_token_id = hf_config.bos_token_id
|
||||
assert isinstance(bos_token_id, int)
|
||||
|
||||
tokenizer = self.info.get_tokenizer()
|
||||
eot_token_id = tokenizer.bos_token_id
|
||||
assert isinstance(eot_token_id, int)
|
||||
|
||||
def get_replacement_fuyu(item_idx: int):
|
||||
images = mm_items.get_items("image", ImageProcessorItems)
|
||||
image_size = images.get_image_size(item_idx)
|
||||
|
||||
ncols, nrows = self.info.get_image_feature_grid_size(
|
||||
image_width=image_size.width,
|
||||
image_height=image_size.height,
|
||||
)
|
||||
image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
|
||||
[_NEWLINE_TOKEN_ID]) * nrows
|
||||
|
||||
return PromptUpdateDetails.select_token_id(
|
||||
image_tokens + [bos_token_id],
|
||||
embed_token_id=_IMAGE_TOKEN_ID,
|
||||
)
|
||||
|
||||
return [
|
||||
PromptReplacement(
|
||||
modality="image",
|
||||
target=[eot_token_id],
|
||||
replacement=get_replacement_fuyu,
|
||||
)
|
||||
]
|
||||
```
|
||||
|
||||
## 5. Register processor-related classes
|
||||
|
||||
After you have defined [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo] (Step 2),
|
||||
[BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] (Step 3),
|
||||
and [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] (Step 4),
|
||||
decorate the model class with {meth}`MULTIMODAL_REGISTRY.register_processor <vllm.multimodal.registry.MultiModalRegistry.register_processor>`
|
||||
to register them to the multi-modal registry:
|
||||
|
||||
```diff
|
||||
from vllm.model_executor.models.interfaces import SupportsMultiModal
|
||||
+ from vllm.multimodal import MULTIMODAL_REGISTRY
|
||||
|
||||
+ @MULTIMODAL_REGISTRY.register_processor(YourMultiModalProcessor,
|
||||
+ info=YourProcessingInfo,
|
||||
+ dummy_inputs=YourDummyInputsBuilder)
|
||||
class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
### Inserting feature tokens without replacement
|
||||
|
||||
Some HF processors directly insert feature tokens without replacing anything in the original prompt. In that case, you can use [PromptInsertion][vllm.multimodal.processing.PromptInsertion] instead of [PromptReplacement][vllm.multimodal.processing.PromptReplacement] inside [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates].
|
||||
|
||||
Examples:
|
||||
|
||||
- BLIP-2 (insert at start of prompt): <gh-file:vllm/model_executor/models/blip2.py>
|
||||
- Florence2 (insert at start of prompt): <gh-file:vllm/model_executor/models/florence2.py>
|
||||
- Molmo (insert after `<|endoftext|>` token): <gh-file:vllm/model_executor/models/molmo.py>
|
||||
|
||||
### Handling prompt updates unrelated to multi-modal data
|
||||
|
||||
[_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] assumes that each application of prompt update corresponds to one multi-modal item. If the HF processor performs additional processing regardless of how many multi-modal items there are, you should override [_apply_hf_processor_tokens_only][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_tokens_only] so that the processed token inputs are consistent with the result of applying the HF processor on text inputs. This is because token inputs bypass the HF processor according to [our design][mm-processing].
|
||||
|
||||
Examples:
|
||||
|
||||
- Chameleon (appends `sep_token`): <gh-file:vllm/model_executor/models/chameleon.py>
|
||||
- Fuyu (appends `boa_token`): <gh-file:vllm/model_executor/models/fuyu.py>
|
||||
- Molmo (applies chat template which is not defined elsewhere): <gh-file:vllm/model_executor/models/molmo.py>
|
||||
|
||||
### Custom HF processor
|
||||
|
||||
Some models don't define a HF processor class on HF Hub. In that case, you can define a custom HF processor that has the same call signature as HF processors and pass it to [_call_hf_processor][vllm.multimodal.processing.BaseMultiModalProcessor._call_hf_processor].
|
||||
|
||||
Examples:
|
||||
|
||||
- DeepSeek-VL2: <gh-file:vllm/model_executor/models/deepseek_vl2.py>
|
||||
- InternVL: <gh-file:vllm/model_executor/models/internvl.py>
|
||||
- Qwen-VL: <gh-file:vllm/model_executor/models/qwen_vl.py>
|
||||
52
docs/contributing/model/registration.md
Normal file
52
docs/contributing/model/registration.md
Normal file
@@ -0,0 +1,52 @@
|
||||
---
|
||||
title: Registering a Model to vLLM
|
||||
---
|
||||
[](){ #new-model-registration }
|
||||
|
||||
vLLM relies on a model registry to determine how to run each model.
|
||||
A list of pre-registered architectures can be found [here][supported-models].
|
||||
|
||||
If your model is not on this list, you must register it to vLLM.
|
||||
This page provides detailed instructions on how to do so.
|
||||
|
||||
## Built-in models
|
||||
|
||||
To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source].
|
||||
This gives you the ability to modify the codebase and test your model.
|
||||
|
||||
After you have implemented your model (see [tutorial][new-model-basic]), put it into the <gh-dir:vllm/model_executor/models> directory.
|
||||
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
|
||||
Finally, update our [list of supported models][supported-models] to promote your model!
|
||||
|
||||
!!! warning
|
||||
The list of models in each section should be maintained in alphabetical order.
|
||||
|
||||
## Out-of-tree models
|
||||
|
||||
You can load an external model using a plugin without modifying the vLLM codebase.
|
||||
|
||||
!!! info
|
||||
[vLLM's Plugin System][plugin-system]
|
||||
|
||||
To register the model, use the following code:
|
||||
|
||||
```python
|
||||
from vllm import ModelRegistry
|
||||
from your_code import YourModelForCausalLM
|
||||
ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM)
|
||||
```
|
||||
|
||||
If your model imports modules that initialize CUDA, consider lazy-importing it to avoid errors like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`:
|
||||
|
||||
```python
|
||||
from vllm import ModelRegistry
|
||||
|
||||
ModelRegistry.register_model("YourModelForCausalLM", "your_code:YourModelForCausalLM")
|
||||
```
|
||||
|
||||
!!! warning
|
||||
If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface.
|
||||
Read more about that [here][supports-multimodal].
|
||||
|
||||
!!! note
|
||||
Although you can directly put these code snippets in your script using `vllm.LLM`, the recommended way is to place these snippets in a vLLM plugin. This ensures compatibility with various vLLM features like distributed inference and the API server.
|
||||
62
docs/contributing/model/tests.md
Normal file
62
docs/contributing/model/tests.md
Normal file
@@ -0,0 +1,62 @@
|
||||
---
|
||||
title: Writing Unit Tests
|
||||
---
|
||||
[](){ #new-model-tests }
|
||||
|
||||
This page explains how to write unit tests to verify the implementation of your model.
|
||||
|
||||
## Required Tests
|
||||
|
||||
These tests are necessary to get your PR merged into vLLM library.
|
||||
Without them, the CI for your PR will fail.
|
||||
|
||||
### Model loading
|
||||
|
||||
Include an example HuggingFace repository for your model in <gh-file:tests/models/registry.py>.
|
||||
This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM.
|
||||
|
||||
!!! warning
|
||||
The list of models in each section should be maintained in alphabetical order.
|
||||
|
||||
!!! tip
|
||||
If your model requires a development version of HF Transformers, you can set
|
||||
`min_transformers_version` to skip the test in CI until the model is released.
|
||||
|
||||
## Optional Tests
|
||||
|
||||
These tests are optional to get your PR merged into vLLM library.
|
||||
Passing these tests provides more confidence that your implementation is correct, and helps avoid future regressions.
|
||||
|
||||
### Model correctness
|
||||
|
||||
These tests compare the model outputs of vLLM against [HF Transformers](https://github.com/huggingface/transformers). You can add new tests under the subdirectories of <gh-dir:tests/models>.
|
||||
|
||||
#### Generative models
|
||||
|
||||
For [generative models][generative-models], there are two levels of correctness tests, as defined in <gh-file:tests/models/utils.py>:
|
||||
|
||||
- Exact correctness (`check_outputs_equal`): The text outputted by vLLM should exactly match the text outputted by HF.
|
||||
- Logprobs similarity (`check_logprobs_close`): The logprobs outputted by vLLM should be in the top-k logprobs outputted by HF, and vice versa.
|
||||
|
||||
#### Pooling models
|
||||
|
||||
For [pooling models][pooling-models], we simply check the cosine similarity, as defined in <gh-file:tests/models/embedding/utils.py>.
|
||||
|
||||
[](){ #mm-processing-tests }
|
||||
|
||||
### Multi-modal processing
|
||||
|
||||
#### Common tests
|
||||
|
||||
Adding your model to <gh-file:tests/models/multimodal/processing/test_common.py> verifies that the following input combinations result in the same outputs:
|
||||
|
||||
- Text + multi-modal data
|
||||
- Tokens + multi-modal data
|
||||
- Text + cached multi-modal data
|
||||
- Tokens + cached multi-modal data
|
||||
|
||||
#### Model-specific tests
|
||||
|
||||
You can add a new file under <gh-dir:tests/models/multimodal/processing> to run tests that only apply to your model.
|
||||
|
||||
For example, if the HF processor for your model accepts user-specified keyword arguments, you can verify that the keyword arguments are being applied correctly, such as in <gh-file:tests/models/multimodal/processing/test_phi3v.py>.
|
||||
189
docs/contributing/overview.md
Normal file
189
docs/contributing/overview.md
Normal file
@@ -0,0 +1,189 @@
|
||||
# Contributing to vLLM
|
||||
|
||||
Thank you for your interest in contributing to vLLM! Our community is open to everyone and welcomes all kinds of contributions, no matter how small or large. There are several ways you can contribute to the project:
|
||||
|
||||
- Identify and report any issues or bugs.
|
||||
- Request or add support for a new model.
|
||||
- Suggest or implement new features.
|
||||
- Improve documentation or contribute a how-to guide.
|
||||
|
||||
We also believe in the power of community support; thus, answering queries, offering PR reviews, and assisting others are also highly regarded and beneficial contributions.
|
||||
|
||||
Finally, one of the most impactful ways to support us is by raising awareness about vLLM. Talk about it in your blog posts and highlight how it's driving your incredible projects. Express your support on social media if you're using vLLM, or simply offer your appreciation by starring our repository!
|
||||
|
||||
## Job Board
|
||||
|
||||
Unsure on where to start? Check out the following links for tasks to work on:
|
||||
|
||||
- [Good first issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22good%20first%20issue%22)
|
||||
- [Selected onboarding tasks](gh-project:6)
|
||||
- [New model requests](https://github.com/vllm-project/vllm/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22new-model%22)
|
||||
- [Models with multi-modal capabilities](gh-project:10)
|
||||
|
||||
## License
|
||||
|
||||
See <gh-file:LICENSE>.
|
||||
|
||||
## Developing
|
||||
|
||||
Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation.
|
||||
Check out the [building from source][build-from-source] documentation for details.
|
||||
|
||||
### Building the docs
|
||||
|
||||
Install the dependencies:
|
||||
|
||||
```bash
|
||||
pip install -r requirements/docs.txt
|
||||
```
|
||||
|
||||
Start the autoreloading MkDocs server:
|
||||
|
||||
```bash
|
||||
mkdocs serve
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
pip install -r requirements/dev.txt
|
||||
|
||||
# Linting, formatting and static type checking
|
||||
pre-commit install --hook-type pre-commit --hook-type commit-msg
|
||||
|
||||
# You can manually run pre-commit with
|
||||
pre-commit run --all-files
|
||||
|
||||
# To manually run something from CI that does not run
|
||||
# locally by default, you can run:
|
||||
pre-commit run mypy-3.9 --hook-stage manual --all-files
|
||||
|
||||
# Unit tests
|
||||
pytest tests/
|
||||
```
|
||||
|
||||
!!! tip
|
||||
Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
|
||||
|
||||
Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
|
||||
|
||||
!!! note
|
||||
Currently, the repository is not fully checked by `mypy`.
|
||||
|
||||
!!! note
|
||||
Currently, not all unit tests pass when run on CPU platforms. If you don't have access to a GPU
|
||||
platform to run unit tests locally, rely on the continuous integration system to run the tests for
|
||||
now.
|
||||
|
||||
## Issues
|
||||
|
||||
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
|
||||
|
||||
!!! warning
|
||||
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
|
||||
|
||||
## Pull Requests & Code Reviews
|
||||
|
||||
Thank you for your contribution to vLLM! Before submitting the pull request,
|
||||
please ensure the PR meets the following criteria. This helps vLLM maintain the
|
||||
code quality and improve the efficiency of the review process.
|
||||
|
||||
### DCO and Signed-off-by
|
||||
|
||||
When contributing changes to this project, you must agree to the <gh-file:DCO>.
|
||||
Commits must include a `Signed-off-by:` header which certifies agreement with
|
||||
the terms of the DCO.
|
||||
|
||||
Using `-s` with `git commit` will automatically add this header.
|
||||
|
||||
### PR Title and Classification
|
||||
|
||||
Only specific types of PRs will be reviewed. The PR title is prefixed
|
||||
appropriately to indicate the type of change. Please use one of the following:
|
||||
|
||||
- `[Bugfix]` for bug fixes.
|
||||
- `[CI/Build]` for build or continuous integration improvements.
|
||||
- `[Doc]` for documentation fixes and improvements.
|
||||
- `[Model]` for adding a new model or improving an existing model. Model name
|
||||
should appear in the title.
|
||||
- `[Frontend]` For changes on the vLLM frontend (e.g., OpenAI API server,
|
||||
`LLM` class, etc.)
|
||||
- `[Kernel]` for changes affecting CUDA kernels or other compute kernels.
|
||||
- `[Core]` for changes in the core vLLM logic (e.g., `LLMEngine`,
|
||||
`AsyncLLMEngine`, `Scheduler`, etc.)
|
||||
- `[Hardware][Vendor]` for hardware-specific changes. Vendor name should
|
||||
appear in the prefix (e.g., `[Hardware][AMD]`).
|
||||
- `[Misc]` for PRs that do not fit the above categories. Please use this
|
||||
sparingly.
|
||||
|
||||
!!! note
|
||||
If the PR spans more than one category, please include all relevant prefixes.
|
||||
|
||||
### Code Quality
|
||||
|
||||
The PR needs to meet the following code quality standards:
|
||||
|
||||
- We adhere to [Google Python style guide](https://google.github.io/styleguide/pyguide.html) and [Google C++ style guide](https://google.github.io/styleguide/cppguide.html).
|
||||
- Pass all linter checks. Please use `pre-commit` to format your code. See
|
||||
<https://pre-commit.com/#usage> if `pre-commit` is new to you.
|
||||
- The code needs to be well-documented to ensure future contributors can easily
|
||||
understand the code.
|
||||
- Include sufficient tests to ensure the project stays correct and robust. This
|
||||
includes both unit tests and integration tests.
|
||||
- Please add documentation to `docs/source/` if the PR modifies the
|
||||
user-facing behaviors of vLLM. It helps vLLM users understand and utilize the
|
||||
new features or changes.
|
||||
|
||||
### Adding or Changing Kernels
|
||||
|
||||
Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.
|
||||
|
||||
- Make sure custom ops are registered following PyTorch guidelines:
|
||||
[Custom C++ and CUDA Operators](https://pytorch.org/tutorials/advanced/cpp_custom_ops.html#cpp-custom-ops-tutorial)
|
||||
and [The Custom Operators Manual](https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU).
|
||||
- Custom operations that return `Tensors` require meta-functions.
|
||||
Meta-functions should be implemented and registered in Python so that dynamic
|
||||
dims can be handled automatically. See above documents for a description of
|
||||
meta-functions.
|
||||
- Use [torch.library.opcheck()](https://pytorch.org/docs/stable/library.html#torch.library.opcheck)
|
||||
to test the function registration and meta-function for any registered ops.
|
||||
See `tests/kernels` for examples.
|
||||
- When changing the C++ signature of an existing op, the schema must be updated
|
||||
to reflect the changes.
|
||||
- If a new custom type is needed, see the following document:
|
||||
[Custom Class Support in PT2](https://docs.google.com/document/d/18fBMPuOJ0fY5ZQ6YyrHUppw9FA332CpNtgB6SOIgyuA).
|
||||
|
||||
### Notes for Large Changes
|
||||
|
||||
Please keep the changes as concise as possible. For major architectural changes
|
||||
(>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue
|
||||
(RFC) discussing the technical design and justification. Otherwise, we will tag
|
||||
it with `rfc-required` and might not go through the PR.
|
||||
|
||||
### What to Expect for the Reviews
|
||||
|
||||
The goal of the vLLM team is to be a *transparent reviewing machine*. We would
|
||||
like to make the review process transparent and efficient and make sure no
|
||||
contributor feels confused or frustrated. However, the vLLM team is small, so we
|
||||
need to prioritize some PRs over others. Here is what you can expect from the
|
||||
review process:
|
||||
|
||||
- After the PR is submitted, the PR will be assigned to a reviewer. Every
|
||||
reviewer will pick up the PRs based on their expertise and availability.
|
||||
- After the PR is assigned, the reviewer will provide status updates every 2-3
|
||||
days. If the PR is not reviewed within 7 days, please feel free to ping the
|
||||
reviewer or the vLLM team.
|
||||
- After the review, the reviewer will put an `action-required` label on the PR
|
||||
if there are changes required. The contributor should address the comments and
|
||||
ping the reviewer to re-review the PR.
|
||||
- Please respond to all comments within a reasonable time frame. If a comment
|
||||
isn't clear or you disagree with a suggestion, feel free to ask for
|
||||
clarification or discuss the suggestion.
|
||||
- Note that not all CI checks will be executed due to limited computational
|
||||
resources. The reviewer will add `ready` label to the PR when the PR is
|
||||
ready to merge or a full CI run is needed.
|
||||
|
||||
## Thank You
|
||||
|
||||
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM.
|
||||
All of your contributions help make vLLM a great tool and community for everyone!
|
||||
172
docs/contributing/profiling.md
Normal file
172
docs/contributing/profiling.md
Normal file
@@ -0,0 +1,172 @@
|
||||
# Profiling vLLM
|
||||
|
||||
!!! warning
|
||||
Profiling is only intended for vLLM developers and maintainers to understand the proportion of time spent in different parts of the codebase. **vLLM end-users should never turn on profiling** as it will significantly slow down the inference.
|
||||
|
||||
## Profile with PyTorch Profiler
|
||||
|
||||
We support tracing vLLM workers using the `torch.profiler` module. You can enable tracing by setting the `VLLM_TORCH_PROFILER_DIR` environment variable to the directory where you want to save the traces: `VLLM_TORCH_PROFILER_DIR=/mnt/traces/`
|
||||
|
||||
The OpenAI server also needs to be started with the `VLLM_TORCH_PROFILER_DIR` environment variable set.
|
||||
|
||||
When using `benchmarks/benchmark_serving.py`, you can enable profiling by passing the `--profile` flag.
|
||||
|
||||
Traces can be visualized using <https://ui.perfetto.dev/>.
|
||||
|
||||
!!! tip
|
||||
Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.
|
||||
|
||||
!!! tip
|
||||
To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100.
|
||||
Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the server. Say something like 30 minutes.
|
||||
`export VLLM_RPC_TIMEOUT=1800000`
|
||||
|
||||
### Example commands and usage
|
||||
|
||||
#### Offline Inference
|
||||
|
||||
Refer to <gh-file:examples/offline_inference/simple_profiling.py> for an example.
|
||||
|
||||
#### OpenAI Server
|
||||
|
||||
```bash
|
||||
VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B
|
||||
```
|
||||
|
||||
benchmark_serving.py:
|
||||
|
||||
```bash
|
||||
python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Meta-Llama-3-70B --dataset-name sharegpt --dataset-path sharegpt.json --profile --num-prompts 2
|
||||
```
|
||||
|
||||
## Profile with NVIDIA Nsight Systems
|
||||
|
||||
Nsight systems is an advanced tool that exposes more profiling details, such as register and shared memory usage, annotated code regions and low-level CUDA APIs and events.
|
||||
|
||||
[Install nsight-systems](https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html) using your package manager.
|
||||
The following block is an example for Ubuntu.
|
||||
|
||||
```bash
|
||||
apt update
|
||||
apt install -y --no-install-recommends gnupg
|
||||
echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list
|
||||
apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
|
||||
apt update
|
||||
apt install nsight-systems-cli
|
||||
```
|
||||
|
||||
### Example commands and usage
|
||||
|
||||
#### Offline Inference
|
||||
|
||||
For basic usage, you can just append `nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node` before any existing script you would run for offline inference.
|
||||
|
||||
The following is an example using the `benchmarks/benchmark_latency.py` script:
|
||||
|
||||
```bash
|
||||
nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node python benchmarks/benchmark_latency.py --model meta-llama/Llama-3.1-8B-Instruct --num-iters-warmup 5 --num-iters 1 --batch-size 16 --input-len 512 --output-len 8
|
||||
```
|
||||
|
||||
#### OpenAI Server
|
||||
|
||||
To profile the server, you will want to prepend your `vllm serve` command with `nsys profile` just like for offline inference, however you must specify `--delay XX --duration YY` parameters according to the needs of your benchmark. After the duration time has been used up, the server will be killed.
|
||||
|
||||
```bash
|
||||
# server
|
||||
nsys profile -o report.nsys-rep --trace-fork-before-exec=true --cuda-graph-trace=node --delay 30 --duration 60 vllm serve meta-llama/Llama-3.1-8B-Instruct
|
||||
|
||||
# client
|
||||
python benchmarks/benchmark_serving.py --backend vllm --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1 --dataset-name random --random-input 1024 --random-output 512
|
||||
```
|
||||
|
||||
In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run:
|
||||
|
||||
```
|
||||
nsys sessions list
|
||||
```
|
||||
|
||||
to get the session id in the form of `profile-XXXXX`, then run:
|
||||
|
||||
```
|
||||
nsys stop --session=profile-XXXXX
|
||||
```
|
||||
|
||||
to manually kill the profiler and generate your `nsys-rep` report.
|
||||
|
||||
#### Analysis
|
||||
|
||||
You can view these profiles either as summaries in the CLI, using `nsys stats [profile-file]`, or in the GUI by installing Nsight [locally following the directions here](https://developer.nvidia.com/nsight-systems/get-started).
|
||||
|
||||
CLI example:
|
||||
|
||||
```bash
|
||||
nsys stats report1.nsys-rep
|
||||
...
|
||||
** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
|
||||
|
||||
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
|
||||
-------- --------------- --------- ----------- ----------- -------- --------- ----------- ----------------------------------------------------------------------------------------------------
|
||||
46.3 10,327,352,338 17,505 589,965.9 144,383.0 27,040 3,126,460 944,263.8 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize128x128x64_warpgroupsize1x1x1_execute_segment_k_of…
|
||||
14.8 3,305,114,764 5,152 641,520.7 293,408.0 287,296 2,822,716 867,124.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize256x128x64_warpgroupsize2x1x1_execute_segment_k_of…
|
||||
12.1 2,692,284,876 14,280 188,535.4 83,904.0 19,328 2,862,237 497,999.9 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x128x64_warpgroupsize1x1x1_execute_segment_k_off…
|
||||
9.5 2,116,600,578 33,920 62,399.8 21,504.0 15,326 2,532,285 290,954.1 sm90_xmma_gemm_bf16bf16_bf16f32_f32_tn_n_tilesize64x64x64_warpgroupsize1x1x1_execute_segment_k_off_…
|
||||
5.0 1,119,749,165 18,912 59,208.4 9,056.0 6,784 2,578,366 271,581.7 void vllm::act_and_mul_kernel<c10::BFloat16, &vllm::silu_kernel<c10::BFloat16>, (bool)1>(T1 *, cons…
|
||||
4.1 916,662,515 21,312 43,011.6 19,776.0 8,928 2,586,205 199,790.1 void cutlass::device_kernel<flash::enable_sm90_or_later<flash::FlashAttnFwdSm90<flash::CollectiveMa…
|
||||
2.6 587,283,113 37,824 15,526.7 3,008.0 2,719 2,517,756 139,091.1 std::enable_if<T2>(int)0&&vllm::_typeConvert<T1>::exists, void>::type vllm::fused_add_rms_norm_kern…
|
||||
1.9 418,362,605 18,912 22,121.5 3,871.0 3,328 2,523,870 175,248.2 void vllm::rotary_embedding_kernel<c10::BFloat16, (bool)1>(const long *, T1 *, T1 *, const T1 *, in…
|
||||
0.7 167,083,069 18,880 8,849.7 2,240.0 1,471 2,499,996 101,436.1 void vllm::reshape_and_cache_flash_kernel<__nv_bfloat16, __nv_bfloat16, (vllm::Fp8KVCacheDataType)0…
|
||||
...
|
||||
```
|
||||
|
||||
GUI example:
|
||||
|
||||
<img width="1799" alt="Screenshot 2025-03-05 at 11 48 42 AM" src="https://github.com/user-attachments/assets/c7cff1ae-6d6f-477d-a342-bd13c4fc424c" />
|
||||
|
||||
## Profiling vLLM Python Code
|
||||
|
||||
The Python standard library includes
|
||||
[cProfile](https://docs.python.org/3/library/profile.html) for profiling Python
|
||||
code. vLLM includes a couple of helpers that make it easy to apply it to a section of vLLM.
|
||||
Both the `vllm.utils.cprofile` and `vllm.utils.cprofile_context` functions can be
|
||||
used to profile a section of code.
|
||||
|
||||
### Example usage - decorator
|
||||
|
||||
The first helper is a Python decorator that can be used to profile a function.
|
||||
If a filename is specified, the profile will be saved to that file. If no filename is
|
||||
specified, profile data will be printed to stdout.
|
||||
|
||||
```python
|
||||
import vllm.utils
|
||||
|
||||
@vllm.utils.cprofile("expensive_function.prof")
|
||||
def expensive_function():
|
||||
# some expensive code
|
||||
pass
|
||||
```
|
||||
|
||||
### Example Usage - context manager
|
||||
|
||||
The second helper is a context manager that can be used to profile a block of
|
||||
code. Similar to the decorator, the filename is optional.
|
||||
|
||||
```python
|
||||
import vllm.utils
|
||||
|
||||
def another_function():
|
||||
# more expensive code
|
||||
pass
|
||||
|
||||
with vllm.utils.cprofile_context("another_function.prof"):
|
||||
another_function()
|
||||
```
|
||||
|
||||
### Analyzing Profile Results
|
||||
|
||||
There are multiple tools available that can help analyze the profile results.
|
||||
One example is [snakeviz](https://jiffyclub.github.io/snakeviz/).
|
||||
|
||||
```bash
|
||||
pip install snakeviz
|
||||
snakeviz expensive_function.prof
|
||||
```
|
||||
60
docs/contributing/vulnerability_management.md
Normal file
60
docs/contributing/vulnerability_management.md
Normal file
@@ -0,0 +1,60 @@
|
||||
# Vulnerability Management
|
||||
|
||||
## Reporting Vulnerabilities
|
||||
|
||||
As mentioned in the [security
|
||||
policy](https://github.com/vllm-project/vllm/tree/main/SECURITY.md), security
|
||||
vulnerabilities may be reported privately to the project via
|
||||
[GitHub](https://github.com/vllm-project/vllm/security/advisories/new).
|
||||
|
||||
## Vulnerability Management Team
|
||||
|
||||
Once a vulnerability has been reported to the project, the Vulnerability
|
||||
Management Team (VMT) is responsible for managing the vulnerability. The VMT is
|
||||
responsible for:
|
||||
|
||||
- Triaging the vulnerability.
|
||||
- Coordinating with reporters and project maintainers on vulnerability analysis
|
||||
and resolution.
|
||||
- Drafting of security advisories for confirmed vulnerabilities, as appropriate.
|
||||
- Coordination with project maintainers on a coordinated release of the fix and
|
||||
security advisory.
|
||||
|
||||
### Security Advisories
|
||||
|
||||
Advisories are published via GitHub through the same system used to report
|
||||
vulnerabilities. More information on the process can be found in the [GitHub
|
||||
documentation](https://docs.github.com/en/code-security/security-advisories/working-with-repository-security-advisories/about-repository-security-advisories).
|
||||
|
||||
### Team Members
|
||||
|
||||
We prefer to keep all vulnerability-related communication on the security report
|
||||
on GitHub. However, if you need to contact the VMT directly for an urgent issue,
|
||||
you may contact the following individuals:
|
||||
|
||||
- Simon Mo - simon.mo@hey.com
|
||||
- Russell Bryant - rbryant@redhat.com
|
||||
|
||||
## Slack Discussion
|
||||
|
||||
You may use the `#security` channel in the [vLLM Slack](https://slack.vllm.ai)
|
||||
to discuss security-related topics. However, please do not disclose any
|
||||
vulnerabilities in this channel. If you need to report a vulnerability, please
|
||||
use the GitHub security advisory system or contact a VMT member privately.
|
||||
|
||||
## Vulnerability Disclosure
|
||||
|
||||
The process for disclosing vulnerabilities is the following:
|
||||
|
||||
- The VMT will work with the project maintainers to develop a fix for the
|
||||
vulnerability.
|
||||
- The VMT will coordinate with the reporter and project maintainers to prepare a
|
||||
security advisory that adequately describes the vulnerability and its impact.
|
||||
- The VMT will coordinate with the project maintainers to publish a fix and
|
||||
release an update that includes that fix.
|
||||
- The VMT will publish the security advisory on GitHub. Release notes will be
|
||||
updated to include a reference to the security advisory.
|
||||
|
||||
The VMT and project maintainers will work to minimize the amount of time in
|
||||
between disclosing any public information about the vulnerability and making a
|
||||
release and advisory available.
|
||||
Reference in New Issue
Block a user