diff --git a/docs/contributing/model/multimodal.md b/docs/contributing/model/multimodal.md index 9e839a8e4..2ec83fa27 100644 --- a/docs/contributing/model/multimodal.md +++ b/docs/contributing/model/multimodal.md @@ -23,29 +23,32 @@ Further update the model as follows: raise ValueError("Only image modality is supported") ``` -- Reserve a keyword parameter in [forward][torch.nn.Module.forward] for each input tensor that corresponds to a multi-modal input, as shown in the following example: +- Inside `__init__` method, initialize the language components of the model inside [_mark_language_model][vllm.model_executor.models.interfaces.SupportsMultiModal._mark_language_model], and the multimodal components of the model inside [_mark_tower_model][vllm.model_executor.models.interfaces.SupportsMultiModal._mark_tower_model], e.g.: - ```diff - def forward( - self, - input_ids: torch.Tensor, - positions: torch.Tensor, - + pixel_values: torch.Tensor, - ) -> SamplerOutput: - ``` - - More conveniently, you can simply pass `**kwargs` to the [forward][torch.nn.Module.forward] method and retrieve the keyword parameters for multimodal inputs from it. + ```python + def __init__(self, *, vllm_config: VllmConfig, prefix: str = "") -> None: + super().__init__() + + config = vllm_config.model_config.hf_config + + with self._mark_tower_model(vllm_config, "image"): + self.vision_encoder = ... + self.multi_modal_projector = ... + + with self._mark_language_model(vllm_config): + self.language_model = init_vllm_registered_model( + vllm_config=vllm_config, + hf_config=config.text_config, + prefix=maybe_prefix(prefix, "language_model"), + ) + ``` - Implement [embed_multimodal][vllm.model_executor.models.interfaces.SupportsMultiModal.embed_multimodal] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs. ??? code ```python - class YourModelForImage2Seq(nn.Module): - ... - def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor: - assert self.vision_encoder is not None image_features = self.vision_encoder(image_input) return self.multi_modal_projector(image_features) @@ -71,18 +74,7 @@ Further update the model as follows: [PlaceholderRange][vllm.multimodal.inputs.PlaceholderRange] from input processing. This logic can be found at [embed_input_ids][vllm.model_executor.models.interfaces.SupportsMultiModal.embed_input_ids]. - You may override this method if additional logic is required for your model when merging embeddings. - -- Implement [get_language_model][vllm.model_executor.models.interfaces.SupportsMultiModal.get_language_model] getter to provide stable access to the underlying language model. - - ```python - class YourModelForImage2Seq(nn.Module): - ... - - def get_language_model(self) -> torch.nn.Module: - # Change `language_model` according to your implementation. - return self.language_model - ``` + You may override this method if additional logic is required for your model when merging embeddings. - Once the above steps are done, update the model class with the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface. diff --git a/examples/online_serving/disaggregated_encoder/README.md b/examples/online_serving/disaggregated_encoder/README.md index bd3294760..b4735bea7 100644 --- a/examples/online_serving/disaggregated_encoder/README.md +++ b/examples/online_serving/disaggregated_encoder/README.md @@ -38,7 +38,7 @@ Encoder engines should be launched with the following flags: - `--max-num-batched-tokens=` **(default: 2048)** – This flag controls the token scheduling budget per decoding step and is irrelevant to encoder-only instances. **Set it to a very high value (effectively unlimited) to bypass scheduler limitations.** The actual token budget is managed by the encoder cache manager. -- `--mm-encoder-only` **(Optional)** - The language model is skipped during initialization to reduce device memory usage. **Models using this option must initialize the language component inside the context of `SupportsMultiModal._mark_language_model`.** +- `--mm-encoder-only` **(Optional)** - If possible, skips the language model during initialization to reduce device memory usage. ## Local media inputs