[Doc] Convert docs to use colon fences (#12471)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -17,11 +17,11 @@ The edges of the build graph represent:
|
||||
|
||||
- `RUN --mount=(.\*)from=...` dependencies (with a dotted line and an empty diamond arrow head)
|
||||
|
||||
> ```{figure} /assets/contributing/dockerfile-stages-dependency.png
|
||||
> :::{figure} /assets/contributing/dockerfile-stages-dependency.png
|
||||
> :align: center
|
||||
> :alt: query
|
||||
> :width: 100%
|
||||
> ```
|
||||
> :::
|
||||
>
|
||||
> Made using: <https://github.com/patrickhoefler/dockerfilegraph>
|
||||
>
|
||||
|
||||
@@ -10,9 +10,9 @@ First, clone the PyTorch model code from the source repository.
|
||||
For instance, vLLM's [OPT model](gh-file:vllm/model_executor/models/opt.py) was adapted from
|
||||
HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file.
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Make sure to review and adhere to the original code's copyright and licensing terms!
|
||||
```
|
||||
:::
|
||||
|
||||
## 2. Make your code compatible with vLLM
|
||||
|
||||
@@ -80,10 +80,10 @@ def forward(
|
||||
...
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
|
||||
If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
|
||||
```
|
||||
:::
|
||||
|
||||
For reference, check out our [Llama implementation](gh-file:vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out <gh-dir:vllm/model_executor/models> for more examples.
|
||||
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
This section provides more information on how to integrate a [PyTorch](https://pytorch.org/) model into vLLM.
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:caption: Contents
|
||||
:maxdepth: 1
|
||||
|
||||
@@ -12,16 +12,16 @@ basic
|
||||
registration
|
||||
tests
|
||||
multimodal
|
||||
```
|
||||
:::
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
The complexity of adding a new model depends heavily on the model's architecture.
|
||||
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
|
||||
However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
|
||||
```
|
||||
:::
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
If you are encountering issues while integrating your model into vLLM, feel free to open a [GitHub issue](https://github.com/vllm-project/vllm/issues)
|
||||
or ask on our [developer slack](https://slack.vllm.ai).
|
||||
We will be happy to help you out!
|
||||
```
|
||||
:::
|
||||
|
||||
@@ -48,9 +48,9 @@ Further update the model as follows:
|
||||
return vision_embeddings
|
||||
```
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
The returned `multimodal_embeddings` must be either a **3D {class}`torch.Tensor`** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D {class}`torch.Tensor`'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request.
|
||||
```
|
||||
:::
|
||||
|
||||
- Implement {meth}`~vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings` to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
|
||||
|
||||
@@ -89,10 +89,10 @@ Further update the model as follows:
|
||||
+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
The model class does not have to be named {code}`*ForCausalLM`.
|
||||
Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.
|
||||
```
|
||||
:::
|
||||
|
||||
## 2. Specify processing information
|
||||
|
||||
@@ -120,8 +120,8 @@ When calling the model, the output embeddings from the visual encoder are assign
|
||||
containing placeholder feature tokens. Therefore, the number of placeholder feature tokens should be equal
|
||||
to the size of the output embeddings.
|
||||
|
||||
::::{tab-set}
|
||||
:::{tab-item} Basic example: LLaVA
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Basic example: LLaVA
|
||||
:sync: llava
|
||||
|
||||
Looking at the code of HF's `LlavaForConditionalGeneration`:
|
||||
@@ -254,12 +254,12 @@ def get_mm_max_tokens_per_item(self, seq_len: int) -> Mapping[str, int]:
|
||||
return {"image": self.get_max_image_tokens()}
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Our [actual code](gh-file:vllm/model_executor/models/llava.py) is more abstracted to support vision encoders other than CLIP.
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
## 3. Specify dummy inputs
|
||||
|
||||
@@ -315,17 +315,17 @@ def get_dummy_processor_inputs(
|
||||
Afterwards, create a subclass of {class}`~vllm.multimodal.processing.BaseMultiModalProcessor`
|
||||
to fill in the missing details about HF processing.
|
||||
|
||||
```{seealso}
|
||||
:::{seealso}
|
||||
[Multi-Modal Data Processing](#mm-processing)
|
||||
```
|
||||
:::
|
||||
|
||||
### Multi-modal fields
|
||||
|
||||
Override {class}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config` to
|
||||
return a schema of the tensors outputted by the HF processor that are related to the input multi-modal items.
|
||||
|
||||
::::{tab-set}
|
||||
:::{tab-item} Basic example: LLaVA
|
||||
:::::{tab-set}
|
||||
::::{tab-item} Basic example: LLaVA
|
||||
:sync: llava
|
||||
|
||||
Looking at the model's `forward` method:
|
||||
@@ -367,13 +367,13 @@ def _get_mm_fields_config(
|
||||
)
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Our [actual code](gh-file:vllm/model_executor/models/llava.py) additionally supports
|
||||
pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument.
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
### Prompt replacements
|
||||
|
||||
|
||||
@@ -17,17 +17,17 @@ After you have implemented your model (see [tutorial](#new-model-basic)), put it
|
||||
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
|
||||
Finally, update our [list of supported models](#supported-models) to promote your model!
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
The list of models in each section should be maintained in alphabetical order.
|
||||
```
|
||||
:::
|
||||
|
||||
## Out-of-tree models
|
||||
|
||||
You can load an external model using a plugin without modifying the vLLM codebase.
|
||||
|
||||
```{seealso}
|
||||
:::{seealso}
|
||||
[vLLM's Plugin System](#plugin-system)
|
||||
```
|
||||
:::
|
||||
|
||||
To register the model, use the following code:
|
||||
|
||||
@@ -45,11 +45,11 @@ from vllm import ModelRegistry
|
||||
ModelRegistry.register_model("YourModelForCausalLM", "your_code:YourModelForCausalLM")
|
||||
```
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
If your model is a multimodal model, ensure the model class implements the {class}`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
|
||||
Read more about that [here](#supports-multimodal).
|
||||
```
|
||||
:::
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Although you can directly put these code snippets in your script using `vllm.LLM`, the recommended way is to place these snippets in a vLLM plugin. This ensures compatibility with various vLLM features like distributed inference and the API server.
|
||||
```
|
||||
:::
|
||||
|
||||
@@ -14,14 +14,14 @@ Without them, the CI for your PR will fail.
|
||||
Include an example HuggingFace repository for your model in <gh-file:tests/models/registry.py>.
|
||||
This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM.
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
The list of models in each section should be maintained in alphabetical order.
|
||||
```
|
||||
:::
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
If your model requires a development version of HF Transformers, you can set
|
||||
`min_transformers_version` to skip the test in CI until the model is released.
|
||||
```
|
||||
:::
|
||||
|
||||
## Optional Tests
|
||||
|
||||
|
||||
@@ -35,17 +35,17 @@ pre-commit run --all-files
|
||||
pytest tests/
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
Currently, the repository is not fully checked by `mypy`.
|
||||
```
|
||||
:::
|
||||
|
||||
## Issues
|
||||
|
||||
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
|
||||
|
||||
```{important}
|
||||
:::{important}
|
||||
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
|
||||
```
|
||||
:::
|
||||
|
||||
## Pull Requests & Code Reviews
|
||||
|
||||
@@ -81,9 +81,9 @@ appropriately to indicate the type of change. Please use one of the following:
|
||||
- `[Misc]` for PRs that do not fit the above categories. Please use this
|
||||
sparingly.
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
If the PR spans more than one category, please include all relevant prefixes.
|
||||
```
|
||||
:::
|
||||
|
||||
### Code Quality
|
||||
|
||||
|
||||
@@ -6,21 +6,21 @@ The OpenAI server also needs to be started with the `VLLM_TORCH_PROFILER_DIR` en
|
||||
|
||||
When using `benchmarks/benchmark_serving.py`, you can enable profiling by passing the `--profile` flag.
|
||||
|
||||
```{warning}
|
||||
:::{warning}
|
||||
Only enable profiling in a development environment.
|
||||
```
|
||||
:::
|
||||
|
||||
Traces can be visualized using <https://ui.perfetto.dev/>.
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.
|
||||
```
|
||||
:::
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100.
|
||||
Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the server. Say something like 30 minutes.
|
||||
`export VLLM_RPC_TIMEOUT=1800000`
|
||||
```
|
||||
:::
|
||||
|
||||
## Example commands and usage
|
||||
|
||||
|
||||
Reference in New Issue
Block a user