[Doc] Convert docs to use colon fences (#12471)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor
2025-01-29 03:38:29 +00:00
committed by GitHub
parent a7e3eba66f
commit dd6a3a02cb
68 changed files with 2352 additions and 2341 deletions

View File

@@ -17,11 +17,11 @@ The edges of the build graph represent:
- `RUN --mount=(.\*)from=...` dependencies (with a dotted line and an empty diamond arrow head)
> ```{figure} /assets/contributing/dockerfile-stages-dependency.png
> :::{figure} /assets/contributing/dockerfile-stages-dependency.png
> :align: center
> :alt: query
> :width: 100%
> ```
> :::
>
> Made using: <https://github.com/patrickhoefler/dockerfilegraph>
>

View File

@@ -10,9 +10,9 @@ First, clone the PyTorch model code from the source repository.
For instance, vLLM's [OPT model](gh-file:vllm/model_executor/models/opt.py) was adapted from
HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file.
```{warning}
:::{warning}
Make sure to review and adhere to the original code's copyright and licensing terms!
```
:::
## 2. Make your code compatible with vLLM
@@ -80,10 +80,10 @@ def forward(
...
```
```{note}
:::{note}
Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
```
:::
For reference, check out our [Llama implementation](gh-file:vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out <gh-dir:vllm/model_executor/models> for more examples.

View File

@@ -4,7 +4,7 @@
This section provides more information on how to integrate a [PyTorch](https://pytorch.org/) model into vLLM.
```{toctree}
:::{toctree}
:caption: Contents
:maxdepth: 1
@@ -12,16 +12,16 @@ basic
registration
tests
multimodal
```
:::
```{note}
:::{note}
The complexity of adding a new model depends heavily on the model's architecture.
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
```
:::
```{tip}
:::{tip}
If you are encountering issues while integrating your model into vLLM, feel free to open a [GitHub issue](https://github.com/vllm-project/vllm/issues)
or ask on our [developer slack](https://slack.vllm.ai).
We will be happy to help you out!
```
:::

View File

@@ -48,9 +48,9 @@ Further update the model as follows:
return vision_embeddings
```
```{important}
:::{important}
The returned `multimodal_embeddings` must be either a **3D {class}`torch.Tensor`** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D {class}`torch.Tensor`'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request.
```
:::
- Implement {meth}`~vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings` to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
@@ -89,10 +89,10 @@ Further update the model as follows:
+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
```
```{note}
:::{note}
The model class does not have to be named {code}`*ForCausalLM`.
Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.
```
:::
## 2. Specify processing information
@@ -120,8 +120,8 @@ When calling the model, the output embeddings from the visual encoder are assign
containing placeholder feature tokens. Therefore, the number of placeholder feature tokens should be equal
to the size of the output embeddings.
::::{tab-set}
:::{tab-item} Basic example: LLaVA
:::::{tab-set}
::::{tab-item} Basic example: LLaVA
:sync: llava
Looking at the code of HF's `LlavaForConditionalGeneration`:
@@ -254,12 +254,12 @@ def get_mm_max_tokens_per_item(self, seq_len: int) -> Mapping[str, int]:
return {"image": self.get_max_image_tokens()}
```
```{note}
:::{note}
Our [actual code](gh-file:vllm/model_executor/models/llava.py) is more abstracted to support vision encoders other than CLIP.
```
:::
::::
:::::
## 3. Specify dummy inputs
@@ -315,17 +315,17 @@ def get_dummy_processor_inputs(
Afterwards, create a subclass of {class}`~vllm.multimodal.processing.BaseMultiModalProcessor`
to fill in the missing details about HF processing.
```{seealso}
:::{seealso}
[Multi-Modal Data Processing](#mm-processing)
```
:::
### Multi-modal fields
Override {class}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config` to
return a schema of the tensors outputted by the HF processor that are related to the input multi-modal items.
::::{tab-set}
:::{tab-item} Basic example: LLaVA
:::::{tab-set}
::::{tab-item} Basic example: LLaVA
:sync: llava
Looking at the model's `forward` method:
@@ -367,13 +367,13 @@ def _get_mm_fields_config(
)
```
```{note}
:::{note}
Our [actual code](gh-file:vllm/model_executor/models/llava.py) additionally supports
pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument.
```
:::
::::
:::::
### Prompt replacements

View File

@@ -17,17 +17,17 @@ After you have implemented your model (see [tutorial](#new-model-basic)), put it
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
Finally, update our [list of supported models](#supported-models) to promote your model!
```{important}
:::{important}
The list of models in each section should be maintained in alphabetical order.
```
:::
## Out-of-tree models
You can load an external model using a plugin without modifying the vLLM codebase.
```{seealso}
:::{seealso}
[vLLM's Plugin System](#plugin-system)
```
:::
To register the model, use the following code:
@@ -45,11 +45,11 @@ from vllm import ModelRegistry
ModelRegistry.register_model("YourModelForCausalLM", "your_code:YourModelForCausalLM")
```
```{important}
:::{important}
If your model is a multimodal model, ensure the model class implements the {class}`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
Read more about that [here](#supports-multimodal).
```
:::
```{note}
:::{note}
Although you can directly put these code snippets in your script using `vllm.LLM`, the recommended way is to place these snippets in a vLLM plugin. This ensures compatibility with various vLLM features like distributed inference and the API server.
```
:::

View File

@@ -14,14 +14,14 @@ Without them, the CI for your PR will fail.
Include an example HuggingFace repository for your model in <gh-file:tests/models/registry.py>.
This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM.
```{important}
:::{important}
The list of models in each section should be maintained in alphabetical order.
```
:::
```{tip}
:::{tip}
If your model requires a development version of HF Transformers, you can set
`min_transformers_version` to skip the test in CI until the model is released.
```
:::
## Optional Tests

View File

@@ -35,17 +35,17 @@ pre-commit run --all-files
pytest tests/
```
```{note}
:::{note}
Currently, the repository is not fully checked by `mypy`.
```
:::
## Issues
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
```{important}
:::{important}
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
```
:::
## Pull Requests & Code Reviews
@@ -81,9 +81,9 @@ appropriately to indicate the type of change. Please use one of the following:
- `[Misc]` for PRs that do not fit the above categories. Please use this
sparingly.
```{note}
:::{note}
If the PR spans more than one category, please include all relevant prefixes.
```
:::
### Code Quality

View File

@@ -6,21 +6,21 @@ The OpenAI server also needs to be started with the `VLLM_TORCH_PROFILER_DIR` en
When using `benchmarks/benchmark_serving.py`, you can enable profiling by passing the `--profile` flag.
```{warning}
:::{warning}
Only enable profiling in a development environment.
```
:::
Traces can be visualized using <https://ui.perfetto.dev/>.
```{tip}
:::{tip}
Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.
```
:::
```{tip}
:::{tip}
To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100.
Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the server. Say something like 30 minutes.
`export VLLM_RPC_TIMEOUT=1800000`
```
:::
## Example commands and usage