[Doc] Improve GitHub links (#11491)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
Cyrus Leung
2024-12-26 06:49:26 +08:00
committed by GitHub
parent b689ada91e
commit 6ad909fdda
31 changed files with 147 additions and 136 deletions

View File

@@ -5,7 +5,7 @@
This page teaches you how to pass multi-modal inputs to [multi-modal models](#supported-mm-models) in vLLM.
```{note}
We are actively iterating on multi-modal support. See [this RFC](https://github.com/vllm-project/vllm/issues/4194) for upcoming changes,
We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
```
@@ -60,7 +60,7 @@ for o in outputs:
print(generated_text)
```
A code example can be found in [examples/offline_inference_vision_language.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py).
Full example: <gh-file:examples/offline_inference_vision_language.py>
To substitute multiple images inside the same text prompt, you can pass in a list of images instead:
@@ -91,7 +91,7 @@ for o in outputs:
print(generated_text)
```
A code example can be found in [examples/offline_inference_vision_language_multi_image.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py).
Full example: <gh-file:examples/offline_inference_vision_language_multi_image.py>
Multi-image input can be extended to perform video captioning. We show this with [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) as it supports videos:
@@ -125,13 +125,13 @@ for o in outputs:
You can pass a list of NumPy arrays directly to the {code}`'video'` field of the multi-modal dictionary
instead of using multi-image input.
Please refer to [examples/offline_inference_vision_language.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language.py) for more details.
Full example: <gh-file:examples/offline_inference_vision_language.py>
### Audio
You can pass a tuple {code}`(array, sampling_rate)` to the {code}`'audio'` field of the multi-modal dictionary.
Please refer to [examples/offline_inference_audio_language.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_audio_language.py) for more details.
Full example: <gh-file:examples/offline_inference_audio_language.py>
### Embedding
@@ -208,7 +208,7 @@ A chat template is **required** to use Chat Completions API.
Although most models come with a chat template, for others you have to define one yourself.
The chat template can be inferred based on the documentation on the model's HuggingFace repo.
For example, LLaVA-1.5 (`llava-hf/llava-1.5-7b-hf`) requires a chat template that can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/template_llava.jinja).
For example, LLaVA-1.5 (`llava-hf/llava-1.5-7b-hf`) requires a chat template that can be found here: <gh-file:examples/template_llava.jinja>
```
### Image
@@ -271,7 +271,7 @@ chat_response = client.chat.completions.create(
print("Chat completion output:", chat_response.choices[0].message.content)
```
A full code example can be found in [examples/openai_chat_completion_client_for_multimodal.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py).
Full example: <gh-file:examples/openai_chat_completion_client_for_multimodal.py>
```{tip}
Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
@@ -296,7 +296,7 @@ $ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
Instead of {code}`image_url`, you can pass a video file via {code}`video_url`.
You can use [these tests](https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/test_video.py) as reference.
You can use [these tests](gh-file:entrypoints/openai/test_video.py) as reference.
````{note}
By default, the timeout for fetching videos through HTTP URL url is `30` seconds.
@@ -399,7 +399,7 @@ result = chat_completion_from_url.choices[0].message.content
print("Chat completion output from audio url:", result)
```
A full code example can be found in [examples/openai_chat_completion_client_for_multimodal.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client_for_multimodal.py).
Full example: <gh-file:examples/openai_chat_completion_client_for_multimodal.py>
````{note}
By default, the timeout for fetching audios through HTTP URL is `10` seconds.
@@ -435,7 +435,7 @@ Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to expl
to run this model in embedding mode instead of text generation mode.
The custom chat template is completely different from the original one for this model,
and can be found [here](https://github.com/vllm-project/vllm/blob/main/examples/template_vlm2vec.jinja).
and can be found here: <gh-file:examples/template_vlm2vec.jinja>
```
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
@@ -475,7 +475,7 @@ vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
Like with VLM2Vec, we have to explicitly pass `--task embed`.
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
by [this custom chat template](https://github.com/vllm-project/vllm/blob/main/examples/template_dse_qwen2_vl.jinja).
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
```
```{important}
@@ -483,4 +483,4 @@ Also important, `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of th
example below for details.
```
A full code example can be found in [examples/openai_chat_embedding_client_for_multimodal.py](https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_embedding_client_for_multimodal.py).
Full example: <gh-file:examples/openai_chat_embedding_client_for_multimodal.py>