[Model][VLM] Add Qwen2.5-Omni model support (thinker only) (#15130)

Signed-off-by: fyabc <suyang.fy@alibaba-inc.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Xiong Wang <wangxiongts@163.com>
This commit is contained in:
Yang Fan
2025-04-19 14:14:36 +08:00
committed by GitHub
parent 5c9121203c
commit 2c1bd848a6
23 changed files with 1855 additions and 85 deletions

View File

@@ -1040,6 +1040,13 @@ See [this page](#generative-models) for more information on how to use generativ
* ✅︎
* ✅︎
* ✅︎
- * `Qwen2_5OmniThinkerForConditionalGeneration`
* Qwen2.5-Omni
* T + I<sup>E+</sup> + V<sup>E+</sup> + A<sup>+</sup>
* `Qwen/Qwen2.5-Omni-7B`
*
* ✅︎
* ✅︎\*
- * `SkyworkR1VChatModel`
* Skywork-R1V-38B
* T + I
@@ -1109,6 +1116,14 @@ For more details, please see: <gh-pr:4087#issuecomment-2250397630>
Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1.
:::
:::{note}
To use Qwen2.5-Omni, you have to install a fork of Hugging Face Transformers library from source via
`pip install git+https://github.com/BakerBunker/transformers.git@qwen25omni`.
Read audio from video pre-processing is currently supported on V0 (but not V1), because overlapping modalities is not yet supported in V1.
`--mm-processor-kwargs '{"use_audio_in_video": True}'`.
:::
### Pooling Models
See [this page](pooling-models) for more information on how to use pooling models.