[Model][VLM] Add Qwen2.5-Omni model support (thinker only) (#15130)
Signed-off-by: fyabc <suyang.fy@alibaba-inc.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: Xiong Wang <wangxiongts@163.com>
This commit is contained in:
@@ -1040,6 +1040,13 @@ See [this page](#generative-models) for more information on how to use generativ
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
- * `Qwen2_5OmniThinkerForConditionalGeneration`
|
||||
* Qwen2.5-Omni
|
||||
* T + I<sup>E+</sup> + V<sup>E+</sup> + A<sup>+</sup>
|
||||
* `Qwen/Qwen2.5-Omni-7B`
|
||||
*
|
||||
* ✅︎
|
||||
* ✅︎\*
|
||||
- * `SkyworkR1VChatModel`
|
||||
* Skywork-R1V-38B
|
||||
* T + I
|
||||
@@ -1109,6 +1116,14 @@ For more details, please see: <gh-pr:4087#issuecomment-2250397630>
|
||||
Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1.
|
||||
:::
|
||||
|
||||
:::{note}
|
||||
To use Qwen2.5-Omni, you have to install a fork of Hugging Face Transformers library from source via
|
||||
`pip install git+https://github.com/BakerBunker/transformers.git@qwen25omni`.
|
||||
|
||||
Read audio from video pre-processing is currently supported on V0 (but not V1), because overlapping modalities is not yet supported in V1.
|
||||
`--mm-processor-kwargs '{"use_audio_in_video": True}'`.
|
||||
:::
|
||||
|
||||
### Pooling Models
|
||||
|
||||
See [this page](pooling-models) for more information on how to use pooling models.
|
||||
|
||||
Reference in New Issue
Block a user