[Model] Add support for Gemma 3 (#14660)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Roger Wang <ywang@roblox.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
@@ -263,10 +263,15 @@ See [this page](#generative-models) for more information on how to use generativ
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
- * `Gemma2ForCausalLM`
|
||||
* Gemma2
|
||||
* Gemma 2
|
||||
* `google/gemma-2-9b`, `google/gemma-2-27b`, etc.
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
- * `Gemma3ForCausalLM`
|
||||
* Gemma 3
|
||||
* `google/gemma-3-1b-it`, etc.
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
- * `GlmForCausalLM`
|
||||
* GLM-4
|
||||
* `THUDM/glm-4-9b-chat-hf`, etc.
|
||||
@@ -504,7 +509,7 @@ you should explicitly specify the task type to ensure that the model is used in
|
||||
*
|
||||
*
|
||||
- * `Gemma2Model`
|
||||
* Gemma2-based
|
||||
* Gemma 2-based
|
||||
* `BAAI/bge-multilingual-gemma2`, etc.
|
||||
*
|
||||
* ✅︎
|
||||
@@ -752,6 +757,13 @@ See [this page](#generative-models) for more information on how to use generativ
|
||||
*
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
- * `Gemma3ForConditionalGeneration`
|
||||
* Gemma 3
|
||||
* T + I<sup>+</sup>
|
||||
* `google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc.
|
||||
* ✅︎
|
||||
* ✅︎
|
||||
* ✅︎\*
|
||||
- * `GLM4VForCausalLM`<sup>^</sup>
|
||||
* GLM-4V
|
||||
* T + I
|
||||
@@ -937,6 +949,31 @@ For more details, please see: <gh-pr:4087#issuecomment-2250397630>
|
||||
To use Qwen2.5-VL series models, you have to install Hugging Face Transformers library from source via `pip install git+https://github.com/huggingface/transformers`.
|
||||
:::
|
||||
|
||||
:::{note}
|
||||
To use Gemma3 series models, you have to install Hugging Face Transformers library from source via
|
||||
`pip install git+https://github.com/huggingface/transformers`.
|
||||
The earliest commit that supports this is [`50d3530aa04e7a7d003e6b255a98f79fd0447357`](https://github.com/huggingface/transformers/commit/50d3530aa04e7a7d003e6b255a98f79fd0447357).
|
||||
|
||||
Both V0 and V1 support `Gemma3ForConditionalGeneration` for text-only inputs.
|
||||
However, there are differences in how they handle text + image inputs:
|
||||
|
||||
V0 correctly implements the model's attention pattern:
|
||||
- Uses bidirectional attention between the image tokens corresponding to the same image
|
||||
- Uses causal attention for other tokens
|
||||
- Implemented via (naive) PyTorch SDPA with masking tensors
|
||||
- Note: May use significant memory for long prompts with image
|
||||
|
||||
V1 currently uses a simplified attention pattern:
|
||||
- Uses causal attention for all tokens, including image tokens
|
||||
- Generates reasonable outputs but does not match the original model's attention for text + image inputs
|
||||
- Will be updated in the future to support the correct behavior
|
||||
|
||||
This limitation exists because the model's mixed attention pattern (bidirectional for images, causal otherwise) is not yet supported by vLLM's attention backends.
|
||||
|
||||
Additionally, vLLM's current Gemma 3 implementation does not support the pan-and-scan image pre-processing algorithm, which helps handle images with skewed aspect ratios by intelligently cropping them into multiple views.
|
||||
Without this feature, model performance may degrade when processing images that deviate significantly from square dimensions.
|
||||
:::
|
||||
|
||||
### Pooling Models
|
||||
|
||||
See [this page](pooling-models) for more information on how to use pooling models.
|
||||
|
||||
Reference in New Issue
Block a user