[Model] Add support for Gemma 3 (#14660)

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
Woosuk Kwon
2025-03-12 08:36:33 -07:00
committed by GitHub
parent 45f3f3f59e
commit c0c25e25fa
10 changed files with 1071 additions and 9 deletions

View File

@@ -263,10 +263,15 @@ See [this page](#generative-models) for more information on how to use generativ
* ✅︎
* ✅︎
- * `Gemma2ForCausalLM`
* Gemma2
* Gemma 2
* `google/gemma-2-9b`, `google/gemma-2-27b`, etc.
* ✅︎
* ✅︎
- * `Gemma3ForCausalLM`
* Gemma 3
* `google/gemma-3-1b-it`, etc.
* ✅︎
* ✅︎
- * `GlmForCausalLM`
* GLM-4
* `THUDM/glm-4-9b-chat-hf`, etc.
@@ -504,7 +509,7 @@ you should explicitly specify the task type to ensure that the model is used in
*
*
- * `Gemma2Model`
* Gemma2-based
* Gemma 2-based
* `BAAI/bge-multilingual-gemma2`, etc.
*
* ✅︎
@@ -752,6 +757,13 @@ See [this page](#generative-models) for more information on how to use generativ
*
* ✅︎
* ✅︎
- * `Gemma3ForConditionalGeneration`
* Gemma 3
* T + I<sup>+</sup>
* `google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc.
* ✅︎
* ✅︎
* ✅︎\*
- * `GLM4VForCausalLM`<sup>^</sup>
* GLM-4V
* T + I
@@ -937,6 +949,31 @@ For more details, please see: <gh-pr:4087#issuecomment-2250397630>
To use Qwen2.5-VL series models, you have to install Hugging Face Transformers library from source via `pip install git+https://github.com/huggingface/transformers`.
:::
:::{note}
To use Gemma3 series models, you have to install Hugging Face Transformers library from source via
`pip install git+https://github.com/huggingface/transformers`.
The earliest commit that supports this is [`50d3530aa04e7a7d003e6b255a98f79fd0447357`](https://github.com/huggingface/transformers/commit/50d3530aa04e7a7d003e6b255a98f79fd0447357).
Both V0 and V1 support `Gemma3ForConditionalGeneration` for text-only inputs.
However, there are differences in how they handle text + image inputs:
V0 correctly implements the model's attention pattern:
- Uses bidirectional attention between the image tokens corresponding to the same image
- Uses causal attention for other tokens
- Implemented via (naive) PyTorch SDPA with masking tensors
- Note: May use significant memory for long prompts with image
V1 currently uses a simplified attention pattern:
- Uses causal attention for all tokens, including image tokens
- Generates reasonable outputs but does not match the original model's attention for text + image inputs
- Will be updated in the future to support the correct behavior
This limitation exists because the model's mixed attention pattern (bidirectional for images, causal otherwise) is not yet supported by vLLM's attention backends.
Additionally, vLLM's current Gemma 3 implementation does not support the pan-and-scan image pre-processing algorithm, which helps handle images with skewed aspect ratios by intelligently cropping them into multiple views.
Without this feature, model performance may degrade when processing images that deviate significantly from square dimensions.
:::
### Pooling Models
See [this page](pooling-models) for more information on how to use pooling models.