[Model] Enable quantization support for transformers backend (#12960)

This commit is contained in:
Isotr0py
2025-02-18 11:52:47 +08:00
committed by GitHub
parent efbe854448
commit 67ef8f666a
3 changed files with 66 additions and 23 deletions

View File

@@ -42,7 +42,7 @@ Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project
### Transformers fallback
After the merge of <gh-pr:11330>, `vllm` can fallback to models that are available in `transformers`. This does not work for all models for now, but most decoder language models are supported, and vision language model support is planned!
`vllm` can fallback to models that are available in `transformers`. This does not work for all models for now, but most decoder language models are supported, and vision language model support is planned!
To check if the backend is `transformers`, you can simply do this:
@@ -56,9 +56,13 @@ If it is `TransformersModel` then it means it's based on `transformers`!
#### Supported features
##### LORA and quantization
##### Quantization
Both are not supported yet! Make sure to open an issue and we'll work on this together with the `transformers` team!
Transformers fallback has supported most of available quantization in vLLM (except GGUF). See [Quantization page](#quantization-index) for more information about supported quantization in vllm.
##### LoRA
LoRA hasn't supported on transformers fallback yet! Make sure to open an issue and we'll work on this together with the `transformers` team!
Usually `transformers` model load weights via the `load_adapters` API, that depends on PEFT. We need to work a bit to either use this api (for now this would result in some weights not being marked as loaded) or replace modules accordingly.