[Misc] Auto detect bitsandbytes pre-quantized models (#16027)
Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com>
This commit is contained in:
@@ -19,17 +19,20 @@ And usually, these repositories have a config.json file that includes a quantiza
|
||||
|
||||
## Read quantized checkpoint
|
||||
|
||||
For pre-quantized checkpoints, vLLM will try to infer the quantization method from the config file, so you don't need to explicitly specify the quantization argument.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
import torch
|
||||
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
|
||||
model_id = "unsloth/tinyllama-bnb-4bit"
|
||||
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
|
||||
quantization="bitsandbytes")
|
||||
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True)
|
||||
```
|
||||
|
||||
## Inflight quantization: load as 4bit quantization
|
||||
|
||||
For inflight 4bit quantization with BitsAndBytes, you need to explicitly specify the quantization argument.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
import torch
|
||||
@@ -40,7 +43,7 @@ quantization="bitsandbytes")
|
||||
|
||||
## OpenAI Compatible Server
|
||||
|
||||
Append the following to your 4bit model arguments:
|
||||
Append the following to your model arguments for 4bit inflight quantization:
|
||||
|
||||
```console
|
||||
--quantization bitsandbytes
|
||||
|
||||
Reference in New Issue
Block a user