[Misc] Auto detect bitsandbytes pre-quantized models (#16027)

Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com>
2025-04-05 08:30:45 +02:00
parent 63375f0cdb
commit 4285e423a6
3 changed files with 16 additions and 7 deletions
--- a/docs/source/features/quantization/bnb.md
+++ b/docs/source/features/quantization/bnb.md
@@ -19,17 +19,20 @@ And usually, these repositories have a config.json file that includes a quantiza

 ## Read quantized checkpoint

+For pre-quantized checkpoints, vLLM will try to infer the quantization method from the config file, so you don't need to explicitly specify the quantization argument.
+
 ```python
 from vllm import LLM
 import torch
 # unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
 model_id = "unsloth/tinyllama-bnb-4bit"
-llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
-quantization="bitsandbytes")
+llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True)
 ```

 ## Inflight quantization: load as 4bit quantization

+For inflight 4bit quantization with BitsAndBytes, you need to explicitly specify the quantization argument.
+
 ```python
 from vllm import LLM
 import torch
@@ -40,7 +43,7 @@ quantization="bitsandbytes")

 ## OpenAI Compatible Server

-Append the following to your 4bit model arguments:
+Append the following to your model arguments for 4bit inflight quantization:

 ```console
 --quantization bitsandbytes