[Quantization][Deprecation] Remove BitBlas (#32683)

Signed-off-by: Robert Shaw <robshaw@redhat.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
2026-01-28 03:06:22 -08:00
parent ecb4f82209
commit 247d1a32ea
15 changed files with 2 additions and 2030 deletions
--- a/docs/features/quantization/README.md
+++ b/docs/features/quantization/README.md
@@ -6,7 +6,6 @@ Contents:

 - [AutoAWQ](auto_awq.md)
 - [BitsAndBytes](bnb.md)
- [BitBLAS](bitblas.md)
 - [GGUF](gguf.md)
 - [GPTQModel](gptqmodel.md)
 - [Intel Neural Compressor](inc.md)
@@ -49,8 +48,6 @@ th:not(:first-child) {
 | Marlin (GPTQ/AWQ/FP8) | ❌      | ❌       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌          | ❌        |
 | INT8 (W8A8)           | ❌      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌          | ✅︎        |
 | FP8 (W8A8)            | ❌      | ❌       | ❌       | ✅︎    | ✅︎       | ✅︎         | ❌          | ❌        |
-| BitBLAS               | ✅︎      | ✅       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌          | ❌        |
-| BitBLAS (GPTQ)        | ❌      | ❌       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌          | ❌        |
 | bitsandbytes          | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌          | ❌        |
 | DeepSpeedFP           | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ❌         | ❌          | ❌        |
 | GGUF                  | ✅︎      | ✅︎       | ✅︎       | ✅︎    | ✅︎       | ✅︎         | ❌          | ❌        |
--- a/docs/features/quantization/bitblas.md
+++ b/docs/features/quantization/bitblas.md
@@ -1,58 +0,0 @@
-# BitBLAS
-
-vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.
-
-!!! note
-    Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`).
-    Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper.
-    For details see [supported hardware](README.md#supported-hardware).
-
-Below are the steps to utilize BitBLAS with vLLM.
-
-```bash
-pip install bitblas>=0.1.0
-```
-
-vLLM reads the model's config file and supports pre-quantized checkpoints.
-
-You can find pre-quantized models on:
-
- [Hugging Face (BitBLAS)](https://huggingface.co/models?search=bitblas)
- [Hugging Face (GPTQ)](https://huggingface.co/models?search=gptq)
-
-Usually, these repositories have a `quantize_config.json` file that includes a `quantization_config` section.
-
-## Read bitblas format checkpoint
-
-```python
-from vllm import LLM
-import torch
-
-# "hxbgsyxh/llama-13b-4bit-g-1-bitblas" is a pre-quantized checkpoint.
-model_id = "hxbgsyxh/llama-13b-4bit-g-1-bitblas"
-llm = LLM(
-    model=model_id,
-    dtype=torch.bfloat16,
-    trust_remote_code=True,
-    quantization="bitblas",
-)
-```
-
-## Read gptq format checkpoint
-
-??? code
-
-    ```python
-    from vllm import LLM
-    import torch
-
-    # "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint.
-    model_id = "hxbgsyxh/llama-13b-4bit-g-1"
-    llm = LLM(
-        model=model_id,
-        dtype=torch.float16,
-        trust_remote_code=True,
-        quantization="bitblas",
-        max_model_len=1024,
-    )
-    ```