[Docs] Add supported quantization methods to docs (#2135)

2023-12-15 13:29:22 -08:00
parent 0fbfc4b81b
commit b81a6a6bb3
2 changed files with 4 additions and 2 deletions
--- a/README.md
+++ b/README.md
@@ -35,6 +35,7 @@ vLLM is fast with:
 - State-of-the-art serving throughput
 - Efficient management of attention key and value memory with **PagedAttention**
 - Continuous batching of incoming requests
+- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [SqueezeLLM](https://arxiv.org/abs/2306.07629)
 - Optimized CUDA kernels

 vLLM is flexible and easy to use with:
@@ -44,7 +45,7 @@ vLLM is flexible and easy to use with:
 - Tensor parallelism support for distributed inference
 - Streaming outputs
 - OpenAI-compatible API server
- Support NVIDIA CUDA and AMD ROCm.
+- Support NVIDIA GPUs and AMD GPUs.

 vLLM seamlessly supports many Hugging Face models, including the following architectures: