[Doc][2/N] Reorganize Models and Usage sections (#11755)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
78
docs/source/features/quantization/auto_awq.md
Normal file
78
docs/source/features/quantization/auto_awq.md
Normal file
@@ -0,0 +1,78 @@
|
||||
(auto-awq)=
|
||||
|
||||
# AutoAWQ
|
||||
|
||||
```{warning}
|
||||
Please note that AWQ support in vLLM is under-optimized at the moment. We would recommend using the unquantized version of the model for better
|
||||
accuracy and higher throughput. Currently, you can use AWQ as a way to reduce memory footprint. As of now, it is more suitable for low latency
|
||||
inference with small number of concurrent requests. vLLM's AWQ implementation have lower throughput than unquantized version.
|
||||
```
|
||||
|
||||
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
|
||||
Quantizing reduces the model's precision from FP16 to INT4 which effectively reduces the file size by ~70%.
|
||||
The main benefits are lower latency and memory usage.
|
||||
|
||||
You can quantize your own models by installing AutoAWQ or picking one of the [400+ models on Huggingface](https://huggingface.co/models?sort=trending&search=awq).
|
||||
|
||||
```console
|
||||
$ pip install autoawq
|
||||
```
|
||||
|
||||
After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
|
||||
|
||||
```python
|
||||
from awq import AutoAWQForCausalLM
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
|
||||
quant_path = 'mistral-instruct-v0.2-awq'
|
||||
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
|
||||
|
||||
# Load model
|
||||
model = AutoAWQForCausalLM.from_pretrained(
|
||||
model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
||||
|
||||
# Quantize
|
||||
model.quantize(tokenizer, quant_config=quant_config)
|
||||
|
||||
# Save quantized model
|
||||
model.save_quantized(quant_path)
|
||||
tokenizer.save_pretrained(quant_path)
|
||||
|
||||
print(f'Model is quantized and saved at "{quant_path}"')
|
||||
```
|
||||
|
||||
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
|
||||
|
||||
```console
|
||||
$ python examples/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
|
||||
```
|
||||
|
||||
AWQ models are also supported directly through the LLM entrypoint:
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
# Sample prompts.
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
# Create a sampling params object.
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
# Create an LLM.
|
||||
llm = LLM(model="TheBloke/Llama-2-7b-Chat-AWQ", quantization="AWQ")
|
||||
# Generate texts from the prompts. The output is a list of RequestOutput objects
|
||||
# that contain the prompt, generated text, and other information.
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
# Print the outputs.
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
46
docs/source/features/quantization/bnb.md
Normal file
46
docs/source/features/quantization/bnb.md
Normal file
@@ -0,0 +1,46 @@
|
||||
(bits-and-bytes)=
|
||||
|
||||
# BitsAndBytes
|
||||
|
||||
vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
|
||||
BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
|
||||
Compared to other quantization methods, BitsAndBytes eliminates the need for calibrating the quantized model with input data.
|
||||
|
||||
Below are the steps to utilize BitsAndBytes with vLLM.
|
||||
|
||||
```console
|
||||
$ pip install bitsandbytes>=0.45.0
|
||||
```
|
||||
|
||||
vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
|
||||
|
||||
You can find bitsandbytes quantized models on <https://huggingface.co/models?other=bitsandbytes>.
|
||||
And usually, these repositories have a config.json file that includes a quantization_config section.
|
||||
|
||||
## Read quantized checkpoint.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
import torch
|
||||
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
|
||||
model_id = "unsloth/tinyllama-bnb-4bit"
|
||||
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
|
||||
quantization="bitsandbytes", load_format="bitsandbytes")
|
||||
```
|
||||
|
||||
## Inflight quantization: load as 4bit quantization
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
import torch
|
||||
model_id = "huggyllama/llama-7b"
|
||||
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
|
||||
quantization="bitsandbytes", load_format="bitsandbytes")
|
||||
```
|
||||
## OpenAI Compatible Server
|
||||
|
||||
Append the following to your 4bit model arguments:
|
||||
|
||||
```
|
||||
--quantization bitsandbytes --load-format bitsandbytes
|
||||
```
|
||||
192
docs/source/features/quantization/fp8.md
Normal file
192
docs/source/features/quantization/fp8.md
Normal file
@@ -0,0 +1,192 @@
|
||||
(fp8)=
|
||||
|
||||
# FP8 W8A8
|
||||
|
||||
vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
|
||||
Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
|
||||
Ampere GPUs are supported for W8A16 (weight-only FP8) utilizing Marlin kernels.
|
||||
Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy.
|
||||
|
||||
Please visit the HF collection of [quantized FP8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127).
|
||||
|
||||
The FP8 types typically supported in hardware have two distinct representations, each useful in different scenarios:
|
||||
|
||||
- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
|
||||
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
|
||||
|
||||
```{note}
|
||||
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
|
||||
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
|
||||
```
|
||||
|
||||
## Quick Start with Online Dynamic Quantization
|
||||
|
||||
Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying `--quantization="fp8"` in the command line or setting `quantization="fp8"` in the LLM constructor.
|
||||
|
||||
In this mode, all Linear modules (except for the final `lm_head`) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
model = LLM("facebook/opt-125m", quantization="fp8")
|
||||
# INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB
|
||||
result = model.generate("Hello, my name is")
|
||||
```
|
||||
|
||||
```{warning}
|
||||
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
||||
|
||||
```console
|
||||
$ pip install llmcompressor
|
||||
```
|
||||
|
||||
## Quantization Process
|
||||
|
||||
The quantization process involves three main steps:
|
||||
|
||||
1. Loading the model
|
||||
2. Applying quantization
|
||||
3. Evaluating accuracy in vLLM
|
||||
|
||||
### 1. Loading the Model
|
||||
|
||||
Use `SparseAutoModelForCausalLM`, which wraps `AutoModelForCausalLM`, for saving and loading quantized models:
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import SparseAutoModelForCausalLM
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
|
||||
|
||||
model = SparseAutoModelForCausalLM.from_pretrained(
|
||||
MODEL_ID, device_map="auto", torch_dtype="auto")
|
||||
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
||||
```
|
||||
|
||||
### 2. Applying Quantization
|
||||
|
||||
For FP8 quantization, we can recover accuracy with simple RTN quantization. We recommend targeting all `Linear` layers using the `FP8_DYNAMIC` scheme, which uses:
|
||||
|
||||
- Static, per-channel quantization on the weights
|
||||
- Dynamic, per-token quantization on the activations
|
||||
|
||||
Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import oneshot
|
||||
from llmcompressor.modifiers.quantization import QuantizationModifier
|
||||
|
||||
# Configure the simple PTQ quantization
|
||||
recipe = QuantizationModifier(
|
||||
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
|
||||
|
||||
# Apply the quantization algorithm.
|
||||
oneshot(model=model, recipe=recipe)
|
||||
|
||||
# Save the model.
|
||||
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
|
||||
model.save_pretrained(SAVE_DIR)
|
||||
tokenizer.save_pretrained(SAVE_DIR)
|
||||
```
|
||||
|
||||
### 3. Evaluating Accuracy
|
||||
|
||||
Install `vllm` and `lm-evaluation-harness`:
|
||||
|
||||
```console
|
||||
$ pip install vllm lm-eval==0.4.4
|
||||
```
|
||||
|
||||
Load and run the model in `vllm`:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
|
||||
model.generate("Hello my name is")
|
||||
```
|
||||
|
||||
Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
|
||||
|
||||
```{note}
|
||||
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
|
||||
```
|
||||
|
||||
```console
|
||||
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
|
||||
$ lm_eval \
|
||||
--model vllm \
|
||||
--model_args pretrained=$MODEL,add_bos_token=True \
|
||||
--tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250
|
||||
```
|
||||
|
||||
Here's an example of the resulting scores:
|
||||
|
||||
```text
|
||||
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|
||||
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|
||||
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.768|± |0.0268|
|
||||
| | |strict-match | 5|exact_match|↑ |0.768|± |0.0268|
|
||||
```
|
||||
|
||||
## Troubleshooting and Support
|
||||
|
||||
If you encounter any issues or have feature requests, please open an issue on the `vllm-project/llm-compressor` GitHub repository.
|
||||
|
||||
## Deprecated Flow
|
||||
|
||||
```{note}
|
||||
The following information is preserved for reference and search purposes.
|
||||
The quantization method described below is deprecated in favor of the `llmcompressor` method described above.
|
||||
```
|
||||
|
||||
For static per-tensor offline quantization to FP8, please install the [AutoFP8 library](https://github.com/neuralmagic/autofp8).
|
||||
|
||||
```bash
|
||||
git clone https://github.com/neuralmagic/AutoFP8.git
|
||||
pip install -e AutoFP8
|
||||
```
|
||||
|
||||
This package introduces the `AutoFP8ForCausalLM` and `BaseQuantizeConfig` objects for managing how your model will be compressed.
|
||||
|
||||
## Offline Quantization with Static Activation Scaling Factors
|
||||
|
||||
You can use AutoFP8 with calibration data to produce per-tensor static scales for both the weights and activations by enabling the `activation_scheme="static"` argument.
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
from transformers import AutoTokenizer
|
||||
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
|
||||
|
||||
pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
|
||||
quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
|
||||
# Load and tokenize 512 dataset samples for calibration of activation scales
|
||||
ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
|
||||
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
|
||||
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")
|
||||
|
||||
# Define quantization config with static activation scales
|
||||
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")
|
||||
|
||||
# Load the model, quantize, and save checkpoint
|
||||
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
|
||||
model.quantize(examples)
|
||||
model.save_quantized(quantized_model_dir)
|
||||
```
|
||||
|
||||
Your model checkpoint with quantized weights and activations should be available at `Meta-Llama-3-8B-Instruct-FP8/`.
|
||||
Finally, you can load the quantized model checkpoint directly in vLLM.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
model = LLM(model="Meta-Llama-3-8B-Instruct-FP8/")
|
||||
# INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB
|
||||
result = model.generate("Hello, my name is")
|
||||
```
|
||||
44
docs/source/features/quantization/fp8_e4m3_kvcache.md
Normal file
44
docs/source/features/quantization/fp8_e4m3_kvcache.md
Normal file
@@ -0,0 +1,44 @@
|
||||
(fp8-e4m3-kvcache)=
|
||||
|
||||
# FP8 E4M3 KV Cache
|
||||
|
||||
Quantizing the KV cache to FP8 reduces its memory footprint. This increases the number of tokens that can be stored in the cache,
|
||||
improving throughput. OCP (Open Compute Project www.opencompute.org) specifies two common 8-bit floating point data formats: E5M2
|
||||
(5 exponent bits and 2 mantissa bits) and E4M3FN (4 exponent bits and 3 mantissa bits), often shortened as E4M3. One benefit of
|
||||
the E4M3 format over E5M2 is that floating point numbers are represented in higher precision. However, the small dynamic range of
|
||||
FP8 E4M3 (±240.0 can be represented) typically necessitates the use of a higher-precision (typically FP32) scaling factor alongside
|
||||
each quantized tensor. For now, only per-tensor (scalar) scaling factors are supported. Development is ongoing to support scaling
|
||||
factors of a finer granularity (e.g. per-channel).
|
||||
|
||||
These scaling factors can be specified by passing an optional quantization param JSON to the LLM engine at load time. If
|
||||
this JSON is not specified, scaling factors default to 1.0. These scaling factors are typically obtained when running an
|
||||
unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO).
|
||||
|
||||
To install AMMO (AlgorithMic Model Optimization):
|
||||
|
||||
```console
|
||||
$ pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
|
||||
```
|
||||
|
||||
Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy. The most recent silicon
|
||||
offerings e.g. AMD MI300, NVIDIA Hopper or later support native hardware conversion to and from fp32, fp16, bf16, etc.
|
||||
Thus, LLM inference is greatly accelerated with minimal accuracy loss.
|
||||
|
||||
Here is an example of how to enable this feature:
|
||||
|
||||
```python
|
||||
# two float8_e4m3fn kv cache scaling factor files are provided under tests/fp8_kv, please refer to
|
||||
# https://github.com/vllm-project/vllm/blob/main/examples/fp8/README.md to generate kv_cache_scales.json of your own.
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
sampling_params = SamplingParams(temperature=1.3, top_p=0.8)
|
||||
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf",
|
||||
kv_cache_dtype="fp8",
|
||||
quantization_param_path="./tests/fp8_kv/llama2-7b-fp8-kv/kv_cache_scales.json")
|
||||
prompt = "London is the capital of"
|
||||
out = llm.generate(prompt, sampling_params)[0].outputs[0].text
|
||||
print(out)
|
||||
|
||||
# output w/ scaling factors: England, the United Kingdom, and one of the world's leading financial,
|
||||
# output w/o scaling factors: England, located in the southeastern part of the country. It is known
|
||||
```
|
||||
31
docs/source/features/quantization/fp8_e5m2_kvcache.md
Normal file
31
docs/source/features/quantization/fp8_e5m2_kvcache.md
Normal file
@@ -0,0 +1,31 @@
|
||||
(fp8-kv-cache)=
|
||||
|
||||
# FP8 E5M2 KV Cache
|
||||
|
||||
The int8/int4 quantization scheme requires additional scale GPU memory storage, which reduces the expected GPU memory benefits.
|
||||
The FP8 data format retains 2~3 mantissa bits and can convert float/fp16/bfloat16 and fp8 to each other.
|
||||
|
||||
Here is an example of how to enable this feature:
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
# Sample prompts.
|
||||
prompts = [
|
||||
"Hello, my name is",
|
||||
"The president of the United States is",
|
||||
"The capital of France is",
|
||||
"The future of AI is",
|
||||
]
|
||||
# Create a sampling params object.
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
# Create an LLM.
|
||||
llm = LLM(model="facebook/opt-125m", kv_cache_dtype="fp8")
|
||||
# Generate texts from the prompts. The output is a list of RequestOutput objects
|
||||
# that contain the prompt, generated text, and other information.
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
# Print the outputs.
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
72
docs/source/features/quantization/gguf.md
Normal file
72
docs/source/features/quantization/gguf.md
Normal file
@@ -0,0 +1,72 @@
|
||||
(gguf)=
|
||||
|
||||
# GGUF
|
||||
|
||||
```{warning}
|
||||
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
|
||||
```
|
||||
|
||||
```{warning}
|
||||
Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
|
||||
```
|
||||
|
||||
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
|
||||
|
||||
```console
|
||||
$ wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
|
||||
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||
$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
|
||||
```
|
||||
|
||||
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
|
||||
|
||||
```console
|
||||
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||
$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
|
||||
```
|
||||
|
||||
```{warning}
|
||||
We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
|
||||
```
|
||||
|
||||
You can also use the GGUF model directly through the LLM entrypoint:
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
# In this script, we demonstrate how to pass input to the chat method:
|
||||
conversation = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful assistant"
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Hello"
|
||||
},
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": "Hello! How can I assist you today?"
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "Write an essay about the importance of higher education.",
|
||||
},
|
||||
]
|
||||
|
||||
# Create a sampling params object.
|
||||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
|
||||
|
||||
# Create an LLM.
|
||||
llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
|
||||
tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
|
||||
# Generate texts from the prompts. The output is a list of RequestOutput objects
|
||||
# that contain the prompt, generated text, and other information.
|
||||
outputs = llm.chat(conversation, sampling_params)
|
||||
|
||||
# Print the outputs.
|
||||
for output in outputs:
|
||||
prompt = output.prompt
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
19
docs/source/features/quantization/index.md
Normal file
19
docs/source/features/quantization/index.md
Normal file
@@ -0,0 +1,19 @@
|
||||
(quantization-index)=
|
||||
|
||||
# Quantization
|
||||
|
||||
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
|
||||
|
||||
```{toctree}
|
||||
:caption: Contents
|
||||
:maxdepth: 1
|
||||
|
||||
supported_hardware
|
||||
auto_awq
|
||||
bnb
|
||||
gguf
|
||||
int8
|
||||
fp8
|
||||
fp8_e5m2_kvcache
|
||||
fp8_e4m3_kvcache
|
||||
```
|
||||
136
docs/source/features/quantization/int8.md
Normal file
136
docs/source/features/quantization/int8.md
Normal file
@@ -0,0 +1,136 @@
|
||||
(int8)=
|
||||
|
||||
# INT8 W8A8
|
||||
|
||||
vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
|
||||
This quantization method is particularly useful for reducing model size while maintaining good performance.
|
||||
|
||||
Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
|
||||
|
||||
```{note}
|
||||
INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper).
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
|
||||
To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
||||
|
||||
```console
|
||||
$ pip install llmcompressor
|
||||
```
|
||||
|
||||
## Quantization Process
|
||||
|
||||
The quantization process involves four main steps:
|
||||
|
||||
1. Loading the model
|
||||
2. Preparing calibration data
|
||||
3. Applying quantization
|
||||
4. Evaluating accuracy in vLLM
|
||||
|
||||
### 1. Loading the Model
|
||||
|
||||
Use `SparseAutoModelForCausalLM`, which wraps `AutoModelForCausalLM`, for saving and loading quantized models:
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import SparseAutoModelForCausalLM
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
|
||||
model = SparseAutoModelForCausalLM.from_pretrained(
|
||||
MODEL_ID, device_map="auto", torch_dtype="auto",
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
||||
```
|
||||
|
||||
### 2. Preparing Calibration Data
|
||||
|
||||
When quantizing activations to INT8, you need sample data to estimate the activation scales.
|
||||
It's best to use calibration data that closely matches your deployment data.
|
||||
For a general-purpose instruction-tuned model, you can use a dataset like `ultrachat`:
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
|
||||
NUM_CALIBRATION_SAMPLES = 512
|
||||
MAX_SEQUENCE_LENGTH = 2048
|
||||
|
||||
# Load and preprocess the dataset
|
||||
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
|
||||
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
|
||||
|
||||
def preprocess(example):
|
||||
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
|
||||
ds = ds.map(preprocess)
|
||||
|
||||
def tokenize(sample):
|
||||
return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
|
||||
ds = ds.map(tokenize, remove_columns=ds.column_names)
|
||||
```
|
||||
|
||||
### 3. Applying Quantization
|
||||
|
||||
Now, apply the quantization algorithms:
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import oneshot
|
||||
from llmcompressor.modifiers.quantization import GPTQModifier
|
||||
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
|
||||
|
||||
# Configure the quantization algorithms
|
||||
recipe = [
|
||||
SmoothQuantModifier(smoothing_strength=0.8),
|
||||
GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]),
|
||||
]
|
||||
|
||||
# Apply quantization
|
||||
oneshot(
|
||||
model=model,
|
||||
dataset=ds,
|
||||
recipe=recipe,
|
||||
max_seq_length=MAX_SEQUENCE_LENGTH,
|
||||
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
|
||||
)
|
||||
|
||||
# Save the compressed model
|
||||
SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
|
||||
model.save_pretrained(SAVE_DIR, save_compressed=True)
|
||||
tokenizer.save_pretrained(SAVE_DIR)
|
||||
```
|
||||
|
||||
This process creates a W8A8 model with weights and activations quantized to 8-bit integers.
|
||||
|
||||
### 4. Evaluating Accuracy
|
||||
|
||||
After quantization, you can load and run the model in vLLM:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
|
||||
```
|
||||
|
||||
To evaluate accuracy, you can use `lm_eval`:
|
||||
|
||||
```console
|
||||
$ lm_eval --model vllm \
|
||||
--model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
|
||||
--tasks gsm8k \
|
||||
--num_fewshot 5 \
|
||||
--limit 250 \
|
||||
--batch_size 'auto'
|
||||
```
|
||||
|
||||
```{note}
|
||||
Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
- Start with 512 samples for calibration data (increase if accuracy drops)
|
||||
- Use a sequence length of 2048 as a starting point
|
||||
- Employ the chat template or instruction template that the model was trained with
|
||||
- If you've fine-tuned a model, consider using a sample of your training data for calibration
|
||||
|
||||
## Troubleshooting and Support
|
||||
|
||||
If you encounter any issues or have feature requests, please open an issue on the `vllm-project/llm-compressor` GitHub repository.
|
||||
131
docs/source/features/quantization/supported_hardware.md
Normal file
131
docs/source/features/quantization/supported_hardware.md
Normal file
@@ -0,0 +1,131 @@
|
||||
(quantization-supported-hardware)=
|
||||
|
||||
# Supported Hardware
|
||||
|
||||
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
|
||||
|
||||
```{list-table}
|
||||
:header-rows: 1
|
||||
:widths: 20 8 8 8 8 8 8 8 8 8 8
|
||||
|
||||
* - Implementation
|
||||
- Volta
|
||||
- Turing
|
||||
- Ampere
|
||||
- Ada
|
||||
- Hopper
|
||||
- AMD GPU
|
||||
- Intel GPU
|
||||
- x86 CPU
|
||||
- AWS Inferentia
|
||||
- Google TPU
|
||||
* - AWQ
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
* - GPTQ
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
* - Marlin (GPTQ/AWQ/FP8)
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
* - INT8 (W8A8)
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
* - FP8 (W8A8)
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
* - AQLM
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
* - bitsandbytes
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
* - DeepSpeedFP
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
* - GGUF
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✅︎
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
- ✗
|
||||
```
|
||||
|
||||
- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
|
||||
- "✅︎" indicates that the quantization method is supported on the specified hardware.
|
||||
- "✗" indicates that the quantization method is not supported on the specified hardware.
|
||||
|
||||
```{note}
|
||||
This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
|
||||
|
||||
For the most up-to-date information on hardware support and quantization methods, please refer to <gh-dir:vllm/model_executor/layers/quantization> or consult with the vLLM development team.
|
||||
```
|
||||
Reference in New Issue
Block a user