[Doc][2/N] Reorganize Models and Usage sections (#11755)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
192
docs/source/features/quantization/fp8.md
Normal file
192
docs/source/features/quantization/fp8.md
Normal file
@@ -0,0 +1,192 @@
|
||||
(fp8)=
|
||||
|
||||
# FP8 W8A8
|
||||
|
||||
vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
|
||||
Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
|
||||
Ampere GPUs are supported for W8A16 (weight-only FP8) utilizing Marlin kernels.
|
||||
Quantization of models with FP8 allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy.
|
||||
|
||||
Please visit the HF collection of [quantized FP8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127).
|
||||
|
||||
The FP8 types typically supported in hardware have two distinct representations, each useful in different scenarios:
|
||||
|
||||
- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
|
||||
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
|
||||
|
||||
```{note}
|
||||
FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
|
||||
FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
|
||||
```
|
||||
|
||||
## Quick Start with Online Dynamic Quantization
|
||||
|
||||
Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying `--quantization="fp8"` in the command line or setting `quantization="fp8"` in the LLM constructor.
|
||||
|
||||
In this mode, all Linear modules (except for the final `lm_head`) have their weights quantized down to FP8_E4M3 precision with a per-tensor scale. Activations have their minimum and maximum values calculated during each forward pass to provide a dynamic per-tensor scale for high accuracy. As a result, latency improvements are limited in this mode.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
model = LLM("facebook/opt-125m", quantization="fp8")
|
||||
# INFO 06-10 17:55:42 model_runner.py:157] Loading model weights took 0.1550 GB
|
||||
result = model.generate("Hello, my name is")
|
||||
```
|
||||
|
||||
```{warning}
|
||||
Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
||||
|
||||
```console
|
||||
$ pip install llmcompressor
|
||||
```
|
||||
|
||||
## Quantization Process
|
||||
|
||||
The quantization process involves three main steps:
|
||||
|
||||
1. Loading the model
|
||||
2. Applying quantization
|
||||
3. Evaluating accuracy in vLLM
|
||||
|
||||
### 1. Loading the Model
|
||||
|
||||
Use `SparseAutoModelForCausalLM`, which wraps `AutoModelForCausalLM`, for saving and loading quantized models:
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import SparseAutoModelForCausalLM
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
|
||||
|
||||
model = SparseAutoModelForCausalLM.from_pretrained(
|
||||
MODEL_ID, device_map="auto", torch_dtype="auto")
|
||||
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
||||
```
|
||||
|
||||
### 2. Applying Quantization
|
||||
|
||||
For FP8 quantization, we can recover accuracy with simple RTN quantization. We recommend targeting all `Linear` layers using the `FP8_DYNAMIC` scheme, which uses:
|
||||
|
||||
- Static, per-channel quantization on the weights
|
||||
- Dynamic, per-token quantization on the activations
|
||||
|
||||
Since simple RTN does not require data for weight quantization and the activations are quantized dynamically, we do not need any calibration data for this quantization flow.
|
||||
|
||||
```python
|
||||
from llmcompressor.transformers import oneshot
|
||||
from llmcompressor.modifiers.quantization import QuantizationModifier
|
||||
|
||||
# Configure the simple PTQ quantization
|
||||
recipe = QuantizationModifier(
|
||||
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
|
||||
|
||||
# Apply the quantization algorithm.
|
||||
oneshot(model=model, recipe=recipe)
|
||||
|
||||
# Save the model.
|
||||
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
|
||||
model.save_pretrained(SAVE_DIR)
|
||||
tokenizer.save_pretrained(SAVE_DIR)
|
||||
```
|
||||
|
||||
### 3. Evaluating Accuracy
|
||||
|
||||
Install `vllm` and `lm-evaluation-harness`:
|
||||
|
||||
```console
|
||||
$ pip install vllm lm-eval==0.4.4
|
||||
```
|
||||
|
||||
Load and run the model in `vllm`:
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
model = LLM("./Meta-Llama-3-8B-Instruct-FP8-Dynamic")
|
||||
model.generate("Hello my name is")
|
||||
```
|
||||
|
||||
Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
|
||||
|
||||
```{note}
|
||||
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
|
||||
```
|
||||
|
||||
```console
|
||||
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
|
||||
$ lm_eval \
|
||||
--model vllm \
|
||||
--model_args pretrained=$MODEL,add_bos_token=True \
|
||||
--tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250
|
||||
```
|
||||
|
||||
Here's an example of the resulting scores:
|
||||
|
||||
```text
|
||||
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|
||||
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|
||||
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.768|± |0.0268|
|
||||
| | |strict-match | 5|exact_match|↑ |0.768|± |0.0268|
|
||||
```
|
||||
|
||||
## Troubleshooting and Support
|
||||
|
||||
If you encounter any issues or have feature requests, please open an issue on the `vllm-project/llm-compressor` GitHub repository.
|
||||
|
||||
## Deprecated Flow
|
||||
|
||||
```{note}
|
||||
The following information is preserved for reference and search purposes.
|
||||
The quantization method described below is deprecated in favor of the `llmcompressor` method described above.
|
||||
```
|
||||
|
||||
For static per-tensor offline quantization to FP8, please install the [AutoFP8 library](https://github.com/neuralmagic/autofp8).
|
||||
|
||||
```bash
|
||||
git clone https://github.com/neuralmagic/AutoFP8.git
|
||||
pip install -e AutoFP8
|
||||
```
|
||||
|
||||
This package introduces the `AutoFP8ForCausalLM` and `BaseQuantizeConfig` objects for managing how your model will be compressed.
|
||||
|
||||
## Offline Quantization with Static Activation Scaling Factors
|
||||
|
||||
You can use AutoFP8 with calibration data to produce per-tensor static scales for both the weights and activations by enabling the `activation_scheme="static"` argument.
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
from transformers import AutoTokenizer
|
||||
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
|
||||
|
||||
pretrained_model_dir = "meta-llama/Meta-Llama-3-8B-Instruct"
|
||||
quantized_model_dir = "Meta-Llama-3-8B-Instruct-FP8"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
|
||||
tokenizer.pad_token = tokenizer.eos_token
|
||||
|
||||
# Load and tokenize 512 dataset samples for calibration of activation scales
|
||||
ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
|
||||
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
|
||||
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")
|
||||
|
||||
# Define quantization config with static activation scales
|
||||
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="static")
|
||||
|
||||
# Load the model, quantize, and save checkpoint
|
||||
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
|
||||
model.quantize(examples)
|
||||
model.save_quantized(quantized_model_dir)
|
||||
```
|
||||
|
||||
Your model checkpoint with quantized weights and activations should be available at `Meta-Llama-3-8B-Instruct-FP8/`.
|
||||
Finally, you can load the quantized model checkpoint directly in vLLM.
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
model = LLM(model="Meta-Llama-3-8B-Instruct-FP8/")
|
||||
# INFO 06-10 21:15:41 model_runner.py:159] Loading model weights took 8.4596 GB
|
||||
result = model.generate("Hello, my name is")
|
||||
```
|
||||
Reference in New Issue
Block a user