[CI/Build] Add markdown linter (#11857)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
This commit is contained in:
@@ -15,7 +15,7 @@ The main benefits are lower latency and memory usage.
|
||||
You can quantize your own models by installing AutoAWQ or picking one of the [400+ models on Huggingface](https://huggingface.co/models?sort=trending&search=awq).
|
||||
|
||||
```console
|
||||
$ pip install autoawq
|
||||
pip install autoawq
|
||||
```
|
||||
|
||||
After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
|
||||
@@ -47,7 +47,7 @@ print(f'Model is quantized and saved at "{quant_path}"')
|
||||
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
|
||||
|
||||
```console
|
||||
$ python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
|
||||
python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
|
||||
```
|
||||
|
||||
AWQ models are also supported directly through the LLM entrypoint:
|
||||
|
||||
@@ -9,7 +9,7 @@ Compared to other quantization methods, BitsAndBytes eliminates the need for cal
|
||||
Below are the steps to utilize BitsAndBytes with vLLM.
|
||||
|
||||
```console
|
||||
$ pip install bitsandbytes>=0.45.0
|
||||
pip install bitsandbytes>=0.45.0
|
||||
```
|
||||
|
||||
vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
|
||||
@@ -17,7 +17,7 @@ vLLM reads the model's config file and supports both in-flight quantization and
|
||||
You can find bitsandbytes quantized models on <https://huggingface.co/models?other=bitsandbytes>.
|
||||
And usually, these repositories have a config.json file that includes a quantization_config section.
|
||||
|
||||
## Read quantized checkpoint.
|
||||
## Read quantized checkpoint
|
||||
|
||||
```python
|
||||
from vllm import LLM
|
||||
@@ -37,10 +37,11 @@ model_id = "huggyllama/llama-7b"
|
||||
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
|
||||
quantization="bitsandbytes", load_format="bitsandbytes")
|
||||
```
|
||||
|
||||
## OpenAI Compatible Server
|
||||
|
||||
Append the following to your 4bit model arguments:
|
||||
|
||||
```
|
||||
```console
|
||||
--quantization bitsandbytes --load-format bitsandbytes
|
||||
```
|
||||
|
||||
@@ -41,7 +41,7 @@ Currently, we load the model at original precision before quantizing down to 8-b
|
||||
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
||||
|
||||
```console
|
||||
$ pip install llmcompressor
|
||||
pip install llmcompressor
|
||||
```
|
||||
|
||||
## Quantization Process
|
||||
@@ -98,7 +98,7 @@ tokenizer.save_pretrained(SAVE_DIR)
|
||||
Install `vllm` and `lm-evaluation-harness`:
|
||||
|
||||
```console
|
||||
$ pip install vllm lm-eval==0.4.4
|
||||
pip install vllm lm-eval==0.4.4
|
||||
```
|
||||
|
||||
Load and run the model in `vllm`:
|
||||
|
||||
@@ -17,7 +17,7 @@ unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO).
|
||||
To install AMMO (AlgorithMic Model Optimization):
|
||||
|
||||
```console
|
||||
$ pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
|
||||
pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
|
||||
```
|
||||
|
||||
Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy. The most recent silicon
|
||||
|
||||
@@ -13,16 +13,16 @@ Currently, vllm only supports loading single-file GGUF models. If you have a mul
|
||||
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
|
||||
|
||||
```console
|
||||
$ wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
|
||||
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||
$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
|
||||
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
|
||||
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
|
||||
```
|
||||
|
||||
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
|
||||
|
||||
```console
|
||||
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||
$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
|
||||
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
|
||||
```
|
||||
|
||||
```{warning}
|
||||
|
||||
@@ -16,7 +16,7 @@ INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turi
|
||||
To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
||||
|
||||
```console
|
||||
$ pip install llmcompressor
|
||||
pip install llmcompressor
|
||||
```
|
||||
|
||||
## Quantization Process
|
||||
|
||||
Reference in New Issue
Block a user