[Docs] Fix syntax highlighting of shell commands (#19870)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
This commit is contained in:
@@ -9,7 +9,7 @@ The main benefits are lower latency and memory usage.
|
||||
|
||||
You can quantize your own models by installing AutoAWQ or picking one of the [6500+ models on Huggingface](https://huggingface.co/models?search=awq).
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install autoawq
|
||||
```
|
||||
|
||||
@@ -43,7 +43,7 @@ After installing AutoAWQ, you are ready to quantize a model. Please refer to the
|
||||
|
||||
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
|
||||
|
||||
```console
|
||||
```bash
|
||||
python examples/offline_inference/llm_engine_example.py \
|
||||
--model TheBloke/Llama-2-7b-Chat-AWQ \
|
||||
--quantization awq
|
||||
|
||||
@@ -12,7 +12,7 @@ vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more effic
|
||||
|
||||
Below are the steps to utilize BitBLAS with vLLM.
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install bitblas>=0.1.0
|
||||
```
|
||||
|
||||
|
||||
@@ -9,7 +9,7 @@ Compared to other quantization methods, BitsAndBytes eliminates the need for cal
|
||||
|
||||
Below are the steps to utilize BitsAndBytes with vLLM.
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install bitsandbytes>=0.45.3
|
||||
```
|
||||
|
||||
@@ -54,6 +54,6 @@ llm = LLM(
|
||||
|
||||
Append the following to your model arguments for 4bit inflight quantization:
|
||||
|
||||
```console
|
||||
```bash
|
||||
--quantization bitsandbytes
|
||||
```
|
||||
|
||||
@@ -23,7 +23,7 @@ The FP8 types typically supported in hardware have two distinct representations,
|
||||
|
||||
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install llmcompressor
|
||||
```
|
||||
|
||||
@@ -81,7 +81,7 @@ Since simple RTN does not require data for weight quantization and the activatio
|
||||
|
||||
Install `vllm` and `lm-evaluation-harness` for evaluation:
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install vllm lm-eval==0.4.4
|
||||
```
|
||||
|
||||
@@ -99,9 +99,9 @@ Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
|
||||
!!! note
|
||||
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
|
||||
|
||||
```console
|
||||
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
|
||||
$ lm_eval \
|
||||
```bash
|
||||
MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
|
||||
lm_eval \
|
||||
--model vllm \
|
||||
--model_args pretrained=$MODEL,add_bos_token=True \
|
||||
--tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250
|
||||
|
||||
@@ -11,7 +11,7 @@ title: GGUF
|
||||
|
||||
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
|
||||
|
||||
```console
|
||||
```bash
|
||||
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
|
||||
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
||||
@@ -20,7 +20,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
||||
|
||||
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
|
||||
|
||||
```console
|
||||
```bash
|
||||
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
||||
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
|
||||
@@ -32,7 +32,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
||||
|
||||
GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path
|
||||
|
||||
```console
|
||||
```bash
|
||||
# If you model is not supported by huggingface you can manually provide a huggingface compatible config path
|
||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
||||
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
|
||||
|
||||
@@ -21,7 +21,7 @@ for more details on this and other advanced features.
|
||||
|
||||
You can quantize your own models by installing [GPTQModel](https://github.com/ModelCloud/GPTQModel) or picking one of the [5000+ models on Huggingface](https://huggingface.co/models?search=gptq).
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install -U gptqmodel --no-build-isolation -v
|
||||
```
|
||||
|
||||
@@ -60,7 +60,7 @@ Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
|
||||
|
||||
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
|
||||
|
||||
```console
|
||||
```bash
|
||||
python examples/offline_inference/llm_engine_example.py \
|
||||
--model ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
|
||||
```
|
||||
|
||||
@@ -14,13 +14,13 @@ Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs re
|
||||
|
||||
To use INT4 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install llmcompressor
|
||||
```
|
||||
|
||||
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install vllm lm-eval==0.4.4
|
||||
```
|
||||
|
||||
@@ -116,8 +116,8 @@ model = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128")
|
||||
|
||||
To evaluate accuracy, you can use `lm_eval`:
|
||||
|
||||
```console
|
||||
$ lm_eval --model vllm \
|
||||
```bash
|
||||
lm_eval --model vllm \
|
||||
--model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A16-G128",add_bos_token=true \
|
||||
--tasks gsm8k \
|
||||
--num_fewshot 5 \
|
||||
|
||||
@@ -15,13 +15,13 @@ Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs re
|
||||
|
||||
To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install llmcompressor
|
||||
```
|
||||
|
||||
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install vllm lm-eval==0.4.4
|
||||
```
|
||||
|
||||
@@ -122,8 +122,8 @@ model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
|
||||
|
||||
To evaluate accuracy, you can use `lm_eval`:
|
||||
|
||||
```console
|
||||
$ lm_eval --model vllm \
|
||||
```bash
|
||||
lm_eval --model vllm \
|
||||
--model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
|
||||
--tasks gsm8k \
|
||||
--num_fewshot 5 \
|
||||
|
||||
@@ -4,7 +4,7 @@ The [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-O
|
||||
|
||||
We recommend installing the library with:
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install nvidia-modelopt
|
||||
```
|
||||
|
||||
|
||||
@@ -65,7 +65,7 @@ For optimal model quality when using FP8 KV Cache, we recommend using calibrated
|
||||
|
||||
First, install the required dependencies:
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install llmcompressor
|
||||
```
|
||||
|
||||
|
||||
@@ -13,7 +13,7 @@ AWQ, GPTQ, Rotation and SmoothQuant.
|
||||
|
||||
Before quantizing models, you need to install Quark. The latest release of Quark can be installed with pip:
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install amd-quark
|
||||
```
|
||||
|
||||
@@ -22,13 +22,13 @@ for more installation details.
|
||||
|
||||
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install vllm lm-eval==0.4.4
|
||||
```
|
||||
|
||||
## Quantization Process
|
||||
|
||||
After installing Quark, we will use an example to illustrate how to use Quark.
|
||||
After installing Quark, we will use an example to illustrate how to use Quark.
|
||||
The Quark quantization process can be listed for 5 steps as below:
|
||||
|
||||
1. Load the model
|
||||
@@ -209,8 +209,8 @@ Now, you can load and run the Quark quantized model directly through the LLM ent
|
||||
|
||||
Or, you can use `lm_eval` to evaluate accuracy:
|
||||
|
||||
```console
|
||||
$ lm_eval --model vllm \
|
||||
```bash
|
||||
lm_eval --model vllm \
|
||||
--model_args pretrained=Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant,kv_cache_dtype='fp8',quantization='quark' \
|
||||
--tasks gsm8k
|
||||
```
|
||||
@@ -222,7 +222,7 @@ to quantize large language models more conveniently. It supports quantizing mode
|
||||
of different quantization schemes and optimization algorithms. It can export the quantized model
|
||||
and run evaluation tasks on the fly. With the script, the example above can be:
|
||||
|
||||
```console
|
||||
```bash
|
||||
python3 quantize_quark.py --model_dir meta-llama/Llama-2-70b-chat-hf \
|
||||
--output_dir /path/to/output \
|
||||
--quant_scheme w_fp8_a_fp8 \
|
||||
|
||||
@@ -4,7 +4,7 @@ TorchAO is an architecture optimization library for PyTorch, it provides high pe
|
||||
|
||||
We recommend installing the latest torchao nightly with
|
||||
|
||||
```console
|
||||
```bash
|
||||
# Install the latest TorchAO nightly build
|
||||
# Choose the CUDA version that matches your system (cu126, cu128, etc.)
|
||||
pip install \
|
||||
|
||||
Reference in New Issue
Block a user