[Docs] Fix syntax highlighting of shell commands (#19870)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
This commit is contained in:
Lukas Geiger
2025-06-23 18:59:09 +01:00
committed by GitHub
parent 53243e5c42
commit c3649e4fee
53 changed files with 220 additions and 220 deletions

View File

@@ -9,7 +9,7 @@ The main benefits are lower latency and memory usage.
You can quantize your own models by installing AutoAWQ or picking one of the [6500+ models on Huggingface](https://huggingface.co/models?search=awq).
```console
```bash
pip install autoawq
```
@@ -43,7 +43,7 @@ After installing AutoAWQ, you are ready to quantize a model. Please refer to the
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
```console
```bash
python examples/offline_inference/llm_engine_example.py \
--model TheBloke/Llama-2-7b-Chat-AWQ \
--quantization awq

View File

@@ -12,7 +12,7 @@ vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more effic
Below are the steps to utilize BitBLAS with vLLM.
```console
```bash
pip install bitblas>=0.1.0
```

View File

@@ -9,7 +9,7 @@ Compared to other quantization methods, BitsAndBytes eliminates the need for cal
Below are the steps to utilize BitsAndBytes with vLLM.
```console
```bash
pip install bitsandbytes>=0.45.3
```
@@ -54,6 +54,6 @@ llm = LLM(
Append the following to your model arguments for 4bit inflight quantization:
```console
```bash
--quantization bitsandbytes
```

View File

@@ -23,7 +23,7 @@ The FP8 types typically supported in hardware have two distinct representations,
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
```console
```bash
pip install llmcompressor
```
@@ -81,7 +81,7 @@ Since simple RTN does not require data for weight quantization and the activatio
Install `vllm` and `lm-evaluation-harness` for evaluation:
```console
```bash
pip install vllm lm-eval==0.4.4
```
@@ -99,9 +99,9 @@ Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
!!! note
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
```console
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
$ lm_eval \
```bash
MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
lm_eval \
--model vllm \
--model_args pretrained=$MODEL,add_bos_token=True \
--tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250

View File

@@ -11,7 +11,7 @@ title: GGUF
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
```console
```bash
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
@@ -20,7 +20,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
```console
```bash
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
@@ -32,7 +32,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path
```console
```bash
# If you model is not supported by huggingface you can manually provide a huggingface compatible config path
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \

View File

@@ -21,7 +21,7 @@ for more details on this and other advanced features.
You can quantize your own models by installing [GPTQModel](https://github.com/ModelCloud/GPTQModel) or picking one of the [5000+ models on Huggingface](https://huggingface.co/models?search=gptq).
```console
```bash
pip install -U gptqmodel --no-build-isolation -v
```
@@ -60,7 +60,7 @@ Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
```console
```bash
python examples/offline_inference/llm_engine_example.py \
--model ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
```

View File

@@ -14,13 +14,13 @@ Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs re
To use INT4 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
```console
```bash
pip install llmcompressor
```
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
```console
```bash
pip install vllm lm-eval==0.4.4
```
@@ -116,8 +116,8 @@ model = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128")
To evaluate accuracy, you can use `lm_eval`:
```console
$ lm_eval --model vllm \
```bash
lm_eval --model vllm \
--model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A16-G128",add_bos_token=true \
--tasks gsm8k \
--num_fewshot 5 \

View File

@@ -15,13 +15,13 @@ Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs re
To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
```console
```bash
pip install llmcompressor
```
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
```console
```bash
pip install vllm lm-eval==0.4.4
```
@@ -122,8 +122,8 @@ model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
To evaluate accuracy, you can use `lm_eval`:
```console
$ lm_eval --model vllm \
```bash
lm_eval --model vllm \
--model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
--tasks gsm8k \
--num_fewshot 5 \

View File

@@ -4,7 +4,7 @@ The [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-O
We recommend installing the library with:
```console
```bash
pip install nvidia-modelopt
```

View File

@@ -65,7 +65,7 @@ For optimal model quality when using FP8 KV Cache, we recommend using calibrated
First, install the required dependencies:
```console
```bash
pip install llmcompressor
```

View File

@@ -13,7 +13,7 @@ AWQ, GPTQ, Rotation and SmoothQuant.
Before quantizing models, you need to install Quark. The latest release of Quark can be installed with pip:
```console
```bash
pip install amd-quark
```
@@ -22,13 +22,13 @@ for more installation details.
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
```console
```bash
pip install vllm lm-eval==0.4.4
```
## Quantization Process
After installing Quark, we will use an example to illustrate how to use Quark.
After installing Quark, we will use an example to illustrate how to use Quark.
The Quark quantization process can be listed for 5 steps as below:
1. Load the model
@@ -209,8 +209,8 @@ Now, you can load and run the Quark quantized model directly through the LLM ent
Or, you can use `lm_eval` to evaluate accuracy:
```console
$ lm_eval --model vllm \
```bash
lm_eval --model vllm \
--model_args pretrained=Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant,kv_cache_dtype='fp8',quantization='quark' \
--tasks gsm8k
```
@@ -222,7 +222,7 @@ to quantize large language models more conveniently. It supports quantizing mode
of different quantization schemes and optimization algorithms. It can export the quantized model
and run evaluation tasks on the fly. With the script, the example above can be:
```console
```bash
python3 quantize_quark.py --model_dir meta-llama/Llama-2-70b-chat-hf \
--output_dir /path/to/output \
--quant_scheme w_fp8_a_fp8 \

View File

@@ -4,7 +4,7 @@ TorchAO is an architecture optimization library for PyTorch, it provides high pe
We recommend installing the latest torchao nightly with
```console
```bash
# Install the latest TorchAO nightly build
# Choose the CUDA version that matches your system (cu126, cu128, etc.)
pip install \