[UX] Use gguf repo_id:quant_type syntax for examples and docs (#33371)

Signed-off-by: mgoin <mgoin64@gmail.com>
2026-01-30 23:14:54 -05:00
parent 9df152bbf6
commit 29fba76781
4 changed files with 79 additions and 28 deletions
--- a/docs/features/quantization/gguf.md
+++ b/docs/features/quantization/gguf.md
@@ -6,34 +6,38 @@
 !!! warning
    Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.

-To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
+To run a GGUF model with vLLM, you can use the `repo_id:quant_type` format to load directly from HuggingFace. For example, to load a Q4_K_M quantized model from [unsloth/Qwen3-0.6B-GGUF](https://huggingface.co/unsloth/Qwen3-0.6B-GGUF):

 ```bash
-wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
 # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
-vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
-   --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
+vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M --tokenizer Qwen/Qwen3-0.6B
 ```

 You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:

 ```bash
-# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
-vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
-   --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
+vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M \
+   --tokenizer Qwen/Qwen3-0.6B \
   --tensor-parallel-size 2
 ```

+Alternatively, you can download and use a local GGUF file:
+
+```bash
+wget https://huggingface.co/unsloth/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q4_K_M.gguf
+vllm serve ./Qwen3-0.6B-Q4_K_M.gguf --tokenizer Qwen/Qwen3-0.6B
+```
+
 !!! warning
    We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.

-GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path
+GGUF assumes that HuggingFace can convert the metadata to a config file. In case HuggingFace doesn't support your model you can manually create a config and pass it as hf-config-path

 ```bash
-# If you model is not supported by huggingface you can manually provide a huggingface compatible config path
-vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
-   --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
-   --hf-config-path Tinyllama/TInyLlama-1.1B-Chat-v1.0
+# If your model is not supported by HuggingFace you can manually provide a HuggingFace compatible config path
+vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M \
+   --tokenizer Qwen/Qwen3-0.6B \
+   --hf-config-path Qwen/Qwen3-0.6B
 ```

 You can also use the GGUF model directly through the LLM entrypoint:
@@ -66,10 +70,10 @@ You can also use the GGUF model directly through the LLM entrypoint:
      # Create a sampling params object.
      sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

-      # Create an LLM.
+      # Create an LLM using repo_id:quant_type format.
      llm = LLM(
-         model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
-         tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+         model="unsloth/Qwen3-0.6B-GGUF:Q4_K_M",
+         tokenizer="Qwen/Qwen3-0.6B",
      )
      # Generate texts from the prompts. The output is a list of RequestOutput objects
      # that contain the prompt, generated text, and other information.