docs/features/quantization/gguf.md

# GGUF

!!! warning
    Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.

!!! warning
    Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.

To run a GGUF model with vLLM, you can use the `repo_id:quant_type` format to load directly from HuggingFace. For example, to load a Q4_K_M quantized model from [unsloth/Qwen3-0.6B-GGUF](https://huggingface.co/unsloth/Qwen3-0.6B-GGUF):

```bash
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M --tokenizer Qwen/Qwen3-0.6B
```

You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:

```bash
vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M \
   --tokenizer Qwen/Qwen3-0.6B \
   --tensor-parallel-size 2
```

Alternatively, you can download and use a local GGUF file:

```bash
wget https://huggingface.co/unsloth/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q4_K_M.gguf
vllm serve ./Qwen3-0.6B-Q4_K_M.gguf --tokenizer Qwen/Qwen3-0.6B
```

!!! warning
    We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.

GGUF assumes that HuggingFace can convert the metadata to a config file. In case HuggingFace doesn't support your model you can manually create a config and pass it as hf-config-path

```bash
# If your model is not supported by HuggingFace you can manually provide a HuggingFace compatible config path
vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M \
   --tokenizer Qwen/Qwen3-0.6B \
   --hf-config-path Qwen/Qwen3-0.6B
```

You can also use the GGUF model directly through the LLM entrypoint:

??? code

      ```python
      from vllm import LLM, SamplingParams

      # In this script, we demonstrate how to pass input to the chat method:
      conversation = [
         {
            "role": "system",
            "content": "You are a helpful assistant",
         },
         {
            "role": "user",
            "content": "Hello",
         },
         {
            "role": "assistant",
            "content": "Hello! How can I assist you today?",
         },
         {
            "role": "user",
            "content": "Write an essay about the importance of higher education.",
         },
      ]

      # Create a sampling params object.
      sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

      # Create an LLM using repo_id:quant_type format.
      llm = LLM(
         model="unsloth/Qwen3-0.6B-GGUF:Q4_K_M",
         tokenizer="Qwen/Qwen3-0.6B",
      )
      # Generate texts from the prompts. The output is a list of RequestOutput objects
      # that contain the prompt, generated text, and other information.
      outputs = llm.chat(conversation, sampling_params)

      # Print the outputs.
      for output in outputs:
         prompt = output.prompt
         generated_text = output.outputs[0].text
         print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
      ```
Stop using title frontmatter and fix doc that can only be reached by search (#20623) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-07-08 11:27:40 +01:00			`# GGUF`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
Migrate docs from Sphinx to MkDocs (#18145) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-05-23 11:09:53 +02:00			`!!! warning`
			`Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
Migrate docs from Sphinx to MkDocs (#18145) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-05-23 11:09:53 +02:00			`!!! warning`
			`Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
[UX] Use gguf `repo_id:quant_type` syntax for examples and docs (#33371) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-01-30 23:14:54 -05:00			To run a GGUF model with vLLM, you can use the `repo_id:quant_type` format to load directly from HuggingFace. For example, to load a Q4_K_M quantized model from [unsloth/Qwen3-0.6B-GGUF](https://huggingface.co/unsloth/Qwen3-0.6B-GGUF):
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
[Docs] Fix syntax highlighting of shell commands (#19870) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> 2025-06-23 18:59:09 +01:00			```bash
[CI/Build] Add markdown linter (#11857) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2025-01-12 03:17:13 -05:00			`# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.`
[UX] Use gguf `repo_id:quant_type` syntax for examples and docs (#33371) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-01-30 23:14:54 -05:00			`vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M --tokenizer Qwen/Qwen3-0.6B`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			```

			You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:

[Docs] Fix syntax highlighting of shell commands (#19870) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> 2025-06-23 18:59:09 +01:00			```bash
[UX] Use gguf `repo_id:quant_type` syntax for examples and docs (#33371) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-01-30 23:14:54 -05:00			`vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M \`
			`--tokenizer Qwen/Qwen3-0.6B \`
[doc] improve readability (#18675) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-05-25 16:40:31 +08:00			`--tensor-parallel-size 2`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			```

[UX] Use gguf `repo_id:quant_type` syntax for examples and docs (#33371) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-01-30 23:14:54 -05:00			`Alternatively, you can download and use a local GGUF file:`

			```bash
			`wget https://huggingface.co/unsloth/Qwen3-0.6B-GGUF/resolve/main/Qwen3-0.6B-Q4_K_M.gguf`
			`vllm serve ./Qwen3-0.6B-Q4_K_M.gguf --tokenizer Qwen/Qwen3-0.6B`
			```

Migrate docs from Sphinx to MkDocs (#18145) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-05-23 11:09:53 +02:00			`!!! warning`
			`We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.`
[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00
[UX] Use gguf `repo_id:quant_type` syntax for examples and docs (#33371) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-01-30 23:14:54 -05:00			`GGUF assumes that HuggingFace can convert the metadata to a config file. In case HuggingFace doesn't support your model you can manually create a config and pass it as hf-config-path`
[Model] Deepseek GGUF support (#13167) 2025-02-27 11:08:35 +01:00
[Docs] Fix syntax highlighting of shell commands (#19870) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> 2025-06-23 18:59:09 +01:00			```bash
[UX] Use gguf `repo_id:quant_type` syntax for examples and docs (#33371) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-01-30 23:14:54 -05:00			`# If your model is not supported by HuggingFace you can manually provide a HuggingFace compatible config path`
			`vllm serve unsloth/Qwen3-0.6B-GGUF:Q4_K_M \`
			`--tokenizer Qwen/Qwen3-0.6B \`
			`--hf-config-path Qwen/Qwen3-0.6B`
[Model] Deepseek GGUF support (#13167) 2025-02-27 11:08:35 +01:00			```

[Docs] Convert rST to MyST (Markdown) (#11145) Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com> 2024-12-23 17:35:38 -05:00			`You can also use the GGUF model directly through the LLM entrypoint:`

Make distinct `code` and `console` admonitions so readers are less likely to miss them (#20585) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2025-07-08 03:55:28 +01:00			`??? code`
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00
			```python
			`from vllm import LLM, SamplingParams`

			`# In this script, we demonstrate how to pass input to the chat method:`
			`conversation = [`
			`{`
			`"role": "system",`
[Doc] ruff format remaining Python examples (#26795) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-10-15 16:25:49 +08:00			`"content": "You are a helpful assistant",`
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00			`},`
			`{`
			`"role": "user",`
[Doc] ruff format remaining Python examples (#26795) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-10-15 16:25:49 +08:00			`"content": "Hello",`
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00			`},`
			`{`
			`"role": "assistant",`
[Doc] ruff format remaining Python examples (#26795) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-10-15 16:25:49 +08:00			`"content": "Hello! How can I assist you today?",`
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00			`},`
			`{`
			`"role": "user",`
			`"content": "Write an essay about the importance of higher education.",`
			`},`
			`]`

			`# Create a sampling params object.`
			`sampling_params = SamplingParams(temperature=0.8, top_p=0.95)`

[UX] Use gguf `repo_id:quant_type` syntax for examples and docs (#33371) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-01-30 23:14:54 -05:00			`# Create an LLM using repo_id:quant_type format.`
[Doc] ruff format remaining Python examples (#26795) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-10-15 16:25:49 +08:00			`llm = LLM(`
[UX] Use gguf `repo_id:quant_type` syntax for examples and docs (#33371) Signed-off-by: mgoin <mgoin64@gmail.com> 2026-01-30 23:14:54 -05:00			`model="unsloth/Qwen3-0.6B-GGUF:Q4_K_M",`
			`tokenizer="Qwen/Qwen3-0.6B",`
[Doc] ruff format remaining Python examples (#26795) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> 2025-10-15 16:25:49 +08:00			`)`
[doc] Fold long code blocks to improve readability (#19926) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com> 2025-06-23 13:24:23 +08:00			`# Generate texts from the prompts. The output is a list of RequestOutput objects`
			`# that contain the prompt, generated text, and other information.`
			`outputs = llm.chat(conversation, sampling_params)`

			`# Print the outputs.`
			`for output in outputs:`
			`prompt = output.prompt`
			`generated_text = output.outputs[0].text`
			`print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")`
			```