Stop using title frontmatter and fix doc that can only be reached by search (#20623)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -1,6 +1,4 @@
|
||||
---
|
||||
title: Quantization
|
||||
---
|
||||
# Quantization
|
||||
|
||||
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
|
||||
|
||||
|
||||
@@ -1,6 +1,4 @@
|
||||
---
|
||||
title: AutoAWQ
|
||||
---
|
||||
# AutoAWQ
|
||||
|
||||
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
|
||||
Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.
|
||||
|
||||
@@ -1,6 +1,4 @@
|
||||
---
|
||||
title: BitBLAS
|
||||
---
|
||||
# BitBLAS
|
||||
|
||||
vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.
|
||||
|
||||
|
||||
@@ -1,6 +1,4 @@
|
||||
---
|
||||
title: BitsAndBytes
|
||||
---
|
||||
# BitsAndBytes
|
||||
|
||||
vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
|
||||
BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
|
||||
|
||||
@@ -1,6 +1,4 @@
|
||||
---
|
||||
title: FP8 W8A8
|
||||
---
|
||||
# FP8 W8A8
|
||||
|
||||
vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
|
||||
Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
|
||||
|
||||
@@ -1,6 +1,4 @@
|
||||
---
|
||||
title: GGUF
|
||||
---
|
||||
# GGUF
|
||||
|
||||
!!! warning
|
||||
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
|
||||
|
||||
@@ -1,6 +1,4 @@
|
||||
---
|
||||
title: GPTQModel
|
||||
---
|
||||
# GPTQModel
|
||||
|
||||
To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.
|
||||
|
||||
|
||||
@@ -1,6 +1,4 @@
|
||||
---
|
||||
title: INT4 W4A16
|
||||
---
|
||||
# INT4 W4A16
|
||||
|
||||
vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS).
|
||||
|
||||
|
||||
@@ -1,6 +1,4 @@
|
||||
---
|
||||
title: INT8 W8A8
|
||||
---
|
||||
# INT8 W8A8
|
||||
|
||||
vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
|
||||
This quantization method is particularly useful for reducing model size while maintaining good performance.
|
||||
|
||||
@@ -1,6 +1,4 @@
|
||||
---
|
||||
title: Quantized KV Cache
|
||||
---
|
||||
# Quantized KV Cache
|
||||
|
||||
## FP8 KV Cache
|
||||
|
||||
|
||||
@@ -1,6 +1,4 @@
|
||||
---
|
||||
title: AMD Quark
|
||||
---
|
||||
# AMD Quark
|
||||
|
||||
Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
|
||||
throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),
|
||||
|
||||
@@ -1,6 +1,4 @@
|
||||
---
|
||||
title: Supported Hardware
|
||||
---
|
||||
# Supported Hardware
|
||||
|
||||
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user