[CI/Build] Auto-fix Markdown files (#12941)

This commit is contained in:
Cyrus Leung
2025-02-08 20:25:15 +08:00
committed by GitHub
parent 4c8dd12ef3
commit 8a69e0e20e
20 changed files with 158 additions and 141 deletions

View File

@@ -1,17 +1,19 @@
# CUTLASS Epilogues
## Introduction
This document describes the various CUTLASS epilogues implemented for fusing de-quantization operations onto GEMMs.
This document describes the various CUTLASS epilogues implemented for fusing de-quantization operations onto GEMMs.
Currently, we only support symmetric quantization for weights,
and symmetric and asymmetric quantization for activations.
Both can be quantized per-tensor or per-channel (weights) / per-token (activations).
There are 4 epilogues:
1. ScaledEpilogue: symmetric quantization for activations, no bias.
1. ScaledEpilogueBias: symmetric quantization for activations, supports bias.
1. ScaledEpilogueAzp: asymmetric per-tensor quantization for activations, supports bias.
1. ScaledEpilogueAzpPerToken: asymmetric per-token quantization for activations, supports bias.
1. `ScaledEpilogue`: symmetric quantization for activations, no bias.
1. `ScaledEpilogueBias`: symmetric quantization for activations, supports bias.
1. `ScaledEpilogueAzp`: asymmetric per-tensor quantization for activations, supports bias.
1. `ScaledEpilogueAzpPerToken`: asymmetric per-token quantization for activations, supports bias.
We do not have epilogues for asymmetric quantization of activations without bias in order to reduce final binary size.
Instead, if no bias is passed, the epilogue will use 0 as the bias.
@@ -26,12 +28,15 @@ If $` \widehat X `$ is the quantized $` X `$, our matrices become the following
```math
A = s_a (\widehat A - J_a z_a)
```
```math
B = s_b \widehat B
```
```math
D = A B + C
```
```math
D = s_a s_b \widehat D + C
```
@@ -48,9 +53,11 @@ Expanding further, we can calculate $` \widehat D `$ as follows:
```math
A B = s_a ( \widehat A - J_a z_a ) s_b \widehat B
```
```math
A B = s_a s_b \left( \widehat A \widehat B - J_a z_a \widehat B \right)
```
```math
\widehat D = \widehat A \widehat B - z_a J_a \widehat B
```
@@ -61,16 +68,19 @@ Each row of it is equal to $` \mathbf 1 \widehat B `$, which is a row-vector of
## Epilogues
### ScaledEpilogue
### `ScaledEpilogue`
This epilogue computes the symmetric quantization for activations without bias, meaning $` C = 0 `$ and $` z_a = 0 `$.
The output of the GEMM is:
```math
\widehat D = \widehat A \widehat B
```
```math
D = s_a s_b \widehat D
```
```math
D = s_a s_b \widehat A \widehat B
```
@@ -79,44 +89,51 @@ Epilogue parameters:
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
### ScaledEpilogueBias
### `ScaledEpilogueBias`
This epilogue computes the symmetric quantization for activations with bias, meaning $` z_a = 0 `$.
The output of the GEMM is:
```math
\widehat D = \widehat A \widehat B
```
```math
D = s_a s_b \widehat D + C
```
```math
D = s_a s_b \widehat A \widehat B + C
```
Epilogue parameters:
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
- `bias` is the bias, is always per-channel (row-vector).
### ScaledEpilogueAzp
### `ScaledEpilogueAzp`
This epilogue computes the asymmetric per-tensor quantization for activations with bias.
The output of the GEMM is:
```math
\widehat D = \widehat A \widehat B - z_a J_a \widehat B
```
```math
D = s_a s_b \widehat D + C
```
```math
D = s_a s_b \left( \widehat A \widehat B - z_a J_a \widehat B \right) + C
```
Because $` z_a `$ is a scalar, the zero-point term $` z_a J_a \widehat B `$ has every row equal to $` z_a \mathbf 1 B `$.
Because $` z_a `$ is a scalar, the zero-point term $` z_a J_a \widehat B `$ has every row equal to $` z_a \mathbf 1 B `$.
That is precomputed and stored in `azp_with_adj` as a row-vector.
Epilogue parameters:
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
- Generally this will be per-tensor as the zero-points are per-tensor.
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
@@ -125,13 +142,15 @@ Epilogue parameters:
To use these kernels efficiently, users must precompute the `azp_with_adj` term offline and pass it to the kernel.
### ScaledEpilogueAzpPerToken
### `ScaledEpilogueAzpPerToken`
This epilogue computes the asymmetric per-token quantization for activations with bias.
The output of the GEMM is the same as above, but the $` z_a `$ is a column-vector.
That means the zero-point term $` z_a J_a \widehat B `$ becomes an outer product of $` z_a `$ and $` \mathbf 1 \widehat B `$.
Epilogue parameters:
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
- Generally this will be per-token as the zero-points are per-token.
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
@@ -142,6 +161,7 @@ Epilogue parameters:
To use these kernels efficiently, users must precompute the `azp_adj` term offline and pass it to the kernel.
The epilogue performs the following computation (where `Dq` is the raw quantized output of the GEMM):
```
```math
out = scale_a * scale_b * (Dq - azp_adj * azp) + bias
```

View File

@@ -6,25 +6,25 @@ Machete is a spiritual successor to the Marlin kernel but optimized for Hopper a
Machete effectively performs
```
```python
scale_type = w_s.dtype
compute_type = a.dtype
out = (w_q.to(scale_type) * w_s - w_z.to(scale_type)) @ a
```
Where `w_q` is a quantized weight matrix, `w_s` is the quantization scales, and
Where `w_q` is a quantized weight matrix, `w_s` is the quantization scales, and
`w_z` is the quantization zeropoints.
> **_NOTE:_** `w_z` is added after the scales so we can
> **_NOTE:_** `w_z` is added after the scales so we can
use FMA operations, but this means they must have the scales pre-applied if the
supplied zeropoints assume that they will be subtracted before the scales are
supplied zeropoints assume that they will be subtracted before the scales are
applied.
## API
The main optimization within Machete is prepacking the weight matrix to more closely match the tensor core layouts, allowing for wider shared memory loads when loading the weight matrix. This means that the weight matrix must be prepacked before calling `machete_gemm`. The flow looks something like:
```
```python
from vllm import _custom_ops as ops
...
@@ -40,6 +40,6 @@ output = ops.machete_gemm(
## Code Generation
Since Machete is based on Cutlass, we can generate multiple type pairs and different tile shapes using the same kernel template. We generate multiple instantiations of this template using `generate.py`.
Since Machete is based on Cutlass, we can generate multiple type pairs and different tile shapes using the same kernel template. We generate multiple instantiations of this template using `generate.py`.
New type pairs (`TypeConfig`s) can be appended to `impl_configs` (in `generate()`), and these will get automatically generated (assuming they can be supported without issues). For each `TypeConfig`, you must also provide an `ImplConfig`, which bundles a `TypeConfig` with a list of `ScheduleConfig`s, `Specialization`s, and a default heuristic. The `ScheduleConfig`s (which contain info on tile shapes, tile scheduler, etc.) can perform differently for different problem shapes, and there is almost never one `ScheduleConfig` that works well for all problem shapes, so it is generally beneficial to generate different `ScheduleConfig`s for different potential problem shapes. This is where the heuristic comes in. For each `TypeConfig`, a default heuristic should be provided. This maps different problem shapes to different `ScheduleConfig`s and is used when the user does not provide the `schedule` parameter to `machete_gemm`. The `Specialization`s define what feature combinations to generate, i.e., `with_zeropoints`, `with_scales`, etc. We can reduce compile times and the final binary size by limiting the set of feature combinations we generate.
New type pairs (`TypeConfig`s) can be appended to `impl_configs` (in `generate()`), and these will get automatically generated (assuming they can be supported without issues). For each `TypeConfig`, you must also provide an `ImplConfig`, which bundles a `TypeConfig` with a list of `ScheduleConfig`s, `Specialization`s, and a default heuristic. The `ScheduleConfig`s (which contain info on tile shapes, tile scheduler, etc.) can perform differently for different problem shapes, and there is almost never one `ScheduleConfig` that works well for all problem shapes, so it is generally beneficial to generate different `ScheduleConfig`s for different potential problem shapes. This is where the heuristic comes in. For each `TypeConfig`, a default heuristic should be provided. This maps different problem shapes to different `ScheduleConfig`s and is used when the user does not provide the `schedule` parameter to `machete_gemm`. The `Specialization`s define what feature combinations to generate, i.e., `with_zeropoints`, `with_scales`, etc. We can reduce compile times and the final binary size by limiting the set of feature combinations we generate.