[CI/Build] Auto-fix Markdown files (#12941)
This commit is contained in:
@@ -1,17 +1,19 @@
|
||||
# CUTLASS Epilogues
|
||||
|
||||
## Introduction
|
||||
This document describes the various CUTLASS epilogues implemented for fusing de-quantization operations onto GEMMs.
|
||||
|
||||
This document describes the various CUTLASS epilogues implemented for fusing de-quantization operations onto GEMMs.
|
||||
|
||||
Currently, we only support symmetric quantization for weights,
|
||||
and symmetric and asymmetric quantization for activations.
|
||||
Both can be quantized per-tensor or per-channel (weights) / per-token (activations).
|
||||
|
||||
There are 4 epilogues:
|
||||
1. ScaledEpilogue: symmetric quantization for activations, no bias.
|
||||
1. ScaledEpilogueBias: symmetric quantization for activations, supports bias.
|
||||
1. ScaledEpilogueAzp: asymmetric per-tensor quantization for activations, supports bias.
|
||||
1. ScaledEpilogueAzpPerToken: asymmetric per-token quantization for activations, supports bias.
|
||||
|
||||
1. `ScaledEpilogue`: symmetric quantization for activations, no bias.
|
||||
1. `ScaledEpilogueBias`: symmetric quantization for activations, supports bias.
|
||||
1. `ScaledEpilogueAzp`: asymmetric per-tensor quantization for activations, supports bias.
|
||||
1. `ScaledEpilogueAzpPerToken`: asymmetric per-token quantization for activations, supports bias.
|
||||
|
||||
We do not have epilogues for asymmetric quantization of activations without bias in order to reduce final binary size.
|
||||
Instead, if no bias is passed, the epilogue will use 0 as the bias.
|
||||
@@ -26,12 +28,15 @@ If $` \widehat X `$ is the quantized $` X `$, our matrices become the following
|
||||
```math
|
||||
A = s_a (\widehat A - J_a z_a)
|
||||
```
|
||||
|
||||
```math
|
||||
B = s_b \widehat B
|
||||
```
|
||||
|
||||
```math
|
||||
D = A B + C
|
||||
```
|
||||
|
||||
```math
|
||||
D = s_a s_b \widehat D + C
|
||||
```
|
||||
@@ -48,9 +53,11 @@ Expanding further, we can calculate $` \widehat D `$ as follows:
|
||||
```math
|
||||
A B = s_a ( \widehat A - J_a z_a ) s_b \widehat B
|
||||
```
|
||||
|
||||
```math
|
||||
A B = s_a s_b \left( \widehat A \widehat B - J_a z_a \widehat B \right)
|
||||
```
|
||||
|
||||
```math
|
||||
\widehat D = \widehat A \widehat B - z_a J_a \widehat B
|
||||
```
|
||||
@@ -61,16 +68,19 @@ Each row of it is equal to $` \mathbf 1 \widehat B `$, which is a row-vector of
|
||||
|
||||
## Epilogues
|
||||
|
||||
### ScaledEpilogue
|
||||
### `ScaledEpilogue`
|
||||
|
||||
This epilogue computes the symmetric quantization for activations without bias, meaning $` C = 0 `$ and $` z_a = 0 `$.
|
||||
The output of the GEMM is:
|
||||
|
||||
```math
|
||||
\widehat D = \widehat A \widehat B
|
||||
```
|
||||
|
||||
```math
|
||||
D = s_a s_b \widehat D
|
||||
```
|
||||
|
||||
```math
|
||||
D = s_a s_b \widehat A \widehat B
|
||||
```
|
||||
@@ -79,44 +89,51 @@ Epilogue parameters:
|
||||
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
||||
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
||||
|
||||
### ScaledEpilogueBias
|
||||
### `ScaledEpilogueBias`
|
||||
|
||||
This epilogue computes the symmetric quantization for activations with bias, meaning $` z_a = 0 `$.
|
||||
The output of the GEMM is:
|
||||
|
||||
```math
|
||||
\widehat D = \widehat A \widehat B
|
||||
```
|
||||
|
||||
```math
|
||||
D = s_a s_b \widehat D + C
|
||||
```
|
||||
|
||||
```math
|
||||
D = s_a s_b \widehat A \widehat B + C
|
||||
```
|
||||
|
||||
|
||||
Epilogue parameters:
|
||||
|
||||
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
||||
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
||||
- `bias` is the bias, is always per-channel (row-vector).
|
||||
|
||||
### ScaledEpilogueAzp
|
||||
### `ScaledEpilogueAzp`
|
||||
|
||||
This epilogue computes the asymmetric per-tensor quantization for activations with bias.
|
||||
The output of the GEMM is:
|
||||
|
||||
```math
|
||||
\widehat D = \widehat A \widehat B - z_a J_a \widehat B
|
||||
```
|
||||
|
||||
```math
|
||||
D = s_a s_b \widehat D + C
|
||||
```
|
||||
|
||||
```math
|
||||
D = s_a s_b \left( \widehat A \widehat B - z_a J_a \widehat B \right) + C
|
||||
```
|
||||
|
||||
Because $` z_a `$ is a scalar, the zero-point term $` z_a J_a \widehat B `$ has every row equal to $` z_a \mathbf 1 B `$.
|
||||
Because $` z_a `$ is a scalar, the zero-point term $` z_a J_a \widehat B `$ has every row equal to $` z_a \mathbf 1 B `$.
|
||||
That is precomputed and stored in `azp_with_adj` as a row-vector.
|
||||
|
||||
Epilogue parameters:
|
||||
|
||||
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
||||
- Generally this will be per-tensor as the zero-points are per-tensor.
|
||||
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
||||
@@ -125,13 +142,15 @@ Epilogue parameters:
|
||||
|
||||
To use these kernels efficiently, users must precompute the `azp_with_adj` term offline and pass it to the kernel.
|
||||
|
||||
### ScaledEpilogueAzpPerToken
|
||||
### `ScaledEpilogueAzpPerToken`
|
||||
|
||||
This epilogue computes the asymmetric per-token quantization for activations with bias.
|
||||
|
||||
The output of the GEMM is the same as above, but the $` z_a `$ is a column-vector.
|
||||
That means the zero-point term $` z_a J_a \widehat B `$ becomes an outer product of $` z_a `$ and $` \mathbf 1 \widehat B `$.
|
||||
|
||||
Epilogue parameters:
|
||||
|
||||
- `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector).
|
||||
- Generally this will be per-token as the zero-points are per-token.
|
||||
- `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector).
|
||||
@@ -142,6 +161,7 @@ Epilogue parameters:
|
||||
To use these kernels efficiently, users must precompute the `azp_adj` term offline and pass it to the kernel.
|
||||
|
||||
The epilogue performs the following computation (where `Dq` is the raw quantized output of the GEMM):
|
||||
```
|
||||
|
||||
```math
|
||||
out = scale_a * scale_b * (Dq - azp_adj * azp) + bias
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user