Update README

2025-09-29 17:10:12 +08:00
parent 59f2c07cf2
commit 0ed3b949d0
1 changed files with 41 additions and 14 deletions
--- a/README.md
+++ b/README.md
@@ -8,11 +8,13 @@ Despite its lightweight design, DeepGEMM's performance matches or exceeds expert

 ## News

+- 2025.09.28: DeepGEMM now supports scoring kernels (weighted ReLU MQA logits) for the lightning indexer for DeepSeek v3.2.
+  - Please see [#200](https://github.com/deepseek-ai/DeepGEMM/pull/200) for more details.
 - 2025.07.20: DeepGEMM now supports both SM90/SM100, and has a full refactor with a low-CPU-overhead JIT CPP module.
-  - NVRTC and post-compilation SASS optimization are all disabled
-  - NVRTC will be supported later
-  - As NVCC 12.9 will automatically do the FFMA interleaving, all post optimizations will be no longer supported
-  - Please see [#112](https://github.com/deepseek-ai/DeepGEMM/pull/112) for more details
+  - NVRTC and post-compilation SASS optimization are all disabled.
+  - NVRTC will be supported later.
+  - As NVCC 12.9 will automatically do the FFMA interleaving, all post optimizations will be no longer supported.
+  - Please see [#112](https://github.com/deepseek-ai/DeepGEMM/pull/112) for more details.
 - 2025.05.14: DeepGEMM now offers weight gradient kernels for dense and MoE backward! See [#95](https://github.com/deepseek-ai/DeepGEMM/pull/95) for details.
 - 2025.05.07: DeepGEMM now supports NVRTC with up to 10x compilation speedup! See [#94](https://github.com/deepseek-ai/DeepGEMM/pull/94) for details. Please use `DG_JIT_USE_NVRTC=1` to enable it (may have performance loss with some cases).
 - 2025.04.18: DeepGEMM now achieves up to **1550 TFLOPS** on H800! See [#74](https://github.com/deepseek-ai/DeepGEMM/pull/74), [#78](https://github.com/deepseek-ai/DeepGEMM/pull/78), [#81](https://github.com/deepseek-ai/DeepGEMM/pull/81), [#86](https://github.com/deepseek-ai/DeepGEMM/pull/86) and [340d988](https://github.com/deepseek-ai/DeepGEMM/commit/340d9880f4a418d943d34260d20a79f41f4c0526) for details.
@@ -46,9 +48,9 @@ Despite its lightweight design, DeepGEMM's performance matches or exceeds expert
 - Python 3.8 or higher
 - Compilers with C++20 support
 - CUDA Toolkit:
-    - CUDA 12.3 or higher for SM90
-        - **We highly recommend 12.9 or higher for the best performance**
-    - CUDA 12.9 or higher for SM100
+  - CUDA 12.3 or higher for SM90
+    - **We highly recommend 12.9 or higher for the best performance**
+  - CUDA 12.9 or higher for SM100
 - PyTorch 2.1 or higher
 - CUTLASS 4.0 or higher (could be cloned by Git submodule)
 - `{fmt}` library (could be cloned by Git submodule)
@@ -66,6 +68,7 @@ cat develop.sh

 # Test all GEMM implements
 python tests/test_layout.py
+python tests/test_attention.py
 python tests/test_bf16.py
 python tests/test_fp8.py
 python tests/test_lazy_init.py
@@ -109,6 +112,30 @@ During the inference decoding phase, when CUDA graph is enabled and the CPU is u

 Use `m_grouped_fp8_gemm_nt_masked` for this purpose and consult the relevant documentation. An example usage is to use the output of low-latency kernels from [DeepEP](https://github.com/deepseek-ai/DeepEP) as input.

+#### V3.2 MQA kernels for the indexer
+
+The kernel family has two versions, non-paged (for prefilling) and paged (for decoding).
+Take the non-paged version `fp8_mqa_logits` as an example. It has 6 inputs:
+
+- `q`, E4M3 tensor with shape `[seq_len, num_heads, head_dim]`
+- `kv`, E4M3 tensor (shaped as `[seq_len_kv, head_dim]`) with float SF (shaped as `[seq_len_kv]`)
+- `weights`, float tensor with shape `[seq_len, num_heads]`
+- `cu_seq_len_k_start` and `cu_seq_len_k_end`, int tensor with shape `[seq_len]`
+- `clean_logits`, whether to clean the unfilled logits into `-inf`
+
+The output tensor is shaped as `[seq_len, seq_len_kv]`, indicating token-to-token logits.
+For each token `i` in `q`, it will iterate all tokens `j` from `[cu_seq_len_k_start[i], cu_seq_len_k_end[i])`,
+and calculate the logit `out[i, j]` as:
+
+```python
+kv_j = kv[0][j, :] * kv[1][j].unsqueeze(1)  # [head_dim]
+out_ij = q[i, :, :] @ kv_j  # [num_heads]
+out_ij = out_ij.relu() * weights[i, :]  # [num_heads]
+out_ij = out_ij.sum()  # Scalar
+```
+
+For more details and the paged version `fp8_paged_mqa_logits`, please refer to `tests/test_attention.py`.
+
 #### Utilities

 The library provides some utility functions besides the above kernels:
@@ -127,17 +154,17 @@ The library provides some utility functions besides the above kernels:
 The library also provides some environment variables, which may be useful:

 - General
-    - `DG_JIT_DEBUG`: `0` or `1`, print more JIT debugging information, `0` by default
+  - `DG_JIT_DEBUG`: `0` or `1`, print more JIT debugging information, `0` by default
 - JIT cache related
-    - `DG_JIT_CACHE_DIR`: string, the cache directory to store compiled kernels, `$HOME/.deep_gemm` by default
+  - `DG_JIT_CACHE_DIR`: string, the cache directory to store compiled kernels, `$HOME/.deep_gemm` by default
 - NVCC/NVRTC selections
-    - `DG_JIT_USE_NVRTC`: `0` or `1`, use NVRTC instead of NVCC, faster compilation but maybe have lower performance for some cases, `0` by default
-    - `DG_JIT_NVCC_COMPILER`: string, specified NVCC compiler path; will find in `torch.utils.cpp_extension.CUDA_HOME` by default
+  - `DG_JIT_USE_NVRTC`: `0` or `1`, use NVRTC instead of NVCC, faster compilation but maybe have lower performance for some cases, `0` by default
+  - `DG_JIT_NVCC_COMPILER`: string, specified NVCC compiler path; will find in `torch.utils.cpp_extension.CUDA_HOME` by default
 - Compiler options
-    - `DG_JIT_PTXAS_VERBOSE`: `0` or `1`, show detailed PTXAS compiler output, `0` by default
-    - `DG_JIT_PRINT_COMPILER_COMMAND`: `0` or `1`, print NVCC compilation command, `0` by default
+  - `DG_JIT_PTXAS_VERBOSE`: `0` or `1`, show detailed PTXAS compiler output, `0` by default
+  - `DG_JIT_PRINT_COMPILER_COMMAND`: `0` or `1`, print NVCC compilation command, `0` by default
 - Heuristic selection
-    - `DG_PRINT_CONFIGS`: `0` or `1`, print selected configs for each shape, `0` by default
+  - `DG_PRINT_CONFIGS`: `0` or `1`, print selected configs for each shape, `0` by default

 For additional examples and details, please refer to [the test code](tests/test_core.py) or review the corresponding Python documentation.