Update B1 docs with test results and bug fix
This commit is contained in:
@@ -1,44 +1,39 @@
|
||||
# B1 Mixed FP8/BF16 FMHA first pass
|
||||
|
||||
Implemented a decode-only DeepSeek-V4 attention path that keeps the cache in the paper/native storage format:
|
||||
# B1 Mixed FP8/BF16 FMHA — DONE ✅
|
||||
|
||||
Implementation of storage-native DeepSeek-V4 attention that keeps KV in the paper format:
|
||||
- noPE KV: FP8_E4M3 bytes plus per-row FP32 scale
|
||||
- RoPE KV: BF16
|
||||
- Q noPE: quantized BF16 -> FP8_E4M3 immediately before FMHA
|
||||
- Q noPE: quantized BF16 → FP8_E4M3 immediately before FMHA
|
||||
- Q RoPE: BF16
|
||||
|
||||
The live `forward_attention` path now gathers compressed rows and the SWA tail into mixed buffers and calls `dsv4_attention_mixed_fp8_decode`; it no longer dequantizes noPE KV into `gather_buf` before attention.
|
||||
The live `forward_attention` path gathers compressed rows and the SWA tail into mixed buffers and calls `dsv4_attention_mixed_fp8_decode`; it no longer dequantizes noPE KV into `gather_buf` before attention.
|
||||
|
||||
## New files
|
||||
|
||||
- `dsv4/kernels/cuda/fp8_attention_io.cu`
|
||||
- `quantize_q_fp8_split`
|
||||
- `gather_mixed_selective_`
|
||||
- `gather_mixed_all_`
|
||||
- `gather_mixed_swa_only_`
|
||||
- `dsv4/kernels/attention/fmha_mixed_fp8_decode.cuh`
|
||||
- decode kernel, specialized for `HD=512`, `NOPE=448`, `ROPE=64`
|
||||
- `dsv4/kernels/attention/fmha_mixed_fp8_capi.cu`
|
||||
- C ABI launcher
|
||||
- `dsv4/kernels/attention/fmha_mixed_fp8_op.py`
|
||||
- Python ctypes/nvcc bridge
|
||||
- `dsv4/kernels/cuda/fp8_attention_io.cu` — quantize_q_fp8_split, gather_mixed_{selective,all,swa_only}
|
||||
- `dsv4/kernels/attention/fmha_mixed_fp8_decode.cuh` — decode kernel, HD=512/NOPE=448/ROPE=64
|
||||
- `dsv4/kernels/attention/fmha_mixed_fp8_capi.cu` — C ABI launcher
|
||||
- `dsv4/kernels/attention/fmha_mixed_fp8_op.py` — Python ctypes/nvcc bridge
|
||||
|
||||
## Modified files
|
||||
## Unit Test
|
||||
|
||||
- `dsv4/kernels/attention/fmha_umma_desc.cuh`
|
||||
- added `.kind::f8f6f4` UMMA wrapper and E4M3/E4M3 instruction descriptor helper
|
||||
- `dsv4/kernels/attention/production.py`
|
||||
- added `dsv4_attention_mixed_fp8_decode`
|
||||
- `dsv4/kernels/attention/__init__.py`
|
||||
- exported mixed FP8 API
|
||||
- `single_shot_inference.py`
|
||||
- added mixed gather buffers/methods to `KVCache`
|
||||
- changed step 5 gather to preserve FP8 noPE globally
|
||||
- changed step 6 FMHA to call the mixed FP8 decode path
|
||||
`tests/unit/test_b1_mixed_fp8_fmha.py` — comprehensive test at production values (HD=512, H=128, N=128..2048):
|
||||
1. quantize_q_fp8_split round-trip: cos=0.9997
|
||||
2. gather_mixed kernels: exact copy for compressed, cos=0.9997 for SWA quantization
|
||||
3. FMHA decode cosine vs FP32 SDPA: cos=0.999972 (N=128) to cos=0.999923 (N=2048)
|
||||
4. Attention sink bias: verified effect on output
|
||||
5. GQA/MQA with 128 Q heads: verified output magnitudes
|
||||
6. Weight loading dtype/shape verification
|
||||
7. Batch sizes B=1,2,4
|
||||
|
||||
## Intentional first-pass limits
|
||||
## Bug Fix: V matrix canonical layout (commit 4fe7f9d)
|
||||
|
||||
- Decode only (`T == 1`). The launcher hard-errors for prefill.
|
||||
- Specialized to DeepSeek-V4 attention dimensions (`512/448/64`).
|
||||
`canon_idx_bf16_16x16(kk, dd)` had arguments swapped. The correct call is `canon_idx_bf16_16x16(dd, kk)`.
|
||||
This produced cos=0.158 vs BF16 reference. After fix: cos=0.999972.
|
||||
|
||||
## Known Limitations
|
||||
|
||||
- **Decode only (T==1)**. The launcher hard-errors for prefill. Prefill runs one token at a time.
|
||||
- Specialized to DSV4 attention dimensions (HD=512/NOPE=448/ROPE=64).
|
||||
- noPE QK uses Blackwell FP8 tensor cores; RoPE QK and PV use BF16 tensor cores.
|
||||
- noPE V is dequantized only inside shared memory immediately before the PV BF16 tensor-core multiply. There is no global BF16 KV staging.
|
||||
|
||||
Reference in New Issue
Block a user