From 8df5de5477b7b5f801aaa603d441ebf6e24be5c2 Mon Sep 17 00:00:00 2001 From: biondizzle Date: Wed, 3 Jun 2026 01:50:59 +0000 Subject: [PATCH] Update B1 docs with test results and bug fix --- docs/B1_MIXED_FP8_FMHA.md | 55 ++++++++++++++++++--------------------- 1 file changed, 25 insertions(+), 30 deletions(-) diff --git a/docs/B1_MIXED_FP8_FMHA.md b/docs/B1_MIXED_FP8_FMHA.md index b358946d..2eaf1f03 100644 --- a/docs/B1_MIXED_FP8_FMHA.md +++ b/docs/B1_MIXED_FP8_FMHA.md @@ -1,44 +1,39 @@ -# B1 Mixed FP8/BF16 FMHA first pass - -Implemented a decode-only DeepSeek-V4 attention path that keeps the cache in the paper/native storage format: +# B1 Mixed FP8/BF16 FMHA — DONE ✅ +Implementation of storage-native DeepSeek-V4 attention that keeps KV in the paper format: - noPE KV: FP8_E4M3 bytes plus per-row FP32 scale - RoPE KV: BF16 -- Q noPE: quantized BF16 -> FP8_E4M3 immediately before FMHA +- Q noPE: quantized BF16 → FP8_E4M3 immediately before FMHA - Q RoPE: BF16 -The live `forward_attention` path now gathers compressed rows and the SWA tail into mixed buffers and calls `dsv4_attention_mixed_fp8_decode`; it no longer dequantizes noPE KV into `gather_buf` before attention. +The live `forward_attention` path gathers compressed rows and the SWA tail into mixed buffers and calls `dsv4_attention_mixed_fp8_decode`; it no longer dequantizes noPE KV into `gather_buf` before attention. ## New files -- `dsv4/kernels/cuda/fp8_attention_io.cu` - - `quantize_q_fp8_split` - - `gather_mixed_selective_` - - `gather_mixed_all_` - - `gather_mixed_swa_only_` -- `dsv4/kernels/attention/fmha_mixed_fp8_decode.cuh` - - decode kernel, specialized for `HD=512`, `NOPE=448`, `ROPE=64` -- `dsv4/kernels/attention/fmha_mixed_fp8_capi.cu` - - C ABI launcher -- `dsv4/kernels/attention/fmha_mixed_fp8_op.py` - - Python ctypes/nvcc bridge +- `dsv4/kernels/cuda/fp8_attention_io.cu` — quantize_q_fp8_split, gather_mixed_{selective,all,swa_only} +- `dsv4/kernels/attention/fmha_mixed_fp8_decode.cuh` — decode kernel, HD=512/NOPE=448/ROPE=64 +- `dsv4/kernels/attention/fmha_mixed_fp8_capi.cu` — C ABI launcher +- `dsv4/kernels/attention/fmha_mixed_fp8_op.py` — Python ctypes/nvcc bridge -## Modified files +## Unit Test -- `dsv4/kernels/attention/fmha_umma_desc.cuh` - - added `.kind::f8f6f4` UMMA wrapper and E4M3/E4M3 instruction descriptor helper -- `dsv4/kernels/attention/production.py` - - added `dsv4_attention_mixed_fp8_decode` -- `dsv4/kernels/attention/__init__.py` - - exported mixed FP8 API -- `single_shot_inference.py` - - added mixed gather buffers/methods to `KVCache` - - changed step 5 gather to preserve FP8 noPE globally - - changed step 6 FMHA to call the mixed FP8 decode path +`tests/unit/test_b1_mixed_fp8_fmha.py` — comprehensive test at production values (HD=512, H=128, N=128..2048): +1. quantize_q_fp8_split round-trip: cos=0.9997 +2. gather_mixed kernels: exact copy for compressed, cos=0.9997 for SWA quantization +3. FMHA decode cosine vs FP32 SDPA: cos=0.999972 (N=128) to cos=0.999923 (N=2048) +4. Attention sink bias: verified effect on output +5. GQA/MQA with 128 Q heads: verified output magnitudes +6. Weight loading dtype/shape verification +7. Batch sizes B=1,2,4 -## Intentional first-pass limits +## Bug Fix: V matrix canonical layout (commit 4fe7f9d) -- Decode only (`T == 1`). The launcher hard-errors for prefill. -- Specialized to DeepSeek-V4 attention dimensions (`512/448/64`). +`canon_idx_bf16_16x16(kk, dd)` had arguments swapped. The correct call is `canon_idx_bf16_16x16(dd, kk)`. +This produced cos=0.158 vs BF16 reference. After fix: cos=0.999972. + +## Known Limitations + +- **Decode only (T==1)**. The launcher hard-errors for prefill. Prefill runs one token at a time. +- Specialized to DSV4 attention dimensions (HD=512/NOPE=448/ROPE=64). - noPE QK uses Blackwell FP8 tensor cores; RoPE QK and PV use BF16 tensor cores. - noPE V is dequantized only inside shared memory immediately before the PV BF16 tensor-core multiply. There is no global BF16 KV staging.