Update B1 docs with test results and bug fix

2026-06-03 01:50:59 +00:00
parent 3e3b352e7e
commit 8df5de5477
1 changed files with 25 additions and 30 deletions
--- a/docs/B1_MIXED_FP8_FMHA.md
+++ b/docs/B1_MIXED_FP8_FMHA.md
@@ -1,44 +1,39 @@
-# B1 Mixed FP8/BF16 FMHA first pass
-
-Implemented a decode-only DeepSeek-V4 attention path that keeps the cache in the paper/native storage format:
+# B1 Mixed FP8/BF16 FMHA — DONE ✅

+Implementation of storage-native DeepSeek-V4 attention that keeps KV in the paper format:
 - noPE KV: FP8_E4M3 bytes plus per-row FP32 scale
 - RoPE KV: BF16
- Q noPE: quantized BF16 -> FP8_E4M3 immediately before FMHA
+- Q noPE: quantized BF16 → FP8_E4M3 immediately before FMHA
 - Q RoPE: BF16

-The live `forward_attention` path now gathers compressed rows and the SWA tail into mixed buffers and calls `dsv4_attention_mixed_fp8_decode`; it no longer dequantizes noPE KV into `gather_buf` before attention.
+The live `forward_attention` path gathers compressed rows and the SWA tail into mixed buffers and calls `dsv4_attention_mixed_fp8_decode`; it no longer dequantizes noPE KV into `gather_buf` before attention.

 ## New files

- `dsv4/kernels/cuda/fp8_attention_io.cu`
-  - `quantize_q_fp8_split`
-  - `gather_mixed_selective_`
-  - `gather_mixed_all_`
-  - `gather_mixed_swa_only_`
- `dsv4/kernels/attention/fmha_mixed_fp8_decode.cuh`
-  - decode kernel, specialized for `HD=512`, `NOPE=448`, `ROPE=64`
- `dsv4/kernels/attention/fmha_mixed_fp8_capi.cu`
-  - C ABI launcher
- `dsv4/kernels/attention/fmha_mixed_fp8_op.py`
-  - Python ctypes/nvcc bridge
+- `dsv4/kernels/cuda/fp8_attention_io.cu` — quantize_q_fp8_split, gather_mixed_{selective,all,swa_only}
+- `dsv4/kernels/attention/fmha_mixed_fp8_decode.cuh` — decode kernel, HD=512/NOPE=448/ROPE=64
+- `dsv4/kernels/attention/fmha_mixed_fp8_capi.cu` — C ABI launcher
+- `dsv4/kernels/attention/fmha_mixed_fp8_op.py` — Python ctypes/nvcc bridge

-## Modified files
+## Unit Test

- `dsv4/kernels/attention/fmha_umma_desc.cuh`
-  - added `.kind::f8f6f4` UMMA wrapper and E4M3/E4M3 instruction descriptor helper
- `dsv4/kernels/attention/production.py`
-  - added `dsv4_attention_mixed_fp8_decode`
- `dsv4/kernels/attention/__init__.py`
-  - exported mixed FP8 API
- `single_shot_inference.py`
-  - added mixed gather buffers/methods to `KVCache`
-  - changed step 5 gather to preserve FP8 noPE globally
-  - changed step 6 FMHA to call the mixed FP8 decode path
+`tests/unit/test_b1_mixed_fp8_fmha.py` — comprehensive test at production values (HD=512, H=128, N=128..2048):
+1. quantize_q_fp8_split round-trip: cos=0.9997
+2. gather_mixed kernels: exact copy for compressed, cos=0.9997 for SWA quantization
+3. FMHA decode cosine vs FP32 SDPA: cos=0.999972 (N=128) to cos=0.999923 (N=2048)
+4. Attention sink bias: verified effect on output
+5. GQA/MQA with 128 Q heads: verified output magnitudes
+6. Weight loading dtype/shape verification
+7. Batch sizes B=1,2,4

-## Intentional first-pass limits
+## Bug Fix: V matrix canonical layout (commit 4fe7f9d)

- Decode only (`T == 1`). The launcher hard-errors for prefill.
- Specialized to DeepSeek-V4 attention dimensions (`512/448/64`).
+`canon_idx_bf16_16x16(kk, dd)` had arguments swapped. The correct call is `canon_idx_bf16_16x16(dd, kk)`.
+This produced cos=0.158 vs BF16 reference. After fix: cos=0.999972.
+
+## Known Limitations
+
+- **Decode only (T==1)**. The launcher hard-errors for prefill. Prefill runs one token at a time.
+- Specialized to DSV4 attention dimensions (HD=512/NOPE=448/ROPE=64).
 - noPE QK uses Blackwell FP8 tensor cores; RoPE QK and PV use BF16 tensor cores.
 - noPE V is dequantized only inside shared memory immediately before the PV BF16 tensor-core multiply. There is no global BF16 KV staging.