2.0 KiB
2.0 KiB
B1 Mixed FP8/BF16 FMHA — DONE ✅
Implementation of storage-native DeepSeek-V4 attention that keeps KV in the paper format:
- noPE KV: FP8_E4M3 bytes plus per-row FP32 scale
- RoPE KV: BF16
- Q noPE: quantized BF16 → FP8_E4M3 immediately before FMHA
- Q RoPE: BF16
The live forward_attention path gathers compressed rows and the SWA tail into mixed buffers and calls dsv4_attention_mixed_fp8_decode; it no longer dequantizes noPE KV into gather_buf before attention.
New files
dsv4/kernels/cuda/fp8_attention_io.cu— quantize_q_fp8_split, gather_mixed_{selective,all,swa_only}dsv4/kernels/attention/fmha_mixed_fp8_decode.cuh— decode kernel, HD=512/NOPE=448/ROPE=64dsv4/kernels/attention/fmha_mixed_fp8_capi.cu— C ABI launcherdsv4/kernels/attention/fmha_mixed_fp8_op.py— Python ctypes/nvcc bridge
Unit Test
tests/unit/test_b1_mixed_fp8_fmha.py — comprehensive test at production values (HD=512, H=128, N=128..2048):
- quantize_q_fp8_split round-trip: cos=0.9997
- gather_mixed kernels: exact copy for compressed, cos=0.9997 for SWA quantization
- FMHA decode cosine vs FP32 SDPA: cos=0.999972 (N=128) to cos=0.999923 (N=2048)
- Attention sink bias: verified effect on output
- GQA/MQA with 128 Q heads: verified output magnitudes
- Weight loading dtype/shape verification
- Batch sizes B=1,2,4
Bug Fix: V matrix canonical layout (commit 4fe7f9d)
canon_idx_bf16_16x16(kk, dd) had arguments swapped. The correct call is canon_idx_bf16_16x16(dd, kk).
This produced cos=0.158 vs BF16 reference. After fix: cos=0.999972.
Known Limitations
- Decode only (T==1). The launcher hard-errors for prefill. Prefill runs one token at a time.
- Specialized to DSV4 attention dimensions (HD=512/NOPE=448/ROPE=64).
- noPE QK uses Blackwell FP8 tensor cores; RoPE QK and PV use BF16 tensor cores.
- noPE V is dequantized only inside shared memory immediately before the PV BF16 tensor-core multiply. There is no global BF16 KV staging.