B1 Mixed FP8/BF16 FMHA — DONE ✅

Implementation of storage-native DeepSeek-V4 attention that keeps KV in the paper format:

noPE KV: FP8_E4M3 bytes plus per-row FP32 scale
RoPE KV: BF16
Q noPE: quantized BF16 → FP8_E4M3 immediately before FMHA
Q RoPE: BF16

The live forward_attention path gathers compressed rows and the SWA tail into mixed buffers and calls dsv4_attention_mixed_fp8_decode; it no longer dequantizes noPE KV into gather_buf before attention.

New files

dsv4/kernels/cuda/fp8_attention_io.cu — quantize_q_fp8_split, gather_mixed_{selective,all,swa_only}
dsv4/kernels/attention/fmha_mixed_fp8_decode.cuh — decode kernel, HD=512/NOPE=448/ROPE=64
dsv4/kernels/attention/fmha_mixed_fp8_capi.cu — C ABI launcher
dsv4/kernels/attention/fmha_mixed_fp8_op.py — Python ctypes/nvcc bridge

Unit Test

tests/unit/test_b1_mixed_fp8_fmha.py — comprehensive test at production values (HD=512, H=128, N=128..2048):

quantize_q_fp8_split round-trip: cos=0.9997
gather_mixed kernels: exact copy for compressed, cos=0.9997 for SWA quantization
FMHA decode cosine vs FP32 SDPA: cos=0.999972 (N=128) to cos=0.999923 (N=2048)
Attention sink bias: verified effect on output
GQA/MQA with 128 Q heads: verified output magnitudes
Weight loading dtype/shape verification
Batch sizes B=1,2,4

Bug Fix: V matrix canonical layout (commit `4fe7f9d`)

canon_idx_bf16_16x16(kk, dd) had arguments swapped. The correct call is canon_idx_bf16_16x16(dd, kk). This produced cos=0.158 vs BF16 reference. After fix: cos=0.999972.

Known Limitations

Decode only (T==1). The launcher hard-errors for prefill. Prefill runs one token at a time.
Specialized to DSV4 attention dimensions (HD=512/NOPE=448/ROPE=64).
noPE QK uses Blackwell FP8 tensor cores; RoPE QK and PV use BF16 tensor cores.
noPE V is dequantized only inside shared memory immediately before the PV BF16 tensor-core multiply. There is no global BF16 KV staging.

2.0 KiB Raw Blame History

B1 Mixed FP8/BF16 FMHA — DONE ✅

New files

Unit Test

Bug Fix: V matrix canonical layout (commit 4fe7f9d)

Known Limitations

2.0 KiB

Raw Blame History

Bug Fix: V matrix canonical layout (commit `4fe7f9d`)