Files
nvfp4-megamoe-kernel/docs/B1_MIXED_FP8_FMHA.md

2.0 KiB

B1 Mixed FP8/BF16 FMHA — DONE

Implementation of storage-native DeepSeek-V4 attention that keeps KV in the paper format:

  • noPE KV: FP8_E4M3 bytes plus per-row FP32 scale
  • RoPE KV: BF16
  • Q noPE: quantized BF16 → FP8_E4M3 immediately before FMHA
  • Q RoPE: BF16

The live forward_attention path gathers compressed rows and the SWA tail into mixed buffers and calls dsv4_attention_mixed_fp8_decode; it no longer dequantizes noPE KV into gather_buf before attention.

New files

  • dsv4/kernels/cuda/fp8_attention_io.cu — quantize_q_fp8_split, gather_mixed_{selective,all,swa_only}
  • dsv4/kernels/attention/fmha_mixed_fp8_decode.cuh — decode kernel, HD=512/NOPE=448/ROPE=64
  • dsv4/kernels/attention/fmha_mixed_fp8_capi.cu — C ABI launcher
  • dsv4/kernels/attention/fmha_mixed_fp8_op.py — Python ctypes/nvcc bridge

Unit Test

tests/unit/test_b1_mixed_fp8_fmha.py — comprehensive test at production values (HD=512, H=128, N=128..2048):

  1. quantize_q_fp8_split round-trip: cos=0.9997
  2. gather_mixed kernels: exact copy for compressed, cos=0.9997 for SWA quantization
  3. FMHA decode cosine vs FP32 SDPA: cos=0.999972 (N=128) to cos=0.999923 (N=2048)
  4. Attention sink bias: verified effect on output
  5. GQA/MQA with 128 Q heads: verified output magnitudes
  6. Weight loading dtype/shape verification
  7. Batch sizes B=1,2,4

Bug Fix: V matrix canonical layout (commit 4fe7f9d)

canon_idx_bf16_16x16(kk, dd) had arguments swapped. The correct call is canon_idx_bf16_16x16(dd, kk). This produced cos=0.158 vs BF16 reference. After fix: cos=0.999972.

Known Limitations

  • Decode only (T==1). The launcher hard-errors for prefill. Prefill runs one token at a time.
  • Specialized to DSV4 attention dimensions (HD=512/NOPE=448/ROPE=64).
  • noPE QK uses Blackwell FP8 tensor cores; RoPE QK and PV use BF16 tensor cores.
  • noPE V is dequantized only inside shared memory immediately before the PV BF16 tensor-core multiply. There is no global BF16 KV staging.