Files
nvfp4-megamoe-kernel/CURRENT_BUG.md

1.4 KiB
Raw Blame History

CURRENT_BUG.md — DeepSeek-V4 Blackwell NVFP4

Status: KV CACHE PIPELINE VERIFIED

What's Fixed

  • Root cause identified: vLLM's _attention_impl_blackwell never writes KV to the paged cache, so decode produces garbage because it can't access prior tokens' KV.
  • Solution built and tested: cutedsl/blackwell_attention.py + vllm/patches/layers/csa_attention.py — KV cache write/read pipeline using fp8 quantization.

Test Results (B200 venv, all passing)

Test Result
KV cache roundtrip (fp8 quant → dequant) 0.999+ cosine
Decode attention (1 query vs N cached KVs) 0.9998 cosine
Full pipeline (inv RoPE + o_a + o_b) 0.996-0.999 cosine
All 5 layer types (C128A, C4A, SWA) ≥0.996 cosine
E2E 61-layer model (shared experts) Healthy logits, consistent tokens
Multi-step decode (3 steps) 0.999+ cosine each step

What's Next

  1. Test in vLLM container (build_and_run.sh)
  2. Handle CSA/HCA sparse attention in the Blackwell path (currently using full attention for all layers)
  3. Add routed MoE experts (currently shared experts only)
  4. Performance optimization (vectorized paged KV, Triton kernels)

Architecture

  • KV latent: (T, HD=512) shared across 128 Q heads
  • KV Cache: fp8_e4m3 paged cache with per-token inverse scale
  • Attention: BF16 (NVFP4 too lossy for Q×K^T)
  • Prefill: causal SDPA on raw KV
  • Decode: read all cached KV → fp8 dequant → SDPA → output