1.4 KiB
1.4 KiB
CURRENT_BUG.md — DeepSeek-V4 Blackwell NVFP4
Status: KV CACHE PIPELINE VERIFIED ✅
What's Fixed
- Root cause identified: vLLM's
_attention_impl_blackwellnever writes KV to the paged cache, so decode produces garbage because it can't access prior tokens' KV. - Solution built and tested:
cutedsl/blackwell_attention.py+vllm/patches/layers/csa_attention.py— KV cache write/read pipeline using fp8 quantization.
Test Results (B200 venv, all passing)
| Test | Result |
|---|---|
| KV cache roundtrip (fp8 quant → dequant) | 0.999+ cosine |
| Decode attention (1 query vs N cached KVs) | 0.9998 cosine |
| Full pipeline (inv RoPE + o_a + o_b) | 0.996-0.999 cosine |
| All 5 layer types (C128A, C4A, SWA) | ≥0.996 cosine |
| E2E 61-layer model (shared experts) | Healthy logits, consistent tokens |
| Multi-step decode (3 steps) | 0.999+ cosine each step |
What's Next
- Test in vLLM container (build_and_run.sh)
- Handle CSA/HCA sparse attention in the Blackwell path (currently using full attention for all layers)
- Add routed MoE experts (currently shared experts only)
- Performance optimization (vectorized paged KV, Triton kernels)
Architecture
- KV latent: (T, HD=512) shared across 128 Q heads
- KV Cache: fp8_e4m3 paged cache with per-token inverse scale
- Attention: BF16 (NVFP4 too lossy for Q×K^T)
- Prefill: causal SDPA on raw KV
- Decode: read all cached KV → fp8 dequant → SDPA → output