CURRENT_BUG.md — DeepSeek-V4 Blackwell NVFP4

Status: KV CACHE PIPELINE VERIFIED ✅

Root cause identified: vLLM's _attention_impl_blackwell never writes KV to the paged cache, so decode produces garbage because it can't access prior tokens' KV.
Solution built and tested: cutedsl/blackwell_attention.py + vllm/patches/layers/csa_attention.py — KV cache write/read pipeline using fp8 quantization.

Test	Result
KV cache roundtrip (fp8 quant → dequant)	0.999+ cosine
Decode attention (1 query vs N cached KVs)	0.9998 cosine
Full pipeline (inv RoPE + o_a + o_b)	0.996-0.999 cosine
All 5 layer types (C128A, C4A, SWA)	≥0.996 cosine
E2E 61-layer model (shared experts)	Healthy logits, consistent tokens
Multi-step decode (3 steps)	0.999+ cosine each step

Test in vLLM container (build_and_run.sh)
Handle CSA/HCA sparse attention in the Blackwell path (currently using full attention for all layers)
Add routed MoE experts (currently shared experts only)
Performance optimization (vectorized paged KV, Triton kernels)