7.7 KiB
CURRENT_BUG.md
Status: Container starts, model generates tokens, but output is GARBAGE (empty/NaN)
THE FUNDAMENTAL PROBLEM
Mike was right — we need our own kernels. Not just for the NVFP4 GEMMs, but for the ENTIRE attention pipeline. The current approach of patching individual vLLM functions is a house of cards.
Here's what happened: we spent hours patching vLLM to "work" on Blackwell. We patched:
VLLM_NVFP4_GEMM_BACKEND=cutedsl→ invalid, removed env var- KV cache page size assertion → patched
kv_cache_utils.py - 91 missing compressor cache layers → patched alignment in 3 cache specs
softmax_scaleAttributeError → fixed toself.scale- NaN from missing RoPE on KV → added
_apply_rope_kv() - Shape mismatch in
apply_gptj_rope→ rewrote as inline RoPE
After ALL of that, the container starts and runs. But the model output is GARBAGE — empty strings, NaN logprobs, zero meaningful text. Because the attention pipeline is fundamentally broken:
- The KV cache is never written to (the C++ kernel is FlashMLA-only)
- The attention reads from raw projection output, not the cache
- The compressor/indexer runs but the Blackwell path doesn't integrate with it
- Everything is deeply coupled — patch one thing, three more break
THE ONLY FIX: Build CuTeDSL kernels for the ENTIRE attention pipeline.
Do NOT try to patch vLLM's FlashMLA code. Do NOT use pure PyTorch as a workaround. Do NOT skip the KV cache write. BUILD THE KERNELS.
Container Crash History (May 19)
Each crash was "fixed" with a patch. Each patch led to the next crash. This is the house of cards:
VLLM_NVFP4_GEMM_BACKEND=cutedsl— invalid choice inenvs.py→ removed env varassert max(sm_page_sizes) <= max(all_page_sizes)— KV cache page size mismatch → patchedkv_cache_utils.pySome layers are not correctly initialized— 91 missing compressor cache layers (alignment=576 wrong on Blackwell) → patched SWA, indexer, compressor cache specsAttributeError: softmax_scale— wrapper usesself.scalenotself.softmax_scale→ fixed- 200 GiB KV cache for 512 tokens → reduced max_model_len to 256, patched cache specs to remove FlashMLA alignment
- NaN output (logprobs) → KV wasn't getting RoPE → added
_apply_rope_kv() - Shape mismatch in
apply_gptj_rope→ rewrote as inline 2D RoPE - Garbage/empty output — the attention pipeline is fundamentally broken
What Actually Works (standalone B200 venv tests)
Every single kernel works when tested individually. The problem is ONLY in the vLLM integration.
| Kernel | Test File | Result |
|---|---|---|
| CuTeDSL NVFP4 Linear | test_full_layer_b200.py |
cosine 0.994+ ✅ |
| CuTeDSL NVFP4 MoE | layertest.py |
cosine 0.988 ✅ |
| FP8 KV quantize/dequant | test_kv_cache_b200.py |
cosine 0.9997 ✅ |
| NVFP4 KV quantize/dequant | test_kv_cache_b200.py |
cosine 0.9943 ✅ |
| Paged KV cache read/write | test_kv_cache_b200.py |
cosine 1.0 ✅ |
| FP8 KV → full attention | test_kv_cache_b200.py |
cosine 0.9997 ✅ |
| CSA sparse attention (cr=4) | test_sparse_attn_b200.py |
works, no NaN ✅ |
| HCA sparse attention (cr=128) | test_sparse_attn_b200.py |
works, no NaN ✅ |
| Merged CSA+SWA attention | test_sparse_attn_b200.py |
works, no NaN ✅ |
| Full pipeline (all layer types) | test_v4_attention_b200.py |
cosine 0.981-0.995 ✅ |
| NVFP4 Q×K^T GEMM | test_nvfp4_attn_gemm_b200.py |
cosine 0.86 ❌ (too lossy) |
Key Lessons (READ THESE OR REPEAT THE SAME MISTAKES)
-
NVFP4 is NOT suitable for attention Q×K^T. The per-element dot products are too sensitive. Cosine 0.86. Keep attention in BF16, use NVFP4 only for weight GEMMs.
-
DeepSeek-V4 is NOT MLA. It uses CSA (Compressed Sparse Attention) + HCA (Heavily Compressed Attention). vLLM misnames everything "MLA" internally — don't be confused by class names like
DeepseekV4MLAAttention. -
The fp8_ds_mla format is FlashMLA-specific. 584 bytes per token (448 NoPE FP8 + 128 RoPE FP8 + 8 scale). This is NOT a standard fp8 tensor. You can't just
view()it as[slot, 512]uint8. -
The SWA cache, indexer cache, and compressor cache all use
alignment=576for FlashMLA. On Blackwell, this must beNone(no FlashMLA). There are 4 separate classes that set this, and you must patch ALL of them. -
DeepseekV4MultiHeadLatentAttentionWrapperregisters ITSELF (not the inner MLA attention) instatic_forward_context. The custom opdeepseek_v4_attentionlooks up the wrapper. Soattention_implmust be on the WRAPPER, and it must useself.scale(notself.softmax_scale). -
The Triton compressor and indexer DO work on Blackwell. They're not the problem. The problem is that the Blackwell attention path doesn't integrate with them.
THE PLAN: Build CuTeDSL Attention Backend
STOP. Do NOT touch the vLLM container. Build and test kernels on the B200 venv first.
Step 1: KV Cache Write Kernel
- BF16 KV → apply RoPE → fp8 quantize → write to paged cache
- Test in
tests/test_kv_cache_write_b200.py:- Write KV for N tokens, read it back, compare against BF16 reference
- Must handle: slot mapping, block_size, fp8 per-token scale
Step 2: KV Cache Read Kernel
- Paged cache → fp8 dequant → BF16 KV with RoPE
- Test: write then read, cosine >= 0.99
Step 3: BF16 Attention Kernel
- Q (with RoPE) × K^T → causal mask → softmax → attn × V
- Keep in BF16 (NVFP4 too lossy for attention scores)
- Handle CSA sparse (gather top-k positions from compressed cache)
- Handle HCA sparse (gather from 1/128 positions)
- Handle SWA (sliding window, full causal within window)
- Test: compare against PyTorch SDPA, cosine >= 0.99
Step 4: Full Pipeline Integration
- KV cache read → attention → inverse RoPE → o_a BMM → o_b NVFP4 projection
- Wire CSA/HCA/SWA with sink weight merge
- Test: compare full pipeline against BF16 reference, cosine >= 0.98
- Test: run through ALL 61 layers, verify logits are reasonable (std between 0.5 and 50)
Step 5: vLLM Attention Backend
- Create a proper
AttentionBackendsubclass (e.g.,CuTeDSLBlackwellBackend) - Override
DeepseekSparseSWABackendon Blackwell - Handle metadata, slot mapping, cache format properly
- ONLY THEN test in the container
Step 6: Test in Container
- Build container with the new backend
- Test with real prompts
- If output is garbage, DO NOT declare success. Fix it.
Architecture: CSA + HCA + mHC (NOT MLA!)
- CSA (Compress Ratio 4): Compressed Sparse Attention — KV compressed 4x with overlap (coff=2). Indexer finds per-layer top-k.
- HCA (Compress Ratio 128): Heavily Compressed Attention — KV compressed 128x.
- mHC: Manifold-Constrained Hyper-Connections — replaces standard residual connections.
- SWA: Sliding Window Attention — local window (compress_ratio=0, last layer only)
- KV latent: (T, 512) shared across all 128 heads. After kv_norm, it's a single 512-dim vector per token.
- Sink weights: Merge sparse + SWA attention outputs.
sigmoid(sinks)controls the mix.
B200 Server Details
Server: root@45.76.247.107 (password: <B200_PASSWORD>)
Repo: /root/nvfp4-megamoe-kernel (branch: proper-nvfp4-integration)
Model: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4
Venv: /root/nvfp4-megamoe-kernel/tests/venv/
Test command: PYTHONPATH=/root/nvfp4-megamoe-kernel tests/venv/bin/python tests/<test>.py
Container: docker compose up (takes ~14 min to build, ~10 min to warm up)
PLEASE LISTEN
If you're the next session reading this: BUILD THE KERNELS. Test them standalone. Don't waste time patching vLLM. The container will still be there when you're ready. The kernels won't build themselves.
Mike said it best: "The only way to do this is to do our own kernels."
Just make the fucking kernel.