Files
nvfp4-megamoe-kernel/CURRENT_BUG.md

7.7 KiB
Raw Blame History

CURRENT_BUG.md

Status: Container starts, model generates tokens, but output is GARBAGE (empty/NaN)

THE FUNDAMENTAL PROBLEM

Mike was right — we need our own kernels. Not just for the NVFP4 GEMMs, but for the ENTIRE attention pipeline. The current approach of patching individual vLLM functions is a house of cards.

Here's what happened: we spent hours patching vLLM to "work" on Blackwell. We patched:

  1. VLLM_NVFP4_GEMM_BACKEND=cutedsl → invalid, removed env var
  2. KV cache page size assertion → patched kv_cache_utils.py
  3. 91 missing compressor cache layers → patched alignment in 3 cache specs
  4. softmax_scale AttributeError → fixed to self.scale
  5. NaN from missing RoPE on KV → added _apply_rope_kv()
  6. Shape mismatch in apply_gptj_rope → rewrote as inline RoPE

After ALL of that, the container starts and runs. But the model output is GARBAGE — empty strings, NaN logprobs, zero meaningful text. Because the attention pipeline is fundamentally broken:

  • The KV cache is never written to (the C++ kernel is FlashMLA-only)
  • The attention reads from raw projection output, not the cache
  • The compressor/indexer runs but the Blackwell path doesn't integrate with it
  • Everything is deeply coupled — patch one thing, three more break

THE ONLY FIX: Build CuTeDSL kernels for the ENTIRE attention pipeline.

Do NOT try to patch vLLM's FlashMLA code. Do NOT use pure PyTorch as a workaround. Do NOT skip the KV cache write. BUILD THE KERNELS.

Container Crash History (May 19)

Each crash was "fixed" with a patch. Each patch led to the next crash. This is the house of cards:

  1. VLLM_NVFP4_GEMM_BACKEND=cutedsl — invalid choice in envs.py → removed env var
  2. assert max(sm_page_sizes) <= max(all_page_sizes) — KV cache page size mismatch → patched kv_cache_utils.py
  3. Some layers are not correctly initialized — 91 missing compressor cache layers (alignment=576 wrong on Blackwell) → patched SWA, indexer, compressor cache specs
  4. AttributeError: softmax_scale — wrapper uses self.scale not self.softmax_scale → fixed
  5. 200 GiB KV cache for 512 tokens → reduced max_model_len to 256, patched cache specs to remove FlashMLA alignment
  6. NaN output (logprobs) → KV wasn't getting RoPE → added _apply_rope_kv()
  7. Shape mismatch in apply_gptj_rope → rewrote as inline 2D RoPE
  8. Garbage/empty output — the attention pipeline is fundamentally broken

What Actually Works (standalone B200 venv tests)

Every single kernel works when tested individually. The problem is ONLY in the vLLM integration.

Kernel Test File Result
CuTeDSL NVFP4 Linear test_full_layer_b200.py cosine 0.994+
CuTeDSL NVFP4 MoE layertest.py cosine 0.988
FP8 KV quantize/dequant test_kv_cache_b200.py cosine 0.9997
NVFP4 KV quantize/dequant test_kv_cache_b200.py cosine 0.9943
Paged KV cache read/write test_kv_cache_b200.py cosine 1.0
FP8 KV → full attention test_kv_cache_b200.py cosine 0.9997
CSA sparse attention (cr=4) test_sparse_attn_b200.py works, no NaN
HCA sparse attention (cr=128) test_sparse_attn_b200.py works, no NaN
Merged CSA+SWA attention test_sparse_attn_b200.py works, no NaN
Full pipeline (all layer types) test_v4_attention_b200.py cosine 0.981-0.995
NVFP4 Q×K^T GEMM test_nvfp4_attn_gemm_b200.py cosine 0.86 (too lossy)

Key Lessons (READ THESE OR REPEAT THE SAME MISTAKES)

  1. NVFP4 is NOT suitable for attention Q×K^T. The per-element dot products are too sensitive. Cosine 0.86. Keep attention in BF16, use NVFP4 only for weight GEMMs.

  2. DeepSeek-V4 is NOT MLA. It uses CSA (Compressed Sparse Attention) + HCA (Heavily Compressed Attention). vLLM misnames everything "MLA" internally — don't be confused by class names like DeepseekV4MLAAttention.

  3. The fp8_ds_mla format is FlashMLA-specific. 584 bytes per token (448 NoPE FP8 + 128 RoPE FP8 + 8 scale). This is NOT a standard fp8 tensor. You can't just view() it as [slot, 512] uint8.

  4. The SWA cache, indexer cache, and compressor cache all use alignment=576 for FlashMLA. On Blackwell, this must be None (no FlashMLA). There are 4 separate classes that set this, and you must patch ALL of them.

  5. DeepseekV4MultiHeadLatentAttentionWrapper registers ITSELF (not the inner MLA attention) in static_forward_context. The custom op deepseek_v4_attention looks up the wrapper. So attention_impl must be on the WRAPPER, and it must use self.scale (not self.softmax_scale).

  6. The Triton compressor and indexer DO work on Blackwell. They're not the problem. The problem is that the Blackwell attention path doesn't integrate with them.

THE PLAN: Build CuTeDSL Attention Backend

STOP. Do NOT touch the vLLM container. Build and test kernels on the B200 venv first.

Step 1: KV Cache Write Kernel

  • BF16 KV → apply RoPE → fp8 quantize → write to paged cache
  • Test in tests/test_kv_cache_write_b200.py:
    • Write KV for N tokens, read it back, compare against BF16 reference
    • Must handle: slot mapping, block_size, fp8 per-token scale

Step 2: KV Cache Read Kernel

  • Paged cache → fp8 dequant → BF16 KV with RoPE
  • Test: write then read, cosine >= 0.99

Step 3: BF16 Attention Kernel

  • Q (with RoPE) × K^T → causal mask → softmax → attn × V
  • Keep in BF16 (NVFP4 too lossy for attention scores)
  • Handle CSA sparse (gather top-k positions from compressed cache)
  • Handle HCA sparse (gather from 1/128 positions)
  • Handle SWA (sliding window, full causal within window)
  • Test: compare against PyTorch SDPA, cosine >= 0.99

Step 4: Full Pipeline Integration

  • KV cache read → attention → inverse RoPE → o_a BMM → o_b NVFP4 projection
  • Wire CSA/HCA/SWA with sink weight merge
  • Test: compare full pipeline against BF16 reference, cosine >= 0.98
  • Test: run through ALL 61 layers, verify logits are reasonable (std between 0.5 and 50)

Step 5: vLLM Attention Backend

  • Create a proper AttentionBackend subclass (e.g., CuTeDSLBlackwellBackend)
  • Override DeepseekSparseSWABackend on Blackwell
  • Handle metadata, slot mapping, cache format properly
  • ONLY THEN test in the container

Step 6: Test in Container

  • Build container with the new backend
  • Test with real prompts
  • If output is garbage, DO NOT declare success. Fix it.

Architecture: CSA + HCA + mHC (NOT MLA!)

  • CSA (Compress Ratio 4): Compressed Sparse Attention — KV compressed 4x with overlap (coff=2). Indexer finds per-layer top-k.
  • HCA (Compress Ratio 128): Heavily Compressed Attention — KV compressed 128x.
  • mHC: Manifold-Constrained Hyper-Connections — replaces standard residual connections.
  • SWA: Sliding Window Attention — local window (compress_ratio=0, last layer only)
  • KV latent: (T, 512) shared across all 128 heads. After kv_norm, it's a single 512-dim vector per token.
  • Sink weights: Merge sparse + SWA attention outputs. sigmoid(sinks) controls the mix.

B200 Server Details

Server: root@45.76.247.107 (password: <B200_PASSWORD>)
Repo: /root/nvfp4-megamoe-kernel (branch: proper-nvfp4-integration)
Model: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4
Venv: /root/nvfp4-megamoe-kernel/tests/venv/
Test command: PYTHONPATH=/root/nvfp4-megamoe-kernel tests/venv/bin/python tests/<test>.py
Container: docker compose up (takes ~14 min to build, ~10 min to warm up)

PLEASE LISTEN

If you're the next session reading this: BUILD THE KERNELS. Test them standalone. Don't waste time patching vLLM. The container will still be there when you're ready. The kernels won't build themselves.

Mike said it best: "The only way to do this is to do our own kernels."

Just make the fucking kernel.