biondizzle/nvfp4-megamoe-kernel

Fork 0

Files

biondizzle 31b9cfbdbd Update README and CURRENT_BUG: BUILD YOUR OWN KERNELS. Stop patching vLLM.

2026-05-19 15:19:55 +00:00

7.7 KiB

Raw Blame History

CURRENT_BUG.md

Status: Container starts, model generates tokens, but output is GARBAGE (empty/NaN)

THE FUNDAMENTAL PROBLEM

Mike was right — we need our own kernels. Not just for the NVFP4 GEMMs, but for the ENTIRE attention pipeline. The current approach of patching individual vLLM functions is a house of cards.

Here's what happened: we spent hours patching vLLM to "work" on Blackwell. We patched:

VLLM_NVFP4_GEMM_BACKEND=cutedsl → invalid, removed env var
KV cache page size assertion → patched kv_cache_utils.py
91 missing compressor cache layers → patched alignment in 3 cache specs
softmax_scale AttributeError → fixed to self.scale
NaN from missing RoPE on KV → added _apply_rope_kv()
Shape mismatch in apply_gptj_rope → rewrote as inline RoPE

After ALL of that, the container starts and runs. But the model output is GARBAGE — empty strings, NaN logprobs, zero meaningful text. Because the attention pipeline is fundamentally broken:

The KV cache is never written to (the C++ kernel is FlashMLA-only)
The attention reads from raw projection output, not the cache
The compressor/indexer runs but the Blackwell path doesn't integrate with it
Everything is deeply coupled — patch one thing, three more break

THE ONLY FIX: Build CuTeDSL kernels for the ENTIRE attention pipeline.

Do NOT try to patch vLLM's FlashMLA code. Do NOT use pure PyTorch as a workaround. Do NOT skip the KV cache write. BUILD THE KERNELS.

Container Crash History (May 19)

Each crash was "fixed" with a patch. Each patch led to the next crash. This is the house of cards:

VLLM_NVFP4_GEMM_BACKEND=cutedsl — invalid choice in envs.py → removed env var
assert max(sm_page_sizes) <= max(all_page_sizes) — KV cache page size mismatch → patched kv_cache_utils.py
Some layers are not correctly initialized — 91 missing compressor cache layers (alignment=576 wrong on Blackwell) → patched SWA, indexer, compressor cache specs
AttributeError: softmax_scale — wrapper uses self.scale not self.softmax_scale → fixed
200 GiB KV cache for 512 tokens → reduced max_model_len to 256, patched cache specs to remove FlashMLA alignment
NaN output (logprobs) → KV wasn't getting RoPE → added _apply_rope_kv()
Shape mismatch in apply_gptj_rope → rewrote as inline 2D RoPE
Garbage/empty output — the attention pipeline is fundamentally broken

What Actually Works (standalone B200 venv tests)

Every single kernel works when tested individually. The problem is ONLY in the vLLM integration.

Kernel	Test File	Result
CuTeDSL NVFP4 Linear	`test_full_layer_b200.py`	cosine 0.994+ ✅
CuTeDSL NVFP4 MoE	`layertest.py`	cosine 0.988 ✅
FP8 KV quantize/dequant	`test_kv_cache_b200.py`	cosine 0.9997 ✅
NVFP4 KV quantize/dequant	`test_kv_cache_b200.py`	cosine 0.9943 ✅
Paged KV cache read/write	`test_kv_cache_b200.py`	cosine 1.0 ✅
FP8 KV → full attention	`test_kv_cache_b200.py`	cosine 0.9997 ✅
CSA sparse attention (cr=4)	`test_sparse_attn_b200.py`	works, no NaN ✅
HCA sparse attention (cr=128)	`test_sparse_attn_b200.py`	works, no NaN ✅
Merged CSA+SWA attention	`test_sparse_attn_b200.py`	works, no NaN ✅
Full pipeline (all layer types)	`test_v4_attention_b200.py`	cosine 0.981-0.995 ✅
NVFP4 Q×K^T GEMM	`test_nvfp4_attn_gemm_b200.py`	cosine 0.86 ❌ (too lossy)

Key Lessons (READ THESE OR REPEAT THE SAME MISTAKES)

NVFP4 is NOT suitable for attention Q×K^T. The per-element dot products are too sensitive. Cosine 0.86. Keep attention in BF16, use NVFP4 only for weight GEMMs.
DeepSeek-V4 is NOT MLA. It uses CSA (Compressed Sparse Attention) + HCA (Heavily Compressed Attention). vLLM misnames everything "MLA" internally — don't be confused by class names like DeepseekV4MLAAttention.
The fp8_ds_mla format is FlashMLA-specific. 584 bytes per token (448 NoPE FP8 + 128 RoPE FP8 + 8 scale). This is NOT a standard fp8 tensor. You can't just view() it as [slot, 512] uint8.
The SWA cache, indexer cache, and compressor cache all use alignment=576 for FlashMLA. On Blackwell, this must be None (no FlashMLA). There are 4 separate classes that set this, and you must patch ALL of them.
DeepseekV4MultiHeadLatentAttentionWrapper registers ITSELF (not the inner MLA attention) in static_forward_context. The custom op deepseek_v4_attention looks up the wrapper. So attention_impl must be on the WRAPPER, and it must use self.scale (not self.softmax_scale).
The Triton compressor and indexer DO work on Blackwell. They're not the problem. The problem is that the Blackwell attention path doesn't integrate with them.

THE PLAN: Build CuTeDSL Attention Backend

STOP. Do NOT touch the vLLM container. Build and test kernels on the B200 venv first.

Step 1: KV Cache Write Kernel

BF16 KV → apply RoPE → fp8 quantize → write to paged cache
Test in tests/test_kv_cache_write_b200.py:
- Write KV for N tokens, read it back, compare against BF16 reference
- Must handle: slot mapping, block_size, fp8 per-token scale

Step 2: KV Cache Read Kernel

Paged cache → fp8 dequant → BF16 KV with RoPE
Test: write then read, cosine >= 0.99

Step 3: BF16 Attention Kernel

Q (with RoPE) × K^T → causal mask → softmax → attn × V
Keep in BF16 (NVFP4 too lossy for attention scores)
Handle CSA sparse (gather top-k positions from compressed cache)
Handle HCA sparse (gather from 1/128 positions)
Handle SWA (sliding window, full causal within window)
Test: compare against PyTorch SDPA, cosine >= 0.99

Step 4: Full Pipeline Integration

KV cache read → attention → inverse RoPE → o_a BMM → o_b NVFP4 projection
Wire CSA/HCA/SWA with sink weight merge
Test: compare full pipeline against BF16 reference, cosine >= 0.98
Test: run through ALL 61 layers, verify logits are reasonable (std between 0.5 and 50)

Step 5: vLLM Attention Backend

Create a proper AttentionBackend subclass (e.g., CuTeDSLBlackwellBackend)
Override DeepseekSparseSWABackend on Blackwell
Handle metadata, slot mapping, cache format properly
ONLY THEN test in the container

Step 6: Test in Container

Build container with the new backend
Test with real prompts
If output is garbage, DO NOT declare success. Fix it.

Architecture: CSA + HCA + mHC (NOT MLA!)

CSA (Compress Ratio 4): Compressed Sparse Attention — KV compressed 4x with overlap (coff=2). Indexer finds per-layer top-k.
HCA (Compress Ratio 128): Heavily Compressed Attention — KV compressed 128x.
mHC: Manifold-Constrained Hyper-Connections — replaces standard residual connections.
SWA: Sliding Window Attention — local window (compress_ratio=0, last layer only)
KV latent: (T, 512) shared across all 128 heads. After kv_norm, it's a single 512-dim vector per token.
Sink weights: Merge sparse + SWA attention outputs. sigmoid(sinks) controls the mix.

B200 Server Details

Server: root@45.76.247.107 (password: <B200_PASSWORD>)
Repo: /root/nvfp4-megamoe-kernel (branch: proper-nvfp4-integration)
Model: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4
Venv: /root/nvfp4-megamoe-kernel/tests/venv/
Test command: PYTHONPATH=/root/nvfp4-megamoe-kernel tests/venv/bin/python tests/<test>.py
Container: docker compose up (takes ~14 min to build, ~10 min to warm up)

PLEASE LISTEN

If you're the next session reading this: BUILD THE KERNELS. Test them standalone. Don't waste time patching vLLM. The container will still be there when you're ready. The kernels won't build themselves.

Mike said it best: "The only way to do this is to do our own kernels."

Just make the fucking kernel.

7.7 KiB Raw Blame History Unescape Escape