Mike's directive: build the full thing with NVFP4/CuTeDSL.
No more 'optimize later' or 'just make it work' workarounds.
Key updates:
- README: full architecture docs (CSA/HCA/mHC), current status, NVFP4 coverage
- CURRENT_BUG: detailed plan for CuTeDSL NVFP4 attention, KV cache, RoPE
- Both files document: checkpoint key names, compress ratios, config issues
- Removed all 'TODO: optimize later' hedging — we build it right the first time
The SWA KV cache uses fp8_ds_mla packed layout (37376 bytes per slot,
not 512). Our naive FP8 quant + write had a shape mismatch.
Fix: skip the SWA cache write entirely. The compressor (Triton)
handles the compressed cache. For full SDPA attention, we use the
raw kv tensor directly — we don't need the paged cache at all
during prefill.
1. DeepseekV4MLAAttention.__init__ had a hard assertion that the
attention backend MUST be FlashMLA. On Blackwell, FlashMLA doesn't
work but we bypass it via _attention_impl_blackwell(). Added
_is_blackwell flag to skip FlashMLA-specific init (fp8_ds_mla
cache format conversion).
2. Added VLLM_NVFP4_GEMM_BACKEND=cutedsl env var to docker-compose.yml
to force CuTeDSL kernel selection for NVFP4 linear layers.
3. Updated register_cutedsl_kernel.py to also register CuTeDSL in
_NVFP4_BACKEND_TO_KERNEL dict (for the env var override path).
The previous approach called _forward_blackwell() BEFORE the
torch.ops.vllm.deepseek_v4_attention custom op, which broke
torch.compile (dynamo can't trace the Python functions).
Fix: instead of modifying forward(), modify attention_impl() which
runs INSIDE the custom op boundary. Detect SM100+ and dispatch to
_attention_impl_blackwell() which uses:
- fused_qnorm_rope_kv_insert_py() instead of C++ kernel
- full_sdpa_attention() instead of FlashMLA
Removed dead _forward_blackwell method from forward().
Replaces vLLM's broken FlashMLA sparse attention which doesn't work on
SM100 (Blackwell). Uses torch.nn.functional.scaled_dot_product_attention
which works on all GPUs.
Architecture:
- CSA (C128A): Batched sparse gather + SDPA on top-k positions
- HCA (C4A): Same with compressed KV + per-layer indexer
- SWA: Sliding window attention
- Full reference: standard SDPA for testing without compression
Also adds test_csa_attention_b200.py to verify the full attention path.
The CPU dummy weight broke torch.mm(compressor.weight.T) which expects
GPU tensors. Instead, reduce max_model_len to fit KV cache within
available memory (876544 instead of 1048576).
The CuTeDSL kernel never reads layer.weight — it uses the runner's
pre-processed fp4/sf/gs tensors. The dummy BF16 weight exists only for
vLLM model introspection. Moving it to CPU saves massive VRAM:
- q_b_proj alone: 65536*1536*2 = 192 MiB on GPU → ~0 MiB
- All layers combined: ~5-8 GiB saved
This should fix the KV cache OOM (needed 10.28 GiB, had 9.36 GiB).
8 tokens * 7168 hidden * ~40 NVFP4 layers = ~2.3 MiB per layer * 40 = 92 MiB
But the dummy weight param (out_features * in_features * 2 bytes BF16) was
the real killer — each layer allocated a BF16 dummy of its full weight shape.
With 1 token the warmup still gets a valid gs, and empty_cache frees the
sample tensor before KV cache allocation.
The checkpoint's input_scale is a calibration-time value that doesn't
match what quantize_activation_nvfp4 expects at runtime. Using it as
the activation global scale produces garbage output (empty EOS tokens).
The fix: run a warmup forward pass with sample data and compute the
activation global scale from the actual activation distribution, exactly
like our standalone test does (which passes with cosine >= 0.994).
This is the root cause of the vLLM server returning empty content.