Mike's directive: build the full thing with NVFP4/CuTeDSL.
No more 'optimize later' or 'just make it work' workarounds.
Key updates:
- README: full architecture docs (CSA/HCA/mHC), current status, NVFP4 coverage
- CURRENT_BUG: detailed plan for CuTeDSL NVFP4 attention, KV cache, RoPE
- Both files document: checkpoint key names, compress ratios, config issues
- Removed all 'TODO: optimize later' hedging — we build it right the first time
The SWA KV cache uses fp8_ds_mla packed layout (37376 bytes per slot,
not 512). Our naive FP8 quant + write had a shape mismatch.
Fix: skip the SWA cache write entirely. The compressor (Triton)
handles the compressed cache. For full SDPA attention, we use the
raw kv tensor directly — we don't need the paged cache at all
during prefill.
1. DeepseekV4MLAAttention.__init__ had a hard assertion that the
attention backend MUST be FlashMLA. On Blackwell, FlashMLA doesn't
work but we bypass it via _attention_impl_blackwell(). Added
_is_blackwell flag to skip FlashMLA-specific init (fp8_ds_mla
cache format conversion).
2. Added VLLM_NVFP4_GEMM_BACKEND=cutedsl env var to docker-compose.yml
to force CuTeDSL kernel selection for NVFP4 linear layers.
3. Updated register_cutedsl_kernel.py to also register CuTeDSL in
_NVFP4_BACKEND_TO_KERNEL dict (for the env var override path).
The previous approach called _forward_blackwell() BEFORE the
torch.ops.vllm.deepseek_v4_attention custom op, which broke
torch.compile (dynamo can't trace the Python functions).
Fix: instead of modifying forward(), modify attention_impl() which
runs INSIDE the custom op boundary. Detect SM100+ and dispatch to
_attention_impl_blackwell() which uses:
- fused_qnorm_rope_kv_insert_py() instead of C++ kernel
- full_sdpa_attention() instead of FlashMLA
Removed dead _forward_blackwell method from forward().
Replaces vLLM's broken FlashMLA sparse attention which doesn't work on
SM100 (Blackwell). Uses torch.nn.functional.scaled_dot_product_attention
which works on all GPUs.
Architecture:
- CSA (C128A): Batched sparse gather + SDPA on top-k positions
- HCA (C4A): Same with compressed KV + per-layer indexer
- SWA: Sliding window attention
- Full reference: standard SDPA for testing without compression
Also adds test_csa_attention_b200.py to verify the full attention path.
The CPU dummy weight broke torch.mm(compressor.weight.T) which expects
GPU tensors. Instead, reduce max_model_len to fit KV cache within
available memory (876544 instead of 1048576).
The CuTeDSL kernel never reads layer.weight — it uses the runner's
pre-processed fp4/sf/gs tensors. The dummy BF16 weight exists only for
vLLM model introspection. Moving it to CPU saves massive VRAM:
- q_b_proj alone: 65536*1536*2 = 192 MiB on GPU → ~0 MiB
- All layers combined: ~5-8 GiB saved
This should fix the KV cache OOM (needed 10.28 GiB, had 9.36 GiB).
8 tokens * 7168 hidden * ~40 NVFP4 layers = ~2.3 MiB per layer * 40 = 92 MiB
But the dummy weight param (out_features * in_features * 2 bytes BF16) was
the real killer — each layer allocated a BF16 dummy of its full weight shape.
With 1 token the warmup still gets a valid gs, and empty_cache frees the
sample tensor before KV cache allocation.
The checkpoint's input_scale is a calibration-time value that doesn't
match what quantize_activation_nvfp4 expects at runtime. Using it as
the activation global scale produces garbage output (empty EOS tokens).
The fix: run a warmup forward pass with sample data and compute the
activation global scale from the actual activation distribution, exactly
like our standalone test does (which passes with cosine >= 0.994).
This is the root cause of the vLLM server returning empty content.
Empty output still happening. Documented what's been tried, what works
standalone, what we don't know, and the plan to bypass vLLM's kernel
selection entirely by calling our runners directly.
The file at ffc2264 already had our BF16 wo_a path (_apply_inv_rope_bf16 +
BMM + all-gather) with FP8 fallback. I was replacing it from the wrong
vllm source, losing all prior work. Restored to the known-good version.
Previous version copied the entire file from our local vllm clone which
had imports (breakable_cudagraph) missing from the Docker image's vllm.
Now we start from the Docker image's original file and only patch the
DeepseekV4MultiHeadLatentAttentionWrapper.forward method.
The original attention forward uses fused_inv_rope_fp8_quant +
deepseek_v4_fp8_einsum which requires wo_a to have FP8 weights
and weight_scale_inv. Our checkpoint has wo_a in BF16, so the
original path crashes (produces empty output).
Replace O projection with:
1. _apply_inv_rope_bf16: pure PyTorch inverse RoPE (no FP8)
2. BMM grouped linear for wo_a (BF16)
3. NVFP4 wo_b via CuTeDSL
Also fixes activation global scale bug from previous commit:
- input_global_scale_inv IS the activation gs, don't re-invert
- w13_input_scale_orig (after undoing convert) IS the MoE gs
Test: tests/test_o_projection.py validates inv RoPE roundtrip
and wo_a BMM correctness.
The activation global scale = amax / (6.0 * 448.0). Both the linear
kernel and MoE kernel were taking 1.0 / (value that's already the
correct gs), inverting it and producing garbage quantization.
Linear kernel: input_global_scale_inv IS the gs, so use it directly.
MoE kernel: w13_input_scale_orig (after undoing convert inversion) IS
the gs, so use it directly.