Mike's directive: build the full thing with NVFP4/CuTeDSL.
No more 'optimize later' or 'just make it work' workarounds.
Key updates:
- README: full architecture docs (CSA/HCA/mHC), current status, NVFP4 coverage
- CURRENT_BUG: detailed plan for CuTeDSL NVFP4 attention, KV cache, RoPE
- Both files document: checkpoint key names, compress ratios, config issues
- Removed all 'TODO: optimize later' hedging — we build it right the first time
The SWA KV cache uses fp8_ds_mla packed layout (37376 bytes per slot,
not 512). Our naive FP8 quant + write had a shape mismatch.
Fix: skip the SWA cache write entirely. The compressor (Triton)
handles the compressed cache. For full SDPA attention, we use the
raw kv tensor directly — we don't need the paged cache at all
during prefill.
1. DeepseekV4MLAAttention.__init__ had a hard assertion that the
attention backend MUST be FlashMLA. On Blackwell, FlashMLA doesn't
work but we bypass it via _attention_impl_blackwell(). Added
_is_blackwell flag to skip FlashMLA-specific init (fp8_ds_mla
cache format conversion).
2. Added VLLM_NVFP4_GEMM_BACKEND=cutedsl env var to docker-compose.yml
to force CuTeDSL kernel selection for NVFP4 linear layers.
3. Updated register_cutedsl_kernel.py to also register CuTeDSL in
_NVFP4_BACKEND_TO_KERNEL dict (for the env var override path).
The previous approach called _forward_blackwell() BEFORE the
torch.ops.vllm.deepseek_v4_attention custom op, which broke
torch.compile (dynamo can't trace the Python functions).
Fix: instead of modifying forward(), modify attention_impl() which
runs INSIDE the custom op boundary. Detect SM100+ and dispatch to
_attention_impl_blackwell() which uses:
- fused_qnorm_rope_kv_insert_py() instead of C++ kernel
- full_sdpa_attention() instead of FlashMLA
Removed dead _forward_blackwell method from forward().
Replaces vLLM's broken FlashMLA sparse attention which doesn't work on
SM100 (Blackwell). Uses torch.nn.functional.scaled_dot_product_attention
which works on all GPUs.
Architecture:
- CSA (C128A): Batched sparse gather + SDPA on top-k positions
- HCA (C4A): Same with compressed KV + per-layer indexer
- SWA: Sliding window attention
- Full reference: standard SDPA for testing without compression
Also adds test_csa_attention_b200.py to verify the full attention path.
The CPU dummy weight broke torch.mm(compressor.weight.T) which expects
GPU tensors. Instead, reduce max_model_len to fit KV cache within
available memory (876544 instead of 1048576).
The CuTeDSL kernel never reads layer.weight — it uses the runner's
pre-processed fp4/sf/gs tensors. The dummy BF16 weight exists only for
vLLM model introspection. Moving it to CPU saves massive VRAM:
- q_b_proj alone: 65536*1536*2 = 192 MiB on GPU → ~0 MiB
- All layers combined: ~5-8 GiB saved
This should fix the KV cache OOM (needed 10.28 GiB, had 9.36 GiB).
8 tokens * 7168 hidden * ~40 NVFP4 layers = ~2.3 MiB per layer * 40 = 92 MiB
But the dummy weight param (out_features * in_features * 2 bytes BF16) was
the real killer — each layer allocated a BF16 dummy of its full weight shape.
With 1 token the warmup still gets a valid gs, and empty_cache frees the
sample tensor before KV cache allocation.
The checkpoint's input_scale is a calibration-time value that doesn't
match what quantize_activation_nvfp4 expects at runtime. Using it as
the activation global scale produces garbage output (empty EOS tokens).
The fix: run a warmup forward pass with sample data and compute the
activation global scale from the actual activation distribution, exactly
like our standalone test does (which passes with cosine >= 0.994).
This is the root cause of the vLLM server returning empty content.
Empty output still happening. Documented what's been tried, what works
standalone, what we don't know, and the plan to bypass vLLM's kernel
selection entirely by calling our runners directly.
The file at ffc2264 already had our BF16 wo_a path (_apply_inv_rope_bf16 +
BMM + all-gather) with FP8 fallback. I was replacing it from the wrong
vllm source, losing all prior work. Restored to the known-good version.
Previous version copied the entire file from our local vllm clone which
had imports (breakable_cudagraph) missing from the Docker image's vllm.
Now we start from the Docker image's original file and only patch the
DeepseekV4MultiHeadLatentAttentionWrapper.forward method.
The original attention forward uses fused_inv_rope_fp8_quant +
deepseek_v4_fp8_einsum which requires wo_a to have FP8 weights
and weight_scale_inv. Our checkpoint has wo_a in BF16, so the
original path crashes (produces empty output).
Replace O projection with:
1. _apply_inv_rope_bf16: pure PyTorch inverse RoPE (no FP8)
2. BMM grouped linear for wo_a (BF16)
3. NVFP4 wo_b via CuTeDSL
Also fixes activation global scale bug from previous commit:
- input_global_scale_inv IS the activation gs, don't re-invert
- w13_input_scale_orig (after undoing convert) IS the MoE gs
Test: tests/test_o_projection.py validates inv RoPE roundtrip
and wo_a BMM correctness.
The activation global scale = amax / (6.0 * 448.0). Both the linear
kernel and MoE kernel were taking 1.0 / (value that's already the
correct gs), inverting it and producing garbage quantization.
Linear kernel: input_global_scale_inv IS the gs, so use it directly.
MoE kernel: w13_input_scale_orig (after undoing convert inversion) IS
the gs, so use it directly.
The nightly vLLM image puts ALL MHC code in layers/mhc.py (not kernels/mhc/).
It imports tilelang at top level and JIT-compiles kernels.
Replace the entire file with pure PyTorch implementations using
direct_register_custom_op for mhc_pre, mhc_post, mhc_fused_post_pre,
and hc_head_fused_kernel. No tilelang dependency at all.
Also removes the separate mhc_torch_ops.py and kernels/mhc/ patches
which don't apply to the nightly image layout.
The layers/mhc.py was trying to import kernels.mhc.torch which
failed because our __init__.py was breaking the package. Instead,
just import our mhc_torch_ops which has everything we need.
Also fix __init__.py to explicitly import mhc_pre_torch and
mhc_post_torch from .torch instead of using import *.
The original layers/mhc.py forward_cuda calls
torch.ops.vllm.mhc_pre_tilelang which triggers TileLang JIT.
Replace with our torch implementations in forward_cuda.
This is what the CustomOp dispatch routes through.
Previous approach used @CustomOp.register which doesn't create
torch.ops.vllm.mhc_pre. The model code calls torch.ops.vllm.mhc_pre()
directly, which requires direct_register_custom_op.
Use direct_register_custom_op to register mhc_pre, mhc_post,
mhc_fused_post_pre, and hc_head_fused_kernel as PyTorch custom ops
with torch (eager) implementations.
Patch kernels/mhc/__init__.py to import from both .torch (original)
and .mhc_torch_ops (our replacements), skipping tilelang import.
TileLang kernels (mhc_pre_big_fuse_tilelang, mhc_fused_tilelang) don't
work correctly on Blackwell SM100 and cause empty model output.
Replace with pure PyTorch implementations:
- mhc_pre_torch: Sinkhorn-normalized HC residual mixing
- mhc_post_torch: HC post block (einsum residual + post layer mix)
- mhc_fused_post_pre_torch: Fused post+pre (composition of above)
- hc_head_fused_torch: RMS norm + linear + sigmoid + weighted sum
Patch both layers/mhc.py (CustomOp dispatch) and kernels/mhc/__init__.py
(no tilelang import). Also remove tilelang from pyproject.toml deps.
The framework's deep_gemm_warmup calls get_fused_moe_quant_config
which accesses w13_input_scale etc. Setting them to None caused
TypeError: float / NoneType. Keep scales (small tensors) and only
free the large weight tensors.
K comes from hidden_states.size(-1) which is the full BF16 dimension
(7168), not the packed weight dimension. K*2=14336 is wrong.
The MoE output is always hidden_dim (7168).