Commit Graph

719 Commits

Author SHA1 Message Date
ea771ff70b Reduce max_model_len to 512 for initial container test 2026-05-19 09:23:10 +00:00
bcfbd1e25b Reduce max_model_len to 32768 (876544 requires 204 GiB KV cache) 2026-05-19 09:13:33 +00:00
e91421f06e Fix KV cache page size patch: separate groups for large SWA pages 2026-05-19 09:05:14 +00:00
dd7f2627e8 Add full model forward test (WIP), sparse attention test passes 2026-05-19 09:04:19 +00:00
9781953509 Add CSA/HCA sparse attention kernel test 2026-05-19 09:02:12 +00:00
d60673864a Fix kv_ref transpose in KV cache test 2026-05-19 08:58:46 +00:00
c1099d76d2 Add KV cache kernel test - fp8 quantize/dequant, paged cache, CSA/HCA compression 2026-05-19 08:57:31 +00:00
c54ddbdae1 Fix NVFP4 attention: slice output to actual N after 128-padding 2026-05-19 08:55:31 +00:00
42285b6c24 Add CuTeDSL NVFP4 attention kernel test - Q×K^T GEMM 2026-05-19 08:54:59 +00:00
9465929e6e Add DeepSeek-V4 CSA/HCA attention pipeline test (not MLA) 2026-05-19 08:51:16 +00:00
fa71fbe909 Patch KV cache utils: handle DeepseekV4 SWA page sizes > MLA page sizes 2026-05-19 08:45:44 +00:00
d08a457829 Fix cos_sin cache shape in NVFP4 attention test 2026-05-19 08:38:55 +00:00
7dd8871e84 Add NVFP4 attention test - quantize Q and K for Q×K^T GEMM 2026-05-19 08:38:25 +00:00
2672e98e4c Remove VLLM_NVFP4_GEMM_BACKEND env var - CuTeDSL auto-selects on Blackwell 2026-05-19 08:35:40 +00:00
914d27fee7 Update README + CURRENT_BUG: full CuTeDSL NVFP4 plan, no more PyTorch fallbacks
Mike's directive: build the full thing with NVFP4/CuTeDSL.
No more 'optimize later' or 'just make it work' workarounds.

Key updates:
- README: full architecture docs (CSA/HCA/mHC), current status, NVFP4 coverage
- CURRENT_BUG: detailed plan for CuTeDSL NVFP4 attention, KV cache, RoPE
- Both files document: checkpoint key names, compress ratios, config issues
- Removed all 'TODO: optimize later' hedging — we build it right the first time
2026-05-19 08:26:16 +00:00
7d5c093c99 Fix KV cache crash: skip SWA cache write on Blackwell
The SWA KV cache uses fp8_ds_mla packed layout (37376 bytes per slot,
not 512). Our naive FP8 quant + write had a shape mismatch.

Fix: skip the SWA cache write entirely. The compressor (Triton)
handles the compressed cache. For full SDPA attention, we use the
raw kv tensor directly — we don't need the paged cache at all
during prefill.
2026-05-19 08:21:57 +00:00
e1a642452a Fix Blackwell: skip FlashMLA assertion + force CuTeDSL kernel
1. DeepseekV4MLAAttention.__init__ had a hard assertion that the
   attention backend MUST be FlashMLA. On Blackwell, FlashMLA doesn't
   work but we bypass it via _attention_impl_blackwell(). Added
   _is_blackwell flag to skip FlashMLA-specific init (fp8_ds_mla
   cache format conversion).

2. Added VLLM_NVFP4_GEMM_BACKEND=cutedsl env var to docker-compose.yml
   to force CuTeDSL kernel selection for NVFP4 linear layers.

3. Updated register_cutedsl_kernel.py to also register CuTeDSL in
   _NVFP4_BACKEND_TO_KERNEL dict (for the env var override path).
2026-05-19 08:19:23 +00:00
2856323360 Fix torch.compile crash: move Blackwell path inside custom op boundary
The previous approach called _forward_blackwell() BEFORE the
torch.ops.vllm.deepseek_v4_attention custom op, which broke
torch.compile (dynamo can't trace the Python functions).

Fix: instead of modifying forward(), modify attention_impl() which
runs INSIDE the custom op boundary. Detect SM100+ and dispatch to
_attention_impl_blackwell() which uses:
- fused_qnorm_rope_kv_insert_py() instead of C++ kernel
- full_sdpa_attention() instead of FlashMLA

Removed dead _forward_blackwell method from forward().
2026-05-19 08:11:58 +00:00
a782ac00ce Integrate CSA/SDPA attention into vLLM for Blackwell
- Add vllm/patches/layers/csa_attention.py: pure PyTorch replacement
  for FlashMLA + fused CUDA kernels that don't work on SM100
- Patch deepseek_v4_attention.py: detect SM100+ and dispatch to
  _forward_blackwell() which uses:
  1. fused_qnorm_rope_kv_insert_py() instead of C++ kernel
  2. full_sdpa_attention() instead of FlashMLA
  3. BF16 inverse RoPE + BMM for wo_a (same as existing BF16 path)
- Add csa_attention.py to Dockerfile

The Blackwell path:
  GEMM projections (CuTeDSL) → RMS norm → q_b → RoPE (PyTorch) →
  SDPA attention → inverse RoPE + wo_a BMM → wo_b → output
2026-05-19 08:04:07 +00:00
81931614e9 Update CURRENT_BUG: CSA kernel works, plan vLLM integration 2026-05-19 08:02:00 +00:00
9d067add90 Fix device reference in full_attention_reference 2026-05-19 08:01:31 +00:00
3e3e998578 Fix attention: manual causal mask for batched single-query 2026-05-19 08:01:08 +00:00
1e675ccc9a Fix causal mask shape for SDPA: (1,1,T,T) broadcast 2026-05-19 08:00:39 +00:00
57615029a4 Fix KV expand for SDPA: (T,HD) → (T*NH, T, HD) 2026-05-19 08:00:08 +00:00
dd3a12bbda Fix full_attention_reference: broadcast KV to all heads+positions 2026-05-19 07:59:28 +00:00
910015c47e Fix kv shape: expand to (T, NH, HD) before reshape 2026-05-19 07:58:42 +00:00
3de75c4e37 Add CSA/HCA attention kernel (PyTorch SDPA, Blackwell-safe)
Replaces vLLM's broken FlashMLA sparse attention which doesn't work on
SM100 (Blackwell). Uses torch.nn.functional.scaled_dot_product_attention
which works on all GPUs.

Architecture:
- CSA (C128A): Batched sparse gather + SDPA on top-k positions
- HCA (C4A): Same with compressed KV + per-layer indexer
- SWA: Sliding window attention
- Full reference: standard SDPA for testing without compression

Also adds test_csa_attention_b200.py to verify the full attention path.
2026-05-19 07:58:10 +00:00
65f48be38c Add attention path test: pinpoint FlashMLA failure 2026-05-19 07:54:01 +00:00
90d1098935 Update CURRENT_BUG: warmup gs is irrelevant, bug is in vLLM pipeline 2026-05-19 07:51:10 +00:00
04ad6409e5 Rewrite test: diagnose whether warmup gs matters at inference time 2026-05-19 07:49:41 +00:00
496848e158 Fix ffn_hc.scale key name 2026-05-19 07:48:09 +00:00
5a4e355d3a Add model forward test: reproduce vLLM empty output outside container 2026-05-19 07:47:48 +00:00
f5ce728ef2 Fix OOM: add --max-model-len=876544 + revert CPU dummy weight
The CPU dummy weight broke torch.mm(compressor.weight.T) which expects
GPU tensors. Instead, reduce max_model_len to fit KV cache within
available memory (876544 instead of 1048576).
2026-05-19 07:35:43 +00:00
79a41d9197 Save ~5-8 GiB GPU VRAM: move dummy weight to CPU
The CuTeDSL kernel never reads layer.weight — it uses the runner's
pre-processed fp4/sf/gs tensors. The dummy BF16 weight exists only for
vLLM model introspection. Moving it to CPU saves massive VRAM:
- q_b_proj alone: 65536*1536*2 = 192 MiB on GPU → ~0 MiB
- All layers combined: ~5-8 GiB saved

This should fix the KV cache OOM (needed 10.28 GiB, had 9.36 GiB).
2026-05-19 07:29:38 +00:00
cebc586014 Fix OOM: use 1-token warmup sample + free immediately
8 tokens * 7168 hidden * ~40 NVFP4 layers = ~2.3 MiB per layer * 40 = 92 MiB
But the dummy weight param (out_features * in_features * 2 bytes BF16) was
the real killer — each layer allocated a BF16 dummy of its full weight shape.
With 1 token the warmup still gets a valid gs, and empty_cache frees the
sample tensor before KV cache allocation.
2026-05-19 07:28:57 +00:00
5122cadc94 Update CURRENT_BUG.md: root cause found + fix committed 2026-05-19 07:21:30 +00:00
6e6f95dfa8 FIX: Use warmup-based activation global scale in CuTeDSL linear kernel
The checkpoint's input_scale is a calibration-time value that doesn't
match what quantize_activation_nvfp4 expects at runtime. Using it as
the activation global scale produces garbage output (empty EOS tokens).

The fix: run a warmup forward pass with sample data and compute the
activation global scale from the actual activation distribution, exactly
like our standalone test does (which passes with cosine >= 0.994).

This is the root cause of the vLLM server returning empty content.
2026-05-19 07:21:07 +00:00
0a7769972f Fix garbled shared_expert_pipeline.py: imports/class were merged 2026-05-19 07:18:10 +00:00
87453a53b0 Fix checkpoint keys: attn_hc.*, compressor.*, q_a_proj/q_b_proj/kv_proj 2026-05-19 07:17:37 +00:00
f97762cc9f Fix full layer test: use correct checkpoint key names
Checkpoint uses q_a_proj/q_b_proj/kv_proj/q_a_norm — NOT the vLLM
fused names (fused_wqa_wkv, wq_b, q_norm).
2026-05-19 07:16:33 +00:00
cc48a5715e Add full layer 0 B200 test: CuTeDSL vs BF16 reference
Tests each attention/FFN projection individually against BF16 dequantized
reference, then runs full layer forward. Identifies exactly where garbage
enters the pipeline.

Key finding: checkpoint uses different names than vLLM:
- q_a_proj, q_b_proj, kv_proj (not fused_wqa_wkv)
- q_a_norm (not q_norm)
- compressor.* (C4A layers only)
- sinks (attn_sink)
2026-05-19 07:14:58 +00:00
dbaa3d6fe6 Update CURRENT_BUG.md and README with current state
Empty output still happening. Documented what's been tried, what works
standalone, what we don't know, and the plan to bypass vLLM's kernel
selection entirely by calling our runners directly.
2026-05-19 07:05:45 +00:00
62abf41b03 Revert deepseek_v4_attention.py to ffc2264 — don't nuke existing patches
The file at ffc2264 already had our BF16 wo_a path (_apply_inv_rope_bf16 +
BMM + all-gather) with FP8 fallback. I was replacing it from the wrong
vllm source, losing all prior work. Restored to the known-good version.
2026-05-19 06:52:40 +00:00
4c2effa2be Fix attention patch: source from v0.21.0 stable, not local clone
The local vllm clone has different imports (breakable_cudagraph) that
don't exist in the Docker image. Now sourced from v0.21.0 tag.
2026-05-19 06:44:59 +00:00
284b6a5d57 Fix attention patch: use original vllm imports, only patch forward method
Previous version copied the entire file from our local vllm clone which
had imports (breakable_cudagraph) missing from the Docker image's vllm.
Now we start from the Docker image's original file and only patch the
DeepseekV4MultiHeadLatentAttentionWrapper.forward method.
2026-05-19 06:40:58 +00:00
199efe0871 Fix dims: o_groups=16, o_lora_rank=1024 from config 2026-05-19 06:37:25 +00:00
b4fee70151 Fix device mismatch in test 2026-05-19 06:36:22 +00:00
6b4b9774d1 Add B200 test: prove O-projection root cause + validate fix 2026-05-19 06:32:54 +00:00
77baca668e Patch attention forward: BF16 inv RoPE + BMM wo_a + NVFP4 wo_b
The original attention forward uses fused_inv_rope_fp8_quant +
deepseek_v4_fp8_einsum which requires wo_a to have FP8 weights
and weight_scale_inv. Our checkpoint has wo_a in BF16, so the
original path crashes (produces empty output).

Replace O projection with:
1. _apply_inv_rope_bf16: pure PyTorch inverse RoPE (no FP8)
2. BMM grouped linear for wo_a (BF16)
3. NVFP4 wo_b via CuTeDSL

Also fixes activation global scale bug from previous commit:
- input_global_scale_inv IS the activation gs, don't re-invert
- w13_input_scale_orig (after undoing convert) IS the MoE gs

Test: tests/test_o_projection.py validates inv RoPE roundtrip
and wo_a BMM correctness.
2026-05-19 06:30:18 +00:00
ffc2264c41 Fix activation global scale: don't double-invert input_global_scale_inv
The activation global scale = amax / (6.0 * 448.0). Both the linear
kernel and MoE kernel were taking 1.0 / (value that's already the
correct gs), inverting it and producing garbage quantization.

Linear kernel: input_global_scale_inv IS the gs, so use it directly.
MoE kernel: w13_input_scale_orig (after undoing convert inversion) IS
the gs, so use it directly.
2026-05-19 06:03:08 +00:00