Commit Graph

431 Commits

Author SHA1 Message Date
4b85605edf Fix fp8 amax in decode test 2026-05-19 15:28:17 +00:00
4f23055450 Add decode attention pipeline test — reproduces KV cache bug 2026-05-19 15:27:55 +00:00
31b9cfbdbd Update README and CURRENT_BUG: BUILD YOUR OWN KERNELS. Stop patching vLLM. 2026-05-19 15:19:55 +00:00
dca8bfc3a8 Fix _apply_rope_kv: use inline RoPE instead of 3D apply_gptj_rope 2026-05-19 10:36:21 +00:00
8e6721917e Fix syntax in RoPE KV test 2026-05-19 10:31:07 +00:00
cbf440f75a Add RoPE KV test 2026-05-19 10:28:15 +00:00
a5fabbdf66 Apply RoPE to KV in Blackwell attention path - fix NaN output 2026-05-19 10:27:15 +00:00
7e97551fd3 Fix: use self.scale instead of self.softmax_scale in Blackwell attention path 2026-05-19 10:04:46 +00:00
39310c357d Patch compressor cache for Blackwell (no FlashMLA alignment) - fixes 91 missing layers 2026-05-19 09:52:23 +00:00
d9cd8fa165 Add debug patch to print layer name mismatch 2026-05-19 09:45:10 +00:00
9a0b015aac Reduce max_model_len to 256 2026-05-19 09:37:38 +00:00
de1fb839f0 Patch SWA and Indexer cache specs for Blackwell (no FlashMLA alignment) 2026-05-19 09:29:57 +00:00
ea771ff70b Reduce max_model_len to 512 for initial container test 2026-05-19 09:23:10 +00:00
bcfbd1e25b Reduce max_model_len to 32768 (876544 requires 204 GiB KV cache) 2026-05-19 09:13:33 +00:00
e91421f06e Fix KV cache page size patch: separate groups for large SWA pages 2026-05-19 09:05:14 +00:00
dd7f2627e8 Add full model forward test (WIP), sparse attention test passes 2026-05-19 09:04:19 +00:00
9781953509 Add CSA/HCA sparse attention kernel test 2026-05-19 09:02:12 +00:00
d60673864a Fix kv_ref transpose in KV cache test 2026-05-19 08:58:46 +00:00
c1099d76d2 Add KV cache kernel test - fp8 quantize/dequant, paged cache, CSA/HCA compression 2026-05-19 08:57:31 +00:00
c54ddbdae1 Fix NVFP4 attention: slice output to actual N after 128-padding 2026-05-19 08:55:31 +00:00
42285b6c24 Add CuTeDSL NVFP4 attention kernel test - Q×K^T GEMM 2026-05-19 08:54:59 +00:00
9465929e6e Add DeepSeek-V4 CSA/HCA attention pipeline test (not MLA) 2026-05-19 08:51:16 +00:00
fa71fbe909 Patch KV cache utils: handle DeepseekV4 SWA page sizes > MLA page sizes 2026-05-19 08:45:44 +00:00
d08a457829 Fix cos_sin cache shape in NVFP4 attention test 2026-05-19 08:38:55 +00:00
7dd8871e84 Add NVFP4 attention test - quantize Q and K for Q×K^T GEMM 2026-05-19 08:38:25 +00:00
2672e98e4c Remove VLLM_NVFP4_GEMM_BACKEND env var - CuTeDSL auto-selects on Blackwell 2026-05-19 08:35:40 +00:00
914d27fee7 Update README + CURRENT_BUG: full CuTeDSL NVFP4 plan, no more PyTorch fallbacks
Mike's directive: build the full thing with NVFP4/CuTeDSL.
No more 'optimize later' or 'just make it work' workarounds.

Key updates:
- README: full architecture docs (CSA/HCA/mHC), current status, NVFP4 coverage
- CURRENT_BUG: detailed plan for CuTeDSL NVFP4 attention, KV cache, RoPE
- Both files document: checkpoint key names, compress ratios, config issues
- Removed all 'TODO: optimize later' hedging — we build it right the first time
2026-05-19 08:26:16 +00:00
7d5c093c99 Fix KV cache crash: skip SWA cache write on Blackwell
The SWA KV cache uses fp8_ds_mla packed layout (37376 bytes per slot,
not 512). Our naive FP8 quant + write had a shape mismatch.

Fix: skip the SWA cache write entirely. The compressor (Triton)
handles the compressed cache. For full SDPA attention, we use the
raw kv tensor directly — we don't need the paged cache at all
during prefill.
2026-05-19 08:21:57 +00:00
e1a642452a Fix Blackwell: skip FlashMLA assertion + force CuTeDSL kernel
1. DeepseekV4MLAAttention.__init__ had a hard assertion that the
   attention backend MUST be FlashMLA. On Blackwell, FlashMLA doesn't
   work but we bypass it via _attention_impl_blackwell(). Added
   _is_blackwell flag to skip FlashMLA-specific init (fp8_ds_mla
   cache format conversion).

2. Added VLLM_NVFP4_GEMM_BACKEND=cutedsl env var to docker-compose.yml
   to force CuTeDSL kernel selection for NVFP4 linear layers.

3. Updated register_cutedsl_kernel.py to also register CuTeDSL in
   _NVFP4_BACKEND_TO_KERNEL dict (for the env var override path).
2026-05-19 08:19:23 +00:00
2856323360 Fix torch.compile crash: move Blackwell path inside custom op boundary
The previous approach called _forward_blackwell() BEFORE the
torch.ops.vllm.deepseek_v4_attention custom op, which broke
torch.compile (dynamo can't trace the Python functions).

Fix: instead of modifying forward(), modify attention_impl() which
runs INSIDE the custom op boundary. Detect SM100+ and dispatch to
_attention_impl_blackwell() which uses:
- fused_qnorm_rope_kv_insert_py() instead of C++ kernel
- full_sdpa_attention() instead of FlashMLA

Removed dead _forward_blackwell method from forward().
2026-05-19 08:11:58 +00:00
a782ac00ce Integrate CSA/SDPA attention into vLLM for Blackwell
- Add vllm/patches/layers/csa_attention.py: pure PyTorch replacement
  for FlashMLA + fused CUDA kernels that don't work on SM100
- Patch deepseek_v4_attention.py: detect SM100+ and dispatch to
  _forward_blackwell() which uses:
  1. fused_qnorm_rope_kv_insert_py() instead of C++ kernel
  2. full_sdpa_attention() instead of FlashMLA
  3. BF16 inverse RoPE + BMM for wo_a (same as existing BF16 path)
- Add csa_attention.py to Dockerfile

The Blackwell path:
  GEMM projections (CuTeDSL) → RMS norm → q_b → RoPE (PyTorch) →
  SDPA attention → inverse RoPE + wo_a BMM → wo_b → output
2026-05-19 08:04:07 +00:00
81931614e9 Update CURRENT_BUG: CSA kernel works, plan vLLM integration 2026-05-19 08:02:00 +00:00
9d067add90 Fix device reference in full_attention_reference 2026-05-19 08:01:31 +00:00
3e3e998578 Fix attention: manual causal mask for batched single-query 2026-05-19 08:01:08 +00:00
1e675ccc9a Fix causal mask shape for SDPA: (1,1,T,T) broadcast 2026-05-19 08:00:39 +00:00
57615029a4 Fix KV expand for SDPA: (T,HD) → (T*NH, T, HD) 2026-05-19 08:00:08 +00:00
dd3a12bbda Fix full_attention_reference: broadcast KV to all heads+positions 2026-05-19 07:59:28 +00:00
910015c47e Fix kv shape: expand to (T, NH, HD) before reshape 2026-05-19 07:58:42 +00:00
3de75c4e37 Add CSA/HCA attention kernel (PyTorch SDPA, Blackwell-safe)
Replaces vLLM's broken FlashMLA sparse attention which doesn't work on
SM100 (Blackwell). Uses torch.nn.functional.scaled_dot_product_attention
which works on all GPUs.

Architecture:
- CSA (C128A): Batched sparse gather + SDPA on top-k positions
- HCA (C4A): Same with compressed KV + per-layer indexer
- SWA: Sliding window attention
- Full reference: standard SDPA for testing without compression

Also adds test_csa_attention_b200.py to verify the full attention path.
2026-05-19 07:58:10 +00:00
65f48be38c Add attention path test: pinpoint FlashMLA failure 2026-05-19 07:54:01 +00:00
90d1098935 Update CURRENT_BUG: warmup gs is irrelevant, bug is in vLLM pipeline 2026-05-19 07:51:10 +00:00
04ad6409e5 Rewrite test: diagnose whether warmup gs matters at inference time 2026-05-19 07:49:41 +00:00
496848e158 Fix ffn_hc.scale key name 2026-05-19 07:48:09 +00:00
5a4e355d3a Add model forward test: reproduce vLLM empty output outside container 2026-05-19 07:47:48 +00:00
f5ce728ef2 Fix OOM: add --max-model-len=876544 + revert CPU dummy weight
The CPU dummy weight broke torch.mm(compressor.weight.T) which expects
GPU tensors. Instead, reduce max_model_len to fit KV cache within
available memory (876544 instead of 1048576).
2026-05-19 07:35:43 +00:00
79a41d9197 Save ~5-8 GiB GPU VRAM: move dummy weight to CPU
The CuTeDSL kernel never reads layer.weight — it uses the runner's
pre-processed fp4/sf/gs tensors. The dummy BF16 weight exists only for
vLLM model introspection. Moving it to CPU saves massive VRAM:
- q_b_proj alone: 65536*1536*2 = 192 MiB on GPU → ~0 MiB
- All layers combined: ~5-8 GiB saved

This should fix the KV cache OOM (needed 10.28 GiB, had 9.36 GiB).
2026-05-19 07:29:38 +00:00
cebc586014 Fix OOM: use 1-token warmup sample + free immediately
8 tokens * 7168 hidden * ~40 NVFP4 layers = ~2.3 MiB per layer * 40 = 92 MiB
But the dummy weight param (out_features * in_features * 2 bytes BF16) was
the real killer — each layer allocated a BF16 dummy of its full weight shape.
With 1 token the warmup still gets a valid gs, and empty_cache frees the
sample tensor before KV cache allocation.
2026-05-19 07:28:57 +00:00
5122cadc94 Update CURRENT_BUG.md: root cause found + fix committed 2026-05-19 07:21:30 +00:00
6e6f95dfa8 FIX: Use warmup-based activation global scale in CuTeDSL linear kernel
The checkpoint's input_scale is a calibration-time value that doesn't
match what quantize_activation_nvfp4 expects at runtime. Using it as
the activation global scale produces garbage output (empty EOS tokens).

The fix: run a warmup forward pass with sample data and compute the
activation global scale from the actual activation distribution, exactly
like our standalone test does (which passes with cosine >= 0.994).

This is the root cause of the vLLM server returning empty content.
2026-05-19 07:21:07 +00:00
0a7769972f Fix garbled shared_expert_pipeline.py: imports/class were merged 2026-05-19 07:18:10 +00:00