835e1a0590
Fix f-string syntax
2026-05-19 17:26:40 +00:00
9c30168202
Add test for exact vLLM codepaths (fused_qnorm, kv_write, decode)
2026-05-19 17:26:10 +00:00
8f80991fdf
CRITICAL FIX: Properly dequantize fp8 KV in decode using per-token inv_scale
2026-05-19 17:08:58 +00:00
d67d8613af
FIX: Use vLLM's decode_swa_indices for correct paged KV cache access during decode
2026-05-19 16:55:44 +00:00
3b204c4772
Fix UnboundLocalError: move num_decode_tokens before debug print
2026-05-19 16:43:28 +00:00
30890b621d
CRITICAL FIX: Skip compressor fused attention kernel on Blackwell — it bypasses our attention path
2026-05-19 16:35:07 +00:00
b8e2cf61ad
Add debug logging to Blackwell attention path
2026-05-19 16:31:55 +00:00
d7f686bcfc
Fix wrapper attribute access: kv_cache, attn_sink, max_model_len via mla_attn
2026-05-19 16:19:28 +00:00
114da83090
Add CSA/HCA decode + prefill attention to Blackwell path
2026-05-19 16:06:24 +00:00
2cc1910c45
Fix N for C128A (need 128 tokens)
2026-05-19 16:04:53 +00:00
cea453cbab
Fix compressor key name
2026-05-19 16:04:38 +00:00
04f2b2d8d4
Add CSA sparse attention test (compressed KV gather + SWA merge)
2026-05-19 16:04:19 +00:00
4c6464e7e0
Update CURRENT_BUG: KV cache pipeline verified, all tests passing
2026-05-19 16:01:10 +00:00
be8566a443
Add decode vs prefill consistency test
2026-05-19 16:00:33 +00:00
2ddd3d0702
Test with all 61 layers (shared experts only)
2026-05-19 15:55:41 +00:00
842e6e1381
Fix view→reshape for non-contiguous tensor
2026-05-19 15:54:40 +00:00
f0f8d8211b
Add e2e decode test (3 layers: C128A, C4A, SWA)
2026-05-19 15:53:29 +00:00
255913fba4
Vectorize paged KV cache read/write, kill container
2026-05-19 15:48:16 +00:00
8b2cb41160
Fix KV cache: write to paged cache, handle uint8→fp8 conversion, fix RoPE bug
2026-05-19 15:34:09 +00:00
6ceb05327f
Add blackwell_attention module and comprehensive test
2026-05-19 15:30:29 +00:00
85c74e5932
Fix attention for decode (1 query vs N cached KVs)
2026-05-19 15:28:52 +00:00
85099c7e75
Fix fp8 amax in decode test
2026-05-19 15:28:17 +00:00
c66b0b88c0
Add decode attention pipeline test — reproduces KV cache bug
2026-05-19 15:27:55 +00:00
836fa75b93
Update README and CURRENT_BUG: BUILD YOUR OWN KERNELS. Stop patching vLLM.
2026-05-19 15:19:55 +00:00
dca8bfc3a8
Fix _apply_rope_kv: use inline RoPE instead of 3D apply_gptj_rope
2026-05-19 10:36:21 +00:00
8e6721917e
Fix syntax in RoPE KV test
2026-05-19 10:31:07 +00:00
cbf440f75a
Add RoPE KV test
2026-05-19 10:28:15 +00:00
a5fabbdf66
Apply RoPE to KV in Blackwell attention path - fix NaN output
2026-05-19 10:27:15 +00:00
7e97551fd3
Fix: use self.scale instead of self.softmax_scale in Blackwell attention path
2026-05-19 10:04:46 +00:00
39310c357d
Patch compressor cache for Blackwell (no FlashMLA alignment) - fixes 91 missing layers
2026-05-19 09:52:23 +00:00
d9cd8fa165
Add debug patch to print layer name mismatch
2026-05-19 09:45:10 +00:00
9a0b015aac
Reduce max_model_len to 256
2026-05-19 09:37:38 +00:00
de1fb839f0
Patch SWA and Indexer cache specs for Blackwell (no FlashMLA alignment)
2026-05-19 09:29:57 +00:00
ea771ff70b
Reduce max_model_len to 512 for initial container test
2026-05-19 09:23:10 +00:00
bcfbd1e25b
Reduce max_model_len to 32768 (876544 requires 204 GiB KV cache)
2026-05-19 09:13:33 +00:00
e91421f06e
Fix KV cache page size patch: separate groups for large SWA pages
2026-05-19 09:05:14 +00:00
dd7f2627e8
Add full model forward test (WIP), sparse attention test passes
2026-05-19 09:04:19 +00:00
9781953509
Add CSA/HCA sparse attention kernel test
2026-05-19 09:02:12 +00:00
d60673864a
Fix kv_ref transpose in KV cache test
2026-05-19 08:58:46 +00:00
c1099d76d2
Add KV cache kernel test - fp8 quantize/dequant, paged cache, CSA/HCA compression
2026-05-19 08:57:31 +00:00
c54ddbdae1
Fix NVFP4 attention: slice output to actual N after 128-padding
2026-05-19 08:55:31 +00:00
42285b6c24
Add CuTeDSL NVFP4 attention kernel test - Q×K^T GEMM
2026-05-19 08:54:59 +00:00
9465929e6e
Add DeepSeek-V4 CSA/HCA attention pipeline test (not MLA)
2026-05-19 08:51:16 +00:00
fa71fbe909
Patch KV cache utils: handle DeepseekV4 SWA page sizes > MLA page sizes
2026-05-19 08:45:44 +00:00
d08a457829
Fix cos_sin cache shape in NVFP4 attention test
2026-05-19 08:38:55 +00:00
7dd8871e84
Add NVFP4 attention test - quantize Q and K for Q×K^T GEMM
2026-05-19 08:38:25 +00:00
2672e98e4c
Remove VLLM_NVFP4_GEMM_BACKEND env var - CuTeDSL auto-selects on Blackwell
2026-05-19 08:35:40 +00:00
914d27fee7
Update README + CURRENT_BUG: full CuTeDSL NVFP4 plan, no more PyTorch fallbacks
...
Mike's directive: build the full thing with NVFP4/CuTeDSL.
No more 'optimize later' or 'just make it work' workarounds.
Key updates:
- README: full architecture docs (CSA/HCA/mHC), current status, NVFP4 coverage
- CURRENT_BUG: detailed plan for CuTeDSL NVFP4 attention, KV cache, RoPE
- Both files document: checkpoint key names, compress ratios, config issues
- Removed all 'TODO: optimize later' hedging — we build it right the first time
2026-05-19 08:26:16 +00:00
7d5c093c99
Fix KV cache crash: skip SWA cache write on Blackwell
...
The SWA KV cache uses fp8_ds_mla packed layout (37376 bytes per slot,
not 512). Our naive FP8 quant + write had a shape mismatch.
Fix: skip the SWA cache write entirely. The compressor (Triton)
handles the compressed cache. For full SDPA attention, we use the
raw kv tensor directly — we don't need the paged cache at all
during prefill.
2026-05-19 08:21:57 +00:00
e1a642452a
Fix Blackwell: skip FlashMLA assertion + force CuTeDSL kernel
...
1. DeepseekV4MLAAttention.__init__ had a hard assertion that the
attention backend MUST be FlashMLA. On Blackwell, FlashMLA doesn't
work but we bypass it via _attention_impl_blackwell(). Added
_is_blackwell flag to skip FlashMLA-specific init (fp8_ds_mla
cache format conversion).
2. Added VLLM_NVFP4_GEMM_BACKEND=cutedsl env var to docker-compose.yml
to force CuTeDSL kernel selection for NVFP4 linear layers.
3. Updated register_cutedsl_kernel.py to also register CuTeDSL in
_NVFP4_BACKEND_TO_KERNEL dict (for the env var override path).
2026-05-19 08:19:23 +00:00