Commit Graph

465 Commits

Author SHA1 Message Date
2e6559402c Add full layer NaN test (attention + MoE, multi-layer chain) 2026-05-19 18:36:49 +00:00
cca145e35c Use 16 experts for MoE runner test (fits in memory) 2026-05-19 18:35:40 +00:00
7893e7514d Add MoE runner NaN test (grouped GEMM with real weights) 2026-05-19 18:34:56 +00:00
7b432da754 Fix intermediate size: 3072 not 18432 2026-05-19 18:34:12 +00:00
293f14a179 Rewrite MoE NaN test: per-expert format, activation quantization, grouped GEMM 2026-05-19 18:33:57 +00:00
62f2395e30 Fix MoE weight key names, add fallback 2026-05-19 18:32:49 +00:00
9455466648 Add MoE NaN reproduction test, update CURRENT_BUG.md with NaN tracing and test plan 2026-05-19 18:32:14 +00:00
0316cec6fb Add input NaN debug to trace where NaN starts 2026-05-19 18:15:53 +00:00
4c45d73b82 Add prefill inputs NaN debug 2026-05-19 18:04:18 +00:00
0773c9608c Add prefill attention value debug check 2026-05-19 17:55:35 +00:00
4f02113aa0 Use module-level Blackwell flag in compressor (works during torch.compile) 2026-05-19 17:37:26 +00:00
8cf6ac3e8c CRITICAL FIX: Remove double Q normalization and fix RoPE sin slice 2026-05-19 17:27:33 +00:00
a94ad73c64 Fix imports in vLLM codepaths test 2026-05-19 17:26:50 +00:00
f3f9674810 Fix f-string syntax 2026-05-19 17:26:40 +00:00
6cc2312e61 Add test for exact vLLM codepaths (fused_qnorm, kv_write, decode) 2026-05-19 17:26:10 +00:00
aade8593f7 CRITICAL FIX: Properly dequantize fp8 KV in decode using per-token inv_scale 2026-05-19 17:08:58 +00:00
2f811bc8bd FIX: Use vLLM's decode_swa_indices for correct paged KV cache access during decode 2026-05-19 16:55:44 +00:00
da6fa2f1d6 Fix UnboundLocalError: move num_decode_tokens before debug print 2026-05-19 16:43:28 +00:00
76fff5fc8b CRITICAL FIX: Skip compressor fused attention kernel on Blackwell — it bypasses our attention path 2026-05-19 16:35:07 +00:00
0554332352 Add debug logging to Blackwell attention path 2026-05-19 16:31:55 +00:00
f9a09df81a Fix wrapper attribute access: kv_cache, attn_sink, max_model_len via mla_attn 2026-05-19 16:19:28 +00:00
b95e934703 Add CSA/HCA decode + prefill attention to Blackwell path 2026-05-19 16:06:24 +00:00
abff942edd Fix N for C128A (need 128 tokens) 2026-05-19 16:04:53 +00:00
49c2e088d4 Fix compressor key name 2026-05-19 16:04:38 +00:00
7d89ede9f9 Add CSA sparse attention test (compressed KV gather + SWA merge) 2026-05-19 16:04:19 +00:00
51a7a89c5c Update CURRENT_BUG: KV cache pipeline verified, all tests passing 2026-05-19 16:01:10 +00:00
696a890df7 Add decode vs prefill consistency test 2026-05-19 16:00:33 +00:00
359654f08e Test with all 61 layers (shared experts only) 2026-05-19 15:55:41 +00:00
3e6041d752 Fix view→reshape for non-contiguous tensor 2026-05-19 15:54:40 +00:00
ff9f373633 Add e2e decode test (3 layers: C128A, C4A, SWA) 2026-05-19 15:53:29 +00:00
a5870fa05c Vectorize paged KV cache read/write, kill container 2026-05-19 15:48:16 +00:00
9e428b83c7 Fix KV cache: write to paged cache, handle uint8→fp8 conversion, fix RoPE bug 2026-05-19 15:34:09 +00:00
0023fee706 Add blackwell_attention module and comprehensive test 2026-05-19 15:30:29 +00:00
142a4a1ad4 Fix attention for decode (1 query vs N cached KVs) 2026-05-19 15:28:52 +00:00
4b85605edf Fix fp8 amax in decode test 2026-05-19 15:28:17 +00:00
4f23055450 Add decode attention pipeline test — reproduces KV cache bug 2026-05-19 15:27:55 +00:00
31b9cfbdbd Update README and CURRENT_BUG: BUILD YOUR OWN KERNELS. Stop patching vLLM. 2026-05-19 15:19:55 +00:00
dca8bfc3a8 Fix _apply_rope_kv: use inline RoPE instead of 3D apply_gptj_rope 2026-05-19 10:36:21 +00:00
8e6721917e Fix syntax in RoPE KV test 2026-05-19 10:31:07 +00:00
cbf440f75a Add RoPE KV test 2026-05-19 10:28:15 +00:00
a5fabbdf66 Apply RoPE to KV in Blackwell attention path - fix NaN output 2026-05-19 10:27:15 +00:00
7e97551fd3 Fix: use self.scale instead of self.softmax_scale in Blackwell attention path 2026-05-19 10:04:46 +00:00
39310c357d Patch compressor cache for Blackwell (no FlashMLA alignment) - fixes 91 missing layers 2026-05-19 09:52:23 +00:00
d9cd8fa165 Add debug patch to print layer name mismatch 2026-05-19 09:45:10 +00:00
9a0b015aac Reduce max_model_len to 256 2026-05-19 09:37:38 +00:00
de1fb839f0 Patch SWA and Indexer cache specs for Blackwell (no FlashMLA alignment) 2026-05-19 09:29:57 +00:00
ea771ff70b Reduce max_model_len to 512 for initial container test 2026-05-19 09:23:10 +00:00
bcfbd1e25b Reduce max_model_len to 32768 (876544 requires 204 GiB KV cache) 2026-05-19 09:13:33 +00:00
e91421f06e Fix KV cache page size patch: separate groups for large SWA pages 2026-05-19 09:05:14 +00:00
dd7f2627e8 Add full model forward test (WIP), sparse attention test passes 2026-05-19 09:04:19 +00:00