|
|
02b57071be
|
Update README.md and CURRENT_BUG.md: eliminate stale issues, document NaN investigation, clarify our kernels are clean
|
2026-05-19 20:22:10 +00:00 |
|
|
|
7070fadf72
|
Add full layer NaN test (attention + MoE, multi-layer chain)
|
2026-05-19 18:36:49 +00:00 |
|
|
|
152b0749df
|
Use 16 experts for MoE runner test (fits in memory)
|
2026-05-19 18:35:40 +00:00 |
|
|
|
daa59a7c75
|
Add MoE runner NaN test (grouped GEMM with real weights)
|
2026-05-19 18:34:56 +00:00 |
|
|
|
9308634e65
|
Fix intermediate size: 3072 not 18432
|
2026-05-19 18:34:12 +00:00 |
|
|
|
2b91bb1b71
|
Rewrite MoE NaN test: per-expert format, activation quantization, grouped GEMM
|
2026-05-19 18:33:57 +00:00 |
|
|
|
8904d409f8
|
Fix MoE weight key names, add fallback
|
2026-05-19 18:32:49 +00:00 |
|
|
|
e45ceb2226
|
Add MoE NaN reproduction test, update CURRENT_BUG.md with NaN tracing and test plan
|
2026-05-19 18:32:14 +00:00 |
|
|
|
22ec43e685
|
Add input NaN debug to trace where NaN starts
|
2026-05-19 18:15:53 +00:00 |
|
|
|
b86d0d2dee
|
Add prefill inputs NaN debug
|
2026-05-19 18:04:18 +00:00 |
|
|
|
45a2d8851d
|
Add prefill attention value debug check
|
2026-05-19 17:55:35 +00:00 |
|
|
|
1589b79137
|
Use module-level Blackwell flag in compressor (works during torch.compile)
|
2026-05-19 17:37:26 +00:00 |
|
|
|
658b12cb3d
|
CRITICAL FIX: Remove double Q normalization and fix RoPE sin slice
|
2026-05-19 17:27:33 +00:00 |
|
|
|
facc6509e7
|
Fix imports in vLLM codepaths test
|
2026-05-19 17:26:50 +00:00 |
|
|
|
835e1a0590
|
Fix f-string syntax
|
2026-05-19 17:26:40 +00:00 |
|
|
|
9c30168202
|
Add test for exact vLLM codepaths (fused_qnorm, kv_write, decode)
|
2026-05-19 17:26:10 +00:00 |
|
|
|
8f80991fdf
|
CRITICAL FIX: Properly dequantize fp8 KV in decode using per-token inv_scale
|
2026-05-19 17:08:58 +00:00 |
|
|
|
d67d8613af
|
FIX: Use vLLM's decode_swa_indices for correct paged KV cache access during decode
|
2026-05-19 16:55:44 +00:00 |
|
|
|
3b204c4772
|
Fix UnboundLocalError: move num_decode_tokens before debug print
|
2026-05-19 16:43:28 +00:00 |
|
|
|
30890b621d
|
CRITICAL FIX: Skip compressor fused attention kernel on Blackwell — it bypasses our attention path
|
2026-05-19 16:35:07 +00:00 |
|
|
|
b8e2cf61ad
|
Add debug logging to Blackwell attention path
|
2026-05-19 16:31:55 +00:00 |
|
|
|
d7f686bcfc
|
Fix wrapper attribute access: kv_cache, attn_sink, max_model_len via mla_attn
|
2026-05-19 16:19:28 +00:00 |
|
|
|
114da83090
|
Add CSA/HCA decode + prefill attention to Blackwell path
|
2026-05-19 16:06:24 +00:00 |
|
|
|
2cc1910c45
|
Fix N for C128A (need 128 tokens)
|
2026-05-19 16:04:53 +00:00 |
|
|
|
cea453cbab
|
Fix compressor key name
|
2026-05-19 16:04:38 +00:00 |
|
|
|
04f2b2d8d4
|
Add CSA sparse attention test (compressed KV gather + SWA merge)
|
2026-05-19 16:04:19 +00:00 |
|
|
|
4c6464e7e0
|
Update CURRENT_BUG: KV cache pipeline verified, all tests passing
|
2026-05-19 16:01:10 +00:00 |
|
|
|
be8566a443
|
Add decode vs prefill consistency test
|
2026-05-19 16:00:33 +00:00 |
|
|
|
2ddd3d0702
|
Test with all 61 layers (shared experts only)
|
2026-05-19 15:55:41 +00:00 |
|
|
|
842e6e1381
|
Fix view→reshape for non-contiguous tensor
|
2026-05-19 15:54:40 +00:00 |
|
|
|
f0f8d8211b
|
Add e2e decode test (3 layers: C128A, C4A, SWA)
|
2026-05-19 15:53:29 +00:00 |
|
|
|
255913fba4
|
Vectorize paged KV cache read/write, kill container
|
2026-05-19 15:48:16 +00:00 |
|
|
|
8b2cb41160
|
Fix KV cache: write to paged cache, handle uint8→fp8 conversion, fix RoPE bug
|
2026-05-19 15:34:09 +00:00 |
|
|
|
6ceb05327f
|
Add blackwell_attention module and comprehensive test
|
2026-05-19 15:30:29 +00:00 |
|
|
|
85c74e5932
|
Fix attention for decode (1 query vs N cached KVs)
|
2026-05-19 15:28:52 +00:00 |
|
|
|
85099c7e75
|
Fix fp8 amax in decode test
|
2026-05-19 15:28:17 +00:00 |
|
|
|
c66b0b88c0
|
Add decode attention pipeline test — reproduces KV cache bug
|
2026-05-19 15:27:55 +00:00 |
|
|
|
836fa75b93
|
Update README and CURRENT_BUG: BUILD YOUR OWN KERNELS. Stop patching vLLM.
|
2026-05-19 15:19:55 +00:00 |
|
|
|
dca8bfc3a8
|
Fix _apply_rope_kv: use inline RoPE instead of 3D apply_gptj_rope
|
2026-05-19 10:36:21 +00:00 |
|
|
|
8e6721917e
|
Fix syntax in RoPE KV test
|
2026-05-19 10:31:07 +00:00 |
|
|
|
cbf440f75a
|
Add RoPE KV test
|
2026-05-19 10:28:15 +00:00 |
|
|
|
a5fabbdf66
|
Apply RoPE to KV in Blackwell attention path - fix NaN output
|
2026-05-19 10:27:15 +00:00 |
|
|
|
7e97551fd3
|
Fix: use self.scale instead of self.softmax_scale in Blackwell attention path
|
2026-05-19 10:04:46 +00:00 |
|
|
|
39310c357d
|
Patch compressor cache for Blackwell (no FlashMLA alignment) - fixes 91 missing layers
|
2026-05-19 09:52:23 +00:00 |
|
|
|
d9cd8fa165
|
Add debug patch to print layer name mismatch
|
2026-05-19 09:45:10 +00:00 |
|
|
|
9a0b015aac
|
Reduce max_model_len to 256
|
2026-05-19 09:37:38 +00:00 |
|
|
|
de1fb839f0
|
Patch SWA and Indexer cache specs for Blackwell (no FlashMLA alignment)
|
2026-05-19 09:29:57 +00:00 |
|
|
|
ea771ff70b
|
Reduce max_model_len to 512 for initial container test
|
2026-05-19 09:23:10 +00:00 |
|
|
|
bcfbd1e25b
|
Reduce max_model_len to 32768 (876544 requires 204 GiB KV cache)
|
2026-05-19 09:13:33 +00:00 |
|
|
|
e91421f06e
|
Fix KV cache page size patch: separate groups for large SWA pages
|
2026-05-19 09:05:14 +00:00 |
|