This website requires JavaScript.
15c987244f
v28 attempt: PV MMA (128,64) - cosine 0.004, debugging
biondizzle
2026-05-21 05:41:44 +00:00
a7fd2761df
README: Bug 4 root cause — TMEM layout mismatch (128,64) PV A-fragment vs softmax P write
biondizzle
2026-05-21 05:17:12 +00:00
c20518332e
more stuff
biondizzle
2026-05-21 05:08:57 +00:00
0dc6fe4a7d
Stage B progress: PV works for square (128,128), broken for (128,64)
biondizzle
2026-05-21 04:40:28 +00:00
7a8945eb76
Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage
biondizzle
2026-05-21 04:10:07 +00:00
467ade37b2
Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed
biondizzle
2026-05-21 00:12:47 +00:00
97656a5cd1
Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong
biondizzle
2026-05-20 20:26:25 +00:00
a5b48be7d5
stuff
biondizzle
2026-05-20 07:15:01 +00:00
9f0528f150
Update README: reflect current state, add C128A/C4A topk + warmup fixes
biondizzle
2026-05-20 06:51:12 +00:00
67d5e26080
Fix warmup compilation + add sparse topk metadata kernels
biondizzle
2026-05-20 06:43:43 +00:00
bbba289bd8
feat: GPU-native SWA + sparse decode attention kernels (CuTeDSL)
biondizzle
2026-05-20 05:46:15 +00:00
06bf4f482d
README: comprehensive update with current kernel status
biondizzle
2026-05-20 04:42:57 +00:00
a30d9eb523
Update README with final kernel status
biondizzle
2026-05-20 04:39:57 +00:00
04eca7c6da
Custom CUDA kernel for de-interleave plus NVFP4 quantize
biondizzle
2026-05-20 04:39:47 +00:00
061d5692a9
Remove debug print statements from pipeline
biondizzle
2026-05-20 04:20:46 +00:00
aa8563c626
Fused SwiGLU epilogue with granularity-8 weight interleave
biondizzle
2026-05-20 04:13:52 +00:00
57d4cb714f
docs: rewrite README.md with current project state
biondizzle
2026-05-20 03:30:35 +00:00
6c04155167
wip: Step 2 gate/up pairing — SiLU validated, runtime conditionals blocked by CuTeDSL
biondizzle
2026-05-20 03:26:20 +00:00
9f0c1b8c5d
wip: Step 1 SiLU validation complete, Step 2 gate/up pairing planning
biondizzle
2026-05-20 03:16:34 +00:00
b84f2f7bf9
fix: cutlass.Float32 not cutlass.float32_t in fused epilogue
biondizzle
2026-05-20 03:12:23 +00:00
08992b818d
wip: add run_fused_swiglu_grouped_gemm bridge + step1 test
biondizzle
2026-05-20 03:10:56 +00:00
9c43c69a4c
wip: fused SwiGLU Stage 1 - SiLU in registers (full acc_vec)
biondizzle
2026-05-20 03:07:02 +00:00
2f053f674e
wip: fused SwiGLU kernel scaffold + bridge interleave + plan
biondizzle
2026-05-20 03:04:38 +00:00
4f178d6e9c
chore: remove unused _expert_id_range after bincount migration
biondizzle
2026-05-20 02:17:44 +00:00
84a2f6d441
perf: replace expert counting O(n*E) comparison with torch.bincount O(n)
biondizzle
2026-05-20 02:17:23 +00:00
4882d8553c
fix: zero out x_norm for underflow blocks before division in NVFP4 quantization
biondizzle
2026-05-20 02:16:49 +00:00
e653712598
fix: detect zero blocks in NVFP4 quantization, force FP4+FP8 to exact zero
biondizzle
2026-05-20 02:14:50 +00:00
1857bdedc3
chore: deprecate prepare_weights_from_dequantized and prepare_weights_direct
biondizzle
2026-05-20 02:11:40 +00:00
ef398006a7
fix: correct scale factor dimensions in warmup (K_sf = ceil_div(K_packed,8) not ceil_div(K_packed,16))
biondizzle
2026-05-20 02:08:26 +00:00
8f1a20562f
fix: root-cause JIT memory corruption myth, add eager warmup, remove _needs_token_refill
biondizzle
2026-05-20 02:08:01 +00:00
6ec0afc318
fix: handle 3D swa_indices and correct kv_bf16 expand dims
biondizzle
2026-05-20 01:36:27 +00:00
aa593361e7
feat: add native CuTeDSL SWA decode attention kernel stub + batched SDPA fallback
biondizzle
2026-05-20 01:28:05 +00:00
3599b44c0f
fix: replace _allocate_buffers with _ensure_buffer_size for dynamic sizing
biondizzle
2026-05-20 00:02:10 +00:00
1d5e70adfb
fix: dynamic buffer sizing in nvfp4_linear for varying token counts
biondizzle
2026-05-19 23:59:55 +00:00
1901bf585e
nuke vllm because this keep confusing people
biondizzle
2026-05-19 23:04:36 +00:00
5fb70b4cd2
Update README.md and CURRENT_BUG.md: eliminate stale issues, document NaN investigation, clarify our kernels are clean
biondizzle
2026-05-19 20:22:10 +00:00
2e6559402c
Add full layer NaN test (attention + MoE, multi-layer chain)
biondizzle
2026-05-19 18:36:49 +00:00
cca145e35c
Use 16 experts for MoE runner test (fits in memory)
biondizzle
2026-05-19 18:35:40 +00:00
7893e7514d
Add MoE runner NaN test (grouped GEMM with real weights)
biondizzle
2026-05-19 18:34:56 +00:00
7b432da754
Fix intermediate size: 3072 not 18432
biondizzle
2026-05-19 18:34:12 +00:00
293f14a179
Rewrite MoE NaN test: per-expert format, activation quantization, grouped GEMM
biondizzle
2026-05-19 18:33:57 +00:00
62f2395e30
Fix MoE weight key names, add fallback
biondizzle
2026-05-19 18:32:49 +00:00
9455466648
Add MoE NaN reproduction test, update CURRENT_BUG.md with NaN tracing and test plan
biondizzle
2026-05-19 18:32:14 +00:00
0316cec6fb
Add input NaN debug to trace where NaN starts
biondizzle
2026-05-19 18:15:53 +00:00
4c45d73b82
Add prefill inputs NaN debug
biondizzle
2026-05-19 18:04:18 +00:00
0773c9608c
Add prefill attention value debug check
biondizzle
2026-05-19 17:55:35 +00:00
4f02113aa0
Use module-level Blackwell flag in compressor (works during torch.compile)
biondizzle
2026-05-19 17:37:26 +00:00
8cf6ac3e8c
CRITICAL FIX: Remove double Q normalization and fix RoPE sin slice
biondizzle
2026-05-19 17:27:33 +00:00
a94ad73c64
Fix imports in vLLM codepaths test
biondizzle
2026-05-19 17:26:50 +00:00
f3f9674810
Fix f-string syntax
biondizzle
2026-05-19 17:26:40 +00:00
6cc2312e61
Add test for exact vLLM codepaths (fused_qnorm, kv_write, decode)
biondizzle
2026-05-19 17:26:10 +00:00
aade8593f7
CRITICAL FIX: Properly dequantize fp8 KV in decode using per-token inv_scale
biondizzle
2026-05-19 17:08:58 +00:00
2f811bc8bd
FIX: Use vLLM's decode_swa_indices for correct paged KV cache access during decode
biondizzle
2026-05-19 16:55:44 +00:00
da6fa2f1d6
Fix UnboundLocalError: move num_decode_tokens before debug print
biondizzle
2026-05-19 16:43:28 +00:00
76fff5fc8b
CRITICAL FIX: Skip compressor fused attention kernel on Blackwell — it bypasses our attention path
biondizzle
2026-05-19 16:35:07 +00:00
0554332352
Add debug logging to Blackwell attention path
biondizzle
2026-05-19 16:31:55 +00:00
f9a09df81a
Fix wrapper attribute access: kv_cache, attn_sink, max_model_len via mla_attn
biondizzle
2026-05-19 16:19:28 +00:00
b95e934703
Add CSA/HCA decode + prefill attention to Blackwell path
biondizzle
2026-05-19 16:06:24 +00:00
abff942edd
Fix N for C128A (need 128 tokens)
biondizzle
2026-05-19 16:04:53 +00:00
49c2e088d4
Fix compressor key name
biondizzle
2026-05-19 16:04:38 +00:00
7d89ede9f9
Add CSA sparse attention test (compressed KV gather + SWA merge)
biondizzle
2026-05-19 16:04:19 +00:00
51a7a89c5c
Update CURRENT_BUG: KV cache pipeline verified, all tests passing
biondizzle
2026-05-19 16:01:10 +00:00
696a890df7
Add decode vs prefill consistency test
biondizzle
2026-05-19 16:00:33 +00:00
359654f08e
Test with all 61 layers (shared experts only)
biondizzle
2026-05-19 15:55:41 +00:00
3e6041d752
Fix view→reshape for non-contiguous tensor
biondizzle
2026-05-19 15:54:40 +00:00
ff9f373633
Add e2e decode test (3 layers: C128A, C4A, SWA)
biondizzle
2026-05-19 15:53:29 +00:00
a5870fa05c
Vectorize paged KV cache read/write, kill container
biondizzle
2026-05-19 15:48:16 +00:00
9e428b83c7
Fix KV cache: write to paged cache, handle uint8→fp8 conversion, fix RoPE bug
biondizzle
2026-05-19 15:34:09 +00:00
0023fee706
Add blackwell_attention module and comprehensive test
biondizzle
2026-05-19 15:30:29 +00:00
142a4a1ad4
Fix attention for decode (1 query vs N cached KVs)
biondizzle
2026-05-19 15:28:52 +00:00
4b85605edf
Fix fp8 amax in decode test
biondizzle
2026-05-19 15:28:17 +00:00
4f23055450
Add decode attention pipeline test — reproduces KV cache bug
biondizzle
2026-05-19 15:27:55 +00:00
31b9cfbdbd
Update README and CURRENT_BUG: BUILD YOUR OWN KERNELS. Stop patching vLLM.
biondizzle
2026-05-19 15:19:55 +00:00
dca8bfc3a8
Fix _apply_rope_kv: use inline RoPE instead of 3D apply_gptj_rope
biondizzle
2026-05-19 10:36:21 +00:00
8e6721917e
Fix syntax in RoPE KV test
biondizzle
2026-05-19 10:31:07 +00:00
cbf440f75a
Add RoPE KV test
biondizzle
2026-05-19 10:28:15 +00:00
a5fabbdf66
Apply RoPE to KV in Blackwell attention path - fix NaN output
biondizzle
2026-05-19 10:27:15 +00:00
7e97551fd3
Fix: use self.scale instead of self.softmax_scale in Blackwell attention path
biondizzle
2026-05-19 10:04:46 +00:00
39310c357d
Patch compressor cache for Blackwell (no FlashMLA alignment) - fixes 91 missing layers
biondizzle
2026-05-19 09:52:23 +00:00
d9cd8fa165
Add debug patch to print layer name mismatch
biondizzle
2026-05-19 09:45:10 +00:00
9a0b015aac
Reduce max_model_len to 256
biondizzle
2026-05-19 09:37:38 +00:00
de1fb839f0
Patch SWA and Indexer cache specs for Blackwell (no FlashMLA alignment)
biondizzle
2026-05-19 09:29:57 +00:00
ea771ff70b
Reduce max_model_len to 512 for initial container test
biondizzle
2026-05-19 09:23:10 +00:00
bcfbd1e25b
Reduce max_model_len to 32768 (876544 requires 204 GiB KV cache)
biondizzle
2026-05-19 09:13:33 +00:00
e91421f06e
Fix KV cache page size patch: separate groups for large SWA pages
biondizzle
2026-05-19 09:05:14 +00:00
dd7f2627e8
Add full model forward test (WIP), sparse attention test passes
biondizzle
2026-05-19 09:04:19 +00:00
9781953509
Add CSA/HCA sparse attention kernel test
biondizzle
2026-05-19 09:02:12 +00:00
d60673864a
Fix kv_ref transpose in KV cache test
biondizzle
2026-05-19 08:58:46 +00:00
c1099d76d2
Add KV cache kernel test - fp8 quantize/dequant, paged cache, CSA/HCA compression
biondizzle
2026-05-19 08:57:31 +00:00
c54ddbdae1
Fix NVFP4 attention: slice output to actual N after 128-padding
biondizzle
2026-05-19 08:55:31 +00:00
42285b6c24
Add CuTeDSL NVFP4 attention kernel test - Q×K^T GEMM
biondizzle
2026-05-19 08:54:59 +00:00
9465929e6e
Add DeepSeek-V4 CSA/HCA attention pipeline test (not MLA)
biondizzle
2026-05-19 08:51:16 +00:00
fa71fbe909
Patch KV cache utils: handle DeepseekV4 SWA page sizes > MLA page sizes
biondizzle
2026-05-19 08:45:44 +00:00
d08a457829
Fix cos_sin cache shape in NVFP4 attention test
biondizzle
2026-05-19 08:38:55 +00:00
7dd8871e84
Add NVFP4 attention test - quantize Q and K for Q×K^T GEMM
biondizzle
2026-05-19 08:38:25 +00:00
2672e98e4c
Remove VLLM_NVFP4_GEMM_BACKEND env var - CuTeDSL auto-selects on Blackwell
biondizzle
2026-05-19 08:35:40 +00:00
914d27fee7
Update README + CURRENT_BUG: full CuTeDSL NVFP4 plan, no more PyTorch fallbacks
biondizzle
2026-05-19 08:26:16 +00:00
7d5c093c99
Fix KV cache crash: skip SWA cache write on Blackwell
biondizzle
2026-05-19 08:21:57 +00:00
e1a642452a
Fix Blackwell: skip FlashMLA assertion + force CuTeDSL kernel
biondizzle
2026-05-19 08:19:23 +00:00
2856323360
Fix torch.compile crash: move Blackwell path inside custom op boundary
biondizzle
2026-05-19 08:11:58 +00:00