-
a782ac00ce
Integrate CSA/SDPA attention into vLLM for Blackwell
biondizzle
2026-05-19 08:04:07 +00:00
-
81931614e9
Update CURRENT_BUG: CSA kernel works, plan vLLM integration
biondizzle
2026-05-19 08:02:00 +00:00
-
9d067add90
Fix device reference in full_attention_reference
biondizzle
2026-05-19 08:01:31 +00:00
-
3e3e998578
Fix attention: manual causal mask for batched single-query
biondizzle
2026-05-19 08:01:08 +00:00
-
1e675ccc9a
Fix causal mask shape for SDPA: (1,1,T,T) broadcast
biondizzle
2026-05-19 08:00:39 +00:00
-
57615029a4
Fix KV expand for SDPA: (T,HD) → (T*NH, T, HD)
biondizzle
2026-05-19 08:00:08 +00:00
-
dd3a12bbda
Fix full_attention_reference: broadcast KV to all heads+positions
biondizzle
2026-05-19 07:59:28 +00:00
-
910015c47e
Fix kv shape: expand to (T, NH, HD) before reshape
biondizzle
2026-05-19 07:58:42 +00:00
-
3de75c4e37
Add CSA/HCA attention kernel (PyTorch SDPA, Blackwell-safe)
biondizzle
2026-05-19 07:58:10 +00:00
-
65f48be38c
Add attention path test: pinpoint FlashMLA failure
biondizzle
2026-05-19 07:54:01 +00:00
-
90d1098935
Update CURRENT_BUG: warmup gs is irrelevant, bug is in vLLM pipeline
biondizzle
2026-05-19 07:51:10 +00:00
-
04ad6409e5
Rewrite test: diagnose whether warmup gs matters at inference time
biondizzle
2026-05-19 07:49:41 +00:00
-
496848e158
Fix ffn_hc.scale key name
biondizzle
2026-05-19 07:48:09 +00:00
-
5a4e355d3a
Add model forward test: reproduce vLLM empty output outside container
biondizzle
2026-05-19 07:47:48 +00:00
-
f5ce728ef2
Fix OOM: add --max-model-len=876544 + revert CPU dummy weight
biondizzle
2026-05-19 07:35:43 +00:00
-
79a41d9197
Save ~5-8 GiB GPU VRAM: move dummy weight to CPU
biondizzle
2026-05-19 07:29:38 +00:00
-
cebc586014
Fix OOM: use 1-token warmup sample + free immediately
biondizzle
2026-05-19 07:28:57 +00:00
-
5122cadc94
Update CURRENT_BUG.md: root cause found + fix committed
biondizzle
2026-05-19 07:21:30 +00:00
-
6e6f95dfa8
FIX: Use warmup-based activation global scale in CuTeDSL linear kernel
biondizzle
2026-05-19 07:21:07 +00:00
-
0a7769972f
Fix garbled shared_expert_pipeline.py: imports/class were merged
biondizzle
2026-05-19 07:18:10 +00:00
-
87453a53b0
Fix checkpoint keys: attn_hc.*, compressor.*, q_a_proj/q_b_proj/kv_proj
biondizzle
2026-05-19 07:17:37 +00:00
-
f97762cc9f
Fix full layer test: use correct checkpoint key names
biondizzle
2026-05-19 07:16:33 +00:00
-
cc48a5715e
Add full layer 0 B200 test: CuTeDSL vs BF16 reference
biondizzle
2026-05-19 07:14:58 +00:00
-
dbaa3d6fe6
Update CURRENT_BUG.md and README with current state
biondizzle
2026-05-19 07:05:45 +00:00
-
62abf41b03
Revert deepseek_v4_attention.py to
ffc2264 — don't nuke existing patches
biondizzle
2026-05-19 06:52:40 +00:00
-
4c2effa2be
Fix attention patch: source from v0.21.0 stable, not local clone
biondizzle
2026-05-19 06:44:59 +00:00
-
284b6a5d57
Fix attention patch: use original vllm imports, only patch forward method
biondizzle
2026-05-19 06:40:58 +00:00
-
199efe0871
Fix dims: o_groups=16, o_lora_rank=1024 from config
biondizzle
2026-05-19 06:37:25 +00:00
-
b4fee70151
Fix device mismatch in test
biondizzle
2026-05-19 06:36:22 +00:00
-
6b4b9774d1
Add B200 test: prove O-projection root cause + validate fix
biondizzle
2026-05-19 06:32:54 +00:00
-
77baca668e
Patch attention forward: BF16 inv RoPE + BMM wo_a + NVFP4 wo_b
biondizzle
2026-05-19 06:30:18 +00:00
-
ffc2264c41
Fix activation global scale: don't double-invert input_global_scale_inv
biondizzle
2026-05-19 06:03:08 +00:00
-
918342feeb
MHC: replace monolithic layers/mhc.py with pure PyTorch
biondizzle
2026-05-19 05:41:55 +00:00
-
dfd9c10ae9
Fix MHC import: don't import .torch from layers/mhc.py
biondizzle
2026-05-19 05:36:35 +00:00
-
e404e18efb
Also replace layers/mhc.py CustomOp dispatch
biondizzle
2026-05-19 05:31:05 +00:00
-
5e6d459145
Fix MHC custom op registration
biondizzle
2026-05-19 05:19:48 +00:00
-
9ff1679064
Replace MHC TileLang kernels with pure PyTorch
biondizzle
2026-05-19 05:07:41 +00:00
-
5c770c68ca
Keep MoE scale tensors: framework warmup needs them
biondizzle
2026-05-19 04:50:31 +00:00
-
e0f385ac45
Fix workspace_shapes: output dim is hidden_dim, not K*2
biondizzle
2026-05-19 04:42:22 +00:00
-
cfd8ec741d
Debug: add shape mismatch logging in MoE apply
biondizzle
2026-05-19 04:35:58 +00:00
-
ffc1a5c6a8
Fix workspace_shapes: remove wrong assertion, compute output dim from K
biondizzle
2026-05-19 04:28:04 +00:00
-
f023b3b2c6
Fix: wrap dummy MoE weights in nn.Parameter
biondizzle
2026-05-19 04:21:35 +00:00
-
b06dcb40dc
Fix MoE w1=None crash: keep shape-preserving dummy weights on CPU
biondizzle
2026-05-19 04:17:10 +00:00
-
c289c44920
Fix BF16 wo_a: per-group BMM instead of flat linear
biondizzle
2026-05-19 04:10:02 +00:00
-
6f9a400ae0
Fix hc_head mapping: checkpoint uses hc_head.hc_fn, model params are flat hc_head_fn
biondizzle
2026-05-19 03:58:25 +00:00
-
909a2710e4
Fix double lm_head mapping: NVFP4 checkpoint already uses correct names
biondizzle
2026-05-19 03:54:14 +00:00
-
4cf5b8b751
Fix compressor path: attn.mla_attn.compressor (not attn.compressor)
biondizzle
2026-05-19 03:47:26 +00:00
-
9d41419e9f
Debug: print compressor params to diagnose KeyError
biondizzle
2026-05-19 03:44:40 +00:00
-
db5192fe41
Patch from Docker image's vLLM (0.20.2rc1) instead of newer upstream
biondizzle
2026-05-19 03:35:15 +00:00
-
df5a496f5d
Fix: make eager_break_during_capture import conditional for older vLLM
biondizzle
2026-05-19 03:29:05 +00:00
-
4ed91b81d0
Fix inverse RoPE formula: swap signs on cross terms
biondizzle
2026-05-19 03:22:10 +00:00
-
fece06f746
Add unit tests for NVFP4 weight mapper and inverse RoPE BF16
biondizzle
2026-05-19 03:22:00 +00:00
-
b0b5113467
Fix weight mapper: compressor → attn.compressor (not mla_attn), quant weights_proj
biondizzle
2026-05-19 03:20:41 +00:00
-
396a83ea56
Clean vLLM integration: use official paths, BF16 wo_a, proper weight mapper
biondizzle
2026-05-19 03:13:38 +00:00
-
b856ee9315
Clean up debug scripts
biondizzle
2026-05-19 02:47:29 +00:00
-
05cdde1676
Fix wo_a: scatter each group's data at correct offset in padded buffer
biondizzle
2026-05-19 02:45:57 +00:00
-
8fe5546bb3
Fix debug script
biondizzle
2026-05-19 02:43:17 +00:00
-
788f0aa65a
Add step-by-step debug for wo_a
biondizzle
2026-05-19 02:43:05 +00:00
-
5f5b997fc3
Fix wo_a: permute to groups-first layout for grouped GEMM
biondizzle
2026-05-19 02:41:32 +00:00
-
77e4970d93
Add debug script for wo_a quantization
biondizzle
2026-05-19 02:40:43 +00:00
-
80122b850b
Add debug script for wo_a
biondizzle
2026-05-19 02:39:55 +00:00
-
ae233ab648
Fix test: cos_sin_cache on CUDA device
biondizzle
2026-05-19 02:37:50 +00:00
-
882d4996ff
Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj)
biondizzle
2026-05-19 02:36:30 +00:00
-
bab1f75f29
Fix gs None error in legacy _ensure_stacked path
biondizzle
2026-05-19 02:17:53 +00:00
-
48fa64dfda
Eliminate weight copies: pass stacked checkpoint tensors directly
biondizzle
2026-05-19 02:16:43 +00:00
-
0612c1ab54
use proper backend
biondizzle
2026-05-19 02:08:18 +00:00
-
00fe63b56f
Fix compile test: add warmup for activation global scales
biondizzle
2026-05-19 01:57:16 +00:00
-
bba3bca4d3
Add torch.compile + custom op integration test
biondizzle
2026-05-19 01:56:46 +00:00
-
35fab6cff3
Replace autograd.Function with torch.library.custom_op for Dynamo compat
biondizzle
2026-05-19 01:54:48 +00:00
-
98153002c0
Go back to torch.library.custom_op with correct GEMM impl
biondizzle
2026-05-19 01:24:41 +00:00
-
02c500bbb1
Switch to allow_in_graph for Dynamo opacity instead of custom op
biondizzle
2026-05-19 01:20:07 +00:00
-
581d87f9a6
Remove warmup forward from process_weights_after_loading
biondizzle
2026-05-19 01:18:54 +00:00
-
5d49849156
Fix: torch.zeros doesn't support Float4_e2m1fn_x2 dtype
biondizzle
2026-05-19 01:15:24 +00:00
-
e1fcfc4f01
Add CuTeDSL warmup + CUDA sync after JIT compilation
biondizzle
2026-05-19 01:11:44 +00:00
-
1d9c0f996c
Fix expert_offsets dtype: CuTeDSL expects int32 not int64
biondizzle
2026-05-19 01:05:20 +00:00
-
b81200f427
Fix CuTeDSL NVFP4 linear: correct scale assembly in custom op
biondizzle
2026-05-19 01:01:42 +00:00
-
e0eb436914
Fix custom_op registration: use as decorator with proper type hints
biondizzle
2026-05-19 00:54:30 +00:00
-
c609e9ba3c
Use torch.library.custom_op for CuTeDSL NVFP4 linear GEMM
biondizzle
2026-05-19 00:50:43 +00:00
-
c043a11bcc
Register CuTeDSL as proper NvFp4LinearKernel for NVFP4 linear layers
biondizzle
2026-05-19 00:44:44 +00:00
-
358830925a
Fix unpack error: handle both tuple and tensor returns from NVFP4 forward()
biondizzle
2026-05-19 00:33:43 +00:00
-
d9dc042ff7
Fix compressor kv_score: use forward() for NVFP4 quantized weights
biondizzle
2026-05-19 00:29:43 +00:00
-
10c14ddb49
Fix NVFP4 mapper: layer norms, hc params, indexer path, q_a_norm
biondizzle
2026-05-19 00:24:26 +00:00
-
540e7ee8fc
Fix: layer.self_attn → layer.attn (model uses attn, not self_attn)
biondizzle
2026-05-19 00:14:09 +00:00
-
201a40e6c4
Fix zero-dim tensor concatenation in compressor scale buffer
biondizzle
2026-05-19 00:10:13 +00:00
-
d41a48aa1f
Fix KeyError for missing stacked params (indexer.compressor)
biondizzle
2026-05-18 23:54:02 +00:00
-
4b0d8263f6
Fix NameError: use print instead of logger (not imported)
biondizzle
2026-05-18 23:49:42 +00:00
-
e3c24769e2
Handle wo_a as bfloat16 (unquantized in NVFP4 checkpoint)
biondizzle
2026-05-18 23:41:39 +00:00
-
9d016aa1c0
Use print instead of logger for weight load debug
biondizzle
2026-05-18 23:30:58 +00:00
-
a6f61bda5d
Add debug logging for weight loading failures
biondizzle
2026-05-18 23:28:15 +00:00
-
eef0ef76af
Fix NVFP4 compressor scale loading: buffer and concatenate scale shards
biondizzle
2026-05-18 23:24:08 +00:00
-
f74447bfd0
Proper NVFP4 integration: quantized compressor/indexer + mapper fixes
biondizzle
2026-05-18 23:20:13 +00:00
-
17496b2615
Fix NVFP4 weights mapper: add prefix mappings, fix substr order
biondizzle
2026-05-18 23:03:34 +00:00
-
b039123207
Fix NVFP4 mapper: add attention projection renames, remove norm_gate renames
biondizzle
2026-05-18 22:53:09 +00:00
-
ea648a9bc2
Fix NVFP4 mapper: keep model. prefix (model params use it)
biondizzle
2026-05-18 22:49:40 +00:00
-
1528d4e182
Fix NVFP4 mapper: strip model. prefix from checkpoint keys
biondizzle
2026-05-18 22:46:04 +00:00
-
5d37674fb1
Add cutedsl to MoEBackend type in kernel config
biondizzle
2026-05-18 22:38:41 +00:00
-
7409204d71
Use nightly's deepseek_v4.py + attention as base, add only NVFP4 mapper
biondizzle
2026-05-18 22:33:51 +00:00
-
a19ed4a18e
Remove breakable_cudagraph import (not in nightly)
biondizzle
2026-05-18 22:29:24 +00:00
-
b007937a68
Fix garbled imports in cutedsl/runner.py
biondizzle
2026-05-18 22:22:52 +00:00
-
a7ed8faec6
Proper NVFP4 integration: use ModelOptNvFp4Config + FusedMoE framework
biondizzle
2026-05-18 22:19:23 +00:00