Commit Graph

  • a782ac00ce Integrate CSA/SDPA attention into vLLM for Blackwell biondizzle 2026-05-19 08:04:07 +00:00
  • 81931614e9 Update CURRENT_BUG: CSA kernel works, plan vLLM integration biondizzle 2026-05-19 08:02:00 +00:00
  • 9d067add90 Fix device reference in full_attention_reference biondizzle 2026-05-19 08:01:31 +00:00
  • 3e3e998578 Fix attention: manual causal mask for batched single-query biondizzle 2026-05-19 08:01:08 +00:00
  • 1e675ccc9a Fix causal mask shape for SDPA: (1,1,T,T) broadcast biondizzle 2026-05-19 08:00:39 +00:00
  • 57615029a4 Fix KV expand for SDPA: (T,HD) → (T*NH, T, HD) biondizzle 2026-05-19 08:00:08 +00:00
  • dd3a12bbda Fix full_attention_reference: broadcast KV to all heads+positions biondizzle 2026-05-19 07:59:28 +00:00
  • 910015c47e Fix kv shape: expand to (T, NH, HD) before reshape biondizzle 2026-05-19 07:58:42 +00:00
  • 3de75c4e37 Add CSA/HCA attention kernel (PyTorch SDPA, Blackwell-safe) biondizzle 2026-05-19 07:58:10 +00:00
  • 65f48be38c Add attention path test: pinpoint FlashMLA failure biondizzle 2026-05-19 07:54:01 +00:00
  • 90d1098935 Update CURRENT_BUG: warmup gs is irrelevant, bug is in vLLM pipeline biondizzle 2026-05-19 07:51:10 +00:00
  • 04ad6409e5 Rewrite test: diagnose whether warmup gs matters at inference time biondizzle 2026-05-19 07:49:41 +00:00
  • 496848e158 Fix ffn_hc.scale key name biondizzle 2026-05-19 07:48:09 +00:00
  • 5a4e355d3a Add model forward test: reproduce vLLM empty output outside container biondizzle 2026-05-19 07:47:48 +00:00
  • f5ce728ef2 Fix OOM: add --max-model-len=876544 + revert CPU dummy weight biondizzle 2026-05-19 07:35:43 +00:00
  • 79a41d9197 Save ~5-8 GiB GPU VRAM: move dummy weight to CPU biondizzle 2026-05-19 07:29:38 +00:00
  • cebc586014 Fix OOM: use 1-token warmup sample + free immediately biondizzle 2026-05-19 07:28:57 +00:00
  • 5122cadc94 Update CURRENT_BUG.md: root cause found + fix committed biondizzle 2026-05-19 07:21:30 +00:00
  • 6e6f95dfa8 FIX: Use warmup-based activation global scale in CuTeDSL linear kernel biondizzle 2026-05-19 07:21:07 +00:00
  • 0a7769972f Fix garbled shared_expert_pipeline.py: imports/class were merged biondizzle 2026-05-19 07:18:10 +00:00
  • 87453a53b0 Fix checkpoint keys: attn_hc.*, compressor.*, q_a_proj/q_b_proj/kv_proj biondizzle 2026-05-19 07:17:37 +00:00
  • f97762cc9f Fix full layer test: use correct checkpoint key names biondizzle 2026-05-19 07:16:33 +00:00
  • cc48a5715e Add full layer 0 B200 test: CuTeDSL vs BF16 reference biondizzle 2026-05-19 07:14:58 +00:00
  • dbaa3d6fe6 Update CURRENT_BUG.md and README with current state biondizzle 2026-05-19 07:05:45 +00:00
  • 62abf41b03 Revert deepseek_v4_attention.py to ffc2264 — don't nuke existing patches biondizzle 2026-05-19 06:52:40 +00:00
  • 4c2effa2be Fix attention patch: source from v0.21.0 stable, not local clone biondizzle 2026-05-19 06:44:59 +00:00
  • 284b6a5d57 Fix attention patch: use original vllm imports, only patch forward method biondizzle 2026-05-19 06:40:58 +00:00
  • 199efe0871 Fix dims: o_groups=16, o_lora_rank=1024 from config biondizzle 2026-05-19 06:37:25 +00:00
  • b4fee70151 Fix device mismatch in test biondizzle 2026-05-19 06:36:22 +00:00
  • 6b4b9774d1 Add B200 test: prove O-projection root cause + validate fix biondizzle 2026-05-19 06:32:54 +00:00
  • 77baca668e Patch attention forward: BF16 inv RoPE + BMM wo_a + NVFP4 wo_b biondizzle 2026-05-19 06:30:18 +00:00
  • ffc2264c41 Fix activation global scale: don't double-invert input_global_scale_inv biondizzle 2026-05-19 06:03:08 +00:00
  • 918342feeb MHC: replace monolithic layers/mhc.py with pure PyTorch biondizzle 2026-05-19 05:41:55 +00:00
  • dfd9c10ae9 Fix MHC import: don't import .torch from layers/mhc.py biondizzle 2026-05-19 05:36:35 +00:00
  • e404e18efb Also replace layers/mhc.py CustomOp dispatch biondizzle 2026-05-19 05:31:05 +00:00
  • 5e6d459145 Fix MHC custom op registration biondizzle 2026-05-19 05:19:48 +00:00
  • 9ff1679064 Replace MHC TileLang kernels with pure PyTorch biondizzle 2026-05-19 05:07:41 +00:00
  • 5c770c68ca Keep MoE scale tensors: framework warmup needs them biondizzle 2026-05-19 04:50:31 +00:00
  • e0f385ac45 Fix workspace_shapes: output dim is hidden_dim, not K*2 biondizzle 2026-05-19 04:42:22 +00:00
  • cfd8ec741d Debug: add shape mismatch logging in MoE apply biondizzle 2026-05-19 04:35:58 +00:00
  • ffc1a5c6a8 Fix workspace_shapes: remove wrong assertion, compute output dim from K biondizzle 2026-05-19 04:28:04 +00:00
  • f023b3b2c6 Fix: wrap dummy MoE weights in nn.Parameter biondizzle 2026-05-19 04:21:35 +00:00
  • b06dcb40dc Fix MoE w1=None crash: keep shape-preserving dummy weights on CPU biondizzle 2026-05-19 04:17:10 +00:00
  • c289c44920 Fix BF16 wo_a: per-group BMM instead of flat linear biondizzle 2026-05-19 04:10:02 +00:00
  • 6f9a400ae0 Fix hc_head mapping: checkpoint uses hc_head.hc_fn, model params are flat hc_head_fn biondizzle 2026-05-19 03:58:25 +00:00
  • 909a2710e4 Fix double lm_head mapping: NVFP4 checkpoint already uses correct names biondizzle 2026-05-19 03:54:14 +00:00
  • 4cf5b8b751 Fix compressor path: attn.mla_attn.compressor (not attn.compressor) biondizzle 2026-05-19 03:47:26 +00:00
  • 9d41419e9f Debug: print compressor params to diagnose KeyError biondizzle 2026-05-19 03:44:40 +00:00
  • db5192fe41 Patch from Docker image's vLLM (0.20.2rc1) instead of newer upstream biondizzle 2026-05-19 03:35:15 +00:00
  • df5a496f5d Fix: make eager_break_during_capture import conditional for older vLLM biondizzle 2026-05-19 03:29:05 +00:00
  • 4ed91b81d0 Fix inverse RoPE formula: swap signs on cross terms biondizzle 2026-05-19 03:22:10 +00:00
  • fece06f746 Add unit tests for NVFP4 weight mapper and inverse RoPE BF16 biondizzle 2026-05-19 03:22:00 +00:00
  • b0b5113467 Fix weight mapper: compressor → attn.compressor (not mla_attn), quant weights_proj biondizzle 2026-05-19 03:20:41 +00:00
  • 396a83ea56 Clean vLLM integration: use official paths, BF16 wo_a, proper weight mapper biondizzle 2026-05-19 03:13:38 +00:00
  • b856ee9315 Clean up debug scripts biondizzle 2026-05-19 02:47:29 +00:00
  • 05cdde1676 Fix wo_a: scatter each group's data at correct offset in padded buffer biondizzle 2026-05-19 02:45:57 +00:00
  • 8fe5546bb3 Fix debug script biondizzle 2026-05-19 02:43:17 +00:00
  • 788f0aa65a Add step-by-step debug for wo_a biondizzle 2026-05-19 02:43:05 +00:00
  • 5f5b997fc3 Fix wo_a: permute to groups-first layout for grouped GEMM biondizzle 2026-05-19 02:41:32 +00:00
  • 77e4970d93 Add debug script for wo_a quantization biondizzle 2026-05-19 02:40:43 +00:00
  • 80122b850b Add debug script for wo_a biondizzle 2026-05-19 02:39:55 +00:00
  • ae233ab648 Fix test: cos_sin_cache on CUDA device biondizzle 2026-05-19 02:37:50 +00:00
  • 882d4996ff Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj) biondizzle 2026-05-19 02:36:30 +00:00
  • bab1f75f29 Fix gs None error in legacy _ensure_stacked path biondizzle 2026-05-19 02:17:53 +00:00
  • 48fa64dfda Eliminate weight copies: pass stacked checkpoint tensors directly biondizzle 2026-05-19 02:16:43 +00:00
  • 0612c1ab54 use proper backend biondizzle 2026-05-19 02:08:18 +00:00
  • 00fe63b56f Fix compile test: add warmup for activation global scales biondizzle 2026-05-19 01:57:16 +00:00
  • bba3bca4d3 Add torch.compile + custom op integration test biondizzle 2026-05-19 01:56:46 +00:00
  • 35fab6cff3 Replace autograd.Function with torch.library.custom_op for Dynamo compat biondizzle 2026-05-19 01:54:48 +00:00
  • 98153002c0 Go back to torch.library.custom_op with correct GEMM impl biondizzle 2026-05-19 01:24:41 +00:00
  • 02c500bbb1 Switch to allow_in_graph for Dynamo opacity instead of custom op biondizzle 2026-05-19 01:20:07 +00:00
  • 581d87f9a6 Remove warmup forward from process_weights_after_loading biondizzle 2026-05-19 01:18:54 +00:00
  • 5d49849156 Fix: torch.zeros doesn't support Float4_e2m1fn_x2 dtype biondizzle 2026-05-19 01:15:24 +00:00
  • e1fcfc4f01 Add CuTeDSL warmup + CUDA sync after JIT compilation biondizzle 2026-05-19 01:11:44 +00:00
  • 1d9c0f996c Fix expert_offsets dtype: CuTeDSL expects int32 not int64 biondizzle 2026-05-19 01:05:20 +00:00
  • b81200f427 Fix CuTeDSL NVFP4 linear: correct scale assembly in custom op biondizzle 2026-05-19 01:01:42 +00:00
  • e0eb436914 Fix custom_op registration: use as decorator with proper type hints biondizzle 2026-05-19 00:54:30 +00:00
  • c609e9ba3c Use torch.library.custom_op for CuTeDSL NVFP4 linear GEMM biondizzle 2026-05-19 00:50:43 +00:00
  • c043a11bcc Register CuTeDSL as proper NvFp4LinearKernel for NVFP4 linear layers biondizzle 2026-05-19 00:44:44 +00:00
  • 358830925a Fix unpack error: handle both tuple and tensor returns from NVFP4 forward() biondizzle 2026-05-19 00:33:43 +00:00
  • d9dc042ff7 Fix compressor kv_score: use forward() for NVFP4 quantized weights biondizzle 2026-05-19 00:29:43 +00:00
  • 10c14ddb49 Fix NVFP4 mapper: layer norms, hc params, indexer path, q_a_norm biondizzle 2026-05-19 00:24:26 +00:00
  • 540e7ee8fc Fix: layer.self_attn → layer.attn (model uses attn, not self_attn) biondizzle 2026-05-19 00:14:09 +00:00
  • 201a40e6c4 Fix zero-dim tensor concatenation in compressor scale buffer biondizzle 2026-05-19 00:10:13 +00:00
  • d41a48aa1f Fix KeyError for missing stacked params (indexer.compressor) biondizzle 2026-05-18 23:54:02 +00:00
  • 4b0d8263f6 Fix NameError: use print instead of logger (not imported) biondizzle 2026-05-18 23:49:42 +00:00
  • e3c24769e2 Handle wo_a as bfloat16 (unquantized in NVFP4 checkpoint) biondizzle 2026-05-18 23:41:39 +00:00
  • 9d016aa1c0 Use print instead of logger for weight load debug biondizzle 2026-05-18 23:30:58 +00:00
  • a6f61bda5d Add debug logging for weight loading failures biondizzle 2026-05-18 23:28:15 +00:00
  • eef0ef76af Fix NVFP4 compressor scale loading: buffer and concatenate scale shards biondizzle 2026-05-18 23:24:08 +00:00
  • f74447bfd0 Proper NVFP4 integration: quantized compressor/indexer + mapper fixes biondizzle 2026-05-18 23:20:13 +00:00
  • 17496b2615 Fix NVFP4 weights mapper: add prefix mappings, fix substr order biondizzle 2026-05-18 23:03:34 +00:00
  • b039123207 Fix NVFP4 mapper: add attention projection renames, remove norm_gate renames biondizzle 2026-05-18 22:53:09 +00:00
  • ea648a9bc2 Fix NVFP4 mapper: keep model. prefix (model params use it) biondizzle 2026-05-18 22:49:40 +00:00
  • 1528d4e182 Fix NVFP4 mapper: strip model. prefix from checkpoint keys biondizzle 2026-05-18 22:46:04 +00:00
  • 5d37674fb1 Add cutedsl to MoEBackend type in kernel config biondizzle 2026-05-18 22:38:41 +00:00
  • 7409204d71 Use nightly's deepseek_v4.py + attention as base, add only NVFP4 mapper biondizzle 2026-05-18 22:33:51 +00:00
  • a19ed4a18e Remove breakable_cudagraph import (not in nightly) biondizzle 2026-05-18 22:29:24 +00:00
  • b007937a68 Fix garbled imports in cutedsl/runner.py biondizzle 2026-05-18 22:22:52 +00:00
  • a7ed8faec6 Proper NVFP4 integration: use ModelOptNvFp4Config + FusedMoE framework biondizzle 2026-05-18 22:19:23 +00:00