nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Select branches

Hide Pull Requests

master

pre-b1

pure-nvfp4

v-b1-b2-done-20260603

v-c1-c2-c3-20260602

v-e2e-nvfp4-all-projections

v-e2e-paris-32tok-20260601-0549

v-indexer-fix-20260602

v-nvfp4-fused-router-rewrite-20260601-0715

v-nvfp4-router-oa-20260601-0610

v-official-encoding-path

v-p0p1p2p3-fused-swiglu-cuda-rope-20260602

v-perf-part1-p2-reverted-20260602

v-post-indexer-c-fixes-20260602

v-precision-floor-fix-20260603

v-single-shot-paris-20260601-0539

v-working-e2e-20260601-0515

v0.1-e2e-working

a782ac00ce Integrate CSA/SDPA attention into vLLM for Blackwell biondizzle 2026-05-19 08:04:07 +00:00
81931614e9 Update CURRENT_BUG: CSA kernel works, plan vLLM integration biondizzle 2026-05-19 08:02:00 +00:00
9d067add90 Fix device reference in full_attention_reference biondizzle 2026-05-19 08:01:31 +00:00
3e3e998578 Fix attention: manual causal mask for batched single-query biondizzle 2026-05-19 08:01:08 +00:00
1e675ccc9a Fix causal mask shape for SDPA: (1,1,T,T) broadcast biondizzle 2026-05-19 08:00:39 +00:00
57615029a4 Fix KV expand for SDPA: (T,HD) → (T*NH, T, HD) biondizzle 2026-05-19 08:00:08 +00:00
dd3a12bbda Fix full_attention_reference: broadcast KV to all heads+positions biondizzle 2026-05-19 07:59:28 +00:00
910015c47e Fix kv shape: expand to (T, NH, HD) before reshape biondizzle 2026-05-19 07:58:42 +00:00
3de75c4e37 Add CSA/HCA attention kernel (PyTorch SDPA, Blackwell-safe) biondizzle 2026-05-19 07:58:10 +00:00
65f48be38c Add attention path test: pinpoint FlashMLA failure biondizzle 2026-05-19 07:54:01 +00:00
90d1098935 Update CURRENT_BUG: warmup gs is irrelevant, bug is in vLLM pipeline biondizzle 2026-05-19 07:51:10 +00:00
04ad6409e5 Rewrite test: diagnose whether warmup gs matters at inference time biondizzle 2026-05-19 07:49:41 +00:00
496848e158 Fix ffn_hc.scale key name biondizzle 2026-05-19 07:48:09 +00:00
5a4e355d3a Add model forward test: reproduce vLLM empty output outside container biondizzle 2026-05-19 07:47:48 +00:00
f5ce728ef2 Fix OOM: add --max-model-len=876544 + revert CPU dummy weight biondizzle 2026-05-19 07:35:43 +00:00
79a41d9197 Save ~5-8 GiB GPU VRAM: move dummy weight to CPU biondizzle 2026-05-19 07:29:38 +00:00
cebc586014 Fix OOM: use 1-token warmup sample + free immediately biondizzle 2026-05-19 07:28:57 +00:00
5122cadc94 Update CURRENT_BUG.md: root cause found + fix committed biondizzle 2026-05-19 07:21:30 +00:00
6e6f95dfa8 FIX: Use warmup-based activation global scale in CuTeDSL linear kernel biondizzle 2026-05-19 07:21:07 +00:00
0a7769972f Fix garbled shared_expert_pipeline.py: imports/class were merged biondizzle 2026-05-19 07:18:10 +00:00
87453a53b0 Fix checkpoint keys: attn_hc.*, compressor.*, q_a_proj/q_b_proj/kv_proj biondizzle 2026-05-19 07:17:37 +00:00
f97762cc9f Fix full layer test: use correct checkpoint key names biondizzle 2026-05-19 07:16:33 +00:00
cc48a5715e Add full layer 0 B200 test: CuTeDSL vs BF16 reference biondizzle 2026-05-19 07:14:58 +00:00
dbaa3d6fe6 Update CURRENT_BUG.md and README with current state biondizzle 2026-05-19 07:05:45 +00:00
62abf41b03 Revert deepseek_v4_attention.py to ffc2264 — don't nuke existing patches biondizzle 2026-05-19 06:52:40 +00:00
4c2effa2be Fix attention patch: source from v0.21.0 stable, not local clone biondizzle 2026-05-19 06:44:59 +00:00
284b6a5d57 Fix attention patch: use original vllm imports, only patch forward method biondizzle 2026-05-19 06:40:58 +00:00
199efe0871 Fix dims: o_groups=16, o_lora_rank=1024 from config biondizzle 2026-05-19 06:37:25 +00:00
b4fee70151 Fix device mismatch in test biondizzle 2026-05-19 06:36:22 +00:00
6b4b9774d1 Add B200 test: prove O-projection root cause + validate fix biondizzle 2026-05-19 06:32:54 +00:00
77baca668e Patch attention forward: BF16 inv RoPE + BMM wo_a + NVFP4 wo_b biondizzle 2026-05-19 06:30:18 +00:00
ffc2264c41 Fix activation global scale: don't double-invert input_global_scale_inv biondizzle 2026-05-19 06:03:08 +00:00
918342feeb MHC: replace monolithic layers/mhc.py with pure PyTorch biondizzle 2026-05-19 05:41:55 +00:00
dfd9c10ae9 Fix MHC import: don't import .torch from layers/mhc.py biondizzle 2026-05-19 05:36:35 +00:00
e404e18efb Also replace layers/mhc.py CustomOp dispatch biondizzle 2026-05-19 05:31:05 +00:00
5e6d459145 Fix MHC custom op registration biondizzle 2026-05-19 05:19:48 +00:00
9ff1679064 Replace MHC TileLang kernels with pure PyTorch biondizzle 2026-05-19 05:07:41 +00:00
5c770c68ca Keep MoE scale tensors: framework warmup needs them biondizzle 2026-05-19 04:50:31 +00:00
e0f385ac45 Fix workspace_shapes: output dim is hidden_dim, not K*2 biondizzle 2026-05-19 04:42:22 +00:00
cfd8ec741d Debug: add shape mismatch logging in MoE apply biondizzle 2026-05-19 04:35:58 +00:00
ffc1a5c6a8 Fix workspace_shapes: remove wrong assertion, compute output dim from K biondizzle 2026-05-19 04:28:04 +00:00
f023b3b2c6 Fix: wrap dummy MoE weights in nn.Parameter biondizzle 2026-05-19 04:21:35 +00:00
b06dcb40dc Fix MoE w1=None crash: keep shape-preserving dummy weights on CPU biondizzle 2026-05-19 04:17:10 +00:00
c289c44920 Fix BF16 wo_a: per-group BMM instead of flat linear biondizzle 2026-05-19 04:10:02 +00:00
6f9a400ae0 Fix hc_head mapping: checkpoint uses hc_head.hc_fn, model params are flat hc_head_fn biondizzle 2026-05-19 03:58:25 +00:00
909a2710e4 Fix double lm_head mapping: NVFP4 checkpoint already uses correct names biondizzle 2026-05-19 03:54:14 +00:00
4cf5b8b751 Fix compressor path: attn.mla_attn.compressor (not attn.compressor) biondizzle 2026-05-19 03:47:26 +00:00
9d41419e9f Debug: print compressor params to diagnose KeyError biondizzle 2026-05-19 03:44:40 +00:00
db5192fe41 Patch from Docker image's vLLM (0.20.2rc1) instead of newer upstream biondizzle 2026-05-19 03:35:15 +00:00
df5a496f5d Fix: make eager_break_during_capture import conditional for older vLLM biondizzle 2026-05-19 03:29:05 +00:00
4ed91b81d0 Fix inverse RoPE formula: swap signs on cross terms biondizzle 2026-05-19 03:22:10 +00:00
fece06f746 Add unit tests for NVFP4 weight mapper and inverse RoPE BF16 biondizzle 2026-05-19 03:22:00 +00:00
b0b5113467 Fix weight mapper: compressor → attn.compressor (not mla_attn), quant weights_proj biondizzle 2026-05-19 03:20:41 +00:00
396a83ea56 Clean vLLM integration: use official paths, BF16 wo_a, proper weight mapper biondizzle 2026-05-19 03:13:38 +00:00
b856ee9315 Clean up debug scripts biondizzle 2026-05-19 02:47:29 +00:00
05cdde1676 Fix wo_a: scatter each group's data at correct offset in padded buffer biondizzle 2026-05-19 02:45:57 +00:00
8fe5546bb3 Fix debug script biondizzle 2026-05-19 02:43:17 +00:00
788f0aa65a Add step-by-step debug for wo_a biondizzle 2026-05-19 02:43:05 +00:00
5f5b997fc3 Fix wo_a: permute to groups-first layout for grouped GEMM biondizzle 2026-05-19 02:41:32 +00:00
77e4970d93 Add debug script for wo_a quantization biondizzle 2026-05-19 02:40:43 +00:00
80122b850b Add debug script for wo_a biondizzle 2026-05-19 02:39:55 +00:00
ae233ab648 Fix test: cos_sin_cache on CUDA device biondizzle 2026-05-19 02:37:50 +00:00
882d4996ff Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj) biondizzle 2026-05-19 02:36:30 +00:00
bab1f75f29 Fix gs None error in legacy _ensure_stacked path biondizzle 2026-05-19 02:17:53 +00:00
48fa64dfda Eliminate weight copies: pass stacked checkpoint tensors directly biondizzle 2026-05-19 02:16:43 +00:00
0612c1ab54 use proper backend biondizzle 2026-05-19 02:08:18 +00:00
00fe63b56f Fix compile test: add warmup for activation global scales biondizzle 2026-05-19 01:57:16 +00:00
bba3bca4d3 Add torch.compile + custom op integration test biondizzle 2026-05-19 01:56:46 +00:00
35fab6cff3 Replace autograd.Function with torch.library.custom_op for Dynamo compat biondizzle 2026-05-19 01:54:48 +00:00
98153002c0 Go back to torch.library.custom_op with correct GEMM impl biondizzle 2026-05-19 01:24:41 +00:00
02c500bbb1 Switch to allow_in_graph for Dynamo opacity instead of custom op biondizzle 2026-05-19 01:20:07 +00:00
581d87f9a6 Remove warmup forward from process_weights_after_loading biondizzle 2026-05-19 01:18:54 +00:00
5d49849156 Fix: torch.zeros doesn't support Float4_e2m1fn_x2 dtype biondizzle 2026-05-19 01:15:24 +00:00
e1fcfc4f01 Add CuTeDSL warmup + CUDA sync after JIT compilation biondizzle 2026-05-19 01:11:44 +00:00
1d9c0f996c Fix expert_offsets dtype: CuTeDSL expects int32 not int64 biondizzle 2026-05-19 01:05:20 +00:00
b81200f427 Fix CuTeDSL NVFP4 linear: correct scale assembly in custom op biondizzle 2026-05-19 01:01:42 +00:00
e0eb436914 Fix custom_op registration: use as decorator with proper type hints biondizzle 2026-05-19 00:54:30 +00:00
c609e9ba3c Use torch.library.custom_op for CuTeDSL NVFP4 linear GEMM biondizzle 2026-05-19 00:50:43 +00:00
c043a11bcc Register CuTeDSL as proper NvFp4LinearKernel for NVFP4 linear layers biondizzle 2026-05-19 00:44:44 +00:00
358830925a Fix unpack error: handle both tuple and tensor returns from NVFP4 forward() biondizzle 2026-05-19 00:33:43 +00:00
d9dc042ff7 Fix compressor kv_score: use forward() for NVFP4 quantized weights biondizzle 2026-05-19 00:29:43 +00:00
10c14ddb49 Fix NVFP4 mapper: layer norms, hc params, indexer path, q_a_norm biondizzle 2026-05-19 00:24:26 +00:00
540e7ee8fc Fix: layer.self_attn → layer.attn (model uses attn, not self_attn) biondizzle 2026-05-19 00:14:09 +00:00
201a40e6c4 Fix zero-dim tensor concatenation in compressor scale buffer biondizzle 2026-05-19 00:10:13 +00:00
d41a48aa1f Fix KeyError for missing stacked params (indexer.compressor) biondizzle 2026-05-18 23:54:02 +00:00
4b0d8263f6 Fix NameError: use print instead of logger (not imported) biondizzle 2026-05-18 23:49:42 +00:00
e3c24769e2 Handle wo_a as bfloat16 (unquantized in NVFP4 checkpoint) biondizzle 2026-05-18 23:41:39 +00:00
9d016aa1c0 Use print instead of logger for weight load debug biondizzle 2026-05-18 23:30:58 +00:00
a6f61bda5d Add debug logging for weight loading failures biondizzle 2026-05-18 23:28:15 +00:00
eef0ef76af Fix NVFP4 compressor scale loading: buffer and concatenate scale shards biondizzle 2026-05-18 23:24:08 +00:00
f74447bfd0 Proper NVFP4 integration: quantized compressor/indexer + mapper fixes biondizzle 2026-05-18 23:20:13 +00:00
17496b2615 Fix NVFP4 weights mapper: add prefix mappings, fix substr order biondizzle 2026-05-18 23:03:34 +00:00
b039123207 Fix NVFP4 mapper: add attention projection renames, remove norm_gate renames biondizzle 2026-05-18 22:53:09 +00:00
ea648a9bc2 Fix NVFP4 mapper: keep model. prefix (model params use it) biondizzle 2026-05-18 22:49:40 +00:00
1528d4e182 Fix NVFP4 mapper: strip model. prefix from checkpoint keys biondizzle 2026-05-18 22:46:04 +00:00
5d37674fb1 Add cutedsl to MoEBackend type in kernel config biondizzle 2026-05-18 22:38:41 +00:00
7409204d71 Use nightly's deepseek_v4.py + attention as base, add only NVFP4 mapper biondizzle 2026-05-18 22:33:51 +00:00
a19ed4a18e Remove breakable_cudagraph import (not in nightly) biondizzle 2026-05-18 22:29:24 +00:00
b007937a68 Fix garbled imports in cutedsl/runner.py biondizzle 2026-05-18 22:22:52 +00:00
a7ed8faec6 Proper NVFP4 integration: use ModelOptNvFp4Config + FusedMoE framework biondizzle 2026-05-18 22:19:23 +00:00

... 19 20 21 22 23 ...