nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Select branches

Hide Pull Requests

master

pre-b1

pure-nvfp4

v-b1-b2-done-20260603

v-c1-c2-c3-20260602

v-e2e-nvfp4-all-projections

v-e2e-paris-32tok-20260601-0549

v-indexer-fix-20260602

v-nvfp4-fused-router-rewrite-20260601-0715

v-nvfp4-router-oa-20260601-0610

v-official-encoding-path

v-p0p1p2p3-fused-swiglu-cuda-rope-20260602

v-perf-part1-p2-reverted-20260602

v-post-indexer-c-fixes-20260602

v-precision-floor-fix-20260603

v-single-shot-paris-20260601-0539

v-working-e2e-20260601-0515

v0.1-e2e-working

15c987244f v28 attempt: PV MMA (128,64) - cosine 0.004, debugging biondizzle 2026-05-21 05:41:44 +00:00
a7fd2761df README: Bug 4 root cause — TMEM layout mismatch (128,64) PV A-fragment vs softmax P write biondizzle 2026-05-21 05:17:12 +00:00
c20518332e more stuff biondizzle 2026-05-21 05:08:57 +00:00
0dc6fe4a7d Stage B progress: PV works for square (128,128), broken for (128,64) biondizzle 2026-05-21 04:40:28 +00:00
7a8945eb76 Stage B: pipeline deadlock fixed, V MN-major applied, PV output garbage biondizzle 2026-05-21 04:10:07 +00:00
467ade37b2 Stage B: C-fragment vs A-fragment TMEM layout mismatch diagnosed biondizzle 2026-05-21 00:12:47 +00:00
97656a5cd1 Stage B: two MMAs + identity softmax — crash fixed, softmax output still wrong biondizzle 2026-05-20 20:26:25 +00:00
a5b48be7d5 stuff biondizzle 2026-05-20 07:15:01 +00:00
9f0528f150 Update README: reflect current state, add C128A/C4A topk + warmup fixes biondizzle 2026-05-20 06:51:12 +00:00
67d5e26080 Fix warmup compilation + add sparse topk metadata kernels biondizzle 2026-05-20 06:43:43 +00:00
bbba289bd8 feat: GPU-native SWA + sparse decode attention kernels (CuTeDSL) biondizzle 2026-05-20 05:46:15 +00:00
06bf4f482d README: comprehensive update with current kernel status biondizzle 2026-05-20 04:42:57 +00:00
a30d9eb523 Update README with final kernel status biondizzle 2026-05-20 04:39:57 +00:00
04eca7c6da Custom CUDA kernel for de-interleave plus NVFP4 quantize biondizzle 2026-05-20 04:39:47 +00:00
061d5692a9 Remove debug print statements from pipeline biondizzle 2026-05-20 04:20:46 +00:00
aa8563c626 Fused SwiGLU epilogue with granularity-8 weight interleave biondizzle 2026-05-20 04:13:52 +00:00
57d4cb714f docs: rewrite README.md with current project state biondizzle 2026-05-20 03:30:35 +00:00
6c04155167 wip: Step 2 gate/up pairing — SiLU validated, runtime conditionals blocked by CuTeDSL biondizzle 2026-05-20 03:26:20 +00:00
9f0c1b8c5d wip: Step 1 SiLU validation complete, Step 2 gate/up pairing planning biondizzle 2026-05-20 03:16:34 +00:00
b84f2f7bf9 fix: cutlass.Float32 not cutlass.float32_t in fused epilogue biondizzle 2026-05-20 03:12:23 +00:00
08992b818d wip: add run_fused_swiglu_grouped_gemm bridge + step1 test biondizzle 2026-05-20 03:10:56 +00:00
9c43c69a4c wip: fused SwiGLU Stage 1 - SiLU in registers (full acc_vec) biondizzle 2026-05-20 03:07:02 +00:00
2f053f674e wip: fused SwiGLU kernel scaffold + bridge interleave + plan biondizzle 2026-05-20 03:04:38 +00:00
4f178d6e9c chore: remove unused _expert_id_range after bincount migration biondizzle 2026-05-20 02:17:44 +00:00
84a2f6d441 perf: replace expert counting O(n*E) comparison with torch.bincount O(n) biondizzle 2026-05-20 02:17:23 +00:00
4882d8553c fix: zero out x_norm for underflow blocks before division in NVFP4 quantization biondizzle 2026-05-20 02:16:49 +00:00
e653712598 fix: detect zero blocks in NVFP4 quantization, force FP4+FP8 to exact zero biondizzle 2026-05-20 02:14:50 +00:00
1857bdedc3 chore: deprecate prepare_weights_from_dequantized and prepare_weights_direct biondizzle 2026-05-20 02:11:40 +00:00
ef398006a7 fix: correct scale factor dimensions in warmup (K_sf = ceil_div(K_packed,8) not ceil_div(K_packed,16)) biondizzle 2026-05-20 02:08:26 +00:00
8f1a20562f fix: root-cause JIT memory corruption myth, add eager warmup, remove _needs_token_refill biondizzle 2026-05-20 02:08:01 +00:00
6ec0afc318 fix: handle 3D swa_indices and correct kv_bf16 expand dims biondizzle 2026-05-20 01:36:27 +00:00
aa593361e7 feat: add native CuTeDSL SWA decode attention kernel stub + batched SDPA fallback biondizzle 2026-05-20 01:28:05 +00:00
3599b44c0f fix: replace _allocate_buffers with _ensure_buffer_size for dynamic sizing biondizzle 2026-05-20 00:02:10 +00:00
1d5e70adfb fix: dynamic buffer sizing in nvfp4_linear for varying token counts biondizzle 2026-05-19 23:59:55 +00:00
1901bf585e nuke vllm because this keep confusing people biondizzle 2026-05-19 23:04:36 +00:00
5fb70b4cd2 Update README.md and CURRENT_BUG.md: eliminate stale issues, document NaN investigation, clarify our kernels are clean biondizzle 2026-05-19 20:22:10 +00:00
2e6559402c Add full layer NaN test (attention + MoE, multi-layer chain) biondizzle 2026-05-19 18:36:49 +00:00
cca145e35c Use 16 experts for MoE runner test (fits in memory) biondizzle 2026-05-19 18:35:40 +00:00
7893e7514d Add MoE runner NaN test (grouped GEMM with real weights) biondizzle 2026-05-19 18:34:56 +00:00
7b432da754 Fix intermediate size: 3072 not 18432 biondizzle 2026-05-19 18:34:12 +00:00
293f14a179 Rewrite MoE NaN test: per-expert format, activation quantization, grouped GEMM biondizzle 2026-05-19 18:33:57 +00:00
62f2395e30 Fix MoE weight key names, add fallback biondizzle 2026-05-19 18:32:49 +00:00
9455466648 Add MoE NaN reproduction test, update CURRENT_BUG.md with NaN tracing and test plan biondizzle 2026-05-19 18:32:14 +00:00
0316cec6fb Add input NaN debug to trace where NaN starts biondizzle 2026-05-19 18:15:53 +00:00
4c45d73b82 Add prefill inputs NaN debug biondizzle 2026-05-19 18:04:18 +00:00
0773c9608c Add prefill attention value debug check biondizzle 2026-05-19 17:55:35 +00:00
4f02113aa0 Use module-level Blackwell flag in compressor (works during torch.compile) biondizzle 2026-05-19 17:37:26 +00:00
8cf6ac3e8c CRITICAL FIX: Remove double Q normalization and fix RoPE sin slice biondizzle 2026-05-19 17:27:33 +00:00
a94ad73c64 Fix imports in vLLM codepaths test biondizzle 2026-05-19 17:26:50 +00:00
f3f9674810 Fix f-string syntax biondizzle 2026-05-19 17:26:40 +00:00
6cc2312e61 Add test for exact vLLM codepaths (fused_qnorm, kv_write, decode) biondizzle 2026-05-19 17:26:10 +00:00
aade8593f7 CRITICAL FIX: Properly dequantize fp8 KV in decode using per-token inv_scale biondizzle 2026-05-19 17:08:58 +00:00
2f811bc8bd FIX: Use vLLM's decode_swa_indices for correct paged KV cache access during decode biondizzle 2026-05-19 16:55:44 +00:00
da6fa2f1d6 Fix UnboundLocalError: move num_decode_tokens before debug print biondizzle 2026-05-19 16:43:28 +00:00
76fff5fc8b CRITICAL FIX: Skip compressor fused attention kernel on Blackwell — it bypasses our attention path biondizzle 2026-05-19 16:35:07 +00:00
0554332352 Add debug logging to Blackwell attention path biondizzle 2026-05-19 16:31:55 +00:00
f9a09df81a Fix wrapper attribute access: kv_cache, attn_sink, max_model_len via mla_attn biondizzle 2026-05-19 16:19:28 +00:00
b95e934703 Add CSA/HCA decode + prefill attention to Blackwell path biondizzle 2026-05-19 16:06:24 +00:00
abff942edd Fix N for C128A (need 128 tokens) biondizzle 2026-05-19 16:04:53 +00:00
49c2e088d4 Fix compressor key name biondizzle 2026-05-19 16:04:38 +00:00
7d89ede9f9 Add CSA sparse attention test (compressed KV gather + SWA merge) biondizzle 2026-05-19 16:04:19 +00:00
51a7a89c5c Update CURRENT_BUG: KV cache pipeline verified, all tests passing biondizzle 2026-05-19 16:01:10 +00:00
696a890df7 Add decode vs prefill consistency test biondizzle 2026-05-19 16:00:33 +00:00
359654f08e Test with all 61 layers (shared experts only) biondizzle 2026-05-19 15:55:41 +00:00
3e6041d752 Fix view→reshape for non-contiguous tensor biondizzle 2026-05-19 15:54:40 +00:00
ff9f373633 Add e2e decode test (3 layers: C128A, C4A, SWA) biondizzle 2026-05-19 15:53:29 +00:00
a5870fa05c Vectorize paged KV cache read/write, kill container biondizzle 2026-05-19 15:48:16 +00:00
9e428b83c7 Fix KV cache: write to paged cache, handle uint8→fp8 conversion, fix RoPE bug biondizzle 2026-05-19 15:34:09 +00:00
0023fee706 Add blackwell_attention module and comprehensive test biondizzle 2026-05-19 15:30:29 +00:00
142a4a1ad4 Fix attention for decode (1 query vs N cached KVs) biondizzle 2026-05-19 15:28:52 +00:00
4b85605edf Fix fp8 amax in decode test biondizzle 2026-05-19 15:28:17 +00:00
4f23055450 Add decode attention pipeline test — reproduces KV cache bug biondizzle 2026-05-19 15:27:55 +00:00
31b9cfbdbd Update README and CURRENT_BUG: BUILD YOUR OWN KERNELS. Stop patching vLLM. biondizzle 2026-05-19 15:19:55 +00:00
dca8bfc3a8 Fix _apply_rope_kv: use inline RoPE instead of 3D apply_gptj_rope biondizzle 2026-05-19 10:36:21 +00:00
8e6721917e Fix syntax in RoPE KV test biondizzle 2026-05-19 10:31:07 +00:00
cbf440f75a Add RoPE KV test biondizzle 2026-05-19 10:28:15 +00:00
a5fabbdf66 Apply RoPE to KV in Blackwell attention path - fix NaN output biondizzle 2026-05-19 10:27:15 +00:00
7e97551fd3 Fix: use self.scale instead of self.softmax_scale in Blackwell attention path biondizzle 2026-05-19 10:04:46 +00:00
39310c357d Patch compressor cache for Blackwell (no FlashMLA alignment) - fixes 91 missing layers biondizzle 2026-05-19 09:52:23 +00:00
d9cd8fa165 Add debug patch to print layer name mismatch biondizzle 2026-05-19 09:45:10 +00:00
9a0b015aac Reduce max_model_len to 256 biondizzle 2026-05-19 09:37:38 +00:00
de1fb839f0 Patch SWA and Indexer cache specs for Blackwell (no FlashMLA alignment) biondizzle 2026-05-19 09:29:57 +00:00
ea771ff70b Reduce max_model_len to 512 for initial container test biondizzle 2026-05-19 09:23:10 +00:00
bcfbd1e25b Reduce max_model_len to 32768 (876544 requires 204 GiB KV cache) biondizzle 2026-05-19 09:13:33 +00:00
e91421f06e Fix KV cache page size patch: separate groups for large SWA pages biondizzle 2026-05-19 09:05:14 +00:00
dd7f2627e8 Add full model forward test (WIP), sparse attention test passes biondizzle 2026-05-19 09:04:19 +00:00
9781953509 Add CSA/HCA sparse attention kernel test biondizzle 2026-05-19 09:02:12 +00:00
d60673864a Fix kv_ref transpose in KV cache test biondizzle 2026-05-19 08:58:46 +00:00
c1099d76d2 Add KV cache kernel test - fp8 quantize/dequant, paged cache, CSA/HCA compression biondizzle 2026-05-19 08:57:31 +00:00
c54ddbdae1 Fix NVFP4 attention: slice output to actual N after 128-padding biondizzle 2026-05-19 08:55:31 +00:00
42285b6c24 Add CuTeDSL NVFP4 attention kernel test - Q×K^T GEMM biondizzle 2026-05-19 08:54:59 +00:00
9465929e6e Add DeepSeek-V4 CSA/HCA attention pipeline test (not MLA) biondizzle 2026-05-19 08:51:16 +00:00
fa71fbe909 Patch KV cache utils: handle DeepseekV4 SWA page sizes > MLA page sizes biondizzle 2026-05-19 08:45:44 +00:00
d08a457829 Fix cos_sin cache shape in NVFP4 attention test biondizzle 2026-05-19 08:38:55 +00:00
7dd8871e84 Add NVFP4 attention test - quantize Q and K for Q×K^T GEMM biondizzle 2026-05-19 08:38:25 +00:00
2672e98e4c Remove VLLM_NVFP4_GEMM_BACKEND env var - CuTeDSL auto-selects on Blackwell biondizzle 2026-05-19 08:35:40 +00:00
914d27fee7 Update README + CURRENT_BUG: full CuTeDSL NVFP4 plan, no more PyTorch fallbacks biondizzle 2026-05-19 08:26:16 +00:00
7d5c093c99 Fix KV cache crash: skip SWA cache write on Blackwell biondizzle 2026-05-19 08:21:57 +00:00
e1a642452a Fix Blackwell: skip FlashMLA assertion + force CuTeDSL kernel biondizzle 2026-05-19 08:19:23 +00:00
2856323360 Fix torch.compile crash: move Blackwell path inside custom op boundary biondizzle 2026-05-19 08:11:58 +00:00

... 18 19 20 21 22 ...