nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Select branches

Hide Pull Requests

master

pre-b1

pure-nvfp4

v-b1-b2-done-20260603

v-c1-c2-c3-20260602

v-e2e-nvfp4-all-projections

v-e2e-paris-32tok-20260601-0549

v-indexer-fix-20260602

v-nvfp4-fused-router-rewrite-20260601-0715

v-nvfp4-router-oa-20260601-0610

v-official-encoding-path

v-p0p1p2p3-fused-swiglu-cuda-rope-20260602

v-perf-part1-p2-reverted-20260602

v-post-indexer-c-fixes-20260602

v-precision-floor-fix-20260603

v-single-shot-paris-20260601-0539

v-working-e2e-20260601-0515

v0.1-e2e-working

3cd910193c Rewrite scale assembly: no .item() calls, no Python loops, fully GPU biondizzle 2026-05-17 09:59:12 +00:00
4f6217acb9 Fix padded_cols calculation in scale assembly biondizzle 2026-05-17 09:58:09 +00:00
918aa8aede Fix scale assembly output shape: reshape to 2D for GEMM biondizzle 2026-05-17 09:57:27 +00:00
d9bae6d770 Fix OOB in scale assembly: size padded_x_sf for max tokens, fix top_k/max_num_tokens passing, support variable-size expert blocks biondizzle 2026-05-17 09:56:28 +00:00
55ac60eb91 Add detailed debug prints for OOB investigation biondizzle 2026-05-17 09:39:42 +00:00
fed3c417ba Add debug OOB check for sorted_token_ids biondizzle 2026-05-17 09:19:10 +00:00
eb7d4f099b Update CURRENT_BUG.md with Bug 8 (global→local expert ID) and Bug 8b (.cpu() sync) biondizzle 2026-05-17 09:01:24 +00:00
ca3cba5bbd Fix global→local expert ID remapping for EP and remove .cpu() sync biondizzle 2026-05-17 08:58:43 +00:00
1330e2b2cf cleanup: remove debug prints, ready for testing biondizzle 2026-05-17 08:30:41 +00:00
d635dcbbb6 fix: keep token_indices on CPU, index with CPU sort_idx biondizzle 2026-05-17 08:29:18 +00:00
235d5b314f fix: fallback token indices allocation with verify+rebuild biondizzle 2026-05-17 08:27:47 +00:00
dd0b3fd4f9 debug: print sorted_token_ids in warmup biondizzle 2026-05-17 08:25:25 +00:00
04999d86cf fix: add quantize_to_nvfp4 import biondizzle 2026-05-17 08:24:57 +00:00
33e28100ee test: use runner's built-in warmup method biondizzle 2026-05-17 08:24:27 +00:00
7073daaffa fix: allocate token_indices on CPU, move to GPU AFTER JIT compilation biondizzle 2026-05-17 08:22:51 +00:00
0e7b06b55c debug: clone + sync token indices before JIT biondizzle 2026-05-17 08:22:11 +00:00
70c0618361 fix: allocate token_indices before CuTeDSL JIT compilation biondizzle 2026-05-17 08:20:41 +00:00
2bbe04efd8 debug: remove assert, test token corruption biondizzle 2026-05-17 08:19:45 +00:00
66627926c5 debug: int32 token indices with sync verify biondizzle 2026-05-17 08:18:37 +00:00
da02a5dc11 debug: assert token indices are correct after allocation biondizzle 2026-05-17 08:16:09 +00:00
c0d016a472 feat: compute_activation_global_scales warmup method biondizzle 2026-05-17 08:11:01 +00:00
8c9a51e006 fix: call _ensure_stacked in warmup test biondizzle 2026-05-17 08:07:09 +00:00
5ba77e355f test: warmup gs computation with safety margin sweep biondizzle 2026-05-17 08:06:27 +00:00
ae6b879d38 fix: pass expert_offsets without leading 0 to GEMM (matches pipeline) biondizzle 2026-05-17 07:59:00 +00:00
a1e6f5f891 fix: searchsorted right=True for correct expert assignment biondizzle 2026-05-17 07:57:00 +00:00
ddffb7d8df docs: current bug analysis — scale_a layout vs expert_offsets mismatch biondizzle 2026-05-17 07:53:58 +00:00
ed90341ea9 fix: scatter+per-expert-swizzle scale assembly (cudagraph-safe) biondizzle 2026-05-17 07:47:14 +00:00
37fecb588f fix: separate L1/L2 scale buffers (different K_sf), fix assembly calls biondizzle 2026-05-17 07:43:05 +00:00
b824b838a9 fix: 128-row-align each expert's scales in padded buffer biondizzle 2026-05-17 07:39:49 +00:00
8dadd9a723 test: scale assembly debug biondizzle 2026-05-17 07:37:47 +00:00
8642946274 fix: padded x_sf buffer for fixed-shape scale assembly biondizzle 2026-05-17 07:37:04 +00:00
418e29f7f5 fix: per-expert scale assembly (match assemble_scales_2d_side) biondizzle 2026-05-17 07:35:49 +00:00
7b95e76723 test: runner vs pipeline comparison + scale assembly comparison biondizzle 2026-05-17 07:33:20 +00:00
366a0240a5 vllm tweaks biondizzle 2026-05-17 07:14:58 +00:00
34c43958d0 vllm tweaks biondizzle 2026-05-17 07:10:16 +00:00
48e4cb625d fix: default activation global_scale so runner works without finalize_weights biondizzle 2026-05-17 06:24:15 +00:00
d2965b432d fix: set _l1_activation_global_scale (with underscore) — attribute name mismatch biondizzle 2026-05-17 03:35:20 +00:00
b382a7a528 fix: handle input_scale as 1D or 2D (EP splits change the shape) biondizzle 2026-05-16 22:49:30 +00:00
139c9c37cd fix: read input_scale from nn.Parameter before it's freed biondizzle 2026-05-16 22:23:24 +00:00
152648789d fix: use checkpoint input_scale for activation global scale (not hardcoded 1/2688) biondizzle 2026-05-16 21:46:00 +00:00
af087e655e docs: update README — vLLM cudagraph inference running, output quality in progress biondizzle 2026-05-16 21:40:59 +00:00
0a5cfe0433 add kernel compile caching — compile once, invoke on subsequent calls biondizzle 2026-05-16 20:45:46 +00:00
3465b9d471 remove torch.cuda.synchronize() from run_nvfp4_grouped_gemm (cudagraph-safe) biondizzle 2026-05-16 20:42:49 +00:00
5e245bc0c6 fix: missing newline biondizzle 2026-05-16 20:40:18 +00:00
288e179f88 add quantize_activation_nvfp4 (cudagraph-safe, fixed global scale) biondizzle 2026-05-16 20:39:37 +00:00
521e11e468 test: old bridge + LUT quantization only (step 1 of cudagraph migration) biondizzle 2026-05-16 20:37:42 +00:00
f51be76e8f temp: restore EXACT old bridge.py from b685112 biondizzle 2026-05-16 20:34:45 +00:00
58dc36e21c fix: compile fresh each call — cached compile produces wrong TMA descriptors biondizzle 2026-05-16 20:28:15 +00:00
98cc6ac1f3 fix: invert cache check logic (compile when NOT in cache) biondizzle 2026-05-16 20:25:16 +00:00
e337ec86a3 debug: test with cache enabled biondizzle 2026-05-16 20:24:04 +00:00
bc56452be8 debug: disable kernel cache to test fresh compilation biondizzle 2026-05-16 20:22:51 +00:00
647c03b2ee fix: make_b_k_major must preserve shape — use double-permute trick biondizzle 2026-05-16 20:19:21 +00:00
ed4f501bba fix: make_b_k_major stride check — K-major means stride[1]==1, not stride[2]==1 biondizzle 2026-05-16 20:18:18 +00:00
2162cee4ad fix: restore proper quantize_weight_to_nvfp4 — K is the packed dim, not N biondizzle 2026-05-16 20:16:28 +00:00
10f1dca982 fix: import ceil_div from correct module biondizzle 2026-05-16 20:09:02 +00:00
81632e2f21 fix: correct cutlass_torch import (cutlass.torch, not top-level) biondizzle 2026-05-16 20:08:21 +00:00
16c4fad025 fix: remove cutlass.cute.backend import biondizzle 2026-05-16 20:06:38 +00:00
44b40d41fe fix: compile CuTeDSL kernel with real tensors, not dummy shapes biondizzle 2026-05-16 20:05:59 +00:00
79281b6fda fix: compute K_packed/N_packed before passing to _get_compiled_kernel biondizzle 2026-05-16 20:00:35 +00:00
caf93d6c45 fix: pass K_packed/N_packed to _get_compiled_kernel biondizzle 2026-05-16 19:59:43 +00:00
ecc7b83334 fix: compile CuTeDSL kernel with actual tensor shapes, not dummy 256x256 biondizzle 2026-05-16 19:58:13 +00:00
cc75a55bd9 restore: new bridge/moe_pipeline/layertest biondizzle 2026-05-16 19:55:19 +00:00
0c878b3a9e temp: restore old layertest+bridge for cosine comparison biondizzle 2026-05-16 19:54:04 +00:00
0069769d12 debug: print global scales biondizzle 2026-05-16 19:38:31 +00:00
84589fe984 debug: more prints biondizzle 2026-05-16 19:31:54 +00:00
fa2d5708c5 debug: add L1 GEMM and SiLU output debug prints biondizzle 2026-05-16 19:29:42 +00:00
4c06c51ec3 fix: moe_pipeline.py gate/up split — L1 output is 2*intermediate, not intermediate biondizzle 2026-05-16 19:28:15 +00:00
da31ce7e1a allow for cuda graphs again biondizzle 2026-05-16 19:23:41 +00:00
d15c43294b fix: test L2 weight N dim should be hidden_size, not hidden_size//2 biondizzle 2026-05-16 19:07:36 +00:00
28788c6f55 fix: L1 weight N dimension is 2*intermediate (gate+up), not intermediate biondizzle 2026-05-16 19:07:08 +00:00
f7e29fdf1e docs: update README with cudagraph compatibility work and decisions biondizzle 2026-05-16 18:55:47 +00:00
103fd451ce fix: use full padded_scales_buf (no GPU scalar slicing in cudagraph) biondizzle 2026-05-16 18:50:35 +00:00
2f68c7ba77 fix: cache E2M1 step_to_idx LUT per device (no CPU->CUDA copy in forward) biondizzle 2026-05-16 18:48:31 +00:00
6c298be842 fix: use new_tensor instead of torch.tensor for cudagraph (no CPU→CUDA copy) biondizzle 2026-05-16 18:47:39 +00:00
53c25bee0b rewrite: cudagraph-safe runner - no dynamic slicing, no GPU scalar indices biondizzle 2026-05-16 18:44:25 +00:00
4300775bfe fix: remove .item() sync in scale reshape — use padded_scales.shape[0] instead biondizzle 2026-05-16 18:29:12 +00:00
5a79065b2b fix: GEMM output should be 2x packed N (float4_e2m1fn_x2 packs 2 per element) biondizzle 2026-05-16 18:27:44 +00:00
95a1345b92 fix: return 2D scale tensor from _assemble_scales_cudagraph_safe biondizzle 2026-05-16 18:26:57 +00:00
533089c9d2 fix: token_indices slice bug + torch.zeros for float4/float8 dtypes biondizzle 2026-05-16 18:21:27 +00:00
54c470e535 fix: use float16->float8 cast for rand_sf (torch.rand doesn't support float8) biondizzle 2026-05-16 18:13:14 +00:00
f2de95c526 fix: use randint for float4 dummy weights in cudagraph test biondizzle 2026-05-16 18:08:45 +00:00
f66d4b69a4 GPU-only scale assembly + cudagraph test harness biondizzle 2026-05-16 18:05:13 +00:00
5121074782 cudagraph-safe CuTeDSL MoE: searchsorted-based scale assembly biondizzle 2026-05-16 18:01:47 +00:00
ab126b0c0d fix: revert to .item() based scale assembly (fixes index OOB) biondizzle 2026-05-16 17:55:32 +00:00
7594968482 WIP: cudagraph-compatible CuTeDSL MoE runner biondizzle 2026-05-16 16:36:19 +00:00
f0c1be3ced fix: remove broken hc_head warmup (wrong tensor shape) biondizzle 2026-05-16 10:11:34 +00:00
c803180706 fix: handle freed weight lists in _check_runtime_supported and _run_mega_moe biondizzle 2026-05-16 09:16:24 +00:00
cdd813cf7e fix: free per-expert weight lists after stacking in CuTeDSL runner biondizzle 2026-05-16 08:54:52 +00:00
99c11c218d fucken a biondizzle 2026-05-16 08:39:13 +00:00
906ee80a42 Add tilelang kernel warmup in load_weights biondizzle 2026-05-16 08:28:39 +00:00
a51ef3d2cf fucken a biondizzle 2026-05-16 08:23:27 +00:00
72bf750a0b fix: revert to eager mode — CUDA graphs OOM with 175GB model biondizzle 2026-05-16 08:07:44 +00:00
baf44c92f8 fix: memory-efficient E2M1 quantization — no 32x distance tensor biondizzle 2026-05-16 07:49:38 +00:00
a2cac7a7fe fix: remove CuTeDSL warmup — OOM with 175GB model loaded biondizzle 2026-05-16 07:32:17 +00:00
e0814eb54e fix: cast expert_offsets to int32 for CuTeDSL kernel biondizzle 2026-05-16 07:15:57 +00:00
4b0a9557f0 fix: rewrite CuTeDSLMoERunner for CUDA graph compatibility biondizzle 2026-05-16 07:03:08 +00:00
dab31b0961 fix: missing tqdm import in weight_loader biondizzle 2026-05-16 06:31:14 +00:00
8496ac99bc dang clonkurs biondizzle 2026-05-16 06:28:16 +00:00
e7c6274107 Revert "feat: auto-warmup in build_and_run.sh" biondizzle 2026-05-16 06:14:28 +00:00
f792537719 feat: auto-warmup in build_and_run.sh biondizzle 2026-05-16 06:11:38 +00:00

... 21 22 23 24 25