Commit Graph

  • 3cd910193c Rewrite scale assembly: no .item() calls, no Python loops, fully GPU biondizzle 2026-05-17 09:59:12 +00:00
  • 4f6217acb9 Fix padded_cols calculation in scale assembly biondizzle 2026-05-17 09:58:09 +00:00
  • 918aa8aede Fix scale assembly output shape: reshape to 2D for GEMM biondizzle 2026-05-17 09:57:27 +00:00
  • d9bae6d770 Fix OOB in scale assembly: size padded_x_sf for max tokens, fix top_k/max_num_tokens passing, support variable-size expert blocks biondizzle 2026-05-17 09:56:28 +00:00
  • 55ac60eb91 Add detailed debug prints for OOB investigation biondizzle 2026-05-17 09:39:42 +00:00
  • fed3c417ba Add debug OOB check for sorted_token_ids biondizzle 2026-05-17 09:19:10 +00:00
  • eb7d4f099b Update CURRENT_BUG.md with Bug 8 (global→local expert ID) and Bug 8b (.cpu() sync) biondizzle 2026-05-17 09:01:24 +00:00
  • ca3cba5bbd Fix global→local expert ID remapping for EP and remove .cpu() sync biondizzle 2026-05-17 08:58:43 +00:00
  • 1330e2b2cf cleanup: remove debug prints, ready for testing biondizzle 2026-05-17 08:30:41 +00:00
  • d635dcbbb6 fix: keep token_indices on CPU, index with CPU sort_idx biondizzle 2026-05-17 08:29:18 +00:00
  • 235d5b314f fix: fallback token indices allocation with verify+rebuild biondizzle 2026-05-17 08:27:47 +00:00
  • dd0b3fd4f9 debug: print sorted_token_ids in warmup biondizzle 2026-05-17 08:25:25 +00:00
  • 04999d86cf fix: add quantize_to_nvfp4 import biondizzle 2026-05-17 08:24:57 +00:00
  • 33e28100ee test: use runner's built-in warmup method biondizzle 2026-05-17 08:24:27 +00:00
  • 7073daaffa fix: allocate token_indices on CPU, move to GPU AFTER JIT compilation biondizzle 2026-05-17 08:22:51 +00:00
  • 0e7b06b55c debug: clone + sync token indices before JIT biondizzle 2026-05-17 08:22:11 +00:00
  • 70c0618361 fix: allocate token_indices before CuTeDSL JIT compilation biondizzle 2026-05-17 08:20:41 +00:00
  • 2bbe04efd8 debug: remove assert, test token corruption biondizzle 2026-05-17 08:19:45 +00:00
  • 66627926c5 debug: int32 token indices with sync verify biondizzle 2026-05-17 08:18:37 +00:00
  • da02a5dc11 debug: assert token indices are correct after allocation biondizzle 2026-05-17 08:16:09 +00:00
  • c0d016a472 feat: compute_activation_global_scales warmup method biondizzle 2026-05-17 08:11:01 +00:00
  • 8c9a51e006 fix: call _ensure_stacked in warmup test biondizzle 2026-05-17 08:07:09 +00:00
  • 5ba77e355f test: warmup gs computation with safety margin sweep biondizzle 2026-05-17 08:06:27 +00:00
  • ae6b879d38 fix: pass expert_offsets without leading 0 to GEMM (matches pipeline) biondizzle 2026-05-17 07:59:00 +00:00
  • a1e6f5f891 fix: searchsorted right=True for correct expert assignment biondizzle 2026-05-17 07:57:00 +00:00
  • ddffb7d8df docs: current bug analysis — scale_a layout vs expert_offsets mismatch biondizzle 2026-05-17 07:53:58 +00:00
  • ed90341ea9 fix: scatter+per-expert-swizzle scale assembly (cudagraph-safe) biondizzle 2026-05-17 07:47:14 +00:00
  • 37fecb588f fix: separate L1/L2 scale buffers (different K_sf), fix assembly calls biondizzle 2026-05-17 07:43:05 +00:00
  • b824b838a9 fix: 128-row-align each expert's scales in padded buffer biondizzle 2026-05-17 07:39:49 +00:00
  • 8dadd9a723 test: scale assembly debug biondizzle 2026-05-17 07:37:47 +00:00
  • 8642946274 fix: padded x_sf buffer for fixed-shape scale assembly biondizzle 2026-05-17 07:37:04 +00:00
  • 418e29f7f5 fix: per-expert scale assembly (match assemble_scales_2d_side) biondizzle 2026-05-17 07:35:49 +00:00
  • 7b95e76723 test: runner vs pipeline comparison + scale assembly comparison biondizzle 2026-05-17 07:33:20 +00:00
  • 366a0240a5 vllm tweaks biondizzle 2026-05-17 07:14:58 +00:00
  • 34c43958d0 vllm tweaks biondizzle 2026-05-17 07:10:16 +00:00
  • 48e4cb625d fix: default activation global_scale so runner works without finalize_weights biondizzle 2026-05-17 06:24:15 +00:00
  • d2965b432d fix: set _l1_activation_global_scale (with underscore) — attribute name mismatch biondizzle 2026-05-17 03:35:20 +00:00
  • b382a7a528 fix: handle input_scale as 1D or 2D (EP splits change the shape) biondizzle 2026-05-16 22:49:30 +00:00
  • 139c9c37cd fix: read input_scale from nn.Parameter before it's freed biondizzle 2026-05-16 22:23:24 +00:00
  • 152648789d fix: use checkpoint input_scale for activation global scale (not hardcoded 1/2688) biondizzle 2026-05-16 21:46:00 +00:00
  • af087e655e docs: update README — vLLM cudagraph inference running, output quality in progress biondizzle 2026-05-16 21:40:59 +00:00
  • 0a5cfe0433 add kernel compile caching — compile once, invoke on subsequent calls biondizzle 2026-05-16 20:45:46 +00:00
  • 3465b9d471 remove torch.cuda.synchronize() from run_nvfp4_grouped_gemm (cudagraph-safe) biondizzle 2026-05-16 20:42:49 +00:00
  • 5e245bc0c6 fix: missing newline biondizzle 2026-05-16 20:40:18 +00:00
  • 288e179f88 add quantize_activation_nvfp4 (cudagraph-safe, fixed global scale) biondizzle 2026-05-16 20:39:37 +00:00
  • 521e11e468 test: old bridge + LUT quantization only (step 1 of cudagraph migration) biondizzle 2026-05-16 20:37:42 +00:00
  • f51be76e8f temp: restore EXACT old bridge.py from b685112 biondizzle 2026-05-16 20:34:45 +00:00
  • 58dc36e21c fix: compile fresh each call — cached compile produces wrong TMA descriptors biondizzle 2026-05-16 20:28:15 +00:00
  • 98cc6ac1f3 fix: invert cache check logic (compile when NOT in cache) biondizzle 2026-05-16 20:25:16 +00:00
  • e337ec86a3 debug: test with cache enabled biondizzle 2026-05-16 20:24:04 +00:00
  • bc56452be8 debug: disable kernel cache to test fresh compilation biondizzle 2026-05-16 20:22:51 +00:00
  • 647c03b2ee fix: make_b_k_major must preserve shape — use double-permute trick biondizzle 2026-05-16 20:19:21 +00:00
  • ed4f501bba fix: make_b_k_major stride check — K-major means stride[1]==1, not stride[2]==1 biondizzle 2026-05-16 20:18:18 +00:00
  • 2162cee4ad fix: restore proper quantize_weight_to_nvfp4 — K is the packed dim, not N biondizzle 2026-05-16 20:16:28 +00:00
  • 10f1dca982 fix: import ceil_div from correct module biondizzle 2026-05-16 20:09:02 +00:00
  • 81632e2f21 fix: correct cutlass_torch import (cutlass.torch, not top-level) biondizzle 2026-05-16 20:08:21 +00:00
  • 16c4fad025 fix: remove cutlass.cute.backend import biondizzle 2026-05-16 20:06:38 +00:00
  • 44b40d41fe fix: compile CuTeDSL kernel with real tensors, not dummy shapes biondizzle 2026-05-16 20:05:59 +00:00
  • 79281b6fda fix: compute K_packed/N_packed before passing to _get_compiled_kernel biondizzle 2026-05-16 20:00:35 +00:00
  • caf93d6c45 fix: pass K_packed/N_packed to _get_compiled_kernel biondizzle 2026-05-16 19:59:43 +00:00
  • ecc7b83334 fix: compile CuTeDSL kernel with actual tensor shapes, not dummy 256x256 biondizzle 2026-05-16 19:58:13 +00:00
  • cc75a55bd9 restore: new bridge/moe_pipeline/layertest biondizzle 2026-05-16 19:55:19 +00:00
  • 0c878b3a9e temp: restore old layertest+bridge for cosine comparison biondizzle 2026-05-16 19:54:04 +00:00
  • 0069769d12 debug: print global scales biondizzle 2026-05-16 19:38:31 +00:00
  • 84589fe984 debug: more prints biondizzle 2026-05-16 19:31:54 +00:00
  • fa2d5708c5 debug: add L1 GEMM and SiLU output debug prints biondizzle 2026-05-16 19:29:42 +00:00
  • 4c06c51ec3 fix: moe_pipeline.py gate/up split — L1 output is 2*intermediate, not intermediate biondizzle 2026-05-16 19:28:15 +00:00
  • da31ce7e1a allow for cuda graphs again biondizzle 2026-05-16 19:23:41 +00:00
  • d15c43294b fix: test L2 weight N dim should be hidden_size, not hidden_size//2 biondizzle 2026-05-16 19:07:36 +00:00
  • 28788c6f55 fix: L1 weight N dimension is 2*intermediate (gate+up), not intermediate biondizzle 2026-05-16 19:07:08 +00:00
  • f7e29fdf1e docs: update README with cudagraph compatibility work and decisions biondizzle 2026-05-16 18:55:47 +00:00
  • 103fd451ce fix: use full padded_scales_buf (no GPU scalar slicing in cudagraph) biondizzle 2026-05-16 18:50:35 +00:00
  • 2f68c7ba77 fix: cache E2M1 step_to_idx LUT per device (no CPU->CUDA copy in forward) biondizzle 2026-05-16 18:48:31 +00:00
  • 6c298be842 fix: use new_tensor instead of torch.tensor for cudagraph (no CPU→CUDA copy) biondizzle 2026-05-16 18:47:39 +00:00
  • 53c25bee0b rewrite: cudagraph-safe runner - no dynamic slicing, no GPU scalar indices biondizzle 2026-05-16 18:44:25 +00:00
  • 4300775bfe fix: remove .item() sync in scale reshape — use padded_scales.shape[0] instead biondizzle 2026-05-16 18:29:12 +00:00
  • 5a79065b2b fix: GEMM output should be 2x packed N (float4_e2m1fn_x2 packs 2 per element) biondizzle 2026-05-16 18:27:44 +00:00
  • 95a1345b92 fix: return 2D scale tensor from _assemble_scales_cudagraph_safe biondizzle 2026-05-16 18:26:57 +00:00
  • 533089c9d2 fix: token_indices slice bug + torch.zeros for float4/float8 dtypes biondizzle 2026-05-16 18:21:27 +00:00
  • 54c470e535 fix: use float16->float8 cast for rand_sf (torch.rand doesn't support float8) biondizzle 2026-05-16 18:13:14 +00:00
  • f2de95c526 fix: use randint for float4 dummy weights in cudagraph test biondizzle 2026-05-16 18:08:45 +00:00
  • f66d4b69a4 GPU-only scale assembly + cudagraph test harness biondizzle 2026-05-16 18:05:13 +00:00
  • 5121074782 cudagraph-safe CuTeDSL MoE: searchsorted-based scale assembly biondizzle 2026-05-16 18:01:47 +00:00
  • ab126b0c0d fix: revert to .item() based scale assembly (fixes index OOB) biondizzle 2026-05-16 17:55:32 +00:00
  • 7594968482 WIP: cudagraph-compatible CuTeDSL MoE runner biondizzle 2026-05-16 16:36:19 +00:00
  • f0c1be3ced fix: remove broken hc_head warmup (wrong tensor shape) biondizzle 2026-05-16 10:11:34 +00:00
  • c803180706 fix: handle freed weight lists in _check_runtime_supported and _run_mega_moe biondizzle 2026-05-16 09:16:24 +00:00
  • cdd813cf7e fix: free per-expert weight lists after stacking in CuTeDSL runner biondizzle 2026-05-16 08:54:52 +00:00
  • 99c11c218d fucken a biondizzle 2026-05-16 08:39:13 +00:00
  • 906ee80a42 Add tilelang kernel warmup in load_weights biondizzle 2026-05-16 08:28:39 +00:00
  • a51ef3d2cf fucken a biondizzle 2026-05-16 08:23:27 +00:00
  • 72bf750a0b fix: revert to eager mode — CUDA graphs OOM with 175GB model biondizzle 2026-05-16 08:07:44 +00:00
  • baf44c92f8 fix: memory-efficient E2M1 quantization — no 32x distance tensor biondizzle 2026-05-16 07:49:38 +00:00
  • a2cac7a7fe fix: remove CuTeDSL warmup — OOM with 175GB model loaded biondizzle 2026-05-16 07:32:17 +00:00
  • e0814eb54e fix: cast expert_offsets to int32 for CuTeDSL kernel biondizzle 2026-05-16 07:15:57 +00:00
  • 4b0a9557f0 fix: rewrite CuTeDSLMoERunner for CUDA graph compatibility biondizzle 2026-05-16 07:03:08 +00:00
  • dab31b0961 fix: missing tqdm import in weight_loader biondizzle 2026-05-16 06:31:14 +00:00
  • 8496ac99bc dang clonkurs biondizzle 2026-05-16 06:28:16 +00:00
  • e7c6274107 Revert "feat: auto-warmup in build_and_run.sh" biondizzle 2026-05-16 06:14:28 +00:00
  • f792537719 feat: auto-warmup in build_and_run.sh biondizzle 2026-05-16 06:11:38 +00:00