-
3cd910193c
Rewrite scale assembly: no .item() calls, no Python loops, fully GPU
biondizzle
2026-05-17 09:59:12 +00:00
-
4f6217acb9
Fix padded_cols calculation in scale assembly
biondizzle
2026-05-17 09:58:09 +00:00
-
918aa8aede
Fix scale assembly output shape: reshape to 2D for GEMM
biondizzle
2026-05-17 09:57:27 +00:00
-
d9bae6d770
Fix OOB in scale assembly: size padded_x_sf for max tokens, fix top_k/max_num_tokens passing, support variable-size expert blocks
biondizzle
2026-05-17 09:56:28 +00:00
-
55ac60eb91
Add detailed debug prints for OOB investigation
biondizzle
2026-05-17 09:39:42 +00:00
-
fed3c417ba
Add debug OOB check for sorted_token_ids
biondizzle
2026-05-17 09:19:10 +00:00
-
eb7d4f099b
Update CURRENT_BUG.md with Bug 8 (global→local expert ID) and Bug 8b (.cpu() sync)
biondizzle
2026-05-17 09:01:24 +00:00
-
ca3cba5bbd
Fix global→local expert ID remapping for EP and remove .cpu() sync
biondizzle
2026-05-17 08:58:43 +00:00
-
1330e2b2cf
cleanup: remove debug prints, ready for testing
biondizzle
2026-05-17 08:30:41 +00:00
-
d635dcbbb6
fix: keep token_indices on CPU, index with CPU sort_idx
biondizzle
2026-05-17 08:29:18 +00:00
-
235d5b314f
fix: fallback token indices allocation with verify+rebuild
biondizzle
2026-05-17 08:27:47 +00:00
-
dd0b3fd4f9
debug: print sorted_token_ids in warmup
biondizzle
2026-05-17 08:25:25 +00:00
-
04999d86cf
fix: add quantize_to_nvfp4 import
biondizzle
2026-05-17 08:24:57 +00:00
-
33e28100ee
test: use runner's built-in warmup method
biondizzle
2026-05-17 08:24:27 +00:00
-
7073daaffa
fix: allocate token_indices on CPU, move to GPU AFTER JIT compilation
biondizzle
2026-05-17 08:22:51 +00:00
-
0e7b06b55c
debug: clone + sync token indices before JIT
biondizzle
2026-05-17 08:22:11 +00:00
-
70c0618361
fix: allocate token_indices before CuTeDSL JIT compilation
biondizzle
2026-05-17 08:20:41 +00:00
-
2bbe04efd8
debug: remove assert, test token corruption
biondizzle
2026-05-17 08:19:45 +00:00
-
66627926c5
debug: int32 token indices with sync verify
biondizzle
2026-05-17 08:18:37 +00:00
-
da02a5dc11
debug: assert token indices are correct after allocation
biondizzle
2026-05-17 08:16:09 +00:00
-
c0d016a472
feat: compute_activation_global_scales warmup method
biondizzle
2026-05-17 08:11:01 +00:00
-
8c9a51e006
fix: call _ensure_stacked in warmup test
biondizzle
2026-05-17 08:07:09 +00:00
-
5ba77e355f
test: warmup gs computation with safety margin sweep
biondizzle
2026-05-17 08:06:27 +00:00
-
ae6b879d38
fix: pass expert_offsets without leading 0 to GEMM (matches pipeline)
biondizzle
2026-05-17 07:59:00 +00:00
-
a1e6f5f891
fix: searchsorted right=True for correct expert assignment
biondizzle
2026-05-17 07:57:00 +00:00
-
ddffb7d8df
docs: current bug analysis — scale_a layout vs expert_offsets mismatch
biondizzle
2026-05-17 07:53:58 +00:00
-
ed90341ea9
fix: scatter+per-expert-swizzle scale assembly (cudagraph-safe)
biondizzle
2026-05-17 07:47:14 +00:00
-
37fecb588f
fix: separate L1/L2 scale buffers (different K_sf), fix assembly calls
biondizzle
2026-05-17 07:43:05 +00:00
-
b824b838a9
fix: 128-row-align each expert's scales in padded buffer
biondizzle
2026-05-17 07:39:49 +00:00
-
8dadd9a723
test: scale assembly debug
biondizzle
2026-05-17 07:37:47 +00:00
-
8642946274
fix: padded x_sf buffer for fixed-shape scale assembly
biondizzle
2026-05-17 07:37:04 +00:00
-
418e29f7f5
fix: per-expert scale assembly (match assemble_scales_2d_side)
biondizzle
2026-05-17 07:35:49 +00:00
-
7b95e76723
test: runner vs pipeline comparison + scale assembly comparison
biondizzle
2026-05-17 07:33:20 +00:00
-
366a0240a5
vllm tweaks
biondizzle
2026-05-17 07:14:58 +00:00
-
34c43958d0
vllm tweaks
biondizzle
2026-05-17 07:10:16 +00:00
-
48e4cb625d
fix: default activation global_scale so runner works without finalize_weights
biondizzle
2026-05-17 06:24:15 +00:00
-
d2965b432d
fix: set _l1_activation_global_scale (with underscore) — attribute name mismatch
biondizzle
2026-05-17 03:35:20 +00:00
-
b382a7a528
fix: handle input_scale as 1D or 2D (EP splits change the shape)
biondizzle
2026-05-16 22:49:30 +00:00
-
139c9c37cd
fix: read input_scale from nn.Parameter before it's freed
biondizzle
2026-05-16 22:23:24 +00:00
-
152648789d
fix: use checkpoint input_scale for activation global scale (not hardcoded 1/2688)
biondizzle
2026-05-16 21:46:00 +00:00
-
af087e655e
docs: update README — vLLM cudagraph inference running, output quality in progress
biondizzle
2026-05-16 21:40:59 +00:00
-
0a5cfe0433
add kernel compile caching — compile once, invoke on subsequent calls
biondizzle
2026-05-16 20:45:46 +00:00
-
3465b9d471
remove torch.cuda.synchronize() from run_nvfp4_grouped_gemm (cudagraph-safe)
biondizzle
2026-05-16 20:42:49 +00:00
-
5e245bc0c6
fix: missing newline
biondizzle
2026-05-16 20:40:18 +00:00
-
288e179f88
add quantize_activation_nvfp4 (cudagraph-safe, fixed global scale)
biondizzle
2026-05-16 20:39:37 +00:00
-
521e11e468
test: old bridge + LUT quantization only (step 1 of cudagraph migration)
biondizzle
2026-05-16 20:37:42 +00:00
-
f51be76e8f
temp: restore EXACT old bridge.py from
b685112
biondizzle
2026-05-16 20:34:45 +00:00
-
58dc36e21c
fix: compile fresh each call — cached compile produces wrong TMA descriptors
biondizzle
2026-05-16 20:28:15 +00:00
-
98cc6ac1f3
fix: invert cache check logic (compile when NOT in cache)
biondizzle
2026-05-16 20:25:16 +00:00
-
e337ec86a3
debug: test with cache enabled
biondizzle
2026-05-16 20:24:04 +00:00
-
bc56452be8
debug: disable kernel cache to test fresh compilation
biondizzle
2026-05-16 20:22:51 +00:00
-
647c03b2ee
fix: make_b_k_major must preserve shape — use double-permute trick
biondizzle
2026-05-16 20:19:21 +00:00
-
ed4f501bba
fix: make_b_k_major stride check — K-major means stride[1]==1, not stride[2]==1
biondizzle
2026-05-16 20:18:18 +00:00
-
2162cee4ad
fix: restore proper quantize_weight_to_nvfp4 — K is the packed dim, not N
biondizzle
2026-05-16 20:16:28 +00:00
-
10f1dca982
fix: import ceil_div from correct module
biondizzle
2026-05-16 20:09:02 +00:00
-
81632e2f21
fix: correct cutlass_torch import (cutlass.torch, not top-level)
biondizzle
2026-05-16 20:08:21 +00:00
-
16c4fad025
fix: remove cutlass.cute.backend import
biondizzle
2026-05-16 20:06:38 +00:00
-
44b40d41fe
fix: compile CuTeDSL kernel with real tensors, not dummy shapes
biondizzle
2026-05-16 20:05:59 +00:00
-
79281b6fda
fix: compute K_packed/N_packed before passing to _get_compiled_kernel
biondizzle
2026-05-16 20:00:35 +00:00
-
caf93d6c45
fix: pass K_packed/N_packed to _get_compiled_kernel
biondizzle
2026-05-16 19:59:43 +00:00
-
ecc7b83334
fix: compile CuTeDSL kernel with actual tensor shapes, not dummy 256x256
biondizzle
2026-05-16 19:58:13 +00:00
-
cc75a55bd9
restore: new bridge/moe_pipeline/layertest
biondizzle
2026-05-16 19:55:19 +00:00
-
0c878b3a9e
temp: restore old layertest+bridge for cosine comparison
biondizzle
2026-05-16 19:54:04 +00:00
-
0069769d12
debug: print global scales
biondizzle
2026-05-16 19:38:31 +00:00
-
84589fe984
debug: more prints
biondizzle
2026-05-16 19:31:54 +00:00
-
fa2d5708c5
debug: add L1 GEMM and SiLU output debug prints
biondizzle
2026-05-16 19:29:42 +00:00
-
4c06c51ec3
fix: moe_pipeline.py gate/up split — L1 output is 2*intermediate, not intermediate
biondizzle
2026-05-16 19:28:15 +00:00
-
da31ce7e1a
allow for cuda graphs again
biondizzle
2026-05-16 19:23:41 +00:00
-
d15c43294b
fix: test L2 weight N dim should be hidden_size, not hidden_size//2
biondizzle
2026-05-16 19:07:36 +00:00
-
28788c6f55
fix: L1 weight N dimension is 2*intermediate (gate+up), not intermediate
biondizzle
2026-05-16 19:07:08 +00:00
-
f7e29fdf1e
docs: update README with cudagraph compatibility work and decisions
biondizzle
2026-05-16 18:55:47 +00:00
-
103fd451ce
fix: use full padded_scales_buf (no GPU scalar slicing in cudagraph)
biondizzle
2026-05-16 18:50:35 +00:00
-
2f68c7ba77
fix: cache E2M1 step_to_idx LUT per device (no CPU->CUDA copy in forward)
biondizzle
2026-05-16 18:48:31 +00:00
-
6c298be842
fix: use new_tensor instead of torch.tensor for cudagraph (no CPU→CUDA copy)
biondizzle
2026-05-16 18:47:39 +00:00
-
53c25bee0b
rewrite: cudagraph-safe runner - no dynamic slicing, no GPU scalar indices
biondizzle
2026-05-16 18:44:25 +00:00
-
4300775bfe
fix: remove .item() sync in scale reshape — use padded_scales.shape[0] instead
biondizzle
2026-05-16 18:29:12 +00:00
-
5a79065b2b
fix: GEMM output should be 2x packed N (float4_e2m1fn_x2 packs 2 per element)
biondizzle
2026-05-16 18:27:44 +00:00
-
95a1345b92
fix: return 2D scale tensor from _assemble_scales_cudagraph_safe
biondizzle
2026-05-16 18:26:57 +00:00
-
533089c9d2
fix: token_indices slice bug + torch.zeros for float4/float8 dtypes
biondizzle
2026-05-16 18:21:27 +00:00
-
54c470e535
fix: use float16->float8 cast for rand_sf (torch.rand doesn't support float8)
biondizzle
2026-05-16 18:13:14 +00:00
-
f2de95c526
fix: use randint for float4 dummy weights in cudagraph test
biondizzle
2026-05-16 18:08:45 +00:00
-
f66d4b69a4
GPU-only scale assembly + cudagraph test harness
biondizzle
2026-05-16 18:05:13 +00:00
-
5121074782
cudagraph-safe CuTeDSL MoE: searchsorted-based scale assembly
biondizzle
2026-05-16 18:01:47 +00:00
-
ab126b0c0d
fix: revert to .item() based scale assembly (fixes index OOB)
biondizzle
2026-05-16 17:55:32 +00:00
-
7594968482
WIP: cudagraph-compatible CuTeDSL MoE runner
biondizzle
2026-05-16 16:36:19 +00:00
-
f0c1be3ced
fix: remove broken hc_head warmup (wrong tensor shape)
biondizzle
2026-05-16 10:11:34 +00:00
-
c803180706
fix: handle freed weight lists in _check_runtime_supported and _run_mega_moe
biondizzle
2026-05-16 09:16:24 +00:00
-
cdd813cf7e
fix: free per-expert weight lists after stacking in CuTeDSL runner
biondizzle
2026-05-16 08:54:52 +00:00
-
99c11c218d
fucken a
biondizzle
2026-05-16 08:39:13 +00:00
-
906ee80a42
Add tilelang kernel warmup in load_weights
biondizzle
2026-05-16 08:28:39 +00:00
-
a51ef3d2cf
fucken a
biondizzle
2026-05-16 08:23:27 +00:00
-
72bf750a0b
fix: revert to eager mode — CUDA graphs OOM with 175GB model
biondizzle
2026-05-16 08:07:44 +00:00
-
baf44c92f8
fix: memory-efficient E2M1 quantization — no 32x distance tensor
biondizzle
2026-05-16 07:49:38 +00:00
-
a2cac7a7fe
fix: remove CuTeDSL warmup — OOM with 175GB model loaded
biondizzle
2026-05-16 07:32:17 +00:00
-
e0814eb54e
fix: cast expert_offsets to int32 for CuTeDSL kernel
biondizzle
2026-05-16 07:15:57 +00:00
-
4b0a9557f0
fix: rewrite CuTeDSLMoERunner for CUDA graph compatibility
biondizzle
2026-05-16 07:03:08 +00:00
-
dab31b0961
fix: missing tqdm import in weight_loader
biondizzle
2026-05-16 06:31:14 +00:00
-
8496ac99bc
dang clonkurs
biondizzle
2026-05-16 06:28:16 +00:00
-
e7c6274107
Revert "feat: auto-warmup in build_and_run.sh"
biondizzle
2026-05-16 06:14:28 +00:00
-
f792537719
feat: auto-warmup in build_and_run.sh
biondizzle
2026-05-16 06:11:38 +00:00