Commit Graph

337 Commits

Author SHA1 Message Date
7073daaffa fix: allocate token_indices on CPU, move to GPU AFTER JIT compilation
CuTeDSL's cute.compile corrupts GPU memory during JIT compilation.
Tensors allocated on GPU before/during compilation get zeroed.
Fix: create token_indices on CPU, then .to(device) after JIT is done.
2026-05-17 08:22:51 +00:00
0e7b06b55c debug: clone + sync token indices before JIT 2026-05-17 08:22:11 +00:00
70c0618361 fix: allocate token_indices before CuTeDSL JIT compilation
CuTeDSL's cute.compile appears to corrupt GPU memory state,
causing torch.arange to produce zero-filled tensors when allocated
after the JIT compilation. Moving token_indices allocation before
the weight stacking operations fixes the corruption.
2026-05-17 08:20:41 +00:00
2bbe04efd8 debug: remove assert, test token corruption 2026-05-17 08:19:45 +00:00
66627926c5 debug: int32 token indices with sync verify 2026-05-17 08:18:37 +00:00
da02a5dc11 debug: assert token indices are correct after allocation 2026-05-17 08:16:09 +00:00
c0d016a472 feat: compute_activation_global_scales warmup method
Uses quantize_to_nvfp4 during warmup to get exact gs values for L1 and L2.
L1 gs comes from slot_hidden, L2 gs from the actual L1 GEMM output.
These values are then used with quantize_activation_nvfp4 (cudagraph-safe)
during inference.
2026-05-17 08:11:01 +00:00
8c9a51e006 fix: call _ensure_stacked in warmup test 2026-05-17 08:07:09 +00:00
5ba77e355f test: warmup gs computation with safety margin sweep 2026-05-17 08:06:27 +00:00
ae6b879d38 fix: pass expert_offsets without leading 0 to GEMM (matches pipeline) 2026-05-17 07:59:00 +00:00
a1e6f5f891 fix: searchsorted right=True for correct expert assignment 2026-05-17 07:57:00 +00:00
ddffb7d8df docs: current bug analysis — scale_a layout vs expert_offsets mismatch 2026-05-17 07:53:58 +00:00
ed90341ea9 fix: scatter+per-expert-swizzle scale assembly (cudagraph-safe) 2026-05-17 07:47:14 +00:00
37fecb588f fix: separate L1/L2 scale buffers (different K_sf), fix assembly calls 2026-05-17 07:43:05 +00:00
b824b838a9 fix: 128-row-align each expert's scales in padded buffer 2026-05-17 07:39:49 +00:00
8dadd9a723 test: scale assembly debug 2026-05-17 07:37:47 +00:00
8642946274 fix: padded x_sf buffer for fixed-shape scale assembly 2026-05-17 07:37:04 +00:00
418e29f7f5 fix: per-expert scale assembly (match assemble_scales_2d_side) 2026-05-17 07:35:49 +00:00
7b95e76723 test: runner vs pipeline comparison + scale assembly comparison 2026-05-17 07:33:20 +00:00
366a0240a5 vllm tweaks 2026-05-17 07:14:58 +00:00
34c43958d0 vllm tweaks 2026-05-17 07:10:16 +00:00
48e4cb625d fix: default activation global_scale so runner works without finalize_weights 2026-05-17 06:24:15 +00:00
d2965b432d fix: set _l1_activation_global_scale (with underscore) — attribute name mismatch 2026-05-17 03:35:20 +00:00
b382a7a528 fix: handle input_scale as 1D or 2D (EP splits change the shape) 2026-05-16 22:49:30 +00:00
139c9c37cd fix: read input_scale from nn.Parameter before it's freed 2026-05-16 22:23:24 +00:00
152648789d fix: use checkpoint input_scale for activation global scale (not hardcoded 1/2688)
The checkpoint stores input_scale per projection — the pre-computed
activation normalization factor. Using 1/2688 was wrong for most layers
(e.g. down_proj input_scale=0.031 vs 1/2688=0.000372 — 83x off).
This caused under-quantized activations and garbage output.
2026-05-16 21:46:00 +00:00
af087e655e docs: update README — vLLM cudagraph inference running, output quality in progress 2026-05-16 21:40:59 +00:00
0a5cfe0433 add kernel compile caching — compile once, invoke on subsequent calls
First call: cute.compile() with real tensors (warmup).
Subsequent calls: just invoke compiled() with new CuTe views.
No cute.compile() in the forward path = cudagraph-safe.
2026-05-16 20:45:46 +00:00
3465b9d471 remove torch.cuda.synchronize() from run_nvfp4_grouped_gemm (cudagraph-safe) 2026-05-16 20:42:49 +00:00
5e245bc0c6 fix: missing newline 2026-05-16 20:40:18 +00:00
288e179f88 add quantize_activation_nvfp4 (cudagraph-safe, fixed global scale) 2026-05-16 20:39:37 +00:00
521e11e468 test: old bridge + LUT quantization only (step 1 of cudagraph migration) 2026-05-16 20:37:42 +00:00
f51be76e8f temp: restore EXACT old bridge.py from b685112 2026-05-16 20:34:45 +00:00
58dc36e21c fix: compile fresh each call — cached compile produces wrong TMA descriptors
The CuTeDSL kernel's TMA descriptors are bound to the
compilation-time tensor addresses. Caching the compiled kernel
and reusing it with different tensor allocations produces wrong
memory access patterns (cosine 0.5 instead of 0.99).

Fresh compilation is proven correct (cosine 0.989). We can
optimize later with proper TMA descriptor reinitialization.
2026-05-16 20:28:15 +00:00
98cc6ac1f3 fix: invert cache check logic (compile when NOT in cache) 2026-05-16 20:25:16 +00:00
e337ec86a3 debug: test with cache enabled 2026-05-16 20:24:04 +00:00
bc56452be8 debug: disable kernel cache to test fresh compilation 2026-05-16 20:22:51 +00:00
647c03b2ee fix: make_b_k_major must preserve shape — use double-permute trick
permute(K,N).contiguous().permute(K,N) gives same (E,K,N) shape
but with K-contiguous memory. Single permute changes the shape.
2026-05-16 20:19:21 +00:00
ed4f501bba fix: make_b_k_major stride check — K-major means stride[1]==1, not stride[2]==1
For (E, K, N): stride[2]==1 is N-major (columns contiguous).
K-major requires stride[1]==1 (rows contiguous).
2026-05-16 20:18:18 +00:00
2162cee4ad fix: restore proper quantize_weight_to_nvfp4 — K is the packed dim, not N
quantize_to_nvfp4() only packs the last dimension, but for weight
matrices (K, N), K is the packed dimension. The weight quantizer
reshapes (k_blocks, block_size, N) and computes block scales along
the K block dimension. This was accidentally replaced with a simple
delegation to quantize_to_nvfp4, producing wrong tensor shapes.
2026-05-16 20:16:28 +00:00
10f1dca982 fix: import ceil_div from correct module 2026-05-16 20:09:02 +00:00
81632e2f21 fix: correct cutlass_torch import (cutlass.torch, not top-level) 2026-05-16 20:08:21 +00:00
16c4fad025 fix: remove cutlass.cute.backend import 2026-05-16 20:06:38 +00:00
44b40d41fe fix: compile CuTeDSL kernel with real tensors, not dummy shapes
The kernel's TMA descriptors are sized from compilation-time shapes.
Dummy 256x256 caused wrong memory access for real 3584x6144 data.
Now compiles with actual runtime tensors on first use, cached by
(num_experts, K, N). Compilation happens once during warmup.
Forward call remains cudagraph-safe.
2026-05-16 20:05:59 +00:00
79281b6fda fix: compute K_packed/N_packed before passing to _get_compiled_kernel 2026-05-16 20:00:35 +00:00
caf93d6c45 fix: pass K_packed/N_packed to _get_compiled_kernel 2026-05-16 19:59:43 +00:00
ecc7b83334 fix: compile CuTeDSL kernel with actual tensor shapes, not dummy 256x256
The compiled kernel's TMA descriptors are sized based on compilation
shapes. Using dummy 256x256 shapes caused wrong memory access patterns
for the real 3584x6144 data. Now uses actual K_packed and N_packed
from the runtime tensors.
2026-05-16 19:58:13 +00:00
cc75a55bd9 restore: new bridge/moe_pipeline/layertest 2026-05-16 19:55:19 +00:00
0c878b3a9e temp: restore old layertest+bridge for cosine comparison 2026-05-16 19:54:04 +00:00
0069769d12 debug: print global scales 2026-05-16 19:38:31 +00:00