Commit Graph

199 Commits

Author SHA1 Message Date
918aa8aede Fix scale assembly output shape: reshape to 2D for GEMM 2026-05-17 09:57:27 +00:00
d9bae6d770 Fix OOB in scale assembly: size padded_x_sf for max tokens, fix top_k/max_num_tokens passing, support variable-size expert blocks
Bug 9: padded_x_sf was sized for num_experts*128 rows, but with 8192 tokens
and top_k=6, the actual padded row count can exceed 6144. Also:
- Pass top_k and max_num_tokens from deepseek_v4.py (was defaulting to 8/8192)
- Phase 2 of scale assembly now handles experts with >128 tokens (multiple 128-row chunks)
- Remove debug prints
2026-05-17 09:56:28 +00:00
55ac60eb91 Add detailed debug prints for OOB investigation 2026-05-17 09:39:42 +00:00
fed3c417ba Add debug OOB check for sorted_token_ids 2026-05-17 09:19:10 +00:00
eb7d4f099b Update CURRENT_BUG.md with Bug 8 (global→local expert ID) and Bug 8b (.cpu() sync) 2026-05-17 09:01:24 +00:00
ca3cba5bbd Fix global→local expert ID remapping for EP and remove .cpu() sync
Root cause of CUDA_ERROR_ASSERT index out of bounds:
- topk_ids contains GLOBAL expert IDs (0-255) but runner treated them
  as local IDs (0-31 with EP=8). Tokens for non-local experts got
  wrong expert assignments, causing out-of-bounds scatter indices
  in _assemble_scales_cudagraph_safe.

Fixes:
1. Add experts_start_idx param to CuTeDSLMoERunner
2. In run(), remap global→local IDs and zero weights for non-local experts
3. Move _token_indices from CPU to GPU (remove sort_idx.cpu() sync)
4. Add _fill_token_indices() and _needs_token_refill to handle CuTeDSL
   JIT GPU memory corruption (refill after first GEMM call)
2026-05-17 08:58:43 +00:00
1330e2b2cf cleanup: remove debug prints, ready for testing
Current state:
- Token indices on CPU (avoids CuTeDSL GPU memory corruption)
- Scale assembly uses per-expert swizzle + scatter (matches reference)
- compute_activation_global_scales warmup gets ~0.97 cosine
- expert_offsets passed without leading 0 (matches pipeline)
- layertest + cudagraph_test pass
2026-05-17 08:30:41 +00:00
d635dcbbb6 fix: keep token_indices on CPU, index with CPU sort_idx
CuTeDSL's cute.compile corrupts GPU memory during JIT compilation.
Keeping token_indices on CPU and using sort_idx.cpu() for indexing
avoids the corruption. The .to(device) call after indexing moves the
result back to GPU for the hidden_states indexing.
2026-05-17 08:29:18 +00:00
235d5b314f fix: fallback token indices allocation with verify+rebuild 2026-05-17 08:27:47 +00:00
dd0b3fd4f9 debug: print sorted_token_ids in warmup 2026-05-17 08:25:25 +00:00
04999d86cf fix: add quantize_to_nvfp4 import 2026-05-17 08:24:57 +00:00
33e28100ee test: use runner's built-in warmup method 2026-05-17 08:24:27 +00:00
7073daaffa fix: allocate token_indices on CPU, move to GPU AFTER JIT compilation
CuTeDSL's cute.compile corrupts GPU memory during JIT compilation.
Tensors allocated on GPU before/during compilation get zeroed.
Fix: create token_indices on CPU, then .to(device) after JIT is done.
2026-05-17 08:22:51 +00:00
0e7b06b55c debug: clone + sync token indices before JIT 2026-05-17 08:22:11 +00:00
70c0618361 fix: allocate token_indices before CuTeDSL JIT compilation
CuTeDSL's cute.compile appears to corrupt GPU memory state,
causing torch.arange to produce zero-filled tensors when allocated
after the JIT compilation. Moving token_indices allocation before
the weight stacking operations fixes the corruption.
2026-05-17 08:20:41 +00:00
2bbe04efd8 debug: remove assert, test token corruption 2026-05-17 08:19:45 +00:00
66627926c5 debug: int32 token indices with sync verify 2026-05-17 08:18:37 +00:00
da02a5dc11 debug: assert token indices are correct after allocation 2026-05-17 08:16:09 +00:00
c0d016a472 feat: compute_activation_global_scales warmup method
Uses quantize_to_nvfp4 during warmup to get exact gs values for L1 and L2.
L1 gs comes from slot_hidden, L2 gs from the actual L1 GEMM output.
These values are then used with quantize_activation_nvfp4 (cudagraph-safe)
during inference.
2026-05-17 08:11:01 +00:00
8c9a51e006 fix: call _ensure_stacked in warmup test 2026-05-17 08:07:09 +00:00
5ba77e355f test: warmup gs computation with safety margin sweep 2026-05-17 08:06:27 +00:00
ae6b879d38 fix: pass expert_offsets without leading 0 to GEMM (matches pipeline) 2026-05-17 07:59:00 +00:00
a1e6f5f891 fix: searchsorted right=True for correct expert assignment 2026-05-17 07:57:00 +00:00
ddffb7d8df docs: current bug analysis — scale_a layout vs expert_offsets mismatch 2026-05-17 07:53:58 +00:00
ed90341ea9 fix: scatter+per-expert-swizzle scale assembly (cudagraph-safe) 2026-05-17 07:47:14 +00:00
37fecb588f fix: separate L1/L2 scale buffers (different K_sf), fix assembly calls 2026-05-17 07:43:05 +00:00
b824b838a9 fix: 128-row-align each expert's scales in padded buffer 2026-05-17 07:39:49 +00:00
8dadd9a723 test: scale assembly debug 2026-05-17 07:37:47 +00:00
8642946274 fix: padded x_sf buffer for fixed-shape scale assembly 2026-05-17 07:37:04 +00:00
418e29f7f5 fix: per-expert scale assembly (match assemble_scales_2d_side) 2026-05-17 07:35:49 +00:00
7b95e76723 test: runner vs pipeline comparison + scale assembly comparison 2026-05-17 07:33:20 +00:00
366a0240a5 vllm tweaks 2026-05-17 07:14:58 +00:00
34c43958d0 vllm tweaks 2026-05-17 07:10:16 +00:00
48e4cb625d fix: default activation global_scale so runner works without finalize_weights 2026-05-17 06:24:15 +00:00
d2965b432d fix: set _l1_activation_global_scale (with underscore) — attribute name mismatch 2026-05-17 03:35:20 +00:00
b382a7a528 fix: handle input_scale as 1D or 2D (EP splits change the shape) 2026-05-16 22:49:30 +00:00
139c9c37cd fix: read input_scale from nn.Parameter before it's freed 2026-05-16 22:23:24 +00:00
152648789d fix: use checkpoint input_scale for activation global scale (not hardcoded 1/2688)
The checkpoint stores input_scale per projection — the pre-computed
activation normalization factor. Using 1/2688 was wrong for most layers
(e.g. down_proj input_scale=0.031 vs 1/2688=0.000372 — 83x off).
This caused under-quantized activations and garbage output.
2026-05-16 21:46:00 +00:00
af087e655e docs: update README — vLLM cudagraph inference running, output quality in progress 2026-05-16 21:40:59 +00:00
0a5cfe0433 add kernel compile caching — compile once, invoke on subsequent calls
First call: cute.compile() with real tensors (warmup).
Subsequent calls: just invoke compiled() with new CuTe views.
No cute.compile() in the forward path = cudagraph-safe.
2026-05-16 20:45:46 +00:00
3465b9d471 remove torch.cuda.synchronize() from run_nvfp4_grouped_gemm (cudagraph-safe) 2026-05-16 20:42:49 +00:00
5e245bc0c6 fix: missing newline 2026-05-16 20:40:18 +00:00
288e179f88 add quantize_activation_nvfp4 (cudagraph-safe, fixed global scale) 2026-05-16 20:39:37 +00:00
521e11e468 test: old bridge + LUT quantization only (step 1 of cudagraph migration) 2026-05-16 20:37:42 +00:00
f51be76e8f temp: restore EXACT old bridge.py from b685112 2026-05-16 20:34:45 +00:00
58dc36e21c fix: compile fresh each call — cached compile produces wrong TMA descriptors
The CuTeDSL kernel's TMA descriptors are bound to the
compilation-time tensor addresses. Caching the compiled kernel
and reusing it with different tensor allocations produces wrong
memory access patterns (cosine 0.5 instead of 0.99).

Fresh compilation is proven correct (cosine 0.989). We can
optimize later with proper TMA descriptor reinitialization.
2026-05-16 20:28:15 +00:00
98cc6ac1f3 fix: invert cache check logic (compile when NOT in cache) 2026-05-16 20:25:16 +00:00
e337ec86a3 debug: test with cache enabled 2026-05-16 20:24:04 +00:00
bc56452be8 debug: disable kernel cache to test fresh compilation 2026-05-16 20:22:51 +00:00
647c03b2ee fix: make_b_k_major must preserve shape — use double-permute trick
permute(K,N).contiguous().permute(K,N) gives same (E,K,N) shape
but with K-contiguous memory. Single permute changes the shape.
2026-05-16 20:19:21 +00:00