Root cause: capping max_num_tokens to 512 made buffers too small for the
actual 8192-token warmup. slot_hidden had 49152 rows but padded_hidden
only had 6144.
Fix: Revert the 512 cap. Use SHARED padded buffers (not per-layer) to
avoid OOM. Only 72 MB total (not 4.3 GB) since layers run sequentially
and reuse the same buffer. Cudagraph-safe since capture and replay both
run layers sequentially on the same tensor.
padded_max_slots was computed from max_tokens*top_k (3072) but
total_padded_slots in run() is num_experts*max_chunks*128 (6144).
The buffer was too small, causing index out of bounds.
padded_hidden/activated buffers were sized for max_num_tokens=8192,
which is 72 MB per layer × 60 layers = 4.3 GB → OOM with 178 GB GPUs
(almost full from model + KV cache).
Now cap at max cudagraph capture size (512 tokens). Eager-mode runs
with >512 tokens will need dynamic allocation, but vLLM always uses
cudagraph for inference after warmup.
Removed torch.zeros() call that created padded_expert_offsets during
scale assembly. Now uses fixed layout computed from Python constants.
Also removed dead reference to padded_expert_offsets variable.
The while loop had variable trip count (GPU scalar in condition),
requiring CPU-GPU sync. Replaced with fixed max_chunks_per_expert
iterations. Unused chunks are zero buffers (harmless for GEMM).
- Compute padded_expert_offsets from real expert_offsets (ceil to 128)
- Scatter x_sf into padded positions matching those offsets
- Per-expert swizzle in 128-row chunks (supports >128 tokens per expert)
- Pad slot_hidden/activated using same padded offsets for GEMM input
- Pre-allocated buffers sized for max_tokens*top_k (not num_experts*128)
Root cause of garbage output: the GEMM reads scale_a according to
expert_offsets (e.g. [0, 500, 1024, ...]) but scale_a had data at
fixed e*128 offsets. When expert 0 has 500 tokens, the GEMM reads
scale_a[0:500] but only rows 0-127 had valid data.
Fix: pad slot_hidden to num_experts*128 rows (128 per expert) and
pass padded_expert_offsets=[0, 128, 256, ...] to the GEMM. Scale
assembly's fixed 128-row layout now matches the GEMM's expectations.
Padding tokens' GEMM output is discarded (scatter_add only uses
sorted_token_ids for real tokens).
compute_activation_global_scales expects local IDs (0..num_experts-1),
not global IDs. EP5/EP7 were getting L2 gs=0 because global IDs (240+,
336+) didn't match expert_id_range (0..47), so no tokens matched any
expert → L1 GEMM got zero inputs → L2 gs=0 → NaN/crash.
Also removed _warmup_done guard since each layer needs its own warmup
(different weights, different gs values).
- Reverted from full-buffer swizzle to per-expert 128-row slots
- Scatter into e*128 fixed positions (cudagraph-compatible, fixed shape)
- Clamp local_row to 127 for experts with >128 tokens (GEMM uses expert_offsets)
- Buffer sized for num_experts*128 rows (not max_tokens*top_k)
- Add _warmup_done guard to only run warmup once (not 60x)
The checkpoint input_scale is a calibration value that produces wrong gs
at runtime (too small → block scales saturate → garbage output → EOS).
Now calls compute_activation_global_scales() with sample data during weight
finalization, before cudagraph capture. This observes actual activation
magnitudes and computes correct L1 and L2 gs values.
Bug 9: padded_x_sf was sized for num_experts*128 rows, but with 8192 tokens
and top_k=6, the actual padded row count can exceed 6144. Also:
- Pass top_k and max_num_tokens from deepseek_v4.py (was defaulting to 8/8192)
- Phase 2 of scale assembly now handles experts with >128 tokens (multiple 128-row chunks)
- Remove debug prints
Root cause of CUDA_ERROR_ASSERT index out of bounds:
- topk_ids contains GLOBAL expert IDs (0-255) but runner treated them
as local IDs (0-31 with EP=8). Tokens for non-local experts got
wrong expert assignments, causing out-of-bounds scatter indices
in _assemble_scales_cudagraph_safe.
Fixes:
1. Add experts_start_idx param to CuTeDSLMoERunner
2. In run(), remap global→local IDs and zero weights for non-local experts
3. Move _token_indices from CPU to GPU (remove sort_idx.cpu() sync)
4. Add _fill_token_indices() and _needs_token_refill to handle CuTeDSL
JIT GPU memory corruption (refill after first GEMM call)
CuTeDSL's cute.compile corrupts GPU memory during JIT compilation.
Keeping token_indices on CPU and using sort_idx.cpu() for indexing
avoids the corruption. The .to(device) call after indexing moves the
result back to GPU for the hidden_states indexing.
CuTeDSL's cute.compile corrupts GPU memory during JIT compilation.
Tensors allocated on GPU before/during compilation get zeroed.
Fix: create token_indices on CPU, then .to(device) after JIT is done.
CuTeDSL's cute.compile appears to corrupt GPU memory state,
causing torch.arange to produce zero-filled tensors when allocated
after the JIT compilation. Moving token_indices allocation before
the weight stacking operations fixes the corruption.
Uses quantize_to_nvfp4 during warmup to get exact gs values for L1 and L2.
L1 gs comes from slot_hidden, L2 gs from the actual L1 GEMM output.
These values are then used with quantize_activation_nvfp4 (cudagraph-safe)
during inference.