DeepSeek-V4 uses SiluAndMulWithClamp(10.0) which clamps:
- silu(gate) to max 10.0
- up to [-10.0, 10.0]
Our runner was doing plain F.silu(gate) * up without clamping.
Large gate values could produce unbounded SiLU output, causing
numerical issues in the L2 GEMM. This is likely contributing to
garbage model output.
Dynamic slicing with GPU scalars (e.g. buf[:gpu_scalar]) is a CUDA
operation not permitted during stream capture. Use full pre-allocated
buffers instead of dynamic slices. The GEMM only reads rows indicated
by expert_offsets, ignoring the zero padding.
Also pass x_sf[:num_slots] (Python int slicing, cudagraph-safe) to
scale assembly so it only processes real token scale data.
Removed .cpu().tolist() and per-expert Python loops. Apply the
Blackwell 32_4_4 swizzle to the entire padded_x_sf buffer at once.
The buffer is already 128-row aligned (padded per expert) and 4-col
aligned, so the full-buffer swizzle produces the correct layout.
The GEMM reads scale_a using padded_expert_offsets, which matches
the scatter layout. Fully GPU, zero CPU syncs, cudagraph-safe.
GPU scalars can't be used for Python indexing (requires sync).
Compute padded_expert_offsets on CPU via .cpu().tolist() for
the Python loop. This is OK for cudagraph: Python code only
runs during capture, not replay. The GPU kernel launches
recorded during capture are deterministic.
Root cause of garbage output: fixed-layout padding with
max_chunks=ceil(avg) was too small for uneven expert assignment.
Tokens beyond max_chunks*128 per expert were silently dropped
(clamped_local overwrote the same row).
Fix: compute padded_expert_offsets from actual tokens_per_expert
(padded to 128). No clamping needed — each expert gets exactly
the space it needs. Pass padded_expert_offsets to scale assembly
and GEMM.
Root cause: capping max_num_tokens to 512 made buffers too small for the
actual 8192-token warmup. slot_hidden had 49152 rows but padded_hidden
only had 6144.
Fix: Revert the 512 cap. Use SHARED padded buffers (not per-layer) to
avoid OOM. Only 72 MB total (not 4.3 GB) since layers run sequentially
and reuse the same buffer. Cudagraph-safe since capture and replay both
run layers sequentially on the same tensor.
padded_max_slots was computed from max_tokens*top_k (3072) but
total_padded_slots in run() is num_experts*max_chunks*128 (6144).
The buffer was too small, causing index out of bounds.
padded_hidden/activated buffers were sized for max_num_tokens=8192,
which is 72 MB per layer × 60 layers = 4.3 GB → OOM with 178 GB GPUs
(almost full from model + KV cache).
Now cap at max cudagraph capture size (512 tokens). Eager-mode runs
with >512 tokens will need dynamic allocation, but vLLM always uses
cudagraph for inference after warmup.
Removed torch.zeros() call that created padded_expert_offsets during
scale assembly. Now uses fixed layout computed from Python constants.
Also removed dead reference to padded_expert_offsets variable.
The while loop had variable trip count (GPU scalar in condition),
requiring CPU-GPU sync. Replaced with fixed max_chunks_per_expert
iterations. Unused chunks are zero buffers (harmless for GEMM).
- Compute padded_expert_offsets from real expert_offsets (ceil to 128)
- Scatter x_sf into padded positions matching those offsets
- Per-expert swizzle in 128-row chunks (supports >128 tokens per expert)
- Pad slot_hidden/activated using same padded offsets for GEMM input
- Pre-allocated buffers sized for max_tokens*top_k (not num_experts*128)
Root cause of garbage output: the GEMM reads scale_a according to
expert_offsets (e.g. [0, 500, 1024, ...]) but scale_a had data at
fixed e*128 offsets. When expert 0 has 500 tokens, the GEMM reads
scale_a[0:500] but only rows 0-127 had valid data.
Fix: pad slot_hidden to num_experts*128 rows (128 per expert) and
pass padded_expert_offsets=[0, 128, 256, ...] to the GEMM. Scale
assembly's fixed 128-row layout now matches the GEMM's expectations.
Padding tokens' GEMM output is discarded (scatter_add only uses
sorted_token_ids for real tokens).
compute_activation_global_scales expects local IDs (0..num_experts-1),
not global IDs. EP5/EP7 were getting L2 gs=0 because global IDs (240+,
336+) didn't match expert_id_range (0..47), so no tokens matched any
expert → L1 GEMM got zero inputs → L2 gs=0 → NaN/crash.
Also removed _warmup_done guard since each layer needs its own warmup
(different weights, different gs values).
- Reverted from full-buffer swizzle to per-expert 128-row slots
- Scatter into e*128 fixed positions (cudagraph-compatible, fixed shape)
- Clamp local_row to 127 for experts with >128 tokens (GEMM uses expert_offsets)
- Buffer sized for num_experts*128 rows (not max_tokens*top_k)
- Add _warmup_done guard to only run warmup once (not 60x)
The checkpoint input_scale is a calibration value that produces wrong gs
at runtime (too small → block scales saturate → garbage output → EOS).
Now calls compute_activation_global_scales() with sample data during weight
finalization, before cudagraph capture. This observes actual activation
magnitudes and computes correct L1 and L2 gs values.
Bug 9: padded_x_sf was sized for num_experts*128 rows, but with 8192 tokens
and top_k=6, the actual padded row count can exceed 6144. Also:
- Pass top_k and max_num_tokens from deepseek_v4.py (was defaulting to 8/8192)
- Phase 2 of scale assembly now handles experts with >128 tokens (multiple 128-row chunks)
- Remove debug prints
Root cause of CUDA_ERROR_ASSERT index out of bounds:
- topk_ids contains GLOBAL expert IDs (0-255) but runner treated them
as local IDs (0-31 with EP=8). Tokens for non-local experts got
wrong expert assignments, causing out-of-bounds scatter indices
in _assemble_scales_cudagraph_safe.
Fixes:
1. Add experts_start_idx param to CuTeDSLMoERunner
2. In run(), remap global→local IDs and zero weights for non-local experts
3. Move _token_indices from CPU to GPU (remove sort_idx.cpu() sync)
4. Add _fill_token_indices() and _needs_token_refill to handle CuTeDSL
JIT GPU memory corruption (refill after first GEMM call)
CuTeDSL's cute.compile corrupts GPU memory during JIT compilation.
Keeping token_indices on CPU and using sort_idx.cpu() for indexing
avoids the corruption. The .to(device) call after indexing moves the
result back to GPU for the hidden_states indexing.