10 KiB
Current Bug: CuTeDSLMoERunner — Status & Debug History
Current Status (May 17, 2026 09:01 UTC)
Bug 8 fixed. Ready for vLLM container test.
- ✅
layertest.py— 0.988 cosine - ✅
cudagraph_test.py— capture + replay works - ✅
test_warmup_gs.py— warmup gs computation works (test script has a pre-existing NameError in safety margin section, not a runner bug) - ❌ vLLM server — not yet tested with these fixes
Fixed in this round:
- Bug 8: Global→local expert ID remapping (was causing CUDA_ERROR_ASSERT)
- Removed
.cpu()sync fromrun()—_token_indicesnow on GPU, cudagraph-safe - Added
_needs_token_refillflag to handle CuTeDSL JIT GPU memory corruption after first GEMM call
Bugs Found & Fixed
Bug 1: Scale Assembly — Global Swizzle vs Per-Expert Swizzle
Symptom: GEMM produced all zeros even with correct global_scale.
Root cause: The original _assemble_scales_cudagraph_safe called pad_and_swizzle_single() on the ENTIRE padded buffer (all experts concatenated). But the kernel expects each expert's 128-row block to be swizzled independently (matching assemble_scales_2d_side which pads+swizzles each expert separately before concatenation).
Fix: Two-phase approach:
- Scatter x_sf rows into 128-aligned positions in a padded buffer (GPU-only, no CPU sync)
- Per-expert: copy 128 rows from padded buffer,
pad_and_swizzle_single()each expert's block independently, then concatenate
Key insight from torch_scaled_grouped_mm.py line ~1115: The kernel computes padded offsets internally when consistent_token_padding=False:
padded_size = round_up(offs[expert_idx] - offs[expert_idx-1], pad_granularity) # 128
So the kernel knows each expert's scale data is in a 128-row block.
Bug 2: searchsorted(right=False) — Wrong Expert Assignment
Symptom: Scale data in wrong positions after scatter.
Root cause: torch.searchsorted([4, 8, 8], 4, right=False) returns 0, assigning row 4 (expert 1's first token) to expert 0.
Fix: Changed to right=True:
expert_assign = torch.searchsorted(expert_offsets[1:], row_indices, right=True)
Verified: Row 4 → expert 1 (correct), rows 0-3 → expert 0 (correct).
Bug 3: CuTeDSL cute.compile GPU Memory Corruption — CRITICAL
Symptom: _token_indices was all zeros, making every token map to token 0.
Root cause: CuTeDSL's cute.compile (JIT compilation) corrupts GPU memory. Tensors allocated on GPU before or during JIT compilation get zeroed. Pre-existing tensors allocated before the JIT survive. This is a bug in the CuTeDSL library.
Impact: _token_indices (int32 on GPU) was zeroed, causing hidden_states[sorted_token_ids] to return hidden_states[0] for all 8 slots. Every expert saw the same input.
Fix: Allocate _token_indices on CPU, keep it there. In run() and compute_activation_global_scales(), index with sort_idx.cpu() then move result to GPU:
sorted_token_ids = token_indices[sort_idx.cpu()].to(device)
Warning: This introduces a CPU-GPU sync (.cpu()) which may interfere with cudagraph capture. Needs verification.
Bug 4: expert_offsets With Leading 0
Symptom: GEMM produced wrong output with correct scale data.
Root cause: The runner passed expert_offsets[:num_experts + 1] = [0, 4, 8, 8] (4 elements with leading 0) but the kernel expects compute_expert_offsets([4, 4, 0], 3) = [4, 8, 8] (3 elements, cumulative sum without leading 0).
Fix: Pass expert_offsets[1:num_experts + 1] to the GEMM.
Bug 5: Checkpoint input_scale Is Wrong for Activation Global Scale
Symptom: Block scales all saturate at float8 max (448), producing garbage quantization.
Root cause: The checkpoint's input_scale (~0.000286) is a calibration value computed from a different input magnitude (amax ≈ 0.77) than what runtime produces (amax ≈ 8.17). Too-small gs → x/gs has values up to ~13000 → block_amax/6 ≈ 2174 → overflows float8_e4m3fn max of 448 → saturated block scales → garbage.
Fix: compute_activation_global_scales() warmup method that runs quantize_to_nvfp4 (dynamic gs with .max()) before cudagraph capture to get the exact gs values for L1 and L2.
Bug 6: L1 and L2 Need Separate Activation Global Scales
Symptom: L2 output was garbage even with correct L1 gs.
Root cause: After SiLU(gate)*up, the activation has amax ~286. The L1 gs (from input amax ~8) is 30x too small for L2, causing even worse block scale saturation.
Fix: compute_activation_global_scales() computes L1 gs from the input, runs the L1 GEMM, then computes L2 gs from the actual L1 output (after SiLU*up).
Bug 7: L1 and L2 Need Separate Padded Scale Buffers
Symptom: IndexError when quantizing L2 activation — K_sf differs between L1 (448) and L2 (192).
Root cause: padded_x_sf_buf was allocated with L1's K_sf (448). When L2's x_sf has K_sf=192, the buffer size mismatch causes issues.
Fix: Separate _padded_x_sf_buf_l1 and _padded_x_sf_buf_l2, plus separate _per_expert_scale_bufs_l1 and _per_expert_scale_bufs_l2.
Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT
Symptom: IndexKernel.cu:111 assertion -sizes[i] <= index && index < sizes[i] failed, cascading into CUDA_ERROR_ASSERT (710) across all workers. vLLM server crash on first inference.
Root cause: With expert parallelism (EP=8), topk_ids contains global expert IDs (0-255), but CuTeDSLMoERunner treated them as local IDs (0-31). Each rank only owns 32 experts (num_experts=32), so tokens assigned to experts 32-255 produced:
- Wrong
expert_offsetscomputation (tokens matched no local expert → zero counts for many experts) - Out-of-bounds scatter indices in
_assemble_scales_cudagraph_safe(dst_rowsexceededpadded_x_sfbuffer size) - CUDA device-side assert → all subsequent CUDA calls fail with error 710
The layertest never hit this because it uses local expert IDs directly (no EP).
Fix:
- Added
experts_start_idxparam toCuTeDSLMoERunner - In
run(): remap global→local vialocal_ids = topk_ids - experts_start_idx, mask non-local tokens with zero weight, clamp IDs to valid range - Pass
experts_start_idxfromdeepseek_v4.py(which already stores it from EP setup)
Bug 8b: .cpu() Sync Breaking Cudagraph Compatibility
Symptom: sort_idx.cpu() in run() — a CPU-GPU synchronization point that cudagraph cannot capture.
Root cause: _token_indices was kept on CPU to avoid CuTeDSL JIT GPU memory corruption (Bug 3). But cudagraph requires all ops to be GPU-only.
Fix:
- Moved
_token_indicesto GPU - Added
_fill_token_indices()method to refill the tensor after potential corruption - Added
_needs_token_refillflag — set after_ensure_stacked()(weight JIT), checked/cleared after firstrun()call (GEMM JIT). After both JITs have fired, the tensor is stable.
Debug Methodology — How We Got Here
Step 1: Identified the CuTeDSL kernel works (layertest = 0.988)
The layertest uses moe_pipeline.run_nvfp4_moe with quantize_to_nvfp4 (dynamic gs) and assemble_scales_2d_side (per-expert split). This is the reference implementation.
Step 2: Wrote test_runner_vs_pipeline.py
Compared runner.run() vs run_nvfp4_moe() with same weights and inputs. Found runner produces all zeros.
Step 3: Wrote test_scale_assembly.py
Compared _assemble_scales_cudagraph_safe vs assemble_scales_2d_side. Found data mismatch (global vs per-expert swizzle).
Step 4: Fixed scale assembly
Rewrote _assemble_scales_cudagraph_safe with scatter + per-expert swizzle. Scale data now matches reference.
Step 5: Found GEMM still produces zeros with correct scales
Isolated the issue: GEMM with the exact same inputs gives cosine 1.0, but runner gives 0.18. The problem was expert_offsets format (leading 0).
Step 6: Fixed expert_offsets, found token_indices corruption
After fixing expert_offsets, cosine improved to 0.35. Traced to _token_indices being all zeros (CuTeDSL GPU corruption).
Step 7: Found and fixed the GPU corruption
Moved _token_indices to CPU. Cosine jumped to 0.46 with default gs, 0.97 with warmup gs.
Step 8: Wrote test_warmup_gs.py
Verified warmup gs computation, tested safety margins, tested different inputs. Found 1.0x safety (no margin) gives best results.
Test Files
| File | Purpose |
|---|---|
tests/layertest.py |
Reference: moe_pipeline with dynamic gs, 3 experts, layer 0. Must pass (≥0.98 cosine). |
tests/cudagraph_test.py |
CuTeDSLMoERunner cudagraph capture + replay. Must pass. |
tests/test_runner_vs_pipeline.py |
Compare runner.run() vs moe_pipeline. With correct gs should be ≥0.97. |
tests/test_scale_assembly.py |
Compare cudagraph-safe vs reference scale assembly. Data must match. |
tests/test_warmup_gs.py |
Warmup gs computation, safety margin sweep, different input test. |
tests/test_scale_debug.py |
Byte-level scale debug (can be cleaned up). |
Run order after any code change:
python3 tests/layertest.py— must passpython3 tests/cudagraph_test.py— must passpython3 tests/test_warmup_gs.py— should show ≥0.97 cosine
Files Modified
| File | Changes |
|---|---|
vllm/nvfp4_cutedsl.py |
All 7 bug fixes, compute_activation_global_scales() warmup, CPU token_indices |
vllm/patches/deepseek_v4.py |
Removed checkpoint input_scale → activation global_scale mapping |
Next Steps for vLLM Integration
-
Add warmup call in
deepseek_v4.py: Afterfinalize_weights(), callrunner.compute_activation_global_scales()with a sample input (e.g., 1 token of random data). This must happen before cudagraph capture. -
Verify cudagraph compatibility: The
sort_idx.cpu()call inrun()is a CPU-GPU sync. Cudagraph may not support this. If it doesn't, need to find a way to keep_token_indiceson GPU while avoiding the CuTeDSL corruption. -
Test the vLLM container: Spin up the server and test with a simple prompt. The output should be mostly correct (0.97 cosine ≈ near-perfect output).
-
Optimize warmup: The current warmup runs a full forward pass (L1 + L2 GEMM). This is slow (~minutes due to JIT). Consider caching the gs values or computing them more efficiently.