Files
nvfp4-megamoe-kernel/CURRENT_BUG.md

10 KiB

Current Bug: CuTeDSLMoERunner — Status & Debug History

Current Status (May 17, 2026 09:01 UTC)

Bug 8 fixed. Ready for vLLM container test.

  • layertest.py — 0.988 cosine
  • cudagraph_test.py — capture + replay works
  • test_warmup_gs.py — warmup gs computation works (test script has a pre-existing NameError in safety margin section, not a runner bug)
  • vLLM server — not yet tested with these fixes

Fixed in this round:

  • Bug 8: Global→local expert ID remapping (was causing CUDA_ERROR_ASSERT)
  • Removed .cpu() sync from run()_token_indices now on GPU, cudagraph-safe
  • Added _needs_token_refill flag to handle CuTeDSL JIT GPU memory corruption after first GEMM call

Bugs Found & Fixed

Bug 1: Scale Assembly — Global Swizzle vs Per-Expert Swizzle

Symptom: GEMM produced all zeros even with correct global_scale.

Root cause: The original _assemble_scales_cudagraph_safe called pad_and_swizzle_single() on the ENTIRE padded buffer (all experts concatenated). But the kernel expects each expert's 128-row block to be swizzled independently (matching assemble_scales_2d_side which pads+swizzles each expert separately before concatenation).

Fix: Two-phase approach:

  1. Scatter x_sf rows into 128-aligned positions in a padded buffer (GPU-only, no CPU sync)
  2. Per-expert: copy 128 rows from padded buffer, pad_and_swizzle_single() each expert's block independently, then concatenate

Key insight from torch_scaled_grouped_mm.py line ~1115: The kernel computes padded offsets internally when consistent_token_padding=False:

padded_size = round_up(offs[expert_idx] - offs[expert_idx-1], pad_granularity)  # 128

So the kernel knows each expert's scale data is in a 128-row block.

Bug 2: searchsorted(right=False) — Wrong Expert Assignment

Symptom: Scale data in wrong positions after scatter.

Root cause: torch.searchsorted([4, 8, 8], 4, right=False) returns 0, assigning row 4 (expert 1's first token) to expert 0.

Fix: Changed to right=True:

expert_assign = torch.searchsorted(expert_offsets[1:], row_indices, right=True)

Verified: Row 4 → expert 1 (correct), rows 0-3 → expert 0 (correct).

Bug 3: CuTeDSL cute.compile GPU Memory Corruption — CRITICAL

Symptom: _token_indices was all zeros, making every token map to token 0.

Root cause: CuTeDSL's cute.compile (JIT compilation) corrupts GPU memory. Tensors allocated on GPU before or during JIT compilation get zeroed. Pre-existing tensors allocated before the JIT survive. This is a bug in the CuTeDSL library.

Impact: _token_indices (int32 on GPU) was zeroed, causing hidden_states[sorted_token_ids] to return hidden_states[0] for all 8 slots. Every expert saw the same input.

Fix: Allocate _token_indices on CPU, keep it there. In run() and compute_activation_global_scales(), index with sort_idx.cpu() then move result to GPU:

sorted_token_ids = token_indices[sort_idx.cpu()].to(device)

Warning: This introduces a CPU-GPU sync (.cpu()) which may interfere with cudagraph capture. Needs verification.

Bug 4: expert_offsets With Leading 0

Symptom: GEMM produced wrong output with correct scale data.

Root cause: The runner passed expert_offsets[:num_experts + 1] = [0, 4, 8, 8] (4 elements with leading 0) but the kernel expects compute_expert_offsets([4, 4, 0], 3) = [4, 8, 8] (3 elements, cumulative sum without leading 0).

Fix: Pass expert_offsets[1:num_experts + 1] to the GEMM.

Bug 5: Checkpoint input_scale Is Wrong for Activation Global Scale

Symptom: Block scales all saturate at float8 max (448), producing garbage quantization.

Root cause: The checkpoint's input_scale (~0.000286) is a calibration value computed from a different input magnitude (amax ≈ 0.77) than what runtime produces (amax ≈ 8.17). Too-small gs → x/gs has values up to ~13000 → block_amax/6 ≈ 2174 → overflows float8_e4m3fn max of 448 → saturated block scales → garbage.

Fix: compute_activation_global_scales() warmup method that runs quantize_to_nvfp4 (dynamic gs with .max()) before cudagraph capture to get the exact gs values for L1 and L2.

Bug 6: L1 and L2 Need Separate Activation Global Scales

Symptom: L2 output was garbage even with correct L1 gs.

Root cause: After SiLU(gate)*up, the activation has amax ~286. The L1 gs (from input amax ~8) is 30x too small for L2, causing even worse block scale saturation.

Fix: compute_activation_global_scales() computes L1 gs from the input, runs the L1 GEMM, then computes L2 gs from the actual L1 output (after SiLU*up).

Bug 7: L1 and L2 Need Separate Padded Scale Buffers

Symptom: IndexError when quantizing L2 activation — K_sf differs between L1 (448) and L2 (192).

Root cause: padded_x_sf_buf was allocated with L1's K_sf (448). When L2's x_sf has K_sf=192, the buffer size mismatch causes issues.

Fix: Separate _padded_x_sf_buf_l1 and _padded_x_sf_buf_l2, plus separate _per_expert_scale_bufs_l1 and _per_expert_scale_bufs_l2.

Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT

Symptom: IndexKernel.cu:111 assertion -sizes[i] <= index && index < sizes[i] failed, cascading into CUDA_ERROR_ASSERT (710) across all workers. vLLM server crash on first inference.

Root cause: With expert parallelism (EP=8), topk_ids contains global expert IDs (0-255), but CuTeDSLMoERunner treated them as local IDs (0-31). Each rank only owns 32 experts (num_experts=32), so tokens assigned to experts 32-255 produced:

  1. Wrong expert_offsets computation (tokens matched no local expert → zero counts for many experts)
  2. Out-of-bounds scatter indices in _assemble_scales_cudagraph_safe (dst_rows exceeded padded_x_sf buffer size)
  3. CUDA device-side assert → all subsequent CUDA calls fail with error 710

The layertest never hit this because it uses local expert IDs directly (no EP).

Fix:

  1. Added experts_start_idx param to CuTeDSLMoERunner
  2. In run(): remap global→local via local_ids = topk_ids - experts_start_idx, mask non-local tokens with zero weight, clamp IDs to valid range
  3. Pass experts_start_idx from deepseek_v4.py (which already stores it from EP setup)

Bug 8b: .cpu() Sync Breaking Cudagraph Compatibility

Symptom: sort_idx.cpu() in run() — a CPU-GPU synchronization point that cudagraph cannot capture.

Root cause: _token_indices was kept on CPU to avoid CuTeDSL JIT GPU memory corruption (Bug 3). But cudagraph requires all ops to be GPU-only.

Fix:

  1. Moved _token_indices to GPU
  2. Added _fill_token_indices() method to refill the tensor after potential corruption
  3. Added _needs_token_refill flag — set after _ensure_stacked() (weight JIT), checked/cleared after first run() call (GEMM JIT). After both JITs have fired, the tensor is stable.

Debug Methodology — How We Got Here

Step 1: Identified the CuTeDSL kernel works (layertest = 0.988)

The layertest uses moe_pipeline.run_nvfp4_moe with quantize_to_nvfp4 (dynamic gs) and assemble_scales_2d_side (per-expert split). This is the reference implementation.

Step 2: Wrote test_runner_vs_pipeline.py

Compared runner.run() vs run_nvfp4_moe() with same weights and inputs. Found runner produces all zeros.

Step 3: Wrote test_scale_assembly.py

Compared _assemble_scales_cudagraph_safe vs assemble_scales_2d_side. Found data mismatch (global vs per-expert swizzle).

Step 4: Fixed scale assembly

Rewrote _assemble_scales_cudagraph_safe with scatter + per-expert swizzle. Scale data now matches reference.

Step 5: Found GEMM still produces zeros with correct scales

Isolated the issue: GEMM with the exact same inputs gives cosine 1.0, but runner gives 0.18. The problem was expert_offsets format (leading 0).

Step 6: Fixed expert_offsets, found token_indices corruption

After fixing expert_offsets, cosine improved to 0.35. Traced to _token_indices being all zeros (CuTeDSL GPU corruption).

Step 7: Found and fixed the GPU corruption

Moved _token_indices to CPU. Cosine jumped to 0.46 with default gs, 0.97 with warmup gs.

Step 8: Wrote test_warmup_gs.py

Verified warmup gs computation, tested safety margins, tested different inputs. Found 1.0x safety (no margin) gives best results.


Test Files

File Purpose
tests/layertest.py Reference: moe_pipeline with dynamic gs, 3 experts, layer 0. Must pass (≥0.98 cosine).
tests/cudagraph_test.py CuTeDSLMoERunner cudagraph capture + replay. Must pass.
tests/test_runner_vs_pipeline.py Compare runner.run() vs moe_pipeline. With correct gs should be ≥0.97.
tests/test_scale_assembly.py Compare cudagraph-safe vs reference scale assembly. Data must match.
tests/test_warmup_gs.py Warmup gs computation, safety margin sweep, different input test.
tests/test_scale_debug.py Byte-level scale debug (can be cleaned up).

Run order after any code change:

  1. python3 tests/layertest.py — must pass
  2. python3 tests/cudagraph_test.py — must pass
  3. python3 tests/test_warmup_gs.py — should show ≥0.97 cosine

Files Modified

File Changes
vllm/nvfp4_cutedsl.py All 7 bug fixes, compute_activation_global_scales() warmup, CPU token_indices
vllm/patches/deepseek_v4.py Removed checkpoint input_scale → activation global_scale mapping

Next Steps for vLLM Integration

  1. Add warmup call in deepseek_v4.py: After finalize_weights(), call runner.compute_activation_global_scales() with a sample input (e.g., 1 token of random data). This must happen before cudagraph capture.

  2. Verify cudagraph compatibility: The sort_idx.cpu() call in run() is a CPU-GPU sync. Cudagraph may not support this. If it doesn't, need to find a way to keep _token_indices on GPU while avoiding the CuTeDSL corruption.

  3. Test the vLLM container: Spin up the server and test with a simple prompt. The output should be mostly correct (0.97 cosine ≈ near-perfect output).

  4. Optimize warmup: The current warmup runs a full forward pass (L1 + L2 GEMM). This is slow (~minutes due to JIT). Consider caching the gs values or computing them more efficiently.