Files
nvfp4-megamoe-kernel/CURRENT_BUG.md

10 KiB
Raw Blame History

Current Bug: CuTeDSLMoERunner — Status & Debug History

Current Status (May 17, 2026 21:30 UTC)

Bug 26 FIXED. All tests pass.

  • layertest.py — 0.988 cosine
  • cudagraph_test.py — capture + replay works
  • test_pipeline_real_weights.py — 0.988 cosine (8 tokens, 48 experts)
  • vLLM container — needs rebuild + test with Bug 26 fix

Latest fix: Missing swiglu_limit=10.0 activation clamping (Bug 25). DeepSeek-V4 uses SiluAndMulWithClamp(10.0) which clamps silu(gate) to max 10.0 and up to [-10, 10]. Our runner was doing plain F.silu(gate) * up without clamping. Large gate values → unbounded SiLU output → corrupted L2 GEMM input → garbage logits → model outputs BOS/thinking tokens.

vLLM launch config:

--gpu_memory_utilization=0.9
--compilation-config='{"cudagraph_mode": "FULL_DECODE_ONLY", "custom_ops": ["all"], "cudagraph_capture_sizes": [1, 2, 4, 8], "max_cudagraph_capture_size": 8}'

Bugs Found & Fixed

Bug 1: Scale Assembly — Global vs Per-Expert Swizzle

Fix: Two-phase scatter + per-expert swizzle.

Bug 2: searchsorted(right=False)

Fix: Changed to right=True.

Bug 3: CuTeDSL cute.compile GPU Memory Corruption — CRITICAL

Symptom: _token_indices all zeros after JIT. Root cause: cute.compile corrupts GPU memory. Fix: _fill_token_indices() builds on CPU, copies to GPU. _needs_token_refill flag.

Bug 4: expert_offsets With Leading 0

Fix: Pass expert_offsets[1:] to GEMM.

Bug 5: Checkpoint input_scale Wrong for Runtime gs

Root cause: Calibration value, too-small gs → block scale overflow. Fix: compute_activation_global_scales() warmup method.

Bug 6: L1/L2 Need Separate gs

Fix: Compute L2 gs from L1 output after SiLU*up.

Bug 7: L1/L2 Need Separate Scale Buffers

Fix: Separate _padded_x_sf_buf_l1/_l2, separate per-expert bufs.

Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT

Symptom: IndexKernel.cu:111 OOB, cascading CUDA_ERROR_ASSERT (710). Root cause: topk_ids contains global IDs (0-255), runner treated as local. Fix: experts_start_idx, remap global→local, mask non-local tokens.

Bug 8b: .cpu() Sync Breaking Cudagraph

Fix: _token_indices on GPU, _fill_token_indices() CPU→GPU copy.

Bug 911: Buffer sizing and swizzle layout

See previous versions for details.

Bug 12: torch.full() During Cudagraph Capture

Symptom: cudaErrorStreamCaptureUnsupported. Fix: Pre-allocated buffers, .fill_() instead of torch.full().

Bug 13: Warmup Passed Global Expert IDs

Fix: Pass local IDs (0..num_experts-1).

Bug 14: GEMM Scale Layout Mismatch — Fixed 128-Row vs Variable

Symptom: BOS token repeat (garbage logits). Root cause: Scale assembly at e*128, GEMM reads by real expert_offsets. Expert with 500 tokens → GEMM reads 500 scale rows but only 128 have data. Fix: Variable padded expert offsets, scatter into real padded positions.

Bug 15: OOM — Per-Layer Padded Buffers (4.3 GB)

Root cause: 72 MB × 60 layers = 4.3 GB. Not enough room for KV cache. Fix: Shared buffers (Bug 21).

Bug 16: padded_max_slots Mismatch

Fix: Size for num_experts * max_chunks * 128.

Bug 17: Shape Mismatch (49152 vs 3072)

Root cause: Cap max_num_tokens to 512 made buffers too small for 8192-token warmup. Fix: Reverted cap, use shared buffers.

Bug 1820: Cudagraph Capture Failures (dynamic allocs, variable loops, GPU scalars)

Fix: Pre-allocate everything, fixed loop counts, Python constants for offsets.

Bug 21: OOM — Shared Padded Buffers

Fix: Class-level shared buffers dict keyed by device. padded_hidden, padded_activated, padded_xsf_l1/l2, output all shared. ~150 MB total instead of ~4.3 GB.

Bug 22: Token Dropping via clamped_local

Symptom: Garbage model output (empty/invisible tokens). Root cause: local_row.clamp(max=max_rows_per_expert-1) silently dropped tokens when an expert got more than max_chunks*128 tokens. max_chunks was computed as average (ceil(total_slots / (num_experts*128))), not worst-case. MoE routing is uneven — some experts get 200+ tokens while others get 10. Fix: Use real padded expert offsets (variable per expert, padded to 128). No clamping needed — each expert gets exactly the space it needs.

Bug 23: cudaErrorStreamCaptureUnsupported from Dynamic GPU Slicing

Symptom: All 8 workers fail during cudagraph capture. Root cause: buf[:total_padded_slots] where total_padded_slots is a GPU scalar — dynamic tensor slicing with a GPU index is a CUDA operation not permitted during stream capture. Fix: Use full pre-allocated buffers, no dynamic GPU slicing. Pass x_sf[:num_slots] (Python int) to scale assembly.

Bug 24: Scale Assembly .cpu().tolist() Breaks Cudagraph

Symptom: cudaErrorStreamCaptureInvalidated during capture. Root cause: Per-expert Python loops with GPU-derived offsets required .cpu().tolist() for slicing — CPU-GPU sync invalidates stream capture. Fix: Full-buffer Blackwell 32_4_4 swizzle. Apply to_blocked transform to entire padded_x_sf buffer at once. No CPU syncs, no Python loops. The buffer is already 128-row aligned per expert and 4-col aligned, so the full-buffer swizzle produces the correct layout. GEMM reads scale_a using padded_expert_offsets, matching the scatter layout.

Bug 25: Missing swiglu_limit=10.0 Activation Clamping — LIKELY CAUSE OF GARBAGE OUTPUT

Symptom: Model generates 30 tokens of empty/invisible content (BOS or thinking token). Not meaningful text. Root cause: DeepSeek-V4 uses SiluAndMulWithClamp(10.0) which:

  • Clamps silu(gate) to max 10.0
  • Clamps up to [-10.0, 10.0]

Our runner did plain F.silu(gate) * up without clamping. Large gate values produce unbounded SiLU output (silu(20) ≈ 20, silu(50) ≈ 50). These large values get multiplied by the up projection, producing activations with amax >> 10. This:

  1. Corrupts the L2 GEMM input (quantized with wrong gs)
  2. Produces garbage L2 output
  3. Final logits are wrong → model collapses to most frequent token (BOS)

Fix: Added set_swiglu_limit(limit) to runner. In run(), apply clamping:

gate_silu = F.silu(gate)
if self._swiglu_limit is not None:
    gate_silu = gate_silu.clamp(max=self._swiglu_limit)
    up = up.clamp(min=-self._swiglu_limit, max=self._swiglu_limit)
activated = gate_silu * up

Called from deepseek_v4.py after warmup: self._cutedsl_runner.set_swiglu_limit(float(self.swiglu_limit)).

Bug 26: Padded Buffer x_sf Mismatch — Experts 1+ Get Zero Output — FIXED

Symptom: Runner produces cosine 0.285 vs BF16. Some tokens get exactly zero output. Expert 0 L1 cosine 0.996, experts 1+ get cosine 0.0. Root cause: Runner quantized padded_hidden (4096 rows with zero padding) → quantize_activation_nvfp4 returns x_sf with 4096 rows. Then x_sf[:num_slots] (first 48 rows) only covers expert 0's tokens (padded rows 0-127). Expert 1's tokens are at padded row 128, but x_sf[4] corresponds to padded row 64 (still expert 0's padding), not expert 1's data. The scale assembly scattered wrong scales for experts 1+, producing zero GEMM output. Fix: Quantize slot_hidden (sorted tokens, num_slots rows) instead of padded_hidden. This gives x_sf with num_slots rows (one per token), which the scale assembly correctly scatters into padded layout. Scatter x_fp4 into a new hidden_fp4 padded buffer (uint8→view float4). Same fix for L2 with activated_fp4 buffer.


Current Architecture: Variable Padded Expert Offsets

Each expert padded to next multiple of 128 tokens.
padded_expert_offsets computed from real tokens_per_expert (GPU).

Scatter: padded_dst = padded_expert_offsets[expert_assign] + local_row
GEMM input: padded_hidden (full pre-allocated buffer, not sliced)
GEMM offsets: padded_expert_offsets[1:] (GPU tensor)
GEMM output: full buffer size; extract via l1_out[padded_dst]

Scale assembly:
  Phase 1: Scatter x_sf into padded_x_sf at padded_expert_offsets
  Phase 2: Full-buffer Blackwell 32_4_4 swizzle (no CPU syncs)
  Zero CPU syncs, zero Python loops

Activation:
  SiLU(gate) clamped to swiglu_limit (10.0)
  up clamped to [-swiglu_limit, swiglu_limit]
  activated = clamped_silu * clamped_up

Shared buffers (class-level, ~150 MB total):
  padded_hidden, padded_activated, padded_xsf_l1, padded_xsf_l2, output

Cudagraph Constraints (All Resolved)

  • No .item(), .cpu(), .tolist()
  • No torch.zeros/ones/full/empty/arange() during capture — pre-allocate everything
  • No dynamic GPU slicing (buf[:gpu_scalar]) — use full buffers
  • No Python loops with GPU-derived values — full-buffer ops instead
  • No torch.full() — pre-allocated .fill_()
  • Shared buffers OK (layers sequential during capture and replay)
  • F.silu().clamp() and .clamp() are GPU ops — cudagraph-safe

EP Configuration (DeepSeek-V4-Pro on 8×B200)

  • 256 total experts, top_k=6, swiglu_limit=10.0
  • EP=8 → 48 local experts per rank (n_routed_experts / ep_size = 256/8 = 32, but logs show 48)
  • experts_start_idx = rank × 32
  • max_num_tokens = 8192
  • max_chunks_per_expert = ceil(8192 × 6 / (48 × 128)) = 8

Shared Expert Path (verified correct)

DeepseekV4MoE.forward():
  1. gate → fused_topk_bias → topk_weights, topk_ids
  2. self.experts(hidden_states, topk_weights, topk_ids) → routed_output
  3. EP all-reduce across ranks
  4. self.shared_experts(hidden_states) → shared_output
  5. final = routed_output + shared_output
  • Shared experts: DeepseekV4MLP (not NVFP4, uses standard quantization)
  • routed_scaling_factor: Applied in fused_topk_bias to topk_weights
  • renormalize: Top-k weights normalized to sum to 1
  • scoring_func=sqrtsoftplus: Applied in routing

Test Files

File Purpose
tests/layertest.py Reference vs runner, 3 experts. Must pass ≥0.98 cosine.
tests/cudagraph_test.py Cudagraph capture + replay. Must pass.
tests/test_pipeline_real_weights.py Full runner vs BF16 reference, 48 experts, 8 tokens. Must pass ≥0.98 cosine.

Run order after any code change:

  1. python3 tests/layertest.py — must pass
  2. python3 tests/cudagraph_test.py — must pass

Repo Info

  • Kernel: sweetapi.com/biondizzle/nvfp4-megamoe-kernel (master)
  • Local: ~/dev/nvfp4-megamoe-kernel/
  • B200: /root/nvfp4-megamoe-kernel/
  • Model: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4 (read-only)
  • Never edit on B200 directly — edit locally → commit → push → pull on B200