10 KiB
Current Bug: CuTeDSLMoERunner — Status & Debug History
Current Status (May 17, 2026 21:30 UTC)
Bug 26 FIXED. All tests pass.
- ✅
layertest.py— 0.988 cosine - ✅
cudagraph_test.py— capture + replay works - ✅
test_pipeline_real_weights.py— 0.988 cosine (8 tokens, 48 experts) - ⏳ vLLM container — needs rebuild + test with Bug 26 fix
Latest fix: Missing swiglu_limit=10.0 activation clamping (Bug 25). DeepSeek-V4 uses SiluAndMulWithClamp(10.0) which clamps silu(gate) to max 10.0 and up to [-10, 10]. Our runner was doing plain F.silu(gate) * up without clamping. Large gate values → unbounded SiLU output → corrupted L2 GEMM input → garbage logits → model outputs BOS/thinking tokens.
vLLM launch config:
--gpu_memory_utilization=0.9
--compilation-config='{"cudagraph_mode": "FULL_DECODE_ONLY", "custom_ops": ["all"], "cudagraph_capture_sizes": [1, 2, 4, 8], "max_cudagraph_capture_size": 8}'
Bugs Found & Fixed
Bug 1: Scale Assembly — Global vs Per-Expert Swizzle
Fix: Two-phase scatter + per-expert swizzle.
Bug 2: searchsorted(right=False)
Fix: Changed to right=True.
Bug 3: CuTeDSL cute.compile GPU Memory Corruption — CRITICAL
Symptom: _token_indices all zeros after JIT.
Root cause: cute.compile corrupts GPU memory.
Fix: _fill_token_indices() builds on CPU, copies to GPU. _needs_token_refill flag.
Bug 4: expert_offsets With Leading 0
Fix: Pass expert_offsets[1:] to GEMM.
Bug 5: Checkpoint input_scale Wrong for Runtime gs
Root cause: Calibration value, too-small gs → block scale overflow.
Fix: compute_activation_global_scales() warmup method.
Bug 6: L1/L2 Need Separate gs
Fix: Compute L2 gs from L1 output after SiLU*up.
Bug 7: L1/L2 Need Separate Scale Buffers
Fix: Separate _padded_x_sf_buf_l1/_l2, separate per-expert bufs.
Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT
Symptom: IndexKernel.cu:111 OOB, cascading CUDA_ERROR_ASSERT (710).
Root cause: topk_ids contains global IDs (0-255), runner treated as local.
Fix: experts_start_idx, remap global→local, mask non-local tokens.
Bug 8b: .cpu() Sync Breaking Cudagraph
Fix: _token_indices on GPU, _fill_token_indices() CPU→GPU copy.
Bug 9–11: Buffer sizing and swizzle layout
See previous versions for details.
Bug 12: torch.full() During Cudagraph Capture
Symptom: cudaErrorStreamCaptureUnsupported.
Fix: Pre-allocated buffers, .fill_() instead of torch.full().
Bug 13: Warmup Passed Global Expert IDs
Fix: Pass local IDs (0..num_experts-1).
Bug 14: GEMM Scale Layout Mismatch — Fixed 128-Row vs Variable
Symptom: BOS token repeat (garbage logits).
Root cause: Scale assembly at e*128, GEMM reads by real expert_offsets. Expert with 500 tokens → GEMM reads 500 scale rows but only 128 have data.
Fix: Variable padded expert offsets, scatter into real padded positions.
Bug 15: OOM — Per-Layer Padded Buffers (4.3 GB)
Root cause: 72 MB × 60 layers = 4.3 GB. Not enough room for KV cache. Fix: Shared buffers (Bug 21).
Bug 16: padded_max_slots Mismatch
Fix: Size for num_experts * max_chunks * 128.
Bug 17: Shape Mismatch (49152 vs 3072)
Root cause: Cap max_num_tokens to 512 made buffers too small for 8192-token warmup.
Fix: Reverted cap, use shared buffers.
Bug 18–20: Cudagraph Capture Failures (dynamic allocs, variable loops, GPU scalars)
Fix: Pre-allocate everything, fixed loop counts, Python constants for offsets.
Bug 21: OOM — Shared Padded Buffers
Fix: Class-level shared buffers dict keyed by device. padded_hidden, padded_activated, padded_xsf_l1/l2, output all shared. ~150 MB total instead of ~4.3 GB.
Bug 22: Token Dropping via clamped_local
Symptom: Garbage model output (empty/invisible tokens).
Root cause: local_row.clamp(max=max_rows_per_expert-1) silently dropped tokens when an expert got more than max_chunks*128 tokens. max_chunks was computed as average (ceil(total_slots / (num_experts*128))), not worst-case. MoE routing is uneven — some experts get 200+ tokens while others get 10.
Fix: Use real padded expert offsets (variable per expert, padded to 128). No clamping needed — each expert gets exactly the space it needs.
Bug 23: cudaErrorStreamCaptureUnsupported from Dynamic GPU Slicing
Symptom: All 8 workers fail during cudagraph capture.
Root cause: buf[:total_padded_slots] where total_padded_slots is a GPU scalar — dynamic tensor slicing with a GPU index is a CUDA operation not permitted during stream capture.
Fix: Use full pre-allocated buffers, no dynamic GPU slicing. Pass x_sf[:num_slots] (Python int) to scale assembly.
Bug 24: Scale Assembly .cpu().tolist() Breaks Cudagraph
Symptom: cudaErrorStreamCaptureInvalidated during capture.
Root cause: Per-expert Python loops with GPU-derived offsets required .cpu().tolist() for slicing — CPU-GPU sync invalidates stream capture.
Fix: Full-buffer Blackwell 32_4_4 swizzle. Apply to_blocked transform to entire padded_x_sf buffer at once. No CPU syncs, no Python loops. The buffer is already 128-row aligned per expert and 4-col aligned, so the full-buffer swizzle produces the correct layout. GEMM reads scale_a using padded_expert_offsets, matching the scatter layout.
Bug 25: Missing swiglu_limit=10.0 Activation Clamping — LIKELY CAUSE OF GARBAGE OUTPUT
Symptom: Model generates 30 tokens of empty/invisible content (BOS or thinking token). Not meaningful text.
Root cause: DeepSeek-V4 uses SiluAndMulWithClamp(10.0) which:
- Clamps
silu(gate)to max 10.0 - Clamps
upto [-10.0, 10.0]
Our runner did plain F.silu(gate) * up without clamping. Large gate values produce unbounded SiLU output (silu(20) ≈ 20, silu(50) ≈ 50). These large values get multiplied by the up projection, producing activations with amax >> 10. This:
- Corrupts the L2 GEMM input (quantized with wrong gs)
- Produces garbage L2 output
- Final logits are wrong → model collapses to most frequent token (BOS)
Fix: Added set_swiglu_limit(limit) to runner. In run(), apply clamping:
gate_silu = F.silu(gate)
if self._swiglu_limit is not None:
gate_silu = gate_silu.clamp(max=self._swiglu_limit)
up = up.clamp(min=-self._swiglu_limit, max=self._swiglu_limit)
activated = gate_silu * up
Called from deepseek_v4.py after warmup: self._cutedsl_runner.set_swiglu_limit(float(self.swiglu_limit)).
Bug 26: Padded Buffer x_sf Mismatch — Experts 1+ Get Zero Output — FIXED
Symptom: Runner produces cosine 0.285 vs BF16. Some tokens get exactly zero output. Expert 0 L1 cosine 0.996, experts 1+ get cosine 0.0.
Root cause: Runner quantized padded_hidden (4096 rows with zero padding) → quantize_activation_nvfp4 returns x_sf with 4096 rows. Then x_sf[:num_slots] (first 48 rows) only covers expert 0's tokens (padded rows 0-127). Expert 1's tokens are at padded row 128, but x_sf[4] corresponds to padded row 64 (still expert 0's padding), not expert 1's data. The scale assembly scattered wrong scales for experts 1+, producing zero GEMM output.
Fix: Quantize slot_hidden (sorted tokens, num_slots rows) instead of padded_hidden. This gives x_sf with num_slots rows (one per token), which the scale assembly correctly scatters into padded layout. Scatter x_fp4 into a new hidden_fp4 padded buffer (uint8→view float4). Same fix for L2 with activated_fp4 buffer.
Current Architecture: Variable Padded Expert Offsets
Each expert padded to next multiple of 128 tokens.
padded_expert_offsets computed from real tokens_per_expert (GPU).
Scatter: padded_dst = padded_expert_offsets[expert_assign] + local_row
GEMM input: padded_hidden (full pre-allocated buffer, not sliced)
GEMM offsets: padded_expert_offsets[1:] (GPU tensor)
GEMM output: full buffer size; extract via l1_out[padded_dst]
Scale assembly:
Phase 1: Scatter x_sf into padded_x_sf at padded_expert_offsets
Phase 2: Full-buffer Blackwell 32_4_4 swizzle (no CPU syncs)
Zero CPU syncs, zero Python loops
Activation:
SiLU(gate) clamped to swiglu_limit (10.0)
up clamped to [-swiglu_limit, swiglu_limit]
activated = clamped_silu * clamped_up
Shared buffers (class-level, ~150 MB total):
padded_hidden, padded_activated, padded_xsf_l1, padded_xsf_l2, output
Cudagraph Constraints (All Resolved)
- No
.item(),.cpu(),.tolist() - No
torch.zeros/ones/full/empty/arange()during capture — pre-allocate everything - No dynamic GPU slicing (
buf[:gpu_scalar]) — use full buffers - No Python loops with GPU-derived values — full-buffer ops instead
- No
torch.full()— pre-allocated.fill_() - Shared buffers OK (layers sequential during capture and replay)
F.silu().clamp()and.clamp()are GPU ops — cudagraph-safe ✅
EP Configuration (DeepSeek-V4-Pro on 8×B200)
- 256 total experts, top_k=6, swiglu_limit=10.0
- EP=8 → 48 local experts per rank (n_routed_experts / ep_size = 256/8 = 32, but logs show 48)
experts_start_idx= rank × 32max_num_tokens= 8192max_chunks_per_expert= ceil(8192 × 6 / (48 × 128)) = 8
Shared Expert Path (verified correct)
DeepseekV4MoE.forward():
1. gate → fused_topk_bias → topk_weights, topk_ids
2. self.experts(hidden_states, topk_weights, topk_ids) → routed_output
3. EP all-reduce across ranks
4. self.shared_experts(hidden_states) → shared_output
5. final = routed_output + shared_output
- Shared experts:
DeepseekV4MLP(not NVFP4, uses standard quantization) routed_scaling_factor: Applied infused_topk_biasto topk_weights ✅renormalize: Top-k weights normalized to sum to 1 ✅scoring_func=sqrtsoftplus: Applied in routing ✅
Test Files
| File | Purpose |
|---|---|
tests/layertest.py |
Reference vs runner, 3 experts. Must pass ≥0.98 cosine. |
tests/cudagraph_test.py |
Cudagraph capture + replay. Must pass. |
tests/test_pipeline_real_weights.py |
Full runner vs BF16 reference, 48 experts, 8 tokens. Must pass ≥0.98 cosine. |
Run order after any code change:
python3 tests/layertest.py— must passpython3 tests/cudagraph_test.py— must pass
Repo Info
- Kernel:
sweetapi.com/biondizzle/nvfp4-megamoe-kernel(master) - Local:
~/dev/nvfp4-megamoe-kernel/ - B200:
/root/nvfp4-megamoe-kernel/ - Model:
/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4(read-only) - Never edit on B200 directly — edit locally → commit → push → pull on B200