# Current Bug: CuTeDSLMoERunner — Status & Debug History ## Current Status (May 17, 2026 17:54 UTC) **Build #11 in progress (includes swiglu_limit fix). Previous builds produced empty/invisible token output.** - ✅ `layertest.py` — 0.988 cosine - ✅ `cudagraph_test.py` — capture + replay works - ✅ vLLM container starts, loads weights, warmup gs computed, cudagraph capture succeeds - ❌ Model output was empty content (30 invisible tokens) — **swiglu_limit fix not yet tested in container** **Latest fix: Missing swiglu_limit=10.0 activation clamping (Bug 25).** DeepSeek-V4 uses `SiluAndMulWithClamp(10.0)` which clamps `silu(gate)` to max 10.0 and `up` to [-10, 10]. Our runner was doing plain `F.silu(gate) * up` without clamping. Large gate values → unbounded SiLU output → corrupted L2 GEMM input → garbage logits → model outputs BOS/thinking tokens. **vLLM launch config:** ``` --gpu_memory_utilization=0.9 --compilation-config='{"cudagraph_mode": "FULL_DECODE_ONLY", "custom_ops": ["all"], "cudagraph_capture_sizes": [1, 2, 4, 8], "max_cudagraph_capture_size": 8}' ``` --- ## Bugs Found & Fixed ### Bug 1: Scale Assembly — Global vs Per-Expert Swizzle **Fix:** Two-phase scatter + per-expert swizzle. ### Bug 2: `searchsorted(right=False)` **Fix:** Changed to `right=True`. ### Bug 3: CuTeDSL `cute.compile` GPU Memory Corruption — CRITICAL **Symptom:** `_token_indices` all zeros after JIT. **Root cause:** `cute.compile` corrupts GPU memory. **Fix:** `_fill_token_indices()` builds on CPU, copies to GPU. `_needs_token_refill` flag. ### Bug 4: `expert_offsets` With Leading 0 **Fix:** Pass `expert_offsets[1:]` to GEMM. ### Bug 5: Checkpoint `input_scale` Wrong for Runtime gs **Root cause:** Calibration value, too-small gs → block scale overflow. **Fix:** `compute_activation_global_scales()` warmup method. ### Bug 6: L1/L2 Need Separate gs **Fix:** Compute L2 gs from L1 output after SiLU*up. ### Bug 7: L1/L2 Need Separate Scale Buffers **Fix:** Separate `_padded_x_sf_buf_l1`/`_l2`, separate per-expert bufs. ### Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT **Symptom:** `IndexKernel.cu:111` OOB, cascading CUDA_ERROR_ASSERT (710). **Root cause:** `topk_ids` contains global IDs (0-255), runner treated as local. **Fix:** `experts_start_idx`, remap global→local, mask non-local tokens. ### Bug 8b: `.cpu()` Sync Breaking Cudagraph **Fix:** `_token_indices` on GPU, `_fill_token_indices()` CPU→GPU copy. ### Bug 9–11: Buffer sizing and swizzle layout See previous versions for details. ### Bug 12: `torch.full()` During Cudagraph Capture **Symptom:** `cudaErrorStreamCaptureUnsupported`. **Fix:** Pre-allocated buffers, `.fill_()` instead of `torch.full()`. ### Bug 13: Warmup Passed Global Expert IDs **Fix:** Pass local IDs (0..num_experts-1). ### Bug 14: GEMM Scale Layout Mismatch — Fixed 128-Row vs Variable **Symptom:** BOS token repeat (garbage logits). **Root cause:** Scale assembly at `e*128`, GEMM reads by real expert_offsets. Expert with 500 tokens → GEMM reads 500 scale rows but only 128 have data. **Fix:** Variable padded expert offsets, scatter into real padded positions. ### Bug 15: OOM — Per-Layer Padded Buffers (4.3 GB) **Root cause:** 72 MB × 60 layers = 4.3 GB. Not enough room for KV cache. **Fix:** Shared buffers (Bug 21). ### Bug 16: `padded_max_slots` Mismatch **Fix:** Size for `num_experts * max_chunks * 128`. ### Bug 17: Shape Mismatch (49152 vs 3072) **Root cause:** Cap `max_num_tokens` to 512 made buffers too small for 8192-token warmup. **Fix:** Reverted cap, use shared buffers. ### Bug 18–20: Cudagraph Capture Failures (dynamic allocs, variable loops, GPU scalars) **Fix:** Pre-allocate everything, fixed loop counts, Python constants for offsets. ### Bug 21: OOM — Shared Padded Buffers **Fix:** Class-level shared buffers dict keyed by device. `padded_hidden`, `padded_activated`, `padded_xsf_l1`/`l2`, `output` all shared. ~150 MB total instead of ~4.3 GB. ### Bug 22: Token Dropping via `clamped_local` **Symptom:** Garbage model output (empty/invisible tokens). **Root cause:** `local_row.clamp(max=max_rows_per_expert-1)` silently dropped tokens when an expert got more than `max_chunks*128` tokens. `max_chunks` was computed as average (ceil(total_slots / (num_experts*128))), not worst-case. MoE routing is uneven — some experts get 200+ tokens while others get 10. **Fix:** Use real padded expert offsets (variable per expert, padded to 128). No clamping needed — each expert gets exactly the space it needs. ### Bug 23: cudaErrorStreamCaptureUnsupported from Dynamic GPU Slicing **Symptom:** All 8 workers fail during cudagraph capture. **Root cause:** `buf[:total_padded_slots]` where `total_padded_slots` is a GPU scalar — dynamic tensor slicing with a GPU index is a CUDA operation not permitted during stream capture. **Fix:** Use full pre-allocated buffers, no dynamic GPU slicing. Pass `x_sf[:num_slots]` (Python int) to scale assembly. ### Bug 24: Scale Assembly `.cpu().tolist()` Breaks Cudagraph **Symptom:** `cudaErrorStreamCaptureInvalidated` during capture. **Root cause:** Per-expert Python loops with GPU-derived offsets required `.cpu().tolist()` for slicing — CPU-GPU sync invalidates stream capture. **Fix:** Full-buffer Blackwell 32_4_4 swizzle. Apply `to_blocked` transform to entire `padded_x_sf` buffer at once. No CPU syncs, no Python loops. The buffer is already 128-row aligned per expert and 4-col aligned, so the full-buffer swizzle produces the correct layout. GEMM reads `scale_a` using `padded_expert_offsets`, matching the scatter layout. ### Bug 25: Missing `swiglu_limit=10.0` Activation Clamping — LIKELY CAUSE OF GARBAGE OUTPUT **Symptom:** Model generates 30 tokens of empty/invisible content (BOS or thinking token). Not meaningful text. **Root cause:** DeepSeek-V4 uses `SiluAndMulWithClamp(10.0)` which: - Clamps `silu(gate)` to max 10.0 - Clamps `up` to [-10.0, 10.0] Our runner did plain `F.silu(gate) * up` without clamping. Large gate values produce unbounded SiLU output (silu(20) ≈ 20, silu(50) ≈ 50). These large values get multiplied by the up projection, producing activations with amax >> 10. This: 1. Corrupts the L2 GEMM input (quantized with wrong gs) 2. Produces garbage L2 output 3. Final logits are wrong → model collapses to most frequent token (BOS) **Fix:** Added `set_swiglu_limit(limit)` to runner. In `run()`, apply clamping: ```python gate_silu = F.silu(gate) if self._swiglu_limit is not None: gate_silu = gate_silu.clamp(max=self._swiglu_limit) up = up.clamp(min=-self._swiglu_limit, max=self._swiglu_limit) activated = gate_silu * up ``` Called from `deepseek_v4.py` after warmup: `self._cutedsl_runner.set_swiglu_limit(float(self.swiglu_limit))`. --- ## Current Architecture: Variable Padded Expert Offsets ``` Each expert padded to next multiple of 128 tokens. padded_expert_offsets computed from real tokens_per_expert (GPU). Scatter: padded_dst = padded_expert_offsets[expert_assign] + local_row GEMM input: padded_hidden (full pre-allocated buffer, not sliced) GEMM offsets: padded_expert_offsets[1:] (GPU tensor) GEMM output: full buffer size; extract via l1_out[padded_dst] Scale assembly: Phase 1: Scatter x_sf into padded_x_sf at padded_expert_offsets Phase 2: Full-buffer Blackwell 32_4_4 swizzle (no CPU syncs) Zero CPU syncs, zero Python loops Activation: SiLU(gate) clamped to swiglu_limit (10.0) up clamped to [-swiglu_limit, swiglu_limit] activated = clamped_silu * clamped_up Shared buffers (class-level, ~150 MB total): padded_hidden, padded_activated, padded_xsf_l1, padded_xsf_l2, output ``` ### Cudagraph Constraints (All Resolved) - No `.item()`, `.cpu()`, `.tolist()` - No `torch.zeros/ones/full/empty/arange()` during capture — pre-allocate everything - No dynamic GPU slicing (`buf[:gpu_scalar]`) — use full buffers - No Python loops with GPU-derived values — full-buffer ops instead - No `torch.full()` — pre-allocated `.fill_()` - Shared buffers OK (layers sequential during capture and replay) - `F.silu().clamp()` and `.clamp()` are GPU ops — cudagraph-safe ✅ ### EP Configuration (DeepSeek-V4-Pro on 8×B200) - 256 total experts, top_k=6, swiglu_limit=10.0 - EP=8 → 48 local experts per rank (n_routed_experts / ep_size = 256/8 = 32, but logs show 48) - `experts_start_idx` = rank × 32 - `max_num_tokens` = 8192 - `max_chunks_per_expert` = ceil(8192 × 6 / (48 × 128)) = 8 --- ## Shared Expert Path (verified correct) ``` DeepseekV4MoE.forward(): 1. gate → fused_topk_bias → topk_weights, topk_ids 2. self.experts(hidden_states, topk_weights, topk_ids) → routed_output 3. EP all-reduce across ranks 4. self.shared_experts(hidden_states) → shared_output 5. final = routed_output + shared_output ``` - Shared experts: `DeepseekV4MLP` (not NVFP4, uses standard quantization) - `routed_scaling_factor`: Applied in `fused_topk_bias` to topk_weights ✅ - `renormalize`: Top-k weights normalized to sum to 1 ✅ - `scoring_func=sqrtsoftplus`: Applied in routing ✅ --- ## Test Files | File | Purpose | |------|---------| | `tests/layertest.py` | Reference vs runner, 3 experts. Must pass ≥0.98 cosine. | | `tests/cudagraph_test.py` | Cudagraph capture + replay. Must pass. | **Run order after any code change:** 1. `python3 tests/layertest.py` — must pass 2. `python3 tests/cudagraph_test.py` — must pass --- ## Repo Info - **Kernel:** `sweetapi.com/biondizzle/nvfp4-megamoe-kernel` (master) - **Local:** `~/dev/nvfp4-megamoe-kernel/` - **B200:** `/root/nvfp4-megamoe-kernel/` - **Model:** `/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4` (read-only) - **Never edit on B200 directly** — edit locally → commit → push → pull on B200