9.5 KiB
Current Bug: CuTeDSLMoERunner — Status & Debug History
Current Status (May 17, 2026 15:45 UTC)
vLLM container crashes during cudagraph warmup with shape mismatch. Debug build in progress.
- ✅
layertest.py— 0.988 cosine - ✅
cudagraph_test.py— capture + replay works - ✅ Container builds, loads weights, warmup gs computed (no L2 gs=0)
- ❌ Container crashes during cudagraph warmup: shape mismatch
[49152, 7168]vs[3072, 7168]
Active investigation: The GEMM output has 49152 rows (48 experts × 8 chunks × 128) but padded_dst only indexes 3072 rows. This means max_chunks_per_expert = 8 instead of the expected 1 (capped at 512 tokens). Likely the max_num_tokens cap to 512 isn't reaching the runner. Debug print added to verify.
Bugs Found & Fixed
Bug 1: Scale Assembly — Global Swizzle vs Per-Expert Swizzle
Symptom: GEMM produced all zeros even with correct global_scale.
Root cause: _assemble_scales_cudagraph_safe called pad_and_swizzle_single() on the ENTIRE padded buffer. The kernel expects each expert's 128-row block swizzled independently.
Fix: Two-phase approach: scatter into 128-aligned positions, then per-expert swizzle and concatenate.
Bug 2: searchsorted(right=False) — Wrong Expert Assignment
Fix: Changed to right=True.
Bug 3: CuTeDSL cute.compile GPU Memory Corruption — CRITICAL
Symptom: _token_indices was all zeros.
Root cause: CuTeDSL's cute.compile (JIT) corrupts GPU memory. Tensors allocated on GPU before/during JIT get zeroed.
Fix: _fill_token_indices() builds on CPU, copies to GPU. _needs_token_refill flag for GEMM JIT.
Bug 4: expert_offsets With Leading 0
Fix: Pass expert_offsets[1:num_experts + 1] to the GEMM.
Bug 5: Checkpoint input_scale Is Wrong for Activation Global Scale
Root cause: Checkpoint input_scale (~0.000286) is a calibration value. Too-small gs → block scale overflow → garbage.
Fix: compute_activation_global_scales() warmup method.
Bug 6: L1 and L2 Need Separate Activation Global Scales
Fix: compute_activation_global_scales() computes L2 gs from L1 output after SiLU*up.
Bug 7: L1 and L2 Need Separate Padded Scale Buffers
Fix: Separate _padded_x_sf_buf_l1 and _padded_x_sf_buf_l2, plus separate per-expert scale bufs.
Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT
Symptom: IndexKernel.cu:111 OOB assertion, cascading CUDA_ERROR_ASSERT (710).
Root cause: topk_ids contains global IDs (0-255), runner treated as local (0-31/48).
Fix: Added experts_start_idx, remap global→local, mask non-local tokens.
Bug 8b: .cpu() Sync Breaking Cudagraph Compatibility
Fix: Moved _token_indices to GPU, _fill_token_indices() (CPU→GPU copy).
Bug 9: padded_x_sf Buffer Too Small — Index Out of Bounds
Root cause: Buffer sized for num_experts * 128 rows, but scatter positions exceeded this.
Fix (iterative): Multiple iterations of sizing and layout fixes. See Bugs 11, 14.
Bug 10: Wrong top_k and max_num_tokens Defaults
Root cause: Runner defaulted to top_k=8, max_num_tokens=8192, vLLM uses top_k=6.
Fix: Pass values from deepseek_v4.py.
Bug 11: Full-Buffer Swizzle Produced Wrong GEMM Input
Symptom: L2 gs=0.0 on EP5/EP7.
Root cause: Applied swizzle to entire buffer at once; GEMM expects per-expert swizzled blocks.
Fix: Reverted to per-expert swizzle with fixed 128-row slots.
Bug 12: torch.full() During Cudagraph Capture
Symptom: cudaErrorStreamCaptureUnsupported on all 8 workers.
Root cause: torch.full() allocates new tensor during stream capture.
Fix: Pre-allocated _l1_gsa_buf, _l2_gsa_buf, _output_buf, _row_indices_buf. Use .fill_() instead of torch.full().
Bug 13: Warmup Passed Global Expert IDs Instead of Local
Symptom: L2 gs=0.0 on EP5/EP7.
Root cause: Warmup passed global IDs (336+) to compute_activation_global_scales() which matches against local range (0..47).
Fix: Pass local IDs (0..num_experts-1).
Bug 14: GEMM Scale Layout Mismatch — Fixed 128-Row vs Variable
Symptom: Model generates BOS token repeatedly (garbage logits).
Root cause: Scale assembly placed data at fixed e*128 offsets, but GEMM reads scale_a[expert_offsets[e]:...] where expert_offsets reflects real token counts (e.g., 500 for expert 0). Only 128 rows of scale data per expert → GEMM reads zeros beyond row 128.
Fix: Pad slot_hidden to num_experts * max_chunks * 128 rows with fixed layout. Pass padded_expert_offsets=[0, max_rows, 2*max_rows, ...] to GEMM. Scatter real tokens into padded positions. GEMM processes padded 128-row blocks. Extract real token outputs via l1_out[padded_dst].
Bug 15: OOM — Padded Buffers Sized for 8192 Tokens
Symptom: torch.OutOfMemoryError trying to allocate 1008 MiB.
Root cause: padded_hidden_buf + padded_activated_buf sized for max_num_tokens=8192 → 72 MB per layer × 60 layers = 4.3 GB. With model+KV at 175 GB on 178 GB GPUs, no room.
Fix: Cap max_num_tokens at cudagraph max capture size (512) for buffer pre-allocation. Reduces per-layer overhead to ~9 MB, total ~540 MB.
Bug 16: padded_max_slots Mismatch — Buffer Sized for max_tokens*top_k vs num_experts*max_chunks*128
Symptom: Index out of bounds during cudagraph warmup.
Root cause: padded_max_slots computed from max_tokens*top_k (3072) but total_padded_slots in run() is num_experts*max_chunks*128 (6144). Buffer too small.
Fix: Size buffers for num_experts * max_chunks * 128.
Bug 17 (ACTIVE): Shape Mismatch — GEMM Output 49152 vs Expected 3072
Symptom: RuntimeError: shape mismatch: value tensor of shape [49152, 7168] cannot be broadcast to indexing result of shape [3072, 7162]
Root cause (under investigation): GEMM output has 49152 rows = 48 experts × 8 chunks × 128. This means max_chunks_per_expert = 8, which implies the runner's max_num_tokens is still 8192 (not capped to 512). The _cudagraph_max_capture_size getattr fallback to 512 should cap it, but the GEMM output suggests otherwise. Debug print added to verify.
Hypothesis: Either (1) the min(self.max_num_tokens, 512) cap isn't working as expected, or (2) the padded_hidden buffer is somehow sized at the original 8192 budget despite the cap.
Bug 18: Cudagraph Capture — Dynamic Tensor Allocation in Scale Assembly
Symptom: cudaErrorStreamCaptureInvalidated — "capture failure must be from kernel launch".
Root cause: _assemble_scales_cudagraph_safe created torch.zeros() for padded_expert_offsets during the forward pass, which allocates during cudagraph capture.
Fix: Removed dynamic tensor creation. Use fixed layout offsets computed from Python constants.
Bug 19: Variable-Trip while Loop in Scale Assembly
Symptom: cudaErrorStreamCaptureInvalidated during cudagraph capture.
Root cause: Inner while remaining > 0 loop with variable trip count based on GPU scalar padded_rows_per_expert[e]. Python control flow using GPU values requires CPU sync.
Fix: Replaced with fixed for c in range(max_chunks) loop. Unused chunks are zero (harmless).
Bug 20: torch.zeros() in Scale Assembly Phase 1
Symptom: cudaErrorStreamCaptureInvalidated.
Root cause: padded_expert_offsets = torch.zeros(...) created during forward pass (inside _assemble_scales_cudagraph_safe).
Fix: Removed the computation entirely. Use fixed e * max_chunks * 128 + c * 128 offsets computed from Python constants.
vLLM Integration Status
| Component | Status | Notes |
|---|---|---|
| Weight loading | ✅ | Direct NVFP4 path, no BF16 round-trip |
| Weight stacking | ✅ | make_b_k_major + assemble_scales_3d_side |
| Global→local ID remap | ✅ | experts_start_idx, mask non-local tokens |
| Warmup gs computation | ✅ | Per-layer, local expert IDs, L1+L2 gs |
| Scale assembly | ⚠️ | Fixed max_chunks layout, pending GEMM shape fix |
| Cudagraph capture | ⚠️ | Works in test, fails in vLLM (shape mismatch) |
| Model output | ❌ | Previously BOS repeat; now crashes before serving |
Key Architecture: Fixed-Layout Padding
Current Design
Each expert gets max_chunks * 128 rows at fixed offset (e * max_chunks * 128).
padded_hidden: [exp0_128rows][exp0_128rows]...[exp1_128rows]...
chunk0 chunk1 chunk0
Scatter: padded_dst = expert_assign * max_rows_per_expert + clamped_local_row
GEMM input: padded_hidden (total = num_experts * max_chunks * 128 rows)
GEMM offsets: [0, max_rows, 2*max_rows, ...] (fixed, pre-computed)
GEMM output: same total rows
Extract: l1_out[padded_dst] → only real token rows
Scale assembly:
Phase 1: Scatter x_sf into padded_x_sf at same fixed offsets
Phase 2: Per-expert, per-chunk swizzle (fixed loop: max_chunks iterations)
Cudagraph Constraints (All Resolved)
- No
.item(),.cpu(),.tolist()— zero CPU-GPU syncs - No
torch.zeros/ones/full/empty/arange()during capture — pre-allocate everything - No dynamic Python control flow from GPU values — fixed loop counts
- Per-expert Python loops OK (fixed
num_experts, unrolled at capture time)
Repo Info
- Kernel:
sweetapi.com/biondizzle/nvfp4-megamoe-kernel(master) - Local:
~/dev/nvfp4-megamoe-kernel/ - B200:
/root/nvfp4-megamoe-kernel/ - Model:
/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4(read-only) - Never edit on B200 directly — edit locally → commit → push → pull on B200