9.6 KiB
Current Bug: CuTeDSLMoERunner — Status & Debug History
Current Status (May 17, 2026 15:51 UTC)
vLLM container build in progress. Previous crash was from OOM + shape mismatch. Both now fixed.
- ✅
layertest.py— 0.988 cosine - ✅
cudagraph_test.py— capture + replay works - ✅ Container builds, loads weights, warmup gs computed (no L2 gs=0)
- 🔧 Build #7 in progress on B200 (shared buffer fix)
- ❌ Haven't gotten to serving yet (crashes were during init/capture)
Latest fixes (Bugs 17→21):
- Bug 17 (shape mismatch 49152 vs 3072): Root cause was capping
max_num_tokensto 512 for buffer sizing, but the actual warmup runs with 8192 tokens. Reverted the cap. - Bug 21 (OOM): Instead of per-layer padded buffers (4.3 GB for 60 layers), use SHARED buffers across all runners. Only 72 MB total since layers run sequentially.
Bugs Found & Fixed
Bug 1: Scale Assembly — Global Swizzle vs Per-Expert Swizzle
Fix: Two-phase scatter + per-expert swizzle.
Bug 2: searchsorted(right=False) — Wrong Expert Assignment
Fix: Changed to right=True.
Bug 3: CuTeDSL cute.compile GPU Memory Corruption — CRITICAL
Symptom: _token_indices all zeros after JIT.
Root cause: cute.compile corrupts GPU memory. Tensors allocated before/during JIT get zeroed.
Fix: _fill_token_indices() builds on CPU, copies to GPU. _needs_token_refill for GEMM JIT.
Bug 4: expert_offsets With Leading 0
Fix: Pass expert_offsets[1:num_experts + 1] to the GEMM.
Bug 5: Checkpoint input_scale Is Wrong for Activation Global Scale
Root cause: Checkpoint input_scale (~0.000286) is a calibration value. Too-small gs → block scale overflow → garbage.
Fix: compute_activation_global_scales() warmup method.
Bug 6: L1 and L2 Need Separate Activation Global Scales
Fix: Compute L2 gs from actual L1 output after SiLU*up.
Bug 7: L1 and L2 Need Separate Padded Scale Buffers
Fix: Separate _padded_x_sf_buf_l1/_l2, separate per-expert scale bufs.
Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT
Symptom: IndexKernel.cu:111 OOB, cascading CUDA_ERROR_ASSERT (710) on all workers.
Root cause: topk_ids contains global IDs (0-255), runner treated as local (0-31/48).
Fix: Added experts_start_idx, remap global→local, mask non-local tokens.
Bug 8b: .cpu() Sync Breaking Cudagraph Compatibility
Fix: Moved _token_indices to GPU, _fill_token_indices() (CPU→GPU copy).
Bug 9: padded_x_sf Buffer Too Small — Index Out of Bounds
Root cause: Buffer sized for num_experts * 128 rows, but scatter positions exceeded this with real token distributions.
Fix: Iterative — see Bugs 11, 14, 16 for the final solution.
Bug 10: Wrong top_k and max_num_tokens Defaults
Root cause: Runner defaulted to top_k=8, vLLM uses top_k=6.
Fix: Pass values from deepseek_v4.py.
Bug 11: Full-Buffer Swizzle Produced Wrong GEMM Input
Symptom: L2 gs=0.0 on EP5/EP7. Root cause: Swizzled entire buffer at once; GEMM expects per-expert swizzled blocks. Fix: Reverted to per-expert swizzle.
Bug 12: torch.full() During Cudagraph Capture
Symptom: cudaErrorStreamCaptureUnsupported on all 8 workers.
Root cause: torch.full() allocates new tensor during stream capture.
Fix: Pre-allocated _l1_gsa_buf, _l2_gsa_buf, _output_buf, _row_indices_buf. Use .fill_().
Bug 13: Warmup Passed Global Expert IDs Instead of Local
Symptom: L2 gs=0.0 on EP5/EP7. Root cause: Warmup passed global IDs (336+) against local range (0..47). Fix: Pass local IDs (0..num_experts-1).
Bug 14: GEMM Scale Layout Mismatch — Fixed 128-Row vs Variable
Symptom: Model generates BOS token repeatedly (garbage logits).
Root cause: Scale assembly placed data at fixed e*128 offsets, but GEMM reads scale_a according to real expert_offsets. When expert 0 has 500 tokens, GEMM reads scale_a[0:500] but only rows 0-127 have valid data.
Fix: Fixed-layout padding: each expert gets max_chunks * 128 rows at offset e * max_chunks * 128. Pad slot_hidden into this layout. Pass fixed padded_expert_offsets to GEMM. Extract real outputs via l1_out[padded_dst].
Bug 15: OOM — Padded Buffers Sized for 8192 Tokens (per-layer)
Symptom: torch.OutOfMemoryError trying to allocate 1008 MiB.
Root cause: padded_hidden_buf + padded_activated_buf at 72 MB per layer × 60 layers = 4.3 GB. Model+KV already at 175 GB on 178 GB GPUs.
Fix (attempt 1 — wrong): Cap max_num_tokens at 512. Caused Bug 17.
Fix (attempt 2 — correct): Shared buffers. See Bug 21.
Bug 16: padded_max_slots Mismatch
Root cause: Computed from max_tokens*top_k (3072) but total_padded_slots is num_experts*max_chunks*128 (6144).
Fix: Size for num_experts * max_chunks * 128.
Bug 17: Shape Mismatch — slot_hidden 49152 vs padded_dst 3072
Symptom: RuntimeError: shape mismatch: [49152, 7168] cannot be broadcast to [3072, 7168]
Root cause: Bug 15 fix capped max_num_tokens to 512, making _token_indices and buffers sized for 3072 slots. But the actual warmup/cudagraph forward pass uses 8192 tokens → sorted_token_ids has 49152 elements → slot_hidden has 49152 rows → doesn't fit in 3072-slot buffer.
Fix: Reverted the 512 cap. Use shared buffers (Bug 21) instead.
Bug 18: Dynamic Tensor Allocation in Scale Assembly
Symptom: cudaErrorStreamCaptureInvalidated.
Root cause: torch.zeros() for padded_expert_offsets inside _assemble_scales_cudagraph_safe.
Fix: Use fixed offsets from Python constants.
Bug 19: Variable-Trip while Loop in Scale Assembly
Symptom: cudaErrorStreamCaptureInvalidated.
Root cause: while remaining > 0 loop with GPU scalar in condition → CPU sync.
Fix: Fixed for c in range(max_chunks) loop.
Bug 20: Another torch.zeros() in Scale Assembly
Fix: Removed. Use fixed e * max_chunks * 128 + c * 128 offsets.
Bug 21: OOM (correct fix) — Shared Padded Buffers
Symptom: Same as Bug 15 (4.3 GB for per-layer padded buffers).
Root cause: Per-layer allocation of padded_hidden_buf and padded_activated_buf at 72 MB × 60 layers.
Fix: Single shared set of padded buffers across all runners. Layers execute sequentially during both capture and replay, so the same buffer is reused. Total: 72 MB (not 4.3 GB). Stored as class-level dict keyed by device.
vLLM Integration Status
| Component | Status | Notes |
|---|---|---|
| Weight loading | ✅ | Direct NVFP4 path, no BF16 round-trip |
| Weight stacking | ✅ | make_b_k_major + assemble_scales_3d_side |
| Global→local ID remap | ✅ | experts_start_idx, mask non-local tokens |
| Warmup gs computation | ✅ | Per-layer, local expert IDs, L1+L2 gs |
| Scale assembly | ✅ | Fixed max_chunks layout, no dynamic allocs |
| Cudagraph compatibility | ✅ | No dynamic allocs, no CPU syncs, fixed loops |
| Buffer sizing | ✅ | Shared buffers avoid OOM |
| Model output | ❓ | Build #7 in progress — never reached serving without crash |
Key Architecture: Fixed-Layout Padding
Current Design
Each expert gets max_chunks * 128 rows at fixed offset (e * max_chunks * 128).
padded_hidden: [exp0_chunk0][exp0_chunk1]...[exp1_chunk0]...
128 rows 128 rows 128 rows
Scatter: padded_dst = expert_assign * max_rows_per_expert + clamped_local_row
GEMM input: padded_hidden (total = num_experts * max_chunks * 128 rows)
GEMM offsets: [0, max_rows, 2*max_rows, ...] (fixed, pre-computed in _allocate_buffers)
GEMM output: same total rows
Extract: l1_out[padded_dst] → only real token rows
Scale assembly:
Phase 1: Scatter x_sf into padded_x_sf at same fixed offsets
Phase 2: Per-expert, per-chunk swizzle (fixed loop: max_chunks iterations)
No dynamic tensor allocation, no GPU→CPU syncs
Shared buffers:
padded_hidden and padded_activated are class-level (not per-layer).
72 MB total instead of 4.3 GB. Layers run sequentially → safe to share.
Cudagraph Constraints (All Resolved)
- No
.item(),.cpu(),.tolist()— zero CPU-GPU syncs - No
torch.zeros/ones/full/empty/arange()during capture — pre-allocate everything - No dynamic Python control flow from GPU values — fixed loop counts
- Per-expert Python loops OK (fixed
num_experts, unrolled at capture time) - Shared buffers OK (layers execute sequentially during capture and replay)
EP Configuration (DeepSeek-V4-Pro on 8×B200)
- 256 total experts, top_k=6
- EP=8 → 48 local experts per rank
experts_start_idx= rank × 32max_num_tokens= 8192 (fromscheduler_config.max_num_batched_tokens)max_chunks_per_expert= ceil(8192 × 6 / (48 × 128)) = 8
Test Files
| File | Purpose |
|---|---|
tests/layertest.py |
Reference: moe_pipeline with dynamic gs, 3 experts, layer 0. Must pass (≥0.98 cosine). |
tests/cudagraph_test.py |
CuTeDSLMoERunner cudagraph capture + replay. Must pass. |
tests/test_warmup_gs.py |
Warmup gs computation. |
tests/test_runner_vs_pipeline.py |
Compare runner.run() vs moe_pipeline. |
tests/test_scale_assembly.py |
Compare cudagraph-safe vs reference scale assembly. |
Run order after any code change:
python3 tests/layertest.py— must passpython3 tests/cudagraph_test.py— must pass
Repo Info
- Kernel:
sweetapi.com/biondizzle/nvfp4-megamoe-kernel(master) - Local:
~/dev/nvfp4-megamoe-kernel/ - B200:
/root/nvfp4-megamoe-kernel/ - Model:
/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4(read-only) - Never edit on B200 directly — edit locally → commit → push → pull on B200