Files
nvfp4-megamoe-kernel/CURRENT_BUG.md

9.6 KiB
Raw Blame History

Current Bug: CuTeDSLMoERunner — Status & Debug History

Current Status (May 17, 2026 15:51 UTC)

vLLM container build in progress. Previous crash was from OOM + shape mismatch. Both now fixed.

  • layertest.py — 0.988 cosine
  • cudagraph_test.py — capture + replay works
  • Container builds, loads weights, warmup gs computed (no L2 gs=0)
  • 🔧 Build #7 in progress on B200 (shared buffer fix)
  • Haven't gotten to serving yet (crashes were during init/capture)

Latest fixes (Bugs 17→21):

  • Bug 17 (shape mismatch 49152 vs 3072): Root cause was capping max_num_tokens to 512 for buffer sizing, but the actual warmup runs with 8192 tokens. Reverted the cap.
  • Bug 21 (OOM): Instead of per-layer padded buffers (4.3 GB for 60 layers), use SHARED buffers across all runners. Only 72 MB total since layers run sequentially.

Bugs Found & Fixed

Bug 1: Scale Assembly — Global Swizzle vs Per-Expert Swizzle

Fix: Two-phase scatter + per-expert swizzle.

Bug 2: searchsorted(right=False) — Wrong Expert Assignment

Fix: Changed to right=True.

Bug 3: CuTeDSL cute.compile GPU Memory Corruption — CRITICAL

Symptom: _token_indices all zeros after JIT. Root cause: cute.compile corrupts GPU memory. Tensors allocated before/during JIT get zeroed. Fix: _fill_token_indices() builds on CPU, copies to GPU. _needs_token_refill for GEMM JIT.

Bug 4: expert_offsets With Leading 0

Fix: Pass expert_offsets[1:num_experts + 1] to the GEMM.

Bug 5: Checkpoint input_scale Is Wrong for Activation Global Scale

Root cause: Checkpoint input_scale (~0.000286) is a calibration value. Too-small gs → block scale overflow → garbage. Fix: compute_activation_global_scales() warmup method.

Bug 6: L1 and L2 Need Separate Activation Global Scales

Fix: Compute L2 gs from actual L1 output after SiLU*up.

Bug 7: L1 and L2 Need Separate Padded Scale Buffers

Fix: Separate _padded_x_sf_buf_l1/_l2, separate per-expert scale bufs.

Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT

Symptom: IndexKernel.cu:111 OOB, cascading CUDA_ERROR_ASSERT (710) on all workers. Root cause: topk_ids contains global IDs (0-255), runner treated as local (0-31/48). Fix: Added experts_start_idx, remap global→local, mask non-local tokens.

Bug 8b: .cpu() Sync Breaking Cudagraph Compatibility

Fix: Moved _token_indices to GPU, _fill_token_indices() (CPU→GPU copy).

Bug 9: padded_x_sf Buffer Too Small — Index Out of Bounds

Root cause: Buffer sized for num_experts * 128 rows, but scatter positions exceeded this with real token distributions. Fix: Iterative — see Bugs 11, 14, 16 for the final solution.

Bug 10: Wrong top_k and max_num_tokens Defaults

Root cause: Runner defaulted to top_k=8, vLLM uses top_k=6. Fix: Pass values from deepseek_v4.py.

Bug 11: Full-Buffer Swizzle Produced Wrong GEMM Input

Symptom: L2 gs=0.0 on EP5/EP7. Root cause: Swizzled entire buffer at once; GEMM expects per-expert swizzled blocks. Fix: Reverted to per-expert swizzle.

Bug 12: torch.full() During Cudagraph Capture

Symptom: cudaErrorStreamCaptureUnsupported on all 8 workers. Root cause: torch.full() allocates new tensor during stream capture. Fix: Pre-allocated _l1_gsa_buf, _l2_gsa_buf, _output_buf, _row_indices_buf. Use .fill_().

Bug 13: Warmup Passed Global Expert IDs Instead of Local

Symptom: L2 gs=0.0 on EP5/EP7. Root cause: Warmup passed global IDs (336+) against local range (0..47). Fix: Pass local IDs (0..num_experts-1).

Bug 14: GEMM Scale Layout Mismatch — Fixed 128-Row vs Variable

Symptom: Model generates BOS token repeatedly (garbage logits). Root cause: Scale assembly placed data at fixed e*128 offsets, but GEMM reads scale_a according to real expert_offsets. When expert 0 has 500 tokens, GEMM reads scale_a[0:500] but only rows 0-127 have valid data. Fix: Fixed-layout padding: each expert gets max_chunks * 128 rows at offset e * max_chunks * 128. Pad slot_hidden into this layout. Pass fixed padded_expert_offsets to GEMM. Extract real outputs via l1_out[padded_dst].

Bug 15: OOM — Padded Buffers Sized for 8192 Tokens (per-layer)

Symptom: torch.OutOfMemoryError trying to allocate 1008 MiB. Root cause: padded_hidden_buf + padded_activated_buf at 72 MB per layer × 60 layers = 4.3 GB. Model+KV already at 175 GB on 178 GB GPUs. Fix (attempt 1 — wrong): Cap max_num_tokens at 512. Caused Bug 17. Fix (attempt 2 — correct): Shared buffers. See Bug 21.

Bug 16: padded_max_slots Mismatch

Root cause: Computed from max_tokens*top_k (3072) but total_padded_slots is num_experts*max_chunks*128 (6144). Fix: Size for num_experts * max_chunks * 128.

Bug 17: Shape Mismatch — slot_hidden 49152 vs padded_dst 3072

Symptom: RuntimeError: shape mismatch: [49152, 7168] cannot be broadcast to [3072, 7168] Root cause: Bug 15 fix capped max_num_tokens to 512, making _token_indices and buffers sized for 3072 slots. But the actual warmup/cudagraph forward pass uses 8192 tokens → sorted_token_ids has 49152 elements → slot_hidden has 49152 rows → doesn't fit in 3072-slot buffer. Fix: Reverted the 512 cap. Use shared buffers (Bug 21) instead.

Bug 18: Dynamic Tensor Allocation in Scale Assembly

Symptom: cudaErrorStreamCaptureInvalidated. Root cause: torch.zeros() for padded_expert_offsets inside _assemble_scales_cudagraph_safe. Fix: Use fixed offsets from Python constants.

Bug 19: Variable-Trip while Loop in Scale Assembly

Symptom: cudaErrorStreamCaptureInvalidated. Root cause: while remaining > 0 loop with GPU scalar in condition → CPU sync. Fix: Fixed for c in range(max_chunks) loop.

Bug 20: Another torch.zeros() in Scale Assembly

Fix: Removed. Use fixed e * max_chunks * 128 + c * 128 offsets.

Bug 21: OOM (correct fix) — Shared Padded Buffers

Symptom: Same as Bug 15 (4.3 GB for per-layer padded buffers). Root cause: Per-layer allocation of padded_hidden_buf and padded_activated_buf at 72 MB × 60 layers. Fix: Single shared set of padded buffers across all runners. Layers execute sequentially during both capture and replay, so the same buffer is reused. Total: 72 MB (not 4.3 GB). Stored as class-level dict keyed by device.


vLLM Integration Status

Component Status Notes
Weight loading Direct NVFP4 path, no BF16 round-trip
Weight stacking make_b_k_major + assemble_scales_3d_side
Global→local ID remap experts_start_idx, mask non-local tokens
Warmup gs computation Per-layer, local expert IDs, L1+L2 gs
Scale assembly Fixed max_chunks layout, no dynamic allocs
Cudagraph compatibility No dynamic allocs, no CPU syncs, fixed loops
Buffer sizing Shared buffers avoid OOM
Model output Build #7 in progress — never reached serving without crash

Key Architecture: Fixed-Layout Padding

Current Design

Each expert gets max_chunks * 128 rows at fixed offset (e * max_chunks * 128).

padded_hidden: [exp0_chunk0][exp0_chunk1]...[exp1_chunk0]...
                   128 rows    128 rows       128 rows

Scatter: padded_dst = expert_assign * max_rows_per_expert + clamped_local_row
GEMM input: padded_hidden (total = num_experts * max_chunks * 128 rows)
GEMM offsets: [0, max_rows, 2*max_rows, ...] (fixed, pre-computed in _allocate_buffers)
GEMM output: same total rows
Extract: l1_out[padded_dst] → only real token rows

Scale assembly:
  Phase 1: Scatter x_sf into padded_x_sf at same fixed offsets
  Phase 2: Per-expert, per-chunk swizzle (fixed loop: max_chunks iterations)
  No dynamic tensor allocation, no GPU→CPU syncs

Shared buffers:
  padded_hidden and padded_activated are class-level (not per-layer).
  72 MB total instead of 4.3 GB. Layers run sequentially → safe to share.

Cudagraph Constraints (All Resolved)

  • No .item(), .cpu(), .tolist() — zero CPU-GPU syncs
  • No torch.zeros/ones/full/empty/arange() during capture — pre-allocate everything
  • No dynamic Python control flow from GPU values — fixed loop counts
  • Per-expert Python loops OK (fixed num_experts, unrolled at capture time)
  • Shared buffers OK (layers execute sequentially during capture and replay)

EP Configuration (DeepSeek-V4-Pro on 8×B200)

  • 256 total experts, top_k=6
  • EP=8 → 48 local experts per rank
  • experts_start_idx = rank × 32
  • max_num_tokens = 8192 (from scheduler_config.max_num_batched_tokens)
  • max_chunks_per_expert = ceil(8192 × 6 / (48 × 128)) = 8

Test Files

File Purpose
tests/layertest.py Reference: moe_pipeline with dynamic gs, 3 experts, layer 0. Must pass (≥0.98 cosine).
tests/cudagraph_test.py CuTeDSLMoERunner cudagraph capture + replay. Must pass.
tests/test_warmup_gs.py Warmup gs computation.
tests/test_runner_vs_pipeline.py Compare runner.run() vs moe_pipeline.
tests/test_scale_assembly.py Compare cudagraph-safe vs reference scale assembly.

Run order after any code change:

  1. python3 tests/layertest.py — must pass
  2. python3 tests/cudagraph_test.py — must pass

Repo Info

  • Kernel: sweetapi.com/biondizzle/nvfp4-megamoe-kernel (master)
  • Local: ~/dev/nvfp4-megamoe-kernel/
  • B200: /root/nvfp4-megamoe-kernel/
  • Model: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4 (read-only)
  • Never edit on B200 directly — edit locally → commit → push → pull on B200