Files
nvfp4-megamoe-kernel/CURRENT_BUG.md

9.5 KiB
Raw Blame History

Current Bug: CuTeDSLMoERunner — Status & Debug History

Current Status (May 17, 2026 15:45 UTC)

vLLM container crashes during cudagraph warmup with shape mismatch. Debug build in progress.

  • layertest.py — 0.988 cosine
  • cudagraph_test.py — capture + replay works
  • Container builds, loads weights, warmup gs computed (no L2 gs=0)
  • Container crashes during cudagraph warmup: shape mismatch [49152, 7168] vs [3072, 7168]

Active investigation: The GEMM output has 49152 rows (48 experts × 8 chunks × 128) but padded_dst only indexes 3072 rows. This means max_chunks_per_expert = 8 instead of the expected 1 (capped at 512 tokens). Likely the max_num_tokens cap to 512 isn't reaching the runner. Debug print added to verify.


Bugs Found & Fixed

Bug 1: Scale Assembly — Global Swizzle vs Per-Expert Swizzle

Symptom: GEMM produced all zeros even with correct global_scale.

Root cause: _assemble_scales_cudagraph_safe called pad_and_swizzle_single() on the ENTIRE padded buffer. The kernel expects each expert's 128-row block swizzled independently.

Fix: Two-phase approach: scatter into 128-aligned positions, then per-expert swizzle and concatenate.

Bug 2: searchsorted(right=False) — Wrong Expert Assignment

Fix: Changed to right=True.

Bug 3: CuTeDSL cute.compile GPU Memory Corruption — CRITICAL

Symptom: _token_indices was all zeros.

Root cause: CuTeDSL's cute.compile (JIT) corrupts GPU memory. Tensors allocated on GPU before/during JIT get zeroed.

Fix: _fill_token_indices() builds on CPU, copies to GPU. _needs_token_refill flag for GEMM JIT.

Bug 4: expert_offsets With Leading 0

Fix: Pass expert_offsets[1:num_experts + 1] to the GEMM.

Bug 5: Checkpoint input_scale Is Wrong for Activation Global Scale

Root cause: Checkpoint input_scale (~0.000286) is a calibration value. Too-small gs → block scale overflow → garbage.

Fix: compute_activation_global_scales() warmup method.

Bug 6: L1 and L2 Need Separate Activation Global Scales

Fix: compute_activation_global_scales() computes L2 gs from L1 output after SiLU*up.

Bug 7: L1 and L2 Need Separate Padded Scale Buffers

Fix: Separate _padded_x_sf_buf_l1 and _padded_x_sf_buf_l2, plus separate per-expert scale bufs.

Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT

Symptom: IndexKernel.cu:111 OOB assertion, cascading CUDA_ERROR_ASSERT (710).

Root cause: topk_ids contains global IDs (0-255), runner treated as local (0-31/48).

Fix: Added experts_start_idx, remap global→local, mask non-local tokens.

Bug 8b: .cpu() Sync Breaking Cudagraph Compatibility

Fix: Moved _token_indices to GPU, _fill_token_indices() (CPU→GPU copy).

Bug 9: padded_x_sf Buffer Too Small — Index Out of Bounds

Root cause: Buffer sized for num_experts * 128 rows, but scatter positions exceeded this.

Fix (iterative): Multiple iterations of sizing and layout fixes. See Bugs 11, 14.

Bug 10: Wrong top_k and max_num_tokens Defaults

Root cause: Runner defaulted to top_k=8, max_num_tokens=8192, vLLM uses top_k=6.

Fix: Pass values from deepseek_v4.py.

Bug 11: Full-Buffer Swizzle Produced Wrong GEMM Input

Symptom: L2 gs=0.0 on EP5/EP7.

Root cause: Applied swizzle to entire buffer at once; GEMM expects per-expert swizzled blocks.

Fix: Reverted to per-expert swizzle with fixed 128-row slots.

Bug 12: torch.full() During Cudagraph Capture

Symptom: cudaErrorStreamCaptureUnsupported on all 8 workers.

Root cause: torch.full() allocates new tensor during stream capture.

Fix: Pre-allocated _l1_gsa_buf, _l2_gsa_buf, _output_buf, _row_indices_buf. Use .fill_() instead of torch.full().

Bug 13: Warmup Passed Global Expert IDs Instead of Local

Symptom: L2 gs=0.0 on EP5/EP7.

Root cause: Warmup passed global IDs (336+) to compute_activation_global_scales() which matches against local range (0..47).

Fix: Pass local IDs (0..num_experts-1).

Bug 14: GEMM Scale Layout Mismatch — Fixed 128-Row vs Variable

Symptom: Model generates BOS token repeatedly (garbage logits).

Root cause: Scale assembly placed data at fixed e*128 offsets, but GEMM reads scale_a[expert_offsets[e]:...] where expert_offsets reflects real token counts (e.g., 500 for expert 0). Only 128 rows of scale data per expert → GEMM reads zeros beyond row 128.

Fix: Pad slot_hidden to num_experts * max_chunks * 128 rows with fixed layout. Pass padded_expert_offsets=[0, max_rows, 2*max_rows, ...] to GEMM. Scatter real tokens into padded positions. GEMM processes padded 128-row blocks. Extract real token outputs via l1_out[padded_dst].

Bug 15: OOM — Padded Buffers Sized for 8192 Tokens

Symptom: torch.OutOfMemoryError trying to allocate 1008 MiB.

Root cause: padded_hidden_buf + padded_activated_buf sized for max_num_tokens=8192 → 72 MB per layer × 60 layers = 4.3 GB. With model+KV at 175 GB on 178 GB GPUs, no room.

Fix: Cap max_num_tokens at cudagraph max capture size (512) for buffer pre-allocation. Reduces per-layer overhead to ~9 MB, total ~540 MB.

Bug 16: padded_max_slots Mismatch — Buffer Sized for max_tokens*top_k vs num_experts*max_chunks*128

Symptom: Index out of bounds during cudagraph warmup.

Root cause: padded_max_slots computed from max_tokens*top_k (3072) but total_padded_slots in run() is num_experts*max_chunks*128 (6144). Buffer too small.

Fix: Size buffers for num_experts * max_chunks * 128.

Bug 17 (ACTIVE): Shape Mismatch — GEMM Output 49152 vs Expected 3072

Symptom: RuntimeError: shape mismatch: value tensor of shape [49152, 7168] cannot be broadcast to indexing result of shape [3072, 7162]

Root cause (under investigation): GEMM output has 49152 rows = 48 experts × 8 chunks × 128. This means max_chunks_per_expert = 8, which implies the runner's max_num_tokens is still 8192 (not capped to 512). The _cudagraph_max_capture_size getattr fallback to 512 should cap it, but the GEMM output suggests otherwise. Debug print added to verify.

Hypothesis: Either (1) the min(self.max_num_tokens, 512) cap isn't working as expected, or (2) the padded_hidden buffer is somehow sized at the original 8192 budget despite the cap.

Bug 18: Cudagraph Capture — Dynamic Tensor Allocation in Scale Assembly

Symptom: cudaErrorStreamCaptureInvalidated — "capture failure must be from kernel launch".

Root cause: _assemble_scales_cudagraph_safe created torch.zeros() for padded_expert_offsets during the forward pass, which allocates during cudagraph capture.

Fix: Removed dynamic tensor creation. Use fixed layout offsets computed from Python constants.

Bug 19: Variable-Trip while Loop in Scale Assembly

Symptom: cudaErrorStreamCaptureInvalidated during cudagraph capture.

Root cause: Inner while remaining > 0 loop with variable trip count based on GPU scalar padded_rows_per_expert[e]. Python control flow using GPU values requires CPU sync.

Fix: Replaced with fixed for c in range(max_chunks) loop. Unused chunks are zero (harmless).

Bug 20: torch.zeros() in Scale Assembly Phase 1

Symptom: cudaErrorStreamCaptureInvalidated.

Root cause: padded_expert_offsets = torch.zeros(...) created during forward pass (inside _assemble_scales_cudagraph_safe).

Fix: Removed the computation entirely. Use fixed e * max_chunks * 128 + c * 128 offsets computed from Python constants.


vLLM Integration Status

Component Status Notes
Weight loading Direct NVFP4 path, no BF16 round-trip
Weight stacking make_b_k_major + assemble_scales_3d_side
Global→local ID remap experts_start_idx, mask non-local tokens
Warmup gs computation Per-layer, local expert IDs, L1+L2 gs
Scale assembly ⚠️ Fixed max_chunks layout, pending GEMM shape fix
Cudagraph capture ⚠️ Works in test, fails in vLLM (shape mismatch)
Model output Previously BOS repeat; now crashes before serving

Key Architecture: Fixed-Layout Padding

Current Design

Each expert gets max_chunks * 128 rows at fixed offset (e * max_chunks * 128).

padded_hidden: [exp0_128rows][exp0_128rows]...[exp1_128rows]...
                chunk0        chunk1           chunk0

Scatter: padded_dst = expert_assign * max_rows_per_expert + clamped_local_row
GEMM input: padded_hidden (total = num_experts * max_chunks * 128 rows)
GEMM offsets: [0, max_rows, 2*max_rows, ...] (fixed, pre-computed)
GEMM output: same total rows
Extract: l1_out[padded_dst] → only real token rows

Scale assembly:
  Phase 1: Scatter x_sf into padded_x_sf at same fixed offsets
  Phase 2: Per-expert, per-chunk swizzle (fixed loop: max_chunks iterations)

Cudagraph Constraints (All Resolved)

  • No .item(), .cpu(), .tolist() — zero CPU-GPU syncs
  • No torch.zeros/ones/full/empty/arange() during capture — pre-allocate everything
  • No dynamic Python control flow from GPU values — fixed loop counts
  • Per-expert Python loops OK (fixed num_experts, unrolled at capture time)

Repo Info

  • Kernel: sweetapi.com/biondizzle/nvfp4-megamoe-kernel (master)
  • Local: ~/dev/nvfp4-megamoe-kernel/
  • B200: /root/nvfp4-megamoe-kernel/
  • Model: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4 (read-only)
  • Never edit on B200 directly — edit locally → commit → push → pull on B200