biondizzle/nvfp4-megamoe-kernel

Fork 0

Files

biondizzle 3d0b1408b4 Update CURRENT_BUG.md: Bug 21 (shared buffers), clean up status

2026-05-17 15:52:06 +00:00

9.6 KiB

Raw Blame History

Current Bug: CuTeDSLMoERunner — Status & Debug History

Current Status (May 17, 2026 15:51 UTC)

vLLM container build in progress. Previous crash was from OOM + shape mismatch. Both now fixed.

✅ layertest.py — 0.988 cosine
✅ cudagraph_test.py — capture + replay works
✅ Container builds, loads weights, warmup gs computed (no L2 gs=0)
🔧 Build #7 in progress on B200 (shared buffer fix)
❌ Haven't gotten to serving yet (crashes were during init/capture)

Latest fixes (Bugs 17→21):

Bug 17 (shape mismatch 49152 vs 3072): Root cause was capping max_num_tokens to 512 for buffer sizing, but the actual warmup runs with 8192 tokens. Reverted the cap.
Bug 21 (OOM): Instead of per-layer padded buffers (4.3 GB for 60 layers), use SHARED buffers across all runners. Only 72 MB total since layers run sequentially.

Bugs Found & Fixed

Bug 1: Scale Assembly — Global Swizzle vs Per-Expert Swizzle

Fix: Two-phase scatter + per-expert swizzle.

Bug 2: `searchsorted(right=False)` — Wrong Expert Assignment

Fix: Changed to right=True.

Bug 3: CuTeDSL `cute.compile` GPU Memory Corruption — CRITICAL

Symptom: _token_indices all zeros after JIT. Root cause: cute.compile corrupts GPU memory. Tensors allocated before/during JIT get zeroed. Fix: _fill_token_indices() builds on CPU, copies to GPU. _needs_token_refill for GEMM JIT.

Bug 4: `expert_offsets` With Leading 0

Fix: Pass expert_offsets[1:num_experts + 1] to the GEMM.

Bug 5: Checkpoint `input_scale` Is Wrong for Activation Global Scale

Root cause: Checkpoint input_scale (~0.000286) is a calibration value. Too-small gs → block scale overflow → garbage. Fix: compute_activation_global_scales() warmup method.

Bug 6: L1 and L2 Need Separate Activation Global Scales

Fix: Compute L2 gs from actual L1 output after SiLU*up.

Bug 7: L1 and L2 Need Separate Padded Scale Buffers

Fix: Separate _padded_x_sf_buf_l1/_l2, separate per-expert scale bufs.

Bug 8: Global→Local Expert ID Mismatch — CUDA_ERROR_ASSERT

Symptom: IndexKernel.cu:111 OOB, cascading CUDA_ERROR_ASSERT (710) on all workers. Root cause: topk_ids contains global IDs (0-255), runner treated as local (0-31/48). Fix: Added experts_start_idx, remap global→local, mask non-local tokens.

Bug 8b: `.cpu()` Sync Breaking Cudagraph Compatibility

Fix: Moved _token_indices to GPU, _fill_token_indices() (CPU→GPU copy).

Bug 9: `padded_x_sf` Buffer Too Small — Index Out of Bounds

Root cause: Buffer sized for num_experts * 128 rows, but scatter positions exceeded this with real token distributions. Fix: Iterative — see Bugs 11, 14, 16 for the final solution.

Bug 10: Wrong `top_k` and `max_num_tokens` Defaults

Root cause: Runner defaulted to top_k=8, vLLM uses top_k=6. Fix: Pass values from deepseek_v4.py.

Bug 11: Full-Buffer Swizzle Produced Wrong GEMM Input

Symptom: L2 gs=0.0 on EP5/EP7. Root cause: Swizzled entire buffer at once; GEMM expects per-expert swizzled blocks. Fix: Reverted to per-expert swizzle.

Bug 12: `torch.full()` During Cudagraph Capture

Symptom: cudaErrorStreamCaptureUnsupported on all 8 workers. Root cause: torch.full() allocates new tensor during stream capture. Fix: Pre-allocated _l1_gsa_buf, _l2_gsa_buf, _output_buf, _row_indices_buf. Use .fill_().

Bug 13: Warmup Passed Global Expert IDs Instead of Local

Symptom: L2 gs=0.0 on EP5/EP7. Root cause: Warmup passed global IDs (336+) against local range (0..47). Fix: Pass local IDs (0..num_experts-1).

Bug 14: GEMM Scale Layout Mismatch — Fixed 128-Row vs Variable

Symptom: Model generates BOS token repeatedly (garbage logits). Root cause: Scale assembly placed data at fixed e*128 offsets, but GEMM reads scale_a according to real expert_offsets. When expert 0 has 500 tokens, GEMM reads scale_a[0:500] but only rows 0-127 have valid data. Fix: Fixed-layout padding: each expert gets max_chunks * 128 rows at offset e * max_chunks * 128. Pad slot_hidden into this layout. Pass fixed padded_expert_offsets to GEMM. Extract real outputs via l1_out[padded_dst].

Bug 15: OOM — Padded Buffers Sized for 8192 Tokens (per-layer)

Symptom: torch.OutOfMemoryError trying to allocate 1008 MiB. Root cause: padded_hidden_buf + padded_activated_buf at 72 MB per layer × 60 layers = 4.3 GB. Model+KV already at 175 GB on 178 GB GPUs. Fix (attempt 1 — wrong): Cap max_num_tokens at 512. Caused Bug 17. Fix (attempt 2 — correct): Shared buffers. See Bug 21.

Bug 16: `padded_max_slots` Mismatch

Root cause: Computed from max_tokens*top_k (3072) but total_padded_slots is num_experts*max_chunks*128 (6144). Fix: Size for num_experts * max_chunks * 128.

Bug 17: Shape Mismatch — slot_hidden 49152 vs padded_dst 3072

Symptom: RuntimeError: shape mismatch: [49152, 7168] cannot be broadcast to [3072, 7168] Root cause: Bug 15 fix capped max_num_tokens to 512, making _token_indices and buffers sized for 3072 slots. But the actual warmup/cudagraph forward pass uses 8192 tokens → sorted_token_ids has 49152 elements → slot_hidden has 49152 rows → doesn't fit in 3072-slot buffer. Fix: Reverted the 512 cap. Use shared buffers (Bug 21) instead.

Bug 18: Dynamic Tensor Allocation in Scale Assembly

Symptom: cudaErrorStreamCaptureInvalidated. Root cause: torch.zeros() for padded_expert_offsets inside _assemble_scales_cudagraph_safe. Fix: Use fixed offsets from Python constants.

Bug 19: Variable-Trip `while` Loop in Scale Assembly

Symptom: cudaErrorStreamCaptureInvalidated. Root cause: while remaining > 0 loop with GPU scalar in condition → CPU sync. Fix: Fixed for c in range(max_chunks) loop.

Bug 20: Another `torch.zeros()` in Scale Assembly

Fix: Removed. Use fixed e * max_chunks * 128 + c * 128 offsets.

Bug 21: OOM (correct fix) — Shared Padded Buffers

Symptom: Same as Bug 15 (4.3 GB for per-layer padded buffers). Root cause: Per-layer allocation of padded_hidden_buf and padded_activated_buf at 72 MB × 60 layers. Fix: Single shared set of padded buffers across all runners. Layers execute sequentially during both capture and replay, so the same buffer is reused. Total: 72 MB (not 4.3 GB). Stored as class-level dict keyed by device.

vLLM Integration Status

Component	Status	Notes
Weight loading	✅	Direct NVFP4 path, no BF16 round-trip
Weight stacking	✅	`make_b_k_major` + `assemble_scales_3d_side`
Global→local ID remap	✅	`experts_start_idx`, mask non-local tokens
Warmup gs computation	✅	Per-layer, local expert IDs, L1+L2 gs
Scale assembly	✅	Fixed max_chunks layout, no dynamic allocs
Cudagraph compatibility	✅	No dynamic allocs, no CPU syncs, fixed loops
Buffer sizing	✅	Shared buffers avoid OOM
Model output	❓	Build #7 in progress — never reached serving without crash

Key Architecture: Fixed-Layout Padding

Current Design

Each expert gets max_chunks * 128 rows at fixed offset (e * max_chunks * 128).

padded_hidden: [exp0_chunk0][exp0_chunk1]...[exp1_chunk0]...
                   128 rows    128 rows       128 rows

Scatter: padded_dst = expert_assign * max_rows_per_expert + clamped_local_row
GEMM input: padded_hidden (total = num_experts * max_chunks * 128 rows)
GEMM offsets: [0, max_rows, 2*max_rows, ...] (fixed, pre-computed in _allocate_buffers)
GEMM output: same total rows
Extract: l1_out[padded_dst] → only real token rows

Scale assembly:
  Phase 1: Scatter x_sf into padded_x_sf at same fixed offsets
  Phase 2: Per-expert, per-chunk swizzle (fixed loop: max_chunks iterations)
  No dynamic tensor allocation, no GPU→CPU syncs

Shared buffers:
  padded_hidden and padded_activated are class-level (not per-layer).
  72 MB total instead of 4.3 GB. Layers run sequentially → safe to share.

Cudagraph Constraints (All Resolved)

No .item(), .cpu(), .tolist() — zero CPU-GPU syncs
No torch.zeros/ones/full/empty/arange() during capture — pre-allocate everything
No dynamic Python control flow from GPU values — fixed loop counts
Per-expert Python loops OK (fixed num_experts, unrolled at capture time)
Shared buffers OK (layers execute sequentially during capture and replay)

EP Configuration (DeepSeek-V4-Pro on 8×B200)

256 total experts, top_k=6
EP=8 → 48 local experts per rank
experts_start_idx = rank × 32
max_num_tokens = 8192 (from scheduler_config.max_num_batched_tokens)
max_chunks_per_expert = ceil(8192 × 6 / (48 × 128)) = 8

Test Files

File	Purpose
`tests/layertest.py`	Reference: moe_pipeline with dynamic gs, 3 experts, layer 0. Must pass (≥0.98 cosine).
`tests/cudagraph_test.py`	CuTeDSLMoERunner cudagraph capture + replay. Must pass.
`tests/test_warmup_gs.py`	Warmup gs computation.
`tests/test_runner_vs_pipeline.py`	Compare runner.run() vs moe_pipeline.
`tests/test_scale_assembly.py`	Compare cudagraph-safe vs reference scale assembly.

Run order after any code change:

python3 tests/layertest.py — must pass
python3 tests/cudagraph_test.py — must pass

Repo Info

Kernel: sweetapi.com/biondizzle/nvfp4-megamoe-kernel (master)
Local: ~/dev/nvfp4-megamoe-kernel/
B200: /root/nvfp4-megamoe-kernel/
Model: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4 (read-only)
Never edit on B200 directly — edit locally → commit → push → pull on B200

9.6 KiB Raw Blame History Unescape Escape