Files
nvfp4-megamoe-kernel/CURRENT_BUG.md
biondizzle 9e7639fba4 Add layer-by-layer diagnostic prints (CLAWMINE_DEBUG=1, enforce-eager)
When CLAWMINE_DEBUG=1, prints amax/mean/NaN/Inf after each layer.
Must run with --enforce-eager (data-dependent prints break Dynamo).
Gated by os.environ so dead-code-eliminated during compilation.
2026-05-18 12:51:51 +00:00

4.0 KiB

Current Bug: vLLM produces empty/garbage output with NaN logits

Status: Active debugging Date: 2026-05-18

Symptom

  • vLLM server starts successfully, loads model, captures cudagraph
  • Chat completions return content: "" with finish_reason: "length"
  • Raw completions API returns: Out of range float values are not JSON compliant: nan
  • 50 completion tokens generated but all produce NaN logits
  • Model: DeepSeek-V4-Pro-NVFP4 on 8x B200 (TP=8)

Known Good

  • layertest.py passes (cosine 0.988 with BF16 reference) — MoE kernel math is correct
  • cudagraph_test.py passes — no CPU-GPU syncs, capture + replay works
  • Model weights load successfully (281K tensors)
  • Kernel compiles and runs without CUDA errors

Hypotheses

H1: Activation global scale (gs) is wrong

  • compute_activation_global_scales is called during init with random data (torch.randn)
  • Random data may produce gs values that don't represent real token distributions
  • If gs is too small: activation quantization clips, info loss
  • If gs is too large: quantization noise dominates
  • Test: Run layertest with the exact gs the vLLM init computes, compare vs dynamic gs
  • Test: Run runner on real token data outside vLLM, check for NaN/garbage

H2: Attention layer produces bad hidden states before MoE

  • If attention output is NaN/garbage, MoE amplifies it
  • The MoE kernel may be fine but receives bad input
  • Test: Hook into layer 0 forward, inspect hidden_states before MoE

H3: Weight loading mismatch between vLLM and test runner

  • vLLM loads weights via its own pipeline (DeepseekV4ForCausalLM weight_loader)
  • Test scripts load directly from safetensors
  • The weight loading patches (model. prefix strip, CKPT_KEY_SUBST) may have bugs
  • Test: Compare weights loaded by vLLM vs direct safetensors load

H4: Expert routing / topk_ids mismatch

  • vLLM uses global expert IDs, runner expects local expert IDs
  • If routing is wrong, wrong experts process tokens
  • Test: Log topk_ids in vLLM vs test, verify they match expected patterns

H5: Residual connection scale issue

  • vLLM adds MoE output to residual: hidden = residual + MoE(hidden)
  • If MoE output scale is wrong, residual connection can amplify error across layers
  • Test: Run test_multilayer.py to check error accumulation

H6: Input_scale from checkpoint is being used somewhere

  • MEMORY.md says checkpoint input_scale is wrong
  • The code comment says gs default is 1/2688, overridden by warmup
  • But maybe finalize_weights sets it to checkpoint input_scale somewhere?
  • Test: Verify which gs value is actually used at runtime

H7: DeepSeek V4 attention / RoPE bug

  • The cos_sin_cache fix and float32 patch are applied
  • But maybe attention still produces garbage for real token positions
  • Test: Single-layer test with real token positions (not random)

Test Plan (ordered by ease and likelihood)

  1. Quick: Run layertest.py on B200 — baseline, confirm kernel still works
  2. Standalone runner test with real-ish data — use runner outside vLLM, check output
  3. Inspect gs values — print the gs computed by warmup, compare with dynamic gs
  4. Multi-layer accumulation test — test_multilayer.py
  5. Weight loading comparison — dump vLLM loaded weights vs direct load
  6. Full pipeline test — test_pipeline_real_weights.py with 48 experts
  7. Attention output inspection — check hidden_states before MoE in vLLM

Progress

  • Removed NaN check (Dynamo incompatible)
  • vLLM container starts and loads model
  • Confirmed NaN logits from completions API
  • H1: gs is wrong — Warmup gs produces cosine 0.988 with BF16 ref. gs is NOT the problem when warmup is used.
    • Default gs (1/2688) gives cosine 0.621, but vLLM calls warmup during init
    • BUT: Does vLLM actually call warmup before every forward, or just once? If gs is computed from random data once and never updated, it may not generalize.
  • New lead: MoE kernel is fine, problem is upstream (attention, embeddings, or weight loading in vLLM path)