When CLAWMINE_DEBUG=1, prints amax/mean/NaN/Inf after each layer. Must run with --enforce-eager (data-dependent prints break Dynamo). Gated by os.environ so dead-code-eliminated during compilation.
4.0 KiB
4.0 KiB
Current Bug: vLLM produces empty/garbage output with NaN logits
Status: Active debugging Date: 2026-05-18
Symptom
- vLLM server starts successfully, loads model, captures cudagraph
- Chat completions return
content: ""withfinish_reason: "length" - Raw completions API returns:
Out of range float values are not JSON compliant: nan - 50 completion tokens generated but all produce NaN logits
- Model: DeepSeek-V4-Pro-NVFP4 on 8x B200 (TP=8)
Known Good
layertest.pypasses (cosine 0.988 with BF16 reference) — MoE kernel math is correctcudagraph_test.pypasses — no CPU-GPU syncs, capture + replay works- Model weights load successfully (281K tensors)
- Kernel compiles and runs without CUDA errors
Hypotheses
H1: Activation global scale (gs) is wrong
compute_activation_global_scalesis called during init with random data (torch.randn)- Random data may produce gs values that don't represent real token distributions
- If gs is too small: activation quantization clips, info loss
- If gs is too large: quantization noise dominates
- Test: Run layertest with the exact gs the vLLM init computes, compare vs dynamic gs
- Test: Run runner on real token data outside vLLM, check for NaN/garbage
H2: Attention layer produces bad hidden states before MoE
- If attention output is NaN/garbage, MoE amplifies it
- The MoE kernel may be fine but receives bad input
- Test: Hook into layer 0 forward, inspect hidden_states before MoE
H3: Weight loading mismatch between vLLM and test runner
- vLLM loads weights via its own pipeline (DeepseekV4ForCausalLM weight_loader)
- Test scripts load directly from safetensors
- The weight loading patches (model. prefix strip, CKPT_KEY_SUBST) may have bugs
- Test: Compare weights loaded by vLLM vs direct safetensors load
H4: Expert routing / topk_ids mismatch
- vLLM uses global expert IDs, runner expects local expert IDs
- If routing is wrong, wrong experts process tokens
- Test: Log topk_ids in vLLM vs test, verify they match expected patterns
H5: Residual connection scale issue
- vLLM adds MoE output to residual:
hidden = residual + MoE(hidden) - If MoE output scale is wrong, residual connection can amplify error across layers
- Test: Run test_multilayer.py to check error accumulation
H6: Input_scale from checkpoint is being used somewhere
- MEMORY.md says checkpoint input_scale is wrong
- The code comment says gs default is 1/2688, overridden by warmup
- But maybe finalize_weights sets it to checkpoint input_scale somewhere?
- Test: Verify which gs value is actually used at runtime
H7: DeepSeek V4 attention / RoPE bug
- The cos_sin_cache fix and float32 patch are applied
- But maybe attention still produces garbage for real token positions
- Test: Single-layer test with real token positions (not random)
Test Plan (ordered by ease and likelihood)
- Quick: Run layertest.py on B200 — baseline, confirm kernel still works
- Standalone runner test with real-ish data — use runner outside vLLM, check output
- Inspect gs values — print the gs computed by warmup, compare with dynamic gs
- Multi-layer accumulation test — test_multilayer.py
- Weight loading comparison — dump vLLM loaded weights vs direct load
- Full pipeline test — test_pipeline_real_weights.py with 48 experts
- Attention output inspection — check hidden_states before MoE in vLLM
Progress
- Removed NaN check (Dynamo incompatible)
- vLLM container starts and loads model
- Confirmed NaN logits from completions API
H1: gs is wrong— Warmup gs produces cosine 0.988 with BF16 ref. gs is NOT the problem when warmup is used.- Default gs (1/2688) gives cosine 0.621, but vLLM calls warmup during init
- BUT: Does vLLM actually call warmup before every forward, or just once? If gs is computed from random data once and never updated, it may not generalize.
- New lead: MoE kernel is fine, problem is upstream (attention, embeddings, or weight loading in vLLM path)