nvfp4-megamoe-kernel/CURRENT_BUG.md at a83c332059bd950c09fa61e36ebec2edc6a1a8bc

biondizzle 9e7639fba4 Add layer-by-layer diagnostic prints (CLAWMINE_DEBUG=1, enforce-eager)

When CLAWMINE_DEBUG=1, prints amax/mean/NaN/Inf after each layer.
Must run with --enforce-eager (data-dependent prints break Dynamo).
Gated by os.environ so dead-code-eliminated during compilation.

4.0 KiB

Raw Blame History

Current Bug: vLLM produces empty/garbage output with NaN logits

Symptom

Known Good

Hypotheses

H1: Activation global scale (gs) is wrong

H2: Attention layer produces bad hidden states before MoE

H3: Weight loading mismatch between vLLM and test runner

H4: Expert routing / topk_ids mismatch

H5: Residual connection scale issue

H6: Input_scale from checkpoint is being used somewhere

H7: DeepSeek V4 attention / RoPE bug

Test Plan (ordered by ease and likelihood)

Progress

4.0 KiB Raw Blame History

Current Bug: vLLM produces empty/garbage output with NaN logits

Symptom

Known Good

Hypotheses

H1: Activation global scale (gs) is wrong

H2: Attention layer produces bad hidden states before MoE

H3: Weight loading mismatch between vLLM and test runner

H4: Expert routing / topk_ids mismatch

H5: Residual connection scale issue

H6: Input_scale from checkpoint is being used somewhere

H7: DeepSeek V4 attention / RoPE bug

Test Plan (ordered by ease and likelihood)

Progress

4.0 KiB

Raw Blame History