1.3 KiB
1.3 KiB
CURRENT_BUG.md
Status: CuTeDSL kernels confirmed correct. Bug is in vLLM's attention/FFN pipeline.
Key Findings from test_model_forward_b200.py
- Warmup gs is IRRELEVANT — CuTeDSL
runner.run()recomputes gs internally per-call. Changing gs by 10x has no effect on output (cosine 0.9993). - CuTeDSL kernels are correct — cosine 0.999 vs BF16 for q_a_proj with both warmup and dynamic gs.
- BF16 reference produces reasonable logits — logit std 3.05, top5 valid token IDs.
- The bug is NOT in our NVFP4 kernels — it's in vLLM's pipeline.
Most Likely Causes
- FlashMLA kernel on Blackwell (SM100) —
fused_deepseek_v4_qnorm_rope_kv_rope_quant_insertis a C++ CUDA kernel. If it doesn't work on B200, attention output is garbage. - Weight sharding with TP=8 — The model is loaded with TP=8. If weight sharding is wrong, all projections produce garbage. But our standalone test uses the full (non-sharded) weights, which works.
- MoE produces garbage — The MoE path (384 experts, top-6) is complex. If expert routing or grouped GEMM is wrong, the output is dominated by MoE noise.
Next Steps
- Write a test that runs the FULL model (all 61 layers) in BF16 and checks the final output
- Add hook/logging to the vLLM container to capture layer-by-layer output
- Test if the FlashMLA C++ kernel works on B200