CURRENT_BUG.md

Status: CuTeDSL kernels confirmed correct. Bug is in vLLM's attention/FFN pipeline.

Warmup gs is IRRELEVANT — CuTeDSL runner.run() recomputes gs internally per-call. Changing gs by 10x has no effect on output (cosine 0.9993).
CuTeDSL kernels are correct — cosine 0.999 vs BF16 for q_a_proj with both warmup and dynamic gs.
BF16 reference produces reasonable logits — logit std 3.05, top5 valid token IDs.
The bug is NOT in our NVFP4 kernels — it's in vLLM's pipeline.

FlashMLA kernel on Blackwell (SM100) — fused_deepseek_v4_qnorm_rope_kv_rope_quant_insert is a C++ CUDA kernel. If it doesn't work on B200, attention output is garbage.
Weight sharding with TP=8 — The model is loaded with TP=8. If weight sharding is wrong, all projections produce garbage. But our standalone test uses the full (non-sharded) weights, which works.
MoE produces garbage — The MoE path (384 experts, top-6) is complex. If expert routing or grouped GEMM is wrong, the output is dominated by MoE noise.

Write a test that runs the FULL model (all 61 layers) in BF16 and checks the final output
Add hook/logging to the vLLM container to capture layer-by-layer output
Test if the FlashMLA C++ kernel works on B200