Files
nvfp4-megamoe-kernel/CURRENT_BUG.md

1.3 KiB

CURRENT_BUG.md

Status: CuTeDSL kernels confirmed correct. Bug is in vLLM's attention/FFN pipeline.

Key Findings from test_model_forward_b200.py

  1. Warmup gs is IRRELEVANT — CuTeDSL runner.run() recomputes gs internally per-call. Changing gs by 10x has no effect on output (cosine 0.9993).
  2. CuTeDSL kernels are correct — cosine 0.999 vs BF16 for q_a_proj with both warmup and dynamic gs.
  3. BF16 reference produces reasonable logits — logit std 3.05, top5 valid token IDs.
  4. The bug is NOT in our NVFP4 kernels — it's in vLLM's pipeline.

Most Likely Causes

  1. FlashMLA kernel on Blackwell (SM100)fused_deepseek_v4_qnorm_rope_kv_rope_quant_insert is a C++ CUDA kernel. If it doesn't work on B200, attention output is garbage.
  2. Weight sharding with TP=8 — The model is loaded with TP=8. If weight sharding is wrong, all projections produce garbage. But our standalone test uses the full (non-sharded) weights, which works.
  3. MoE produces garbage — The MoE path (384 experts, top-6) is complex. If expert routing or grouped GEMM is wrong, the output is dominated by MoE noise.

Next Steps

  • Write a test that runs the FULL model (all 61 layers) in BF16 and checks the final output
  • Add hook/logging to the vLLM container to capture layer-by-layer output
  • Test if the FlashMLA C++ kernel works on B200