Files
nvfp4-megamoe-kernel/CURRENT_BUG.md

1.7 KiB

CURRENT_BUG.md

Status: Fix committed, needs vLLM container rebuild + test

Root Cause Found: Wrong Activation Global Scale

All CuTeDSL NVFP4 kernels are correct (verified with standalone test, cosine 0.989-0.995 vs BF16 reference). The bug was in the vLLM integration, NOT our kernels.

The CuTeDSLNvFp4LinearKernel.process_weights_after_loading (in vllm/kernels/linear/nvfp4/cutedsl.py) was using the checkpoint's input_global_scale_inv as the activation global scale. This is a calibration-time value that doesn't match what quantize_activation_nvfp4 expects at runtime, producing garbage output.

Fix

Changed cutedsl.py to use warmup-based activation global scale computation (same as standalone test):

# BEFORE (broken):
runner._activation_global_scale = input_global_scale_inv  # wrong!

# AFTER (fixed):
runner.compute_activation_global_scale(sample)  # warmup-based, correct

Next Steps

  1. Rebuild vLLM container on B200 with this fix
  2. Run build_and_run.sh
  3. Test with curl chat completions
  4. If still broken, also fix MoE runner warmup (currently using checkpoint input_scale mean)

Standalone Test Results (test_full_layer_b200.py)

q_a_proj:     cosine=0.994599 ✅
kv_proj:      cosine=0.994777 ✅
q_b_proj:     cosine=0.994834 ✅
wo_b_proj:    cosine=0.994768 ✅
comp.kv_proj: cosine=0.994152 ✅
comp.gate:    cosine=0.994766 ✅
shared_expert: cosine=0.989745 ✅

Remaining Issues

  1. MoE warmup: compute_activation_global_scales is never called on the MoE runner. Currently uses checkpoint input_scale mean. Needs warmup too.
  2. Shared expert in vLLM: Check if the vLLM shared expert path uses CuTeDSL or falls through to broken vLLM kernels.