1.7 KiB
1.7 KiB
CURRENT_BUG.md
Status: Fix committed, needs vLLM container rebuild + test
Root Cause Found: Wrong Activation Global Scale
All CuTeDSL NVFP4 kernels are correct (verified with standalone test, cosine 0.989-0.995 vs BF16 reference). The bug was in the vLLM integration, NOT our kernels.
The CuTeDSLNvFp4LinearKernel.process_weights_after_loading (in vllm/kernels/linear/nvfp4/cutedsl.py) was using the checkpoint's input_global_scale_inv as the activation global scale. This is a calibration-time value that doesn't match what quantize_activation_nvfp4 expects at runtime, producing garbage output.
Fix
Changed cutedsl.py to use warmup-based activation global scale computation (same as standalone test):
# BEFORE (broken):
runner._activation_global_scale = input_global_scale_inv # wrong!
# AFTER (fixed):
runner.compute_activation_global_scale(sample) # warmup-based, correct
Next Steps
- Rebuild vLLM container on B200 with this fix
- Run
build_and_run.sh - Test with curl chat completions
- If still broken, also fix MoE runner warmup (currently using checkpoint input_scale mean)
Standalone Test Results (test_full_layer_b200.py)
q_a_proj: cosine=0.994599 ✅
kv_proj: cosine=0.994777 ✅
q_b_proj: cosine=0.994834 ✅
wo_b_proj: cosine=0.994768 ✅
comp.kv_proj: cosine=0.994152 ✅
comp.gate: cosine=0.994766 ✅
shared_expert: cosine=0.989745 ✅
Remaining Issues
- MoE warmup:
compute_activation_global_scalesis never called on the MoE runner. Currently uses checkpoint input_scale mean. Needs warmup too. - Shared expert in vLLM: Check if the vLLM shared expert path uses CuTeDSL or falls through to broken vLLM kernels.