Server running on B200 port 8000 with full NVFP4→vLLM bridge. All critical bugs fixed: DeepGEMM scale format, compressor shapes, block scale values.
12 KiB
2026-05-10
DeepSeek V4 Pro NVFP4 — vLLM Serving Debug Session
- Quantization completed successfully (Run 11, 881GB NVFP4)
- Spent the day debugging vLLM serving of the modelopt NVFP4 checkpoint
- Key finding: modelopt and vllm were never integrated for NVFP4 on DeepSeek V4
- NVIDIA themselves haven't gotten this far — we're in uncharted territory
What we fixed:
- Expert weight name mapping (gate_proj→w1, up_proj→w3, down_proj→w2)
- mlp→ffn module naming
- Attention: self_attn→attn.mla_attn, kv_proj→wkv, etc.
- Compressor: kv_proj→wkv, gate_proj→wgate
- kv_norm moved from compressor to attention level
- Class attribute patching (hf_to_vllm_mapper)
- Source file patching (workers are separate processes)
- E2M1 FP4→BF16 unpacking for stacked attention params
- Skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear
What we abandoned:
- mega_moe: No NVFP4 kernel exists, format mismatch (16-col vs 32-col blocks)
- Runtime monkey-patching: Workers don't inherit patches
Open issues (stop point):
- MergedColumnParallelLinear + NVFP4 incompatibility — ModelOptNvFp4Config only handles Linear, not MergedColumn. Weight param is bf16 (should be uint8), no weight_scale registered for stacked params
- Unknown params from modelopt (compressor.position_bias) crash loading
- Current approach (unpack uint8→bf16, skip scales) loses calibration-optimized scales for attention weights
Repo state:
- All code/patches/docker-compose synced and committed on modelopt-nvfp4 branch
- README fully updated with vLLM serving run history, open issues, bug list
- B200 node at 45.76.247.107, weights at /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4
2026-05-10
DeepSeek V4 Pro NVFP4 — vLLM Serving Debug Session
- Quantization completed successfully (Run 11, 881GB NVFP4)
- Spent the day debugging vLLM serving of the modelopt NVFP4 checkpoint
- Key finding: modelopt and vllm were never integrated for NVFP4 on DeepSeek V4
- NVIDIA themselves haven't gotten this far — we're in uncharted territory
What we fixed:
- Expert weight name mapping (gate_proj→w1, up_proj→w3, down_proj→w2)
- mlp→ffn module naming
- Attention: self_attn→attn.mla_attn, kv_proj→wkv, etc.
- Compressor: kv_proj→wkv, gate_proj→wgate
- kv_norm moved from compressor to attention level
- Class attribute patching (hf_to_vllm_mapper)
- Source file patching (workers are separate processes)
- E2M1 FP4→BF16 unpacking for stacked attention params
- Skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear
What we abandoned:
- mega_moe: No NVFP4 kernel exists, format mismatch (16-col vs 32-col blocks)
- Runtime monkey-patching: Workers don't inherit patches
Open issues (stop point):
- MergedColumnParallelLinear + NVFP4 incompatibility — ModelOptNvFp4Config only handles Linear, not MergedColumn. Weight param is bf16 (should be uint8), no weight_scale registered for stacked params
- Unknown params from modelopt (compressor.position_bias) crash loading
- Current approach (unpack uint8→bf16, skip scales) loses calibration-optimized scales for attention weights
Repo state:
- All code/patches/docker-compose synced and committed on modelopt-nvfp4 branch
- README fully updated with vLLM serving run history, open issues, bug list
- B200 node at 45.76.247.107, weights at /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4
vLLM NVFP4 Serving — Second Session (16:28–19:35 UTC)
Mike gave autonomous work instructions. Key directive: use weights AS-IS (NVFP4), do NOT convert to MXFP4. Try FusedMoE first, then if stuck, build a mega_moe NVFP4 kernel from scratch.
Major breakthroughs (S11→S14 progress):
Key insight: vLLM attention forward bypasses quant_method, uses deepseek_v4_fp8_einsum directly
- The attention code reads
self.wo_a.weight(expects fp8) andself.wo_a.weight_scale_invdirectly - NVFP4 uint8 weights are incompatible with this FP8 kernel
- Solution: NVFP4→bf16→FP8 dequantize/requant at load time for attention layers
S12 fixes applied (weight loading now succeeds to 94%):
- Substr mapping fix: Removed
.mla_attn.prefix from attention projections. The model hasfused_wqa_wkv,wq_b,wo_a,wo_batattn.*level, notattn.mla_attn.*. The stacking code then correctly mapsattn.wq_a→attn.fused_wqa_wkv. - Skip patterns fix: Only skip compressor scale tensors (compressor uses
UnquantizedLinearMethodwithquant_config=None). Attention and shared expert scales now correctly load via stacking logic. - Suffix mapping fix: Removed
"head.weight": "lm_head.weight"which causedlm_head.weight→lm_lm_head.weightdoubling. - Resilient loading: Unknown params (e.g.,
compressor.position_bias) silently skipped.
S13 — Weight loading SUCCESS (32 seconds!)
- All 95 safetensors loaded without KeyError
- New error:
MergedColumnParallelLinearhas noweight_scale_inv(FP8 attribute)
S13.5 — o_a_proj discovery:
- modelopt did NOT quantize
o_a_proj— it's bf16 in the checkpoint (no scales) - But vLLM creates
wo_awith NVFP4 quant (uint8 weight + scales) - Fix: convert bf16→FP8 directly at load time, set weight_scale_inv
S14 — NVFP4→FP8 post-load conversion approach:
- Added
_convert_nvfp4_attention_to_fp8()and_convert_nvfp4_module_to_fp8()methods toDeepseekV4Model - Converts all uint8 NVFP4 attention weights (fused_wqa_wkv, wq_b, wo_a, wo_b, gate_up_proj) to FP8 at load time
- Steps: unpack E2M1 FP4→bf16, dequantize with block/global scales, requantize to FP8 e4m3, set weight_scale_inv
- For o_a_proj (bf16, no scales): convert directly bf16→FP8
- For compressor fused_wkv_wgate: stays bf16 (UnquantizedLinearMethod)
- For MoE experts: handled natively by ModelOptNvFp4FusedMoE
Bug found: E2M1 LUT indexing off-by-one
- FP4 4-bit values are 0-15 (bit 3 = sign, bits 0-2 = magnitude)
- LUT has 8 entries (magnitudes 0-7), but code was indexing with full 4-bit value (0-15) → CUDA assert
- Fix: mask with
& 0x07for magnitude index, apply sign from bit 3 separately
Bug found: method placement inside Python class
_convert_nvfp4_attention_to_fp8was being placed at top level (0 indent) instead of insideDeepseekV4Model- The class actually ends at
finalize_mega_moe_weights()(line ~1600), followed by top-levelhc_headfunction - Had to insert methods BEFORE the
@torch.compiledecorator that marks the class boundary
Bug found: logger not available in method
logger.info_once()isn't accessible inside the conversion methods- Replaced with
print(f"...")for now
Current status (as of 19:35 UTC):
- Weight loading + NVFP4→FP8 conversion code is in place
- Last test was running (loading 880GB checkpoint)
- E2M1 sign handling fix applied but NOT YET TESTED
- Need to fix
logger→printissue - After load succeeds: FusedMoE expert weight handling needs verification
- If FusedMoE fails: need to build mega_moe NVFP4 kernel
Key files on B200 node:
- Patch:
/root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py - Docker:
docker compose up -d(TP=8, no mega_moe, FLASHINFER_TRTLLM attn) - Weights:
/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4/
Architecture decisions:
- NVFP4→FP8 for attention/shared_experts (requant, preserves FP8 kernel compat)
- BF16 for compressor (UnquantizedLinearMethod, no quant_config)
- Native NVFP4 for MoE experts (ModelOptNvFp4FusedMoE handles it)
- UnquantizedLinearMethod as no-op quant_method (attention forward bypasses it anyway)
vLLM NVFP4 Serving — Third Session (23:05+ UTC)
Current state of the B200 node:
- Docker container ran 27 min ago and crashed with
BFloat16 != Float8_e4m3fn - Uncommitted changes to
patches/deepseek_v4.py(the _convert_nvfp4_post_load methods) - Repo on
modelopt-nvfp4branch, last commitdb16be8
Crash analysis (S15 — BFloat16 != Float8_e4m3fn):
Weight loading succeeds (95/95, 330s). Post-load conversion reports: 122 layers → FP8, 183 → BF16. MoE setup runs. Crash during profile_run/_dummy_run.
Root cause: _convert_nvfp4_post_load converts fused_wqa_wkv to FP8 and sets quant_method = UnquantizedLinearMethod(). The attention forward calls self.fused_wqa_wkv(hidden_states) which goes through UnquantizedLinearMethod.forward() → F.linear(bf16_input, fp8_weight) → dtype mismatch.
Key insight about the attention forward paths:
wo_a: Attention code readsself.wo_a.weightandself.wo_a.weight_scale_invDIRECTLY, passes todeepseek_v4_fp8_einsum. This bypassesquant_method. FP8 conversion works here.fused_wqa_wkv: Called viaself.fused_wqa_wkv(hidden_states)→MergedColumnParallelLinear.forward()→quant_method.forward(). Cannot be FP8 with UnquantizedLinearMethod.wq_b,wo_b: Called via normal.forward(). Need BF16 + UnquantizedLinearMethod.compressor.fused_wkv_wgate: Called viatorch.mm(hidden_states, weight.T, out_dtype=torch.float32)DIRECTLY. Needs BF16 weight — currently uint8 (not in any conversion set!).
Critical finding from safetensors: o_a_proj.weight is BF16 (modelopt did NOT quantize it). So wo_a weight is already BF16, not NVFP4. The post-load conversion code's dtype != uint8 check skips it. This means wo_a.weight stays BF16 and wo_a.weight_scale_inv is never created. When deepseek_v4_fp8_einsum tries to read it as FP8 → crash.
Wait, but the log says 122 → FP8. 61 layers × 2 (fused_wqa_wkv + wo_a) = 122. If wo_a.weight is BF16 and gets skipped, only 61 → FP8. The 122 count means wo_a IS being converted somehow. Hypothesis: ModelOptNvFp4LinearMethod.create_weights() creates wo_a.weight as uint8. When the BF16 checkpoint data is loaded into the uint8 param, the weight_loader might be casting it, or the param might be updated to BF16. Need to verify.
Unfixed bugs from S14 (still present):
- E2M1 sign handling fix applied but NOT TESTED
logger→printissue in conversion methods
Compressor fused_wkv_wgate — PENDING CRASH:
- NOT in any conversion set (fp8_proj_names, bf16_proj_names, bf16_shared_names)
- Weight is uint8 after loading (NVFP4 packed)
- Forward uses
torch.mm(hidden_states, weight.T, out_dtype=torch.float32)directly - uint8 × BF16 would crash with a different error than the current one
- Needs BF16 dequantization in post-load conversion
Checkpoint key format (verified from safetensors):
model.layers.0.self_attn.q_a_proj.weight— uint8model.layers.0.self_attn.q_a_proj.weight_scale— float8_e4m3fn (block scale)model.layers.0.self_attn.q_a_proj.weight_scale_2— float32 (per-tensor)model.layers.0.self_attn.q_a_proj.input_scale— float32model.layers.0.self_attn.o_a_proj.weight— BF16 (NOT quantized by modelopt)model.layers.0.self_attn.o_b_proj.weight— uint8model.layers.0.self_attn.kv_proj.weight— uint8model.layers.0.self_attn.compressor.kv_proj.weight— uint8model.layers.0.self_attn.compressor.gate_proj.weight— uint8model.layers.0.self_attn.compressor.position_bias— BF16 (unknown param, skipped)- Expert scales:
.weight_scale,.weight_scale_2,.input_scale(NOT.scale)
FusedMoE NVFP4 status:
ModelOptNvFp4FusedMoEcreates proper uint8 weights + float8_e4m3fn block scales + float32 per-tensor/input scalesprocess_weights_after_loadingcallsconvert_to_nvfp4_moe_kernel_formatthenmake_nvfp4_moe_kernel- Uses
cutlass_fp4_gemmvia nvfp4 backend - Warning:
w1_weight_scale_2 must match w3_weight_scale_2— modelopt gives different global scales to w1 and w3, but FusedMoE uses a single w13_weight_scale_2 (takes w1's). Minor accuracy impact. expert_dtype: fp4in config — causes weight mapper to use.scale→.weight_scaleregex, but checkpoint already uses.weight_scaledirectly, so regex is a no-op. Correct behavior.scale_fmt: "ue8m0"in config — used by attention FP8 einsum. Correct for NVFP4.
Config verification:
compress_ratios✅ (copied from BF16 source)scale_fmt: "ue8m0"✅ (added by us)rope_parameters✅ (flattened)expert_dtype: fp4✅ (original, correct for weight mapper regex)