Files
deepseek-v4-quant/memory/2026-05-10.md
biondizzle 02b8ea536f Update MEMORY.md and memory files with vLLM NVFP4 serving progress
Server running on B200 port 8000 with full NVFP4→vLLM bridge.
All critical bugs fixed: DeepGEMM scale format, compressor shapes, block scale values.
2026-05-11 02:02:49 +00:00

12 KiB
Raw Blame History

2026-05-10

DeepSeek V4 Pro NVFP4 — vLLM Serving Debug Session

  • Quantization completed successfully (Run 11, 881GB NVFP4)
  • Spent the day debugging vLLM serving of the modelopt NVFP4 checkpoint
  • Key finding: modelopt and vllm were never integrated for NVFP4 on DeepSeek V4
  • NVIDIA themselves haven't gotten this far — we're in uncharted territory

What we fixed:

  • Expert weight name mapping (gate_proj→w1, up_proj→w3, down_proj→w2)
  • mlp→ffn module naming
  • Attention: self_attn→attn.mla_attn, kv_proj→wkv, etc.
  • Compressor: kv_proj→wkv, gate_proj→wgate
  • kv_norm moved from compressor to attention level
  • Class attribute patching (hf_to_vllm_mapper)
  • Source file patching (workers are separate processes)
  • E2M1 FP4→BF16 unpacking for stacked attention params
  • Skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear

What we abandoned:

  • mega_moe: No NVFP4 kernel exists, format mismatch (16-col vs 32-col blocks)
  • Runtime monkey-patching: Workers don't inherit patches

Open issues (stop point):

  1. MergedColumnParallelLinear + NVFP4 incompatibility — ModelOptNvFp4Config only handles Linear, not MergedColumn. Weight param is bf16 (should be uint8), no weight_scale registered for stacked params
  2. Unknown params from modelopt (compressor.position_bias) crash loading
  3. Current approach (unpack uint8→bf16, skip scales) loses calibration-optimized scales for attention weights

Repo state:

  • All code/patches/docker-compose synced and committed on modelopt-nvfp4 branch
  • README fully updated with vLLM serving run history, open issues, bug list
  • B200 node at 45.76.247.107, weights at /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4

2026-05-10

DeepSeek V4 Pro NVFP4 — vLLM Serving Debug Session

  • Quantization completed successfully (Run 11, 881GB NVFP4)
  • Spent the day debugging vLLM serving of the modelopt NVFP4 checkpoint
  • Key finding: modelopt and vllm were never integrated for NVFP4 on DeepSeek V4
  • NVIDIA themselves haven't gotten this far — we're in uncharted territory

What we fixed:

  • Expert weight name mapping (gate_proj→w1, up_proj→w3, down_proj→w2)
  • mlp→ffn module naming
  • Attention: self_attn→attn.mla_attn, kv_proj→wkv, etc.
  • Compressor: kv_proj→wkv, gate_proj→wgate
  • kv_norm moved from compressor to attention level
  • Class attribute patching (hf_to_vllm_mapper)
  • Source file patching (workers are separate processes)
  • E2M1 FP4→BF16 unpacking for stacked attention params
  • Skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear

What we abandoned:

  • mega_moe: No NVFP4 kernel exists, format mismatch (16-col vs 32-col blocks)
  • Runtime monkey-patching: Workers don't inherit patches

Open issues (stop point):

  1. MergedColumnParallelLinear + NVFP4 incompatibility — ModelOptNvFp4Config only handles Linear, not MergedColumn. Weight param is bf16 (should be uint8), no weight_scale registered for stacked params
  2. Unknown params from modelopt (compressor.position_bias) crash loading
  3. Current approach (unpack uint8→bf16, skip scales) loses calibration-optimized scales for attention weights

Repo state:

  • All code/patches/docker-compose synced and committed on modelopt-nvfp4 branch
  • README fully updated with vLLM serving run history, open issues, bug list
  • B200 node at 45.76.247.107, weights at /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4

vLLM NVFP4 Serving — Second Session (16:2819:35 UTC)

Mike gave autonomous work instructions. Key directive: use weights AS-IS (NVFP4), do NOT convert to MXFP4. Try FusedMoE first, then if stuck, build a mega_moe NVFP4 kernel from scratch.

Major breakthroughs (S11→S14 progress):

Key insight: vLLM attention forward bypasses quant_method, uses deepseek_v4_fp8_einsum directly

  • The attention code reads self.wo_a.weight (expects fp8) and self.wo_a.weight_scale_inv directly
  • NVFP4 uint8 weights are incompatible with this FP8 kernel
  • Solution: NVFP4→bf16→FP8 dequantize/requant at load time for attention layers

S12 fixes applied (weight loading now succeeds to 94%):

  1. Substr mapping fix: Removed .mla_attn. prefix from attention projections. The model has fused_wqa_wkv, wq_b, wo_a, wo_b at attn.* level, not attn.mla_attn.*. The stacking code then correctly maps attn.wq_aattn.fused_wqa_wkv.
  2. Skip patterns fix: Only skip compressor scale tensors (compressor uses UnquantizedLinearMethod with quant_config=None). Attention and shared expert scales now correctly load via stacking logic.
  3. Suffix mapping fix: Removed "head.weight": "lm_head.weight" which caused lm_head.weightlm_lm_head.weight doubling.
  4. Resilient loading: Unknown params (e.g., compressor.position_bias) silently skipped.

S13 — Weight loading SUCCESS (32 seconds!)

  • All 95 safetensors loaded without KeyError
  • New error: MergedColumnParallelLinear has no weight_scale_inv (FP8 attribute)

S13.5 — o_a_proj discovery:

  • modelopt did NOT quantize o_a_proj — it's bf16 in the checkpoint (no scales)
  • But vLLM creates wo_a with NVFP4 quant (uint8 weight + scales)
  • Fix: convert bf16→FP8 directly at load time, set weight_scale_inv

S14 — NVFP4→FP8 post-load conversion approach:

  • Added _convert_nvfp4_attention_to_fp8() and _convert_nvfp4_module_to_fp8() methods to DeepseekV4Model
  • Converts all uint8 NVFP4 attention weights (fused_wqa_wkv, wq_b, wo_a, wo_b, gate_up_proj) to FP8 at load time
  • Steps: unpack E2M1 FP4→bf16, dequantize with block/global scales, requantize to FP8 e4m3, set weight_scale_inv
  • For o_a_proj (bf16, no scales): convert directly bf16→FP8
  • For compressor fused_wkv_wgate: stays bf16 (UnquantizedLinearMethod)
  • For MoE experts: handled natively by ModelOptNvFp4FusedMoE

Bug found: E2M1 LUT indexing off-by-one

  • FP4 4-bit values are 0-15 (bit 3 = sign, bits 0-2 = magnitude)
  • LUT has 8 entries (magnitudes 0-7), but code was indexing with full 4-bit value (0-15) → CUDA assert
  • Fix: mask with & 0x07 for magnitude index, apply sign from bit 3 separately

Bug found: method placement inside Python class

  • _convert_nvfp4_attention_to_fp8 was being placed at top level (0 indent) instead of inside DeepseekV4Model
  • The class actually ends at finalize_mega_moe_weights() (line ~1600), followed by top-level hc_head function
  • Had to insert methods BEFORE the @torch.compile decorator that marks the class boundary

Bug found: logger not available in method

  • logger.info_once() isn't accessible inside the conversion methods
  • Replaced with print(f"...") for now

Current status (as of 19:35 UTC):

  • Weight loading + NVFP4→FP8 conversion code is in place
  • Last test was running (loading 880GB checkpoint)
  • E2M1 sign handling fix applied but NOT YET TESTED
  • Need to fix loggerprint issue
  • After load succeeds: FusedMoE expert weight handling needs verification
  • If FusedMoE fails: need to build mega_moe NVFP4 kernel

Key files on B200 node:

  • Patch: /root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py
  • Docker: docker compose up -d (TP=8, no mega_moe, FLASHINFER_TRTLLM attn)
  • Weights: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4/

Architecture decisions:

  • NVFP4→FP8 for attention/shared_experts (requant, preserves FP8 kernel compat)
  • BF16 for compressor (UnquantizedLinearMethod, no quant_config)
  • Native NVFP4 for MoE experts (ModelOptNvFp4FusedMoE handles it)
  • UnquantizedLinearMethod as no-op quant_method (attention forward bypasses it anyway)

vLLM NVFP4 Serving — Third Session (23:05+ UTC)

Current state of the B200 node:

  • Docker container ran 27 min ago and crashed with BFloat16 != Float8_e4m3fn
  • Uncommitted changes to patches/deepseek_v4.py (the _convert_nvfp4_post_load methods)
  • Repo on modelopt-nvfp4 branch, last commit db16be8

Crash analysis (S15 — BFloat16 != Float8_e4m3fn):

Weight loading succeeds (95/95, 330s). Post-load conversion reports: 122 layers → FP8, 183 → BF16. MoE setup runs. Crash during profile_run/_dummy_run.

Root cause: _convert_nvfp4_post_load converts fused_wqa_wkv to FP8 and sets quant_method = UnquantizedLinearMethod(). The attention forward calls self.fused_wqa_wkv(hidden_states) which goes through UnquantizedLinearMethod.forward()F.linear(bf16_input, fp8_weight) → dtype mismatch.

Key insight about the attention forward paths:

  • wo_a: Attention code reads self.wo_a.weight and self.wo_a.weight_scale_inv DIRECTLY, passes to deepseek_v4_fp8_einsum. This bypasses quant_method. FP8 conversion works here.
  • fused_wqa_wkv: Called via self.fused_wqa_wkv(hidden_states)MergedColumnParallelLinear.forward()quant_method.forward(). Cannot be FP8 with UnquantizedLinearMethod.
  • wq_b, wo_b: Called via normal .forward(). Need BF16 + UnquantizedLinearMethod.
  • compressor.fused_wkv_wgate: Called via torch.mm(hidden_states, weight.T, out_dtype=torch.float32) DIRECTLY. Needs BF16 weight — currently uint8 (not in any conversion set!).

Critical finding from safetensors: o_a_proj.weight is BF16 (modelopt did NOT quantize it). So wo_a weight is already BF16, not NVFP4. The post-load conversion code's dtype != uint8 check skips it. This means wo_a.weight stays BF16 and wo_a.weight_scale_inv is never created. When deepseek_v4_fp8_einsum tries to read it as FP8 → crash.

Wait, but the log says 122 → FP8. 61 layers × 2 (fused_wqa_wkv + wo_a) = 122. If wo_a.weight is BF16 and gets skipped, only 61 → FP8. The 122 count means wo_a IS being converted somehow. Hypothesis: ModelOptNvFp4LinearMethod.create_weights() creates wo_a.weight as uint8. When the BF16 checkpoint data is loaded into the uint8 param, the weight_loader might be casting it, or the param might be updated to BF16. Need to verify.

Unfixed bugs from S14 (still present):

  1. E2M1 sign handling fix applied but NOT TESTED
  2. loggerprint issue in conversion methods

Compressor fused_wkv_wgate — PENDING CRASH:

  • NOT in any conversion set (fp8_proj_names, bf16_proj_names, bf16_shared_names)
  • Weight is uint8 after loading (NVFP4 packed)
  • Forward uses torch.mm(hidden_states, weight.T, out_dtype=torch.float32) directly
  • uint8 × BF16 would crash with a different error than the current one
  • Needs BF16 dequantization in post-load conversion

Checkpoint key format (verified from safetensors):

  • model.layers.0.self_attn.q_a_proj.weight — uint8
  • model.layers.0.self_attn.q_a_proj.weight_scale — float8_e4m3fn (block scale)
  • model.layers.0.self_attn.q_a_proj.weight_scale_2 — float32 (per-tensor)
  • model.layers.0.self_attn.q_a_proj.input_scale — float32
  • model.layers.0.self_attn.o_a_proj.weightBF16 (NOT quantized by modelopt)
  • model.layers.0.self_attn.o_b_proj.weight — uint8
  • model.layers.0.self_attn.kv_proj.weight — uint8
  • model.layers.0.self_attn.compressor.kv_proj.weight — uint8
  • model.layers.0.self_attn.compressor.gate_proj.weight — uint8
  • model.layers.0.self_attn.compressor.position_bias — BF16 (unknown param, skipped)
  • Expert scales: .weight_scale, .weight_scale_2, .input_scale (NOT .scale)

FusedMoE NVFP4 status:

  • ModelOptNvFp4FusedMoE creates proper uint8 weights + float8_e4m3fn block scales + float32 per-tensor/input scales
  • process_weights_after_loading calls convert_to_nvfp4_moe_kernel_format then make_nvfp4_moe_kernel
  • Uses cutlass_fp4_gemm via nvfp4 backend
  • Warning: w1_weight_scale_2 must match w3_weight_scale_2 — modelopt gives different global scales to w1 and w3, but FusedMoE uses a single w13_weight_scale_2 (takes w1's). Minor accuracy impact.
  • expert_dtype: fp4 in config — causes weight mapper to use .scale.weight_scale regex, but checkpoint already uses .weight_scale directly, so regex is a no-op. Correct behavior.
  • scale_fmt: "ue8m0" in config — used by attention FP8 einsum. Correct for NVFP4.

Config verification:

  • compress_ratios (copied from BF16 source)
  • scale_fmt: "ue8m0" (added by us)
  • rope_parameters (flattened)
  • expert_dtype: fp4 (original, correct for weight mapper regex)