Files

biondizzle 02b8ea536f Update MEMORY.md and memory files with vLLM NVFP4 serving progress

Server running on B200 port 8000 with full NVFP4→vLLM bridge.
All critical bugs fixed: DeepGEMM scale format, compressor shapes, block scale values.

2026-05-11 02:02:49 +00:00

12 KiB

Raw Blame History

2026-05-10

DeepSeek V4 Pro NVFP4 — vLLM Serving Debug Session

Quantization completed successfully (Run 11, 881GB NVFP4)
Spent the day debugging vLLM serving of the modelopt NVFP4 checkpoint
Key finding: modelopt and vllm were never integrated for NVFP4 on DeepSeek V4
NVIDIA themselves haven't gotten this far — we're in uncharted territory

What we fixed:

Expert weight name mapping (gate_proj→w1, up_proj→w3, down_proj→w2)
mlp→ffn module naming
Attention: self_attn→attn.mla_attn, kv_proj→wkv, etc.
Compressor: kv_proj→wkv, gate_proj→wgate
kv_norm moved from compressor to attention level
Class attribute patching (hf_to_vllm_mapper)
Source file patching (workers are separate processes)
E2M1 FP4→BF16 unpacking for stacked attention params
Skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear

What we abandoned:

mega_moe: No NVFP4 kernel exists, format mismatch (16-col vs 32-col blocks)
Runtime monkey-patching: Workers don't inherit patches

Open issues (stop point):

MergedColumnParallelLinear + NVFP4 incompatibility — ModelOptNvFp4Config only handles Linear, not MergedColumn. Weight param is bf16 (should be uint8), no weight_scale registered for stacked params
Unknown params from modelopt (compressor.position_bias) crash loading
Current approach (unpack uint8→bf16, skip scales) loses calibration-optimized scales for attention weights

Repo state:

All code/patches/docker-compose synced and committed on modelopt-nvfp4 branch
README fully updated with vLLM serving run history, open issues, bug list
B200 node at 45.76.247.107, weights at /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4

2026-05-10

DeepSeek V4 Pro NVFP4 — vLLM Serving Debug Session

Quantization completed successfully (Run 11, 881GB NVFP4)
Spent the day debugging vLLM serving of the modelopt NVFP4 checkpoint
Key finding: modelopt and vllm were never integrated for NVFP4 on DeepSeek V4
NVIDIA themselves haven't gotten this far — we're in uncharted territory

What we fixed:

Expert weight name mapping (gate_proj→w1, up_proj→w3, down_proj→w2)
mlp→ffn module naming
Attention: self_attn→attn.mla_attn, kv_proj→wkv, etc.
Compressor: kv_proj→wkv, gate_proj→wgate
kv_norm moved from compressor to attention level
Class attribute patching (hf_to_vllm_mapper)
Source file patching (workers are separate processes)
E2M1 FP4→BF16 unpacking for stacked attention params
Skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear

What we abandoned:

mega_moe: No NVFP4 kernel exists, format mismatch (16-col vs 32-col blocks)
Runtime monkey-patching: Workers don't inherit patches

Open issues (stop point):

MergedColumnParallelLinear + NVFP4 incompatibility — ModelOptNvFp4Config only handles Linear, not MergedColumn. Weight param is bf16 (should be uint8), no weight_scale registered for stacked params
Unknown params from modelopt (compressor.position_bias) crash loading
Current approach (unpack uint8→bf16, skip scales) loses calibration-optimized scales for attention weights

Repo state:

All code/patches/docker-compose synced and committed on modelopt-nvfp4 branch
README fully updated with vLLM serving run history, open issues, bug list
B200 node at 45.76.247.107, weights at /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4

vLLM NVFP4 Serving — Second Session (16:28–19:35 UTC)

Mike gave autonomous work instructions. Key directive: use weights AS-IS (NVFP4), do NOT convert to MXFP4. Try FusedMoE first, then if stuck, build a mega_moe NVFP4 kernel from scratch.

Major breakthroughs (S11→S14 progress):

Key insight: vLLM attention forward bypasses quant_method, uses deepseek_v4_fp8_einsum directly

The attention code reads self.wo_a.weight (expects fp8) and self.wo_a.weight_scale_inv directly
NVFP4 uint8 weights are incompatible with this FP8 kernel
Solution: NVFP4→bf16→FP8 dequantize/requant at load time for attention layers

S12 fixes applied (weight loading now succeeds to 94%):

Substr mapping fix: Removed .mla_attn. prefix from attention projections. The model has fused_wqa_wkv, wq_b, wo_a, wo_b at attn.* level, not attn.mla_attn.*. The stacking code then correctly maps attn.wq_a → attn.fused_wqa_wkv.
Skip patterns fix: Only skip compressor scale tensors (compressor uses UnquantizedLinearMethod with quant_config=None). Attention and shared expert scales now correctly load via stacking logic.
Suffix mapping fix: Removed "head.weight": "lm_head.weight" which caused lm_head.weight → lm_lm_head.weight doubling.
Resilient loading: Unknown params (e.g., compressor.position_bias) silently skipped.

S13 — Weight loading SUCCESS (32 seconds!)

All 95 safetensors loaded without KeyError
New error: MergedColumnParallelLinear has no weight_scale_inv (FP8 attribute)

S13.5 — o_a_proj discovery:

modelopt did NOT quantize o_a_proj — it's bf16 in the checkpoint (no scales)
But vLLM creates wo_a with NVFP4 quant (uint8 weight + scales)
Fix: convert bf16→FP8 directly at load time, set weight_scale_inv

S14 — NVFP4→FP8 post-load conversion approach:

Added _convert_nvfp4_attention_to_fp8() and _convert_nvfp4_module_to_fp8() methods to DeepseekV4Model
Converts all uint8 NVFP4 attention weights (fused_wqa_wkv, wq_b, wo_a, wo_b, gate_up_proj) to FP8 at load time
Steps: unpack E2M1 FP4→bf16, dequantize with block/global scales, requantize to FP8 e4m3, set weight_scale_inv
For o_a_proj (bf16, no scales): convert directly bf16→FP8
For compressor fused_wkv_wgate: stays bf16 (UnquantizedLinearMethod)
For MoE experts: handled natively by ModelOptNvFp4FusedMoE

Bug found: E2M1 LUT indexing off-by-one

FP4 4-bit values are 0-15 (bit 3 = sign, bits 0-2 = magnitude)
LUT has 8 entries (magnitudes 0-7), but code was indexing with full 4-bit value (0-15) → CUDA assert
Fix: mask with & 0x07 for magnitude index, apply sign from bit 3 separately

Bug found: method placement inside Python class

_convert_nvfp4_attention_to_fp8 was being placed at top level (0 indent) instead of inside DeepseekV4Model
The class actually ends at finalize_mega_moe_weights() (line ~1600), followed by top-level hc_head function
Had to insert methods BEFORE the @torch.compile decorator that marks the class boundary

Bug found: logger not available in method

logger.info_once() isn't accessible inside the conversion methods
Replaced with print(f"...") for now

Current status (as of 19:35 UTC):

Weight loading + NVFP4→FP8 conversion code is in place
Last test was running (loading 880GB checkpoint)
E2M1 sign handling fix applied but NOT YET TESTED
Need to fix logger → print issue
After load succeeds: FusedMoE expert weight handling needs verification
If FusedMoE fails: need to build mega_moe NVFP4 kernel

Key files on B200 node:

Patch: /root/nvidia-meeting/deepseek-v4-quant/patches/deepseek_v4.py
Docker: docker compose up -d (TP=8, no mega_moe, FLASHINFER_TRTLLM attn)
Weights: /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4/

Architecture decisions:

NVFP4→FP8 for attention/shared_experts (requant, preserves FP8 kernel compat)
BF16 for compressor (UnquantizedLinearMethod, no quant_config)
Native NVFP4 for MoE experts (ModelOptNvFp4FusedMoE handles it)
UnquantizedLinearMethod as no-op quant_method (attention forward bypasses it anyway)

vLLM NVFP4 Serving — Third Session (23:05+ UTC)

Current state of the B200 node:

Docker container ran 27 min ago and crashed with BFloat16 != Float8_e4m3fn
Uncommitted changes to patches/deepseek_v4.py (the _convert_nvfp4_post_load methods)
Repo on modelopt-nvfp4 branch, last commit db16be8

Crash analysis (S15 — `BFloat16 != Float8_e4m3fn`):

Weight loading succeeds (95/95, 330s). Post-load conversion reports: 122 layers → FP8, 183 → BF16. MoE setup runs. Crash during profile_run/_dummy_run.

Root cause: _convert_nvfp4_post_load converts fused_wqa_wkv to FP8 and sets quant_method = UnquantizedLinearMethod(). The attention forward calls self.fused_wqa_wkv(hidden_states) which goes through UnquantizedLinearMethod.forward() → F.linear(bf16_input, fp8_weight) → dtype mismatch.

Key insight about the attention forward paths:

wo_a: Attention code reads self.wo_a.weight and self.wo_a.weight_scale_inv DIRECTLY, passes to deepseek_v4_fp8_einsum. This bypasses quant_method. FP8 conversion works here.
fused_wqa_wkv: Called via self.fused_wqa_wkv(hidden_states) → MergedColumnParallelLinear.forward() → quant_method.forward(). Cannot be FP8 with UnquantizedLinearMethod.
wq_b, wo_b: Called via normal .forward(). Need BF16 + UnquantizedLinearMethod.
compressor.fused_wkv_wgate: Called via torch.mm(hidden_states, weight.T, out_dtype=torch.float32) DIRECTLY. Needs BF16 weight — currently uint8 (not in any conversion set!).

Critical finding from safetensors: o_a_proj.weight is BF16 (modelopt did NOT quantize it). So wo_a weight is already BF16, not NVFP4. The post-load conversion code's dtype != uint8 check skips it. This means wo_a.weight stays BF16 and wo_a.weight_scale_inv is never created. When deepseek_v4_fp8_einsum tries to read it as FP8 → crash.

Wait, but the log says 122 → FP8. 61 layers × 2 (fused_wqa_wkv + wo_a) = 122. If wo_a.weight is BF16 and gets skipped, only 61 → FP8. The 122 count means wo_a IS being converted somehow. Hypothesis: ModelOptNvFp4LinearMethod.create_weights() creates wo_a.weight as uint8. When the BF16 checkpoint data is loaded into the uint8 param, the weight_loader might be casting it, or the param might be updated to BF16. Need to verify.

Unfixed bugs from S14 (still present):

E2M1 sign handling fix applied but NOT TESTED
logger → print issue in conversion methods

Compressor `fused_wkv_wgate` — PENDING CRASH:

NOT in any conversion set (fp8_proj_names, bf16_proj_names, bf16_shared_names)
Weight is uint8 after loading (NVFP4 packed)
Forward uses torch.mm(hidden_states, weight.T, out_dtype=torch.float32) directly
uint8 × BF16 would crash with a different error than the current one
Needs BF16 dequantization in post-load conversion

Checkpoint key format (verified from safetensors):

model.layers.0.self_attn.q_a_proj.weight — uint8
model.layers.0.self_attn.q_a_proj.weight_scale — float8_e4m3fn (block scale)
model.layers.0.self_attn.q_a_proj.weight_scale_2 — float32 (per-tensor)
model.layers.0.self_attn.q_a_proj.input_scale — float32
model.layers.0.self_attn.o_a_proj.weight — BF16 (NOT quantized by modelopt)
model.layers.0.self_attn.o_b_proj.weight — uint8
model.layers.0.self_attn.kv_proj.weight — uint8
model.layers.0.self_attn.compressor.kv_proj.weight — uint8
model.layers.0.self_attn.compressor.gate_proj.weight — uint8
model.layers.0.self_attn.compressor.position_bias — BF16 (unknown param, skipped)
Expert scales: .weight_scale, .weight_scale_2, .input_scale (NOT .scale)

FusedMoE NVFP4 status:

ModelOptNvFp4FusedMoE creates proper uint8 weights + float8_e4m3fn block scales + float32 per-tensor/input scales
process_weights_after_loading calls convert_to_nvfp4_moe_kernel_format then make_nvfp4_moe_kernel
Uses cutlass_fp4_gemm via nvfp4 backend
Warning: w1_weight_scale_2 must match w3_weight_scale_2 — modelopt gives different global scales to w1 and w3, but FusedMoE uses a single w13_weight_scale_2 (takes w1's). Minor accuracy impact.
expert_dtype: fp4 in config — causes weight mapper to use .scale → .weight_scale regex, but checkpoint already uses .weight_scale directly, so regex is a no-op. Correct behavior.
scale_fmt: "ue8m0" in config — used by attention FP8 einsum. Correct for NVFP4.

Config verification:

compress_ratios ✅ (copied from BF16 source)
scale_fmt: "ue8m0" ✅ (added by us)
rope_parameters ✅ (flattened)
expert_dtype: fp4 ✅ (original, correct for weight mapper regex)

12 KiB Raw Blame History Unescape Escape

2026-05-10

DeepSeek V4 Pro NVFP4 — vLLM Serving Debug Session

What we fixed:

What we abandoned:

Open issues (stop point):

Repo state:

2026-05-10

DeepSeek V4 Pro NVFP4 — vLLM Serving Debug Session

What we fixed:

What we abandoned:

Open issues (stop point):

Repo state:

vLLM NVFP4 Serving — Second Session (16:28–19:35 UTC)

Major breakthroughs (S11→S14 progress):

Current status (as of 19:35 UTC):

Key files on B200 node:

Architecture decisions:

vLLM NVFP4 Serving — Third Session (23:05+ UTC)

Current state of the B200 node:

Crash analysis (S15 — BFloat16 != Float8_e4m3fn):

Unfixed bugs from S14 (still present):

Compressor fused_wkv_wgate — PENDING CRASH:

Checkpoint key format (verified from safetensors):

FusedMoE NVFP4 status:

Config verification:

12 KiB

Raw Blame History

Crash analysis (S15 — `BFloat16 != Float8_e4m3fn`):

Compressor `fused_wkv_wgate` — PENDING CRASH: