input_scale is for ACTIVATIONS, not weights. The correct NVFP4 weight
dequantization formula is: weight_bf16 = e2m1_value * block_scale * global_scale
Including input_scale made weights ~5000x too small, causing completely
garbled output (multilingual gibberish with repeating patterns).
- Add patches/deepseek_v4.py: patched vllm source file with modelopt NVFP4
weight name mappings (expert gate_proj→w1, mlp→ffn, self_attn→attn.mla_attn,
compressor.kv_proj→wkv, etc.), E2M1 FP4→BF16 unpacking for stacked params,
skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear, and
resilient loading for unknown params.
- Update docker-compose.yml: copy patched deepseek_v4.py over original at
container startup, remove --moe-backend=deep_gemm_mega_moe (no NVFP4 kernel).
- Update patches/patch_vllm_weights.py: legacy runtime monkey-patch approach
(doesn't work with worker processes), kept for reference.
- Update README.md: added vLLM serving run history table (S1-S10), documented
all open issues (MergedColumnParallelLinear+NVFP4, no mega_moe kernel,
resilient loading), added vLLM-specific bug list and key notes.
- Update scripts/serve_vllm.py: add WARN comment on mega_moe flag.
- Run 10: calibration succeeded but export crashed in get_weight_scaling_factor
(stale GPU weight, not just amax). Patch 4 forces weight to CPU at
_export_quantized_weight entry point, covering the entire export chain.
- Updated Key Lessons with Run 10 analysis
- Updated Runtime Patches section to document all 8 patches
- Added Bug #8 (stale GPU weight tensors)
- Updated Do NOT Repeat list
Traced the full execution chain from _process_quantized_modules through
every function that reads stale GPU tensors:
_process_quantized_modules
→ _export_quantized_weight (Patch 4: force weight to CPU at entry point)
→ get_weight_scaling_factor (Patch 7: belt-and-suspenders)
→ get_weights_scaling_factor_from_quantizer (safe: weight now CPU)
→ NVFP4QTensor.get_weights_scaling_factor (safe: input is CPU)
→ get_weight_scaling_factor_2 (Patch 8: force quantizer to CPU)
→ get_activation_scaling_factor (Patch 3: CPU + clamp)
→ to_quantized_weight (Patch 6: force all tensors to CPU)
→ weight.to(dtype) (safe: weight is CPU)
→ _export_fused_experts (Patch 5: force expert weights + quantizer to CPU)
Patch 4 is the key: it moves weight to CPU at the earliest possible point,
so ALL downstream .to(weight.device) calls resolve to CPU.
Patches 5-8 are belt-and-suspenders for alternative code paths.
Run 10 completed calibration (128/128) but crashed at export in
get_weight_scaling_factor — the weight tensor on GPU was stale after
5+ hours of calibration, and weight_scaling_factor_2.to(weight.device)
triggered cudaErrorIllegalAddress.
Patches 4+5 force weight and quantizer state to CPU before computing
scaling factors. This mirrors the same pattern as Patch 3
(get_activation_scaling_factor).
Calibrated state saved successfully (721.4 GB, 47,696 amax tensors).
Amax snapshot saved (15.4 MB). Re-running with new patches.
- Architecture section: call hf_main() directly, not rewrite the pipeline
- Run history: all 10 runs with root causes and fixes
- Key lessons: stale GPU tensors, expert OOM, pipeline rewriting trap, __main__ gap
- Runtime patches: 3 monkey-patches + 3 post-calibration hook steps
- Do NOT repeat: 8 specific mistakes with run references
- File layout with legacy patches note
When calling hf_main(args) directly, the __main__ block conversions that
run between parse_args() and main() are skipped. calib_size stays as
string '128' instead of [128], causing TypeError on list concatenation.
The previous approach tried to reconstruct hf_ptq's pipeline by importing
individual functions and building a fake argparse.Namespace. This caused
repeated crashes from missing args (KV_QUANT_CFG_CHOICES, dataset,
calib_with_images, etc.).
New approach:
- Call hf_ptq.parse_args() with sys.argv replaced — gets ALL defaults
- Call hf_main(args) — the exact same entry point the shell script uses
- Hook export_quantized to add amax snapshot + state save before export
- No more missing args. No more diverging from the example script.
The only changes from the stock pipeline:
1. Runtime patches (load_calib_amax CPU, export_amax CPU, clamp)
2. Post-calibration hook (snapshot amax, save state, force CPU)
- get_model/get_tokenizer imported from example_utils (not hf_ptq)
- KV_QUANT_CFG_CHOICES imported from hf_ptq (not mtq)
- Removed dead _FORCE_AMAX_CPU global and global reference in run_export_only
- Fixed stale comments
- All 16 imports and references verified against the actual B200 runtime
- Zero divergences from modelopt example path except get_model()
Raw from_pretrained OOMs during weight conversion — torch.cat on expert
gate_up_proj tries to allocate 31.5GB on a GPU with only 25.9GB free.
modelopt's get_model() handles max_memory/device_map properly for models
that need sequential device mapping.
Key changes:
- snapshot_amax_to_cpu(): copies all quantizer _amax to CPU and saves
to disk (~50MB) right after mtq.quantize() returns, before any other
GPU operation can corrupt them
- force_all_amax_to_cpu(): nuclear option, moves _pre_quant_scale and
_global_amax to CPU too
- _FORCE_AMAX_CPU flag + patched amax setter: after calibration, any
future amax writes go to CPU instead of GPU
- --validate-only mode to check saved state without running anything
- restore_amax_from_snapshot() for --export-only recovery
- torch.cuda.empty_cache() + gc.collect() between steps
- Patches: export_amax CPU fallback, get_activation_scaling_factor
clamp instead of assert
- New scripts/quantize_nvfp4.py: runs full ModelOpt pipeline in-process
- Saves calibrated state after calibration (insurance against export crashes)
- Patches modelopt for V4: ModuleList quantizers, stale GPU tensor safety
- --export-only flag to retry export from saved calibration state
- Removed old model_opt_nvfp4_full.py (shell wrapper)
- Updated README with new pipeline docs and bug #5/#6