Commit Graph

73 Commits

Author SHA1 Message Date
653e2d7a50 vLLM NVFP4 serving: full end-to-end pipeline working
Bridged the gap between ModelOpt NVFP4 and vLLM DeepSeek V4 attention.
Server loads and serves tokens on 8x B200 with TP=8, EP=8.

Key changes:
- wo_a: NVFP4->BF16->FP8 with DeepGEMM block-scale format for BMM einsum
  Uses deepgemm_post_process_fp8_weight_block for correct scale layout
  weight_scale_inv = DeepGEMM-formatted block scale (NOT per-tensor scalar)
  Block scale filled with fp8_scale (NOT all-ones -- causes garbage output)
- Attention: NVFP4->BF16 dequantization, UnquantizedLinearMethod
- Compressor: reconstruct fused_wkv_wgate from separate kv_proj+gate_proj
  Fixed indexer path: compressor.indexer.kv_proj (was loading main compressor)
- MoE experts: stay NVFP4, FLASHINFER_TRTLLM FusedMoE backend

Bugs fixed:
1. DeepGEMM sf.dim() assertion: weight_scale_inv must be block-scale tensor
2. Block scale dtype: float32 (not float8_e4m3fn)
3. Missing deepgemm_post_process args: quant_block_shape, use_e8m0
4. Compressor indexer shape mismatch: wrong checkpoint key prefix
5. All-ones block scale: DeepGEMM divides by 1.0 instead of actual scale

Updated README with full technical documentation of all fixes.
2026-05-11 02:01:46 +00:00
db16be8e5d S11: Fixed substr mapping, stacking, suffix, and o_a_proj - loads weights but attention forward uses FP8 einsum incompatible with NVFP4 2026-05-10 17:45:53 +00:00
6fd03a0aa0 vLLM serving: patched deepseek_v4.py, disabled mega_moe, updated docs
- Add patches/deepseek_v4.py: patched vllm source file with modelopt NVFP4
  weight name mappings (expert gate_proj→w1, mlp→ffn, self_attn→attn.mla_attn,
  compressor.kv_proj→wkv, etc.), E2M1 FP4→BF16 unpacking for stacked params,
  skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear, and
  resilient loading for unknown params.

- Update docker-compose.yml: copy patched deepseek_v4.py over original at
  container startup, remove --moe-backend=deep_gemm_mega_moe (no NVFP4 kernel).

- Update patches/patch_vllm_weights.py: legacy runtime monkey-patch approach
  (doesn't work with worker processes), kept for reference.

- Update README.md: added vLLM serving run history table (S1-S10), documented
  all open issues (MergedColumnParallelLinear+NVFP4, no mega_moe kernel,
  resilient loading), added vLLM-specific bug list and key notes.

- Update scripts/serve_vllm.py: add WARN comment on mega_moe flag.
2026-05-10 16:14:17 +00:00
d88793dee6 Add vllm weight mapper patch and docker-compose 2026-05-10 09:33:48 +00:00
30608e3834 Config patches: document modelopt↔vllm gaps with NVIDIA reference 2026-05-10 08:59:28 +00:00
0d74b97fb2 Config patches doc + compress_ratios runtime patch in serve script 2026-05-10 08:23:11 +00:00
f65d4ab99f Run 11 SUCCESS: 881GB NVFP4 exported, add vLLM serve script 2026-05-10 07:54:34 +00:00
eb80bd6f80 README + memory: Run 10 result (export crash in get_weight_scaling_factor), Run 11 running
- Run 10: calibration succeeded but export crashed in get_weight_scaling_factor
  (stale GPU weight, not just amax). Patch 4 forces weight to CPU at
  _export_quantized_weight entry point, covering the entire export chain.
- Updated Key Lessons with Run 10 analysis
- Updated Runtime Patches section to document all 8 patches
- Added Bug #8 (stale GPU weight tensors)
- Updated Do NOT Repeat list
2026-05-09 23:00:17 +00:00
07cd50e823 8 patches covering full export chain — no more whack-a-mole
Traced the full execution chain from _process_quantized_modules through
every function that reads stale GPU tensors:

  _process_quantized_modules
    → _export_quantized_weight (Patch 4: force weight to CPU at entry point)
      → get_weight_scaling_factor (Patch 7: belt-and-suspenders)
        → get_weights_scaling_factor_from_quantizer (safe: weight now CPU)
        → NVFP4QTensor.get_weights_scaling_factor (safe: input is CPU)
      → get_weight_scaling_factor_2 (Patch 8: force quantizer to CPU)
      → get_activation_scaling_factor (Patch 3: CPU + clamp)
      → to_quantized_weight (Patch 6: force all tensors to CPU)
      → weight.to(dtype) (safe: weight is CPU)
    → _export_fused_experts (Patch 5: force expert weights + quantizer to CPU)

Patch 4 is the key: it moves weight to CPU at the earliest possible point,
so ALL downstream .to(weight.device) calls resolve to CPU.
Patches 5-8 are belt-and-suspenders for alternative code paths.
2026-05-09 22:50:58 +00:00
efc111a11f Add Patch 4+5: get_weight_scaling_factor and get_weight_scaling_factor_2 CPU safety
Run 10 completed calibration (128/128) but crashed at export in
get_weight_scaling_factor — the weight tensor on GPU was stale after
5+ hours of calibration, and weight_scaling_factor_2.to(weight.device)
triggered cudaErrorIllegalAddress.

Patches 4+5 force weight and quantizer state to CPU before computing
scaling factors. This mirrors the same pattern as Patch 3
(get_activation_scaling_factor).

Calibrated state saved successfully (721.4 GB, 47,696 amax tensors).
Amax snapshot saved (15.4 MB). Re-running with new patches.
2026-05-09 22:43:48 +00:00
ce9056d259 README overhaul: reflect current architecture (hf_main, run history through Run 10)
- Architecture section: call hf_main() directly, not rewrite the pipeline
- Run history: all 10 runs with root causes and fixes
- Key lessons: stale GPU tensors, expert OOM, pipeline rewriting trap, __main__ gap
- Runtime patches: 3 monkey-patches + 3 post-calibration hook steps
- Do NOT repeat: 8 specific mistakes with run references
- File layout with legacy patches note
2026-05-09 16:09:09 +00:00
5a72da7193 Fix: apply hf_ptq __main__ post-parse conversions (dataset split, calib_size int list)
When calling hf_main(args) directly, the __main__ block conversions that
run between parse_args() and main() are skipped. calib_size stays as
string '128' instead of [128], causing TypeError on list concatenation.
2026-05-09 15:58:36 +00:00
8612914169 Update run history: Runs 7-8, Run 9 running on a300302 2026-05-09 15:00:23 +00:00
a300302486 Fix: use hf_ptq.py arg names (--pyt_ckpt_path, --qformat, --inference_tensor_parallel) 2026-05-09 14:57:28 +00:00
1a36a655ea Fix: use full argparse flag names (--calib_size, --kv_cache_qformat) 2026-05-09 14:54:51 +00:00
b2849a8944 Fundamental rewrite: call hf_main() instead of rewriting the pipeline
The previous approach tried to reconstruct hf_ptq's pipeline by importing
individual functions and building a fake argparse.Namespace. This caused
repeated crashes from missing args (KV_QUANT_CFG_CHOICES, dataset,
calib_with_images, etc.).

New approach:
- Call hf_ptq.parse_args() with sys.argv replaced — gets ALL defaults
- Call hf_main(args) — the exact same entry point the shell script uses
- Hook export_quantized to add amax snapshot + state save before export
- No more missing args. No more diverging from the example script.

The only changes from the stock pipeline:
1. Runtime patches (load_calib_amax CPU, export_amax CPU, clamp)
2. Post-calibration hook (snapshot amax, save state, force CPU)
2026-05-09 14:52:02 +00:00
a70593d886 Update run history: Run 6 (dataloader crash), Run 7 running on 25b4d8d 2026-05-09 13:40:00 +00:00
25b4d8da06 Fix: add missing args for make_calib_dataloader (dataset, calib_with_images, auto_quantize, specdec) 2026-05-09 13:37:24 +00:00
d1e15178b2 Update run history: Runs 4-5 (import bugs), Run 6 running on 6c1bff6 2026-05-09 09:29:20 +00:00
6c1bff6997 Clean rewrite: verified all imports against runtime, removed dead code
- get_model/get_tokenizer imported from example_utils (not hf_ptq)
- KV_QUANT_CFG_CHOICES imported from hf_ptq (not mtq)
- Removed dead _FORCE_AMAX_CPU global and global reference in run_export_only
- Fixed stale comments
- All 16 imports and references verified against the actual B200 runtime
- Zero divergences from modelopt example path except get_model()
2026-05-09 09:26:23 +00:00
86dd8df302 Fix: KV_QUANT_CFG_CHOICES is in hf_ptq, not mtq 2026-05-09 09:17:12 +00:00
99f861f48a Update README and memory: Run 3 OOM crash, Run 4 running on f9bbef8
- Added Run 3 to table (model loading OOM, fixed with get_model())
- Added Run 4 (current, commit f9bbef8)
- Added bug #7 (model loading OOM during expert weight concat)
- Added 'do NOT repeat' for AutoModelForCausalLM.from_pretrained
- Documented all 5 runtime patches
- Noted only divergence from modelopt example: get_model()
2026-05-09 08:10:04 +00:00
f9bbef8e91 Fix: patch load_calib_amax instead of amax property setter (can't patch readonly descriptor)
Also remove _FORCE_AMAX_CPU global — load_calib_amax patch handles it.
2026-05-09 08:04:03 +00:00
94179ed9d0 Fix typo: store_only → store_true 2026-05-09 08:02:09 +00:00
03c10ab3b6 Fix model loading: use modelopt get_model() instead of raw AutoModelForCausalLM
Raw from_pretrained OOMs during weight conversion — torch.cat on expert
gate_up_proj tries to allocate 31.5GB on a GPU with only 25.9GB free.
modelopt's get_model() handles max_memory/device_map properly for models
that need sequential device mapping.
2026-05-09 08:00:50 +00:00
9438af5a8c Add commit hashes to run history table 2026-05-09 06:47:26 +00:00
d7593fc1dd Update README: run history table, bug #1 already fixed, cost note, don't-repeat mistakes 2026-05-09 06:44:17 +00:00
6eaba26914 Defensive quantization: snapshot amax to CPU immediately after calibration
Key changes:
- snapshot_amax_to_cpu(): copies all quantizer _amax to CPU and saves
  to disk (~50MB) right after mtq.quantize() returns, before any other
  GPU operation can corrupt them
- force_all_amax_to_cpu(): nuclear option, moves _pre_quant_scale and
  _global_amax to CPU too
- _FORCE_AMAX_CPU flag + patched amax setter: after calibration, any
  future amax writes go to CPU instead of GPU
- --validate-only mode to check saved state without running anything
- restore_amax_from_snapshot() for --export-only recovery
- torch.cuda.empty_cache() + gc.collect() between steps
- Patches: export_amax CPU fallback, get_activation_scaling_factor
  clamp instead of assert
2026-05-09 06:31:08 +00:00
3907838409 Remove ModuleList patch (already fixed in modelopt 0.45), fix numbering 2026-05-09 06:10:18 +00:00
382c1d872f Fix quant_module import path 2026-05-09 06:09:17 +00:00
9291165ba0 Fix imports: QUANT_CFG_CHOICES is in hf_ptq, not modelopt config 2026-05-09 06:08:35 +00:00
a0bacb3cf6 Replace shell wrapper with in-process quantize script
- New scripts/quantize_nvfp4.py: runs full ModelOpt pipeline in-process
- Saves calibrated state after calibration (insurance against export crashes)
- Patches modelopt for V4: ModuleList quantizers, stale GPU tensor safety
- --export-only flag to retry export from saved calibration state
- Removed old model_opt_nvfp4_full.py (shell wrapper)
- Updated README with new pipeline docs and bug #5/#6
2026-05-09 06:07:22 +00:00
04304fdae6 Add export crash fix patches, update README with bug #5 (repr CUDA crash) 2026-05-08 23:28:32 +00:00
50348989b2 Clarify: V4 is NOT BF16, dequantize first 2026-05-08 17:31:35 +00:00
24e3b3745d Pin modelopt and transformers versions in README 2026-05-08 17:23:10 +00:00
b08afea425 remove weird session dump crap 2026-05-08 17:21:18 +00:00
a2370006f7 Update README: document full pipeline, BF16 verification, calib 128 constraint 2026-05-08 17:17:48 +00:00
f1d21900ea Remove upcast_to_bf16.py — superseded by dequant_fp8_to_bf16.py 2026-05-08 17:13:39 +00:00
ca9a4f5eaa Purge OpenClaw session files, memory dumps, __pycache__. Update .gitignore 2026-05-08 17:09:59 +00:00
eeba101cc4 Cleanup: nuke dead scripts and stale docs, rewrite README for full NVFP4 pipeline 2026-05-08 17:02:07 +00:00
075da675dc fix: update HF token, echo it at runtime, export both HF_TOKEN and HUGGING_FACE_HUB_TOKEN 2026-05-08 16:57:32 +00:00
36e1342270 nvfp4_full: pass HF_TOKEN env var for gated calibration dataset 2026-05-08 13:33:45 +00:00
3d38e1d5cd nvfp4_full: drop calib to 128, gpu_max_mem to 0.7 for VRAM headroom 2026-05-08 06:24:45 +00:00
d0fc5338fe model_opt_nvfp4_full: add use_seq_device_map, fix source for /bin/sh 2026-05-08 05:50:16 +00:00
b70a04696e Add resume capability to dequant script (skip already-done shards)
Verified our FP4 dequant is byte-identical to official transformers
MXFP4 implementation. Max diff = 0.0 across all values.
2026-05-08 02:58:24 +00:00
f63eed5cfd Purge INT4 references — expert weights are FP4 (E2M1), not INT4
All docs and scripts updated. Historical memory entries annotated.
2026-05-08 02:33:46 +00:00
f8533197f2 Fix: expert weights are FP4 (E2M1), not INT4 - verified with nibble analysis
Nibble index 0 vs 8 ratio = 0.996 (FP4 -0.0 ≈ +0.0), NOT INT4 where -8 would be rare.
FP4 dequant uses E2M1 LUT lookup × E8M0 scale (MXFP4 microscaling).
Also adds model_opt_nvfp4_full.py for full model NVFP4 quantization.
2026-05-08 02:25:43 +00:00
b5d569218c Add full nvfp4 quantization script + complete dequant script
- model_opt_nvfp4_full.py: Full NVFP4 quantization (not experts-only)
  Uses --gpu_max_mem_percentage 0.9 instead of --use_seq_device_map
- dequant_fp8_to_bf16.py: Now handles INT4-packed experts + FP8 shared
  experts + FP8 attention. Complete dequant to pure BF16.
2026-05-08 01:50:53 +00:00
db6beb5b76 Complete dequant script: handles INT4 experts, FP8 attention, FP8 shared experts
INT4 expert weights are packed 2-per-byte into int8 with float8_e8m0fnu
per-row 32-column block scales. Unpacking: lower nibble first, upper second.
Output dimensions are 2x the stored dimensions (e.g. [3072,3584] → [3072,7168]).

Also adds progress output with ETA per shard so screen sessions stay alive.
2026-05-08 01:39:50 +00:00
cbfc5a9afb Update nvfp4_experts_only to use dequantized BF16 model 2026-05-07 16:34:37 +00:00