134 Commits

Author SHA1 Message Date
f08bcd456b tweax n shit 2026-05-12 23:16:33 +00:00
2bdda36bb7 fucken aye 2026-05-12 22:33:55 +00:00
fa825c16b9 fucken ay 2026-05-12 21:58:08 +00:00
dbf1d11f9f ayee 2026-05-12 21:52:48 +00:00
ef3edb3481 ba fongol again4 2026-05-12 21:49:43 +00:00
b74dc7121a ba fongol again3 2026-05-12 21:48:00 +00:00
7bbbdbcc79 ba fongol again 2026-05-12 21:47:37 +00:00
2e674f87c1 ba fongol again 2026-05-12 21:47:14 +00:00
5d127d8294 ba fongol again 2026-05-12 21:47:01 +00:00
52cf3f2e25 ba fongol again 2026-05-12 21:46:42 +00:00
02decb486e ba fongol 2026-05-12 21:34:08 +00:00
48f1f9dc5e clanker nonsense again 2026-05-12 21:30:36 +00:00
5cabc1f7d9 clanker nonsense again 2026-05-12 21:29:59 +00:00
25a2d4e6ad clanker nonsense 2026-05-12 21:28:50 +00:00
d88ea9842b fix: add missing staging_kernel.py to Dockerfile — BF16→E2M1+UE4M3 quantization was never in container 2026-05-12 21:21:24 +00:00
91d7d9bad7 fucken a 2026-05-12 21:18:48 +00:00
d68e113af1 remove spammy shit 2026-05-12 20:57:04 +00:00
f0652693a6 dangit again 2026-05-12 19:13:01 +00:00
054792c84e dangit 2026-05-12 18:42:39 +00:00
de055b1e77 syupid clankers 2026-05-12 18:26:37 +00:00
307574bc91 test: signal alarm timeout for kernel hang 2026-05-12 15:14:39 +00:00
fcd6de0a60 test: simplify SF fill to avoid shape mismatch 2026-05-12 15:13:16 +00:00
d4c557fddc test: fix float8 randn + SF int32 packing 2026-05-12 15:12:35 +00:00
28afc2406b test: add random FP4 data and kernel timeout 2026-05-12 15:11:41 +00:00
787d427847 test: fix NVFP4 mega_moe test dimensions for SMEM alignment 2026-05-12 15:07:35 +00:00
8737fd57c0 remove crap 2026-05-12 14:53:42 +00:00
52c3aefe73 bump cache busters to 33 for debug build 2026-05-12 13:10:37 +00:00
ca1d306890 fix: use torch.int8 for packed FP4 tensors (kPackedFP4=kInt8, not uint8) 2026-05-12 12:23:43 +00:00
b8f95ffad3 docker: add OMP_NUM_THREADS=64, remove --tool initcheck, mount cubin cache 2026-05-12 11:15:06 +00:00
5840291ea3 fix staging kernel packed_k_mask double-count 2026-05-12 08:08:24 +00:00
5ea5b579c3 Trim banner, no code changes 2026-05-12 07:24:36 +00:00
74af9984f6 Bug fixes: UE4M3 scale conversion, staging kernel SF/E2M1 packing, wo_a UE4M3, README overhaul
- Fix _ue8m0_to_float32: checkpoint is float8_e4m3fn (UE4M3), not UE8M0
  - Changed from shift-by-23 to .to(torch.float32) in both copies
  - Fix fold_global_scale in DeepGEMM mega/__init__.py
- Fix staging kernel SF pack: int32 shift >= 32 is UB on GPU
  - Split 8-group pack into two int32 writes (groups 0-3, 4-7)
- Fix staging kernel E2M1 output: was writing unpacked (1 byte/elem)
  into packed buffer (hidden/2 bytes), causing 2x overflow
  - Now packs even/odd nibble pairs correctly
- Fix wo_a on-the-fly BF16→NVFP4: was encoding UE8M0, now UE4M3
  - Use .clamp(0, 448).to(float8_e4m3fn) instead of log2/exp trick
- Remove dead code: _ue8m0_uint8_to_float, tmp/, .bak, .s11,
  quant_module_patched.py, patch_finegrained_fp8_blackwell.py,
  patch_vllm_weights.py
- Remove SCALE-FMT diagnostic histogram clutter
- Update stale UE8M0 comments throughout
- Rewrite README: clean instructions, confirmed format details
2026-05-12 05:52:30 +00:00
a36bf47f11 fix: use tl.split instead of indexing for E2M1 pair packing
Triton doesn't support constexpr tensor indexing (e2m1_pairs[:, 0]).
Use tl.split() which splits the last axis into two tensors.
2026-05-11 22:39:38 +00:00
27dbf2850f fix: replace nested tl.where with sum-of-comparisons for E2M1 quantization
Triton can't compile deeply nested tl.where. Use arithmetic instead:
idx = sum(abs_s >= threshold_i) for 7 threshold values.
2026-05-11 22:23:05 +00:00
3d1f3de190 fix: syntax error — move triton imports before docstring, remove orphan @triton.jit 2026-05-11 22:08:50 +00:00
79d866995f bump cache buster 32 for packed FP4 mxf4nvf4 fix 2026-05-11 21:59:56 +00:00
c85b84b0fe fix: staging kernel outputs unpacked E2M1 (1 byte/element, not packed 2/byte)
Matches the SMEM layout: float_e2m1_unpacksmem_t is 1 byte/element.
L1→L2 handoff uses unpacked format (same byte count as FP8).
No bandwidth savings at L1→L2 for v1 — can optimize later.
2026-05-11 21:29:33 +00:00
01cfd02759 fix: same reshape fix in main patch file 2026-05-11 21:05:54 +00:00
076d325c97 fix: use reshape instead of risky [0::2] slicing for E2M1 packing 2026-05-11 21:04:53 +00:00
8dc917c498 fix: topk_weights_out store missing topk_offsets multiplier 2026-05-11 21:02:19 +00:00
17ba5a9d7b bump cache buster 30 for FP4 staging + DeepGEMM FP4 activations 2026-05-11 20:30:14 +00:00
7a4403fa98 feat: FP4 staging kernel - BF16 → E2M1 packed + UE4M3 block16 scales
mxf4nvf4 requires FP4×FP4, not FP8×FP4.
- New staging kernel: E2M1 nearest-neighbor quantization
- Output: uint8 packed (2 E2M1 per byte) + UE4M3 packed int32 scales
- Added CUDA sync diagnostics for error localization
2026-05-11 20:29:36 +00:00
0fd2d4f078 diag: add weight_scale uint8 histogram to verify E8M0 vs E4M3 format 2026-05-11 19:55:41 +00:00
50a945bde4 bump cache buster 29 2026-05-11 19:51:48 +00:00
48b905406a diag: add CUDA sync after mega_moe finalize + forward to catch errors 2026-05-11 19:51:44 +00:00
35f6b66678 fix: UE8M0 reinterpret in DeepGEMM fold_global_scale + bump cache 2026-05-11 19:40:08 +00:00
f32d6b5b48 bump cache buster to 27 2026-05-11 19:26:21 +00:00
cd24182e36 diag: add NaN/Inf + FP8-dtype checks after NVFP4 dequant 2026-05-11 19:26:12 +00:00
8ae2214bad fix: reorder Dockerfile ARG before COPY for proper cache busting 2026-05-11 18:48:07 +00:00
c4891e9ee2 fix: manual FP32→UE4M3 quant in Triton staging kernel
Triton can't cast float8e4nv → uint8 directly. Compute E4M3 bits manually:
extract FP32 exponent/mantissa, convert to E4M3 format (4-bit exp + 3-bit mant),
handle rounding and overflow, reconstruct dequantized value for FP8 activation quantization.
2026-05-11 16:38:52 +00:00
436109081c bump cache buster to 24 2026-05-11 16:12:56 +00:00
5faf9916eb fix: UE4M3 activation scales + group_size=16 for NVFP4 mega_moe
The mxf4nvf4 MMA instruction shares scale_format_ between SFA and SFB.
For NVFP4 (UE4M3), both activation and weight scales must be UE4M3.

Changes to _stage_deepseek_v4_mega_moe_inputs_kernel:
- GROUP_K=16 (was 32) — NVFP4 scale_vec::4X has group_size=16
- Scale quantization: float → float8_e4m3fn (UE4M3) instead of UE8M0
  exponent extraction (>> 23). Pack 4 UE4M3 bytes per int32.
- FP8 activation quantized against UE4M3 rounded scale

Also updated class docstring (was stale MXFP4 conversion description).
2026-05-11 16:12:36 +00:00
220649c188 docs: CORRECTED — mxf4nvf4 IS supported on sm_100a (B200)
Build 17-18 'scale_vec not supported' error was because we targeted
sm_100 instead of sm_100a. The 'a' suffix enables FP4 block-scaled
instructions. No need to fall back to mxf8f6f4 with UE8M0.

Path forward: target sm_100a, use mxf4nvf4.scale_vec::4X, keep
native UE4M3 scales + block16. No scale conversion needed.
2026-05-11 14:24:13 +00:00
cfead0012d docs: comprehensive README update through build 22
- Full architecture diagram and NVFP4→vLLM bridge details
- All 8 bugs documented with fixes
- SM100 hardware limitation (mxf4nvf4 unsupported)
- MegaMoE kernel architecture and debugging log (builds 1-22)
- Three paths forward (A: FlashInfer, B: BF16 mega_moe, C: SM103+)
- Container build pipeline, NVFP4 format spec, hard rules
2026-05-11 13:53:41 +00:00
8cb23bdb78 fix: import NVFP4 SymmBuffer from deep_gemm.mega 2026-05-11 08:05:50 +00:00
ff579c9767 fix: use NVFP4 SymmBuffer (2x SF size for group_size=16)
The NVFP4 mega_moe kernel needs a larger symmetric buffer because
group_size=16 produces 2x more scale factor entries than MXFP4's 32.
Switch from deep_gemm.get_symm_buffer_for_mega_moe to
deep_gemm.mega.nvfp4.get_symm_buffer_for_nvfp4_mega_moe.
2026-05-11 07:49:11 +00:00
1da40c53da fix: add patch cache buster to Dockerfile 2026-05-11 07:19:10 +00:00
b532742530 debug: add shape/dtype logging to finalize_weights 2026-05-11 07:13:44 +00:00
b1cf4232ee feat: wire DeepGEMM NVFP4 mega_moe kernel into vLLM patch
- DeepseekV4MegaMoEExperts now uses native NVFP4 path
- finalize_weights: transform_nvfp4_weights_for_mega_moe() instead of
  NVFP4→BF16→MXFP4 conversion
- forward: fp8_nvfp4_mega_moe() with recipe=(1,1,16)
- Experts stay in NVFP4. No MXFP4 conversion. Period.
2026-05-11 06:22:11 +00:00
a2e9b5f17f fix: add --enable-expert-parallel to compose command 2026-05-11 06:15:11 +00:00
c8564caf9d fix: patch vLLM deepseek_v4.py directly in image 2026-05-11 06:09:40 +00:00
7c8c6cd67f fix: add PYTHONPATH for deep_gemm import 2026-05-11 06:06:52 +00:00
cffb373759 fix: symlink NVRTC lib into cuda/lib64 for linker 2026-05-11 06:04:24 +00:00
983ba02c5b fix: add CUDA/NVRTC lib paths to Dockerfile 2026-05-11 06:02:13 +00:00
f0471ed1c2 fix: correct CR URL to atl.vultrcr.com 2026-05-11 05:59:06 +00:00
c234190a80 feat: add Dockerfile + build/push script for NVFP4 container
- Extends dream-build with DeepGEMM nvfp4-mega-moe kernel
- build_push.sh: builds, logs into Vultr CR, pushes, updates docker-compose
- CACHE_BUSTER parameter for forcing fresh clones
2026-05-11 05:57:49 +00:00
e963325b61 WIP: MegaMoE NVFP4 kernel + diagnostics
- Force use_mega_moe=True for NVFP4 pipeline
- DeepseekV4MegaMoEExperts: load NVFP4 params (float8 block scales,
  float32 global/input scales), convert NVFP4→BF16→MXFP4 in
  finalize_weights for the DeepGEMM mega_moe kernel
- Add _nvfp4_to_bf16 and _bf16_to_mxfp4 conversion methods
- Remove expert_dtype check blocking mega_moe
- Add diagnostics for wo_a and bf16 layer conversion
- Still WIP: attention layer bugs under investigation
2026-05-11 05:19:49 +00:00
7e2f219259 fix: banner uses _os instead of os (not yet imported) 2026-05-11 04:57:24 +00:00
cf54b4755a fix CRITICAL #7: UE8M0 block scale misinterpreted as E4M3
scale_fmt=ue8m0 means weight_scale bytes are E8M0 format (power-of-2 only).
A simple .to(float32) misinterprets them as E4M3 (which has mantissa bits),
producing completely wrong block scale values and garbled output.

Fix: add _ue8m0_to_float32() that reinterprets raw uint8 bits as IEEE 754
exponent field: (raw_byte << 23).view(float32) = 2^(raw-127).

Applied to:
- _dequant_nvfp4_to_bf16 (BF16 layers: fused_wqa_wkv, wq_b, wo_b)
- _convert_nvfp4_to_fp8 (wo_a FP8 conversion)
- _reconstruct_compressor_weight (compressor fused_wkv_wgate)
- BF16->FP4 quantization path (stores as UE8M0, reads back correctly)
2026-05-11 04:37:33 +00:00
7febeaeb71 README: document bugs #5 (input_scale) and #6 (fused_skip_regex), add version banner section, update status 2026-05-11 04:28:38 +00:00
26aaaba4a2 Add version banner to patch — prints commit, arch, bugs fixed at startup
Ensures we can always verify what's running inside the container
from the docker logs. No functional changes.
2026-05-11 04:28:10 +00:00
67f9086a26 Fix critical dequantization bug: remove input_scale from weight dequant
input_scale is for ACTIVATIONS, not weights. The correct NVFP4 weight
dequantization formula is: weight_bf16 = e2m1_value * block_scale * global_scale

Including input_scale made weights ~5000x too small, causing completely
garbled output (multilingual gibberish with repeating patterns).
2026-05-11 02:23:26 +00:00
02b8ea536f Update MEMORY.md and memory files with vLLM NVFP4 serving progress
Server running on B200 port 8000 with full NVFP4→vLLM bridge.
All critical bugs fixed: DeepGEMM scale format, compressor shapes, block scale values.
2026-05-11 02:02:49 +00:00
653e2d7a50 vLLM NVFP4 serving: full end-to-end pipeline working
Bridged the gap between ModelOpt NVFP4 and vLLM DeepSeek V4 attention.
Server loads and serves tokens on 8x B200 with TP=8, EP=8.

Key changes:
- wo_a: NVFP4->BF16->FP8 with DeepGEMM block-scale format for BMM einsum
  Uses deepgemm_post_process_fp8_weight_block for correct scale layout
  weight_scale_inv = DeepGEMM-formatted block scale (NOT per-tensor scalar)
  Block scale filled with fp8_scale (NOT all-ones -- causes garbage output)
- Attention: NVFP4->BF16 dequantization, UnquantizedLinearMethod
- Compressor: reconstruct fused_wkv_wgate from separate kv_proj+gate_proj
  Fixed indexer path: compressor.indexer.kv_proj (was loading main compressor)
- MoE experts: stay NVFP4, FLASHINFER_TRTLLM FusedMoE backend

Bugs fixed:
1. DeepGEMM sf.dim() assertion: weight_scale_inv must be block-scale tensor
2. Block scale dtype: float32 (not float8_e4m3fn)
3. Missing deepgemm_post_process args: quant_block_shape, use_e8m0
4. Compressor indexer shape mismatch: wrong checkpoint key prefix
5. All-ones block scale: DeepGEMM divides by 1.0 instead of actual scale

Updated README with full technical documentation of all fixes.
2026-05-11 02:01:46 +00:00
db16be8e5d S11: Fixed substr mapping, stacking, suffix, and o_a_proj - loads weights but attention forward uses FP8 einsum incompatible with NVFP4 2026-05-10 17:45:53 +00:00
6fd03a0aa0 vLLM serving: patched deepseek_v4.py, disabled mega_moe, updated docs
- Add patches/deepseek_v4.py: patched vllm source file with modelopt NVFP4
  weight name mappings (expert gate_proj→w1, mlp→ffn, self_attn→attn.mla_attn,
  compressor.kv_proj→wkv, etc.), E2M1 FP4→BF16 unpacking for stacked params,
  skip patterns for NVFP4 scale tensors on MergedColumnParallelLinear, and
  resilient loading for unknown params.

- Update docker-compose.yml: copy patched deepseek_v4.py over original at
  container startup, remove --moe-backend=deep_gemm_mega_moe (no NVFP4 kernel).

- Update patches/patch_vllm_weights.py: legacy runtime monkey-patch approach
  (doesn't work with worker processes), kept for reference.

- Update README.md: added vLLM serving run history table (S1-S10), documented
  all open issues (MergedColumnParallelLinear+NVFP4, no mega_moe kernel,
  resilient loading), added vLLM-specific bug list and key notes.

- Update scripts/serve_vllm.py: add WARN comment on mega_moe flag.
2026-05-10 16:14:17 +00:00
d88793dee6 Add vllm weight mapper patch and docker-compose 2026-05-10 09:33:48 +00:00
30608e3834 Config patches: document modelopt↔vllm gaps with NVIDIA reference 2026-05-10 08:59:28 +00:00
0d74b97fb2 Config patches doc + compress_ratios runtime patch in serve script 2026-05-10 08:23:11 +00:00
f65d4ab99f Run 11 SUCCESS: 881GB NVFP4 exported, add vLLM serve script 2026-05-10 07:54:34 +00:00
eb80bd6f80 README + memory: Run 10 result (export crash in get_weight_scaling_factor), Run 11 running
- Run 10: calibration succeeded but export crashed in get_weight_scaling_factor
  (stale GPU weight, not just amax). Patch 4 forces weight to CPU at
  _export_quantized_weight entry point, covering the entire export chain.
- Updated Key Lessons with Run 10 analysis
- Updated Runtime Patches section to document all 8 patches
- Added Bug #8 (stale GPU weight tensors)
- Updated Do NOT Repeat list
2026-05-09 23:00:17 +00:00
07cd50e823 8 patches covering full export chain — no more whack-a-mole
Traced the full execution chain from _process_quantized_modules through
every function that reads stale GPU tensors:

  _process_quantized_modules
    → _export_quantized_weight (Patch 4: force weight to CPU at entry point)
      → get_weight_scaling_factor (Patch 7: belt-and-suspenders)
        → get_weights_scaling_factor_from_quantizer (safe: weight now CPU)
        → NVFP4QTensor.get_weights_scaling_factor (safe: input is CPU)
      → get_weight_scaling_factor_2 (Patch 8: force quantizer to CPU)
      → get_activation_scaling_factor (Patch 3: CPU + clamp)
      → to_quantized_weight (Patch 6: force all tensors to CPU)
      → weight.to(dtype) (safe: weight is CPU)
    → _export_fused_experts (Patch 5: force expert weights + quantizer to CPU)

Patch 4 is the key: it moves weight to CPU at the earliest possible point,
so ALL downstream .to(weight.device) calls resolve to CPU.
Patches 5-8 are belt-and-suspenders for alternative code paths.
2026-05-09 22:50:58 +00:00
efc111a11f Add Patch 4+5: get_weight_scaling_factor and get_weight_scaling_factor_2 CPU safety
Run 10 completed calibration (128/128) but crashed at export in
get_weight_scaling_factor — the weight tensor on GPU was stale after
5+ hours of calibration, and weight_scaling_factor_2.to(weight.device)
triggered cudaErrorIllegalAddress.

Patches 4+5 force weight and quantizer state to CPU before computing
scaling factors. This mirrors the same pattern as Patch 3
(get_activation_scaling_factor).

Calibrated state saved successfully (721.4 GB, 47,696 amax tensors).
Amax snapshot saved (15.4 MB). Re-running with new patches.
2026-05-09 22:43:48 +00:00
ce9056d259 README overhaul: reflect current architecture (hf_main, run history through Run 10)
- Architecture section: call hf_main() directly, not rewrite the pipeline
- Run history: all 10 runs with root causes and fixes
- Key lessons: stale GPU tensors, expert OOM, pipeline rewriting trap, __main__ gap
- Runtime patches: 3 monkey-patches + 3 post-calibration hook steps
- Do NOT repeat: 8 specific mistakes with run references
- File layout with legacy patches note
2026-05-09 16:09:09 +00:00
5a72da7193 Fix: apply hf_ptq __main__ post-parse conversions (dataset split, calib_size int list)
When calling hf_main(args) directly, the __main__ block conversions that
run between parse_args() and main() are skipped. calib_size stays as
string '128' instead of [128], causing TypeError on list concatenation.
2026-05-09 15:58:36 +00:00
8612914169 Update run history: Runs 7-8, Run 9 running on a300302 2026-05-09 15:00:23 +00:00
a300302486 Fix: use hf_ptq.py arg names (--pyt_ckpt_path, --qformat, --inference_tensor_parallel) 2026-05-09 14:57:28 +00:00
1a36a655ea Fix: use full argparse flag names (--calib_size, --kv_cache_qformat) 2026-05-09 14:54:51 +00:00
b2849a8944 Fundamental rewrite: call hf_main() instead of rewriting the pipeline
The previous approach tried to reconstruct hf_ptq's pipeline by importing
individual functions and building a fake argparse.Namespace. This caused
repeated crashes from missing args (KV_QUANT_CFG_CHOICES, dataset,
calib_with_images, etc.).

New approach:
- Call hf_ptq.parse_args() with sys.argv replaced — gets ALL defaults
- Call hf_main(args) — the exact same entry point the shell script uses
- Hook export_quantized to add amax snapshot + state save before export
- No more missing args. No more diverging from the example script.

The only changes from the stock pipeline:
1. Runtime patches (load_calib_amax CPU, export_amax CPU, clamp)
2. Post-calibration hook (snapshot amax, save state, force CPU)
2026-05-09 14:52:02 +00:00
a70593d886 Update run history: Run 6 (dataloader crash), Run 7 running on 25b4d8d 2026-05-09 13:40:00 +00:00
25b4d8da06 Fix: add missing args for make_calib_dataloader (dataset, calib_with_images, auto_quantize, specdec) 2026-05-09 13:37:24 +00:00
d1e15178b2 Update run history: Runs 4-5 (import bugs), Run 6 running on 6c1bff6 2026-05-09 09:29:20 +00:00
6c1bff6997 Clean rewrite: verified all imports against runtime, removed dead code
- get_model/get_tokenizer imported from example_utils (not hf_ptq)
- KV_QUANT_CFG_CHOICES imported from hf_ptq (not mtq)
- Removed dead _FORCE_AMAX_CPU global and global reference in run_export_only
- Fixed stale comments
- All 16 imports and references verified against the actual B200 runtime
- Zero divergences from modelopt example path except get_model()
2026-05-09 09:26:23 +00:00
86dd8df302 Fix: KV_QUANT_CFG_CHOICES is in hf_ptq, not mtq 2026-05-09 09:17:12 +00:00
99f861f48a Update README and memory: Run 3 OOM crash, Run 4 running on f9bbef8
- Added Run 3 to table (model loading OOM, fixed with get_model())
- Added Run 4 (current, commit f9bbef8)
- Added bug #7 (model loading OOM during expert weight concat)
- Added 'do NOT repeat' for AutoModelForCausalLM.from_pretrained
- Documented all 5 runtime patches
- Noted only divergence from modelopt example: get_model()
2026-05-09 08:10:04 +00:00
f9bbef8e91 Fix: patch load_calib_amax instead of amax property setter (can't patch readonly descriptor)
Also remove _FORCE_AMAX_CPU global — load_calib_amax patch handles it.
2026-05-09 08:04:03 +00:00
94179ed9d0 Fix typo: store_only → store_true 2026-05-09 08:02:09 +00:00
03c10ab3b6 Fix model loading: use modelopt get_model() instead of raw AutoModelForCausalLM
Raw from_pretrained OOMs during weight conversion — torch.cat on expert
gate_up_proj tries to allocate 31.5GB on a GPU with only 25.9GB free.
modelopt's get_model() handles max_memory/device_map properly for models
that need sequential device mapping.
2026-05-09 08:00:50 +00:00
9438af5a8c Add commit hashes to run history table 2026-05-09 06:47:26 +00:00
d7593fc1dd Update README: run history table, bug #1 already fixed, cost note, don't-repeat mistakes 2026-05-09 06:44:17 +00:00
6eaba26914 Defensive quantization: snapshot amax to CPU immediately after calibration
Key changes:
- snapshot_amax_to_cpu(): copies all quantizer _amax to CPU and saves
  to disk (~50MB) right after mtq.quantize() returns, before any other
  GPU operation can corrupt them
- force_all_amax_to_cpu(): nuclear option, moves _pre_quant_scale and
  _global_amax to CPU too
- _FORCE_AMAX_CPU flag + patched amax setter: after calibration, any
  future amax writes go to CPU instead of GPU
- --validate-only mode to check saved state without running anything
- restore_amax_from_snapshot() for --export-only recovery
- torch.cuda.empty_cache() + gc.collect() between steps
- Patches: export_amax CPU fallback, get_activation_scaling_factor
  clamp instead of assert
2026-05-09 06:31:08 +00:00
3907838409 Remove ModuleList patch (already fixed in modelopt 0.45), fix numbering 2026-05-09 06:10:18 +00:00
382c1d872f Fix quant_module import path 2026-05-09 06:09:17 +00:00
9291165ba0 Fix imports: QUANT_CFG_CHOICES is in hf_ptq, not modelopt config 2026-05-09 06:08:35 +00:00
a0bacb3cf6 Replace shell wrapper with in-process quantize script
- New scripts/quantize_nvfp4.py: runs full ModelOpt pipeline in-process
- Saves calibrated state after calibration (insurance against export crashes)
- Patches modelopt for V4: ModuleList quantizers, stale GPU tensor safety
- --export-only flag to retry export from saved calibration state
- Removed old model_opt_nvfp4_full.py (shell wrapper)
- Updated README with new pipeline docs and bug #5/#6
2026-05-09 06:07:22 +00:00
04304fdae6 Add export crash fix patches, update README with bug #5 (repr CUDA crash) 2026-05-08 23:28:32 +00:00
50348989b2 Clarify: V4 is NOT BF16, dequantize first 2026-05-08 17:31:35 +00:00
24e3b3745d Pin modelopt and transformers versions in README 2026-05-08 17:23:10 +00:00
b08afea425 remove weird session dump crap 2026-05-08 17:21:18 +00:00
a2370006f7 Update README: document full pipeline, BF16 verification, calib 128 constraint 2026-05-08 17:17:48 +00:00
f1d21900ea Remove upcast_to_bf16.py — superseded by dequant_fp8_to_bf16.py 2026-05-08 17:13:39 +00:00
ca9a4f5eaa Purge OpenClaw session files, memory dumps, __pycache__. Update .gitignore 2026-05-08 17:09:59 +00:00
eeba101cc4 Cleanup: nuke dead scripts and stale docs, rewrite README for full NVFP4 pipeline 2026-05-08 17:02:07 +00:00
075da675dc fix: update HF token, echo it at runtime, export both HF_TOKEN and HUGGING_FACE_HUB_TOKEN 2026-05-08 16:57:32 +00:00
36e1342270 nvfp4_full: pass HF_TOKEN env var for gated calibration dataset 2026-05-08 13:33:45 +00:00
3d38e1d5cd nvfp4_full: drop calib to 128, gpu_max_mem to 0.7 for VRAM headroom 2026-05-08 06:24:45 +00:00
d0fc5338fe model_opt_nvfp4_full: add use_seq_device_map, fix source for /bin/sh 2026-05-08 05:50:16 +00:00
b70a04696e Add resume capability to dequant script (skip already-done shards)
Verified our FP4 dequant is byte-identical to official transformers
MXFP4 implementation. Max diff = 0.0 across all values.
2026-05-08 02:58:24 +00:00
f63eed5cfd Purge INT4 references — expert weights are FP4 (E2M1), not INT4
All docs and scripts updated. Historical memory entries annotated.
2026-05-08 02:33:46 +00:00
f8533197f2 Fix: expert weights are FP4 (E2M1), not INT4 - verified with nibble analysis
Nibble index 0 vs 8 ratio = 0.996 (FP4 -0.0 ≈ +0.0), NOT INT4 where -8 would be rare.
FP4 dequant uses E2M1 LUT lookup × E8M0 scale (MXFP4 microscaling).
Also adds model_opt_nvfp4_full.py for full model NVFP4 quantization.
2026-05-08 02:25:43 +00:00
b5d569218c Add full nvfp4 quantization script + complete dequant script
- model_opt_nvfp4_full.py: Full NVFP4 quantization (not experts-only)
  Uses --gpu_max_mem_percentage 0.9 instead of --use_seq_device_map
- dequant_fp8_to_bf16.py: Now handles INT4-packed experts + FP8 shared
  experts + FP8 attention. Complete dequant to pure BF16.
2026-05-08 01:50:53 +00:00
db6beb5b76 Complete dequant script: handles INT4 experts, FP8 attention, FP8 shared experts
INT4 expert weights are packed 2-per-byte into int8 with float8_e8m0fnu
per-row 32-column block scales. Unpacking: lower nibble first, upper second.
Output dimensions are 2x the stored dimensions (e.g. [3072,3584] → [3072,7168]).

Also adds progress output with ETA per shard so screen sessions stay alive.
2026-05-08 01:39:50 +00:00
cbfc5a9afb Update nvfp4_experts_only to use dequantized BF16 model 2026-05-07 16:34:37 +00:00
b5d14aa8b8 Add proper FP8→BF16 dequantization script
Unlike the naive upcast, this properly dequantizes FP8 block-wise weights:
bf16 = fp8_weight * scale_expanded (128x128 blocks).

Also removes the now-unnecessary scale tensors and updates config.
FP8Linear.forward() sees element_size() > 1 and falls back to F.linear().
2026-05-07 15:45:46 +00:00
6008cf128d Add model_opt_nvfp4_experts_only.py
Quantizes only MoE expert weights to NVFP4, leaving attention untouched.
Includes comments documenting all available NVFP4 strategies.
Copy to model_opt_nvfp4_<strategy>.py for each new strategy.
2026-05-07 15:16:08 +00:00
a7664aee7d Add BF16 upcast script and Blackwell DeepGEMM patch 2026-05-07 14:29:50 +00:00
7a3b81e833 Add BF16 upcast script and Blackwell DeepGEMM patch
- scripts/upcast_to_bf16.py: Converts mixed-precision V4 Pro to pure BF16
  by upcasting all FP8 tensors (float8_e8m0fnu etc.) to bfloat16.
  Needed because modelopt PTQ calibration crashes on Blackwell with FP8
  kernels (DeepGEMM unsupported, Triton finegrained-fp8 has K mismatches).

- patches/patch_finegrained_fp8_blackwell.py: Patches transformers to
  reject DeepGEMM on SM100+ (Blackwell), letting it fall back to Triton.
  Note: the Triton fallback also fails during modelopt calibration on
  quantized weights, so upcasting to BF16 is the working solution.
2026-05-07 14:25:30 +00:00
ef89ceffbd Add ModelOpt NVFP4 pipeline: patch, run script, README
- Patch fixes iter_weights_for_calibration() for DeepseekV4Experts
  (ModuleList quantizers vs singular)
- Run script uses official NVIDIA hf_ptq.py with FP8 source
- Documents flags to avoid (--low_memory_mode, wrong arg names)
2026-05-07 07:22:54 +00:00
116933dcf6 Fix: skip .cuda() when low_memory_mode; switch default to nvfp4 2026-05-07 03:06:33 +00:00
b8bdd00d19 Lower GPU max_memory to 100GiB, add CPU-only fallback for low_memory_mode 2026-05-07 02:49:24 +00:00
717151b98c Add CPU offloading and max_memory caps for FP8 model loading 2026-05-07 02:40:48 +00:00
aff12c6951 Fix forward_loop: pass as callable, not via create_forward_loop 2026-05-07 02:08:09 +00:00
492e44c0f6 Fix dataloader API: max_sample_length not seq_len, proper create_forward_loop 2026-05-07 02:04:54 +00:00
b32bb2e84d NVIDIA Model Optimizer branch: nvfp4_experts_only PTQ for DeepSeek V4 Pro 2026-05-07 00:11:31 +00:00
13 changed files with 3879 additions and 1218 deletions

30
Dockerfile Normal file
View File

@@ -0,0 +1,30 @@
# DeepSeek V4 NVFP4 vLLM + DeepGEMM Mega MoE
# Extends the vLLM dream-build container with our custom DeepGEMM kernel
FROM atl.vultrcr.com/vllm/vllm-with-lmcache:dream-build
# Install build essentials
RUN apt-get update && apt-get install -y git screen cmake && rm -rf /var/lib/apt/lists/*
# Clone and build DeepGEMM with NVFP4 mega_moe kernel
# CACHE_BUSTER: increment to force fresh clone
RUN git clone -b nvfp4-mega-moe https://sweetapi.com/biondizzle/DeepGEMM.git /root/DeepGEMM && PATCH_CACHE_BUSTER=70
# Build DeepGEMM with proper CUDA/NVRTC paths
ENV CPATH="/usr/local/lib/python3.12/dist-packages/flashinfer/data/cutlass/include:/usr/local/lib/python3.12/dist-packages/nvidia/cu13/include:/usr/local/cuda-13.0/include:${CPATH}"
ENV PYTHONPATH="/root/DeepGEMM:${PYTHONPATH}"
# NVRTC lives in the pip nvidia/cu13 package, but the linker expects it in cuda/lib64
# Create a symlink so -lnvrtc resolves
RUN ln -sf /usr/local/lib/python3.12/dist-packages/nvidia/cu13/lib/libnvrtc.so.13 /usr/local/cuda/lib64/libnvrtc.so && PATCH_CACHE_BUSTER=70
RUN cd /root/DeepGEMM && python3 setup.py build_ext --inplace && PATCH_CACHE_BUSTER=69
# Bust cache for patch changes — ARG before COPY ensures layer invalidation
ARG PATCH_CACHE_BUSTER=70
# Copy our DeepSeek V4 patch over vLLM's model file
COPY patches/deepseek_v4.py /usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py
# Copy the NVFP4 staging kernel (BF16→E2M1+UE4M3 quantization for activations)
COPY patches/staging_kernel.py /usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/staging_kernel.py
# Verify everything imports
RUN python3 -c "import deep_gemm; print('DeepGEMM NVFP4 OK')" && \
python3 -c "import vllm; print('vLLM OK')"

310
README.md
View File

@@ -1,106 +1,216 @@
# DeepSeek V4 Pro → NVFP4 conversion kit
# DeepSeek V4 Pro → NVFP4 Quantization + vLLM Serving
Two paths for converting `sgl-project/DeepSeek-V4-Pro-FP8` (the uniform-FP8 repackage of the original mixed-precision V4 Pro) into NVFP4 for Blackwell inference.
Full NVFP4 quantization of DeepSeek V4 Pro and vLLM serving on 8× NVIDIA B200 GPUs.
## What's here
## Quick Status
| Component | Status |
|-----------|--------|
| NVFP4 Quantization | ✅ 881GB (Run 11), modelopt 0.45.0.dev64 |
| Weight Loading | ✅ 95 safetensors shards, all 8 TP ranks |
| Dequant Verification | ✅ Bit-exact match against official dequant (0.0 relative error) |
| NVFP4→FP8 Conversion (wo_a) | ✅ DeepGEMM block-scale format |
| NVFP4→BF16 Dequantization | ✅ 305 attn/shared, 91 compressor layers |
| Compressor Reconstruction | ✅ Separate kv_proj/gate_proj → fused_wkv_wgate |
| MoE Expert Serving (MegaMoE) | 🔧 Kernel builds & runs on sm_100a, debugging illegal CUDA access |
| Output Quality | 🔧 Under investigation |
## B200 Node
- **IP**: `45.76.247.107`
- **User**: `root`
- **Password**: see `.env`
- **GPUs**: 8× NVIDIA B200 (SM100a)
- **Model weights**: `/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4/`
- **BF16 reference**: `/root/nvidia-meeting/DeepSeek-V4-Pro-BF16/`
## Repositories
| Repo | Branch | Purpose |
|------|--------|---------|
| `deepseek-v4-quant` | `modelopt-nvfp4` | Main repo: patches, quantize, serve scripts |
| `DeepGEMM` | `nvfp4-mega-moe` | NVFP4 mega_moe kernel fork |
## Architecture
```
DeepSeek V4 Pro (1.2T params, 61 layers)
├── MLA Attention (61 layers)
│ ├── fused_wqa_wkv → BF16 (UnquantizedLinearMethod)
│ ├── wo_a → FP8 (DeepGEMM block-scale, BMM einsum)
│ ├── wo_b → BF16 (UnquantizedLinearMethod)
│ └── compressor.fused_wkv_wgate → BF16 (reconstructed from NVFP4)
├── MoE Experts (256 experts per layer, 61 layers)
│ └── MegaMoE path → NVFP4 (DeepGEMM mxf4nvf4, native block16)
└── Shared Expert → FP8 (Fp8LinearMethod, DeepGEMM)
```
## NVFP4 Format (Confirmed)
| Field | Format | Notes |
|-------|--------|-------|
| Weights | E2M1 packed uint8 | 2 values per byte |
| Block scales | `torch.float8_e4m3fn` (UE4M3) | Standard NVFP4 spec, group_size=16 |
| Global scales | `torch.float32` (weight_scale_2) | **Scalar per expert** (`torch.Size([])`) |
| Dequant | `value = packed_E2M1 * block_scale * global_scale` | Block scale range [0, 448] |
**Key finding**: The checkpoint stores block scales as `torch.float8_e4m3fn` (UE4M3), NOT UE8M0.
`.to(torch.float32)` is the correct conversion. The shift-by-23 trick was wrong — it was
applying an E8M0→float conversion to E4M3 bytes, producing garbage.
## Dequant Verification
We verified the dequant path is bit-exact against the official reference:
```python
W_bf16 = dequantize_fp4_weight(W_int, S)
y_ours = W_bf16 @ x.bfloat16()
y_ref = official_expert_forward(W_int, S, x)
print((y_ours - y_ref).abs().max() / y_ref.abs().mean())
```
Result:
```
Max abs diff: 0.00000000
Mean abs diff: 0.00000000
Relative error: 0.000000
Matmul max diff: 0.00000000
```
## Running
### 1. Quantize
```bash
# On B200 node, in screen
screen -S quantize
cd /root/nvidia-meeting
bash run_quantize_nvfp4.sh
# ~7 hours, $161 per run
```
### 2. Build Container
```bash
# From this repo
bash build_push.sh
# Always build in screen: screen -S build
```
The Dockerfile:
1. Extends `atl.vultrcr.com/vllm/vllm-with-lmcache:dream-build`
2. Clones DeepGEMM (`nvfp4-mega-moe` branch) and builds
3. Copies `patches/deepseek_v4.py` over vLLM's model file
### 3. Serve
```bash
# On B200 node
cd /root/nvidia-meeting
docker compose up -d
# Check logs
docker logs -f nvidia-meeting-vllm-1
# Test
curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "/model", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 50}'
```
### vLLM Flags
```
--trust-remote-code
--kv-cache-dtype fp8
--block-size 256
--enable-expert-parallel
--tensor-parallel-size 8
--compilation-config {"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}
--attention_config.use_fp4_indexer_cache=True
--tokenizer-mode deepseek_v4
--tool-call-parser deepseek_v4
--enable-auto-tool-choice
--reasoning-parser deepseek_v4
--speculative_config {"method":"mtp","num_speculative_tokens":2}
```
## NVFP4 Mega MoE Kernel
### What We Built
A native NVFP4 mega_moe kernel in our DeepGEMM fork. Weights stay in E2M1 packed format
and use `kind::mxf4nvf4.block_scale.scale_vec::4X` MMA directly on SM100a (B200).
**This is novel — NVIDIA has not done NVFP4→vLLM integration.**
### Kernel Architecture
| Parameter | Value |
|-----------|-------|
| PTX instruction | `tcgen05.mma.kind::mxf4nvf4.block_scale.scale_vec::4X` |
| kGranK | 16 (NVFP4 native block_size) |
| Weight format | E2M1 packed uint8 (unchanged from checkpoint) |
| Block scales | UE4M3 (float8_e4m3fn), native — no conversion needed |
| Global scales | Folded into block scales before packing |
| Instruction desc | `float_ue4m3_t` |
| SF layout | block16, scale_vec::4X |
| UTCCP stride | i*8 (4X layout) |
| kNumSFUint32 | kHidden / 64 (4 UE4M3 per int32) |
| recipe | (1, 1, 16) |
| Target arch | `sm_100a` (the `a` suffix is **required**) |
### Python API
- `fp8_nvfp4_mega_moe()` — entry point, recipe=(1,1,16)
- `transform_nvfp4_weights_for_mega_moe()` — fold global scales, pack UE4M3→int32, TMA-align
- `get_symm_buffer_for_nvfp4_mega_moe()` — 2× SF buffer vs MXFP4
### C++ Bindings
- `csrc/apis/mega_nvfp4.hpp` — kGranK=16, SF stride K/16, packed E2M1 hidden/2
- `csrc/jit_kernels/impls/sm100_fp8_nvfp4_mega_moe.hpp` — host-side TMA descriptors
- `deep_gemm/include/deep_gemm/impls/sm100_fp8_nvfp4_mega_moe.cuh` — kernel
### Full FP4 Pipeline
The `mxf4nvf4` instruction is FP4×FP4 — both activations (A) and weights (B) must be E2M1 packed.
A Triton staging kernel quantizes BF16 activations → E2M1 packed uint8 + UE4M3 block16 scales
before the GEMM. The L1 epilogue outputs UE4M3 activation scales directly (float→e4m3 cast).
## Bugs Found and Fixed
| # | Bug | Impact | Fix |
|---|-----|--------|-----|
| 1 | DeepGEMM `sf.dim()` crash | Server crash | `deepgemm_post_process_fp8_weight_block` for block-scale format |
| 2 | Block scale dtype `float8_e4m3fn` | Crash | Use `float32` for block-scale tensor |
| 3 | Missing `deepgemm_post_process` args | Crash | Pass `quant_block_shape`, `use_e8m0` |
| 4 | Compressor indexer shape mismatch | Crash | `.indexer.` sub-path in checkpoint keys |
| 5 | All-ones block scale | Garbage output | `torch.full(..., fp8_scale)` not `torch.ones` |
| 6 | `fused_skip_regex` skipping q_b/o_a/o_b scales | Garbage output | Remove non-fused scale entries from skip list |
| 7 | UE8M0 shift-by-23 applied to E4M3 scales | Garbled output | Checkpoint is standard UE4M3 — use `.to(float32)` (shift-by-23 was wrong) |
| 8 | wo_a BF16→NVFP4 on-the-fly used UE8M0 encoding | Scrambled attention | Produce UE4M3 directly: `.clamp(0, 448).to(float8_e4m3fn)` |
| 9 | FP8 activations fed to mxf4nvf4 (FP4×FP4 instruction) | Crash/garbled | Full FP4 pipeline: activations are E2M1 packed + UE4M3 scales |
| 10 | Staging kernel SF pack: shift ≥32 is UB | Half the activation scales zeroed | Split into 2 int32 writes per k_block (groups 0-3, 4-7) |
| 11 | Staging kernel wrote unpacked E2M1 (1 byte/elem) into packed buffer | 2× buffer overflow | Pack even/odd nibble pairs, write BLOCK_K//2 bytes |
| 12 | `compute-sanitizer` build running during debug | Slow (50×), masking timing | Remove sanitizer, rebuild |
## Files
| File | Purpose |
| --- | --- |
| `inspect_model.py` | Run this first. Prints tensor name patterns, dtypes, FP8 scaling block sizes, and counts of MoE expert/router/norm tensors so you know exactly what you're dealing with before any conversion. |
| `fp8_to_nvfp4_streaming.py` | **Path A.** Pure tensor-level streaming FP8 → NVFP4 conversion. No model loading, no calibration, weight-only. Low memory, fast, deterministic. Recommended for first run. |
| `quantize_llmcompressor.py` | **Path B.** `llm-compressor` oneshot with sequential pipeline + activation calibration. Produces W4A4 with calibrated activation scales. Higher quality on activation-sensitive ops but riskier given V4 is two weeks old. |
| `verify_nvfp4.py` | Loads the produced NVFP4 checkpoint, runs a basic forward pass through one block, checks for NaN/Inf, and dumps a few generated tokens via vLLM. |
|------|---------|
| `patches/deepseek_v4.py` | Main patch: NVFP4 weight loading, dequant, staging kernel, MegaMoE |
| `patches/staging_kernel.py` | Reference copy of Triton staging kernel (live copy is in deepseek_v4.py) |
| `scripts/dequant_fp8_to_bf16.py` | BF16 dequantization utility |
| `scripts/quantize_nvfp4.py` | NVFP4 quantization runner |
| `scripts/serve_vllm.py` | Standalone vLLM server launcher |
| `Dockerfile` | Container build (extends dream-build with DeepGEMM + patch) |
| `docker-compose.yml` | Production serve config |
| `build_push.sh` | Build, push to CR, update docker-compose |
## Hardware assumptions
## HARD RULES
- 8× B200 baremetal, 1.5 TB HBM total
- 2.7 TB system RAM
- ≥10 TB free NVMe at `~/nvidia-meeting/`
## Prereqs
```bash
source ~/nvidia-meeting/venv/bin/activate
pip install --upgrade torch safetensors transformers tqdm
pip install --upgrade llmcompressor compressed-tensors # only needed for Path B
pip install --upgrade vllm # only needed for verify
```
You'll likely need `transformers` from source for V4 architecture support, and `trust_remote_code=True` everywhere. Stock pip versions may not load V4 yet.
## Recommended order tonight
```bash
cd ~/nvidia-meeting
# 1. Inspect the FP8 source — 30 seconds, no GPU needed.
python inspect_model.py DeepSeek-V4-Pro-FP8 | tee inspect.log
# 2. Path A streaming conversion — should run in 2-6 hours dominated by NVMe I/O.
python fp8_to_nvfp4_streaming.py \
--src DeepSeek-V4-Pro-FP8 \
--dst DeepSeek-V4-Pro-NVFP4-streaming \
--workers 8 \
2>&1 | tee path_a.log
# 3. Quick sanity check — does it load and forward-pass?
python verify_nvfp4.py DeepSeek-V4-Pro-NVFP4-streaming
# 4. Path B (overnight). Run only after Path A succeeds. 24-72 hours.
python quantize_llmcompressor.py \
--src DeepSeek-V4-Pro-FP8 \
--dst DeepSeek-V4-Pro-NVFP4-llmcompressor \
--num-samples 256 \
--max-seq-len 4096 \
2>&1 | tee path_b.log
```
## Path A — what it does
1. Reads `model.safetensors.index.json` to map every tensor to its shard.
2. Classifies every tensor:
- **Preserve** (copied bit-for-bit): `lm_head`, `embed_tokens`, MoE router gates (`*.mlp.gate`), all norms, V4-specific attention indexer/scoring tensors, mHC residual mixing weights.
- **Quantize**: any FP8 weight that has a corresponding `*.weight_scale_inv` companion (i.e. real GEMM weights).
3. For every quantizable weight:
- Dequantizes FP8 E4M3 → FP32 using the source's per-block scales (auto-detects 128×128 blocks).
- Computes NVFP4 dual scales: per-tensor `weight_scale_2 = amax / (6.0 * 448.0)` and per-16-element-block `weight_scale = block_amax / (6.0 * weight_scale_2)` cast to FP8 E4M3.
- Quantizes FP32 → E2M1 representable values `{0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}`.
- Packs two 4-bit values per `uint8` byte.
4. **MoE pair handling**: detects `gate_proj` (w1) + `up_proj` (w3) of each expert and computes a joint `weight_scale_2` across both, since vLLM's fused MoE kernel requires them to share that global scale.
5. Streams output to new shards (~5 GB each) with a fresh `model.safetensors.index.json` and copies all non-tensor files (config, tokenizer, etc.) verbatim.
**This is weight-only NVFP4.** Activation quantization is not done here — you get W4A16 effective behavior at runtime unless your inference engine generates dynamic per-group activation scales. vLLM does generate per-group activation scales dynamically at inference, so this is fine for most use cases.
## Path B — what it does
1. Loads the FP8 model via `transformers` with `device_map="auto"` and the offload folder pointing at NVMe. With 2.7 TB RAM, the FP8 weights (~865 GB) sit in RAM; activations and per-layer BF16 promotion happen on the B200s.
2. Loads a calibration set (default 256 samples of `HuggingFaceH4/ultrachat_200k`).
3. Runs `llm-compressor` `oneshot` with `pipeline="sequential"` so only one transformer block is materialized in BF16 on GPU at a time.
4. `moe_calibrate_all_experts=True` ensures every routed expert gets calibration signal even when natural routing wouldn't pick it.
5. The recipe targets `Linear` with NVFP4 and the same ignore list as Path A (lm_head, embed, router gates, norms, indexer, mHC).
6. Saves with `save_compressed=True` in `compressed-tensors` format.
**The known risks for Path B on V4 specifically:**
- V4 architecture is brand new. `llm-compressor` may not have a registered MoE wrapper for V4 — you may need to call `replace_modules_for_calibration` with the actual V4 MoE class name (the script has a TODO and a fallback path).
- Sequential pipeline may not handle CSA/HCA hybrid attention if the attention forward isn't a simple linear chain. If you see weird offload errors during calibration, the indexer/scoring tensors are likely the culprit.
- Calibration cache for 256 routed experts × all V4 layers can be hundreds of GB. Watch `nvidia-smi` and `free -h` during the first 30 minutes.
## Things to discuss with the NVIDIA engineer
1. **NVFP4 packing convention.** My converter packs as `byte = elem0 | (elem1 << 4)` (low nibble first). Verify this matches what TensorRT-LLM / cutlass NVFP4 kernels expect. If reversed, just flip in `pack_fp4()`.
2. **Joint scaling extension.** I implement joint `weight_scale_2` for `gate_proj`/`up_proj` pairs. Ask whether `down_proj` also benefits, or whether all three experts in a fused MoE block should share — recipes have varied.
3. **mHC residual weights.** I preserve them in FP8/BF16 conservatively. If NVIDIA has actually quantized these somewhere internally, drop them out of the ignore list to recover memory.
4. **CSA + HCA indexer/scoring tensors.** I preserve these blindly based on the V3.2 DSA precedent. Ask whether V4's compressed-sparse / heavily-compressed attention has analogous "cannot quantize" tensors and what the canonical regex is.
5. **W4A4 vs W4A16 for V4 Pro.** Path A is W4A16-equivalent; Path B is W4A4. For a 1.6T MoE with extreme long-context, ask which is internally recommended for first deployment.
6. **`modelopt` vs `llm-compressor` for V4.** RedHat shipped V4-*Flash* NVFP4 via `llm-compressor`. Why not Pro yet? Find out if there's a known-bad layer or just compute time.
## Output sizes to expect
- FP8 source: ~865 GB
- Path A NVFP4 output: ~430470 GB (about 2× compression vs FP8 source; experts dominate, norms/embeds add a bit back)
- Path B NVFP4 output: similar, plus activation scale metadata
## Resumability
Path A is checkpoint-resumable per shard — if it dies mid-run, re-running picks up from the next unwritten output shard. Path B is **not** resumable mid-calibration; if it crashes you restart.
- **NEVER convert DeepSeek MoE experts to MXFP4.** Experts stay in NVFP4. Period.
- **The checkpoint is UE4M3 (float8_e4m3fn), NOT UE8M0.** Never use shift-by-23 on these bytes.
- **Target `sm_100a`, not `sm_100`.** The `a` suffix is required for mxf4nvf4 instructions.

35
docker-compose.yml Normal file
View File

@@ -0,0 +1,35 @@
services:
vllm:
#image: atl.vultrcr.com/vllm/vllm-dsv4-nvfp4:latest
build:
context: .
pull_policy: always
ports:
- "8000:8000"
environment:
- OMP_NUM_THREADS=128
#- VLLM_USE_FLASHINFER_MOE_FP4=1 # What the fuck is this!?
command:
- /model
- --trust-remote-code
#- --kv-cache-dtype=fp8 # maybe we just let it figure its own shit out
#- --block-size=256
- --enable-expert-parallel
- --tensor-parallel-size=8
- --enforce-eager
#- --compilation-config={"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}
#- --attention_config.use_fp4_indexer_cache=True
- --tokenizer-mode=deepseek_v4
#- --speculative_config={"method":"mtp","num_speculative_tokens":2}
- --host=0.0.0.0
- --port=8000
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- /root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4:/model:ro

View File

@@ -1,548 +0,0 @@
#!/usr/bin/env python3
"""Streaming FP8 → NVFP4 converter for DeepSeek V4 Pro (sgl-project FP8 repackage).
Path A: pure tensor-level conversion. No model loading via transformers, no
calibration. Reads FP8 safetensors shards, dequantizes per-block FP8 to FP32,
re-quantizes to NVFP4 (E2M1 packed in uint8 with FP8 E4M3 per-block scales and
an FP32 per-tensor global scale), and writes new shards.
Key behaviors:
- Joint global scale_2 across (gate_proj, up_proj) pairs of each expert,
required for vLLM fused MoE kernels.
- Preserves lm_head, embeddings, MoE router gates, norms, V4 indexer/scoring,
and mHC residual mixing weights at original precision.
- Streams shard-by-shard. Peak working memory is one tensor pair dequantized
to FP32 (a few hundred MB at most for the largest weights).
- Resumable per output shard.
NVFP4 format reference:
value = packed_fp4 * weight_scale * weight_scale_2
where:
packed_fp4: E2M1 in {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}, 2 per byte
weight_scale: FP8 E4M3, one per 16-element block
weight_scale_2: FP32 scalar per tensor, global
Usage:
python fp8_to_nvfp4_streaming.py \\
--src DeepSeek-V4-Pro-FP8 \\
--dst DeepSeek-V4-Pro-NVFP4-streaming \\
--workers 8
Optional:
--gpu N Use CUDA device N for the math (default: 0; -1 for CPU)
--shard-size-gb 5 Target output shard size
--dry-run Print what would be done; don't write
"""
import argparse
import json
import re
import shutil
import sys
import time
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
import torch
from safetensors import safe_open
from safetensors.torch import save_file
from tqdm import tqdm
# ---------------------------------------------------------------------------
# Classification: which tensors do we quantize, which do we preserve?
# ---------------------------------------------------------------------------
# NVFP4-everything: only preserve 1D/non-weight tensors that can't be NVFP4
PRESERVE_REGEXES = [
r".*embed_tokens.*", # embeddings (kept in original precision)
r".*\.(mlp|ffn)\.gate(\.weight)?$", # MoE router (1D or small gate, not a GEMM weight)
r".*norm.*", # all norms (1D)
r".*indexer.*", # V4 CSA indexer (non-GEMM)
r".*scoring.*", # V4 scoring tensors
r".*attn_sink.*", # V4 attention sink (scalar/1D)
r".*compressor\.ape.*", # V4 compressor APE (1D)
r".*tid2eid.*", # V4 MoE token-to-expert mapping (1D)
r".*\.bias$", # any biases
r".*hc_attn_base.*", # V4 hyper-connection scalars
r".*hc_attn_fn.*",
r".*hc_ffn_base.*",
r".*hc_ffn_fn.*",
r".*hc_head_scale.*",
r".*compressor\.wgate\.weight$", # V4 compressor gate (small, preserve)
r".*compressor\.wkv\.weight$", # V4 compressor KV proj (small, preserve)
r".*indexer\.wq_b\.weight$", # V4 indexer projections (small, preserve)
r".*indexer\.wkv\.weight$",
r".*indexer\.compressor\.wkv\.weight$",
r".*indexer\.gate_proj\.weight$",
r".*indexer\.compressor\.wgate\.weight$",
r".*indexer\.q_b_proj\.weight$",
]
PRESERVE_RE = re.compile("|".join(f"(?:{p})" for p in PRESERVE_REGEXES))
# Identify expert pairs that need joint global scale
EXPERT_PAIR_RE = re.compile(r"(.*experts\.\d+)\.(w1|w3)\.weight$")
def is_preserve(name: str) -> bool:
return bool(PRESERVE_RE.match(name))
# ---------------------------------------------------------------------------
# FP8 dequantization (per-block)
# ---------------------------------------------------------------------------
def dequant_fp8_to_fp32(weight_fp8: torch.Tensor, scale_inv: torch.Tensor) -> torch.Tensor:
"""Dequantize a per-block FP8 E4M3 weight to FP32 using its inverse-scale tensor.
DeepSeek convention: weight_scale_inv stores the dequant scale (multiply by it
to recover FP32). Block size is inferred from shape ratios — typically 128x128.
"""
assert weight_fp8.dim() == 2, f"Expected 2D weight, got shape {weight_fp8.shape}"
M, N = weight_fp8.shape
if scale_inv.dim() == 0:
# Per-tensor scale
return weight_fp8.float() * scale_inv.float()
if scale_inv.dim() == 1:
# Per-row or per-col — unusual for DeepSeek but handle it
if scale_inv.numel() == M:
return weight_fp8.float() * scale_inv.float().unsqueeze(1)
if scale_inv.numel() == N:
return weight_fp8.float() * scale_inv.float().unsqueeze(0)
raise ValueError(f"Cannot align 1D scale_inv {scale_inv.shape} to weight {weight_fp8.shape}")
# 2D block scaling
sm, sn = scale_inv.shape
bm = (M + sm - 1) // sm
bn = (N + sn - 1) // sn
scale_full = scale_inv.float().repeat_interleave(bm, dim=0).repeat_interleave(bn, dim=1)
scale_full = scale_full[:M, :N]
return weight_fp8.float() * scale_full
# ---------------------------------------------------------------------------
# NVFP4 quantization
# ---------------------------------------------------------------------------
FP4_E2M1_VALUES = torch.tensor(
[0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0], dtype=torch.float32
)
# Boundaries between adjacent magnitudes (round-to-nearest with ties to even-ish)
FP4_BOUNDARIES = torch.tensor(
[0.25, 0.75, 1.25, 1.75, 2.5, 3.5, 5.0], dtype=torch.float32
)
FP4_MAX = 6.0
FP8_E4M3_MAX = 448.0
def round_to_fp4_e2m1_index(x: torch.Tensor) -> torch.Tensor:
"""Round x to nearest FP4 E2M1 representable, return 4-bit index in [0..15].
Index encoding: bit 3 = sign, bits 0..2 = magnitude index into FP4_E2M1_VALUES.
"""
sign = (x < 0).to(torch.uint8)
abs_x = x.abs().clamp_(max=FP4_MAX)
# searchsorted is fast on GPU; uses float32
boundaries = FP4_BOUNDARIES.to(x.device)
mag_idx = torch.searchsorted(boundaries, abs_x.contiguous()).to(torch.uint8)
return (sign << 3) | mag_idx
def quantize_to_nvfp4(
x_fp32: torch.Tensor,
scale_2: torch.Tensor,
) -> tuple[torch.Tensor, torch.Tensor]:
"""Quantize an FP32 weight to NVFP4 given a (possibly joint) global scale.
Args:
x_fp32: [M, N] FP32 tensor, N must be divisible by 16
scale_2: scalar FP32 tensor
Returns:
packed: [M, N//2] uint8, two FP4 values per byte (low nibble first)
weight_scale: [M, N//16] FP8 E4M3 per-block scales
"""
M, N = x_fp32.shape
if N % 16 != 0:
raise ValueError(f"NVFP4 requires N % 16 == 0; got {x_fp32.shape}")
# Per-block (16-element) amax
blocks = x_fp32.view(M, N // 16, 16)
block_amax = blocks.abs().amax(dim=-1) # [M, N//16]
# Per-block scale in FP32, then cast to FP8 E4M3 (this is the lossy step)
block_scale_fp32 = block_amax / (FP4_MAX * scale_2)
# Avoid zeros — produces NaN on dequant. Clamp tiny scales.
block_scale_fp32 = block_scale_fp32.clamp_(min=1e-30)
block_scale_fp8 = block_scale_fp32.to(torch.float8_e4m3fn)
# Recover the effective scale that the kernel will actually use
effective = scale_2 * block_scale_fp8.float() # [M, N//16]
# Quantize values: divide, clamp, round to E2M1
scaled = blocks / effective.unsqueeze(-1).clamp_(min=1e-30)
fp4_idx = round_to_fp4_e2m1_index(scaled) # [M, N//16, 16] uint8
fp4_idx = fp4_idx.view(M, N).contiguous()
# Pack two nibbles per byte: low = even-index element, high = odd-index element
low = fp4_idx[:, ::2]
high = fp4_idx[:, 1::2]
packed = (low | (high << 4)).to(torch.uint8)
return packed, block_scale_fp8
def compute_global_scale(*tensors_fp32: torch.Tensor) -> torch.Tensor:
"""Compute joint NVFP4 global scale_2 across one or more FP32 tensors.
scale_2 = amax / (FP4_MAX * FP8_E4M3_MAX)
"""
amax = torch.stack([t.abs().max() for t in tensors_fp32]).max()
scale_2 = amax / (FP4_MAX * FP8_E4M3_MAX)
# Avoid zero
return scale_2.clamp_(min=1e-30).float()
# ---------------------------------------------------------------------------
# Sharded output writer
# ---------------------------------------------------------------------------
class ShardedSafetensorsWriter:
"""Writes tensors to a sequence of safetensors shards, building an index map."""
def __init__(self, out_dir: Path, max_shard_bytes: int):
self.out_dir = out_dir
self.out_dir.mkdir(parents=True, exist_ok=True)
self.max_shard_bytes = max_shard_bytes
self.current = {} # name -> tensor (CPU)
self.current_bytes = 0
self.shard_idx = 0
self.weight_map: dict[str, str] = {} # name -> shard filename
self.shard_filenames: list[str] = []
def _flush(self):
if not self.current:
return
self.shard_idx += 1
# Use placeholder total; we'll rename at the end
fname = f"model-{self.shard_idx:05d}-of-PLACEHOLDER.safetensors"
path = self.out_dir / fname
save_file(self.current, str(path))
for name in self.current:
self.weight_map[name] = fname
self.shard_filenames.append(fname)
self.current.clear()
self.current_bytes = 0
def add(self, name: str, tensor: torch.Tensor):
# safetensors requires CPU tensors and contiguous
t = tensor.detach().cpu().contiguous()
size = t.numel() * t.element_size()
if self.current and self.current_bytes + size > self.max_shard_bytes:
self._flush()
self.current[name] = t
self.current_bytes += size
def close(self):
self._flush()
# Now rename shards to use proper of-N suffix
total = len(self.shard_filenames)
new_map = {}
for old_fname in self.shard_filenames:
idx = int(old_fname.split("-")[1])
new_fname = f"model-{idx:05d}-of-{total:05d}.safetensors"
(self.out_dir / old_fname).rename(self.out_dir / new_fname)
new_map[old_fname] = new_fname
# Patch weight_map
self.weight_map = {k: new_map[v] for k, v in self.weight_map.items()}
return self.weight_map
# ---------------------------------------------------------------------------
# Shard-level conversion plan
# ---------------------------------------------------------------------------
def build_plan(src_dir: Path):
"""Build the conversion plan from index.json.
Returns:
weight_map: name -> shard filename
shard_to_names: shard filename -> list of names in that shard
expert_pair_groups: list of (group_name, name_w1, name_w3)
For each expert, the gate_proj/up_proj pair gets a shared scale_2.
solo_quantize: list of names to quantize independently
preserve: list of names to copy unchanged
"""
with open(src_dir / "model.safetensors.index.json") as f:
index = json.load(f)
weight_map = index["weight_map"]
shard_to_names = defaultdict(list)
for name, fn in weight_map.items():
shard_to_names[fn].append(name)
# Gather all weight tensor names (those with .weight suffix)
all_weights = [n for n in weight_map if n.endswith(".weight")]
# Identify expert pairs
expert_pairs = defaultdict(dict) # base -> {"gate_proj": name, "up_proj": name}
for n in all_weights:
m = EXPERT_PAIR_RE.match(n)
if m:
base, kind = m.group(1), m.group(2)
expert_pairs[base][kind] = n
paired_names = set()
expert_pair_groups = []
for base, parts in expert_pairs.items():
if "w1" in parts and "w3" in parts:
expert_pair_groups.append((base, parts["w1"], parts["w3"]))
paired_names.add(parts["w1"])
paired_names.add(parts["w3"])
# Classify everything else
solo_quantize = []
preserve = []
scale_companions = [] # .scale tensors that get consumed during dequant
for n in weight_map:
if n.endswith(".scale") and n.replace(".scale", ".weight") in weight_map:
scale_companions.append(n)
continue
if n in paired_names:
continue
if is_preserve(n):
preserve.append(n)
continue
# Anything else with .weight gets quantized solo, otherwise preserved
if n.endswith(".weight"):
solo_quantize.append(n)
else:
preserve.append(n)
return {
"weight_map": weight_map,
"shard_to_names": dict(shard_to_names),
"expert_pair_groups": expert_pair_groups,
"solo_quantize": solo_quantize,
"preserve": preserve,
"scale_companions": scale_companions,
}
# ---------------------------------------------------------------------------
# Tensor loading helpers
# ---------------------------------------------------------------------------
class ShardCache:
"""Lazy per-shard safe_open cache so we don't re-open shards repeatedly."""
def __init__(self, src_dir: Path, max_open: int = 4):
self.src_dir = src_dir
self.max_open = max_open
self.handles: dict[str, "safe_open"] = {}
def get(self, shard_fname: str):
if shard_fname in self.handles:
return self.handles[shard_fname]
if len(self.handles) >= self.max_open:
# Drop one
old_fn = next(iter(self.handles))
self.handles[old_fn].__exit__(None, None, None)
del self.handles[old_fn]
h = safe_open(self.src_dir / shard_fname, framework="pt")
h.__enter__()
self.handles[shard_fname] = h
return h
def close(self):
for h in self.handles.values():
h.__exit__(None, None, None)
self.handles.clear()
def load_weight_and_scale(cache: ShardCache, weight_map, name):
"""Load an FP8 weight with its scale companion (if any)."""
weight = cache.get(weight_map[name]).get_tensor(name)
scale_name = name.replace(".weight", ".scale")
scale = None
if scale_name in weight_map:
try:
scale = cache.get(weight_map[scale_name]).get_tensor(scale_name)
except Exception:
# Scale listed in index but not in shard (BF16 weights have no scale)
pass
return weight, scale
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--src", required=True, help="Source FP8 model directory")
ap.add_argument("--dst", required=True, help="Output NVFP4 model directory")
ap.add_argument("--gpu", type=int, default=0, help="CUDA device, -1 for CPU")
ap.add_argument("--shard-size-gb", type=float, default=5.0)
ap.add_argument("--workers", type=int, default=4,
help="Concurrent tensor-conversion workers (lots of small tensors benefit; "
"actual GPU compute is serialized by torch)")
ap.add_argument("--dry-run", action="store_true")
args = ap.parse_args()
src = Path(args.src).resolve()
dst = Path(args.dst).resolve()
if not (src / "model.safetensors.index.json").exists():
sys.exit(f"No index.json at {src}")
device = torch.device(f"cuda:{args.gpu}" if args.gpu >= 0 and torch.cuda.is_available() else "cpu")
print(f"Compute device: {device}")
# Move FP4_BOUNDARIES to device once
global FP4_BOUNDARIES
FP4_BOUNDARIES = FP4_BOUNDARIES.to(device)
print("Building conversion plan...")
plan = build_plan(src)
n_pairs = len(plan["expert_pair_groups"])
n_solo = len(plan["solo_quantize"])
n_preserve = len(plan["preserve"])
n_scales = len(plan["scale_companions"])
print(f" Expert pair groups (joint scale_2): {n_pairs:,}")
print(f" Solo quantize tensors: {n_solo:,}")
print(f" Preserved tensors: {n_preserve:,}")
print(f" Scale companions consumed: {n_scales:,}")
if args.dry_run:
print("\nDry run — exiting before any writes.")
return
dst.mkdir(parents=True, exist_ok=True)
cache = ShardCache(src, max_open=8)
writer = ShardedSafetensorsWriter(dst, max_shard_bytes=int(args.shard_size_gb * 1024**3))
weight_map = plan["weight_map"]
t_start = time.time()
# ------------------------------------------------------------------
# 1. Preserved tensors — copy unchanged
# ------------------------------------------------------------------
for name in tqdm(plan["preserve"], desc="Preserve", unit="tensor"):
t = cache.get(weight_map[name]).get_tensor(name)
writer.add(name, t)
# ------------------------------------------------------------------
# 2. Expert pairs — joint scale_2 across (gate_proj, up_proj)
# ------------------------------------------------------------------
for base, name_w1, name_w3 in tqdm(plan["expert_pair_groups"], desc="Expert pairs", unit="pair"):
w1_fp8, s1 = load_weight_and_scale(cache, weight_map, name_w1)
w3_fp8, s3 = load_weight_and_scale(cache, weight_map, name_w3)
with torch.no_grad():
w1 = dequant_fp8_to_fp32(w1_fp8.to(device), s1.to(device)) if s1 is not None else w1_fp8.float().to(device)
w3 = dequant_fp8_to_fp32(w3_fp8.to(device), s3.to(device)) if s3 is not None else w3_fp8.float().to(device)
scale_2 = compute_global_scale(w1, w3)
packed1, blk1 = quantize_to_nvfp4(w1, scale_2)
packed3, blk3 = quantize_to_nvfp4(w3, scale_2)
writer.add(name_w1, packed1)
writer.add(name_w1.replace(".weight", ".weight_scale"), blk1)
writer.add(name_w1.replace(".weight", ".weight_scale_2"), scale_2)
writer.add(name_w3, packed3)
writer.add(name_w3.replace(".weight", ".weight_scale"), blk3)
writer.add(name_w3.replace(".weight", ".weight_scale_2"), scale_2)
# ------------------------------------------------------------------
# 3. Solo quantize tensors — independent scale_2 per tensor
# ------------------------------------------------------------------
for name in tqdm(plan["solo_quantize"], desc="Solo quantize", unit="tensor"):
w_fp8, s = load_weight_and_scale(cache, weight_map, name)
with torch.no_grad():
if s is not None:
w = dequant_fp8_to_fp32(w_fp8.to(device), s.to(device))
else:
# Already non-FP8 (e.g. BF16), just upcast
w = w_fp8.float().to(device)
scale_2 = compute_global_scale(w)
packed, blk = quantize_to_nvfp4(w, scale_2)
writer.add(name, packed)
writer.add(name.replace(".weight", ".weight_scale"), blk)
writer.add(name.replace(".weight", ".weight_scale_2"), scale_2)
# Finalize shards & index
final_weight_map = writer.close()
cache.close()
# ------------------------------------------------------------------
# 4. Write model.safetensors.index.json
# ------------------------------------------------------------------
total_size = sum(
(dst / fn).stat().st_size for fn in set(final_weight_map.values())
)
new_index = {
"metadata": {"total_size": total_size},
"weight_map": final_weight_map,
}
with open(dst / "model.safetensors.index.json", "w") as f:
json.dump(new_index, f, indent=2)
# ------------------------------------------------------------------
# 5. Copy non-tensor files (config, tokenizer, etc.)
# ------------------------------------------------------------------
for fname in src.iterdir():
if fname.is_dir():
# encoding/, inference/, assets/ — copy whole tree
dst_sub = dst / fname.name
if not dst_sub.exists():
shutil.copytree(fname, dst_sub)
continue
if fname.suffix == ".safetensors":
continue
if fname.name == "model.safetensors.index.json":
continue
shutil.copy2(fname, dst / fname.name)
# ------------------------------------------------------------------
# 6. Patch config.json with quantization metadata so loaders know
# ------------------------------------------------------------------
cfg_path = dst / "config.json"
if cfg_path.exists():
with open(cfg_path) as f:
cfg = json.load(f)
cfg["quantization_config"] = {
"quant_method": "compressed-tensors",
"format": "nvfp4-pack-quantized",
"config_groups": {
"group_0": {
"targets": ["Linear"],
"weights": {
"num_bits": 4,
"type": "float",
"strategy": "tensor_group",
"group_size": 16,
"symmetric": True,
},
}
},
"ignore": PRESERVE_REGEXES,
}
with open(cfg_path, "w") as f:
json.dump(cfg, f, indent=2)
elapsed = time.time() - t_start
print(f"\nDone in {elapsed/3600:.2f}h")
print(f"Output: {dst}")
print(f"Total size: {total_size/1024**3:.1f} GB across {len(set(final_weight_map.values()))} shards")
if __name__ == "__main__":
main()

View File

@@ -1,173 +0,0 @@
#!/usr/bin/env python3
"""Inspect a DeepSeek FP8 model directory and report on tensor structure.
Usage: python inspect_model.py <model_dir>
Prints:
- Total tensor count and dtype histogram
- Sample of tensor names by category (lm_head, embeddings, attention, MoE experts, norms, etc.)
- FP8 block scaling structure (block size detection)
- MoE expert layer count and routing structure
- Any "unusual" tensors that need manual classification
"""
import argparse
import json
import re
import sys
from collections import Counter, defaultdict
from pathlib import Path
from safetensors import safe_open
# Patterns we'd preserve (skip quantization on)
PRESERVE_PATTERNS = [
(re.compile(r".*lm_head.*"), "lm_head"),
(re.compile(r".*embed_tokens.*"), "embeddings"),
(re.compile(r".*\.mlp\.gate(\.weight)?$"), "moe_router_gate"),
(re.compile(r".*norm.*"), "normalization"),
(re.compile(r".*indexer.*"), "attention_indexer"), # V3.2 DSA / V4 CSA?
(re.compile(r".*hyper_conn.*"), "mhc_hyper_conn"), # V4 mHC
(re.compile(r".*mhc.*"), "mhc_other"),
(re.compile(r".*scoring.*"), "scoring"),
]
# Patterns for MoE expert weights (these are what we WILL quantize)
EXPERT_PATTERNS = [
(re.compile(r".*experts\.\d+\.gate_proj.*"), "expert_gate_proj"),
(re.compile(r".*experts\.\d+\.up_proj.*"), "expert_up_proj"),
(re.compile(r".*experts\.\d+\.down_proj.*"), "expert_down_proj"),
(re.compile(r".*shared_experts?\.gate_proj.*"), "shared_gate_proj"),
(re.compile(r".*shared_experts?\.up_proj.*"), "shared_up_proj"),
(re.compile(r".*shared_experts?\.down_proj.*"), "shared_down_proj"),
]
def categorize(name):
for pat, cat in PRESERVE_PATTERNS:
if pat.match(name):
return ("preserve", cat)
for pat, cat in EXPERT_PATTERNS:
if pat.match(name):
return ("quantize_expert", cat)
if name.endswith(".weight_scale_inv"):
return ("scale_metadata", "fp8_block_scale")
if name.endswith(".weight"):
return ("quantize_other", "linear_weight")
return ("other", "uncategorized")
def main():
ap = argparse.ArgumentParser()
ap.add_argument("model_dir")
ap.add_argument("--show-samples", type=int, default=5,
help="How many sample names to show per category")
args = ap.parse_args()
model_dir = Path(args.model_dir)
index_path = model_dir / "model.safetensors.index.json"
if not index_path.exists():
print(f"ERROR: {index_path} not found", file=sys.stderr)
sys.exit(1)
with open(index_path) as f:
index = json.load(f)
weight_map = index["weight_map"]
total_size = index.get("metadata", {}).get("total_size")
print(f"=== {model_dir} ===")
print(f"Total tensors: {len(weight_map):,}")
print(f"Total shards: {len(set(weight_map.values()))}")
if total_size:
print(f"Reported size: {total_size / 1024**3:.1f} GB")
print()
# Categorize names (cheap, no tensor loading)
categories = defaultdict(list)
for name in weight_map:
kind, cat = categorize(name)
categories[(kind, cat)].append(name)
print("=== Tensor categorization ===")
for (kind, cat), names in sorted(categories.items()):
print(f" [{kind:18s}] {cat:25s} count={len(names):,}")
for n in names[: args.show_samples]:
print(f" {n}")
if len(names) > args.show_samples:
print(f" ... and {len(names) - args.show_samples} more")
print()
# Inspect dtypes and FP8 block scaling on a sample shard
sample_shard = model_dir / sorted(set(weight_map.values()))[0]
print(f"=== Sampling dtypes from {sample_shard.name} ===")
dtype_hist = Counter()
fp8_block_sizes = Counter()
weight_with_scale = []
with safe_open(sample_shard, framework="pt") as f:
names_in_shard = list(f.keys())
for name in names_in_shard:
t = f.get_tensor(name)
dtype_hist[str(t.dtype)] += 1
# Check for FP8 weight + scale_inv pair
if name.endswith(".weight") and t.dtype.is_floating_point and t.element_size() == 1:
scale_name = name.replace(".weight", ".weight_scale_inv")
if scale_name in names_in_shard:
scale_t = f.get_tensor(scale_name)
bm = t.shape[0] / scale_t.shape[0] if scale_t.dim() == 2 else None
bn = t.shape[1] / scale_t.shape[1] if scale_t.dim() == 2 and t.dim() == 2 else None
fp8_block_sizes[(bm, bn)] += 1
if len(weight_with_scale) < 3:
weight_with_scale.append((name, t.shape, t.dtype, scale_t.shape, scale_t.dtype))
print(" Dtype histogram (this shard only):")
for d, c in dtype_hist.most_common():
print(f" {d:20s} {c:,}")
print()
print(" FP8 block-scale dimensions detected:")
for (bm, bn), c in fp8_block_sizes.most_common():
print(f" block_size = ({bm}, {bn}) count={c}")
print()
print(" Sample FP8 weight + scale_inv pairs:")
for name, wshape, wdt, sshape, sdt in weight_with_scale:
print(f" {name}")
print(f" weight: shape={tuple(wshape)} dtype={wdt}")
print(f" scale: shape={tuple(sshape)} dtype={sdt}")
# MoE structure summary
print()
print("=== MoE structure summary ===")
layer_experts = defaultdict(set)
for name in weight_map:
m = re.match(r".*layers\.(\d+)\..*experts\.(\d+)\..*", name)
if m:
layer_experts[int(m.group(1))].add(int(m.group(2)))
if layer_experts:
layer_count = len(layer_experts)
expert_counts = [len(v) for v in layer_experts.values()]
print(f" Layers with MoE experts: {layer_count}")
print(f" Experts per layer: min={min(expert_counts)} max={max(expert_counts)}")
print(f" Sample layer 0 experts: {sorted(list(layer_experts[min(layer_experts)]))[:5]}...")
else:
print(" No '.experts.N.' pattern found — MoE structure may use different naming.")
# Flag uncategorized for human review
print()
print("=== Uncategorized tensors (review these manually) ===")
uncat = categories.get(("other", "uncategorized"), [])
if uncat:
print(f" {len(uncat):,} tensors:")
for n in uncat[:20]:
print(f" {n}")
if len(uncat) > 20:
print(f" ... and {len(uncat) - 20} more")
else:
print(" None — every tensor matched a known pattern.")
if __name__ == "__main__":
main()

2364
patches/deepseek_v4.py Normal file

File diff suppressed because it is too large Load Diff

270
patches/staging_kernel.py Normal file
View File

@@ -0,0 +1,270 @@
"""
NVFP4 staging kernel — full FP4 (E2M1) activations + UE4M3 block16 scales.
The mxf4nvf4 PTX instruction requires BOTH A and B to be FP4 (E2M1 packed).
This kernel quantizes BF16 activations → E2M1 packed uint8 with UE4M3 scales.
"""
import triton
import triton.language as tl
import torch
@triton.jit
def _deepseek_v4_stage_mega_moe_inputs_kernel(
hidden_states,
x_fp4, # uint8, shape (M, K//2) — E2M1 packed, 2 values per byte
x_sf, # int32, shape (M, K//64) — UE4M3 packed, 4 scales per int32
topk_ids,
topk_weights,
topk_idx_out,
topk_weights_out,
hidden_stride_m: tl.constexpr,
hidden_stride_k: tl.constexpr,
x_stride_m: tl.constexpr,
x_stride_k: tl.constexpr,
x_sf_stride_m: tl.constexpr,
x_sf_stride_k: tl.constexpr,
topk_ids_stride_m: tl.constexpr,
topk_ids_stride_k: tl.constexpr,
topk_weights_stride_m: tl.constexpr,
topk_weights_stride_k: tl.constexpr,
topk_idx_stride_m: tl.constexpr,
topk_idx_stride_k: tl.constexpr,
topk_weights_out_stride_m: tl.constexpr,
topk_weights_out_stride_k: tl.constexpr,
hidden_size: tl.constexpr,
top_k: tl.constexpr,
BLOCK_K: tl.constexpr, # 128 elements (loaded from hidden)
GROUP_K: tl.constexpr, # 16 (NVFP4 group_size)
BLOCK_TOPK: tl.constexpr,
) -> None:
token_id = tl.program_id(0)
k_block_id = tl.program_id(1)
k_offsets = k_block_id * BLOCK_K + tl.arange(0, BLOCK_K)
k_mask = k_offsets < hidden_size
hidden = tl.load(
hidden_states + token_id * hidden_stride_m + k_offsets * hidden_stride_k,
mask=k_mask,
other=0.0,
).to(tl.float32)
num_groups: tl.constexpr = BLOCK_K // GROUP_K # 8
hidden_groups = tl.reshape(hidden, [num_groups, GROUP_K])
abs_groups = tl.reshape(tl.abs(hidden), [num_groups, GROUP_K])
amax = tl.max(abs_groups, axis=1)
amax = tl.maximum(amax, 1.0e-4)
# ---- UE4M3 scale computation ----
# scale = amax / 6.0 (E2M1 max value = 6)
# Then quantize scale to UE4M3 format
scale = amax / 6.0
scale_bits = scale.to(tl.uint32, bitcast=True)
scale_exp = (scale_bits >> 23) & 0xFF
scale_mant = scale_bits & 0x7FFFFF
# Convert FP32 → E4M3 manually (with subnormal support)
# FP32 bias=127, E4M3 bias=7 → raw exp = scale_exp - 120
e4m3_exp_raw = scale_exp - 120 # can be negative → subnormal
# Normal path: exp >= 1, just truncate mantissa top 3 bits
# RNE rounding: need guard (bit 19), sticky (OR of bits 18:0), and LSB of result
normal_mant = scale_mant >> 20
guard_bit = (scale_mant >> 19) & 1
sticky_bit = tl.where((scale_mant & 0x7FFFF) != 0, 1, 0) # OR of bits [18:0]
result_lsb = normal_mant & 1
# RNE: round up if (guard=1 and sticky=1) or (guard=1 and sticky=0 and lsb=1)
round_up = guard_bit & (sticky_bit | result_lsb)
normal_mant = normal_mant + round_up
normal_exp = e4m3_exp_raw
# Subnormal path: exp_raw <= 0
# Insert implicit leading 1 and right-shift by (1 - exp_raw)
# E4M3 subnormal: value = (mant/8) * 2^(1-7) = (mant/8) * 2^-6
# So we need: (1 + mant_fp32/2^23) * 2^(exp_raw - 7) = (shifted_mant/8) * 2^-6
# shifted_mant = (implicit_1 | mant_fp32) >> (1 - exp_raw - 1) then take top 3 bits
shift = 1 - e4m3_exp_raw # positive when subnormal
mant_with_leading = (0x800000 | scale_mant) # insert implicit 1
# Right-shift to get into the 3-bit E4M3 mantissa window
# We want bits [shift+19 : shift+23) of mant_with_leading for 3 mantissa bits + 1 round bit
subnormal_mant = (mant_with_leading >> (shift.to(tl.int32) + 20)) & 0x7
sub_guard_bit = (mant_with_leading >> (shift.to(tl.int32) + 19)) & 1
# Sticky: OR of all bits below the guard bit in the shifted result
# shift ≤ 8 in practice (amax floor = 1e-4 → scale ≈ 2^-15 → exp_raw ≈ -7), so mask ≤ 2^27
sub_sticky_mask = (1 << (shift.to(tl.int32) + 19)) - 1
sub_sticky_bit = tl.where((mant_with_leading & sub_sticky_mask) != 0, 1, 0)
sub_result_lsb = subnormal_mant & 1
sub_round_up = sub_guard_bit & (sub_sticky_bit | sub_result_lsb)
subnormal_mant = subnormal_mant + sub_round_up
is_normal = e4m3_exp_raw >= 1
e4m3_mant = tl.where(is_normal, normal_mant, subnormal_mant)
e4m3_exp = tl.where(is_normal, normal_exp, 0) # exp=0 for subnormals
# Handle mantissa overflow after rounding
overflow = e4m3_mant >= 8
e4m3_mant = tl.where(overflow, 0, e4m3_mant)
e4m3_exp = tl.where(overflow, e4m3_exp + 1, e4m3_exp)
e4m3_exp = tl.maximum(e4m3_exp, 0)
e4m3_exp = tl.minimum(e4m3_exp, 15)
scale_e4m3_bits = (e4m3_exp << 3) | e4m3_mant
# Reconstruct dequantized scale by decoding the STORED E4M3 bits.
# This guarantees the E2M1 quantization divides by exactly the value
# the CUDA kernel will multiply back — same bits, single decode, no
# possibility of encode/decode disagreement.
stored_exp = (scale_e4m3_bits >> 3) & 0xF
stored_mant = scale_e4m3_bits & 0x7
e4m3_exp_for_recon = tl.maximum(stored_exp.to(tl.int32) - 7, -126)
two_pow_exp_bits = (e4m3_exp_for_recon + 127).to(tl.uint32) << 23
two_pow_exp = two_pow_exp_bits.to(tl.float32, bitcast=True)
normal_value = (1.0 + stored_mant.to(tl.float32) / 8.0) * two_pow_exp
subnormal_value = (stored_mant.to(tl.float32) / 8.0) * 0.015625
e4m3_value = tl.where(stored_exp == 0, subnormal_value, normal_value)
# ---- E2M1 FP4 quantization (unpacked, 1 byte/element) ----
# E2M1 LUT (unsigned): [0, 0.5, 1, 1.5, 2, 3, 4, 6]
# Nearest-neighbor using thresholds (midpoints between consecutive values)
scaled = hidden_groups * (1.0 / tl.maximum(e4m3_value, 1e-6))[:, None]
# Clamp to E2M1 range [-6, 6]
scaled = tl.maximum(scaled, -6.0)
scaled = tl.minimum(scaled, 6.0)
abs_s = tl.abs(scaled)
# Thresholds: midpoints between [0, 0.5, 1, 1.5, 2, 3, 4, 6]
# [0, 0.25, 0.75, 1.25, 1.75, 2.5, 3.5, 5.0, INF]
e2m1_idx = tl.where(abs_s < 0.25, 0,
tl.where(abs_s < 0.75, 1,
tl.where(abs_s < 1.25, 2,
tl.where(abs_s < 1.75, 3,
tl.where(abs_s < 2.5, 4,
tl.where(abs_s < 3.5, 5,
tl.where(abs_s < 5.0, 6, 7)))))))
sign_bit = (scaled < 0).to(tl.int32)
e2m1_4bit = (sign_bit << 3) | e2m1_idx # 4-bit: (sign << 3) | index
# Pack E2M1 pairs into single bytes (2 per byte, low nibble first)
# mxf4nvf4 reads FP4 packed from SMEM — must match kernel's TMA layout
e2m1_flat = tl.reshape(e2m1_4bit, [BLOCK_K])
e2m1_lo = e2m1_flat[0::2] # even indices → low nibble
e2m1_hi = e2m1_flat[1::2] # odd indices → high nibble
e2m1_packed = (e2m1_hi << 4 | e2m1_lo).to(tl.uint8) # [BLOCK_K // 2]
k_offsets_out = k_block_id * (BLOCK_K // 2) + tl.arange(0, BLOCK_K // 2)
k_mask_out = k_offsets_out < (hidden_size // 2)
tl.store(
x_fp4 + token_id * x_stride_m + k_offsets_out * x_stride_k,
e2m1_packed,
mask=k_mask_out,
)
# Pack UE4M3 bytes into int32 (NVFP4: group_size=16, 4 groups per 64 elements)
# 8 groups per k_block of 128 → 2 int32s per k_block
# int32 can only pack 4 bytes (shifts >= 32 are UB), so split into two packs
scale_offsets = tl.arange(0, num_groups) # [0..7]
first_half = scale_offsets < 4 # groups 0-3 → int32[0]
second_half = scale_offsets >= 4 # groups 4-7 → int32[1]
packed_lo = tl.sum(
tl.where(first_half, scale_e4m3_bits.to(tl.int32) << (scale_offsets * 8), 0),
axis=0,
).to(tl.int32)
packed_hi = tl.sum(
tl.where(second_half, scale_e4m3_bits.to(tl.int32) << ((scale_offsets - 4) * 8), 0),
axis=0,
).to(tl.int32)
# Write 2 int32s per k_block: x_sf shape is (M, K//64) = (M, num_k_blocks * 2)
sf_base = token_id * x_sf_stride_m + k_block_id * 2 * x_sf_stride_k
tl.store(x_sf + sf_base, packed_lo)
tl.store(x_sf + sf_base + x_sf_stride_k, packed_hi)
if k_block_id == 0:
topk_offsets = tl.arange(0, BLOCK_TOPK)
topk_mask = topk_offsets < top_k
ids = tl.load(
topk_ids + token_id * topk_ids_stride_m + topk_offsets * topk_ids_stride_k,
mask=topk_mask,
other=0,
).to(tl.int64)
tl.store(
topk_idx_out
+ token_id * topk_idx_stride_m
+ topk_offsets * topk_idx_stride_k,
ids,
mask=topk_mask,
)
weights = tl.load(
topk_weights
+ token_id * topk_weights_stride_m
+ topk_offsets * topk_weights_stride_k,
mask=topk_mask,
other=0.0,
)
tl.store(
topk_weights_out
+ token_id * topk_weights_out_stride_m
+ topk_offsets * topk_weights_out_stride_k,
weights,
mask=topk_mask,
)
def _stage_deepseek_v4_mega_moe_inputs(
hidden_states: torch.Tensor,
topk_weights: torch.Tensor,
topk_ids: torch.Tensor,
x_fp4: torch.Tensor, # uint8, shape (M, K//2)
x_sf: torch.Tensor, # int32, shape (M, K//64)
topk_idx_out: torch.Tensor,
topk_weights_out: torch.Tensor,
) -> None:
num_tokens, hidden_size = hidden_states.shape
if num_tokens == 0:
return
if hidden_size % 128 != 0:
raise ValueError(
"DeepSeek V4 MegaMoE input staging requires hidden_size to be "
"a multiple of 128."
)
top_k = topk_ids.shape[1]
if topk_weights.shape != topk_ids.shape:
raise ValueError(
"DeepSeek V4 MegaMoE input staging requires topk_weights and "
"topk_ids to have the same shape."
)
block_k = 128
grid = (num_tokens, triton.cdiv(hidden_size, block_k))
block_topk = triton.next_power_of_2(top_k)
_deepseek_v4_stage_mega_moe_inputs_kernel[grid](
hidden_states,
x_fp4,
x_sf,
topk_ids,
topk_weights,
topk_idx_out,
topk_weights_out,
hidden_states.stride(0),
hidden_states.stride(1),
x_fp4.stride(0),
x_fp4.stride(1),
x_sf.stride(0),
x_sf.stride(1),
topk_ids.stride(0),
topk_ids.stride(1),
topk_weights.stride(0),
topk_weights.stride(1),
topk_idx_out.stride(0),
topk_idx_out.stride(1),
topk_weights_out.stride(0),
topk_weights_out.stride(1),
hidden_size,
top_k,
BLOCK_K=block_k,
GROUP_K=16, # NVFP4: group_size=16 (scale_vec::4X)
BLOCK_TOPK=block_topk,
num_warps=4,
)

View File

@@ -1,218 +0,0 @@
#!/usr/bin/env python3
"""Path B: llm-compressor oneshot NVFP4 quantization for DeepSeek V4 Pro.
Uses sequential pipeline + activation calibration to produce W4A4 NVFP4 with
calibrated activation global scales. Higher quality than the streaming converter
on activation-sensitive ops, at the cost of much longer wall time and more
fragility on a brand-new architecture.
Memory plan with 2.7 TB host RAM + 8x B200 (1.5 TB HBM):
- FP8 base resident in CPU RAM: ~865 GB
- One transformer block on GPU at a time: ~10-30 GB HBM
- Activation calibration cache: tens to a few hundred GB
- Headroom: ~1.5+ TB RAM, ~1.4+ TB HBM
Critical: this loads the model with trust_remote_code=True. V4 architecture is
brand new; expect to need:
- transformers from source (or recent main)
- llm-compressor from source
- The V4 modeling code in DeepSeek-V4-Pro-FP8/inference/ to be importable
Usage:
python quantize_llmcompressor.py \\
--src DeepSeek-V4-Pro-FP8 \\
--dst DeepSeek-V4-Pro-NVFP4-llmcompressor \\
--num-samples 256 \\
--max-seq-len 4096
"""
import argparse
import os
import sys
from pathlib import Path
import torch
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--src", required=True, help="Source FP8 model directory")
ap.add_argument("--dst", required=True, help="Output NVFP4 model directory")
ap.add_argument("--num-samples", type=int, default=256)
ap.add_argument("--max-seq-len", type=int, default=4096)
ap.add_argument("--calibration-dataset", default="HuggingFaceH4/ultrachat_200k")
ap.add_argument(
"--offload-folder", default="/root/nvidia-meeting/.offload",
help="NVMe folder for accelerate disk-offload spillover (rarely needed at 2.7TB RAM)",
)
ap.add_argument(
"--no-activation-quant", action="store_true",
help="Quantize weights only (no activation calibration). Faster, closer to Path A."
)
args = ap.parse_args()
src = Path(args.src).resolve()
dst = Path(args.dst).resolve()
if not (src / "config.json").exists():
sys.exit(f"No config.json at {src}")
Path(args.offload_folder).mkdir(parents=True, exist_ok=True)
# Heavy imports happen here so --help is fast
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
# ----------------------------------------------------------------------
# 1. Load model
# ----------------------------------------------------------------------
print(f"Loading {src} ...")
print(" This will take several minutes — FP8 base is ~865 GB.")
# We want FP8 weights to stay as FP8 on CPU and only be promoted to BF16
# when each block goes to GPU during sequential calibration. The exact
# behavior depends on transformers' V4 modeling code — if it auto-dequants
# on load, expect 3.2 TB BF16 in RAM and you'll spill. Watch `free -h`.
tokenizer = AutoTokenizer.from_pretrained(src, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
src,
torch_dtype="auto",
device_map="cpu", # all on CPU; sequential pipeline moves blocks to GPU
trust_remote_code=True,
offload_folder=args.offload_folder,
)
print(f" Model class: {type(model).__name__}")
print(f" Param count: {sum(p.numel() for p in model.parameters()):,}")
# ----------------------------------------------------------------------
# 2. MoE handling — replace_modules_for_calibration
# ----------------------------------------------------------------------
# On Llama4/Qwen3-MoE, llm-compressor needs a wrapper class that exposes
# every expert during calibration (otherwise routed-only experts never see
# data). For DeepSeek V4 the MoE class name is something like
# `DeepseekV4MoE`. Try the canonical entrypoint first; fall back gracefully.
try:
from llmcompressor.modeling import replace_modules_for_calibration
print("Replacing MoE modules for calibration...")
replace_modules_for_calibration(model)
except ImportError:
print("WARN: replace_modules_for_calibration not available in this "
"llm-compressor version. Routed-only experts may not see "
"calibration data, lowering NVFP4 quality on rare experts.")
except Exception as e:
print(f"WARN: replace_modules_for_calibration failed: {e}")
print(" You may need to register a custom MoE wrapper for V4. "
"Find the MoE class name in DeepSeek-V4-Pro-FP8/inference/ and "
"register it via llmcompressor.modeling.register_module_replacement.")
# ----------------------------------------------------------------------
# 3. Calibration dataset
# ----------------------------------------------------------------------
print(f"Loading calibration dataset {args.calibration_dataset} ...")
ds = load_dataset(args.calibration_dataset, split="train_sft")
ds = ds.shuffle(seed=42).select(range(args.num_samples))
def preprocess(example):
# Use the model's chat template if it has one; ultrachat samples have a
# 'messages' field already in the OpenAI shape.
if "messages" in example:
try:
text = tokenizer.apply_chat_template(
example["messages"], tokenize=False, add_generation_prompt=False
)
except Exception:
text = "\n".join(m.get("content", "") for m in example["messages"])
else:
text = example.get("text") or example.get("prompt") or ""
return {"text": text}
ds = ds.map(preprocess, remove_columns=ds.column_names)
def tokenize(example):
return tokenizer(
example["text"],
truncation=True,
max_length=args.max_seq_len,
padding=False,
return_tensors=None,
)
ds = ds.map(tokenize, remove_columns=["text"])
# ----------------------------------------------------------------------
# 4. Recipe
# ----------------------------------------------------------------------
# NVFP4 W4A4 by default. The ignore list mirrors Path A's preserve list:
# output head, embeddings, MoE router gates (NOT gate_proj!), norms, and
# V4-specific attention indexer / mHC residual mixing weights.
ignore = [
"re:.*lm_head",
"re:.*embed_tokens$",
"re:.*\\.mlp\\.gate$",
"re:.*\\.mlp\\.gate\\.weight$",
"re:.*norm.*",
"re:.*indexer.*",
"re:.*hyper_conn.*",
"re:.*\\.mhc.*",
"re:.*scoring.*",
]
if args.no_activation_quant:
print("Recipe: NVFP4 weight-only (W4A16 effective)")
recipe = QuantizationModifier(
targets="Linear",
scheme="NVFP4A16", # weight-only variant
ignore=ignore,
)
else:
print("Recipe: NVFP4 W4A4 with activation calibration")
recipe = QuantizationModifier(
targets="Linear",
scheme="NVFP4",
ignore=ignore,
)
# ----------------------------------------------------------------------
# 5. Run oneshot — sequential pipeline is the key for memory
# ----------------------------------------------------------------------
print("Starting oneshot calibration + quantization (this is the long part)...")
print(f" num_samples={args.num_samples}, max_seq_len={args.max_seq_len}")
print(f" Watch with: watch -n 5 'free -h && nvidia-smi --query-gpu=memory.used,memory.free --format=csv'")
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=args.max_seq_len,
num_calibration_samples=args.num_samples,
# Sequential pipeline: one block at a time on GPU, rest on CPU.
pipeline="sequential",
# Calibrate every expert, even routed-only ones that wouldn't see traffic.
moe_calibrate_all_experts=True,
)
# ----------------------------------------------------------------------
# 6. Save compressed
# ----------------------------------------------------------------------
print(f"Saving compressed checkpoint to {dst} ...")
dst.mkdir(parents=True, exist_ok=True)
model.save_pretrained(str(dst), save_compressed=True)
tokenizer.save_pretrained(str(dst))
# Copy any extra files that save_pretrained doesn't (encoding/, inference/, PDF)
import shutil
for fname in src.iterdir():
if fname.is_dir() and fname.name in {"encoding", "inference", "assets"}:
dst_sub = dst / fname.name
if not dst_sub.exists():
shutil.copytree(fname, dst_sub)
elif fname.suffix in {".pdf", ".md"} and not (dst / fname.name).exists():
shutil.copy2(fname, dst / fname.name)
print("Done.")
print(f"Output: {dst}")
if __name__ == "__main__":
main()

7
requirements.txt Normal file
View File

@@ -0,0 +1,7 @@
compressed-tensors<0.15.0
nvidia-modelopt[hf]
fire
flash-attn>=2.6.0
transformers<5.0
transformers_stream_generator
zstandard

View File

@@ -0,0 +1,276 @@
#!/usr/bin/env python3
"""
Complete dequantization of DeepSeek V4 Pro mixed-precision to pure BF16.
Handles ALL compressed tensor types found in the mixed-precision model:
1. FP8 attention weights (float8_e4m3fn + float8_e8m0fnu block scales)
- weight × scale_expanded → BF16
- 128×128 block quantization
2. FP4 (E2M1) expert weights (int8 packed + float8_e8m0fnu block scales)
- Unpack 2 FP4 values per int8 byte (lower nibble first, upper second)
- Dequantize via E2M1 LUT lookup × scale_expanded → BF16
- Per-row, 32-column block scaling (MXFP4 microscaling format)
- Output dimensions are 2× the stored dimensions
- Verified: nibble index 0 vs 8 ratio = 0.996 (FP4 -0.0 vs +0.0),
NOT INT4 where index 8 = -8 would be rare
3. FP8 shared expert weights (float8_e4m3fn + float8_e8m0fnu block scales)
- Same as FP8 attention dequantization
After dequantization, all weights are pure BF16. FP8Linear.forward() sees
element_size() > 1 and falls back to F.linear(), avoiding broken FP8 kernels
on Blackwell GPUs. The model can then be loaded by modelopt without shape
mismatches.
"""
import os, glob, json, shutil, sys, time
from safetensors import safe_open
from safetensors.torch import save_file
import torch
FP8_WEIGHT_DTYPE = torch.float8_e4m3fn
FP8_SCALE_DTYPE = torch.float8_e8m0fnu
BLOCK_SIZE_FP8 = (128, 128)
FP4_BLOCK_SIZE = 32 # columns per scale value for MXFP4 expert weights
# E2M1 FP4 lookup table (MXFP4 microscaling format)
# Index 0-7: positive values (sign=0, 2-bit exp, 1-bit mantissa)
# Index 8-15: negative values (sign=1)
# Mapping: 0→0, 1→0.5, 2→1, 3→1.5, 4→2, 5→3, 6→4, 7→6
FP4_E2M1_LUT = torch.tensor([
0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0,
-0.0, -0.5, -1.0, -1.5, -2.0, -3.0, -4.0, -6.0,
], dtype=torch.float32)
def dequantize_fp8_weight(fp8_weight: torch.Tensor, scale: torch.Tensor) -> torch.Tensor:
"""Dequantize block-wise FP8 weight to BF16.
fp8_weight: (out_features, in_features) float8_e4m3fn
scale: (out_features//128, in_features//128) float8_e8m0fnu
"""
scale_f32 = scale.float()
out_features, in_features = fp8_weight.shape
scale_expanded = scale_f32.repeat_interleave(BLOCK_SIZE_FP8[0], dim=0).repeat_interleave(BLOCK_SIZE_FP8[1], dim=1)
scale_expanded = scale_expanded[:out_features, :in_features]
weight_bf16 = fp8_weight.float() * scale_expanded
return weight_bf16.to(torch.bfloat16)
def dequantize_fp4_weight(int8_packed: torch.Tensor, scale: torch.Tensor) -> torch.Tensor:
"""Dequantize MXFP4 (E2M1) expert weight to BF16.
FP4 values are packed 2-per-byte into int8 tensors.
Lower nibble (bits 0-3) is the first value, upper nibble (bits 4-7) is the second.
E2M1 format: 1 sign + 2 exponent + 1 mantissa bit.
Scale is per-row with 32-column blocks (float8_e8m0fnu, MX microscaling).
Output dimensions are 2× the stored dimensions.
int8_packed: (out_features, in_features//2) int8
scale: (out_features, in_features//32) float8_e8m0fnu
returns: (out_features, in_features) bfloat16
"""
lut = FP4_E2M1_LUT.to(int8_packed.device)
# Unpack nibble indices
lower_idx = (int8_packed & 0x0F).long() # 0-15
upper_idx = ((int8_packed >> 4) & 0x0F).long() # 0-15
# LUT lookup
lower = lut[lower_idx] # float32
upper = lut[upper_idx] # float32
out_features = int8_packed.shape[0]
in_features_full = int8_packed.shape[1] * 2 # 2× expansion
# Expand scale: (out_features, in_features//32) → (out_features, in_features)
scale_f32 = scale.float()
scale_expanded = scale_f32.repeat_interleave(FP4_BLOCK_SIZE, dim=1)
scale_expanded = scale_expanded[:, :in_features_full]
# Interleave lower and upper nibbles
unpacked = torch.empty(out_features, in_features_full, dtype=torch.float32, device=int8_packed.device)
unpacked[:, 0::2] = lower
unpacked[:, 1::2] = upper
# Dequantize: FP4 value × E8M0 scale
bf16_weight = (unpacked * scale_expanded).to(torch.bfloat16)
return bf16_weight
def dequantize_model(model_dir: str, out_dir: str):
os.makedirs(out_dir, exist_ok=True)
# Copy non-safetensor files
print("Copying metadata files...")
for f in os.listdir(model_dir):
fp = os.path.join(model_dir, f)
if not f.endswith(".safetensors") and os.path.isfile(fp):
shutil.copy2(fp, os.path.join(out_dir, f))
print(f" Copied {f}")
safetensor_files = sorted(glob.glob(os.path.join(model_dir, "*.safetensors")))
total_shards = len(safetensor_files)
print(f"Found {total_shards} shards")
# First pass: build scale-key → weight-key mapping
print("\nScanning for weight+scale pairs...")
scale_to_weight = {}
for f in safetensor_files:
with safe_open(f, framework="pt") as sf:
for key in sf.keys():
if key.endswith(".scale"):
weight_key = key[:-len(".scale")] + ".weight"
scale_to_weight[key] = weight_key
weight_to_scale = {v: k for k, v in scale_to_weight.items()}
print(f"Found {len(scale_to_weight)} weight+scale pairs")
# Classify weights by type (sample first 2 shards)
fp4_weight_keys = set()
fp8_weight_keys = set()
scale_keys = set(scale_to_weight.keys())
for f in safetensor_files[:2]:
with safe_open(f, framework="pt") as sf:
for key in sf.keys():
if key in weight_to_scale:
t = sf.get_tensor(key)
if t.dtype == torch.int8:
fp4_weight_keys.add(key)
elif t.dtype == FP8_WEIGHT_DTYPE:
fp8_weight_keys.add(key)
print(f" FP4 (E2M1) expert weights (packed): ~{len(fp4_weight_keys)} per shard")
print(f" FP8 attention/shared-expert weights: ~{len(fp8_weight_keys)} per shard")
# Second pass: dequantize and save
stats = {"fp4_dequantized": 0, "fp8_dequantized": 0, "scales_removed": 0, "unchanged": 0}
start_time = time.time()
for i, f in enumerate(safetensor_files):
shard_start = time.time()
tensors = {}
scales_in_shard = {}
with safe_open(f, framework="pt") as sf:
keys = list(sf.keys())
# First: collect scales
for key in keys:
if key in scale_keys:
t = sf.get_tensor(key)
scales_in_shard[key] = t
# Second: process weights and other tensors
for key in keys:
if key in scale_keys:
continue # handled separately
t = sf.get_tensor(key)
if key in weight_to_scale and t.dtype == torch.int8:
# FP4 (E2M1) packed expert weight (MXFP4 microscaling)
scale_key = weight_to_scale[key]
scale = scales_in_shard.get(scale_key)
if scale is None:
print(f" WARNING: scale {scale_key} not in same shard as {key}")
tensors[key] = t # keep as-is
continue
bf16 = dequantize_fp4_weight(t, scale)
tensors[key] = bf16
stats["fp4_dequantized"] += 1
del scales_in_shard[scale_key]
stats["scales_removed"] += 1
elif key in weight_to_scale and t.dtype == FP8_WEIGHT_DTYPE:
# FP8 weight (attention or shared expert)
scale_key = weight_to_scale[key]
scale = scales_in_shard.get(scale_key)
if scale is None:
print(f" WARNING: scale {scale_key} not in same shard as {key}")
tensors[key] = t
continue
bf16 = dequantize_fp8_weight(t, scale)
tensors[key] = bf16
stats["fp8_dequantized"] += 1
del scales_in_shard[scale_key]
stats["scales_removed"] += 1
else:
# Regular tensor (BF16, FP32, int64, etc.) - keep as-is
tensors[key] = t
stats["unchanged"] += 1
# Remove unused scales
for sk in scales_in_shard:
stats["scales_removed"] += 1
out_path = os.path.join(out_dir, os.path.basename(f))
if os.path.exists(out_path) and os.path.getsize(out_path) > 0:
# Resume: skip already-dequantized shards
print(f"[{i+1}/{total_shards}] Skipping (already done): {os.path.basename(f)}")
del tensors, scales_in_shard
continue
save_file(tensors, out_path)
shard_time = time.time() - shard_start
elapsed = time.time() - start_time
rate = (i + 1) / elapsed if elapsed > 0 else 0
eta = (total_shards - i - 1) / rate if rate > 0 else 0
print(f"[{i+1}/{total_shards}] {os.path.basename(f)} "
f"({stats['fp4_dequantized']} fp4, {stats['fp8_dequantized']} fp8, "
f"{stats['scales_removed']} scales rm) "
f"[{shard_time:.1f}s, ETA: {eta/60:.0f}min]")
del tensors, scales_in_shard
# Update config
cfg_path = os.path.join(out_dir, "config.json")
if os.path.exists(cfg_path):
cfg = json.load(open(cfg_path))
cfg["torch_dtype"] = "bfloat16"
cfg["_experts_implementation"] = "eager"
if "quantization_config" in cfg:
del cfg["quantization_config"]
json.dump(cfg, open(cfg_path, "w"), indent=2)
print(f"\nUpdated config.json: torch_dtype=bfloat16, _experts_implementation=eager")
total_time = time.time() - start_time
print(f"\nDone in {total_time/60:.1f} minutes!")
print(f" FP4 expert weights dequantized: {stats['fp4_dequantized']}")
print(f" FP8 weights dequantized: {stats['fp8_dequantized']}")
print(f" Scale tensors removed: {stats['scales_removed']}")
print(f" Unchanged tensors: {stats['unchanged']}")
# Verify no FP8/INT8 remaining
print("\nVerifying...")
remaining_compressed = 0
for f in sorted(glob.glob(os.path.join(out_dir, "*.safetensors")))[:5]:
with safe_open(f, framework="pt") as sf:
for key in sf.keys():
t = sf.get_tensor(key)
if t.dtype in (torch.float8_e8m0fnu, torch.float8_e4m3fn, torch.int8):
remaining_compressed += 1
if remaining_compressed <= 5:
print(f" REMAINING: {key} {t.dtype} {t.shape}")
if remaining_compressed == 0:
print(" ✅ No compressed tensors remaining — model is pure BF16!")
else:
print(f" ⚠️ {remaining_compressed} compressed tensors still present")
out_size = sum(os.path.getsize(os.path.join(out_dir, f)) for f in os.listdir(out_dir) if f.endswith(".safetensors"))
print(f"Output size: {out_size / 1e12:.2f} TB")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Complete dequantization of DeepSeek V4 Pro to BF16")
parser.add_argument("model_dir", help="Path to mixed-precision model")
parser.add_argument("out_dir", help="Path to write dequantized BF16 model")
args = parser.parse_args()
dequantize_model(args.model_dir, args.out_dir)

587
scripts/quantize_nvfp4.py Normal file
View File

@@ -0,0 +1,587 @@
#!/usr/bin/env python3
"""
DeepSeek V4 Pro → NVFP4 quantization — defensive edition.
This script:
1. Applies runtime patches for GPU tensor safety (before modelopt runs)
2. Calls the SAME hf_ptq.py pipeline that the shell script uses
3. After calibration, snapshots amax to CPU and saves model state
The key insight: we don't rewrite the pipeline. We let hf_ptq do its thing
with all its args, defaults, and edge cases handled correctly. We just add
our defensive patches and post-calibration saves.
Must be run from the modelopt example directory:
cd /root/nvidia-meeting/modelopt-repo/examples/llm_ptq
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py
Usage:
# Full run (calibrate + export):
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py
# Re-run export only (after a calibration save exists):
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --export-only
# Validate saved calibration state (check amax values):
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/quantize_nvfp4.py --validate-only
"""
import argparse
import gc
import os
import sys
import time
import torch
# ── Config ──────────────────────────────────────────────────────────────────
MODEL = "/root/nvidia-meeting/DeepSeek-V4-Pro-BF16"
QUANT = "nvfp4"
TP = 8
CALIB_SIZE = 128
CALIB_SEQ = 512
KV_CACHE_QUANT = "fp8_cast"
GPU_MEM_PCT = 0.7
HF_TOKEN = "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"
# Paths
EXAMPLE_DIR = "/root/nvidia-meeting/modelopt-repo/examples/llm_ptq"
EXPORT_DIR = "/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4"
CALIB_SAVE_PATH = "/root/nvidia-meeting/v4_nvfp4_calibrated_state.pt"
AMAX_SNAPSHOT_PATH = "/root/nvidia-meeting/v4_nvfp4_amax_snapshots.pt"
def apply_patches():
"""Apply runtime patches for V4 compatibility and GPU tensor safety.
Root cause of all export crashes: use_seq_device_map keeps model weights on GPU
for 5+ hours during calibration. By export time, CUDA's memory allocator has
recycled the underlying memory, so any read of those GPU tensors triggers
cudaErrorIllegalAddress.
Fix strategy: patch at the EARLIEST possible entry points to force stale GPU
tensors to CPU before any downstream code reads them. This covers the full
chain of execution we traced through the export path:
_process_quantized_modules
→ _export_quantized_weight (or _export_fused_experts)
→ get_weight_scaling_factor
→ get_weights_scaling_factor_from_quantizer (reads weight, _amax, global_amax)
→ NVFP4QTensor.get_weights_scaling_factor (dynamic: reduce_block_amax on weight)
→ get_weight_scaling_factor_2 (reads _amax, global_amax)
→ get_activation_scaling_factor (reads _amax) [already patched]
→ to_quantized_weight (reads weight, does .to(weight.device) on scaling factors)
→ weight.to(dtype) (reads weight)
By forcing weight to CPU in Patch 4 (_export_quantized_weight), ALL downstream
.to(weight.device) calls resolve to CPU. Patches 5-8 are belt-and-suspenders.
"""
from modelopt.torch.quantization.nn.modules import tensor_quantizer as tq_module
from modelopt.torch.quantization.qtensor import nvfp4_tensor
from modelopt.torch.export import quant_utils
from modelopt.torch.quantization.utils import quantizer_attr_names as _quantizer_attr_names
import modelopt.torch.export.unified_export_hf as uehf
# ══════════════════════════════════════════════════════════════════════
# Patch 1: load_calib_amax — force _amax to CPU immediately after calibration
# This runs during calibration, right after each quantizer finishes.
# ══════════════════════════════════════════════════════════════════════
orig_load_calib_amax = tq_module.TensorQuantizer.load_calib_amax
def patched_load_calib_amax(self, *args, **kwargs):
orig_load_calib_amax(self, *args, **kwargs)
if hasattr(self, '_amax') and self._amax is not None:
self._amax = self._amax.cpu()
tq_module.TensorQuantizer.load_calib_amax = patched_load_calib_amax
print("✓ Patch 1: TensorQuantizer.load_calib_amax → force _amax to CPU")
# ══════════════════════════════════════════════════════════════════════
# Patch 2: export_amax — CPU safety net at export time
# ══════════════════════════════════════════════════════════════════════
orig_export_amax = tq_module.TensorQuantizer.export_amax
def patched_export_amax(self):
if hasattr(self, '_amax') and self._amax is not None and self._amax.is_cuda:
self._amax = self._amax.cpu()
return orig_export_amax(self)
tq_module.TensorQuantizer.export_amax = patched_export_amax
print("✓ Patch 2: TensorQuantizer.export_amax → CPU fallback")
# ══════════════════════════════════════════════════════════════════════
# Patch 3: get_activation_scaling_factor — CPU + clamp
# ══════════════════════════════════════════════════════════════════════
@classmethod
def patched_get_activation_scaling_factor(cls, quantizer):
if not quantizer.is_enabled:
return None
try:
amax = quantizer.export_amax()
except (torch.cuda.CudaError, RuntimeError) as e:
print(f" WARNING: export_amax() failed ({e}), attempting CPU recovery...")
if hasattr(quantizer, '_amax') and quantizer._amax is not None:
quantizer._amax = quantizer._amax.cpu()
amax = quantizer.export_amax()
if amax is None:
return None
amax = amax.cpu()
activation_scaling_factor = amax.float() / (quantizer.maxbound * 448.0)
if not torch.all(activation_scaling_factor > 0):
n_bad = (activation_scaling_factor <= 0).sum().item()
n_total = activation_scaling_factor.numel()
print(f" WARNING: {n_bad}/{n_total} activation scaling factors <= 0, clamping")
activation_scaling_factor = activation_scaling_factor.clamp(min=torch.finfo(torch.float32).tiny)
return activation_scaling_factor
nvfp4_tensor.NVFP4QTensor.get_activation_scaling_factor = patched_get_activation_scaling_factor
print("✓ Patch 3: NVFP4QTensor.get_activation_scaling_factor → CPU + clamp")
# ══════════════════════════════════════════════════════════════════════
# Patch 4: _export_quantized_weight — THE KEY PATCH
#
# This is the entry point for exporting each quantized module. It reads
# `weight = getattr(sub_module, weight_name)` which is on a stale GPU.
# By moving weight to CPU right here, ALL downstream functions are safe:
# - get_weight_scaling_factor: weight.device is now CPU
# - get_weights_scaling_factor: operates on CPU weight
# - to_quantized_weight: .to(weight.device) stays on CPU
# - weight.to(dtype): CPU cast
# We also force all quantizer state to CPU for the same reason.
# ══════════════════════════════════════════════════════════════════════
orig_export_quantized_weight = uehf._export_quantized_weight
def patched_export_quantized_weight(sub_module, dtype, weight_name="weight"):
# Move weight to CPU (stale GPU → safe CPU)
weight = getattr(sub_module, weight_name, None)
if weight is not None and isinstance(weight, torch.Tensor) and weight.is_cuda:
try:
weight_cpu = weight.cpu()
with torch.no_grad():
setattr(sub_module, weight_name, torch.nn.Parameter(weight_cpu))
except (torch.cuda.CudaError, RuntimeError) as e:
print(f" WARNING: weight.cpu() failed for {weight_name} ({e})")
raise
# Force all quantizer state to CPU
qattrs = _quantizer_attr_names(weight_name)
for qattr in [qattrs.weight_quantizer, qattrs.input_quantizer, qattrs.output_quantizer]:
if not qattr:
continue
quantizer = getattr(sub_module, qattr, None)
if quantizer is None:
continue
for attr in ['_amax', '_pre_quant_scale', 'global_amax', '_global_amax']:
val = getattr(quantizer, attr, None)
if val is not None and isinstance(val, torch.Tensor) and val.is_cuda:
try:
setattr(quantizer, attr, val.cpu())
except (torch.cuda.CudaError, RuntimeError):
pass
# Handle SequentialQuantizer (W4A8 path)
if hasattr(quantizer, 'quantizers'):
for sub_q in quantizer.quantizers:
for attr in ['_amax', '_pre_quant_scale', 'global_amax', '_global_amax']:
val = getattr(sub_q, attr, None)
if val is not None and isinstance(val, torch.Tensor) and val.is_cuda:
try:
setattr(sub_q, attr, val.cpu())
except (torch.cuda.CudaError, RuntimeError):
pass
return orig_export_quantized_weight(sub_module, dtype, weight_name)
uehf._export_quantized_weight = patched_export_quantized_weight
print("✓ Patch 4: _export_quantized_weight → force weight + quantizer state to CPU")
# ══════════════════════════════════════════════════════════════════════
# Patch 5: _export_fused_experts — same treatment for MoE expert weights
# DeepseekV4Experts go through this different code path.
# ══════════════════════════════════════════════════════════════════════
orig_export_fused_experts = uehf._export_fused_experts
def patched_export_fused_experts(sub_module, dtype):
# Force all expert weights to CPU
for name, param in list(sub_module.named_parameters()):
if isinstance(param, torch.Tensor) and param.is_cuda:
try:
with torch.no_grad():
setattr(sub_module, name, torch.nn.Parameter(param.cpu()))
except (torch.cuda.CudaError, RuntimeError):
pass
# Force all buffers to CPU
for name, buf in list(sub_module.named_buffers()):
if isinstance(buf, torch.Tensor) and buf.is_cuda:
try:
sub_module.register_buffer(name, buf.cpu())
except (torch.cuda.CudaError, RuntimeError):
pass
# Force all quantizer state to CPU
for mod in sub_module.modules():
for attr in ['_amax', '_pre_quant_scale', 'global_amax', '_global_amax']:
val = getattr(mod, attr, None)
if val is not None and isinstance(val, torch.Tensor) and val.is_cuda:
try:
setattr(mod, attr, val.cpu())
except (torch.cuda.CudaError, RuntimeError):
pass
return orig_export_fused_experts(sub_module, dtype)
uehf._export_fused_experts = patched_export_fused_experts
print("✓ Patch 5: _export_fused_experts → force expert weights + quantizer state to CPU")
# ══════════════════════════════════════════════════════════════════════
# Patch 6: to_quantized_weight — force scaling factors to CPU
# This does .to(weight.device) on scaling factors. With weight now on
# CPU (Patch 4), this should be a no-op, but belt-and-suspenders.
# ══════════════════════════════════════════════════════════════════════
orig_to_quantized_weight = quant_utils.to_quantized_weight
def patched_to_quantized_weight(weight, weights_scaling_factor, quantization,
weights_scaling_factor2=None, block_size=None):
if isinstance(weight, torch.Tensor) and weight.is_cuda:
weight = weight.cpu()
if weights_scaling_factor is not None and isinstance(weights_scaling_factor, torch.Tensor) and weights_scaling_factor.is_cuda:
weights_scaling_factor = weights_scaling_factor.cpu()
if weights_scaling_factor2 is not None and isinstance(weights_scaling_factor2, torch.Tensor) and weights_scaling_factor2.is_cuda:
weights_scaling_factor2 = weights_scaling_factor2.cpu()
return orig_to_quantized_weight(weight, weights_scaling_factor, quantization,
weights_scaling_factor2, block_size)
quant_utils.to_quantized_weight = patched_to_quantized_weight
print("✓ Patch 6: to_quantized_weight → force all tensors to CPU")
# ══════════════════════════════════════════════════════════════════════
# Patch 7: get_weight_scaling_factor — force weight + quantizer to CPU
# Belt and suspenders: Patch 4 should handle this, but this is also
# called from other code paths.
# ══════════════════════════════════════════════════════════════════════
orig_get_weight_scaling_factor = quant_utils.get_weight_scaling_factor
def patched_get_weight_scaling_factor(module, weight_name="weight"):
weight = getattr(module, weight_name, None)
if weight is not None and isinstance(weight, torch.Tensor) and weight.is_cuda:
try:
with torch.no_grad():
setattr(module, weight_name, torch.nn.Parameter(weight.cpu()))
except (torch.cuda.CudaError, RuntimeError) as e:
print(f" WARNING: weight.cpu() failed in get_weight_scaling_factor ({e})")
raise
weight_quantizer = getattr(module, _quantizer_attr_names(weight_name).weight_quantizer, None)
if weight_quantizer is not None:
for attr in ['_amax', '_pre_quant_scale', 'global_amax', '_global_amax']:
val = getattr(weight_quantizer, attr, None)
if val is not None and isinstance(val, torch.Tensor) and val.is_cuda:
try:
setattr(weight_quantizer, attr, val.cpu())
except (torch.cuda.CudaError, RuntimeError):
pass
return orig_get_weight_scaling_factor(module, weight_name)
quant_utils.get_weight_scaling_factor = patched_get_weight_scaling_factor
print("✓ Patch 7: get_weight_scaling_factor → force weight + quantizer to CPU")
# ══════════════════════════════════════════════════════════════════════
# Patch 8: get_weight_scaling_factor_2 — force quantizer to CPU
# ══════════════════════════════════════════════════════════════════════
orig_get_weight_scaling_factor_2 = quant_utils.get_weight_scaling_factor_2
def patched_get_weight_scaling_factor_2(module, weight_name="weight"):
weight_quantizer = getattr(module, _quantizer_attr_names(weight_name).weight_quantizer, None)
if weight_quantizer is not None:
for attr in ['_amax', '_pre_quant_scale', 'global_amax', '_global_amax']:
val = getattr(weight_quantizer, attr, None)
if val is not None and isinstance(val, torch.Tensor) and val.is_cuda:
try:
setattr(weight_quantizer, attr, val.cpu())
except (torch.cuda.CudaError, RuntimeError):
pass
return orig_get_weight_scaling_factor_2(module, weight_name)
quant_utils.get_weight_scaling_factor_2 = patched_get_weight_scaling_factor_2
print("✓ Patch 8: get_weight_scaling_factor_2 → force quantizer to CPU")
def snapshot_amax_to_cpu(model, snapshot_path):
"""Walk all quantizers, copy _amax to CPU, save to disk."""
from modelopt.torch.quantization.nn.modules.tensor_quantizer import TensorQuantizer
print(f"\nSnapshotting quantizer _amax to CPU...")
t0 = time.time()
snapshots = {}
n_moved = 0
for name, module in model.named_modules():
if not isinstance(module, TensorQuantizer):
continue
if hasattr(module, '_amax') and module._amax is not None:
amax_cpu = module._amax.detach().cpu().clone()
snapshots[name] = amax_cpu
module._amax.data.copy_(amax_cpu)
n_moved += 1
torch.save(snapshots, snapshot_path)
size_mb = os.path.getsize(snapshot_path) / (1024**2)
print(f"✓ Snapshotted {n_moved} quantizer _amax tensors to CPU ({time.time()-t0:.1f}s)")
print(f" Saved to: {snapshot_path} ({size_mb:.1f} MB)")
return snapshots
def restore_amax_from_snapshot(model, snapshot_path):
"""Restore _amax from a previously saved CPU snapshot."""
from modelopt.torch.quantization.nn.modules.tensor_quantizer import TensorQuantizer
print(f"Restoring _amax from snapshot: {snapshot_path}")
snapshots = torch.load(snapshot_path, map_location='cpu')
n_restored = 0
for name, module in model.named_modules():
if not isinstance(module, TensorQuantizer):
continue
if name in snapshots and hasattr(module, '_amax'):
module._amax.data.copy_(snapshots[name].to(module._amax.device))
n_restored += 1
print(f"✓ Restored {n_restored} _amax tensors from snapshot")
def force_all_amax_to_cpu(model):
"""Force ALL quantizer tensors to CPU."""
from modelopt.torch.quantization.nn.modules.tensor_quantizer import TensorQuantizer
count = 0
for name, module in model.named_modules():
if not isinstance(module, TensorQuantizer):
continue
for attr in ['_amax', '_pre_quant_scale', '_global_amax']:
if hasattr(module, attr):
val = getattr(module, attr)
if val is not None and isinstance(val, torch.Tensor) and val.is_cuda:
setattr(module, attr, val.cpu())
count += 1
print(f"✓ Forced {count} quantizer tensors to CPU")
def save_calibrated_state(model, path):
"""Save model state dict after calibration."""
print(f"\n{'='*60}")
print(f"SAVING CALIBRATED STATE → {path}")
print(f"{'='*60}")
start = time.time()
state = {
'model_state_dict': model.state_dict(),
'timestamp': time.strftime('%Y-%m-%d %H:%M:%S'),
}
torch.save(state, path)
size_gb = os.path.getsize(path) / (1024**3)
print(f"✓ Saved calibrated state: {size_gb:.1f} GB ({time.time()-start:.0f}s)")
print(f" Path: {path}")
print(f" Re-run with --export-only to retry export.\n")
def run_calibration(model_path, export_dir, calib_save_path, amax_snapshot_path, calib_size, calib_seq):
"""Full pipeline: parse args via hf_ptq → load → quantize → snapshot → save → export."""
os.chdir(EXAMPLE_DIR)
sys.path.insert(0, EXAMPLE_DIR)
os.environ["HF_TOKEN"] = HF_TOKEN
os.environ["HUGGING_FACE_HUB_TOKEN"] = HF_TOKEN
from hf_ptq import parse_args, main as hf_main
apply_patches()
# ── Build args using hf_ptq's own parser ──
# This guarantees ALL attributes exist with correct defaults.
# We temporarily replace sys.argv so parse_args() sees our config.
saved_argv = sys.argv
sys.argv = [
"hf_ptq.py",
"--pyt_ckpt_path", model_path,
"--qformat", QUANT,
"--calib_size", str(calib_size),
"--calib_seq", str(calib_seq),
"--kv_cache_qformat", KV_CACHE_QUANT,
"--inference_tensor_parallel", str(TP),
"--export_path", export_dir,
"--trust_remote_code",
"--use_seq_device_map",
"--gpu_max_mem_percentage", str(GPU_MEM_PCT),
"--batch_size", "0",
]
args = parse_args()
sys.argv = saved_argv
# Apply the same post-parse conversions that hf_ptq's __main__ block does
# (these normally run between parse_args() and main() in the original script,
# but since we call main() directly, we have to do them ourselves)
args.dataset = args.dataset.split(",") if isinstance(args.dataset, str) else args.dataset
args.calib_size = [int(num_sample) for num_sample in args.calib_size.split(",")]
# ── Post-calibration hook ──
# We monkey-patch export_quantized to add our defensive saves before export.
import hf_ptq
orig_export_quantized = hf_ptq.export_quantized
def patched_export_quantized(exp_args, full_model, language_model, model_type,
tokenizer, default_padding_side, default_pad_token):
"""Wrapper that snapshots amax and saves state before calling the real export."""
print("\n" + "="*60)
print("POST-CALIBRATION: Snapshotting amax and saving state")
print("="*60)
# Snapshot amax to CPU
snapshot_amax_to_cpu(language_model, amax_snapshot_path)
# Force all quantizer state to CPU
force_all_amax_to_cpu(language_model)
# Free GPU memory
torch.cuda.empty_cache()
gc.collect()
# Save calibrated state
save_calibrated_state(language_model, calib_save_path)
# Now run the real export
orig_export_quantized(exp_args, full_model, language_model, model_type,
tokenizer, default_padding_side, default_pad_token)
hf_ptq.export_quantized = patched_export_quantized
print("✓ Hooked export_quantized with amax snapshot + state save")
# ── Run hf_ptq's full pipeline ──
# This handles model loading, quantization, calibration, and export
# using the exact same code path as the shell script.
hf_main(args)
def run_export_only(calib_save_path, amax_snapshot_path, model_path, export_dir):
"""Load saved calibration state and run export only."""
os.chdir(EXAMPLE_DIR)
sys.path.insert(0, EXAMPLE_DIR)
os.environ["HF_TOKEN"] = HF_TOKEN
os.environ["HUGGING_FACE_HUB_TOKEN"] = HF_TOKEN
apply_patches()
from example_utils import get_model, get_tokenizer
print(f"Loading model from {model_path}...")
model = get_model(
model_path,
device="cpu",
trust_remote_code=True,
)
tokenizer = get_tokenizer(model_path, trust_remote_code=True)
print(f"Loading calibrated state from {calib_save_path}...")
state = torch.load(calib_save_path, map_location='cpu')
model.load_state_dict(state['model_state_dict'])
print(f"✓ Loaded calibrated state (saved at {state['timestamp']})")
force_all_amax_to_cpu(model)
if amax_snapshot_path and os.path.exists(amax_snapshot_path):
restore_amax_from_snapshot(model, amax_snapshot_path)
torch.cuda.empty_cache()
gc.collect()
from modelopt.torch.export import export_hf_checkpoint
from hf_ptq import load_mtp_weights, copy_custom_model_files
print(f"\n{'='*60}")
print(f"EXPORTING → {export_dir}")
print(f"{'='*60}")
t0 = time.time()
try:
mtp_layer_prefixes, mtp_state_dict = load_mtp_weights(model, model_path)
if mtp_layer_prefixes:
model._mtp_layer_prefixes = mtp_layer_prefixes
export_hf_checkpoint(model, export_dir=export_dir, extra_state_dict=mtp_state_dict)
tokenizer.save_pretrained(export_dir)
copy_custom_model_files(model_path, export_dir, True)
print(f"\n✓ Export complete in {time.time()-t0:.0f}s → {export_dir}")
except Exception as e:
print(f"\n✗ EXPORT FAILED: {e}")
print(f" Calibrated state: {CALIB_SAVE_PATH}")
print(f" Amax snapshots: {AMAX_SNAPSHOT_PATH}")
raise
def run_validate(calib_save_path, amax_snapshot_path):
"""Validate saved calibration state — check amax values are valid."""
print(f"\nValidating calibration state...")
if os.path.exists(amax_snapshot_path):
snapshots = torch.load(amax_snapshot_path, map_location='cpu')
n_total = len(snapshots)
n_valid = n_zero = n_nan = n_neg = 0
for name, amax in snapshots.items():
if torch.any(torch.isnan(amax)):
n_nan += 1
elif torch.any(amax < 0):
n_neg += 1
elif torch.all(amax == 0):
n_zero += 1
else:
n_valid += 1
print(f"\nAmax snapshot validation:")
print(f" Total: {n_total} Valid: {n_valid} Zero: {n_zero} Neg: {n_neg} NaN: {n_nan}")
if n_valid == n_total:
print(f"\n✓ All {n_total} amax snapshots are valid!")
else:
print(f"\n{n_total - n_valid} quantizers have invalid amax!")
else:
print(f"✗ No amax snapshot found at {amax_snapshot_path}")
if os.path.exists(calib_save_path):
size_gb = os.path.getsize(calib_save_path) / (1024**3)
print(f"\nCalibrated state: {calib_save_path} ({size_gb:.1f} GB)")
else:
print(f"\n✗ No calibrated state found at {calib_save_path}")
def main():
parser = argparse.ArgumentParser(description="DeepSeek V4 Pro NVFP4 Quantization")
parser.add_argument("--export-only", action="store_true",
help="Skip calibration, load saved state and run export only")
parser.add_argument("--validate-only", action="store_true",
help="Validate saved calibration state without running anything")
parser.add_argument("--model", default=MODEL, help="Path to BF16 model")
parser.add_argument("--export-dir", default=EXPORT_DIR, help="Export output directory")
parser.add_argument("--calib-save", default=CALIB_SAVE_PATH, help="Calibration state save path")
parser.add_argument("--amax-snapshot", default=AMAX_SNAPSHOT_PATH, help="Amax snapshot path")
parser.add_argument("--calib-size", type=int, default=CALIB_SIZE, help="Calibration samples")
parser.add_argument("--calib-seq", type=int, default=CALIB_SEQ, help="Calibration sequence length")
args = parser.parse_args()
if args.validate_only:
run_validate(args.calib_save, args.amax_snapshot)
elif args.export_only:
if not os.path.exists(args.calib_save):
print(f"ERROR: No calibration state found at {args.calib_save}")
sys.exit(1)
run_export_only(args.calib_save, args.amax_snapshot, args.model, args.export_dir)
else:
run_calibration(args.model, args.export_dir, args.calib_save,
args.amax_snapshot, args.calib_size, args.calib_seq)
if __name__ == "__main__":
main()

100
scripts/serve_vllm.py Normal file
View File

@@ -0,0 +1,100 @@
#!/usr/bin/env python3
"""
DeepSeek V4 Pro NVFP4 — vLLM OpenAI-compatible server.
Run from the venv on the B200 node:
source /root/nvidia-meeting/venv/bin/activate
python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/serve_vllm.py
Or in the background:
nohup python3 /root/nvidia-meeting/deepseek-v4-quant/scripts/serve_vllm.py \
> /root/nvidia-meeting/vllm_serve.log 2>&1 &
"""
import subprocess
import sys
# ── Patch: Add compress_ratios to DeepseekV4Config ──────────────────────────
# transformers 5.8.0 renamed compress_ratios → compress_rates (dict format).
# vllm 0.20.2 still expects compress_ratios as a list indexed by layer_id.
# We patch the Config class to expose compress_ratios as a property that
# converts the new dict format back to the list format vllm expects.
import transformers
try:
from transformers import DeepseekV4Config
_orig_init = DeepseekV4Config.__init__
def _patched_init(self, *args, **kwargs):
_orig_init(self, *args, **kwargs)
# If compress_ratios already exists as a list, leave it alone
if hasattr(self, 'compress_ratios') and isinstance(self.compress_ratios, list):
return
# Convert compress_rates dict → compress_ratios list
if hasattr(self, 'compress_rates') and isinstance(self.compress_rates, dict):
rates = self.compress_rates
# Build per-layer list from the dict schema
# V4 pattern: layers 0-1=128, then alternating 4/128, last=0
n_layers = getattr(self, 'num_hidden_layers', 61)
cr = rates.get('compressed_sparse_attention', 4)
hr = rates.get('heavily_compressed_attention', 128)
ratios = []
for i in range(n_layers):
if i < 2:
ratios.append(hr)
elif i == n_layers - 1:
ratios.append(0)
else:
ratios.append(cr if i % 2 == 0 else hr)
self.compress_ratios = ratios
elif hasattr(self, 'compress_rates') and isinstance(self.compress_rates, list):
self.compress_ratios = self.compress_rates
DeepseekV4Config.__init__ = _patched_init
print("✓ Patched DeepseekV4Config.__init__ to add compress_ratios")
except ImportError:
print("⚠ DeepseekV4Config not found, skipping compress_ratios patch")
MODEL = "/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4"
# These flags are critical for V4 — do not change without understanding why:
# --trust-remote-code V4 needs custom modeling code
# --kv-cache-dtype fp8 Match our kv_cache_qformat=fp8_cast quantization
# --block-size 256 V4 recommended block size
# --enable-expert-parallel Distribute expert computation across GPUs (critical for 256-expert MoE)
# --tensor-parallel-size 8 8× B200
# --compilation-config CUDA graphs for throughput — FULL_AND_PIECEWISE + all custom ops
# --attention_config FP4 indexer cache for V4 MLA attention
# --moe-backend deep_gemm_mega_moe — optimized MoE kernel for Blackwell
# --tokenizer-mode deepseek_v4 — V4-specific tokenizer
# --tool-call-parser deepseek_v4 — native tool calling
# --enable-auto-tool-choice Auto tool choice for function calling
# --reasoning-parser deepseek_v4 — reasoning/thinking output parsing
# --speculative_config MTP speculative decoding (2 speculative tokens)
cmd = [
sys.executable, "-m", "vllm.entrypoints.openai.api_server",
"--model", MODEL,
"--trust-remote-code",
"--kv-cache-dtype", "fp8",
"--block-size", "256",
"--enable-expert-parallel",
"--tensor-parallel-size", "8",
"--compilation-config", '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops":["all"]}',
"--attention_config.use_fp4_indexer_cache=True",
"--moe-backend", "deep_gemm_mega_moe", # WARN: No NVFP4 mega_moe kernel. Use docker-compose (omits this flag) instead.
"--tokenizer-mode", "deepseek_v4",
"--tool-call-parser", "deepseek_v4",
"--enable-auto-tool-choice",
"--reasoning-parser", "deepseek_v4",
"--speculative_config", '{"method":"mtp","num_speculative_tokens":2}',
"--host", "0.0.0.0",
"--port", "8000",
]
print(f"Starting vLLM server for {MODEL}")
print(f"Command: {' '.join(cmd)}")
print(f"Log: /root/nvidia-meeting/vllm_serve.log")
print()
sys.exit(subprocess.call(cmd))

View File

@@ -1,179 +0,0 @@
#!/usr/bin/env python3
"""Sanity check an NVFP4 DeepSeek V4 Pro checkpoint.
Two modes:
1) --tensor-only (default): no model loading. Just inspects the safetensors
shards: confirms NVFP4 packing structure (uint8 weight + FP8 weight_scale
+ FP32 weight_scale_2), checks for NaN/Inf in scales, samples a few
dequantizations to confirm they look plausible.
2) --vllm: tries to load the model with vLLM and generate a few tokens.
Requires vLLM with NVFP4 support (SM100+ Blackwell GPU).
Usage:
python verify_nvfp4.py DeepSeek-V4-Pro-NVFP4-streaming
python verify_nvfp4.py DeepSeek-V4-Pro-NVFP4-streaming --vllm
"""
import argparse
import json
import sys
from pathlib import Path
import torch
from safetensors import safe_open
FP4_E2M1_VALUES = torch.tensor(
[0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0,
-0.0, -0.5, -1.0, -1.5, -2.0, -3.0, -4.0, -6.0],
dtype=torch.float32,
)
def unpack_fp4(packed: torch.Tensor) -> torch.Tensor:
"""Reverse the (low | high<<4) byte pack into a [M, N] tensor of FP4 indices."""
low = packed & 0x0F
high = (packed >> 4) & 0x0F
M, N_half = packed.shape
out = torch.empty(M, N_half * 2, dtype=torch.uint8)
out[:, ::2] = low
out[:, 1::2] = high
return out
def dequant_nvfp4(packed_uint8, weight_scale_fp8, weight_scale_2_fp32):
"""Reconstruct FP32 values from NVFP4 storage."""
fp4_idx = unpack_fp4(packed_uint8)
values = FP4_E2M1_VALUES[fp4_idx.long()] # [M, N]
M, N = values.shape
# Per-block scale broadcast back over 16 elements
scale_blocks = weight_scale_fp8.float() # [M, N//16]
scale_per_elem = scale_blocks.unsqueeze(-1).expand(-1, -1, 16).reshape(M, N)
return values * scale_per_elem * weight_scale_2_fp32.float()
def tensor_only_check(model_dir: Path):
index_path = model_dir / "model.safetensors.index.json"
if not index_path.exists():
sys.exit(f"No index.json at {model_dir}")
with open(index_path) as f:
index = json.load(f)
weight_map = index["weight_map"]
# Find one quantized weight to sample
sample = None
for name, fn in weight_map.items():
if name.endswith(".weight") and (name.replace(".weight", ".weight_scale") in weight_map):
sample = name
break
if not sample:
sys.exit("Couldn't find an NVFP4-quantized weight (expected *.weight_scale companion).")
print(f"Sampling: {sample}")
shard_fn = weight_map[sample]
scale_name = sample.replace(".weight", ".weight_scale")
scale_2_name = sample.replace(".weight", ".weight_scale_2")
scale_shard = weight_map[scale_name]
scale_2_shard = weight_map[scale_2_name]
def open_get(fn, name):
with safe_open(model_dir / fn, framework="pt") as f:
return f.get_tensor(name)
packed = open_get(shard_fn, sample)
weight_scale = open_get(scale_shard, scale_name)
weight_scale_2 = open_get(scale_2_shard, scale_2_name)
print(f" packed: shape={tuple(packed.shape)} dtype={packed.dtype}")
print(f" weight_scale: shape={tuple(weight_scale.shape)} dtype={weight_scale.dtype}")
print(f" weight_scale_2: shape={tuple(weight_scale_2.shape)} dtype={weight_scale_2.dtype} "
f"value={weight_scale_2.float().item():.6e}")
# Structural assertions
M = packed.shape[0]
assert packed.dtype == torch.uint8, f"packed should be uint8, got {packed.dtype}"
assert weight_scale.dtype == torch.float8_e4m3fn, \
f"weight_scale should be FP8 E4M3, got {weight_scale.dtype}"
assert weight_scale.shape == (M, packed.shape[1] * 2 // 16), \
f"weight_scale shape {weight_scale.shape} doesn't match expected (M, N/16)"
# Check for NaN/Inf in scales
s_fp32 = weight_scale.float()
assert torch.isfinite(s_fp32).all(), "weight_scale contains NaN/Inf"
assert torch.isfinite(weight_scale_2.float()).all(), "weight_scale_2 is NaN/Inf"
print(f" scales: all finite ✓")
print(f" weight_scale stats: min={s_fp32.min().item():.3e} max={s_fp32.max().item():.3e} "
f"mean={s_fp32.mean().item():.3e}")
# Spot-check dequantization
print("\nDequantizing first 4x32 block for visual check:")
rec = dequant_nvfp4(packed[:4, :16], weight_scale[:4, :2], weight_scale_2)
print(rec)
assert torch.isfinite(rec).all(), "Dequantized values contain NaN/Inf"
print(f" dequant: all finite ✓")
print(f" dequant range: [{rec.min().item():.4f}, {rec.max().item():.4f}]")
# Count what's quantized vs preserved across the whole model
quantized_weights = []
preserved = []
for name in weight_map:
if name.endswith(".weight"):
if name.replace(".weight", ".weight_scale") in weight_map:
quantized_weights.append(name)
else:
preserved.append(name)
print(f"\nWhole-model summary:")
print(f" Quantized .weight tensors: {len(quantized_weights):,}")
print(f" Preserved .weight tensors: {len(preserved):,}")
print(f" Total tensors in index: {len(weight_map):,}")
# Show a few preserved names to confirm the right things stayed in higher precision
print(f"\n Sample preserved tensors (should be lm_head, embed, gates, norms, etc.):")
for n in preserved[:10]:
print(f" {n}")
def vllm_check(model_dir: Path):
print("Loading model with vLLM... (requires Blackwell GPU + vLLM with NVFP4 support)")
from vllm import LLM, SamplingParams
llm = LLM(
model=str(model_dir),
trust_remote_code=True,
quantization="compressed-tensors",
dtype="auto",
tensor_parallel_size=8,
max_model_len=8192,
)
sampling = SamplingParams(temperature=1.0, top_p=1.0, max_tokens=64)
prompts = [
"Write a short poem about quantization:",
"What is 17 * 23?",
"Explain MoE routing in one sentence.",
]
outputs = llm.generate(prompts, sampling)
for o in outputs:
print("=" * 60)
print("PROMPT:", o.prompt)
print("OUTPUT:", o.outputs[0].text)
def main():
ap = argparse.ArgumentParser()
ap.add_argument("model_dir")
ap.add_argument("--vllm", action="store_true")
args = ap.parse_args()
model_dir = Path(args.model_dir)
tensor_only_check(model_dir)
if args.vllm:
print("\n" + "=" * 60)
vllm_check(model_dir)
if __name__ == "__main__":
main()