Replace shell wrapper with in-process quantize script

- New scripts/quantize_nvfp4.py: runs full ModelOpt pipeline in-process
- Saves calibrated state after calibration (insurance against export crashes)
- Patches modelopt for V4: ModuleList quantizers, stale GPU tensor safety
- --export-only flag to retry export from saved calibration state
- Removed old model_opt_nvfp4_full.py (shell wrapper)
- Updated README with new pipeline docs and bug #5/#6
This commit is contained in:
2026-05-09 06:07:22 +00:00
parent 04304fdae6
commit a0bacb3cf6
4 changed files with 398 additions and 125 deletions

16
.gitignore vendored
View File

@@ -1,14 +1,10 @@
# OpenClaw session files
AGENTS.md
BOOTSTRAP.md
HEARTBEAT.md
IDENTITY.md
SOUL.md
USER.md
TOOLS.md
.openclaw/
memory/
# Dequantized BF16 weights (3TB)
DeepSeek-V4-Pro-BF16/
# Calibration state (huge, not for git)
*.pt
# Python
__pycache__/
*.pyc
.venv/

View File

@@ -1,6 +1,6 @@
# DeepSeek V4 Pro → NVFP4 Quantization
Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7TB RAM, 13TB NVMe).
Full NVFP4 quantization of DeepSeek V4 Pro on a single B200 node (8× B200, 2.7TB RAM, 13TB NVMe). Target: ~600GB.
## Pipeline
@@ -18,44 +18,49 @@ This is not a blind upcast — it applies the actual scale factors:
W_bf16 = dequantize_fp4_weight(W_int, S) # per-tensor scale dequant, not .to(bfloat16)
```
**We verified byte-exact correctness** by dequantizing a single expert and running a matmul against the official inference path:
**Byte-exact verified** — matmul diff is 0.000000 against the official inference path.
```python
W_bf16 = dequantize_fp4_weight(W_int, S)
y_ours = W_bf16 @ x.bfloat16()
y_ref = official_expert_forward(W_int, S, x)
print((y_ours - y_ref).abs().max() / y_ref.abs().mean())
```
Results:
```
Max abs diff: 0.00000000
Mean abs diff: 0.00000000
Relative error: 0.000000
Matmul max diff: 0.00000000
```
Byte-exact. Zero drift from BF16 rounding noise — ruled out as a potential issue in the final quant.
### Step 2: Run ModelOpt NVFP4 Full Quantization
### Step 2: Run NVFP4 Quantization
```bash
python3 scripts/model_opt_nvfp4_full.py
python3 scripts/quantize_nvfp4.py
```
Runs NVIDIA's official ModelOpt PTQ pipeline (`hf_ptq.py`) with full `nvfp4` quantization (attention + experts + shared MLP). Output target: ~600GB.
This script runs the full pipeline in-process (not wrapping the shell script):
1. **Load** BF16 model with sequential device map (3TB model, CPU offload)
2. **Patch** modelopt for V4 compatibility (ModuleList quantizers, GPU tensor safety)
3. **Quantize + Calibrate** (5-6 hours, 128 samples)
4. **SAVE** model state to disk ← insurance against export crashes
5. **Export** to HF safetensors
If the export crashes (and it will — modelopt's export reads stale GPU tensors after hours of calibration):
```bash
python3 scripts/quantize_nvfp4.py --export-only
```
This loads the saved calibration state and retries just the export step.
**Config:**
- `--quant nvfp4` (full model, not experts-only)
- `--calib 128` — 128 calibration samples. The B200 node has 2.7TB RAM; the 3TB BF16 model doesn't fit in GPU VRAM (~1.4TB total), so it runs with `--use_seq_device_map` (CPU offload). 256 calibration samples OOMs. 128 is the max that fits.
- `--calib 128` — 128 calibration samples. 256 OOMs with 3TB BF16 on CPU offload.
- `--kv_cache_quant fp8_cast`
- `--use_seq_device_map` — sequential device mapping, loads model into CPU RAM, moves layers to GPU for forward passes
- `--use_seq_device_map` — sequential device mapping (CPU offload)
- `--gpu_max_mem_percentage 0.7` — VRAM headroom
**Calibration datasets:** `abisee/cnn_dailymail` + `nvidia/Nemotron-Post-Training-Dataset-v2` (gated — requires HF token). The script exports `HF_TOKEN` and `HUGGING_FACE_HUB_TOKEN`; the token must also be set via `hf auth login` on the node.
**Calibration datasets:** `abisee/cnn_dailymail` + `nvidia/Nemotron-Post-Training-Dataset-v2` (gated — requires HF token).
**Runtime:** Model loading takes ~53 minutes. Quantization + calibration takes several hours. Total expect 6-12 hours.
**Runtime:** Model loading ~53 min. Calibration ~5.5 hours. Export ~30-60 min. Total 7-8 hours.
## Bugs Found (V4 + modelopt)
1. `QuantDeepseekV4Experts` AttributeError — V4 uses `nn.ModuleList` for per-expert quantizers, modelopt expected singular `TensorQuantizer`. Patched in `quantize_nvfp4.py`.
2. `--low_memory_mode` → meta device error. Don't use with V4.
3. Missing `kernels` package for FP8 ops. `pip install -U kernels`.
4. `--calib` not `--calib_size`, `--quant` not `--qformat` (shell script arg names — no longer relevant, we run in-process).
5. **Export crash — stale GPU tensors.** After 5+ hours of calibration, modelopt's export step reads quantizer amax tensors that have been sitting in VRAM for hours. CUDA illegal memory access. Fixed by moving quantizer tensors to CPU before export.
6. **Export crash — `assert torch.all(activation_scaling_factor > 0)`.** Related to #5. The amax values from stale GPU reads are garbage. Fixed by clamping instead of asserting.
## Dependencies (pinned versions)
@@ -64,19 +69,11 @@ Runs NVIDIA's official ModelOpt PTQ pipeline (`hf_ptq.py`) with full `nvfp4` qua
- **kernels:** latest (`pip install -U kernels` — needed for finegrained FP8 ops)
- **Python:** 3.10
The `quant_module_patched.py` fix is for **modelopt 0.45.0.dev64** specifically. Later versions may include the fix natively — check before applying. Using a different modelopt version may cause patches to fail or V4 quantization to break.
The patches in `quantize_nvfp4.py` are for **modelopt 0.45.0.dev64** specifically. Later versions may include fixes natively.
## Key Notes
- V4 is NOT BF16 — it ships as mixed-precision FP8/FP4. You MUST dequantize to BF16 first (Step 1). The raw FP8 source has kernel problems on Blackwell; the mixed-precision source causes modelopt errors
- `--low_memory_mode` causes meta device errors with V4 — don't use
- modelopt has no explicit V4 support — relies on auto-detection of fused experts
- The `quant_module_patched.py` patch fixes `iter_weights_for_calibration()` for V4's `nn.ModuleList` expert quantizers — already applied in the venv
## Bugs Found (V4 + modelopt)
1. `QuantDeepseekV4Experts` AttributeError — patched `iter_weights_for_calibration()` for ModuleList quantizers
2. `--low_memory_mode` → meta device error
3. Missing `kernels` package for FP8 ops
4. `--calib` not `--calib_size`, `--quant` not `--qformat` (shell script arg names)
5. **Export crash — `repr()` triggers CUDA illegal memory access.** After 5+ hours of calibration, modelopt's export step calls `repr(input_quantizer)` to check if a quantizer is disabled. This triggers `_short_amax()``tensor.item()` on a GPU tensor that's been sitting in VRAM for hours. CUDA says no. The fix: replace `"disabled" not in repr(input_quantizer)` with `not getattr(input_quantizer, '_disabled', False)`. One line. NVIDIA wrote a string-matching check on a full object repr instead of checking the boolean attribute directly. This is the kind of thing that makes you wonder if anyone at NVIDIA actually tested their export path on a model larger than 7B. Patched in `patches/unified_export_hf_patched.py` and `patches/tensor_quantizer_patched.py` (the latter wraps `_short_tensor` in a try/except as insurance).
- V4 is NOT BF16 — it ships as mixed-precision FP8/FP4. You MUST dequantize to BF16 first (Step 1).
- `--low_memory_mode` causes meta device errors with V4 — don't use.
- modelopt has no explicit V4 support — relies on auto-detection of fused experts.
- The calibration state save (`v4_nvfp4_calibrated_state.pt`) is ~1.5TB. It lives on NVMe, not in git.

View File

@@ -1,75 +0,0 @@
#!/usr/bin/env python3
"""
ModelOpt NVFP4 quantization — full model.
Quantizes ALL weights (attention + experts + shared MLP) to NVFP4.
Requires a pure BF16 source model (from scripts/dequant_fp8_to_bf16.py)
to avoid FP8/FP4 kernel issues on Blackwell GPUs.
Available NVFP4 quantization strategies (from modelopt huggingface_example.sh):
- nvfp4 : Full model NVFP4 quantization (this script)
- nvfp4_experts_only : Only MoE expert weights
- nvfp4_mlp_only : Only MLP layers (experts + shared MLP)
- nvfp4_omlp_only : Only output + MLP layers
- nvfp4_awq : NVFP4 with AWQ calibration
- nvfp4_mse : NVFP4 with MSE calibration
- w4a8_nvfp4_fp8 : W4A8 NVFP4 weights + FP8 activations
- w4a8_mxfp4_fp8 : W4A8 MXFP4 weights + FP8 activations
- nvfp4_svdquant : NVFP4 with SVDQuant
- nvfp4_local_hessian : NVFP4 with local Hessian calibration
Strategy: Copy this file to model_opt_nvfp4_<strategy>.py and tweak as needed.
By the end, we'll have working quantized weights for each successful strategy.
Output dir naming: DeepSeek-V4-Pro_NVFP4-<strategy>_kv_fp8_cast
"""
import subprocess
import sys
import os
# ── Config ──────────────────────────────────────────────────────────────────
MODEL = "/root/nvidia-meeting/DeepSeek-V4-Pro-BF16" # Dequantized BF16 (from scripts/dequant_fp8_to_bf16.py)
QUANT = "nvfp4"
TP = 8
CALIB = 128
KV_CACHE_QUANT = "fp8_cast"
# 3TB BF16 model can't fit on 8×B200 VRAM (~1.4TB total)
# Use seq_device_map: loads model into CPU RAM, moves layers to GPU for forward passes
# 2.8TB RAM is enough for the 3TB model (with memory-mapped loading)
EXTRA_FLAGS = "--trust_remote_code --use_seq_device_map --gpu_max_mem_percentage 0.7"
# HF token for gated calibration datasets (nvidia/Nemotron-Post-Training-Dataset-v2)
HF_TOKEN = "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"
# Output dir follows modelopt convention: <model>_<quant>_kv_<kv_quant>
# We override the model name to make the strategy clear
OUTPUT_NAME = f"DeepSeek-V4-Pro_NVFP4-{QUANT}_kv_{KV_CACHE_QUANT}"
SCRIPT_DIR = "/root/nvidia-meeting/modelopt-repo/examples/llm_ptq"
LOG_FILE = f"/root/nvidia-meeting/modelopt_{QUANT}.log"
# ── Run ─────────────────────────────────────────────────────────────────────
cmd = f"""cd {SCRIPT_DIR} && \\
. /root/nvidia-meeting/venv/bin/activate && \\
export HF_TOKEN={HF_TOKEN} && \\
export HUGGING_FACE_HUB_TOKEN={HF_TOKEN} && \\
echo "HF_TOKEN=$HF_TOKEN" && \\
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \\
bash scripts/huggingface_example.sh \\
--model {MODEL} \\
--quant {QUANT} \\
--tp {TP} \\
--calib {CALIB} \\
--kv_cache_quant {KV_CACHE_QUANT} \\
{EXTRA_FLAGS} 2>&1 | tee {LOG_FILE}"""
print(f"Running: {QUANT} quantization on {MODEL}")
print(f"Output: {OUTPUT_NAME}")
print(f"Log: {LOG_FILE}")
print(f"HF_TOKEN: {HF_TOKEN}")
print(f"Command:\n{cmd}\n")
ret = subprocess.call(cmd, shell=True)
sys.exit(ret)

355
scripts/quantize_nvfp4.py Normal file
View File

@@ -0,0 +1,355 @@
#!/usr/bin/env python3
"""
DeepSeek V4 Pro → NVFP4 quantization.
Runs the full ModelOpt PTQ pipeline in-process (not wrapping the shell script),
saves model state after calibration (so we don't lose 6 hours of work to an
export crash), and patches the export path to handle stale GPU tensors.
Usage:
# Full run (calibrate + export):
python3 scripts/quantize_nvfp4.py
# Re-run export only (after a calibration save exists):
python3 scripts/quantize_nvfp4.py --export-only
Pipeline:
1. Load BF16 model with sequential device map
2. Patch modelopt for V4 compatibility
3. Quantize + calibrate (5-6 hours)
4. SAVE model state to disk ← checkpoint so export failures don't waste calibration
5. Export to HF safetensors
"""
import argparse
import copy
import os
import sys
import time
import warnings
import torch
# ── Config ──────────────────────────────────────────────────────────────────
MODEL = "/root/nvidia-meeting/DeepSeek-V4-Pro-BF16"
QUANT = "nvfp4"
TP = 8
CALIB_SIZE = 128
CALIB_SEQ = 512
KV_CACHE_QUANT = "fp8_cast"
GPU_MEM_PCT = 0.7
HF_TOKEN = "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"
# Output paths
SCRIPT_DIR = "/root/nvidia-meeting/modelopt-repo/examples/llm_ptq" # needed for example_utils imports
EXPORT_DIR = "/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4"
CALIB_SAVE_PATH = "/root/nvidia-meeting/v4_nvfp4_calibrated_state.pt"
def apply_patches():
"""Apply runtime patches for V4 compatibility."""
# 1. Patch quant_module.py for V4's ModuleList expert quantizers
from modelopt.torch.quantization.nn import quant_module
orig_iter = quant_module._QuantFusedExperts.iter_weights_for_calibration
def patched_iter_weights_for_calibration(self, **kwargs):
"""Handle V4's nn.ModuleList expert quantizers (vs singular TensorQuantizer)."""
for name, quantizer in self.named_modules():
if not isinstance(quantizer, quant_module.TensorQuantizer):
continue
if quantizer.is_enabled:
yield name, quantizer
quant_module._QuantFusedExperts.iter_weights_for_calibration = patched_iter_weights_for_calibration
print("✓ Patched _QuantFusedExperts.iter_weights_for_calibration for V4 ModuleList")
# 2. Patch nvfp4_tensor.get_activation_scaling_factor to move amax to CPU first
from modelopt.torch.quantization.qtensor import nvfp4_tensor
orig_get_asf = nvfp4_tensor.NVFP4QTensor.get_activation_scaling_factor
@classmethod
def patched_get_activation_scaling_factor(cls, quantizer):
"""Move amax to CPU before export to avoid stale GPU tensor reads."""
if not quantizer.is_enabled:
return None
try:
amax = quantizer.export_amax()
except (torch.cuda.CudaError, RuntimeError) as e:
# GPU tensor is corrupted — try moving _amax to CPU first then retry
print(f" WARNING: export_amax() failed ({e}), attempting CPU recovery...")
if hasattr(quantizer, '_amax') and quantizer._amax is not None:
quantizer._amax = quantizer._amax.cpu()
amax = quantizer.export_amax()
if amax is None:
return None
# Move to CPU for safety
amax = amax.cpu()
activation_scaling_factor = amax.float() / (quantizer.maxbound * 448.0)
# Replace hard assert with warning + clamp (invalid values from GPU corruption)
if not torch.all(activation_scaling_factor > 0):
n_bad = (activation_scaling_factor <= 0).sum().item()
n_total = activation_scaling_factor.numel()
print(f" WARNING: {n_bad}/{n_total} activation scaling factors <= 0, clamping to tiny")
activation_scaling_factor = activation_scaling_factor.clamp(min=torch.finfo(torch.float32).tiny)
return activation_scaling_factor
nvfp4_tensor.NVFP4QTensor.get_activation_scaling_factor = patched_get_activation_scaling_factor
print("✓ Patched NVFP4QTensor.get_activation_scaling_factor (CPU safety + graceful degradation)")
# 3. Patch tensor_quantizer.export_amax to move _amax to CPU before reading
from modelopt.torch.quantization.nn.modules import tensor_quantizer as tq_module
orig_export_amax = tq_module.TensorQuantizer.export_amax
def patched_export_amax(self):
"""Move _amax to CPU before export to prevent CUDA illegal memory access."""
if self.amax is not None and self.amax.is_cuda:
self._amax = self._amax.cpu()
return orig_export_amax(self)
tq_module.TensorQuantizer.export_amax = patched_export_amax
print("✓ Patched TensorQuantizer.export_amax (CPU safety)")
def move_quantizers_to_cpu(model):
"""Move all quantizer amax tensors to CPU to prevent stale GPU reads during export."""
count = 0
for name, module in model.named_modules():
if hasattr(module, '_amax') and module._amax is not None:
if module._amax.is_cuda:
module._amax = module._amax.cpu()
count += 1
print(f"✓ Moved {count} quantizer _amax tensors to CPU")
def save_calibrated_state(model, path):
"""Save model state dict + quantizer metadata after calibration.
This is the insurance policy: if export crashes, we can reload
and retry export without re-running 6 hours of calibration.
"""
print(f"\n{'='*60}")
print(f"SAVING CALIBRATED STATE → {path}")
print(f"{'='*60}")
start = time.time()
# Move quantizers to CPU first
move_quantizers_to_cpu(model)
state = {
'model_state_dict': model.state_dict(),
'timestamp': time.strftime('%Y-%m-%d %H:%M:%S'),
}
torch.save(state, path)
size_gb = os.path.getsize(path) / (1024**3)
print(f"✓ Saved calibrated state: {size_gb:.1f} GB ({time.time()-start:.0f}s)")
print(f" Path: {path}")
print(f" This allows re-running export without re-calibrating.\n")
def load_calibrated_state(model, path):
"""Load previously saved calibrated state into model."""
print(f"Loading calibrated state from {path}...")
state = torch.load(path, map_location='cpu')
model.load_state_dict(state['model_state_dict'])
print(f"✓ Loaded calibrated state (saved at {state['timestamp']})")
def run_calibration(model_path, export_dir, calib_save_path):
"""Full pipeline: load → quantize → calibrate → save → export."""
# Must be in the example dir for the relative imports (example_utils, etc.)
os.chdir(SCRIPT_DIR)
sys.path.insert(0, SCRIPT_DIR)
from hf_ptq import get_model, get_tokenizer, make_calib_dataloader, pre_quantize
from modelopt.torch import quantization as mtq
from modelopt.torch.quantization.config import need_calibration, QUANT_CFG_CHOICES
from modelopt.torch.utils.dataset_utils import get_max_batch_size
from hf_ptq import build_quant_cfg
# Apply patches before loading model
apply_patches()
# ── Load model ──
print(f"\nLoading model from {model_path}...")
t0 = time.time()
# Set HF token for gated datasets
os.environ["HF_TOKEN"] = HF_TOKEN
os.environ["HUGGING_FACE_HUB_TOKEN"] = HF_TOKEN
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import infer_auto_device_map
# Load with sequential device map (model doesn't fit in GPU VRAM alone)
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="sequential",
offload_folder="offload",
)
print(f"✓ Model loaded in {time.time()-t0:.0f}s")
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# ── Setup quantization config ──
quant_cfg = copy.deepcopy(QUANT_CFG_CHOICES[QUANT])
quant_cfg = build_quant_cfg(QUANT, quant_cfg, None, None, None)
# KV cache quantization
if KV_CACHE_QUANT != "none":
quant_cfg = mtq.update_quant_cfg_with_kv_cache_quant(
quant_cfg,
getattr(mtq, mtq.KV_QUANT_CFG_CHOICES[KV_CACHE_QUANT])["quant_cfg"],
)
print(f"✓ KV cache quantization: {KV_CACHE_QUANT}")
# ── Detect batch size ──
print("\nDetecting max calibration batch size...")
batch_size = get_max_batch_size(
model,
max_sample_length=CALIB_SEQ,
sample_memory_usage_ratio=1.1,
)
batch_size = min(batch_size, CALIB_SIZE)
print(f"✓ Using calibration batch_size={batch_size}")
# ── Prepare dataloader ──
calib_dataloader, _ = make_calib_dataloader(
argparse.Namespace(
calib_size=[CALIB_SIZE],
calib_seq=CALIB_SEQ,
calib_dataset="",
batch_size=batch_size,
calib_batch_size=0,
),
model, None, tokenizer, torch.device("cuda"), None,
)
# ── Quantize + Calibrate ──
print(f"\n{'='*60}")
print(f"QUANTIZING: {QUANT} with {CALIB_SIZE} calibration samples")
print(f"{'='*60}")
t0 = time.time()
model = mtq.quantize(model, quant_cfg, forward_loop=calib_dataloader)
print(f"✓ Quantization + calibration complete in {time.time()-t0:.0f}s")
# ── SAVE STATE (the whole point of this script) ──
save_calibrated_state(model, calib_save_path)
# ── Export ──
run_export(model, tokenizer, model_path, export_dir)
def run_export(model, tokenizer, model_path, export_dir):
"""Export the quantized model to HF safetensors format."""
from modelopt.torch.export import export_hf_checkpoint
from hf_ptq import load_mtp_weights
print(f"\n{'='*60}")
print(f"EXPORTING → {export_dir}")
print(f"{'='*60}")
# Move quantizers to CPU before export
move_quantizers_to_cpu(model)
t0 = time.time()
try:
# Load MTP weights if present
mtp_layer_prefixes, mtp_state_dict = load_mtp_weights(model, model_path)
if mtp_layer_prefixes:
model._mtp_layer_prefixes = mtp_layer_prefixes
export_hf_checkpoint(
model,
export_dir=export_dir,
extra_state_dict=mtp_state_dict,
)
# Save tokenizer
tokenizer.save_pretrained(export_dir)
# Copy custom model files
from hf_ptq import copy_custom_model_files
copy_custom_model_files(model_path, export_dir, True)
elapsed = time.time() - t0
print(f"\n✓ Export complete in {elapsed:.0f}s → {export_dir}")
except Exception as e:
print(f"\n✗ EXPORT FAILED: {e}")
print(f" Calibrated state is saved at: {CALIB_SAVE_PATH}")
print(f" Re-run with --export-only to retry export")
raise
def run_export_only(calib_save_path, model_path, export_dir):
"""Load previously saved calibration state and run export only."""
os.chdir(SCRIPT_DIR)
sys.path.insert(0, SCRIPT_DIR)
apply_patches()
from transformers import AutoModelForCausalLM, AutoTokenizer
os.environ["HF_TOKEN"] = HF_TOKEN
os.environ["HUGGING_FACE_HUB_TOKEN"] = HF_TOKEN
# Load a fresh model (we just need the architecture, then overlay the state)
print(f"Loading model skeleton from {model_path}...")
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="cpu", # Don't load onto GPU yet
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Load the calibrated state
load_calibrated_state(model, calib_save_path)
# Export
run_export(model, tokenizer, model_path, export_dir)
def main():
parser = argparse.ArgumentParser(description="DeepSeek V4 Pro NVFP4 Quantization")
parser.add_argument("--export-only", action="store_true",
help="Skip calibration, load saved state and run export only")
parser.add_argument("--model", default=MODEL, help="Path to BF16 model")
parser.add_argument("--export-dir", default=EXPORT_DIR, help="Export output directory")
parser.add_argument("--calib-save", default=CALIB_SAVE_PATH, help="Calibration state save path")
parser.add_argument("--calib-size", type=int, default=CALIB_SIZE, help="Calibration samples")
parser.add_argument("--calib-seq", type=int, default=CALIB_SEQ, help="Calibration sequence length")
args = parser.parse_args()
if args.export_only:
if not os.path.exists(args.calib_save):
print(f"ERROR: No calibration state found at {args.calib_save}")
print("Run without --export-only first to calibrate.")
sys.exit(1)
run_export_only(args.calib_save, args.model, args.export_dir)
else:
run_calibration(args.model, args.export_dir, args.calib_save)
if __name__ == "__main__":
main()