Files
deepseek-v4-quant/scripts/quantize_nvfp4.py
biondizzle 07cd50e823 8 patches covering full export chain — no more whack-a-mole
Traced the full execution chain from _process_quantized_modules through
every function that reads stale GPU tensors:

  _process_quantized_modules
    → _export_quantized_weight (Patch 4: force weight to CPU at entry point)
      → get_weight_scaling_factor (Patch 7: belt-and-suspenders)
        → get_weights_scaling_factor_from_quantizer (safe: weight now CPU)
        → NVFP4QTensor.get_weights_scaling_factor (safe: input is CPU)
      → get_weight_scaling_factor_2 (Patch 8: force quantizer to CPU)
      → get_activation_scaling_factor (Patch 3: CPU + clamp)
      → to_quantized_weight (Patch 6: force all tensors to CPU)
      → weight.to(dtype) (safe: weight is CPU)
    → _export_fused_experts (Patch 5: force expert weights + quantizer to CPU)

Patch 4 is the key: it moves weight to CPU at the earliest possible point,
so ALL downstream .to(weight.device) calls resolve to CPU.
Patches 5-8 are belt-and-suspenders for alternative code paths.
2026-05-09 22:50:58 +00:00

28 KiB