Traced the full execution chain from _process_quantized_modules through
every function that reads stale GPU tensors:
_process_quantized_modules
→ _export_quantized_weight (Patch 4: force weight to CPU at entry point)
→ get_weight_scaling_factor (Patch 7: belt-and-suspenders)
→ get_weights_scaling_factor_from_quantizer (safe: weight now CPU)
→ NVFP4QTensor.get_weights_scaling_factor (safe: input is CPU)
→ get_weight_scaling_factor_2 (Patch 8: force quantizer to CPU)
→ get_activation_scaling_factor (Patch 3: CPU + clamp)
→ to_quantized_weight (Patch 6: force all tensors to CPU)
→ weight.to(dtype) (safe: weight is CPU)
→ _export_fused_experts (Patch 5: force expert weights + quantizer to CPU)
Patch 4 is the key: it moves weight to CPU at the earliest possible point,
so ALL downstream .to(weight.device) calls resolve to CPU.
Patches 5-8 are belt-and-suspenders for alternative code paths.