process_weights_after_loading sets input_global_scale_inv AFTER _convert_nvfp4_post_load runs, so the fix couldn't find the attrs. Going back to BF16 dequant approach. The zeros in the dummy run are expected (attention_impl returns early with out.zero_()). Need to test with a real request under cudagraph_mode=NONE.