Compares forward_layer output with step-by-step PyTorch reference to identify where residual blowup originates. Uses our own NVFP4 dequant — no HF dependency.