Unlike the naive upcast, this properly dequantizes FP8 block-wise weights:
bf16 = fp8_weight * scale_expanded (128x128 blocks).
Also removes the now-unnecessary scale tensors and updates config.
FP8Linear.forward() sees element_size() > 1 and falls back to F.linear().