Nibble index 0 vs 8 ratio = 0.996 (FP4 -0.0 ≈ +0.0), NOT INT4 where -8 would be rare.
FP4 dequant uses E2M1 LUT lookup × E8M0 scale (MXFP4 microscaling).
Also adds model_opt_nvfp4_full.py for full model NVFP4 quantization.
INT4 expert weights are packed 2-per-byte into int8 with float8_e8m0fnu
per-row 32-column block scales. Unpacking: lower nibble first, upper second.
Output dimensions are 2x the stored dimensions (e.g. [3072,3584] → [3072,7168]).
Also adds progress output with ETA per shard so screen sessions stay alive.
Unlike the naive upcast, this properly dequantizes FP8 block-wise weights:
bf16 = fp8_weight * scale_expanded (128x128 blocks).
Also removes the now-unnecessary scale tensors and updates config.
FP8Linear.forward() sees element_size() > 1 and falls back to F.linear().