CRITICAL FIX: FP4 LUT was 4x too large!

E2M1 magnitudes are [0, 0.5, 1, 1.5, 2, 3, 4, 6] NOT [0, 2, 3, 4, 6, 8, 12, 24].
The old LUT was 4x the correct values, causing every NVFP4 dequantized
weight to be 4x too large. This compounded across 61 layers, causing
the residual stream to explode and producing gibberish output.

This is the root cause of the residual growth and incoherent generation.
This commit is contained in:
2026-05-31 04:16:13 +00:00
parent b8c8da91fe
commit aafe2eee12

View File

@@ -53,7 +53,7 @@ NUM_GPUS = 8
# NVFP4 dequantization — matches checkpoint format exactly
# =====================================================================
FP4_LUT = torch.tensor([0., 2., 3., 4., 6., 8., 12., 24.])
FP4_LUT = torch.tensor([0., 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0]) # E2M1 magnitudes
def dequant_nvfp4_weight(weight, weight_scale, weight_scale_2):
"""Dequantize NVFP4 weight to BF16.