The layertest dequantizes checkpoint NVFP4→BF16 then re-quantizes BF16→NVFP4. This double quantization costs ~1% cosine. The kernel itself is correct — the 0.989 cosine is expected quantization noise.
The layertest dequantizes checkpoint NVFP4→BF16 then re-quantizes BF16→NVFP4. This double quantization costs ~1% cosine. The kernel itself is correct — the 0.989 cosine is expected quantization noise.