The softmax scales by scale_log2 = scale * log2(e). Adding sink_val to raw logits causes it to be scaled too. Fix: add sink_val/scale instead, so after scaling: (sink_val/scale) * scale_log2 = sink_val * log2(e). This correctly multiplies attention weights by exp(sink_val).