The sink bias from the checkpoint is already in the scaled domain (added to QK*scale in the reference softmax). The kernel's running_max is max(QK*scale), so the sink should be compared directly without multiplying by scale again.