From 7e42b5e090217404daf9653e8cf37e78377dbf14 Mon Sep 17 00:00:00 2001
From: biondizzle <biondizzle@gmail.com>
Date: Tue, 2 Jun 2026 20:23:47 +0000
Subject: [PATCH] =?UTF-8?q?A1:=20Add=20=E2=97=87=20(think=5Fstart)=20primi?=
 =?UTF-8?q?ng=20after=20Assistant=20token?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

DSV4 is a reasoning model. The standard prompt format is:
  BOS <|User|> prompt <|Assistant|> ◇
Without the ◇ priming, the model is out-of-distribution — it expects to
be inside a thinking block but never received the sentinel. This causes
degenerate output from step 0 (France instead of Paris, looping on
newlines/repeated tokens).

With ◇, the model will:
1. Generate thinking content (reasoning)
2. Emit ◇ (think_end=128822) to close the thinking block
3. Produce the actual answer
4. Emit EOS (token 1)

This matches the pattern described in the Kimi K2 accuracy blog:
https://vllm.ai/blog/2025-10-28-kimi-k2-accuracy — malformed
prompt formatting is the #1 cause of degenerate output in chat-tuned
reasoning models.
---
 single_shot_inference.py | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/single_shot_inference.py b/single_shot_inference.py
index 62d4786f..18d6bcde 100644
--- a/single_shot_inference.py
+++ b/single_shot_inference.py
@@ -1361,6 +1361,13 @@ def main():
         input_ids = [bos, USER_TOKEN]
         input_ids += tokenizer.encode('\n\n' + PROMPT, add_special_tokens=False)
         input_ids.append(ASSISTANT_TOKEN)
+        # DSV4 reasoning model: must prime with ◇ (think_start) after Assistant token.
+        # Without this, the model is out-of-distribution — it expects to be inside a
+        # thinking block but never received the think-start sentinel.
+        # Symptom: degenerate output from step 0 (e.g. "France" instead of "Paris",
+        # looping on newlines/repeated tokens). With ◇, the model generates thinking
+        # content, emits ◇ (think_end), then produces the actual answer.
+        input_ids.append(THINK_START)
         generated = input_ids
     all_tokens = generated.copy()
     print(f"Input: {len(generated)} tokens")