Generative Scoring (#34539)

Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com> Co-authored-by: Vedant Jhaveri <vjhaveri@linkedin.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2026-03-31 16:02:11 -07:00
parent 36f1dc19ae
commit 2e56975657
13 changed files with 1265 additions and 3 deletions
--- a/docs/serving/openai_compatible_server.md
+++ b/docs/serving/openai_compatible_server.md
@@ -73,8 +73,11 @@ In addition, we have the following custom APIs:
 - [Cohere Embed API](../models/pooling_models/embed.md#cohere-embed-api) (`/v2/embed`)
    - Compatible with [Cohere's Embed API](https://docs.cohere.com/reference/embed)
    - Works with any [embedding model](../models/pooling_models/embed.md#supported-models), including multimodal models.
- [Score API](../models/pooling_models/scoring.md#score-api) (`/score`)
-    - Applicable to [score models](../models/pooling_models/scoring.md).
+- [Score API](../models/pooling_models/scoring.md#score-api) (`/score`, `/v1/score`)
+    - Applicable to [score models](../models/pooling_models/scoring.md) (cross-encoder, bi-encoder, late-interaction).
+- [Generative Scoring API](#generative-scoring-api) (`/generative_scoring`)
+    - Applicable to [CausalLM models](../models/generative_models.md) (task `"generate"`).
+    - Computes next-token probabilities for specified `label_token_ids`.
 - [Rerank API](../models/pooling_models/scoring.md#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
    - Implements [Jina AI's v1 rerank API](https://jina.ai/reranker/)
    - Also compatible with [Cohere's v1 & v2 rerank APIs](https://docs.cohere.com/v2/reference/rerank)
@@ -481,6 +484,71 @@ This approach is more robust than index-based access (`messages[0]`, `messages[1

 Example template file: [examples/pooling/score/template/nemotron-rerank.jinja](../../examples/pooling/score/template/nemotron-rerank.jinja)

+### Generative Scoring API
+
+The `/generative_scoring` endpoint uses a CausalLM model (e.g., Llama, Qwen, Mistral) to compute the probability of specified token IDs appearing as the next token. Each item (document) is concatenated with the query to form a prompt, and the model predicts how likely each label token is as the next token after that prompt. This lets you score items against a query — for example, asking "Is this the capital of France?" and scoring each city by how likely the model is to answer "Yes".
+
+This endpoint is automatically available when the server is started with a generative model (task `"generate"`). It is separate from the pooling-based [Score API](#score-api), which uses cross-encoder, bi-encoder, or late-interaction models.
+
+**Requirements:**
+
+- The `label_token_ids` parameter is **required** and must contain **at least 1 token ID**.
+- When 2 label tokens are provided, the score equals `P(label_token_ids[0]) / (P(label_token_ids[0]) + P(label_token_ids[1]))` (softmax over the two labels).
+- When more labels are provided, the score is the softmax-normalized probability of the first label token across all label tokens.
+
+#### Example
+
+```bash
+curl -X POST http://localhost:8000/generative_scoring \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-0.6B",
+    "query": "Is this city the capital of France?",
+    "items": ["Paris", "London", "Berlin"],
+    "label_token_ids": [9454, 2753]
+  }'
+```
+
+Here, each item is appended to the query to form prompts like `"Is this city the capital of France? Paris"`, `"... London"`, etc. The model then predicts the next token, and the score reflects the probability of "Yes" (token 9454) vs "No" (token 2753).
+
+??? console "Response"
+
+    ```json
+    {
+      "id": "generative-scoring-abc123",
+      "object": "list",
+      "created": 1234567890,
+      "model": "Qwen/Qwen3-0.6B",
+      "data": [
+        {"index": 0, "object": "score", "score": 0.95},
+        {"index": 1, "object": "score", "score": 0.12},
+        {"index": 2, "object": "score", "score": 0.08}
+      ],
+      "usage": {"prompt_tokens": 45, "total_tokens": 48, "completion_tokens": 3}
+    }
+    ```
+
+#### How it works
+
+1. **Prompt Construction**: For each item, builds `prompt = query + item` (or `item + query` if `item_first=true`)
+2. **Forward Pass**: Runs the model on each prompt to get next-token logits
+3. **Probability Extraction**: Extracts logprobs for the specified `label_token_ids`
+4. **Softmax Normalization**: Applies softmax over only the label tokens (when `apply_softmax=true`)
+5. **Score**: Returns the normalized probability of the first label token
+
+#### Finding Token IDs
+
+To find the token IDs for your labels, use the tokenizer:
+
+```python
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
+yes_id = tokenizer.encode("Yes", add_special_tokens=False)[0]
+no_id = tokenizer.encode("No", add_special_tokens=False)[0]
+print(f"Yes: {yes_id}, No: {no_id}")
+```
+
 ## Ray Serve LLM

 Ray Serve LLM enables scalable, production-grade serving of the vLLM engine. It integrates tightly with vLLM and extends it with features such as auto-scaling, load balancing, and back-pressure.