Generative Scoring (#34539)
Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com> Co-authored-by: Vedant Jhaveri <vjhaveri@linkedin.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
This commit is contained in:
@@ -73,8 +73,11 @@ In addition, we have the following custom APIs:
|
||||
- [Cohere Embed API](../models/pooling_models/embed.md#cohere-embed-api) (`/v2/embed`)
|
||||
- Compatible with [Cohere's Embed API](https://docs.cohere.com/reference/embed)
|
||||
- Works with any [embedding model](../models/pooling_models/embed.md#supported-models), including multimodal models.
|
||||
- [Score API](../models/pooling_models/scoring.md#score-api) (`/score`)
|
||||
- Applicable to [score models](../models/pooling_models/scoring.md).
|
||||
- [Score API](../models/pooling_models/scoring.md#score-api) (`/score`, `/v1/score`)
|
||||
- Applicable to [score models](../models/pooling_models/scoring.md) (cross-encoder, bi-encoder, late-interaction).
|
||||
- [Generative Scoring API](#generative-scoring-api) (`/generative_scoring`)
|
||||
- Applicable to [CausalLM models](../models/generative_models.md) (task `"generate"`).
|
||||
- Computes next-token probabilities for specified `label_token_ids`.
|
||||
- [Rerank API](../models/pooling_models/scoring.md#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
|
||||
- Implements [Jina AI's v1 rerank API](https://jina.ai/reranker/)
|
||||
- Also compatible with [Cohere's v1 & v2 rerank APIs](https://docs.cohere.com/v2/reference/rerank)
|
||||
@@ -481,6 +484,71 @@ This approach is more robust than index-based access (`messages[0]`, `messages[1
|
||||
|
||||
Example template file: [examples/pooling/score/template/nemotron-rerank.jinja](../../examples/pooling/score/template/nemotron-rerank.jinja)
|
||||
|
||||
### Generative Scoring API
|
||||
|
||||
The `/generative_scoring` endpoint uses a CausalLM model (e.g., Llama, Qwen, Mistral) to compute the probability of specified token IDs appearing as the next token. Each item (document) is concatenated with the query to form a prompt, and the model predicts how likely each label token is as the next token after that prompt. This lets you score items against a query — for example, asking "Is this the capital of France?" and scoring each city by how likely the model is to answer "Yes".
|
||||
|
||||
This endpoint is automatically available when the server is started with a generative model (task `"generate"`). It is separate from the pooling-based [Score API](#score-api), which uses cross-encoder, bi-encoder, or late-interaction models.
|
||||
|
||||
**Requirements:**
|
||||
|
||||
- The `label_token_ids` parameter is **required** and must contain **at least 1 token ID**.
|
||||
- When 2 label tokens are provided, the score equals `P(label_token_ids[0]) / (P(label_token_ids[0]) + P(label_token_ids[1]))` (softmax over the two labels).
|
||||
- When more labels are provided, the score is the softmax-normalized probability of the first label token across all label tokens.
|
||||
|
||||
#### Example
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/generative_scoring \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "Qwen/Qwen3-0.6B",
|
||||
"query": "Is this city the capital of France?",
|
||||
"items": ["Paris", "London", "Berlin"],
|
||||
"label_token_ids": [9454, 2753]
|
||||
}'
|
||||
```
|
||||
|
||||
Here, each item is appended to the query to form prompts like `"Is this city the capital of France? Paris"`, `"... London"`, etc. The model then predicts the next token, and the score reflects the probability of "Yes" (token 9454) vs "No" (token 2753).
|
||||
|
||||
??? console "Response"
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "generative-scoring-abc123",
|
||||
"object": "list",
|
||||
"created": 1234567890,
|
||||
"model": "Qwen/Qwen3-0.6B",
|
||||
"data": [
|
||||
{"index": 0, "object": "score", "score": 0.95},
|
||||
{"index": 1, "object": "score", "score": 0.12},
|
||||
{"index": 2, "object": "score", "score": 0.08}
|
||||
],
|
||||
"usage": {"prompt_tokens": 45, "total_tokens": 48, "completion_tokens": 3}
|
||||
}
|
||||
```
|
||||
|
||||
#### How it works
|
||||
|
||||
1. **Prompt Construction**: For each item, builds `prompt = query + item` (or `item + query` if `item_first=true`)
|
||||
2. **Forward Pass**: Runs the model on each prompt to get next-token logits
|
||||
3. **Probability Extraction**: Extracts logprobs for the specified `label_token_ids`
|
||||
4. **Softmax Normalization**: Applies softmax over only the label tokens (when `apply_softmax=true`)
|
||||
5. **Score**: Returns the normalized probability of the first label token
|
||||
|
||||
#### Finding Token IDs
|
||||
|
||||
To find the token IDs for your labels, use the tokenizer:
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
|
||||
yes_id = tokenizer.encode("Yes", add_special_tokens=False)[0]
|
||||
no_id = tokenizer.encode("No", add_special_tokens=False)[0]
|
||||
print(f"Yes: {yes_id}, No: {no_id}")
|
||||
```
|
||||
|
||||
## Ray Serve LLM
|
||||
|
||||
Ray Serve LLM enables scalable, production-grade serving of the vLLM engine. It integrates tightly with vLLM and extends it with features such as auto-scaling, load balancing, and back-pressure.
|
||||
|
||||
Reference in New Issue
Block a user