docs/models/pooling_models/reward.md

# Reward Usages

A reward model (RM) is designed to evaluate and score the quality of outputs generated by a language model, acting as a proxy for human preferences.

## Summary

- Model Usage: reward
- Pooling Task:

| Model Types                        | Pooling Tasks  |
|------------------------------------|----------------|
| (sequence) (outcome) reward models | classify       |
| token (outcome) reward models      | token_classify |
| process reward models              | token_classify |

- Offline APIs:
    - `LLM.encode(..., pooling_task="...")`
- Online APIs:
    - Pooling API (`/pooling`)

## Supported Models

### Reward Models

Using sequence classification models as (sequence) (outcome) reward models, the usage and supported features are the same as for normal [classification models](classify.md).

--8<-- [start:supported-sequence-reward-models]

| Architecture | Models | Example HF Models | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) |
| ------------ | ------ | ----------------- | -------------------- | ------------------------- |
| `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ |
| `Qwen3ForSequenceClassification`<sup>C</sup> | Qwen3-based | `Skywork/Skywork-Reward-V2-Qwen3-0.6B`, etc. | ✅︎ | ✅︎ |
| `LlamaForSequenceClassification`<sup>C</sup> | Llama-based | `Skywork/Skywork-Reward-V2-Llama-3.2-1B`, etc. | ✅︎ | ✅︎ |
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* |

<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./README.md#model-conversion))  

If your model is not in the above list, we will try to automatically convert the model using
[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.

--8<-- [end:supported-sequence-reward-models]

### Token Reward Models

The key distinction between (sequence) classification and token classification lies in their output granularity: (sequence) classification produces a single result for an entire input sequence, whereas token classification yields a result for each individual token within the sequence.

Using token classification models as token (outcome) reward models, the usage and supported features are the same as for normal [token classification models](token_classify.md).

--8<-- [start:supported-token-reward-models]

| Architecture | Models | Example HF Models | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) |
| ------------ | ------ | ----------------- | -------------------- | ------------------------- |
| `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ |
| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B`, etc. | ✅︎ | ✅︎ |
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* |

<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./README.md#model-conversion))  

If your model is not in the above list, we will try to automatically convert the model using
[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model].

--8<-- [end:supported-token-reward-models]

### Process Reward Models

The process reward models used for evaluating intermediate steps are crucial to achieving the desired outcome.

| Architecture | Models | Example HF Models | [LoRA](../../features/lora.md) | [PP](../../serving/parallelism_scaling.md) |
| ------------ | ------ | ----------------- | -------------------- | ------------------------- |
| `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ |
| `Qwen2ForProcessRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-PRM-7B`, etc. | ✅︎ | ✅︎ |

!!! important
    For process-supervised reward models such as `peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
    e.g.: `--pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.

## Offline Inference

### Pooling Parameters

The following [pooling parameters][vllm.PoolingParams] are supported.

```python
--8<-- "vllm/pooling_params.py:common-pooling-params"
--8<-- "vllm/pooling_params.py:classify-pooling-params"
```

### `LLM.encode`

The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.

- Reward Models

Set `pooling_task="classify"` when using `LLM.encode` for (sequence) (outcome) reward models:

```python
from vllm import LLM

llm = LLM(model="Skywork/Skywork-Reward-V2-Qwen3-0.6B", runner="pooling")
(output,) = llm.encode("Hello, my name is", pooling_task="classify")

data = output.outputs.data
print(f"Data: {data!r}")
```

- Token Reward Models

Set `pooling_task="token_classify"` when using `LLM.encode` for token (outcome) reward models:

```python
from vllm import LLM

llm = LLM(model="internlm/internlm2-1_8b-reward", runner="pooling", trust_remote_code=True)
(output,) = llm.encode("Hello, my name is", pooling_task="token_classify")

data = output.outputs.data
print(f"Data: {data!r}")
```

- Process Reward Models

Set `pooling_task="token_classify"` when using `LLM.encode` for token (outcome) reward models:

```python
from vllm import LLM

llm = LLM(model="Qwen/Qwen2.5-Math-PRM-7B", runner="pooling")
(output,) = llm.encode("Hello, my name is<extra_0><extra_0><extra_0>", pooling_task="token_classify")

data = output.outputs.data
print(f"Data: {data!r}")
```

## Online Serving

Please refer to the [pooling API](README.md#pooling-api). Pooling task corresponding to reward model types refer to the [table above](#summary).
[Docs] Reorganize pooling docs. (#35592) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> 2026-03-19 19:25:47 +08:00			`# Reward Usages`

			`A reward model (RM) is designed to evaluate and score the quality of outputs generated by a language model, acting as a proxy for human preferences.`

			`## Summary`

			`- Model Usage: reward`
			`- Pooling Task:`

			`\| Model Types \| Pooling Tasks \|`
			`\|------------------------------------\|----------------\|`
			`\| (sequence) (outcome) reward models \| classify \|`
			`\| token (outcome) reward models \| token_classify \|`
			`\| process reward models \| token_classify \|`

			`- Offline APIs:`
			- `LLM.encode(..., pooling_task="...")`
			`- Online APIs:`
			- Pooling API (`/pooling`)

			`## Supported Models`

			`### Reward Models`

			`Using sequence classification models as (sequence) (outcome) reward models, the usage and supported features are the same as for normal [classification models](classify.md).`

			`--8<-- [start:supported-sequence-reward-models]`

			`\| Architecture \| Models \| Example HF Models \| [LoRA](../../features/lora.md) \| [PP](../../serving/parallelism_scaling.md) \|`
			`\| ------------ \| ------ \| ----------------- \| -------------------- \| ------------------------- \|`
			\| `JambaForSequenceClassification` \| Jamba \| `ai21labs/Jamba-tiny-reward-dev`, etc. \| ✅︎ \| ✅︎ \|
			\| `Qwen3ForSequenceClassification`<sup>C</sup> \| Qwen3-based \| `Skywork/Skywork-Reward-V2-Qwen3-0.6B`, etc. \| ✅︎ \| ✅︎ \|
			\| `LlamaForSequenceClassification`<sup>C</sup> \| Llama-based \| `Skywork/Skywork-Reward-V2-Llama-3.2-1B`, etc. \| ✅︎ \| ✅︎ \|
			\| `Model`<sup>C</sup>, `ForCausalLM`<sup>C</sup>, etc. \| Generative models \| N/A \| \* \| \* \|

			<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./README.md#model-conversion))

			`If your model is not in the above list, we will try to automatically convert the model using`
			`[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.`

			`--8<-- [end:supported-sequence-reward-models]`

			`### Token Reward Models`

			`The key distinction between (sequence) classification and token classification lies in their output granularity: (sequence) classification produces a single result for an entire input sequence, whereas token classification yields a result for each individual token within the sequence.`

			`Using token classification models as token (outcome) reward models, the usage and supported features are the same as for normal [token classification models](token_classify.md).`

			`--8<-- [start:supported-token-reward-models]`

			`\| Architecture \| Models \| Example HF Models \| [LoRA](../../features/lora.md) \| [PP](../../serving/parallelism_scaling.md) \|`
			`\| ------------ \| ------ \| ----------------- \| -------------------- \| ------------------------- \|`
			\| `InternLM2ForRewardModel` \| InternLM2-based \| `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. \| ✅︎ \| ✅︎ \|
			\| `Qwen2ForRewardModel` \| Qwen2-based \| `Qwen/Qwen2.5-Math-RM-72B`, etc. \| ✅︎ \| ✅︎ \|
			\| `Model`<sup>C</sup>, `ForCausalLM`<sup>C</sup>, etc. \| Generative models \| N/A \| \* \| \* \|

			<sup>C</sup> Automatically converted into a classification model via `--convert classify`. ([details](./README.md#model-conversion))

			`If your model is not in the above list, we will try to automatically convert the model using`
			`[as_seq_cls_model][vllm.model_executor.models.adapters.as_seq_cls_model].`

			`--8<-- [end:supported-token-reward-models]`

			`### Process Reward Models`

			`The process reward models used for evaluating intermediate steps are crucial to achieving the desired outcome.`

			`\| Architecture \| Models \| Example HF Models \| [LoRA](../../features/lora.md) \| [PP](../../serving/parallelism_scaling.md) \|`
			`\| ------------ \| ------ \| ----------------- \| -------------------- \| ------------------------- \|`
			\| `LlamaForCausalLM` \| Llama-based \| `peiyi9979/math-shepherd-mistral-7b-prm`, etc. \| ✅︎ \| ✅︎ \|
			\| `Qwen2ForProcessRewardModel` \| Qwen2-based \| `Qwen/Qwen2.5-Math-PRM-7B`, etc. \| ✅︎ \| ✅︎ \|

			`!!! important`
			For process-supervised reward models such as `peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
			e.g.: `--pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.

			`## Offline Inference`

			`### Pooling Parameters`

			`The following [pooling parameters][vllm.PoolingParams] are supported.`

			```python
			`--8<-- "vllm/pooling_params.py:common-pooling-params"`
			`--8<-- "vllm/pooling_params.py:classify-pooling-params"`
			```

			### `LLM.encode`

			`The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.`

			`- Reward Models`

			Set `pooling_task="classify"` when using `LLM.encode` for (sequence) (outcome) reward models:

			```python
			`from vllm import LLM`

			`llm = LLM(model="Skywork/Skywork-Reward-V2-Qwen3-0.6B", runner="pooling")`
			`(output,) = llm.encode("Hello, my name is", pooling_task="classify")`

			`data = output.outputs.data`
			`print(f"Data: {data!r}")`
			```

			`- Token Reward Models`

			Set `pooling_task="token_classify"` when using `LLM.encode` for token (outcome) reward models:

			```python
			`from vllm import LLM`

			`llm = LLM(model="internlm/internlm2-1_8b-reward", runner="pooling", trust_remote_code=True)`
			`(output,) = llm.encode("Hello, my name is", pooling_task="token_classify")`

			`data = output.outputs.data`
			`print(f"Data: {data!r}")`
			```

			`- Process Reward Models`

			Set `pooling_task="token_classify"` when using `LLM.encode` for token (outcome) reward models:

			```python
			`from vllm import LLM`

			`llm = LLM(model="Qwen/Qwen2.5-Math-PRM-7B", runner="pooling")`
			`(output,) = llm.encode("Hello, my name is<extra_0><extra_0><extra_0>", pooling_task="token_classify")`

			`data = output.outputs.data`
			`print(f"Data: {data!r}")`
			```

			`## Online Serving`

			`Please refer to the [pooling API](README.md#pooling-api). Pooling task corresponding to reward model types refer to the [table above](#summary).`