[Model] Add support for nvidia/llama-nemotron-rerank-vl-1b-v2 (#35735)

Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
2026-03-03 01:32:14 +01:00
parent 18c29c746b
commit c8b678e53e
9 changed files with 503 additions and 149 deletions
--- a/docs/models/pooling_models.md
+++ b/docs/models/pooling_models.md
@@ -498,7 +498,9 @@ curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
 - Multi-vector retrieval: [examples/pooling/token_embed/colqwen3_token_embed_online.py](../../examples/pooling/token_embed/colqwen3_token_embed_online.py)
 - Reranking (text + multi-modal): [examples/pooling/score/colqwen3_rerank_online.py](../../examples/pooling/score/colqwen3_rerank_online.py)

-### Llama Nemotron Multimodal Embedding Models
+### Llama Nemotron Multimodal
+
+#### Embedding Model

 Llama Nemotron VL Embedding models combine the bidirectional Llama embedding backbone
 (from `nvidia/llama-nemotron-embed-1b-v2`) with SigLIP as the vision encoder to produce
@@ -559,6 +561,70 @@ curl -s http://localhost:8000/v1/embeddings -H "Content-Type: application/json"
 }'
 ```

+#### Reranker Model
+
+Llama Nemotron VL reranker models combine the same bidirectional Llama + SigLIP
+backbone with a sequence-classification head for cross-encoder scoring and reranking.
+
+| Architecture | Backbone | Example HF Models |
+|---|---|---|
+| `LlamaNemotronVLForSequenceClassification` | Bidirectional Llama + SigLIP | `nvidia/llama-nemotron-rerank-vl-1b-v2` |
+
+Start the server:
+
+```shell
+vllm serve nvidia/llama-nemotron-rerank-vl-1b-v2 \
+    --runner pooling \
+    --trust-remote-code \
+    --chat-template examples/pooling/score/template/nemotron-vl-rerank.jinja
+```
+
+!!! note
+    The chat template bundled with this checkpoint's tokenizer is not suitable
+    for the Score/Rerank APIs. Use the provided override template when serving:
+    `examples/pooling/score/template/nemotron-vl-rerank.jinja`.
+
+Score a text query against an image document:
+
+```shell
+curl -s http://localhost:8000/score -H "Content-Type: application/json" -d '{
+    "model": "nvidia/llama-nemotron-rerank-vl-1b-v2",
+    "data_1": "Find diagrams about autonomous robots",
+    "data_2": [
+        {
+            "content": [
+                {"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64>"}},
+                {"type": "text", "text": "Robotics workflow diagram."}
+            ]
+        }
+    ]
+}'
+```
+
+Rerank image documents by a text query:
+
+```shell
+curl -s http://localhost:8000/rerank -H "Content-Type: application/json" -d '{
+    "model": "nvidia/llama-nemotron-rerank-vl-1b-v2",
+    "query": "Find diagrams about autonomous robots",
+    "documents": [
+        {
+            "content": [
+                {"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64_1>"}},
+                {"type": "text", "text": "Robotics workflow diagram."}
+            ]
+        },
+        {
+            "content": [
+                {"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64_2>"}},
+                {"type": "text", "text": "General skyline photo."}
+            ]
+        }
+    ],
+    "top_n": 2
+}'
+```
+
 ### BAAI/bge-m3

 The `BAAI/bge-m3` model comes with extra weights for sparse and colbert embeddings but unfortunately in its `config.json`