Support bge-m3 sparse embeddings and colbert embeddings (#14526)

Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
2026-01-22 12:52:57 -03:00
parent 444e2e7e1f
commit ff365eea94
9 changed files with 393 additions and 19 deletions
--- a/docs/models/pooling_models.md
+++ b/docs/models/pooling_models.md
@@ -305,6 +305,44 @@ Expected output:

 An OpenAI client example can be found here: [examples/pooling/embed/openai_embedding_matryoshka_fy_client.py](../../examples/pooling/embed/openai_embedding_matryoshka_fy_client.py)

+## Specific models
+
+### BAAI/bge-m3
+
+The `BAAI/bge-m3` model comes with extra weights for sparse and colbert embeddings but unfortunately in its `config.json`
+the architecture is declared as `XLMRobertaModel`, which makes `vLLM` load it as a vanilla ROBERTA model without the
+extra weights. To load the full model weights, override its architecture like this:
+
+```shell
+vllm serve BAAI/bge-m3 --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'
+```
+
+Then you obtain the sparse embeddings like this:
+
+```shell
+curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
+     "model": "BAAI/bge-m3",
+     "task": "token_classify",
+     "input": ["What is BGE M3?", "Defination of BM25"]
+}'
+```
+
+Due to limitations in the the output schema, the output consists of a list of
+token scores for each token for each input. This means that you'll have to call
+`/tokenize` as well to be able to pair tokens with scores.
+Refer to the tests in  `tests/models/language/pooling/test_bge_m3.py` to see how
+to do that.
+
+You can obtain the colbert embeddings like this:
+
+```shell
+curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
+     "model": "BAAI/bge-m3",
+     "task": "token_embed",
+     "input": ["What is BGE M3?", "Defination of BM25"]
+}'
+```
+
 ## Deprecated Features

 ### Encode task