Replace "online inference" with "online serving" (#11923)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor
2025-01-10 12:05:56 +00:00
committed by GitHub
parent ef725feafc
commit d85c47d6ad
11 changed files with 16 additions and 16 deletions

View File

@@ -5,7 +5,7 @@
vLLM supports the generation of structured outputs using [outlines](https://github.com/dottxt-ai/outlines), [lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer), or [xgrammar](https://github.com/mlc-ai/xgrammar) as backends for the guided decoding.
This document shows you some examples of the different options that are available to generate structured outputs.
## Online Inference (OpenAI API)
## Online Serving (OpenAI API)
You can generate structured outputs using the OpenAI's [Completions](https://platform.openai.com/docs/api-reference/completions) and [Chat](https://platform.openai.com/docs/api-reference/chat) API.
@@ -239,7 +239,7 @@ The main available options inside `GuidedDecodingParams` are:
- `backend`
- `whitespace_pattern`
These parameters can be used in the same way as the parameters from the Online Inference examples above.
These parameters can be used in the same way as the parameters from the Online Serving examples above.
One example for the usage of the `choices` parameter is shown below:
```python

View File

@@ -83,7 +83,7 @@ $ python setup.py develop
## Supported Features
- [Offline inference](#offline-inference)
- Online inference via [OpenAI-Compatible Server](#openai-compatible-server)
- Online serving via [OpenAI-Compatible Server](#openai-compatible-server)
- HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
@@ -385,5 +385,5 @@ the below:
completely. With HPU Graphs disabled, you are trading latency and
throughput at lower batches for potentially higher throughput on
higher batches. You can do that by adding `--enforce-eager` flag to
server (for online inference), or by passing `enforce_eager=True`
server (for online serving), or by passing `enforce_eager=True`
argument to LLM constructor (for offline inference).

View File

@@ -5,7 +5,7 @@
This guide will help you quickly get started with vLLM to perform:
- [Offline batched inference](#quickstart-offline)
- [Online inference using OpenAI-compatible server](#quickstart-online)
- [Online serving using OpenAI-compatible server](#quickstart-online)
## Prerequisites

View File

@@ -118,7 +118,7 @@ print("Loaded chat template:", custom_template)
outputs = llm.chat(conversation, chat_template=custom_template)
```
## Online Inference
## Online Serving
Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:

View File

@@ -127,7 +127,7 @@ print(f"Score: {score}")
A code example can be found here: <gh-file:examples/offline_inference/offline_inference_scoring.py>
## Online Inference
## Online Serving
Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:

View File

@@ -552,7 +552,7 @@ See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the mod
````{important}
To enable multiple multi-modal items per text prompt, you have to set `limit_mm_per_prompt` (offline inference)
or `--limit-mm-per-prompt` (online inference). For example, to enable passing up to 4 images per text prompt:
or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:
Offline inference:
```python
@@ -562,7 +562,7 @@ llm = LLM(
)
```
Online inference:
Online serving:
```bash
vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt image=4
```

View File

@@ -199,7 +199,7 @@ for o in outputs:
print(generated_text)
```
## Online Inference
## Online Serving
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).