Replace "online inference" with "online serving" (#11923)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor
2025-01-10 12:05:56 +00:00
committed by GitHub
parent ef725feafc
commit d85c47d6ad
11 changed files with 16 additions and 16 deletions

View File

@@ -83,7 +83,7 @@ $ python setup.py develop
## Supported Features
- [Offline inference](#offline-inference)
- Online inference via [OpenAI-Compatible Server](#openai-compatible-server)
- Online serving via [OpenAI-Compatible Server](#openai-compatible-server)
- HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
@@ -385,5 +385,5 @@ the below:
completely. With HPU Graphs disabled, you are trading latency and
throughput at lower batches for potentially higher throughput on
higher batches. You can do that by adding `--enforce-eager` flag to
server (for online inference), or by passing `enforce_eager=True`
server (for online serving), or by passing `enforce_eager=True`
argument to LLM constructor (for offline inference).