Merge similar examples in offline_inference into single basic example (#12737)
This commit is contained in:
@@ -147,7 +147,7 @@ class Example:
|
||||
return content
|
||||
|
||||
content += "## Example materials\n\n"
|
||||
for file in self.other_files:
|
||||
for file in sorted(self.other_files):
|
||||
include = "include" if file.suffix == ".md" else "literalinclude"
|
||||
content += f":::{{admonition}} {file.relative_to(self.path)}\n"
|
||||
content += ":class: dropdown\n\n"
|
||||
@@ -194,7 +194,7 @@ def generate_examples():
|
||||
path=EXAMPLE_DOC_DIR / "examples_offline_inference_index.md",
|
||||
title="Offline Inference",
|
||||
description=
|
||||
"Offline inference examples demonstrate how to use vLLM in an offline setting, where the model is queried for predictions in batches.", # noqa: E501
|
||||
"Offline inference examples demonstrate how to use vLLM in an offline setting, where the model is queried for predictions in batches. We recommend starting with <project:basic.md>.", # noqa: E501
|
||||
caption="Examples",
|
||||
),
|
||||
}
|
||||
|
||||
@@ -170,7 +170,7 @@ vLLM CPU backend supports the following vLLM features:
|
||||
sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
|
||||
find / -name *libtcmalloc* # find the dynamic link library path
|
||||
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
|
||||
python examples/offline_inference/basic.py # run vLLM
|
||||
python examples/offline_inference/basic/basic.py # run vLLM
|
||||
```
|
||||
|
||||
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
|
||||
@@ -207,7 +207,7 @@ CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ MHZ
|
||||
|
||||
# On this platform, it is recommend to only bind openMP threads on logical CPU cores 0-7 or 8-15
|
||||
$ export VLLM_CPU_OMP_THREADS_BIND=0-7
|
||||
$ python examples/offline_inference/basic.py
|
||||
$ python examples/offline_inference/basic/basic.py
|
||||
```
|
||||
|
||||
- If using vLLM CPU backend on a multi-socket machine with NUMA, be aware to set CPU cores using `VLLM_CPU_OMP_THREADS_BIND` to avoid cross NUMA node memory access.
|
||||
|
||||
@@ -40,7 +40,7 @@ For non-CUDA platforms, please refer [here](#installation-index) for specific in
|
||||
|
||||
## Offline Batched Inference
|
||||
|
||||
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/basic.py>
|
||||
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: <gh-file:examples/offline_inference/basic/basic.py>
|
||||
|
||||
The first line of this example imports the classes {class}`~vllm.LLM` and {class}`~vllm.SamplingParams`:
|
||||
|
||||
|
||||
@@ -46,7 +46,7 @@ for output in outputs:
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
A code example can be found here: <gh-file:examples/offline_inference/basic.py>
|
||||
A code example can be found here: <gh-file:examples/offline_inference/basic/basic.py>
|
||||
|
||||
### `LLM.beam_search`
|
||||
|
||||
@@ -103,7 +103,7 @@ for output in outputs:
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
A code example can be found here: <gh-file:examples/offline_inference/chat.py>
|
||||
A code example can be found here: <gh-file:examples/offline_inference/basic/chat.py>
|
||||
|
||||
If the model doesn't have a chat template or you want to specify another one,
|
||||
you can explicitly pass a chat template:
|
||||
|
||||
@@ -88,7 +88,7 @@ embeds = output.outputs.embedding
|
||||
print(f"Embeddings: {embeds!r} (size={len(embeds)})")
|
||||
```
|
||||
|
||||
A code example can be found here: <gh-file:examples/offline_inference/embedding.py>
|
||||
A code example can be found here: <gh-file:examples/offline_inference/basic/embed.py>
|
||||
|
||||
### `LLM.classify`
|
||||
|
||||
@@ -103,7 +103,7 @@ probs = output.outputs.probs
|
||||
print(f"Class Probabilities: {probs!r} (size={len(probs)})")
|
||||
```
|
||||
|
||||
A code example can be found here: <gh-file:examples/offline_inference/classification.py>
|
||||
A code example can be found here: <gh-file:examples/offline_inference/basic/classify.py>
|
||||
|
||||
### `LLM.score`
|
||||
|
||||
@@ -125,7 +125,7 @@ score = output.outputs.score
|
||||
print(f"Score: {score}")
|
||||
```
|
||||
|
||||
A code example can be found here: <gh-file:examples/offline_inference/scoring.py>
|
||||
A code example can be found here: <gh-file:examples/offline_inference/basic/score.py>
|
||||
|
||||
## Online Serving
|
||||
|
||||
|
||||
Reference in New Issue
Block a user