[Doc] Move examples into categories (#11840)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor
2025-01-08 13:09:53 +00:00
committed by GitHub
parent 2a0596bc48
commit aba8d6ee00
116 changed files with 153 additions and 124 deletions

View File

@@ -21,7 +21,7 @@ Disaggregated prefill DOES NOT improve throughput.
## Usage example
Please refer to `examples/disaggregated_prefill.sh` for the example usage of disaggregated prefilling.
Please refer to `examples/online_serving/disaggregated_prefill.sh` for the example usage of disaggregated prefilling.
## Benchmarks

View File

@@ -47,7 +47,7 @@ outputs = llm.generate(
)
```
Check out <gh-file:examples/multilora_inference.py> for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
Check out <gh-file:examples/offline_inference/multilora_inference.py> for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
## Serving LoRA Adapters

View File

@@ -47,7 +47,7 @@ print(f'Model is quantized and saved at "{quant_path}"')
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
```console
$ python examples/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
$ python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
```
AWQ models are also supported directly through the LLM entrypoint:

View File

@@ -28,7 +28,7 @@ Here is an example of how to enable this feature:
```python
# two float8_e4m3fn kv cache scaling factor files are provided under tests/fp8_kv, please refer to
# https://github.com/vllm-project/vllm/blob/main/examples/fp8/README.md to generate kv_cache_scales.json of your own.
# https://github.com/vllm-project/vllm/blob/main/examples/other/fp8/README.md to generate kv_cache_scales.json of your own.
from vllm import LLM, SamplingParams
sampling_params = SamplingParams(temperature=1.3, top_p=0.8)

View File

@@ -131,7 +131,7 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content)
```
Full example: <gh-file:examples/openai_chat_completion_structured_outputs.py>
Full example: <gh-file:examples/online_serving/openai_chat_completion_structured_outputs.py>
## Experimental Automatic Parsing (OpenAI API)
@@ -257,4 +257,4 @@ outputs = llm.generate(
print(outputs[0].outputs[0].text)
```
Full example: <gh-file:examples/offline_inference_structured_outputs.py>
Full example: <gh-file:examples/offline_inference/offline_inference_structured_outputs.py>