[Doc] Move examples into categories (#11840)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -21,7 +21,7 @@ Disaggregated prefill DOES NOT improve throughput.
|
||||
|
||||
## Usage example
|
||||
|
||||
Please refer to `examples/disaggregated_prefill.sh` for the example usage of disaggregated prefilling.
|
||||
Please refer to `examples/online_serving/disaggregated_prefill.sh` for the example usage of disaggregated prefilling.
|
||||
|
||||
## Benchmarks
|
||||
|
||||
|
||||
@@ -47,7 +47,7 @@ outputs = llm.generate(
|
||||
)
|
||||
```
|
||||
|
||||
Check out <gh-file:examples/multilora_inference.py> for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
|
||||
Check out <gh-file:examples/offline_inference/multilora_inference.py> for an example of how to use LoRA adapters with the async engine and how to use more advanced configuration options.
|
||||
|
||||
## Serving LoRA Adapters
|
||||
|
||||
|
||||
@@ -47,7 +47,7 @@ print(f'Model is quantized and saved at "{quant_path}"')
|
||||
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
|
||||
|
||||
```console
|
||||
$ python examples/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
|
||||
$ python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
|
||||
```
|
||||
|
||||
AWQ models are also supported directly through the LLM entrypoint:
|
||||
|
||||
@@ -28,7 +28,7 @@ Here is an example of how to enable this feature:
|
||||
|
||||
```python
|
||||
# two float8_e4m3fn kv cache scaling factor files are provided under tests/fp8_kv, please refer to
|
||||
# https://github.com/vllm-project/vllm/blob/main/examples/fp8/README.md to generate kv_cache_scales.json of your own.
|
||||
# https://github.com/vllm-project/vllm/blob/main/examples/other/fp8/README.md to generate kv_cache_scales.json of your own.
|
||||
|
||||
from vllm import LLM, SamplingParams
|
||||
sampling_params = SamplingParams(temperature=1.3, top_p=0.8)
|
||||
|
||||
@@ -131,7 +131,7 @@ completion = client.chat.completions.create(
|
||||
print(completion.choices[0].message.content)
|
||||
```
|
||||
|
||||
Full example: <gh-file:examples/openai_chat_completion_structured_outputs.py>
|
||||
Full example: <gh-file:examples/online_serving/openai_chat_completion_structured_outputs.py>
|
||||
|
||||
## Experimental Automatic Parsing (OpenAI API)
|
||||
|
||||
@@ -257,4 +257,4 @@ outputs = llm.generate(
|
||||
print(outputs[0].outputs[0].text)
|
||||
```
|
||||
|
||||
Full example: <gh-file:examples/offline_inference_structured_outputs.py>
|
||||
Full example: <gh-file:examples/offline_inference/offline_inference_structured_outputs.py>
|
||||
|
||||
Reference in New Issue
Block a user