[docs][torch.compile] Add fusions.md — kernel/operator fusion reference page (#35538)
Signed-off-by: ProExpertProg <luka.govedic@gmail.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com> Co-authored-by: ProExpertProg <luka.govedic@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
This commit is contained in:
@@ -5,6 +5,17 @@ This guide covers optimization strategies and performance tuning for vLLM V1.
|
||||
!!! tip
|
||||
Running out of memory? Consult [this guide](./conserving_memory.md) on how to conserve memory.
|
||||
|
||||
## Optimization Levels
|
||||
|
||||
vLLM provides 4 optimization levels (`-O0`, `-O1`, `-O2`, `-O3`) that allow users to trade off startup time for performance:
|
||||
|
||||
- `-O0`: No optimizations. Fastest startup time, but lowest performance.
|
||||
- `-O1`: Fast optimization. Simple compilation and fast fusions, and PIECEWISE cudagraphs.
|
||||
- `-O2`: Default optimization. Additional compilation ranges, additional fusions, FULL_AND_PIECEWISE cudagraphs.
|
||||
- `-O3`: Aggressive optimization. Currently equal to `-O2`, but may include additional time-consuming or experimental optimizations in the future.
|
||||
|
||||
For more information, see the [optimization level documentation](../design/optimization_levels.md).
|
||||
|
||||
## Preemption
|
||||
|
||||
Due to the autoregressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
|
||||
|
||||
Reference in New Issue
Block a user