# Optimization Levels ## Overview vLLM provides 4 optimization levels (`-O0`, `-O1`, `-O2`, `-O3`) that allow users to trade off startup time for performance: - `-O0`: No optimization. Fastest startup time, but lowest performance. - `-O1`: Fast optimization. Simple compilation and fast fusions, and PIECEWISE cudagraphs. - `-O2`: Default optimization. Additional compilation ranges, additional fusions, FULL_AND_PIECEWISE cudagraphs. - `-O3`: Aggressive optimization. Currently equal to `-O2`, but may include additional time-consuming or experimental optimizations in the future. All optimization level defaults can be achieved by manually setting the underlying flags. User-set flags take precedence over optimization level defaults. ## Level Summaries and Usage Examples ```bash # CLI usage python -m vllm.entrypoints.api_server --model RedHatAI/Llama-3.2-1B-FP8 -O1 # Python API usage from vllm.entrypoints.llm import LLM llm = LLM( model="RedHatAI/Llama-3.2-1B-FP8", optimization_level=2 # equivalent to -O2 ) ``` ### `-O0`: No Optimization Startup as fast as possible - no autotuning, no compilation, and no cudagraphs. This level is good for initial phases of development and debugging. Settings: - `-cc.cudagraph_mode=NONE` - `-cc.mode=NONE` (also resulting in `-cc.custom_ops=["none"]`) - `-cc.pass_config.fuse_...=False` (all fusions disabled) - `--kernel-config.enable_flashinfer_autotune=False` ### `-O1`: Fast Optimization Prioritize fast startup, but still enable basic optimizations like compilation and cudagraphs. This level is a good balance for most development scenarios where you want faster startup but still make sure your code does not break cudagraphs or compilation. Settings: - `-cc.cudagraph_mode=PIECEWISE` - `-cc.mode=VLLM_COMPILE` - `--kernel-config.enable_flashinfer_autotune=True` Fusions: - `-cc.pass_config.fuse_norm_quant=True`* - `-cc.pass_config.fuse_act_quant=True`* - `-cc.pass_config.fuse_act_padding=True`† - `-cc.pass_config.fuse_rope_kvcache=True`† (will be moved to O2) \* These fusions are only enabled when either op is using a custom kernel, otherwise Inductor fusion is better.
† These fusions are ROCm-only and require AITER. ### `-O2`: Full Optimization (Default) Prioritize performance at the expense of additional startup time. This level is recommended for production workloads and is hence the default. Fusions in this level _may_ take longer due to additional compile ranges. Settings (on top of `-O1`): - `-cc.cudagraph_mode=FULL_AND_PIECEWISE` - `-cc.pass_config.fuse_allreduce_rms=True` ### `-O3`: Aggressive Optimization This level is currently the same as `-O2`, but may include additional optimizations in the future that are more time-consuming or experimental. ## Troubleshooting ### Common Issues 1. **Startup Time Too Long**: Use `-O0` or `-O1` for faster startup 2. **Compilation Errors**: Use `debug_dump_path` for additional debugging information 3. **Performance Issues**: Ensure using `-O2` for production