Reduce the kernel overhead when num of active loras is smaller than max loras. Multiple cuda graphs are captured for each num of active-loras. (#32005)

Signed-off-by: Yu Gong <yu3.gong@gmail.com>
2026-02-02 09:30:06 -08:00
parent 8b7346d5f1
commit ffe1fc7a28
15 changed files with 323 additions and 66 deletions
--- a/vllm/config/lora.py
+++ b/vllm/config/lora.py
@@ -60,6 +60,13 @@ class LoRAConfig:
    of multimodal models will be enabled. This is an experimental feature and 
    currently only supports some MM models such as the Qwen VL series. The default 
    is False."""
+    specialize_active_lora: bool = False
+    """Whether to construct lora kernel grid by the number of active LoRA adapters.
+    When set to True, separate cuda graphs will be captured for different counts
+    of active LoRAs (powers of 2 up to max_loras), which can improve performance
+    for variable LoRA usage patterns at the cost of increased startup time and
+    memory usage. Only takes effect when cudagraph_specialize_lora is True.
+    """

    def compute_hash(self) -> str:
        """