[Kernels] LoRA - Retire SGMV and BGMV Kernels (#14685)

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-03-18 05:47:53 -04:00
parent d1695758b2
commit 400d483e87
15 changed files with 245 additions and 2092 deletions
--- a/vllm/v1/worker/lora_model_runner_mixin.py
+++ b/vllm/v1/worker/lora_model_runner_mixin.py
@@ -62,9 +62,10 @@ class LoRAModelRunnerMixin:
        if not self.lora_manager:
            raise RuntimeError("LoRA is not enabled.")

-        # Set is_prefill to True, so we always use the SGMV kernels.
-        # For cuda platforms, we have specialized triton kernels, and
-        # the cuda path ignores `is_prefill`.
+        # Set is_prefill to True, so we always use the SGMV kernels on
+        # non-cuda platforms.
+        # On cuda platforms we use the same kernels for prefill and
+        # decode and this flag is generally ignored.
        lora_mapping = LoRAMapping(token_lora_mapping,
                                   prompt_lora_mapping,
                                   is_prefill=True)