[torch.compile] Add an option to force-enable the MOE cold start optimization (#33735)

Signed-off-by: Richard Zou <zou3519@gmail.com>
2026-02-08 13:42:56 -05:00
parent a263aa6140
commit 4df841fe75
3 changed files with 18 additions and 12 deletions
--- a/vllm/config/compilation.py
+++ b/vllm/config/compilation.py
@@ -593,7 +593,7 @@ class CompilationConfig:
    local_cache_dir: str = field(default=None, init=False)  # type: ignore
    """local cache dir for each rank"""

-    fast_moe_cold_start = True
+    fast_moe_cold_start: bool | None = None
    """Optimization for fast MOE cold start.

    This is a bit of a hack that assumes that:
@@ -604,8 +604,14 @@ class CompilationConfig:
    When the above two conditions hold, this option greatly decreases cold start
    time for MOE models.

-    If the above two conditions don't hold, then this option will lead to silent
-    incorrectness. The only condition in which this doesn't hold is speculative
+    The options are:
+    - True: optimization is always on
+    - False: optimization is always off
+    - None: optimization is on usually but off for speculative decoding
+
+    If conditions 1&2 don't hold then this option will lead to silent
+    incorrectness.
+    The only condition in which this doesn't hold is speculative
    decoding, where there is a draft model that may have MOEs in them.

    NB: We're working on a longer-term solution that doesn't need these assumptions.