[Perf] Disable inductor runtime asserts by default for serving perfor… (#37485)

Signed-off-by: tianrengao <terrygao87@gmail.com> Co-authored-by: Tianren Gao <tianren@fb.com>
2026-03-24 16:37:51 -07:00
parent a0d487b2e1
commit 82580b10ac
3 changed files with 95 additions and 0 deletions
--- a/docs/design/debug_vllm_compile.md
+++ b/docs/design/debug_vllm_compile.md
@@ -233,6 +233,26 @@ that may call 1+ triton kernels. On rare (but unfortunate) occasions, it may
 produce an incorrect triton kernel. This may manifest as silent incorrectness,
 CUDA illegal memory accesses, or loud errors.

+### Inductor runtime assertions
+
+By default (on torch < 2.12), vLLM disables Inductor's runtime assertions
+(`assert_size_stride`, `assert_alignment`) to avoid ~2ms overhead per forward
+pass on large models. Setting `VLLM_LOGGING_LEVEL=DEBUG` automatically
+re-enables them so debugging sessions get full shape/stride validation:
+
+```sh
+VLLM_LOGGING_LEVEL=DEBUG vllm serve <model>
+```
+
+You can also override them explicitly via `--compilation-config`:
+
+```sh
+vllm serve <model> -cc.inductor_compile_config='{"size_asserts": true, "alignment_asserts": true, "scalar_asserts": true}'
+```
+
+On torch >= 2.12, PyTorch uses an efficient assert-once strategy and these
+flags are no longer suppressed by vLLM.
+
 To debug if TorchInductor is at fault, you can disable it by passing `backend='eager'`
 to the compilation config: