Commit Graph

  • b2f78cbad4 [small][batch invariance] Rename the env and internal flags to simplify usage (#26855) Bram Wasti 2025-10-16 14:40:25 -07:00
  • 23583ee28c [Bug] Add Assertion for random-input-len / random-output-len (#26834) Wentao Ye 2025-10-16 17:36:39 -04:00
  • 01c977e96d [CI] Prune Quantization Tests and skip compilation (#27038) Michael Goin 2025-10-16 17:26:35 -04:00
  • b3dda72c23 [Feature] Migrate DeepGEMM API from get_m_alignment_for_contiguous_layout to get_mk_alignment_for_contiguous_layout (#26935) Wentao Ye 2025-10-16 16:46:48 -04:00
  • fb0571b077 [GPTOSS][DP/EP][Marlin] Enable GPTOSS Batched DP/EP using Marlin kernels (#25997) Varun Sundar Rabindranath 2025-10-16 15:53:11 -04:00
  • 2ed8b6b3d0 [Bug] Fix batch invariant test has to is (#27032) Wentao Ye 2025-10-16 15:45:14 -04:00
  • 013abde6ef Adding Warmup to Benchmark Serving (#26943) kimbochen 2025-10-16 15:44:32 -04:00
  • a5464dcf92 [Compressed Tensors] Always clone output for compile robustness (#26849) Kyle Sayers 2025-10-16 15:29:59 -04:00
  • ac3ed5a815 Support block size of 256 used by Intel HPU (#26883) Mandy Li 2025-10-16 12:10:57 -07:00
  • e6ba2000ae [gpt-oss][1/N] EZ: refactor serving_responses for modularity (#26948) Andrew Xia 2025-10-16 11:44:06 -07:00
  • aa255ff55a Support set in the CLI generation (#27031) Harry Mellor 2025-10-16 19:07:18 +01:00
  • 7bb736d00e Fix Qwen2.5 VL image grid docstring (#27033) ZiTian Zhao 2025-10-17 00:57:36 +08:00
  • 9f4e30904b [Model] Fix Qwen3VL mm mapping (#27027) Jee Jee Li 2025-10-17 00:45:59 +08:00
  • 5afd3276df [Feature] Add process_weights_after_loading to AttentionImpl (#26870) rongfu.leng 2025-10-16 23:02:30 +08:00
  • 43721bc67f [CI] Replace large models with tiny alternatives in tests (#24057) Tahsin Tunan 2025-10-16 20:51:27 +06:00
  • 02d709a6f1 [docs] standardize Hugging Face env var to HF_TOKEN (deprecates HUGGING_FACE_HUB_TOKEN) (#27020) Kay Yan 2025-10-16 22:31:02 +08:00
  • 4a510ab487 [NIXL] Improve request_finished() debug logs (#25665) Mark McLoughlin 2025-10-16 14:55:17 +01:00
  • 314fa8abbf [Attention] Tune CUTLASS MLA num_splits (#26846) Matthew Bonanni 2025-10-16 09:36:09 -04:00
  • 334535b6fb [Benchmark] Show E2EL by default for pooling models (#27014) Cyrus Leung 2025-10-16 20:47:09 +08:00
  • dcbb3f1871 [Bugfix] Correct LayerNorm epsilon parameter in modernbert.py (#27008) bogdanm 2025-10-16 17:27:44 +05:00
  • 00417f4e44 [MISC] fix import violations for re and triton modules (#26654) Sungjae Lee 2025-10-16 19:38:27 +09:00
  • ed344f4116 Cleanup code after Python 3.10 upgrade (#26520) Lukas Geiger 2025-10-16 11:38:23 +01:00
  • e51928793e [Model][Bugfix] fix ernie45 vl run failed from shared experts optimization (#26885) CSWYF3634076 2025-10-16 18:37:35 +08:00
  • d2740fafbf [Chore] Separate out vllm.utils.collections (#26990) Cyrus Leung 2025-10-16 16:35:35 +08:00
  • 17838e50ef [Benchmark] Use truncation by default for pooling benchmarks (#26992) Cyrus Leung 2025-10-16 16:02:39 +08:00
  • 44c8555621 [CI/Build] Fix AMD import failures in CI (#26841) Zhewen Li 2025-10-16 00:28:20 -07:00
  • f7d318de2b [Hardware][CPU][PowerPC]Disable torch.compile() in toptopk sampling (#26987) Akash kaothalkar 2025-10-16 11:06:59 +05:30
  • 76f0d05bc6 [CI/Build] Update expected beam search output for Phi3V (#26978) Cyrus Leung 2025-10-16 13:12:44 +08:00
  • 7d8975de84 Deepseek-v3 Batch Invariant on 8xH100 (#26609) Bram Wasti 2025-10-15 22:06:02 -07:00
  • 785d8b6410 [PERF] Qwen3-next MTP speedup (change bool mask indexing to index_select / index_copy to reduce d2h) (#26437) Vadim Gimpelson 2025-10-16 08:18:31 +04:00
  • f6cdc9a02f [Chore] Rename utils submodules (#26920) Cyrus Leung 2025-10-16 11:58:13 +08:00
  • 509cdc0370 [DOC][XPU]update feature parity with Intel GPU (#26954) Chendi.Xue 2025-10-15 22:07:10 -05:00
  • 9b6504c307 [BugFix] Work around graph partition x torch.compile cache issue (#26956) Richard Zou 2025-10-15 23:06:11 -04:00
  • e19b16dde6 [bugfix] Fix SP + PP without specifying compile size (#26955) Angela Yi 2025-10-15 20:05:33 -07:00
  • 582f2c6be7 [BUG] Allow runai_streamer_sharded in config check (#26958) ahao-anyscale 2025-10-15 20:05:14 -07:00
  • f8a0acbdbe [CI] Enable Blackwell Llama4 MoE tests (#26731) Michael Goin 2025-10-15 23:02:57 -04:00
  • 1317034379 [ROCm][FEAT] Fuse DeepSeek shared experts into AITER fused_moe ops (#24097) kliuae 2025-10-16 10:41:34 +08:00
  • 0ecc553ee6 [Bugfix] reasoning_parser parameter handling in run_batch.py (#26225) InChang Jeong 2025-10-16 11:24:05 +09:00
  • f96bc3649c [Qwen3-Next] Add tuned MoE config for Qwen3-Next FP8 on H100 tp2 (#26887) felixzhu555 2025-10-15 18:55:05 -07:00
  • 938c43ea7f [ci] Adjusting AMD test composition 2025-10-14 (#26852) Alexei-V-Ivanov-AMD 2025-10-15 18:52:13 -05:00
  • 0a9ef0cfce Move query quantization to attention layer for Flashinfer & Triton. (#26534) Adrian Abeyta 2025-10-15 18:01:38 -05:00
  • e5b438a247 [Bug] Temporally Disable VLLM_ALLREDUCE_USE_SYMM_MEM by Default (#26925) Wentao Ye 2025-10-15 16:18:50 -04:00
  • 0b99f5d302 support flashinfer_fp4 moe for 5090 gpu (#26669) XiaobingZhang 2025-10-16 03:06:47 +08:00
  • 1f491aa0c8 Vectorize RMS norm variance using vectorize_read_with_alignment (#26234) Benji Beck 2025-10-15 11:54:41 -07:00
  • de92d916fe [NVIDIA] Add support for cudnn fp4 gemm via flashinfer (#26107) Kaixi Hou 2025-10-15 10:53:00 -07:00
  • a1063628a4 [Chore] Clean up CODEOWNERS (#26923) Woosuk Kwon 2025-10-15 10:52:54 -07:00
  • d796375258 [ModelOpt] Remove NVFP4 MoE K%16==0 constraint (#26891) XiaobingZhang 2025-10-16 01:06:17 +08:00
  • 14f8456344 [Feature]: Use pydantic validation in observability.py config (#26637) Sam/Samuel 2025-10-16 01:44:03 +09:00
  • 4794c2bd92 Olmo 3 tool parser and tests (#26143) Pradeep Dasigi 2025-10-15 09:36:12 -07:00
  • d3cbaa08dc Lower sevarity of log when model info cache misses due to exception (#26917) Harry Mellor 2025-10-15 17:01:09 +01:00
  • 828523ad8e [Chore] Separate out vllm.utils.async_utils (#26913) Cyrus Leung 2025-10-15 23:33:00 +08:00
  • 136a17fe6e [Chore] Separate out vllm.utils.func (#26904) Cyrus Leung 2025-10-15 21:03:58 +08:00
  • f57438338d [BugFix] Patch inductor memory plan logic (#26878) Boyuan Feng 2025-10-15 05:51:45 -07:00
  • 5d598680e3 chore: remove unused marker (#26890) Max Wittig 2025-10-15 14:40:33 +02:00
  • 8f4b313c37 [Misc] rename torch_dtype to dtype (#26695) wangxiyuan 2025-10-15 20:11:48 +08:00
  • f93e348010 [Misc] Remove isort and yapf ignores (#26888) Cyrus Leung 2025-10-15 20:09:03 +08:00
  • f54f85129e [Model][2/N] Improve all pooling task | Support multi-vector retrieval (#25370) wang.yuqi 2025-10-15 19:14:41 +08:00
  • d4d1a6024f [Lora]Load tuned multi-lora kernel configs from json files (#26319) li2haipeng 2025-10-15 02:45:14 -07:00
  • db1764e4e0 [Platform] allow platform to init dp group (#22243) wangxiyuan 2025-10-15 17:32:17 +08:00
  • 7f83b4ee8e [Easy] Get rid of unnecessary paraenthesis in kv_cache_manager (#26842) Jialin Ouyang 2025-10-15 02:17:43 -07:00
  • 5c3bae1a6a [Fix] Remove divisibility requirement between num_kv_heads and tp_size in bailing_moe (#26876) ant-yy 2025-10-15 16:44:04 +08:00
  • 5210dc3940 [Misc] Update TritonLanguagePlaceholder to have attributes that are used by Flash Linear Attention ops. (#26853) Xudong Ma 2025-10-15 01:37:49 -07:00
  • 650b51f9f9 [doc] add Context Parallel Deployment doc (#26877) youkaichao 2025-10-15 16:33:52 +08:00
  • 6256697997 [Doc] ruff format remaining Python examples (#26795) Cyrus Leung 2025-10-15 16:25:49 +08:00
  • 71557a5f7c [CI] Fix mypy for vllm/executor (#26845) Wentao Ye 2025-10-15 04:23:33 -04:00
  • f3c378ffa7 [CI/Build] Add Qwen2.5-VL-7B-Instruct ChartQA Accuracy Tests in CI (#21810) Zhewen Li 2025-10-15 01:09:56 -07:00
  • f5ed68ef63 [Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather (#26456) Yongye Zhu 2025-10-15 04:05:01 -04:00
  • efdef57b1f [bugfix] Lazy import cv2 (#26869) Angela Yi 2025-10-15 00:47:50 -07:00
  • b8a4572157 [Misc] Use helper function to generate dummy messages in OpenAI MM tests (#26875) Cyrus Leung 2025-10-15 15:17:37 +08:00
  • 302ef403a2 [DSA][MLA] Tiny refactor on DeepSeek to make it reusable for different backends (#26656) Mengqing Cao 2025-10-15 15:16:44 +08:00
  • 8865da157b [Bugfix][Multi Modal] Fix incorrect Molmo token processing (#26873) sangho.lee 2025-10-15 02:13:59 -05:00
  • f0862eae43 [Graph Partition] pass tests for decorator (#26831) Boyuan Feng 2025-10-14 23:39:48 -07:00
  • 8c851f6d04 [Bugfix] Fix qwen3-omni audio truncation issue (#26815) Isotr0py 2025-10-15 13:38:36 +08:00
  • 7cfa420f49 [BugFix] Patch inductor partitioning logic (#26735) Angela Yi 2025-10-14 22:04:32 -07:00
  • a27b288e4a [Feature] default --extra-body param to disable thinking in vllm bench serve (#26784) rongfu.leng 2025-10-15 12:23:44 +08:00
  • e471d7ca7e [CI/Build][Bugfix] fix qutlass cmake error when set QUTLASS_SRC_DIR (#26773) zhrrr 2025-10-15 12:09:44 +08:00
  • c43ca8259e [Docs] Move build.inc into arm.inc (#26862) Michael Yao 2025-10-15 11:35:08 +08:00
  • 85a65e7f51 [Model] Add DeepSeek-V3.1 reasoning parser (split from PR #24972) (#25589) Tao Hui 2025-10-15 11:09:52 +08:00
  • a2986b3e33 [Bugfix] Fixes prefix-repetition benchmark script (#26828) kourosh hakhamaneshi 2025-10-14 19:54:43 -07:00
  • 96b9aa5aa0 [Frontend][torch.compile] CompilationConfig Overhaul (#20283): name change compilation level to compilation mode, deprecation compilation level (#26355) Morrison Turnansky 2025-10-14 22:51:16 -04:00
  • e66d787bce Disable FlashInfer sampler by default (#26859) Michael Goin 2025-10-14 22:35:18 -04:00
  • bfad142e25 [BUGFIX][NIXL] quick fix for 'assert self.connector_worker is not None' in get_kv_connector_stats (#26851) Chendi.Xue 2025-10-14 21:33:25 -05:00
  • 9354660036 [Bugfix]fix Qwen3 xml tool parser (#26345) Zhikaiiii 2025-10-15 09:50:30 +08:00
  • 07ca70af8d [Core][Easy] Use envs.__getattr__ for all Unify to environment variable access (#26810) Jialin Ouyang 2025-10-14 18:41:18 -07:00
  • 2dcd12d357 [torch.compile] Fix tests for torch==2.9 inductor partition (#26116) Luka Govedič 2025-10-14 19:55:02 -04:00
  • 579d2e5458 [WideEP][P/D] Add usage stats for DP+EP and KV Connector (#26836) Tyler Michael Smith 2025-10-14 19:51:54 -04:00
  • 0512c04aee [frontend][gptoss] Add per turn stats into Harmony Context (#25061) Ye Hu 2025-10-14 16:48:13 -07:00
  • 7e0ef4084a [CI Failure] Fix torchao dep failure for Quantization Test (#26824) Michael Goin 2025-10-14 19:41:43 -04:00
  • 4aed506b65 [Core] Streamline some structured output related code (#26737) Nick Hill 2025-10-14 16:27:44 -07:00
  • a86b4c58e8 remove attn output view kernel (#26680) Boyuan Feng 2025-10-14 15:53:10 -07:00
  • ff4810ba73 [Minor] Group async_scheduling related fields in model runner init (#26736) Nick Hill 2025-10-14 14:46:37 -07:00
  • 9d6964926e fix: response_format for completion (#23212) Nan Qin 2025-10-14 16:23:22 -05:00
  • 0e65818910 Added MoE configs for llama 4, H200 device with tp=4/8 tuning (#26837) Dhruvil Bhatt 2025-10-14 14:21:03 -07:00
  • 380f17527c [Perf] Cache vllm.env.__getattr__ result to avoid recomputation (#26146) Jialin Ouyang 2025-10-14 14:03:21 -07:00
  • b92ab3deda Notice for deprecation of AutoAWQ (#26820) HDCharles 2025-10-14 16:39:59 -04:00
  • acaa2c0a4a [Core] Reuse empty block lists whenever possible in KVCacheBlocks to mitigate GC costs (#24964) Jialin Ouyang 2025-10-14 12:58:43 -07:00
  • 82af928c41 [Attention][Spec Decode] FlashMLA spec decode support (#26541) Matthew Bonanni 2025-10-14 15:38:20 -04:00
  • 87efc681db llama4_vision_rope: add HIP override to accept (q, k) and avoid (positions, q, k) mismatch (#26790) Huamin Li 2025-10-14 11:54:12 -07:00
  • c3a722fcb2 [CI Failure] Fix tests with missing TinyLlama-1.1B-Chat-v1.0-FP8-e2e (#26816) v0.11.1rc1 Michael Goin 2025-10-14 14:38:59 -04:00
  • aba48f7db1 [Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVidia B200 (#26818) Ze'ev Klapow 2025-10-14 14:20:39 -04:00