Commit Graph

  • 60af7b967b [Releases] [ROCm] Enable Nightly Docker Image and Wheel Releases for ROCm (#37283) TJian 2026-03-27 00:32:25 +08:00
  • bdc1719eb9 [ROCm][CI] Fix AITER state leak in shared_fused_moe_routed_transform test (#38137) Andreas Karatzas 2026-03-26 11:26:46 -05:00
  • 0aac2048bf [Bugfix] Restore CUDA graph persistent buffers for FP8 FlashMLA decode (#35175) haosdent 2026-03-27 00:13:39 +08:00
  • cb2263218e [Bugfix][Minor] Fix potential NameError in mamba backend selector and misc typos (#35886) Chuan (Richard) Li 2026-03-26 08:59:24 -07:00
  • e054f152fa [CI] Add batch invariant test for b200 (#38014) Wentao Ye 2026-03-26 11:54:54 -04:00
  • 0f5b526040 [Fix] Remove unused packing_position_embedding from PaddleOCRVL for better checkpoint compatibility (#38232) zhang-prog 2026-03-26 23:34:49 +08:00
  • be1a85b7a2 Revert "[MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration" (#38050) (#38169) Zhewen Li 2026-03-26 07:59:09 -07:00
  • 2e225f7bd2 [Renderer] Consolidate factory methods (#38218) Cyrus Leung 2026-03-26 20:19:22 +08:00
  • 757eafcf37 [bug-fix] GLM OCR Patch Merger context_dim (#37962) Jared Wen 2026-03-26 20:11:21 +08:00
  • dcdc145893 [CI] Reorganize scoring tests (#38207) wang.yuqi 2026-03-26 20:07:01 +08:00
  • f2d16207c7 [ROCm][CI] Fix flaky GPTQ compile correctness test (#38161) Andreas Karatzas 2026-03-26 06:57:00 -05:00
  • 37a83007fe [ROCm][CI] Fix wvSplitKrc mock argument order in test_rocm_unquantized_gemm (#38167) Andreas Karatzas 2026-03-26 06:54:59 -05:00
  • 9fdc0f3aeb merge khluu 2026-03-26 02:17:52 -07:00
  • bf5eec638d [Refactor] Remove unused utils (#38153) Wentao Ye 2026-03-26 05:08:19 -04:00
  • b1cb1d3d2c DOC: Documentation pages fixes (#38125) Mateusz Sokół 2026-03-26 09:55:42 +01:00
  • 6ae8bbd0c2 [XPU] Disable xpu graph by default (#38193) Kunshang Ji 2026-03-26 16:53:45 +08:00
  • a9213c0ffe [Doc] Fix outdated reference to CUDAGraphManager (#38209) Cyrus Leung 2026-03-26 16:52:38 +08:00
  • 502c41a8f6 [Model] Use helper function to run MM processors with token inputs (where applicable) (#38018) Cyrus Leung 2026-03-26 16:44:04 +08:00
  • 05d96d7991 merge Vadim Gimpelson 2026-03-26 12:21:47 +04:00
  • 52069012fe [Bugfix] Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell (#38083) Vadim Gimpelson 2026-03-26 12:21:47 +04:00
  • 71161e8b63 [cpu][ci] remove soft-fail for Arm CI and add quant model tests (#37691) Fadi Arafeh 2026-03-26 07:03:31 +00:00
  • 38de822310 [Model] Add torch.compile support for InternVL vision encoder (#38049) Terry Gao 2026-03-25 23:52:29 -07:00
  • 2bfbdca23c [Bugfix] Fix benchmark_fused_collective.py (#38082) Jee Jee Li 2026-03-26 14:51:00 +08:00
  • 2908094567 Add /v1/chat/completions/batch endpoint for batched chat completions (#38011) Matej Rojec 2026-03-26 05:13:33 +01:00
  • e6bf9f15ec [Bugfix][CI] Fix Marlin FP8 Linear Kernel for Compressed Tensors Format (#38092) BadrBasowid 2026-03-26 12:11:43 +08:00
  • 144030c84e Relocate Encoder CUDA graph manager (#38116) Woosuk Kwon 2026-03-25 20:52:12 -07:00
  • e2db2b4234 [Tool Parser][1/3] Pass tools to ToolParser constructor (#38029) Flora Feng 2026-03-25 22:29:06 -04:00
  • 87f05d6880 [Revert] Remove DeepGEMM availability check in DeepseekV32IndexerMetadataBuilder (#38076) Chauncey 2026-03-26 09:43:51 +08:00
  • 36f6aede23 [Misc] Optimized check to encapsulate both CUDA and ROCm platforms (#34549) Andreas Karatzas 2026-03-25 20:43:07 -05:00
  • 9704a5c310 Disable dual stream execution of input projection for Qwen3 (#38152) Xin Yang 2026-03-25 18:20:39 -07:00
  • 74056039b7 Fix minimax m2.5 nvfp4 kv scales weight loading (#37214) Wei Zhao 2026-03-25 20:48:06 -04:00
  • d7d51a7ee5 [Bugfix] Fix Qwen3.5-FP8 Weight Loading Error on TPU (#37348) Jacob Platin 2026-03-25 17:46:01 -07:00
  • 3c3c084240 Various Transformers v5 fixes (#38127) Harry Mellor 2026-03-26 00:10:08 +00:00
  • 7b54f60db0 [Cohere] Enable Cohere-Transcribe (#38120) Ekagra Ranjan 2026-03-25 19:13:51 -04:00
  • a0e8c74005 [ROCm]: Update rope+kvcache fusion conditions and disable custom op by default (#36716) Rohan Potdar 2026-03-25 15:58:44 -05:00
  • 70a2152830 [MultiModal] add support for numpy array embeddings (#38119) Guillaume Guy 2026-03-25 15:13:04 -05:00
  • 978fc18bf0 [ROCm] Utilize persistent MLA kernel from AITER (#36574) Sathish Sanjeevi 2026-03-25 12:00:42 -07:00
  • 7d6917bef5 [ROCm] Fix MoE kernel test failures on gfx950 (#37833) Andreas Karatzas 2026-03-25 13:46:40 -05:00
  • e38817fadb [Core][KV Connector] Remove use of num_cached_tokens in error handling (#38096) Mark McLoughlin 2026-03-25 18:20:48 +00:00
  • 72cad44d3c [Frontend] Move APIServerProcessManager target server fn (#38115) Nick Hill 2026-03-25 11:14:41 -07:00
  • ba2f0acc2d [Misc] Reorganize inputs (#35182) Cyrus Leung 2026-03-26 01:22:54 +08:00
  • 678b3c99e8 [MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration (#38050) Yongye Zhu 2026-03-25 13:16:40 -04:00
  • bf4cc9ed2d [2/n] Migrate per_token_group_quant to torch stable ABI (#36058) mikaylagawarecki 2026-03-25 13:15:13 -04:00
  • 1ac2ef2e53 [CI/Docs] Improve aarch64/DGX Spark support for dev setup (#38057) Ben Browning 2026-03-25 12:24:42 -04:00
  • 6e37c46b35 [compile] Add some more startup tests for top models (#38046) Richard Zou 2026-03-25 12:02:22 -04:00
  • 1bf2ddd0ee [Refactor] Rename WAITING_FOR_FSM to WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR (#38048) Wentao Ye 2026-03-25 11:41:44 -04:00
  • e7221180e1 [Kernel] Optimize SM120 CUTLASS blockwise FP8 GEMM (#37970) Necofish 2026-03-25 23:20:04 +08:00
  • 4a76ad12e0 [Bugfix] Preserve CUDA arch suffix (a/f) for SM12x — fixes NVFP4 NaN on desktop Blackwell (#37725) RobTand 2026-03-25 11:18:25 -04:00
  • d7e93e13fb [Feature] EPLB Support for GPU Model Runner v2 (#37488) Wentao Ye 2026-03-25 11:16:39 -04:00
  • cd7643015e [Feature] Support per-draft-model MoE backend via --speculative-config (#37880) Andrii Skliar 2026-03-25 15:31:52 +01:00
  • a1a2566447 [Docs] Add guide for editing agent instruction files (#37819) Ben Browning 2026-03-25 09:54:09 -04:00
  • b745e8b5d3 [KVTransfer][Mooncake] Add heterogeneous TP support for disaggregated P/D in MooncakeConnector (#36869) yjz 2026-03-25 21:24:07 +08:00
  • d215d1efca [Mypy] Better fixes for the mypy issues in vllm/config (#37902) Harry Mellor 2026-03-25 13:14:43 +00:00
  • 34d317dcec [CPU][UX][Perf] Enable tcmalloc by default (#37607) Fadi Arafeh 2026-03-25 12:39:57 +00:00
  • 7ac48fd357 [Model] Add AutoWeightsLoader support for jais (#38074) grYe99 2026-03-25 20:38:40 +08:00
  • d6bb2a9d9a Fix Plamo 2/3 & LFM2 for Transformers v5 (#38090) Harry Mellor 2026-03-25 12:29:49 +00:00
  • 1e673a43ce Better weight tying check for multimodal models (#38035) Harry Mellor 2026-03-25 12:07:23 +00:00
  • 04417ecd5f [ROCm][CI] Rename filepath test to point to correct file (#38102) Andreas Karatzas 2026-03-25 07:05:46 -05:00
  • 242c93f744 [Docs] Adds vllm-musa to custom_op.md (#37840) R0CKSTAR 2026-03-25 19:54:36 +08:00
  • a889b7f584 [Bugfix] Pass drafter quant_config to ParallelLMHead in Eagle3 (#37280) Matthias Gehre 2026-03-25 12:42:58 +01:00
  • ba2910f73a Fix offline mode test for Transformers v5 (#38095) Harry Mellor 2026-03-25 11:39:48 +00:00
  • f262a62aa1 [ROCm][CI] Fix flaky Cohere/OpenAI embedding parity test (#37616) Andreas Karatzas 2026-03-25 05:55:51 -05:00
  • 9ac2fcafbb [CI] Fix realtime WebSocket timeout deadlock and unhandled model validation errors (#37483) Andreas Karatzas 2026-03-25 05:24:33 -05:00
  • e9ae3f8077 [Hardware][XPU] Align memory usage with cuda on xpu (#37029) Kunshang Ji 2026-03-25 18:14:29 +08:00
  • 04cec4f927 [ROCm][CI] Increase OpenAPI schema test timeouts (#38088) Andreas Karatzas 2026-03-25 05:06:58 -05:00
  • 14771f7150 [XPU] support MLA model on Intel GPU (#37143) Kunshang Ji 2026-03-25 17:43:42 +08:00
  • 189ddefbfd [ROCm] Attention selector reordering (#36702) Gregory Shtrasberg 2026-03-25 04:42:56 -05:00
  • 09c3dc9186 [Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function (#37968) Chauncey 2026-03-25 14:19:37 +08:00
  • 42e9547976 [ROCm][Test] Fix ROCM_AITER_UNIFIED_ATTN attn+quant fusion test (#37640) vllmellm 2026-03-25 13:06:15 +08:00
  • a32783bb35 [Bugfix] Fix IndexError when accessing prev_tool_call_arr in OpenAIToolParser (#37958) Chauncey 2026-03-25 12:06:21 +08:00
  • 9d0351c91d [Docs] Add Encoder (ViT) CUDA Graphs section to CUDA Graphs design doc (#37914) Baorun (Lauren) Mu 2026-03-24 22:53:24 -04:00
  • ccbc5ac449 [Bugfix] Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 (#37158) Dimitrios Bariamis 2026-03-17 21:13:06 +01:00
  • a93a53f8a1 [Performance] Auto-enable prefetch on NFS with RAM guard (#37673) Artem Perevedentsev 2026-03-25 02:31:14 +02:00
  • 679c6a3ecc [Bugfix][ROCm][MoE] Fix mxfp4 oracle regressions from #37128 (#37787) Andreas Karatzas 2026-03-24 19:17:33 -05:00
  • 8bbb7c7f20 [ROCm][CI][PD] Add Hybrid SSM integration tests to CI (#37924) Andreas Karatzas 2026-03-24 18:58:39 -05:00
  • af945615b5 [release] Move the rest of release jobs to release queue (#38044) Kevin H. Luu 2026-03-24 16:40:58 -07:00
  • 82580b10ac [Perf] Disable inductor runtime asserts by default for serving perfor… (#37485) Terry Gao 2026-03-24 16:37:51 -07:00
  • a0d487b2e1 nano_nemotron_vl: suppress readonly torch.from_numpy() warning in image and video resize paths (#37903) Netanel Haber 2026-03-25 01:25:56 +02:00
  • b73b5b0629 Make microbatch optimization (DBO) work with general models (#37926) Junhao 2026-03-24 17:40:08 -04:00
  • 0f0e03890e [UX] Add flashinfer-cubin as CUDA default dep (#37233) Michael Goin 2026-03-24 22:13:08 +01:00
  • 4b53740d7f [MRV2] Fix for DS v3.2 (#38030) Woosuk Kwon 2026-03-24 14:03:24 -07:00
  • 4e824d1c83 [Model Runner V2][Minor] Simplify PP logic (#38031) Nick Hill 2026-03-24 13:57:17 -07:00
  • 0c1809c806 Add Ubuntu 24.04 support for Docker builds (#35386) amey asgaonkar 2026-03-24 13:34:44 -07:00
  • 8c47fdfdb1 [FlexAttention] allow custom mask mod (#37692) liangel-02 2026-03-24 16:03:24 -04:00
  • 54b0578ada [Bugfix] Pass hf_token through config loading paths for gated model support (#37920) Javier De Jesus 2026-03-24 20:22:05 +01:00
  • 89f572dbc0 [BugFix] fix VLLM_USE_STANDALONE_COMPILE=0 (#38015) Richard Zou 2026-03-24 15:08:26 -04:00
  • 71a4a2fbd0 [BugFix] Fix order of compile logging (#38012) Richard Zou 2026-03-24 14:58:18 -04:00
  • 935c46dd9b [Model] Add Granite 4.0 1B speech to supported models (#38019) Nick Cao 2026-03-24 14:23:41 -04:00
  • 057fc94cbd [Bugfix] Fix structured output crash on CPU due to pin_memory=True (#37706) Willy Hardy 2026-03-24 13:44:17 -04:00
  • b58c5f28aa docs: fix broken offline inference paths in documentation (#37998) Vineeta Tiwari 2026-03-24 23:05:14 +05:30
  • c07e2ca6e0 Fix Mamba state corruption from referencing stale block table entries (#37728) (#37728) (#37728) Ming Yang 2026-03-24 10:29:59 -07:00
  • 4df5fa7439 [Bugfix] Force continuous usage stats when CLI override is enabled (#37923) Dhruv Singal 2026-03-24 10:29:50 -07:00
  • a5416bc52e [XPU] Support Intel XPU hardware information collection in usage stats (#37964) sihao_li 2026-03-25 01:29:17 +08:00
  • b3601da6e7 [Mypy] Fix mypy for vllm/model_executor (except vllm/model_executor/layers) (#37904) Harry Mellor 2026-03-24 17:14:01 +00:00
  • dc78c2c933 [Core] add option to schedule requests based on full ISL (#37307) Dan Blanaru 2026-03-24 18:01:12 +01:00
  • 4731884796 [Feature] limit thinking tokens (hard limit) (#20859) Sungjae Lee 2026-03-25 01:53:07 +09:00
  • 8de5261e69 Update new contributor message (#37999) Harry Mellor 2026-03-24 16:01:41 +00:00
  • 1b6cb920e6 [Deprecate] Deprecate pooling multi task support. (#37956) wang.yuqi 2026-03-24 22:07:47 +08:00
  • 352b90c4a4 [Bugfix] Add replacement of _compute_slot_mapping_kernel on CPU (#37987) Li, Jiang 2026-03-24 22:00:20 +08:00
  • 1c0aabdeb0 [Bugfix] Suppress spurious CPU KV cache warning in launch render (#37911) Sage 2026-03-24 14:36:18 +02:00