khluu
b31e9326a7
Bound openai to under 2.25.0
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-03-06 13:04:15 -08:00
Doug Smith
e346c08560
[Release] Include source distribution (sdist) in PyPI uploads ( #35136 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
Co-authored-by: Daniele Trifirò <dtrifiro@redhat.com >
(cherry picked from commit 0bfa229bf1 )
2026-03-06 13:03:53 -08:00
Avery Miao
b7a423cb01
[BUGFIX]Fix Qwen-Omni models audio max_token_per_item estimation error leading to encoder_cache_size is 0 ( #35994 )
...
Signed-off-by: Miao, Avery <avery.miao@intel.com >
(cherry picked from commit e998fa76b9 )
2026-03-06 13:03:40 -08:00
Cyrus Leung
fa78ec8a72
[Bugfix] Fix Qwen-VL tokenizer implementation ( #36140 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
(cherry picked from commit 7196348157 )
2026-03-06 13:03:26 -08:00
Kunshang Ji
9a474ce7a4
[XPU] bump vllm-xpu-kernels to v0.1.3 ( #35984 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
(cherry picked from commit a8f66cbde8 )
2026-03-06 13:03:05 -08:00
lailoo
097eb544e9
[Bugfix] Improve engine ready timeout error message ( #35616 )
...
Signed-off-by: damaozi <1811866786@qq.com >
2026-03-04 05:54:32 +00:00
ShiJie Zhong
7cdba98edf
[BugFix] Support tool_choice=none in the Anthropic API ( #35835 )
...
Signed-off-by: ZhongsJie <zhongsjie@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-04 05:24:46 +00:00
Charlie Fu
3c85cd9d74
[Rocm][CI] Fix ROCm LM Eval Large Models (8 Card) ( #35913 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2026-03-04 04:50:13 +00:00
Andreas Karatzas
edba15045a
[Bugfix] Guard mm_token_type_ids kwarg in get_mrope_input_positions ( #35711 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-04 04:12:51 +00:00
Cyrus Leung
e379396167
[Refactor] Clean up processor kwargs extraction ( #35872 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-03 19:53:53 -08:00
Isotr0py
6e9f21e8a2
[Chore] Remove debug code in model implementation ( #35883 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-03 19:50:58 -08:00
AllenDou
c1d963403c
[model] support FireRedASR2 ( #35727 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-03 19:41:30 -08:00
Shanshan Shen
77e6dcbbfa
[PluggableLayer][MM] Add PluggableLayer for RelPosAttention ( #33753 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-03-03 19:41:27 -08:00
William Zhang
70c73df69e
[Bugfix] Fix EVS implementation for Qwen3 VL ( #33607 )
...
Signed-off-by: 2ez4bz <133824995+2ez4bz@users.noreply.github.com >
2026-03-04 02:18:11 +00:00
xjx
9a9d442464
Enable bnb for multiple indices weight ( #35838 )
...
Signed-off-by: xjx <493337577@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-04 01:46:47 +00:00
Andreas Karatzas
f7da9cdffc
[ROCm][CI] Support async weight transfer example with platform-aware determinism ( #35710 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-04 09:44:14 +08:00
Jaewon
f22ff2958c
[Bugfix] Fix coord_socket assertion in DPEngineCoreProc for offline DP mode ( #35916 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
2026-03-04 00:10:11 +00:00
Nick Hill
d15c3b90fc
[Core] Move save_tensorized_model logic to Worker ( #35825 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-03 15:31:59 -08:00
zhrrr
97286a20ed
[Model Runner V2] support dp & ep for spec decoding ( #35294 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Co-authored-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-03 15:19:45 -08:00
Amr Mahdi
12b38c0f45
[CI/Build] Allow mounting AWS credentials for sccache S3 auth ( #35912 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-03-03 14:30:47 -08:00
Woosuk Kwon
467886a0c4
[Model Runner V2] Fix inputs_embeds=None bug for MM models ( #35917 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-03 13:47:45 -08:00
bnellnm
a9b8b13e5c
[Bugfix] Fix misnamed parameter in compressed_tensors_moe.py ( #35813 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-03-03 16:29:57 -05:00
Micah Williamson
e7213003cb
[ROCm][CI] Fix TP size issue for test_gpt_oss ( #35887 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-03-03 20:57:34 +00:00
Rohan Potdar
3a8eef5869
[ROCm][Bugfix]: Disable AITER Triton ROPE by default ( #35601 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-03-03 13:43:56 -06:00
Robert Shaw
97995f6376
[MoE Refactor] Create MK for TRTLLM Kernels ( #32564 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2026-03-03 10:39:50 -08:00
Robert Shaw
881a6b011b
[CI] Temporarily Disable Llama4 MoE Refactor Test ( #35870 )
...
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-03-03 10:36:15 -08:00
Matthew Bonanni
8e1fd5baf0
[CI] Bump num_speculative_tokens to 3 in nightly DeepSeek tests ( #35882 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-03 09:26:44 -08:00
JasonCohere
ae88468bcc
fix: Ensure invalid audio files return 400 error ( #34715 )
...
Signed-off-by: Jason Ozuzu <jasonozuzu@cohere.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-03 08:47:39 -08:00
ojhaanshika
e05cb3b93e
TRTLLM gen-full attn Test Coverage ( #34986 )
...
Signed-off-by: Anshika Ojha <anshikao@nvidia.com >
Co-authored-by: Anshika Ojha <anshikao@gb-nvl-059-compute09.nvidia.com >
2026-03-03 11:35:34 -05:00
Lucas Wilkinson
28ef9ba399
[BugFix] Add support for MTP num_speculative_tokens > 1 with sparse MLA ( #34552 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-03 07:21:57 -08:00
TJian
fb7fdc49c4
[ROCm] [CI] Add new fusion test cases that are relevant to vLLM IR Ops ( #34307 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-03-03 06:24:21 -08:00
wang.yuqi
ea463978bb
[Frontend][1/n] Improve pooling entrypoints | classify. ( #35604 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-03 06:05:36 -08:00
Li, Jiang
440f0e7dc6
[Bugfix] Avoid src/dst as None in irecv/isend_tensor_dict ( #35754 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-03 05:56:08 -08:00
wang.yuqi
fd4a90f337
[CI] And PPL test for Qwen3.5. ( #35853 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-03 13:15:51 +00:00
Thomas Parnell
ad9d09e2b8
[Perf] [Hybrid] Copy num_accepted_tokens in non-blocking way when not using prefix caching ( #35442 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2026-03-03 04:15:43 -08:00
Szymon Reginis
4beebfd146
[CI/Build][Intel] Add new performance benchmarks for Intel Gaudi 3 ( #31025 )
...
Signed-off-by: Szymon Reginis <sreginis@habana.ai >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-03 19:48:24 +08:00
hallerite
b8401cde0e
add regression test ( #35834 )
...
Signed-off-by: hallerite <git@hallerite.com >
2026-03-03 07:32:15 +00:00
TJian
5dfc5abe94
[ROCm] [Release] Change the package from aiter to amd-aiter ( #35198 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-03-02 23:13:39 -08:00
lin-shh
8fa68a8ce4
Fix TYPE_CHECKING stub defaults in envs.py to match actual runtime defaults ( #35645 )
2026-03-02 21:59:43 -08:00
lin-shh
35a6f0bfe2
[Misc] Fix typos in comments: explict→explicit, paramaters→parameters ( #35648 )
2026-03-02 21:59:14 -08:00
Taneem Ibrahim
3a6cbf16e2
[MISC] Removed unused function find_all_indices() from tool_parsers/utils.py ( #35683 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-03-03 13:58:42 +08:00
Lucas Wilkinson
f44d1ddc8c
[BugFix] Fix cmake based incremental install (wrong vllm install dir) ( #35773 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-03-02 21:58:16 -08:00
Cyrus Leung
48a54c1e0d
[CI/Build] Trigger processor tests on registry update ( #35824 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-03 13:55:57 +08:00
Micah Williamson
8b9e8b7454
[ROCm][CI] Fix Assertion Logic For test_gpt_oss ( #35806 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-03-03 05:08:04 +00:00
Wentao Ye
c21d0039ec
[Refactor] Fix maxsim cuda platform and add cli to control it ( #35427 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-03 12:48:31 +08:00
Isotr0py
7d8bbe6f42
[CI/Build] Automatically patch video metadata for multimodal processor test ( #35822 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-03 04:27:45 +00:00
aykoppol
25e02647c2
[Core] Add optional flags to check for repetitive token patterns in engine output ( #35451 )
...
Signed-off-by: aykoppol <aykoppol+git@gmail.com >
2026-03-03 12:23:25 +08:00
Woosuk Kwon
a0a5178ab4
[Model Runner V2] Use ModelState.prepare_attn() for cuda graph capture [5/N] ( #35774 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-02 20:06:27 -08:00
Isotr0py
8ea8ba275e
[V0 deprecation] Remove Swin model ( #35821 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-02 20:03:41 -08:00
Woosuk Kwon
4f85bae9d6
[Docs][Model Runner V2] Add Design Docs ( #35819 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-02 19:58:14 -08:00
Andy Lo
0a7165fd71
[ModelRunnerV2] Rename sampler functions and variables for clarity ( #35459 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-03-02 19:48:56 -08:00
Robert Shaw
6521ccf286
[CI] Temporarily Disable Nightly Failures ( #35770 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-03 01:49:13 +00:00
Martin Vit
8ebd872f50
[Tool Parser] Fix Qwen3Coder streaming parameter loss with speculative decode ( #35615 )
...
Signed-off-by: Martin Vit <martin@voipmonitor.org >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-03 09:40:37 +08:00
zhrrr
168ee03e1c
[Model Runner V2][Perf] align dummy_run tokens to uniform decode for dp cudagraph ( #35376 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2026-03-02 17:10:47 -08:00
liuzhenwei
9dd656f0ea
[XPU][NIXL] Add GPUDirect RDMA support for XPU ( #35270 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-03 08:42:49 +08:00
Jakub Zakrzewski
c8b678e53e
[Model] Add support for nvidia/llama-nemotron-rerank-vl-1b-v2 ( #35735 )
...
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com >
2026-03-03 08:32:14 +08:00
Andreas Karatzas
18c29c746b
[ROCm][CI] Fix backslash-continuation in pytest marker re-quoting and treat exit code 5 as success ( #35798 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-02 16:07:51 -08:00
Hanjie Qiu
96fc09503a
[All Reduce] Change default backend of Flashinfer All Reduce to trtllm ( #35793 )
...
Signed-off-by: hjjq <hanjieq@nvidia.com >
2026-03-02 18:57:38 -05:00
Roger Wang
1b82b433fc
[Bugfix] Fix MM processor test for Qwen3.5 ( #35797 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-03-02 23:05:08 +00:00
Robert Shaw
9319044ee9
[MoE][Perf] Wrap DSV3 QKVAProj GEMM in custom op for torch.compile ( #35751 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-02 23:03:49 +00:00
Boyuan Feng
c42dc402c1
clean unused cudagraph_batch_sizes ( #35552 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2026-03-02 22:00:16 +00:00
Ye (Charlotte) Qi
fa6a6be519
[Bugfix] Fix missing sequence_lengths in qwen3_omni_moe_thinker ( #35741 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2026-03-02 21:11:56 +00:00
Aaron Hao
cad21918e3
[BUG] Fix rlhf_async example ( #35788 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-03-02 20:36:40 +00:00
Jeffrey Wang
53700bf49b
[ci] Add Ray compatibility check informational CI job ( #34672 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2026-03-02 12:06:16 -08:00
Yashwant Bezawada
a13d8c03c9
[KVConnector] Auto-downgrade to PIECEWISE cudagraph mode for layerwise async ops ( #31057 )
...
Signed-off-by: Yashwant Bezawada <yashwant_b@me.com >
2026-03-02 15:04:47 -05:00
Fynn Schmitt-Ulms
9433acb8df
[Spec Decode] Add hidden states extraction system ( #33736 )
...
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com >
2026-03-02 14:29:09 -05:00
Richard Zou
d1a6e96d9e
[torch.compile] Improve cold and warm start compile tests ( #35709 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-02 19:27:06 +00:00
CSWYF3634076
2a9e3347e9
[BugFix][Model]Fix the garbled code in Ernie4.5-VL caused by fast_moe_cold_start ( #35587 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
2026-03-02 18:56:33 +00:00
Isotr0py
cc0d565f40
[CI/Build] Enable Qwen3.5 tests on CI ( #35763 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-02 17:43:53 +00:00
Patryk Wolsza
358e4d5ba7
[CI][HPU] Pin vllm commit compatible with vllm-gaudi - HPU tests ( #35307 )
...
Signed-off-by: PatrykWo <patryk.wolsza@intel.com >
2026-03-02 17:02:26 +00:00
Cyrus Leung
792a74b973
[Doc] Improve UX of --enable-log-requests ( #35723 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-02 08:24:09 -08:00
Turner Jabbour
4034c3d32e
[Core] Move test utility to test file ( #35672 )
...
Signed-off-by: Turner Jabbour <doubleujabbour@gmail.com >
2026-03-02 10:56:03 -05:00
Martin Hickey
7560d674c9
[CI] Fix mypy for vllm/device allocator ( #35518 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-02 15:53:18 +00:00
ElizaWszola
d9c7730877
[Performance] Extract kv update ops from MLA attention backends ( #34627 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Di Wu <dw2761@nyu.edu >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-03-02 10:43:19 -05:00
Runkai Tao
ada4f4fadd
[Fix Bug]num_active_loras always equals to zero ( #34119 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-02 23:17:46 +08:00
Harry Mellor
7e9149d9a9
[Docs] Add breadcrumbs for better UX ( #35749 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-02 14:31:54 +00:00
Martin Hickey
87c98b0236
[MyPy][BugFix] Check profiler is assigned before calling start() on it ( #35505 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-02 13:23:42 +00:00
Tyler Michael Smith
de7dd634b9
Fix unresolved-import errors when using Astral's ty by removing src.root ( #35681 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-03-02 10:26:47 +00:00
Chauncey
9a87b0578f
[Feat] Supports Anthropic Messages count_tokens API ( #35588 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-02 09:48:54 +00:00
wangxiyuan
510bc9e1df
[Misc] Cleanup useless current_platform import ( #35715 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2026-03-02 09:36:54 +00:00
Charles Ashby
cbd361fd46
[CPU][Distributed] Fix Enable _CPUSHMDistributed only when TP/PP ranks share the same SHM group name ( #34169 )
...
Signed-off-by: Charles Ashby <charlesa.l@hotmail.com >
2026-03-02 09:34:35 +00:00
Nicolò Lucchesi
c212202d93
[Misc] Bound NIXL upper bound version ( #35495 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-02 16:57:07 +08:00
Andreas Karatzas
ec27b36b4b
[CI] Defining extended V1 e2e + engine tests ( #35580 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-02 08:10:54 +00:00
Charlie Fu
3fd1d4ec2c
[Rocm][CI] Fix LM Eval Large Models (H100) test group ( #34750 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2026-03-02 07:43:38 +00:00
EdalatiAli
cb21972a97
[Kernel] Integrate SM100 MXFP8 blockscaled grouped MM and quant kernels ( #34448 )
...
Signed-off-by: EdalatiAli <aliedalati@cohere.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-01 23:31:19 -08:00
Andreas Karatzas
c34963f138
[ROCm][CI] Disable skinny GEMMs in language model standard tests to fix non-determinism ( #35152 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-02 15:04:18 +08:00
Hongxia Yang
f26650d649
[ROCm] add amd-quark package in requirements for rocm to use quantized models ( #35658 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-03-02 06:02:43 +00:00
Kunshang Ji
92f5d0f070
[XPU] fix mxfp4 activation type ( #35691 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-02 11:48:39 +08:00
Jesse Cai
a60985b07e
Fix deprecated v1 config tests ( #35327 )
...
Signed-off-by: Jesse Cai <jessecai@fb.com >
2026-03-01 20:32:03 -05:00
Lucas Wilkinson
8b5014d3dd
[Attention] FA4 integration ( #32974 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-03-01 23:44:57 +00:00
zhanqiuhu
57a96e26c9
Revert "[Bugfix] Disable TRTLLM attention with KV transfer enabled ( #33192 )" ( #34832 )
...
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-03-01 22:32:37 +00:00
Richard Zou
e82fbeec7b
[torch.compile] Undo the fast_moe_cold_start hack in torch>=2.11 ( #35475 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-01 21:44:22 +00:00
haosdent
6290470843
[Bugfix] Fix dtype mismatch in RMSNormGated.forward_native() during torch.compile ( #35256 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-01 15:14:46 -05:00
Woosuk Kwon
72f4d16262
[Model Runner V2] Use block table apis for capture inputs ( #35671 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-01 10:31:13 -08:00
Seungho Yoon
5a435507d8
fix(mxfp4): return is_monolithic=False when LoRA is enabled for Triton backend ( #35382 )
...
Signed-off-by: Seungho Yoon <yoonsnowdev@gmail.com >
2026-03-01 09:59:30 -05:00
Taneem Ibrahim
59d7af9c6c
[MISC] Fixing a null reference by removing parallel_utils from mypy EXCLUDE ( #35630 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-03-01 09:26:44 -05:00
Asaf Gardin
bbf81f9a92
[Mamba1] - Kernel Level Chunk Alignment for Prefix Caching ( #34798 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-03-01 20:40:23 +08:00
Woosuk Kwon
da543d1abe
[Model Runner V2] Minor refactoring for EncoderRunner ( #35628 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-01 00:15:39 -08:00
Ryan Rock
87d319c52f
[AMD][CI] Support Triton attention with ExampleConnector ( #34931 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-03-01 09:58:07 +02:00
lin-shh
a9ec392c86
Fix typo: implictly -> implicitly in isaac.py docstring ( #35646 )
2026-02-28 23:34:37 -08:00
lailoo
afd089f231
[Bugfix][Model] Fix Qwen3.5/Qwen3Next ignoring --dtype flag on older GPUs ( #35617 )
2026-03-01 03:27:37 +00:00
gnovack
3ecd0bf9fc
Add TMA support to fused_moe_lora kernel ( #32195 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-01 10:55:25 +08:00
Woosuk Kwon
e3eb146f7a
[Model Runner V2] Add ModelStateInterface [4/N] ( #35621 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-28 13:19:45 -08:00
Martin Vit
95a395dbec
[Bugfix] Fix Anthropic API base64 image handling in Messages endpoint ( #35557 )
...
Signed-off-by: Martin Vit <martin@voipmonitor.org >
2026-02-28 20:57:08 +00:00
Isotr0py
e94b263bd6
[Chore] Cleanup BNB utilization dead code ( #35620 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-28 19:22:41 +00:00
Wentao Ye
e113a30113
[Deprecation] Deprecate code in 0.17 as scheduled ( #35441 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-28 17:32:37 +00:00
Cyrus Leung
1dafb29f91
[Benchmark] Avoid unnecessary video download in MMVU ( #35618 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-28 09:07:02 -08:00
emricksini-h
49b9ae32e9
[Fix] Avoid sending image input to other PP ranks ( #35405 )
...
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-01 00:14:29 +08:00
cwazai
63d7972f13
Fix Qwen3_5MTP packed_modules_mapping for gate_up_proj ( #35581 )
2026-02-28 14:50:55 +00:00
flutist
c68e69f144
custom dataset img support base64 ( #35280 )
...
Signed-off-by: xjx <493337577@qq.com >
2026-02-28 11:49:52 +00:00
Chauncey
7e08c22b8c
[Feat] Add CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function ( #35271 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-28 10:12:00 +00:00
Augusto Yao
8e75d88554
add io_process_plugin for sparse embedding ( #34214 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
Signed-off-by: Augusto Yao <augusto.yjh@antgroup.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-28 09:16:37 +00:00
Mario Hong
0892d1ab1f
[Feature]Supports Anthropic Thinking Block ( #33671 )
...
Signed-off-by: mariohong <mariohong128@gmail.com >
Co-authored-by: zetaohong <i-hongzetao@stepfun.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-02-28 09:02:33 +00:00
Hashem Hashemi
7600642eae
Add padding support to wvSplitK solution for skinny GEMMs ( #33762 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-28 09:02:05 +00:00
Andreas Karatzas
1e69c04887
[ROCm][CI] Parametrize vision score tests across attention backends with per-backend tolerances ( #35571 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-28 08:59:26 +00:00
Cyrus Leung
4292e3b807
[Benchmark] Improve UX of sweep scripts ( #35600 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-28 00:36:02 -08:00
Cyrus Leung
24d6ea8afd
[Benchmark] Rename SLA Finder to Workload Explorer ( #35586 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-27 23:31:55 -08:00
Chauncey
57c86c0741
[Misc] Change logging level from info to debug for tool parser import ( #35575 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-28 14:51:35 +08:00
Chauncey
06254d4cbb
[CI] add trainer_send_weights for MockWeightTransferEngine ( #35589 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-28 06:47:43 +00:00
Andreas Karatzas
f5d1281c9d
[ROCm][CI] Expose tests to AMD production CI and fix amdsmi heap corruption ( #35071 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-28 13:57:31 +08:00
Andreas Karatzas
94029ffaf0
[ROCm] Derive device capability from GCN arch string without CUDA init ( #35069 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-28 13:55:28 +08:00
Andreas Karatzas
88e8525f2e
[ROCm][CI] Adding infiniband mappings for moriio tests ( #35170 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-28 13:53:28 +08:00
Ilya Markov
b2d8b422b2
[EPLB] Enforce sync eplb for NCCL-based all2all backend ( #35212 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-02-28 05:47:12 +00:00
Umut Polat
1d5ab5d603
[Bugfix] Move chat completion response_format validation to Pydantic model_validator ( #35510 )
...
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com >
2026-02-27 21:26:19 -08:00
Huy Do
7b346ba8ed
[Bugfix] Propagate compilation_time from workers to main process for TP>1 ( #35503 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2026-02-28 05:03:22 +00:00
Itay Alroy
dea268336f
[1/N] Elastic EP Milestone 2 ( #34861 )
...
Signed-off-by: Yongji Wu <wuyongji317@gmail.com >
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: Ron Tourgeman <rtourgeman@nvidia.com >
Co-authored-by: Yongji Wu <wuyongji317@gmail.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com >
2026-02-28 04:46:42 +00:00
Ma Jian
90805ff464
[CI/Build] CPU release supports both of AVX2 and AVX512 ( #35466 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Co-authored-by: jiang1.li <jiang1.li@intel.com >
2026-02-28 04:35:21 +00:00
Matthew Bonanni
2562e0271e
[MTP] Validate that MTP weights are actually loaded ( #35548 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-28 12:27:40 +08:00
Cyrus Leung
fd68cd132b
[Bugfix] Fixes for SLA finder ( #35537 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-27 20:20:55 -08:00
Micah Williamson
0edf101d2b
[ROCm] Add stablelm Head Size 80 To Supported Head Sizes For ROCM_ATTN ( #35527 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-28 12:16:34 +08:00
Douglas Lehr
d5b6f3ba36
[ROCm][Quantization] Add Composable Kernel (CK) backend support for M… ( #34301 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Signed-off-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com >
Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
Co-authored-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
2026-02-28 03:37:01 +00:00
Woosuk Kwon
1a014a0a93
[Model Runner V2] Move MM encoder to Model States [3/N] ( #35564 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-27 18:32:38 -08:00
Woosuk Kwon
86ac7bcf84
[Model Runner V2] Support pooling models ( #35120 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-27 18:03:01 -08:00
Umut Polat
405f28d38d
[Misc] Clean up ResponsesRequest model validators ( #35531 )
...
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com >
2026-02-28 01:19:21 +00:00
youkaichao
5323672bc2
[misc] cleanup one level of error stack when nixl fails to initialize ( #35517 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2026-02-28 08:42:37 +08:00
Roberto L. Castro
a201ad72d8
[Refactor][Kernel] Add global helper to deduplicate vectorized memory ops ( #35105 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
2026-02-27 16:28:17 -08:00
Rohan Potdar
e3691988d0
[ROCm]: fix aiter rope functionalization ( #35533 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-27 22:42:30 +00:00
Gregory Shtrasberg
9fa6c68fa6
[ROCm] Enabling encoder and encoder-decoder on ROCm and AITER unified backends ( #35334 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-27 21:32:55 +00:00
Aaron Hao
2ce6f3cf67
[Feat][RL][2/2] Native Weight Syncing API: IPC ( #34171 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-02-27 13:45:21 -07:00
Jakub Zakrzewski
1f3dbd95fd
[Bugfix][Model] Fix gpt-oss batch invariance ( #35404 )
...
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com >
2026-02-27 20:41:24 +00:00
Lucas Wilkinson
1d532f9d8f
[DP] Only use DP padding when cudagraphs are actually used ( #34102 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-27 15:14:31 -05:00
Lucas Kabela
234a65b781
[Bugfix] Add monkeypatch to prevent race condition from writing ( #35420 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-02-27 14:51:36 -05:00
SteadfastAsArt
2decec9856
[Transformers backend] Ignore MTP weights when num_nextn_predict_layers=0 ( #34888 )
...
Signed-off-by: SteadfastAsArt <695488173@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-27 19:39:23 +00:00
Zhengxu Chen
29b35477b0
[compile] Fix caching error over pytree slice node. ( #35308 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-27 19:34:16 +00:00
Nick Hill
b1d9f5372d
[Model Runner V2] Warmup kernels ( #35172 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-27 10:43:30 -08:00
Raushan Turganbay
fd6de37fca
[BugFix] Fix 3D rope in transformers backend ( #35097 )
...
Signed-off-by: raushan <raushan@huggingface.co >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-27 18:34:49 +00:00
Netanel Haber
c8aca0c9e1
Support parakeet as audio encoder for nemotron-nano-vl ( #35100 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-27 11:07:38 -07:00
Martin Hickey
b602e4f299
[Doc] Fix link to Llama chat template for usability ( #35525 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-27 17:51:09 +00:00
Huamin Li
157722da75
[perf] Use pinned memory for async H2D transfer in do_mamba_copy_block ( #35480 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
2026-02-28 01:50:37 +08:00
Nick Hill
1d897ff04f
[Misc] Fill in some v1 CODEOWNERS gaps ( #35524 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-27 09:34:37 -08:00
fort726
905d76b51d
[Model] Add huggingface skt/A.X-K1 model ( #32407 )
...
Signed-off-by: Sungwan(Alex) Kim <sw0726.kim@sktelecom.com >
Signed-off-by: fort726 <38447663+fort726@users.noreply.github.com >
Co-authored-by: Sungwan(Alex) Kim <sw0726.kim@sktelecom.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-27 09:26:02 -08:00
Yanan Cao
9098ce690c
[Kernel] [Helion] [7/N] Use HOP to represent Helion Kernel call to enable fx tracing and pattern matching ( #34390 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-27 09:21:35 -08:00
Nick Hill
876312f0b5
[Core] Fix gpu_worker.py pre-commit errors ( #35312 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-27 07:54:24 -08:00
Boyuan Feng
5de98abc12
Add @BoyuanFeng to CODEOWNERS ( #35317 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2026-02-27 15:53:47 +00:00
Koushik Dutta
9251ed5c4f
[Bugfix] Handle case when kimi ends reasoning with a tool call ( #33646 )
...
Signed-off-by: Koushik Dutta <koushd@gmail.com >
Co-authored-by: mondaylord <20212010046@fudan.edu.cn >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-27 14:58:28 +00:00
Yueqian Lin
e8249378e4
[Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests ( #35487 )
...
Signed-off-by: linyueqian <linyueqian@outlook.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-27 06:48:25 -08:00
haosdent
6d4f9d3ad5
[Bugfix] Fix DCP + FA3 crash due to missing num_splits in _forward_with_dcp ( #35082 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-27 22:27:06 +08:00
Harry Mellor
fbe3f0120a
Revert "Add GlmOcrConfig for GLM-OCR model type recognition" ( #35512 )
2026-02-27 06:13:27 -08:00
Jason Li
66c1751d13
[compile] Cleanup: Remove unnecessary +rms_norm forcing for sequence parallelism ( #35410 )
...
Signed-off-by: jasonlizhengjian <jasonlizhengjian@gmail.com >
2026-02-27 08:36:37 -05:00
Tib
6467b635b6
[Bugfix] Add missing activation attr to RMSNormGated ( #35423 )
...
Signed-off-by: tibG <naps@qubes.milou >
Co-authored-by: tibG <naps@qubes.milou >
2026-02-27 12:53:35 +00:00
Max Hu
9c3fe9936b
Flashinfer cuDNN backend for Qwen3 VL ViT attention ( #34580 )
...
Signed-off-by: Max Hu <maxhu@nvidia.com >
Signed-off-by: Max Hu <hyoung2991@gmail.com >
Co-authored-by: Max Hu <maxhu@nvidia.com >
Co-authored-by: Shang Wang <shangw@nvidia.com >
2026-02-27 20:20:23 +08:00
Umut Polat
b66a74649e
[Bugfix] Replace assert with ValueError for response_format validation in completions endpoint ( #35456 )
...
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com >
2026-02-27 08:01:06 +00:00
Wang Xingran
07bdabef03
[Bugfix] Use 'sum' reduction instead of 'avg' in Async TP reduce-scatter ( #33088 )
...
Signed-off-by: Xingran Wang <wangxingran123456@outlook.com >
Signed-off-by: Hongjian Zhang <hirokenovo@gmail.com >
Co-authored-by: Hongjian Zhang <hirokenovo@gmail.com >
2026-02-27 07:06:08 +00:00
Chengyi Nie
a572baff5e
[Model Performance] Add Qwen3MoE tuned MoE configs for H200 ( #35457 )
...
Signed-off-by: Chengyi Nie <cnie@roblox.com >
Co-authored-by: Chengyi Nie <cnie@roblox.com >
2026-02-27 13:51:14 +08:00
zofia
516cf26698
[Bug] correct out dtype of rms_norm_gated native path ( #35369 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-27 05:19:51 +00:00
Jiangyun Zhu
487e5c51f7
[Bugfix] disable allreduce_rms_fusion by default when pp size > 1 ( #35424 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-02-27 04:18:52 +00:00
Daniel Huang
1a8c71674e
[BugFix] Repo utils debug print patch ( #35434 )
...
Signed-off-by: Daniel Huang <daniel1.huang@intel.com >
2026-02-27 03:50:56 +00:00
Wentao Ye
062b789632
[Bug] Fix outdated links in source code ( #35314 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-27 03:50:46 +00:00
gnovack
a532c83849
use 'max_active_experts' for moe lora input size ( #33197 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
2026-02-27 03:50:43 +00:00
Jee Jee Li
1e5ad9b74f
[Bugfix] Fix Qwen3NextForCausalLM packed_modules_mapping ( #35413 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-26 19:46:30 -08:00
Nicolò Lucchesi
cabdaa7619
[Misc] Move GPUModelRunner.prepare_kernel_block_sizes to utils ( #35400 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-27 11:42:51 +08:00
Chenyaaang
06be53563b
[Core]Extract is_last_rank in Ray for tpu to override ( #33012 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2026-02-27 03:18:52 +00:00
Angela Yi
c29ee9c326
[compile] Invalidate cache for cpu flags ( #35119 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-02-27 02:54:11 +00:00
daniel-salib
d43048ce05
[Bugfix] Emit reasoning_part events in simple streaming path for Resp… ( #35184 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2026-02-27 09:49:06 +08:00
Michael Goin
4fec53cfcb
[CI] Actually run tests/kernels/quantization/test_block_fp8.py in CI ( #34274 )
2026-02-26 17:58:03 -07:00
roikoren755
38c498b8e3
[Performance] Cublas Bf16 Gate with Fp32 Output ( #35121 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-02-26 16:51:28 -08:00
Andrii Skliar
56a6371706
[Update] Use FlashInfer fast_decode_plan directly instead of replication ( #34687 )
...
Signed-off-by: Andrii <askliar@nvidia.com >
Co-authored-by: Andrii <askliar@nvidia.com >
2026-02-26 16:31:43 -08:00
Pavani Majety
6283021142
[Bugfix] Fix KV Scale loading for MLA Models ( #35430 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2026-02-26 23:38:19 +00:00
Aleksandr Malyshev
01923eec70
[ROCm][Quantization] GPT OSS Upstream MoE wmxfp4_afp8 with static scales ( #30357 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2026-02-26 16:50:16 -06:00
pkousha
31fb6f43da
[Kernel][perf] optimize NCCL symm_mem vs custom_AR selection thresholds ( #33839 )
...
Signed-off-by: <>
Signed-off-by: pkousha <43781676+pkousha@users.noreply.github.com >
Co-authored-by: Pouya Kousha <pkousha@login-eos01.eos.clusters.nvidia.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-26 14:35:58 -08:00
Tyler Michael Smith
eb19955c37
[WideEP] Remove pplx all2all backend ( #33724 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-26 14:30:10 -08:00
Lucia Fang
0f2f24c8b2
[Bugfix] Fix MessageQueue connect_ip for cross-node data parallelism ( #35429 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-26 22:08:16 +00:00
sychen52
d0105b84f0
add mixed precision support for modelopt ( #35047 )
...
Signed-off-by: Shiyang Chen <shiychen@nvidia.com >
2026-02-26 21:56:24 +00:00
danielafrimi
832a780f3a
Nemotron: use per-layer config in NemotronHMLPDecoderLayer for heterogeneous models ( #35396 )
...
Signed-off-by: dafrimi <dafrimi@nvidia.com >
2026-02-26 16:55:19 -05:00
ElizaWszola
98217b09f9
[Performance] Extract KV cache update op from flashinfer forward ( #35422 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2026-02-26 21:29:01 +00:00
不做了睡大觉
967572dd5f
fix(reasoning): Qwen3ReasoningParser returns truncated output as reasoning ( #35230 )
...
Signed-off-by: stakeswky <stakeswky@users.noreply.github.com >
Co-authored-by: stakeswky <stakeswky@users.noreply.github.com >
2026-02-26 20:30:45 +00:00
Woosuk Kwon
3d66502e1b
[Model Runner V2] Prepare attn metadata in ModelState [2/N] ( #35383 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-26 11:47:02 -08:00
Woosuk Kwon
c66aa48e99
[Model Runner V2] Add model states [1/N] ( #35350 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-26 11:20:35 -08:00
Nick Hill
b6d5a17298
[Model Runner V2] Fix error-handling ( #35063 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-26 11:00:19 -08:00
Lucas Wilkinson
5e58bdc711
[Bugfix] Remove erroneous lower bound on LoRA vocab size constraint ( #35354 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-26 18:44:50 +00:00
Runkai Tao
a1f53addb1
[BugFix] Align fused MoE-LoRA kernel config with actual weight shapes ( #34396 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
2026-02-26 18:03:10 +00:00
Wentao Ye
05970c772c
[Refactor] Remove dead code for attention benchmark script ( #35418 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-26 09:53:46 -08:00
Yiliu Dong
d940607629
[Core] Support min_tokens with speculative decoding ( #32642 )
...
Signed-off-by: qianlihuang <yiliu.dong@qq.com >
Co-authored-by: qianlihuang <yiliu.dong@qq.com >
2026-02-26 12:31:28 -05:00
Wentao Ye
99c7892c5b
[Perf] Optimize maxsim scores computation for pooling models, 13.9% E2E throughput improvement ( #35330 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-26 17:14:54 +00:00
hujia177
ec8f943db1
Add GlmOcrConfig for GLM-OCR model type recognition ( #34982 )
2026-02-26 17:04:42 +00:00
Or Ozeri
f2ad952f40
[BugFix][kv_offload]: Fix kernel block size detection ( #35125 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-02-26 16:29:34 +00:00
Sage Moore
9e2cabdf9c
[ROCm] Update the torch version in rocm_build.txt to use the official 2.10 release ( #34387 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-02-26 16:28:45 +00:00
Douglas Lehr
ec8ab9d254
[ROCm] Add dynamic mxfp4 quantization for DeepSeek V2 projection layers ( #34157 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Signed-off-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
2026-02-26 10:00:49 -06:00
Wentao Ye
05972ea7e5
[Refactor] Remove dead or duplicate func utils or variables ( #35318 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-26 10:57:56 -05:00
Jakub Zakrzewski
111d869069
[Model] Add nvidia/llama-nemotron-embed-vl-1b-v2 multimodal embedding model ( #35297 )
...
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com >
2026-02-26 14:17:17 +00:00
stingoChen
7fea7250a4
[Bug] Fix missing <think> tag after tool call in MiniMax 2.1 ( #35352 )
...
Signed-off-by: 冬马 <chenxinke@cai-inc.com >
Co-authored-by: 冬马 <chenxinke@cai-inc.com >
2026-02-26 22:11:07 +08:00
Cyrus Leung
845ee348ef
[Misc] Standardize handling of mm_processor_kwargs.size ( #35284 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-26 13:05:46 +00:00
Asaf Gardin
ec13e549d3
[Bugfix] Fix uint32 overflow in Mamba selective scan state pointer arithmetic ( #35275 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-02-26 12:22:06 +00:00
Li-Yongwen
c6ca51598a
[Bugfix] fix device_name for routing replay ( #34336 )
...
Signed-off-by: liyongwen <1310439159@qq.com >
2026-02-26 12:18:38 +00:00
Yueqian Lin
c0615a296d
[Bugfix] Fix Qwen2.5-Omni and Qwen3-Omni mixed-modality embed regression ( #35368 )
...
Signed-off-by: linyueqian <linyueqian@outlook.com >
2026-02-26 11:58:23 +00:00
Harry Mellor
01914445b0
Remove bc-lint ( #35274 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-26 03:01:01 -08:00
Kunshang Ji
5281713e11
[XPU] use fixed UMD version in dockerfile.xpu ( #35392 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-26 18:54:55 +08:00
HZY
32693db8ce
[Bugfix] [Qwen3.5]Fix Qwen3.5 FP8 quantization: tuple shard_id weight loading ( #35289 )
...
Signed-off-by: daowu.hzy <daowu.hzy@alibaba-inc.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-26 18:26:15 +08:00
Akash kaothalkar
e03ddcfbd4
[Hardware][Powerpc]Enable prefix caching and chunked prefill for ppc64le ( #35081 )
...
Signed-off-by: Akash kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: Akash kaothalkar <akash.kaothalkar@ibm.com >
2026-02-26 10:21:24 +00:00
Sophie du Couédic
02acd16861
[Benchmarks] Plot benchmark timeline and requests statistics ( #35220 )
...
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-26 02:17:43 -08:00
Jiangyun Zhu
ab87f85231
[Model] Ring 2.5 ( #35102 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-02-26 02:17:11 -08:00
Krish Gupta
3827c8c55a
[Test] Add tests for n parameter in chat completions API ( #35283 )
...
Signed-off-by: KrxGu <krishom70@gmail.com >
2026-02-26 09:14:07 +00:00
Kevin McKay
ade81f17fe
[Bugfix][Hardware][AMD] Gate FP4 ops on gfx950 to prevent MI300X crash ( #35250 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2026-02-26 16:11:07 +08:00
Gregory Shtrasberg
6042e66cd5
[ROCm] Add extra step in config initialization to populate custom ops before compilation config init ( #34848 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-26 16:05:40 +08:00
Chaojun Zhang
9f9a675b23
[XPU][8/N] Fix kernel bugs in XPU LoRA and MOE LORA ( #34115 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-26 15:46:44 +08:00
Ofir Zafrir
a07c4c5939
[BugFix][XPU] Fix speculative decoding on Intel XPU due to bug with IGC_ForceOCLSIMDWidth=16 ( #35298 )
...
Signed-off-by: Ofir Zafrir <ofir.zafrir@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-26 07:15:16 +00:00
Cyrus Leung
d3a51da92a
[Benchmark] Simplify SLA scan ( #35306 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-25 22:35:41 -08:00
Flora Feng
186ea22efe
[Misc][Harmony] Move Responses API only harmony utils to responses/harmony.py ( #35339 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-02-26 14:35:16 +08:00
Daniele
4a9c07a0a2
[BugFix] anthropic/serving_messages: fix tool call arguments streaming ( #34887 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-26 05:39:48 +00:00
Jason Li
9d37941017
[torch.compile] Sequence Parallelism threshold compile ranges ( #28672 )
...
Signed-off-by: jasonlizhengjian <jasonlizhengjian@gmail.com >
Signed-off-by: Jason Li <jasonlizhengjian@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-26 05:00:12 +00:00
Fadi Arafeh
4171ff6dd9
[CPU][Feat] Enable KleidiAI INT8_W4A8 for all input dtypes ( #34890 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-02-26 05:00:10 +00:00
Woosuk Kwon
13025e71e8
[Model Runner V2] Add coding style guide ( #35325 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-25 20:42:40 -08:00
Hanjie Qiu
71dfce6aa6
[Kernel] Refactor FlashInfer allreduce for mnnvl backend ( #34109 )
...
Signed-off-by: hjjq <50634613+hjjq@users.noreply.github.com >
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com >
2026-02-26 03:17:20 +00:00
hujiaxin0
2aa4140402
openpangu-vl support video input ( #34134 )
...
Signed-off-by: hujiaxin <524446785@qq.com >
Signed-off-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com >
Co-authored-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-26 03:08:09 +00:00
Roberto L. Castro
86c3b5a808
[BugFix] Fix fp4 quant kernel on CUDA 12.8 ( #35210 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
2026-02-25 18:32:50 -08:00
Seungmin Kim
160424a937
[Bugfix] Fix CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs ( #33992 )
...
Signed-off-by: Seungmin Kim <8457324+ehfd@users.noreply.github.com >
Signed-off-by: Andrew Mello <19512127+88plug@users.noreply.github.com >
Co-authored-by: 88plug <19512127+88plug@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 18:15:51 -08:00
Lucas Wilkinson
9511a3f8ee
[Bugfix] Fix AttributeError in SMControlContextManager ( #35338 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-25 18:01:10 -08:00
Michael Goin
de527e1cec
[UX] Add --moe-backend arg for explicit kernel selection ( #33807 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-25 17:44:44 -08:00
Yongye Zhu
1976356ee6
[MoE Refactor] MXFP4 Cutlass Experts to MK ( #34542 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
2026-02-25 17:32:39 -08:00
Michael Goin
cbf8f7028c
[UX] Add --performance-mode {balanced,interactivity,throughput} ( #34936 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-25 17:28:31 -08:00
Ming Yang
6831650c40
[offloader] v2: Hide weight onloading latency via prefetching ( #29941 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 17:20:59 -08:00
Andreas Karatzas
ed42507f6d
[ROCm][CI] Amending deletion of AMD mirror ( #35322 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 14:17:56 -08:00
Andreas Karatzas
9571e99945
[ROCm][CI] Extending attention backend coverage for Eagle spec decode tests ( #35265 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 14:16:18 -08:00
Elizabeth Thomas
c97234c08b
fix(mxfp4): Disable monolithic path for TRITON backend with EP ( #34270 )
...
Signed-off-by: Elizabeth Thomas <email2eliza@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 13:33:42 -08:00
rasmith
b188bab441
[CI][AMD][BugFix] Add torch.cuda.set_device to test_punica_ops so punica kernels execute on same device as tensor ( #34985 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-25 19:18:00 +00:00
Lucas Wilkinson
15d76f74e2
Revert "[Misc] Enable weights loading tracking for quantized models" ( #35309 )
2026-02-25 09:20:15 -08:00
Andreas Karatzas
8fd6975479
[ROCm][CI] Disable skinny GEMMs in multimodal tests to fix non-deterministic results ( #35049 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 16:48:37 +00:00
pushkar
5d18bf8b32
[Bugfix] Fix Harmony preamble visibility in Responses API ( #32114 )
...
Signed-off-by: Pushkar Patel <git@thepushkarp.com >
Signed-off-by: pupa <pupa@users.noreply.github.com >
2026-02-25 08:08:16 -08:00
haosdent
0788ff0a15
[Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multicast support ( #35085 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-25 07:31:45 -08:00
Chendi.Xue
d72b0be33c
[XPU]Fix for Qwen-OMNI crash ( #35249 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2026-02-25 07:31:07 -08:00
Bhoomit
42489e43c2
[Misc][LoRA] Increase max vocab size limit to 258048 in logits processor ( #34773 )
...
Signed-off-by: Bhoomit Vasani <vbhoomit@amazon.com >
2026-02-25 23:30:55 +08:00
Mario Hong
af5e6afa0a
[Bugfix] Fix step3p5 reasoning with interleaved thinking ( #34211 )
...
Signed-off-by: mariohong <mariohong128@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-02-25 15:13:01 +00:00
Benjamin Chislett
ee59a7c615
[Tests] Add GSM8k check to SpecDec E2E tests ( #34772 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-25 07:51:14 -05:00
Joao Gante
709eadbb0b
Doc link typo ( #35281 )
...
Signed-off-by: Joao Gante <joaofranciscocardosogante@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-25 03:00:31 -08:00
Harry Mellor
90fc7f9109
Fix custom processors that use deleted behaviour for Transformers v5 ( #35107 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-25 02:36:21 -08:00
Yanwen Lin
675ec59aa9
[Bugfix][CPU] Fix basic unit tests failing in CPU platforms ( #34677 )
...
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-25 08:36:15 +00:00
Yanwen Lin
80e60a6133
[Doc] Suggest "--managed-python" flag when installing python using uv ( #33069 )
...
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
2026-02-25 08:19:43 +00:00
jonoillar
26e722f906
[DOC][BugFix] Specfiy build dependency installation ( #34513 )
...
Signed-off-by: Jon OILLARBURU <jon.oillarburu@multiversecomputing.com >
Co-authored-by: Jon OILLARBURU <jon.oillarburu@multiversecomputing.com >
2026-02-25 08:04:06 +00:00
lichuang
2c619e5e3f
[Docs]Fix documentation formatting in architecture overview ( #34679 )
...
Signed-off-by: codedump <lichuang1982@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-25 08:00:15 +00:00
Simon Mo
8a685be8d9
docs: document committer proposal process in governance ( #35225 )
...
Signed-off-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-25 07:58:48 +00:00
Laura Wang
2465071510
[Perf] Add opt-in SM100 Oink RMSNorm custom-op path ( #31828 )
...
Signed-off-by: Laura Wang <3700467+Laurawly@users.noreply.github.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-24 23:01:53 -08:00
wenshuai
cd43673668
[Perf] Optimize FP8 gemm of sm120. ( #34424 )
...
Signed-off-by: wenshuai <wenshuai@xiaomi.com >
2026-02-24 22:25:24 -08:00
Xinyu Chen
35d44b4557
[XPU]Support CUDAGraph on XPU Platform ( #34482 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
Co-authored-by: chzhang <chaojun.zhang@intel.com >
Co-authored-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-24 22:22:52 -08:00
Kunshang Ji
8ad54a991b
[Platform] Add current_platform.num_compute_units interface ( #35042 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
2026-02-24 22:22:49 -08:00
Kunshang Ji
92510edc32
remove cuda check in top_k_top_p_triton kernel ( #35011 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-24 22:22:31 -08:00
Isotr0py
a6c137521c
[Misc] Add shard_id validation for MergedColumnLinear ( #35055 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-24 22:12:28 -08:00
Isotr0py
4572a06afe
[Misc] Enable weights loading tracking for quantized models ( #35074 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-24 22:11:03 -08:00
Zhengxu Chen
5cc29cfb8b
[compile] Improve error message during artifacts load failure. ( #35115 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-24 22:01:09 -08:00
Chen Zhang
8fae54faff
[Linear Attention] fix bug for linear attention + prefix caching + reset_prefix_cache ( #35157 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2026-02-24 22:00:19 -08:00
Harry Mellor
f7967577f5
Remove requirement to use --hf-overrides for DeepseekVLV2ForCausalLM ( #35203 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-24 22:00:06 -08:00
pks
af770b8e7b
[Bugfix] Fix AttributeError when passing StructuredOutputsParams to CompletionRequest ( #35237 )
...
Signed-off-by: Patrick Simianer <patrick@lilt.com >
2026-02-24 22:00:03 -08:00
Andreas Karatzas
2ff3e436ad
[Responses][CI] Filter negative token IDs in schema fuzz test to avoid 500 errors ( #35231 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 05:52:44 +00:00
Jhao-Ting Chen
c2c4c4611a
[FIX] fused moe with lora shared expert dual stream (1.07x otps) ( #34933 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 04:40:45 +00:00
Rohan Potdar
f38f8c9742
[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE ( #35180 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-25 04:36:40 +00:00
Flora Feng
ec1d30c0f6
[Responses] Decouple SSE event helpers from Harmony context ( #35148 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-02-24 20:05:25 -08:00
Pooya Davoodi
e3b2324ec4
[Frontend] Use init_app_state and FrontendArgs in run_batch ( #32967 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-24 19:40:39 -08:00
Nick Hill
dbf0da817a
[Core] Cleanup engine pause/sleep logic ( #34528 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-24 19:33:34 -08:00
Xin Yang
3bbb2046ff
[Bugfix] Fix expert_ids padding values in moe_align_block_size kernel ( #35161 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-24 17:14:24 -08:00
yugong333
576fe50333
Adding Nemotron fp8 Triton MoE Config ( #34674 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-24 15:56:38 -08:00
Hashem Hashemi
a0e50a4260
Convert wvSplitKQ to 16x16 MFMA in prep for mi4xx. ( #34100 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-24 23:35:21 +00:00
Benjamin Chislett
9fa5b25a23
[Bug][DSV3.2] Always prepare metadata for DeepGEMM Sparse Attention ( #35075 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-24 14:55:22 -08:00
Robert Shaw
ea97750414
[CI] Fix Distributed Tests ( #35236 )
...
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
2026-02-24 22:31:56 +00:00
Andreas Karatzas
067c5d9ad1
[ROCm][CI] Added MI325 mirrors ( #34923 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-24 13:37:15 -08:00
Benjamin Chislett
f5972a872f
[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support ( #33726 )
...
Signed-off-by: Shahar Mor <smor@nvidia.com >
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Shahar Mor <smor@nvidia.com >
Co-authored-by: Roi Koren <roik@nvidia.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-24 09:49:56 -08:00
Matthew Bonanni
a9e15e040d
Add @MatthewBonanni to CODEOWNERS ( #35207 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-24 10:45:10 -07:00
Lucas Wilkinson
542ca66357
Revert "[CI/Build] Remove redundant OpenTelemetry pip install from CI configs" ( #35211 )
2026-02-24 09:26:42 -08:00
Cyrus Leung
fc8456c336
[CI/Build] Fix kernels test location ( #35205 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-24 09:20:34 -08:00
Wentao Ye
9ce8fad2a9
[Perf] Optimize Python Slice for Structured Output using islice instead of [:] ( #33593 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-24 09:02:36 -08:00
Harry Mellor
c38b8d5a31
Remove padding_index from models that don't use it for better Transformers v5 compatibility ( #35189 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-24 08:04:46 -08:00
Robert Shaw
60da0e1544
[CI] Remove Duplicated Tests ( #35199 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-24 23:53:30 +08:00
danisereb
9609b1f18d
Integrate flashinfer mm_mxfp8 in ModelOpt MXFP8 ( #35053 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-24 08:45:13 -07:00
danisereb
a0c7081695
Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe ( #35088 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-24 07:25:44 -08:00
R3hankhan
34ce0ffd1f
[CPU][Perf] Accelerate Attention head for s390x using vector intrinsics ( #34434 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-02-24 07:25:39 -08:00
Robin Nabel
0de5333989
Fix GLM4 parser tests ( #34905 )
...
Signed-off-by: Robin Nabel <opensource@nabel.co >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-02-24 22:27:42 +08:00
Eldar Kurtić
a87cc50859
[Attn,KV-cache] Use per-head scales in the attention selector ( #34281 )
...
Signed-off-by: Your Name <you@example.com >
Signed-off-by: Eldar Kurtic <research@neuralmagic.com >
Co-authored-by: Eldar Kurtic <research@neuralmagic.com >
Co-authored-by: Your Name <you@example.com >
2026-02-24 09:02:43 -05:00
Cyrus Leung
761e63e541
[Frontend] Always pass supported_tasks to validation ( #35186 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-24 04:16:33 -08:00
Isotr0py
d12d201409
[Bugfix] Fix failing FunASR processor test ( #35111 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-24 04:13:45 -08:00
eustlb
b3ad37c5db
[glm-asr] change defaults dummy audio size ( #35108 )
...
Signed-off-by: Eustache Le Bihan <eulebihan@gmail.com >
2026-02-24 04:13:33 -08:00
Wentao Ye
14561fabfd
[Perf] Optimize pooling model redundant copy, 1.8% throughput improvement ( #35127 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-24 04:13:11 -08:00
Zhengxu Chen
c77f3e1207
[compile] Save aot compile artifacts atomically. ( #35117 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-24 04:11:01 -08:00
Dor Huri
012dee9233
[Feature] Add LoRA tower/connector support for Llama 4 Vision (mllama4) ( #35147 )
...
Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-24 04:10:32 -08:00
Tugsbayasgalan Manlaibaatar
f1c664545b
Make voxtral compile friendly ( #33959 )
...
Signed-off-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-24 09:33:35 +01:00
Xin Yang
c870eb9e0f
[LoRA] Update LoRA expand kernel block_n calculation ( #32621 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-23 23:17:53 -08:00
BadrBasowid
6af03f2394
[Refactor] [1/N] Reorganize kernel abstraction directory ( #34055 )
...
Signed-off-by: BadrBasowid <badr.basowid@gmail.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-24 06:47:22 +00:00
Vlad Tiberiu Mihailescu
1a6cf39dec
[CI/Build] Remove redundant OpenTelemetry pip install from CI configs ( #35032 )
...
Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com >
2026-02-23 22:24:11 -08:00
Nicolò Lucchesi
f91808ae0d
[MM] Allow audio chunking for offline LLM ( #34628 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-23 21:04:28 -08:00
Vadim Gimpelson
33a0d43c71
[BUGFIX][Qwen3.5] Hardcode mlp.gate as not quantizable ( #35156 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-23 19:42:24 -08:00
pschlan-amd
80d93fd6da
gpu_model_runner: Cache is_encoder_decoder from model config ( #35099 )
...
Signed-off-by: Patrick Schlangen <pschlan@amd.com >
2026-02-23 19:08:34 -08:00
Jia Guo
ec85340531
[Quantization] Support FP8 MoE bias for models like GPT-OSS ( #34906 )
...
Signed-off-by: jasperjiaguo <jasperg662@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-23 19:07:47 -08:00
Rohan Potdar
2ff4e51152
[ROCm] AITER fused RoPE+KVCache ( #33443 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
Signed-off-by: charlifu <charlifu@amd.com >
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
Co-authored-by: charlifu <charlifu@amd.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com >
2026-02-23 19:06:00 -08:00
Asaf Gardin
95642441d0
[Mamba1] - Change supports_update_block_table to True ( #35054 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-02-23 19:05:57 -08:00
Xin Yang
a7c9f7b7ec
[Bugfix] Fix lora_ids in FusedMoE LoRA test ( #35135 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-23 21:49:25 -05:00
Michael Goin
a4bd661fb3
[Perf] Enable FlashInfer DeepGEMM swapAB on SM90 by default ( #34924 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-23 17:34:41 -08:00
Michael Goin
3ef9fd0f98
[Bugfix] Fix DSV3 kernels breaking _C and _moe_C on unsupported arches ( #35123 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-23 17:11:27 -08:00
Michael Goin
22a97e6613
[Perf] Improve default triton fused moe configs ( #34846 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-23 16:01:28 -08:00
Aaron Hao
596ed1f02e
[RL] Validation for pause_mode='keep' ( #34992 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-02-23 16:30:56 -05:00
Nicolò Lucchesi
b8d8b7e934
[Misc] Monitor interface changes ( #35113 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-23 17:14:51 +00:00
Harry Mellor
28c5e69ba0
Enforce that model is the first positional arg when --served-model-name is used ( #34973 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 08:38:05 -08:00
Harry Mellor
864167d376
Fix custom processors that use deleted import for Transformers v5 ( #35101 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 08:38:00 -08:00
haosdent
a2ba6a5244
[Bugfix] Fix prefix caching for Mamba 'all' mode (Nemotron models) ( #34874 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-23 17:31:51 +01:00
Harry Mellor
c4f38696f7
Use Xet high performance mode for Transformers v5 ( #35098 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 08:19:30 -08:00
haosdent
a7f341c323
[Bugfix] Fix MRotaryEmbedding missing truncate attr with YaRN scaling ( #35080 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-23 16:05:52 +00:00
Robert Shaw
d13ece38d7
[CI] Skip Responses API ( #34990 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-23 07:46:45 -08:00
Mark McLoughlin
5cc7c4452e
[Metrics] Add Prometheus counters for Model FLOPs Utilization (MFU) ( #30950 )
...
Export the existing Model FLOPs Utilization (MFU) metrics via Prometheus.
`--enable-mfu-metrics` is required for these to be exposed.
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-02-23 15:01:07 +00:00
Eldar Kurtić
b95bb6927f
[kv-cache, ct] Use compressed-tensors as a source of ground-truth for quant strategies ( #34254 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
2026-02-23 07:37:55 -07:00
Cyrus Leung
392645454b
[Refactor] Decouple TimingContext from InputProcessingContext ( #35083 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-23 14:15:50 +00:00
Eldar Kurtić
1e8438a89a
[Llama4,CI] Bring back Llama-4 bug fixes, and also fix Maverick tests ( #35033 )
...
Signed-off-by: Eldar Kurtic <you@example.com >
Co-authored-by: Eldar Kurtic <you@example.com >
2026-02-23 09:04:34 -05:00
Robert Shaw
8435b2e049
[ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup) ( #34302 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-23 14:02:26 +00:00
Yan Ma
b1b5e045df
[XPU] allow TORCH_SDPA/TRITON_ATTN as XPU vit Backend ( #35010 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-02-23 05:06:44 -08:00
Andreas Karatzas
5f68464f92
[ROCm][CI] Fix spec decode profile assertion and logprob test determinism ( #35043 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-23 05:05:54 -08:00
Vincent Gimenes
aa08a30fc9
[CLEANING] Remove unused disable_by_batch_size from SpeculativeConfig ( #35060 )
...
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com >
2026-02-23 05:05:36 -08:00
Wentao Ye
7f40e9e516
[Refactor] Remove dead private func _fp8_perm and _extract_mask_for_item ( #35068 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-23 05:05:20 -08:00
Harry Mellor
103e614b14
Fix pipeline parallel with embed scaling in the Transformers modelling backend ( #35094 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 05:04:47 -08:00
Neil Schemenauer
54e2f83d0a
[Feature] Lazy import for the "mistral" tokenizer module. ( #34651 )
...
Signed-off-by: Neil Schemenauer <nas@arctrix.com >
2026-02-23 00:43:01 -08:00
Gabe Goodhart
e631f8e78e
fix: Apply embedding_multiplier to inputs_embeds ( #34813 )
...
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-23 00:42:46 -08:00
Martin Hickey
e97c46a92d
[BugFix]: Fix local mypy issues ( #34739 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 00:40:29 -08:00
Jee Jee Li
7291d1b288
[Bugfix] Fix kernel benchmark ( #33752 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-22 21:18:08 -08:00
Cyrus Leung
987506bca6
[Refactor] Simplify dummy data generation ( #35025 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-22 20:55:27 -08:00
Woosuk Kwon
c645e9a214
[Model Runner V2] Remove propose_draft method ( #35070 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-22 18:27:12 -08:00
Nick Hill
944ffb5968
[Model Runner V2][Minor] Remove redundant do_spec_decode field ( #35039 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-22 16:18:04 -08:00
qizixi
2bcf71b9c0
[Spec Decode] Reduce TP communication for speculative decoding draft token generation ( #34049 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-22 14:59:16 -08:00
tacos8me
b7892a3bef
[Model] Add NVFP4 quantization support for Step3.5-Flash ( #34478 )
...
Signed-off-by: tacos8me <ian@cloudhabit.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-22 12:30:46 -07:00
Benjamin Chislett
682566b18e
[Bug] Refactor max_num_batched_tokens to account for drafting ( #34898 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-22 11:18:46 -05:00
qizixi
b9c2a565cc
[Spec Decode] Defer clearing KV connector metadata for EAGLE3 speculative decode + prefill / decode disagg setup ( #34529 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-22 08:08:32 -08:00
Andreas Karatzas
dd8c3a7fb2
[ROCm][CI] Fix realtime test timeouts caused by aiter JIT compilation delays ( #35052 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-22 10:07:18 +00:00
Andreas Karatzas
a8a47c17b6
[ROCm][CI] Fix flaky embedding chat test by using tolerance-based comparison ( #35050 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-22 09:03:44 +00:00
Roger Wang
40f88d8318
[Bugfix] Fix Qwen3/Qwen3.5 Reasoning Parser ( #34779 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-21 23:15:35 -08:00
Woosuk Kwon
2cbf9656ce
[Model Runner V2] Enable CUDA graph for Eagle3 ( #35040 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-21 21:42:50 -08:00
Xiao Li
30132cd144
Fix apply_top_k_top_p_triton called by non-cuda logits Tensor ( #35030 )
...
Signed-off-by: Xiao Li <ilx@meta.com >
2026-02-21 21:11:54 -08:00
Cyrus Leung
cbd95a2dd1
[Benchmark] Use sns.relplot for plotting ( #35027 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-21 20:26:48 -08:00
Athrael Soju
970861ac0c
[New Model] Add ColModernVBERT ( #34558 )
...
Signed-off-by: Athrael Soju <athrael.soju@gmail.com >
Signed-off-by: athrael-soju <athrael-soju@users.noreply.github.com >
2026-02-22 12:23:41 +08:00
Wentao Ye
d24bdd7c4b
[CI] Bump mteb version to mteb[bm25s]>=2, <3 for pooling model unit tests ( #34961 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-21 20:23:24 -08:00
Andreas Karatzas
d403c1da1c
[CI] Stabilizing ROCm amd-ci signal and minor name fix in upstream ( #35008 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-22 04:01:10 +00:00
Woosuk Kwon
b71fbd06e2
[Model Runner V2] Support attention group ( #35036 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-21 16:42:53 -08:00
Vadim Gimpelson
74d90b1ce4
[Model Bash][DSR1] Add selective dynamic shape marking for CustomOp ( #34900 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-21 19:28:01 -05:00
Woosuk Kwon
a4047d4ea9
[Model Runner V2] Support Eagle3 (no CUDA graph) ( #35029 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-21 12:55:24 -08:00
Cyrus Leung
965fe45935
[CI/Build] Fix gRPC version mismatch ( #35013 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-21 12:14:41 -07:00
Roman
98b0205c3c
[Frontend] Add automatic language detection for Whisper transcription ( #34342 )
...
Signed-off-by: space_check <roman.vuskov@rwth-aachen.de >
Signed-off-by: Roman <45857014+spacecheck@users.noreply.github.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-21 04:49:41 -08:00
Huy Do
272b535ab3
[Bugfix] Gate 256-bit instructions to CUDA 12.9+ ( #34791 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-21 04:48:14 -08:00
Cyrus Leung
f74f1572ca
[Benchmark] Improve benchmarks ( #35012 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-21 10:31:58 +00:00
petrpechman
bebfe55b1c
[Doc] Fix example of eagle3 ( #34960 )
...
Signed-off-by: Petr Pechman <petr.pechman@firma.seznam.cz >
Co-authored-by: Petr Pechman <petr.pechman@firma.seznam.cz >
2026-02-21 09:57:53 +00:00
Nick Hill
820d7815eb
[Core] Minor structured-output related scheduler optimization ( #34765 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-21 01:38:28 -08:00
Nicolò Lucchesi
ab6f3487a6
[PD] Change kv_load_failure_policy Default from "recompute" to "fail" ( #34896 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-21 01:34:57 -08:00
BADAOUI Abdennacer
8dc8a99b56
[ROCm] Enable bitsandbytes quantization support on ROCm ( #34688 )
...
Signed-off-by: badaoui <abdennacerbadaoui0@gmail.com >
2026-02-21 00:34:55 -08:00
jennyyyyzhen
2aab2bb543
[ROCM] Optimize ROCM_AITER_FA spec decode eagle performance ( #34541 )
...
Signed-off-by: jennyyyyzhen <yzhen@hmc.edu >
2026-02-20 20:32:05 -08:00
Andreas Karatzas
54254f7a61
[ROCm][CI] Fix spec decode logprobs flakiness and parametrize tree attention backends ( #34599 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:25:23 -08:00
Andreas Karatzas
cf93c1a128
[ROCm][AITER] Fix aiter paged_attention_v1 decode for sliding window and head_size < 64 ( #34570 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:25:07 -08:00
Andreas Karatzas
89358f0d35
[CI] Fix ColBERT HF comparison tests on AMD CI + refactor ( #34567 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:12:05 -08:00
zhongdaor-nv
a0fe7ea2f0
[feat] Add per-block extra_keys to KV events ( #33304 )
...
Signed-off-by: zhongdaor-nv <zhongdaor@nvidia.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 20:11:40 -08:00
Andreas Karatzas
991d6bff38
[CI][MCP][Harmony] Heavy refactoring Harmony & MCP response tests and stabilizing with deterministic test infrastructure ( #33949 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:03:32 -08:00
Kata Coder
5719a4e4e6
[Frontend] Support multimodal inputs for late-interaction scoring (ColQwen3) + NewModel: nvidia/nemotron-colembed ( #34574 )
...
Signed-off-by: craftsangjae <craftsangjae@gmail.com >
2026-02-20 20:01:40 -08:00
pougetat
11be2c74dc
[Realtime] Add Qwen3-ASR realtime streaming support ( #34613 )
...
Signed-off-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Co-authored-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-20 19:59:42 -08:00
Xin Yang
7a5adad480
[Kernel] Optimize sample_recovered_tokens_kernel ( #34974 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-20 19:59:06 -08:00
Li
59c6233297
Support prompt_embeds for pooling requests in output processor ( #34904 )
...
Signed-off-by: Li Zhang <lzhanga@amazon.com >
Co-authored-by: Li Zhang <lzhanga@amazon.com >
2026-02-20 19:57:38 -08:00
Taneem Ibrahim
d38cd3dde5
[Misc] Fix mypy errors in vllm/profiler and remove from exclude list ( #34959 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-02-20 19:56:33 -08:00
Rohan Potdar
ded333fb9b
[ROCm][Bugfix]: Only save unpadded sizes for shared_experts in MoERunner to fix rmsnorm pad fusion ( #34636 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-20 19:56:16 -08:00
Yanan Cao
9d7577b2bd
[Kernel] [Helion] [9/N] Canonicalize GPU variant names to base model names ( #34928 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-20 19:55:51 -08:00
Vlad Tiberiu Mihailescu
e739c29ea4
[CI/Build] Add opentelemetry libs in default vllm build (requirements/common.txt) ( #34466 )
...
Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com >
2026-02-20 19:54:55 -08:00
yugong333
a55caf6ae9
[LoRA] Support Quantized Adapters ( #30286 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
Signed-off-by: wz1qqx <ziqi.wang@novita.ai >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: wz1qqx <55830058+wz1qqx@users.noreply.github.com >
Co-authored-by: wz1qqx <ziqi.wang@novita.ai >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 19:54:35 -08:00
Lucas Wilkinson
0e22cd618b
Revert "[Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers " ( #34997 )
2026-02-20 17:19:19 -08:00
Wei Zhao
ea5f903f80
Bump Flashinfer Version and Re-enable DeepSeek NVFP4 AR+Norm Fusion ( #34899 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 13:37:31 -08:00
Ryan Rock
0632ed8778
[AMD][CI] Fix test_custom_allreduce for A100 testgroup ( #34735 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-02-20 21:33:04 +00:00
Lucas Wilkinson
aaefc58ee0
[CI] Revert PRs 34818 and 33600 ( #34979 )
2026-02-20 13:25:50 -08:00
Wei Zhao
f24b2de3d3
[Test] Add FP8 KV Cache Testing for MLA Backends ( #34473 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-20 18:51:58 +00:00
Michael Goin
fac1507f03
[CI] Remove failing prime-rl integration test ( #34843 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-02-20 10:17:42 -08:00
Zhengxu Chen
f863994084
[compile] Fix torch.compile time discrepancy in logging. ( #34912 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 08:47:14 -08:00
Zhengxu Chen
e4a5d8c653
[compile] Move torch_aot_compile directory under torch_compile_cache ( #34831 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-20 08:46:45 -08:00
Yanan Cao
a6d0299c75
[Kernel] [Helion] [6/N] Add num_tokens dimension to silu_mul autotuning and dispatching ( #34185 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-20 08:36:51 -08:00
Harry Mellor
6ce80f7071
Ensure that MkDocs v2 does not get installed ( #34958 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-20 15:38:11 +00:00
Huamin Li
1fe462168c
[perf] Avoid dtype promotion sync in mamba_get_block_table_tensor ( #34870 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 06:21:56 -08:00
Flora Feng
ed31a020ee
[Refactor] Extract Harmony streaming SSE event builders into streaming_events.py ( #34909 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 06:20:46 -08:00
Cyrus Leung
f9ac19204f
[V0 Deprecation] Remove unused MM placeholders in request output ( #34944 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-20 06:19:23 -08:00
Vadim Gimpelson
59965affbd
[BUGFIX] Fix _dummy_run missing prepare_inputs_event synchronization ( #34866 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-20 05:54:27 -08:00
Xin Yang
b1c4f0b265
[Kernel] Optimize grouped topk kernel ( #34206 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-20 01:34:45 -08:00
Kevin McKay
8de7c636cc
[Bugfix][Hardware][AMD] Fix ROCM_AITER_FA speculative decoding support ( #32877 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-19 22:25:46 -08:00
Frank Wang
059779231f
[Minor] Add logging when using MXFP4 MXFP8 TRTLLM backend ( #34916 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
Signed-off-by: Frank Wang <41319051+frankwang28@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-19 22:07:57 -08:00
tianshu-Michael-yu
ea37530b47
[Models] LFM2: Support LoRA ( #34921 )
...
Co-authored-by: Piotr Mazurek <piotr635@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-19 22:07:23 -08:00
Micah Williamson
f5432e35a3
[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout ( #34922 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-20 05:37:49 +00:00
杨朱 · Kiki
07cab212f0
[Misc] Add deprecated environment variable utilities ( #33677 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-19 21:33:25 -08:00
rasmith
0c1dc42748
[CI][AMD][BugFix][P/D] Add default_vllm_config to test_moriio_connector.py so tests pass ( #33739 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-19 21:32:40 -08:00
Varun Chawla
676f82ae81
Add validation to reject non-text content in system messages ( #34072 )
...
Signed-off-by: Varun Chawla <varun_6april@hotmail.com >
2026-02-19 21:30:33 -08:00
Elizabeth Thomas
81bfc21a6a
[Model Bash]: Improve FP8 Oracle for Config Specific Kernel Selection ( #34260 )
...
Signed-off-by: Elizabeth Thomas <email2eliza@gmail.com >
Signed-off-by: Robert Shaw <robertgshaw2-redhat@h100-02.nemg-001.lab.rdu2.dc.redhat.com >
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
Co-authored-by: Robert Shaw <robertgshaw2-redhat@h100-02.nemg-001.lab.rdu2.dc.redhat.com >
Co-authored-by: Robert Shaw <robertgshaw2@gmail.com >
2026-02-19 21:29:08 -08:00
Matthias Gehre
4e2c7caf2d
[Bugfix] Add regression test for MoE quant_config under torch.compile ( #34335 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-02-20 13:27:26 +08:00
Bowen Bao
d9e62c03eb
[Quark] Fix MoE fp8 activation scale handling on mi300 ( #34386 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com >
2026-02-19 21:27:14 -08:00
Kevin H. Luu
a1a2d79442
[ci] Use the right tag for CPU arm64 image ( #34915 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-19 19:59:15 -08:00
Cyrus Leung
ac900c89bb
[Refactor] Implement output type check in LLM ( #34794 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-19 19:57:55 -08:00
Mark McLoughlin
76df6072ff
[Core] Fix state names in pause_scheduler() ( #34840 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-02-19 17:21:46 -08:00
Michael Goin
16f24e8797
[CI] Add GPT-OSS Eval job for H100 ( #34359 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-02-19 17:14:54 -08:00
Nick Hill
40b2f1c3d9
[Model Runner V2] Minor CPU optimizations ( #34856 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-19 16:05:37 -08:00
Mayank Ketkar
648951a9c3
[Bugfix] Fix benchmark_fused_collective crash on CustomOp init ( #34665 )
...
Signed-off-by: Mayank Ketkar <mketkar@zoox.com >
Signed-off-by: Mayank Ketkar <mayket04@gmail.com >
Co-authored-by: Mayank Ketkar <mketkar@zoox.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-19 19:01:00 -05:00
Michael Goin
f72061a19a
[UX] More descriptive reasons in is_supported_config for MoE ( #34908 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-19 15:20:52 -08:00
Matthew Bonanni
662205d34e
[Bugfix] Fix Basic Models Test ( #34818 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-19 14:49:07 -08:00
Roger Wang
4fb8beefaa
[Bugfix] Fix cutlass fp8 kernel on hopper for Qwen3.5 ( #34914 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-19 13:34:55 -08:00
Alexei-V-Ivanov-AMD
304319c4ed
Change targets for AMD build in the "CI" pipeline ( #34918 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2026-02-19 21:26:53 +00:00
Wentao Ye
c683d11c94
[Refactor] Deprecate head_first for chunk_gated_delta_rule ( #34263 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-19 13:23:49 -05:00
roikoren755
3eff45d793
Revert "[NemotronH] Do not force router to run in fp32 ( #34582 )" ( #34808 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-19 09:47:05 -08:00
Robert Shaw
4685a630a2
[Model Bash][DeepSeekR1] Remove Shared Expert Clone ( #34344 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-19 07:56:14 -08:00
Eldar Kurtić
ee1d25f199
[Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers ( #34471 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-19 07:55:41 -08:00
Linda
6fff24f30f
[Bugfix] Qwen3.5 kv-scale weight remapping ( #34719 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com >
2026-02-19 04:13:37 -08:00
Cyrus Leung
23210a911e
[CI/Build] Try to make beam search test less flaky ( #34885 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-19 19:16:58 +08:00
Cyrus Leung
1391378861
[Bugfix] Fix edge case in UUID data parsing ( #34884 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-19 02:24:30 -08:00
Andreas Karatzas
f6220f9877
[ROCm][Test] Fix beam search determinism failures from batch-size-dependent FP divergence and removed wrong marker ( #34878 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-19 08:25:26 +00:00
Andreas Karatzas
2df2bb27b0
[ROCm][CI] Removing all blocking labels from MI355 until stable infra ( #34879 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-19 07:53:08 +00:00
Tal Nir
f75b61a9e9
[Voxtral Realtime] Fix engine crash on empty multimodal embeddings ( #34862 )
...
Signed-off-by: Tal Nir <tal@nervexneurotech.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-18 23:21:47 -08:00
Wei Zhao
7f51e93864
[Bug] Fix DeepSeek V3 weight loading caused by incorrect prefix ( #34876 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-18 23:20:30 -08:00
Alex Brooks
4611af1663
[Bugfix] Add Quant Config to Llava Next Projector ( #34847 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
2026-02-18 23:18:23 -08:00
Manrique Vargas
ad5aa6bd9f
fix(docs): fix typos in comments and docstrings ( #34836 )
...
Signed-off-by: machov <mv1742@nyu.edu >
2026-02-18 23:17:41 -08:00
Jaeyeon Kim(김재연)
9681068cf9
[Frontend] Fix reasoning_tokens for text-based parsers in Responses API ( #33513 )
...
Signed-off-by: Jaeyeon Kim <anencore94@gmail.com >
2026-02-18 23:16:41 -08:00
Kevin H. Luu
b6101d384d
Deprecate test-pipeline.yaml ( #34864 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-19 02:15:27 +00:00
Woosuk Kwon
5fcb0cdd68
[Model Runner V2] Use FP32 for Gumbel Noise ( #34854 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-18 17:07:37 -08:00
Woosuk Kwon
c878b43b64
[Model Runner V2] Remove unnecessary copies in PW CUDA graph capture ( #34849 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-18 15:52:50 -08:00
rasmith
2b84ac669c
[CI][AMD][BugFix] Use torch.testing.assert_close instead of assert torch.allclose in test_rocm_skinny_gemms.py ( #34181 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-18 23:10:19 +00:00
zhrrr
11d3976b88
[Model Runner V2] support piecewise & mixed cudagraph ( #32771 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2026-02-18 15:03:17 -08:00
Yongye Zhu
40da9625a1
[MoE Refactor] Convert mxfp4 marlin into modular kernel format ( #34588 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-18 14:37:14 -08:00
Flora Feng
8d9babd4de
Fix empty tool_call_id in Anthropic messages API tool result conversion ( #34745 )
...
Signed-off-by: <>
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Flora Feng <sfeng33@h100-01.nemg-001.lab.rdu2.dc.redhat.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-18 14:31:59 -08:00
Aaron Hao
e99ba957ec
[BUG] Fixing Weight Sync unit test ( #34841 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-02-18 17:20:10 -05:00
Kyle Sayers
64ac1395e8
[Docs] Clean up speculators docs ( #34065 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-02-18 13:48:11 -08:00
Cyrus Leung
61cf087680
[Bugfix] Fix lora tests ( #34834 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-18 13:22:31 -08:00
Wenlong Wang
847a57cd12
[Bugfix][MoE Kernel] Fix incorrect routing selection for models without expert groups (e.g., MiniMax-M2.1) ( #34673 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-18 13:03:24 -08:00
rasmith
fcd6ac97ed
[CI][AMD][BugFix] Skip tests in test_unquantized_backend_selection that should not run on ROCm ( #34655 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-18 15:00:40 -05:00
Woosuk Kwon
95be2a7f22
[Model Runner V2] Minor simplification for DCP ( #34786 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-18 11:04:53 -08:00
Jaden Mathias
0e60c925cf
[Bugfix] Remove assert causing hipErrorStreamCaptureUnsupported ( #34455 )
...
Signed-off-by: Jaden Mathias <jaden.mathias@amd.com >
2026-02-18 18:54:54 +00:00
Teng Ma
d7ff22204a
[Misc] Add mooncake-transfer-engine to kv_connectors requirements ( #34826 )
...
Signed-off-by: Teng Ma <teng-ma@linux.alibaba.com >
2026-02-18 18:26:24 +00:00
Isotr0py
c0bd8b13da
[Bugfix] Redo Qwen3.5/Qwen3-Next GDN projector fusion ( #34697 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
2026-02-18 09:46:53 -08:00
Michael Goin
caeb887bf6
[Bugfix] Fix NVFP4 TRTLLM MoE non-gated support; add gsm8k for Nemotron-3-Nano FP8+NVFP4 ( #34725 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-18 09:39:22 -08:00
Ilya Markov
6b3166a7c7
[CI][Bugfix] Fix multinode test script ( #34820 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-02-18 11:45:10 -05:00
Robert Shaw
25e2e136ef
[CI] temporarily disable multi-node tests ( #34825 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-18 11:32:44 -05:00
Robert Shaw
6874638bc4
[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup) ( #34758 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-18 07:42:36 -08:00
Burkhard Ringlein
e24663c5a9
Add unit tests for fp8 output fusion of triton_attn ( #34228 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-18 06:22:49 -05:00
Nick Hill
c50e105a88
[Model Runner V2] Avoid prepare prefill kernel launch overhead ( #34780 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-18 00:49:21 -08:00
Cyrus Leung
a766b30349
[Renderer] Deprecate code paths for old input processing ( #34775 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-18 00:35:04 -08:00
Asaf Joseph Gardin
1faa8cb73c
[Quantization] - Added uses_meta_device_weights to quant config ( #34645 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-02-17 23:43:44 -08:00
Marek Michalowski
e89a91d927
[Bugfix] fix activation in cpu_fused_moe_torch call ( #34696 )
...
Signed-off-by: Marek Michalowski <marek.michalowski@arm.com >
2026-02-17 23:39:46 -08:00
Michael Goin
909b147197
[Bugfix] Fix prefix creation for Qwen3.5 ( #34723 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-17 23:39:15 -08:00
ElizaWszola
a88b3be7c4
[Bugfix] Fix quant RMS norm fusion for quantization with TMA-aligned scales ( #33255 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-17 23:35:04 -08:00
Nick Hill
a49ea5a58f
[Model Runner V2] A bit more PP simplification ( #34766 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-17 21:39:07 -08:00
Cyrus Leung
30ebe0dc3c
[CI/Build] Remove use of skip_v1 ( #34699 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-18 12:19:11 +08:00
Andreas Karatzas
cef65f0715
[ROCm][CI] Removed hard-coded attn backend requirement for Qwen VL ( #34753 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-18 03:59:53 +00:00
Russell Bryant
6f3b2047ab
[Core] Fix SSRF bypass via backslash-@ URL parsing inconsistency ( #34743 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: isotr0py <2037008807@qq.com >
2026-02-18 03:53:35 +00:00
Luka Govedič
02e8f26cea
[torch.compile] Turn on silu+fp4 quant fusion by default for O1+ ( #34718 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-02-18 03:29:15 +00:00
Hongxia Yang
4a00a511bb
[BugFix] [Build] fix string literals comparison in indexer_k_quant_and_cache calling site ( #34653 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-02-17 19:19:41 -08:00
Cyrus Leung
a0d8d944e2
[Renderer] Move MM Hash parsing into Renderer ( #34711 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-17 19:18:55 -08:00
Amr Mahdi
df3f537a66
[CI] Remove unused precompiled wheel args from image build ( #34767 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-17 18:58:18 -08:00
Matthew Bonanni
7743152957
[Attention] Refactor check_and_update_config ( #33600 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-17 17:06:54 -08:00
Wentao Ye
ab33d2a629
[Feature] Decode Context Parallel support for GPU model runner v2 ( #34179 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-17 16:27:15 -08:00
Woosuk Kwon
be3af2d29e
[Model Runner V2] Further simplification for PP ( #34724 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-17 15:18:18 -08:00
Jongseok Park
c656ba3b4d
[Kernel] Triton-based Top-k and Top-p sampler kernels ( #33538 )
...
Signed-off-by: js_park <cakeng@naver.com >
Signed-off-by: Jongseok Park <37990712+cakeng@users.noreply.github.com >
Signed-off-by: Sunga Kim <sunga.kim@berkeley.edu >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Sunga Kim <sunga.kim@berkeley.edu >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-17 23:14:30 +00:00
Matthew Bonanni
dc5fa77a4e
[Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs ( #34457 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-17 14:01:27 -05:00
Flora Feng
1e4a084c8e
[CI] Fix flaky test_parsable_context ( #34717 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-02-17 18:42:52 +00:00
Richard Zou
7967e854da
[BugFix] Fix sp tests ( #34716 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-17 17:07:56 +00:00
almayne
6bd6d0c3c1
Fixed whisper CPU test that does not spawn properly. ( #34324 )
...
Signed-off-by: Anna Mayne <anna.mayne@arm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-17 06:46:23 -08:00
Nicolò Lucchesi
8e962fef5f
[CI][Nixl] Add CrossLayer KV layout tests ( #34615 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-17 21:35:40 +08:00
Cyrus Leung
574fe75245
[Renderer] Move InputPreprocessor into Renderer (2/2) ( #34560 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-17 05:29:01 -08:00
junuxyz
c61a98f529
[CI][BugFix] ShellCheck cleanup to remove baseline and preserve runtime behavior ( #34514 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-17 12:22:56 +00:00
Harry Mellor
28bffe9466
Fix docs build warning ( #34686 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-17 02:31:40 -08:00
ChenqianCao
ad65177a19
[Bugfix] Fix 'remove_instance_endpoint' method logic in disagg_proxy_demo ( #32922 )
...
Signed-off-by: ChenqianCao <39755070+ChenqianCao@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-17 10:06:53 +00:00
Tim Dettmers
d44a5b6c47
Remove dead bitsandbytes CxB code from 8-bit inference path ( #34633 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-17 01:49:14 -08:00
Jiangyun Zhu
1d65283e95
Revert "[Models] Fuse Qwen3.5 GDN's qkvz_proj and ba_proj" ( #34683 )
2026-02-17 01:29:27 -08:00
kourosh hakhamaneshi
c464b57374
[Ray] Propagate third-party env vars to Ray workers via prefix matching ( #34383 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-17 01:08:42 -08:00
Amr Mahdi
c5c38e152a
[CI] Fix bake config artifact path for AMI rebuild pipeline ( #34656 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-17 06:39:44 +00:00
Woosuk Kwon
d00df624f3
[Model Runner V2] Minor refactoring for penalties ( #34662 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 21:43:00 -08:00
Woosuk Kwon
9752da9d9c
[Model Runner V2] Minor simplification for BadWordsState ( #34669 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 21:27:24 -08:00
Woosuk Kwon
04925b2202
[Model Runner V2] Minor cleanup for PP ( #34666 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 19:15:31 -08:00
Woosuk Kwon
d74278fb67
[Model Runner V2] Fix unintended CPU-GPU sync in make_dummy ( #34667 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 19:00:29 -08:00
haosdent
b68fd899d1
[Bugfix] Fix fused MoE int32 overflow in stride*offset without perf regression ( #34507 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-16 17:58:49 -08:00
Aneesh Puttur
0b5f9b7204
[CI] Enable mypy import following for vllm/v1/kv_offload ( #34639 )
...
Signed-off-by: Aneesh Puttur <aneeshputtur@gmail.com >
2026-02-17 09:58:15 +08:00
zhanqiuhu
9a8853f781
[Core] Pipeline Parallel support for Model Runner V2 ( #33960 )
...
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-02-16 17:48:16 -08:00
zhrrr
387a1898d9
[Model Runner V2] support bad_words sampling param ( #33433 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 16:36:06 -08:00
roikoren755
3b30e61507
[NemotronH] Do not force router to run in fp32 ( #34582 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-02-16 10:15:32 -08:00
Alexei-V-Ivanov-AMD
824f9e8f3c
Targeting the MI355 agent pool with all existing tests ( #34629 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2026-02-16 17:02:27 +00:00
Nicolò Lucchesi
6cc403e67d
[Bugfix][CI] Fix flaky entrypoints/openai/test_response_api_with_harmony.py::test_function_calling[openai/gpt-oss-20b] ( #34624 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-16 16:11:07 +00:00
Almog Tavor
72d5951d02
[Bugfix] Treat generation_config max_tokens as default not ceiling ( #34063 )
...
Signed-off-by: almogtavor <almogtavor@gmail.com >
2026-02-16 07:58:24 -08:00
Lucas Kabela
a3205beffb
[CI] Enable mypy coverage for individual excluded files ( #34292 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-16 07:34:29 -08:00
Christian Pinto
6930becd45
(bugfix): Fixed encode in LLM entrypoint for IOProcessr plugin prompts ( #34618 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2026-02-16 07:33:55 -08:00
Andreas Karatzas
03a8770a6d
[ROCm][CI] Fix plugins test group; updating terratorch and dependencies ( #34589 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-16 07:33:42 -08:00
Yiqi Xue
bc56a1d56e
[Bugfix] Fix ARC touch KeyError for non-ready T1 blocks in kv offload ( #34576 )
...
Signed-off-by: Yiqi Xue <xuey666@gmail.com >
2026-02-16 07:33:19 -08:00
danisereb
ec7d9e6745
Fix call to moe_mk in modelopt MoE modules (required for LoRA) ( #34575 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-16 07:33:09 -08:00
Isotr0py
3bb4e4311c
[Models] Fuse Qwen3.5 GDN's qkvz_proj and ba_proj ( #34492 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-16 07:32:51 -08:00
Amr Mahdi
08f8c198ae
[CI] Disable precompiled wheel path in CI image builds ( #34606 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-16 15:14:43 +00:00
Harry Mellor
a21cedf4ff
Bump lm-eval version for Transformers v5 compatibility ( #33994 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-16 05:24:35 -08:00
emricksini-h
3ef74cde5d
[CI][Tracing] Fix race condition by adding server readiness check ( #34364 )
...
Attempt to resolve #34284 : "Metrics Tracing (2GPU)" fails with a
segmentation fault.
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai >
2026-02-16 12:57:39 +00:00
Ekagra Ranjan
cd81cdb399
[Scheduler][ASR] Fix CrossAttn blocks per-request for Variable length encoder inputs ( #31058 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-16 11:08:44 +00:00
Andreas Karatzas
1e828573b4
[CI][Metrics] Stabilize tests with polling and subprocess guards ( #34566 )
...
test_abort_metrics_reset is flaky due to hardware-dependent
fixed sleeps: replace fixed sleeps with polling.
test_metrics_exist_run_batch passes even when the engine crashes
on startup (false positive): add subprocess lifecycle guards.
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-16 10:52:02 +00:00
Samu Tamminen
a5ccc85c8c
[Bugfix] Fix Dynamo unexpected keyword argument ( #34320 )
...
Signed-off-by: Samu Tamminen <stammine@amd.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-16 01:32:30 -08:00
Roger Wang
b5475d0534
Revert "[Misc] fix qwen3.5 config" ( #34610 )
2026-02-16 01:06:05 -08:00
JJJYmmm
9521002f0a
[Misc] fix qwen3.5 config ( #34604 )
2026-02-16 00:25:38 -08:00
Cyrus Leung
ec17bdd894
[Renderer] Move InputPreprocessor into Renderer (1.5/2) ( #34598 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-15 23:46:33 -08:00
Amr Mahdi
bb59c90248
[CI] Write bake config to temp directory instead of repo root ( #34569 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-15 22:15:47 -08:00
bnellnm
5bff999d12
[Bugfix] Add method to swap quant_method on FusedMoE to fix LoRA issues ( #34453 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-15 20:10:50 -08:00
Lucas Wilkinson
bb85929aa6
[BugFix] Fix Python 3.13 FlashMLA import error ( #34548 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-15 20:09:18 -08:00
Parth Bansal
5653021094
[Doc] Add Mistral-7b-v0.3 model to the batch invariance validated model ( #34584 )
...
Signed-off-by: Parth Bansal <parthbansal127@gmail.com >
2026-02-16 12:09:00 +08:00
Andreas Karatzas
974d829b05
[CI][Frontend] Return 422 instead of 500 for invalid Anthropic tool_choice ( #34590 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-15 20:06:48 -08:00
Isotr0py
91ac5d9bfd
[CI/Build] Enable tests for recent day-0 new models ( #34585 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-15 18:17:04 -08:00
Luka Govedič
23d825aba1
[torch.compile] Disable ar-rms fusion for ds3-fp4 & DP, fix CI test ( #34392 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-15 06:33:57 -08:00
Maryam Tahhan
f07a128413
[CPU][ARM] Add ARM BF16 cross-compilation support and improve documen… ( #33079 )
...
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-02-15 06:33:08 -08:00
Isotr0py
71cd89264f
[MM Encoder] Add Triton ViT attention backend ( #32183 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-15 06:32:47 -08:00
Isotr0py
19fab44152
[Doc] Update Encoder-Decoder models support doc with Florence-2 ( #34581 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-15 04:18:57 -08:00
Seiji Eicher
79c7e09235
[KV Connector] Add temporary, off-by-default VLLM_DISABLE_REQUEST_ID_RANDOMIZATION workaround ( #34415 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2026-02-14 23:26:10 -08:00
haosdent
79f3fab05a
[Bugfix] Handle num_expert_group=None in flashinfer block-scale FP8 MoE ( #34494 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-14 23:25:46 -08:00
Vadim Gimpelson
604b9eaec5
[BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1 ( #34476 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-14 23:25:17 -08:00
Stanislav Kirillov
50dbd6c9e6
[bugfix] Fix critical bug when reporting for all paths where handler.create_error_response is used ( #34516 )
...
Signed-off-by: Stanislav Kirillov <stas@nebius.com >
Co-authored-by: Stanislav Kirillov <stas@nebius.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-14 23:24:25 -08:00
Andreas Karatzas
98bcc6ca59
[CI][Entrypoints] Validate detokenize token IDs to prevent int64 overflow causing 500 ( #34468 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-14 23:08:38 -08:00
Andreas Karatzas
f13e86d8dd
[Kernels] Fix Helion GPU utils to use platform-agnostic device name API ( #34537 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-14 20:29:23 -08:00
Woosuk Kwon
9ca768c740
[Model Runner V2] Minor cleanup for Sampler ( #34563 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-14 18:29:03 -08:00
Thomas Parnell
d5fe3f702c
[Hybrid] Enable mamba prefix cache "align" mode with async scheduling ( #33997 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2026-02-14 13:15:56 -08:00
Cyrus Leung
73391a1baa
[Renderer] Move InputPreprocessor into Renderer (1/2) ( #34510 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-14 10:14:21 -08:00
Andreas Karatzas
b3c14229b0
[ROCm][CI] Guard sparse MLA backend imports for ROCm compatibility in tests ( #34538 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-14 07:32:09 -08:00
Roger Wang
2f186635cb
[Bugfix] Fix Qwen3.5 config loading ( #34554 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-14 03:56:11 -08:00
Christian Pinto
342a7cda2d
[Misc] Update tests and examples for Prithvi/Terratorch models ( #34416 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-13 23:03:51 -08:00
Kata Coder
d1ea65d0a1
[new model] add COLQwen3 code & Inference ( #34398 )
...
Signed-off-by: craftsangjae <craftsangjae@gmail.com >
Signed-off-by: katacoder <craftsangjae@gmail.com >
2026-02-14 12:15:19 +08:00
Andreas Karatzas
de42abb366
[CI] Heavy refactoring of Voxtral multimodal audio model tests ( #34294 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-13 20:04:29 -08:00
Julien Denize
60ca7981bc
Add explicit validation error for tool calls. ( #34438 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-02-13 20:04:01 -08:00
Christian S. Perone
0ef5b9147b
fix: use __annotations__ instead of get_type_hints() for dynamic kwargs detection ( #34527 )
...
Signed-off-by: Christian S. Perone <christian.perone@gmail.com >
Signed-off-by: Christian S. Perone <perone@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-13 20:03:37 -08:00
Shiyan Deng
ed242652d7
[bug] Make sure get_modality_with_max_tokens is deterministic ( #34533 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2026-02-13 20:02:59 -08:00
Wei Zhao
b37b679770
[Feature][Perf] Support Selective CPU Weight Offloading ( #34535 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-13 20:02:24 -08:00
Andreas Karatzas
a0638d052d
[Bugfix] Fix ROCm UVA CPU weight offloading broken by #32993 ( #34543 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-13 20:01:42 -08:00
Harry Huang
c027541eaf
[Hybrid] Enable spec decoding in mamba cache align mode ( #33705 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-13 13:02:28 -08:00
Ben Browning
fd267bc7b7
[Bugfix]: Fix structured output in multi-turn gpt-oss ( #34454 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-13 11:12:48 -08:00
Michael Goin
bfaa559305
Revert "[Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides" ( #34530 )
2026-02-13 10:35:29 -08:00
Richard Zou
87789c8364
[Misc] vLLM's --enforce-eager should turn off compile and cudagraphs only ( #34523 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-13 09:52:20 -08:00
Pushpinder Singh
bcd65c1f6a
[Bugfix] Replace c10::optional with std::optional in topk kernel ( #34467 )
...
Signed-off-by: Pushpinder Singh <pushpindersingh135@gmail.com >
2026-02-13 08:30:23 -08:00
Wei Zhao
59d53066d8
[Feature] Support CPU Offloading without Pytorch Pinned Memory that leads to doubled allocation ( #32993 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-13 08:11:26 -08:00
LoganJane
4a9952ec1b
[Bugfix] Add quant_config in ViT of Kimi-K2.5 ( #34501 )
...
Signed-off-by: LoganJane <LoganJane73@hotmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-13 16:05:34 +00:00
Roger Wang
1dae7b7843
[Bugfix] Exclude language_model_only key from MM AOT compile hash but include in model one ( #34508 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-13 13:59:00 +00:00
Roger Wang
5885e330ef
[Misc] Port Qwen3.5 Configs ( #34512 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-13 05:24:25 -08:00
Ilya Boytsov
071d863e20
Extend ColBERT support to non-standard BERT backbones ( #34170 )
...
Signed-off-by: Ilya Boytsov <ilya.boytsov@aleph-alpha.com >
2026-02-13 09:53:09 +00:00
Woosuk Kwon
0916e7960b
[GDN] Use CPU tensors to build GDN metadata ( #34498 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-13 01:24:45 -08:00
Wentao Ye
3d2a026fd0
[Feature] Pipeline Parallel Async send/recv, 2.9% E2E throughput improvement ( #33368 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-02-13 16:38:16 +08:00
Aaron Hao
dddbff4624
[Core] Move pause and resume functions into engine ( #34125 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Signed-off-by: hao-aaron <ahao@anyscale.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-13 00:15:10 -08:00
Martin Hickey
47e9b63e1a
[KVConnector] Clean up redundant code in KV connectors ( #34147 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-02-13 00:14:30 -08:00
Matthias Gehre
934acddef9
[Perf] fused_moe: add int4_w4a16 benchmark support and tuning config ( #34130 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-13 00:14:27 -08:00
Marek Michalowski
742d214d6e
[Bugfix] fix the import path in moe test utils.py ( #34245 )
...
Signed-off-by: Marek Michalowski <marek.michalowski@arm.com >
2026-02-13 00:13:45 -08:00
haosdent
4137c5dfa7
[Bug Fix] Fix MambaManager.cache_blocks() crash on null blocks in align mode ( #34418 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-13 00:13:22 -08:00
Harry Huang
7a8a46ddcb
[BugFix] Fix and optimize max_num_blocks_per_req calculation for MambaSpec ( #34440 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-13 00:13:14 -08:00
myselvess
bcf0731aa0
[New Model] support new model ovis2.6 ( #34426 )
...
Signed-off-by: myselvess <23743269+myselvess@users.noreply.github.com >
2026-02-13 00:12:45 -08:00
Cyrus Leung
ec090c2429
[Refactor] Call renderer for online IO processor request ( #34490 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-12 22:48:45 -08:00
Roger Wang
eea3024f43
[Bugfix] Fix mamba state dtype setting for Qwen3-Next and Qwen3.5 ( #34489 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-12 22:48:42 -08:00
Cyrus Leung
2f308214c0
[Refactor] Pass full VllmConfig to Renderer ( #34485 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 22:48:38 -08:00
Cyrus Leung
1b4e8e53f8
[CI/Build] Fix CUDA re-initialization error in distributed model tests ( #34491 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-13 06:43:53 +00:00
haosdent
dcf6ee8592
[Bugfix] Fix encoder cache underestimation for GLM-4V/GLM-OCR single image ( #34483 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-12 21:04:06 -08:00
Cyrus Leung
372b2e762a
[Bugfix] Standardize getting number of image patches/tokens ( #34358 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 20:47:01 -08:00
Andreas Karatzas
6afa587d31
[ROCm][CI] Fix serving tokens test failures ( #34047 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-13 11:27:53 +08:00
Cyrus Leung
94ed6cf6ea
Add new sections to CODEOWNERS ( #34309 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 18:39:28 -08:00
Harry Huang
bf37812ca7
[Hybrid] Fix and optimize block-aligned splitting in mamba cache align mode ( #33706 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-12 18:21:52 -08:00
Frank Wang
b86bf4417e
[Bugfix] Fix Random Dataset Prefix Length Inaccuracy ( #33907 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-12 18:21:19 -08:00
Yanan Cao
de13dd781f
[Kernel] [Helion] [5/N] Add Helion Autotuning infrastructure ( #34025 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-12 18:21:05 -08:00
LoganJane
62788f99a4
[Bugfix] Delete unused redundant code in Kimi-K2.5 ( #34427 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-12 18:18:42 -08:00
Cyrus Leung
ea5ff3a1f6
[Refactor] Simplify BOS/EOS token handling ( #34435 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 18:18:24 -08:00
bnellnm
04ea31baab
[Bugfix] Remove assert that's no longer valid ( #34443 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-12 18:18:15 -08:00
Harry Huang
6f019e6e0a
[BugFix] Add block_size validation for mamba cache align mode ( #34445 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-12 18:18:07 -08:00
Zhuohan Li
d707678dfb
Fix num_logprobs parameter description in sampler.py ( #34451 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2026-02-12 18:18:03 -08:00
Cyrus Leung
fc22cae4ac
[CI/Build] Update video URLs for testing ( #34446 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 18:15:36 -08:00
Yanan Cao
96161fe978
[Kernel] [Helion] [4/N] Add silu_mul_fp8 Helion kernel ( #33373 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-12 18:13:12 -08:00
Jaewon
4453ba8d9e
[Core] Profiler improvements and lazy initialization ( #33198 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-12 16:16:38 -08:00
Jaewon
aa181c923b
[Core] Add sleep level 0 mode with enqueue/wait pattern ( #33195 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-12 16:16:25 -08:00
Alec S
be7370daf3
[Frontend] Enable generic structured_outputs for responses API ( #33709 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Co-authored-by: Alec Solder <alecs@fb.com >
2026-02-12 16:15:48 -08:00
Mengtao (Martin) Yuan
9ea1f598ce
Use paged_attention_v1 for sliding window decode in rocm_aiter_fa ( #34378 )
...
Signed-off-by: Martin Yuan <myuan@meta.com >
Co-authored-by: Martin Yuan <myuan@meta.com >
2026-02-12 16:14:43 -08:00
amitz-nv
f120bd42d3
[Kernel] Support Flashinfer trtllm fused MoE non gated FP8 & NVFP4 ( #33506 )
...
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com >
2026-02-12 13:06:58 -08:00
Hashem Hashemi
fac4e96940
small adjustment to wvSplitKrc ( #34410 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-12 20:26:36 +00:00
Michael Goin
6d4e27ce29
[Bugfix] Enforce DeepGEMM when using sparse_attn_indexer on CUDA ( #34374 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-12 12:08:06 -08:00
Andreas Karatzas
4c078fa546
[ROCm][CI] Pin TorchCodec to v0.10.0 for ROCm compatibility ( #34447 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-12 18:47:34 +00:00
Patrick von Platen
6c0baee610
[Voxtral Realtime] Refactor & Improve buffering logic ( #34428 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-12 09:46:43 -08:00
Patrick von Platen
1100a97621
[Voxstral Realtime] Enable tests ( #33803 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2026-02-12 09:43:24 -08:00
xuebwang-amd
766e167821
[ROCm][quantization] improve OCP weight quant parser robust ( #34431 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-12 09:40:19 -08:00
Isotr0py
becbe24808
[Bugfix] Remove broken raw url GGUF model loading support ( #34433 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-12 09:40:01 -08:00
Harry Mellor
679ca5d8d3
Fix MoE for the Transformers modelling backend ( #34436 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-12 09:29:42 -08:00
Matthew Bonanni
f2c47886fd
[Attention] Add FlashInfer Sparse MLA backend ( #33451 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
2026-02-12 17:21:54 +00:00
Nicolò Lucchesi
334c715e0f
[Docs] Spec decoding docs warning removal ( #34439 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-12 09:01:51 -08:00
Aaron Hao
7b5a8b4a9d
[BUG] Reset running requests when clearing cache for pause/resume ( #34382 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-02-12 16:19:13 +00:00
danisereb
dea63512bb
Add config file for fused MoE for Nemotron (TP4, B200) ( #34411 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-12 06:09:55 -08:00
Douglas Lehr
8a798be929
[ROCm] Enable MXFP4 MoE weight pre-shuffling on gfx950 and update aiter ( #34192 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
Co-authored-by: tjtanaavllm <tunjian.tan@amd.com >
2026-02-12 05:06:33 -08:00
Cyrus Leung
fb455ed547
[V0 Deprecation] Remove code related to per-request logits processors ( #34400 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 20:44:28 +08:00
baonudesifeizhai
f5897613fb
Fix Mistral config remap to accept compressed-tensors quantization #34028 ( #34104 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
2026-02-12 08:22:06 +00:00
Louie Tsai
55a1a9563a
Vllm CPU benchmark suite improvement ( #34128 )
...
Signed-off-by: louie-tsai <louie.tsai@intel.com >
2026-02-12 16:04:44 +08:00
AllenDou
386bfe5d08
[bugfix] refactor FunASR's _get_data_parser ( #34397 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
2026-02-12 07:26:49 +00:00
Kyle Sayers
e9cd691132
[Bugfix] Fix Sparse24 Compressed Tensors models ( #33446 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-11 23:15:16 -08:00
Yichuan Wang
80f2ba6ea6
Fix DeepSeek-OCR tensor validation for all size variants ( #34085 )
...
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-11 22:50:23 -08:00
Lucas Wilkinson
136b0bfa59
[BugFix] Fix DP chunking ( #34379 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Bill Nell <bnell@redhat.com >
2026-02-12 06:44:03 +00:00
Cyrus Leung
b96f7314b4
[Refactor] Pass Renderer to Input Processor ( #34329 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 19:38:11 -08:00
Cyrus Leung
ced2a92f40
[Refactor] Move validation to params definitions ( #34362 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 19:33:15 -08:00
Runkai Tao
e1d97c38f8
[Bug Fix] Fix naive_block_assignment always defaulting to False due to arg misalignment ( #33848 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
2026-02-12 11:30:57 +08:00
Michael Goin
ec12d39d44
[Bugfix] Fix MTP accuracy for GLM-5 ( #34385 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-12 11:08:19 +08:00
Michael Goin
ff1f83b056
[Refactor] Replace activation: str with MoEActivation enum ( #33843 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-02-11 17:29:32 -08:00
Kevin H. Luu
83b47f67b1
[ci] Integrate AMD tests into CI ( #33626 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
Signed-off-by: khluu <khluu000@gmail.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-12 08:54:17 +08:00
Micah Williamson
fb7b30c716
[ROCm][CI] Revert Test Groups From mi325_8 to mi325_1 Agent Pool In AMD CI ( #34384 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-11 15:52:34 -08:00
bnellnm
31d992d215
[Bugfix] Fix some issues with MoERunner PR #32344 ( #34371 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-11 14:33:14 -08:00
Wei Zhao
5aff2699bd
Fix CI failure - Flashinfer Kernel tests ( #34316 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-11 14:17:16 -08:00
Raushan Turganbay
527ca32197
[Bugfix] Fix more multimodal tests for transformers V5 ( #34334 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2026-02-11 22:02:05 +01:00
Junseo Park
5458eb835d
[Bugfix] send None sentinel on final commit so server properly sends transcription.done ( #33963 )
...
Signed-off-by: pjs102793 <pjs102793@naver.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-11 21:01:53 +00:00
Tomas Ruiz
144d9b7cc8
[Benchmarks] Reduce ready checker log verbosity ( #34349 )
...
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com >
2026-02-11 20:57:57 +00:00
elvischenv
83e26c834e
[GPT-OSS] Remove unnecessary contiguous ( #34337 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-02-11 15:29:29 -05:00
TJian
5001211369
[ROCm] [CI] fix test_unrecognized_env ( #34350 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-02-11 18:50:44 +00:00
Eldar Kurtić
11c7ace340
[Bugfix] Enable attn quantization of Llama-4 by correctly permuting scales for rope (int8, fp8) ( #34243 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
2026-02-11 13:24:22 -05:00
Xinyu Dong
be7f3d5d20
[Bugfix] fix default is_neox_style is True for deepseek ( #34353 )
...
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com >
2026-02-11 18:20:45 +00:00
Isotr0py
0ab06100f4
[Multimodal] Expose mm_processor_kwargs for DummyInputsBuilder ( #34330 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-11 09:37:40 -08:00
Xinyu Chen
ffb3d553cc
[Model Runner V2] Init cuda graph pool when necessary ( #33217 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2026-02-11 09:12:13 -08:00
junuxyz
fa7e0bfacf
[CI][BugFix] Fix silent failure in shellcheck hook and baseline exist… ( #32458 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-11 17:03:48 +00:00
SorenDreano
48134a2c22
[Docs] Fix typo ("defult") and double spacing ( #34348 )
...
Signed-off-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com >
Co-authored-by: Soren Dreano <soren@numind.ai >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-11 09:02:27 -08:00
kliuae
64f570ab56
[ROCm] [aiter] Split KV cache update for AiterFlashAttention ( #33681 )
...
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com >
2026-02-11 16:26:44 +00:00
Rohan Potdar
fd618871b4
[Bugfix]: Fix ROCm fusion attn test; use AttentionBackend utils to create kv cache ( #33948 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-11 11:12:05 -05:00
Harry Mellor
67a42b5a44
Don't try and run GLM-ASR with remote code ( #34352 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-11 08:09:40 -08:00
Lucas Wilkinson
c7914d30f9
Reapply [Attention][FA3] Update FA3 to include new swizzle optimization ( #34043 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-11 07:07:56 -08:00
Adam Binford
1b8756562e
Responses harmony system message structured ( #34268 )
...
Signed-off-by: Adam Binford <adamq43@gmail.com >
2026-02-11 05:14:28 -08:00
Linda
275e0d2a99
[NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE ( #33715 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com >
Co-authored-by: Pavani Majety <pmajety@nvidia.com >
2026-02-11 12:38:11 +00:00
Harry Mellor
0f5e55e7a8
Make JAIS compatible with Transformers v5 ( #34264 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-11 12:30:37 +00:00
Harry Mellor
1e9204bff3
Make Qwen3VL compatible with Transformers v5 ( #34262 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-11 04:13:23 -08:00
Li, Jiang
05339a7b20
[Bugfix][CPU] Fix llama4 inference on CPU ( #34321 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-02-11 19:07:23 +08:00
Harry Mellor
40b8f55358
[Docs] Reduce time spent generating API docs ( #34255 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-11 02:56:02 -08:00
Seiji Eicher
5045d5c983
Patch protobuf for CVE-2026-0994 ( #34253 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-11 02:25:04 -08:00
Nick Hill
e09546cf05
[Frontend] Exploit tokenizers "new stream" in FastIncrementalDetokenizer ( #34217 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-11 11:03:24 +01:00
Tianqi Ren
786806dd44
[Doc] Update Marlin support matrix for Turing ( #34319 )
...
Signed-off-by: Tianqi Ren <tianqi.r@outlook.com >
2026-02-11 09:03:41 +00:00
Nick Hill
79504027ef
[Misc] Bump fastsafetensors version for latest fixes ( #34273 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-11 00:30:09 -08:00
Luka Govedič
addac0e653
[torch.compile] Enable AR+rms fusion by default available for -O2 ( #34299 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-02-11 00:30:00 -08:00
Cyrus Leung
675a22ed66
[Chore] Move BaseRenderer to base.py ( #34308 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 00:29:51 -08:00
Kunshang Ji
cb9574eb85
[XPU][9/N] clean up existing ipex code/doc ( #34111 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-11 00:27:15 -08:00
AllenDou
21dfb842d7
[model] support FunASR model ( #33247 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
2026-02-11 07:37:09 +00:00
R3hankhan
d1b837f0ae
[CPU] Enable FP16 (Half dtype) support for s390x ( #34116 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-02-11 14:41:42 +08:00
Roger Wang
0b20469c62
[Bugfix] Fix weight naming in Qwen3.5 ( #34313 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 21:37:14 -08:00
Tyler Michael Smith
d7982daff5
[Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides ( #34279 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-11 05:15:52 +00:00
Robert Shaw
9b17c57460
[ModelBash][DSR1 NVFp4] Removed Bf16 Bias Cast ( #34298 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-11 05:00:00 +00:00
Hashem Hashemi
1b3540e6c6
Threshold fix wvSplitk for occasional CI fails ( #34013 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-11 03:59:14 +00:00
Matthias Gehre
7a048ee65f
[Bugfix] Fix benchmark_moe.py inplace assertion with torch >= 2.9 ( #34149 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-02-11 03:58:56 +00:00
Cyrus Leung
c9a1923bb4
[Plugin] Simplify IO Processor Plugin interface ( #34236 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 19:47:39 -08:00
zofia
b482f71e9f
[XPU][7/N] enable xpu fp8 moe ( #34202 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-02-11 03:33:59 +00:00
Дзержи́нский
1485396abb
[Kernel] Apply 256bit LDG/STG To Activation Kernels ( #33022 )
...
Signed-off-by: Dzerzhinsky <256908701+AstroVoyager7@users.noreply.github.com >
Signed-off-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-10 19:31:51 -08:00
Kebe
5ee5c86eeb
[Bugfix][DeepSeek-V3.2] fix fp8 kvcache type cast ( #33884 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2026-02-10 19:31:36 -08:00
Cyrus Leung
b5dcb372e4
[Misc] Clean up validation logic in input processor ( #34144 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 19:29:29 -08:00
Tyler Michael Smith
066c6da6a0
[WideEP] Fix nvfp4 DeepEP High Throughput All2All backend ( #33738 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-10 19:15:43 -08:00
Richard Zou
e30cedd44b
[torch.compile] Stop doing unnecessary FakeTensorProp in PiecewiseCompileInterpreter ( #34093 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-10 19:15:40 -08:00
Cyrus Leung
3bcd494ef4
[Redo] Add --trust-remote-code to dataset bench args ( #34251 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 11:10:12 +08:00
tianshu-Michael-yu
0e725a7d22
[Bugfix] Fix Worker.load_model context-manager composition for sleep mode ( #34021 )
...
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com >
2026-02-11 11:07:51 +08:00
Lucas Wilkinson
ba0511fd80
[Misc] Add run one batch script that supports profiling ( #32968 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-10 18:29:49 -08:00
Micah Williamson
4a1550d22d
[ROCm][CI] Fix test_sequence_parallel.py location in AMD CI pipeline ( #34280 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-11 01:08:11 +00:00
bnellnm
d1481ba783
[MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner ( #32344 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-10 19:51:07 -05:00
7. Sun
dc6de33c3d
[CI] Add pip caching to cleanup_pr_body workflow ( #32979 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-02-11 00:45:28 +00:00
Tyler Michael Smith
c4b9e6778f
[Misc] Add pre-commit hook to catch boolean ops in with-statements ( #34271 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-10 15:13:20 -08:00
Richard Zou
341eed3d30
[torch.compile] Disable recursive pre_grad_passes ( #34092 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-10 18:02:31 -05:00
Zhengkai Zhang
6f2f59f2b3
[Misc][Spec Decode] support different load config for draft model ( #34022 )
...
Signed-off-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com >
Co-authored-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com >
2026-02-10 14:52:43 -08:00
Ilya Markov
bb2fc8b5e7
[BugFix] Fix async EPLB hang with DeepEP LL all2all backend ( #32860 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-02-10 22:34:47 +00:00
Ilya Markov
67132945bb
[Perf] Move eplb rebalance algo to async thread ( #30888 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-02-10 22:19:10 +00:00
Gregory Shtrasberg
f0ca0671c7
[Feature] Warn about unrecognized environment variables ( #33581 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-10 15:45:38 -06:00
Pavani Majety
578977bb5e
[SM100] Resubmit FMHA FP8 prefill for MLA ( #31195 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2026-02-10 16:18:43 -05:00
Roger Wang
9615575afc
[Bugfix] Fix mamba cache dtype for Qwen3.5 ( #34200 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 13:12:31 -08:00
Matthew Bonanni
4293c00b84
[Benchmarks] Fix attention benchmark smoke test ( #34269 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-10 16:04:07 -05:00
J Seppänen
506ad7d7c1
[Bugfix] Fix weights offloading for sleep mode ( #32947 )
...
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-02-10 20:38:17 +00:00
Reagan Lee
fdd6f2ad58
Convert online APIs to use Renderer ( #34084 )
...
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com ”>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com ”>
2026-02-10 19:44:31 +00:00
Qi Wang
33bcd3dc3b
[Misc] Introduce ec_both role EC (encoder cache) connector ( #34182 )
...
Signed-off-by: Qi Wang <qiwa@nvidia.com >
2026-02-10 18:55:35 +00:00
Michael Goin
1f5febb4b8
[UX nit] Fix non-default api_server_count message ( #34152 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-10 10:35:58 -08:00
Andy Lo
ae871ca923
Minor cleanup for Voxtral ( #34247 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-02-10 18:18:30 +00:00
Woosuk Kwon
a2443de5fa
[Model Runner V2] Use pinned memory for write_contents ( #34222 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-10 08:55:22 -08:00
Harry Mellor
f84a2a8f31
[Docs] Speed up build environment set-up ( #34240 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-10 16:34:43 +00:00
Vadim Gimpelson
000214c4bb
[BUGFIX] Fix accuracy bugs in Qwen3-Next MTP ( #34077 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-10 10:57:11 -05:00
junuxyz
c5a66d1697
[Core][BugFix] Fix PP KV cache sharding memory validation ( #33698 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-10 10:46:24 -05:00
Roberto L. Castro
afdce12c89
[Perf][Kernel] Add faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention ( #33680 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-02-10 10:29:52 -05:00
Zhengxu Chen
82e11973cc
[compile] Enable AOT compile with 2.10 in trunk. ( #34155 )
...
Signed-off-by: Zhengxu Chen <zhxchen17@meta.com >
2026-02-10 23:24:42 +08:00
xuebwang-amd
b129136c7a
[ROCm][Quantization] GPT_OSS in amd-quark format model loading and emulations ( #29008 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-10 10:08:05 -05:00
mgazz
599e4335a4
Support benchmarking of Geospatial models ( #33922 )
...
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com >
2026-02-10 07:04:16 -08:00
Fan Yang
a1946570d8
add --insecure arg to the vllm bench to skip TLS ( #34026 )
...
Signed-off-by: Fan Yang <yan9fan@meta.com >
Co-authored-by: Fan Yang <yan9fan@meta.com >
2026-02-10 22:23:52 +08:00
Harry Mellor
d0bc520569
Bump mamba-ssm version in CI for Transformers v5 compatibility ( #34233 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-10 14:46:01 +01:00
Krish Gupta
748625cdaf
[V1][BugFix] Fix EAGLE3 encoder cache miss with disable_chunked_mm_input ( #34220 )
...
Signed-off-by: KrxGu <krishom70@gmail.com >
2026-02-10 13:05:32 +00:00
Harry Mellor
61413973e8
Stop testing for slow tokenizers as they will not exist soon ( #34235 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-10 12:08:20 +00:00
Phúc H. Lê Khắc
94de871546
[Misc] allow specify is_mm_prefix_lm in hf_config ( #34215 )
2026-02-10 11:16:21 +00:00
tc-mb
e042d7e685
Add flagos in MiniCPM-o ( #34126 )
...
Signed-off-by: tc-mb <caitianchi@modelbest.cn >
Signed-off-by: Vincent-Xiao <vincent.xiao.me@gmail.com >
Co-authored-by: Vincent-Xiao <vincent.xiao.me@gmail.com >
2026-02-10 02:51:48 -08:00
Roger Wang
ae4e280602
[Bugfix] Fix FI kernelchunk_gated_delta_rule output shape for Qwen3.5 ( #34219 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 10:41:24 +00:00
zzaebok
cbea11c9f0
[Docs] Fix format error in KV load failure recovery doc ( #34137 )
...
Signed-off-by: Jaebok Lee <jaebok9541@naver.com >
2026-02-10 02:16:26 -08:00
Cyrus Leung
2c32558a3c
[Bugfix] Fix --trust-remote-code conflict ( #34218 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 00:29:10 -08:00
Zetong Li
5f970120f0
[Bugfix] Fix memory inconsistency in cross-process shared memory ( #32022 )
...
Signed-off-by: Zetong Li <slippersss@126.com >
2026-02-10 08:22:03 +00:00
Cyrus Leung
998e2d91f8
Revert #34208 ( #34216 )
2026-02-09 23:59:04 -08:00
Wentao Ye
e1060a71a1
[Perf] Optimize detokenizer python logic ( #32975 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-02-09 23:54:41 -08:00
Chen Zhang
97fa8f6590
[BugFix] Avoid prefix cache hit in the same schedule step for mamba layers ( #29387 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2026-02-10 07:41:16 +00:00
wang.yuqi
dab1de9f38
[Frontend][CI] Consolidate instrumentator entrypoints ( #34123 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-10 07:30:19 +00:00
Balaxxe
8d48d0a9d9
[Bugfix] Sort hf_weights_files in fastsafetensors_weights_iterator to match #33491 ( #34190 )
...
Signed-off-by: Balaxxe <136368465+jaim12005@users.noreply.github.com >
2026-02-09 23:06:30 -08:00
Andrew Xia
9608844f96
[responsesAPI] fix simpleContext streaming output_messages ( #34188 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-02-09 22:53:07 -08:00
Cyrus Leung
f69b903b4c
[Bugfix] Add --trust-remote-code to dataset bench args ( #34208 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-09 22:37:50 -08:00
Lucas Wilkinson
81e217fe6b
[Bugfix] Fix DP Attention Padding in Dummy Run ( #34187 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-10 05:29:39 +00:00
Cyrus Leung
ab97bcf662
[CI/Build] Relax test_mcp_tool_call ( #34204 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 05:18:57 +00:00
Cyrus Leung
25e48a3aae
[Doc] Update usage of --limit-mm-per-prompt ( #34148 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-09 21:12:13 -08:00
Roger Wang
8a5e0e2b2b
[Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching ( #34183 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 13:03:32 +08:00
Andreas Karatzas
4cde2e0159
[ROCm][Bugfix] Resolve Dynamo tracing crash from amdsmi calls in on_gfx* arch detection ( #34108 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-09 20:50:20 -08:00
Roger Wang
047a457fa4
[Bugfix] Adopt ChunkGatedDeltaRule for Qwen3.5 ( #34198 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 03:47:54 +00:00
Yuwei An
e94ec59733
[LMCache] Token Base IPC API ( #34175 )
...
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com >
2026-02-10 01:18:42 +00:00
Ning Xie
13397841ab
[structured output] validate unsupported json features first ( #33233 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2026-02-09 23:49:09 +00:00
Gregory Shtrasberg
c60f8e3b49
[Bugfix][ROCm][GPT-OSS] Use old triton_kernels implementation on ROCm if the new API is not available ( #34153 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-09 17:38:54 -06:00
Michael Goin
5e75a14a66
[Doc] Add DCP support to attention backend doc ( #33936 )
2026-02-09 18:33:43 -05:00
Nick Hill
e7e52781ff
[ModelRunner V2][BugFix] Fix max_query_len calculation ( #34167 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-09 21:47:17 +00:00
Charlie Fu
bb9f97308d
[torch.compile][Fusion] Fix attention fusion pass removing kv_udpate op. ( #33945 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2026-02-09 16:15:43 -05:00
Hongxia Yang
4d39650961
[ROCm] update triton branch to support gpt-oss models for gfx11xx devices ( #34032 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2026-02-09 19:36:30 +00:00
Artus Krohn-Grimberghe
8fd31f6245
[Bugfix] Voxtral prompt/audio placeholder alignment ( #34140 )
...
Signed-off-by: Artus KG <artuskg@gmail.com >
2026-02-09 19:30:38 +00:00
Artus Krohn-Grimberghe
eadb4e868b
[Bugfix] Avoid duplicate k-proj weight emission in helper ( #34142 )
...
Signed-off-by: Artus KG <artuskg@gmail.com >
2026-02-09 19:17:44 +00:00
Jiangyun Zhu
285bab4752
[Kernel] use flashinfer for gdn prefill ( #32846 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-02-09 12:17:25 -05:00
TomerBN-Nvidia
995bbf38f1
[Bugfix] Fix shared expert input for latent MoE in EP+DP (Nemotron-H) ( #34087 )
...
Signed-off-by: Tomer Natan <tbarnatan@nvidia.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-09 16:44:18 +00:00
Mohammad Miadh Angkad
d4f123cc48
[Kernel] FlashInfer: switch allreduce fusion to unified API ( #33985 )
...
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com >
2026-02-09 15:43:24 +00:00
ZhengHongming888
cb62e86f83
Add NUMA Core binding in nixl_connector for CPU xPyD ( #32365 )
...
Signed-off-by: Hongming Zheng <hongming.zheng@intel.com >
Signed-off-by: ZhengHongming888 <hongming.zheng@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-09 15:39:12 +00:00
Luka Govedič
781ddf7868
[CI][torch.compile] Fix incorrect filtering for E2E fusion tests on B200 ( #34031 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-02-09 10:05:14 -05:00
Roger Wang
64a9c2528b
[UX] Add --language-model-only for hybrid models ( #34120 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-09 14:57:33 +00:00
Lucas Wilkinson
d0d97e2974
[Misc] Fix up attention benchmarks ( #33810 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-09 09:42:03 -05:00
JJJYmmm
9562912cea
[MODEL] Adding Support for Qwen3.5 Models ( #34110 )
...
Signed-off-by: JJJYmmm <1650675829@qq.com >
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: wulipc <wulipc@users.noreply.github.com >
Co-authored-by: ywang96 <ywang96@users.noreply.github.com >
Co-authored-by: Isotr0py <Isotr0py@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-09 21:12:58 +08:00
zofia
9bdb06b436
[XPU][6/N] add xpu scaled_mm kernel ( #34117 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-02-09 20:17:35 +08:00
Nikhil Gupta
caad9f1e01
[Fix] [CPU Backend] : Prepack weights for w8a8 oneDNN matmul ( #33901 )
...
Signed-off-by: nikhil-arm <nikhil.gupta2@arm.com >
2026-02-09 18:04:41 +08:00
Ekagra Ranjan
1d5922fade
[ASR] Fix audio benchmark and add RTFx metric ( #32300 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2026-02-09 10:02:37 +00:00
Andreas Karatzas
3025b3cebb
[CI] Remove empty image_size_factors for fuyu, glm4_1v, glm_ocr ( #34107 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-09 17:37:04 +08:00
Jee Jee Li
978a37c823
[Model] GLM adaptation ( #34124 )
2026-02-09 17:32:52 +08:00
ihb2032
5a5c43511a
fix(cpu): fix mla_decode compilation on x86 without AVX512 ( #34052 )
...
Signed-off-by: ihb2032 <hebome@foxmail.com >
Co-authored-by: root <root@LAPTOP-FKNHV411.localdomain >
2026-02-09 08:55:41 +00:00
Nick Hill
d9bede0314
[BugFix] Fix fastsafetensors TP all procs using all GPUs ( #34070 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-09 15:15:46 +08:00
wang.yuqi
22b64948f6
[Frontend][last/5] Make pooling entrypoints request schema consensus. ( #31127 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-09 06:42:38 +00:00
Reagan Lee
7c233dbb36
[Tiny] Rename encoder budget file to more specific name ( #34103 )
...
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com ”>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com ”>
2026-02-09 03:48:19 +00:00
kourosh hakhamaneshi
a75a5b54c7
[bug-fix] supported_tasks is breaking backward compatibility at init_app_state ( #34027 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
Signed-off-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-09 09:46:46 +08:00
Andrey Talman
f97ca67176
[Release 2.10] Update to Torch 2.10 - final release ( #30525 )
2026-02-08 13:51:09 -08:00
danisereb
084aa19f02
Add support for ModelOpt MXFP8 dense models ( #33786 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-08 11:16:48 -08:00
navmarri14
1ecfabe525
glm 4.6 fused tuned inference config for B200 ( #32958 )
2026-02-08 18:55:47 +00:00
Richard Zou
4df841fe75
[torch.compile] Add an option to force-enable the MOE cold start optimization ( #33735 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-08 18:42:56 +00:00
TomerBN-Nvidia
a263aa6140
[BugFix] Change support no act and mul for marlin ( #34088 )
...
Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com >
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com >
2026-02-08 17:18:22 +00:00
aabbccddwasd
179ae7da8f
[Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate ( #33771 )
...
Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com >
2026-02-08 08:13:24 -08:00
Reagan Lee
c4df59ad43
Add embedding input functionality for disabled modalities [remake] ( #32493 )
...
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com ”>
Signed-off-by: Reagan Lee <reaganjlee@gmail.com >
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com >
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com ”>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-08 04:57:16 -08:00
TJian
785cf28fff
[ROCm] [CI] Reduce Resource of two test groups ( #34059 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-02-08 15:17:26 +08:00
Nick Hill
a96197f564
[Perf] Simplify DeepseekV32 tokenizer, ensure fast detokenization used ( #33855 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-08 07:16:34 +00:00
Andreas Karatzas
ab10d79855
[ROCm][Bugfix] fix act_quant_fusion module import error ( #34069 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-07 19:21:12 -08:00
Cyrus Leung
7fcb705b80
[CI/Build] Skip GCS test ( #34057 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-07 08:52:38 -08:00
Cyrus Leung
b956cdf818
[Doc] Fix run_batch docs ( #34056 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-07 06:18:16 -08:00
Hashem Hashemi
ed17f54c8b
Perf tuning and expansion of cases covered for wvSplitKrc ( #33493 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-07 05:33:11 -08:00
Jiang Wu
860981d8d8
Make directory exist ok for ray spinning up multiple replicas on a single instance ( #33604 )
...
Signed-off-by: Jiang Wu <jwu@cclgroup.com >
2026-02-07 05:30:49 -08:00
zifeitong
52181baaea
Update DeepGEMM version pin in Dockerfile to match #32479 ( #33935 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-07 05:30:22 -08:00
Rohan Potdar
de3869bb4d
move checks out of unified_kv_cache_update custom op ( #33943 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-07 05:30:09 -08:00
whx
ce9b3cd3e9
[PluggableLayer][3/N] Apply PluggableLayer to mamba layers. ( #33660 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2026-02-07 05:26:05 -08:00
Jee Jee Li
db4ede9743
[Model] Enable Step3p5ForCausalLM testing ( #33755 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-07 05:25:24 -08:00
Pooya Davoodi
2cb2340f7a
[Frontend]Add support for transcriptions and translations to run_batch ( #33934 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-07 05:24:57 -08:00
TundeAtSN
4df44c16ba
Enable Eagle3 speculative decoding for Mistral3ForConditionalGeneration to support eagle3 ( #33939 )
...
Signed-off-by: Akintunde Oladipo <akintunde.oladipo@servicenow.com >
Signed-off-by: TundeAtSN <akintunde.oladipo@servicenow.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-07 05:24:52 -08:00
Richard Zou
81fe69cae5
[torch.compile] Stop compiling identical artifacts ( #34003 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-07 05:24:48 -08:00
Mohammad Miadh Angkad
dd6a6e1190
[Kernel] Add KernelConfig flag to enable/disable FlashInfer autotune ( #34006 )
...
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-07 05:24:44 -08:00
Cyrus Leung
edb359cce4
[Renderer] Define render_cmpl and render_chat ( #34039 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-07 05:24:40 -08:00
wang.yuqi
6ed5eda300
[CI][Build] Pin grpcio-tools==1.78.0 ( #34048 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-07 05:24:35 -08:00
Cyrus Leung
11a4c9d30d
[Misc] Simplify get_max_tokens ( #34036 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-07 00:59:49 -08:00
lukec
15a0b9e570
Fix spelling errors ( #33978 )
2026-02-06 23:58:50 -08:00
Andreas Karatzas
c490d8cc73
[ROCm][CI] Pinning lm-eval version to resolve multi-modal small eval bug ( #34038 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-06 22:21:08 -08:00
Cyrus Leung
48312e579a
[Misc] Make PlaceholderRange.get_num_embeds a method ( #34035 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-07 05:30:17 +00:00
Vel
bc32444b23
[Kernel] Add enable_sm120_or_later for SM121 (DGX Spark) CUTLASS support ( #33517 )
...
Signed-off-by: code4me2 <velvetmoon222999@gmail.com >
2026-02-06 20:28:01 -08:00
Wentao Ye
18e8545297
[Revert] Add util handle_deprecated back ( #33998 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-07 04:14:45 +00:00
果冻虾仁
6f7adc533a
fix description in plugin_system.md ( #33999 )
2026-02-06 19:37:02 -08:00
Nick Hill
40218a82ba
[ModelRunner V2] Revert token rank comparison difference for now ( #34017 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-07 11:11:05 +08:00
kourosh hakhamaneshi
1c3b22058f
[Misc] Add backward-compatible import aliases for renamed translations module ( #34015 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-07 11:01:41 +08:00
Xin Yang
3920cafdd6
[Bugfix] Fix _fused_moe_lora_expand signature mismatch ( #33821 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-07 10:45:59 +08:00
rasmith
ec28784fdc
[CI][AMD]Bugfix] Check that model_config is not None in enable_norm_pad_fusion ( #34007 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-07 02:43:25 +00:00
Nicolò Lucchesi
55aeec04f5
[Bugfix] Fix Whisper tokenization ( #34011 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-07 10:42:52 +08:00
Ikenna
906077181b
[Bugfix] Fix QK Norm+RoPE fusion pattern matching on B200+FP8 ( #33967 )
...
Signed-off-by: Ikenna <ikennachifo@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-07 02:27:33 +00:00
Aaron Hao
89a385d79f
[Feat][RL] Pause and Resume with keep requests for single engine ( #32351 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-07 00:08:58 +00:00
kourosh hakhamaneshi
4a2d00eafd
[bugfix] [ROCm] Fix premature CUDA initialization in platform detection ( #33941 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2026-02-06 16:17:55 -06:00
Dimitrios Bariamis
207c3a0c20
Fix RoutingMethodType logic ( #33919 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2026-02-06 14:03:34 -08:00
Sumanth R Hegde
ae2e93f89b
[Fix] Fix logprobs=0 handling for /inference/v1/generate endpoint ( #34010 )
...
Signed-off-by: SumanthRH <sumanthrh99@gmail.com >
2026-02-06 20:33:40 +00:00
xuebwang-amd
9e9acce577
[Bugfix] Fix no attribute error of SharedFusedMoE (DeepSeek-V3.1 as test model) ( #33993 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
2026-02-06 19:11:32 +00:00
Charlie Fu
fe5438200b
[Rocm][Bugfix] Fix dtype not same for gemm_a4w4 op ( #33734 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2026-02-06 19:09:59 +00:00
Wentao Ye
77c09e1130
[Refactor] Remove align block size logic in moe_permute ( #33449 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-06 10:57:06 -08:00
zhrrr
16786da735
[Model Runner V2] support apply penalty for spec decode ( #33251 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2026-02-06 10:56:48 -08:00
vllmellm
aaa2efbe98
[DOC] [ROCm] Update docker deployment doc ( #33971 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 10:05:35 -08:00
Seiji Eicher
aca5967416
[KV Connector] Add missing method overrides to MultiConnector ( #33292 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2026-02-06 12:58:21 -05:00
Wentao Ye
67a746e87f
[Log] Optimize duplicate startup log ( #33944 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-06 17:49:56 +00:00
Chauncey
7bec435130
[Bugfix] Fix the issue where tool calling does not work when using fast detokenization with dsv32 ( #33964 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-06 09:23:44 -08:00
Eldar Kurtić
5c52644b10
[Docs] Update link to Benchmark CLI documentation ( #33254 )
...
Signed-off-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com >
2026-02-06 16:00:59 +00:00
zofia
2ce9fe4ad0
[XPU][5/N] add wna16 xpu kernel ( #33973 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-02-06 15:59:53 +00:00
Cyrus Leung
cd8b405bd0
[Refactor] Consolidate sequence normalization and enc-dec parsing ( #33928 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-06 15:43:47 +00:00
tc-mb
4707f7ebb4
[Model] Support MiniCPM-o 4.5 ( #33431 )
...
Signed-off-by: caitianchi <caitianchi@modelbest.cn >
Signed-off-by: tc-mb <caitianchi@modelbest.cn >
Co-authored-by: mslv <mslv@baai.ac.cn >
2026-02-06 15:29:10 +00:00
Michael Goin
c39ee9ee2b
[Docs] Add sections on process architecture and minimum CPU resources ( #33940 )
...
It seems users can be confused about vLLM's performance when running
with very small amounts of CPU cores available. We are missing a clear
overview of what vLLM's process architecture is, so I added this along with
some diagrams in arch_overview.md, and included a section on CPU resource
recommendations in optimization.md
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-06 15:26:43 +00:00
Andreas Karatzas
350ca72c04
[ROCm][AITER] Fix AITER import regression for explicit backend selection ( #33749 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-06 15:08:16 +00:00
FredericOdermatt
1fb0495a72
[FIX] guidance: use max(vocab_size, len(tokenizer)) for n_vocab ( #33509 )
...
Signed-off-by: Frederic Odermatt <frederic.odermatt@44ai.ch >
2026-02-06 14:23:03 +00:00
Raushan Turganbay
85ee1d962b
[Bugfix] Fix models and tests for transformers v5 ( #33977 )
...
Signed-off-by: raushan <raushan@huggingface.co >
Signed-off-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 21:47:41 +08:00
Harry Mellor
51a7bda625
Update WeightTransferConfig to be more standard like the others ( #33989 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 13:15:00 +00:00
SorenDreano
6e7b1c4b59
[Docs] Improve documentation ( #33799 )
...
Co-authored-by: Soren Dreano <soren@numind.ai >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-06 12:57:09 +00:00
Kurt Shuster
2991dd3d22
[Bugfix][Model] Support LoRA on Qwen3 Output Embedding ( #29816 )
...
Signed-off-by: kurt <kurt@thinkingmachines.ai >
2026-02-06 20:25:31 +08:00
Luka Govedič
ac32e66cf9
[torch.compile] Reorganize vllm/compilation and tests/compile (0/N for vLLM IR) ( #33731 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: ProExpertProg <luka.govedic@gmail.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-06 04:19:49 -08:00
Fadi Arafeh
f79d9dce16
[CPU][BugFix] Fix loading of w8a8int models with bias ( #33582 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-02-06 11:59:20 +00:00
Harry Mellor
ba5cbbf107
Bump HF Hub client to get bug fix ( #33984 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 11:25:33 +00:00
zhang-prog
233b26ab35
[PaddleOCR-VL] Add BC for transformers 5.0 config ( #33976 )
...
Signed-off-by: zhangyue66 <zhangyue66@baidu.com >
2026-02-06 10:33:49 +00:00
Harry Mellor
791a94bed0
Consolidate and fix forbidden import pre-commit checks ( #33982 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 01:47:41 -08:00
Xinyu Chen
e969a169ef
support view_from_cpu_tensor on XPU ( #33868 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2026-02-06 08:34:20 +00:00
Harry Mellor
6d8d34be6d
Fix main pre-commit ( #33975 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 00:08:05 -08:00
Gassan Salama
1363e3d6d5
[cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation ( #32263 )
...
Signed-off-by: Gassan <gassan.salama@arm.com >
2026-02-06 15:01:48 +08:00
chengchengpei
965525667b
Onboard voyage-4-nano ( #33720 )
...
Signed-off-by: Chengcheng Pei <chengchengpei@outlook.com >
Signed-off-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com >
Co-authored-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-06 06:23:34 +00:00
sihao_li
6550815c3a
[XPU]Replace pip in docker.xpu with uv pip ( #31112 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
2026-02-06 14:02:33 +08:00
Kunshang Ji
7439e4f41b
[XPU][4/N] add mxfp4 moe model support ( #33679 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-06 13:03:59 +08:00
R3hankhan
ac04dd374f
[CPU] Add BF16 Kernel type for s390x ( #33788 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-02-06 04:57:02 +00:00
Cyrus Leung
035a6cb09a
[Misc] Update code for encoder-decoder models ( #33900 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-06 11:38:39 +08:00
Mingliang Li
a32cb49b60
feat(frontend): early-fail tokenization guard for user requests ( #31366 )
...
Signed-off-by: limingliang <limingliang@stepfun.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: limingliang <limingliang@stepfun.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-05 19:38:02 -08:00
Rabi Mishra
20d7454c9b
fix(ROCm): Make flash_attn import optional in MLA attention ( #33511 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2026-02-06 02:22:53 +00:00
Simon Mo
5819ca8944
[Docs] Add reo analytics ( #33957 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2026-02-05 17:42:22 -08:00
Xin Yang
79028d4388
[Perf] Disable clean_logits in deepgemm fp8_mqa_logits kernel ( #33568 )
2026-02-05 20:34:00 -05:00
emricksini-h
325ab6b0a8
[Feature] OTEL tracing during loading ( #31162 )
2026-02-05 16:59:28 -08:00
Wei Zhao
91a07ff618
[Bugfix] Fix DeepSeek v3.2 tokenizer outputting None issue ( #33832 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-05 23:50:49 +00:00
Hashem Hashemi
d5c4800112
Adds padding and perf improvements to wvSplitK_fp8 ( #33527 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-05 22:16:02 +00:00
Lumosis
42d5d705f9
[Minor] Sort safetensors files to ensure deterministic loading order ( #33491 )
...
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2026-02-05 17:05:09 -05:00
Cyrus Leung
116880a5a0
[Bugfix] Make MM batching more robust ( #33817 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-05 20:40:58 +00:00
Matthew Bonanni
4145e50d85
[Bugfix] Fix DSV3.2 NVFP4 ( #33932 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-05 19:22:19 +00:00
Nicolò Lucchesi
20f5d185a6
[Misc] Rename translations to speech_to_text for OAI serving component ( #33904 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-05 19:16:52 +00:00
Harry Mellor
1887acca9e
Fix tokenizer test for renamed attr on Transformers v5 ( #33902 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-05 19:16:20 +00:00
Tsukasa OI
92e7562a99
[Bugfix] Suppress non-TTY color output on the process name part of the log ( #29714 )
...
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com >
2026-02-05 18:47:09 +00:00
Isotr0py
87d0d17ab5
[Models] Consolidate Deepseek-OCR2 processor ( #33909 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-05 18:29:20 +00:00
bnellnm
a57c8228ff
[Moe Refactor] Make Inplace Flag for FusedMoEModularKernel part of the constructor ( #33375 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-05 18:07:18 +00:00
zackyoray
1ee95841bd
[Bugfix] Fix swapped engine_ids in NIXL Llama 4 local attention path ( #33795 )
...
Signed-off-by: Yoray Zack <yorayz@nvidia.com >
2026-02-05 17:51:58 +00:00
Nicolò Lucchesi
7d8c6804e2
[Misc] Add debug logs ( #33931 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-05 09:42:40 -08:00
Benjamin Chislett
af3162d3aa
[Spec Decode] Unified Parallel Drafting ( #32887 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-05 12:37:18 -05:00
danisereb
5b2a9422f0
[BugFix] Fix LoRA Fp8 ( #33879 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-05 17:25:55 +00:00
Aaron Hao
c1858b7ec8
[Feat][RL][1/2] Native Weight Syncing API: NCCL ( #31943 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Co-authored-by: SumanthRH <sumanthrh99@gmail.com >
2026-02-05 12:13:23 -05:00
Mario Hong
82914d2ae8
[Bugfix] Fix step3p5 parser when using mtp ( #33690 )
...
Signed-off-by: mariohong <mariohong128@gmail.com >
2026-02-05 16:04:04 +00:00
Nicolò Lucchesi
81a90e5277
[Docs] Add bart-plugin to docs ( #33905 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-05 12:20:25 +00:00
wang.yuqi
1c3a221d3b
[Bugfix] Fix corner case of sparse embedding ( #33886 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-05 02:51:22 -08:00
Cyrus Leung
7bd42e609d
[Refactor] Clean up input preprocessing ( #33687 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-05 18:43:42 +08:00
Isotr0py
a2522839d8
[Bugfix] Fix Kimi-K2.5 NVFP4 checkpoints weight loading ( #33876 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-05 10:29:54 +00:00
jiahanc
59a5cb387a
[perf] Integrate flashinfer concat_mla_k ( #31171 )
2026-02-05 05:23:11 -05:00
liranschour
8322d4e47f
Enable Cross layers KV cache layout at NIXL Connector V2 ( #33339 )
...
Signed-off-by: Liran Schour <lirans@il.ibm.com >
Signed-off-by: liranschour <liranschour@users.noreply.github.com >
Co-authored-by: Or Ozeri <or@ozery.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-05 02:17:02 -08:00
Andreas Karatzas
3e472e81f9
[ROCm][Bugfix][CI] Fix hybrid models and their tests (Mamba/Jamba/Bamba) ( #32710 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-02-05 10:01:23 +00:00
Cyrus Leung
038914b7c8
[Refactor] Move task outside of PoolingParams.verify ( #33796 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-05 09:33:11 +00:00
Pavani Majety
d2f4a71cd5
[Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. ( #33858 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2026-02-05 09:32:10 +00:00
Mark McLoughlin
2abd97592f
[KV Connector][Metrics] Do not count local prefix cache hits in connector queries ( #30522 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-02-05 09:57:27 +02:00
Chauncey
6abb0454ad
[Perf] Optimize the performance of structured output + reasoning ( #33557 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-05 15:45:29 +08:00
Li, Jiang
db6f71d4c9
[CI/Build] Fix CPU CI test case title ( #33870 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-02-05 15:07:14 +08:00
Fadi Arafeh
fd03538bf9
[CPU][BugFix] Allow w8a8 oneDNN quantized matmul to support 3D inputs ( #33727 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-02-05 06:26:09 +00:00
Andreas Karatzas
1f70313e59
[Bugfix] Fix ScoreMultiModalParam multi-document scoring returning single result ( #33837 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-05 06:17:00 +00:00
Li, Jiang
07daee132b
[CI/Build] Parallelize CPU CI tests ( #33778 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-02-05 13:53:48 +08:00
Andrew Xia
9595afda18
[2/N] move responses/serving _make_response_output_items logic to parser ( #33281 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Signed-off-by: Andrew Xia <axia@meta.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-02-05 13:46:15 +08:00
rasmith
c1395f72cd
[CI][AMD][BugFix] Ensure VLLM_ROCM_USE_AITER is set so test_rocm_aiter_topk.py can run correctly ( #33840 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-05 05:05:48 +00:00
rinbaro
007b183d74
[docs] fix unintentional misspellings ( #33863 )
...
Signed-off-by: rinbaro <ilgomishra@gmail.com >
2026-02-04 20:50:59 -08:00
Nick Hill
add9f1fbd9
[Minor] Include StreamingInput in inputs package ( #33856 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-05 04:38:20 +00:00
Luka Govedič
e3bf79ffa0
Revert "[Attention][FA3] Update FA3 to include new swizzle optimization" ( #33841 )
2026-02-04 19:54:27 -08:00
Andreas Karatzas
fb1270f1f8
[CI][Bugfix]: return McpCall for built-in MCP tools in non-streaming mode ( #32762 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-05 11:14:06 +08:00
Kevin H. Luu
72bb24e2db
[release] Minor fixes to release annotation ( #33849 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-05 02:07:35 +00:00
Chauncey
a7be77beef
[Bugfix] fix DeepSeek R1 with CUTLASS MLA Broken on B200 ( #33637 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-05 01:28:36 +00:00
zhanqiuhu
bbe0574d8e
[Bugfix] Disable TRTLLM attention when KV transfer is enabled ( #33192 )
...
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-02-05 00:49:18 +00:00
Luka Govedič
4d9513537d
[CI][torch.compile] Reduce e2e fusion test time ( #33293 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: ProExpertProg <luka.govedic@gmail.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-04 19:09:03 -05:00
Ilya Boytsov
439afa4eea
feat: Add ColBERT late interaction model support ( #33686 )
...
Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com >
Signed-off-by: Ilya Boytsov <boytsovpanamera@mail.ru >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-05 08:05:13 +08:00
Nick Hill
fa4e0fb028
[Core] Don't schedule spec tokens with prefill chunks ( #33652 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-04 23:40:22 +00:00
Sage Moore
ce498a6d61
Change the type signature of MixtureOfExperts.expert_weights to MutableSequence[Sequence[Tensor]] ( #33573 )
...
Signed-off-by: Sage Moore <sagmoore@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-04 17:02:46 -05:00
Richard Zou
9f14c9224d
Revert "[torch.compile] Significantly speed up cold start times" ( #33820 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-04 21:59:59 +00:00
Muhammad Hashmi
535de06cb1
[Model] Add transcription support for Qwen3-Omni ( #29828 )
...
Signed-off-by: Muhammad Hashmi <mhashmi@berkeley.edu >
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: NickLucche <nlucches@redhat.com >
2026-02-04 21:17:47 +00:00
Simon Danielsson
4292c90a2a
[Bugfix] Support RotaryEmbedding CustomOp for gpt-oss ( #33800 )
...
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com >
2026-02-04 20:17:41 +00:00
Taeksang Kim
6e98f6d8b6
Implement zero-copy GQA for multimodal and CPU ( #33732 )
...
Signed-off-by: Taeksang Kim <ts.kim@hyperaccel.ai >
2026-02-04 20:11:39 +00:00
kourosh hakhamaneshi
2f6d17cb2f
[rocm][ray] Fix: Unify Ray device visibility handling across CUDA and ROCm ( #33308 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2026-02-04 10:09:14 -08:00
Isotr0py
192ad4648b
[Bugfix] Fix interns1-pro initialization and PP ( #33793 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-04 17:54:45 +00:00
Lucas Wilkinson
0e92298622
[Misc] Delay deprecation of CommonAttentionMetadata properties ( #33801 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-04 08:41:57 -08:00
jiangkuaixue123
87d9a26166
[Bugfix] Fix ubatch wrapper num_tokens calculate ( #33694 )
...
Signed-off-by: jiangkuaixue123 <jiangxiaozhou111@163.com >
2026-02-04 16:41:45 +00:00
Cyrus Leung
80f921ba4b
[Bugfix] Fix normalize still being passed to PoolerConfig ( #33794 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-04 23:56:02 +08:00
Wentao Ye
711edaf0d0
[Perf] Optimize spec decoding + async scheduling, 1.5% Throughput improvement ( #33612 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-02-04 09:34:32 -05:00
Micah Williamson
1d367a738e
[Bugfix][ROCm] Include float8_e4m3fnuz in NCCL Dtype Dispatching ( #33713 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-04 05:36:29 -08:00
Cyrus Leung
32a02c7ca2
Apply #33621 to main ( #33758 )
...
Signed-off-by: Zachary Aristei <zaristei@nvidia.com >
Co-authored-by: zaristei2 <zaristei2@gmail.com >
Co-authored-by: Zachary Aristei <zaristei@nvidia.com >
2026-02-04 05:35:39 -08:00
Chauncey
f67ee8b859
[Perf] Optimize chat completion streaming performance ( #33782 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-04 12:30:36 +00:00
Cyrus Leung
e57ef99b40
[Model] Apply #32631 for recent models ( #33785 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-04 12:23:01 +00:00
Yueqian Lin
f8516a1ab9
[Bugfix][Model] Fix audio-in-video support for Qwen2.5-Omni and Qwen3-Omni ( #33605 )
...
Signed-off-by: linyueqian <linyueqian@outlook.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-04 12:15:29 +00:00
Vadim Gimpelson
824058076c
[PERF] Change GDN Attention State Layout from [N, HV, K, V] to [N, HV, V, K] ( #33291 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-04 11:20:52 +00:00
Or Ozeri
8e32690869
[KV Connector][BugFix] scheduler: Delay freeing blocks of aborted async loads ( #32255 )
...
Fixes a not-yet-reported case where it was possible for blocks to be
freed by an abort before an async transfer completed, resulting
in corrupted KV data.
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-02-04 11:16:34 +00:00
Zhengxu Chen
a208439537
[compile] Remove runner type from ignored caching factor list. ( #33712 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-04 10:56:45 +00:00
Zhengxu Chen
bcd2f74c0d
[compile] Clean up AOT compile bypass on evaluate_guards. ( #33578 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-04 02:12:53 -08:00
Kunshang Ji
f79f777803
[XPU][2/N] add support unquantized moe support for xpu ( #33659 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-04 02:12:25 -08:00
Augusto Yao
4c8d1bf361
use ORJSONResponse when available to improve the efficiency of request process ( #33548 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
2026-02-04 10:04:11 +00:00
Kunshang Ji
061da6bcf7
[XPU] remove common path warning log ( #33769 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-04 16:40:17 +08:00
zhanqiuhu
4403e3ed4c
[Metrics] Add labeled prompt token metrics for P/D disaggregation ( #33290 )
...
Add labeled Prometheus metrics to distinguish where prompt tokens come
from in P/D disaggregated deployments.
In P/D disaggregation, decode instances receive KV cache from prefill instances.
Currently, decode reports inflated prompt throughput because it counts all
prompt tokens as "computed", even though most were transferred.
This PR adds labeled metrics so users can understand actual compute work vs
transferred work:
vllm:prompt_tokens_by_source_total{source="local_compute"} # Tokens prefilled locally
vllm:prompt_tokens_by_source_total{source="external_kv_transfer"} # Tokens received via KV transfer
vllm:prompt_tokens_by_source_total{source="local_cache_hit"} # Tokens from local prefix cache
vllm:prompt_tokens_cached_total # Total cached (local + external, -1 when all
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-02-04 07:46:48 +00:00
Matt
08e094997e
[Hardware][AMD][CI] Refactor AMD tests to properly use BuildKite parallelism ( #32745 )
...
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-02-04 14:51:33 +08:00
Wentao Ye
d88a1df699
[Deprecation] Deprecate profiling envs ( #33722 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-04 05:58:21 +00:00
Cyrus Leung
90d74ebaa4
[Deprecation] Remove _get_data_parser in MM processor ( #33757 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-04 05:51:52 +00:00
Frank Wang
45f8fd6f97
[Feature] Enable TRITON_ATTN for Batch Invariance ( #33688 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
2026-02-04 13:27:34 +08:00
Wentao Ye
5e1e0a0fbd
[Refactor] Remove unused dead code ( #33718 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-03 21:25:11 -08:00
Michael Goin
eb5ed20743
[Bugfix] Define router_logits_dtype for remaining MoE models ( #33737 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-04 13:24:14 +08:00
Huy Do
2647163674
Save startup benchmark results as a list of values ( #33629 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2026-02-03 20:37:51 -08:00
Shanshan Shen
9fb27dd3b3
[MM] Align the prefix of MMEncoderAttention with Attention ( #33750 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-02-04 04:07:30 +00:00
R3hankhan
4dffc5e044
[CPU] Split attention dispatch by head_dim alignment ( #32161 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-02-03 19:37:15 -08:00
Andrew Xia
e1bf04b6c2
[1/N] Initial Implementation of Parser for ResponsesAPI ( #32712 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-02-04 10:59:03 +08:00
Isotr0py
02080179a3
[Bugfix] Fix torchrun PP broadcast deadlock with async scheduling ( #33701 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-04 02:17:37 +00:00
wang.yuqi
1b8fe6f7c4
[Frontend][4/n] Make pooling entrypoints request schema consensus | ScoreRequest ( #33060 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-04 01:48:40 +00:00
Nick Hill
52ee21021a
[BugFix][Spec Decoding] Fix negative accepted tokens metric crash ( #33729 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-03 23:34:41 +00:00
Wentao Ye
655efb3e69
[Dependency] Remove comments of ray in dependency files ( #33351 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-03 15:30:47 -08:00
Matthew Bonanni
bd8da29a66
[Bugfix] Fix sparse MLA metadata building ( #33579 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-03 15:29:48 -08:00
Michael Goin
2a99c5a6c8
[Bugfix] Disable TRTLLM FP8 MoE if router_logits_dtype==float32 and routing_method!=DeepSeekV3 ( #33613 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-03 13:26:51 -08:00
Patrick von Platen
3f7662d650
[Voxtral Realtime] Change name ( #33716 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2026-02-03 13:03:28 -08:00
Vadim Gimpelson
a372f3f40a
[MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1 ( #33257 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-03 15:10:31 -05:00
Harry Mellor
61e632aea1
Turn @config into a dataclass_transform ( #31541 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-03 17:40:59 +00:00
Richard Zou
b1bb18de8d
[torch.compile] Significantly speed up cold start times ( #33641 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-03 09:12:11 -08:00
Lucas Wilkinson
2267cb1cfd
[Attention][FA3] Update FA3 to include new swizzle optimization ( #23465 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-03 08:08:47 -08:00
dtc
0d6ccf68fa
[P/D] rework mooncake connector and introduce its bootstrap server ( #31034 )
...
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2026-02-03 08:08:25 -08:00
Cyrus Leung
18e7cbbb15
[Bugfix] Fix startup hang for Granite Speech ( #33699 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-03 15:57:56 +00:00
Patrick von Platen
f0d5251715
[Voxtral models] Skip warm-up to skip confusing error message in warm-up ( #33576 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-03 07:22:34 -08:00
Shanshan Shen
5c4f2dd6ef
[MM] Pass prefix parameter to MMEncoderAttention ( #33674 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-02-03 06:47:41 -08:00
wang.yuqi
f3d8a34671
[Bugfix] Do not add extra \n for image-only cases when constructing multimodal text prompts. ( #33647 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-03 06:43:47 -08:00
shaharmor98
4bc913aeec
Feat/add nemotron nano v3 tests ( #33345 )
2026-02-03 08:52:49 -05:00
Kuntai Du
fbb3cf6981
[Bugfix][Async][Connector] avoid vllm-side double free during async scheduling + request abort + async KV cache transfer ( #33377 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2026-02-03 21:50:15 +08:00
Krish Gupta
2df2b3499d
Document NixlConnector backend selection via kv_connector_extra_config ( #33552 )
...
Signed-off-by: KrxGu <krishom70@gmail.com >
2026-02-03 05:49:59 -08:00
Harry Mellor
2a8d84e66d
Fix Gemma3n audio encoder for Transformers v5 ( #33673 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-03 05:49:49 -08:00
zxy
a3acfa1071
[Models] Intern-S1-Pro ( #33636 )
...
Signed-off-by: zxy <zhou0493@e.ntu.edu.sg >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-03 05:49:45 -08:00
Harry Mellor
be8168ff88
Fix Gemma3 GGUF for Transformers v5 ( #33683 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-03 12:36:53 +00:00
Harry Mellor
f6af34626d
Fix offline test for Transformers v5 ( #33682 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-03 12:07:24 +00:00
Song Zhixin
ceab70c89d
[Bugfix] fix qwen3-asr response error ( #33644 )
...
Signed-off-by: jesse <szxfml@gmail.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-03 03:33:56 -08:00
Cyrus Leung
52683ccbe1
[Misc] Update default image format of encode_base64 ( #33656 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-03 03:13:16 -08:00
Michael Goin
e346e2d056
[Bugfix] Disable RoutingMethodType.[Renormalize,RenormalizeNaive] TRTLLM per-tensor FP8 MoE ( #33620 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-03 10:37:15 +00:00
Cyrus Leung
83449a5ff0
[Refactor] Clean up pooling serial utils ( #33665 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-03 10:29:18 +00:00
Lucas Hänke de Cansino
dad2d6a590
[Bugfix][Model] Fix DeepSeek-OCR-2 chat template to include BOS token ( #33642 )
...
Signed-off-by: l4b4r4b4b4 <lucas.cansino@mail.de >
2026-02-03 00:35:58 -08:00
Isotr0py
32e84fa1ff
[CI/Build] Investigate torchrun distributed tests hanging issue ( #33650 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-03 15:49:17 +08:00
Richard Zou
fd9c83d0e0
[torch.compile] Document the workaround to standalone_compile failing ( #33571 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-03 07:16:55 +00:00
杨朱 · Kiki
b95cc5014d
[Misc] Remove deprecated VLLM_ALL2ALL_BACKEND environment variable ( #33535 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-02-03 15:01:59 +08:00
Nick Hill
61397891ce
[Minor] Some code simplification in scheduler.py ( #33597 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-03 15:00:00 +08:00
杨朱 · Kiki
ef248ff740
[Misc] Remove deprecated profiler environment variables ( #33536 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-02-03 14:58:44 +08:00
Kunshang Ji
e10604480b
[XPU][1/N] Deprecate ipex and switch to vllm-xpu-kernels for xpu platform ( #33379 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-02 22:46:10 -08:00
Chauncey
bf001da4bf
[Bugfix] Interleaved thinking keeps compatibility with reasoning_content ( #33635 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Co-authored-by: Koushik Dutta <koushd@gmail.com >
2026-02-03 06:46:05 +00:00
杨朱 · Kiki
a0a984ac2e
[CI/Build] Remove hardcoded America/Los_Angeles timezone from Dockerfiles ( #33553 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-02-02 22:32:39 -08:00
Shengliang Xu
f1cb9b5544
Fix quantized Falcon-H1 model loading issues ( #32728 )
...
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-02 22:31:27 -08:00
Daniel Mescheder
4c4b6f7a97
[Frontend] Add sampling parameters to Responses API ( #32609 )
...
Signed-off-by: Daniel Mescheder <dmesch@amazon.com >
Co-authored-by: Daniel Mescheder <dmesch@amazon.com >
2026-02-03 13:51:10 +08:00
Roger Wang
10546f925a
[Bugfix] Fix mm budget setting for Qwen Omni models ( #33634 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-03 04:56:25 +00:00
Radu Salavat
e69c990c21
[Feature][CPU Backend]: Optimize ARM vectorization backend ( #30329 )
...
Signed-off-by: Radu Salavat <radu.salavat@arm.com >
2026-02-02 20:17:56 -08:00
Richard Zou
5eac9a1b34
[torch.compile] Don't do the fast moe cold start optimization if there is speculative decoding ( #33624 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-03 03:38:49 +00:00
Nathan Weinberg
1b60b45d0d
[CI/Build] add directions for CPU image upload to Docker Hub ( #32032 )
...
Signed-off-by: Nathan Weinberg <nweinber@redhat.com >
Signed-off-by: Nathan Weinberg <31703736+nathan-weinberg@users.noreply.github.com >
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
2026-02-03 02:48:06 +00:00
Dezhan
4b3803d180
[BugFix] DPMetadata raises assert error for dense model ( #32739 )
...
Co-authored-by: Dezhan Tu <dztu@meta.com >
2026-02-03 00:56:44 +00:00
Patrick von Platen
5019c59dd2
[Voxtral Realtime] Introduce global log mel max ( #33574 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-02 17:01:47 -05:00
Lain
089cd4f002
fix cutlass_3x_gemm_fp8_blockwise on sm103a ( #32224 )
...
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Co-authored-by: Pavani Majety <pmajety@nvidia.com >
2026-02-02 11:47:46 -08:00
Vasiliy Kuznetsov
0130223bd9
fix memory for online fp8 quantization with streaming weight load ( #31914 )
...
Signed-off-by: vasiliy <vasiliy@fb.com >
2026-02-02 14:17:42 -05:00
Matthew Bonanni
5d1aef3004
[UX] Format attention backend log line ( #33570 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-02 18:57:12 +00:00
yugong333
ffe1fc7a28
Reduce the kernel overhead when num of active loras is smaller than max loras. Multiple cuda graphs are captured for each num of active-loras. ( #32005 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
2026-02-02 12:30:06 -05:00
Harry Mellor
8b7346d5f1
Update huggingface-hub again ( #33567 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-02 09:20:54 -08:00
Harry Mellor
6141ebe0dd
Remove incorrect tokenizer info test ( #33565 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-02 17:11:44 +00:00
Yang Liu
199e3cb476
[Model] Use mm_position to compute mrope positions for GLM-4.xV ( #33039 )
...
Signed-off-by: Yang <lymailforjob@gmail.com >
2026-02-02 16:55:48 +00:00
Matthew Bonanni
9f8cb81b44
[CI] Add DeepSeek V3.2 nightly eval ( #33566 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-02 16:10:02 +00:00
Cyrus Leung
d7e17aaacd
[Refactor] Move profiling methods to MM budget ( #33559 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-02 23:27:00 +08:00
Kebe
528e9b1490
[Feature][Core] Support Fabric detection to adapt the MNNVL protocol for the GB series ( #33540 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Thomas Vegas <tvegas@nvidia.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2026-02-02 22:55:46 +08:00
shanjiaz
d95b4be47a
move spec decode slow test to test_areas.yaml ( #33365 )
...
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com >
2026-02-02 06:28:36 -08:00
Isotr0py
4061dcf4c5
[Bugfix] Enable Kimi k25 processor test ( #33562 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-02 14:25:25 +00:00
danielafrimi
0aca8b8c62
[MoE] Enable Shared/Routed Overlap For Latent MoE (Nemotron-H) ( #32790 )
...
Signed-off-by: dafrimi <dafrimi@nvidia.com >
2026-02-02 09:18:50 -05:00
Rabi Mishra
9eb58f8cf1
fix[ROCm]: Remove unconditional aiter import ( #32902 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2026-02-02 22:10:02 +08:00
Cyrus Leung
b10d05b8a8
[Model] Use explicit types in get_generation_prompt ( #33551 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-02 12:38:49 +00:00
Borushiki
b398e5c819
Update get_expert_mapping to include self parameter ( #33525 )
...
Signed-off-by: Borushiki <38628261+Otsutsukii@users.noreply.github.com >
2026-02-02 20:29:07 +08:00
Grzegorz K. Karch
78061ef584
Fix accessing hidden_act from model config ( #32686 )
...
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com >
2026-02-02 11:11:33 +00:00
Nicolò Lucchesi
528b3076af
[CI][Bugfix] Fix flaky tests/v1/kv_connector/unit/test_multi_connector.py::test_multi_example_connector_consistency ( #33555 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-02 03:01:29 -08:00
Cyrus Leung
a502831d36
[Chore] Remove redundant input parsing methods ( #33542 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-02 10:50:47 +00:00
Komal Kumar Teru
ba871fb788
[Misc] support arbitrary MM datasets in spec dec bench ( #33486 )
...
Signed-off-by: kkt-cohere <komal@cohere.com >
Signed-off-by: Komal Kumar Teru <162363718+kkt-cohere@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-02 08:49:48 +00:00
R3hankhan
ab374786c7
[CPU][IBM Z][Dockerfile] Fix IBM Z builds ( #33243 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-02-01 23:41:29 -08:00
RED
808dd87b30
[Model] Support DeepSeek-OCR-2 ( #33165 )
...
Signed-off-by: liuli <ll407707@alibaba-inc.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: liuli <ll407707@alibaba-inc.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-02 06:24:10 +00:00
Andy Lo
beb8899482
Fix mistral sliding window parsing ( #33521 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-02-02 05:08:04 +00:00
Sawyer Bowerman
ce88756b96
[Doc]: update paths for Offline/Online/Others example sections ( #33494 )
...
Signed-off-by: Sawyer Bowerman <sbowerma@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-02 03:56:53 +00:00
Paco Xu
a3154a6092
[Doc] add missing model entries in supported_models.md ( #33220 )
...
Signed-off-by: Paco Xu <paco.xu@daocloud.io >
2026-02-02 03:37:25 +00:00
jack
7c036432fc
[Bugfix] GLM-4 tool parser: incremental string streaming ( #33218 )
...
Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com >
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com >
2026-02-02 11:13:31 +08:00
Robert Shaw
318b120766
[Nightly CI] Remove CT Model ( #33530 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-01 19:09:09 -08:00
csy0225
c3b40dc3e7
[Models] Step-3.5-Flash ( #33523 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: i-zhangmingming <i-zhangmingming@stepfun.com >
Co-authored-by: xiewuxun <xiewuxun@stepfun.com >
Co-authored-by: zetaohong <i-hongzetao@stepfun.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-02 10:21:18 +08:00
Yifan Qiao
a01ef3fa51
[Fix] prefix cache hit rate == 0 bug with gpt-oss style models ( #33524 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
2026-02-02 01:59:58 +00:00
Runkai Tao
7320ca3942
Add unpermute-aware fused MoE LoRA path ( #32655 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
2026-02-02 09:46:09 +08:00
Nick Hill
cf0a99f84d
[ModelRunner V2] Support spec decode with structured outputs ( #33374 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-02 00:19:59 +00:00
Nick Hill
e535d90deb
[ModelRunner V2] Misc minor simplifications and optimizations ( #33467 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-01 22:17:14 +00:00
Komal Kumar Teru
0b225fb7b2
[Misc] skip target model mm emb in draft proposal step when draft is text-only ( #33437 )
...
Signed-off-by: kkt-cohere <komal@cohere.com >
2026-02-01 21:13:35 +00:00
will b.
46b4a02794
Fix DeepSeek V2 RoPE initialization error ( #33501 )
...
Signed-off-by: Eduardo Salinas <edus@microsoft.com >
Signed-off-by: catswe <212922539+catswe@users.noreply.github.com >
Co-authored-by: Eduardo Salinas <edus@microsoft.com >
2026-02-01 21:00:56 +00:00
shaharmor98
8869cd8ec1
Add MoE config for Super B200 TP2 ( #33510 )
2026-02-01 18:48:37 +00:00
JartX
cd86fff38f
[BUGFIX] Fix hipErrorIllegalState in Qwen3-Omni during startup profiling allow inference Omni on ROCM ( #33077 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
2026-02-01 13:36:25 +00:00
Maral
b5f8c3092d
[W8A8 Block Linear Refactor][1/N] Keep all quantization types into QuantFP8 class. ( #33047 )
...
Signed-off-by: maral <maralbahari.98@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-01 09:28:01 +00:00
Cyrus Leung
21997f45b1
[Redo] #33110 with threading limit ( #33502 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: YunzhuLu <lucia.yunzhu@gmail.com >
2026-02-01 09:18:11 +00:00
Luka Govedič
672023877b
Change defaults for vllm bench startup ( #33489 )
2026-01-31 23:46:01 -08:00
Zack Yu
754a8ca942
fix: only include Authorization header when OPENAI_API_KEY is set ( #33488 )
...
Signed-off-by: zack041 <zackyu041@gmail.com >
2026-01-31 23:35:09 -08:00
Eduardo Salinas
302ecf64ff
[Models]: lfm2_siglip2 return intermediate encoder layers ( #33370 )
...
Signed-off-by: Eduardo Salinas <edus@microsoft.com >
2026-02-01 06:17:49 +00:00
Cyrus Leung
b6bb2842cf
[Critical] Revert #33110 ( #33500 )
2026-01-31 21:06:42 -08:00
Cyrus Leung
79b6ec6aab
[Bugfix] Fix inconsistent handling of cache reset ( #33481 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-31 20:23:41 -08:00
Greg Pereira
d6416fdde9
pin LMCache to v0.3.9 or greater with vLLM v0.15.0 ( #33440 )
...
Signed-off-by: greg pereira <grpereir@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-31 20:50:38 -07:00
Andreas Karatzas
0fb3157267
[ROCm][CI] Update huggingface-hub pin ( #33492 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-01 02:51:54 +00:00
Cyrus Leung
a358e4dffe
[Refactor] Make Renderer an abstract class ( #33479 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-01 10:36:30 +08:00
René Honig
079781177a
fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels ( #33417 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2026-01-31 14:06:42 -08:00
Roy Wang
63c0889416
[Misc] Fix flashinfer related tests ( #33462 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2026-01-31 16:10:24 -05:00
smashyalts
1e86c802d4
Fix grammar ( #33121 )
...
Signed-off-by: smashyalts <smashyalts@gmail.com >
2026-01-31 09:59:34 -08:00
linhaifeng
fedf64332e
[Bugfix]: Fix display errors in TORCH_CHECK messages ( #32942 )
...
Signed-off-by: linhaifeng <1371675203@qq.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-31 09:48:48 -08:00
Xiao Yang
2238a12c13
[Misc] support collect_env for endpoint /server_info ( #33246 )
...
Signed-off-by: yang.xiao <yang.xiao@daocloud.io >
2026-02-01 01:42:59 +08:00
Harry Mellor
ce0afe2451
Update huggingface-hub pin for the last time before Transformers v5 ( #33473 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-31 09:14:24 -08:00
Cyrus Leung
88c3e114d8
[Refactor] Move MM data parsing outside processor ( #33408 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-31 16:46:14 +00:00
Cyrus Leung
92924b2ddd
[Deprecation] Remove deprecated items related to pooling ( #33477 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-31 08:44:40 -08:00
YunzhuLu
27cb2f678f
[Bugfix] Early-reject requests with MM data longer than encode cache capacity ( #33110 )
...
Signed-off-by: YunzhuLu <lucia.yunzhu@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-01-31 08:41:13 -08:00
jma99_2333
22d9a056d5
Support clear mm and encoder cache ( #33452 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-01-31 15:22:25 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
13b842f271
[BugFix][Router Replay] Capture Logical Experts with EPLB ( #33013 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2026-01-31 10:12:17 -05:00
Luka Govedič
15f40b20aa
[fix][torch.compile] Fix cold-start compilation time increase by adding kv cache update to splitting ops ( #33441 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Richard Zou <zou3519@gmail.com >
2026-01-31 06:48:34 -08:00
Cyrus Leung
793af538a3
[Doc] Update plugin deprecation notices ( #33476 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-31 22:48:28 +08:00
cmunley1
6f5e7cda57
support return prompt token ids in responses ( #33378 )
2026-01-31 06:04:20 -08:00
Roy Wang
68feb76a6f
[Misc] Replace deprecated interface seed_everything ( #33474 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2026-01-31 05:38:39 -08:00
Cyrus Leung
4cb59dea6a
[Bugfix] Fix incompatibility between #33372 and #32863 ( #33475 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-31 05:21:32 -08:00
Angela Yi
608b556507
[ez] Add structured torch.compile logs ( #33213 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-01-31 21:00:54 +08:00
Cyrus Leung
f0a1c8453a
[Frontend] Use new Renderer for Completions and Tokenize API ( #32863 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-31 04:51:15 -08:00
caozuoba
8980001c93
[perf] v1/spec_decode: skip softmax for all-greedy rejection sampling ( #32852 )
...
Signed-off-by: hdj <1293066020@qq.com >
2026-01-31 09:51:26 +00:00
jennyyyyzhen
527bcd14d4
[ROCM] Enable aiter attn backend for qwen3-next model ( #32492 )
...
Signed-off-by: jennyyyyzhen <yzhen@hmc.edu >
2026-01-31 17:03:57 +08:00
Jinwu
f68e3ea4e1
[BugFix] Add synchronize in CutlassW4A8LinearKernel to ensure data is ready for use. ( #33078 )
...
Co-authored-by: jinwuguo <jinwuguo@tencent.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-31 08:14:54 +00:00
Yanan Cao
d5c41db35b
[Kernel] [Helion] [3/N] Helion kernel registry ( #33203 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-01-31 15:38:46 +08:00
Fadi Arafeh
1618e25492
[CPU][Feat] Enable KleidiAI accelerated int4 dynamic quant with BF16 activations on Arm CPUs ( #33122 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-01-31 07:16:22 +00:00
AutumnAurelium
f3888aca83
Add EAGLE3 support for AFMoE ( #33111 )
...
Signed-off-by: AutumnAurelium <88015631+AutumnAurelium@users.noreply.github.com >
2026-01-31 06:53:08 +00:00
Dimitrios Bariamis
f0bca83ee4
Add support for Mistral Large 3 inference with Flashinfer MoE ( #33174 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-01-30 22:48:27 -08:00
Matthias Gehre
73419abfae
[Bugfix] Handle Asym W4A16 (ConchLinearKernel) for CT ( #33200 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-01-31 06:21:51 +00:00
Nicolò Lucchesi
e77f162cf5
[Bugfix] Fix Qwen3ASR language asr tag in output ( #33410 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-31 05:24:49 +00:00
Yanan Cao
8ecd213c0b
[Kernel] [Helion] [2/N] Helion kernel wrapper ( #32964 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-01-31 12:53:01 +08:00
Francesco Fusco
5b55c0bea7
[Attention] Clarify comment explaining attn_logits +1 dimension ( #33427 )
...
Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com >
2026-01-31 04:50:30 +00:00
Patrick von Platen
15e0bb9c42
[Streaming -> Realtime] Rename all voxtral related classes, fn, files ( #33415 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2026-01-31 04:49:00 +00:00
Micah Williamson
6c64c41b4a
[ROCm][CI] Force max_num_seqs=1 on ROCm In test_sharded_state_loader to reduce flakiness ( #33277 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-31 12:28:29 +08:00
Russell Bryant
a2ef06e1b3
[Misc] offest -> offset in comments and variable names ( #33444 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-01-30 20:19:22 -08:00
Lucas Wilkinson
0a3c71e7e5
[BugFix] Fix whisper FA2 + full cudagraphs ( #33360 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-31 12:15:06 +08:00
Michael Goin
29fba76781
[UX] Use gguf repo_id:quant_type syntax for examples and docs ( #33371 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-31 12:14:54 +08:00
Isotr0py
9df152bbf6
[Misc] Algin Qwen3-VL-embedding image example outputs with HF repo example ( #33419 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-30 19:36:56 -08:00
Nick Hill
876a16f4fb
[ModelRunner V2] Fix spec decoding + logprobs ( #33391 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-31 03:33:26 +00:00
Matthew Bonanni
aaa901ad55
[Attention] Move MLA forward from backend to layer ( #33284 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-30 19:30:00 -08:00
Wentao Ye
010ec0c30e
[Deprecation] Deprecate seed_everything and scatter_mm_placeholders in v0.15 ( #33362 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-31 02:54:16 +00:00
Alberto Ferrer
64a40a7ab4
[Bugfix] Fix typo in read_offset variable name ( #33426 )
...
Signed-off-by: Alberto Ferrer <albertof@barrahome.org >
2026-01-31 01:26:15 +00:00
Gregory Shtrasberg
31aedfe7d6
[Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831 ( #33366 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-01-30 19:05:23 -06:00
Michael Goin
67ebaff528
Refactor NVFP4 Linear utils for ModelOpt and CT ( #33201 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-30 16:37:42 -08:00
Chendi.Xue
2b465570e6
[CI][HPU]accelerate hpu test by skip python re-install and clean container name ( #33286 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2026-01-30 21:36:29 +00:00
Huy Do
9ca66ecc10
Indicate compile mode in the benchmark results ( #32990 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2026-01-30 15:34:36 -05:00
Pavani Majety
c3a9752b0c
[Hardware][SM100] Add TRTLLM Kernel for INT4 W4A16 Kernel. ( #32437 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2026-01-30 10:30:46 -08:00
xuebwang-amd
f451b4558b
[Quantization][ROCm] Fix MoE weight loading to be robust (Qwen3_MoE/Qwen3_next as example models) ( #33173 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
2026-01-30 17:50:23 +00:00
Vasiliy Kuznetsov
3f96fcf646
fix QERL attention import path ( #33432 )
...
Signed-off-by: vasiliy <vasiliy@fb.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-30 09:29:09 -08:00
Yanan Cao
6c1f9e4c18
[Kernel] [Helion] [1/N] Add Helion ConfigManager ( #32740 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-01-30 12:19:19 -05:00
Harry Mellor
67239c4c42
Fix encoder-decoder model disabling mm processor cache ( #33236 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-30 16:30:10 +00:00
Nicolò Lucchesi
8ece60768f
[CI] Qwen3-ASR transcriptios tests ( #33414 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-30 16:17:56 +00:00
Michael Goin
fd0e377244
Support FP8 block quant for CompressedTensorsW8A16Fp8 ( #33280 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-30 11:15:20 -05:00
Kyle Sayers
f857a03f6b
[QeRL] Layerwise Reloading ( #32133 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-01-30 08:50:05 -07:00
Danielle Robinson
74898a7015
[BugFix][LoRA] TritonExperts is ModularMoEPath for FP8 models ( #33393 )
...
Signed-off-by: Danielle Robinson <dmmaddix@amazon.com >
Co-authored-by: Danielle Robinson <dmmaddix@amazon.com >
2026-01-30 15:27:42 +00:00
Frank Wang
8f5d51203b
Disable Cascade Attention for Batch Invariance ( #32561 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
Signed-off-by: Frank Wang <41319051+frankwang28@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-30 10:00:46 -05:00
Julien Denize
ae5b7aff2b
Improve Mistral format checks. ( #33253 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
Signed-off-by: juliendenize <julien.denize@mistral.ai >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-30 06:23:33 -08:00
Harry Mellor
a11bc12d53
Fix test_moe.py for Transformers v5 ( #33413 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-30 14:03:25 +00:00
Nathan Weinberg
58cb55e4de
[Doc] Enhance documentation around CPU container images ( #32286 )
...
Signed-off-by: Nathan Weinberg <nweinber@redhat.com >
2026-01-30 13:36:20 +00:00
杨朱 · Kiki
cf896ae0e3
[Misc] Clean up HIDDEN_DEPRECATED_METRICS after metric removal ( #33323 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-30 13:31:17 +00:00
Harry Mellor
c5113f60f2
Remove deprecated reasoning_content message field ( #33402 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-30 11:48:15 +00:00
vllmellm
174f16700b
[Doc] [ROCm] Update Documentation to reflect v0.15.0 release ( #33388 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-01-30 19:06:08 +08:00
Julien Denize
8e2ad97ad0
[BUGFIX] Pixtral cannot be loaded with --limit-mm-per-prompt 0 ( #33406 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-01-30 02:52:02 -08:00
Patrick von Platen
10152d2194
[Realtime API] Adds minimal realtime API based on websockets ( #33187 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-01-30 18:41:29 +08:00
杨朱 · Kiki
1a7894dbdf
[Misc] Replace Optional[X] with X | None syntax ( #33332 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-30 01:56:59 -08:00
Cyrus Leung
c87eac18f7
[Refactor] Move MM item count validation outside of processor ( #33396 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-30 09:27:31 +00:00
tianshu-Michael-yu
f45870b53f
fix: allow LFM2 MoE prefix caching (align) ( #33376 )
...
Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com >
2026-01-30 08:23:14 +00:00
hujiaxin0
ba45bedfd1
[model] Add support for openPangu7B-VL ( #32449 )
...
Signed-off-by: hujiaxin <524446785@qq.com >
Signed-off-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com >
Co-authored-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com >
2026-01-30 15:54:27 +08:00
Harry Mellor
9432ed8c7e
Explicitly set return_dict for apply_chat_template ( #33372 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-30 07:27:04 +00:00
Lucas Kabela
726d89720c
[CI] Enable mypy import following for vllm/spec_decode ( #33282 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-01-30 06:43:32 +00:00
Harry Mellor
d334dd26c4
Move decode context parallel validationn to ParallelConfig ( #33239 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-30 06:18:41 +00:00
Ryan Rock
070c811d6f
[CI][AMD] Skip 4 GPUs testgroup ray tests ( #33305 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-01-29 21:39:53 -08:00
Isotr0py
8bfc8d5600
[Models] Refactor Kimi-K2.5 weight loading ( #33346 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-30 05:31:20 +00:00
Harry Huang
ec51831a22
[BugFix] Disable async scheduling for Mamba prefix caching ( #33352 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-01-30 04:40:19 +00:00
Harry Mellor
80b918f2bd
Fix tie_word_embeddings for multimodal models in Transformers v5 ( #33359 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-30 03:37:39 +00:00
Wang Haoyu
c46b0cd0af
[Model][Multimodal] Add explicit MusicFlamingo adapter ( #32696 )
...
Signed-off-by: WangHaoyuuu <mailwhaoyu@gmail.com >
2026-01-30 11:01:29 +08:00
Aidan Reilly
133765760b
[Docs] Adding links and intro to Speculators and LLM Compressor ( #32849 )
...
Signed-off-by: Aidan Reilly <aireilly@redhat.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-29 14:12:35 -08:00
Michael Goin
bfb9bdaf3f
[Bugfix] Enable Triton MoE for FP8 per-tensor dynamic ( #33300 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-29 12:15:17 -08:00
Kevin H. Luu
2284461d02
[release] Minor fixes to release annotation and wheel upload ( #33129 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-01-29 12:09:35 -08:00
danisereb
8e2a469b3b
Add Triton fused MoE config for B200 (Nemotron Nano) ( #32804 )
2026-01-29 19:21:33 +00:00
CarstyYou
23591e631e
[Bugfix][Kernel] Fix negative memory offset in GDN Triton kernel ( #33326 )
...
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com >
2026-01-29 10:40:11 -08:00
Linda
0493d897c4
[NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe ( #32954 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com >
2026-01-29 10:00:13 -08:00
Chendi.Xue
8c8ebeb941
[BUGFIX][XPU] fix memory check after XPU reuse GPU_worker ( #33358 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2026-01-29 09:56:30 -08:00
Cyrus Leung
831453fcef
[Chore] Move MediaConnector to vllm.multimodal.media ( #33324 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-29 16:54:31 +00:00
Angela Yi
5a66c9cc76
[ez] Delete torch25_custom_graph_pass ( #33287 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-01-29 16:47:05 +00:00
Isotr0py
5e73e4900c
[Bugfix] Fix broken GLM-OCR initialization ( #33350 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-29 07:56:05 -08:00
Cyrus Leung
c6e7404cc5
[Multimodal] Simplify MM input definitions ( #33331 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-29 13:32:04 +00:00
sthWrong
17b17c0684
[Backport] [Kimi-K2.5] Replace torch.cuda with current_platform for d… ( #33320 )
2026-01-29 12:29:17 +00:00
Kunshang Ji
8bb6271c77
[Intel GPU] refine xpu worker ( #32894 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-01-29 12:26:52 +00:00
Roger Wang
8b3f0a99dd
[Models] Qwen3-ASR ( #33312 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-01-29 19:27:15 +08:00
Li, Jiang
8311f083bd
[Bugfix][CPU] Fix thread num for shared memory communication ( #33317 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-29 03:26:58 -08:00
Patrick von Platen
40c35038d2
[Voxtral] Streaming example ( #33042 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-01-29 03:22:49 -08:00
zofia
a5aa4d5c0f
[Quantization][Refactor] use platform dict to choose kernel ( #33130 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
Signed-off-by: zofia <110436990+zufangzhu@users.noreply.github.com >
2026-01-29 10:44:58 +00:00
andrii.pasternak
615e8033e5
[Bug Fix] Handle variable-length tensors in MultiModalFlatField batching ( #31751 )
...
Signed-off-by: Andrii Pasternak <andriipasternak31@gmail.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-29 10:42:59 +00:00
Ilya Markov
d09135fbd0
[BugFix] Async Eplb fix potential race condition ( #32881 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-01-29 10:31:40 +00:00
daniel-salib
8688c3d460
[fix] tesdt mcp_tool_calling_streaming with a more complex math question ( #32769 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2026-01-29 10:25:58 +00:00
Isotr0py
5400014d55
[Chore] Remove use_data_parallel kwargs from ViT implementation ( #33310 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-29 10:20:52 +00:00
Isotr0py
3a92c6f3b5
[Misc] Cleanup Kimi-K2.5's vision chunk modality entrypoints ( #33157 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-29 09:46:02 +00:00
amirkl94
e01ff5c070
Bugfix: Pass router logits dtype in nemotron shared experts ( #32669 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com >
2026-01-29 09:36:34 +00:00
Harry Mellor
fb946a7f89
Make mypy opt-out instead of opt-in ( #33205 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-29 09:12:26 +00:00
Lucas Wilkinson
a650ad1588
[Misc] Remove missed pad_for_cudagraph ( #33283 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-29 09:12:05 +00:00
graftim
d697581a7c
[Doc] Update outdated link to Ray documentation ( #32660 )
...
Signed-off-by: graftim <38649219+graftim@users.noreply.github.com >
2026-01-29 00:56:06 -08:00
shanjiaz
5eeba80c74
Adding optional speculator tests for larger models ( #32943 )
...
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com >
2026-01-29 16:54:02 +08:00
whx
08b1195e62
[PluggableLayer][2/N] Apply PluggableLayer to linear layers ( #33152 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2026-01-29 16:53:15 +08:00
cmunley1
3bba2edb0f
support returning tokenids in responses api ( #33212 )
...
Signed-off-by: Christian Munley <cmunley@nvidia.com >
2026-01-29 16:52:39 +08:00
Ilya Markov
53fc166402
[BugFix] Fix EPLB fail for MoeFP4 model with Marlin backend ( #33262 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-01-29 16:52:11 +08:00
Didier Durand
31b25f6516
[Doc]: fixing multiple typos in diverse files ( #33256 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Signed-off-by: Didier Durand <2927957+didier-durand@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-29 16:52:03 +08:00
wang.yuqi
abb34ac43a
[Bugfix] Fix Qwen3-VL-Reranker load. ( #33298 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-29 08:42:53 +00:00
Pengchao Wang
2515bbd027
[CI/Build][BugFix] fix cuda/compat loading order issue in docker build ( #33116 )
...
Signed-off-by: Pengchao Wang <wpc@fb.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
2026-01-29 00:19:05 -08:00
TJian
c487a8eef4
[Release] [ROCm] Remove old build step ( #33316 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-01-28 23:35:51 -08:00
Kiersten Stokes
9e138cb01d
[Misc][Build] Lazy load cv2 in nemotron_parse.py ( #33189 )
...
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com >
2026-01-29 06:55:50 +00:00
TJian
f9d03599ef
[Release] [CI] Optim release pipeline ( #33156 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-01-28 22:45:42 -08:00
wangln19
39037d258e
Fix tool call indexing double-counting ( #33141 )
...
Signed-off-by: wanglinian <wanglinian@stu.pku.edu.cn >
2026-01-29 05:57:09 +00:00
Cyrus Leung
51550179fc
[Refactor] Define MM data parser in processing info instead of processor itself ( #33260 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-29 13:55:17 +08:00
Angela Yi
07ea184f00
[ez] Delete more torch version checks <= 2.8 ( #33288 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-01-29 05:28:46 +00:00
Or Ozeri
a663b218ae
[Misc] Add orozery to CODEOWNERS (core, kv_transfer, kv_offload) ( #33227 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-29 04:24:20 +00:00
Michael Goin
1bd47d6e5a
[Bugfix] Register fp8 cutlass_group_gemm as supported for only SM90+SM100 ( #33285 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-28 18:40:59 -08:00
Michael Goin
141cd43967
[UX] Remove noisy CT UnquantizedLinearMethod warn ( #33273 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-28 16:09:30 -08:00
Nick Hill
6bf3b46d78
[ModelRunner V2] Misc code simplification and cleanup ( #33266 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-28 14:41:23 -08:00
Matthew Bonanni
77c4f45c6c
[7/N][Attention][Docs] Add documentation for attention backends ( #32477 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-28 17:20:22 -05:00
Michael Goin
ca1969186d
[UX] Enable nested configs in config yaml files ( #33193 )
2026-01-28 16:54:25 -05:00
Gregory Shtrasberg
ab597c869a
[Bugfix] Add missing encoder only guard for do_kv_cache_update ( #33269 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-01-28 21:25:07 +00:00
Angela Yi
4197168ea5
[ez] Remove checks for torch version <= 2.8 ( #33209 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-01-28 16:03:56 -05:00
Rohan Potdar
59bcc5b6f2
Use aiter triton fused_add_rmsnorm_pad for gpt-oss ( #30976 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-01-28 20:47:47 +00:00
Wentao Ye
3e440786af
[Feature] Fully support for async scheduling + PP, 30.8% E2E throughput improvement, 31.8% TPOT improvement ( #32618 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-01-28 20:30:32 +00:00
Kevin H. Luu
8bdd3979d8
[CI] Change GPU key to device key for B200 test ( #33275 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-01-28 19:14:29 +00:00
Wentao Ye
c4e744dbd4
[Perf] Optimize moe_permute for CUTLASS FP8 ( #32892 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-28 10:15:24 -08:00
Nicolò Lucchesi
8ebf372e9d
[CI] Whisper tests enforce_eager=False ( #33098 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-28 09:36:56 -08:00
cwazai
f210f0b7b1
[lora/moe] Avoid extra intermediate buffer & Python slicing in expand phase when split_k == 1 ( #32774 )
...
Signed-off-by: 陈建华 <1647430658@qq.com >
2026-01-29 00:22:45 +08:00
Bin Bao
392c5af4fe
[Benchmark] Add startup benchmarking to buildkite run ( #33183 )
...
Signed-off-by: Bin Bao <binbao@meta.com >
2026-01-28 16:03:07 +00:00
Robert Shaw
af9b69f977
[Quantization][Deprecation] Remove Marlin 24 ( #32688 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-28 15:54:59 +00:00
Chauncey
8e5e40daf4
[Misc] Provide a DeepSeek ReasoningParser with thinking enabled by default ( #33221 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-28 21:16:53 +08:00
Or Ozeri
2e8de86777
Revert "Enable Cross layers KV cache layout at NIXL Connector ( #30207 )" ( #33241 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-01-28 04:36:00 -08:00
Robert Shaw
247d1a32ea
[Quantization][Deprecation] Remove BitBlas ( #32683 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-28 11:06:22 +00:00
Kevin H. Luu
ecb4f82209
[CI] Update job dependency syntax for Intel and AMD jobs ( #33240 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-01-28 01:33:59 -08:00
Kevin H. Luu
5914090765
[CI] Update job dependency for hardware and CPU jobs ( #33237 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-01-28 01:10:05 -08:00
Harry Mellor
f1acbd68c5
[CI] Enable mypy import following for vllm/compilation ( #33199 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-28 08:59:54 +00:00
Yan Ma
9581185d51
[XPU]disable test_acceptance_length UT ( #33226 )
2026-01-28 15:24:13 +08:00
Maryam Tahhan
2dd359f953
[Docs] Simplify CPU x86 Docker build documentation ( #33071 )
...
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com >
2026-01-28 06:37:09 +00:00
Gregory Shtrasberg
22ad649501
[ROCm] Enabling forward_includes_kv_cache on ROCm MHA backends ( #33106 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-01-28 14:36:14 +08:00
ramos
36d450e3b8
Adds FunAudioChat multimodal audio model support ( #2 ) ( #33058 )
...
Signed-off-by: ramos <49182011+nemoramo@users.noreply.github.com >
Signed-off-by: mayufeng <mayufeng@example.com >
Co-authored-by: mayufeng <mayufeng@example.com >
2026-01-28 05:18:09 +00:00
22quinn
a2b877df6c
[Bugfix] Lazy import NgramProposer in GPU model runner ( #32821 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2026-01-27 21:07:16 -08:00
Harry Mellor
35fb0b8613
Don't use min_pixels/max_pixels from Qwen2VL's processor ( #33208 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-28 05:02:08 +00:00
Harry Mellor
2eb673a088
Add flake8-implicit-str-concat rules to Ruff ( #33191 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-28 04:56:10 +00:00
Jeffrey Wang
a97b5e206d
Relax protobuf library version constraints ( #33202 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2026-01-28 04:15:53 +00:00
Micah Williamson
911b51b69f
[ROCm][CI] Add TORCH_NCCL_BLOCKING_WAIT For Distributed Tests (A100) ( #32891 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-28 11:32:31 +08:00
Xinan Miao
604e3b87e8
[Feature]: Container image WORKDIR consistency ( #33159 )
...
Signed-off-by: SouthWest7 <am1ao@qq.com >
Co-authored-by: SouthWest7 <am1ao@qq.com >
2026-01-28 11:06:48 +08:00
Harry Mellor
706f123b23
[Docs] Use definition lists for CLI reference docs ( #33186 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Ashwin Phadke <23502062+ashwin-phadke@users.noreply.github.com >
2026-01-28 02:22:48 +00:00
Angela Yi
fb7abfc1d0
[docs] Improve tlparse section ( #33211 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-01-28 02:07:37 +00:00
Kevin H. Luu
5d3d6e44e8
[CI] minor fixes to pipeline generator and tests ( #33151 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-01-27 17:04:02 -08:00
Woosuk Kwon
46ec6d71c7
[Model Runner V2] Use a different stream for grammar bitmask h2d copy ( #33059 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-01-27 16:37:43 -08:00
Matthew Bonanni
e82fa448c4
Add attention benchmarking tools ( #26835 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Claude <noreply@anthropic.com >
2026-01-28 00:09:20 +00:00
Richard Zou
d9aa39a3bb
[torch.compile] Speed up MOE handling in forward_context ( #33184 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-01-27 15:17:54 -08:00
Wentao Ye
3a6d5cbefd
[Perf] Optimize dcp allocate tensor ( #33102 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-27 17:24:41 -05:00
linhaifeng
f5d7049cc1
[Bugfix] Fix display error (inconsistent with context) ( #33020 )
...
Signed-off-by: linhaifeng <1371675203@qq.com >
2026-01-27 20:33:29 +00:00
Alexei-V-Ivanov-AMD
3c3c547ce0
Enabling "2 node" distributed tests in the AMD CI pipeline. ( #32719 )
...
Signed-off-by: DCCS-4560 <alivanov@chi-mi325x-pod1-112.ord.vultr.cpe.ice.amd.com >
Co-authored-by: DCCS-4560 <alivanov@chi-mi325x-pod1-112.ord.vultr.cpe.ice.amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-01-27 19:13:21 +00:00
Matthew Bonanni
1cbccb6dba
[Attention] Use has_flashinfer helper ( #33177 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-27 18:33:17 +00:00
Iris
bd92089d33
feature: support eagle3 for HunyuanVL & Hunyuan ( #33035 )
...
Signed-off-by: irisliu10 <601012173@qq.com >
Signed-off-by: Iris <38269816+irisliu10@users.noreply.github.com >
2026-01-27 17:55:48 +00:00
Karan Bansal
a6760f1525
[Doc] Improve serve parameter documentation with meaningful defaults ( #33082 )
...
Signed-off-by: Karan Bansal <karanb192@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-27 09:19:37 -08:00
IriKa
66e601ef79
Support compress-tensors with nvfp4 or fp8 weights and modelopt with nvfp4 weights on Turing ( #33076 )
...
Signed-off-by: IriKa Qiu <qiujie.jq@gmail.com >
2026-01-27 11:04:05 -05:00
Nick Hill
0cd259b2d8
[BugFix] Fix P/D with non-MoE DP ( #33037 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-27 08:03:47 -08:00
danielafrimi
83fb2d09e8
Support heterogeneous NemotronHPuzzle model ( #32549 )
...
Signed-off-by: <dafrimi@nvidia.com >
Signed-off-by: Daniel Afrimi <dafrimi@nvidia.com >
Signed-off-by: root <dafrimi@nvidia.com >
2026-01-27 10:55:54 -05:00
danisereb
f3a5ee705f
[LoRA][Spec Decode] Support LoRA for Nemotron-H MTP models ( #32265 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-27 07:53:26 -08:00
wang.yuqi
7cbbca9aaa
[Frontend] Cleanup api server ( #33158 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
2026-01-27 15:18:10 +00:00
omkhalil
5ec44056f7
[Metrics][MFU] Fix UnembedMetrics FLOP overcounting for prefill ( #33045 ) ( #33045 )
...
Fix UnembedMetrics to correctly count FLOPs for the unembedding (LM head) layer.
The bug: UnembedMetrics used total_num_tokens() which counts all tokens in the
batch for projection flops, vocab projections are run on just the last token for the
autoregressive use case.
Co-authored-by: Omar Mohamed Khalil <omarkhalil@meta.com >
2026-01-27 15:16:49 +00:00
Nicolò Lucchesi
492a7983dd
[Bugfix] Fix DeepseekV32 AssertionError: num_kv_heads == 1 ( #33090 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-27 15:03:20 +00:00
Matthew Bonanni
a608b4c6c2
[5/N][Attention] Finish eliminating vllm/attention folder ( #32064 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-27 10:02:51 -05:00
Nicolò Lucchesi
1f3a2c2944
[Bugfix] Disable CG for Whisper+FA2 ( #33164 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-27 21:46:51 +08:00
omerpaz95
7227d06156
[Metrics] [KVConnector] Add Offloading Connector metrics ( #27942 )
...
Added queries and hits metrics for the Offloading Connector.
Also added timing metrics for store and load operations, which take the
average time it takes to load/store, per-token.
The metrics are available from Prometheus and from the StatLogger.
Signed-off-by: omerpaz95 <omerpaz95@gmail.com >
Co-authored-by: Omer Paz <Omer.Paz@ibm.com >
2026-01-27 13:34:49 +00:00
Harry Mellor
14385c80fc
Fix weight mapping test for Transfomers v5 ( #33162 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-27 12:30:14 +00:00
wang.yuqi
76139d0801
[Frontend] Frontend will only attach supported tasks corresponding entrypoints. ( #33139 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-01-27 12:15:43 +00:00
Lifan Shen
da8d0c441a
[AMD][QWEN3-NEXT] FP8 Tunings ( #32042 )
...
Signed-off-by: Lifan Shen <lifans@meta.com >
2026-01-27 09:34:13 +00:00
rasmith
58996f3589
[AMD][Kernel][BugFix] Use correct scale in concat_and_cache_ds_mla_kernel when on gfx942 ( #32976 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2026-01-27 07:16:43 +00:00
Roger Wang
b539f988e1
[Models] Kimi-K2.5 ( #33131 )
...
Signed-off-by: wanglinian <wanglinian@stu.pku.edu.cn >
Signed-off-by: wangln19 <96399074+wangln19@users.noreply.github.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: wanglinian <wanglinian@stu.pku.edu.cn >
Co-authored-by: wangln19 <96399074+wangln19@users.noreply.github.com >
Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-27 14:50:31 +08:00
Andreas Karatzas
6c00645712
[CI][Pooling] Stabilize ModernBERT test ( #32909 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-27 05:26:48 +00:00
Ning Xie
b781eeaa15
[code clean] remove duplicate code ( #33135 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-27 04:57:16 +00:00
Cyrus Leung
e0b005d9cf
[Frontend] Cleanup serving engine ( #33103 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-26 20:47:26 -08:00
Richard Zou
3b8f0fe59e
[torch.compile] Stop assuming 32 bit indexing ( #33113 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-01-27 04:25:02 +00:00
Cyrus Leung
c831911be2
[Frontend] Reduce mixin usage in serving pooling ( #33101 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-27 11:50:37 +08:00
Paco Xu
157caf511b
[Perf] avoid duplicate mem_get_info() call in get_current_memory_usage ( #33064 )
...
Signed-off-by: Paco Xu <paco.xu@daocloud.io >
2026-01-27 03:45:45 +00:00
Vincent Gimenes
0b53bec60b
[DOC]: Add warning about max_num_batched_tokens and max_model_len when chunked prefill is disabled ( #33109 )
...
Signed-off-by: Vincent Gimenes <147169146+VincentG1234@users.noreply.github.com >
2026-01-27 03:05:02 +00:00
Strahinja Stamenkovic
c568581ff3
Fix IndexError with encoder-decoder models when using Custom Paged Attention ( #33112 )
...
Signed-off-by: sstamenk <strahinja.stamenkovic@amd.com >
2026-01-27 10:33:37 +08:00
wangln19
2d7053438a
fix: preserve native tool call ID in multi-turn tool calling ( #32768 )
...
Signed-off-by: wanglinian <wanglinian@stu.pku.edu.cn >
Signed-off-by: wangln19 <96399074+wangln19@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Isotr0py <2037008807@qq.com >
2026-01-27 10:22:35 +08:00
Robert Shaw
5a93b9162b
[MoE Refactor] Integrate Naive Prepare Finalize into MK ( #32567 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: amirkl94 <203507526+amirkl94@users.noreply.github.com >
2026-01-27 01:28:02 +00:00
Woosuk Kwon
6d86fde09c
[Model Runner V2] Remove UvaBufferPool for cpu->gpu copy ( #33055 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-01-26 16:47:35 -08:00
XiongfeiWei
510ed1e8d3
[Bugfix][TPU] Return a Default fp8 MoE Backend ( #32908 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-01-26 18:46:11 -05:00
Pengchao Wang
8caffd92df
[Bugfix][MXFP4] Call trtllm_fp4_block_scale_moe with kwargs ( #33104 )
...
Signed-off-by: Pengchao Wang <wpc@fb.com >
2026-01-26 15:13:18 -08:00
dolpm
58a05b0ca1
[fix] CPUDNNLGEMMHandler pointer baked into inductor artifact ( #32913 )
...
Signed-off-by: dolpm <34420038+dolpm@users.noreply.github.com >
2026-01-26 16:59:44 -05:00
Jared Wen
6ee7f18f33
[Logging] add --disable-access-log-for-endpoints CLI option ( #30011 )
...
Add a new CLI option --disable-access-log-for-endpoints to suppress
uvicorn access logs for specified endpoints (e.g., /health, /metrics, /ping).
This addresses the need to reduce log noise in production environments
where health check endpoints are frequently polled by load balancers or
monitoring systems, generating excessive log entries that obscure
meaningful request logs.
Fixes #29982
Signed-off-by: JaredforReal <w13431838023@gmail.com >
2026-01-26 21:49:03 +00:00
Wentao Ye
8f987883cb
[Refactor] Remove unused _moe_permute function ( #33108 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-26 16:06:45 -05:00
Kevin H. Luu
ebe0ba91db
[ci] Sync test areas with test-pipeline.yaml and enable new pipeline generator ( #33080 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
Signed-off-by: khluu <khluu000@gmail.com >
Co-authored-by: Kevin Luu <khluu@Kevins-MacBook-Pro.local >
2026-01-26 12:28:20 -08:00
Robert Shaw
43a013c3a2
[Bugfix] Fix Dtypes for Pynccl Wrapper ( #33030 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-26 20:09:32 +00:00
Cyrus Leung
c25dbee40d
[Model] Bump transformers version for test registry ( #33100 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-26 18:53:22 +00:00
Nicolò Lucchesi
19ab0f7ce5
[Bugfix] Fix Voxtral streaming slot_mapping ( #33073 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-26 10:40:40 -08:00
danielafrimi
67fe677c53
[FIX] Always support TP > 4 for FP4 Gemm ( #31099 )
...
Signed-off-by: dafrimi <dafrimi@nvidia.com >
Co-authored-by: root <root@gpu-51.slurm-workers-slurm.slurm.svc.cluster.local >
2026-01-26 11:04:20 -07:00
Andy Lo
d56afd45fd
Remove unused logic in models/mistral.py ( #33095 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-01-26 09:01:52 -08:00
Chauncey
a2393ed496
[CI] Fix AssertionError: MCP tool call not found in output_messages ( #33093 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-26 15:19:57 +00:00
Pleaplusone
be6931ee27
[ROCm][Bugfix] Fix ptpc scale load issue for fused shared expert path in deepseek mtp ( #33018 )
...
Signed-off-by: ganyi <ygan@amd.com >
2026-01-26 23:19:04 +08:00
Chauncey
9ef3b718d9
[Bugfix] Fix Can't instantiate abstract class DeepseekV32IndexerBackend ( #33052 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-26 06:44:02 -08:00
Yuxuan Zhang
bb17e8f11c
[GLM-OCR] GLM-OCR with MTP Support ( #33005 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-26 06:24:43 -08:00
Cyrus Leung
dcd80206b7
[Chore] Update type annotation of input_ids in model forward ( #33063 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-26 06:02:10 -08:00
danisereb
f4a0921c9c
[Performance] Tune Mamba selective scan kernel for B200 ( #32873 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-01-26 05:56:54 -08:00
VihaanThat
208c56256f
[Feature] Add LoRA support for Gemma3 vision components ( #32764 )
2026-01-26 13:56:40 +00:00
Alex Brooks
9ac818a551
[Misc] HF Hub LoRA Resolver ( #20320 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2026-01-26 13:56:32 +00:00
Itay Etelis
6ca2c91b96
[Model] Use mm_position to compute mrope positions for Qwen3-Omni ( #33010 )
...
Signed-off-by: Itay Etelis <itay.etelis@ibm.com >
Co-authored-by: Itay Etelis <itay.etelis@ibm.com >
2026-01-26 13:48:07 +00:00
cwazai
e33192b269
[lora/moe] Improve fused MoE‑LoRA kernel indexing and memory access ( #32770 )
...
Signed-off-by: 陈建华 <1647430658@qq.com >
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
Signed-off-by: kimheesu <wlskaka4@gmail.com >
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: ganyi <ygan@amd.com >
Signed-off-by: whx-sjtu <2952154980@qq.com >
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Signed-off-by: Xin Yang <xyangx@amazon.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com >
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com >
Signed-off-by: Ifta khairul Alam Adil <25082512+ikaadil@users.noreply.github.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Signed-off-by: Huy Do <huydhn@gmail.com >
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Kebe <mail@kebe7jun.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Alex Sun <alex.s@amd.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Liran Schour <lirans@il.ibm.com >
Signed-off-by: liranschour <liranschour@users.noreply.github.com >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
Signed-off-by: Richard Zou <zou3519@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: AuYang <459461160@qq.com >
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com >
Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com >
Signed-off-by: Bill Nell <bnell@redhat.com >
Signed-off-by: RishabhSaini <rishabhsaini01@gmail.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Signed-off-by: Karan Bansal <karanb192@gmail.com >
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Li, Jiang <bigpyj64@gmail.com >
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com >
Signed-off-by: raushan <raushan@huggingface.co >
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: sangbumlikeagod <oironese@naver.com >
Signed-off-by: sangbumlikeagod <98077576+sangbumlikeagod@users.noreply.github.com >
Signed-off-by: Matteo Fari <matteofari06@gmail.com >
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: Orion Reblitz-Richardson <orionr@meta.com >
Signed-off-by: Orion Reblitz-Richardson <orionr@gmail.com >
Signed-off-by: marksverdhei <marksverdhei@hotmail.com >
Signed-off-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Randall Smith <ransmith@amd.com >
Signed-off-by: jon <joninco@bullpoint.org >
Signed-off-by: dolpm <34420038+dolpm@users.noreply.github.com >
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: Luka Govedič <luka.govedic@gmail.com >
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Signed-off-by: mohammad najafi <mohammad.najafi@amd.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
Signed-off-by: esmeetu <jasonailu87@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Reagan <reaganjlee@gmail.com >
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com >
Signed-off-by: Hongjian Zhang <zhanghongjian@xiaohongshu.com >
Signed-off-by: Xingran Wang <wangxingran123456@outlook.com >
Signed-off-by: Hiroken. <105287758+HirokenOvo@users.noreply.github.com >
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
Signed-off-by: Louie Tsai <louie.tsai@intel.com >
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com >
Signed-off-by: Joshua Deng <joshuakdeng@gmail.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: cwazai <38356712+cwazai@users.noreply.github.com >
Co-authored-by: Yanwen Lin <lyw1124278064@gmail.com >
Co-authored-by: Kim Hee Su <wlskaka4@gmail.com >
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Pleaplusone <ygan@amd.com >
Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com >
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: danisereb <daserebrenik@nvidia.com >
Co-authored-by: Yanan Cao <gmagogsfm@users.noreply.github.com >
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Matt <156021403+mawong-amd@users.noreply.github.com >
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com >
Co-authored-by: Lucain <lucainp@gmail.com >
Co-authored-by: Ifta khairul Alam Adil <25082512+ikaadil@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Huy Do <huydhn@gmail.com >
Co-authored-by: Micah Williamson <micah.williamson@amd.com >
Co-authored-by: Andreas Karatzas <akaratza@amd.com >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: Kebe <mail@kebe7jun.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Alex Sun <minchsun@amd.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: liranschour <liranschour@users.noreply.github.com >
Co-authored-by: Or Ozeri <or@ozery.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Shengqi Chen <harry-chen@outlook.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Lucas Kabela <lucaskabela@meta.com >
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com >
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com >
Co-authored-by: Xu Jinyang <72930776+AuYang261@users.noreply.github.com >
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: David Ramon Prados <davidramon3@hotmail.es >
Co-authored-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com >
Co-authored-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com >
Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com >
Co-authored-by: Rishabh Saini <rishabhsaini01@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Karan Bansal <karanb192@users.noreply.github.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tianshu-Michael-yu <101950379+tianshu-Michael-yu@users.noreply.github.com >
Co-authored-by: Raushan Turganbay <raushan@huggingface.co >
Co-authored-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: sangbumlikeagod <98077576+sangbumlikeagod@users.noreply.github.com >
Co-authored-by: Matteo Fari <matteofari06@gmail.com >
Co-authored-by: Harry Huang <vastrockhuang162@gmail.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Orion Reblitz-Richardson <orionr@gmail.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
Co-authored-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
Co-authored-by: rasmith <Randall.Smith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
Co-authored-by: joninco <joninco@bullpoint.org >
Co-authored-by: dolpm <34420038+dolpm@users.noreply.github.com >
Co-authored-by: ElizaWszola <ewszola@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com >
Co-authored-by: Luka Govedič <luka.govedic@gmail.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: monajafi-amd <mohammad.najafi@amd.com >
Co-authored-by: ruizcrp <ruiz.crp@gmail.com >
Co-authored-by: Shengqi Chen <i@harrychen.xyz >
Co-authored-by: 7. Sun <jhao.sun@gmail.com >
Co-authored-by: Roy Wang <jasonailu87@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com >
Co-authored-by: Hiroken. <105287758+HirokenOvo@users.noreply.github.com >
Co-authored-by: Xingran Wang <wangxingran123456@outlook.com >
Co-authored-by: david guan <102001211+Chenhao-Guan@users.noreply.github.com >
Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com >
Co-authored-by: Louie Tsai <louie.tsai@intel.com >
Co-authored-by: Maryam Tahhan <mtahhan@redhat.com >
Co-authored-by: Joshua Deng <91448271+joshuadeng@users.noreply.github.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
Co-authored-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
Co-authored-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-26 04:56:34 -08:00
Cyrus Leung
61274bdef5
[Doc] Further update multi-modal impl doc ( #33065 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-26 10:54:20 +00:00
ltd0924
b40db4dfec
[StepVL] add step vl offline example ( #33054 )
...
Signed-off-by: luotingdan <luotingdan@stepfun.com >
Co-authored-by: luotingdan <luotingdan@stepfun.com >
2026-01-26 01:00:32 -08:00
Cyrus Leung
11b556878b
[Refactor] Use data parser for matching data items to multi-modal UUIDs ( #32955 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-26 15:00:28 +08:00
Danielle Robinson
ee484b3f4b
Set splitk=1 for fused-moe-lora expand kernel ( #32882 )
...
Signed-off-by: Danielle Robinson <dmmaddix@amazon.com >
Co-authored-by: Danielle Robinson <dmmaddix@amazon.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-25 22:52:34 -08:00
Woosuk Kwon
a9b53dd435
[Model Runner V2] Add LoRAState to consolidate lora logic ( #33062 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-01-25 22:21:12 -08:00
Robert Shaw
254db42ede
[Tests] Remove Duplicates ( #33032 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-26 05:23:54 +00:00
ltd0924
105d104576
[StepVL] support close img patch ( #32923 )
...
Signed-off-by: luotingdan <luotingdan@stepfun.com >
Signed-off-by: ltd0924 <32387785+ltd0924@users.noreply.github.com >
Co-authored-by: luotingdan <luotingdan@stepfun.com >
2026-01-25 20:56:39 -08:00
Lucas Wilkinson
566cdb6cfb
[CI] Fix MHA attention test failure (AttributeError when model_config is None in ViT attention backend) ( #33033 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-25 19:49:53 -08:00
Woosuk Kwon
2f0d3ba745
[Model Runner V2] Minor simplification for finish_requests ( #33048 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-01-25 18:35:02 -08:00
Woosuk Kwon
edf927bc9f
[Model Runner V2] Fix slot_mapping after #25954 ( #33046 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-01-25 18:29:49 -08:00
Andreas Karatzas
22aeb43007
[Bugfix][VLM] Fix transformers backend embed_multimodal for Qwen2.5-VL profiling ( #32969 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-26 08:34:05 +08:00
Itay Etelis
a698e8e7ad
[Model] Use mm_position to compute mrope positions for Qwen2.5-Omni ( #32772 )
...
Signed-off-by: Itay Etelis <itay.etelis@ibm.com >
Co-authored-by: Itay Etelis <itay.etelis@ibm.com >
2026-01-25 20:15:53 +08:00
zhanqiuhu
151e5451c2
[Doc] Add Qwen2.5 models to batch invariance tested models ( #33016 )
...
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-01-25 09:20:46 +00:00
Jee Jee Li
73b243463b
[BugFix] Add env variable to control PDL in LoRA ( #32836 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-25 16:32:30 +08:00
JJJYmmm
7e67df5570
[Bugfix] fix encoder cache hang in Qwen3VL ( #32684 )
...
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-25 05:17:31 +00:00
7. Sun
ff6c1da4e6
[Docs] Fix Apple silicon include path in CPU installation docs ( #32977 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-01-25 01:51:49 +00:00
Roberto L. Castro
fcb9df99bd
[Perf][Kernel] Optimize FP4 quantization kernels (SM100F) ( #32520 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
2026-01-24 18:45:27 -07:00
TJian
1ebdff412a
[DOC] [ROCm] Update doc for v0.14.1 ( #32998 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-01-25 09:13:21 +08:00
Joshua Deng
91601ff478
[Feature] add session based streaming input support to v1 ( #28973 )
...
Signed-off-by: Joshua Deng <joshuakdeng@gmail.com >
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-01-24 12:06:28 -08:00
yugong333
d4dbb7af63
Using max_loras + 1 to construct grid in fused_moe_lora ( #32277 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
2026-01-24 12:39:30 -05:00
Maryam Tahhan
203d0bc0c2
[CPU] Improve CPU Docker build ( #30953 )
...
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-01-24 17:08:24 +00:00
Fadi Arafeh
17ab54de81
[CPU Backend][BugFix] Fix failing Darwin pipelines ( #33002 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-01-24 17:02:22 +00:00
7. Sun
cd775bdbe0
[Tests] Replace flaky sleep with polling in test_background_cancel ( #32986 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-01-24 16:39:07 +00:00
Lucas Wilkinson
da5e7b12be
[MLA] Fuse cat and qaunt for fp8 kv-cache ( #32950 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-24 16:03:02 +00:00
Louie Tsai
719ac592ed
Update CPU doc according to feedback ( #32963 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
Signed-off-by: Louie Tsai <louie.tsai@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-24 16:02:44 +00:00
Hiroken.
1209b784f2
[Bugfix]: resolve torch.compile cache conflict between mm_encoder_tp_modes ( #32842 )
...
Signed-off-by: Hongjian Zhang <zhanghongjian@xiaohongshu.com >
Signed-off-by: Xingran Wang <wangxingran123456@outlook.com >
Co-authored-by: Xingran Wang <wangxingran123456@outlook.com >
2026-01-24 14:45:14 +00:00
Lukas Geiger
5fa0f6efa9
[EncoderCacheManager] Remove unnecessary copy ( #32800 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2026-01-24 14:28:57 +00:00
david guan
bc0d291bfe
feat: Complete LoRA support for MiniMaxM2 Fixes #32736 ( #32763 )
...
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-01-24 20:48:46 +08:00
Isotr0py
9ad7f89f55
[Models]: Make Multimodal config implicit in ViT implementation ( #31972 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-24 20:34:26 +08:00
Hiroken.
6450b536a6
[Bugfix] Fix E2E latency calculation and add warmup support in mm_processor benchmark ( #32646 )
...
Signed-off-by: Hongjian Zhang <zhanghongjian@xiaohongshu.com >
Signed-off-by: Xingran Wang <wangxingran123456@outlook.com >
Signed-off-by: Hiroken. <105287758+HirokenOvo@users.noreply.github.com >
Co-authored-by: Xingran Wang <wangxingran123456@outlook.com >
2026-01-24 10:31:41 +00:00
7. Sun
0f19427db5
[Perf] Cache exc.errors() result in validation exception handler ( #32984 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-01-24 02:01:35 -08:00
Cyrus Leung
51931c5c9a
[UX] Deduplicate sampling parameter startup logs ( #32953 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-24 17:37:28 +08:00
Reagan Lee
06b557ecd9
feat(benchmark): add encoder forward pass benchmarking to mm-processor ( #31655 )
...
Signed-off-by: Reagan <reaganjlee@gmail.com >
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com >
Co-authored-by: Hiroken. <105287758+HirokenOvo@users.noreply.github.com >
2026-01-24 08:24:44 +00:00
Roger Wang
81c2a889ce
[Doc] Ignore typo check on doc ( #32999 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-01-23 23:52:22 -08:00
Isotr0py
8edaf38570
[Models] Add SharedFusedMoE support to Qwen3MoE ( #32082 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-23 23:36:31 -08:00
Roy Wang
5c86a89805
[docs] Update governance process links ( #32995 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-01-23 23:32:44 -08:00
7. Sun
0ccecf8833
[Tests] Standardize RNG seed utility across test files ( #32982 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-01-24 06:47:14 +00:00
7. Sun
0b9a735e11
[Tests] Clarify pytest skip reasons with actionable context ( #32981 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-01-24 06:38:50 +00:00
7. Sun
14d03b8ddb
[Perf] Cache xpu_get_mem_info() result to avoid duplicate calls ( #32983 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-01-23 20:56:23 -08:00
Michael Goin
d0cbac5827
[Dev UX] Add auto-detection for VLLM_PRECOMPILED_WHEEL_VARIANT during install ( #32948 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Shengqi Chen <i@harrychen.xyz >
2026-01-23 19:15:17 -08:00
ruizcrp
c0d820457a
Auth_token added in documentation as it is required ( #32988 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-24 03:03:05 +00:00
monajafi-amd
97ef11dd34
[ROCm][ViT] Enable Flash Attention Triton backend on RDNA3/RDNA4 ( #32944 )
...
Signed-off-by: mohammad najafi <mohammad.najafi@amd.com >
2026-01-24 10:03:07 +08:00
Xin Yang
ecc3dd66cc
[Bugfix] Fix FusedMoE LoRA kernel offs_token out of bound value ( #32279 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-24 01:41:35 +00:00
Joe Runde
7e1f10d562
[Core][Bugfix] allow graceful worker termination ( #32965 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2026-01-23 17:28:45 -08:00
ElizaWszola
a28b94e6ef
[Performance] Split FlashAttn attention and cache update ( #25954 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Luka Govedič <luka.govedic@gmail.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <luka.govedic@gmail.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
2026-01-23 17:28:06 -08:00
dolpm
0118cdcc02
[fix] add VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME to compile factors ( #32912 )
...
Signed-off-by: dolpm <34420038+dolpm@users.noreply.github.com >
2026-01-23 22:53:10 +00:00
Shengqi Chen
136c499f6e
[CI] fix version comparsion and exclusion patterns in upload-release-wheels.sh ( #32971 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2026-01-23 22:21:49 +00:00
joninco
ebd0a17e0e
[Bugfix] Fix missing is_layer_skipped check for FusedMoE in AWQConfig ( #32935 )
...
Signed-off-by: jon <joninco@bullpoint.org >
2026-01-23 17:19:56 -05:00
Wentao Ye
37c9859fab
[Refactor] Clean up unused variables & func ( #32692 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-23 17:04:25 -05:00
Michael Goin
4561f13985
[Refactor] Rename gptq_marlin to marlin to match MoE ( #32952 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-23 16:48:12 -05:00
rasmith
6cc6d92be5
[CI][AMD][BugFix] Update wvSplitK (and other skinny_gemm wrappers) to ensure tensors passed will be made contiguous for the kernel ( #32831 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2026-01-23 13:35:48 -08:00
Wentao Ye
dfab5f3764
[Bug] Fix benchmark script moe_permute_unpermute ( #32949 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-23 16:18:56 -05:00
Markus / Mark
586a57ad7e
fix: Add glm4_moe_lite to MLA detection ( #32614 )
...
Signed-off-by: marksverdhei <marksverdhei@hotmail.com >
Signed-off-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2026-01-23 12:38:57 -08:00
Lucas Wilkinson
3a41459501
[cudagraphs] Refactor cudagraph capture loop ( #32946 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-23 13:22:20 -07:00
Nick Hill
8518b30447
[Model Runner V2] Add KV Connector support ( #32742 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-23 10:49:17 -08:00
Matthew Bonanni
2d6b537157
[Bugfix][CI] Fix pre-commit ( #32956 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-23 10:26:56 -08:00
Orion Reblitz-Richardson
68b0a6c1ba
[CI][torch nightlies] Use main Dockerfile with flags for nightly torch tests ( #30443 )
...
Signed-off-by: Orion Reblitz-Richardson <orionr@meta.com >
Signed-off-by: Orion Reblitz-Richardson <orionr@gmail.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-01-23 10:22:56 -08:00
Harry Huang
5206e5e28c
[V1][Hybrid] Mamba Prefix Caching with align mode ( #30877 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2026-01-23 09:56:48 -08:00
Matteo Fari
fec9da0af4
[Model] Enable LoRA support for internvl2 ( #32397 )
...
Signed-off-by: Matteo Fari <matteofari06@gmail.com >
2026-01-24 01:39:01 +08:00
Luka Govedič
bbbd696af9
[torch.compile][CI] Add back attn fusion on hopper/ada ( #32940 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-01-23 16:49:20 +00:00
sangbumlikeagod
9b77bb790d
[Frontend] add logprob, compression_rate to 'verbose_json' features ( #31059 )
...
Signed-off-by: sangbumlikeagod <oironese@naver.com >
Signed-off-by: sangbumlikeagod <98077576+sangbumlikeagod@users.noreply.github.com >
2026-01-23 16:35:13 +00:00
Matt
305e53ade8
[Hardware][AMD][CI][Bugfix] Fix Kernels Attention Cache test ( #32904 )
...
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-01-23 16:24:26 +00:00
Mark McLoughlin
1cb4341fbc
[ROCm][PD] Remove unused moriio connector proxy code ( #32939 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-01-23 15:59:04 +00:00
baonudesifeizhai
1fb648bf10
[Bugfix] Fix FP8 MoE EP Weight Loading for ModelOpt Llama4 ( #32886 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
2026-01-23 10:31:48 -05:00
Nicolò Lucchesi
7e22309755
[Misc] Postpone torch_profiler deprecation ( #32867 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-23 14:39:48 +00:00
Xin Yang
90c2007932
[Bugfix] Disable tma_aligned_scales in test_fusions_e2e ( #32916 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-23 14:34:30 +00:00
Raushan Turganbay
d95d650762
[Bugfix] Fix getting vision features in Transformer Multimodal backend ( #32933 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2026-01-23 13:34:48 +00:00
tianshu-Michael-yu
13d8746c54
[Feature]: Remove DtoH Copy for lfm2_vl On Default Stream ( #32815 )
...
Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com >
2026-01-23 13:20:30 +00:00
Fadi Arafeh
10e94c84f6
[CPU][Feat] Update PyTorch to v2.10 for CPU Backend ( #32869 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-01-23 21:13:06 +08:00
Isotr0py
243e78c20f
[Benchmark][Bugfix] Fix race condtion when starting server for sweep benchmark ( #32927 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-23 12:11:18 +00:00
Fadi Arafeh
aac0b817fa
[CPU Backend][BugFix] Fix failing CPU MoE test ( #32876 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-01-23 12:06:51 +00:00
wang.yuqi
05f3d714db
[Frontend][3/n] Make pooling entrypoints request schema consensus | EmbedRequest & ClassifyRequest ( #32905 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-23 12:03:44 +00:00
Patrick von Platen
3f3f89529d
[Voxtral] Add new streaming arch ( #32861 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-23 12:41:52 +01:00
Li, Jiang
5da4c7d789
[CI/Build][CPU] Fix failed pooling tests and macos smoke test ( #32907 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-23 10:48:20 +00:00
Nicolò Lucchesi
160c6fa387
[Misc] Add get_name to missing AttentionBackends ( #32698 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-23 10:35:44 +00:00
Andreas Karatzas
a8eb1182f1
[CI][Models] Add VLM Support for Sequence Classification Conversion ( #32885 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-23 16:22:51 +08:00
Karan Bansal
fa6e599a61
[Bugfix] Fix _CPU_MOE_ACT AssertionError when vLLM config not set ( #32777 )
...
Signed-off-by: Karan Bansal <karanb192@gmail.com >
2026-01-23 08:22:37 +00:00
Wentao Ye
7ef5873752
[CI] Fix mypy for vllm/v1/structured_output ( #32722 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-23 11:55:51 +08:00
Luka Govedič
5e4e0e51f4
[torch.compile] Compile CustomOp.forward_native for SiluAndMul and QuantFP8 to avoid raw torch ops inside opaque custom ops ( #32806 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-22 19:52:26 -08:00
Rishabh Saini
f61c9da711
[BugFix] deepseek_v32_encoding: Replace asserts with proper exceptions ( #32884 )
...
Signed-off-by: RishabhSaini <rishabhsaini01@gmail.com >
2026-01-23 03:44:11 +00:00
Nick Hill
7fe255889e
[Misc] Log vLLM logo when starting server ( #32796 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-23 11:15:12 +08:00
bnellnm
dc917cceb8
[MoE Refactor] Move select_experts from FusedMoEQuantMethod -> FusedMoE ( #31996 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-01-22 18:21:35 -05:00
Fadi Arafeh
fc56f4a071
[BugFix] Fix invalid flashinfer_fused_moe_blockscale_fp8 op registration ( #32855 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-01-22 22:27:40 +00:00
Xin Yang
d08b356ee0
[Perf] Create TMA-aligned input scale tensor for DeepGemm on Hopper ( #32619 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-22 15:47:04 -05:00
Wentao Ye
f744810184
[Refactor] Remove unused tpu files ( #32610 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-22 15:35:18 -05:00
Eldar Kurtić
44f08af3a7
Add llmcompressor fp8 kv-cache quant (per-tensor and per-attn_head) ( #30141 )
...
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com >
Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com >
2026-01-22 13:29:57 -07:00
Matthew Bonanni
955b43a5a5
[Bugfix][Attention] Explicitly report support for kv_cache_dtype bfloat16 ( #32795 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-22 19:05:18 +00:00
Fadi Arafeh
744ef30484
[CPU Backend] [Perf] Accelerate tensor-parallel/data-parallel inference across NUMA domains on Arm ( #32792 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-01-22 18:55:23 +00:00
Matthew Bonanni
300622e609
[CI][Attention] Add more CI dependencies for attention tests ( #32487 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-22 18:44:56 +00:00
RickyChen / 陳昭儒
69d09fdd6c
[Feature] Add --ssl-ciphers CLI argument for TLS cipher control ( #30937 )
...
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com >
2026-01-22 09:53:24 -08:00
David Ramon Prados
3a63be0faa
Support custom URI schemes and trace handlers for profiler ( #32393 )
2026-01-22 09:45:40 -08:00
Tyler Michael Smith
803e3f3f68
[UX] Default api_server_count to dp_size if not specified ( #32525 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-01-22 17:35:35 +00:00
Vadim Gimpelson
70917b1c55
[MISC] Add .cursor to .gitignore ( #32868 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-01-22 17:27:13 +00:00
Matt
c517d8c934
[Hardware][AMD][CI][Bugfix] Fix regressions from deprecated env vars ( #32837 )
...
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-01-23 00:59:15 +08:00
Xu Jinyang
fc37187a51
[Bugfix] ModelScope is supported when downloading LORA models. ( #32844 )
...
Signed-off-by: AuYang <459461160@qq.com >
2026-01-22 16:33:21 +00:00
Maximilien de Bayser
ff365eea94
Support bge-m3 sparse embeddings and colbert embeddings ( #14526 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
2026-01-22 23:52:57 +08:00
Isotr0py
444e2e7e1f
[Misc] Bump opencv-python dependecy version to 4.13 ( #32668 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-22 15:51:15 +00:00
Nick Hill
bc14663e6a
[Cleanup] Move scheduler get_routed_experts logic to separate method ( #32706 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-22 10:46:00 -05:00
Richard Zou
654a71fc3c
[torch.compile] Improve Cold Start for MoEs ( #32805 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-01-22 10:44:40 -05:00
Lucas Kabela
15e302dfce
[Misc][BE] Turn on strict type coverage for vllm/compilation ( #31756 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-01-22 15:12:26 +00:00
Cyrus Leung
d117a4d1a9
[Frontend] Introduce Renderer for processing chat messages (using ModelConfig) ( #30200 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-22 12:44:22 +00:00
Or Ozeri
421012b63a
OffloadingConnector: Support kernel_block_size != block_size ( #30692 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-22 12:30:04 +00:00
Chauncey
841d53aaa8
[Frontend] add prompt_cache_key for openresponses ( #32824 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-22 11:34:14 +00:00
Shengqi Chen
1752262e96
[CI] refactor release pipeline config into groups ( #32833 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2026-01-22 11:27:21 +00:00
Nicolò Lucchesi
ea6102b85d
[Bugfix] Fix Whisper/encoder-decoder GPU memory leak ( #32789 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-22 10:50:37 +00:00
wang.yuqi
328cbb2773
[Frontend][2/n] Make pooling entrypoints request schema consensus | ChatRequest ( #32574 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-22 10:32:44 +00:00
liranschour
64e3d67ac0
Enable Cross layers KV cache layout at NIXL Connector ( #30207 )
...
Signed-off-by: Liran Schour <lirans@il.ibm.com >
Signed-off-by: liranschour <liranschour@users.noreply.github.com >
Co-authored-by: Or Ozeri <or@ozery.com >
2026-01-22 10:12:58 +00:00
Nick Hill
098b2d66fe
[Benchmark] Don't default to temperature==0 in vllm bench serve ( #32723 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-22 10:03:15 +00:00
Isotr0py
8ebf271bb6
[Misc] Replace urllib's urlparse with urllib3's parse_url ( #32746 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-22 16:37:15 +08:00
Alex Sun
49a1262267
[AMD][ROCm] MoRI EP: a high-performance all2all backend ( #28664 )
...
Signed-off-by: Alex Sun <alex.s@amd.com >
2026-01-22 16:33:18 +08:00
Cyrus Leung
2b8a38b6d6
[Model] Extend collect_children and no_init_weights contexts ( #32757 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-22 08:20:27 +00:00
Kebe
1bf1a34b19
[bench] add start_times field to vllm bench serve json result ( #32667 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2026-01-22 07:10:14 +00:00
Andreas Karatzas
a810299838
[ROCm][CI][Docs] Add comment explaining TRITON_ATTN fallback for ROCm ( #32835 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-21 22:11:09 -08:00
Andreas Karatzas
eb1629da24
[ROCm][CI] Fix AITER test flakiness by using explicit attention backend ( #32346 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-01-22 13:55:25 +08:00
Micah Williamson
019e2c3b7c
[ROCm][CI] Lower Acceptance Len Threshold For test_draft_model_quantization ( #32731 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-22 05:47:33 +00:00
Huy Do
f5fdec8ce2
Upgrade transformers-4.57.5 ( #32287 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2026-01-22 05:19:19 +00:00
Patrick von Platen
1579c9b5fd
[Llama.py -> mistral.py] Extract mistral-only relevant code into separate file ( #32780 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2026-01-22 05:14:57 +00:00
Lucas Wilkinson
889722f3bf
[FlashMLA] Update FlashMLA to expose new arguments ( #32810 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-21 22:02:39 -07:00
Divakar Verma
49d9653852
[ROCm][CI] fix get_valid_backends ( #32787 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-01-22 04:27:47 +00:00
Ifta khairul Alam Adil
a1d82466ea
[Docs] Remove outdated async_scheduling limitation with speculative decoding ( #32775 )
...
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com >
Signed-off-by: Ifta khairul Alam Adil <25082512+ikaadil@users.noreply.github.com >
2026-01-21 20:19:25 -08:00
Lucain
24a163ed77
Cleanup some huggingface_hub-related stuff ( #32788 )
2026-01-22 03:38:17 +00:00
knlnguyen1802
378385b90c
[EC Connector] Optimize remote cache check in scheduler ( #32585 )
...
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com >
2026-01-22 03:30:59 +00:00
Matt
c5487e2b96
[Bugfix] Fix potential EAGLE spec decode segfault during graph capture ( #32818 )
...
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-01-22 03:11:55 +00:00
Wentao Ye
6437ff1fb9
[Deprecation] Remove deprecated environment variables ( #32812 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-22 02:25:16 +00:00
Woosuk Kwon
5e00b561cd
[Model Runner V2] Do not error on attention backends ( #32820 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-21 17:02:48 -08:00
Woosuk Kwon
408195ec59
[Model Runner V2] Refactor Prompt Logprobs ( #32811 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-21 15:12:20 -08:00
Xin Yang
63227accf5
[Kernel] Add topk_sigmoid kernel ( #31246 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-21 22:49:51 +00:00
Yanan Cao
e675dda67b
[Misc] Add Helion version check to collect_env ( #32797 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-01-21 21:54:46 +00:00
Nick Hill
24dc30f7ff
[ModelRunner V2] Don't pin reused flashinfer tensors ( #32799 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-21 13:17:43 -08:00
Divakar Verma
180fba653e
[ROCm] fix import for on_gfx9 ( #32783 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-01-21 18:41:11 +00:00
danisereb
f999539869
Add missing import of fused_topk to benchmark_moe ( #32784 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-01-21 18:30:10 +00:00
Woosuk Kwon
e1da249c93
[Model Runner V2] Minor refactor for compute_slot_mappings ( #32794 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-21 10:24:35 -08:00
Nick Hill
9b693d023c
[Misc] Omit "disable NCCL for DP sync" startup log when not applicable ( #32707 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-21 17:03:39 +00:00
elvischenv
808d6fd7b9
Bump Flashinfer to v0.6.1 ( #30993 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-01-21 08:49:50 -08:00
whx
1861ae8aae
[PluggableLayer][1/N] Define PluggableLayer (Fix ci) ( #32744 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2026-01-21 11:38:04 -05:00
Robert Shaw
4e31b7f228
[Quantization][Deprecation] Remove RTN ( #32697 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-21 16:34:42 +00:00
Pleaplusone
6c20e89c02
[ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp ( #29287 )
...
Signed-off-by: ganyi <ygan@amd.com >
2026-01-21 23:16:30 +08:00
Robert Shaw
85f55c943c
[Quantization][Deprecation] Deprecate HQQ ( #32681 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-21 09:32:40 -05:00
Robert Shaw
cea3c754c4
[Quantization][Deprecation] Remove DeepSpeedFp8 ( #32679 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-21 09:32:12 -05:00
Robert Shaw
42135d6898
[MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority ( #32414 )
2026-01-21 08:22:33 -05:00
Divakar Verma
e14467be43
[bugfix] Aria model ( #32727 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-01-21 05:11:31 -08:00
Kim Hee Su
7727ce35c2
[Model] Add Eagle2.5-8B Vision-Language Model support ( #32456 )
...
Signed-off-by: kimheesu <wlskaka4@gmail.com >
2026-01-21 09:39:53 +00:00
Yanwen Lin
6bb2bc71e2
[Bugfix] Force using spawn multiprocess method when it's the WSL platform ( #32749 )
...
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
2026-01-21 09:35:55 +00:00
Lucas Kabela
c80f92c14d
[Documentation] Fix typo in docs/design/torch_compile_multimodal.md ( #32741 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-01-20 23:54:20 -08:00
RickyChen / 陳昭儒
f23fb5a7c1
[Bugfix] Support HF sharded weights for Mistral3/Pixtral models ( #32673 )
...
Signed-off-by: ricky-chaoju <ricky.chen@infinirc.com >
Signed-off-by: vllm-dev <ricky.chen@infinirc.com >
2026-01-20 23:27:30 -08:00
Paco Xu
360aa93f8f
[Docs] Fix GitHub handle in governance process ( #32582 )
...
Signed-off-by: Paco Xu <paco.xu@daocloud.io >
2026-01-21 07:07:50 +00:00
Netanel Haber
27ca95b3c9
[Bugfix] Fix Nemotron-Nano-v2-vlm static resolution ( #32682 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-01-21 06:28:21 +00:00
Lucas Wilkinson
b4f64e5b02
Update FlashMLA ( #32491 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-21 13:03:37 +08:00
shanjiaz
7ab80a8e37
Added qwen3 vision language moe support for speculative decoding ( #32048 )
...
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com >
Signed-off-by: shanjiaz <43143795+shanjiaz@users.noreply.github.com >
2026-01-21 03:24:05 +00:00
gopalsarda
0900cedb3f
Enable Eagle3 speculative decoding for Pixtral (LlavaForConditionalGeneration) ( #32542 )
...
Signed-off-by: gopalsarda <gopal.sarda@servicenow.com >
2026-01-21 11:18:05 +08:00
Nick Hill
6f067b1fb7
[Cleanup] Remove unused KVConnectorModelRunnerMixin methods ( #32077 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-21 11:16:37 +08:00
Alex Brooks
27b81e010d
[Bugfix] Fix Granite Vision / Don't use Siglip Pooling Head Nested Models by Default ( #32299 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2026-01-21 11:11:52 +08:00
Or Ozeri
7013e9ac8f
OffloadingConnector: Prevent redundant loads ( #29087 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-21 01:15:42 +00:00
Robert Shaw
c78ee240b3
Revert "[PluggableLayer][1/N] Define PluggableLayer" ( #32725 )
2026-01-21 00:21:06 +00:00
Vasiliy Kuznetsov
d2389c1262
fp8 online quant: split out Fp8OnlineLinearMethod ( #32189 )
2026-01-20 18:13:22 -05:00
Micah Williamson
22375f8d13
[ROCm][CI] Remove DS async eplb accuracy test from AMD CI ( #32717 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-20 13:40:48 -08:00
TJian
9b67338b78
[Bugfix] Suppress log on non-ROCm platform ( #32703 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-01-20 13:38:20 -08:00
Lucas Wilkinson
2261340806
[Misc] Remove pad_for_cudagraphs from config ( #30143 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-20 15:05:48 -05:00
Shinichi Hemmi
86c69dc54c
[Bugfix] Fix byte fallback handling when using outlines ( #31391 )
...
Signed-off-by: Shinichi Hemmi <50256998+Alnusjaponica@users.noreply.github.com >
Co-authored-by: Kenichi Maehashi <maehashi@preferred.jp >
2026-01-20 19:48:08 +00:00
dolpm
7c5dedc247
[AOT compilation] support torch.compile inductor artifacts in VllmCompiledFunction ( #25205 )
...
Signed-off-by: dolpm <34420038+dolpm@users.noreply.github.com >
2026-01-20 19:45:59 +00:00
Cyrus Leung
193069d129
[5/N] Initialize MM components in context managers (Q-Z) ( #32695 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 19:10:23 +00:00
Rahul Tuli
f0feb1cf81
Test: added acceptance length tests ( #32030 )
...
Signed-off-by: rahul-tuli <rtuli@redhat.com >
2026-01-20 18:55:15 +00:00
Cyrus Leung
09194b90a5
[Doc] Update docs for MM model development with context usage ( #32691 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 10:37:35 -08:00
Woosuk Kwon
9ab4388cd3
[Model Runner V2] Support FLASHINFER_MLA backend ( #32709 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-20 10:26:17 -08:00
JJJYmmm
04a9e064db
[Bugfix] fix the ima issue of qwen-vit ( #32687 )
...
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
2026-01-20 17:21:25 +00:00
TJian
c025263ddd
[Doc] [ROCm] Update ROCm getting started doc ( #32580 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: Hongxia Yang <hongxia.yang@amd.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 09:20:08 -08:00
Wentao Ye
6c97b9b9b6
[Perf] Only clone when needed for moe_permute ( #32273 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-20 11:34:39 -05:00
whx
4ca62a0dbd
[PluggableLayer][1/N] Define PluggableLayer ( #32331 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2026-01-20 16:19:21 +00:00
linhaifeng
7901109ea5
[Bugfix] Fix Off-by-one error in _num_tokens_to_min_blocks calculation ( #32603 )
...
Signed-off-by: linhaifeng <1371675203@qq.com >
2026-01-20 11:13:39 -05:00
YiSheng5
13f6630a9e
[XPU]Support AgRsAll2AllManager on XPU device ( #32654 )
...
Signed-off-by: yisheng <yi.sheng@intel.com >
2026-01-20 14:27:24 +00:00
Cyrus Leung
fda3f03eb2
[4/N] Initialize MM components in context managers (M-P) ( #32663 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 14:06:32 +00:00
杨朱 · Kiki
bb9172030e
[Metrics] Complete removal of deprecated vllm:time_per_output_token_seconds metric ( #32661 )
...
This PR completes the removal of the deprecated vllm:time_per_output_token_seconds
metric that was deprecated in v0.11, hidden in v0.12, scheduled for removal in v0.13,
but delayed until v0.15.
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com >
2026-01-20 12:28:41 +00:00
Chauncey
c4e5bdf61b
[Bugfix] Fix the fp8_mqa_logits dim mismatch ( #32652 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-20 18:48:07 +08:00
Cyrus Leung
7f1bcd18ff
[3/N] Initialize MM components in context managers (I-L) ( #32650 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 10:21:56 +00:00
Walter Beller-Morales
8be263c3fb
[Core] Cleanup shm based object store on engine shutdown ( #32429 )
...
Signed-off-by: walterbm <walter.beller.morales@gmail.com >
2026-01-20 08:53:37 +00:00
Cyrus Leung
e1a34c3a5d
[2/N] Initialize MM components in context managers (E-H) ( #32641 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 08:12:56 +00:00
vllmellm
148117ea2e
[Refactor] Make FP8 Linear Ops use kernel abstraction ( #27814 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-01-20 14:48:20 +08:00
Woosuk Kwon
e9c83cdc51
[Model Runner V2] Skip kernel launch for penalties & logit_bias ( #32634 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-19 22:20:19 -08:00
Cyrus Leung
b75e85dede
[1/N] Initialize MM components in context managers (A-D) ( #32632 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 14:12:42 +08:00
Cyrus Leung
4753f3bf69
[Model] Use context managers for encoder- and LM-only mode ( #32605 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 11:43:38 +08:00
Woosuk Kwon
6c01ffb897
[Model Runner V2] Decouple temperature from penalties ( #32629 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-19 19:13:24 -08:00
Woosuk Kwon
7b7cdce968
[Model Runner V2] Refactor get_cudagraph_and_dp_padding ( #32625 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-19 18:25:02 -08:00
Jackmin801
12dab78f49
[Feat] allow inplace loading lora ( #31326 )
...
Signed-off-by: Jackmin801 <ongjackm@gmail.com >
Signed-off-by: Jackmin801 <56836461+Jackmin801@users.noreply.github.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-20 10:15:20 +08:00
Woosuk Kwon
05dc4bfab6
[Model Runner V2] Initialized communication buffer for DP ( #32624 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-19 17:27:06 -08:00
Matthew Bonanni
1a1fc3bbc0
[Attention][MLA] Make FLASHINFER_MLA the default MLA backend on Blackwell, and TRTLLM the default prefill ( #32615 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-19 18:41:34 -05:00
Woosuk Kwon
43fada5360
[Model Runner V2] Refactor dummy_run ( #32533 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-19 14:50:59 -08:00
Tomas Ruiz
4a5299c93f
feat: spec decode with draft models ( #24322 )
...
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com >
2026-01-19 16:05:46 -05:00
lon
73f2a81c75
docs: prefix caching seems quite outdated ( #28784 )
...
Signed-off-by: lon <114724657+longregen@users.noreply.github.com >
Signed-off-by: Russell Bryant <russell.bryant@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <russell.bryant@gmail.com >
2026-01-19 11:49:52 -08:00
jiahanc
7350331718
[BugFix] Fix TRT-LLM NVFP4 DP/EP ( #32349 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-19 14:32:24 -05:00
Yanan Cao
9d1e611f0e
[CI] Add Helion as an optional dependency ( #32482 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-01-19 19:09:56 +00:00
Vadim Gimpelson
0727cc9ecf
[BUGFIX] Fix test_mla_backends.py. Scale MLA projection weights to prevent numerical instability ( #32529 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-01-19 13:49:29 -05:00
qli88
a0490be8f1
[CI][amd] Revert NIXL connector change to avoid crash ( #32570 )
...
Signed-off-by: Qiang Li <qiang.li2@amd.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-01-19 18:39:16 +00:00
Netanel Haber
cd3ac5b797
support dynamic resolution image encoding for Nemotron Nano VL ( #32121 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-01-19 18:15:58 +00:00
Jee Jee Li
2636d76257
[Misc] Remove unused ModelKeys ( #32608 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-19 17:34:59 +00:00
danisereb
aa7f37ccfa
Add support for LoRA adapters in Nemotron-H models ( #30802 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-01-19 22:30:44 +08:00
wang.yuqi
c88860d759
[Frontend] Score entrypoint support data_1 & data_2 and queries & documents as inputs ( #32577 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-19 14:07:46 +00:00
Nicolò Lucchesi
758df5afe7
[NIXL][Metrics] Track nixl_num_kv_expired_reqs metric in Prometheus ( #32340 )
...
Add a new metric to track the number of requests that had their KV blocks
expire. The scenario is particularly important to surface and track as it is a
vital indicator of the health of the deployment.
Currently we're resorting to track these failures through unstructured log
parsing (which is, among other thing, error string dependent); current main:
> Releasing expired KV blocks for request cmpl-071d which were retrieved by 0 decode worker(s) within 0 seconds.
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-19 12:28:27 +00:00
Daniel Mescheder
cdd03d25d3
[CI/Build] Fix dependency conflict between model-hosting-container-standards and starlette ( #32560 )
...
Signed-off-by: Daniel Mescheder <dmesch@amazon.com >
Co-authored-by: Daniel Mescheder <dmesch@amazon.com >
2026-01-19 03:27:08 -08:00
Nicolò Lucchesi
74c583bc50
[Core] Whisper support torch.compile ( #30385 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-19 10:02:31 +00:00
Andreas Karatzas
c0a350ca73
[ROCm][CI] Add ROCm attention backend support for EAGLE DP tests ( #32363 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-19 09:57:54 +00:00
Yuxuan Zhang
71832ba71e
[GLM-4.7] GLM Model support for GLM-Lite ( #31386 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: Yuxuan Zhang <2448370773@qq.com >
2026-01-19 01:18:38 -08:00
Matt
11bbf86f6a
[CI][Hardware][AMD] Fix test_rotary_embedding_mla_cache_fused ( #32408 )
...
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-01-19 08:25:47 +00:00
Hyunkyun Moon
3c8740aacb
[Frontend] Add render endpoints for prompt preprocessing ( #32473 )
...
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com >
Signed-off-by: Hyunkyun Moon <mhg5303@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-19 12:21:46 +08:00
Alex Brooks
7518a3dc65
[CI/Build] Use Common Event Map Fixture in Harmony / MCP Server Tests ( #32531 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2026-01-19 04:05:51 +00:00
honglyua
976af2f314
[BugFix] Fix embed_input_ids argument error of QwenVLForConditionalGeneration ( #32462 )
2026-01-19 03:06:02 +00:00
Woosuk Kwon
9a1f16da1e
[Model Runner V2] Refactor update_states ( #32562 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-18 17:32:42 -08:00
Woosuk Kwon
bb1848cd62
[Model Runner V2] Support VLM ( #32546 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-18 16:58:51 -08:00
Vadim Gimpelson
6101a26dc9
[BUGFIX] Fix degenerate strides in TRTLLM query tensors for FlashInfer backend. Fixes issue #32353 ( #32417 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-01-18 16:57:32 -08:00
Iryna Boiko
f5d1740030
[Bugfix] Add OOT backend option ( #32471 )
...
Signed-off-by: Iryna Boiko <iboiko@habana.ai >
2026-01-18 22:20:39 +00:00
Wentao Ye
eebc58df0c
[Refactor] Remove unused cutlass moe problem size function ( #32047 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-18 12:46:59 -08:00
Wentao Ye
16de822c71
[Refactor] Remove unused file pallas_kv_cache_update.py ( #32433 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-18 12:46:39 -08:00
Deming
5480c6b1fa
[Doc] Correct comment for _jobs dict in OffloadingConnectorWorker ( #32556 )
2026-01-18 12:46:00 -08:00
Andrey Khalyavin
ba29ab441e
Use the same memory for workspace13 and fused_output. ( #31531 )
...
Signed-off-by: Andrey Khalyavin <halyavin@yandex-team.ru >
2026-01-18 19:14:22 +00:00
Robert Shaw
afc3622602
[CI] Move Distributed Tests from H200 -> H100 ( #32555 )
2026-01-18 10:25:23 -08:00
bnellnm
327a02d8db
[MoE Refactor] Separate Router into OO Classes ( #30623 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-01-18 11:40:49 -05:00
tjp_zju
2f03035a61
"refactor: refactor_repeated_interfaces" ( #32486 )
...
Signed-off-by: tom-zju <tanjianpingzju1990@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-18 22:07:01 +08:00
Isotr0py
38bf2ffb21
[Bugfix] Fix GLM-ASR audio encoder RoPE dim ( #32540 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-18 19:17:59 +08:00
Li Xie
c826c72a96
[Model] Support Step1 Model ( #32511 )
...
Signed-off-by: xieli <xieli@stepfun.com >
2026-01-18 10:20:46 +00:00
Canlin Guo
fe36bf5e80
[Model] Remove the unnecessary dtype conversion in MiniCPM ( #32523 )
...
Signed-off-by: gcanlin <canlinguosdu@gmail.com >
2026-01-18 08:07:28 +00:00
Woosuk Kwon
963dc0b865
[Model Runner V2] Minor optimization for eagle input processing ( #32535 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-17 21:55:17 -08:00
Isotr0py
8cc26acd8b
[Performance] Improve Triton prefill attention kernel's performance ( #32403 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-17 20:19:59 -08:00
Robert Shaw
4a6af8813f
[MoE Refactor] Move Test Impl into Test Dirs ( #32129 )
...
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2026-01-18 12:16:59 +08:00
Woosuk Kwon
4147910f1e
[Model Runner V2] Move mrope_positions buffer to MRopeState ( #32532 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-17 20:09:48 -08:00
Karan Bansal
3055232ba0
[Feature] Add FIPS 140-3 compliant hash algorithm option for multimodal hashing ( #32386 )
...
Signed-off-by: Karan Bansal <karanb192@gmail.com >
2026-01-18 11:02:01 +08:00
Shengqi Chen
965765aef9
[build] fix cu130 related release pipeline steps and publish as nightly image ( #32522 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2026-01-17 18:36:11 -08:00
Mritunjay Kumar Sharma
9e078d0582
[CI/Build][Docker] Add centralized version manifest for Docker builds ( #31492 )
...
Signed-off-by: Mritunjay Sharma <mritunjay.sharma@chainguard.dev >
2026-01-17 13:45:30 +00:00
Guofang.Tang
2b99f210f5
[Misc] Fix typo: seperator -> separator in flashmla_sparse.py ( #32411 )
...
Signed-off-by: Guofang Tang <tinggofun@gmail.com >
Co-authored-by: Guofang Tang <tinggofun@gmail.com >
2026-01-17 12:18:30 +00:00
Kim Hee Su
1646fea672
[Model] Molmo2: Enable quantized weight mapping for vision backbone ( #32385 )
...
Signed-off-by: kimheesu <wlskaka4@gmail.com >
2026-01-17 09:33:05 +00:00
Paul Pak
d3317bbba4
[Models] Lfm2Moe: minor name changes for resolving lora conflicts ( #29063 )
...
Signed-off-by: Paul Pak <paulpak58@gmail.com >
2026-01-16 22:12:55 -08:00
Shengqi Chen
8e61425ee6
[CI] Implement uploading to PyPI and GitHub in the release pipeline, enable release image building for CUDA 13.0 ( #31032 )
2026-01-17 04:52:33 +00:00
Matthew Bonanni
2e7c89e708
Revert "[Attention][MLA] Make FLASHINFER_MLA the default MLA backen… ( #32484 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-17 04:42:39 +00:00
vanshil shah
037a6487af
apply _validate_input to MistralTokenizer token-id chat prompts ( #32448 )
...
Signed-off-by: Vanshil Shah <vanshilshah@gmail.com >
2026-01-17 03:23:45 +00:00
Simon Mo
5a3050a089
[Docs][Governance] Add @robertshaw2-redhat to lead maintainers group ( #32498 )
...
Co-authored-by: Claude <noreply@anthropic.com >
2026-01-16 18:35:49 -08:00
Chenyaaang
484e22bc18
[TPU][Core] Enable Pipeline Parallelism on TPU backend ( #28506 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2026-01-16 15:29:20 -08:00
Lucas Wilkinson
ca21288080
[CI] Fix OOM in Hopper Fusion E2E Tests (H100) ( #32489 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-16 21:27:16 +00:00
Andrew Xia
4c82b6fac7
[responsesAPI] allow tuning include_stop_str_in_output ( #32383 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-01-16 21:14:40 +00:00
Xin Yang
a884bc62d6
[LoRA] Update LoRA expand kernel heuristic ( #32425 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-16 18:38:07 +00:00
Hashem Hashemi
7a1030431a
Atomics Reduce Counting Optimization for SplitK Skinny GEMMs. ( #29843 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-01-16 11:45:04 -06:00
Wentao Ye
9fd918e510
[CI] Update deepgemm to newer version ( #32479 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-17 01:18:05 +08:00
Ilya Markov
c9a533079c
[EPLB][BugFix]Possible deadlock fix ( #32418 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-01-16 09:11:01 -05:00
rasmith
6ca4f400d8
[CI][AMD] Skip test_permute_cols since the kernel is not used and not built for ROCm ( #32444 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
2026-01-16 16:22:53 +08:00
Cyrus Leung
180e981d56
[Chore] Replace swish with silu ( #32459 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-16 08:22:45 +00:00
Micah Williamson
b84c426a8c
[ROCm][CI] Skip Qwen3-30B-A3B-MXFP4A16 Eval Test On Non-CUDA Platforms ( #32460 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-16 00:17:44 -08:00
Rabi Mishra
b66b0d6abb
fix(rocm): Enable non-gated MoE (is_act_and_mul=False) support on ROCm ( #32244 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2026-01-16 15:31:10 +08:00
Hongxin Xu
03da3b52ef
[Bugfix] Refactor to support DP parallel in R3 ( #32306 )
...
Signed-off-by: xhx1022 <1737006628@qq.com >
Co-authored-by: arlenxu <arlenxu@tencent.com >
2026-01-16 15:13:58 +08:00
Lucas Wilkinson
14ce524249
[CI] Breakup h200 tests ( #30499 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-16 06:23:22 +00:00
wang.yuqi
4ae77dfd42
[Frontend][1/n] Make pooling entrypoints request schema consensus | CompletionRequest ( #32395 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-16 06:17:04 +00:00
XiongfeiWei
73f635a75f
[Bug] Add TPU backend option ( #32438 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2026-01-16 05:17:12 +00:00
cjackal
35bf5d08e8
[bugfix] Fix online serving crash when text type response_format is received ( #26822 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
Signed-off-by: j0shuajun <59368606+j0shuajun@users.noreply.github.com >
Co-authored-by: j0shuajun <59368606+j0shuajun@users.noreply.github.com >
2026-01-16 12:23:54 +08:00
Kebe
5de6dd0662
[Bugfix] [DeepSeek-V3.2] fix sparse_attn_indexer padding ( #32175 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-16 03:21:55 +00:00
ltd0924
709502558c
[Model] Add Step3vl 10b ( #32329 )
...
Signed-off-by: luotingdan <luotingdan@stepfun.com >
Signed-off-by: ltd0924 <32387785+ltd0924@users.noreply.github.com >
Co-authored-by: luotingdan <luotingdan@stepfun.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-01-15 19:04:16 -08:00
Micah Williamson
46f8a982b1
[ROCm][CI] Enable AITER Unified Attention On ROCm For gpt-oss Test ( #32431 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-16 00:55:57 +00:00
Matthew Bonanni
bcf2333cd6
[CI] Fix LM Eval Large Models (H100) ( #32423 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-16 00:52:49 +00:00
Michael Goin
83239ff19a
Add thread_n=64 support to Marlin MoE ( #32360 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-15 16:45:44 -08:00
TomerBN-Nvidia
c277fbdf31
[Feat] Support non-gated MoE with Marlin, NVFP4 CUTLASS, FP8, INT8, compressed-tensors ( #32257 )
...
Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Tomer Natan <tbarnatan@ipp1-1429.ipp1a1.colossus.nvidia.com >
2026-01-15 16:15:05 -08:00
Wentao Ye
aca5c51487
[Refactor] Remove unused file ( #32422 )
2026-01-15 15:59:38 -07:00
Yongye Zhu
31c29257c8
[MoE Refactor][17/N] Apply Refactor to Bf16 ( #31827 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-01-15 12:53:40 -08:00
Aleksandr Malyshev
8c11001ba2
[ROCM] DSfp4 mla projection gemms weight dynamic quantization ( #32238 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2026-01-15 14:13:08 -06:00
Richard Zou
bd292be0c0
[BugFix] Python file source reading can fail on UnicodeDecodeError ( #32416 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-01-15 20:01:41 +00:00
TJian
41c544f78a
[ROCm] [CI] [Release] Rocm wheel pipeline with sccache ( #32264 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-01-16 02:56:18 +08:00
Michael Goin
1be5a73571
[UX] Use kv_offloading_backend=native by default ( #32421 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-15 18:55:11 +00:00
Lucas Wilkinson
c36ba69bda
[BugFix] Fix assert x_s.shape[-1] == x_q.shape[-1] // group_shape[1] in Blackwell Quantized MoE Test ( #32362 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-15 10:19:12 -08:00
Matthias Gehre
047413375c
[Attention][AMD] Make flash-attn optional ( #30361 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-01-15 17:18:24 +00:00
smit kadvani
74e4bb1c5a
fixing podman build issue ( #32131 )
...
Signed-off-by: Smit Kadvani <smit.kadvani@gmail.com >
Co-authored-by: Smit Shaileshbhai Kadvani <kadvani@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-01-15 11:07:08 -06:00
Wentao Ye
b34474bf2c
[Feature] Support async scheduling + PP ( #32359 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-15 12:06:23 -05:00
Woosuk Kwon
6218034dd7
[Model Runner V2] Support FlashInfer backend & Fix CUDA Graph bug [1/2] ( #32348 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-15 08:59:23 -08:00
Pleaplusone
77c16df31d
[ROCm][Bugfix] Disable hip sampler to fix deepseek's accuracy issue on ROCm ( #32413 )
...
Signed-off-by: ganyi <ygan@amd.com >
2026-01-15 16:35:47 +00:00
Pleaplusone
130d6c9514
[ROCm][Perf] Enable shuffle kv cache layout and assembly paged attention kernel for AiterFlashAttentionBackend ( #29887 )
...
Signed-off-by: ganyi <ygan@amd.com >
2026-01-15 15:29:53 +00:00
Dipika Sikka
361dfdc9d8
[Quant] Support MXFP4 W4A16 for compressed-tensors MoE models ( #32285 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-15 07:25:55 -08:00
Matthew Bonanni
8ebfacaa75
[Attention][MLA] Make FLASHINFER_MLA the default MLA backend on Blackwell, and TRTLLM the default prefill ( #32339 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-15 09:49:57 -05:00
brian033
b89275d018
[ROCm] Improve error handling while loading quantized model on gfx120… ( #31715 )
...
Signed-off-by: brian033 <85883730+brian033@users.noreply.github.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-01-15 04:16:00 -08:00
Cyrus Leung
28459785ff
[3/N] Group together media-related code ( #32406 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-15 11:52:12 +00:00
rasmith
8853a50af2
[CI][BugFix][AMD][FP8] Fix test_rms_norm so it runs correctly on ROCm ( #32372 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2026-01-15 19:05:54 +08:00
Douglas Lehr
c5891b5430
[ROCM] Add ROCm image build to release pipeline ( #31995 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
2026-01-15 19:01:40 +08:00
Chauncey
707b44cc28
[Refactor] [11/N] to simplify the mcp architecture ( #32396 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-15 18:49:31 +08:00
rongfu.leng
3a4e10c847
[Benchmark] [Feature] add vllm bench sweep startup command ( #32337 )
...
Signed-off-by: lengrongfu <lenronfu@gmail.com >
2026-01-15 09:25:46 +00:00
Cyrus Leung
cbbae38f93
[2/N] Move cache factories to MM registry ( #32382 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-15 01:02:30 -08:00
Cyrus Leung
cdba4c74b3
[Model] Avoid token selection in SigLIP pooling head ( #32389 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-15 17:01:59 +08:00
seeksky
a52d1396a7
fix: avoid crash on zero-arg tool calls in glm4 parser ( #32321 )
...
Signed-off-by: seekskyworld <djh1813553759@gmail.com >
2026-01-15 08:45:59 +00:00
dtc
1e584823f8
[Bugfix] Strengthen the check of X-data-parallel-rank in Hybrid LB mode ( #32314 )
...
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com >
2026-01-15 16:31:16 +08:00
Chauncey
4c1c501a7e
[Refactor] [10/N] to simplify the vLLM openai completion serving architecture ( #32369 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-15 07:41:34 +00:00
Andreas Karatzas
ae1eba6a9a
[ROCm][CI] Pin transformers 4.57.3 to fix jina test failures ( #32350 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-15 15:19:34 +08:00
Ofir Zafrir
e9ec2a72d8
[Bugfix] Fix stale common_attn_metadata.max_seq_len in speculative decoding with Eagle ( #32312 )
...
Signed-off-by: Ofir Zafrir <ofir.zafrir@intel.com >
2026-01-15 06:39:37 +00:00
Lucas Wilkinson
2c9b4cf5bf
[BugFix] Fix DeepSeek-V3.1 + DeepGEMM incompatible scale shapes ( #32361 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com >
2026-01-15 06:32:22 +00:00
Ning Xie
9d7ae3fcdb
[code clean] remove duplicate check ( #32376 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-15 05:29:34 +00:00
rasmith
3c2685645e
[CI][AMD][Quantization][BugFix] Fix fp8 max in quant_utils.py and update test_fp8_quant.::test_static_fp8_quant_group_2d to use correct fp8 dtype and adjust atol/rtol ( #32201 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
2026-01-15 05:04:34 +00:00
Micah Williamson
773d7073ae
[ROCm][CI] Disable async scheduling on ROCm for test_structured_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9] ( #32355 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-15 04:53:43 +00:00
kzwrime
edadca109c
[Bugfix] Add CpuCommunicator.dispatch and combine to fix DP+MoE inference ( #31867 )
...
Signed-off-by: kunzh <zhikun.wu@outlook.com >
2026-01-15 04:50:48 +00:00
Li Wang
d86fc23bdd
[Misc] Remove redundant line ( #32366 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2026-01-15 04:29:56 +00:00
Shiyan Deng
375e5984fe
Support configure skip_special_tokens in openai response api ( #32345 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2026-01-15 04:07:26 +00:00
baonudesifeizhai
19b251fe3d
Fix optional parameter parsing in MiniMax M2 tool parser #32278 ( #32342 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
2026-01-15 04:05:48 +00:00
Ryan Rock
15422ed3f7
[CI/Build][Hardware][AMD] Fix v1/shutdown ( #31997 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-01-15 04:01:42 +00:00
dolpm
8471b27df9
[compile] raise on compile_size implicit padding ( #32343 )
...
Signed-off-by: dolpm <34420038+dolpm@users.noreply.github.com >
2026-01-14 20:46:56 +00:00
Lumosis
66652e8082
[BugFix] Assign page_size_padded when unifying kv cache spec. ( #32283 )
...
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com >
2026-01-14 20:10:01 +00:00
vllmellm
e27078ea80
[Bugfix][ROCm][performance] Resolve the performance regression issue of the Qwen3-Next-80B-A3B-Thinking under rocm_atten ( #32336 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-01-14 19:32:48 +00:00
Aleksandr Samarin
d084e9fca7
[MODEL] Fix handling of multiple channels for gpt-oss with speculative decoding ( #26291 )
...
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com >
Signed-off-by: southfreebird <yvorott@gmail.com >
Co-authored-by: southfreebird <yvorott@gmail.com >
2026-01-14 13:20:52 -05:00
qli88
3a612322eb
[CI] Move rixl/ucx from Dockerfile.rocm_base to Dockerfile.rocm ( #32295 )
...
Signed-off-by: Qiang Li <qiang.li2@amd.com >
2026-01-14 16:53:36 +00:00
Cyrus Leung
9ea07b41da
[1/N] Reorganize multimodal processing code ( #32327 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-14 15:25:31 +00:00
Ning Xie
552b262936
rename tokenize serving api request id prefix to tokenize ( #32328 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-14 14:52:20 +00:00
Chauncey
00e6402d56
[Frontend] track responsesAPI server_load ( #32323 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-14 12:00:37 +00:00
Shanshan Shen
ce0946249d
[Misc] Make mem utils can be reused by other platforms ( #32322 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-01-14 03:46:01 -08:00
Cyrus Leung
3f28174c6a
[Frontend] Standardize use of create_error_response ( #32319 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-14 11:22:26 +00:00
Chauncey
769d0629e1
[Refactor] [9/N] to simplify the vLLM openai translations serving ar chitecture ( #32313 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-14 10:20:58 +00:00
Cyrus Leung
90db5b31e4
[Refactor] Move top-level dummy data generation to registry ( #32310 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-14 02:17:46 -08:00
Roger Wang
b8199f6049
[Model] Re-implement Qwen3Omni Audio Encoder ( #32167 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-01-14 15:40:30 +08:00
sangho.lee
7e6f123810
Add Molmo2 multimodal model support ( #30997 )
...
Signed-off-by: sanghol <sanghol@allenai.org >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-14 15:33:09 +08:00
Chauncey
9312a6c03a
[Refactor] [8/N] to simplify the vLLM openai responsesapi_serving architecture ( #32260 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-14 07:26:24 +00:00
Michael Goin
6388b50058
[Docs] Add docs about OOT Quantization Plugins ( #32035 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-14 15:25:45 +08:00
Hongxia Yang
048bb59728
AMD CI Test - unskip moe_sum test and moe_align_block_size tests ( #32039 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2026-01-13 23:25:10 -08:00
Angela Yi
7933638051
[misc] Remove is_torch_equal_or_newer(2.4) cases ( #32296 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-01-13 23:22:07 -08:00
David
6b176095e3
[Build] Relax anthropic version pin from ==0.71.0 to >=0.71.0 ( #32289 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-13 23:21:39 -08:00
Andreas Karatzas
9d0d7f48d5
[ROCm][CI] Handle missing vision_config in Isaac model attention patch ( #32281 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-14 07:21:26 +00:00
Yi Liu
50632adc58
Consolidate Intel Quantization Toolkit Integration in vLLM ( #31716 )
...
Signed-off-by: yiliu30 <yi4.liu@intel.com >
2026-01-14 07:11:30 +00:00
Micah Williamson
6fa6e7ef0c
[ROCm][CI] Disable Async Scheduling For Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy Test ( #32275 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-14 13:29:42 +08:00
Woosuk Kwon
90c0836902
[Model Runner V2] Refactor Sampler ( #32245 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-13 17:58:12 -08:00
Roberto L. Castro
8ef50d9a6b
[Kernel][Performance] Enable smaller Scaling Factor tiling for NVFP4 small-batch decoding ( #30885 )
...
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
2026-01-13 15:22:53 -08:00
emricksini-h
2a60ac91d0
[Improvement] Persist CUDA compat libraries paths to prevent reset on apt-get ( #30784 )
...
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai >
2026-01-13 14:35:05 -08:00
Michael Goin
9e65bb4ef4
Add mergify label job for "bug" in PR titles ( #31980 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-13 14:28:19 -08:00
Simon Mo
0db574b185
[Build] Add scripts for cherry-picking and trigger build ( #32282 )
...
Co-authored-by: Cursor Agent <cursoragent@cursor.com >
2026-01-13 13:21:05 -08:00
HappyAmazonian
2f4a71daf2
[Misc] Add In-Container restart capability through supervisord for sagemaker entrypoint ( #28502 )
...
Signed-off-by: Shen Teng <sheteng@amazon.com >
Signed-off-by: HappyAmazonian <91216626+HappyAmazonian@users.noreply.github.com >
2026-01-13 13:06:10 -08:00
Rabi Mishra
69f8a0ea37
fix(rocm): Use refresh_env_variables() for rocm_aiter_ops in test_moe ( #31711 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2026-01-13 19:11:54 +00:00
Wentao Ye
f28125d87b
[Perf] Optimize grouped topk kernel, 1.2%~2% E2E Throughput improvement ( #32058 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-13 10:58:18 -08:00
Dmitry Tokarev
46f8c6b725
Fix CUDA 13 wheel installation doc ( #32276 )
...
Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com >
2026-01-13 10:48:37 -08:00
Andrew Xia
af54d2e2d0
[responseAPI] support partial message generation ( #32100 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Signed-off-by: Andrew Xia <mitandrewxia@gmail.com >
Signed-off-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
Co-authored-by: Andrew Xia <axia@fb.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-01-13 10:41:26 -08:00
Sage Moore
6beef12b9b
[EPLB][Cleanup] Remove is_async_enabled from EplbModelState ( #32050 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-01-13 18:19:03 +00:00
Mark McLoughlin
ab74b2a27a
[Trivial] Remove duplicate enable_mfu_metrics ( #32246 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-01-14 01:09:23 +08:00
Matthew Bonanni
2263d44b68
[4/N][Attention] Move MLA common to model_executor ( #32060 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-01-13 09:08:45 -08:00
Mathis Felardos
4f3676e726
nixl_connector: export UCX_MEM_MMAP_HOOK_MODE=none to avoid a UCX memory leak ( #32181 )
...
Signed-off-by: Mathis Felardos <mathis@mistral.ai >
2026-01-13 16:21:10 +00:00
Martin Hickey
510265472c
[BugFix] [KVConnector] Fix KV events for LMCache connector ( #32169 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-13 15:50:34 +00:00
Chauncey
4f02cb2eac
[Refactor] [7/N] to simplify the vLLM lora serving architecture ( #32251 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-13 15:37:34 +00:00
Cyrus Leung
252c011012
[Refactor] Remove MultiModalProfiler ( #32254 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-13 15:10:20 +00:00
Matthew Bonanni
98f60e5acb
[6/N][Attention] Move utils to more appropriate locations ( #32215 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-13 05:38:52 -08:00
Chauncey
fefce49807
[Refactor] [6/N] to simplify the vLLM openai chat_completion serving architecture ( #32240 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-13 13:01:39 +00:00
Mickaël Seznec
a5bbbd2f24
[Quantization] fix: overflow with static per-tensor scaling ( #29867 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-13 12:56:01 +00:00
Nicolò Lucchesi
8c8653b672
[Docs] Nixl Usage recommend fail kv_load_failure_policy ( #32198 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-13 12:51:57 +00:00
Cyrus Leung
232214b2ae
[Bugfix] Replace PoolingParams.normalize with use_activation ( #32243 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-13 10:45:42 +00:00
Cyrus Leung
eb28e8068d
[Refactor] Remove get_encoder_dummy_data ( #32241 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-13 09:21:23 +00:00
YunzhuLu
542a4059b2
[Model] Use mm_position to compute mrope positions for Qwen2-VL/2.5-VL ( #32126 )
...
Signed-off-by: YunzhuLu <lucia.yunzhu@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-13 09:04:29 +00:00
Andreas Karatzas
df7e12715f
[ROCm][CI] Fix engine core client tests for ROCm spawn multiprocessing ( #32061 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-13 15:14:30 +08:00
Roy Wang
44c34f22d9
[Doc] Update installation from source command ( #32239 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2026-01-12 23:10:27 -08:00
Xingyu Liu
80221e1884
[BugFix]Fix eagle draft_model_config and add tests ( #31753 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com >
2026-01-12 23:09:36 -08:00
Andreas Karatzas
5e714f7ff4
[ROCm][CI] Fix HuggingFace flash_attention_2 accuracy issue in Isaac vision encoder ( #32233 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-12 22:33:59 -08:00
Andreas Karatzas
11b6af5280
[ROCm][Bugfix] Fix Mamba batched decode producing incorrect output ( #32099 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-13 05:46:53 +00:00
Wentao Ye
2a719e0865
[Perf] Optimize requests abort ( #32211 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-13 04:11:37 +00:00
Andrew Bennett
f243abc92d
Fix various typos found in docs ( #32212 )
...
Signed-off-by: Andrew Bennett <potatosaladx@meta.com >
2026-01-13 03:41:47 +00:00
Sanghoon Yoon
60b77e1463
[Frontend] Add reasoning_effort to OpenAIServing._preprocess_chat() ( #31956 )
...
Signed-off-by: Sanghoon Yoon <seanyoon@kakao.com >
2026-01-13 03:21:49 +00:00
cjackal
15b33ff064
[Misc] improve warning/assert messages ( #32226 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2026-01-13 03:11:23 +00:00
Nick Hill
c6bb5b5603
[BugFix] Fix engine crash caused by chat tools + response_format ( #32127 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-13 10:33:14 +08:00
Nick Hill
9273a427b5
[Misc] Allow enabling NCCL for DP sync when async scheduling ( #32197 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-13 02:03:08 +00:00
Cyrus Leung
78d13ea9de
[Model] Handle trust_remote_code for transformers backend ( #32194 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-13 09:30:12 +08:00
Andrew Xia
a307ac0734
[responsesAPI] add unit test for optional function tool call id ( #32036 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-01-12 16:14:54 -08:00
Divakar Verma
a28d9f4470
[ROCm][CI] Handle pytest status code 5 when a shard isn't allocated any tests ( #32040 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-01-12 17:35:49 -05:00
xuebwang-amd
629584bfc9
[Kernel][MoE] fix computation order of MoE weight multiplication and improve flow ( #31962 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
2026-01-12 17:17:30 -05:00
Woosuk Kwon
0a7dd23754
[Model Runner V2] Add support for M-RoPE ( #32143 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-12 13:37:43 -08:00
Woosuk Kwon
dec28688c5
[Model Runner V2] Minor refactor for logit_bias ( #32209 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-12 13:08:30 -08:00
Vadim Gimpelson
9f430c94bd
[BUGFIX] Add missed remaping of the names of fp8 kv-scale ( #32199 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-01-12 20:42:06 +00:00
Nicolò Lucchesi
f8bd8394e3
[NIXL][Bugfix] Failure logging overhaul + early metadata free on failure ( #32031 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-12 20:38:49 +00:00
Woosuk Kwon
ca81811bfe
[Model Runner V2] Support logit_bias, allowed_token_ids, min_tokens ( #32163 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-12 11:31:10 -08:00
Lucas Kabela
ad8818bb5e
[Misc][BE] Type coverage for vllm/compilation [3/3] ( #31748 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-01-12 19:24:38 +00:00
Nicolò Lucchesi
08e8e99ce7
[Misc] Change log level for batch queue log ( #32192 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-12 18:59:31 +00:00
Or Ozeri
2be765b68a
[BugFix] scheduler: Fix ordering preserving of skipped requests ( #32173 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-12 18:39:38 +00:00
Roger Wang
16abe6b85a
[Misc] Set default torch num threads for input processing ( #31879 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-01-12 10:28:16 -08:00
Ilya Markov
1eb61ab34b
[Refactor] EPLB rebalance algo to NumPy ( #30697 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-01-12 18:13:23 +00:00
Kyungmin Lee
3d962d72ab
[BugFix] fix FusedMoE.make_expert_params_mapping in EXAONE-MoE ( #32196 )
...
Signed-off-by: lkm2835 <lkm2835@gmail.com >
2026-01-12 10:00:45 -08:00
Matthew Bonanni
20228cb851
[3/N][Attention] Move AttentionMetadata-related code from utils.py to backend.py ( #32054 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-12 09:13:56 -08:00
Cyrus Leung
7c0d3c5152
[Benchmark] Share data between SLA runs ( #32184 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-13 01:12:22 +08:00
Nicolò Lucchesi
5b68107411
[Misc][PD] Fix get_attn_backend usage in transfer connectors ( #31988 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-12 18:10:05 +01:00
Asaf Joseph Gardin
8fb2c135be
[Bugfix] Fix stale SSM state for new Mamba requests scheduled as decode ( #32118 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-01-12 17:02:38 +00:00
Cyrus Leung
8863c2b25c
[Model] Standardize pooling heads ( #32148 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-12 17:01:49 +00:00
danielafrimi
3f72639d36
[FIX] Add NO_MUL activation support for modular kernel path ( #31528 )
...
Signed-off-by: dafrimi <dafrimi@nvidia.com >
Signed-off-by: <>
Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local >
Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: root <root@pool0-01777.cm.cluster >
2026-01-12 11:55:49 -05:00
Jaehyun An
6bc9c8473e
[MODEL] New model support for kakaocorp/kanana-1.5-v-3b-instruct ( #29384 )
...
Signed-off-by: Jaehyun An <steve.ai@kakaocorp.com >
2026-01-12 16:39:02 +00:00
Kyungmin Lee
63ed2409e8
Add K-EXAONE-236B-A23B ( #31621 )
...
Signed-off-by: lkm2835 <lkm2835@gmail.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: lgai-exaone <exaonemodels@lgresearch.ai >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-01-12 16:30:50 +00:00
Andy Zhang
95e53d907c
doc: Update model references in supported_models.md ( #32188 )
2026-01-12 08:15:28 -08:00
TJian
0346396e94
[ROCm] [Bugfix] Fix order of mori build in Dockerfile.rocm_base ( #32179 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-01-12 15:33:21 +00:00
Andy Zhang
e68b0dad8b
doc: Update model name for Qwen3-Coder in documentation ( #32185 )
...
Signed-off-by: Andy Zhang <xiazhang@microsoft.com >
2026-01-12 07:10:50 -08:00
Or Ozeri
9cddbdba6d
OffloadingConnector: Add cpu_bytes_to_use configuration ( #24498 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-12 15:00:43 +00:00
Hongxin Xu
49e6b86c91
[Feature] Support recording expert indices for rollout router replay ( #28284 )
...
Signed-off-by: xhx1022 <1737006628@qq.com >
Signed-off-by: Hongxin Xu <70438206+xhx1022@users.noreply.github.com >
Signed-off-by: arlenxu <arlenxu@tencent.com >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: arlenxu <arlenxu@tencent.com >
2026-01-12 06:23:04 -08:00
dtc
0565f1fdec
[P/D] Refactor mooncake connector sender thread using async coroutines ( #31573 )
...
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2026-01-12 12:35:35 +00:00
Isotr0py
9dbe1fe960
[Bugfix] Fix missing scale passing for encoder Triton Attention implementation ( #32149 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-12 11:13:41 +00:00
RickyChen / 陳昭儒
a5f89ae296
[Doc] Add documentation for offline API docs feature ( #32134 )
...
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com >
2026-01-12 10:33:48 +00:00
Jee Jee Li
05e8981234
[Doc] Improve LoRA docs ( #32159 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-12 02:19:17 -08:00
XlKsyt
899541bdb1
[doc] fix broken links ( #32158 )
...
Signed-off-by: minimAluminiumalism <caixuesen@outlook.com >
2026-01-12 10:18:38 +00:00
daniel-salib
d7b2e57097
[Frontend] Fix Flaky MCP Streaming Test ( #32153 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2026-01-12 18:03:32 +08:00
Andika Rachman
5e034f2e3d
[cpu][bench] Add Fused MoE Micro Benchmark for CPU Backend ( #32092 )
...
Signed-off-by: andikarachman <andika.rachman.y@gmail.com >
2026-01-12 10:03:28 +00:00
Nicolò Lucchesi
22970c1626
[Misc] Disable default --ready-check-timeout-sec extra call in vllm bench ( #30975 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-12 01:58:21 -08:00
Cyrus Leung
600aaab8d6
[Model] Remove incorrect SupportsPP from MTP models ( #32150 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-12 01:19:30 -08:00
wang.yuqi
60446cd684
[Model] Improve multimodal pooling examples ( #32085 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-12 07:54:09 +00:00
Cyrus Leung
9101dc756c
[Model] Avoid hardcoding pooling type ( #32119 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-11 21:28:12 -08:00
Woosuk Kwon
025a32f9ed
[Model Runner V2] Remove async barrier ( #32083 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-11 20:24:30 -08:00
Woosuk Kwon
19504ac07f
[Model Runner V2] Skip building deprecated fields in attn metadata ( #32132 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-11 14:31:04 -08:00
Jiangyun Zhu
3df619ac94
[CI] fix test_concat_and_cache_mla_rope_fused ( #32117 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-01-11 15:11:11 +00:00
Ning Xie
d74132ca3b
fix offline inference chat response prompt ( #32088 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-11 14:01:18 +00:00
maang
a34abc49b7
[FixBug] Improve exception string in tensorizer.py ( #31680 )
...
Signed-off-by: maang <maang_h@163.com >
Signed-off-by: maang-h <55082429+maang-h@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-11 05:01:53 -08:00
rongfu.leng
d70249e2e9
[Misc] fix this log format not space ( #32112 )
...
Signed-off-by: lengrongfu <lenronfu@gmail.com >
2026-01-11 05:01:16 -08:00
Cyrus Leung
a374532111
[CI/Build] Separate out flaky responses API tests ( #32110 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-11 05:01:12 -08:00
Isotr0py
cee7436a26
[Misc] Make scipy as optional audio/benchmark dependency ( #32096 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-11 00:18:57 -08:00
Or Ozeri
4c16ba617f
[KVConnector] OffloadingConnector: Fix bug in handling of preemptions ( #29870 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-11 08:05:36 +00:00
Matt
bde57ab2ed
[Hardware][AMD][CI][Bugfix] Fix AMD Quantization test group ( #31713 )
...
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-01-10 23:19:46 -08:00
Fadi Arafeh
9103ed1696
[CPU][BugFix] Disable AOT Compile for CPU ( #32037 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-01-10 23:15:49 -08:00
Laith Sakka
46eb30f519
make assume_32_bit_indexing configurable ( #32044 )
...
Signed-off-by: Laith Sakka <lsakka@meta.com >
2026-01-10 23:15:46 -08:00
Andy Liu
0dd63639be
[MTP][GLM][Bugfix] Fixed .weight_scale loading logic that dropped MTP prediction accuracy with fp8+mtp ( #32101 )
...
Signed-off-by: Andy Liu <andyliu@roblox.com >
2026-01-10 23:14:54 -08:00
Cyrus Leung
ef96fa3f1f
[Benchmark][2/2] Use spline interpolation to tune SLA variables ( #32095 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-10 20:27:27 -08:00
Or Ozeri
2a4dbe24ea
[BugFix] Wait for compute before offloading KV to CPU ( #31341 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-10 22:25:08 +00:00
RickyChen / 陳昭儒
8020a60402
[Bugfix] Fix Qwen3-VL-Reranker model loading for sequence classification ( #32089 )
...
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-10 12:40:09 -08:00
Vadim Gimpelson
e15a5ff07b
[MISC] Add strict contiguity check for FlashInfer attention tensors ( #32008 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com >
2026-01-10 12:40:05 -08:00
Vensen
6ea001cfb7
[Bugfix][Quantization] Ensure input contiguity in per_token_quant_int8 ( #31637 )
...
Signed-off-by: vensen <vensenmu@gmail.com >
2026-01-10 12:40:02 -08:00
shyeh25
1c46dea001
Revert "[Kernels][FI] Skip trtllm attention when num_kv_heads=1 (#308… ( #31617 )
...
Signed-off-by: shyeh25 <206795756+shyeh25@users.noreply.github.com >
2026-01-10 12:39:59 -08:00
Or Ozeri
028599739d
[BugFix] scheduler: Fix resuming of preempted requests after async load ( #31583 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-10 12:39:25 -08:00
gnovack
d1fd802fa3
fused_moe_kernel - cast accumulator after applying router weights ( #32002 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
2026-01-11 04:36:45 +08:00
Xin Yang
543c23be78
[LoRA][Perf] Improve FusedMoE LoRA performance for small rank ( #32019 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-10 11:04:18 -08:00
jvlunteren
b8bf5c45bb
[Kernel] Optimize Sliding Window Attention in 3D Triton Kernel ( #31984 )
...
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com >
2026-01-10 18:13:44 +00:00
Michael Goin
e6c6f2c79d
[Quant] Support MXFP4 W4A16 for compressed-tensors dense models ( #31926 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-01-10 06:44:35 -08:00
Jeremy Teboul
07286ec5a6
[Bugfix] Fix integer overflow in Gemma3n audio processing ( #31657 )
...
Signed-off-by: Jeremy Teboul <jeremyte@meta.com >
2026-01-10 17:52:53 +08:00
Ning Xie
14fc7a68c7
[Bugfix] fix offline chat output prompt ( #32076 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-10 07:50:57 +00:00
Cyrus Leung
5f2385a4c8
[Benchmark][1/2] Generalize SLA criterion validation from binary flags to margins ( #32075 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-10 07:11:03 +00:00
Frelam
a01a1c0d69
[Bugfix] fix encoder cache leak of waiting requests in scheduler to solve stuck in CPU scheduling ( #31857 )
...
Signed-off-by: frelam <frelam112233@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-01-10 06:27:58 +00:00
Lucas Wilkinson
da6709c9fe
[Misc] Delay deprecation of CommonAttentionMetadata properties ( #32074 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-09 21:06:44 -08:00
Andreas Karatzas
d83becd503
[ROCm][CI] Fix flaky test_function_calling_with_stream and reduce schema test examples ( #32063 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-10 05:02:35 +00:00
roikoren755
0c9614876e
Update modelopt KV cache quantization resolution to new scheme ( #31895 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-01-10 04:54:13 +00:00
Cyrus Leung
583a90e005
[Refactor] Separate sequence and token pooling types ( #32026 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-10 04:53:24 +00:00
maang
52d428295d
[Core] Refactor ColumnParallelLinear: remove unused parameter and optimize forward ( #31939 )
...
Signed-off-by: maang <maang_h@163.com >
2026-01-10 04:19:49 +00:00
Kevin McKay
c60578de0a
[Bugfix][Hardware][AMD] Use dynamic WARP_SIZE in sampler vectorized_process ( #31295 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2026-01-10 03:57:38 +00:00
PatrykSaffer
80fead8bf6
Fuse RoPE and MLA KV-cache write ( #25774 )
...
Signed-off-by: Patryk Saffer <patryk.saffer99@gmail.com >
Signed-off-by: PatrykSaffer <patryk.saffer@mistral.ai >
Co-authored-by: Patryk Saffer <patryk.saffer99@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-09 19:18:37 -08:00
Akshat Shrivastava
e45946bd91
feature/issac 0.2 ( #31550 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-01-10 03:18:05 +00:00
Lucas Kabela
ea6d067a2a
[Misc][LLaMa4] Compile LLaMa Vision Encoder ( #30709 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-01-09 22:01:38 -05:00
Ning Xie
abd9224280
resolve pydantic error in startup benchmark ( #31348 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-10 02:41:27 +00:00
Kevin McKay
4dc0d606b7
[Bugfix] Narrow broad exceptions in compilation backends ( #31616 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-01-09 21:39:22 -05:00
Micah Williamson
ac0675ff6b
[CI] Allow Deprecated Quantization For LM Eval Tests ( #32065 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-09 19:10:47 -07:00
Wentao Ye
e18464a57d
[Perf] Optimize async scheduling placeholder using empty ( #32056 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-10 00:46:11 +00:00
Russell Bryant
1963245ed1
[Core] Use weights_only=True with torch.load ( #32045 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-01-10 00:28:57 +00:00
Matthew Bonanni
0308901975
[2/N][Attention] Fix pre-commit errors ( #32052 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-10 00:27:15 +00:00
Lucas Kabela
aaf4b70aae
[Misc][BE] Type coverage for vllm/compilation [2/3] ( #31744 )
2026-01-09 18:30:38 -05:00
Nick Hill
3adffd5b90
[Misc] Enable async scheduling by default with spec decoding ( #31998 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-09 23:09:34 +00:00
zhrrr
97ba96fbe9
[perf][async] support non cpu sync get logprob tensors for spec ( #31336 )
...
Signed-off-by: izhuhaoran <izhuhaoran@qq.com >
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2026-01-09 21:24:51 +00:00
Chendi.Xue
94578127a4
[NIXL] refine decoder side post process for heterogeneous BlockSize and kv_layout ( #30275 )
2026-01-09 21:22:19 +00:00
Matthew Bonanni
2612ba9285
[1/N][Attention] Restructure attention: move files ( #31916 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-09 13:10:24 -08:00
Andrew Xia
1f8b7c536b
[responsesAPI] fix incomplete_messages for simple/parsable context ( #31836 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-01-09 21:00:57 +00:00
Lucas Wilkinson
0a0aa07747
[Quant] Make static quant support all group shapes ( #30833 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-09 12:49:27 -08:00
jiahanc
f9e2a75a1e
[fix] add cutedsl to global sf ( #32001 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
2026-01-09 12:03:02 -08:00
Runkai Tao
a4d5d663e2
Add unpermute-aware fused MoE path and small-batch fallback ( #29354 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-09 12:58:39 -07:00
Jeremy Teboul
657e9c0e18
[Fix] Introduce audio channels spec ( #31595 )
...
Signed-off-by: Jeremy Teboul <jeremyte@meta.com >
2026-01-09 19:34:51 +00:00
Wentao Ye
308feab33f
[Perf] Optimize cutlass moe problem size calculation, 5.3% E2E Throughput improvement, 2.2% TTFT improvement ( #31830 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-01-09 11:13:43 -08:00
Wentao Ye
28ae32a5d3
[Refactor] Remove numpy split in async scheduling ( #32034 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-09 19:09:02 +00:00
Andrew Xia
f32c629eb4
[Frontend][gpt-oss] Allow system message to overwrite model identity ( #31737 )
...
Signed-off-by: lacora <hyelacora@gmail.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: lacora <hyelacora@gmail.com >
Co-authored-by: Andrew Xia <axia@fb.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-01-09 14:03:57 -05:00
Yifan Qiao
cd4a95e3aa
[Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator ( #31707 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
2026-01-09 10:53:20 -08:00
Michael Goin
d5ec6c056f
[UX] Add vLLM model inspection view ( #29450 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-09 10:12:35 -07:00
Shanshan Shen
08d954f036
[Doc] Add developer guide for CustomOp ( #30886 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-01-09 16:21:11 +00:00
Kevin Šuc
ac9f9330e6
Rename --exclude-log-deltas to --enable-log-deltas ( #32020 )
...
Signed-off-by: Catacomba <kevinsuc16@gmail.com >
2026-01-09 15:30:40 +00:00
Isotr0py
2d0c5b630e
[Doc] Remove hardcoded Whisper in example openai translation client ( #32027 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-09 14:44:52 +00:00
Michael Goin
34cd32fe30
[Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe ( #31832 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-01-09 07:40:33 -07:00
R3hankhan
8e27663b6a
[CPU] Add head sizes 80 and 112 with vec16 fallback ( #31968 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-01-09 22:14:46 +08:00
maang
7cdf7e2fe0
[Model] Remove redundant None check in DeepSeekOCR image input processing ( #32016 )
...
Signed-off-by: maang <maang_h@163.com >
2026-01-09 06:12:44 -08:00
Adolfo Victoria
bbf80ede43
Fix type error ( #31999 )
...
Signed-off-by: Adolfo Victoria <adolfokarim@gmail.com >
Co-authored-by: Adolfo Victoria <adovi@meta.com >
2026-01-09 22:03:32 +08:00
inkcherry
4505849b30
[ROCm][PD] add moriio kv connector. ( #29304 )
...
Signed-off-by: inkcherry <mingzhi.liu@amd.com >
2026-01-09 14:01:57 +00:00
Roger Wang
db07433ce5
[Misc] Skip hashing kwargs if value is None ( #32025 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-01-09 13:20:59 +00:00
Andreas Karatzas
e02706d2d2
[ROCm][CI][V1] Fix nixl_connector test failure and achieve CUDA parity in test_async_scheduling ( #32000 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-09 20:48:32 +08:00
Sophie du Couédic
b474782ad7
[Feature][Benchmarks] Custom dataset: read output length from dataset ( #31881 )
...
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com >
2026-01-09 12:40:59 +00:00
Bofeng Xue
55212c1404
fix: remove duplicate engine_id check in nixl_connector ( #31948 )
...
Signed-off-by: Bofeng BF1 Xue <xuebf1@Lenovo.com >
Co-authored-by: Bofeng BF1 Xue <xuebf1@Lenovo.com >
2026-01-09 12:13:17 +00:00
Xin Yang
e7b68f4d6c
[Bugfix] Fix Triton FusedMoE LoRA ( #30585 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-09 11:46:59 +00:00
vllmellm
1a19e9cd87
[Bugfix][ROCm]Fix Qwen3-Next-80B-A3B-Thinking inference and optimize non-standard block size (544) support under rocm_atten ( #31380 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-01-09 19:28:02 +08:00
Cyrus Leung
c8ed39b9dd
[Model] Reorganize pooling layers ( #31973 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-09 11:02:14 +00:00
Andreas Karatzas
020732800c
[Bugfix] Fix OpenAPI schema test failures ( #31921 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-09 10:56:20 +00:00
Alex Brooks
dc77cb7129
[Bugfix] Fix Var Length Batched Padding in Granite Speech ( #31906 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2026-01-09 10:28:43 +00:00
gnovack
bde38c11df
fix lora moe sharding when rank < max_lora_rank ( #31994 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-09 14:43:25 +08:00
Xin Yang
707b240d7e
[Bugfix] Fix FusedMoE LoRA w2_output_size ( #31949 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-09 00:54:05 -05:00
Nick Hill
29ce48221c
[Cleanup] Remove obsolete spec decoding compatibility logic ( #32003 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-09 05:44:18 +00:00
TJian
7a05d2dc65
[CI] [ROCm] Fix tests/entrypoints/test_grpc_server.py on ROCm ( #31970 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-01-09 12:54:20 +08:00
Divakar Verma
a1648c4045
[ROCm][CI] Fix test_token_classification.py::test_bert_models ( #31993 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-01-09 04:04:33 +00:00
RioS
e2d49ec2a4
[Bugfix] missing tokens occur in harmony streaming ( #30437 )
...
Signed-off-by: RioS <aa248424@gmail.com >
Signed-off-by: Ri0S <aa248424@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-01-09 03:59:34 +00:00
Xin Yang
8413868dab
[Bugfix] Fix typo in FusedMoE LoRA reshape comment ( #31992 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-08 18:46:05 -08:00
zhrrr
8ff4a99566
[Async][Feat] support apply penalty or bad_words for async + spec ( #30495 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Signed-off-by: izhuhaoran <izhuhaoran@qq.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-01-09 02:31:50 +00:00
daniel-salib
a4ec0c5595
[Frontend] Add MCP tool streaming support to Responses API ( #31761 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2026-01-09 09:19:34 +08:00
Robert Shaw
0fa8dd24d2
[Bugfix] Fix Typo from NVFP4 Refactor ( #31977 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-08 16:18:50 -08:00
Max Hu
6ebe34d6fa
[Feature] Add iteration level logging and enhance nvtx marker ( #31193 )
...
Signed-off-by: Max Hu <maxhu@nvidia.com >
Signed-off-by: Max Hu <hyoung2991@gmail.com >
Co-authored-by: Max Hu <maxhu@nvidia.com >
2026-01-09 00:13:39 +00:00
Nick Hill
11cec296dd
[BugFix] Add spec-decode-incompatible request param validation ( #31982 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-09 00:08:21 +00:00
Robert Shaw
5825bbc1f7
[Quantization] Deprecate Long Tail of Schemes ( #31688 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-08 19:07:45 -05:00
Yongye Zhu
d62cfe546d
[MoE Refactoring][Bugfix]Wrap WNA16 Triton kernel into mk and change compressed tensor kernel selection ( #31752 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-08 19:01:30 -05:00
Lucas Wilkinson
6cdf015c3c
[Misc] Fix Current vLLM config is not set. warnings, assert to avoid issues in the future ( #31747 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-01-08 15:20:49 -08:00
Dipika Sikka
5d3b6097ad
[Compressed-Tensors] Simplify NVFP4 Conditions, enable marlin support for NVFP4A16 MoEs ( #30881 )
2026-01-08 17:45:17 -05:00
bnellnm
e74698c27a
[Misc][Refactor] Add FusedMoERouter object ( #30519 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-01-08 20:52:55 +00:00
Cyrus Leung
aa125ecf0e
[Frontend] Improve error message ( #31987 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-08 20:07:03 +00:00
Lucas Kabela
f16bfbe5bc
[Documentation][torch.compile] Add documentation for torch.compile + multimodal encoders ( #31627 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-01-08 14:33:24 -05:00
Michael Goin
87e07a6b46
Revert "feat(moe): Add is_act_and_mul=False support for Triton MoE kernels" ( #31978 )
2026-01-08 11:31:53 -08:00
Woosuk Kwon
7508243249
[Model Runner V2] Simplify BlockTables with UVA ( #31965 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-08 10:24:26 -08:00
Nicolò Lucchesi
83e1c76dbe
[CI][ROCm] Fix NIXL tests on ROCm ( #31728 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-09 01:34:43 +08:00
Nishidha Panpaliya
a563866b48
Fix ijson build for Power. ( #31702 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
2026-01-08 17:12:33 +00:00
Nick Hill
a3d909ad2b
[Misc] Tidy up some spec decode logic in GPUModelRunner ( #31591 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-08 09:10:07 -08:00
Jee Jee Li
49568d5cf9
[Doc] Improve MM models LoRA notes ( #31979 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-08 08:55:22 -08:00
danisereb
b8112c1d85
[Bugfix] Fix vllm serve failure with Nemotron Nano V3 FP8 ( #31960 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-01-08 16:08:37 +00:00
Chauncey
eaba8ece77
[Bugfix]: Fix Step3ReasoningParser missing is_reasoning_end_streaming ( #31969 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-08 15:28:13 +00:00
yxing-bj
fe86be66c5
[Model] Support IQuestCoder model ( #31575 )
...
Signed-off-by: yxing <yxing@iquestlab.com >
2026-01-08 14:42:57 +00:00
Chauncey
1da3a5441a
[Docs]: update claude code url ( #31971 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-08 14:04:55 +00:00
TJian
72c068b8e0
[CI] [Bugfix] Fix unbounded variable in run-multi-node-test.sh ( #31967 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-01-08 05:42:01 -08:00
Mary
7645bc524b
[OpenAI] Fix tool_choice=required streaming when output has trailing extra data ( #31610 )
...
Signed-off-by: maylikenoother <ogedengbemary19@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-01-08 21:01:42 +08:00
Ce Zhao
1123a87892
[Model] Enable LoRA support for Pixtral ( #31724 )
...
Signed-off-by: <>
Signed-off-by: 赵策 <alcor@zhaocedeMacBook-Air.local >
Signed-off-by: 赵策 <alcor@mac.mynetworksettings.com >
Co-authored-by: 赵策 <alcor@mac.mynetworksettings.com >
2026-01-08 05:00:57 -08:00
tianshu-Michael-yu
03fd76c570
[Model] Add LFM2-VL model support ( #31758 )
...
Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-01-08 05:00:27 -08:00
Bijaya Dangol
59d260f5e4
[Model] Add Grok-2 ( #31847 )
...
Signed-off-by: dangoldbj <dangoldbj23@gmail.com >
2026-01-08 04:59:48 -08:00
Patrick von Platen
18d4e481d0
[Voxtral] Fix speech transcription api ( #31388 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: bk-201 <joy25810@foxmail.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: prashanth058 <prashanth.dannamaneni@uipath.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: bk-201 <joy25810@foxmail.com >
Co-authored-by: prashanth058 <prashanth.dannamaneni@uipath.com >
Co-authored-by: Anexdeus <5142168@mail.ru >
Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2026-01-08 18:34:19 +08:00
Isotr0py
2972a05473
[MM Encoder]: Make MMEncoderAttention's scale takes effect properly ( #31950 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-08 02:33:48 -08:00
Cyrus Leung
5576227bc1
[Model] Standardize common vision encoders ( #31947 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-08 02:33:16 -08:00
Cyrus Leung
d1b6fe007f
[Chore] Further cleanup pooler ( #31951 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-08 02:16:21 -08:00
omer-dayan
04a49669d1
RayLLM Bugfix - Preserve obj store URL for multi engine_config creation ( #30803 )
...
Signed-off-by: Omer Dayan <omdayan@nvidia.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-08 10:00:25 +00:00
BingjiaWang
96fcd3c267
[Misc] Support qwen3-next lora ( #31719 )
2026-01-08 09:27:50 +00:00
DevByteAI
1f214290d6
fix(compile): apply partition wrapper when loading AOT cached functions ( #31536 )
...
Signed-off-by: Devbyteai <abud6673@gmail.com >
Signed-off-by: DevByteAI <161969603+devbyteai@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-01-08 17:27:26 +08:00
Ryan Rock
8cbdc7eb94
[CI/Build] Enable test_kv_cache_events_dp for AMD ( #31834 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-01-08 09:00:24 +00:00
Lumosis
b634e619bb
Decouple page_size_bytes calculation in AttentionSpec for TPU/RPA Compatibility. ( #31635 )
...
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com >
Signed-off-by: Lumosis <30372757+Lumosis@users.noreply.github.com >
2026-01-08 09:00:07 +00:00
Isotr0py
eac3b96ec0
[Models] Allow converting Qwen3-VL into Reranker model ( #31890 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-08 08:10:15 +00:00
Zhiwei
573a1d1119
[ROCm]Skip test_torchao.py::test_pre_quantized_model on CDNA3 arch ( #31905 )
...
Signed-off-by: ZhiweiYan-96 <zhiwei.yan@amd.com >
2026-01-08 15:47:44 +08:00
Shang Wang
33156f56e0
[docker] A follow-up patch to fix #30913 : [docker] install cuda13 version of lmcache and nixl ( #31775 )
...
Signed-off-by: Shang Wang <shangw@nvidia.com >
2026-01-07 23:47:02 -08:00
Rabi Mishra
107cf8e92f
fix(rocm): Add get_supported_kernel_block_sizes() to ROCM_ATTN ( #31712 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2026-01-08 15:46:07 +08:00
Zyyeric
63baa28cf5
[Model] Enable LoRA support for tower and connector in GLM4-V ( #31652 )
...
Signed-off-by: Zyyeric <eric1976808123@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-08 15:45:53 +08:00
Andy Liu
e5173d3bac
[Bugfix] Remove the num_hidden_layers override for glm4_moe ( #31745 )
2026-01-08 15:45:10 +08:00
prashanth058
d3235cb503
[Fix] Enable mm_processor_cache with vision LoRA ( #31927 )
...
Signed-off-by: prashanth058 <prashanth.dannamaneni@uipath.com >
2026-01-08 15:31:51 +08:00
Nick Hill
287b37cda4
[BugFix] Fix spec decoding edge case bugs ( #31944 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-08 15:31:03 +08:00
Chang Su
791b2fc30a
[grpc] Support gRPC server entrypoint ( #30190 )
...
Signed-off-by: Chang Su <chang.s.su@oracle.com >
Signed-off-by: njhill <nickhill123@gmail.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: njhill <nickhill123@gmail.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2026-01-07 23:24:46 -08:00
Lucas Wilkinson
be6a81f31b
[chore] Update FA commit ( #30460 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-07 23:24:18 -08:00
Ronald
2ab441befe
[platform] add dp_metadata arg to set_additional_forward_context ( #31942 )
...
Signed-off-by: Ronald1995 <ronaldautomobile@163.com >
2026-01-08 06:56:44 +00:00
ShaanveerS
9572f74f15
[Model] Enable LoRA support for tower and connector in DotsOCR ( #31825 )
...
Signed-off-by: ShaanveerS <shaanver.singh@gmail.com >
2026-01-08 14:50:16 +08:00
Andreas Karatzas
5f2a473ff3
[ROCm][CI] v1 cpu offloading attention backend fix ( #31833 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-08 14:37:50 +08:00
Michael Goin
6b2a672e47
[Doc] Add Claude code usage example ( #31188 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-08 13:50:23 +08:00
rasmith
f1b1bea5c3
[CI][BugFix][AMD] Actually skip tests marked @pytest.mark.skip_v1 ( #31873 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2026-01-08 13:06:09 +08:00
Charlie Fu
cddbc2b4b2
[ROCm][CI] Add rocm support for run-multi-node-test.sh ( #31922 )
...
Signed-off-by: charlifu <charlifu@amd.com >
Signed-off-by: Charlie Fu <Charlie.Fu@amd.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-08 04:36:39 +00:00
Andreas Karatzas
087a138963
[ROCm][CI] Fix attention backend test flakiness from uninitialized KV cache memory ( #31928 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-08 04:35:25 +00:00
Andreas Karatzas
c4041f37a4
[ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router weight scaling ( #31931 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-08 04:17:56 +00:00
Richard Zou
a79079feef
[BugFix] Fix flakiness in test_eagle_dp for PyTorch 2.10 ( #31915 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-01-08 04:04:58 +00:00
Robert Shaw
9f6dcb71ae
[MoE Refactor][16/N] Apply Refactor to NVFP4 ( #31692 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Pavani Majety <pmajety@nvidia.com >
2026-01-08 03:46:27 +00:00
Andreas Karatzas
8dd2419fa9
[CI] Skip Qwen-VL in multimodal processing tests due to flaky external dependency ( #31932 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-08 02:58:01 +00:00
Rabi Mishra
39d82005f7
fix(rocm): add early return in get_flash_attn_version for ROCm ( #31286 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2026-01-08 10:28:07 +08:00
Rabi Mishra
25eef3dc2e
feat(moe): Add is_act_and_mul=False support for Triton MoE kernels ( #31645 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2026-01-08 10:27:09 +08:00
Matthew Bonanni
0d7667419f
[0/N][Attention] Fix miscellaneous pre-commit issues ( #31924 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-08 01:15:17 +00:00
Robert Shaw
5dcd7ef1f2
[MoE Refactor][15/N] Apply Refactor to Fp8 ( #31415 )
2026-01-07 19:42:33 -05:00
Elvir Crnčević
ffc0a2798b
Add back missing DeepEP LL params ( #31911 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
2026-01-07 17:47:54 -05:00
Nick Hill
10ef65eded
[BugFix] Fix bad words with speculative decoding ( #31908 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-07 15:46:42 -05:00
Ilya Markov
6170d47d22
[EPLB] Optimize EPLB with numpy ( #29499 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-01-07 15:21:35 -05:00
Xin Yang
0ada960a20
[Kernel] Support bias type in grouped_topk kernel ( #31781 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-07 12:16:32 -08:00
Ning Xie
c907d22158
[refactor] refactor memory constants usage ( #31865 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-07 18:37:31 +00:00
Michael Goin
f347ac6c34
[Perf] Fuse stride preparation for NVFP4 cutlass_moe ( #31837 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-07 13:31:26 -05:00
Festus Ayobami Owumi
05f47bd8d2
[Doc] Fix: Correct vLLM announcing blog post link in docs ( #31868 )
...
Signed-off-by: enfinity <festusowumi@gmail.com >
2026-01-07 10:06:42 -08:00
roikoren755
bf184a6621
Enable quantized attention in NemotronH models ( #31898 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-01-07 17:37:19 +00:00
Jee Jee Li
30399cc725
UX: add vLLM env info in '/server_info' ( #31899 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-07 17:13:02 +00:00
Kfir Toledo
b89443b8d9
[KVConnector]: Enable Cross-layers KV cache layout for MultiConnector ( #30761 )
...
Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com >
2026-01-07 16:59:43 +00:00
Marko Rosenmueller
1d9e9ae8a4
[Bugfix]: prevent leaking tokens in crash log ( #30751 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com >
2026-01-07 16:15:19 +00:00
Cyrus Leung
b7036c87a1
[Refactor] Clean up pooler modules ( #31897 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-08 00:07:43 +08:00
Kate Cheng
cc6dafaef2
[Perf][Kernels] Enable FlashInfer DeepGEMM swapAB on SM90 (for W8A8 Linear Op) ( #29213 )
...
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com >
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com >
Co-authored-by: Jhao-Ting Chen <jhaotingc@nvidia.com >
2026-01-07 10:53:54 -05:00
R3hankhan
1ab055efe6
[OpenAI] Extend VLLMValidationError to additional validation parameters ( #31870 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-01-07 14:45:49 +00:00
Cyrus Leung
b665bbc2d4
[Chore] Migrate V0 attention utils ( #31891 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-07 13:44:36 +00:00
Jared Wen
974138751b
[Refactor] GLM-ASR Modeling ( #31779 )
...
Signed-off-by: JaredforReal <w13431838023@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-07 13:08:29 +00:00
vllmellm
41cfa50632
[ROCm][AITER] fix wrong argument passed to AITER flash_attn_varlen_func ( #31880 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-01-07 11:25:03 +00:00
Andy Liu
d111bc53ad
[Bugfix][MTP] Fix GLM4 MoE fp8 loading with MTP on ( #31757 )
...
Signed-off-by: Andy Liu <andyliu@roblox.com >
2026-01-07 09:18:52 +00:00
BlankR
0790f07695
[Misc] Improve error messages for unsupported types and parameters ( #30593 )
...
Signed-off-by: BlankR <hjyblanche@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-07 09:00:16 +00:00
maang
1f33e38e81
[Model] Cleanup: Remove redundant manual definition of make_empty_intermediate_tensors in GLM-4-MoE ( #31869 )
...
Signed-off-by: maang <maang_h@163.com >
2026-01-07 08:18:28 +00:00
sihao_li
59fe6f298e
[XPU]fallback to TRITON_ATTN on xpu when use float32 dtype ( #31762 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
2026-01-07 08:10:29 +00:00
weiyu
e7596371a4
[Refactor][TPU] Remove torch_xla path and use tpu-inference ( #30808 )
...
Signed-off-by: Wei-Yu Lin <weiyulin@google.com >
Signed-off-by: weiyu <62784299+weiyu0824@users.noreply.github.com >
2026-01-07 16:07:16 +08:00
xuebwang-amd
0dd5dee9b9
[Bugfix][Kernel] fix bias adding in triton kernel implemented fused moe ( #31676 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
2026-01-07 07:36:13 +00:00
Kevin McKay
4614c5a539
[Bugfix][Hardware][AMD] Consolidate FP8 min/max values helper function ( #31106 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
Signed-off-by: Kevin McKay <kevin@example.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-01-07 06:55:03 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
482914849c
[BugFix] LoRA: Support loading base_layer of experts ( #31104 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2026-01-07 14:49:39 +08:00
tianshu-Michael-yu
efeaac92f2
[Bugfix] Fix race condition in async-scheduling for vlm model ( #31841 )
...
Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com >
2026-01-07 06:45:10 +00:00
tjp_zju
55caa6051d
refactor: find_loaded_library ( #31866 )
...
Signed-off-by: tjp_zju <tanjianpingzju1990@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-01-07 06:42:20 +00:00
Lucas Wilkinson
c7a79d41a0
[Attention][3/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties ( #31850 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-07 13:31:34 +08:00
vllmellm
6409004b26
[ROCm][AITER] bugfix accuracy regression in ROCM_AITER_TRITON_MLA backend ( #31816 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-01-07 05:04:53 +00:00
Cyrus Leung
aafd4d2354
[Chore] Try remove init_cached_hf_modules ( #31786 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-07 12:34:04 +08:00
Jack Yang
0a2c2dc3f1
fixed mypy warnings for files vllm/v1/attention with TEMPORARY workaround ( #31465 )
...
Signed-off-by: Zhuohao Yang <zy242@cornell.edu >
Co-authored-by: Zhuohao Yang <zy242@cornell.edu >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-07 04:08:47 +00:00
Tyler Michael Smith
f09c5feb7c
Change warning in get_current_vllm_config to report caller's line number ( #31855 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-01-07 03:48:13 +00:00
Cyrus Leung
1b8af957f6
[Doc] Update release docs ( #31799 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-07 03:27:40 +00:00
Ce Zhao
a051525e07
[Model] Enable LoRA support for PaliGemma ( #31656 )
...
Signed-off-by: 赵策 <alcor@mac.mynetworksettings.com >
Signed-off-by: Alcor <alcor_zhao@outlook.com >
Co-authored-by: 赵策 <alcor@mac.mynetworksettings.com >
2026-01-07 10:09:32 +08:00
Yihua Cheng
5b833be49e
[1/2][lmcache connector] clean up lmcache multi-process adapter ( #31838 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
2026-01-07 02:02:42 +00:00
Lucas Kabela
873480d133
[Misc][BE] Type coverage for vllm/compilation [1/3] ( #31554 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-01-06 20:37:51 -05:00
vSeamar
6f351548b2
[Frontend] Implement robust video frame recovery for corrupted videos ( #29197 )
...
Signed-off-by: cmartinez <cmartinez@roblox.com >
Signed-off-by: vSeamar <cmartinez@roblox.com >
2026-01-07 01:13:24 +00:00
Andreas Karatzas
364a8bc6dc
[ROCm][CI] Fix plugin tests (2 GPUs) failures on ROCm and removing VLLM_FLOAT32_MATMUL_PRECISION from all ROCm tests ( #31829 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-07 01:12:23 +00:00
Angela Yi
9a1d20a89c
[CI] Add warmup run in test_fusion_attn ( #31183 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-01-07 00:31:52 +00:00
Cyrus Leung
309a8f66ee
[Bugfix] Handle mistral tokenizer in get_hf_processor ( #31817 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-07 07:46:56 +08:00
Andreas Karatzas
e5d427e93a
[ROCm][CI] Pinning timm lib version to fix ImportError in Multi-Modal Tests (Nemotron) ( #31835 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-06 23:23:11 +00:00
Andreas Karatzas
2a42ae790d
[ROCm][CI] Fix ModernBERT token classification test numerical accuracy on ROCm ( #31820 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-06 23:21:15 +00:00
Matthew Bonanni
d49899732e
[Spec Decode][UX] Add acceptance stats to vllm bench serve report ( #31739 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com >
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com >
2026-01-06 21:21:42 +00:00
Elvir Crnčević
dba95378a6
Report error log after vllm bench serve ( #31808 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
2026-01-06 20:24:19 +00:00
Nikhil G
ada6f91d56
Fix RecursionError in MediaWithBytes unpickling ( #31191 )
...
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com >
2026-01-06 20:11:26 +00:00
Li, Jiang
8becf146bd
[Quantization][Refactor] Move CPU GPTQ kernel into MP linear ( #31801 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-06 19:10:18 +00:00
Charlie Fu
c07163663d
[ROCm][CI] Fix tests/compile unit tests ( #28895 )
...
Signed-off-by: charlifu <charlifu@amd.com >
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
Signed-off-by: Charlie Fu <Charlie.Fu@amd.com >
Co-authored-by: Micah Williamson <micah.williamson@amd.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-01-06 18:50:43 +00:00
Benjamin Chislett
f7008ce1c4
[Perf] Async Scheduling + Speculative Decoding + Structured Outputs ( #29821 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-01-06 18:50:37 +00:00
Yakine Tahtah
4e67a8f616
[Bugfix] Fix GLM-4 MoE router logits dtype for data parallel chunking ( #31055 )
...
Signed-off-by: ReinforcedKnowledge <reinforced.knowledge@gmail.com >
2026-01-06 17:57:56 +00:00
Masataro Asai
142c4d1738
make 500: InternalServerError more informative ( #20610 )
...
Signed-off-by: Masataro Asai <guicho2.71828@gmail.com >
2026-01-06 17:36:24 +00:00
Ning Xie
6f5e653383
[Log] add log about gpu worker init snapshot and requested memory ( #29493 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-06 17:32:55 +00:00
Vadim Gimpelson
22dffca982
[PERF] Speed-up of GDN attention decode part (Qwen3-Next) ( #31722 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-01-06 17:32:46 +00:00
Lucas Wilkinson
4c73be14e0
[Attention][2/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties ( #31774 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-01-06 17:32:14 +00:00
Jinzhen Lin
2f4bdee61e
[Quantization][MoE] remove unused ep logic from moe marlin ( #31571 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-06 09:07:19 -08:00
roikoren755
28c94770ad
[NemotronH] Use ReplicatedLinear for fc1_latent_proj ( #31807 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-01-06 16:00:40 +00:00
Robert Shaw
af8fd73051
[MoE Refactor][14/N] Clean Up FI Quant Config Smuggling ( #31593 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-06 15:47:04 +00:00
Robert Shaw
d3e477c013
[MoE Refactor] Add Temporary Integration Tests - H100/B200 ( #31759 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-06 10:34:17 -05:00
Isotr0py
02809af1e7
[Bugfix]: Fix cross attention backend selection for Turing GPU ( #31806 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-06 23:15:56 +08:00
Jee Jee Li
cbd4690a03
[LoRA]Disable linear LoRA kernel PDL ( #31777 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-06 23:12:25 +08:00
wang.yuqi
96860af655
[Model] rename use_pad_token to use_sep_token ( #31784 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-06 14:16:04 +00:00
Chauncey
0202971a48
[Frontend] Support GLM-4.5 / GLM-4.7 with enable_thinking: false ( #31788 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-06 13:53:21 +00:00
Jzz1943
2c1a4f2488
[Bugfix]: avoid overriding audio/text kwargs (Qwen3-Omni) ( #31790 )
...
Signed-off-by: Zhongze Jiang <jiangzhongze.jzz@ant-intl.com >
2026-01-06 12:59:17 +00:00
Cyrus Leung
6444824873
[Misc] Implement TokenizerLike.convert_tokens_to_ids ( #31796 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-06 12:08:22 +00:00
kzwrime
bf0f3a4638
[Bugfix] Fix torch.compile error for DP + MoE on CPU Backend ( #31650 )
...
Signed-off-by: kunzh <zhikun.wu@outlook.com >
2026-01-06 12:06:20 +00:00
Lucas Wilkinson
e0327c9db2
[Attention][1/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties ( #31773 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-06 04:05:17 -08:00
Cyrus Leung
14df02b4e1
[Chore] Cleanup mem_utils.py ( #31793 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-06 19:55:59 +08:00
BlankR
6ebb66ccea
[Doc] Fix format of multimodal_inputs.md ( #31800 )
...
Signed-off-by: BlankR <hjyblanche@gmail.com >
2026-01-06 03:30:24 -08:00
wang.yuqi
43d384bab4
[CI] Increase the MTEB_EMBED_TOL threshold to 5e-4. ( #31797 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-06 19:30:05 +08:00
Cyrus Leung
db318326a5
[Misc] Use deprecated for seed_everything ( #31780 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-06 11:29:55 +00:00
Fadi Arafeh
799b5721f6
[cpu][bench] Add CPU paged attention benchmarks ( #31720 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-01-06 10:57:57 +00:00
Cyrus Leung
97ca4c3b60
[Chore] Remove more V0 dead code from sequence.py ( #31783 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-06 10:25:14 +00:00
Isotr0py
ee2e69d6cd
[Bugfix][CI/Build] Fix failing pooling models test due to Triton kernel accuracy diff ( #31776 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-06 00:44:22 -08:00
Isotr0py
7101e0851f
[Models]: Use MMEncoderAttention for MoonViT ( #31738 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: h100 <h100@inferact.ai >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: h100 <h100@inferact.ai >
2026-01-06 08:00:25 +00:00
vllmellm
e9717801bd
[Bugfix][ROCm] Fix Unsupported attention metadata type for speculative decoding in eagle.py ( #31714 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-01-06 07:53:22 +00:00
Cyrus Leung
da71d44410
[Doc] Show that use_audio_in_video is supported in docs ( #30837 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-05 23:27:19 -08:00
Kevin McKay
1fb0209bbc
[Bugfix][Hardware][AMD] Fix exception types in AITER MLA FP8 check ( #31177 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-06 14:10:59 +08:00
Robert Shaw
81323ea221
[CI] Fix CPU MM PRocessor Test ( #31764 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-06 04:22:18 +00:00
Michael Goin
e1cd7a5faf
[Bugfix] Add init_workspace_manager to moe kernel benchmarks ( #31042 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-05 19:14:33 -08:00
Michael Goin
a68e703c32
[UX] Add -ep shorthand for --enable-expert-parallel ( #30890 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-05 19:13:36 -08:00
maang
cd1245a184
[Cleanup] Remove redundant decoder_layer_type assignment in Qwen2 ( #31760 )
...
Signed-off-by: maang <maang_h@163.com >
2026-01-05 18:09:18 -08:00
Wentao Ye
ffec815422
[Perf] Optimize additional fill(0) in cutlass moe, 2.9% E2E throughput improvement, 10.8% TTFT improvement ( #31754 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-05 18:01:13 -08:00
maang
d386ab1412
[Docs] Improve malformed exception caused by backslash line continuations ( #31694 )
...
Signed-off-by: maang <maang_h@163.com >
Signed-off-by: maang <55082429+maang-h@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-05 17:51:54 -08:00
Michael Goin
ccb309a964
Revert "[CI Failure] Disable B200 tests while runner is broken" ( #31750 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-01-05 17:26:33 -08:00
John Calderon
2f4e6548ef
[Bugfix] vLLM produces invalid UTF-8 tokens and “�” ( #28874 )
...
Signed-off-by: John Calderon <jcalderon@nvidia.com >
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com >
2026-01-06 00:23:00 +00:00
Seiji Eicher
3c98c2d21b
[CI/Build] Allow user to configure NVSHMEM version via ENV or command line ( #30732 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-01-05 15:56:08 -08:00
Michael Goin
9513029898
[Bugfix] Properly apply v_scale for mimo_v2_flash ( #31175 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-05 23:20:46 +00:00
Robert Shaw
f6c0009afa
[Bugfix] Fix Broken ModelOpt NVFP4 MoE ( #31742 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-05 23:18:38 +00:00
Yongye Zhu
776ca1e187
[MoE Refactor] Aiter Experts for BF16 MoE ( #31542 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-01-05 14:52:59 -08:00
Wentao Ye
af9a7ec255
[Bug] Revert torch warning fix ( #31585 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-05 22:31:21 +00:00
Matthew Bonanni
276e03b92c
[CI][DeepSeek] Add nightly DeepSeek R1 lm_eval tests on H200 ( #30356 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-05 17:17:59 -05:00
Nick Hill
32f4e4db00
[Cleanup] Remove deprecated fields from CachedRequestData class ( #31734 )
...
Signed-off-by: njhill <nickhill123@gmail.com >
2026-01-05 21:07:14 +00:00
amitz-nv
ee21291825
[Model] Nemotron Parse 1.1 Support ( #30864 )
...
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-05 13:00:14 -08:00
Qidong Su
af1b07b0c5
[docker] install cuda13 version of lmcache and nixl ( #30913 )
...
Signed-off-by: Qidong Su <soodoshll@gmail.com >
2026-01-05 12:50:39 -08:00
gnovack
c77a993cc2
pin lora_b moe weights on cpu ( #31317 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
2026-01-05 12:15:40 -08:00
Roberto L. Castro
fdcc5176be
[BugFix] Fix architecture flags to prevent issues on SM103 ( #31150 )
...
Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com >
2026-01-05 20:11:35 +00:00
Wang Kunpeng
5708297e4e
[Misc][Model][Refactor] Pass the prefix into Linear layers ( #31669 )
...
Signed-off-by: Wang Kunpeng <1289706727@qq.com >
2026-01-05 20:03:18 +00:00
baonudesifeizhai
02dbb933cb
Fix GLM-4.6v flash tool calling in transformers 5.x ( #31622 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
2026-01-05 11:32:43 -08:00
Isotr0py
51e38a8e30
[Misc] Enable Paligemma's PrefixLM attention mask computation ( #31725 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-06 03:31:49 +08:00
Or Ozeri
d8e38d4939
Triton Attention: Support cross-layers blocks ( #30687 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-05 19:29:16 +00:00
kzwrime
21156ff199
[Bugfix] Add missing extra_tensors arg to DeviceCommunicatorBase.disp… ( #31644 )
...
Signed-off-by: kunzh <zhikun.wu@outlook.com >
2026-01-06 01:26:09 +08:00
RickyChen / 陳昭儒
c455b771fd
[Bugfix][CPU] Fix RotaryEmbedding fallback causing gibberish with --enforce-eager ( #31643 )
...
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com >
2026-01-06 01:25:38 +08:00
Michael Goin
eefa713a66
[CI Failure] Disable B200 tests while runner is broken ( #31732 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-05 08:50:51 -08:00
Kevin Šuc
79ed460dd5
[Frontend] [Doc] Exclude log deltas feature ( #30322 )
...
Signed-off-by: Catacomba <kevinsuc16@gmail.com >
Signed-off-by: Kevin Šuc <kevinsuc16@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-01-05 16:34:35 +00:00
Isotr0py
6aa5b18e1d
[v1] Add encoder-only/cross attention support to Triton Attention backend ( #31406 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-06 00:00:23 +08:00
wang.yuqi
911d38ed99
[Model] Let more models to support the score template. ( #31335 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-01-05 11:54:26 +00:00
zzzzwwjj
caaa482aca
[platform] Support additional forward context for OOT ( #31674 )
...
Signed-off-by: zzzzwwjj <1183291235@qq.com >
Signed-off-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-01-05 10:25:13 +00:00
Yihua Cheng
b471aad41f
[KVconnector][LMCache] remove the import of legacy LMCache code ( #31704 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
2026-01-05 10:11:01 +00:00
Jee Jee Li
d5503ca7f9
[LoRA] LoRA PDL improvement ( #31660 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-05 08:28:46 +00:00
Qiping Pan
a2ad15c070
[Model] Enable LoRA support for BLIP2 ( #31620 )
...
Signed-off-by: Qiping Pan <panqiping@outlook.com >
2026-01-05 08:02:24 +00:00
Tres
3133c192a3
[ROCM] Reorder arguments and rename parameters for rope_cached_thd_positions_2c_fwd_inplace ( #29993 )
...
Signed-off-by: Tres Popp <tres.popp@amd.com >
2026-01-05 15:37:57 +08:00
wang.yuqi
76fd458aa7
[CI] Bump sentence-transformer from 3.2.1 to 5.2.0 ( #31664 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-04 21:45:01 -08:00
cjackal
e2701cc525
[Frontend] [Bugfix] respect server-level default chat template kwargs in reasoning parser ( #31581 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-01-05 05:42:47 +00:00
Tyler Michael Smith
fe8a9fbd2e
[Bugfix] Fix EPLB state logging error ( #31455 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-01-05 04:06:28 +00:00
Ning Xie
98b8b3abaa
[log] enable max_log_len trim only when needed ( #31482 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-05 03:55:43 +00:00
CHENYUE
346e56455a
Add chat prefix completion feature to DeepSeek v3.2 ( #31147 )
2026-01-05 11:20:25 +08:00
wang.yuqi
8be6432bda
[CI Failure] Fix NomicBert max_model_len validation ( #31662 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-05 11:06:52 +08:00
Nick Hill
43e3f8e4a9
[Misc] Various code simplifications ( #31666 )
...
Signed-off-by: njhill <nickhill123@gmail.com >
2026-01-04 18:35:56 -08:00
wangxiyuan
bb4337b34c
[Platform] Deprecate seed_everything ( #31659 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2026-01-04 18:34:04 -08:00
Isotr0py
367856de14
[CI/Build] Revive skipped reward models e2e test ( #31665 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-05 02:33:46 +00:00
Nick Hill
da436f868a
[Minor] Small pooler output processing optimization ( #31667 )
...
Signed-off-by: njhill <nickhill123@gmail.com >
2026-01-04 18:33:12 -08:00
Jee Jee Li
f099cd557a
[Bugfix] Fix AttributeError: 'Stream' object has no attribute 'dp_size' ( #31663 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-05 02:31:31 +00:00
Andreas Karatzas
f2b6dfd237
[ROCm][CI] Fix language generation test accuracy by disabling HF flash_sdp and mem_efficient_sdp ( #31597 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-05 02:17:05 +00:00
Andreas Karatzas
89f1f25310
[CI] Skip Phi-MoE test due to old API util ( #31632 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-05 08:52:07 +08:00
Nick Hill
b53b89fdb3
[BugFix] Async scheduling: handle model forward errors more cleanly ( #31611 )
...
Signed-off-by: njhill <nickhill123@gmail.com >
2026-01-04 11:04:37 -08:00
Ning Xie
6522721d17
[misc] Sort uvicorn log level description according to verbosity ( #31137 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-04 18:45:37 +00:00
Yuxuan Zhang
0d4044edd8
fix no think of GLM-4.5 / GLM-4.7 ( #31449 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2026-01-04 11:43:00 +08:00
Reagan Lee
41ab179738
[Docs] Fix argparse include path for mm-processor benchmark ( #31654 )
...
Signed-off-by: Reagan <reaganjlee@gmail.com >
2026-01-04 03:31:29 +00:00
Robert Shaw
268b1c55ad
[MoE Refactor][13/N] Convert FI to Use PFNoEP ( #31533 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-03 12:26:36 -08:00
Andreas Karatzas
4f9ce35afe
[CI][Bugfix] Fix token counting in chunked prefill compl test ( #31630 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-03 14:28:49 +08:00
jeremyteboul
97a01308e9
Improve HF qwen3_omni: preserve audio_sample_rate in kwargs restructuring ( #29255 )
...
Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com >
Co-authored-by: Jeremy Teboul <jeremyteboul@fb.com >
2026-01-03 04:31:09 +00:00
Xingyu Liu
0eee877f67
[Core] Parse vLLM engine required fields from hf_config to model_arch_config ( #28454 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com >
Signed-off-by: Xingyu Liu <38244988+charlotte12l@users.noreply.github.com >
2026-01-02 15:13:15 -08:00
Alfred
a0e9ee83c7
[Benchmark] Fix OOM during MoE kernel tuning for large models ( #31604 )
...
Signed-off-by: Alfred <massif0601@gmail.com >
2026-01-02 22:24:51 +00:00
Yongye Zhu
a3f2f40947
[MoE Refactor] Explicit construct mk for flashinfer bf16 kernel ( #31504 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-01-02 13:54:50 -08:00
Yongye Zhu
5a468ff7c7
[MoE Refactor] Split invoke_fused_moe_kernel ( #31050 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-01-02 13:47:15 -08:00
Andreas Karatzas
6ef770df7c
[MoE] Fix output_shape calculation in Attention layer to handle 3D query inputs ( #31596 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-02 15:46:23 +00:00
Nick Hill
bd877162eb
[BugFix] Support online dense model DP without overhead ( #30739 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: njhill <nickhill123@gmail.com >
2026-01-02 23:36:38 +08:00
Xinyu Chen
08f425bad1
CustomOp: test forward dispatch for grouped_topk ( #31530 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2026-01-02 10:04:01 -05:00
labAxiaoming
a01f2faedf
Add multimodal input method in the documentation ( #31601 )
...
Signed-off-by: xiaoming <1259730330@qq.com >
2026-01-02 12:43:30 +00:00
Kyuyeun Kim
cc410e8644
[Bugfix] Fix weight_loader v1 block scale ( #31103 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com >
2026-01-02 13:14:10 +08:00
Kevin McKay
825c2dc133
[Bugfix][Hardware][AMD] Fix last_page_len calculation in AITER MLA decode ( #31282 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2026-01-01 21:14:00 -08:00
Vaibhav Sourirajan
1f43c121d5
Remove unused use_marlin variable in Mxfp4MoEMethod ( #31549 )
...
Signed-off-by: vaibhav sourirajan <vs2787@columbia.edu >
2026-01-01 21:13:36 -08:00
Tmn07
ca179d0f64
[Bugfix] Fix activation quantization for compressed-tensors W4A16 ( #31572 )
...
Signed-off-by: Tmn07 <tmn0796@gmail.com >
2026-01-01 21:13:22 -08:00
Andreas Karatzas
013b54088c
[ROCm][CI] Fix ModernBERT token classification test ( #31612 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-02 04:19:08 +00:00
Jay Hemnani
5ac55eb30f
[Model] Enable LoRA support for tower and connector in LLaVA ( #31513 )
...
Signed-off-by: Jay Hemnani <jayhemnani9910@gmail.com >
Co-authored-by: Jay Hemnani <jayhemnani9910@gmail.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-01 19:32:39 -08:00
Benjamin Chislett
ea53ca5e85
[Bugfix] Fix block size used in EAGLE slot mapping ( #31540 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-01-01 19:32:30 -08:00
zhima771
27864a851c
feat: support LoRA for DeepSeek-OCR(Language Model part) ( #31569 )
...
Signed-off-by: zhima771 <15836938703@163.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-01 19:32:11 -08:00
Andreas Karatzas
5cc4876630
[ROCm][CI] Fix failure in Language Models Tests (Extra Standard) by reducing agent pool size ( #31553 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-01 19:29:42 -08:00
Kevin McKay
5fff44064b
[Bugfix] Replace BaseException with specific exceptions in FLA utils ( #31590 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2026-01-01 19:27:54 -08:00
Reagan Lee
1f5b7c41c3
Add Multimodal Processor Benchmark ( #29105 )
...
Signed-off-by: Reagan Lee <reaganjlee@gmail.com >
Signed-off-by: Reagan <reaganjlee@gmail.com >
2026-01-01 19:26:53 -08:00
Ekagra Ranjan
adcf682fc7
[Audio] Improve Audio Inference Scripts (offline/online) ( #29279 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2025-12-31 23:34:18 +00:00
Andreas Karatzas
21de6d4b02
[CI][Bugfix] Fix token counting in chunked prefill streaming test ( #31565 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-31 23:05:14 +00:00
Nick Hill
6c2cfb62ff
[BugFix] Fix async scheduling for pooling models ( #31584 )
...
Signed-off-by: njhill <nickhill123@gmail.com >
2025-12-31 14:48:51 -08:00
Fanjiang Ye
d8da76f3b7
[Bugfix] Fix BAGEL online serving for text and image understanding ( #31546 )
...
Signed-off-by: Dylan1229 <yvanphys@gmail.com >
Signed-off-by: UED <zxr3611244710@gmail.com >
Signed-off-by: mr-ye-cao <yecaoyc2019@gmail.com >
Co-authored-by: UED <zxr3611244710@gmail.com >
Co-authored-by: mr-ye-cao <yecaoyc2019@gmail.com >
Co-authored-by: Mr-Ye-Cao <60802056+Mr-Ye-Cao@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-31 14:46:10 -08:00
baonudesifeizhai
d722e9e614
Add GLM-ASR multimodal support ( #31436 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-31 23:12:24 +08:00
Andreas Karatzas
cf16342d43
[ROCm][CI] Update MiniCPM model test: MiniCPM3-4B to MiniCPM4.1-8B and simplify attention backend testing ( #31551 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-31 00:12:01 -08:00
Wentao Ye
357d435c54
[Bug] Fix log issue with \n ( #31390 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-12-30 21:16:55 -08:00
danisereb
108a2728f7
Add get_expert_mapping to NemotronHModel (for LoRA support) ( #31539 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2025-12-30 21:09:03 -08:00
TJian
578c8f51f6
[CI] [Critical] [CUDA] Fix duplicated test name ( #31562 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-12-30 21:01:09 -08:00
maang-h
b4bb5f312f
[Core] Remove unused num_tokens parameter from _init_model_kwargs ( #31517 )
...
Signed-off-by: maang <maang_h@163.com >
2025-12-30 20:47:23 -08:00
SameerAsal
70e1acefcd
[BugFix] Fix NUMA node validation in CPU platform ( #31520 )
...
Signed-off-by: SameerAsal <SameerAsal@users.noreply.github.com >
Co-authored-by: SameerAsal <SameerAsal@users.noreply.github.com >
2025-12-31 04:06:49 +00:00
Qiu
84f6cd741b
[Mics] add pcp basic support to MoE model ( #31003 )
2025-12-30 20:01:29 -08:00
B-201
ecd49ce7e6
[Fix] Align fused moe lora_b shape with peft ( #31534 )
...
Signed-off-by: bk-201 <joy25810@foxmail.com >
2025-12-31 09:44:59 +08:00
Amr Mahdi
e1ee11b2a5
Add docker buildx bake configuration ( #31477 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2025-12-31 01:08:54 +00:00
vintipandey
04147dcfa7
[Bugfix]Fix pooling model always disabled due to incorrect PP rank check ( #31505 )
...
Signed-off-by: vintipandey <vinti.pandey@gmail.com >
2025-12-30 11:27:10 -08:00
JartX
07728bf5cd
[BugFix] add select_gemm_impl on CompressedTensorsWNA16MoEMethod to support LoRA ( #31453 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
2025-12-30 11:20:15 -08:00
yt0428
3f52fa5aa2
[Model] Add support for openPangu moe model ( #28775 )
...
Signed-off-by: yuantao <2422264527@qq.com >
Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-30 08:11:38 -08:00
Li, Jiang
7157596103
[CPU] Disable async schedule on CPU ( #31525 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-12-30 12:34:08 +00:00
Nicolò Lucchesi
ab1af6aa3e
[CI][NIXL] Split DPEP tests ( #31491 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-30 07:26:12 -05:00
Pleaplusone
1a834df2d4
[ROCm][Bugfix] Fix accuracy issue on fmoe when VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS enabled ( #31523 )
...
Signed-off-by: ganyi <ygan@amd.com >
2025-12-30 09:21:49 +00:00
Kevin
51085c2aeb
[Frontend] add continue_final_message parameter to /embeddings endpoint ( #31497 )
...
Signed-off-by: Kevin P-W <140451262+kevin-pw@users.noreply.github.com >
2025-12-30 07:21:13 +00:00
Roger Feng
3d973764ce
[xpu] [bugfix] upgrade to latest oneccl in dockerfile ( #31522 )
...
Signed-off-by: roger feng <roger.feng@intel.com >
2025-12-30 14:52:28 +08:00
Nick Hill
3b312fb792
[Minor] Various small code cleanups/simplifications ( #31508 )
...
Signed-off-by: njhill <nickhill123@gmail.com >
2025-12-29 22:42:06 -08:00
ZT-AIA
f84bf7d79b
Add Loraconfig parameter to get_punica_wrapper function ( #31408 )
...
Signed-off-by: ZT-AIA <1028681969@qq.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-29 22:27:31 -08:00
Roy Wang
99dcf5dcc5
Migrate meetups & sponsors [2/N] ( #31500 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2025-12-30 04:26:15 +00:00
Hojin Yang
dc837bc23e
feat(frontend): add --default-chat-template-kwargs CLI argument ( #31343 )
...
Signed-off-by: effortprogrammer <yhjhoward7@gmail.com >
2025-12-30 03:38:47 +00:00
Nick Hill
e54ee3ea33
[Core] Deduplicate generate/encode logic in AsyncLLM ( #31510 )
...
Signed-off-by: njhill <nickhill123@gmail.com >
2025-12-30 10:42:45 +08:00
wangln19
358bfd315c
fix: update kimi k2 tool parser logic ( #31207 )
...
Signed-off-by: wangln19 <wanglinian@dev.wanglinian.msh-dev.svc.cluster.local >
Signed-off-by: Wang Linian <wanglinian@stu.pku.edu.cn >
Co-authored-by: wangln19 <wanglinian@dev.wanglinian.msh-dev.svc.cluster.local >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-12-30 10:01:58 +08:00
Sage
39512aba72
[Prefix Cache] Include lora_name in BlockStored event for deterministic KV-cache reconstruction ( #27577 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
Co-authored-by: Sage <80211083+sagiahrac@users.noreply.github.com >
2025-12-30 00:17:16 +00:00
qli88
0f35429a0c
[CI]Test Group 'NixlConnector PD accuracy tests' is fixed ( #31460 )
...
Signed-off-by: qli88 <qiang.li2@amd.com >
2025-12-29 23:48:56 +00:00
Alexei-V-Ivanov-AMD
d63b969675
[CI/ROCm] Fixing "V1 Test attention (H100)" test group. ( #31187 )
...
Signed-off-by: DCCS-4560 <alivanov@chi-mi325x-pod1-108.ord.vultr.cpe.ice.amd.com >
Signed-off-by: <>
Co-authored-by: DCCS-4560 <alivanov@chi-mi325x-pod1-108.ord.vultr.cpe.ice.amd.com >
Co-authored-by: root <root@chi-mi325x-pod1-108.ord.vultr.cpe.ice.amd.com >
2025-12-29 16:53:59 -05:00
Robert Shaw
56f516254c
[Bugfix][ROCm] Fix Static Quant Issue ( #31502 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2025-12-29 13:27:55 -08:00
Robert Shaw
9152a30d8f
[MoE Refactor][12/N] Marlin Fp8 MoE Pure Function ( #31499 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-12-29 13:27:00 -08:00
Nick Hill
c2ff33cc8c
[Core] Enable async scheduling by default ( #27614 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2025-12-29 13:20:55 -07:00
chunxiaozheng
b12cb38398
implements register kv caches in lmcache connector ( #31397 )
...
Signed-off-by: idellzheng <idellzheng@tencent.com >
2025-12-29 11:13:42 -08:00
Roger Young
5bc664110f
Optimize QKNorm for MiniMax-M2/M2.1 ( #31493 )
...
Signed-off-by: xuebi <xuebi@minimaxi.com >
Co-authored-by: xuebi <xuebi@minimaxi.com >
2025-12-29 16:30:18 +00:00
RickyChen / 陳昭儒
b3a2bdf1ac
[Feature] Add offline FastAPI documentation support for air-gapped environments ( #30184 )
...
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com >
Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-29 16:22:39 +00:00
Harry Mellor
e37e7349e6
Replace nn.ConvNd with vLLM's ConvNdLayer for Transformers modeling backend ( #31498 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-29 16:20:01 +00:00
Roy Wang
b5d2d71d26
Migrate doc to website: Hardware Plugins (1/N) ( #31496 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2025-12-29 15:55:20 +00:00
Harry Mellor
decc244767
[Docs] Use relative md links instead of absolute html links for cross referencing ( #31494 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-29 13:33:44 +00:00
amittell
9c884faa95
[Bugfix] Preserve tool call id/type/name in streaming finish chunk ( #31438 )
...
Signed-off-by: amittell <mittell@me.com >
Signed-off-by: Alex Mittell <mittell@me.com >
2025-12-29 21:10:52 +08:00
Chauncey
48d5ca4e8b
[CI] fix test_chat_truncation_content_not_null test ( #31488 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-12-29 12:47:08 +00:00
twj
bf73a3e4d7
[Bugfix][Frontend] Fix Jina reranker multimodal input compatibility ( #31445 )
...
Signed-off-by: tianwenjing <tianwenjing@jfgenius.com >
Signed-off-by: twj <151701930+twjww@users.noreply.github.com >
Co-authored-by: tianwenjing <tianwenjing@jfgenius.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-29 01:13:18 -08:00
Andreas Karatzas
3ecfdc3776
[ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race condition ( #30719 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-29 01:13:14 -08:00
Andreas Karatzas
45c1ca1ca1
[ROCm][CI] Skip DeepGemm-dependent test on ROCm platform ( #31462 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-29 16:31:10 +09:00
Li, Jiang
17347daaa2
[CI/Build][CPU] Update CPU CI test cases ( #31466 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-12-29 14:17:52 +08:00
Mamy Ratsimbazafy
b9793e6a8c
Add Fused MoE Triton kernels for GLM-4.5-Air, GLM-4.5v, GLM-4.6v on 2x RTX Pro 6000 ( #31407 )
...
Signed-off-by: Mamy Ratsimbazafy <mamy_github@numforge.co >
2025-12-28 08:38:33 -08:00
Jzz1943
0b6b701050
[Model] Add tuned triton fused_moe configs for Qwen3Moe on B200 ( #31448 )
...
Signed-off-by: Zhongze Jiang <jiangzhongze.jzz@ant-intl.com >
2025-12-28 08:38:07 -08:00
Nick Hill
094fcce250
[BugFix] Re-fix async multimodal cpu tensor race condition ( #31373 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Signed-off-by: njhill <nickhill123@gmail.com >
2025-12-28 03:05:08 -08:00
Andreas Karatzas
573dd0e6f0
[ROCm] Migrate xgrammar to upstream release ( #31327 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-28 00:08:29 -08:00
Andreas Karatzas
f70368867e
[ROCm][CI] Add TorchCodec source build for transcription tests ( #31323 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-28 16:06:05 +08:00
Andreas Karatzas
96142f2094
[ROCm][CI] Added perceptron lib in requirements for isaac multi-modal test ( #31441 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-28 04:15:14 +00:00
Boyuan Feng
62def07d67
[BugFix] register quant scale tensors as buffer ( #31395 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-12-28 11:20:02 +08:00
yitingdc
b326598e97
add tip for VLLM_USE_PRECOMPILED arg to reduce docker build time ( #31385 )
...
Signed-off-by: yiting.jiang <yiting.jiang@daocloud.io >
2025-12-28 03:19:47 +00:00
Robert Shaw
727c41f3fd
[MoE Refactor][10/N] Cleanup Fp8 Process Weights After Loading ( #31169 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-12-27 20:22:48 +00:00
Boyuan Feng
2f12cd32c0
[BugFix] Fix cache issue in compilation_config ( #31376 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-12-27 09:30:39 -05:00
Isotr0py
40a8756224
[Chore]: Remove HF format Phi4-MM examples ( #31405 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-27 13:42:02 +00:00
Isotr0py
3d024985ab
[CI/Build] Ignore max transformers version for more common tests ( #31401 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-27 13:06:26 +00:00
baonudesifeizhai
8711b21676
Fix/get raw stream patch #30905 ( #30912 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-12-26 20:08:47 -08:00
Yifan Qiao
52bf066516
[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector ( #30166 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
Co-authored-by: KuntaiDu <kuntai@uchicago.edu >
2025-12-26 18:25:46 -08:00
Kunshang Ji
5326c89803
[XPU][CI]skip test_preprocess_error_handling due to fork/spawn issue ( #31381 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-12-26 21:40:44 +00:00
Xinyu Chen
87f1b8ca2c
CustomOp: Unify aiter impl into GroupedTopk ( #31221 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2025-12-26 12:44:29 -05:00
rongfu.leng
887e900b77
[Docs] Add profiler user docs for http request ( #31370 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-12-26 23:48:15 +08:00
Patrick von Platen
48e744976c
[Mistral common] Ensure all functions are imported from the top & only use public methods ( #31138 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-26 04:48:24 -08:00
Jee Jee Li
ce1eafd1a5
[Core] Initialize LoRA support for tower and connector in multi-modal models ( #26674 )
...
Signed-off-by: bk-201 <joy25810@foxmail.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: prashanth058 <prashanth.dannamaneni@uipath.com >
Co-authored-by: bk-201 <joy25810@foxmail.com >
Co-authored-by: prashanth058 <prashanth.dannamaneni@uipath.com >
Co-authored-by: Anexdeus <5142168@mail.ru >
2025-12-26 04:48:20 -08:00
Harry Mellor
0b544e6476
[Docs] Fix some snippets ( #31378 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-26 12:47:41 +00:00
Jee Jee Li
c3666f56fd
[Misc] Fix Qwen2-MoE shared_expert_gate ( #31339 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-26 05:10:39 +00:00
Andreas Karatzas
c79dbfa9ad
[CI] Fix flaky vision beam search test with flexible semantic validation ( #31324 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-26 04:39:32 +00:00
Shinichi Hemmi
9ee05cbe7f
Support LoRA and GPTQModel for PLaMo 2/3 ( #31322 )
...
Signed-off-by: Shinichi Hemmi <50256998+Alnusjaponica@users.noreply.github.com >
2025-12-26 11:41:33 +08:00
Ning Xie
3b8f31b362
[benchmark] use model card root instead of id ( #31329 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-12-26 10:55:56 +08:00
Isotr0py
2cd94259c8
[CI/Build] Ignore max transformers version skipping for initialization tests ( #30619 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-26 10:50:32 +08:00
oscardev256
b7165d53c6
Feature/isaac 0.1 ( #28367 )
...
Signed-off-by: oscardev256 <42308241+oscardev256@users.noreply.github.com >
Signed-off-by: Oscar Gonzalez <ogonzal6@alumni.jh.edu >
Signed-off-by: Yang <lymailforjob@gmail.com >
Co-authored-by: Yang <lymailforjob@gmail.com >
2025-12-25 18:49:11 -08:00
Nick Hill
81786c8774
[BugFix] Fix async scheduling + reasoning with struct output ( #31332 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2025-12-25 23:01:02 +00:00
Stan Wozniak
f1531d9f2a
[Hybrid] Mamba2 prefix cache blocks freeing for running requests ( #28047 )
...
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-12-25 20:54:06 +00:00
SongHe
2d6001f491
[Model][Ernie4.5-VL] Support video metadata for timestamp rendering ( #31274 )
...
Signed-off-by: dengsonghe <dengsonghe@baidu.com >
Co-authored-by: dengsonghe <dengsonghe@baidu.com >
2025-12-25 14:07:15 +00:00
Amir Samani
030fc44914
use the same stream for cuda graph catpure and replay for NCCL ( #29207 )
...
Signed-off-by: Amir Samani <asamani@nvidia.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-12-25 19:10:03 +08:00
Isotr0py
2532f437ee
[Doc] Add troubleshooting for Triton PTX error about undefined gpu-name ( #31338 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-12-25 02:26:34 -08:00
Louie Tsai
f15185fbdb
[Benchmark Suite] improve cpu Benchmark Suite tests and comparison report for 0.12.0 ( #30994 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
2025-12-25 08:51:45 +00:00
Mark Gatere
ba25a65992
[Frontend] add FunctionGemma tool parser support ( #31218 )
...
Signed-off-by: gateremark <gateremg@gmail.com >
2025-12-25 15:29:25 +08:00
Amith KK
42826bbccd
[Doc] Add tool call parser documentation for GPT-OSS models ( #31212 )
...
Signed-off-by: Amith KK <amithkumaran@gmail.com >
2025-12-25 05:29:10 +00:00
Richard Zou
254f6b9867
[Bugfix] Fix eagle dp tests on A100 ( #31241 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-12-25 00:05:04 +00:00
Michael Goin
bc5ef333e0
[Perf] Add skip_clone to SamplingParams for internal request handling ( #31041 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-24 14:35:57 -08:00
Cyrus Leung
09dc7c690c
[Chore][1/2] Drop v0.14 deprecations ( #31285 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-24 09:54:01 -08:00
ゆり
506eb0f454
[Bugfix] Remove dead block_quant_to_tensor_quant function ( #31294 )
...
Co-authored-by: yurekami <yurekami@users.noreply.github.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2025-12-24 17:22:48 +00:00
Ning Xie
5d93089686
[cli] complete vllm cli help message ( #31226 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-12-24 15:45:47 +00:00
Kevin McKay
66c9887440
[Bugfix][Hardware][AMD] Fix FP8 dtype in silu_mul quantization ( #31179 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2025-12-24 10:37:11 -05:00
wang.yuqi
1ff67df182
[CI] Reorganization pooling_mteb_test ( #31265 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-12-24 23:36:20 +08:00
skaraban3807
7cd288a4b3
[PERF] Add interleaved memory allocation to NUMA module ( #30800 )
2025-12-24 13:47:49 +00:00
Cyrus Leung
d201807339
[Chore] Bump lm-eval version ( #31264 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-24 05:39:13 -08:00
Cyrus Leung
aa3868ecfe
[Chore] Remove unused noqas ( #31263 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-24 05:38:46 -08:00
Cyrus Leung
7adeb4bfa8
[Bugfix] Fix max_model_len="auto" handling ( #31260 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-24 19:15:27 +08:00
wang.yuqi
bd89ce16d2
[Model] Introduce verify_and_update_model_config for VerifyAndUpdateConfig. ( #31131 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
2025-12-24 09:54:57 +00:00
Pleaplusone
b41aeb3468
[Bugfix][ROCm] Fix load issue on deepseek quark quantization when shared expert enabled ( #31261 )
...
Signed-off-by: ganyi <ygan@amd.com >
2025-12-24 16:47:44 +08:00
Ryan Rock
ddfac7034e
[CI/Build] Ignore data_parallel_size_local ( #30281 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2025-12-24 07:40:54 +00:00
Micah Williamson
6559d96796
[ROCm][CI] Set TORCH_NCCL_BLOCKING_WAIT Distributed Tests On ROCm ( #31259 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-12-24 07:19:07 +00:00
kliuae
1c74150bca
[ROCm][CI] Fix "Distributed Tests (H200)" Test ( #31227 )
...
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com >
2025-12-24 06:56:30 +00:00
Andreas Karatzas
0247a91e00
[ROCm][CI] Fix entrypoints tests and Python-only installation test on ROCm ( #28979 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-23 22:42:30 -08:00
Michael Goin
8ee90c83f8
Add --max-model-len auto to auto-fit context to available memory ( #29431 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-23 21:37:14 -08:00
Nick Cao
d7e05ac743
[docker] Fix downloading sccache on aarch64 platform ( #30070 )
...
Signed-off-by: Nick Cao <nickcao@nichi.co >
2025-12-23 21:36:33 -08:00
sihao_li
471ddb99a0
[XPU] Remove distributed_executor_backend check ( #30760 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2025-12-23 21:34:33 -08:00
Xiong Wang
bb24592d13
[Qwen3-Omni] fixed _get_feat_extract_output_lengths function ( #31007 )
...
Signed-off-by: Xiong Wang <wangxiongts@163.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-12-23 21:33:54 -08:00
Matthew Bonanni
369f47aa0f
[DeepSeek v3.2] Remove unnecessary syncwarps ( #31047 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-12-23 21:33:30 -08:00
zejunchen-zejun
dabff12ed3
[Bugfix][ROCm][Dynamo][DS 3.1][FP8] fix unsupported hasattr call when Dynamo tracing for ROCm device ( #31149 )
...
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com >
2025-12-23 21:32:19 -08:00
Ming Yang
3bb9561928
Revert "[bench] Support common prefix len config (for decode-only bench)" ( #31240 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-12-23 21:17:23 -08:00
Micah Williamson
3ce791ac77
[ROCm][CI] Set VLLM_FLOAT32_MATMUL_PRECISION="tf32" For terratorch Tests In AMD CI ( #31242 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-12-24 03:21:50 +00:00
Andreas Karatzas
e42894f5b5
[ROCm][CI][Bugfix] Fix Siglip2 rotary embedding dispatch and InternVL video test tolerance ( #31235 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-24 02:56:58 +00:00
Wentao Ye
76e6a95192
[Bug] Fix Number of dimensions of tensors must match. for Deepseek V3.2 ( #31160 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-24 10:41:09 +08:00
Chao Lei
8b59753cdb
[P/D] Mooncake connector support more protocols ( #30133 )
...
Signed-off-by: LCAIZJ <leichao139636@163.com >
2025-12-24 10:24:07 +08:00
Chen Zhang
538e830caa
[KVEvent] User request.block_hash for parent block_hash ( #30544 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
Co-authored-by: Yifan Qiao <yifanqiao@berkeley.edu >
2025-12-23 18:23:43 -08:00
rongfu.leng
4ed11105d7
[Misc] Remove unused custom ops copy_blocks and copy_blocks_mla ( #30967 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-12-23 18:22:35 -08:00
Cyrus Leung
dd424571c8
[Bugfix] Enable dynamic_dims for different embeds shape ( #31223 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-24 10:15:47 +08:00
Cyrus Leung
ca6a95ba25
[Chore] Simplify logic of _execute_mm_encoder ( #31222 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-23 18:15:16 -08:00
Vadim Gimpelson
bc0a5a0c08
[CI] Add Qwen3-Next-FP8 to Blackwell model tests ( #31049 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2025-12-23 17:21:50 -08:00
Andreas Karatzas
bfa2c0bbb9
[ROCm][Bugfix] Fix RuntimeError in MMEncoderAttention by replacing .view() with .reshape() ( #31203 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-23 21:48:01 +00:00
Mark McLoughlin
f790068600
[Core] Add a random suffix to frontend-provided request IDs ( #27987 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-12-23 13:05:39 -08:00
Asaf Joseph Gardin
34916ae37f
[Mamba] - Consolidate Mambas Attention Logic ( #28133 )
2025-12-23 21:57:00 +01:00
Yuan Tang
0736f901e7
docs: Add llm-d integration to the website ( #31234 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-12-23 20:27:22 +00:00
Harry Mellor
c016c95b45
Use helper function instead of looping through attribute names ( #29788 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-23 17:31:56 +00:00
Harry Mellor
1339878e13
Only patch original_max_position_embeddings for Transformers v4 ( #31214 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-23 16:46:32 +00:00
danielafrimi
b94f80ffb8
[FIX] FP4 quantization kernel padding initialization bug ( #31097 )
...
Signed-off-by: <>
Co-authored-by: root <root@gpu-193.slurm-workers-slurm.slurm.svc.cluster.local >
Co-authored-by: root <root@gpu-951.slurm-workers-slurm.slurm.svc.cluster.local >
2025-12-23 08:45:18 -08:00
Joachim Studnia
38c361f99d
Fix edge case Mistral tool parser ( #30724 )
...
Signed-off-by: Joachim Studnia <joachim@mistral.ai >
Signed-off-by: Joachim Studnia <studniajoachim@gmail.com >
Signed-off-by: juliendenize <julien.denize@mistral.ai >
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: juliendenize <julien.denize@mistral.ai >
Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-12-23 14:19:58 +00:00
Cyrus Leung
bb62dda2c3
[Misc] Introduce encode_*_url utility function ( #31208 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-23 13:45:21 +00:00
Patrick von Platen
3faa8bee57
adapt voxtral ( #31095 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-12-23 05:31:55 -08:00
Harry Mellor
b10d47e0e0
Add util function for checking nesting of rope parameters ( #31146 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-23 11:41:49 +00:00
R3hankhan
769f27e701
[OpenAI] Add parameter metadata to validation errors ( #30134 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2025-12-23 11:30:12 +00:00
Jakub Zakrzewski
23daef548d
[Frontend] Support using chat template as custom score template for reranking models ( #30550 )
...
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2025-12-23 11:19:16 +00:00
Jee Jee Li
27c6c2f98c
[Bugfix] Fix MoE LoRA bin/pt loading ( #31161 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-23 19:09:15 +08:00
Weida Hong
73cfb7a722
Correct position of docstring of class attributes ( #31209 )
...
Signed-off-by: Weida Hong <wdhongtw@google.com >
2025-12-23 02:08:58 -08:00
vllmellm
f32cfd7d97
[ROCm][FEAT] Support AITER RMSNorm quantization fusion pass ( #26575 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2025-12-23 02:07:54 -08:00
Jee Jee Li
6b16fff01b
[Bugfix] Fix Jais2ForCausalLM ( #31198 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-23 07:44:01 +00:00
Yan Ma
f1c2c20136
[XPU] decrease IGC_ForceOCLSIMDWidth for speculative decoding triton-xpu kernel compilation ( #30538 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2025-12-23 05:22:15 +00:00
Cyrus Leung
8cef137689
[Chore] Update more locations to use attention_config.backend ( #31153 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-22 19:19:50 -08:00
quanliu
a37328fc5c
[Feature] Batch invariant: Lora ( #30097 )
...
Signed-off-by: quanliu <18646313696@163.com >
2025-12-23 10:32:47 +08:00
Pavani Majety
3e10262356
Revert "[SM100] Enable fp8 compute for prefill MLA ( #30746 )" ( #31197 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-12-22 18:15:33 -08:00
Angela Yi
612d5ffdab
[ci] Fix Pytorch compilation test oom in 2.10 ( #31194 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2025-12-23 01:56:47 +00:00
Divakar Verma
78e5e62bbf
[AMD][CI] fix v1/engine test_preprocess_error_handling ( #31192 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-12-23 01:28:19 +00:00
Robert Shaw
b57b967386
[MoE Refactor][7/N] AITER MK ( #31102 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-12-22 16:42:58 -07:00
Michael Goin
6d518ffbaa
[CI Failure] Disable mosaicml/mpt-7b and databricks/dbrx-instruct tests ( #31182 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-22 15:40:35 -08:00
Benjamin Chislett
85aff45e24
[Perf] Remove blocking copy in GDN Attention ( #31167 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2025-12-22 14:25:22 -08:00
Wentao Ye
5312a7284e
[Bug] Fix 'CutlassMLAImpl' object has no attribute '_workspace_buffer' ( #31173 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-22 14:24:27 -08:00
Lucas Wilkinson
de71747655
[SpecDecode] Simplified alternative padded-speculation acceptance rate fix ( #29845 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-22 13:06:10 -08:00
Michael Goin
9586354053
[Doc] Add vllm-metal to hardware plugin documentation ( #31174 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-22 20:06:29 +00:00
Pavani Majety
b10f41c894
[SM100] Enable fp8 compute for prefill MLA ( #30746 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-12-22 19:15:57 +00:00
Yongye Zhu
7b926e8901
[MoE Refactor][9/N] Use modular kernel for unquantized Triton MoE ( #31052 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
2025-12-22 17:34:19 +00:00
Gregory Shtrasberg
ab3a85fd68
[ROCm][CI/Build] Fix triton version to one that has triton_kernels required for gpt-oss to run ( #31159 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-12-22 17:19:27 +00:00
Boyuan Feng
8dd0db687b
[UX] improve profiler error message ( #31125 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-12-22 08:45:59 -08:00
TJian
022f3cea53
[ROCm] [Critical]: Remove unused variable ( #31156 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-12-22 08:28:22 -08:00
Micah Williamson
a5bc77c253
[AMD][CI] Add "V1 Test e2e + engine" to mi325_8 Agent Pool ( #31040 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-12-22 10:41:56 -05:00
Nicolò Lucchesi
b1c3f96ae3
[CI][Bugfix] Fix entrypoints/openai/test_audio.py ( #31151 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-22 07:21:40 -08:00
dengyunyang
8f8f469b1b
[BugFix] skip language model in Encoder ( #30242 )
...
Signed-off-by: dengyunyang <584797741@qq.com >
2025-12-22 05:25:59 -08:00
Shengqi Chen
2cf91c2ea4
[CI] add polling for precompiled wheel in python_only_compile.sh, fix index generation for releases ( #30781 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2025-12-22 13:24:21 +00:00
AlonKejzman
bd6d5a7475
[gpt-oss] Fix harmony parser in streaming responses ( #30205 )
...
Signed-off-by: AlonKejzman <alonkeizman@gmail.com >
2025-12-22 20:56:06 +08:00
Li Wang
256a33ecb4
[Model] Fix bagel failed to run ( #31132 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-12-22 02:15:54 -08:00
Roger Young
c02a2705f9
Update MiniMax-M2 ToolCall and add MiniMax-M2.1 in Docs ( #31083 )
...
Signed-off-by: xuebi <xuebi@minimaxi.com >
Co-authored-by: xuebi <xuebi@minimaxi.com >
2025-12-22 05:28:40 +00:00
Kevin McKay
cf8eed7bef
[Bugfix][ROCm] Fix typo: is_linear_fp8_enaled -> is_linear_fp8_enabled ( #31109 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
2025-12-21 21:14:58 -08:00
Kevin McKay
44ae85f725
[Misc] Fix typo: 'occured' -> 'occurred' ( #31120 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2025-12-21 21:14:27 -08:00
Kevin McKay
14c3e6ade3
[Misc] Fix spelling typos in model comments ( #31117 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2025-12-21 21:14:14 -08:00
Kevin McKay
42b42824ae
[Misc] Fix grammar errors in comments and messages ( #31115 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2025-12-21 21:14:02 -08:00
Kevin McKay
ec58c10ce1
[Misc] Fix quantization-related typos ( #31116 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2025-12-21 21:13:48 -08:00
Kevin McKay
8c084de59d
[Misc] Fix spelling typos in comments ( #31114 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2025-12-21 21:13:14 -08:00
CedricHuang
19cc9468fd
[Feature]: Support NVIDIA ModelOpt HF FP8 variants FP8_PER_CHANNEL_PER_TOKEN and FP8_PB_WO in vLLM ( #30957 )
2025-12-21 22:34:49 -05:00
Jee Jee Li
097978a15d
[Kernel] Enable fused_qknorm_rope_kernel supports partial rope ( #30821 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-21 18:39:22 -08:00
Lucas Wilkinson
7e065eba59
[CI] Fix "2 Node Tests (4 GPUs in total)" ( #31090 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-22 10:32:40 +08:00
Steve Westerhouse
9d701e90d8
[Doc] Clarify FP8 KV cache computation workflow ( #31071 )
...
Signed-off-by: westers <steve.westerhouse@origami-analytics.com >
2025-12-22 08:41:37 +08:00
Michael Goin
06d490282f
[NVFP4][Perf] Tune NVFP4 input quant kernel for small batch size ( #30897 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-21 09:41:57 -08:00
Robert Shaw
b471092d3a
[MoE Refactor][4/N] Marlin Fp8 Mk ( #31036 )
2025-12-21 12:37:42 -05:00
Ameen Patel
93cabc417c
ci: add nvidia-smi warmup before Prime-RL integration test ( #31093 )
...
Signed-off-by: AmeenP <ameenp360@gmail.com >
2025-12-21 15:43:01 +00:00
Chauncey
bb80f69bc9
add aarnphm and chaunceyjiang to the new tool_parser directory ( #31088 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-12-21 03:24:34 +00:00
汪志鹏
3e92b2b7ac
[BugFix]fix gpt-oss v1/completions response bug ( #30608 )
...
Signed-off-by: princepride <wangzhipeng628@gmail.com >
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: bbrowning <bbrownin@redhat.com >
2025-12-21 10:39:31 +08:00
Jinzhen Lin
7c73ceb581
[Quantization] add marlin w4a8/w8a8 check ( #31061 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
2025-12-20 21:58:11 +00:00
Lucas Wilkinson
ae0770fa6b
[CI] Fix H200 Distributed test ( #31054 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-20 16:48:49 -05:00
Jinzhen Lin
ee52d9901d
[Quantization] support logical_widths for fp8 marlin ( #30962 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-20 12:02:57 -08:00
baonudesifeizhai
54c8924384
[MoE Refactor][5/N] Isolate zero expert to LongCatFlash ( #28891 )
...
Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com >
Signed-off-by: Dongjie Zou <85092850+baonudesifeizhai@users.noreply.github.com >
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robertgshaw2@gmail.com >
2025-12-20 18:22:04 +00:00
Yan Ma
560ae9638c
[XPU] enable fp8 online streaming quantization ( #30944 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2025-12-20 13:45:27 +00:00
Jeffrey Wang
1501a4070e
[Bugfix] Read truncate_prompt_tokens from pooling_params in AsyncLLM.encode() ( #31013 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2025-12-20 10:29:31 +00:00
Lucas Wilkinson
ff2168bca3
[CI] FIx fixture 'siglip_attention_config' not found ( #31053 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-20 03:46:15 +00:00
Gregory Shtrasberg
0be149524c
[ROCm][CI/Build] Update ROCm dockerfiles ( #30991 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-12-20 03:19:12 +00:00
zejunchen-zejun
d52c5096d7
[Bugfix] fix the alias bug of AttentionBackendEnum when register CUSTOM attention backend to vllm ( #30869 )
...
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com >
2025-12-20 09:03:35 +08:00
Yuxuan Zhang
8a7a414374
GLM-4.7 Tool Parser and Doc Update ( #30876 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-12-20 00:09:58 +00:00
Robert Shaw
95befecc18
[MoE Refactor][2/N] Use Modular Kernels for Fp8 ( #30825 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-12-19 23:36:38 +00:00
Wentao Ye
4cf9429897
[Bug] Fix error 'Dynamo failed to run FX node with fake tensors for Deepseek V3.2 ( #31046 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-19 23:31:31 +00:00
Robert Shaw
83a317f650
[MoE Refactor][3/N] Deprecate cutlass block quant fp8 (b200) ( #30990 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-12-19 13:09:54 -08:00
Lucas Wilkinson
5f6477d1d0
[BugFix] Fix TypeError: unhashable type: 'dict' when serving deepseek32 ( #30924 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-19 16:07:54 -05:00
Wentao Ye
3bd8335bd0
[Refactor] Refactor for DeepGemmQuantScaleFMT using cache ( #30898 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-19 13:50:39 -07:00
Seiji Eicher
1ab5213531
Make engine core client handshake timeout configurable ( #27444 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-12-19 20:38:30 +00:00
Zhonghua Deng
969bbc7c61
[Model] Add MiMo-V2-Flash support ( #30836 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
Signed-off-by: Jumiar <liuanqim10@126.com >
Signed-off-by: Zyann7 <zyann7@outlook.com >
Co-authored-by: Jumiar <liuanqim10@126.com >
Co-authored-by: Zyann7 <zyann7@outlook.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-19 17:17:03 +00:00
Andrey Talman
268a972c62
Update Pytorch version update docs ( #30982 )
2025-12-19 16:08:53 +00:00
Jinzhen Lin
5fbfa8d9ef
[Quantization] fix marlin w8a8 check ( #30961 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
2025-12-19 07:33:22 -08:00
Shanshan Shen
23a1946e3b
[CustomOp][Refactor] Extract common methods for ApplyRotaryEmb CustomOp ( #31021 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-12-19 22:16:09 +08:00
Thomas Parnell
b5545d9d5c
[Bugfix] [Kernel] Triton attention kernels: mask out V blocks that fall outside sliding window ( #30887 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-12-19 21:39:54 +08:00
Nishidha Panpaliya
bd2b52fc2d
[CPU][Bugfix] Fix ppc64le CPU build ( #30871 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
2025-12-19 12:26:35 +00:00
Li, Jiang
420ba2dbb6
Enable aarch64 CPU performance benchmarks ( #26494 )
...
Signed-off-by: Ioana Ghiban <ioana.ghiban@arm.com >
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
Co-authored-by: Ioana Ghiban <ioana.ghiban@arm.com >
Co-authored-by: Fadi Arafeh <fadi.arafeh@arm.com >
2025-12-19 12:16:18 +00:00
Marko Rosenmueller
455949675d
[Frontend][Bug] allow tool calls in analysis channel ( #28139 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-12-19 10:47:44 +00:00
lif
086b96339f
[Bugfix] Add validation for tool requests when tool_parser is unavailable ( #30613 )
...
Signed-off-by: majiayu000 <1835304752@qq.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2025-12-19 18:23:28 +08:00
Jinzhen Lin
9187de9fac
[Quantization] enable compressed-tensors marlin support for turing (2) ( #31008 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
2025-12-19 08:56:35 +00:00
Isotr0py
ac1c934276
[Bugfix] Fix incorrect tiles creation for mm prefix triton attention ( #30974 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-19 16:00:33 +08:00
Wenqi Glantz
4924ac582c
Add hidden dimension validation for multimodal embedding inputs ( #30968 )
...
Signed-off-by: Wenqi Glantz <wglantz@nvidia.com >
2025-12-19 07:59:36 +00:00
Li, Jiang
096b25c9ed
[Doc][CPU] Fix index link for CPU regular release wheels ( #31015 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-12-19 07:29:52 +00:00
Jinzhen Lin
de08b8f61b
[Quantization] enable compressed-tensors marlin support for turing ( #31000 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
2025-12-18 20:29:48 -08:00
Nick Hill
2ac85a4544
[BugFix] Fix logprobs with spec decode and modified logits ( #30846 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-18 19:58:28 -08:00
Andreas Karatzas
7b43db210c
[ROCm][CI][Bugfix] Multi-Modal Model Support Fixes and Attention Backend Improvements ( #30270 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-19 02:17:27 +00:00
PlatinumGod
6a09612b2e
[Bugfix] Fix tool_choice="none" being ignored by GPT-OSS/harmony models ( #30867 )
...
Signed-off-by: yujiepu <pyjapple@gmail.com >
Signed-off-by: PlatinumGod <pyjapple@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-12-19 09:34:27 +08:00
Nick Hill
45c0526ac9
[BugFix] Handle errors when preprocessing added requests ( #30895 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-19 01:29:11 +00:00
Benjamin Chislett
d6b3d39b6d
[Cleanup] Refactor FlashInferMetadataBuilder ( #29128 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-18 14:45:30 -08:00
Chendi.Xue
6ca74bc11a
[NIXL][BUG FIX] Fix both failing issue and accuracy issue with nixl + host_buffer on CUDA ( #30419 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2025-12-18 22:10:02 +00:00
Harry Mellor
19c583398a
Check for truthy rope_parameters not the existence of it ( #30983 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-18 13:59:10 -08:00
Nick Hill
b0b77c4655
[BugFix] Fix spec decode + structured outputs + preemption edge case ( #30916 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-18 12:59:55 -08:00
Kayvan Mivehnejad
634a14bd7d
Strengthen input validation and tests for 'parse_raw_prompts’. ( #30652 )
...
Signed-off-by: Kayvan Mivehnejad <K.Mivehnejad@gmail.com >
2025-12-18 19:51:58 +00:00
Chen Zhang
24b65eff0d
[BugFix] Spec decode with VLLM_ENABLE_V1_MULTIPROCESSING=0 ( #30319 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-12-18 19:47:56 +00:00
Elizabeth Thomas
41b6f9200f
Remove all2all backend envvar ( #30363 )
...
Signed-off-by: Elizabeth Thomas <email2eliza@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-18 19:46:28 +00:00
Wentao Ye
97000a2be7
[Bug] Fix compressed tensor not using deepgemm ( #30820 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-18 14:45:55 -05:00
Isotr0py
d2dc5dfc6e
[Bugfix] Remove tile_size=64 for mm_prefix triton attention ( #30973 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-18 20:42:32 +01:00
navmarri14
b8c477c115
tuned fused configs for B300 ( #30629 )
2025-12-18 11:41:59 -08:00
jiahanc
53ad423f26
[Perf] enable flashinfer rotary_embedding custom ops in DeepSeek rotary ( #30729 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
2025-12-18 14:31:18 -05:00
wz1qqx
889f8bb250
[BugFix]Reclaim resources to prevent memory leaks when use LMCacheMPConnector ( #30745 )
...
Signed-off-by: wz1qqx <ziqi.wang@novita.ai >
Co-authored-by: wz1qqx <ziqi.wang@novita.ai >
2025-12-18 19:09:51 +00:00
Fanli Lin
058926d48c
[XPU] allow custom workers (e.g. vllm-omni workers) to be used on XPU ( #30935 )
...
Signed-off-by: Fanli Lin <fanli.lin@intel.com >
2025-12-18 10:16:36 -08:00
Isotr0py
700a5ad6c6
[MM Encoder]: Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ( #30684 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-19 02:04:19 +08:00
Alec
62be3670cb
[BugFix] Add sleep to fix tight loop and release GIL ( #29476 )
...
Signed-off-by: alec-flowers <aflowers@nvidia.com >
Signed-off-by: Alec <35311602+alec-flowers@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-12-18 09:52:55 -08:00
inkcherry
500f26e6d3
[Bugfix] fix DP-aware routing in OpenAI API requests ( #29002 )
...
Signed-off-by: inkcherry <mingzhi.liu@amd.com >
2025-12-18 09:50:42 -08:00
Nick Hill
686cbaac64
[Cleanup] Remove unused ModelRunner V1 InputBatch.num_tokens field ( #30218 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-18 09:17:00 -08:00
Vasiliy Kuznetsov
f4ee2c3d90
fix fp8 online quantization streaming with tp > 1 ( #30900 )
...
Signed-off-by: vasiliy <vasiliy@fb.com >
2025-12-18 11:45:15 -05:00
Xin Yang
9a5e96523b
[LoRA] Set default MXFP4 LoRA backend to Marlin ( #30598 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-18 08:42:22 -08:00
wzyrrr
326e7c3105
[Doc] Add Sophgo TPU Support ( #30949 )
...
Co-authored-by: zhaoyang.wang <zhaoyang.wang@sophgo.com >
2025-12-18 16:29:33 +00:00
Lucas Kabela
0db5439ded
[Bugfix][torch2.10] Fix test_qwen2_5_vl_compilation with 2.10 RC ( #30822 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-18 08:23:31 -08:00
sarathc-cerebras
28d15ab56b
adds jais 2 support ( #30188 )
...
Signed-off-by: sarathc-cerebras <sarath.chandran@cerebras.net >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-12-18 15:46:58 +00:00
Wentao Ye
6628758233
[Bug] Fix batch invariant in torch 2.10 ( #30907 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-18 07:27:51 -08:00
zhrrr
eee600c34f
[Misc] support nsys profile for bench latency ( #29776 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2025-12-18 14:52:20 +00:00
Michael Goin
100f93d2be
Filter safetensors files to download if .safetensors.index.json exists ( #30537 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-18 14:51:17 +00:00
vllmellm
96bf50a2c0
[ROCm] Serving Fails on Radeon Due to AITER Dtype Import ( #30952 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-12-18 11:47:46 +00:00
Li, Jiang
f90d3636e2
[Bugfix][CPU] Fix Mac CPU build ( #30955 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-12-18 01:38:22 -08:00
Ming Yang
8372be2828
[moe] Use enable_chunking func (to support disabling chunking) ( #29935 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-12-18 09:02:38 +00:00
Andreas Karatzas
8da6ae49c3
[ROCm][Bugfix] Fix fa_version argument error in flash_attn_maxseqlen_wrapper for ROCm without aiter ( #30909 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-18 16:45:51 +08:00
Lucas Wilkinson
30bb19a760
[BugFix] Partial revert of #29558 (DeepEP HT + PIECEWISE CG support) ( #30910 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-17 23:50:15 -08:00
Chauncey
aa7e836055
[Bugfix] Fix Unicode issues in GLM-4 tool calling ( #30920 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-12-18 07:12:17 +00:00
Andreas Karatzas
be2ad5f920
[ROCm][Bugfix] fix(structured_output): Skip guidance backend for schemas with patternProperties ( #30730 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-18 07:04:57 +00:00
wangxiyuan
a85724bd6e
[Platform] Let EPD work with non-cuda platform ( #30225 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-12-18 06:45:29 +00:00
Yifan Qiao
11a89cf95c
[Fix][FlexAttention] return max logical block index to handle reused blocks ( #30915 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
2025-12-18 06:42:21 +00:00
Li, Jiang
e3ab93c896
[CPU] Refactor CPU fused MOE ( #30531 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-12-18 14:36:49 +08:00
Nathan Price
fc2ae6d617
fix: add warmup for audio preprocessing ( #30706 )
...
Signed-off-by: Nathan Price <nathan@abridge.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-18 06:12:29 +00:00
Yihua Cheng
ec965569d9
[KV connector][LMCache] Only record the cuda event when there are request to store/load ( #30814 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
2025-12-18 05:31:34 +00:00
Divakar Verma
82dc338ad6
[AMD][CI] fix lm eval ci arg ( #30911 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-12-18 13:18:26 +08:00
Vadim Gimpelson
717ac33d9c
[PERF] Qwen3-next. Add fp8 cutlass MoE tuned configs. chmod -x *MI308X.json ( #29553 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2025-12-18 13:16:04 +08:00
Li, Jiang
cfb7e55515
[Doc][CPU] Update CPU doc ( #30765 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-18 04:59:09 +00:00
zzhxxx
b166ef20e1
[refactor] Add prefix support to embed_tokens in DeepSeek MTP ( #30788 )
...
Signed-off-by: zzhx1 <zzh_201018@outlook.com >
2025-12-18 04:45:56 +00:00
Zhengxu Chen
5f2f3fba1d
[compile] Fix CI for test_gpt2_cache_hit ( #30902 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2025-12-17 20:22:23 -08:00
Matthew Bonanni
4a8412f773
[UX] Reduce DeepGEMM warmup log output to single progress bar ( #30903 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-12-17 20:21:51 -08:00
Bowen Bao
0c738b58bc
[Quantization] Support Quark int4-fp8 w4a8 for MoE ( #30071 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com >
2025-12-18 04:20:42 +00:00
gnovack
5a3adf581e
fused_moe_lora PDL improvements ( #30716 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-17 19:55:00 -08:00
Isotr0py
6fe5887652
[Chore] Remove v0 dead code for Qwen2.5-omni ( #30883 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-17 19:54:39 -08:00
Nicolò Lucchesi
bc3700e0cd
[NIXL] Support P tensor-parallel-size > D tensor-parallel-size ( #27274 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-18 11:53:30 +08:00
Micah Williamson
fd8afdf38d
[ROCm][CI] Reduce Flakiness For test_async_scheduling Using ROCM_ATTN With FP32 ( #30811 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-12-18 10:27:37 +08:00
SungMinCho
a0b782f9cc
[Metrics] Model FLOPs Utilization estimation ( #30738 )
...
Signed-off-by: SungMinCho <tjdals4565@gmail.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
2025-12-18 01:40:51 +00:00
Rafael Vasquez
ed2897f336
[CI][Feature] Adds auto-rebase PR rule ( #30875 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2025-12-18 00:46:44 +00:00
Isotr0py
74a1ac38b0
[v1] Add PrefixLM support to TritonAttention backend ( #30386 )
2025-12-17 16:05:24 -08:00
Nathan Price
05a83dc6ee
feat(api): Eager chat template warmup to eliminate first-request latency ( #30700 )
...
Signed-off-by: Nathan Price <nathan@abridge.com >
2025-12-18 00:01:29 +00:00
Varun Sundar Rabindranath
e3fc374a9a
[BugFix] Workspace allocation during profile run : DeepEPHighThroughput + DeepGEMM ( #30899 )
2025-12-17 15:00:59 -08:00
Andrey Talman
e06d0bf0aa
2.9.1 PyTorch release update ( #28495 )
2025-12-17 12:20:22 -08:00
Xunzhuo
e3a0f21e6c
[docs]: add ecosystem projects sr in docs/governance ( #30844 )
...
Signed-off-by: bitliu <bitliu@tencent.com >
2025-12-17 18:45:56 +00:00
Matthew Bonanni
7eb6cb6c18
[Attention] Update tests to remove deprecated env vars ( #30563 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-12-17 09:49:59 -08:00
Nicolò Lucchesi
9ca8cb38fd
[CI][Bugfix] Fix flaky tests/entrypoints/openai/test_audio.py::test_chat_streaming_audio ( #30878 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-17 18:49:56 +01:00
Cyrus Leung
2497228ad4
[Chore] Factor out logic for requesting initial memory ( #30868 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-17 07:32:17 -08:00
KimHyemin
196cdc3224
[Model] Gemma3: Support untied word embeddings ( #30827 )
...
Signed-off-by: www-spam <panmahm@naver.com >
2025-12-17 07:11:18 -08:00
高鑫崧
b7b6a60aca
Adapt the old parameter enable_thinking in chat_template_kwargs ( #30852 )
...
Signed-off-by: xinsong.gao <1418762819@qq.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-12-17 07:10:59 -08:00
rongfu.leng
9e67c4ce98
[Docs] fix function name ( #30748 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-12-17 12:14:45 +00:00
Jialin Ouyang
6e9dbcc50e
[Fix] uniform decode batch check ( #30747 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-12-17 19:58:43 +08:00
Hank_
6482e3895b
chores: adjust the attn register param order ( #30688 )
...
Signed-off-by: Hank <hcc.mayday@gmail.com >
2025-12-17 19:58:16 +08:00
Harry Mellor
fb980eb2fd
Fix lazy import ( #30858 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-17 03:33:50 -08:00
baoqian426
84896fda22
[Bugfix] deepseek-V3.2 self.weights_proj has no bias ( #30841 )
...
Signed-off-by: baoqian <1354987947@qq.com >
Signed-off-by: baoqian426 <1354987947@qq.com >
2025-12-17 03:32:34 -08:00
Kevin H. Luu
4bf6c23668
[ci] Sync test areas yaml file with test-pipeline ( #30862 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2025-12-17 02:30:56 -08:00
Chauncey
9ad5b21710
[Refactor] [4/N] Move VLLM_SERVER_DEV endpoints into the serve directory ( #30749 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-12-17 02:27:30 -08:00
Wentao Ye
f284d7bd0c
[Bug] Fix AttributeError: 'ColumnParallelLinear' object has no attribute weight_scale_inv ( #30823 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-17 02:00:35 -08:00
Zhengxu Chen
53cd7f868b
[compile] Recompile graph module during Dynamo cache loading. ( #30743 )
...
Signed-off-by: Zhengxu Chen <zhxchen17@fb.com >
2025-12-17 02:00:12 -08:00
danielafrimi
7b966ae2ba
[Fix]Load kv-cache dtype from hf_quant_config.json automatically (fix for reverted PR) ( #30785 )
...
Signed-off-by: <>
Co-authored-by: root <root@gpu-937.slurm-workers-slurm.slurm.svc.cluster.local >
2025-12-17 01:56:38 -08:00
Zhengxu Chen
9db1db5949
[compile] Ignore VLLM_FORCE_AOT_LOAD from cache factors ( #30809 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2025-12-17 01:56:24 -08:00
Zhengxu Chen
177c391db2
[compile] Disable aot when eager backend is used. ( #30810 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2025-12-17 01:55:56 -08:00
Michael Goin
519ef9a911
[UX] Make vllm bench serve discover model by default and use --input-len ( #30816 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-17 01:55:30 -08:00
Ye (Charlotte) Qi
a100152288
[Kernels][FI] Skip trtllm attention when num_kv_heads=1 ( #30842 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-12-17 01:54:21 -08:00
Andrew Xia
4c054d89aa
[Doc][ResponsesAPI] add documentation ( #30840 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2025-12-17 01:53:02 -08:00
Sheng Lin
f4e884f222
[NIXL][Bugfix] Fix NIXL/RDMA registration failure over CuMemAllocator ( #29569 )
...
Signed-off-by: Somoku <linsh0@protonmail.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2025-12-17 01:52:58 -08:00
Xinyu Chen
3b1d440ede
CustomOp: grouped topk ( #29575 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2025-12-17 17:43:00 +08:00
Asaf Joseph Gardin
a9e15c21ef
[Mamba] Removed disable cascade attn in MambaModelConfig ( #30712 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
2025-12-17 08:48:53 +00:00
Robin
20fda43151
[Bugfix][Frontend] Prevent IndexError in MiniMax M2 tool parser during streaming extraction ( #30555 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-12-17 16:37:57 +08:00
Yan Ma
4f735babb7
[XPU] fix broken fp8 online quantization for XPU platform ( #30831 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2025-12-17 00:28:13 -08:00
Li, Jiang
0cd5353644
[Bugfix][CPU] Fix CPU backend ROPE dispatch for VL models ( #30829 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-16 23:25:12 -08:00
Michael Goin
d4d2751732
Update note comment for flashinfer attention warmup ( #30711 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-16 21:29:03 -08:00
shanjiaz
009a773828
bump up compressed tensors version to 0.13.0 ( #30799 )
...
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com >
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com >
2025-12-16 21:01:04 -08:00
Cyrus Leung
44d3b1df3d
[CI/Build] Fix compatibility between #30244 and #30396 ( #30787 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-16 20:21:19 -08:00
Fadi Arafeh
bb5ac1fe38
[CPU] Add action to automatically label CPU related PRs ( #30678 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2025-12-17 04:21:07 +00:00
Michael Goin
811cdf5197
Update model-hosting-container-standards to 0.1.10 ( #30815 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-12-16 17:52:14 -08:00
Grzegorz K. Karch
f5db6385a1
Fix nemotron_nas intermediate_size computation ( #30795 )
...
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com >
2025-12-17 01:06:28 +00:00
Amr Mahdi
c0a88df7f7
[docker] Allow kv_connectors install to fail on arm64 ( #30806 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2025-12-16 16:41:57 -08:00
Nicolò Lucchesi
e087fbc393
[MM] Pass FA version in ViT Attn ( #30756 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-17 07:54:45 +08:00
Michael Goin
e80455ca8b
Replace deprecated enable_fusion with fuse_norm_quant in test_rms_group_quant ( #30817 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-16 23:40:47 +00:00
TJian
2410132bb1
[ROCm] [Bugfix] Fix torch sdpa hallucination ( #30789 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-12-16 15:32:43 -08:00
Michael Goin
0a1ab1e565
[Perf][Kernels] Vectorize csrc/activations_kernels.cu ( #29512 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-16 14:56:02 -08:00
Wentao Ye
b6ec077e05
[CI] Skip ci failure test ( #30804 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-16 22:47:53 +00:00
Jinzhen Lin
ce96857fdd
[Kernel][Quantization][MoE] add marlin kernel support for turing (sm75) ( #29901 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-12-16 14:35:28 -08:00
Daniel Cámpora
eaa82a709a
[Bugfix][DSV32] Fix overflow in topk. ( #30754 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-12-16 14:21:17 -08:00
Roger Wang
f5f51e5931
[Core][MM] Optimize encoder cache manager by operating with embeddings only ( #30475 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Sun Kim <sunytokki@gmail.com >
2025-12-16 14:18:17 -08:00
Lucas Wilkinson
9fec0e13d5
[Attention] Cache attention metadata builds across hybrid KV-cache groups ( #29627 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com >
2025-12-16 17:10:16 -05:00
jiahanc
254a7f8fd6
[Perf] Do FP4 quant before All gather on flashinfer trtllmgen MOE ( #30014 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
2025-12-16 13:01:48 -08:00
Wentao Ye
f21f5ea38c
[Refactor] Small refactor for group topk ( #30562 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-12-16 14:50:59 -05:00
Nicolò Lucchesi
ca702a14dc
[Frontend] Add max-completion-token option to transcription/translation endpoints ( #30769 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-16 19:36:49 +00:00
Michael Goin
10ee1c64cf
[CI] Generalize gsm8k test args and add Qwen3-Next MTP B200 test ( #30723 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-16 14:28:34 -05:00
Mark McLoughlin
66c3537e5d
[Docs][API] Remove warning about LoRARequest being internal-only ( #30774 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-12-16 08:35:46 -08:00
Harry Mellor
e1625498f4
Update where bytes_to_unicode is imported from ( #30771 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-16 08:05:01 -08:00
Harry Mellor
0b0acc758e
Remove head_mask from Ultravox and Swin ( #30764 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-16 08:02:41 -08:00
Harry Mellor
af506fd76a
Fix instantiation of HfHubHTTPError in LoRA test ( #30768 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-16 08:02:24 -08:00
Ming Yang
ce12b407f2
[TRTLLM] Remove the MoE GEMM weight name change ( #30713 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-12-16 11:01:38 -05:00
Wentao Ye
59bd5f6a71
[Feat] Enable eplb with default all2all backend ( #30559 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-16 10:33:52 -05:00
Lucas Wilkinson
00a8d7628c
[BugFix] Fix memory spike in workspace allocation ( #30744 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-16 06:46:22 -08:00
Isotr0py
4de08ad698
[CI/Build] Skip broken ViT backend functionality test tempoarily ( #30782 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-16 06:45:25 -08:00
Nicolò Lucchesi
75eb302a2e
[Bugfix] Whisper fix number of allocated CrossAttn blocks per-request ( #30772 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-16 14:20:19 +00:00
Pleaplusone
9dbbc59b15
[ROCm][MTP] Support MTP for AITER MLA backend ( #28624 )
...
Signed-off-by: ganyi <ygan@amd.com >
2025-12-16 14:10:26 +00:00
Boyuan Feng
104003dc77
update piecewise cudagraph warning when splitting_ops=[] ( #30728 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-12-16 06:09:34 -08:00
TJian
d0fb572929
[ROCm] [AITER] [DOC] Add usage description about check functions in _aiter_ops ( #30586 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-12-16 13:50:47 +00:00
Harry Mellor
6f15ac5de7
Don'e assume position_embedding_type will be present for BERT and RoBERTa models ( #30770 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-16 13:40:26 +00:00
Junru Shen
676db55eec
[Bugfix] Fix prefix_repetition routing in bench throughput ( #29663 )
...
Signed-off-by: Junru Shen <jrshen.sjr@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-16 01:37:15 -08:00
Jee Jee Li
0e391e7570
[Bugfix] Fix RequestOutput miss lora_request ( #30636 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-16 01:36:35 -08:00
Andrew Xia
0d0c929f23
[responsesAPI][8] input/output messages for ResponsesParser ( #30158 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Signed-off-by: Andrew Xia <axia@meta.com >
Co-authored-by: Andrew Xia <axia@fb.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-12-16 13:54:59 +08:00
Isotr0py
e94384bbad
[Bugfix] Fix broken ViT attention selection for Blackwell device ( #30731 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-16 05:24:32 +00:00
jiangkuaixue123
b9ff4f2a8d
[feature] extend DBO to XBO ( #30120 )
...
Signed-off-by: jiangkuaixue123 <jiangxiaozhou111@163.com >
Co-authored-by: root <root@hk01dgx028.cm.cluster >
2025-12-16 00:04:01 -05:00
Boyuan Feng
c881db364e
improve lazy import test ( #30733 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-12-16 03:12:05 +00:00
Shanshan Shen
3bd9c49158
[CustomOp] Extract ApplyRotaryEmb as CustomOp and unify the dispatch logic ( #29873 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
Co-authored-by: gcanlin <canlinguosdu@gmail.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2025-12-15 19:08:16 -08:00
Amr Mahdi
ff21a0fc85
[docker] Restructure Dockerfile for more efficient and cache-friendly builds ( #30626 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2025-12-15 18:52:19 -08:00
penfree
bbd850e597
[Bugfix] fix streaming final output for non harmony ( #30237 )
...
Signed-off-by: penfree <qiupengfei@baidu.com >
Co-authored-by: penfree <qiupengfei@baidu.com >
2025-12-16 09:03:11 +08:00
Shengqi Chen
511e81e7c9
[BUILD] use sm_100f when compiling flashmla to fix support on sm103 ( #30705 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2025-12-15 14:48:01 -08:00
Matthew Bonanni
a182be4308
[UX][Attention] Add attention_config argument to LLM() ( #30710 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-12-15 17:29:09 -05:00
Kevin Musgrave
c01d589813
[Benchmarks] auto_tune.sh: Use hostname variable for server requests ( #30529 )
...
Signed-off-by: Kevin Musgrave <kevin.musgrave@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-15 22:00:29 +00:00
Matthew Bonanni
60dbf7d8f1
Update batch invariant to use attention config ( #30704 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-15 15:24:16 -05:00
Michael Goin
a450c64a30
[Bugfix] Fail instead of ignoring when CompilationConfig gets invalid args ( #30708 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-15 20:18:02 +00:00
Fadi Arafeh
b2191abdca
[docs][fix] Update Arm CPU vLLM wheel installation docs ( #30594 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2025-12-15 19:46:25 +00:00
Matthew Bonanni
51e5b3e3c4
[Bugfix] Fix ViT with FlashAttention on ROCm ( #30703 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-12-15 19:45:21 +00:00
Isotr0py
ec154c36ee
[Platform] Refactor Platform attention backend selection to avoid breakpoint for OOT platform ( #30212 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-15 17:36:07 +00:00
Harry Mellor
970713d4a4
Remove SkipValidation from ModelConfig ( #30695 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-15 17:34:08 +00:00
mondaylord
17fec3af09
[Bugfix] Fix missing first token in tool calls during reasoning-to-tool transition ( #30671 )
...
Signed-off-by: mondaylord <20212010046@fudan.edu.cn >
2025-12-15 16:13:37 +00:00
yjc9696
855b101d75
[Frontend] add tools for dsv32 developer role ( #30040 )
...
Signed-off-by: pridejcyang <pridejcyang@tencent.com >
Co-authored-by: pridejcyang <pridejcyang@tencent.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-15 15:08:47 +00:00
Robert Shaw
d0502b4928
[MoE][Refactor 1/N] Separate Online Quantization ( #30627 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-12-15 06:54:53 -08:00
Max Hu
3f175f18a2
[Bugfix] Fix multimodal configuration for Qwen3VL MOE model ( #30670 )
...
Signed-off-by: Max Hu <hyoung2991@gmail.com >
2025-12-15 14:06:01 +00:00
Cyrus Leung
ed586e7724
[Refactor] [3/N] Move tool parser tests and run on CPU ( #30693 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-15 13:45:36 +00:00
Chauncey
2a1776b7ac
[Refactor] [2/N] Move tool parsers into the vLLM main directory ( #30675 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-12-15 12:54:52 +00:00
Nicolò Lucchesi
185c22bf2f
[Misc][Hybrid allocator + kv connector] Optionally enable hybrid allocator + KV cache connector ( #29805 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-15 11:17:58 +00:00
duke
e4806d973a
[BugFix] Add embed_input_ids method to make QWenLMHeadModel a vllm model ( #30674 )
...
Signed-off-by: root <iwzbi@zju.edu.cn >
Co-authored-by: root <iwzbi@zju.edu.cn >
2025-12-15 10:38:29 +00:00
wang.yuqi
4429d934de
[Model] Automatic conversion of TokenClassification model ( #30666 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2025-12-15 08:13:00 +00:00
ゆり
33278073d6
typing: Add type hints to TurnMetrics class in context.py ( #30552 )
...
Co-authored-by: zkexorability <zkexorability@gmail.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2025-12-14 23:00:39 -08:00
汪志鹏
1adeb3b84c
[New Model] BAGEL support (AR only) ( #28439 )
...
Signed-off-by: princepride <wangzhipeng628@gmail.com >
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-15 14:58:23 +08:00
Kunshang Ji
e3a1cd1c59
[XPU] fix Dockerfile.xpu, avoid wheel conflicts ( #30662 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-12-15 13:32:06 +08:00
Wentao Ye
3778673ea8
[Feat] Refactor for parallel_config in FusedMoEModularKernel ( #30282 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-12-15 04:21:36 +00:00
Seokhyun An
b337647aa0
[Bugfix] Drop empty tool_calls lists to keep assistant replies in chat template ( #30648 )
...
Signed-off-by: Seokhyun An <iamseokhyun@gmail.com >
2025-12-15 04:21:12 +00:00
Jee Jee Li
a524d1ba0a
[Bugfix] Fix deepseek_v32 tokenizer_mode ( #30658 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-15 04:20:31 +00:00
Shanshan Shen
87b4d1557d
[CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it. ( #30125 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-12-15 11:13:32 +08:00
Wenqi Glantz
84e23d103d
additional protection for CVE-2025-62164 ( #30649 )
...
Signed-off-by: Wenqi Glantz <wglantz@nvidia.com >
2025-12-15 03:07:10 +00:00
Shanshan Shen
738648fb81
[CustomOp] Support object-level enable for CustomOp ( #30547 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-12-15 11:02:09 +08:00
Boyuan Feng
917fdae5b2
[Log] Skip piecewise cudagraph warn when using full cudagraph ( #30657 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-12-15 02:49:45 +00:00
Robert Shaw
e2ed238885
Revert "[Fix]Load kv-cache dtype from hf_quant_config.json automatically" ( #30653 )
2025-12-14 19:33:41 -05:00
Or Ozeri
174e39ead7
CPU KV Offloading: Use more CUDA streams ( #29013 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-12-14 23:50:45 +00:00
RioS
9ccbf6b692
[responsesAPI]add extra body parameters ( #30532 )
...
Signed-off-by: Ri0S <aa248424@gmail.com >
2025-12-14 19:25:45 +00:00
Chendi.Xue
ae2e503dda
[NIXL][BUG FIX] Fix a bug for PD with host_buffer after merging 29665 ( #30420 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
2025-12-14 15:38:28 +00:00
Tsukasa OI
9e33a1a75b
[Model][Quantization] Override HF defaults to GGUF ones (incl. Qwen3 MoE) ( #30118 )
...
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com >
2025-12-14 15:01:42 +00:00
Vensen
add4b0ca44
[Bugfix][benchmarks] Fix input token calculation for rerank benchmark metrics ( #30596 )
...
Signed-off-by: vensen <vensenmu@gmail.com >
2025-12-14 14:57:15 +00:00
ZiTian Zhao
ae88aada38
[Feature]Add EVS (Efficient Video Sampling) Support for Qwen3-VL ( #29752 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
Co-authored-by: deitxfge <huhaibo1990@126.com >
2025-12-14 05:24:56 -08:00
yifant-code
5ccf0efa84
[Bugfix] Improve error messages in ModelConfig validation ( #30213 )
...
Signed-off-by: ytian218 <ytian218@bloomberg.net >
Co-authored-by: ytian218 <ytian218@bloomberg.net >
2025-12-14 21:23:37 +08:00
ElizaWszola
994acec0cc
[Bugfix] Fix fusion for VL models ( #30244 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2025-12-14 21:22:37 +08:00
zifeitong
48b8456ff9
[Bugfix] Revert Qwen2-VL part of change in #28271 ( #30542 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2025-12-14 05:20:08 -08:00
Drew Botwinick
5b64ac21f9
[Bugfix] Update get_processor_data to use get_all method ( #30583 )
...
Signed-off-by: Drew Botwinick <6953152+dbotwinick@users.noreply.github.com >
2025-12-14 21:19:20 +08:00
Bin Bao
a8ec486592
[Misc] Add a script to benchmark compilation time ( #29919 )
...
Signed-off-by: Bin Bao <binbao@meta.com >
2025-12-14 13:02:39 +00:00
tjp_zju
6ecc1e411b
[Bugfix] fix _get_quant_method of FusedMoE for deepseekV3.2 on non-NV… ( #30057 )
...
Signed-off-by: tjp_zju <tanjianpingzju1990@gmail.com >
2025-12-14 02:20:51 -08:00
Shengliang Xu
0bb0bae436
Nvidia ModelOpt workaround for issue 28072 ( #30164 )
...
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com >
Co-authored-by: Pavani Majety <pmajety@nvidia.com >
2025-12-14 18:18:31 +08:00
Johannes F
060893654d
fix: Update json features supported by xGrammar ( #30390 )
...
Signed-off-by: Johannes Flommersfeld <johannes.flommersfeld@tngtech.com >
Signed-off-by: Johannes F <johannesflommersfeld@users.noreply.github.com >
Co-authored-by: Johannes Flommersfeld <johannes.flommersfeld@tngtech.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-14 02:16:06 -08:00
Matthias Gehre
e9add129ad
[Bugfix] awq_gemm: fix argument order swap ( #30364 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-12-14 18:15:37 +08:00
Ilya Markov
3224ea9915
[torch.compile] Add encoder tag for compilation ( #30489 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2025-12-14 18:15:11 +08:00
Lasha Koroshinadze
3a20450d31
Add AudioFlamingo3 model support ( #30539 )
...
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com >
Signed-off-by: Lasha Koroshinadze <26011196+lashahub@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-14 02:14:55 -08:00
Didier Durand
1a55cfafcb
[Doc]: fixing typos in various files ( #30540 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Signed-off-by: Didier Durand <2927957+didier-durand@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-12-14 02:14:37 -08:00
drslark
add1b9d3de
[main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring ( #30632 )
...
Signed-off-by: drslark <slarksblood@qq.com >
2025-12-14 01:32:16 -08:00
Cyrus Leung
dcb31196da
[Chore] Remove redundant RequestPrompt ( #30612 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-14 09:22:37 +00:00
Laith Sakka
f569c654e1
enable unbacked with aot_compile ( #30462 )
...
Signed-off-by: Laith Sakka <lsakka@meta.com >
2025-12-14 08:14:06 +00:00
Micah Williamson
97f2f160fd
[ROCm][CI] Add "Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy Test" Back Into AMD CI ( #30590 )
...
Signed-off-by: David Chen <530634352@qq.com >
Signed-off-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com >
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-14 06:56:26 +00:00
Kayvan Mivehnejad
29f7d97715
Improve parse_raw_prompt test cases for invalid input .v2 ( #30512 )
...
Signed-off-by: Kayvan Mivehnejad <K.Mivehnejad@gmail.com >
2025-12-14 11:18:41 +08:00
Qier Li
dc7fb5bebe
[Bug][KVConnector][Metrics] Remove a vacuous assertion breaking external-launcher ( #30577 )
...
Co-authored-by: Qier Li <qier@fb.com >
2025-12-14 01:23:08 +00:00
Qidong Su
24429d5924
[Doc] Add instructions for building docker image on GB300 with CUDA13 ( #30414 )
...
Signed-off-by: Qidong Su <soodoshll@gmail.com >
2025-12-13 21:56:53 +00:00
Wentao Ye
6e78ed6ba7
[Logs] Optimize startup logs 4 ( #29903 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-13 16:12:53 -05:00
Isotr0py
7c16f3fbcc
[Doc] Add documents for multi-node distributed serving with MP backend ( #30509 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-13 18:02:29 +00:00
lif
ddbfbe5278
[Docs] Clarify Expert Parallel behavior for attention and MoE layers ( #30615 )
...
Signed-off-by: majiayu000 <1835304752@qq.com >
2025-12-13 08:37:59 -09:00
Laith Sakka
763963aa73
set assume_32bit_indexing and pass unbacked hints ( #30459 )
...
Signed-off-by: Laith Sakka <lsakka@meta.com >
2025-12-13 15:36:53 +00:00
Cyrus Leung
39cefbdf17
[Refactor] TokenizerRegistry only uses lazy imports ( #30609 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-13 23:16:22 +08:00
Chen Zhang
ace34e3783
[Bugfix] Qwen3-next with --hf-overrides \{\"num_hidden_layers\":8\} ( #30433 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-12-13 22:12:45 +08:00
Isotr0py
e5db3e2774
[CI/Build] Fix broken mm processor test Mistral-3-large ( #30597 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-13 04:43:01 -08:00
Cyrus Leung
64251f48df
[Chore] Adjust tokenizer import to avoid circular imports ( #30601 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-13 04:42:39 -08:00
Nick Hill
1cec5b7ea9
[Scheduer] Simplify stop checking for pooling models ( #30591 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-13 09:45:26 +00:00
Cyrus Leung
b09806e28f
[Bugfix] Dictionary MM embeddings for online chat ( #30507 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-13 15:48:56 +08:00
Tsukasa OI
fdc135d768
[Misc][Quantization] Clarify the intent of GGUF FusedMoE weight materialization ( #30310 )
...
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com >
2025-12-13 13:55:14 +08:00
Roberto L. Castro
4fa7ce46f3
[Feature] Add SM103 (Blackwell Ultra) Support to vLLM ( #30484 )
...
Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-12-12 19:34:23 -08:00
Nicolò Lucchesi
57e9bf1864
[CI] Whisper logprobs tests ( #30504 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-13 10:49:11 +08:00
Michael Goin
2f32a68d75
[CI] Update several models in registry that are available online now ( #30514 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-12-12 18:28:13 -08:00
Matthew Bonanni
f5dfbbd8e9
[Docs] Remove references to VLLM_ATTENTION_BACKEND ( #30564 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-12-13 10:20:15 +08:00
Michael Goin
fc0119425c
Add IBM and Red Hat to compute resources sponsors ( #30581 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-12-13 01:34:23 +00:00
Matthew Bonanni
86a3261525
[Bugfix] Pass FA version in MultiHeadAttention ( #30575 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-12-13 00:02:11 +00:00
rasmith
08f8a5627e
[CI/Build][Kernel][BugFix][AMD] Fix per_token_group_quant_fp8 to use correct fp8 min/max values and update atol/rtol in test_quantfp8_group_functionality ( #30292 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-12-12 18:41:56 -05:00
Kevin H. Luu
b4039c08b5
[ci] Mark PrimeRL integration test as soft fail ( #30578 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2025-12-12 14:13:09 -08:00
Wentao Ye
1e6b115300
[Refactor] Reduce duplicate code in per_token_group_quant cuda kernels ( #30496 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-12 16:45:23 -05:00
danielafrimi
13618626df
[MoE-FP8-modelopt] Add FlashInfer alignment padding for intermediate dimensions ( #29748 )
...
Signed-off-by: Daniel Afrimi <dafrimi@pool0-00589.cm.cluster >
Signed-off-by: dafrimi <dafrimi@nvidia.com >
Co-authored-by: Daniel Afrimi <dafrimi@pool0-00589.cm.cluster >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-12-12 20:42:32 +00:00
danielafrimi
6ec0d8dbe4
[Fix]Load kv-cache dtype from hf_quant_config.json automatically ( #29980 )
...
Signed-off-by: Daniel Afrimi <dafrimi@nvidia.com >
2025-12-12 11:27:47 -08:00
Li, Jiang
9693dd0fe3
[CI/Build] Add x86 CPU wheel release pipeline ( #28848 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-12-12 19:21:35 +00:00
Xin Yang
1f19d8f899
[Perf] Set split_k to 1 for triton_kernels ( #30528 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2025-12-12 14:07:57 -05:00
shivampr
cd7740ac5c
[ROCm] Enable Triton ScaledMM fallback + kernel selection fix ( #26668 )
...
Signed-off-by: Shivam <shivampr.dev@gmail.com >
Signed-off-by: Shivam <shivamprasad91@gmail.com >
2025-12-12 13:28:20 -05:00
Wentao Ye
02a5880394
[CI] Fix mypy for vllm/v1/executor ( #30517 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-12 18:05:34 +00:00
realliujiaxu
d2c919dcc2
[bugfix] fix bug when top_logprobs=0 with spec decoding ( #30059 )
...
Signed-off-by: realliujiaxu <realliujiaxu@163.com >
2025-12-12 09:03:35 -08:00
Benjamin Bartels
f3237f3f6b
[Frontend] Fixes anthropic streaming message_start usage nesting ( #30266 )
...
Signed-off-by: bbartels <benjamin@bartels.dev >
2025-12-12 16:28:54 +00:00
jvlunteren
9c0ee995a8
[Kernel] Support CUDA Graphs in 3D Triton Attention Kernel ( #28306 )
...
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com >
Signed-off-by: jvlunteren <161835099+jvlunteren@users.noreply.github.com >
Co-authored-by: Thomas Parnell <tom.parnell@gmail.com >
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-12-12 16:55:40 +01:00
Michael Goin
09ad3b76b3
[Bug] Fix attention_backend arg string parsing ( #30534 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-12 08:40:50 -07:00
Christina Norman
dc13c99eed
fix(gguf): Disable bfloat16 for GGUF on blackwell device ( #30408 )
...
Signed-off-by: Christina <truffle@gmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Christina Norman <christina@example.com >
Co-authored-by: Isotr0py <isotr0py@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-12 23:10:12 +08:00
Vladislav Nosivskoy
3e34adcdfb
[DeepSeek V3.2] Proper drop_thinking logic ( #30490 )
...
Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com >
2025-12-12 15:01:06 +00:00
Lucas Wilkinson
3e41992fec
[Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2 ( #27532 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-12 05:57:47 -08:00
吴坎
91401c7a26
[Bugfix] Fix CMakeLists Environment Variable ( #21804 )
...
Signed-off-by: wu-kan <github@wu-kan.com >
Signed-off-by: 吴坎 <github@wu-kan.cn >
Signed-off-by: wu-kan <github@wu-kan.cn >
2025-12-12 10:54:52 +00:00
Jaehwang Jung
f90319d5d1
[Bugfix] Schedule failure due to wrong get_image_size_with_most_features ( #29692 )
2025-12-12 02:27:20 -08:00
rasmith
302b2c1eb9
[CI/Build][AMD] Fix ref_dynamic_per_token_quant reference implementation on ROCm. ( #30291 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-12-12 09:30:23 +00:00
Ben Browning
8f8fda261a
[Bugfix] Multiple fixes for gpt-oss Chat Completion prompting ( #28729 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-12-12 12:59:53 +08:00
Zhengxu Chen
fe1787107e
[compile] Parse compile range cache keys as Range during cache loading. ( #30516 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2025-12-12 04:30:51 +00:00
Andreas Karatzas
783644e4ac
[ROCm][CI] Skip multi-GPU speculative decoding tests when insufficient GPUs available ( #30527 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-12 03:54:56 +00:00
Ryan Rock
197473c4e7
[CI/Build] Use spawn subprocess for ROCm ( #30272 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2025-12-12 03:33:17 +00:00
Nick Hill
947dfda9c2
[LMCache] Relax lmcache version requirement ( #30425 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-11 18:18:47 -09:00
Michael Goin
9f2fc16a69
[Bugfix][Model] Fix Afmoe rope_parameters issue ( #30505 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-12 02:53:57 +00:00
Bhanu Prakash Voutharoja
6a6fc41c79
gptq marlin quantization support for fused moe with lora ( #30254 )
...
Signed-off-by: Bhanu068 <voutharoja.bhanu06@gmail.com >
2025-12-12 02:27:22 +00:00
Fadi Arafeh
f355ad5412
[CPU][FIX] Fix build failures on Arm CPUs with torch nightly ( #30481 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2025-12-12 02:09:25 +00:00
Lucas Wilkinson
042da73244
[Core] Refactor _build_attention_metadata ( #29628 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-11 17:54:12 -08:00
Andreas Karatzas
b5945d49c0
[ROCm][CI] Use mi325_4 agent pool for V1 e2e tests ( #30526 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-12 01:37:24 +00:00
rasmith
ba80926681
[CI/Build][AMD] Skip test_cutlass_w4a8_moe tests on ROCm sine they require cutlass_pack_scale_fp8 ( #30508 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-12 01:02:19 +00:00
jiahanc
0ab23c2b2b
[fix] fix SM check for Flashinfer TRTLLM MOE ( #30314 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
2025-12-12 01:00:58 +00:00
rasmith
48661d275f
[CI/Build][AMD] Skip tests in test_fusions_e2e and test_dbo_dp_ep_gsm8k that require non-existing imports for ROCm ( #30417 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-12-12 00:24:20 +00:00
Ev Lacey
d527cf0b3d
[FIX]Patch run-cluster.sh (fix for #28328 ) ( #30002 )
...
Signed-off-by: elacey <elacey@nvidia.com >
Signed-off-by: Ev Lacey <github@everettlacey.com >
2025-12-11 23:36:31 +00:00
Concurrensee
2cc5affc38
[ROCM][CI] Fix AMD Examples Test Group ( #30276 )
...
Signed-off-by: Yida Wu <yida.wu@amd.com >
Signed-off-by: Yida <yida.wu@amd.com >
2025-12-11 18:03:54 -05:00
Andrew Briand
a00d88973d
[EPLB] Support EPLB w/ NVFP4 ( #29804 )
...
Signed-off-by: Andrew Briand <abriand@nvidia.com >
Co-authored-by: Andrew Briand <abriand@nvidia.com >
2025-12-11 22:59:40 +00:00
Wentao Ye
61249b177d
[Refactor] Remove useless syncwarp ( #30510 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-11 17:43:41 -05:00
Wentao Ye
c817b14151
[Perf] Optimize deepgemm experts initialization, 3.9% TTFT improvement ( #30494 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: li-jinpeng <3332126450@qq.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-12-11 17:28:34 -05:00
ioana ghiban
3efdc3feae
[Docs][CPU backend] Add pre-built Arm CPU Docker images ( #30491 )
...
Signed-off-by: Ioana Ghiban <ioana.ghiban@arm.com >
2025-12-11 22:03:29 +00:00
Nicolò Lucchesi
0efd9f867c
[Core] Whisper Enable Encoder Batching ( #29421 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-11 21:06:51 +00:00
Xingyu Liu
90d6cf921f
[BugFix][MM]support VLLM_RANDOMIZE_DP_DUMMY_INPUTS ( #30472 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-11 21:00:15 +00:00
Harry Mellor
cf3eacfe58
Standardise get_rope to use rope_parameters["partial_rotary_factor"], not rotary_dim ( #30389 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-11 20:45:23 +00:00
Zhengxu Chen
92fea56fd1
[compile] Stop one-off setting enable_aot_compile and use context manager instead. ( #30503 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2025-12-11 20:28:03 +00:00
Ye (Charlotte) Qi
e458270a95
[Misc] Add mcp to requirements ( #30474 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-12-11 20:06:09 +00:00
Andreas Karatzas
72aaac5b66
[ROCm][Bugfix] Add MLACommonMetadata to allowed attention types for speculative decoding ( #30430 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-11 19:25:01 +00:00
汪志鹏
0e71eaa644
[Feature] AWQ marlin quantization support for fused moe with lora ( #30442 )
...
Signed-off-by: princepride <wangzhipeng628@gmail.com >
2025-12-11 18:03:32 +00:00
Harry Mellor
8781cd6b88
Add Eagle and Eagle3 support to Transformers modeling backend ( #30340 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-11 17:02:10 +00:00
Julien Denize
aa3c250c48
[IMPROVEMENT] Change MistralReasoningParser behavior ( #30391 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-12-11 17:53:26 +01:00
Shengqi Chen
305b168a9f
[CI] refine more logic when generating and using nightly wheels & indices, add cuda130 build for aarch64, specify correct manylinux version ( #30341 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2025-12-12 00:42:30 +08:00
Harry Mellor
93db3256a4
Give pooling examples better names ( #30488 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-11 16:22:58 +00:00
ioana ghiban
17cb540248
[Docs][CPU Backend] Add nightly and per revision pre-built Arm CPU wheels ( #30402 )
...
Signed-off-by: Ioana Ghiban <ioana.ghiban@arm.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-11 15:57:10 +00:00
Harry Mellor
97a042f3bc
Make the httpx logger less annoying when Transformers v5 is installed ( #30480 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-11 15:44:56 +00:00
Cyrus Leung
3a3b06ee70
[Misc] Improve error message for is_multimodal ( #30483 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-11 06:39:51 -08:00
Martin Hickey
f4417f8449
[KVConnector] Add KV events to KV Connectors ( #28309 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2025-12-11 15:30:29 +01:00
Qiu
a11f4a81e0
[Misc][PCP&DCP] relocate PCP feature check ( #30050 )
...
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-11 03:36:18 -08:00
Kenichi Maehashi
853611bb18
Fix typo of endpoint name in CLI args docs ( #30473 )
...
Signed-off-by: Kenichi Maehashi <maehashi@preferred.jp >
2025-12-11 11:07:56 +00:00
Cyrus Leung
d917747c95
[Bugfix] Fix task still being passed in tests/benchmarks ( #30476 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-11 10:33:55 +00:00
wang.yuqi
a5f9fb5960
[Deprecation] Deprecation --convert reward, use --convert embed instead. ( #30463 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2025-12-11 10:18:25 +00:00
jeremyteboul
4515eb1a0b
[Fix] Update lazing loading of video loader backend ( #30444 )
...
Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com >
Co-authored-by: Jeremy Teboul <jeremyteboul@fb.com >
2025-12-11 10:14:57 +00:00
Cyrus Leung
13d63b65e0
[Deprecation] Remove missed fallback for embed_input_ids ( #30469 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-11 10:06:36 +00:00
wz1qqx
b4e8b91278
[Fix]fix import error from lmcache ( #30376 )
...
Signed-off-by: wz1qqx <ziqi.wang@novita.ai >
Co-authored-by: wz1qqx <ziqi.wang@novita.ai >
2025-12-11 09:23:52 +00:00
Rei.
6299628d32
[bugfix] fix MiniMaxM2ReasoningParser streaming output not separating reasoning_content. ( #29882 )
...
Signed-off-by: Rei <1477174254@qq.com >
2025-12-11 09:05:08 +00:00
Ming Yang
fba8906930
[perf] Use direct copy (broadcast) instead of cat for k_nope/k_pe in MLA prefill ( #29710 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-12-11 08:20:45 +00:00
Ning Xie
d02d1043de
fix: enhance human_readable_int function ( #30337 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-12-10 23:30:33 -08:00
Cyrus Leung
979f50efd0
[Deprecation] Remove fallbacks for embed_input_ids and embed_multimodal ( #30458 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-11 06:58:23 +00:00
gh-wf
36c9ce2554
Ensure minimum frames for GLM 4.6V compatibility ( #30285 )
...
Signed-off-by: Wayne Ferguson <wayneferguson@gmail.com >
2025-12-11 05:26:49 +00:00
xyDong0223
1a516557e1
[Doc] Add Baidu Kunlun XPU support ( #30455 )
...
Signed-off-by: xyDong0223 <dongxinyu23@gmail.com >
2025-12-11 04:52:17 +00:00
Wentao Ye
d6464f2679
[Chore] Fix torch precision warning ( #30428 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-11 04:05:56 +00:00
Cyrus Leung
7e24e5d4d6
[Deprecation] Remove deprecated task, seed and MM settings ( #30397 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-10 19:59:39 -08:00
Cyrus Leung
5a87d8b9b1
[Deprecation] Remove deprecated plugin and compilation fields for v0.13 release ( #30396 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-10 19:59:35 -08:00
Divakar Verma
d1e1fb4363
[Bugfix] Fix grouped_topk pytorch impl when num_experts can't be grouped properly ( #29439 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2025-12-10 19:47:18 -08:00
Andreas Karatzas
b51255f369
[ROCm] Fix broken import in platform attention backend dispatching ( #30432 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-11 01:12:58 +00:00
Sage Moore
b4054c8ab4
Revert "[CI] Add Async Eplb nightly CI tests ( #29385 )" ( #30431 )
2025-12-11 00:48:35 +00:00
Xu Song
25221b44bb
Add more docs for regex ( #30106 )
...
Signed-off-by: Xu Song <xusong.vip@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-11 00:12:21 +00:00
shivampr
8580919ac3
[Bugfix] fix confusing OOM errors during v1 init ( #28051 )
...
Signed-off-by: Shivam <shivamprasad91@gmail.com >
Signed-off-by: shivampr <shivampr.dev@gmail.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-12-10 23:17:41 +00:00
Christina Norman
166ac3c94d
fix(shm): Add memory barriers for cross-process shared memory visibility ( #30407 )
...
Signed-off-by: Christina Holland <hey@christinaholland.com >
Signed-off-by: Christina <truffle@gmail.com >
2025-12-10 23:01:19 +00:00
Seiji Eicher
b9e0951f96
[docs] Improve wide-EP performance + benchmarking documentation ( #27933 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-12-10 22:15:54 +00:00
Michael Goin
fcb894222f
[Docs] Update EPLB docs ( #30426 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-10 11:56:51 -09:00
Nick Hill
6ccb7baeb1
[LMCache] Fix breakage due to new LMCache version ( #30216 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-10 11:52:01 -08:00
Po-Han Huang (NVIDIA)
eea41804a4
[bug] Fix "Current vLLM config is not set." warnings when FlashInfer attention is used ( #30241 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
2025-12-10 11:18:51 -08:00
Jialin Ouyang
9f042ba26b
[Perf] Enable environment cache in EngineCore to enable the feature for UniProcExecutor as well ( #29289 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-12-10 14:13:01 -05:00
Cyrus Leung
e72d65b959
{Deprecation] Remove tokenizer setter ( #30400 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-10 19:10:58 +00:00
Will Eaton
a9e4106f28
[P/D] KV Load Failure Recovery/Abort Configuration ( #26813 )
...
Signed-off-by: Will Eaton <weaton@redhat.com >
Signed-off-by: Will Eaton <me@wseaton.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-12-10 11:00:52 -08:00
Anker
e8e8cd73e5
[Bugfix] Fix HunyuanOCR cross-image contamination in batch processing ( #30344 )
...
Signed-off-by: Lennart Brog <lennart.borg@list-ag.de >
Signed-off-by: Anker <20343812+anker-c2@users.noreply.github.com >
2025-12-10 18:09:31 +00:00
Cyrus Leung
253305d5b2
[Chore] Delay recent deprecations ( #30398 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-10 17:48:38 +00:00
Matthew Bonanni
794a7875ee
[Misc] Consistent case for vllm bench serve results ( #30403 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-12-10 09:44:02 -08:00
Mark McLoughlin
2dcbac9077
[Docs] Generate full list of metrics in user docs ( #30388 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Claude <noreply@anthropic.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-10 16:09:34 +00:00
Lucas Wilkinson
aacf0abf8b
[BugFix] Fix AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight_scale' ( #30399 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-10 07:59:23 -08:00
Nicolò Lucchesi
c756fb6781
[Core] Whisper enable FULL_DECODE_ONLY CudaGraph ( #30072 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-10 06:14:24 -08:00
Roger Young
d017bceb08
[BugFix] Fix minimax m2 model rotary_dim ( #30384 )
...
Signed-off-by: xuebi <xuebi@minimaxi.com >
Co-authored-by: xuebi <xuebi@minimaxi.com >
2025-12-10 04:58:50 -08:00
Aditya Tewari
cebda2a4af
[CPU] Support for Whisper ( #30062 )
...
Signed-off-by: Aditya Tewari <aditya.tewari@arm.com >
2025-12-10 04:58:42 -08:00
Daniele
53d2420b44
[Bugfix] tpu_model_runner: set vllm config context when calling reset_dynamo_cache() ( #30331 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
2025-12-10 04:58:35 -08:00
Chauncey
9db78f34dc
[Bugfix] Fix the issue where DeepSeek v3.2 cannot use structured_output ( #30371 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-12-10 08:30:16 +00:00
Fadi Arafeh
434ac76a7c
[cpu][ci] Add CPU Attention Tests for Neon Backend ( #30347 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2025-12-10 05:37:35 +00:00
Andreas Karatzas
ed7af3178a
[ROCm][CI] Attempt to fix the failures under a subgroup of the e2e the test group ( #29358 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
Co-authored-by: Micah Williamson <micah.williamson@amd.com >
2025-12-10 05:33:13 +00:00
Radu Salavat
180345807f
[CMake][Build]: Remove unused ACL CMake env variables ( #30339 )
...
Signed-off-by: Radu Salavat <radu.salavat@arm.com >
2025-12-10 04:27:19 +00:00
Mingliang Li
d007387aa7
[Bugfix] Cache added_vocab to avoid per-token overhead ( #30351 )
...
Signed-off-by: limingliang <limingliang@stepfun.com >
Co-authored-by: limingliang <limingliang@stepfun.com >
2025-12-10 12:05:51 +08:00
Wilson Wu
3bdd426636
Fix typos in comments across multiple files ( #30345 )
...
Signed-off-by: Wilson Wu <iwilsonwu@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-12-09 20:05:28 -08:00
haoyangli-amd
06462392e4
[bugfix][quantization] fix quark qwen3 kv_cache quantization ( #30308 )
...
Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com >
2025-12-10 03:24:12 +00:00
Micah Williamson
7d80c73d42
[CI] Reduce Flakiness For test_spec_decode.py::test_suffix_decoding_acceptance ( #30367 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-12-10 02:35:49 +00:00
rasmith
b75f826fca
[CI/Build][AMD] Skip quantization kernels tests that require CUTLASS or e4m3fn when not supported by platform ( #30020 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-12-10 02:28:37 +00:00
Andrew Xia
c3487aca34
[responsesAPI][6] Fix multi turn MCP tokenization ( #30230 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2025-12-10 10:13:13 +08:00
Lucas Wilkinson
abe93bce59
[Attention] Make seq_lens_cpu optional in CommonAttentionMetadata to enable true async spec-decode ( #29624 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com >
2025-12-09 17:18:10 -08:00
ElizaWszola
2e7035dd8c
[Bugfix] Fix fp8 DeepGemm compilation issues ( #30336 )
2025-12-09 20:17:25 -05:00
PatrykSaffer
4c2e10ea19
[Bugfix] Fix cuda graph sizes when running with speculative decoding ( #30330 )
...
Signed-off-by: Patryk Saffer <patryk.saffer99@gmail.com >
Signed-off-by: PatrykSaffer <patryk.saffer@mistral.ai >
Co-authored-by: Patryk Saffer <patryk.saffer99@gmail.com >
2025-12-10 00:47:07 +00:00
dongbo910220
03b5f940fd
[V1][Spec Decode] Optimize Medusa proposer to avoid GPU-CPU sync ( #29723 )
...
Signed-off-by: dongbo910220 <1275604947@qq.com >
2025-12-10 00:15:01 +00:00
Hashem Hashemi
2e7054da06
Improve wvsplitK tile and balance heristics. ( #29937 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2025-12-09 23:51:32 +00:00
Charlie Fu
3c680f4a17
[Rocm][torch.compile] Adding layernorm + fp8 block quant and silu + fp8 block quant for Aiter ( #25693 )
...
Signed-off-by: charlifu <charlifu@amd.com >
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
Signed-off-by: Charlie Fu <Charlie.Fu@amd.com >
Co-authored-by: Micah Williamson <micah.williamson@amd.com >
Co-authored-by: wuhuikx <hattie.wu@amd.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
2025-12-09 22:39:26 +00:00
Kyle Sayers
fccd532587
[Quantization] FP8 Weight Reloading for Quantized RL Rollout ( #28480 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-12-09 13:54:32 -08:00
bnellnm
00e5cbb967
[MoE][Refactor] Remove most arguments to FusedMoEMethodBase.apply ( #29066 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2025-12-09 13:48:25 -08:00
rasmith
7618dc973d
[CI/Build] Make test_mha_attn.py run on correct platform only and check for flash_attn_varlen_func in layer.py ( #29145 )
2025-12-09 20:18:17 +00:00
dependabot[bot]
f8dacc66b6
Bump actions/stale from 10.1.0 to 10.1.1 ( #30234 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-09 20:12:14 +00:00
dependabot[bot]
7cab92fd45
Bump actions/checkout from 6.0.0 to 6.0.1 ( #30233 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-09 20:03:16 +00:00
Tsukasa OI
73a484caa1
[Model][Quantization] Fix / Add GGUF support for Qwen2 MoE models ( #30307 )
...
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com >
2025-12-09 19:13:10 +00:00
Lucas Wilkinson
b37bf51e75
[CI/Test] Fix FP8 per-tensor quant test reference scale shape ( #30352 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-09 12:52:20 -06:00
Lucas Wilkinson
95501a70ec
[BugFix] Fix DeepSeek-R1 hang with DP and MTP ( #30119 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2025-12-09 18:51:19 +00:00
Benjamin Chislett
e858bfe051
[Cleanup] Refactor profiling env vars into a CLI config ( #29912 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-09 13:29:33 -05:00
Woosuk Kwon
d471b2aff0
[Model Runner V2] Support num NaNs in logits ( #30187 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-12-09 10:00:49 -08:00
Woosuk Kwon
9e6562a3f6
[Model Runner V2] Fix Triton warning on tl.where ( #30355 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-12-09 09:59:54 -08:00
Ilya Markov
0b6a8a304c
[BugFix] Fix non detected failing tests ( #30277 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2025-12-09 17:57:55 +00:00
Alexei-V-Ivanov-AMD
804e3468c0
Update AMD test definitions (2025-12-08) ( #30298 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-12-09 17:31:30 +00:00
Wentao Ye
83319b44c2
[Compile] Fix torch warning TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled ( #29897 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-09 10:40:37 -05:00
Lucas Wilkinson
56037dfa2f
[BugFix] Fix assert batch_descriptor.num_tokens == num_tokens_padded ( #30173 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-09 10:36:12 -05:00
quanliu
5dcd593baf
[Feature] Batch-Invariant Support for FA2 and LoRA ( #30018 )
...
Signed-off-by: quanliu <18646313696@163.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-12-09 10:01:38 -05:00
Julien Denize
5c213d2899
[BUGFIX] Mistral tool call parser v11+ ( #30332 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2025-12-09 14:55:38 +00:00
vllmellm
ee14644ba9
[ROCm] Aiter Quant Kernels ( #25552 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-12-09 14:27:37 +00:00
Dongjie Zou
1166c31cc7
[Bugfix]: Fix glm46 awq marlin moe wna16 compatibility ( #30210 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
2025-12-09 12:20:21 +00:00
haoyangli-amd
03416eada6
[bugfix][quantization] Fix fp8 per_tensor scale shape ( #30257 )
...
Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com >
2025-12-09 19:28:50 +08:00
Hubert de La Jonquiere
c72ea10723
[Structured Output][Reasoning] Improves decoding throughput for models using single-token reasoning endings. ( #30056 )
2025-12-09 18:54:08 +08:00
Jaya Yuan
67475a6e81
[DCP][Bugfix][CI] Fix accuracy issue of DCP when using FLASH_ATTN_MLA ( #30309 )
...
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com >
2025-12-09 08:22:14 +00:00
wang.yuqi
9c32df6101
[Bugfix] Qwen 3 VL Embedding loading ( #30303 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-09 08:04:02 +00:00
Micah Williamson
aeb82b1930
[CI] Fix Flaky test_eagle_max_len Test ( #30306 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-12-09 07:33:34 +00:00
Lucas Wilkinson
aed846917f
[Attention] Make split_decodes_and_prefills(..., require_uniform=True) support padding ( #29644 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com >
2025-12-09 07:24:01 +00:00
Yongtao Huang
e4605d225e
[Misc] Fix safetensors import for safe_open ( #30300 )
...
Signed-off-by: Yongtao Huang <yongtaoh2022@gmail.com >
2025-12-09 06:50:06 +00:00
Tsukasa OI
58d5b3f514
[Model][Quantization] Restore MoE + GGUF models support (incl. Qwen3 MoE) by allowing Sideload Parameters ( #30116 )
...
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-09 05:30:05 +00:00
Fanli Lin
c2e1987a6e
[Doc] update Intel GPU MM status in Feature x Hardware matrix ( #30294 )
...
Signed-off-by: Lin, Fanli <fanli.lin@intel.com >
2025-12-09 05:16:44 +00:00
Fadi Arafeh
e130845984
[CPU][CI] Enable fused MoE tests in Arm CI ( #30132 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2025-12-09 04:55:39 +00:00
liangel-02
4b03b50211
update torchao safetensors impl ( #30155 )
...
Signed-off-by: Angel Li <liangel@meta.com >
2025-12-09 12:46:35 +08:00
Or Ozeri
4c6fd25880
kv_transfer: Rename the shared storage connectors ( #30201 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-12-08 20:46:09 -08:00
Michael Goin
03b91f7262
[Bugfix] Fix compressed-tensors models failing to load with transformers backend ( #30287 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-08 20:44:28 -08:00
czhu-cohere
f6227c22ab
[Kernel]Support W4A8 Grouped GEMM on Hopper ( #29691 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-12-08 19:29:06 -08:00
gnovack
ea657f2078
Lora MoE Align Improvements ( #29257 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
2025-12-09 10:35:16 +08:00
Kevin H. Luu
db14f61f2d
[ci] Refactor CI file structure ( #29343 )
2025-12-08 17:25:43 -09:00
Micah Williamson
78c7503364
[ROCm][CI] Skip NVIDIA-Only Prime-RL Test in AMD CI ( #29420 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-12-09 02:14:02 +00:00
Christina Norman
e41312a2f5
[Bugfix] Skip generation config fallback for GGUF to prevent multi-process hang ( #30209 )
...
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2025-12-09 01:52:43 +00:00
Yanan Cao
7b35011ad1
Mark qwen2_5_vl as xfail ( #30283 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2025-12-09 01:14:10 +00:00
Zhewen Li
ae339b1a67
[Bugfix] Fix DeepGEMM after #29546 ( #30267 )
...
Signed-off-by: zhewenli <zhewenli@meta.com >
Signed-off-by: Zhewen Li <zhewenli@meta.com >
2025-12-09 01:05:27 +00:00
Wentao Ye
0ee6416f67
[Perf] Optimize group_topk kernel, 1.9% Throughput improvement, 2.1% TPOT improvemnt ( #30159 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-08 19:44:01 -05:00
Wentao Ye
d9417096d1
[Feature] Batch invariant: Enable TRITON_MLA without prefix-caching ( #29125 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-08 19:31:57 -05:00
Ming Yang
9d6235ca9a
[moe] Allow disabling DP chunking ( #29936 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-12-09 00:29:36 +00:00
Victor Ziliang Peng
f1599ca55d
feat(metrics): Add prefill KV compute metric excluding cached tokens ( #30189 )
...
Signed-off-by: Ziliang Peng <ziliang@character.ai >
2025-12-09 00:08:48 +00:00
Ming Yang
60d17251c9
[Disagg] Support large batch size in proxy server and update NixlConnector doc for DP ( #28782 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-12-09 00:01:08 +00:00
Lain
1fb632fdb6
[Perf] Improve fp8 quant in mla; replace ReduceSum with ReduceScatterSum ( #29795 )
...
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
2025-12-08 15:02:34 -08:00
Charlie Fu
6af70e11a0
[ROCm][CI] Fix test_max_len.py for Rocm ( #29916 )
...
Signed-off-by: charlifu <charlifu@amd.com >
Signed-off-by: Charlie Fu <Charlie.Fu@amd.com >
2025-12-08 16:58:30 -05:00
roikoren755
ae0f69b16a
Add SpecDec support to selective_state_update ( #29488 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2025-12-08 16:45:18 -05:00
Dmitry Tokarev
799804d140
Bump nvshmem to 3.3.24 and fix CUDA 13 installation ( #30149 )
...
Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-08 20:24:34 +00:00
Vasiliy Kuznetsov
0d402d2600
online fp8 quant with streaming weight post-processing ( #29196 )
...
Signed-off-by: vasiliy <vasiliy@fb.com >
2025-12-08 20:15:10 +00:00
Johnny Yang
d1b5e7afbf
[TPU] Bump tpu-inference to 0.12.0 ( #30221 )
...
Signed-off-by: Johnny Yang <johnnyyang@google.com >
2025-12-08 20:10:10 +00:00
shaharmor98
fcd5306f65
Add latent MoE support ( #30203 )
...
Signed-off-by: Shahar Mor <smor@nvidia.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-12-08 17:35:01 +00:00
weiguihua2
398a596ed2
[MP executor] fix get device count for multi node of mp executor feature ( #30042 )
...
Signed-off-by: weiguihua2 <weiguihua2@huawei.com >
2025-12-09 01:33:48 +08:00
Jee Jee Li
67312cad11
[Misc] Split the LoRA code ( #30253 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-09 00:59:31 +08:00
Laith Sakka
87aee9ed2b
Add evaluate_guards option to DynamicShapesConfig ( #27432 )
...
Signed-off-by: Laith Sakka <lsakka@meta.com >
2025-12-08 10:46:15 -05:00
Daniel Cámpora
184076c3fe
[DeepSeek v3.2] Make top-k work for any logit values. ( #27568 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-08 06:55:58 -08:00
Ye (Charlotte) Qi
eb1051fb95
[ROCm] Guard group quant RMS norm fusion patterns ( #30239 )
2025-12-08 14:44:48 +00:00
Jee Jee Li
80433e225e
[LoRA] Reduce the loading time of MoE LoRA ( #30243 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-08 13:29:47 +00:00
Harry Mellor
5c2433a6f3
Add tip for mypy and markdownlint to the pre-commit comment ( #30259 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-08 13:11:51 +00:00
Simon Mo
77072e93b3
[docs] governance documents ( #24801 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-12-08 12:06:20 +00:00
wang.yuqi
2e660c2434
[Frontend] Binary embedding response does not return metadata by setting encoding_format to bytes_only. ( #30249 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-08 12:01:21 +00:00
Shiming Zhang
408cf42f67
[CI] Prevents triggering of an inactive issue/PR check for forked repository. ( #29654 )
...
Signed-off-by: Shiming Zhang <wzshiming@hotmail.com >
2025-12-08 10:29:14 +00:00
wang.yuqi
9e77ffca3f
[Model][7/N] Improve all pooling task | Deprecation as_reward_model. Extract hidden states prefer using new multi-vector retrieval API ( #26686 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2025-12-08 08:10:09 +00:00
Dazhi Jiang
bcb6f5947f
[Perf] Remove sync point in vit torch sdpa attn backend ( #30232 )
...
Signed-off-by: Dazhi Jiang <dazhi_jiang@163.com >
2025-12-08 07:12:42 +00:00
Zhiyu
cd00c443d2
[Misc] Rename TensorRT Model Optimizer to Model Optimizer ( #30091 )
...
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
2025-12-08 07:05:27 +00:00
Jiangyun Zhu
d143271234
[Bugfix] fix fuse_allreduce_rms when tp =1 ( #30178 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-12-08 06:43:47 +00:00
Zhiwei
c6df05ebb4
[ROCm] [Fused Moe EP] Use binary expert mask for aiter fused moe kernel ( #29773 )
...
Signed-off-by: ZhiweiYan-96 <zhiwei.yan@amd.com >
2025-12-08 05:23:46 +00:00
Nick Hill
d726a7b0ed
[BugFix] Unblock use of LoRA with data parallel mode ( #30220 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-08 12:21:05 +08:00
Zhijian Jiang
344b50d525
Address comment to mergify.yml in #30117 ( #30219 )
...
Signed-off-by: Zhijian Jiang <Zhijian.Jiang@outlook.com >
2025-12-08 11:26:25 +08:00
Andrew Xia
735284ed86
[responsesAPI][7] Browser, Container MCP tools for non harmony models ( #29989 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-08 10:04:03 +08:00
daniel-salib
444f0e3f33
[Frontend] Add MCP type support infrastructure to Responses API ( #30054 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2025-12-08 10:02:52 +08:00
ElizaWszola
af0444bf40
[Performance] Fused blockwise quant RMS norm ( #27883 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: yewentao256 <zhyanwentao@126.com >
2025-12-07 16:38:04 +00:00
Lucas Wilkinson
0044c4038c
[BugFix][DeepSeek-V3.2] Fix backend selection logic for Blackwell ( #30195 )
2025-12-07 10:53:51 -05:00
Isotr0py
b952f4d3c3
[v1] Add PrefixLM support to FlexAttention backend ( #27938 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-07 15:51:36 +00:00
Wentao Ye
541a2ef892
[Perf] Deepgemm fused layout kernel for activations, 4.3% throughput improvement, 10.7% TTFT improvement. ( #29546 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-07 20:31:14 +08:00
Jee Jee Li
b0f4866a77
[CI/Build]Temporary workaround for test_default_mm_loras timeout ( #30202 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-07 20:27:11 +08:00
Jinzhen Lin
879ddb09c3
[Kernel][MoE] optimize moe_align_block_size ( #29642 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-07 01:58:47 -08:00
Yifan Qiao
1b0482b9d1
[Misc][Core] Remove unused req_index increment in scheduler ( #30176 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
2025-12-07 08:39:21 +00:00
Cyrus Leung
e83b7e379c
Revert "[Renderer] Separate out RendererConfig from ModelConfig ( #30145 )" ( #30199 )
2025-12-07 00:00:22 -08:00
Cyrus Leung
27f4c2fd46
[Renderer] Separate out RendererConfig from ModelConfig ( #30145 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-06 23:15:42 -08:00
Luke
a49d813fa8
Lazy loading to avoid importing all files ( #29716 )
...
Signed-off-by: Luke <yq0536@gmail.com >
2025-12-07 07:13:14 +00:00
Wentao Ye
17eb25e327
[Perf] Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement ( #29558 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-07 04:44:50 +00:00
jeremyteboul
dce6d229f7
Support multiple image/audio embeddings per requests ( #29988 )
...
Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com >
Co-authored-by: Jeremy Teboul <jeremyteboul@fb.com >
2025-12-07 04:34:24 +00:00
Yanan Cao
cbedb703cc
[Frontend] Remove confusing -O.xx flag error ( #30169 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2025-12-07 02:53:42 +00:00
AuruTus
8d3da4c79d
[MISC]: change NIXL compatibility hash logging level to debug ( #30182 )
2025-12-07 00:21:03 +00:00
Andrew Xia
421125d03a
[ez] move harmony utils to parser folder ( #30117 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2025-12-06 17:34:34 -05:00
Cyrus Leung
671427efbf
[Model] Move multimodal_cpu_fields definition to field config ( #30181 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-06 13:40:02 +00:00
Viacheslav
21bb323542
Gigachat 3 tool parser and tests ( #29905 )
...
Signed-off-by: Viacheslav Barinov <viacheslav.teh@gmail.com >
2025-12-06 12:04:14 +00:00
Chukwuma Nwaugha
17a9abec2b
simplify requires_files list creation ( #29656 )
...
Signed-off-by: Chukwuma Nwaugha <nwaughac@gmail.com >
2025-12-06 09:42:41 +00:00
Ye (Charlotte) Qi
92c35abb24
[Misc] Fix circular import in vllm.transformers_utils.config ( #30179 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-12-06 09:24:03 +00:00
Yu Jiaqi
43e7593031
Support tokenization_kwargs override ( #29794 )
...
Signed-off-by: piood <2477084691@qq.com >
2025-12-06 09:12:53 +00:00
Cyrus Leung
c46b932df2
[Chore] Deprecate SupportsMultiModal.merge_by_field_config ( #30170 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-06 07:57:28 +00:00
redwrasse
6476382384
prefix caching design doc sha256 now default ( #29261 )
...
Signed-off-by: redwrasse <mail@redwrasse.io >
2025-12-06 07:39:56 +00:00
kx
d6aeaddf4a
[bugfix] fix type[AttentionBackend] bug in kv_connector_base_v1 ( #30051 )
...
Signed-off-by: 01267596 <xiongkai123@cmbchina.com >
Co-authored-by: 01267596 <xiongkai123@cmbchina.com >
2025-12-06 07:11:31 +00:00
Woosuk Kwon
a238cbd89d
[Model Runner V2] Support min-p sampling ( #30171 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-12-05 21:42:47 -08:00
Nick Hill
4026ae31e9
[Misc] Move disable_nccl_for_dp_synchronization init logic into VllmConfig ( #30161 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-05 20:59:04 -08:00
rasmith
b12f4a9830
[CI/Build][AMD] Use ROCM_ATTN instead of FLASH_ATTN test for test_register_kv_caches for ROCm and update test for TRITON_ATTN ( #29985 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2025-12-05 20:57:38 -08:00
Rohan Potdar
40a046cd82
[Bugfix]: Fix TokenizerLike interface ( #30009 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2025-12-05 20:56:40 -08:00
Peter Salas
e858bc4d14
[Model] Add support for transformer-based Ultravox v0.7 projector ( #30089 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2025-12-05 20:55:43 -08:00
Dongjie Zou
e3fbb6f152
fix#30092 Kimi-Linear model loading failure with missing indexer_rotary_emb ( #30093 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
2025-12-05 20:55:09 -08:00
yuttian1
c4d62618ca
Fix AWQ MoE marlin check issue in marlin_utils.py for AMD backend ( #30102 )
...
Signed-off-by: yuttian1 <yuttian@amd.com >
2025-12-05 20:54:38 -08:00
rasmith
62079d8600
[CI/Build][AMD] Skip marlin, machete, and hadacore tests since these require _C functions not defined for ROCm ( #30109 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-12-06 12:54:17 +08:00
Harry Mellor
bf4a901af9
Better error when world size is larger than node and distributed_executor_backend is not set ( #30140 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-05 20:53:52 -08:00
Samuel Shen
7e31c3a3f6
[CI]: Remove unnecessary imports from test_lmache_integration ( #30157 )
...
Signed-off-by: Samuel Shen <slshen@uchicago.edu >
Co-authored-by: Samuel Shen <slshen@uchicago.edu >
2025-12-06 12:53:34 +08:00
rasmith
dc839ad03d
[CI/Build][AMD][Quantization] Fix test_int8_kernel.py by updating int8_utils to use hip.libdevice.round ( #30151 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-12-05 20:52:11 -08:00
Deboleina
02a4169193
[Tests] Tool call tests for openai/gpt-oss-20b ( #26237 )
...
Signed-off-by: Debolina Roy <debroy@redhat.com >
2025-12-05 19:03:29 -08:00
Wentao Ye
7b5575fa7d
[Bug] Fix vLLM config is not set error ( #29999 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-05 16:42:12 -05:00
Bangsheng Tang
77e4472809
let draft model follow target model's config_format ( #30152 )
2025-12-05 13:33:42 -08:00
Divakar Verma
962d703818
[Bugfix][llama4_eagle] Fix missing 'lm_head' attribute ( #29926 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-12-05 19:57:26 +00:00
Nicolò Lucchesi
e23ca3a0e8
[CI] Re-use whisper_client for all tests ( #30148 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-05 19:47:37 +00:00
Russell Bryant
3633035a3f
[Misc] Rename CohereForAI references to CohereLabs ( #30147 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-12-05 19:41:40 +00:00
Nicolò Lucchesi
bff78310d9
[Enc-Dec] Fix OOT tokenizer issue ( #30144 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-05 19:23:33 +00:00
Tova Movshovitz
adb315060c
[KVConnector][Feature] Support KV connector cache reset via /reset_prefix_cache ( #27170 )
...
Signed-off-by: tovam <tovam@pliops.com >
Signed-off-by: Tova Movshovitz <tovam@pliops.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-05 18:33:26 +00:00
Ilya Markov
4e26d3b09e
[Compile] Conditional compilation. Introduce compile_ranges ( #24252 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Signed-off-by: Luka Govedič <luka.govedic@gmail.com >
Signed-off-by: ProExpertProg <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Luka Govedič <luka.govedic@gmail.com >
2025-12-05 18:17:32 +00:00
Matthew Bonanni
66e674cdd5
[Attention][UX][1/N] Add AttentionConfig and change attention env vars to CLI arguments ( #26315 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
2025-12-05 09:48:43 -08:00
Mark McLoughlin
dff0a2b394
[NIXL] Add remote_request_id to kv_transfer_params ( #29665 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-12-05 09:43:48 -08:00
Nick Hill
dc264bcea1
[BugFix] Eagerly abort cancelled final-step requests ( #29987 )
...
Currently, when requests are cancelled while executing their final
step, "completion" is handled based on normal stop processing (e.g.
length or stop token), so the abort has no effect. This is typically
not a problem, but when a kv connector is involved it thinks the
request completed successfully rather than being aborted.
This is problematic for disaggregated prefill which will free kv
cache blocks if the request was aborted but not if it completed
successfully—since the cancelled request will never be sent to
the decode side, kv cache blocks remain pinned until the fall-back
timeout expires. The problem is exacerbated when many requests
are cancelled and/or there are large prefills whose forward pass
takes a long time (since the window is bigger).
This PR fixes the problem by processing pending aborts
immediately prior to processing model output each step; we process
only aborts, not new requests, since it's preferable for latency to
process model outputs before new incoming requests.
Fixes #26400 .
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-05 17:28:32 +00:00
Nicolò Lucchesi
78c44fd722
[NIXL] Small cleanup of unused variables ( #29618 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-05 18:17:36 +01:00
Angela Yi
e7296b08da
[bugfix] Pass globals to aot_compiled function ( #29428 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2025-12-05 16:54:26 +00:00
Andrew Xia
da7bc54ea8
[responsesAPI][5] ResponsesParser with tools for full MCP python loop ( #29798 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Signed-off-by: Andrew Xia <axia@meta.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2025-12-05 11:11:50 -05:00
Mark McLoughlin
949a6a19d2
[NIXL] Add compatibility checking to NIXL KV connector handshake ( #29503 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-12-05 15:52:45 +01:00
Alec S
2c174420f5
Reduce validation to a warning ( #28749 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Alec Solder <alecs@fb.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-05 14:02:49 +00:00
Yi Liu
0d8a7d8a26
[Compressed Tensors] Add XPU wNa16 support ( #29484 )
...
Signed-off-by: yiliu30 <yi4.liu@intel.com >
2025-12-05 22:02:09 +08:00
Elham
9843e332da
[CPU][Perf] Add fast vectorized exp impl from Arm Optimized Routines ( #30068 )
...
Signed-off-by: Ubuntu <ubuntu@ip-10-252-30-150.eu-west-1.compute.internal >
Signed-off-by: Elham Harirpoush <elham.harirpoush@arm.com >
Co-authored-by: Ubuntu <ubuntu@ip-10-252-30-150.eu-west-1.compute.internal >
2025-12-05 13:09:20 +00:00
Harry Mellor
b7d85cf25c
[CI] Have pre-commit comment on a PR if pre-commit was not used ( #30077 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-05 13:03:45 +00:00
Max Hu
c2894d3883
[Feature] Add Layer-wise NVTX Support ( #29990 )
...
Signed-off-by: Max Hu <hyoung2991@gmail.com >
Signed-off-by: Max Hu <maxhu@nvidia.com >
Co-authored-by: Max Hu <maxhu@nvidia.com >
2025-12-05 11:20:07 +00:00
Zhiwei
3628bcaaf2
[ROCm][MXFP4] Infer w4a4 quant method in rocm aiter fused moe ( #29775 )
...
Signed-off-by: ZhiweiYan-96 <zhiwei.yan@amd.com >
2025-12-05 11:01:16 +00:00
strinczer
b73b158ab0
[Bugfix] Fix parse_output_message crash on commentary with no recipient ( #29972 )
...
Signed-off-by: Shai Trinczer <strinczer@icloud.com >
Signed-off-by: strinczer <strinczer@icloud.com >
2025-12-05 10:51:12 +00:00
Ning Xie
7ae13c66ba
[typing] fix type ( #29964 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-12-05 10:46:08 +00:00
Ming Yang
f16356fe36
[bench] Support common prefix len config (for decode-only bench) ( #29934 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-12-05 10:26:52 +00:00
Alec S
65ee97288a
[BugFix] Adding env variable to disable async grammar compilation ( #29996 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Signed-off-by: Alec S <10566873+alecsolder@users.noreply.github.com >
Co-authored-by: Alec Solder <alecs@fb.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-12-05 00:49:37 -08:00
Yanan Cao
62b3333448
[Frontend] Remove deprecated -O.xx flag ( #29991 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2025-12-05 00:47:22 -08:00
rasmith
feecba09af
[CI/Build][AMD] Use float16 in test_reset_prefix_cache_e2e to avoid accuracy issues ( #29997 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-12-05 08:42:25 +00:00
amitz-nv
6038b1b04b
[Frontend][Model] Add 'float16' to possible mamba cache dtype values, override mamba SSM cache dtype value for NemotronH ( #29978 )
...
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com >
2025-12-05 00:34:33 -08:00
Tiger Xu / Zhonghu Xu
60a66ea2dc
[DOC]: Add kthena to integrations ( #29931 )
...
Signed-off-by: Zhonghu Xu <xuzhonghu@huawei.com >
2025-12-05 08:11:03 +00:00
Micah Williamson
06579f9a82
[AMD][CI] Add ray[default] Dependency On ROCm To Pass v1/metrics/test_engine_logger_apis.py ( #30110 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-12-05 06:48:23 +00:00
Chukwuma Nwaugha
6e865b6a83
Refactor example prompts fixture ( #29854 )
...
Signed-off-by: nwaughac@gmail.com
2025-12-05 06:44:32 +00:00
Jingchun Gao
d698bb382d
[Bugfix] Correct num_q_heads on DCP for Flashinfer backends ( #29487 )
...
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com >
Signed-off-by: Jingchun Gao <63247409+gjc0824@users.noreply.github.com >
Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com >
2025-12-05 05:54:31 +00:00
Charlie Fu
2c22c4ca2d
[ROCm][CI] Increase the memory threshold for test_deep_sleep_fp8_kvcache ( #30104 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-12-05 04:51:44 +00:00
Laith Sakka
5867819eaf
Do not guard during noop elimination pass ( #30095 )
...
Signed-off-by: Laith Sakka <lsakka@meta.com >
2025-12-05 04:10:12 +00:00
Charlie Fu
7c9b2c8f81
[ROCm][CI] Add jiwer dependency for testing ( #30081 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-12-05 03:34:51 +00:00
Qiu
0098a6e3da
[PCP&DCP] move CUDAGraph check for PCP&DCP to the check func of platforms ( #29952 )
...
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-04 21:40:51 -05:00
Hubert de La Jonquiere
befb59e5b1
[Model] Add Holo2 reasoning parser ( #30048 )
...
Signed-off-by: hdlj-h <hubert@hcompany.ai >
2025-12-05 10:38:45 +08:00
Shengqi Chen
aaddc9c82a
[CI] fix silent error in nightly wheel index generation script, add generation time to HTML index ( #30060 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2025-12-05 00:48:59 +00:00
Zhewen Li
263c38d74d
[CI/Build] Update batch invariant test trigger ( #30080 )
...
Signed-off-by: zhewenli <zhewenli@meta.com >
2025-12-05 00:42:37 +00:00
Zhewen Li
bcf43ab1f3
[CI/Build][AMD] Add Llama4 Maverick FP8 to AMD CI ( #28695 )
...
Signed-off-by: zhewenli <zhewenli@meta.com >
2025-12-04 16:07:20 -08:00
Alexander Matveev
4470ee2f90
[Perf] Enable separate shared_experts stream only for CUDA ( #30085 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-12-05 00:03:17 +00:00
TimWang
690cc3ef20
docs: update metrics design doc to use new vllm:kv_cache_usage_perc ( #30041 )
...
Signed-off-by: Tim <tim.wang03@sap.com >
2025-12-04 23:37:14 +00:00
Laith Sakka
1f0d184590
[aot_compile]change VLLM backend to read fake args from example_value ( #29104 )
...
Signed-off-by: Laith Sakka <lsakka@meta.com >
2025-12-04 17:33:45 -05:00
Lucas Wilkinson
c8ab988b15
[BugFix] Fix DBO assert assert B_block_table == B_q ( #29933 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-04 14:48:54 -05:00
Peng-YM
48a5fff66e
[Bugfix] Missing tokens in return_token_ids when tool parsers is enabled in streaming mode ( #29074 )
...
Signed-off-by: Peng-YM <1048217874pengym@gmail.com >
2025-12-04 19:09:39 +00:00
Mercykid-bash
1119f6e47a
Abstract eplb algo ( #26471 )
...
Signed-off-by: Che Ruan <cr623@ic.ac.uk >
Signed-off-by: mengxingkongzhouhan <117415539+mengxingkongzhouhan@users.noreply.github.com >
Signed-off-by: Mercykid-bash <ruanche0218@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Che Ruan <cr623@ic.ac.uk >
Co-authored-by: mengxingkongzhouhan <117415539+mengxingkongzhouhan@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-04 19:09:09 +00:00
Harry Mellor
e10c84e06a
Access partial_rotary_factor from rope_parameters ( #29966 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-04 18:42:49 +00:00
Kuntai Du
ece2825a29
[KVConnector] Remove v0-related kv connector components such as kv pipe and kv lookup buffer ( #29705 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2025-12-04 18:20:48 +00:00
Jee Jee Li
652ba93da3
[Bugfix] Fix FP8 MoE LoRA ( #29890 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-04 18:17:49 +00:00
Tao Yun
6dcb07f676
support qwen3-vl handle requests with embeddings ( #30037 )
...
Signed-off-by: taoyun <1069423820@qq.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-12-04 17:34:06 +00:00
Qiu
46cbbca05c
[CI][DCP][Perf] reduce DCP CI execution time ( #29858 )
...
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com >
2025-12-04 17:28:21 +00:00
Cyrus Leung
b286a311c2
[Chore] Deprecate merge_by_field_config arg ( #30035 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-04 17:21:24 +00:00
Shengqi Chen
990f806473
[Doc] clarify nightly builds in developer docs ( #30019 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2025-12-05 00:28:37 +08:00
Doug Smith
5b4b42c0b6
Mark DBO test as flaky on b200 for Distributed B200 test ( #29913 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
2025-12-04 10:38:03 -05:00
Woosuk Kwon
cc050558f4
[Model Runner V2] Implement get_num_sampled_and_rejected kernel ( #30029 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-12-04 07:19:42 -08:00
Harry Mellor
5c32a06a04
Use Transformers v5 RoPE standardisation and validation ( #30046 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-04 14:54:28 +00:00
Yongtao Huang
dd97e047e0
Fix broken multiline assert in LoRAModelManager.register_module ( #30032 )
...
Signed-off-by: Yongtao Huang <yongtaoh2022@gmail.com >
2025-12-04 22:04:42 +08:00
Harry Mellor
9998ea5b57
Delete HF version of Phi 4 MM ( #30049 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-04 13:44:50 +00:00
wang.yuqi
74c4d80c6c
[Model][6/N] Improve all pooling task | Support chunked prefill with ALL pooling ( #27145 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-04 13:44:15 +00:00
Kevin H. Luu
1b7c7f5159
[release] install regex ( #30008 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-04 03:18:29 -08:00
Chauncey
6796ce8bdb
[Bugfix] Fix the issue with interleaved thinking when using streaming ( #30033 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Signed-off-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-12-04 11:11:59 +00:00
Andreas Karatzas
e96a6a6dca
[ROCm][CI][Bugfix] Fixing the Multi-Modal Models Test (Extended) 1 group ( #30013 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-04 11:00:16 +00:00
Noa Neria
6366c098d7
Validating Runai Model Streamer Integration with S3 Object Storage ( #29320 )
...
Signed-off-by: Noa Neria <noa@run.ai >
2025-12-04 18:04:43 +08:00
dtc
842aba501d
[P/D] Introduce Mooncake Transfer Engine as kv_connector ( #24718 )
...
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com >
Signed-off-by: dtc <dtcccc@linux.alibaba.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2025-12-04 09:51:36 +00:00
rasmith
f2f4cea6cc
[CI/Build][AMD] Skip test on test_hybrid_attention_mamba_tensor_shapes on ROCm, requires FLASHINFER ( #29995 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-12-04 09:30:22 +00:00
Arpit Khandelwal
dfdda96747
[Core] Remove forced None assignment for deprecated PassConfig flags ( #29994 )
...
Signed-off-by: arpitkh101 <arpit5khandelwal@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-04 09:15:04 +00:00
Xu Wenqing
ffdd18111b
Add DeepSeek-V3.2 tool parser. ( #29848 )
...
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com >
2025-12-04 08:46:34 +00:00
Ye (Charlotte) Qi
b8a6ae4158
[ROCm] add fallback for aiter fp8 decode mla ( #30005 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-12-04 08:45:57 +00:00
Mark McLoughlin
899e2ef558
[Core] Fix standalone runs of test_reset_prefix_cache_e2e ( #29899 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-12-04 16:22:03 +08:00
Cyrus Leung
68eb5c8d97
[Misc] Move functions into PoolingMetadata ( #30027 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-04 08:21:19 +00:00
Micah Williamson
5430e110c0
[CI][AMD] Match Main CI Behavior By Skipping test_eplb_spec_decode In AMD CI ( #30006 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-12-04 16:20:54 +08:00
TJian
3f1b03739a
[ROCm] [Bugfix] compute_attn_mask_seqlen for qwen3 omni ( #29974 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-12-04 08:20:24 +00:00
Charlie Fu
9aa33a74b0
[Rocm][CI] Fix test_speculator_eagle3 by skipping the CompressedTensorw4a16 Model ( #30001 )
...
Signed-off-by: charlifu <charlifu@amd.com >
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com >
2025-12-04 07:52:28 +00:00
CYJiang
fd68e909db
[docs] Remove _total from counter metrics names ( #30028 )
...
In Prometheus Counters always expose their actual numeric value with a metric name that ends in _total. We should document the base name, as this what appears in the get_metrics() API.
Signed-off-by: CYJiang <86391540+googs1025@users.noreply.github.com >
2025-12-04 07:46:15 +00:00
daniel-salib
404fc4bfc0
[Frontend] refactor harmony utils output message parsing ( #29820 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2025-12-04 15:36:57 +08:00
Chauncey
82a64b3d8f
[Bugfix] fixed deepseekv32 tool calling error ( #30025 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-04 15:12:12 +08:00
Cyrus Leung
9ae2f60374
[Misc] Various cleanups for MM input processing ( #29970 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-04 06:22:20 +00:00
Jianwei Mao
80f8af4b2f
Fix error while downloading dependencies for CPU backend ( #29797 )
...
Signed-off-by: Jianwei Mao <maojianwei2016@126.com >
2025-12-04 06:04:44 +00:00
Kuntai Du
8aaa81b35f
[KVConnector] remove unused code (the model aware kv ops class) ( #29709 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2025-12-04 06:00:52 +00:00
Benjamin Bartels
fca3f46658
[Frontend] Fixes anthropic /v1/messages streaming not containing input_tokens on first chunk ( #29971 )
...
Signed-off-by: bbartels <benjamin@bartels.dev >
2025-12-04 05:50:27 +00:00
gausah01
28097d5638
[Bugfix][CPU] Fix CPU KV cache fallback memory allocation ( #29604 )
...
Signed-off-by: Gauri Sahnan <gauri.sahnan@arm.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-12-04 13:01:15 +08:00
Jee Jee Li
dd38ba3a26
[Bugfix] Fix adapter_enabled IMA ( #29977 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-04 12:51:15 +08:00
Li Wang
5f91cdda75
[Misc] Add docker build env for Ascend NPU ( #30015 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-12-03 19:53:00 -08:00
Iceber Gu
33a3d6c798
fix LoRA-related examples ( #29956 )
...
Signed-off-by: Iceber Gu <caiwei95@hotmail.com >
2025-12-04 11:48:30 +08:00
Zhewen Li
c493b9d092
[CI/Build] Add MM code path to Examples Test ( #29986 )
...
Signed-off-by: zhewenli <zhewenli@meta.com >
2025-12-03 19:21:45 -08:00
Xieyang Xu
ad32e3e19c
enable multi-node in external launcher mode ( #29833 )
2025-12-03 17:02:02 -08:00
Shengqi Chen
1109f98288
[CI] fix docker image build by specifying merge-base commit id when downloading pre-compiled wheels ( #29930 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2025-12-03 14:08:19 -08:00
Elizabeth Thomas
b5407869c8
[Bugfix] Respect VLLM_CONFIGURE_LOGGING value ( #28671 )
...
Signed-off-by: Elizabeth Thomas <email2eliza@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Jane Xu <janeyx@meta.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Johnny Yang <johnnyyang@google.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: bruceszchen <bruceszchen@tencent.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Johnny Yang <24908445+jcyang43@users.noreply.github.com >
2025-12-03 22:00:52 +00:00
bnellnm
2902c34826
[Kernels] Remove BatchedTritonOrDeepGemmExperts and default fallback to Triton ( #29929 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Signed-off-by: bnellnm <49004751+bnellnm@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-12-03 20:49:00 +00:00
Wentao Ye
ac1886588f
[CI] Fix re import error ( #29973 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-03 15:16:54 -05:00
Yongtao Huang
2fc5d6e0d7
Fix LLMEngine.del dp_group cleanup condition ( #29954 )
...
Signed-off-by: Yongtao Huang <yongtaoh2022@gmail.com >
2025-12-03 12:14:44 -08:00
elvischenv
afe9eb408e
[Bugfix] Fix flashinfer ar+norm kernel not available issue ( #29960 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-12-03 18:50:53 +00:00
Varun Sundar Rabindranath
19bee6d12d
[Performance][DP/EP] Add silu_mul_per_token_group_quant_fp8_colmajor kernel ( #29470 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-12-03 18:04:59 +00:00
avigny
dd5d1ef780
[Bugfix] Mistral tool parser streaming update ( #19425 )
...
Signed-off-by: avigny <47987522+avigny@users.noreply.github.com >
Signed-off-by: Chauncey <chaunceyjiang@gmail.com >
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Co-authored-by: Jeff Cook <jeff@jeffcook.io >
Co-authored-by: sfbemerk <benjaminmerkel@mail.de >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-03 17:45:31 +00:00
Micah Williamson
d1f7392c5f
[ROCm][CI] Fix v1/logits_processors failure on ROCm ( #29927 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-12-04 01:17:07 +08:00
Yu Jiaqi
9ae3c55b10
SigLIP example add chat_template ( #29902 )
...
Signed-off-by: piood <2477084691@qq.com >
2025-12-03 16:12:58 +00:00
Lumis Chen
9bcf92295a
[Core] Add xxHash as a high-performance hash option for accelerating prefix caching ( #29163 )
...
Signed-off-by: LuminolT <lumischen01@gmail.com >
Signed-off-by: Lumis Chen <lumischen01@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-12-03 16:06:57 +00:00
rasmith
5aa9b09040
[CI/Build][AMD] Skip test_shared_storage_connector_hashes in test_shared_storage_connector.py due to hipErrorLaunchFailure when calling .cpu() ( #29839 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-12-03 22:56:35 +08:00
ioana ghiban
1bb17ecb39
[CPU Backend] [Doc]: Update Installation Docs for CPUs ( #29868 )
...
Signed-off-by: Ioana Ghiban <ioana.ghiban@arm.com >
2025-12-03 13:33:50 +00:00
ioana ghiban
15b1511a15
[GPU Backend] [Doc]: Remove duplicate statements on missing GPU wheels. ( #29962 )
...
Signed-off-by: Ioana Ghiban <ioana.ghiban@arm.com >
2025-12-03 12:56:47 +00:00
Chauncey
b78772c433
[Frontend] supports deepseekv32 chat template ( #29837 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-12-03 20:53:44 +08:00
Amr Mahdi
f5d3d93c40
[docker] Build CUDA kernels in separate Docker stage for faster rebuilds ( #29452 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2025-12-03 11:41:53 +00:00
Fadi Arafeh
78f4bb0ba8
[DOC] Add Arm to list of compute resouces providers ( #29894 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2025-12-03 11:36:58 +00:00
HDCharles
b294e28db2
[refactor] CTMoEMethods to use QuantizationArgs ( #28871 )
...
Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-03 11:00:56 +00:00
Roger Wang
787b84a9fc
[Bugfix] Follow-up fix on MediaWithBytes ( #29951 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-12-03 10:42:49 +00:00
Tsukasa OI
42c1949643
[Bugfix][Quantization] Support BF16 tensors on GGUF ( #29948 )
...
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com >
2025-12-03 10:33:46 +00:00
Isotr0py
cc4e296ea6
[CI/Build] Avoid duplicate empty inputs test for common multimodal generation tests ( #29907 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-03 10:27:36 +00:00
Isotr0py
a21cd9ed23
[Bugfix] Fix incorrect image_grid_thw rank for HunyuanOCR from missing merge_by_field_config=True ( #29950 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-03 10:05:10 +00:00
WeiQing Chen
7fe9c1a223
[CI] Add Async Eplb nightly CI tests ( #29385 )
...
Signed-off-by: David Chen <530634352@qq.com >
Signed-off-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-03 09:51:08 +00:00
Chauncey
3f42b05fbc
[Refactor] [1/N] to simplify the vLLM serving architecture ( #28040 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-12-03 01:26:39 -08:00
Yong Hoon Shin
69520bc695
Add logging for cudagraph related info ( #29825 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-12-03 01:01:48 -08:00
Andrew Xia
3a7751485b
[responsesAPI] support input output messages for non harmony models ( #29549 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2025-12-02 23:59:23 -08:00
Cyrus Leung
bbfb55c29e
[Misc] Allow fetch_* utils to access local files by default ( #29932 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-03 15:49:34 +08:00
JackieWu
0bec63fa31
[BugFix] fix imgs_pos in hunyuan_vl ( #29879 )
...
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-03 06:20:37 +00:00
elvischenv
c719c40540
[Bugfix] Defunctionalize TRTLLM AR+Norm op for avoiding extra clone kernel before it ( #29631 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-12-03 05:15:50 +00:00
Russell Bryant
b08025a83b
[Docs] Discuss api key limitations in security guide ( #29922 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-12-02 20:57:28 -08:00
Arpit Khandelwal
d7284a2604
[Core] Rename PassConfig flags as per RFC #27995 ( #29646 )
...
Signed-off-by: arpitkh101 <arpit5khandelwal@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-12-03 03:38:55 +00:00
Andreas Karatzas
506ed87e87
[ROCm][CI][Bugfix] Disable Flash/MemEfficient SDP on ROCm to avoid HF Transformers accuracy issues ( #29909 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-03 10:36:49 +08:00
Roger Wang
4dd7978374
[Bugfix] Fix regression on pooling models from PR#29621 ( #29921 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-03 10:33:45 +08:00
Lucas Wilkinson
5cdd664509
[BugFix] Fix assert in build_for_cudagraph_capture ( #29893 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-02 16:56:54 -08:00
Alexei-V-Ivanov-AMD
5f67361fd1
Reverting re-direction to amd_mi355_X. ( #29914 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-12-03 00:40:02 +00:00
maang-h
5d91d2b292
[Doc] Add allocate_slots parameter docs ( #29777 )
...
Signed-off-by: maang <maang_h@163.com >
Signed-off-by: maang-h <55082429+maang-h@users.noreply.github.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-12-02 23:23:09 +00:00
Micah Williamson
c014de1ec7
[ROCm][CI] Fix test_cudagraph_mode.py Failure For AMD CI ( #29808 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-12-02 22:54:36 +00:00
Julien Denize
1b1e35aaf9
[BUGFIX] Fix regex pattern for Mistral Tool Call ( #29918 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2025-12-02 14:51:58 -08:00
Julien Denize
5e5646e206
[BUGFIX] llama_4_scaling wrongly passed to DeepseekAttention ( #29908 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2025-12-02 14:51:20 -08:00
Chauncey
0a9caca9f5
[Bugfix] fix --scheduling-policy=priority & n>1 crashes engine ( #29764 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-12-02 22:42:28 +00:00
Sage Moore
e6f114ac25
[Bugfix][EPLB] Prevent user-provided EPLB config from being overwritten with defaults ( #29911 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-12-02 13:20:22 -09:00
Harry Mellor
6fc5841db1
Fix some more Transformers nightly tests ( #29872 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-02 21:49:44 +00:00
dependabot[bot]
3ff5b53bc2
Bump actions/setup-python from 6.0.0 to 6.1.0 ( #29768 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-02 21:29:32 +00:00
jthomson04
1528e079e2
[Perf] Avoid pageable HtoD transfer in MinTokensLogitsProcessor ( #29826 )
...
Signed-off-by: jthomson04 <jwillthomson19@gmail.com >
2025-12-02 21:25:52 +00:00
Divakar Verma
afb1e5b380
[CI][ROCm][tests/v1/e2e] Fix multiprocessing launch for the test ( #29123 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-12-02 20:46:10 +00:00
Copilot
1c593e117d
Fix boolean nested params, add dict format support, and enhance plotting for vllm bench sweep ( #29025 )
...
Signed-off-by: Luka Govedič <luka.govedic@gmail.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <luka.govedic@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-12-02 20:40:56 +00:00
Navanit Dubey
a2b053dc85
feat(model): Add BitsAndBytes quantization support for Qwen3-Omni-MoE ( #29896 )
...
Signed-off-by: navanit-git <navanitdubey@gmail.com >
2025-12-02 19:28:35 +00:00
Matthew Bonanni
1d93f11675
[Attention][CUDAGraph] Remove CG padding from attention backends ( #29352 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-12-02 13:48:08 -05:00
Benjamin Bartels
2d613de9ae
[CI/Build] Fixes missing runtime dependencies ( #29822 )
...
Signed-off-by: bbartels <benjamin@bartels.dev >
2025-12-02 10:21:49 -08:00
Alexei-V-Ivanov-AMD
c77b9929a0
Update AMD-CI testing mirror (as of 2025-12-02) ( #29898 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-12-02 08:52:54 -09:00
Isotr0py
63b1da76ba
[Chore]: Reorganize gguf utils funtions under transformers_utils ( #29891 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-02 17:33:23 +00:00
Andrew Xia
52cb349fc0
[responsesAPI][3] ResponsesParser to set up non harmony MCP ( #29413 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2025-12-02 11:24:45 -05:00
Isotr0py
0ec8422171
[Bugfix] Fix incorrect channel order for idefics3 in edge case ( #29881 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-02 16:03:52 +00:00
wang.yuqi
2eb4fe9129
[examples] Resettle pooling examples. ( #29365 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-02 15:54:28 +00:00
Matthew Bonanni
51c57b51dd
[Bugfix] Fix DeepSeek R1 MTP weight loading ( #29545 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com >
2025-12-02 15:52:18 +00:00
ImaGoodFella
60c3d413af
[Multimodal][Core] Optimize multimodal preprocessing cache by hashing image bytes instead of pixel values ( #29621 )
...
Signed-off-by: Rahul Steiger <rasteiger@ethz.ch >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-02 21:49:02 +08:00
Cyrus Leung
68ffbca7e4
[Chore] Use tokenizer.encode and tokenizer.decode directly ( #29851 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-02 12:30:40 +00:00
Harry Mellor
951445a52d
Remove default values from InitVars so that they're not stored ( #29859 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-02 12:16:37 +00:00
Julien Denize
d8c6210eea
Add Mistral Large 3 and Ministral 3 ( #29757 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Mickael Seznec <mickael@mistral.ai >
2025-12-02 10:29:00 +00:00
Louie Tsai
8bbcf8b6e7
[vLLM Benchmark Suite] Add default parameters section and update CPU benchmark cases ( #29381 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
Signed-off-by: Louie Tsai <louie.tsai@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
2025-12-02 09:00:23 +00:00
Boyuan Feng
70fb77b4dc
[BugFix] add max-num-batched-token to scheduler hash ( #29829 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-12-02 08:55:02 +00:00
杰兮
48d15a32aa
[CI] Fix Bad_words test for tokenizer encode/decode asymmetry ( #28193 )
...
Signed-off-by: zhyajie <yajizhan@amd.com >
Co-authored-by: zhyajie <yajizhan@amd.com >
2025-12-02 00:02:12 -08:00
Boyuan Feng
3b221cb661
[BugFix] respect VLLM_LOGGING_LEVEL in logger ( #29761 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-12-02 07:49:16 +00:00
Wushi Dong
0037b5746a
[Core] Eliminate redundant is_encoder_decoder lookups (20-40us/step) ( #29800 )
...
Signed-off-by: Wushi Dong <dongws@meta.com >
2025-12-02 07:08:07 +00:00
Harry Mellor
f5b0846ba0
Fix some Transformers nightly tests ( #29802 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-02 07:05:27 +00:00
Zhang Xiangze
13ea39bc09
[CPU]Parallelize over tokens in int4 moe ( #29600 )
...
Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com >
2025-12-02 06:21:39 +00:00
Shengqi Chen
4b612664fd
[CI] Renovation of nightly wheel build & generation (take 2) ( #29838 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2025-12-01 22:17:10 -08:00
Cyrus Leung
653591d5e7
[Chore] Move tokenizer initialization methods ( #29793 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-02 13:33:37 +08:00
Divakar Verma
e2fbfc955e
[CI][AMD] spec_decode:eagle skip FLASH_ATTN for deepseek on ROCm ( #29827 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-12-02 05:27:46 +00:00
Divakar Verma
a690fb5bd6
[CI][ROCm] Fix test_correctness_sliding_window ( #29243 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-02 04:53:27 +00:00
usberkeley
81fe3f82af
[BugFix] Fix index error in ngram_proposer ( #29779 )
...
Signed-off-by: Bradley <bradley.b.pitt@gmail.com >
2025-12-02 04:48:11 +00:00
Zuyi Zhao
53bf71b0f0
[Misc] Update conftest for entrypoints/sagemaker test folder ( #29799 )
...
Signed-off-by: Zuyi Zhao <zhaozuy@amazon.com >
2025-12-01 18:56:39 -09:00
Johnny Yang
f441d36cee
Add missing return in _check_vllm_model_embed_input_ids ( #29834 )
...
Signed-off-by: Johnny Yang <johnnyyang@google.com >
2025-12-01 19:22:50 -08:00
Seiji Eicher
22274b2184
[Misc] Add ReplicaId to Ray metrics ( #24267 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Co-authored-by: rongfu.leng <1275177125@qq.com >
2025-12-02 03:21:44 +00:00
Wei Wei
fc95521ba5
[Misc] Throw error on unintended access to scheduler_config.max_model_len ( #29771 )
...
Signed-off-by: Wei Wei <wwei6@meta.com >
2025-12-02 10:58:44 +08:00
Zhuohan Li
d0cd728907
[Core] Support reseting all running requests' KV while calling reset_prefix_cache ( #28827 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-12-02 02:25:05 +00:00
Andrew Xia
fa8804ad9c
[responsesAPI][4] fix responseOutputItem Kimi K2 thinking bug ( #29555 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2025-12-02 02:11:35 +00:00
Divakar Verma
4b40924998
[ROCm] Fallback pytorch GELU with tanh approximation to GELU() ( #29244 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
Signed-off-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-02 02:02:22 +00:00
Hendrik Holtmann
c0dfc89485
SM120 / NVFP4: add device guard and runtime SM dispatch to cutlass_scaled_fp4_mm ( #29711 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-12-01 17:24:18 -08:00
Nick Hill
44822d7ff2
[BugFix] Preserve spec decoding uniform decode when scheduling ( #29759 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-01 17:15:52 -08:00
Alexei-V-Ivanov-AMD
342c4f1472
Updated CI mirror 2025-11-25 ( #29434 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
Signed-off-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2025-12-01 23:44:33 +00:00
Kevin H. Luu
1336a1ea24
Revert #29787 and #29690 ( #29815 )
2025-12-01 13:42:03 -08:00
Nengjun Ma
eaf81485ed
[Ascend]: Fixed the issue where OOT Platform vllm-ascend could not enable SP in Eager mode ( #28935 )
...
Signed-off-by: leo-pony <nengjunma@outlook.com >
2025-12-01 15:02:18 -05:00
Finbarr Timbers
38caf7fa1a
Update FAQ on interleaving sliding windows support ( #29796 )
...
Signed-off-by: Finbarr Timbers <finbarrtimbers@gmail.com >
2025-12-01 19:15:19 +00:00
shivampr
cabc77cc86
[Core][Observability] Add KV cache residency metrics ( #27793 )
...
Introduces three new Prometheus histograms for fine-grained observability of KV cache residency behavior:
vllm:kv_block_lifetime_seconds — total lifetime from allocation to free
vllm:kv_block_idle_before_evict_seconds — idle duration before eviction
vllm:kv_block_reuse_gap_seconds — time between consecutive reuses of the same block
These metrics help operators analyze KV cache efficiency, reuse patterns, and eviction timing beyond simple utilization rates.
Implementation uses monotonic timestamps for accuracy, 1% sampling for minimal overhead (~48 bytes/block), and is fully thread-safe with zero runtime cost when disabled.
Two new runtime flags are introduced:
--kv-cache-metrics – enable KV cache residency metrics
--kv-cache-metrics-sample – control sampling ratio (default: 0.01)
Signed-off-by: Shivam <shivamprasad91@gmail.com >
2025-12-01 18:27:53 +00:00
Kevin H. Luu
ec7035c9d4
[ci] Make distributed 8 gpus test optional ( #29801 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2025-12-01 10:22:05 -08:00
knlnguyen1802
fc6acc88ca
[Bugfix] Missing cached item in the MultiModalReceiverCache ( #28525 )
...
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com >
Co-authored-by: Chenguang Zheng <645327136@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-01 10:18:07 -08:00
BADAOUI Abdennacer
d0985c5feb
[Hardware][AMD] Remove ROCm skip conditions for transformers backend tests ( #29782 )
...
Signed-off-by: badaoui <abdennacerbadaoui0@gmail.com >
2025-12-02 02:03:13 +08:00
sangbumlikeagod
092bb73b8a
[Frontend] add 'verbose_json' and 'timestamp' feature on Whisper Transcription/Translation ( #24209 )
...
Signed-off-by: sangbumlikeagod <oironese@naver.com >
Signed-off-by: sangbumlikeagod <98077576+sangbumlikeagod@users.noreply.github.com >
2025-12-01 18:19:17 +01:00
FredericOdermatt
5d43f7372e
[Doc] Update description disable_any_whitespace ( #29784 )
...
Signed-off-by: Frederic Odermatt <frederic.odermatt@44ai.ch >
2025-12-01 16:48:33 +00:00
Shengqi Chen
37593deb02
[CI] fix url-encoding behavior in nightly metadata generation ( #29787 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2025-12-01 23:17:20 +08:00
Liu Jinyi
f5516039c5
[Doc] fix heading levels ( #29783 )
...
Signed-off-by: KKKZOZ <kkkzoz@qq.com >
2025-12-01 14:49:22 +00:00
Shengqi Chen
36db0a35e4
[CI] Renovation of nightly wheel build & generation ( #29690 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2025-12-01 21:25:39 +08:00
Marcin Ostrowski
5cfa967efa
[Bugfix] TypeError: 'NoneType' object is not callable ( #29414 )
...
Signed-off-by: Marcin Ostrowski <marcinx.ostrowski@intel.com >
2025-12-01 13:16:44 +00:00
Isotr0py
b95db244ee
[v1] Add real sliding window calculation to FlexAttention direct BlockMask building ( #26015 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
Co-authored-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
2025-12-01 13:12:51 +00:00
Zhengxu Chen
ad9d656bfa
[multimodal][test] Reduce memory utilization for test_siglip to avoid OOM ( #29504 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-01 20:41:48 +08:00
Fanli Lin
f37e8938d2
[XPU] Fix AWQ skipped layer detection in IPEX quantization ( #29774 )
...
Signed-off-by: Fanli Lin <fanli.lin@intel.com >
2025-12-01 12:00:52 +00:00
Cyrus Leung
f0a28bf661
[Misc] Unify tokenizer registration ( #29767 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-01 11:34:58 +00:00
Mickaël Seznec
86e178f7c4
[crashfix] Eagle + multimodal can crash on mm cache miss ( #29750 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-12-01 17:29:33 +08:00
daniel-salib
014ece97c7
[Frontend] Add tool filtering support to ToolServer ( #29224 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-12-01 08:03:57 +00:00
wang.yuqi
62de4f4257
[Frontend] Resettle pooling entrypoints ( #29634 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2025-12-01 15:30:43 +08:00
Huamin Li
83805a6078
[CI] Skip paddleocr_vl for transformer 4.57.3 ( #29758 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
2025-12-01 04:38:06 +00:00
Yifei Zhang
1ab8fc8197
Make PyTorch profiler gzip and CUDA time dump configurable ( #29568 )
...
Signed-off-by: Yifei Zhang <yifei.zhang1992@outlook.com >
2025-12-01 04:30:46 +00:00
Shu Wang
f72a817bdf
[MoE] CuteDSL MoE with Nvfp4 DeepEP dispatch ( #27141 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com >
Signed-off-by: Shu Wang. <shuw@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: root <root@umbriel-b200-017.ipp4a1.colossus.nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-11-30 16:05:32 -08:00
Woosuk Kwon
ec38a7368d
[Model Runner V2] Use packed mask for prompt bin counts ( #29756 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-30 14:15:42 -08:00
Xingyu Liu
21c2627934
[Misc]Remove redundant hidden_size property in ModelConfig ( #29749 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-30 17:14:23 +00:00
Omer Ullman Argov
39d28108f4
[Feat] Support non-gated activations in NVFP4 modelopt path ( #29004 )
2025-11-30 11:02:40 -05:00
Harry Mellor
cd719de5cb
Fix RoPE failures in Transformers nightly ( #29700 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-30 14:29:32 +00:00
Pleaplusone
8c363ed666
[ROCm][Attention] Sliding window support for AiterFlashAttentionBackend ( #29234 )
...
Signed-off-by: ganyi <ygan@amd.com >
2025-11-30 11:31:50 +00:00
Cyrus Leung
64bc09ba27
[Core] Enable inputs_embeds_size separate from hidden_size ( #29741 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-30 17:31:12 +08:00
Isotr0py
47539cfd3e
[Bugfix] Fix mismatched nvfp4 gemm output shape ( #29742 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-11-30 09:15:01 +00:00
Cyrus Leung
2afcec4dec
[Misc] Update TokenizerLike interface and move get_cached_tokenizer ( #29730 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-30 14:59:47 +08:00
朝
9381b5cde0
[Doc]: Fix typo in fused_moe layer ( #29731 )
...
Signed-off-by: BowTen <bowten@qq.com >
2025-11-29 22:29:13 -08:00
Vensen
66b5840287
[Bugfix][sleepmode][fp8 kv cache]: Fix FP8 KV cache + sleep(level=2) gibberish output ( #28783 )
...
Signed-off-by: vensen <vensenmu@gmail.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2025-11-30 14:24:25 +08:00
Huamin Li
82c795d6f2
Fix AttributeError about _use_fi_prefill ( #29734 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
2025-11-30 06:04:55 +00:00
Isotr0py
e1464c3a08
[Quantization] Enable compressed-tensors AWQ for Turing GPU ( #29732 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-11-30 06:04:28 +00:00
Xin Yang
a491b0911b
[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 ( #29708 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
Signed-off-by: Xin Yang <105740670+xyang16@users.noreply.github.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-11-30 10:37:25 +08:00
Jee Jee Li
b9d0504a36
[Bugfix] Revert test_tokenization.py ( #29729 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-11-29 16:35:15 +00:00
Jinzhen Lin
1656ad3704
[Kernel][Quantization] add w4a8 support for marlin kernel ( #24722 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin@redhat.com >
2025-11-29 07:19:33 -08:00
Cyrus Leung
fa59fe417f
[Chore] Move detokenizer_utils to vllm/tokenizers ( #29727 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-29 06:25:17 -08:00
Cyrus Leung
fe3398fab2
[Chore] Enable passing tokenizer=None into MM processor ( #29724 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-29 06:25:10 -08:00
Chukwuma Nwaugha
ad7f714d62
hfrunner.classify should return list[list[float]] not list[str] ( #29671 )
...
Signed-off-by: Chukwuma Nwaugha <nwaughac@gmail.com >
2025-11-29 13:57:00 +00:00
dublc
f4341f45d3
[Doc]: fix code block rendering ( #29728 )
...
Signed-off-by: dublc <jdublc0x@gmail.com >
2025-11-29 13:46:48 +00:00
Cyrus Leung
34a984274e
[Misc] Refactor tokenizer interface ( #29693 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-29 04:02:21 -08:00
Woosuk Kwon
f223ed4181
[Model Runner V2] Fuse penalties and temperature into single kernel ( #29720 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-29 02:29:16 -08:00
Didier Durand
04a797cd0e
[Doc]: fixing typos in various files. ( #29717 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-11-29 01:15:39 -08:00
Woosuk Kwon
6afc0ffaf6
[Model Runner V2] Add sample/ directory and reorganize files ( #29719 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-29 00:41:01 -08:00
Jee Jee Li
39e63dec7c
[LoRA] Cleanup LoRA unused code ( #29611 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-28 22:52:58 -08:00
Woosuk Kwon
4a80ad0a25
[Model Runner V2] Don't use UVA buffer for prefill_len ( #29713 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-28 20:27:16 -08:00
Angela Yi
4b17ce6815
Add gpu memory wait before test_async_tp ( #28893 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-11-28 20:19:05 -08:00
Lucas Wilkinson
e23f665d83
[BugFix] Fix DBO failing with TypeError: 'NoneType' object is not iterable ( #29698 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-11-28 20:19:01 -08:00
Woosuk Kwon
ca1b1e7296
[Model Runner V2] Refactor prefill token preparation ( #29712 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-28 19:49:17 -08:00
Tsukasa OI
762a4a6ca9
[Frontend] Perform offline path replacement to tokenizer ( #29706 )
...
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com >
2025-11-28 18:32:08 -08:00
Cyrus Leung
b2c50eda50
[Bugfix] Fix wrong mock attribute ( #29704 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-29 10:30:41 +08:00
Woosuk Kwon
1dcafb3dea
[Model Runner V2] Support penalties using bin counts ( #29703 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-28 17:53:17 -08:00
Andreas Karatzas
ea3370b428
[ROCm][Bugfix] Patch for the Multi-Modal Processor Test group ( #29702 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-11-29 01:31:44 +00:00
Mert Unsal
c625d7b1c6
[Bugfix] Fix O(n²) multimodal string prompt processing ( #29667 )
...
Signed-off-by: mertunsall <mertunsal1905@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-11-28 16:10:39 -08:00
Zhengxu Chen
6173682b6e
[compile] Include enable_sleep_mode into caching factors. ( #29696 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2025-11-29 07:58:38 +08:00
Augusto Yao
9726e64530
bugfix: correct attn output with base 2 or e ( #28840 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
2025-11-29 07:52:12 +08:00
Huamin Li
3fd1fb0b60
Revert "[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 ( #28971 )" ( #29697 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
2025-11-28 15:26:52 -08:00
Jiangyun Zhu
a51f4186f2
[Bugfix] fix dots.llm1.inst ( #29687 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-11-28 15:25:26 -08:00
Cyrus Leung
7675ba30de
[Misc] Remove redundant ClassRegistry ( #29681 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-28 15:24:47 -08:00
Ralf Gommers
7c1ed45848
[CI/Build]: make it possible to build with a free-threaded interpreter ( #29241 )
...
Signed-off-by: Ralf Gommers <ralf.gommers@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-28 15:21:46 -08:00
Benjamin Chislett
1986de1375
[Perf] Optimize EAGLE prepare_inputs_padded with triton kernels ( #28597 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com >
2025-11-28 22:25:05 +00:00
Yanan Cao
3461e7efd8
[Frontend] Remap -O to -cc commandline flag ( #29557 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude <noreply@anthropic.com >
2025-11-28 21:51:12 +00:00
Harry Mellor
fecae12cd7
Remove all_special_tokens_extended from tokenizer code ( #29686 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-28 20:26:51 +00:00
Cyrus Leung
8d9338fae4
[Chore] Rename Processor to InputProcessor ( #29682 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-28 09:35:41 -08:00
Isotr0py
d40c854009
[CI/Build] Rework CPU multimodal processor test ( #29684 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-11-28 17:10:29 +00:00
Harry Mellor
4332955602
[Docs] Add CLI reference doc for vllm bench sweep plot_pareto ( #29689 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-28 08:10:08 -09:00
Isotr0py
f946a8d743
[Chore]: Reorganize model repo operating functions in transformers_utils ( #29680 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-11-28 08:46:51 -08:00
Isotr0py
6f9d81d03b
[V0 deprecation] Clean up legacy paged attention helper functions ( #28043 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-11-28 16:44:33 +00:00
Didier Durand
fae6943068
[Doc]: fixing typos in multiple files. ( #29685 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-11-28 08:41:41 -08:00
果冻虾仁
3bcbb30cbf
add add_truncate_prompt_tokens in repr for PoolingParams ( #29683 )
2025-11-28 08:41:05 -08:00
Cyrus Leung
9e6bcda3ac
[mypy] Enable type checking for more directories ( #29674 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-28 08:39:27 -08:00
Harry Mellor
9eec282cb5
Guard FlashInfer sampler using the same check as FlashInfer attention backend ( #29415 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-28 08:34:48 -08:00
Cyrus Leung
0808eb813b
[Misc] Remove yapf directives ( #29675 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-28 15:07:23 +00:00
Mingyuan Ma
460d8bbf2d
Remove upstream fa checks ( #29471 )
...
Signed-off-by: mingyuanm <mingyuanm@nvidia.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-11-28 05:52:42 -08:00
Li, Jiang
e2f56c309d
[CPU] Update torch 2.9.1 for CPU backend ( #29664 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-11-28 13:37:54 +00:00
HappyAmazonian
f8151b66fa
Revert "Supress verbose logs from model_hosting_container_standards (… ( #29335 )
...
Signed-off-by: Shen Teng <sheteng@amazon.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-28 05:29:05 -08:00
Cyrus Leung
1168768a2d
[Optimization] Early return for _apply_matches and _iter_placeholders ( #29668 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-28 13:26:47 +00:00
Nick Hill
8e7a891602
[BugFix] Fix spec decoding max_tokens scheduling perf issue ( #29542 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-11-28 20:52:23 +08:00
Cyrus Leung
953d9c820b
[mypy] Pass type checking for vllm/utils and vllm/v1/pool ( #29666 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-28 20:40:47 +08:00
Cyrus Leung
33b06a6f24
[Misc] Remove redundant attention var constants ( #29650 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-28 04:35:19 -08:00
Wilson Wu
5c2b5cb422
[Docs] Add SPLADE and Ultravox models to supported models documentation ( #29659 )
...
Signed-off-by: Wilson Wu <iwilsonwu@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-11-28 01:29:28 -09:00
杰兮
3cb32e5d6e
[Rocm] Set VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS default is disabled ( #28985 )
...
Signed-off-by: zhyajie <yajizhan@amd.com >
Co-authored-by: zhyajie <yajizhan@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2025-11-28 02:08:42 -08:00
Cyrus Leung
ccbdf51bd5
[Doc] Reorganize benchmark docs ( #29658 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-28 17:19:25 +08:00
Filipp Fisin
5f5521bd5d
Fix parameter order in GPT-OSS weight loading function for non-MXFP4 weights ( #29506 )
...
Signed-off-by: Filipp Fisin <48059208+qGentry@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-28 00:45:10 -08:00
Julien Denize
b2c1d294fa
[BUGFIX] MistralTokenizer._call__ adds an invalid EOS token ( #29607 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-28 16:44:47 +08:00
maang-h
cc0f2a0e19
[Doc] Improve abnormal information string ( #29655 )
...
Signed-off-by: maang <maang_h@163.com >
2025-11-28 00:12:20 -08:00
rongfu.leng
480598958e
[Feature][Bench] Add pareto visualization ( #29477 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-11-27 23:53:20 -08:00
Cyrus Leung
b34e8775a3
Revert "[CPU]Update CPU PyTorch to 2.9.0 ( #29589 )" ( #29647 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-27 22:43:18 -08:00
wang.yuqi
f4b76056ee
Improve enable chunked_prefill & prefix_caching logic. ( #26623 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-27 22:05:48 -08:00
EanWang211123
37b15e97e8
[Multimodal][Speculative Decoding]Eagle3 mm support, enablement on qwen3vl ( #29594 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
Signed-off-by: EanWang211123 <wangyiheng@sangfor.com.cn >
Co-authored-by: Louie Tsai <louie.tsai@intel.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-11-27 22:05:45 -08:00
maang-h
c7ba1f6bc7
[BugFix] Fix ValueError in NewRequestData repr methods ( #29392 )
...
Signed-off-by: maang <maang_h@163.com >
2025-11-28 13:42:30 +08:00
Wilson Wu
18523b87f6
[Docs] Update supported models for Olmo 3 in tool calling documentation ( #29411 )
...
Signed-off-by: Wilson Wu <iwilsonwu@gmail.com >
2025-11-28 02:53:55 +00:00
Xin Yang
745a3bae1a
[LoRA] Support FusedMoE LoRA Triton kernel for mxfp4 ( #28971 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-11-28 10:48:28 +08:00
scydas
35657bcd7a
[CPU]Update CPU PyTorch to 2.9.0 ( #29589 )
...
Signed-off-by: scyda <scyda@outlook.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-11-28 09:34:33 +08:00
Lucas Wilkinson
be493e0b3c
[BugFix] Fix new nightly failures ( #29578 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-11-27 13:45:38 -08:00
Woosuk Kwon
ae0ce1be27
[Model Runner V2][BugFix] Keep reference to GPU tensors in AsyncOutput ( #29623 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-27 12:38:53 -08:00
Andrii Skliar
a5345bf49d
[BugFix] Fix plan API Mismatch when using latest FlashInfer ( #29426 )
...
Signed-off-by: Andrii Skliar <askliar@askliar-mlt.client.nvidia.com >
Co-authored-by: Andrii Skliar <askliar@askliar-mlt.client.nvidia.com >
2025-11-27 11:34:59 -08:00
Nicolò Lucchesi
e5a621b724
[CI] Add batched audios Whisper test ( #29308 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-11-27 19:31:52 +00:00
Isotr0py
38658ec6f3
[Bugfix][MM encoder] Fix ViT attention backend resolving for Turing GPU ( #29614 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-11-27 19:17:37 +00:00
Cyrus Leung
a24ea5414b
[Deprecation] Advance deprecation status ( #29617 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-27 19:04:58 +00:00
Cyrus Leung
ea228b4491
[Misc] Remove unused code from protocol.py ( #29616 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-27 18:39:59 +00:00
果冻虾仁
d45269b378
add skip_reading_prefix_cache in repr for PoolingParams ( #29620 )
2025-11-27 09:21:00 -08:00
Cyrus Leung
ee9841daa9
[Bugfix] Fix doc build on main ( #29619 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-27 09:08:08 -08:00
Injae Ryou
0840abdd24
[BugFix] Optional tokenizer argument when loading GGUF models ( #29582 )
...
Signed-off-by: Injae Ryou <injaeryou@gmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-11-27 16:53:10 +00:00
Harry Mellor
e1f262337b
Update Transformers pin in CI to 4.57.3 ( #29418 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-27 08:42:14 -08:00
Matthew Bonanni
fc1d8be3dc
[Attention] Update attention imports ( #29540 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-11-27 11:19:09 -05:00
Mathis Felardos
cd007a53b4
[bugfix] avoid NIXL_ERR_REMOTE_DISCONNECT in nixl_connector when Prefill dies ( #28120 )
...
Signed-off-by: Mathis Felardos <mathis@mistral.ai >
2025-11-27 15:32:38 +00:00
Didier Durand
66d3d5422c
[Doc]: fixing typos in diverse files ( #29492 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-11-27 07:15:50 -08:00
Ryan Rock
bab438ff3e
[CI/Build] Skip ray tests on ROCm ( #29556 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2025-11-27 07:01:37 -08:00
Li, Jiang
882851dc81
[CI/Build][Bugfix] Fix auto label issues for CPU ( #29610 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-11-27 14:51:26 +00:00
Jee Jee Li
2f5f9acd55
[LoRA] Continue optimizing MoE LoRA weight loading ( #29322 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-11-27 05:56:28 -08:00
Roger Wang
cf348c8d27
[Bugfix] Fix HunyuanVL XD-RoPE ( #29593 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored by: grider-transwithai <grider@transwith.ai >
2025-11-27 12:36:24 +00:00
Li, Jiang
a5abd1d384
[CI] Auto label CPU related issues ( #29602 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-11-27 11:33:19 +00:00
Cyrus Leung
e6d4f3c254
[Bugfix] Fix pre-commit ( #29601 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-27 02:23:06 -08:00
maang-h
51906c8c55
[Docs] Improve priority parameter documentation ( #29572 )
...
Signed-off-by: maang <maang_h@163.com >
Signed-off-by: maang-h <55082429+maang-h@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-11-27 02:09:24 -08:00
Morrison Turnansky
0838b52e2e
[Frontend][torch.compile] CompilationConfig Overhaul ( #20283 ): Set up -O infrastructure ( #26847 )
...
Signed-off-by: morrison-turnansky <mturnans@redhat.com >
Signed-off-by: adabeyta <aabeyta@redhat.com >
Signed-off-by: Morrison Turnansky <mturnans@redhat.com >
Co-authored-by: adabeyta <aabeyta@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-27 01:55:58 -08:00
Cyrus Leung
00d3310d2d
[Bugfix] Update Ultravox compatibility ( #29588 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-27 01:36:18 -08:00
Woosuk Kwon
da3222f371
[Model Runner V2] Implement multi-step Eagle with CUDA graph ( #29559 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-27 00:09:41 -08:00
Micah Williamson
43c5792592
[ROCm][CI] Fix test_cpu_offloading for ROCm ( #29548 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-11-27 07:54:44 +00:00
Johnny Yang
3ecabd06ee
Fix tpu-inference platform path ( #29554 )
...
Signed-off-by: Johnny Yang <johnnyyang@google.com >
2025-11-26 23:25:21 -08:00
Jee Jee Li
c069086b9c
[Bugfix] Fix getting device for MoE LoRA ( #29475 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-11-26 23:16:07 -08:00
Woosuk Kwon
11ea5ec1ff
[Model Runner V2] Refactor CudaGraphManager ( #29583 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-26 21:37:59 -08:00
Fadi Arafeh
ecb1952378
[cpu][fix] Fix Arm CI tests ( #29552 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2025-11-27 13:09:41 +08:00
TJian
da8e1a1bf9
[DOC] Add vLLM Bangkok Meetup info ( #29561 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-11-27 04:42:50 +00:00
Woosuk Kwon
ee80aee1ca
[Model Runner V2] Minor cleanup for build_attn_metadata ( #29576 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-26 20:10:12 -08:00
Woosuk Kwon
0aeb698b77
[Model Runner V2] Minor code cleanup ( #29570 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-26 19:47:17 -08:00
Louie Tsai
9bb33c8919
add xpu supported model and model id for cpu ( #29380 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
2025-11-27 11:30:50 +08:00
Jinzhen Lin
a67dec7cba
[Bugfix] fix IMA issue in certain cases of the moe marlin kernel ( #28619 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-11-26 19:02:21 -08:00
Matthew Bonanni
77740191de
[Attention][Async] Eliminate seq_lens_cpu in FlashAttention metadata building with DCP > 1 ( #29449 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-11-26 18:48:43 -08:00
HDCharles
df01eda4dc
[Bugfix] Make compressed-tensors MoEs respect ignored layers ( #28878 )
...
Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com >
2025-11-26 21:35:13 -05:00
Johnny Yang
ba1fcd84a7
[TPU] add tpu_inference ( #27277 )
...
Signed-off-by: Johnny Yang <johnnyyang@google.com >
2025-11-26 14:46:36 -08:00
Lucas Wilkinson
56539cddac
[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building ( #28579 )
2025-11-26 14:07:13 -05:00
Matthew Bonanni
430dd4d9eb
[Attention] Remove imports from vllm/attention/__init__.py ( #29342 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-11-26 10:53:15 -07:00
Alec
c4c0354eec
[CI/Build] allow user modify pplx and deepep ref by ENV or command line ( #29131 )
...
Signed-off-by: alec-flowers <aflowers@nvidia.com >
2025-11-26 17:41:16 +00:00
HDCharles
e603129505
[refactor] CTConfig methods to static/class methods ( #28870 )
...
Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-11-26 17:21:58 +00:00
Wentao Ye
0b0aa874e8
[Perf] Optimize batch invariant BMM, 18.1% Throughput improvement, 10.7% TTFT improvement ( #29345 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-26 09:38:52 -07:00
Huamin Li
70d5953f82
Revert "[Bugfix] Fix GPT-OSS AR+NORM fusion ( #28841 )" ( #29483 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
2025-11-26 22:27:26 +08:00
yxt
3650a74ed8
Optimize the wording of the document and unify the terminology and th… ( #29491 )
2025-11-26 05:16:12 -08:00
Yejing Lai
bb706d6048
Fix TeleChatForCausalLM not register issue ( #29473 )
...
Signed-off-by: Lai, Yejing <yejing.lai@intel.com >
2025-11-26 05:15:00 -08:00
Cyrus Leung
e30859dff3
[Bugfix] Fix handling of image embeds in models ( #29480 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-26 05:00:15 -08:00
Roger Wang
452a7c9f7c
[Misc] Allow LM only loading for Pixtral ( #29451 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-11-26 05:00:00 -08:00
Pleaplusone
d9d342d214
[Performance][MLA][ROCm] Remove redundant D2D copy in deepseek ( #27457 )
...
Signed-off-by: ganyi <ygan@amd.com >
2025-11-26 12:45:28 +08:00
Xin Yang
53d7f1f601
[Kernel] Use pre-allocated output buffer for triton kernel fused_experts ( #29219 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2025-11-26 10:21:00 +08:00
dependabot[bot]
c5ee430328
Bump actions/checkout from 4 to 6 ( #29293 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-26 01:57:08 +00:00
Michael Goin
8d6a89dffd
[UX] Suppress gloo log spam ( #29250 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-11-25 17:19:35 -08:00
George D. Torres
56531b79cc
[Misc] Add backup hash algorithm for FIPS constrained environments ( #28795 )
...
Signed-off-by: George D. Torres <gdavtor@gmail.com >
Signed-off-by: George D. Torres <41129492+geodavic@users.noreply.github.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-11-26 00:50:22 +00:00
Xieyang Xu
12866af748
dummy run corner case ( #29433 )
2025-11-26 00:20:35 +00:00
Lucia Fang
d8819c88eb
fix assertion for single world use case (uni) ( #29429 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-11-26 00:14:23 +00:00
Andrey Khalyavin
de75b0bb70
[BugFix] Fix initialization of draft model. ( #29319 )
...
Signed-off-by: Andrey Khalyavin <halyavin@yandex-team.ru >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2025-11-25 18:45:58 -05:00
Michael Goin
7df0289782
Change warning logs to debug for unimplemented MXFP4 Linear/Attention ( #29441 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-11-25 22:52:31 +00:00
Zhengxu Chen
0abc79482a
[caching] Add enable_prompt_embeds and cpu_offload_gb to compile hashes. ( #29435 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2025-11-25 21:46:41 +00:00
Nick Hill
4e57c6587f
[Core] Support logprobs with spec decode + async scheduling ( #29223 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-11-25 12:55:24 -08:00
Ilya Markov
e7d776273d
[Compile] Refactor. Move PostGradPassManager out of Compilation config ( #29340 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2025-11-25 19:58:56 +00:00
Eldar Kurtić
c32a18cbe7
Attempt to fix GPU OOM in a spec-decoding test ( #29419 )
...
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com >
2025-11-25 14:23:36 -05:00
Andrew Xia
b07555d26f
[responsesAPI][2] parse ResponseFunctionToolCallOutputItem ( #29383 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2025-11-25 10:27:26 -08:00
Harry Mellor
0353d2e162
Fix RoPE related failures in Transformers nightly tests ( #29333 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-25 16:23:45 +00:00
Harry Mellor
a1f2676879
Scheduled removal of override_pooler_config and disable_log_requests ( #29402 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-25 16:08:57 +00:00
Yifan Qiao
48ddb02b79
[Hybrid Allocator] Support KV cache groups with different block_size ( #29143 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-11-25 10:30:57 -05:00
Michael Goin
e502098643
[Kernel] Add NVFP4 MoE CUTLASS support for SM120 ( #29242 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-11-25 06:59:07 -08:00
Michael Goin
dbc3d9991a
[UX] Put CUDA attention backend selection log into one line ( #29337 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-11-25 06:46:18 -08:00
Injae Ryou
794029f012
[Feature]: Improve GGUF loading from HuggingFace user experience like repo_id:quant_type ( #29137 )
...
Signed-off-by: Injae Ryou <injaeryou@gmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-11-25 14:28:53 +00:00
Eldar Kurtić
0231ce836a
Revert back to torch.equal over torch.allclose from #28819 ( #29086 )
...
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com >
2025-11-25 14:23:38 +00:00
Thomas Parnell
516c3f7847
[Bugfix] Fix logic for choosing default prefix caching setting ( #29393 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-11-25 14:05:10 +00:00
Harry Mellor
51fc9e017a
Scheduled removal of CompilationConfig.use_inductor ( #29323 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-25 12:55:42 +00:00
Harry Mellor
bf0c75cd4f
Make Transformers Nightly tests soft-fail and enable all tests ( #29401 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-25 12:41:15 +00:00
Roger Wang
c2c661af9b
[Bugfix] Fix overallocation in MM profiling ( #29386 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-11-25 12:38:36 +00:00
Nicolò Lucchesi
798e87db5c
[Core] Generalize Encoder-Decoder seq_lens computation to avoid Whisper hardcoded logic ( #29268 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2025-11-25 11:32:11 +00:00
wang.yuqi
de6889946b
[Misc] Suppress log outputs when constructing the default vllm config. ( #29291 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-25 03:00:44 -08:00
wang.yuqi
7a80b01889
[CI] Resettle pooling entrypoints tests. ( #29370 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2025-11-25 10:39:10 +00:00
Ben Browning
e1dd706cd1
[Frontend] Respect Chat Completion parallel_tool_calls param ( #26233 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-11-25 09:56:15 +00:00
Andrew Xia
a685b47c57
[responsesAPI] refactor construct_input_messages ( #29359 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2025-11-25 09:47:10 +00:00
Avishek Goswami
32c40b95e0
[BugFix] bad_words filtering ineffective when n > 1 ( #29313 )
...
Signed-off-by: GOavi101 <1704178@kiit.ac.in >
2025-11-25 09:36:34 +00:00
Nick Hill
db2906108a
[Misc] Streamline unique id generation ( #29375 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-11-25 08:30:11 +00:00
wang.yuqi
67fc16cd8c
[Bugfix] If chunked_prefill is disabled, end the scheduling early. ( #28911 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2025-11-25 16:06:09 +08:00
elvischenv
6330f9477d
[Bugfix] Fix GPT-OSS AR+NORM fusion ( #28841 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-11-25 07:59:40 +00:00
Micah Williamson
ef1f7030f0
[ROCm][CI] Fix test_cudagraph_mode failure in AMD CI ( #29367 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-11-25 07:55:09 +00:00
Rémi Delacourt
12c007e288
EAGLE Support DP>1 ( #26086 )
...
Signed-off-by: Rémi Delacourt <remi@mistral.ai >
Signed-off-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com >
Signed-off-by: remi <remi@mistral.ai >
2025-11-25 07:32:21 +00:00
zhrrr
f242cfcdd5
[Perf] use cpu all reduce to avoid sync when async_scheduling & dp > 1 ( #29311 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2025-11-25 15:31:07 +08:00
Icey
888152bf87
Allow oot custom compiler extension via CompilerInterface ( #28623 )
...
Signed-off-by: wxsIcey <1790571317@qq.com >
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Signed-off-by: Icey <1790571317@qq.com >
Co-authored-by: Mengqing Cao <cmq0113@163.com >
2025-11-25 15:25:15 +08:00
Ryan Rock
fe3a4f5b34
[CI/Build] Pin torchgeo dependency for AMD ( #29353 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2025-11-25 07:14:59 +00:00
Fadi Arafeh
98caeadd54
[fix][cpu] Use a SwigluOAI impl which supports interleaved gate-up wei ( #29273 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2025-11-25 15:11:11 +08:00
vllmellm
64deead719
[Bugfix] [ROCm] [UX]: revert Flex attention backend ( #29371 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-11-25 06:56:06 +00:00
Nick Hill
7992324f23
[BugFix] Use unique ids for different transcription prompts ( #29372 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-11-25 06:55:16 +00:00
Inoki
40a6f53f6c
Display warning only when ROCm version is less than Pytorch required version ( #29200 )
...
Signed-off-by: Inoki <inoki@inoki.cc >
2025-11-25 14:40:06 +08:00
kflu
ce58fdc1c3
Fix PoolingParams.skip_reading_prefix_cache type ( #29364 )
...
Signed-off-by: KFL <kludev@gmail.com >
2025-11-25 06:39:29 +00:00
Fanli Lin
a21256c463
Add TP CLI argument to multimodal inference examples ( #29301 )
...
Signed-off-by: Lin, Fanli <fanli.lin@intel.com >
2025-11-25 06:03:20 +00:00
Harry Mellor
316c8492bf
Scheduled removal of guided_* config fields ( #29326 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-25 05:24:05 +00:00
Lucas Wilkinson
2d9ee28cab
[CI/Test Fix] Fix CP tests on Blackwell ( #29338 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-11-24 20:55:57 -08:00
Jiangyun Zhu
81db702ed2
[Attention] add _cudagraph_support for linear attention ( #28934 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-11-25 12:25:20 +08:00
Isotr0py
92effb07a4
[Model] Add HunyuanOCR support ( #29327 )
...
Signed-off-by: manayang <jackmanayang@gmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: sergeywang <sergeywang@tencent.com >
Co-authored-by: manayang <jackmanayang@gmail.com >
Co-authored-by: manayang <manayang@tencent.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-11-25 03:28:51 +00:00
Maryam Tahhan
87185c88d5
[Bugfix] Make deprecated --task embedding consistent with `--runner… ( #29312 )
...
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com >
2025-11-25 03:19:52 +00:00
Mark McLoughlin
9cf4edae6e
[Metrics] Scheduled removal of deprecated metrics ( #29330 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-11-25 11:15:13 +08:00
汪志鹏
7012d8b45e
[Docker] Optimize Dockerfile: consolidate apt-get and reduce image size by ~200MB ( #29060 )
...
Signed-off-by: princepride <wangzhipeng628@gmail.com >
2025-11-24 19:54:00 -07:00
Divakar Verma
22b42b5402
[CI][ROCm] Install arctic-inference on ROCm tests ( #29344 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-11-25 02:15:39 +00:00
gbyu-amd
cb7214d8ea
[ROCm][MLA] enable fp8 MLA decode on ROCm ( #28032 )
...
Signed-off-by: guanbao <gyu@amd.com >
Signed-off-by: Guanbao Yu <gyu@amd.com >
Signed-off-by: gbyu-amd <Guanbao.Yu@amd.com >
Co-authored-by: guanbao <gyu@amd.com >
2025-11-25 10:15:02 +08:00
Pleaplusone
77e10c9cab
[Perf][Deepseek] optimize gather_and_maybe_dequant_cache kernel's perf for extremely long sequence ( #28029 )
...
Signed-off-by: ganyi <ygan@amd.com >
2025-11-24 19:05:46 -07:00
Michael Goin
6f1355a1b7
[Perf] Disable DeepGEMM MoE by default when TP=8 is used ( #29346 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-11-24 19:01:40 -07:00
Harry Mellor
a4ad43ad5a
Scheduled removal of ParallelConfig's direct child EPLB fields ( #29324 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-25 01:58:58 +00:00
Nick Hill
a178a0b40b
[BugFix] Fix duplicate id tool-call race condition ( #29355 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-11-25 01:54:26 +00:00
Kunshang Ji
b8328b49fb
[XPU] upgrade torch & ipex 2.9 on XPU platform ( #29307 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-11-25 09:34:47 +08:00
Hanjie Qiu
5f9679a43b
[Spec Decode] Add support for EAGLE3 heads that do not use_aux_hidden_states ( #27688 )
...
Signed-off-by: hjjq <hanjieq@nvidia.com >
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com >
2025-11-24 20:13:12 -05:00
Wentao Ye
699bca76c0
[UX] Raise error for attn backend of batch invariant ( #29348 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-11-24 17:49:01 -07:00
Michael Goin
c17610e2ba
[Bugfix] Only use triton_kernels for MXFP4 on SM90 and SM100 ( #29339 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-11-24 18:22:46 -05:00
Chen Zhang
71df2a57ef
[Hybrid Allocator] Better layer padding strategy for gpt-oss eagle ( #29303 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-11-24 14:28:32 -08:00
Tyler Michael Smith
4dd42db566
Remove VLLM_SKIP_WARMUP tip ( #29331 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2025-11-24 22:16:05 +00:00
Nick Hill
84371daf75
[Tests] Verify gpt_oss package is installed in harmony tests ( #29336 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-11-24 22:04:31 +00:00
Woosuk Kwon
f32c7d6f54
[Model Runner V2] Simplify Eagle bookkeeping with num_rejected ( #29347 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-24 13:54:59 -08:00
Yan Ma
3cfa63ad99
[XPU]fix Kimi-VL-A3B-thinking on xpu ( #29309 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2025-11-24 21:02:21 +00:00
Benjamin Bartels
4d6afcaddc
[CI/Build] Moves to cuda-base runtime image while retaining minimal JIT dependencies ( #29270 )
...
Signed-off-by: bbartels <benjamin@bartels.dev >
Signed-off-by: Benjamin Bartels <benjamin@bartels.dev >
2025-11-24 11:40:54 -08:00
Woosuk Kwon
97588c4d12
[Model Runner V2] Add minor clarification comments for Eagle ( #29332 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-24 11:28:56 -08:00
Chenheli Hua
839c6b7b72
[Multimodal][Qwen3 Omni] Make Qwen3 Omni work with audio-in-video inputs in V1 engine. ( #27721 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-11-24 19:24:37 +00:00
bnellnm
8f066146c3
[MoE][Refactor] Make select_experts a non-static method ( #29067 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-11-24 13:38:04 -05:00
Woosuk Kwon
cec418b5df
[Model Runner V2] Change Numba AoT to JIT ( #29328 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-24 09:34:37 -08:00
Woosuk Kwon
cc313cb73d
[Model Runner V2] Implement Single-step Eagle 1 ( #29300 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-24 09:32:27 -08:00
Nicolò Lucchesi
26a465584a
[NIXL] Use config to enable telemetry + NIXL version bump ( #29305 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-11-24 17:18:04 +00:00
Varun Sundar Rabindranath
e924bbb4f4
[Build/CI][DP/EP] Add QWen/Qwen3-30B-A3B-FP8 + EPLB tests to Nightly H100 and B200 ( #29195 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-11-24 16:06:17 +00:00
Aydin Abiar
656516c315
[Bugfix] properly handle nested json with llama3 tool parser ( #27701 )
...
Signed-off-by: Aydin Abiar <aydin@anyscale.com >
Signed-off-by: Aydin Abiar <62435714+Aydin-ab@users.noreply.github.com >
Co-authored-by: Aydin Abiar <aydin@anyscale.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-11-24 15:28:51 +00:00
vllmellm
e48b2e6848
[Bugfix] [ROCm] [UX] Reorganize ROCm Backend Selection Logic ( #26980 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-11-24 15:24:49 +00:00
Laith Sakka
7a228b5305
Add option to use unbacked, and backed size obl dynamic shapes for more sounds compilation. ( #26199 )
...
Signed-off-by: Laith Sakka <lsakka@meta.com >
2025-11-24 10:12:41 -05:00
Yuan Tang
f716a15372
Update KServe guide link in documentation ( #29258 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-11-24 14:40:05 +00:00
WeiQing Chen
2601f18a82
[EPLB] Optimize EPLB for Async Rearrange Experts ( #22179 )
...
Signed-off-by: David Chen <530634352@qq.com >
Co-authored-by: SunChenxiang123 <1291824390@qq.com >
2025-11-24 09:08:29 -05:00
R3hankhan
4de87866a8
[CPU][IBM Z] Fix BF16 support and vectorize math operations for s390x ( #28926 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2025-11-24 12:08:09 +00:00
Didier Durand
eca7a8fb59
[Doc]: fix typos in various files ( #29230 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-24 11:10:48 +00:00
杰兮
8005e606bf
[Bugfix][Rocm] Fix shared expert weight loading failure in DeepSeek-MTP ( #27563 )
...
Signed-off-by: zhyajie <yajizhan@amd.com >
Co-authored-by: zhyajie <yajizhan@amd.com >
2025-11-24 10:16:52 +00:00
rongfu.leng
68dfe28eae
[Feature][Benchmark] add --link-vars can filter when serve_param equal bench_param ( #28909 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-11-24 02:02:28 -08:00
Fanli Lin
ed40d85929
[BugFix] Fix R-VL model loading error ( #29299 )
...
Signed-off-by: Lin, Fanli <fanli.lin@intel.com >
2025-11-23 22:48:45 -08:00
Roger Wang
0ff70821c9
[Core] Deprecate xformers ( #29262 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-11-24 04:18:55 +00:00
tongqiu
5253f4276f
[ROCm] Support for Whisper v1 with Aiter Unified Attention and Aiter Flash Attention ( #28376 )
...
Signed-off-by: apinge <Tong.Qiu2@amd.com >
2025-11-24 03:26:00 +00:00
Zero
30854783ad
[Model] Add OpenCUA-7B support ( #29068 )
...
Signed-off-by: lim4349 <rockmanzero@naver.com >
Signed-off-by: Zero <rockmanzero@naver.com >
Co-authored-by: Cloud User <ubuntu@a100-80g-4.novalocal >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-24 10:27:55 +08:00
Jee Jee Li
1073ba68b0
[LoRA] Optimize 3D MoE logic ( #29222 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-11-24 10:27:23 +08:00
Josh Moore
c309bb5245
[Bugfix] Update Gradio OpenAI Chatbot Webserver example to new Gradio message history format ( #29249 )
...
Signed-off-by: joshiemoore <joshiemoore98@gmail.com >
2025-11-24 00:47:54 +00:00
Woosuk Kwon
3e1ad40655
[Model Runner V2] Add apply_temperature option to gumbel_sample ( #29276 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-23 14:13:00 -08:00
Woosuk Kwon
62d54ba46d
[Model Runner V2] Optimize CUDA graph capture time ( #29275 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-23 11:15:32 -08:00
Woosuk Kwon
b004c00418
[Model Runner V2] Support spec decoding [1/N] ( #29274 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-23 10:09:06 -08:00
Woosuk Kwon
7f12c82fa6
[Model Runner V2] Change bookkeeping logic in preparation for spec decoding ( #29194 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-23 09:42:52 -08:00
Luke
6fb0215eee
[Bugfix] Use lazy string reference for DeepseekV3Config in config registry ( #28958 )
...
Signed-off-by: Luke <yq0536@gmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-11-23 11:43:21 +00:00
Micah Williamson
55c21c8836
[ROCm][CI] Fix "Cannot re-initialize CUDA in forked subprocess" in test_pynccl.py ( #29119 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-11-23 13:05:00 +08:00
rasmith
3999442f1c
[CI/Build][AMD] Add check for flash_att_varlen_func to test_tree_attention.py ( #29252 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-11-23 04:45:08 +00:00
rasmith
71362ffab4
[CI/Build][AMD] Skip test_multi_shared_storage_connector_consistency in test_multi_connector.py due to hipErrorLaunchFailure when calling .cpu() ( #29253 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-11-23 04:42:49 +00:00
Woosuk Kwon
20ee418adc
[Model Runner V2] Minor fix for cudagraph_utils ( #29256 )
2025-11-22 20:12:50 -08:00
Cyrus Leung
389aa1b2eb
[Doc] Update more docs with respect to V1 ( #29188 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-23 10:58:48 +08:00
Michael Act
3ed767ec06
docs: fixes distributed executor backend config for multi-node vllm ( #29173 )
...
Signed-off-by: Michael Act <michael.a.c.tulenan@gdplabs.id >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-11-23 10:58:28 +08:00
jiahanc
5f96c00c55
[Fix] Add SM check to flashinfer MOE backend ( #29144 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-11-23 00:39:30 +00:00
Qidong Su
4587063267
Patch DeepEP when building docker image with CUDA 13 ( #29154 )
...
Signed-off-by: Qidong Su <soodoshll@gmail.com >
2025-11-22 23:25:13 +00:00
Wentao Ye
472fdee974
[Chore] Update batch invariant code owner ( #29246 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-11-22 13:50:02 -08:00
Yizhou
df78aeef08
Refactor: Move CUDA graph dispatch logic earlier ( #27382 )
...
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com >
2025-11-22 16:10:31 -05:00
Nick Hill
7df331c66b
[BugFix] Fix chunked prompt logprobs + preemption ( #29071 )
2025-11-22 16:07:18 -05:00
Benjamin Bartels
eb5352a770
[CI/build] Removes source compilation from runtime image ( #26966 )
...
Signed-off-by: bbartels <benjamin@bartels.dev >
2025-11-22 10:23:09 -08:00
Cyrus Leung
d1cf8214e5
[Bugfix] Use HF config fields as fallback when loading Mistral config ( #29239 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-22 11:22:48 -07:00
Fadi Arafeh
730bd35378
[perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON ( #29193 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2025-11-22 09:04:36 -08:00
Federico
f55c76c2b3
chore: add RTX_PRO_6000 GLM4.6-FP8 kernel tuning ( #29240 )
2025-11-22 08:42:48 -08:00
ZiTian Zhao
d84d8f4429
Fix EVS crash when using video_embeds inputs in Qwen2.5-VL ( #29232 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-22 06:48:59 -08:00
Cyrus Leung
ae66818379
[Misc] Fix pre-commit ( #29238 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-22 06:48:01 -08:00
Nick Hill
d44a63c6d6
[BugFix] Fix returned logprobs with spec decode + prefill chunking ( #29216 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-11-22 22:41:25 +08:00
Nicolò Lucchesi
066209a045
[Attention] Refactor FA block_size limitations to hybrid models only ( #29084 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-11-22 06:38:44 -08:00
Bram Wasti
5f7209a793
[tiny] Remove unsupported TRITON_MLA backend from batch invariance ( #28832 )
...
Signed-off-by: Bram Wasti <bwasti@meta.com >
Signed-off-by: Bram Wasti <bwasti@fb.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-11-22 21:00:50 +08:00
yihong
2d4978a57e
fix: clean up function never use in setup.py ( #29061 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-11-22 05:00:04 -08:00
Nandan Vallamdasu
6965a392a4
Fix: Resolve circular import in model_loader/utils.py ( #29189 )
...
Signed-off-by: nandan2003 <nandan.vallamdasu@outlook.com >
Signed-off-by: Nandan Vallamdasu <nandan.vallamdasu@outlook.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-22 04:58:22 -08:00
Cyrus Leung
5a4802588e
[Misc] Further clean up chunked prefill and prefix caching init ( #29186 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-22 19:34:15 +08:00
rasmith
8e22da1d7f
[CI/Build Don't add FLASHINFER backend in test_cpu_offloading.py ( #29229 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-11-22 11:00:54 +00:00
rasmith
a4fdf2405c
[CI/Build] Skip tests that require libcudart in test_lmcache_integration.py ( #29228 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-11-22 10:59:39 +00:00
Jane (Yuan) Xu
e6309acdba
Simplify from_blob usage in get_cuda_view_from_cpu_tensor ( #29027 )
...
Signed-off-by: Jane Xu <janeyx@meta.com >
2025-11-22 10:35:32 +00:00
jinghanhu
988ee66b0d
Handle triton kernel import exception ( #29062 )
2025-11-22 10:07:50 +00:00
Mads Kildegård
ea38474ac5
[Frontend][Responses API] Multi-turn (with type: "output_text") support for non-harmony requests ( #29175 )
...
Signed-off-by: Mads Kildegård <mkildegaard99@gmail.com >
2025-11-22 09:58:22 +00:00
Andrew Xia
742e9ff6b3
[responsesAPI] parse reasoning item input ( #28248 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-22 15:42:11 +08:00
Woosuk Kwon
e9056056fb
[Model Runner V2] Limit cudagraph size to max decode batch size ( #29221 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-21 20:21:35 -08:00
Jee Jee Li
1489902b53
[LoRA] Cleanup FusedMoEWithLoRA ( #29187 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-11-22 04:01:30 +00:00
Yanan Cao
933f67ecd8
[Bugfix]Fix a conditional to not check zero value ( #28754 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2025-11-21 19:59:07 -08:00
rasmith
fd65015a14
[CI/Build] Only use supported types and features on ROCm in MoE kernel tests ( #29149 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-11-21 20:34:33 -07:00
Yihua Cheng
77e1c035d0
[chore][LMCache connector] Remove useless logs from lmcache connector ( #29069 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
2025-11-22 03:18:00 +00:00
rasmith
6f403501a0
[CI/Build][AMD] Enable Entrypoints Integration Test (Pooling) to run without error on ROCm ( #29212 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-11-22 02:13:18 +00:00
FlintyLemming
052950e5b3
Add fused MoE config for H200 E160 N192 fp8 ( #29182 )
...
Signed-off-by: FlintyLemming <admin@flinty.moe >
2025-11-21 17:37:51 -08:00
qli88
1ef9c9e294
[CI/Build] Disable test_gptoss_tp.py in 'LoRA TP Test' group for ROCm platform ( #29204 )
...
Signed-off-by: qli88 <qiang.li2@amd.com >
2025-11-21 17:36:19 -08:00
Jie Luo
5c8f2adf50
[Bugfix] Fix block size in block_table with PCP ( #29094 )
...
Signed-off-by: Livinfly <luojie3m@gmail.com >
2025-11-22 01:34:28 +00:00
Ryan Rock
ed8e6843cc
[CI/Build] Add terratorch for AMD ( #29205 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2025-11-21 17:31:22 -08:00
Lukas Geiger
d045e22dfe
[Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s ( #29217 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-11-21 17:30:55 -08:00
Wentao Ye
1d34eb11e0
[CI] Bug: Fix triton import issue ( #29202 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-11-21 17:14:49 -08:00
Charlie Fu
9a3101b2ba
[Rocm][CI] Fix DeekSeek V2-Lite Accuracy CI ( #29135 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-11-21 17:11:02 -08:00
Angela Yi
d5dbdbfcb2
[docs] Fix cudagraph mode config ( #29170 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2025-11-21 17:10:27 -08:00
Lucas Wilkinson
30d6466238
[BugFix] Fix Eagle IndexError: list index out of range for even num_speculative_tokens ( #29102 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-11-22 00:47:05 +00:00
Woosuk Kwon
e9af6ba62a
[Model Runner V2] Optimize Gumbel Sampling Kernel ( #29210 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-21 15:52:28 -08:00
Mark McLoughlin
c6fa3895e9
[KV Connector] Fix async connector prefix cache metrics ( #28585 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2025-11-21 17:45:00 -05:00
Varun Sundar Rabindranath
3137991f55
[BugFix] EPLB + B200 + DeepGEMM : Handle column-major scales tensor ( #29162 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-11-21 14:28:17 -08:00
Julien Denize
57430fc95c
Default model load/config/tokenizer to mistral format if relevant files exist ( #28659 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-11-21 13:58:59 -08:00
Lucas Wilkinson
c68c7b403d
[BugFix] Fix missing symbol triggering FA2 fallback on Hopper ( #29107 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-11-21 13:58:32 -08:00
Ning Xie
53a1ba6ec5
[log] add weights loading time log to sharded_state loader ( #28628 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-11-21 21:06:09 +00:00
Lucas Wilkinson
1840c5cb18
[BugFix] Make sure to allocate worst case MoE workspace during profile run in the DP + EP case ( #27426 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-11-21 11:41:52 -08:00
Woosuk Kwon
1bed891f72
[Chore] Fix pre-commit error after #25266 ( #29190 )
2025-11-21 10:21:40 -08:00
Cyrus Leung
ceca060501
[Deprecation] Deprecate seed=None ( #29185 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-21 18:19:25 +00:00
Charlie Fu
75648b16dd
[ROCm][CI] Fix config/test_config_generation.py ( #29142 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-11-21 17:12:16 +00:00
Chendi.Xue
460d02a417
[NIXL] Fix after virtual block_size for host_buffer with heter kv_layout ( #29122 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2025-11-21 08:55:27 -08:00
Mingyuan Ma
b4c8fbaae2
Add TRTLLM MoE NVFP4 kernel to CompressedTensorsW4A4MoeMethod ( #28892 )
...
Signed-off-by: mingyuanm <mingyuanm@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-11-21 09:54:11 -07:00
rasmith
e99e467384
[CI/Build][Kernel][AMD] Move extra dim to after load in _fwd_kv_parallel in lighting_attn.py ( #29132 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-11-21 11:53:09 -05:00
Wentao Ye
a42ab317ac
[Log] Optimize startup log ( #28948 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-11-21 08:46:20 -08:00
Aleksandr Malyshev
b7f1f490a6
Upstream triton fp4 weight preshuffle ( #28888 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2025-11-21 11:34:46 -05:00
Woosuk Kwon
30b44a1598
GPU Model Runner V2 ( #25266 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-11-21 08:20:55 -08:00
Wentao Ye
1f400c58b8
[CI] Add batch invariant test to ci ( #27842 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-11-21 09:20:33 -07:00
rasmith
711241c13c
[CI/Build] Fix illegal memory access and unsupported test in kernels/attention/test_cache.py ( #29118 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-11-21 10:58:38 -05:00
Cyrus Leung
d7219bcda3
[Misc] Move dynamic seed initialization to EngineArgs ( #29165 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-21 15:27:44 +00:00
wangxiyuan
4050bae417
[Doc] Update plugin doc ( #28532 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-11-21 14:57:26 +00:00
skaraban3807
f1805db1a6
[Perf] These changes enhance the NUMA functionality of vllm for systems with more than one NUMA nodes per socket ( #25559 )
...
Signed-off-by: Siddappa Karabannavar <siddappa.karabannavar@amd.com >
2025-11-21 14:13:52 +00:00
Julien Denize
434f3d3eb8
Fix mistral config ( #29172 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-11-21 14:01:20 +00:00
sfbemerk
2092ce8c39
Tool Call Parser logs should not contain user input / model output except on DEBUG ( #29160 )
...
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
Co-authored-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-11-21 20:57:19 +08:00
who who who
fc9f821d20
fix cross attention ( #28346 )
...
Signed-off-by: fsx950223 <fsx950223@outlook.com >
2025-11-21 04:55:43 -08:00
Cyrus Leung
9452863088
Revert "Revert #28875 ( #29159 )" ( #29179 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-21 04:27:43 -08:00
Bhagyashri
2b1b3dfa4b
Update Dockerfile to use gcc-toolset-14 and fix test case failures on power (ppc64le) ( #28957 )
...
Signed-off-by: Bhagyashri <Bhagyashri.Gaikwad2@ibm.com >
2025-11-21 12:24:09 +00:00
Russell Bryant
cca2d2cdbe
[Core] Align whisper closer to other multimodal models ( #27292 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-11-21 12:01:54 +00:00
Cyrus Leung
aab0102a26
[V0 deprecation] Remove more V0 references ( #29088 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-21 11:56:59 +00:00
WeiQing Chen
b34129bf8e
[Misc] remove useless v1 env ( #29164 )
...
Signed-off-by: David Chen <530634352@qq.com >
2025-11-21 01:41:20 -08:00
Cyrus Leung
4d7231e774
Revert #28875 ( #29159 )
2025-11-21 01:40:17 -08:00
Huamin Li
8ac3a41487
[CI Failure] Fix Gemma3 RoPE configuration for sliding attention layers ( #29111 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-20 23:53:30 -08:00
Canlin Guo
7d6da483b0
[Minor][Clean] Remove the legacy assertion in video ( #29150 )
...
Signed-off-by: gcanlin <canlinguosdu@gmail.com >
2025-11-20 23:52:34 -08:00
Chenheli Hua
e4c3182c68
[Small] Capture AttributeError when checking ray dependency. ( #29024 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-11-20 22:54:10 -08:00
Alex Brooks
b4734b9550
[Bugfix] Fix default MM LoRA alignment for single str prompts ( #29140 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-11-21 13:32:30 +08:00
Jialin Ouyang
30b9c67743
Revert "[Redo] #26368 ( #28771 )" ( #29121 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-11-20 21:27:45 -08:00
Matthew Bonanni
11857a00b0
[Attention] Add ROCM_AITER_MLA_SPARSE to attention backend registry ( #29103 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-11-20 20:24:43 -08:00
Boyuan Feng
8c25f9cfb6
[BugFix] skip combo kernel on cpu ( #29129 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-11-21 11:50:59 +08:00
Cyrus Leung
56e96b37e4
[V0 Deprecation] Remove best_of ( #29090 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-11-21 11:40:40 +08:00
Qidong Su
698024ecce
[Doc] update installation guide regarding aarch64+cuda pytorch build ( #28875 )
...
Signed-off-by: Qidong Su <soodoshll@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-11-20 19:40:25 -08:00
jeremyteboul
0730414999
[Core] Add audio_embeds support to chat completions ( #29059 )
...
Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com >
Co-authored-by: Jeremy Teboul <jeremyteboul@fb.com >
2025-11-21 11:39:47 +08:00
zhrrr
a982f5b5ea
[kernel][perf] support uncontiguous input for rms_norm kernel ( #28103 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Signed-off-by: izhuhaoran <izhuhaoran@qq.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-11-20 19:39:09 -08:00
Cyrus Leung
0e741c12e3
[Bugfix] Fix Plamo3 rope handling ( #29092 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-21 11:38:35 +08:00
Wentao Ye
56669c1f29
[CI] Fix mypy for vllm/v1/worker ( #29037 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-11-21 11:36:07 +08:00
Hongxia Yang
3f5f36da3f
[ROCm] Fix for import when building with upstream triton for gfx1100 for gpt-oss serving ( #29127 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2025-11-21 03:30:07 +00:00
Wentao Ye
e1eefa4c40
[Bug] Fix torch warning of tf32 usage ( #29112 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-11-21 01:54:59 +00:00
Xiao Li
ed6ae1e36a
[AITER] [ROCm] Fix crash when loading llama4 model with old aiter version installed, fallback to forward_native implementation ( #29124 )
...
Signed-off-by: Xiao Li <ilx@meta.com >
2025-11-20 17:54:35 -08:00
Jee Jee Li
9875be6431
[LoRA][2/2]Remove LoRA extra vocab ( #28545 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-11-21 09:46:43 +08:00
Wentao Ye
df44df0143
[Feature] Shared Experts Overlap with FI deepgemm swap kernel, 2.2% throughput improvement and 3.6% TTFT improvement ( #28879 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-11-20 18:41:49 -07:00
Michael Goin
87cbbdff63
Update model references for OLMo3 ( #29099 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-11-21 09:16:52 +08:00
Michael Goin
986ab5db63
[CI Bugfix] Fix Kernels DeepGEMM Test (H100) ( #29106 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-11-20 16:42:33 -08:00
Rob Mulla
dd39f91edb
[Doc] cleanup TPU documentation and remove outdated examples ( #29048 )
...
Signed-off-by: Rob Mulla <rob.mulla@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-21 00:05:59 +00:00
rasmith
c7a29d2c8d
[CI/Build] Remove skip global cleanup in test_struct_output_generate.py ( #29022 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-11-20 21:44:37 +00:00
rasmith
8237ab8a2b
[CI/Build] Skip lm-format-enforcer tests in test_struct_output_generate.py for now ( #29021 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-11-20 21:35:14 +00:00
Driss Guessous
3fd74189db
Fixes bench ( #29058 )
...
Signed-off-by: drisspg <drisspguessous@gmail.com >
2025-11-20 21:21:54 +00:00
rasmith
5e5a7eb16f
[CI/Build] Make test_attention_selector.py run tests on correct platform ( #29064 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Signed-off-by: rasmith <Randall.Smith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-20 20:45:56 +00:00
rasmith
3d84ef9054
[CI/Build][AMD] Skip if flash_attn_varlen_func not available in test_aiter_flash_attn.py ( #29043 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-11-20 20:39:49 +00:00
Software Developer
4d01b64284
[Bugfix] - Add Trace Headers to Beam Search Path ( #29100 )
...
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com >
2025-11-20 20:00:33 +00:00
Kevin H. Luu
114b0e2500
[chore] Update annotate release scripts ( #29077 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2025-11-20 10:22:40 -08:00
Or Ozeri
647464719b
[KVConnector][Core] Support cross-layer KV blocks ( #27743 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-11-20 19:09:59 +01:00
Pan Li
e5bfcb6a88
[BugFix][PD]: make example proxy usable with P2pNcclConnector ( #26628 )
...
Signed-off-by: PAN <1162953505@qq.com >
2025-11-20 17:38:31 +00:00
Alexei-V-Ivanov-AMD
22924383e1
Updating the mirror of test-amd.yaml as of 2025-11-18 ( #29016 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-11-20 12:07:06 -05:00
rookie
56f45eddaf
[Frontend] Optimize beam search loop by sorting and then splicing ( #19347 )
...
Signed-off-by: zhangguozhu <zhangguozhu@360.cn >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: zhangguozhu <zhangguozhu@360.cn >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-11-20 09:02:30 -08:00
TJian
82b05b15e6
[BugFix] [FEAT] Enable fastsafetensors for ROCm platform ( #28225 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-11-20 16:34:11 +00:00
Fanli Lin
a2e9ebe9e2
[BugFix] Fix flash_attn import in siglip2navit.py ( #29082 )
...
Signed-off-by: Fanli Lin <fanli.lin@intel.com >
2025-11-20 12:14:29 +00:00
Zhewen Li
93c8672ceb
[Bugfix] Fix spec decode memory regression after #28549 ( #28819 )
...
Signed-off-by: zhewenli <zhewenli@meta.com >
2025-11-20 19:05:50 +08:00
Samit
371b1d4c61
[RL] Add Pause and Resume Generation for Asynchronous RL Training ( #28037 )
...
Signed-off-by: SamitHuang <285365963@qq.com >
Signed-off-by: Samit <285365963@qq.com >
Signed-off-by: samithuang <285365963@qq.com >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-11-20 03:01:03 -08:00
Shinichi Hemmi
c9e093116c
[MODEL] Implement plamo3 ( #28834 )
...
Signed-off-by: Shinichi Hemmi <50256998+Alnusjaponica@users.noreply.github.com >
2025-11-20 03:00:19 -08:00
Or Ozeri
c0c2dd1e0b
[BugFix] kv_offloading: Fix bug in loading of partial cpu blocks ( #28951 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-20 18:55:10 +08:00
Pleaplusone
06c20c9904
[ROCm] Add AMD GPU support on Deepseek v3.2 and SparseMLA ( #26670 )
...
Signed-off-by: ganyi <ygan@amd.com >
2025-11-20 02:54:01 -08:00
Anna Shors
6eb745d9bd
Add truncate arg to yarn to match openai implementation of gpt-oss ( #28244 )
...
Signed-off-by: ashors1 <ashors@nvidia.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-11-20 18:53:50 +08:00
cjackal
66483a9d00
[Chore] Update xgrammar version from 0.1.25 to 0.1.27 ( #28221 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2025-11-20 02:53:09 -08:00
Jinzhen Lin
edfe867208
[Misc] don't cache CUTLASS_REVISION var in CMakeLists.txt ( #28518 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
2025-11-20 02:52:53 -08:00
Dezhan
dc45efc8ef
[BugFix] Fix Llama4 Pipeline Parallelism Assert Error ( #28577 )
...
Co-authored-by: Dezhan Tu <dztu@meta.com >
2025-11-20 02:52:36 -08:00
Vensen
fb8851f254
[Bugfix][cache_kernels]: Fix OOB in cache_kernels.cu ( #28760 )
...
Signed-off-by: vensen <vensenmu@gmail.com >
Signed-off-by: Vensenmu <vensenmu@gmail.com >
2025-11-20 02:52:02 -08:00
Boyuan Feng
a903d59ffa
cleanup at::Tag::needs_fixed_stride_order ( #28974 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-20 02:51:36 -08:00
rasmith
322cb02872
[CI/Build][AMD] Fix import errors in tests/kernels/attention ( #29032 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2025-11-20 17:48:09 +08:00
Wentao Ye
2c52c7fd9a
[Bug] Fix torch dynamo warning Dynamo detected a call to a functools.lru_cache ( #29038 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-11-20 16:52:23 +08:00
Bradley D
1e1c06789e
[ci][amd] fix EPLB execution test ( #28742 )
...
Signed-off-by: Bradley Davis <bradleyhd@meta.com >
2025-11-20 14:53:38 +07:00
Pleaplusone
7218f83992
[ROCm][BugFix] Fix shared expert loading error when disable VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS ( #28633 )
...
Signed-off-by: ganyi <ygan@amd.com >
2025-11-20 14:50:23 +07:00
Cyrus Leung
20e4497be2
[V0 Deprecation] Remove num_lookahead_slots ( #29000 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-11-20 06:39:10 +00:00
Quentin Gallouédec
1c7bcc55b8
[Frontend] Allow parsed tool arguments ( #28820 )
...
Signed-off-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-19 22:20:12 -08:00
Lukas Geiger
a9705a290a
[Model][QwenVL] Replace torch.repeat_interleave with faster np.repeat ( #28964 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-11-19 22:04:23 -08:00
Isotr0py
64192d5624
[Bugfix] Revert custom attention mask for gemma3-mm ( #28995 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-20 13:23:22 +08:00
Canlin Guo
fe25772aa9
[Bugfix] Handle broken frames in video loading ( #29001 )
...
Signed-off-by: gcanlin <canlinguosdu@gmail.com >
Signed-off-by: 凌葭 <lvjiang.lj@alibaba-inc.com >
Co-authored-by: 凌葭 <lvjiang.lj@alibaba-inc.com >
2025-11-20 04:38:12 +00:00
prashanth058
0cca9b4d13
[Bugfix] Fix precision loss in LoRA-wrapped RowParallelLinear by fusing bias into GEMM ( #28972 )
...
Signed-off-by: prashanth058 <prashanth.dannamaneni@uipath.com >
2025-11-20 03:50:37 +00:00
Shengliang Xu
a8c536829c
Consolidate Nvidia ModelOpt quant config handling for all quantization methods ( #28076 )
...
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com >
2025-11-19 22:39:36 -05:00
Benjamin Chislett
fcbcba6c70
[Feat] Iteration-level profiling for Torch and CUDA profiler ( #28987 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-19 19:17:48 -08:00
Fadi Arafeh
3168285fca
[cpu][ci] Add initial set of tests for Arm CPUs ( #28657 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2025-11-20 02:37:09 +00:00
Qiang Zhang
3fb0d90999
[AMD] Use Decoupled Kernel Block Size to Support AITER MLA block_size=1 ( #27715 )
...
Signed-off-by: chiangzhang <chiangzhang@tencent.com >
2025-11-20 02:11:52 +00:00
Kuntai Du
05c2dee7e9
[DeepSeek + LMCache Multiprocess] handle MLA for deepseek model + LMCache Multiprocess connector ( #29039 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2025-11-20 01:40:49 +00:00
liangel-02
1d642872a2
[torchao] fix safetensors for sharding ( #28169 )
...
Signed-off-by: Angel Li <liangel@meta.com >
2025-11-19 16:39:45 -08:00
Nick Hill
9ccef8e333
[Misc] Colorize logs ( #29017 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-11-19 19:26:04 -05:00
Jialin Ouyang
537cc635c7
[GC Debugger] Simply and improve GC Debugger Utils ( #29029 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-11-20 00:10:22 +00:00
Wentao Ye
5031cd5d55
[Refactor] Optimize select_experts ( #28069 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-11-19 18:53:15 -05:00
Alexander Matveev
3aaa94ac99
[Performance] Reduce DeepGEMM N dim restriction from 128 to 64 multiplier ( #28687 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-11-19 15:47:13 -08:00
JartX
8e38e99829
[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod ( #28849 )
2025-11-19 18:30:08 -05:00
Wentao Ye
0075bfffd4
[CI] Fix precommit rope_theta issue ( #29040 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-11-19 14:22:43 -08:00
Max Hu
cb0a7b4bea
[Bugfix] Move flashinfer kernel check into ``__init__` function of `FusedMoE`` ( #29018 )
...
Signed-off-by: Max Hu <hyoung2991@gmail.com >
2025-11-19 21:54:15 +00:00
Lucas Wilkinson
8f4f77a727
[BugFix] Fix false assertion with spec-decode=[2,4,..] and TP>2 ( #29036 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-11-19 13:43:54 -08:00
Micah Williamson
22e44ad589
[ROCm][CI] Fix Weight Loading With Multiple GPU Tests on ROCm ( #28984 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-11-19 21:31:33 +00:00
Yongye Zhu
88f5b19f0b
[DeepSeek] Fix DeepSeek V3.2 Rope Embedding ( #28968 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
2025-11-19 16:30:04 -05:00
Shu Wang
613abb50d5
[MoE] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked ( #25990 )
...
Signed-off-by: Shu Wang. <shuw@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-11-19 13:29:06 -08:00
Julien Denize
cdeec2e606
[BugFix] Ray with multiple nodes ( #28873 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
2025-11-19 21:20:58 +00:00
Wentao Ye
1607e664f0
[Bug] Fix Batch Invariant MLA test ( #28967 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-11-19 21:18:32 +00:00
Ryan Rock
68d7231991
[CI/Build] Fix test_prefix_prefill for AMD ( #28905 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2025-11-19 16:04:36 -05:00
Qiu
2fd893b4ce
[Feature] Prefill Context Parallel (PCP) basic support ( #28718 )
...
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com >
Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com >
Signed-off-by: LookAround <lixushi@huawei.com >
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com >
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com >
Co-authored-by: FENP <yuanyongjie.yyj@antgroup.com >
Co-authored-by: LookAround <lixushi@huawei.com >
Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com >
Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com >
Co-authored-by: Jingchun Gao <63247409+gjc0824@users.noreply.github.com >
2025-11-19 15:52:44 -05:00
Izzy Putterman
02f5903b84
Eagle: MM Cuda Graphs with MRope ( #28896 )
...
Signed-off-by: Izzy Putterman <iputterman@nvidia.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-19 15:01:05 -05:00
Aleksandr Malyshev
ac10fd3c69
Upstreaming aiter triton attention backend as a new backend ( #28701 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2025-11-19 19:59:30 +00:00
杰兮
9d2d561257
[Bugfix] Fix precision corruption when shared_experts_stream=None ( #28942 )
...
Signed-off-by: zhyajie <yajizhan@amd.com >
Co-authored-by: zhyajie <yajizhan@amd.com >
2025-11-19 19:30:57 +00:00
Robert Shaw
fe69f331f8
[Kernels] Improve H200 Fused MoE Config ( #28992 )
...
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-11-19 19:23:54 +00:00
Jialin Ouyang
3319a493fc
[Core] Reuse created spec tokens lists to mitigate GC cost ( #28917 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-11-19 19:20:22 +00:00
Copilot
61728cd1df
Re-enable FlashInfer for Llama4 on Blackwell in e2e fusion tests ( #28966 )
...
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-11-19 13:32:19 -05:00
Yuxuan Zhang
0c80efd94f
GLM-V video segmentation solution adjustment ( #28941 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-11-19 17:32:55 +00:00
Harry Mellor
a8b70304d6
Update rope_scaling to rope_parameters in preparation for Transformers v5 ( #28542 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-19 09:06:36 -08:00
Shanshan Shen
d44e9df7d4
[Model][Mamba] Add selector for mamba attention backend and make it pluggable for other device ( #26487 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-11-19 16:24:55 +00:00
Lucas Wilkinson
48fc8b1e59
[BugFix] Fix async-scheduling + FlashAttn MLA ( #28990 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-11-19 10:04:07 -05:00
vnadathur
1ffe934c8a
[torch.compile] caching of config fields should be opt-out by default ( #26468 )
...
Signed-off-by: vnadathur <glvikramn@gmail.com >
Signed-off-by: WorldExplored <srreyansh.sethi@gmail.com >
Signed-off-by: Srreyansh Sethi <srreyansh.sethi@gmail.com >
Signed-off-by: Srreyansh Sethi <107075589+WorldExplored@users.noreply.github.com >
Co-authored-by: WorldExplored <srreyansh.sethi@gmail.com >
Co-authored-by: Srreyansh Sethi <107075589+worldexplored@users.noreply.github.com >
Co-authored-by: vnadathur <236933696+vnadathur@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-11-19 06:13:54 -08:00
Yanan Cao
2c8b9182b5
[CI] Reorganize compile tests so new tests are automatically included in CI ( #28625 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2025-11-19 06:13:50 -08:00
Harry Mellor
4f5299f717
Relax Transformers modeling backend MoE experts check ( #28952 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-19 21:50:30 +08:00
Didier Durand
09540cd918
[Doc]: fix typos in various files ( #29010 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-11-19 04:56:21 -08:00
Chen Bruce
da2f6800e0
[Feat][Perf] Enable deepep-low-latency with round-robin expert placement. ( #28449 )
...
Signed-off-by: bruceszchen <bruceszchen@tencent.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-19 13:46:24 +01:00
Tova Movshovitz
ba558c029a
[config] Expose get_total_num_hidden_layers() in ModelConfig ( #28961 )
...
Signed-off-by: tovam <tovam@pliops.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-19 11:37:11 +00:00
Harry Mellor
97cfa99d59
[Docs] Take env var definition out of folded admonition ( #29005 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-11-19 03:32:04 -08:00
j20120307
bbc6c2f1e5
[CI/Build] Fix broken build on Apple M1 ( #28999 )
...
Signed-off-by: Kan Zhu <j20120307@gmail.com >
2025-11-19 11:07:22 +00:00
ihb2032
8151609583
refactor(cpu_types_scalar.hpp): Unify scalar loop implementations using unroll_loop ( #28847 )
...
Signed-off-by: ihb2032 <1355790728@qq.com >
Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn >
2025-11-19 11:05:44 +00:00
Michael Yao
fdf93486d6
[Docs] Clean up moe_kernel_features.md ( #28530 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-11-19 02:35:29 -08:00
gnovack
d69062c67a
add support for --fully-sharded-loras in fused_moe ( #28761 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-11-19 16:32:00 +08:00
Louie Tsai
ae4821a108
Add CPU support model ( #28697 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
2025-11-18 23:47:57 -08:00
Didier Durand
7ed27f3cb5
[Doc]: fix typos in various files ( #28945 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-11-18 22:52:30 -08:00
Michael Goin
a4511e38db
Speed up macOS smoke test ( #28954 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-11-18 22:46:32 -08:00
Roman Solomatin
71d0ae1c54
[Misc] Update embedding/cross encoder tests to use mteb v2 ( #27329 )
...
Signed-off-by: Roman Solomatin <36135455+Samoed@users.noreply.github.com >
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: wang.yuqi <noooop@126.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2025-11-18 22:28:40 -08:00
Lukas Geiger
3d4e7d34be
[Model][QwenVL] Simplify cos/sin rotary embedding indexing ( #28962 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-11-19 05:43:01 +00:00
Uranus
6a25ea5f0e
[Docs] Update oneshot imports ( #28188 )
...
Signed-off-by: UranusSeven <109661872+UranusSeven@users.noreply.github.com >
2025-11-19 05:30:08 +00:00
Gleb Kurchanov
73ff872db0
[Bugfix] Fix typo in Qwen3 Next model executor ( #28960 )
...
Signed-off-by: Gleb Kurchanov <nepherpitou@gmail.com >
2025-11-19 05:21:02 +00:00
Xin Yang
468a8d72ba
[Bugfix] Fix FusedMoEModularKernel for triton backend ( #28913 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2025-11-19 13:05:22 +08:00
Matthew Bonanni
4c23690f43
[Attention] FlashAttention ViT support, make default backend ( #28763 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-11-18 20:06:21 -08:00
Strahinja Stamenkovic
814843e021
Enable bitsandbytes quantization on AMD GPUs that use warp size 32 ( #27307 )
...
Signed-off-by: sstamenk <strahinja.stamenkovic@amd.com >
2025-11-19 03:12:31 +00:00
Li, Jiang
20852c8f4c
[CPU] Refactor CPU WNA16 ( #28826 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-11-19 10:32:00 +08:00
Jialin Ouyang
40b6b38f2c
[Core] Switch Flat logprob control from environment variable to SamplingParams ( #28914 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-11-19 02:10:02 +00:00
Jerry Zhang
da94c7c0eb
Move online quantization to model.load_weights ( #26327 )
...
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com >
2025-11-18 16:52:41 -08:00
tomeras91
1395461f5f
[Hybrid][torch.compile] Refactor mamba2 forward to avoid obscuring linear projections under custom op ( #28587 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
2025-11-18 16:49:36 -08:00
Varun Sundar Rabindranath
9912b8ccb8
[Build] Add OpenAI triton_kernels ( #28788 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-11-18 16:45:20 -08:00
Johnny
49ef847aa8
[NVIDIA] Guard SM100 CUTLASS MoE macro to SM100 builds v2 ( #28938 )
...
Signed-off-by: johnnynunez <johnnynuca14@gmail.com >
Signed-off-by: Johnny <johnnynuca14@gmail.com >
2025-11-18 16:44:27 -08:00
Michael Goin
67745d189f
Supress verbose logs from model_hosting_container_standards ( #28949 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-11-18 12:29:06 -08:00
Kunshang Ji
2a2d5d2780
Replace torch.cuda.Event with torch.Event for better hardware compatibility ( #26985 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-11-18 11:34:36 -08:00
Chendi.Xue
c3e2978620
[NIXL] fix cpu PD after physical <> logical block_size PR ( #28904 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2025-11-18 14:03:23 -05:00
Isotr0py
e4bb2684bc
[Models] Replace all nn.Conv2d with vLLM's Conv2dLayer ( #28842 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-11-18 18:56:04 +00:00
Kevin H. Luu
c64c0b78de
[chore] Move the rest of wikimedia url to S3 ( #28921 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-18 09:44:18 -08:00
vllmellm
0af3d4f0df
[FEAT] [AITER] [ROCm] integrate aiter sampling ops ( #26084 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-11-18 17:28:34 +00:00
Nick Hill
da8dadf68b
[Minor] Rename ec_producer field to is_ec_producer ( #28884 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-11-18 17:26:07 +00:00
Nicolò Lucchesi
f226a3f0c1
[CI][NIXL] Change default block_size for tests ( #28927 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-11-18 09:22:30 -08:00
Luciano Martins
c2612371ad
[Model] Add Gemma3 GGUF multimodal support ( #27772 )
...
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-11-18 08:56:29 -08:00
Ido Segev
49a986ecd4
[Benchmark] multi_turn: Report warmup-inclusive runtime ( #28937 )
...
Signed-off-by: Ido Segev <idos@pliops.com >
2025-11-18 16:38:22 +00:00
Alex
f6aa122698
[CI Sprint] Quantization CI Cleanup ( #24130 )
...
Signed-off-by: Alex Yun <alexyun04@gmail.com >
2025-11-18 09:21:48 -05:00
Nicolò Lucchesi
184b12fdc6
[Bugfix][NIXL] Fix block_size_ratio when logical !=physical blocks ( #28925 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-11-18 22:07:50 +08:00
Canlin Guo
b9489f51e1
[Model][Perf] Use cos and sin cache in QwenVL ( #28798 )
...
Signed-off-by: gcanlin <canlinguosdu@gmail.com >
2025-11-18 11:51:54 +00:00
Song Zhixin
285eaa4285
[Bugfix] Safeguard against missing backend in AttentionBackendEnum ( #28846 )
...
Signed-off-by: jesse <szxfml@gmail.com >
Signed-off-by: Song Zhixin <szxfml@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-18 10:53:44 +00:00