whx
4ca62a0dbd
[PluggableLayer][1/N] Define PluggableLayer ( #32331 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2026-01-20 16:19:21 +00:00
linhaifeng
7901109ea5
[Bugfix] Fix Off-by-one error in _num_tokens_to_min_blocks calculation ( #32603 )
...
Signed-off-by: linhaifeng <1371675203@qq.com >
2026-01-20 11:13:39 -05:00
YiSheng5
13f6630a9e
[XPU]Support AgRsAll2AllManager on XPU device ( #32654 )
...
Signed-off-by: yisheng <yi.sheng@intel.com >
2026-01-20 14:27:24 +00:00
Cyrus Leung
fda3f03eb2
[4/N] Initialize MM components in context managers (M-P) ( #32663 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 14:06:32 +00:00
杨朱 · Kiki
bb9172030e
[Metrics] Complete removal of deprecated vllm:time_per_output_token_seconds metric ( #32661 )
...
This PR completes the removal of the deprecated vllm:time_per_output_token_seconds
metric that was deprecated in v0.11, hidden in v0.12, scheduled for removal in v0.13,
but delayed until v0.15.
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com >
2026-01-20 12:28:41 +00:00
Chauncey
c4e5bdf61b
[Bugfix] Fix the fp8_mqa_logits dim mismatch ( #32652 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-20 18:48:07 +08:00
Cyrus Leung
7f1bcd18ff
[3/N] Initialize MM components in context managers (I-L) ( #32650 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 10:21:56 +00:00
Walter Beller-Morales
8be263c3fb
[Core] Cleanup shm based object store on engine shutdown ( #32429 )
...
Signed-off-by: walterbm <walter.beller.morales@gmail.com >
2026-01-20 08:53:37 +00:00
Cyrus Leung
e1a34c3a5d
[2/N] Initialize MM components in context managers (E-H) ( #32641 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 08:12:56 +00:00
vllmellm
148117ea2e
[Refactor] Make FP8 Linear Ops use kernel abstraction ( #27814 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-01-20 14:48:20 +08:00
Woosuk Kwon
e9c83cdc51
[Model Runner V2] Skip kernel launch for penalties & logit_bias ( #32634 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-19 22:20:19 -08:00
Cyrus Leung
b75e85dede
[1/N] Initialize MM components in context managers (A-D) ( #32632 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 14:12:42 +08:00
Cyrus Leung
4753f3bf69
[Model] Use context managers for encoder- and LM-only mode ( #32605 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 11:43:38 +08:00
Woosuk Kwon
6c01ffb897
[Model Runner V2] Decouple temperature from penalties ( #32629 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-19 19:13:24 -08:00
Woosuk Kwon
7b7cdce968
[Model Runner V2] Refactor get_cudagraph_and_dp_padding ( #32625 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-19 18:25:02 -08:00
Jackmin801
12dab78f49
[Feat] allow inplace loading lora ( #31326 )
...
Signed-off-by: Jackmin801 <ongjackm@gmail.com >
Signed-off-by: Jackmin801 <56836461+Jackmin801@users.noreply.github.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-20 10:15:20 +08:00
Woosuk Kwon
05dc4bfab6
[Model Runner V2] Initialized communication buffer for DP ( #32624 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-19 17:27:06 -08:00
Matthew Bonanni
1a1fc3bbc0
[Attention][MLA] Make FLASHINFER_MLA the default MLA backend on Blackwell, and TRTLLM the default prefill ( #32615 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-19 18:41:34 -05:00
Woosuk Kwon
43fada5360
[Model Runner V2] Refactor dummy_run ( #32533 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-19 14:50:59 -08:00
Tomas Ruiz
4a5299c93f
feat: spec decode with draft models ( #24322 )
...
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com >
2026-01-19 16:05:46 -05:00
lon
73f2a81c75
docs: prefix caching seems quite outdated ( #28784 )
...
Signed-off-by: lon <114724657+longregen@users.noreply.github.com >
Signed-off-by: Russell Bryant <russell.bryant@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <russell.bryant@gmail.com >
2026-01-19 11:49:52 -08:00
jiahanc
7350331718
[BugFix] Fix TRT-LLM NVFP4 DP/EP ( #32349 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-19 14:32:24 -05:00
Yanan Cao
9d1e611f0e
[CI] Add Helion as an optional dependency ( #32482 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-01-19 19:09:56 +00:00
Vadim Gimpelson
0727cc9ecf
[BUGFIX] Fix test_mla_backends.py. Scale MLA projection weights to prevent numerical instability ( #32529 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-01-19 13:49:29 -05:00
qli88
a0490be8f1
[CI][amd] Revert NIXL connector change to avoid crash ( #32570 )
...
Signed-off-by: Qiang Li <qiang.li2@amd.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-01-19 18:39:16 +00:00
Netanel Haber
cd3ac5b797
support dynamic resolution image encoding for Nemotron Nano VL ( #32121 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-01-19 18:15:58 +00:00
Jee Jee Li
2636d76257
[Misc] Remove unused ModelKeys ( #32608 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-19 17:34:59 +00:00
danisereb
aa7f37ccfa
Add support for LoRA adapters in Nemotron-H models ( #30802 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-01-19 22:30:44 +08:00
wang.yuqi
c88860d759
[Frontend] Score entrypoint support data_1 & data_2 and queries & documents as inputs ( #32577 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-19 14:07:46 +00:00
Nicolò Lucchesi
758df5afe7
[NIXL][Metrics] Track nixl_num_kv_expired_reqs metric in Prometheus ( #32340 )
...
Add a new metric to track the number of requests that had their KV blocks
expire. The scenario is particularly important to surface and track as it is a
vital indicator of the health of the deployment.
Currently we're resorting to track these failures through unstructured log
parsing (which is, among other thing, error string dependent); current main:
> Releasing expired KV blocks for request cmpl-071d which were retrieved by 0 decode worker(s) within 0 seconds.
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-19 12:28:27 +00:00
Daniel Mescheder
cdd03d25d3
[CI/Build] Fix dependency conflict between model-hosting-container-standards and starlette ( #32560 )
...
Signed-off-by: Daniel Mescheder <dmesch@amazon.com >
Co-authored-by: Daniel Mescheder <dmesch@amazon.com >
2026-01-19 03:27:08 -08:00
Nicolò Lucchesi
74c583bc50
[Core] Whisper support torch.compile ( #30385 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-19 10:02:31 +00:00
Andreas Karatzas
c0a350ca73
[ROCm][CI] Add ROCm attention backend support for EAGLE DP tests ( #32363 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-19 09:57:54 +00:00
Yuxuan Zhang
71832ba71e
[GLM-4.7] GLM Model support for GLM-Lite ( #31386 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: Yuxuan Zhang <2448370773@qq.com >
2026-01-19 01:18:38 -08:00
Matt
11bbf86f6a
[CI][Hardware][AMD] Fix test_rotary_embedding_mla_cache_fused ( #32408 )
...
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-01-19 08:25:47 +00:00
Hyunkyun Moon
3c8740aacb
[Frontend] Add render endpoints for prompt preprocessing ( #32473 )
...
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com >
Signed-off-by: Hyunkyun Moon <mhg5303@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-19 12:21:46 +08:00
Alex Brooks
7518a3dc65
[CI/Build] Use Common Event Map Fixture in Harmony / MCP Server Tests ( #32531 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2026-01-19 04:05:51 +00:00
honglyua
976af2f314
[BugFix] Fix embed_input_ids argument error of QwenVLForConditionalGeneration ( #32462 )
2026-01-19 03:06:02 +00:00
Woosuk Kwon
9a1f16da1e
[Model Runner V2] Refactor update_states ( #32562 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-18 17:32:42 -08:00
Woosuk Kwon
bb1848cd62
[Model Runner V2] Support VLM ( #32546 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-18 16:58:51 -08:00
Vadim Gimpelson
6101a26dc9
[BUGFIX] Fix degenerate strides in TRTLLM query tensors for FlashInfer backend. Fixes issue #32353 ( #32417 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-01-18 16:57:32 -08:00
Iryna Boiko
f5d1740030
[Bugfix] Add OOT backend option ( #32471 )
...
Signed-off-by: Iryna Boiko <iboiko@habana.ai >
2026-01-18 22:20:39 +00:00
Wentao Ye
eebc58df0c
[Refactor] Remove unused cutlass moe problem size function ( #32047 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-18 12:46:59 -08:00
Wentao Ye
16de822c71
[Refactor] Remove unused file pallas_kv_cache_update.py ( #32433 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-18 12:46:39 -08:00
Deming
5480c6b1fa
[Doc] Correct comment for _jobs dict in OffloadingConnectorWorker ( #32556 )
2026-01-18 12:46:00 -08:00
Andrey Khalyavin
ba29ab441e
Use the same memory for workspace13 and fused_output. ( #31531 )
...
Signed-off-by: Andrey Khalyavin <halyavin@yandex-team.ru >
2026-01-18 19:14:22 +00:00
Robert Shaw
afc3622602
[CI] Move Distributed Tests from H200 -> H100 ( #32555 )
2026-01-18 10:25:23 -08:00
bnellnm
327a02d8db
[MoE Refactor] Separate Router into OO Classes ( #30623 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-01-18 11:40:49 -05:00
tjp_zju
2f03035a61
"refactor: refactor_repeated_interfaces" ( #32486 )
...
Signed-off-by: tom-zju <tanjianpingzju1990@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-18 22:07:01 +08:00
Isotr0py
38bf2ffb21
[Bugfix] Fix GLM-ASR audio encoder RoPE dim ( #32540 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-18 19:17:59 +08:00