Or Ozeri
8e32690869
[KV Connector][BugFix] scheduler: Delay freeing blocks of aborted async loads ( #32255 )
...
Fixes a not-yet-reported case where it was possible for blocks to be
freed by an abort before an async transfer completed, resulting
in corrupted KV data.
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-02-04 11:16:34 +00:00
Zhengxu Chen
a208439537
[compile] Remove runner type from ignored caching factor list. ( #33712 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-04 10:56:45 +00:00
Zhengxu Chen
bcd2f74c0d
[compile] Clean up AOT compile bypass on evaluate_guards. ( #33578 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-04 02:12:53 -08:00
Kunshang Ji
f79f777803
[XPU][2/N] add support unquantized moe support for xpu ( #33659 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-04 02:12:25 -08:00
Augusto Yao
4c8d1bf361
use ORJSONResponse when available to improve the efficiency of request process ( #33548 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
2026-02-04 10:04:11 +00:00
Kunshang Ji
061da6bcf7
[XPU] remove common path warning log ( #33769 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-04 16:40:17 +08:00
zhanqiuhu
4403e3ed4c
[Metrics] Add labeled prompt token metrics for P/D disaggregation ( #33290 )
...
Add labeled Prometheus metrics to distinguish where prompt tokens come
from in P/D disaggregated deployments.
In P/D disaggregation, decode instances receive KV cache from prefill instances.
Currently, decode reports inflated prompt throughput because it counts all
prompt tokens as "computed", even though most were transferred.
This PR adds labeled metrics so users can understand actual compute work vs
transferred work:
vllm:prompt_tokens_by_source_total{source="local_compute"} # Tokens prefilled locally
vllm:prompt_tokens_by_source_total{source="external_kv_transfer"} # Tokens received via KV transfer
vllm:prompt_tokens_by_source_total{source="local_cache_hit"} # Tokens from local prefix cache
vllm:prompt_tokens_cached_total # Total cached (local + external, -1 when all
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-02-04 07:46:48 +00:00
Matt
08e094997e
[Hardware][AMD][CI] Refactor AMD tests to properly use BuildKite parallelism ( #32745 )
...
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-02-04 14:51:33 +08:00
Wentao Ye
d88a1df699
[Deprecation] Deprecate profiling envs ( #33722 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-04 05:58:21 +00:00
Cyrus Leung
90d74ebaa4
[Deprecation] Remove _get_data_parser in MM processor ( #33757 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-04 05:51:52 +00:00
Frank Wang
45f8fd6f97
[Feature] Enable TRITON_ATTN for Batch Invariance ( #33688 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
2026-02-04 13:27:34 +08:00
Wentao Ye
5e1e0a0fbd
[Refactor] Remove unused dead code ( #33718 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-03 21:25:11 -08:00
Michael Goin
eb5ed20743
[Bugfix] Define router_logits_dtype for remaining MoE models ( #33737 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-04 13:24:14 +08:00
Huy Do
2647163674
Save startup benchmark results as a list of values ( #33629 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2026-02-03 20:37:51 -08:00
Shanshan Shen
9fb27dd3b3
[MM] Align the prefix of MMEncoderAttention with Attention ( #33750 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-02-04 04:07:30 +00:00
R3hankhan
4dffc5e044
[CPU] Split attention dispatch by head_dim alignment ( #32161 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-02-03 19:37:15 -08:00
Andrew Xia
e1bf04b6c2
[1/N] Initial Implementation of Parser for ResponsesAPI ( #32712 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-02-04 10:59:03 +08:00
Isotr0py
02080179a3
[Bugfix] Fix torchrun PP broadcast deadlock with async scheduling ( #33701 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-04 02:17:37 +00:00
wang.yuqi
1b8fe6f7c4
[Frontend][4/n] Make pooling entrypoints request schema consensus | ScoreRequest ( #33060 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-04 01:48:40 +00:00
Nick Hill
52ee21021a
[BugFix][Spec Decoding] Fix negative accepted tokens metric crash ( #33729 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-03 23:34:41 +00:00
Wentao Ye
655efb3e69
[Dependency] Remove comments of ray in dependency files ( #33351 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-03 15:30:47 -08:00
Matthew Bonanni
bd8da29a66
[Bugfix] Fix sparse MLA metadata building ( #33579 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-03 15:29:48 -08:00
Michael Goin
2a99c5a6c8
[Bugfix] Disable TRTLLM FP8 MoE if router_logits_dtype==float32 and routing_method!=DeepSeekV3 ( #33613 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-03 13:26:51 -08:00
Patrick von Platen
3f7662d650
[Voxtral Realtime] Change name ( #33716 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2026-02-03 13:03:28 -08:00
Vadim Gimpelson
a372f3f40a
[MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1 ( #33257 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-03 15:10:31 -05:00
Harry Mellor
61e632aea1
Turn @config into a dataclass_transform ( #31541 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-03 17:40:59 +00:00
Richard Zou
b1bb18de8d
[torch.compile] Significantly speed up cold start times ( #33641 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-03 09:12:11 -08:00
Lucas Wilkinson
2267cb1cfd
[Attention][FA3] Update FA3 to include new swizzle optimization ( #23465 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-03 08:08:47 -08:00
dtc
0d6ccf68fa
[P/D] rework mooncake connector and introduce its bootstrap server ( #31034 )
...
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2026-02-03 08:08:25 -08:00
Cyrus Leung
18e7cbbb15
[Bugfix] Fix startup hang for Granite Speech ( #33699 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-03 15:57:56 +00:00
Patrick von Platen
f0d5251715
[Voxtral models] Skip warm-up to skip confusing error message in warm-up ( #33576 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-03 07:22:34 -08:00
Shanshan Shen
5c4f2dd6ef
[MM] Pass prefix parameter to MMEncoderAttention ( #33674 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-02-03 06:47:41 -08:00
wang.yuqi
f3d8a34671
[Bugfix] Do not add extra \n for image-only cases when constructing multimodal text prompts. ( #33647 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-03 06:43:47 -08:00
shaharmor98
4bc913aeec
Feat/add nemotron nano v3 tests ( #33345 )
2026-02-03 08:52:49 -05:00
Kuntai Du
fbb3cf6981
[Bugfix][Async][Connector] avoid vllm-side double free during async scheduling + request abort + async KV cache transfer ( #33377 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2026-02-03 21:50:15 +08:00
Krish Gupta
2df2b3499d
Document NixlConnector backend selection via kv_connector_extra_config ( #33552 )
...
Signed-off-by: KrxGu <krishom70@gmail.com >
2026-02-03 05:49:59 -08:00
Harry Mellor
2a8d84e66d
Fix Gemma3n audio encoder for Transformers v5 ( #33673 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-03 05:49:49 -08:00
zxy
a3acfa1071
[Models] Intern-S1-Pro ( #33636 )
...
Signed-off-by: zxy <zhou0493@e.ntu.edu.sg >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-03 05:49:45 -08:00
Harry Mellor
be8168ff88
Fix Gemma3 GGUF for Transformers v5 ( #33683 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-03 12:36:53 +00:00
Harry Mellor
f6af34626d
Fix offline test for Transformers v5 ( #33682 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-03 12:07:24 +00:00
Song Zhixin
ceab70c89d
[Bugfix] fix qwen3-asr response error ( #33644 )
...
Signed-off-by: jesse <szxfml@gmail.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-03 03:33:56 -08:00
Cyrus Leung
52683ccbe1
[Misc] Update default image format of encode_base64 ( #33656 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-03 03:13:16 -08:00
Michael Goin
e346e2d056
[Bugfix] Disable RoutingMethodType.[Renormalize,RenormalizeNaive] TRTLLM per-tensor FP8 MoE ( #33620 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-03 10:37:15 +00:00
Cyrus Leung
83449a5ff0
[Refactor] Clean up pooling serial utils ( #33665 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-03 10:29:18 +00:00
Lucas Hänke de Cansino
dad2d6a590
[Bugfix][Model] Fix DeepSeek-OCR-2 chat template to include BOS token ( #33642 )
...
Signed-off-by: l4b4r4b4b4 <lucas.cansino@mail.de >
2026-02-03 00:35:58 -08:00
Isotr0py
32e84fa1ff
[CI/Build] Investigate torchrun distributed tests hanging issue ( #33650 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-03 15:49:17 +08:00
Richard Zou
fd9c83d0e0
[torch.compile] Document the workaround to standalone_compile failing ( #33571 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-03 07:16:55 +00:00
杨朱 · Kiki
b95cc5014d
[Misc] Remove deprecated VLLM_ALL2ALL_BACKEND environment variable ( #33535 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-02-03 15:01:59 +08:00
Nick Hill
61397891ce
[Minor] Some code simplification in scheduler.py ( #33597 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-03 15:00:00 +08:00
杨朱 · Kiki
ef248ff740
[Misc] Remove deprecated profiler environment variables ( #33536 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-02-03 14:58:44 +08:00