Cyrus Leung
80f921ba4b
[Bugfix] Fix normalize still being passed to PoolerConfig ( #33794 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-04 23:56:02 +08:00
Or Ozeri
8e32690869
[KV Connector][BugFix] scheduler: Delay freeing blocks of aborted async loads ( #32255 )
...
Fixes a not-yet-reported case where it was possible for blocks to be
freed by an abort before an async transfer completed, resulting
in corrupted KV data.
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-02-04 11:16:34 +00:00
zhanqiuhu
4403e3ed4c
[Metrics] Add labeled prompt token metrics for P/D disaggregation ( #33290 )
...
Add labeled Prometheus metrics to distinguish where prompt tokens come
from in P/D disaggregated deployments.
In P/D disaggregation, decode instances receive KV cache from prefill instances.
Currently, decode reports inflated prompt throughput because it counts all
prompt tokens as "computed", even though most were transferred.
This PR adds labeled metrics so users can understand actual compute work vs
transferred work:
vllm:prompt_tokens_by_source_total{source="local_compute"} # Tokens prefilled locally
vllm:prompt_tokens_by_source_total{source="external_kv_transfer"} # Tokens received via KV transfer
vllm:prompt_tokens_by_source_total{source="local_cache_hit"} # Tokens from local prefix cache
vllm:prompt_tokens_cached_total # Total cached (local + external, -1 when all
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-02-04 07:46:48 +00:00
Frank Wang
45f8fd6f97
[Feature] Enable TRITON_ATTN for Batch Invariance ( #33688 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
2026-02-04 13:27:34 +08:00
R3hankhan
4dffc5e044
[CPU] Split attention dispatch by head_dim alignment ( #32161 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-02-03 19:37:15 -08:00
Andrew Xia
e1bf04b6c2
[1/N] Initial Implementation of Parser for ResponsesAPI ( #32712 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-02-04 10:59:03 +08:00
Isotr0py
02080179a3
[Bugfix] Fix torchrun PP broadcast deadlock with async scheduling ( #33701 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-04 02:17:37 +00:00
wang.yuqi
1b8fe6f7c4
[Frontend][4/n] Make pooling entrypoints request schema consensus | ScoreRequest ( #33060 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-04 01:48:40 +00:00
Nick Hill
52ee21021a
[BugFix][Spec Decoding] Fix negative accepted tokens metric crash ( #33729 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-03 23:34:41 +00:00
Patrick von Platen
3f7662d650
[Voxtral Realtime] Change name ( #33716 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2026-02-03 13:03:28 -08:00
Harry Mellor
61e632aea1
Turn @config into a dataclass_transform ( #31541 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-03 17:40:59 +00:00
Richard Zou
b1bb18de8d
[torch.compile] Significantly speed up cold start times ( #33641 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-03 09:12:11 -08:00
shaharmor98
4bc913aeec
Feat/add nemotron nano v3 tests ( #33345 )
2026-02-03 08:52:49 -05:00
zxy
a3acfa1071
[Models] Intern-S1-Pro ( #33636 )
...
Signed-off-by: zxy <zhou0493@e.ntu.edu.sg >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-03 05:49:45 -08:00
Harry Mellor
f6af34626d
Fix offline test for Transformers v5 ( #33682 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-03 12:07:24 +00:00
Cyrus Leung
83449a5ff0
[Refactor] Clean up pooling serial utils ( #33665 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-03 10:29:18 +00:00
Isotr0py
32e84fa1ff
[CI/Build] Investigate torchrun distributed tests hanging issue ( #33650 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-03 15:49:17 +08:00
杨朱 · Kiki
b95cc5014d
[Misc] Remove deprecated VLLM_ALL2ALL_BACKEND environment variable ( #33535 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-02-03 15:01:59 +08:00
Daniel Mescheder
4c4b6f7a97
[Frontend] Add sampling parameters to Responses API ( #32609 )
...
Signed-off-by: Daniel Mescheder <dmesch@amazon.com >
Co-authored-by: Daniel Mescheder <dmesch@amazon.com >
2026-02-03 13:51:10 +08:00
Patrick von Platen
5019c59dd2
[Voxtral Realtime] Introduce global log mel max ( #33574 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-02 17:01:47 -05:00
Vasiliy Kuznetsov
0130223bd9
fix memory for online fp8 quantization with streaming weight load ( #31914 )
...
Signed-off-by: vasiliy <vasiliy@fb.com >
2026-02-02 14:17:42 -05:00
yugong333
ffe1fc7a28
Reduce the kernel overhead when num of active loras is smaller than max loras. Multiple cuda graphs are captured for each num of active-loras. ( #32005 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
2026-02-02 12:30:06 -05:00
Harry Mellor
6141ebe0dd
Remove incorrect tokenizer info test ( #33565 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-02 17:11:44 +00:00
Matthew Bonanni
9f8cb81b44
[CI] Add DeepSeek V3.2 nightly eval ( #33566 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-02 16:10:02 +00:00
shanjiaz
d95b4be47a
move spec decode slow test to test_areas.yaml ( #33365 )
...
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com >
2026-02-02 06:28:36 -08:00
Isotr0py
4061dcf4c5
[Bugfix] Enable Kimi k25 processor test ( #33562 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-02 14:25:25 +00:00
danielafrimi
0aca8b8c62
[MoE] Enable Shared/Routed Overlap For Latent MoE (Nemotron-H) ( #32790 )
...
Signed-off-by: dafrimi <dafrimi@nvidia.com >
2026-02-02 09:18:50 -05:00
Nicolò Lucchesi
528b3076af
[CI][Bugfix] Fix flaky tests/v1/kv_connector/unit/test_multi_connector.py::test_multi_example_connector_consistency ( #33555 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-02 03:01:29 -08:00
Cyrus Leung
a502831d36
[Chore] Remove redundant input parsing methods ( #33542 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-02 10:50:47 +00:00
RED
808dd87b30
[Model] Support DeepSeek-OCR-2 ( #33165 )
...
Signed-off-by: liuli <ll407707@alibaba-inc.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: liuli <ll407707@alibaba-inc.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-02 06:24:10 +00:00
jack
7c036432fc
[Bugfix] GLM-4 tool parser: incremental string streaming ( #33218 )
...
Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com >
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com >
2026-02-02 11:13:31 +08:00
Robert Shaw
318b120766
[Nightly CI] Remove CT Model ( #33530 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-01 19:09:09 -08:00
csy0225
c3b40dc3e7
[Models] Step-3.5-Flash ( #33523 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: i-zhangmingming <i-zhangmingming@stepfun.com >
Co-authored-by: xiewuxun <xiewuxun@stepfun.com >
Co-authored-by: zetaohong <i-hongzetao@stepfun.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-02 10:21:18 +08:00
Yifan Qiao
a01ef3fa51
[Fix] prefix cache hit rate == 0 bug with gpt-oss style models ( #33524 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
2026-02-02 01:59:58 +00:00
Runkai Tao
7320ca3942
Add unpermute-aware fused MoE LoRA path ( #32655 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
2026-02-02 09:46:09 +08:00
Roy Wang
63c0889416
[Misc] Fix flashinfer related tests ( #33462 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2026-01-31 16:10:24 -05:00
Cyrus Leung
88c3e114d8
[Refactor] Move MM data parsing outside processor ( #33408 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-31 16:46:14 +00:00
jma99_2333
22d9a056d5
Support clear mm and encoder cache ( #33452 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-01-31 15:22:25 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
13b842f271
[BugFix][Router Replay] Capture Logical Experts with EPLB ( #33013 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2026-01-31 10:12:17 -05:00
Luka Govedič
15f40b20aa
[fix][torch.compile] Fix cold-start compilation time increase by adding kv cache update to splitting ops ( #33441 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Richard Zou <zou3519@gmail.com >
2026-01-31 06:48:34 -08:00
Angela Yi
608b556507
[ez] Add structured torch.compile logs ( #33213 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-01-31 21:00:54 +08:00
Cyrus Leung
f0a1c8453a
[Frontend] Use new Renderer for Completions and Tokenize API ( #32863 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-31 04:51:15 -08:00
Yanan Cao
d5c41db35b
[Kernel] [Helion] [3/N] Helion kernel registry ( #33203 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-01-31 15:38:46 +08:00
Dimitrios Bariamis
f0bca83ee4
Add support for Mistral Large 3 inference with Flashinfer MoE ( #33174 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-01-30 22:48:27 -08:00
Yanan Cao
8ecd213c0b
[Kernel] [Helion] [2/N] Helion kernel wrapper ( #32964 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-01-31 12:53:01 +08:00
Patrick von Platen
15e0bb9c42
[Streaming -> Realtime] Rename all voxtral related classes, fn, files ( #33415 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2026-01-31 04:49:00 +00:00
Micah Williamson
6c64c41b4a
[ROCm][CI] Force max_num_seqs=1 on ROCm In test_sharded_state_loader to reduce flakiness ( #33277 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-31 12:28:29 +08:00
Michael Goin
29fba76781
[UX] Use gguf repo_id:quant_type syntax for examples and docs ( #33371 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-31 12:14:54 +08:00
Nick Hill
876a16f4fb
[ModelRunner V2] Fix spec decoding + logprobs ( #33391 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-31 03:33:26 +00:00
Matthew Bonanni
aaa901ad55
[Attention] Move MLA forward from backend to layer ( #33284 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-30 19:30:00 -08:00