biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
jiangkuaixue123	87d9a26166	[Bugfix] Fix ubatch wrapper num_tokens calculate (#33694 ) Signed-off-by: jiangkuaixue123 <jiangxiaozhou111@163.com>	2026-02-04 16:41:45 +00:00
Cyrus Leung	80f921ba4b	[Bugfix] Fix `normalize` still being passed to `PoolerConfig` (#33794 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-04 23:56:02 +08:00
Wentao Ye	711edaf0d0	[Perf] Optimize spec decoding + async scheduling, 1.5% Throughput improvement (#33612 ) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Nick Hill <nhill@redhat.com>	2026-02-04 09:34:32 -05:00
Micah Williamson	1d367a738e	[Bugfix][ROCm] Include float8_e4m3fnuz in NCCL Dtype Dispatching (#33713 ) Signed-off-by: Micah Williamson <micah.williamson@amd.com>	2026-02-04 05:36:29 -08:00
Cyrus Leung	32a02c7ca2	Apply #33621 to main (#33758 ) Signed-off-by: Zachary Aristei <zaristei@nvidia.com> Co-authored-by: zaristei2 <zaristei2@gmail.com> Co-authored-by: Zachary Aristei <zaristei@nvidia.com>	2026-02-04 05:35:39 -08:00
Chauncey	f67ee8b859	[Perf] Optimize chat completion streaming performance (#33782 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>	2026-02-04 12:30:36 +00:00
Cyrus Leung	e57ef99b40	[Model] Apply #32631 for recent models (#33785 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-04 12:23:01 +00:00
Yueqian Lin	f8516a1ab9	[Bugfix][Model] Fix audio-in-video support for Qwen2.5-Omni and Qwen3-Omni (#33605 ) Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Roger Wang <hey@rogerw.io>	2026-02-04 12:15:29 +00:00
Vadim Gimpelson	824058076c	[PERF] Change GDN Attention State Layout from [N, HV, K, V] to [N, HV, V, K] (#33291 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>	2026-02-04 11:20:52 +00:00
Or Ozeri	8e32690869	[KV Connector][BugFix] scheduler: Delay freeing blocks of aborted async loads (#32255 ) Fixes a not-yet-reported case where it was possible for blocks to be freed by an abort before an async transfer completed, resulting in corrupted KV data. Signed-off-by: Or Ozeri <oro@il.ibm.com>	2026-02-04 11:16:34 +00:00
Zhengxu Chen	a208439537	[compile] Remove runner type from ignored caching factor list. (#33712 ) Signed-off-by: zhxchen17 <zhxchen17@fb.com>	2026-02-04 10:56:45 +00:00
Zhengxu Chen	bcd2f74c0d	[compile] Clean up AOT compile bypass on evaluate_guards. (#33578 ) Signed-off-by: zhxchen17 <zhxchen17@fb.com>	2026-02-04 02:12:53 -08:00
Kunshang Ji	f79f777803	[XPU][2/N] add support unquantized moe support for xpu (#33659 ) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>	2026-02-04 02:12:25 -08:00
Augusto Yao	4c8d1bf361	use ORJSONResponse when available to improve the efficiency of request process (#33548 ) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>	2026-02-04 10:04:11 +00:00
Kunshang Ji	061da6bcf7	[XPU] remove common path warning log (#33769 ) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>	2026-02-04 16:40:17 +08:00
zhanqiuhu	4403e3ed4c	[Metrics] Add labeled prompt token metrics for P/D disaggregation (#33290 ) Add labeled Prometheus metrics to distinguish where prompt tokens come from in P/D disaggregated deployments. In P/D disaggregation, decode instances receive KV cache from prefill instances. Currently, decode reports inflated prompt throughput because it counts all prompt tokens as "computed", even though most were transferred. This PR adds labeled metrics so users can understand actual compute work vs transferred work: vllm:prompt_tokens_by_source_total{source="local_compute"} # Tokens prefilled locally vllm:prompt_tokens_by_source_total{source="external_kv_transfer"} # Tokens received via KV transfer vllm:prompt_tokens_by_source_total{source="local_cache_hit"} # Tokens from local prefix cache vllm:prompt_tokens_cached_total # Total cached (local + external, -1 when all Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>	2026-02-04 07:46:48 +00:00
Matt	08e094997e	[Hardware][AMD][CI] Refactor AMD tests to properly use BuildKite parallelism (#32745 ) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>	2026-02-04 14:51:33 +08:00
Wentao Ye	d88a1df699	[Deprecation] Deprecate profiling envs (#33722 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-02-04 05:58:21 +00:00
Cyrus Leung	90d74ebaa4	[Deprecation] Remove `_get_data_parser` in MM processor (#33757 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-04 05:51:52 +00:00
Frank Wang	45f8fd6f97	[Feature] Enable `TRITON_ATTN` for Batch Invariance (#33688 ) Signed-off-by: frankwang28 <frank.wbb@hotmail.com>	2026-02-04 13:27:34 +08:00
Wentao Ye	5e1e0a0fbd	[Refactor] Remove unused dead code (#33718 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-02-03 21:25:11 -08:00
Michael Goin	eb5ed20743	[Bugfix] Define router_logits_dtype for remaining MoE models (#33737 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2026-02-04 13:24:14 +08:00
Huy Do	2647163674	Save startup benchmark results as a list of values (#33629 ) Signed-off-by: Huy Do <huydhn@gmail.com>	2026-02-03 20:37:51 -08:00
Shanshan Shen	9fb27dd3b3	[MM] Align the prefix of MMEncoderAttention with Attention (#33750 ) Signed-off-by: shen-shanshan <467638484@qq.com>	2026-02-04 04:07:30 +00:00
R3hankhan	4dffc5e044	[CPU] Split attention dispatch by head_dim alignment (#32161 ) Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>	2026-02-03 19:37:15 -08:00
Andrew Xia	e1bf04b6c2	[1/N] Initial Implementation of Parser for ResponsesAPI (#32712 ) Signed-off-by: Andrew Xia <axia@fb.com> Co-authored-by: Andrew Xia <axia@fb.com>	2026-02-04 10:59:03 +08:00
Isotr0py	02080179a3	[Bugfix] Fix torchrun PP broadcast deadlock with async scheduling (#33701 ) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>	2026-02-04 02:17:37 +00:00
wang.yuqi	1b8fe6f7c4	[Frontend][4/n] Make pooling entrypoints request schema consensus \| ScoreRequest (#33060 ) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>	2026-02-04 01:48:40 +00:00
Nick Hill	52ee21021a	[BugFix][Spec Decoding] Fix negative accepted tokens metric crash (#33729 ) Signed-off-by: Nick Hill <nickhill123@gmail.com>	2026-02-03 23:34:41 +00:00
Wentao Ye	655efb3e69	[Dependency] Remove comments of ray in dependency files (#33351 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-02-03 15:30:47 -08:00
Matthew Bonanni	bd8da29a66	[Bugfix] Fix sparse MLA metadata building (#33579 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-02-03 15:29:48 -08:00
Michael Goin	2a99c5a6c8	[Bugfix] Disable TRTLLM FP8 MoE if router_logits_dtype==float32 and routing_method!=DeepSeekV3 (#33613 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2026-02-03 13:26:51 -08:00
Patrick von Platen	3f7662d650	[Voxtral Realtime] Change name (#33716 ) Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>	2026-02-03 13:03:28 -08:00
Vadim Gimpelson	a372f3f40a	[MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1 (#33257 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>	2026-02-03 15:10:31 -05:00
Harry Mellor	61e632aea1	Turn `@config` into a `dataclass_transform` (#31541 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-02-03 17:40:59 +00:00
Richard Zou	b1bb18de8d	[torch.compile] Significantly speed up cold start times (#33641 ) Signed-off-by: Richard Zou <zou3519@gmail.com>	2026-02-03 09:12:11 -08:00
Lucas Wilkinson	2267cb1cfd	[Attention][FA3] Update FA3 to include new swizzle optimization (#23465 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-02-03 08:08:47 -08:00
dtc	0d6ccf68fa	[P/D] rework mooncake connector and introduce its bootstrap server (#31034 ) Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>	2026-02-03 08:08:25 -08:00
Cyrus Leung	18e7cbbb15	[Bugfix] Fix startup hang for Granite Speech (#33699 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-03 15:57:56 +00:00
Patrick von Platen	f0d5251715	[Voxtral models] Skip warm-up to skip confusing error message in warm-up (#33576 ) Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2026-02-03 07:22:34 -08:00
Shanshan Shen	5c4f2dd6ef	[MM] Pass `prefix` parameter to MMEncoderAttention (#33674 ) Signed-off-by: shen-shanshan <467638484@qq.com>	2026-02-03 06:47:41 -08:00
wang.yuqi	f3d8a34671	[Bugfix] Do not add extra \n for image-only cases when constructing multimodal text prompts. (#33647 ) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>	2026-02-03 06:43:47 -08:00
shaharmor98	4bc913aeec	Feat/add nemotron nano v3 tests (#33345 )	2026-02-03 08:52:49 -05:00
Kuntai Du	fbb3cf6981	[Bugfix][Async][Connector] avoid vllm-side double free during async scheduling + request abort + async KV cache transfer (#33377 ) Signed-off-by: KuntaiDu <kuntai@uchicago.edu>	2026-02-03 21:50:15 +08:00
Krish Gupta	2df2b3499d	Document NixlConnector backend selection via kv_connector_extra_config (#33552 ) Signed-off-by: KrxGu <krishom70@gmail.com>	2026-02-03 05:49:59 -08:00
Harry Mellor	2a8d84e66d	Fix Gemma3n audio encoder for Transformers v5 (#33673 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-02-03 05:49:49 -08:00
zxy	a3acfa1071	[Models] Intern-S1-Pro (#33636 ) Signed-off-by: zxy <zhou0493@e.ntu.edu.sg> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>	2026-02-03 05:49:45 -08:00
Harry Mellor	be8168ff88	Fix Gemma3 GGUF for Transformers v5 (#33683 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-02-03 12:36:53 +00:00
Harry Mellor	f6af34626d	Fix offline test for Transformers v5 (#33682 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-02-03 12:07:24 +00:00
Song Zhixin	ceab70c89d	[Bugfix] fix qwen3-asr response error (#33644 ) Signed-off-by: jesse <szxfml@gmail.com> Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2026-02-03 03:33:56 -08:00

1 2 3 4 5 ...

13601 Commits