biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
liranschour	8322d4e47f	Enable Cross layers KV cache layout at NIXL Connector V2 (#33339 ) Signed-off-by: Liran Schour <lirans@il.ibm.com> Signed-off-by: liranschour <liranschour@users.noreply.github.com> Co-authored-by: Or Ozeri <or@ozery.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>	2026-02-05 02:17:02 -08:00
Andreas Karatzas	3e472e81f9	[ROCm][Bugfix][CI] Fix hybrid models and their tests (Mamba/Jamba/Bamba) (#32710 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>	2026-02-05 10:01:23 +00:00
Cyrus Leung	038914b7c8	[Refactor] Move `task` outside of `PoolingParams.verify` (#33796 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>	2026-02-05 09:33:11 +00:00
Pavani Majety	d2f4a71cd5	[Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. (#33858 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com>	2026-02-05 09:32:10 +00:00
Mark McLoughlin	2abd97592f	[KV Connector][Metrics] Do not count local prefix cache hits in connector queries (#30522 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2026-02-05 09:57:27 +02:00
Chauncey	6abb0454ad	[Perf] Optimize the performance of structured output + reasoning (#33557 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>	2026-02-05 15:45:29 +08:00
Li, Jiang	db6f71d4c9	[CI/Build] Fix CPU CI test case title (#33870 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2026-02-05 15:07:14 +08:00
Fadi Arafeh	fd03538bf9	[CPU][BugFix] Allow w8a8 oneDNN quantized matmul to support 3D inputs (#33727 ) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>	2026-02-05 06:26:09 +00:00
Andreas Karatzas	1f70313e59	[Bugfix] Fix ScoreMultiModalParam multi-document scoring returning single result (#33837 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>	2026-02-05 06:17:00 +00:00
Li, Jiang	07daee132b	[CI/Build] Parallelize CPU CI tests (#33778 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2026-02-05 13:53:48 +08:00
Andrew Xia	9595afda18	[2/N] move responses/serving _make_response_output_items logic to parser (#33281 ) Signed-off-by: Andrew Xia <axia@fb.com> Signed-off-by: Andrew Xia <axia@meta.com> Co-authored-by: Andrew Xia <axia@fb.com>	2026-02-05 13:46:15 +08:00
rasmith	c1395f72cd	[CI][AMD][BugFix] Ensure VLLM_ROCM_USE_AITER is set so test_rocm_aiter_topk.py can run correctly (#33840 ) Signed-off-by: Randall Smith <Randall.Smith@amd.com>	2026-02-05 05:05:48 +00:00
rinbaro	007b183d74	[docs] fix unintentional misspellings (#33863 ) Signed-off-by: rinbaro <ilgomishra@gmail.com>	2026-02-04 20:50:59 -08:00
Nick Hill	add9f1fbd9	[Minor] Include `StreamingInput` in inputs package (#33856 ) Signed-off-by: Nick Hill <nickhill123@gmail.com>	2026-02-05 04:38:20 +00:00
Luka Govedič	e3bf79ffa0	Revert "[Attention][FA3] Update FA3 to include new swizzle optimization" (#33841 )	2026-02-04 19:54:27 -08:00
Andreas Karatzas	fb1270f1f8	[CI][Bugfix]: return McpCall for built-in MCP tools in non-streaming mode (#32762 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-02-05 11:14:06 +08:00
Kevin H. Luu	72bb24e2db	[release] Minor fixes to release annotation (#33849 ) Signed-off-by: Kevin H. Luu <khluu000@gmail.com>	2026-02-05 02:07:35 +00:00
Chauncey	a7be77beef	[Bugfix] fix DeepSeek R1 with CUTLASS MLA Broken on B200 (#33637 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>	2026-02-05 01:28:36 +00:00
zhanqiuhu	bbe0574d8e	[Bugfix] Disable TRTLLM attention when KV transfer is enabled (#33192 ) Signed-off-by: Zhanqiu Hu <zh338@cornell.edu> v0.15.2rc0	2026-02-05 00:49:18 +00:00
Luka Govedič	4d9513537d	[CI][torch.compile] Reduce e2e fusion test time (#33293 ) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: ProExpertProg <luka.govedic@gmail.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2026-02-04 19:09:03 -05:00
Ilya Boytsov	439afa4eea	feat: Add ColBERT late interaction model support (#33686 ) Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com> Signed-off-by: Ilya Boytsov <boytsovpanamera@mail.ru> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>	2026-02-05 08:05:13 +08:00
Nick Hill	fa4e0fb028	[Core] Don't schedule spec tokens with prefill chunks (#33652 ) Signed-off-by: Nick Hill <nickhill123@gmail.com>	2026-02-04 23:40:22 +00:00
Sage Moore	ce498a6d61	Change the type signature of MixtureOfExperts.expert_weights to MutableSequence[Sequence[Tensor]] (#33573 ) Signed-off-by: Sage Moore <sagmoore@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2026-02-04 17:02:46 -05:00
Richard Zou	9f14c9224d	Revert "[torch.compile] Significantly speed up cold start times" (#33820 ) Signed-off-by: Richard Zou <zou3519@gmail.com>	2026-02-04 21:59:59 +00:00
Muhammad Hashmi	535de06cb1	[Model] Add transcription support for Qwen3-Omni (#29828 ) Signed-off-by: Muhammad Hashmi <mhashmi@berkeley.edu> Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: NickLucche <nlucches@redhat.com>	2026-02-04 21:17:47 +00:00
Simon Danielsson	4292c90a2a	[Bugfix] Support `RotaryEmbedding` CustomOp for gpt-oss (#33800 ) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>	2026-02-04 20:17:41 +00:00
Taeksang Kim	6e98f6d8b6	Implement zero-copy GQA for multimodal and CPU (#33732 ) Signed-off-by: Taeksang Kim <ts.kim@hyperaccel.ai>	2026-02-04 20:11:39 +00:00
kourosh hakhamaneshi	2f6d17cb2f	[rocm][ray] Fix: Unify Ray device visibility handling across CUDA and ROCm (#33308 ) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>	2026-02-04 10:09:14 -08:00
Isotr0py	192ad4648b	[Bugfix] Fix interns1-pro initialization and PP (#33793 ) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>	2026-02-04 17:54:45 +00:00
Lucas Wilkinson	0e92298622	[Misc] Delay deprecation of CommonAttentionMetadata properties (#33801 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-02-04 08:41:57 -08:00
jiangkuaixue123	87d9a26166	[Bugfix] Fix ubatch wrapper num_tokens calculate (#33694 ) Signed-off-by: jiangkuaixue123 <jiangxiaozhou111@163.com>	2026-02-04 16:41:45 +00:00
Cyrus Leung	80f921ba4b	[Bugfix] Fix `normalize` still being passed to `PoolerConfig` (#33794 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-04 23:56:02 +08:00
Wentao Ye	711edaf0d0	[Perf] Optimize spec decoding + async scheduling, 1.5% Throughput improvement (#33612 ) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Nick Hill <nhill@redhat.com>	2026-02-04 09:34:32 -05:00
Micah Williamson	1d367a738e	[Bugfix][ROCm] Include float8_e4m3fnuz in NCCL Dtype Dispatching (#33713 ) Signed-off-by: Micah Williamson <micah.williamson@amd.com>	2026-02-04 05:36:29 -08:00
Cyrus Leung	32a02c7ca2	Apply #33621 to main (#33758 ) Signed-off-by: Zachary Aristei <zaristei@nvidia.com> Co-authored-by: zaristei2 <zaristei2@gmail.com> Co-authored-by: Zachary Aristei <zaristei@nvidia.com>	2026-02-04 05:35:39 -08:00
Chauncey	f67ee8b859	[Perf] Optimize chat completion streaming performance (#33782 ) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>	2026-02-04 12:30:36 +00:00
Cyrus Leung	e57ef99b40	[Model] Apply #32631 for recent models (#33785 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-04 12:23:01 +00:00
Yueqian Lin	f8516a1ab9	[Bugfix][Model] Fix audio-in-video support for Qwen2.5-Omni and Qwen3-Omni (#33605 ) Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Roger Wang <hey@rogerw.io>	2026-02-04 12:15:29 +00:00
Vadim Gimpelson	824058076c	[PERF] Change GDN Attention State Layout from [N, HV, K, V] to [N, HV, V, K] (#33291 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>	2026-02-04 11:20:52 +00:00
Or Ozeri	8e32690869	[KV Connector][BugFix] scheduler: Delay freeing blocks of aborted async loads (#32255 ) Fixes a not-yet-reported case where it was possible for blocks to be freed by an abort before an async transfer completed, resulting in corrupted KV data. Signed-off-by: Or Ozeri <oro@il.ibm.com>	2026-02-04 11:16:34 +00:00
Zhengxu Chen	a208439537	[compile] Remove runner type from ignored caching factor list. (#33712 ) Signed-off-by: zhxchen17 <zhxchen17@fb.com>	2026-02-04 10:56:45 +00:00
Zhengxu Chen	bcd2f74c0d	[compile] Clean up AOT compile bypass on evaluate_guards. (#33578 ) Signed-off-by: zhxchen17 <zhxchen17@fb.com>	2026-02-04 02:12:53 -08:00
Kunshang Ji	f79f777803	[XPU][2/N] add support unquantized moe support for xpu (#33659 ) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>	2026-02-04 02:12:25 -08:00
Augusto Yao	4c8d1bf361	use ORJSONResponse when available to improve the efficiency of request process (#33548 ) Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com>	2026-02-04 10:04:11 +00:00
Kunshang Ji	061da6bcf7	[XPU] remove common path warning log (#33769 ) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>	2026-02-04 16:40:17 +08:00
zhanqiuhu	4403e3ed4c	[Metrics] Add labeled prompt token metrics for P/D disaggregation (#33290 ) Add labeled Prometheus metrics to distinguish where prompt tokens come from in P/D disaggregated deployments. In P/D disaggregation, decode instances receive KV cache from prefill instances. Currently, decode reports inflated prompt throughput because it counts all prompt tokens as "computed", even though most were transferred. This PR adds labeled metrics so users can understand actual compute work vs transferred work: vllm:prompt_tokens_by_source_total{source="local_compute"} # Tokens prefilled locally vllm:prompt_tokens_by_source_total{source="external_kv_transfer"} # Tokens received via KV transfer vllm:prompt_tokens_by_source_total{source="local_cache_hit"} # Tokens from local prefix cache vllm:prompt_tokens_cached_total # Total cached (local + external, -1 when all Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>	2026-02-04 07:46:48 +00:00
Matt	08e094997e	[Hardware][AMD][CI] Refactor AMD tests to properly use BuildKite parallelism (#32745 ) Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>	2026-02-04 14:51:33 +08:00
Wentao Ye	d88a1df699	[Deprecation] Deprecate profiling envs (#33722 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-02-04 05:58:21 +00:00
Cyrus Leung	90d74ebaa4	[Deprecation] Remove `_get_data_parser` in MM processor (#33757 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-04 05:51:52 +00:00
Frank Wang	45f8fd6f97	[Feature] Enable `TRITON_ATTN` for Batch Invariance (#33688 ) Signed-off-by: frankwang28 <frank.wbb@hotmail.com>	2026-02-04 13:27:34 +08:00

1 2 3 4 5 ...

13631 Commits