biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Yashwant Bezawada	a13d8c03c9	[KVConnector] Auto-downgrade to PIECEWISE cudagraph mode for layerwise async ops (#31057 ) Signed-off-by: Yashwant Bezawada <yashwant_b@me.com>	2026-03-02 15:04:47 -05:00
Fynn Schmitt-Ulms	9433acb8df	[Spec Decode] Add hidden states extraction system (#33736 ) Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>	2026-03-02 14:29:09 -05:00
ElizaWszola	d9c7730877	[Performance] Extract kv update ops from MLA attention backends (#34627 ) Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Di Wu <dw2761@nyu.edu> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2026-03-02 10:43:19 -05:00
wangxiyuan	510bc9e1df	[Misc] Cleanup useless `current_platform` import (#35715 ) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2026-03-02 09:36:54 +00:00
Lucas Wilkinson	8b5014d3dd	[Attention] FA4 integration (#32974 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2026-03-01 23:44:57 +00:00
Richard Zou	e82fbeec7b	[torch.compile] Undo the fast_moe_cold_start hack in torch>=2.11 (#35475 ) Signed-off-by: Richard Zou <zou3519@gmail.com>	2026-03-01 21:44:22 +00:00
Ilya Markov	b2d8b422b2	[EPLB] Enforce sync eplb for NCCL-based all2all backend (#35212 ) Signed-off-by: ilmarkov <markovilya197@gmail.com>	2026-02-28 05:47:12 +00:00
Itay Alroy	dea268336f	[1/N] Elastic EP Milestone 2 (#34861 ) Signed-off-by: Yongji Wu <wuyongji317@gmail.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Ron Tourgeman <rtourgeman@nvidia.com> Co-authored-by: Yongji Wu <wuyongji317@gmail.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>	2026-02-28 04:46:42 +00:00
Aaron Hao	2ce6f3cf67	[Feat][RL][2/2] Native Weight Syncing API: IPC (#34171 ) Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: Aaron Hao <ahao@anyscale.com> Signed-off-by: ahao-anyscale <ahao@anyscale.com>	2026-02-27 13:45:21 -07:00
Lucas Wilkinson	1d532f9d8f	[DP] Only use DP padding when cudagraphs are actually used (#34102 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-02-27 15:14:31 -05:00
Jason Li	66c1751d13	[compile] Cleanup: Remove unnecessary +rms_norm forcing for sequence parallelism (#35410 ) Signed-off-by: jasonlizhengjian <jasonlizhengjian@gmail.com>	2026-02-27 08:36:37 -05:00
Jiangyun Zhu	487e5c51f7	[Bugfix] disable allreduce_rms_fusion by default when pp size > 1 (#35424 ) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>	2026-02-27 04:18:52 +00:00
Tyler Michael Smith	eb19955c37	[WideEP] Remove pplx all2all backend (#33724 ) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 14:30:10 -08:00
sychen52	d0105b84f0	add mixed precision support for modelopt (#35047 ) Signed-off-by: Shiyang Chen <shiychen@nvidia.com>	2026-02-26 21:56:24 +00:00
Gregory Shtrasberg	6042e66cd5	[ROCm] Add extra step in config initialization to populate custom ops before compilation config init (#34848 ) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>	2026-02-26 16:05:40 +08:00
Jason Li	9d37941017	[torch.compile] Sequence Parallelism threshold compile ranges (#28672 ) Signed-off-by: jasonlizhengjian <jasonlizhengjian@gmail.com> Signed-off-by: Jason Li <jasonlizhengjian@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2026-02-26 05:00:12 +00:00
Michael Goin	de527e1cec	[UX] Add `--moe-backend` arg for explicit kernel selection (#33807 ) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2026-02-25 17:44:44 -08:00
Michael Goin	cbf8f7028c	[UX] Add `--performance-mode {balanced,interactivity,throughput}` (#34936 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2026-02-25 17:28:31 -08:00
Ming Yang	6831650c40	[offloader] v2: Hide weight onloading latency via prefetching (#29941 ) Signed-off-by: Ming Yang <minos.future@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2026-02-25 17:20:59 -08:00
Rohan Potdar	f38f8c9742	[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE (#35180 ) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>	2026-02-25 04:36:40 +00:00
Benjamin Chislett	f5972a872f	[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support (#33726 ) Signed-off-by: Shahar Mor <smor@nvidia.com> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Shahar Mor <smor@nvidia.com> Co-authored-by: Roi Koren <roik@nvidia.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-02-24 09:49:56 -08:00
Nicolò Lucchesi	f91808ae0d	[MM] Allow audio chunking for offline LLM (#34628 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2026-02-23 21:04:28 -08:00
pschlan-amd	80d93fd6da	gpu_model_runner: Cache is_encoder_decoder from model config (#35099 ) Signed-off-by: Patrick Schlangen <pschlan@amd.com>	2026-02-23 19:08:34 -08:00
Rohan Potdar	2ff4e51152	[ROCm] AITER fused RoPE+KVCache (#33443 ) Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: charlifu <charlifu@amd.com> Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com> Co-authored-by: charlifu <charlifu@amd.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com>	2026-02-23 19:06:00 -08:00
Vincent Gimenes	aa08a30fc9	[CLEANING] Remove unused disable_by_batch_size from SpeculativeConfig (#35060 ) Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>	2026-02-23 05:05:36 -08:00
Cyrus Leung	987506bca6	[Refactor] Simplify dummy data generation (#35025 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-22 20:55:27 -08:00
qizixi	2bcf71b9c0	[Spec Decode] Reduce TP communication for speculative decoding draft token generation (#34049 ) Signed-off-by: qizixi <qizixi@meta.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>	2026-02-22 14:59:16 -08:00
Benjamin Chislett	682566b18e	[Bug] Refactor max_num_batched_tokens to account for drafting (#34898 ) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>	2026-02-22 11:18:46 -05:00
Nicolò Lucchesi	ab6f3487a6	[PD] Change kv_load_failure_policy Default from "recompute" to "fail" (#34896 ) Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2026-02-21 01:34:57 -08:00
Lucas Wilkinson	aaefc58ee0	[CI] Revert PRs 34818 and 33600 (#34979 )	2026-02-20 13:25:50 -08:00
杨朱 · Kiki	07cab212f0	[Misc] Add deprecated environment variable utilities (#33677 ) Signed-off-by: carlory <baofa.fan@daocloud.io> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-02-19 21:33:25 -08:00
Matthew Bonanni	662205d34e	[Bugfix] Fix Basic Models Test (#34818 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2026-02-19 14:49:07 -08:00
Kyle Sayers	64ac1395e8	[Docs] Clean up speculators docs (#34065 ) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>	2026-02-18 13:48:11 -08:00
Luka Govedič	02e8f26cea	[torch.compile] Turn on silu+fp4 quant fusion by default for O1+ (#34718 ) Signed-off-by: Luka Govedič <lgovedic@redhat.com>	2026-02-18 03:29:15 +00:00
Matthew Bonanni	7743152957	[Attention] Refactor `check_and_update_config` (#33600 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-02-17 17:06:54 -08:00
Lucas Kabela	a3205beffb	[CI] Enable mypy coverage for individual excluded files (#34292 ) Signed-off-by: Lucas Kabela <lucaskabela@meta.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-02-16 07:34:29 -08:00
Luka Govedič	23d825aba1	[torch.compile] Disable ar-rms fusion for ds3-fp4 & DP, fix CI test (#34392 ) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-02-15 06:33:57 -08:00
Thomas Parnell	d5fe3f702c	[Hybrid] Enable mamba prefix cache "align" mode with async scheduling (#33997 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2026-02-14 13:15:56 -08:00
Wei Zhao	b37b679770	[Feature][Perf] Support Selective CPU Weight Offloading (#34535 ) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>	2026-02-13 20:02:24 -08:00
Richard Zou	87789c8364	[Misc] vLLM's --enforce-eager should turn off compile and cudagraphs only (#34523 ) Signed-off-by: Richard Zou <zou3519@gmail.com>	2026-02-13 09:52:20 -08:00
Roger Wang	1dae7b7843	[Bugfix] Exclude `language_model_only` key from MM AOT compile hash but include in model one (#34508 ) Signed-off-by: Roger Wang <hey@rogerw.io>	2026-02-13 13:59:00 +00:00
Harry Huang	6f019e6e0a	[BugFix] Add block_size validation for mamba cache align mode (#34445 ) Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>	2026-02-12 18:18:07 -08:00
Cyrus Leung	fb455ed547	[V0 Deprecation] Remove code related to per-request logits processors (#34400 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-12 20:44:28 +08:00
Cyrus Leung	b96f7314b4	[Refactor] Pass Renderer to Input Processor (#34329 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-11 19:38:11 -08:00
SorenDreano	48134a2c22	[Docs] Fix typo ("defult") and double spacing (#34348 ) Signed-off-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com> Co-authored-by: Soren Dreano <soren@numind.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-02-11 09:02:27 -08:00
Harry Mellor	40b8f55358	[Docs] Reduce time spent generating API docs (#34255 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-02-11 02:56:02 -08:00
Luka Govedič	addac0e653	[torch.compile] Enable AR+rms fusion by default available for `-O2` (#34299 ) Signed-off-by: Luka Govedič <lgovedic@redhat.com>	2026-02-11 00:30:00 -08:00
Zhengkai Zhang	6f2f59f2b3	[Misc][Spec Decode] support different load config for draft model (#34022 ) Signed-off-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com> Co-authored-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>	2026-02-10 14:52:43 -08:00
Pavani Majety	578977bb5e	[SM100] Resubmit FMHA FP8 prefill for MLA (#31195 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com>	2026-02-10 16:18:43 -05:00
Qi Wang	33bcd3dc3b	[Misc] Introduce ec_both role EC (encoder cache) connector (#34182 ) Signed-off-by: Qi Wang <qiwa@nvidia.com>	2026-02-10 18:55:35 +00:00

1 2 3 4 5 ...

538 Commits