biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Xin Yang	7a5adad480	[Kernel] Optimize sample_recovered_tokens_kernel (#34974 ) Signed-off-by: Xin Yang <xyangx@amazon.com>	2026-02-20 19:59:06 -08:00
Lucas Wilkinson	aaefc58ee0	[CI] Revert PRs 34818 and 33600 (#34979 )	2026-02-20 13:25:50 -08:00
Wei Zhao	f24b2de3d3	[Test] Add FP8 KV Cache Testing for MLA Backends (#34473 ) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>	2026-02-20 18:51:58 +00:00
rasmith	0c1dc42748	[CI][AMD][BugFix][P/D] Add default_vllm_config to test_moriio_connector.py so tests pass (#33739 ) Signed-off-by: Randall Smith <Randall.Smith@amd.com>	2026-02-19 21:32:40 -08:00
Matthew Bonanni	662205d34e	[Bugfix] Fix Basic Models Test (#34818 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2026-02-19 14:49:07 -08:00
Kyle Sayers	64ac1395e8	[Docs] Clean up speculators docs (#34065 ) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>	2026-02-18 13:48:11 -08:00
Jongseok Park	c656ba3b4d	[Kernel] Triton-based Top-k and Top-p sampler kernels (#33538 ) Signed-off-by: js_park <cakeng@naver.com> Signed-off-by: Jongseok Park <37990712+cakeng@users.noreply.github.com> Signed-off-by: Sunga Kim <sunga.kim@berkeley.edu> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Sunga Kim <sunga.kim@berkeley.edu> Co-authored-by: Nick Hill <nickhill123@gmail.com>	2026-02-17 23:14:30 +00:00
Cyrus Leung	574fe75245	[Renderer] Move InputPreprocessor into Renderer (2/2) (#34560 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-17 05:29:01 -08:00
junuxyz	c61a98f529	[CI][BugFix] ShellCheck cleanup to remove baseline and preserve runtime behavior (#34514 ) Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>	2026-02-17 12:22:56 +00:00
Ekagra Ranjan	cd81cdb399	[Scheduler][ASR] Fix CrossAttn blocks per-request for Variable length encoder inputs (#31058 ) Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>	2026-02-16 11:08:44 +00:00
Thomas Parnell	d5fe3f702c	[Hybrid] Enable mamba prefix cache "align" mode with async scheduling (#33997 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2026-02-14 13:15:56 -08:00
Cyrus Leung	73391a1baa	[Renderer] Move InputPreprocessor into Renderer (1/2) (#34510 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2026-02-14 10:14:21 -08:00
Andreas Karatzas	b3c14229b0	[ROCm][CI] Guard sparse MLA backend imports for ROCm compatibility in tests (#34538 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-02-14 07:32:09 -08:00
Harry Huang	c027541eaf	[Hybrid] Enable spec decoding in mamba cache align mode (#33705 ) Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>	2026-02-13 13:02:28 -08:00
Ben Browning	fd267bc7b7	[Bugfix]: Fix structured output in multi-turn gpt-oss (#34454 ) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2026-02-13 11:12:48 -08:00
Aaron Hao	dddbff4624	[Core] Move pause and resume functions into engine (#34125 ) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Aaron Hao <ahao@anyscale.com> Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>	2026-02-13 00:15:10 -08:00
haosdent	4137c5dfa7	[Bug Fix] Fix MambaManager.cache_blocks() crash on null blocks in align mode (#34418 ) Signed-off-by: haosdent <haosdent@gmail.com>	2026-02-13 00:13:22 -08:00
Cyrus Leung	ea5ff3a1f6	[Refactor] Simplify BOS/EOS token handling (#34435 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-12 18:18:24 -08:00
Matthew Bonanni	f2c47886fd	[Attention] Add FlashInfer Sparse MLA backend (#33451 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>	2026-02-12 17:21:54 +00:00
Cyrus Leung	fb455ed547	[V0 Deprecation] Remove code related to per-request logits processors (#34400 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-12 20:44:28 +08:00
Cyrus Leung	b96f7314b4	[Refactor] Pass Renderer to Input Processor (#34329 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-11 19:38:11 -08:00
Lucas Wilkinson	c7914d30f9	Reapply [Attention][FA3] Update FA3 to include new swizzle optimization (#34043 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-02-11 07:07:56 -08:00
Cyrus Leung	b5dcb372e4	[Misc] Clean up validation logic in input processor (#34144 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-10 19:29:29 -08:00
Pavani Majety	578977bb5e	[SM100] Resubmit FMHA FP8 prefill for MLA (#31195 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com>	2026-02-10 16:18:43 -05:00
junuxyz	c5a66d1697	[Core][BugFix] Fix PP KV cache sharding memory validation (#33698 ) Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com>	2026-02-10 10:46:24 -05:00
Krish Gupta	748625cdaf	[V1][BugFix] Fix EAGLE3 encoder cache miss with disable_chunked_mm_input (#34220 ) Signed-off-by: KrxGu <krishom70@gmail.com>	2026-02-10 13:05:32 +00:00
Chen Zhang	97fa8f6590	[BugFix] Avoid prefix cache hit in the same schedule step for mamba layers (#29387 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2026-02-10 07:41:16 +00:00
Roger Wang	8a5e0e2b2b	[Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching (#34183 ) Signed-off-by: Roger Wang <hey@rogerw.io>	2026-02-10 13:03:32 +08:00
Cyrus Leung	48312e579a	[Misc] Make `PlaceholderRange.get_num_embeds` a method (#34035 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-02-07 05:30:17 +00:00
Aaron Hao	89a385d79f	[Feat][RL] Pause and Resume with keep requests for single engine (#32351 ) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Aaron Hao <ahao@anyscale.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2026-02-07 00:08:58 +00:00
Seiji Eicher	aca5967416	[KV Connector] Add missing method overrides to MultiConnector (#33292 ) Signed-off-by: Seiji Eicher <seiji@anyscale.com>	2026-02-06 12:58:21 -05:00
emricksini-h	325ab6b0a8	[Feature] OTEL tracing during loading (#31162 )	2026-02-05 16:59:28 -08:00
Benjamin Chislett	af3162d3aa	[Spec Decode] Unified Parallel Drafting (#32887 ) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>	2026-02-05 12:37:18 -05:00
liranschour	8322d4e47f	Enable Cross layers KV cache layout at NIXL Connector V2 (#33339 ) Signed-off-by: Liran Schour <lirans@il.ibm.com> Signed-off-by: liranschour <liranschour@users.noreply.github.com> Co-authored-by: Or Ozeri <or@ozery.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>	2026-02-05 02:17:02 -08:00
Mark McLoughlin	2abd97592f	[KV Connector][Metrics] Do not count local prefix cache hits in connector queries (#30522 ) Signed-off-by: Mark McLoughlin <markmc@redhat.com>	2026-02-05 09:57:27 +02:00
Nick Hill	add9f1fbd9	[Minor] Include `StreamingInput` in inputs package (#33856 ) Signed-off-by: Nick Hill <nickhill123@gmail.com>	2026-02-05 04:38:20 +00:00
Nick Hill	fa4e0fb028	[Core] Don't schedule spec tokens with prefill chunks (#33652 ) Signed-off-by: Nick Hill <nickhill123@gmail.com>	2026-02-04 23:40:22 +00:00
Or Ozeri	8e32690869	[KV Connector][BugFix] scheduler: Delay freeing blocks of aborted async loads (#32255 ) Fixes a not-yet-reported case where it was possible for blocks to be freed by an abort before an async transfer completed, resulting in corrupted KV data. Signed-off-by: Or Ozeri <oro@il.ibm.com>	2026-02-04 11:16:34 +00:00
zhanqiuhu	4403e3ed4c	[Metrics] Add labeled prompt token metrics for P/D disaggregation (#33290 ) Add labeled Prometheus metrics to distinguish where prompt tokens come from in P/D disaggregated deployments. In P/D disaggregation, decode instances receive KV cache from prefill instances. Currently, decode reports inflated prompt throughput because it counts all prompt tokens as "computed", even though most were transferred. This PR adds labeled metrics so users can understand actual compute work vs transferred work: vllm:prompt_tokens_by_source_total{source="local_compute"} # Tokens prefilled locally vllm:prompt_tokens_by_source_total{source="external_kv_transfer"} # Tokens received via KV transfer vllm:prompt_tokens_by_source_total{source="local_cache_hit"} # Tokens from local prefix cache vllm:prompt_tokens_cached_total # Total cached (local + external, -1 when all Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>	2026-02-04 07:46:48 +00:00
Frank Wang	45f8fd6f97	[Feature] Enable `TRITON_ATTN` for Batch Invariance (#33688 ) Signed-off-by: frankwang28 <frank.wbb@hotmail.com>	2026-02-04 13:27:34 +08:00
Nick Hill	52ee21021a	[BugFix][Spec Decoding] Fix negative accepted tokens metric crash (#33729 ) Signed-off-by: Nick Hill <nickhill123@gmail.com>	2026-02-03 23:34:41 +00:00
Harry Mellor	61e632aea1	Turn `@config` into a `dataclass_transform` (#31541 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2026-02-03 17:40:59 +00:00
yugong333	ffe1fc7a28	Reduce the kernel overhead when num of active loras is smaller than max loras. Multiple cuda graphs are captured for each num of active-loras. (#32005 ) Signed-off-by: Yu Gong <yu3.gong@gmail.com>	2026-02-02 12:30:06 -05:00
shanjiaz	d95b4be47a	move spec decode slow test to test_areas.yaml (#33365 ) Signed-off-by: shanjiaz <zsjwpianpian@gmail.com>	2026-02-02 06:28:36 -08:00
Nicolò Lucchesi	528b3076af	[CI][Bugfix] Fix flaky `tests/v1/kv_connector/unit/test_multi_connector.py::test_multi_example_connector_consistency` (#33555 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2026-02-02 03:01:29 -08:00
Yifan Qiao	a01ef3fa51	[Fix] prefix cache hit rate == 0 bug with gpt-oss style models (#33524 ) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>	2026-02-02 01:59:58 +00:00
jma99_2333	22d9a056d5	Support clear mm and encoder cache (#33452 ) Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Roger Wang <hey@rogerw.io>	2026-01-31 15:22:25 +00:00
Nick Hill	876a16f4fb	[ModelRunner V2] Fix spec decoding + logprobs (#33391 ) Signed-off-by: Nick Hill <nickhill123@gmail.com>	2026-01-31 03:33:26 +00:00
Matthew Bonanni	aaa901ad55	[Attention] Move MLA `forward` from backend to layer (#33284 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-30 19:30:00 -08:00
Kyle Sayers	f857a03f6b	[QeRL] Layerwise Reloading (#32133 ) Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>	2026-01-30 08:50:05 -07:00

1 2 3 4 5 ...

990 Commits