Benjamin Chislett
ee59a7c615
[Tests] Add GSM8k check to SpecDec E2E tests ( #34772 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-25 07:51:14 -05:00
Chen Zhang
8fae54faff
[Linear Attention] fix bug for linear attention + prefix caching + reset_prefix_cache ( #35157 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2026-02-24 22:00:19 -08:00
Harry Mellor
f7967577f5
Remove requirement to use --hf-overrides for DeepseekVLV2ForCausalLM ( #35203 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-24 22:00:06 -08:00
Nick Hill
dbf0da817a
[Core] Cleanup engine pause/sleep logic ( #34528 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-24 19:33:34 -08:00
Andreas Karatzas
067c5d9ad1
[ROCm][CI] Added MI325 mirrors ( #34923 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-24 13:37:15 -08:00
Benjamin Chislett
f5972a872f
[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support ( #33726 )
...
Signed-off-by: Shahar Mor <smor@nvidia.com >
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Shahar Mor <smor@nvidia.com >
Co-authored-by: Roi Koren <roik@nvidia.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-24 09:49:56 -08:00
Rohan Potdar
2ff4e51152
[ROCm] AITER fused RoPE+KVCache ( #33443 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
Signed-off-by: charlifu <charlifu@amd.com >
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
Co-authored-by: charlifu <charlifu@amd.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com >
2026-02-23 19:06:00 -08:00
haosdent
a2ba6a5244
[Bugfix] Fix prefix caching for Mamba 'all' mode (Nemotron models) ( #34874 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-23 17:31:51 +01:00
Andreas Karatzas
5f68464f92
[ROCm][CI] Fix spec decode profile assertion and logprob test determinism ( #35043 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-23 05:05:54 -08:00
Nicolò Lucchesi
ab6f3487a6
[PD] Change kv_load_failure_policy Default from "recompute" to "fail" ( #34896 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-21 01:34:57 -08:00
Andreas Karatzas
54254f7a61
[ROCm][CI] Fix spec decode logprobs flakiness and parametrize tree attention backends ( #34599 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:25:23 -08:00
zhongdaor-nv
a0fe7ea2f0
[feat] Add per-block extra_keys to KV events ( #33304 )
...
Signed-off-by: zhongdaor-nv <zhongdaor@nvidia.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 20:11:40 -08:00
Xin Yang
7a5adad480
[Kernel] Optimize sample_recovered_tokens_kernel ( #34974 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-20 19:59:06 -08:00
Lucas Wilkinson
aaefc58ee0
[CI] Revert PRs 34818 and 33600 ( #34979 )
2026-02-20 13:25:50 -08:00
Wei Zhao
f24b2de3d3
[Test] Add FP8 KV Cache Testing for MLA Backends ( #34473 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-20 18:51:58 +00:00
rasmith
0c1dc42748
[CI][AMD][BugFix][P/D] Add default_vllm_config to test_moriio_connector.py so tests pass ( #33739 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-19 21:32:40 -08:00
Matthew Bonanni
662205d34e
[Bugfix] Fix Basic Models Test ( #34818 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-19 14:49:07 -08:00
Kyle Sayers
64ac1395e8
[Docs] Clean up speculators docs ( #34065 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-02-18 13:48:11 -08:00
Jongseok Park
c656ba3b4d
[Kernel] Triton-based Top-k and Top-p sampler kernels ( #33538 )
...
Signed-off-by: js_park <cakeng@naver.com >
Signed-off-by: Jongseok Park <37990712+cakeng@users.noreply.github.com >
Signed-off-by: Sunga Kim <sunga.kim@berkeley.edu >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Sunga Kim <sunga.kim@berkeley.edu >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-17 23:14:30 +00:00
Cyrus Leung
574fe75245
[Renderer] Move InputPreprocessor into Renderer (2/2) ( #34560 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-17 05:29:01 -08:00
junuxyz
c61a98f529
[CI][BugFix] ShellCheck cleanup to remove baseline and preserve runtime behavior ( #34514 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-17 12:22:56 +00:00
Ekagra Ranjan
cd81cdb399
[Scheduler][ASR] Fix CrossAttn blocks per-request for Variable length encoder inputs ( #31058 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-16 11:08:44 +00:00
Thomas Parnell
d5fe3f702c
[Hybrid] Enable mamba prefix cache "align" mode with async scheduling ( #33997 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2026-02-14 13:15:56 -08:00
Cyrus Leung
73391a1baa
[Renderer] Move InputPreprocessor into Renderer (1/2) ( #34510 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-14 10:14:21 -08:00
Andreas Karatzas
b3c14229b0
[ROCm][CI] Guard sparse MLA backend imports for ROCm compatibility in tests ( #34538 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-14 07:32:09 -08:00
Harry Huang
c027541eaf
[Hybrid] Enable spec decoding in mamba cache align mode ( #33705 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-13 13:02:28 -08:00
Ben Browning
fd267bc7b7
[Bugfix]: Fix structured output in multi-turn gpt-oss ( #34454 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-13 11:12:48 -08:00
Aaron Hao
dddbff4624
[Core] Move pause and resume functions into engine ( #34125 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Signed-off-by: hao-aaron <ahao@anyscale.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-13 00:15:10 -08:00
haosdent
4137c5dfa7
[Bug Fix] Fix MambaManager.cache_blocks() crash on null blocks in align mode ( #34418 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-13 00:13:22 -08:00
Cyrus Leung
ea5ff3a1f6
[Refactor] Simplify BOS/EOS token handling ( #34435 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 18:18:24 -08:00
Matthew Bonanni
f2c47886fd
[Attention] Add FlashInfer Sparse MLA backend ( #33451 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
2026-02-12 17:21:54 +00:00
Cyrus Leung
fb455ed547
[V0 Deprecation] Remove code related to per-request logits processors ( #34400 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 20:44:28 +08:00
Cyrus Leung
b96f7314b4
[Refactor] Pass Renderer to Input Processor ( #34329 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 19:38:11 -08:00
Lucas Wilkinson
c7914d30f9
Reapply [Attention][FA3] Update FA3 to include new swizzle optimization ( #34043 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-11 07:07:56 -08:00
Cyrus Leung
b5dcb372e4
[Misc] Clean up validation logic in input processor ( #34144 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 19:29:29 -08:00
Pavani Majety
578977bb5e
[SM100] Resubmit FMHA FP8 prefill for MLA ( #31195 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2026-02-10 16:18:43 -05:00
junuxyz
c5a66d1697
[Core][BugFix] Fix PP KV cache sharding memory validation ( #33698 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-10 10:46:24 -05:00
Krish Gupta
748625cdaf
[V1][BugFix] Fix EAGLE3 encoder cache miss with disable_chunked_mm_input ( #34220 )
...
Signed-off-by: KrxGu <krishom70@gmail.com >
2026-02-10 13:05:32 +00:00
Chen Zhang
97fa8f6590
[BugFix] Avoid prefix cache hit in the same schedule step for mamba layers ( #29387 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2026-02-10 07:41:16 +00:00
Roger Wang
8a5e0e2b2b
[Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching ( #34183 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 13:03:32 +08:00
Cyrus Leung
48312e579a
[Misc] Make PlaceholderRange.get_num_embeds a method ( #34035 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-07 05:30:17 +00:00
Aaron Hao
89a385d79f
[Feat][RL] Pause and Resume with keep requests for single engine ( #32351 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-07 00:08:58 +00:00
Seiji Eicher
aca5967416
[KV Connector] Add missing method overrides to MultiConnector ( #33292 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2026-02-06 12:58:21 -05:00
emricksini-h
325ab6b0a8
[Feature] OTEL tracing during loading ( #31162 )
2026-02-05 16:59:28 -08:00
Benjamin Chislett
af3162d3aa
[Spec Decode] Unified Parallel Drafting ( #32887 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-05 12:37:18 -05:00
liranschour
8322d4e47f
Enable Cross layers KV cache layout at NIXL Connector V2 ( #33339 )
...
Signed-off-by: Liran Schour <lirans@il.ibm.com >
Signed-off-by: liranschour <liranschour@users.noreply.github.com >
Co-authored-by: Or Ozeri <or@ozery.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-05 02:17:02 -08:00
Mark McLoughlin
2abd97592f
[KV Connector][Metrics] Do not count local prefix cache hits in connector queries ( #30522 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-02-05 09:57:27 +02:00
Nick Hill
add9f1fbd9
[Minor] Include StreamingInput in inputs package ( #33856 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-05 04:38:20 +00:00
Nick Hill
fa4e0fb028
[Core] Don't schedule spec tokens with prefill chunks ( #33652 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-04 23:40:22 +00:00
Or Ozeri
8e32690869
[KV Connector][BugFix] scheduler: Delay freeing blocks of aborted async loads ( #32255 )
...
Fixes a not-yet-reported case where it was possible for blocks to be
freed by an abort before an async transfer completed, resulting
in corrupted KV data.
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-02-04 11:16:34 +00:00