Yashwant Bezawada
|
a13d8c03c9
|
[KVConnector] Auto-downgrade to PIECEWISE cudagraph mode for layerwise async ops (#31057)
Signed-off-by: Yashwant Bezawada <yashwant_b@me.com>
|
2026-03-02 15:04:47 -05:00 |
|
Fynn Schmitt-Ulms
|
9433acb8df
|
[Spec Decode] Add hidden states extraction system (#33736)
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
|
2026-03-02 14:29:09 -05:00 |
|
ElizaWszola
|
d9c7730877
|
[Performance] Extract kv update ops from MLA attention backends (#34627)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Di Wu <dw2761@nyu.edu>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2026-03-02 10:43:19 -05:00 |
|
wangxiyuan
|
510bc9e1df
|
[Misc] Cleanup useless current_platform import (#35715)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
|
2026-03-02 09:36:54 +00:00 |
|
Lucas Wilkinson
|
8b5014d3dd
|
[Attention] FA4 integration (#32974)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2026-03-01 23:44:57 +00:00 |
|
Richard Zou
|
e82fbeec7b
|
[torch.compile] Undo the fast_moe_cold_start hack in torch>=2.11 (#35475)
Signed-off-by: Richard Zou <zou3519@gmail.com>
|
2026-03-01 21:44:22 +00:00 |
|
Ilya Markov
|
b2d8b422b2
|
[EPLB] Enforce sync eplb for NCCL-based all2all backend (#35212)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
|
2026-02-28 05:47:12 +00:00 |
|
Itay Alroy
|
dea268336f
|
[1/N] Elastic EP Milestone 2 (#34861)
Signed-off-by: Yongji Wu <wuyongji317@gmail.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Ron Tourgeman <rtourgeman@nvidia.com>
Co-authored-by: Yongji Wu <wuyongji317@gmail.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
|
2026-02-28 04:46:42 +00:00 |
|
Aaron Hao
|
2ce6f3cf67
|
[Feat][RL][2/2] Native Weight Syncing API: IPC (#34171)
Signed-off-by: hao-aaron <ahao@anyscale.com>
Signed-off-by: Aaron Hao <ahao@anyscale.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
|
2026-02-27 13:45:21 -07:00 |
|
Lucas Wilkinson
|
1d532f9d8f
|
[DP] Only use DP padding when cudagraphs are actually used (#34102)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2026-02-27 15:14:31 -05:00 |
|
Jason Li
|
66c1751d13
|
[compile] Cleanup: Remove unnecessary +rms_norm forcing for sequence parallelism (#35410)
Signed-off-by: jasonlizhengjian <jasonlizhengjian@gmail.com>
|
2026-02-27 08:36:37 -05:00 |
|
Jiangyun Zhu
|
487e5c51f7
|
[Bugfix] disable allreduce_rms_fusion by default when pp size > 1 (#35424)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
|
2026-02-27 04:18:52 +00:00 |
|
Tyler Michael Smith
|
eb19955c37
|
[WideEP] Remove pplx all2all backend (#33724)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-02-26 14:30:10 -08:00 |
|
sychen52
|
d0105b84f0
|
add mixed precision support for modelopt (#35047)
Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
|
2026-02-26 21:56:24 +00:00 |
|
Gregory Shtrasberg
|
6042e66cd5
|
[ROCm] Add extra step in config initialization to populate custom ops before compilation config init (#34848)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
|
2026-02-26 16:05:40 +08:00 |
|
Jason Li
|
9d37941017
|
[torch.compile] Sequence Parallelism threshold compile ranges (#28672)
Signed-off-by: jasonlizhengjian <jasonlizhengjian@gmail.com>
Signed-off-by: Jason Li <jasonlizhengjian@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2026-02-26 05:00:12 +00:00 |
|
Michael Goin
|
de527e1cec
|
[UX] Add --moe-backend arg for explicit kernel selection (#33807)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
|
2026-02-25 17:44:44 -08:00 |
|
Michael Goin
|
cbf8f7028c
|
[UX] Add --performance-mode {balanced,interactivity,throughput} (#34936)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2026-02-25 17:28:31 -08:00 |
|
Ming Yang
|
6831650c40
|
[offloader] v2: Hide weight onloading latency via prefetching (#29941)
Signed-off-by: Ming Yang <minos.future@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-02-25 17:20:59 -08:00 |
|
Rohan Potdar
|
f38f8c9742
|
[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE (#35180)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
|
2026-02-25 04:36:40 +00:00 |
|
Benjamin Chislett
|
f5972a872f
|
[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support (#33726)
Signed-off-by: Shahar Mor <smor@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Shahar Mor <smor@nvidia.com>
Co-authored-by: Roi Koren <roik@nvidia.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2026-02-24 09:49:56 -08:00 |
|
Nicolò Lucchesi
|
f91808ae0d
|
[MM] Allow audio chunking for offline LLM (#34628)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2026-02-23 21:04:28 -08:00 |
|
pschlan-amd
|
80d93fd6da
|
gpu_model_runner: Cache is_encoder_decoder from model config (#35099)
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
|
2026-02-23 19:08:34 -08:00 |
|
Rohan Potdar
|
2ff4e51152
|
[ROCm] AITER fused RoPE+KVCache (#33443)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: charlifu <charlifu@amd.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com>
|
2026-02-23 19:06:00 -08:00 |
|
Vincent Gimenes
|
aa08a30fc9
|
[CLEANING] Remove unused disable_by_batch_size from SpeculativeConfig (#35060)
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com>
|
2026-02-23 05:05:36 -08:00 |
|
Cyrus Leung
|
987506bca6
|
[Refactor] Simplify dummy data generation (#35025)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2026-02-22 20:55:27 -08:00 |
|
qizixi
|
2bcf71b9c0
|
[Spec Decode] Reduce TP communication for speculative decoding draft token generation (#34049)
Signed-off-by: qizixi <qizixi@meta.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
|
2026-02-22 14:59:16 -08:00 |
|
Benjamin Chislett
|
682566b18e
|
[Bug] Refactor max_num_batched_tokens to account for drafting (#34898)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
2026-02-22 11:18:46 -05:00 |
|
Nicolò Lucchesi
|
ab6f3487a6
|
[PD] Change kv_load_failure_policy Default from "recompute" to "fail" (#34896)
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2026-02-21 01:34:57 -08:00 |
|
Lucas Wilkinson
|
aaefc58ee0
|
[CI] Revert PRs 34818 and 33600 (#34979)
|
2026-02-20 13:25:50 -08:00 |
|
杨朱 · Kiki
|
07cab212f0
|
[Misc] Add deprecated environment variable utilities (#33677)
Signed-off-by: carlory <baofa.fan@daocloud.io>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-02-19 21:33:25 -08:00 |
|
Matthew Bonanni
|
662205d34e
|
[Bugfix] Fix Basic Models Test (#34818)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
|
2026-02-19 14:49:07 -08:00 |
|
Kyle Sayers
|
64ac1395e8
|
[Docs] Clean up speculators docs (#34065)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
|
2026-02-18 13:48:11 -08:00 |
|
Luka Govedič
|
02e8f26cea
|
[torch.compile] Turn on silu+fp4 quant fusion by default for O1+ (#34718)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
|
2026-02-18 03:29:15 +00:00 |
|
Matthew Bonanni
|
7743152957
|
[Attention] Refactor check_and_update_config (#33600)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2026-02-17 17:06:54 -08:00 |
|
Lucas Kabela
|
a3205beffb
|
[CI] Enable mypy coverage for individual excluded files (#34292)
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-02-16 07:34:29 -08:00 |
|
Luka Govedič
|
23d825aba1
|
[torch.compile] Disable ar-rms fusion for ds3-fp4 & DP, fix CI test (#34392)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2026-02-15 06:33:57 -08:00 |
|
Thomas Parnell
|
d5fe3f702c
|
[Hybrid] Enable mamba prefix cache "align" mode with async scheduling (#33997)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2026-02-14 13:15:56 -08:00 |
|
Wei Zhao
|
b37b679770
|
[Feature][Perf] Support Selective CPU Weight Offloading (#34535)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
|
2026-02-13 20:02:24 -08:00 |
|
Richard Zou
|
87789c8364
|
[Misc] vLLM's --enforce-eager should turn off compile and cudagraphs only (#34523)
Signed-off-by: Richard Zou <zou3519@gmail.com>
|
2026-02-13 09:52:20 -08:00 |
|
Roger Wang
|
1dae7b7843
|
[Bugfix] Exclude language_model_only key from MM AOT compile hash but include in model one (#34508)
Signed-off-by: Roger Wang <hey@rogerw.io>
|
2026-02-13 13:59:00 +00:00 |
|
Harry Huang
|
6f019e6e0a
|
[BugFix] Add block_size validation for mamba cache align mode (#34445)
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com>
|
2026-02-12 18:18:07 -08:00 |
|
Cyrus Leung
|
fb455ed547
|
[V0 Deprecation] Remove code related to per-request logits processors (#34400)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2026-02-12 20:44:28 +08:00 |
|
Cyrus Leung
|
b96f7314b4
|
[Refactor] Pass Renderer to Input Processor (#34329)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2026-02-11 19:38:11 -08:00 |
|
SorenDreano
|
48134a2c22
|
[Docs] Fix typo ("defult") and double spacing (#34348)
Signed-off-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com>
Co-authored-by: Soren Dreano <soren@numind.ai>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2026-02-11 09:02:27 -08:00 |
|
Harry Mellor
|
40b8f55358
|
[Docs] Reduce time spent generating API docs (#34255)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-02-11 02:56:02 -08:00 |
|
Luka Govedič
|
addac0e653
|
[torch.compile] Enable AR+rms fusion by default available for -O2 (#34299)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
|
2026-02-11 00:30:00 -08:00 |
|
Zhengkai Zhang
|
6f2f59f2b3
|
[Misc][Spec Decode] support different load config for draft model (#34022)
Signed-off-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
Co-authored-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com>
|
2026-02-10 14:52:43 -08:00 |
|
Pavani Majety
|
578977bb5e
|
[SM100] Resubmit FMHA FP8 prefill for MLA (#31195)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
2026-02-10 16:18:43 -05:00 |
|
Qi Wang
|
33bcd3dc3b
|
[Misc] Introduce ec_both role EC (encoder cache) connector (#34182)
Signed-off-by: Qi Wang <qiwa@nvidia.com>
|
2026-02-10 18:55:35 +00:00 |
|