Commit Graph

2123 Commits

Author SHA1 Message Date
JartX
2ce3d0ce36 [Feature] KV cache per-token-head INT8/FP8 quantization (#38378)
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: yangyang4991 <yangyang4991@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2026-04-02 08:13:26 -04:00
Stefano Castagnetta
6183cae1bd [Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang (#38730)
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>
2026-04-01 12:08:40 -07:00
Monishver
c09ad767cd Feature/silu block quant fusion v1 (#32996)
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
2026-04-01 18:50:43 +00:00
Chauncey
cbe7d18096 [Misc] Rename think_start_str/think_end_str to reasoning_start_str/reasoning_end_str (#38242)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2026-04-01 09:56:45 -07:00
Jesus Talavera
148c2072ec Add ibm-granite/granite-vision-3.3-2b to supported models documentation (#38714)
Signed-off-by: Jesus Talavera <jesus.talavera@ibm.com>
2026-04-01 08:22:25 -07:00
Li, Jiang
36d7f19897 [CPU] Support head_size 512 in cpu_attn (#38676)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2026-04-01 05:42:27 +00:00
Vedant V Jhaveri
2e56975657 Generative Scoring (#34539)
Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com>
Co-authored-by: Vedant Jhaveri <vjhaveri@linkedin.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2026-03-31 16:02:11 -07:00
BadrBasowid
077a9a8e37 [torch.compile] Refactor Attention Quant Fusion Pass and Remove Boilerplate (#37373)
Signed-off-by: BadrBasowid <badr.basowid@gmail.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
2026-03-31 14:15:50 -04:00
Nicolò Lucchesi
7337ff7f03 [Docs] PD with Nixl compat matrix (#38628)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-03-31 15:01:21 +00:00
Flora Feng
3e802e8786 [Mypy] Fix adjust_request typing (#38264)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
2026-03-31 04:21:18 +00:00
TJian
03ac6ca895 [ROCm] [DOC] Update the Documentation to include ROCm Nightly Wheel support (#38457)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2026-03-30 02:25:46 -07:00
Juan Pérez de Algaba
57861ae48d (security) Fix SSRF in batch runner download_bytes_from_url (#38482)
Signed-off-by: jperezde <jperezde@redhat.com>
2026-03-30 07:10:01 +00:00
Andreas Karatzas
43cc5138e5 [ROCm][CI] Fix cross-attention dispatch for encoder-decoder models (#38450)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-28 22:08:03 -07:00
haosdent
d39b8daf5f [Feature] Add Qwen3-ForcedAligner support via token classification pooling (#35367)
Signed-off-by: haosdent <haosdent@gmail.com>
2026-03-29 00:27:52 +00:00
whyiug
58c959a767 [Misc]: clean up non-core lint issues (#37049)
Signed-off-by: whyiug <whyiug@hotmail.com>
2026-03-28 10:28:16 -04:00
Hongxia Yang
83a4df049d [ROCm][Documentation] update quickstart and installation to include rocm nightly docker tips (#38367)
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com>
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
2026-03-27 23:20:19 +00:00
Juan Pérez de Algaba
b111f8a61f fix(security): Add VLLM_MAX_N_SEQUENCES environment variable and enforce limit (#37952)
Signed-off-by: jperezde <jperezde@redhat.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
2026-03-27 09:02:10 -04:00
Yuichiro Utsumi
a1746ff9ec [Doc] Clarify Helm chart location in deployment guide (#38328)
Signed-off-by: Yuichiro Utsumi <utsumi.yuichiro@fujitsu.com>
Signed-off-by: Yuichiro Utsumi <81412151+utsumi-fj@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-27 15:43:02 +08:00
Cyrus Leung
a9213c0ffe [Doc] Fix outdated reference to CUDAGraphManager (#38209)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-03-26 01:52:38 -07:00
Ekagra Ranjan
7b54f60db0 [Cohere] Enable Cohere-Transcribe (#38120)
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
2026-03-25 16:13:51 -07:00
Rohan Potdar
a0e8c74005 [ROCm]: Update rope+kvcache fusion conditions and disable custom op by default (#36716)
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
2026-03-25 20:58:44 +00:00
Cyrus Leung
ba2f0acc2d [Misc] Reorganize inputs (#35182)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-03-25 10:22:54 -07:00
Ben Browning
a1a2566447 [Docs] Add guide for editing agent instruction files (#37819)
Signed-off-by: Ben Browning <bbrownin@redhat.com>
2026-03-25 13:54:09 +00:00
R0CKSTAR
242c93f744 [Docs] Adds vllm-musa to custom_op.md (#37840)
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
2026-03-25 11:54:36 +00:00
Gregory Shtrasberg
189ddefbfd [ROCm] Attention selector reordering (#36702)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
2026-03-25 17:42:56 +08:00
Baorun (Lauren) Mu
9d0351c91d [Docs] Add Encoder (ViT) CUDA Graphs section to CUDA Graphs design doc (#37914)
Signed-off-by: Baorun Mu <bmu@nvidia.com>
2026-03-24 19:53:24 -07:00
Terry Gao
82580b10ac [Perf] Disable inductor runtime asserts by default for serving perfor… (#37485)
Signed-off-by: tianrengao <terrygao87@gmail.com>
Co-authored-by: Tianren Gao <tianren@fb.com>
2026-03-24 19:37:51 -04:00
Nick Cao
935c46dd9b [Model] Add Granite 4.0 1B speech to supported models (#38019)
Signed-off-by: Nick Cao <ncao@redhat.com>
2026-03-24 18:23:41 +00:00
Vineeta Tiwari
b58c5f28aa docs: fix broken offline inference paths in documentation (#37998)
Signed-off-by: Vineeta Tiwari <vineeta.tiwari2@ibm.com>
Signed-off-by: Vineeta Tiwari <vineetatiwari2000@gmail.com>
Co-authored-by: Vineeta Tiwari <vineeta.tiwari2@ibm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-24 17:35:14 +00:00
Sungjae Lee
4731884796 [Feature] limit thinking tokens (hard limit) (#20859)
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-24 09:53:07 -07:00
wang.yuqi
1b6cb920e6 [Deprecate] Deprecate pooling multi task support. (#37956)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2026-03-24 14:07:47 +00:00
Andrew Xia
9ace378a63 [Frontend][Responses API] Fix arrival_time recording for TTFT on initial request (#37498)
Signed-off-by: Andrew Xia <axia@meta.com>
2026-03-23 09:58:08 +00:00
Yan Ma
d3fe857135 update doc for online fp8 quantization (#37851)
Signed-off-by: Yan Ma <yan.ma@intel.com>
2026-03-23 05:19:03 +00:00
Lasha Koroshinadze
e7767eccae Fix AudioFlamingo3/MusicFlamingo HF parity and RoTE handling (#37643)
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com>
2026-03-23 10:29:07 +08:00
Robert Shaw
4383f1532e [MoE] Move PF Methods to Folder (#35927) 2026-03-22 02:42:59 -06:00
Yongye Zhu
87bd91892f [MoE Refactor] Mxfp4 oracle rebased (#37128)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-21 03:37:04 +00:00
Ilya Boytsov
8b6c6b9505 [Model] Add LFM2-ColBERT-350M support (#37528)
Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com>
2026-03-20 14:57:57 +00:00
Jee Jee Li
dd20ee4e3e [UX] Enable torch_profiler_with_stack (#37571)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2026-03-20 11:17:26 +00:00
wang.yuqi
ed359c497a [Model] Deprecate the score task (this will not affect users). (#37537)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
2026-03-20 08:07:56 +00:00
Wangbei25
0674d1fee7 [PluggableLayer][MM] Add PluggableLayer for CustomQwen2Decoder (#37293)
Signed-off-by: Wangbei25 <wangbei41@huawie.com>
Signed-off-by: Wangbei25 <wangbei41@huawei.com>
Co-authored-by: Wangbei25 <wangbei41@huawie.com>
2026-03-20 06:24:07 +00:00
bnellnm
91be5f9be3 [MoE Refactor] Rename "naive" all2all backend (#36294)
Signed-off-by: Bill Nell <bnell@redhat.com>
2026-03-19 15:50:34 -04:00
Lucas Kabela
7769b58307 [torch.compile][BE][Multimodal] Remove requirement to set_model_tag to avoid cache conflict (#37345)
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
2026-03-19 17:26:12 +00:00
Ifta khairul Alam Adil
104605cbf2 Remove deprecated reasoning_content message field(part-2) (#37480)
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Philip Ottesen <phiott256@gmail.com>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
Signed-off-by: Andy Lo <andy@mistral.ai>
Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
Signed-off-by: sihao.li <sihao.li@intel.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: JartX <sagformas@epdcenter.es>
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Philip Ottesen <phiott256@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Giancarlo Delfin <32987265+TheEpicDolphin@users.noreply.github.com>
Co-authored-by: Andy Lo <andy@mistral.ai>
Co-authored-by: Thillai Chithambaram <79466435+thillai-c@users.noreply.github.com>
Co-authored-by: sihao_li <165983188+1643661061leo@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-19 15:20:08 +00:00
wang.yuqi
f9e2a38386 [Docs] Reorganize pooling docs. (#35592)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-19 11:25:47 +00:00
Shwetha Poojary
cef1f302d2 [Model] Enable LoRA support for tower and connector in H2OVL (#31696)
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com>
2026-03-18 13:26:47 +00:00
Aaron Hao
47a1f11bff [docs] Add docs for new RL flows (#36188)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-18 09:04:26 +00:00
Athrael Soju
c0745a851a [Model] Add ColQwen3.5 4.5B support (#36887)
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
2026-03-17 21:17:02 +00:00
Wei Zhao
b36adfa349 [Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache (#37252)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
2026-03-17 20:09:20 +00:00
Bhoomit
3717a4dd47 [Misc][LoRA] Add --lora-target-modules to restrict LoRA to specific modules (#34984)
Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-17 14:36:41 +00:00
Walter Beller-Morales
061980c36a [Feature][Frontend] add support for Cohere Embed v2 API (#37074)
Signed-off-by: walterbm <walter.beller.morales@gmail.com>
2026-03-16 19:55:53 -04:00