Commit Graph

15349 Commits

Author SHA1 Message Date
Yifan Qiao
1dbbafd3f3 [Feat][v1] Simple yet General CPU KV Cache Offloading (#37160)
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
(cherry picked from commit 91e4521f9f)
v0.19.0rc0
2026-04-01 01:03:14 -07:00
Lucas Wilkinson
0ee3b7fc3d [Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking (#36178)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
(cherry picked from commit eb47454987)
2026-04-01 01:02:58 -07:00
Matthew Bonanni
268bed9cf3 [Bugfix][Async] Fix async spec decoding with hybrid models (#38556)
Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com>
(cherry picked from commit 757068dc65)
2026-04-01 01:02:35 -07:00
Jiangyun Zhu
bcc0fdd0f3 [CI] fix LM Eval Qwen3.5 Models (B200) (#38632)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
(cherry picked from commit ea7bfde6e4)
2026-04-01 01:02:20 -07:00
wang.yuqi
69b8bd4b33 [CI Failure] pin colmodernvbert revision (#38612)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
(cherry picked from commit 719735d6c5)
2026-04-01 01:02:04 -07:00
Li, Jiang
12449f9492 [Bugfix][CPU] Skip set_num_threads after thread binding (#38535)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
(cherry picked from commit 6557f4937f)
2026-03-30 23:01:42 -07:00
haosdent
b92312dfd7 [CI] Fix SPLADE pooler test broken by #38139 (#38495)
Signed-off-by: haosdent <haosdent@gmail.com>
(cherry picked from commit a08b7733fd)
2026-03-30 21:52:13 -07:00
Jaewon
d816834c1a [MoE] Add RoutingMethodType.Simulated to TRT-LLM FP8/NVFP4 kernel allowlists (#38329)
Signed-off-by: Jaewon Lee <jaewon@meta.com>
2026-03-29 22:53:43 -07:00
Roger Wang
92f0db57a8 [Misc] Always use forward_mulmat for Conv3d on newer versions of torch. (#38487) 2026-03-30 05:39:41 +00:00
Andreas Karatzas
bea23536f6 [CI] Add temperature=0.0, reduce max_tokens, and add debug prints to audio_in_video tests (#38492)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-30 05:36:45 +00:00
Jiangyun Zhu
c133f33746 Add @ZJY0516 to CODEOWNERS (#38497)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2026-03-29 21:10:00 -07:00
Stanislav Kirillov
a6db99ba02 [Bugfix] Support multi-type params parsing for DeepSeek v3.2 (#33703)
Signed-off-by: Stanislav Kirillov <stas@nebius.com>
Co-authored-by: Stanislav Kirillov <stas@nebius.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
2026-03-30 04:07:28 +00:00
Andreas Karatzas
4f2ed5fddb [ROCm][CI] Enable hybrid chunked prefill test (#38317)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-30 10:30:26 +08:00
Kyle Sayers
d28d86e8a3 [QeRL] Fix online quantized reloading (#38442)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2026-03-29 14:56:41 -06:00
Wentao Ye
995dea1354 [Perf] Remove redundant device copies for CPU-only pooling token IDs, 48.9% E2E throughput improvement (#38139)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-03-29 18:12:50 +00:00
allgather
8c0b6267d7 [Transformers v5] fix missing pixtral/voxtral multimodal dispatch (#38410)
Signed-off-by: allgather <all2allops@gmail.com>
2026-03-29 09:59:06 +00:00
Andreas Karatzas
43cc5138e5 [ROCm][CI] Fix cross-attention dispatch for encoder-decoder models (#38450)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-28 22:08:03 -07:00
Shubhra Pandit
5b8c30d62b [Spec Decode, BugFix] Propagate norm_before_fc from Eagle3 speculator (#38111)
Signed-off-by: Shubhra Pandit <shubhra.pandit@gmail.com>
2026-03-29 00:42:06 +00:00
haosdent
d39b8daf5f [Feature] Add Qwen3-ForcedAligner support via token classification pooling (#35367)
Signed-off-by: haosdent <haosdent@gmail.com>
2026-03-29 00:27:52 +00:00
Walter Beller-Morales
fafca38adc [BugFix][Frontend] apply task instruction as system prompt in cohere v2/embed (#38362)
Signed-off-by: walterbm <walter.beller.morales@gmail.com>
2026-03-28 18:30:54 +00:00
Kunshang Ji
aa4eb0db78 [CI]revert initialize_model context manager (#38426)
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
2026-03-28 16:56:50 +00:00
Andreas Karatzas
af89140efc [ROCm][CI] Fix UV install in Dockerfile.rocm to detect curl failures and retry (#38415)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-03-29 00:47:42 +08:00
haosdent
b2bc736b12 [CI] Fix Ernie4.5-VL initialization test (#38429)
Signed-off-by: haosdent <haosdent@gmail.com>
2026-03-28 22:43:24 +08:00
whyiug
58c959a767 [Misc]: clean up non-core lint issues (#37049)
Signed-off-by: whyiug <whyiug@hotmail.com>
2026-03-28 10:28:16 -04:00
Bvicii
bda3eda82d [Bugfix] Disallow renderer_num_workers > 1 with mm processor cache (#38418)
Signed-off-by: Bvicii <yizhanhuang2002@gmail.com>
2026-03-28 06:32:52 -07:00
Michael Goin
2bf5b70ae8 [CI Bugfix] Pre-download missing FlashInfer headers in Docker build (#38391)
Signed-off-by: mgoin <mgoin64@gmail.com>
2026-03-28 06:09:00 -07:00
yzong-rh
6dad4c5722 [Test] Fix flaky race condition in test_abort_final_step (#38414)
Signed-off-by: Yifan <yzong@redhat.com>
2026-03-28 09:06:56 +00:00
Liwen
171775f306 Fix Device Index for ROCm Ray Workers in MoE Benchmark (#38108)
Signed-off-by: Liwen <53441624+li-liwen@users.noreply.github.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2026-03-28 08:27:11 +00:00
TJian
58a249bc61 [ROCm] [Release] Update ROCm variant from rocm700 to rocm721 (#38413)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2026-03-28 06:07:03 +00:00
IriKa
148a5c1226 [Bugfix]fix output Nan/Inf in marlin if dtype=float16 (#33972)
Signed-off-by: IriKa Qiu <qiujie.jq@gmail.com>
2026-03-27 16:36:08 -07:00
Wei Zhao
b69bf2f0b1 [Perf] Use torch compile to fuse pack topk in trtllm moe (#37695)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com>
2026-03-27 17:30:46 -06:00
rongfu.leng
88149b635e Add nvidia h800 moe config (#31201)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
2026-03-27 16:28:48 -07:00
Hongxia Yang
83a4df049d [ROCm][Documentation] update quickstart and installation to include rocm nightly docker tips (#38367)
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com>
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com>
2026-03-27 23:20:19 +00:00
Gregory Shtrasberg
731285c939 [ROCm][CI/Build] ROCm 7.2.1 release version; torch 2.10; triton 3.6 (#38252)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2026-03-27 18:03:12 -05:00
Johnny
97d19197bc [NVIDIA] Fix DGX Spark logic (#38126)
Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com>
Signed-off-by: guillaume_guy <guillaume.guy@airbnb.com>
Signed-off-by: Guillaume Guy <guillaume.c.guy@gmail.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Andreas Karatzas <akaratza@amd.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: Sathish Sanjeevi <SKPsanjeevi@users.noreply.github.com>
Co-authored-by: Guillaume Guy <guillaume.c.guy@gmail.com>
Co-authored-by: guillaume_guy <guillaume.guy@airbnb.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2026-03-27 15:26:07 -07:00
Giancarlo Delfin
384e4d5f48 [Model Runner V2] Rebuild attention metadata before eagle decode full… (#38311)
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
2026-03-27 13:46:42 -07:00
Nicolò Lucchesi
44a6528028 [CI] Skip failing test (#38369)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-03-27 13:25:19 -07:00
Kyle Sayers
648edcf729 [QeRL] Compose online quantization with quantized reloading (#38032)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2026-03-27 13:22:33 -07:00
Michael Goin
7ba425e916 Add short flag -sc for --speculative-config argument (#38380)
Co-authored-by: Claude <noreply@anthropic.com>
2026-03-27 12:04:22 -07:00
Gregory Shtrasberg
b8665383df [ROCm] Fix GPT-OSS import for triton 3.6 (#37453)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2026-03-27 18:00:57 +00:00
Rohan Potdar
0e9358c11d {ROCm]: gpt-oss fusion/padding fixes (#38043)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: Andreas Karatzas <akaratza@amd.com>
2026-03-27 12:19:15 -04:00
Harry Mellor
21d2b53f88 Remove need for explicit \n in docstring lists for --help formatting (#38350)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-27 08:38:00 -07:00
Jonas M. Kübler
98e7f223b9 enable skipping of SW attention layers when using FP8 KV cache (#33695)
Signed-off-by: Jonas Kuebler <kuebj@amazon.com>
2026-03-27 07:25:02 -06:00
Juan Pérez de Algaba
b111f8a61f fix(security): Add VLLM_MAX_N_SEQUENCES environment variable and enforce limit (#37952)
Signed-off-by: jperezde <jperezde@redhat.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
2026-03-27 09:02:10 -04:00
Sage Moore
497e234d38 [EPLB] Cleanup the transfer logic for the various eplb maps (#34520)
Signed-off-by: Sage Moore <sagmoore@redhat.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2026-03-27 10:18:46 +01:00
dtc
6287e7fa20 [P/D] Mooncake: Add unit tests and minor fixes for mooncake connector (#36946)
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
2026-03-27 09:26:40 +01:00
Shengqi Chen
84e439a9cb [CI/Build] Move nightly wheel index generation to a single post-build step (#38322)
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-03-27 07:44:18 +00:00
Yuichiro Utsumi
a1746ff9ec [Doc] Clarify Helm chart location in deployment guide (#38328)
Signed-off-by: Yuichiro Utsumi <utsumi.yuichiro@fujitsu.com>
Signed-off-by: Yuichiro Utsumi <81412151+utsumi-fj@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-27 15:43:02 +08:00
Flora Feng
aee4c14689 [Bugfix] Fix Hermes tool parser when stream interval > 1 (#38168)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
2026-03-27 14:42:26 +08:00
Bowen Bao
0ae89f18fd [Refactor] Move FusedMoE hidden_size roundup to quant_method (#34285)
Signed-off-by: Bowen Bao <bowenbao@amd.com>
2026-03-26 23:38:26 -07:00