Commit Graph

15177 Commits

Author SHA1 Message Date
mikaylagawarecki
8b10e4fb31 [1/n] Migrate permute_cols to libtorch stable ABI (#31509)
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
2026-03-19 11:27:26 -04:00
Ifta khairul Alam Adil
104605cbf2 Remove deprecated reasoning_content message field(part-2) (#37480)
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Philip Ottesen <phiott256@gmail.com>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
Signed-off-by: Andy Lo <andy@mistral.ai>
Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
Signed-off-by: sihao.li <sihao.li@intel.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: JartX <sagformas@epdcenter.es>
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Philip Ottesen <phiott256@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Giancarlo Delfin <32987265+TheEpicDolphin@users.noreply.github.com>
Co-authored-by: Andy Lo <andy@mistral.ai>
Co-authored-by: Thillai Chithambaram <79466435+thillai-c@users.noreply.github.com>
Co-authored-by: sihao_li <165983188+1643661061leo@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-19 15:20:08 +00:00
Jee Jee Li
96266f119b [LoRA] Minor improvements to LoRA log (#37557)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2026-03-19 15:18:06 +00:00
Sage Moore
7c0cf3bcd0 Cap the number of API servers to 1 when using Elastic EP. (#37466)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2026-03-19 10:42:57 -04:00
Harry Mellor
572b432913 Stop bench CLI from recursively casting all configs to dict (#37559)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-19 14:04:03 +00:00
Cyrus Leung
9515c20868 [Misc] Clean up processing logic (#37541)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-03-19 13:30:20 +00:00
DorBernsohn
c63ca2b2e6 [Bugfix] Add Kimi-K2.5 reasoning/tool parser aliases and tool_call_id support (#37438)
Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>
2026-03-19 21:08:00 +08:00
Harry Mellor
a32eaf5bb2 [CI] Merge cleanup_pr_body.yml and reminder_comment.yml (#37552)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-19 12:55:07 +00:00
XueLiang Yang
e390742c59 Fix KV Offloading + MLA AssertionError by using num_kv_heads=1 in cpu… (#37536)
Signed-off-by: xueliangyang-oeuler <yxl546827391@gmail.com>
Co-authored-by: xueliangyang-oeuler <yxl546827391@gmail.com>
2026-03-19 12:05:07 +00:00
Cyrus Leung
7a6ebcbfcf [Model] Remove unnecessary get_language_model (#37545)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-03-19 20:00:36 +08:00
Cyrus Leung
c7bc12c20f [CI/Build] Split out MM pooling tests (#37542)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-03-19 11:36:11 +00:00
wang.yuqi
f9e2a38386 [Docs] Reorganize pooling docs. (#35592)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-19 11:25:47 +00:00
Harry Mellor
4426447bba Don't log exc_info when vLLM tries to doenload a file that doesn't exist (#37458)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-19 10:38:29 +00:00
Li, Jiang
3322e26420 [Bugfix] Avoid more OpenMP thread reallocation in CPU torch compile (#37538)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2026-03-19 10:24:39 +00:00
Cyrus Leung
765e461065 [Bugfix] Fix Nemotron Parse loading (#37407)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-03-19 09:55:29 +00:00
Duyi-Wang
6a9cceb219 [Bugfix][ROCm] Fix MoRI + AITER FP8 dispatch compatibility for defer_input_quant (#37418)
Signed-off-by: Duyi-Wang <duyi.wang@amd.com>
2026-03-19 09:49:27 +00:00
yassha
199f914183 fix(cpu): add null check for aligned_alloc in ScratchPadManager (#37369)
Signed-off-by: yassha <50112520+yassha@users.noreply.github.com>
2026-03-19 17:45:06 +08:00
Kunshang Ji
ca21483bf9 [MISC] fix pin_memory=torch.cuda.is_available(), use is_pin_memory_available (#37415)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2026-03-19 09:23:24 +00:00
TJian
da70c87e81 [CI] Fix wrong path test file, missing rlhf_async_new_apis.py (#37532)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2026-03-19 02:21:55 -07:00
Collin McCarthy
0b6d52629f Support temporal compression for Nemotron-3-VL videos (#36808)
Signed-off-by: Collin McCarthy <cmccarthy@nvidia.com>
2026-03-19 08:02:19 +00:00
Ziming Huang
d3cc379567 [Perf] Fix slow hasattr in CUDAGraphWrapper.__getattr__ (#37425)
Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>
2026-03-19 15:43:48 +08:00
cdpath
354cd580d5 fix(anthropic): remove non-standard 'data: [DONE]' from Anthropic streaming (#37510)
Signed-off-by: cdpath <cdpath@outlook.com>
2026-03-19 07:23:35 +00:00
zhanqiuhu
d49f273144 [SSM/Mamba] Follow-up: N-1 prefill for P/D disaggregation (#37310) 2026-03-19 08:22:00 +01:00
Flora Feng
b21d384304 [Refactor] Relocate endpoint tests to mirror serving code directory structure (#37504)
Signed-off-by: sfeng33 <4florafeng@gmail.com>
2026-03-19 07:19:36 +00:00
Hongxia Yang
e3126cd107 [ROCm] issue management - request information for bug issues on ROCm (#37009)
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com>
2026-03-19 03:51:29 +00:00
Wentao Ye
e37ff5b5c8 [Perf] Optimize token_embed for pooling models, 1.0% token throughput improvement (#37347)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-03-19 10:27:51 +08:00
Aaron Hao
6accb21f2a [bug] Fix deadlock with pause resume and collective_rpc (#37024)
Signed-off-by: hao-aaron <ahao@anyscale.com>
2026-03-19 01:49:02 +00:00
Giancarlo Delfin
053f3b6309 [Model Runner V2] Spec decode rejection sampler logprobs support (#37237)
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
2026-03-19 01:36:27 +00:00
Aaron Hao
5f82706a21 [BUG] Exclude SKIP_TENSORS from get_layer_size() + new weight sync example for dpep (#37334)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
2026-03-19 00:45:10 +00:00
Sage Moore
c32a58cc2a [EPLB] Simplify EPLB rearrange by only returning one map (#36267)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2026-03-18 20:34:00 -04:00
Elvir Crnčević
ef2c4f778d [Bugfix] Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding (#37442)
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
2026-03-19 00:28:37 +00:00
sihao_li
9dade5da3a [XPU]Unify xpu test dependencies in dockerfile.xpu (#36477)
Signed-off-by: sihao.li <sihao.li@intel.com>
2026-03-19 08:12:07 +08:00
Thillai Chithambaram
828f862acb [Bugfix] Expand quantization method support in perf metrics (#37231)
Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
2026-03-18 23:54:19 +00:00
Andy Lo
577df69b26 [Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype "auto" leading to gibberish (#37054)
Signed-off-by: Andy Lo <andy@mistral.ai>
2026-03-18 23:07:29 +00:00
Giancarlo Delfin
04244fd0e1 [Model Runner V2] Spec decode rejection sampler greedy support (#37238)
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
2026-03-18 15:59:03 -07:00
Michael Goin
9482b0b085 [Bugfix] Remove assertion for NVFP4 scale dynamic range (#37465)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2026-03-18 15:37:49 -07:00
Woosuk Kwon
5bc1da147f [LoRA][BugFix] Fix skipped LoRA adapters for Mistral3 (#36928)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
2026-03-18 22:34:19 +00:00
Philip Ottesen
0091017188 fix(worker): optimize swap_states to copy only active token prefixes (#34733)
Signed-off-by: Philip Ottesen <phiott256@gmail.com>
2026-03-18 14:59:27 -07:00
Wentao Ye
0d81a1fe61 [V0 Deprecation] Deprecate virtual engine (#37195)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-03-18 14:30:14 -07:00
Netanel Haber
6ae4c8d6fc chunk parakeet into 30s clips to prevent OOMs on long audios (#36671)
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
2026-03-18 14:22:24 -07:00
JartX
a913b612d8 [Bugfix] Fix ROCm crash in qwen3_next multi-stream events (#36795) (#37427)
Signed-off-by: JartX <sagformas@epdcenter.es>
2026-03-18 16:06:31 -04:00
Harry Mellor
5ce2d10e4a Fix models which use layer_type_validation for Transformers v5 (#37398)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-18 18:41:51 +00:00
Chengyu Fang
738d0a281f [Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation (#37439)
Signed-off-by: chengyufang <cnyvfang@outlook.com>
2026-03-18 11:36:34 -07:00
youkaichao
70b81c4f3d [bugfix][async scheduling] fix extra cuda context in device 0 with EP/DP (#37449)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2026-03-18 18:32:30 +00:00
Cyrus Leung
7476d148db [Model] Remove unnecessary processor definition for Nemotron Parse (#37456)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-03-18 18:25:13 +00:00
Cyrus Leung
f3732bd931 [Misc] Clean up model registry (#37457)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-03-18 18:24:44 +00:00
Wentao Ye
0ef7f79054 [Perf] Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E throughput improvement (#37340)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-03-18 14:18:34 -04:00
Or Ozeri
5dd8df0701 [kv_offload+HMA][2/N]: Support multiple KV groups in GPULoadStoreSpec (#36642)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2026-03-18 19:26:40 +02:00
Harry Mellor
39bfb57b7c Add API docs link if the CLI arg is a config class (#37432)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2026-03-18 17:19:35 +00:00
RonaldBXu
c9d838fc33 Adding deterministic lora benchmarking to vLLM Bench (#36057)
Signed-off-by: Ubuntu <ubuntu@ip-172-31-43-201.ap-northeast-1.compute.internal>
Signed-off-by: Ronald Xu <ronaldxu@amazon.com>
2026-03-18 16:02:03 +00:00