Francesco Fusco
298e510848
[Hybrid] calling get_mamba_groups() once at MambaCopyBuffers.create() ( #37318 )
...
Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com >
2026-03-21 09:29:43 +00:00
Chaitanya Sri Krishna Lolla
3982bc2cd0
[ROCm] Enable DeepEP ROCm as all2allbackend for AMD GPUs. ( #34692 )
...
Signed-off-by: Tej Kiran <vpolamre@amd.com >
Co-authored-by: Tej Kiran <vpolamre@amd.com >
2026-03-21 00:32:31 -07:00
Andreas Karatzas
02eec7ecbe
[ROCm][CI] Update GSM8K eval config to use fp8-and-mixed models list (MI355) ( #37721 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-21 15:27:12 +08:00
Bongwoo Bak
17ee641c45
[Responses API] Add kv_transfer_params for PD disaggregation ( #37424 )
...
Signed-off-by: bongwoobak <bongwoobak@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-21 13:48:54 +08:00
Andreas Karatzas
0d50fa1db6
[ROCm][CI] Mark gemma3 as large GPU test to avoid OOM on MI250 ( #37610 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-21 12:57:25 +08:00
Simon Mo
1fa1e53a73
Revert "[compile] Initialize passes at VllmBackend init" ( #37733 )
2026-03-20 21:35:49 -07:00
Andreas Karatzas
3ffa52009f
[ROCm][CI] Guard CudaPlatform/RocmPlatform imports to fix test collection on cross-platform builds ( #37617 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-21 11:58:58 +08:00
Yongye Zhu
87bd91892f
[MoE Refactor] Mxfp4 oracle rebased ( #37128 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-21 03:37:04 +00:00
Isotr0py
c7f98b4d0a
[Frontend] Remove librosa from audio dependency ( #37058 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-21 11:36:15 +08:00
tmm77
1c472f8fe1
Add get_device_uuid for rocm ( #37694 )
...
Signed-off-by: Tiffany Mintz <Tiffany.Mintz@amd.com >
2026-03-21 11:33:16 +08:00
Itay Alroy
c57d38d603
elastic_ep: Fix issues with repeated scale up/down cycles ( #37131 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com >
2026-03-20 23:13:02 +00:00
Kaihang Jiang
e5ed6c6c13
[BugFix] Allow qk_nope_head_dim=192 in FlashInfer MLA backend checks ( #37475 )
...
Signed-off-by: Kaihang Jiang <kaihangj@nvidia.com >
2026-03-20 16:14:55 -06:00
Wentao Ye
b3d0b37908
[Refactor] Remove unused dead code ( #36171 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-20 16:12:51 -06:00
Santino Ramos
85f671b8e1
[Model Runner V2] Support Streaming Inputs ( #37028 )
...
Signed-off-by: Santino Ramos <elsantinoramos@gmail.com >
2026-03-20 20:42:25 +00:00
Andreas Karatzas
8bc6b5cdb0
[ROCm][CI] Setting some mi325_4 tests back to optional (in parity with upstream) ( #37711 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 12:25:08 -07:00
Vadim Gimpelson
4f16ebbbd3
[Bugfix] Disable monolithic TRTLLM MoE for Renormalize routing ( #37591 ) ( #37605 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-20 12:19:26 -07:00
Angela Yi
12fd17eb51
[compile] Initialize passes at VllmBackend init ( #35216 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-03-20 11:40:33 -07:00
Cyrus Leung
37aadf6237
[Model] Update Kimi-K25 and Isaac processors to fit HF-style ( #37693 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-20 18:30:22 +00:00
Le Yang
d7d2b5e405
[Bugfix] Disable --calculate-kv-scales for hybrid GDN/Mamba+Attention… ( #37565 )
...
Signed-off-by: Young-Leo <562593859@qq.com >
2026-03-20 18:28:34 +00:00
SherryC41
6ec5e9fd37
refactor: abstract deepgemm support into platform ( #37519 )
...
Co-authored-by: sherryC41 <sherry.c.c41@gmail.com >
2026-03-20 17:54:08 +00:00
Lucas Wilkinson
e1d85e5c24
[Attention] Support distinguishing between short extends and decodes ( #37303 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-03-20 10:49:36 -07:00
Peter Pan
79eb9369c5
fix CUDAGraph memory being counted twice ( #37426 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
Signed-off-by: Peter Pan <peter.pan@daocloud.io >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-20 17:36:32 +00:00
Woosuk Kwon
e80cfe575d
[MRV2] Avoid recompilation of _gather_block_tables_kernel ( #37645 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-20 10:31:45 -07:00
Xin Yang
d0532bf38d
[Perf] Eliminate redundant SparseMatrix creation in gpt_oss_triton_kernels ( #37683 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-20 11:28:41 -06:00
Andreas Karatzas
fb4e8bf442
[ROCm][CI] Fix accuracy for llama-nemotron-vl pooling tests ( #37613 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 10:16:59 -07:00
Harry Mellor
6ade4bc5a5
Fix various config related issues for Transformers v5 ( #37681 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-20 16:30:12 +00:00
Zhengxu Chen
2e089b96a8
[compile] Add compiled artifact counter for VLLM_USE_MEGA_AOT_ARTIFACT=1. ( #37589 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-20 16:22:46 +00:00
Martin Hickey
880be2b1b8
[Metrics] Some small refactoring for better maintainability ( #33898 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-03-20 16:11:34 +00:00
Zhengxu Chen
c0f5fae601
[compile] Fix aot test failures with torch 2.12. ( #37604 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-20 16:06:29 +00:00
Rémi Delacourt
aa84e43ccb
[Pixtral] Enable Pixtral language model support Eagle3 ( #37182 )
...
Signed-off-by: remi <remi@mistral.ai >
2026-03-20 15:50:15 +00:00
Matthias Gehre
5e806bcf54
[Bugfix] Fix ConchLinearKernel channelwise quantization (group_size=-1) ( #37329 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-03-20 10:32:21 -05:00
Matthias Gehre
56a62c310c
[Bugfix] Reject channelwise quantization (group_size <= 0) in ExllamaLinearKernel ( #37331 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-03-20 10:31:57 -05:00
L.B.R.
1779c09898
[ROCm] Enable wvSplitK skinny GEMM kernel for RDNA4/gfx1x decode ( #34709 )
...
Signed-off-by: L.B.R. <lbr@mmonad.com >
Co-authored-by: L.B.R. <lbr@mmonad.com >
2026-03-20 10:11:23 -05:00
xuebwang-amd
44eea10f68
[ROCm][Quantization] make quark ocp mx dtype parser robust for weight-only quantization ( #36232 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
2026-03-20 10:10:03 -05:00
Ilya Boytsov
8b6c6b9505
[Model] Add LFM2-ColBERT-350M support ( #37528 )
...
Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com >
2026-03-20 14:57:57 +00:00
Harry Mellor
9f6d9dd371
Fix attribute error in isaac_patch_hf_runner ( #37685 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-20 14:49:40 +00:00
Jee Jee Li
dd20ee4e3e
[UX] Enable torch_profiler_with_stack ( #37571 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-20 11:17:26 +00:00
Chauncey
0523449c9c
[Misc] Use logger.info_once for auto tool choice log message ( #37661 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-20 10:40:36 +00:00
Flora Feng
b4c1aef21c
[Refactor] Relocate tests from tests/v1/entrypoints/ to tests/entrypoints/ ( #37500 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 02:50:34 -07:00
Flora Feng
6050b93bed
[Refactor] Move serve entrypoint tests under tests/entrypoints/serve/ ( #37595 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 02:10:47 -07:00
Andreas Karatzas
5a4a179591
[ROCm][CI] Fix granite_speech test for gfx90a by selecting compatible attention backend ( #37611 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 17:07:26 +08:00
Andreas Karatzas
37cd9fc107
[ROCm][CI] Remove deepep DBO tests on gfx90a ( #37614 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 17:07:07 +08:00
Andreas Karatzas
9cfd4ebb5e
[ROCm][CI] Update GSM8K eval config to use fp8-and-mixed models list ( #37619 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 17:06:53 +08:00
wang.yuqi
ed359c497a
[Model] Deprecate the score task (this will not affect users). ( #37537 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-20 08:07:56 +00:00
Giancarlo Delfin
dcee9be95a
[Model Runner V2] Fix draft logits not populated during cudagraph replay ( #37639 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-20 07:43:47 +00:00
Andreas Karatzas
bd8c4c0752
[CI] Removing deprecated rlhf examples reference ( #37585 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 15:20:33 +08:00
Wei Zhao
0140eafb15
[Bug] Fix FlashInfer allreduce fusion workspace uninitialized error ( #37461 )
...
Signed-off-by: root <root@prenyx0169.a51.clusters.nvidia.com >
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Signed-off-by: <>
Co-authored-by: root <root@prenyx0169.a51.clusters.nvidia.com >
Co-authored-by: root <root@prenyx0042.a51.clusters.nvidia.com >
2026-03-20 03:09:21 -04:00
Kunshang Ji
bdf6a0a57b
[XPU] bump vllm-xpu-kernels to v0.1.4 ( #37641 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-20 15:04:38 +08:00
Wangbei25
0674d1fee7
[PluggableLayer][MM] Add PluggableLayer for CustomQwen2Decoder ( #37293 )
...
Signed-off-by: Wangbei25 <wangbei41@huawie.com >
Signed-off-by: Wangbei25 <wangbei41@huawei.com >
Co-authored-by: Wangbei25 <wangbei41@huawie.com >
2026-03-20 06:24:07 +00:00
Cyrus Leung
30108fc8b0
[Model] Refactor Step3-VL processor to HF style ( #37579 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-20 06:05:08 +00:00
Flora Feng
e2d1c8b5e8
[Refactor] Relocate entrypoint tests to match serving code structure ( #37593 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 05:31:23 +00:00
Huanxing
6951fcd44f
[XPU] Automatically detect target platform as XPU in build. ( #37634 )
...
Signed-off-by: huanxing <huanxing.shen@intel.com >
2026-03-20 13:30:15 +08:00
Giancarlo Delfin
39474513f6
[Model Runner V2] fix draft attention metadata generation ( #37364 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-19 21:05:15 -07:00
Yuxiang Liang
638a872d77
fix(xpu): Re-compute compile ranges after platform-specific config updates ( #37523 )
...
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com >
Signed-off-by: Yuxiang Liang <yuliang@habana.ai >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-20 03:52:35 +00:00
Flora Feng
9040151fe1
[V0 Deprecation] Deprecate --disable-frontend-multiprocessing ( #37612 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 11:31:43 +08:00
Jee Jee Li
8fbe3f303f
[Bugfix][LoRA] Fix Qwen35 LoRA ( #36976 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-20 11:09:32 +08:00
Xiao
ea2c148fa7
[compile][graph_partition]Add tensor size handling ( #36038 )
...
Signed-off-by: Xiao Fu <xiaofu@meta.com >
2026-03-19 19:55:25 -07:00
Tianmu Li
47b7af0d87
[Feat] Enable CompressedTensorW4A8Int for XPU ( #37207 )
...
Signed-off-by: Li, Tianmu <tianmu.li@intel.com >
2026-03-20 02:34:28 +00:00
tianshu-Michael-yu
269bf46d99
fix: disambiguate multimodal prefix cache keys ( #36708 )
...
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com >
2026-03-20 10:33:20 +08:00
Flora Feng
e5a77a5015
[CI] Update mergify tool-calling label paths ( #37478 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 02:22:23 +00:00
Itay Alroy
ca1ac1a4b4
Fix DP coordinator ZMQ port TOCTOU ( #37452 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
2026-03-20 00:58:31 +00:00
Divakar Verma
4ca3fa6bb4
[ROCm][Bugfix] fix cache block size mismatch for aiter unified attention ( #37606 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-03-20 00:00:08 +00:00
Flora Feng
be12afd284
[Bugfix] Fix Deepseekv32 tool parser when stream interval > 1 ( #36056 )
2026-03-19 19:51:25 -04:00
Wentao Ye
df3c0291a3
[Bug] Fix EmbedIOprocessor "classify" <-> "embed" ( #37573 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-20 07:40:10 +08:00
Wentao Ye
2be1a0f74b
[Refactor] Remove dead code in pooling model ( #37572 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-20 07:39:43 +08:00
Jim Smith
4120a05ff1
Fix AttributeError in Qwen3.5 GDN layers with quantized models ( #37448 )
...
Signed-off-by: Jim Smith <jim@joshua8.ai >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com >
2026-03-19 19:21:14 -04:00
rasmith
98ff042917
[CI][BugFix][AMD] Don't set VLLM_ROCM_USE_AITER anymore in test_rocm_aiter_topk since its not necessary ( #36996 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-03-20 07:12:45 +08:00
Artem Perevedentsev
b55156eae9
[Performance] Enable Triton autotuning disk cache by default ( #37188 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
2026-03-19 17:36:28 -04:00
Laith Sakka
112944fab9
test Qwen/Qwen3-4B-Instruct-2507 for unbacked ( #36064 )
...
Signed-off-by: Laith Sakka <lsakka@meta.com >
2026-03-19 17:28:45 -04:00
bnellnm
91be5f9be3
[MoE Refactor] Rename "naive" all2all backend ( #36294 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-03-19 15:50:34 -04:00
Aaron Hao
4ee847e400
Comment fix for async rl example ( #35244 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-03-19 19:46:07 +00:00
Andreas Karatzas
040a505ff5
[ROCm][CI] Cleaning and restructuring amd-ci legacy pipeline ( #34839 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-19 14:30:58 -05:00
bnellnm
9279c59a0e
[MoE Refactor] DefaultMoERunner simplifcation ( #33049 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-03-19 15:07:44 -04:00
Wentao Ye
7454096199
[Log] Log once in local node by default ( #37568 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-19 12:04:59 -07:00
Andreas Karatzas
fb8b5e05fc
[CI] Add retry with 4x backoff to HTTP fetches for transient failures ( #37218 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-19 19:00:20 +00:00
Harry Mellor
e5d96dc8fc
Fix SpeculatorsConfig now that PreTrainedConfig is a dataclass in Transformers ( #37574 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 18:04:40 +00:00
EdalatiAli
daa05bf340
[Bugfix] Fix AttributeError when serving MXFP8 models with DeepGEMM installed ( #37358 )
...
Signed-off-by: EdalatiAli <aliedalati@cohere.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-19 17:58:33 +00:00
Lucas Kabela
7769b58307
[torch.compile][BE][Multimodal] Remove requirement to set_model_tag to avoid cache conflict ( #37345 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-03-19 17:26:12 +00:00
Chauncey
2f9f946b22
[P/D] AnthropicMessages add kv_transfer_params for PD disaggregation ( #37535 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-19 16:41:20 +00:00
Fadi Arafeh
2890aecce5
[CPU][UX] Do not crash when tcmalloc/libiomp are not ldpreloaded ( #37561 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-03-19 16:35:45 +00:00
Harry Mellor
34f093b417
[CI] Gate pre-commit on ready label or number of contributions ( #37544 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 16:21:57 +00:00
Harry Mellor
4dce8321a9
Run MacOS smoke test on daily cron job instead of every commit ( #37567 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 16:19:50 +00:00
Cyrus Leung
657855ab41
[Misc] Cleanup more configs and processors ( #37560 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 15:45:23 +00:00
Wei Zhao
e27b8ba3d1
[Bug] Fix fp8 trtllm MoE modular kernel supported routing methods ( #37346 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-19 11:43:06 -04:00
Woosuk Kwon
40b8363b45
[MRV2] Use fp32 for draft logits ( #37526 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-19 08:41:21 -07:00
mikaylagawarecki
8b10e4fb31
[1/n] Migrate permute_cols to libtorch stable ABI ( #31509 )
...
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com >
2026-03-19 11:27:26 -04:00
Ifta khairul Alam Adil
104605cbf2
Remove deprecated reasoning_content message field(part-2) ( #37480 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com >
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Philip Ottesen <phiott256@gmail.com >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
Signed-off-by: Andy Lo <andy@mistral.ai >
Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com >
Signed-off-by: sihao.li <sihao.li@intel.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: JartX <sagformas@epdcenter.es >
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Philip Ottesen <phiott256@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Giancarlo Delfin <32987265+TheEpicDolphin@users.noreply.github.com >
Co-authored-by: Andy Lo <andy@mistral.ai >
Co-authored-by: Thillai Chithambaram <79466435+thillai-c@users.noreply.github.com >
Co-authored-by: sihao_li <165983188+1643661061leo@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 15:20:08 +00:00
Jee Jee Li
96266f119b
[LoRA] Minor improvements to LoRA log ( #37557 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-19 15:18:06 +00:00
Sage Moore
7c0cf3bcd0
Cap the number of API servers to 1 when using Elastic EP. ( #37466 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-03-19 10:42:57 -04:00
Harry Mellor
572b432913
Stop bench CLI from recursively casting all configs to dict ( #37559 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 14:04:03 +00:00
Cyrus Leung
9515c20868
[Misc] Clean up processing logic ( #37541 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 13:30:20 +00:00
DorBernsohn
c63ca2b2e6
[Bugfix] Add Kimi-K2.5 reasoning/tool parser aliases and tool_call_id support ( #37438 )
...
Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com >
2026-03-19 21:08:00 +08:00
Harry Mellor
a32eaf5bb2
[CI] Merge cleanup_pr_body.yml and reminder_comment.yml ( #37552 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 12:55:07 +00:00
XueLiang Yang
e390742c59
Fix KV Offloading + MLA AssertionError by using num_kv_heads=1 in cpu… ( #37536 )
...
Signed-off-by: xueliangyang-oeuler <yxl546827391@gmail.com >
Co-authored-by: xueliangyang-oeuler <yxl546827391@gmail.com >
2026-03-19 12:05:07 +00:00
Cyrus Leung
7a6ebcbfcf
[Model] Remove unnecessary get_language_model ( #37545 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 20:00:36 +08:00
Cyrus Leung
c7bc12c20f
[CI/Build] Split out MM pooling tests ( #37542 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 11:36:11 +00:00
wang.yuqi
f9e2a38386
[Docs] Reorganize pooling docs. ( #35592 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 11:25:47 +00:00
Harry Mellor
4426447bba
Don't log exc_info when vLLM tries to doenload a file that doesn't exist ( #37458 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 10:38:29 +00:00
Li, Jiang
3322e26420
[Bugfix] Avoid more OpenMP thread reallocation in CPU torch compile ( #37538 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-19 10:24:39 +00:00
Cyrus Leung
765e461065
[Bugfix] Fix Nemotron Parse loading ( #37407 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 09:55:29 +00:00
Duyi-Wang
6a9cceb219
[Bugfix][ROCm] Fix MoRI + AITER FP8 dispatch compatibility for defer_input_quant ( #37418 )
...
Signed-off-by: Duyi-Wang <duyi.wang@amd.com >
2026-03-19 09:49:27 +00:00
yassha
199f914183
fix(cpu): add null check for aligned_alloc in ScratchPadManager ( #37369 )
...
Signed-off-by: yassha <50112520+yassha@users.noreply.github.com >
2026-03-19 17:45:06 +08:00
Kunshang Ji
ca21483bf9
[MISC] fix pin_memory=torch.cuda.is_available(), use is_pin_memory_available ( #37415 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-19 09:23:24 +00:00
TJian
da70c87e81
[CI] Fix wrong path test file, missing rlhf_async_new_apis.py ( #37532 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-03-19 02:21:55 -07:00
Collin McCarthy
0b6d52629f
Support temporal compression for Nemotron-3-VL videos ( #36808 )
...
Signed-off-by: Collin McCarthy <cmccarthy@nvidia.com >
2026-03-19 08:02:19 +00:00
Ziming Huang
d3cc379567
[Perf] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ( #37425 )
...
Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com >
2026-03-19 15:43:48 +08:00
cdpath
354cd580d5
fix(anthropic): remove non-standard 'data: [DONE]' from Anthropic streaming ( #37510 )
...
Signed-off-by: cdpath <cdpath@outlook.com >
2026-03-19 07:23:35 +00:00
zhanqiuhu
d49f273144
[SSM/Mamba] Follow-up: N-1 prefill for P/D disaggregation ( #37310 )
2026-03-19 08:22:00 +01:00
Flora Feng
b21d384304
[Refactor] Relocate endpoint tests to mirror serving code directory structure ( #37504 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-19 07:19:36 +00:00
Hongxia Yang
e3126cd107
[ROCm] issue management - request information for bug issues on ROCm ( #37009 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-03-19 03:51:29 +00:00
Wentao Ye
e37ff5b5c8
[Perf] Optimize token_embed for pooling models, 1.0% token throughput improvement ( #37347 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-19 10:27:51 +08:00
Aaron Hao
6accb21f2a
[bug] Fix deadlock with pause resume and collective_rpc ( #37024 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-03-19 01:49:02 +00:00
Giancarlo Delfin
053f3b6309
[Model Runner V2] Spec decode rejection sampler logprobs support ( #37237 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-19 01:36:27 +00:00
Aaron Hao
5f82706a21
[BUG] Exclude SKIP_TENSORS from get_layer_size() + new weight sync example for dpep ( #37334 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-03-19 00:45:10 +00:00
Sage Moore
c32a58cc2a
[EPLB] Simplify EPLB rearrange by only returning one map ( #36267 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-03-18 20:34:00 -04:00
Elvir Crnčević
ef2c4f778d
[Bugfix] Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding ( #37442 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-19 00:28:37 +00:00
sihao_li
9dade5da3a
[XPU]Unify xpu test dependencies in dockerfile.xpu ( #36477 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
2026-03-19 08:12:07 +08:00
Thillai Chithambaram
828f862acb
[Bugfix] Expand quantization method support in perf metrics ( #37231 )
...
Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com >
2026-03-18 23:54:19 +00:00
Andy Lo
577df69b26
[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype "auto" leading to gibberish ( #37054 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-03-18 23:07:29 +00:00
Giancarlo Delfin
04244fd0e1
[Model Runner V2] Spec decode rejection sampler greedy support ( #37238 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-18 15:59:03 -07:00
Michael Goin
9482b0b085
[Bugfix] Remove assertion for NVFP4 scale dynamic range ( #37465 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-03-18 15:37:49 -07:00
Woosuk Kwon
5bc1da147f
[LoRA][BugFix] Fix skipped LoRA adapters for Mistral3 ( #36928 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-18 22:34:19 +00:00
Philip Ottesen
0091017188
fix(worker): optimize swap_states to copy only active token prefixes ( #34733 )
...
Signed-off-by: Philip Ottesen <phiott256@gmail.com >
2026-03-18 14:59:27 -07:00
Wentao Ye
0d81a1fe61
[V0 Deprecation] Deprecate virtual engine ( #37195 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-18 14:30:14 -07:00
Netanel Haber
6ae4c8d6fc
chunk parakeet into 30s clips to prevent OOMs on long audios ( #36671 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-03-18 14:22:24 -07:00
JartX
a913b612d8
[Bugfix] Fix ROCm crash in qwen3_next multi-stream events ( #36795 ) ( #37427 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
2026-03-18 16:06:31 -04:00
Harry Mellor
5ce2d10e4a
Fix models which use layer_type_validation for Transformers v5 ( #37398 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-18 18:41:51 +00:00
Chengyu Fang
738d0a281f
[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation ( #37439 )
...
Signed-off-by: chengyufang <cnyvfang@outlook.com >
2026-03-18 11:36:34 -07:00
youkaichao
70b81c4f3d
[bugfix][async scheduling] fix extra cuda context in device 0 with EP/DP ( #37449 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2026-03-18 18:32:30 +00:00
Cyrus Leung
7476d148db
[Model] Remove unnecessary processor definition for Nemotron Parse ( #37456 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-18 18:25:13 +00:00
Cyrus Leung
f3732bd931
[Misc] Clean up model registry ( #37457 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-18 18:24:44 +00:00
Wentao Ye
0ef7f79054
[Perf] Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E throughput improvement ( #37340 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-18 14:18:34 -04:00
Or Ozeri
5dd8df0701
[kv_offload+HMA][2/N]: Support multiple KV groups in GPULoadStoreSpec ( #36642 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-18 19:26:40 +02:00
Harry Mellor
39bfb57b7c
Add API docs link if the CLI arg is a config class ( #37432 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-18 17:19:35 +00:00
RonaldBXu
c9d838fc33
Adding deterministic lora benchmarking to vLLM Bench ( #36057 )
...
Signed-off-by: Ubuntu <ubuntu@ip-172-31-43-201.ap-northeast-1.compute.internal >
Signed-off-by: Ronald Xu <ronaldxu@amazon.com >
2026-03-18 16:02:03 +00:00
Xin Yang
b1169d7be8
[Kernel] Add gpt-oss Router GEMM kernel ( #37205 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-18 08:15:56 -07:00
XLiu-2000
17808394bc
standardize load_weights using AutoWeightsLoader for kimi_linear and minimax_text_01 ( #37371 )
...
Signed-off-by: XuLiu <xuliu40@gmail.com >
Co-authored-by: XuLiu <xuliu40@gmail.com >
2026-03-18 15:05:37 +00:00
elvischenv
296839a1b0
[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE ( #30647 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-03-18 15:01:26 +00:00
Wentao Ye
c373b5c00d
[Log] Reduce duplicate log ( #37313 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-18 10:57:44 -04:00
Itay Alroy
de1a86b7de
elastic_ep: Fix stateless group port races ( #36330 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
2026-03-18 14:36:18 +00:00
Cyrus Leung
99267c23ca
[2/3] Refactor InternVL-based processors ( #37324 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-18 22:22:19 +08:00
Or Ozeri
525f2eeb0b
[kv_offload+HMA][6/N]: Split offloading_connector.py ( #37405 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-18 14:42:46 +01:00
Yufeng He
918b7890a1
[Bugfix] Fix base64 JPEG video frames returning empty metadata ( #37301 )
...
Signed-off-by: Yufeng He <40085740+universeplayer@users.noreply.github.com >
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Yufeng He <40085740+universeplayer@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-18 13:40:03 +00:00
Andy Lo
98b09ddc27
[NIXL][Bugfix] metrics & testing minor bug ( #36051 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-03-18 14:39:14 +01:00
Shwetha Poojary
cef1f302d2
[Model] Enable LoRA support for tower and connector in H2OVL ( #31696 )
...
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com >
2026-03-18 13:26:47 +00:00
Elvir Crnčević
17c47fb869
[Bugfix] Fix EP weight filter breaking EPLB and NVFP4 accuracy ( #37322 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-18 18:30:29 +08:00
Chauncey
b322b197f1
[Build] Bump python openai version ( #32316 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-18 18:20:10 +08:00
Andreas Karatzas
eaf7c9b976
[CI] Fix PaddleOCR-VL HF test failure due to create_causal_mask API rename ( #37328 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-18 09:44:12 +00:00
Aaron Hao
47a1f11bff
[docs] Add docs for new RL flows ( #36188 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-18 09:04:26 +00:00
Karan Bansal
fad09e8a1f
fix(glm47): improve tool call parsing and content normalization ( #37386 )
...
Signed-off-by: karanb192 <karan@example.com >
Co-authored-by: karanb192 <karan@example.com >
2026-03-18 08:12:21 +00:00
Jee Jee Li
8c31f47c63
[LoRA] Make LoRA respect language_model_only ( #37375 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-18 07:53:34 +00:00
Li, Jiang
261801242f
[Bugfix] Avoid OpenMP thread reallocation in CPU torch compile ( #37391 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-18 07:51:39 +00:00
Or Ozeri
fcf0687b27
[kv_offload+HMA][0/N]: Support block-level preemption handling ( #34805 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-18 08:49:53 +02:00
liuzhenwei
86b7e3c95a
[XPU] skip unsupported ut and update test_nixl_connector ( #37179 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-18 13:32:59 +08:00
Andrew Xia
0e95916155
[responsesAPI] parser.extract_response_outputs can take in token IDs ( #37130 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2026-03-18 05:31:31 +00:00
Andreas Karatzas
ce2ef42fd3
[CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset ( #37335 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-18 05:26:20 +00:00
Andreas Karatzas
8b6325758c
[ROCm][CI] Add ROCM_EXTRA_ARGS to audio_in_video test server fixture ( #37349 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-18 04:55:40 +00:00
gxd3
a0dd1995c7
[Hardware][TPU] Add supports_async_scheduling() method to Executor interface so that it can be extended for Executor implementations. ( #36924 )
...
Signed-off-by: Guangxiang Du <gxd@google.com >
2026-03-18 12:53:28 +08:00
Xin Yang
f1740006e4
[Perf] Enable dual stream execution of input projection for Qwen3 ( #36795 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-18 11:13:27 +08:00
Andreas Karatzas
58cde5c026
[ROCm][CI] Skip trtllm kvfp8 dequant tests on ROCm ( #37330 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-18 11:12:26 +08:00
Roy Wang
761e0aa7a0
[Performance] Add --enable-ep-weight-filter CLI option ( #37351 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-18 09:36:55 +08:00
Yanan Cao
ff9fbc9aff
[Kernel][Helion] [16/N] Refactor register_kernel API to be more Dynamo-friendly ( #36705 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-18 01:23:35 +00:00
Divakar Verma
e6c4797704
[ROCm][Quantization] add fp8xfp8 attn support for rocm_aiter_unified_attn ( #36927 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-03-18 08:49:32 +08:00
Michael Goin
09e4576f65
[Kernel] Add non-gated support for NVFP4 CUTLASS MoE ( #37320 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-03-17 18:12:04 -04:00
Andreas Karatzas
3ed7b1e6e0
[ROCm] Validate block_size for explicitly selected attention backends ( #36846 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-17 17:04:40 -05:00
JartX
e8f9dbc369
[Bugfix][ROCm] Fix worker startup OOM on ROCm by skipping unreliable cudagraph memory profiling ( #36720 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
2026-03-17 17:55:34 -04:00
Yong Hoon Shin
de35c06c66
Make KV connector metadata build overridable via plugin ( #37336 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2026-03-17 21:29:06 +00:00
Athrael Soju
c0745a851a
[Model] Add ColQwen3.5 4.5B support ( #36887 )
...
Signed-off-by: Athrael Soju <athrael.soju@gmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-17 21:17:02 +00:00
Ekagra Ranjan
b5ca9c3557
[Models] Cohere ASR ( #35809 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2026-03-17 21:04:17 +00:00
Chao-Ju Chen
245758992e
[Bugfix] Rescale NVFP4 weight scales to fix BF16 dequant underflow ( #34577 )
...
Signed-off-by: ricky-chaoju <ricky.chen@infinirc.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-17 20:48:42 +00:00
Dimitrios Bariamis
1204cf0a9d
[Bugfix] Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 ( #37158 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
2026-03-17 20:13:06 +00:00
Wei Zhao
b36adfa349
[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache ( #37252 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-17 20:09:20 +00:00
Michael Goin
e78821b438
[Deprecation] Deprecate --calculate-kv-scales option ( #37201 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-03-17 19:57:24 +00:00
Cyrus Leung
51f0acda79
[Model] Remove unused handle_oov_mm_token ( #37321 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-17 19:44:52 +00:00
Brian Dellabetta
fa75204b16
bump compressed-tensors version to 0.14.0.1 ( #36988 )
...
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com >
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com >
2026-03-17 15:36:19 -04:00
Wentao Ye
bdb903bb5f
[Bug] Fix FlashInfer MNNVL socket collisions under concurrent vLLM jobs ( #36674 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-17 15:19:52 -04:00
Andrey Talman
68f783a727
[Torch 2.11] Guard torch._C._cpu attribute checks for forward compatibility ( #35673 )
...
Signed-off-by: atalman <atalman@fb.com >
2026-03-17 18:47:59 +00:00
Avinash Singh
c5030c439d
[CI] Split Distributed Tests (4 GPUs) and Kernel MoE tests ( #37100 )
...
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com >
Signed-off-by: Avinash Singh <107198269+avinashsingh77@users.noreply.github.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-17 11:44:55 -07:00
Michael Goin
51b2333be1
[Perf] Optimize top-k search in apply_top_k_top_p_triton sampler ( #37225 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-03-17 11:35:17 -07:00
Andreas Karatzas
4ed51308c8
[CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in __init__ ( #37230 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-17 09:08:08 -07:00
Cyrus Leung
c781fbbab3
[Bugfix] Standardize custom HF Processor init ( #37289 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-17 15:38:55 +00:00
Richard Zou
979ff44cea
[BugFix] PyTorch Compilation Tests should error if any test fails ( #37300 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-17 15:26:38 +00:00
Benjamin Chislett
f63ed7b5ac
[Bugfix] Fix DP MTP Dummy Run ( #35243 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-17 11:16:48 -04:00
Ning Xie
c9e5096256
[openapi] remove redundant exception stack trace[4/N] ( #37157 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-03-17 15:06:25 +00:00
Anton Vlasjuk
2ff0ad9694
[UltraVox] Fix output type ( #37224 )
...
Signed-off-by: vasqu <antonprogamer@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-17 14:51:17 +00:00
Isotr0py
a836524d20
[Chore] Replace all base64 usages with faster pybase64 package ( #37290 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-17 14:44:19 +00:00
Bhoomit
3717a4dd47
[Misc][LoRA] Add --lora-target-modules to restrict LoRA to specific modules ( #34984 )
...
Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-17 14:36:41 +00:00
Harry Mellor
ecfcdd2ce4
Fix Phi3 test that fails with Transformers v5 ( #37298 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-17 14:29:24 +00:00
Siew's Capital Jarvis
c25dbc2d27
[Bugfix] Fix unclean shutdown crash with AllReduce Fusion workspace ( #36955 )
...
Signed-off-by: Jarvis <brayden.stanley.0127@gmail.com >
2026-03-17 14:22:09 +00:00
Jonas M. Kübler
77d2a5f17b
pick up tuned prefill configs for FP8 FA3 ( #36265 )
...
Signed-off-by: Jonas M. Kübler <44084297+jmkuebler@users.noreply.github.com >
Signed-off-by: Jonas Kuebler <kuebj@amazon.com >
2026-03-17 07:00:26 -07:00
Sage
59192dfd39
[Frontend] Complete OpenAI render delegation ( #37287 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-17 13:53:55 +00:00
Umut Polat
56cb1baa66
[Misc] Use VLLMValidationError in batch, pooling, and tokenize protocol validators ( #36256 )
...
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com >
2026-03-17 13:52:30 +00:00
Cyrus Leung
f340324335
[1/2] Move InternVL-based processors ( #37260 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-17 21:50:56 +08:00
sfbemerk
2660b9289c
Bugfix for offloading+prefetch for GLM-4.7-FP8 ( #37178 )
...
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
Co-authored-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
2026-03-17 21:22:09 +08:00
Viacheslav
293f036e6d
Add gigachat 3.1 tool parser + fix gigachat3 tool parser ( #36664 )
...
Signed-off-by: Viacheslav Barinov <viacheslav.teh@gmail.com >
2026-03-17 12:03:20 +00:00
youkaichao
0fb142a454
[perf][connector] optimize build_connector_meta when host buffer transfer is not used ( #37165 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2026-03-17 11:59:35 +00:00
Sage
00f8e0d211
[Frontend] Delegate tokenization serving preprocessing to OpenAIServingRender ( #37266 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-17 11:22:54 +00:00
zhao, zhenhui
4af9ed21cb
[Bugfix](xpu): prevent “selected index k out of range” in TP decode path ( #37259 )
...
Signed-off-by: zhenzhao <zhenzhao@habana.ai >
2026-03-17 11:14:07 +00:00
Augusto Yao
9c7cab5ebb
[Feature]: Support for multiple embedding types in a single inference call ( #35829 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
2026-03-17 17:05:42 +08:00
Chauncey
132bfd45b6
[Bugfix][ResponsesAPI] Fix crash when tool_choice=required exceeds max_output_tokens ( #37258 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-17 08:54:52 +00:00
xiao-llm
24b4272a8c
Fix infinite recursive search issue in quark.py ( #32779 )
...
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
Signed-off-by: Xiao Yu <xiao.yu.dc@outlook.com >
Signed-off-by: kimheesu <wlskaka4@gmail.com >
Co-authored-by: Yanwen Lin <lyw1124278064@gmail.com >
Co-authored-by: Kim Hee Su <wlskaka4@gmail.com >
2026-03-17 07:19:15 +00:00
Benjamin Chislett
8a680463fa
[Bugfix] Fix NemotronH MTP + Chunked Prefill ( #35447 )
2026-03-17 07:07:33 +01:00
Nick Cao
20b14095a4
[Bugfix] Fix loading Music Flamingo ( #35535 )
...
Signed-off-by: Nick Cao <ncao@redhat.com >
2026-03-17 05:24:40 +00:00
PatchyTIS
17c1bdf371
[Bugfix] dtype mismatch in ngram gpu propose ( #37246 )
...
Signed-off-by: PatchouliTaisa <patchychen@tencent.com >
Co-authored-by: PatchouliTaisa <patchychen@tencent.com >
2026-03-17 05:19:55 +00:00
Flora Feng
3e3d320c1b
[Refactor] Relocate responses API tests ( #37241 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-17 05:14:52 +00:00
Andreas Karatzas
54a62a79f7
[ROCm] Fix AttributeError for torch.compiler.skip_all_guards_unsafe on older PyTorch ( #37219 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-17 11:34:49 +08:00
Flora Feng
384dc7f77b
[Refactor] Relocate completion and chat completion tests ( #37125 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-17 11:31:23 +08:00
Flora Feng
f04d5226f8
[CI] Fix flaky tool_use chat completion tests with deterministic seed ( #37027 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-17 03:24:34 +00:00
Kyuyeun Kim
0a0a1a198b
Add ability to replace oot ops when using lora ( #37181 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com >
2026-03-16 18:04:15 -07:00
Vadim Gimpelson
6c1cfbad32
Support non-contiguous KV cache in TRTLLM fp8 dequant kernel ( #36867 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com >
Co-authored-by: Pavani Majety <pavanimajety@gmail.com >
2026-03-16 17:48:42 -07:00
Harry Huang
45f526d652
[BugFix] Correct max memory usage for multiple KV-cache groups ( #36030 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-03-17 00:38:52 +00:00
Julien Denize
5db91f0aaf
Fix some Mistral parser issues ( #37209 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-03-17 00:08:56 +00:00
Walter Beller-Morales
061980c36a
[Feature][Frontend] add support for Cohere Embed v2 API ( #37074 )
...
Signed-off-by: walterbm <walter.beller.morales@gmail.com >
2026-03-16 19:55:53 -04:00
Ben Browning
7a49742b88
[CI/Build] Add common tool call parser test suite ( #27599 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2026-03-16 19:46:20 -04:00
Terry Gao
3e6a1e1686
[Custom Ops] Add functional + out variant for scaled_fp4_quant ( #34389 )
...
Signed-off-by: tianrengao <terrygao87@gmail.com >
2026-03-16 18:51:46 -04:00
Julien Denize
7961486a9b
Fix EagleMistralLarge3Model initialization ( #37232 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-03-16 15:41:00 -07:00
Andreas Karatzas
4f9b14c21c
[CI] Stabilize multinode DP internal LB completion tests ( #36356 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 15:40:23 -07:00
Yuchen Fama
31a458c091
[Doc] Clarify schema enforcement behavior for tool_choice modes ( #37064 )
...
Signed-off-by: yfama <yuchengu@gmail.com >
2026-03-16 22:27:42 +00:00
Wei Zhao
a3a51d20e7
[Benchmark] Improvements to attention benchmark script ( #37115 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-16 22:22:40 +00:00
EdalatiAli
e5b807607c
[Quant][Feature] Support online MXFP8 quantization for MoE and dense models ( #35448 )
...
Signed-off-by: EdalatiAli <aliedalati@cohere.com >
2026-03-16 18:07:39 -04:00
Elvir Crnčević
fd4d96302a
Fix eplb nvfp4 experts hook ( #37217 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
Signed-off-by: Elvir Crncevic <elvir@anthropic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-16 22:03:54 +00:00
Krish Gupta
c0f011918d
[Bugfix] opcheck false mutation error in rms_norm_per_block_quant ( #36688 ) ( #36779 )
...
Signed-off-by: Krish Gupta <krishom70@gmail.com >
2026-03-16 21:11:33 +00:00
Zhengxu Chen
e6ae4b1be1
[compile] Enable mega aot artifact for torch 2.12+. ( #37198 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-16 21:05:51 +00:00
zhanqiuhu
2dccb38f73
[Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors ( #36549 )
2026-03-16 20:51:04 +00:00
Kunshang Ji
d157216093
[BUGFIX][Mamba] Use uint64 for address in KVBlockZeroer ( #37197 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-16 21:39:56 +01:00
Matthew Bonanni
93f3c8e531
[Misc] Add float16 to CacheDType ( #37199 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-16 13:24:48 -07:00
rasmith
2cc26c3a99
[CI][BugFix][MORI][AMD] Add transfer_id to kv transfer params for test ( #37213 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-03-16 13:22:57 -07:00
Flora Feng
dfa8852db2
[Refactor] Consolidate GPT-OSS reasoning parser tests ( #36915 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Signed-off-by: Flora Feng <4florafeng@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-16 15:53:07 -04:00
Lucas Kabela
714c6e0eab
[torch.compile][BE] Modify cudagraph callable to check for is_forward_context_set ( #36288 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-03-16 19:42:34 +00:00
Sage
0fefd00e6c
[Bugfix] Fix render server crash for quantized models on CPU-only hosts ( #37215 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-16 18:59:01 +00:00
Nicolò Lucchesi
f5c081d432
[PD][Nixl] Add support for hybrid SSM-FA models ( #36687 )
2026-03-16 19:58:06 +01:00
Matthew Bonanni
c88ea8338b
[MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible ( #36982 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-16 13:51:21 -04:00
Max de Bayser
9f9ecff4cd
Add simple granite4 tool parser ( #36827 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2026-03-16 10:49:09 -07:00
haosdent
ca1954d58c
[Bugfix] Disable cross-layer KV cache for MLA attention backends ( #37090 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-03-16 19:03:10 +02:00
Raushan Turganbay
55e6d3d5c0
[Bugfix] Make siglip/clip compatible with transformers v5 ( #37200 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2026-03-16 16:48:18 +00:00
Chauncey
6682c231fa
[Bugfix] Add error handling for FINISHED_ERROR in OpenAIServing ( #37148 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-16 16:27:47 +00:00
Itay Etelis
5ae685c1c8
[Bugfix] Relax TRTLLM KV cache contiguity assertion for cross-layer layout ( #34158 )
...
Signed-off-by: Itay Etelis <itay.etelis@ibm.com >
Co-authored-by: Itay Etelis <itay.etelis@ibm.com >
2026-03-16 11:20:51 -04:00
Wentao Ye
ce8cf9161d
[Compile] Fix compile warning st256_cs in cuda_vec_utils.cuh ( #36693 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-16 11:12:15 -04:00
xjx
18be11fd59
[BUGFIX]fix CUDA OOM ERROR : invalid argument at cumem_allocator.cpp:119 ( #35594 )
...
Signed-off-by: xjx <493337577@qq.com >
2026-03-16 15:10:42 +00:00
Yuanheng Zhao
8d8855fdae
[Bugfix] Add safety check and fallback for null scaling factor ( #36106 )
...
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 14:27:29 +00:00
Wentao Ye
e855d380fa
[Compile] Fix compile warning in moe_permute ( #36529 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-16 10:16:14 -04:00
Benjamin Bartels
0e5a9382af
[Bugfix] accept redacted thinking blocks in Anthropic messages ( #36992 )
...
Signed-off-by: Benjamin Bartels <benjaminba@tiglab-ubuntu.ilab.local >
Signed-off-by: bbartels <benjamin@bartels.dev >
Co-authored-by: Benjamin Bartels <benjaminba@tiglab-ubuntu.ilab.local >
2026-03-16 22:01:57 +08:00
Fynn Schmitt-Ulms
04bf5a35fa
[Spec Decode] Update extract_hidden_states to use deferred kv_connector clear ( #37013 )
2026-03-16 14:53:45 +01:00
Tianyu Guo
43a73f853b
Remove unused EVS functions in qwen3_vl.py ( #37183 )
...
Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn >
2026-03-16 13:09:09 +00:00
Julien Denize
ffbc2e5bdb
Patch Mistral config ( #37104 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-03-16 12:22:18 +00:00
Lukas Geiger
f9e6db3034
[Models][Qwen3 ViT] Keep max_seqlen on CPU to prevent D2H sync ( #37139 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-16 12:11:59 +00:00
elvischenv
d61d2b08e9
[Build] Fix API rate limit exceeded when using VLLM_USE_PRECOMPILED=1 ( #36229 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 12:09:27 +00:00
Artem Perevedentsev
f5e59ee7a6
[Performance] Add prefetch for checkpoints to OS page cache ( #36012 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
2026-03-16 11:32:02 +00:00
Harry Mellor
9b005edc48
[Docs] Make the link to hardware plugins clearer ( #37174 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 04:12:58 -07:00
Robin Nabel
bf9a185395
GLM4 tool parser: fix streaming mode ( #35208 )
...
Signed-off-by: Robin Nabel <opensource@nabel.co >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-16 18:48:52 +08:00
Harry Mellor
ad041c79db
Fix text only inputs for MRoPE models with the Transformers modelling backend ( #37055 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 10:31:16 +00:00
Kunshang Ji
747b068136
[Hardware] Replace memory related torch.cuda APIs ( #37031 )
...
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
2026-03-16 10:24:48 +00:00
Harry Mellor
122f75d939
Fix pipeline parallel with multimodal models with the Transformers modelling backend ( #37057 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 10:20:37 +00:00
SoluMilken
d8f8a7aad2
[Misc] Sync pre-commit to 4.5.1 in workflows and docs ( #36675 )
...
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 10:03:21 +00:00
Roy Wang
0115e957d4
[Frontend][Misc] Remove unused log in /is_sleeping ( #37093 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2026-03-16 17:46:28 +08:00
haosdent
116ed130f4
[Bugfix] Fix GDN attention crash with mixed decode/spec-decode batches ( #34871 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-16 10:30:23 +01:00
Vadim Gimpelson
8374387bd8
[FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell ( #36987 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-16 09:04:29 +00:00
Isotr0py
912fbe9555
[Bugfix] Fix Qwen2.5-Omni/Qwen3-Omni use_audio_in_video with multi-video inputs ( #37147 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-16 08:56:06 +00:00
Laith Sakka
52131f88d9
use skip_all_guards_unsafe to drop global_state and torch_function_mode_stack guards instead of previous hacks ( #36204 )
...
Signed-off-by: Laith Sakka <lsakka@meta.com >
2026-03-16 08:52:31 +00:00
Roy Wang
821eb80c0d
[Performance][Model Loader] Skip non-local expert weights during EP model loading ( #37136 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2026-03-16 01:33:36 -07:00
Andreas Karatzas
a2956a0f8e
[ROCm][CI] Retrying in case of batch variance effects and reducing flakiness ( #36442 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 16:08:51 +08:00
Andreas Karatzas
911355e216
[ROCm] Fix KV copy methods and auto-select attention backend for ROCm ( #36845 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 16:07:27 +08:00
Chauncey
8d3f8f485e
[Bugfix] fix Qwen3.5 tool calling bug ( #36774 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-16 15:38:42 +08:00
Woosuk Kwon
96efb91480
[Model Runner V2] Fix processed logits in sample() ( #37144 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-16 00:35:49 -07:00
leo-cf-tian
2754231ba3
[Kernel] Add FlashInfer MoE A2A Kernel ( #36022 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Signed-off-by: Leo Tian <lctian@nvidia.com >
Co-authored-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Stefano Castagnetta <scastagnetta@nvidia.com >
Co-authored-by: root <root@lyris0267.lyris.clusters.nvidia.com >
2026-03-15 23:45:32 -07:00
bigshanedogg
2390d44209
[Model] Add HyperCLOVAX-SEED-Think-14B language model support ( #37107 )
...
Signed-off-by: bigshanedogg <bigshane319@gmail.com >
2026-03-16 06:40:05 +00:00
Li, Jiang
7362b4450a
[Bugfix] Avoid LD_PRELOAD check on MacOS ( #37145 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-15 23:31:44 -07:00
Andreas Karatzas
57a314d155
[CI][Bugfix] Fix 500 errors from priority overflow and TemplateError subclasses in schema fuzz tests ( #37127 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 05:27:21 +00:00
Andreas Karatzas
d4c57863f7
[ROCm][CI] Fix engine teardown and text normalization to stabilize voxtral test ( #37138 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 04:49:31 +00:00
Wang, Yiting
68e1b711f1
[XPU] Add deepseek_scaling_rope fused kernel ( #36612 )
...
Signed-off-by: yitingw1 <yiting.wang@intel.com >
2026-03-16 12:35:08 +08:00
rasmith
0024f39a32
[ROCm][P/D][MORI][BugFix] Add transfer_id for moriio_connector so moriio_connector to restore P/D functionality ( #34907 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-03-16 10:36:51 +08:00
Andrew Xia
e9163b536e
[responsesAPI][ez] add a unit test for SimpleContext logprobs ( #37126 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2026-03-15 17:12:26 -07:00
Lalithnarayan C
7acaea634c
In-Tree AMD Zen CPU Backend via zentorch [1/N] ( #35970 )
...
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Chinmay-Kulkarni-AMD <Chinmay.Kulkarni@amd.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-03-15 23:35:35 +00:00
Jiangyun Zhu
697e4ff352
[GDN] add a config for gdn kernel selection ( #36647 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-16 00:40:17 +08:00
Hari
a3e2e250f0
[Feature] Add Azure Blob Storage support for RunAI Model Streamer ( #34614 )
...
Signed-off-by: hasethuraman <hsethuraman@microsoft.com >
2026-03-15 19:38:21 +08:00
Isotr0py
143e4dccdf
[Misc] Add online audio_in_video test ( #36775 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-15 00:14:11 -07:00
Isotr0py
6590a3ecda
[Frontend] Remove torchcodec from audio dependency ( #37061 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-15 05:15:59 +00:00
Russell Bryant
b3debb7e77
[Build] Upgrade xgrammar to get a security fix ( #36168 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-15 03:13:48 +00:00
Nick Hill
458c1a4b2d
[Frontend] Reduce chat template warmup logging levels ( #37062 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-14 13:48:59 -07:00
Karan Bansal
821fde2df4
[Bugfix] Fix xgrammar dtype mismatch on macOS CPU inference ( #32384 )
...
Signed-off-by: Karan Bansal <karanb192@gmail.com >
Co-authored-by: Inokinoki <inoki@inoki.cc >
2026-03-14 17:29:06 +00:00
arlo
8c29042bb9
[Feature] Add InstantTensor weight loader ( #36139 )
2026-03-14 18:05:23 +01:00
Cyrus Leung
5467d137b3
[Frontend] Avoid startup error log for models without chat template ( #37040 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-14 09:36:11 -07:00
Santino Ramos
3ed46f374b
[Model Runner V2] Add Support for XD-RoPE ( #36817 )
...
Signed-off-by: Santino Ramos <elsantinoramos@gmail.com >
2026-03-14 09:27:55 -07:00
seanmamasde
84868e4793
[Bugfix][Frontend] Fix audio transcription for MP4, M4A, and WebM formats ( #35109 )
...
Signed-off-by: seanmamasde <seanmamasde@gmail.com >
2026-03-14 08:44:03 -07:00
Isotr0py
a8e8d62dd8
[Misc] Clean up Kimi-audio whisper encoder loading ( #36903 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-14 23:37:52 +08:00
Julien Denize
e42b49bd69
Mistral common v10 ( #36971 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: root <root@h200-bar-196-227.slurm-bar-compute.tenant-slurm.svc.cluster.local >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-14 07:26:43 -07:00
Sergey Zinchenko
4a718e770d
[Bug] Fix Failure in /v1/chat/completions/render for Multimodal Requests ( https://github.com/vllm-project/vllm/issues/35665 ) ( #35684 )
2026-03-14 14:10:11 +00:00
Kevin H. Luu
600a039f57
[CI] Shard Multi-Modal Models (Standard) into 4 parallel jobs ( #37014 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-14 08:26:54 +00:00
Harry Mellor
ffa5d74f15
Enable loading of fused expert weights in the Transformers modelling backend ( #36997 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-14 07:01:06 +00:00
Kevin H. Luu
74fe80ee95
[CI] Split Distributed Tests (4 GPUs) into 3 parallel jobs ( #37015 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-14 12:21:13 +08:00
Flora Feng
bcfdadb1bc
[Refactor] Relocate chat completion and anthropic tests ( #36919 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-14 12:16:16 +08:00
Yanan Cao
236de72e49
[CI] Pin helion version ( #37012 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-13 23:25:29 -04:00
sbeurnier
a116f96930
[V1] Remove pin_memory() in async_copy_to_gpu to fix sporadic stalls ( #37006 )
...
Signed-off-by: Sebastien Beurnier <sbeurnier@together.ai >
2026-03-14 01:37:32 +00:00
Li, Jiang
092ace9e3a
[UX] Improve UX of CPU backend ( #36968 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-14 09:27:29 +08:00
Andrew Xia
f680dc1b39
[responsesAPI] prioritize content over summary in reasoning item input ( #36516 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
Signed-off-by: Andrew Xia <mitandrewxia@gmail.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Andrew Xia <axia@fb.com >
2026-03-14 09:20:30 +08:00
Giulio Leone
b41aa264f9
fix: resolve chat template names before kwargs detection ( #36937 )
...
Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com >
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-03-14 00:20:16 +00:00
Dimitrios Bariamis
367cf5cd3e
[Feat][Bugfix] Enable additional dimension for Flashinfer MLA and fix routing dtype ( #36931 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
2026-03-13 16:41:16 -07:00
haosdent
6d53efd2a5
[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models ( #34695 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-13 23:25:41 +00:00
Benjamin Chislett
8b346309a5
[Refactor] Consolidate SupportsEagle ( #36063 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-13 23:22:40 +00:00
Nick Hill
54a6db827f
[BugFix] Fix "DP Coordinator receives unexpected..." messages ( #37008 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-13 23:18:05 +00:00
Matthew Bonanni
9efc4db965
[Bugfix] Fix DeepSeek-V3.2 tokenizer stripping spaces ( #37004 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-13 22:55:36 +00:00
Kevin H. Luu
f1816fb192
[CI] Split V1 e2e + engine (1 GPU) into separate jobs ( #36945 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-13 14:16:02 -07:00
Harry Mellor
0005d2a3c9
Use Transformers v5 WeightRenaming for Transformers modeling backend ( #31545 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-13 20:49:08 +00:00
Ekagra Ranjan
d0b402974f
[Bugfix][Spec Decode] Avoid double call of Ngram CPU ( #36952 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2026-03-13 20:33:19 +00:00
Divakar Verma
6341d43043
[ROCm][Quantization] add quark w4a8 mxfp4_fp8 for LinearLayer ( #35316 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-03-13 19:44:24 +00:00
Mark McLoughlin
7afe0faab1
[Frontend][Core] Re-add shutdown timeout - allowing in-flight requests to finish ( #36666 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-13 12:10:06 -07:00
Harry Mellor
5a3f1eb62f
[Misc] Set default kv_buffer_device in a better way ( #36862 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-13 19:07:33 +00:00
yugong333
b3ce711b93
Fp8 lora dense kernel ( #35242 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
2026-03-13 19:05:08 +00:00
Isotr0py
abf61aaa8e
[Bugfix] Fix Qwen2.5-omni/Qwen3-omni mm_processor cache for audio_in_video request ( #36800 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-13 18:16:05 +00:00
bigmoyan
4508532fbd
[Bugfix] fix paddleocr crash on some image shape ( #36959 )
...
Signed-off-by: wangzhengtao <wangzhengtao@msh.team >
Signed-off-by: bigmoyan <moyan_work@foxmail.com >
Co-authored-by: wangzhengtao <wangzhengtao@msh.team >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-13 13:46:55 +00:00
Itay Alroy
d5af196c18
[2/N] Elastic EP Milestone 2: Integrating NIXL-EP ( #35627 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
Co-authored-by: Yongji Wu <wuyongji317@gmail.com >
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com >
2026-03-13 09:25:33 -04:00
Chaojun Zhang
82f836d976
[XPU] Support LoRA via torch.compile on XPU platform ( #36962 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2026-03-13 10:34:59 +00:00
Andreas Karatzas
4fccd30f19
[ROCm][CI] Upgrading orchestrator to handle python pipeline markers and options ( #36181 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-13 02:04:22 -07:00
Or Ozeri
cfaf4668f7
[kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec ( #36610 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-13 08:04:21 +00:00
Andreas Karatzas
99a57bdf74
[ROCm][CI] Corrected the GPT-OSS test root path ( #36711 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-13 15:53:43 +08:00
Sage
a2268617cf
[Frontend] Delegate preprocessing to OpenAIServingRender ( #36483 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-13 00:39:43 -07:00
Rohan Potdar
a4ad9db541
Enable RoPE+KV cache fusion for ROCm AITER FA (non-shuffle layout) ( #35786 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-03-13 07:33:22 +00:00
Nick Hill
b373b5102a
[Tests] Shutdown test RemoteVLLMServer cleanly ( #36950 )
...
Recent PR #33949 changed the teardown logic of the RemoteVLLMServer test utility class to
send SIGTERM to all vllm (sub)processes at once, which breaks the clean/coordinated
shutdown logic that assumes only the top-level process will receive a signal (for example
when running in a container that's shut down).
This caused a bunch of errors and stacktraces in some test logs, even though those tests
still pass. We should still attempt a normal shutdown and only kill other procs if they are
still running after a few seconds.
Example: tests/v1/distributed/test_external_lb_dp.py::test_external_lb_completion_streaming
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-13 07:32:55 +00:00
Thomas Parnell
f296a1966d
[Bugfix] Fix FlashInfer GDN warmup ValueError on SM90 GPUs ( #36876 )
2026-03-13 07:09:39 +01:00
Csrayz
bc2c0c86ef
[Frontend] Fix usage incorrectly returned with empty stream_options` ( #36379 )
...
Signed-off-by: Csrayz <33659823+Csrayz@users.noreply.github.com >
2026-03-13 03:33:04 +00:00
jaime campos salas
891c60dcd5
fix(kv-cache): increase hybrid attention grouping threshold from 1.25 to 1.5 ( #36684 )
...
Signed-off-by: Jaime Campos Salas <jaime.campos.salas@gmail.com >
2026-03-12 23:28:27 -04:00
whyiug
1ce13cf992
[Model] Add support for BERT-like Chinese ERNIE pooling models ( #36385 )
...
Signed-off-by: whyiug <whyiug@hotmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-13 03:23:53 +00:00
Nikita
10f08dedfa
[Model] Add ColPali late interaction model for multi-modal retrieval ( #36818 )
...
Signed-off-by: Nikita Sukharev <kaonael@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-13 02:18:57 +00:00
Aaron Hao
5e1a373d2e
[BUG] Fix rank calculation in NCCLWeightTransferEngine ( #36940 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-03-13 01:56:51 +00:00
Simo Lin
572c776bfb
build: update smg-grpc-servicer to use vllm extra ( #36938 )
...
Signed-off-by: Simo Lin <linsimo.mark@gmail.com >
2026-03-13 01:31:36 +00:00
Yifan Qiao
55d8073d06
[Bugfix] ep_scatter kernel store-load race condition ( #34991 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
2026-03-13 01:07:59 +00:00
Nick Hill
cd32d6f586
[Model Runner V2] Some code simplification ( #36929 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-13 00:59:23 +00:00
Jaewon
aaa3092f51
[MoE] Add routing simulation override for MXFP4 quantized MoE ( #33595 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
2026-03-13 00:30:44 +00:00
Shubhra Pandit
87985077a4
[Speculative Decoding] Add norm_before_fc for gpt-oss draft models ( #36545 )
...
Signed-off-by: Shubhra Pandit <shubhra.pandit@gmail.com >
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com >
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-12 23:03:32 +00:00
Ryan Rock
a79c1c2c80
[AMD][Build] Add DeepEP to ROCm Dockerfile ( #36086 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-03-12 21:33:32 +00:00
Andreas Karatzas
cc8f1f4764
[ROCm][CI] Preparing gfx90a mirroring ( #36210 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-12 13:42:25 -07:00
Michael Goin
05b9e8ab5b
Revise environment setup in AGENTS.md ( #36909 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-12 19:21:11 +00:00
Xinan Miao
2cdf92228c
[Feature]: Remove Chunking From FusedMoE ( #34086 )
...
Signed-off-by: SouthWest7 <am1ao@qq.com >
Signed-off-by: Southwest <1403572259@qq.com >
Signed-off-by: southwest <am1ao@qq.com >
Signed-off-by: Xinan Miao <1403572259@qq.com >
Co-authored-by: SouthWest7 <am1ao@qq.com >
2026-03-12 14:24:38 -04:00
Marc Sun
c973ecdead
[bnb] Skip moe + bnb test ( #36896 )
...
Signed-off-by: Marc Sun <marc@huggingface.co >
2026-03-12 18:03:25 +00:00
Harry Mellor
e39257a552
Add AGENTS.md ( #36877 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-12 10:20:50 -07:00
Dimitrios Bariamis
cc16b24b17
Update Flashinfer to 0.6.6 ( #36768 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
2026-03-12 13:19:19 -04:00
Eunkwang Jeon
bdc2343454
[Bugfix] Fix KeyError in parse_response_input for reasoning items with optional content ( #34499 )
...
Signed-off-by: jeonsworld <jeonsworld@gmail.com >
2026-03-13 00:13:36 +08:00
Matthew Bonanni
f444c05c32
[Attention] Use FA4 for MLA prefill ( #34732 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-12 12:10:17 -04:00
SoluMilken
85199f9681
[Bugfix] fix main branch pre-commit error (1 line change) ( #36897 )
...
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw >
2026-03-12 09:08:37 -07:00
grimulkan
a1257fd1ea
[Kernel] Add FP8 KV cache support to Triton MLA decode attention ( #34597 )
...
Signed-off-by: grimulkan <grimulkan@gmail.com >
2026-03-12 08:32:34 -07:00
Thomas Parnell
abcffbba8c
[CI] Fix mypy pre-commit errors on main ( #36882 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-12 08:22:29 -07:00
Kunshang Ji
53ec16a705
[Hardware] Replace torch.cuda.device_count/current_device/set_device API ( #36145 )
...
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-12 07:57:47 -07:00
Wei Zhao
2e693f48e7
[Perf] Add TRTLLM FP8 MoE Modular Kernel ( #36307 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-12 07:32:31 -07:00
Martin Hickey
7f1f36bf91
[CI] Fix mypy for vllm/reasoning ( #35742 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-12 12:21:33 +00:00
Mark McLoughlin
5282c7d4d0
[docs] Add lightweight AI assisted contribution policy ( #30947 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-03-12 11:46:13 +00:00
caozuoba
9e19f8338b
[Perf] add packed recurrent fast path for decode ( #36596 )
...
Signed-off-by: hdj <1293066020@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-12 04:01:57 -07:00
Sage
06e0bc21d2
[Frontend] Split OpenAIServingModels into OpenAIModelRegistry + OpenAIServingModels ( #36536 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-12 03:29:37 -07:00
Chauncey
5a71cdd76e
[Bugfix] Fix crash when tool_choice=required exceeds max_tokens ( #36841 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-12 03:28:45 -07:00
Shanshan Shen
f0d3658c0f
[MM][OOT] Support CPU seq_lens for OOT MMEncoderAttention kernels ( #36605 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-12 03:28:23 -07:00
Michael Goin
57431d8231
[UX] Only show FP4 Marlin fallback warning for w4a4 models ( #36806 )
...
Co-authored-by: Claude <noreply@anthropic.com >
2026-03-12 05:19:35 -04:00
Xu Jinyang
3e64fe4a18
[Bugfix] Warm up Triton autotuner for GDN layers during V1 profiling ( #36599 )
...
Signed-off-by: AuYang <459461160@qq.com >
2026-03-12 00:51:09 -07:00
sfeiqiang
8cb24d3aed
[KV Connector] Support using FlexKV as KV Cache Offloading option. ( #34328 )
...
Signed-off-by: phaedonsun <phaedonsun@tencent.com >
Co-authored-by: phaedonsun <phaedonsun@tencent.com >
2026-03-12 00:46:20 -07:00
István Ketykó
00726c74c9
[Bugfix][Model] Fix DeepSeek-OCR TensorSchema crash on empty images_crop ( #36670 )
...
Signed-off-by: István Ketykó <istvan.ketyko@gmail.com >
2026-03-12 15:35:54 +08:00
Chauncey
9fe404ed04
[Frontend] OpenAI Responses API supports Tool/Function calling with streaming ( #29947 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-12 15:03:50 +08:00
Sage
802f306cd1
[Tests] Skip model weight download for render-only test server ( #36813 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-12 06:24:42 +00:00
Yan Ma
894843eb25
replace with torch.cuda.device with with torch.accelerator.device_index ( #36144 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-03-11 23:12:57 -07:00
Yanan Cao
584a3f56de
[Kernel][Helion][13/N] Force static_shapes=False in helion register ( #36677 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-12 05:35:29 +00:00
Nick Hill
36735fd772
[BugFix] Fix multiple/duplicate stdout prefixes ( #36822 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-12 12:23:21 +08:00
wang.yuqi
6ecabe4936
[CI Failure] Fix Language Models Test (Extended Pooling) daily CI Failure ( #36761 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-12 12:22:05 +08:00
Woosuk Kwon
2f8b4ce0c0
[Model Runner V2] Do not initialize sampler for non-last PP ranks ( #36824 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-12 03:55:28 +00:00
Yuwei An
2ef69456f5
[LMCache] Fault Tolerance Mechanism ( #36586 )
...
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com >
2026-03-12 03:54:39 +00:00
Louie Tsai
17852aa503
more models for vLLM Benchmark Suite ( #35086 )
...
Signed-off-by: louie-tsai <louie.tsai@intel.com >
2026-03-12 11:36:51 +08:00
Flora Feng
8647c6cf51
[Bugfix] Fix minimax_m2 tool parser when stream interval > 1 ( #35895 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-12 10:25:14 +08:00
Kunshang Ji
513949f95f
[XPU][Doc] Remove manual OneAPI install step, now handled by torch-xpu ( #36831 )
...
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
2026-03-12 01:46:02 +00:00
Nick Hill
262b76a09f
[Frontend] Exclude anthropic billing header to avoid prefix cache miss ( #36829 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-12 01:20:34 +00:00
Wentao Ye
c34ba6b961
[Perf] Optimize compute maxsim using batched version, 3.2% E2E throughput improvement ( #36710 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-12 08:37:01 +08:00
Matthias Gehre
24062b704f
[ROCm][CI/Build] Add gfx1152/gfx1153 (Krackan) to HIP supported architectures ( #36499 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-03-11 23:14:40 +00:00
Aaron Hao
d6b61e5166
[BUG] Fix async rlhf tests ( #35811 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-03-11 18:06:10 -04:00
Yanan Cao
cf632499ee
[Kernel] [Helion] [15/N] Split config files into per-platform files ( #36698 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 17:25:29 -04:00
Yanan Cao
a3774a8198
[Kernel] [Helion] [12/N] Use FakeTensorMode to avoid GPU allocation during config key computation ( #36563 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 17:25:16 -04:00
Yanan Cao
0ce21c46a0
[Kernel] [Helion] [14/N] Set autotune_ignore_errors=True during autotuning ( #36683 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 17:25:04 -04:00
Woosuk Kwon
55eed6b7a5
[Model Runner V2] Add WhisperModelState [6/N] ( #35790 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-11 14:20:38 -07:00
Giancarlo Delfin
c77181e534
[Model Runner V2] Add probabilistic rejection sampling for spec decoding ( #35461 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-11 14:04:32 -07:00
maobaolong
12001f2ebc
[LMCache] Pass TP size in lookup for MLA multi-reader locking ( #36129 )
...
Signed-off-by: baoloongmao <baoloongmao@tencent.com >
Co-authored-by: Yihua Cheng <yihua98@uchicago.edu >
2026-03-11 20:45:20 +00:00
Or Ozeri
7ee5d5093b
[BugFix][kv_offload] Fix offloading decodes with async scheduling ( #33881 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-11 20:43:40 +00:00
jennyyyyzhen
428bc718bd
[Bugfix][ROCm] Strip block_size before attention backend validation ( #36274 )
...
Signed-off-by: jennyyyyzhen <yzhen@hmc.edu >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-03-11 13:37:31 -07:00
汪志鹏
ff1e3d9c63
[BugFix]: add bagel to MM_PREFIX_LM_MODELS ( #36316 )
...
Signed-off-by: princepride <wangzhipeng628@gmail.com >
2026-03-11 19:55:59 +00:00
Wentao Ye
35bdca5431
[Refactor] Remove dead code in KV connector ( #36424 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-11 19:40:17 +00:00
Amanzhol Salykov
8a24842765
[ROCm] add tuned moe_wna16_triton kernel configs for CDNA4 ( #35093 )
...
Signed-off-by: salykova <amsalykov@gmail.com >
Signed-off-by: amd-asalykov <asalykov@amd.com >
2026-03-11 19:00:08 +00:00
Harry Mellor
65986db6ba
Make Gemma and Gemma 2 accept inputs_embeds like Gemma 3 ( #36787 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 18:12:43 +00:00
Luka Govedič
9556af87d5
[torch.compile] Add support for non-contiguous fused RMSNorm + group quant ( #36551 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
2026-03-11 10:56:55 -07:00
Or Ozeri
a1a3523a56
[KVConnector] Support worker -> scheduler metadata ( #31964 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-11 17:36:37 +00:00
tianshu-Michael-yu
741f4e046b
fix: align lfm2 thumbnail token counting with HF ( #36707 )
2026-03-11 10:28:38 -07:00
Julien Denize
a5d06dc557
Add 320 dimension size support to MLA ( #36161 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
2026-03-11 10:21:22 -07:00
Harry Mellor
5efa206a8c
Fix ExaoneMoeMTP test that never ran in Transformers v4 ( #36792 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 17:10:23 +00:00
Cyrus Leung
196802dfa6
[Misc] Clean up renderers ( #36770 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-11 16:39:29 +00:00
Isotr0py
c84b519cf3
[Bugfix] Fix negative max_tokens when input prompt is too long ( #36789 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-11 16:30:51 +00:00
Flora Feng
741ecf0630
[CI] Add bfcl tool call correctness eval ( #36560 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-03-11 12:27:36 -04:00
Robert Shaw
b7e5a588d8
[Bugfix] Fix DP/EP Shared Expert With Monolithic Kernels ( #36061 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-11 16:07:14 +00:00
Richard Zou
822e250ab7
[torch.compile] Use FakeTensors instead of real GPU tensors for single-size compilation ( #36093 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-11 16:07:09 +00:00
Hongxin Xu
bea02cdf93
Fix routed experts capture for hybrid models (Mamba + Attention) ( #35744 )
...
Signed-off-by: arlenxu <arlenxu@tencent.com >
Signed-off-by: xhx1022 <1737006628@qq.com >
Co-authored-by: arlenxu <arlenxu@tencent.com >
2026-03-11 08:53:10 -07:00
Julien Denize
a3ea760ea5
Add 'none' reasoning effort to ChatCompletionRequest ( #36238 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
2026-03-11 15:45:34 +00:00
Harry Mellor
35db669f1d
Correct link to supported hardware on vllm.ai ( #36798 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 08:43:28 -07:00
Julien Denize
afebeffbfb
Add support to Mistral large 3 eagle with dense layers ( #36163 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-11 15:42:56 +00:00
Jhao-Ting Chen
5573894737
Kimi k2.5 MLA based eagle3 ( #36361 )
...
Signed-off-by: Izzy Putterman <iputterman@nvidia.com >
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com >
Co-authored-by: Izzy Putterman <iputterman@nvidia.com >
2026-03-11 11:36:11 -04:00
Harry Mellor
d5816c8c2f
Fix tied weights in weight mapping test for Transformers v5 ( #36788 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 15:10:26 +00:00
Woosuk Kwon
8ccbcda5c0
[Model Runner V2] Remove unused warmup_for_prefill method ( #36762 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-11 08:02:44 -07:00
tvirolai-amd
a9e532afe2
[ROCm][Perf] Allow MTP lens > 1 in Sparse MLA ( #36681 )
...
Signed-off-by: Teemu Virolainen <teemu.virolainen@amd.com >
2026-03-11 14:43:03 +00:00
Harry Mellor
f3163bba67
Disable docs build skipping until a better solution is found ( #36790 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 13:53:23 +00:00
Martin Hickey
700a1ddc65
[Misc] Use envs module to get VLLM_DISABLED_KERNELS ( #35776 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-03-11 13:37:46 +00:00
Silvia Colabrese
f33251ffc8
[Bugfix] Fix Mistral-small --format ( #36782 )
...
Signed-off-by: 12010486 <silvia.colabrese@intel.com >
2026-03-11 04:47:52 -07:00
Wuxun Zhang
e584dce52b
Add XPU MLA Sparse backend for DeepSeek v3.2 ( #33230 )
...
Signed-off-by: Zhang, Wuxun <wuxun.zhang@intel.com >
2026-03-11 19:19:15 +08:00
Ning Xie
40c0461f24
[openapi] refactor render related openapi [3/N] ( #36749 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-03-11 03:14:34 -07:00
Weiguang Li
724759684c
[Bugfix] Fix Qwen3-VL timestamp mismatch when using num_frames without fps ( #36136 )
...
Signed-off-by: OiPunk <codingpunk@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 03:13:06 -07:00
Michael Goin
9c34e9d24f
Disable cascade attention by default ( #36318 )
2026-03-11 03:12:23 -07:00
Richard Zou
09b6f99852
[compile] aot_compile should respect VLLM_DISABLE_COMPILE_CACHE ( #36358 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-11 03:12:03 -07:00
Ethan T.
c87fb515ed
fix(lora): use replaced_module_name in pooling model name check ( #36402 )
...
Signed-off-by: gambletan <ethanchang32@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 03:11:27 -07:00
Itay Alroy
5353c9b016
platforms: Fix Ray DP startup crash ( #36665 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
2026-03-11 03:08:55 -07:00
Angela Yi
13e79fc811
[ci] Update rtol for test_classification ( #36556 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com >
2026-03-11 03:08:16 -07:00
Rahul Tuli
9d07a3d6e4
Add: Eagle3 support for Qwen3.5 ( #36658 )
...
Signed-off-by: Rahul-Tuli <rtuli@redhat.com >
2026-03-11 03:07:42 -07:00
Cyrus Leung
646b85544b
[Refactor] Remove Molmo2 processor wrapper ( #36667 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-11 03:07:20 -07:00
tc-mb
4286cc5ec2
fix(minicpmv): fix audio inference by handling meta device in init_re… ( #36751 )
...
Signed-off-by: caitianchi <caitianchi@modelbest.cn >
2026-03-11 03:06:28 -07:00
LoganJane
545d18d81b
[Bugfix] Support other quantization methods in glm41v ( #36321 )
...
Signed-off-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com >
Co-authored-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-11 09:48:05 +00:00
roikoren755
e661b9ee83
[NemotronH] Small fix reasoning parser ( #36635 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-03-11 02:44:41 -07:00
YiSheng5
c910eeb125
[XPU]Bug fix for some unexpected error when use AgRs backend on XPU device. ( #36593 )
...
Signed-off-by: yisheng <yi.sheng@intel.com >
2026-03-11 09:17:46 +00:00
Harry Mellor
f4ae58b38b
Remove unused config field from Gemma2 ( #36672 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 01:51:19 -07:00
Isotr0py
e568cf88bc
[UX] Infer dtype for local checkpoint ( #36218 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-11 08:50:04 +00:00
Nicolò Lucchesi
098d844731
[NIXL][1/N] Refactor kernel_block_size detection ( #35752 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-11 01:11:23 -07:00
JartX
a40ee486f2
[Bugfix] Add Multiple of 16 block_size to triton fallback on rocm Attention to support qwen3_5 ( #35923 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Co-authored-by: akaratza <akaratza@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-03-11 07:45:57 +00:00
pschlan-amd
eac2dc2b41
AITER MLA backend: Avoid CPU sync in _build_decode ( #35765 )
...
Signed-off-by: Patrick Schlangen <pschlan@amd.com >
2026-03-11 07:25:00 +00:00
Flora Feng
d5080aeaa4
[Refactor] Remove deadcode in Responses API serving ( #36726 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-11 07:11:41 +00:00
liuzhenwei
f22d6e0267
[Hardware][NIXL] set default kv buffer type for different platform ( #36438 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-11 05:19:28 +00:00
Kunshang Ji
76c6e6da08
[XPU] Support block fp8 moe by fallback to TritonExpert on XPU ( #36458 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-10 21:54:09 -07:00
typer-J
4184653775
feat: add RISC-V support for CPU backend (v2) ( #36578 )
...
Signed-off-by: typer-J <2236066784@qq.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-03-10 21:51:39 -07:00
Sladyn
4aaaf8c8ce
feat(spec_decode): fuse EAGLE step slot mapping and metadata updates ( #33503 )
...
Signed-off-by: sladynnunes <snunes@usc.edu >
2026-03-11 04:35:33 +00:00
Hongbin Guo
4bf533623b
[Doc] Fix duplicate words in comments ( #36713 )
...
Signed-off-by: Hongbin10 <jdmjdm1998@163.com >
2026-03-10 21:28:31 -07:00
Matthew Bonanni
5f77ef15ae
[Misc][Attention] Clean up unused method in CPU_ATTN ( #36673 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-10 21:27:22 -07:00
elvischenv
7d6abdd022
[Fix] Use torch.empty for output in attention+quant fusion ( #31785 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-03-10 21:26:14 -07:00
Wentao Ye
a8ff2cca92
[Perf] Optimize scheduler overhead for PD disaggregation, around 5% E2E perf improvement ( #35781 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-03-10 21:25:30 -07:00
tunglinwood
42fadebecb
[Model] Add support for moonshotai/Kimi-Audio-7B-Instruct ( #36127 )
...
Signed-off-by: tunglinwood <tunglinwood@gmail.com >
Signed-off-by: tunglinwood <tomwu.tunglin@gmail.com >
Signed-off-by: tunglinwood <113751333+tunglinwood@users.noreply.github.com >
2026-03-10 21:24:48 -07:00
tianshu-Michael-yu
a197eda9c3
Add tuned H100 MoE configs for LFM2 8B and 24B ( #36699 )
2026-03-10 21:22:02 -07:00
Kevin H. Luu
82b110d50e
[ci] Bound nvidia-cudnn-frontend version ( #36719 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-03-11 12:17:35 +08:00
Benjamin Chislett
9040cd40af
[DSV3.2][MTP] Optimize Indexer MTP handling ( #36723 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-11 12:16:56 +08:00
fangyuchu
fa0d353acf
[Bugfix] Surface exceptions from non-blocking execute_model in UniProcExecutor to avoid DP deadlocks ( #35194 )
...
Signed-off-by: fangyuchu <fangyuchu@qq.com >
2026-03-11 03:22:21 +00:00
Augusto Yao
b386bb3d7c
fix bugs when token_classify & classify run concurrently ( #36614 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
2026-03-10 20:16:34 -07:00
Ning Xie
fe714dd507
[openapi server] log exception in exception handler(2/N) ( #36201 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-03-10 20:16:30 -07:00
Matthew Bonanni
8ab3d7427c
[Bugfix] Fix DeepSeek V3.2 OOM during CG memory profiling ( #36691 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-11 03:01:07 +00:00
Wei Zhao
84e436ed1c
[Bug] Fix TRTLLM Block FP8 MoE Monolithic ( #36296 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-03-10 22:04:47 -04:00
Andreas Karatzas
81939e7733
[ROCm][CI] Making some tests optional to reduce workload ( #36090 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-10 16:45:27 -07:00
Woosuk Kwon
195d1ca3e8
[Minor] Enhance error message for TRTLLM decode uniformity check ( #36609 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-10 15:38:45 -07:00
Nick Hill
8d983d7cd6
[Model Runner V2] Add initial CI tests ( #36041 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 14:55:21 -07:00
Nick Hill
65b2f405dc
[Core] Simplify core kv-cache blocks initialization logic ( #36521 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 20:20:02 +00:00
Nick Hill
2a68464c5b
[Test] test_async_scheduling.py improvements ( #36340 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 11:17:26 -07:00
Zhengxu Chen
bdd8981dab
[compile] Apply stored functorch config while finalizing loaded artifacts. ( #36582 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-10 09:34:35 -07:00
Woosuk Kwon
f088a831dd
[Model Runner V2] Use unpadded num_tokens for PW CUDA graph attn metadata ( #36626 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-10 09:30:56 -07:00
Harry Mellor
f83b933b84
[CI] Bump mypy version to 1.19.1 ( #36104 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-10 09:18:28 -07:00
Pleaplusone
82f3f30e26
[ROCm][Perf] Enable sparse_mla's cudagraph on ROCm platform ( #35719 )
...
Signed-off-by: ganyi <ygan@amd.com >
2026-03-10 09:14:35 -07:00
Matthew Bonanni
9095cbbfb6
[Bugfix][Sparse MLA] report indexer CG support properly ( #36519 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-10 09:14:31 -07:00
Hashem Hashemi
721ae79f50
Improvements to wvSplitKrc skinny GEMM solution ( #34304 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-03-10 09:14:27 -07:00
AllenDou
aefc59f088
FunASR model bugfix ( #36633 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
2026-03-10 08:14:21 -07:00
Harry Mellor
d88f28da05
Fix hf_override_fn when it modifies model_type ( #35200 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-10 15:03:18 +00:00
Srinivasoo7
106ff69c4e
feat(kv-offload): Strategy A — StoreReusedOffloadingManager gates CPU stores on reuse frequency ( #35342 )
...
Signed-off-by: srinivas_oo7 <Sriusa4414@gmail.com >
Signed-off-by: Sriusa4414@gmail.com
Signed-off-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com >
Co-authored-by: srinivas_oo7 <sklinkedin0120@gmail.com >
Co-authored-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-03-10 14:43:40 +00:00
Jiangyun Zhu
ca5fb4bbd8
[Bugfix] Avoid merging empty-only partitions into splitting-op subgraphs ( #36595 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-10 07:39:01 -07:00
Alvin Tang
cf88b23749
fix: check HTTP status in batch read_file to prevent silent failures ( #36397 )
...
Signed-off-by: gambletan <ethanchang32@gmail.com >
Co-authored-by: gambletan <ethanchang32@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-10 07:22:40 -07:00
wang.yuqi
a3189a08b0
[Model] Consolidate score logic by introduce score_type ( #36479 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-10 13:32:25 +00:00
SoluMilken
409c4e632d
[Misc] fix typo: homogenous-> homogeneous (2 lines change) ( #36508 )
...
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw >
2026-03-10 06:25:37 -07:00
Raushan Turganbay
8850738b70
[Bugfix] Fix processor signature ( #36630 )
...
Signed-off-by: raushan <raushan@huggingface.co >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-10 06:20:47 -07:00
Mark McLoughlin
234860399b
[Frontend][Core] Revert "Add shutdown timeout" ( #34730 and #36270 ) ( #36628 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-03-10 06:20:41 -07:00
Harry Mellor
c88510083b
Fix Qwen2.5-VL test for Transformers v5 ( #36532 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-10 12:05:34 +00:00
Vadim Gimpelson
4ff8c3c8f9
[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU ( #35219 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-10 03:32:20 -07:00
Chang Su
507ddbe992
feat(grpc): extract gRPC servicer into smg-grpc-servicer package, add --grpc flag to vllm serve ( #36169 )
...
Signed-off-by: Chang Su <chang.s.su@oracle.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-03-10 03:29:59 -07:00
Nick Hill
ddbb0d230a
[Model Runner V2] Fix mm input embeddings lookup ( #36588 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 00:24:58 -07:00
Nick Hill
9efc3bdcd6
[Model Runner V2] Fix _compute_slot_mappings_kernel for chunked prefill ( #36580 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 00:23:42 -07:00
amirkl94
156e33553c
Fix: Re-Enable EP for trtllm MoE FP8 backend ( #36494 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com >
2026-03-09 23:11:27 -07:00
hallerite
d0cd736caa
[Bugfix] Fix RuntimeError: Already borrowed that degrades VLM serving throughput under concurrent load. ( #36557 )
...
Signed-off-by: hallerite <hallerite@users.noreply.github.com >
Signed-off-by: hallerite <git@hallerite.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-09 22:30:51 -07:00
Harry Mellor
195c997203
Fix LFM2 MoE test for Transformers v5 ( #36534 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-09 22:29:17 -07:00
Zhuohan Li
04b67d8f62
Remove unused disable_fallback field ( #36546 )
2026-03-09 20:56:54 -07:00
Wentao Ye
7279374f91
[Perf] Compute maxsim in worker side, reducing redundant copies, 2.7% E2E throughput improvement ( #36159 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-09 20:55:58 -07:00
Woosuk Kwon
006aea17d7
[BugFix] Remove incorrect assert in split_decodes_and_prefills ( #36553 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-09 20:02:02 -07:00
Hojin Yang
0836be3b03
[Model] Add HyperCLOVAX-SEED-Think-32B vision-language model support ( #31471 )
...
Signed-off-by: effortprogrammer <yhjhoward7@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-10 10:59:19 +08:00
Ajay Anubolu
4e95ec111c
[Bugfix] Fix Qwen3-Next in_proj_ba weight sharding with TP > 1 ( #36242 )
...
Signed-off-by: AjAnubolu <anuboluajay@gmail.com >
2026-03-09 19:16:26 -07:00
Andreas Karatzas
179547d62c
[ROCm][CI] Fix ROCm GPT-OSS Eval test group ( #36179 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-09 17:55:20 -07:00
youkaichao
f85b4eda3a
[bugfix] fix nvlink for nixl/ucx ( #36475 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2026-03-10 07:49:47 +08:00
Woosuk Kwon
2a194ddd72
[Model Runner V2] Add model_state inputs to CUDA graph capture ( #36544 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-09 15:14:51 -07:00
Shaun Kotek
203a7f27da
add nemotron v3 reasoning parser ( #36393 )
...
Signed-off-by: Shaun Kotek - Nvidia <skotek@nvidia.com >
Co-authored-by: root <root@gpu-259.slurm-workers-slurm.slurm.svc.cluster.local >
2026-03-09 15:11:41 -07:00
Lucas Wilkinson
483463f735
[MRV2] Extensible CG dispatch rework ( #35959 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-03-09 13:58:45 -07:00
Matthew Bonanni
4e571ce643
[MTP][Misc] Clean up dead code ( #36507 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-09 14:43:06 -04:00
Micah Williamson
4ff9b045fe
[ROCm][CI] Prep Tests For Change To ROCM_ATTN As New Default Backend On ROCm ( #36025 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-03-09 13:27:55 -05:00
Lucas Kabela
3fd03f1ec2
[BE] Rename should_torch_compile_mm_vit to should_torch_compile_mm_encoder ( #36281 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-03-09 18:22:05 +00:00
Woosuk Kwon
10a5f4d53d
[Model Runner V2] Use NamedTuple for execute_model_state ( #35930 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-09 11:17:34 -07:00
Simon Mo
fe0c085c28
[Docs] Remove the reo beacon ( #36528 )
...
Co-authored-by: Cursor Agent <cursoragent@cursor.com >
2026-03-09 11:16:50 -07:00
Taneem Ibrahim
8d6b3d5dda
[Misc] Refactored 5 duplicate helper functions that were copied-pasted across multiple parsers ( #36436 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-03-09 14:14:11 -04:00
Copilot
4b87ffbefb
[torch.compile] Rename compile_ranges_split_points to compile_ranges_endpoints ( #36027 )
...
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-03-09 18:04:40 +00:00
Shaun Kotek
fa028207aa
Fix/resupport nongated fused moe triton ( #36412 )
...
Signed-off-by: Shaun Kotek - Nvidia <skotek@nvidia.com >
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com >
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Signed-off-by: liweiguang <codingpunk@gmail.com >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Alex Brooks <albrooks@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: cong-or <conchubhar.gannon@gmail.com >
Signed-off-by: Tushar Shetty <tushar.shetty@abbyy.com >
Signed-off-by: Tushar Shetty <54362365+tusharshetty61@users.noreply.github.com >
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Signed-off-by: Xin Yang <xyangx@amazon.com >
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: nvnbagrov <nbagrov@nvidia.com >
Co-authored-by: Sage <80211083+sagearc@users.noreply.github.com >
Co-authored-by: danisereb <daserebrenik@nvidia.com >
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Weiguang Li <codingpunk@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
Co-authored-by: Alex Brooks <albrooks@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: cong-or <conchubhar.gannon@gmail.com >
Co-authored-by: Tushar Shetty <54362365+tusharshetty61@users.noreply.github.com >
Co-authored-by: liuzhenwei <zhenwei.liu@intel.com >
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-09 11:01:18 -07:00
Russell Bryant
d460a18fc6
[Docs] Expand --allowed-media-domains security guidance with threat details ( #36506 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-09 17:43:42 +00:00
Woosuk Kwon
6e956d9eca
[Model Runner V2] Add dummy profile_cudagraph_memory API ( #36520 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-09 10:20:13 -07:00
Andreas Karatzas
1e0f917b34
[ROCm][CI] Fix logprob divergence for TitanML/tiny-mixtral under AITER rms_norm ( #36101 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-09 12:07:44 -05:00
Andreas Karatzas
c174d54f86
[ROCm][CI] Fix ROCm attention backend validation for head sizes, block sizes, and compute capability checks ( #36292 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-09 12:02:41 -05:00
SoluMilken
55d27cca55
[Misc] fix typo: dependant -> dependent (2 lines change) ( #36511 )
...
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw >
2026-03-09 10:00:12 -07:00
Roberto L. Castro
580864d81e
[Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 ( #34917 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
2026-03-09 09:50:36 -07:00
Roberto L. Castro
2b28b9b269
[Attention][Perf] Optimize cp_gather_and_upconvert_fp8_kv_cache - DeepSeek-v3.2 ( #35290 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Co-authored-by: Claude <noreply@anthropic.com >
2026-03-09 09:46:57 -07:00
Taoyu Zhu
70485a11bd
[ROCM] Optimize the fused_topk_bias to use aiter instead of fallback torch ops. ( #36253 )
...
Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com >
2026-03-09 11:30:35 -05:00
Harry Mellor
74a9f54cdb
[CI] Fix edge case that could lead to broken docs builds on main ( #36515 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-09 09:06:19 -07:00
Matthew Bonanni
00c4cb5606
[Bugfix] Clear stale CG keys after memory profiling ( #36416 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-09 11:56:00 -04:00
Wentao Ye
941e52c298
[Refactor] Simplify chat_completion_full_generator for tool parsers ( #35634 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-09 23:33:46 +08:00
Wentao Ye
be292b7c14
[Bug] Fix pooling model benchmark script ( #36300 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-09 11:17:45 -04:00
Matthew Bonanni
77a73458e3
Reapply [Attention] Refactor check_and_update_config ( #35122 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-09 07:17:14 -07:00
Tianyu Guo
5578f2a4d3
Support online use_audio_in_video ( #36319 )
...
Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-09 07:16:44 -07:00
Cyrus Leung
3ec2115015
[Frontend] Move warmup into Renderer ( #36482 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-09 06:03:21 -07:00
Isotr0py
b0906d8b02
[MM Encoder] Default to use TORCH_SDPA backend for ViT on Volta/Turing GPU ( #36472 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-09 03:43:44 -07:00
Kevin H. Luu
aaf5fa9abf
[ci] Bound openai dependency to 2.24.0 ( #36471 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-09 03:43:26 -07:00
Cyrus Leung
f96c3ab08c
[Deprecation][1/2] Remove items deprecated in v0.18 ( #36470 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-09 03:43:23 -07:00
Xin Yang
dc6b578466
[Kernel] Add fused_sigmoid_gating_delta_rule_update kernel for Qwen3 Next ( #35777 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-08 23:41:01 -07:00
liuzhenwei
1bc9c77f6d
[XPU] Add test script of PD disaggregation ( #36434 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
2026-03-09 05:50:27 +00:00
Alex Brooks
65a4da1504
[Frontend] Add Support for MM Encoder/Decoder Beam Search (Online Transcriptions) ( #36160 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
2026-03-09 05:46:23 +00:00
Li, Jiang
217f27598d
[Bugfix] Avoid to replace non-tensor members in cpu model runner ( #36430 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-09 13:06:28 +08:00
wang.yuqi
fff3711a24
[Frontend][2/n] Improve pooling entrypoints | embed. ( #36110 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
2026-03-09 11:42:19 +08:00
Tushar Shetty
c4d859c274
[Bugfix] Skip out-of-stage layers in get_layers_from_vllm_config for pipeline parallel ( #36243 )
...
Signed-off-by: Tushar Shetty <tushar.shetty@abbyy.com >
Signed-off-by: Tushar Shetty <54362365+tusharshetty61@users.noreply.github.com >
2026-03-08 20:40:16 -07:00
cong-or
747431044d
feat(attention): extract KV-cache update from FlexAttention backend ( #36263 )
...
Signed-off-by: cong-or <conchubhar.gannon@gmail.com >
2026-03-08 20:40:12 -07:00
Cyrus Leung
d62856b928
[Misc] Move processors to transformers_utils ( #35953 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-09 11:31:39 +08:00
Alex Brooks
bd2659a566
Increase Flexibility for OOV Multimodal Token Handling ( #34858 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
2026-03-08 20:30:49 -07:00
Shaun Kotek
90512b2e8b
fix: Use iterator as not to store all the file loads in memory at once ( #36149 )
...
Signed-off-by: Shaun Kotek - Nvidia <skotek@nvidia.com >
2026-03-08 20:25:21 -07:00
wang.yuqi
dcf8862fd4
[Examples][1/n] Resettle basic examples. ( #35579 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-08 20:22:53 -07:00
Weiguang Li
43aa389231
[Bugfix] Fix CPU OMP autobind assertion to use local_world_size ( #35815 )
...
Signed-off-by: liweiguang <codingpunk@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-03-08 20:07:29 -07:00
Wentao Ye
384425f84e
[Dependency] Remove default ray dependency ( #36170 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-08 20:06:22 -07:00
Harry Mellor
a0f44bb616
Allow markdownlint to run locally ( #36398 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-08 20:05:24 -07:00
Kunshang Ji
fde4771bbd
[XPU][Doc] update xpu document about triton dependency/conflict issue. ( #36301 )
...
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
2026-03-09 02:09:22 +00:00
Jiangyun Zhu
e5ff140216
[cudagraph] fix cudagraph warning in deepseekv32 ( #28044 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-08 20:27:41 -04:00
danisereb
0a6a3a1290
Add support for ModelOpt MXFP8 MoE models ( #35986 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-03-08 13:00:05 -07:00
Sage
4497431df6
[Frontend] Add GPU-less render serving path (vllm launch render) ( #36166 )
2026-03-08 16:35:09 +01:00
nvnbagrov
b7332b058c
[Model] Nano Nemotron VL - fast media preprocessing ( #35657 )
...
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com >
2026-03-08 03:04:05 -07:00
Andreas Karatzas
40077ea3de
[CI] fix flaky empty responses and add diagnostic assertions in vision chat tests ( #36341 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-08 14:42:24 +08:00
Samuel Shen
5d6aae4577
[LMCache MP Patch]: Race Condition + Duplicated Block Ids ( #35831 )
2026-03-07 13:52:48 -08:00
Roy Huang
63298ee173
[Bugfix][LMCache][KVConnector] fix potential memory leak in LMCache multiprocess mode ( #35931 )
2026-03-07 13:52:35 -08:00
Richard Zou
2dde535df1
[compile] Split compile/warmup monitoring ( #36098 )
2026-03-07 13:52:11 -08:00
Wei Zhao
379689d533
[Perf] Support FP8 KV cache for Flashinfer MLA Sparse ( #35891 )
2026-03-07 13:51:54 -08:00
PatchyTIS
a6be75dbd2
[Core] NGram GPU Implementation compatible with Async Scheduler ( #29184 )
2026-03-07 13:51:37 -08:00
Micah Williamson
ee54f9cdb9
[ROCm][CI] Accept Different But Valid Output for test_olmoe_tp ( #35224 )
2026-03-07 13:50:52 -08:00
Micah Williamson
fc4657756f
[ROCm][CI] Enable AITER for failing test_gpt_oss test case on MI355 ( #36174 )
2026-03-07 13:50:17 -08:00
qli88
eebd14651f
[CI] Enable Crosslayer KV layout tests for ROCm platforms ( #35416 )
2026-03-07 13:49:56 -08:00
Matthew Bonanni
ebb9cc5f2b
[UX][Startup] Account for CUDA graphs during memory profiling ( #30515 )
2026-03-07 13:49:23 -08:00
rahul-sarvam
85f50eb41f
Adding support to Sarvam's MoE models ( #33942 )
...
Signed-off-by: rahul-sarvam <140298821+rahul-sarvam@users.noreply.github.com >
2026-03-08 01:16:24 +08:00
Taneem Ibrahim
5261223c2d
[Misc] Remove duplicate parser registration ( #36303 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-03-07 09:37:01 -05:00
lif
00b814ba5a
[V0 Deprecation] Remove unused swap_space parameter ( #36216 )
...
Signed-off-by: majiayu000 <1835304752@qq.com >
Co-authored-by: mcelrath
2026-03-07 22:09:55 +08:00
vllmellm
ee8a29511f
[Bugfix] Fix compressed-tensors quantization failure for DeepSeek-R1 on MI300x ( #36247 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-03-07 09:26:59 +00:00
milesial
755356b3d1
feat: expose media_io_kwargs at runtime ( #34778 )
...
Signed-off-by: Alexandre Milesi <milesial@users.noreply.github.com >
2026-03-07 04:27:04 +00:00
Andreas Karatzas
58928475e4
[ROCm][CI] Making entrypoints more deterministic on ROCm ( #36293 )
2026-03-06 19:04:40 -08:00
Mengtao (Martin) Yuan
1a9718085c
Fix CUDA graph decode capture crash in AITER FlashAttention ( #36042 )
...
Signed-off-by: Martin Yuan <myuan@meta.com >
Co-authored-by: Martin Yuan <myuan@meta.com >
2026-03-06 18:12:07 -08:00
Kunshang Ji
7eb524e64c
refine vllm bench throughput --backend hf ( #35971 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-07 02:10:33 +00:00
Nick Hill
c7f32e08c2
[BugFix] Avoid ignored trust_remote_code warnings ( #36290 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-07 01:24:18 +00:00
Nick Hill
b354686524
[Model Runner V2] Fix warmup for pipeline parallel ( #36280 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-06 16:58:51 -08:00
Nick Hill
6a18d8789b
[Core] Fix benign error log during normal shutdown ( #36270 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
2026-03-07 00:39:21 +00:00
Itay Alroy
24a03915f5
mla: don't update kv cache on dummy forwards ( #36282 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
2026-03-07 00:36:00 +00:00
Andreas Karatzas
b5e34e1fca
[ROCm][CI] Fixing yaml file for external amd-ci signal ( #36284 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 18:30:39 -06:00
Copilot
ce8546a12b
[docs][torch.compile] Add fusions.md — kernel/operator fusion reference page ( #35538 )
...
Signed-off-by: ProExpertProg <luka.govedic@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: ProExpertProg <luka.govedic@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-03-06 23:55:06 +00:00
Chuan (Richard) Li
c188749bcd
[ROCm] Support MLA with nhead<16 and FP8 KV cache for TP=8 (Kimi K2.5/Linear) ( #35850 )
...
Signed-off-by: Li <chuali@amd.com >
2026-03-06 20:24:03 +00:00
Alexei-V-Ivanov-AMD
225d1090a0
Enabling some B200-specific tests on MI355 ( #35253 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
Signed-off-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com >
2026-03-06 19:27:20 +00:00
eellison
f3c6c9c9d7
[CustomOp] CustomOp FusedRMSNormGated ( #35877 )
...
Signed-off-by: Elias Ellison <elias.ellison@gmail.com >
Signed-off-by: eellison <elias.ellison@gmail.com >
2026-03-06 10:53:37 -08:00
Nick Hill
26bd43b52d
Revert "[BugFix] Fix engine hanging after KV cache initialization fai… ( #36262 )
2026-03-06 08:28:09 -08:00
Travis Johnson
6b625a8807
[Bugfix] Quickfix followups to busy loop removal in #28053 ( #36068 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-06 08:13:05 -08:00
Richard Zou
54756b6109
[compile] Stop unconditionally patching constrain_to_fx_strides ( #36152 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-06 10:17:27 -05:00
Raphaël Rialland
39f9ea0da4
[Bugfix] Fix cudagraph_mode:FULL dispatch (This does not impact FULL_AND_PIECEWISE (default)) ( #36165 )
2026-03-06 09:15:31 -05:00
Isotr0py
e4ae148a78
[Refactor] Modular video loader backend refactoring ( #35202 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-06 06:06:59 -08:00
Isotr0py
1d0c0d209c
[Misc] Lazy import registered processors ( #36024 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-06 06:06:45 -08:00
Chenguang Zheng
fcb73f306c
[bugfix] add api process rank in default multimodal request ( #36150 )
...
Signed-off-by: fake0fan <645327136@qq.com >
Signed-off-by: Chenguang ZHENG <645327136@qq.com >
2026-03-06 12:00:09 +00:00
Harry Mellor
e2090bf3af
[CI] Fix startup error test ( #36230 )
...
A change in engine startup error messages in #35478 caused this test failure.
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-06 11:50:28 +00:00
Andreas Karatzas
2a00d3241f
[CI][MM] Gate vision encoder attention mask to MiniCPM only, fixing Aria regression ( #36206 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 01:17:08 -08:00
Alex Brooks
10f4db4dbe
[Frontend] Add Support for MM Encoder/Decoder Beam Search (Offline) ( #36153 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-06 01:16:56 -08:00
Nicolò Lucchesi
5b3ba94ab4
[Core][KVConnector] Support HMA+NixlConnector ( #35758 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-06 08:51:21 +01:00
zhanqiuhu
90f3c01fa4
[Spec Decode][KV Connector] Fix KV transfer in PD + speculative decoding ( #35158 )
...
Signed-off-by: Claude <noreply@anthropic.com >
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-06 08:50:44 +01:00
Andreas Karatzas
807d680337
[ROCm][CI] Fix tool use test stability - disable skinny GEMM, prefix caching, eliminate batch variance ( #35553 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 15:15:12 +08:00
Tyler Michael Smith
5afb387bd4
Change "following fields were present in the request but ignored" log from warn to debug ( #36173 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-03-05 22:15:46 -08:00
Walter Beller-Morales
43e77e59ab
[BugFix] avoid infinite loop with VLLM_PORT and get_open_ports_list ( #36191 )
...
Signed-off-by: walterbm <walter.beller.morales@gmail.com >
2026-03-05 22:15:29 -08:00
Russell Bryant
00bd08edee
[Security] Respect user trust_remote_code setting in NemotronVL and KimiK25 ( #36192 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-05 22:15:19 -08:00
Ajay Anubolu
43f10573c9
[Bugfix] Fix misleading context length error messages ( #36197 )
...
Signed-off-by: AjAnubolu <anuboluajay@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-05 22:15:12 -08:00
Yongye Zhu
86e1060b17
[Bugfix] Fix inner_dp_world initialization order for multi-node TP ( #35892 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-03-05 22:04:44 -08:00
Mark McLoughlin
27066d1b2b
[Frontend][Core] Add shutdown timeout - allowing in-flight requests to finish ( #34730 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-03-05 22:04:31 -08:00
cong-or
57c84ff129
perf: add __slots__ to KVCacheBlock ( #36164 )
...
Signed-off-by: cong-or <conchubhar.gannon@gmail.com >
2026-03-05 22:04:09 -08:00
Xiang Shi
e68de8adc0
docs: fix wrong cc in int8.md ( #36209 )
...
Signed-off-by: Xiang Shi <realkevin@tutanota.com >
2026-03-06 06:01:02 +00:00
Andreas Karatzas
a1ffa56a1e
[CI] Fix bge-m3 similarity reference values after *Defination* typo fix ( #36208 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 05:07:29 +00:00
Shiyan Deng
0a208d1f54
[BugFix] Fix engine hanging after KV cache initialization failure ( #35478 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-03-05 20:58:09 -08:00
Shiyan Deng
03a49bb8f0
[Feature] Add --distributed-timeout-seconds CLI option ( #36047 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-03-05 20:57:51 -08:00
Shiyan Deng
8e87cc57f1
[Bug] Fix a corner case in _process_simple_streaming_events ( #34754 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-03-05 20:57:32 -08:00
Cyrus Leung
6dd302653f
[Misc] Rename group_mm_kwargs_by_modality -> group_and_batch_mm_kwargs ( #36158 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-06 12:32:48 +08:00
Cyrus Leung
de00ebeac4
[Bugfix] Fix simple Mistral-Small example ( #36156 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-05 20:25:11 -08:00
Andreas Karatzas
639680d220
[ROCm][CI] Adding missing dependencies for Multi-modal models tests ( #36177 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 12:23:10 +08:00
Rohan Potdar
c5362c739f
Reenable features for ROCm attention backends ( #36185 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-03-05 20:21:06 -08:00
Nikhil Gupta
0a49676fb0
cpu: aarch64: Upgrade OneDNN for aarch64 to add support for int8 matmul ( #36147 )
...
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com >
2026-03-06 03:48:59 +00:00
Jeffrey Wang
c012a8c477
Don't fire ray compatibility webhook when PR or branch is not provided ( #36088 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2026-03-06 00:42:21 +00:00
Dor Huri
ebed80a7c8
[Performance] Extract KV-cache update from TreeAttention backend ( #35384 )
...
Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il >
2026-03-06 00:22:43 +00:00
Nick Hill
a73af584fe
[Model Runner V2] Fix warmup for very small kvcache and/or blocksizes ( #36176 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-05 14:48:10 -08:00
Zhengxu Chen
a97954b6a8
[compile] Consistent compiler config for saved/loaded vllm backends. ( #35810 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-05 15:08:12 -05:00
Yanhong Li
a911f4dd20
[Model] Add support for OLMo Hybrid ( #32550 )
2026-03-05 14:51:06 -05:00
Russell Bryant
5395471d29
[CI] Add explicit permissions to macOS smoke test workflow ( #35775 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-05 19:08:48 +00:00
Frank Wang
a57c877f18
[BugFix] Fallback from FA4->FA2 for Batch Invariance ( #36059 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
2026-03-05 14:05:56 -05:00
Xin Yang
f917020983
[Perf] Optimize FusedMoEModularKernel output tensor using torch.empty ( #35794 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-05 13:47:53 -05:00
tomeras91
86483ca774
[Bugfix] Disable FlashInfer TRTLLM BF16 path for non-gated MoE ( #36146 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
2026-03-05 09:49:05 -08:00
Netanel Haber
b93a9e6f6d
ParakeetProjection.norm = RMSNorm instead of nn.LayerNorm ( #36133 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-03-05 17:29:30 +00:00
Xinyu Chen
d8839ef7d9
[XPU] Enable ModelRunnerV2 on XPU ( #36078 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2026-03-05 17:19:18 +00:00
Avery Miao
e998fa76b9
[BUGFIX]Fix Qwen-Omni models audio max_token_per_item estimation error leading to encoder_cache_size is 0 ( #35994 )
...
Signed-off-by: Miao, Avery <avery.miao@intel.com >
2026-03-05 09:16:29 -08:00
Jiayi Yan
6a895197fa
[Bugfix][CI] fix typos ( #34934 )
...
Signed-off-by: 1195343015 <1195343015@qq.com >
Signed-off-by: Jiayi Yan <66017932+1195343015@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 17:05:46 +00:00
Sage Moore
8c760b6ab6
[ROCm] Refactor ROCm attention backend selection logic ( #35246 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-03-05 10:51:26 -06:00
AllenDou
3ee68590c7
refactor funasr model. ( #36108 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-05 08:07:37 -08:00
Cyrus Leung
7196348157
[Bugfix] Fix Qwen-VL tokenizer implementation ( #36140 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-05 08:07:19 -08:00
Ning Xie
176c799f4c
[openai api] log exception in exception handler (1/N) ( #31164 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-03-05 16:00:12 +00:00
Or Ozeri
612e7729c2
[KVConnector] Scheduler: Fix num_computed_tokens after async KV load ( #34616 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-05 14:25:15 +00:00
Harry Mellor
ecde7af9c4
Fix import that was moved in Transformers 5.2.0 ( #36120 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 13:59:44 +00:00
Harry Mellor
8df523351f
[Docs] Only build docs if documentation or ready labels are present ( #36135 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 13:58:16 +00:00
Andreas Karatzas
b03ff6a96b
[CI] Stabilize test_no_args_tool_call and add ROCm-specific server args ( #36107 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-05 21:52:49 +08:00
Ajay Anubolu
ed81d5edd1
[Bugfix] Fix RunAI streamer crash with S3-hosted model paths ( #35976 )
...
Signed-off-by: AjAnubolu <anuboluajay@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-05 12:14:20 +00:00
Shiyan Deng
3c23ac840e
[Bugfix] Fix mypy errors in hermes_tool_parser.py ( #36114 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2026-03-05 11:37:47 +00:00
cjackal
a708ef5944
[Misc] Fix SyntaxWarning - invalid escape sequence '\e' ( #36020 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2026-03-05 10:55:31 +00:00
Kunshang Ji
66a2209645
[Hardware] Replace torch.cuda.synchronize() api with torch.accelerator.synchronize ( #36085 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-05 10:36:39 +00:00
Doug Smith
0bfa229bf1
[Release] Include source distribution (sdist) in PyPI uploads ( #35136 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
Co-authored-by: Daniele Trifirò <dtrifiro@redhat.com >
2026-03-05 01:43:50 -08:00
Paco Xu
7493c51c55
[Docs] add Dynamo/aibrix integration and kubeai/aks link ( #32767 )
...
Signed-off-by: Paco Xu <paco.xu@daocloud.io >
2026-03-05 17:39:50 +08:00
Reagan Lee
ac773bbe80
[Docs] Update docs to include mm processor + encoder benchmarks ( #34083 )
...
Signed-off-by: Reagan <reaganjlee@gmail.com >
2026-03-05 01:38:25 -08:00
Christian Munley
48e376a007
qwen3coder tool parser fix anyOf double encoded parameters ( #36032 )
...
Signed-off-by: Christian Munley <cmunley@nvidia.com >
2026-03-05 09:06:57 +00:00
Isotr0py
21eb2c3372
[Chore] Correct MTP models test registry ordering ( #36115 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-05 08:55:04 +00:00
Seiji Eicher
e2b31243c0
[Docs] Update CacheConfig block_size docstring to remove inaccurate limit when using CUDA ( #35632 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2026-03-05 06:24:08 +00:00
Martin Hickey
c3598d02fa
[Misc] Remove deprecated items that are due for removal ( #36006 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-03-05 06:14:50 +00:00
Benjamin Chislett
57c629e9c1
[Bugfix] Fix block_size for hybrid model MTP ( #36036 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-05 06:10:54 +00:00
zihaoanllm
d106bf39f5
[Doc] Add Parallel Draft Models ( #35973 )
...
Signed-off-by: <zihaoan2@amd.com >
Signed-off-by: zihaoanllm <zihaoan2@amd.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 05:44:07 +00:00
Yanan Cao
b0651021e5
[Kernel] [Helion] [11/N] Retune configs for silu_mul_fp8 ( #36062 )
2026-03-04 21:25:59 -08:00
Hanjun Cho
f600d5192e
[Bugfix] Fix score layer quantization for sequence classification models - Qwen3 (VL) Reranker ( #35849 )
...
Signed-off-by: Hanjun Cho <gkswns0531@gmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-04 20:57:20 -08:00
Tianmu Li
8e7820131e
[Perf] Use dummy M for weight prepacking on x86 ( #35890 )
...
Signed-off-by: Li, Tianmu <tianmu.li@intel.com >
2026-03-05 04:56:49 +00:00
Andrii Skliar
0a12cea25f
Order config.py in Lexicographical order ( #35866 )
...
Signed-off-by: Andrii Skliar <askliar@nvidia.com >
Co-authored-by: Andrii Skliar <askliar@nvidia.com >
2026-03-04 20:56:47 -08:00
Zhengxu Chen
dd6dbd93f8
[compile] Fix extra cache save on warm start. ( #35921 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-05 12:56:30 +08:00
Harry Mellor
26366009c5
[CI] Don't leave docs preview comment on closed PRs ( #36087 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 04:51:46 +00:00
Nick Hill
16c472abe7
[Core] Move ray-specific WorkerWrapperBase methods to RayWorkerWrapper ( #35328 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-05 12:11:59 +08:00
daje0601
3b23d57c96
[Model] Add LoRA support for Whisper models ( #29856 )
...
Signed-off-by: daje0601 <englishmt4118@gmail.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-05 10:38:25 +08:00
Wentao Ye
2f4226fe52
[CI] Fix pre-commit mypy issue in main ( #36049 )
2026-03-04 18:13:12 -08:00
nkm-meta
792cbd64ca
Add platform method to enable custom collective ops registration ( #34760 )
...
Signed-off-by: Naina Kuruballi Mahesh <nainakm@meta.com >
2026-03-05 00:50:32 +00:00
Zhengxu Chen
2ed4722e26
[compile] Reduce log spam from compile. ( #36044 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-05 00:48:36 +00:00
Nick Hill
a3299c3d1d
[Model Runner V2] Misc code simplification ( #35941 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-04 15:26:35 -08:00
Andreas Karatzas
6c21a0c2d7
[ROCm][CI] Added MI325 mirrors (stage C) ( #35239 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-04 14:48:46 -08:00
Shanshan Shen
562339abc3
[Misc] Support OOT linear method registering ( #35981 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-03-04 22:25:56 +00:00
amitz-nv
d7adcadb9b
[Bugfix] Fix passing of activation_type to trtllm fused MoE NVFP4 and FP8 ( #36017 )
...
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com >
2026-03-04 22:23:51 +00:00
Simon Mo
f678c3f61a
[RL] [Weight Sync] Guard IPC update-info pickle deserialization behind insecure serialization flag ( #35928 )
...
Co-authored-by: Cursor Agent <cursoragent@cursor.com >
2026-03-04 17:05:32 -05:00
Thomas Parnell
be0a3f7570
[Bugfix] Fix race in non-blocking num_accepted_tokens GPU->CPU copy ( #36013 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-04 13:52:44 -08:00
Harry Mellor
17dc9c7fc9
[CI] Bump mypy version ( #34950 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 20:55:11 +00:00
fenypatel99
7eca859110
Add PyTorch profiler schedule support with warmup/active iterations ( #35240 )
2026-03-04 12:53:38 -08:00
Russell Bryant
636ee223ac
[Docs] Document security risks of GPT-OSS Python tool ( #35139 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-04 20:27:31 +00:00
Robert Shaw
b7d59ffce2
[UX] Remove NoOpOffloader log ( #35678 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-04 12:13:40 -08:00
Richard Zou
5569f5218d
[torch.compile] Stop lazily compiling ( #35472 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-04 12:13:17 -08:00
Davina Zaman
138d891d7f
[Docs] Clarify structured outputs configuration for Qwen3 reasoning mode ( #32441 )
...
Signed-off-by: Davina Zaman <davzaman@users.noreply.github.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 11:44:39 -08:00
Stefano Castagnetta
d7166e74c1
[CI] Add Blackwell AsyncTP correctness test ( #35871 )
...
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com >
2026-03-04 19:41:21 +00:00
Nick Hill
417fd28fb1
[Model Runner V2] Fix pooling ( #36019 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-04 10:53:17 -08:00
tomeras91
7faba503c4
[Kernel][Mamba] Optimize Mamba2 SSD prefill Triton kernels ( #35397 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
2026-03-04 19:47:17 +01:00
Hyunkyun Moon
bc6be89d16
[Frontend] Add vllm launch command for GPU-less preprocessing serving ( #34551 )
...
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com >
2026-03-04 18:41:52 +00:00
Maxime Grenu
32224f568a
docs: update CPU Docker images to reference Docker Hub instead of AWS ECR ( #34882 )
...
Signed-off-by: Maxime Grenu <69890511+cluster2600@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 10:31:35 -08:00
Abhishek Mathukiya
f3dc292e9f
docs: add version requirement note for --profiler-config flag ( #32454 )
...
Signed-off-by: abhishkh <mathukiya.a@northeastern.edu >
2026-03-04 18:13:54 +00:00
Chen
138c5fa186
[Docs] Add RunPod GPU deployment guide for vLLM ( #34531 )
...
Signed-off-by: lisperz <zhuchen200245@163.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 10:11:34 -08:00
Russell Bryant
2f2c1d73a7
[Docs] Upgrade dynamic LoRA warning to admonition block ( #35218 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-04 10:01:42 -08:00
Bhuminjay Soni
fb3e78ab09
[Feature][CI]: compare func & no_func outputs in test_functionalization.py ( #35481 )
...
Signed-off-by: Bhuminjay <bhuminjaysoni@gmail.com >
Signed-off-by: Bhuminjay Soni <Soni5Happy@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-03-04 18:01:16 +00:00
Michael Yao
fd3bfe74c9
[Docs] Update design/multiprocessing.md ( #30677 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2026-03-04 17:58:59 +00:00
tc-mb
bfdb512f11
fix minicpmo4.5: fix attn_mask in vit attn && fix resampler pos_emb i… ( #34127 )
...
Signed-off-by: tc-mb <caitianchi@modelbest.cn >
Co-authored-by: hezhihui <hezhihui@modelbest.cn >
2026-03-04 17:46:17 +00:00
Sage
d25c1ec3c9
docs(cpu): Clarify pre-built wheels requirement for CPU Python-only build ( #35090 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-04 17:45:35 +00:00
Xing Liu
7cc6058ac6
[Doc] Add MTP docs and update speculative decoding guidance ( #35197 )
...
Signed-off-by: liuxing <945764858@qq.com >
2026-03-04 17:23:34 +00:00
Manrique Vargas
28028dff2f
fix(docs): use static rdzv backend in multi-node troubleshooting script ( #34784 )
...
Signed-off-by: machov <mv1742@nyu.edu >
2026-03-04 17:15:35 +00:00
Dr Alex Mitre
3417ba5648
docs: add README for logits_processor examples ( #35933 )
2026-03-04 17:09:19 +00:00
Yan Ma
58cfe0dc44
Fix phi4-mm and remove cuda binding ( #35964 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-03-05 01:08:05 +08:00
simone-dotolo
e86221deb6
[Doc] Fix GPU Worker count in Process Count Summary ( #36000 )
...
Signed-off-by: simone-dotolo <simonedotolo@libero.it >
Signed-off-by: simone-dotolo <84937474+simone-dotolo@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-04 17:03:14 +00:00
Netanel Haber
289fc48ab7
Use MMEncoderAttention (=use FlashAttention) instead of torch.sdpa in radio.py ( #35653 )
2026-03-04 08:43:13 -08:00
Christian Pinto
2f2212e6cc
Split generic IO Processor plugins tests from Terratorch specific ones ( #35756 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2026-03-05 00:01:03 +08:00
Nicolò Lucchesi
18e01a0a10
[Misc] Add --attention-backend auto option ( #35738 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-04 15:12:27 +00:00
sungsoo ha
6cb901093f
[Core] Add All-to-All communication backend for DCP ( #34883 )
...
Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com >
Signed-off-by: sungsoo ha <hasungsoo@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 10:01:57 -05:00
Cyrus Leung
ead7bde1ab
[Bugfix] Make kaldi_native_fbank optional ( #35996 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-04 06:47:32 -08:00
Qi Wang
6aa6ad8992
[BugFix] Fix implicit and incorrect assumption on ECConnector is_producer ( #34783 )
...
Signed-off-by: Qi Wang <qiwa@nvidia.com >
2026-03-04 15:01:30 +01:00
Raghavan
c8c3935b70
[Bugfix][Model] Fix FP8 k_scale/v_scale not loaded for Qwen3-MoE ( #35656 )
...
Signed-off-by: raghavan <oneraghavan@gmail.com >
2026-03-04 13:15:38 +00:00
Ronen Schaffer
bb6888b8b1
[Bugfix][CPUOffloadingManager] Prevent eviction of already-stored blocks in LRU/ARC prepare_store() ( #35846 )
...
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com >
2026-03-04 14:25:33 +02:00
Taneem Ibrahim
1aaec59d79
[MISC] fixed tool_parser mypy errors ( #35640 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 12:23:12 +00:00
pougetat
1659b2e058
[Feature] Add basic metrics for /realtime endpoint ( #35500 )
...
Signed-off-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Signed-off-by: pougetat <thomas.pougetabadie@gmail.com >
Co-authored-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-04 19:56:32 +08:00
haosdent
d6e04f4c43
[Bugfix] Cap FULL decode cudagraph sizes for Mamba/hybrid models ( #34094 ) ( #34571 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
Co-authored-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-04 11:56:22 +01:00
Kunshang Ji
a8f66cbde8
[XPU] bump vllm-xpu-kernels to v0.1.3 ( #35984 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-04 18:23:31 +08:00
Kunshang Ji
16d2ad1d38
[Hardware] Replace torch.cuda.empty_cache with torch.accelerator.empty_cache ( #30681 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 09:49:47 +00:00
Chuan (Richard) Li
5dc3538736
[ROCm][Bugfix] Fall back from CK MXFP4 MoE when GEMM dimensions are unsupported ( #35893 )
...
Signed-off-by: Li <chuali@amd.com >
2026-03-04 08:30:54 +00:00
Nathan Price
36bf213181
[Bugfix] Add missing dynamic_arg_dims for Qwen3-ASR torch.compile ( #35869 )
...
Signed-off-by: Nathan Price <nathan@abridge.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-04 08:29:01 +00:00
Joe Runde
6f0dd93801
[Core] Remove busy loop from idle buffer readers ( #28053 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-04 07:44:20 +00:00
Andrii Skliar
5d199ac8f2
Support Audio Extraction from MP4 Video for Nemotron Nano VL ( #35539 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Signed-off-by: Andrii Skliar <askliar@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: Andrii <askliar@nvidia.com >
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Co-authored-by: Andrii Skliar <askliar@oci-nrt-cs-001-vscode-01.cm.cluster >
Co-authored-by: Andrii <askliar@nvidia.com >
Co-authored-by: root <root@pool0-03748.cm.cluster >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: root <root@pool0-02416.cm.cluster >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com >
Co-authored-by: root <root@pool0-04880.cm.cluster >
2026-03-03 23:20:33 -08:00
Komal Kumar Teru
9e0f44bec4
[cohere][fix][spec-decode]: fix crash when allowed_token_ids is set without penalties ( #35654 )
...
Signed-off-by: kkt-cohere <komal@cohere.com >
2026-03-03 23:20:15 -08:00
lailoo
097eb544e9
[Bugfix] Improve engine ready timeout error message ( #35616 )
...
Signed-off-by: damaozi <1811866786@qq.com >
2026-03-04 05:54:32 +00:00
ShiJie Zhong
7cdba98edf
[BugFix] Support tool_choice=none in the Anthropic API ( #35835 )
...
Signed-off-by: ZhongsJie <zhongsjie@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-04 05:24:46 +00:00
Charlie Fu
3c85cd9d74
[Rocm][CI] Fix ROCm LM Eval Large Models (8 Card) ( #35913 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2026-03-04 04:50:13 +00:00
Andreas Karatzas
edba15045a
[Bugfix] Guard mm_token_type_ids kwarg in get_mrope_input_positions ( #35711 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-04 04:12:51 +00:00
Cyrus Leung
e379396167
[Refactor] Clean up processor kwargs extraction ( #35872 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-03 19:53:53 -08:00
Isotr0py
6e9f21e8a2
[Chore] Remove debug code in model implementation ( #35883 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-03 19:50:58 -08:00
AllenDou
c1d963403c
[model] support FireRedASR2 ( #35727 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-03 19:41:30 -08:00
Shanshan Shen
77e6dcbbfa
[PluggableLayer][MM] Add PluggableLayer for RelPosAttention ( #33753 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-03-03 19:41:27 -08:00
William Zhang
70c73df69e
[Bugfix] Fix EVS implementation for Qwen3 VL ( #33607 )
...
Signed-off-by: 2ez4bz <133824995+2ez4bz@users.noreply.github.com >
2026-03-04 02:18:11 +00:00
xjx
9a9d442464
Enable bnb for multiple indices weight ( #35838 )
...
Signed-off-by: xjx <493337577@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-04 01:46:47 +00:00
Andreas Karatzas
f7da9cdffc
[ROCm][CI] Support async weight transfer example with platform-aware determinism ( #35710 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-04 09:44:14 +08:00
Jaewon
f22ff2958c
[Bugfix] Fix coord_socket assertion in DPEngineCoreProc for offline DP mode ( #35916 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
2026-03-04 00:10:11 +00:00
Nick Hill
d15c3b90fc
[Core] Move save_tensorized_model logic to Worker ( #35825 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-03 15:31:59 -08:00
zhrrr
97286a20ed
[Model Runner V2] support dp & ep for spec decoding ( #35294 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Co-authored-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-03 15:19:45 -08:00
Amr Mahdi
12b38c0f45
[CI/Build] Allow mounting AWS credentials for sccache S3 auth ( #35912 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-03-03 14:30:47 -08:00
Woosuk Kwon
467886a0c4
[Model Runner V2] Fix inputs_embeds=None bug for MM models ( #35917 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-03 13:47:45 -08:00
bnellnm
a9b8b13e5c
[Bugfix] Fix misnamed parameter in compressed_tensors_moe.py ( #35813 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-03-03 16:29:57 -05:00
Micah Williamson
e7213003cb
[ROCm][CI] Fix TP size issue for test_gpt_oss ( #35887 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-03-03 20:57:34 +00:00
Rohan Potdar
3a8eef5869
[ROCm][Bugfix]: Disable AITER Triton ROPE by default ( #35601 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-03-03 13:43:56 -06:00
Robert Shaw
97995f6376
[MoE Refactor] Create MK for TRTLLM Kernels ( #32564 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2026-03-03 10:39:50 -08:00
Robert Shaw
881a6b011b
[CI] Temporarily Disable Llama4 MoE Refactor Test ( #35870 )
...
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-03-03 10:36:15 -08:00
Matthew Bonanni
8e1fd5baf0
[CI] Bump num_speculative_tokens to 3 in nightly DeepSeek tests ( #35882 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-03 09:26:44 -08:00
JasonCohere
ae88468bcc
fix: Ensure invalid audio files return 400 error ( #34715 )
...
Signed-off-by: Jason Ozuzu <jasonozuzu@cohere.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-03 08:47:39 -08:00
ojhaanshika
e05cb3b93e
TRTLLM gen-full attn Test Coverage ( #34986 )
...
Signed-off-by: Anshika Ojha <anshikao@nvidia.com >
Co-authored-by: Anshika Ojha <anshikao@gb-nvl-059-compute09.nvidia.com >
2026-03-03 11:35:34 -05:00
Lucas Wilkinson
28ef9ba399
[BugFix] Add support for MTP num_speculative_tokens > 1 with sparse MLA ( #34552 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-03 07:21:57 -08:00
TJian
fb7fdc49c4
[ROCm] [CI] Add new fusion test cases that are relevant to vLLM IR Ops ( #34307 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-03-03 06:24:21 -08:00
wang.yuqi
ea463978bb
[Frontend][1/n] Improve pooling entrypoints | classify. ( #35604 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-03 06:05:36 -08:00
Li, Jiang
440f0e7dc6
[Bugfix] Avoid src/dst as None in irecv/isend_tensor_dict ( #35754 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-03 05:56:08 -08:00
wang.yuqi
fd4a90f337
[CI] And PPL test for Qwen3.5. ( #35853 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-03 13:15:51 +00:00
Thomas Parnell
ad9d09e2b8
[Perf] [Hybrid] Copy num_accepted_tokens in non-blocking way when not using prefix caching ( #35442 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2026-03-03 04:15:43 -08:00
Szymon Reginis
4beebfd146
[CI/Build][Intel] Add new performance benchmarks for Intel Gaudi 3 ( #31025 )
...
Signed-off-by: Szymon Reginis <sreginis@habana.ai >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-03 19:48:24 +08:00
hallerite
b8401cde0e
add regression test ( #35834 )
...
Signed-off-by: hallerite <git@hallerite.com >
2026-03-03 07:32:15 +00:00
TJian
5dfc5abe94
[ROCm] [Release] Change the package from aiter to amd-aiter ( #35198 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-03-02 23:13:39 -08:00
lin-shh
8fa68a8ce4
Fix TYPE_CHECKING stub defaults in envs.py to match actual runtime defaults ( #35645 )
2026-03-02 21:59:43 -08:00
lin-shh
35a6f0bfe2
[Misc] Fix typos in comments: explict→explicit, paramaters→parameters ( #35648 )
2026-03-02 21:59:14 -08:00
Taneem Ibrahim
3a6cbf16e2
[MISC] Removed unused function find_all_indices() from tool_parsers/utils.py ( #35683 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-03-03 13:58:42 +08:00
Lucas Wilkinson
f44d1ddc8c
[BugFix] Fix cmake based incremental install (wrong vllm install dir) ( #35773 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-03-02 21:58:16 -08:00
Cyrus Leung
48a54c1e0d
[CI/Build] Trigger processor tests on registry update ( #35824 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-03 13:55:57 +08:00
Micah Williamson
8b9e8b7454
[ROCm][CI] Fix Assertion Logic For test_gpt_oss ( #35806 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-03-03 05:08:04 +00:00
Wentao Ye
c21d0039ec
[Refactor] Fix maxsim cuda platform and add cli to control it ( #35427 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-03 12:48:31 +08:00
Isotr0py
7d8bbe6f42
[CI/Build] Automatically patch video metadata for multimodal processor test ( #35822 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-03 04:27:45 +00:00
aykoppol
25e02647c2
[Core] Add optional flags to check for repetitive token patterns in engine output ( #35451 )
...
Signed-off-by: aykoppol <aykoppol+git@gmail.com >
2026-03-03 12:23:25 +08:00
Woosuk Kwon
a0a5178ab4
[Model Runner V2] Use ModelState.prepare_attn() for cuda graph capture [5/N] ( #35774 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-02 20:06:27 -08:00
Isotr0py
8ea8ba275e
[V0 deprecation] Remove Swin model ( #35821 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-02 20:03:41 -08:00
Woosuk Kwon
4f85bae9d6
[Docs][Model Runner V2] Add Design Docs ( #35819 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-02 19:58:14 -08:00
Andy Lo
0a7165fd71
[ModelRunnerV2] Rename sampler functions and variables for clarity ( #35459 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-03-02 19:48:56 -08:00
Robert Shaw
6521ccf286
[CI] Temporarily Disable Nightly Failures ( #35770 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-03 01:49:13 +00:00
Martin Vit
8ebd872f50
[Tool Parser] Fix Qwen3Coder streaming parameter loss with speculative decode ( #35615 )
...
Signed-off-by: Martin Vit <martin@voipmonitor.org >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-03 09:40:37 +08:00
zhrrr
168ee03e1c
[Model Runner V2][Perf] align dummy_run tokens to uniform decode for dp cudagraph ( #35376 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2026-03-02 17:10:47 -08:00
liuzhenwei
9dd656f0ea
[XPU][NIXL] Add GPUDirect RDMA support for XPU ( #35270 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-03 08:42:49 +08:00
Jakub Zakrzewski
c8b678e53e
[Model] Add support for nvidia/llama-nemotron-rerank-vl-1b-v2 ( #35735 )
...
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com >
2026-03-03 08:32:14 +08:00
Andreas Karatzas
18c29c746b
[ROCm][CI] Fix backslash-continuation in pytest marker re-quoting and treat exit code 5 as success ( #35798 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-02 16:07:51 -08:00
Hanjie Qiu
96fc09503a
[All Reduce] Change default backend of Flashinfer All Reduce to trtllm ( #35793 )
...
Signed-off-by: hjjq <hanjieq@nvidia.com >
2026-03-02 18:57:38 -05:00
Roger Wang
1b82b433fc
[Bugfix] Fix MM processor test for Qwen3.5 ( #35797 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-03-02 23:05:08 +00:00
Robert Shaw
9319044ee9
[MoE][Perf] Wrap DSV3 QKVAProj GEMM in custom op for torch.compile ( #35751 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-02 23:03:49 +00:00
Boyuan Feng
c42dc402c1
clean unused cudagraph_batch_sizes ( #35552 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2026-03-02 22:00:16 +00:00
Ye (Charlotte) Qi
fa6a6be519
[Bugfix] Fix missing sequence_lengths in qwen3_omni_moe_thinker ( #35741 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2026-03-02 21:11:56 +00:00
Aaron Hao
cad21918e3
[BUG] Fix rlhf_async example ( #35788 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-03-02 20:36:40 +00:00
Jeffrey Wang
53700bf49b
[ci] Add Ray compatibility check informational CI job ( #34672 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2026-03-02 12:06:16 -08:00
Yashwant Bezawada
a13d8c03c9
[KVConnector] Auto-downgrade to PIECEWISE cudagraph mode for layerwise async ops ( #31057 )
...
Signed-off-by: Yashwant Bezawada <yashwant_b@me.com >
2026-03-02 15:04:47 -05:00
Fynn Schmitt-Ulms
9433acb8df
[Spec Decode] Add hidden states extraction system ( #33736 )
...
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com >
2026-03-02 14:29:09 -05:00
Richard Zou
d1a6e96d9e
[torch.compile] Improve cold and warm start compile tests ( #35709 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-02 19:27:06 +00:00
CSWYF3634076
2a9e3347e9
[BugFix][Model]Fix the garbled code in Ernie4.5-VL caused by fast_moe_cold_start ( #35587 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
2026-03-02 18:56:33 +00:00
Isotr0py
cc0d565f40
[CI/Build] Enable Qwen3.5 tests on CI ( #35763 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-02 17:43:53 +00:00
Patryk Wolsza
358e4d5ba7
[CI][HPU] Pin vllm commit compatible with vllm-gaudi - HPU tests ( #35307 )
...
Signed-off-by: PatrykWo <patryk.wolsza@intel.com >
2026-03-02 17:02:26 +00:00
Cyrus Leung
792a74b973
[Doc] Improve UX of --enable-log-requests ( #35723 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-02 08:24:09 -08:00
Turner Jabbour
4034c3d32e
[Core] Move test utility to test file ( #35672 )
...
Signed-off-by: Turner Jabbour <doubleujabbour@gmail.com >
2026-03-02 10:56:03 -05:00
Martin Hickey
7560d674c9
[CI] Fix mypy for vllm/device allocator ( #35518 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-02 15:53:18 +00:00
ElizaWszola
d9c7730877
[Performance] Extract kv update ops from MLA attention backends ( #34627 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Di Wu <dw2761@nyu.edu >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-03-02 10:43:19 -05:00
Runkai Tao
ada4f4fadd
[Fix Bug]num_active_loras always equals to zero ( #34119 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-02 23:17:46 +08:00
Harry Mellor
7e9149d9a9
[Docs] Add breadcrumbs for better UX ( #35749 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-02 14:31:54 +00:00
Martin Hickey
87c98b0236
[MyPy][BugFix] Check profiler is assigned before calling start() on it ( #35505 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-02 13:23:42 +00:00
Tyler Michael Smith
de7dd634b9
Fix unresolved-import errors when using Astral's ty by removing src.root ( #35681 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-03-02 10:26:47 +00:00
Chauncey
9a87b0578f
[Feat] Supports Anthropic Messages count_tokens API ( #35588 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-02 09:48:54 +00:00
wangxiyuan
510bc9e1df
[Misc] Cleanup useless current_platform import ( #35715 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2026-03-02 09:36:54 +00:00
Charles Ashby
cbd361fd46
[CPU][Distributed] Fix Enable _CPUSHMDistributed only when TP/PP ranks share the same SHM group name ( #34169 )
...
Signed-off-by: Charles Ashby <charlesa.l@hotmail.com >
2026-03-02 09:34:35 +00:00
Nicolò Lucchesi
c212202d93
[Misc] Bound NIXL upper bound version ( #35495 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-02 16:57:07 +08:00
Andreas Karatzas
ec27b36b4b
[CI] Defining extended V1 e2e + engine tests ( #35580 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-02 08:10:54 +00:00
Charlie Fu
3fd1d4ec2c
[Rocm][CI] Fix LM Eval Large Models (H100) test group ( #34750 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2026-03-02 07:43:38 +00:00
EdalatiAli
cb21972a97
[Kernel] Integrate SM100 MXFP8 blockscaled grouped MM and quant kernels ( #34448 )
...
Signed-off-by: EdalatiAli <aliedalati@cohere.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-01 23:31:19 -08:00
Andreas Karatzas
c34963f138
[ROCm][CI] Disable skinny GEMMs in language model standard tests to fix non-determinism ( #35152 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-02 15:04:18 +08:00
Hongxia Yang
f26650d649
[ROCm] add amd-quark package in requirements for rocm to use quantized models ( #35658 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-03-02 06:02:43 +00:00
Kunshang Ji
92f5d0f070
[XPU] fix mxfp4 activation type ( #35691 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-02 11:48:39 +08:00
Jesse Cai
a60985b07e
Fix deprecated v1 config tests ( #35327 )
...
Signed-off-by: Jesse Cai <jessecai@fb.com >
2026-03-01 20:32:03 -05:00
Lucas Wilkinson
8b5014d3dd
[Attention] FA4 integration ( #32974 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-03-01 23:44:57 +00:00
zhanqiuhu
57a96e26c9
Revert "[Bugfix] Disable TRTLLM attention with KV transfer enabled ( #33192 )" ( #34832 )
...
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-03-01 22:32:37 +00:00
Richard Zou
e82fbeec7b
[torch.compile] Undo the fast_moe_cold_start hack in torch>=2.11 ( #35475 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-01 21:44:22 +00:00
haosdent
6290470843
[Bugfix] Fix dtype mismatch in RMSNormGated.forward_native() during torch.compile ( #35256 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-01 15:14:46 -05:00
Woosuk Kwon
72f4d16262
[Model Runner V2] Use block table apis for capture inputs ( #35671 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-01 10:31:13 -08:00
Seungho Yoon
5a435507d8
fix(mxfp4): return is_monolithic=False when LoRA is enabled for Triton backend ( #35382 )
...
Signed-off-by: Seungho Yoon <yoonsnowdev@gmail.com >
2026-03-01 09:59:30 -05:00
Taneem Ibrahim
59d7af9c6c
[MISC] Fixing a null reference by removing parallel_utils from mypy EXCLUDE ( #35630 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-03-01 09:26:44 -05:00
Asaf Gardin
bbf81f9a92
[Mamba1] - Kernel Level Chunk Alignment for Prefix Caching ( #34798 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-03-01 20:40:23 +08:00
Woosuk Kwon
da543d1abe
[Model Runner V2] Minor refactoring for EncoderRunner ( #35628 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-01 00:15:39 -08:00
Ryan Rock
87d319c52f
[AMD][CI] Support Triton attention with ExampleConnector ( #34931 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-03-01 09:58:07 +02:00
lin-shh
a9ec392c86
Fix typo: implictly -> implicitly in isaac.py docstring ( #35646 )
2026-02-28 23:34:37 -08:00
lailoo
afd089f231
[Bugfix][Model] Fix Qwen3.5/Qwen3Next ignoring --dtype flag on older GPUs ( #35617 )
2026-03-01 03:27:37 +00:00
gnovack
3ecd0bf9fc
Add TMA support to fused_moe_lora kernel ( #32195 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-01 10:55:25 +08:00
Woosuk Kwon
e3eb146f7a
[Model Runner V2] Add ModelStateInterface [4/N] ( #35621 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-28 13:19:45 -08:00
Martin Vit
95a395dbec
[Bugfix] Fix Anthropic API base64 image handling in Messages endpoint ( #35557 )
...
Signed-off-by: Martin Vit <martin@voipmonitor.org >
2026-02-28 20:57:08 +00:00
Isotr0py
e94b263bd6
[Chore] Cleanup BNB utilization dead code ( #35620 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-28 19:22:41 +00:00
Wentao Ye
e113a30113
[Deprecation] Deprecate code in 0.17 as scheduled ( #35441 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-28 17:32:37 +00:00
Cyrus Leung
1dafb29f91
[Benchmark] Avoid unnecessary video download in MMVU ( #35618 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-28 09:07:02 -08:00
emricksini-h
49b9ae32e9
[Fix] Avoid sending image input to other PP ranks ( #35405 )
...
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-01 00:14:29 +08:00
cwazai
63d7972f13
Fix Qwen3_5MTP packed_modules_mapping for gate_up_proj ( #35581 )
2026-02-28 14:50:55 +00:00
flutist
c68e69f144
custom dataset img support base64 ( #35280 )
...
Signed-off-by: xjx <493337577@qq.com >
2026-02-28 11:49:52 +00:00
Chauncey
7e08c22b8c
[Feat] Add CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function ( #35271 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-28 10:12:00 +00:00
Augusto Yao
8e75d88554
add io_process_plugin for sparse embedding ( #34214 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
Signed-off-by: Augusto Yao <augusto.yjh@antgroup.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-28 09:16:37 +00:00
Mario Hong
0892d1ab1f
[Feature]Supports Anthropic Thinking Block ( #33671 )
...
Signed-off-by: mariohong <mariohong128@gmail.com >
Co-authored-by: zetaohong <i-hongzetao@stepfun.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-02-28 09:02:33 +00:00
Hashem Hashemi
7600642eae
Add padding support to wvSplitK solution for skinny GEMMs ( #33762 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-28 09:02:05 +00:00
Andreas Karatzas
1e69c04887
[ROCm][CI] Parametrize vision score tests across attention backends with per-backend tolerances ( #35571 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-28 08:59:26 +00:00
Cyrus Leung
4292e3b807
[Benchmark] Improve UX of sweep scripts ( #35600 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-28 00:36:02 -08:00
Cyrus Leung
24d6ea8afd
[Benchmark] Rename SLA Finder to Workload Explorer ( #35586 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-27 23:31:55 -08:00
Chauncey
57c86c0741
[Misc] Change logging level from info to debug for tool parser import ( #35575 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-28 14:51:35 +08:00
Chauncey
06254d4cbb
[CI] add trainer_send_weights for MockWeightTransferEngine ( #35589 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-28 06:47:43 +00:00
Andreas Karatzas
f5d1281c9d
[ROCm][CI] Expose tests to AMD production CI and fix amdsmi heap corruption ( #35071 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-28 13:57:31 +08:00
Andreas Karatzas
94029ffaf0
[ROCm] Derive device capability from GCN arch string without CUDA init ( #35069 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-28 13:55:28 +08:00
Andreas Karatzas
88e8525f2e
[ROCm][CI] Adding infiniband mappings for moriio tests ( #35170 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-28 13:53:28 +08:00
Ilya Markov
b2d8b422b2
[EPLB] Enforce sync eplb for NCCL-based all2all backend ( #35212 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-02-28 05:47:12 +00:00
Umut Polat
1d5ab5d603
[Bugfix] Move chat completion response_format validation to Pydantic model_validator ( #35510 )
...
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com >
2026-02-27 21:26:19 -08:00
Huy Do
7b346ba8ed
[Bugfix] Propagate compilation_time from workers to main process for TP>1 ( #35503 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2026-02-28 05:03:22 +00:00
Itay Alroy
dea268336f
[1/N] Elastic EP Milestone 2 ( #34861 )
...
Signed-off-by: Yongji Wu <wuyongji317@gmail.com >
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: Ron Tourgeman <rtourgeman@nvidia.com >
Co-authored-by: Yongji Wu <wuyongji317@gmail.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com >
2026-02-28 04:46:42 +00:00
Ma Jian
90805ff464
[CI/Build] CPU release supports both of AVX2 and AVX512 ( #35466 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Co-authored-by: jiang1.li <jiang1.li@intel.com >
2026-02-28 04:35:21 +00:00
Matthew Bonanni
2562e0271e
[MTP] Validate that MTP weights are actually loaded ( #35548 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-28 12:27:40 +08:00
Cyrus Leung
fd68cd132b
[Bugfix] Fixes for SLA finder ( #35537 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-27 20:20:55 -08:00
Micah Williamson
0edf101d2b
[ROCm] Add stablelm Head Size 80 To Supported Head Sizes For ROCM_ATTN ( #35527 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-28 12:16:34 +08:00
Douglas Lehr
d5b6f3ba36
[ROCm][Quantization] Add Composable Kernel (CK) backend support for M… ( #34301 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Signed-off-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com >
Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
Co-authored-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
2026-02-28 03:37:01 +00:00
Woosuk Kwon
1a014a0a93
[Model Runner V2] Move MM encoder to Model States [3/N] ( #35564 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-27 18:32:38 -08:00
Woosuk Kwon
86ac7bcf84
[Model Runner V2] Support pooling models ( #35120 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-27 18:03:01 -08:00
Umut Polat
405f28d38d
[Misc] Clean up ResponsesRequest model validators ( #35531 )
...
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com >
2026-02-28 01:19:21 +00:00
youkaichao
5323672bc2
[misc] cleanup one level of error stack when nixl fails to initialize ( #35517 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2026-02-28 08:42:37 +08:00
Roberto L. Castro
a201ad72d8
[Refactor][Kernel] Add global helper to deduplicate vectorized memory ops ( #35105 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
2026-02-27 16:28:17 -08:00
Rohan Potdar
e3691988d0
[ROCm]: fix aiter rope functionalization ( #35533 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-27 22:42:30 +00:00
Gregory Shtrasberg
9fa6c68fa6
[ROCm] Enabling encoder and encoder-decoder on ROCm and AITER unified backends ( #35334 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-27 21:32:55 +00:00
Aaron Hao
2ce6f3cf67
[Feat][RL][2/2] Native Weight Syncing API: IPC ( #34171 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-02-27 13:45:21 -07:00
Jakub Zakrzewski
1f3dbd95fd
[Bugfix][Model] Fix gpt-oss batch invariance ( #35404 )
...
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com >
2026-02-27 20:41:24 +00:00
Lucas Wilkinson
1d532f9d8f
[DP] Only use DP padding when cudagraphs are actually used ( #34102 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-27 15:14:31 -05:00
Lucas Kabela
234a65b781
[Bugfix] Add monkeypatch to prevent race condition from writing ( #35420 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-02-27 14:51:36 -05:00
SteadfastAsArt
2decec9856
[Transformers backend] Ignore MTP weights when num_nextn_predict_layers=0 ( #34888 )
...
Signed-off-by: SteadfastAsArt <695488173@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-27 19:39:23 +00:00
Zhengxu Chen
29b35477b0
[compile] Fix caching error over pytree slice node. ( #35308 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-27 19:34:16 +00:00
Nick Hill
b1d9f5372d
[Model Runner V2] Warmup kernels ( #35172 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-27 10:43:30 -08:00
Raushan Turganbay
fd6de37fca
[BugFix] Fix 3D rope in transformers backend ( #35097 )
...
Signed-off-by: raushan <raushan@huggingface.co >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-27 18:34:49 +00:00
Netanel Haber
c8aca0c9e1
Support parakeet as audio encoder for nemotron-nano-vl ( #35100 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-27 11:07:38 -07:00
Martin Hickey
b602e4f299
[Doc] Fix link to Llama chat template for usability ( #35525 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-27 17:51:09 +00:00
Huamin Li
157722da75
[perf] Use pinned memory for async H2D transfer in do_mamba_copy_block ( #35480 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
2026-02-28 01:50:37 +08:00
Nick Hill
1d897ff04f
[Misc] Fill in some v1 CODEOWNERS gaps ( #35524 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-27 09:34:37 -08:00
fort726
905d76b51d
[Model] Add huggingface skt/A.X-K1 model ( #32407 )
...
Signed-off-by: Sungwan(Alex) Kim <sw0726.kim@sktelecom.com >
Signed-off-by: fort726 <38447663+fort726@users.noreply.github.com >
Co-authored-by: Sungwan(Alex) Kim <sw0726.kim@sktelecom.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-27 09:26:02 -08:00
Yanan Cao
9098ce690c
[Kernel] [Helion] [7/N] Use HOP to represent Helion Kernel call to enable fx tracing and pattern matching ( #34390 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-27 09:21:35 -08:00
Nick Hill
876312f0b5
[Core] Fix gpu_worker.py pre-commit errors ( #35312 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-27 07:54:24 -08:00
Boyuan Feng
5de98abc12
Add @BoyuanFeng to CODEOWNERS ( #35317 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2026-02-27 15:53:47 +00:00
Koushik Dutta
9251ed5c4f
[Bugfix] Handle case when kimi ends reasoning with a tool call ( #33646 )
...
Signed-off-by: Koushik Dutta <koushd@gmail.com >
Co-authored-by: mondaylord <20212010046@fudan.edu.cn >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-27 14:58:28 +00:00
Yueqian Lin
e8249378e4
[Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests ( #35487 )
...
Signed-off-by: linyueqian <linyueqian@outlook.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-27 06:48:25 -08:00
haosdent
6d4f9d3ad5
[Bugfix] Fix DCP + FA3 crash due to missing num_splits in _forward_with_dcp ( #35082 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-27 22:27:06 +08:00
Harry Mellor
fbe3f0120a
Revert "Add GlmOcrConfig for GLM-OCR model type recognition" ( #35512 )
2026-02-27 06:13:27 -08:00
Jason Li
66c1751d13
[compile] Cleanup: Remove unnecessary +rms_norm forcing for sequence parallelism ( #35410 )
...
Signed-off-by: jasonlizhengjian <jasonlizhengjian@gmail.com >
2026-02-27 08:36:37 -05:00
Tib
6467b635b6
[Bugfix] Add missing activation attr to RMSNormGated ( #35423 )
...
Signed-off-by: tibG <naps@qubes.milou >
Co-authored-by: tibG <naps@qubes.milou >
2026-02-27 12:53:35 +00:00
Max Hu
9c3fe9936b
Flashinfer cuDNN backend for Qwen3 VL ViT attention ( #34580 )
...
Signed-off-by: Max Hu <maxhu@nvidia.com >
Signed-off-by: Max Hu <hyoung2991@gmail.com >
Co-authored-by: Max Hu <maxhu@nvidia.com >
Co-authored-by: Shang Wang <shangw@nvidia.com >
2026-02-27 20:20:23 +08:00
Umut Polat
b66a74649e
[Bugfix] Replace assert with ValueError for response_format validation in completions endpoint ( #35456 )
...
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com >
2026-02-27 08:01:06 +00:00
Wang Xingran
07bdabef03
[Bugfix] Use 'sum' reduction instead of 'avg' in Async TP reduce-scatter ( #33088 )
...
Signed-off-by: Xingran Wang <wangxingran123456@outlook.com >
Signed-off-by: Hongjian Zhang <hirokenovo@gmail.com >
Co-authored-by: Hongjian Zhang <hirokenovo@gmail.com >
2026-02-27 07:06:08 +00:00
Chengyi Nie
a572baff5e
[Model Performance] Add Qwen3MoE tuned MoE configs for H200 ( #35457 )
...
Signed-off-by: Chengyi Nie <cnie@roblox.com >
Co-authored-by: Chengyi Nie <cnie@roblox.com >
2026-02-27 13:51:14 +08:00
zofia
516cf26698
[Bug] correct out dtype of rms_norm_gated native path ( #35369 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-27 05:19:51 +00:00
Jiangyun Zhu
487e5c51f7
[Bugfix] disable allreduce_rms_fusion by default when pp size > 1 ( #35424 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-02-27 04:18:52 +00:00
Daniel Huang
1a8c71674e
[BugFix] Repo utils debug print patch ( #35434 )
...
Signed-off-by: Daniel Huang <daniel1.huang@intel.com >
2026-02-27 03:50:56 +00:00
Wentao Ye
062b789632
[Bug] Fix outdated links in source code ( #35314 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-27 03:50:46 +00:00
gnovack
a532c83849
use 'max_active_experts' for moe lora input size ( #33197 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
2026-02-27 03:50:43 +00:00
Jee Jee Li
1e5ad9b74f
[Bugfix] Fix Qwen3NextForCausalLM packed_modules_mapping ( #35413 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-26 19:46:30 -08:00
Nicolò Lucchesi
cabdaa7619
[Misc] Move GPUModelRunner.prepare_kernel_block_sizes to utils ( #35400 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-27 11:42:51 +08:00
Chenyaaang
06be53563b
[Core]Extract is_last_rank in Ray for tpu to override ( #33012 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2026-02-27 03:18:52 +00:00
Angela Yi
c29ee9c326
[compile] Invalidate cache for cpu flags ( #35119 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-02-27 02:54:11 +00:00
daniel-salib
d43048ce05
[Bugfix] Emit reasoning_part events in simple streaming path for Resp… ( #35184 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2026-02-27 09:49:06 +08:00
Michael Goin
4fec53cfcb
[CI] Actually run tests/kernels/quantization/test_block_fp8.py in CI ( #34274 )
2026-02-26 17:58:03 -07:00
roikoren755
38c498b8e3
[Performance] Cublas Bf16 Gate with Fp32 Output ( #35121 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-02-26 16:51:28 -08:00
Andrii Skliar
56a6371706
[Update] Use FlashInfer fast_decode_plan directly instead of replication ( #34687 )
...
Signed-off-by: Andrii <askliar@nvidia.com >
Co-authored-by: Andrii <askliar@nvidia.com >
2026-02-26 16:31:43 -08:00
Pavani Majety
6283021142
[Bugfix] Fix KV Scale loading for MLA Models ( #35430 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2026-02-26 23:38:19 +00:00
Aleksandr Malyshev
01923eec70
[ROCm][Quantization] GPT OSS Upstream MoE wmxfp4_afp8 with static scales ( #30357 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2026-02-26 16:50:16 -06:00
pkousha
31fb6f43da
[Kernel][perf] optimize NCCL symm_mem vs custom_AR selection thresholds ( #33839 )
...
Signed-off-by: <>
Signed-off-by: pkousha <43781676+pkousha@users.noreply.github.com >
Co-authored-by: Pouya Kousha <pkousha@login-eos01.eos.clusters.nvidia.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-26 14:35:58 -08:00
Tyler Michael Smith
eb19955c37
[WideEP] Remove pplx all2all backend ( #33724 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-26 14:30:10 -08:00
Lucia Fang
0f2f24c8b2
[Bugfix] Fix MessageQueue connect_ip for cross-node data parallelism ( #35429 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-26 22:08:16 +00:00
sychen52
d0105b84f0
add mixed precision support for modelopt ( #35047 )
...
Signed-off-by: Shiyang Chen <shiychen@nvidia.com >
2026-02-26 21:56:24 +00:00
danielafrimi
832a780f3a
Nemotron: use per-layer config in NemotronHMLPDecoderLayer for heterogeneous models ( #35396 )
...
Signed-off-by: dafrimi <dafrimi@nvidia.com >
2026-02-26 16:55:19 -05:00
ElizaWszola
98217b09f9
[Performance] Extract KV cache update op from flashinfer forward ( #35422 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2026-02-26 21:29:01 +00:00
不做了睡大觉
967572dd5f
fix(reasoning): Qwen3ReasoningParser returns truncated output as reasoning ( #35230 )
...
Signed-off-by: stakeswky <stakeswky@users.noreply.github.com >
Co-authored-by: stakeswky <stakeswky@users.noreply.github.com >
2026-02-26 20:30:45 +00:00
Woosuk Kwon
3d66502e1b
[Model Runner V2] Prepare attn metadata in ModelState [2/N] ( #35383 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-26 11:47:02 -08:00
Woosuk Kwon
c66aa48e99
[Model Runner V2] Add model states [1/N] ( #35350 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-26 11:20:35 -08:00
Nick Hill
b6d5a17298
[Model Runner V2] Fix error-handling ( #35063 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-26 11:00:19 -08:00
Lucas Wilkinson
5e58bdc711
[Bugfix] Remove erroneous lower bound on LoRA vocab size constraint ( #35354 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-26 18:44:50 +00:00
Runkai Tao
a1f53addb1
[BugFix] Align fused MoE-LoRA kernel config with actual weight shapes ( #34396 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
2026-02-26 18:03:10 +00:00
Wentao Ye
05970c772c
[Refactor] Remove dead code for attention benchmark script ( #35418 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-26 09:53:46 -08:00
Yiliu Dong
d940607629
[Core] Support min_tokens with speculative decoding ( #32642 )
...
Signed-off-by: qianlihuang <yiliu.dong@qq.com >
Co-authored-by: qianlihuang <yiliu.dong@qq.com >
2026-02-26 12:31:28 -05:00
Wentao Ye
99c7892c5b
[Perf] Optimize maxsim scores computation for pooling models, 13.9% E2E throughput improvement ( #35330 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-26 17:14:54 +00:00
hujia177
ec8f943db1
Add GlmOcrConfig for GLM-OCR model type recognition ( #34982 )
2026-02-26 17:04:42 +00:00
Or Ozeri
f2ad952f40
[BugFix][kv_offload]: Fix kernel block size detection ( #35125 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-02-26 16:29:34 +00:00
Sage Moore
9e2cabdf9c
[ROCm] Update the torch version in rocm_build.txt to use the official 2.10 release ( #34387 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-02-26 16:28:45 +00:00
Douglas Lehr
ec8ab9d254
[ROCm] Add dynamic mxfp4 quantization for DeepSeek V2 projection layers ( #34157 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Signed-off-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
2026-02-26 10:00:49 -06:00
Wentao Ye
05972ea7e5
[Refactor] Remove dead or duplicate func utils or variables ( #35318 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-26 10:57:56 -05:00
Jakub Zakrzewski
111d869069
[Model] Add nvidia/llama-nemotron-embed-vl-1b-v2 multimodal embedding model ( #35297 )
...
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com >
2026-02-26 14:17:17 +00:00
stingoChen
7fea7250a4
[Bug] Fix missing <think> tag after tool call in MiniMax 2.1 ( #35352 )
...
Signed-off-by: 冬马 <chenxinke@cai-inc.com >
Co-authored-by: 冬马 <chenxinke@cai-inc.com >
2026-02-26 22:11:07 +08:00
Cyrus Leung
845ee348ef
[Misc] Standardize handling of mm_processor_kwargs.size ( #35284 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-26 13:05:46 +00:00
Asaf Gardin
ec13e549d3
[Bugfix] Fix uint32 overflow in Mamba selective scan state pointer arithmetic ( #35275 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-02-26 12:22:06 +00:00
Li-Yongwen
c6ca51598a
[Bugfix] fix device_name for routing replay ( #34336 )
...
Signed-off-by: liyongwen <1310439159@qq.com >
2026-02-26 12:18:38 +00:00
Yueqian Lin
c0615a296d
[Bugfix] Fix Qwen2.5-Omni and Qwen3-Omni mixed-modality embed regression ( #35368 )
...
Signed-off-by: linyueqian <linyueqian@outlook.com >
2026-02-26 11:58:23 +00:00
Harry Mellor
01914445b0
Remove bc-lint ( #35274 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-26 03:01:01 -08:00
Kunshang Ji
5281713e11
[XPU] use fixed UMD version in dockerfile.xpu ( #35392 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-26 18:54:55 +08:00
HZY
32693db8ce
[Bugfix] [Qwen3.5]Fix Qwen3.5 FP8 quantization: tuple shard_id weight loading ( #35289 )
...
Signed-off-by: daowu.hzy <daowu.hzy@alibaba-inc.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-26 18:26:15 +08:00
Akash kaothalkar
e03ddcfbd4
[Hardware][Powerpc]Enable prefix caching and chunked prefill for ppc64le ( #35081 )
...
Signed-off-by: Akash kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: Akash kaothalkar <akash.kaothalkar@ibm.com >
2026-02-26 10:21:24 +00:00
Sophie du Couédic
02acd16861
[Benchmarks] Plot benchmark timeline and requests statistics ( #35220 )
...
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-26 02:17:43 -08:00
Jiangyun Zhu
ab87f85231
[Model] Ring 2.5 ( #35102 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-02-26 02:17:11 -08:00
Krish Gupta
3827c8c55a
[Test] Add tests for n parameter in chat completions API ( #35283 )
...
Signed-off-by: KrxGu <krishom70@gmail.com >
2026-02-26 09:14:07 +00:00
Kevin McKay
ade81f17fe
[Bugfix][Hardware][AMD] Gate FP4 ops on gfx950 to prevent MI300X crash ( #35250 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2026-02-26 16:11:07 +08:00
Gregory Shtrasberg
6042e66cd5
[ROCm] Add extra step in config initialization to populate custom ops before compilation config init ( #34848 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-26 16:05:40 +08:00
Chaojun Zhang
9f9a675b23
[XPU][8/N] Fix kernel bugs in XPU LoRA and MOE LORA ( #34115 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-26 15:46:44 +08:00
Ofir Zafrir
a07c4c5939
[BugFix][XPU] Fix speculative decoding on Intel XPU due to bug with IGC_ForceOCLSIMDWidth=16 ( #35298 )
...
Signed-off-by: Ofir Zafrir <ofir.zafrir@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-26 07:15:16 +00:00
Cyrus Leung
d3a51da92a
[Benchmark] Simplify SLA scan ( #35306 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-25 22:35:41 -08:00
Flora Feng
186ea22efe
[Misc][Harmony] Move Responses API only harmony utils to responses/harmony.py ( #35339 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-02-26 14:35:16 +08:00
Daniele
4a9c07a0a2
[BugFix] anthropic/serving_messages: fix tool call arguments streaming ( #34887 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-26 05:39:48 +00:00
Jason Li
9d37941017
[torch.compile] Sequence Parallelism threshold compile ranges ( #28672 )
...
Signed-off-by: jasonlizhengjian <jasonlizhengjian@gmail.com >
Signed-off-by: Jason Li <jasonlizhengjian@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-26 05:00:12 +00:00
Fadi Arafeh
4171ff6dd9
[CPU][Feat] Enable KleidiAI INT8_W4A8 for all input dtypes ( #34890 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-02-26 05:00:10 +00:00
Woosuk Kwon
13025e71e8
[Model Runner V2] Add coding style guide ( #35325 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-25 20:42:40 -08:00
Hanjie Qiu
71dfce6aa6
[Kernel] Refactor FlashInfer allreduce for mnnvl backend ( #34109 )
...
Signed-off-by: hjjq <50634613+hjjq@users.noreply.github.com >
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com >
2026-02-26 03:17:20 +00:00
hujiaxin0
2aa4140402
openpangu-vl support video input ( #34134 )
...
Signed-off-by: hujiaxin <524446785@qq.com >
Signed-off-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com >
Co-authored-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-26 03:08:09 +00:00
Roberto L. Castro
86c3b5a808
[BugFix] Fix fp4 quant kernel on CUDA 12.8 ( #35210 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
2026-02-25 18:32:50 -08:00
Seungmin Kim
160424a937
[Bugfix] Fix CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs ( #33992 )
...
Signed-off-by: Seungmin Kim <8457324+ehfd@users.noreply.github.com >
Signed-off-by: Andrew Mello <19512127+88plug@users.noreply.github.com >
Co-authored-by: 88plug <19512127+88plug@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 18:15:51 -08:00
Lucas Wilkinson
9511a3f8ee
[Bugfix] Fix AttributeError in SMControlContextManager ( #35338 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-25 18:01:10 -08:00
Michael Goin
de527e1cec
[UX] Add --moe-backend arg for explicit kernel selection ( #33807 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-25 17:44:44 -08:00
Yongye Zhu
1976356ee6
[MoE Refactor] MXFP4 Cutlass Experts to MK ( #34542 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
2026-02-25 17:32:39 -08:00
Michael Goin
cbf8f7028c
[UX] Add --performance-mode {balanced,interactivity,throughput} ( #34936 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-25 17:28:31 -08:00
Ming Yang
6831650c40
[offloader] v2: Hide weight onloading latency via prefetching ( #29941 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 17:20:59 -08:00
Andreas Karatzas
ed42507f6d
[ROCm][CI] Amending deletion of AMD mirror ( #35322 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 14:17:56 -08:00
Andreas Karatzas
9571e99945
[ROCm][CI] Extending attention backend coverage for Eagle spec decode tests ( #35265 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 14:16:18 -08:00
Elizabeth Thomas
c97234c08b
fix(mxfp4): Disable monolithic path for TRITON backend with EP ( #34270 )
...
Signed-off-by: Elizabeth Thomas <email2eliza@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 13:33:42 -08:00
rasmith
b188bab441
[CI][AMD][BugFix] Add torch.cuda.set_device to test_punica_ops so punica kernels execute on same device as tensor ( #34985 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-25 19:18:00 +00:00
Lucas Wilkinson
15d76f74e2
Revert "[Misc] Enable weights loading tracking for quantized models" ( #35309 )
2026-02-25 09:20:15 -08:00
Andreas Karatzas
8fd6975479
[ROCm][CI] Disable skinny GEMMs in multimodal tests to fix non-deterministic results ( #35049 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 16:48:37 +00:00
pushkar
5d18bf8b32
[Bugfix] Fix Harmony preamble visibility in Responses API ( #32114 )
...
Signed-off-by: Pushkar Patel <git@thepushkarp.com >
Signed-off-by: pupa <pupa@users.noreply.github.com >
2026-02-25 08:08:16 -08:00
haosdent
0788ff0a15
[Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multicast support ( #35085 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-25 07:31:45 -08:00
Chendi.Xue
d72b0be33c
[XPU]Fix for Qwen-OMNI crash ( #35249 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2026-02-25 07:31:07 -08:00
Bhoomit
42489e43c2
[Misc][LoRA] Increase max vocab size limit to 258048 in logits processor ( #34773 )
...
Signed-off-by: Bhoomit Vasani <vbhoomit@amazon.com >
2026-02-25 23:30:55 +08:00
Mario Hong
af5e6afa0a
[Bugfix] Fix step3p5 reasoning with interleaved thinking ( #34211 )
...
Signed-off-by: mariohong <mariohong128@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-02-25 15:13:01 +00:00
Benjamin Chislett
ee59a7c615
[Tests] Add GSM8k check to SpecDec E2E tests ( #34772 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-25 07:51:14 -05:00
Joao Gante
709eadbb0b
Doc link typo ( #35281 )
...
Signed-off-by: Joao Gante <joaofranciscocardosogante@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-25 03:00:31 -08:00
Harry Mellor
90fc7f9109
Fix custom processors that use deleted behaviour for Transformers v5 ( #35107 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-25 02:36:21 -08:00
Yanwen Lin
675ec59aa9
[Bugfix][CPU] Fix basic unit tests failing in CPU platforms ( #34677 )
...
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-25 08:36:15 +00:00
Yanwen Lin
80e60a6133
[Doc] Suggest "--managed-python" flag when installing python using uv ( #33069 )
...
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
2026-02-25 08:19:43 +00:00
jonoillar
26e722f906
[DOC][BugFix] Specfiy build dependency installation ( #34513 )
...
Signed-off-by: Jon OILLARBURU <jon.oillarburu@multiversecomputing.com >
Co-authored-by: Jon OILLARBURU <jon.oillarburu@multiversecomputing.com >
2026-02-25 08:04:06 +00:00
lichuang
2c619e5e3f
[Docs]Fix documentation formatting in architecture overview ( #34679 )
...
Signed-off-by: codedump <lichuang1982@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-25 08:00:15 +00:00
Simon Mo
8a685be8d9
docs: document committer proposal process in governance ( #35225 )
...
Signed-off-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-25 07:58:48 +00:00
Laura Wang
2465071510
[Perf] Add opt-in SM100 Oink RMSNorm custom-op path ( #31828 )
...
Signed-off-by: Laura Wang <3700467+Laurawly@users.noreply.github.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-24 23:01:53 -08:00
wenshuai
cd43673668
[Perf] Optimize FP8 gemm of sm120. ( #34424 )
...
Signed-off-by: wenshuai <wenshuai@xiaomi.com >
2026-02-24 22:25:24 -08:00
Xinyu Chen
35d44b4557
[XPU]Support CUDAGraph on XPU Platform ( #34482 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
Co-authored-by: chzhang <chaojun.zhang@intel.com >
Co-authored-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-24 22:22:52 -08:00
Kunshang Ji
8ad54a991b
[Platform] Add current_platform.num_compute_units interface ( #35042 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
2026-02-24 22:22:49 -08:00
Kunshang Ji
92510edc32
remove cuda check in top_k_top_p_triton kernel ( #35011 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-24 22:22:31 -08:00
Isotr0py
a6c137521c
[Misc] Add shard_id validation for MergedColumnLinear ( #35055 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-24 22:12:28 -08:00
Isotr0py
4572a06afe
[Misc] Enable weights loading tracking for quantized models ( #35074 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-24 22:11:03 -08:00
Zhengxu Chen
5cc29cfb8b
[compile] Improve error message during artifacts load failure. ( #35115 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-24 22:01:09 -08:00
Chen Zhang
8fae54faff
[Linear Attention] fix bug for linear attention + prefix caching + reset_prefix_cache ( #35157 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2026-02-24 22:00:19 -08:00
Harry Mellor
f7967577f5
Remove requirement to use --hf-overrides for DeepseekVLV2ForCausalLM ( #35203 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-24 22:00:06 -08:00
pks
af770b8e7b
[Bugfix] Fix AttributeError when passing StructuredOutputsParams to CompletionRequest ( #35237 )
...
Signed-off-by: Patrick Simianer <patrick@lilt.com >
2026-02-24 22:00:03 -08:00
Andreas Karatzas
2ff3e436ad
[Responses][CI] Filter negative token IDs in schema fuzz test to avoid 500 errors ( #35231 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 05:52:44 +00:00
Jhao-Ting Chen
c2c4c4611a
[FIX] fused moe with lora shared expert dual stream (1.07x otps) ( #34933 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 04:40:45 +00:00
Rohan Potdar
f38f8c9742
[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE ( #35180 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-25 04:36:40 +00:00
Flora Feng
ec1d30c0f6
[Responses] Decouple SSE event helpers from Harmony context ( #35148 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-02-24 20:05:25 -08:00
Pooya Davoodi
e3b2324ec4
[Frontend] Use init_app_state and FrontendArgs in run_batch ( #32967 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-24 19:40:39 -08:00
Nick Hill
dbf0da817a
[Core] Cleanup engine pause/sleep logic ( #34528 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-24 19:33:34 -08:00
Xin Yang
3bbb2046ff
[Bugfix] Fix expert_ids padding values in moe_align_block_size kernel ( #35161 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-24 17:14:24 -08:00
yugong333
576fe50333
Adding Nemotron fp8 Triton MoE Config ( #34674 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-24 15:56:38 -08:00
Hashem Hashemi
a0e50a4260
Convert wvSplitKQ to 16x16 MFMA in prep for mi4xx. ( #34100 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-24 23:35:21 +00:00
Benjamin Chislett
9fa5b25a23
[Bug][DSV3.2] Always prepare metadata for DeepGEMM Sparse Attention ( #35075 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-24 14:55:22 -08:00
Robert Shaw
ea97750414
[CI] Fix Distributed Tests ( #35236 )
...
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
2026-02-24 22:31:56 +00:00
Andreas Karatzas
067c5d9ad1
[ROCm][CI] Added MI325 mirrors ( #34923 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-24 13:37:15 -08:00
Benjamin Chislett
f5972a872f
[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support ( #33726 )
...
Signed-off-by: Shahar Mor <smor@nvidia.com >
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Shahar Mor <smor@nvidia.com >
Co-authored-by: Roi Koren <roik@nvidia.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-24 09:49:56 -08:00
Matthew Bonanni
a9e15e040d
Add @MatthewBonanni to CODEOWNERS ( #35207 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-24 10:45:10 -07:00
Lucas Wilkinson
542ca66357
Revert "[CI/Build] Remove redundant OpenTelemetry pip install from CI configs" ( #35211 )
2026-02-24 09:26:42 -08:00
Cyrus Leung
fc8456c336
[CI/Build] Fix kernels test location ( #35205 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-24 09:20:34 -08:00
Wentao Ye
9ce8fad2a9
[Perf] Optimize Python Slice for Structured Output using islice instead of [:] ( #33593 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-24 09:02:36 -08:00
Harry Mellor
c38b8d5a31
Remove padding_index from models that don't use it for better Transformers v5 compatibility ( #35189 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-24 08:04:46 -08:00
Robert Shaw
60da0e1544
[CI] Remove Duplicated Tests ( #35199 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-24 23:53:30 +08:00
danisereb
9609b1f18d
Integrate flashinfer mm_mxfp8 in ModelOpt MXFP8 ( #35053 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-24 08:45:13 -07:00
danisereb
a0c7081695
Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe ( #35088 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-24 07:25:44 -08:00
R3hankhan
34ce0ffd1f
[CPU][Perf] Accelerate Attention head for s390x using vector intrinsics ( #34434 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-02-24 07:25:39 -08:00
Robin Nabel
0de5333989
Fix GLM4 parser tests ( #34905 )
...
Signed-off-by: Robin Nabel <opensource@nabel.co >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-02-24 22:27:42 +08:00
Eldar Kurtić
a87cc50859
[Attn,KV-cache] Use per-head scales in the attention selector ( #34281 )
...
Signed-off-by: Your Name <you@example.com >
Signed-off-by: Eldar Kurtic <research@neuralmagic.com >
Co-authored-by: Eldar Kurtic <research@neuralmagic.com >
Co-authored-by: Your Name <you@example.com >
2026-02-24 09:02:43 -05:00
Cyrus Leung
761e63e541
[Frontend] Always pass supported_tasks to validation ( #35186 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-24 04:16:33 -08:00
Isotr0py
d12d201409
[Bugfix] Fix failing FunASR processor test ( #35111 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-24 04:13:45 -08:00
eustlb
b3ad37c5db
[glm-asr] change defaults dummy audio size ( #35108 )
...
Signed-off-by: Eustache Le Bihan <eulebihan@gmail.com >
2026-02-24 04:13:33 -08:00
Wentao Ye
14561fabfd
[Perf] Optimize pooling model redundant copy, 1.8% throughput improvement ( #35127 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-24 04:13:11 -08:00
Zhengxu Chen
c77f3e1207
[compile] Save aot compile artifacts atomically. ( #35117 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-24 04:11:01 -08:00
Dor Huri
012dee9233
[Feature] Add LoRA tower/connector support for Llama 4 Vision (mllama4) ( #35147 )
...
Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-24 04:10:32 -08:00
Tugsbayasgalan Manlaibaatar
f1c664545b
Make voxtral compile friendly ( #33959 )
...
Signed-off-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-24 09:33:35 +01:00
Xin Yang
c870eb9e0f
[LoRA] Update LoRA expand kernel block_n calculation ( #32621 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-23 23:17:53 -08:00
BadrBasowid
6af03f2394
[Refactor] [1/N] Reorganize kernel abstraction directory ( #34055 )
...
Signed-off-by: BadrBasowid <badr.basowid@gmail.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-24 06:47:22 +00:00
Vlad Tiberiu Mihailescu
1a6cf39dec
[CI/Build] Remove redundant OpenTelemetry pip install from CI configs ( #35032 )
...
Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com >
2026-02-23 22:24:11 -08:00
Nicolò Lucchesi
f91808ae0d
[MM] Allow audio chunking for offline LLM ( #34628 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-23 21:04:28 -08:00
Vadim Gimpelson
33a0d43c71
[BUGFIX][Qwen3.5] Hardcode mlp.gate as not quantizable ( #35156 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-23 19:42:24 -08:00
pschlan-amd
80d93fd6da
gpu_model_runner: Cache is_encoder_decoder from model config ( #35099 )
...
Signed-off-by: Patrick Schlangen <pschlan@amd.com >
2026-02-23 19:08:34 -08:00
Jia Guo
ec85340531
[Quantization] Support FP8 MoE bias for models like GPT-OSS ( #34906 )
...
Signed-off-by: jasperjiaguo <jasperg662@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-23 19:07:47 -08:00
Rohan Potdar
2ff4e51152
[ROCm] AITER fused RoPE+KVCache ( #33443 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
Signed-off-by: charlifu <charlifu@amd.com >
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
Co-authored-by: charlifu <charlifu@amd.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com >
2026-02-23 19:06:00 -08:00
Asaf Gardin
95642441d0
[Mamba1] - Change supports_update_block_table to True ( #35054 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-02-23 19:05:57 -08:00
Xin Yang
a7c9f7b7ec
[Bugfix] Fix lora_ids in FusedMoE LoRA test ( #35135 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-23 21:49:25 -05:00
Michael Goin
a4bd661fb3
[Perf] Enable FlashInfer DeepGEMM swapAB on SM90 by default ( #34924 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-23 17:34:41 -08:00
Michael Goin
3ef9fd0f98
[Bugfix] Fix DSV3 kernels breaking _C and _moe_C on unsupported arches ( #35123 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-23 17:11:27 -08:00
Michael Goin
22a97e6613
[Perf] Improve default triton fused moe configs ( #34846 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-23 16:01:28 -08:00
Aaron Hao
596ed1f02e
[RL] Validation for pause_mode='keep' ( #34992 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-02-23 16:30:56 -05:00
Nicolò Lucchesi
b8d8b7e934
[Misc] Monitor interface changes ( #35113 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-23 17:14:51 +00:00
Harry Mellor
28c5e69ba0
Enforce that model is the first positional arg when --served-model-name is used ( #34973 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 08:38:05 -08:00
Harry Mellor
864167d376
Fix custom processors that use deleted import for Transformers v5 ( #35101 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 08:38:00 -08:00
haosdent
a2ba6a5244
[Bugfix] Fix prefix caching for Mamba 'all' mode (Nemotron models) ( #34874 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-23 17:31:51 +01:00
Harry Mellor
c4f38696f7
Use Xet high performance mode for Transformers v5 ( #35098 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 08:19:30 -08:00
haosdent
a7f341c323
[Bugfix] Fix MRotaryEmbedding missing truncate attr with YaRN scaling ( #35080 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-23 16:05:52 +00:00
Robert Shaw
d13ece38d7
[CI] Skip Responses API ( #34990 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-23 07:46:45 -08:00
Mark McLoughlin
5cc7c4452e
[Metrics] Add Prometheus counters for Model FLOPs Utilization (MFU) ( #30950 )
...
Export the existing Model FLOPs Utilization (MFU) metrics via Prometheus.
`--enable-mfu-metrics` is required for these to be exposed.
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-02-23 15:01:07 +00:00
Eldar Kurtić
b95bb6927f
[kv-cache, ct] Use compressed-tensors as a source of ground-truth for quant strategies ( #34254 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
2026-02-23 07:37:55 -07:00
Cyrus Leung
392645454b
[Refactor] Decouple TimingContext from InputProcessingContext ( #35083 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-23 14:15:50 +00:00
Eldar Kurtić
1e8438a89a
[Llama4,CI] Bring back Llama-4 bug fixes, and also fix Maverick tests ( #35033 )
...
Signed-off-by: Eldar Kurtic <you@example.com >
Co-authored-by: Eldar Kurtic <you@example.com >
2026-02-23 09:04:34 -05:00
Robert Shaw
8435b2e049
[ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup) ( #34302 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-23 14:02:26 +00:00
Yan Ma
b1b5e045df
[XPU] allow TORCH_SDPA/TRITON_ATTN as XPU vit Backend ( #35010 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-02-23 05:06:44 -08:00
Andreas Karatzas
5f68464f92
[ROCm][CI] Fix spec decode profile assertion and logprob test determinism ( #35043 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-23 05:05:54 -08:00
Vincent Gimenes
aa08a30fc9
[CLEANING] Remove unused disable_by_batch_size from SpeculativeConfig ( #35060 )
...
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com >
2026-02-23 05:05:36 -08:00
Wentao Ye
7f40e9e516
[Refactor] Remove dead private func _fp8_perm and _extract_mask_for_item ( #35068 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-23 05:05:20 -08:00
Harry Mellor
103e614b14
Fix pipeline parallel with embed scaling in the Transformers modelling backend ( #35094 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 05:04:47 -08:00
Neil Schemenauer
54e2f83d0a
[Feature] Lazy import for the "mistral" tokenizer module. ( #34651 )
...
Signed-off-by: Neil Schemenauer <nas@arctrix.com >
2026-02-23 00:43:01 -08:00
Gabe Goodhart
e631f8e78e
fix: Apply embedding_multiplier to inputs_embeds ( #34813 )
...
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-23 00:42:46 -08:00
Martin Hickey
e97c46a92d
[BugFix]: Fix local mypy issues ( #34739 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 00:40:29 -08:00
Jee Jee Li
7291d1b288
[Bugfix] Fix kernel benchmark ( #33752 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-22 21:18:08 -08:00
Cyrus Leung
987506bca6
[Refactor] Simplify dummy data generation ( #35025 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-22 20:55:27 -08:00
Woosuk Kwon
c645e9a214
[Model Runner V2] Remove propose_draft method ( #35070 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-22 18:27:12 -08:00
Nick Hill
944ffb5968
[Model Runner V2][Minor] Remove redundant do_spec_decode field ( #35039 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-22 16:18:04 -08:00
qizixi
2bcf71b9c0
[Spec Decode] Reduce TP communication for speculative decoding draft token generation ( #34049 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-22 14:59:16 -08:00
tacos8me
b7892a3bef
[Model] Add NVFP4 quantization support for Step3.5-Flash ( #34478 )
...
Signed-off-by: tacos8me <ian@cloudhabit.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-22 12:30:46 -07:00
Benjamin Chislett
682566b18e
[Bug] Refactor max_num_batched_tokens to account for drafting ( #34898 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-22 11:18:46 -05:00
qizixi
b9c2a565cc
[Spec Decode] Defer clearing KV connector metadata for EAGLE3 speculative decode + prefill / decode disagg setup ( #34529 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-22 08:08:32 -08:00
Andreas Karatzas
dd8c3a7fb2
[ROCm][CI] Fix realtime test timeouts caused by aiter JIT compilation delays ( #35052 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-22 10:07:18 +00:00
Andreas Karatzas
a8a47c17b6
[ROCm][CI] Fix flaky embedding chat test by using tolerance-based comparison ( #35050 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-22 09:03:44 +00:00
Roger Wang
40f88d8318
[Bugfix] Fix Qwen3/Qwen3.5 Reasoning Parser ( #34779 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-21 23:15:35 -08:00
Woosuk Kwon
2cbf9656ce
[Model Runner V2] Enable CUDA graph for Eagle3 ( #35040 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-21 21:42:50 -08:00
Xiao Li
30132cd144
Fix apply_top_k_top_p_triton called by non-cuda logits Tensor ( #35030 )
...
Signed-off-by: Xiao Li <ilx@meta.com >
2026-02-21 21:11:54 -08:00
Cyrus Leung
cbd95a2dd1
[Benchmark] Use sns.relplot for plotting ( #35027 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-21 20:26:48 -08:00
Athrael Soju
970861ac0c
[New Model] Add ColModernVBERT ( #34558 )
...
Signed-off-by: Athrael Soju <athrael.soju@gmail.com >
Signed-off-by: athrael-soju <athrael-soju@users.noreply.github.com >
2026-02-22 12:23:41 +08:00
Wentao Ye
d24bdd7c4b
[CI] Bump mteb version to mteb[bm25s]>=2, <3 for pooling model unit tests ( #34961 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-21 20:23:24 -08:00
Andreas Karatzas
d403c1da1c
[CI] Stabilizing ROCm amd-ci signal and minor name fix in upstream ( #35008 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-22 04:01:10 +00:00
Woosuk Kwon
b71fbd06e2
[Model Runner V2] Support attention group ( #35036 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-21 16:42:53 -08:00
Vadim Gimpelson
74d90b1ce4
[Model Bash][DSR1] Add selective dynamic shape marking for CustomOp ( #34900 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-21 19:28:01 -05:00
Woosuk Kwon
a4047d4ea9
[Model Runner V2] Support Eagle3 (no CUDA graph) ( #35029 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-21 12:55:24 -08:00
Cyrus Leung
965fe45935
[CI/Build] Fix gRPC version mismatch ( #35013 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-21 12:14:41 -07:00
Roman
98b0205c3c
[Frontend] Add automatic language detection for Whisper transcription ( #34342 )
...
Signed-off-by: space_check <roman.vuskov@rwth-aachen.de >
Signed-off-by: Roman <45857014+spacecheck@users.noreply.github.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-21 04:49:41 -08:00
Huy Do
272b535ab3
[Bugfix] Gate 256-bit instructions to CUDA 12.9+ ( #34791 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-21 04:48:14 -08:00
Cyrus Leung
f74f1572ca
[Benchmark] Improve benchmarks ( #35012 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-21 10:31:58 +00:00
petrpechman
bebfe55b1c
[Doc] Fix example of eagle3 ( #34960 )
...
Signed-off-by: Petr Pechman <petr.pechman@firma.seznam.cz >
Co-authored-by: Petr Pechman <petr.pechman@firma.seznam.cz >
2026-02-21 09:57:53 +00:00
Nick Hill
820d7815eb
[Core] Minor structured-output related scheduler optimization ( #34765 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-21 01:38:28 -08:00
Nicolò Lucchesi
ab6f3487a6
[PD] Change kv_load_failure_policy Default from "recompute" to "fail" ( #34896 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-21 01:34:57 -08:00
BADAOUI Abdennacer
8dc8a99b56
[ROCm] Enable bitsandbytes quantization support on ROCm ( #34688 )
...
Signed-off-by: badaoui <abdennacerbadaoui0@gmail.com >
2026-02-21 00:34:55 -08:00
jennyyyyzhen
2aab2bb543
[ROCM] Optimize ROCM_AITER_FA spec decode eagle performance ( #34541 )
...
Signed-off-by: jennyyyyzhen <yzhen@hmc.edu >
2026-02-20 20:32:05 -08:00
Andreas Karatzas
54254f7a61
[ROCm][CI] Fix spec decode logprobs flakiness and parametrize tree attention backends ( #34599 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:25:23 -08:00
Andreas Karatzas
cf93c1a128
[ROCm][AITER] Fix aiter paged_attention_v1 decode for sliding window and head_size < 64 ( #34570 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:25:07 -08:00
Andreas Karatzas
89358f0d35
[CI] Fix ColBERT HF comparison tests on AMD CI + refactor ( #34567 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:12:05 -08:00
zhongdaor-nv
a0fe7ea2f0
[feat] Add per-block extra_keys to KV events ( #33304 )
...
Signed-off-by: zhongdaor-nv <zhongdaor@nvidia.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 20:11:40 -08:00
Andreas Karatzas
991d6bff38
[CI][MCP][Harmony] Heavy refactoring Harmony & MCP response tests and stabilizing with deterministic test infrastructure ( #33949 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:03:32 -08:00
Kata Coder
5719a4e4e6
[Frontend] Support multimodal inputs for late-interaction scoring (ColQwen3) + NewModel: nvidia/nemotron-colembed ( #34574 )
...
Signed-off-by: craftsangjae <craftsangjae@gmail.com >
2026-02-20 20:01:40 -08:00
pougetat
11be2c74dc
[Realtime] Add Qwen3-ASR realtime streaming support ( #34613 )
...
Signed-off-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Co-authored-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-20 19:59:42 -08:00
Xin Yang
7a5adad480
[Kernel] Optimize sample_recovered_tokens_kernel ( #34974 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-20 19:59:06 -08:00
Li
59c6233297
Support prompt_embeds for pooling requests in output processor ( #34904 )
...
Signed-off-by: Li Zhang <lzhanga@amazon.com >
Co-authored-by: Li Zhang <lzhanga@amazon.com >
2026-02-20 19:57:38 -08:00
Taneem Ibrahim
d38cd3dde5
[Misc] Fix mypy errors in vllm/profiler and remove from exclude list ( #34959 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-02-20 19:56:33 -08:00
Rohan Potdar
ded333fb9b
[ROCm][Bugfix]: Only save unpadded sizes for shared_experts in MoERunner to fix rmsnorm pad fusion ( #34636 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-20 19:56:16 -08:00
Yanan Cao
9d7577b2bd
[Kernel] [Helion] [9/N] Canonicalize GPU variant names to base model names ( #34928 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-20 19:55:51 -08:00
Vlad Tiberiu Mihailescu
e739c29ea4
[CI/Build] Add opentelemetry libs in default vllm build (requirements/common.txt) ( #34466 )
...
Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com >
2026-02-20 19:54:55 -08:00
yugong333
a55caf6ae9
[LoRA] Support Quantized Adapters ( #30286 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
Signed-off-by: wz1qqx <ziqi.wang@novita.ai >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: wz1qqx <55830058+wz1qqx@users.noreply.github.com >
Co-authored-by: wz1qqx <ziqi.wang@novita.ai >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 19:54:35 -08:00
Lucas Wilkinson
0e22cd618b
Revert "[Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers " ( #34997 )
2026-02-20 17:19:19 -08:00
Wei Zhao
ea5f903f80
Bump Flashinfer Version and Re-enable DeepSeek NVFP4 AR+Norm Fusion ( #34899 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 13:37:31 -08:00
Ryan Rock
0632ed8778
[AMD][CI] Fix test_custom_allreduce for A100 testgroup ( #34735 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-02-20 21:33:04 +00:00
Lucas Wilkinson
aaefc58ee0
[CI] Revert PRs 34818 and 33600 ( #34979 )
2026-02-20 13:25:50 -08:00
Wei Zhao
f24b2de3d3
[Test] Add FP8 KV Cache Testing for MLA Backends ( #34473 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-20 18:51:58 +00:00
Michael Goin
fac1507f03
[CI] Remove failing prime-rl integration test ( #34843 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-02-20 10:17:42 -08:00
Zhengxu Chen
f863994084
[compile] Fix torch.compile time discrepancy in logging. ( #34912 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 08:47:14 -08:00
Zhengxu Chen
e4a5d8c653
[compile] Move torch_aot_compile directory under torch_compile_cache ( #34831 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-20 08:46:45 -08:00
Yanan Cao
a6d0299c75
[Kernel] [Helion] [6/N] Add num_tokens dimension to silu_mul autotuning and dispatching ( #34185 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-20 08:36:51 -08:00
Harry Mellor
6ce80f7071
Ensure that MkDocs v2 does not get installed ( #34958 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-20 15:38:11 +00:00
Huamin Li
1fe462168c
[perf] Avoid dtype promotion sync in mamba_get_block_table_tensor ( #34870 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 06:21:56 -08:00
Flora Feng
ed31a020ee
[Refactor] Extract Harmony streaming SSE event builders into streaming_events.py ( #34909 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 06:20:46 -08:00
Cyrus Leung
f9ac19204f
[V0 Deprecation] Remove unused MM placeholders in request output ( #34944 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-20 06:19:23 -08:00
Vadim Gimpelson
59965affbd
[BUGFIX] Fix _dummy_run missing prepare_inputs_event synchronization ( #34866 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-20 05:54:27 -08:00
Xin Yang
b1c4f0b265
[Kernel] Optimize grouped topk kernel ( #34206 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-20 01:34:45 -08:00
Kevin McKay
8de7c636cc
[Bugfix][Hardware][AMD] Fix ROCM_AITER_FA speculative decoding support ( #32877 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-19 22:25:46 -08:00
Frank Wang
059779231f
[Minor] Add logging when using MXFP4 MXFP8 TRTLLM backend ( #34916 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
Signed-off-by: Frank Wang <41319051+frankwang28@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-19 22:07:57 -08:00
tianshu-Michael-yu
ea37530b47
[Models] LFM2: Support LoRA ( #34921 )
...
Co-authored-by: Piotr Mazurek <piotr635@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-19 22:07:23 -08:00
Micah Williamson
f5432e35a3
[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout ( #34922 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-20 05:37:49 +00:00
杨朱 · Kiki
07cab212f0
[Misc] Add deprecated environment variable utilities ( #33677 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-19 21:33:25 -08:00
rasmith
0c1dc42748
[CI][AMD][BugFix][P/D] Add default_vllm_config to test_moriio_connector.py so tests pass ( #33739 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-19 21:32:40 -08:00
Varun Chawla
676f82ae81
Add validation to reject non-text content in system messages ( #34072 )
...
Signed-off-by: Varun Chawla <varun_6april@hotmail.com >
2026-02-19 21:30:33 -08:00
Elizabeth Thomas
81bfc21a6a
[Model Bash]: Improve FP8 Oracle for Config Specific Kernel Selection ( #34260 )
...
Signed-off-by: Elizabeth Thomas <email2eliza@gmail.com >
Signed-off-by: Robert Shaw <robertgshaw2-redhat@h100-02.nemg-001.lab.rdu2.dc.redhat.com >
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
Co-authored-by: Robert Shaw <robertgshaw2-redhat@h100-02.nemg-001.lab.rdu2.dc.redhat.com >
Co-authored-by: Robert Shaw <robertgshaw2@gmail.com >
2026-02-19 21:29:08 -08:00
Matthias Gehre
4e2c7caf2d
[Bugfix] Add regression test for MoE quant_config under torch.compile ( #34335 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-02-20 13:27:26 +08:00
Bowen Bao
d9e62c03eb
[Quark] Fix MoE fp8 activation scale handling on mi300 ( #34386 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com >
2026-02-19 21:27:14 -08:00
Kevin H. Luu
a1a2d79442
[ci] Use the right tag for CPU arm64 image ( #34915 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-19 19:59:15 -08:00
Cyrus Leung
ac900c89bb
[Refactor] Implement output type check in LLM ( #34794 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-19 19:57:55 -08:00
Mark McLoughlin
76df6072ff
[Core] Fix state names in pause_scheduler() ( #34840 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-02-19 17:21:46 -08:00
Michael Goin
16f24e8797
[CI] Add GPT-OSS Eval job for H100 ( #34359 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-02-19 17:14:54 -08:00
Nick Hill
40b2f1c3d9
[Model Runner V2] Minor CPU optimizations ( #34856 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-19 16:05:37 -08:00
Mayank Ketkar
648951a9c3
[Bugfix] Fix benchmark_fused_collective crash on CustomOp init ( #34665 )
...
Signed-off-by: Mayank Ketkar <mketkar@zoox.com >
Signed-off-by: Mayank Ketkar <mayket04@gmail.com >
Co-authored-by: Mayank Ketkar <mketkar@zoox.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-19 19:01:00 -05:00
Michael Goin
f72061a19a
[UX] More descriptive reasons in is_supported_config for MoE ( #34908 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-19 15:20:52 -08:00
Matthew Bonanni
662205d34e
[Bugfix] Fix Basic Models Test ( #34818 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-19 14:49:07 -08:00
Roger Wang
4fb8beefaa
[Bugfix] Fix cutlass fp8 kernel on hopper for Qwen3.5 ( #34914 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-19 13:34:55 -08:00
Alexei-V-Ivanov-AMD
304319c4ed
Change targets for AMD build in the "CI" pipeline ( #34918 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2026-02-19 21:26:53 +00:00
Wentao Ye
c683d11c94
[Refactor] Deprecate head_first for chunk_gated_delta_rule ( #34263 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-19 13:23:49 -05:00
roikoren755
3eff45d793
Revert "[NemotronH] Do not force router to run in fp32 ( #34582 )" ( #34808 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-19 09:47:05 -08:00
Robert Shaw
4685a630a2
[Model Bash][DeepSeekR1] Remove Shared Expert Clone ( #34344 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-19 07:56:14 -08:00
Eldar Kurtić
ee1d25f199
[Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers ( #34471 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-19 07:55:41 -08:00
Linda
6fff24f30f
[Bugfix] Qwen3.5 kv-scale weight remapping ( #34719 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com >
2026-02-19 04:13:37 -08:00
Cyrus Leung
23210a911e
[CI/Build] Try to make beam search test less flaky ( #34885 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-19 19:16:58 +08:00
Cyrus Leung
1391378861
[Bugfix] Fix edge case in UUID data parsing ( #34884 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-19 02:24:30 -08:00
Andreas Karatzas
f6220f9877
[ROCm][Test] Fix beam search determinism failures from batch-size-dependent FP divergence and removed wrong marker ( #34878 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-19 08:25:26 +00:00
Andreas Karatzas
2df2bb27b0
[ROCm][CI] Removing all blocking labels from MI355 until stable infra ( #34879 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-19 07:53:08 +00:00
Tal Nir
f75b61a9e9
[Voxtral Realtime] Fix engine crash on empty multimodal embeddings ( #34862 )
...
Signed-off-by: Tal Nir <tal@nervexneurotech.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-18 23:21:47 -08:00
Wei Zhao
7f51e93864
[Bug] Fix DeepSeek V3 weight loading caused by incorrect prefix ( #34876 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-18 23:20:30 -08:00
Alex Brooks
4611af1663
[Bugfix] Add Quant Config to Llava Next Projector ( #34847 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
2026-02-18 23:18:23 -08:00
Manrique Vargas
ad5aa6bd9f
fix(docs): fix typos in comments and docstrings ( #34836 )
...
Signed-off-by: machov <mv1742@nyu.edu >
2026-02-18 23:17:41 -08:00
Jaeyeon Kim(김재연)
9681068cf9
[Frontend] Fix reasoning_tokens for text-based parsers in Responses API ( #33513 )
...
Signed-off-by: Jaeyeon Kim <anencore94@gmail.com >
2026-02-18 23:16:41 -08:00
Kevin H. Luu
b6101d384d
Deprecate test-pipeline.yaml ( #34864 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-19 02:15:27 +00:00
Woosuk Kwon
5fcb0cdd68
[Model Runner V2] Use FP32 for Gumbel Noise ( #34854 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-18 17:07:37 -08:00
Woosuk Kwon
c878b43b64
[Model Runner V2] Remove unnecessary copies in PW CUDA graph capture ( #34849 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-18 15:52:50 -08:00
rasmith
2b84ac669c
[CI][AMD][BugFix] Use torch.testing.assert_close instead of assert torch.allclose in test_rocm_skinny_gemms.py ( #34181 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-18 23:10:19 +00:00
zhrrr
11d3976b88
[Model Runner V2] support piecewise & mixed cudagraph ( #32771 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2026-02-18 15:03:17 -08:00
Yongye Zhu
40da9625a1
[MoE Refactor] Convert mxfp4 marlin into modular kernel format ( #34588 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-18 14:37:14 -08:00
Flora Feng
8d9babd4de
Fix empty tool_call_id in Anthropic messages API tool result conversion ( #34745 )
...
Signed-off-by: <>
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Flora Feng <sfeng33@h100-01.nemg-001.lab.rdu2.dc.redhat.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-18 14:31:59 -08:00
Aaron Hao
e99ba957ec
[BUG] Fixing Weight Sync unit test ( #34841 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-02-18 17:20:10 -05:00
Kyle Sayers
64ac1395e8
[Docs] Clean up speculators docs ( #34065 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-02-18 13:48:11 -08:00
Cyrus Leung
61cf087680
[Bugfix] Fix lora tests ( #34834 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-18 13:22:31 -08:00
Wenlong Wang
847a57cd12
[Bugfix][MoE Kernel] Fix incorrect routing selection for models without expert groups (e.g., MiniMax-M2.1) ( #34673 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-18 13:03:24 -08:00
rasmith
fcd6ac97ed
[CI][AMD][BugFix] Skip tests in test_unquantized_backend_selection that should not run on ROCm ( #34655 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-18 15:00:40 -05:00
Woosuk Kwon
95be2a7f22
[Model Runner V2] Minor simplification for DCP ( #34786 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-18 11:04:53 -08:00
Jaden Mathias
0e60c925cf
[Bugfix] Remove assert causing hipErrorStreamCaptureUnsupported ( #34455 )
...
Signed-off-by: Jaden Mathias <jaden.mathias@amd.com >
2026-02-18 18:54:54 +00:00
Teng Ma
d7ff22204a
[Misc] Add mooncake-transfer-engine to kv_connectors requirements ( #34826 )
...
Signed-off-by: Teng Ma <teng-ma@linux.alibaba.com >
2026-02-18 18:26:24 +00:00
Isotr0py
c0bd8b13da
[Bugfix] Redo Qwen3.5/Qwen3-Next GDN projector fusion ( #34697 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
2026-02-18 09:46:53 -08:00
Michael Goin
caeb887bf6
[Bugfix] Fix NVFP4 TRTLLM MoE non-gated support; add gsm8k for Nemotron-3-Nano FP8+NVFP4 ( #34725 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-18 09:39:22 -08:00
Ilya Markov
6b3166a7c7
[CI][Bugfix] Fix multinode test script ( #34820 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-02-18 11:45:10 -05:00
Robert Shaw
25e2e136ef
[CI] temporarily disable multi-node tests ( #34825 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-18 11:32:44 -05:00
Robert Shaw
6874638bc4
[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup) ( #34758 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-18 07:42:36 -08:00
Burkhard Ringlein
e24663c5a9
Add unit tests for fp8 output fusion of triton_attn ( #34228 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-18 06:22:49 -05:00
Nick Hill
c50e105a88
[Model Runner V2] Avoid prepare prefill kernel launch overhead ( #34780 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-18 00:49:21 -08:00
Cyrus Leung
a766b30349
[Renderer] Deprecate code paths for old input processing ( #34775 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-18 00:35:04 -08:00
Asaf Joseph Gardin
1faa8cb73c
[Quantization] - Added uses_meta_device_weights to quant config ( #34645 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-02-17 23:43:44 -08:00
Marek Michalowski
e89a91d927
[Bugfix] fix activation in cpu_fused_moe_torch call ( #34696 )
...
Signed-off-by: Marek Michalowski <marek.michalowski@arm.com >
2026-02-17 23:39:46 -08:00
Michael Goin
909b147197
[Bugfix] Fix prefix creation for Qwen3.5 ( #34723 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-17 23:39:15 -08:00
ElizaWszola
a88b3be7c4
[Bugfix] Fix quant RMS norm fusion for quantization with TMA-aligned scales ( #33255 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-17 23:35:04 -08:00
Nick Hill
a49ea5a58f
[Model Runner V2] A bit more PP simplification ( #34766 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-17 21:39:07 -08:00
Cyrus Leung
30ebe0dc3c
[CI/Build] Remove use of skip_v1 ( #34699 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-18 12:19:11 +08:00
Andreas Karatzas
cef65f0715
[ROCm][CI] Removed hard-coded attn backend requirement for Qwen VL ( #34753 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-18 03:59:53 +00:00
Russell Bryant
6f3b2047ab
[Core] Fix SSRF bypass via backslash-@ URL parsing inconsistency ( #34743 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: isotr0py <2037008807@qq.com >
2026-02-18 03:53:35 +00:00
Luka Govedič
02e8f26cea
[torch.compile] Turn on silu+fp4 quant fusion by default for O1+ ( #34718 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-02-18 03:29:15 +00:00
Hongxia Yang
4a00a511bb
[BugFix] [Build] fix string literals comparison in indexer_k_quant_and_cache calling site ( #34653 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-02-17 19:19:41 -08:00
Cyrus Leung
a0d8d944e2
[Renderer] Move MM Hash parsing into Renderer ( #34711 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-17 19:18:55 -08:00
Amr Mahdi
df3f537a66
[CI] Remove unused precompiled wheel args from image build ( #34767 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-17 18:58:18 -08:00
Matthew Bonanni
7743152957
[Attention] Refactor check_and_update_config ( #33600 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-17 17:06:54 -08:00
Wentao Ye
ab33d2a629
[Feature] Decode Context Parallel support for GPU model runner v2 ( #34179 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-17 16:27:15 -08:00
Woosuk Kwon
be3af2d29e
[Model Runner V2] Further simplification for PP ( #34724 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-17 15:18:18 -08:00
Jongseok Park
c656ba3b4d
[Kernel] Triton-based Top-k and Top-p sampler kernels ( #33538 )
...
Signed-off-by: js_park <cakeng@naver.com >
Signed-off-by: Jongseok Park <37990712+cakeng@users.noreply.github.com >
Signed-off-by: Sunga Kim <sunga.kim@berkeley.edu >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Sunga Kim <sunga.kim@berkeley.edu >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-17 23:14:30 +00:00
Matthew Bonanni
dc5fa77a4e
[Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs ( #34457 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-17 14:01:27 -05:00
Flora Feng
1e4a084c8e
[CI] Fix flaky test_parsable_context ( #34717 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-02-17 18:42:52 +00:00
Richard Zou
7967e854da
[BugFix] Fix sp tests ( #34716 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-17 17:07:56 +00:00
almayne
6bd6d0c3c1
Fixed whisper CPU test that does not spawn properly. ( #34324 )
...
Signed-off-by: Anna Mayne <anna.mayne@arm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-17 06:46:23 -08:00
Nicolò Lucchesi
8e962fef5f
[CI][Nixl] Add CrossLayer KV layout tests ( #34615 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-17 21:35:40 +08:00
Cyrus Leung
574fe75245
[Renderer] Move InputPreprocessor into Renderer (2/2) ( #34560 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-17 05:29:01 -08:00
junuxyz
c61a98f529
[CI][BugFix] ShellCheck cleanup to remove baseline and preserve runtime behavior ( #34514 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-17 12:22:56 +00:00
Harry Mellor
28bffe9466
Fix docs build warning ( #34686 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-17 02:31:40 -08:00
ChenqianCao
ad65177a19
[Bugfix] Fix 'remove_instance_endpoint' method logic in disagg_proxy_demo ( #32922 )
...
Signed-off-by: ChenqianCao <39755070+ChenqianCao@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-17 10:06:53 +00:00
Tim Dettmers
d44a5b6c47
Remove dead bitsandbytes CxB code from 8-bit inference path ( #34633 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-17 01:49:14 -08:00
Jiangyun Zhu
1d65283e95
Revert "[Models] Fuse Qwen3.5 GDN's qkvz_proj and ba_proj" ( #34683 )
2026-02-17 01:29:27 -08:00
kourosh hakhamaneshi
c464b57374
[Ray] Propagate third-party env vars to Ray workers via prefix matching ( #34383 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-17 01:08:42 -08:00
Amr Mahdi
c5c38e152a
[CI] Fix bake config artifact path for AMI rebuild pipeline ( #34656 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-17 06:39:44 +00:00
Woosuk Kwon
d00df624f3
[Model Runner V2] Minor refactoring for penalties ( #34662 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 21:43:00 -08:00
Woosuk Kwon
9752da9d9c
[Model Runner V2] Minor simplification for BadWordsState ( #34669 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 21:27:24 -08:00
Woosuk Kwon
04925b2202
[Model Runner V2] Minor cleanup for PP ( #34666 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 19:15:31 -08:00
Woosuk Kwon
d74278fb67
[Model Runner V2] Fix unintended CPU-GPU sync in make_dummy ( #34667 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 19:00:29 -08:00
haosdent
b68fd899d1
[Bugfix] Fix fused MoE int32 overflow in stride*offset without perf regression ( #34507 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-16 17:58:49 -08:00
Aneesh Puttur
0b5f9b7204
[CI] Enable mypy import following for vllm/v1/kv_offload ( #34639 )
...
Signed-off-by: Aneesh Puttur <aneeshputtur@gmail.com >
2026-02-17 09:58:15 +08:00
zhanqiuhu
9a8853f781
[Core] Pipeline Parallel support for Model Runner V2 ( #33960 )
...
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-02-16 17:48:16 -08:00
zhrrr
387a1898d9
[Model Runner V2] support bad_words sampling param ( #33433 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 16:36:06 -08:00
roikoren755
3b30e61507
[NemotronH] Do not force router to run in fp32 ( #34582 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-02-16 10:15:32 -08:00
Alexei-V-Ivanov-AMD
824f9e8f3c
Targeting the MI355 agent pool with all existing tests ( #34629 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2026-02-16 17:02:27 +00:00
Nicolò Lucchesi
6cc403e67d
[Bugfix][CI] Fix flaky entrypoints/openai/test_response_api_with_harmony.py::test_function_calling[openai/gpt-oss-20b] ( #34624 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-16 16:11:07 +00:00
Almog Tavor
72d5951d02
[Bugfix] Treat generation_config max_tokens as default not ceiling ( #34063 )
...
Signed-off-by: almogtavor <almogtavor@gmail.com >
2026-02-16 07:58:24 -08:00
Lucas Kabela
a3205beffb
[CI] Enable mypy coverage for individual excluded files ( #34292 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-16 07:34:29 -08:00
Christian Pinto
6930becd45
(bugfix): Fixed encode in LLM entrypoint for IOProcessr plugin prompts ( #34618 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2026-02-16 07:33:55 -08:00
Andreas Karatzas
03a8770a6d
[ROCm][CI] Fix plugins test group; updating terratorch and dependencies ( #34589 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-16 07:33:42 -08:00
Yiqi Xue
bc56a1d56e
[Bugfix] Fix ARC touch KeyError for non-ready T1 blocks in kv offload ( #34576 )
...
Signed-off-by: Yiqi Xue <xuey666@gmail.com >
2026-02-16 07:33:19 -08:00
danisereb
ec7d9e6745
Fix call to moe_mk in modelopt MoE modules (required for LoRA) ( #34575 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-16 07:33:09 -08:00
Isotr0py
3bb4e4311c
[Models] Fuse Qwen3.5 GDN's qkvz_proj and ba_proj ( #34492 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-16 07:32:51 -08:00
Amr Mahdi
08f8c198ae
[CI] Disable precompiled wheel path in CI image builds ( #34606 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-16 15:14:43 +00:00
Harry Mellor
a21cedf4ff
Bump lm-eval version for Transformers v5 compatibility ( #33994 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-16 05:24:35 -08:00
emricksini-h
3ef74cde5d
[CI][Tracing] Fix race condition by adding server readiness check ( #34364 )
...
Attempt to resolve #34284 : "Metrics Tracing (2GPU)" fails with a
segmentation fault.
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai >
2026-02-16 12:57:39 +00:00
Ekagra Ranjan
cd81cdb399
[Scheduler][ASR] Fix CrossAttn blocks per-request for Variable length encoder inputs ( #31058 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-16 11:08:44 +00:00
Andreas Karatzas
1e828573b4
[CI][Metrics] Stabilize tests with polling and subprocess guards ( #34566 )
...
test_abort_metrics_reset is flaky due to hardware-dependent
fixed sleeps: replace fixed sleeps with polling.
test_metrics_exist_run_batch passes even when the engine crashes
on startup (false positive): add subprocess lifecycle guards.
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-16 10:52:02 +00:00
Samu Tamminen
a5ccc85c8c
[Bugfix] Fix Dynamo unexpected keyword argument ( #34320 )
...
Signed-off-by: Samu Tamminen <stammine@amd.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-16 01:32:30 -08:00
Roger Wang
b5475d0534
Revert "[Misc] fix qwen3.5 config" ( #34610 )
2026-02-16 01:06:05 -08:00
JJJYmmm
9521002f0a
[Misc] fix qwen3.5 config ( #34604 )
2026-02-16 00:25:38 -08:00
Cyrus Leung
ec17bdd894
[Renderer] Move InputPreprocessor into Renderer (1.5/2) ( #34598 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-15 23:46:33 -08:00
Amr Mahdi
bb59c90248
[CI] Write bake config to temp directory instead of repo root ( #34569 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-15 22:15:47 -08:00
bnellnm
5bff999d12
[Bugfix] Add method to swap quant_method on FusedMoE to fix LoRA issues ( #34453 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-15 20:10:50 -08:00
Lucas Wilkinson
bb85929aa6
[BugFix] Fix Python 3.13 FlashMLA import error ( #34548 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-15 20:09:18 -08:00
Parth Bansal
5653021094
[Doc] Add Mistral-7b-v0.3 model to the batch invariance validated model ( #34584 )
...
Signed-off-by: Parth Bansal <parthbansal127@gmail.com >
2026-02-16 12:09:00 +08:00
Andreas Karatzas
974d829b05
[CI][Frontend] Return 422 instead of 500 for invalid Anthropic tool_choice ( #34590 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-15 20:06:48 -08:00
Isotr0py
91ac5d9bfd
[CI/Build] Enable tests for recent day-0 new models ( #34585 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-15 18:17:04 -08:00
Luka Govedič
23d825aba1
[torch.compile] Disable ar-rms fusion for ds3-fp4 & DP, fix CI test ( #34392 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-15 06:33:57 -08:00
Maryam Tahhan
f07a128413
[CPU][ARM] Add ARM BF16 cross-compilation support and improve documen… ( #33079 )
...
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-02-15 06:33:08 -08:00
Isotr0py
71cd89264f
[MM Encoder] Add Triton ViT attention backend ( #32183 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-15 06:32:47 -08:00
Isotr0py
19fab44152
[Doc] Update Encoder-Decoder models support doc with Florence-2 ( #34581 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-15 04:18:57 -08:00
Seiji Eicher
79c7e09235
[KV Connector] Add temporary, off-by-default VLLM_DISABLE_REQUEST_ID_RANDOMIZATION workaround ( #34415 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2026-02-14 23:26:10 -08:00
haosdent
79f3fab05a
[Bugfix] Handle num_expert_group=None in flashinfer block-scale FP8 MoE ( #34494 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-14 23:25:46 -08:00
Vadim Gimpelson
604b9eaec5
[BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1 ( #34476 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-14 23:25:17 -08:00
Stanislav Kirillov
50dbd6c9e6
[bugfix] Fix critical bug when reporting for all paths where handler.create_error_response is used ( #34516 )
...
Signed-off-by: Stanislav Kirillov <stas@nebius.com >
Co-authored-by: Stanislav Kirillov <stas@nebius.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-14 23:24:25 -08:00
Andreas Karatzas
98bcc6ca59
[CI][Entrypoints] Validate detokenize token IDs to prevent int64 overflow causing 500 ( #34468 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-14 23:08:38 -08:00
Andreas Karatzas
f13e86d8dd
[Kernels] Fix Helion GPU utils to use platform-agnostic device name API ( #34537 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-14 20:29:23 -08:00
Woosuk Kwon
9ca768c740
[Model Runner V2] Minor cleanup for Sampler ( #34563 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-14 18:29:03 -08:00
Thomas Parnell
d5fe3f702c
[Hybrid] Enable mamba prefix cache "align" mode with async scheduling ( #33997 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2026-02-14 13:15:56 -08:00
Cyrus Leung
73391a1baa
[Renderer] Move InputPreprocessor into Renderer (1/2) ( #34510 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-14 10:14:21 -08:00
Andreas Karatzas
b3c14229b0
[ROCm][CI] Guard sparse MLA backend imports for ROCm compatibility in tests ( #34538 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-14 07:32:09 -08:00
Roger Wang
2f186635cb
[Bugfix] Fix Qwen3.5 config loading ( #34554 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-14 03:56:11 -08:00
Christian Pinto
342a7cda2d
[Misc] Update tests and examples for Prithvi/Terratorch models ( #34416 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-13 23:03:51 -08:00
Kata Coder
d1ea65d0a1
[new model] add COLQwen3 code & Inference ( #34398 )
...
Signed-off-by: craftsangjae <craftsangjae@gmail.com >
Signed-off-by: katacoder <craftsangjae@gmail.com >
2026-02-14 12:15:19 +08:00
Andreas Karatzas
de42abb366
[CI] Heavy refactoring of Voxtral multimodal audio model tests ( #34294 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-13 20:04:29 -08:00
Julien Denize
60ca7981bc
Add explicit validation error for tool calls. ( #34438 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-02-13 20:04:01 -08:00
Christian S. Perone
0ef5b9147b
fix: use __annotations__ instead of get_type_hints() for dynamic kwargs detection ( #34527 )
...
Signed-off-by: Christian S. Perone <christian.perone@gmail.com >
Signed-off-by: Christian S. Perone <perone@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-13 20:03:37 -08:00
Shiyan Deng
ed242652d7
[bug] Make sure get_modality_with_max_tokens is deterministic ( #34533 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2026-02-13 20:02:59 -08:00
Wei Zhao
b37b679770
[Feature][Perf] Support Selective CPU Weight Offloading ( #34535 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-13 20:02:24 -08:00
Andreas Karatzas
a0638d052d
[Bugfix] Fix ROCm UVA CPU weight offloading broken by #32993 ( #34543 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-13 20:01:42 -08:00
Harry Huang
c027541eaf
[Hybrid] Enable spec decoding in mamba cache align mode ( #33705 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-13 13:02:28 -08:00
Ben Browning
fd267bc7b7
[Bugfix]: Fix structured output in multi-turn gpt-oss ( #34454 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-13 11:12:48 -08:00
Michael Goin
bfaa559305
Revert "[Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides" ( #34530 )
2026-02-13 10:35:29 -08:00
Richard Zou
87789c8364
[Misc] vLLM's --enforce-eager should turn off compile and cudagraphs only ( #34523 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-13 09:52:20 -08:00
Pushpinder Singh
bcd65c1f6a
[Bugfix] Replace c10::optional with std::optional in topk kernel ( #34467 )
...
Signed-off-by: Pushpinder Singh <pushpindersingh135@gmail.com >
2026-02-13 08:30:23 -08:00
Wei Zhao
59d53066d8
[Feature] Support CPU Offloading without Pytorch Pinned Memory that leads to doubled allocation ( #32993 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-13 08:11:26 -08:00
LoganJane
4a9952ec1b
[Bugfix] Add quant_config in ViT of Kimi-K2.5 ( #34501 )
...
Signed-off-by: LoganJane <LoganJane73@hotmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-13 16:05:34 +00:00
Roger Wang
1dae7b7843
[Bugfix] Exclude language_model_only key from MM AOT compile hash but include in model one ( #34508 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-13 13:59:00 +00:00
Roger Wang
5885e330ef
[Misc] Port Qwen3.5 Configs ( #34512 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-13 05:24:25 -08:00
Ilya Boytsov
071d863e20
Extend ColBERT support to non-standard BERT backbones ( #34170 )
...
Signed-off-by: Ilya Boytsov <ilya.boytsov@aleph-alpha.com >
2026-02-13 09:53:09 +00:00
Woosuk Kwon
0916e7960b
[GDN] Use CPU tensors to build GDN metadata ( #34498 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-13 01:24:45 -08:00
Wentao Ye
3d2a026fd0
[Feature] Pipeline Parallel Async send/recv, 2.9% E2E throughput improvement ( #33368 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-02-13 16:38:16 +08:00
Aaron Hao
dddbff4624
[Core] Move pause and resume functions into engine ( #34125 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Signed-off-by: hao-aaron <ahao@anyscale.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-13 00:15:10 -08:00
Martin Hickey
47e9b63e1a
[KVConnector] Clean up redundant code in KV connectors ( #34147 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-02-13 00:14:30 -08:00
Matthias Gehre
934acddef9
[Perf] fused_moe: add int4_w4a16 benchmark support and tuning config ( #34130 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-13 00:14:27 -08:00
Marek Michalowski
742d214d6e
[Bugfix] fix the import path in moe test utils.py ( #34245 )
...
Signed-off-by: Marek Michalowski <marek.michalowski@arm.com >
2026-02-13 00:13:45 -08:00
haosdent
4137c5dfa7
[Bug Fix] Fix MambaManager.cache_blocks() crash on null blocks in align mode ( #34418 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-13 00:13:22 -08:00
Harry Huang
7a8a46ddcb
[BugFix] Fix and optimize max_num_blocks_per_req calculation for MambaSpec ( #34440 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-13 00:13:14 -08:00
myselvess
bcf0731aa0
[New Model] support new model ovis2.6 ( #34426 )
...
Signed-off-by: myselvess <23743269+myselvess@users.noreply.github.com >
2026-02-13 00:12:45 -08:00
Cyrus Leung
ec090c2429
[Refactor] Call renderer for online IO processor request ( #34490 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-12 22:48:45 -08:00
Roger Wang
eea3024f43
[Bugfix] Fix mamba state dtype setting for Qwen3-Next and Qwen3.5 ( #34489 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-12 22:48:42 -08:00
Cyrus Leung
2f308214c0
[Refactor] Pass full VllmConfig to Renderer ( #34485 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 22:48:38 -08:00
Cyrus Leung
1b4e8e53f8
[CI/Build] Fix CUDA re-initialization error in distributed model tests ( #34491 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-13 06:43:53 +00:00
haosdent
dcf6ee8592
[Bugfix] Fix encoder cache underestimation for GLM-4V/GLM-OCR single image ( #34483 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-12 21:04:06 -08:00
Cyrus Leung
372b2e762a
[Bugfix] Standardize getting number of image patches/tokens ( #34358 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 20:47:01 -08:00
Andreas Karatzas
6afa587d31
[ROCm][CI] Fix serving tokens test failures ( #34047 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-13 11:27:53 +08:00
Cyrus Leung
94ed6cf6ea
Add new sections to CODEOWNERS ( #34309 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 18:39:28 -08:00
Harry Huang
bf37812ca7
[Hybrid] Fix and optimize block-aligned splitting in mamba cache align mode ( #33706 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-12 18:21:52 -08:00
Frank Wang
b86bf4417e
[Bugfix] Fix Random Dataset Prefix Length Inaccuracy ( #33907 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-12 18:21:19 -08:00
Yanan Cao
de13dd781f
[Kernel] [Helion] [5/N] Add Helion Autotuning infrastructure ( #34025 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-12 18:21:05 -08:00
LoganJane
62788f99a4
[Bugfix] Delete unused redundant code in Kimi-K2.5 ( #34427 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-12 18:18:42 -08:00
Cyrus Leung
ea5ff3a1f6
[Refactor] Simplify BOS/EOS token handling ( #34435 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 18:18:24 -08:00
bnellnm
04ea31baab
[Bugfix] Remove assert that's no longer valid ( #34443 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-12 18:18:15 -08:00
Harry Huang
6f019e6e0a
[BugFix] Add block_size validation for mamba cache align mode ( #34445 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-12 18:18:07 -08:00
Zhuohan Li
d707678dfb
Fix num_logprobs parameter description in sampler.py ( #34451 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2026-02-12 18:18:03 -08:00
Cyrus Leung
fc22cae4ac
[CI/Build] Update video URLs for testing ( #34446 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 18:15:36 -08:00
Yanan Cao
96161fe978
[Kernel] [Helion] [4/N] Add silu_mul_fp8 Helion kernel ( #33373 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-12 18:13:12 -08:00
Jaewon
4453ba8d9e
[Core] Profiler improvements and lazy initialization ( #33198 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-12 16:16:38 -08:00
Jaewon
aa181c923b
[Core] Add sleep level 0 mode with enqueue/wait pattern ( #33195 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-12 16:16:25 -08:00
Alec S
be7370daf3
[Frontend] Enable generic structured_outputs for responses API ( #33709 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Co-authored-by: Alec Solder <alecs@fb.com >
2026-02-12 16:15:48 -08:00
Mengtao (Martin) Yuan
9ea1f598ce
Use paged_attention_v1 for sliding window decode in rocm_aiter_fa ( #34378 )
...
Signed-off-by: Martin Yuan <myuan@meta.com >
Co-authored-by: Martin Yuan <myuan@meta.com >
2026-02-12 16:14:43 -08:00
amitz-nv
f120bd42d3
[Kernel] Support Flashinfer trtllm fused MoE non gated FP8 & NVFP4 ( #33506 )
...
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com >
2026-02-12 13:06:58 -08:00
Hashem Hashemi
fac4e96940
small adjustment to wvSplitKrc ( #34410 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-12 20:26:36 +00:00
Michael Goin
6d4e27ce29
[Bugfix] Enforce DeepGEMM when using sparse_attn_indexer on CUDA ( #34374 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-12 12:08:06 -08:00
Andreas Karatzas
4c078fa546
[ROCm][CI] Pin TorchCodec to v0.10.0 for ROCm compatibility ( #34447 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-12 18:47:34 +00:00
Patrick von Platen
6c0baee610
[Voxtral Realtime] Refactor & Improve buffering logic ( #34428 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-12 09:46:43 -08:00
Patrick von Platen
1100a97621
[Voxstral Realtime] Enable tests ( #33803 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2026-02-12 09:43:24 -08:00
xuebwang-amd
766e167821
[ROCm][quantization] improve OCP weight quant parser robust ( #34431 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-12 09:40:19 -08:00
Isotr0py
becbe24808
[Bugfix] Remove broken raw url GGUF model loading support ( #34433 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-12 09:40:01 -08:00
Harry Mellor
679ca5d8d3
Fix MoE for the Transformers modelling backend ( #34436 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-12 09:29:42 -08:00
Matthew Bonanni
f2c47886fd
[Attention] Add FlashInfer Sparse MLA backend ( #33451 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
2026-02-12 17:21:54 +00:00
Nicolò Lucchesi
334c715e0f
[Docs] Spec decoding docs warning removal ( #34439 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-12 09:01:51 -08:00
Aaron Hao
7b5a8b4a9d
[BUG] Reset running requests when clearing cache for pause/resume ( #34382 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-02-12 16:19:13 +00:00
danisereb
dea63512bb
Add config file for fused MoE for Nemotron (TP4, B200) ( #34411 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-12 06:09:55 -08:00
Douglas Lehr
8a798be929
[ROCm] Enable MXFP4 MoE weight pre-shuffling on gfx950 and update aiter ( #34192 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
Co-authored-by: tjtanaavllm <tunjian.tan@amd.com >
2026-02-12 05:06:33 -08:00
Cyrus Leung
fb455ed547
[V0 Deprecation] Remove code related to per-request logits processors ( #34400 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 20:44:28 +08:00
baonudesifeizhai
f5897613fb
Fix Mistral config remap to accept compressed-tensors quantization #34028 ( #34104 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
2026-02-12 08:22:06 +00:00
Louie Tsai
55a1a9563a
Vllm CPU benchmark suite improvement ( #34128 )
...
Signed-off-by: louie-tsai <louie.tsai@intel.com >
2026-02-12 16:04:44 +08:00
AllenDou
386bfe5d08
[bugfix] refactor FunASR's _get_data_parser ( #34397 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
2026-02-12 07:26:49 +00:00
Kyle Sayers
e9cd691132
[Bugfix] Fix Sparse24 Compressed Tensors models ( #33446 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-11 23:15:16 -08:00
Yichuan Wang
80f2ba6ea6
Fix DeepSeek-OCR tensor validation for all size variants ( #34085 )
...
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-11 22:50:23 -08:00
Lucas Wilkinson
136b0bfa59
[BugFix] Fix DP chunking ( #34379 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Bill Nell <bnell@redhat.com >
2026-02-12 06:44:03 +00:00
Cyrus Leung
b96f7314b4
[Refactor] Pass Renderer to Input Processor ( #34329 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 19:38:11 -08:00
Cyrus Leung
ced2a92f40
[Refactor] Move validation to params definitions ( #34362 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 19:33:15 -08:00
Runkai Tao
e1d97c38f8
[Bug Fix] Fix naive_block_assignment always defaulting to False due to arg misalignment ( #33848 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
2026-02-12 11:30:57 +08:00
Michael Goin
ec12d39d44
[Bugfix] Fix MTP accuracy for GLM-5 ( #34385 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-12 11:08:19 +08:00
Michael Goin
ff1f83b056
[Refactor] Replace activation: str with MoEActivation enum ( #33843 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-02-11 17:29:32 -08:00
Kevin H. Luu
83b47f67b1
[ci] Integrate AMD tests into CI ( #33626 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
Signed-off-by: khluu <khluu000@gmail.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-12 08:54:17 +08:00
Micah Williamson
fb7b30c716
[ROCm][CI] Revert Test Groups From mi325_8 to mi325_1 Agent Pool In AMD CI ( #34384 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-11 15:52:34 -08:00
bnellnm
31d992d215
[Bugfix] Fix some issues with MoERunner PR #32344 ( #34371 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-11 14:33:14 -08:00
Wei Zhao
5aff2699bd
Fix CI failure - Flashinfer Kernel tests ( #34316 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-11 14:17:16 -08:00
Raushan Turganbay
527ca32197
[Bugfix] Fix more multimodal tests for transformers V5 ( #34334 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2026-02-11 22:02:05 +01:00
Junseo Park
5458eb835d
[Bugfix] send None sentinel on final commit so server properly sends transcription.done ( #33963 )
...
Signed-off-by: pjs102793 <pjs102793@naver.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-11 21:01:53 +00:00
Tomas Ruiz
144d9b7cc8
[Benchmarks] Reduce ready checker log verbosity ( #34349 )
...
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com >
2026-02-11 20:57:57 +00:00
elvischenv
83e26c834e
[GPT-OSS] Remove unnecessary contiguous ( #34337 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-02-11 15:29:29 -05:00
TJian
5001211369
[ROCm] [CI] fix test_unrecognized_env ( #34350 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-02-11 18:50:44 +00:00
Eldar Kurtić
11c7ace340
[Bugfix] Enable attn quantization of Llama-4 by correctly permuting scales for rope (int8, fp8) ( #34243 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
2026-02-11 13:24:22 -05:00
Xinyu Dong
be7f3d5d20
[Bugfix] fix default is_neox_style is True for deepseek ( #34353 )
...
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com >
2026-02-11 18:20:45 +00:00
Isotr0py
0ab06100f4
[Multimodal] Expose mm_processor_kwargs for DummyInputsBuilder ( #34330 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-11 09:37:40 -08:00
Xinyu Chen
ffb3d553cc
[Model Runner V2] Init cuda graph pool when necessary ( #33217 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2026-02-11 09:12:13 -08:00
junuxyz
fa7e0bfacf
[CI][BugFix] Fix silent failure in shellcheck hook and baseline exist… ( #32458 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-11 17:03:48 +00:00
SorenDreano
48134a2c22
[Docs] Fix typo ("defult") and double spacing ( #34348 )
...
Signed-off-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com >
Co-authored-by: Soren Dreano <soren@numind.ai >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-11 09:02:27 -08:00
kliuae
64f570ab56
[ROCm] [aiter] Split KV cache update for AiterFlashAttention ( #33681 )
...
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com >
2026-02-11 16:26:44 +00:00
Rohan Potdar
fd618871b4
[Bugfix]: Fix ROCm fusion attn test; use AttentionBackend utils to create kv cache ( #33948 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-11 11:12:05 -05:00
Harry Mellor
67a42b5a44
Don't try and run GLM-ASR with remote code ( #34352 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-11 08:09:40 -08:00
Lucas Wilkinson
c7914d30f9
Reapply [Attention][FA3] Update FA3 to include new swizzle optimization ( #34043 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-11 07:07:56 -08:00
Adam Binford
1b8756562e
Responses harmony system message structured ( #34268 )
...
Signed-off-by: Adam Binford <adamq43@gmail.com >
2026-02-11 05:14:28 -08:00
Linda
275e0d2a99
[NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE ( #33715 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com >
Co-authored-by: Pavani Majety <pmajety@nvidia.com >
2026-02-11 12:38:11 +00:00
Harry Mellor
0f5e55e7a8
Make JAIS compatible with Transformers v5 ( #34264 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-11 12:30:37 +00:00
Harry Mellor
1e9204bff3
Make Qwen3VL compatible with Transformers v5 ( #34262 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-11 04:13:23 -08:00
Li, Jiang
05339a7b20
[Bugfix][CPU] Fix llama4 inference on CPU ( #34321 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-02-11 19:07:23 +08:00
Harry Mellor
40b8f55358
[Docs] Reduce time spent generating API docs ( #34255 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-11 02:56:02 -08:00
Seiji Eicher
5045d5c983
Patch protobuf for CVE-2026-0994 ( #34253 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-11 02:25:04 -08:00
Nick Hill
e09546cf05
[Frontend] Exploit tokenizers "new stream" in FastIncrementalDetokenizer ( #34217 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-11 11:03:24 +01:00
Tianqi Ren
786806dd44
[Doc] Update Marlin support matrix for Turing ( #34319 )
...
Signed-off-by: Tianqi Ren <tianqi.r@outlook.com >
2026-02-11 09:03:41 +00:00
Nick Hill
79504027ef
[Misc] Bump fastsafetensors version for latest fixes ( #34273 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-11 00:30:09 -08:00
Luka Govedič
addac0e653
[torch.compile] Enable AR+rms fusion by default available for -O2 ( #34299 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-02-11 00:30:00 -08:00
Cyrus Leung
675a22ed66
[Chore] Move BaseRenderer to base.py ( #34308 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 00:29:51 -08:00
Kunshang Ji
cb9574eb85
[XPU][9/N] clean up existing ipex code/doc ( #34111 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-11 00:27:15 -08:00
AllenDou
21dfb842d7
[model] support FunASR model ( #33247 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
2026-02-11 07:37:09 +00:00
R3hankhan
d1b837f0ae
[CPU] Enable FP16 (Half dtype) support for s390x ( #34116 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-02-11 14:41:42 +08:00
Roger Wang
0b20469c62
[Bugfix] Fix weight naming in Qwen3.5 ( #34313 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 21:37:14 -08:00
Tyler Michael Smith
d7982daff5
[Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides ( #34279 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-11 05:15:52 +00:00
Robert Shaw
9b17c57460
[ModelBash][DSR1 NVFp4] Removed Bf16 Bias Cast ( #34298 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-11 05:00:00 +00:00
Hashem Hashemi
1b3540e6c6
Threshold fix wvSplitk for occasional CI fails ( #34013 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-11 03:59:14 +00:00
Matthias Gehre
7a048ee65f
[Bugfix] Fix benchmark_moe.py inplace assertion with torch >= 2.9 ( #34149 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-02-11 03:58:56 +00:00
Cyrus Leung
c9a1923bb4
[Plugin] Simplify IO Processor Plugin interface ( #34236 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 19:47:39 -08:00
zofia
b482f71e9f
[XPU][7/N] enable xpu fp8 moe ( #34202 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-02-11 03:33:59 +00:00
Дзержи́нский
1485396abb
[Kernel] Apply 256bit LDG/STG To Activation Kernels ( #33022 )
...
Signed-off-by: Dzerzhinsky <256908701+AstroVoyager7@users.noreply.github.com >
Signed-off-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-10 19:31:51 -08:00
Kebe
5ee5c86eeb
[Bugfix][DeepSeek-V3.2] fix fp8 kvcache type cast ( #33884 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2026-02-10 19:31:36 -08:00
Cyrus Leung
b5dcb372e4
[Misc] Clean up validation logic in input processor ( #34144 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 19:29:29 -08:00
Tyler Michael Smith
066c6da6a0
[WideEP] Fix nvfp4 DeepEP High Throughput All2All backend ( #33738 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-10 19:15:43 -08:00
Richard Zou
e30cedd44b
[torch.compile] Stop doing unnecessary FakeTensorProp in PiecewiseCompileInterpreter ( #34093 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-10 19:15:40 -08:00
Cyrus Leung
3bcd494ef4
[Redo] Add --trust-remote-code to dataset bench args ( #34251 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 11:10:12 +08:00
tianshu-Michael-yu
0e725a7d22
[Bugfix] Fix Worker.load_model context-manager composition for sleep mode ( #34021 )
...
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com >
2026-02-11 11:07:51 +08:00
Lucas Wilkinson
ba0511fd80
[Misc] Add run one batch script that supports profiling ( #32968 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-10 18:29:49 -08:00
Micah Williamson
4a1550d22d
[ROCm][CI] Fix test_sequence_parallel.py location in AMD CI pipeline ( #34280 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-11 01:08:11 +00:00
bnellnm
d1481ba783
[MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner ( #32344 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-10 19:51:07 -05:00
7. Sun
dc6de33c3d
[CI] Add pip caching to cleanup_pr_body workflow ( #32979 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-02-11 00:45:28 +00:00
Tyler Michael Smith
c4b9e6778f
[Misc] Add pre-commit hook to catch boolean ops in with-statements ( #34271 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-10 15:13:20 -08:00
Richard Zou
341eed3d30
[torch.compile] Disable recursive pre_grad_passes ( #34092 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-10 18:02:31 -05:00
Zhengkai Zhang
6f2f59f2b3
[Misc][Spec Decode] support different load config for draft model ( #34022 )
...
Signed-off-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com >
Co-authored-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com >
2026-02-10 14:52:43 -08:00
Ilya Markov
bb2fc8b5e7
[BugFix] Fix async EPLB hang with DeepEP LL all2all backend ( #32860 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-02-10 22:34:47 +00:00
Ilya Markov
67132945bb
[Perf] Move eplb rebalance algo to async thread ( #30888 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-02-10 22:19:10 +00:00
Gregory Shtrasberg
f0ca0671c7
[Feature] Warn about unrecognized environment variables ( #33581 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-10 15:45:38 -06:00
Pavani Majety
578977bb5e
[SM100] Resubmit FMHA FP8 prefill for MLA ( #31195 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2026-02-10 16:18:43 -05:00
Roger Wang
9615575afc
[Bugfix] Fix mamba cache dtype for Qwen3.5 ( #34200 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 13:12:31 -08:00
Matthew Bonanni
4293c00b84
[Benchmarks] Fix attention benchmark smoke test ( #34269 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-10 16:04:07 -05:00
J Seppänen
506ad7d7c1
[Bugfix] Fix weights offloading for sleep mode ( #32947 )
...
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-02-10 20:38:17 +00:00
Reagan Lee
fdd6f2ad58
Convert online APIs to use Renderer ( #34084 )
...
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com ”>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com ”>
2026-02-10 19:44:31 +00:00
Qi Wang
33bcd3dc3b
[Misc] Introduce ec_both role EC (encoder cache) connector ( #34182 )
...
Signed-off-by: Qi Wang <qiwa@nvidia.com >
2026-02-10 18:55:35 +00:00
Michael Goin
1f5febb4b8
[UX nit] Fix non-default api_server_count message ( #34152 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-10 10:35:58 -08:00
Andy Lo
ae871ca923
Minor cleanup for Voxtral ( #34247 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-02-10 18:18:30 +00:00
Woosuk Kwon
a2443de5fa
[Model Runner V2] Use pinned memory for write_contents ( #34222 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-10 08:55:22 -08:00
Harry Mellor
f84a2a8f31
[Docs] Speed up build environment set-up ( #34240 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-10 16:34:43 +00:00
Vadim Gimpelson
000214c4bb
[BUGFIX] Fix accuracy bugs in Qwen3-Next MTP ( #34077 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-10 10:57:11 -05:00
junuxyz
c5a66d1697
[Core][BugFix] Fix PP KV cache sharding memory validation ( #33698 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-10 10:46:24 -05:00
Roberto L. Castro
afdce12c89
[Perf][Kernel] Add faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention ( #33680 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-02-10 10:29:52 -05:00
Zhengxu Chen
82e11973cc
[compile] Enable AOT compile with 2.10 in trunk. ( #34155 )
...
Signed-off-by: Zhengxu Chen <zhxchen17@meta.com >
2026-02-10 23:24:42 +08:00
xuebwang-amd
b129136c7a
[ROCm][Quantization] GPT_OSS in amd-quark format model loading and emulations ( #29008 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-10 10:08:05 -05:00
mgazz
599e4335a4
Support benchmarking of Geospatial models ( #33922 )
...
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com >
2026-02-10 07:04:16 -08:00
Fan Yang
a1946570d8
add --insecure arg to the vllm bench to skip TLS ( #34026 )
...
Signed-off-by: Fan Yang <yan9fan@meta.com >
Co-authored-by: Fan Yang <yan9fan@meta.com >
2026-02-10 22:23:52 +08:00
Harry Mellor
d0bc520569
Bump mamba-ssm version in CI for Transformers v5 compatibility ( #34233 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-10 14:46:01 +01:00
Krish Gupta
748625cdaf
[V1][BugFix] Fix EAGLE3 encoder cache miss with disable_chunked_mm_input ( #34220 )
...
Signed-off-by: KrxGu <krishom70@gmail.com >
2026-02-10 13:05:32 +00:00
Harry Mellor
61413973e8
Stop testing for slow tokenizers as they will not exist soon ( #34235 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-10 12:08:20 +00:00
Phúc H. Lê Khắc
94de871546
[Misc] allow specify is_mm_prefix_lm in hf_config ( #34215 )
2026-02-10 11:16:21 +00:00
tc-mb
e042d7e685
Add flagos in MiniCPM-o ( #34126 )
...
Signed-off-by: tc-mb <caitianchi@modelbest.cn >
Signed-off-by: Vincent-Xiao <vincent.xiao.me@gmail.com >
Co-authored-by: Vincent-Xiao <vincent.xiao.me@gmail.com >
2026-02-10 02:51:48 -08:00
Roger Wang
ae4e280602
[Bugfix] Fix FI kernelchunk_gated_delta_rule output shape for Qwen3.5 ( #34219 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 10:41:24 +00:00
zzaebok
cbea11c9f0
[Docs] Fix format error in KV load failure recovery doc ( #34137 )
...
Signed-off-by: Jaebok Lee <jaebok9541@naver.com >
2026-02-10 02:16:26 -08:00
Cyrus Leung
2c32558a3c
[Bugfix] Fix --trust-remote-code conflict ( #34218 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 00:29:10 -08:00
Zetong Li
5f970120f0
[Bugfix] Fix memory inconsistency in cross-process shared memory ( #32022 )
...
Signed-off-by: Zetong Li <slippersss@126.com >
2026-02-10 08:22:03 +00:00
Cyrus Leung
998e2d91f8
Revert #34208 ( #34216 )
2026-02-09 23:59:04 -08:00
Wentao Ye
e1060a71a1
[Perf] Optimize detokenizer python logic ( #32975 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-02-09 23:54:41 -08:00
Chen Zhang
97fa8f6590
[BugFix] Avoid prefix cache hit in the same schedule step for mamba layers ( #29387 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2026-02-10 07:41:16 +00:00
wang.yuqi
dab1de9f38
[Frontend][CI] Consolidate instrumentator entrypoints ( #34123 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-10 07:30:19 +00:00
Balaxxe
8d48d0a9d9
[Bugfix] Sort hf_weights_files in fastsafetensors_weights_iterator to match #33491 ( #34190 )
...
Signed-off-by: Balaxxe <136368465+jaim12005@users.noreply.github.com >
2026-02-09 23:06:30 -08:00
Andrew Xia
9608844f96
[responsesAPI] fix simpleContext streaming output_messages ( #34188 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-02-09 22:53:07 -08:00
Cyrus Leung
f69b903b4c
[Bugfix] Add --trust-remote-code to dataset bench args ( #34208 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-09 22:37:50 -08:00
Lucas Wilkinson
81e217fe6b
[Bugfix] Fix DP Attention Padding in Dummy Run ( #34187 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-10 05:29:39 +00:00
Cyrus Leung
ab97bcf662
[CI/Build] Relax test_mcp_tool_call ( #34204 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 05:18:57 +00:00
Cyrus Leung
25e48a3aae
[Doc] Update usage of --limit-mm-per-prompt ( #34148 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-09 21:12:13 -08:00
Roger Wang
8a5e0e2b2b
[Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching ( #34183 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 13:03:32 +08:00
Andreas Karatzas
4cde2e0159
[ROCm][Bugfix] Resolve Dynamo tracing crash from amdsmi calls in on_gfx* arch detection ( #34108 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-09 20:50:20 -08:00
Roger Wang
047a457fa4
[Bugfix] Adopt ChunkGatedDeltaRule for Qwen3.5 ( #34198 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 03:47:54 +00:00
Yuwei An
e94ec59733
[LMCache] Token Base IPC API ( #34175 )
...
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com >
2026-02-10 01:18:42 +00:00
Ning Xie
13397841ab
[structured output] validate unsupported json features first ( #33233 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2026-02-09 23:49:09 +00:00
Gregory Shtrasberg
c60f8e3b49
[Bugfix][ROCm][GPT-OSS] Use old triton_kernels implementation on ROCm if the new API is not available ( #34153 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-09 17:38:54 -06:00
Michael Goin
5e75a14a66
[Doc] Add DCP support to attention backend doc ( #33936 )
2026-02-09 18:33:43 -05:00
Nick Hill
e7e52781ff
[ModelRunner V2][BugFix] Fix max_query_len calculation ( #34167 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-09 21:47:17 +00:00
Charlie Fu
bb9f97308d
[torch.compile][Fusion] Fix attention fusion pass removing kv_udpate op. ( #33945 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2026-02-09 16:15:43 -05:00
Hongxia Yang
4d39650961
[ROCm] update triton branch to support gpt-oss models for gfx11xx devices ( #34032 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2026-02-09 19:36:30 +00:00
Artus Krohn-Grimberghe
8fd31f6245
[Bugfix] Voxtral prompt/audio placeholder alignment ( #34140 )
...
Signed-off-by: Artus KG <artuskg@gmail.com >
2026-02-09 19:30:38 +00:00
Artus Krohn-Grimberghe
eadb4e868b
[Bugfix] Avoid duplicate k-proj weight emission in helper ( #34142 )
...
Signed-off-by: Artus KG <artuskg@gmail.com >
2026-02-09 19:17:44 +00:00
Jiangyun Zhu
285bab4752
[Kernel] use flashinfer for gdn prefill ( #32846 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-02-09 12:17:25 -05:00
TomerBN-Nvidia
995bbf38f1
[Bugfix] Fix shared expert input for latent MoE in EP+DP (Nemotron-H) ( #34087 )
...
Signed-off-by: Tomer Natan <tbarnatan@nvidia.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-09 16:44:18 +00:00
Mohammad Miadh Angkad
d4f123cc48
[Kernel] FlashInfer: switch allreduce fusion to unified API ( #33985 )
...
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com >
2026-02-09 15:43:24 +00:00
ZhengHongming888
cb62e86f83
Add NUMA Core binding in nixl_connector for CPU xPyD ( #32365 )
...
Signed-off-by: Hongming Zheng <hongming.zheng@intel.com >
Signed-off-by: ZhengHongming888 <hongming.zheng@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-09 15:39:12 +00:00
Luka Govedič
781ddf7868
[CI][torch.compile] Fix incorrect filtering for E2E fusion tests on B200 ( #34031 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-02-09 10:05:14 -05:00
Roger Wang
64a9c2528b
[UX] Add --language-model-only for hybrid models ( #34120 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-09 14:57:33 +00:00
Lucas Wilkinson
d0d97e2974
[Misc] Fix up attention benchmarks ( #33810 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-09 09:42:03 -05:00
JJJYmmm
9562912cea
[MODEL] Adding Support for Qwen3.5 Models ( #34110 )
...
Signed-off-by: JJJYmmm <1650675829@qq.com >
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: wulipc <wulipc@users.noreply.github.com >
Co-authored-by: ywang96 <ywang96@users.noreply.github.com >
Co-authored-by: Isotr0py <Isotr0py@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-09 21:12:58 +08:00
zofia
9bdb06b436
[XPU][6/N] add xpu scaled_mm kernel ( #34117 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-02-09 20:17:35 +08:00
Nikhil Gupta
caad9f1e01
[Fix] [CPU Backend] : Prepack weights for w8a8 oneDNN matmul ( #33901 )
...
Signed-off-by: nikhil-arm <nikhil.gupta2@arm.com >
2026-02-09 18:04:41 +08:00
Ekagra Ranjan
1d5922fade
[ASR] Fix audio benchmark and add RTFx metric ( #32300 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2026-02-09 10:02:37 +00:00
Andreas Karatzas
3025b3cebb
[CI] Remove empty image_size_factors for fuyu, glm4_1v, glm_ocr ( #34107 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-09 17:37:04 +08:00
Jee Jee Li
978a37c823
[Model] GLM adaptation ( #34124 )
2026-02-09 17:32:52 +08:00
ihb2032
5a5c43511a
fix(cpu): fix mla_decode compilation on x86 without AVX512 ( #34052 )
...
Signed-off-by: ihb2032 <hebome@foxmail.com >
Co-authored-by: root <root@LAPTOP-FKNHV411.localdomain >
2026-02-09 08:55:41 +00:00
Nick Hill
d9bede0314
[BugFix] Fix fastsafetensors TP all procs using all GPUs ( #34070 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-09 15:15:46 +08:00