Cyrus Leung
e5de19ff9a
[CI/Build[ Don't auto-rebase PRs with CI failures ( #39443 )
...
Close inactive issues and PRs / close-issues-and-pull-requests (push) Has been cancelled
macOS Apple Silicon Smoke Test / macos-m1-smoke-test (push) Has been cancelled
pre-commit / pre-run-check (push) Has been cancelled
pre-commit / pre-commit (push) Has been cancelled
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-04-09 13:57:37 -07:00
zzaebok
edee96519a
[Spec Decode] fix returning size mismatch on extract hidden states proposer ( #38610 )
...
Signed-off-by: Jaebok Lee <jaebok9541@naver.com >
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-04-09 20:39:39 +00:00
Rishi Puri
adaabb8a55
Add nightly b200 test for spec decode eagle correctness ( #38577 )
...
Signed-off-by: Rishi Puri <riship@nvidia.com >
2026-04-09 20:09:09 +00:00
Ekagra Ranjan
f7cad67412
[ASR] Fix spacing bw chunks in multi chunk audio transcription ( #39116 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2026-04-09 12:46:33 -07:00
Xinyu Chen
a8134aef4e
[XPU] check is_xccl_available before oneccl warmup ( #39302 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2026-04-09 12:42:17 -07:00
Michael Goin
2800706f06
[Refactor] Move NVFP4 GEMM management into NvFp4LinearKernel ( #39129 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-04-09 15:05:36 -04:00
Cyrus Leung
0d310ffbeb
[CI/Build] Update auto-rebase rule ( #39429 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-04-09 10:59:56 -07:00
Micah Williamson
d5f75fdf50
[ROCm] Correctly guard fused_silu_mul_block_quant on ROCm ( #39387 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-04-09 17:59:03 +00:00
PikaPikachu
827268e98d
[Quantization] Support Quark W8A8 INT8 MoE inference ( #36320 )
...
Signed-off-by: kangletian <Letian.Kang@amd.com >
2026-04-09 17:24:43 +00:00
Wentao Ye
56e19d7ee2
[Model Runner V2] Fix flex attention kv blocks calculation issue ( #39353 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-09 13:07:43 -04:00
Andreas Karatzas
9036d4c464
[ROCm][CI] Resolved nvidia package deps issue ( #39421 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-04-10 00:06:06 +08:00
Lucas Kabela
a8c6ee9b78
[Performance Improvement] Update batched_count_greater_than to handle batch size 1 without recompile ( #38933 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-04-09 23:51:31 +08:00
Cyrus Leung
3b1d9c3156
[CI/Build] Fix memory cleanup in MM test ( #39411 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-04-09 08:50:45 -07:00
Cyrus Leung
54d244f28f
[UX] Improve error message for MM input too long ( #39409 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-04-09 13:20:19 +00:00
Richard Zou
6c749399b7
[BugFix] fix tests/kernels/moe/test_moe_layer.py ( #39404 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-04-09 08:48:59 -04:00
lalit10
91eea72330
[Tests] Add Qwen3-VL multimodal memory leak check ( #39268 )
...
Signed-off-by: Lalit Laxminarayan Bangad <lalitbangad@gmail.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-04-09 04:54:46 -07:00
Andrii Skliar
df2503e125
nemotron-nano-vl: Allow use_audio_in_video to be passed at vllm serve time ( #38538 )
...
Signed-off-by: Andrii Skliar <askliar@nvidia.com >
Co-authored-by: Andrii Skliar <askliar@nvidia.com >
2026-04-09 11:44:39 +00:00
Nick Hill
c8d98f81f6
[Core] Simplify API server handshake ( #39364 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-04-09 18:56:15 +08:00
Harry Mellor
d87fb264df
[Docs] Bring README updates into docs README ( #39397 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-04-09 10:35:00 +00:00
wang.yuqi
66c079ae83
[Frontend][4/n] Improve pooling entrypoints | pooling. ( #39153 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-04-09 10:09:45 +00:00
Shengqi Chen
b6c9be509e
[CI] fix possible user permission issues in nightly index generation ( #39390 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2026-04-09 08:14:07 +00:00
Qidong Su
ed733802f0
Fix NUMA binding on non-CDMM Grace-Blackwell systems ( #39361 )
...
Signed-off-by: Qidong Su <soodoshll@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-09 07:36:51 +00:00
Andrew Barnes
8a34c5087a
[ROCm] Remove unnecessary fp8 roundtrip in gather cache NHD dequant ( #39122 )
...
Signed-off-by: Bortlesboat <bortstheboat@gmail.com >
2026-04-09 15:12:22 +08:00
Wentao Ye
ed2f282bc8
[Perf] Optimize redundant sync for pooling model, 3.7% Throughput Improvement ( #39113 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-08 23:12:23 -07:00
Zhewen Li
9e78555743
[Docker] Add fastsafetensors to NVIDIA Dockerfile ( #38950 )
2026-04-08 22:21:37 -07:00
sihao_li
e80e633927
[XPU] Skip VLLM_BATCH_INVARIANT for XPU in EAGLE DP test ( #39164 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-04-09 12:45:16 +08:00
Khairul Kabir
490f17d0c7
[Multimodal] Fix nested_tensors_equal: add length check for lists and tuple support ( #38388 )
...
Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com >
Co-authored-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com >
2026-04-09 04:40:37 +00:00
Yongye Zhu
2e98406048
[Refactor] Improve indexer decode path metadata preparation ( #38865 )
2026-04-08 20:49:15 -07:00
Chendi.Xue
ef5a226819
[PD][HeteroArch]Fix accuracy issue with CPU_ATTN as Decoder and Flash_ATTN as prefiller ( #38935 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2026-04-09 11:19:07 +08:00
Wentao Ye
aec18492d0
[CI] Fix mypy for vllm/v1/ops ( #39219 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-09 11:06:34 +08:00
noobHappylife
2a49284c8a
Fix Responses JSON schema alias serialization ( #38519 )
...
Signed-off-by: noobhappylife <aratar1991@hotmail.com >
Co-authored-by: OpenAI Codex <codex@openai.com >
2026-04-09 10:50:16 +08:00
Ilya Boytsov
d37b378762
[Model] Update ColModernVBERT to support latest HF checkpoint ( #39307 )
...
Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com >
2026-04-09 10:48:51 +08:00
Wei Zhao
92fbec391b
[Bug] Fix routing bias dtype for trtllm per-block fp8 moe ( #38989 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-04-08 19:42:43 -07:00
Ajay Anubolu
2f41d6c063
[Bugfix] Fix cpu-offload-gb assertion with non-default block sizes ( #36461 )
...
Signed-off-by: AjAnubolu <anuboluajay@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-04-08 19:42:16 -07:00
Dipika Sikka
3aecdf08b4
[Gemma4] Support quantized MoE ( #39045 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
2026-04-08 21:57:53 -04:00
Michael Goin
eb4205fee5
[UX] Integrate DeepGEMM into vLLM wheel via CMake ( #37980 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Claude <noreply@anthropic.com >
2026-04-08 18:56:32 -07:00
liuzhenwei
83aea2147f
[XPU][UT] update UTs in CI ( #39296 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Kunshang Ji <jikunshang95@gmail.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-04-09 09:38:16 +08:00
Maral
2e9034c998
[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections. ( #33892 )
...
Signed-off-by: maral <maralbahari.98@gmail.com >
Signed-off-by: Maral <maralbahari.98@gmail.com >
2026-04-09 08:50:39 +08:00
Benjamin Chislett
8332078cfd
[Bugfix] FlashInfer MXINT4 MoE crashes, missing do_finalize ( #39315 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-08 20:36:33 -04:00
Richard Zou
ba4a78eb5d
[torch.compile] Allow usage of Opaque Objects in PyTorch 2.11 ( #39286 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-04-08 23:21:10 +00:00
Kai Song
f3c7941ec8
[Bugfix]Fix EP precision for Qwen3.5, Qwen3-Next ( #39181 )
...
Signed-off-by: Song Kai <songkai05@baidu.com >
2026-04-09 01:47:48 +04:00
Wentao Ye
3352bf8b03
[CI Bug] Fix pre-commit issue in main ( #39347 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-08 14:10:05 -07:00
triangleXIV
7c94ae16c6
[BugFix] --max-model-len=-1 causes over-limit requests to hang and starve the entire service ( #39102 )
...
Signed-off-by: triangle14 <y1019026570@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2026-04-08 14:03:17 -07:00
Rishi Puri
ad05edfbca
tests/v1/e2e/spec_decode: assert async scheduling is used (#39206 )
...
Signed-off-by: Rishi Puri <riship@nvidia.com >
Signed-off-by: Rishi Puri <puririshi98@berkeley.edu >
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com >
Co-authored-by: Flora Feng <4florafeng@gmail.com >
2026-04-08 20:30:03 +00:00
Wentao Ye
2018137242
[Feature] Batch invariant nvfp4 linear support ( #39322 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-08 16:29:13 -04:00
Jackmin801
a776a48b1c
[MoE] Move DEEP_GEMM into experts/ subdirectory ( #39005 )
...
Signed-off-by: Jackmin801 <ongjackm@gmail.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-08 19:23:08 +00:00
Ben Browning
8477fe427d
[Tool] adjust_request to reasoning parser, and Gemma4 fixes ( #39027 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-04-08 19:04:04 +00:00
Lain
e24e0a43a4
[Attention] relax the head dim 512 and paged kv for sm90+FA4 ( #38835 )
...
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-04-08 18:23:18 +00:00
Roberto L. Castro
b55d830ec7
[Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA decode ( #37421 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
2026-04-08 13:35:57 -04:00
Shengqi Chen
75e01a39a1
[Feature] NUMA binding support for GPU workers ( #38635 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
Co-authored-by: Jason Li <jasonlizhengjian@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-04-08 09:55:24 -07:00
Or Ozeri
512c5eb455
[kv_offload+HMA][5/N]: Track group block hashes and block IDs ( #37109 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-04-08 19:50:28 +03:00
Flora Feng
13151a4df4
[Bugfix] Fix Gemma4 streaming tool call corruption for split boolean/number values ( #39114 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-04-08 16:46:27 +00:00
Gregory Shtrasberg
56c976c1b5
[ROCm] Enable fused_silu_mul_block_quant on ROCm ( #38817 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-04-08 11:23:32 -05:00
Frederik Gossen
d74a306c4b
[Core] Use tuple_return in split_module for tuple-conformant subgraphs ( #38752 )
...
Signed-off-by: Frederik Gossen <frgossen@meta.com >
Co-authored-by: Boyuan Feng <boyuan@meta.com >
2026-04-08 09:09:58 -07:00
Gregory Shtrasberg
0e9f0a516c
[ROCm][CI-Build] Cherry pick triton BUFFER_OPS fix and update AITER ( #38580 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-04-08 10:38:03 -05:00
haosdent
8904fc4d19
[Bugfix] Fix V1 logprobs empty strings for multi-byte UTF-8 tokens when logprobs > 0 ( #34875 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-04-08 15:30:00 +00:00
nemanjaudovic
1a2c17634e
[Bugfix] Add missing ASRDataset import and CLI args in benchmarks/throughput.py ( #38114 )
...
Signed-off-by: nemanjaudovic <nudovic@amd.com >
2026-04-08 13:53:53 +00:00
Matthew Bonanni
308cec5864
[FlashAttention] Symlink FA4 instead of copying when using VLLM_FLASH_ATTN_SRC_DIR ( #38814 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-04-08 12:04:34 +00:00
wang.yuqi
4e2ab1861d
[CI Failure] pin nomic-embed-text-v1 revision ( #39292 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-04-08 11:43:06 +00:00
JartX
140cbb1186
[Bugfix] Cuda Clean up scales Kvcache fp8/int8_per_token_head ( #39224 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-04-08 04:08:04 -07:00
Kevin H. Luu
6155bbd1dd
[Bugfix][Docs] Fix ReadTheDocs build crash from mocked torch decorator ( #39284 )
...
Signed-off-by: khluu <khluu000@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-08 09:43:01 +00:00
rasmith
78434b923c
[CI][AMD][BugFix][Kernel] Cast induction variable to int64 on MI350 for chunk_gated_delta_rule_fwd_kernel_h_blockdim64 to avoid illegal memory access ( #39087 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-04-08 16:57:18 +08:00
Michael Goin
2488d1dca2
[Docs] Update README ( #39251 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-04-08 11:34:07 +08:00
yoke
d734445fcd
[Bugfix][Frontend] Fix Gemma4 streaming HTML duplication after tool calls ( #38909 )
...
Signed-off-by: yoke233 <yoke2012@gmail.com >
2026-04-08 11:03:54 +08:00
Flora Feng
927975ead8
[Parser] Migrate response api streaming to unified parser ( #38755 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Signed-off-by: Andrew Xia <axia@meta.com >
2026-04-08 10:09:00 +08:00
Flora Feng
9ea7d670d8
[Bugfix] Fix Qwen3 tool parser for Responses API tools ( #38848 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-04-08 10:08:51 +08:00
Varun Sundar Rabindranath
7b80cd8ac3
[Docs] Add Phi-4-reasoning-vision to supported models + examples ( #39232 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2026-04-08 02:02:26 +00:00
Andrey Talman
2111997f96
[release 2.11] Update to torch 2.11 ( #34644 )
2026-04-07 18:55:48 -07:00
Flora Feng
5af684c319
[CI] Add reasoning parser tests to CI ( #37025 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-04-08 00:57:36 +00:00
Md. Mekayel Anik
d521dcdbcc
docs: clarify SMT and OMP acronyms in CpuPlatform ( #39085 )
2026-04-07 17:42:07 -07:00
Giancarlo Delfin
5daf62271d
[Model Runner V2] Fuse probabilistic rejection sample kernels ( #38496 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-04-07 17:37:37 -07:00
zofia
ad3304425b
[XPU] add xpu backend implementation of mxfp8 quant ( #38682 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-04-08 08:30:35 +08:00
Lucas Wilkinson
70406eb1dc
[Attention][V0 Deprecation] Deprecate accept output buffer ( #39125 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-04-07 17:14:58 -04:00
Yubo Wang
08bfedc152
[Bugfix] Fix extract_hidden_states crash with quantized KV cache dtype ( #39160 )
...
Signed-off-by: Yubo Wang <yubowang2019@gmail.com >
2026-04-07 11:18:33 -07:00
Flora Feng
0102bd2f4c
[Parser] Pass request.tools to tool parser ( #38860 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-04-08 01:36:21 +08:00
rasmith
83d09d36b5
[CI][Bugfix][AMD][ Ensure weights created when using emulating OCP MXFP4 ( #36993 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-04-08 00:37:16 +08:00
Chendi.Xue
92b9afeecd
[XPU] Quick fix for TritonMLA to remove cuda hardcode ( #39088 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-04-08 00:17:58 +08:00
Jinzhen Lin
7310555482
[Bugfix] Fix marlin nvfp4 rescaling ( #37502 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
2026-04-07 08:57:17 -07:00
ibifrost
96b5004b71
[KVConnector] Support 3FS KVConnector ( #37636 )
...
Signed-off-by: wuchenxin <wuchenxin.wcx@alibaba-inc.com >
Signed-off-by: ibifrost <47308427+ibifrost@users.noreply.github.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2026-04-07 15:46:00 +00:00
kkyyxhll
98e1a43af7
[Bugfix][Quantization] Fix PerTensorScale loading with tuple shard_id in MergedColumnParallelLinear ( #38517 )
...
Signed-off-by: loukang <loukang@xiaohongshu.com >
2026-04-07 11:16:26 -04:00
maobaolong
729eb59f60
[KVConnector]: prioritize external connector over internal registry ( #38301 )
...
Signed-off-by: baoloongmao <baoloongmao@tencent.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-04-07 15:03:11 +00:00
Ilya Boytsov
6e1100889e
fix(test): recompute Jina ColBERT rotary inv_freq cleared by transformers v5 weight loader ( #39176 )
...
Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com >
2026-04-07 22:40:55 +08:00
Harry Mellor
edcc37a8ce
Fix Mistral yarn warning in Transformers v5 ( #37292 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
2026-04-07 13:23:33 +00:00
Harry Mellor
79df4a794d
Automatically add links to API docs for matching strings in docs ( #37434 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-04-07 21:21:18 +08:00
Ronen Schaffer
7c139ab23f
[KV Offload] Clean up ARC/LRU refactoring leftovers: group ARC tests and fix stale comment ( #38217 )
...
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com >
2026-04-07 15:14:45 +03:00
Wei Zhao
0be9516ea4
[Bug] Fix Trtllm Fp8 MoE Weight Shuffle Memory Fragamentation ( #39054 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-04-07 08:04:08 -04:00
Kyle Mylonakis
7b9de7c892
[Bugfix] Correct mistake in chained comparison in static assert logic ( #38699 )
...
Signed-off-by: Kyle Mylonakis <kyle@protopia.ai >
2026-04-07 18:24:39 +08:00
Rohan Potdar
dd9342e6bc
only patch runtime_env for torch >= 2.10 ( #38763 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-04-07 09:29:23 +00:00
Jiangyun Zhu
8060bb0333
[vLLM IR] rework gemma_rms_norm ( #39014 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-04-07 01:37:00 -07:00
Rishapveer Singh
da4c0e4db9
[Model] Use AutoWeightsLoader for FalconH1 ( #39092 )
...
Signed-off-by: Rishapveer Singh <215205492+rishaps@users.noreply.github.com >
2026-04-07 16:25:17 +08:00
Netanel Haber
a9a0e0551f
nano-nemotron-vl: get_mm_max_tokens_per_item for audio, video, image == seq_len ( #38727 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-04-07 00:23:29 -07:00
Andrew Barnes
5c35517a3e
[ROCm] Remove unused IS_FNUZ parameter from reshape_and_cache_shuffle_kernel ( #39123 )
...
Signed-off-by: Bortlesboat <bortstheboat@gmail.com >
2026-04-07 07:17:59 +00:00
Andreas Karatzas
a435e3108d
[ROCm][CI] Fix test repo-root assumptions ( #39053 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-04-07 13:36:21 +08:00
Andreas Karatzas
2df2c85be4
[Kernels][MoE] Fix legacy_routing to use bitmatrix-based routing path ( #38504 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-04-07 10:57:09 +08:00
Nick Hill
62095e82c1
[BugFix][MRV2] Fix cuda event reuse race ( #39115 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-04-07 00:21:09 +00:00
bnellnm
b2b2c5239e
[MoE Refactor] Split up compressed_tensors_moe.py ( #38960 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-04-06 20:07:54 -04:00
fxmarty-amd
00d7b497b3
[NVFP4] Support NVFP4 dense models from modelopt and compressed-tensors on AMD Instinct MI300, MI355X and Hopper through emulation ( #35733 )
...
Signed-off-by: Felix Marty <Felix.Marty@amd.com >
Signed-off-by: fxmarty-amd <felmarty@amd.com >
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com >
2026-04-06 16:18:27 -06:00
Matthew Bonanni
9c81f35b1a
[Attention][MLA] Re-enable FA4 as default MLA prefill backend ( #38819 )
2026-04-06 17:51:46 -04:00
Woosuk Kwon
f186cfe75e
[MRV2] Fix hanging issue with DeepSeek V3.2 by setting skip_attn=False ( #39098 )
...
Signed-off-by: WoosukKwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-04-06 12:55:13 -07:00
Netanel Haber
dfa5062a8f
NemotronH default mamba_ssm_cache_dtype=float32; enable auto-hook for NemotronHNanoVLV2Config ( #39032 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-04-06 19:47:46 +00:00
Yongye Zhu
e8ebbdde83
[Quantization] Add FlashInfer CuteDSL batched experts backend for NVFP4 MoE ( #38251 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-04-06 11:57:53 -07:00
namgyu-youn
94fbb09894
[EASY] Drop duplicate KV-cache initialization ( #38799 )
...
Signed-off-by: namgyu-youn <namgyu.dev@gmail.com >
2026-04-06 18:05:39 +00:00
Wentao Ye
419e73cdfa
[Bug] Fix mistral version dependency ( #39086 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-06 13:31:19 -04:00
bnellnm
f01482408c
[MoE Refactor][Test] FusedMoE layer test ( #24675 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-06 17:17:23 +00:00
zhanqiuhu
bfdc0a3a99
[NIXL][Mamba][3/N] Heterogeneous TP: 3-read conv state transfer ( #37635 )
2026-04-06 19:07:02 +02:00
bnellnm
93bada494f
[MoE Refactor] Split of DefaultMoERunner class ( #35326 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-06 12:41:59 -04:00
Frederik Gossen
608914de30
[Core] Re-enable Inductor pre-grad passes in standalone compile (torch>=2.12) ( #38944 )
...
Signed-off-by: Frederik Gossen <frgossen@meta.com >
2026-04-06 09:37:13 -07:00
Wentao Ye
4ae218c122
[Refactor] Remove unused dead code ( #38842 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-06 11:52:05 -04:00
Lukas Geiger
f40d9879f2
[Models][GDN] Remove GPU/CPU syncs in GDNAttentionMetadata.build during speculative decoding ( #38047 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2026-04-06 15:39:37 +00:00
Lucas Wilkinson
47e605092b
[Gemma4] Enable Fast Prefill Optimization ( #38879 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-04-06 11:19:39 -04:00
Walter Beller-Morales
e69a265135
[Feat][Core] safely abort requests when FSM fails to advance ( #38663 )
...
Signed-off-by: walterbm <walter.beller.morales@gmail.com >
2026-04-06 08:00:16 -07:00
Julien Denize
fef56c1855
[Mistral Grammar] Support Grammar Factory ( #38150 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-04-06 10:28:51 -04:00
bhargav-patel-29
c5e3454e5a
[Model] Add support for BharatGen's Param2MoE model ( #38000 )
...
Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-06 16:19:56 +08:00
liuchenbing2026
f6983f01de
MiniMax-M2: add Eagle3 speculative decoding support ( #37512 )
...
Signed-off-by: liuchenbing <chenliumail@163.com >
Signed-off-by: liucb <liuchengbao_work@163.com >
Co-authored-by: liuchenbing <chenliumail@163.com >
2026-04-05 19:50:18 -07:00
Andreas Karatzas
780ba37458
[ROCm][Quantization] Add asymmetric INT8 quantization support to TritonInt8ScaledMMLinearKernel ( #38501 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-04-06 09:42:10 +08:00
Micah Williamson
9570654c6d
[ROCm][CI] Run Kernels Core Operation Test On MI325 and mitigate flakiness ( #38184 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-04-06 09:42:02 +08:00
Netanel Haber
d56e952239
nano_nemotron_vl: fix tensor device mismatch exception when video profiling ( #39029 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-04-05 22:23:45 +00:00
Kevin H. Luu
56de443db1
[ci] Switch some CI jobs to H200 MIG slices ( #38956 )
2026-04-05 13:26:11 -07:00
Greg Pereira
4dd49b06f8
[Bug] Fix Import paths for encoder_cudagraph modules ( #38997 )
...
Signed-off-by: greg pereira <grpereir@redhat.com >
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-05 19:11:58 +00:00
Greg Pereira
f53fa26e05
[Bugfix] Fix invalid JSON in Gemma 4 streaming tool calls by stripping partial delimiters ( #38992 )
...
Signed-off-by: greg pereira <grpereir@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-05 17:11:18 +00:00
Wei Zhao
1af6f78ae5
[Perf] Change Trtllm fp8 MoE to use Shuffled Weights and BlockMajorK Layout ( #38993 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-05 10:54:31 -04:00
Martin Vit
228023b3a5
[Bugfix][MoE] Fix 6-8% decode regression: prefer multi-stream shared expert overlap ( #38990 )
...
Signed-off-by: Martin Vit <martin@voipmonitor.org >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-05 10:28:31 -04:00
Aaron Batilo
9a528260ef
[Bugfix][Spec Decode] Fix extract_hidden_states for VLM models ( #38987 )
...
Signed-off-by: Aaron Batilo <abatilo@coreweave.com >
2026-04-05 02:41:54 -07:00
Robert Shaw
968ed02ace
[Quantization][Deprecation] Remove Petit NVFP4 ( #32694 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-04-05 00:07:45 +00:00
Robert Shaw
7d266abb22
Revert "[vLLM IR] gemma_rms_norm" ( #38998 )
2026-04-04 17:48:08 -04:00
Xiaoshuang Wang
156405d243
[vLLM IR] gemma_rms_norm ( #38780 )
...
Signed-off-by: Icey <1790571317@qq.com >
2026-04-04 13:55:52 -04:00
Artem Perevedentsev
99e5539a67
[Perf][GDN] Align TMA usage with upstream FLA ( #38981 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-05 00:38:02 +08:00
Linkun
a88ce94bbb
[IR][RmsNorm] pass None if not has_weight ( #38961 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-04-04 11:02:30 -04:00
Ziming Qi
2a36d8fb72
[Bugfix][CPU] Fix macOS compatibility broken by #36487 ( #38970 )
...
Signed-off-by: Ziming (2imi9) <148090931+2imi9@users.noreply.github.com >
2026-04-04 14:05:58 +00:00
lalit10
93726b2a1c
Refactor Arctic loading to use AutoWeightsLoader ( #38955 )
...
Signed-off-by: Lalit Laxminarayan Bangad <lalitbangad@gmail.com >
Co-authored-by: Lalit Laxminarayan Bangad <lalitbangad@meta.com >
2026-04-04 05:01:09 +00:00
Yongye Zhu
8617f8676b
[Bugfix] Fix DSV32 weight loading ( #38870 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
2026-04-03 19:57:52 -07:00
Andreas Karatzas
06fd9ffcc4
[ROCm][CI] Fix ROCm Dockerfile conftest generation for older Docker parsers ( #38959 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-04-04 10:41:41 +08:00
Wentao Ye
cab4064cd5
[Bug] Fix workspace manager _current_workspaces size ( #38853 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-04 01:29:45 +00:00
Wentao Ye
062f1a2d70
[Bug] Fix compile error for swap_blocks_batch in CUDA 13 ( #38915 )
2026-04-03 16:56:38 -07:00
elenalil-aws
81994e1d0e
[Bugfix][LoRA] Fix missing in_proj_z in Qwen3_5ForConditionalGenerati… ( #38927 )
...
Signed-off-by: elenalil-aws <elenalil@amazon.com >
2026-04-03 23:30:09 +00:00
Andreas Karatzas
4b506ff90a
[ROCm][CI] Minor missing import patch ( #38951 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-04-03 23:01:20 +00:00
Andreas Karatzas
5875bb2e9c
[ROCm][CI] Added back missing common deps ( #38937 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-04-03 15:58:57 -07:00
Kevin H. Luu
f0d3ad9f3e
[ci] Remove soft fail for AMD image build job ( #38941 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-04-03 20:42:33 +00:00
Divin Honnappa
121ea5a21f
Removed GPU state confirmation and cleanup steps. ( #38238 )
...
Signed-off-by: Divin Honnappa <divin.honnappa@amd.com >
2026-04-03 13:11:08 -07:00
Jeffrey Wang
ab79863e6c
Remove MQ multi-node tests ( #38934 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2026-04-03 20:00:08 +00:00
Nick Hill
5f1de2b14b
[Model Runner V2] Add config validation for not-yet-supported features ( #38758 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-04-03 12:08:08 -07:00
yzong-rh
a5a623d961
[Bugfix] Re-enable Renormalize routing for TRT-LLM MoE experts ( #38859 )
...
Signed-off-by: Yifan Zong <yzong@redhat.com >
2026-04-04 01:48:17 +08:00
Xiaoshuang Wang
f8c3af2d85
[vLLM IR] add import_ir_kernels() to support OOT platforms ( #38807 )
...
Signed-off-by: Icey <1790571317@qq.com >
2026-04-03 17:25:19 +00:00
danisereb
50cd5674b3
Fix invalid logprobs with MTP enabled and sync scheduling ( #38711 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-04-03 12:24:37 -04:00
Vasiliy Kuznetsov
7b1a7423be
[Frontend] new online quantization frontend ( #38138 )
...
Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com >
2026-04-03 11:58:39 -04:00
Nicolò Lucchesi
97f92c6b47
[KVConnector] Skip register_kv_caches on profiling ( #38558 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-04-03 15:40:16 +00:00
Yusuf Mohammad
46f02e00f2
[Bugfix] Fix AWQ models batch invariance issues ( #38670 )
...
Signed-off-by: yusuf <yusuf@deeplearningmachine.mynet >
Signed-off-by: <>
Co-authored-by: yusuf <yusuf@deeplearningmachine.mynet >
2026-04-03 14:54:15 +00:00
Qiming Zhang
6b4872240f
[XPU] bump up xpu-kernel v0.1.5, transpose moe weights ( #38342 )
...
Signed-off-by: mayuyuace <qiming1.zhang@intel.com >
Signed-off-by: Qiming Zhang <qiming1.zhang@intel.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-04-03 14:10:02 +00:00
Necofish
580090db6b
[Kernel] Add swapAB support for SM120 CUTLASS blockwise FP8 GEMM ( #38325 )
2026-04-03 15:49:59 +02:00
Artem Perevedentsev
cb10b7e80b
[GDN] Eliminate GPU->CPU sync in prepare_chunk_indices during prefill ( #38361 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com >
2026-04-03 13:38:02 +00:00
Mieszko Dziadowiec
bf8b022e60
[Intel][Triton] Support round_int8 for Intel backend ( #38825 )
...
Signed-off-by: Mieszko Dziadowiec <mdziadowiec@habana.ai >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Stefano Castagnetta <scastagnetta@nvidia.com >
Co-authored-by: Claude <noreply@anthropic.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-04-03 20:47:35 +08:00
xiangdong
40ee64c00e
[XPU][CI] Skip test_topp_only and test_topk_and_topp cases on Intel GPU in CI ( #38904 )
...
Signed-off-by: zengxian <xiangdong.zeng@intel.com >
2026-04-03 20:44:52 +08:00
wufann
1b117cb0ac
[ROCm] Fix aiter persistent mode mla with q/o nhead<16 for kimi-k2.5 tp8 ( #38615 )
...
Signed-off-by: wufann <36477220+wufann@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-03 03:54:00 -07:00
Anton Ivanov
abebd9323d
[CPU] Replace OMP initialization ( #36487 )
...
Signed-off-by: Anton Ivanov <anton.ivanov@cambridgegreys.com >
2026-04-03 18:42:43 +08:00
Hyeonki Hong
25f2b55319
[Frontend] feat: add streaming support for token generation endpoint ( #37171 )
...
Signed-off-by: Hyeonki Hong <hyeonki.hong@moreh.io >
2026-04-03 10:20:32 +00:00
xiangdong
cb4ff07f8b
[XPU][CI] Skip test_topk_only cases on Intel GPU in CI ( #38899 )
...
Signed-off-by: zengxian <xiangdong.zeng@intel.com >
2026-04-03 09:50:41 +00:00
Gregory Shtrasberg
a7d79fa133
[ROCm][CI/Build] Fix the pytest hook to properly print out the summary ( #38585 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-04-03 17:24:26 +08:00
Netanel Haber
fa9e68022d
Fix Nano Nemotron VL regressions ( #38655 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-04-03 15:22:06 +08:00
Isotr0py
5506435419
[Misc] Clean up Gemma4 implementation ( #38872 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-04-03 05:47:02 +00:00
Yifan Qiao
311c981647
[MRV2][KVConnector] Fix missing build_connector_worker_meta ( #38698 )
...
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai >
2026-04-03 08:42:52 +03:00
Li, Jiang
21d7ecc5b0
[CI/Build] Add audio deps in Dockerfile.cpu ( #38876 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-04-03 05:05:14 +00:00
Aaron Hao
4729b90838
[Bug] Add e_score_correction_bias to SKIP_TENSORS ( #38746 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-04-02 21:15:05 -07:00
shunting314
8b141ed8c3
full cudagraph for flex-attn ( #36298 )
...
Signed-off-by: shunting314 <shunting@meta.com >
2026-04-02 21:15:01 -07:00
Varun Sundar Rabindranath
2ad7c0335f
[Model] Add Phi4ForCausalLMV for microsoft/Phi-4-reasoning-vision-15B ( #38306 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2026-04-02 21:14:57 -07:00
Bowen Bao
201d2ea5bf
[CI][ROCm] Add Qwen3.5-35B-A3B-MXFP4 model eval into CI ( #38664 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com >
2026-04-03 04:05:45 +00:00
Bowen Bao
103f0de565
[ROCm][Quantization][1/N] Refactor quark_moe w_mxfp4 w/ oracle ( #38774 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-03 03:29:57 +00:00
wliao2
32e0c0bfa2
refactor hard coded device string in test files under tests/v1 and tests/lora ( #37566 )
...
Signed-off-by: Liao, Wei <wei.liao@intel.com >
2026-04-03 11:21:47 +08:00
Itay Etelis
4a06e1246e
[Perf] Batch KV cache swap copies via cuMemcpyBatchAsync ( #38460 )
...
Signed-off-by: Itay Etelis <itay.etelis@ibm.com >
Co-authored-by: Itay Etelis <itay.etelis@ibm.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-04-03 03:13:23 +00:00
Carl Y
3bc2734dd0
[Kernel] Fuse FP8 output quantization into merge_attn_states ( #36518 )
...
Signed-off-by: Carl You <4531192+carlyou@users.noreply.github.com >
2026-04-03 01:47:04 +00:00
Carl Y
1f5ec2889c
[mla] Support fused FP8/NVFP4 output quantization in MLA attention ( #35792 ) ( #36205 )
...
Signed-off-by: Carl You <4531192+carlyou@users.noreply.github.com >
Signed-off-by: Carl Y <4531192+carlyou@users.noreply.github.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-02 21:16:11 -04:00
Yan Ma
ee3cf45739
[XPU] Initial support for GDN attention on Qwen3-next/Qwen3.5 ( #33657 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-04-03 08:59:11 +08:00
Matthew Bonanni
05e68e1f81
[CI] Fix test_nixl_connector ( #38838 )
2026-04-02 17:52:13 -07:00
Vadim Gimpelson
771913e4a0
[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 ( #38832 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-04-03 04:45:57 +04:00
1096125073
71a9125c67
[New Model]: add support for telechat3 ( #38510 )
...
Signed-off-by: xiayongqiang <xiayq1@chinatelecom.cn >
Co-authored-by: xiayongqiang <xiayq1@chinatelecom.cn >
2026-04-03 08:26:22 +08:00
Nicolò Lucchesi
66e86f1dbd
[Kernel] Mamba support different layout for Conv state ( #37416 )
2026-04-03 01:50:09 +02:00
Michael
bb39382b2b
[Bugfix]: Fix Gemma4ToolParser.__init__() missing tools parameter ( #38847 )
...
Signed-off-by: Michael Hospedales <hospedales@me.com >
2026-04-02 14:35:19 -07:00
zhanqiuhu
7b743ba953
[CI] Fix: pass string cache_dtype in test_register_kv_caches ( #38836 )
2026-04-02 19:42:09 +00:00
Stefano Castagnetta
188defbd0b
[CI] Add flashinfer.py to attention test source deps ( #38792 )
...
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-04-02 19:24:29 +00:00
Luciano Martins
08ed2b9688
feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use) ( #38826 )
...
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com >
Signed-off-by: Luciano Martins <lucianomartins@google.com >
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2026-04-02 11:13:28 -07:00
Yanan Cao
ecd5443dbc
Bump helion dependency from 0.3.2 to 0.3.3 ( #38062 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-02 10:59:33 -07:00
Stefano Castagnetta
58262dec6e
[Bugfix] Fix test mocks after SM100 restriction in #38730 ( #38791 )
...
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com >
Co-authored-by: Claude <noreply@anthropic.com >
2026-04-02 13:12:58 -04:00
Lucas Wilkinson
cb3935a8fc
[FA4] Update flash-attention to latest upstream FA4 ( #38690 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-04-02 17:02:37 +00:00
Bowen Bao
82a006beeb
[CI][ROCm] Add gpt-oss w4a8 in CI ( #38292 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com >
2026-04-03 00:06:01 +08:00
wang.yuqi
a9b4f07ba2
[Frontend] Re-enable running MaxSim on GPU ( #38620 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-04-03 00:03:13 +08:00
Koushik Dutta
d9408ffba3
Triton MLA perf fixes ( #33529 )
...
Signed-off-by: Koushik Dutta <koushd@gmail.com >
Co-authored-by: root <root@ubuntu-nvidia.localdomain >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-04-02 09:40:01 -04:00
Yusuf Mohammad
16a65e4173
[Bugfix] Enable batch-invariant Triton matmul on all Ampere GPUs (SM 8x) ( #38427 )
...
Signed-off-by: yusuf <yusufmohammad@live.com >
Signed-off-by: yusuf <yusuf@deeplearningmachine.mynet >
Signed-off-by: Yusuf Mohammad <79484377+YM2132@users.noreply.github.com >
Signed-off-by: <>
Co-authored-by: Claude <noreply@anthropic.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: yusuf <yusuf@deeplearningmachine.mynet >
2026-04-02 09:29:58 -04:00
bsliu
c0817e4d39
[Model] Add support for Cheers multimodal model ( #38788 )
...
Signed-off-by: bsliu <1187291748@qq.com >
Signed-off-by: 吴炳贤 <wubingxian24@mails.ucas.ac.cn >
2026-04-02 21:01:40 +08:00
Harry Mellor
dfe5e31689
Don't compile vision encoder for Transformers backend ( #30518 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-04-02 12:42:29 +00:00
JartX
2ce3d0ce36
[Feature] KV cache per-token-head INT8/FP8 quantization ( #38378 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: yangyang4991 <yangyang4991@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2026-04-02 08:13:26 -04:00
Jiangyun Zhu
4eefbf9609
[Perf] fuse kernels in gdn ( #37813 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-04-02 11:52:18 +00:00
vllmellm
551b3fb39f
[ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2 and Qwen/Qwen3.5-35B-A3B-FP8 tp=2 ( #38086 )
...
Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-04-02 08:13:42 +00:00
Li, Jiang
c6f722b93e
[CPU] Support gelu act in cpu_fused_moe ( #38770 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-04-02 14:14:32 +08:00
Xin Yang
9bd7231106
Revert "[Kernel] Add gpt-oss Router GEMM kernel ( #37205 )" ( #38778 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-04-01 22:02:32 -07:00
Yanan Cao
73f48ce559
[Kernel] [Helion] Use warning_once in get_gpu_name to prevent log spam ( #38743 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com >
2026-04-01 21:30:31 -07:00
Gregory Shtrasberg
3aab680e3e
[ROCm][Bugfix] Fix ROCm runtime failure due to missing symbol ( #38750 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tjtanaavllm <tunjian.tan@amd.com >
2026-04-01 21:30:11 -07:00
Sergey Zinchenko
5a2d420c17
[Bugfix] Use dedicated MM processor cache in /tokenize to prevent sender-cache pollution ( #38545 )
...
Signed-off-by: Sergey Zinchenko <sergey.zinchenko.rnd@gmail.com >
2026-04-01 21:14:49 -07:00
Benjamin Chislett
5f96f9aff1
[Perf] DSV3.2 Indexer Fused Weights Projection ( #38684 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-04-02 03:34:49 +00:00
Luka Govedič
694449050f
Fix multiline-format string for python 3.10 ( #38739 )
...
Signed-off-by: Luka Govedic <luka.govedic@gmail.com >
2026-04-02 03:19:35 +00:00
Nick Hill
6241521dd2
[BugFix] Fix precommit breakage due to conflicting in-flight merges ( #38759 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-04-01 15:35:06 -07:00
Kevin H. Luu
1785dc5501
Revert "[Bugfix] Fix Qwen3CoderToolParser anyOf/oneOf type resolution for nullable params ( #37831 )" ( #38751 )
2026-04-02 06:34:28 +08:00
Chang Su
54500546ac
[Bugfix] Preserve original ImportError in gRPC server entrypoint ( #38673 )
...
Signed-off-by: Chang Su <chang.s.su@oracle.com >
2026-04-01 22:16:44 +00:00
Jeffrey Wang
de5e6c44c6
[Feat][Executor] Introduce RayExecutorV2 ( #36836 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2026-04-01 14:34:29 -07:00
yzong-rh
cb268e4e55
[Refactor] Simplify FutureWrapper in MultiprocExecutor ( #38644 )
...
Signed-off-by: Yifan <yzong@redhat.com >
Signed-off-by: Yifan Zong <yzong@redhat.com >
2026-04-01 21:28:26 +00:00
Stefano Castagnetta
6183cae1bd
[Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang ( #38730 )
...
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com >
2026-04-01 12:08:40 -07:00
Monishver
c09ad767cd
Feature/silu block quant fusion v1 ( #32996 )
...
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com >
2026-04-01 18:50:43 +00:00
Wentao Ye
c9a9db0e02
[Compile] Fix nvfp4 compile warning ( #38573 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-01 18:28:57 +00:00
Chauncey
cbe7d18096
[Misc] Rename think_start_str/think_end_str to reasoning_start_str/reasoning_end_str ( #38242 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-04-01 09:56:45 -07:00
Michael Goin
db5d0719e1
[Kernel] Add MXFP8 to Marlin GEMM/MoE and refactor Mxfp8LinearOp ( #34664 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-04-01 09:41:42 -07:00
yzong-rh
dc0428ebb8
[NIXL][BUG] Fix Triton heterogeneous TP ( #37940 )
...
Signed-off-by: Yifan <yzong@redhat.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-04-01 17:23:15 +02:00
Jesus Talavera
148c2072ec
Add ibm-granite/granite-vision-3.3-2b to supported models documentation ( #38714 )
...
Signed-off-by: Jesus Talavera <jesus.talavera@ibm.com >
2026-04-01 08:22:25 -07:00
majianhan
2f5c3c1ec0
[Misc] Fix docstring typo: buildin -> builtin ( #38722 )
...
Co-authored-by: majianhan <majianhan@kylinos.cn >
2026-04-01 07:39:46 -07:00
Fynn Schmitt-Ulms
fa246d5231
Fix shape comment in extract_hidden_states example ( #38723 )
...
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com >
2026-04-01 07:29:33 -07:00
bnellnm
7cf56a59a2
[MoE Refactor] Make SharedExperts class for use with DefaultMoERunner ( #35153 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-04-01 09:44:08 -04:00
Elvir Crnčević
5e30e9b9a9
[Bugfix] Revert "Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding" ( #38359 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-04-01 09:11:10 -04:00
손세정
582340f273
[Bugfix] Fix Qwen3CoderToolParser anyOf/oneOf type resolution for nullable params ( #37831 )
...
Signed-off-by: AAISSJ <maze0717@g.skku.edu >
Signed-off-by: <>
Co-authored-by: 세덩 <saison@sedeong-ui-MacBookAir.local >
2026-04-01 20:22:29 +08:00
yjz
992368522f
[KVTransfer] Fix TpKVTopology.is_kv_replicated equality case ( #38179 )
...
Signed-off-by: JianDan0212 <zhangyj0212@gmail.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-04-01 12:41:49 +02:00
Juan Pérez de Algaba
58ee614221
(security) Enforce frame limit in VideoMediaIO ( #38636 )
...
Signed-off-by: jperezde <jperezde@redhat.com >
2026-04-01 10:23:45 +00:00
Harry Mellor
f9f6a9097a
Add verified label to trigger pre-commit ( #38708 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-04-01 02:31:02 -07:00
Zhanda Zhu
c75a313824
[Perf] triton bilinear_pos_embed kernel for ViT ( #37948 )
...
Signed-off-by: Zhanda Zhu <zhandazhu@gmail.com >
2026-04-01 01:52:02 -07:00
Lukas Geiger
4f6eed3bd4
[Core] Simplify multimodal masking ( #34246 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2026-04-01 01:18:22 -07:00
Li, Jiang
36d7f19897
[CPU] Support head_size 512 in cpu_attn ( #38676 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-04-01 05:42:27 +00:00
Jeffrey Wang
2d725b89c5
[Bugfix] Lazy import diskcache to avoid sqlite3/libstdc++ ImportError at startup ( #38649 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2026-04-01 05:31:20 +00:00
Augusto Yao
ef53395e2c
[bugfix] do not add extra linebreak for score/rerank with chat template ( #38617 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
Co-authored-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-04-01 04:50:07 +00:00
Lucas Wilkinson
eb47454987
[Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking ( #36178 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-04-01 00:15:53 -04:00
Matthew Bonanni
116f4be405
[1/N][Cleanup] Standardize on use of is_quantized_kv_cache ( #38659 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-04-01 04:08:01 +00:00
Wentao Ye
7b01d97a22
[Perf] Optimize mean pooling using chunks and index_add, 5.9% E2E throughput improvement ( #38559 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-01 03:54:58 +00:00
HarshRathva
17b72fd1c8
Fix priority preemption regression test in scheduler ( #37051 )
...
Signed-off-by: HarshRathva <harshrathvaai@gmail.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-04-01 06:36:12 +03:00
Samu Tamminen
c49497726b
[ROCm][perf] Shuffle KV cache to use paged_attention_common ( #32914 )
...
Signed-off-by: Samu Tamminen <stammine@amd.com >
Co-authored-by: Tuukka Sarvi <tuukka.sarvi@amd.com >
2026-04-01 03:30:19 +00:00
Ben Browning
cb0b443274
[Misc] Add 20 regression tests for 11 tool parser bug fixes ( #38172 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-04-01 03:00:31 +00:00
Luka Govedič
40bb175027
[vLLM IR] 1/N Implement IR skeleton and rms_norm op ( #33825 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
Signed-off-by: chzhang <chaojun.zhang@intel.com >
Signed-off-by: Luka Govedic <luka.govedic@gmail.com >
Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com >
Co-authored-by: Chaojun Zhang <chaojun.zhang@intel.com >
Co-authored-by: Luka Govedič <ProExpertProg@h100-01.nemg-001.lab.rdu2.dc.redhat.com >
2026-03-31 22:15:05 -04:00
Elvir Crnčević
0fab52f0aa
Fix NaN from stale FP4 scale padding in create_fp4_scale_tensor ( #38148 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-03-31 19:14:59 -07:00
Yifan Qiao
91e4521f9f
[Feat][v1] Simple yet General CPU KV Cache Offloading ( #37160 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai >
2026-03-31 17:58:37 -07:00
Stig-Arne Grönroos
31a719bcd3
[ROCm][perf] fix Aiter sparse MLA with MTP>1 ( #37887 )
...
Signed-off-by: Stig-Arne Grönroos <stig-arne.gronroos@amd.com >
Signed-off-by: Stig-Arne Grönroos <sgronroo@amd.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-31 19:22:23 -04:00
Vedant V Jhaveri
2e56975657
Generative Scoring ( #34539 )
...
Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com >
Co-authored-by: Vedant Jhaveri <vjhaveri@linkedin.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-31 16:02:11 -07:00
Chang Su
36f1dc19ae
feat(grpc): add periodic stats logging and servicer log forwarding ( #38333 )
...
Signed-off-by: Chang Su <chang.s.su@oracle.com >
2026-03-31 15:50:07 -07:00
Asaf Gardin
3dc01ef352
[Quantization] Consolidate dummy format logic into DummyModelLoader ( #38637 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-03-31 22:20:45 +00:00
Yanan Cao
cc671cb110
[Kernel] [Helion] [17/N] Add Helion kernel torch.compile support ( #38592 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com >
2026-03-31 17:06:42 -04:00
Wentao Ye
856589ed9a
[Refactor] Remove dead code in kv connector and model runner ( #38383 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-31 17:05:23 -04:00
czhu-cohere
517b769b58
[Perf] Fix DBO overlap: capture DeepEP event before yield ( #38451 )
...
Signed-off-by: root <conway.zhu@cohere.com >
2026-03-31 20:38:59 +00:00
yzong-rh
d9b90a07ac
[MoE Refactor] Migrate Unquantized to Full Oracle Flow ( #36286 )
...
Signed-off-by: Yifan Zong <yzong@redhat.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: yzong-rh <yzong@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-31 15:43:33 -04:00
Olya Kozlova
598190aac3
[fix] Remove trtllm ragged mla prefills ( #36540 )
...
Signed-off-by: Olya Kozlova <okozlova@nvidia.com >
2026-03-31 12:30:27 -07:00
Xu Jinyang
b779eb3363
[Model] Sync upstream BT=chunk_size fix for GDN chunk_fwd_kernel_o, simplify warmup to single pass ( #38343 )
...
Signed-off-by: AuYang <459461160@qq.com >
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
2026-03-31 23:03:24 +04:00
BadrBasowid
077a9a8e37
[torch.compile] Refactor Attention Quant Fusion Pass and Remove Boilerplate ( #37373 )
...
Signed-off-by: BadrBasowid <badr.basowid@gmail.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-03-31 14:15:50 -04:00
Run Yu
07edd551cc
[CI/Build] Resolve a dependency deadlock when installing the test dependencies used in CI ( #37766 )
...
Signed-off-by: Run Yu <yurun00@gmail.com >
2026-03-31 18:05:14 +00:00
mikaylagawarecki
7c080dd3c5
[4/n] Migrate FP4/W4A8 CUTLASS kernels to torch stable ABI ( #37503 )
...
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com >
2026-03-31 10:21:13 -07:00
Yi Liu
0dd25a44ea
[Quantization][Autoround][XPU] Add W4A16 Support ( #37986 )
...
Signed-off-by: yiliu30 <yi4.liu@intel.com >
2026-03-31 16:48:24 +00:00
SandishKumarHN
3896e021a0
[Bugfix] Fix FusedMoE weight loading with padded hidden dimensions ( #37010 )
...
Signed-off-by: SandishKumarHN <sandish@fb.com >
2026-03-31 12:22:26 -04:00
zhang-prog
b6e636c12c
[Fix] handle PaddleOCR-VL image processor max_pixels across Transformers v4/v5 ( #38629 )
...
Signed-off-by: zhangyue66 <zhangyue66@baidu.com >
2026-03-31 15:50:41 +00:00
Jingu Kang
f1ff50c86c
[Bugfix] clamp dA_cumsum differences to prevent Inf in Mamba2 SSD kernels ( #37501 )
...
Signed-off-by: Jingu Kang <jg.k@navercorp.com >
2026-03-31 17:35:51 +02:00
Matthew Bonanni
757068dc65
[Bugfix][Async] Fix async spec decoding with hybrid models ( #38556 )
...
Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com >
2026-03-31 11:08:54 -04:00
Nicolò Lucchesi
7337ff7f03
[Docs] PD with Nixl compat matrix ( #38628 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-31 15:01:21 +00:00
Kyle Sayers
5869f69c5f
[Online Quant] [QeRL] Minor code cleanup ( #38574 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-03-31 14:56:43 +00:00
wliao2
4dfad17ed1
replace cuda_device_count_stateless() to current_platform.device_count() ( #37841 )
...
Signed-off-by: Liao, Wei <wei.liao@intel.com >
Signed-off-by: wliao2 <wei.liao@intel.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-31 22:32:54 +08:00
wenjun liu
e8057c00bc
[CI] Avoid concurrent docker pull in intel XPU CI runners to prevent rate limit issues ( #38594 )
...
Signed-off-by: wendyliu235 <wenjun.liu@intel.com >
2026-03-31 22:23:18 +08:00
Nicolò Lucchesi
7430389669
[Bugfix][CI] Skip flaky test_eagle test ( #38566 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-31 09:42:37 -04:00
ElizaWszola
202f147cf2
Fix MLA runs when use_inductor_graph_partition=True ( #38631 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2026-03-31 13:37:43 +00:00
Jiangyun Zhu
ea7bfde6e4
[CI] fix LM Eval Qwen3.5 Models (B200) ( #38632 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-31 13:20:08 +00:00
sihao_li
d71a15041f
[XPU]move testing dependencies from Dockerfile to xpu-test.in ( #38596 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-31 12:49:43 +00:00
Ilya Markov
abdbb68386
[EPLB] Add alternative communication for EPLB weight exchange ( #33176 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Signed-off-by: Markov Ilya <markovilya19@gmail.com >
Co-authored-by: Markov Ilya <markovilya19@gmail.com >
2026-03-31 08:17:12 -04:00
liuzhenwei
0c63739135
[EPD] update EPD script arguments ( #36742 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
2026-03-31 12:02:09 +00:00
wang.yuqi
719735d6c5
[CI Failure] pin colmodernvbert revision ( #38612 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-31 10:54:54 +00:00
Maosheng Liao
aae3e688f8
Fix document of torchrun_example.py ( #31113 )
2026-03-31 10:54:23 +00:00
Matthew Bonanni
7d65463528
[WIP][CI][Bugfix] Fix test_run_eagle_dp ( #38584 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-31 12:30:25 +02:00
Mateusz Sokół
8278825b57
DOC: TPU mention fix ( #38129 )
...
Signed-off-by: Mateusz Sokół <mat646@gmail.com >
2026-03-31 03:27:56 -07:00
Chang Su
acf7292bf2
[Misc] Move --grpc CLI argument into make_arg_parser ( #38570 )
...
Signed-off-by: Chang Su <chang.s.su@oracle.com >
2026-03-31 03:24:05 -07:00
Chauncey
ce884756f0
[Feature]: add presence_penalty and frequency_penalty fields to Responses API ( #38613 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-31 08:45:57 +00:00
wang.yuqi
d9d21eb8e3
[Frontend][3/n] Improve pooling entrypoints | scoring. ( #28631 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-31 07:52:00 +00:00
Yintong Lu
f09daea261
[CPU] Support int8 compute mode in CPU AWQ ( #35697 )
...
Signed-off-by: Yintong Lu <yintong.lu@intel.com >
2026-03-31 15:27:37 +08:00
Kevin H. Luu
42318c840b
[ci] Remove benchmarks job ( #38611 )
2026-03-31 06:46:21 +00:00
zhangyiming
1ac6694297
[OOT] Add OOT support for linear kernel. ( #37989 )
...
Signed-off-by: menogrey <1299267905@qq.com >
2026-03-31 14:33:21 +08:00
Kfir Toledo
6cc7abdc66
[kv_offload+HMA] Fix num_blocks with different per-layer page sizes and improve assert message ( #38554 )
...
Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-03-31 06:00:40 +00:00
Flora Feng
d53cb9cb8e
[Tool Parser][2/3] Use self.tools instead of request.tools in tool parsers ( #38189 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-31 13:41:36 +08:00
Louie Tsai
44eef0ca1e
vLLM Benchmark Suite perf regression after PR#32723 ( #38576 )
...
Signed-off-by: louie-tsai <louie.tsai@intel.com >
2026-03-31 05:23:17 +00:00
Andreas Karatzas
b9cdc85207
[ROCm][CI] Fix Whisper translation test attention backend selection ( #38508 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-31 13:21:49 +08:00
Flora Feng
3e802e8786
[Mypy] Fix adjust_request typing ( #38264 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-31 04:21:18 +00:00
Martin Hickey
350af48e14
[KVConnector] Remove redundant method KVConnectorOutput::merge() ( #38546 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-03-31 07:11:02 +03:00
Lucas Kabela
e31915063d
[Bugfix] Fix for builtins (forward fix of pytorch/177558) ( #37234 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-03-31 01:08:11 +00:00
Flora Feng
29e48707e8
[Refactor] Consolidate Tool type alias in tool_parsers/utils.py ( #38265 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-31 00:55:51 +00:00
sungsoo ha
4ac227222f
[Bugfix][DCP] Fix CUDA graph capture for Decode Context Parallelism ( #36070 )
...
Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-30 20:20:43 -04:00
Vadim Gimpelson
bb51d5b40d
Add @vadiklyutiy as committer ( #38589 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-31 07:50:04 +08:00
Prathmesh Bhatt
93b3ec1585
feat(attention): extract KV-cache update from FlashAttentionDiffKV ba… ( #36466 )
...
Signed-off-by: Prathmesh Bhatt <71340361+Prathmesh234@users.noreply.github.com >
2026-03-30 23:16:09 +00:00
Netanel Haber
e812bf70bd
Restore non-hf processor path for Nano-Nemotron-VL (bypass call_hf_processor_mm_only) - fixes #38018 ( #38567 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com >
2026-03-30 21:56:52 +00:00
SandishKumarHN
bcc6f67447
[Bugfix] Use null block (0) for padded block table entries ( #35431 )
...
Signed-off-by: SandishKumarHN <sandish@fb.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-30 14:02:51 -07:00
Asaf Gardin
1fc69f59bb
[Bug fix][Quantization] Fix dummy weight loading ( #38478 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-03-30 16:38:02 -04:00
Micah Williamson
d9c7db18da
[ROCm][CI] Pin test_hybrid test to TRITON_ATTN on ROCm ( #38381 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-03-30 20:26:46 +00:00
Ilya Markov
12701e8af2
[EPLB] Optmize eplb mapping and record in router for prefill ( #36261 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-03-30 19:48:33 +00:00
Benjamin Chislett
494636b29d
[Feat][Spec Decode] DFlash ( #36847 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-30 15:03:15 -04:00
mikaylagawarecki
ab1a6a43fa
[3/n] Migrate cutlass/scaled_mm_entry.cu torch stable ABI ( #37221 )
...
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com >
2026-03-30 11:20:13 -07:00
fangyuchu
b5e608258e
[Refactor] Unify engine process monitoring in engine manager and add Ray backend support ( #35862 )
...
Signed-off-by: fangyuchu <fangyuchu@qq.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-30 10:16:09 -07:00
Matthew Bonanni
2c734ed0e0
[Bugfix][MLA] Change default SM100 MLA prefill backend back to TRT-LLM ( #38562 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-30 09:51:24 -07:00
Chendi.Xue
3b1dbaad4e
[HMA]Fix corner case when hybrid page_size can not be evenly divided issue (blk_size=64,tp=4) ( #37467 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-30 16:47:30 +00:00
Johnny
b4a2f3ac36
[NVIDIA] Bugfix NVFP4 DGX Spark and RTX50 ( #38423 )
...
Signed-off-by: johnnynunez <johnnynuca14@gmail.com >
Signed-off-by: Johnny <johnnynuca14@gmail.com >
2026-03-30 09:36:18 -07:00
roikoren755
8e6293e838
[Mamba] Add stochastic rounding support ( #35753 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-03-30 12:33:49 -04:00
Hongxia Yang
dbdd9ae067
[ROCm][Bugfix] fix exception related to trust_remote_code for MiniMax-M2.1-MXFP4 ( #37698 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-03-30 15:49:23 +00:00
Matthias Gehre
e8b055a5ac
[Bugfix] Handle ParallelLMHead in compressed-tensors get_quant_method ( #37291 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-30 07:30:52 -07:00
tomeras91
246dc7d864
[Misc] Add @tomeras91 as a maintainer of Nemotron related code + mamba block ( #38547 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
2026-03-30 21:12:17 +08:00
Thomas Parnell
7c3f88b2a8
[Bugfix] Remove false-positive format mismatch warnings in FLA ops ( #38255 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2026-03-30 12:32:26 +00:00
Li, Jiang
6557f4937f
[Bugfix][CPU] Skip set_num_threads after thread binding ( #38535 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-30 20:13:00 +08:00
Andreas Karatzas
677424c7ac
[Core][CI] Add opt-in media URL caching via VLLM_MEDIA_CACHE ( #37123 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-30 04:58:53 -07:00
Collin McCarthy
1031c84c36
Fix ambiguous num_blocks for hybrid attn mamba ( #37236 )
...
Signed-off-by: Collin McCarthy <cmccarthy@nvidia.com >
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-03-30 11:09:45 +00:00
aliialsaeedii
7e76af14fa
[Bugfix][Frontend] Return 400 for corrupt/truncated image inputs instead of 500 ( #38253 )
...
Signed-off-by: aliialsaeedii <ali.al-saeedi@nscale.com >
2026-03-30 10:26:46 +00:00
yzong-rh
3683fe6c06
[Bugfix] Fix shared-object aliasing in n>1 streaming with tool calls ( #38158 )
...
Signed-off-by: Yifan Zong <yzong@redhat.com >
Signed-off-by: Yifan <yzong@redhat.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-30 10:12:13 +00:00
Nicolò Lucchesi
cc06b4e86b
[Mamba][Bugfix] Raise on insufficient cache blocks instead of silently capping cudagraph sizes ( #38270 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-30 09:41:50 +00:00
TJian
03ac6ca895
[ROCm] [DOC] Update the Documentation to include ROCm Nightly Wheel support ( #38457 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-03-30 02:25:46 -07:00
haosdent
a08b7733fd
[CI] Fix SPLADE pooler test broken by #38139 ( #38495 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-30 07:48:33 +00:00
Tan Pin Siang
85c0950b1f
[ROCm] Enable MORI EP for unquantized MoE with AITER backend ( #37529 )
...
Signed-off-by: Tan Pin Siang <pinsiang.tan@amd.com >
2026-03-30 15:19:33 +08:00
Juan Pérez de Algaba
57861ae48d
(security) Fix SSRF in batch runner download_bytes_from_url ( #38482 )
...
Signed-off-by: jperezde <jperezde@redhat.com >
2026-03-30 07:10:01 +00:00
Jee Jee Li
ac30a8311e
[Bugfix][Model] Fix PixtralForConditionalGeneration LoRA ( #36963 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-29 23:59:42 -07:00
PikaPikachu
63babd17f1
[Model][Quantization] Add GGUF support for MiniMax-M2.1 ( #36965 )
...
Signed-off-by: kangletian <Letian.Kang@amd.com >
2026-03-30 14:24:06 +08:00
Kevin H. Luu
fec5aeca12
[ci] Soft fail and disable retry for AMD build image job ( #38505 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-29 23:05:26 -07:00
Jaewon
d816834c1a
[MoE] Add RoutingMethodType.Simulated to TRT-LLM FP8/NVFP4 kernel allowlists ( #38329 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
2026-03-29 22:53:43 -07:00
Roger Wang
92f0db57a8
[Misc] Always use forward_mulmat for Conv3d on newer versions of torch. ( #38487 )
2026-03-30 05:39:41 +00:00
Andreas Karatzas
bea23536f6
[CI] Add temperature=0.0, reduce max_tokens, and add debug prints to audio_in_video tests ( #38492 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-30 05:36:45 +00:00
Jiangyun Zhu
c133f33746
Add @ZJY0516 to CODEOWNERS ( #38497 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-29 21:10:00 -07:00
Stanislav Kirillov
a6db99ba02
[Bugfix] Support multi-type params parsing for DeepSeek v3.2 ( #33703 )
...
Signed-off-by: Stanislav Kirillov <stas@nebius.com >
Co-authored-by: Stanislav Kirillov <stas@nebius.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-30 04:07:28 +00:00
Andreas Karatzas
4f2ed5fddb
[ROCm][CI] Enable hybrid chunked prefill test ( #38317 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-30 10:30:26 +08:00
Kyle Sayers
d28d86e8a3
[QeRL] Fix online quantized reloading ( #38442 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-03-29 14:56:41 -06:00
Wentao Ye
995dea1354
[Perf] Remove redundant device copies for CPU-only pooling token IDs, 48.9% E2E throughput improvement ( #38139 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-29 18:12:50 +00:00
allgather
8c0b6267d7
[Transformers v5] fix missing pixtral/voxtral multimodal dispatch ( #38410 )
...
Signed-off-by: allgather <all2allops@gmail.com >
2026-03-29 09:59:06 +00:00
Andreas Karatzas
43cc5138e5
[ROCm][CI] Fix cross-attention dispatch for encoder-decoder models ( #38450 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-28 22:08:03 -07:00
Shubhra Pandit
5b8c30d62b
[Spec Decode, BugFix] Propagate norm_before_fc from Eagle3 speculator ( #38111 )
...
Signed-off-by: Shubhra Pandit <shubhra.pandit@gmail.com >
2026-03-29 00:42:06 +00:00
haosdent
d39b8daf5f
[Feature] Add Qwen3-ForcedAligner support via token classification pooling ( #35367 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-29 00:27:52 +00:00
Walter Beller-Morales
fafca38adc
[BugFix][Frontend] apply task instruction as system prompt in cohere v2/embed ( #38362 )
...
Signed-off-by: walterbm <walter.beller.morales@gmail.com >
2026-03-28 18:30:54 +00:00
Kunshang Ji
aa4eb0db78
[CI]revert initialize_model context manager ( #38426 )
...
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-28 16:56:50 +00:00
Andreas Karatzas
af89140efc
[ROCm][CI] Fix UV install in Dockerfile.rocm to detect curl failures and retry ( #38415 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-29 00:47:42 +08:00
haosdent
b2bc736b12
[CI] Fix Ernie4.5-VL initialization test ( #38429 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-28 22:43:24 +08:00
whyiug
58c959a767
[Misc]: clean up non-core lint issues ( #37049 )
...
Signed-off-by: whyiug <whyiug@hotmail.com >
2026-03-28 10:28:16 -04:00
Bvicii
bda3eda82d
[Bugfix] Disallow renderer_num_workers > 1 with mm processor cache ( #38418 )
...
Signed-off-by: Bvicii <yizhanhuang2002@gmail.com >
2026-03-28 06:32:52 -07:00
Michael Goin
2bf5b70ae8
[CI Bugfix] Pre-download missing FlashInfer headers in Docker build ( #38391 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-03-28 06:09:00 -07:00
yzong-rh
6dad4c5722
[Test] Fix flaky race condition in test_abort_final_step ( #38414 )
...
Signed-off-by: Yifan <yzong@redhat.com >
2026-03-28 09:06:56 +00:00
Liwen
171775f306
Fix Device Index for ROCm Ray Workers in MoE Benchmark ( #38108 )
...
Signed-off-by: Liwen <53441624+li-liwen@users.noreply.github.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-28 08:27:11 +00:00
TJian
58a249bc61
[ROCm] [Release] Update ROCm variant from rocm700 to rocm721 ( #38413 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-03-28 06:07:03 +00:00
IriKa
148a5c1226
[Bugfix]fix output Nan/Inf in marlin if dtype=float16 ( #33972 )
...
Signed-off-by: IriKa Qiu <qiujie.jq@gmail.com >
2026-03-27 16:36:08 -07:00
Wei Zhao
b69bf2f0b1
[Perf] Use torch compile to fuse pack topk in trtllm moe ( #37695 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com >
2026-03-27 17:30:46 -06:00
rongfu.leng
88149b635e
Add nvidia h800 moe config ( #31201 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2026-03-27 16:28:48 -07:00
Hongxia Yang
83a4df049d
[ROCm][Documentation] update quickstart and installation to include rocm nightly docker tips ( #38367 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-03-27 23:20:19 +00:00
Gregory Shtrasberg
731285c939
[ROCm][CI/Build] ROCm 7.2.1 release version; torch 2.10; triton 3.6 ( #38252 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-03-27 18:03:12 -05:00
Johnny
97d19197bc
[NVIDIA] Fix DGX Spark logic ( #38126 )
...
Signed-off-by: johnnynunez <johnnynuca14@gmail.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com >
Signed-off-by: guillaume_guy <guillaume.guy@airbnb.com >
Signed-off-by: Guillaume Guy <guillaume.c.guy@gmail.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Andreas Karatzas <akaratza@amd.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: Sathish Sanjeevi <SKPsanjeevi@users.noreply.github.com >
Co-authored-by: Guillaume Guy <guillaume.c.guy@gmail.com >
Co-authored-by: guillaume_guy <guillaume.guy@airbnb.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-27 15:26:07 -07:00
Giancarlo Delfin
384e4d5f48
[Model Runner V2] Rebuild attention metadata before eagle decode full… ( #38311 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-27 13:46:42 -07:00
Nicolò Lucchesi
44a6528028
[CI] Skip failing test ( #38369 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-27 13:25:19 -07:00
Kyle Sayers
648edcf729
[QeRL] Compose online quantization with quantized reloading ( #38032 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-03-27 13:22:33 -07:00
Michael Goin
7ba425e916
Add short flag -sc for --speculative-config argument ( #38380 )
...
Co-authored-by: Claude <noreply@anthropic.com >
2026-03-27 12:04:22 -07:00
Gregory Shtrasberg
b8665383df
[ROCm] Fix GPT-OSS import for triton 3.6 ( #37453 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-03-27 18:00:57 +00:00
Rohan Potdar
0e9358c11d
{ROCm]: gpt-oss fusion/padding fixes ( #38043 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
Co-authored-by: Andreas Karatzas <akaratza@amd.com >
2026-03-27 12:19:15 -04:00
Harry Mellor
21d2b53f88
Remove need for explicit \n in docstring lists for --help formatting ( #38350 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-27 08:38:00 -07:00
Jonas M. Kübler
98e7f223b9
enable skipping of SW attention layers when using FP8 KV cache ( #33695 )
...
Signed-off-by: Jonas Kuebler <kuebj@amazon.com >
2026-03-27 07:25:02 -06:00
Juan Pérez de Algaba
b111f8a61f
fix(security): Add VLLM_MAX_N_SEQUENCES environment variable and enforce limit ( #37952 )
...
Signed-off-by: jperezde <jperezde@redhat.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2026-03-27 09:02:10 -04:00
Sage Moore
497e234d38
[EPLB] Cleanup the transfer logic for the various eplb maps ( #34520 )
...
Signed-off-by: Sage Moore <sagmoore@redhat.com >
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-03-27 10:18:46 +01:00
dtc
6287e7fa20
[P/D] Mooncake: Add unit tests and minor fixes for mooncake connector ( #36946 )
...
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com >
2026-03-27 09:26:40 +01:00
Shengqi Chen
84e439a9cb
[CI/Build] Move nightly wheel index generation to a single post-build step ( #38322 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-03-27 07:44:18 +00:00
Yuichiro Utsumi
a1746ff9ec
[Doc] Clarify Helm chart location in deployment guide ( #38328 )
...
Signed-off-by: Yuichiro Utsumi <utsumi.yuichiro@fujitsu.com >
Signed-off-by: Yuichiro Utsumi <81412151+utsumi-fj@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-27 15:43:02 +08:00
Flora Feng
aee4c14689
[Bugfix] Fix Hermes tool parser when stream interval > 1 ( #38168 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-27 14:42:26 +08:00
Bowen Bao
0ae89f18fd
[Refactor] Move FusedMoE hidden_size roundup to quant_method ( #34285 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com >
2026-03-26 23:38:26 -07:00
wenjun liu
c2b17d71af
[CI] Add xpu auto-label rule for Intel GPU/XPU PRs ( #38320 )
...
Signed-off-by: wendyliu235 <wenjun.liu@intel.com >
2026-03-27 14:22:38 +08:00
Li, Jiang
becaed6ec8
[CPU] Support CT W4A16 on CPU MP kernel ( #38219 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-27 14:15:28 +08:00
Xiaoshuang Wang
a8eab8f30d
[Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5 ( #37975 )
...
Signed-off-by: wxsIcey <1790571317@qq.com >
Signed-off-by: Icey <1790571317@qq.com >
2026-03-27 14:13:21 +08:00
cjackal
2babac0bed
[frontend] dump openai responses type by alias ( #38262 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2026-03-27 05:58:20 +00:00
Or Ozeri
7cc302dd87
[kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models ( #37853 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-27 08:38:33 +03:00
Bvicii
999dfc1622
[Bugfix] Offload blocking tokenizer ops to shared thread pool to unblock event loop ( #34789 )
...
Signed-off-by: Bvicii <yizhanhuang2002@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-26 22:17:00 -07:00
wenjun liu
d86060122a
[CI/Build] enable Intel XPU test flow with prebuilt image ( #37447 )
...
Signed-off-by: wendyliu235 <wenjun.liu@intel.com >
2026-03-26 18:16:04 -07:00
Harry Mellor
f73bcb1c51
Various Transformers v5 config fixes ( #38247 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-26 23:06:59 +00:00
yzong-rh
28048bd6b0
[Bugfix] Add missing f-string prefix in xgrammar choices error message ( #38162 )
...
Signed-off-by: Yifan Zong <yzong@redhat.com >
2026-03-26 21:43:03 +00:00
Giancarlo Delfin
c32e97602d
[Model Runner V2] Enable forcing a specific acceptance rate during rejection sampling ( #38045 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-26 13:38:12 -07:00
Wei Zhao
0904b6550d
Fix multi-node allreduce fusion ( #38136 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: root <root@theia0053.lyris.clusters.nvidia.com >
2026-03-26 20:24:36 +00:00
Stig-Arne Grönroos
f26fcdfb9e
[Bugfix][ROCm] Fix lru_cache on paged_mqa_logits_module ( #37547 )
...
Signed-off-by: Stig-Arne Grönroos <stig-arne.gronroos@amd.com >
2026-03-26 19:01:05 +00:00
TJian
bc9c6fbbe6
[ROCm] [Bugfix] [Release] Fix nightly rocm release pipeline ( #38263 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-03-26 18:47:10 +00:00
Andreas Karatzas
bff9a1c266
[ROCm][CI] Override PYTORCH_ROCM_ARCH with detected GPU arch in test containers ( #38165 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 18:33:45 +00:00
Andreas Karatzas
db01535e2b
[ROCm][CI] Add uv pip compile workflow for rocm-test.txt lockfile ( #37930 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 12:44:01 -05:00
jennyyyyzhen
a4cf9b22ba
[ROCM][Bugfix] Use correct stride in cp_mha_gather_cache_kernel for hybrid model ( #37228 ) ( #37228 )
...
Signed-off-by: jennyyyyzhen <yzhen@hmc.edu >
Co-authored-by: yZhen <yZhen@fb.com >
2026-03-26 10:33:39 -07:00
Andreas Karatzas
9c3ae04bfe
[ROCm][CI] Add LM Eval Qwen3.5 Models test for MI355 ( #38155 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 16:51:18 +00:00
Andreas Karatzas
a8e48a7b85
[CI] Fix conch kernel crash on 3D input by reshaping to 2D before GEMM ( #38178 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 11:46:03 -05:00
Divakar Verma
b9dbc5c4ab
[Mamba][APC] Add test case to compare apc outputs ( #34977 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-03-26 16:40:35 +00:00
TJian
60af7b967b
[Releases] [ROCm] Enable Nightly Docker Image and Wheel Releases for ROCm ( #37283 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-03-26 16:32:25 +00:00
Andreas Karatzas
bdc1719eb9
[ROCm][CI] Fix AITER state leak in shared_fused_moe_routed_transform test ( #38137 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 09:26:46 -07:00
haosdent
0aac2048bf
[Bugfix] Restore CUDA graph persistent buffers for FP8 FlashMLA decode ( #35175 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-26 16:13:39 +00:00
Chuan (Richard) Li
cb2263218e
[Bugfix][Minor] Fix potential NameError in mamba backend selector and misc typos ( #35886 )
...
Signed-off-by: Li <chuali@amd.com >
2026-03-26 11:59:24 -04:00
Wentao Ye
e054f152fa
[CI] Add batch invariant test for b200 ( #38014 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-26 11:54:54 -04:00
zhang-prog
0f5b526040
[Fix] Remove unused packing_position_embedding from PaddleOCRVL for better checkpoint compatibility ( #38232 )
...
Signed-off-by: zhangyue66 <zhangyue66@baidu.com >
2026-03-26 15:34:49 +00:00
Zhewen Li
be1a85b7a2
Revert "[MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration" ( #38050 ) ( #38169 )
...
Co-authored-by: Zhewen Li <zhewenli@inferact.ai >
2026-03-26 07:59:09 -07:00
Cyrus Leung
2e225f7bd2
[Renderer] Consolidate factory methods ( #38218 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-26 12:19:22 +00:00
Jared Wen
757eafcf37
[bug-fix] GLM OCR Patch Merger context_dim ( #37962 )
...
Signed-off-by: JaredforReal <w13431838023@gmail.com >
2026-03-26 05:11:21 -07:00
wang.yuqi
dcdc145893
[CI] Reorganize scoring tests ( #38207 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-26 12:07:01 +00:00
Andreas Karatzas
f2d16207c7
[ROCm][CI] Fix flaky GPTQ compile correctness test ( #38161 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 19:57:00 +08:00
Andreas Karatzas
37a83007fe
[ROCm][CI] Fix wvSplitKrc mock argument order in test_rocm_unquantized_gemm ( #38167 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 19:54:59 +08:00
Wentao Ye
bf5eec638d
[Refactor] Remove unused utils ( #38153 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-26 17:08:19 +08:00
Mateusz Sokół
b1cb1d3d2c
DOC: Documentation pages fixes ( #38125 )
...
Signed-off-by: Mateusz Sokół <mat646@gmail.com >
2026-03-26 16:55:42 +08:00
Kunshang Ji
6ae8bbd0c2
[XPU] Disable xpu graph by default ( #38193 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-26 01:53:45 -07:00
Cyrus Leung
a9213c0ffe
[Doc] Fix outdated reference to CUDAGraphManager ( #38209 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-26 01:52:38 -07:00
Cyrus Leung
502c41a8f6
[Model] Use helper function to run MM processors with token inputs (where applicable) ( #38018 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-26 16:44:04 +08:00
Vadim Gimpelson
52069012fe
[Bugfix] Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell ( #38083 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-26 01:21:47 -07:00
Fadi Arafeh
71161e8b63
[cpu][ci] remove soft-fail for Arm CI and add quant model tests ( #37691 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-03-26 07:03:31 +00:00
Terry Gao
38de822310
[Model] Add torch.compile support for InternVL vision encoder ( #38049 )
...
Signed-off-by: tianrengao <terrygao87@gmail.com >
2026-03-25 23:52:29 -07:00
Jee Jee Li
2bfbdca23c
[Bugfix] Fix benchmark_fused_collective.py ( #38082 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-25 23:51:00 -07:00
Matej Rojec
2908094567
Add /v1/chat/completions/batch endpoint for batched chat completions ( #38011 )
...
Signed-off-by: Matej Rojec <64556640+MatejRojec@users.noreply.github.com >
2026-03-26 12:13:33 +08:00
BadrBasowid
e6bf9f15ec
[Bugfix][CI] Fix Marlin FP8 Linear Kernel for Compressed Tensors Format ( #38092 )
...
Signed-off-by: BadrBasowid <Badr.Basowid@gmail.com >
Signed-off-by: BadrBasowid <61441185+BadrBasowid@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-25 21:11:43 -07:00
Woosuk Kwon
144030c84e
Relocate Encoder CUDA graph manager ( #38116 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-25 20:52:12 -07:00
Flora Feng
e2db2b4234
[Tool Parser][1/3] Pass tools to ToolParser constructor ( #38029 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-26 10:29:06 +08:00
Chauncey
87f05d6880
[Revert] Remove DeepGEMM availability check in DeepseekV32IndexerMetadataBuilder ( #38076 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-26 01:43:51 +00:00
Andreas Karatzas
36f6aede23
[Misc] Optimized check to encapsulate both CUDA and ROCm platforms ( #34549 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 09:43:07 +08:00
Xin Yang
9704a5c310
Disable dual stream execution of input projection for Qwen3 ( #38152 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-26 01:20:39 +00:00
Wei Zhao
74056039b7
Fix minimax m2.5 nvfp4 kv scales weight loading ( #37214 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-26 00:48:06 +00:00
Jacob Platin
d7d51a7ee5
[Bugfix] Fix Qwen3.5-FP8 Weight Loading Error on TPU ( #37348 )
...
Signed-off-by: Jacob Platin <jacobplatin@google.com >
2026-03-26 00:46:01 +00:00
Harry Mellor
3c3c084240
Various Transformers v5 fixes ( #38127 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-26 00:10:08 +00:00
Ekagra Ranjan
7b54f60db0
[Cohere] Enable Cohere-Transcribe ( #38120 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2026-03-25 16:13:51 -07:00
Rohan Potdar
a0e8c74005
[ROCm]: Update rope+kvcache fusion conditions and disable custom op by default ( #36716 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-03-25 20:58:44 +00:00
Guillaume Guy
70a2152830
[MultiModal] add support for numpy array embeddings ( #38119 )
...
Signed-off-by: guillaume_guy <guillaume.guy@airbnb.com >
Signed-off-by: Guillaume Guy <guillaume.c.guy@gmail.com >
Co-authored-by: guillaume_guy <guillaume.guy@airbnb.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-25 20:13:04 +00:00
Sathish Sanjeevi
978fc18bf0
[ROCm] Utilize persistent MLA kernel from AITER ( #36574 )
...
Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com >
2026-03-26 03:00:42 +08:00
Andreas Karatzas
7d6917bef5
[ROCm] Fix MoE kernel test failures on gfx950 ( #37833 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-03-25 13:46:40 -05:00
Mark McLoughlin
e38817fadb
[Core][KV Connector] Remove use of num_cached_tokens in error handling ( #38096 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-03-25 18:20:48 +00:00
Nick Hill
72cad44d3c
[Frontend] Move APIServerProcessManager target server fn ( #38115 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-25 18:14:41 +00:00
Cyrus Leung
ba2f0acc2d
[Misc] Reorganize inputs ( #35182 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-25 10:22:54 -07:00
Yongye Zhu
678b3c99e8
[MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration ( #38050 )
2026-03-25 10:16:40 -07:00
mikaylagawarecki
bf4cc9ed2d
[2/n] Migrate per_token_group_quant to torch stable ABI ( #36058 )
...
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com >
2026-03-25 10:15:13 -07:00
Ben Browning
1ac2ef2e53
[CI/Docs] Improve aarch64/DGX Spark support for dev setup ( #38057 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-25 09:24:42 -07:00
Richard Zou
6e37c46b35
[compile] Add some more startup tests for top models ( #38046 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-25 12:02:22 -04:00
Wentao Ye
1bf2ddd0ee
[Refactor] Rename WAITING_FOR_FSM to WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR ( #38048 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-25 11:41:44 -04:00
Necofish
e7221180e1
[Kernel] Optimize SM120 CUTLASS blockwise FP8 GEMM ( #37970 )
...
Signed-off-by: Necofish <liuxiangyang@mail.ustc.edu.cn >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-25 08:20:04 -07:00
RobTand
4a76ad12e0
[Bugfix] Preserve CUDA arch suffix (a/f) for SM12x — fixes NVFP4 NaN on desktop Blackwell ( #37725 )
...
Signed-off-by: Rob Tand <robert.tand@icloud.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
2026-03-25 08:18:25 -07:00
Wentao Ye
d7e93e13fb
[Feature] EPLB Support for GPU Model Runner v2 ( #37488 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-25 08:16:39 -07:00
Andrii Skliar
cd7643015e
[Feature] Support per-draft-model MoE backend via --speculative-config ( #37880 )
...
Signed-off-by: Andrii Skliar <askliar@nvidia.com >
Signed-off-by: [Andrii Skliar] <askliar@nvidia.com >
Co-authored-by: Andrii Skliar <askliar@nvidia.com >
2026-03-25 14:31:52 +00:00
Ben Browning
a1a2566447
[Docs] Add guide for editing agent instruction files ( #37819 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2026-03-25 13:54:09 +00:00
yjz
b745e8b5d3
[KVTransfer][Mooncake] Add heterogeneous TP support for disaggregated P/D in MooncakeConnector ( #36869 )
...
Signed-off-by: JianDan0212 <zhangyj0212@gmail.com >
2026-03-25 14:24:07 +01:00
Harry Mellor
d215d1efca
[Mypy] Better fixes for the mypy issues in vllm/config ( #37902 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-25 06:14:43 -07:00
Fadi Arafeh
34d317dcec
[CPU][UX][Perf] Enable tcmalloc by default ( #37607 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-03-25 20:39:57 +08:00
grYe99
7ac48fd357
[Model] Add AutoWeightsLoader support for jais ( #38074 )
...
Signed-off-by: grYe99 <guorongye99@gmail.com >
Co-authored-by: grYe99 <guorongye99@gmail.com >
2026-03-25 12:38:40 +00:00
Harry Mellor
d6bb2a9d9a
Fix Plamo 2/3 & LFM2 for Transformers v5 ( #38090 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-25 12:29:49 +00:00
Harry Mellor
1e673a43ce
Better weight tying check for multimodal models ( #38035 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-25 12:07:23 +00:00
Andreas Karatzas
04417ecd5f
[ROCm][CI] Rename filepath test to point to correct file ( #38102 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 20:05:46 +08:00
R0CKSTAR
242c93f744
[Docs] Adds vllm-musa to custom_op.md ( #37840 )
...
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
2026-03-25 11:54:36 +00:00
Matthias Gehre
a889b7f584
[Bugfix] Pass drafter quant_config to ParallelLMHead in Eagle3 ( #37280 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-03-25 11:42:58 +00:00
Harry Mellor
ba2910f73a
Fix offline mode test for Transformers v5 ( #38095 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-25 11:39:48 +00:00
Andreas Karatzas
f262a62aa1
[ROCm][CI] Fix flaky Cohere/OpenAI embedding parity test ( #37616 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 10:55:51 +00:00
Andreas Karatzas
9ac2fcafbb
[CI] Fix realtime WebSocket timeout deadlock and unhandled model validation errors ( #37483 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 11:24:33 +01:00
Kunshang Ji
e9ae3f8077
[Hardware][XPU] Align memory usage with cuda on xpu ( #37029 )
...
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-25 18:14:29 +08:00
Andreas Karatzas
04cec4f927
[ROCm][CI] Increase OpenAPI schema test timeouts ( #38088 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 18:06:58 +08:00
Kunshang Ji
14771f7150
[XPU] support MLA model on Intel GPU ( #37143 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-25 17:43:42 +08:00
Gregory Shtrasberg
189ddefbfd
[ROCm] Attention selector reordering ( #36702 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
Co-authored-by: Micah Williamson <micah.williamson@amd.com >
2026-03-25 17:42:56 +08:00
Chauncey
09c3dc9186
[Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function ( #37968 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-25 06:19:37 +00:00
vllmellm
42e9547976
[ROCm][Test] Fix ROCM_AITER_UNIFIED_ATTN attn+quant fusion test ( #37640 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-03-25 05:06:15 +00:00
Chauncey
a32783bb35
[Bugfix] Fix IndexError when accessing prev_tool_call_arr in OpenAIToolParser ( #37958 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-25 12:06:21 +08:00
Baorun (Lauren) Mu
9d0351c91d
[Docs] Add Encoder (ViT) CUDA Graphs section to CUDA Graphs design doc ( #37914 )
...
Signed-off-by: Baorun Mu <bmu@nvidia.com >
2026-03-24 19:53:24 -07:00
Artem Perevedentsev
a93a53f8a1
[Performance] Auto-enable prefetch on NFS with RAM guard ( #37673 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
2026-03-24 17:31:14 -07:00
Andreas Karatzas
679c6a3ecc
[Bugfix][ROCm][MoE] Fix mxfp4 oracle regressions from #37128 ( #37787 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 08:17:33 +08:00
Andreas Karatzas
8bbb7c7f20
[ROCm][CI][PD] Add Hybrid SSM integration tests to CI ( #37924 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 07:58:39 +08:00
Kevin H. Luu
af945615b5
[release] Move the rest of release jobs to release queue ( #38044 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-03-24 16:40:58 -07:00
Terry Gao
82580b10ac
[Perf] Disable inductor runtime asserts by default for serving perfor… ( #37485 )
...
Signed-off-by: tianrengao <terrygao87@gmail.com >
Co-authored-by: Tianren Gao <tianren@fb.com >
2026-03-24 19:37:51 -04:00
Netanel Haber
a0d487b2e1
nano_nemotron_vl: suppress readonly torch.from_numpy() warning in image and video resize paths ( #37903 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-03-24 23:25:56 +00:00
Junhao
b73b5b0629
Make microbatch optimization (DBO) work with general models ( #37926 )
...
Signed-off-by: Junhao Li <junhao@ubicloud.com >
2026-03-24 14:40:08 -07:00
Michael Goin
0f0e03890e
[UX] Add flashinfer-cubin as CUDA default dep ( #37233 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-03-24 14:13:08 -07:00
Woosuk Kwon
4b53740d7f
[MRV2] Fix for DS v3.2 ( #38030 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-24 14:03:24 -07:00
Nick Hill
4e824d1c83
[Model Runner V2][Minor] Simplify PP logic ( #38031 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-24 13:57:17 -07:00
amey asgaonkar
0c1809c806
Add Ubuntu 24.04 support for Docker builds ( #35386 )
...
Signed-off-by: aasgaonkar <aasgaonkar@nvidia.com >
2026-03-24 13:34:44 -07:00
liangel-02
8c47fdfdb1
[FlexAttention] allow custom mask mod ( #37692 )
...
Signed-off-by: Angel Li <liangel@meta.com >
2026-03-24 16:03:24 -04:00
Javier De Jesus
54b0578ada
[Bugfix] Pass hf_token through config loading paths for gated model support ( #37920 )
...
Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com >
2026-03-24 15:22:05 -04:00
Richard Zou
89f572dbc0
[BugFix] fix VLLM_USE_STANDALONE_COMPILE=0 ( #38015 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-24 19:08:26 +00:00
Richard Zou
71a4a2fbd0
[BugFix] Fix order of compile logging ( #38012 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-24 18:58:18 +00:00
Nick Cao
935c46dd9b
[Model] Add Granite 4.0 1B speech to supported models ( #38019 )
...
Signed-off-by: Nick Cao <ncao@redhat.com >
2026-03-24 18:23:41 +00:00
Willy Hardy
057fc94cbd
[Bugfix] Fix structured output crash on CPU due to pin_memory=True ( #37706 )
...
Signed-off-by: Willy Hardy <whardy@redhat.com >
Signed-off-by: Will Hardy <whardy@redhat.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-24 17:44:17 +00:00
Vineeta Tiwari
b58c5f28aa
docs: fix broken offline inference paths in documentation ( #37998 )
...
Signed-off-by: Vineeta Tiwari <vineeta.tiwari2@ibm.com >
Signed-off-by: Vineeta Tiwari <vineetatiwari2000@gmail.com >
Co-authored-by: Vineeta Tiwari <vineeta.tiwari2@ibm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-24 17:35:14 +00:00
Ming Yang
c07e2ca6e0
Fix Mamba state corruption from referencing stale block table entries ( #37728 ) ( #37728 ) ( #37728 )
2026-03-24 10:29:59 -07:00
Dhruv Singal
4df5fa7439
[Bugfix] Force continuous usage stats when CLI override is enabled ( #37923 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
Co-authored-by: OpenCode <noreply@openai.com >
2026-03-24 10:29:50 -07:00
sihao_li
a5416bc52e
[XPU] Support Intel XPU hardware information collection in usage stats ( #37964 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
2026-03-24 10:29:17 -07:00
Harry Mellor
b3601da6e7
[Mypy] Fix mypy for vllm/model_executor (except vllm/model_executor/layers) ( #37904 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-24 17:14:01 +00:00
Dan Blanaru
dc78c2c933
[Core] add option to schedule requests based on full ISL ( #37307 )
...
Signed-off-by: Dan Blanaru <48605845+DanBlanaru@users.noreply.github.com >
Co-authored-by: Claude <noreply@anthropic.com >
2026-03-24 13:01:12 -04:00
Sungjae Lee
4731884796
[Feature] limit thinking tokens (hard limit) ( #20859 )
...
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com >
Signed-off-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-24 09:53:07 -07:00
Harry Mellor
8de5261e69
Update new contributor message ( #37999 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-24 16:01:41 +00:00
wang.yuqi
1b6cb920e6
[Deprecate] Deprecate pooling multi task support. ( #37956 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-24 14:07:47 +00:00
Li, Jiang
352b90c4a4
[Bugfix] Add replacement of _compute_slot_mapping_kernel on CPU ( #37987 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-24 07:00:20 -07:00
Sage
1c0aabdeb0
[Bugfix] Suppress spurious CPU KV cache warning in launch render ( #37911 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-24 12:36:18 +00:00
Ilya Markov
14acf429ac
[EPLB] Remove main waits in case of slow EPLB ( #36271 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-03-24 11:50:44 +00:00
Harry Mellor
ce57fd5557
[Docs] Fix build ( #37991 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-24 03:20:49 -07:00
Flora Feng
2e67fa756d
Fix tool_parser_cls type annotation from Callable to type[ToolParser] ( #37957 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-23 22:58:27 -07:00
Ronen Schaffer
e3c6c10cad
[KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into cpu/ package ( #37874 )
...
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com >
2026-03-24 07:02:51 +02:00
jetxa
16a664df24
[Frontend][Bugfix] Pass default_chat_template_kwargs to AnthropicServingMessages ( #37899 )
...
Signed-off-by: jetxa <jetxzhang@outlook.com >
2026-03-24 05:00:12 +00:00
Kevin H. Luu
7281199a8c
[release] Move agent queue to Release cluster queues ( #37783 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-03-23 20:36:47 -07:00
Kevin H. Luu
b2dd75eb48
Downsize CPU jobs to use small queue ( #37913 )
...
Signed-off-by: khluu <khluu000@gmail.com >
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-23 20:36:37 -07:00
Wentao Ye
c59a132f96
[V0 Deprecation] Refactor kv cache from list to element ( #37487 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-23 20:10:11 -07:00
Andreas Karatzas
de99d91ece
[ROCm][CI] Split Entrypoints Integration (API Server 1) into 3 jobs ( #37906 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-24 09:48:37 +08:00
Wentao Ye
83c9d525b6
[CI] Add batch invariant test: Block FP8 + small MOE ( #37895 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-23 21:16:14 -04:00
Giancarlo Delfin
8f4824b664
[Model Runner V2] Gather multimodal embeddings before draft model postprocess ( #37932 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-23 18:14:13 -07:00
roikoren755
56777b5c89
[Test] E2E Nemotron-3-Super tests ( #36803 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-03-23 17:49:56 -07:00
Kevin H. Luu
2488a82f89
[CI] Split V1 Others into 3 separate jobs ( #37016 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-24 06:44:38 +08:00
Ranran
dc6908ac6a
[Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning ( #35007 )
...
Signed-off-by: Ranran <1012869439@qq.com >
Signed-off-by: Ranran <hzz5361@psu.edu >
Signed-off-by: ran <hzz5361@psu.edu >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-03-23 18:31:14 -04:00
yzong-rh
e85f8f0932
[Bug][MoE] Strengthen _supports_current_device() checks in the TRTLLM FP8, NVFP4, and FlashInfer CuteDSL MoE experts ( #36728 )
...
Signed-off-by: Yifan Zong <yzong@redhat.com >
2026-03-23 17:02:57 -04:00
Robert Shaw
5bf3c42d4c
[Bug][MoE] Fix TRTLLM NVFP4 Routing Kernel Precision ( #36725 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-23 20:19:06 +00:00
Kyle Sayers
38364a7e32
[Sparse24] [Deprecation] Remove Sparse24 CT integration and kernels ( #36799 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-03-23 16:03:29 -04:00
Matthew Bonanni
fafe76b4af
[Async][Spec Decoding] Zero-bubble async scheduling + spec decoding ( #32951 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com >
2026-03-23 15:37:22 -04:00
Woosuk Kwon
ffb5b32b5f
[MRV2] Consider spec decoding in warmup ( #37812 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-23 17:45:43 +00:00
Kunshang Ji
91fd695b75
[CI] split Entrypoints Integration (API Server 1) into 3 jobs ( #37882 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-23 10:37:56 -07:00
Nicolò Lucchesi
1cbbcfe8a3
[CI][PD] Add Hybrid SSM integration tests to CI ( #37657 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-23 23:58:19 +08:00
Angela Yi
aceadb5ee1
Use lazy graph module during split_module to defer recompile() ( #37609 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-03-23 11:21:29 -04:00
Yufeng He
ec2280611a
[Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding ( #37884 )
2026-03-23 15:15:12 +00:00
yanghui1-arch
7151ae6528
[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region ( #37873 )
...
Signed-off-by: dass90 <3053034939@qq.com >
2026-03-23 14:59:21 +00:00
Wentao Ye
45bd5c8e75
[Mypy] Fix mypy for vllm/config ( #37808 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-23 14:33:59 +00:00
Zhaodong Bing
10a1018c12
[ROCm] fix sleep mode not releasing GPU memory problem on ROCm ( #37533 )
...
Signed-off-by: bingzhaodong <aaab8b@gmail.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-03-23 06:07:19 -07:00
Jee Jee Li
aec2dc6c0d
[Bugfix][LoRA] Fix incorrect LoRA Log ( #37877 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-23 11:42:52 +00:00
DorBernsohn
7938d12119
[Bugfix] Fix CPU backend crash in KV cache block zeroing ( #37550 )
...
Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com >
2026-03-23 11:35:45 +00:00
Kunshang Ji
debd6e768c
[XPU][MoE Refactor] Refactor xpu mxfp4 support into oracle ( #37784 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-23 11:10:41 +00:00
Andrew Xia
9ace378a63
[Frontend][Responses API] Fix arrival_time recording for TTFT on initial request ( #37498 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2026-03-23 09:58:08 +00:00
Kunshang Ji
27d5ee3e6f
[FP8]add FP8 WoQ kernel abstraction. ( #32929 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-03-23 09:47:47 +00:00
wangxiyuan
35141a7eed
[Misc]Update gitignore ( #37863 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2026-03-23 01:14:10 -07:00
Chuan (Richard) Li
e99fb98867
[ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs ( #36100 )
...
Signed-off-by: Li <chuali@amd.com >
2026-03-23 15:48:31 +08:00
Artem Perevedentsev
a16133a0f1
[Perf] [Bugfix] Fix Triton autotuning in inference for Qwen3.5 ( #37338 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
2026-03-23 00:37:58 -07:00
Hojin Yang
54ab804e87
[Bugfix] Store Qwen3Next A_log in fp32 ( #37810 )
...
Signed-off-by: effortprogrammer <yhjhoward7@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-23 15:36:57 +08:00
r266-tech
02e6efe56d
[Bugfix] JAIS: Only apply ALiBi when position_embedding_type='alibi' ( #37820 )
...
Co-authored-by: r266-tech <r266-tech@users.noreply.github.com >
2026-03-23 07:36:34 +00:00
Matthias Gehre
410d300893
[ROCm][Refactor] Enable AWQMarlinConfig on ROCm to use choose_mp_linear_kernel ( #36505 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-23 15:36:08 +08:00
Yan Ma
d3fe857135
update doc for online fp8 quantization ( #37851 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-03-23 05:19:03 +00:00
Baorun (Lauren) Mu
f85e479e66
[Feature] ViT Full CUDA Graph ( #35963 )
...
Signed-off-by: Baorun Mu <bmu@nvidia.com >
2026-03-23 13:01:10 +08:00
Jee Jee Li
1f0d210641
[CI/Build][LoRA] Update Qwen35 LoRA testing ( #37816 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-23 12:55:49 +08:00
Ben Browning
3bbe2e1e6e
[Test] Consolidate tool parser unit tests to tests/tool_parsers ( #37834 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2026-03-23 04:24:25 +00:00
Augusto Yao
6e04e79326
always use embed&token_classify for bge-m3 ( #37632 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-23 03:10:57 +00:00
Lasha Koroshinadze
e7767eccae
Fix AudioFlamingo3/MusicFlamingo HF parity and RoTE handling ( #37643 )
...
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com >
2026-03-23 10:29:07 +08:00
Woosuk Kwon
43877a620b
[MRV2] Enable PP CUDA graph test ( #37830 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-22 16:30:25 -07:00
zhanqiuhu
63f49b8bd4
[Model Runner V2] Enable piecewise CUDA graphs for pipeline parallelism ( #35162 )
...
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-22 20:48:25 +00:00
Woosuk Kwon
a5e9d511de
[MRV2] Use FP64 for Gumbel noise ( #37798 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-22 12:28:10 -07:00
Yongye Zhu
c058ff44d4
[Bigfix]fix lora test by pass padded size back to the layer ( #37811 )
2026-03-22 13:20:13 -06:00
Woosuk Kwon
ce9b1d76cf
[MRV2] Skip hidden states allocation for PW CUDA graphs ( #37818 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-22 11:47:21 -07:00
Netanel Haber
e74c17e153
Enable NemotronHPuzzle + NemotronHMTP ( #37803 )
2026-03-22 15:13:58 +00:00
Wentao Ye
eaf4978621
[Test] Only Run MLA model when user explicitly set for batch invariance ( #37719 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-22 09:09:12 -04:00
Wentao Ye
77d24c4bfe
[Bug] Fix fp8 deepgemm batch invariant ( #37718 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-22 08:57:20 -04:00
Giancarlo Delfin
b3e846017d
[Model Runner V2] Support multi-modal embeddings for spec decode model ( #36097 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-22 02:48:43 -07:00
Andreas Karatzas
cd1242d82a
[ROCm][CI] Stabilize ROCm speech-to-text translation test with lower min acc threshold ( #37723 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 17:32:08 +08:00
Robert Shaw
4383f1532e
[MoE] Move PF Methods to Folder ( #35927 )
2026-03-22 02:42:59 -06:00
Andreas Karatzas
6eedec6e36
[ROCm][CI] Make some duplicated tests optional so that they are only evaluated in our nightly ( #37780 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 16:03:18 +08:00
Andreas Karatzas
ffc8531524
[ROCm][CI] Added missing resampy dependency for MM audio tests ( #37778 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 16:02:41 +08:00
Andreas Karatzas
6ecba840d7
[ROCm][CI] get_cu_count was renamed to num_compute_units in #35042 ( #37764 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 16:02:21 +08:00
Andreas Karatzas
3b06c55c78
[ROCm][CI] Fix MEGA_AOT_ARTIFACT fallback when PyTorch < 2.10.0 lacks AOT support ( #37763 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 16:02:03 +08:00
Yang Liu
b050700462
[Perf] Optimize glm4.xv VIT ( #37779 )
...
Signed-off-by: Yang <lymailforjob@gmail.com >
2026-03-22 06:12:34 +00:00
Andreas Karatzas
5dac719b2b
[Bugfix] Handle libsndfile sf_error(NULL) race condition in audio fallback ( #37782 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 13:37:29 +08:00
Andreas Karatzas
c862481c02
[CI] Skip ISAAC multimodal tests due to broken upstream HF model weights ( #37781 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 13:23:32 +08:00
Andreas Karatzas
c86b17cfe6
[ROCm][CI] Add large_gpu_mark to test_max_tokens_none for ROCm ( #37717 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 12:25:16 +08:00
Andreas Karatzas
66f927f205
[Bugfix] Fix pooling non-determinism from pinned prompt_lens aliasing ( #37775 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 03:22:24 +00:00
Andreas Karatzas
e78bc74268
[ROCm][CI] close missing quote in kernels/moe block in run-amd-test.sh ( #37774 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 09:42:34 +08:00
Robert Shaw
6b2fa3a762
[MoE] Move FlashInfer CuteDSL experts into fused_moe/experts/ ( #37759 )
...
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
2026-03-21 19:15:16 -04:00
Robert Shaw
eeee5b262d
[Quantization][Deprecation] Remove PTPC FP8 ( #32700 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-21 22:10:16 +00:00
Robert Shaw
5ad0446572
Revert "Consolidate AWQ quantization into single awq_marlin.py file" ( #37768 )
2026-03-21 17:20:41 -04:00
Robert Shaw
8cc700dd6a
Consolidate AWQ quantization into single awq_marlin.py file
...
Merge awq.py and awq_marlin.py into a single file, eliminating the
circular import between them. awq.py becomes a backward-compat shim.
Follows the same structure as gptq_marlin.py.
Co-authored-by: Claude
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
2026-03-21 17:09:17 -04:00
Brandon Pelfrey
80b70884eb
Add tensor IPC transfer mechanism for multimodal data ( #32104 )
...
Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com >
Signed-off-by: Brandon Pelfrey <brandonpelfrey@gmail.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-21 20:10:20 +00:00
Mohammad Miadh Angkad
61e381dcf0
[Perf] Add SM 10.3 (B300/GB300) all-reduce communicator tuning ( #37756 )
...
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com >
2026-03-21 19:43:47 +00:00
Mohammad Miadh Angkad
88f1b374f5
[Core] Enable allreduce fusion by default for SM 10.3 (B300/GB300) ( #37755 )
...
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com >
2026-03-21 19:40:37 +00:00
Francesco Fusco
298e510848
[Hybrid] calling get_mamba_groups() once at MambaCopyBuffers.create() ( #37318 )
...
Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com >
2026-03-21 09:29:43 +00:00
Chaitanya Sri Krishna Lolla
3982bc2cd0
[ROCm] Enable DeepEP ROCm as all2allbackend for AMD GPUs. ( #34692 )
...
Signed-off-by: Tej Kiran <vpolamre@amd.com >
Co-authored-by: Tej Kiran <vpolamre@amd.com >
2026-03-21 00:32:31 -07:00
Andreas Karatzas
02eec7ecbe
[ROCm][CI] Update GSM8K eval config to use fp8-and-mixed models list (MI355) ( #37721 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-21 15:27:12 +08:00
Bongwoo Bak
17ee641c45
[Responses API] Add kv_transfer_params for PD disaggregation ( #37424 )
...
Signed-off-by: bongwoobak <bongwoobak@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-21 13:48:54 +08:00
Andreas Karatzas
0d50fa1db6
[ROCm][CI] Mark gemma3 as large GPU test to avoid OOM on MI250 ( #37610 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-21 12:57:25 +08:00
Simon Mo
1fa1e53a73
Revert "[compile] Initialize passes at VllmBackend init" ( #37733 )
2026-03-20 21:35:49 -07:00
Andreas Karatzas
3ffa52009f
[ROCm][CI] Guard CudaPlatform/RocmPlatform imports to fix test collection on cross-platform builds ( #37617 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-21 11:58:58 +08:00
Yongye Zhu
87bd91892f
[MoE Refactor] Mxfp4 oracle rebased ( #37128 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-21 03:37:04 +00:00
Isotr0py
c7f98b4d0a
[Frontend] Remove librosa from audio dependency ( #37058 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-21 11:36:15 +08:00
tmm77
1c472f8fe1
Add get_device_uuid for rocm ( #37694 )
...
Signed-off-by: Tiffany Mintz <Tiffany.Mintz@amd.com >
2026-03-21 11:33:16 +08:00
Itay Alroy
c57d38d603
elastic_ep: Fix issues with repeated scale up/down cycles ( #37131 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com >
2026-03-20 23:13:02 +00:00
Kaihang Jiang
e5ed6c6c13
[BugFix] Allow qk_nope_head_dim=192 in FlashInfer MLA backend checks ( #37475 )
...
Signed-off-by: Kaihang Jiang <kaihangj@nvidia.com >
2026-03-20 16:14:55 -06:00
Wentao Ye
b3d0b37908
[Refactor] Remove unused dead code ( #36171 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-20 16:12:51 -06:00
Santino Ramos
85f671b8e1
[Model Runner V2] Support Streaming Inputs ( #37028 )
...
Signed-off-by: Santino Ramos <elsantinoramos@gmail.com >
2026-03-20 20:42:25 +00:00
Andreas Karatzas
8bc6b5cdb0
[ROCm][CI] Setting some mi325_4 tests back to optional (in parity with upstream) ( #37711 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 12:25:08 -07:00
Vadim Gimpelson
4f16ebbbd3
[Bugfix] Disable monolithic TRTLLM MoE for Renormalize routing ( #37591 ) ( #37605 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-20 12:19:26 -07:00
Angela Yi
12fd17eb51
[compile] Initialize passes at VllmBackend init ( #35216 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-03-20 11:40:33 -07:00
Cyrus Leung
37aadf6237
[Model] Update Kimi-K25 and Isaac processors to fit HF-style ( #37693 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-20 18:30:22 +00:00
Le Yang
d7d2b5e405
[Bugfix] Disable --calculate-kv-scales for hybrid GDN/Mamba+Attention… ( #37565 )
...
Signed-off-by: Young-Leo <562593859@qq.com >
2026-03-20 18:28:34 +00:00
SherryC41
6ec5e9fd37
refactor: abstract deepgemm support into platform ( #37519 )
...
Co-authored-by: sherryC41 <sherry.c.c41@gmail.com >
2026-03-20 17:54:08 +00:00
Lucas Wilkinson
e1d85e5c24
[Attention] Support distinguishing between short extends and decodes ( #37303 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-03-20 10:49:36 -07:00
Peter Pan
79eb9369c5
fix CUDAGraph memory being counted twice ( #37426 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
Signed-off-by: Peter Pan <peter.pan@daocloud.io >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-20 17:36:32 +00:00
Woosuk Kwon
e80cfe575d
[MRV2] Avoid recompilation of _gather_block_tables_kernel ( #37645 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-20 10:31:45 -07:00
Xin Yang
d0532bf38d
[Perf] Eliminate redundant SparseMatrix creation in gpt_oss_triton_kernels ( #37683 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-20 11:28:41 -06:00
Andreas Karatzas
fb4e8bf442
[ROCm][CI] Fix accuracy for llama-nemotron-vl pooling tests ( #37613 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 10:16:59 -07:00
Harry Mellor
6ade4bc5a5
Fix various config related issues for Transformers v5 ( #37681 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-20 16:30:12 +00:00
Zhengxu Chen
2e089b96a8
[compile] Add compiled artifact counter for VLLM_USE_MEGA_AOT_ARTIFACT=1. ( #37589 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-20 16:22:46 +00:00
Martin Hickey
880be2b1b8
[Metrics] Some small refactoring for better maintainability ( #33898 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-03-20 16:11:34 +00:00
Zhengxu Chen
c0f5fae601
[compile] Fix aot test failures with torch 2.12. ( #37604 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-20 16:06:29 +00:00
Rémi Delacourt
aa84e43ccb
[Pixtral] Enable Pixtral language model support Eagle3 ( #37182 )
...
Signed-off-by: remi <remi@mistral.ai >
2026-03-20 15:50:15 +00:00
Matthias Gehre
5e806bcf54
[Bugfix] Fix ConchLinearKernel channelwise quantization (group_size=-1) ( #37329 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-03-20 10:32:21 -05:00
Matthias Gehre
56a62c310c
[Bugfix] Reject channelwise quantization (group_size <= 0) in ExllamaLinearKernel ( #37331 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-03-20 10:31:57 -05:00
L.B.R.
1779c09898
[ROCm] Enable wvSplitK skinny GEMM kernel for RDNA4/gfx1x decode ( #34709 )
...
Signed-off-by: L.B.R. <lbr@mmonad.com >
Co-authored-by: L.B.R. <lbr@mmonad.com >
2026-03-20 10:11:23 -05:00
xuebwang-amd
44eea10f68
[ROCm][Quantization] make quark ocp mx dtype parser robust for weight-only quantization ( #36232 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
2026-03-20 10:10:03 -05:00
Ilya Boytsov
8b6c6b9505
[Model] Add LFM2-ColBERT-350M support ( #37528 )
...
Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com >
2026-03-20 14:57:57 +00:00
Harry Mellor
9f6d9dd371
Fix attribute error in isaac_patch_hf_runner ( #37685 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-20 14:49:40 +00:00
Jee Jee Li
dd20ee4e3e
[UX] Enable torch_profiler_with_stack ( #37571 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-20 11:17:26 +00:00
Chauncey
0523449c9c
[Misc] Use logger.info_once for auto tool choice log message ( #37661 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-20 10:40:36 +00:00
Flora Feng
b4c1aef21c
[Refactor] Relocate tests from tests/v1/entrypoints/ to tests/entrypoints/ ( #37500 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 02:50:34 -07:00
Flora Feng
6050b93bed
[Refactor] Move serve entrypoint tests under tests/entrypoints/serve/ ( #37595 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 02:10:47 -07:00
Andreas Karatzas
5a4a179591
[ROCm][CI] Fix granite_speech test for gfx90a by selecting compatible attention backend ( #37611 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 17:07:26 +08:00
Andreas Karatzas
37cd9fc107
[ROCm][CI] Remove deepep DBO tests on gfx90a ( #37614 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 17:07:07 +08:00
Andreas Karatzas
9cfd4ebb5e
[ROCm][CI] Update GSM8K eval config to use fp8-and-mixed models list ( #37619 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 17:06:53 +08:00
wang.yuqi
ed359c497a
[Model] Deprecate the score task (this will not affect users). ( #37537 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-20 08:07:56 +00:00
Giancarlo Delfin
dcee9be95a
[Model Runner V2] Fix draft logits not populated during cudagraph replay ( #37639 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-20 07:43:47 +00:00
Andreas Karatzas
bd8c4c0752
[CI] Removing deprecated rlhf examples reference ( #37585 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 15:20:33 +08:00
Wei Zhao
0140eafb15
[Bug] Fix FlashInfer allreduce fusion workspace uninitialized error ( #37461 )
...
Signed-off-by: root <root@prenyx0169.a51.clusters.nvidia.com >
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Signed-off-by: <>
Co-authored-by: root <root@prenyx0169.a51.clusters.nvidia.com >
Co-authored-by: root <root@prenyx0042.a51.clusters.nvidia.com >
2026-03-20 03:09:21 -04:00
Kunshang Ji
bdf6a0a57b
[XPU] bump vllm-xpu-kernels to v0.1.4 ( #37641 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-20 15:04:38 +08:00
Wangbei25
0674d1fee7
[PluggableLayer][MM] Add PluggableLayer for CustomQwen2Decoder ( #37293 )
...
Signed-off-by: Wangbei25 <wangbei41@huawie.com >
Signed-off-by: Wangbei25 <wangbei41@huawei.com >
Co-authored-by: Wangbei25 <wangbei41@huawie.com >
2026-03-20 06:24:07 +00:00
Cyrus Leung
30108fc8b0
[Model] Refactor Step3-VL processor to HF style ( #37579 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-20 06:05:08 +00:00
Flora Feng
e2d1c8b5e8
[Refactor] Relocate entrypoint tests to match serving code structure ( #37593 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 05:31:23 +00:00
Huanxing
6951fcd44f
[XPU] Automatically detect target platform as XPU in build. ( #37634 )
...
Signed-off-by: huanxing <huanxing.shen@intel.com >
2026-03-20 13:30:15 +08:00
Giancarlo Delfin
39474513f6
[Model Runner V2] fix draft attention metadata generation ( #37364 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-19 21:05:15 -07:00
Yuxiang Liang
638a872d77
fix(xpu): Re-compute compile ranges after platform-specific config updates ( #37523 )
...
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com >
Signed-off-by: Yuxiang Liang <yuliang@habana.ai >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-20 03:52:35 +00:00
Flora Feng
9040151fe1
[V0 Deprecation] Deprecate --disable-frontend-multiprocessing ( #37612 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 11:31:43 +08:00
Jee Jee Li
8fbe3f303f
[Bugfix][LoRA] Fix Qwen35 LoRA ( #36976 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-20 11:09:32 +08:00
Xiao
ea2c148fa7
[compile][graph_partition]Add tensor size handling ( #36038 )
...
Signed-off-by: Xiao Fu <xiaofu@meta.com >
2026-03-19 19:55:25 -07:00
Tianmu Li
47b7af0d87
[Feat] Enable CompressedTensorW4A8Int for XPU ( #37207 )
...
Signed-off-by: Li, Tianmu <tianmu.li@intel.com >
2026-03-20 02:34:28 +00:00
tianshu-Michael-yu
269bf46d99
fix: disambiguate multimodal prefix cache keys ( #36708 )
...
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com >
2026-03-20 10:33:20 +08:00
Flora Feng
e5a77a5015
[CI] Update mergify tool-calling label paths ( #37478 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 02:22:23 +00:00
Itay Alroy
ca1ac1a4b4
Fix DP coordinator ZMQ port TOCTOU ( #37452 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
2026-03-20 00:58:31 +00:00
Divakar Verma
4ca3fa6bb4
[ROCm][Bugfix] fix cache block size mismatch for aiter unified attention ( #37606 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-03-20 00:00:08 +00:00
Flora Feng
be12afd284
[Bugfix] Fix Deepseekv32 tool parser when stream interval > 1 ( #36056 )
2026-03-19 19:51:25 -04:00
Wentao Ye
df3c0291a3
[Bug] Fix EmbedIOprocessor "classify" <-> "embed" ( #37573 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-20 07:40:10 +08:00
Wentao Ye
2be1a0f74b
[Refactor] Remove dead code in pooling model ( #37572 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-20 07:39:43 +08:00
Jim Smith
4120a05ff1
Fix AttributeError in Qwen3.5 GDN layers with quantized models ( #37448 )
...
Signed-off-by: Jim Smith <jim@joshua8.ai >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com >
2026-03-19 19:21:14 -04:00
rasmith
98ff042917
[CI][BugFix][AMD] Don't set VLLM_ROCM_USE_AITER anymore in test_rocm_aiter_topk since its not necessary ( #36996 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-03-20 07:12:45 +08:00
Artem Perevedentsev
b55156eae9
[Performance] Enable Triton autotuning disk cache by default ( #37188 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
2026-03-19 17:36:28 -04:00
Laith Sakka
112944fab9
test Qwen/Qwen3-4B-Instruct-2507 for unbacked ( #36064 )
...
Signed-off-by: Laith Sakka <lsakka@meta.com >
2026-03-19 17:28:45 -04:00
bnellnm
91be5f9be3
[MoE Refactor] Rename "naive" all2all backend ( #36294 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-03-19 15:50:34 -04:00
Aaron Hao
4ee847e400
Comment fix for async rl example ( #35244 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-03-19 19:46:07 +00:00
Andreas Karatzas
040a505ff5
[ROCm][CI] Cleaning and restructuring amd-ci legacy pipeline ( #34839 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-19 14:30:58 -05:00
bnellnm
9279c59a0e
[MoE Refactor] DefaultMoERunner simplifcation ( #33049 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-03-19 15:07:44 -04:00
Wentao Ye
7454096199
[Log] Log once in local node by default ( #37568 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-19 12:04:59 -07:00
Andreas Karatzas
fb8b5e05fc
[CI] Add retry with 4x backoff to HTTP fetches for transient failures ( #37218 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-19 19:00:20 +00:00
Harry Mellor
e5d96dc8fc
Fix SpeculatorsConfig now that PreTrainedConfig is a dataclass in Transformers ( #37574 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 18:04:40 +00:00
EdalatiAli
daa05bf340
[Bugfix] Fix AttributeError when serving MXFP8 models with DeepGEMM installed ( #37358 )
...
Signed-off-by: EdalatiAli <aliedalati@cohere.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-19 17:58:33 +00:00
Lucas Kabela
7769b58307
[torch.compile][BE][Multimodal] Remove requirement to set_model_tag to avoid cache conflict ( #37345 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-03-19 17:26:12 +00:00
Chauncey
2f9f946b22
[P/D] AnthropicMessages add kv_transfer_params for PD disaggregation ( #37535 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-19 16:41:20 +00:00
Fadi Arafeh
2890aecce5
[CPU][UX] Do not crash when tcmalloc/libiomp are not ldpreloaded ( #37561 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-03-19 16:35:45 +00:00
Harry Mellor
34f093b417
[CI] Gate pre-commit on ready label or number of contributions ( #37544 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 16:21:57 +00:00
Harry Mellor
4dce8321a9
Run MacOS smoke test on daily cron job instead of every commit ( #37567 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 16:19:50 +00:00
Cyrus Leung
657855ab41
[Misc] Cleanup more configs and processors ( #37560 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 15:45:23 +00:00
Wei Zhao
e27b8ba3d1
[Bug] Fix fp8 trtllm MoE modular kernel supported routing methods ( #37346 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-19 11:43:06 -04:00
Woosuk Kwon
40b8363b45
[MRV2] Use fp32 for draft logits ( #37526 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-19 08:41:21 -07:00
mikaylagawarecki
8b10e4fb31
[1/n] Migrate permute_cols to libtorch stable ABI ( #31509 )
...
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com >
2026-03-19 11:27:26 -04:00
Ifta khairul Alam Adil
104605cbf2
Remove deprecated reasoning_content message field(part-2) ( #37480 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com >
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Philip Ottesen <phiott256@gmail.com >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
Signed-off-by: Andy Lo <andy@mistral.ai >
Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com >
Signed-off-by: sihao.li <sihao.li@intel.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: JartX <sagformas@epdcenter.es >
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Philip Ottesen <phiott256@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Giancarlo Delfin <32987265+TheEpicDolphin@users.noreply.github.com >
Co-authored-by: Andy Lo <andy@mistral.ai >
Co-authored-by: Thillai Chithambaram <79466435+thillai-c@users.noreply.github.com >
Co-authored-by: sihao_li <165983188+1643661061leo@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 15:20:08 +00:00
Jee Jee Li
96266f119b
[LoRA] Minor improvements to LoRA log ( #37557 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-19 15:18:06 +00:00
Sage Moore
7c0cf3bcd0
Cap the number of API servers to 1 when using Elastic EP. ( #37466 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-03-19 10:42:57 -04:00
Harry Mellor
572b432913
Stop bench CLI from recursively casting all configs to dict ( #37559 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 14:04:03 +00:00
Cyrus Leung
9515c20868
[Misc] Clean up processing logic ( #37541 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 13:30:20 +00:00
DorBernsohn
c63ca2b2e6
[Bugfix] Add Kimi-K2.5 reasoning/tool parser aliases and tool_call_id support ( #37438 )
...
Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com >
2026-03-19 21:08:00 +08:00
Harry Mellor
a32eaf5bb2
[CI] Merge cleanup_pr_body.yml and reminder_comment.yml ( #37552 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 12:55:07 +00:00
XueLiang Yang
e390742c59
Fix KV Offloading + MLA AssertionError by using num_kv_heads=1 in cpu… ( #37536 )
...
Signed-off-by: xueliangyang-oeuler <yxl546827391@gmail.com >
Co-authored-by: xueliangyang-oeuler <yxl546827391@gmail.com >
2026-03-19 12:05:07 +00:00
Cyrus Leung
7a6ebcbfcf
[Model] Remove unnecessary get_language_model ( #37545 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 20:00:36 +08:00
Cyrus Leung
c7bc12c20f
[CI/Build] Split out MM pooling tests ( #37542 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 11:36:11 +00:00
wang.yuqi
f9e2a38386
[Docs] Reorganize pooling docs. ( #35592 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 11:25:47 +00:00
Harry Mellor
4426447bba
Don't log exc_info when vLLM tries to doenload a file that doesn't exist ( #37458 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 10:38:29 +00:00
Li, Jiang
3322e26420
[Bugfix] Avoid more OpenMP thread reallocation in CPU torch compile ( #37538 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-19 10:24:39 +00:00
Cyrus Leung
765e461065
[Bugfix] Fix Nemotron Parse loading ( #37407 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 09:55:29 +00:00
Duyi-Wang
6a9cceb219
[Bugfix][ROCm] Fix MoRI + AITER FP8 dispatch compatibility for defer_input_quant ( #37418 )
...
Signed-off-by: Duyi-Wang <duyi.wang@amd.com >
2026-03-19 09:49:27 +00:00
yassha
199f914183
fix(cpu): add null check for aligned_alloc in ScratchPadManager ( #37369 )
...
Signed-off-by: yassha <50112520+yassha@users.noreply.github.com >
2026-03-19 17:45:06 +08:00
Kunshang Ji
ca21483bf9
[MISC] fix pin_memory=torch.cuda.is_available(), use is_pin_memory_available ( #37415 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-19 09:23:24 +00:00
TJian
da70c87e81
[CI] Fix wrong path test file, missing rlhf_async_new_apis.py ( #37532 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-03-19 02:21:55 -07:00
Collin McCarthy
0b6d52629f
Support temporal compression for Nemotron-3-VL videos ( #36808 )
...
Signed-off-by: Collin McCarthy <cmccarthy@nvidia.com >
2026-03-19 08:02:19 +00:00
Ziming Huang
d3cc379567
[Perf] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ( #37425 )
...
Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com >
2026-03-19 15:43:48 +08:00
cdpath
354cd580d5
fix(anthropic): remove non-standard 'data: [DONE]' from Anthropic streaming ( #37510 )
...
Signed-off-by: cdpath <cdpath@outlook.com >
2026-03-19 07:23:35 +00:00
zhanqiuhu
d49f273144
[SSM/Mamba] Follow-up: N-1 prefill for P/D disaggregation ( #37310 )
2026-03-19 08:22:00 +01:00
Flora Feng
b21d384304
[Refactor] Relocate endpoint tests to mirror serving code directory structure ( #37504 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-19 07:19:36 +00:00
Hongxia Yang
e3126cd107
[ROCm] issue management - request information for bug issues on ROCm ( #37009 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-03-19 03:51:29 +00:00
Wentao Ye
e37ff5b5c8
[Perf] Optimize token_embed for pooling models, 1.0% token throughput improvement ( #37347 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-19 10:27:51 +08:00
Aaron Hao
6accb21f2a
[bug] Fix deadlock with pause resume and collective_rpc ( #37024 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-03-19 01:49:02 +00:00
Giancarlo Delfin
053f3b6309
[Model Runner V2] Spec decode rejection sampler logprobs support ( #37237 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-19 01:36:27 +00:00
Aaron Hao
5f82706a21
[BUG] Exclude SKIP_TENSORS from get_layer_size() + new weight sync example for dpep ( #37334 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-03-19 00:45:10 +00:00
Sage Moore
c32a58cc2a
[EPLB] Simplify EPLB rearrange by only returning one map ( #36267 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-03-18 20:34:00 -04:00
Elvir Crnčević
ef2c4f778d
[Bugfix] Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding ( #37442 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-19 00:28:37 +00:00
sihao_li
9dade5da3a
[XPU]Unify xpu test dependencies in dockerfile.xpu ( #36477 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
2026-03-19 08:12:07 +08:00
Thillai Chithambaram
828f862acb
[Bugfix] Expand quantization method support in perf metrics ( #37231 )
...
Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com >
2026-03-18 23:54:19 +00:00
Andy Lo
577df69b26
[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype "auto" leading to gibberish ( #37054 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-03-18 23:07:29 +00:00
Giancarlo Delfin
04244fd0e1
[Model Runner V2] Spec decode rejection sampler greedy support ( #37238 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-18 15:59:03 -07:00
Michael Goin
9482b0b085
[Bugfix] Remove assertion for NVFP4 scale dynamic range ( #37465 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-03-18 15:37:49 -07:00
Woosuk Kwon
5bc1da147f
[LoRA][BugFix] Fix skipped LoRA adapters for Mistral3 ( #36928 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-18 22:34:19 +00:00
Philip Ottesen
0091017188
fix(worker): optimize swap_states to copy only active token prefixes ( #34733 )
...
Signed-off-by: Philip Ottesen <phiott256@gmail.com >
2026-03-18 14:59:27 -07:00
Wentao Ye
0d81a1fe61
[V0 Deprecation] Deprecate virtual engine ( #37195 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-18 14:30:14 -07:00
Netanel Haber
6ae4c8d6fc
chunk parakeet into 30s clips to prevent OOMs on long audios ( #36671 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-03-18 14:22:24 -07:00
JartX
a913b612d8
[Bugfix] Fix ROCm crash in qwen3_next multi-stream events ( #36795 ) ( #37427 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
2026-03-18 16:06:31 -04:00
Harry Mellor
5ce2d10e4a
Fix models which use layer_type_validation for Transformers v5 ( #37398 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-18 18:41:51 +00:00
Chengyu Fang
738d0a281f
[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation ( #37439 )
...
Signed-off-by: chengyufang <cnyvfang@outlook.com >
2026-03-18 11:36:34 -07:00
youkaichao
70b81c4f3d
[bugfix][async scheduling] fix extra cuda context in device 0 with EP/DP ( #37449 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2026-03-18 18:32:30 +00:00
Cyrus Leung
7476d148db
[Model] Remove unnecessary processor definition for Nemotron Parse ( #37456 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-18 18:25:13 +00:00
Cyrus Leung
f3732bd931
[Misc] Clean up model registry ( #37457 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-18 18:24:44 +00:00
Wentao Ye
0ef7f79054
[Perf] Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E throughput improvement ( #37340 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-18 14:18:34 -04:00
Or Ozeri
5dd8df0701
[kv_offload+HMA][2/N]: Support multiple KV groups in GPULoadStoreSpec ( #36642 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-18 19:26:40 +02:00
Harry Mellor
39bfb57b7c
Add API docs link if the CLI arg is a config class ( #37432 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-18 17:19:35 +00:00
RonaldBXu
c9d838fc33
Adding deterministic lora benchmarking to vLLM Bench ( #36057 )
...
Signed-off-by: Ubuntu <ubuntu@ip-172-31-43-201.ap-northeast-1.compute.internal >
Signed-off-by: Ronald Xu <ronaldxu@amazon.com >
2026-03-18 16:02:03 +00:00
Xin Yang
b1169d7be8
[Kernel] Add gpt-oss Router GEMM kernel ( #37205 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-18 08:15:56 -07:00
XLiu-2000
17808394bc
standardize load_weights using AutoWeightsLoader for kimi_linear and minimax_text_01 ( #37371 )
...
Signed-off-by: XuLiu <xuliu40@gmail.com >
Co-authored-by: XuLiu <xuliu40@gmail.com >
2026-03-18 15:05:37 +00:00
elvischenv
296839a1b0
[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE ( #30647 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-03-18 15:01:26 +00:00
Wentao Ye
c373b5c00d
[Log] Reduce duplicate log ( #37313 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-18 10:57:44 -04:00
Itay Alroy
de1a86b7de
elastic_ep: Fix stateless group port races ( #36330 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
2026-03-18 14:36:18 +00:00
Cyrus Leung
99267c23ca
[2/3] Refactor InternVL-based processors ( #37324 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-18 22:22:19 +08:00
Or Ozeri
525f2eeb0b
[kv_offload+HMA][6/N]: Split offloading_connector.py ( #37405 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-18 14:42:46 +01:00
Yufeng He
918b7890a1
[Bugfix] Fix base64 JPEG video frames returning empty metadata ( #37301 )
...
Signed-off-by: Yufeng He <40085740+universeplayer@users.noreply.github.com >
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Yufeng He <40085740+universeplayer@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-18 13:40:03 +00:00
Andy Lo
98b09ddc27
[NIXL][Bugfix] metrics & testing minor bug ( #36051 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-03-18 14:39:14 +01:00
Shwetha Poojary
cef1f302d2
[Model] Enable LoRA support for tower and connector in H2OVL ( #31696 )
...
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com >
2026-03-18 13:26:47 +00:00
Elvir Crnčević
17c47fb869
[Bugfix] Fix EP weight filter breaking EPLB and NVFP4 accuracy ( #37322 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-18 18:30:29 +08:00
Chauncey
b322b197f1
[Build] Bump python openai version ( #32316 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-18 18:20:10 +08:00
Andreas Karatzas
eaf7c9b976
[CI] Fix PaddleOCR-VL HF test failure due to create_causal_mask API rename ( #37328 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-18 09:44:12 +00:00
Aaron Hao
47a1f11bff
[docs] Add docs for new RL flows ( #36188 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-18 09:04:26 +00:00
Karan Bansal
fad09e8a1f
fix(glm47): improve tool call parsing and content normalization ( #37386 )
...
Signed-off-by: karanb192 <karan@example.com >
Co-authored-by: karanb192 <karan@example.com >
2026-03-18 08:12:21 +00:00
Jee Jee Li
8c31f47c63
[LoRA] Make LoRA respect language_model_only ( #37375 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-18 07:53:34 +00:00
Li, Jiang
261801242f
[Bugfix] Avoid OpenMP thread reallocation in CPU torch compile ( #37391 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-18 07:51:39 +00:00
Or Ozeri
fcf0687b27
[kv_offload+HMA][0/N]: Support block-level preemption handling ( #34805 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-18 08:49:53 +02:00
liuzhenwei
86b7e3c95a
[XPU] skip unsupported ut and update test_nixl_connector ( #37179 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-18 13:32:59 +08:00
Andrew Xia
0e95916155
[responsesAPI] parser.extract_response_outputs can take in token IDs ( #37130 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2026-03-18 05:31:31 +00:00
Andreas Karatzas
ce2ef42fd3
[CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset ( #37335 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-18 05:26:20 +00:00
Andreas Karatzas
8b6325758c
[ROCm][CI] Add ROCM_EXTRA_ARGS to audio_in_video test server fixture ( #37349 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-18 04:55:40 +00:00
gxd3
a0dd1995c7
[Hardware][TPU] Add supports_async_scheduling() method to Executor interface so that it can be extended for Executor implementations. ( #36924 )
...
Signed-off-by: Guangxiang Du <gxd@google.com >
2026-03-18 12:53:28 +08:00
Xin Yang
f1740006e4
[Perf] Enable dual stream execution of input projection for Qwen3 ( #36795 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-18 11:13:27 +08:00
Andreas Karatzas
58cde5c026
[ROCm][CI] Skip trtllm kvfp8 dequant tests on ROCm ( #37330 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-18 11:12:26 +08:00
Roy Wang
761e0aa7a0
[Performance] Add --enable-ep-weight-filter CLI option ( #37351 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-18 09:36:55 +08:00
Yanan Cao
ff9fbc9aff
[Kernel][Helion] [16/N] Refactor register_kernel API to be more Dynamo-friendly ( #36705 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-18 01:23:35 +00:00
Divakar Verma
e6c4797704
[ROCm][Quantization] add fp8xfp8 attn support for rocm_aiter_unified_attn ( #36927 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-03-18 08:49:32 +08:00
Michael Goin
09e4576f65
[Kernel] Add non-gated support for NVFP4 CUTLASS MoE ( #37320 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-03-17 18:12:04 -04:00
Andreas Karatzas
3ed7b1e6e0
[ROCm] Validate block_size for explicitly selected attention backends ( #36846 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-17 17:04:40 -05:00
JartX
e8f9dbc369
[Bugfix][ROCm] Fix worker startup OOM on ROCm by skipping unreliable cudagraph memory profiling ( #36720 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
2026-03-17 17:55:34 -04:00
Yong Hoon Shin
de35c06c66
Make KV connector metadata build overridable via plugin ( #37336 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2026-03-17 21:29:06 +00:00
Athrael Soju
c0745a851a
[Model] Add ColQwen3.5 4.5B support ( #36887 )
...
Signed-off-by: Athrael Soju <athrael.soju@gmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-17 21:17:02 +00:00
Ekagra Ranjan
b5ca9c3557
[Models] Cohere ASR ( #35809 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2026-03-17 21:04:17 +00:00
Chao-Ju Chen
245758992e
[Bugfix] Rescale NVFP4 weight scales to fix BF16 dequant underflow ( #34577 )
...
Signed-off-by: ricky-chaoju <ricky.chen@infinirc.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-17 20:48:42 +00:00
Dimitrios Bariamis
1204cf0a9d
[Bugfix] Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 ( #37158 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
2026-03-17 20:13:06 +00:00
Wei Zhao
b36adfa349
[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache ( #37252 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-17 20:09:20 +00:00
Michael Goin
e78821b438
[Deprecation] Deprecate --calculate-kv-scales option ( #37201 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-03-17 19:57:24 +00:00
Cyrus Leung
51f0acda79
[Model] Remove unused handle_oov_mm_token ( #37321 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-17 19:44:52 +00:00
Brian Dellabetta
fa75204b16
bump compressed-tensors version to 0.14.0.1 ( #36988 )
...
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com >
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com >
2026-03-17 15:36:19 -04:00
Wentao Ye
bdb903bb5f
[Bug] Fix FlashInfer MNNVL socket collisions under concurrent vLLM jobs ( #36674 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-17 15:19:52 -04:00
Andrey Talman
68f783a727
[Torch 2.11] Guard torch._C._cpu attribute checks for forward compatibility ( #35673 )
...
Signed-off-by: atalman <atalman@fb.com >
2026-03-17 18:47:59 +00:00
Avinash Singh
c5030c439d
[CI] Split Distributed Tests (4 GPUs) and Kernel MoE tests ( #37100 )
...
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com >
Signed-off-by: Avinash Singh <107198269+avinashsingh77@users.noreply.github.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-17 11:44:55 -07:00
Michael Goin
51b2333be1
[Perf] Optimize top-k search in apply_top_k_top_p_triton sampler ( #37225 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-03-17 11:35:17 -07:00
Andreas Karatzas
4ed51308c8
[CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in __init__ ( #37230 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-17 09:08:08 -07:00
Cyrus Leung
c781fbbab3
[Bugfix] Standardize custom HF Processor init ( #37289 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-17 15:38:55 +00:00
Richard Zou
979ff44cea
[BugFix] PyTorch Compilation Tests should error if any test fails ( #37300 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-17 15:26:38 +00:00
Benjamin Chislett
f63ed7b5ac
[Bugfix] Fix DP MTP Dummy Run ( #35243 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-17 11:16:48 -04:00
Ning Xie
c9e5096256
[openapi] remove redundant exception stack trace[4/N] ( #37157 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-03-17 15:06:25 +00:00
Anton Vlasjuk
2ff0ad9694
[UltraVox] Fix output type ( #37224 )
...
Signed-off-by: vasqu <antonprogamer@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-17 14:51:17 +00:00
Isotr0py
a836524d20
[Chore] Replace all base64 usages with faster pybase64 package ( #37290 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-17 14:44:19 +00:00
Bhoomit
3717a4dd47
[Misc][LoRA] Add --lora-target-modules to restrict LoRA to specific modules ( #34984 )
...
Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-17 14:36:41 +00:00
Harry Mellor
ecfcdd2ce4
Fix Phi3 test that fails with Transformers v5 ( #37298 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-17 14:29:24 +00:00
Siew's Capital Jarvis
c25dbc2d27
[Bugfix] Fix unclean shutdown crash with AllReduce Fusion workspace ( #36955 )
...
Signed-off-by: Jarvis <brayden.stanley.0127@gmail.com >
2026-03-17 14:22:09 +00:00
Jonas M. Kübler
77d2a5f17b
pick up tuned prefill configs for FP8 FA3 ( #36265 )
...
Signed-off-by: Jonas M. Kübler <44084297+jmkuebler@users.noreply.github.com >
Signed-off-by: Jonas Kuebler <kuebj@amazon.com >
2026-03-17 07:00:26 -07:00
Sage
59192dfd39
[Frontend] Complete OpenAI render delegation ( #37287 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-17 13:53:55 +00:00
Umut Polat
56cb1baa66
[Misc] Use VLLMValidationError in batch, pooling, and tokenize protocol validators ( #36256 )
...
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com >
2026-03-17 13:52:30 +00:00
Cyrus Leung
f340324335
[1/2] Move InternVL-based processors ( #37260 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-17 21:50:56 +08:00
sfbemerk
2660b9289c
Bugfix for offloading+prefetch for GLM-4.7-FP8 ( #37178 )
...
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
Co-authored-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
2026-03-17 21:22:09 +08:00
Viacheslav
293f036e6d
Add gigachat 3.1 tool parser + fix gigachat3 tool parser ( #36664 )
...
Signed-off-by: Viacheslav Barinov <viacheslav.teh@gmail.com >
2026-03-17 12:03:20 +00:00
youkaichao
0fb142a454
[perf][connector] optimize build_connector_meta when host buffer transfer is not used ( #37165 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2026-03-17 11:59:35 +00:00
Sage
00f8e0d211
[Frontend] Delegate tokenization serving preprocessing to OpenAIServingRender ( #37266 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-17 11:22:54 +00:00
zhao, zhenhui
4af9ed21cb
[Bugfix](xpu): prevent “selected index k out of range” in TP decode path ( #37259 )
...
Signed-off-by: zhenzhao <zhenzhao@habana.ai >
2026-03-17 11:14:07 +00:00
Augusto Yao
9c7cab5ebb
[Feature]: Support for multiple embedding types in a single inference call ( #35829 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
2026-03-17 17:05:42 +08:00
Chauncey
132bfd45b6
[Bugfix][ResponsesAPI] Fix crash when tool_choice=required exceeds max_output_tokens ( #37258 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-17 08:54:52 +00:00
xiao-llm
24b4272a8c
Fix infinite recursive search issue in quark.py ( #32779 )
...
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
Signed-off-by: Xiao Yu <xiao.yu.dc@outlook.com >
Signed-off-by: kimheesu <wlskaka4@gmail.com >
Co-authored-by: Yanwen Lin <lyw1124278064@gmail.com >
Co-authored-by: Kim Hee Su <wlskaka4@gmail.com >
2026-03-17 07:19:15 +00:00
Benjamin Chislett
8a680463fa
[Bugfix] Fix NemotronH MTP + Chunked Prefill ( #35447 )
2026-03-17 07:07:33 +01:00
Nick Cao
20b14095a4
[Bugfix] Fix loading Music Flamingo ( #35535 )
...
Signed-off-by: Nick Cao <ncao@redhat.com >
2026-03-17 05:24:40 +00:00
PatchyTIS
17c1bdf371
[Bugfix] dtype mismatch in ngram gpu propose ( #37246 )
...
Signed-off-by: PatchouliTaisa <patchychen@tencent.com >
Co-authored-by: PatchouliTaisa <patchychen@tencent.com >
2026-03-17 05:19:55 +00:00
Flora Feng
3e3d320c1b
[Refactor] Relocate responses API tests ( #37241 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-17 05:14:52 +00:00
Andreas Karatzas
54a62a79f7
[ROCm] Fix AttributeError for torch.compiler.skip_all_guards_unsafe on older PyTorch ( #37219 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-17 11:34:49 +08:00
Flora Feng
384dc7f77b
[Refactor] Relocate completion and chat completion tests ( #37125 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-17 11:31:23 +08:00
Flora Feng
f04d5226f8
[CI] Fix flaky tool_use chat completion tests with deterministic seed ( #37027 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-17 03:24:34 +00:00
Kyuyeun Kim
0a0a1a198b
Add ability to replace oot ops when using lora ( #37181 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com >
2026-03-16 18:04:15 -07:00
Vadim Gimpelson
6c1cfbad32
Support non-contiguous KV cache in TRTLLM fp8 dequant kernel ( #36867 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com >
Co-authored-by: Pavani Majety <pavanimajety@gmail.com >
2026-03-16 17:48:42 -07:00
Harry Huang
45f526d652
[BugFix] Correct max memory usage for multiple KV-cache groups ( #36030 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-03-17 00:38:52 +00:00
Julien Denize
5db91f0aaf
Fix some Mistral parser issues ( #37209 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-03-17 00:08:56 +00:00
Walter Beller-Morales
061980c36a
[Feature][Frontend] add support for Cohere Embed v2 API ( #37074 )
...
Signed-off-by: walterbm <walter.beller.morales@gmail.com >
2026-03-16 19:55:53 -04:00
Ben Browning
7a49742b88
[CI/Build] Add common tool call parser test suite ( #27599 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2026-03-16 19:46:20 -04:00
Terry Gao
3e6a1e1686
[Custom Ops] Add functional + out variant for scaled_fp4_quant ( #34389 )
...
Signed-off-by: tianrengao <terrygao87@gmail.com >
2026-03-16 18:51:46 -04:00
Julien Denize
7961486a9b
Fix EagleMistralLarge3Model initialization ( #37232 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-03-16 15:41:00 -07:00
Andreas Karatzas
4f9b14c21c
[CI] Stabilize multinode DP internal LB completion tests ( #36356 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 15:40:23 -07:00
Yuchen Fama
31a458c091
[Doc] Clarify schema enforcement behavior for tool_choice modes ( #37064 )
...
Signed-off-by: yfama <yuchengu@gmail.com >
2026-03-16 22:27:42 +00:00
Wei Zhao
a3a51d20e7
[Benchmark] Improvements to attention benchmark script ( #37115 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-16 22:22:40 +00:00
EdalatiAli
e5b807607c
[Quant][Feature] Support online MXFP8 quantization for MoE and dense models ( #35448 )
...
Signed-off-by: EdalatiAli <aliedalati@cohere.com >
2026-03-16 18:07:39 -04:00
Elvir Crnčević
fd4d96302a
Fix eplb nvfp4 experts hook ( #37217 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
Signed-off-by: Elvir Crncevic <elvir@anthropic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-16 22:03:54 +00:00
Krish Gupta
c0f011918d
[Bugfix] opcheck false mutation error in rms_norm_per_block_quant ( #36688 ) ( #36779 )
...
Signed-off-by: Krish Gupta <krishom70@gmail.com >
2026-03-16 21:11:33 +00:00
Zhengxu Chen
e6ae4b1be1
[compile] Enable mega aot artifact for torch 2.12+. ( #37198 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-16 21:05:51 +00:00
zhanqiuhu
2dccb38f73
[Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors ( #36549 )
2026-03-16 20:51:04 +00:00
Kunshang Ji
d157216093
[BUGFIX][Mamba] Use uint64 for address in KVBlockZeroer ( #37197 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-16 21:39:56 +01:00
Matthew Bonanni
93f3c8e531
[Misc] Add float16 to CacheDType ( #37199 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-16 13:24:48 -07:00
rasmith
2cc26c3a99
[CI][BugFix][MORI][AMD] Add transfer_id to kv transfer params for test ( #37213 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-03-16 13:22:57 -07:00
Flora Feng
dfa8852db2
[Refactor] Consolidate GPT-OSS reasoning parser tests ( #36915 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Signed-off-by: Flora Feng <4florafeng@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-16 15:53:07 -04:00
Lucas Kabela
714c6e0eab
[torch.compile][BE] Modify cudagraph callable to check for is_forward_context_set ( #36288 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-03-16 19:42:34 +00:00
Sage
0fefd00e6c
[Bugfix] Fix render server crash for quantized models on CPU-only hosts ( #37215 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-16 18:59:01 +00:00
Nicolò Lucchesi
f5c081d432
[PD][Nixl] Add support for hybrid SSM-FA models ( #36687 )
2026-03-16 19:58:06 +01:00
Matthew Bonanni
c88ea8338b
[MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible ( #36982 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-16 13:51:21 -04:00
Max de Bayser
9f9ecff4cd
Add simple granite4 tool parser ( #36827 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2026-03-16 10:49:09 -07:00
haosdent
ca1954d58c
[Bugfix] Disable cross-layer KV cache for MLA attention backends ( #37090 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-03-16 19:03:10 +02:00
Raushan Turganbay
55e6d3d5c0
[Bugfix] Make siglip/clip compatible with transformers v5 ( #37200 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2026-03-16 16:48:18 +00:00
Chauncey
6682c231fa
[Bugfix] Add error handling for FINISHED_ERROR in OpenAIServing ( #37148 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-16 16:27:47 +00:00
Itay Etelis
5ae685c1c8
[Bugfix] Relax TRTLLM KV cache contiguity assertion for cross-layer layout ( #34158 )
...
Signed-off-by: Itay Etelis <itay.etelis@ibm.com >
Co-authored-by: Itay Etelis <itay.etelis@ibm.com >
2026-03-16 11:20:51 -04:00
Wentao Ye
ce8cf9161d
[Compile] Fix compile warning st256_cs in cuda_vec_utils.cuh ( #36693 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-16 11:12:15 -04:00
xjx
18be11fd59
[BUGFIX]fix CUDA OOM ERROR : invalid argument at cumem_allocator.cpp:119 ( #35594 )
...
Signed-off-by: xjx <493337577@qq.com >
2026-03-16 15:10:42 +00:00
Yuanheng Zhao
8d8855fdae
[Bugfix] Add safety check and fallback for null scaling factor ( #36106 )
...
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 14:27:29 +00:00
Wentao Ye
e855d380fa
[Compile] Fix compile warning in moe_permute ( #36529 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-16 10:16:14 -04:00
Benjamin Bartels
0e5a9382af
[Bugfix] accept redacted thinking blocks in Anthropic messages ( #36992 )
...
Signed-off-by: Benjamin Bartels <benjaminba@tiglab-ubuntu.ilab.local >
Signed-off-by: bbartels <benjamin@bartels.dev >
Co-authored-by: Benjamin Bartels <benjaminba@tiglab-ubuntu.ilab.local >
2026-03-16 22:01:57 +08:00
Fynn Schmitt-Ulms
04bf5a35fa
[Spec Decode] Update extract_hidden_states to use deferred kv_connector clear ( #37013 )
2026-03-16 14:53:45 +01:00
Tianyu Guo
43a73f853b
Remove unused EVS functions in qwen3_vl.py ( #37183 )
...
Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn >
2026-03-16 13:09:09 +00:00
Julien Denize
ffbc2e5bdb
Patch Mistral config ( #37104 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-03-16 12:22:18 +00:00
Lukas Geiger
f9e6db3034
[Models][Qwen3 ViT] Keep max_seqlen on CPU to prevent D2H sync ( #37139 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-16 12:11:59 +00:00
elvischenv
d61d2b08e9
[Build] Fix API rate limit exceeded when using VLLM_USE_PRECOMPILED=1 ( #36229 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 12:09:27 +00:00
Artem Perevedentsev
f5e59ee7a6
[Performance] Add prefetch for checkpoints to OS page cache ( #36012 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
2026-03-16 11:32:02 +00:00
Harry Mellor
9b005edc48
[Docs] Make the link to hardware plugins clearer ( #37174 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 04:12:58 -07:00
Robin Nabel
bf9a185395
GLM4 tool parser: fix streaming mode ( #35208 )
...
Signed-off-by: Robin Nabel <opensource@nabel.co >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-16 18:48:52 +08:00
Harry Mellor
ad041c79db
Fix text only inputs for MRoPE models with the Transformers modelling backend ( #37055 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 10:31:16 +00:00
Kunshang Ji
747b068136
[Hardware] Replace memory related torch.cuda APIs ( #37031 )
...
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
2026-03-16 10:24:48 +00:00
Harry Mellor
122f75d939
Fix pipeline parallel with multimodal models with the Transformers modelling backend ( #37057 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 10:20:37 +00:00
SoluMilken
d8f8a7aad2
[Misc] Sync pre-commit to 4.5.1 in workflows and docs ( #36675 )
...
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 10:03:21 +00:00
Roy Wang
0115e957d4
[Frontend][Misc] Remove unused log in /is_sleeping ( #37093 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2026-03-16 17:46:28 +08:00
haosdent
116ed130f4
[Bugfix] Fix GDN attention crash with mixed decode/spec-decode batches ( #34871 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-16 10:30:23 +01:00
Vadim Gimpelson
8374387bd8
[FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell ( #36987 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-16 09:04:29 +00:00
Isotr0py
912fbe9555
[Bugfix] Fix Qwen2.5-Omni/Qwen3-Omni use_audio_in_video with multi-video inputs ( #37147 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-16 08:56:06 +00:00
Laith Sakka
52131f88d9
use skip_all_guards_unsafe to drop global_state and torch_function_mode_stack guards instead of previous hacks ( #36204 )
...
Signed-off-by: Laith Sakka <lsakka@meta.com >
2026-03-16 08:52:31 +00:00
Roy Wang
821eb80c0d
[Performance][Model Loader] Skip non-local expert weights during EP model loading ( #37136 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2026-03-16 01:33:36 -07:00
Andreas Karatzas
a2956a0f8e
[ROCm][CI] Retrying in case of batch variance effects and reducing flakiness ( #36442 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 16:08:51 +08:00
Andreas Karatzas
911355e216
[ROCm] Fix KV copy methods and auto-select attention backend for ROCm ( #36845 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 16:07:27 +08:00
Chauncey
8d3f8f485e
[Bugfix] fix Qwen3.5 tool calling bug ( #36774 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-16 15:38:42 +08:00
Woosuk Kwon
96efb91480
[Model Runner V2] Fix processed logits in sample() ( #37144 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-16 00:35:49 -07:00
leo-cf-tian
2754231ba3
[Kernel] Add FlashInfer MoE A2A Kernel ( #36022 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Signed-off-by: Leo Tian <lctian@nvidia.com >
Co-authored-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Stefano Castagnetta <scastagnetta@nvidia.com >
Co-authored-by: root <root@lyris0267.lyris.clusters.nvidia.com >
2026-03-15 23:45:32 -07:00
bigshanedogg
2390d44209
[Model] Add HyperCLOVAX-SEED-Think-14B language model support ( #37107 )
...
Signed-off-by: bigshanedogg <bigshane319@gmail.com >
2026-03-16 06:40:05 +00:00
Li, Jiang
7362b4450a
[Bugfix] Avoid LD_PRELOAD check on MacOS ( #37145 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-15 23:31:44 -07:00
Andreas Karatzas
57a314d155
[CI][Bugfix] Fix 500 errors from priority overflow and TemplateError subclasses in schema fuzz tests ( #37127 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 05:27:21 +00:00
Andreas Karatzas
d4c57863f7
[ROCm][CI] Fix engine teardown and text normalization to stabilize voxtral test ( #37138 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 04:49:31 +00:00
Wang, Yiting
68e1b711f1
[XPU] Add deepseek_scaling_rope fused kernel ( #36612 )
...
Signed-off-by: yitingw1 <yiting.wang@intel.com >
2026-03-16 12:35:08 +08:00
rasmith
0024f39a32
[ROCm][P/D][MORI][BugFix] Add transfer_id for moriio_connector so moriio_connector to restore P/D functionality ( #34907 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-03-16 10:36:51 +08:00
Andrew Xia
e9163b536e
[responsesAPI][ez] add a unit test for SimpleContext logprobs ( #37126 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2026-03-15 17:12:26 -07:00
Lalithnarayan C
7acaea634c
In-Tree AMD Zen CPU Backend via zentorch [1/N] ( #35970 )
...
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Chinmay-Kulkarni-AMD <Chinmay.Kulkarni@amd.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-03-15 23:35:35 +00:00
Jiangyun Zhu
697e4ff352
[GDN] add a config for gdn kernel selection ( #36647 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-16 00:40:17 +08:00
Hari
a3e2e250f0
[Feature] Add Azure Blob Storage support for RunAI Model Streamer ( #34614 )
...
Signed-off-by: hasethuraman <hsethuraman@microsoft.com >
2026-03-15 19:38:21 +08:00
Isotr0py
143e4dccdf
[Misc] Add online audio_in_video test ( #36775 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-15 00:14:11 -07:00
Isotr0py
6590a3ecda
[Frontend] Remove torchcodec from audio dependency ( #37061 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-15 05:15:59 +00:00
Russell Bryant
b3debb7e77
[Build] Upgrade xgrammar to get a security fix ( #36168 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-15 03:13:48 +00:00
Nick Hill
458c1a4b2d
[Frontend] Reduce chat template warmup logging levels ( #37062 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-14 13:48:59 -07:00
Karan Bansal
821fde2df4
[Bugfix] Fix xgrammar dtype mismatch on macOS CPU inference ( #32384 )
...
Signed-off-by: Karan Bansal <karanb192@gmail.com >
Co-authored-by: Inokinoki <inoki@inoki.cc >
2026-03-14 17:29:06 +00:00
arlo
8c29042bb9
[Feature] Add InstantTensor weight loader ( #36139 )
2026-03-14 18:05:23 +01:00
Cyrus Leung
5467d137b3
[Frontend] Avoid startup error log for models without chat template ( #37040 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-14 09:36:11 -07:00
Santino Ramos
3ed46f374b
[Model Runner V2] Add Support for XD-RoPE ( #36817 )
...
Signed-off-by: Santino Ramos <elsantinoramos@gmail.com >
2026-03-14 09:27:55 -07:00
seanmamasde
84868e4793
[Bugfix][Frontend] Fix audio transcription for MP4, M4A, and WebM formats ( #35109 )
...
Signed-off-by: seanmamasde <seanmamasde@gmail.com >
2026-03-14 08:44:03 -07:00
Isotr0py
a8e8d62dd8
[Misc] Clean up Kimi-audio whisper encoder loading ( #36903 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-14 23:37:52 +08:00
Julien Denize
e42b49bd69
Mistral common v10 ( #36971 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: root <root@h200-bar-196-227.slurm-bar-compute.tenant-slurm.svc.cluster.local >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-14 07:26:43 -07:00
Sergey Zinchenko
4a718e770d
[Bug] Fix Failure in /v1/chat/completions/render for Multimodal Requests ( https://github.com/vllm-project/vllm/issues/35665 ) ( #35684 )
2026-03-14 14:10:11 +00:00
Kevin H. Luu
600a039f57
[CI] Shard Multi-Modal Models (Standard) into 4 parallel jobs ( #37014 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-14 08:26:54 +00:00
Harry Mellor
ffa5d74f15
Enable loading of fused expert weights in the Transformers modelling backend ( #36997 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-14 07:01:06 +00:00
Kevin H. Luu
74fe80ee95
[CI] Split Distributed Tests (4 GPUs) into 3 parallel jobs ( #37015 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-14 12:21:13 +08:00
Flora Feng
bcfdadb1bc
[Refactor] Relocate chat completion and anthropic tests ( #36919 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-14 12:16:16 +08:00
Yanan Cao
236de72e49
[CI] Pin helion version ( #37012 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-13 23:25:29 -04:00
sbeurnier
a116f96930
[V1] Remove pin_memory() in async_copy_to_gpu to fix sporadic stalls ( #37006 )
...
Signed-off-by: Sebastien Beurnier <sbeurnier@together.ai >
2026-03-14 01:37:32 +00:00
Li, Jiang
092ace9e3a
[UX] Improve UX of CPU backend ( #36968 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-14 09:27:29 +08:00
Andrew Xia
f680dc1b39
[responsesAPI] prioritize content over summary in reasoning item input ( #36516 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
Signed-off-by: Andrew Xia <mitandrewxia@gmail.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Andrew Xia <axia@fb.com >
2026-03-14 09:20:30 +08:00
Giulio Leone
b41aa264f9
fix: resolve chat template names before kwargs detection ( #36937 )
...
Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com >
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-03-14 00:20:16 +00:00
Dimitrios Bariamis
367cf5cd3e
[Feat][Bugfix] Enable additional dimension for Flashinfer MLA and fix routing dtype ( #36931 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
2026-03-13 16:41:16 -07:00
haosdent
6d53efd2a5
[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models ( #34695 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-13 23:25:41 +00:00
Benjamin Chislett
8b346309a5
[Refactor] Consolidate SupportsEagle ( #36063 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-13 23:22:40 +00:00
Nick Hill
54a6db827f
[BugFix] Fix "DP Coordinator receives unexpected..." messages ( #37008 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-13 23:18:05 +00:00
Matthew Bonanni
9efc4db965
[Bugfix] Fix DeepSeek-V3.2 tokenizer stripping spaces ( #37004 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-13 22:55:36 +00:00
Kevin H. Luu
f1816fb192
[CI] Split V1 e2e + engine (1 GPU) into separate jobs ( #36945 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-13 14:16:02 -07:00
Harry Mellor
0005d2a3c9
Use Transformers v5 WeightRenaming for Transformers modeling backend ( #31545 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-13 20:49:08 +00:00
Ekagra Ranjan
d0b402974f
[Bugfix][Spec Decode] Avoid double call of Ngram CPU ( #36952 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2026-03-13 20:33:19 +00:00
Divakar Verma
6341d43043
[ROCm][Quantization] add quark w4a8 mxfp4_fp8 for LinearLayer ( #35316 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-03-13 19:44:24 +00:00
Mark McLoughlin
7afe0faab1
[Frontend][Core] Re-add shutdown timeout - allowing in-flight requests to finish ( #36666 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-13 12:10:06 -07:00
Harry Mellor
5a3f1eb62f
[Misc] Set default kv_buffer_device in a better way ( #36862 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-13 19:07:33 +00:00
yugong333
b3ce711b93
Fp8 lora dense kernel ( #35242 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
2026-03-13 19:05:08 +00:00
Isotr0py
abf61aaa8e
[Bugfix] Fix Qwen2.5-omni/Qwen3-omni mm_processor cache for audio_in_video request ( #36800 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-13 18:16:05 +00:00
bigmoyan
4508532fbd
[Bugfix] fix paddleocr crash on some image shape ( #36959 )
...
Signed-off-by: wangzhengtao <wangzhengtao@msh.team >
Signed-off-by: bigmoyan <moyan_work@foxmail.com >
Co-authored-by: wangzhengtao <wangzhengtao@msh.team >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-13 13:46:55 +00:00
Itay Alroy
d5af196c18
[2/N] Elastic EP Milestone 2: Integrating NIXL-EP ( #35627 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
Co-authored-by: Yongji Wu <wuyongji317@gmail.com >
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com >
2026-03-13 09:25:33 -04:00
Chaojun Zhang
82f836d976
[XPU] Support LoRA via torch.compile on XPU platform ( #36962 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2026-03-13 10:34:59 +00:00
Andreas Karatzas
4fccd30f19
[ROCm][CI] Upgrading orchestrator to handle python pipeline markers and options ( #36181 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-13 02:04:22 -07:00
Or Ozeri
cfaf4668f7
[kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec ( #36610 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-13 08:04:21 +00:00
Andreas Karatzas
99a57bdf74
[ROCm][CI] Corrected the GPT-OSS test root path ( #36711 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-13 15:53:43 +08:00
Sage
a2268617cf
[Frontend] Delegate preprocessing to OpenAIServingRender ( #36483 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-13 00:39:43 -07:00
Rohan Potdar
a4ad9db541
Enable RoPE+KV cache fusion for ROCm AITER FA (non-shuffle layout) ( #35786 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-03-13 07:33:22 +00:00
Nick Hill
b373b5102a
[Tests] Shutdown test RemoteVLLMServer cleanly ( #36950 )
...
Recent PR #33949 changed the teardown logic of the RemoteVLLMServer test utility class to
send SIGTERM to all vllm (sub)processes at once, which breaks the clean/coordinated
shutdown logic that assumes only the top-level process will receive a signal (for example
when running in a container that's shut down).
This caused a bunch of errors and stacktraces in some test logs, even though those tests
still pass. We should still attempt a normal shutdown and only kill other procs if they are
still running after a few seconds.
Example: tests/v1/distributed/test_external_lb_dp.py::test_external_lb_completion_streaming
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-13 07:32:55 +00:00
Thomas Parnell
f296a1966d
[Bugfix] Fix FlashInfer GDN warmup ValueError on SM90 GPUs ( #36876 )
2026-03-13 07:09:39 +01:00
Csrayz
bc2c0c86ef
[Frontend] Fix usage incorrectly returned with empty stream_options` ( #36379 )
...
Signed-off-by: Csrayz <33659823+Csrayz@users.noreply.github.com >
2026-03-13 03:33:04 +00:00
jaime campos salas
891c60dcd5
fix(kv-cache): increase hybrid attention grouping threshold from 1.25 to 1.5 ( #36684 )
...
Signed-off-by: Jaime Campos Salas <jaime.campos.salas@gmail.com >
2026-03-12 23:28:27 -04:00
whyiug
1ce13cf992
[Model] Add support for BERT-like Chinese ERNIE pooling models ( #36385 )
...
Signed-off-by: whyiug <whyiug@hotmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-13 03:23:53 +00:00
Nikita
10f08dedfa
[Model] Add ColPali late interaction model for multi-modal retrieval ( #36818 )
...
Signed-off-by: Nikita Sukharev <kaonael@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-13 02:18:57 +00:00
Aaron Hao
5e1a373d2e
[BUG] Fix rank calculation in NCCLWeightTransferEngine ( #36940 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-03-13 01:56:51 +00:00
Simo Lin
572c776bfb
build: update smg-grpc-servicer to use vllm extra ( #36938 )
...
Signed-off-by: Simo Lin <linsimo.mark@gmail.com >
2026-03-13 01:31:36 +00:00
Yifan Qiao
55d8073d06
[Bugfix] ep_scatter kernel store-load race condition ( #34991 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
2026-03-13 01:07:59 +00:00
Nick Hill
cd32d6f586
[Model Runner V2] Some code simplification ( #36929 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-13 00:59:23 +00:00
Jaewon
aaa3092f51
[MoE] Add routing simulation override for MXFP4 quantized MoE ( #33595 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
2026-03-13 00:30:44 +00:00
Shubhra Pandit
87985077a4
[Speculative Decoding] Add norm_before_fc for gpt-oss draft models ( #36545 )
...
Signed-off-by: Shubhra Pandit <shubhra.pandit@gmail.com >
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com >
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-12 23:03:32 +00:00
Ryan Rock
a79c1c2c80
[AMD][Build] Add DeepEP to ROCm Dockerfile ( #36086 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-03-12 21:33:32 +00:00
Andreas Karatzas
cc8f1f4764
[ROCm][CI] Preparing gfx90a mirroring ( #36210 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-12 13:42:25 -07:00
Michael Goin
05b9e8ab5b
Revise environment setup in AGENTS.md ( #36909 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-12 19:21:11 +00:00
Xinan Miao
2cdf92228c
[Feature]: Remove Chunking From FusedMoE ( #34086 )
...
Signed-off-by: SouthWest7 <am1ao@qq.com >
Signed-off-by: Southwest <1403572259@qq.com >
Signed-off-by: southwest <am1ao@qq.com >
Signed-off-by: Xinan Miao <1403572259@qq.com >
Co-authored-by: SouthWest7 <am1ao@qq.com >
2026-03-12 14:24:38 -04:00
Marc Sun
c973ecdead
[bnb] Skip moe + bnb test ( #36896 )
...
Signed-off-by: Marc Sun <marc@huggingface.co >
2026-03-12 18:03:25 +00:00
Harry Mellor
e39257a552
Add AGENTS.md ( #36877 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-12 10:20:50 -07:00
Dimitrios Bariamis
cc16b24b17
Update Flashinfer to 0.6.6 ( #36768 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
2026-03-12 13:19:19 -04:00
Eunkwang Jeon
bdc2343454
[Bugfix] Fix KeyError in parse_response_input for reasoning items with optional content ( #34499 )
...
Signed-off-by: jeonsworld <jeonsworld@gmail.com >
2026-03-13 00:13:36 +08:00
Matthew Bonanni
f444c05c32
[Attention] Use FA4 for MLA prefill ( #34732 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-12 12:10:17 -04:00
SoluMilken
85199f9681
[Bugfix] fix main branch pre-commit error (1 line change) ( #36897 )
...
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw >
2026-03-12 09:08:37 -07:00
grimulkan
a1257fd1ea
[Kernel] Add FP8 KV cache support to Triton MLA decode attention ( #34597 )
...
Signed-off-by: grimulkan <grimulkan@gmail.com >
2026-03-12 08:32:34 -07:00
Thomas Parnell
abcffbba8c
[CI] Fix mypy pre-commit errors on main ( #36882 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-12 08:22:29 -07:00
Kunshang Ji
53ec16a705
[Hardware] Replace torch.cuda.device_count/current_device/set_device API ( #36145 )
...
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-12 07:57:47 -07:00
Wei Zhao
2e693f48e7
[Perf] Add TRTLLM FP8 MoE Modular Kernel ( #36307 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-12 07:32:31 -07:00
Martin Hickey
7f1f36bf91
[CI] Fix mypy for vllm/reasoning ( #35742 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-12 12:21:33 +00:00
Mark McLoughlin
5282c7d4d0
[docs] Add lightweight AI assisted contribution policy ( #30947 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-03-12 11:46:13 +00:00
caozuoba
9e19f8338b
[Perf] add packed recurrent fast path for decode ( #36596 )
...
Signed-off-by: hdj <1293066020@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-12 04:01:57 -07:00
Sage
06e0bc21d2
[Frontend] Split OpenAIServingModels into OpenAIModelRegistry + OpenAIServingModels ( #36536 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-12 03:29:37 -07:00
Chauncey
5a71cdd76e
[Bugfix] Fix crash when tool_choice=required exceeds max_tokens ( #36841 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-12 03:28:45 -07:00
Shanshan Shen
f0d3658c0f
[MM][OOT] Support CPU seq_lens for OOT MMEncoderAttention kernels ( #36605 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-12 03:28:23 -07:00
Michael Goin
57431d8231
[UX] Only show FP4 Marlin fallback warning for w4a4 models ( #36806 )
...
Co-authored-by: Claude <noreply@anthropic.com >
2026-03-12 05:19:35 -04:00
Xu Jinyang
3e64fe4a18
[Bugfix] Warm up Triton autotuner for GDN layers during V1 profiling ( #36599 )
...
Signed-off-by: AuYang <459461160@qq.com >
2026-03-12 00:51:09 -07:00
sfeiqiang
8cb24d3aed
[KV Connector] Support using FlexKV as KV Cache Offloading option. ( #34328 )
...
Signed-off-by: phaedonsun <phaedonsun@tencent.com >
Co-authored-by: phaedonsun <phaedonsun@tencent.com >
2026-03-12 00:46:20 -07:00
István Ketykó
00726c74c9
[Bugfix][Model] Fix DeepSeek-OCR TensorSchema crash on empty images_crop ( #36670 )
...
Signed-off-by: István Ketykó <istvan.ketyko@gmail.com >
2026-03-12 15:35:54 +08:00
Chauncey
9fe404ed04
[Frontend] OpenAI Responses API supports Tool/Function calling with streaming ( #29947 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-12 15:03:50 +08:00
Sage
802f306cd1
[Tests] Skip model weight download for render-only test server ( #36813 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-12 06:24:42 +00:00
Yan Ma
894843eb25
replace with torch.cuda.device with with torch.accelerator.device_index ( #36144 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-03-11 23:12:57 -07:00
Yanan Cao
584a3f56de
[Kernel][Helion][13/N] Force static_shapes=False in helion register ( #36677 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-12 05:35:29 +00:00
Nick Hill
36735fd772
[BugFix] Fix multiple/duplicate stdout prefixes ( #36822 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-12 12:23:21 +08:00
wang.yuqi
6ecabe4936
[CI Failure] Fix Language Models Test (Extended Pooling) daily CI Failure ( #36761 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-12 12:22:05 +08:00
Woosuk Kwon
2f8b4ce0c0
[Model Runner V2] Do not initialize sampler for non-last PP ranks ( #36824 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-12 03:55:28 +00:00
Yuwei An
2ef69456f5
[LMCache] Fault Tolerance Mechanism ( #36586 )
...
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com >
2026-03-12 03:54:39 +00:00
Louie Tsai
17852aa503
more models for vLLM Benchmark Suite ( #35086 )
...
Signed-off-by: louie-tsai <louie.tsai@intel.com >
2026-03-12 11:36:51 +08:00
Flora Feng
8647c6cf51
[Bugfix] Fix minimax_m2 tool parser when stream interval > 1 ( #35895 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-12 10:25:14 +08:00
Kunshang Ji
513949f95f
[XPU][Doc] Remove manual OneAPI install step, now handled by torch-xpu ( #36831 )
...
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
2026-03-12 01:46:02 +00:00
Nick Hill
262b76a09f
[Frontend] Exclude anthropic billing header to avoid prefix cache miss ( #36829 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-12 01:20:34 +00:00
Wentao Ye
c34ba6b961
[Perf] Optimize compute maxsim using batched version, 3.2% E2E throughput improvement ( #36710 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-12 08:37:01 +08:00
Matthias Gehre
24062b704f
[ROCm][CI/Build] Add gfx1152/gfx1153 (Krackan) to HIP supported architectures ( #36499 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-03-11 23:14:40 +00:00
Aaron Hao
d6b61e5166
[BUG] Fix async rlhf tests ( #35811 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-03-11 18:06:10 -04:00
Yanan Cao
cf632499ee
[Kernel] [Helion] [15/N] Split config files into per-platform files ( #36698 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 17:25:29 -04:00
Yanan Cao
a3774a8198
[Kernel] [Helion] [12/N] Use FakeTensorMode to avoid GPU allocation during config key computation ( #36563 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 17:25:16 -04:00
Yanan Cao
0ce21c46a0
[Kernel] [Helion] [14/N] Set autotune_ignore_errors=True during autotuning ( #36683 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 17:25:04 -04:00
Woosuk Kwon
55eed6b7a5
[Model Runner V2] Add WhisperModelState [6/N] ( #35790 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-11 14:20:38 -07:00
Giancarlo Delfin
c77181e534
[Model Runner V2] Add probabilistic rejection sampling for spec decoding ( #35461 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-11 14:04:32 -07:00
maobaolong
12001f2ebc
[LMCache] Pass TP size in lookup for MLA multi-reader locking ( #36129 )
...
Signed-off-by: baoloongmao <baoloongmao@tencent.com >
Co-authored-by: Yihua Cheng <yihua98@uchicago.edu >
2026-03-11 20:45:20 +00:00
Or Ozeri
7ee5d5093b
[BugFix][kv_offload] Fix offloading decodes with async scheduling ( #33881 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-11 20:43:40 +00:00
jennyyyyzhen
428bc718bd
[Bugfix][ROCm] Strip block_size before attention backend validation ( #36274 )
...
Signed-off-by: jennyyyyzhen <yzhen@hmc.edu >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-03-11 13:37:31 -07:00
汪志鹏
ff1e3d9c63
[BugFix]: add bagel to MM_PREFIX_LM_MODELS ( #36316 )
...
Signed-off-by: princepride <wangzhipeng628@gmail.com >
2026-03-11 19:55:59 +00:00
Wentao Ye
35bdca5431
[Refactor] Remove dead code in KV connector ( #36424 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-11 19:40:17 +00:00
Amanzhol Salykov
8a24842765
[ROCm] add tuned moe_wna16_triton kernel configs for CDNA4 ( #35093 )
...
Signed-off-by: salykova <amsalykov@gmail.com >
Signed-off-by: amd-asalykov <asalykov@amd.com >
2026-03-11 19:00:08 +00:00
Harry Mellor
65986db6ba
Make Gemma and Gemma 2 accept inputs_embeds like Gemma 3 ( #36787 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 18:12:43 +00:00
Luka Govedič
9556af87d5
[torch.compile] Add support for non-contiguous fused RMSNorm + group quant ( #36551 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
2026-03-11 10:56:55 -07:00
Or Ozeri
a1a3523a56
[KVConnector] Support worker -> scheduler metadata ( #31964 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-11 17:36:37 +00:00
tianshu-Michael-yu
741f4e046b
fix: align lfm2 thumbnail token counting with HF ( #36707 )
2026-03-11 10:28:38 -07:00
Julien Denize
a5d06dc557
Add 320 dimension size support to MLA ( #36161 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
2026-03-11 10:21:22 -07:00
Harry Mellor
5efa206a8c
Fix ExaoneMoeMTP test that never ran in Transformers v4 ( #36792 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 17:10:23 +00:00
Cyrus Leung
196802dfa6
[Misc] Clean up renderers ( #36770 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-11 16:39:29 +00:00
Isotr0py
c84b519cf3
[Bugfix] Fix negative max_tokens when input prompt is too long ( #36789 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-11 16:30:51 +00:00
Flora Feng
741ecf0630
[CI] Add bfcl tool call correctness eval ( #36560 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-03-11 12:27:36 -04:00
Robert Shaw
b7e5a588d8
[Bugfix] Fix DP/EP Shared Expert With Monolithic Kernels ( #36061 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-11 16:07:14 +00:00
Richard Zou
822e250ab7
[torch.compile] Use FakeTensors instead of real GPU tensors for single-size compilation ( #36093 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-11 16:07:09 +00:00
Hongxin Xu
bea02cdf93
Fix routed experts capture for hybrid models (Mamba + Attention) ( #35744 )
...
Signed-off-by: arlenxu <arlenxu@tencent.com >
Signed-off-by: xhx1022 <1737006628@qq.com >
Co-authored-by: arlenxu <arlenxu@tencent.com >
2026-03-11 08:53:10 -07:00
Julien Denize
a3ea760ea5
Add 'none' reasoning effort to ChatCompletionRequest ( #36238 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
2026-03-11 15:45:34 +00:00
Harry Mellor
35db669f1d
Correct link to supported hardware on vllm.ai ( #36798 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 08:43:28 -07:00
Julien Denize
afebeffbfb
Add support to Mistral large 3 eagle with dense layers ( #36163 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-11 15:42:56 +00:00
Jhao-Ting Chen
5573894737
Kimi k2.5 MLA based eagle3 ( #36361 )
...
Signed-off-by: Izzy Putterman <iputterman@nvidia.com >
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com >
Co-authored-by: Izzy Putterman <iputterman@nvidia.com >
2026-03-11 11:36:11 -04:00
Harry Mellor
d5816c8c2f
Fix tied weights in weight mapping test for Transformers v5 ( #36788 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 15:10:26 +00:00
Woosuk Kwon
8ccbcda5c0
[Model Runner V2] Remove unused warmup_for_prefill method ( #36762 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-11 08:02:44 -07:00
tvirolai-amd
a9e532afe2
[ROCm][Perf] Allow MTP lens > 1 in Sparse MLA ( #36681 )
...
Signed-off-by: Teemu Virolainen <teemu.virolainen@amd.com >
2026-03-11 14:43:03 +00:00
Harry Mellor
f3163bba67
Disable docs build skipping until a better solution is found ( #36790 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 13:53:23 +00:00
Martin Hickey
700a1ddc65
[Misc] Use envs module to get VLLM_DISABLED_KERNELS ( #35776 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-03-11 13:37:46 +00:00
Silvia Colabrese
f33251ffc8
[Bugfix] Fix Mistral-small --format ( #36782 )
...
Signed-off-by: 12010486 <silvia.colabrese@intel.com >
2026-03-11 04:47:52 -07:00
Wuxun Zhang
e584dce52b
Add XPU MLA Sparse backend for DeepSeek v3.2 ( #33230 )
...
Signed-off-by: Zhang, Wuxun <wuxun.zhang@intel.com >
2026-03-11 19:19:15 +08:00
Ning Xie
40c0461f24
[openapi] refactor render related openapi [3/N] ( #36749 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-03-11 03:14:34 -07:00
Weiguang Li
724759684c
[Bugfix] Fix Qwen3-VL timestamp mismatch when using num_frames without fps ( #36136 )
...
Signed-off-by: OiPunk <codingpunk@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 03:13:06 -07:00
Michael Goin
9c34e9d24f
Disable cascade attention by default ( #36318 )
2026-03-11 03:12:23 -07:00
Richard Zou
09b6f99852
[compile] aot_compile should respect VLLM_DISABLE_COMPILE_CACHE ( #36358 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-11 03:12:03 -07:00
Ethan T.
c87fb515ed
fix(lora): use replaced_module_name in pooling model name check ( #36402 )
...
Signed-off-by: gambletan <ethanchang32@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 03:11:27 -07:00
Itay Alroy
5353c9b016
platforms: Fix Ray DP startup crash ( #36665 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
2026-03-11 03:08:55 -07:00
Angela Yi
13e79fc811
[ci] Update rtol for test_classification ( #36556 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com >
2026-03-11 03:08:16 -07:00
Rahul Tuli
9d07a3d6e4
Add: Eagle3 support for Qwen3.5 ( #36658 )
...
Signed-off-by: Rahul-Tuli <rtuli@redhat.com >
2026-03-11 03:07:42 -07:00
Cyrus Leung
646b85544b
[Refactor] Remove Molmo2 processor wrapper ( #36667 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-11 03:07:20 -07:00
tc-mb
4286cc5ec2
fix(minicpmv): fix audio inference by handling meta device in init_re… ( #36751 )
...
Signed-off-by: caitianchi <caitianchi@modelbest.cn >
2026-03-11 03:06:28 -07:00
LoganJane
545d18d81b
[Bugfix] Support other quantization methods in glm41v ( #36321 )
...
Signed-off-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com >
Co-authored-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-11 09:48:05 +00:00
roikoren755
e661b9ee83
[NemotronH] Small fix reasoning parser ( #36635 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-03-11 02:44:41 -07:00
YiSheng5
c910eeb125
[XPU]Bug fix for some unexpected error when use AgRs backend on XPU device. ( #36593 )
...
Signed-off-by: yisheng <yi.sheng@intel.com >
2026-03-11 09:17:46 +00:00
Harry Mellor
f4ae58b38b
Remove unused config field from Gemma2 ( #36672 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 01:51:19 -07:00
Isotr0py
e568cf88bc
[UX] Infer dtype for local checkpoint ( #36218 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-11 08:50:04 +00:00
Nicolò Lucchesi
098d844731
[NIXL][1/N] Refactor kernel_block_size detection ( #35752 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-11 01:11:23 -07:00
JartX
a40ee486f2
[Bugfix] Add Multiple of 16 block_size to triton fallback on rocm Attention to support qwen3_5 ( #35923 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Co-authored-by: akaratza <akaratza@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-03-11 07:45:57 +00:00
pschlan-amd
eac2dc2b41
AITER MLA backend: Avoid CPU sync in _build_decode ( #35765 )
...
Signed-off-by: Patrick Schlangen <pschlan@amd.com >
2026-03-11 07:25:00 +00:00
Flora Feng
d5080aeaa4
[Refactor] Remove deadcode in Responses API serving ( #36726 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-11 07:11:41 +00:00
liuzhenwei
f22d6e0267
[Hardware][NIXL] set default kv buffer type for different platform ( #36438 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-11 05:19:28 +00:00
Kunshang Ji
76c6e6da08
[XPU] Support block fp8 moe by fallback to TritonExpert on XPU ( #36458 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-10 21:54:09 -07:00
typer-J
4184653775
feat: add RISC-V support for CPU backend (v2) ( #36578 )
...
Signed-off-by: typer-J <2236066784@qq.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-03-10 21:51:39 -07:00
Sladyn
4aaaf8c8ce
feat(spec_decode): fuse EAGLE step slot mapping and metadata updates ( #33503 )
...
Signed-off-by: sladynnunes <snunes@usc.edu >
2026-03-11 04:35:33 +00:00
Hongbin Guo
4bf533623b
[Doc] Fix duplicate words in comments ( #36713 )
...
Signed-off-by: Hongbin10 <jdmjdm1998@163.com >
2026-03-10 21:28:31 -07:00
Matthew Bonanni
5f77ef15ae
[Misc][Attention] Clean up unused method in CPU_ATTN ( #36673 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-10 21:27:22 -07:00
elvischenv
7d6abdd022
[Fix] Use torch.empty for output in attention+quant fusion ( #31785 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-03-10 21:26:14 -07:00
Wentao Ye
a8ff2cca92
[Perf] Optimize scheduler overhead for PD disaggregation, around 5% E2E perf improvement ( #35781 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-03-10 21:25:30 -07:00
tunglinwood
42fadebecb
[Model] Add support for moonshotai/Kimi-Audio-7B-Instruct ( #36127 )
...
Signed-off-by: tunglinwood <tunglinwood@gmail.com >
Signed-off-by: tunglinwood <tomwu.tunglin@gmail.com >
Signed-off-by: tunglinwood <113751333+tunglinwood@users.noreply.github.com >
2026-03-10 21:24:48 -07:00
tianshu-Michael-yu
a197eda9c3
Add tuned H100 MoE configs for LFM2 8B and 24B ( #36699 )
2026-03-10 21:22:02 -07:00
Kevin H. Luu
82b110d50e
[ci] Bound nvidia-cudnn-frontend version ( #36719 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-03-11 12:17:35 +08:00
Benjamin Chislett
9040cd40af
[DSV3.2][MTP] Optimize Indexer MTP handling ( #36723 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-11 12:16:56 +08:00
fangyuchu
fa0d353acf
[Bugfix] Surface exceptions from non-blocking execute_model in UniProcExecutor to avoid DP deadlocks ( #35194 )
...
Signed-off-by: fangyuchu <fangyuchu@qq.com >
2026-03-11 03:22:21 +00:00
Augusto Yao
b386bb3d7c
fix bugs when token_classify & classify run concurrently ( #36614 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
2026-03-10 20:16:34 -07:00
Ning Xie
fe714dd507
[openapi server] log exception in exception handler(2/N) ( #36201 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-03-10 20:16:30 -07:00
Matthew Bonanni
8ab3d7427c
[Bugfix] Fix DeepSeek V3.2 OOM during CG memory profiling ( #36691 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-11 03:01:07 +00:00
Wei Zhao
84e436ed1c
[Bug] Fix TRTLLM Block FP8 MoE Monolithic ( #36296 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-03-10 22:04:47 -04:00
Andreas Karatzas
81939e7733
[ROCm][CI] Making some tests optional to reduce workload ( #36090 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-10 16:45:27 -07:00
Woosuk Kwon
195d1ca3e8
[Minor] Enhance error message for TRTLLM decode uniformity check ( #36609 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-10 15:38:45 -07:00
Nick Hill
8d983d7cd6
[Model Runner V2] Add initial CI tests ( #36041 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 14:55:21 -07:00
Nick Hill
65b2f405dc
[Core] Simplify core kv-cache blocks initialization logic ( #36521 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 20:20:02 +00:00
Nick Hill
2a68464c5b
[Test] test_async_scheduling.py improvements ( #36340 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 11:17:26 -07:00
Zhengxu Chen
bdd8981dab
[compile] Apply stored functorch config while finalizing loaded artifacts. ( #36582 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-10 09:34:35 -07:00
Woosuk Kwon
f088a831dd
[Model Runner V2] Use unpadded num_tokens for PW CUDA graph attn metadata ( #36626 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-10 09:30:56 -07:00
Harry Mellor
f83b933b84
[CI] Bump mypy version to 1.19.1 ( #36104 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-10 09:18:28 -07:00
Pleaplusone
82f3f30e26
[ROCm][Perf] Enable sparse_mla's cudagraph on ROCm platform ( #35719 )
...
Signed-off-by: ganyi <ygan@amd.com >
2026-03-10 09:14:35 -07:00
Matthew Bonanni
9095cbbfb6
[Bugfix][Sparse MLA] report indexer CG support properly ( #36519 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-10 09:14:31 -07:00
Hashem Hashemi
721ae79f50
Improvements to wvSplitKrc skinny GEMM solution ( #34304 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-03-10 09:14:27 -07:00
AllenDou
aefc59f088
FunASR model bugfix ( #36633 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
2026-03-10 08:14:21 -07:00
Harry Mellor
d88f28da05
Fix hf_override_fn when it modifies model_type ( #35200 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-10 15:03:18 +00:00
Srinivasoo7
106ff69c4e
feat(kv-offload): Strategy A — StoreReusedOffloadingManager gates CPU stores on reuse frequency ( #35342 )
...
Signed-off-by: srinivas_oo7 <Sriusa4414@gmail.com >
Signed-off-by: Sriusa4414@gmail.com
Signed-off-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com >
Co-authored-by: srinivas_oo7 <sklinkedin0120@gmail.com >
Co-authored-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-03-10 14:43:40 +00:00
Jiangyun Zhu
ca5fb4bbd8
[Bugfix] Avoid merging empty-only partitions into splitting-op subgraphs ( #36595 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-10 07:39:01 -07:00
Alvin Tang
cf88b23749
fix: check HTTP status in batch read_file to prevent silent failures ( #36397 )
...
Signed-off-by: gambletan <ethanchang32@gmail.com >
Co-authored-by: gambletan <ethanchang32@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-10 07:22:40 -07:00
wang.yuqi
a3189a08b0
[Model] Consolidate score logic by introduce score_type ( #36479 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-10 13:32:25 +00:00
SoluMilken
409c4e632d
[Misc] fix typo: homogenous-> homogeneous (2 lines change) ( #36508 )
...
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw >
2026-03-10 06:25:37 -07:00
Raushan Turganbay
8850738b70
[Bugfix] Fix processor signature ( #36630 )
...
Signed-off-by: raushan <raushan@huggingface.co >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-10 06:20:47 -07:00
Mark McLoughlin
234860399b
[Frontend][Core] Revert "Add shutdown timeout" ( #34730 and #36270 ) ( #36628 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-03-10 06:20:41 -07:00
Harry Mellor
c88510083b
Fix Qwen2.5-VL test for Transformers v5 ( #36532 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-10 12:05:34 +00:00
Vadim Gimpelson
4ff8c3c8f9
[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU ( #35219 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-10 03:32:20 -07:00
Chang Su
507ddbe992
feat(grpc): extract gRPC servicer into smg-grpc-servicer package, add --grpc flag to vllm serve ( #36169 )
...
Signed-off-by: Chang Su <chang.s.su@oracle.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-03-10 03:29:59 -07:00
Nick Hill
ddbb0d230a
[Model Runner V2] Fix mm input embeddings lookup ( #36588 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 00:24:58 -07:00
Nick Hill
9efc3bdcd6
[Model Runner V2] Fix _compute_slot_mappings_kernel for chunked prefill ( #36580 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 00:23:42 -07:00
amirkl94
156e33553c
Fix: Re-Enable EP for trtllm MoE FP8 backend ( #36494 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com >
2026-03-09 23:11:27 -07:00
hallerite
d0cd736caa
[Bugfix] Fix RuntimeError: Already borrowed that degrades VLM serving throughput under concurrent load. ( #36557 )
...
Signed-off-by: hallerite <hallerite@users.noreply.github.com >
Signed-off-by: hallerite <git@hallerite.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-09 22:30:51 -07:00
Harry Mellor
195c997203
Fix LFM2 MoE test for Transformers v5 ( #36534 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-09 22:29:17 -07:00
Zhuohan Li
04b67d8f62
Remove unused disable_fallback field ( #36546 )
2026-03-09 20:56:54 -07:00
Wentao Ye
7279374f91
[Perf] Compute maxsim in worker side, reducing redundant copies, 2.7% E2E throughput improvement ( #36159 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-09 20:55:58 -07:00
Woosuk Kwon
006aea17d7
[BugFix] Remove incorrect assert in split_decodes_and_prefills ( #36553 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-09 20:02:02 -07:00
Hojin Yang
0836be3b03
[Model] Add HyperCLOVAX-SEED-Think-32B vision-language model support ( #31471 )
...
Signed-off-by: effortprogrammer <yhjhoward7@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-10 10:59:19 +08:00
Ajay Anubolu
4e95ec111c
[Bugfix] Fix Qwen3-Next in_proj_ba weight sharding with TP > 1 ( #36242 )
...
Signed-off-by: AjAnubolu <anuboluajay@gmail.com >
2026-03-09 19:16:26 -07:00
Andreas Karatzas
179547d62c
[ROCm][CI] Fix ROCm GPT-OSS Eval test group ( #36179 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-09 17:55:20 -07:00
youkaichao
f85b4eda3a
[bugfix] fix nvlink for nixl/ucx ( #36475 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2026-03-10 07:49:47 +08:00
Woosuk Kwon
2a194ddd72
[Model Runner V2] Add model_state inputs to CUDA graph capture ( #36544 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-09 15:14:51 -07:00
Shaun Kotek
203a7f27da
add nemotron v3 reasoning parser ( #36393 )
...
Signed-off-by: Shaun Kotek - Nvidia <skotek@nvidia.com >
Co-authored-by: root <root@gpu-259.slurm-workers-slurm.slurm.svc.cluster.local >
2026-03-09 15:11:41 -07:00
Lucas Wilkinson
483463f735
[MRV2] Extensible CG dispatch rework ( #35959 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-03-09 13:58:45 -07:00
Matthew Bonanni
4e571ce643
[MTP][Misc] Clean up dead code ( #36507 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-09 14:43:06 -04:00
Micah Williamson
4ff9b045fe
[ROCm][CI] Prep Tests For Change To ROCM_ATTN As New Default Backend On ROCm ( #36025 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-03-09 13:27:55 -05:00
Lucas Kabela
3fd03f1ec2
[BE] Rename should_torch_compile_mm_vit to should_torch_compile_mm_encoder ( #36281 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-03-09 18:22:05 +00:00
Woosuk Kwon
10a5f4d53d
[Model Runner V2] Use NamedTuple for execute_model_state ( #35930 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-09 11:17:34 -07:00
Simon Mo
fe0c085c28
[Docs] Remove the reo beacon ( #36528 )
...
Co-authored-by: Cursor Agent <cursoragent@cursor.com >
2026-03-09 11:16:50 -07:00
Taneem Ibrahim
8d6b3d5dda
[Misc] Refactored 5 duplicate helper functions that were copied-pasted across multiple parsers ( #36436 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-03-09 14:14:11 -04:00
Copilot
4b87ffbefb
[torch.compile] Rename compile_ranges_split_points to compile_ranges_endpoints ( #36027 )
...
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-03-09 18:04:40 +00:00
Shaun Kotek
fa028207aa
Fix/resupport nongated fused moe triton ( #36412 )
...
Signed-off-by: Shaun Kotek - Nvidia <skotek@nvidia.com >
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com >
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Signed-off-by: liweiguang <codingpunk@gmail.com >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Alex Brooks <albrooks@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: cong-or <conchubhar.gannon@gmail.com >
Signed-off-by: Tushar Shetty <tushar.shetty@abbyy.com >
Signed-off-by: Tushar Shetty <54362365+tusharshetty61@users.noreply.github.com >
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Signed-off-by: Xin Yang <xyangx@amazon.com >
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: nvnbagrov <nbagrov@nvidia.com >
Co-authored-by: Sage <80211083+sagearc@users.noreply.github.com >
Co-authored-by: danisereb <daserebrenik@nvidia.com >
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Weiguang Li <codingpunk@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
Co-authored-by: Alex Brooks <albrooks@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: cong-or <conchubhar.gannon@gmail.com >
Co-authored-by: Tushar Shetty <54362365+tusharshetty61@users.noreply.github.com >
Co-authored-by: liuzhenwei <zhenwei.liu@intel.com >
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-09 11:01:18 -07:00
Russell Bryant
d460a18fc6
[Docs] Expand --allowed-media-domains security guidance with threat details ( #36506 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-09 17:43:42 +00:00
Woosuk Kwon
6e956d9eca
[Model Runner V2] Add dummy profile_cudagraph_memory API ( #36520 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-09 10:20:13 -07:00
Andreas Karatzas
1e0f917b34
[ROCm][CI] Fix logprob divergence for TitanML/tiny-mixtral under AITER rms_norm ( #36101 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-09 12:07:44 -05:00
Andreas Karatzas
c174d54f86
[ROCm][CI] Fix ROCm attention backend validation for head sizes, block sizes, and compute capability checks ( #36292 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-09 12:02:41 -05:00
SoluMilken
55d27cca55
[Misc] fix typo: dependant -> dependent (2 lines change) ( #36511 )
...
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw >
2026-03-09 10:00:12 -07:00
Roberto L. Castro
580864d81e
[Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 ( #34917 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
2026-03-09 09:50:36 -07:00
Roberto L. Castro
2b28b9b269
[Attention][Perf] Optimize cp_gather_and_upconvert_fp8_kv_cache - DeepSeek-v3.2 ( #35290 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Co-authored-by: Claude <noreply@anthropic.com >
2026-03-09 09:46:57 -07:00
Taoyu Zhu
70485a11bd
[ROCM] Optimize the fused_topk_bias to use aiter instead of fallback torch ops. ( #36253 )
...
Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com >
2026-03-09 11:30:35 -05:00
Harry Mellor
74a9f54cdb
[CI] Fix edge case that could lead to broken docs builds on main ( #36515 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-09 09:06:19 -07:00
Matthew Bonanni
00c4cb5606
[Bugfix] Clear stale CG keys after memory profiling ( #36416 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-09 11:56:00 -04:00
Wentao Ye
941e52c298
[Refactor] Simplify chat_completion_full_generator for tool parsers ( #35634 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-09 23:33:46 +08:00
Wentao Ye
be292b7c14
[Bug] Fix pooling model benchmark script ( #36300 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-09 11:17:45 -04:00
Matthew Bonanni
77a73458e3
Reapply [Attention] Refactor check_and_update_config ( #35122 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-09 07:17:14 -07:00
Tianyu Guo
5578f2a4d3
Support online use_audio_in_video ( #36319 )
...
Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-09 07:16:44 -07:00
Cyrus Leung
3ec2115015
[Frontend] Move warmup into Renderer ( #36482 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-09 06:03:21 -07:00
Isotr0py
b0906d8b02
[MM Encoder] Default to use TORCH_SDPA backend for ViT on Volta/Turing GPU ( #36472 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-09 03:43:44 -07:00
Kevin H. Luu
aaf5fa9abf
[ci] Bound openai dependency to 2.24.0 ( #36471 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-09 03:43:26 -07:00
Cyrus Leung
f96c3ab08c
[Deprecation][1/2] Remove items deprecated in v0.18 ( #36470 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-09 03:43:23 -07:00
Xin Yang
dc6b578466
[Kernel] Add fused_sigmoid_gating_delta_rule_update kernel for Qwen3 Next ( #35777 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-08 23:41:01 -07:00
liuzhenwei
1bc9c77f6d
[XPU] Add test script of PD disaggregation ( #36434 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
2026-03-09 05:50:27 +00:00
Alex Brooks
65a4da1504
[Frontend] Add Support for MM Encoder/Decoder Beam Search (Online Transcriptions) ( #36160 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
2026-03-09 05:46:23 +00:00
Li, Jiang
217f27598d
[Bugfix] Avoid to replace non-tensor members in cpu model runner ( #36430 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-09 13:06:28 +08:00
wang.yuqi
fff3711a24
[Frontend][2/n] Improve pooling entrypoints | embed. ( #36110 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
2026-03-09 11:42:19 +08:00
Tushar Shetty
c4d859c274
[Bugfix] Skip out-of-stage layers in get_layers_from_vllm_config for pipeline parallel ( #36243 )
...
Signed-off-by: Tushar Shetty <tushar.shetty@abbyy.com >
Signed-off-by: Tushar Shetty <54362365+tusharshetty61@users.noreply.github.com >
2026-03-08 20:40:16 -07:00
cong-or
747431044d
feat(attention): extract KV-cache update from FlexAttention backend ( #36263 )
...
Signed-off-by: cong-or <conchubhar.gannon@gmail.com >
2026-03-08 20:40:12 -07:00
Cyrus Leung
d62856b928
[Misc] Move processors to transformers_utils ( #35953 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-09 11:31:39 +08:00
Alex Brooks
bd2659a566
Increase Flexibility for OOV Multimodal Token Handling ( #34858 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
2026-03-08 20:30:49 -07:00
Shaun Kotek
90512b2e8b
fix: Use iterator as not to store all the file loads in memory at once ( #36149 )
...
Signed-off-by: Shaun Kotek - Nvidia <skotek@nvidia.com >
2026-03-08 20:25:21 -07:00
wang.yuqi
dcf8862fd4
[Examples][1/n] Resettle basic examples. ( #35579 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-08 20:22:53 -07:00
Weiguang Li
43aa389231
[Bugfix] Fix CPU OMP autobind assertion to use local_world_size ( #35815 )
...
Signed-off-by: liweiguang <codingpunk@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-03-08 20:07:29 -07:00
Wentao Ye
384425f84e
[Dependency] Remove default ray dependency ( #36170 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-08 20:06:22 -07:00
Harry Mellor
a0f44bb616
Allow markdownlint to run locally ( #36398 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-08 20:05:24 -07:00
Kunshang Ji
fde4771bbd
[XPU][Doc] update xpu document about triton dependency/conflict issue. ( #36301 )
...
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
2026-03-09 02:09:22 +00:00
Jiangyun Zhu
e5ff140216
[cudagraph] fix cudagraph warning in deepseekv32 ( #28044 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-08 20:27:41 -04:00
danisereb
0a6a3a1290
Add support for ModelOpt MXFP8 MoE models ( #35986 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-03-08 13:00:05 -07:00
Sage
4497431df6
[Frontend] Add GPU-less render serving path (vllm launch render) ( #36166 )
2026-03-08 16:35:09 +01:00
nvnbagrov
b7332b058c
[Model] Nano Nemotron VL - fast media preprocessing ( #35657 )
...
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com >
2026-03-08 03:04:05 -07:00
Andreas Karatzas
40077ea3de
[CI] fix flaky empty responses and add diagnostic assertions in vision chat tests ( #36341 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-08 14:42:24 +08:00
Samuel Shen
5d6aae4577
[LMCache MP Patch]: Race Condition + Duplicated Block Ids ( #35831 )
2026-03-07 13:52:48 -08:00
Roy Huang
63298ee173
[Bugfix][LMCache][KVConnector] fix potential memory leak in LMCache multiprocess mode ( #35931 )
2026-03-07 13:52:35 -08:00
Richard Zou
2dde535df1
[compile] Split compile/warmup monitoring ( #36098 )
2026-03-07 13:52:11 -08:00
Wei Zhao
379689d533
[Perf] Support FP8 KV cache for Flashinfer MLA Sparse ( #35891 )
2026-03-07 13:51:54 -08:00
PatchyTIS
a6be75dbd2
[Core] NGram GPU Implementation compatible with Async Scheduler ( #29184 )
2026-03-07 13:51:37 -08:00
Micah Williamson
ee54f9cdb9
[ROCm][CI] Accept Different But Valid Output for test_olmoe_tp ( #35224 )
2026-03-07 13:50:52 -08:00
Micah Williamson
fc4657756f
[ROCm][CI] Enable AITER for failing test_gpt_oss test case on MI355 ( #36174 )
2026-03-07 13:50:17 -08:00
qli88
eebd14651f
[CI] Enable Crosslayer KV layout tests for ROCm platforms ( #35416 )
2026-03-07 13:49:56 -08:00
Matthew Bonanni
ebb9cc5f2b
[UX][Startup] Account for CUDA graphs during memory profiling ( #30515 )
2026-03-07 13:49:23 -08:00
rahul-sarvam
85f50eb41f
Adding support to Sarvam's MoE models ( #33942 )
...
Signed-off-by: rahul-sarvam <140298821+rahul-sarvam@users.noreply.github.com >
2026-03-08 01:16:24 +08:00
Taneem Ibrahim
5261223c2d
[Misc] Remove duplicate parser registration ( #36303 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-03-07 09:37:01 -05:00
lif
00b814ba5a
[V0 Deprecation] Remove unused swap_space parameter ( #36216 )
...
Signed-off-by: majiayu000 <1835304752@qq.com >
Co-authored-by: mcelrath
2026-03-07 22:09:55 +08:00
vllmellm
ee8a29511f
[Bugfix] Fix compressed-tensors quantization failure for DeepSeek-R1 on MI300x ( #36247 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-03-07 09:26:59 +00:00
milesial
755356b3d1
feat: expose media_io_kwargs at runtime ( #34778 )
...
Signed-off-by: Alexandre Milesi <milesial@users.noreply.github.com >
2026-03-07 04:27:04 +00:00
Andreas Karatzas
58928475e4
[ROCm][CI] Making entrypoints more deterministic on ROCm ( #36293 )
2026-03-06 19:04:40 -08:00
Mengtao (Martin) Yuan
1a9718085c
Fix CUDA graph decode capture crash in AITER FlashAttention ( #36042 )
...
Signed-off-by: Martin Yuan <myuan@meta.com >
Co-authored-by: Martin Yuan <myuan@meta.com >
2026-03-06 18:12:07 -08:00
Kunshang Ji
7eb524e64c
refine vllm bench throughput --backend hf ( #35971 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-07 02:10:33 +00:00
Nick Hill
c7f32e08c2
[BugFix] Avoid ignored trust_remote_code warnings ( #36290 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-07 01:24:18 +00:00
Nick Hill
b354686524
[Model Runner V2] Fix warmup for pipeline parallel ( #36280 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-06 16:58:51 -08:00
Nick Hill
6a18d8789b
[Core] Fix benign error log during normal shutdown ( #36270 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
2026-03-07 00:39:21 +00:00
Itay Alroy
24a03915f5
mla: don't update kv cache on dummy forwards ( #36282 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
2026-03-07 00:36:00 +00:00
Andreas Karatzas
b5e34e1fca
[ROCm][CI] Fixing yaml file for external amd-ci signal ( #36284 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 18:30:39 -06:00
Copilot
ce8546a12b
[docs][torch.compile] Add fusions.md — kernel/operator fusion reference page ( #35538 )
...
Signed-off-by: ProExpertProg <luka.govedic@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: ProExpertProg <luka.govedic@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-03-06 23:55:06 +00:00
Chuan (Richard) Li
c188749bcd
[ROCm] Support MLA with nhead<16 and FP8 KV cache for TP=8 (Kimi K2.5/Linear) ( #35850 )
...
Signed-off-by: Li <chuali@amd.com >
2026-03-06 20:24:03 +00:00
Alexei-V-Ivanov-AMD
225d1090a0
Enabling some B200-specific tests on MI355 ( #35253 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
Signed-off-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com >
2026-03-06 19:27:20 +00:00
eellison
f3c6c9c9d7
[CustomOp] CustomOp FusedRMSNormGated ( #35877 )
...
Signed-off-by: Elias Ellison <elias.ellison@gmail.com >
Signed-off-by: eellison <elias.ellison@gmail.com >
2026-03-06 10:53:37 -08:00
Nick Hill
26bd43b52d
Revert "[BugFix] Fix engine hanging after KV cache initialization fai… ( #36262 )
2026-03-06 08:28:09 -08:00
Travis Johnson
6b625a8807
[Bugfix] Quickfix followups to busy loop removal in #28053 ( #36068 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-06 08:13:05 -08:00
Richard Zou
54756b6109
[compile] Stop unconditionally patching constrain_to_fx_strides ( #36152 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-06 10:17:27 -05:00
Raphaël Rialland
39f9ea0da4
[Bugfix] Fix cudagraph_mode:FULL dispatch (This does not impact FULL_AND_PIECEWISE (default)) ( #36165 )
2026-03-06 09:15:31 -05:00
Isotr0py
e4ae148a78
[Refactor] Modular video loader backend refactoring ( #35202 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-06 06:06:59 -08:00
Isotr0py
1d0c0d209c
[Misc] Lazy import registered processors ( #36024 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-06 06:06:45 -08:00
Chenguang Zheng
fcb73f306c
[bugfix] add api process rank in default multimodal request ( #36150 )
...
Signed-off-by: fake0fan <645327136@qq.com >
Signed-off-by: Chenguang ZHENG <645327136@qq.com >
2026-03-06 12:00:09 +00:00
Harry Mellor
e2090bf3af
[CI] Fix startup error test ( #36230 )
...
A change in engine startup error messages in #35478 caused this test failure.
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-06 11:50:28 +00:00
Andreas Karatzas
2a00d3241f
[CI][MM] Gate vision encoder attention mask to MiniCPM only, fixing Aria regression ( #36206 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 01:17:08 -08:00
Alex Brooks
10f4db4dbe
[Frontend] Add Support for MM Encoder/Decoder Beam Search (Offline) ( #36153 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-06 01:16:56 -08:00
Nicolò Lucchesi
5b3ba94ab4
[Core][KVConnector] Support HMA+NixlConnector ( #35758 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-06 08:51:21 +01:00
zhanqiuhu
90f3c01fa4
[Spec Decode][KV Connector] Fix KV transfer in PD + speculative decoding ( #35158 )
...
Signed-off-by: Claude <noreply@anthropic.com >
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-06 08:50:44 +01:00
Andreas Karatzas
807d680337
[ROCm][CI] Fix tool use test stability - disable skinny GEMM, prefix caching, eliminate batch variance ( #35553 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 15:15:12 +08:00
Tyler Michael Smith
5afb387bd4
Change "following fields were present in the request but ignored" log from warn to debug ( #36173 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-03-05 22:15:46 -08:00
Walter Beller-Morales
43e77e59ab
[BugFix] avoid infinite loop with VLLM_PORT and get_open_ports_list ( #36191 )
...
Signed-off-by: walterbm <walter.beller.morales@gmail.com >
2026-03-05 22:15:29 -08:00
Russell Bryant
00bd08edee
[Security] Respect user trust_remote_code setting in NemotronVL and KimiK25 ( #36192 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-05 22:15:19 -08:00
Ajay Anubolu
43f10573c9
[Bugfix] Fix misleading context length error messages ( #36197 )
...
Signed-off-by: AjAnubolu <anuboluajay@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-05 22:15:12 -08:00
Yongye Zhu
86e1060b17
[Bugfix] Fix inner_dp_world initialization order for multi-node TP ( #35892 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-03-05 22:04:44 -08:00
Mark McLoughlin
27066d1b2b
[Frontend][Core] Add shutdown timeout - allowing in-flight requests to finish ( #34730 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-03-05 22:04:31 -08:00
cong-or
57c84ff129
perf: add __slots__ to KVCacheBlock ( #36164 )
...
Signed-off-by: cong-or <conchubhar.gannon@gmail.com >
2026-03-05 22:04:09 -08:00
Xiang Shi
e68de8adc0
docs: fix wrong cc in int8.md ( #36209 )
...
Signed-off-by: Xiang Shi <realkevin@tutanota.com >
2026-03-06 06:01:02 +00:00
Andreas Karatzas
a1ffa56a1e
[CI] Fix bge-m3 similarity reference values after *Defination* typo fix ( #36208 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 05:07:29 +00:00
Shiyan Deng
0a208d1f54
[BugFix] Fix engine hanging after KV cache initialization failure ( #35478 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-03-05 20:58:09 -08:00
Shiyan Deng
03a49bb8f0
[Feature] Add --distributed-timeout-seconds CLI option ( #36047 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-03-05 20:57:51 -08:00
Shiyan Deng
8e87cc57f1
[Bug] Fix a corner case in _process_simple_streaming_events ( #34754 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-03-05 20:57:32 -08:00
Cyrus Leung
6dd302653f
[Misc] Rename group_mm_kwargs_by_modality -> group_and_batch_mm_kwargs ( #36158 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-06 12:32:48 +08:00
Cyrus Leung
de00ebeac4
[Bugfix] Fix simple Mistral-Small example ( #36156 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-05 20:25:11 -08:00
Andreas Karatzas
639680d220
[ROCm][CI] Adding missing dependencies for Multi-modal models tests ( #36177 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 12:23:10 +08:00
Rohan Potdar
c5362c739f
Reenable features for ROCm attention backends ( #36185 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-03-05 20:21:06 -08:00
Nikhil Gupta
0a49676fb0
cpu: aarch64: Upgrade OneDNN for aarch64 to add support for int8 matmul ( #36147 )
...
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com >
2026-03-06 03:48:59 +00:00
Jeffrey Wang
c012a8c477
Don't fire ray compatibility webhook when PR or branch is not provided ( #36088 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2026-03-06 00:42:21 +00:00
Dor Huri
ebed80a7c8
[Performance] Extract KV-cache update from TreeAttention backend ( #35384 )
...
Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il >
2026-03-06 00:22:43 +00:00
Nick Hill
a73af584fe
[Model Runner V2] Fix warmup for very small kvcache and/or blocksizes ( #36176 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-05 14:48:10 -08:00
Zhengxu Chen
a97954b6a8
[compile] Consistent compiler config for saved/loaded vllm backends. ( #35810 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-05 15:08:12 -05:00
Yanhong Li
a911f4dd20
[Model] Add support for OLMo Hybrid ( #32550 )
2026-03-05 14:51:06 -05:00
Russell Bryant
5395471d29
[CI] Add explicit permissions to macOS smoke test workflow ( #35775 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-05 19:08:48 +00:00
Frank Wang
a57c877f18
[BugFix] Fallback from FA4->FA2 for Batch Invariance ( #36059 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
2026-03-05 14:05:56 -05:00
Xin Yang
f917020983
[Perf] Optimize FusedMoEModularKernel output tensor using torch.empty ( #35794 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-05 13:47:53 -05:00
tomeras91
86483ca774
[Bugfix] Disable FlashInfer TRTLLM BF16 path for non-gated MoE ( #36146 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
2026-03-05 09:49:05 -08:00
Netanel Haber
b93a9e6f6d
ParakeetProjection.norm = RMSNorm instead of nn.LayerNorm ( #36133 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-03-05 17:29:30 +00:00
Xinyu Chen
d8839ef7d9
[XPU] Enable ModelRunnerV2 on XPU ( #36078 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2026-03-05 17:19:18 +00:00
Avery Miao
e998fa76b9
[BUGFIX]Fix Qwen-Omni models audio max_token_per_item estimation error leading to encoder_cache_size is 0 ( #35994 )
...
Signed-off-by: Miao, Avery <avery.miao@intel.com >
2026-03-05 09:16:29 -08:00
Jiayi Yan
6a895197fa
[Bugfix][CI] fix typos ( #34934 )
...
Signed-off-by: 1195343015 <1195343015@qq.com >
Signed-off-by: Jiayi Yan <66017932+1195343015@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 17:05:46 +00:00
Sage Moore
8c760b6ab6
[ROCm] Refactor ROCm attention backend selection logic ( #35246 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-03-05 10:51:26 -06:00
AllenDou
3ee68590c7
refactor funasr model. ( #36108 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-05 08:07:37 -08:00
Cyrus Leung
7196348157
[Bugfix] Fix Qwen-VL tokenizer implementation ( #36140 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-05 08:07:19 -08:00
Ning Xie
176c799f4c
[openai api] log exception in exception handler (1/N) ( #31164 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-03-05 16:00:12 +00:00
Or Ozeri
612e7729c2
[KVConnector] Scheduler: Fix num_computed_tokens after async KV load ( #34616 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-05 14:25:15 +00:00
Harry Mellor
ecde7af9c4
Fix import that was moved in Transformers 5.2.0 ( #36120 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 13:59:44 +00:00
Harry Mellor
8df523351f
[Docs] Only build docs if documentation or ready labels are present ( #36135 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 13:58:16 +00:00
Andreas Karatzas
b03ff6a96b
[CI] Stabilize test_no_args_tool_call and add ROCm-specific server args ( #36107 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-05 21:52:49 +08:00
Ajay Anubolu
ed81d5edd1
[Bugfix] Fix RunAI streamer crash with S3-hosted model paths ( #35976 )
...
Signed-off-by: AjAnubolu <anuboluajay@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-05 12:14:20 +00:00
Shiyan Deng
3c23ac840e
[Bugfix] Fix mypy errors in hermes_tool_parser.py ( #36114 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2026-03-05 11:37:47 +00:00
cjackal
a708ef5944
[Misc] Fix SyntaxWarning - invalid escape sequence '\e' ( #36020 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2026-03-05 10:55:31 +00:00
Kunshang Ji
66a2209645
[Hardware] Replace torch.cuda.synchronize() api with torch.accelerator.synchronize ( #36085 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-05 10:36:39 +00:00
Doug Smith
0bfa229bf1
[Release] Include source distribution (sdist) in PyPI uploads ( #35136 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
Co-authored-by: Daniele Trifirò <dtrifiro@redhat.com >
2026-03-05 01:43:50 -08:00
Paco Xu
7493c51c55
[Docs] add Dynamo/aibrix integration and kubeai/aks link ( #32767 )
...
Signed-off-by: Paco Xu <paco.xu@daocloud.io >
2026-03-05 17:39:50 +08:00
Reagan Lee
ac773bbe80
[Docs] Update docs to include mm processor + encoder benchmarks ( #34083 )
...
Signed-off-by: Reagan <reaganjlee@gmail.com >
2026-03-05 01:38:25 -08:00
Christian Munley
48e376a007
qwen3coder tool parser fix anyOf double encoded parameters ( #36032 )
...
Signed-off-by: Christian Munley <cmunley@nvidia.com >
2026-03-05 09:06:57 +00:00
Isotr0py
21eb2c3372
[Chore] Correct MTP models test registry ordering ( #36115 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-05 08:55:04 +00:00
Seiji Eicher
e2b31243c0
[Docs] Update CacheConfig block_size docstring to remove inaccurate limit when using CUDA ( #35632 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2026-03-05 06:24:08 +00:00
Martin Hickey
c3598d02fa
[Misc] Remove deprecated items that are due for removal ( #36006 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-03-05 06:14:50 +00:00
Benjamin Chislett
57c629e9c1
[Bugfix] Fix block_size for hybrid model MTP ( #36036 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-05 06:10:54 +00:00
zihaoanllm
d106bf39f5
[Doc] Add Parallel Draft Models ( #35973 )
...
Signed-off-by: <zihaoan2@amd.com >
Signed-off-by: zihaoanllm <zihaoan2@amd.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 05:44:07 +00:00
Yanan Cao
b0651021e5
[Kernel] [Helion] [11/N] Retune configs for silu_mul_fp8 ( #36062 )
2026-03-04 21:25:59 -08:00
Hanjun Cho
f600d5192e
[Bugfix] Fix score layer quantization for sequence classification models - Qwen3 (VL) Reranker ( #35849 )
...
Signed-off-by: Hanjun Cho <gkswns0531@gmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-04 20:57:20 -08:00
Tianmu Li
8e7820131e
[Perf] Use dummy M for weight prepacking on x86 ( #35890 )
...
Signed-off-by: Li, Tianmu <tianmu.li@intel.com >
2026-03-05 04:56:49 +00:00
Andrii Skliar
0a12cea25f
Order config.py in Lexicographical order ( #35866 )
...
Signed-off-by: Andrii Skliar <askliar@nvidia.com >
Co-authored-by: Andrii Skliar <askliar@nvidia.com >
2026-03-04 20:56:47 -08:00
Zhengxu Chen
dd6dbd93f8
[compile] Fix extra cache save on warm start. ( #35921 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-05 12:56:30 +08:00
Harry Mellor
26366009c5
[CI] Don't leave docs preview comment on closed PRs ( #36087 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 04:51:46 +00:00
Nick Hill
16c472abe7
[Core] Move ray-specific WorkerWrapperBase methods to RayWorkerWrapper ( #35328 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-05 12:11:59 +08:00
daje0601
3b23d57c96
[Model] Add LoRA support for Whisper models ( #29856 )
...
Signed-off-by: daje0601 <englishmt4118@gmail.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-05 10:38:25 +08:00
Wentao Ye
2f4226fe52
[CI] Fix pre-commit mypy issue in main ( #36049 )
2026-03-04 18:13:12 -08:00
nkm-meta
792cbd64ca
Add platform method to enable custom collective ops registration ( #34760 )
...
Signed-off-by: Naina Kuruballi Mahesh <nainakm@meta.com >
2026-03-05 00:50:32 +00:00
Zhengxu Chen
2ed4722e26
[compile] Reduce log spam from compile. ( #36044 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-05 00:48:36 +00:00
Nick Hill
a3299c3d1d
[Model Runner V2] Misc code simplification ( #35941 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-04 15:26:35 -08:00
Andreas Karatzas
6c21a0c2d7
[ROCm][CI] Added MI325 mirrors (stage C) ( #35239 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-04 14:48:46 -08:00
Shanshan Shen
562339abc3
[Misc] Support OOT linear method registering ( #35981 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-03-04 22:25:56 +00:00
amitz-nv
d7adcadb9b
[Bugfix] Fix passing of activation_type to trtllm fused MoE NVFP4 and FP8 ( #36017 )
...
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com >
2026-03-04 22:23:51 +00:00
Simon Mo
f678c3f61a
[RL] [Weight Sync] Guard IPC update-info pickle deserialization behind insecure serialization flag ( #35928 )
...
Co-authored-by: Cursor Agent <cursoragent@cursor.com >
2026-03-04 17:05:32 -05:00
Thomas Parnell
be0a3f7570
[Bugfix] Fix race in non-blocking num_accepted_tokens GPU->CPU copy ( #36013 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-04 13:52:44 -08:00
Harry Mellor
17dc9c7fc9
[CI] Bump mypy version ( #34950 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 20:55:11 +00:00
fenypatel99
7eca859110
Add PyTorch profiler schedule support with warmup/active iterations ( #35240 )
2026-03-04 12:53:38 -08:00
Russell Bryant
636ee223ac
[Docs] Document security risks of GPT-OSS Python tool ( #35139 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-04 20:27:31 +00:00
Robert Shaw
b7d59ffce2
[UX] Remove NoOpOffloader log ( #35678 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-04 12:13:40 -08:00
Richard Zou
5569f5218d
[torch.compile] Stop lazily compiling ( #35472 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-04 12:13:17 -08:00
Davina Zaman
138d891d7f
[Docs] Clarify structured outputs configuration for Qwen3 reasoning mode ( #32441 )
...
Signed-off-by: Davina Zaman <davzaman@users.noreply.github.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 11:44:39 -08:00
Stefano Castagnetta
d7166e74c1
[CI] Add Blackwell AsyncTP correctness test ( #35871 )
...
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com >
2026-03-04 19:41:21 +00:00
Nick Hill
417fd28fb1
[Model Runner V2] Fix pooling ( #36019 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-04 10:53:17 -08:00
tomeras91
7faba503c4
[Kernel][Mamba] Optimize Mamba2 SSD prefill Triton kernels ( #35397 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
2026-03-04 19:47:17 +01:00
Hyunkyun Moon
bc6be89d16
[Frontend] Add vllm launch command for GPU-less preprocessing serving ( #34551 )
...
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com >
2026-03-04 18:41:52 +00:00
Maxime Grenu
32224f568a
docs: update CPU Docker images to reference Docker Hub instead of AWS ECR ( #34882 )
...
Signed-off-by: Maxime Grenu <69890511+cluster2600@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 10:31:35 -08:00
Abhishek Mathukiya
f3dc292e9f
docs: add version requirement note for --profiler-config flag ( #32454 )
...
Signed-off-by: abhishkh <mathukiya.a@northeastern.edu >
2026-03-04 18:13:54 +00:00
Chen
138c5fa186
[Docs] Add RunPod GPU deployment guide for vLLM ( #34531 )
...
Signed-off-by: lisperz <zhuchen200245@163.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 10:11:34 -08:00
Russell Bryant
2f2c1d73a7
[Docs] Upgrade dynamic LoRA warning to admonition block ( #35218 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-04 10:01:42 -08:00
Bhuminjay Soni
fb3e78ab09
[Feature][CI]: compare func & no_func outputs in test_functionalization.py ( #35481 )
...
Signed-off-by: Bhuminjay <bhuminjaysoni@gmail.com >
Signed-off-by: Bhuminjay Soni <Soni5Happy@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-03-04 18:01:16 +00:00
Michael Yao
fd3bfe74c9
[Docs] Update design/multiprocessing.md ( #30677 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2026-03-04 17:58:59 +00:00
tc-mb
bfdb512f11
fix minicpmo4.5: fix attn_mask in vit attn && fix resampler pos_emb i… ( #34127 )
...
Signed-off-by: tc-mb <caitianchi@modelbest.cn >
Co-authored-by: hezhihui <hezhihui@modelbest.cn >
2026-03-04 17:46:17 +00:00
Sage
d25c1ec3c9
docs(cpu): Clarify pre-built wheels requirement for CPU Python-only build ( #35090 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-04 17:45:35 +00:00
Xing Liu
7cc6058ac6
[Doc] Add MTP docs and update speculative decoding guidance ( #35197 )
...
Signed-off-by: liuxing <945764858@qq.com >
2026-03-04 17:23:34 +00:00
Manrique Vargas
28028dff2f
fix(docs): use static rdzv backend in multi-node troubleshooting script ( #34784 )
...
Signed-off-by: machov <mv1742@nyu.edu >
2026-03-04 17:15:35 +00:00
Dr Alex Mitre
3417ba5648
docs: add README for logits_processor examples ( #35933 )
2026-03-04 17:09:19 +00:00
Yan Ma
58cfe0dc44
Fix phi4-mm and remove cuda binding ( #35964 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-03-05 01:08:05 +08:00
simone-dotolo
e86221deb6
[Doc] Fix GPU Worker count in Process Count Summary ( #36000 )
...
Signed-off-by: simone-dotolo <simonedotolo@libero.it >
Signed-off-by: simone-dotolo <84937474+simone-dotolo@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-04 17:03:14 +00:00
Netanel Haber
289fc48ab7
Use MMEncoderAttention (=use FlashAttention) instead of torch.sdpa in radio.py ( #35653 )
2026-03-04 08:43:13 -08:00
Christian Pinto
2f2212e6cc
Split generic IO Processor plugins tests from Terratorch specific ones ( #35756 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2026-03-05 00:01:03 +08:00
Nicolò Lucchesi
18e01a0a10
[Misc] Add --attention-backend auto option ( #35738 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-04 15:12:27 +00:00
sungsoo ha
6cb901093f
[Core] Add All-to-All communication backend for DCP ( #34883 )
...
Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com >
Signed-off-by: sungsoo ha <hasungsoo@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 10:01:57 -05:00
Cyrus Leung
ead7bde1ab
[Bugfix] Make kaldi_native_fbank optional ( #35996 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-04 06:47:32 -08:00
Qi Wang
6aa6ad8992
[BugFix] Fix implicit and incorrect assumption on ECConnector is_producer ( #34783 )
...
Signed-off-by: Qi Wang <qiwa@nvidia.com >
2026-03-04 15:01:30 +01:00
Raghavan
c8c3935b70
[Bugfix][Model] Fix FP8 k_scale/v_scale not loaded for Qwen3-MoE ( #35656 )
...
Signed-off-by: raghavan <oneraghavan@gmail.com >
2026-03-04 13:15:38 +00:00
Ronen Schaffer
bb6888b8b1
[Bugfix][CPUOffloadingManager] Prevent eviction of already-stored blocks in LRU/ARC prepare_store() ( #35846 )
...
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com >
2026-03-04 14:25:33 +02:00
Taneem Ibrahim
1aaec59d79
[MISC] fixed tool_parser mypy errors ( #35640 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 12:23:12 +00:00
pougetat
1659b2e058
[Feature] Add basic metrics for /realtime endpoint ( #35500 )
...
Signed-off-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Signed-off-by: pougetat <thomas.pougetabadie@gmail.com >
Co-authored-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-04 19:56:32 +08:00
haosdent
d6e04f4c43
[Bugfix] Cap FULL decode cudagraph sizes for Mamba/hybrid models ( #34094 ) ( #34571 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
Co-authored-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-04 11:56:22 +01:00
Kunshang Ji
a8f66cbde8
[XPU] bump vllm-xpu-kernels to v0.1.3 ( #35984 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-04 18:23:31 +08:00
Kunshang Ji
16d2ad1d38
[Hardware] Replace torch.cuda.empty_cache with torch.accelerator.empty_cache ( #30681 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 09:49:47 +00:00
Chuan (Richard) Li
5dc3538736
[ROCm][Bugfix] Fall back from CK MXFP4 MoE when GEMM dimensions are unsupported ( #35893 )
...
Signed-off-by: Li <chuali@amd.com >
2026-03-04 08:30:54 +00:00
Nathan Price
36bf213181
[Bugfix] Add missing dynamic_arg_dims for Qwen3-ASR torch.compile ( #35869 )
...
Signed-off-by: Nathan Price <nathan@abridge.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-04 08:29:01 +00:00
Joe Runde
6f0dd93801
[Core] Remove busy loop from idle buffer readers ( #28053 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-04 07:44:20 +00:00
Andrii Skliar
5d199ac8f2
Support Audio Extraction from MP4 Video for Nemotron Nano VL ( #35539 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Signed-off-by: Andrii Skliar <askliar@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: Andrii <askliar@nvidia.com >
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Co-authored-by: Andrii Skliar <askliar@oci-nrt-cs-001-vscode-01.cm.cluster >
Co-authored-by: Andrii <askliar@nvidia.com >
Co-authored-by: root <root@pool0-03748.cm.cluster >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: root <root@pool0-02416.cm.cluster >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com >
Co-authored-by: root <root@pool0-04880.cm.cluster >
2026-03-03 23:20:33 -08:00
Komal Kumar Teru
9e0f44bec4
[cohere][fix][spec-decode]: fix crash when allowed_token_ids is set without penalties ( #35654 )
...
Signed-off-by: kkt-cohere <komal@cohere.com >
2026-03-03 23:20:15 -08:00