Cyrus Leung
e5de19ff9a
[CI/Build[ Don't auto-rebase PRs with CI failures ( #39443 )
...
Close inactive issues and PRs / close-issues-and-pull-requests (push) Has been cancelled
macOS Apple Silicon Smoke Test / macos-m1-smoke-test (push) Has been cancelled
pre-commit / pre-run-check (push) Has been cancelled
pre-commit / pre-commit (push) Has been cancelled
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-04-09 13:57:37 -07:00
zzaebok
edee96519a
[Spec Decode] fix returning size mismatch on extract hidden states proposer ( #38610 )
...
Signed-off-by: Jaebok Lee <jaebok9541@naver.com >
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
2026-04-09 20:39:39 +00:00
Rishi Puri
adaabb8a55
Add nightly b200 test for spec decode eagle correctness ( #38577 )
...
Signed-off-by: Rishi Puri <riship@nvidia.com >
2026-04-09 20:09:09 +00:00
Ekagra Ranjan
f7cad67412
[ASR] Fix spacing bw chunks in multi chunk audio transcription ( #39116 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2026-04-09 12:46:33 -07:00
Xinyu Chen
a8134aef4e
[XPU] check is_xccl_available before oneccl warmup ( #39302 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2026-04-09 12:42:17 -07:00
Michael Goin
2800706f06
[Refactor] Move NVFP4 GEMM management into NvFp4LinearKernel ( #39129 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-04-09 15:05:36 -04:00
Cyrus Leung
0d310ffbeb
[CI/Build] Update auto-rebase rule ( #39429 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-04-09 10:59:56 -07:00
Micah Williamson
d5f75fdf50
[ROCm] Correctly guard fused_silu_mul_block_quant on ROCm ( #39387 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-04-09 17:59:03 +00:00
PikaPikachu
827268e98d
[Quantization] Support Quark W8A8 INT8 MoE inference ( #36320 )
...
Signed-off-by: kangletian <Letian.Kang@amd.com >
2026-04-09 17:24:43 +00:00
Wentao Ye
56e19d7ee2
[Model Runner V2] Fix flex attention kv blocks calculation issue ( #39353 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-09 13:07:43 -04:00
Andreas Karatzas
9036d4c464
[ROCm][CI] Resolved nvidia package deps issue ( #39421 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-04-10 00:06:06 +08:00
Lucas Kabela
a8c6ee9b78
[Performance Improvement] Update batched_count_greater_than to handle batch size 1 without recompile ( #38933 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-04-09 23:51:31 +08:00
Cyrus Leung
3b1d9c3156
[CI/Build] Fix memory cleanup in MM test ( #39411 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-04-09 08:50:45 -07:00
Cyrus Leung
54d244f28f
[UX] Improve error message for MM input too long ( #39409 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-04-09 13:20:19 +00:00
Richard Zou
6c749399b7
[BugFix] fix tests/kernels/moe/test_moe_layer.py ( #39404 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-04-09 08:48:59 -04:00
lalit10
91eea72330
[Tests] Add Qwen3-VL multimodal memory leak check ( #39268 )
...
Signed-off-by: Lalit Laxminarayan Bangad <lalitbangad@gmail.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-04-09 04:54:46 -07:00
Andrii Skliar
df2503e125
nemotron-nano-vl: Allow use_audio_in_video to be passed at vllm serve time ( #38538 )
...
Signed-off-by: Andrii Skliar <askliar@nvidia.com >
Co-authored-by: Andrii Skliar <askliar@nvidia.com >
2026-04-09 11:44:39 +00:00
Nick Hill
c8d98f81f6
[Core] Simplify API server handshake ( #39364 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-04-09 18:56:15 +08:00
Harry Mellor
d87fb264df
[Docs] Bring README updates into docs README ( #39397 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-04-09 10:35:00 +00:00
wang.yuqi
66c079ae83
[Frontend][4/n] Improve pooling entrypoints | pooling. ( #39153 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-04-09 10:09:45 +00:00
Shengqi Chen
b6c9be509e
[CI] fix possible user permission issues in nightly index generation ( #39390 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2026-04-09 08:14:07 +00:00
Qidong Su
ed733802f0
Fix NUMA binding on non-CDMM Grace-Blackwell systems ( #39361 )
...
Signed-off-by: Qidong Su <soodoshll@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-09 07:36:51 +00:00
Andrew Barnes
8a34c5087a
[ROCm] Remove unnecessary fp8 roundtrip in gather cache NHD dequant ( #39122 )
...
Signed-off-by: Bortlesboat <bortstheboat@gmail.com >
2026-04-09 15:12:22 +08:00
Wentao Ye
ed2f282bc8
[Perf] Optimize redundant sync for pooling model, 3.7% Throughput Improvement ( #39113 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-08 23:12:23 -07:00
Zhewen Li
9e78555743
[Docker] Add fastsafetensors to NVIDIA Dockerfile ( #38950 )
2026-04-08 22:21:37 -07:00
sihao_li
e80e633927
[XPU] Skip VLLM_BATCH_INVARIANT for XPU in EAGLE DP test ( #39164 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-04-09 12:45:16 +08:00
Khairul Kabir
490f17d0c7
[Multimodal] Fix nested_tensors_equal: add length check for lists and tuple support ( #38388 )
...
Signed-off-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com >
Co-authored-by: khairulkabir1661 <khairulkabir1661@users.noreply.github.com >
2026-04-09 04:40:37 +00:00
Yongye Zhu
2e98406048
[Refactor] Improve indexer decode path metadata preparation ( #38865 )
2026-04-08 20:49:15 -07:00
Chendi.Xue
ef5a226819
[PD][HeteroArch]Fix accuracy issue with CPU_ATTN as Decoder and Flash_ATTN as prefiller ( #38935 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2026-04-09 11:19:07 +08:00
Wentao Ye
aec18492d0
[CI] Fix mypy for vllm/v1/ops ( #39219 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-09 11:06:34 +08:00
noobHappylife
2a49284c8a
Fix Responses JSON schema alias serialization ( #38519 )
...
Signed-off-by: noobhappylife <aratar1991@hotmail.com >
Co-authored-by: OpenAI Codex <codex@openai.com >
2026-04-09 10:50:16 +08:00
Ilya Boytsov
d37b378762
[Model] Update ColModernVBERT to support latest HF checkpoint ( #39307 )
...
Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com >
2026-04-09 10:48:51 +08:00
Wei Zhao
92fbec391b
[Bug] Fix routing bias dtype for trtllm per-block fp8 moe ( #38989 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-04-08 19:42:43 -07:00
Ajay Anubolu
2f41d6c063
[Bugfix] Fix cpu-offload-gb assertion with non-default block sizes ( #36461 )
...
Signed-off-by: AjAnubolu <anuboluajay@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-04-08 19:42:16 -07:00
Dipika Sikka
3aecdf08b4
[Gemma4] Support quantized MoE ( #39045 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
2026-04-08 21:57:53 -04:00
Michael Goin
eb4205fee5
[UX] Integrate DeepGEMM into vLLM wheel via CMake ( #37980 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Claude <noreply@anthropic.com >
2026-04-08 18:56:32 -07:00
liuzhenwei
83aea2147f
[XPU][UT] update UTs in CI ( #39296 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Kunshang Ji <jikunshang95@gmail.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-04-09 09:38:16 +08:00
Maral
2e9034c998
[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections. ( #33892 )
...
Signed-off-by: maral <maralbahari.98@gmail.com >
Signed-off-by: Maral <maralbahari.98@gmail.com >
2026-04-09 08:50:39 +08:00
Benjamin Chislett
8332078cfd
[Bugfix] FlashInfer MXINT4 MoE crashes, missing do_finalize ( #39315 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-08 20:36:33 -04:00
Richard Zou
ba4a78eb5d
[torch.compile] Allow usage of Opaque Objects in PyTorch 2.11 ( #39286 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-04-08 23:21:10 +00:00
Kai Song
f3c7941ec8
[Bugfix]Fix EP precision for Qwen3.5, Qwen3-Next ( #39181 )
...
Signed-off-by: Song Kai <songkai05@baidu.com >
2026-04-09 01:47:48 +04:00
Wentao Ye
3352bf8b03
[CI Bug] Fix pre-commit issue in main ( #39347 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-08 14:10:05 -07:00
triangleXIV
7c94ae16c6
[BugFix] --max-model-len=-1 causes over-limit requests to hang and starve the entire service ( #39102 )
...
Signed-off-by: triangle14 <y1019026570@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2026-04-08 14:03:17 -07:00
Rishi Puri
ad05edfbca
tests/v1/e2e/spec_decode: assert async scheduling is used (#39206 )
...
Signed-off-by: Rishi Puri <riship@nvidia.com >
Signed-off-by: Rishi Puri <puririshi98@berkeley.edu >
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com >
Co-authored-by: Flora Feng <4florafeng@gmail.com >
2026-04-08 20:30:03 +00:00
Wentao Ye
2018137242
[Feature] Batch invariant nvfp4 linear support ( #39322 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-08 16:29:13 -04:00
Jackmin801
a776a48b1c
[MoE] Move DEEP_GEMM into experts/ subdirectory ( #39005 )
...
Signed-off-by: Jackmin801 <ongjackm@gmail.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-08 19:23:08 +00:00
Ben Browning
8477fe427d
[Tool] adjust_request to reasoning parser, and Gemma4 fixes ( #39027 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-04-08 19:04:04 +00:00
Lain
e24e0a43a4
[Attention] relax the head dim 512 and paged kv for sm90+FA4 ( #38835 )
...
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-04-08 18:23:18 +00:00
Roberto L. Castro
b55d830ec7
[Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA decode ( #37421 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
2026-04-08 13:35:57 -04:00
Shengqi Chen
75e01a39a1
[Feature] NUMA binding support for GPU workers ( #38635 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
Co-authored-by: Jason Li <jasonlizhengjian@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-04-08 09:55:24 -07:00
Or Ozeri
512c5eb455
[kv_offload+HMA][5/N]: Track group block hashes and block IDs ( #37109 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-04-08 19:50:28 +03:00
Flora Feng
13151a4df4
[Bugfix] Fix Gemma4 streaming tool call corruption for split boolean/number values ( #39114 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-04-08 16:46:27 +00:00
Gregory Shtrasberg
56c976c1b5
[ROCm] Enable fused_silu_mul_block_quant on ROCm ( #38817 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-04-08 11:23:32 -05:00
Frederik Gossen
d74a306c4b
[Core] Use tuple_return in split_module for tuple-conformant subgraphs ( #38752 )
...
Signed-off-by: Frederik Gossen <frgossen@meta.com >
Co-authored-by: Boyuan Feng <boyuan@meta.com >
2026-04-08 09:09:58 -07:00
Gregory Shtrasberg
0e9f0a516c
[ROCm][CI-Build] Cherry pick triton BUFFER_OPS fix and update AITER ( #38580 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-04-08 10:38:03 -05:00
haosdent
8904fc4d19
[Bugfix] Fix V1 logprobs empty strings for multi-byte UTF-8 tokens when logprobs > 0 ( #34875 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-04-08 15:30:00 +00:00
nemanjaudovic
1a2c17634e
[Bugfix] Add missing ASRDataset import and CLI args in benchmarks/throughput.py ( #38114 )
...
Signed-off-by: nemanjaudovic <nudovic@amd.com >
2026-04-08 13:53:53 +00:00
Matthew Bonanni
308cec5864
[FlashAttention] Symlink FA4 instead of copying when using VLLM_FLASH_ATTN_SRC_DIR ( #38814 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-04-08 12:04:34 +00:00
wang.yuqi
4e2ab1861d
[CI Failure] pin nomic-embed-text-v1 revision ( #39292 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-04-08 11:43:06 +00:00
JartX
140cbb1186
[Bugfix] Cuda Clean up scales Kvcache fp8/int8_per_token_head ( #39224 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-04-08 04:08:04 -07:00
Kevin H. Luu
6155bbd1dd
[Bugfix][Docs] Fix ReadTheDocs build crash from mocked torch decorator ( #39284 )
...
Signed-off-by: khluu <khluu000@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-04-08 09:43:01 +00:00
rasmith
78434b923c
[CI][AMD][BugFix][Kernel] Cast induction variable to int64 on MI350 for chunk_gated_delta_rule_fwd_kernel_h_blockdim64 to avoid illegal memory access ( #39087 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-04-08 16:57:18 +08:00
Michael Goin
2488d1dca2
[Docs] Update README ( #39251 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-04-08 11:34:07 +08:00
yoke
d734445fcd
[Bugfix][Frontend] Fix Gemma4 streaming HTML duplication after tool calls ( #38909 )
...
Signed-off-by: yoke233 <yoke2012@gmail.com >
2026-04-08 11:03:54 +08:00
Flora Feng
927975ead8
[Parser] Migrate response api streaming to unified parser ( #38755 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Signed-off-by: Andrew Xia <axia@meta.com >
2026-04-08 10:09:00 +08:00
Flora Feng
9ea7d670d8
[Bugfix] Fix Qwen3 tool parser for Responses API tools ( #38848 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-04-08 10:08:51 +08:00
Varun Sundar Rabindranath
7b80cd8ac3
[Docs] Add Phi-4-reasoning-vision to supported models + examples ( #39232 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2026-04-08 02:02:26 +00:00
Andrey Talman
2111997f96
[release 2.11] Update to torch 2.11 ( #34644 )
2026-04-07 18:55:48 -07:00
Flora Feng
5af684c319
[CI] Add reasoning parser tests to CI ( #37025 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-04-08 00:57:36 +00:00
Md. Mekayel Anik
d521dcdbcc
docs: clarify SMT and OMP acronyms in CpuPlatform ( #39085 )
2026-04-07 17:42:07 -07:00
Giancarlo Delfin
5daf62271d
[Model Runner V2] Fuse probabilistic rejection sample kernels ( #38496 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-04-07 17:37:37 -07:00
zofia
ad3304425b
[XPU] add xpu backend implementation of mxfp8 quant ( #38682 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-04-08 08:30:35 +08:00
Lucas Wilkinson
70406eb1dc
[Attention][V0 Deprecation] Deprecate accept output buffer ( #39125 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-04-07 17:14:58 -04:00
Yubo Wang
08bfedc152
[Bugfix] Fix extract_hidden_states crash with quantized KV cache dtype ( #39160 )
...
Signed-off-by: Yubo Wang <yubowang2019@gmail.com >
2026-04-07 11:18:33 -07:00
Flora Feng
0102bd2f4c
[Parser] Pass request.tools to tool parser ( #38860 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-04-08 01:36:21 +08:00
rasmith
83d09d36b5
[CI][Bugfix][AMD][ Ensure weights created when using emulating OCP MXFP4 ( #36993 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-04-08 00:37:16 +08:00
Chendi.Xue
92b9afeecd
[XPU] Quick fix for TritonMLA to remove cuda hardcode ( #39088 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-04-08 00:17:58 +08:00
Jinzhen Lin
7310555482
[Bugfix] Fix marlin nvfp4 rescaling ( #37502 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
2026-04-07 08:57:17 -07:00
ibifrost
96b5004b71
[KVConnector] Support 3FS KVConnector ( #37636 )
...
Signed-off-by: wuchenxin <wuchenxin.wcx@alibaba-inc.com >
Signed-off-by: ibifrost <47308427+ibifrost@users.noreply.github.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2026-04-07 15:46:00 +00:00
kkyyxhll
98e1a43af7
[Bugfix][Quantization] Fix PerTensorScale loading with tuple shard_id in MergedColumnParallelLinear ( #38517 )
...
Signed-off-by: loukang <loukang@xiaohongshu.com >
2026-04-07 11:16:26 -04:00
maobaolong
729eb59f60
[KVConnector]: prioritize external connector over internal registry ( #38301 )
...
Signed-off-by: baoloongmao <baoloongmao@tencent.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-04-07 15:03:11 +00:00
Ilya Boytsov
6e1100889e
fix(test): recompute Jina ColBERT rotary inv_freq cleared by transformers v5 weight loader ( #39176 )
...
Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com >
2026-04-07 22:40:55 +08:00
Harry Mellor
edcc37a8ce
Fix Mistral yarn warning in Transformers v5 ( #37292 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
2026-04-07 13:23:33 +00:00
Harry Mellor
79df4a794d
Automatically add links to API docs for matching strings in docs ( #37434 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-04-07 21:21:18 +08:00
Ronen Schaffer
7c139ab23f
[KV Offload] Clean up ARC/LRU refactoring leftovers: group ARC tests and fix stale comment ( #38217 )
...
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com >
2026-04-07 15:14:45 +03:00
Wei Zhao
0be9516ea4
[Bug] Fix Trtllm Fp8 MoE Weight Shuffle Memory Fragamentation ( #39054 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-04-07 08:04:08 -04:00
Kyle Mylonakis
7b9de7c892
[Bugfix] Correct mistake in chained comparison in static assert logic ( #38699 )
...
Signed-off-by: Kyle Mylonakis <kyle@protopia.ai >
2026-04-07 18:24:39 +08:00
Rohan Potdar
dd9342e6bc
only patch runtime_env for torch >= 2.10 ( #38763 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-04-07 09:29:23 +00:00
Jiangyun Zhu
8060bb0333
[vLLM IR] rework gemma_rms_norm ( #39014 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-04-07 01:37:00 -07:00
Rishapveer Singh
da4c0e4db9
[Model] Use AutoWeightsLoader for FalconH1 ( #39092 )
...
Signed-off-by: Rishapveer Singh <215205492+rishaps@users.noreply.github.com >
2026-04-07 16:25:17 +08:00
Netanel Haber
a9a0e0551f
nano-nemotron-vl: get_mm_max_tokens_per_item for audio, video, image == seq_len ( #38727 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-04-07 00:23:29 -07:00
Andrew Barnes
5c35517a3e
[ROCm] Remove unused IS_FNUZ parameter from reshape_and_cache_shuffle_kernel ( #39123 )
...
Signed-off-by: Bortlesboat <bortstheboat@gmail.com >
2026-04-07 07:17:59 +00:00
Andreas Karatzas
a435e3108d
[ROCm][CI] Fix test repo-root assumptions ( #39053 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-04-07 13:36:21 +08:00
Andreas Karatzas
2df2c85be4
[Kernels][MoE] Fix legacy_routing to use bitmatrix-based routing path ( #38504 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-04-07 10:57:09 +08:00
Nick Hill
62095e82c1
[BugFix][MRV2] Fix cuda event reuse race ( #39115 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-04-07 00:21:09 +00:00
bnellnm
b2b2c5239e
[MoE Refactor] Split up compressed_tensors_moe.py ( #38960 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-04-06 20:07:54 -04:00
fxmarty-amd
00d7b497b3
[NVFP4] Support NVFP4 dense models from modelopt and compressed-tensors on AMD Instinct MI300, MI355X and Hopper through emulation ( #35733 )
...
Signed-off-by: Felix Marty <Felix.Marty@amd.com >
Signed-off-by: fxmarty-amd <felmarty@amd.com >
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com >
2026-04-06 16:18:27 -06:00
Matthew Bonanni
9c81f35b1a
[Attention][MLA] Re-enable FA4 as default MLA prefill backend ( #38819 )
2026-04-06 17:51:46 -04:00
Woosuk Kwon
f186cfe75e
[MRV2] Fix hanging issue with DeepSeek V3.2 by setting skip_attn=False ( #39098 )
...
Signed-off-by: WoosukKwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-04-06 12:55:13 -07:00
Netanel Haber
dfa5062a8f
NemotronH default mamba_ssm_cache_dtype=float32; enable auto-hook for NemotronHNanoVLV2Config ( #39032 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-04-06 19:47:46 +00:00
Yongye Zhu
e8ebbdde83
[Quantization] Add FlashInfer CuteDSL batched experts backend for NVFP4 MoE ( #38251 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-04-06 11:57:53 -07:00
namgyu-youn
94fbb09894
[EASY] Drop duplicate KV-cache initialization ( #38799 )
...
Signed-off-by: namgyu-youn <namgyu.dev@gmail.com >
2026-04-06 18:05:39 +00:00
Wentao Ye
419e73cdfa
[Bug] Fix mistral version dependency ( #39086 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-06 13:31:19 -04:00
bnellnm
f01482408c
[MoE Refactor][Test] FusedMoE layer test ( #24675 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-06 17:17:23 +00:00
zhanqiuhu
bfdc0a3a99
[NIXL][Mamba][3/N] Heterogeneous TP: 3-read conv state transfer ( #37635 )
2026-04-06 19:07:02 +02:00
bnellnm
93bada494f
[MoE Refactor] Split of DefaultMoERunner class ( #35326 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-06 12:41:59 -04:00
Frederik Gossen
608914de30
[Core] Re-enable Inductor pre-grad passes in standalone compile (torch>=2.12) ( #38944 )
...
Signed-off-by: Frederik Gossen <frgossen@meta.com >
2026-04-06 09:37:13 -07:00
Wentao Ye
4ae218c122
[Refactor] Remove unused dead code ( #38842 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-06 11:52:05 -04:00
Lukas Geiger
f40d9879f2
[Models][GDN] Remove GPU/CPU syncs in GDNAttentionMetadata.build during speculative decoding ( #38047 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2026-04-06 15:39:37 +00:00
Lucas Wilkinson
47e605092b
[Gemma4] Enable Fast Prefill Optimization ( #38879 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-04-06 11:19:39 -04:00
Walter Beller-Morales
e69a265135
[Feat][Core] safely abort requests when FSM fails to advance ( #38663 )
...
Signed-off-by: walterbm <walter.beller.morales@gmail.com >
2026-04-06 08:00:16 -07:00
Julien Denize
fef56c1855
[Mistral Grammar] Support Grammar Factory ( #38150 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-04-06 10:28:51 -04:00
bhargav-patel-29
c5e3454e5a
[Model] Add support for BharatGen's Param2MoE model ( #38000 )
...
Signed-off-by: bhargav-patel-29 <bhargav.patel@tihiitb.org >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-06 16:19:56 +08:00
liuchenbing2026
f6983f01de
MiniMax-M2: add Eagle3 speculative decoding support ( #37512 )
...
Signed-off-by: liuchenbing <chenliumail@163.com >
Signed-off-by: liucb <liuchengbao_work@163.com >
Co-authored-by: liuchenbing <chenliumail@163.com >
2026-04-05 19:50:18 -07:00
Andreas Karatzas
780ba37458
[ROCm][Quantization] Add asymmetric INT8 quantization support to TritonInt8ScaledMMLinearKernel ( #38501 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-04-06 09:42:10 +08:00
Micah Williamson
9570654c6d
[ROCm][CI] Run Kernels Core Operation Test On MI325 and mitigate flakiness ( #38184 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-04-06 09:42:02 +08:00
Netanel Haber
d56e952239
nano_nemotron_vl: fix tensor device mismatch exception when video profiling ( #39029 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-04-05 22:23:45 +00:00
Kevin H. Luu
56de443db1
[ci] Switch some CI jobs to H200 MIG slices ( #38956 )
2026-04-05 13:26:11 -07:00
Greg Pereira
4dd49b06f8
[Bug] Fix Import paths for encoder_cudagraph modules ( #38997 )
...
Signed-off-by: greg pereira <grpereir@redhat.com >
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-05 19:11:58 +00:00
Greg Pereira
f53fa26e05
[Bugfix] Fix invalid JSON in Gemma 4 streaming tool calls by stripping partial delimiters ( #38992 )
...
Signed-off-by: greg pereira <grpereir@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-05 17:11:18 +00:00
Wei Zhao
1af6f78ae5
[Perf] Change Trtllm fp8 MoE to use Shuffled Weights and BlockMajorK Layout ( #38993 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-05 10:54:31 -04:00
Martin Vit
228023b3a5
[Bugfix][MoE] Fix 6-8% decode regression: prefer multi-stream shared expert overlap ( #38990 )
...
Signed-off-by: Martin Vit <martin@voipmonitor.org >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-05 10:28:31 -04:00
Aaron Batilo
9a528260ef
[Bugfix][Spec Decode] Fix extract_hidden_states for VLM models ( #38987 )
...
Signed-off-by: Aaron Batilo <abatilo@coreweave.com >
2026-04-05 02:41:54 -07:00
Robert Shaw
968ed02ace
[Quantization][Deprecation] Remove Petit NVFP4 ( #32694 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-04-05 00:07:45 +00:00
Robert Shaw
7d266abb22
Revert "[vLLM IR] gemma_rms_norm" ( #38998 )
2026-04-04 17:48:08 -04:00
Xiaoshuang Wang
156405d243
[vLLM IR] gemma_rms_norm ( #38780 )
...
Signed-off-by: Icey <1790571317@qq.com >
2026-04-04 13:55:52 -04:00
Artem Perevedentsev
99e5539a67
[Perf][GDN] Align TMA usage with upstream FLA ( #38981 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-05 00:38:02 +08:00
Linkun
a88ce94bbb
[IR][RmsNorm] pass None if not has_weight ( #38961 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-04-04 11:02:30 -04:00
Ziming Qi
2a36d8fb72
[Bugfix][CPU] Fix macOS compatibility broken by #36487 ( #38970 )
...
Signed-off-by: Ziming (2imi9) <148090931+2imi9@users.noreply.github.com >
2026-04-04 14:05:58 +00:00
lalit10
93726b2a1c
Refactor Arctic loading to use AutoWeightsLoader ( #38955 )
...
Signed-off-by: Lalit Laxminarayan Bangad <lalitbangad@gmail.com >
Co-authored-by: Lalit Laxminarayan Bangad <lalitbangad@meta.com >
2026-04-04 05:01:09 +00:00
Yongye Zhu
8617f8676b
[Bugfix] Fix DSV32 weight loading ( #38870 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
2026-04-03 19:57:52 -07:00
Andreas Karatzas
06fd9ffcc4
[ROCm][CI] Fix ROCm Dockerfile conftest generation for older Docker parsers ( #38959 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-04-04 10:41:41 +08:00
Wentao Ye
cab4064cd5
[Bug] Fix workspace manager _current_workspaces size ( #38853 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-04 01:29:45 +00:00
Wentao Ye
062f1a2d70
[Bug] Fix compile error for swap_blocks_batch in CUDA 13 ( #38915 )
2026-04-03 16:56:38 -07:00
elenalil-aws
81994e1d0e
[Bugfix][LoRA] Fix missing in_proj_z in Qwen3_5ForConditionalGenerati… ( #38927 )
...
Signed-off-by: elenalil-aws <elenalil@amazon.com >
2026-04-03 23:30:09 +00:00
Andreas Karatzas
4b506ff90a
[ROCm][CI] Minor missing import patch ( #38951 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-04-03 23:01:20 +00:00
Andreas Karatzas
5875bb2e9c
[ROCm][CI] Added back missing common deps ( #38937 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-04-03 15:58:57 -07:00
Kevin H. Luu
f0d3ad9f3e
[ci] Remove soft fail for AMD image build job ( #38941 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-04-03 20:42:33 +00:00
Divin Honnappa
121ea5a21f
Removed GPU state confirmation and cleanup steps. ( #38238 )
...
Signed-off-by: Divin Honnappa <divin.honnappa@amd.com >
2026-04-03 13:11:08 -07:00
Jeffrey Wang
ab79863e6c
Remove MQ multi-node tests ( #38934 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2026-04-03 20:00:08 +00:00
Nick Hill
5f1de2b14b
[Model Runner V2] Add config validation for not-yet-supported features ( #38758 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-04-03 12:08:08 -07:00
yzong-rh
a5a623d961
[Bugfix] Re-enable Renormalize routing for TRT-LLM MoE experts ( #38859 )
...
Signed-off-by: Yifan Zong <yzong@redhat.com >
2026-04-04 01:48:17 +08:00
Xiaoshuang Wang
f8c3af2d85
[vLLM IR] add import_ir_kernels() to support OOT platforms ( #38807 )
...
Signed-off-by: Icey <1790571317@qq.com >
2026-04-03 17:25:19 +00:00
danisereb
50cd5674b3
Fix invalid logprobs with MTP enabled and sync scheduling ( #38711 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-04-03 12:24:37 -04:00
Vasiliy Kuznetsov
7b1a7423be
[Frontend] new online quantization frontend ( #38138 )
...
Signed-off-by: Vasiliy Kuznetsov <vasiliy@meta.com >
2026-04-03 11:58:39 -04:00
Nicolò Lucchesi
97f92c6b47
[KVConnector] Skip register_kv_caches on profiling ( #38558 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-04-03 15:40:16 +00:00
Yusuf Mohammad
46f02e00f2
[Bugfix] Fix AWQ models batch invariance issues ( #38670 )
...
Signed-off-by: yusuf <yusuf@deeplearningmachine.mynet >
Signed-off-by: <>
Co-authored-by: yusuf <yusuf@deeplearningmachine.mynet >
2026-04-03 14:54:15 +00:00
Qiming Zhang
6b4872240f
[XPU] bump up xpu-kernel v0.1.5, transpose moe weights ( #38342 )
...
Signed-off-by: mayuyuace <qiming1.zhang@intel.com >
Signed-off-by: Qiming Zhang <qiming1.zhang@intel.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-04-03 14:10:02 +00:00
Necofish
580090db6b
[Kernel] Add swapAB support for SM120 CUTLASS blockwise FP8 GEMM ( #38325 )
2026-04-03 15:49:59 +02:00
Artem Perevedentsev
cb10b7e80b
[GDN] Eliminate GPU->CPU sync in prepare_chunk_indices during prefill ( #38361 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com >
2026-04-03 13:38:02 +00:00
Mieszko Dziadowiec
bf8b022e60
[Intel][Triton] Support round_int8 for Intel backend ( #38825 )
...
Signed-off-by: Mieszko Dziadowiec <mdziadowiec@habana.ai >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Stefano Castagnetta <scastagnetta@nvidia.com >
Co-authored-by: Claude <noreply@anthropic.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-04-03 20:47:35 +08:00
xiangdong
40ee64c00e
[XPU][CI] Skip test_topp_only and test_topk_and_topp cases on Intel GPU in CI ( #38904 )
...
Signed-off-by: zengxian <xiangdong.zeng@intel.com >
2026-04-03 20:44:52 +08:00
wufann
1b117cb0ac
[ROCm] Fix aiter persistent mode mla with q/o nhead<16 for kimi-k2.5 tp8 ( #38615 )
...
Signed-off-by: wufann <36477220+wufann@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-03 03:54:00 -07:00
Anton Ivanov
abebd9323d
[CPU] Replace OMP initialization ( #36487 )
...
Signed-off-by: Anton Ivanov <anton.ivanov@cambridgegreys.com >
2026-04-03 18:42:43 +08:00
Hyeonki Hong
25f2b55319
[Frontend] feat: add streaming support for token generation endpoint ( #37171 )
...
Signed-off-by: Hyeonki Hong <hyeonki.hong@moreh.io >
2026-04-03 10:20:32 +00:00
xiangdong
cb4ff07f8b
[XPU][CI] Skip test_topk_only cases on Intel GPU in CI ( #38899 )
...
Signed-off-by: zengxian <xiangdong.zeng@intel.com >
2026-04-03 09:50:41 +00:00
Gregory Shtrasberg
a7d79fa133
[ROCm][CI/Build] Fix the pytest hook to properly print out the summary ( #38585 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-04-03 17:24:26 +08:00
Netanel Haber
fa9e68022d
Fix Nano Nemotron VL regressions ( #38655 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-04-03 15:22:06 +08:00
Isotr0py
5506435419
[Misc] Clean up Gemma4 implementation ( #38872 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-04-03 05:47:02 +00:00
Yifan Qiao
311c981647
[MRV2][KVConnector] Fix missing build_connector_worker_meta ( #38698 )
...
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai >
2026-04-03 08:42:52 +03:00
Li, Jiang
21d7ecc5b0
[CI/Build] Add audio deps in Dockerfile.cpu ( #38876 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-04-03 05:05:14 +00:00
Aaron Hao
4729b90838
[Bug] Add e_score_correction_bias to SKIP_TENSORS ( #38746 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-04-02 21:15:05 -07:00
shunting314
8b141ed8c3
full cudagraph for flex-attn ( #36298 )
...
Signed-off-by: shunting314 <shunting@meta.com >
2026-04-02 21:15:01 -07:00
Varun Sundar Rabindranath
2ad7c0335f
[Model] Add Phi4ForCausalLMV for microsoft/Phi-4-reasoning-vision-15B ( #38306 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2026-04-02 21:14:57 -07:00
Bowen Bao
201d2ea5bf
[CI][ROCm] Add Qwen3.5-35B-A3B-MXFP4 model eval into CI ( #38664 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com >
2026-04-03 04:05:45 +00:00
Bowen Bao
103f0de565
[ROCm][Quantization][1/N] Refactor quark_moe w_mxfp4 w/ oracle ( #38774 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-04-03 03:29:57 +00:00
wliao2
32e0c0bfa2
refactor hard coded device string in test files under tests/v1 and tests/lora ( #37566 )
...
Signed-off-by: Liao, Wei <wei.liao@intel.com >
2026-04-03 11:21:47 +08:00
Itay Etelis
4a06e1246e
[Perf] Batch KV cache swap copies via cuMemcpyBatchAsync ( #38460 )
...
Signed-off-by: Itay Etelis <itay.etelis@ibm.com >
Co-authored-by: Itay Etelis <itay.etelis@ibm.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-04-03 03:13:23 +00:00
Carl Y
3bc2734dd0
[Kernel] Fuse FP8 output quantization into merge_attn_states ( #36518 )
...
Signed-off-by: Carl You <4531192+carlyou@users.noreply.github.com >
2026-04-03 01:47:04 +00:00
Carl Y
1f5ec2889c
[mla] Support fused FP8/NVFP4 output quantization in MLA attention ( #35792 ) ( #36205 )
...
Signed-off-by: Carl You <4531192+carlyou@users.noreply.github.com >
Signed-off-by: Carl Y <4531192+carlyou@users.noreply.github.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-02 21:16:11 -04:00
Yan Ma
ee3cf45739
[XPU] Initial support for GDN attention on Qwen3-next/Qwen3.5 ( #33657 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-04-03 08:59:11 +08:00
Matthew Bonanni
05e68e1f81
[CI] Fix test_nixl_connector ( #38838 )
2026-04-02 17:52:13 -07:00
Vadim Gimpelson
771913e4a0
[Bugfix] Fix NVFP4+MTP crash: force unquantized mtp.fc for Qwen3.5 ( #38832 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-04-03 04:45:57 +04:00
1096125073
71a9125c67
[New Model]: add support for telechat3 ( #38510 )
...
Signed-off-by: xiayongqiang <xiayq1@chinatelecom.cn >
Co-authored-by: xiayongqiang <xiayq1@chinatelecom.cn >
2026-04-03 08:26:22 +08:00
Nicolò Lucchesi
66e86f1dbd
[Kernel] Mamba support different layout for Conv state ( #37416 )
2026-04-03 01:50:09 +02:00
Michael
bb39382b2b
[Bugfix]: Fix Gemma4ToolParser.__init__() missing tools parameter ( #38847 )
...
Signed-off-by: Michael Hospedales <hospedales@me.com >
2026-04-02 14:35:19 -07:00
zhanqiuhu
7b743ba953
[CI] Fix: pass string cache_dtype in test_register_kv_caches ( #38836 )
2026-04-02 19:42:09 +00:00
Stefano Castagnetta
188defbd0b
[CI] Add flashinfer.py to attention test source deps ( #38792 )
...
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-04-02 19:24:29 +00:00
Luciano Martins
08ed2b9688
feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use) ( #38826 )
...
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com >
Signed-off-by: Luciano Martins <lucianomartins@google.com >
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2026-04-02 11:13:28 -07:00
Yanan Cao
ecd5443dbc
Bump helion dependency from 0.3.2 to 0.3.3 ( #38062 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-04-02 10:59:33 -07:00
Stefano Castagnetta
58262dec6e
[Bugfix] Fix test mocks after SM100 restriction in #38730 ( #38791 )
...
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com >
Co-authored-by: Claude <noreply@anthropic.com >
2026-04-02 13:12:58 -04:00
Lucas Wilkinson
cb3935a8fc
[FA4] Update flash-attention to latest upstream FA4 ( #38690 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-04-02 17:02:37 +00:00
Bowen Bao
82a006beeb
[CI][ROCm] Add gpt-oss w4a8 in CI ( #38292 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com >
2026-04-03 00:06:01 +08:00
wang.yuqi
a9b4f07ba2
[Frontend] Re-enable running MaxSim on GPU ( #38620 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-04-03 00:03:13 +08:00
Koushik Dutta
d9408ffba3
Triton MLA perf fixes ( #33529 )
...
Signed-off-by: Koushik Dutta <koushd@gmail.com >
Co-authored-by: root <root@ubuntu-nvidia.localdomain >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-04-02 09:40:01 -04:00
Yusuf Mohammad
16a65e4173
[Bugfix] Enable batch-invariant Triton matmul on all Ampere GPUs (SM 8x) ( #38427 )
...
Signed-off-by: yusuf <yusufmohammad@live.com >
Signed-off-by: yusuf <yusuf@deeplearningmachine.mynet >
Signed-off-by: Yusuf Mohammad <79484377+YM2132@users.noreply.github.com >
Signed-off-by: <>
Co-authored-by: Claude <noreply@anthropic.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: yusuf <yusuf@deeplearningmachine.mynet >
2026-04-02 09:29:58 -04:00
bsliu
c0817e4d39
[Model] Add support for Cheers multimodal model ( #38788 )
...
Signed-off-by: bsliu <1187291748@qq.com >
Signed-off-by: 吴炳贤 <wubingxian24@mails.ucas.ac.cn >
2026-04-02 21:01:40 +08:00
Harry Mellor
dfe5e31689
Don't compile vision encoder for Transformers backend ( #30518 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-04-02 12:42:29 +00:00
JartX
2ce3d0ce36
[Feature] KV cache per-token-head INT8/FP8 quantization ( #38378 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: yangyang4991 <yangyang4991@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2026-04-02 08:13:26 -04:00
Jiangyun Zhu
4eefbf9609
[Perf] fuse kernels in gdn ( #37813 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-04-02 11:52:18 +00:00
vllmellm
551b3fb39f
[ROCm] Enable VLLM triton FP8 moe for gfx1201, tuned for Qwen3-30B-A3B-FP8 tp=2 and Qwen/Qwen3.5-35B-A3B-FP8 tp=2 ( #38086 )
...
Signed-off-by: big-yellow-duck <jeffaw99@hotmail.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-04-02 08:13:42 +00:00
Li, Jiang
c6f722b93e
[CPU] Support gelu act in cpu_fused_moe ( #38770 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-04-02 14:14:32 +08:00
Xin Yang
9bd7231106
Revert "[Kernel] Add gpt-oss Router GEMM kernel ( #37205 )" ( #38778 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-04-01 22:02:32 -07:00
Yanan Cao
73f48ce559
[Kernel] [Helion] Use warning_once in get_gpu_name to prevent log spam ( #38743 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com >
2026-04-01 21:30:31 -07:00
Gregory Shtrasberg
3aab680e3e
[ROCm][Bugfix] Fix ROCm runtime failure due to missing symbol ( #38750 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tjtanaavllm <tunjian.tan@amd.com >
2026-04-01 21:30:11 -07:00
Sergey Zinchenko
5a2d420c17
[Bugfix] Use dedicated MM processor cache in /tokenize to prevent sender-cache pollution ( #38545 )
...
Signed-off-by: Sergey Zinchenko <sergey.zinchenko.rnd@gmail.com >
2026-04-01 21:14:49 -07:00
Benjamin Chislett
5f96f9aff1
[Perf] DSV3.2 Indexer Fused Weights Projection ( #38684 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-04-02 03:34:49 +00:00
Luka Govedič
694449050f
Fix multiline-format string for python 3.10 ( #38739 )
...
Signed-off-by: Luka Govedic <luka.govedic@gmail.com >
2026-04-02 03:19:35 +00:00
Nick Hill
6241521dd2
[BugFix] Fix precommit breakage due to conflicting in-flight merges ( #38759 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-04-01 15:35:06 -07:00
Kevin H. Luu
1785dc5501
Revert "[Bugfix] Fix Qwen3CoderToolParser anyOf/oneOf type resolution for nullable params ( #37831 )" ( #38751 )
2026-04-02 06:34:28 +08:00
Chang Su
54500546ac
[Bugfix] Preserve original ImportError in gRPC server entrypoint ( #38673 )
...
Signed-off-by: Chang Su <chang.s.su@oracle.com >
2026-04-01 22:16:44 +00:00
Jeffrey Wang
de5e6c44c6
[Feat][Executor] Introduce RayExecutorV2 ( #36836 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2026-04-01 14:34:29 -07:00
yzong-rh
cb268e4e55
[Refactor] Simplify FutureWrapper in MultiprocExecutor ( #38644 )
...
Signed-off-by: Yifan <yzong@redhat.com >
Signed-off-by: Yifan Zong <yzong@redhat.com >
2026-04-01 21:28:26 +00:00
Stefano Castagnetta
6183cae1bd
[Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang ( #38730 )
...
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com >
2026-04-01 12:08:40 -07:00
Monishver
c09ad767cd
Feature/silu block quant fusion v1 ( #32996 )
...
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com >
2026-04-01 18:50:43 +00:00
Wentao Ye
c9a9db0e02
[Compile] Fix nvfp4 compile warning ( #38573 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-01 18:28:57 +00:00
Chauncey
cbe7d18096
[Misc] Rename think_start_str/think_end_str to reasoning_start_str/reasoning_end_str ( #38242 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-04-01 09:56:45 -07:00
Michael Goin
db5d0719e1
[Kernel] Add MXFP8 to Marlin GEMM/MoE and refactor Mxfp8LinearOp ( #34664 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-04-01 09:41:42 -07:00
yzong-rh
dc0428ebb8
[NIXL][BUG] Fix Triton heterogeneous TP ( #37940 )
...
Signed-off-by: Yifan <yzong@redhat.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-04-01 17:23:15 +02:00
Jesus Talavera
148c2072ec
Add ibm-granite/granite-vision-3.3-2b to supported models documentation ( #38714 )
...
Signed-off-by: Jesus Talavera <jesus.talavera@ibm.com >
2026-04-01 08:22:25 -07:00
majianhan
2f5c3c1ec0
[Misc] Fix docstring typo: buildin -> builtin ( #38722 )
...
Co-authored-by: majianhan <majianhan@kylinos.cn >
2026-04-01 07:39:46 -07:00
Fynn Schmitt-Ulms
fa246d5231
Fix shape comment in extract_hidden_states example ( #38723 )
...
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com >
2026-04-01 07:29:33 -07:00
bnellnm
7cf56a59a2
[MoE Refactor] Make SharedExperts class for use with DefaultMoERunner ( #35153 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-04-01 09:44:08 -04:00
Elvir Crnčević
5e30e9b9a9
[Bugfix] Revert "Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding" ( #38359 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-04-01 09:11:10 -04:00
손세정
582340f273
[Bugfix] Fix Qwen3CoderToolParser anyOf/oneOf type resolution for nullable params ( #37831 )
...
Signed-off-by: AAISSJ <maze0717@g.skku.edu >
Signed-off-by: <>
Co-authored-by: 세덩 <saison@sedeong-ui-MacBookAir.local >
2026-04-01 20:22:29 +08:00
yjz
992368522f
[KVTransfer] Fix TpKVTopology.is_kv_replicated equality case ( #38179 )
...
Signed-off-by: JianDan0212 <zhangyj0212@gmail.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-04-01 12:41:49 +02:00
Juan Pérez de Algaba
58ee614221
(security) Enforce frame limit in VideoMediaIO ( #38636 )
...
Signed-off-by: jperezde <jperezde@redhat.com >
2026-04-01 10:23:45 +00:00
Harry Mellor
f9f6a9097a
Add verified label to trigger pre-commit ( #38708 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-04-01 02:31:02 -07:00
Zhanda Zhu
c75a313824
[Perf] triton bilinear_pos_embed kernel for ViT ( #37948 )
...
Signed-off-by: Zhanda Zhu <zhandazhu@gmail.com >
2026-04-01 01:52:02 -07:00
Lukas Geiger
4f6eed3bd4
[Core] Simplify multimodal masking ( #34246 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2026-04-01 01:18:22 -07:00
Li, Jiang
36d7f19897
[CPU] Support head_size 512 in cpu_attn ( #38676 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-04-01 05:42:27 +00:00
Jeffrey Wang
2d725b89c5
[Bugfix] Lazy import diskcache to avoid sqlite3/libstdc++ ImportError at startup ( #38649 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2026-04-01 05:31:20 +00:00
Augusto Yao
ef53395e2c
[bugfix] do not add extra linebreak for score/rerank with chat template ( #38617 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
Co-authored-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-04-01 04:50:07 +00:00
Lucas Wilkinson
eb47454987
[Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking ( #36178 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-04-01 00:15:53 -04:00
Matthew Bonanni
116f4be405
[1/N][Cleanup] Standardize on use of is_quantized_kv_cache ( #38659 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-04-01 04:08:01 +00:00
Wentao Ye
7b01d97a22
[Perf] Optimize mean pooling using chunks and index_add, 5.9% E2E throughput improvement ( #38559 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-04-01 03:54:58 +00:00
HarshRathva
17b72fd1c8
Fix priority preemption regression test in scheduler ( #37051 )
...
Signed-off-by: HarshRathva <harshrathvaai@gmail.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-04-01 06:36:12 +03:00
Samu Tamminen
c49497726b
[ROCm][perf] Shuffle KV cache to use paged_attention_common ( #32914 )
...
Signed-off-by: Samu Tamminen <stammine@amd.com >
Co-authored-by: Tuukka Sarvi <tuukka.sarvi@amd.com >
2026-04-01 03:30:19 +00:00
Ben Browning
cb0b443274
[Misc] Add 20 regression tests for 11 tool parser bug fixes ( #38172 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-04-01 03:00:31 +00:00
Luka Govedič
40bb175027
[vLLM IR] 1/N Implement IR skeleton and rms_norm op ( #33825 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
Signed-off-by: chzhang <chaojun.zhang@intel.com >
Signed-off-by: Luka Govedic <luka.govedic@gmail.com >
Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com >
Co-authored-by: Chaojun Zhang <chaojun.zhang@intel.com >
Co-authored-by: Luka Govedič <ProExpertProg@h100-01.nemg-001.lab.rdu2.dc.redhat.com >
2026-03-31 22:15:05 -04:00
Elvir Crnčević
0fab52f0aa
Fix NaN from stale FP4 scale padding in create_fp4_scale_tensor ( #38148 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-03-31 19:14:59 -07:00
Yifan Qiao
91e4521f9f
[Feat][v1] Simple yet General CPU KV Cache Offloading ( #37160 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai >
2026-03-31 17:58:37 -07:00
Stig-Arne Grönroos
31a719bcd3
[ROCm][perf] fix Aiter sparse MLA with MTP>1 ( #37887 )
...
Signed-off-by: Stig-Arne Grönroos <stig-arne.gronroos@amd.com >
Signed-off-by: Stig-Arne Grönroos <sgronroo@amd.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-31 19:22:23 -04:00
Vedant V Jhaveri
2e56975657
Generative Scoring ( #34539 )
...
Signed-off-by: Vedant Jhaveri <vjhaveri@linkedin.com >
Co-authored-by: Vedant Jhaveri <vjhaveri@linkedin.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-31 16:02:11 -07:00
Chang Su
36f1dc19ae
feat(grpc): add periodic stats logging and servicer log forwarding ( #38333 )
...
Signed-off-by: Chang Su <chang.s.su@oracle.com >
2026-03-31 15:50:07 -07:00
Asaf Gardin
3dc01ef352
[Quantization] Consolidate dummy format logic into DummyModelLoader ( #38637 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-03-31 22:20:45 +00:00
Yanan Cao
cc671cb110
[Kernel] [Helion] [17/N] Add Helion kernel torch.compile support ( #38592 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Sonnet 4 <noreply@anthropic.com >
2026-03-31 17:06:42 -04:00
Wentao Ye
856589ed9a
[Refactor] Remove dead code in kv connector and model runner ( #38383 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-31 17:05:23 -04:00
czhu-cohere
517b769b58
[Perf] Fix DBO overlap: capture DeepEP event before yield ( #38451 )
...
Signed-off-by: root <conway.zhu@cohere.com >
2026-03-31 20:38:59 +00:00
yzong-rh
d9b90a07ac
[MoE Refactor] Migrate Unquantized to Full Oracle Flow ( #36286 )
...
Signed-off-by: Yifan Zong <yzong@redhat.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: yzong-rh <yzong@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-31 15:43:33 -04:00
Olya Kozlova
598190aac3
[fix] Remove trtllm ragged mla prefills ( #36540 )
...
Signed-off-by: Olya Kozlova <okozlova@nvidia.com >
2026-03-31 12:30:27 -07:00
Xu Jinyang
b779eb3363
[Model] Sync upstream BT=chunk_size fix for GDN chunk_fwd_kernel_o, simplify warmup to single pass ( #38343 )
...
Signed-off-by: AuYang <459461160@qq.com >
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
2026-03-31 23:03:24 +04:00
BadrBasowid
077a9a8e37
[torch.compile] Refactor Attention Quant Fusion Pass and Remove Boilerplate ( #37373 )
...
Signed-off-by: BadrBasowid <badr.basowid@gmail.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-03-31 14:15:50 -04:00
Run Yu
07edd551cc
[CI/Build] Resolve a dependency deadlock when installing the test dependencies used in CI ( #37766 )
...
Signed-off-by: Run Yu <yurun00@gmail.com >
2026-03-31 18:05:14 +00:00
mikaylagawarecki
7c080dd3c5
[4/n] Migrate FP4/W4A8 CUTLASS kernels to torch stable ABI ( #37503 )
...
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com >
2026-03-31 10:21:13 -07:00
Yi Liu
0dd25a44ea
[Quantization][Autoround][XPU] Add W4A16 Support ( #37986 )
...
Signed-off-by: yiliu30 <yi4.liu@intel.com >
2026-03-31 16:48:24 +00:00
SandishKumarHN
3896e021a0
[Bugfix] Fix FusedMoE weight loading with padded hidden dimensions ( #37010 )
...
Signed-off-by: SandishKumarHN <sandish@fb.com >
2026-03-31 12:22:26 -04:00
zhang-prog
b6e636c12c
[Fix] handle PaddleOCR-VL image processor max_pixels across Transformers v4/v5 ( #38629 )
...
Signed-off-by: zhangyue66 <zhangyue66@baidu.com >
2026-03-31 15:50:41 +00:00
Jingu Kang
f1ff50c86c
[Bugfix] clamp dA_cumsum differences to prevent Inf in Mamba2 SSD kernels ( #37501 )
...
Signed-off-by: Jingu Kang <jg.k@navercorp.com >
2026-03-31 17:35:51 +02:00
Matthew Bonanni
757068dc65
[Bugfix][Async] Fix async spec decoding with hybrid models ( #38556 )
...
Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com >
2026-03-31 11:08:54 -04:00
Nicolò Lucchesi
7337ff7f03
[Docs] PD with Nixl compat matrix ( #38628 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-31 15:01:21 +00:00
Kyle Sayers
5869f69c5f
[Online Quant] [QeRL] Minor code cleanup ( #38574 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-03-31 14:56:43 +00:00
wliao2
4dfad17ed1
replace cuda_device_count_stateless() to current_platform.device_count() ( #37841 )
...
Signed-off-by: Liao, Wei <wei.liao@intel.com >
Signed-off-by: wliao2 <wei.liao@intel.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-31 22:32:54 +08:00
wenjun liu
e8057c00bc
[CI] Avoid concurrent docker pull in intel XPU CI runners to prevent rate limit issues ( #38594 )
...
Signed-off-by: wendyliu235 <wenjun.liu@intel.com >
2026-03-31 22:23:18 +08:00
Nicolò Lucchesi
7430389669
[Bugfix][CI] Skip flaky test_eagle test ( #38566 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-31 09:42:37 -04:00
ElizaWszola
202f147cf2
Fix MLA runs when use_inductor_graph_partition=True ( #38631 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2026-03-31 13:37:43 +00:00
Jiangyun Zhu
ea7bfde6e4
[CI] fix LM Eval Qwen3.5 Models (B200) ( #38632 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-31 13:20:08 +00:00
sihao_li
d71a15041f
[XPU]move testing dependencies from Dockerfile to xpu-test.in ( #38596 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-31 12:49:43 +00:00
Ilya Markov
abdbb68386
[EPLB] Add alternative communication for EPLB weight exchange ( #33176 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Signed-off-by: Markov Ilya <markovilya19@gmail.com >
Co-authored-by: Markov Ilya <markovilya19@gmail.com >
2026-03-31 08:17:12 -04:00
liuzhenwei
0c63739135
[EPD] update EPD script arguments ( #36742 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
2026-03-31 12:02:09 +00:00
wang.yuqi
719735d6c5
[CI Failure] pin colmodernvbert revision ( #38612 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-31 10:54:54 +00:00
Maosheng Liao
aae3e688f8
Fix document of torchrun_example.py ( #31113 )
2026-03-31 10:54:23 +00:00
Matthew Bonanni
7d65463528
[WIP][CI][Bugfix] Fix test_run_eagle_dp ( #38584 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-31 12:30:25 +02:00
Mateusz Sokół
8278825b57
DOC: TPU mention fix ( #38129 )
...
Signed-off-by: Mateusz Sokół <mat646@gmail.com >
2026-03-31 03:27:56 -07:00
Chang Su
acf7292bf2
[Misc] Move --grpc CLI argument into make_arg_parser ( #38570 )
...
Signed-off-by: Chang Su <chang.s.su@oracle.com >
2026-03-31 03:24:05 -07:00
Chauncey
ce884756f0
[Feature]: add presence_penalty and frequency_penalty fields to Responses API ( #38613 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-31 08:45:57 +00:00
wang.yuqi
d9d21eb8e3
[Frontend][3/n] Improve pooling entrypoints | scoring. ( #28631 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-31 07:52:00 +00:00
Yintong Lu
f09daea261
[CPU] Support int8 compute mode in CPU AWQ ( #35697 )
...
Signed-off-by: Yintong Lu <yintong.lu@intel.com >
2026-03-31 15:27:37 +08:00
Kevin H. Luu
42318c840b
[ci] Remove benchmarks job ( #38611 )
2026-03-31 06:46:21 +00:00
zhangyiming
1ac6694297
[OOT] Add OOT support for linear kernel. ( #37989 )
...
Signed-off-by: menogrey <1299267905@qq.com >
2026-03-31 14:33:21 +08:00
Kfir Toledo
6cc7abdc66
[kv_offload+HMA] Fix num_blocks with different per-layer page sizes and improve assert message ( #38554 )
...
Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-03-31 06:00:40 +00:00
Flora Feng
d53cb9cb8e
[Tool Parser][2/3] Use self.tools instead of request.tools in tool parsers ( #38189 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-31 13:41:36 +08:00
Louie Tsai
44eef0ca1e
vLLM Benchmark Suite perf regression after PR#32723 ( #38576 )
...
Signed-off-by: louie-tsai <louie.tsai@intel.com >
2026-03-31 05:23:17 +00:00
Andreas Karatzas
b9cdc85207
[ROCm][CI] Fix Whisper translation test attention backend selection ( #38508 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-31 13:21:49 +08:00
Flora Feng
3e802e8786
[Mypy] Fix adjust_request typing ( #38264 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-31 04:21:18 +00:00
Martin Hickey
350af48e14
[KVConnector] Remove redundant method KVConnectorOutput::merge() ( #38546 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-03-31 07:11:02 +03:00
Lucas Kabela
e31915063d
[Bugfix] Fix for builtins (forward fix of pytorch/177558) ( #37234 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-03-31 01:08:11 +00:00
Flora Feng
29e48707e8
[Refactor] Consolidate Tool type alias in tool_parsers/utils.py ( #38265 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-31 00:55:51 +00:00
sungsoo ha
4ac227222f
[Bugfix][DCP] Fix CUDA graph capture for Decode Context Parallelism ( #36070 )
...
Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-30 20:20:43 -04:00
Vadim Gimpelson
bb51d5b40d
Add @vadiklyutiy as committer ( #38589 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-31 07:50:04 +08:00
Prathmesh Bhatt
93b3ec1585
feat(attention): extract KV-cache update from FlashAttentionDiffKV ba… ( #36466 )
...
Signed-off-by: Prathmesh Bhatt <71340361+Prathmesh234@users.noreply.github.com >
2026-03-30 23:16:09 +00:00
Netanel Haber
e812bf70bd
Restore non-hf processor path for Nano-Nemotron-VL (bypass call_hf_processor_mm_only) - fixes #38018 ( #38567 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com >
2026-03-30 21:56:52 +00:00
SandishKumarHN
bcc6f67447
[Bugfix] Use null block (0) for padded block table entries ( #35431 )
...
Signed-off-by: SandishKumarHN <sandish@fb.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-30 14:02:51 -07:00
Asaf Gardin
1fc69f59bb
[Bug fix][Quantization] Fix dummy weight loading ( #38478 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-03-30 16:38:02 -04:00
Micah Williamson
d9c7db18da
[ROCm][CI] Pin test_hybrid test to TRITON_ATTN on ROCm ( #38381 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-03-30 20:26:46 +00:00
Ilya Markov
12701e8af2
[EPLB] Optmize eplb mapping and record in router for prefill ( #36261 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-03-30 19:48:33 +00:00
Benjamin Chislett
494636b29d
[Feat][Spec Decode] DFlash ( #36847 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-30 15:03:15 -04:00
mikaylagawarecki
ab1a6a43fa
[3/n] Migrate cutlass/scaled_mm_entry.cu torch stable ABI ( #37221 )
...
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com >
2026-03-30 11:20:13 -07:00
fangyuchu
b5e608258e
[Refactor] Unify engine process monitoring in engine manager and add Ray backend support ( #35862 )
...
Signed-off-by: fangyuchu <fangyuchu@qq.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-30 10:16:09 -07:00
Matthew Bonanni
2c734ed0e0
[Bugfix][MLA] Change default SM100 MLA prefill backend back to TRT-LLM ( #38562 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-30 09:51:24 -07:00
Chendi.Xue
3b1dbaad4e
[HMA]Fix corner case when hybrid page_size can not be evenly divided issue (blk_size=64,tp=4) ( #37467 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-30 16:47:30 +00:00
Johnny
b4a2f3ac36
[NVIDIA] Bugfix NVFP4 DGX Spark and RTX50 ( #38423 )
...
Signed-off-by: johnnynunez <johnnynuca14@gmail.com >
Signed-off-by: Johnny <johnnynuca14@gmail.com >
2026-03-30 09:36:18 -07:00
roikoren755
8e6293e838
[Mamba] Add stochastic rounding support ( #35753 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-03-30 12:33:49 -04:00
Hongxia Yang
dbdd9ae067
[ROCm][Bugfix] fix exception related to trust_remote_code for MiniMax-M2.1-MXFP4 ( #37698 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-03-30 15:49:23 +00:00
Matthias Gehre
e8b055a5ac
[Bugfix] Handle ParallelLMHead in compressed-tensors get_quant_method ( #37291 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-30 07:30:52 -07:00
tomeras91
246dc7d864
[Misc] Add @tomeras91 as a maintainer of Nemotron related code + mamba block ( #38547 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
2026-03-30 21:12:17 +08:00
Thomas Parnell
7c3f88b2a8
[Bugfix] Remove false-positive format mismatch warnings in FLA ops ( #38255 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2026-03-30 12:32:26 +00:00
Li, Jiang
6557f4937f
[Bugfix][CPU] Skip set_num_threads after thread binding ( #38535 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-30 20:13:00 +08:00
Andreas Karatzas
677424c7ac
[Core][CI] Add opt-in media URL caching via VLLM_MEDIA_CACHE ( #37123 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-30 04:58:53 -07:00
Collin McCarthy
1031c84c36
Fix ambiguous num_blocks for hybrid attn mamba ( #37236 )
...
Signed-off-by: Collin McCarthy <cmccarthy@nvidia.com >
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-03-30 11:09:45 +00:00
aliialsaeedii
7e76af14fa
[Bugfix][Frontend] Return 400 for corrupt/truncated image inputs instead of 500 ( #38253 )
...
Signed-off-by: aliialsaeedii <ali.al-saeedi@nscale.com >
2026-03-30 10:26:46 +00:00
yzong-rh
3683fe6c06
[Bugfix] Fix shared-object aliasing in n>1 streaming with tool calls ( #38158 )
...
Signed-off-by: Yifan Zong <yzong@redhat.com >
Signed-off-by: Yifan <yzong@redhat.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-30 10:12:13 +00:00
Nicolò Lucchesi
cc06b4e86b
[Mamba][Bugfix] Raise on insufficient cache blocks instead of silently capping cudagraph sizes ( #38270 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-30 09:41:50 +00:00
TJian
03ac6ca895
[ROCm] [DOC] Update the Documentation to include ROCm Nightly Wheel support ( #38457 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-03-30 02:25:46 -07:00
haosdent
a08b7733fd
[CI] Fix SPLADE pooler test broken by #38139 ( #38495 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-30 07:48:33 +00:00
Tan Pin Siang
85c0950b1f
[ROCm] Enable MORI EP for unquantized MoE with AITER backend ( #37529 )
...
Signed-off-by: Tan Pin Siang <pinsiang.tan@amd.com >
2026-03-30 15:19:33 +08:00
Juan Pérez de Algaba
57861ae48d
(security) Fix SSRF in batch runner download_bytes_from_url ( #38482 )
...
Signed-off-by: jperezde <jperezde@redhat.com >
2026-03-30 07:10:01 +00:00
Jee Jee Li
ac30a8311e
[Bugfix][Model] Fix PixtralForConditionalGeneration LoRA ( #36963 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-29 23:59:42 -07:00
PikaPikachu
63babd17f1
[Model][Quantization] Add GGUF support for MiniMax-M2.1 ( #36965 )
...
Signed-off-by: kangletian <Letian.Kang@amd.com >
2026-03-30 14:24:06 +08:00
Kevin H. Luu
fec5aeca12
[ci] Soft fail and disable retry for AMD build image job ( #38505 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-29 23:05:26 -07:00
Jaewon
d816834c1a
[MoE] Add RoutingMethodType.Simulated to TRT-LLM FP8/NVFP4 kernel allowlists ( #38329 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
2026-03-29 22:53:43 -07:00
Roger Wang
92f0db57a8
[Misc] Always use forward_mulmat for Conv3d on newer versions of torch. ( #38487 )
2026-03-30 05:39:41 +00:00
Andreas Karatzas
bea23536f6
[CI] Add temperature=0.0, reduce max_tokens, and add debug prints to audio_in_video tests ( #38492 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-30 05:36:45 +00:00
Jiangyun Zhu
c133f33746
Add @ZJY0516 to CODEOWNERS ( #38497 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-29 21:10:00 -07:00
Stanislav Kirillov
a6db99ba02
[Bugfix] Support multi-type params parsing for DeepSeek v3.2 ( #33703 )
...
Signed-off-by: Stanislav Kirillov <stas@nebius.com >
Co-authored-by: Stanislav Kirillov <stas@nebius.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-30 04:07:28 +00:00
Andreas Karatzas
4f2ed5fddb
[ROCm][CI] Enable hybrid chunked prefill test ( #38317 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-30 10:30:26 +08:00
Kyle Sayers
d28d86e8a3
[QeRL] Fix online quantized reloading ( #38442 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-03-29 14:56:41 -06:00
Wentao Ye
995dea1354
[Perf] Remove redundant device copies for CPU-only pooling token IDs, 48.9% E2E throughput improvement ( #38139 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-29 18:12:50 +00:00
allgather
8c0b6267d7
[Transformers v5] fix missing pixtral/voxtral multimodal dispatch ( #38410 )
...
Signed-off-by: allgather <all2allops@gmail.com >
2026-03-29 09:59:06 +00:00
Andreas Karatzas
43cc5138e5
[ROCm][CI] Fix cross-attention dispatch for encoder-decoder models ( #38450 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-28 22:08:03 -07:00
Shubhra Pandit
5b8c30d62b
[Spec Decode, BugFix] Propagate norm_before_fc from Eagle3 speculator ( #38111 )
...
Signed-off-by: Shubhra Pandit <shubhra.pandit@gmail.com >
2026-03-29 00:42:06 +00:00
haosdent
d39b8daf5f
[Feature] Add Qwen3-ForcedAligner support via token classification pooling ( #35367 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-29 00:27:52 +00:00
Walter Beller-Morales
fafca38adc
[BugFix][Frontend] apply task instruction as system prompt in cohere v2/embed ( #38362 )
...
Signed-off-by: walterbm <walter.beller.morales@gmail.com >
2026-03-28 18:30:54 +00:00
Kunshang Ji
aa4eb0db78
[CI]revert initialize_model context manager ( #38426 )
...
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-28 16:56:50 +00:00
Andreas Karatzas
af89140efc
[ROCm][CI] Fix UV install in Dockerfile.rocm to detect curl failures and retry ( #38415 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-29 00:47:42 +08:00
haosdent
b2bc736b12
[CI] Fix Ernie4.5-VL initialization test ( #38429 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-28 22:43:24 +08:00
whyiug
58c959a767
[Misc]: clean up non-core lint issues ( #37049 )
...
Signed-off-by: whyiug <whyiug@hotmail.com >
2026-03-28 10:28:16 -04:00
Bvicii
bda3eda82d
[Bugfix] Disallow renderer_num_workers > 1 with mm processor cache ( #38418 )
...
Signed-off-by: Bvicii <yizhanhuang2002@gmail.com >
2026-03-28 06:32:52 -07:00
Michael Goin
2bf5b70ae8
[CI Bugfix] Pre-download missing FlashInfer headers in Docker build ( #38391 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-03-28 06:09:00 -07:00
yzong-rh
6dad4c5722
[Test] Fix flaky race condition in test_abort_final_step ( #38414 )
...
Signed-off-by: Yifan <yzong@redhat.com >
2026-03-28 09:06:56 +00:00
Liwen
171775f306
Fix Device Index for ROCm Ray Workers in MoE Benchmark ( #38108 )
...
Signed-off-by: Liwen <53441624+li-liwen@users.noreply.github.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-28 08:27:11 +00:00
TJian
58a249bc61
[ROCm] [Release] Update ROCm variant from rocm700 to rocm721 ( #38413 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-03-28 06:07:03 +00:00
IriKa
148a5c1226
[Bugfix]fix output Nan/Inf in marlin if dtype=float16 ( #33972 )
...
Signed-off-by: IriKa Qiu <qiujie.jq@gmail.com >
2026-03-27 16:36:08 -07:00
Wei Zhao
b69bf2f0b1
[Perf] Use torch compile to fuse pack topk in trtllm moe ( #37695 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com >
2026-03-27 17:30:46 -06:00
rongfu.leng
88149b635e
Add nvidia h800 moe config ( #31201 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2026-03-27 16:28:48 -07:00
Hongxia Yang
83a4df049d
[ROCm][Documentation] update quickstart and installation to include rocm nightly docker tips ( #38367 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-03-27 23:20:19 +00:00
Gregory Shtrasberg
731285c939
[ROCm][CI/Build] ROCm 7.2.1 release version; torch 2.10; triton 3.6 ( #38252 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-03-27 18:03:12 -05:00
Johnny
97d19197bc
[NVIDIA] Fix DGX Spark logic ( #38126 )
...
Signed-off-by: johnnynunez <johnnynuca14@gmail.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com >
Signed-off-by: guillaume_guy <guillaume.guy@airbnb.com >
Signed-off-by: Guillaume Guy <guillaume.c.guy@gmail.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Andreas Karatzas <akaratza@amd.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: Sathish Sanjeevi <SKPsanjeevi@users.noreply.github.com >
Co-authored-by: Guillaume Guy <guillaume.c.guy@gmail.com >
Co-authored-by: guillaume_guy <guillaume.guy@airbnb.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-27 15:26:07 -07:00
Giancarlo Delfin
384e4d5f48
[Model Runner V2] Rebuild attention metadata before eagle decode full… ( #38311 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-27 13:46:42 -07:00
Nicolò Lucchesi
44a6528028
[CI] Skip failing test ( #38369 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-27 13:25:19 -07:00
Kyle Sayers
648edcf729
[QeRL] Compose online quantization with quantized reloading ( #38032 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-03-27 13:22:33 -07:00
Michael Goin
7ba425e916
Add short flag -sc for --speculative-config argument ( #38380 )
...
Co-authored-by: Claude <noreply@anthropic.com >
2026-03-27 12:04:22 -07:00
Gregory Shtrasberg
b8665383df
[ROCm] Fix GPT-OSS import for triton 3.6 ( #37453 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-03-27 18:00:57 +00:00
Rohan Potdar
0e9358c11d
{ROCm]: gpt-oss fusion/padding fixes ( #38043 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
Co-authored-by: Andreas Karatzas <akaratza@amd.com >
2026-03-27 12:19:15 -04:00
Harry Mellor
21d2b53f88
Remove need for explicit \n in docstring lists for --help formatting ( #38350 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-27 08:38:00 -07:00
Jonas M. Kübler
98e7f223b9
enable skipping of SW attention layers when using FP8 KV cache ( #33695 )
...
Signed-off-by: Jonas Kuebler <kuebj@amazon.com >
2026-03-27 07:25:02 -06:00
Juan Pérez de Algaba
b111f8a61f
fix(security): Add VLLM_MAX_N_SEQUENCES environment variable and enforce limit ( #37952 )
...
Signed-off-by: jperezde <jperezde@redhat.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2026-03-27 09:02:10 -04:00
Sage Moore
497e234d38
[EPLB] Cleanup the transfer logic for the various eplb maps ( #34520 )
...
Signed-off-by: Sage Moore <sagmoore@redhat.com >
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-03-27 10:18:46 +01:00
dtc
6287e7fa20
[P/D] Mooncake: Add unit tests and minor fixes for mooncake connector ( #36946 )
...
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com >
2026-03-27 09:26:40 +01:00
Shengqi Chen
84e439a9cb
[CI/Build] Move nightly wheel index generation to a single post-build step ( #38322 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-03-27 07:44:18 +00:00
Yuichiro Utsumi
a1746ff9ec
[Doc] Clarify Helm chart location in deployment guide ( #38328 )
...
Signed-off-by: Yuichiro Utsumi <utsumi.yuichiro@fujitsu.com >
Signed-off-by: Yuichiro Utsumi <81412151+utsumi-fj@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-27 15:43:02 +08:00
Flora Feng
aee4c14689
[Bugfix] Fix Hermes tool parser when stream interval > 1 ( #38168 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-27 14:42:26 +08:00
Bowen Bao
0ae89f18fd
[Refactor] Move FusedMoE hidden_size roundup to quant_method ( #34285 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com >
2026-03-26 23:38:26 -07:00
wenjun liu
c2b17d71af
[CI] Add xpu auto-label rule for Intel GPU/XPU PRs ( #38320 )
...
Signed-off-by: wendyliu235 <wenjun.liu@intel.com >
2026-03-27 14:22:38 +08:00
Li, Jiang
becaed6ec8
[CPU] Support CT W4A16 on CPU MP kernel ( #38219 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-27 14:15:28 +08:00
Xiaoshuang Wang
a8eab8f30d
[Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5 ( #37975 )
...
Signed-off-by: wxsIcey <1790571317@qq.com >
Signed-off-by: Icey <1790571317@qq.com >
2026-03-27 14:13:21 +08:00
cjackal
2babac0bed
[frontend] dump openai responses type by alias ( #38262 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2026-03-27 05:58:20 +00:00
Or Ozeri
7cc302dd87
[kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models ( #37853 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-27 08:38:33 +03:00
Bvicii
999dfc1622
[Bugfix] Offload blocking tokenizer ops to shared thread pool to unblock event loop ( #34789 )
...
Signed-off-by: Bvicii <yizhanhuang2002@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-26 22:17:00 -07:00
wenjun liu
d86060122a
[CI/Build] enable Intel XPU test flow with prebuilt image ( #37447 )
...
Signed-off-by: wendyliu235 <wenjun.liu@intel.com >
2026-03-26 18:16:04 -07:00
Harry Mellor
f73bcb1c51
Various Transformers v5 config fixes ( #38247 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-26 23:06:59 +00:00
yzong-rh
28048bd6b0
[Bugfix] Add missing f-string prefix in xgrammar choices error message ( #38162 )
...
Signed-off-by: Yifan Zong <yzong@redhat.com >
2026-03-26 21:43:03 +00:00
Giancarlo Delfin
c32e97602d
[Model Runner V2] Enable forcing a specific acceptance rate during rejection sampling ( #38045 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-26 13:38:12 -07:00
Wei Zhao
0904b6550d
Fix multi-node allreduce fusion ( #38136 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: root <root@theia0053.lyris.clusters.nvidia.com >
2026-03-26 20:24:36 +00:00
Stig-Arne Grönroos
f26fcdfb9e
[Bugfix][ROCm] Fix lru_cache on paged_mqa_logits_module ( #37547 )
...
Signed-off-by: Stig-Arne Grönroos <stig-arne.gronroos@amd.com >
2026-03-26 19:01:05 +00:00
TJian
bc9c6fbbe6
[ROCm] [Bugfix] [Release] Fix nightly rocm release pipeline ( #38263 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-03-26 18:47:10 +00:00
Andreas Karatzas
bff9a1c266
[ROCm][CI] Override PYTORCH_ROCM_ARCH with detected GPU arch in test containers ( #38165 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 18:33:45 +00:00
Andreas Karatzas
db01535e2b
[ROCm][CI] Add uv pip compile workflow for rocm-test.txt lockfile ( #37930 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 12:44:01 -05:00
jennyyyyzhen
a4cf9b22ba
[ROCM][Bugfix] Use correct stride in cp_mha_gather_cache_kernel for hybrid model ( #37228 ) ( #37228 )
...
Signed-off-by: jennyyyyzhen <yzhen@hmc.edu >
Co-authored-by: yZhen <yZhen@fb.com >
2026-03-26 10:33:39 -07:00
Andreas Karatzas
9c3ae04bfe
[ROCm][CI] Add LM Eval Qwen3.5 Models test for MI355 ( #38155 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 16:51:18 +00:00
Andreas Karatzas
a8e48a7b85
[CI] Fix conch kernel crash on 3D input by reshaping to 2D before GEMM ( #38178 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 11:46:03 -05:00
Divakar Verma
b9dbc5c4ab
[Mamba][APC] Add test case to compare apc outputs ( #34977 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-03-26 16:40:35 +00:00
TJian
60af7b967b
[Releases] [ROCm] Enable Nightly Docker Image and Wheel Releases for ROCm ( #37283 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-03-26 16:32:25 +00:00
Andreas Karatzas
bdc1719eb9
[ROCm][CI] Fix AITER state leak in shared_fused_moe_routed_transform test ( #38137 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 09:26:46 -07:00
haosdent
0aac2048bf
[Bugfix] Restore CUDA graph persistent buffers for FP8 FlashMLA decode ( #35175 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-26 16:13:39 +00:00
Chuan (Richard) Li
cb2263218e
[Bugfix][Minor] Fix potential NameError in mamba backend selector and misc typos ( #35886 )
...
Signed-off-by: Li <chuali@amd.com >
2026-03-26 11:59:24 -04:00
Wentao Ye
e054f152fa
[CI] Add batch invariant test for b200 ( #38014 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-26 11:54:54 -04:00
zhang-prog
0f5b526040
[Fix] Remove unused packing_position_embedding from PaddleOCRVL for better checkpoint compatibility ( #38232 )
...
Signed-off-by: zhangyue66 <zhangyue66@baidu.com >
2026-03-26 15:34:49 +00:00
Zhewen Li
be1a85b7a2
Revert "[MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration" ( #38050 ) ( #38169 )
...
Co-authored-by: Zhewen Li <zhewenli@inferact.ai >
2026-03-26 07:59:09 -07:00
Cyrus Leung
2e225f7bd2
[Renderer] Consolidate factory methods ( #38218 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-26 12:19:22 +00:00
Jared Wen
757eafcf37
[bug-fix] GLM OCR Patch Merger context_dim ( #37962 )
...
Signed-off-by: JaredforReal <w13431838023@gmail.com >
2026-03-26 05:11:21 -07:00
wang.yuqi
dcdc145893
[CI] Reorganize scoring tests ( #38207 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-26 12:07:01 +00:00
Andreas Karatzas
f2d16207c7
[ROCm][CI] Fix flaky GPTQ compile correctness test ( #38161 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 19:57:00 +08:00
Andreas Karatzas
37a83007fe
[ROCm][CI] Fix wvSplitKrc mock argument order in test_rocm_unquantized_gemm ( #38167 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 19:54:59 +08:00
Wentao Ye
bf5eec638d
[Refactor] Remove unused utils ( #38153 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-26 17:08:19 +08:00
Mateusz Sokół
b1cb1d3d2c
DOC: Documentation pages fixes ( #38125 )
...
Signed-off-by: Mateusz Sokół <mat646@gmail.com >
2026-03-26 16:55:42 +08:00
Kunshang Ji
6ae8bbd0c2
[XPU] Disable xpu graph by default ( #38193 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-26 01:53:45 -07:00
Cyrus Leung
a9213c0ffe
[Doc] Fix outdated reference to CUDAGraphManager ( #38209 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-26 01:52:38 -07:00
Cyrus Leung
502c41a8f6
[Model] Use helper function to run MM processors with token inputs (where applicable) ( #38018 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-26 16:44:04 +08:00
Vadim Gimpelson
52069012fe
[Bugfix] Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell ( #38083 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-26 01:21:47 -07:00
Fadi Arafeh
71161e8b63
[cpu][ci] remove soft-fail for Arm CI and add quant model tests ( #37691 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-03-26 07:03:31 +00:00
Terry Gao
38de822310
[Model] Add torch.compile support for InternVL vision encoder ( #38049 )
...
Signed-off-by: tianrengao <terrygao87@gmail.com >
2026-03-25 23:52:29 -07:00
Jee Jee Li
2bfbdca23c
[Bugfix] Fix benchmark_fused_collective.py ( #38082 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-25 23:51:00 -07:00
Matej Rojec
2908094567
Add /v1/chat/completions/batch endpoint for batched chat completions ( #38011 )
...
Signed-off-by: Matej Rojec <64556640+MatejRojec@users.noreply.github.com >
2026-03-26 12:13:33 +08:00
BadrBasowid
e6bf9f15ec
[Bugfix][CI] Fix Marlin FP8 Linear Kernel for Compressed Tensors Format ( #38092 )
...
Signed-off-by: BadrBasowid <Badr.Basowid@gmail.com >
Signed-off-by: BadrBasowid <61441185+BadrBasowid@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-25 21:11:43 -07:00
Woosuk Kwon
144030c84e
Relocate Encoder CUDA graph manager ( #38116 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-25 20:52:12 -07:00
Flora Feng
e2db2b4234
[Tool Parser][1/3] Pass tools to ToolParser constructor ( #38029 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-26 10:29:06 +08:00
Chauncey
87f05d6880
[Revert] Remove DeepGEMM availability check in DeepseekV32IndexerMetadataBuilder ( #38076 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-26 01:43:51 +00:00
Andreas Karatzas
36f6aede23
[Misc] Optimized check to encapsulate both CUDA and ROCm platforms ( #34549 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 09:43:07 +08:00
Xin Yang
9704a5c310
Disable dual stream execution of input projection for Qwen3 ( #38152 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-26 01:20:39 +00:00
Wei Zhao
74056039b7
Fix minimax m2.5 nvfp4 kv scales weight loading ( #37214 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-26 00:48:06 +00:00
Jacob Platin
d7d51a7ee5
[Bugfix] Fix Qwen3.5-FP8 Weight Loading Error on TPU ( #37348 )
...
Signed-off-by: Jacob Platin <jacobplatin@google.com >
2026-03-26 00:46:01 +00:00
Harry Mellor
3c3c084240
Various Transformers v5 fixes ( #38127 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-26 00:10:08 +00:00
Ekagra Ranjan
7b54f60db0
[Cohere] Enable Cohere-Transcribe ( #38120 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2026-03-25 16:13:51 -07:00
Rohan Potdar
a0e8c74005
[ROCm]: Update rope+kvcache fusion conditions and disable custom op by default ( #36716 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-03-25 20:58:44 +00:00
Guillaume Guy
70a2152830
[MultiModal] add support for numpy array embeddings ( #38119 )
...
Signed-off-by: guillaume_guy <guillaume.guy@airbnb.com >
Signed-off-by: Guillaume Guy <guillaume.c.guy@gmail.com >
Co-authored-by: guillaume_guy <guillaume.guy@airbnb.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-25 20:13:04 +00:00
Sathish Sanjeevi
978fc18bf0
[ROCm] Utilize persistent MLA kernel from AITER ( #36574 )
...
Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com >
2026-03-26 03:00:42 +08:00
Andreas Karatzas
7d6917bef5
[ROCm] Fix MoE kernel test failures on gfx950 ( #37833 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-03-25 13:46:40 -05:00
Mark McLoughlin
e38817fadb
[Core][KV Connector] Remove use of num_cached_tokens in error handling ( #38096 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-03-25 18:20:48 +00:00
Nick Hill
72cad44d3c
[Frontend] Move APIServerProcessManager target server fn ( #38115 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-25 18:14:41 +00:00
Cyrus Leung
ba2f0acc2d
[Misc] Reorganize inputs ( #35182 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-25 10:22:54 -07:00
Yongye Zhu
678b3c99e8
[MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration ( #38050 )
2026-03-25 10:16:40 -07:00
mikaylagawarecki
bf4cc9ed2d
[2/n] Migrate per_token_group_quant to torch stable ABI ( #36058 )
...
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com >
2026-03-25 10:15:13 -07:00
Ben Browning
1ac2ef2e53
[CI/Docs] Improve aarch64/DGX Spark support for dev setup ( #38057 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-25 09:24:42 -07:00
Richard Zou
6e37c46b35
[compile] Add some more startup tests for top models ( #38046 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-25 12:02:22 -04:00
Wentao Ye
1bf2ddd0ee
[Refactor] Rename WAITING_FOR_FSM to WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR ( #38048 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-25 11:41:44 -04:00
Necofish
e7221180e1
[Kernel] Optimize SM120 CUTLASS blockwise FP8 GEMM ( #37970 )
...
Signed-off-by: Necofish <liuxiangyang@mail.ustc.edu.cn >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-25 08:20:04 -07:00
RobTand
4a76ad12e0
[Bugfix] Preserve CUDA arch suffix (a/f) for SM12x — fixes NVFP4 NaN on desktop Blackwell ( #37725 )
...
Signed-off-by: Rob Tand <robert.tand@icloud.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
2026-03-25 08:18:25 -07:00
Wentao Ye
d7e93e13fb
[Feature] EPLB Support for GPU Model Runner v2 ( #37488 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-25 08:16:39 -07:00
Andrii Skliar
cd7643015e
[Feature] Support per-draft-model MoE backend via --speculative-config ( #37880 )
...
Signed-off-by: Andrii Skliar <askliar@nvidia.com >
Signed-off-by: [Andrii Skliar] <askliar@nvidia.com >
Co-authored-by: Andrii Skliar <askliar@nvidia.com >
2026-03-25 14:31:52 +00:00
Ben Browning
a1a2566447
[Docs] Add guide for editing agent instruction files ( #37819 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2026-03-25 13:54:09 +00:00
yjz
b745e8b5d3
[KVTransfer][Mooncake] Add heterogeneous TP support for disaggregated P/D in MooncakeConnector ( #36869 )
...
Signed-off-by: JianDan0212 <zhangyj0212@gmail.com >
2026-03-25 14:24:07 +01:00
Harry Mellor
d215d1efca
[Mypy] Better fixes for the mypy issues in vllm/config ( #37902 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-25 06:14:43 -07:00
Fadi Arafeh
34d317dcec
[CPU][UX][Perf] Enable tcmalloc by default ( #37607 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-03-25 20:39:57 +08:00
grYe99
7ac48fd357
[Model] Add AutoWeightsLoader support for jais ( #38074 )
...
Signed-off-by: grYe99 <guorongye99@gmail.com >
Co-authored-by: grYe99 <guorongye99@gmail.com >
2026-03-25 12:38:40 +00:00
Harry Mellor
d6bb2a9d9a
Fix Plamo 2/3 & LFM2 for Transformers v5 ( #38090 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-25 12:29:49 +00:00
Harry Mellor
1e673a43ce
Better weight tying check for multimodal models ( #38035 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-25 12:07:23 +00:00
Andreas Karatzas
04417ecd5f
[ROCm][CI] Rename filepath test to point to correct file ( #38102 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 20:05:46 +08:00
R0CKSTAR
242c93f744
[Docs] Adds vllm-musa to custom_op.md ( #37840 )
...
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
2026-03-25 11:54:36 +00:00
Matthias Gehre
a889b7f584
[Bugfix] Pass drafter quant_config to ParallelLMHead in Eagle3 ( #37280 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-03-25 11:42:58 +00:00
Harry Mellor
ba2910f73a
Fix offline mode test for Transformers v5 ( #38095 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-25 11:39:48 +00:00
Andreas Karatzas
f262a62aa1
[ROCm][CI] Fix flaky Cohere/OpenAI embedding parity test ( #37616 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 10:55:51 +00:00
Andreas Karatzas
9ac2fcafbb
[CI] Fix realtime WebSocket timeout deadlock and unhandled model validation errors ( #37483 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 11:24:33 +01:00
Kunshang Ji
e9ae3f8077
[Hardware][XPU] Align memory usage with cuda on xpu ( #37029 )
...
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-25 18:14:29 +08:00
Andreas Karatzas
04cec4f927
[ROCm][CI] Increase OpenAPI schema test timeouts ( #38088 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 18:06:58 +08:00
Kunshang Ji
14771f7150
[XPU] support MLA model on Intel GPU ( #37143 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-25 17:43:42 +08:00
Gregory Shtrasberg
189ddefbfd
[ROCm] Attention selector reordering ( #36702 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
Co-authored-by: Micah Williamson <micah.williamson@amd.com >
2026-03-25 17:42:56 +08:00
Chauncey
09c3dc9186
[Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function ( #37968 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-25 06:19:37 +00:00
vllmellm
42e9547976
[ROCm][Test] Fix ROCM_AITER_UNIFIED_ATTN attn+quant fusion test ( #37640 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-03-25 05:06:15 +00:00
Chauncey
a32783bb35
[Bugfix] Fix IndexError when accessing prev_tool_call_arr in OpenAIToolParser ( #37958 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-25 12:06:21 +08:00
Baorun (Lauren) Mu
9d0351c91d
[Docs] Add Encoder (ViT) CUDA Graphs section to CUDA Graphs design doc ( #37914 )
...
Signed-off-by: Baorun Mu <bmu@nvidia.com >
2026-03-24 19:53:24 -07:00
Artem Perevedentsev
a93a53f8a1
[Performance] Auto-enable prefetch on NFS with RAM guard ( #37673 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
2026-03-24 17:31:14 -07:00
Andreas Karatzas
679c6a3ecc
[Bugfix][ROCm][MoE] Fix mxfp4 oracle regressions from #37128 ( #37787 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 08:17:33 +08:00
Andreas Karatzas
8bbb7c7f20
[ROCm][CI][PD] Add Hybrid SSM integration tests to CI ( #37924 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 07:58:39 +08:00
Kevin H. Luu
af945615b5
[release] Move the rest of release jobs to release queue ( #38044 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-03-24 16:40:58 -07:00
Terry Gao
82580b10ac
[Perf] Disable inductor runtime asserts by default for serving perfor… ( #37485 )
...
Signed-off-by: tianrengao <terrygao87@gmail.com >
Co-authored-by: Tianren Gao <tianren@fb.com >
2026-03-24 19:37:51 -04:00
Netanel Haber
a0d487b2e1
nano_nemotron_vl: suppress readonly torch.from_numpy() warning in image and video resize paths ( #37903 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-03-24 23:25:56 +00:00
Junhao
b73b5b0629
Make microbatch optimization (DBO) work with general models ( #37926 )
...
Signed-off-by: Junhao Li <junhao@ubicloud.com >
2026-03-24 14:40:08 -07:00
Michael Goin
0f0e03890e
[UX] Add flashinfer-cubin as CUDA default dep ( #37233 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-03-24 14:13:08 -07:00
Woosuk Kwon
4b53740d7f
[MRV2] Fix for DS v3.2 ( #38030 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-24 14:03:24 -07:00
Nick Hill
4e824d1c83
[Model Runner V2][Minor] Simplify PP logic ( #38031 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-24 13:57:17 -07:00
amey asgaonkar
0c1809c806
Add Ubuntu 24.04 support for Docker builds ( #35386 )
...
Signed-off-by: aasgaonkar <aasgaonkar@nvidia.com >
2026-03-24 13:34:44 -07:00
liangel-02
8c47fdfdb1
[FlexAttention] allow custom mask mod ( #37692 )
...
Signed-off-by: Angel Li <liangel@meta.com >
2026-03-24 16:03:24 -04:00
Javier De Jesus
54b0578ada
[Bugfix] Pass hf_token through config loading paths for gated model support ( #37920 )
...
Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com >
2026-03-24 15:22:05 -04:00
Richard Zou
89f572dbc0
[BugFix] fix VLLM_USE_STANDALONE_COMPILE=0 ( #38015 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-24 19:08:26 +00:00
Richard Zou
71a4a2fbd0
[BugFix] Fix order of compile logging ( #38012 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-24 18:58:18 +00:00
Nick Cao
935c46dd9b
[Model] Add Granite 4.0 1B speech to supported models ( #38019 )
...
Signed-off-by: Nick Cao <ncao@redhat.com >
2026-03-24 18:23:41 +00:00
Willy Hardy
057fc94cbd
[Bugfix] Fix structured output crash on CPU due to pin_memory=True ( #37706 )
...
Signed-off-by: Willy Hardy <whardy@redhat.com >
Signed-off-by: Will Hardy <whardy@redhat.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-24 17:44:17 +00:00
Vineeta Tiwari
b58c5f28aa
docs: fix broken offline inference paths in documentation ( #37998 )
...
Signed-off-by: Vineeta Tiwari <vineeta.tiwari2@ibm.com >
Signed-off-by: Vineeta Tiwari <vineetatiwari2000@gmail.com >
Co-authored-by: Vineeta Tiwari <vineeta.tiwari2@ibm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-24 17:35:14 +00:00
Ming Yang
c07e2ca6e0
Fix Mamba state corruption from referencing stale block table entries ( #37728 ) ( #37728 ) ( #37728 )
2026-03-24 10:29:59 -07:00
Dhruv Singal
4df5fa7439
[Bugfix] Force continuous usage stats when CLI override is enabled ( #37923 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
Co-authored-by: OpenCode <noreply@openai.com >
2026-03-24 10:29:50 -07:00
sihao_li
a5416bc52e
[XPU] Support Intel XPU hardware information collection in usage stats ( #37964 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
2026-03-24 10:29:17 -07:00
Harry Mellor
b3601da6e7
[Mypy] Fix mypy for vllm/model_executor (except vllm/model_executor/layers) ( #37904 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-24 17:14:01 +00:00
Dan Blanaru
dc78c2c933
[Core] add option to schedule requests based on full ISL ( #37307 )
...
Signed-off-by: Dan Blanaru <48605845+DanBlanaru@users.noreply.github.com >
Co-authored-by: Claude <noreply@anthropic.com >
2026-03-24 13:01:12 -04:00
Sungjae Lee
4731884796
[Feature] limit thinking tokens (hard limit) ( #20859 )
...
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com >
Signed-off-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-24 09:53:07 -07:00
Harry Mellor
8de5261e69
Update new contributor message ( #37999 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-24 16:01:41 +00:00
wang.yuqi
1b6cb920e6
[Deprecate] Deprecate pooling multi task support. ( #37956 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-24 14:07:47 +00:00
Li, Jiang
352b90c4a4
[Bugfix] Add replacement of _compute_slot_mapping_kernel on CPU ( #37987 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-24 07:00:20 -07:00
Sage
1c0aabdeb0
[Bugfix] Suppress spurious CPU KV cache warning in launch render ( #37911 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-24 12:36:18 +00:00
Ilya Markov
14acf429ac
[EPLB] Remove main waits in case of slow EPLB ( #36271 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-03-24 11:50:44 +00:00
Harry Mellor
ce57fd5557
[Docs] Fix build ( #37991 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-24 03:20:49 -07:00
Flora Feng
2e67fa756d
Fix tool_parser_cls type annotation from Callable to type[ToolParser] ( #37957 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-23 22:58:27 -07:00
Ronen Schaffer
e3c6c10cad
[KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into cpu/ package ( #37874 )
...
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com >
2026-03-24 07:02:51 +02:00
jetxa
16a664df24
[Frontend][Bugfix] Pass default_chat_template_kwargs to AnthropicServingMessages ( #37899 )
...
Signed-off-by: jetxa <jetxzhang@outlook.com >
2026-03-24 05:00:12 +00:00
Kevin H. Luu
7281199a8c
[release] Move agent queue to Release cluster queues ( #37783 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-03-23 20:36:47 -07:00
Kevin H. Luu
b2dd75eb48
Downsize CPU jobs to use small queue ( #37913 )
...
Signed-off-by: khluu <khluu000@gmail.com >
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-23 20:36:37 -07:00
Wentao Ye
c59a132f96
[V0 Deprecation] Refactor kv cache from list to element ( #37487 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-23 20:10:11 -07:00
Andreas Karatzas
de99d91ece
[ROCm][CI] Split Entrypoints Integration (API Server 1) into 3 jobs ( #37906 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-24 09:48:37 +08:00
Wentao Ye
83c9d525b6
[CI] Add batch invariant test: Block FP8 + small MOE ( #37895 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-23 21:16:14 -04:00
Giancarlo Delfin
8f4824b664
[Model Runner V2] Gather multimodal embeddings before draft model postprocess ( #37932 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-23 18:14:13 -07:00
roikoren755
56777b5c89
[Test] E2E Nemotron-3-Super tests ( #36803 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-03-23 17:49:56 -07:00
Kevin H. Luu
2488a82f89
[CI] Split V1 Others into 3 separate jobs ( #37016 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-24 06:44:38 +08:00
Ranran
dc6908ac6a
[Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning ( #35007 )
...
Signed-off-by: Ranran <1012869439@qq.com >
Signed-off-by: Ranran <hzz5361@psu.edu >
Signed-off-by: ran <hzz5361@psu.edu >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-03-23 18:31:14 -04:00
yzong-rh
e85f8f0932
[Bug][MoE] Strengthen _supports_current_device() checks in the TRTLLM FP8, NVFP4, and FlashInfer CuteDSL MoE experts ( #36728 )
...
Signed-off-by: Yifan Zong <yzong@redhat.com >
2026-03-23 17:02:57 -04:00
Robert Shaw
5bf3c42d4c
[Bug][MoE] Fix TRTLLM NVFP4 Routing Kernel Precision ( #36725 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-23 20:19:06 +00:00
Kyle Sayers
38364a7e32
[Sparse24] [Deprecation] Remove Sparse24 CT integration and kernels ( #36799 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-03-23 16:03:29 -04:00
Matthew Bonanni
fafe76b4af
[Async][Spec Decoding] Zero-bubble async scheduling + spec decoding ( #32951 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com >
2026-03-23 15:37:22 -04:00
Woosuk Kwon
ffb5b32b5f
[MRV2] Consider spec decoding in warmup ( #37812 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-23 17:45:43 +00:00
Kunshang Ji
91fd695b75
[CI] split Entrypoints Integration (API Server 1) into 3 jobs ( #37882 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-23 10:37:56 -07:00
Nicolò Lucchesi
1cbbcfe8a3
[CI][PD] Add Hybrid SSM integration tests to CI ( #37657 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-23 23:58:19 +08:00
Angela Yi
aceadb5ee1
Use lazy graph module during split_module to defer recompile() ( #37609 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-03-23 11:21:29 -04:00
Yufeng He
ec2280611a
[Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding ( #37884 )
2026-03-23 15:15:12 +00:00
yanghui1-arch
7151ae6528
[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region ( #37873 )
...
Signed-off-by: dass90 <3053034939@qq.com >
2026-03-23 14:59:21 +00:00
Wentao Ye
45bd5c8e75
[Mypy] Fix mypy for vllm/config ( #37808 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-23 14:33:59 +00:00
Zhaodong Bing
10a1018c12
[ROCm] fix sleep mode not releasing GPU memory problem on ROCm ( #37533 )
...
Signed-off-by: bingzhaodong <aaab8b@gmail.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-03-23 06:07:19 -07:00
Jee Jee Li
aec2dc6c0d
[Bugfix][LoRA] Fix incorrect LoRA Log ( #37877 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-23 11:42:52 +00:00
DorBernsohn
7938d12119
[Bugfix] Fix CPU backend crash in KV cache block zeroing ( #37550 )
...
Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com >
2026-03-23 11:35:45 +00:00
Kunshang Ji
debd6e768c
[XPU][MoE Refactor] Refactor xpu mxfp4 support into oracle ( #37784 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-23 11:10:41 +00:00
Andrew Xia
9ace378a63
[Frontend][Responses API] Fix arrival_time recording for TTFT on initial request ( #37498 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2026-03-23 09:58:08 +00:00
Kunshang Ji
27d5ee3e6f
[FP8]add FP8 WoQ kernel abstraction. ( #32929 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-03-23 09:47:47 +00:00
wangxiyuan
35141a7eed
[Misc]Update gitignore ( #37863 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2026-03-23 01:14:10 -07:00
Chuan (Richard) Li
e99fb98867
[ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs ( #36100 )
...
Signed-off-by: Li <chuali@amd.com >
2026-03-23 15:48:31 +08:00
Artem Perevedentsev
a16133a0f1
[Perf] [Bugfix] Fix Triton autotuning in inference for Qwen3.5 ( #37338 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
2026-03-23 00:37:58 -07:00
Hojin Yang
54ab804e87
[Bugfix] Store Qwen3Next A_log in fp32 ( #37810 )
...
Signed-off-by: effortprogrammer <yhjhoward7@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-23 15:36:57 +08:00
r266-tech
02e6efe56d
[Bugfix] JAIS: Only apply ALiBi when position_embedding_type='alibi' ( #37820 )
...
Co-authored-by: r266-tech <r266-tech@users.noreply.github.com >
2026-03-23 07:36:34 +00:00
Matthias Gehre
410d300893
[ROCm][Refactor] Enable AWQMarlinConfig on ROCm to use choose_mp_linear_kernel ( #36505 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-23 15:36:08 +08:00
Yan Ma
d3fe857135
update doc for online fp8 quantization ( #37851 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-03-23 05:19:03 +00:00
Baorun (Lauren) Mu
f85e479e66
[Feature] ViT Full CUDA Graph ( #35963 )
...
Signed-off-by: Baorun Mu <bmu@nvidia.com >
2026-03-23 13:01:10 +08:00
Jee Jee Li
1f0d210641
[CI/Build][LoRA] Update Qwen35 LoRA testing ( #37816 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-23 12:55:49 +08:00
Ben Browning
3bbe2e1e6e
[Test] Consolidate tool parser unit tests to tests/tool_parsers ( #37834 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2026-03-23 04:24:25 +00:00
Augusto Yao
6e04e79326
always use embed&token_classify for bge-m3 ( #37632 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-23 03:10:57 +00:00
Lasha Koroshinadze
e7767eccae
Fix AudioFlamingo3/MusicFlamingo HF parity and RoTE handling ( #37643 )
...
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com >
2026-03-23 10:29:07 +08:00
Woosuk Kwon
43877a620b
[MRV2] Enable PP CUDA graph test ( #37830 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-22 16:30:25 -07:00
zhanqiuhu
63f49b8bd4
[Model Runner V2] Enable piecewise CUDA graphs for pipeline parallelism ( #35162 )
...
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-22 20:48:25 +00:00
Woosuk Kwon
a5e9d511de
[MRV2] Use FP64 for Gumbel noise ( #37798 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-22 12:28:10 -07:00
Yongye Zhu
c058ff44d4
[Bigfix]fix lora test by pass padded size back to the layer ( #37811 )
2026-03-22 13:20:13 -06:00
Woosuk Kwon
ce9b1d76cf
[MRV2] Skip hidden states allocation for PW CUDA graphs ( #37818 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-22 11:47:21 -07:00
Netanel Haber
e74c17e153
Enable NemotronHPuzzle + NemotronHMTP ( #37803 )
2026-03-22 15:13:58 +00:00
Wentao Ye
eaf4978621
[Test] Only Run MLA model when user explicitly set for batch invariance ( #37719 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-22 09:09:12 -04:00
Wentao Ye
77d24c4bfe
[Bug] Fix fp8 deepgemm batch invariant ( #37718 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-22 08:57:20 -04:00
Giancarlo Delfin
b3e846017d
[Model Runner V2] Support multi-modal embeddings for spec decode model ( #36097 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-22 02:48:43 -07:00
Andreas Karatzas
cd1242d82a
[ROCm][CI] Stabilize ROCm speech-to-text translation test with lower min acc threshold ( #37723 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 17:32:08 +08:00
Robert Shaw
4383f1532e
[MoE] Move PF Methods to Folder ( #35927 )
2026-03-22 02:42:59 -06:00
Andreas Karatzas
6eedec6e36
[ROCm][CI] Make some duplicated tests optional so that they are only evaluated in our nightly ( #37780 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 16:03:18 +08:00
Andreas Karatzas
ffc8531524
[ROCm][CI] Added missing resampy dependency for MM audio tests ( #37778 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 16:02:41 +08:00
Andreas Karatzas
6ecba840d7
[ROCm][CI] get_cu_count was renamed to num_compute_units in #35042 ( #37764 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 16:02:21 +08:00
Andreas Karatzas
3b06c55c78
[ROCm][CI] Fix MEGA_AOT_ARTIFACT fallback when PyTorch < 2.10.0 lacks AOT support ( #37763 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 16:02:03 +08:00
Yang Liu
b050700462
[Perf] Optimize glm4.xv VIT ( #37779 )
...
Signed-off-by: Yang <lymailforjob@gmail.com >
2026-03-22 06:12:34 +00:00
Andreas Karatzas
5dac719b2b
[Bugfix] Handle libsndfile sf_error(NULL) race condition in audio fallback ( #37782 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 13:37:29 +08:00
Andreas Karatzas
c862481c02
[CI] Skip ISAAC multimodal tests due to broken upstream HF model weights ( #37781 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 13:23:32 +08:00
Andreas Karatzas
c86b17cfe6
[ROCm][CI] Add large_gpu_mark to test_max_tokens_none for ROCm ( #37717 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 12:25:16 +08:00
Andreas Karatzas
66f927f205
[Bugfix] Fix pooling non-determinism from pinned prompt_lens aliasing ( #37775 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 03:22:24 +00:00
Andreas Karatzas
e78bc74268
[ROCm][CI] close missing quote in kernels/moe block in run-amd-test.sh ( #37774 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 09:42:34 +08:00
Robert Shaw
6b2fa3a762
[MoE] Move FlashInfer CuteDSL experts into fused_moe/experts/ ( #37759 )
...
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
2026-03-21 19:15:16 -04:00
Robert Shaw
eeee5b262d
[Quantization][Deprecation] Remove PTPC FP8 ( #32700 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-21 22:10:16 +00:00
Robert Shaw
5ad0446572
Revert "Consolidate AWQ quantization into single awq_marlin.py file" ( #37768 )
2026-03-21 17:20:41 -04:00
Robert Shaw
8cc700dd6a
Consolidate AWQ quantization into single awq_marlin.py file
...
Merge awq.py and awq_marlin.py into a single file, eliminating the
circular import between them. awq.py becomes a backward-compat shim.
Follows the same structure as gptq_marlin.py.
Co-authored-by: Claude
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
2026-03-21 17:09:17 -04:00
Brandon Pelfrey
80b70884eb
Add tensor IPC transfer mechanism for multimodal data ( #32104 )
...
Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com >
Signed-off-by: Brandon Pelfrey <brandonpelfrey@gmail.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-21 20:10:20 +00:00
Mohammad Miadh Angkad
61e381dcf0
[Perf] Add SM 10.3 (B300/GB300) all-reduce communicator tuning ( #37756 )
...
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com >
2026-03-21 19:43:47 +00:00
Mohammad Miadh Angkad
88f1b374f5
[Core] Enable allreduce fusion by default for SM 10.3 (B300/GB300) ( #37755 )
...
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com >
2026-03-21 19:40:37 +00:00
Francesco Fusco
298e510848
[Hybrid] calling get_mamba_groups() once at MambaCopyBuffers.create() ( #37318 )
...
Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com >
2026-03-21 09:29:43 +00:00
Chaitanya Sri Krishna Lolla
3982bc2cd0
[ROCm] Enable DeepEP ROCm as all2allbackend for AMD GPUs. ( #34692 )
...
Signed-off-by: Tej Kiran <vpolamre@amd.com >
Co-authored-by: Tej Kiran <vpolamre@amd.com >
2026-03-21 00:32:31 -07:00
Andreas Karatzas
02eec7ecbe
[ROCm][CI] Update GSM8K eval config to use fp8-and-mixed models list (MI355) ( #37721 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-21 15:27:12 +08:00
Bongwoo Bak
17ee641c45
[Responses API] Add kv_transfer_params for PD disaggregation ( #37424 )
...
Signed-off-by: bongwoobak <bongwoobak@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-21 13:48:54 +08:00
Andreas Karatzas
0d50fa1db6
[ROCm][CI] Mark gemma3 as large GPU test to avoid OOM on MI250 ( #37610 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-21 12:57:25 +08:00
Simon Mo
1fa1e53a73
Revert "[compile] Initialize passes at VllmBackend init" ( #37733 )
2026-03-20 21:35:49 -07:00
Andreas Karatzas
3ffa52009f
[ROCm][CI] Guard CudaPlatform/RocmPlatform imports to fix test collection on cross-platform builds ( #37617 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-21 11:58:58 +08:00
Yongye Zhu
87bd91892f
[MoE Refactor] Mxfp4 oracle rebased ( #37128 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-21 03:37:04 +00:00
Isotr0py
c7f98b4d0a
[Frontend] Remove librosa from audio dependency ( #37058 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-21 11:36:15 +08:00
tmm77
1c472f8fe1
Add get_device_uuid for rocm ( #37694 )
...
Signed-off-by: Tiffany Mintz <Tiffany.Mintz@amd.com >
2026-03-21 11:33:16 +08:00
Itay Alroy
c57d38d603
elastic_ep: Fix issues with repeated scale up/down cycles ( #37131 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com >
2026-03-20 23:13:02 +00:00
Kaihang Jiang
e5ed6c6c13
[BugFix] Allow qk_nope_head_dim=192 in FlashInfer MLA backend checks ( #37475 )
...
Signed-off-by: Kaihang Jiang <kaihangj@nvidia.com >
2026-03-20 16:14:55 -06:00
Wentao Ye
b3d0b37908
[Refactor] Remove unused dead code ( #36171 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-20 16:12:51 -06:00
Santino Ramos
85f671b8e1
[Model Runner V2] Support Streaming Inputs ( #37028 )
...
Signed-off-by: Santino Ramos <elsantinoramos@gmail.com >
2026-03-20 20:42:25 +00:00
Andreas Karatzas
8bc6b5cdb0
[ROCm][CI] Setting some mi325_4 tests back to optional (in parity with upstream) ( #37711 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 12:25:08 -07:00
Vadim Gimpelson
4f16ebbbd3
[Bugfix] Disable monolithic TRTLLM MoE for Renormalize routing ( #37591 ) ( #37605 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-20 12:19:26 -07:00
Angela Yi
12fd17eb51
[compile] Initialize passes at VllmBackend init ( #35216 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-03-20 11:40:33 -07:00
Cyrus Leung
37aadf6237
[Model] Update Kimi-K25 and Isaac processors to fit HF-style ( #37693 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-20 18:30:22 +00:00
Le Yang
d7d2b5e405
[Bugfix] Disable --calculate-kv-scales for hybrid GDN/Mamba+Attention… ( #37565 )
...
Signed-off-by: Young-Leo <562593859@qq.com >
2026-03-20 18:28:34 +00:00
SherryC41
6ec5e9fd37
refactor: abstract deepgemm support into platform ( #37519 )
...
Co-authored-by: sherryC41 <sherry.c.c41@gmail.com >
2026-03-20 17:54:08 +00:00
Lucas Wilkinson
e1d85e5c24
[Attention] Support distinguishing between short extends and decodes ( #37303 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-03-20 10:49:36 -07:00
Peter Pan
79eb9369c5
fix CUDAGraph memory being counted twice ( #37426 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
Signed-off-by: Peter Pan <peter.pan@daocloud.io >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-20 17:36:32 +00:00
Woosuk Kwon
e80cfe575d
[MRV2] Avoid recompilation of _gather_block_tables_kernel ( #37645 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-20 10:31:45 -07:00
Xin Yang
d0532bf38d
[Perf] Eliminate redundant SparseMatrix creation in gpt_oss_triton_kernels ( #37683 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-20 11:28:41 -06:00
Andreas Karatzas
fb4e8bf442
[ROCm][CI] Fix accuracy for llama-nemotron-vl pooling tests ( #37613 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 10:16:59 -07:00
Harry Mellor
6ade4bc5a5
Fix various config related issues for Transformers v5 ( #37681 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-20 16:30:12 +00:00
Zhengxu Chen
2e089b96a8
[compile] Add compiled artifact counter for VLLM_USE_MEGA_AOT_ARTIFACT=1. ( #37589 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-20 16:22:46 +00:00
Martin Hickey
880be2b1b8
[Metrics] Some small refactoring for better maintainability ( #33898 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-03-20 16:11:34 +00:00
Zhengxu Chen
c0f5fae601
[compile] Fix aot test failures with torch 2.12. ( #37604 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-20 16:06:29 +00:00
Rémi Delacourt
aa84e43ccb
[Pixtral] Enable Pixtral language model support Eagle3 ( #37182 )
...
Signed-off-by: remi <remi@mistral.ai >
2026-03-20 15:50:15 +00:00
Matthias Gehre
5e806bcf54
[Bugfix] Fix ConchLinearKernel channelwise quantization (group_size=-1) ( #37329 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-03-20 10:32:21 -05:00
Matthias Gehre
56a62c310c
[Bugfix] Reject channelwise quantization (group_size <= 0) in ExllamaLinearKernel ( #37331 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-03-20 10:31:57 -05:00
L.B.R.
1779c09898
[ROCm] Enable wvSplitK skinny GEMM kernel for RDNA4/gfx1x decode ( #34709 )
...
Signed-off-by: L.B.R. <lbr@mmonad.com >
Co-authored-by: L.B.R. <lbr@mmonad.com >
2026-03-20 10:11:23 -05:00
xuebwang-amd
44eea10f68
[ROCm][Quantization] make quark ocp mx dtype parser robust for weight-only quantization ( #36232 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
2026-03-20 10:10:03 -05:00
Ilya Boytsov
8b6c6b9505
[Model] Add LFM2-ColBERT-350M support ( #37528 )
...
Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com >
2026-03-20 14:57:57 +00:00
Harry Mellor
9f6d9dd371
Fix attribute error in isaac_patch_hf_runner ( #37685 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-20 14:49:40 +00:00
Jee Jee Li
dd20ee4e3e
[UX] Enable torch_profiler_with_stack ( #37571 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-20 11:17:26 +00:00
Chauncey
0523449c9c
[Misc] Use logger.info_once for auto tool choice log message ( #37661 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-20 10:40:36 +00:00
Flora Feng
b4c1aef21c
[Refactor] Relocate tests from tests/v1/entrypoints/ to tests/entrypoints/ ( #37500 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 02:50:34 -07:00
Flora Feng
6050b93bed
[Refactor] Move serve entrypoint tests under tests/entrypoints/serve/ ( #37595 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 02:10:47 -07:00
Andreas Karatzas
5a4a179591
[ROCm][CI] Fix granite_speech test for gfx90a by selecting compatible attention backend ( #37611 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 17:07:26 +08:00
Andreas Karatzas
37cd9fc107
[ROCm][CI] Remove deepep DBO tests on gfx90a ( #37614 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 17:07:07 +08:00
Andreas Karatzas
9cfd4ebb5e
[ROCm][CI] Update GSM8K eval config to use fp8-and-mixed models list ( #37619 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 17:06:53 +08:00
wang.yuqi
ed359c497a
[Model] Deprecate the score task (this will not affect users). ( #37537 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-20 08:07:56 +00:00
Giancarlo Delfin
dcee9be95a
[Model Runner V2] Fix draft logits not populated during cudagraph replay ( #37639 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-20 07:43:47 +00:00
Andreas Karatzas
bd8c4c0752
[CI] Removing deprecated rlhf examples reference ( #37585 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 15:20:33 +08:00
Wei Zhao
0140eafb15
[Bug] Fix FlashInfer allreduce fusion workspace uninitialized error ( #37461 )
...
Signed-off-by: root <root@prenyx0169.a51.clusters.nvidia.com >
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Signed-off-by: <>
Co-authored-by: root <root@prenyx0169.a51.clusters.nvidia.com >
Co-authored-by: root <root@prenyx0042.a51.clusters.nvidia.com >
2026-03-20 03:09:21 -04:00
Kunshang Ji
bdf6a0a57b
[XPU] bump vllm-xpu-kernels to v0.1.4 ( #37641 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-20 15:04:38 +08:00
Wangbei25
0674d1fee7
[PluggableLayer][MM] Add PluggableLayer for CustomQwen2Decoder ( #37293 )
...
Signed-off-by: Wangbei25 <wangbei41@huawie.com >
Signed-off-by: Wangbei25 <wangbei41@huawei.com >
Co-authored-by: Wangbei25 <wangbei41@huawie.com >
2026-03-20 06:24:07 +00:00
Cyrus Leung
30108fc8b0
[Model] Refactor Step3-VL processor to HF style ( #37579 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-20 06:05:08 +00:00
Flora Feng
e2d1c8b5e8
[Refactor] Relocate entrypoint tests to match serving code structure ( #37593 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 05:31:23 +00:00
Huanxing
6951fcd44f
[XPU] Automatically detect target platform as XPU in build. ( #37634 )
...
Signed-off-by: huanxing <huanxing.shen@intel.com >
2026-03-20 13:30:15 +08:00
Giancarlo Delfin
39474513f6
[Model Runner V2] fix draft attention metadata generation ( #37364 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-19 21:05:15 -07:00
Yuxiang Liang
638a872d77
fix(xpu): Re-compute compile ranges after platform-specific config updates ( #37523 )
...
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com >
Signed-off-by: Yuxiang Liang <yuliang@habana.ai >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-20 03:52:35 +00:00
Flora Feng
9040151fe1
[V0 Deprecation] Deprecate --disable-frontend-multiprocessing ( #37612 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 11:31:43 +08:00
Jee Jee Li
8fbe3f303f
[Bugfix][LoRA] Fix Qwen35 LoRA ( #36976 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-20 11:09:32 +08:00
Xiao
ea2c148fa7
[compile][graph_partition]Add tensor size handling ( #36038 )
...
Signed-off-by: Xiao Fu <xiaofu@meta.com >
2026-03-19 19:55:25 -07:00
Tianmu Li
47b7af0d87
[Feat] Enable CompressedTensorW4A8Int for XPU ( #37207 )
...
Signed-off-by: Li, Tianmu <tianmu.li@intel.com >
2026-03-20 02:34:28 +00:00
tianshu-Michael-yu
269bf46d99
fix: disambiguate multimodal prefix cache keys ( #36708 )
...
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com >
2026-03-20 10:33:20 +08:00
Flora Feng
e5a77a5015
[CI] Update mergify tool-calling label paths ( #37478 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 02:22:23 +00:00
Itay Alroy
ca1ac1a4b4
Fix DP coordinator ZMQ port TOCTOU ( #37452 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
2026-03-20 00:58:31 +00:00
Divakar Verma
4ca3fa6bb4
[ROCm][Bugfix] fix cache block size mismatch for aiter unified attention ( #37606 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-03-20 00:00:08 +00:00
Flora Feng
be12afd284
[Bugfix] Fix Deepseekv32 tool parser when stream interval > 1 ( #36056 )
2026-03-19 19:51:25 -04:00
Wentao Ye
df3c0291a3
[Bug] Fix EmbedIOprocessor "classify" <-> "embed" ( #37573 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-20 07:40:10 +08:00
Wentao Ye
2be1a0f74b
[Refactor] Remove dead code in pooling model ( #37572 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-20 07:39:43 +08:00
Jim Smith
4120a05ff1
Fix AttributeError in Qwen3.5 GDN layers with quantized models ( #37448 )
...
Signed-off-by: Jim Smith <jim@joshua8.ai >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com >
2026-03-19 19:21:14 -04:00
rasmith
98ff042917
[CI][BugFix][AMD] Don't set VLLM_ROCM_USE_AITER anymore in test_rocm_aiter_topk since its not necessary ( #36996 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-03-20 07:12:45 +08:00
Artem Perevedentsev
b55156eae9
[Performance] Enable Triton autotuning disk cache by default ( #37188 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
2026-03-19 17:36:28 -04:00
Laith Sakka
112944fab9
test Qwen/Qwen3-4B-Instruct-2507 for unbacked ( #36064 )
...
Signed-off-by: Laith Sakka <lsakka@meta.com >
2026-03-19 17:28:45 -04:00
bnellnm
91be5f9be3
[MoE Refactor] Rename "naive" all2all backend ( #36294 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-03-19 15:50:34 -04:00
Aaron Hao
4ee847e400
Comment fix for async rl example ( #35244 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-03-19 19:46:07 +00:00
Andreas Karatzas
040a505ff5
[ROCm][CI] Cleaning and restructuring amd-ci legacy pipeline ( #34839 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-19 14:30:58 -05:00
bnellnm
9279c59a0e
[MoE Refactor] DefaultMoERunner simplifcation ( #33049 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-03-19 15:07:44 -04:00
Wentao Ye
7454096199
[Log] Log once in local node by default ( #37568 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-19 12:04:59 -07:00
Andreas Karatzas
fb8b5e05fc
[CI] Add retry with 4x backoff to HTTP fetches for transient failures ( #37218 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-19 19:00:20 +00:00
Harry Mellor
e5d96dc8fc
Fix SpeculatorsConfig now that PreTrainedConfig is a dataclass in Transformers ( #37574 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 18:04:40 +00:00
EdalatiAli
daa05bf340
[Bugfix] Fix AttributeError when serving MXFP8 models with DeepGEMM installed ( #37358 )
...
Signed-off-by: EdalatiAli <aliedalati@cohere.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-19 17:58:33 +00:00
Lucas Kabela
7769b58307
[torch.compile][BE][Multimodal] Remove requirement to set_model_tag to avoid cache conflict ( #37345 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-03-19 17:26:12 +00:00
Chauncey
2f9f946b22
[P/D] AnthropicMessages add kv_transfer_params for PD disaggregation ( #37535 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-19 16:41:20 +00:00
Fadi Arafeh
2890aecce5
[CPU][UX] Do not crash when tcmalloc/libiomp are not ldpreloaded ( #37561 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-03-19 16:35:45 +00:00
Harry Mellor
34f093b417
[CI] Gate pre-commit on ready label or number of contributions ( #37544 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 16:21:57 +00:00
Harry Mellor
4dce8321a9
Run MacOS smoke test on daily cron job instead of every commit ( #37567 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 16:19:50 +00:00
Cyrus Leung
657855ab41
[Misc] Cleanup more configs and processors ( #37560 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 15:45:23 +00:00
Wei Zhao
e27b8ba3d1
[Bug] Fix fp8 trtllm MoE modular kernel supported routing methods ( #37346 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-19 11:43:06 -04:00
Woosuk Kwon
40b8363b45
[MRV2] Use fp32 for draft logits ( #37526 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-19 08:41:21 -07:00
mikaylagawarecki
8b10e4fb31
[1/n] Migrate permute_cols to libtorch stable ABI ( #31509 )
...
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com >
2026-03-19 11:27:26 -04:00
Ifta khairul Alam Adil
104605cbf2
Remove deprecated reasoning_content message field(part-2) ( #37480 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com >
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Philip Ottesen <phiott256@gmail.com >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
Signed-off-by: Andy Lo <andy@mistral.ai >
Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com >
Signed-off-by: sihao.li <sihao.li@intel.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: JartX <sagformas@epdcenter.es >
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Philip Ottesen <phiott256@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Giancarlo Delfin <32987265+TheEpicDolphin@users.noreply.github.com >
Co-authored-by: Andy Lo <andy@mistral.ai >
Co-authored-by: Thillai Chithambaram <79466435+thillai-c@users.noreply.github.com >
Co-authored-by: sihao_li <165983188+1643661061leo@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 15:20:08 +00:00
Jee Jee Li
96266f119b
[LoRA] Minor improvements to LoRA log ( #37557 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-19 15:18:06 +00:00
Sage Moore
7c0cf3bcd0
Cap the number of API servers to 1 when using Elastic EP. ( #37466 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-03-19 10:42:57 -04:00
Harry Mellor
572b432913
Stop bench CLI from recursively casting all configs to dict ( #37559 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 14:04:03 +00:00
Cyrus Leung
9515c20868
[Misc] Clean up processing logic ( #37541 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 13:30:20 +00:00
DorBernsohn
c63ca2b2e6
[Bugfix] Add Kimi-K2.5 reasoning/tool parser aliases and tool_call_id support ( #37438 )
...
Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com >
2026-03-19 21:08:00 +08:00
Harry Mellor
a32eaf5bb2
[CI] Merge cleanup_pr_body.yml and reminder_comment.yml ( #37552 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 12:55:07 +00:00
XueLiang Yang
e390742c59
Fix KV Offloading + MLA AssertionError by using num_kv_heads=1 in cpu… ( #37536 )
...
Signed-off-by: xueliangyang-oeuler <yxl546827391@gmail.com >
Co-authored-by: xueliangyang-oeuler <yxl546827391@gmail.com >
2026-03-19 12:05:07 +00:00
Cyrus Leung
7a6ebcbfcf
[Model] Remove unnecessary get_language_model ( #37545 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 20:00:36 +08:00
Cyrus Leung
c7bc12c20f
[CI/Build] Split out MM pooling tests ( #37542 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 11:36:11 +00:00
wang.yuqi
f9e2a38386
[Docs] Reorganize pooling docs. ( #35592 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 11:25:47 +00:00
Harry Mellor
4426447bba
Don't log exc_info when vLLM tries to doenload a file that doesn't exist ( #37458 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 10:38:29 +00:00
Li, Jiang
3322e26420
[Bugfix] Avoid more OpenMP thread reallocation in CPU torch compile ( #37538 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-19 10:24:39 +00:00
Cyrus Leung
765e461065
[Bugfix] Fix Nemotron Parse loading ( #37407 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 09:55:29 +00:00
Duyi-Wang
6a9cceb219
[Bugfix][ROCm] Fix MoRI + AITER FP8 dispatch compatibility for defer_input_quant ( #37418 )
...
Signed-off-by: Duyi-Wang <duyi.wang@amd.com >
2026-03-19 09:49:27 +00:00
yassha
199f914183
fix(cpu): add null check for aligned_alloc in ScratchPadManager ( #37369 )
...
Signed-off-by: yassha <50112520+yassha@users.noreply.github.com >
2026-03-19 17:45:06 +08:00
Kunshang Ji
ca21483bf9
[MISC] fix pin_memory=torch.cuda.is_available(), use is_pin_memory_available ( #37415 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-19 09:23:24 +00:00
TJian
da70c87e81
[CI] Fix wrong path test file, missing rlhf_async_new_apis.py ( #37532 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-03-19 02:21:55 -07:00
Collin McCarthy
0b6d52629f
Support temporal compression for Nemotron-3-VL videos ( #36808 )
...
Signed-off-by: Collin McCarthy <cmccarthy@nvidia.com >
2026-03-19 08:02:19 +00:00
Ziming Huang
d3cc379567
[Perf] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ( #37425 )
...
Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com >
2026-03-19 15:43:48 +08:00
cdpath
354cd580d5
fix(anthropic): remove non-standard 'data: [DONE]' from Anthropic streaming ( #37510 )
...
Signed-off-by: cdpath <cdpath@outlook.com >
2026-03-19 07:23:35 +00:00
zhanqiuhu
d49f273144
[SSM/Mamba] Follow-up: N-1 prefill for P/D disaggregation ( #37310 )
2026-03-19 08:22:00 +01:00
Flora Feng
b21d384304
[Refactor] Relocate endpoint tests to mirror serving code directory structure ( #37504 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-19 07:19:36 +00:00
Hongxia Yang
e3126cd107
[ROCm] issue management - request information for bug issues on ROCm ( #37009 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-03-19 03:51:29 +00:00
Wentao Ye
e37ff5b5c8
[Perf] Optimize token_embed for pooling models, 1.0% token throughput improvement ( #37347 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-19 10:27:51 +08:00
Aaron Hao
6accb21f2a
[bug] Fix deadlock with pause resume and collective_rpc ( #37024 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-03-19 01:49:02 +00:00
Giancarlo Delfin
053f3b6309
[Model Runner V2] Spec decode rejection sampler logprobs support ( #37237 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-19 01:36:27 +00:00
Aaron Hao
5f82706a21
[BUG] Exclude SKIP_TENSORS from get_layer_size() + new weight sync example for dpep ( #37334 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-03-19 00:45:10 +00:00
Sage Moore
c32a58cc2a
[EPLB] Simplify EPLB rearrange by only returning one map ( #36267 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-03-18 20:34:00 -04:00
Elvir Crnčević
ef2c4f778d
[Bugfix] Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding ( #37442 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-19 00:28:37 +00:00
sihao_li
9dade5da3a
[XPU]Unify xpu test dependencies in dockerfile.xpu ( #36477 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
2026-03-19 08:12:07 +08:00
Thillai Chithambaram
828f862acb
[Bugfix] Expand quantization method support in perf metrics ( #37231 )
...
Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com >
2026-03-18 23:54:19 +00:00
Andy Lo
577df69b26
[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype "auto" leading to gibberish ( #37054 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-03-18 23:07:29 +00:00
Giancarlo Delfin
04244fd0e1
[Model Runner V2] Spec decode rejection sampler greedy support ( #37238 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-18 15:59:03 -07:00
Michael Goin
9482b0b085
[Bugfix] Remove assertion for NVFP4 scale dynamic range ( #37465 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-03-18 15:37:49 -07:00
Woosuk Kwon
5bc1da147f
[LoRA][BugFix] Fix skipped LoRA adapters for Mistral3 ( #36928 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-18 22:34:19 +00:00
Philip Ottesen
0091017188
fix(worker): optimize swap_states to copy only active token prefixes ( #34733 )
...
Signed-off-by: Philip Ottesen <phiott256@gmail.com >
2026-03-18 14:59:27 -07:00
Wentao Ye
0d81a1fe61
[V0 Deprecation] Deprecate virtual engine ( #37195 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-18 14:30:14 -07:00
Netanel Haber
6ae4c8d6fc
chunk parakeet into 30s clips to prevent OOMs on long audios ( #36671 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-03-18 14:22:24 -07:00
JartX
a913b612d8
[Bugfix] Fix ROCm crash in qwen3_next multi-stream events ( #36795 ) ( #37427 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
2026-03-18 16:06:31 -04:00
Harry Mellor
5ce2d10e4a
Fix models which use layer_type_validation for Transformers v5 ( #37398 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-18 18:41:51 +00:00
Chengyu Fang
738d0a281f
[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation ( #37439 )
...
Signed-off-by: chengyufang <cnyvfang@outlook.com >
2026-03-18 11:36:34 -07:00
youkaichao
70b81c4f3d
[bugfix][async scheduling] fix extra cuda context in device 0 with EP/DP ( #37449 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2026-03-18 18:32:30 +00:00
Cyrus Leung
7476d148db
[Model] Remove unnecessary processor definition for Nemotron Parse ( #37456 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-18 18:25:13 +00:00
Cyrus Leung
f3732bd931
[Misc] Clean up model registry ( #37457 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-18 18:24:44 +00:00
Wentao Ye
0ef7f79054
[Perf] Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E throughput improvement ( #37340 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-18 14:18:34 -04:00
Or Ozeri
5dd8df0701
[kv_offload+HMA][2/N]: Support multiple KV groups in GPULoadStoreSpec ( #36642 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-18 19:26:40 +02:00
Harry Mellor
39bfb57b7c
Add API docs link if the CLI arg is a config class ( #37432 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-18 17:19:35 +00:00
RonaldBXu
c9d838fc33
Adding deterministic lora benchmarking to vLLM Bench ( #36057 )
...
Signed-off-by: Ubuntu <ubuntu@ip-172-31-43-201.ap-northeast-1.compute.internal >
Signed-off-by: Ronald Xu <ronaldxu@amazon.com >
2026-03-18 16:02:03 +00:00
Xin Yang
b1169d7be8
[Kernel] Add gpt-oss Router GEMM kernel ( #37205 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-18 08:15:56 -07:00
XLiu-2000
17808394bc
standardize load_weights using AutoWeightsLoader for kimi_linear and minimax_text_01 ( #37371 )
...
Signed-off-by: XuLiu <xuliu40@gmail.com >
Co-authored-by: XuLiu <xuliu40@gmail.com >
2026-03-18 15:05:37 +00:00
elvischenv
296839a1b0
[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE ( #30647 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-03-18 15:01:26 +00:00
Wentao Ye
c373b5c00d
[Log] Reduce duplicate log ( #37313 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-18 10:57:44 -04:00
Itay Alroy
de1a86b7de
elastic_ep: Fix stateless group port races ( #36330 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
2026-03-18 14:36:18 +00:00
Cyrus Leung
99267c23ca
[2/3] Refactor InternVL-based processors ( #37324 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-18 22:22:19 +08:00
Or Ozeri
525f2eeb0b
[kv_offload+HMA][6/N]: Split offloading_connector.py ( #37405 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-18 14:42:46 +01:00
Yufeng He
918b7890a1
[Bugfix] Fix base64 JPEG video frames returning empty metadata ( #37301 )
...
Signed-off-by: Yufeng He <40085740+universeplayer@users.noreply.github.com >
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Yufeng He <40085740+universeplayer@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-18 13:40:03 +00:00
Andy Lo
98b09ddc27
[NIXL][Bugfix] metrics & testing minor bug ( #36051 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-03-18 14:39:14 +01:00
Shwetha Poojary
cef1f302d2
[Model] Enable LoRA support for tower and connector in H2OVL ( #31696 )
...
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com >
2026-03-18 13:26:47 +00:00
Elvir Crnčević
17c47fb869
[Bugfix] Fix EP weight filter breaking EPLB and NVFP4 accuracy ( #37322 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-18 18:30:29 +08:00
Chauncey
b322b197f1
[Build] Bump python openai version ( #32316 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-18 18:20:10 +08:00
Andreas Karatzas
eaf7c9b976
[CI] Fix PaddleOCR-VL HF test failure due to create_causal_mask API rename ( #37328 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-18 09:44:12 +00:00
Aaron Hao
47a1f11bff
[docs] Add docs for new RL flows ( #36188 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-18 09:04:26 +00:00
Karan Bansal
fad09e8a1f
fix(glm47): improve tool call parsing and content normalization ( #37386 )
...
Signed-off-by: karanb192 <karan@example.com >
Co-authored-by: karanb192 <karan@example.com >
2026-03-18 08:12:21 +00:00
Jee Jee Li
8c31f47c63
[LoRA] Make LoRA respect language_model_only ( #37375 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-18 07:53:34 +00:00
Li, Jiang
261801242f
[Bugfix] Avoid OpenMP thread reallocation in CPU torch compile ( #37391 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-18 07:51:39 +00:00
Or Ozeri
fcf0687b27
[kv_offload+HMA][0/N]: Support block-level preemption handling ( #34805 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-18 08:49:53 +02:00
liuzhenwei
86b7e3c95a
[XPU] skip unsupported ut and update test_nixl_connector ( #37179 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-18 13:32:59 +08:00
Andrew Xia
0e95916155
[responsesAPI] parser.extract_response_outputs can take in token IDs ( #37130 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2026-03-18 05:31:31 +00:00
Andreas Karatzas
ce2ef42fd3
[CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset ( #37335 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-18 05:26:20 +00:00
Andreas Karatzas
8b6325758c
[ROCm][CI] Add ROCM_EXTRA_ARGS to audio_in_video test server fixture ( #37349 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-18 04:55:40 +00:00
gxd3
a0dd1995c7
[Hardware][TPU] Add supports_async_scheduling() method to Executor interface so that it can be extended for Executor implementations. ( #36924 )
...
Signed-off-by: Guangxiang Du <gxd@google.com >
2026-03-18 12:53:28 +08:00
Xin Yang
f1740006e4
[Perf] Enable dual stream execution of input projection for Qwen3 ( #36795 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-18 11:13:27 +08:00
Andreas Karatzas
58cde5c026
[ROCm][CI] Skip trtllm kvfp8 dequant tests on ROCm ( #37330 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-18 11:12:26 +08:00
Roy Wang
761e0aa7a0
[Performance] Add --enable-ep-weight-filter CLI option ( #37351 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-18 09:36:55 +08:00
Yanan Cao
ff9fbc9aff
[Kernel][Helion] [16/N] Refactor register_kernel API to be more Dynamo-friendly ( #36705 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-18 01:23:35 +00:00
Divakar Verma
e6c4797704
[ROCm][Quantization] add fp8xfp8 attn support for rocm_aiter_unified_attn ( #36927 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-03-18 08:49:32 +08:00
Michael Goin
09e4576f65
[Kernel] Add non-gated support for NVFP4 CUTLASS MoE ( #37320 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-03-17 18:12:04 -04:00
Andreas Karatzas
3ed7b1e6e0
[ROCm] Validate block_size for explicitly selected attention backends ( #36846 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-17 17:04:40 -05:00
JartX
e8f9dbc369
[Bugfix][ROCm] Fix worker startup OOM on ROCm by skipping unreliable cudagraph memory profiling ( #36720 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
2026-03-17 17:55:34 -04:00
Yong Hoon Shin
de35c06c66
Make KV connector metadata build overridable via plugin ( #37336 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2026-03-17 21:29:06 +00:00
Athrael Soju
c0745a851a
[Model] Add ColQwen3.5 4.5B support ( #36887 )
...
Signed-off-by: Athrael Soju <athrael.soju@gmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-17 21:17:02 +00:00
Ekagra Ranjan
b5ca9c3557
[Models] Cohere ASR ( #35809 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2026-03-17 21:04:17 +00:00
Chao-Ju Chen
245758992e
[Bugfix] Rescale NVFP4 weight scales to fix BF16 dequant underflow ( #34577 )
...
Signed-off-by: ricky-chaoju <ricky.chen@infinirc.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-17 20:48:42 +00:00
Dimitrios Bariamis
1204cf0a9d
[Bugfix] Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 ( #37158 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
2026-03-17 20:13:06 +00:00
Wei Zhao
b36adfa349
[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache ( #37252 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-17 20:09:20 +00:00
Michael Goin
e78821b438
[Deprecation] Deprecate --calculate-kv-scales option ( #37201 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-03-17 19:57:24 +00:00
Cyrus Leung
51f0acda79
[Model] Remove unused handle_oov_mm_token ( #37321 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-17 19:44:52 +00:00
Brian Dellabetta
fa75204b16
bump compressed-tensors version to 0.14.0.1 ( #36988 )
...
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com >
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com >
2026-03-17 15:36:19 -04:00
Wentao Ye
bdb903bb5f
[Bug] Fix FlashInfer MNNVL socket collisions under concurrent vLLM jobs ( #36674 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-17 15:19:52 -04:00
Andrey Talman
68f783a727
[Torch 2.11] Guard torch._C._cpu attribute checks for forward compatibility ( #35673 )
...
Signed-off-by: atalman <atalman@fb.com >
2026-03-17 18:47:59 +00:00
Avinash Singh
c5030c439d
[CI] Split Distributed Tests (4 GPUs) and Kernel MoE tests ( #37100 )
...
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com >
Signed-off-by: Avinash Singh <107198269+avinashsingh77@users.noreply.github.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-17 11:44:55 -07:00
Michael Goin
51b2333be1
[Perf] Optimize top-k search in apply_top_k_top_p_triton sampler ( #37225 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-03-17 11:35:17 -07:00
Andreas Karatzas
4ed51308c8
[CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in __init__ ( #37230 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-17 09:08:08 -07:00
Cyrus Leung
c781fbbab3
[Bugfix] Standardize custom HF Processor init ( #37289 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-17 15:38:55 +00:00
Richard Zou
979ff44cea
[BugFix] PyTorch Compilation Tests should error if any test fails ( #37300 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-17 15:26:38 +00:00
Benjamin Chislett
f63ed7b5ac
[Bugfix] Fix DP MTP Dummy Run ( #35243 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-17 11:16:48 -04:00
Ning Xie
c9e5096256
[openapi] remove redundant exception stack trace[4/N] ( #37157 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-03-17 15:06:25 +00:00
Anton Vlasjuk
2ff0ad9694
[UltraVox] Fix output type ( #37224 )
...
Signed-off-by: vasqu <antonprogamer@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-17 14:51:17 +00:00
Isotr0py
a836524d20
[Chore] Replace all base64 usages with faster pybase64 package ( #37290 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-17 14:44:19 +00:00
Bhoomit
3717a4dd47
[Misc][LoRA] Add --lora-target-modules to restrict LoRA to specific modules ( #34984 )
...
Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-17 14:36:41 +00:00
Harry Mellor
ecfcdd2ce4
Fix Phi3 test that fails with Transformers v5 ( #37298 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-17 14:29:24 +00:00
Siew's Capital Jarvis
c25dbc2d27
[Bugfix] Fix unclean shutdown crash with AllReduce Fusion workspace ( #36955 )
...
Signed-off-by: Jarvis <brayden.stanley.0127@gmail.com >
2026-03-17 14:22:09 +00:00
Jonas M. Kübler
77d2a5f17b
pick up tuned prefill configs for FP8 FA3 ( #36265 )
...
Signed-off-by: Jonas M. Kübler <44084297+jmkuebler@users.noreply.github.com >
Signed-off-by: Jonas Kuebler <kuebj@amazon.com >
2026-03-17 07:00:26 -07:00
Sage
59192dfd39
[Frontend] Complete OpenAI render delegation ( #37287 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-17 13:53:55 +00:00
Umut Polat
56cb1baa66
[Misc] Use VLLMValidationError in batch, pooling, and tokenize protocol validators ( #36256 )
...
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com >
2026-03-17 13:52:30 +00:00
Cyrus Leung
f340324335
[1/2] Move InternVL-based processors ( #37260 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-17 21:50:56 +08:00
sfbemerk
2660b9289c
Bugfix for offloading+prefetch for GLM-4.7-FP8 ( #37178 )
...
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
Co-authored-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
2026-03-17 21:22:09 +08:00
Viacheslav
293f036e6d
Add gigachat 3.1 tool parser + fix gigachat3 tool parser ( #36664 )
...
Signed-off-by: Viacheslav Barinov <viacheslav.teh@gmail.com >
2026-03-17 12:03:20 +00:00
youkaichao
0fb142a454
[perf][connector] optimize build_connector_meta when host buffer transfer is not used ( #37165 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2026-03-17 11:59:35 +00:00
Sage
00f8e0d211
[Frontend] Delegate tokenization serving preprocessing to OpenAIServingRender ( #37266 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-17 11:22:54 +00:00
zhao, zhenhui
4af9ed21cb
[Bugfix](xpu): prevent “selected index k out of range” in TP decode path ( #37259 )
...
Signed-off-by: zhenzhao <zhenzhao@habana.ai >
2026-03-17 11:14:07 +00:00
Augusto Yao
9c7cab5ebb
[Feature]: Support for multiple embedding types in a single inference call ( #35829 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
2026-03-17 17:05:42 +08:00
Chauncey
132bfd45b6
[Bugfix][ResponsesAPI] Fix crash when tool_choice=required exceeds max_output_tokens ( #37258 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-17 08:54:52 +00:00
xiao-llm
24b4272a8c
Fix infinite recursive search issue in quark.py ( #32779 )
...
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
Signed-off-by: Xiao Yu <xiao.yu.dc@outlook.com >
Signed-off-by: kimheesu <wlskaka4@gmail.com >
Co-authored-by: Yanwen Lin <lyw1124278064@gmail.com >
Co-authored-by: Kim Hee Su <wlskaka4@gmail.com >
2026-03-17 07:19:15 +00:00
Benjamin Chislett
8a680463fa
[Bugfix] Fix NemotronH MTP + Chunked Prefill ( #35447 )
2026-03-17 07:07:33 +01:00
Nick Cao
20b14095a4
[Bugfix] Fix loading Music Flamingo ( #35535 )
...
Signed-off-by: Nick Cao <ncao@redhat.com >
2026-03-17 05:24:40 +00:00
PatchyTIS
17c1bdf371
[Bugfix] dtype mismatch in ngram gpu propose ( #37246 )
...
Signed-off-by: PatchouliTaisa <patchychen@tencent.com >
Co-authored-by: PatchouliTaisa <patchychen@tencent.com >
2026-03-17 05:19:55 +00:00
Flora Feng
3e3d320c1b
[Refactor] Relocate responses API tests ( #37241 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-17 05:14:52 +00:00
Andreas Karatzas
54a62a79f7
[ROCm] Fix AttributeError for torch.compiler.skip_all_guards_unsafe on older PyTorch ( #37219 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-17 11:34:49 +08:00
Flora Feng
384dc7f77b
[Refactor] Relocate completion and chat completion tests ( #37125 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-17 11:31:23 +08:00
Flora Feng
f04d5226f8
[CI] Fix flaky tool_use chat completion tests with deterministic seed ( #37027 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-17 03:24:34 +00:00
Kyuyeun Kim
0a0a1a198b
Add ability to replace oot ops when using lora ( #37181 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com >
2026-03-16 18:04:15 -07:00
Vadim Gimpelson
6c1cfbad32
Support non-contiguous KV cache in TRTLLM fp8 dequant kernel ( #36867 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com >
Co-authored-by: Pavani Majety <pavanimajety@gmail.com >
2026-03-16 17:48:42 -07:00
Harry Huang
45f526d652
[BugFix] Correct max memory usage for multiple KV-cache groups ( #36030 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-03-17 00:38:52 +00:00
Julien Denize
5db91f0aaf
Fix some Mistral parser issues ( #37209 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-03-17 00:08:56 +00:00
Walter Beller-Morales
061980c36a
[Feature][Frontend] add support for Cohere Embed v2 API ( #37074 )
...
Signed-off-by: walterbm <walter.beller.morales@gmail.com >
2026-03-16 19:55:53 -04:00
Ben Browning
7a49742b88
[CI/Build] Add common tool call parser test suite ( #27599 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2026-03-16 19:46:20 -04:00
Terry Gao
3e6a1e1686
[Custom Ops] Add functional + out variant for scaled_fp4_quant ( #34389 )
...
Signed-off-by: tianrengao <terrygao87@gmail.com >
2026-03-16 18:51:46 -04:00
Julien Denize
7961486a9b
Fix EagleMistralLarge3Model initialization ( #37232 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-03-16 15:41:00 -07:00
Andreas Karatzas
4f9b14c21c
[CI] Stabilize multinode DP internal LB completion tests ( #36356 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 15:40:23 -07:00
Yuchen Fama
31a458c091
[Doc] Clarify schema enforcement behavior for tool_choice modes ( #37064 )
...
Signed-off-by: yfama <yuchengu@gmail.com >
2026-03-16 22:27:42 +00:00
Wei Zhao
a3a51d20e7
[Benchmark] Improvements to attention benchmark script ( #37115 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-16 22:22:40 +00:00
EdalatiAli
e5b807607c
[Quant][Feature] Support online MXFP8 quantization for MoE and dense models ( #35448 )
...
Signed-off-by: EdalatiAli <aliedalati@cohere.com >
2026-03-16 18:07:39 -04:00
Elvir Crnčević
fd4d96302a
Fix eplb nvfp4 experts hook ( #37217 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
Signed-off-by: Elvir Crncevic <elvir@anthropic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-16 22:03:54 +00:00
Krish Gupta
c0f011918d
[Bugfix] opcheck false mutation error in rms_norm_per_block_quant ( #36688 ) ( #36779 )
...
Signed-off-by: Krish Gupta <krishom70@gmail.com >
2026-03-16 21:11:33 +00:00
Zhengxu Chen
e6ae4b1be1
[compile] Enable mega aot artifact for torch 2.12+. ( #37198 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-16 21:05:51 +00:00
zhanqiuhu
2dccb38f73
[Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors ( #36549 )
2026-03-16 20:51:04 +00:00
Kunshang Ji
d157216093
[BUGFIX][Mamba] Use uint64 for address in KVBlockZeroer ( #37197 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-16 21:39:56 +01:00
Matthew Bonanni
93f3c8e531
[Misc] Add float16 to CacheDType ( #37199 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-16 13:24:48 -07:00
rasmith
2cc26c3a99
[CI][BugFix][MORI][AMD] Add transfer_id to kv transfer params for test ( #37213 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-03-16 13:22:57 -07:00
Flora Feng
dfa8852db2
[Refactor] Consolidate GPT-OSS reasoning parser tests ( #36915 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Signed-off-by: Flora Feng <4florafeng@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-16 15:53:07 -04:00
Lucas Kabela
714c6e0eab
[torch.compile][BE] Modify cudagraph callable to check for is_forward_context_set ( #36288 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-03-16 19:42:34 +00:00
Sage
0fefd00e6c
[Bugfix] Fix render server crash for quantized models on CPU-only hosts ( #37215 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-16 18:59:01 +00:00
Nicolò Lucchesi
f5c081d432
[PD][Nixl] Add support for hybrid SSM-FA models ( #36687 )
2026-03-16 19:58:06 +01:00
Matthew Bonanni
c88ea8338b
[MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible ( #36982 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-16 13:51:21 -04:00
Max de Bayser
9f9ecff4cd
Add simple granite4 tool parser ( #36827 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2026-03-16 10:49:09 -07:00
haosdent
ca1954d58c
[Bugfix] Disable cross-layer KV cache for MLA attention backends ( #37090 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-03-16 19:03:10 +02:00
Raushan Turganbay
55e6d3d5c0
[Bugfix] Make siglip/clip compatible with transformers v5 ( #37200 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2026-03-16 16:48:18 +00:00
Chauncey
6682c231fa
[Bugfix] Add error handling for FINISHED_ERROR in OpenAIServing ( #37148 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-16 16:27:47 +00:00
Itay Etelis
5ae685c1c8
[Bugfix] Relax TRTLLM KV cache contiguity assertion for cross-layer layout ( #34158 )
...
Signed-off-by: Itay Etelis <itay.etelis@ibm.com >
Co-authored-by: Itay Etelis <itay.etelis@ibm.com >
2026-03-16 11:20:51 -04:00
Wentao Ye
ce8cf9161d
[Compile] Fix compile warning st256_cs in cuda_vec_utils.cuh ( #36693 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-16 11:12:15 -04:00
xjx
18be11fd59
[BUGFIX]fix CUDA OOM ERROR : invalid argument at cumem_allocator.cpp:119 ( #35594 )
...
Signed-off-by: xjx <493337577@qq.com >
2026-03-16 15:10:42 +00:00
Yuanheng Zhao
8d8855fdae
[Bugfix] Add safety check and fallback for null scaling factor ( #36106 )
...
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 14:27:29 +00:00
Wentao Ye
e855d380fa
[Compile] Fix compile warning in moe_permute ( #36529 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-16 10:16:14 -04:00
Benjamin Bartels
0e5a9382af
[Bugfix] accept redacted thinking blocks in Anthropic messages ( #36992 )
...
Signed-off-by: Benjamin Bartels <benjaminba@tiglab-ubuntu.ilab.local >
Signed-off-by: bbartels <benjamin@bartels.dev >
Co-authored-by: Benjamin Bartels <benjaminba@tiglab-ubuntu.ilab.local >
2026-03-16 22:01:57 +08:00
Fynn Schmitt-Ulms
04bf5a35fa
[Spec Decode] Update extract_hidden_states to use deferred kv_connector clear ( #37013 )
2026-03-16 14:53:45 +01:00
Tianyu Guo
43a73f853b
Remove unused EVS functions in qwen3_vl.py ( #37183 )
...
Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn >
2026-03-16 13:09:09 +00:00
Julien Denize
ffbc2e5bdb
Patch Mistral config ( #37104 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-03-16 12:22:18 +00:00
Lukas Geiger
f9e6db3034
[Models][Qwen3 ViT] Keep max_seqlen on CPU to prevent D2H sync ( #37139 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-16 12:11:59 +00:00
elvischenv
d61d2b08e9
[Build] Fix API rate limit exceeded when using VLLM_USE_PRECOMPILED=1 ( #36229 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 12:09:27 +00:00
Artem Perevedentsev
f5e59ee7a6
[Performance] Add prefetch for checkpoints to OS page cache ( #36012 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
2026-03-16 11:32:02 +00:00
Harry Mellor
9b005edc48
[Docs] Make the link to hardware plugins clearer ( #37174 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 04:12:58 -07:00
Robin Nabel
bf9a185395
GLM4 tool parser: fix streaming mode ( #35208 )
...
Signed-off-by: Robin Nabel <opensource@nabel.co >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-16 18:48:52 +08:00
Harry Mellor
ad041c79db
Fix text only inputs for MRoPE models with the Transformers modelling backend ( #37055 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 10:31:16 +00:00
Kunshang Ji
747b068136
[Hardware] Replace memory related torch.cuda APIs ( #37031 )
...
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
2026-03-16 10:24:48 +00:00
Harry Mellor
122f75d939
Fix pipeline parallel with multimodal models with the Transformers modelling backend ( #37057 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 10:20:37 +00:00
SoluMilken
d8f8a7aad2
[Misc] Sync pre-commit to 4.5.1 in workflows and docs ( #36675 )
...
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 10:03:21 +00:00
Roy Wang
0115e957d4
[Frontend][Misc] Remove unused log in /is_sleeping ( #37093 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2026-03-16 17:46:28 +08:00
haosdent
116ed130f4
[Bugfix] Fix GDN attention crash with mixed decode/spec-decode batches ( #34871 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-16 10:30:23 +01:00
Vadim Gimpelson
8374387bd8
[FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell ( #36987 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-16 09:04:29 +00:00
Isotr0py
912fbe9555
[Bugfix] Fix Qwen2.5-Omni/Qwen3-Omni use_audio_in_video with multi-video inputs ( #37147 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-16 08:56:06 +00:00
Laith Sakka
52131f88d9
use skip_all_guards_unsafe to drop global_state and torch_function_mode_stack guards instead of previous hacks ( #36204 )
...
Signed-off-by: Laith Sakka <lsakka@meta.com >
2026-03-16 08:52:31 +00:00
Roy Wang
821eb80c0d
[Performance][Model Loader] Skip non-local expert weights during EP model loading ( #37136 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2026-03-16 01:33:36 -07:00
Andreas Karatzas
a2956a0f8e
[ROCm][CI] Retrying in case of batch variance effects and reducing flakiness ( #36442 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 16:08:51 +08:00
Andreas Karatzas
911355e216
[ROCm] Fix KV copy methods and auto-select attention backend for ROCm ( #36845 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 16:07:27 +08:00
Chauncey
8d3f8f485e
[Bugfix] fix Qwen3.5 tool calling bug ( #36774 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-16 15:38:42 +08:00
Woosuk Kwon
96efb91480
[Model Runner V2] Fix processed logits in sample() ( #37144 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-16 00:35:49 -07:00
leo-cf-tian
2754231ba3
[Kernel] Add FlashInfer MoE A2A Kernel ( #36022 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Signed-off-by: Leo Tian <lctian@nvidia.com >
Co-authored-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Stefano Castagnetta <scastagnetta@nvidia.com >
Co-authored-by: root <root@lyris0267.lyris.clusters.nvidia.com >
2026-03-15 23:45:32 -07:00
bigshanedogg
2390d44209
[Model] Add HyperCLOVAX-SEED-Think-14B language model support ( #37107 )
...
Signed-off-by: bigshanedogg <bigshane319@gmail.com >
2026-03-16 06:40:05 +00:00
Li, Jiang
7362b4450a
[Bugfix] Avoid LD_PRELOAD check on MacOS ( #37145 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-15 23:31:44 -07:00
Andreas Karatzas
57a314d155
[CI][Bugfix] Fix 500 errors from priority overflow and TemplateError subclasses in schema fuzz tests ( #37127 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 05:27:21 +00:00
Andreas Karatzas
d4c57863f7
[ROCm][CI] Fix engine teardown and text normalization to stabilize voxtral test ( #37138 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 04:49:31 +00:00
Wang, Yiting
68e1b711f1
[XPU] Add deepseek_scaling_rope fused kernel ( #36612 )
...
Signed-off-by: yitingw1 <yiting.wang@intel.com >
2026-03-16 12:35:08 +08:00
rasmith
0024f39a32
[ROCm][P/D][MORI][BugFix] Add transfer_id for moriio_connector so moriio_connector to restore P/D functionality ( #34907 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-03-16 10:36:51 +08:00
Andrew Xia
e9163b536e
[responsesAPI][ez] add a unit test for SimpleContext logprobs ( #37126 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2026-03-15 17:12:26 -07:00
Lalithnarayan C
7acaea634c
In-Tree AMD Zen CPU Backend via zentorch [1/N] ( #35970 )
...
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Chinmay-Kulkarni-AMD <Chinmay.Kulkarni@amd.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-03-15 23:35:35 +00:00
Jiangyun Zhu
697e4ff352
[GDN] add a config for gdn kernel selection ( #36647 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-16 00:40:17 +08:00
Hari
a3e2e250f0
[Feature] Add Azure Blob Storage support for RunAI Model Streamer ( #34614 )
...
Signed-off-by: hasethuraman <hsethuraman@microsoft.com >
2026-03-15 19:38:21 +08:00
Isotr0py
143e4dccdf
[Misc] Add online audio_in_video test ( #36775 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-15 00:14:11 -07:00
Isotr0py
6590a3ecda
[Frontend] Remove torchcodec from audio dependency ( #37061 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-15 05:15:59 +00:00
Russell Bryant
b3debb7e77
[Build] Upgrade xgrammar to get a security fix ( #36168 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-15 03:13:48 +00:00
Nick Hill
458c1a4b2d
[Frontend] Reduce chat template warmup logging levels ( #37062 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-14 13:48:59 -07:00
Karan Bansal
821fde2df4
[Bugfix] Fix xgrammar dtype mismatch on macOS CPU inference ( #32384 )
...
Signed-off-by: Karan Bansal <karanb192@gmail.com >
Co-authored-by: Inokinoki <inoki@inoki.cc >
2026-03-14 17:29:06 +00:00
arlo
8c29042bb9
[Feature] Add InstantTensor weight loader ( #36139 )
2026-03-14 18:05:23 +01:00
Cyrus Leung
5467d137b3
[Frontend] Avoid startup error log for models without chat template ( #37040 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-14 09:36:11 -07:00
Santino Ramos
3ed46f374b
[Model Runner V2] Add Support for XD-RoPE ( #36817 )
...
Signed-off-by: Santino Ramos <elsantinoramos@gmail.com >
2026-03-14 09:27:55 -07:00
seanmamasde
84868e4793
[Bugfix][Frontend] Fix audio transcription for MP4, M4A, and WebM formats ( #35109 )
...
Signed-off-by: seanmamasde <seanmamasde@gmail.com >
2026-03-14 08:44:03 -07:00
Isotr0py
a8e8d62dd8
[Misc] Clean up Kimi-audio whisper encoder loading ( #36903 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-14 23:37:52 +08:00
Julien Denize
e42b49bd69
Mistral common v10 ( #36971 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: root <root@h200-bar-196-227.slurm-bar-compute.tenant-slurm.svc.cluster.local >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-14 07:26:43 -07:00
Sergey Zinchenko
4a718e770d
[Bug] Fix Failure in /v1/chat/completions/render for Multimodal Requests ( https://github.com/vllm-project/vllm/issues/35665 ) ( #35684 )
2026-03-14 14:10:11 +00:00
Kevin H. Luu
600a039f57
[CI] Shard Multi-Modal Models (Standard) into 4 parallel jobs ( #37014 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-14 08:26:54 +00:00
Harry Mellor
ffa5d74f15
Enable loading of fused expert weights in the Transformers modelling backend ( #36997 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-14 07:01:06 +00:00
Kevin H. Luu
74fe80ee95
[CI] Split Distributed Tests (4 GPUs) into 3 parallel jobs ( #37015 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-14 12:21:13 +08:00
Flora Feng
bcfdadb1bc
[Refactor] Relocate chat completion and anthropic tests ( #36919 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-14 12:16:16 +08:00
Yanan Cao
236de72e49
[CI] Pin helion version ( #37012 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-13 23:25:29 -04:00
sbeurnier
a116f96930
[V1] Remove pin_memory() in async_copy_to_gpu to fix sporadic stalls ( #37006 )
...
Signed-off-by: Sebastien Beurnier <sbeurnier@together.ai >
2026-03-14 01:37:32 +00:00
Li, Jiang
092ace9e3a
[UX] Improve UX of CPU backend ( #36968 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-14 09:27:29 +08:00
Andrew Xia
f680dc1b39
[responsesAPI] prioritize content over summary in reasoning item input ( #36516 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
Signed-off-by: Andrew Xia <mitandrewxia@gmail.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Andrew Xia <axia@fb.com >
2026-03-14 09:20:30 +08:00
Giulio Leone
b41aa264f9
fix: resolve chat template names before kwargs detection ( #36937 )
...
Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com >
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-03-14 00:20:16 +00:00
Dimitrios Bariamis
367cf5cd3e
[Feat][Bugfix] Enable additional dimension for Flashinfer MLA and fix routing dtype ( #36931 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
2026-03-13 16:41:16 -07:00
haosdent
6d53efd2a5
[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models ( #34695 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-13 23:25:41 +00:00
Benjamin Chislett
8b346309a5
[Refactor] Consolidate SupportsEagle ( #36063 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-13 23:22:40 +00:00
Nick Hill
54a6db827f
[BugFix] Fix "DP Coordinator receives unexpected..." messages ( #37008 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-13 23:18:05 +00:00
Matthew Bonanni
9efc4db965
[Bugfix] Fix DeepSeek-V3.2 tokenizer stripping spaces ( #37004 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-13 22:55:36 +00:00
Kevin H. Luu
f1816fb192
[CI] Split V1 e2e + engine (1 GPU) into separate jobs ( #36945 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-13 14:16:02 -07:00
Harry Mellor
0005d2a3c9
Use Transformers v5 WeightRenaming for Transformers modeling backend ( #31545 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-13 20:49:08 +00:00
Ekagra Ranjan
d0b402974f
[Bugfix][Spec Decode] Avoid double call of Ngram CPU ( #36952 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2026-03-13 20:33:19 +00:00
Divakar Verma
6341d43043
[ROCm][Quantization] add quark w4a8 mxfp4_fp8 for LinearLayer ( #35316 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-03-13 19:44:24 +00:00
Mark McLoughlin
7afe0faab1
[Frontend][Core] Re-add shutdown timeout - allowing in-flight requests to finish ( #36666 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-13 12:10:06 -07:00
Harry Mellor
5a3f1eb62f
[Misc] Set default kv_buffer_device in a better way ( #36862 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-13 19:07:33 +00:00
yugong333
b3ce711b93
Fp8 lora dense kernel ( #35242 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
2026-03-13 19:05:08 +00:00
Isotr0py
abf61aaa8e
[Bugfix] Fix Qwen2.5-omni/Qwen3-omni mm_processor cache for audio_in_video request ( #36800 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-13 18:16:05 +00:00
bigmoyan
4508532fbd
[Bugfix] fix paddleocr crash on some image shape ( #36959 )
...
Signed-off-by: wangzhengtao <wangzhengtao@msh.team >
Signed-off-by: bigmoyan <moyan_work@foxmail.com >
Co-authored-by: wangzhengtao <wangzhengtao@msh.team >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-13 13:46:55 +00:00
Itay Alroy
d5af196c18
[2/N] Elastic EP Milestone 2: Integrating NIXL-EP ( #35627 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
Co-authored-by: Yongji Wu <wuyongji317@gmail.com >
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com >
2026-03-13 09:25:33 -04:00
Chaojun Zhang
82f836d976
[XPU] Support LoRA via torch.compile on XPU platform ( #36962 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2026-03-13 10:34:59 +00:00
Andreas Karatzas
4fccd30f19
[ROCm][CI] Upgrading orchestrator to handle python pipeline markers and options ( #36181 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-13 02:04:22 -07:00
Or Ozeri
cfaf4668f7
[kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec ( #36610 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-13 08:04:21 +00:00
Andreas Karatzas
99a57bdf74
[ROCm][CI] Corrected the GPT-OSS test root path ( #36711 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-13 15:53:43 +08:00
Sage
a2268617cf
[Frontend] Delegate preprocessing to OpenAIServingRender ( #36483 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-13 00:39:43 -07:00
Rohan Potdar
a4ad9db541
Enable RoPE+KV cache fusion for ROCm AITER FA (non-shuffle layout) ( #35786 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-03-13 07:33:22 +00:00
Nick Hill
b373b5102a
[Tests] Shutdown test RemoteVLLMServer cleanly ( #36950 )
...
Recent PR #33949 changed the teardown logic of the RemoteVLLMServer test utility class to
send SIGTERM to all vllm (sub)processes at once, which breaks the clean/coordinated
shutdown logic that assumes only the top-level process will receive a signal (for example
when running in a container that's shut down).
This caused a bunch of errors and stacktraces in some test logs, even though those tests
still pass. We should still attempt a normal shutdown and only kill other procs if they are
still running after a few seconds.
Example: tests/v1/distributed/test_external_lb_dp.py::test_external_lb_completion_streaming
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-13 07:32:55 +00:00
Thomas Parnell
f296a1966d
[Bugfix] Fix FlashInfer GDN warmup ValueError on SM90 GPUs ( #36876 )
2026-03-13 07:09:39 +01:00
Csrayz
bc2c0c86ef
[Frontend] Fix usage incorrectly returned with empty stream_options` ( #36379 )
...
Signed-off-by: Csrayz <33659823+Csrayz@users.noreply.github.com >
2026-03-13 03:33:04 +00:00
jaime campos salas
891c60dcd5
fix(kv-cache): increase hybrid attention grouping threshold from 1.25 to 1.5 ( #36684 )
...
Signed-off-by: Jaime Campos Salas <jaime.campos.salas@gmail.com >
2026-03-12 23:28:27 -04:00
whyiug
1ce13cf992
[Model] Add support for BERT-like Chinese ERNIE pooling models ( #36385 )
...
Signed-off-by: whyiug <whyiug@hotmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-13 03:23:53 +00:00
Nikita
10f08dedfa
[Model] Add ColPali late interaction model for multi-modal retrieval ( #36818 )
...
Signed-off-by: Nikita Sukharev <kaonael@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-13 02:18:57 +00:00
Aaron Hao
5e1a373d2e
[BUG] Fix rank calculation in NCCLWeightTransferEngine ( #36940 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-03-13 01:56:51 +00:00
Simo Lin
572c776bfb
build: update smg-grpc-servicer to use vllm extra ( #36938 )
...
Signed-off-by: Simo Lin <linsimo.mark@gmail.com >
2026-03-13 01:31:36 +00:00
Yifan Qiao
55d8073d06
[Bugfix] ep_scatter kernel store-load race condition ( #34991 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
2026-03-13 01:07:59 +00:00
Nick Hill
cd32d6f586
[Model Runner V2] Some code simplification ( #36929 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-13 00:59:23 +00:00
Jaewon
aaa3092f51
[MoE] Add routing simulation override for MXFP4 quantized MoE ( #33595 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
2026-03-13 00:30:44 +00:00
Shubhra Pandit
87985077a4
[Speculative Decoding] Add norm_before_fc for gpt-oss draft models ( #36545 )
...
Signed-off-by: Shubhra Pandit <shubhra.pandit@gmail.com >
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com >
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-12 23:03:32 +00:00
Ryan Rock
a79c1c2c80
[AMD][Build] Add DeepEP to ROCm Dockerfile ( #36086 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-03-12 21:33:32 +00:00
Andreas Karatzas
cc8f1f4764
[ROCm][CI] Preparing gfx90a mirroring ( #36210 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-12 13:42:25 -07:00
Michael Goin
05b9e8ab5b
Revise environment setup in AGENTS.md ( #36909 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-12 19:21:11 +00:00
Xinan Miao
2cdf92228c
[Feature]: Remove Chunking From FusedMoE ( #34086 )
...
Signed-off-by: SouthWest7 <am1ao@qq.com >
Signed-off-by: Southwest <1403572259@qq.com >
Signed-off-by: southwest <am1ao@qq.com >
Signed-off-by: Xinan Miao <1403572259@qq.com >
Co-authored-by: SouthWest7 <am1ao@qq.com >
2026-03-12 14:24:38 -04:00
Marc Sun
c973ecdead
[bnb] Skip moe + bnb test ( #36896 )
...
Signed-off-by: Marc Sun <marc@huggingface.co >
2026-03-12 18:03:25 +00:00
Harry Mellor
e39257a552
Add AGENTS.md ( #36877 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-12 10:20:50 -07:00
Dimitrios Bariamis
cc16b24b17
Update Flashinfer to 0.6.6 ( #36768 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
2026-03-12 13:19:19 -04:00
Eunkwang Jeon
bdc2343454
[Bugfix] Fix KeyError in parse_response_input for reasoning items with optional content ( #34499 )
...
Signed-off-by: jeonsworld <jeonsworld@gmail.com >
2026-03-13 00:13:36 +08:00
Matthew Bonanni
f444c05c32
[Attention] Use FA4 for MLA prefill ( #34732 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-12 12:10:17 -04:00
SoluMilken
85199f9681
[Bugfix] fix main branch pre-commit error (1 line change) ( #36897 )
...
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw >
2026-03-12 09:08:37 -07:00
grimulkan
a1257fd1ea
[Kernel] Add FP8 KV cache support to Triton MLA decode attention ( #34597 )
...
Signed-off-by: grimulkan <grimulkan@gmail.com >
2026-03-12 08:32:34 -07:00
Thomas Parnell
abcffbba8c
[CI] Fix mypy pre-commit errors on main ( #36882 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-12 08:22:29 -07:00
Kunshang Ji
53ec16a705
[Hardware] Replace torch.cuda.device_count/current_device/set_device API ( #36145 )
...
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-12 07:57:47 -07:00
Wei Zhao
2e693f48e7
[Perf] Add TRTLLM FP8 MoE Modular Kernel ( #36307 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-12 07:32:31 -07:00
Martin Hickey
7f1f36bf91
[CI] Fix mypy for vllm/reasoning ( #35742 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-12 12:21:33 +00:00
Mark McLoughlin
5282c7d4d0
[docs] Add lightweight AI assisted contribution policy ( #30947 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-03-12 11:46:13 +00:00
caozuoba
9e19f8338b
[Perf] add packed recurrent fast path for decode ( #36596 )
...
Signed-off-by: hdj <1293066020@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-12 04:01:57 -07:00
Sage
06e0bc21d2
[Frontend] Split OpenAIServingModels into OpenAIModelRegistry + OpenAIServingModels ( #36536 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-12 03:29:37 -07:00
Chauncey
5a71cdd76e
[Bugfix] Fix crash when tool_choice=required exceeds max_tokens ( #36841 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-12 03:28:45 -07:00
Shanshan Shen
f0d3658c0f
[MM][OOT] Support CPU seq_lens for OOT MMEncoderAttention kernels ( #36605 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-12 03:28:23 -07:00
Michael Goin
57431d8231
[UX] Only show FP4 Marlin fallback warning for w4a4 models ( #36806 )
...
Co-authored-by: Claude <noreply@anthropic.com >
2026-03-12 05:19:35 -04:00
Xu Jinyang
3e64fe4a18
[Bugfix] Warm up Triton autotuner for GDN layers during V1 profiling ( #36599 )
...
Signed-off-by: AuYang <459461160@qq.com >
2026-03-12 00:51:09 -07:00
sfeiqiang
8cb24d3aed
[KV Connector] Support using FlexKV as KV Cache Offloading option. ( #34328 )
...
Signed-off-by: phaedonsun <phaedonsun@tencent.com >
Co-authored-by: phaedonsun <phaedonsun@tencent.com >
2026-03-12 00:46:20 -07:00
István Ketykó
00726c74c9
[Bugfix][Model] Fix DeepSeek-OCR TensorSchema crash on empty images_crop ( #36670 )
...
Signed-off-by: István Ketykó <istvan.ketyko@gmail.com >
2026-03-12 15:35:54 +08:00
Chauncey
9fe404ed04
[Frontend] OpenAI Responses API supports Tool/Function calling with streaming ( #29947 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-12 15:03:50 +08:00
Sage
802f306cd1
[Tests] Skip model weight download for render-only test server ( #36813 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-12 06:24:42 +00:00
Yan Ma
894843eb25
replace with torch.cuda.device with with torch.accelerator.device_index ( #36144 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-03-11 23:12:57 -07:00
Yanan Cao
584a3f56de
[Kernel][Helion][13/N] Force static_shapes=False in helion register ( #36677 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-12 05:35:29 +00:00
Nick Hill
36735fd772
[BugFix] Fix multiple/duplicate stdout prefixes ( #36822 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-12 12:23:21 +08:00
wang.yuqi
6ecabe4936
[CI Failure] Fix Language Models Test (Extended Pooling) daily CI Failure ( #36761 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-12 12:22:05 +08:00
Woosuk Kwon
2f8b4ce0c0
[Model Runner V2] Do not initialize sampler for non-last PP ranks ( #36824 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-12 03:55:28 +00:00
Yuwei An
2ef69456f5
[LMCache] Fault Tolerance Mechanism ( #36586 )
...
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com >
2026-03-12 03:54:39 +00:00
Louie Tsai
17852aa503
more models for vLLM Benchmark Suite ( #35086 )
...
Signed-off-by: louie-tsai <louie.tsai@intel.com >
2026-03-12 11:36:51 +08:00
Flora Feng
8647c6cf51
[Bugfix] Fix minimax_m2 tool parser when stream interval > 1 ( #35895 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-12 10:25:14 +08:00
Kunshang Ji
513949f95f
[XPU][Doc] Remove manual OneAPI install step, now handled by torch-xpu ( #36831 )
...
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
2026-03-12 01:46:02 +00:00
Nick Hill
262b76a09f
[Frontend] Exclude anthropic billing header to avoid prefix cache miss ( #36829 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-12 01:20:34 +00:00
Wentao Ye
c34ba6b961
[Perf] Optimize compute maxsim using batched version, 3.2% E2E throughput improvement ( #36710 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-12 08:37:01 +08:00
Matthias Gehre
24062b704f
[ROCm][CI/Build] Add gfx1152/gfx1153 (Krackan) to HIP supported architectures ( #36499 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-03-11 23:14:40 +00:00
Aaron Hao
d6b61e5166
[BUG] Fix async rlhf tests ( #35811 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-03-11 18:06:10 -04:00
Yanan Cao
cf632499ee
[Kernel] [Helion] [15/N] Split config files into per-platform files ( #36698 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 17:25:29 -04:00
Yanan Cao
a3774a8198
[Kernel] [Helion] [12/N] Use FakeTensorMode to avoid GPU allocation during config key computation ( #36563 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 17:25:16 -04:00
Yanan Cao
0ce21c46a0
[Kernel] [Helion] [14/N] Set autotune_ignore_errors=True during autotuning ( #36683 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 17:25:04 -04:00
Woosuk Kwon
55eed6b7a5
[Model Runner V2] Add WhisperModelState [6/N] ( #35790 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-11 14:20:38 -07:00
Giancarlo Delfin
c77181e534
[Model Runner V2] Add probabilistic rejection sampling for spec decoding ( #35461 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-11 14:04:32 -07:00
maobaolong
12001f2ebc
[LMCache] Pass TP size in lookup for MLA multi-reader locking ( #36129 )
...
Signed-off-by: baoloongmao <baoloongmao@tencent.com >
Co-authored-by: Yihua Cheng <yihua98@uchicago.edu >
2026-03-11 20:45:20 +00:00
Or Ozeri
7ee5d5093b
[BugFix][kv_offload] Fix offloading decodes with async scheduling ( #33881 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-11 20:43:40 +00:00
jennyyyyzhen
428bc718bd
[Bugfix][ROCm] Strip block_size before attention backend validation ( #36274 )
...
Signed-off-by: jennyyyyzhen <yzhen@hmc.edu >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-03-11 13:37:31 -07:00
汪志鹏
ff1e3d9c63
[BugFix]: add bagel to MM_PREFIX_LM_MODELS ( #36316 )
...
Signed-off-by: princepride <wangzhipeng628@gmail.com >
2026-03-11 19:55:59 +00:00
Wentao Ye
35bdca5431
[Refactor] Remove dead code in KV connector ( #36424 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-11 19:40:17 +00:00
Amanzhol Salykov
8a24842765
[ROCm] add tuned moe_wna16_triton kernel configs for CDNA4 ( #35093 )
...
Signed-off-by: salykova <amsalykov@gmail.com >
Signed-off-by: amd-asalykov <asalykov@amd.com >
2026-03-11 19:00:08 +00:00
Harry Mellor
65986db6ba
Make Gemma and Gemma 2 accept inputs_embeds like Gemma 3 ( #36787 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 18:12:43 +00:00
Luka Govedič
9556af87d5
[torch.compile] Add support for non-contiguous fused RMSNorm + group quant ( #36551 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
2026-03-11 10:56:55 -07:00
Or Ozeri
a1a3523a56
[KVConnector] Support worker -> scheduler metadata ( #31964 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-11 17:36:37 +00:00
tianshu-Michael-yu
741f4e046b
fix: align lfm2 thumbnail token counting with HF ( #36707 )
2026-03-11 10:28:38 -07:00
Julien Denize
a5d06dc557
Add 320 dimension size support to MLA ( #36161 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
2026-03-11 10:21:22 -07:00
Harry Mellor
5efa206a8c
Fix ExaoneMoeMTP test that never ran in Transformers v4 ( #36792 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 17:10:23 +00:00
Cyrus Leung
196802dfa6
[Misc] Clean up renderers ( #36770 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-11 16:39:29 +00:00
Isotr0py
c84b519cf3
[Bugfix] Fix negative max_tokens when input prompt is too long ( #36789 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-11 16:30:51 +00:00
Flora Feng
741ecf0630
[CI] Add bfcl tool call correctness eval ( #36560 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-03-11 12:27:36 -04:00
Robert Shaw
b7e5a588d8
[Bugfix] Fix DP/EP Shared Expert With Monolithic Kernels ( #36061 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-11 16:07:14 +00:00
Richard Zou
822e250ab7
[torch.compile] Use FakeTensors instead of real GPU tensors for single-size compilation ( #36093 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-11 16:07:09 +00:00
Hongxin Xu
bea02cdf93
Fix routed experts capture for hybrid models (Mamba + Attention) ( #35744 )
...
Signed-off-by: arlenxu <arlenxu@tencent.com >
Signed-off-by: xhx1022 <1737006628@qq.com >
Co-authored-by: arlenxu <arlenxu@tencent.com >
2026-03-11 08:53:10 -07:00
Julien Denize
a3ea760ea5
Add 'none' reasoning effort to ChatCompletionRequest ( #36238 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
2026-03-11 15:45:34 +00:00
Harry Mellor
35db669f1d
Correct link to supported hardware on vllm.ai ( #36798 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 08:43:28 -07:00
Julien Denize
afebeffbfb
Add support to Mistral large 3 eagle with dense layers ( #36163 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-11 15:42:56 +00:00
Jhao-Ting Chen
5573894737
Kimi k2.5 MLA based eagle3 ( #36361 )
...
Signed-off-by: Izzy Putterman <iputterman@nvidia.com >
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com >
Co-authored-by: Izzy Putterman <iputterman@nvidia.com >
2026-03-11 11:36:11 -04:00
Harry Mellor
d5816c8c2f
Fix tied weights in weight mapping test for Transformers v5 ( #36788 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 15:10:26 +00:00
Woosuk Kwon
8ccbcda5c0
[Model Runner V2] Remove unused warmup_for_prefill method ( #36762 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-11 08:02:44 -07:00
tvirolai-amd
a9e532afe2
[ROCm][Perf] Allow MTP lens > 1 in Sparse MLA ( #36681 )
...
Signed-off-by: Teemu Virolainen <teemu.virolainen@amd.com >
2026-03-11 14:43:03 +00:00
Harry Mellor
f3163bba67
Disable docs build skipping until a better solution is found ( #36790 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 13:53:23 +00:00
Martin Hickey
700a1ddc65
[Misc] Use envs module to get VLLM_DISABLED_KERNELS ( #35776 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-03-11 13:37:46 +00:00
Silvia Colabrese
f33251ffc8
[Bugfix] Fix Mistral-small --format ( #36782 )
...
Signed-off-by: 12010486 <silvia.colabrese@intel.com >
2026-03-11 04:47:52 -07:00
Wuxun Zhang
e584dce52b
Add XPU MLA Sparse backend for DeepSeek v3.2 ( #33230 )
...
Signed-off-by: Zhang, Wuxun <wuxun.zhang@intel.com >
2026-03-11 19:19:15 +08:00
Ning Xie
40c0461f24
[openapi] refactor render related openapi [3/N] ( #36749 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-03-11 03:14:34 -07:00
Weiguang Li
724759684c
[Bugfix] Fix Qwen3-VL timestamp mismatch when using num_frames without fps ( #36136 )
...
Signed-off-by: OiPunk <codingpunk@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 03:13:06 -07:00
Michael Goin
9c34e9d24f
Disable cascade attention by default ( #36318 )
2026-03-11 03:12:23 -07:00
Richard Zou
09b6f99852
[compile] aot_compile should respect VLLM_DISABLE_COMPILE_CACHE ( #36358 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-11 03:12:03 -07:00
Ethan T.
c87fb515ed
fix(lora): use replaced_module_name in pooling model name check ( #36402 )
...
Signed-off-by: gambletan <ethanchang32@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 03:11:27 -07:00
Itay Alroy
5353c9b016
platforms: Fix Ray DP startup crash ( #36665 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
2026-03-11 03:08:55 -07:00
Angela Yi
13e79fc811
[ci] Update rtol for test_classification ( #36556 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com >
2026-03-11 03:08:16 -07:00
Rahul Tuli
9d07a3d6e4
Add: Eagle3 support for Qwen3.5 ( #36658 )
...
Signed-off-by: Rahul-Tuli <rtuli@redhat.com >
2026-03-11 03:07:42 -07:00
Cyrus Leung
646b85544b
[Refactor] Remove Molmo2 processor wrapper ( #36667 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-11 03:07:20 -07:00
tc-mb
4286cc5ec2
fix(minicpmv): fix audio inference by handling meta device in init_re… ( #36751 )
...
Signed-off-by: caitianchi <caitianchi@modelbest.cn >
2026-03-11 03:06:28 -07:00
LoganJane
545d18d81b
[Bugfix] Support other quantization methods in glm41v ( #36321 )
...
Signed-off-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com >
Co-authored-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-11 09:48:05 +00:00
roikoren755
e661b9ee83
[NemotronH] Small fix reasoning parser ( #36635 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-03-11 02:44:41 -07:00
YiSheng5
c910eeb125
[XPU]Bug fix for some unexpected error when use AgRs backend on XPU device. ( #36593 )
...
Signed-off-by: yisheng <yi.sheng@intel.com >
2026-03-11 09:17:46 +00:00
Harry Mellor
f4ae58b38b
Remove unused config field from Gemma2 ( #36672 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 01:51:19 -07:00
Isotr0py
e568cf88bc
[UX] Infer dtype for local checkpoint ( #36218 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-11 08:50:04 +00:00
Nicolò Lucchesi
098d844731
[NIXL][1/N] Refactor kernel_block_size detection ( #35752 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-11 01:11:23 -07:00
JartX
a40ee486f2
[Bugfix] Add Multiple of 16 block_size to triton fallback on rocm Attention to support qwen3_5 ( #35923 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Co-authored-by: akaratza <akaratza@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-03-11 07:45:57 +00:00
pschlan-amd
eac2dc2b41
AITER MLA backend: Avoid CPU sync in _build_decode ( #35765 )
...
Signed-off-by: Patrick Schlangen <pschlan@amd.com >
2026-03-11 07:25:00 +00:00
Flora Feng
d5080aeaa4
[Refactor] Remove deadcode in Responses API serving ( #36726 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-11 07:11:41 +00:00
liuzhenwei
f22d6e0267
[Hardware][NIXL] set default kv buffer type for different platform ( #36438 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-11 05:19:28 +00:00
Kunshang Ji
76c6e6da08
[XPU] Support block fp8 moe by fallback to TritonExpert on XPU ( #36458 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-10 21:54:09 -07:00
typer-J
4184653775
feat: add RISC-V support for CPU backend (v2) ( #36578 )
...
Signed-off-by: typer-J <2236066784@qq.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-03-10 21:51:39 -07:00
Sladyn
4aaaf8c8ce
feat(spec_decode): fuse EAGLE step slot mapping and metadata updates ( #33503 )
...
Signed-off-by: sladynnunes <snunes@usc.edu >
2026-03-11 04:35:33 +00:00
Hongbin Guo
4bf533623b
[Doc] Fix duplicate words in comments ( #36713 )
...
Signed-off-by: Hongbin10 <jdmjdm1998@163.com >
2026-03-10 21:28:31 -07:00
Matthew Bonanni
5f77ef15ae
[Misc][Attention] Clean up unused method in CPU_ATTN ( #36673 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-10 21:27:22 -07:00
elvischenv
7d6abdd022
[Fix] Use torch.empty for output in attention+quant fusion ( #31785 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-03-10 21:26:14 -07:00
Wentao Ye
a8ff2cca92
[Perf] Optimize scheduler overhead for PD disaggregation, around 5% E2E perf improvement ( #35781 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-03-10 21:25:30 -07:00
tunglinwood
42fadebecb
[Model] Add support for moonshotai/Kimi-Audio-7B-Instruct ( #36127 )
...
Signed-off-by: tunglinwood <tunglinwood@gmail.com >
Signed-off-by: tunglinwood <tomwu.tunglin@gmail.com >
Signed-off-by: tunglinwood <113751333+tunglinwood@users.noreply.github.com >
2026-03-10 21:24:48 -07:00
tianshu-Michael-yu
a197eda9c3
Add tuned H100 MoE configs for LFM2 8B and 24B ( #36699 )
2026-03-10 21:22:02 -07:00
Kevin H. Luu
82b110d50e
[ci] Bound nvidia-cudnn-frontend version ( #36719 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-03-11 12:17:35 +08:00
Benjamin Chislett
9040cd40af
[DSV3.2][MTP] Optimize Indexer MTP handling ( #36723 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-11 12:16:56 +08:00
fangyuchu
fa0d353acf
[Bugfix] Surface exceptions from non-blocking execute_model in UniProcExecutor to avoid DP deadlocks ( #35194 )
...
Signed-off-by: fangyuchu <fangyuchu@qq.com >
2026-03-11 03:22:21 +00:00
Augusto Yao
b386bb3d7c
fix bugs when token_classify & classify run concurrently ( #36614 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
2026-03-10 20:16:34 -07:00
Ning Xie
fe714dd507
[openapi server] log exception in exception handler(2/N) ( #36201 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-03-10 20:16:30 -07:00
Matthew Bonanni
8ab3d7427c
[Bugfix] Fix DeepSeek V3.2 OOM during CG memory profiling ( #36691 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-11 03:01:07 +00:00
Wei Zhao
84e436ed1c
[Bug] Fix TRTLLM Block FP8 MoE Monolithic ( #36296 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-03-10 22:04:47 -04:00
Andreas Karatzas
81939e7733
[ROCm][CI] Making some tests optional to reduce workload ( #36090 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-10 16:45:27 -07:00
Woosuk Kwon
195d1ca3e8
[Minor] Enhance error message for TRTLLM decode uniformity check ( #36609 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-10 15:38:45 -07:00
Nick Hill
8d983d7cd6
[Model Runner V2] Add initial CI tests ( #36041 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 14:55:21 -07:00
Nick Hill
65b2f405dc
[Core] Simplify core kv-cache blocks initialization logic ( #36521 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 20:20:02 +00:00
Nick Hill
2a68464c5b
[Test] test_async_scheduling.py improvements ( #36340 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 11:17:26 -07:00
Zhengxu Chen
bdd8981dab
[compile] Apply stored functorch config while finalizing loaded artifacts. ( #36582 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-10 09:34:35 -07:00
Woosuk Kwon
f088a831dd
[Model Runner V2] Use unpadded num_tokens for PW CUDA graph attn metadata ( #36626 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-10 09:30:56 -07:00
Harry Mellor
f83b933b84
[CI] Bump mypy version to 1.19.1 ( #36104 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-10 09:18:28 -07:00
Pleaplusone
82f3f30e26
[ROCm][Perf] Enable sparse_mla's cudagraph on ROCm platform ( #35719 )
...
Signed-off-by: ganyi <ygan@amd.com >
2026-03-10 09:14:35 -07:00
Matthew Bonanni
9095cbbfb6
[Bugfix][Sparse MLA] report indexer CG support properly ( #36519 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-10 09:14:31 -07:00
Hashem Hashemi
721ae79f50
Improvements to wvSplitKrc skinny GEMM solution ( #34304 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-03-10 09:14:27 -07:00
AllenDou
aefc59f088
FunASR model bugfix ( #36633 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
2026-03-10 08:14:21 -07:00
Harry Mellor
d88f28da05
Fix hf_override_fn when it modifies model_type ( #35200 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-10 15:03:18 +00:00
Srinivasoo7
106ff69c4e
feat(kv-offload): Strategy A — StoreReusedOffloadingManager gates CPU stores on reuse frequency ( #35342 )
...
Signed-off-by: srinivas_oo7 <Sriusa4414@gmail.com >
Signed-off-by: Sriusa4414@gmail.com
Signed-off-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com >
Co-authored-by: srinivas_oo7 <sklinkedin0120@gmail.com >
Co-authored-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-03-10 14:43:40 +00:00
Jiangyun Zhu
ca5fb4bbd8
[Bugfix] Avoid merging empty-only partitions into splitting-op subgraphs ( #36595 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-10 07:39:01 -07:00
Alvin Tang
cf88b23749
fix: check HTTP status in batch read_file to prevent silent failures ( #36397 )
...
Signed-off-by: gambletan <ethanchang32@gmail.com >
Co-authored-by: gambletan <ethanchang32@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-10 07:22:40 -07:00
wang.yuqi
a3189a08b0
[Model] Consolidate score logic by introduce score_type ( #36479 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-10 13:32:25 +00:00
SoluMilken
409c4e632d
[Misc] fix typo: homogenous-> homogeneous (2 lines change) ( #36508 )
...
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw >
2026-03-10 06:25:37 -07:00
Raushan Turganbay
8850738b70
[Bugfix] Fix processor signature ( #36630 )
...
Signed-off-by: raushan <raushan@huggingface.co >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-10 06:20:47 -07:00
Mark McLoughlin
234860399b
[Frontend][Core] Revert "Add shutdown timeout" ( #34730 and #36270 ) ( #36628 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-03-10 06:20:41 -07:00
Harry Mellor
c88510083b
Fix Qwen2.5-VL test for Transformers v5 ( #36532 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-10 12:05:34 +00:00
Vadim Gimpelson
4ff8c3c8f9
[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU ( #35219 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-10 03:32:20 -07:00
Chang Su
507ddbe992
feat(grpc): extract gRPC servicer into smg-grpc-servicer package, add --grpc flag to vllm serve ( #36169 )
...
Signed-off-by: Chang Su <chang.s.su@oracle.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-03-10 03:29:59 -07:00
Nick Hill
ddbb0d230a
[Model Runner V2] Fix mm input embeddings lookup ( #36588 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 00:24:58 -07:00
Nick Hill
9efc3bdcd6
[Model Runner V2] Fix _compute_slot_mappings_kernel for chunked prefill ( #36580 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 00:23:42 -07:00
amirkl94
156e33553c
Fix: Re-Enable EP for trtllm MoE FP8 backend ( #36494 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com >
2026-03-09 23:11:27 -07:00
hallerite
d0cd736caa
[Bugfix] Fix RuntimeError: Already borrowed that degrades VLM serving throughput under concurrent load. ( #36557 )
...
Signed-off-by: hallerite <hallerite@users.noreply.github.com >
Signed-off-by: hallerite <git@hallerite.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-09 22:30:51 -07:00
Harry Mellor
195c997203
Fix LFM2 MoE test for Transformers v5 ( #36534 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-09 22:29:17 -07:00
Zhuohan Li
04b67d8f62
Remove unused disable_fallback field ( #36546 )
2026-03-09 20:56:54 -07:00
Wentao Ye
7279374f91
[Perf] Compute maxsim in worker side, reducing redundant copies, 2.7% E2E throughput improvement ( #36159 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-09 20:55:58 -07:00
Woosuk Kwon
006aea17d7
[BugFix] Remove incorrect assert in split_decodes_and_prefills ( #36553 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-09 20:02:02 -07:00
Hojin Yang
0836be3b03
[Model] Add HyperCLOVAX-SEED-Think-32B vision-language model support ( #31471 )
...
Signed-off-by: effortprogrammer <yhjhoward7@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-10 10:59:19 +08:00
Ajay Anubolu
4e95ec111c
[Bugfix] Fix Qwen3-Next in_proj_ba weight sharding with TP > 1 ( #36242 )
...
Signed-off-by: AjAnubolu <anuboluajay@gmail.com >
2026-03-09 19:16:26 -07:00
Andreas Karatzas
179547d62c
[ROCm][CI] Fix ROCm GPT-OSS Eval test group ( #36179 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-09 17:55:20 -07:00
youkaichao
f85b4eda3a
[bugfix] fix nvlink for nixl/ucx ( #36475 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2026-03-10 07:49:47 +08:00
Woosuk Kwon
2a194ddd72
[Model Runner V2] Add model_state inputs to CUDA graph capture ( #36544 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-09 15:14:51 -07:00
Shaun Kotek
203a7f27da
add nemotron v3 reasoning parser ( #36393 )
...
Signed-off-by: Shaun Kotek - Nvidia <skotek@nvidia.com >
Co-authored-by: root <root@gpu-259.slurm-workers-slurm.slurm.svc.cluster.local >
2026-03-09 15:11:41 -07:00
Lucas Wilkinson
483463f735
[MRV2] Extensible CG dispatch rework ( #35959 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-03-09 13:58:45 -07:00
Matthew Bonanni
4e571ce643
[MTP][Misc] Clean up dead code ( #36507 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-09 14:43:06 -04:00
Micah Williamson
4ff9b045fe
[ROCm][CI] Prep Tests For Change To ROCM_ATTN As New Default Backend On ROCm ( #36025 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-03-09 13:27:55 -05:00
Lucas Kabela
3fd03f1ec2
[BE] Rename should_torch_compile_mm_vit to should_torch_compile_mm_encoder ( #36281 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-03-09 18:22:05 +00:00
Woosuk Kwon
10a5f4d53d
[Model Runner V2] Use NamedTuple for execute_model_state ( #35930 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-09 11:17:34 -07:00
Simon Mo
fe0c085c28
[Docs] Remove the reo beacon ( #36528 )
...
Co-authored-by: Cursor Agent <cursoragent@cursor.com >
2026-03-09 11:16:50 -07:00
Taneem Ibrahim
8d6b3d5dda
[Misc] Refactored 5 duplicate helper functions that were copied-pasted across multiple parsers ( #36436 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-03-09 14:14:11 -04:00
Copilot
4b87ffbefb
[torch.compile] Rename compile_ranges_split_points to compile_ranges_endpoints ( #36027 )
...
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-03-09 18:04:40 +00:00
Shaun Kotek
fa028207aa
Fix/resupport nongated fused moe triton ( #36412 )
...
Signed-off-by: Shaun Kotek - Nvidia <skotek@nvidia.com >
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com >
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Signed-off-by: liweiguang <codingpunk@gmail.com >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Alex Brooks <albrooks@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: cong-or <conchubhar.gannon@gmail.com >
Signed-off-by: Tushar Shetty <tushar.shetty@abbyy.com >
Signed-off-by: Tushar Shetty <54362365+tusharshetty61@users.noreply.github.com >
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Signed-off-by: Xin Yang <xyangx@amazon.com >
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: nvnbagrov <nbagrov@nvidia.com >
Co-authored-by: Sage <80211083+sagearc@users.noreply.github.com >
Co-authored-by: danisereb <daserebrenik@nvidia.com >
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Weiguang Li <codingpunk@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
Co-authored-by: Alex Brooks <albrooks@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: cong-or <conchubhar.gannon@gmail.com >
Co-authored-by: Tushar Shetty <54362365+tusharshetty61@users.noreply.github.com >
Co-authored-by: liuzhenwei <zhenwei.liu@intel.com >
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-09 11:01:18 -07:00
Russell Bryant
d460a18fc6
[Docs] Expand --allowed-media-domains security guidance with threat details ( #36506 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-09 17:43:42 +00:00
Woosuk Kwon
6e956d9eca
[Model Runner V2] Add dummy profile_cudagraph_memory API ( #36520 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-09 10:20:13 -07:00
Andreas Karatzas
1e0f917b34
[ROCm][CI] Fix logprob divergence for TitanML/tiny-mixtral under AITER rms_norm ( #36101 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-09 12:07:44 -05:00
Andreas Karatzas
c174d54f86
[ROCm][CI] Fix ROCm attention backend validation for head sizes, block sizes, and compute capability checks ( #36292 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-09 12:02:41 -05:00
SoluMilken
55d27cca55
[Misc] fix typo: dependant -> dependent (2 lines change) ( #36511 )
...
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw >
2026-03-09 10:00:12 -07:00
Roberto L. Castro
580864d81e
[Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 ( #34917 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
2026-03-09 09:50:36 -07:00
Roberto L. Castro
2b28b9b269
[Attention][Perf] Optimize cp_gather_and_upconvert_fp8_kv_cache - DeepSeek-v3.2 ( #35290 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Co-authored-by: Claude <noreply@anthropic.com >
2026-03-09 09:46:57 -07:00
Taoyu Zhu
70485a11bd
[ROCM] Optimize the fused_topk_bias to use aiter instead of fallback torch ops. ( #36253 )
...
Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com >
2026-03-09 11:30:35 -05:00
Harry Mellor
74a9f54cdb
[CI] Fix edge case that could lead to broken docs builds on main ( #36515 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-09 09:06:19 -07:00
Matthew Bonanni
00c4cb5606
[Bugfix] Clear stale CG keys after memory profiling ( #36416 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-09 11:56:00 -04:00
Wentao Ye
941e52c298
[Refactor] Simplify chat_completion_full_generator for tool parsers ( #35634 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-09 23:33:46 +08:00
Wentao Ye
be292b7c14
[Bug] Fix pooling model benchmark script ( #36300 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-09 11:17:45 -04:00
Matthew Bonanni
77a73458e3
Reapply [Attention] Refactor check_and_update_config ( #35122 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-09 07:17:14 -07:00
Tianyu Guo
5578f2a4d3
Support online use_audio_in_video ( #36319 )
...
Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-09 07:16:44 -07:00
Cyrus Leung
3ec2115015
[Frontend] Move warmup into Renderer ( #36482 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-09 06:03:21 -07:00
Isotr0py
b0906d8b02
[MM Encoder] Default to use TORCH_SDPA backend for ViT on Volta/Turing GPU ( #36472 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-09 03:43:44 -07:00
Kevin H. Luu
aaf5fa9abf
[ci] Bound openai dependency to 2.24.0 ( #36471 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-09 03:43:26 -07:00
Cyrus Leung
f96c3ab08c
[Deprecation][1/2] Remove items deprecated in v0.18 ( #36470 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-09 03:43:23 -07:00
Xin Yang
dc6b578466
[Kernel] Add fused_sigmoid_gating_delta_rule_update kernel for Qwen3 Next ( #35777 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-08 23:41:01 -07:00
liuzhenwei
1bc9c77f6d
[XPU] Add test script of PD disaggregation ( #36434 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
2026-03-09 05:50:27 +00:00
Alex Brooks
65a4da1504
[Frontend] Add Support for MM Encoder/Decoder Beam Search (Online Transcriptions) ( #36160 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
2026-03-09 05:46:23 +00:00
Li, Jiang
217f27598d
[Bugfix] Avoid to replace non-tensor members in cpu model runner ( #36430 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-09 13:06:28 +08:00
wang.yuqi
fff3711a24
[Frontend][2/n] Improve pooling entrypoints | embed. ( #36110 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
2026-03-09 11:42:19 +08:00
Tushar Shetty
c4d859c274
[Bugfix] Skip out-of-stage layers in get_layers_from_vllm_config for pipeline parallel ( #36243 )
...
Signed-off-by: Tushar Shetty <tushar.shetty@abbyy.com >
Signed-off-by: Tushar Shetty <54362365+tusharshetty61@users.noreply.github.com >
2026-03-08 20:40:16 -07:00
cong-or
747431044d
feat(attention): extract KV-cache update from FlexAttention backend ( #36263 )
...
Signed-off-by: cong-or <conchubhar.gannon@gmail.com >
2026-03-08 20:40:12 -07:00
Cyrus Leung
d62856b928
[Misc] Move processors to transformers_utils ( #35953 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-09 11:31:39 +08:00
Alex Brooks
bd2659a566
Increase Flexibility for OOV Multimodal Token Handling ( #34858 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
2026-03-08 20:30:49 -07:00
Shaun Kotek
90512b2e8b
fix: Use iterator as not to store all the file loads in memory at once ( #36149 )
...
Signed-off-by: Shaun Kotek - Nvidia <skotek@nvidia.com >
2026-03-08 20:25:21 -07:00
wang.yuqi
dcf8862fd4
[Examples][1/n] Resettle basic examples. ( #35579 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-08 20:22:53 -07:00
Weiguang Li
43aa389231
[Bugfix] Fix CPU OMP autobind assertion to use local_world_size ( #35815 )
...
Signed-off-by: liweiguang <codingpunk@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-03-08 20:07:29 -07:00
Wentao Ye
384425f84e
[Dependency] Remove default ray dependency ( #36170 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-08 20:06:22 -07:00
Harry Mellor
a0f44bb616
Allow markdownlint to run locally ( #36398 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-08 20:05:24 -07:00
Kunshang Ji
fde4771bbd
[XPU][Doc] update xpu document about triton dependency/conflict issue. ( #36301 )
...
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
2026-03-09 02:09:22 +00:00
Jiangyun Zhu
e5ff140216
[cudagraph] fix cudagraph warning in deepseekv32 ( #28044 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-08 20:27:41 -04:00
danisereb
0a6a3a1290
Add support for ModelOpt MXFP8 MoE models ( #35986 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-03-08 13:00:05 -07:00
Sage
4497431df6
[Frontend] Add GPU-less render serving path (vllm launch render) ( #36166 )
2026-03-08 16:35:09 +01:00
nvnbagrov
b7332b058c
[Model] Nano Nemotron VL - fast media preprocessing ( #35657 )
...
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com >
2026-03-08 03:04:05 -07:00
Andreas Karatzas
40077ea3de
[CI] fix flaky empty responses and add diagnostic assertions in vision chat tests ( #36341 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-08 14:42:24 +08:00
Samuel Shen
5d6aae4577
[LMCache MP Patch]: Race Condition + Duplicated Block Ids ( #35831 )
2026-03-07 13:52:48 -08:00
Roy Huang
63298ee173
[Bugfix][LMCache][KVConnector] fix potential memory leak in LMCache multiprocess mode ( #35931 )
2026-03-07 13:52:35 -08:00
Richard Zou
2dde535df1
[compile] Split compile/warmup monitoring ( #36098 )
2026-03-07 13:52:11 -08:00
Wei Zhao
379689d533
[Perf] Support FP8 KV cache for Flashinfer MLA Sparse ( #35891 )
2026-03-07 13:51:54 -08:00
PatchyTIS
a6be75dbd2
[Core] NGram GPU Implementation compatible with Async Scheduler ( #29184 )
2026-03-07 13:51:37 -08:00
Micah Williamson
ee54f9cdb9
[ROCm][CI] Accept Different But Valid Output for test_olmoe_tp ( #35224 )
2026-03-07 13:50:52 -08:00
Micah Williamson
fc4657756f
[ROCm][CI] Enable AITER for failing test_gpt_oss test case on MI355 ( #36174 )
2026-03-07 13:50:17 -08:00
qli88
eebd14651f
[CI] Enable Crosslayer KV layout tests for ROCm platforms ( #35416 )
2026-03-07 13:49:56 -08:00
Matthew Bonanni
ebb9cc5f2b
[UX][Startup] Account for CUDA graphs during memory profiling ( #30515 )
2026-03-07 13:49:23 -08:00
rahul-sarvam
85f50eb41f
Adding support to Sarvam's MoE models ( #33942 )
...
Signed-off-by: rahul-sarvam <140298821+rahul-sarvam@users.noreply.github.com >
2026-03-08 01:16:24 +08:00
Taneem Ibrahim
5261223c2d
[Misc] Remove duplicate parser registration ( #36303 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-03-07 09:37:01 -05:00
lif
00b814ba5a
[V0 Deprecation] Remove unused swap_space parameter ( #36216 )
...
Signed-off-by: majiayu000 <1835304752@qq.com >
Co-authored-by: mcelrath
2026-03-07 22:09:55 +08:00
vllmellm
ee8a29511f
[Bugfix] Fix compressed-tensors quantization failure for DeepSeek-R1 on MI300x ( #36247 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-03-07 09:26:59 +00:00
milesial
755356b3d1
feat: expose media_io_kwargs at runtime ( #34778 )
...
Signed-off-by: Alexandre Milesi <milesial@users.noreply.github.com >
2026-03-07 04:27:04 +00:00
Andreas Karatzas
58928475e4
[ROCm][CI] Making entrypoints more deterministic on ROCm ( #36293 )
2026-03-06 19:04:40 -08:00
Mengtao (Martin) Yuan
1a9718085c
Fix CUDA graph decode capture crash in AITER FlashAttention ( #36042 )
...
Signed-off-by: Martin Yuan <myuan@meta.com >
Co-authored-by: Martin Yuan <myuan@meta.com >
2026-03-06 18:12:07 -08:00
Kunshang Ji
7eb524e64c
refine vllm bench throughput --backend hf ( #35971 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-07 02:10:33 +00:00
Nick Hill
c7f32e08c2
[BugFix] Avoid ignored trust_remote_code warnings ( #36290 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-07 01:24:18 +00:00
Nick Hill
b354686524
[Model Runner V2] Fix warmup for pipeline parallel ( #36280 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-06 16:58:51 -08:00
Nick Hill
6a18d8789b
[Core] Fix benign error log during normal shutdown ( #36270 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
2026-03-07 00:39:21 +00:00
Itay Alroy
24a03915f5
mla: don't update kv cache on dummy forwards ( #36282 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
2026-03-07 00:36:00 +00:00
Andreas Karatzas
b5e34e1fca
[ROCm][CI] Fixing yaml file for external amd-ci signal ( #36284 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 18:30:39 -06:00
Copilot
ce8546a12b
[docs][torch.compile] Add fusions.md — kernel/operator fusion reference page ( #35538 )
...
Signed-off-by: ProExpertProg <luka.govedic@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: ProExpertProg <luka.govedic@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-03-06 23:55:06 +00:00
Chuan (Richard) Li
c188749bcd
[ROCm] Support MLA with nhead<16 and FP8 KV cache for TP=8 (Kimi K2.5/Linear) ( #35850 )
...
Signed-off-by: Li <chuali@amd.com >
2026-03-06 20:24:03 +00:00
Alexei-V-Ivanov-AMD
225d1090a0
Enabling some B200-specific tests on MI355 ( #35253 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
Signed-off-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com >
2026-03-06 19:27:20 +00:00
eellison
f3c6c9c9d7
[CustomOp] CustomOp FusedRMSNormGated ( #35877 )
...
Signed-off-by: Elias Ellison <elias.ellison@gmail.com >
Signed-off-by: eellison <elias.ellison@gmail.com >
2026-03-06 10:53:37 -08:00
Nick Hill
26bd43b52d
Revert "[BugFix] Fix engine hanging after KV cache initialization fai… ( #36262 )
2026-03-06 08:28:09 -08:00
Travis Johnson
6b625a8807
[Bugfix] Quickfix followups to busy loop removal in #28053 ( #36068 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-06 08:13:05 -08:00
Richard Zou
54756b6109
[compile] Stop unconditionally patching constrain_to_fx_strides ( #36152 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-06 10:17:27 -05:00
Raphaël Rialland
39f9ea0da4
[Bugfix] Fix cudagraph_mode:FULL dispatch (This does not impact FULL_AND_PIECEWISE (default)) ( #36165 )
2026-03-06 09:15:31 -05:00
Isotr0py
e4ae148a78
[Refactor] Modular video loader backend refactoring ( #35202 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-06 06:06:59 -08:00
Isotr0py
1d0c0d209c
[Misc] Lazy import registered processors ( #36024 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-06 06:06:45 -08:00
Chenguang Zheng
fcb73f306c
[bugfix] add api process rank in default multimodal request ( #36150 )
...
Signed-off-by: fake0fan <645327136@qq.com >
Signed-off-by: Chenguang ZHENG <645327136@qq.com >
2026-03-06 12:00:09 +00:00
Harry Mellor
e2090bf3af
[CI] Fix startup error test ( #36230 )
...
A change in engine startup error messages in #35478 caused this test failure.
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-06 11:50:28 +00:00
Andreas Karatzas
2a00d3241f
[CI][MM] Gate vision encoder attention mask to MiniCPM only, fixing Aria regression ( #36206 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 01:17:08 -08:00
Alex Brooks
10f4db4dbe
[Frontend] Add Support for MM Encoder/Decoder Beam Search (Offline) ( #36153 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-06 01:16:56 -08:00
Nicolò Lucchesi
5b3ba94ab4
[Core][KVConnector] Support HMA+NixlConnector ( #35758 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-06 08:51:21 +01:00
zhanqiuhu
90f3c01fa4
[Spec Decode][KV Connector] Fix KV transfer in PD + speculative decoding ( #35158 )
...
Signed-off-by: Claude <noreply@anthropic.com >
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-06 08:50:44 +01:00
Andreas Karatzas
807d680337
[ROCm][CI] Fix tool use test stability - disable skinny GEMM, prefix caching, eliminate batch variance ( #35553 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 15:15:12 +08:00
Tyler Michael Smith
5afb387bd4
Change "following fields were present in the request but ignored" log from warn to debug ( #36173 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-03-05 22:15:46 -08:00
Walter Beller-Morales
43e77e59ab
[BugFix] avoid infinite loop with VLLM_PORT and get_open_ports_list ( #36191 )
...
Signed-off-by: walterbm <walter.beller.morales@gmail.com >
2026-03-05 22:15:29 -08:00
Russell Bryant
00bd08edee
[Security] Respect user trust_remote_code setting in NemotronVL and KimiK25 ( #36192 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-05 22:15:19 -08:00
Ajay Anubolu
43f10573c9
[Bugfix] Fix misleading context length error messages ( #36197 )
...
Signed-off-by: AjAnubolu <anuboluajay@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-05 22:15:12 -08:00
Yongye Zhu
86e1060b17
[Bugfix] Fix inner_dp_world initialization order for multi-node TP ( #35892 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-03-05 22:04:44 -08:00
Mark McLoughlin
27066d1b2b
[Frontend][Core] Add shutdown timeout - allowing in-flight requests to finish ( #34730 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-03-05 22:04:31 -08:00
cong-or
57c84ff129
perf: add __slots__ to KVCacheBlock ( #36164 )
...
Signed-off-by: cong-or <conchubhar.gannon@gmail.com >
2026-03-05 22:04:09 -08:00
Xiang Shi
e68de8adc0
docs: fix wrong cc in int8.md ( #36209 )
...
Signed-off-by: Xiang Shi <realkevin@tutanota.com >
2026-03-06 06:01:02 +00:00
Andreas Karatzas
a1ffa56a1e
[CI] Fix bge-m3 similarity reference values after *Defination* typo fix ( #36208 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 05:07:29 +00:00
Shiyan Deng
0a208d1f54
[BugFix] Fix engine hanging after KV cache initialization failure ( #35478 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-03-05 20:58:09 -08:00
Shiyan Deng
03a49bb8f0
[Feature] Add --distributed-timeout-seconds CLI option ( #36047 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-03-05 20:57:51 -08:00
Shiyan Deng
8e87cc57f1
[Bug] Fix a corner case in _process_simple_streaming_events ( #34754 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-03-05 20:57:32 -08:00
Cyrus Leung
6dd302653f
[Misc] Rename group_mm_kwargs_by_modality -> group_and_batch_mm_kwargs ( #36158 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-06 12:32:48 +08:00
Cyrus Leung
de00ebeac4
[Bugfix] Fix simple Mistral-Small example ( #36156 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-05 20:25:11 -08:00
Andreas Karatzas
639680d220
[ROCm][CI] Adding missing dependencies for Multi-modal models tests ( #36177 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 12:23:10 +08:00
Rohan Potdar
c5362c739f
Reenable features for ROCm attention backends ( #36185 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-03-05 20:21:06 -08:00
Nikhil Gupta
0a49676fb0
cpu: aarch64: Upgrade OneDNN for aarch64 to add support for int8 matmul ( #36147 )
...
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com >
2026-03-06 03:48:59 +00:00
Jeffrey Wang
c012a8c477
Don't fire ray compatibility webhook when PR or branch is not provided ( #36088 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2026-03-06 00:42:21 +00:00
Dor Huri
ebed80a7c8
[Performance] Extract KV-cache update from TreeAttention backend ( #35384 )
...
Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il >
2026-03-06 00:22:43 +00:00
Nick Hill
a73af584fe
[Model Runner V2] Fix warmup for very small kvcache and/or blocksizes ( #36176 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-05 14:48:10 -08:00
Zhengxu Chen
a97954b6a8
[compile] Consistent compiler config for saved/loaded vllm backends. ( #35810 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-05 15:08:12 -05:00
Yanhong Li
a911f4dd20
[Model] Add support for OLMo Hybrid ( #32550 )
2026-03-05 14:51:06 -05:00
Russell Bryant
5395471d29
[CI] Add explicit permissions to macOS smoke test workflow ( #35775 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-05 19:08:48 +00:00
Frank Wang
a57c877f18
[BugFix] Fallback from FA4->FA2 for Batch Invariance ( #36059 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
2026-03-05 14:05:56 -05:00
Xin Yang
f917020983
[Perf] Optimize FusedMoEModularKernel output tensor using torch.empty ( #35794 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-05 13:47:53 -05:00
tomeras91
86483ca774
[Bugfix] Disable FlashInfer TRTLLM BF16 path for non-gated MoE ( #36146 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
2026-03-05 09:49:05 -08:00
Netanel Haber
b93a9e6f6d
ParakeetProjection.norm = RMSNorm instead of nn.LayerNorm ( #36133 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-03-05 17:29:30 +00:00
Xinyu Chen
d8839ef7d9
[XPU] Enable ModelRunnerV2 on XPU ( #36078 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2026-03-05 17:19:18 +00:00
Avery Miao
e998fa76b9
[BUGFIX]Fix Qwen-Omni models audio max_token_per_item estimation error leading to encoder_cache_size is 0 ( #35994 )
...
Signed-off-by: Miao, Avery <avery.miao@intel.com >
2026-03-05 09:16:29 -08:00
Jiayi Yan
6a895197fa
[Bugfix][CI] fix typos ( #34934 )
...
Signed-off-by: 1195343015 <1195343015@qq.com >
Signed-off-by: Jiayi Yan <66017932+1195343015@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 17:05:46 +00:00
Sage Moore
8c760b6ab6
[ROCm] Refactor ROCm attention backend selection logic ( #35246 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-03-05 10:51:26 -06:00
AllenDou
3ee68590c7
refactor funasr model. ( #36108 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-05 08:07:37 -08:00
Cyrus Leung
7196348157
[Bugfix] Fix Qwen-VL tokenizer implementation ( #36140 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-05 08:07:19 -08:00
Ning Xie
176c799f4c
[openai api] log exception in exception handler (1/N) ( #31164 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-03-05 16:00:12 +00:00
Or Ozeri
612e7729c2
[KVConnector] Scheduler: Fix num_computed_tokens after async KV load ( #34616 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-05 14:25:15 +00:00
Harry Mellor
ecde7af9c4
Fix import that was moved in Transformers 5.2.0 ( #36120 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 13:59:44 +00:00
Harry Mellor
8df523351f
[Docs] Only build docs if documentation or ready labels are present ( #36135 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 13:58:16 +00:00
Andreas Karatzas
b03ff6a96b
[CI] Stabilize test_no_args_tool_call and add ROCm-specific server args ( #36107 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-05 21:52:49 +08:00
Ajay Anubolu
ed81d5edd1
[Bugfix] Fix RunAI streamer crash with S3-hosted model paths ( #35976 )
...
Signed-off-by: AjAnubolu <anuboluajay@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-05 12:14:20 +00:00
Shiyan Deng
3c23ac840e
[Bugfix] Fix mypy errors in hermes_tool_parser.py ( #36114 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2026-03-05 11:37:47 +00:00
cjackal
a708ef5944
[Misc] Fix SyntaxWarning - invalid escape sequence '\e' ( #36020 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2026-03-05 10:55:31 +00:00
Kunshang Ji
66a2209645
[Hardware] Replace torch.cuda.synchronize() api with torch.accelerator.synchronize ( #36085 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-05 10:36:39 +00:00
Doug Smith
0bfa229bf1
[Release] Include source distribution (sdist) in PyPI uploads ( #35136 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
Co-authored-by: Daniele Trifirò <dtrifiro@redhat.com >
2026-03-05 01:43:50 -08:00
Paco Xu
7493c51c55
[Docs] add Dynamo/aibrix integration and kubeai/aks link ( #32767 )
...
Signed-off-by: Paco Xu <paco.xu@daocloud.io >
2026-03-05 17:39:50 +08:00
Reagan Lee
ac773bbe80
[Docs] Update docs to include mm processor + encoder benchmarks ( #34083 )
...
Signed-off-by: Reagan <reaganjlee@gmail.com >
2026-03-05 01:38:25 -08:00
Christian Munley
48e376a007
qwen3coder tool parser fix anyOf double encoded parameters ( #36032 )
...
Signed-off-by: Christian Munley <cmunley@nvidia.com >
2026-03-05 09:06:57 +00:00
Isotr0py
21eb2c3372
[Chore] Correct MTP models test registry ordering ( #36115 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-05 08:55:04 +00:00
Seiji Eicher
e2b31243c0
[Docs] Update CacheConfig block_size docstring to remove inaccurate limit when using CUDA ( #35632 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2026-03-05 06:24:08 +00:00
Martin Hickey
c3598d02fa
[Misc] Remove deprecated items that are due for removal ( #36006 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-03-05 06:14:50 +00:00
Benjamin Chislett
57c629e9c1
[Bugfix] Fix block_size for hybrid model MTP ( #36036 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-05 06:10:54 +00:00
zihaoanllm
d106bf39f5
[Doc] Add Parallel Draft Models ( #35973 )
...
Signed-off-by: <zihaoan2@amd.com >
Signed-off-by: zihaoanllm <zihaoan2@amd.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 05:44:07 +00:00
Yanan Cao
b0651021e5
[Kernel] [Helion] [11/N] Retune configs for silu_mul_fp8 ( #36062 )
2026-03-04 21:25:59 -08:00
Hanjun Cho
f600d5192e
[Bugfix] Fix score layer quantization for sequence classification models - Qwen3 (VL) Reranker ( #35849 )
...
Signed-off-by: Hanjun Cho <gkswns0531@gmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-04 20:57:20 -08:00
Tianmu Li
8e7820131e
[Perf] Use dummy M for weight prepacking on x86 ( #35890 )
...
Signed-off-by: Li, Tianmu <tianmu.li@intel.com >
2026-03-05 04:56:49 +00:00
Andrii Skliar
0a12cea25f
Order config.py in Lexicographical order ( #35866 )
...
Signed-off-by: Andrii Skliar <askliar@nvidia.com >
Co-authored-by: Andrii Skliar <askliar@nvidia.com >
2026-03-04 20:56:47 -08:00
Zhengxu Chen
dd6dbd93f8
[compile] Fix extra cache save on warm start. ( #35921 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-05 12:56:30 +08:00
Harry Mellor
26366009c5
[CI] Don't leave docs preview comment on closed PRs ( #36087 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 04:51:46 +00:00
Nick Hill
16c472abe7
[Core] Move ray-specific WorkerWrapperBase methods to RayWorkerWrapper ( #35328 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-05 12:11:59 +08:00
daje0601
3b23d57c96
[Model] Add LoRA support for Whisper models ( #29856 )
...
Signed-off-by: daje0601 <englishmt4118@gmail.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-05 10:38:25 +08:00
Wentao Ye
2f4226fe52
[CI] Fix pre-commit mypy issue in main ( #36049 )
2026-03-04 18:13:12 -08:00
nkm-meta
792cbd64ca
Add platform method to enable custom collective ops registration ( #34760 )
...
Signed-off-by: Naina Kuruballi Mahesh <nainakm@meta.com >
2026-03-05 00:50:32 +00:00
Zhengxu Chen
2ed4722e26
[compile] Reduce log spam from compile. ( #36044 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-05 00:48:36 +00:00
Nick Hill
a3299c3d1d
[Model Runner V2] Misc code simplification ( #35941 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-04 15:26:35 -08:00
Andreas Karatzas
6c21a0c2d7
[ROCm][CI] Added MI325 mirrors (stage C) ( #35239 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-04 14:48:46 -08:00
Shanshan Shen
562339abc3
[Misc] Support OOT linear method registering ( #35981 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-03-04 22:25:56 +00:00
amitz-nv
d7adcadb9b
[Bugfix] Fix passing of activation_type to trtllm fused MoE NVFP4 and FP8 ( #36017 )
...
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com >
2026-03-04 22:23:51 +00:00
Simon Mo
f678c3f61a
[RL] [Weight Sync] Guard IPC update-info pickle deserialization behind insecure serialization flag ( #35928 )
...
Co-authored-by: Cursor Agent <cursoragent@cursor.com >
2026-03-04 17:05:32 -05:00
Thomas Parnell
be0a3f7570
[Bugfix] Fix race in non-blocking num_accepted_tokens GPU->CPU copy ( #36013 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-04 13:52:44 -08:00
Harry Mellor
17dc9c7fc9
[CI] Bump mypy version ( #34950 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 20:55:11 +00:00
fenypatel99
7eca859110
Add PyTorch profiler schedule support with warmup/active iterations ( #35240 )
2026-03-04 12:53:38 -08:00
Russell Bryant
636ee223ac
[Docs] Document security risks of GPT-OSS Python tool ( #35139 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-04 20:27:31 +00:00
Robert Shaw
b7d59ffce2
[UX] Remove NoOpOffloader log ( #35678 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-04 12:13:40 -08:00
Richard Zou
5569f5218d
[torch.compile] Stop lazily compiling ( #35472 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-04 12:13:17 -08:00
Davina Zaman
138d891d7f
[Docs] Clarify structured outputs configuration for Qwen3 reasoning mode ( #32441 )
...
Signed-off-by: Davina Zaman <davzaman@users.noreply.github.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 11:44:39 -08:00
Stefano Castagnetta
d7166e74c1
[CI] Add Blackwell AsyncTP correctness test ( #35871 )
...
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com >
2026-03-04 19:41:21 +00:00
Nick Hill
417fd28fb1
[Model Runner V2] Fix pooling ( #36019 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-04 10:53:17 -08:00
tomeras91
7faba503c4
[Kernel][Mamba] Optimize Mamba2 SSD prefill Triton kernels ( #35397 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
2026-03-04 19:47:17 +01:00
Hyunkyun Moon
bc6be89d16
[Frontend] Add vllm launch command for GPU-less preprocessing serving ( #34551 )
...
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com >
2026-03-04 18:41:52 +00:00
Maxime Grenu
32224f568a
docs: update CPU Docker images to reference Docker Hub instead of AWS ECR ( #34882 )
...
Signed-off-by: Maxime Grenu <69890511+cluster2600@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 10:31:35 -08:00
Abhishek Mathukiya
f3dc292e9f
docs: add version requirement note for --profiler-config flag ( #32454 )
...
Signed-off-by: abhishkh <mathukiya.a@northeastern.edu >
2026-03-04 18:13:54 +00:00
Chen
138c5fa186
[Docs] Add RunPod GPU deployment guide for vLLM ( #34531 )
...
Signed-off-by: lisperz <zhuchen200245@163.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 10:11:34 -08:00
Russell Bryant
2f2c1d73a7
[Docs] Upgrade dynamic LoRA warning to admonition block ( #35218 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-04 10:01:42 -08:00
Bhuminjay Soni
fb3e78ab09
[Feature][CI]: compare func & no_func outputs in test_functionalization.py ( #35481 )
...
Signed-off-by: Bhuminjay <bhuminjaysoni@gmail.com >
Signed-off-by: Bhuminjay Soni <Soni5Happy@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-03-04 18:01:16 +00:00
Michael Yao
fd3bfe74c9
[Docs] Update design/multiprocessing.md ( #30677 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2026-03-04 17:58:59 +00:00
tc-mb
bfdb512f11
fix minicpmo4.5: fix attn_mask in vit attn && fix resampler pos_emb i… ( #34127 )
...
Signed-off-by: tc-mb <caitianchi@modelbest.cn >
Co-authored-by: hezhihui <hezhihui@modelbest.cn >
2026-03-04 17:46:17 +00:00
Sage
d25c1ec3c9
docs(cpu): Clarify pre-built wheels requirement for CPU Python-only build ( #35090 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-04 17:45:35 +00:00
Xing Liu
7cc6058ac6
[Doc] Add MTP docs and update speculative decoding guidance ( #35197 )
...
Signed-off-by: liuxing <945764858@qq.com >
2026-03-04 17:23:34 +00:00
Manrique Vargas
28028dff2f
fix(docs): use static rdzv backend in multi-node troubleshooting script ( #34784 )
...
Signed-off-by: machov <mv1742@nyu.edu >
2026-03-04 17:15:35 +00:00
Dr Alex Mitre
3417ba5648
docs: add README for logits_processor examples ( #35933 )
2026-03-04 17:09:19 +00:00
Yan Ma
58cfe0dc44
Fix phi4-mm and remove cuda binding ( #35964 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-03-05 01:08:05 +08:00
simone-dotolo
e86221deb6
[Doc] Fix GPU Worker count in Process Count Summary ( #36000 )
...
Signed-off-by: simone-dotolo <simonedotolo@libero.it >
Signed-off-by: simone-dotolo <84937474+simone-dotolo@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-04 17:03:14 +00:00
Netanel Haber
289fc48ab7
Use MMEncoderAttention (=use FlashAttention) instead of torch.sdpa in radio.py ( #35653 )
2026-03-04 08:43:13 -08:00
Christian Pinto
2f2212e6cc
Split generic IO Processor plugins tests from Terratorch specific ones ( #35756 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2026-03-05 00:01:03 +08:00
Nicolò Lucchesi
18e01a0a10
[Misc] Add --attention-backend auto option ( #35738 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-04 15:12:27 +00:00
sungsoo ha
6cb901093f
[Core] Add All-to-All communication backend for DCP ( #34883 )
...
Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com >
Signed-off-by: sungsoo ha <hasungsoo@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 10:01:57 -05:00
Cyrus Leung
ead7bde1ab
[Bugfix] Make kaldi_native_fbank optional ( #35996 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-04 06:47:32 -08:00
Qi Wang
6aa6ad8992
[BugFix] Fix implicit and incorrect assumption on ECConnector is_producer ( #34783 )
...
Signed-off-by: Qi Wang <qiwa@nvidia.com >
2026-03-04 15:01:30 +01:00
Raghavan
c8c3935b70
[Bugfix][Model] Fix FP8 k_scale/v_scale not loaded for Qwen3-MoE ( #35656 )
...
Signed-off-by: raghavan <oneraghavan@gmail.com >
2026-03-04 13:15:38 +00:00
Ronen Schaffer
bb6888b8b1
[Bugfix][CPUOffloadingManager] Prevent eviction of already-stored blocks in LRU/ARC prepare_store() ( #35846 )
...
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com >
2026-03-04 14:25:33 +02:00
Taneem Ibrahim
1aaec59d79
[MISC] fixed tool_parser mypy errors ( #35640 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 12:23:12 +00:00
pougetat
1659b2e058
[Feature] Add basic metrics for /realtime endpoint ( #35500 )
...
Signed-off-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Signed-off-by: pougetat <thomas.pougetabadie@gmail.com >
Co-authored-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-04 19:56:32 +08:00
haosdent
d6e04f4c43
[Bugfix] Cap FULL decode cudagraph sizes for Mamba/hybrid models ( #34094 ) ( #34571 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
Co-authored-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-04 11:56:22 +01:00
Kunshang Ji
a8f66cbde8
[XPU] bump vllm-xpu-kernels to v0.1.3 ( #35984 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-04 18:23:31 +08:00
Kunshang Ji
16d2ad1d38
[Hardware] Replace torch.cuda.empty_cache with torch.accelerator.empty_cache ( #30681 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 09:49:47 +00:00
Chuan (Richard) Li
5dc3538736
[ROCm][Bugfix] Fall back from CK MXFP4 MoE when GEMM dimensions are unsupported ( #35893 )
...
Signed-off-by: Li <chuali@amd.com >
2026-03-04 08:30:54 +00:00
Nathan Price
36bf213181
[Bugfix] Add missing dynamic_arg_dims for Qwen3-ASR torch.compile ( #35869 )
...
Signed-off-by: Nathan Price <nathan@abridge.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-04 08:29:01 +00:00
Joe Runde
6f0dd93801
[Core] Remove busy loop from idle buffer readers ( #28053 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-04 07:44:20 +00:00
Andrii Skliar
5d199ac8f2
Support Audio Extraction from MP4 Video for Nemotron Nano VL ( #35539 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Signed-off-by: Andrii Skliar <askliar@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: Andrii <askliar@nvidia.com >
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Co-authored-by: Andrii Skliar <askliar@oci-nrt-cs-001-vscode-01.cm.cluster >
Co-authored-by: Andrii <askliar@nvidia.com >
Co-authored-by: root <root@pool0-03748.cm.cluster >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: root <root@pool0-02416.cm.cluster >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com >
Co-authored-by: root <root@pool0-04880.cm.cluster >
2026-03-03 23:20:33 -08:00
Komal Kumar Teru
9e0f44bec4
[cohere][fix][spec-decode]: fix crash when allowed_token_ids is set without penalties ( #35654 )
...
Signed-off-by: kkt-cohere <komal@cohere.com >
2026-03-03 23:20:15 -08:00
lailoo
097eb544e9
[Bugfix] Improve engine ready timeout error message ( #35616 )
...
Signed-off-by: damaozi <1811866786@qq.com >
2026-03-04 05:54:32 +00:00
ShiJie Zhong
7cdba98edf
[BugFix] Support tool_choice=none in the Anthropic API ( #35835 )
...
Signed-off-by: ZhongsJie <zhongsjie@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-04 05:24:46 +00:00
Charlie Fu
3c85cd9d74
[Rocm][CI] Fix ROCm LM Eval Large Models (8 Card) ( #35913 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2026-03-04 04:50:13 +00:00
Andreas Karatzas
edba15045a
[Bugfix] Guard mm_token_type_ids kwarg in get_mrope_input_positions ( #35711 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-04 04:12:51 +00:00
Cyrus Leung
e379396167
[Refactor] Clean up processor kwargs extraction ( #35872 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-03 19:53:53 -08:00
Isotr0py
6e9f21e8a2
[Chore] Remove debug code in model implementation ( #35883 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-03 19:50:58 -08:00
AllenDou
c1d963403c
[model] support FireRedASR2 ( #35727 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-03 19:41:30 -08:00
Shanshan Shen
77e6dcbbfa
[PluggableLayer][MM] Add PluggableLayer for RelPosAttention ( #33753 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-03-03 19:41:27 -08:00
William Zhang
70c73df69e
[Bugfix] Fix EVS implementation for Qwen3 VL ( #33607 )
...
Signed-off-by: 2ez4bz <133824995+2ez4bz@users.noreply.github.com >
2026-03-04 02:18:11 +00:00
xjx
9a9d442464
Enable bnb for multiple indices weight ( #35838 )
...
Signed-off-by: xjx <493337577@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-04 01:46:47 +00:00
Andreas Karatzas
f7da9cdffc
[ROCm][CI] Support async weight transfer example with platform-aware determinism ( #35710 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-04 09:44:14 +08:00
Jaewon
f22ff2958c
[Bugfix] Fix coord_socket assertion in DPEngineCoreProc for offline DP mode ( #35916 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
2026-03-04 00:10:11 +00:00
Nick Hill
d15c3b90fc
[Core] Move save_tensorized_model logic to Worker ( #35825 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-03 15:31:59 -08:00
zhrrr
97286a20ed
[Model Runner V2] support dp & ep for spec decoding ( #35294 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Co-authored-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-03 15:19:45 -08:00
Amr Mahdi
12b38c0f45
[CI/Build] Allow mounting AWS credentials for sccache S3 auth ( #35912 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-03-03 14:30:47 -08:00
Woosuk Kwon
467886a0c4
[Model Runner V2] Fix inputs_embeds=None bug for MM models ( #35917 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-03 13:47:45 -08:00
bnellnm
a9b8b13e5c
[Bugfix] Fix misnamed parameter in compressed_tensors_moe.py ( #35813 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-03-03 16:29:57 -05:00
Micah Williamson
e7213003cb
[ROCm][CI] Fix TP size issue for test_gpt_oss ( #35887 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-03-03 20:57:34 +00:00
Rohan Potdar
3a8eef5869
[ROCm][Bugfix]: Disable AITER Triton ROPE by default ( #35601 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-03-03 13:43:56 -06:00
Robert Shaw
97995f6376
[MoE Refactor] Create MK for TRTLLM Kernels ( #32564 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2026-03-03 10:39:50 -08:00
Robert Shaw
881a6b011b
[CI] Temporarily Disable Llama4 MoE Refactor Test ( #35870 )
...
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-03-03 10:36:15 -08:00
Matthew Bonanni
8e1fd5baf0
[CI] Bump num_speculative_tokens to 3 in nightly DeepSeek tests ( #35882 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-03 09:26:44 -08:00
JasonCohere
ae88468bcc
fix: Ensure invalid audio files return 400 error ( #34715 )
...
Signed-off-by: Jason Ozuzu <jasonozuzu@cohere.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-03 08:47:39 -08:00
ojhaanshika
e05cb3b93e
TRTLLM gen-full attn Test Coverage ( #34986 )
...
Signed-off-by: Anshika Ojha <anshikao@nvidia.com >
Co-authored-by: Anshika Ojha <anshikao@gb-nvl-059-compute09.nvidia.com >
2026-03-03 11:35:34 -05:00
Lucas Wilkinson
28ef9ba399
[BugFix] Add support for MTP num_speculative_tokens > 1 with sparse MLA ( #34552 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-03 07:21:57 -08:00
TJian
fb7fdc49c4
[ROCm] [CI] Add new fusion test cases that are relevant to vLLM IR Ops ( #34307 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-03-03 06:24:21 -08:00
wang.yuqi
ea463978bb
[Frontend][1/n] Improve pooling entrypoints | classify. ( #35604 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-03 06:05:36 -08:00
Li, Jiang
440f0e7dc6
[Bugfix] Avoid src/dst as None in irecv/isend_tensor_dict ( #35754 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-03 05:56:08 -08:00
wang.yuqi
fd4a90f337
[CI] And PPL test for Qwen3.5. ( #35853 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-03 13:15:51 +00:00
Thomas Parnell
ad9d09e2b8
[Perf] [Hybrid] Copy num_accepted_tokens in non-blocking way when not using prefix caching ( #35442 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2026-03-03 04:15:43 -08:00
Szymon Reginis
4beebfd146
[CI/Build][Intel] Add new performance benchmarks for Intel Gaudi 3 ( #31025 )
...
Signed-off-by: Szymon Reginis <sreginis@habana.ai >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-03 19:48:24 +08:00
hallerite
b8401cde0e
add regression test ( #35834 )
...
Signed-off-by: hallerite <git@hallerite.com >
2026-03-03 07:32:15 +00:00
TJian
5dfc5abe94
[ROCm] [Release] Change the package from aiter to amd-aiter ( #35198 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-03-02 23:13:39 -08:00
lin-shh
8fa68a8ce4
Fix TYPE_CHECKING stub defaults in envs.py to match actual runtime defaults ( #35645 )
2026-03-02 21:59:43 -08:00
lin-shh
35a6f0bfe2
[Misc] Fix typos in comments: explict→explicit, paramaters→parameters ( #35648 )
2026-03-02 21:59:14 -08:00
Taneem Ibrahim
3a6cbf16e2
[MISC] Removed unused function find_all_indices() from tool_parsers/utils.py ( #35683 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-03-03 13:58:42 +08:00
Lucas Wilkinson
f44d1ddc8c
[BugFix] Fix cmake based incremental install (wrong vllm install dir) ( #35773 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-03-02 21:58:16 -08:00
Cyrus Leung
48a54c1e0d
[CI/Build] Trigger processor tests on registry update ( #35824 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-03 13:55:57 +08:00
Micah Williamson
8b9e8b7454
[ROCm][CI] Fix Assertion Logic For test_gpt_oss ( #35806 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-03-03 05:08:04 +00:00
Wentao Ye
c21d0039ec
[Refactor] Fix maxsim cuda platform and add cli to control it ( #35427 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-03 12:48:31 +08:00
Isotr0py
7d8bbe6f42
[CI/Build] Automatically patch video metadata for multimodal processor test ( #35822 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-03 04:27:45 +00:00
aykoppol
25e02647c2
[Core] Add optional flags to check for repetitive token patterns in engine output ( #35451 )
...
Signed-off-by: aykoppol <aykoppol+git@gmail.com >
2026-03-03 12:23:25 +08:00
Woosuk Kwon
a0a5178ab4
[Model Runner V2] Use ModelState.prepare_attn() for cuda graph capture [5/N] ( #35774 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-02 20:06:27 -08:00
Isotr0py
8ea8ba275e
[V0 deprecation] Remove Swin model ( #35821 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-02 20:03:41 -08:00
Woosuk Kwon
4f85bae9d6
[Docs][Model Runner V2] Add Design Docs ( #35819 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-02 19:58:14 -08:00
Andy Lo
0a7165fd71
[ModelRunnerV2] Rename sampler functions and variables for clarity ( #35459 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-03-02 19:48:56 -08:00
Robert Shaw
6521ccf286
[CI] Temporarily Disable Nightly Failures ( #35770 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-03 01:49:13 +00:00
Martin Vit
8ebd872f50
[Tool Parser] Fix Qwen3Coder streaming parameter loss with speculative decode ( #35615 )
...
Signed-off-by: Martin Vit <martin@voipmonitor.org >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-03 09:40:37 +08:00
zhrrr
168ee03e1c
[Model Runner V2][Perf] align dummy_run tokens to uniform decode for dp cudagraph ( #35376 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2026-03-02 17:10:47 -08:00
liuzhenwei
9dd656f0ea
[XPU][NIXL] Add GPUDirect RDMA support for XPU ( #35270 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-03 08:42:49 +08:00
Jakub Zakrzewski
c8b678e53e
[Model] Add support for nvidia/llama-nemotron-rerank-vl-1b-v2 ( #35735 )
...
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com >
2026-03-03 08:32:14 +08:00
Andreas Karatzas
18c29c746b
[ROCm][CI] Fix backslash-continuation in pytest marker re-quoting and treat exit code 5 as success ( #35798 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-02 16:07:51 -08:00
Hanjie Qiu
96fc09503a
[All Reduce] Change default backend of Flashinfer All Reduce to trtllm ( #35793 )
...
Signed-off-by: hjjq <hanjieq@nvidia.com >
2026-03-02 18:57:38 -05:00
Roger Wang
1b82b433fc
[Bugfix] Fix MM processor test for Qwen3.5 ( #35797 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-03-02 23:05:08 +00:00
Robert Shaw
9319044ee9
[MoE][Perf] Wrap DSV3 QKVAProj GEMM in custom op for torch.compile ( #35751 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-02 23:03:49 +00:00
Boyuan Feng
c42dc402c1
clean unused cudagraph_batch_sizes ( #35552 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2026-03-02 22:00:16 +00:00
Ye (Charlotte) Qi
fa6a6be519
[Bugfix] Fix missing sequence_lengths in qwen3_omni_moe_thinker ( #35741 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2026-03-02 21:11:56 +00:00
Aaron Hao
cad21918e3
[BUG] Fix rlhf_async example ( #35788 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-03-02 20:36:40 +00:00
Jeffrey Wang
53700bf49b
[ci] Add Ray compatibility check informational CI job ( #34672 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2026-03-02 12:06:16 -08:00
Yashwant Bezawada
a13d8c03c9
[KVConnector] Auto-downgrade to PIECEWISE cudagraph mode for layerwise async ops ( #31057 )
...
Signed-off-by: Yashwant Bezawada <yashwant_b@me.com >
2026-03-02 15:04:47 -05:00
Fynn Schmitt-Ulms
9433acb8df
[Spec Decode] Add hidden states extraction system ( #33736 )
...
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com >
2026-03-02 14:29:09 -05:00
Richard Zou
d1a6e96d9e
[torch.compile] Improve cold and warm start compile tests ( #35709 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-02 19:27:06 +00:00
CSWYF3634076
2a9e3347e9
[BugFix][Model]Fix the garbled code in Ernie4.5-VL caused by fast_moe_cold_start ( #35587 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
2026-03-02 18:56:33 +00:00
Isotr0py
cc0d565f40
[CI/Build] Enable Qwen3.5 tests on CI ( #35763 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-02 17:43:53 +00:00
Patryk Wolsza
358e4d5ba7
[CI][HPU] Pin vllm commit compatible with vllm-gaudi - HPU tests ( #35307 )
...
Signed-off-by: PatrykWo <patryk.wolsza@intel.com >
2026-03-02 17:02:26 +00:00
Cyrus Leung
792a74b973
[Doc] Improve UX of --enable-log-requests ( #35723 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-02 08:24:09 -08:00
Turner Jabbour
4034c3d32e
[Core] Move test utility to test file ( #35672 )
...
Signed-off-by: Turner Jabbour <doubleujabbour@gmail.com >
2026-03-02 10:56:03 -05:00
Martin Hickey
7560d674c9
[CI] Fix mypy for vllm/device allocator ( #35518 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-02 15:53:18 +00:00
ElizaWszola
d9c7730877
[Performance] Extract kv update ops from MLA attention backends ( #34627 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Di Wu <dw2761@nyu.edu >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-03-02 10:43:19 -05:00
Runkai Tao
ada4f4fadd
[Fix Bug]num_active_loras always equals to zero ( #34119 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-02 23:17:46 +08:00
Harry Mellor
7e9149d9a9
[Docs] Add breadcrumbs for better UX ( #35749 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-02 14:31:54 +00:00
Martin Hickey
87c98b0236
[MyPy][BugFix] Check profiler is assigned before calling start() on it ( #35505 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-02 13:23:42 +00:00
Tyler Michael Smith
de7dd634b9
Fix unresolved-import errors when using Astral's ty by removing src.root ( #35681 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-03-02 10:26:47 +00:00
Chauncey
9a87b0578f
[Feat] Supports Anthropic Messages count_tokens API ( #35588 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-02 09:48:54 +00:00
wangxiyuan
510bc9e1df
[Misc] Cleanup useless current_platform import ( #35715 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2026-03-02 09:36:54 +00:00
Charles Ashby
cbd361fd46
[CPU][Distributed] Fix Enable _CPUSHMDistributed only when TP/PP ranks share the same SHM group name ( #34169 )
...
Signed-off-by: Charles Ashby <charlesa.l@hotmail.com >
2026-03-02 09:34:35 +00:00
Nicolò Lucchesi
c212202d93
[Misc] Bound NIXL upper bound version ( #35495 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-02 16:57:07 +08:00
Andreas Karatzas
ec27b36b4b
[CI] Defining extended V1 e2e + engine tests ( #35580 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-02 08:10:54 +00:00
Charlie Fu
3fd1d4ec2c
[Rocm][CI] Fix LM Eval Large Models (H100) test group ( #34750 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2026-03-02 07:43:38 +00:00
EdalatiAli
cb21972a97
[Kernel] Integrate SM100 MXFP8 blockscaled grouped MM and quant kernels ( #34448 )
...
Signed-off-by: EdalatiAli <aliedalati@cohere.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-01 23:31:19 -08:00
Andreas Karatzas
c34963f138
[ROCm][CI] Disable skinny GEMMs in language model standard tests to fix non-determinism ( #35152 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-02 15:04:18 +08:00
Hongxia Yang
f26650d649
[ROCm] add amd-quark package in requirements for rocm to use quantized models ( #35658 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-03-02 06:02:43 +00:00
Kunshang Ji
92f5d0f070
[XPU] fix mxfp4 activation type ( #35691 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-02 11:48:39 +08:00
Jesse Cai
a60985b07e
Fix deprecated v1 config tests ( #35327 )
...
Signed-off-by: Jesse Cai <jessecai@fb.com >
2026-03-01 20:32:03 -05:00
Lucas Wilkinson
8b5014d3dd
[Attention] FA4 integration ( #32974 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-03-01 23:44:57 +00:00
zhanqiuhu
57a96e26c9
Revert "[Bugfix] Disable TRTLLM attention with KV transfer enabled ( #33192 )" ( #34832 )
...
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-03-01 22:32:37 +00:00
Richard Zou
e82fbeec7b
[torch.compile] Undo the fast_moe_cold_start hack in torch>=2.11 ( #35475 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-01 21:44:22 +00:00
haosdent
6290470843
[Bugfix] Fix dtype mismatch in RMSNormGated.forward_native() during torch.compile ( #35256 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-01 15:14:46 -05:00
Woosuk Kwon
72f4d16262
[Model Runner V2] Use block table apis for capture inputs ( #35671 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-01 10:31:13 -08:00
Seungho Yoon
5a435507d8
fix(mxfp4): return is_monolithic=False when LoRA is enabled for Triton backend ( #35382 )
...
Signed-off-by: Seungho Yoon <yoonsnowdev@gmail.com >
2026-03-01 09:59:30 -05:00
Taneem Ibrahim
59d7af9c6c
[MISC] Fixing a null reference by removing parallel_utils from mypy EXCLUDE ( #35630 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-03-01 09:26:44 -05:00
Asaf Gardin
bbf81f9a92
[Mamba1] - Kernel Level Chunk Alignment for Prefix Caching ( #34798 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-03-01 20:40:23 +08:00
Woosuk Kwon
da543d1abe
[Model Runner V2] Minor refactoring for EncoderRunner ( #35628 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-01 00:15:39 -08:00
Ryan Rock
87d319c52f
[AMD][CI] Support Triton attention with ExampleConnector ( #34931 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-03-01 09:58:07 +02:00
lin-shh
a9ec392c86
Fix typo: implictly -> implicitly in isaac.py docstring ( #35646 )
2026-02-28 23:34:37 -08:00
lailoo
afd089f231
[Bugfix][Model] Fix Qwen3.5/Qwen3Next ignoring --dtype flag on older GPUs ( #35617 )
2026-03-01 03:27:37 +00:00
gnovack
3ecd0bf9fc
Add TMA support to fused_moe_lora kernel ( #32195 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-01 10:55:25 +08:00
Woosuk Kwon
e3eb146f7a
[Model Runner V2] Add ModelStateInterface [4/N] ( #35621 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-28 13:19:45 -08:00
Martin Vit
95a395dbec
[Bugfix] Fix Anthropic API base64 image handling in Messages endpoint ( #35557 )
...
Signed-off-by: Martin Vit <martin@voipmonitor.org >
2026-02-28 20:57:08 +00:00
Isotr0py
e94b263bd6
[Chore] Cleanup BNB utilization dead code ( #35620 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-28 19:22:41 +00:00
Wentao Ye
e113a30113
[Deprecation] Deprecate code in 0.17 as scheduled ( #35441 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-28 17:32:37 +00:00
Cyrus Leung
1dafb29f91
[Benchmark] Avoid unnecessary video download in MMVU ( #35618 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-28 09:07:02 -08:00
emricksini-h
49b9ae32e9
[Fix] Avoid sending image input to other PP ranks ( #35405 )
...
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-01 00:14:29 +08:00
cwazai
63d7972f13
Fix Qwen3_5MTP packed_modules_mapping for gate_up_proj ( #35581 )
2026-02-28 14:50:55 +00:00
flutist
c68e69f144
custom dataset img support base64 ( #35280 )
...
Signed-off-by: xjx <493337577@qq.com >
2026-02-28 11:49:52 +00:00
Chauncey
7e08c22b8c
[Feat] Add CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function ( #35271 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-28 10:12:00 +00:00
Augusto Yao
8e75d88554
add io_process_plugin for sparse embedding ( #34214 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
Signed-off-by: Augusto Yao <augusto.yjh@antgroup.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-28 09:16:37 +00:00
Mario Hong
0892d1ab1f
[Feature]Supports Anthropic Thinking Block ( #33671 )
...
Signed-off-by: mariohong <mariohong128@gmail.com >
Co-authored-by: zetaohong <i-hongzetao@stepfun.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-02-28 09:02:33 +00:00
Hashem Hashemi
7600642eae
Add padding support to wvSplitK solution for skinny GEMMs ( #33762 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-28 09:02:05 +00:00
Andreas Karatzas
1e69c04887
[ROCm][CI] Parametrize vision score tests across attention backends with per-backend tolerances ( #35571 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-28 08:59:26 +00:00
Cyrus Leung
4292e3b807
[Benchmark] Improve UX of sweep scripts ( #35600 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-28 00:36:02 -08:00
Cyrus Leung
24d6ea8afd
[Benchmark] Rename SLA Finder to Workload Explorer ( #35586 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-27 23:31:55 -08:00
Chauncey
57c86c0741
[Misc] Change logging level from info to debug for tool parser import ( #35575 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-28 14:51:35 +08:00
Chauncey
06254d4cbb
[CI] add trainer_send_weights for MockWeightTransferEngine ( #35589 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-28 06:47:43 +00:00
Andreas Karatzas
f5d1281c9d
[ROCm][CI] Expose tests to AMD production CI and fix amdsmi heap corruption ( #35071 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-28 13:57:31 +08:00
Andreas Karatzas
94029ffaf0
[ROCm] Derive device capability from GCN arch string without CUDA init ( #35069 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-28 13:55:28 +08:00
Andreas Karatzas
88e8525f2e
[ROCm][CI] Adding infiniband mappings for moriio tests ( #35170 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-28 13:53:28 +08:00
Ilya Markov
b2d8b422b2
[EPLB] Enforce sync eplb for NCCL-based all2all backend ( #35212 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-02-28 05:47:12 +00:00
Umut Polat
1d5ab5d603
[Bugfix] Move chat completion response_format validation to Pydantic model_validator ( #35510 )
...
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com >
2026-02-27 21:26:19 -08:00
Huy Do
7b346ba8ed
[Bugfix] Propagate compilation_time from workers to main process for TP>1 ( #35503 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2026-02-28 05:03:22 +00:00
Itay Alroy
dea268336f
[1/N] Elastic EP Milestone 2 ( #34861 )
...
Signed-off-by: Yongji Wu <wuyongji317@gmail.com >
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: Ron Tourgeman <rtourgeman@nvidia.com >
Co-authored-by: Yongji Wu <wuyongji317@gmail.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com >
2026-02-28 04:46:42 +00:00
Ma Jian
90805ff464
[CI/Build] CPU release supports both of AVX2 and AVX512 ( #35466 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Co-authored-by: jiang1.li <jiang1.li@intel.com >
2026-02-28 04:35:21 +00:00
Matthew Bonanni
2562e0271e
[MTP] Validate that MTP weights are actually loaded ( #35548 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-28 12:27:40 +08:00
Cyrus Leung
fd68cd132b
[Bugfix] Fixes for SLA finder ( #35537 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-27 20:20:55 -08:00
Micah Williamson
0edf101d2b
[ROCm] Add stablelm Head Size 80 To Supported Head Sizes For ROCM_ATTN ( #35527 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-28 12:16:34 +08:00
Douglas Lehr
d5b6f3ba36
[ROCm][Quantization] Add Composable Kernel (CK) backend support for M… ( #34301 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Signed-off-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com >
Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
Co-authored-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
2026-02-28 03:37:01 +00:00
Woosuk Kwon
1a014a0a93
[Model Runner V2] Move MM encoder to Model States [3/N] ( #35564 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-27 18:32:38 -08:00
Woosuk Kwon
86ac7bcf84
[Model Runner V2] Support pooling models ( #35120 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-27 18:03:01 -08:00
Umut Polat
405f28d38d
[Misc] Clean up ResponsesRequest model validators ( #35531 )
...
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com >
2026-02-28 01:19:21 +00:00
youkaichao
5323672bc2
[misc] cleanup one level of error stack when nixl fails to initialize ( #35517 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2026-02-28 08:42:37 +08:00
Roberto L. Castro
a201ad72d8
[Refactor][Kernel] Add global helper to deduplicate vectorized memory ops ( #35105 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
2026-02-27 16:28:17 -08:00
Rohan Potdar
e3691988d0
[ROCm]: fix aiter rope functionalization ( #35533 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-27 22:42:30 +00:00
Gregory Shtrasberg
9fa6c68fa6
[ROCm] Enabling encoder and encoder-decoder on ROCm and AITER unified backends ( #35334 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-27 21:32:55 +00:00
Aaron Hao
2ce6f3cf67
[Feat][RL][2/2] Native Weight Syncing API: IPC ( #34171 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-02-27 13:45:21 -07:00
Jakub Zakrzewski
1f3dbd95fd
[Bugfix][Model] Fix gpt-oss batch invariance ( #35404 )
...
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com >
2026-02-27 20:41:24 +00:00
Lucas Wilkinson
1d532f9d8f
[DP] Only use DP padding when cudagraphs are actually used ( #34102 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-27 15:14:31 -05:00
Lucas Kabela
234a65b781
[Bugfix] Add monkeypatch to prevent race condition from writing ( #35420 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-02-27 14:51:36 -05:00
SteadfastAsArt
2decec9856
[Transformers backend] Ignore MTP weights when num_nextn_predict_layers=0 ( #34888 )
...
Signed-off-by: SteadfastAsArt <695488173@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-27 19:39:23 +00:00
Zhengxu Chen
29b35477b0
[compile] Fix caching error over pytree slice node. ( #35308 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-27 19:34:16 +00:00
Nick Hill
b1d9f5372d
[Model Runner V2] Warmup kernels ( #35172 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-27 10:43:30 -08:00
Raushan Turganbay
fd6de37fca
[BugFix] Fix 3D rope in transformers backend ( #35097 )
...
Signed-off-by: raushan <raushan@huggingface.co >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-27 18:34:49 +00:00
Netanel Haber
c8aca0c9e1
Support parakeet as audio encoder for nemotron-nano-vl ( #35100 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-27 11:07:38 -07:00
Martin Hickey
b602e4f299
[Doc] Fix link to Llama chat template for usability ( #35525 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-27 17:51:09 +00:00
Huamin Li
157722da75
[perf] Use pinned memory for async H2D transfer in do_mamba_copy_block ( #35480 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
2026-02-28 01:50:37 +08:00
Nick Hill
1d897ff04f
[Misc] Fill in some v1 CODEOWNERS gaps ( #35524 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-27 09:34:37 -08:00
fort726
905d76b51d
[Model] Add huggingface skt/A.X-K1 model ( #32407 )
...
Signed-off-by: Sungwan(Alex) Kim <sw0726.kim@sktelecom.com >
Signed-off-by: fort726 <38447663+fort726@users.noreply.github.com >
Co-authored-by: Sungwan(Alex) Kim <sw0726.kim@sktelecom.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-27 09:26:02 -08:00
Yanan Cao
9098ce690c
[Kernel] [Helion] [7/N] Use HOP to represent Helion Kernel call to enable fx tracing and pattern matching ( #34390 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-27 09:21:35 -08:00
Nick Hill
876312f0b5
[Core] Fix gpu_worker.py pre-commit errors ( #35312 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-27 07:54:24 -08:00
Boyuan Feng
5de98abc12
Add @BoyuanFeng to CODEOWNERS ( #35317 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2026-02-27 15:53:47 +00:00
Koushik Dutta
9251ed5c4f
[Bugfix] Handle case when kimi ends reasoning with a tool call ( #33646 )
...
Signed-off-by: Koushik Dutta <koushd@gmail.com >
Co-authored-by: mondaylord <20212010046@fudan.edu.cn >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-27 14:58:28 +00:00
Yueqian Lin
e8249378e4
[Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests ( #35487 )
...
Signed-off-by: linyueqian <linyueqian@outlook.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-27 06:48:25 -08:00
haosdent
6d4f9d3ad5
[Bugfix] Fix DCP + FA3 crash due to missing num_splits in _forward_with_dcp ( #35082 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-27 22:27:06 +08:00
Harry Mellor
fbe3f0120a
Revert "Add GlmOcrConfig for GLM-OCR model type recognition" ( #35512 )
2026-02-27 06:13:27 -08:00
Jason Li
66c1751d13
[compile] Cleanup: Remove unnecessary +rms_norm forcing for sequence parallelism ( #35410 )
...
Signed-off-by: jasonlizhengjian <jasonlizhengjian@gmail.com >
2026-02-27 08:36:37 -05:00
Tib
6467b635b6
[Bugfix] Add missing activation attr to RMSNormGated ( #35423 )
...
Signed-off-by: tibG <naps@qubes.milou >
Co-authored-by: tibG <naps@qubes.milou >
2026-02-27 12:53:35 +00:00
Max Hu
9c3fe9936b
Flashinfer cuDNN backend for Qwen3 VL ViT attention ( #34580 )
...
Signed-off-by: Max Hu <maxhu@nvidia.com >
Signed-off-by: Max Hu <hyoung2991@gmail.com >
Co-authored-by: Max Hu <maxhu@nvidia.com >
Co-authored-by: Shang Wang <shangw@nvidia.com >
2026-02-27 20:20:23 +08:00
Umut Polat
b66a74649e
[Bugfix] Replace assert with ValueError for response_format validation in completions endpoint ( #35456 )
...
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com >
2026-02-27 08:01:06 +00:00
Wang Xingran
07bdabef03
[Bugfix] Use 'sum' reduction instead of 'avg' in Async TP reduce-scatter ( #33088 )
...
Signed-off-by: Xingran Wang <wangxingran123456@outlook.com >
Signed-off-by: Hongjian Zhang <hirokenovo@gmail.com >
Co-authored-by: Hongjian Zhang <hirokenovo@gmail.com >
2026-02-27 07:06:08 +00:00
Chengyi Nie
a572baff5e
[Model Performance] Add Qwen3MoE tuned MoE configs for H200 ( #35457 )
...
Signed-off-by: Chengyi Nie <cnie@roblox.com >
Co-authored-by: Chengyi Nie <cnie@roblox.com >
2026-02-27 13:51:14 +08:00
zofia
516cf26698
[Bug] correct out dtype of rms_norm_gated native path ( #35369 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-27 05:19:51 +00:00
Jiangyun Zhu
487e5c51f7
[Bugfix] disable allreduce_rms_fusion by default when pp size > 1 ( #35424 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-02-27 04:18:52 +00:00
Daniel Huang
1a8c71674e
[BugFix] Repo utils debug print patch ( #35434 )
...
Signed-off-by: Daniel Huang <daniel1.huang@intel.com >
2026-02-27 03:50:56 +00:00
Wentao Ye
062b789632
[Bug] Fix outdated links in source code ( #35314 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-27 03:50:46 +00:00
gnovack
a532c83849
use 'max_active_experts' for moe lora input size ( #33197 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
2026-02-27 03:50:43 +00:00
Jee Jee Li
1e5ad9b74f
[Bugfix] Fix Qwen3NextForCausalLM packed_modules_mapping ( #35413 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-26 19:46:30 -08:00
Nicolò Lucchesi
cabdaa7619
[Misc] Move GPUModelRunner.prepare_kernel_block_sizes to utils ( #35400 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-27 11:42:51 +08:00
Chenyaaang
06be53563b
[Core]Extract is_last_rank in Ray for tpu to override ( #33012 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2026-02-27 03:18:52 +00:00
Angela Yi
c29ee9c326
[compile] Invalidate cache for cpu flags ( #35119 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-02-27 02:54:11 +00:00
daniel-salib
d43048ce05
[Bugfix] Emit reasoning_part events in simple streaming path for Resp… ( #35184 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2026-02-27 09:49:06 +08:00
Michael Goin
4fec53cfcb
[CI] Actually run tests/kernels/quantization/test_block_fp8.py in CI ( #34274 )
2026-02-26 17:58:03 -07:00
roikoren755
38c498b8e3
[Performance] Cublas Bf16 Gate with Fp32 Output ( #35121 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-02-26 16:51:28 -08:00
Andrii Skliar
56a6371706
[Update] Use FlashInfer fast_decode_plan directly instead of replication ( #34687 )
...
Signed-off-by: Andrii <askliar@nvidia.com >
Co-authored-by: Andrii <askliar@nvidia.com >
2026-02-26 16:31:43 -08:00
Pavani Majety
6283021142
[Bugfix] Fix KV Scale loading for MLA Models ( #35430 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2026-02-26 23:38:19 +00:00
Aleksandr Malyshev
01923eec70
[ROCm][Quantization] GPT OSS Upstream MoE wmxfp4_afp8 with static scales ( #30357 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2026-02-26 16:50:16 -06:00
pkousha
31fb6f43da
[Kernel][perf] optimize NCCL symm_mem vs custom_AR selection thresholds ( #33839 )
...
Signed-off-by: <>
Signed-off-by: pkousha <43781676+pkousha@users.noreply.github.com >
Co-authored-by: Pouya Kousha <pkousha@login-eos01.eos.clusters.nvidia.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-26 14:35:58 -08:00
Tyler Michael Smith
eb19955c37
[WideEP] Remove pplx all2all backend ( #33724 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-26 14:30:10 -08:00
Lucia Fang
0f2f24c8b2
[Bugfix] Fix MessageQueue connect_ip for cross-node data parallelism ( #35429 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-26 22:08:16 +00:00
sychen52
d0105b84f0
add mixed precision support for modelopt ( #35047 )
...
Signed-off-by: Shiyang Chen <shiychen@nvidia.com >
2026-02-26 21:56:24 +00:00
danielafrimi
832a780f3a
Nemotron: use per-layer config in NemotronHMLPDecoderLayer for heterogeneous models ( #35396 )
...
Signed-off-by: dafrimi <dafrimi@nvidia.com >
2026-02-26 16:55:19 -05:00
ElizaWszola
98217b09f9
[Performance] Extract KV cache update op from flashinfer forward ( #35422 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2026-02-26 21:29:01 +00:00
不做了睡大觉
967572dd5f
fix(reasoning): Qwen3ReasoningParser returns truncated output as reasoning ( #35230 )
...
Signed-off-by: stakeswky <stakeswky@users.noreply.github.com >
Co-authored-by: stakeswky <stakeswky@users.noreply.github.com >
2026-02-26 20:30:45 +00:00
Woosuk Kwon
3d66502e1b
[Model Runner V2] Prepare attn metadata in ModelState [2/N] ( #35383 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-26 11:47:02 -08:00
Woosuk Kwon
c66aa48e99
[Model Runner V2] Add model states [1/N] ( #35350 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-26 11:20:35 -08:00
Nick Hill
b6d5a17298
[Model Runner V2] Fix error-handling ( #35063 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-26 11:00:19 -08:00
Lucas Wilkinson
5e58bdc711
[Bugfix] Remove erroneous lower bound on LoRA vocab size constraint ( #35354 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-26 18:44:50 +00:00
Runkai Tao
a1f53addb1
[BugFix] Align fused MoE-LoRA kernel config with actual weight shapes ( #34396 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
2026-02-26 18:03:10 +00:00
Wentao Ye
05970c772c
[Refactor] Remove dead code for attention benchmark script ( #35418 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-26 09:53:46 -08:00
Yiliu Dong
d940607629
[Core] Support min_tokens with speculative decoding ( #32642 )
...
Signed-off-by: qianlihuang <yiliu.dong@qq.com >
Co-authored-by: qianlihuang <yiliu.dong@qq.com >
2026-02-26 12:31:28 -05:00
Wentao Ye
99c7892c5b
[Perf] Optimize maxsim scores computation for pooling models, 13.9% E2E throughput improvement ( #35330 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-26 17:14:54 +00:00
hujia177
ec8f943db1
Add GlmOcrConfig for GLM-OCR model type recognition ( #34982 )
2026-02-26 17:04:42 +00:00
Or Ozeri
f2ad952f40
[BugFix][kv_offload]: Fix kernel block size detection ( #35125 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-02-26 16:29:34 +00:00
Sage Moore
9e2cabdf9c
[ROCm] Update the torch version in rocm_build.txt to use the official 2.10 release ( #34387 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-02-26 16:28:45 +00:00
Douglas Lehr
ec8ab9d254
[ROCm] Add dynamic mxfp4 quantization for DeepSeek V2 projection layers ( #34157 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Signed-off-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
2026-02-26 10:00:49 -06:00
Wentao Ye
05972ea7e5
[Refactor] Remove dead or duplicate func utils or variables ( #35318 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-26 10:57:56 -05:00
Jakub Zakrzewski
111d869069
[Model] Add nvidia/llama-nemotron-embed-vl-1b-v2 multimodal embedding model ( #35297 )
...
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com >
2026-02-26 14:17:17 +00:00
stingoChen
7fea7250a4
[Bug] Fix missing <think> tag after tool call in MiniMax 2.1 ( #35352 )
...
Signed-off-by: 冬马 <chenxinke@cai-inc.com >
Co-authored-by: 冬马 <chenxinke@cai-inc.com >
2026-02-26 22:11:07 +08:00
Cyrus Leung
845ee348ef
[Misc] Standardize handling of mm_processor_kwargs.size ( #35284 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-26 13:05:46 +00:00
Asaf Gardin
ec13e549d3
[Bugfix] Fix uint32 overflow in Mamba selective scan state pointer arithmetic ( #35275 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-02-26 12:22:06 +00:00
Li-Yongwen
c6ca51598a
[Bugfix] fix device_name for routing replay ( #34336 )
...
Signed-off-by: liyongwen <1310439159@qq.com >
2026-02-26 12:18:38 +00:00
Yueqian Lin
c0615a296d
[Bugfix] Fix Qwen2.5-Omni and Qwen3-Omni mixed-modality embed regression ( #35368 )
...
Signed-off-by: linyueqian <linyueqian@outlook.com >
2026-02-26 11:58:23 +00:00
Harry Mellor
01914445b0
Remove bc-lint ( #35274 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-26 03:01:01 -08:00
Kunshang Ji
5281713e11
[XPU] use fixed UMD version in dockerfile.xpu ( #35392 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-26 18:54:55 +08:00
HZY
32693db8ce
[Bugfix] [Qwen3.5]Fix Qwen3.5 FP8 quantization: tuple shard_id weight loading ( #35289 )
...
Signed-off-by: daowu.hzy <daowu.hzy@alibaba-inc.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-26 18:26:15 +08:00
Akash kaothalkar
e03ddcfbd4
[Hardware][Powerpc]Enable prefix caching and chunked prefill for ppc64le ( #35081 )
...
Signed-off-by: Akash kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: Akash kaothalkar <akash.kaothalkar@ibm.com >
2026-02-26 10:21:24 +00:00
Sophie du Couédic
02acd16861
[Benchmarks] Plot benchmark timeline and requests statistics ( #35220 )
...
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-26 02:17:43 -08:00
Jiangyun Zhu
ab87f85231
[Model] Ring 2.5 ( #35102 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-02-26 02:17:11 -08:00
Krish Gupta
3827c8c55a
[Test] Add tests for n parameter in chat completions API ( #35283 )
...
Signed-off-by: KrxGu <krishom70@gmail.com >
2026-02-26 09:14:07 +00:00
Kevin McKay
ade81f17fe
[Bugfix][Hardware][AMD] Gate FP4 ops on gfx950 to prevent MI300X crash ( #35250 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2026-02-26 16:11:07 +08:00
Gregory Shtrasberg
6042e66cd5
[ROCm] Add extra step in config initialization to populate custom ops before compilation config init ( #34848 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-26 16:05:40 +08:00
Chaojun Zhang
9f9a675b23
[XPU][8/N] Fix kernel bugs in XPU LoRA and MOE LORA ( #34115 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-26 15:46:44 +08:00
Ofir Zafrir
a07c4c5939
[BugFix][XPU] Fix speculative decoding on Intel XPU due to bug with IGC_ForceOCLSIMDWidth=16 ( #35298 )
...
Signed-off-by: Ofir Zafrir <ofir.zafrir@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-26 07:15:16 +00:00
Cyrus Leung
d3a51da92a
[Benchmark] Simplify SLA scan ( #35306 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-25 22:35:41 -08:00
Flora Feng
186ea22efe
[Misc][Harmony] Move Responses API only harmony utils to responses/harmony.py ( #35339 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-02-26 14:35:16 +08:00
Daniele
4a9c07a0a2
[BugFix] anthropic/serving_messages: fix tool call arguments streaming ( #34887 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-26 05:39:48 +00:00
Jason Li
9d37941017
[torch.compile] Sequence Parallelism threshold compile ranges ( #28672 )
...
Signed-off-by: jasonlizhengjian <jasonlizhengjian@gmail.com >
Signed-off-by: Jason Li <jasonlizhengjian@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-26 05:00:12 +00:00
Fadi Arafeh
4171ff6dd9
[CPU][Feat] Enable KleidiAI INT8_W4A8 for all input dtypes ( #34890 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-02-26 05:00:10 +00:00
Woosuk Kwon
13025e71e8
[Model Runner V2] Add coding style guide ( #35325 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-25 20:42:40 -08:00
Hanjie Qiu
71dfce6aa6
[Kernel] Refactor FlashInfer allreduce for mnnvl backend ( #34109 )
...
Signed-off-by: hjjq <50634613+hjjq@users.noreply.github.com >
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com >
2026-02-26 03:17:20 +00:00
hujiaxin0
2aa4140402
openpangu-vl support video input ( #34134 )
...
Signed-off-by: hujiaxin <524446785@qq.com >
Signed-off-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com >
Co-authored-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-26 03:08:09 +00:00
Roberto L. Castro
86c3b5a808
[BugFix] Fix fp4 quant kernel on CUDA 12.8 ( #35210 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
2026-02-25 18:32:50 -08:00
Seungmin Kim
160424a937
[Bugfix] Fix CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs ( #33992 )
...
Signed-off-by: Seungmin Kim <8457324+ehfd@users.noreply.github.com >
Signed-off-by: Andrew Mello <19512127+88plug@users.noreply.github.com >
Co-authored-by: 88plug <19512127+88plug@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 18:15:51 -08:00
Lucas Wilkinson
9511a3f8ee
[Bugfix] Fix AttributeError in SMControlContextManager ( #35338 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-25 18:01:10 -08:00
Michael Goin
de527e1cec
[UX] Add --moe-backend arg for explicit kernel selection ( #33807 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-25 17:44:44 -08:00
Yongye Zhu
1976356ee6
[MoE Refactor] MXFP4 Cutlass Experts to MK ( #34542 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
2026-02-25 17:32:39 -08:00
Michael Goin
cbf8f7028c
[UX] Add --performance-mode {balanced,interactivity,throughput} ( #34936 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-25 17:28:31 -08:00
Ming Yang
6831650c40
[offloader] v2: Hide weight onloading latency via prefetching ( #29941 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 17:20:59 -08:00
Andreas Karatzas
ed42507f6d
[ROCm][CI] Amending deletion of AMD mirror ( #35322 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 14:17:56 -08:00
Andreas Karatzas
9571e99945
[ROCm][CI] Extending attention backend coverage for Eagle spec decode tests ( #35265 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 14:16:18 -08:00
Elizabeth Thomas
c97234c08b
fix(mxfp4): Disable monolithic path for TRITON backend with EP ( #34270 )
...
Signed-off-by: Elizabeth Thomas <email2eliza@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 13:33:42 -08:00
rasmith
b188bab441
[CI][AMD][BugFix] Add torch.cuda.set_device to test_punica_ops so punica kernels execute on same device as tensor ( #34985 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-25 19:18:00 +00:00
Lucas Wilkinson
15d76f74e2
Revert "[Misc] Enable weights loading tracking for quantized models" ( #35309 )
2026-02-25 09:20:15 -08:00
Andreas Karatzas
8fd6975479
[ROCm][CI] Disable skinny GEMMs in multimodal tests to fix non-deterministic results ( #35049 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 16:48:37 +00:00
pushkar
5d18bf8b32
[Bugfix] Fix Harmony preamble visibility in Responses API ( #32114 )
...
Signed-off-by: Pushkar Patel <git@thepushkarp.com >
Signed-off-by: pupa <pupa@users.noreply.github.com >
2026-02-25 08:08:16 -08:00
haosdent
0788ff0a15
[Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multicast support ( #35085 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-25 07:31:45 -08:00
Chendi.Xue
d72b0be33c
[XPU]Fix for Qwen-OMNI crash ( #35249 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2026-02-25 07:31:07 -08:00
Bhoomit
42489e43c2
[Misc][LoRA] Increase max vocab size limit to 258048 in logits processor ( #34773 )
...
Signed-off-by: Bhoomit Vasani <vbhoomit@amazon.com >
2026-02-25 23:30:55 +08:00
Mario Hong
af5e6afa0a
[Bugfix] Fix step3p5 reasoning with interleaved thinking ( #34211 )
...
Signed-off-by: mariohong <mariohong128@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-02-25 15:13:01 +00:00
Benjamin Chislett
ee59a7c615
[Tests] Add GSM8k check to SpecDec E2E tests ( #34772 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-25 07:51:14 -05:00
Joao Gante
709eadbb0b
Doc link typo ( #35281 )
...
Signed-off-by: Joao Gante <joaofranciscocardosogante@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-25 03:00:31 -08:00
Harry Mellor
90fc7f9109
Fix custom processors that use deleted behaviour for Transformers v5 ( #35107 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-25 02:36:21 -08:00
Yanwen Lin
675ec59aa9
[Bugfix][CPU] Fix basic unit tests failing in CPU platforms ( #34677 )
...
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-25 08:36:15 +00:00
Yanwen Lin
80e60a6133
[Doc] Suggest "--managed-python" flag when installing python using uv ( #33069 )
...
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
2026-02-25 08:19:43 +00:00
jonoillar
26e722f906
[DOC][BugFix] Specfiy build dependency installation ( #34513 )
...
Signed-off-by: Jon OILLARBURU <jon.oillarburu@multiversecomputing.com >
Co-authored-by: Jon OILLARBURU <jon.oillarburu@multiversecomputing.com >
2026-02-25 08:04:06 +00:00
lichuang
2c619e5e3f
[Docs]Fix documentation formatting in architecture overview ( #34679 )
...
Signed-off-by: codedump <lichuang1982@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-25 08:00:15 +00:00
Simon Mo
8a685be8d9
docs: document committer proposal process in governance ( #35225 )
...
Signed-off-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-25 07:58:48 +00:00
Laura Wang
2465071510
[Perf] Add opt-in SM100 Oink RMSNorm custom-op path ( #31828 )
...
Signed-off-by: Laura Wang <3700467+Laurawly@users.noreply.github.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-24 23:01:53 -08:00
wenshuai
cd43673668
[Perf] Optimize FP8 gemm of sm120. ( #34424 )
...
Signed-off-by: wenshuai <wenshuai@xiaomi.com >
2026-02-24 22:25:24 -08:00
Xinyu Chen
35d44b4557
[XPU]Support CUDAGraph on XPU Platform ( #34482 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
Co-authored-by: chzhang <chaojun.zhang@intel.com >
Co-authored-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-24 22:22:52 -08:00
Kunshang Ji
8ad54a991b
[Platform] Add current_platform.num_compute_units interface ( #35042 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
2026-02-24 22:22:49 -08:00
Kunshang Ji
92510edc32
remove cuda check in top_k_top_p_triton kernel ( #35011 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-24 22:22:31 -08:00
Isotr0py
a6c137521c
[Misc] Add shard_id validation for MergedColumnLinear ( #35055 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-24 22:12:28 -08:00
Isotr0py
4572a06afe
[Misc] Enable weights loading tracking for quantized models ( #35074 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-24 22:11:03 -08:00
Zhengxu Chen
5cc29cfb8b
[compile] Improve error message during artifacts load failure. ( #35115 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-24 22:01:09 -08:00
Chen Zhang
8fae54faff
[Linear Attention] fix bug for linear attention + prefix caching + reset_prefix_cache ( #35157 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2026-02-24 22:00:19 -08:00
Harry Mellor
f7967577f5
Remove requirement to use --hf-overrides for DeepseekVLV2ForCausalLM ( #35203 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-24 22:00:06 -08:00
pks
af770b8e7b
[Bugfix] Fix AttributeError when passing StructuredOutputsParams to CompletionRequest ( #35237 )
...
Signed-off-by: Patrick Simianer <patrick@lilt.com >
2026-02-24 22:00:03 -08:00
Andreas Karatzas
2ff3e436ad
[Responses][CI] Filter negative token IDs in schema fuzz test to avoid 500 errors ( #35231 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 05:52:44 +00:00
Jhao-Ting Chen
c2c4c4611a
[FIX] fused moe with lora shared expert dual stream (1.07x otps) ( #34933 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 04:40:45 +00:00
Rohan Potdar
f38f8c9742
[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE ( #35180 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-25 04:36:40 +00:00
Flora Feng
ec1d30c0f6
[Responses] Decouple SSE event helpers from Harmony context ( #35148 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-02-24 20:05:25 -08:00
Pooya Davoodi
e3b2324ec4
[Frontend] Use init_app_state and FrontendArgs in run_batch ( #32967 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-24 19:40:39 -08:00
Nick Hill
dbf0da817a
[Core] Cleanup engine pause/sleep logic ( #34528 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-24 19:33:34 -08:00
Xin Yang
3bbb2046ff
[Bugfix] Fix expert_ids padding values in moe_align_block_size kernel ( #35161 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-24 17:14:24 -08:00
yugong333
576fe50333
Adding Nemotron fp8 Triton MoE Config ( #34674 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-24 15:56:38 -08:00
Hashem Hashemi
a0e50a4260
Convert wvSplitKQ to 16x16 MFMA in prep for mi4xx. ( #34100 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-24 23:35:21 +00:00
Benjamin Chislett
9fa5b25a23
[Bug][DSV3.2] Always prepare metadata for DeepGEMM Sparse Attention ( #35075 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-24 14:55:22 -08:00
Robert Shaw
ea97750414
[CI] Fix Distributed Tests ( #35236 )
...
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
2026-02-24 22:31:56 +00:00
Andreas Karatzas
067c5d9ad1
[ROCm][CI] Added MI325 mirrors ( #34923 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-24 13:37:15 -08:00
Benjamin Chislett
f5972a872f
[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support ( #33726 )
...
Signed-off-by: Shahar Mor <smor@nvidia.com >
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Shahar Mor <smor@nvidia.com >
Co-authored-by: Roi Koren <roik@nvidia.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-24 09:49:56 -08:00
Matthew Bonanni
a9e15e040d
Add @MatthewBonanni to CODEOWNERS ( #35207 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-24 10:45:10 -07:00
Lucas Wilkinson
542ca66357
Revert "[CI/Build] Remove redundant OpenTelemetry pip install from CI configs" ( #35211 )
2026-02-24 09:26:42 -08:00
Cyrus Leung
fc8456c336
[CI/Build] Fix kernels test location ( #35205 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-24 09:20:34 -08:00
Wentao Ye
9ce8fad2a9
[Perf] Optimize Python Slice for Structured Output using islice instead of [:] ( #33593 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-24 09:02:36 -08:00
Harry Mellor
c38b8d5a31
Remove padding_index from models that don't use it for better Transformers v5 compatibility ( #35189 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-24 08:04:46 -08:00
Robert Shaw
60da0e1544
[CI] Remove Duplicated Tests ( #35199 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-24 23:53:30 +08:00
danisereb
9609b1f18d
Integrate flashinfer mm_mxfp8 in ModelOpt MXFP8 ( #35053 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-24 08:45:13 -07:00
danisereb
a0c7081695
Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe ( #35088 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-24 07:25:44 -08:00
R3hankhan
34ce0ffd1f
[CPU][Perf] Accelerate Attention head for s390x using vector intrinsics ( #34434 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-02-24 07:25:39 -08:00
Robin Nabel
0de5333989
Fix GLM4 parser tests ( #34905 )
...
Signed-off-by: Robin Nabel <opensource@nabel.co >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-02-24 22:27:42 +08:00
Eldar Kurtić
a87cc50859
[Attn,KV-cache] Use per-head scales in the attention selector ( #34281 )
...
Signed-off-by: Your Name <you@example.com >
Signed-off-by: Eldar Kurtic <research@neuralmagic.com >
Co-authored-by: Eldar Kurtic <research@neuralmagic.com >
Co-authored-by: Your Name <you@example.com >
2026-02-24 09:02:43 -05:00
Cyrus Leung
761e63e541
[Frontend] Always pass supported_tasks to validation ( #35186 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-24 04:16:33 -08:00
Isotr0py
d12d201409
[Bugfix] Fix failing FunASR processor test ( #35111 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-24 04:13:45 -08:00
eustlb
b3ad37c5db
[glm-asr] change defaults dummy audio size ( #35108 )
...
Signed-off-by: Eustache Le Bihan <eulebihan@gmail.com >
2026-02-24 04:13:33 -08:00
Wentao Ye
14561fabfd
[Perf] Optimize pooling model redundant copy, 1.8% throughput improvement ( #35127 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-24 04:13:11 -08:00
Zhengxu Chen
c77f3e1207
[compile] Save aot compile artifacts atomically. ( #35117 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-24 04:11:01 -08:00
Dor Huri
012dee9233
[Feature] Add LoRA tower/connector support for Llama 4 Vision (mllama4) ( #35147 )
...
Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-24 04:10:32 -08:00
Tugsbayasgalan Manlaibaatar
f1c664545b
Make voxtral compile friendly ( #33959 )
...
Signed-off-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-24 09:33:35 +01:00
Xin Yang
c870eb9e0f
[LoRA] Update LoRA expand kernel block_n calculation ( #32621 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-23 23:17:53 -08:00
BadrBasowid
6af03f2394
[Refactor] [1/N] Reorganize kernel abstraction directory ( #34055 )
...
Signed-off-by: BadrBasowid <badr.basowid@gmail.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-24 06:47:22 +00:00
Vlad Tiberiu Mihailescu
1a6cf39dec
[CI/Build] Remove redundant OpenTelemetry pip install from CI configs ( #35032 )
...
Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com >
2026-02-23 22:24:11 -08:00
Nicolò Lucchesi
f91808ae0d
[MM] Allow audio chunking for offline LLM ( #34628 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-23 21:04:28 -08:00
Vadim Gimpelson
33a0d43c71
[BUGFIX][Qwen3.5] Hardcode mlp.gate as not quantizable ( #35156 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-23 19:42:24 -08:00
pschlan-amd
80d93fd6da
gpu_model_runner: Cache is_encoder_decoder from model config ( #35099 )
...
Signed-off-by: Patrick Schlangen <pschlan@amd.com >
2026-02-23 19:08:34 -08:00
Jia Guo
ec85340531
[Quantization] Support FP8 MoE bias for models like GPT-OSS ( #34906 )
...
Signed-off-by: jasperjiaguo <jasperg662@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-23 19:07:47 -08:00
Rohan Potdar
2ff4e51152
[ROCm] AITER fused RoPE+KVCache ( #33443 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
Signed-off-by: charlifu <charlifu@amd.com >
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
Co-authored-by: charlifu <charlifu@amd.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com >
2026-02-23 19:06:00 -08:00
Asaf Gardin
95642441d0
[Mamba1] - Change supports_update_block_table to True ( #35054 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-02-23 19:05:57 -08:00
Xin Yang
a7c9f7b7ec
[Bugfix] Fix lora_ids in FusedMoE LoRA test ( #35135 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-23 21:49:25 -05:00
Michael Goin
a4bd661fb3
[Perf] Enable FlashInfer DeepGEMM swapAB on SM90 by default ( #34924 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-23 17:34:41 -08:00
Michael Goin
3ef9fd0f98
[Bugfix] Fix DSV3 kernels breaking _C and _moe_C on unsupported arches ( #35123 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-23 17:11:27 -08:00
Michael Goin
22a97e6613
[Perf] Improve default triton fused moe configs ( #34846 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-23 16:01:28 -08:00
Aaron Hao
596ed1f02e
[RL] Validation for pause_mode='keep' ( #34992 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-02-23 16:30:56 -05:00
Nicolò Lucchesi
b8d8b7e934
[Misc] Monitor interface changes ( #35113 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-23 17:14:51 +00:00
Harry Mellor
28c5e69ba0
Enforce that model is the first positional arg when --served-model-name is used ( #34973 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 08:38:05 -08:00
Harry Mellor
864167d376
Fix custom processors that use deleted import for Transformers v5 ( #35101 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 08:38:00 -08:00
haosdent
a2ba6a5244
[Bugfix] Fix prefix caching for Mamba 'all' mode (Nemotron models) ( #34874 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-23 17:31:51 +01:00
Harry Mellor
c4f38696f7
Use Xet high performance mode for Transformers v5 ( #35098 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 08:19:30 -08:00
haosdent
a7f341c323
[Bugfix] Fix MRotaryEmbedding missing truncate attr with YaRN scaling ( #35080 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-23 16:05:52 +00:00
Robert Shaw
d13ece38d7
[CI] Skip Responses API ( #34990 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-23 07:46:45 -08:00
Mark McLoughlin
5cc7c4452e
[Metrics] Add Prometheus counters for Model FLOPs Utilization (MFU) ( #30950 )
...
Export the existing Model FLOPs Utilization (MFU) metrics via Prometheus.
`--enable-mfu-metrics` is required for these to be exposed.
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-02-23 15:01:07 +00:00
Eldar Kurtić
b95bb6927f
[kv-cache, ct] Use compressed-tensors as a source of ground-truth for quant strategies ( #34254 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
2026-02-23 07:37:55 -07:00
Cyrus Leung
392645454b
[Refactor] Decouple TimingContext from InputProcessingContext ( #35083 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-23 14:15:50 +00:00
Eldar Kurtić
1e8438a89a
[Llama4,CI] Bring back Llama-4 bug fixes, and also fix Maverick tests ( #35033 )
...
Signed-off-by: Eldar Kurtic <you@example.com >
Co-authored-by: Eldar Kurtic <you@example.com >
2026-02-23 09:04:34 -05:00
Robert Shaw
8435b2e049
[ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup) ( #34302 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-23 14:02:26 +00:00
Yan Ma
b1b5e045df
[XPU] allow TORCH_SDPA/TRITON_ATTN as XPU vit Backend ( #35010 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-02-23 05:06:44 -08:00
Andreas Karatzas
5f68464f92
[ROCm][CI] Fix spec decode profile assertion and logprob test determinism ( #35043 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-23 05:05:54 -08:00
Vincent Gimenes
aa08a30fc9
[CLEANING] Remove unused disable_by_batch_size from SpeculativeConfig ( #35060 )
...
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com >
2026-02-23 05:05:36 -08:00
Wentao Ye
7f40e9e516
[Refactor] Remove dead private func _fp8_perm and _extract_mask_for_item ( #35068 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-23 05:05:20 -08:00
Harry Mellor
103e614b14
Fix pipeline parallel with embed scaling in the Transformers modelling backend ( #35094 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 05:04:47 -08:00
Neil Schemenauer
54e2f83d0a
[Feature] Lazy import for the "mistral" tokenizer module. ( #34651 )
...
Signed-off-by: Neil Schemenauer <nas@arctrix.com >
2026-02-23 00:43:01 -08:00
Gabe Goodhart
e631f8e78e
fix: Apply embedding_multiplier to inputs_embeds ( #34813 )
...
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-23 00:42:46 -08:00
Martin Hickey
e97c46a92d
[BugFix]: Fix local mypy issues ( #34739 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 00:40:29 -08:00
Jee Jee Li
7291d1b288
[Bugfix] Fix kernel benchmark ( #33752 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-22 21:18:08 -08:00
Cyrus Leung
987506bca6
[Refactor] Simplify dummy data generation ( #35025 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-22 20:55:27 -08:00
Woosuk Kwon
c645e9a214
[Model Runner V2] Remove propose_draft method ( #35070 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-22 18:27:12 -08:00
Nick Hill
944ffb5968
[Model Runner V2][Minor] Remove redundant do_spec_decode field ( #35039 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-22 16:18:04 -08:00
qizixi
2bcf71b9c0
[Spec Decode] Reduce TP communication for speculative decoding draft token generation ( #34049 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-22 14:59:16 -08:00
tacos8me
b7892a3bef
[Model] Add NVFP4 quantization support for Step3.5-Flash ( #34478 )
...
Signed-off-by: tacos8me <ian@cloudhabit.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-22 12:30:46 -07:00
Benjamin Chislett
682566b18e
[Bug] Refactor max_num_batched_tokens to account for drafting ( #34898 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-22 11:18:46 -05:00
qizixi
b9c2a565cc
[Spec Decode] Defer clearing KV connector metadata for EAGLE3 speculative decode + prefill / decode disagg setup ( #34529 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-22 08:08:32 -08:00
Andreas Karatzas
dd8c3a7fb2
[ROCm][CI] Fix realtime test timeouts caused by aiter JIT compilation delays ( #35052 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-22 10:07:18 +00:00
Andreas Karatzas
a8a47c17b6
[ROCm][CI] Fix flaky embedding chat test by using tolerance-based comparison ( #35050 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-22 09:03:44 +00:00
Roger Wang
40f88d8318
[Bugfix] Fix Qwen3/Qwen3.5 Reasoning Parser ( #34779 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-21 23:15:35 -08:00
Woosuk Kwon
2cbf9656ce
[Model Runner V2] Enable CUDA graph for Eagle3 ( #35040 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-21 21:42:50 -08:00
Xiao Li
30132cd144
Fix apply_top_k_top_p_triton called by non-cuda logits Tensor ( #35030 )
...
Signed-off-by: Xiao Li <ilx@meta.com >
2026-02-21 21:11:54 -08:00
Cyrus Leung
cbd95a2dd1
[Benchmark] Use sns.relplot for plotting ( #35027 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-21 20:26:48 -08:00
Athrael Soju
970861ac0c
[New Model] Add ColModernVBERT ( #34558 )
...
Signed-off-by: Athrael Soju <athrael.soju@gmail.com >
Signed-off-by: athrael-soju <athrael-soju@users.noreply.github.com >
2026-02-22 12:23:41 +08:00
Wentao Ye
d24bdd7c4b
[CI] Bump mteb version to mteb[bm25s]>=2, <3 for pooling model unit tests ( #34961 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-21 20:23:24 -08:00
Andreas Karatzas
d403c1da1c
[CI] Stabilizing ROCm amd-ci signal and minor name fix in upstream ( #35008 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-22 04:01:10 +00:00
Woosuk Kwon
b71fbd06e2
[Model Runner V2] Support attention group ( #35036 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-21 16:42:53 -08:00
Vadim Gimpelson
74d90b1ce4
[Model Bash][DSR1] Add selective dynamic shape marking for CustomOp ( #34900 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-21 19:28:01 -05:00
Woosuk Kwon
a4047d4ea9
[Model Runner V2] Support Eagle3 (no CUDA graph) ( #35029 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-21 12:55:24 -08:00
Cyrus Leung
965fe45935
[CI/Build] Fix gRPC version mismatch ( #35013 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-21 12:14:41 -07:00
Roman
98b0205c3c
[Frontend] Add automatic language detection for Whisper transcription ( #34342 )
...
Signed-off-by: space_check <roman.vuskov@rwth-aachen.de >
Signed-off-by: Roman <45857014+spacecheck@users.noreply.github.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-21 04:49:41 -08:00
Huy Do
272b535ab3
[Bugfix] Gate 256-bit instructions to CUDA 12.9+ ( #34791 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-21 04:48:14 -08:00
Cyrus Leung
f74f1572ca
[Benchmark] Improve benchmarks ( #35012 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-21 10:31:58 +00:00
petrpechman
bebfe55b1c
[Doc] Fix example of eagle3 ( #34960 )
...
Signed-off-by: Petr Pechman <petr.pechman@firma.seznam.cz >
Co-authored-by: Petr Pechman <petr.pechman@firma.seznam.cz >
2026-02-21 09:57:53 +00:00
Nick Hill
820d7815eb
[Core] Minor structured-output related scheduler optimization ( #34765 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-21 01:38:28 -08:00
Nicolò Lucchesi
ab6f3487a6
[PD] Change kv_load_failure_policy Default from "recompute" to "fail" ( #34896 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-21 01:34:57 -08:00
BADAOUI Abdennacer
8dc8a99b56
[ROCm] Enable bitsandbytes quantization support on ROCm ( #34688 )
...
Signed-off-by: badaoui <abdennacerbadaoui0@gmail.com >
2026-02-21 00:34:55 -08:00
jennyyyyzhen
2aab2bb543
[ROCM] Optimize ROCM_AITER_FA spec decode eagle performance ( #34541 )
...
Signed-off-by: jennyyyyzhen <yzhen@hmc.edu >
2026-02-20 20:32:05 -08:00
Andreas Karatzas
54254f7a61
[ROCm][CI] Fix spec decode logprobs flakiness and parametrize tree attention backends ( #34599 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:25:23 -08:00
Andreas Karatzas
cf93c1a128
[ROCm][AITER] Fix aiter paged_attention_v1 decode for sliding window and head_size < 64 ( #34570 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:25:07 -08:00
Andreas Karatzas
89358f0d35
[CI] Fix ColBERT HF comparison tests on AMD CI + refactor ( #34567 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:12:05 -08:00
zhongdaor-nv
a0fe7ea2f0
[feat] Add per-block extra_keys to KV events ( #33304 )
...
Signed-off-by: zhongdaor-nv <zhongdaor@nvidia.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 20:11:40 -08:00
Andreas Karatzas
991d6bff38
[CI][MCP][Harmony] Heavy refactoring Harmony & MCP response tests and stabilizing with deterministic test infrastructure ( #33949 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:03:32 -08:00
Kata Coder
5719a4e4e6
[Frontend] Support multimodal inputs for late-interaction scoring (ColQwen3) + NewModel: nvidia/nemotron-colembed ( #34574 )
...
Signed-off-by: craftsangjae <craftsangjae@gmail.com >
2026-02-20 20:01:40 -08:00
pougetat
11be2c74dc
[Realtime] Add Qwen3-ASR realtime streaming support ( #34613 )
...
Signed-off-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Co-authored-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-20 19:59:42 -08:00
Xin Yang
7a5adad480
[Kernel] Optimize sample_recovered_tokens_kernel ( #34974 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-20 19:59:06 -08:00
Li
59c6233297
Support prompt_embeds for pooling requests in output processor ( #34904 )
...
Signed-off-by: Li Zhang <lzhanga@amazon.com >
Co-authored-by: Li Zhang <lzhanga@amazon.com >
2026-02-20 19:57:38 -08:00
Taneem Ibrahim
d38cd3dde5
[Misc] Fix mypy errors in vllm/profiler and remove from exclude list ( #34959 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-02-20 19:56:33 -08:00
Rohan Potdar
ded333fb9b
[ROCm][Bugfix]: Only save unpadded sizes for shared_experts in MoERunner to fix rmsnorm pad fusion ( #34636 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-20 19:56:16 -08:00
Yanan Cao
9d7577b2bd
[Kernel] [Helion] [9/N] Canonicalize GPU variant names to base model names ( #34928 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-20 19:55:51 -08:00
Vlad Tiberiu Mihailescu
e739c29ea4
[CI/Build] Add opentelemetry libs in default vllm build (requirements/common.txt) ( #34466 )
...
Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com >
2026-02-20 19:54:55 -08:00
yugong333
a55caf6ae9
[LoRA] Support Quantized Adapters ( #30286 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
Signed-off-by: wz1qqx <ziqi.wang@novita.ai >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: wz1qqx <55830058+wz1qqx@users.noreply.github.com >
Co-authored-by: wz1qqx <ziqi.wang@novita.ai >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 19:54:35 -08:00
Lucas Wilkinson
0e22cd618b
Revert "[Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers " ( #34997 )
2026-02-20 17:19:19 -08:00
Wei Zhao
ea5f903f80
Bump Flashinfer Version and Re-enable DeepSeek NVFP4 AR+Norm Fusion ( #34899 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 13:37:31 -08:00
Ryan Rock
0632ed8778
[AMD][CI] Fix test_custom_allreduce for A100 testgroup ( #34735 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-02-20 21:33:04 +00:00
Lucas Wilkinson
aaefc58ee0
[CI] Revert PRs 34818 and 33600 ( #34979 )
2026-02-20 13:25:50 -08:00
Wei Zhao
f24b2de3d3
[Test] Add FP8 KV Cache Testing for MLA Backends ( #34473 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-20 18:51:58 +00:00
Michael Goin
fac1507f03
[CI] Remove failing prime-rl integration test ( #34843 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-02-20 10:17:42 -08:00
Zhengxu Chen
f863994084
[compile] Fix torch.compile time discrepancy in logging. ( #34912 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 08:47:14 -08:00
Zhengxu Chen
e4a5d8c653
[compile] Move torch_aot_compile directory under torch_compile_cache ( #34831 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-20 08:46:45 -08:00
Yanan Cao
a6d0299c75
[Kernel] [Helion] [6/N] Add num_tokens dimension to silu_mul autotuning and dispatching ( #34185 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-20 08:36:51 -08:00
Harry Mellor
6ce80f7071
Ensure that MkDocs v2 does not get installed ( #34958 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-20 15:38:11 +00:00
Huamin Li
1fe462168c
[perf] Avoid dtype promotion sync in mamba_get_block_table_tensor ( #34870 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 06:21:56 -08:00
Flora Feng
ed31a020ee
[Refactor] Extract Harmony streaming SSE event builders into streaming_events.py ( #34909 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 06:20:46 -08:00
Cyrus Leung
f9ac19204f
[V0 Deprecation] Remove unused MM placeholders in request output ( #34944 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-20 06:19:23 -08:00
Vadim Gimpelson
59965affbd
[BUGFIX] Fix _dummy_run missing prepare_inputs_event synchronization ( #34866 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-20 05:54:27 -08:00
Xin Yang
b1c4f0b265
[Kernel] Optimize grouped topk kernel ( #34206 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-20 01:34:45 -08:00
Kevin McKay
8de7c636cc
[Bugfix][Hardware][AMD] Fix ROCM_AITER_FA speculative decoding support ( #32877 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-19 22:25:46 -08:00
Frank Wang
059779231f
[Minor] Add logging when using MXFP4 MXFP8 TRTLLM backend ( #34916 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
Signed-off-by: Frank Wang <41319051+frankwang28@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-19 22:07:57 -08:00
tianshu-Michael-yu
ea37530b47
[Models] LFM2: Support LoRA ( #34921 )
...
Co-authored-by: Piotr Mazurek <piotr635@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-19 22:07:23 -08:00
Micah Williamson
f5432e35a3
[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout ( #34922 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-20 05:37:49 +00:00
杨朱 · Kiki
07cab212f0
[Misc] Add deprecated environment variable utilities ( #33677 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-19 21:33:25 -08:00
rasmith
0c1dc42748
[CI][AMD][BugFix][P/D] Add default_vllm_config to test_moriio_connector.py so tests pass ( #33739 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-19 21:32:40 -08:00
Varun Chawla
676f82ae81
Add validation to reject non-text content in system messages ( #34072 )
...
Signed-off-by: Varun Chawla <varun_6april@hotmail.com >
2026-02-19 21:30:33 -08:00
Elizabeth Thomas
81bfc21a6a
[Model Bash]: Improve FP8 Oracle for Config Specific Kernel Selection ( #34260 )
...
Signed-off-by: Elizabeth Thomas <email2eliza@gmail.com >
Signed-off-by: Robert Shaw <robertgshaw2-redhat@h100-02.nemg-001.lab.rdu2.dc.redhat.com >
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
Co-authored-by: Robert Shaw <robertgshaw2-redhat@h100-02.nemg-001.lab.rdu2.dc.redhat.com >
Co-authored-by: Robert Shaw <robertgshaw2@gmail.com >
2026-02-19 21:29:08 -08:00
Matthias Gehre
4e2c7caf2d
[Bugfix] Add regression test for MoE quant_config under torch.compile ( #34335 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-02-20 13:27:26 +08:00
Bowen Bao
d9e62c03eb
[Quark] Fix MoE fp8 activation scale handling on mi300 ( #34386 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com >
2026-02-19 21:27:14 -08:00
Kevin H. Luu
a1a2d79442
[ci] Use the right tag for CPU arm64 image ( #34915 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-19 19:59:15 -08:00
Cyrus Leung
ac900c89bb
[Refactor] Implement output type check in LLM ( #34794 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-19 19:57:55 -08:00
Mark McLoughlin
76df6072ff
[Core] Fix state names in pause_scheduler() ( #34840 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-02-19 17:21:46 -08:00
Michael Goin
16f24e8797
[CI] Add GPT-OSS Eval job for H100 ( #34359 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-02-19 17:14:54 -08:00
Nick Hill
40b2f1c3d9
[Model Runner V2] Minor CPU optimizations ( #34856 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-19 16:05:37 -08:00
Mayank Ketkar
648951a9c3
[Bugfix] Fix benchmark_fused_collective crash on CustomOp init ( #34665 )
...
Signed-off-by: Mayank Ketkar <mketkar@zoox.com >
Signed-off-by: Mayank Ketkar <mayket04@gmail.com >
Co-authored-by: Mayank Ketkar <mketkar@zoox.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-19 19:01:00 -05:00
Michael Goin
f72061a19a
[UX] More descriptive reasons in is_supported_config for MoE ( #34908 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-19 15:20:52 -08:00
Matthew Bonanni
662205d34e
[Bugfix] Fix Basic Models Test ( #34818 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-19 14:49:07 -08:00
Roger Wang
4fb8beefaa
[Bugfix] Fix cutlass fp8 kernel on hopper for Qwen3.5 ( #34914 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-19 13:34:55 -08:00
Alexei-V-Ivanov-AMD
304319c4ed
Change targets for AMD build in the "CI" pipeline ( #34918 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2026-02-19 21:26:53 +00:00
Wentao Ye
c683d11c94
[Refactor] Deprecate head_first for chunk_gated_delta_rule ( #34263 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-19 13:23:49 -05:00
roikoren755
3eff45d793
Revert "[NemotronH] Do not force router to run in fp32 ( #34582 )" ( #34808 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-19 09:47:05 -08:00
Robert Shaw
4685a630a2
[Model Bash][DeepSeekR1] Remove Shared Expert Clone ( #34344 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-19 07:56:14 -08:00
Eldar Kurtić
ee1d25f199
[Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers ( #34471 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-19 07:55:41 -08:00
Linda
6fff24f30f
[Bugfix] Qwen3.5 kv-scale weight remapping ( #34719 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com >
2026-02-19 04:13:37 -08:00
Cyrus Leung
23210a911e
[CI/Build] Try to make beam search test less flaky ( #34885 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-19 19:16:58 +08:00
Cyrus Leung
1391378861
[Bugfix] Fix edge case in UUID data parsing ( #34884 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-19 02:24:30 -08:00
Andreas Karatzas
f6220f9877
[ROCm][Test] Fix beam search determinism failures from batch-size-dependent FP divergence and removed wrong marker ( #34878 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-19 08:25:26 +00:00
Andreas Karatzas
2df2bb27b0
[ROCm][CI] Removing all blocking labels from MI355 until stable infra ( #34879 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-19 07:53:08 +00:00
Tal Nir
f75b61a9e9
[Voxtral Realtime] Fix engine crash on empty multimodal embeddings ( #34862 )
...
Signed-off-by: Tal Nir <tal@nervexneurotech.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-18 23:21:47 -08:00
Wei Zhao
7f51e93864
[Bug] Fix DeepSeek V3 weight loading caused by incorrect prefix ( #34876 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-18 23:20:30 -08:00
Alex Brooks
4611af1663
[Bugfix] Add Quant Config to Llava Next Projector ( #34847 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
2026-02-18 23:18:23 -08:00
Manrique Vargas
ad5aa6bd9f
fix(docs): fix typos in comments and docstrings ( #34836 )
...
Signed-off-by: machov <mv1742@nyu.edu >
2026-02-18 23:17:41 -08:00
Jaeyeon Kim(김재연)
9681068cf9
[Frontend] Fix reasoning_tokens for text-based parsers in Responses API ( #33513 )
...
Signed-off-by: Jaeyeon Kim <anencore94@gmail.com >
2026-02-18 23:16:41 -08:00
Kevin H. Luu
b6101d384d
Deprecate test-pipeline.yaml ( #34864 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-19 02:15:27 +00:00
Woosuk Kwon
5fcb0cdd68
[Model Runner V2] Use FP32 for Gumbel Noise ( #34854 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-18 17:07:37 -08:00
Woosuk Kwon
c878b43b64
[Model Runner V2] Remove unnecessary copies in PW CUDA graph capture ( #34849 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-18 15:52:50 -08:00
rasmith
2b84ac669c
[CI][AMD][BugFix] Use torch.testing.assert_close instead of assert torch.allclose in test_rocm_skinny_gemms.py ( #34181 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-18 23:10:19 +00:00
zhrrr
11d3976b88
[Model Runner V2] support piecewise & mixed cudagraph ( #32771 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2026-02-18 15:03:17 -08:00
Yongye Zhu
40da9625a1
[MoE Refactor] Convert mxfp4 marlin into modular kernel format ( #34588 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-18 14:37:14 -08:00
Flora Feng
8d9babd4de
Fix empty tool_call_id in Anthropic messages API tool result conversion ( #34745 )
...
Signed-off-by: <>
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Flora Feng <sfeng33@h100-01.nemg-001.lab.rdu2.dc.redhat.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-18 14:31:59 -08:00
Aaron Hao
e99ba957ec
[BUG] Fixing Weight Sync unit test ( #34841 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-02-18 17:20:10 -05:00
Kyle Sayers
64ac1395e8
[Docs] Clean up speculators docs ( #34065 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-02-18 13:48:11 -08:00
Cyrus Leung
61cf087680
[Bugfix] Fix lora tests ( #34834 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-18 13:22:31 -08:00
Wenlong Wang
847a57cd12
[Bugfix][MoE Kernel] Fix incorrect routing selection for models without expert groups (e.g., MiniMax-M2.1) ( #34673 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-18 13:03:24 -08:00
rasmith
fcd6ac97ed
[CI][AMD][BugFix] Skip tests in test_unquantized_backend_selection that should not run on ROCm ( #34655 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-18 15:00:40 -05:00
Woosuk Kwon
95be2a7f22
[Model Runner V2] Minor simplification for DCP ( #34786 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-18 11:04:53 -08:00
Jaden Mathias
0e60c925cf
[Bugfix] Remove assert causing hipErrorStreamCaptureUnsupported ( #34455 )
...
Signed-off-by: Jaden Mathias <jaden.mathias@amd.com >
2026-02-18 18:54:54 +00:00
Teng Ma
d7ff22204a
[Misc] Add mooncake-transfer-engine to kv_connectors requirements ( #34826 )
...
Signed-off-by: Teng Ma <teng-ma@linux.alibaba.com >
2026-02-18 18:26:24 +00:00
Isotr0py
c0bd8b13da
[Bugfix] Redo Qwen3.5/Qwen3-Next GDN projector fusion ( #34697 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
2026-02-18 09:46:53 -08:00
Michael Goin
caeb887bf6
[Bugfix] Fix NVFP4 TRTLLM MoE non-gated support; add gsm8k for Nemotron-3-Nano FP8+NVFP4 ( #34725 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-18 09:39:22 -08:00
Ilya Markov
6b3166a7c7
[CI][Bugfix] Fix multinode test script ( #34820 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-02-18 11:45:10 -05:00
Robert Shaw
25e2e136ef
[CI] temporarily disable multi-node tests ( #34825 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-18 11:32:44 -05:00
Robert Shaw
6874638bc4
[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup) ( #34758 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-18 07:42:36 -08:00
Burkhard Ringlein
e24663c5a9
Add unit tests for fp8 output fusion of triton_attn ( #34228 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-18 06:22:49 -05:00
Nick Hill
c50e105a88
[Model Runner V2] Avoid prepare prefill kernel launch overhead ( #34780 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-18 00:49:21 -08:00
Cyrus Leung
a766b30349
[Renderer] Deprecate code paths for old input processing ( #34775 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-18 00:35:04 -08:00
Asaf Joseph Gardin
1faa8cb73c
[Quantization] - Added uses_meta_device_weights to quant config ( #34645 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-02-17 23:43:44 -08:00
Marek Michalowski
e89a91d927
[Bugfix] fix activation in cpu_fused_moe_torch call ( #34696 )
...
Signed-off-by: Marek Michalowski <marek.michalowski@arm.com >
2026-02-17 23:39:46 -08:00
Michael Goin
909b147197
[Bugfix] Fix prefix creation for Qwen3.5 ( #34723 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-17 23:39:15 -08:00
ElizaWszola
a88b3be7c4
[Bugfix] Fix quant RMS norm fusion for quantization with TMA-aligned scales ( #33255 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-17 23:35:04 -08:00
Nick Hill
a49ea5a58f
[Model Runner V2] A bit more PP simplification ( #34766 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-17 21:39:07 -08:00
Cyrus Leung
30ebe0dc3c
[CI/Build] Remove use of skip_v1 ( #34699 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-18 12:19:11 +08:00
Andreas Karatzas
cef65f0715
[ROCm][CI] Removed hard-coded attn backend requirement for Qwen VL ( #34753 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-18 03:59:53 +00:00
Russell Bryant
6f3b2047ab
[Core] Fix SSRF bypass via backslash-@ URL parsing inconsistency ( #34743 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: isotr0py <2037008807@qq.com >
2026-02-18 03:53:35 +00:00
Luka Govedič
02e8f26cea
[torch.compile] Turn on silu+fp4 quant fusion by default for O1+ ( #34718 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-02-18 03:29:15 +00:00
Hongxia Yang
4a00a511bb
[BugFix] [Build] fix string literals comparison in indexer_k_quant_and_cache calling site ( #34653 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-02-17 19:19:41 -08:00
Cyrus Leung
a0d8d944e2
[Renderer] Move MM Hash parsing into Renderer ( #34711 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-17 19:18:55 -08:00
Amr Mahdi
df3f537a66
[CI] Remove unused precompiled wheel args from image build ( #34767 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-17 18:58:18 -08:00
Matthew Bonanni
7743152957
[Attention] Refactor check_and_update_config ( #33600 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-17 17:06:54 -08:00
Wentao Ye
ab33d2a629
[Feature] Decode Context Parallel support for GPU model runner v2 ( #34179 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-17 16:27:15 -08:00
Woosuk Kwon
be3af2d29e
[Model Runner V2] Further simplification for PP ( #34724 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-17 15:18:18 -08:00
Jongseok Park
c656ba3b4d
[Kernel] Triton-based Top-k and Top-p sampler kernels ( #33538 )
...
Signed-off-by: js_park <cakeng@naver.com >
Signed-off-by: Jongseok Park <37990712+cakeng@users.noreply.github.com >
Signed-off-by: Sunga Kim <sunga.kim@berkeley.edu >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Sunga Kim <sunga.kim@berkeley.edu >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-17 23:14:30 +00:00
Matthew Bonanni
dc5fa77a4e
[Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs ( #34457 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-17 14:01:27 -05:00
Flora Feng
1e4a084c8e
[CI] Fix flaky test_parsable_context ( #34717 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-02-17 18:42:52 +00:00
Richard Zou
7967e854da
[BugFix] Fix sp tests ( #34716 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-17 17:07:56 +00:00
almayne
6bd6d0c3c1
Fixed whisper CPU test that does not spawn properly. ( #34324 )
...
Signed-off-by: Anna Mayne <anna.mayne@arm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-17 06:46:23 -08:00
Nicolò Lucchesi
8e962fef5f
[CI][Nixl] Add CrossLayer KV layout tests ( #34615 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-17 21:35:40 +08:00
Cyrus Leung
574fe75245
[Renderer] Move InputPreprocessor into Renderer (2/2) ( #34560 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-17 05:29:01 -08:00
junuxyz
c61a98f529
[CI][BugFix] ShellCheck cleanup to remove baseline and preserve runtime behavior ( #34514 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-17 12:22:56 +00:00
Harry Mellor
28bffe9466
Fix docs build warning ( #34686 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-17 02:31:40 -08:00
ChenqianCao
ad65177a19
[Bugfix] Fix 'remove_instance_endpoint' method logic in disagg_proxy_demo ( #32922 )
...
Signed-off-by: ChenqianCao <39755070+ChenqianCao@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-17 10:06:53 +00:00
Tim Dettmers
d44a5b6c47
Remove dead bitsandbytes CxB code from 8-bit inference path ( #34633 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-17 01:49:14 -08:00
Jiangyun Zhu
1d65283e95
Revert "[Models] Fuse Qwen3.5 GDN's qkvz_proj and ba_proj" ( #34683 )
2026-02-17 01:29:27 -08:00
kourosh hakhamaneshi
c464b57374
[Ray] Propagate third-party env vars to Ray workers via prefix matching ( #34383 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-17 01:08:42 -08:00
Amr Mahdi
c5c38e152a
[CI] Fix bake config artifact path for AMI rebuild pipeline ( #34656 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-17 06:39:44 +00:00
Woosuk Kwon
d00df624f3
[Model Runner V2] Minor refactoring for penalties ( #34662 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 21:43:00 -08:00
Woosuk Kwon
9752da9d9c
[Model Runner V2] Minor simplification for BadWordsState ( #34669 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 21:27:24 -08:00
Woosuk Kwon
04925b2202
[Model Runner V2] Minor cleanup for PP ( #34666 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 19:15:31 -08:00
Woosuk Kwon
d74278fb67
[Model Runner V2] Fix unintended CPU-GPU sync in make_dummy ( #34667 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 19:00:29 -08:00
haosdent
b68fd899d1
[Bugfix] Fix fused MoE int32 overflow in stride*offset without perf regression ( #34507 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-16 17:58:49 -08:00
Aneesh Puttur
0b5f9b7204
[CI] Enable mypy import following for vllm/v1/kv_offload ( #34639 )
...
Signed-off-by: Aneesh Puttur <aneeshputtur@gmail.com >
2026-02-17 09:58:15 +08:00
zhanqiuhu
9a8853f781
[Core] Pipeline Parallel support for Model Runner V2 ( #33960 )
...
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-02-16 17:48:16 -08:00
zhrrr
387a1898d9
[Model Runner V2] support bad_words sampling param ( #33433 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 16:36:06 -08:00
roikoren755
3b30e61507
[NemotronH] Do not force router to run in fp32 ( #34582 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-02-16 10:15:32 -08:00
Alexei-V-Ivanov-AMD
824f9e8f3c
Targeting the MI355 agent pool with all existing tests ( #34629 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2026-02-16 17:02:27 +00:00
Nicolò Lucchesi
6cc403e67d
[Bugfix][CI] Fix flaky entrypoints/openai/test_response_api_with_harmony.py::test_function_calling[openai/gpt-oss-20b] ( #34624 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-16 16:11:07 +00:00
Almog Tavor
72d5951d02
[Bugfix] Treat generation_config max_tokens as default not ceiling ( #34063 )
...
Signed-off-by: almogtavor <almogtavor@gmail.com >
2026-02-16 07:58:24 -08:00
Lucas Kabela
a3205beffb
[CI] Enable mypy coverage for individual excluded files ( #34292 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-16 07:34:29 -08:00
Christian Pinto
6930becd45
(bugfix): Fixed encode in LLM entrypoint for IOProcessr plugin prompts ( #34618 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2026-02-16 07:33:55 -08:00
Andreas Karatzas
03a8770a6d
[ROCm][CI] Fix plugins test group; updating terratorch and dependencies ( #34589 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-16 07:33:42 -08:00
Yiqi Xue
bc56a1d56e
[Bugfix] Fix ARC touch KeyError for non-ready T1 blocks in kv offload ( #34576 )
...
Signed-off-by: Yiqi Xue <xuey666@gmail.com >
2026-02-16 07:33:19 -08:00
danisereb
ec7d9e6745
Fix call to moe_mk in modelopt MoE modules (required for LoRA) ( #34575 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-16 07:33:09 -08:00
Isotr0py
3bb4e4311c
[Models] Fuse Qwen3.5 GDN's qkvz_proj and ba_proj ( #34492 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-16 07:32:51 -08:00
Amr Mahdi
08f8c198ae
[CI] Disable precompiled wheel path in CI image builds ( #34606 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-16 15:14:43 +00:00
Harry Mellor
a21cedf4ff
Bump lm-eval version for Transformers v5 compatibility ( #33994 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-16 05:24:35 -08:00
emricksini-h
3ef74cde5d
[CI][Tracing] Fix race condition by adding server readiness check ( #34364 )
...
Attempt to resolve #34284 : "Metrics Tracing (2GPU)" fails with a
segmentation fault.
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai >
2026-02-16 12:57:39 +00:00
Ekagra Ranjan
cd81cdb399
[Scheduler][ASR] Fix CrossAttn blocks per-request for Variable length encoder inputs ( #31058 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-16 11:08:44 +00:00
Andreas Karatzas
1e828573b4
[CI][Metrics] Stabilize tests with polling and subprocess guards ( #34566 )
...
test_abort_metrics_reset is flaky due to hardware-dependent
fixed sleeps: replace fixed sleeps with polling.
test_metrics_exist_run_batch passes even when the engine crashes
on startup (false positive): add subprocess lifecycle guards.
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-16 10:52:02 +00:00
Samu Tamminen
a5ccc85c8c
[Bugfix] Fix Dynamo unexpected keyword argument ( #34320 )
...
Signed-off-by: Samu Tamminen <stammine@amd.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-16 01:32:30 -08:00
Roger Wang
b5475d0534
Revert "[Misc] fix qwen3.5 config" ( #34610 )
2026-02-16 01:06:05 -08:00
JJJYmmm
9521002f0a
[Misc] fix qwen3.5 config ( #34604 )
2026-02-16 00:25:38 -08:00
Cyrus Leung
ec17bdd894
[Renderer] Move InputPreprocessor into Renderer (1.5/2) ( #34598 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-15 23:46:33 -08:00
Amr Mahdi
bb59c90248
[CI] Write bake config to temp directory instead of repo root ( #34569 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-15 22:15:47 -08:00
bnellnm
5bff999d12
[Bugfix] Add method to swap quant_method on FusedMoE to fix LoRA issues ( #34453 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-15 20:10:50 -08:00
Lucas Wilkinson
bb85929aa6
[BugFix] Fix Python 3.13 FlashMLA import error ( #34548 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-15 20:09:18 -08:00
Parth Bansal
5653021094
[Doc] Add Mistral-7b-v0.3 model to the batch invariance validated model ( #34584 )
...
Signed-off-by: Parth Bansal <parthbansal127@gmail.com >
2026-02-16 12:09:00 +08:00
Andreas Karatzas
974d829b05
[CI][Frontend] Return 422 instead of 500 for invalid Anthropic tool_choice ( #34590 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-15 20:06:48 -08:00
Isotr0py
91ac5d9bfd
[CI/Build] Enable tests for recent day-0 new models ( #34585 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-15 18:17:04 -08:00
Luka Govedič
23d825aba1
[torch.compile] Disable ar-rms fusion for ds3-fp4 & DP, fix CI test ( #34392 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-15 06:33:57 -08:00
Maryam Tahhan
f07a128413
[CPU][ARM] Add ARM BF16 cross-compilation support and improve documen… ( #33079 )
...
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-02-15 06:33:08 -08:00
Isotr0py
71cd89264f
[MM Encoder] Add Triton ViT attention backend ( #32183 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-15 06:32:47 -08:00
Isotr0py
19fab44152
[Doc] Update Encoder-Decoder models support doc with Florence-2 ( #34581 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-15 04:18:57 -08:00
Seiji Eicher
79c7e09235
[KV Connector] Add temporary, off-by-default VLLM_DISABLE_REQUEST_ID_RANDOMIZATION workaround ( #34415 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2026-02-14 23:26:10 -08:00
haosdent
79f3fab05a
[Bugfix] Handle num_expert_group=None in flashinfer block-scale FP8 MoE ( #34494 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-14 23:25:46 -08:00
Vadim Gimpelson
604b9eaec5
[BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1 ( #34476 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-14 23:25:17 -08:00
Stanislav Kirillov
50dbd6c9e6
[bugfix] Fix critical bug when reporting for all paths where handler.create_error_response is used ( #34516 )
...
Signed-off-by: Stanislav Kirillov <stas@nebius.com >
Co-authored-by: Stanislav Kirillov <stas@nebius.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-14 23:24:25 -08:00
Andreas Karatzas
98bcc6ca59
[CI][Entrypoints] Validate detokenize token IDs to prevent int64 overflow causing 500 ( #34468 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-14 23:08:38 -08:00
Andreas Karatzas
f13e86d8dd
[Kernels] Fix Helion GPU utils to use platform-agnostic device name API ( #34537 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-14 20:29:23 -08:00
Woosuk Kwon
9ca768c740
[Model Runner V2] Minor cleanup for Sampler ( #34563 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-14 18:29:03 -08:00
Thomas Parnell
d5fe3f702c
[Hybrid] Enable mamba prefix cache "align" mode with async scheduling ( #33997 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2026-02-14 13:15:56 -08:00
Cyrus Leung
73391a1baa
[Renderer] Move InputPreprocessor into Renderer (1/2) ( #34510 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-14 10:14:21 -08:00
Andreas Karatzas
b3c14229b0
[ROCm][CI] Guard sparse MLA backend imports for ROCm compatibility in tests ( #34538 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-14 07:32:09 -08:00
Roger Wang
2f186635cb
[Bugfix] Fix Qwen3.5 config loading ( #34554 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-14 03:56:11 -08:00
Christian Pinto
342a7cda2d
[Misc] Update tests and examples for Prithvi/Terratorch models ( #34416 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-13 23:03:51 -08:00
Kata Coder
d1ea65d0a1
[new model] add COLQwen3 code & Inference ( #34398 )
...
Signed-off-by: craftsangjae <craftsangjae@gmail.com >
Signed-off-by: katacoder <craftsangjae@gmail.com >
2026-02-14 12:15:19 +08:00
Andreas Karatzas
de42abb366
[CI] Heavy refactoring of Voxtral multimodal audio model tests ( #34294 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-13 20:04:29 -08:00
Julien Denize
60ca7981bc
Add explicit validation error for tool calls. ( #34438 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-02-13 20:04:01 -08:00
Christian S. Perone
0ef5b9147b
fix: use __annotations__ instead of get_type_hints() for dynamic kwargs detection ( #34527 )
...
Signed-off-by: Christian S. Perone <christian.perone@gmail.com >
Signed-off-by: Christian S. Perone <perone@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-13 20:03:37 -08:00
Shiyan Deng
ed242652d7
[bug] Make sure get_modality_with_max_tokens is deterministic ( #34533 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2026-02-13 20:02:59 -08:00
Wei Zhao
b37b679770
[Feature][Perf] Support Selective CPU Weight Offloading ( #34535 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-13 20:02:24 -08:00
Andreas Karatzas
a0638d052d
[Bugfix] Fix ROCm UVA CPU weight offloading broken by #32993 ( #34543 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-13 20:01:42 -08:00
Harry Huang
c027541eaf
[Hybrid] Enable spec decoding in mamba cache align mode ( #33705 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-13 13:02:28 -08:00
Ben Browning
fd267bc7b7
[Bugfix]: Fix structured output in multi-turn gpt-oss ( #34454 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-13 11:12:48 -08:00
Michael Goin
bfaa559305
Revert "[Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides" ( #34530 )
2026-02-13 10:35:29 -08:00
Richard Zou
87789c8364
[Misc] vLLM's --enforce-eager should turn off compile and cudagraphs only ( #34523 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-13 09:52:20 -08:00
Pushpinder Singh
bcd65c1f6a
[Bugfix] Replace c10::optional with std::optional in topk kernel ( #34467 )
...
Signed-off-by: Pushpinder Singh <pushpindersingh135@gmail.com >
2026-02-13 08:30:23 -08:00
Wei Zhao
59d53066d8
[Feature] Support CPU Offloading without Pytorch Pinned Memory that leads to doubled allocation ( #32993 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-13 08:11:26 -08:00
LoganJane
4a9952ec1b
[Bugfix] Add quant_config in ViT of Kimi-K2.5 ( #34501 )
...
Signed-off-by: LoganJane <LoganJane73@hotmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-13 16:05:34 +00:00
Roger Wang
1dae7b7843
[Bugfix] Exclude language_model_only key from MM AOT compile hash but include in model one ( #34508 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-13 13:59:00 +00:00
Roger Wang
5885e330ef
[Misc] Port Qwen3.5 Configs ( #34512 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-13 05:24:25 -08:00
Ilya Boytsov
071d863e20
Extend ColBERT support to non-standard BERT backbones ( #34170 )
...
Signed-off-by: Ilya Boytsov <ilya.boytsov@aleph-alpha.com >
2026-02-13 09:53:09 +00:00
Woosuk Kwon
0916e7960b
[GDN] Use CPU tensors to build GDN metadata ( #34498 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-13 01:24:45 -08:00
Wentao Ye
3d2a026fd0
[Feature] Pipeline Parallel Async send/recv, 2.9% E2E throughput improvement ( #33368 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-02-13 16:38:16 +08:00
Aaron Hao
dddbff4624
[Core] Move pause and resume functions into engine ( #34125 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Signed-off-by: hao-aaron <ahao@anyscale.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-13 00:15:10 -08:00
Martin Hickey
47e9b63e1a
[KVConnector] Clean up redundant code in KV connectors ( #34147 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-02-13 00:14:30 -08:00
Matthias Gehre
934acddef9
[Perf] fused_moe: add int4_w4a16 benchmark support and tuning config ( #34130 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-13 00:14:27 -08:00
Marek Michalowski
742d214d6e
[Bugfix] fix the import path in moe test utils.py ( #34245 )
...
Signed-off-by: Marek Michalowski <marek.michalowski@arm.com >
2026-02-13 00:13:45 -08:00
haosdent
4137c5dfa7
[Bug Fix] Fix MambaManager.cache_blocks() crash on null blocks in align mode ( #34418 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-13 00:13:22 -08:00
Harry Huang
7a8a46ddcb
[BugFix] Fix and optimize max_num_blocks_per_req calculation for MambaSpec ( #34440 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-13 00:13:14 -08:00
myselvess
bcf0731aa0
[New Model] support new model ovis2.6 ( #34426 )
...
Signed-off-by: myselvess <23743269+myselvess@users.noreply.github.com >
2026-02-13 00:12:45 -08:00
Cyrus Leung
ec090c2429
[Refactor] Call renderer for online IO processor request ( #34490 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-12 22:48:45 -08:00
Roger Wang
eea3024f43
[Bugfix] Fix mamba state dtype setting for Qwen3-Next and Qwen3.5 ( #34489 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-12 22:48:42 -08:00
Cyrus Leung
2f308214c0
[Refactor] Pass full VllmConfig to Renderer ( #34485 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 22:48:38 -08:00
Cyrus Leung
1b4e8e53f8
[CI/Build] Fix CUDA re-initialization error in distributed model tests ( #34491 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-13 06:43:53 +00:00
haosdent
dcf6ee8592
[Bugfix] Fix encoder cache underestimation for GLM-4V/GLM-OCR single image ( #34483 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-12 21:04:06 -08:00
Cyrus Leung
372b2e762a
[Bugfix] Standardize getting number of image patches/tokens ( #34358 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 20:47:01 -08:00
Andreas Karatzas
6afa587d31
[ROCm][CI] Fix serving tokens test failures ( #34047 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-13 11:27:53 +08:00
Cyrus Leung
94ed6cf6ea
Add new sections to CODEOWNERS ( #34309 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 18:39:28 -08:00
Harry Huang
bf37812ca7
[Hybrid] Fix and optimize block-aligned splitting in mamba cache align mode ( #33706 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-12 18:21:52 -08:00
Frank Wang
b86bf4417e
[Bugfix] Fix Random Dataset Prefix Length Inaccuracy ( #33907 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-12 18:21:19 -08:00
Yanan Cao
de13dd781f
[Kernel] [Helion] [5/N] Add Helion Autotuning infrastructure ( #34025 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-12 18:21:05 -08:00
LoganJane
62788f99a4
[Bugfix] Delete unused redundant code in Kimi-K2.5 ( #34427 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-12 18:18:42 -08:00
Cyrus Leung
ea5ff3a1f6
[Refactor] Simplify BOS/EOS token handling ( #34435 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 18:18:24 -08:00
bnellnm
04ea31baab
[Bugfix] Remove assert that's no longer valid ( #34443 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-12 18:18:15 -08:00
Harry Huang
6f019e6e0a
[BugFix] Add block_size validation for mamba cache align mode ( #34445 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-12 18:18:07 -08:00
Zhuohan Li
d707678dfb
Fix num_logprobs parameter description in sampler.py ( #34451 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2026-02-12 18:18:03 -08:00
Cyrus Leung
fc22cae4ac
[CI/Build] Update video URLs for testing ( #34446 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 18:15:36 -08:00
Yanan Cao
96161fe978
[Kernel] [Helion] [4/N] Add silu_mul_fp8 Helion kernel ( #33373 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-12 18:13:12 -08:00
Jaewon
4453ba8d9e
[Core] Profiler improvements and lazy initialization ( #33198 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-12 16:16:38 -08:00
Jaewon
aa181c923b
[Core] Add sleep level 0 mode with enqueue/wait pattern ( #33195 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-12 16:16:25 -08:00
Alec S
be7370daf3
[Frontend] Enable generic structured_outputs for responses API ( #33709 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Co-authored-by: Alec Solder <alecs@fb.com >
2026-02-12 16:15:48 -08:00
Mengtao (Martin) Yuan
9ea1f598ce
Use paged_attention_v1 for sliding window decode in rocm_aiter_fa ( #34378 )
...
Signed-off-by: Martin Yuan <myuan@meta.com >
Co-authored-by: Martin Yuan <myuan@meta.com >
2026-02-12 16:14:43 -08:00
amitz-nv
f120bd42d3
[Kernel] Support Flashinfer trtllm fused MoE non gated FP8 & NVFP4 ( #33506 )
...
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com >
2026-02-12 13:06:58 -08:00
Hashem Hashemi
fac4e96940
small adjustment to wvSplitKrc ( #34410 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-12 20:26:36 +00:00
Michael Goin
6d4e27ce29
[Bugfix] Enforce DeepGEMM when using sparse_attn_indexer on CUDA ( #34374 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-12 12:08:06 -08:00
Andreas Karatzas
4c078fa546
[ROCm][CI] Pin TorchCodec to v0.10.0 for ROCm compatibility ( #34447 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-12 18:47:34 +00:00
Patrick von Platen
6c0baee610
[Voxtral Realtime] Refactor & Improve buffering logic ( #34428 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-12 09:46:43 -08:00
Patrick von Platen
1100a97621
[Voxstral Realtime] Enable tests ( #33803 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2026-02-12 09:43:24 -08:00
xuebwang-amd
766e167821
[ROCm][quantization] improve OCP weight quant parser robust ( #34431 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-12 09:40:19 -08:00
Isotr0py
becbe24808
[Bugfix] Remove broken raw url GGUF model loading support ( #34433 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-12 09:40:01 -08:00
Harry Mellor
679ca5d8d3
Fix MoE for the Transformers modelling backend ( #34436 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-12 09:29:42 -08:00
Matthew Bonanni
f2c47886fd
[Attention] Add FlashInfer Sparse MLA backend ( #33451 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
2026-02-12 17:21:54 +00:00
Nicolò Lucchesi
334c715e0f
[Docs] Spec decoding docs warning removal ( #34439 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-12 09:01:51 -08:00
Aaron Hao
7b5a8b4a9d
[BUG] Reset running requests when clearing cache for pause/resume ( #34382 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-02-12 16:19:13 +00:00
danisereb
dea63512bb
Add config file for fused MoE for Nemotron (TP4, B200) ( #34411 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-12 06:09:55 -08:00
Douglas Lehr
8a798be929
[ROCm] Enable MXFP4 MoE weight pre-shuffling on gfx950 and update aiter ( #34192 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
Co-authored-by: tjtanaavllm <tunjian.tan@amd.com >
2026-02-12 05:06:33 -08:00
Cyrus Leung
fb455ed547
[V0 Deprecation] Remove code related to per-request logits processors ( #34400 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 20:44:28 +08:00
baonudesifeizhai
f5897613fb
Fix Mistral config remap to accept compressed-tensors quantization #34028 ( #34104 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
2026-02-12 08:22:06 +00:00
Louie Tsai
55a1a9563a
Vllm CPU benchmark suite improvement ( #34128 )
...
Signed-off-by: louie-tsai <louie.tsai@intel.com >
2026-02-12 16:04:44 +08:00
AllenDou
386bfe5d08
[bugfix] refactor FunASR's _get_data_parser ( #34397 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
2026-02-12 07:26:49 +00:00
Kyle Sayers
e9cd691132
[Bugfix] Fix Sparse24 Compressed Tensors models ( #33446 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-11 23:15:16 -08:00
Yichuan Wang
80f2ba6ea6
Fix DeepSeek-OCR tensor validation for all size variants ( #34085 )
...
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-11 22:50:23 -08:00
Lucas Wilkinson
136b0bfa59
[BugFix] Fix DP chunking ( #34379 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Bill Nell <bnell@redhat.com >
2026-02-12 06:44:03 +00:00
Cyrus Leung
b96f7314b4
[Refactor] Pass Renderer to Input Processor ( #34329 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 19:38:11 -08:00
Cyrus Leung
ced2a92f40
[Refactor] Move validation to params definitions ( #34362 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 19:33:15 -08:00
Runkai Tao
e1d97c38f8
[Bug Fix] Fix naive_block_assignment always defaulting to False due to arg misalignment ( #33848 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
2026-02-12 11:30:57 +08:00
Michael Goin
ec12d39d44
[Bugfix] Fix MTP accuracy for GLM-5 ( #34385 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-12 11:08:19 +08:00
Michael Goin
ff1f83b056
[Refactor] Replace activation: str with MoEActivation enum ( #33843 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-02-11 17:29:32 -08:00
Kevin H. Luu
83b47f67b1
[ci] Integrate AMD tests into CI ( #33626 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
Signed-off-by: khluu <khluu000@gmail.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-12 08:54:17 +08:00
Micah Williamson
fb7b30c716
[ROCm][CI] Revert Test Groups From mi325_8 to mi325_1 Agent Pool In AMD CI ( #34384 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-11 15:52:34 -08:00
bnellnm
31d992d215
[Bugfix] Fix some issues with MoERunner PR #32344 ( #34371 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-11 14:33:14 -08:00
Wei Zhao
5aff2699bd
Fix CI failure - Flashinfer Kernel tests ( #34316 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-11 14:17:16 -08:00
Raushan Turganbay
527ca32197
[Bugfix] Fix more multimodal tests for transformers V5 ( #34334 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2026-02-11 22:02:05 +01:00
Junseo Park
5458eb835d
[Bugfix] send None sentinel on final commit so server properly sends transcription.done ( #33963 )
...
Signed-off-by: pjs102793 <pjs102793@naver.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-11 21:01:53 +00:00
Tomas Ruiz
144d9b7cc8
[Benchmarks] Reduce ready checker log verbosity ( #34349 )
...
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com >
2026-02-11 20:57:57 +00:00
elvischenv
83e26c834e
[GPT-OSS] Remove unnecessary contiguous ( #34337 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-02-11 15:29:29 -05:00
TJian
5001211369
[ROCm] [CI] fix test_unrecognized_env ( #34350 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-02-11 18:50:44 +00:00
Eldar Kurtić
11c7ace340
[Bugfix] Enable attn quantization of Llama-4 by correctly permuting scales for rope (int8, fp8) ( #34243 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
2026-02-11 13:24:22 -05:00
Xinyu Dong
be7f3d5d20
[Bugfix] fix default is_neox_style is True for deepseek ( #34353 )
...
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com >
2026-02-11 18:20:45 +00:00
Isotr0py
0ab06100f4
[Multimodal] Expose mm_processor_kwargs for DummyInputsBuilder ( #34330 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-11 09:37:40 -08:00
Xinyu Chen
ffb3d553cc
[Model Runner V2] Init cuda graph pool when necessary ( #33217 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2026-02-11 09:12:13 -08:00
junuxyz
fa7e0bfacf
[CI][BugFix] Fix silent failure in shellcheck hook and baseline exist… ( #32458 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-11 17:03:48 +00:00
SorenDreano
48134a2c22
[Docs] Fix typo ("defult") and double spacing ( #34348 )
...
Signed-off-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com >
Co-authored-by: Soren Dreano <soren@numind.ai >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-11 09:02:27 -08:00
kliuae
64f570ab56
[ROCm] [aiter] Split KV cache update for AiterFlashAttention ( #33681 )
...
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com >
2026-02-11 16:26:44 +00:00
Rohan Potdar
fd618871b4
[Bugfix]: Fix ROCm fusion attn test; use AttentionBackend utils to create kv cache ( #33948 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-11 11:12:05 -05:00
Harry Mellor
67a42b5a44
Don't try and run GLM-ASR with remote code ( #34352 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-11 08:09:40 -08:00
Lucas Wilkinson
c7914d30f9
Reapply [Attention][FA3] Update FA3 to include new swizzle optimization ( #34043 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-11 07:07:56 -08:00
Adam Binford
1b8756562e
Responses harmony system message structured ( #34268 )
...
Signed-off-by: Adam Binford <adamq43@gmail.com >
2026-02-11 05:14:28 -08:00
Linda
275e0d2a99
[NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE ( #33715 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com >
Co-authored-by: Pavani Majety <pmajety@nvidia.com >
2026-02-11 12:38:11 +00:00
Harry Mellor
0f5e55e7a8
Make JAIS compatible with Transformers v5 ( #34264 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-11 12:30:37 +00:00
Harry Mellor
1e9204bff3
Make Qwen3VL compatible with Transformers v5 ( #34262 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-11 04:13:23 -08:00
Li, Jiang
05339a7b20
[Bugfix][CPU] Fix llama4 inference on CPU ( #34321 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-02-11 19:07:23 +08:00
Harry Mellor
40b8f55358
[Docs] Reduce time spent generating API docs ( #34255 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-11 02:56:02 -08:00
Seiji Eicher
5045d5c983
Patch protobuf for CVE-2026-0994 ( #34253 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-11 02:25:04 -08:00
Nick Hill
e09546cf05
[Frontend] Exploit tokenizers "new stream" in FastIncrementalDetokenizer ( #34217 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-11 11:03:24 +01:00
Tianqi Ren
786806dd44
[Doc] Update Marlin support matrix for Turing ( #34319 )
...
Signed-off-by: Tianqi Ren <tianqi.r@outlook.com >
2026-02-11 09:03:41 +00:00
Nick Hill
79504027ef
[Misc] Bump fastsafetensors version for latest fixes ( #34273 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-11 00:30:09 -08:00
Luka Govedič
addac0e653
[torch.compile] Enable AR+rms fusion by default available for -O2 ( #34299 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-02-11 00:30:00 -08:00
Cyrus Leung
675a22ed66
[Chore] Move BaseRenderer to base.py ( #34308 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 00:29:51 -08:00
Kunshang Ji
cb9574eb85
[XPU][9/N] clean up existing ipex code/doc ( #34111 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-11 00:27:15 -08:00
AllenDou
21dfb842d7
[model] support FunASR model ( #33247 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
2026-02-11 07:37:09 +00:00
R3hankhan
d1b837f0ae
[CPU] Enable FP16 (Half dtype) support for s390x ( #34116 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-02-11 14:41:42 +08:00
Roger Wang
0b20469c62
[Bugfix] Fix weight naming in Qwen3.5 ( #34313 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 21:37:14 -08:00
Tyler Michael Smith
d7982daff5
[Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides ( #34279 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-11 05:15:52 +00:00
Robert Shaw
9b17c57460
[ModelBash][DSR1 NVFp4] Removed Bf16 Bias Cast ( #34298 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-11 05:00:00 +00:00
Hashem Hashemi
1b3540e6c6
Threshold fix wvSplitk for occasional CI fails ( #34013 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-11 03:59:14 +00:00
Matthias Gehre
7a048ee65f
[Bugfix] Fix benchmark_moe.py inplace assertion with torch >= 2.9 ( #34149 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-02-11 03:58:56 +00:00
Cyrus Leung
c9a1923bb4
[Plugin] Simplify IO Processor Plugin interface ( #34236 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 19:47:39 -08:00
zofia
b482f71e9f
[XPU][7/N] enable xpu fp8 moe ( #34202 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-02-11 03:33:59 +00:00
Дзержи́нский
1485396abb
[Kernel] Apply 256bit LDG/STG To Activation Kernels ( #33022 )
...
Signed-off-by: Dzerzhinsky <256908701+AstroVoyager7@users.noreply.github.com >
Signed-off-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-10 19:31:51 -08:00
Kebe
5ee5c86eeb
[Bugfix][DeepSeek-V3.2] fix fp8 kvcache type cast ( #33884 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2026-02-10 19:31:36 -08:00
Cyrus Leung
b5dcb372e4
[Misc] Clean up validation logic in input processor ( #34144 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 19:29:29 -08:00
Tyler Michael Smith
066c6da6a0
[WideEP] Fix nvfp4 DeepEP High Throughput All2All backend ( #33738 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-10 19:15:43 -08:00
Richard Zou
e30cedd44b
[torch.compile] Stop doing unnecessary FakeTensorProp in PiecewiseCompileInterpreter ( #34093 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-10 19:15:40 -08:00
Cyrus Leung
3bcd494ef4
[Redo] Add --trust-remote-code to dataset bench args ( #34251 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 11:10:12 +08:00
tianshu-Michael-yu
0e725a7d22
[Bugfix] Fix Worker.load_model context-manager composition for sleep mode ( #34021 )
...
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com >
2026-02-11 11:07:51 +08:00
Lucas Wilkinson
ba0511fd80
[Misc] Add run one batch script that supports profiling ( #32968 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-10 18:29:49 -08:00
Micah Williamson
4a1550d22d
[ROCm][CI] Fix test_sequence_parallel.py location in AMD CI pipeline ( #34280 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-11 01:08:11 +00:00
bnellnm
d1481ba783
[MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner ( #32344 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-10 19:51:07 -05:00
7. Sun
dc6de33c3d
[CI] Add pip caching to cleanup_pr_body workflow ( #32979 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-02-11 00:45:28 +00:00
Tyler Michael Smith
c4b9e6778f
[Misc] Add pre-commit hook to catch boolean ops in with-statements ( #34271 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-10 15:13:20 -08:00
Richard Zou
341eed3d30
[torch.compile] Disable recursive pre_grad_passes ( #34092 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-10 18:02:31 -05:00
Zhengkai Zhang
6f2f59f2b3
[Misc][Spec Decode] support different load config for draft model ( #34022 )
...
Signed-off-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com >
Co-authored-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com >
2026-02-10 14:52:43 -08:00
Ilya Markov
bb2fc8b5e7
[BugFix] Fix async EPLB hang with DeepEP LL all2all backend ( #32860 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-02-10 22:34:47 +00:00
Ilya Markov
67132945bb
[Perf] Move eplb rebalance algo to async thread ( #30888 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-02-10 22:19:10 +00:00
Gregory Shtrasberg
f0ca0671c7
[Feature] Warn about unrecognized environment variables ( #33581 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-10 15:45:38 -06:00
Pavani Majety
578977bb5e
[SM100] Resubmit FMHA FP8 prefill for MLA ( #31195 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2026-02-10 16:18:43 -05:00
Roger Wang
9615575afc
[Bugfix] Fix mamba cache dtype for Qwen3.5 ( #34200 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 13:12:31 -08:00
Matthew Bonanni
4293c00b84
[Benchmarks] Fix attention benchmark smoke test ( #34269 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-10 16:04:07 -05:00
J Seppänen
506ad7d7c1
[Bugfix] Fix weights offloading for sleep mode ( #32947 )
...
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-02-10 20:38:17 +00:00
Reagan Lee
fdd6f2ad58
Convert online APIs to use Renderer ( #34084 )
...
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com ”>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com ”>
2026-02-10 19:44:31 +00:00
Qi Wang
33bcd3dc3b
[Misc] Introduce ec_both role EC (encoder cache) connector ( #34182 )
...
Signed-off-by: Qi Wang <qiwa@nvidia.com >
2026-02-10 18:55:35 +00:00
Michael Goin
1f5febb4b8
[UX nit] Fix non-default api_server_count message ( #34152 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-10 10:35:58 -08:00
Andy Lo
ae871ca923
Minor cleanup for Voxtral ( #34247 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-02-10 18:18:30 +00:00
Woosuk Kwon
a2443de5fa
[Model Runner V2] Use pinned memory for write_contents ( #34222 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-10 08:55:22 -08:00
Harry Mellor
f84a2a8f31
[Docs] Speed up build environment set-up ( #34240 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-10 16:34:43 +00:00
Vadim Gimpelson
000214c4bb
[BUGFIX] Fix accuracy bugs in Qwen3-Next MTP ( #34077 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-10 10:57:11 -05:00
junuxyz
c5a66d1697
[Core][BugFix] Fix PP KV cache sharding memory validation ( #33698 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-10 10:46:24 -05:00
Roberto L. Castro
afdce12c89
[Perf][Kernel] Add faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention ( #33680 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-02-10 10:29:52 -05:00
Zhengxu Chen
82e11973cc
[compile] Enable AOT compile with 2.10 in trunk. ( #34155 )
...
Signed-off-by: Zhengxu Chen <zhxchen17@meta.com >
2026-02-10 23:24:42 +08:00
xuebwang-amd
b129136c7a
[ROCm][Quantization] GPT_OSS in amd-quark format model loading and emulations ( #29008 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-10 10:08:05 -05:00
mgazz
599e4335a4
Support benchmarking of Geospatial models ( #33922 )
...
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com >
2026-02-10 07:04:16 -08:00
Fan Yang
a1946570d8
add --insecure arg to the vllm bench to skip TLS ( #34026 )
...
Signed-off-by: Fan Yang <yan9fan@meta.com >
Co-authored-by: Fan Yang <yan9fan@meta.com >
2026-02-10 22:23:52 +08:00
Harry Mellor
d0bc520569
Bump mamba-ssm version in CI for Transformers v5 compatibility ( #34233 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-10 14:46:01 +01:00
Krish Gupta
748625cdaf
[V1][BugFix] Fix EAGLE3 encoder cache miss with disable_chunked_mm_input ( #34220 )
...
Signed-off-by: KrxGu <krishom70@gmail.com >
2026-02-10 13:05:32 +00:00
Harry Mellor
61413973e8
Stop testing for slow tokenizers as they will not exist soon ( #34235 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-10 12:08:20 +00:00
Phúc H. Lê Khắc
94de871546
[Misc] allow specify is_mm_prefix_lm in hf_config ( #34215 )
2026-02-10 11:16:21 +00:00
tc-mb
e042d7e685
Add flagos in MiniCPM-o ( #34126 )
...
Signed-off-by: tc-mb <caitianchi@modelbest.cn >
Signed-off-by: Vincent-Xiao <vincent.xiao.me@gmail.com >
Co-authored-by: Vincent-Xiao <vincent.xiao.me@gmail.com >
2026-02-10 02:51:48 -08:00
Roger Wang
ae4e280602
[Bugfix] Fix FI kernelchunk_gated_delta_rule output shape for Qwen3.5 ( #34219 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 10:41:24 +00:00
zzaebok
cbea11c9f0
[Docs] Fix format error in KV load failure recovery doc ( #34137 )
...
Signed-off-by: Jaebok Lee <jaebok9541@naver.com >
2026-02-10 02:16:26 -08:00
Cyrus Leung
2c32558a3c
[Bugfix] Fix --trust-remote-code conflict ( #34218 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 00:29:10 -08:00
Zetong Li
5f970120f0
[Bugfix] Fix memory inconsistency in cross-process shared memory ( #32022 )
...
Signed-off-by: Zetong Li <slippersss@126.com >
2026-02-10 08:22:03 +00:00
Cyrus Leung
998e2d91f8
Revert #34208 ( #34216 )
2026-02-09 23:59:04 -08:00
Wentao Ye
e1060a71a1
[Perf] Optimize detokenizer python logic ( #32975 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-02-09 23:54:41 -08:00
Chen Zhang
97fa8f6590
[BugFix] Avoid prefix cache hit in the same schedule step for mamba layers ( #29387 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2026-02-10 07:41:16 +00:00
wang.yuqi
dab1de9f38
[Frontend][CI] Consolidate instrumentator entrypoints ( #34123 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-10 07:30:19 +00:00
Balaxxe
8d48d0a9d9
[Bugfix] Sort hf_weights_files in fastsafetensors_weights_iterator to match #33491 ( #34190 )
...
Signed-off-by: Balaxxe <136368465+jaim12005@users.noreply.github.com >
2026-02-09 23:06:30 -08:00
Andrew Xia
9608844f96
[responsesAPI] fix simpleContext streaming output_messages ( #34188 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-02-09 22:53:07 -08:00
Cyrus Leung
f69b903b4c
[Bugfix] Add --trust-remote-code to dataset bench args ( #34208 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-09 22:37:50 -08:00
Lucas Wilkinson
81e217fe6b
[Bugfix] Fix DP Attention Padding in Dummy Run ( #34187 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-10 05:29:39 +00:00
Cyrus Leung
ab97bcf662
[CI/Build] Relax test_mcp_tool_call ( #34204 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 05:18:57 +00:00
Cyrus Leung
25e48a3aae
[Doc] Update usage of --limit-mm-per-prompt ( #34148 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-09 21:12:13 -08:00
Roger Wang
8a5e0e2b2b
[Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching ( #34183 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 13:03:32 +08:00
Andreas Karatzas
4cde2e0159
[ROCm][Bugfix] Resolve Dynamo tracing crash from amdsmi calls in on_gfx* arch detection ( #34108 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-09 20:50:20 -08:00
Roger Wang
047a457fa4
[Bugfix] Adopt ChunkGatedDeltaRule for Qwen3.5 ( #34198 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 03:47:54 +00:00
Yuwei An
e94ec59733
[LMCache] Token Base IPC API ( #34175 )
...
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com >
2026-02-10 01:18:42 +00:00
Ning Xie
13397841ab
[structured output] validate unsupported json features first ( #33233 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2026-02-09 23:49:09 +00:00
Gregory Shtrasberg
c60f8e3b49
[Bugfix][ROCm][GPT-OSS] Use old triton_kernels implementation on ROCm if the new API is not available ( #34153 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-09 17:38:54 -06:00
Michael Goin
5e75a14a66
[Doc] Add DCP support to attention backend doc ( #33936 )
2026-02-09 18:33:43 -05:00
Nick Hill
e7e52781ff
[ModelRunner V2][BugFix] Fix max_query_len calculation ( #34167 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-09 21:47:17 +00:00
Charlie Fu
bb9f97308d
[torch.compile][Fusion] Fix attention fusion pass removing kv_udpate op. ( #33945 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2026-02-09 16:15:43 -05:00
Hongxia Yang
4d39650961
[ROCm] update triton branch to support gpt-oss models for gfx11xx devices ( #34032 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2026-02-09 19:36:30 +00:00
Artus Krohn-Grimberghe
8fd31f6245
[Bugfix] Voxtral prompt/audio placeholder alignment ( #34140 )
...
Signed-off-by: Artus KG <artuskg@gmail.com >
2026-02-09 19:30:38 +00:00
Artus Krohn-Grimberghe
eadb4e868b
[Bugfix] Avoid duplicate k-proj weight emission in helper ( #34142 )
...
Signed-off-by: Artus KG <artuskg@gmail.com >
2026-02-09 19:17:44 +00:00
Jiangyun Zhu
285bab4752
[Kernel] use flashinfer for gdn prefill ( #32846 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-02-09 12:17:25 -05:00
TomerBN-Nvidia
995bbf38f1
[Bugfix] Fix shared expert input for latent MoE in EP+DP (Nemotron-H) ( #34087 )
...
Signed-off-by: Tomer Natan <tbarnatan@nvidia.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-09 16:44:18 +00:00
Mohammad Miadh Angkad
d4f123cc48
[Kernel] FlashInfer: switch allreduce fusion to unified API ( #33985 )
...
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com >
2026-02-09 15:43:24 +00:00
ZhengHongming888
cb62e86f83
Add NUMA Core binding in nixl_connector for CPU xPyD ( #32365 )
...
Signed-off-by: Hongming Zheng <hongming.zheng@intel.com >
Signed-off-by: ZhengHongming888 <hongming.zheng@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-09 15:39:12 +00:00
Luka Govedič
781ddf7868
[CI][torch.compile] Fix incorrect filtering for E2E fusion tests on B200 ( #34031 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-02-09 10:05:14 -05:00
Roger Wang
64a9c2528b
[UX] Add --language-model-only for hybrid models ( #34120 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-09 14:57:33 +00:00
Lucas Wilkinson
d0d97e2974
[Misc] Fix up attention benchmarks ( #33810 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-09 09:42:03 -05:00
JJJYmmm
9562912cea
[MODEL] Adding Support for Qwen3.5 Models ( #34110 )
...
Signed-off-by: JJJYmmm <1650675829@qq.com >
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: wulipc <wulipc@users.noreply.github.com >
Co-authored-by: ywang96 <ywang96@users.noreply.github.com >
Co-authored-by: Isotr0py <Isotr0py@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-09 21:12:58 +08:00
zofia
9bdb06b436
[XPU][6/N] add xpu scaled_mm kernel ( #34117 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-02-09 20:17:35 +08:00
Nikhil Gupta
caad9f1e01
[Fix] [CPU Backend] : Prepack weights for w8a8 oneDNN matmul ( #33901 )
...
Signed-off-by: nikhil-arm <nikhil.gupta2@arm.com >
2026-02-09 18:04:41 +08:00
Ekagra Ranjan
1d5922fade
[ASR] Fix audio benchmark and add RTFx metric ( #32300 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2026-02-09 10:02:37 +00:00
Andreas Karatzas
3025b3cebb
[CI] Remove empty image_size_factors for fuyu, glm4_1v, glm_ocr ( #34107 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-09 17:37:04 +08:00
Jee Jee Li
978a37c823
[Model] GLM adaptation ( #34124 )
2026-02-09 17:32:52 +08:00
ihb2032
5a5c43511a
fix(cpu): fix mla_decode compilation on x86 without AVX512 ( #34052 )
...
Signed-off-by: ihb2032 <hebome@foxmail.com >
Co-authored-by: root <root@LAPTOP-FKNHV411.localdomain >
2026-02-09 08:55:41 +00:00
Nick Hill
d9bede0314
[BugFix] Fix fastsafetensors TP all procs using all GPUs ( #34070 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-09 15:15:46 +08:00
wang.yuqi
22b64948f6
[Frontend][last/5] Make pooling entrypoints request schema consensus. ( #31127 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-09 06:42:38 +00:00
Reagan Lee
7c233dbb36
[Tiny] Rename encoder budget file to more specific name ( #34103 )
...
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com ”>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com ”>
2026-02-09 03:48:19 +00:00
kourosh hakhamaneshi
a75a5b54c7
[bug-fix] supported_tasks is breaking backward compatibility at init_app_state ( #34027 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
Signed-off-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-09 09:46:46 +08:00
Andrey Talman
f97ca67176
[Release 2.10] Update to Torch 2.10 - final release ( #30525 )
2026-02-08 13:51:09 -08:00
danisereb
084aa19f02
Add support for ModelOpt MXFP8 dense models ( #33786 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-08 11:16:48 -08:00
navmarri14
1ecfabe525
glm 4.6 fused tuned inference config for B200 ( #32958 )
2026-02-08 18:55:47 +00:00
Richard Zou
4df841fe75
[torch.compile] Add an option to force-enable the MOE cold start optimization ( #33735 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-08 18:42:56 +00:00
TomerBN-Nvidia
a263aa6140
[BugFix] Change support no act and mul for marlin ( #34088 )
...
Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com >
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com >
2026-02-08 17:18:22 +00:00
aabbccddwasd
179ae7da8f
[Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate ( #33771 )
...
Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com >
2026-02-08 08:13:24 -08:00
Reagan Lee
c4df59ad43
Add embedding input functionality for disabled modalities [remake] ( #32493 )
...
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com ”>
Signed-off-by: Reagan Lee <reaganjlee@gmail.com >
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com >
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com ”>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-08 04:57:16 -08:00
TJian
785cf28fff
[ROCm] [CI] Reduce Resource of two test groups ( #34059 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-02-08 15:17:26 +08:00
Nick Hill
a96197f564
[Perf] Simplify DeepseekV32 tokenizer, ensure fast detokenization used ( #33855 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-08 07:16:34 +00:00
Andreas Karatzas
ab10d79855
[ROCm][Bugfix] fix act_quant_fusion module import error ( #34069 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-07 19:21:12 -08:00
Cyrus Leung
7fcb705b80
[CI/Build] Skip GCS test ( #34057 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-07 08:52:38 -08:00
Cyrus Leung
b956cdf818
[Doc] Fix run_batch docs ( #34056 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-07 06:18:16 -08:00
Hashem Hashemi
ed17f54c8b
Perf tuning and expansion of cases covered for wvSplitKrc ( #33493 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-07 05:33:11 -08:00
Jiang Wu
860981d8d8
Make directory exist ok for ray spinning up multiple replicas on a single instance ( #33604 )
...
Signed-off-by: Jiang Wu <jwu@cclgroup.com >
2026-02-07 05:30:49 -08:00
zifeitong
52181baaea
Update DeepGEMM version pin in Dockerfile to match #32479 ( #33935 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-07 05:30:22 -08:00
Rohan Potdar
de3869bb4d
move checks out of unified_kv_cache_update custom op ( #33943 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-07 05:30:09 -08:00
whx
ce9b3cd3e9
[PluggableLayer][3/N] Apply PluggableLayer to mamba layers. ( #33660 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2026-02-07 05:26:05 -08:00
Jee Jee Li
db4ede9743
[Model] Enable Step3p5ForCausalLM testing ( #33755 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-07 05:25:24 -08:00
Pooya Davoodi
2cb2340f7a
[Frontend]Add support for transcriptions and translations to run_batch ( #33934 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-07 05:24:57 -08:00
TundeAtSN
4df44c16ba
Enable Eagle3 speculative decoding for Mistral3ForConditionalGeneration to support eagle3 ( #33939 )
...
Signed-off-by: Akintunde Oladipo <akintunde.oladipo@servicenow.com >
Signed-off-by: TundeAtSN <akintunde.oladipo@servicenow.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-07 05:24:52 -08:00
Richard Zou
81fe69cae5
[torch.compile] Stop compiling identical artifacts ( #34003 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-07 05:24:48 -08:00
Mohammad Miadh Angkad
dd6a6e1190
[Kernel] Add KernelConfig flag to enable/disable FlashInfer autotune ( #34006 )
...
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-07 05:24:44 -08:00
Cyrus Leung
edb359cce4
[Renderer] Define render_cmpl and render_chat ( #34039 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-07 05:24:40 -08:00
wang.yuqi
6ed5eda300
[CI][Build] Pin grpcio-tools==1.78.0 ( #34048 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-07 05:24:35 -08:00
Cyrus Leung
11a4c9d30d
[Misc] Simplify get_max_tokens ( #34036 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-07 00:59:49 -08:00
lukec
15a0b9e570
Fix spelling errors ( #33978 )
2026-02-06 23:58:50 -08:00
Andreas Karatzas
c490d8cc73
[ROCm][CI] Pinning lm-eval version to resolve multi-modal small eval bug ( #34038 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-06 22:21:08 -08:00
Cyrus Leung
48312e579a
[Misc] Make PlaceholderRange.get_num_embeds a method ( #34035 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-07 05:30:17 +00:00
Vel
bc32444b23
[Kernel] Add enable_sm120_or_later for SM121 (DGX Spark) CUTLASS support ( #33517 )
...
Signed-off-by: code4me2 <velvetmoon222999@gmail.com >
2026-02-06 20:28:01 -08:00
Wentao Ye
18e8545297
[Revert] Add util handle_deprecated back ( #33998 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-07 04:14:45 +00:00
果冻虾仁
6f7adc533a
fix description in plugin_system.md ( #33999 )
2026-02-06 19:37:02 -08:00
Nick Hill
40218a82ba
[ModelRunner V2] Revert token rank comparison difference for now ( #34017 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-07 11:11:05 +08:00
kourosh hakhamaneshi
1c3b22058f
[Misc] Add backward-compatible import aliases for renamed translations module ( #34015 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-07 11:01:41 +08:00
Xin Yang
3920cafdd6
[Bugfix] Fix _fused_moe_lora_expand signature mismatch ( #33821 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-07 10:45:59 +08:00
rasmith
ec28784fdc
[CI][AMD]Bugfix] Check that model_config is not None in enable_norm_pad_fusion ( #34007 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-07 02:43:25 +00:00
Nicolò Lucchesi
55aeec04f5
[Bugfix] Fix Whisper tokenization ( #34011 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-07 10:42:52 +08:00
Ikenna
906077181b
[Bugfix] Fix QK Norm+RoPE fusion pattern matching on B200+FP8 ( #33967 )
...
Signed-off-by: Ikenna <ikennachifo@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-07 02:27:33 +00:00
Aaron Hao
89a385d79f
[Feat][RL] Pause and Resume with keep requests for single engine ( #32351 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-07 00:08:58 +00:00
kourosh hakhamaneshi
4a2d00eafd
[bugfix] [ROCm] Fix premature CUDA initialization in platform detection ( #33941 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2026-02-06 16:17:55 -06:00
Dimitrios Bariamis
207c3a0c20
Fix RoutingMethodType logic ( #33919 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2026-02-06 14:03:34 -08:00
Sumanth R Hegde
ae2e93f89b
[Fix] Fix logprobs=0 handling for /inference/v1/generate endpoint ( #34010 )
...
Signed-off-by: SumanthRH <sumanthrh99@gmail.com >
2026-02-06 20:33:40 +00:00
xuebwang-amd
9e9acce577
[Bugfix] Fix no attribute error of SharedFusedMoE (DeepSeek-V3.1 as test model) ( #33993 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
2026-02-06 19:11:32 +00:00
Charlie Fu
fe5438200b
[Rocm][Bugfix] Fix dtype not same for gemm_a4w4 op ( #33734 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2026-02-06 19:09:59 +00:00
Wentao Ye
77c09e1130
[Refactor] Remove align block size logic in moe_permute ( #33449 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-06 10:57:06 -08:00
zhrrr
16786da735
[Model Runner V2] support apply penalty for spec decode ( #33251 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2026-02-06 10:56:48 -08:00
vllmellm
aaa2efbe98
[DOC] [ROCm] Update docker deployment doc ( #33971 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 10:05:35 -08:00
Seiji Eicher
aca5967416
[KV Connector] Add missing method overrides to MultiConnector ( #33292 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2026-02-06 12:58:21 -05:00
Wentao Ye
67a746e87f
[Log] Optimize duplicate startup log ( #33944 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-06 17:49:56 +00:00
Chauncey
7bec435130
[Bugfix] Fix the issue where tool calling does not work when using fast detokenization with dsv32 ( #33964 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-06 09:23:44 -08:00
Eldar Kurtić
5c52644b10
[Docs] Update link to Benchmark CLI documentation ( #33254 )
...
Signed-off-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com >
2026-02-06 16:00:59 +00:00
zofia
2ce9fe4ad0
[XPU][5/N] add wna16 xpu kernel ( #33973 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-02-06 15:59:53 +00:00
Cyrus Leung
cd8b405bd0
[Refactor] Consolidate sequence normalization and enc-dec parsing ( #33928 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-06 15:43:47 +00:00
tc-mb
4707f7ebb4
[Model] Support MiniCPM-o 4.5 ( #33431 )
...
Signed-off-by: caitianchi <caitianchi@modelbest.cn >
Signed-off-by: tc-mb <caitianchi@modelbest.cn >
Co-authored-by: mslv <mslv@baai.ac.cn >
2026-02-06 15:29:10 +00:00
Michael Goin
c39ee9ee2b
[Docs] Add sections on process architecture and minimum CPU resources ( #33940 )
...
It seems users can be confused about vLLM's performance when running
with very small amounts of CPU cores available. We are missing a clear
overview of what vLLM's process architecture is, so I added this along with
some diagrams in arch_overview.md, and included a section on CPU resource
recommendations in optimization.md
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-06 15:26:43 +00:00
Andreas Karatzas
350ca72c04
[ROCm][AITER] Fix AITER import regression for explicit backend selection ( #33749 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-06 15:08:16 +00:00
FredericOdermatt
1fb0495a72
[FIX] guidance: use max(vocab_size, len(tokenizer)) for n_vocab ( #33509 )
...
Signed-off-by: Frederic Odermatt <frederic.odermatt@44ai.ch >
2026-02-06 14:23:03 +00:00
Raushan Turganbay
85ee1d962b
[Bugfix] Fix models and tests for transformers v5 ( #33977 )
...
Signed-off-by: raushan <raushan@huggingface.co >
Signed-off-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 21:47:41 +08:00
Harry Mellor
51a7bda625
Update WeightTransferConfig to be more standard like the others ( #33989 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 13:15:00 +00:00
SorenDreano
6e7b1c4b59
[Docs] Improve documentation ( #33799 )
...
Co-authored-by: Soren Dreano <soren@numind.ai >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-06 12:57:09 +00:00
Kurt Shuster
2991dd3d22
[Bugfix][Model] Support LoRA on Qwen3 Output Embedding ( #29816 )
...
Signed-off-by: kurt <kurt@thinkingmachines.ai >
2026-02-06 20:25:31 +08:00
Luka Govedič
ac32e66cf9
[torch.compile] Reorganize vllm/compilation and tests/compile (0/N for vLLM IR) ( #33731 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: ProExpertProg <luka.govedic@gmail.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-06 04:19:49 -08:00
Fadi Arafeh
f79d9dce16
[CPU][BugFix] Fix loading of w8a8int models with bias ( #33582 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-02-06 11:59:20 +00:00
Harry Mellor
ba5cbbf107
Bump HF Hub client to get bug fix ( #33984 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 11:25:33 +00:00
zhang-prog
233b26ab35
[PaddleOCR-VL] Add BC for transformers 5.0 config ( #33976 )
...
Signed-off-by: zhangyue66 <zhangyue66@baidu.com >
2026-02-06 10:33:49 +00:00
Harry Mellor
791a94bed0
Consolidate and fix forbidden import pre-commit checks ( #33982 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 01:47:41 -08:00
Xinyu Chen
e969a169ef
support view_from_cpu_tensor on XPU ( #33868 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2026-02-06 08:34:20 +00:00
Harry Mellor
6d8d34be6d
Fix main pre-commit ( #33975 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 00:08:05 -08:00
Gassan Salama
1363e3d6d5
[cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation ( #32263 )
...
Signed-off-by: Gassan <gassan.salama@arm.com >
2026-02-06 15:01:48 +08:00
chengchengpei
965525667b
Onboard voyage-4-nano ( #33720 )
...
Signed-off-by: Chengcheng Pei <chengchengpei@outlook.com >
Signed-off-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com >
Co-authored-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-06 06:23:34 +00:00
sihao_li
6550815c3a
[XPU]Replace pip in docker.xpu with uv pip ( #31112 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
2026-02-06 14:02:33 +08:00
Kunshang Ji
7439e4f41b
[XPU][4/N] add mxfp4 moe model support ( #33679 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-06 13:03:59 +08:00
R3hankhan
ac04dd374f
[CPU] Add BF16 Kernel type for s390x ( #33788 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-02-06 04:57:02 +00:00
Cyrus Leung
035a6cb09a
[Misc] Update code for encoder-decoder models ( #33900 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-06 11:38:39 +08:00
Mingliang Li
a32cb49b60
feat(frontend): early-fail tokenization guard for user requests ( #31366 )
...
Signed-off-by: limingliang <limingliang@stepfun.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: limingliang <limingliang@stepfun.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-05 19:38:02 -08:00
Rabi Mishra
20d7454c9b
fix(ROCm): Make flash_attn import optional in MLA attention ( #33511 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2026-02-06 02:22:53 +00:00
Simon Mo
5819ca8944
[Docs] Add reo analytics ( #33957 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2026-02-05 17:42:22 -08:00
Xin Yang
79028d4388
[Perf] Disable clean_logits in deepgemm fp8_mqa_logits kernel ( #33568 )
2026-02-05 20:34:00 -05:00
emricksini-h
325ab6b0a8
[Feature] OTEL tracing during loading ( #31162 )
2026-02-05 16:59:28 -08:00
Wei Zhao
91a07ff618
[Bugfix] Fix DeepSeek v3.2 tokenizer outputting None issue ( #33832 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-05 23:50:49 +00:00
Hashem Hashemi
d5c4800112
Adds padding and perf improvements to wvSplitK_fp8 ( #33527 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-05 22:16:02 +00:00
Lumosis
42d5d705f9
[Minor] Sort safetensors files to ensure deterministic loading order ( #33491 )
...
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2026-02-05 17:05:09 -05:00
Cyrus Leung
116880a5a0
[Bugfix] Make MM batching more robust ( #33817 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-05 20:40:58 +00:00
Matthew Bonanni
4145e50d85
[Bugfix] Fix DSV3.2 NVFP4 ( #33932 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-05 19:22:19 +00:00
Nicolò Lucchesi
20f5d185a6
[Misc] Rename translations to speech_to_text for OAI serving component ( #33904 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-05 19:16:52 +00:00
Harry Mellor
1887acca9e
Fix tokenizer test for renamed attr on Transformers v5 ( #33902 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-05 19:16:20 +00:00
Tsukasa OI
92e7562a99
[Bugfix] Suppress non-TTY color output on the process name part of the log ( #29714 )
...
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com >
2026-02-05 18:47:09 +00:00
Isotr0py
87d0d17ab5
[Models] Consolidate Deepseek-OCR2 processor ( #33909 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-05 18:29:20 +00:00
bnellnm
a57c8228ff
[Moe Refactor] Make Inplace Flag for FusedMoEModularKernel part of the constructor ( #33375 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-05 18:07:18 +00:00
zackyoray
1ee95841bd
[Bugfix] Fix swapped engine_ids in NIXL Llama 4 local attention path ( #33795 )
...
Signed-off-by: Yoray Zack <yorayz@nvidia.com >
2026-02-05 17:51:58 +00:00
Nicolò Lucchesi
7d8c6804e2
[Misc] Add debug logs ( #33931 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-05 09:42:40 -08:00
Benjamin Chislett
af3162d3aa
[Spec Decode] Unified Parallel Drafting ( #32887 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-05 12:37:18 -05:00
danisereb
5b2a9422f0
[BugFix] Fix LoRA Fp8 ( #33879 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-05 17:25:55 +00:00
Aaron Hao
c1858b7ec8
[Feat][RL][1/2] Native Weight Syncing API: NCCL ( #31943 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Co-authored-by: SumanthRH <sumanthrh99@gmail.com >
2026-02-05 12:13:23 -05:00
Mario Hong
82914d2ae8
[Bugfix] Fix step3p5 parser when using mtp ( #33690 )
...
Signed-off-by: mariohong <mariohong128@gmail.com >
2026-02-05 16:04:04 +00:00
Nicolò Lucchesi
81a90e5277
[Docs] Add bart-plugin to docs ( #33905 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-05 12:20:25 +00:00
wang.yuqi
1c3a221d3b
[Bugfix] Fix corner case of sparse embedding ( #33886 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-05 02:51:22 -08:00
Cyrus Leung
7bd42e609d
[Refactor] Clean up input preprocessing ( #33687 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-05 18:43:42 +08:00
Isotr0py
a2522839d8
[Bugfix] Fix Kimi-K2.5 NVFP4 checkpoints weight loading ( #33876 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-05 10:29:54 +00:00
jiahanc
59a5cb387a
[perf] Integrate flashinfer concat_mla_k ( #31171 )
2026-02-05 05:23:11 -05:00
liranschour
8322d4e47f
Enable Cross layers KV cache layout at NIXL Connector V2 ( #33339 )
...
Signed-off-by: Liran Schour <lirans@il.ibm.com >
Signed-off-by: liranschour <liranschour@users.noreply.github.com >
Co-authored-by: Or Ozeri <or@ozery.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-05 02:17:02 -08:00
Andreas Karatzas
3e472e81f9
[ROCm][Bugfix][CI] Fix hybrid models and their tests (Mamba/Jamba/Bamba) ( #32710 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-02-05 10:01:23 +00:00
Cyrus Leung
038914b7c8
[Refactor] Move task outside of PoolingParams.verify ( #33796 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-05 09:33:11 +00:00
Pavani Majety
d2f4a71cd5
[Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. ( #33858 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2026-02-05 09:32:10 +00:00
Mark McLoughlin
2abd97592f
[KV Connector][Metrics] Do not count local prefix cache hits in connector queries ( #30522 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-02-05 09:57:27 +02:00
Chauncey
6abb0454ad
[Perf] Optimize the performance of structured output + reasoning ( #33557 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-05 15:45:29 +08:00
Li, Jiang
db6f71d4c9
[CI/Build] Fix CPU CI test case title ( #33870 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-02-05 15:07:14 +08:00
Fadi Arafeh
fd03538bf9
[CPU][BugFix] Allow w8a8 oneDNN quantized matmul to support 3D inputs ( #33727 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-02-05 06:26:09 +00:00
Andreas Karatzas
1f70313e59
[Bugfix] Fix ScoreMultiModalParam multi-document scoring returning single result ( #33837 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-05 06:17:00 +00:00
Li, Jiang
07daee132b
[CI/Build] Parallelize CPU CI tests ( #33778 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-02-05 13:53:48 +08:00
Andrew Xia
9595afda18
[2/N] move responses/serving _make_response_output_items logic to parser ( #33281 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Signed-off-by: Andrew Xia <axia@meta.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-02-05 13:46:15 +08:00
rasmith
c1395f72cd
[CI][AMD][BugFix] Ensure VLLM_ROCM_USE_AITER is set so test_rocm_aiter_topk.py can run correctly ( #33840 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-05 05:05:48 +00:00
rinbaro
007b183d74
[docs] fix unintentional misspellings ( #33863 )
...
Signed-off-by: rinbaro <ilgomishra@gmail.com >
2026-02-04 20:50:59 -08:00
Nick Hill
add9f1fbd9
[Minor] Include StreamingInput in inputs package ( #33856 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-05 04:38:20 +00:00
Luka Govedič
e3bf79ffa0
Revert "[Attention][FA3] Update FA3 to include new swizzle optimization" ( #33841 )
2026-02-04 19:54:27 -08:00
Andreas Karatzas
fb1270f1f8
[CI][Bugfix]: return McpCall for built-in MCP tools in non-streaming mode ( #32762 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-05 11:14:06 +08:00
Kevin H. Luu
72bb24e2db
[release] Minor fixes to release annotation ( #33849 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-05 02:07:35 +00:00
Chauncey
a7be77beef
[Bugfix] fix DeepSeek R1 with CUTLASS MLA Broken on B200 ( #33637 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-05 01:28:36 +00:00