Krish Gupta
3827c8c55a
[Test] Add tests for n parameter in chat completions API ( #35283 )
...
Signed-off-by: KrxGu <krishom70@gmail.com >
2026-02-26 09:14:07 +00:00
Kevin McKay
ade81f17fe
[Bugfix][Hardware][AMD] Gate FP4 ops on gfx950 to prevent MI300X crash ( #35250 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2026-02-26 16:11:07 +08:00
Gregory Shtrasberg
6042e66cd5
[ROCm] Add extra step in config initialization to populate custom ops before compilation config init ( #34848 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-26 16:05:40 +08:00
Chaojun Zhang
9f9a675b23
[XPU][8/N] Fix kernel bugs in XPU LoRA and MOE LORA ( #34115 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-26 15:46:44 +08:00
Ofir Zafrir
a07c4c5939
[BugFix][XPU] Fix speculative decoding on Intel XPU due to bug with IGC_ForceOCLSIMDWidth=16 ( #35298 )
...
Signed-off-by: Ofir Zafrir <ofir.zafrir@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-26 07:15:16 +00:00
Cyrus Leung
d3a51da92a
[Benchmark] Simplify SLA scan ( #35306 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-25 22:35:41 -08:00
Flora Feng
186ea22efe
[Misc][Harmony] Move Responses API only harmony utils to responses/harmony.py ( #35339 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-02-26 14:35:16 +08:00
Daniele
4a9c07a0a2
[BugFix] anthropic/serving_messages: fix tool call arguments streaming ( #34887 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-26 05:39:48 +00:00
Jason Li
9d37941017
[torch.compile] Sequence Parallelism threshold compile ranges ( #28672 )
...
Signed-off-by: jasonlizhengjian <jasonlizhengjian@gmail.com >
Signed-off-by: Jason Li <jasonlizhengjian@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-26 05:00:12 +00:00
Fadi Arafeh
4171ff6dd9
[CPU][Feat] Enable KleidiAI INT8_W4A8 for all input dtypes ( #34890 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-02-26 05:00:10 +00:00
Woosuk Kwon
13025e71e8
[Model Runner V2] Add coding style guide ( #35325 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-25 20:42:40 -08:00
Hanjie Qiu
71dfce6aa6
[Kernel] Refactor FlashInfer allreduce for mnnvl backend ( #34109 )
...
Signed-off-by: hjjq <50634613+hjjq@users.noreply.github.com >
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com >
2026-02-26 03:17:20 +00:00
hujiaxin0
2aa4140402
openpangu-vl support video input ( #34134 )
...
Signed-off-by: hujiaxin <524446785@qq.com >
Signed-off-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com >
Co-authored-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-26 03:08:09 +00:00
Roberto L. Castro
86c3b5a808
[BugFix] Fix fp4 quant kernel on CUDA 12.8 ( #35210 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
2026-02-25 18:32:50 -08:00
Seungmin Kim
160424a937
[Bugfix] Fix CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs ( #33992 )
...
Signed-off-by: Seungmin Kim <8457324+ehfd@users.noreply.github.com >
Signed-off-by: Andrew Mello <19512127+88plug@users.noreply.github.com >
Co-authored-by: 88plug <19512127+88plug@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 18:15:51 -08:00
Lucas Wilkinson
9511a3f8ee
[Bugfix] Fix AttributeError in SMControlContextManager ( #35338 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-25 18:01:10 -08:00
Michael Goin
de527e1cec
[UX] Add --moe-backend arg for explicit kernel selection ( #33807 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-25 17:44:44 -08:00
Yongye Zhu
1976356ee6
[MoE Refactor] MXFP4 Cutlass Experts to MK ( #34542 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
2026-02-25 17:32:39 -08:00
Michael Goin
cbf8f7028c
[UX] Add --performance-mode {balanced,interactivity,throughput} ( #34936 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-25 17:28:31 -08:00
Ming Yang
6831650c40
[offloader] v2: Hide weight onloading latency via prefetching ( #29941 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 17:20:59 -08:00
Andreas Karatzas
ed42507f6d
[ROCm][CI] Amending deletion of AMD mirror ( #35322 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 14:17:56 -08:00
Andreas Karatzas
9571e99945
[ROCm][CI] Extending attention backend coverage for Eagle spec decode tests ( #35265 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 14:16:18 -08:00
Elizabeth Thomas
c97234c08b
fix(mxfp4): Disable monolithic path for TRITON backend with EP ( #34270 )
...
Signed-off-by: Elizabeth Thomas <email2eliza@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 13:33:42 -08:00
rasmith
b188bab441
[CI][AMD][BugFix] Add torch.cuda.set_device to test_punica_ops so punica kernels execute on same device as tensor ( #34985 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-25 19:18:00 +00:00
Lucas Wilkinson
15d76f74e2
Revert "[Misc] Enable weights loading tracking for quantized models" ( #35309 )
2026-02-25 09:20:15 -08:00
Andreas Karatzas
8fd6975479
[ROCm][CI] Disable skinny GEMMs in multimodal tests to fix non-deterministic results ( #35049 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 16:48:37 +00:00
pushkar
5d18bf8b32
[Bugfix] Fix Harmony preamble visibility in Responses API ( #32114 )
...
Signed-off-by: Pushkar Patel <git@thepushkarp.com >
Signed-off-by: pupa <pupa@users.noreply.github.com >
2026-02-25 08:08:16 -08:00
haosdent
0788ff0a15
[Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multicast support ( #35085 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-25 07:31:45 -08:00
Chendi.Xue
d72b0be33c
[XPU]Fix for Qwen-OMNI crash ( #35249 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2026-02-25 07:31:07 -08:00
Bhoomit
42489e43c2
[Misc][LoRA] Increase max vocab size limit to 258048 in logits processor ( #34773 )
...
Signed-off-by: Bhoomit Vasani <vbhoomit@amazon.com >
2026-02-25 23:30:55 +08:00
Mario Hong
af5e6afa0a
[Bugfix] Fix step3p5 reasoning with interleaved thinking ( #34211 )
...
Signed-off-by: mariohong <mariohong128@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-02-25 15:13:01 +00:00
Benjamin Chislett
ee59a7c615
[Tests] Add GSM8k check to SpecDec E2E tests ( #34772 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-25 07:51:14 -05:00
Joao Gante
709eadbb0b
Doc link typo ( #35281 )
...
Signed-off-by: Joao Gante <joaofranciscocardosogante@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-25 03:00:31 -08:00
Harry Mellor
90fc7f9109
Fix custom processors that use deleted behaviour for Transformers v5 ( #35107 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-25 02:36:21 -08:00
Yanwen Lin
675ec59aa9
[Bugfix][CPU] Fix basic unit tests failing in CPU platforms ( #34677 )
...
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-25 08:36:15 +00:00
Yanwen Lin
80e60a6133
[Doc] Suggest "--managed-python" flag when installing python using uv ( #33069 )
...
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
2026-02-25 08:19:43 +00:00
jonoillar
26e722f906
[DOC][BugFix] Specfiy build dependency installation ( #34513 )
...
Signed-off-by: Jon OILLARBURU <jon.oillarburu@multiversecomputing.com >
Co-authored-by: Jon OILLARBURU <jon.oillarburu@multiversecomputing.com >
2026-02-25 08:04:06 +00:00
lichuang
2c619e5e3f
[Docs]Fix documentation formatting in architecture overview ( #34679 )
...
Signed-off-by: codedump <lichuang1982@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-25 08:00:15 +00:00
Simon Mo
8a685be8d9
docs: document committer proposal process in governance ( #35225 )
...
Signed-off-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-25 07:58:48 +00:00
Laura Wang
2465071510
[Perf] Add opt-in SM100 Oink RMSNorm custom-op path ( #31828 )
...
Signed-off-by: Laura Wang <3700467+Laurawly@users.noreply.github.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-24 23:01:53 -08:00
wenshuai
cd43673668
[Perf] Optimize FP8 gemm of sm120. ( #34424 )
...
Signed-off-by: wenshuai <wenshuai@xiaomi.com >
2026-02-24 22:25:24 -08:00
Xinyu Chen
35d44b4557
[XPU]Support CUDAGraph on XPU Platform ( #34482 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
Co-authored-by: chzhang <chaojun.zhang@intel.com >
Co-authored-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-24 22:22:52 -08:00
Kunshang Ji
8ad54a991b
[Platform] Add current_platform.num_compute_units interface ( #35042 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
2026-02-24 22:22:49 -08:00
Kunshang Ji
92510edc32
remove cuda check in top_k_top_p_triton kernel ( #35011 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-24 22:22:31 -08:00
Isotr0py
a6c137521c
[Misc] Add shard_id validation for MergedColumnLinear ( #35055 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-24 22:12:28 -08:00
Isotr0py
4572a06afe
[Misc] Enable weights loading tracking for quantized models ( #35074 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-24 22:11:03 -08:00
Zhengxu Chen
5cc29cfb8b
[compile] Improve error message during artifacts load failure. ( #35115 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-24 22:01:09 -08:00
Chen Zhang
8fae54faff
[Linear Attention] fix bug for linear attention + prefix caching + reset_prefix_cache ( #35157 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2026-02-24 22:00:19 -08:00
Harry Mellor
f7967577f5
Remove requirement to use --hf-overrides for DeepseekVLV2ForCausalLM ( #35203 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-24 22:00:06 -08:00
pks
af770b8e7b
[Bugfix] Fix AttributeError when passing StructuredOutputsParams to CompletionRequest ( #35237 )
...
Signed-off-by: Patrick Simianer <patrick@lilt.com >
2026-02-24 22:00:03 -08:00
Andreas Karatzas
2ff3e436ad
[Responses][CI] Filter negative token IDs in schema fuzz test to avoid 500 errors ( #35231 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 05:52:44 +00:00
Jhao-Ting Chen
c2c4c4611a
[FIX] fused moe with lora shared expert dual stream (1.07x otps) ( #34933 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 04:40:45 +00:00
Rohan Potdar
f38f8c9742
[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE ( #35180 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-25 04:36:40 +00:00
Flora Feng
ec1d30c0f6
[Responses] Decouple SSE event helpers from Harmony context ( #35148 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-02-24 20:05:25 -08:00
Pooya Davoodi
e3b2324ec4
[Frontend] Use init_app_state and FrontendArgs in run_batch ( #32967 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-24 19:40:39 -08:00
Nick Hill
dbf0da817a
[Core] Cleanup engine pause/sleep logic ( #34528 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-24 19:33:34 -08:00
Xin Yang
3bbb2046ff
[Bugfix] Fix expert_ids padding values in moe_align_block_size kernel ( #35161 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-24 17:14:24 -08:00
yugong333
576fe50333
Adding Nemotron fp8 Triton MoE Config ( #34674 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-24 15:56:38 -08:00
Hashem Hashemi
a0e50a4260
Convert wvSplitKQ to 16x16 MFMA in prep for mi4xx. ( #34100 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-24 23:35:21 +00:00
Benjamin Chislett
9fa5b25a23
[Bug][DSV3.2] Always prepare metadata for DeepGEMM Sparse Attention ( #35075 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-24 14:55:22 -08:00
Robert Shaw
ea97750414
[CI] Fix Distributed Tests ( #35236 )
...
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
2026-02-24 22:31:56 +00:00
Andreas Karatzas
067c5d9ad1
[ROCm][CI] Added MI325 mirrors ( #34923 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-24 13:37:15 -08:00
Benjamin Chislett
f5972a872f
[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support ( #33726 )
...
Signed-off-by: Shahar Mor <smor@nvidia.com >
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Shahar Mor <smor@nvidia.com >
Co-authored-by: Roi Koren <roik@nvidia.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-24 09:49:56 -08:00
Matthew Bonanni
a9e15e040d
Add @MatthewBonanni to CODEOWNERS ( #35207 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-24 10:45:10 -07:00
Lucas Wilkinson
542ca66357
Revert "[CI/Build] Remove redundant OpenTelemetry pip install from CI configs" ( #35211 )
2026-02-24 09:26:42 -08:00
Cyrus Leung
fc8456c336
[CI/Build] Fix kernels test location ( #35205 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-24 09:20:34 -08:00
Wentao Ye
9ce8fad2a9
[Perf] Optimize Python Slice for Structured Output using islice instead of [:] ( #33593 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-24 09:02:36 -08:00
Harry Mellor
c38b8d5a31
Remove padding_index from models that don't use it for better Transformers v5 compatibility ( #35189 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-24 08:04:46 -08:00
Robert Shaw
60da0e1544
[CI] Remove Duplicated Tests ( #35199 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-24 23:53:30 +08:00
danisereb
9609b1f18d
Integrate flashinfer mm_mxfp8 in ModelOpt MXFP8 ( #35053 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-24 08:45:13 -07:00
danisereb
a0c7081695
Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe ( #35088 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-24 07:25:44 -08:00
R3hankhan
34ce0ffd1f
[CPU][Perf] Accelerate Attention head for s390x using vector intrinsics ( #34434 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-02-24 07:25:39 -08:00
Robin Nabel
0de5333989
Fix GLM4 parser tests ( #34905 )
...
Signed-off-by: Robin Nabel <opensource@nabel.co >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-02-24 22:27:42 +08:00
Eldar Kurtić
a87cc50859
[Attn,KV-cache] Use per-head scales in the attention selector ( #34281 )
...
Signed-off-by: Your Name <you@example.com >
Signed-off-by: Eldar Kurtic <research@neuralmagic.com >
Co-authored-by: Eldar Kurtic <research@neuralmagic.com >
Co-authored-by: Your Name <you@example.com >
2026-02-24 09:02:43 -05:00
Cyrus Leung
761e63e541
[Frontend] Always pass supported_tasks to validation ( #35186 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-24 04:16:33 -08:00
Isotr0py
d12d201409
[Bugfix] Fix failing FunASR processor test ( #35111 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-24 04:13:45 -08:00
eustlb
b3ad37c5db
[glm-asr] change defaults dummy audio size ( #35108 )
...
Signed-off-by: Eustache Le Bihan <eulebihan@gmail.com >
2026-02-24 04:13:33 -08:00
Wentao Ye
14561fabfd
[Perf] Optimize pooling model redundant copy, 1.8% throughput improvement ( #35127 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-24 04:13:11 -08:00
Zhengxu Chen
c77f3e1207
[compile] Save aot compile artifacts atomically. ( #35117 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-24 04:11:01 -08:00
Dor Huri
012dee9233
[Feature] Add LoRA tower/connector support for Llama 4 Vision (mllama4) ( #35147 )
...
Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-24 04:10:32 -08:00
Tugsbayasgalan Manlaibaatar
f1c664545b
Make voxtral compile friendly ( #33959 )
...
Signed-off-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-24 09:33:35 +01:00
Xin Yang
c870eb9e0f
[LoRA] Update LoRA expand kernel block_n calculation ( #32621 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-23 23:17:53 -08:00
BadrBasowid
6af03f2394
[Refactor] [1/N] Reorganize kernel abstraction directory ( #34055 )
...
Signed-off-by: BadrBasowid <badr.basowid@gmail.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-24 06:47:22 +00:00
Vlad Tiberiu Mihailescu
1a6cf39dec
[CI/Build] Remove redundant OpenTelemetry pip install from CI configs ( #35032 )
...
Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com >
2026-02-23 22:24:11 -08:00
Nicolò Lucchesi
f91808ae0d
[MM] Allow audio chunking for offline LLM ( #34628 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-23 21:04:28 -08:00
Vadim Gimpelson
33a0d43c71
[BUGFIX][Qwen3.5] Hardcode mlp.gate as not quantizable ( #35156 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-23 19:42:24 -08:00
pschlan-amd
80d93fd6da
gpu_model_runner: Cache is_encoder_decoder from model config ( #35099 )
...
Signed-off-by: Patrick Schlangen <pschlan@amd.com >
2026-02-23 19:08:34 -08:00
Jia Guo
ec85340531
[Quantization] Support FP8 MoE bias for models like GPT-OSS ( #34906 )
...
Signed-off-by: jasperjiaguo <jasperg662@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-23 19:07:47 -08:00
Rohan Potdar
2ff4e51152
[ROCm] AITER fused RoPE+KVCache ( #33443 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
Signed-off-by: charlifu <charlifu@amd.com >
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
Co-authored-by: charlifu <charlifu@amd.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com >
2026-02-23 19:06:00 -08:00
Asaf Gardin
95642441d0
[Mamba1] - Change supports_update_block_table to True ( #35054 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-02-23 19:05:57 -08:00
Xin Yang
a7c9f7b7ec
[Bugfix] Fix lora_ids in FusedMoE LoRA test ( #35135 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-23 21:49:25 -05:00
Michael Goin
a4bd661fb3
[Perf] Enable FlashInfer DeepGEMM swapAB on SM90 by default ( #34924 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-23 17:34:41 -08:00
Michael Goin
3ef9fd0f98
[Bugfix] Fix DSV3 kernels breaking _C and _moe_C on unsupported arches ( #35123 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-23 17:11:27 -08:00
Michael Goin
22a97e6613
[Perf] Improve default triton fused moe configs ( #34846 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-23 16:01:28 -08:00
Aaron Hao
596ed1f02e
[RL] Validation for pause_mode='keep' ( #34992 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-02-23 16:30:56 -05:00
Nicolò Lucchesi
b8d8b7e934
[Misc] Monitor interface changes ( #35113 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-23 17:14:51 +00:00
Harry Mellor
28c5e69ba0
Enforce that model is the first positional arg when --served-model-name is used ( #34973 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 08:38:05 -08:00
Harry Mellor
864167d376
Fix custom processors that use deleted import for Transformers v5 ( #35101 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 08:38:00 -08:00
haosdent
a2ba6a5244
[Bugfix] Fix prefix caching for Mamba 'all' mode (Nemotron models) ( #34874 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-23 17:31:51 +01:00
Harry Mellor
c4f38696f7
Use Xet high performance mode for Transformers v5 ( #35098 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 08:19:30 -08:00
haosdent
a7f341c323
[Bugfix] Fix MRotaryEmbedding missing truncate attr with YaRN scaling ( #35080 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-23 16:05:52 +00:00
Robert Shaw
d13ece38d7
[CI] Skip Responses API ( #34990 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-23 07:46:45 -08:00
Mark McLoughlin
5cc7c4452e
[Metrics] Add Prometheus counters for Model FLOPs Utilization (MFU) ( #30950 )
...
Export the existing Model FLOPs Utilization (MFU) metrics via Prometheus.
`--enable-mfu-metrics` is required for these to be exposed.
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-02-23 15:01:07 +00:00
Eldar Kurtić
b95bb6927f
[kv-cache, ct] Use compressed-tensors as a source of ground-truth for quant strategies ( #34254 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
2026-02-23 07:37:55 -07:00
Cyrus Leung
392645454b
[Refactor] Decouple TimingContext from InputProcessingContext ( #35083 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-23 14:15:50 +00:00
Eldar Kurtić
1e8438a89a
[Llama4,CI] Bring back Llama-4 bug fixes, and also fix Maverick tests ( #35033 )
...
Signed-off-by: Eldar Kurtic <you@example.com >
Co-authored-by: Eldar Kurtic <you@example.com >
2026-02-23 09:04:34 -05:00
Robert Shaw
8435b2e049
[ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup) ( #34302 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-23 14:02:26 +00:00
Yan Ma
b1b5e045df
[XPU] allow TORCH_SDPA/TRITON_ATTN as XPU vit Backend ( #35010 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-02-23 05:06:44 -08:00
Andreas Karatzas
5f68464f92
[ROCm][CI] Fix spec decode profile assertion and logprob test determinism ( #35043 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-23 05:05:54 -08:00
Vincent Gimenes
aa08a30fc9
[CLEANING] Remove unused disable_by_batch_size from SpeculativeConfig ( #35060 )
...
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com >
2026-02-23 05:05:36 -08:00
Wentao Ye
7f40e9e516
[Refactor] Remove dead private func _fp8_perm and _extract_mask_for_item ( #35068 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-23 05:05:20 -08:00
Harry Mellor
103e614b14
Fix pipeline parallel with embed scaling in the Transformers modelling backend ( #35094 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 05:04:47 -08:00
Neil Schemenauer
54e2f83d0a
[Feature] Lazy import for the "mistral" tokenizer module. ( #34651 )
...
Signed-off-by: Neil Schemenauer <nas@arctrix.com >
2026-02-23 00:43:01 -08:00
Gabe Goodhart
e631f8e78e
fix: Apply embedding_multiplier to inputs_embeds ( #34813 )
...
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-23 00:42:46 -08:00
Martin Hickey
e97c46a92d
[BugFix]: Fix local mypy issues ( #34739 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 00:40:29 -08:00
Jee Jee Li
7291d1b288
[Bugfix] Fix kernel benchmark ( #33752 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-22 21:18:08 -08:00
Cyrus Leung
987506bca6
[Refactor] Simplify dummy data generation ( #35025 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-22 20:55:27 -08:00
Woosuk Kwon
c645e9a214
[Model Runner V2] Remove propose_draft method ( #35070 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-22 18:27:12 -08:00
Nick Hill
944ffb5968
[Model Runner V2][Minor] Remove redundant do_spec_decode field ( #35039 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-22 16:18:04 -08:00
qizixi
2bcf71b9c0
[Spec Decode] Reduce TP communication for speculative decoding draft token generation ( #34049 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-22 14:59:16 -08:00
tacos8me
b7892a3bef
[Model] Add NVFP4 quantization support for Step3.5-Flash ( #34478 )
...
Signed-off-by: tacos8me <ian@cloudhabit.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-22 12:30:46 -07:00
Benjamin Chislett
682566b18e
[Bug] Refactor max_num_batched_tokens to account for drafting ( #34898 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-22 11:18:46 -05:00
qizixi
b9c2a565cc
[Spec Decode] Defer clearing KV connector metadata for EAGLE3 speculative decode + prefill / decode disagg setup ( #34529 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-22 08:08:32 -08:00
Andreas Karatzas
dd8c3a7fb2
[ROCm][CI] Fix realtime test timeouts caused by aiter JIT compilation delays ( #35052 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-22 10:07:18 +00:00
Andreas Karatzas
a8a47c17b6
[ROCm][CI] Fix flaky embedding chat test by using tolerance-based comparison ( #35050 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-22 09:03:44 +00:00
Roger Wang
40f88d8318
[Bugfix] Fix Qwen3/Qwen3.5 Reasoning Parser ( #34779 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-21 23:15:35 -08:00
Woosuk Kwon
2cbf9656ce
[Model Runner V2] Enable CUDA graph for Eagle3 ( #35040 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-21 21:42:50 -08:00
Xiao Li
30132cd144
Fix apply_top_k_top_p_triton called by non-cuda logits Tensor ( #35030 )
...
Signed-off-by: Xiao Li <ilx@meta.com >
2026-02-21 21:11:54 -08:00
Cyrus Leung
cbd95a2dd1
[Benchmark] Use sns.relplot for plotting ( #35027 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-21 20:26:48 -08:00
Athrael Soju
970861ac0c
[New Model] Add ColModernVBERT ( #34558 )
...
Signed-off-by: Athrael Soju <athrael.soju@gmail.com >
Signed-off-by: athrael-soju <athrael-soju@users.noreply.github.com >
2026-02-22 12:23:41 +08:00
Wentao Ye
d24bdd7c4b
[CI] Bump mteb version to mteb[bm25s]>=2, <3 for pooling model unit tests ( #34961 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-21 20:23:24 -08:00
Andreas Karatzas
d403c1da1c
[CI] Stabilizing ROCm amd-ci signal and minor name fix in upstream ( #35008 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-22 04:01:10 +00:00
Woosuk Kwon
b71fbd06e2
[Model Runner V2] Support attention group ( #35036 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-21 16:42:53 -08:00
Vadim Gimpelson
74d90b1ce4
[Model Bash][DSR1] Add selective dynamic shape marking for CustomOp ( #34900 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-21 19:28:01 -05:00
Woosuk Kwon
a4047d4ea9
[Model Runner V2] Support Eagle3 (no CUDA graph) ( #35029 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-21 12:55:24 -08:00
Cyrus Leung
965fe45935
[CI/Build] Fix gRPC version mismatch ( #35013 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-21 12:14:41 -07:00
Roman
98b0205c3c
[Frontend] Add automatic language detection for Whisper transcription ( #34342 )
...
Signed-off-by: space_check <roman.vuskov@rwth-aachen.de >
Signed-off-by: Roman <45857014+spacecheck@users.noreply.github.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-21 04:49:41 -08:00
Huy Do
272b535ab3
[Bugfix] Gate 256-bit instructions to CUDA 12.9+ ( #34791 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-21 04:48:14 -08:00
Cyrus Leung
f74f1572ca
[Benchmark] Improve benchmarks ( #35012 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-21 10:31:58 +00:00
petrpechman
bebfe55b1c
[Doc] Fix example of eagle3 ( #34960 )
...
Signed-off-by: Petr Pechman <petr.pechman@firma.seznam.cz >
Co-authored-by: Petr Pechman <petr.pechman@firma.seznam.cz >
2026-02-21 09:57:53 +00:00
Nick Hill
820d7815eb
[Core] Minor structured-output related scheduler optimization ( #34765 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-21 01:38:28 -08:00
Nicolò Lucchesi
ab6f3487a6
[PD] Change kv_load_failure_policy Default from "recompute" to "fail" ( #34896 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-21 01:34:57 -08:00
BADAOUI Abdennacer
8dc8a99b56
[ROCm] Enable bitsandbytes quantization support on ROCm ( #34688 )
...
Signed-off-by: badaoui <abdennacerbadaoui0@gmail.com >
2026-02-21 00:34:55 -08:00
jennyyyyzhen
2aab2bb543
[ROCM] Optimize ROCM_AITER_FA spec decode eagle performance ( #34541 )
...
Signed-off-by: jennyyyyzhen <yzhen@hmc.edu >
2026-02-20 20:32:05 -08:00
Andreas Karatzas
54254f7a61
[ROCm][CI] Fix spec decode logprobs flakiness and parametrize tree attention backends ( #34599 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:25:23 -08:00
Andreas Karatzas
cf93c1a128
[ROCm][AITER] Fix aiter paged_attention_v1 decode for sliding window and head_size < 64 ( #34570 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:25:07 -08:00
Andreas Karatzas
89358f0d35
[CI] Fix ColBERT HF comparison tests on AMD CI + refactor ( #34567 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:12:05 -08:00
zhongdaor-nv
a0fe7ea2f0
[feat] Add per-block extra_keys to KV events ( #33304 )
...
Signed-off-by: zhongdaor-nv <zhongdaor@nvidia.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 20:11:40 -08:00
Andreas Karatzas
991d6bff38
[CI][MCP][Harmony] Heavy refactoring Harmony & MCP response tests and stabilizing with deterministic test infrastructure ( #33949 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:03:32 -08:00
Kata Coder
5719a4e4e6
[Frontend] Support multimodal inputs for late-interaction scoring (ColQwen3) + NewModel: nvidia/nemotron-colembed ( #34574 )
...
Signed-off-by: craftsangjae <craftsangjae@gmail.com >
2026-02-20 20:01:40 -08:00
pougetat
11be2c74dc
[Realtime] Add Qwen3-ASR realtime streaming support ( #34613 )
...
Signed-off-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Co-authored-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-20 19:59:42 -08:00
Xin Yang
7a5adad480
[Kernel] Optimize sample_recovered_tokens_kernel ( #34974 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-20 19:59:06 -08:00
Li
59c6233297
Support prompt_embeds for pooling requests in output processor ( #34904 )
...
Signed-off-by: Li Zhang <lzhanga@amazon.com >
Co-authored-by: Li Zhang <lzhanga@amazon.com >
2026-02-20 19:57:38 -08:00
Taneem Ibrahim
d38cd3dde5
[Misc] Fix mypy errors in vllm/profiler and remove from exclude list ( #34959 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-02-20 19:56:33 -08:00
Rohan Potdar
ded333fb9b
[ROCm][Bugfix]: Only save unpadded sizes for shared_experts in MoERunner to fix rmsnorm pad fusion ( #34636 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-20 19:56:16 -08:00
Yanan Cao
9d7577b2bd
[Kernel] [Helion] [9/N] Canonicalize GPU variant names to base model names ( #34928 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-20 19:55:51 -08:00
Vlad Tiberiu Mihailescu
e739c29ea4
[CI/Build] Add opentelemetry libs in default vllm build (requirements/common.txt) ( #34466 )
...
Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com >
2026-02-20 19:54:55 -08:00
yugong333
a55caf6ae9
[LoRA] Support Quantized Adapters ( #30286 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
Signed-off-by: wz1qqx <ziqi.wang@novita.ai >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: wz1qqx <55830058+wz1qqx@users.noreply.github.com >
Co-authored-by: wz1qqx <ziqi.wang@novita.ai >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 19:54:35 -08:00
Lucas Wilkinson
0e22cd618b
Revert "[Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers " ( #34997 )
2026-02-20 17:19:19 -08:00
Wei Zhao
ea5f903f80
Bump Flashinfer Version and Re-enable DeepSeek NVFP4 AR+Norm Fusion ( #34899 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 13:37:31 -08:00
Ryan Rock
0632ed8778
[AMD][CI] Fix test_custom_allreduce for A100 testgroup ( #34735 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-02-20 21:33:04 +00:00
Lucas Wilkinson
aaefc58ee0
[CI] Revert PRs 34818 and 33600 ( #34979 )
2026-02-20 13:25:50 -08:00
Wei Zhao
f24b2de3d3
[Test] Add FP8 KV Cache Testing for MLA Backends ( #34473 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-20 18:51:58 +00:00
Michael Goin
fac1507f03
[CI] Remove failing prime-rl integration test ( #34843 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-02-20 10:17:42 -08:00
Zhengxu Chen
f863994084
[compile] Fix torch.compile time discrepancy in logging. ( #34912 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 08:47:14 -08:00
Zhengxu Chen
e4a5d8c653
[compile] Move torch_aot_compile directory under torch_compile_cache ( #34831 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-20 08:46:45 -08:00
Yanan Cao
a6d0299c75
[Kernel] [Helion] [6/N] Add num_tokens dimension to silu_mul autotuning and dispatching ( #34185 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-20 08:36:51 -08:00
Harry Mellor
6ce80f7071
Ensure that MkDocs v2 does not get installed ( #34958 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-20 15:38:11 +00:00
Huamin Li
1fe462168c
[perf] Avoid dtype promotion sync in mamba_get_block_table_tensor ( #34870 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 06:21:56 -08:00
Flora Feng
ed31a020ee
[Refactor] Extract Harmony streaming SSE event builders into streaming_events.py ( #34909 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 06:20:46 -08:00
Cyrus Leung
f9ac19204f
[V0 Deprecation] Remove unused MM placeholders in request output ( #34944 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-20 06:19:23 -08:00
Vadim Gimpelson
59965affbd
[BUGFIX] Fix _dummy_run missing prepare_inputs_event synchronization ( #34866 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-20 05:54:27 -08:00
Xin Yang
b1c4f0b265
[Kernel] Optimize grouped topk kernel ( #34206 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-20 01:34:45 -08:00
Kevin McKay
8de7c636cc
[Bugfix][Hardware][AMD] Fix ROCM_AITER_FA speculative decoding support ( #32877 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-19 22:25:46 -08:00
Frank Wang
059779231f
[Minor] Add logging when using MXFP4 MXFP8 TRTLLM backend ( #34916 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
Signed-off-by: Frank Wang <41319051+frankwang28@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-19 22:07:57 -08:00
tianshu-Michael-yu
ea37530b47
[Models] LFM2: Support LoRA ( #34921 )
...
Co-authored-by: Piotr Mazurek <piotr635@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-19 22:07:23 -08:00
Micah Williamson
f5432e35a3
[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout ( #34922 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-20 05:37:49 +00:00
杨朱 · Kiki
07cab212f0
[Misc] Add deprecated environment variable utilities ( #33677 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-19 21:33:25 -08:00
rasmith
0c1dc42748
[CI][AMD][BugFix][P/D] Add default_vllm_config to test_moriio_connector.py so tests pass ( #33739 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-19 21:32:40 -08:00
Varun Chawla
676f82ae81
Add validation to reject non-text content in system messages ( #34072 )
...
Signed-off-by: Varun Chawla <varun_6april@hotmail.com >
2026-02-19 21:30:33 -08:00
Elizabeth Thomas
81bfc21a6a
[Model Bash]: Improve FP8 Oracle for Config Specific Kernel Selection ( #34260 )
...
Signed-off-by: Elizabeth Thomas <email2eliza@gmail.com >
Signed-off-by: Robert Shaw <robertgshaw2-redhat@h100-02.nemg-001.lab.rdu2.dc.redhat.com >
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
Co-authored-by: Robert Shaw <robertgshaw2-redhat@h100-02.nemg-001.lab.rdu2.dc.redhat.com >
Co-authored-by: Robert Shaw <robertgshaw2@gmail.com >
2026-02-19 21:29:08 -08:00
Matthias Gehre
4e2c7caf2d
[Bugfix] Add regression test for MoE quant_config under torch.compile ( #34335 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-02-20 13:27:26 +08:00
Bowen Bao
d9e62c03eb
[Quark] Fix MoE fp8 activation scale handling on mi300 ( #34386 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com >
2026-02-19 21:27:14 -08:00
Kevin H. Luu
a1a2d79442
[ci] Use the right tag for CPU arm64 image ( #34915 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-19 19:59:15 -08:00
Cyrus Leung
ac900c89bb
[Refactor] Implement output type check in LLM ( #34794 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-19 19:57:55 -08:00
Mark McLoughlin
76df6072ff
[Core] Fix state names in pause_scheduler() ( #34840 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-02-19 17:21:46 -08:00
Michael Goin
16f24e8797
[CI] Add GPT-OSS Eval job for H100 ( #34359 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-02-19 17:14:54 -08:00
Nick Hill
40b2f1c3d9
[Model Runner V2] Minor CPU optimizations ( #34856 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-19 16:05:37 -08:00
Mayank Ketkar
648951a9c3
[Bugfix] Fix benchmark_fused_collective crash on CustomOp init ( #34665 )
...
Signed-off-by: Mayank Ketkar <mketkar@zoox.com >
Signed-off-by: Mayank Ketkar <mayket04@gmail.com >
Co-authored-by: Mayank Ketkar <mketkar@zoox.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-19 19:01:00 -05:00
Michael Goin
f72061a19a
[UX] More descriptive reasons in is_supported_config for MoE ( #34908 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-19 15:20:52 -08:00
Matthew Bonanni
662205d34e
[Bugfix] Fix Basic Models Test ( #34818 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-19 14:49:07 -08:00
Roger Wang
4fb8beefaa
[Bugfix] Fix cutlass fp8 kernel on hopper for Qwen3.5 ( #34914 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-19 13:34:55 -08:00
Alexei-V-Ivanov-AMD
304319c4ed
Change targets for AMD build in the "CI" pipeline ( #34918 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2026-02-19 21:26:53 +00:00
Wentao Ye
c683d11c94
[Refactor] Deprecate head_first for chunk_gated_delta_rule ( #34263 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-19 13:23:49 -05:00
roikoren755
3eff45d793
Revert "[NemotronH] Do not force router to run in fp32 ( #34582 )" ( #34808 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-19 09:47:05 -08:00
Robert Shaw
4685a630a2
[Model Bash][DeepSeekR1] Remove Shared Expert Clone ( #34344 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-19 07:56:14 -08:00
Eldar Kurtić
ee1d25f199
[Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers ( #34471 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-19 07:55:41 -08:00
Linda
6fff24f30f
[Bugfix] Qwen3.5 kv-scale weight remapping ( #34719 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com >
2026-02-19 04:13:37 -08:00
Cyrus Leung
23210a911e
[CI/Build] Try to make beam search test less flaky ( #34885 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-19 19:16:58 +08:00
Cyrus Leung
1391378861
[Bugfix] Fix edge case in UUID data parsing ( #34884 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-19 02:24:30 -08:00
Andreas Karatzas
f6220f9877
[ROCm][Test] Fix beam search determinism failures from batch-size-dependent FP divergence and removed wrong marker ( #34878 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-19 08:25:26 +00:00
Andreas Karatzas
2df2bb27b0
[ROCm][CI] Removing all blocking labels from MI355 until stable infra ( #34879 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-19 07:53:08 +00:00
Tal Nir
f75b61a9e9
[Voxtral Realtime] Fix engine crash on empty multimodal embeddings ( #34862 )
...
Signed-off-by: Tal Nir <tal@nervexneurotech.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-18 23:21:47 -08:00
Wei Zhao
7f51e93864
[Bug] Fix DeepSeek V3 weight loading caused by incorrect prefix ( #34876 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-18 23:20:30 -08:00
Alex Brooks
4611af1663
[Bugfix] Add Quant Config to Llava Next Projector ( #34847 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
2026-02-18 23:18:23 -08:00
Manrique Vargas
ad5aa6bd9f
fix(docs): fix typos in comments and docstrings ( #34836 )
...
Signed-off-by: machov <mv1742@nyu.edu >
2026-02-18 23:17:41 -08:00
Jaeyeon Kim(김재연)
9681068cf9
[Frontend] Fix reasoning_tokens for text-based parsers in Responses API ( #33513 )
...
Signed-off-by: Jaeyeon Kim <anencore94@gmail.com >
2026-02-18 23:16:41 -08:00
Kevin H. Luu
b6101d384d
Deprecate test-pipeline.yaml ( #34864 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-19 02:15:27 +00:00
Woosuk Kwon
5fcb0cdd68
[Model Runner V2] Use FP32 for Gumbel Noise ( #34854 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-18 17:07:37 -08:00
Woosuk Kwon
c878b43b64
[Model Runner V2] Remove unnecessary copies in PW CUDA graph capture ( #34849 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-18 15:52:50 -08:00
rasmith
2b84ac669c
[CI][AMD][BugFix] Use torch.testing.assert_close instead of assert torch.allclose in test_rocm_skinny_gemms.py ( #34181 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-18 23:10:19 +00:00
zhrrr
11d3976b88
[Model Runner V2] support piecewise & mixed cudagraph ( #32771 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2026-02-18 15:03:17 -08:00
Yongye Zhu
40da9625a1
[MoE Refactor] Convert mxfp4 marlin into modular kernel format ( #34588 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-18 14:37:14 -08:00
Flora Feng
8d9babd4de
Fix empty tool_call_id in Anthropic messages API tool result conversion ( #34745 )
...
Signed-off-by: <>
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Flora Feng <sfeng33@h100-01.nemg-001.lab.rdu2.dc.redhat.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-18 14:31:59 -08:00
Aaron Hao
e99ba957ec
[BUG] Fixing Weight Sync unit test ( #34841 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-02-18 17:20:10 -05:00
Kyle Sayers
64ac1395e8
[Docs] Clean up speculators docs ( #34065 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-02-18 13:48:11 -08:00
Cyrus Leung
61cf087680
[Bugfix] Fix lora tests ( #34834 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-18 13:22:31 -08:00
Wenlong Wang
847a57cd12
[Bugfix][MoE Kernel] Fix incorrect routing selection for models without expert groups (e.g., MiniMax-M2.1) ( #34673 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-18 13:03:24 -08:00
rasmith
fcd6ac97ed
[CI][AMD][BugFix] Skip tests in test_unquantized_backend_selection that should not run on ROCm ( #34655 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-18 15:00:40 -05:00
Woosuk Kwon
95be2a7f22
[Model Runner V2] Minor simplification for DCP ( #34786 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-18 11:04:53 -08:00
Jaden Mathias
0e60c925cf
[Bugfix] Remove assert causing hipErrorStreamCaptureUnsupported ( #34455 )
...
Signed-off-by: Jaden Mathias <jaden.mathias@amd.com >
2026-02-18 18:54:54 +00:00
Teng Ma
d7ff22204a
[Misc] Add mooncake-transfer-engine to kv_connectors requirements ( #34826 )
...
Signed-off-by: Teng Ma <teng-ma@linux.alibaba.com >
2026-02-18 18:26:24 +00:00
Isotr0py
c0bd8b13da
[Bugfix] Redo Qwen3.5/Qwen3-Next GDN projector fusion ( #34697 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
2026-02-18 09:46:53 -08:00
Michael Goin
caeb887bf6
[Bugfix] Fix NVFP4 TRTLLM MoE non-gated support; add gsm8k for Nemotron-3-Nano FP8+NVFP4 ( #34725 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-18 09:39:22 -08:00
Ilya Markov
6b3166a7c7
[CI][Bugfix] Fix multinode test script ( #34820 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-02-18 11:45:10 -05:00
Robert Shaw
25e2e136ef
[CI] temporarily disable multi-node tests ( #34825 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-18 11:32:44 -05:00
Robert Shaw
6874638bc4
[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup) ( #34758 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-18 07:42:36 -08:00
Burkhard Ringlein
e24663c5a9
Add unit tests for fp8 output fusion of triton_attn ( #34228 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-18 06:22:49 -05:00
Nick Hill
c50e105a88
[Model Runner V2] Avoid prepare prefill kernel launch overhead ( #34780 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-18 00:49:21 -08:00
Cyrus Leung
a766b30349
[Renderer] Deprecate code paths for old input processing ( #34775 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-18 00:35:04 -08:00
Asaf Joseph Gardin
1faa8cb73c
[Quantization] - Added uses_meta_device_weights to quant config ( #34645 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-02-17 23:43:44 -08:00
Marek Michalowski
e89a91d927
[Bugfix] fix activation in cpu_fused_moe_torch call ( #34696 )
...
Signed-off-by: Marek Michalowski <marek.michalowski@arm.com >
2026-02-17 23:39:46 -08:00
Michael Goin
909b147197
[Bugfix] Fix prefix creation for Qwen3.5 ( #34723 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-17 23:39:15 -08:00
ElizaWszola
a88b3be7c4
[Bugfix] Fix quant RMS norm fusion for quantization with TMA-aligned scales ( #33255 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-17 23:35:04 -08:00
Nick Hill
a49ea5a58f
[Model Runner V2] A bit more PP simplification ( #34766 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-17 21:39:07 -08:00
Cyrus Leung
30ebe0dc3c
[CI/Build] Remove use of skip_v1 ( #34699 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-18 12:19:11 +08:00
Andreas Karatzas
cef65f0715
[ROCm][CI] Removed hard-coded attn backend requirement for Qwen VL ( #34753 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-18 03:59:53 +00:00
Russell Bryant
6f3b2047ab
[Core] Fix SSRF bypass via backslash-@ URL parsing inconsistency ( #34743 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: isotr0py <2037008807@qq.com >
2026-02-18 03:53:35 +00:00
Luka Govedič
02e8f26cea
[torch.compile] Turn on silu+fp4 quant fusion by default for O1+ ( #34718 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-02-18 03:29:15 +00:00
Hongxia Yang
4a00a511bb
[BugFix] [Build] fix string literals comparison in indexer_k_quant_and_cache calling site ( #34653 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-02-17 19:19:41 -08:00
Cyrus Leung
a0d8d944e2
[Renderer] Move MM Hash parsing into Renderer ( #34711 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-17 19:18:55 -08:00
Amr Mahdi
df3f537a66
[CI] Remove unused precompiled wheel args from image build ( #34767 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-17 18:58:18 -08:00
Matthew Bonanni
7743152957
[Attention] Refactor check_and_update_config ( #33600 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-17 17:06:54 -08:00
Wentao Ye
ab33d2a629
[Feature] Decode Context Parallel support for GPU model runner v2 ( #34179 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-17 16:27:15 -08:00
Woosuk Kwon
be3af2d29e
[Model Runner V2] Further simplification for PP ( #34724 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-17 15:18:18 -08:00
Jongseok Park
c656ba3b4d
[Kernel] Triton-based Top-k and Top-p sampler kernels ( #33538 )
...
Signed-off-by: js_park <cakeng@naver.com >
Signed-off-by: Jongseok Park <37990712+cakeng@users.noreply.github.com >
Signed-off-by: Sunga Kim <sunga.kim@berkeley.edu >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Sunga Kim <sunga.kim@berkeley.edu >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-17 23:14:30 +00:00
Matthew Bonanni
dc5fa77a4e
[Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs ( #34457 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-17 14:01:27 -05:00
Flora Feng
1e4a084c8e
[CI] Fix flaky test_parsable_context ( #34717 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-02-17 18:42:52 +00:00
Richard Zou
7967e854da
[BugFix] Fix sp tests ( #34716 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-17 17:07:56 +00:00
almayne
6bd6d0c3c1
Fixed whisper CPU test that does not spawn properly. ( #34324 )
...
Signed-off-by: Anna Mayne <anna.mayne@arm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-17 06:46:23 -08:00
Nicolò Lucchesi
8e962fef5f
[CI][Nixl] Add CrossLayer KV layout tests ( #34615 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-17 21:35:40 +08:00
Cyrus Leung
574fe75245
[Renderer] Move InputPreprocessor into Renderer (2/2) ( #34560 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-17 05:29:01 -08:00
junuxyz
c61a98f529
[CI][BugFix] ShellCheck cleanup to remove baseline and preserve runtime behavior ( #34514 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-17 12:22:56 +00:00
Harry Mellor
28bffe9466
Fix docs build warning ( #34686 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-17 02:31:40 -08:00
ChenqianCao
ad65177a19
[Bugfix] Fix 'remove_instance_endpoint' method logic in disagg_proxy_demo ( #32922 )
...
Signed-off-by: ChenqianCao <39755070+ChenqianCao@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-17 10:06:53 +00:00
Tim Dettmers
d44a5b6c47
Remove dead bitsandbytes CxB code from 8-bit inference path ( #34633 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-17 01:49:14 -08:00
Jiangyun Zhu
1d65283e95
Revert "[Models] Fuse Qwen3.5 GDN's qkvz_proj and ba_proj" ( #34683 )
2026-02-17 01:29:27 -08:00
kourosh hakhamaneshi
c464b57374
[Ray] Propagate third-party env vars to Ray workers via prefix matching ( #34383 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-17 01:08:42 -08:00
Amr Mahdi
c5c38e152a
[CI] Fix bake config artifact path for AMI rebuild pipeline ( #34656 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-17 06:39:44 +00:00
Woosuk Kwon
d00df624f3
[Model Runner V2] Minor refactoring for penalties ( #34662 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 21:43:00 -08:00
Woosuk Kwon
9752da9d9c
[Model Runner V2] Minor simplification for BadWordsState ( #34669 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 21:27:24 -08:00
Woosuk Kwon
04925b2202
[Model Runner V2] Minor cleanup for PP ( #34666 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 19:15:31 -08:00
Woosuk Kwon
d74278fb67
[Model Runner V2] Fix unintended CPU-GPU sync in make_dummy ( #34667 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 19:00:29 -08:00
haosdent
b68fd899d1
[Bugfix] Fix fused MoE int32 overflow in stride*offset without perf regression ( #34507 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-16 17:58:49 -08:00
Aneesh Puttur
0b5f9b7204
[CI] Enable mypy import following for vllm/v1/kv_offload ( #34639 )
...
Signed-off-by: Aneesh Puttur <aneeshputtur@gmail.com >
2026-02-17 09:58:15 +08:00
zhanqiuhu
9a8853f781
[Core] Pipeline Parallel support for Model Runner V2 ( #33960 )
...
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-02-16 17:48:16 -08:00
zhrrr
387a1898d9
[Model Runner V2] support bad_words sampling param ( #33433 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 16:36:06 -08:00
roikoren755
3b30e61507
[NemotronH] Do not force router to run in fp32 ( #34582 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-02-16 10:15:32 -08:00
Alexei-V-Ivanov-AMD
824f9e8f3c
Targeting the MI355 agent pool with all existing tests ( #34629 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2026-02-16 17:02:27 +00:00
Nicolò Lucchesi
6cc403e67d
[Bugfix][CI] Fix flaky entrypoints/openai/test_response_api_with_harmony.py::test_function_calling[openai/gpt-oss-20b] ( #34624 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-16 16:11:07 +00:00
Almog Tavor
72d5951d02
[Bugfix] Treat generation_config max_tokens as default not ceiling ( #34063 )
...
Signed-off-by: almogtavor <almogtavor@gmail.com >
2026-02-16 07:58:24 -08:00
Lucas Kabela
a3205beffb
[CI] Enable mypy coverage for individual excluded files ( #34292 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-16 07:34:29 -08:00
Christian Pinto
6930becd45
(bugfix): Fixed encode in LLM entrypoint for IOProcessr plugin prompts ( #34618 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2026-02-16 07:33:55 -08:00
Andreas Karatzas
03a8770a6d
[ROCm][CI] Fix plugins test group; updating terratorch and dependencies ( #34589 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-16 07:33:42 -08:00
Yiqi Xue
bc56a1d56e
[Bugfix] Fix ARC touch KeyError for non-ready T1 blocks in kv offload ( #34576 )
...
Signed-off-by: Yiqi Xue <xuey666@gmail.com >
2026-02-16 07:33:19 -08:00
danisereb
ec7d9e6745
Fix call to moe_mk in modelopt MoE modules (required for LoRA) ( #34575 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-16 07:33:09 -08:00
Isotr0py
3bb4e4311c
[Models] Fuse Qwen3.5 GDN's qkvz_proj and ba_proj ( #34492 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-16 07:32:51 -08:00
Amr Mahdi
08f8c198ae
[CI] Disable precompiled wheel path in CI image builds ( #34606 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-16 15:14:43 +00:00
Harry Mellor
a21cedf4ff
Bump lm-eval version for Transformers v5 compatibility ( #33994 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-16 05:24:35 -08:00
emricksini-h
3ef74cde5d
[CI][Tracing] Fix race condition by adding server readiness check ( #34364 )
...
Attempt to resolve #34284 : "Metrics Tracing (2GPU)" fails with a
segmentation fault.
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai >
2026-02-16 12:57:39 +00:00
Ekagra Ranjan
cd81cdb399
[Scheduler][ASR] Fix CrossAttn blocks per-request for Variable length encoder inputs ( #31058 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-16 11:08:44 +00:00
Andreas Karatzas
1e828573b4
[CI][Metrics] Stabilize tests with polling and subprocess guards ( #34566 )
...
test_abort_metrics_reset is flaky due to hardware-dependent
fixed sleeps: replace fixed sleeps with polling.
test_metrics_exist_run_batch passes even when the engine crashes
on startup (false positive): add subprocess lifecycle guards.
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-16 10:52:02 +00:00
Samu Tamminen
a5ccc85c8c
[Bugfix] Fix Dynamo unexpected keyword argument ( #34320 )
...
Signed-off-by: Samu Tamminen <stammine@amd.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-16 01:32:30 -08:00
Roger Wang
b5475d0534
Revert "[Misc] fix qwen3.5 config" ( #34610 )
2026-02-16 01:06:05 -08:00
JJJYmmm
9521002f0a
[Misc] fix qwen3.5 config ( #34604 )
2026-02-16 00:25:38 -08:00
Cyrus Leung
ec17bdd894
[Renderer] Move InputPreprocessor into Renderer (1.5/2) ( #34598 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-15 23:46:33 -08:00
Amr Mahdi
bb59c90248
[CI] Write bake config to temp directory instead of repo root ( #34569 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-15 22:15:47 -08:00
bnellnm
5bff999d12
[Bugfix] Add method to swap quant_method on FusedMoE to fix LoRA issues ( #34453 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-15 20:10:50 -08:00
Lucas Wilkinson
bb85929aa6
[BugFix] Fix Python 3.13 FlashMLA import error ( #34548 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-15 20:09:18 -08:00
Parth Bansal
5653021094
[Doc] Add Mistral-7b-v0.3 model to the batch invariance validated model ( #34584 )
...
Signed-off-by: Parth Bansal <parthbansal127@gmail.com >
2026-02-16 12:09:00 +08:00
Andreas Karatzas
974d829b05
[CI][Frontend] Return 422 instead of 500 for invalid Anthropic tool_choice ( #34590 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-15 20:06:48 -08:00
Isotr0py
91ac5d9bfd
[CI/Build] Enable tests for recent day-0 new models ( #34585 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-15 18:17:04 -08:00
Luka Govedič
23d825aba1
[torch.compile] Disable ar-rms fusion for ds3-fp4 & DP, fix CI test ( #34392 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-15 06:33:57 -08:00
Maryam Tahhan
f07a128413
[CPU][ARM] Add ARM BF16 cross-compilation support and improve documen… ( #33079 )
...
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-02-15 06:33:08 -08:00
Isotr0py
71cd89264f
[MM Encoder] Add Triton ViT attention backend ( #32183 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-15 06:32:47 -08:00
Isotr0py
19fab44152
[Doc] Update Encoder-Decoder models support doc with Florence-2 ( #34581 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-15 04:18:57 -08:00
Seiji Eicher
79c7e09235
[KV Connector] Add temporary, off-by-default VLLM_DISABLE_REQUEST_ID_RANDOMIZATION workaround ( #34415 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2026-02-14 23:26:10 -08:00
haosdent
79f3fab05a
[Bugfix] Handle num_expert_group=None in flashinfer block-scale FP8 MoE ( #34494 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-14 23:25:46 -08:00
Vadim Gimpelson
604b9eaec5
[BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1 ( #34476 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-14 23:25:17 -08:00
Stanislav Kirillov
50dbd6c9e6
[bugfix] Fix critical bug when reporting for all paths where handler.create_error_response is used ( #34516 )
...
Signed-off-by: Stanislav Kirillov <stas@nebius.com >
Co-authored-by: Stanislav Kirillov <stas@nebius.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-14 23:24:25 -08:00
Andreas Karatzas
98bcc6ca59
[CI][Entrypoints] Validate detokenize token IDs to prevent int64 overflow causing 500 ( #34468 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-14 23:08:38 -08:00
Andreas Karatzas
f13e86d8dd
[Kernels] Fix Helion GPU utils to use platform-agnostic device name API ( #34537 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-14 20:29:23 -08:00
Woosuk Kwon
9ca768c740
[Model Runner V2] Minor cleanup for Sampler ( #34563 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-14 18:29:03 -08:00
Thomas Parnell
d5fe3f702c
[Hybrid] Enable mamba prefix cache "align" mode with async scheduling ( #33997 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2026-02-14 13:15:56 -08:00
Cyrus Leung
73391a1baa
[Renderer] Move InputPreprocessor into Renderer (1/2) ( #34510 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-14 10:14:21 -08:00
Andreas Karatzas
b3c14229b0
[ROCm][CI] Guard sparse MLA backend imports for ROCm compatibility in tests ( #34538 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-14 07:32:09 -08:00
Roger Wang
2f186635cb
[Bugfix] Fix Qwen3.5 config loading ( #34554 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-14 03:56:11 -08:00
Christian Pinto
342a7cda2d
[Misc] Update tests and examples for Prithvi/Terratorch models ( #34416 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-13 23:03:51 -08:00
Kata Coder
d1ea65d0a1
[new model] add COLQwen3 code & Inference ( #34398 )
...
Signed-off-by: craftsangjae <craftsangjae@gmail.com >
Signed-off-by: katacoder <craftsangjae@gmail.com >
2026-02-14 12:15:19 +08:00
Andreas Karatzas
de42abb366
[CI] Heavy refactoring of Voxtral multimodal audio model tests ( #34294 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-13 20:04:29 -08:00
Julien Denize
60ca7981bc
Add explicit validation error for tool calls. ( #34438 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-02-13 20:04:01 -08:00
Christian S. Perone
0ef5b9147b
fix: use __annotations__ instead of get_type_hints() for dynamic kwargs detection ( #34527 )
...
Signed-off-by: Christian S. Perone <christian.perone@gmail.com >
Signed-off-by: Christian S. Perone <perone@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-13 20:03:37 -08:00
Shiyan Deng
ed242652d7
[bug] Make sure get_modality_with_max_tokens is deterministic ( #34533 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2026-02-13 20:02:59 -08:00
Wei Zhao
b37b679770
[Feature][Perf] Support Selective CPU Weight Offloading ( #34535 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-13 20:02:24 -08:00
Andreas Karatzas
a0638d052d
[Bugfix] Fix ROCm UVA CPU weight offloading broken by #32993 ( #34543 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-13 20:01:42 -08:00
Harry Huang
c027541eaf
[Hybrid] Enable spec decoding in mamba cache align mode ( #33705 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-13 13:02:28 -08:00
Ben Browning
fd267bc7b7
[Bugfix]: Fix structured output in multi-turn gpt-oss ( #34454 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-13 11:12:48 -08:00
Michael Goin
bfaa559305
Revert "[Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides" ( #34530 )
2026-02-13 10:35:29 -08:00
Richard Zou
87789c8364
[Misc] vLLM's --enforce-eager should turn off compile and cudagraphs only ( #34523 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-13 09:52:20 -08:00
Pushpinder Singh
bcd65c1f6a
[Bugfix] Replace c10::optional with std::optional in topk kernel ( #34467 )
...
Signed-off-by: Pushpinder Singh <pushpindersingh135@gmail.com >
2026-02-13 08:30:23 -08:00
Wei Zhao
59d53066d8
[Feature] Support CPU Offloading without Pytorch Pinned Memory that leads to doubled allocation ( #32993 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-13 08:11:26 -08:00
LoganJane
4a9952ec1b
[Bugfix] Add quant_config in ViT of Kimi-K2.5 ( #34501 )
...
Signed-off-by: LoganJane <LoganJane73@hotmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-13 16:05:34 +00:00
Roger Wang
1dae7b7843
[Bugfix] Exclude language_model_only key from MM AOT compile hash but include in model one ( #34508 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-13 13:59:00 +00:00
Roger Wang
5885e330ef
[Misc] Port Qwen3.5 Configs ( #34512 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-13 05:24:25 -08:00
Ilya Boytsov
071d863e20
Extend ColBERT support to non-standard BERT backbones ( #34170 )
...
Signed-off-by: Ilya Boytsov <ilya.boytsov@aleph-alpha.com >
2026-02-13 09:53:09 +00:00
Woosuk Kwon
0916e7960b
[GDN] Use CPU tensors to build GDN metadata ( #34498 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-13 01:24:45 -08:00
Wentao Ye
3d2a026fd0
[Feature] Pipeline Parallel Async send/recv, 2.9% E2E throughput improvement ( #33368 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-02-13 16:38:16 +08:00
Aaron Hao
dddbff4624
[Core] Move pause and resume functions into engine ( #34125 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Signed-off-by: hao-aaron <ahao@anyscale.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-13 00:15:10 -08:00
Martin Hickey
47e9b63e1a
[KVConnector] Clean up redundant code in KV connectors ( #34147 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-02-13 00:14:30 -08:00
Matthias Gehre
934acddef9
[Perf] fused_moe: add int4_w4a16 benchmark support and tuning config ( #34130 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-13 00:14:27 -08:00
Marek Michalowski
742d214d6e
[Bugfix] fix the import path in moe test utils.py ( #34245 )
...
Signed-off-by: Marek Michalowski <marek.michalowski@arm.com >
2026-02-13 00:13:45 -08:00
haosdent
4137c5dfa7
[Bug Fix] Fix MambaManager.cache_blocks() crash on null blocks in align mode ( #34418 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-13 00:13:22 -08:00
Harry Huang
7a8a46ddcb
[BugFix] Fix and optimize max_num_blocks_per_req calculation for MambaSpec ( #34440 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-13 00:13:14 -08:00
myselvess
bcf0731aa0
[New Model] support new model ovis2.6 ( #34426 )
...
Signed-off-by: myselvess <23743269+myselvess@users.noreply.github.com >
2026-02-13 00:12:45 -08:00
Cyrus Leung
ec090c2429
[Refactor] Call renderer for online IO processor request ( #34490 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-12 22:48:45 -08:00
Roger Wang
eea3024f43
[Bugfix] Fix mamba state dtype setting for Qwen3-Next and Qwen3.5 ( #34489 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-12 22:48:42 -08:00
Cyrus Leung
2f308214c0
[Refactor] Pass full VllmConfig to Renderer ( #34485 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 22:48:38 -08:00
Cyrus Leung
1b4e8e53f8
[CI/Build] Fix CUDA re-initialization error in distributed model tests ( #34491 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-13 06:43:53 +00:00
haosdent
dcf6ee8592
[Bugfix] Fix encoder cache underestimation for GLM-4V/GLM-OCR single image ( #34483 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-12 21:04:06 -08:00
Cyrus Leung
372b2e762a
[Bugfix] Standardize getting number of image patches/tokens ( #34358 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 20:47:01 -08:00
Andreas Karatzas
6afa587d31
[ROCm][CI] Fix serving tokens test failures ( #34047 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-13 11:27:53 +08:00
Cyrus Leung
94ed6cf6ea
Add new sections to CODEOWNERS ( #34309 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 18:39:28 -08:00
Harry Huang
bf37812ca7
[Hybrid] Fix and optimize block-aligned splitting in mamba cache align mode ( #33706 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-12 18:21:52 -08:00
Frank Wang
b86bf4417e
[Bugfix] Fix Random Dataset Prefix Length Inaccuracy ( #33907 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-12 18:21:19 -08:00
Yanan Cao
de13dd781f
[Kernel] [Helion] [5/N] Add Helion Autotuning infrastructure ( #34025 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-12 18:21:05 -08:00
LoganJane
62788f99a4
[Bugfix] Delete unused redundant code in Kimi-K2.5 ( #34427 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-12 18:18:42 -08:00
Cyrus Leung
ea5ff3a1f6
[Refactor] Simplify BOS/EOS token handling ( #34435 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 18:18:24 -08:00
bnellnm
04ea31baab
[Bugfix] Remove assert that's no longer valid ( #34443 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-12 18:18:15 -08:00
Harry Huang
6f019e6e0a
[BugFix] Add block_size validation for mamba cache align mode ( #34445 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-12 18:18:07 -08:00
Zhuohan Li
d707678dfb
Fix num_logprobs parameter description in sampler.py ( #34451 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2026-02-12 18:18:03 -08:00
Cyrus Leung
fc22cae4ac
[CI/Build] Update video URLs for testing ( #34446 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 18:15:36 -08:00
Yanan Cao
96161fe978
[Kernel] [Helion] [4/N] Add silu_mul_fp8 Helion kernel ( #33373 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-12 18:13:12 -08:00
Jaewon
4453ba8d9e
[Core] Profiler improvements and lazy initialization ( #33198 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-12 16:16:38 -08:00
Jaewon
aa181c923b
[Core] Add sleep level 0 mode with enqueue/wait pattern ( #33195 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-12 16:16:25 -08:00
Alec S
be7370daf3
[Frontend] Enable generic structured_outputs for responses API ( #33709 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Co-authored-by: Alec Solder <alecs@fb.com >
2026-02-12 16:15:48 -08:00
Mengtao (Martin) Yuan
9ea1f598ce
Use paged_attention_v1 for sliding window decode in rocm_aiter_fa ( #34378 )
...
Signed-off-by: Martin Yuan <myuan@meta.com >
Co-authored-by: Martin Yuan <myuan@meta.com >
2026-02-12 16:14:43 -08:00
amitz-nv
f120bd42d3
[Kernel] Support Flashinfer trtllm fused MoE non gated FP8 & NVFP4 ( #33506 )
...
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com >
2026-02-12 13:06:58 -08:00
Hashem Hashemi
fac4e96940
small adjustment to wvSplitKrc ( #34410 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-12 20:26:36 +00:00
Michael Goin
6d4e27ce29
[Bugfix] Enforce DeepGEMM when using sparse_attn_indexer on CUDA ( #34374 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-12 12:08:06 -08:00
Andreas Karatzas
4c078fa546
[ROCm][CI] Pin TorchCodec to v0.10.0 for ROCm compatibility ( #34447 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-12 18:47:34 +00:00
Patrick von Platen
6c0baee610
[Voxtral Realtime] Refactor & Improve buffering logic ( #34428 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-12 09:46:43 -08:00
Patrick von Platen
1100a97621
[Voxstral Realtime] Enable tests ( #33803 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2026-02-12 09:43:24 -08:00
xuebwang-amd
766e167821
[ROCm][quantization] improve OCP weight quant parser robust ( #34431 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-12 09:40:19 -08:00
Isotr0py
becbe24808
[Bugfix] Remove broken raw url GGUF model loading support ( #34433 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-12 09:40:01 -08:00
Harry Mellor
679ca5d8d3
Fix MoE for the Transformers modelling backend ( #34436 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-12 09:29:42 -08:00
Matthew Bonanni
f2c47886fd
[Attention] Add FlashInfer Sparse MLA backend ( #33451 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
2026-02-12 17:21:54 +00:00
Nicolò Lucchesi
334c715e0f
[Docs] Spec decoding docs warning removal ( #34439 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-12 09:01:51 -08:00
Aaron Hao
7b5a8b4a9d
[BUG] Reset running requests when clearing cache for pause/resume ( #34382 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-02-12 16:19:13 +00:00
danisereb
dea63512bb
Add config file for fused MoE for Nemotron (TP4, B200) ( #34411 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-12 06:09:55 -08:00
Douglas Lehr
8a798be929
[ROCm] Enable MXFP4 MoE weight pre-shuffling on gfx950 and update aiter ( #34192 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
Co-authored-by: tjtanaavllm <tunjian.tan@amd.com >
2026-02-12 05:06:33 -08:00
Cyrus Leung
fb455ed547
[V0 Deprecation] Remove code related to per-request logits processors ( #34400 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 20:44:28 +08:00
baonudesifeizhai
f5897613fb
Fix Mistral config remap to accept compressed-tensors quantization #34028 ( #34104 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
2026-02-12 08:22:06 +00:00
Louie Tsai
55a1a9563a
Vllm CPU benchmark suite improvement ( #34128 )
...
Signed-off-by: louie-tsai <louie.tsai@intel.com >
2026-02-12 16:04:44 +08:00
AllenDou
386bfe5d08
[bugfix] refactor FunASR's _get_data_parser ( #34397 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
2026-02-12 07:26:49 +00:00
Kyle Sayers
e9cd691132
[Bugfix] Fix Sparse24 Compressed Tensors models ( #33446 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-11 23:15:16 -08:00
Yichuan Wang
80f2ba6ea6
Fix DeepSeek-OCR tensor validation for all size variants ( #34085 )
...
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-11 22:50:23 -08:00
Lucas Wilkinson
136b0bfa59
[BugFix] Fix DP chunking ( #34379 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Bill Nell <bnell@redhat.com >
2026-02-12 06:44:03 +00:00
Cyrus Leung
b96f7314b4
[Refactor] Pass Renderer to Input Processor ( #34329 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 19:38:11 -08:00
Cyrus Leung
ced2a92f40
[Refactor] Move validation to params definitions ( #34362 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 19:33:15 -08:00
Runkai Tao
e1d97c38f8
[Bug Fix] Fix naive_block_assignment always defaulting to False due to arg misalignment ( #33848 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
2026-02-12 11:30:57 +08:00
Michael Goin
ec12d39d44
[Bugfix] Fix MTP accuracy for GLM-5 ( #34385 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-12 11:08:19 +08:00
Michael Goin
ff1f83b056
[Refactor] Replace activation: str with MoEActivation enum ( #33843 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-02-11 17:29:32 -08:00
Kevin H. Luu
83b47f67b1
[ci] Integrate AMD tests into CI ( #33626 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
Signed-off-by: khluu <khluu000@gmail.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-12 08:54:17 +08:00
Micah Williamson
fb7b30c716
[ROCm][CI] Revert Test Groups From mi325_8 to mi325_1 Agent Pool In AMD CI ( #34384 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-11 15:52:34 -08:00
bnellnm
31d992d215
[Bugfix] Fix some issues with MoERunner PR #32344 ( #34371 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-11 14:33:14 -08:00
Wei Zhao
5aff2699bd
Fix CI failure - Flashinfer Kernel tests ( #34316 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-11 14:17:16 -08:00
Raushan Turganbay
527ca32197
[Bugfix] Fix more multimodal tests for transformers V5 ( #34334 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2026-02-11 22:02:05 +01:00
Junseo Park
5458eb835d
[Bugfix] send None sentinel on final commit so server properly sends transcription.done ( #33963 )
...
Signed-off-by: pjs102793 <pjs102793@naver.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-11 21:01:53 +00:00
Tomas Ruiz
144d9b7cc8
[Benchmarks] Reduce ready checker log verbosity ( #34349 )
...
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com >
2026-02-11 20:57:57 +00:00
elvischenv
83e26c834e
[GPT-OSS] Remove unnecessary contiguous ( #34337 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-02-11 15:29:29 -05:00
TJian
5001211369
[ROCm] [CI] fix test_unrecognized_env ( #34350 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-02-11 18:50:44 +00:00
Eldar Kurtić
11c7ace340
[Bugfix] Enable attn quantization of Llama-4 by correctly permuting scales for rope (int8, fp8) ( #34243 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
2026-02-11 13:24:22 -05:00
Xinyu Dong
be7f3d5d20
[Bugfix] fix default is_neox_style is True for deepseek ( #34353 )
...
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com >
2026-02-11 18:20:45 +00:00
Isotr0py
0ab06100f4
[Multimodal] Expose mm_processor_kwargs for DummyInputsBuilder ( #34330 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-11 09:37:40 -08:00
Xinyu Chen
ffb3d553cc
[Model Runner V2] Init cuda graph pool when necessary ( #33217 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2026-02-11 09:12:13 -08:00
junuxyz
fa7e0bfacf
[CI][BugFix] Fix silent failure in shellcheck hook and baseline exist… ( #32458 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-11 17:03:48 +00:00
SorenDreano
48134a2c22
[Docs] Fix typo ("defult") and double spacing ( #34348 )
...
Signed-off-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com >
Co-authored-by: Soren Dreano <soren@numind.ai >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-11 09:02:27 -08:00
kliuae
64f570ab56
[ROCm] [aiter] Split KV cache update for AiterFlashAttention ( #33681 )
...
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com >
2026-02-11 16:26:44 +00:00
Rohan Potdar
fd618871b4
[Bugfix]: Fix ROCm fusion attn test; use AttentionBackend utils to create kv cache ( #33948 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-11 11:12:05 -05:00
Harry Mellor
67a42b5a44
Don't try and run GLM-ASR with remote code ( #34352 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-11 08:09:40 -08:00
Lucas Wilkinson
c7914d30f9
Reapply [Attention][FA3] Update FA3 to include new swizzle optimization ( #34043 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-11 07:07:56 -08:00
Adam Binford
1b8756562e
Responses harmony system message structured ( #34268 )
...
Signed-off-by: Adam Binford <adamq43@gmail.com >
2026-02-11 05:14:28 -08:00
Linda
275e0d2a99
[NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE ( #33715 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com >
Co-authored-by: Pavani Majety <pmajety@nvidia.com >
2026-02-11 12:38:11 +00:00
Harry Mellor
0f5e55e7a8
Make JAIS compatible with Transformers v5 ( #34264 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-11 12:30:37 +00:00
Harry Mellor
1e9204bff3
Make Qwen3VL compatible with Transformers v5 ( #34262 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-11 04:13:23 -08:00
Li, Jiang
05339a7b20
[Bugfix][CPU] Fix llama4 inference on CPU ( #34321 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-02-11 19:07:23 +08:00
Harry Mellor
40b8f55358
[Docs] Reduce time spent generating API docs ( #34255 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-11 02:56:02 -08:00
Seiji Eicher
5045d5c983
Patch protobuf for CVE-2026-0994 ( #34253 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-11 02:25:04 -08:00
Nick Hill
e09546cf05
[Frontend] Exploit tokenizers "new stream" in FastIncrementalDetokenizer ( #34217 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-11 11:03:24 +01:00
Tianqi Ren
786806dd44
[Doc] Update Marlin support matrix for Turing ( #34319 )
...
Signed-off-by: Tianqi Ren <tianqi.r@outlook.com >
2026-02-11 09:03:41 +00:00
Nick Hill
79504027ef
[Misc] Bump fastsafetensors version for latest fixes ( #34273 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-11 00:30:09 -08:00
Luka Govedič
addac0e653
[torch.compile] Enable AR+rms fusion by default available for -O2 ( #34299 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-02-11 00:30:00 -08:00
Cyrus Leung
675a22ed66
[Chore] Move BaseRenderer to base.py ( #34308 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 00:29:51 -08:00
Kunshang Ji
cb9574eb85
[XPU][9/N] clean up existing ipex code/doc ( #34111 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-11 00:27:15 -08:00
AllenDou
21dfb842d7
[model] support FunASR model ( #33247 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
2026-02-11 07:37:09 +00:00
R3hankhan
d1b837f0ae
[CPU] Enable FP16 (Half dtype) support for s390x ( #34116 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-02-11 14:41:42 +08:00
Roger Wang
0b20469c62
[Bugfix] Fix weight naming in Qwen3.5 ( #34313 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 21:37:14 -08:00
Tyler Michael Smith
d7982daff5
[Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides ( #34279 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-11 05:15:52 +00:00
Robert Shaw
9b17c57460
[ModelBash][DSR1 NVFp4] Removed Bf16 Bias Cast ( #34298 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-11 05:00:00 +00:00
Hashem Hashemi
1b3540e6c6
Threshold fix wvSplitk for occasional CI fails ( #34013 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-11 03:59:14 +00:00
Matthias Gehre
7a048ee65f
[Bugfix] Fix benchmark_moe.py inplace assertion with torch >= 2.9 ( #34149 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-02-11 03:58:56 +00:00
Cyrus Leung
c9a1923bb4
[Plugin] Simplify IO Processor Plugin interface ( #34236 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 19:47:39 -08:00
zofia
b482f71e9f
[XPU][7/N] enable xpu fp8 moe ( #34202 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-02-11 03:33:59 +00:00
Дзержи́нский
1485396abb
[Kernel] Apply 256bit LDG/STG To Activation Kernels ( #33022 )
...
Signed-off-by: Dzerzhinsky <256908701+AstroVoyager7@users.noreply.github.com >
Signed-off-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-10 19:31:51 -08:00
Kebe
5ee5c86eeb
[Bugfix][DeepSeek-V3.2] fix fp8 kvcache type cast ( #33884 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2026-02-10 19:31:36 -08:00
Cyrus Leung
b5dcb372e4
[Misc] Clean up validation logic in input processor ( #34144 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 19:29:29 -08:00
Tyler Michael Smith
066c6da6a0
[WideEP] Fix nvfp4 DeepEP High Throughput All2All backend ( #33738 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-10 19:15:43 -08:00
Richard Zou
e30cedd44b
[torch.compile] Stop doing unnecessary FakeTensorProp in PiecewiseCompileInterpreter ( #34093 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-10 19:15:40 -08:00
Cyrus Leung
3bcd494ef4
[Redo] Add --trust-remote-code to dataset bench args ( #34251 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 11:10:12 +08:00
tianshu-Michael-yu
0e725a7d22
[Bugfix] Fix Worker.load_model context-manager composition for sleep mode ( #34021 )
...
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com >
2026-02-11 11:07:51 +08:00
Lucas Wilkinson
ba0511fd80
[Misc] Add run one batch script that supports profiling ( #32968 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-10 18:29:49 -08:00
Micah Williamson
4a1550d22d
[ROCm][CI] Fix test_sequence_parallel.py location in AMD CI pipeline ( #34280 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-11 01:08:11 +00:00
bnellnm
d1481ba783
[MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner ( #32344 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-10 19:51:07 -05:00
7. Sun
dc6de33c3d
[CI] Add pip caching to cleanup_pr_body workflow ( #32979 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-02-11 00:45:28 +00:00
Tyler Michael Smith
c4b9e6778f
[Misc] Add pre-commit hook to catch boolean ops in with-statements ( #34271 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-10 15:13:20 -08:00
Richard Zou
341eed3d30
[torch.compile] Disable recursive pre_grad_passes ( #34092 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-10 18:02:31 -05:00
Zhengkai Zhang
6f2f59f2b3
[Misc][Spec Decode] support different load config for draft model ( #34022 )
...
Signed-off-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com >
Co-authored-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com >
2026-02-10 14:52:43 -08:00
Ilya Markov
bb2fc8b5e7
[BugFix] Fix async EPLB hang with DeepEP LL all2all backend ( #32860 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-02-10 22:34:47 +00:00
Ilya Markov
67132945bb
[Perf] Move eplb rebalance algo to async thread ( #30888 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-02-10 22:19:10 +00:00
Gregory Shtrasberg
f0ca0671c7
[Feature] Warn about unrecognized environment variables ( #33581 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-10 15:45:38 -06:00
Pavani Majety
578977bb5e
[SM100] Resubmit FMHA FP8 prefill for MLA ( #31195 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2026-02-10 16:18:43 -05:00
Roger Wang
9615575afc
[Bugfix] Fix mamba cache dtype for Qwen3.5 ( #34200 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 13:12:31 -08:00
Matthew Bonanni
4293c00b84
[Benchmarks] Fix attention benchmark smoke test ( #34269 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-10 16:04:07 -05:00
J Seppänen
506ad7d7c1
[Bugfix] Fix weights offloading for sleep mode ( #32947 )
...
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-02-10 20:38:17 +00:00
Reagan Lee
fdd6f2ad58
Convert online APIs to use Renderer ( #34084 )
...
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com ”>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com ”>
2026-02-10 19:44:31 +00:00
Qi Wang
33bcd3dc3b
[Misc] Introduce ec_both role EC (encoder cache) connector ( #34182 )
...
Signed-off-by: Qi Wang <qiwa@nvidia.com >
2026-02-10 18:55:35 +00:00
Michael Goin
1f5febb4b8
[UX nit] Fix non-default api_server_count message ( #34152 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-10 10:35:58 -08:00
Andy Lo
ae871ca923
Minor cleanup for Voxtral ( #34247 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-02-10 18:18:30 +00:00
Woosuk Kwon
a2443de5fa
[Model Runner V2] Use pinned memory for write_contents ( #34222 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-10 08:55:22 -08:00
Harry Mellor
f84a2a8f31
[Docs] Speed up build environment set-up ( #34240 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-10 16:34:43 +00:00
Vadim Gimpelson
000214c4bb
[BUGFIX] Fix accuracy bugs in Qwen3-Next MTP ( #34077 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-10 10:57:11 -05:00
junuxyz
c5a66d1697
[Core][BugFix] Fix PP KV cache sharding memory validation ( #33698 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-10 10:46:24 -05:00
Roberto L. Castro
afdce12c89
[Perf][Kernel] Add faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention ( #33680 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-02-10 10:29:52 -05:00
Zhengxu Chen
82e11973cc
[compile] Enable AOT compile with 2.10 in trunk. ( #34155 )
...
Signed-off-by: Zhengxu Chen <zhxchen17@meta.com >
2026-02-10 23:24:42 +08:00
xuebwang-amd
b129136c7a
[ROCm][Quantization] GPT_OSS in amd-quark format model loading and emulations ( #29008 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-10 10:08:05 -05:00
mgazz
599e4335a4
Support benchmarking of Geospatial models ( #33922 )
...
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com >
2026-02-10 07:04:16 -08:00
Fan Yang
a1946570d8
add --insecure arg to the vllm bench to skip TLS ( #34026 )
...
Signed-off-by: Fan Yang <yan9fan@meta.com >
Co-authored-by: Fan Yang <yan9fan@meta.com >
2026-02-10 22:23:52 +08:00
Harry Mellor
d0bc520569
Bump mamba-ssm version in CI for Transformers v5 compatibility ( #34233 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-10 14:46:01 +01:00
Krish Gupta
748625cdaf
[V1][BugFix] Fix EAGLE3 encoder cache miss with disable_chunked_mm_input ( #34220 )
...
Signed-off-by: KrxGu <krishom70@gmail.com >
2026-02-10 13:05:32 +00:00
Harry Mellor
61413973e8
Stop testing for slow tokenizers as they will not exist soon ( #34235 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-10 12:08:20 +00:00
Phúc H. Lê Khắc
94de871546
[Misc] allow specify is_mm_prefix_lm in hf_config ( #34215 )
2026-02-10 11:16:21 +00:00
tc-mb
e042d7e685
Add flagos in MiniCPM-o ( #34126 )
...
Signed-off-by: tc-mb <caitianchi@modelbest.cn >
Signed-off-by: Vincent-Xiao <vincent.xiao.me@gmail.com >
Co-authored-by: Vincent-Xiao <vincent.xiao.me@gmail.com >
2026-02-10 02:51:48 -08:00
Roger Wang
ae4e280602
[Bugfix] Fix FI kernelchunk_gated_delta_rule output shape for Qwen3.5 ( #34219 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 10:41:24 +00:00
zzaebok
cbea11c9f0
[Docs] Fix format error in KV load failure recovery doc ( #34137 )
...
Signed-off-by: Jaebok Lee <jaebok9541@naver.com >
2026-02-10 02:16:26 -08:00
Cyrus Leung
2c32558a3c
[Bugfix] Fix --trust-remote-code conflict ( #34218 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 00:29:10 -08:00
Zetong Li
5f970120f0
[Bugfix] Fix memory inconsistency in cross-process shared memory ( #32022 )
...
Signed-off-by: Zetong Li <slippersss@126.com >
2026-02-10 08:22:03 +00:00
Cyrus Leung
998e2d91f8
Revert #34208 ( #34216 )
2026-02-09 23:59:04 -08:00
Wentao Ye
e1060a71a1
[Perf] Optimize detokenizer python logic ( #32975 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-02-09 23:54:41 -08:00
Chen Zhang
97fa8f6590
[BugFix] Avoid prefix cache hit in the same schedule step for mamba layers ( #29387 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2026-02-10 07:41:16 +00:00
wang.yuqi
dab1de9f38
[Frontend][CI] Consolidate instrumentator entrypoints ( #34123 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-10 07:30:19 +00:00
Balaxxe
8d48d0a9d9
[Bugfix] Sort hf_weights_files in fastsafetensors_weights_iterator to match #33491 ( #34190 )
...
Signed-off-by: Balaxxe <136368465+jaim12005@users.noreply.github.com >
2026-02-09 23:06:30 -08:00
Andrew Xia
9608844f96
[responsesAPI] fix simpleContext streaming output_messages ( #34188 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-02-09 22:53:07 -08:00
Cyrus Leung
f69b903b4c
[Bugfix] Add --trust-remote-code to dataset bench args ( #34208 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-09 22:37:50 -08:00
Lucas Wilkinson
81e217fe6b
[Bugfix] Fix DP Attention Padding in Dummy Run ( #34187 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-10 05:29:39 +00:00
Cyrus Leung
ab97bcf662
[CI/Build] Relax test_mcp_tool_call ( #34204 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 05:18:57 +00:00
Cyrus Leung
25e48a3aae
[Doc] Update usage of --limit-mm-per-prompt ( #34148 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-09 21:12:13 -08:00
Roger Wang
8a5e0e2b2b
[Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching ( #34183 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 13:03:32 +08:00
Andreas Karatzas
4cde2e0159
[ROCm][Bugfix] Resolve Dynamo tracing crash from amdsmi calls in on_gfx* arch detection ( #34108 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-09 20:50:20 -08:00
Roger Wang
047a457fa4
[Bugfix] Adopt ChunkGatedDeltaRule for Qwen3.5 ( #34198 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 03:47:54 +00:00
Yuwei An
e94ec59733
[LMCache] Token Base IPC API ( #34175 )
...
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com >
2026-02-10 01:18:42 +00:00
Ning Xie
13397841ab
[structured output] validate unsupported json features first ( #33233 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2026-02-09 23:49:09 +00:00
Gregory Shtrasberg
c60f8e3b49
[Bugfix][ROCm][GPT-OSS] Use old triton_kernels implementation on ROCm if the new API is not available ( #34153 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-09 17:38:54 -06:00
Michael Goin
5e75a14a66
[Doc] Add DCP support to attention backend doc ( #33936 )
2026-02-09 18:33:43 -05:00
Nick Hill
e7e52781ff
[ModelRunner V2][BugFix] Fix max_query_len calculation ( #34167 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-09 21:47:17 +00:00
Charlie Fu
bb9f97308d
[torch.compile][Fusion] Fix attention fusion pass removing kv_udpate op. ( #33945 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2026-02-09 16:15:43 -05:00
Hongxia Yang
4d39650961
[ROCm] update triton branch to support gpt-oss models for gfx11xx devices ( #34032 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2026-02-09 19:36:30 +00:00
Artus Krohn-Grimberghe
8fd31f6245
[Bugfix] Voxtral prompt/audio placeholder alignment ( #34140 )
...
Signed-off-by: Artus KG <artuskg@gmail.com >
2026-02-09 19:30:38 +00:00
Artus Krohn-Grimberghe
eadb4e868b
[Bugfix] Avoid duplicate k-proj weight emission in helper ( #34142 )
...
Signed-off-by: Artus KG <artuskg@gmail.com >
2026-02-09 19:17:44 +00:00
Jiangyun Zhu
285bab4752
[Kernel] use flashinfer for gdn prefill ( #32846 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-02-09 12:17:25 -05:00
TomerBN-Nvidia
995bbf38f1
[Bugfix] Fix shared expert input for latent MoE in EP+DP (Nemotron-H) ( #34087 )
...
Signed-off-by: Tomer Natan <tbarnatan@nvidia.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-09 16:44:18 +00:00
Mohammad Miadh Angkad
d4f123cc48
[Kernel] FlashInfer: switch allreduce fusion to unified API ( #33985 )
...
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com >
2026-02-09 15:43:24 +00:00
ZhengHongming888
cb62e86f83
Add NUMA Core binding in nixl_connector for CPU xPyD ( #32365 )
...
Signed-off-by: Hongming Zheng <hongming.zheng@intel.com >
Signed-off-by: ZhengHongming888 <hongming.zheng@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-09 15:39:12 +00:00
Luka Govedič
781ddf7868
[CI][torch.compile] Fix incorrect filtering for E2E fusion tests on B200 ( #34031 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-02-09 10:05:14 -05:00
Roger Wang
64a9c2528b
[UX] Add --language-model-only for hybrid models ( #34120 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-09 14:57:33 +00:00
Lucas Wilkinson
d0d97e2974
[Misc] Fix up attention benchmarks ( #33810 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-09 09:42:03 -05:00
JJJYmmm
9562912cea
[MODEL] Adding Support for Qwen3.5 Models ( #34110 )
...
Signed-off-by: JJJYmmm <1650675829@qq.com >
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: wulipc <wulipc@users.noreply.github.com >
Co-authored-by: ywang96 <ywang96@users.noreply.github.com >
Co-authored-by: Isotr0py <Isotr0py@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-09 21:12:58 +08:00
zofia
9bdb06b436
[XPU][6/N] add xpu scaled_mm kernel ( #34117 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-02-09 20:17:35 +08:00
Nikhil Gupta
caad9f1e01
[Fix] [CPU Backend] : Prepack weights for w8a8 oneDNN matmul ( #33901 )
...
Signed-off-by: nikhil-arm <nikhil.gupta2@arm.com >
2026-02-09 18:04:41 +08:00
Ekagra Ranjan
1d5922fade
[ASR] Fix audio benchmark and add RTFx metric ( #32300 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2026-02-09 10:02:37 +00:00
Andreas Karatzas
3025b3cebb
[CI] Remove empty image_size_factors for fuyu, glm4_1v, glm_ocr ( #34107 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-09 17:37:04 +08:00
Jee Jee Li
978a37c823
[Model] GLM adaptation ( #34124 )
2026-02-09 17:32:52 +08:00
ihb2032
5a5c43511a
fix(cpu): fix mla_decode compilation on x86 without AVX512 ( #34052 )
...
Signed-off-by: ihb2032 <hebome@foxmail.com >
Co-authored-by: root <root@LAPTOP-FKNHV411.localdomain >
2026-02-09 08:55:41 +00:00
Nick Hill
d9bede0314
[BugFix] Fix fastsafetensors TP all procs using all GPUs ( #34070 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-09 15:15:46 +08:00
wang.yuqi
22b64948f6
[Frontend][last/5] Make pooling entrypoints request schema consensus. ( #31127 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-09 06:42:38 +00:00
Reagan Lee
7c233dbb36
[Tiny] Rename encoder budget file to more specific name ( #34103 )
...
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com ”>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com ”>
2026-02-09 03:48:19 +00:00
kourosh hakhamaneshi
a75a5b54c7
[bug-fix] supported_tasks is breaking backward compatibility at init_app_state ( #34027 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
Signed-off-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-09 09:46:46 +08:00
Andrey Talman
f97ca67176
[Release 2.10] Update to Torch 2.10 - final release ( #30525 )
2026-02-08 13:51:09 -08:00
danisereb
084aa19f02
Add support for ModelOpt MXFP8 dense models ( #33786 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-08 11:16:48 -08:00
navmarri14
1ecfabe525
glm 4.6 fused tuned inference config for B200 ( #32958 )
2026-02-08 18:55:47 +00:00
Richard Zou
4df841fe75
[torch.compile] Add an option to force-enable the MOE cold start optimization ( #33735 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-08 18:42:56 +00:00
TomerBN-Nvidia
a263aa6140
[BugFix] Change support no act and mul for marlin ( #34088 )
...
Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com >
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com >
2026-02-08 17:18:22 +00:00
aabbccddwasd
179ae7da8f
[Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate ( #33771 )
...
Signed-off-by: aabbccddwasd <aabbccddwasd@qq.com >
2026-02-08 08:13:24 -08:00
Reagan Lee
c4df59ad43
Add embedding input functionality for disabled modalities [remake] ( #32493 )
...
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com ”>
Signed-off-by: Reagan Lee <reaganjlee@gmail.com >
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com >
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com ”>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-08 04:57:16 -08:00
TJian
785cf28fff
[ROCm] [CI] Reduce Resource of two test groups ( #34059 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-02-08 15:17:26 +08:00
Nick Hill
a96197f564
[Perf] Simplify DeepseekV32 tokenizer, ensure fast detokenization used ( #33855 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-08 07:16:34 +00:00
Andreas Karatzas
ab10d79855
[ROCm][Bugfix] fix act_quant_fusion module import error ( #34069 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-07 19:21:12 -08:00
Cyrus Leung
7fcb705b80
[CI/Build] Skip GCS test ( #34057 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-07 08:52:38 -08:00
Cyrus Leung
b956cdf818
[Doc] Fix run_batch docs ( #34056 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-07 06:18:16 -08:00
Hashem Hashemi
ed17f54c8b
Perf tuning and expansion of cases covered for wvSplitKrc ( #33493 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-07 05:33:11 -08:00
Jiang Wu
860981d8d8
Make directory exist ok for ray spinning up multiple replicas on a single instance ( #33604 )
...
Signed-off-by: Jiang Wu <jwu@cclgroup.com >
2026-02-07 05:30:49 -08:00
zifeitong
52181baaea
Update DeepGEMM version pin in Dockerfile to match #32479 ( #33935 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-07 05:30:22 -08:00
Rohan Potdar
de3869bb4d
move checks out of unified_kv_cache_update custom op ( #33943 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-07 05:30:09 -08:00
whx
ce9b3cd3e9
[PluggableLayer][3/N] Apply PluggableLayer to mamba layers. ( #33660 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2026-02-07 05:26:05 -08:00
Jee Jee Li
db4ede9743
[Model] Enable Step3p5ForCausalLM testing ( #33755 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-07 05:25:24 -08:00
Pooya Davoodi
2cb2340f7a
[Frontend]Add support for transcriptions and translations to run_batch ( #33934 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-07 05:24:57 -08:00
TundeAtSN
4df44c16ba
Enable Eagle3 speculative decoding for Mistral3ForConditionalGeneration to support eagle3 ( #33939 )
...
Signed-off-by: Akintunde Oladipo <akintunde.oladipo@servicenow.com >
Signed-off-by: TundeAtSN <akintunde.oladipo@servicenow.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-07 05:24:52 -08:00
Richard Zou
81fe69cae5
[torch.compile] Stop compiling identical artifacts ( #34003 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-07 05:24:48 -08:00
Mohammad Miadh Angkad
dd6a6e1190
[Kernel] Add KernelConfig flag to enable/disable FlashInfer autotune ( #34006 )
...
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-07 05:24:44 -08:00
Cyrus Leung
edb359cce4
[Renderer] Define render_cmpl and render_chat ( #34039 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-07 05:24:40 -08:00
wang.yuqi
6ed5eda300
[CI][Build] Pin grpcio-tools==1.78.0 ( #34048 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-07 05:24:35 -08:00
Cyrus Leung
11a4c9d30d
[Misc] Simplify get_max_tokens ( #34036 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-07 00:59:49 -08:00
lukec
15a0b9e570
Fix spelling errors ( #33978 )
2026-02-06 23:58:50 -08:00
Andreas Karatzas
c490d8cc73
[ROCm][CI] Pinning lm-eval version to resolve multi-modal small eval bug ( #34038 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-06 22:21:08 -08:00
Cyrus Leung
48312e579a
[Misc] Make PlaceholderRange.get_num_embeds a method ( #34035 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-07 05:30:17 +00:00
Vel
bc32444b23
[Kernel] Add enable_sm120_or_later for SM121 (DGX Spark) CUTLASS support ( #33517 )
...
Signed-off-by: code4me2 <velvetmoon222999@gmail.com >
2026-02-06 20:28:01 -08:00
Wentao Ye
18e8545297
[Revert] Add util handle_deprecated back ( #33998 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-07 04:14:45 +00:00
果冻虾仁
6f7adc533a
fix description in plugin_system.md ( #33999 )
2026-02-06 19:37:02 -08:00
Nick Hill
40218a82ba
[ModelRunner V2] Revert token rank comparison difference for now ( #34017 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-07 11:11:05 +08:00
kourosh hakhamaneshi
1c3b22058f
[Misc] Add backward-compatible import aliases for renamed translations module ( #34015 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-07 11:01:41 +08:00
Xin Yang
3920cafdd6
[Bugfix] Fix _fused_moe_lora_expand signature mismatch ( #33821 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-07 10:45:59 +08:00
rasmith
ec28784fdc
[CI][AMD]Bugfix] Check that model_config is not None in enable_norm_pad_fusion ( #34007 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-07 02:43:25 +00:00
Nicolò Lucchesi
55aeec04f5
[Bugfix] Fix Whisper tokenization ( #34011 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-07 10:42:52 +08:00
Ikenna
906077181b
[Bugfix] Fix QK Norm+RoPE fusion pattern matching on B200+FP8 ( #33967 )
...
Signed-off-by: Ikenna <ikennachifo@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-07 02:27:33 +00:00
Aaron Hao
89a385d79f
[Feat][RL] Pause and Resume with keep requests for single engine ( #32351 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-07 00:08:58 +00:00
kourosh hakhamaneshi
4a2d00eafd
[bugfix] [ROCm] Fix premature CUDA initialization in platform detection ( #33941 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2026-02-06 16:17:55 -06:00
Dimitrios Bariamis
207c3a0c20
Fix RoutingMethodType logic ( #33919 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2026-02-06 14:03:34 -08:00
Sumanth R Hegde
ae2e93f89b
[Fix] Fix logprobs=0 handling for /inference/v1/generate endpoint ( #34010 )
...
Signed-off-by: SumanthRH <sumanthrh99@gmail.com >
2026-02-06 20:33:40 +00:00
xuebwang-amd
9e9acce577
[Bugfix] Fix no attribute error of SharedFusedMoE (DeepSeek-V3.1 as test model) ( #33993 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
2026-02-06 19:11:32 +00:00
Charlie Fu
fe5438200b
[Rocm][Bugfix] Fix dtype not same for gemm_a4w4 op ( #33734 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2026-02-06 19:09:59 +00:00
Wentao Ye
77c09e1130
[Refactor] Remove align block size logic in moe_permute ( #33449 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-06 10:57:06 -08:00
zhrrr
16786da735
[Model Runner V2] support apply penalty for spec decode ( #33251 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2026-02-06 10:56:48 -08:00
vllmellm
aaa2efbe98
[DOC] [ROCm] Update docker deployment doc ( #33971 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 10:05:35 -08:00
Seiji Eicher
aca5967416
[KV Connector] Add missing method overrides to MultiConnector ( #33292 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2026-02-06 12:58:21 -05:00
Wentao Ye
67a746e87f
[Log] Optimize duplicate startup log ( #33944 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-06 17:49:56 +00:00
Chauncey
7bec435130
[Bugfix] Fix the issue where tool calling does not work when using fast detokenization with dsv32 ( #33964 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-06 09:23:44 -08:00
Eldar Kurtić
5c52644b10
[Docs] Update link to Benchmark CLI documentation ( #33254 )
...
Signed-off-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com >
2026-02-06 16:00:59 +00:00
zofia
2ce9fe4ad0
[XPU][5/N] add wna16 xpu kernel ( #33973 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-02-06 15:59:53 +00:00
Cyrus Leung
cd8b405bd0
[Refactor] Consolidate sequence normalization and enc-dec parsing ( #33928 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-06 15:43:47 +00:00
tc-mb
4707f7ebb4
[Model] Support MiniCPM-o 4.5 ( #33431 )
...
Signed-off-by: caitianchi <caitianchi@modelbest.cn >
Signed-off-by: tc-mb <caitianchi@modelbest.cn >
Co-authored-by: mslv <mslv@baai.ac.cn >
2026-02-06 15:29:10 +00:00
Michael Goin
c39ee9ee2b
[Docs] Add sections on process architecture and minimum CPU resources ( #33940 )
...
It seems users can be confused about vLLM's performance when running
with very small amounts of CPU cores available. We are missing a clear
overview of what vLLM's process architecture is, so I added this along with
some diagrams in arch_overview.md, and included a section on CPU resource
recommendations in optimization.md
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-06 15:26:43 +00:00
Andreas Karatzas
350ca72c04
[ROCm][AITER] Fix AITER import regression for explicit backend selection ( #33749 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-06 15:08:16 +00:00
FredericOdermatt
1fb0495a72
[FIX] guidance: use max(vocab_size, len(tokenizer)) for n_vocab ( #33509 )
...
Signed-off-by: Frederic Odermatt <frederic.odermatt@44ai.ch >
2026-02-06 14:23:03 +00:00
Raushan Turganbay
85ee1d962b
[Bugfix] Fix models and tests for transformers v5 ( #33977 )
...
Signed-off-by: raushan <raushan@huggingface.co >
Signed-off-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 21:47:41 +08:00
Harry Mellor
51a7bda625
Update WeightTransferConfig to be more standard like the others ( #33989 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 13:15:00 +00:00
SorenDreano
6e7b1c4b59
[Docs] Improve documentation ( #33799 )
...
Co-authored-by: Soren Dreano <soren@numind.ai >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-06 12:57:09 +00:00
Kurt Shuster
2991dd3d22
[Bugfix][Model] Support LoRA on Qwen3 Output Embedding ( #29816 )
...
Signed-off-by: kurt <kurt@thinkingmachines.ai >
2026-02-06 20:25:31 +08:00
Luka Govedič
ac32e66cf9
[torch.compile] Reorganize vllm/compilation and tests/compile (0/N for vLLM IR) ( #33731 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: ProExpertProg <luka.govedic@gmail.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-06 04:19:49 -08:00
Fadi Arafeh
f79d9dce16
[CPU][BugFix] Fix loading of w8a8int models with bias ( #33582 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-02-06 11:59:20 +00:00
Harry Mellor
ba5cbbf107
Bump HF Hub client to get bug fix ( #33984 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 11:25:33 +00:00
zhang-prog
233b26ab35
[PaddleOCR-VL] Add BC for transformers 5.0 config ( #33976 )
...
Signed-off-by: zhangyue66 <zhangyue66@baidu.com >
2026-02-06 10:33:49 +00:00
Harry Mellor
791a94bed0
Consolidate and fix forbidden import pre-commit checks ( #33982 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 01:47:41 -08:00
Xinyu Chen
e969a169ef
support view_from_cpu_tensor on XPU ( #33868 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2026-02-06 08:34:20 +00:00
Harry Mellor
6d8d34be6d
Fix main pre-commit ( #33975 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-06 00:08:05 -08:00
Gassan Salama
1363e3d6d5
[cpu][performance] CPU Paged Attention NEON BFMMLA BF16 Implementation ( #32263 )
...
Signed-off-by: Gassan <gassan.salama@arm.com >
2026-02-06 15:01:48 +08:00
chengchengpei
965525667b
Onboard voyage-4-nano ( #33720 )
...
Signed-off-by: Chengcheng Pei <chengchengpei@outlook.com >
Signed-off-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com >
Co-authored-by: chengchengpei <5881383+chengchengpei@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-06 06:23:34 +00:00
sihao_li
6550815c3a
[XPU]Replace pip in docker.xpu with uv pip ( #31112 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
2026-02-06 14:02:33 +08:00
Kunshang Ji
7439e4f41b
[XPU][4/N] add mxfp4 moe model support ( #33679 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-06 13:03:59 +08:00
R3hankhan
ac04dd374f
[CPU] Add BF16 Kernel type for s390x ( #33788 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-02-06 04:57:02 +00:00
Cyrus Leung
035a6cb09a
[Misc] Update code for encoder-decoder models ( #33900 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-06 11:38:39 +08:00
Mingliang Li
a32cb49b60
feat(frontend): early-fail tokenization guard for user requests ( #31366 )
...
Signed-off-by: limingliang <limingliang@stepfun.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: limingliang <limingliang@stepfun.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-05 19:38:02 -08:00
Rabi Mishra
20d7454c9b
fix(ROCm): Make flash_attn import optional in MLA attention ( #33511 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2026-02-06 02:22:53 +00:00
Simon Mo
5819ca8944
[Docs] Add reo analytics ( #33957 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2026-02-05 17:42:22 -08:00
Xin Yang
79028d4388
[Perf] Disable clean_logits in deepgemm fp8_mqa_logits kernel ( #33568 )
2026-02-05 20:34:00 -05:00
emricksini-h
325ab6b0a8
[Feature] OTEL tracing during loading ( #31162 )
2026-02-05 16:59:28 -08:00
Wei Zhao
91a07ff618
[Bugfix] Fix DeepSeek v3.2 tokenizer outputting None issue ( #33832 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-05 23:50:49 +00:00
Hashem Hashemi
d5c4800112
Adds padding and perf improvements to wvSplitK_fp8 ( #33527 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-05 22:16:02 +00:00
Lumosis
42d5d705f9
[Minor] Sort safetensors files to ensure deterministic loading order ( #33491 )
...
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2026-02-05 17:05:09 -05:00
Cyrus Leung
116880a5a0
[Bugfix] Make MM batching more robust ( #33817 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-05 20:40:58 +00:00
Matthew Bonanni
4145e50d85
[Bugfix] Fix DSV3.2 NVFP4 ( #33932 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-05 19:22:19 +00:00
Nicolò Lucchesi
20f5d185a6
[Misc] Rename translations to speech_to_text for OAI serving component ( #33904 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-05 19:16:52 +00:00
Harry Mellor
1887acca9e
Fix tokenizer test for renamed attr on Transformers v5 ( #33902 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-05 19:16:20 +00:00
Tsukasa OI
92e7562a99
[Bugfix] Suppress non-TTY color output on the process name part of the log ( #29714 )
...
Signed-off-by: Tsukasa OI <floss_llm@irq.a4lg.com >
2026-02-05 18:47:09 +00:00
Isotr0py
87d0d17ab5
[Models] Consolidate Deepseek-OCR2 processor ( #33909 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-05 18:29:20 +00:00
bnellnm
a57c8228ff
[Moe Refactor] Make Inplace Flag for FusedMoEModularKernel part of the constructor ( #33375 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-05 18:07:18 +00:00
zackyoray
1ee95841bd
[Bugfix] Fix swapped engine_ids in NIXL Llama 4 local attention path ( #33795 )
...
Signed-off-by: Yoray Zack <yorayz@nvidia.com >
2026-02-05 17:51:58 +00:00
Nicolò Lucchesi
7d8c6804e2
[Misc] Add debug logs ( #33931 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-05 09:42:40 -08:00
Benjamin Chislett
af3162d3aa
[Spec Decode] Unified Parallel Drafting ( #32887 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-05 12:37:18 -05:00
danisereb
5b2a9422f0
[BugFix] Fix LoRA Fp8 ( #33879 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-05 17:25:55 +00:00
Aaron Hao
c1858b7ec8
[Feat][RL][1/2] Native Weight Syncing API: NCCL ( #31943 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Co-authored-by: SumanthRH <sumanthrh99@gmail.com >
2026-02-05 12:13:23 -05:00
Mario Hong
82914d2ae8
[Bugfix] Fix step3p5 parser when using mtp ( #33690 )
...
Signed-off-by: mariohong <mariohong128@gmail.com >
2026-02-05 16:04:04 +00:00
Nicolò Lucchesi
81a90e5277
[Docs] Add bart-plugin to docs ( #33905 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-05 12:20:25 +00:00
wang.yuqi
1c3a221d3b
[Bugfix] Fix corner case of sparse embedding ( #33886 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-05 02:51:22 -08:00
Cyrus Leung
7bd42e609d
[Refactor] Clean up input preprocessing ( #33687 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-05 18:43:42 +08:00
Isotr0py
a2522839d8
[Bugfix] Fix Kimi-K2.5 NVFP4 checkpoints weight loading ( #33876 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-05 10:29:54 +00:00
jiahanc
59a5cb387a
[perf] Integrate flashinfer concat_mla_k ( #31171 )
2026-02-05 05:23:11 -05:00
liranschour
8322d4e47f
Enable Cross layers KV cache layout at NIXL Connector V2 ( #33339 )
...
Signed-off-by: Liran Schour <lirans@il.ibm.com >
Signed-off-by: liranschour <liranschour@users.noreply.github.com >
Co-authored-by: Or Ozeri <or@ozery.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-05 02:17:02 -08:00
Andreas Karatzas
3e472e81f9
[ROCm][Bugfix][CI] Fix hybrid models and their tests (Mamba/Jamba/Bamba) ( #32710 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-02-05 10:01:23 +00:00
Cyrus Leung
038914b7c8
[Refactor] Move task outside of PoolingParams.verify ( #33796 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-05 09:33:11 +00:00
Pavani Majety
d2f4a71cd5
[Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. ( #33858 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2026-02-05 09:32:10 +00:00
Mark McLoughlin
2abd97592f
[KV Connector][Metrics] Do not count local prefix cache hits in connector queries ( #30522 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-02-05 09:57:27 +02:00
Chauncey
6abb0454ad
[Perf] Optimize the performance of structured output + reasoning ( #33557 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-05 15:45:29 +08:00
Li, Jiang
db6f71d4c9
[CI/Build] Fix CPU CI test case title ( #33870 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-02-05 15:07:14 +08:00
Fadi Arafeh
fd03538bf9
[CPU][BugFix] Allow w8a8 oneDNN quantized matmul to support 3D inputs ( #33727 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-02-05 06:26:09 +00:00
Andreas Karatzas
1f70313e59
[Bugfix] Fix ScoreMultiModalParam multi-document scoring returning single result ( #33837 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-05 06:17:00 +00:00
Li, Jiang
07daee132b
[CI/Build] Parallelize CPU CI tests ( #33778 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-02-05 13:53:48 +08:00
Andrew Xia
9595afda18
[2/N] move responses/serving _make_response_output_items logic to parser ( #33281 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Signed-off-by: Andrew Xia <axia@meta.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-02-05 13:46:15 +08:00
rasmith
c1395f72cd
[CI][AMD][BugFix] Ensure VLLM_ROCM_USE_AITER is set so test_rocm_aiter_topk.py can run correctly ( #33840 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-05 05:05:48 +00:00
rinbaro
007b183d74
[docs] fix unintentional misspellings ( #33863 )
...
Signed-off-by: rinbaro <ilgomishra@gmail.com >
2026-02-04 20:50:59 -08:00
Nick Hill
add9f1fbd9
[Minor] Include StreamingInput in inputs package ( #33856 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-05 04:38:20 +00:00
Luka Govedič
e3bf79ffa0
Revert "[Attention][FA3] Update FA3 to include new swizzle optimization" ( #33841 )
2026-02-04 19:54:27 -08:00
Andreas Karatzas
fb1270f1f8
[CI][Bugfix]: return McpCall for built-in MCP tools in non-streaming mode ( #32762 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-05 11:14:06 +08:00
Kevin H. Luu
72bb24e2db
[release] Minor fixes to release annotation ( #33849 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-05 02:07:35 +00:00
Chauncey
a7be77beef
[Bugfix] fix DeepSeek R1 with CUTLASS MLA Broken on B200 ( #33637 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-05 01:28:36 +00:00
zhanqiuhu
bbe0574d8e
[Bugfix] Disable TRTLLM attention when KV transfer is enabled ( #33192 )
...
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-02-05 00:49:18 +00:00
Luka Govedič
4d9513537d
[CI][torch.compile] Reduce e2e fusion test time ( #33293 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: ProExpertProg <luka.govedic@gmail.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-04 19:09:03 -05:00
Ilya Boytsov
439afa4eea
feat: Add ColBERT late interaction model support ( #33686 )
...
Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com >
Signed-off-by: Ilya Boytsov <boytsovpanamera@mail.ru >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-05 08:05:13 +08:00
Nick Hill
fa4e0fb028
[Core] Don't schedule spec tokens with prefill chunks ( #33652 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-04 23:40:22 +00:00
Sage Moore
ce498a6d61
Change the type signature of MixtureOfExperts.expert_weights to MutableSequence[Sequence[Tensor]] ( #33573 )
...
Signed-off-by: Sage Moore <sagmoore@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-04 17:02:46 -05:00
Richard Zou
9f14c9224d
Revert "[torch.compile] Significantly speed up cold start times" ( #33820 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-04 21:59:59 +00:00
Muhammad Hashmi
535de06cb1
[Model] Add transcription support for Qwen3-Omni ( #29828 )
...
Signed-off-by: Muhammad Hashmi <mhashmi@berkeley.edu >
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: NickLucche <nlucches@redhat.com >
2026-02-04 21:17:47 +00:00
Simon Danielsson
4292c90a2a
[Bugfix] Support RotaryEmbedding CustomOp for gpt-oss ( #33800 )
...
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com >
2026-02-04 20:17:41 +00:00
Taeksang Kim
6e98f6d8b6
Implement zero-copy GQA for multimodal and CPU ( #33732 )
...
Signed-off-by: Taeksang Kim <ts.kim@hyperaccel.ai >
2026-02-04 20:11:39 +00:00
kourosh hakhamaneshi
2f6d17cb2f
[rocm][ray] Fix: Unify Ray device visibility handling across CUDA and ROCm ( #33308 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2026-02-04 10:09:14 -08:00
Isotr0py
192ad4648b
[Bugfix] Fix interns1-pro initialization and PP ( #33793 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-04 17:54:45 +00:00
Lucas Wilkinson
0e92298622
[Misc] Delay deprecation of CommonAttentionMetadata properties ( #33801 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-04 08:41:57 -08:00
jiangkuaixue123
87d9a26166
[Bugfix] Fix ubatch wrapper num_tokens calculate ( #33694 )
...
Signed-off-by: jiangkuaixue123 <jiangxiaozhou111@163.com >
2026-02-04 16:41:45 +00:00
Cyrus Leung
80f921ba4b
[Bugfix] Fix normalize still being passed to PoolerConfig ( #33794 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-04 23:56:02 +08:00
Wentao Ye
711edaf0d0
[Perf] Optimize spec decoding + async scheduling, 1.5% Throughput improvement ( #33612 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-02-04 09:34:32 -05:00
Micah Williamson
1d367a738e
[Bugfix][ROCm] Include float8_e4m3fnuz in NCCL Dtype Dispatching ( #33713 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-04 05:36:29 -08:00
Cyrus Leung
32a02c7ca2
Apply #33621 to main ( #33758 )
...
Signed-off-by: Zachary Aristei <zaristei@nvidia.com >
Co-authored-by: zaristei2 <zaristei2@gmail.com >
Co-authored-by: Zachary Aristei <zaristei@nvidia.com >
2026-02-04 05:35:39 -08:00
Chauncey
f67ee8b859
[Perf] Optimize chat completion streaming performance ( #33782 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-04 12:30:36 +00:00
Cyrus Leung
e57ef99b40
[Model] Apply #32631 for recent models ( #33785 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-04 12:23:01 +00:00
Yueqian Lin
f8516a1ab9
[Bugfix][Model] Fix audio-in-video support for Qwen2.5-Omni and Qwen3-Omni ( #33605 )
...
Signed-off-by: linyueqian <linyueqian@outlook.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-04 12:15:29 +00:00
Vadim Gimpelson
824058076c
[PERF] Change GDN Attention State Layout from [N, HV, K, V] to [N, HV, V, K] ( #33291 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-04 11:20:52 +00:00
Or Ozeri
8e32690869
[KV Connector][BugFix] scheduler: Delay freeing blocks of aborted async loads ( #32255 )
...
Fixes a not-yet-reported case where it was possible for blocks to be
freed by an abort before an async transfer completed, resulting
in corrupted KV data.
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-02-04 11:16:34 +00:00
Zhengxu Chen
a208439537
[compile] Remove runner type from ignored caching factor list. ( #33712 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-04 10:56:45 +00:00
Zhengxu Chen
bcd2f74c0d
[compile] Clean up AOT compile bypass on evaluate_guards. ( #33578 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-04 02:12:53 -08:00
Kunshang Ji
f79f777803
[XPU][2/N] add support unquantized moe support for xpu ( #33659 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-04 02:12:25 -08:00
Augusto Yao
4c8d1bf361
use ORJSONResponse when available to improve the efficiency of request process ( #33548 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
2026-02-04 10:04:11 +00:00
Kunshang Ji
061da6bcf7
[XPU] remove common path warning log ( #33769 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-04 16:40:17 +08:00
zhanqiuhu
4403e3ed4c
[Metrics] Add labeled prompt token metrics for P/D disaggregation ( #33290 )
...
Add labeled Prometheus metrics to distinguish where prompt tokens come
from in P/D disaggregated deployments.
In P/D disaggregation, decode instances receive KV cache from prefill instances.
Currently, decode reports inflated prompt throughput because it counts all
prompt tokens as "computed", even though most were transferred.
This PR adds labeled metrics so users can understand actual compute work vs
transferred work:
vllm:prompt_tokens_by_source_total{source="local_compute"} # Tokens prefilled locally
vllm:prompt_tokens_by_source_total{source="external_kv_transfer"} # Tokens received via KV transfer
vllm:prompt_tokens_by_source_total{source="local_cache_hit"} # Tokens from local prefix cache
vllm:prompt_tokens_cached_total # Total cached (local + external, -1 when all
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-02-04 07:46:48 +00:00
Matt
08e094997e
[Hardware][AMD][CI] Refactor AMD tests to properly use BuildKite parallelism ( #32745 )
...
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-02-04 14:51:33 +08:00
Wentao Ye
d88a1df699
[Deprecation] Deprecate profiling envs ( #33722 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-04 05:58:21 +00:00
Cyrus Leung
90d74ebaa4
[Deprecation] Remove _get_data_parser in MM processor ( #33757 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-04 05:51:52 +00:00
Frank Wang
45f8fd6f97
[Feature] Enable TRITON_ATTN for Batch Invariance ( #33688 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
2026-02-04 13:27:34 +08:00
Wentao Ye
5e1e0a0fbd
[Refactor] Remove unused dead code ( #33718 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-03 21:25:11 -08:00
Michael Goin
eb5ed20743
[Bugfix] Define router_logits_dtype for remaining MoE models ( #33737 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-04 13:24:14 +08:00
Huy Do
2647163674
Save startup benchmark results as a list of values ( #33629 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2026-02-03 20:37:51 -08:00
Shanshan Shen
9fb27dd3b3
[MM] Align the prefix of MMEncoderAttention with Attention ( #33750 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-02-04 04:07:30 +00:00
R3hankhan
4dffc5e044
[CPU] Split attention dispatch by head_dim alignment ( #32161 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-02-03 19:37:15 -08:00
Andrew Xia
e1bf04b6c2
[1/N] Initial Implementation of Parser for ResponsesAPI ( #32712 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-02-04 10:59:03 +08:00
Isotr0py
02080179a3
[Bugfix] Fix torchrun PP broadcast deadlock with async scheduling ( #33701 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-04 02:17:37 +00:00
wang.yuqi
1b8fe6f7c4
[Frontend][4/n] Make pooling entrypoints request schema consensus | ScoreRequest ( #33060 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-04 01:48:40 +00:00
Nick Hill
52ee21021a
[BugFix][Spec Decoding] Fix negative accepted tokens metric crash ( #33729 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-03 23:34:41 +00:00
Wentao Ye
655efb3e69
[Dependency] Remove comments of ray in dependency files ( #33351 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-03 15:30:47 -08:00
Matthew Bonanni
bd8da29a66
[Bugfix] Fix sparse MLA metadata building ( #33579 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-03 15:29:48 -08:00
Michael Goin
2a99c5a6c8
[Bugfix] Disable TRTLLM FP8 MoE if router_logits_dtype==float32 and routing_method!=DeepSeekV3 ( #33613 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-03 13:26:51 -08:00
Patrick von Platen
3f7662d650
[Voxtral Realtime] Change name ( #33716 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2026-02-03 13:03:28 -08:00
Vadim Gimpelson
a372f3f40a
[MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1 ( #33257 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-03 15:10:31 -05:00
Harry Mellor
61e632aea1
Turn @config into a dataclass_transform ( #31541 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-03 17:40:59 +00:00
Richard Zou
b1bb18de8d
[torch.compile] Significantly speed up cold start times ( #33641 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-03 09:12:11 -08:00
Lucas Wilkinson
2267cb1cfd
[Attention][FA3] Update FA3 to include new swizzle optimization ( #23465 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-03 08:08:47 -08:00
dtc
0d6ccf68fa
[P/D] rework mooncake connector and introduce its bootstrap server ( #31034 )
...
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2026-02-03 08:08:25 -08:00
Cyrus Leung
18e7cbbb15
[Bugfix] Fix startup hang for Granite Speech ( #33699 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-03 15:57:56 +00:00
Patrick von Platen
f0d5251715
[Voxtral models] Skip warm-up to skip confusing error message in warm-up ( #33576 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-03 07:22:34 -08:00
Shanshan Shen
5c4f2dd6ef
[MM] Pass prefix parameter to MMEncoderAttention ( #33674 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-02-03 06:47:41 -08:00
wang.yuqi
f3d8a34671
[Bugfix] Do not add extra \n for image-only cases when constructing multimodal text prompts. ( #33647 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-03 06:43:47 -08:00
shaharmor98
4bc913aeec
Feat/add nemotron nano v3 tests ( #33345 )
2026-02-03 08:52:49 -05:00
Kuntai Du
fbb3cf6981
[Bugfix][Async][Connector] avoid vllm-side double free during async scheduling + request abort + async KV cache transfer ( #33377 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2026-02-03 21:50:15 +08:00
Krish Gupta
2df2b3499d
Document NixlConnector backend selection via kv_connector_extra_config ( #33552 )
...
Signed-off-by: KrxGu <krishom70@gmail.com >
2026-02-03 05:49:59 -08:00
Harry Mellor
2a8d84e66d
Fix Gemma3n audio encoder for Transformers v5 ( #33673 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-03 05:49:49 -08:00
zxy
a3acfa1071
[Models] Intern-S1-Pro ( #33636 )
...
Signed-off-by: zxy <zhou0493@e.ntu.edu.sg >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-03 05:49:45 -08:00
Harry Mellor
be8168ff88
Fix Gemma3 GGUF for Transformers v5 ( #33683 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-03 12:36:53 +00:00
Harry Mellor
f6af34626d
Fix offline test for Transformers v5 ( #33682 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-03 12:07:24 +00:00
Song Zhixin
ceab70c89d
[Bugfix] fix qwen3-asr response error ( #33644 )
...
Signed-off-by: jesse <szxfml@gmail.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-03 03:33:56 -08:00
Cyrus Leung
52683ccbe1
[Misc] Update default image format of encode_base64 ( #33656 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-03 03:13:16 -08:00
Michael Goin
e346e2d056
[Bugfix] Disable RoutingMethodType.[Renormalize,RenormalizeNaive] TRTLLM per-tensor FP8 MoE ( #33620 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-03 10:37:15 +00:00
Cyrus Leung
83449a5ff0
[Refactor] Clean up pooling serial utils ( #33665 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-03 10:29:18 +00:00
Lucas Hänke de Cansino
dad2d6a590
[Bugfix][Model] Fix DeepSeek-OCR-2 chat template to include BOS token ( #33642 )
...
Signed-off-by: l4b4r4b4b4 <lucas.cansino@mail.de >
2026-02-03 00:35:58 -08:00
Isotr0py
32e84fa1ff
[CI/Build] Investigate torchrun distributed tests hanging issue ( #33650 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-03 15:49:17 +08:00
Richard Zou
fd9c83d0e0
[torch.compile] Document the workaround to standalone_compile failing ( #33571 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-03 07:16:55 +00:00
杨朱 · Kiki
b95cc5014d
[Misc] Remove deprecated VLLM_ALL2ALL_BACKEND environment variable ( #33535 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-02-03 15:01:59 +08:00
Nick Hill
61397891ce
[Minor] Some code simplification in scheduler.py ( #33597 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-03 15:00:00 +08:00
杨朱 · Kiki
ef248ff740
[Misc] Remove deprecated profiler environment variables ( #33536 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-02-03 14:58:44 +08:00
Kunshang Ji
e10604480b
[XPU][1/N] Deprecate ipex and switch to vllm-xpu-kernels for xpu platform ( #33379 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-02 22:46:10 -08:00
Chauncey
bf001da4bf
[Bugfix] Interleaved thinking keeps compatibility with reasoning_content ( #33635 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Co-authored-by: Koushik Dutta <koushd@gmail.com >
2026-02-03 06:46:05 +00:00
杨朱 · Kiki
a0a984ac2e
[CI/Build] Remove hardcoded America/Los_Angeles timezone from Dockerfiles ( #33553 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-02-02 22:32:39 -08:00
Shengliang Xu
f1cb9b5544
Fix quantized Falcon-H1 model loading issues ( #32728 )
...
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-02 22:31:27 -08:00
Daniel Mescheder
4c4b6f7a97
[Frontend] Add sampling parameters to Responses API ( #32609 )
...
Signed-off-by: Daniel Mescheder <dmesch@amazon.com >
Co-authored-by: Daniel Mescheder <dmesch@amazon.com >
2026-02-03 13:51:10 +08:00
Roger Wang
10546f925a
[Bugfix] Fix mm budget setting for Qwen Omni models ( #33634 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-03 04:56:25 +00:00
Radu Salavat
e69c990c21
[Feature][CPU Backend]: Optimize ARM vectorization backend ( #30329 )
...
Signed-off-by: Radu Salavat <radu.salavat@arm.com >
2026-02-02 20:17:56 -08:00
Richard Zou
5eac9a1b34
[torch.compile] Don't do the fast moe cold start optimization if there is speculative decoding ( #33624 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-03 03:38:49 +00:00
Nathan Weinberg
1b60b45d0d
[CI/Build] add directions for CPU image upload to Docker Hub ( #32032 )
...
Signed-off-by: Nathan Weinberg <nweinber@redhat.com >
Signed-off-by: Nathan Weinberg <31703736+nathan-weinberg@users.noreply.github.com >
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
2026-02-03 02:48:06 +00:00
Dezhan
4b3803d180
[BugFix] DPMetadata raises assert error for dense model ( #32739 )
...
Co-authored-by: Dezhan Tu <dztu@meta.com >
2026-02-03 00:56:44 +00:00
Patrick von Platen
5019c59dd2
[Voxtral Realtime] Introduce global log mel max ( #33574 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-02 17:01:47 -05:00
Lain
089cd4f002
fix cutlass_3x_gemm_fp8_blockwise on sm103a ( #32224 )
...
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Co-authored-by: Pavani Majety <pmajety@nvidia.com >
2026-02-02 11:47:46 -08:00
Vasiliy Kuznetsov
0130223bd9
fix memory for online fp8 quantization with streaming weight load ( #31914 )
...
Signed-off-by: vasiliy <vasiliy@fb.com >
2026-02-02 14:17:42 -05:00
Matthew Bonanni
5d1aef3004
[UX] Format attention backend log line ( #33570 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-02 18:57:12 +00:00
yugong333
ffe1fc7a28
Reduce the kernel overhead when num of active loras is smaller than max loras. Multiple cuda graphs are captured for each num of active-loras. ( #32005 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
2026-02-02 12:30:06 -05:00
Harry Mellor
8b7346d5f1
Update huggingface-hub again ( #33567 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-02 09:20:54 -08:00
Harry Mellor
6141ebe0dd
Remove incorrect tokenizer info test ( #33565 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-02 17:11:44 +00:00
Yang Liu
199e3cb476
[Model] Use mm_position to compute mrope positions for GLM-4.xV ( #33039 )
...
Signed-off-by: Yang <lymailforjob@gmail.com >
2026-02-02 16:55:48 +00:00
Matthew Bonanni
9f8cb81b44
[CI] Add DeepSeek V3.2 nightly eval ( #33566 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-02 16:10:02 +00:00
Cyrus Leung
d7e17aaacd
[Refactor] Move profiling methods to MM budget ( #33559 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-02 23:27:00 +08:00
Kebe
528e9b1490
[Feature][Core] Support Fabric detection to adapt the MNNVL protocol for the GB series ( #33540 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Thomas Vegas <tvegas@nvidia.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2026-02-02 22:55:46 +08:00
shanjiaz
d95b4be47a
move spec decode slow test to test_areas.yaml ( #33365 )
...
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com >
2026-02-02 06:28:36 -08:00
Isotr0py
4061dcf4c5
[Bugfix] Enable Kimi k25 processor test ( #33562 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-02 14:25:25 +00:00
danielafrimi
0aca8b8c62
[MoE] Enable Shared/Routed Overlap For Latent MoE (Nemotron-H) ( #32790 )
...
Signed-off-by: dafrimi <dafrimi@nvidia.com >
2026-02-02 09:18:50 -05:00
Rabi Mishra
9eb58f8cf1
fix[ROCm]: Remove unconditional aiter import ( #32902 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2026-02-02 22:10:02 +08:00
Cyrus Leung
b10d05b8a8
[Model] Use explicit types in get_generation_prompt ( #33551 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-02 12:38:49 +00:00
Borushiki
b398e5c819
Update get_expert_mapping to include self parameter ( #33525 )
...
Signed-off-by: Borushiki <38628261+Otsutsukii@users.noreply.github.com >
2026-02-02 20:29:07 +08:00
Grzegorz K. Karch
78061ef584
Fix accessing hidden_act from model config ( #32686 )
...
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com >
2026-02-02 11:11:33 +00:00
Nicolò Lucchesi
528b3076af
[CI][Bugfix] Fix flaky tests/v1/kv_connector/unit/test_multi_connector.py::test_multi_example_connector_consistency ( #33555 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-02 03:01:29 -08:00
Cyrus Leung
a502831d36
[Chore] Remove redundant input parsing methods ( #33542 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-02 10:50:47 +00:00
Komal Kumar Teru
ba871fb788
[Misc] support arbitrary MM datasets in spec dec bench ( #33486 )
...
Signed-off-by: kkt-cohere <komal@cohere.com >
Signed-off-by: Komal Kumar Teru <162363718+kkt-cohere@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-02 08:49:48 +00:00
R3hankhan
ab374786c7
[CPU][IBM Z][Dockerfile] Fix IBM Z builds ( #33243 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-02-01 23:41:29 -08:00
RED
808dd87b30
[Model] Support DeepSeek-OCR-2 ( #33165 )
...
Signed-off-by: liuli <ll407707@alibaba-inc.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: liuli <ll407707@alibaba-inc.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-02 06:24:10 +00:00
Andy Lo
beb8899482
Fix mistral sliding window parsing ( #33521 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-02-02 05:08:04 +00:00
Sawyer Bowerman
ce88756b96
[Doc]: update paths for Offline/Online/Others example sections ( #33494 )
...
Signed-off-by: Sawyer Bowerman <sbowerma@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-02 03:56:53 +00:00
Paco Xu
a3154a6092
[Doc] add missing model entries in supported_models.md ( #33220 )
...
Signed-off-by: Paco Xu <paco.xu@daocloud.io >
2026-02-02 03:37:25 +00:00
jack
7c036432fc
[Bugfix] GLM-4 tool parser: incremental string streaming ( #33218 )
...
Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com >
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com >
2026-02-02 11:13:31 +08:00
Robert Shaw
318b120766
[Nightly CI] Remove CT Model ( #33530 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-01 19:09:09 -08:00
csy0225
c3b40dc3e7
[Models] Step-3.5-Flash ( #33523 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: i-zhangmingming <i-zhangmingming@stepfun.com >
Co-authored-by: xiewuxun <xiewuxun@stepfun.com >
Co-authored-by: zetaohong <i-hongzetao@stepfun.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-02 10:21:18 +08:00
Yifan Qiao
a01ef3fa51
[Fix] prefix cache hit rate == 0 bug with gpt-oss style models ( #33524 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
2026-02-02 01:59:58 +00:00
Runkai Tao
7320ca3942
Add unpermute-aware fused MoE LoRA path ( #32655 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
2026-02-02 09:46:09 +08:00
Nick Hill
cf0a99f84d
[ModelRunner V2] Support spec decode with structured outputs ( #33374 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-02 00:19:59 +00:00
Nick Hill
e535d90deb
[ModelRunner V2] Misc minor simplifications and optimizations ( #33467 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-01 22:17:14 +00:00
Komal Kumar Teru
0b225fb7b2
[Misc] skip target model mm emb in draft proposal step when draft is text-only ( #33437 )
...
Signed-off-by: kkt-cohere <komal@cohere.com >
2026-02-01 21:13:35 +00:00
will b.
46b4a02794
Fix DeepSeek V2 RoPE initialization error ( #33501 )
...
Signed-off-by: Eduardo Salinas <edus@microsoft.com >
Signed-off-by: catswe <212922539+catswe@users.noreply.github.com >
Co-authored-by: Eduardo Salinas <edus@microsoft.com >
2026-02-01 21:00:56 +00:00
shaharmor98
8869cd8ec1
Add MoE config for Super B200 TP2 ( #33510 )
2026-02-01 18:48:37 +00:00
JartX
cd86fff38f
[BUGFIX] Fix hipErrorIllegalState in Qwen3-Omni during startup profiling allow inference Omni on ROCM ( #33077 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
2026-02-01 13:36:25 +00:00
Maral
b5f8c3092d
[W8A8 Block Linear Refactor][1/N] Keep all quantization types into QuantFP8 class. ( #33047 )
...
Signed-off-by: maral <maralbahari.98@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-01 09:28:01 +00:00
Cyrus Leung
21997f45b1
[Redo] #33110 with threading limit ( #33502 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: YunzhuLu <lucia.yunzhu@gmail.com >
2026-02-01 09:18:11 +00:00
Luka Govedič
672023877b
Change defaults for vllm bench startup ( #33489 )
2026-01-31 23:46:01 -08:00
Zack Yu
754a8ca942
fix: only include Authorization header when OPENAI_API_KEY is set ( #33488 )
...
Signed-off-by: zack041 <zackyu041@gmail.com >
2026-01-31 23:35:09 -08:00
Eduardo Salinas
302ecf64ff
[Models]: lfm2_siglip2 return intermediate encoder layers ( #33370 )
...
Signed-off-by: Eduardo Salinas <edus@microsoft.com >
2026-02-01 06:17:49 +00:00
Cyrus Leung
b6bb2842cf
[Critical] Revert #33110 ( #33500 )
2026-01-31 21:06:42 -08:00
Cyrus Leung
79b6ec6aab
[Bugfix] Fix inconsistent handling of cache reset ( #33481 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-31 20:23:41 -08:00
Greg Pereira
d6416fdde9
pin LMCache to v0.3.9 or greater with vLLM v0.15.0 ( #33440 )
...
Signed-off-by: greg pereira <grpereir@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-31 20:50:38 -07:00
Andreas Karatzas
0fb3157267
[ROCm][CI] Update huggingface-hub pin ( #33492 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-01 02:51:54 +00:00
Cyrus Leung
a358e4dffe
[Refactor] Make Renderer an abstract class ( #33479 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-01 10:36:30 +08:00
René Honig
079781177a
fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels ( #33417 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2026-01-31 14:06:42 -08:00
Roy Wang
63c0889416
[Misc] Fix flashinfer related tests ( #33462 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2026-01-31 16:10:24 -05:00
smashyalts
1e86c802d4
Fix grammar ( #33121 )
...
Signed-off-by: smashyalts <smashyalts@gmail.com >
2026-01-31 09:59:34 -08:00
linhaifeng
fedf64332e
[Bugfix]: Fix display errors in TORCH_CHECK messages ( #32942 )
...
Signed-off-by: linhaifeng <1371675203@qq.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-31 09:48:48 -08:00
Xiao Yang
2238a12c13
[Misc] support collect_env for endpoint /server_info ( #33246 )
...
Signed-off-by: yang.xiao <yang.xiao@daocloud.io >
2026-02-01 01:42:59 +08:00
Harry Mellor
ce0afe2451
Update huggingface-hub pin for the last time before Transformers v5 ( #33473 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-31 09:14:24 -08:00
Cyrus Leung
88c3e114d8
[Refactor] Move MM data parsing outside processor ( #33408 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-31 16:46:14 +00:00
Cyrus Leung
92924b2ddd
[Deprecation] Remove deprecated items related to pooling ( #33477 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-31 08:44:40 -08:00
YunzhuLu
27cb2f678f
[Bugfix] Early-reject requests with MM data longer than encode cache capacity ( #33110 )
...
Signed-off-by: YunzhuLu <lucia.yunzhu@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-01-31 08:41:13 -08:00
jma99_2333
22d9a056d5
Support clear mm and encoder cache ( #33452 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-01-31 15:22:25 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
13b842f271
[BugFix][Router Replay] Capture Logical Experts with EPLB ( #33013 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2026-01-31 10:12:17 -05:00
Luka Govedič
15f40b20aa
[fix][torch.compile] Fix cold-start compilation time increase by adding kv cache update to splitting ops ( #33441 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Richard Zou <zou3519@gmail.com >
2026-01-31 06:48:34 -08:00
Cyrus Leung
793af538a3
[Doc] Update plugin deprecation notices ( #33476 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-31 22:48:28 +08:00
cmunley1
6f5e7cda57
support return prompt token ids in responses ( #33378 )
2026-01-31 06:04:20 -08:00
Roy Wang
68feb76a6f
[Misc] Replace deprecated interface seed_everything ( #33474 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2026-01-31 05:38:39 -08:00
Cyrus Leung
4cb59dea6a
[Bugfix] Fix incompatibility between #33372 and #32863 ( #33475 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-31 05:21:32 -08:00
Angela Yi
608b556507
[ez] Add structured torch.compile logs ( #33213 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-01-31 21:00:54 +08:00
Cyrus Leung
f0a1c8453a
[Frontend] Use new Renderer for Completions and Tokenize API ( #32863 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-31 04:51:15 -08:00
caozuoba
8980001c93
[perf] v1/spec_decode: skip softmax for all-greedy rejection sampling ( #32852 )
...
Signed-off-by: hdj <1293066020@qq.com >
2026-01-31 09:51:26 +00:00
jennyyyyzhen
527bcd14d4
[ROCM] Enable aiter attn backend for qwen3-next model ( #32492 )
...
Signed-off-by: jennyyyyzhen <yzhen@hmc.edu >
2026-01-31 17:03:57 +08:00
Jinwu
f68e3ea4e1
[BugFix] Add synchronize in CutlassW4A8LinearKernel to ensure data is ready for use. ( #33078 )
...
Co-authored-by: jinwuguo <jinwuguo@tencent.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-31 08:14:54 +00:00
Yanan Cao
d5c41db35b
[Kernel] [Helion] [3/N] Helion kernel registry ( #33203 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-01-31 15:38:46 +08:00
Fadi Arafeh
1618e25492
[CPU][Feat] Enable KleidiAI accelerated int4 dynamic quant with BF16 activations on Arm CPUs ( #33122 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-01-31 07:16:22 +00:00
AutumnAurelium
f3888aca83
Add EAGLE3 support for AFMoE ( #33111 )
...
Signed-off-by: AutumnAurelium <88015631+AutumnAurelium@users.noreply.github.com >
2026-01-31 06:53:08 +00:00
Dimitrios Bariamis
f0bca83ee4
Add support for Mistral Large 3 inference with Flashinfer MoE ( #33174 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-01-30 22:48:27 -08:00
Matthias Gehre
73419abfae
[Bugfix] Handle Asym W4A16 (ConchLinearKernel) for CT ( #33200 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-01-31 06:21:51 +00:00
Nicolò Lucchesi
e77f162cf5
[Bugfix] Fix Qwen3ASR language asr tag in output ( #33410 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-31 05:24:49 +00:00
Yanan Cao
8ecd213c0b
[Kernel] [Helion] [2/N] Helion kernel wrapper ( #32964 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-01-31 12:53:01 +08:00
Francesco Fusco
5b55c0bea7
[Attention] Clarify comment explaining attn_logits +1 dimension ( #33427 )
...
Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com >
2026-01-31 04:50:30 +00:00
Patrick von Platen
15e0bb9c42
[Streaming -> Realtime] Rename all voxtral related classes, fn, files ( #33415 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2026-01-31 04:49:00 +00:00
Micah Williamson
6c64c41b4a
[ROCm][CI] Force max_num_seqs=1 on ROCm In test_sharded_state_loader to reduce flakiness ( #33277 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-31 12:28:29 +08:00
Russell Bryant
a2ef06e1b3
[Misc] offest -> offset in comments and variable names ( #33444 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-01-30 20:19:22 -08:00
Lucas Wilkinson
0a3c71e7e5
[BugFix] Fix whisper FA2 + full cudagraphs ( #33360 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-31 12:15:06 +08:00
Michael Goin
29fba76781
[UX] Use gguf repo_id:quant_type syntax for examples and docs ( #33371 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-31 12:14:54 +08:00
Isotr0py
9df152bbf6
[Misc] Algin Qwen3-VL-embedding image example outputs with HF repo example ( #33419 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-30 19:36:56 -08:00
Nick Hill
876a16f4fb
[ModelRunner V2] Fix spec decoding + logprobs ( #33391 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-31 03:33:26 +00:00
Matthew Bonanni
aaa901ad55
[Attention] Move MLA forward from backend to layer ( #33284 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-30 19:30:00 -08:00
Wentao Ye
010ec0c30e
[Deprecation] Deprecate seed_everything and scatter_mm_placeholders in v0.15 ( #33362 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-31 02:54:16 +00:00
Alberto Ferrer
64a40a7ab4
[Bugfix] Fix typo in read_offset variable name ( #33426 )
...
Signed-off-by: Alberto Ferrer <albertof@barrahome.org >
2026-01-31 01:26:15 +00:00
Gregory Shtrasberg
31aedfe7d6
[Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831 ( #33366 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-01-30 19:05:23 -06:00
Michael Goin
67ebaff528
Refactor NVFP4 Linear utils for ModelOpt and CT ( #33201 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-30 16:37:42 -08:00
Chendi.Xue
2b465570e6
[CI][HPU]accelerate hpu test by skip python re-install and clean container name ( #33286 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2026-01-30 21:36:29 +00:00
Huy Do
9ca66ecc10
Indicate compile mode in the benchmark results ( #32990 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2026-01-30 15:34:36 -05:00
Pavani Majety
c3a9752b0c
[Hardware][SM100] Add TRTLLM Kernel for INT4 W4A16 Kernel. ( #32437 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2026-01-30 10:30:46 -08:00
xuebwang-amd
f451b4558b
[Quantization][ROCm] Fix MoE weight loading to be robust (Qwen3_MoE/Qwen3_next as example models) ( #33173 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
2026-01-30 17:50:23 +00:00
Vasiliy Kuznetsov
3f96fcf646
fix QERL attention import path ( #33432 )
...
Signed-off-by: vasiliy <vasiliy@fb.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-30 09:29:09 -08:00
Yanan Cao
6c1f9e4c18
[Kernel] [Helion] [1/N] Add Helion ConfigManager ( #32740 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-01-30 12:19:19 -05:00
Harry Mellor
67239c4c42
Fix encoder-decoder model disabling mm processor cache ( #33236 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-30 16:30:10 +00:00
Nicolò Lucchesi
8ece60768f
[CI] Qwen3-ASR transcriptios tests ( #33414 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-30 16:17:56 +00:00
Michael Goin
fd0e377244
Support FP8 block quant for CompressedTensorsW8A16Fp8 ( #33280 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-30 11:15:20 -05:00
Kyle Sayers
f857a03f6b
[QeRL] Layerwise Reloading ( #32133 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-01-30 08:50:05 -07:00
Danielle Robinson
74898a7015
[BugFix][LoRA] TritonExperts is ModularMoEPath for FP8 models ( #33393 )
...
Signed-off-by: Danielle Robinson <dmmaddix@amazon.com >
Co-authored-by: Danielle Robinson <dmmaddix@amazon.com >
2026-01-30 15:27:42 +00:00
Frank Wang
8f5d51203b
Disable Cascade Attention for Batch Invariance ( #32561 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
Signed-off-by: Frank Wang <41319051+frankwang28@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-30 10:00:46 -05:00
Julien Denize
ae5b7aff2b
Improve Mistral format checks. ( #33253 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
Signed-off-by: juliendenize <julien.denize@mistral.ai >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-30 06:23:33 -08:00
Harry Mellor
a11bc12d53
Fix test_moe.py for Transformers v5 ( #33413 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-30 14:03:25 +00:00
Nathan Weinberg
58cb55e4de
[Doc] Enhance documentation around CPU container images ( #32286 )
...
Signed-off-by: Nathan Weinberg <nweinber@redhat.com >
2026-01-30 13:36:20 +00:00
杨朱 · Kiki
cf896ae0e3
[Misc] Clean up HIDDEN_DEPRECATED_METRICS after metric removal ( #33323 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-30 13:31:17 +00:00
Harry Mellor
c5113f60f2
Remove deprecated reasoning_content message field ( #33402 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-30 11:48:15 +00:00
vllmellm
174f16700b
[Doc] [ROCm] Update Documentation to reflect v0.15.0 release ( #33388 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-01-30 19:06:08 +08:00
Julien Denize
8e2ad97ad0
[BUGFIX] Pixtral cannot be loaded with --limit-mm-per-prompt 0 ( #33406 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-01-30 02:52:02 -08:00
Patrick von Platen
10152d2194
[Realtime API] Adds minimal realtime API based on websockets ( #33187 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-01-30 18:41:29 +08:00
杨朱 · Kiki
1a7894dbdf
[Misc] Replace Optional[X] with X | None syntax ( #33332 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-30 01:56:59 -08:00
Cyrus Leung
c87eac18f7
[Refactor] Move MM item count validation outside of processor ( #33396 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-30 09:27:31 +00:00
tianshu-Michael-yu
f45870b53f
fix: allow LFM2 MoE prefix caching (align) ( #33376 )
...
Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com >
2026-01-30 08:23:14 +00:00
hujiaxin0
ba45bedfd1
[model] Add support for openPangu7B-VL ( #32449 )
...
Signed-off-by: hujiaxin <524446785@qq.com >
Signed-off-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com >
Co-authored-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com >
2026-01-30 15:54:27 +08:00
Harry Mellor
9432ed8c7e
Explicitly set return_dict for apply_chat_template ( #33372 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-30 07:27:04 +00:00
Lucas Kabela
726d89720c
[CI] Enable mypy import following for vllm/spec_decode ( #33282 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-01-30 06:43:32 +00:00
Harry Mellor
d334dd26c4
Move decode context parallel validationn to ParallelConfig ( #33239 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-30 06:18:41 +00:00
Ryan Rock
070c811d6f
[CI][AMD] Skip 4 GPUs testgroup ray tests ( #33305 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-01-29 21:39:53 -08:00
Isotr0py
8bfc8d5600
[Models] Refactor Kimi-K2.5 weight loading ( #33346 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-30 05:31:20 +00:00
Harry Huang
ec51831a22
[BugFix] Disable async scheduling for Mamba prefix caching ( #33352 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-01-30 04:40:19 +00:00
Harry Mellor
80b918f2bd
Fix tie_word_embeddings for multimodal models in Transformers v5 ( #33359 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-30 03:37:39 +00:00
Wang Haoyu
c46b0cd0af
[Model][Multimodal] Add explicit MusicFlamingo adapter ( #32696 )
...
Signed-off-by: WangHaoyuuu <mailwhaoyu@gmail.com >
2026-01-30 11:01:29 +08:00
Aidan Reilly
133765760b
[Docs] Adding links and intro to Speculators and LLM Compressor ( #32849 )
...
Signed-off-by: Aidan Reilly <aireilly@redhat.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-29 14:12:35 -08:00
Michael Goin
bfb9bdaf3f
[Bugfix] Enable Triton MoE for FP8 per-tensor dynamic ( #33300 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-29 12:15:17 -08:00
Kevin H. Luu
2284461d02
[release] Minor fixes to release annotation and wheel upload ( #33129 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-01-29 12:09:35 -08:00
danisereb
8e2a469b3b
Add Triton fused MoE config for B200 (Nemotron Nano) ( #32804 )
2026-01-29 19:21:33 +00:00
CarstyYou
23591e631e
[Bugfix][Kernel] Fix negative memory offset in GDN Triton kernel ( #33326 )
...
Signed-off-by: CarstyYou <186021327+CarstyYou@users.noreply.github.com >
2026-01-29 10:40:11 -08:00
Linda
0493d897c4
[NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe ( #32954 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com >
2026-01-29 10:00:13 -08:00
Chendi.Xue
8c8ebeb941
[BUGFIX][XPU] fix memory check after XPU reuse GPU_worker ( #33358 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2026-01-29 09:56:30 -08:00
Cyrus Leung
831453fcef
[Chore] Move MediaConnector to vllm.multimodal.media ( #33324 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-29 16:54:31 +00:00
Angela Yi
5a66c9cc76
[ez] Delete torch25_custom_graph_pass ( #33287 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-01-29 16:47:05 +00:00
Isotr0py
5e73e4900c
[Bugfix] Fix broken GLM-OCR initialization ( #33350 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-29 07:56:05 -08:00
Cyrus Leung
c6e7404cc5
[Multimodal] Simplify MM input definitions ( #33331 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-29 13:32:04 +00:00
sthWrong
17b17c0684
[Backport] [Kimi-K2.5] Replace torch.cuda with current_platform for d… ( #33320 )
2026-01-29 12:29:17 +00:00
Kunshang Ji
8bb6271c77
[Intel GPU] refine xpu worker ( #32894 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-01-29 12:26:52 +00:00
Roger Wang
8b3f0a99dd
[Models] Qwen3-ASR ( #33312 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-01-29 19:27:15 +08:00
Li, Jiang
8311f083bd
[Bugfix][CPU] Fix thread num for shared memory communication ( #33317 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-29 03:26:58 -08:00
Patrick von Platen
40c35038d2
[Voxtral] Streaming example ( #33042 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-01-29 03:22:49 -08:00
zofia
a5aa4d5c0f
[Quantization][Refactor] use platform dict to choose kernel ( #33130 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
Signed-off-by: zofia <110436990+zufangzhu@users.noreply.github.com >
2026-01-29 10:44:58 +00:00
andrii.pasternak
615e8033e5
[Bug Fix] Handle variable-length tensors in MultiModalFlatField batching ( #31751 )
...
Signed-off-by: Andrii Pasternak <andriipasternak31@gmail.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-29 10:42:59 +00:00
Ilya Markov
d09135fbd0
[BugFix] Async Eplb fix potential race condition ( #32881 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-01-29 10:31:40 +00:00
daniel-salib
8688c3d460
[fix] tesdt mcp_tool_calling_streaming with a more complex math question ( #32769 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2026-01-29 10:25:58 +00:00
Isotr0py
5400014d55
[Chore] Remove use_data_parallel kwargs from ViT implementation ( #33310 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-29 10:20:52 +00:00
Isotr0py
3a92c6f3b5
[Misc] Cleanup Kimi-K2.5's vision chunk modality entrypoints ( #33157 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-29 09:46:02 +00:00
amirkl94
e01ff5c070
Bugfix: Pass router logits dtype in nemotron shared experts ( #32669 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com >
2026-01-29 09:36:34 +00:00
Harry Mellor
fb946a7f89
Make mypy opt-out instead of opt-in ( #33205 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-29 09:12:26 +00:00
Lucas Wilkinson
a650ad1588
[Misc] Remove missed pad_for_cudagraph ( #33283 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-29 09:12:05 +00:00
graftim
d697581a7c
[Doc] Update outdated link to Ray documentation ( #32660 )
...
Signed-off-by: graftim <38649219+graftim@users.noreply.github.com >
2026-01-29 00:56:06 -08:00
shanjiaz
5eeba80c74
Adding optional speculator tests for larger models ( #32943 )
...
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com >
2026-01-29 16:54:02 +08:00
whx
08b1195e62
[PluggableLayer][2/N] Apply PluggableLayer to linear layers ( #33152 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2026-01-29 16:53:15 +08:00
cmunley1
3bba2edb0f
support returning tokenids in responses api ( #33212 )
...
Signed-off-by: Christian Munley <cmunley@nvidia.com >
2026-01-29 16:52:39 +08:00
Ilya Markov
53fc166402
[BugFix] Fix EPLB fail for MoeFP4 model with Marlin backend ( #33262 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-01-29 16:52:11 +08:00
Didier Durand
31b25f6516
[Doc]: fixing multiple typos in diverse files ( #33256 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Signed-off-by: Didier Durand <2927957+didier-durand@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-29 16:52:03 +08:00
wang.yuqi
abb34ac43a
[Bugfix] Fix Qwen3-VL-Reranker load. ( #33298 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-29 08:42:53 +00:00
Pengchao Wang
2515bbd027
[CI/Build][BugFix] fix cuda/compat loading order issue in docker build ( #33116 )
...
Signed-off-by: Pengchao Wang <wpc@fb.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
2026-01-29 00:19:05 -08:00
TJian
c487a8eef4
[Release] [ROCm] Remove old build step ( #33316 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-01-28 23:35:51 -08:00
Kiersten Stokes
9e138cb01d
[Misc][Build] Lazy load cv2 in nemotron_parse.py ( #33189 )
...
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com >
2026-01-29 06:55:50 +00:00
TJian
f9d03599ef
[Release] [CI] Optim release pipeline ( #33156 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-01-28 22:45:42 -08:00
wangln19
39037d258e
Fix tool call indexing double-counting ( #33141 )
...
Signed-off-by: wanglinian <wanglinian@stu.pku.edu.cn >
2026-01-29 05:57:09 +00:00
Cyrus Leung
51550179fc
[Refactor] Define MM data parser in processing info instead of processor itself ( #33260 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-29 13:55:17 +08:00
Angela Yi
07ea184f00
[ez] Delete more torch version checks <= 2.8 ( #33288 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-01-29 05:28:46 +00:00
Or Ozeri
a663b218ae
[Misc] Add orozery to CODEOWNERS (core, kv_transfer, kv_offload) ( #33227 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-29 04:24:20 +00:00
Michael Goin
1bd47d6e5a
[Bugfix] Register fp8 cutlass_group_gemm as supported for only SM90+SM100 ( #33285 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-28 18:40:59 -08:00
Michael Goin
141cd43967
[UX] Remove noisy CT UnquantizedLinearMethod warn ( #33273 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-28 16:09:30 -08:00
Nick Hill
6bf3b46d78
[ModelRunner V2] Misc code simplification and cleanup ( #33266 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-28 14:41:23 -08:00
Matthew Bonanni
77c4f45c6c
[7/N][Attention][Docs] Add documentation for attention backends ( #32477 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-28 17:20:22 -05:00
Michael Goin
ca1969186d
[UX] Enable nested configs in config yaml files ( #33193 )
2026-01-28 16:54:25 -05:00
Gregory Shtrasberg
ab597c869a
[Bugfix] Add missing encoder only guard for do_kv_cache_update ( #33269 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-01-28 21:25:07 +00:00
Angela Yi
4197168ea5
[ez] Remove checks for torch version <= 2.8 ( #33209 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-01-28 16:03:56 -05:00
Rohan Potdar
59bcc5b6f2
Use aiter triton fused_add_rmsnorm_pad for gpt-oss ( #30976 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-01-28 20:47:47 +00:00
Wentao Ye
3e440786af
[Feature] Fully support for async scheduling + PP, 30.8% E2E throughput improvement, 31.8% TPOT improvement ( #32618 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-01-28 20:30:32 +00:00
Kevin H. Luu
8bdd3979d8
[CI] Change GPU key to device key for B200 test ( #33275 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-01-28 19:14:29 +00:00
Wentao Ye
c4e744dbd4
[Perf] Optimize moe_permute for CUTLASS FP8 ( #32892 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-28 10:15:24 -08:00
Nicolò Lucchesi
8ebf372e9d
[CI] Whisper tests enforce_eager=False ( #33098 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-28 09:36:56 -08:00
cwazai
f210f0b7b1
[lora/moe] Avoid extra intermediate buffer & Python slicing in expand phase when split_k == 1 ( #32774 )
...
Signed-off-by: 陈建华 <1647430658@qq.com >
2026-01-29 00:22:45 +08:00
Bin Bao
392c5af4fe
[Benchmark] Add startup benchmarking to buildkite run ( #33183 )
...
Signed-off-by: Bin Bao <binbao@meta.com >
2026-01-28 16:03:07 +00:00
Robert Shaw
af9b69f977
[Quantization][Deprecation] Remove Marlin 24 ( #32688 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-28 15:54:59 +00:00
Chauncey
8e5e40daf4
[Misc] Provide a DeepSeek ReasoningParser with thinking enabled by default ( #33221 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-28 21:16:53 +08:00
Or Ozeri
2e8de86777
Revert "Enable Cross layers KV cache layout at NIXL Connector ( #30207 )" ( #33241 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-01-28 04:36:00 -08:00
Robert Shaw
247d1a32ea
[Quantization][Deprecation] Remove BitBlas ( #32683 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-28 11:06:22 +00:00
Kevin H. Luu
ecb4f82209
[CI] Update job dependency syntax for Intel and AMD jobs ( #33240 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-01-28 01:33:59 -08:00
Kevin H. Luu
5914090765
[CI] Update job dependency for hardware and CPU jobs ( #33237 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-01-28 01:10:05 -08:00
Harry Mellor
f1acbd68c5
[CI] Enable mypy import following for vllm/compilation ( #33199 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-28 08:59:54 +00:00
Yan Ma
9581185d51
[XPU]disable test_acceptance_length UT ( #33226 )
2026-01-28 15:24:13 +08:00
Maryam Tahhan
2dd359f953
[Docs] Simplify CPU x86 Docker build documentation ( #33071 )
...
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com >
2026-01-28 06:37:09 +00:00
Gregory Shtrasberg
22ad649501
[ROCm] Enabling forward_includes_kv_cache on ROCm MHA backends ( #33106 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-01-28 14:36:14 +08:00
ramos
36d450e3b8
Adds FunAudioChat multimodal audio model support ( #2 ) ( #33058 )
...
Signed-off-by: ramos <49182011+nemoramo@users.noreply.github.com >
Signed-off-by: mayufeng <mayufeng@example.com >
Co-authored-by: mayufeng <mayufeng@example.com >
2026-01-28 05:18:09 +00:00
22quinn
a2b877df6c
[Bugfix] Lazy import NgramProposer in GPU model runner ( #32821 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2026-01-27 21:07:16 -08:00
Harry Mellor
35fb0b8613
Don't use min_pixels/max_pixels from Qwen2VL's processor ( #33208 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-28 05:02:08 +00:00
Harry Mellor
2eb673a088
Add flake8-implicit-str-concat rules to Ruff ( #33191 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-28 04:56:10 +00:00
Jeffrey Wang
a97b5e206d
Relax protobuf library version constraints ( #33202 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2026-01-28 04:15:53 +00:00
Micah Williamson
911b51b69f
[ROCm][CI] Add TORCH_NCCL_BLOCKING_WAIT For Distributed Tests (A100) ( #32891 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-28 11:32:31 +08:00
Xinan Miao
604e3b87e8
[Feature]: Container image WORKDIR consistency ( #33159 )
...
Signed-off-by: SouthWest7 <am1ao@qq.com >
Co-authored-by: SouthWest7 <am1ao@qq.com >
2026-01-28 11:06:48 +08:00
Harry Mellor
706f123b23
[Docs] Use definition lists for CLI reference docs ( #33186 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Ashwin Phadke <23502062+ashwin-phadke@users.noreply.github.com >
2026-01-28 02:22:48 +00:00
Angela Yi
fb7abfc1d0
[docs] Improve tlparse section ( #33211 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-01-28 02:07:37 +00:00
Kevin H. Luu
5d3d6e44e8
[CI] minor fixes to pipeline generator and tests ( #33151 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-01-27 17:04:02 -08:00
Woosuk Kwon
46ec6d71c7
[Model Runner V2] Use a different stream for grammar bitmask h2d copy ( #33059 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-01-27 16:37:43 -08:00
Matthew Bonanni
e82fa448c4
Add attention benchmarking tools ( #26835 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Claude <noreply@anthropic.com >
2026-01-28 00:09:20 +00:00
Richard Zou
d9aa39a3bb
[torch.compile] Speed up MOE handling in forward_context ( #33184 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-01-27 15:17:54 -08:00
Wentao Ye
3a6d5cbefd
[Perf] Optimize dcp allocate tensor ( #33102 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-27 17:24:41 -05:00
linhaifeng
f5d7049cc1
[Bugfix] Fix display error (inconsistent with context) ( #33020 )
...
Signed-off-by: linhaifeng <1371675203@qq.com >
2026-01-27 20:33:29 +00:00
Alexei-V-Ivanov-AMD
3c3c547ce0
Enabling "2 node" distributed tests in the AMD CI pipeline. ( #32719 )
...
Signed-off-by: DCCS-4560 <alivanov@chi-mi325x-pod1-112.ord.vultr.cpe.ice.amd.com >
Co-authored-by: DCCS-4560 <alivanov@chi-mi325x-pod1-112.ord.vultr.cpe.ice.amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-01-27 19:13:21 +00:00
Matthew Bonanni
1cbccb6dba
[Attention] Use has_flashinfer helper ( #33177 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-27 18:33:17 +00:00
Iris
bd92089d33
feature: support eagle3 for HunyuanVL & Hunyuan ( #33035 )
...
Signed-off-by: irisliu10 <601012173@qq.com >
Signed-off-by: Iris <38269816+irisliu10@users.noreply.github.com >
2026-01-27 17:55:48 +00:00
Karan Bansal
a6760f1525
[Doc] Improve serve parameter documentation with meaningful defaults ( #33082 )
...
Signed-off-by: Karan Bansal <karanb192@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-27 09:19:37 -08:00
IriKa
66e601ef79
Support compress-tensors with nvfp4 or fp8 weights and modelopt with nvfp4 weights on Turing ( #33076 )
...
Signed-off-by: IriKa Qiu <qiujie.jq@gmail.com >
2026-01-27 11:04:05 -05:00
Nick Hill
0cd259b2d8
[BugFix] Fix P/D with non-MoE DP ( #33037 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-27 08:03:47 -08:00
danielafrimi
83fb2d09e8
Support heterogeneous NemotronHPuzzle model ( #32549 )
...
Signed-off-by: <dafrimi@nvidia.com >
Signed-off-by: Daniel Afrimi <dafrimi@nvidia.com >
Signed-off-by: root <dafrimi@nvidia.com >
2026-01-27 10:55:54 -05:00
danisereb
f3a5ee705f
[LoRA][Spec Decode] Support LoRA for Nemotron-H MTP models ( #32265 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-27 07:53:26 -08:00
wang.yuqi
7cbbca9aaa
[Frontend] Cleanup api server ( #33158 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
2026-01-27 15:18:10 +00:00
omkhalil
5ec44056f7
[Metrics][MFU] Fix UnembedMetrics FLOP overcounting for prefill ( #33045 ) ( #33045 )
...
Fix UnembedMetrics to correctly count FLOPs for the unembedding (LM head) layer.
The bug: UnembedMetrics used total_num_tokens() which counts all tokens in the
batch for projection flops, vocab projections are run on just the last token for the
autoregressive use case.
Co-authored-by: Omar Mohamed Khalil <omarkhalil@meta.com >
2026-01-27 15:16:49 +00:00
Nicolò Lucchesi
492a7983dd
[Bugfix] Fix DeepseekV32 AssertionError: num_kv_heads == 1 ( #33090 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-27 15:03:20 +00:00
Matthew Bonanni
a608b4c6c2
[5/N][Attention] Finish eliminating vllm/attention folder ( #32064 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-27 10:02:51 -05:00
Nicolò Lucchesi
1f3a2c2944
[Bugfix] Disable CG for Whisper+FA2 ( #33164 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-27 21:46:51 +08:00
omerpaz95
7227d06156
[Metrics] [KVConnector] Add Offloading Connector metrics ( #27942 )
...
Added queries and hits metrics for the Offloading Connector.
Also added timing metrics for store and load operations, which take the
average time it takes to load/store, per-token.
The metrics are available from Prometheus and from the StatLogger.
Signed-off-by: omerpaz95 <omerpaz95@gmail.com >
Co-authored-by: Omer Paz <Omer.Paz@ibm.com >
2026-01-27 13:34:49 +00:00
Harry Mellor
14385c80fc
Fix weight mapping test for Transfomers v5 ( #33162 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-01-27 12:30:14 +00:00
wang.yuqi
76139d0801
[Frontend] Frontend will only attach supported tasks corresponding entrypoints. ( #33139 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-01-27 12:15:43 +00:00
Lifan Shen
da8d0c441a
[AMD][QWEN3-NEXT] FP8 Tunings ( #32042 )
...
Signed-off-by: Lifan Shen <lifans@meta.com >
2026-01-27 09:34:13 +00:00
rasmith
58996f3589
[AMD][Kernel][BugFix] Use correct scale in concat_and_cache_ds_mla_kernel when on gfx942 ( #32976 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2026-01-27 07:16:43 +00:00
Roger Wang
b539f988e1
[Models] Kimi-K2.5 ( #33131 )
...
Signed-off-by: wanglinian <wanglinian@stu.pku.edu.cn >
Signed-off-by: wangln19 <96399074+wangln19@users.noreply.github.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: wanglinian <wanglinian@stu.pku.edu.cn >
Co-authored-by: wangln19 <96399074+wangln19@users.noreply.github.com >
Co-authored-by: Zaida Zhou <58739961+zhouzaida@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-27 14:50:31 +08:00
Andreas Karatzas
6c00645712
[CI][Pooling] Stabilize ModernBERT test ( #32909 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-27 05:26:48 +00:00
Ning Xie
b781eeaa15
[code clean] remove duplicate code ( #33135 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-27 04:57:16 +00:00
Cyrus Leung
e0b005d9cf
[Frontend] Cleanup serving engine ( #33103 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-26 20:47:26 -08:00
Richard Zou
3b8f0fe59e
[torch.compile] Stop assuming 32 bit indexing ( #33113 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-01-27 04:25:02 +00:00
Cyrus Leung
c831911be2
[Frontend] Reduce mixin usage in serving pooling ( #33101 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-27 11:50:37 +08:00
Paco Xu
157caf511b
[Perf] avoid duplicate mem_get_info() call in get_current_memory_usage ( #33064 )
...
Signed-off-by: Paco Xu <paco.xu@daocloud.io >
2026-01-27 03:45:45 +00:00
Vincent Gimenes
0b53bec60b
[DOC]: Add warning about max_num_batched_tokens and max_model_len when chunked prefill is disabled ( #33109 )
...
Signed-off-by: Vincent Gimenes <147169146+VincentG1234@users.noreply.github.com >
2026-01-27 03:05:02 +00:00
Strahinja Stamenkovic
c568581ff3
Fix IndexError with encoder-decoder models when using Custom Paged Attention ( #33112 )
...
Signed-off-by: sstamenk <strahinja.stamenkovic@amd.com >
2026-01-27 10:33:37 +08:00
wangln19
2d7053438a
fix: preserve native tool call ID in multi-turn tool calling ( #32768 )
...
Signed-off-by: wanglinian <wanglinian@stu.pku.edu.cn >
Signed-off-by: wangln19 <96399074+wangln19@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Isotr0py <2037008807@qq.com >
2026-01-27 10:22:35 +08:00
Robert Shaw
5a93b9162b
[MoE Refactor] Integrate Naive Prepare Finalize into MK ( #32567 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: amirkl94 <203507526+amirkl94@users.noreply.github.com >
2026-01-27 01:28:02 +00:00
Woosuk Kwon
6d86fde09c
[Model Runner V2] Remove UvaBufferPool for cpu->gpu copy ( #33055 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-01-26 16:47:35 -08:00
XiongfeiWei
510ed1e8d3
[Bugfix][TPU] Return a Default fp8 MoE Backend ( #32908 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-01-26 18:46:11 -05:00
Pengchao Wang
8caffd92df
[Bugfix][MXFP4] Call trtllm_fp4_block_scale_moe with kwargs ( #33104 )
...
Signed-off-by: Pengchao Wang <wpc@fb.com >
2026-01-26 15:13:18 -08:00
dolpm
58a05b0ca1
[fix] CPUDNNLGEMMHandler pointer baked into inductor artifact ( #32913 )
...
Signed-off-by: dolpm <34420038+dolpm@users.noreply.github.com >
2026-01-26 16:59:44 -05:00
Jared Wen
6ee7f18f33
[Logging] add --disable-access-log-for-endpoints CLI option ( #30011 )
...
Add a new CLI option --disable-access-log-for-endpoints to suppress
uvicorn access logs for specified endpoints (e.g., /health, /metrics, /ping).
This addresses the need to reduce log noise in production environments
where health check endpoints are frequently polled by load balancers or
monitoring systems, generating excessive log entries that obscure
meaningful request logs.
Fixes #29982
Signed-off-by: JaredforReal <w13431838023@gmail.com >
2026-01-26 21:49:03 +00:00
Wentao Ye
8f987883cb
[Refactor] Remove unused _moe_permute function ( #33108 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-26 16:06:45 -05:00
Kevin H. Luu
ebe0ba91db
[ci] Sync test areas with test-pipeline.yaml and enable new pipeline generator ( #33080 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
Signed-off-by: khluu <khluu000@gmail.com >
Co-authored-by: Kevin Luu <khluu@Kevins-MacBook-Pro.local >
2026-01-26 12:28:20 -08:00
Robert Shaw
43a013c3a2
[Bugfix] Fix Dtypes for Pynccl Wrapper ( #33030 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-26 20:09:32 +00:00
Cyrus Leung
c25dbee40d
[Model] Bump transformers version for test registry ( #33100 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-26 18:53:22 +00:00
Nicolò Lucchesi
19ab0f7ce5
[Bugfix] Fix Voxtral streaming slot_mapping ( #33073 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-26 10:40:40 -08:00
danielafrimi
67fe677c53
[FIX] Always support TP > 4 for FP4 Gemm ( #31099 )
...
Signed-off-by: dafrimi <dafrimi@nvidia.com >
Co-authored-by: root <root@gpu-51.slurm-workers-slurm.slurm.svc.cluster.local >
2026-01-26 11:04:20 -07:00
Andy Lo
d56afd45fd
Remove unused logic in models/mistral.py ( #33095 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-01-26 09:01:52 -08:00
Chauncey
a2393ed496
[CI] Fix AssertionError: MCP tool call not found in output_messages ( #33093 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-26 15:19:57 +00:00
Pleaplusone
be6931ee27
[ROCm][Bugfix] Fix ptpc scale load issue for fused shared expert path in deepseek mtp ( #33018 )
...
Signed-off-by: ganyi <ygan@amd.com >
2026-01-26 23:19:04 +08:00
Chauncey
9ef3b718d9
[Bugfix] Fix Can't instantiate abstract class DeepseekV32IndexerBackend ( #33052 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-26 06:44:02 -08:00
Yuxuan Zhang
bb17e8f11c
[GLM-OCR] GLM-OCR with MTP Support ( #33005 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-26 06:24:43 -08:00
Cyrus Leung
dcd80206b7
[Chore] Update type annotation of input_ids in model forward ( #33063 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-26 06:02:10 -08:00
danisereb
f4a0921c9c
[Performance] Tune Mamba selective scan kernel for B200 ( #32873 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-01-26 05:56:54 -08:00
VihaanThat
208c56256f
[Feature] Add LoRA support for Gemma3 vision components ( #32764 )
2026-01-26 13:56:40 +00:00
Alex Brooks
9ac818a551
[Misc] HF Hub LoRA Resolver ( #20320 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2026-01-26 13:56:32 +00:00
Itay Etelis
6ca2c91b96
[Model] Use mm_position to compute mrope positions for Qwen3-Omni ( #33010 )
...
Signed-off-by: Itay Etelis <itay.etelis@ibm.com >
Co-authored-by: Itay Etelis <itay.etelis@ibm.com >
2026-01-26 13:48:07 +00:00
cwazai
e33192b269
[lora/moe] Improve fused MoE‑LoRA kernel indexing and memory access ( #32770 )
...
Signed-off-by: 陈建华 <1647430658@qq.com >
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
Signed-off-by: kimheesu <wlskaka4@gmail.com >
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: ganyi <ygan@amd.com >
Signed-off-by: whx-sjtu <2952154980@qq.com >
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Signed-off-by: Xin Yang <xyangx@amazon.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com >
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com >
Signed-off-by: Ifta khairul Alam Adil <25082512+ikaadil@users.noreply.github.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Signed-off-by: Huy Do <huydhn@gmail.com >
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Kebe <mail@kebe7jun.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Alex Sun <alex.s@amd.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Liran Schour <lirans@il.ibm.com >
Signed-off-by: liranschour <liranschour@users.noreply.github.com >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
Signed-off-by: Richard Zou <zou3519@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: AuYang <459461160@qq.com >
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com >
Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com >
Signed-off-by: Bill Nell <bnell@redhat.com >
Signed-off-by: RishabhSaini <rishabhsaini01@gmail.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Signed-off-by: Karan Bansal <karanb192@gmail.com >
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Li, Jiang <bigpyj64@gmail.com >
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com >
Signed-off-by: raushan <raushan@huggingface.co >
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: sangbumlikeagod <oironese@naver.com >
Signed-off-by: sangbumlikeagod <98077576+sangbumlikeagod@users.noreply.github.com >
Signed-off-by: Matteo Fari <matteofari06@gmail.com >
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: Orion Reblitz-Richardson <orionr@meta.com >
Signed-off-by: Orion Reblitz-Richardson <orionr@gmail.com >
Signed-off-by: marksverdhei <marksverdhei@hotmail.com >
Signed-off-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Randall Smith <ransmith@amd.com >
Signed-off-by: jon <joninco@bullpoint.org >
Signed-off-by: dolpm <34420038+dolpm@users.noreply.github.com >
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: Luka Govedič <luka.govedic@gmail.com >
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Signed-off-by: mohammad najafi <mohammad.najafi@amd.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
Signed-off-by: esmeetu <jasonailu87@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Reagan <reaganjlee@gmail.com >
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com >
Signed-off-by: Hongjian Zhang <zhanghongjian@xiaohongshu.com >
Signed-off-by: Xingran Wang <wangxingran123456@outlook.com >
Signed-off-by: Hiroken. <105287758+HirokenOvo@users.noreply.github.com >
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
Signed-off-by: Louie Tsai <louie.tsai@intel.com >
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com >
Signed-off-by: Joshua Deng <joshuakdeng@gmail.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: cwazai <38356712+cwazai@users.noreply.github.com >
Co-authored-by: Yanwen Lin <lyw1124278064@gmail.com >
Co-authored-by: Kim Hee Su <wlskaka4@gmail.com >
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Pleaplusone <ygan@amd.com >
Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com >
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: danisereb <daserebrenik@nvidia.com >
Co-authored-by: Yanan Cao <gmagogsfm@users.noreply.github.com >
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Matt <156021403+mawong-amd@users.noreply.github.com >
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com >
Co-authored-by: Lucain <lucainp@gmail.com >
Co-authored-by: Ifta khairul Alam Adil <25082512+ikaadil@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Huy Do <huydhn@gmail.com >
Co-authored-by: Micah Williamson <micah.williamson@amd.com >
Co-authored-by: Andreas Karatzas <akaratza@amd.com >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: Kebe <mail@kebe7jun.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Alex Sun <minchsun@amd.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: liranschour <liranschour@users.noreply.github.com >
Co-authored-by: Or Ozeri <or@ozery.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Shengqi Chen <harry-chen@outlook.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Lucas Kabela <lucaskabela@meta.com >
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com >
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com >
Co-authored-by: Xu Jinyang <72930776+AuYang261@users.noreply.github.com >
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: David Ramon Prados <davidramon3@hotmail.es >
Co-authored-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com >
Co-authored-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com >
Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com >
Co-authored-by: Rishabh Saini <rishabhsaini01@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Karan Bansal <karanb192@users.noreply.github.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: tianshu-Michael-yu <101950379+tianshu-Michael-yu@users.noreply.github.com >
Co-authored-by: Raushan Turganbay <raushan@huggingface.co >
Co-authored-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: sangbumlikeagod <98077576+sangbumlikeagod@users.noreply.github.com >
Co-authored-by: Matteo Fari <matteofari06@gmail.com >
Co-authored-by: Harry Huang <vastrockhuang162@gmail.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Orion Reblitz-Richardson <orionr@gmail.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
Co-authored-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
Co-authored-by: rasmith <Randall.Smith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
Co-authored-by: joninco <joninco@bullpoint.org >
Co-authored-by: dolpm <34420038+dolpm@users.noreply.github.com >
Co-authored-by: ElizaWszola <ewszola@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com >
Co-authored-by: Luka Govedič <luka.govedic@gmail.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: monajafi-amd <mohammad.najafi@amd.com >
Co-authored-by: ruizcrp <ruiz.crp@gmail.com >
Co-authored-by: Shengqi Chen <i@harrychen.xyz >
Co-authored-by: 7. Sun <jhao.sun@gmail.com >
Co-authored-by: Roy Wang <jasonailu87@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com >
Co-authored-by: Hiroken. <105287758+HirokenOvo@users.noreply.github.com >
Co-authored-by: Xingran Wang <wangxingran123456@outlook.com >
Co-authored-by: david guan <102001211+Chenhao-Guan@users.noreply.github.com >
Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com >
Co-authored-by: Louie Tsai <louie.tsai@intel.com >
Co-authored-by: Maryam Tahhan <mtahhan@redhat.com >
Co-authored-by: Joshua Deng <91448271+joshuadeng@users.noreply.github.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
Co-authored-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
Co-authored-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-26 04:56:34 -08:00
Cyrus Leung
61274bdef5
[Doc] Further update multi-modal impl doc ( #33065 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-26 10:54:20 +00:00
ltd0924
b40db4dfec
[StepVL] add step vl offline example ( #33054 )
...
Signed-off-by: luotingdan <luotingdan@stepfun.com >
Co-authored-by: luotingdan <luotingdan@stepfun.com >
2026-01-26 01:00:32 -08:00
Cyrus Leung
11b556878b
[Refactor] Use data parser for matching data items to multi-modal UUIDs ( #32955 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-26 15:00:28 +08:00
Danielle Robinson
ee484b3f4b
Set splitk=1 for fused-moe-lora expand kernel ( #32882 )
...
Signed-off-by: Danielle Robinson <dmmaddix@amazon.com >
Co-authored-by: Danielle Robinson <dmmaddix@amazon.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-25 22:52:34 -08:00
Woosuk Kwon
a9b53dd435
[Model Runner V2] Add LoRAState to consolidate lora logic ( #33062 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-01-25 22:21:12 -08:00
Robert Shaw
254db42ede
[Tests] Remove Duplicates ( #33032 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-26 05:23:54 +00:00
ltd0924
105d104576
[StepVL] support close img patch ( #32923 )
...
Signed-off-by: luotingdan <luotingdan@stepfun.com >
Signed-off-by: ltd0924 <32387785+ltd0924@users.noreply.github.com >
Co-authored-by: luotingdan <luotingdan@stepfun.com >
2026-01-25 20:56:39 -08:00
Lucas Wilkinson
566cdb6cfb
[CI] Fix MHA attention test failure (AttributeError when model_config is None in ViT attention backend) ( #33033 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-25 19:49:53 -08:00
Woosuk Kwon
2f0d3ba745
[Model Runner V2] Minor simplification for finish_requests ( #33048 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-01-25 18:35:02 -08:00
Woosuk Kwon
edf927bc9f
[Model Runner V2] Fix slot_mapping after #25954 ( #33046 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-01-25 18:29:49 -08:00
Andreas Karatzas
22aeb43007
[Bugfix][VLM] Fix transformers backend embed_multimodal for Qwen2.5-VL profiling ( #32969 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-26 08:34:05 +08:00
Itay Etelis
a698e8e7ad
[Model] Use mm_position to compute mrope positions for Qwen2.5-Omni ( #32772 )
...
Signed-off-by: Itay Etelis <itay.etelis@ibm.com >
Co-authored-by: Itay Etelis <itay.etelis@ibm.com >
2026-01-25 20:15:53 +08:00
zhanqiuhu
151e5451c2
[Doc] Add Qwen2.5 models to batch invariance tested models ( #33016 )
...
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-01-25 09:20:46 +00:00
Jee Jee Li
73b243463b
[BugFix] Add env variable to control PDL in LoRA ( #32836 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-25 16:32:30 +08:00
JJJYmmm
7e67df5570
[Bugfix] fix encoder cache hang in Qwen3VL ( #32684 )
...
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-25 05:17:31 +00:00
7. Sun
ff6c1da4e6
[Docs] Fix Apple silicon include path in CPU installation docs ( #32977 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-01-25 01:51:49 +00:00
Roberto L. Castro
fcb9df99bd
[Perf][Kernel] Optimize FP4 quantization kernels (SM100F) ( #32520 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
2026-01-24 18:45:27 -07:00
TJian
1ebdff412a
[DOC] [ROCm] Update doc for v0.14.1 ( #32998 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-01-25 09:13:21 +08:00
Joshua Deng
91601ff478
[Feature] add session based streaming input support to v1 ( #28973 )
...
Signed-off-by: Joshua Deng <joshuakdeng@gmail.com >
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-01-24 12:06:28 -08:00
yugong333
d4dbb7af63
Using max_loras + 1 to construct grid in fused_moe_lora ( #32277 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
2026-01-24 12:39:30 -05:00
Maryam Tahhan
203d0bc0c2
[CPU] Improve CPU Docker build ( #30953 )
...
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-01-24 17:08:24 +00:00
Fadi Arafeh
17ab54de81
[CPU Backend][BugFix] Fix failing Darwin pipelines ( #33002 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-01-24 17:02:22 +00:00
7. Sun
cd775bdbe0
[Tests] Replace flaky sleep with polling in test_background_cancel ( #32986 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-01-24 16:39:07 +00:00
Lucas Wilkinson
da5e7b12be
[MLA] Fuse cat and qaunt for fp8 kv-cache ( #32950 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-24 16:03:02 +00:00
Louie Tsai
719ac592ed
Update CPU doc according to feedback ( #32963 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
Signed-off-by: Louie Tsai <louie.tsai@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-24 16:02:44 +00:00
Hiroken.
1209b784f2
[Bugfix]: resolve torch.compile cache conflict between mm_encoder_tp_modes ( #32842 )
...
Signed-off-by: Hongjian Zhang <zhanghongjian@xiaohongshu.com >
Signed-off-by: Xingran Wang <wangxingran123456@outlook.com >
Co-authored-by: Xingran Wang <wangxingran123456@outlook.com >
2026-01-24 14:45:14 +00:00
Lukas Geiger
5fa0f6efa9
[EncoderCacheManager] Remove unnecessary copy ( #32800 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2026-01-24 14:28:57 +00:00
david guan
bc0d291bfe
feat: Complete LoRA support for MiniMaxM2 Fixes #32736 ( #32763 )
...
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-01-24 20:48:46 +08:00
Isotr0py
9ad7f89f55
[Models]: Make Multimodal config implicit in ViT implementation ( #31972 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-24 20:34:26 +08:00
Hiroken.
6450b536a6
[Bugfix] Fix E2E latency calculation and add warmup support in mm_processor benchmark ( #32646 )
...
Signed-off-by: Hongjian Zhang <zhanghongjian@xiaohongshu.com >
Signed-off-by: Xingran Wang <wangxingran123456@outlook.com >
Signed-off-by: Hiroken. <105287758+HirokenOvo@users.noreply.github.com >
Co-authored-by: Xingran Wang <wangxingran123456@outlook.com >
2026-01-24 10:31:41 +00:00
7. Sun
0f19427db5
[Perf] Cache exc.errors() result in validation exception handler ( #32984 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-01-24 02:01:35 -08:00
Cyrus Leung
51931c5c9a
[UX] Deduplicate sampling parameter startup logs ( #32953 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-24 17:37:28 +08:00
Reagan Lee
06b557ecd9
feat(benchmark): add encoder forward pass benchmarking to mm-processor ( #31655 )
...
Signed-off-by: Reagan <reaganjlee@gmail.com >
Signed-off-by: Reagan Lee <96998476+reaganjlee@users.noreply.github.com >
Co-authored-by: Hiroken. <105287758+HirokenOvo@users.noreply.github.com >
2026-01-24 08:24:44 +00:00
Roger Wang
81c2a889ce
[Doc] Ignore typo check on doc ( #32999 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-01-23 23:52:22 -08:00
Isotr0py
8edaf38570
[Models] Add SharedFusedMoE support to Qwen3MoE ( #32082 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-23 23:36:31 -08:00
Roy Wang
5c86a89805
[docs] Update governance process links ( #32995 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-01-23 23:32:44 -08:00
7. Sun
0ccecf8833
[Tests] Standardize RNG seed utility across test files ( #32982 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-01-24 06:47:14 +00:00
7. Sun
0b9a735e11
[Tests] Clarify pytest skip reasons with actionable context ( #32981 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-01-24 06:38:50 +00:00
7. Sun
14d03b8ddb
[Perf] Cache xpu_get_mem_info() result to avoid duplicate calls ( #32983 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-01-23 20:56:23 -08:00
Michael Goin
d0cbac5827
[Dev UX] Add auto-detection for VLLM_PRECOMPILED_WHEEL_VARIANT during install ( #32948 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Shengqi Chen <i@harrychen.xyz >
2026-01-23 19:15:17 -08:00
ruizcrp
c0d820457a
Auth_token added in documentation as it is required ( #32988 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-24 03:03:05 +00:00
monajafi-amd
97ef11dd34
[ROCm][ViT] Enable Flash Attention Triton backend on RDNA3/RDNA4 ( #32944 )
...
Signed-off-by: mohammad najafi <mohammad.najafi@amd.com >
2026-01-24 10:03:07 +08:00
Xin Yang
ecc3dd66cc
[Bugfix] Fix FusedMoE LoRA kernel offs_token out of bound value ( #32279 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-24 01:41:35 +00:00
Joe Runde
7e1f10d562
[Core][Bugfix] allow graceful worker termination ( #32965 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2026-01-23 17:28:45 -08:00
ElizaWszola
a28b94e6ef
[Performance] Split FlashAttn attention and cache update ( #25954 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Luka Govedič <luka.govedic@gmail.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <luka.govedic@gmail.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
2026-01-23 17:28:06 -08:00
dolpm
0118cdcc02
[fix] add VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME to compile factors ( #32912 )
...
Signed-off-by: dolpm <34420038+dolpm@users.noreply.github.com >
2026-01-23 22:53:10 +00:00
Shengqi Chen
136c499f6e
[CI] fix version comparsion and exclusion patterns in upload-release-wheels.sh ( #32971 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2026-01-23 22:21:49 +00:00
joninco
ebd0a17e0e
[Bugfix] Fix missing is_layer_skipped check for FusedMoE in AWQConfig ( #32935 )
...
Signed-off-by: jon <joninco@bullpoint.org >
2026-01-23 17:19:56 -05:00
Wentao Ye
37c9859fab
[Refactor] Clean up unused variables & func ( #32692 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-23 17:04:25 -05:00
Michael Goin
4561f13985
[Refactor] Rename gptq_marlin to marlin to match MoE ( #32952 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-23 16:48:12 -05:00
rasmith
6cc6d92be5
[CI][AMD][BugFix] Update wvSplitK (and other skinny_gemm wrappers) to ensure tensors passed will be made contiguous for the kernel ( #32831 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2026-01-23 13:35:48 -08:00
Wentao Ye
dfab5f3764
[Bug] Fix benchmark script moe_permute_unpermute ( #32949 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-23 16:18:56 -05:00
Markus / Mark
586a57ad7e
fix: Add glm4_moe_lite to MLA detection ( #32614 )
...
Signed-off-by: marksverdhei <marksverdhei@hotmail.com >
Signed-off-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2026-01-23 12:38:57 -08:00
Lucas Wilkinson
3a41459501
[cudagraphs] Refactor cudagraph capture loop ( #32946 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-23 13:22:20 -07:00
Nick Hill
8518b30447
[Model Runner V2] Add KV Connector support ( #32742 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-23 10:49:17 -08:00
Matthew Bonanni
2d6b537157
[Bugfix][CI] Fix pre-commit ( #32956 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-23 10:26:56 -08:00
Orion Reblitz-Richardson
68b0a6c1ba
[CI][torch nightlies] Use main Dockerfile with flags for nightly torch tests ( #30443 )
...
Signed-off-by: Orion Reblitz-Richardson <orionr@meta.com >
Signed-off-by: Orion Reblitz-Richardson <orionr@gmail.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-01-23 10:22:56 -08:00
Harry Huang
5206e5e28c
[V1][Hybrid] Mamba Prefix Caching with align mode ( #30877 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2026-01-23 09:56:48 -08:00
Matteo Fari
fec9da0af4
[Model] Enable LoRA support for internvl2 ( #32397 )
...
Signed-off-by: Matteo Fari <matteofari06@gmail.com >
2026-01-24 01:39:01 +08:00
Luka Govedič
bbbd696af9
[torch.compile][CI] Add back attn fusion on hopper/ada ( #32940 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-01-23 16:49:20 +00:00
sangbumlikeagod
9b77bb790d
[Frontend] add logprob, compression_rate to 'verbose_json' features ( #31059 )
...
Signed-off-by: sangbumlikeagod <oironese@naver.com >
Signed-off-by: sangbumlikeagod <98077576+sangbumlikeagod@users.noreply.github.com >
2026-01-23 16:35:13 +00:00
Matt
305e53ade8
[Hardware][AMD][CI][Bugfix] Fix Kernels Attention Cache test ( #32904 )
...
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-01-23 16:24:26 +00:00
Mark McLoughlin
1cb4341fbc
[ROCm][PD] Remove unused moriio connector proxy code ( #32939 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-01-23 15:59:04 +00:00
baonudesifeizhai
1fb648bf10
[Bugfix] Fix FP8 MoE EP Weight Loading for ModelOpt Llama4 ( #32886 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
2026-01-23 10:31:48 -05:00
Nicolò Lucchesi
7e22309755
[Misc] Postpone torch_profiler deprecation ( #32867 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-23 14:39:48 +00:00
Xin Yang
90c2007932
[Bugfix] Disable tma_aligned_scales in test_fusions_e2e ( #32916 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-23 14:34:30 +00:00
Raushan Turganbay
d95d650762
[Bugfix] Fix getting vision features in Transformer Multimodal backend ( #32933 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2026-01-23 13:34:48 +00:00
tianshu-Michael-yu
13d8746c54
[Feature]: Remove DtoH Copy for lfm2_vl On Default Stream ( #32815 )
...
Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com >
2026-01-23 13:20:30 +00:00
Fadi Arafeh
10e94c84f6
[CPU][Feat] Update PyTorch to v2.10 for CPU Backend ( #32869 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-01-23 21:13:06 +08:00
Isotr0py
243e78c20f
[Benchmark][Bugfix] Fix race condtion when starting server for sweep benchmark ( #32927 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-23 12:11:18 +00:00
Fadi Arafeh
aac0b817fa
[CPU Backend][BugFix] Fix failing CPU MoE test ( #32876 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-01-23 12:06:51 +00:00
wang.yuqi
05f3d714db
[Frontend][3/n] Make pooling entrypoints request schema consensus | EmbedRequest & ClassifyRequest ( #32905 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-23 12:03:44 +00:00
Patrick von Platen
3f3f89529d
[Voxtral] Add new streaming arch ( #32861 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-23 12:41:52 +01:00
Li, Jiang
5da4c7d789
[CI/Build][CPU] Fix failed pooling tests and macos smoke test ( #32907 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-23 10:48:20 +00:00
Nicolò Lucchesi
160c6fa387
[Misc] Add get_name to missing AttentionBackends ( #32698 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-23 10:35:44 +00:00
Andreas Karatzas
a8eb1182f1
[CI][Models] Add VLM Support for Sequence Classification Conversion ( #32885 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-23 16:22:51 +08:00
Karan Bansal
fa6e599a61
[Bugfix] Fix _CPU_MOE_ACT AssertionError when vLLM config not set ( #32777 )
...
Signed-off-by: Karan Bansal <karanb192@gmail.com >
2026-01-23 08:22:37 +00:00
Wentao Ye
7ef5873752
[CI] Fix mypy for vllm/v1/structured_output ( #32722 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-23 11:55:51 +08:00
Luka Govedič
5e4e0e51f4
[torch.compile] Compile CustomOp.forward_native for SiluAndMul and QuantFP8 to avoid raw torch ops inside opaque custom ops ( #32806 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-22 19:52:26 -08:00
Rishabh Saini
f61c9da711
[BugFix] deepseek_v32_encoding: Replace asserts with proper exceptions ( #32884 )
...
Signed-off-by: RishabhSaini <rishabhsaini01@gmail.com >
2026-01-23 03:44:11 +00:00
Nick Hill
7fe255889e
[Misc] Log vLLM logo when starting server ( #32796 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-23 11:15:12 +08:00
bnellnm
dc917cceb8
[MoE Refactor] Move select_experts from FusedMoEQuantMethod -> FusedMoE ( #31996 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-01-22 18:21:35 -05:00
Fadi Arafeh
fc56f4a071
[BugFix] Fix invalid flashinfer_fused_moe_blockscale_fp8 op registration ( #32855 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-01-22 22:27:40 +00:00
Xin Yang
d08b356ee0
[Perf] Create TMA-aligned input scale tensor for DeepGemm on Hopper ( #32619 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-22 15:47:04 -05:00
Wentao Ye
f744810184
[Refactor] Remove unused tpu files ( #32610 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-22 15:35:18 -05:00
Eldar Kurtić
44f08af3a7
Add llmcompressor fp8 kv-cache quant (per-tensor and per-attn_head) ( #30141 )
...
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com >
Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com >
2026-01-22 13:29:57 -07:00
Matthew Bonanni
955b43a5a5
[Bugfix][Attention] Explicitly report support for kv_cache_dtype bfloat16 ( #32795 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-22 19:05:18 +00:00
Fadi Arafeh
744ef30484
[CPU Backend] [Perf] Accelerate tensor-parallel/data-parallel inference across NUMA domains on Arm ( #32792 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-01-22 18:55:23 +00:00
Matthew Bonanni
300622e609
[CI][Attention] Add more CI dependencies for attention tests ( #32487 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-22 18:44:56 +00:00
RickyChen / 陳昭儒
69d09fdd6c
[Feature] Add --ssl-ciphers CLI argument for TLS cipher control ( #30937 )
...
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com >
2026-01-22 09:53:24 -08:00
David Ramon Prados
3a63be0faa
Support custom URI schemes and trace handlers for profiler ( #32393 )
2026-01-22 09:45:40 -08:00
Tyler Michael Smith
803e3f3f68
[UX] Default api_server_count to dp_size if not specified ( #32525 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-01-22 17:35:35 +00:00
Vadim Gimpelson
70917b1c55
[MISC] Add .cursor to .gitignore ( #32868 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-01-22 17:27:13 +00:00
Matt
c517d8c934
[Hardware][AMD][CI][Bugfix] Fix regressions from deprecated env vars ( #32837 )
...
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-01-23 00:59:15 +08:00
Xu Jinyang
fc37187a51
[Bugfix] ModelScope is supported when downloading LORA models. ( #32844 )
...
Signed-off-by: AuYang <459461160@qq.com >
2026-01-22 16:33:21 +00:00
Maximilien de Bayser
ff365eea94
Support bge-m3 sparse embeddings and colbert embeddings ( #14526 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
2026-01-22 23:52:57 +08:00
Isotr0py
444e2e7e1f
[Misc] Bump opencv-python dependecy version to 4.13 ( #32668 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-22 15:51:15 +00:00
Nick Hill
bc14663e6a
[Cleanup] Move scheduler get_routed_experts logic to separate method ( #32706 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-22 10:46:00 -05:00
Richard Zou
654a71fc3c
[torch.compile] Improve Cold Start for MoEs ( #32805 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-01-22 10:44:40 -05:00
Lucas Kabela
15e302dfce
[Misc][BE] Turn on strict type coverage for vllm/compilation ( #31756 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-01-22 15:12:26 +00:00
Cyrus Leung
d117a4d1a9
[Frontend] Introduce Renderer for processing chat messages (using ModelConfig) ( #30200 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-22 12:44:22 +00:00
Or Ozeri
421012b63a
OffloadingConnector: Support kernel_block_size != block_size ( #30692 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-22 12:30:04 +00:00
Chauncey
841d53aaa8
[Frontend] add prompt_cache_key for openresponses ( #32824 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-22 11:34:14 +00:00
Shengqi Chen
1752262e96
[CI] refactor release pipeline config into groups ( #32833 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2026-01-22 11:27:21 +00:00
Nicolò Lucchesi
ea6102b85d
[Bugfix] Fix Whisper/encoder-decoder GPU memory leak ( #32789 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-22 10:50:37 +00:00
wang.yuqi
328cbb2773
[Frontend][2/n] Make pooling entrypoints request schema consensus | ChatRequest ( #32574 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-22 10:32:44 +00:00
liranschour
64e3d67ac0
Enable Cross layers KV cache layout at NIXL Connector ( #30207 )
...
Signed-off-by: Liran Schour <lirans@il.ibm.com >
Signed-off-by: liranschour <liranschour@users.noreply.github.com >
Co-authored-by: Or Ozeri <or@ozery.com >
2026-01-22 10:12:58 +00:00
Nick Hill
098b2d66fe
[Benchmark] Don't default to temperature==0 in vllm bench serve ( #32723 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-22 10:03:15 +00:00
Isotr0py
8ebf271bb6
[Misc] Replace urllib's urlparse with urllib3's parse_url ( #32746 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-22 16:37:15 +08:00
Alex Sun
49a1262267
[AMD][ROCm] MoRI EP: a high-performance all2all backend ( #28664 )
...
Signed-off-by: Alex Sun <alex.s@amd.com >
2026-01-22 16:33:18 +08:00
Cyrus Leung
2b8a38b6d6
[Model] Extend collect_children and no_init_weights contexts ( #32757 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-22 08:20:27 +00:00
Kebe
1bf1a34b19
[bench] add start_times field to vllm bench serve json result ( #32667 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2026-01-22 07:10:14 +00:00
Andreas Karatzas
a810299838
[ROCm][CI][Docs] Add comment explaining TRITON_ATTN fallback for ROCm ( #32835 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-21 22:11:09 -08:00
Andreas Karatzas
eb1629da24
[ROCm][CI] Fix AITER test flakiness by using explicit attention backend ( #32346 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-01-22 13:55:25 +08:00
Micah Williamson
019e2c3b7c
[ROCm][CI] Lower Acceptance Len Threshold For test_draft_model_quantization ( #32731 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-22 05:47:33 +00:00
Huy Do
f5fdec8ce2
Upgrade transformers-4.57.5 ( #32287 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2026-01-22 05:19:19 +00:00
Patrick von Platen
1579c9b5fd
[Llama.py -> mistral.py] Extract mistral-only relevant code into separate file ( #32780 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2026-01-22 05:14:57 +00:00
Lucas Wilkinson
889722f3bf
[FlashMLA] Update FlashMLA to expose new arguments ( #32810 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-21 22:02:39 -07:00
Divakar Verma
49d9653852
[ROCm][CI] fix get_valid_backends ( #32787 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-01-22 04:27:47 +00:00
Ifta khairul Alam Adil
a1d82466ea
[Docs] Remove outdated async_scheduling limitation with speculative decoding ( #32775 )
...
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com >
Signed-off-by: Ifta khairul Alam Adil <25082512+ikaadil@users.noreply.github.com >
2026-01-21 20:19:25 -08:00
Lucain
24a163ed77
Cleanup some huggingface_hub-related stuff ( #32788 )
2026-01-22 03:38:17 +00:00
knlnguyen1802
378385b90c
[EC Connector] Optimize remote cache check in scheduler ( #32585 )
...
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com >
2026-01-22 03:30:59 +00:00
Matt
c5487e2b96
[Bugfix] Fix potential EAGLE spec decode segfault during graph capture ( #32818 )
...
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-01-22 03:11:55 +00:00
Wentao Ye
6437ff1fb9
[Deprecation] Remove deprecated environment variables ( #32812 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-22 02:25:16 +00:00
Woosuk Kwon
5e00b561cd
[Model Runner V2] Do not error on attention backends ( #32820 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-21 17:02:48 -08:00
Woosuk Kwon
408195ec59
[Model Runner V2] Refactor Prompt Logprobs ( #32811 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-21 15:12:20 -08:00
Xin Yang
63227accf5
[Kernel] Add topk_sigmoid kernel ( #31246 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-21 22:49:51 +00:00
Yanan Cao
e675dda67b
[Misc] Add Helion version check to collect_env ( #32797 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-01-21 21:54:46 +00:00
Nick Hill
24dc30f7ff
[ModelRunner V2] Don't pin reused flashinfer tensors ( #32799 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-21 13:17:43 -08:00
Divakar Verma
180fba653e
[ROCm] fix import for on_gfx9 ( #32783 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-01-21 18:41:11 +00:00
danisereb
f999539869
Add missing import of fused_topk to benchmark_moe ( #32784 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-01-21 18:30:10 +00:00
Woosuk Kwon
e1da249c93
[Model Runner V2] Minor refactor for compute_slot_mappings ( #32794 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-21 10:24:35 -08:00
Nick Hill
9b693d023c
[Misc] Omit "disable NCCL for DP sync" startup log when not applicable ( #32707 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-21 17:03:39 +00:00
elvischenv
808d6fd7b9
Bump Flashinfer to v0.6.1 ( #30993 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-01-21 08:49:50 -08:00
whx
1861ae8aae
[PluggableLayer][1/N] Define PluggableLayer (Fix ci) ( #32744 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2026-01-21 11:38:04 -05:00
Robert Shaw
4e31b7f228
[Quantization][Deprecation] Remove RTN ( #32697 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-21 16:34:42 +00:00
Pleaplusone
6c20e89c02
[ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp ( #29287 )
...
Signed-off-by: ganyi <ygan@amd.com >
2026-01-21 23:16:30 +08:00
Robert Shaw
85f55c943c
[Quantization][Deprecation] Deprecate HQQ ( #32681 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-21 09:32:40 -05:00
Robert Shaw
cea3c754c4
[Quantization][Deprecation] Remove DeepSpeedFp8 ( #32679 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-21 09:32:12 -05:00
Robert Shaw
42135d6898
[MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority ( #32414 )
2026-01-21 08:22:33 -05:00
Divakar Verma
e14467be43
[bugfix] Aria model ( #32727 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-01-21 05:11:31 -08:00
Kim Hee Su
7727ce35c2
[Model] Add Eagle2.5-8B Vision-Language Model support ( #32456 )
...
Signed-off-by: kimheesu <wlskaka4@gmail.com >
2026-01-21 09:39:53 +00:00
Yanwen Lin
6bb2bc71e2
[Bugfix] Force using spawn multiprocess method when it's the WSL platform ( #32749 )
...
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
2026-01-21 09:35:55 +00:00
Lucas Kabela
c80f92c14d
[Documentation] Fix typo in docs/design/torch_compile_multimodal.md ( #32741 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-01-20 23:54:20 -08:00
RickyChen / 陳昭儒
f23fb5a7c1
[Bugfix] Support HF sharded weights for Mistral3/Pixtral models ( #32673 )
...
Signed-off-by: ricky-chaoju <ricky.chen@infinirc.com >
Signed-off-by: vllm-dev <ricky.chen@infinirc.com >
2026-01-20 23:27:30 -08:00
Paco Xu
360aa93f8f
[Docs] Fix GitHub handle in governance process ( #32582 )
...
Signed-off-by: Paco Xu <paco.xu@daocloud.io >
2026-01-21 07:07:50 +00:00
Netanel Haber
27ca95b3c9
[Bugfix] Fix Nemotron-Nano-v2-vlm static resolution ( #32682 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-01-21 06:28:21 +00:00
Lucas Wilkinson
b4f64e5b02
Update FlashMLA ( #32491 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-21 13:03:37 +08:00
shanjiaz
7ab80a8e37
Added qwen3 vision language moe support for speculative decoding ( #32048 )
...
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com >
Signed-off-by: shanjiaz <43143795+shanjiaz@users.noreply.github.com >
2026-01-21 03:24:05 +00:00
gopalsarda
0900cedb3f
Enable Eagle3 speculative decoding for Pixtral (LlavaForConditionalGeneration) ( #32542 )
...
Signed-off-by: gopalsarda <gopal.sarda@servicenow.com >
2026-01-21 11:18:05 +08:00
Nick Hill
6f067b1fb7
[Cleanup] Remove unused KVConnectorModelRunnerMixin methods ( #32077 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-21 11:16:37 +08:00
Alex Brooks
27b81e010d
[Bugfix] Fix Granite Vision / Don't use Siglip Pooling Head Nested Models by Default ( #32299 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2026-01-21 11:11:52 +08:00
Or Ozeri
7013e9ac8f
OffloadingConnector: Prevent redundant loads ( #29087 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-21 01:15:42 +00:00
Robert Shaw
c78ee240b3
Revert "[PluggableLayer][1/N] Define PluggableLayer" ( #32725 )
2026-01-21 00:21:06 +00:00
Vasiliy Kuznetsov
d2389c1262
fp8 online quant: split out Fp8OnlineLinearMethod ( #32189 )
2026-01-20 18:13:22 -05:00
Micah Williamson
22375f8d13
[ROCm][CI] Remove DS async eplb accuracy test from AMD CI ( #32717 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-20 13:40:48 -08:00
TJian
9b67338b78
[Bugfix] Suppress log on non-ROCm platform ( #32703 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-01-20 13:38:20 -08:00
Lucas Wilkinson
2261340806
[Misc] Remove pad_for_cudagraphs from config ( #30143 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-20 15:05:48 -05:00
Shinichi Hemmi
86c69dc54c
[Bugfix] Fix byte fallback handling when using outlines ( #31391 )
...
Signed-off-by: Shinichi Hemmi <50256998+Alnusjaponica@users.noreply.github.com >
Co-authored-by: Kenichi Maehashi <maehashi@preferred.jp >
2026-01-20 19:48:08 +00:00
dolpm
7c5dedc247
[AOT compilation] support torch.compile inductor artifacts in VllmCompiledFunction ( #25205 )
...
Signed-off-by: dolpm <34420038+dolpm@users.noreply.github.com >
2026-01-20 19:45:59 +00:00
Cyrus Leung
193069d129
[5/N] Initialize MM components in context managers (Q-Z) ( #32695 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 19:10:23 +00:00
Rahul Tuli
f0feb1cf81
Test: added acceptance length tests ( #32030 )
...
Signed-off-by: rahul-tuli <rtuli@redhat.com >
2026-01-20 18:55:15 +00:00
Cyrus Leung
09194b90a5
[Doc] Update docs for MM model development with context usage ( #32691 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 10:37:35 -08:00
Woosuk Kwon
9ab4388cd3
[Model Runner V2] Support FLASHINFER_MLA backend ( #32709 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-20 10:26:17 -08:00
JJJYmmm
04a9e064db
[Bugfix] fix the ima issue of qwen-vit ( #32687 )
...
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
2026-01-20 17:21:25 +00:00
TJian
c025263ddd
[Doc] [ROCm] Update ROCm getting started doc ( #32580 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: Hongxia Yang <hongxia.yang@amd.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 09:20:08 -08:00
Wentao Ye
6c97b9b9b6
[Perf] Only clone when needed for moe_permute ( #32273 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-20 11:34:39 -05:00
whx
4ca62a0dbd
[PluggableLayer][1/N] Define PluggableLayer ( #32331 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2026-01-20 16:19:21 +00:00
linhaifeng
7901109ea5
[Bugfix] Fix Off-by-one error in _num_tokens_to_min_blocks calculation ( #32603 )
...
Signed-off-by: linhaifeng <1371675203@qq.com >
2026-01-20 11:13:39 -05:00
YiSheng5
13f6630a9e
[XPU]Support AgRsAll2AllManager on XPU device ( #32654 )
...
Signed-off-by: yisheng <yi.sheng@intel.com >
2026-01-20 14:27:24 +00:00
Cyrus Leung
fda3f03eb2
[4/N] Initialize MM components in context managers (M-P) ( #32663 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 14:06:32 +00:00
杨朱 · Kiki
bb9172030e
[Metrics] Complete removal of deprecated vllm:time_per_output_token_seconds metric ( #32661 )
...
This PR completes the removal of the deprecated vllm:time_per_output_token_seconds
metric that was deprecated in v0.11, hidden in v0.12, scheduled for removal in v0.13,
but delayed until v0.15.
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com >
2026-01-20 12:28:41 +00:00
Chauncey
c4e5bdf61b
[Bugfix] Fix the fp8_mqa_logits dim mismatch ( #32652 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-20 18:48:07 +08:00
Cyrus Leung
7f1bcd18ff
[3/N] Initialize MM components in context managers (I-L) ( #32650 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 10:21:56 +00:00
Walter Beller-Morales
8be263c3fb
[Core] Cleanup shm based object store on engine shutdown ( #32429 )
...
Signed-off-by: walterbm <walter.beller.morales@gmail.com >
2026-01-20 08:53:37 +00:00
Cyrus Leung
e1a34c3a5d
[2/N] Initialize MM components in context managers (E-H) ( #32641 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 08:12:56 +00:00
vllmellm
148117ea2e
[Refactor] Make FP8 Linear Ops use kernel abstraction ( #27814 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-01-20 14:48:20 +08:00
Woosuk Kwon
e9c83cdc51
[Model Runner V2] Skip kernel launch for penalties & logit_bias ( #32634 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-19 22:20:19 -08:00
Cyrus Leung
b75e85dede
[1/N] Initialize MM components in context managers (A-D) ( #32632 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 14:12:42 +08:00
Cyrus Leung
4753f3bf69
[Model] Use context managers for encoder- and LM-only mode ( #32605 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-20 11:43:38 +08:00
Woosuk Kwon
6c01ffb897
[Model Runner V2] Decouple temperature from penalties ( #32629 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-19 19:13:24 -08:00
Woosuk Kwon
7b7cdce968
[Model Runner V2] Refactor get_cudagraph_and_dp_padding ( #32625 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-19 18:25:02 -08:00
Jackmin801
12dab78f49
[Feat] allow inplace loading lora ( #31326 )
...
Signed-off-by: Jackmin801 <ongjackm@gmail.com >
Signed-off-by: Jackmin801 <56836461+Jackmin801@users.noreply.github.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-20 10:15:20 +08:00
Woosuk Kwon
05dc4bfab6
[Model Runner V2] Initialized communication buffer for DP ( #32624 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-19 17:27:06 -08:00
Matthew Bonanni
1a1fc3bbc0
[Attention][MLA] Make FLASHINFER_MLA the default MLA backend on Blackwell, and TRTLLM the default prefill ( #32615 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-19 18:41:34 -05:00
Woosuk Kwon
43fada5360
[Model Runner V2] Refactor dummy_run ( #32533 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-19 14:50:59 -08:00
Tomas Ruiz
4a5299c93f
feat: spec decode with draft models ( #24322 )
...
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com >
2026-01-19 16:05:46 -05:00
lon
73f2a81c75
docs: prefix caching seems quite outdated ( #28784 )
...
Signed-off-by: lon <114724657+longregen@users.noreply.github.com >
Signed-off-by: Russell Bryant <russell.bryant@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <russell.bryant@gmail.com >
2026-01-19 11:49:52 -08:00
jiahanc
7350331718
[BugFix] Fix TRT-LLM NVFP4 DP/EP ( #32349 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-19 14:32:24 -05:00
Yanan Cao
9d1e611f0e
[CI] Add Helion as an optional dependency ( #32482 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-01-19 19:09:56 +00:00
Vadim Gimpelson
0727cc9ecf
[BUGFIX] Fix test_mla_backends.py. Scale MLA projection weights to prevent numerical instability ( #32529 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-01-19 13:49:29 -05:00
qli88
a0490be8f1
[CI][amd] Revert NIXL connector change to avoid crash ( #32570 )
...
Signed-off-by: Qiang Li <qiang.li2@amd.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-01-19 18:39:16 +00:00
Netanel Haber
cd3ac5b797
support dynamic resolution image encoding for Nemotron Nano VL ( #32121 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-01-19 18:15:58 +00:00
Jee Jee Li
2636d76257
[Misc] Remove unused ModelKeys ( #32608 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-19 17:34:59 +00:00
danisereb
aa7f37ccfa
Add support for LoRA adapters in Nemotron-H models ( #30802 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-01-19 22:30:44 +08:00
wang.yuqi
c88860d759
[Frontend] Score entrypoint support data_1 & data_2 and queries & documents as inputs ( #32577 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-19 14:07:46 +00:00
Nicolò Lucchesi
758df5afe7
[NIXL][Metrics] Track nixl_num_kv_expired_reqs metric in Prometheus ( #32340 )
...
Add a new metric to track the number of requests that had their KV blocks
expire. The scenario is particularly important to surface and track as it is a
vital indicator of the health of the deployment.
Currently we're resorting to track these failures through unstructured log
parsing (which is, among other thing, error string dependent); current main:
> Releasing expired KV blocks for request cmpl-071d which were retrieved by 0 decode worker(s) within 0 seconds.
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-19 12:28:27 +00:00
Daniel Mescheder
cdd03d25d3
[CI/Build] Fix dependency conflict between model-hosting-container-standards and starlette ( #32560 )
...
Signed-off-by: Daniel Mescheder <dmesch@amazon.com >
Co-authored-by: Daniel Mescheder <dmesch@amazon.com >
2026-01-19 03:27:08 -08:00
Nicolò Lucchesi
74c583bc50
[Core] Whisper support torch.compile ( #30385 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-19 10:02:31 +00:00
Andreas Karatzas
c0a350ca73
[ROCm][CI] Add ROCm attention backend support for EAGLE DP tests ( #32363 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-19 09:57:54 +00:00
Yuxuan Zhang
71832ba71e
[GLM-4.7] GLM Model support for GLM-Lite ( #31386 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: Yuxuan Zhang <2448370773@qq.com >
2026-01-19 01:18:38 -08:00
Matt
11bbf86f6a
[CI][Hardware][AMD] Fix test_rotary_embedding_mla_cache_fused ( #32408 )
...
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-01-19 08:25:47 +00:00
Hyunkyun Moon
3c8740aacb
[Frontend] Add render endpoints for prompt preprocessing ( #32473 )
...
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com >
Signed-off-by: Hyunkyun Moon <mhg5303@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-19 12:21:46 +08:00
Alex Brooks
7518a3dc65
[CI/Build] Use Common Event Map Fixture in Harmony / MCP Server Tests ( #32531 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2026-01-19 04:05:51 +00:00
honglyua
976af2f314
[BugFix] Fix embed_input_ids argument error of QwenVLForConditionalGeneration ( #32462 )
2026-01-19 03:06:02 +00:00
Woosuk Kwon
9a1f16da1e
[Model Runner V2] Refactor update_states ( #32562 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-18 17:32:42 -08:00
Woosuk Kwon
bb1848cd62
[Model Runner V2] Support VLM ( #32546 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-18 16:58:51 -08:00
Vadim Gimpelson
6101a26dc9
[BUGFIX] Fix degenerate strides in TRTLLM query tensors for FlashInfer backend. Fixes issue #32353 ( #32417 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-01-18 16:57:32 -08:00
Iryna Boiko
f5d1740030
[Bugfix] Add OOT backend option ( #32471 )
...
Signed-off-by: Iryna Boiko <iboiko@habana.ai >
2026-01-18 22:20:39 +00:00
Wentao Ye
eebc58df0c
[Refactor] Remove unused cutlass moe problem size function ( #32047 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-18 12:46:59 -08:00
Wentao Ye
16de822c71
[Refactor] Remove unused file pallas_kv_cache_update.py ( #32433 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-18 12:46:39 -08:00
Deming
5480c6b1fa
[Doc] Correct comment for _jobs dict in OffloadingConnectorWorker ( #32556 )
2026-01-18 12:46:00 -08:00
Andrey Khalyavin
ba29ab441e
Use the same memory for workspace13 and fused_output. ( #31531 )
...
Signed-off-by: Andrey Khalyavin <halyavin@yandex-team.ru >
2026-01-18 19:14:22 +00:00
Robert Shaw
afc3622602
[CI] Move Distributed Tests from H200 -> H100 ( #32555 )
2026-01-18 10:25:23 -08:00
bnellnm
327a02d8db
[MoE Refactor] Separate Router into OO Classes ( #30623 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-01-18 11:40:49 -05:00
tjp_zju
2f03035a61
"refactor: refactor_repeated_interfaces" ( #32486 )
...
Signed-off-by: tom-zju <tanjianpingzju1990@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-18 22:07:01 +08:00
Isotr0py
38bf2ffb21
[Bugfix] Fix GLM-ASR audio encoder RoPE dim ( #32540 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-18 19:17:59 +08:00
Li Xie
c826c72a96
[Model] Support Step1 Model ( #32511 )
...
Signed-off-by: xieli <xieli@stepfun.com >
2026-01-18 10:20:46 +00:00
Canlin Guo
fe36bf5e80
[Model] Remove the unnecessary dtype conversion in MiniCPM ( #32523 )
...
Signed-off-by: gcanlin <canlinguosdu@gmail.com >
2026-01-18 08:07:28 +00:00
Woosuk Kwon
963dc0b865
[Model Runner V2] Minor optimization for eagle input processing ( #32535 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-17 21:55:17 -08:00
Isotr0py
8cc26acd8b
[Performance] Improve Triton prefill attention kernel's performance ( #32403 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-17 20:19:59 -08:00
Robert Shaw
4a6af8813f
[MoE Refactor] Move Test Impl into Test Dirs ( #32129 )
...
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2026-01-18 12:16:59 +08:00
Woosuk Kwon
4147910f1e
[Model Runner V2] Move mrope_positions buffer to MRopeState ( #32532 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-17 20:09:48 -08:00
Karan Bansal
3055232ba0
[Feature] Add FIPS 140-3 compliant hash algorithm option for multimodal hashing ( #32386 )
...
Signed-off-by: Karan Bansal <karanb192@gmail.com >
2026-01-18 11:02:01 +08:00
Shengqi Chen
965765aef9
[build] fix cu130 related release pipeline steps and publish as nightly image ( #32522 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2026-01-17 18:36:11 -08:00
Mritunjay Kumar Sharma
9e078d0582
[CI/Build][Docker] Add centralized version manifest for Docker builds ( #31492 )
...
Signed-off-by: Mritunjay Sharma <mritunjay.sharma@chainguard.dev >
2026-01-17 13:45:30 +00:00
Guofang.Tang
2b99f210f5
[Misc] Fix typo: seperator -> separator in flashmla_sparse.py ( #32411 )
...
Signed-off-by: Guofang Tang <tinggofun@gmail.com >
Co-authored-by: Guofang Tang <tinggofun@gmail.com >
2026-01-17 12:18:30 +00:00
Kim Hee Su
1646fea672
[Model] Molmo2: Enable quantized weight mapping for vision backbone ( #32385 )
...
Signed-off-by: kimheesu <wlskaka4@gmail.com >
2026-01-17 09:33:05 +00:00
Paul Pak
d3317bbba4
[Models] Lfm2Moe: minor name changes for resolving lora conflicts ( #29063 )
...
Signed-off-by: Paul Pak <paulpak58@gmail.com >
2026-01-16 22:12:55 -08:00
Shengqi Chen
8e61425ee6
[CI] Implement uploading to PyPI and GitHub in the release pipeline, enable release image building for CUDA 13.0 ( #31032 )
2026-01-17 04:52:33 +00:00
Matthew Bonanni
2e7c89e708
Revert "[Attention][MLA] Make FLASHINFER_MLA the default MLA backen… ( #32484 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-17 04:42:39 +00:00
vanshil shah
037a6487af
apply _validate_input to MistralTokenizer token-id chat prompts ( #32448 )
...
Signed-off-by: Vanshil Shah <vanshilshah@gmail.com >
2026-01-17 03:23:45 +00:00
Simon Mo
5a3050a089
[Docs][Governance] Add @robertshaw2-redhat to lead maintainers group ( #32498 )
...
Co-authored-by: Claude <noreply@anthropic.com >
2026-01-16 18:35:49 -08:00
Chenyaaang
484e22bc18
[TPU][Core] Enable Pipeline Parallelism on TPU backend ( #28506 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2026-01-16 15:29:20 -08:00
Lucas Wilkinson
ca21288080
[CI] Fix OOM in Hopper Fusion E2E Tests (H100) ( #32489 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-16 21:27:16 +00:00
Andrew Xia
4c82b6fac7
[responsesAPI] allow tuning include_stop_str_in_output ( #32383 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-01-16 21:14:40 +00:00
Xin Yang
a884bc62d6
[LoRA] Update LoRA expand kernel heuristic ( #32425 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-16 18:38:07 +00:00
Hashem Hashemi
7a1030431a
Atomics Reduce Counting Optimization for SplitK Skinny GEMMs. ( #29843 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-01-16 11:45:04 -06:00
Wentao Ye
9fd918e510
[CI] Update deepgemm to newer version ( #32479 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-17 01:18:05 +08:00
Ilya Markov
c9a533079c
[EPLB][BugFix]Possible deadlock fix ( #32418 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-01-16 09:11:01 -05:00
rasmith
6ca4f400d8
[CI][AMD] Skip test_permute_cols since the kernel is not used and not built for ROCm ( #32444 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
2026-01-16 16:22:53 +08:00
Cyrus Leung
180e981d56
[Chore] Replace swish with silu ( #32459 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-16 08:22:45 +00:00
Micah Williamson
b84c426a8c
[ROCm][CI] Skip Qwen3-30B-A3B-MXFP4A16 Eval Test On Non-CUDA Platforms ( #32460 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-16 00:17:44 -08:00
Rabi Mishra
b66b0d6abb
fix(rocm): Enable non-gated MoE (is_act_and_mul=False) support on ROCm ( #32244 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2026-01-16 15:31:10 +08:00
Hongxin Xu
03da3b52ef
[Bugfix] Refactor to support DP parallel in R3 ( #32306 )
...
Signed-off-by: xhx1022 <1737006628@qq.com >
Co-authored-by: arlenxu <arlenxu@tencent.com >
2026-01-16 15:13:58 +08:00
Lucas Wilkinson
14ce524249
[CI] Breakup h200 tests ( #30499 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-16 06:23:22 +00:00
wang.yuqi
4ae77dfd42
[Frontend][1/n] Make pooling entrypoints request schema consensus | CompletionRequest ( #32395 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-16 06:17:04 +00:00
XiongfeiWei
73f635a75f
[Bug] Add TPU backend option ( #32438 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2026-01-16 05:17:12 +00:00
cjackal
35bf5d08e8
[bugfix] Fix online serving crash when text type response_format is received ( #26822 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
Signed-off-by: j0shuajun <59368606+j0shuajun@users.noreply.github.com >
Co-authored-by: j0shuajun <59368606+j0shuajun@users.noreply.github.com >
2026-01-16 12:23:54 +08:00
Kebe
5de6dd0662
[Bugfix] [DeepSeek-V3.2] fix sparse_attn_indexer padding ( #32175 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-16 03:21:55 +00:00
ltd0924
709502558c
[Model] Add Step3vl 10b ( #32329 )
...
Signed-off-by: luotingdan <luotingdan@stepfun.com >
Signed-off-by: ltd0924 <32387785+ltd0924@users.noreply.github.com >
Co-authored-by: luotingdan <luotingdan@stepfun.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-01-15 19:04:16 -08:00
Micah Williamson
46f8a982b1
[ROCm][CI] Enable AITER Unified Attention On ROCm For gpt-oss Test ( #32431 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-16 00:55:57 +00:00
Matthew Bonanni
bcf2333cd6
[CI] Fix LM Eval Large Models (H100) ( #32423 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-16 00:52:49 +00:00
Michael Goin
83239ff19a
Add thread_n=64 support to Marlin MoE ( #32360 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-15 16:45:44 -08:00
TomerBN-Nvidia
c277fbdf31
[Feat] Support non-gated MoE with Marlin, NVFP4 CUTLASS, FP8, INT8, compressed-tensors ( #32257 )
...
Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Tomer Natan <tbarnatan@ipp1-1429.ipp1a1.colossus.nvidia.com >
2026-01-15 16:15:05 -08:00
Wentao Ye
aca5c51487
[Refactor] Remove unused file ( #32422 )
2026-01-15 15:59:38 -07:00
Yongye Zhu
31c29257c8
[MoE Refactor][17/N] Apply Refactor to Bf16 ( #31827 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-01-15 12:53:40 -08:00
Aleksandr Malyshev
8c11001ba2
[ROCM] DSfp4 mla projection gemms weight dynamic quantization ( #32238 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2026-01-15 14:13:08 -06:00
Richard Zou
bd292be0c0
[BugFix] Python file source reading can fail on UnicodeDecodeError ( #32416 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-01-15 20:01:41 +00:00
TJian
41c544f78a
[ROCm] [CI] [Release] Rocm wheel pipeline with sccache ( #32264 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-01-16 02:56:18 +08:00
Michael Goin
1be5a73571
[UX] Use kv_offloading_backend=native by default ( #32421 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-15 18:55:11 +00:00
Lucas Wilkinson
c36ba69bda
[BugFix] Fix assert x_s.shape[-1] == x_q.shape[-1] // group_shape[1] in Blackwell Quantized MoE Test ( #32362 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-15 10:19:12 -08:00
Matthias Gehre
047413375c
[Attention][AMD] Make flash-attn optional ( #30361 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-01-15 17:18:24 +00:00
smit kadvani
74e4bb1c5a
fixing podman build issue ( #32131 )
...
Signed-off-by: Smit Kadvani <smit.kadvani@gmail.com >
Co-authored-by: Smit Shaileshbhai Kadvani <kadvani@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-01-15 11:07:08 -06:00
Wentao Ye
b34474bf2c
[Feature] Support async scheduling + PP ( #32359 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-15 12:06:23 -05:00
Woosuk Kwon
6218034dd7
[Model Runner V2] Support FlashInfer backend & Fix CUDA Graph bug [1/2] ( #32348 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-15 08:59:23 -08:00
Pleaplusone
77c16df31d
[ROCm][Bugfix] Disable hip sampler to fix deepseek's accuracy issue on ROCm ( #32413 )
...
Signed-off-by: ganyi <ygan@amd.com >
2026-01-15 16:35:47 +00:00
Pleaplusone
130d6c9514
[ROCm][Perf] Enable shuffle kv cache layout and assembly paged attention kernel for AiterFlashAttentionBackend ( #29887 )
...
Signed-off-by: ganyi <ygan@amd.com >
2026-01-15 15:29:53 +00:00
Dipika Sikka
361dfdc9d8
[Quant] Support MXFP4 W4A16 for compressed-tensors MoE models ( #32285 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-15 07:25:55 -08:00
Matthew Bonanni
8ebfacaa75
[Attention][MLA] Make FLASHINFER_MLA the default MLA backend on Blackwell, and TRTLLM the default prefill ( #32339 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-15 09:49:57 -05:00
brian033
b89275d018
[ROCm] Improve error handling while loading quantized model on gfx120… ( #31715 )
...
Signed-off-by: brian033 <85883730+brian033@users.noreply.github.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-01-15 04:16:00 -08:00
Cyrus Leung
28459785ff
[3/N] Group together media-related code ( #32406 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-15 11:52:12 +00:00
rasmith
8853a50af2
[CI][BugFix][AMD][FP8] Fix test_rms_norm so it runs correctly on ROCm ( #32372 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2026-01-15 19:05:54 +08:00
Douglas Lehr
c5891b5430
[ROCM] Add ROCm image build to release pipeline ( #31995 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
2026-01-15 19:01:40 +08:00
Chauncey
707b44cc28
[Refactor] [11/N] to simplify the mcp architecture ( #32396 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-15 18:49:31 +08:00
rongfu.leng
3a4e10c847
[Benchmark] [Feature] add vllm bench sweep startup command ( #32337 )
...
Signed-off-by: lengrongfu <lenronfu@gmail.com >
2026-01-15 09:25:46 +00:00
Cyrus Leung
cbbae38f93
[2/N] Move cache factories to MM registry ( #32382 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-15 01:02:30 -08:00
Cyrus Leung
cdba4c74b3
[Model] Avoid token selection in SigLIP pooling head ( #32389 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-15 17:01:59 +08:00
seeksky
a52d1396a7
fix: avoid crash on zero-arg tool calls in glm4 parser ( #32321 )
...
Signed-off-by: seekskyworld <djh1813553759@gmail.com >
2026-01-15 08:45:59 +00:00
dtc
1e584823f8
[Bugfix] Strengthen the check of X-data-parallel-rank in Hybrid LB mode ( #32314 )
...
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com >
2026-01-15 16:31:16 +08:00
Chauncey
4c1c501a7e
[Refactor] [10/N] to simplify the vLLM openai completion serving architecture ( #32369 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-15 07:41:34 +00:00
Andreas Karatzas
ae1eba6a9a
[ROCm][CI] Pin transformers 4.57.3 to fix jina test failures ( #32350 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-15 15:19:34 +08:00
Ofir Zafrir
e9ec2a72d8
[Bugfix] Fix stale common_attn_metadata.max_seq_len in speculative decoding with Eagle ( #32312 )
...
Signed-off-by: Ofir Zafrir <ofir.zafrir@intel.com >
2026-01-15 06:39:37 +00:00
Lucas Wilkinson
2c9b4cf5bf
[BugFix] Fix DeepSeek-V3.1 + DeepGEMM incompatible scale shapes ( #32361 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Eldar Kurtić <8884008+eldarkurtic@users.noreply.github.com >
2026-01-15 06:32:22 +00:00
Ning Xie
9d7ae3fcdb
[code clean] remove duplicate check ( #32376 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-15 05:29:34 +00:00
rasmith
3c2685645e
[CI][AMD][Quantization][BugFix] Fix fp8 max in quant_utils.py and update test_fp8_quant.::test_static_fp8_quant_group_2d to use correct fp8 dtype and adjust atol/rtol ( #32201 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
2026-01-15 05:04:34 +00:00
Micah Williamson
773d7073ae
[ROCm][CI] Disable async scheduling on ROCm for test_structured_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9] ( #32355 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-15 04:53:43 +00:00
kzwrime
edadca109c
[Bugfix] Add CpuCommunicator.dispatch and combine to fix DP+MoE inference ( #31867 )
...
Signed-off-by: kunzh <zhikun.wu@outlook.com >
2026-01-15 04:50:48 +00:00
Li Wang
d86fc23bdd
[Misc] Remove redundant line ( #32366 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2026-01-15 04:29:56 +00:00
Shiyan Deng
375e5984fe
Support configure skip_special_tokens in openai response api ( #32345 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2026-01-15 04:07:26 +00:00
baonudesifeizhai
19b251fe3d
Fix optional parameter parsing in MiniMax M2 tool parser #32278 ( #32342 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
2026-01-15 04:05:48 +00:00
Ryan Rock
15422ed3f7
[CI/Build][Hardware][AMD] Fix v1/shutdown ( #31997 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-01-15 04:01:42 +00:00
dolpm
8471b27df9
[compile] raise on compile_size implicit padding ( #32343 )
...
Signed-off-by: dolpm <34420038+dolpm@users.noreply.github.com >
2026-01-14 20:46:56 +00:00
Lumosis
66652e8082
[BugFix] Assign page_size_padded when unifying kv cache spec. ( #32283 )
...
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com >
2026-01-14 20:10:01 +00:00
vllmellm
e27078ea80
[Bugfix][ROCm][performance] Resolve the performance regression issue of the Qwen3-Next-80B-A3B-Thinking under rocm_atten ( #32336 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-01-14 19:32:48 +00:00
Aleksandr Samarin
d084e9fca7
[MODEL] Fix handling of multiple channels for gpt-oss with speculative decoding ( #26291 )
...
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com >
Signed-off-by: southfreebird <yvorott@gmail.com >
Co-authored-by: southfreebird <yvorott@gmail.com >
2026-01-14 13:20:52 -05:00
qli88
3a612322eb
[CI] Move rixl/ucx from Dockerfile.rocm_base to Dockerfile.rocm ( #32295 )
...
Signed-off-by: Qiang Li <qiang.li2@amd.com >
2026-01-14 16:53:36 +00:00
Cyrus Leung
9ea07b41da
[1/N] Reorganize multimodal processing code ( #32327 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-14 15:25:31 +00:00
Ning Xie
552b262936
rename tokenize serving api request id prefix to tokenize ( #32328 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-14 14:52:20 +00:00
Chauncey
00e6402d56
[Frontend] track responsesAPI server_load ( #32323 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-14 12:00:37 +00:00
Shanshan Shen
ce0946249d
[Misc] Make mem utils can be reused by other platforms ( #32322 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-01-14 03:46:01 -08:00
Cyrus Leung
3f28174c6a
[Frontend] Standardize use of create_error_response ( #32319 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-14 11:22:26 +00:00
Chauncey
769d0629e1
[Refactor] [9/N] to simplify the vLLM openai translations serving ar chitecture ( #32313 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-14 10:20:58 +00:00
Cyrus Leung
90db5b31e4
[Refactor] Move top-level dummy data generation to registry ( #32310 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-14 02:17:46 -08:00
Roger Wang
b8199f6049
[Model] Re-implement Qwen3Omni Audio Encoder ( #32167 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-01-14 15:40:30 +08:00
sangho.lee
7e6f123810
Add Molmo2 multimodal model support ( #30997 )
...
Signed-off-by: sanghol <sanghol@allenai.org >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-14 15:33:09 +08:00
Chauncey
9312a6c03a
[Refactor] [8/N] to simplify the vLLM openai responsesapi_serving architecture ( #32260 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-14 07:26:24 +00:00
Michael Goin
6388b50058
[Docs] Add docs about OOT Quantization Plugins ( #32035 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-14 15:25:45 +08:00
Hongxia Yang
048bb59728
AMD CI Test - unskip moe_sum test and moe_align_block_size tests ( #32039 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2026-01-13 23:25:10 -08:00
Angela Yi
7933638051
[misc] Remove is_torch_equal_or_newer(2.4) cases ( #32296 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-01-13 23:22:07 -08:00
David
6b176095e3
[Build] Relax anthropic version pin from ==0.71.0 to >=0.71.0 ( #32289 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-13 23:21:39 -08:00
Andreas Karatzas
9d0d7f48d5
[ROCm][CI] Handle missing vision_config in Isaac model attention patch ( #32281 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-14 07:21:26 +00:00
Yi Liu
50632adc58
Consolidate Intel Quantization Toolkit Integration in vLLM ( #31716 )
...
Signed-off-by: yiliu30 <yi4.liu@intel.com >
2026-01-14 07:11:30 +00:00
Micah Williamson
6fa6e7ef0c
[ROCm][CI] Disable Async Scheduling For Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy Test ( #32275 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-14 13:29:42 +08:00
Woosuk Kwon
90c0836902
[Model Runner V2] Refactor Sampler ( #32245 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-13 17:58:12 -08:00
Roberto L. Castro
8ef50d9a6b
[Kernel][Performance] Enable smaller Scaling Factor tiling for NVFP4 small-batch decoding ( #30885 )
...
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
2026-01-13 15:22:53 -08:00
emricksini-h
2a60ac91d0
[Improvement] Persist CUDA compat libraries paths to prevent reset on apt-get ( #30784 )
...
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai >
2026-01-13 14:35:05 -08:00
Michael Goin
9e65bb4ef4
Add mergify label job for "bug" in PR titles ( #31980 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-13 14:28:19 -08:00
Simon Mo
0db574b185
[Build] Add scripts for cherry-picking and trigger build ( #32282 )
...
Co-authored-by: Cursor Agent <cursoragent@cursor.com >
2026-01-13 13:21:05 -08:00
HappyAmazonian
2f4a71daf2
[Misc] Add In-Container restart capability through supervisord for sagemaker entrypoint ( #28502 )
...
Signed-off-by: Shen Teng <sheteng@amazon.com >
Signed-off-by: HappyAmazonian <91216626+HappyAmazonian@users.noreply.github.com >
2026-01-13 13:06:10 -08:00
Rabi Mishra
69f8a0ea37
fix(rocm): Use refresh_env_variables() for rocm_aiter_ops in test_moe ( #31711 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2026-01-13 19:11:54 +00:00
Wentao Ye
f28125d87b
[Perf] Optimize grouped topk kernel, 1.2%~2% E2E Throughput improvement ( #32058 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-13 10:58:18 -08:00
Dmitry Tokarev
46f8c6b725
Fix CUDA 13 wheel installation doc ( #32276 )
...
Signed-off-by: Dmitry Tokarev <dtokarev@nvidia.com >
2026-01-13 10:48:37 -08:00
Andrew Xia
af54d2e2d0
[responseAPI] support partial message generation ( #32100 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Signed-off-by: Andrew Xia <mitandrewxia@gmail.com >
Signed-off-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
Co-authored-by: Andrew Xia <axia@fb.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-01-13 10:41:26 -08:00
Sage Moore
6beef12b9b
[EPLB][Cleanup] Remove is_async_enabled from EplbModelState ( #32050 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-01-13 18:19:03 +00:00
Mark McLoughlin
ab74b2a27a
[Trivial] Remove duplicate enable_mfu_metrics ( #32246 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-01-14 01:09:23 +08:00
Matthew Bonanni
2263d44b68
[4/N][Attention] Move MLA common to model_executor ( #32060 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-01-13 09:08:45 -08:00
Mathis Felardos
4f3676e726
nixl_connector: export UCX_MEM_MMAP_HOOK_MODE=none to avoid a UCX memory leak ( #32181 )
...
Signed-off-by: Mathis Felardos <mathis@mistral.ai >
2026-01-13 16:21:10 +00:00
Martin Hickey
510265472c
[BugFix] [KVConnector] Fix KV events for LMCache connector ( #32169 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-13 15:50:34 +00:00
Chauncey
4f02cb2eac
[Refactor] [7/N] to simplify the vLLM lora serving architecture ( #32251 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-13 15:37:34 +00:00
Cyrus Leung
252c011012
[Refactor] Remove MultiModalProfiler ( #32254 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-13 15:10:20 +00:00
Matthew Bonanni
98f60e5acb
[6/N][Attention] Move utils to more appropriate locations ( #32215 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-13 05:38:52 -08:00
Chauncey
fefce49807
[Refactor] [6/N] to simplify the vLLM openai chat_completion serving architecture ( #32240 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-13 13:01:39 +00:00
Mickaël Seznec
a5bbbd2f24
[Quantization] fix: overflow with static per-tensor scaling ( #29867 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-13 12:56:01 +00:00
Nicolò Lucchesi
8c8653b672
[Docs] Nixl Usage recommend fail kv_load_failure_policy ( #32198 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-13 12:51:57 +00:00
Cyrus Leung
232214b2ae
[Bugfix] Replace PoolingParams.normalize with use_activation ( #32243 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-13 10:45:42 +00:00
Cyrus Leung
eb28e8068d
[Refactor] Remove get_encoder_dummy_data ( #32241 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-13 09:21:23 +00:00
YunzhuLu
542a4059b2
[Model] Use mm_position to compute mrope positions for Qwen2-VL/2.5-VL ( #32126 )
...
Signed-off-by: YunzhuLu <lucia.yunzhu@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-13 09:04:29 +00:00
Andreas Karatzas
df7e12715f
[ROCm][CI] Fix engine core client tests for ROCm spawn multiprocessing ( #32061 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-13 15:14:30 +08:00
Roy Wang
44c34f22d9
[Doc] Update installation from source command ( #32239 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2026-01-12 23:10:27 -08:00
Xingyu Liu
80221e1884
[BugFix]Fix eagle draft_model_config and add tests ( #31753 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com >
2026-01-12 23:09:36 -08:00
Andreas Karatzas
5e714f7ff4
[ROCm][CI] Fix HuggingFace flash_attention_2 accuracy issue in Isaac vision encoder ( #32233 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-12 22:33:59 -08:00
Andreas Karatzas
11b6af5280
[ROCm][Bugfix] Fix Mamba batched decode producing incorrect output ( #32099 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-13 05:46:53 +00:00
Wentao Ye
2a719e0865
[Perf] Optimize requests abort ( #32211 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-13 04:11:37 +00:00
Andrew Bennett
f243abc92d
Fix various typos found in docs ( #32212 )
...
Signed-off-by: Andrew Bennett <potatosaladx@meta.com >
2026-01-13 03:41:47 +00:00
Sanghoon Yoon
60b77e1463
[Frontend] Add reasoning_effort to OpenAIServing._preprocess_chat() ( #31956 )
...
Signed-off-by: Sanghoon Yoon <seanyoon@kakao.com >
2026-01-13 03:21:49 +00:00
cjackal
15b33ff064
[Misc] improve warning/assert messages ( #32226 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2026-01-13 03:11:23 +00:00
Nick Hill
c6bb5b5603
[BugFix] Fix engine crash caused by chat tools + response_format ( #32127 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-13 10:33:14 +08:00
Nick Hill
9273a427b5
[Misc] Allow enabling NCCL for DP sync when async scheduling ( #32197 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-13 02:03:08 +00:00
Cyrus Leung
78d13ea9de
[Model] Handle trust_remote_code for transformers backend ( #32194 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-13 09:30:12 +08:00
Andrew Xia
a307ac0734
[responsesAPI] add unit test for optional function tool call id ( #32036 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-01-12 16:14:54 -08:00
Divakar Verma
a28d9f4470
[ROCm][CI] Handle pytest status code 5 when a shard isn't allocated any tests ( #32040 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-01-12 17:35:49 -05:00
xuebwang-amd
629584bfc9
[Kernel][MoE] fix computation order of MoE weight multiplication and improve flow ( #31962 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
2026-01-12 17:17:30 -05:00
Woosuk Kwon
0a7dd23754
[Model Runner V2] Add support for M-RoPE ( #32143 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-12 13:37:43 -08:00
Woosuk Kwon
dec28688c5
[Model Runner V2] Minor refactor for logit_bias ( #32209 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-12 13:08:30 -08:00
Vadim Gimpelson
9f430c94bd
[BUGFIX] Add missed remaping of the names of fp8 kv-scale ( #32199 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-01-12 20:42:06 +00:00
Nicolò Lucchesi
f8bd8394e3
[NIXL][Bugfix] Failure logging overhaul + early metadata free on failure ( #32031 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-12 20:38:49 +00:00
Woosuk Kwon
ca81811bfe
[Model Runner V2] Support logit_bias, allowed_token_ids, min_tokens ( #32163 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-12 11:31:10 -08:00
Lucas Kabela
ad8818bb5e
[Misc][BE] Type coverage for vllm/compilation [3/3] ( #31748 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-01-12 19:24:38 +00:00
Nicolò Lucchesi
08e8e99ce7
[Misc] Change log level for batch queue log ( #32192 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-12 18:59:31 +00:00
Or Ozeri
2be765b68a
[BugFix] scheduler: Fix ordering preserving of skipped requests ( #32173 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-12 18:39:38 +00:00
Roger Wang
16abe6b85a
[Misc] Set default torch num threads for input processing ( #31879 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-01-12 10:28:16 -08:00
Ilya Markov
1eb61ab34b
[Refactor] EPLB rebalance algo to NumPy ( #30697 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-01-12 18:13:23 +00:00
Kyungmin Lee
3d962d72ab
[BugFix] fix FusedMoE.make_expert_params_mapping in EXAONE-MoE ( #32196 )
...
Signed-off-by: lkm2835 <lkm2835@gmail.com >
2026-01-12 10:00:45 -08:00
Matthew Bonanni
20228cb851
[3/N][Attention] Move AttentionMetadata-related code from utils.py to backend.py ( #32054 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-12 09:13:56 -08:00
Cyrus Leung
7c0d3c5152
[Benchmark] Share data between SLA runs ( #32184 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-13 01:12:22 +08:00
Nicolò Lucchesi
5b68107411
[Misc][PD] Fix get_attn_backend usage in transfer connectors ( #31988 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-12 18:10:05 +01:00
Asaf Joseph Gardin
8fb2c135be
[Bugfix] Fix stale SSM state for new Mamba requests scheduled as decode ( #32118 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-01-12 17:02:38 +00:00
Cyrus Leung
8863c2b25c
[Model] Standardize pooling heads ( #32148 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-12 17:01:49 +00:00
danielafrimi
3f72639d36
[FIX] Add NO_MUL activation support for modular kernel path ( #31528 )
...
Signed-off-by: dafrimi <dafrimi@nvidia.com >
Signed-off-by: <>
Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local >
Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: root <root@pool0-01777.cm.cluster >
2026-01-12 11:55:49 -05:00
Jaehyun An
6bc9c8473e
[MODEL] New model support for kakaocorp/kanana-1.5-v-3b-instruct ( #29384 )
...
Signed-off-by: Jaehyun An <steve.ai@kakaocorp.com >
2026-01-12 16:39:02 +00:00
Kyungmin Lee
63ed2409e8
Add K-EXAONE-236B-A23B ( #31621 )
...
Signed-off-by: lkm2835 <lkm2835@gmail.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: lgai-exaone <exaonemodels@lgresearch.ai >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-01-12 16:30:50 +00:00
Andy Zhang
95e53d907c
doc: Update model references in supported_models.md ( #32188 )
2026-01-12 08:15:28 -08:00
TJian
0346396e94
[ROCm] [Bugfix] Fix order of mori build in Dockerfile.rocm_base ( #32179 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-01-12 15:33:21 +00:00
Andy Zhang
e68b0dad8b
doc: Update model name for Qwen3-Coder in documentation ( #32185 )
...
Signed-off-by: Andy Zhang <xiazhang@microsoft.com >
2026-01-12 07:10:50 -08:00
Or Ozeri
9cddbdba6d
OffloadingConnector: Add cpu_bytes_to_use configuration ( #24498 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-12 15:00:43 +00:00
Hongxin Xu
49e6b86c91
[Feature] Support recording expert indices for rollout router replay ( #28284 )
...
Signed-off-by: xhx1022 <1737006628@qq.com >
Signed-off-by: Hongxin Xu <70438206+xhx1022@users.noreply.github.com >
Signed-off-by: arlenxu <arlenxu@tencent.com >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: arlenxu <arlenxu@tencent.com >
2026-01-12 06:23:04 -08:00
dtc
0565f1fdec
[P/D] Refactor mooncake connector sender thread using async coroutines ( #31573 )
...
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2026-01-12 12:35:35 +00:00
Isotr0py
9dbe1fe960
[Bugfix] Fix missing scale passing for encoder Triton Attention implementation ( #32149 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-12 11:13:41 +00:00
RickyChen / 陳昭儒
a5f89ae296
[Doc] Add documentation for offline API docs feature ( #32134 )
...
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com >
2026-01-12 10:33:48 +00:00
Jee Jee Li
05e8981234
[Doc] Improve LoRA docs ( #32159 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-12 02:19:17 -08:00
XlKsyt
899541bdb1
[doc] fix broken links ( #32158 )
...
Signed-off-by: minimAluminiumalism <caixuesen@outlook.com >
2026-01-12 10:18:38 +00:00
daniel-salib
d7b2e57097
[Frontend] Fix Flaky MCP Streaming Test ( #32153 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2026-01-12 18:03:32 +08:00
Andika Rachman
5e034f2e3d
[cpu][bench] Add Fused MoE Micro Benchmark for CPU Backend ( #32092 )
...
Signed-off-by: andikarachman <andika.rachman.y@gmail.com >
2026-01-12 10:03:28 +00:00
Nicolò Lucchesi
22970c1626
[Misc] Disable default --ready-check-timeout-sec extra call in vllm bench ( #30975 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-12 01:58:21 -08:00
Cyrus Leung
600aaab8d6
[Model] Remove incorrect SupportsPP from MTP models ( #32150 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-12 01:19:30 -08:00
wang.yuqi
60446cd684
[Model] Improve multimodal pooling examples ( #32085 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-12 07:54:09 +00:00
Cyrus Leung
9101dc756c
[Model] Avoid hardcoding pooling type ( #32119 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-11 21:28:12 -08:00
Woosuk Kwon
025a32f9ed
[Model Runner V2] Remove async barrier ( #32083 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-11 20:24:30 -08:00
Woosuk Kwon
19504ac07f
[Model Runner V2] Skip building deprecated fields in attn metadata ( #32132 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-11 14:31:04 -08:00
Jiangyun Zhu
3df619ac94
[CI] fix test_concat_and_cache_mla_rope_fused ( #32117 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-01-11 15:11:11 +00:00
Ning Xie
d74132ca3b
fix offline inference chat response prompt ( #32088 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-11 14:01:18 +00:00
maang
a34abc49b7
[FixBug] Improve exception string in tensorizer.py ( #31680 )
...
Signed-off-by: maang <maang_h@163.com >
Signed-off-by: maang-h <55082429+maang-h@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-11 05:01:53 -08:00
rongfu.leng
d70249e2e9
[Misc] fix this log format not space ( #32112 )
...
Signed-off-by: lengrongfu <lenronfu@gmail.com >
2026-01-11 05:01:16 -08:00
Cyrus Leung
a374532111
[CI/Build] Separate out flaky responses API tests ( #32110 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-11 05:01:12 -08:00
Isotr0py
cee7436a26
[Misc] Make scipy as optional audio/benchmark dependency ( #32096 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-11 00:18:57 -08:00
Or Ozeri
4c16ba617f
[KVConnector] OffloadingConnector: Fix bug in handling of preemptions ( #29870 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-11 08:05:36 +00:00
Matt
bde57ab2ed
[Hardware][AMD][CI][Bugfix] Fix AMD Quantization test group ( #31713 )
...
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-01-10 23:19:46 -08:00
Fadi Arafeh
9103ed1696
[CPU][BugFix] Disable AOT Compile for CPU ( #32037 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-01-10 23:15:49 -08:00
Laith Sakka
46eb30f519
make assume_32_bit_indexing configurable ( #32044 )
...
Signed-off-by: Laith Sakka <lsakka@meta.com >
2026-01-10 23:15:46 -08:00
Andy Liu
0dd63639be
[MTP][GLM][Bugfix] Fixed .weight_scale loading logic that dropped MTP prediction accuracy with fp8+mtp ( #32101 )
...
Signed-off-by: Andy Liu <andyliu@roblox.com >
2026-01-10 23:14:54 -08:00
Cyrus Leung
ef96fa3f1f
[Benchmark][2/2] Use spline interpolation to tune SLA variables ( #32095 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-10 20:27:27 -08:00
Or Ozeri
2a4dbe24ea
[BugFix] Wait for compute before offloading KV to CPU ( #31341 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-10 22:25:08 +00:00
RickyChen / 陳昭儒
8020a60402
[Bugfix] Fix Qwen3-VL-Reranker model loading for sequence classification ( #32089 )
...
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-10 12:40:09 -08:00
Vadim Gimpelson
e15a5ff07b
[MISC] Add strict contiguity check for FlashInfer attention tensors ( #32008 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com >
2026-01-10 12:40:05 -08:00
Vensen
6ea001cfb7
[Bugfix][Quantization] Ensure input contiguity in per_token_quant_int8 ( #31637 )
...
Signed-off-by: vensen <vensenmu@gmail.com >
2026-01-10 12:40:02 -08:00
shyeh25
1c46dea001
Revert "[Kernels][FI] Skip trtllm attention when num_kv_heads=1 (#308… ( #31617 )
...
Signed-off-by: shyeh25 <206795756+shyeh25@users.noreply.github.com >
2026-01-10 12:39:59 -08:00
Or Ozeri
028599739d
[BugFix] scheduler: Fix resuming of preempted requests after async load ( #31583 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-10 12:39:25 -08:00
gnovack
d1fd802fa3
fused_moe_kernel - cast accumulator after applying router weights ( #32002 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
2026-01-11 04:36:45 +08:00
Xin Yang
543c23be78
[LoRA][Perf] Improve FusedMoE LoRA performance for small rank ( #32019 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-10 11:04:18 -08:00
jvlunteren
b8bf5c45bb
[Kernel] Optimize Sliding Window Attention in 3D Triton Kernel ( #31984 )
...
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com >
2026-01-10 18:13:44 +00:00
Michael Goin
e6c6f2c79d
[Quant] Support MXFP4 W4A16 for compressed-tensors dense models ( #31926 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-01-10 06:44:35 -08:00
Jeremy Teboul
07286ec5a6
[Bugfix] Fix integer overflow in Gemma3n audio processing ( #31657 )
...
Signed-off-by: Jeremy Teboul <jeremyte@meta.com >
2026-01-10 17:52:53 +08:00
Ning Xie
14fc7a68c7
[Bugfix] fix offline chat output prompt ( #32076 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-10 07:50:57 +00:00
Cyrus Leung
5f2385a4c8
[Benchmark][1/2] Generalize SLA criterion validation from binary flags to margins ( #32075 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-10 07:11:03 +00:00
Frelam
a01a1c0d69
[Bugfix] fix encoder cache leak of waiting requests in scheduler to solve stuck in CPU scheduling ( #31857 )
...
Signed-off-by: frelam <frelam112233@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-01-10 06:27:58 +00:00
Lucas Wilkinson
da6709c9fe
[Misc] Delay deprecation of CommonAttentionMetadata properties ( #32074 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-09 21:06:44 -08:00
Andreas Karatzas
d83becd503
[ROCm][CI] Fix flaky test_function_calling_with_stream and reduce schema test examples ( #32063 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-10 05:02:35 +00:00
roikoren755
0c9614876e
Update modelopt KV cache quantization resolution to new scheme ( #31895 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-01-10 04:54:13 +00:00
Cyrus Leung
583a90e005
[Refactor] Separate sequence and token pooling types ( #32026 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-10 04:53:24 +00:00
maang
52d428295d
[Core] Refactor ColumnParallelLinear: remove unused parameter and optimize forward ( #31939 )
...
Signed-off-by: maang <maang_h@163.com >
2026-01-10 04:19:49 +00:00
Kevin McKay
c60578de0a
[Bugfix][Hardware][AMD] Use dynamic WARP_SIZE in sampler vectorized_process ( #31295 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2026-01-10 03:57:38 +00:00
PatrykSaffer
80fead8bf6
Fuse RoPE and MLA KV-cache write ( #25774 )
...
Signed-off-by: Patryk Saffer <patryk.saffer99@gmail.com >
Signed-off-by: PatrykSaffer <patryk.saffer@mistral.ai >
Co-authored-by: Patryk Saffer <patryk.saffer99@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-09 19:18:37 -08:00
Akshat Shrivastava
e45946bd91
feature/issac 0.2 ( #31550 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-01-10 03:18:05 +00:00
Lucas Kabela
ea6d067a2a
[Misc][LLaMa4] Compile LLaMa Vision Encoder ( #30709 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-01-09 22:01:38 -05:00
Ning Xie
abd9224280
resolve pydantic error in startup benchmark ( #31348 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-10 02:41:27 +00:00
Kevin McKay
4dc0d606b7
[Bugfix] Narrow broad exceptions in compilation backends ( #31616 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-01-09 21:39:22 -05:00
Micah Williamson
ac0675ff6b
[CI] Allow Deprecated Quantization For LM Eval Tests ( #32065 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-01-09 19:10:47 -07:00
Wentao Ye
e18464a57d
[Perf] Optimize async scheduling placeholder using empty ( #32056 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-10 00:46:11 +00:00
Russell Bryant
1963245ed1
[Core] Use weights_only=True with torch.load ( #32045 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-01-10 00:28:57 +00:00
Matthew Bonanni
0308901975
[2/N][Attention] Fix pre-commit errors ( #32052 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-10 00:27:15 +00:00
Lucas Kabela
aaf4b70aae
[Misc][BE] Type coverage for vllm/compilation [2/3] ( #31744 )
2026-01-09 18:30:38 -05:00
Nick Hill
3adffd5b90
[Misc] Enable async scheduling by default with spec decoding ( #31998 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-09 23:09:34 +00:00
zhrrr
97ba96fbe9
[perf][async] support non cpu sync get logprob tensors for spec ( #31336 )
...
Signed-off-by: izhuhaoran <izhuhaoran@qq.com >
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2026-01-09 21:24:51 +00:00
Chendi.Xue
94578127a4
[NIXL] refine decoder side post process for heterogeneous BlockSize and kv_layout ( #30275 )
2026-01-09 21:22:19 +00:00
Matthew Bonanni
2612ba9285
[1/N][Attention] Restructure attention: move files ( #31916 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-09 13:10:24 -08:00
Andrew Xia
1f8b7c536b
[responsesAPI] fix incomplete_messages for simple/parsable context ( #31836 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-01-09 21:00:57 +00:00
Lucas Wilkinson
0a0aa07747
[Quant] Make static quant support all group shapes ( #30833 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-09 12:49:27 -08:00
jiahanc
f9e2a75a1e
[fix] add cutedsl to global sf ( #32001 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
2026-01-09 12:03:02 -08:00
Runkai Tao
a4d5d663e2
Add unpermute-aware fused MoE path and small-batch fallback ( #29354 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-09 12:58:39 -07:00
Jeremy Teboul
657e9c0e18
[Fix] Introduce audio channels spec ( #31595 )
...
Signed-off-by: Jeremy Teboul <jeremyte@meta.com >
2026-01-09 19:34:51 +00:00
Wentao Ye
308feab33f
[Perf] Optimize cutlass moe problem size calculation, 5.3% E2E Throughput improvement, 2.2% TTFT improvement ( #31830 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-01-09 11:13:43 -08:00
Wentao Ye
28ae32a5d3
[Refactor] Remove numpy split in async scheduling ( #32034 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-09 19:09:02 +00:00
Andrew Xia
f32c629eb4
[Frontend][gpt-oss] Allow system message to overwrite model identity ( #31737 )
...
Signed-off-by: lacora <hyelacora@gmail.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: lacora <hyelacora@gmail.com >
Co-authored-by: Andrew Xia <axia@fb.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-01-09 14:03:57 -05:00
Yifan Qiao
cd4a95e3aa
[Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator ( #31707 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
2026-01-09 10:53:20 -08:00
Michael Goin
d5ec6c056f
[UX] Add vLLM model inspection view ( #29450 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-09 10:12:35 -07:00
Shanshan Shen
08d954f036
[Doc] Add developer guide for CustomOp ( #30886 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-01-09 16:21:11 +00:00
Kevin Šuc
ac9f9330e6
Rename --exclude-log-deltas to --enable-log-deltas ( #32020 )
...
Signed-off-by: Catacomba <kevinsuc16@gmail.com >
2026-01-09 15:30:40 +00:00
Isotr0py
2d0c5b630e
[Doc] Remove hardcoded Whisper in example openai translation client ( #32027 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-09 14:44:52 +00:00
Michael Goin
34cd32fe30
[Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe ( #31832 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-01-09 07:40:33 -07:00
R3hankhan
8e27663b6a
[CPU] Add head sizes 80 and 112 with vec16 fallback ( #31968 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-01-09 22:14:46 +08:00
maang
7cdf7e2fe0
[Model] Remove redundant None check in DeepSeekOCR image input processing ( #32016 )
...
Signed-off-by: maang <maang_h@163.com >
2026-01-09 06:12:44 -08:00
Adolfo Victoria
bbf80ede43
Fix type error ( #31999 )
...
Signed-off-by: Adolfo Victoria <adolfokarim@gmail.com >
Co-authored-by: Adolfo Victoria <adovi@meta.com >
2026-01-09 22:03:32 +08:00
inkcherry
4505849b30
[ROCm][PD] add moriio kv connector. ( #29304 )
...
Signed-off-by: inkcherry <mingzhi.liu@amd.com >
2026-01-09 14:01:57 +00:00
Roger Wang
db07433ce5
[Misc] Skip hashing kwargs if value is None ( #32025 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-01-09 13:20:59 +00:00
Andreas Karatzas
e02706d2d2
[ROCm][CI][V1] Fix nixl_connector test failure and achieve CUDA parity in test_async_scheduling ( #32000 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-09 20:48:32 +08:00
Sophie du Couédic
b474782ad7
[Feature][Benchmarks] Custom dataset: read output length from dataset ( #31881 )
...
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com >
2026-01-09 12:40:59 +00:00
Bofeng Xue
55212c1404
fix: remove duplicate engine_id check in nixl_connector ( #31948 )
...
Signed-off-by: Bofeng BF1 Xue <xuebf1@Lenovo.com >
Co-authored-by: Bofeng BF1 Xue <xuebf1@Lenovo.com >
2026-01-09 12:13:17 +00:00
Xin Yang
e7b68f4d6c
[Bugfix] Fix Triton FusedMoE LoRA ( #30585 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-09 11:46:59 +00:00
vllmellm
1a19e9cd87
[Bugfix][ROCm]Fix Qwen3-Next-80B-A3B-Thinking inference and optimize non-standard block size (544) support under rocm_atten ( #31380 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-01-09 19:28:02 +08:00
Cyrus Leung
c8ed39b9dd
[Model] Reorganize pooling layers ( #31973 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-09 11:02:14 +00:00
Andreas Karatzas
020732800c
[Bugfix] Fix OpenAPI schema test failures ( #31921 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-09 10:56:20 +00:00
Alex Brooks
dc77cb7129
[Bugfix] Fix Var Length Batched Padding in Granite Speech ( #31906 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2026-01-09 10:28:43 +00:00
gnovack
bde38c11df
fix lora moe sharding when rank < max_lora_rank ( #31994 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-09 14:43:25 +08:00
Xin Yang
707b240d7e
[Bugfix] Fix FusedMoE LoRA w2_output_size ( #31949 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-09 00:54:05 -05:00
Nick Hill
29ce48221c
[Cleanup] Remove obsolete spec decoding compatibility logic ( #32003 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-09 05:44:18 +00:00
TJian
7a05d2dc65
[CI] [ROCm] Fix tests/entrypoints/test_grpc_server.py on ROCm ( #31970 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-01-09 12:54:20 +08:00
Divakar Verma
a1648c4045
[ROCm][CI] Fix test_token_classification.py::test_bert_models ( #31993 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-01-09 04:04:33 +00:00
RioS
e2d49ec2a4
[Bugfix] missing tokens occur in harmony streaming ( #30437 )
...
Signed-off-by: RioS <aa248424@gmail.com >
Signed-off-by: Ri0S <aa248424@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-01-09 03:59:34 +00:00
Xin Yang
8413868dab
[Bugfix] Fix typo in FusedMoE LoRA reshape comment ( #31992 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-01-08 18:46:05 -08:00
zhrrr
8ff4a99566
[Async][Feat] support apply penalty or bad_words for async + spec ( #30495 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Signed-off-by: izhuhaoran <izhuhaoran@qq.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-01-09 02:31:50 +00:00
daniel-salib
a4ec0c5595
[Frontend] Add MCP tool streaming support to Responses API ( #31761 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2026-01-09 09:19:34 +08:00
Robert Shaw
0fa8dd24d2
[Bugfix] Fix Typo from NVFP4 Refactor ( #31977 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-08 16:18:50 -08:00
Max Hu
6ebe34d6fa
[Feature] Add iteration level logging and enhance nvtx marker ( #31193 )
...
Signed-off-by: Max Hu <maxhu@nvidia.com >
Signed-off-by: Max Hu <hyoung2991@gmail.com >
Co-authored-by: Max Hu <maxhu@nvidia.com >
2026-01-09 00:13:39 +00:00
Nick Hill
11cec296dd
[BugFix] Add spec-decode-incompatible request param validation ( #31982 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-09 00:08:21 +00:00
Robert Shaw
5825bbc1f7
[Quantization] Deprecate Long Tail of Schemes ( #31688 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-08 19:07:45 -05:00
Yongye Zhu
d62cfe546d
[MoE Refactoring][Bugfix]Wrap WNA16 Triton kernel into mk and change compressed tensor kernel selection ( #31752 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-08 19:01:30 -05:00
Lucas Wilkinson
6cdf015c3c
[Misc] Fix Current vLLM config is not set. warnings, assert to avoid issues in the future ( #31747 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-01-08 15:20:49 -08:00
Dipika Sikka
5d3b6097ad
[Compressed-Tensors] Simplify NVFP4 Conditions, enable marlin support for NVFP4A16 MoEs ( #30881 )
2026-01-08 17:45:17 -05:00
bnellnm
e74698c27a
[Misc][Refactor] Add FusedMoERouter object ( #30519 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-01-08 20:52:55 +00:00
Cyrus Leung
aa125ecf0e
[Frontend] Improve error message ( #31987 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-08 20:07:03 +00:00
Lucas Kabela
f16bfbe5bc
[Documentation][torch.compile] Add documentation for torch.compile + multimodal encoders ( #31627 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-01-08 14:33:24 -05:00
Michael Goin
87e07a6b46
Revert "feat(moe): Add is_act_and_mul=False support for Triton MoE kernels" ( #31978 )
2026-01-08 11:31:53 -08:00
Woosuk Kwon
7508243249
[Model Runner V2] Simplify BlockTables with UVA ( #31965 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2026-01-08 10:24:26 -08:00
Nicolò Lucchesi
83e1c76dbe
[CI][ROCm] Fix NIXL tests on ROCm ( #31728 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-01-09 01:34:43 +08:00
Nishidha Panpaliya
a563866b48
Fix ijson build for Power. ( #31702 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
2026-01-08 17:12:33 +00:00
Nick Hill
a3d909ad2b
[Misc] Tidy up some spec decode logic in GPUModelRunner ( #31591 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-08 09:10:07 -08:00
Jee Jee Li
49568d5cf9
[Doc] Improve MM models LoRA notes ( #31979 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-08 08:55:22 -08:00
danisereb
b8112c1d85
[Bugfix] Fix vllm serve failure with Nemotron Nano V3 FP8 ( #31960 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-01-08 16:08:37 +00:00
Chauncey
eaba8ece77
[Bugfix]: Fix Step3ReasoningParser missing is_reasoning_end_streaming ( #31969 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-08 15:28:13 +00:00
yxing-bj
fe86be66c5
[Model] Support IQuestCoder model ( #31575 )
...
Signed-off-by: yxing <yxing@iquestlab.com >
2026-01-08 14:42:57 +00:00
Chauncey
1da3a5441a
[Docs]: update claude code url ( #31971 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-08 14:04:55 +00:00
TJian
72c068b8e0
[CI] [Bugfix] Fix unbounded variable in run-multi-node-test.sh ( #31967 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-01-08 05:42:01 -08:00
Mary
7645bc524b
[OpenAI] Fix tool_choice=required streaming when output has trailing extra data ( #31610 )
...
Signed-off-by: maylikenoother <ogedengbemary19@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-01-08 21:01:42 +08:00
Ce Zhao
1123a87892
[Model] Enable LoRA support for Pixtral ( #31724 )
...
Signed-off-by: <>
Signed-off-by: 赵策 <alcor@zhaocedeMacBook-Air.local >
Signed-off-by: 赵策 <alcor@mac.mynetworksettings.com >
Co-authored-by: 赵策 <alcor@mac.mynetworksettings.com >
2026-01-08 05:00:57 -08:00
tianshu-Michael-yu
03fd76c570
[Model] Add LFM2-VL model support ( #31758 )
...
Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-01-08 05:00:27 -08:00
Bijaya Dangol
59d260f5e4
[Model] Add Grok-2 ( #31847 )
...
Signed-off-by: dangoldbj <dangoldbj23@gmail.com >
2026-01-08 04:59:48 -08:00
Patrick von Platen
18d4e481d0
[Voxtral] Fix speech transcription api ( #31388 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: bk-201 <joy25810@foxmail.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: prashanth058 <prashanth.dannamaneni@uipath.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: bk-201 <joy25810@foxmail.com >
Co-authored-by: prashanth058 <prashanth.dannamaneni@uipath.com >
Co-authored-by: Anexdeus <5142168@mail.ru >
Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2026-01-08 18:34:19 +08:00
Isotr0py
2972a05473
[MM Encoder]: Make MMEncoderAttention's scale takes effect properly ( #31950 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-08 02:33:48 -08:00
Cyrus Leung
5576227bc1
[Model] Standardize common vision encoders ( #31947 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-08 02:33:16 -08:00
Cyrus Leung
d1b6fe007f
[Chore] Further cleanup pooler ( #31951 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-08 02:16:21 -08:00
omer-dayan
04a49669d1
RayLLM Bugfix - Preserve obj store URL for multi engine_config creation ( #30803 )
...
Signed-off-by: Omer Dayan <omdayan@nvidia.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-08 10:00:25 +00:00
BingjiaWang
96fcd3c267
[Misc] Support qwen3-next lora ( #31719 )
2026-01-08 09:27:50 +00:00
DevByteAI
1f214290d6
fix(compile): apply partition wrapper when loading AOT cached functions ( #31536 )
...
Signed-off-by: Devbyteai <abud6673@gmail.com >
Signed-off-by: DevByteAI <161969603+devbyteai@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-01-08 17:27:26 +08:00
Ryan Rock
8cbdc7eb94
[CI/Build] Enable test_kv_cache_events_dp for AMD ( #31834 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-01-08 09:00:24 +00:00
Lumosis
b634e619bb
Decouple page_size_bytes calculation in AttentionSpec for TPU/RPA Compatibility. ( #31635 )
...
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com >
Signed-off-by: Lumosis <30372757+Lumosis@users.noreply.github.com >
2026-01-08 09:00:07 +00:00
Isotr0py
eac3b96ec0
[Models] Allow converting Qwen3-VL into Reranker model ( #31890 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-08 08:10:15 +00:00
Zhiwei
573a1d1119
[ROCm]Skip test_torchao.py::test_pre_quantized_model on CDNA3 arch ( #31905 )
...
Signed-off-by: ZhiweiYan-96 <zhiwei.yan@amd.com >
2026-01-08 15:47:44 +08:00
Shang Wang
33156f56e0
[docker] A follow-up patch to fix #30913 : [docker] install cuda13 version of lmcache and nixl ( #31775 )
...
Signed-off-by: Shang Wang <shangw@nvidia.com >
2026-01-07 23:47:02 -08:00
Rabi Mishra
107cf8e92f
fix(rocm): Add get_supported_kernel_block_sizes() to ROCM_ATTN ( #31712 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2026-01-08 15:46:07 +08:00
Zyyeric
63baa28cf5
[Model] Enable LoRA support for tower and connector in GLM4-V ( #31652 )
...
Signed-off-by: Zyyeric <eric1976808123@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-08 15:45:53 +08:00
Andy Liu
e5173d3bac
[Bugfix] Remove the num_hidden_layers override for glm4_moe ( #31745 )
2026-01-08 15:45:10 +08:00
prashanth058
d3235cb503
[Fix] Enable mm_processor_cache with vision LoRA ( #31927 )
...
Signed-off-by: prashanth058 <prashanth.dannamaneni@uipath.com >
2026-01-08 15:31:51 +08:00
Nick Hill
287b37cda4
[BugFix] Fix spec decoding edge case bugs ( #31944 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-08 15:31:03 +08:00
Chang Su
791b2fc30a
[grpc] Support gRPC server entrypoint ( #30190 )
...
Signed-off-by: Chang Su <chang.s.su@oracle.com >
Signed-off-by: njhill <nickhill123@gmail.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: njhill <nickhill123@gmail.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2026-01-07 23:24:46 -08:00
Lucas Wilkinson
be6a81f31b
[chore] Update FA commit ( #30460 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-07 23:24:18 -08:00
Ronald
2ab441befe
[platform] add dp_metadata arg to set_additional_forward_context ( #31942 )
...
Signed-off-by: Ronald1995 <ronaldautomobile@163.com >
2026-01-08 06:56:44 +00:00
ShaanveerS
9572f74f15
[Model] Enable LoRA support for tower and connector in DotsOCR ( #31825 )
...
Signed-off-by: ShaanveerS <shaanver.singh@gmail.com >
2026-01-08 14:50:16 +08:00
Andreas Karatzas
5f2a473ff3
[ROCm][CI] v1 cpu offloading attention backend fix ( #31833 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-08 14:37:50 +08:00
Michael Goin
6b2a672e47
[Doc] Add Claude code usage example ( #31188 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-08 13:50:23 +08:00
rasmith
f1b1bea5c3
[CI][BugFix][AMD] Actually skip tests marked @pytest.mark.skip_v1 ( #31873 )
...
Signed-off-by: Randall Smith <ransmith@amd.com >
Co-authored-by: Randall Smith <ransmith@amd.com >
2026-01-08 13:06:09 +08:00
Charlie Fu
cddbc2b4b2
[ROCm][CI] Add rocm support for run-multi-node-test.sh ( #31922 )
...
Signed-off-by: charlifu <charlifu@amd.com >
Signed-off-by: Charlie Fu <Charlie.Fu@amd.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-08 04:36:39 +00:00
Andreas Karatzas
087a138963
[ROCm][CI] Fix attention backend test flakiness from uninitialized KV cache memory ( #31928 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-08 04:35:25 +00:00
Andreas Karatzas
c4041f37a4
[ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router weight scaling ( #31931 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-08 04:17:56 +00:00
Richard Zou
a79079feef
[BugFix] Fix flakiness in test_eagle_dp for PyTorch 2.10 ( #31915 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-01-08 04:04:58 +00:00
Robert Shaw
9f6dcb71ae
[MoE Refactor][16/N] Apply Refactor to NVFP4 ( #31692 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Pavani Majety <pmajety@nvidia.com >
2026-01-08 03:46:27 +00:00
Andreas Karatzas
8dd2419fa9
[CI] Skip Qwen-VL in multimodal processing tests due to flaky external dependency ( #31932 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-08 02:58:01 +00:00
Rabi Mishra
39d82005f7
fix(rocm): add early return in get_flash_attn_version for ROCm ( #31286 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2026-01-08 10:28:07 +08:00
Rabi Mishra
25eef3dc2e
feat(moe): Add is_act_and_mul=False support for Triton MoE kernels ( #31645 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2026-01-08 10:27:09 +08:00
Matthew Bonanni
0d7667419f
[0/N][Attention] Fix miscellaneous pre-commit issues ( #31924 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-01-08 01:15:17 +00:00
Robert Shaw
5dcd7ef1f2
[MoE Refactor][15/N] Apply Refactor to Fp8 ( #31415 )
2026-01-07 19:42:33 -05:00
Elvir Crnčević
ffc0a2798b
Add back missing DeepEP LL params ( #31911 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
2026-01-07 17:47:54 -05:00
Nick Hill
10ef65eded
[BugFix] Fix bad words with speculative decoding ( #31908 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-01-07 15:46:42 -05:00
Ilya Markov
6170d47d22
[EPLB] Optimize EPLB with numpy ( #29499 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-01-07 15:21:35 -05:00
Xin Yang
0ada960a20
[Kernel] Support bias type in grouped_topk kernel ( #31781 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-07 12:16:32 -08:00
Ning Xie
c907d22158
[refactor] refactor memory constants usage ( #31865 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-07 18:37:31 +00:00
Michael Goin
f347ac6c34
[Perf] Fuse stride preparation for NVFP4 cutlass_moe ( #31837 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-07 13:31:26 -05:00
Festus Ayobami Owumi
05f47bd8d2
[Doc] Fix: Correct vLLM announcing blog post link in docs ( #31868 )
...
Signed-off-by: enfinity <festusowumi@gmail.com >
2026-01-07 10:06:42 -08:00
roikoren755
bf184a6621
Enable quantized attention in NemotronH models ( #31898 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-01-07 17:37:19 +00:00
Jee Jee Li
30399cc725
UX: add vLLM env info in '/server_info' ( #31899 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-07 17:13:02 +00:00
Kfir Toledo
b89443b8d9
[KVConnector]: Enable Cross-layers KV cache layout for MultiConnector ( #30761 )
...
Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com >
2026-01-07 16:59:43 +00:00
Marko Rosenmueller
1d9e9ae8a4
[Bugfix]: prevent leaking tokens in crash log ( #30751 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com >
2026-01-07 16:15:19 +00:00
Cyrus Leung
b7036c87a1
[Refactor] Clean up pooler modules ( #31897 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-08 00:07:43 +08:00
Kate Cheng
cc6dafaef2
[Perf][Kernels] Enable FlashInfer DeepGEMM swapAB on SM90 (for W8A8 Linear Op) ( #29213 )
...
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com >
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com >
Co-authored-by: Jhao-Ting Chen <jhaotingc@nvidia.com >
2026-01-07 10:53:54 -05:00
R3hankhan
1ab055efe6
[OpenAI] Extend VLLMValidationError to additional validation parameters ( #31870 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-01-07 14:45:49 +00:00
Cyrus Leung
b665bbc2d4
[Chore] Migrate V0 attention utils ( #31891 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-07 13:44:36 +00:00
Jared Wen
974138751b
[Refactor] GLM-ASR Modeling ( #31779 )
...
Signed-off-by: JaredforReal <w13431838023@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-07 13:08:29 +00:00
vllmellm
41cfa50632
[ROCm][AITER] fix wrong argument passed to AITER flash_attn_varlen_func ( #31880 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-01-07 11:25:03 +00:00
Andy Liu
d111bc53ad
[Bugfix][MTP] Fix GLM4 MoE fp8 loading with MTP on ( #31757 )
...
Signed-off-by: Andy Liu <andyliu@roblox.com >
2026-01-07 09:18:52 +00:00
BlankR
0790f07695
[Misc] Improve error messages for unsupported types and parameters ( #30593 )
...
Signed-off-by: BlankR <hjyblanche@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-07 09:00:16 +00:00
maang
1f33e38e81
[Model] Cleanup: Remove redundant manual definition of make_empty_intermediate_tensors in GLM-4-MoE ( #31869 )
...
Signed-off-by: maang <maang_h@163.com >
2026-01-07 08:18:28 +00:00
sihao_li
59fe6f298e
[XPU]fallback to TRITON_ATTN on xpu when use float32 dtype ( #31762 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
2026-01-07 08:10:29 +00:00
weiyu
e7596371a4
[Refactor][TPU] Remove torch_xla path and use tpu-inference ( #30808 )
...
Signed-off-by: Wei-Yu Lin <weiyulin@google.com >
Signed-off-by: weiyu <62784299+weiyu0824@users.noreply.github.com >
2026-01-07 16:07:16 +08:00
xuebwang-amd
0dd5dee9b9
[Bugfix][Kernel] fix bias adding in triton kernel implemented fused moe ( #31676 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
2026-01-07 07:36:13 +00:00
Kevin McKay
4614c5a539
[Bugfix][Hardware][AMD] Consolidate FP8 min/max values helper function ( #31106 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
Signed-off-by: Kevin McKay <kevin@example.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-01-07 06:55:03 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
482914849c
[BugFix] LoRA: Support loading base_layer of experts ( #31104 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2026-01-07 14:49:39 +08:00
tianshu-Michael-yu
efeaac92f2
[Bugfix] Fix race condition in async-scheduling for vlm model ( #31841 )
...
Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com >
2026-01-07 06:45:10 +00:00
tjp_zju
55caa6051d
refactor: find_loaded_library ( #31866 )
...
Signed-off-by: tjp_zju <tanjianpingzju1990@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-01-07 06:42:20 +00:00
Lucas Wilkinson
c7a79d41a0
[Attention][3/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties ( #31850 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-07 13:31:34 +08:00
vllmellm
6409004b26
[ROCm][AITER] bugfix accuracy regression in ROCM_AITER_TRITON_MLA backend ( #31816 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-01-07 05:04:53 +00:00
Cyrus Leung
aafd4d2354
[Chore] Try remove init_cached_hf_modules ( #31786 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-07 12:34:04 +08:00
Jack Yang
0a2c2dc3f1
fixed mypy warnings for files vllm/v1/attention with TEMPORARY workaround ( #31465 )
...
Signed-off-by: Zhuohao Yang <zy242@cornell.edu >
Co-authored-by: Zhuohao Yang <zy242@cornell.edu >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-07 04:08:47 +00:00
Tyler Michael Smith
f09c5feb7c
Change warning in get_current_vllm_config to report caller's line number ( #31855 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-01-07 03:48:13 +00:00
Cyrus Leung
1b8af957f6
[Doc] Update release docs ( #31799 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-07 03:27:40 +00:00
Ce Zhao
a051525e07
[Model] Enable LoRA support for PaliGemma ( #31656 )
...
Signed-off-by: 赵策 <alcor@mac.mynetworksettings.com >
Signed-off-by: Alcor <alcor_zhao@outlook.com >
Co-authored-by: 赵策 <alcor@mac.mynetworksettings.com >
2026-01-07 10:09:32 +08:00
Yihua Cheng
5b833be49e
[1/2][lmcache connector] clean up lmcache multi-process adapter ( #31838 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
2026-01-07 02:02:42 +00:00
Lucas Kabela
873480d133
[Misc][BE] Type coverage for vllm/compilation [1/3] ( #31554 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-01-06 20:37:51 -05:00
vSeamar
6f351548b2
[Frontend] Implement robust video frame recovery for corrupted videos ( #29197 )
...
Signed-off-by: cmartinez <cmartinez@roblox.com >
Signed-off-by: vSeamar <cmartinez@roblox.com >
2026-01-07 01:13:24 +00:00
Andreas Karatzas
364a8bc6dc
[ROCm][CI] Fix plugin tests (2 GPUs) failures on ROCm and removing VLLM_FLOAT32_MATMUL_PRECISION from all ROCm tests ( #31829 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-07 01:12:23 +00:00
Angela Yi
9a1d20a89c
[CI] Add warmup run in test_fusion_attn ( #31183 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-01-07 00:31:52 +00:00
Cyrus Leung
309a8f66ee
[Bugfix] Handle mistral tokenizer in get_hf_processor ( #31817 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-07 07:46:56 +08:00
Andreas Karatzas
e5d427e93a
[ROCm][CI] Pinning timm lib version to fix ImportError in Multi-Modal Tests (Nemotron) ( #31835 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-06 23:23:11 +00:00
Andreas Karatzas
2a42ae790d
[ROCm][CI] Fix ModernBERT token classification test numerical accuracy on ROCm ( #31820 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-06 23:21:15 +00:00
Matthew Bonanni
d49899732e
[Spec Decode][UX] Add acceptance stats to vllm bench serve report ( #31739 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com >
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com >
2026-01-06 21:21:42 +00:00
Elvir Crnčević
dba95378a6
Report error log after vllm bench serve ( #31808 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
2026-01-06 20:24:19 +00:00
Nikhil G
ada6f91d56
Fix RecursionError in MediaWithBytes unpickling ( #31191 )
...
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com >
2026-01-06 20:11:26 +00:00
Li, Jiang
8becf146bd
[Quantization][Refactor] Move CPU GPTQ kernel into MP linear ( #31801 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-06 19:10:18 +00:00
Charlie Fu
c07163663d
[ROCm][CI] Fix tests/compile unit tests ( #28895 )
...
Signed-off-by: charlifu <charlifu@amd.com >
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
Signed-off-by: Charlie Fu <Charlie.Fu@amd.com >
Co-authored-by: Micah Williamson <micah.williamson@amd.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-01-06 18:50:43 +00:00
Benjamin Chislett
f7008ce1c4
[Perf] Async Scheduling + Speculative Decoding + Structured Outputs ( #29821 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-01-06 18:50:37 +00:00
Yakine Tahtah
4e67a8f616
[Bugfix] Fix GLM-4 MoE router logits dtype for data parallel chunking ( #31055 )
...
Signed-off-by: ReinforcedKnowledge <reinforced.knowledge@gmail.com >
2026-01-06 17:57:56 +00:00
Masataro Asai
142c4d1738
make 500: InternalServerError more informative ( #20610 )
...
Signed-off-by: Masataro Asai <guicho2.71828@gmail.com >
2026-01-06 17:36:24 +00:00
Ning Xie
6f5e653383
[Log] add log about gpu worker init snapshot and requested memory ( #29493 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-06 17:32:55 +00:00
Vadim Gimpelson
22dffca982
[PERF] Speed-up of GDN attention decode part (Qwen3-Next) ( #31722 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-01-06 17:32:46 +00:00
Lucas Wilkinson
4c73be14e0
[Attention][2/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties ( #31774 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-01-06 17:32:14 +00:00
Jinzhen Lin
2f4bdee61e
[Quantization][MoE] remove unused ep logic from moe marlin ( #31571 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-06 09:07:19 -08:00
roikoren755
28c94770ad
[NemotronH] Use ReplicatedLinear for fc1_latent_proj ( #31807 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-01-06 16:00:40 +00:00
Robert Shaw
af8fd73051
[MoE Refactor][14/N] Clean Up FI Quant Config Smuggling ( #31593 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-06 15:47:04 +00:00
Robert Shaw
d3e477c013
[MoE Refactor] Add Temporary Integration Tests - H100/B200 ( #31759 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-06 10:34:17 -05:00
Isotr0py
02809af1e7
[Bugfix]: Fix cross attention backend selection for Turing GPU ( #31806 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-06 23:15:56 +08:00
Jee Jee Li
cbd4690a03
[LoRA]Disable linear LoRA kernel PDL ( #31777 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-06 23:12:25 +08:00
wang.yuqi
96860af655
[Model] rename use_pad_token to use_sep_token ( #31784 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-06 14:16:04 +00:00
Chauncey
0202971a48
[Frontend] Support GLM-4.5 / GLM-4.7 with enable_thinking: false ( #31788 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-01-06 13:53:21 +00:00
Jzz1943
2c1a4f2488
[Bugfix]: avoid overriding audio/text kwargs (Qwen3-Omni) ( #31790 )
...
Signed-off-by: Zhongze Jiang <jiangzhongze.jzz@ant-intl.com >
2026-01-06 12:59:17 +00:00
Cyrus Leung
6444824873
[Misc] Implement TokenizerLike.convert_tokens_to_ids ( #31796 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-06 12:08:22 +00:00
kzwrime
bf0f3a4638
[Bugfix] Fix torch.compile error for DP + MoE on CPU Backend ( #31650 )
...
Signed-off-by: kunzh <zhikun.wu@outlook.com >
2026-01-06 12:06:20 +00:00
Lucas Wilkinson
e0327c9db2
[Attention][1/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties ( #31773 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-01-06 04:05:17 -08:00
Cyrus Leung
14df02b4e1
[Chore] Cleanup mem_utils.py ( #31793 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-06 19:55:59 +08:00
BlankR
6ebb66ccea
[Doc] Fix format of multimodal_inputs.md ( #31800 )
...
Signed-off-by: BlankR <hjyblanche@gmail.com >
2026-01-06 03:30:24 -08:00
wang.yuqi
43d384bab4
[CI] Increase the MTEB_EMBED_TOL threshold to 5e-4. ( #31797 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-06 19:30:05 +08:00
Cyrus Leung
db318326a5
[Misc] Use deprecated for seed_everything ( #31780 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-06 11:29:55 +00:00
Fadi Arafeh
799b5721f6
[cpu][bench] Add CPU paged attention benchmarks ( #31720 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-01-06 10:57:57 +00:00
Cyrus Leung
97ca4c3b60
[Chore] Remove more V0 dead code from sequence.py ( #31783 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-06 10:25:14 +00:00
Isotr0py
ee2e69d6cd
[Bugfix][CI/Build] Fix failing pooling models test due to Triton kernel accuracy diff ( #31776 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-06 00:44:22 -08:00
Isotr0py
7101e0851f
[Models]: Use MMEncoderAttention for MoonViT ( #31738 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: h100 <h100@inferact.ai >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: h100 <h100@inferact.ai >
2026-01-06 08:00:25 +00:00
vllmellm
e9717801bd
[Bugfix][ROCm] Fix Unsupported attention metadata type for speculative decoding in eagle.py ( #31714 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-01-06 07:53:22 +00:00
Cyrus Leung
da71d44410
[Doc] Show that use_audio_in_video is supported in docs ( #30837 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-01-05 23:27:19 -08:00
Kevin McKay
1fb0209bbc
[Bugfix][Hardware][AMD] Fix exception types in AITER MLA FP8 check ( #31177 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-06 14:10:59 +08:00
Robert Shaw
81323ea221
[CI] Fix CPU MM PRocessor Test ( #31764 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-06 04:22:18 +00:00
Michael Goin
e1cd7a5faf
[Bugfix] Add init_workspace_manager to moe kernel benchmarks ( #31042 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-05 19:14:33 -08:00
Michael Goin
a68e703c32
[UX] Add -ep shorthand for --enable-expert-parallel ( #30890 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-05 19:13:36 -08:00
maang
cd1245a184
[Cleanup] Remove redundant decoder_layer_type assignment in Qwen2 ( #31760 )
...
Signed-off-by: maang <maang_h@163.com >
2026-01-05 18:09:18 -08:00
Wentao Ye
ffec815422
[Perf] Optimize additional fill(0) in cutlass moe, 2.9% E2E throughput improvement, 10.8% TTFT improvement ( #31754 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-05 18:01:13 -08:00
maang
d386ab1412
[Docs] Improve malformed exception caused by backslash line continuations ( #31694 )
...
Signed-off-by: maang <maang_h@163.com >
Signed-off-by: maang <55082429+maang-h@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-05 17:51:54 -08:00
Michael Goin
ccb309a964
Revert "[CI Failure] Disable B200 tests while runner is broken" ( #31750 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-01-05 17:26:33 -08:00
John Calderon
2f4e6548ef
[Bugfix] vLLM produces invalid UTF-8 tokens and “�” ( #28874 )
...
Signed-off-by: John Calderon <jcalderon@nvidia.com >
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com >
2026-01-06 00:23:00 +00:00
Seiji Eicher
3c98c2d21b
[CI/Build] Allow user to configure NVSHMEM version via ENV or command line ( #30732 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-01-05 15:56:08 -08:00
Michael Goin
9513029898
[Bugfix] Properly apply v_scale for mimo_v2_flash ( #31175 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-05 23:20:46 +00:00
Robert Shaw
f6c0009afa
[Bugfix] Fix Broken ModelOpt NVFP4 MoE ( #31742 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-01-05 23:18:38 +00:00
Yongye Zhu
776ca1e187
[MoE Refactor] Aiter Experts for BF16 MoE ( #31542 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-01-05 14:52:59 -08:00
Wentao Ye
af9a7ec255
[Bug] Revert torch warning fix ( #31585 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-01-05 22:31:21 +00:00
Matthew Bonanni
276e03b92c
[CI][DeepSeek] Add nightly DeepSeek R1 lm_eval tests on H200 ( #30356 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-05 17:17:59 -05:00
Nick Hill
32f4e4db00
[Cleanup] Remove deprecated fields from CachedRequestData class ( #31734 )
...
Signed-off-by: njhill <nickhill123@gmail.com >
2026-01-05 21:07:14 +00:00
amitz-nv
ee21291825
[Model] Nemotron Parse 1.1 Support ( #30864 )
...
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-01-05 13:00:14 -08:00
Qidong Su
af1b07b0c5
[docker] install cuda13 version of lmcache and nixl ( #30913 )
...
Signed-off-by: Qidong Su <soodoshll@gmail.com >
2026-01-05 12:50:39 -08:00
gnovack
c77a993cc2
pin lora_b moe weights on cpu ( #31317 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
2026-01-05 12:15:40 -08:00
Roberto L. Castro
fdcc5176be
[BugFix] Fix architecture flags to prevent issues on SM103 ( #31150 )
...
Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com >
2026-01-05 20:11:35 +00:00
Wang Kunpeng
5708297e4e
[Misc][Model][Refactor] Pass the prefix into Linear layers ( #31669 )
...
Signed-off-by: Wang Kunpeng <1289706727@qq.com >
2026-01-05 20:03:18 +00:00
baonudesifeizhai
02dbb933cb
Fix GLM-4.6v flash tool calling in transformers 5.x ( #31622 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
2026-01-05 11:32:43 -08:00
Isotr0py
51e38a8e30
[Misc] Enable Paligemma's PrefixLM attention mask computation ( #31725 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-06 03:31:49 +08:00
Or Ozeri
d8e38d4939
Triton Attention: Support cross-layers blocks ( #30687 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-01-05 19:29:16 +00:00
kzwrime
21156ff199
[Bugfix] Add missing extra_tensors arg to DeviceCommunicatorBase.disp… ( #31644 )
...
Signed-off-by: kunzh <zhikun.wu@outlook.com >
2026-01-06 01:26:09 +08:00
RickyChen / 陳昭儒
c455b771fd
[Bugfix][CPU] Fix RotaryEmbedding fallback causing gibberish with --enforce-eager ( #31643 )
...
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com >
2026-01-06 01:25:38 +08:00
Michael Goin
eefa713a66
[CI Failure] Disable B200 tests while runner is broken ( #31732 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-01-05 08:50:51 -08:00
Kevin Šuc
79ed460dd5
[Frontend] [Doc] Exclude log deltas feature ( #30322 )
...
Signed-off-by: Catacomba <kevinsuc16@gmail.com >
Signed-off-by: Kevin Šuc <kevinsuc16@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-01-05 16:34:35 +00:00
Isotr0py
6aa5b18e1d
[v1] Add encoder-only/cross attention support to Triton Attention backend ( #31406 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-06 00:00:23 +08:00
wang.yuqi
911d38ed99
[Model] Let more models to support the score template. ( #31335 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-01-05 11:54:26 +00:00
zzzzwwjj
caaa482aca
[platform] Support additional forward context for OOT ( #31674 )
...
Signed-off-by: zzzzwwjj <1183291235@qq.com >
Signed-off-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-01-05 10:25:13 +00:00
Yihua Cheng
b471aad41f
[KVconnector][LMCache] remove the import of legacy LMCache code ( #31704 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
2026-01-05 10:11:01 +00:00
Jee Jee Li
d5503ca7f9
[LoRA] LoRA PDL improvement ( #31660 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-05 08:28:46 +00:00
Qiping Pan
a2ad15c070
[Model] Enable LoRA support for BLIP2 ( #31620 )
...
Signed-off-by: Qiping Pan <panqiping@outlook.com >
2026-01-05 08:02:24 +00:00
Tres
3133c192a3
[ROCM] Reorder arguments and rename parameters for rope_cached_thd_positions_2c_fwd_inplace ( #29993 )
...
Signed-off-by: Tres Popp <tres.popp@amd.com >
2026-01-05 15:37:57 +08:00
wang.yuqi
76fd458aa7
[CI] Bump sentence-transformer from 3.2.1 to 5.2.0 ( #31664 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-04 21:45:01 -08:00
cjackal
e2701cc525
[Frontend] [Bugfix] respect server-level default chat template kwargs in reasoning parser ( #31581 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-01-05 05:42:47 +00:00
Tyler Michael Smith
fe8a9fbd2e
[Bugfix] Fix EPLB state logging error ( #31455 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-01-05 04:06:28 +00:00
Ning Xie
98b8b3abaa
[log] enable max_log_len trim only when needed ( #31482 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-05 03:55:43 +00:00
CHENYUE
346e56455a
Add chat prefix completion feature to DeepSeek v3.2 ( #31147 )
2026-01-05 11:20:25 +08:00
wang.yuqi
8be6432bda
[CI Failure] Fix NomicBert max_model_len validation ( #31662 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-01-05 11:06:52 +08:00
Nick Hill
43e3f8e4a9
[Misc] Various code simplifications ( #31666 )
...
Signed-off-by: njhill <nickhill123@gmail.com >
2026-01-04 18:35:56 -08:00
wangxiyuan
bb4337b34c
[Platform] Deprecate seed_everything ( #31659 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2026-01-04 18:34:04 -08:00
Isotr0py
367856de14
[CI/Build] Revive skipped reward models e2e test ( #31665 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-01-05 02:33:46 +00:00
Nick Hill
da436f868a
[Minor] Small pooler output processing optimization ( #31667 )
...
Signed-off-by: njhill <nickhill123@gmail.com >
2026-01-04 18:33:12 -08:00
Jee Jee Li
f099cd557a
[Bugfix] Fix AttributeError: 'Stream' object has no attribute 'dp_size' ( #31663 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-05 02:31:31 +00:00
Andreas Karatzas
f2b6dfd237
[ROCm][CI] Fix language generation test accuracy by disabling HF flash_sdp and mem_efficient_sdp ( #31597 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-05 02:17:05 +00:00
Andreas Karatzas
89f1f25310
[CI] Skip Phi-MoE test due to old API util ( #31632 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-05 08:52:07 +08:00
Nick Hill
b53b89fdb3
[BugFix] Async scheduling: handle model forward errors more cleanly ( #31611 )
...
Signed-off-by: njhill <nickhill123@gmail.com >
2026-01-04 11:04:37 -08:00
Ning Xie
6522721d17
[misc] Sort uvicorn log level description according to verbosity ( #31137 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-01-04 18:45:37 +00:00
Yuxuan Zhang
0d4044edd8
fix no think of GLM-4.5 / GLM-4.7 ( #31449 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2026-01-04 11:43:00 +08:00
Reagan Lee
41ab179738
[Docs] Fix argparse include path for mm-processor benchmark ( #31654 )
...
Signed-off-by: Reagan <reaganjlee@gmail.com >
2026-01-04 03:31:29 +00:00
Robert Shaw
268b1c55ad
[MoE Refactor][13/N] Convert FI to Use PFNoEP ( #31533 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-01-03 12:26:36 -08:00
Andreas Karatzas
4f9ce35afe
[CI][Bugfix] Fix token counting in chunked prefill compl test ( #31630 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-03 14:28:49 +08:00
jeremyteboul
97a01308e9
Improve HF qwen3_omni: preserve audio_sample_rate in kwargs restructuring ( #29255 )
...
Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com >
Co-authored-by: Jeremy Teboul <jeremyteboul@fb.com >
2026-01-03 04:31:09 +00:00
Xingyu Liu
0eee877f67
[Core] Parse vLLM engine required fields from hf_config to model_arch_config ( #28454 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com >
Signed-off-by: Xingyu Liu <38244988+charlotte12l@users.noreply.github.com >
2026-01-02 15:13:15 -08:00
Alfred
a0e9ee83c7
[Benchmark] Fix OOM during MoE kernel tuning for large models ( #31604 )
...
Signed-off-by: Alfred <massif0601@gmail.com >
2026-01-02 22:24:51 +00:00
Yongye Zhu
a3f2f40947
[MoE Refactor] Explicit construct mk for flashinfer bf16 kernel ( #31504 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-01-02 13:54:50 -08:00
Yongye Zhu
5a468ff7c7
[MoE Refactor] Split invoke_fused_moe_kernel ( #31050 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-01-02 13:47:15 -08:00
Andreas Karatzas
6ef770df7c
[MoE] Fix output_shape calculation in Attention layer to handle 3D query inputs ( #31596 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-02 15:46:23 +00:00
Nick Hill
bd877162eb
[BugFix] Support online dense model DP without overhead ( #30739 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: njhill <nickhill123@gmail.com >
2026-01-02 23:36:38 +08:00
Xinyu Chen
08f425bad1
CustomOp: test forward dispatch for grouped_topk ( #31530 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2026-01-02 10:04:01 -05:00
labAxiaoming
a01f2faedf
Add multimodal input method in the documentation ( #31601 )
...
Signed-off-by: xiaoming <1259730330@qq.com >
2026-01-02 12:43:30 +00:00
Kyuyeun Kim
cc410e8644
[Bugfix] Fix weight_loader v1 block scale ( #31103 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com >
2026-01-02 13:14:10 +08:00
Kevin McKay
825c2dc133
[Bugfix][Hardware][AMD] Fix last_page_len calculation in AITER MLA decode ( #31282 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2026-01-01 21:14:00 -08:00
Vaibhav Sourirajan
1f43c121d5
Remove unused use_marlin variable in Mxfp4MoEMethod ( #31549 )
...
Signed-off-by: vaibhav sourirajan <vs2787@columbia.edu >
2026-01-01 21:13:36 -08:00
Tmn07
ca179d0f64
[Bugfix] Fix activation quantization for compressed-tensors W4A16 ( #31572 )
...
Signed-off-by: Tmn07 <tmn0796@gmail.com >
2026-01-01 21:13:22 -08:00
Andreas Karatzas
013b54088c
[ROCm][CI] Fix ModernBERT token classification test ( #31612 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-02 04:19:08 +00:00
Jay Hemnani
5ac55eb30f
[Model] Enable LoRA support for tower and connector in LLaVA ( #31513 )
...
Signed-off-by: Jay Hemnani <jayhemnani9910@gmail.com >
Co-authored-by: Jay Hemnani <jayhemnani9910@gmail.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-01-01 19:32:39 -08:00
Benjamin Chislett
ea53ca5e85
[Bugfix] Fix block size used in EAGLE slot mapping ( #31540 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-01-01 19:32:30 -08:00
zhima771
27864a851c
feat: support LoRA for DeepSeek-OCR(Language Model part) ( #31569 )
...
Signed-off-by: zhima771 <15836938703@163.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-01-01 19:32:11 -08:00
Andreas Karatzas
5cc4876630
[ROCm][CI] Fix failure in Language Models Tests (Extra Standard) by reducing agent pool size ( #31553 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-01-01 19:29:42 -08:00
Kevin McKay
5fff44064b
[Bugfix] Replace BaseException with specific exceptions in FLA utils ( #31590 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2026-01-01 19:27:54 -08:00
Reagan Lee
1f5b7c41c3
Add Multimodal Processor Benchmark ( #29105 )
...
Signed-off-by: Reagan Lee <reaganjlee@gmail.com >
Signed-off-by: Reagan <reaganjlee@gmail.com >
2026-01-01 19:26:53 -08:00
Ekagra Ranjan
adcf682fc7
[Audio] Improve Audio Inference Scripts (offline/online) ( #29279 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2025-12-31 23:34:18 +00:00
Andreas Karatzas
21de6d4b02
[CI][Bugfix] Fix token counting in chunked prefill streaming test ( #31565 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-31 23:05:14 +00:00
Nick Hill
6c2cfb62ff
[BugFix] Fix async scheduling for pooling models ( #31584 )
...
Signed-off-by: njhill <nickhill123@gmail.com >
2025-12-31 14:48:51 -08:00
Fanjiang Ye
d8da76f3b7
[Bugfix] Fix BAGEL online serving for text and image understanding ( #31546 )
...
Signed-off-by: Dylan1229 <yvanphys@gmail.com >
Signed-off-by: UED <zxr3611244710@gmail.com >
Signed-off-by: mr-ye-cao <yecaoyc2019@gmail.com >
Co-authored-by: UED <zxr3611244710@gmail.com >
Co-authored-by: mr-ye-cao <yecaoyc2019@gmail.com >
Co-authored-by: Mr-Ye-Cao <60802056+Mr-Ye-Cao@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-31 14:46:10 -08:00
baonudesifeizhai
d722e9e614
Add GLM-ASR multimodal support ( #31436 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-31 23:12:24 +08:00
Andreas Karatzas
cf16342d43
[ROCm][CI] Update MiniCPM model test: MiniCPM3-4B to MiniCPM4.1-8B and simplify attention backend testing ( #31551 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-31 00:12:01 -08:00
Wentao Ye
357d435c54
[Bug] Fix log issue with \n ( #31390 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-12-30 21:16:55 -08:00
danisereb
108a2728f7
Add get_expert_mapping to NemotronHModel (for LoRA support) ( #31539 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2025-12-30 21:09:03 -08:00
TJian
578c8f51f6
[CI] [Critical] [CUDA] Fix duplicated test name ( #31562 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-12-30 21:01:09 -08:00
maang-h
b4bb5f312f
[Core] Remove unused num_tokens parameter from _init_model_kwargs ( #31517 )
...
Signed-off-by: maang <maang_h@163.com >
2025-12-30 20:47:23 -08:00
SameerAsal
70e1acefcd
[BugFix] Fix NUMA node validation in CPU platform ( #31520 )
...
Signed-off-by: SameerAsal <SameerAsal@users.noreply.github.com >
Co-authored-by: SameerAsal <SameerAsal@users.noreply.github.com >
2025-12-31 04:06:49 +00:00
Qiu
84f6cd741b
[Mics] add pcp basic support to MoE model ( #31003 )
2025-12-30 20:01:29 -08:00
B-201
ecd49ce7e6
[Fix] Align fused moe lora_b shape with peft ( #31534 )
...
Signed-off-by: bk-201 <joy25810@foxmail.com >
2025-12-31 09:44:59 +08:00
Amr Mahdi
e1ee11b2a5
Add docker buildx bake configuration ( #31477 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2025-12-31 01:08:54 +00:00
vintipandey
04147dcfa7
[Bugfix]Fix pooling model always disabled due to incorrect PP rank check ( #31505 )
...
Signed-off-by: vintipandey <vinti.pandey@gmail.com >
2025-12-30 11:27:10 -08:00
JartX
07728bf5cd
[BugFix] add select_gemm_impl on CompressedTensorsWNA16MoEMethod to support LoRA ( #31453 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
2025-12-30 11:20:15 -08:00
yt0428
3f52fa5aa2
[Model] Add support for openPangu moe model ( #28775 )
...
Signed-off-by: yuantao <2422264527@qq.com >
Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-30 08:11:38 -08:00
Li, Jiang
7157596103
[CPU] Disable async schedule on CPU ( #31525 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-12-30 12:34:08 +00:00
Nicolò Lucchesi
ab1af6aa3e
[CI][NIXL] Split DPEP tests ( #31491 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-30 07:26:12 -05:00
Pleaplusone
1a834df2d4
[ROCm][Bugfix] Fix accuracy issue on fmoe when VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS enabled ( #31523 )
...
Signed-off-by: ganyi <ygan@amd.com >
2025-12-30 09:21:49 +00:00
Kevin
51085c2aeb
[Frontend] add continue_final_message parameter to /embeddings endpoint ( #31497 )
...
Signed-off-by: Kevin P-W <140451262+kevin-pw@users.noreply.github.com >
2025-12-30 07:21:13 +00:00
Roger Feng
3d973764ce
[xpu] [bugfix] upgrade to latest oneccl in dockerfile ( #31522 )
...
Signed-off-by: roger feng <roger.feng@intel.com >
2025-12-30 14:52:28 +08:00
Nick Hill
3b312fb792
[Minor] Various small code cleanups/simplifications ( #31508 )
...
Signed-off-by: njhill <nickhill123@gmail.com >
2025-12-29 22:42:06 -08:00
ZT-AIA
f84bf7d79b
Add Loraconfig parameter to get_punica_wrapper function ( #31408 )
...
Signed-off-by: ZT-AIA <1028681969@qq.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-29 22:27:31 -08:00
Roy Wang
99dcf5dcc5
Migrate meetups & sponsors [2/N] ( #31500 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2025-12-30 04:26:15 +00:00
Hojin Yang
dc837bc23e
feat(frontend): add --default-chat-template-kwargs CLI argument ( #31343 )
...
Signed-off-by: effortprogrammer <yhjhoward7@gmail.com >
2025-12-30 03:38:47 +00:00
Nick Hill
e54ee3ea33
[Core] Deduplicate generate/encode logic in AsyncLLM ( #31510 )
...
Signed-off-by: njhill <nickhill123@gmail.com >
2025-12-30 10:42:45 +08:00
wangln19
358bfd315c
fix: update kimi k2 tool parser logic ( #31207 )
...
Signed-off-by: wangln19 <wanglinian@dev.wanglinian.msh-dev.svc.cluster.local >
Signed-off-by: Wang Linian <wanglinian@stu.pku.edu.cn >
Co-authored-by: wangln19 <wanglinian@dev.wanglinian.msh-dev.svc.cluster.local >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-12-30 10:01:58 +08:00
Sage
39512aba72
[Prefix Cache] Include lora_name in BlockStored event for deterministic KV-cache reconstruction ( #27577 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
Co-authored-by: Sage <80211083+sagiahrac@users.noreply.github.com >
2025-12-30 00:17:16 +00:00
qli88
0f35429a0c
[CI]Test Group 'NixlConnector PD accuracy tests' is fixed ( #31460 )
...
Signed-off-by: qli88 <qiang.li2@amd.com >
2025-12-29 23:48:56 +00:00
Alexei-V-Ivanov-AMD
d63b969675
[CI/ROCm] Fixing "V1 Test attention (H100)" test group. ( #31187 )
...
Signed-off-by: DCCS-4560 <alivanov@chi-mi325x-pod1-108.ord.vultr.cpe.ice.amd.com >
Signed-off-by: <>
Co-authored-by: DCCS-4560 <alivanov@chi-mi325x-pod1-108.ord.vultr.cpe.ice.amd.com >
Co-authored-by: root <root@chi-mi325x-pod1-108.ord.vultr.cpe.ice.amd.com >
2025-12-29 16:53:59 -05:00
Robert Shaw
56f516254c
[Bugfix][ROCm] Fix Static Quant Issue ( #31502 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2025-12-29 13:27:55 -08:00
Robert Shaw
9152a30d8f
[MoE Refactor][12/N] Marlin Fp8 MoE Pure Function ( #31499 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-12-29 13:27:00 -08:00
Nick Hill
c2ff33cc8c
[Core] Enable async scheduling by default ( #27614 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2025-12-29 13:20:55 -07:00
chunxiaozheng
b12cb38398
implements register kv caches in lmcache connector ( #31397 )
...
Signed-off-by: idellzheng <idellzheng@tencent.com >
2025-12-29 11:13:42 -08:00
Roger Young
5bc664110f
Optimize QKNorm for MiniMax-M2/M2.1 ( #31493 )
...
Signed-off-by: xuebi <xuebi@minimaxi.com >
Co-authored-by: xuebi <xuebi@minimaxi.com >
2025-12-29 16:30:18 +00:00
RickyChen / 陳昭儒
b3a2bdf1ac
[Feature] Add offline FastAPI documentation support for air-gapped environments ( #30184 )
...
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com >
Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-29 16:22:39 +00:00
Harry Mellor
e37e7349e6
Replace nn.ConvNd with vLLM's ConvNdLayer for Transformers modeling backend ( #31498 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-29 16:20:01 +00:00
Roy Wang
b5d2d71d26
Migrate doc to website: Hardware Plugins (1/N) ( #31496 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2025-12-29 15:55:20 +00:00
Harry Mellor
decc244767
[Docs] Use relative md links instead of absolute html links for cross referencing ( #31494 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-29 13:33:44 +00:00
amittell
9c884faa95
[Bugfix] Preserve tool call id/type/name in streaming finish chunk ( #31438 )
...
Signed-off-by: amittell <mittell@me.com >
Signed-off-by: Alex Mittell <mittell@me.com >
2025-12-29 21:10:52 +08:00
Chauncey
48d5ca4e8b
[CI] fix test_chat_truncation_content_not_null test ( #31488 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-12-29 12:47:08 +00:00
twj
bf73a3e4d7
[Bugfix][Frontend] Fix Jina reranker multimodal input compatibility ( #31445 )
...
Signed-off-by: tianwenjing <tianwenjing@jfgenius.com >
Signed-off-by: twj <151701930+twjww@users.noreply.github.com >
Co-authored-by: tianwenjing <tianwenjing@jfgenius.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-29 01:13:18 -08:00
Andreas Karatzas
3ecfdc3776
[ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race condition ( #30719 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-29 01:13:14 -08:00
Andreas Karatzas
45c1ca1ca1
[ROCm][CI] Skip DeepGemm-dependent test on ROCm platform ( #31462 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-29 16:31:10 +09:00
Li, Jiang
17347daaa2
[CI/Build][CPU] Update CPU CI test cases ( #31466 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-12-29 14:17:52 +08:00
Mamy Ratsimbazafy
b9793e6a8c
Add Fused MoE Triton kernels for GLM-4.5-Air, GLM-4.5v, GLM-4.6v on 2x RTX Pro 6000 ( #31407 )
...
Signed-off-by: Mamy Ratsimbazafy <mamy_github@numforge.co >
2025-12-28 08:38:33 -08:00
Jzz1943
0b6b701050
[Model] Add tuned triton fused_moe configs for Qwen3Moe on B200 ( #31448 )
...
Signed-off-by: Zhongze Jiang <jiangzhongze.jzz@ant-intl.com >
2025-12-28 08:38:07 -08:00
Nick Hill
094fcce250
[BugFix] Re-fix async multimodal cpu tensor race condition ( #31373 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Signed-off-by: njhill <nickhill123@gmail.com >
2025-12-28 03:05:08 -08:00
Andreas Karatzas
573dd0e6f0
[ROCm] Migrate xgrammar to upstream release ( #31327 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-28 00:08:29 -08:00
Andreas Karatzas
f70368867e
[ROCm][CI] Add TorchCodec source build for transcription tests ( #31323 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-28 16:06:05 +08:00
Andreas Karatzas
96142f2094
[ROCm][CI] Added perceptron lib in requirements for isaac multi-modal test ( #31441 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-28 04:15:14 +00:00
Boyuan Feng
62def07d67
[BugFix] register quant scale tensors as buffer ( #31395 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-12-28 11:20:02 +08:00
yitingdc
b326598e97
add tip for VLLM_USE_PRECOMPILED arg to reduce docker build time ( #31385 )
...
Signed-off-by: yiting.jiang <yiting.jiang@daocloud.io >
2025-12-28 03:19:47 +00:00
Robert Shaw
727c41f3fd
[MoE Refactor][10/N] Cleanup Fp8 Process Weights After Loading ( #31169 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-12-27 20:22:48 +00:00
Boyuan Feng
2f12cd32c0
[BugFix] Fix cache issue in compilation_config ( #31376 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-12-27 09:30:39 -05:00
Isotr0py
40a8756224
[Chore]: Remove HF format Phi4-MM examples ( #31405 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-27 13:42:02 +00:00
Isotr0py
3d024985ab
[CI/Build] Ignore max transformers version for more common tests ( #31401 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-27 13:06:26 +00:00
baonudesifeizhai
8711b21676
Fix/get raw stream patch #30905 ( #30912 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-12-26 20:08:47 -08:00
Yifan Qiao
52bf066516
[Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector ( #30166 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
Co-authored-by: KuntaiDu <kuntai@uchicago.edu >
2025-12-26 18:25:46 -08:00
Kunshang Ji
5326c89803
[XPU][CI]skip test_preprocess_error_handling due to fork/spawn issue ( #31381 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-12-26 21:40:44 +00:00
Xinyu Chen
87f1b8ca2c
CustomOp: Unify aiter impl into GroupedTopk ( #31221 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2025-12-26 12:44:29 -05:00
rongfu.leng
887e900b77
[Docs] Add profiler user docs for http request ( #31370 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-12-26 23:48:15 +08:00
Patrick von Platen
48e744976c
[Mistral common] Ensure all functions are imported from the top & only use public methods ( #31138 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-26 04:48:24 -08:00
Jee Jee Li
ce1eafd1a5
[Core] Initialize LoRA support for tower and connector in multi-modal models ( #26674 )
...
Signed-off-by: bk-201 <joy25810@foxmail.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: prashanth058 <prashanth.dannamaneni@uipath.com >
Co-authored-by: bk-201 <joy25810@foxmail.com >
Co-authored-by: prashanth058 <prashanth.dannamaneni@uipath.com >
Co-authored-by: Anexdeus <5142168@mail.ru >
2025-12-26 04:48:20 -08:00
Harry Mellor
0b544e6476
[Docs] Fix some snippets ( #31378 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-26 12:47:41 +00:00
Jee Jee Li
c3666f56fd
[Misc] Fix Qwen2-MoE shared_expert_gate ( #31339 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-26 05:10:39 +00:00
Andreas Karatzas
c79dbfa9ad
[CI] Fix flaky vision beam search test with flexible semantic validation ( #31324 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-26 04:39:32 +00:00
Shinichi Hemmi
9ee05cbe7f
Support LoRA and GPTQModel for PLaMo 2/3 ( #31322 )
...
Signed-off-by: Shinichi Hemmi <50256998+Alnusjaponica@users.noreply.github.com >
2025-12-26 11:41:33 +08:00
Ning Xie
3b8f31b362
[benchmark] use model card root instead of id ( #31329 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-12-26 10:55:56 +08:00
Isotr0py
2cd94259c8
[CI/Build] Ignore max transformers version skipping for initialization tests ( #30619 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-26 10:50:32 +08:00
oscardev256
b7165d53c6
Feature/isaac 0.1 ( #28367 )
...
Signed-off-by: oscardev256 <42308241+oscardev256@users.noreply.github.com >
Signed-off-by: Oscar Gonzalez <ogonzal6@alumni.jh.edu >
Signed-off-by: Yang <lymailforjob@gmail.com >
Co-authored-by: Yang <lymailforjob@gmail.com >
2025-12-25 18:49:11 -08:00
Nick Hill
81786c8774
[BugFix] Fix async scheduling + reasoning with struct output ( #31332 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2025-12-25 23:01:02 +00:00
Stan Wozniak
f1531d9f2a
[Hybrid] Mamba2 prefix cache blocks freeing for running requests ( #28047 )
...
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-12-25 20:54:06 +00:00
SongHe
2d6001f491
[Model][Ernie4.5-VL] Support video metadata for timestamp rendering ( #31274 )
...
Signed-off-by: dengsonghe <dengsonghe@baidu.com >
Co-authored-by: dengsonghe <dengsonghe@baidu.com >
2025-12-25 14:07:15 +00:00
Amir Samani
030fc44914
use the same stream for cuda graph catpure and replay for NCCL ( #29207 )
...
Signed-off-by: Amir Samani <asamani@nvidia.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-12-25 19:10:03 +08:00
Isotr0py
2532f437ee
[Doc] Add troubleshooting for Triton PTX error about undefined gpu-name ( #31338 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-12-25 02:26:34 -08:00
Louie Tsai
f15185fbdb
[Benchmark Suite] improve cpu Benchmark Suite tests and comparison report for 0.12.0 ( #30994 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
2025-12-25 08:51:45 +00:00
Mark Gatere
ba25a65992
[Frontend] add FunctionGemma tool parser support ( #31218 )
...
Signed-off-by: gateremark <gateremg@gmail.com >
2025-12-25 15:29:25 +08:00
Amith KK
42826bbccd
[Doc] Add tool call parser documentation for GPT-OSS models ( #31212 )
...
Signed-off-by: Amith KK <amithkumaran@gmail.com >
2025-12-25 05:29:10 +00:00
Richard Zou
254f6b9867
[Bugfix] Fix eagle dp tests on A100 ( #31241 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-12-25 00:05:04 +00:00
Michael Goin
bc5ef333e0
[Perf] Add skip_clone to SamplingParams for internal request handling ( #31041 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-24 14:35:57 -08:00
Cyrus Leung
09dc7c690c
[Chore][1/2] Drop v0.14 deprecations ( #31285 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-24 09:54:01 -08:00
ゆり
506eb0f454
[Bugfix] Remove dead block_quant_to_tensor_quant function ( #31294 )
...
Co-authored-by: yurekami <yurekami@users.noreply.github.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2025-12-24 17:22:48 +00:00
Ning Xie
5d93089686
[cli] complete vllm cli help message ( #31226 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-12-24 15:45:47 +00:00
Kevin McKay
66c9887440
[Bugfix][Hardware][AMD] Fix FP8 dtype in silu_mul quantization ( #31179 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2025-12-24 10:37:11 -05:00
wang.yuqi
1ff67df182
[CI] Reorganization pooling_mteb_test ( #31265 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-12-24 23:36:20 +08:00
skaraban3807
7cd288a4b3
[PERF] Add interleaved memory allocation to NUMA module ( #30800 )
2025-12-24 13:47:49 +00:00
Cyrus Leung
d201807339
[Chore] Bump lm-eval version ( #31264 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-24 05:39:13 -08:00
Cyrus Leung
aa3868ecfe
[Chore] Remove unused noqas ( #31263 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-24 05:38:46 -08:00
Cyrus Leung
7adeb4bfa8
[Bugfix] Fix max_model_len="auto" handling ( #31260 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-24 19:15:27 +08:00
wang.yuqi
bd89ce16d2
[Model] Introduce verify_and_update_model_config for VerifyAndUpdateConfig. ( #31131 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
2025-12-24 09:54:57 +00:00
Pleaplusone
b41aeb3468
[Bugfix][ROCm] Fix load issue on deepseek quark quantization when shared expert enabled ( #31261 )
...
Signed-off-by: ganyi <ygan@amd.com >
2025-12-24 16:47:44 +08:00
Ryan Rock
ddfac7034e
[CI/Build] Ignore data_parallel_size_local ( #30281 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2025-12-24 07:40:54 +00:00
Micah Williamson
6559d96796
[ROCm][CI] Set TORCH_NCCL_BLOCKING_WAIT Distributed Tests On ROCm ( #31259 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-12-24 07:19:07 +00:00
kliuae
1c74150bca
[ROCm][CI] Fix "Distributed Tests (H200)" Test ( #31227 )
...
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com >
2025-12-24 06:56:30 +00:00
Andreas Karatzas
0247a91e00
[ROCm][CI] Fix entrypoints tests and Python-only installation test on ROCm ( #28979 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-23 22:42:30 -08:00
Michael Goin
8ee90c83f8
Add --max-model-len auto to auto-fit context to available memory ( #29431 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-23 21:37:14 -08:00
Nick Cao
d7e05ac743
[docker] Fix downloading sccache on aarch64 platform ( #30070 )
...
Signed-off-by: Nick Cao <nickcao@nichi.co >
2025-12-23 21:36:33 -08:00
sihao_li
471ddb99a0
[XPU] Remove distributed_executor_backend check ( #30760 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2025-12-23 21:34:33 -08:00
Xiong Wang
bb24592d13
[Qwen3-Omni] fixed _get_feat_extract_output_lengths function ( #31007 )
...
Signed-off-by: Xiong Wang <wangxiongts@163.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-12-23 21:33:54 -08:00
Matthew Bonanni
369f47aa0f
[DeepSeek v3.2] Remove unnecessary syncwarps ( #31047 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-12-23 21:33:30 -08:00
zejunchen-zejun
dabff12ed3
[Bugfix][ROCm][Dynamo][DS 3.1][FP8] fix unsupported hasattr call when Dynamo tracing for ROCm device ( #31149 )
...
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com >
2025-12-23 21:32:19 -08:00
Ming Yang
3bb9561928
Revert "[bench] Support common prefix len config (for decode-only bench)" ( #31240 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-12-23 21:17:23 -08:00
Micah Williamson
3ce791ac77
[ROCm][CI] Set VLLM_FLOAT32_MATMUL_PRECISION="tf32" For terratorch Tests In AMD CI ( #31242 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-12-24 03:21:50 +00:00
Andreas Karatzas
e42894f5b5
[ROCm][CI][Bugfix] Fix Siglip2 rotary embedding dispatch and InternVL video test tolerance ( #31235 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-24 02:56:58 +00:00
Wentao Ye
76e6a95192
[Bug] Fix Number of dimensions of tensors must match. for Deepseek V3.2 ( #31160 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-24 10:41:09 +08:00
Chao Lei
8b59753cdb
[P/D] Mooncake connector support more protocols ( #30133 )
...
Signed-off-by: LCAIZJ <leichao139636@163.com >
2025-12-24 10:24:07 +08:00
Chen Zhang
538e830caa
[KVEvent] User request.block_hash for parent block_hash ( #30544 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
Co-authored-by: Yifan Qiao <yifanqiao@berkeley.edu >
2025-12-23 18:23:43 -08:00
rongfu.leng
4ed11105d7
[Misc] Remove unused custom ops copy_blocks and copy_blocks_mla ( #30967 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-12-23 18:22:35 -08:00
Cyrus Leung
dd424571c8
[Bugfix] Enable dynamic_dims for different embeds shape ( #31223 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-24 10:15:47 +08:00
Cyrus Leung
ca6a95ba25
[Chore] Simplify logic of _execute_mm_encoder ( #31222 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-23 18:15:16 -08:00
Vadim Gimpelson
bc0a5a0c08
[CI] Add Qwen3-Next-FP8 to Blackwell model tests ( #31049 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2025-12-23 17:21:50 -08:00
Andreas Karatzas
bfa2c0bbb9
[ROCm][Bugfix] Fix RuntimeError in MMEncoderAttention by replacing .view() with .reshape() ( #31203 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-23 21:48:01 +00:00
Mark McLoughlin
f790068600
[Core] Add a random suffix to frontend-provided request IDs ( #27987 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-12-23 13:05:39 -08:00
Asaf Joseph Gardin
34916ae37f
[Mamba] - Consolidate Mambas Attention Logic ( #28133 )
2025-12-23 21:57:00 +01:00
Yuan Tang
0736f901e7
docs: Add llm-d integration to the website ( #31234 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-12-23 20:27:22 +00:00
Harry Mellor
c016c95b45
Use helper function instead of looping through attribute names ( #29788 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-23 17:31:56 +00:00
Harry Mellor
1339878e13
Only patch original_max_position_embeddings for Transformers v4 ( #31214 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-23 16:46:32 +00:00
danielafrimi
b94f80ffb8
[FIX] FP4 quantization kernel padding initialization bug ( #31097 )
...
Signed-off-by: <>
Co-authored-by: root <root@gpu-193.slurm-workers-slurm.slurm.svc.cluster.local >
Co-authored-by: root <root@gpu-951.slurm-workers-slurm.slurm.svc.cluster.local >
2025-12-23 08:45:18 -08:00
Joachim Studnia
38c361f99d
Fix edge case Mistral tool parser ( #30724 )
...
Signed-off-by: Joachim Studnia <joachim@mistral.ai >
Signed-off-by: Joachim Studnia <studniajoachim@gmail.com >
Signed-off-by: juliendenize <julien.denize@mistral.ai >
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: juliendenize <julien.denize@mistral.ai >
Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-12-23 14:19:58 +00:00
Cyrus Leung
bb62dda2c3
[Misc] Introduce encode_*_url utility function ( #31208 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-23 13:45:21 +00:00
Patrick von Platen
3faa8bee57
adapt voxtral ( #31095 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-12-23 05:31:55 -08:00
Harry Mellor
b10d47e0e0
Add util function for checking nesting of rope parameters ( #31146 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-23 11:41:49 +00:00
R3hankhan
769f27e701
[OpenAI] Add parameter metadata to validation errors ( #30134 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2025-12-23 11:30:12 +00:00
Jakub Zakrzewski
23daef548d
[Frontend] Support using chat template as custom score template for reranking models ( #30550 )
...
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2025-12-23 11:19:16 +00:00
Jee Jee Li
27c6c2f98c
[Bugfix] Fix MoE LoRA bin/pt loading ( #31161 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-23 19:09:15 +08:00
Weida Hong
73cfb7a722
Correct position of docstring of class attributes ( #31209 )
...
Signed-off-by: Weida Hong <wdhongtw@google.com >
2025-12-23 02:08:58 -08:00
vllmellm
f32cfd7d97
[ROCm][FEAT] Support AITER RMSNorm quantization fusion pass ( #26575 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2025-12-23 02:07:54 -08:00
Jee Jee Li
6b16fff01b
[Bugfix] Fix Jais2ForCausalLM ( #31198 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-23 07:44:01 +00:00
Yan Ma
f1c2c20136
[XPU] decrease IGC_ForceOCLSIMDWidth for speculative decoding triton-xpu kernel compilation ( #30538 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2025-12-23 05:22:15 +00:00
Cyrus Leung
8cef137689
[Chore] Update more locations to use attention_config.backend ( #31153 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-22 19:19:50 -08:00
quanliu
a37328fc5c
[Feature] Batch invariant: Lora ( #30097 )
...
Signed-off-by: quanliu <18646313696@163.com >
2025-12-23 10:32:47 +08:00
Pavani Majety
3e10262356
Revert "[SM100] Enable fp8 compute for prefill MLA ( #30746 )" ( #31197 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-12-22 18:15:33 -08:00
Angela Yi
612d5ffdab
[ci] Fix Pytorch compilation test oom in 2.10 ( #31194 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2025-12-23 01:56:47 +00:00
Divakar Verma
78e5e62bbf
[AMD][CI] fix v1/engine test_preprocess_error_handling ( #31192 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-12-23 01:28:19 +00:00
Robert Shaw
b57b967386
[MoE Refactor][7/N] AITER MK ( #31102 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-12-22 16:42:58 -07:00
Michael Goin
6d518ffbaa
[CI Failure] Disable mosaicml/mpt-7b and databricks/dbrx-instruct tests ( #31182 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-22 15:40:35 -08:00
Benjamin Chislett
85aff45e24
[Perf] Remove blocking copy in GDN Attention ( #31167 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2025-12-22 14:25:22 -08:00
Wentao Ye
5312a7284e
[Bug] Fix 'CutlassMLAImpl' object has no attribute '_workspace_buffer' ( #31173 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-22 14:24:27 -08:00
Lucas Wilkinson
de71747655
[SpecDecode] Simplified alternative padded-speculation acceptance rate fix ( #29845 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-22 13:06:10 -08:00
Michael Goin
9586354053
[Doc] Add vllm-metal to hardware plugin documentation ( #31174 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-22 20:06:29 +00:00
Pavani Majety
b10f41c894
[SM100] Enable fp8 compute for prefill MLA ( #30746 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-12-22 19:15:57 +00:00
Yongye Zhu
7b926e8901
[MoE Refactor][9/N] Use modular kernel for unquantized Triton MoE ( #31052 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
2025-12-22 17:34:19 +00:00
Gregory Shtrasberg
ab3a85fd68
[ROCm][CI/Build] Fix triton version to one that has triton_kernels required for gpt-oss to run ( #31159 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-12-22 17:19:27 +00:00
Boyuan Feng
8dd0db687b
[UX] improve profiler error message ( #31125 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-12-22 08:45:59 -08:00
TJian
022f3cea53
[ROCm] [Critical]: Remove unused variable ( #31156 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-12-22 08:28:22 -08:00
Micah Williamson
a5bc77c253
[AMD][CI] Add "V1 Test e2e + engine" to mi325_8 Agent Pool ( #31040 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-12-22 10:41:56 -05:00
Nicolò Lucchesi
b1c3f96ae3
[CI][Bugfix] Fix entrypoints/openai/test_audio.py ( #31151 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-22 07:21:40 -08:00
dengyunyang
8f8f469b1b
[BugFix] skip language model in Encoder ( #30242 )
...
Signed-off-by: dengyunyang <584797741@qq.com >
2025-12-22 05:25:59 -08:00
Shengqi Chen
2cf91c2ea4
[CI] add polling for precompiled wheel in python_only_compile.sh, fix index generation for releases ( #30781 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2025-12-22 13:24:21 +00:00
AlonKejzman
bd6d5a7475
[gpt-oss] Fix harmony parser in streaming responses ( #30205 )
...
Signed-off-by: AlonKejzman <alonkeizman@gmail.com >
2025-12-22 20:56:06 +08:00
Li Wang
256a33ecb4
[Model] Fix bagel failed to run ( #31132 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-12-22 02:15:54 -08:00
Roger Young
c02a2705f9
Update MiniMax-M2 ToolCall and add MiniMax-M2.1 in Docs ( #31083 )
...
Signed-off-by: xuebi <xuebi@minimaxi.com >
Co-authored-by: xuebi <xuebi@minimaxi.com >
2025-12-22 05:28:40 +00:00
Kevin McKay
cf8eed7bef
[Bugfix][ROCm] Fix typo: is_linear_fp8_enaled -> is_linear_fp8_enabled ( #31109 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
2025-12-21 21:14:58 -08:00
Kevin McKay
44ae85f725
[Misc] Fix typo: 'occured' -> 'occurred' ( #31120 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2025-12-21 21:14:27 -08:00
Kevin McKay
14c3e6ade3
[Misc] Fix spelling typos in model comments ( #31117 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2025-12-21 21:14:14 -08:00
Kevin McKay
42b42824ae
[Misc] Fix grammar errors in comments and messages ( #31115 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2025-12-21 21:14:02 -08:00
Kevin McKay
ec58c10ce1
[Misc] Fix quantization-related typos ( #31116 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2025-12-21 21:13:48 -08:00
Kevin McKay
8c084de59d
[Misc] Fix spelling typos in comments ( #31114 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2025-12-21 21:13:14 -08:00
CedricHuang
19cc9468fd
[Feature]: Support NVIDIA ModelOpt HF FP8 variants FP8_PER_CHANNEL_PER_TOKEN and FP8_PB_WO in vLLM ( #30957 )
2025-12-21 22:34:49 -05:00
Jee Jee Li
097978a15d
[Kernel] Enable fused_qknorm_rope_kernel supports partial rope ( #30821 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-21 18:39:22 -08:00
Lucas Wilkinson
7e065eba59
[CI] Fix "2 Node Tests (4 GPUs in total)" ( #31090 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-22 10:32:40 +08:00
Steve Westerhouse
9d701e90d8
[Doc] Clarify FP8 KV cache computation workflow ( #31071 )
...
Signed-off-by: westers <steve.westerhouse@origami-analytics.com >
2025-12-22 08:41:37 +08:00
Michael Goin
06d490282f
[NVFP4][Perf] Tune NVFP4 input quant kernel for small batch size ( #30897 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-21 09:41:57 -08:00
Robert Shaw
b471092d3a
[MoE Refactor][4/N] Marlin Fp8 Mk ( #31036 )
2025-12-21 12:37:42 -05:00
Ameen Patel
93cabc417c
ci: add nvidia-smi warmup before Prime-RL integration test ( #31093 )
...
Signed-off-by: AmeenP <ameenp360@gmail.com >
2025-12-21 15:43:01 +00:00
Chauncey
bb80f69bc9
add aarnphm and chaunceyjiang to the new tool_parser directory ( #31088 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-12-21 03:24:34 +00:00
汪志鹏
3e92b2b7ac
[BugFix]fix gpt-oss v1/completions response bug ( #30608 )
...
Signed-off-by: princepride <wangzhipeng628@gmail.com >
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: bbrowning <bbrownin@redhat.com >
2025-12-21 10:39:31 +08:00
Jinzhen Lin
7c73ceb581
[Quantization] add marlin w4a8/w8a8 check ( #31061 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
2025-12-20 21:58:11 +00:00
Lucas Wilkinson
ae0770fa6b
[CI] Fix H200 Distributed test ( #31054 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-20 16:48:49 -05:00
Jinzhen Lin
ee52d9901d
[Quantization] support logical_widths for fp8 marlin ( #30962 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-20 12:02:57 -08:00
baonudesifeizhai
54c8924384
[MoE Refactor][5/N] Isolate zero expert to LongCatFlash ( #28891 )
...
Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com >
Signed-off-by: Dongjie Zou <85092850+baonudesifeizhai@users.noreply.github.com >
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robertgshaw2@gmail.com >
2025-12-20 18:22:04 +00:00
Yan Ma
560ae9638c
[XPU] enable fp8 online streaming quantization ( #30944 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2025-12-20 13:45:27 +00:00
Jeffrey Wang
1501a4070e
[Bugfix] Read truncate_prompt_tokens from pooling_params in AsyncLLM.encode() ( #31013 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2025-12-20 10:29:31 +00:00
Lucas Wilkinson
ff2168bca3
[CI] FIx fixture 'siglip_attention_config' not found ( #31053 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-20 03:46:15 +00:00
Gregory Shtrasberg
0be149524c
[ROCm][CI/Build] Update ROCm dockerfiles ( #30991 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-12-20 03:19:12 +00:00
zejunchen-zejun
d52c5096d7
[Bugfix] fix the alias bug of AttentionBackendEnum when register CUSTOM attention backend to vllm ( #30869 )
...
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com >
2025-12-20 09:03:35 +08:00
Yuxuan Zhang
8a7a414374
GLM-4.7 Tool Parser and Doc Update ( #30876 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-12-20 00:09:58 +00:00
Robert Shaw
95befecc18
[MoE Refactor][2/N] Use Modular Kernels for Fp8 ( #30825 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-12-19 23:36:38 +00:00
Wentao Ye
4cf9429897
[Bug] Fix error 'Dynamo failed to run FX node with fake tensors for Deepseek V3.2 ( #31046 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-19 23:31:31 +00:00
Robert Shaw
83a317f650
[MoE Refactor][3/N] Deprecate cutlass block quant fp8 (b200) ( #30990 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-12-19 13:09:54 -08:00
Lucas Wilkinson
5f6477d1d0
[BugFix] Fix TypeError: unhashable type: 'dict' when serving deepseek32 ( #30924 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-19 16:07:54 -05:00
Wentao Ye
3bd8335bd0
[Refactor] Refactor for DeepGemmQuantScaleFMT using cache ( #30898 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-19 13:50:39 -07:00
Seiji Eicher
1ab5213531
Make engine core client handshake timeout configurable ( #27444 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-12-19 20:38:30 +00:00
Zhonghua Deng
969bbc7c61
[Model] Add MiMo-V2-Flash support ( #30836 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
Signed-off-by: Jumiar <liuanqim10@126.com >
Signed-off-by: Zyann7 <zyann7@outlook.com >
Co-authored-by: Jumiar <liuanqim10@126.com >
Co-authored-by: Zyann7 <zyann7@outlook.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-19 17:17:03 +00:00
Andrey Talman
268a972c62
Update Pytorch version update docs ( #30982 )
2025-12-19 16:08:53 +00:00
Jinzhen Lin
5fbfa8d9ef
[Quantization] fix marlin w8a8 check ( #30961 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
2025-12-19 07:33:22 -08:00
Shanshan Shen
23a1946e3b
[CustomOp][Refactor] Extract common methods for ApplyRotaryEmb CustomOp ( #31021 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-12-19 22:16:09 +08:00
Thomas Parnell
b5545d9d5c
[Bugfix] [Kernel] Triton attention kernels: mask out V blocks that fall outside sliding window ( #30887 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-12-19 21:39:54 +08:00
Nishidha Panpaliya
bd2b52fc2d
[CPU][Bugfix] Fix ppc64le CPU build ( #30871 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
2025-12-19 12:26:35 +00:00
Li, Jiang
420ba2dbb6
Enable aarch64 CPU performance benchmarks ( #26494 )
...
Signed-off-by: Ioana Ghiban <ioana.ghiban@arm.com >
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
Co-authored-by: Ioana Ghiban <ioana.ghiban@arm.com >
Co-authored-by: Fadi Arafeh <fadi.arafeh@arm.com >
2025-12-19 12:16:18 +00:00
Marko Rosenmueller
455949675d
[Frontend][Bug] allow tool calls in analysis channel ( #28139 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-12-19 10:47:44 +00:00
lif
086b96339f
[Bugfix] Add validation for tool requests when tool_parser is unavailable ( #30613 )
...
Signed-off-by: majiayu000 <1835304752@qq.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2025-12-19 18:23:28 +08:00
Jinzhen Lin
9187de9fac
[Quantization] enable compressed-tensors marlin support for turing (2) ( #31008 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
2025-12-19 08:56:35 +00:00
Isotr0py
ac1c934276
[Bugfix] Fix incorrect tiles creation for mm prefix triton attention ( #30974 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-19 16:00:33 +08:00
Wenqi Glantz
4924ac582c
Add hidden dimension validation for multimodal embedding inputs ( #30968 )
...
Signed-off-by: Wenqi Glantz <wglantz@nvidia.com >
2025-12-19 07:59:36 +00:00
Li, Jiang
096b25c9ed
[Doc][CPU] Fix index link for CPU regular release wheels ( #31015 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-12-19 07:29:52 +00:00
Jinzhen Lin
de08b8f61b
[Quantization] enable compressed-tensors marlin support for turing ( #31000 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
2025-12-18 20:29:48 -08:00
Nick Hill
2ac85a4544
[BugFix] Fix logprobs with spec decode and modified logits ( #30846 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-18 19:58:28 -08:00
Andreas Karatzas
7b43db210c
[ROCm][CI][Bugfix] Multi-Modal Model Support Fixes and Attention Backend Improvements ( #30270 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-19 02:17:27 +00:00
PlatinumGod
6a09612b2e
[Bugfix] Fix tool_choice="none" being ignored by GPT-OSS/harmony models ( #30867 )
...
Signed-off-by: yujiepu <pyjapple@gmail.com >
Signed-off-by: PlatinumGod <pyjapple@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-12-19 09:34:27 +08:00
Nick Hill
45c0526ac9
[BugFix] Handle errors when preprocessing added requests ( #30895 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-19 01:29:11 +00:00
Benjamin Chislett
d6b3d39b6d
[Cleanup] Refactor FlashInferMetadataBuilder ( #29128 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-12-18 14:45:30 -08:00
Chendi.Xue
6ca74bc11a
[NIXL][BUG FIX] Fix both failing issue and accuracy issue with nixl + host_buffer on CUDA ( #30419 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2025-12-18 22:10:02 +00:00
Harry Mellor
19c583398a
Check for truthy rope_parameters not the existence of it ( #30983 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-18 13:59:10 -08:00
Nick Hill
b0b77c4655
[BugFix] Fix spec decode + structured outputs + preemption edge case ( #30916 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-18 12:59:55 -08:00
Kayvan Mivehnejad
634a14bd7d
Strengthen input validation and tests for 'parse_raw_prompts’. ( #30652 )
...
Signed-off-by: Kayvan Mivehnejad <K.Mivehnejad@gmail.com >
2025-12-18 19:51:58 +00:00
Chen Zhang
24b65eff0d
[BugFix] Spec decode with VLLM_ENABLE_V1_MULTIPROCESSING=0 ( #30319 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-12-18 19:47:56 +00:00
Elizabeth Thomas
41b6f9200f
Remove all2all backend envvar ( #30363 )
...
Signed-off-by: Elizabeth Thomas <email2eliza@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-18 19:46:28 +00:00
Wentao Ye
97000a2be7
[Bug] Fix compressed tensor not using deepgemm ( #30820 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-18 14:45:55 -05:00
Isotr0py
d2dc5dfc6e
[Bugfix] Remove tile_size=64 for mm_prefix triton attention ( #30973 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-18 20:42:32 +01:00
navmarri14
b8c477c115
tuned fused configs for B300 ( #30629 )
2025-12-18 11:41:59 -08:00
jiahanc
53ad423f26
[Perf] enable flashinfer rotary_embedding custom ops in DeepSeek rotary ( #30729 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
2025-12-18 14:31:18 -05:00
wz1qqx
889f8bb250
[BugFix]Reclaim resources to prevent memory leaks when use LMCacheMPConnector ( #30745 )
...
Signed-off-by: wz1qqx <ziqi.wang@novita.ai >
Co-authored-by: wz1qqx <ziqi.wang@novita.ai >
2025-12-18 19:09:51 +00:00
Fanli Lin
058926d48c
[XPU] allow custom workers (e.g. vllm-omni workers) to be used on XPU ( #30935 )
...
Signed-off-by: Fanli Lin <fanli.lin@intel.com >
2025-12-18 10:16:36 -08:00
Isotr0py
700a5ad6c6
[MM Encoder]: Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface ( #30684 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-19 02:04:19 +08:00
Alec
62be3670cb
[BugFix] Add sleep to fix tight loop and release GIL ( #29476 )
...
Signed-off-by: alec-flowers <aflowers@nvidia.com >
Signed-off-by: Alec <35311602+alec-flowers@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-12-18 09:52:55 -08:00
inkcherry
500f26e6d3
[Bugfix] fix DP-aware routing in OpenAI API requests ( #29002 )
...
Signed-off-by: inkcherry <mingzhi.liu@amd.com >
2025-12-18 09:50:42 -08:00
Nick Hill
686cbaac64
[Cleanup] Remove unused ModelRunner V1 InputBatch.num_tokens field ( #30218 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-12-18 09:17:00 -08:00
Vasiliy Kuznetsov
f4ee2c3d90
fix fp8 online quantization streaming with tp > 1 ( #30900 )
...
Signed-off-by: vasiliy <vasiliy@fb.com >
2025-12-18 11:45:15 -05:00
Xin Yang
9a5e96523b
[LoRA] Set default MXFP4 LoRA backend to Marlin ( #30598 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-18 08:42:22 -08:00
wzyrrr
326e7c3105
[Doc] Add Sophgo TPU Support ( #30949 )
...
Co-authored-by: zhaoyang.wang <zhaoyang.wang@sophgo.com >
2025-12-18 16:29:33 +00:00
Lucas Kabela
0db5439ded
[Bugfix][torch2.10] Fix test_qwen2_5_vl_compilation with 2.10 RC ( #30822 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-18 08:23:31 -08:00
sarathc-cerebras
28d15ab56b
adds jais 2 support ( #30188 )
...
Signed-off-by: sarathc-cerebras <sarath.chandran@cerebras.net >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-12-18 15:46:58 +00:00
Wentao Ye
6628758233
[Bug] Fix batch invariant in torch 2.10 ( #30907 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-18 07:27:51 -08:00
zhrrr
eee600c34f
[Misc] support nsys profile for bench latency ( #29776 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2025-12-18 14:52:20 +00:00
Michael Goin
100f93d2be
Filter safetensors files to download if .safetensors.index.json exists ( #30537 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-18 14:51:17 +00:00
vllmellm
96bf50a2c0
[ROCm] Serving Fails on Radeon Due to AITER Dtype Import ( #30952 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-12-18 11:47:46 +00:00
Li, Jiang
f90d3636e2
[Bugfix][CPU] Fix Mac CPU build ( #30955 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-12-18 01:38:22 -08:00
Ming Yang
8372be2828
[moe] Use enable_chunking func (to support disabling chunking) ( #29935 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-12-18 09:02:38 +00:00
Andreas Karatzas
8da6ae49c3
[ROCm][Bugfix] Fix fa_version argument error in flash_attn_maxseqlen_wrapper for ROCm without aiter ( #30909 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-18 16:45:51 +08:00
Lucas Wilkinson
30bb19a760
[BugFix] Partial revert of #29558 (DeepEP HT + PIECEWISE CG support) ( #30910 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-17 23:50:15 -08:00
Chauncey
aa7e836055
[Bugfix] Fix Unicode issues in GLM-4 tool calling ( #30920 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-12-18 07:12:17 +00:00
Andreas Karatzas
be2ad5f920
[ROCm][Bugfix] fix(structured_output): Skip guidance backend for schemas with patternProperties ( #30730 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2025-12-18 07:04:57 +00:00
wangxiyuan
a85724bd6e
[Platform] Let EPD work with non-cuda platform ( #30225 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-12-18 06:45:29 +00:00
Yifan Qiao
11a89cf95c
[Fix][FlexAttention] return max logical block index to handle reused blocks ( #30915 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
2025-12-18 06:42:21 +00:00
Li, Jiang
e3ab93c896
[CPU] Refactor CPU fused MOE ( #30531 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-12-18 14:36:49 +08:00
Nathan Price
fc2ae6d617
fix: add warmup for audio preprocessing ( #30706 )
...
Signed-off-by: Nathan Price <nathan@abridge.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-18 06:12:29 +00:00
Yihua Cheng
ec965569d9
[KV connector][LMCache] Only record the cuda event when there are request to store/load ( #30814 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
2025-12-18 05:31:34 +00:00
Divakar Verma
82dc338ad6
[AMD][CI] fix lm eval ci arg ( #30911 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-12-18 13:18:26 +08:00
Vadim Gimpelson
717ac33d9c
[PERF] Qwen3-next. Add fp8 cutlass MoE tuned configs. chmod -x *MI308X.json ( #29553 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2025-12-18 13:16:04 +08:00
Li, Jiang
cfb7e55515
[Doc][CPU] Update CPU doc ( #30765 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-18 04:59:09 +00:00
zzhxxx
b166ef20e1
[refactor] Add prefix support to embed_tokens in DeepSeek MTP ( #30788 )
...
Signed-off-by: zzhx1 <zzh_201018@outlook.com >
2025-12-18 04:45:56 +00:00
Zhengxu Chen
5f2f3fba1d
[compile] Fix CI for test_gpt2_cache_hit ( #30902 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2025-12-17 20:22:23 -08:00
Matthew Bonanni
4a8412f773
[UX] Reduce DeepGEMM warmup log output to single progress bar ( #30903 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-12-17 20:21:51 -08:00
Bowen Bao
0c738b58bc
[Quantization] Support Quark int4-fp8 w4a8 for MoE ( #30071 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com >
2025-12-18 04:20:42 +00:00
gnovack
5a3adf581e
fused_moe_lora PDL improvements ( #30716 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-17 19:55:00 -08:00
Isotr0py
6fe5887652
[Chore] Remove v0 dead code for Qwen2.5-omni ( #30883 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-17 19:54:39 -08:00
Nicolò Lucchesi
bc3700e0cd
[NIXL] Support P tensor-parallel-size > D tensor-parallel-size ( #27274 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-18 11:53:30 +08:00
Micah Williamson
fd8afdf38d
[ROCm][CI] Reduce Flakiness For test_async_scheduling Using ROCM_ATTN With FP32 ( #30811 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2025-12-18 10:27:37 +08:00
SungMinCho
a0b782f9cc
[Metrics] Model FLOPs Utilization estimation ( #30738 )
...
Signed-off-by: SungMinCho <tjdals4565@gmail.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
2025-12-18 01:40:51 +00:00
Rafael Vasquez
ed2897f336
[CI][Feature] Adds auto-rebase PR rule ( #30875 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2025-12-18 00:46:44 +00:00
Isotr0py
74a1ac38b0
[v1] Add PrefixLM support to TritonAttention backend ( #30386 )
2025-12-17 16:05:24 -08:00
Nathan Price
05a83dc6ee
feat(api): Eager chat template warmup to eliminate first-request latency ( #30700 )
...
Signed-off-by: Nathan Price <nathan@abridge.com >
2025-12-18 00:01:29 +00:00
Varun Sundar Rabindranath
e3fc374a9a
[BugFix] Workspace allocation during profile run : DeepEPHighThroughput + DeepGEMM ( #30899 )
2025-12-17 15:00:59 -08:00
Andrey Talman
e06d0bf0aa
2.9.1 PyTorch release update ( #28495 )
2025-12-17 12:20:22 -08:00
Xunzhuo
e3a0f21e6c
[docs]: add ecosystem projects sr in docs/governance ( #30844 )
...
Signed-off-by: bitliu <bitliu@tencent.com >
2025-12-17 18:45:56 +00:00
Matthew Bonanni
7eb6cb6c18
[Attention] Update tests to remove deprecated env vars ( #30563 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-12-17 09:49:59 -08:00
Nicolò Lucchesi
9ca8cb38fd
[CI][Bugfix] Fix flaky tests/entrypoints/openai/test_audio.py::test_chat_streaming_audio ( #30878 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-17 18:49:56 +01:00
Cyrus Leung
2497228ad4
[Chore] Factor out logic for requesting initial memory ( #30868 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-17 07:32:17 -08:00
KimHyemin
196cdc3224
[Model] Gemma3: Support untied word embeddings ( #30827 )
...
Signed-off-by: www-spam <panmahm@naver.com >
2025-12-17 07:11:18 -08:00
高鑫崧
b7b6a60aca
Adapt the old parameter enable_thinking in chat_template_kwargs ( #30852 )
...
Signed-off-by: xinsong.gao <1418762819@qq.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-12-17 07:10:59 -08:00
rongfu.leng
9e67c4ce98
[Docs] fix function name ( #30748 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-12-17 12:14:45 +00:00
Jialin Ouyang
6e9dbcc50e
[Fix] uniform decode batch check ( #30747 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-12-17 19:58:43 +08:00
Hank_
6482e3895b
chores: adjust the attn register param order ( #30688 )
...
Signed-off-by: Hank <hcc.mayday@gmail.com >
2025-12-17 19:58:16 +08:00
Harry Mellor
fb980eb2fd
Fix lazy import ( #30858 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-17 03:33:50 -08:00
baoqian426
84896fda22
[Bugfix] deepseek-V3.2 self.weights_proj has no bias ( #30841 )
...
Signed-off-by: baoqian <1354987947@qq.com >
Signed-off-by: baoqian426 <1354987947@qq.com >
2025-12-17 03:32:34 -08:00
Kevin H. Luu
4bf6c23668
[ci] Sync test areas yaml file with test-pipeline ( #30862 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2025-12-17 02:30:56 -08:00
Chauncey
9ad5b21710
[Refactor] [4/N] Move VLLM_SERVER_DEV endpoints into the serve directory ( #30749 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-12-17 02:27:30 -08:00
Wentao Ye
f284d7bd0c
[Bug] Fix AttributeError: 'ColumnParallelLinear' object has no attribute weight_scale_inv ( #30823 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-17 02:00:35 -08:00
Zhengxu Chen
53cd7f868b
[compile] Recompile graph module during Dynamo cache loading. ( #30743 )
...
Signed-off-by: Zhengxu Chen <zhxchen17@fb.com >
2025-12-17 02:00:12 -08:00
danielafrimi
7b966ae2ba
[Fix]Load kv-cache dtype from hf_quant_config.json automatically (fix for reverted PR) ( #30785 )
...
Signed-off-by: <>
Co-authored-by: root <root@gpu-937.slurm-workers-slurm.slurm.svc.cluster.local >
2025-12-17 01:56:38 -08:00
Zhengxu Chen
9db1db5949
[compile] Ignore VLLM_FORCE_AOT_LOAD from cache factors ( #30809 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2025-12-17 01:56:24 -08:00
Zhengxu Chen
177c391db2
[compile] Disable aot when eager backend is used. ( #30810 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2025-12-17 01:55:56 -08:00
Michael Goin
519ef9a911
[UX] Make vllm bench serve discover model by default and use --input-len ( #30816 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-17 01:55:30 -08:00
Ye (Charlotte) Qi
a100152288
[Kernels][FI] Skip trtllm attention when num_kv_heads=1 ( #30842 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-12-17 01:54:21 -08:00
Andrew Xia
4c054d89aa
[Doc][ResponsesAPI] add documentation ( #30840 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2025-12-17 01:53:02 -08:00
Sheng Lin
f4e884f222
[NIXL][Bugfix] Fix NIXL/RDMA registration failure over CuMemAllocator ( #29569 )
...
Signed-off-by: Somoku <linsh0@protonmail.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2025-12-17 01:52:58 -08:00
Xinyu Chen
3b1d440ede
CustomOp: grouped topk ( #29575 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2025-12-17 17:43:00 +08:00
Asaf Joseph Gardin
a9e15c21ef
[Mamba] Removed disable cascade attn in MambaModelConfig ( #30712 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
2025-12-17 08:48:53 +00:00
Robin
20fda43151
[Bugfix][Frontend] Prevent IndexError in MiniMax M2 tool parser during streaming extraction ( #30555 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-12-17 16:37:57 +08:00
Yan Ma
4f735babb7
[XPU] fix broken fp8 online quantization for XPU platform ( #30831 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2025-12-17 00:28:13 -08:00
Li, Jiang
0cd5353644
[Bugfix][CPU] Fix CPU backend ROPE dispatch for VL models ( #30829 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-16 23:25:12 -08:00
Michael Goin
d4d2751732
Update note comment for flashinfer attention warmup ( #30711 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-16 21:29:03 -08:00
shanjiaz
009a773828
bump up compressed tensors version to 0.13.0 ( #30799 )
...
Signed-off-by: shanjiaz <zsjwpianpian@gmail.com >
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com >
2025-12-16 21:01:04 -08:00
Cyrus Leung
44d3b1df3d
[CI/Build] Fix compatibility between #30244 and #30396 ( #30787 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-12-16 20:21:19 -08:00
Fadi Arafeh
bb5ac1fe38
[CPU] Add action to automatically label CPU related PRs ( #30678 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2025-12-17 04:21:07 +00:00
Michael Goin
811cdf5197
Update model-hosting-container-standards to 0.1.10 ( #30815 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-12-16 17:52:14 -08:00
Grzegorz K. Karch
f5db6385a1
Fix nemotron_nas intermediate_size computation ( #30795 )
...
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com >
2025-12-17 01:06:28 +00:00
Amr Mahdi
c0a88df7f7
[docker] Allow kv_connectors install to fail on arm64 ( #30806 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2025-12-16 16:41:57 -08:00
Nicolò Lucchesi
e087fbc393
[MM] Pass FA version in ViT Attn ( #30756 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-17 07:54:45 +08:00
Michael Goin
e80455ca8b
Replace deprecated enable_fusion with fuse_norm_quant in test_rms_group_quant ( #30817 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-16 23:40:47 +00:00
TJian
2410132bb1
[ROCm] [Bugfix] Fix torch sdpa hallucination ( #30789 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-12-16 15:32:43 -08:00
Michael Goin
0a1ab1e565
[Perf][Kernels] Vectorize csrc/activations_kernels.cu ( #29512 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-16 14:56:02 -08:00
Wentao Ye
b6ec077e05
[CI] Skip ci failure test ( #30804 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-16 22:47:53 +00:00
Jinzhen Lin
ce96857fdd
[Kernel][Quantization][MoE] add marlin kernel support for turing (sm75) ( #29901 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-12-16 14:35:28 -08:00
Daniel Cámpora
eaa82a709a
[Bugfix][DSV32] Fix overflow in topk. ( #30754 )
...
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-12-16 14:21:17 -08:00
Roger Wang
f5f51e5931
[Core][MM] Optimize encoder cache manager by operating with embeddings only ( #30475 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Sun Kim <sunytokki@gmail.com >
2025-12-16 14:18:17 -08:00
Lucas Wilkinson
9fec0e13d5
[Attention] Cache attention metadata builds across hybrid KV-cache groups ( #29627 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com >
2025-12-16 17:10:16 -05:00
jiahanc
254a7f8fd6
[Perf] Do FP4 quant before All gather on flashinfer trtllmgen MOE ( #30014 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
2025-12-16 13:01:48 -08:00
Wentao Ye
f21f5ea38c
[Refactor] Small refactor for group topk ( #30562 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-12-16 14:50:59 -05:00
Nicolò Lucchesi
ca702a14dc
[Frontend] Add max-completion-token option to transcription/translation endpoints ( #30769 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-16 19:36:49 +00:00
Michael Goin
10ee1c64cf
[CI] Generalize gsm8k test args and add Qwen3-Next MTP B200 test ( #30723 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-12-16 14:28:34 -05:00
Mark McLoughlin
66c3537e5d
[Docs][API] Remove warning about LoRARequest being internal-only ( #30774 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-12-16 08:35:46 -08:00
Harry Mellor
e1625498f4
Update where bytes_to_unicode is imported from ( #30771 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-16 08:05:01 -08:00
Harry Mellor
0b0acc758e
Remove head_mask from Ultravox and Swin ( #30764 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-16 08:02:41 -08:00
Harry Mellor
af506fd76a
Fix instantiation of HfHubHTTPError in LoRA test ( #30768 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-16 08:02:24 -08:00
Ming Yang
ce12b407f2
[TRTLLM] Remove the MoE GEMM weight name change ( #30713 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-12-16 11:01:38 -05:00
Wentao Ye
59bd5f6a71
[Feat] Enable eplb with default all2all backend ( #30559 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-12-16 10:33:52 -05:00
Lucas Wilkinson
00a8d7628c
[BugFix] Fix memory spike in workspace allocation ( #30744 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-12-16 06:46:22 -08:00
Isotr0py
4de08ad698
[CI/Build] Skip broken ViT backend functionality test tempoarily ( #30782 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-16 06:45:25 -08:00
Nicolò Lucchesi
75eb302a2e
[Bugfix] Whisper fix number of allocated CrossAttn blocks per-request ( #30772 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-12-16 14:20:19 +00:00
Pleaplusone
9dbbc59b15
[ROCm][MTP] Support MTP for AITER MLA backend ( #28624 )
...
Signed-off-by: ganyi <ygan@amd.com >
2025-12-16 14:10:26 +00:00
Boyuan Feng
104003dc77
update piecewise cudagraph warning when splitting_ops=[] ( #30728 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-12-16 06:09:34 -08:00
TJian
d0fb572929
[ROCm] [AITER] [DOC] Add usage description about check functions in _aiter_ops ( #30586 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-12-16 13:50:47 +00:00
Harry Mellor
6f15ac5de7
Don'e assume position_embedding_type will be present for BERT and RoBERTa models ( #30770 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-12-16 13:40:26 +00:00
Junru Shen
676db55eec
[Bugfix] Fix prefix_repetition routing in bench throughput ( #29663 )
...
Signed-off-by: Junru Shen <jrshen.sjr@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-16 01:37:15 -08:00
Jee Jee Li
0e391e7570
[Bugfix] Fix RequestOutput miss lora_request ( #30636 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-12-16 01:36:35 -08:00
Andrew Xia
0d0c929f23
[responsesAPI][8] input/output messages for ResponsesParser ( #30158 )
...
Signed-off-by: Andrew Xia <axia@fb.com >
Signed-off-by: Andrew Xia <axia@meta.com >
Co-authored-by: Andrew Xia <axia@fb.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-12-16 13:54:59 +08:00
Isotr0py
e94384bbad
[Bugfix] Fix broken ViT attention selection for Blackwell device ( #30731 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-12-16 05:24:32 +00:00
jiangkuaixue123
b9ff4f2a8d
[feature] extend DBO to XBO ( #30120 )
...
Signed-off-by: jiangkuaixue123 <jiangxiaozhou111@163.com >
Co-authored-by: root <root@hk01dgx028.cm.cluster >
2025-12-16 00:04:01 -05:00
Boyuan Feng
c881db364e
improve lazy import test ( #30733 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-12-16 03:12:05 +00:00
Shanshan Shen
3bd9c49158
[CustomOp] Extract ApplyRotaryEmb as CustomOp and unify the dispatch logic ( #29873 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
Co-authored-by: gcanlin <canlinguosdu@gmail.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2025-12-15 19:08:16 -08:00
Amr Mahdi
ff21a0fc85
[docker] Restructure Dockerfile for more efficient and cache-friendly builds ( #30626 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2025-12-15 18:52:19 -08:00
penfree
bbd850e597
[Bugfix] fix streaming final output for non harmony ( #30237 )
...
Signed-off-by: penfree <qiupengfei@baidu.com >
Co-authored-by: penfree <qiupengfei@baidu.com >
2025-12-16 09:03:11 +08:00
Shengqi Chen
511e81e7c9
[BUILD] use sm_100f when compiling flashmla to fix support on sm103 ( #30705 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
2025-12-15 14:48:01 -08:00
Matthew Bonanni
a182be4308
[UX][Attention] Add attention_config argument to LLM() ( #30710 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-12-15 17:29:09 -05:00