Commit Graph

14061 Commits

Author SHA1 Message Date
Kyungmin Lee
63ed2409e8 Add K-EXAONE-236B-A23B (#31621)
Signed-off-by: lkm2835 <lkm2835@gmail.com>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: lgai-exaone <exaonemodels@lgresearch.ai>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2026-01-12 16:30:50 +00:00
Andy Zhang
95e53d907c doc: Update model references in supported_models.md (#32188) 2026-01-12 08:15:28 -08:00
TJian
0346396e94 [ROCm] [Bugfix] Fix order of mori build in Dockerfile.rocm_base (#32179)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2026-01-12 15:33:21 +00:00
Andy Zhang
e68b0dad8b doc: Update model name for Qwen3-Coder in documentation (#32185)
Signed-off-by: Andy Zhang <xiazhang@microsoft.com>
2026-01-12 07:10:50 -08:00
Or Ozeri
9cddbdba6d OffloadingConnector: Add cpu_bytes_to_use configuration (#24498)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2026-01-12 15:00:43 +00:00
Hongxin Xu
49e6b86c91 [Feature] Support recording expert indices for rollout router replay (#28284)
Signed-off-by: xhx1022 <1737006628@qq.com>
Signed-off-by: Hongxin Xu <70438206+xhx1022@users.noreply.github.com>
Signed-off-by: arlenxu <arlenxu@tencent.com>
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Co-authored-by: arlenxu <arlenxu@tencent.com>
2026-01-12 06:23:04 -08:00
dtc
0565f1fdec [P/D] Refactor mooncake connector sender thread using async coroutines (#31573)
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
2026-01-12 12:35:35 +00:00
Isotr0py
9dbe1fe960 [Bugfix] Fix missing scale passing for encoder Triton Attention implementation (#32149)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-12 11:13:41 +00:00
RickyChen / 陳昭儒
a5f89ae296 [Doc] Add documentation for offline API docs feature (#32134)
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com>
2026-01-12 10:33:48 +00:00
Jee Jee Li
05e8981234 [Doc] Improve LoRA docs (#32159)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-12 02:19:17 -08:00
XlKsyt
899541bdb1 [doc] fix broken links (#32158)
Signed-off-by: minimAluminiumalism <caixuesen@outlook.com>
2026-01-12 10:18:38 +00:00
daniel-salib
d7b2e57097 [Frontend] Fix Flaky MCP Streaming Test (#32153)
Signed-off-by: Daniel Salib <danielsalib@meta.com>
2026-01-12 18:03:32 +08:00
Andika Rachman
5e034f2e3d [cpu][bench] Add Fused MoE Micro Benchmark for CPU Backend (#32092)
Signed-off-by: andikarachman <andika.rachman.y@gmail.com>
2026-01-12 10:03:28 +00:00
Nicolò Lucchesi
22970c1626 [Misc] Disable default --ready-check-timeout-sec extra call in vllm bench (#30975)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-01-12 01:58:21 -08:00
Cyrus Leung
600aaab8d6 [Model] Remove incorrect SupportsPP from MTP models (#32150)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-12 01:19:30 -08:00
wang.yuqi
60446cd684 [Model] Improve multimodal pooling examples (#32085)
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
2026-01-12 07:54:09 +00:00
Cyrus Leung
9101dc756c [Model] Avoid hardcoding pooling type (#32119)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-11 21:28:12 -08:00
Woosuk Kwon
025a32f9ed [Model Runner V2] Remove async barrier (#32083)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2026-01-11 20:24:30 -08:00
Woosuk Kwon
19504ac07f [Model Runner V2] Skip building deprecated fields in attn metadata (#32132)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2026-01-11 14:31:04 -08:00
Jiangyun Zhu
3df619ac94 [CI] fix test_concat_and_cache_mla_rope_fused (#32117)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2026-01-11 15:11:11 +00:00
Ning Xie
d74132ca3b fix offline inference chat response prompt (#32088)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2026-01-11 14:01:18 +00:00
maang
a34abc49b7 [FixBug] Improve exception string in tensorizer.py (#31680)
Signed-off-by: maang <maang_h@163.com>
Signed-off-by: maang-h <55082429+maang-h@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-11 05:01:53 -08:00
rongfu.leng
d70249e2e9 [Misc] fix this log format not space (#32112)
Signed-off-by: lengrongfu <lenronfu@gmail.com>
2026-01-11 05:01:16 -08:00
Cyrus Leung
a374532111 [CI/Build] Separate out flaky responses API tests (#32110)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-11 05:01:12 -08:00
Isotr0py
cee7436a26 [Misc] Make scipy as optional audio/benchmark dependency (#32096)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-11 00:18:57 -08:00
Or Ozeri
4c16ba617f [KVConnector] OffloadingConnector: Fix bug in handling of preemptions (#29870)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2026-01-11 08:05:36 +00:00
Matt
bde57ab2ed [Hardware][AMD][CI][Bugfix] Fix AMD Quantization test group (#31713)
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
2026-01-10 23:19:46 -08:00
Fadi Arafeh
9103ed1696 [CPU][BugFix] Disable AOT Compile for CPU (#32037)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2026-01-10 23:15:49 -08:00
Laith Sakka
46eb30f519 make assume_32_bit_indexing configurable (#32044)
Signed-off-by: Laith Sakka <lsakka@meta.com>
2026-01-10 23:15:46 -08:00
Andy Liu
0dd63639be [MTP][GLM][Bugfix] Fixed .weight_scale loading logic that dropped MTP prediction accuracy with fp8+mtp (#32101)
Signed-off-by: Andy Liu <andyliu@roblox.com>
2026-01-10 23:14:54 -08:00
Cyrus Leung
ef96fa3f1f [Benchmark][2/2] Use spline interpolation to tune SLA variables (#32095)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-10 20:27:27 -08:00
Or Ozeri
2a4dbe24ea [BugFix] Wait for compute before offloading KV to CPU (#31341)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2026-01-10 22:25:08 +00:00
RickyChen / 陳昭儒
8020a60402 [Bugfix] Fix Qwen3-VL-Reranker model loading for sequence classification (#32089)
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-10 12:40:09 -08:00
Vadim Gimpelson
e15a5ff07b [MISC] Add strict contiguity check for FlashInfer attention tensors (#32008)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
2026-01-10 12:40:05 -08:00
Vensen
6ea001cfb7 [Bugfix][Quantization] Ensure input contiguity in per_token_quant_int8 (#31637)
Signed-off-by: vensen <vensenmu@gmail.com>
2026-01-10 12:40:02 -08:00
shyeh25
1c46dea001 Revert "[Kernels][FI] Skip trtllm attention when num_kv_heads=1 (#308… (#31617)
Signed-off-by: shyeh25 <206795756+shyeh25@users.noreply.github.com>
2026-01-10 12:39:59 -08:00
Or Ozeri
028599739d [BugFix] scheduler: Fix resuming of preempted requests after async load (#31583)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2026-01-10 12:39:25 -08:00
gnovack
d1fd802fa3 fused_moe_kernel - cast accumulator after applying router weights (#32002)
Signed-off-by: gnovack <gnovack@amazon.com>
2026-01-11 04:36:45 +08:00
Xin Yang
543c23be78 [LoRA][Perf] Improve FusedMoE LoRA performance for small rank (#32019)
Signed-off-by: Xin Yang <xyangx@amazon.com>
2026-01-10 11:04:18 -08:00
jvlunteren
b8bf5c45bb [Kernel] Optimize Sliding Window Attention in 3D Triton Kernel (#31984)
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>
2026-01-10 18:13:44 +00:00
Michael Goin
e6c6f2c79d [Quant] Support MXFP4 W4A16 for compressed-tensors dense models (#31926)
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2026-01-10 06:44:35 -08:00
Jeremy Teboul
07286ec5a6 [Bugfix] Fix integer overflow in Gemma3n audio processing (#31657)
Signed-off-by: Jeremy Teboul <jeremyte@meta.com>
2026-01-10 17:52:53 +08:00
Ning Xie
14fc7a68c7 [Bugfix] fix offline chat output prompt (#32076)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2026-01-10 07:50:57 +00:00
Cyrus Leung
5f2385a4c8 [Benchmark][1/2] Generalize SLA criterion validation from binary flags to margins (#32075)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-10 07:11:03 +00:00
Frelam
a01a1c0d69 [Bugfix] fix encoder cache leak of waiting requests in scheduler to solve stuck in CPU scheduling (#31857)
Signed-off-by: frelam <frelam112233@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2026-01-10 06:27:58 +00:00
Lucas Wilkinson
da6709c9fe [Misc] Delay deprecation of CommonAttentionMetadata properties (#32074)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-01-09 21:06:44 -08:00
Andreas Karatzas
d83becd503 [ROCm][CI] Fix flaky test_function_calling_with_stream and reduce schema test examples (#32063)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-10 05:02:35 +00:00
roikoren755
0c9614876e Update modelopt KV cache quantization resolution to new scheme (#31895)
Signed-off-by: Roi Koren <roik@nvidia.com>
2026-01-10 04:54:13 +00:00
Cyrus Leung
583a90e005 [Refactor] Separate sequence and token pooling types (#32026)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-10 04:53:24 +00:00
maang
52d428295d [Core] Refactor ColumnParallelLinear: remove unused parameter and optimize forward (#31939)
Signed-off-by: maang <maang_h@163.com>
2026-01-10 04:19:49 +00:00