Michael Goin
|
34cd32fe30
|
[Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe (#31832)
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
|
2026-01-09 07:40:33 -07:00 |
|
maang
|
7cdf7e2fe0
|
[Model] Remove redundant None check in DeepSeekOCR image input processing (#32016)
Signed-off-by: maang <maang_h@163.com>
|
2026-01-09 06:12:44 -08:00 |
|
Xin Yang
|
e7b68f4d6c
|
[Bugfix] Fix Triton FusedMoE LoRA (#30585)
Signed-off-by: Xin Yang <xyangx@amazon.com>
|
2026-01-09 11:46:59 +00:00 |
|
Cyrus Leung
|
c8ed39b9dd
|
[Model] Reorganize pooling layers (#31973)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2026-01-09 11:02:14 +00:00 |
|
Alex Brooks
|
dc77cb7129
|
[Bugfix] Fix Var Length Batched Padding in Granite Speech (#31906)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
|
2026-01-09 10:28:43 +00:00 |
|
Robert Shaw
|
0fa8dd24d2
|
[Bugfix] Fix Typo from NVFP4 Refactor (#31977)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
|
2026-01-08 16:18:50 -08:00 |
|
Robert Shaw
|
5825bbc1f7
|
[Quantization] Deprecate Long Tail of Schemes (#31688)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2026-01-08 19:07:45 -05:00 |
|
Yongye Zhu
|
d62cfe546d
|
[MoE Refactoring][Bugfix]Wrap WNA16 Triton kernel into mk and change compressed tensor kernel selection (#31752)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
|
2026-01-08 19:01:30 -05:00 |
|
Lucas Wilkinson
|
6cdf015c3c
|
[Misc] Fix Current vLLM config is not set. warnings, assert to avoid issues in the future (#31747)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2026-01-08 15:20:49 -08:00 |
|
Dipika Sikka
|
5d3b6097ad
|
[Compressed-Tensors] Simplify NVFP4 Conditions, enable marlin support for NVFP4A16 MoEs (#30881)
|
2026-01-08 17:45:17 -05:00 |
|
bnellnm
|
e74698c27a
|
[Misc][Refactor] Add FusedMoERouter object (#30519)
Signed-off-by: Bill Nell <bnell@redhat.com>
|
2026-01-08 20:52:55 +00:00 |
|
Michael Goin
|
87e07a6b46
|
Revert "feat(moe): Add is_act_and_mul=False support for Triton MoE kernels" (#31978)
|
2026-01-08 11:31:53 -08:00 |
|
danisereb
|
b8112c1d85
|
[Bugfix] Fix vllm serve failure with Nemotron Nano V3 FP8 (#31960)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
|
2026-01-08 16:08:37 +00:00 |
|
yxing-bj
|
fe86be66c5
|
[Model] Support IQuestCoder model (#31575)
Signed-off-by: yxing <yxing@iquestlab.com>
|
2026-01-08 14:42:57 +00:00 |
|
Ce Zhao
|
1123a87892
|
[Model] Enable LoRA support for Pixtral (#31724)
Signed-off-by: <>
Signed-off-by: 赵策 <alcor@zhaocedeMacBook-Air.local>
Signed-off-by: 赵策 <alcor@mac.mynetworksettings.com>
Co-authored-by: 赵策 <alcor@mac.mynetworksettings.com>
|
2026-01-08 05:00:57 -08:00 |
|
tianshu-Michael-yu
|
03fd76c570
|
[Model] Add LFM2-VL model support (#31758)
Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2026-01-08 05:00:27 -08:00 |
|
Bijaya Dangol
|
59d260f5e4
|
[Model] Add Grok-2 (#31847)
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
|
2026-01-08 04:59:48 -08:00 |
|
Patrick von Platen
|
18d4e481d0
|
[Voxtral] Fix speech transcription api (#31388)
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: bk-201 <joy25810@foxmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: prashanth058 <prashanth.dannamaneni@uipath.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: bk-201 <joy25810@foxmail.com>
Co-authored-by: prashanth058 <prashanth.dannamaneni@uipath.com>
Co-authored-by: Anexdeus <5142168@mail.ru>
Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
|
2026-01-08 18:34:19 +08:00 |
|
Isotr0py
|
2972a05473
|
[MM Encoder]: Make MMEncoderAttention's scale takes effect properly (#31950)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2026-01-08 02:33:48 -08:00 |
|
Cyrus Leung
|
5576227bc1
|
[Model] Standardize common vision encoders (#31947)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2026-01-08 02:33:16 -08:00 |
|
Cyrus Leung
|
d1b6fe007f
|
[Chore] Further cleanup pooler (#31951)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2026-01-08 02:16:21 -08:00 |
|
omer-dayan
|
04a49669d1
|
RayLLM Bugfix - Preserve obj store URL for multi engine_config creation (#30803)
Signed-off-by: Omer Dayan <omdayan@nvidia.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2026-01-08 10:00:25 +00:00 |
|
BingjiaWang
|
96fcd3c267
|
[Misc] Support qwen3-next lora (#31719)
|
2026-01-08 09:27:50 +00:00 |
|
Isotr0py
|
eac3b96ec0
|
[Models] Allow converting Qwen3-VL into Reranker model (#31890)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2026-01-08 08:10:15 +00:00 |
|
Zyyeric
|
63baa28cf5
|
[Model] Enable LoRA support for tower and connector in GLM4-V (#31652)
Signed-off-by: Zyyeric <eric1976808123@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
|
2026-01-08 15:45:53 +08:00 |
|
ShaanveerS
|
9572f74f15
|
[Model] Enable LoRA support for tower and connector in DotsOCR (#31825)
Signed-off-by: ShaanveerS <shaanver.singh@gmail.com>
|
2026-01-08 14:50:16 +08:00 |
|
Andreas Karatzas
|
c4041f37a4
|
[ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router weight scaling (#31931)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
2026-01-08 04:17:56 +00:00 |
|
Robert Shaw
|
9f6dcb71ae
|
[MoE Refactor][16/N] Apply Refactor to NVFP4 (#31692)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
|
2026-01-08 03:46:27 +00:00 |
|
Rabi Mishra
|
25eef3dc2e
|
feat(moe): Add is_act_and_mul=False support for Triton MoE kernels (#31645)
Signed-off-by: rabi <ramishra@redhat.com>
|
2026-01-08 10:27:09 +08:00 |
|
Robert Shaw
|
5dcd7ef1f2
|
[MoE Refactor][15/N] Apply Refactor to Fp8 (#31415)
|
2026-01-07 19:42:33 -05:00 |
|
Elvir Crnčević
|
ffc0a2798b
|
Add back missing DeepEP LL params (#31911)
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com>
|
2026-01-07 17:47:54 -05:00 |
|
Xin Yang
|
0ada960a20
|
[Kernel] Support bias type in grouped_topk kernel (#31781)
Signed-off-by: Xin Yang <xyangx@amazon.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-01-07 12:16:32 -08:00 |
|
roikoren755
|
bf184a6621
|
Enable quantized attention in NemotronH models (#31898)
Signed-off-by: Roi Koren <roik@nvidia.com>
|
2026-01-07 17:37:19 +00:00 |
|
Cyrus Leung
|
b7036c87a1
|
[Refactor] Clean up pooler modules (#31897)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2026-01-08 00:07:43 +08:00 |
|
Kate Cheng
|
cc6dafaef2
|
[Perf][Kernels] Enable FlashInfer DeepGEMM swapAB on SM90 (for W8A8 Linear Op) (#29213)
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
Co-authored-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
|
2026-01-07 10:53:54 -05:00 |
|
Cyrus Leung
|
b665bbc2d4
|
[Chore] Migrate V0 attention utils (#31891)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2026-01-07 13:44:36 +00:00 |
|
Jared Wen
|
974138751b
|
[Refactor] GLM-ASR Modeling (#31779)
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2026-01-07 13:08:29 +00:00 |
|
Andy Liu
|
d111bc53ad
|
[Bugfix][MTP] Fix GLM4 MoE fp8 loading with MTP on (#31757)
Signed-off-by: Andy Liu <andyliu@roblox.com>
|
2026-01-07 09:18:52 +00:00 |
|
BlankR
|
0790f07695
|
[Misc] Improve error messages for unsupported types and parameters (#30593)
Signed-off-by: BlankR <hjyblanche@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2026-01-07 09:00:16 +00:00 |
|
maang
|
1f33e38e81
|
[Model] Cleanup: Remove redundant manual definition of make_empty_intermediate_tensors in GLM-4-MoE (#31869)
Signed-off-by: maang <maang_h@163.com>
|
2026-01-07 08:18:28 +00:00 |
|
weiyu
|
e7596371a4
|
[Refactor][TPU] Remove torch_xla path and use tpu-inference (#30808)
Signed-off-by: Wei-Yu Lin <weiyulin@google.com>
Signed-off-by: weiyu <62784299+weiyu0824@users.noreply.github.com>
|
2026-01-07 16:07:16 +08:00 |
|
xuebwang-amd
|
0dd5dee9b9
|
[Bugfix][Kernel] fix bias adding in triton kernel implemented fused moe (#31676)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
|
2026-01-07 07:36:13 +00:00 |
|
Kevin McKay
|
4614c5a539
|
[Bugfix][Hardware][AMD] Consolidate FP8 min/max values helper function (#31106)
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
Signed-off-by: Kevin McKay <kevin@example.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
|
2026-01-07 06:55:03 +00:00 |
|
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
|
482914849c
|
[BugFix] LoRA: Support loading base_layer of experts (#31104)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
|
2026-01-07 14:49:39 +08:00 |
|
Jack Yang
|
0a2c2dc3f1
|
fixed mypy warnings for files vllm/v1/attention with TEMPORARY workaround (#31465)
Signed-off-by: Zhuohao Yang <zy242@cornell.edu>
Co-authored-by: Zhuohao Yang <zy242@cornell.edu>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2026-01-07 04:08:47 +00:00 |
|
Ce Zhao
|
a051525e07
|
[Model] Enable LoRA support for PaliGemma (#31656)
Signed-off-by: 赵策 <alcor@mac.mynetworksettings.com>
Signed-off-by: Alcor <alcor_zhao@outlook.com>
Co-authored-by: 赵策 <alcor@mac.mynetworksettings.com>
|
2026-01-07 10:09:32 +08:00 |
|
Li, Jiang
|
8becf146bd
|
[Quantization][Refactor] Move CPU GPTQ kernel into MP linear (#31801)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: Li, Jiang <bigpyj64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2026-01-06 19:10:18 +00:00 |
|
Yakine Tahtah
|
4e67a8f616
|
[Bugfix] Fix GLM-4 MoE router logits dtype for data parallel chunking (#31055)
Signed-off-by: ReinforcedKnowledge <reinforced.knowledge@gmail.com>
|
2026-01-06 17:57:56 +00:00 |
|
Vadim Gimpelson
|
22dffca982
|
[PERF] Speed-up of GDN attention decode part (Qwen3-Next) (#31722)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
|
2026-01-06 17:32:46 +00:00 |
|
Jinzhen Lin
|
2f4bdee61e
|
[Quantization][MoE] remove unused ep logic from moe marlin (#31571)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2026-01-06 09:07:19 -08:00 |
|