Jialin Ouyang
|
4228be7959
|
[Perf] Use np.ndarray instead of list[list[int]] to reduce GC overhead (#28245)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
|
2025-11-11 10:28:47 -08:00 |
|
Zhewen Li
|
e553424919
|
[CI/Build] Refactor Attention backend for test_prefix_prefill from xformers to SDPA (#28424)
Signed-off-by: zhewenli <zhewenli@meta.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
|
2025-11-12 01:09:47 +08:00 |
|
xuebwang-amd
|
5a1271d83a
|
[Quantization] fix attention quantization of gpt_oss model (#27334)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
|
2025-11-11 12:06:00 -05:00 |
|
xuebwang-amd
|
05576df85c
|
[ROCm][Quantization] extend AMD Quark to support mixed-precision quantized model (#24239)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Co-authored-by: fxmarty-amd <felmarty@amd.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2025-11-11 12:05:22 -05:00 |
|
zhrrr
|
68c09efc37
|
[Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model (#27165)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
|
2025-11-11 12:00:31 -05:00 |
|
Nicolò Lucchesi
|
a7ef3eb0cd
|
[NIXL] Generalize block-first backend layouts (FlashInfer-like) (#28282)
|
2025-11-11 16:57:43 +00:00 |
|
jvlunteren
|
533b018f72
|
[BugFix] Fix Failing Ruff Check (#28469)
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>
|
2025-11-11 06:41:43 -08:00 |
|
bnellnm
|
a1448b4b69
|
[Kernels] Split up fused_moe/layer.py, isolate more modular kernel code (#28064)
|
2025-11-11 07:29:02 -07:00 |
|
Matthew Bonanni
|
b30dfa03c5
|
[Attention] Refactor CUDA attention backend selection logic (#24794)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2025-11-11 07:40:44 -05:00 |
|
Robert Shaw
|
e605e8e323
|
[Bugfix] Fix Stream Sync for Shared Expert Overlap (#28430)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
Co-authored-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
|
2025-11-11 05:59:08 +00:00 |
|
Zuyi Zhao
|
bca74e32b7
|
[Frontend] Add sagemaker_standards dynamic lora adapter and stateful session management decorators to vLLM OpenAI API server (#27892)
Signed-off-by: Zuyi Zhao <zhaozuy@amazon.com>
Signed-off-by: Shen Teng <sheteng@amazon.com>
Co-authored-by: Shen Teng <sheteng@amazon.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
|
2025-11-11 04:57:01 +00:00 |
|
Zhuohan Li
|
8d706cca90
|
[Misc] FlattenLogprobs -> FlatLogprobs (#28335)
|
2025-11-11 03:41:23 +00:00 |
|
Lucas Wilkinson
|
39029d5192
|
[CI/Test Fix] Fix CP tests on Blackwell (#28404)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-11-11 01:36:29 +00:00 |
|
Matthew Bonanni
|
0bf29fadf5
|
[Test] Remove old non-varlen FA2 test (#28420)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-11-10 23:57:41 +00:00 |
|
Adrian Abeyta
|
a5a790eea6
|
[Bugfix] Ensure calculated KV scales are applied in attention. (#27232)
Signed-off-by: adabeyta <aabeyta@redhat.com>
|
2025-11-10 23:42:37 +00:00 |
|
Ilya Markov
|
d17ecc6b19
|
[PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds (#24248)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
|
2025-11-10 18:33:11 -05:00 |
|
Rémi Delacourt
|
6d54336ae5
|
[Bugfix] Fix llguidance backend, rollback when EOS was encountered (#25905)
Signed-off-by: Rémi Delacourt <remi@mistral.ai>
Signed-off-by: remi <remi@mistral.ai>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
|
2025-11-10 14:53:32 -05:00 |
|
Varun Sundar Rabindranath
|
b039bfda8f
|
[Bugfix] Fix persistent_masked_m_silu_mul_quant tests (#28366)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
|
2025-11-10 09:21:52 -08:00 |
|
vllmellm
|
f080a83511
|
[RFC][ROCm][AITER] Keep all AITER kernels in _aiter_ops class like _custom_ops and _ipex_ops (#24490)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2025-11-10 08:20:53 -08:00 |
|
Mark McLoughlin
|
6f7de33bed
|
[Metrics] Refactor LoRA state tracking (#26801)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-11-10 16:34:36 +08:00 |
|
Shinichi Hemmi
|
a98cc35c34
|
Restore PlaMo2 unit test as pfnet/plamo-2-1b now supports transformers >=4.56 (#28019)
Signed-off-by: Shinichi Hemmi <50256998+Alnusjaponica@users.noreply.github.com>
|
2025-11-10 06:50:02 +00:00 |
|
Lucas Wilkinson
|
e8697faf03
|
[V0 deprecation] Remove no longer used get_metadata_cls (#28370)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2025-11-10 14:32:09 +08:00 |
|
Varun Sundar Rabindranath
|
6b2b9fd934
|
[CI] lora/test_mixtral.py : Add additional expected outputs due to flakiness (#28322)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
|
2025-11-10 10:45:29 +08:00 |
|
Zhewen Li
|
a65a934ebe
|
[CI/Build] Temporary fix to LM Eval Small Models (#28324)
Signed-off-by: zhewenli <zhewenli@meta.com>
|
2025-11-09 21:08:38 +00:00 |
|
usberkeley
|
4a8d6bd168
|
Fix cu_num_generated_tokens slicing logic in LogprobsLists.slice() method (#28214)
Signed-off-by: Bradley <bradley.b.pitt@gmail.com>
|
2025-11-09 19:11:46 +00:00 |
|
Nick Hill
|
289eb6c537
|
[Core] Simplify async KV output aggregation (#28327)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-11-09 09:44:13 -08:00 |
|
Nicolò Lucchesi
|
19d91ece4b
|
[CI] Fix flaky test_eagle_correctness test (#28364)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-11-09 16:04:59 +00:00 |
|
ElizaWszola
|
171133f929
|
[Bugfix] Fix test fused quant layernorm tests (#27865)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-11-08 14:31:33 -08:00 |
|
zhangsicheng5
|
2108a571d7
|
[DCP] Support dcp kv_cache interleave size > 1 (#26696)
Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
Signed-off-by: Qiu <qiuchunshuo@huawei.com>
Co-authored-by: QiuChunshuo <qiuchunshuo@huawei.com>
|
2025-11-09 04:45:27 +09:00 |
|
Andy Lo
|
47604137a2
|
[Bugfix] Spec decode + structured output + spec model max len edge case (#28298)
Signed-off-by: Andy Lo <andy@mistral.ai>
|
2025-11-08 19:44:25 +00:00 |
|
Harry Mellor
|
d9ab1ad9d1
|
reasoning_content -> reasoning (#27752)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-11-08 12:15:08 +00:00 |
|
Isotr0py
|
934a9c3b79
|
[Model] Consolidate Deepseek-MoE implementation with DeepSeek-v2 (#28101)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
|
2025-11-08 05:01:27 +00:00 |
|
Xiaohong (Sean) Chen
|
d0c7792004
|
[Bugfix][LoRA][Spec Decode] Support LoRA with speculative decoding (#21068)
Signed-off-by: Sean Chen <xiaohong_chen1991@hotmail.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Danielle Robinson <dcmaddix@gmail.com>
Co-authored-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
|
2025-11-08 01:58:22 +00:00 |
|
Boyuan Feng
|
b158df2813
|
remove resolve_op_overloads and use splitting_ops directly (#28081)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
|
2025-11-08 01:13:13 +00:00 |
|
Harry Mellor
|
811df41ee9
|
Update Flashinfer from v0.4.1 to v0.5.2 (#27952)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-11-07 16:24:42 -08:00 |
|
Nick Hill
|
67a2da890e
|
[PerfFix] Avoid separate thread for MP executor shm spin (take 2) (#28319)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-11-07 22:11:03 +00:00 |
|
Nick Hill
|
da786e339e
|
[Core] Rework handling of async scheduling config (#28250)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-11-07 20:01:23 +00:00 |
|
Benjamin Chislett
|
18903216f5
|
[Bugfix] Fix and add tests for GptOss reasoning parser (#28000)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
2025-11-07 19:28:04 +00:00 |
|
Nicolò Lucchesi
|
68a72a5cc1
|
Revert "[PerfFix] Avoid separate thread for MP executor shm spin (#28012)" (#28289)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-11-07 15:07:01 +00:00 |
|
Pavani Majety
|
72b1c2ae2c
|
[Bugfix] Use latency MOE backend as default for Flashinfer and other misc fixes (#27439)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
2025-11-07 04:18:39 -08:00 |
|
Jialin Ouyang
|
ccd98b59c1
|
[Perf] Introduce FlattenLogprobs to store logprobs results to reduce GC overhead (#28171)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
|
2025-11-07 00:27:12 -08:00 |
|
Copilot
|
a736e5ff77
|
[CI] Reduce Blackwell Fusion test runtime by filtering tests and only run all tests in nightly (#28074)
|
2025-11-07 15:58:16 +08:00 |
|
Alexis MacAskill
|
a47d94f18c
|
Add runai model streamer e2e test for GCS (#28079)
Signed-off-by: Alexis MacAskill <amacaskill@google.com>
|
2025-11-07 03:07:54 +00:00 |
|
Alex Brooks
|
e70fbc599b
|
[CI/Build] Loosen STT LoRA Translate Check (Flaky Test) (#28247)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex Brooks <alex.brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2025-11-07 02:51:27 +00:00 |
|
Lucas Kabela
|
4bf56c79cc
|
[Multimodal][torch.compile] Add compilation config field for turning off ViT/MM compile (#28242)
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
|
2025-11-07 00:16:03 +00:00 |
|
Junhong Liu
|
59b453eaa2
|
Speed up mm processor kwargs per request by spliting dynamic and static kwargs (#26483)
Signed-off-by: Junhong <liujunhong11@huawei.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Co-authored-by: Junhong <liujunhong11@huawei.com>
|
2025-11-07 07:51:28 +08:00 |
|
Eugene Khvedchenya
|
827e4237bc
|
Fix failing test for CRadio (#27738)
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: wang.yuqi <noooop@126.com>
|
2025-11-06 15:32:25 -08:00 |
|
Matthew Bonanni
|
ca90f50304
|
[Test] Add non-MoE DP test coverage (#28235)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-11-06 20:59:57 +00:00 |
|
Chauncey
|
59a50afa08
|
[Frontend] OpenAI Responses API supports Tool/Function calling - non-harmony (#26874)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
|
2025-11-06 10:40:03 +00:00 |
|
wangxiyuan
|
c3ee80a01a
|
[V0 deprecation]clean up is_v1_supported_oracle (#28116)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
|
2025-11-06 16:05:32 +08:00 |
|