Compare commits

...

955 Commits

Author SHA1 Message Date
Nick Hill
439368496d [BugFix] Fix PP/async scheduling with pooling models (#28899)
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-18 00:20:45 -08:00
Isotr0py
896e41ae04 [CI/Build] Replace wikipedia url with local server ones (#28908)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-18 08:10:55 +00:00
Kuntai Du
5bb1da5190 [MISC] Remove format.sh (#28906)
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
2025-11-18 05:28:31 +00:00
Nick Hill
5bdd155277 [CI] Fix async scheduling + spec decoding test flake (#28902)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-18 05:26:32 +00:00
Ning Xie
0168f69e50 [Misc] Remove unnecessary parentheses from log statements (#28897)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-11-17 20:33:46 -08:00
Didier Durand
083cf326dc [Doc]: fix typos in various files (#28863)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
2025-11-17 20:32:14 -08:00
Cyrus Leung
bf9e1e8767 [Bugfix] Fix wrong CLI defaults for dynamic SchedulerConfig fields (#28872)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-17 20:30:29 -08:00
Wentao Ye
3ddcf46011 [Refactor] Remove Unused Func in Batch Invariant (#28881)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-17 20:29:29 -08:00
xuebwang-amd
d0a73620cc [ROCm][Quantization] add apply_vllm_mapper in quark config for models like gpt-oss (#28638)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-18 11:16:45 +08:00
Michael Goin
88ab591f0b Run macos smoke test workflow on main commit (#28752)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-18 11:16:03 +08:00
Benjamin Bartels
b6e04390d3 [Bugfix] Fix Kimi-K2 tool parser concatenated tool calls parsing (#28831)
Signed-off-by: Thomas Mao <yiyeguhu@gmail.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
Co-authored-by: Thomas Mao <yiyeguhu@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
2025-11-17 19:13:25 -08:00
Zhuohan Li
552cac95b5 [Misc] Fix wrong comment in scheduler (#28880)
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com>
2025-11-17 15:32:22 -08:00
Bangsheng Tang
61485844fc [BugFix] Corner case that could cause out-of-sync with external launcher mode and dp >1 (#28774) 2025-11-17 15:22:11 -08:00
Pranav
f77bce001a [Model] Add Afmoe architecture implementation (#28332)
Signed-off-by: Maziyar Panahi <maziyar.panahi@iscpif.fr>
Signed-off-by: Pranav <veldurthipranav@gmail.com>
Co-authored-by: Maziyar Panahi <maziyar.panahi@iscpif.fr>
2025-11-17 15:11:20 -08:00
Wentao Ye
a289cc1dde [Test] Batch Invariant: Rename and organize tests (#27421)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-17 18:09:47 -05:00
Shreyas Kulkarni
95ae50b7d1 [Quantization] [Eagle] Add complete quantization support to the draft model in Eagle (#28435)
Signed-off-by: Shreyas Kulkarni <shreyas.gp269@gmail.com>
2025-11-17 15:01:34 -08:00
Nick Hill
7765e5ba75 [BugFix] Fix PP performance and PP kv connector output regression (#28768)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-17 14:08:50 -08:00
Ronald
d8874c61a5 [Core] Async Scheduling X Spec Decoding Compatibility (#24799)
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
2025-11-17 12:16:20 -08:00
Zhewen Li
f8b19c0ffd [Bugfix] Fix GPT-OSS on AMD after #28603 (#28816)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-11-17 13:15:26 -05:00
tiehexue
e42bd8c2e3 Cast return value to int64_t for cache size (#28814)
Signed-off-by: tiehexue <tiehexue@hotmail.com>
2025-11-17 16:02:32 +00:00
Roger Wang
7f064491f8 [Bugfix][Perf] Revert applying HF processor on text-only inputs for multimodal models (#28858)
Signed-off-by: Roger Wang <hey@rogerw.io>
2025-11-17 14:49:25 +00:00
Lucas Wilkinson
64e39d667c [BugFix] Temporary fix for IMA with MTP = 2 and full-cg (#28315)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-17 09:41:22 -05:00
Kunshang Ji
1b82fb0ad3 [XPU] work around for sp, avoid custom op import error (#28822)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-11-17 13:16:44 +00:00
Jae-Won Chung
d4acf518d0 [Metrics] Fix KV cache usage percent metric multiproc (#28792)
The `vllm:kv_cache_usage_perc` Gauge metric is missing `multiprocess_mode="mostrecent"` and ends up returning

```
vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-VL-8B-Instruct",pid="277"} 0.0
vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-VL-8B-Instruct",pid="275"} 0.0
vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-VL-8B-Instruct",pid="273"} 0.6530455880475035
...
```

The deprecated `vllm:gpu_cache_usage_perc` Gauge metric has `multiprocess_mode="mostrecent"`.

Signed-off-by: Jae-Won Chung <jwnchung@umich.edu>
2025-11-17 09:54:15 +00:00
wuyaoxuehun
ab01cd14e5 [BugFix] Fix glm4_moe_mtp load weights bug (#28805)
Signed-off-by: wuyaoxuehun <798143193@qq.com>
2025-11-17 17:13:11 +08:00
Li, Jiang
577bb34fff [CPU][Bugfix] Fix _to_list in CPU model runner (#28824)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-11-17 07:47:24 +00:00
Jee Jee Li
3380ed5e11 [Doc] Add llama4 LoRA tag (#28825)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-17 14:08:48 +08:00
Jay Caldwell
6f37419244 [Bugfix][Model] Prevent special token leakage in KimiK2ToolParser streaming mode (#28543)
Signed-off-by: Jscaldwell55 <jay.s.caldwell@gmail.com>
2025-11-17 13:54:46 +08:00
Xiake Sun
60e089f0b9 [ROCm][Qwen3-32B] Fix AITER MHA accuracy issue cause by #25763 (#28670)
Signed-off-by: Xiake Sun <xiake.sun@amd.com>
2025-11-16 20:52:11 -08:00
liuzhenwei
d64429bb36 [NIXL][XPU] update install script of NIXL (#28778)
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
2025-11-17 03:01:33 +00:00
jiahanc
561253b37f [Performance][Fix] update nvfp4 code to support renorm routing (#28569)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-11-16 18:02:42 -08:00
Nick Hill
80b6080ddc [BugFix] Fix async scheduling + chunked prefill + preemption (#28787)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-17 06:46:46 +08:00
amirkl94
03ee48111d Feature: Support Relu2 in FusedMoE fp8 cutlass path (#27261) 2025-11-16 13:39:44 -05:00
Lukas Geiger
5a87076d6e [Model][QwenVL] Optimize Qwen2_5_VisionAttention q,k preparation (#28769)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-16 17:37:15 +00:00
Ning Xie
ac1daf3233 fix comment typo (#28802)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-11-16 17:03:21 +00:00
Didier Durand
63fed55506 [Doc]: fix typos in various files (#28811)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
2025-11-16 14:30:06 +00:00
Anna Shors
8d259fad6c Fix gpt oss weight loading with EP + bf16 (#28765)
Signed-off-by: ashors1 <ashors@nvidia.com>
2025-11-16 13:12:45 +00:00
scottzh8
3bc1175798 [Bugfix] Fix host and port join for ipv6 in bench serve (#28679)
Signed-off-by: Scott Zhang <scottzh@fb.com>
Co-authored-by: Scott Zhang <scottzh@fb.com>
2025-11-16 10:20:57 +00:00
Dezhan
af02c40970 Fixed gpt-oss _load_weights_other() parameter position bug (#28715)
Co-authored-by: Dezhan Tu <dztu@meta.com>
2025-11-16 09:46:29 +00:00
Lucia Fang
b316ac6589 [V1] Support MP Executor for multi node distributed inference (#23691)
Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: Lucia Fang <fanglu@fb.com>
Signed-off-by: Lucia Fang <116399278+luccafong@users.noreply.github.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-11-16 09:01:21 +00:00
wang.yuqi
a55b64635c [Model] Allow users to control skip reading cache per request. (#28194)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
2025-11-16 00:04:50 -08:00
ai-jz
d231876ce3 [Benchmark] Fix client seed synchronization in multi-turn benchmark (#28512)
Signed-off-by: ai-jz <aijz.xplr@gmail.com>
2025-11-16 15:04:32 +08:00
Bram Wasti
f849ee739c Adding a benchmark for batch invariance (#28161)
Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-16 13:22:17 +08:00
Lucas Wilkinson
be263f7645 [BugFix] Fix AssertionError: DCP not support reorder_batch_threshold > 1 now. (#28751)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-15 22:35:06 +00:00
Didier Durand
2bb4435cb7 [Doc]: fix typos in various files (#28567)
Signed-off-by: Didier Durand <durand.didier@gmail.com>
2025-11-15 19:27:50 +00:00
Lukas Geiger
07cadab27a [Model][Qwen3VL] Cache positional embedding indices (#28475)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-11-15 19:03:09 +00:00
Nick Hill
637f292196 [CI] Fix broken pipeline (#28781)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-15 08:44:14 -08:00
Eldar Kurtić
e439c784fa Add support for Eagle with separate lm-head and embed_tokens layers (#28549)
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
2025-11-15 06:12:02 -08:00
hwhaokun
085a525332 [Model] Fix lmhead init bug of bailing_moe (#28777)
Signed-off-by: hwhaokun <haokun0405@163.com>
Co-authored-by: zhaozx-cn <zhaozx2116@163.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-15 05:44:12 -08:00
Cyrus Leung
89d3679221 [Doc] Fix failing doc build (#28772)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-15 05:33:27 -08:00
tingtinggithub
cb15ee28db Allow Gemma3 to take image embeddings (#28483)
Signed-off-by: tingtinggithub <streamttt@gmail.com>
2025-11-15 04:18:08 -08:00
Angela Yi
f36292dbee [compile] Enable sequence parallelism matching w/o custom ops enabled (#27126)
Signed-off-by: angelayi <yiangela7@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: ProExpertProg <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Luka Govedič <luka.govedic@gmail.com>
2025-11-15 11:46:12 +00:00
Vadim Gimpelson
173b356abf [PERF] Remove TRTLLM Gen attn kernel limitation max_seq_len <=131072 (#28755)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2025-11-15 15:43:41 +05:30
Cyrus Leung
638e4196d1 [Misc] Make SchedulerConfig.max_model_len init-only (#28733)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-15 01:59:31 -08:00
Zhewen Li
1ec978c209 [Kernel][Moe Configs] llama4 maverick fp8 moe config tp8 on mi325 (#28709)
Signed-off-by: Zhewen Li <zhewenli@meta.com>
2025-11-15 01:10:48 -08:00
Jane (Yuan) Xu
74b5267d3a Use narrow over indexing in hadacore_transform to prep for ABI stable (#28756)
Signed-off-by: Jane Xu <janeyx@meta.com>
2025-11-15 01:10:15 -08:00
Zhuohan Li
dd6ac1c2bb [RL] [V1] Remove unused device argument from reset_kv_cache (#28766)
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com>
2025-11-14 23:59:42 -08:00
Cyrus Leung
98b4d389ed [Redo] #26368 (#28771)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-11-14 22:47:41 -08:00
Varun Sundar Rabindranath
6965ef436f [Performance][DeepGEMM] Estimate expected_m (#28694)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-11-15 13:52:14 +08:00
Chendi.Xue
c9e665852a [NIXL] heterogeneous block_size support (#26759)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
2025-11-14 21:51:32 -08:00
Mohammad Othman
363aaeef0f Fix IntermediateTensors initialization and add type hints (#28743)
Signed-off-by: Mohammad Othman <Mo@MohammadOthman.com>
Co-authored-by: Mohammad Othman <Mo@MohammadOthman.com>
2025-11-15 04:31:36 +00:00
Nick Hill
ac86bff8cb Revert "[Core] Performance: Use list[np.ndarray] instead of list[list… (#28773) 2025-11-14 20:24:00 -08:00
Michael Goin
edfe498189 [Bugfix] Build hadacore kernels on >SM90 (#28748)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-14 19:51:05 -08:00
Lukas Geiger
f05d474c8a [Model][Qwen3VL] Use mm_position to compute mrope positions (#28730)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-14 19:45:11 -08:00
QiliangCui
9fc81ec765 [TPU] Fix import error in tpu launch (#28758)
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
2025-11-15 00:58:32 +00:00
Jialin Ouyang
186352b270 [Core] Performance: Use list[np.ndarray] instead of list[list[int]] for output tokens for GC optimization (#26368)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-11-14 16:04:04 -08:00
Nick Hill
58e61e56b7 [Test] Rework e2e async scheduling tests (#28744)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-14 16:01:09 -08:00
Gregory Shtrasberg
75f01b9d3c [ROCm][CI/Build] Upgrade to ROCm 7.1 and AITER main (#28753)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-11-14 15:53:21 -08:00
rasmith
ba041d980b [Log] Save profiler results to file instead of stdout (#28144)
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
2025-11-14 23:26:39 +00:00
Thomas Parnell
e0c910bb89 [Hybrid] [Kernel] Fix chunk scan kernel when BLOCK_SIZE_DSTATE > 128 (#28295)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-11-14 22:55:42 +00:00
Benjamin Chislett
bf3ffb61e6 [Bugfix] Fix ChunkedLocalAttention CUDA Graph setting (#28739)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2025-11-14 14:14:46 -08:00
Alexander Matveev
e5c78956c0 [Bugfix] Fix incorrect use of hidden_states for shared_experts due to do_naive_dispatch_combine (#28740)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-11-14 14:13:46 -08:00
Laith Sakka
2e0ad629b0 Avoid bytecode hook and simplify TorchCompileWrapperWithCustomDipatch (#25110)
Signed-off-by: Laith Sakka <lsakka@meta.com>
2025-11-14 14:11:10 -08:00
Gregory Shtrasberg
5a84b76b86 [ROCm][CI/Build] Change install location of uv (#28741)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-11-14 21:34:18 +00:00
Marcin Ostrowski
0de4f217ab [Bugfix] TypeError: 'NoneType' object is not callable (#27410)
Signed-off-by: Marcin Ostrowski <marcinx.ostrowski@intel.com>
2025-11-14 21:13:53 +00:00
Michael Goin
f08eab2acc [CI] Fix macos smoke test uv cache issue (#28736)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-14 13:29:55 -07:00
Sage Moore
8977ffb5e6 [ROCm][Bugfix] Fix compilation errors with fused_qknorm_rope_kernel.cu (#28682)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-11-14 11:06:01 -08:00
Andrey Khalyavin
fd4555089a [BugFix] Fix misprint introduced by modular_kernel refactoring. (#28728)
Signed-off-by: Andrey Khalyavin <halyavin@yandex-team.ru>
2025-11-14 10:58:18 -08:00
GuanH
cec275efce [Bugfix] resolve Qwen3-VL GPTQModel quantized model loading failure (#28663)
Signed-off-by: GuanH <guansdrailib@gmail.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-14 18:44:27 +00:00
Cyrus Leung
e2741f6cbc [Chore] Rename SchedulerConfig.chunked_prefill_enabled (#28735)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-14 18:39:57 +00:00
Harry Mellor
67187554dd [Docs] Enable some more markdown lint rules for the docs (#28731)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-14 18:39:19 +00:00
TJian
a425dc256e [Bugfix] [ROCm] [AITER]: Fix aiter block quant not compatible with torch compile dynamo (#28716)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-11-14 10:30:50 -08:00
Fardin Hoque
964d65deed LLaMA4 LoRA Adapter Enablement (#28602)
Signed-off-by: Fardin Hoque <kfhfar@amazon.com>
Co-authored-by: Wei Wei <wwei6@meta.com>
2025-11-14 13:27:56 -05:00
Chen Wang
9261eb3dc1 docs(lora_resolvers): clarify multi-resolver order and storage path requirement (#28153)
Signed-off-by: Chen Wang <Chen.Wang1@ibm.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-14 18:08:30 +00:00
czhu-cohere
cdd7025961 [kernel] Improve FP8 PTPC on Hopper for larger shapes (#28692)
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
2025-11-14 09:59:11 -08:00
Julien Denize
085424808e Remove audio optional dependency for mistral-common (#28722)
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-14 09:54:38 -08:00
Mohammad Othman
a17e36f223 Fix typo in comment: existance -> existence (#28737)
Signed-off-by: Mohammad Othman <emranm226@hotmail.com>
2025-11-14 09:35:45 -08:00
Matthew Bonanni
8cc40f8992 [Attention] Bump FA for removed method (#28429)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-14 09:13:37 -08:00
Nicolò Lucchesi
6f1e7f7226 [DisaggEverything] Tokens in<>out /generate endpoint (#24261)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-14 09:58:01 -07:00
Michael Goin
d54a18a47e [CI][CPU] Smoke test for Apple Silicon using GHA MacOS runner (#28688)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-14 09:37:18 -07:00
Harry Mellor
5f3cd7f7f2 [Docs] Update the name of Transformers backend -> Transformers modeling backend (#28725)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-14 16:34:14 +00:00
dongbo910220
c934caee88 [Fix] improve aspect ratio in dummy image generation and add common VLM tests for PaddleOCR-VL (#28711)
Signed-off-by: dongbo910220 <1275604947@qq.com>
2025-11-14 16:07:20 +00:00
Duncan Moss
3f8a874065 [Kernels] Enable FlashInfer FP8 Blockscale on SM90 (for TEP DSR1) (#27134)
Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-14 08:02:44 -08:00
Cyrus Leung
511a6b611d [Config] Clean up SchedulerConfig initialization (#28665)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-14 22:41:02 +08:00
Nicolò Lucchesi
96b23b8e3b [Bugfix][Nixl] Fix kernel physical<>logical block_size issue (#28677)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-11-14 22:40:05 +08:00
zhaozx-cn
433c0f8675 [Model] Fix bailing_moe accuracy problem (#28277)
Signed-off-by: zhaozx-cn <zhaozx2116@163.com>
2025-11-14 13:33:02 +00:00
Fasal Shah
8d3748d3c7 [Doc] Fix macOS installation dependency resolution issue (#26721)
Signed-off-by: faisal shah <fashah@redhat.com>
2025-11-14 12:43:56 +00:00
Lucas Wilkinson
db56a59970 [BugFix] Fix FA3 IMA with FULL_AND_PIECEWISE and cascade attention (default) (#28702) 2025-11-14 12:19:22 +00:00
Yong Hoon Shin
9324e10275 Fix KV sharing fast prefill with cudagraph enabled (#28537)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-14 11:53:42 +00:00
Jingchun Gao
4516d44b7f [DCP] Support Decode Context Parallel (DCP) for GQA with Flashinfer (#25438)
Signed-off-by: gaojc <1055866782@qq.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Jingchun Gao <63247409+gjc0824@users.noreply.github.com>
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
Co-authored-by: gaojingchun (A) <g00955623@china.huawei.com>
Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com>
Co-authored-by: QiuChunshuo <qiuchunshuo@huawei.com>
2025-11-14 11:24:10 +00:00
Shanshan Shen
41b92f7d38 [Model][MM] Extract conv layer as CustomOp (#28455)
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-14 19:16:13 +08:00
Srreyansh Sethi
360bd8762f [Frontend] Added chat-style multimodal support to /classify. (#27516)
Signed-off-by: WorldExplored <srreyansh.sethi@gmail.com>
Signed-off-by: Srreyansh Sethi <107075589+WorldExplored@users.noreply.github.com>
Signed-off-by: vnadathur <glvikramn@gmail.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: vnadathur <236933696+vnadathur@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: vnadathur <glvikramn@gmail.com>
Co-authored-by: wang.yuqi <noooop@126.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
2025-11-14 11:03:55 +00:00
lyn610
ecf8230d4d [Metrics] Log number of preempted requests (#28522)
Add tracking and periodic logging for the number of preempted requests in the
metrics logger. This helps monitor system behavior under load.

Signed-off-by: Yining Liu <610lyn@gmail.com>
2025-11-14 09:47:45 +00:00
Xing Liu
8cfbe89b93 [Misc] fix comment in test_envs (#28529)
Signed-off-by: Xing Liu <xingliu14@gmail.com>
2025-11-14 09:32:46 +00:00
Boyuan Feng
fd75d3e8c0 [Minor] avoid register new custom and just import silly_attn (#28578)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
2025-11-14 09:32:31 +00:00
Michael Goin
c9a3a02149 Add output token counting to gsm8k eval (#28594)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-14 09:32:03 +00:00
Nick Hill
bc3e43069a [BugFix] Fix multi-modal async scheduling race condition (#28706)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-14 01:11:13 -08:00
Jiangyun Zhu
c36bcfe6b3 [Bugfix] fix dots.ocr pp support (#28705)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-11-14 09:01:26 +00:00
Yan Ma
529cea343d use default CCL_ZE_IPC_EXCHANGE (#28700)
Signed-off-by: Yan Ma <yan.ma@intel.com>
2025-11-14 16:55:29 +08:00
rasmith
93103575ce [BugFix][CI/Build][ROCM] Fix import error and apply assert in appropriate case in test_struct_output_generate (#28311)
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
2025-11-13 22:41:29 -08:00
rasmith
15ae8e0784 [Bugfix][CI/Test][Spec Decode] Fix illegal memory access in offline_inference/spec_decode.py (Issue 27619) (#28432)
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
2025-11-13 22:34:01 -08:00
haoyangli-amd
0b25498990 [Misc] add ignore mapper for quark quantization (#28275)
Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>
2025-11-14 05:56:35 +00:00
Roger Wang
0aecd9138f [Misc] Update xformers to 0.33.0.post1 (#28678)
Signed-off-by: Roger Wang <hey@rogerw.io>
2025-11-13 21:52:53 -08:00
Kunshang Ji
da14ae0fad [XPU][CI]disable lm cache uts (#28696)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-11-14 03:15:50 +00:00
Cyrus Leung
01bea115c4 [Misc] Remove warn_for_unimplemented_methods (#28613)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-14 11:10:10 +08:00
Bradley D
b39a5026eb [ci][amd] fix basic models extra init test (#28676)
Signed-off-by: Bradley Davis <bradleyhd@meta.com>
2025-11-14 02:44:36 +00:00
Michael Goin
622e6106a9 [CPU][Bugfix] Fix Apple Silicon M1 compilation failure (#28681)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-14 09:49:55 +08:00
Sage Moore
2aa75c752b [ROCm] Bump up the version of amd-smi to 6.4.3 (#28680)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-11-14 01:24:28 +00:00
Hank_
4d5943bda6 [quantization][config] enable override existing quant_config (#28510)
Signed-off-by: Hank <hcc.mayday@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-11-14 01:24:10 +00:00
Alexei-V-Ivanov-AMD
f2b8e1c551 Mirrored test group definitions for AMD (2025-11-11) (#28573)
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
2025-11-14 00:16:34 +00:00
Mark McLoughlin
6e25b1cddf [KV Connector] Test async mode in scheduler tests (#28550)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-11-13 18:30:59 -05:00
Wentao Ye
e64011f29a [CI] Bug: Fix ci entrypoint pooling (#28684)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-13 14:19:35 -08:00
Simon Mo
1b622deba7 [Misc] Update CODEOWNERS for simon-mo and comaniac (#28675)
Signed-off-by: Simon Mo <simon.mo@hey.com>
2025-11-13 21:01:43 +00:00
Kebe
faed7bf07e [Bugfix] [CPU] bump torch to 2.9.0 for Darwin to fix segmentation fault (#27791)
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-11-13 12:48:08 -08:00
Yanan Cao
262d263f6c [Bugfix] Eliminate tuple inputs to submodules in graph partitioning (#28533)
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
2025-11-13 15:09:05 -05:00
Qiu
968060c15a [bugfix] correct local_chunk_len for DCP in reorg_kvcache with long context (#28526)
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-13 11:29:22 -08:00
elvischenv
5d6ce2b960 [Perf] Support stream interval for reducing host overhead (#27869)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-11-13 13:21:25 -05:00
Matthew Bonanni
f9f3b596f3 [Attention][Bugfix] Fix FA sink support (#28660)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-11-13 13:20:01 -05:00
Yannick Schnider
119c4927b3 [Bugfix] Fix validate model input for decoder models (#27099)
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-11-13 10:18:47 -08:00
Varun Sundar Rabindranath
fe1cd7704d [Performance][B200] silu_mul_quant: pack scales in int32 (#28358)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-11-13 10:16:55 -08:00
Johnny Yang
fdfd5075aa [TPU] patch TPU wheel build script to resolve metadata issue (#27279)
Signed-off-by: Johnny Yang <johnnyyang@google.com>
2025-11-13 09:36:54 -08:00
Nick Hill
327c0a9a23 [BugFix] Ensure EngineArgs.create_engine_config is idempotent (#28515)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-13 17:14:08 +00:00
Jane (Yuan) Xu
06c4873d95 Rewrite C++ meta funcs to Python (#28595)
Signed-off-by: Jane Xu <janeyx@meta.com>
2025-11-14 00:52:50 +08:00
Roger Wang
d3387750f1 [Misc] Turn off encoder torch compile by default (#28634)
Signed-off-by: Roger Wang <hey@rogerw.io>
2025-11-13 08:38:08 -08:00
Harry Mellor
b230286fbc Fix get_num_experts when config sets it explicitly to None (#28652)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: bruceszchen <bruceszchen@tencent.com>
2025-11-13 16:02:42 +00:00
Yuanping Song
3035d1a166 [BugFix] DeepSeek-OCR: apply NoRepeatNGramLogitsProcessor to greedy path (#28617)
Signed-off-by: Yuanping Song <yuanping.song@outlook.com>
2025-11-13 15:24:35 +00:00
Huamin Li
07a606aa7e [CI Failure] Fix backend selection for encoder-only models (#28534)
Signed-off-by: Huamin Li <3ericli@gmail.com>
2025-11-13 10:11:27 -05:00
amdfaa
a7791eac9d [CI/Build] Install uv for AMD MI300: Language Models Tests (Hybrid) %N (#28142)
Signed-off-by: amdfaa <107946068+amdfaa@users.noreply.github.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Co-authored-by: zhewenli <zhewenli@meta.com>
2025-11-13 14:34:55 +00:00
Pleaplusone
8da2f28f53 [ROCm][BugFix]Fix get_cu_count in rocm_aiter_fa.py (#28618)
Signed-off-by: ganyi <ygan@amd.com>
2025-11-13 14:18:20 +00:00
Akash kaothalkar
86d15bfd8d [Hardware][PowerPC] Fix fp16 compilation error for Power in cpu attention backend and bump oneDNN version (#28535)
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
2025-11-13 13:32:21 +00:00
Fanli Lin
c9fe6abe7c [Bugfix] Fix FPS value type for Qwen2.5-Omni video processing (#28630)
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
2025-11-13 13:06:06 +00:00
zofia
c47b6c85ac [XPU] add sym params to IPEXConfig (#28611)
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
2025-11-13 11:35:04 +00:00
baonudesifeizhai
c428e8d80b Fix io processor pooling #28273 (#28484)
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
2025-11-13 11:34:14 +00:00
Zijing Liu
5e973209aa [BugFix] Fix type error when assign a trition kernel tensor to a torch.nn.Parameter (#28603)
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com>
2025-11-13 11:30:04 +00:00
Di Wu
e63fd44560 Fix: Correctly filter special tokens in benchmark_prefix_caching (#28615)
Signed-off-by: Di Wu <dw2761@nyu.edu>
2025-11-13 10:57:44 +00:00
Yong Hoon Shin
11ac9ddd03 Support all interleaved layer types (#28485)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
2025-11-13 08:57:20 +00:00
Chauncey
5c9ad138d5 [Frontend] supports interleaved thinking (#28531)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-11-13 16:14:13 +08:00
Jiangyun Zhu
fa183e9271 [Bugfix] fix kimi-linear crash (#28445)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-11-13 07:59:58 +00:00
usberkeley
4ab34f6ef1 Add NUMA node validation for CPU thread binding (#28555)
Signed-off-by: Bradley <bradley.b.pitt@gmail.com>
2025-11-13 07:03:52 +00:00
Huy Do
c33b87e777 Use official xformers-0.0.33 built for PT 2.9 (#28600)
Signed-off-by: Huy Do <huydhn@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-11-12 22:48:53 -08:00
tjandy98
4504e8029b [Bugfix] Prevent crash on empty grammar string (#28210)
Signed-off-by: tjandy98 <3953059+tjandy98@users.noreply.github.com>
2025-11-13 06:42:29 +00:00
Pleaplusone
ca00b1bfc6 [ROCm][BugFix] Remove the usage of device_info from aiter (#28383)
Signed-off-by: ganyi <ygan@amd.com>
2025-11-12 21:43:42 -08:00
Radu Salavat
d44fbbab0e [build][cmake]: Bundle static ACL and torch libgomp for CPU extension builds (#28059)
Signed-off-by: Radu Salavat <radu.salavat@arm.com>
2025-11-13 05:43:08 +00:00
Lucia Fang
7e082bc14e Support DeepEP for Kimi-k2-thinking through enabling gemm selection for compressed-tensor marlin wna16 (#28574)
Signed-off-by: Lu Fang <fanglu@fb.com>
2025-11-12 21:40:45 -08:00
Fanli Lin
dbbe0c756a [XPU] Support Triton path for LoRA operations on XPU (#28511)
Signed-off-by: Fanli Lin <fanli.lin@intel.com>
2025-11-13 05:31:42 +00:00
Pleaplusone
7dca0c90cb [BugFix][ROCm] Fix get_cu_count missing variable error (#28608)
Signed-off-by: ganyi <ygan@amd.com>
2025-11-13 05:18:56 +00:00
Andrew Xia
1a0b157a2e [Frontend][responsesAPI][1/n] convert responses API tool input to chat completions tool format (#28231)
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: Andrew Xia <axia@fb.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
2025-11-13 04:47:22 +00:00
Andrew Xia
7c38ed0f1c [Frontend] split append tool output (#28333)
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: Andrew Xia <axia@fb.com>
2025-11-13 04:03:23 +00:00
Jialin Ouyang
a1d3866dda [n-gen] DO NOT repeatedly return finished child requests (#28591)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-11-13 03:36:07 +00:00
Harry Mellor
97d1c99302 Rename clashing method names for vLLM model protocol (#27583)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-12 19:14:33 -08:00
Harry Mellor
3226283461 [Docs] Add some details about what the MoE block needs for the Transformers backend (#28588)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-13 03:12:14 +00:00
Nick Hill
8832fff972 [BugFix] Fix mm_encoder_attn_backend arg type checking (#28599)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-13 03:06:03 +00:00
Michael Goin
a543e678b4 [Bugfix] Fix SM100 gpt-oss regression due to faulty attn sink support (#28561)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-12 19:40:59 -07:00
wangxiyuan
2dacd57394 [platform] Move get_cu_count to utils (#27005)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-13 08:48:47 +08:00
Gregory Shtrasberg
d75ad04818 [ROCm][Bugfix] Revert removing setuptools version restriction (#28592)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-11-12 16:46:58 -08:00
Michael Goin
52eadcec9e [Docs] Update meetups.md description (#28583)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-13 00:00:23 +00:00
Harry Mellor
51c599f0ec Skip models that cannot currently init on Transformers v5 (#28471)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-12 23:43:57 +00:00
Alexander Matveev
69d0e90313 [MoE][Kernel][Perf] Improve Shared Expert Stream Overlap (#28406)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-11-12 23:37:24 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
4ca5cd5740 [Core][AMD] Migrate fully transparent sleep mode to ROCm platform (#12695)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com>
2025-11-12 15:24:12 -08:00
Michael Goin
10f01d5a3a [Bugfix] Adjust Marlin CUDA arch selection to 8.0+PTX;9.0+PTX (#28294) 2025-11-12 15:14:13 -08:00
QiliangCui
3eb0c2673e [TPU] Support GCS path in VLLM_TORCH_PROFILER_DIR (#28487)
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
2025-11-12 22:31:14 +00:00
vllmellm
d8140b9833 [ROCM] Fix ROCm warnings, environment flag access, and GEMM kernel naming for consistency in _aiter_ops.py (#28464)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-12 21:46:57 +00:00
Varun Sundar Rabindranath
74a9a9faad [Performance][B200] Fix deepgemm prologue (#27897)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-11-12 13:13:03 -08:00
Wei Wei
478ee511de [Misc]Fix typo in llm_engine.py (#28584)
Signed-off-by: Wei Wei <wwei6@meta.com>
2025-11-12 12:59:43 -08:00
Andy Lo
58ce8d12b7 [BugFix] Priority scheduling and spec tokens preemption (#28558)
Signed-off-by: Andy Lo <andy@mistral.ai>
2025-11-12 20:29:21 +00:00
Yihua Cheng
94a9ebcf31 [KV connector][WIP] KV cache proxy based on LMCache multi-process mode (#27902)
Signed-off-by: ApostaC <yihua98@uchicago.edu>
2025-11-12 20:25:43 +00:00
Harry Mellor
a39dd7bb06 [CI] Skip "Multi-Modal Models Test (Extended) 3" test that's broken in current Transformers (#28559)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-12 19:38:13 +00:00
Thomas Parnell
64d57c3be7 [Model] [Config] Correctly identify granite-4.0-micro as non-hybrid model (#28563)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-11-12 18:17:55 +00:00
PerryZhang01
a1e7fa362a [EPLB][ROCm]: support EPBL for ROCm backend (#27731)
Signed-off-by: Perry Zhang <perzhang@amd.com>
Co-authored-by: Perry Zhang <perzhang@amd.com>
2025-11-12 18:16:35 +00:00
alberto
bac904565f Implement ARC KV cache eviction policy for CPU offloader (#27039)
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: alberto <aperdomo@redhat.com>
Co-authored-by: Or Ozeri <or@ozery.com>
2025-11-12 09:51:39 -08:00
Benjamin Chislett
304419576a [Perf] Refactor cudagraph_support to enable full CUDA graphs for spec decoding with FlashInfer (#28479)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2025-11-13 01:56:40 +09:00
Harry Mellor
a742134cc5 Remove deprecated fields from CompilationConfig (#27593)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-12 16:10:28 +00:00
Nicolò Lucchesi
728a9eb70e [Misc] Refactor Attention kv transfer methods into decorator (#27816)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
2025-11-12 16:05:44 +00:00
Canlin Guo
bc5bd45c7d [Refactor] Remove redundant TP gather/split in split_qkv in QwenVL (#28271)
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2025-11-12 15:56:47 +00:00
Alexander Matveev
f76e85c299 [Performance][Hopper] Avoid M dim padding to 4x for most cases (due to cuda graphs paddings) (#28492)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-11-12 10:51:43 -05:00
Harry Mellor
54aecd9ed5 Fix pre-commit (and XPU) on main (#28556)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-12 06:13:41 -08:00
wangxiyuan
10138c92a5 [V0 deprecation] Deprecate use_v1 parameter (#28112)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-12 14:03:52 +00:00
Jee Jee Li
a9d18b5107 [Bugfix] Fix gpt_oss packed_modules_mapping (#28536)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-12 21:02:06 +08:00
TJian
edb59a9470 [ROCm] [Bugfix] Fix fused_qknorm_rope_kernel rocm compatibility (#28500)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-11-12 05:01:14 -08:00
ZhengHongming888
c5f10cc139 add cpu option for p/d in nixl_connector (#28356)
Signed-off-by: Hongming Zheng <hongming.zheng@intel.com>
2025-11-12 11:53:08 +00:00
ziruiliu
d143152308 [KVConnector] Enable get_block_ids_with_load_errors() in LMCache connector (#27978)
Signed-off-by: Zirui Liu <ziliu@ddn.com>
Signed-off-by: ziruiliu <ziliu@ddn.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
2025-11-12 11:44:58 +01:00
Chaojun Zhang
a4730c1b4f [XPU]Fix crash due to removed VLLM_USE_V1 attribute (#28520)
Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com>
2025-11-12 10:20:55 +00:00
wuyaoxuehun
d3ade61e42 [Model] fix glm4_moe_mtp load weights with GLM-4.6 checkpoint. (#27597)
Signed-off-by: wuao.scotty <wuao.scotty@bytedance.com>
Co-authored-by: wuao.scotty <wuao.scotty@bytedance.com>
2025-11-12 10:14:00 +00:00
yyzxw
1761dea1a8 [BugFix]: --enable-lora with model granite-4.0-micro crash (#27733)
Signed-off-by: zxw <1020938856@qq.com>
2025-11-12 09:03:56 +00:00
Huamin Li
c748355e0d [CI] Introduce autorun_on_main feature (#27836)
Signed-off-by: Huamin Li <3ericli@gmail.com>
2025-11-12 08:51:19 +00:00
Chenguang Zheng
91864b79b3 [CI/Build] Fix crash due to removed VLLM_USE_V1 attribute in EPD (#28521)
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-11-11 23:09:33 -08:00
Lukas Geiger
ac0bb2c307 [Core] Cache vllm_is_batch_invariant (#28304)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-11-12 05:03:01 +00:00
ai-jz
f31419ed8b [Benchmark] Add retry support to fix workload bias in multi-turn benchmark (#28493) 2025-11-12 05:00:45 +00:00
Fanli Lin
b9ce9a3013 [BugFix] Add fallback path in apply_rotary_pos_emb_flashattn for non-cuda platforms (#28447)
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
2025-11-12 03:13:21 +00:00
Chenguang Zheng
4ccffe561f [Core] Encoder separation for Encode-Prefill-Decode Disaggregation (#25233)
Signed-off-by: n00909098 <nguyen.kha.long@huawei.com>
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Signed-off-by: herotai214 <herotai214@gmail.com>
Signed-off-by: Khuong Le <khuong.le.manh@huawei.com>
Signed-off-by: Khuong Le <lemanhkhuong2611@gmail.com>
Co-authored-by: n00909098 <nguyen.kha.long@huawei.com>
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: herotai214 <herotai214@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Khuong Le <khuong.le.manh@huawei.com>
Co-authored-by: Khuong Le <lemanhkhuong2611@gmail.com>
2025-11-11 18:58:33 -08:00
Lukas Geiger
cbb799e314 [Model][Qwen3VL] Simplify get_mrope_input_positions using numpy (#28302)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-11-12 02:55:10 +00:00
Andreas Karatzas
9f0247cfa4 VLLM_USE_TRITON_FLASH_ATTN V0 variable deprecation (#27611)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <Andreas.Karatzas@amd.com>
2025-11-11 18:34:36 -08:00
Li, Jiang
7f829be7d3 [CPU] Refactor CPU attention backend (#27954)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-11-12 09:43:06 +08:00
wangxiyuan
e1710393c4 [[V0 deprecation]]Remove VLLM_USE_V1 env (#28204)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-11 18:22:16 -07:00
Isotr0py
3f770f4427 [Performance] Cache loaded custom logitsprocs to avoid overheads (#28462)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-11 16:49:29 -08:00
Yanan Cao
48c879369f [Frontend] Change CompilationMode to a proper Enum (#28165)
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
2025-11-11 19:46:18 -05:00
Ilya Markov
1788aa1efb [BugFix] Graceful handling of torch symm mem errors. (#27671)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-11-11 17:41:54 -07:00
Adrian Abeyta
d23539549a Use FLASHINFER MLA backend when testing fp8_kv_scale_compile (#28491)
Signed-off-by: adabeyta <aabeyta@redhat.com>
2025-11-12 00:34:58 +00:00
Max Hu
412e153df5 [Feature] Allow configuring FlashInfer workspace size (#28269)
Signed-off-by: Max Hu <hyoung2991@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-11 23:32:20 +00:00
Michael Goin
e5f599d4d1 [Bugfix] Disable shared expert overlap if Marlin MoE is used (#28410)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-11 23:16:12 +00:00
Michael Goin
28534b92b9 Add Zurich vLLM Meetup (#28488)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-11 14:53:59 -08:00
wangxiyuan
d4902ba56d [Misc] Cleanup Executor interface (#28441)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-11 22:28:07 +00:00
Kyuyeun Kim
df4d3a44a8 [TPU] Rename path to tpu platform (#28452)
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com>
2025-11-11 19:16:47 +00:00
Jee Jee Li
9d1c474704 [LoRA][1/N]Remove LoRA extra vocab (#28382)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-11 11:06:21 -08:00
Jie Luo
8c32c6e4b4 [Misc] fix typo in DCP comment (#28389)
Signed-off-by: Livinfly <luojie3m@gmail.com>
2025-11-11 10:59:16 -08:00
Canlin Guo
de120bc94f [V0 deprecation] Clean up num_prefill_tokens logic for V0 (#28203)
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2025-11-11 10:57:12 -08:00
Jialin Ouyang
4228be7959 [Perf] Use np.ndarray instead of list[list[int]] to reduce GC overhead (#28245)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-11-11 10:28:47 -08:00
Lukas Geiger
76e4dcf225 [Misc] Remove unused attention prefix prefill ops functions (#26971)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-11-11 18:26:04 +00:00
Fanli Lin
d5edcb8678 [BugFix] Fix Siglip2Attention on XPU (#28448)
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
2025-11-11 18:18:02 +00:00
Xin Yang
6c3c0f8235 [Kernel] Optimize rms_norm kernel (#27931)
Signed-off-by: Xin Yang <xyangx@amazon.com>
2025-11-11 18:02:23 +00:00
Matthew Bonanni
684f254585 Prefer FlashAttention MLA as default over FlashMLA (#27363)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-11-11 17:13:51 +00:00
Zhewen Li
e553424919 [CI/Build] Refactor Attention backend for test_prefix_prefill from xformers to SDPA (#28424)
Signed-off-by: zhewenli <zhewenli@meta.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-11-12 01:09:47 +08:00
xuebwang-amd
5a1271d83a [Quantization] fix attention quantization of gpt_oss model (#27334)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
2025-11-11 12:06:00 -05:00
xuebwang-amd
05576df85c [ROCm][Quantization] extend AMD Quark to support mixed-precision quantized model (#24239)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Co-authored-by: fxmarty-amd <felmarty@amd.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-11-11 12:05:22 -05:00
zhrrr
68c09efc37 [Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model (#27165)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
2025-11-11 12:00:31 -05:00
Nicolò Lucchesi
a7ef3eb0cd [NIXL] Generalize block-first backend layouts (FlashInfer-like) (#28282) 2025-11-11 16:57:43 +00:00
Michael Goin
f9a4087182 Remove weight_scale.T special case for SM90 Block FP8 CUTLASS kernel (#28431)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-11 11:46:04 -05:00
the-codeboy
287bbbeb06 [Doc] Fix typo in serving docs (#28474)
Signed-off-by: the-codeboy <71213855+the-codeboy@users.noreply.github.com>
2025-11-11 16:45:49 +00:00
usberkeley
3143eb23fc [BugFix] Add test_outputs.py to CI pipeline (#28466)
Signed-off-by: Bradley <bradley.b.pitt@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-11 16:01:30 +00:00
Fanli Lin
b886068056 [BugFix] Fix RuntimeError in PixtralHFAttention on CPU/XPU (#28444)
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
2025-11-11 15:29:33 +00:00
Mark McLoughlin
a90ad7d838 Add @markmc to CODEOWNERS for Observability (#28457)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-11-11 23:03:22 +08:00
jvlunteren
533b018f72 [BugFix] Fix Failing Ruff Check (#28469)
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>
2025-11-11 06:41:43 -08:00
bnellnm
a1448b4b69 [Kernels] Split up fused_moe/layer.py, isolate more modular kernel code (#28064) 2025-11-11 07:29:02 -07:00
Maryam Tahhan
fa1970201d [Docs] Fix grammar in CPU installation guide (#28461)
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
2025-11-11 14:01:11 +00:00
Ido Segev
3380543b20 Add request timeout override for multi-turn benchmarks (#28386)
Signed-off-by: Ido Segev <idos@pliops.com>
2025-11-11 13:41:18 +00:00
Cyrus Leung
afffd3cc8a [Model] Pass mm_features directly into get_mrope_input_positions (#28399)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-11 21:14:48 +08:00
Chaojun Zhang
7dbe6d81d6 Fix Fused MoE LoRA Triton kernel bug (#28450)
Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com>
2025-11-11 20:46:47 +08:00
Matthew Bonanni
b30dfa03c5 [Attention] Refactor CUDA attention backend selection logic (#24794)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-11-11 07:40:44 -05:00
Michael Goin
2e78150d24 [CI] Add mergify rules for nvidia label (#28417)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-11 04:28:28 -08:00
Ido Segev
d381eb967f Multi turn benchmark progress bar for synthetic conversation generation (#28394)
Signed-off-by: Ido Segev <idos@pliops.com>
2025-11-11 11:06:04 +00:00
Lukas Geiger
9973e6e04a [Model][Qwen3VL] Slighly speedup fast_pos_embed_interpolate (#28434)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-11-11 10:35:10 +00:00
Fanli Lin
c7991269dd [BugFix] 'DeepseekV2Config' object has no attribute 'use_mla'` (#28387)
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
2025-11-11 08:45:38 +00:00
Jiangyun Zhu
f0359fffa4 [Bugfix] fix qwen3-next crash (#28202)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-11-11 08:24:28 +00:00
Sage Moore
798c7bebca [EPLB] Refactor balance_packing to use numpy and optimize GPU-CPU transfers in EPLB (#28369)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-11-11 00:19:51 -08:00
Roger Wang
4fd4b743a2 [Bugfix] Fix max image size for PaddleOCR-VL (#28442)
Signed-off-by: Roger Wang <hey@rogerw.io>
2025-11-11 08:07:24 +00:00
David Ben-David
cc079763c5 [BugFix] Avoid calling KV connector layer APIs when metadata is unset (#28253)
Signed-off-by: David Ben-David <davidb@pliops.com>
Co-authored-by: David Ben-David <davidb@pliops.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
2025-11-10 23:39:36 -08:00
iAmir97
a7adbc6c6b [Doc] Sleep mode documentation (#28357)
Signed-off-by: Amir Balwel <amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: Amir Balwel <amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-10 22:44:35 -08:00
Robert Shaw
e605e8e323 [Bugfix] Fix Stream Sync for Shared Expert Overlap (#28430)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
Co-authored-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2025-11-11 05:59:08 +00:00
Zuyi Zhao
bca74e32b7 [Frontend] Add sagemaker_standards dynamic lora adapter and stateful session management decorators to vLLM OpenAI API server (#27892)
Signed-off-by: Zuyi Zhao <zhaozuy@amazon.com>
Signed-off-by: Shen Teng <sheteng@amazon.com>
Co-authored-by: Shen Teng <sheteng@amazon.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2025-11-11 04:57:01 +00:00
Zhuohan Li
8d706cca90 [Misc] FlattenLogprobs -> FlatLogprobs (#28335) 2025-11-11 03:41:23 +00:00
Xin Yang
57201a6a4c Fix rotary embedding benchmark script (#28323)
Signed-off-by: Xin Yang <xyangx@amazon.com>
2025-11-10 21:57:12 -05:00
Michael Goin
f2d9ad0620 Only register rocm_aiter_ops if aiter is found (#28428)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-11 02:53:24 +00:00
Wentao Ye
de540c0354 [Feature] Add env var VLLM_MOE_USE_DEEP_GEMM (#28422)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-11 02:29:48 +00:00
Lucas Wilkinson
39029d5192 [CI/Test Fix] Fix CP tests on Blackwell (#28404)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-11 01:36:29 +00:00
Wentao Ye
35d801f13f [Feature] Refactor batch invariant fp8 DeepGEMM (#27606)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-11 00:08:40 +00:00
Matthew Bonanni
0bf29fadf5 [Test] Remove old non-varlen FA2 test (#28420)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-11-10 23:57:41 +00:00
Adrian Abeyta
a5a790eea6 [Bugfix] Ensure calculated KV scales are applied in attention. (#27232)
Signed-off-by: adabeyta <aabeyta@redhat.com>
2025-11-10 23:42:37 +00:00
Jialin Ouyang
b30372cbd0 [Perf] Move gc.freeze logic from EngineCoreProc to EngineCore for better coverage (#27896)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-11-10 15:34:18 -08:00
Ilya Markov
d17ecc6b19 [PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds (#24248)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2025-11-10 18:33:11 -05:00
Yong Hoon Shin
021143561f [ROCm] Add missing gemm_a8w8_blockscale import (#28378)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
2025-11-10 23:13:36 +00:00
Robert Shaw
30700b1cd7 [CI] Fix Plugin Tests Tests (#28413)
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
2025-11-10 22:36:11 +00:00
Andrew Xia
4b94ed8f92 [Frontend][2/n] remove empty content from _parse_tool_calls_from_content (#28331)
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: Andrew Xia <axia@fb.com>
2025-11-10 14:07:49 -08:00
Lucas Wilkinson
6dec9f6109 [BugFix] Fix DeepGEMM over-allocating workspace (#28254)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-10 17:01:17 -05:00
Wei Wei
bf6a3d0ff5 [Misc] Add more scoping for improved trace (#28329)
Signed-off-by: Wei Wei <wwei6@meta.com>
2025-11-10 21:03:21 +00:00
Sage Moore
40d33264c6 [Bugfix][EPLB] Disabled shared expert overlap when EPLB is enabled (#28377)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Sage Moore <sagemoore@utexas.edu>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-10 20:39:19 +00:00
Jonas M. Kübler
9c84ca8293 [FA/Chore] Bump FA version for FP8 two-level accumulation (#27889)
Signed-off-by: Jonas Kuebler <kuebj@amazon.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
2025-11-10 12:06:04 -08:00
Rémi Delacourt
6d54336ae5 [Bugfix] Fix llguidance backend, rollback when EOS was encountered (#25905)
Signed-off-by: Rémi Delacourt <remi@mistral.ai>
Signed-off-by: remi <remi@mistral.ai>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
2025-11-10 14:53:32 -05:00
jiahanc
34553b9d27 [Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3next (#27492)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
2025-11-10 12:34:57 -05:00
Varun Sundar Rabindranath
b039bfda8f [Bugfix] Fix persistent_masked_m_silu_mul_quant tests (#28366)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-11-10 09:21:52 -08:00
Cyrus Leung
d0e186c16f [V0 Deprecation] Remove unused context_len and seq_len from M-RoPE (#28395)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-11 00:30:06 +08:00
vllmellm
f080a83511 [RFC][ROCm][AITER] Keep all AITER kernels in _aiter_ops class like _custom_ops and _ipex_ops (#24490)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-11-10 08:20:53 -08:00
caozuoba
40e2eeeb92 [Kernel] Optimization of the mm_k operator. (#28280)
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-10 16:03:46 +00:00
zejunchen-zejun
b06b9470ca [Rocm][fused_moe][fp4] view weight to torch.float4_e2m1fn_x2 when running aiter fused moe for fp4 model (#27474)
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
2025-11-10 10:38:56 -05:00
TJian
4673e465ff Add @tjtanaa to codeowner for ROCm and multi-modal (#28360)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-11-10 21:39:17 +08:00
Ferrebo
912744d066 [Fix] optimize visual token mask with caching and multi-token support (#28374)
Signed-off-by: Ferrebo <itachi971009@gmail.com>
Signed-off-by: kebo01 <kebo01@baidu.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-10 13:23:49 +00:00
Yu Jiaqi
15be507c86 [bugfix] fix siglip batch text output error (#28365)
Signed-off-by: piood <2477084691@qq.com>
2025-11-10 21:21:15 +08:00
Mark McLoughlin
6f7de33bed [Metrics] Refactor LoRA state tracking (#26801)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-11-10 16:34:36 +08:00
Shinichi Hemmi
a98cc35c34 Restore PlaMo2 unit test as pfnet/plamo-2-1b now supports transformers >=4.56 (#28019)
Signed-off-by: Shinichi Hemmi <50256998+Alnusjaponica@users.noreply.github.com>
2025-11-10 06:50:02 +00:00
Lucas Wilkinson
e8697faf03 [V0 deprecation] Remove no longer used get_metadata_cls (#28370)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-10 14:32:09 +08:00
Xiake Sun
03fa4d3fb3 [Hardware][AMD][Model] Add Triton MoE tuning support and optimized configs for Qwen3 omni for MI308X (#28373)
Signed-off-by: Xiake Sun <xiake.sun@amd.com>
Signed-off-by: Xiake Sun <xisun@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-10 04:53:40 +00:00
Varun Sundar Rabindranath
6b2b9fd934 [CI] lora/test_mixtral.py : Add additional expected outputs due to flakiness (#28322)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-11-10 10:45:29 +08:00
JartX
c5f685b3ae [ROCm][Platform] Add RX7900XTX device id in _ROCM_DEVICE_ID_NAME_MAP (#28279)
Signed-off-by: JartX <sagformas@epdcenter.es>
2025-11-09 23:09:36 +00:00
Jiangyun Zhu
c4768dcf47 [Kernel] Fix fused_gdn_gating (#28343)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-11-09 14:26:35 -07:00
Zhewen Li
a65a934ebe [CI/Build] Temporary fix to LM Eval Small Models (#28324)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-11-09 21:08:38 +00:00
usberkeley
4a8d6bd168 Fix cu_num_generated_tokens slicing logic in LogprobsLists.slice() method (#28214)
Signed-off-by: Bradley <bradley.b.pitt@gmail.com>
2025-11-09 19:11:46 +00:00
Lucas Wilkinson
636efd10a5 [Core] Separate out attention metadata building logic from prepare inputs (#26764)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-09 13:51:43 -05:00
Nick Hill
289eb6c537 [Core] Simplify async KV output aggregation (#28327)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-09 09:44:13 -08:00
Nicolò Lucchesi
19d91ece4b [CI] Fix flaky test_eagle_correctness test (#28364)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-11-09 16:04:59 +00:00
Jiangyun Zhu
7ae5a5fb11 [Misc] Add some comments in qwen3-next (#28267)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-11-08 23:59:24 -08:00
Yong Hoon Shin
de2b78305f [ROCm] Add env to enable/disable aiter triton gemm (#28321)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
2025-11-08 22:27:00 -08:00
Ning Xie
e5e9067e61 [Misc] fix typo and add detailed log (#28178)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-11-09 05:33:46 +00:00
yihong
3a7d580343 fix: close issue 28338 by fixed python version (#28339)
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-11-09 05:07:26 +00:00
Kevin H. Luu
05f8d69077 [chore] Move some wikimedia images to S3 (#28351)
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
2025-11-09 01:58:26 +00:00
Mohammad Miadh Angkad
404d7a9d14 [Performance][gpt-oss] Revert gpt-oss max cudagraph size to 1024 (#28345)
Signed-off-by: Mohammad Miadh Angkad <MAngkad.BSDSBA2027@aim.edu>
2025-11-08 15:50:10 -07:00
ElizaWszola
171133f929 [Bugfix] Fix test fused quant layernorm tests (#27865)
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-08 14:31:33 -08:00
Cole Murray
32787d0644 Remove setuptools upper bound constraint (<80) (#28337)
Signed-off-by: Cole Murray <colemurray.cs@gmail.com>
2025-11-08 22:30:18 +00:00
Benjamin Chislett
975676d174 [Feat] Drop-in Torch CUDA Profiler (#27841)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2025-11-08 14:07:37 -08:00
Ev Lacey
77d702a22b Enhance run_cluster.sh for multi-NIC support (#28328)
Signed-off-by: Ev Lacey <elacey@nvidia.com>
2025-11-08 22:04:16 +00:00
zhangsicheng5
2108a571d7 [DCP] Support dcp kv_cache interleave size > 1 (#26696)
Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com>
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
Signed-off-by: Qiu <qiuchunshuo@huawei.com>
Co-authored-by: QiuChunshuo <qiuchunshuo@huawei.com>
2025-11-09 04:45:27 +09:00
Andy Lo
47604137a2 [Bugfix] Spec decode + structured output + spec model max len edge case (#28298)
Signed-off-by: Andy Lo <andy@mistral.ai>
2025-11-08 19:44:25 +00:00
Robert Shaw
26990d25dc [Bugfix] Update device name for H200 detection (#28349)
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2025-11-08 19:01:11 +00:00
Harry Mellor
d9ab1ad9d1 reasoning_content -> reasoning (#27752)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-08 12:15:08 +00:00
22quinn
608bb14462 [Attention] Remove max cudagraph size limit of 992 (#27840)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-11-07 22:33:27 -08:00
Xiaozhu Meng
4a36681f85 [flashinfer][fix] do not check nvcc availability when using pre-downloaded cubins (#27990)
Signed-off-by: Xiaozhu <mxz297@gmail.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
2025-11-07 22:25:21 -08:00
Abolfazl Shahbazi
d15afc1fd0 Refactor CPU/GPU extension targets for CMake build (#28026)
Signed-off-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com>
2025-11-08 14:17:35 +08:00
Isotr0py
934a9c3b79 [Model] Consolidate Deepseek-MoE implementation with DeepSeek-v2 (#28101)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2025-11-08 05:01:27 +00:00
gnovack
70af44fd10 [bugfix] support eagle with lora cudagraph specialization (#28318)
Signed-off-by: gnovack <gnovack@amazon.com>
2025-11-08 03:25:45 +00:00
Aurick Qiao
781f5ebf52 Bump arctic-inference requirement (#28174)
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-07 18:31:18 -08:00
Michael Goin
0852527647 [Perf][DeepSeek] Add sigmoid+bias fusion to fused_grouped_topk from TRTLLM (#28124)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-11-07 18:20:55 -08:00
Hamid Mukhtar
61d25dc44b Update gpu.rocm.inc.md to add support for AMD Ryzen AI MAX / AI 300 Series (gfx1151, gfx1150) (#28308)
Signed-off-by: Hamid Mukhtar <15519013+hammmmy@users.noreply.github.com>
2025-11-08 02:09:21 +00:00
Xiaohong (Sean) Chen
d0c7792004 [Bugfix][LoRA][Spec Decode] Support LoRA with speculative decoding (#21068)
Signed-off-by: Sean Chen <xiaohong_chen1991@hotmail.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Danielle Robinson <dcmaddix@gmail.com>
Co-authored-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
2025-11-08 01:58:22 +00:00
Boyuan Feng
b158df2813 remove resolve_op_overloads and use splitting_ops directly (#28081)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
2025-11-08 01:13:13 +00:00
Kunshang Ji
1aaecda078 [XPU] Enable Expert parallel for MoE models (#28263)
Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-11-08 00:33:11 +00:00
Harry Mellor
811df41ee9 Update Flashinfer from v0.4.1 to v0.5.2 (#27952)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-07 16:24:42 -08:00
Nick Hill
67a2da890e [PerfFix] Avoid separate thread for MP executor shm spin (take 2) (#28319)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-07 22:11:03 +00:00
Nick Hill
da786e339e [Core] Rework handling of async scheduling config (#28250)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-07 20:01:23 +00:00
Benjamin Chislett
18903216f5 [Bugfix] Fix and add tests for GptOss reasoning parser (#28000)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2025-11-07 19:28:04 +00:00
Simon Mo
d0ceb38ae8 [Build] Fix release pipeline failing annotation (#28272)
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-11-07 10:06:45 -08:00
youkaichao
155ad56d7b [doc] add guide about the provided PTX was compiled with an unsupported toolchain (#28305)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-11-08 00:26:34 +08:00
Fadi Arafeh
5fb4137c99 [README] Add Arm CPUs to the list of supported targets (#28290)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2025-11-07 15:41:47 +00:00
Nicolò Lucchesi
68a72a5cc1 Revert "[PerfFix] Avoid separate thread for MP executor shm spin (#28012)" (#28289)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-11-07 15:07:01 +00:00
Boyuan Feng
0f872b7977 [Log] update shm wait time msg (#28255) 2025-11-07 09:43:30 -05:00
Wentao Ye
4b1ff13221 [Feature] Default ignore_eos True for random dataset (#28227)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-07 07:35:33 -05:00
Iceber Gu
e0d6b4a867 [CLI] add --max-tokens to vllm complete (#28109)
Signed-off-by: Iceber Gu <caiwei95@hotmail.com>
2025-11-07 12:21:40 +00:00
Pavani Majety
72b1c2ae2c [Bugfix] Use latency MOE backend as default for Flashinfer and other misc fixes (#27439)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-11-07 04:18:39 -08:00
Lukas Geiger
e0919f331d [Core][MM] Add mechanism to configure multimodal fields which should stay on CPU (#28168)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-11-07 12:14:29 +00:00
Kevin H. Luu
8e19d470af [fix] Revert "fixing mm placeholder replacement issue with gemma3" (#28285)
Signed-off-by: Kevin H. Luu <khluu000@gmail.com>
2025-11-07 12:09:09 +00:00
Mengqing Cao
1958bda9b4 [Misc][Model][Refactor] Pass the prefix into Linear layers (#28259)
Signed-off-by: MengqingCao <cmq0113@163.com>
2025-11-07 19:38:38 +08:00
Zhang Xiangze
7bdb42b2f2 [CPU]Avoid repeated random sample compile (#28260)
Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com>
2025-11-07 11:03:57 +00:00
汪志鹏
315068eb4a [FixBug]Aeala/ShareGPT_Vicuna_unfiltered marked as multimodal benchmark (#28265)
Signed-off-by: princepride <wangzhipeng628@gmail.com>
2025-11-07 09:35:22 +00:00
Jialin Ouyang
ccd98b59c1 [Perf] Introduce FlattenLogprobs to store logprobs results to reduce GC overhead (#28171)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-11-07 00:27:12 -08:00
Jee Jee Li
21b82f4ea2 [Kernel] LoRA triton kernels support PDL (#27402)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-07 08:05:48 +00:00
Copilot
a736e5ff77 [CI] Reduce Blackwell Fusion test runtime by filtering tests and only run all tests in nightly (#28074) 2025-11-07 15:58:16 +08:00
baonudesifeizhai
9da9208b20 [Bug] Fix missing token_ids for reasoning parser models in chat completions #28246 (#28256) 2025-11-07 07:31:58 +00:00
smit kadvani
11fd69dd54 [amd][gptoss] Perf gain because of block alignment (#28024)
Signed-off-by: Smit Kadvani <smit.kadvani@gmail.com>
Co-authored-by: Smit Shaileshbhai Kadvani <kadvani@meta.com>
2025-11-07 05:27:42 +00:00
Harry Mellor
c0a4b95d64 Fix issues from #28242 (#28257)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-07 04:23:17 +00:00
Alexis MacAskill
a47d94f18c Add runai model streamer e2e test for GCS (#28079)
Signed-off-by: Alexis MacAskill <amacaskill@google.com>
2025-11-07 03:07:54 +00:00
Alex Brooks
e70fbc599b [CI/Build] Loosen STT LoRA Translate Check (Flaky Test) (#28247)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Signed-off-by: Alex Brooks <alex.brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-11-07 02:51:27 +00:00
Lucas Kabela
4bf56c79cc [Multimodal][torch.compile] Add compilation config field for turning off ViT/MM compile (#28242)
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
2025-11-07 00:16:03 +00:00
Junhong Liu
59b453eaa2 Speed up mm processor kwargs per request by spliting dynamic and static kwargs (#26483)
Signed-off-by: Junhong <liujunhong11@huawei.com>
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com>
Co-authored-by: Junhong <liujunhong11@huawei.com>
2025-11-07 07:51:28 +08:00
Eugene Khvedchenya
827e4237bc Fix failing test for CRadio (#27738)
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: wang.yuqi <noooop@126.com>
2025-11-06 15:32:25 -08:00
Varun Sundar Rabindranath
ca6f755d24 [BugFix] Fix FusedMoELoRA + ModularKernel Integration (#28237)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-11-06 22:53:30 +00:00
Matthew Bonanni
ca90f50304 [Test] Add non-MoE DP test coverage (#28235)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-11-06 20:59:57 +00:00
Fang Han
da855b42d2 [Doc]: Make extraInit containers fully configurable in helm chart (#27497)
Signed-off-by: Fang Han <fhan0520@gmail.com>
2025-11-06 20:27:16 +00:00
Aleksandr Malyshev
449de9001a [ROCm] triton fp8 kernel (#27058)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
2025-11-06 14:46:44 -05:00
Vico Chu
d4aa65c998 [Chore] eliminate duplicated and unconditional object serialization in anthropic messages api (#27792)
Signed-off-by: Vico Chu <vico24826@gmail.com>
2025-11-06 19:09:19 +00:00
Julien Denize
7a8375f8a0 Add llama 4 scaling support (#28145)
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
2025-11-06 18:55:17 +00:00
Andy Lo
5e0c1fe69c [Structured outputs] Upgrade llguidance to 1.3.0 (#28039)
Signed-off-by: Andy Lo <andy@mistral.ai>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
2025-11-06 10:24:47 -08:00
Russell Bryant
4507a6dae4 CODEOWNERS: Add myself as reviewer on security docs (#28216)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-11-06 17:39:42 +00:00
Roy Wang
d1dd5f53e4 [Frontend] Fix logging format when enable response logging (#28049)
Signed-off-by: esmeetu <jasonailu87@gmail.com>
2025-11-06 16:25:39 +00:00
StanHatko
e52e4da971 [HARDWARE][CPU] Add Option for Disabling Binding to Specific CPU Cores (#27953)
Signed-off-by: Stan Hatko <stan_hatko@live.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
2025-11-06 23:47:11 +08:00
Milos Puzovic
2176778cd3 [Doc] Add Arm CPUs are on the list of supported targets in vLLM (#26018)
Signed-off-by: Milos Puzovic <milos.puzovic@arm.com>
2025-11-06 15:30:26 +00:00
Eric Yue
0370679ce9 [Kernel][Model] Tune fused_moe Triton configs for MiniMax-M2 on H100 (#28200)
Signed-off-by: minatoaquaMK2 <jiacheng.yue@foxmail.com>
2025-11-06 07:29:46 -08:00
Harry Mellor
8816e375d3 [Docs] Switch to directory style URLs (#28058)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-06 07:06:33 -08:00
Michael Goin
f32229293e Disable nm-testing models with issues in CI (#28206)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-11-06 06:19:07 -08:00
xiangze-arm
c757a15f0f [CPU]Improve cpu fused moe perf (#27244)
Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com>
2025-11-06 11:04:18 +00:00
Chauncey
59a50afa08 [Frontend] OpenAI Responses API supports Tool/Function calling - non-harmony (#26874)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-11-06 10:40:03 +00:00
courage17340
981cadb35c [Bugfix][Kernel] fix merge attn states when both prefix and suffix are empty (#28181)
Signed-off-by: courage17340 <courage17340@163.com>
2025-11-06 17:52:13 +08:00
wangxiyuan
c3ee80a01a [V0 deprecation]clean up is_v1_supported_oracle (#28116)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-06 16:05:32 +08:00
Aditya Tewari
3755c14532 [CPU] Enable torch profiling (#28130)
Signed-off-by: Aditya Tewari <aditya.tewari@arm.com>
2025-11-06 07:32:05 +00:00
Seungduk Kim
201dc98acc Fix hard-coded parameter name in gemma3n.py (#27946)
Signed-off-by: Seungduk Kim <seungduk.kim@yanolja.com>
Signed-off-by: Biswa Panda <biswa.panda@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Biswa Panda <biswa.panda@gmail.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
2025-11-05 23:07:36 -08:00
Julien Denize
a404e2c0f1 Patch Mistral Tokenizer (#28146)
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
2025-11-06 06:43:16 +00:00
Xiaozhu Meng
e31946f86e [flashinfer] fix FI all2all with FI cutlass moe (#28166)
Signed-off-by: Xiaozhu <mxz297@gmail.com>
2025-11-06 05:52:16 +00:00
gmagogsfm
bde5039325 [CI] Add compile/test_multimodal_compile.py to CI (#28151)
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-06 05:41:47 +00:00
Jacob Zhong
d72299d47b Make the cv2 dependency optional (#27780)
Signed-off-by: Jacob <cmpute@qq.com>
2025-11-06 05:08:55 +00:00
Lukas Geiger
80679f108f [Core][MM] Use non-blocking CPU-GPU copy of multimodal data (#28141)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-11-06 04:05:12 +00:00
Isotr0py
43ecd0a900 [Chore] Clean up deepseek v2/v3 config copy (#28055)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-06 03:46:30 +00:00
Chauncey
07d614511f [Misc] Remove the duplicate code (#28111)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-11-05 21:07:47 -05:00
Vadim Gimpelson
f948ab6945 [CI Failure] nm-testing/Qwen2-0.5B-Instruct-FP8-SkipQKV was removed from HF. Skip it in tests (#28170)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2025-11-06 01:22:13 +00:00
Wentao Ye
d71af5f502 [Feature] Enable TP + EP shared_experts overlap with router, 3.7% E2E performance improvement (#28164)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-05 17:21:08 -08:00
Wentao Ye
90189c71a9 [Bug] Fix env string "0" same to True (#28159)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-05 17:04:20 -08:00
Wentao Ye
d79d9f0780 [Bug] Fix cpu disable shared_experts VLLM_DISABLE_SHARED_EXPERTS_STREAM (#28157)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-05 17:03:09 -08:00
Vadim Gimpelson
b6a248bdd7 [PERF] Decouple projections from GDN custom op. Attempt 2 (#28083)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2025-11-05 17:01:12 -08:00
Dayeol Lee
1767658559 [Debugging] Add annotation for easier trace analysis (#22496) 2025-11-05 16:52:52 -08:00
Kuntai Du
efe73e9b57 [Core][Hybrid allocator + connector 2/n] Unify remove_skipped_blocks by get_last_useful_token (#25431)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
2025-11-06 00:12:00 +00:00
Zhewen Li
0b8e871e5e [CI/Build] Fix test_defaults_with_usage_context in AMD CI (#27926)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-11-05 15:40:24 -08:00
Zhewen Li
5ee93a5956 [CI/Build] Update checking logic in cutlass_group_gemm_supported (#27948)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-11-05 15:40:10 -08:00
Snehlata
e15601789b [Feature]: Add corrupted request metric to V1 metrics system. (#27306)
Signed-off-by: atalhens <sneh.lata@nutanix.com>
2025-11-05 13:45:29 -08:00
Richard Zou
65ac8d8dc4 [Docs] Add guide to debugging vLLM-torch.compile integration (#28094)
Signed-off-by: Richard Zou <zou3519@gmail.com>
2025-11-05 21:31:46 +00:00
Isotr0py
ffb08379d8 [Chore] Remove Nemotron-Nano-VL config copy (#28126)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-05 20:06:45 +00:00
R3hankhan
e04492449e [Hardware][IBM Z] Optimize s390x Dockerfile (#28023)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
2025-11-05 11:25:44 -08:00
Michael Yao
518ec6b722 [Docs] Clean up README_TUNING.md (#28088)
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
2025-11-05 19:01:34 +00:00
wang.yuqi
802748bddb [Bugfix] Fix Qwen3-Reranker-8B load (#28117)
Signed-off-by: wang.yuqi <noooop@126.com>
2025-11-05 18:33:50 +00:00
Paul Zhang
faedbb4d4f [Feature] Extend batch invariant torch.compile to B200 (#27856)
Signed-off-by: PaulZhang12 <paulzhan@fb.com>
2025-11-05 10:04:49 -08:00
Samuel Shen
40db194446 [CI]: Add LMCacheConnector Unit Tests (#27852)
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Yihua Cheng <yihua98@uchicago.edu>
2025-11-05 09:45:57 -08:00
Chen Zhang
c765f0b443 [FlashInfer] Avoid FlashInfer block_size 16 + head_size 256 on blackwell (#27994)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-11-05 09:25:32 -08:00
gmagogsfm
002b07c4b2 [Bugfix] vLLM should check Inductor config for compile cache enablement status (#27637)
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
2025-11-05 12:22:44 -05:00
Walter Beller-Morales
752ddeacaa [Core] add support for reasoning parser plugins (#28075)
Signed-off-by: walter beller-morales <walter.beller.morales@gmail.com>
2025-11-06 01:15:06 +08:00
Jiangyun Zhu
c18f88c6ca [Kernel] Fuse computation of g and beta for Gated Delta Net (#28095)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-11-05 09:14:55 -08:00
Jiaju Zhang
6fd0df8132 [misc] add vLLM Beijing Meetup (#28127)
Signed-off-by: Jiaju Zhang <jjzhang@redhat.com>
2025-11-05 17:12:59 +00:00
Isotr0py
3f5a4b6473 [Bugfix] Validate custom logits processor xargs for online serving (#27560)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-05 16:53:33 +00:00
Pleaplusone
6cae1e5332 [ROCm][MLA] Support block-size > 1 for AITER MLA backend (#27224)
Signed-off-by: ganyi <ygan@amd.com>
Co-authored-by: wuhuikx <hattie.wu@amd.com>
2025-11-05 10:43:02 -05:00
Alexei-V-Ivanov-AMD
80c9275348 Enabling cooperative multi-gpu tests on multi-gpu nodes (#27986)
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
2025-11-05 10:35:49 -05:00
Ilya Markov
e50c454672 [BugFix] Support EP/DP + EPLB with MTP (#25311)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
2025-11-05 15:22:17 +00:00
Chen Zhang
5d16d0fa62 [DCP] check return_lse for all layers in dcp (#27929)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-11-05 22:27:25 +08:00
bigmoyan
0606bea2b6 add kimi reasoning parser (#28128)
Signed-off-by: wangzhengtao <wangzhengtao@msh.team>
Co-authored-by: wangzhengtao <wangzhengtao@msh.team>
2025-11-05 21:48:33 +08:00
Frost Mitchell
6e97eccf5d [XPU] Enable custom routing functions in IPEX for Llama4 (#28004)
Signed-off-by: frost-intel <frost.mitchell@intel.com>
2025-11-05 13:39:57 +00:00
Boyuan Feng
6ab183813c [Graph Partition][Cache] Use inductor partition ops config (#27702)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
2025-11-05 13:04:48 +00:00
amirkl94
6b7a81185d Bugfix: Cutlass FP8 FusedMoE bad scaling factors (#27255)
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-11-05 06:06:06 -05:00
Eric Yue
b57789b62b Fix excessive logging noise by reducing the log level of the MinimaxM2ToolParser import success message (#27635)
Signed-off-by: minatoaquaMK2 <jiacheng.yue@foxmail.com>
2025-11-05 19:03:51 +08:00
Chauncey
377061d481 [Misc] fix import error for DeepSeekR1ReasoningParser (#28114)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-11-05 19:02:32 +08:00
Kuntai Du
86dca07d9b [Hybrid allocator + kv connector] revert connector test changes related to hybrid allocator (#28011)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
2025-11-05 10:36:31 +00:00
Qiu
16b37f3119 [bugfix] fix wrong dcp_local_seq_lens calc (#27518)
Signed-off-by: Qiu <qiuchunshuo@huawei.com>
2025-11-05 17:58:13 +08:00
Chauncey
0976711f3b [Refactor] to simplify and extract the shared logic between chat completion and responses (#27961)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-11-05 15:46:39 +08:00
Chauncey
e261d37c9a [Refactor] Lazy-loaded reasoning_parser (#28092)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-11-05 15:37:02 +08:00
Alex Brooks
b7cbc25416 [Model, Core] Support Granite Speech & LoRA for STT (#24455) 2025-11-05 08:33:48 +01:00
Lucas Wilkinson
d43ad5a757 [BugFix] Fix DCP Assert (AssertionError: DCP not support reorder_batch_threshold > 1 now.) (#28100)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-05 14:54:43 +08:00
Isotr0py
0ff05e3770 [Bugfix] Fix encoder-only model support for transformers backend (#28021)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-04 22:24:41 -08:00
wangxiyuan
428bc7bf1c [V0 deprecation] Remove VLLM_USE_V1 usage in most modules (#27955)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-11-04 20:51:16 -08:00
Zhewen Li
878fd5a16f [CI/Build] Enable some fixed tests in AMD CI (#28078)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-11-05 03:15:59 +00:00
Kunshang Ji
18b39828d9 [XPU] Add gpt-oss model support for Intel GPU (#27786)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-11-05 02:17:23 +00:00
tou
4ea62b77f5 [Qwen3-Next] MOE configs for A100-SXM4-80GB TP4 TP8 (#27740) 2025-11-05 09:25:09 +08:00
Vadim Gimpelson
d4e547bb7e Revert "[PERF] Decouple projections from GDN custom op" (#28080)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2025-11-04 15:58:23 -08:00
Aleksandr Malyshev
2d977a7a9e [ROCm] gemm_a16w16 upstreaming (#26969)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
2025-11-04 16:01:00 -05:00
Chenheli Hua
1fb4217a05 [Multimodal] Make MediaConnector extensible. (#27759)
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
2025-11-04 18:28:01 +00:00
nadavkluger
611c86ea3c Added disable rule to track files under benchmarks/lib (#28048)
Signed-off-by: Nadav Kluger <nadav.k@fmr.ai>
2025-11-04 18:18:43 +00:00
Pleaplusone
dc937175d4 [ROCm][Perf] New design on ROCm AITER MHA backend Implementation (#25763)
Signed-off-by: ganyi <ygan@amd.com>
2025-11-04 18:05:33 +00:00
Harry Mellor
2f1cc8cef1 Remove deprecated --rope-scaling and --rope-theta (#28006)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-04 18:01:56 +00:00
Nick Hill
938a81692e [AsyncScheduling] Don't schedule past request max_tokens (#27922)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-04 17:06:28 +00:00
Nick Hill
c9f66da8fd [PerfFix] Avoid separate thread for MP executor shm spin (#28012)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-04 08:33:55 -08:00
yt0428
05cae69f0f [model] Add support for openPangu_Ultra_MoE (#27521)
Signed-off-by: yuantao <2422264527@qq.com>
Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-04 08:17:20 -08:00
Vadim Gimpelson
5fd8f02ea9 [PERF] Decouple projections from GDN custom op (#27512)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2025-11-04 08:11:41 -08:00
lyrisz
97e3dda84b [Perf] SM100 - add swap AB optimization to CUTLASS FP8 GEMM (#27284)
Signed-off-by: Faqin Zhong <faqin.zhong@gmail.com>
Co-authored-by: Faqin Zhong <zhofaqin@amazon.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-11-04 07:49:25 -08:00
Nick Hill
5a0a6dfd55 [BugFix] Fix incorrect preallocated sampled_token_ids tensor size (#28025)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-04 07:38:16 -08:00
bnellnm
938772af03 [Kernels] Isolate modular kernel code from FusedMoEMethodBase subclasses. (#27123) 2025-11-04 21:59:45 +08:00
tomeras91
e4ee658672 [Model] add optimal triton fused moe configs for NemotronH MoE (#27967)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-11-04 12:59:43 +00:00
tomeras91
77f8001f53 [Model][Bugfix] fix pipeline parallelism support for NemotronH (#27968)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-11-04 12:28:36 +00:00
Zhuohan Li
300a265978 [Core] Enable StatLogger in LLMEngine (#28020)
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com>
2025-11-04 04:13:35 -08:00
Jerry Zhang
03c4c4aa9d Support using Int4PreshuffledTensor after loading (#26066)
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
2025-11-04 06:00:57 -05:00
yugong333
2ec401bc39 Load tuned fused_moe_lora shrink and expand kernel configs separately (#27435)
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-04 18:27:35 +08:00
Varun Sundar Rabindranath
4022a9d279 [BugFix][Performance] Restore flashinfer autotuning for all scenarios (#27904) 2025-11-04 15:56:21 +08:00
Zhewen Li
53f6e81dfd [CI/Build] Fix OpenAI API correctness on AMD CI (#28022)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-11-04 07:20:50 +00:00
CSWYF3634076
43a6acfb7d [Model] fix ernie45 reasoning_parser (#27973)
Signed-off-by: wangyafeng <wangyafeng@baidu.com>
2025-11-04 07:16:46 +00:00
Mark McLoughlin
58279c60b5 [KV Connector] Make KVCacheConfig an explicit constructor argument (#27887)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-11-03 23:00:49 -08:00
Zhewen Li
2f84ae1f27 [CI/Build] Update LM Eval Version in AMD CI (#27944)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-11-04 06:36:40 +00:00
xiangze-arm
f32cbc9a0c [CPU]Improve dynamic 4bit moe performance (#27240)
Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com>
2025-11-04 06:33:23 +00:00
Wentao Ye
7e4be74104 [Bug] Batch invariant: Fix flash attn MLA RuntimeError: scheduler_metadata must have shape (metadata_size) (#27884) 2025-11-04 14:05:55 +08:00
Mark McLoughlin
380ba6816d [Metrics] Enable sleep state metric outside of dev mode (#27867)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-11-03 20:35:36 -08:00
liuzhenwei
14a125a06d [NIXL][XPU] Pin NIXL version to 0.7.0 (#27849)
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
2025-11-04 03:28:35 +00:00
Chauncey
c02fccdbd2 [Refactor] Lazy import tool_parser (#27974)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-11-04 10:10:10 +08:00
li2haipeng
6ddae74054 [LoRA] Lora shrink swizzle (#27694)
Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-04 09:30:20 +08:00
vllmellm
b13a447546 [Bugfix][ROCm] Fix ViT rotary embeddings for torch.compile compatibility on ROCm (#27748)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-11-03 17:12:19 -08:00
QiliangCui
7956b0c0bc Remove the tpu docker image nightly build. (#27997)
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
2025-11-04 00:35:54 +00:00
Tyler Michael Smith
3758757377 [Bugfix] Fix MoE Routing Simulation (#28002)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
2025-11-03 22:26:49 +00:00
Hank_
ccd3e55e51 [Bugfix][plugin] fla crash on plugin (#27322) 2025-11-04 05:27:03 +08:00
Matthew Bonanni
01baefe674 Add TP parameter to attention tests (#27683)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-11-03 13:04:40 -08:00
Ning Xie
786030721e [Docs] add runai_streamer_sharded to LoadConfig (#27937)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-11-03 20:35:16 +00:00
Matthew Bonanni
145c00a4d3 [Bugfix] change FlashMLA reorder_batch_threshold (#27777)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-11-03 15:17:10 -05:00
Lucas Kabela
55011aef24 [Bugfix][Qwen][Multimodal] Move Qwen2_5_vl sdpa to custom op and reenable compile (#27764)
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
2025-11-03 11:12:15 -08:00
Sophie du Couédic
a4398fbb5e [Feature][Benchmarks] Support inf burstiness (#26941)
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com>
2025-11-03 18:33:17 +00:00
Aurick Qiao
2c19d96777 [Spec Decode] Integrate Suffix Decoding from Arctic Inference (#25784)
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
2025-11-03 09:23:31 -08:00
Lucas Wilkinson
4bc400f47e [CI/Testing] Add basic single node dual batch overlap test (#27235)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-11-03 17:00:46 +00:00
ahao-anyscale
cac4c10ef0 [BUG] Make 'binary' default option for saving torch compile artifacts when using standalone_compile (#27616)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
2025-11-03 11:13:51 -05:00
pwschuurman
f7d2946e99 [Bugfix] Skip gs:// model paths for speculator detection (#27846)
Signed-off-by: Peter Schuurman <psch@google.com>
2025-11-03 14:31:03 +00:00
gnovack
294c805f1d Early exit for MoE LoRA kernels (#27131)
Signed-off-by: gnovack <gnovack@amazon.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-03 20:22:17 +08:00
zhang-prog
40b69e33e7 [Model] Add PaddleOCR-VL Model Support (#27758)
Signed-off-by: zhangyue <zhangyue66@baidu.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: zhangyue66 <zhangyue66@baidu.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-03 19:04:22 +08:00
Jee Jee Li
32257297dd [CI/Build] Remove the flaky gpt-oss lora test (#27966)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-03 16:50:06 +08:00
Misha Efimov
ba464e6ae2 Add ORCA endpoint load metrics support (#24905)
Signed-off-by: Misha Efimov <mef@google.com>
2025-11-03 08:21:31 +00:00
Kunshang Ji
7f4bdadb92 [XPU]Refine Dockerfile.xpu, avoid oneccl dependency issue (#27964)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-11-03 07:36:59 +00:00
Rémi Delacourt
cec7c28833 [Bugfix] Padded Eagle Specdec with Chunked Prefill (#26263)
Signed-off-by: Rémi Delacourt <remi@mistral.ai>
Signed-off-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com>
Signed-off-by: remi <remi@mistral.ai>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
2025-11-03 02:22:46 -05:00
Thomas Parnell
18961c5ea6 [Hybrid] Pass kernel block size to builders (#27753)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-11-03 05:48:03 +00:00
Sungyoon Jeong
470ad118b6 [Frontend] Align finish_reason when tool is called with OpenAI (#25054)
Signed-off-by: Sungyoon Jeong <sungyoon.jeong@furiosa.ai>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
2025-11-03 04:21:18 +00:00
Biswa Panda
1bf43ae35d [BugFix][LoRA] use adapter_id instead of id field of lora_request (#27728)
Signed-off-by: Biswa Panda <biswa.panda@gmail.com>
2025-11-03 10:08:08 +08:00
Vensen
0ce743f4e1 Fix(llm): Abort orphaned requests when llm.chat() batch fails Fixes #26081 (#27420)
Signed-off-by: vensenmu <vensenmu@gmail.com>
2025-11-02 16:24:01 +00:00
Cyrus Leung
6c317a656e [Misc] Provide Siglip2 chat template (#27939)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-02 13:42:38 +00:00
Asaf Joseph Gardin
00b31a36a2 [V1] [Hybrid] Mamba1 Automatic Prefix Caching (#26377)
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
2025-11-02 04:16:23 -08:00
Julien Denize
73444b7b56 Performance fix MistralTokenizer: cache special ids and tokens (#27925)
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2025-11-02 08:48:33 +00:00
Cyrus Leung
853a8eb53b [Bugfix] Fix Qwen Omni audio inference (#27920)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-02 05:06:05 +00:00
Ben Browning
758ea2e980 [CI/Build] Fix flaky test_transcription_validation.py::test_basic_audio_gemma (#27924)
Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-11-02 03:45:02 +00:00
Yue Zhang
685c99ee77 [KV offload] Offloading connector async scheduling support (#27648)
Signed-off-by: KevinCheung2259 <2651309292@qq.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-11-01 21:08:56 +00:00
Benjamin Bartels
1e88fb751b Adds anthropic /v1/messages endpoint to openai api_server (#27882)
Signed-off-by: bbartels <benjamin@bartels.dev>
Signed-off-by: Benjamin Bartels <benjamin@bartels.dev>
2025-11-01 12:45:42 -07:00
Nick Hill
c2ed069b32 [BugFix] Fix mixed penalties batch with async scheduling (#27910)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-01 10:51:24 -07:00
wenxindongwork
af6e19f50f [Core][TPU] Support TPU Data Parallalism (#27365)
Signed-off-by: wenxindongwork <wenxindong@google.com>
2025-11-01 17:14:44 +00:00
Cyrus Leung
99d69af9ec [Bugfix] Python 3.10 compatibility for Self (#27918)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-11-01 15:28:54 +00:00
Haco
d811b442d3 [Bugfix] DeepSeek V3.2 MTP metadata & CUDA graph issues (#26779)
Signed-off-by: xiaohajiayou <923390377@qq.com>
2025-11-01 10:52:43 -04:00
wangxiyuan
30a14b034f [V0 deprecation] Remove VLLM_USE_V1 usage in platform and v1 module (#27798)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-01 10:17:45 +00:00
Harry Mellor
799ce45cc1 [Docs] Mock all imports for docs (#27873)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-01 10:02:23 +00:00
ai-jz
2c0c7c39bd feat(benchmarks): support HF model names in multi-turn benchmark (#27850) 2025-11-01 08:04:52 +00:00
Yihua Cheng
e675118849 [Add] cmdline argument parsing for KV cache offloading modules (#27621)
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-01 07:17:07 +00:00
TJian
e2347dbf58 [Bugfix] [Model] Missing MRoPE function definition from KeyeForConditionalGeneration (#27895)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-11-01 13:45:23 +08:00
Cyrus Leung
879a06579e [CI/Build] Bump transformers version (#27528)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-31 22:11:07 -07:00
yugong333
29de3cdee4 Adding SplitK in fused_moe_lora kernel (#27818)
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-01 12:55:46 +08:00
Yan Ma
7e2729b57e [Multimodal][XPU]Enable vision attn backend for xpu platform (#27525)
Signed-off-by: Yan Ma <yan.ma@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Yejing Lai <yejing.lai@intel.com>
Co-authored-by: Guancheng Fu <110874468+gc-fu@users.noreply.github.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2025-11-01 04:45:02 +00:00
Jee Jee Li
3a5de7d2d6 [Bugfix] Fix KDA output (#27905)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-01 11:54:36 +08:00
Jee Jee Li
bc4486d609 [Kernel] Enable FusedMoEModularKernel support bias (#27754)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-11-01 02:05:12 +00:00
Nick Hill
0cdbe7b744 [Core] Async scheduling + structured outputs compatibility (#26866)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-11-01 00:35:04 +00:00
Chen Zhang
df334868ca [Hybrid] A simpler algorithm to find kernel_block_size (#26476)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-10-31 21:30:28 +00:00
Bram Wasti
0e0a638c3b Batch invariance doc (#27839)
Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-31 17:22:19 -04:00
Matthew Bonanni
f29aeb5a25 Add FLASHINFER_MLA to test_mla_backends and add B200 CI run (#27663)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-10-31 11:12:19 -07:00
Vinay R Damodaran
5e8862e9e0 [Feature] Pydantic validation for scheduler.py and structured_outputs.py (#26519)
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-31 18:05:50 +00:00
Nick Hill
9e5bd3076e [Cleanup] Remove no-longer-used SpeculativeConfig.enable_chunked_prefill (#27826)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-31 10:57:45 -07:00
Shu Wang
fc16f1c477 Flashinfer_CUTLASS_MOE fuses quantization for TP (#27223)
Signed-off-by: Shu Wang. <shuw@nvidia.com>
2025-10-31 17:54:29 +00:00
ZiTian Zhao
bc306fe5e9 fix incorrect type annotation in KimiMLP (#27885)
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
2025-10-31 17:38:02 +00:00
Chenguang Zheng
103a468bbf [bugfix] Missing cached item in beam search (#27874)
Signed-off-by: fake0fan <645327136@qq.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-10-31 17:34:27 +00:00
Rob Mulla
70bfbd7b16 Docs update tpu install instructions (#27824)
Signed-off-by: Rob Mulla <rob.mulla@gmail.com>
Signed-off-by: Rob Mulla <RobMulla@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-31 10:29:55 -07:00
GuanLuo
d6517be3cd [Bugfix] Missing NIXL metadata for handshake initialization if instance spans multi-node (#26338)
Signed-off-by: Guan Luo <gluo@nvidia.com>
Signed-off-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com>
Signed-off-by: Guan Luo <41310872+GuanLuo@users.noreply.github.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
2025-10-31 10:16:00 -07:00
Isotr0py
7e06c40e63 [Bugfix] Fix broken MRoPE for GLM-4.1V/GLM-4.5V (#27860)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-31 17:04:51 +00:00
Madeesh Kannan
675704ac01 [Bugfix] Allow 64-bit integer values for LoRA IDs to avoid overflow/truncation (#27876)
Signed-off-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2025-10-31 16:58:42 +00:00
Jee Jee Li
0384aa7150 [CI/Build] Add gpt-oss LoRA test (#27870)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-10-31 22:17:21 +08:00
Jiangyun Zhu
3857eb8725 [Perf] Decouple torch op from GDA to leverage torch.compile (#27871)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-10-31 21:35:52 +08:00
Huamin Li
933cdea440 [BugFix] Don’t compute reorder threshold when there are no attention groups (#27861) 2025-10-31 11:36:18 +00:00
Isotr0py
3933f18a5e [Bugfix] Avoid too small block m/n for FlexAttention kernel option (#27853)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-31 19:33:12 +08:00
toncao
e5ef4dfc11 [Kimi-Linear] Correct prefixes and add compatibility to AWQ quants (#27834)
Signed-off-by: toncao <cpatonn@gmail.com>
Co-authored-by: toncao <cpatonn@gmail.com>
2025-10-31 17:36:37 +08:00
Akash kaothalkar
36960501d3 [Hardware][Powerpc] Fix VLLM_CPU_OMP_THREADS_BIND="auto" low CPU utilization for Power (#27734)
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
2025-10-31 07:45:26 +00:00
Seiji Eicher
b2e65cb4a7 [benchmark] Make request IDs unique across clients by default (#27723)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
2025-10-30 17:40:35 -07:00
Wentao Ye
2bf0bcc1fc [CI Test] Add Scheduled Integration Test (#27765)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-30 17:29:26 -07:00
Jakub Sochacki
697f507a8e [CI/Build][Intel] Enable performance benchmarks for Intel Gaudi 3 (#26919)
Signed-off-by: jakub-sochacki <jakub.sochacki@wp.pl>
2025-10-31 07:57:22 +08:00
Matthew Bonanni
d5d2a0fe74 [Misc] Make all tool scripts executable (#27831)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-10-30 23:46:02 +00:00
Nick Hill
c9791f1813 [BugFix] Fix broken import in initialize_ray_cluster() (#27838)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-30 16:26:13 -07:00
Paul Zhang
e7acb20076 [Feature] Batch invariant torch.compile (#27660)
Signed-off-by: PaulZhang12 <paulzhan@fb.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-10-30 13:11:29 -07:00
Jialin Ouyang
4b68c4a55b [Core][Perf] Only invoke save_new_computed_blocks when computed blocks are not empty (#27799)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-10-30 19:47:30 +00:00
Wentao Ye
a8141fa649 [Refactor] Remove VLLM_DEEPEP_LOW_LATENCY_ALLOW_NVLINK (#27750)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-30 15:32:39 -04:00
Sumanth R Hegde
4917002523 [Fix] Skip record_sleep_state logic in PrometheusStatsLogger if not in dev mode (#27789)
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
2025-10-30 19:26:27 +00:00
cong-meta
a2981c4272 [EP/DP][API Server] Enable DP-aware routing in OpenAI API requests (#24945)
Co-authored-by: Cong Chen <prowindy@gmail.com>
2025-10-30 12:10:16 -07:00
Jialin Ouyang
4574d48bab [Core][Bookkeeping] Update cu_num_accepted_tokens for all req_index (#27629)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-10-30 11:52:36 -07:00
Tyler Michael Smith
ab98f6556f [Bugfix] Fix 2 precommit issues - (mamba_block_size, kv_cache_config) (#27811)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-10-30 11:52:18 -07:00
Roger Meier
2918c1b49c [Model] Use the same fused_moe configs for all H200 devices (#23642)
Signed-off-by: Roger Meier <r.meier@siemens.com>
2025-10-30 17:36:56 +00:00
Mengqing Cao
1004205795 [MTP] Refactor mtp predictor to avoid d2h operation (#27643)
Signed-off-by: MengqingCao <cmq0113@163.com>
2025-10-30 17:27:39 +00:00
Huy Do
ba33e8830d Reapply "Install pre-built xformers-0.0.32.post2 built with pt-2.9.0" (#27768)
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-10-30 10:22:30 -07:00
Kebe
33a0ea5f32 [Docs] add Shanghai Meetup - 2025/10 (#27545)
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: esmeetu <jasonailu87@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: esmeetu <jasonailu87@gmail.com>
2025-10-31 00:33:13 +08:00
Ilya Markov
60f76baa66 [Misc] Replace CUDA_VISIBLE_DEVICES in DP with torch.cuda.set_device for device selection on cuda-like devices (#27564)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
2025-10-30 11:41:44 -04:00
Varun Sundar Rabindranath
e5e076cad7 [BugFix] Stopgap - Flashinfer Autotuner + GPT-OSS + DP/TP (#27762)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-10-30 08:24:31 -07:00
Li, Jiang
eebf00cb0c [Bugfix][CPU] Fix MRoPE dispatch on the CPU backend (#27800)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-10-30 15:12:05 +00:00
Fan Yin
9956aae4ea [Model][Ouro] Support Ouro Model (#27794)
Signed-off-by: yinfan.1024 <yinfan.1024@bytedance.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: yinfan.1024 <yinfan.1024@bytedance.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-10-30 22:34:41 +08:00
Zhewen Li
0fe0140408 [KV offload] Enable CPU KV offload on CUDA alike Platforms (#27770)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-10-30 22:10:29 +08:00
Zhiyuan Li
4e68cc9b6a [Model] Introduce Kimi Linear to vLLM (#27809)
Signed-off-by: lizhiyuan <lizhiyuan@moonshot.cn>
Signed-off-by: Zhiyuan Li <uniartisan2017@gmail.com>
2025-10-30 21:02:27 +08:00
Huamin Li
1994de99ea [CI Failure] Fix test_kv_cache_model_load_and_run (#27717)
Signed-off-by: Huamin Li <3ericli@gmail.com>
2025-10-30 12:27:53 +00:00
wang.yuqi
4464723f22 [Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (#25524)
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-10-30 12:13:05 +00:00
Sairam Pillai
74374386e2 [Bugfix] Improve GPU validation logging in Ray fallback scenarios (#25775)
Signed-off-by: Sairam Pillai <sairam.pillai61@gmail.com>
2025-10-30 11:57:59 +00:00
Wentao Ye
c01f6e525f [CI] Fix mypy for vllm/v1/core and vllm/v1/engine (#27108)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-30 11:32:17 +00:00
Huamin Li
c7d2a554ba [CI Failure] fix test_default_mm_loras (#27795)
Signed-off-by: Huamin Li <3ericli@gmail.com>
2025-10-30 18:13:03 +08:00
wangxiyuan
af826e0820 [V0 deprecation] Remove VLLM_USE_V1 usage in config module (#27784)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-30 09:42:49 +00:00
Zhewen Li
e806178d2a [BugFix][VL] Fix FA selection on Qwen2.5-VL (#27790)
Signed-off-by: zhewenli <zhewenli@meta.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-10-30 07:54:44 +00:00
Huamin Li
5be1bed790 [CI/Build]Add eval config for Qwen3-235B-A22B-Instruct-2507-FP8 (#27113)
Signed-off-by: Huamin Li <3ericli@gmail.com>
2025-10-30 07:50:56 +00:00
yitingdc
31b55ffc62 use stringData in secret yaml to store huggingface token (#25685)
Signed-off-by: yiting.jiang <yiting.jiang@daocloud.io>
2025-10-30 00:47:36 -07:00
Bram Wasti
ded8ada86a Add more dims for batch invariant shims (#27489)
Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Bram Wasti <bwasti@fb.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-30 05:28:45 +00:00
Kuntai Du
8bff831f0a [Benchmark] Cleanup deprecated nightly benchmark and adjust the docstring for performance benchmark (#25786)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
2025-10-30 04:43:37 +00:00
Lucas Wilkinson
b5d70751d8 [BugFix] Reordering extend logic fix (#27739)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-10-29 21:39:34 -07:00
Fardin Hoque
b8c48c5d72 kernels/moe test pruning (#27053)
Signed-off-by: Fardin Hoque <kfhfar@amazon.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-10-30 12:10:34 +08:00
Benjamin Bartels
17d055f527 [Feat] Adds runai distributed streamer (#27230)
Signed-off-by: bbartels <benjamin@bartels.dev>
Signed-off-by: Benjamin Bartels <benjamin@bartels.dev>
Co-authored-by: omer-dayan <omdayan@nvidia.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-10-29 21:09:10 -07:00
Nick Hill
2ce5c5d3d6 [BugFix] Handle unscheduled requests properly when async scheduling (#27756)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-29 21:04:25 -07:00
Kunshang Ji
b5bae42f91 [XPU] Update latest IPEX 2.8 release (#27735)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-10-30 11:17:13 +08:00
Chen Zhang
d7fb10c574 [Bugfix] mamba-block-size is set for vision language model (#27773)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-10-29 19:39:57 -07:00
Yan Ma
b798e39f93 [XPU][bugfix] fix rope for llama4 and deepseek (#25145)
Signed-off-by: Yan Ma <yan.ma@intel.com>
2025-10-30 09:43:13 +08:00
Chenheli Hua
48eb8eba58 [Temp fix] Disable torch.compile for Qwen2.5 VL's VisionBlock temporarily. (#27760)
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-10-29 23:17:48 +00:00
Wentao Ye
b5d90f7400 [Bug] Fix DBO IMA issue for DeepEPHT (#27666)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-29 16:28:27 -04:00
Nick Hill
d4aa144343 [BugFix] Fix handling of resumed reqs in SharedStorageConnector (#27719)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-29 20:16:52 +00:00
Wentao Ye
fcb1d570bb [Bug] Fix DeepEP low latency assert self.batched_router_logits.size(-1) == full_router_logits.size(-1) Bug (#27682)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-29 14:50:39 -04:00
Nicolò Lucchesi
accb8fab07 [KVConnector] Add metrics to Prometheus-Grafana dashboard (#26811)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
2025-10-29 18:44:49 +00:00
Wentao Ye
5b0448104f [Bug] Raise error explicitly if using incompatible backend (#27424)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-29 13:29:20 -04:00
22quinn
f7a6682872 [CI/Build] Test torchrun with 8 cards (#27548)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-10-29 10:26:06 -07:00
Boyuan Feng
a9fe0793f2 use_aot_compile should respect VLLM_DISABLE_COMPILE_CACHE (#27698)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
2025-10-29 17:08:54 +00:00
JartX
7568a282b9 [FIXBUG] Qwen3VL hallucinations without Contiguous on Torch.SDPA (#27744)
Signed-off-by: JartX <sagformas@epdcenter.es>
Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-10-29 16:55:35 +00:00
Braulio Dumba
1da3309ace [Core] Exposing engine sleep & wake_up state as prometheus metrics (#24176)
Signed-off-by: Braulio Dumba <Braulio.Dumba@ibm.com>
2025-10-29 09:32:01 -07:00
Wentao Ye
5522fb274b [Chore] Optimize P2PNCCLEngine http_address (#27488)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-30 00:05:09 +08:00
Nicolò Lucchesi
0f95a1c3f2 [CI] Fix flaky test_two_responses_with_same_prev_id test (#27745)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-10-29 15:10:35 +00:00
Xiake Sun
ded24e3e54 [ROCm][Platform] Add MI308X device id in _ROCM_DEVICE_ID_NAME_MAP (#27623)
Signed-off-by: Xiake Sun <xiake.sun@amd.com>
2025-10-29 14:44:03 +00:00
Roger Young
d6704dd099 Fix MiniMax-M2 rmsnorm precision and remove useless code (#27627)
Signed-off-by: xuebi <xuebi@minimaxi.com>
Co-authored-by: xuebi <xuebi@minimaxi.com>
2025-10-29 21:01:05 +08:00
Cyrus Leung
ecca3fee76 [Frontend] Add vllm bench sweep to CLI (#27639)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-29 05:59:48 -07:00
Zhewen Li
9a0d2f0d92 [CI/Build] Skip cpu offloading test on AMD (#27690)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-10-29 12:55:51 +00:00
Isotr0py
ad3ec89532 [VLM] Add Qwen3-VL generation test (#25185)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-10-29 12:19:37 +00:00
Kevin H. Luu
3481e40743 [chore] Remove models weight on S3 logic (#27725)
Signed-off-by: kevin <kevin@anyscale.com>
2025-10-29 10:29:49 +00:00
Eugene Khvedchenya
5e72216d17 Feature/video support in random mm dataset (#25963)
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Eugene Khvedchenya <ekhvedchenia@nvidia.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-10-29 18:24:52 +08:00
Isotr0py
1a33aacf82 [Misc] Raise error for missing video metadata in MultiModalDataParser (#27664)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-29 10:06:42 +00:00
Yue Zhang
7ba6aa8f56 [Fix] import get_kv_cache_torch_dtype error in LMCacheConnector integration (#27670)
Signed-off-by: KevinCheung2259 <2651309292@qq.com>
2025-10-29 10:03:54 +00:00
Alec S
ab2eb27b74 [Frontend] [gpt-oss] Mcp type bug (#27689)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Signed-off-by: Alec Solder <alecs@fb.com>
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Co-authored-by: Alec Solder <alecs@fb.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
2025-10-29 10:01:32 +00:00
Alec S
3c7fefdeba [Frontend] [gpt-oss] Tool json call parsing error retry (#27675)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Signed-off-by: Alec Solder <alecs@fb.com>
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Co-authored-by: Alec Solder <alecs@fb.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
2025-10-29 09:42:44 +00:00
bnellnm
1891cf605a [Bugfix] Fix modular kernel tests (#27707)
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-10-29 16:14:33 +08:00
Jiangyun Zhu
8df98c2161 [perf] Enable concurrent execution of "shared_experts" and "selected_experts" in qwen3-next (#27578)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-10-29 08:12:54 +00:00
Cyrus Leung
4fb8771cc0 [CI/Build] Move pre-commit only scripts to tools/pre_commit (#27657)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-29 08:04:33 +00:00
Dipika Sikka
413ef7a3b4 [Speculators] Move tests + fix integration (#27308)
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: rahul-tuli <rtuli@redhat.com>
Co-authored-by: Rahul Tuli <rtuli@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2025-10-29 00:54:21 -07:00
Zhewen Li
8b62495076 [Bugfix] Fix non-contiguous tensor error in rocm_unquantized_gemm_impl (#27605)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-10-29 00:00:15 -07:00
Zhewen Li
83fd49b1fc [CI/Build][Bugfix]Fix Quantized Models Test on AMD (#27712)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-10-29 06:27:30 +00:00
Shaoting
a4a4f0f617 [KV Connector] Update lmcache connector with latest compatibility (#27681)
Signed-off-by: Samuel Shen <slshen@uchicago.edu>
Co-authored-by: Samuel Shen <slshen@uchicago.edu>
2025-10-29 05:38:37 +00:00
Lukas Geiger
0d8161b075 [Model] Fix Qwen3VL and Qwen3Omni after torch.compile changes (#27705)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-10-29 05:28:20 +00:00
liuzhenwei
d2c33c397a [NIXL][XPU] update name of nixl wheel (#27631)
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
2025-10-29 12:43:29 +08:00
Varun Sundar Rabindranath
f6d5f5888c [Build] Revert triton_kernels requirements (#27659) 2025-10-28 21:07:09 -07:00
Simon Mo
9007bf57e6 Revert "Install pre-built xformers-0.0.32.post2 built with pt-2.9.0" (#27714) 2025-10-28 20:58:01 -07:00
Huy Do
f257544709 Install pre-built xformers-0.0.32.post2 built with pt-2.9.0 (#27598)
Signed-off-by: Huy Do <huydhn@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-10-28 19:39:15 -07:00
Jialin Ouyang
0b51c9bd8b [Core] Early return in SlidingWindowManager.remove_skipped_blocks (#27673)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-10-29 01:32:33 +00:00
Wentao Ye
d3ab240f39 [Bug] Fix deepep low latency use nvlink by default (#27677)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-28 23:53:12 +00:00
Lucas Kabela
94666612a9 [Misc][qwen2_5_vl][torch.compile] Enable supports_torch_compile on generic nn.Module and demonstrate speedup on Qwen Vision model (#23207)
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
Signed-off-by: Lucas Kabela <lucasakabela@gmail.com>
2025-10-28 22:36:43 +00:00
Nick Hill
4fe5895361 [AsyncScheduling] Make async overlap work with logprobs (#27615)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-28 22:35:54 +00:00
Or Ozeri
111faf1118 [Core] Scheduler: Publish connector events after output (#25875)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2025-10-28 21:01:33 +00:00
Wentao Ye
6afc28a9ba [Test] Batch Invariant: Unit test using parameterized backend (#27478)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-28 13:51:35 -07:00
Lucas Wilkinson
141e6a0505 [Misc] Make reorder batch also separate extends (#27367)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-10-28 10:55:10 -07:00
Matvei Pashkovskii
130aa8cbcf Add load pattern configuration guide to benchmarks (#26886)
Signed-off-by: Matvei Pashkovskii <mpashkov@amd.com>
Signed-off-by: Matvei Pashkovskii <matvei.pashkovskii@amd.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-28 10:49:15 -07:00
Zhengxu Chen
e3d8186666 [compile] Add fallback path to AOT compile when serialization fails. (#27350)
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-10-28 12:54:26 -04:00
Cyrus Leung
f5710ef02a [Misc] Make LayerBlockType a Literal instead of Enum (#27658)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-28 16:23:35 +00:00
Mohammad Miadh Angkad
a8c02fb5bf [Bugfix][CI] Fix v1 attention backend tests and add CI coverage (#26597)
Signed-off-by: Mohammad Miadh Angkad <MAngkad.BSDSBA2027@aim.edu>
Signed-off-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
2025-10-28 11:42:05 -04:00
Kero Liang
02af36df36 [Bugfix] Fix allocation & free logic of SingleWriterShmRingBuffer (#27117)
Signed-off-by: Kero Liang <kerorek@outlook.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: donglu <donglu@cohere.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-10-28 15:01:24 +00:00
Zhiyuan Li
e88bdd60d9 [FLA] Introduce Kimi Delta Attention(KDA) to VLLM (#27654)
Signed-off-by: lizhiyuan <lizhiyuan@moonshot.cn>
2025-10-28 22:56:28 +08:00
Samuel Shen
05e034f085 [nit]: Fix import for the lmcache integration (#27600)
Signed-off-by: Samuel Shen <slshen@uchicago.edu>
Co-authored-by: Samuel Shen <slshen@uchicago.edu>
2025-10-28 14:40:55 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
936643a868 [BugFix] Also consider RAY_EXPERIMENTAL_NOSET_* when storing compilation cache (#27294)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2025-10-28 10:22:28 -04:00
Junpu Fan
b186149e8e [Bugfix][Frontend] validate arg priority in frontend LLM class before add request (#27596)
Signed-off-by: Junpu Fan <junpufan@gmail.com>
2025-10-28 14:02:43 +00:00
22quinn
2abbd351ef [Core] Enable async scheduling for external_launcher mode (#27394)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2025-10-28 13:52:47 +00:00
wangln19
446912d1cb fix: allow HuggingFace standard chat template params via **kwargs (#27622)
Signed-off-by: wangln19 <wanglinian@dev.wanglinian.msh-dev.svc.cluster.local>
Signed-off-by: wangln19 <96399074+wangln19@users.noreply.github.com>
Co-authored-by: wangln19 <wanglinian@dev.wanglinian.msh-dev.svc.cluster.local>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-10-28 21:12:34 +08:00
Zhengxu Chen
a00d6254e9 [compile] Disable dynamo guards check for AOT compilation. (#27288)
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-10-28 12:58:12 +00:00
Asaf Joseph Gardin
05181cc57f [Hybrid] Add mamba_block_size to Engine Args (#27289)
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
2025-10-28 12:54:24 +00:00
Zhengxu Chen
259504e147 [compile] Add enable_prompt_embeds to compile hash. (#27285)
Signed-off-by: zhxchen17 <zhxchen17@fb.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-10-28 20:46:03 +08:00
Wentao Ye
0484b64248 [Bug] Fix shape issue for eplb expert weights (#27589)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-10-28 20:44:05 +08:00
Cyrus Leung
f58d9b6404 [Misc] Separate out utils.counter and move utils.Device to engine (#27588)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-28 12:20:46 +00:00
Matthew Bonanni
44b5ce956d [Bugfix] In LongRoPE, decide short vs long based on max_model_len (#27431)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-10-28 12:00:56 +00:00
Nick Hill
7a865f2325 [V0 Deprecation] Remove vestigial V0 logits_processors.py file (#27601)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-28 19:17:45 +08:00
wangln19
2fa90bda27 Fix a robust parsing issue in KimiK2ToolParser that causes IndexError (#27565)
Signed-off-by: wangln19 <wanglinian@dev.wanglinian.msh-dev.svc.cluster.local>
Co-authored-by: wangln19 <wanglinian@dev.wanglinian.msh-dev.svc.cluster.local>
2025-10-28 11:11:50 +00:00
Zhewen Li
0291fbf65c [CI/Build] Fix amd model executor test (#27612)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-10-28 08:58:11 +00:00
Jialin Ouyang
b46e4a06f1 [Core][Bookkeeping Optimization] Update against numpy view of is_token_ids tensor (#27618)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-10-28 08:13:10 +00:00
Li, Jiang
d34f5fe939 [Bugfix][CPU] Fallback oneDNN linear to torch linear to fix half gemm support on legecy platforms (#27526)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-27 23:25:44 -07:00
Eric Yue
bdb01a38fe [Hardware][AMD][Model] Triton MoE tuning configs for GLM-4.6 for MI300X (#27323)
Signed-off-by: minatoaquaMK2 <jiacheng.yue@foxmail.com>
2025-10-27 22:58:06 -07:00
vllmellm
5b3c35a68e [ROCm] [Doc] Update ROCm installation docs (#27327)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-10-28 13:00:50 +08:00
Chauncey
61fbfe5274 [Bugfix] fixed inconsistent finish_reason handling between V0 and V1 engines (#27555)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-10-28 02:18:08 +00:00
Kuntai Du
255e34ca50 [Stability fix] turn off HMA allocator when connector is set (#27592)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
2025-10-27 18:32:23 -07:00
Roger Wang
a8d2e326ec [Bugfix][CI] Fix config resolving logic with remote models (#27610) 2025-10-28 00:48:32 +00:00
Andrew Xia
53a56e658b [gpt-oss][2/N] Support input_messages in responsesRequest (#26962)
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: Andrew Xia <axia@fb.com>
2025-10-27 23:15:49 +00:00
usberkeley
69f064062b Code quality improvements: version update, type annotation enhancement, and enum usage simplification (#27581)
Signed-off-by: Bradley <bradley.b.pitt@gmail.com>
2025-10-27 17:50:22 +00:00
Micah Williamson
921e78f4bb [ROCm] Update AITER branch for ROCm base docker (#27586)
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
2025-10-27 17:22:33 +00:00
Cyrus Leung
6ebffafbb6 [Misc] Clean up more utils (#27567)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-27 15:30:38 +00:00
Ben Browning
3b96f85c36 [Chore]: Stream tokens vs characters in tool call parser tests (#26513)
Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-10-27 23:06:25 +08:00
tingtinggithub
23ad820553 fixing mm placeholder replacement issue with gemma3 (#27538)
Signed-off-by: tingtingtang1992 <streamttt@gmail.com>
2025-10-27 14:34:01 +00:00
Varun Sundar Rabindranath
5d3be3ba4c [Bugfix][LoRA][FusedMoE] Select MxFP4 Backend based on LoRA Enablement (#27487)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-10-27 07:32:50 -07:00
Yu Jiaqi
4f882be4a0 [Model] Siglip2 Model Support (#27566)
Signed-off-by: piood <2477084691@qq.com>
2025-10-27 06:57:37 -07:00
Asaf Joseph Gardin
9273754222 [Hybrid] Added supports_mamba_prefix_caching Protocol (#27339)
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
2025-10-27 13:05:20 +00:00
Jee Jee Li
f4e8154076 [Kernel] Enable moe LoRA kernel support FP16 (#27468)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-10-27 19:48:37 +08:00
Fadi Arafeh
a663f6ae64 [cpu][perf] Fix low CPU utilization with VLLM_CPU_OMP_THREADS_BIND on AArch64 (#27415)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2025-10-27 11:14:55 +00:00
Chauncey
a4fc21895e [Bugfix] Fixed when return_token_ids=False, the first event still contains prompt_token_ids. (#27561)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-10-27 11:06:43 +00:00
Shanshan Shen
a3e8611da5 [Bugfix] Limit the default value of max_model_len when it is not specified by users (#27556)
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-10-27 10:16:20 +00:00
Cyrus Leung
7c2bdb83dc [Misc] Clean up utils (#27552)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-27 09:05:40 +00:00
Danielle Robinson
9932ed6a83 [Kernel] Adding split_K implementation for fused_moe_lora (#27291)
Signed-off-by: Danielle Robinson <dmmaddix@amazon.com>
Signed-off-by: Danielle Robinson <dcmaddix@gmail.com>
Co-authored-by: Danielle Robinson <dmmaddix@amazon.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-10-27 02:05:24 -07:00
Jee Jee Li
2d631d28c6 [Doc] Slight improvement to M2 and beyond (#27554)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-10-27 09:02:10 +00:00
Cyrus Leung
b368382964 [Model] Deprecate merge_by_field_config=False (#27551)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-27 16:43:00 +08:00
gnovack
a806c14cc7 [Performance][LoRA] add context varying params to 'do_not_specialize' in fused moe lora (#27445)
Signed-off-by: gnovack <gnovack@amazon.com>
2025-10-27 06:31:55 +00:00
yyzxw
181bf5bbde [Docs] reemove the incorrect enable_reasoning parameter (#27550)
Signed-off-by: zxw <1020938856@qq.com>
2025-10-26 23:17:19 -07:00
Cyrus Leung
cbd5e07a51 [Model] Use merge_by_field_config for MM models (Qwen series) (#27546)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-27 05:38:05 +00:00
CSWYF3634076
63b22e0dbb [Model][Bugfix] fix ernie45 moe 300B SharedFusedMoE output tuple (#27316)
Signed-off-by: wangyafeng <wangyafeng@baidu.com>
2025-10-26 20:53:31 -07:00
Roger Young
5980604c44 Fix MiniMax-M2 copyright (#27537)
Signed-off-by: xuebi <xuebi@minimaxi.com>
Co-authored-by: xuebi <xuebi@minimaxi.com>
2025-10-27 03:29:51 +00:00
youkaichao
361a7463d3 fix m2 test (#27536)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-10-27 01:04:36 +08:00
Roger Young
720af6ab79 [Model][MiniMax-M2] Support MiniMax-M2 Model (#27535)
Signed-off-by: xuebi <xuebi@minimaxi.com>
Co-authored-by: xuebi <xuebi@minimaxi.com>
2025-10-27 00:59:11 +08:00
Cyrus Leung
55cba4a05c [CI/Build] Update causal-conv1d installation (#27529)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-26 22:14:22 +08:00
Cyrus Leung
c7abff2990 Revert "[CI/Build] Use CPU for mm processing test on CI (#27522)" (#27531)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-26 04:44:27 -07:00
Yeshwanth N
71b1c8b667 [Chore]:Extract math and argparse utilities to separate modules (#27188)
Signed-off-by: Yeshwanth Surya <yeshsurya@gmail.com>
Signed-off-by: Yeshwanth N <yeshsurya@gmail.com>
Signed-off-by: yeshsurya <yeshsurya@gmail.com>
2025-10-26 04:03:32 -07:00
Cyrus Leung
8fb7b2fab9 [Doc] Fix links to GH projects (#27530)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-26 17:55:51 +08:00
Cyrus Leung
be7b55a83d [Doc] Remove Molmo warning (#27527)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-26 16:22:52 +08:00
Lucia Fang
315b860abe [bugfix]fix empty prompts for async-engine mode in benchmark throughput (#27494)
Signed-off-by: Lucia Fang <fanglu@fb.com>
2025-10-26 08:16:35 +00:00
rongfu.leng
87c41c26ad [Bugfix] Fix processor initialization for model from modelscope instead of HF (#27461)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-26 07:44:31 +00:00
JartX
65d2cf9511 [BUGFIX][ROCM] ViT FlashAttention on ROCm (no GFX9) and contiguous on qwen3vl ROCm TORCH_SDPA (#27190)
Signed-off-by: JartX <sagformas@epdcenter.es>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-10-26 15:08:52 +08:00
Isotr0py
d63cd9ff10 [CI/Build] Use CPU for mm processing test on CI (#27522)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-26 13:09:18 +08:00
Cyrus Leung
66a168a197 [CI/Build] Refactor processing tests (#27470)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-25 16:14:30 +00:00
Matthew Bonanni
a99564ac5b [Attention] Add missing kv cache scale setup (#27490)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-10-25 00:12:49 -07:00
Cyrus Leung
4c5f632165 [Misc] Simplify max tokens in multimodal registry (#27500)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-24 23:56:01 -07:00
Kuntai Du
b853540388 [Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator + KV cache connector (#25712)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
2025-10-24 23:34:18 -07:00
Zhuohan Li
56ed7609a9 Revert "[Misc] Remove use of CUDA_VISIBLE_DEVICES for device selectio… (#27502) 2025-10-25 05:31:43 +00:00
Jiangyun Zhu
29c9cb8007 [CI] Add tests for cudagraph (#27391)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-10-25 02:37:33 +00:00
Yihua Cheng
83f478bb19 [KVConnector] Migrate the LMCache integration code to be vLLM native (#25542)
Signed-off-by: ApostaC <yihua98@uchicago.edu>
2025-10-25 00:23:53 +00:00
Varun Sundar Rabindranath
269c4db0a4 [Misc][DP] Guard mxfp4 implementation selection (#27484)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-10-24 23:29:24 +00:00
Wentao Ye
52efc34ebf [Log] Optimize Startup Log (#26740)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-24 19:27:04 -04:00
Pengchao Wang
d95d0f4b98 [Distributed] Basic set of configuration for large EP deployment on GB200 (#27328)
Signed-off-by: Pengchao Wang <wpc@fb.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
2025-10-24 14:16:44 -07:00
Lehua Ding
0402428200 [Perf][Async Scheduling] Remove CPU->GPU sync in dummy_run (#27455)
Signed-off-by: Lehua Ding <lehuading@tencent.com>
2025-10-24 20:45:36 +00:00
jinghanhu
17af6aa0da [Document] Add ms-swift library to rlhf.md (#27469) 2025-10-24 20:31:50 +00:00
Zhewen Li
fc168c33f3 [CI/Build] Fix test_torch_utils in AMD CI (#27317)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-10-24 12:26:00 -07:00
Isotr0py
acc78aeb88 [Bugfix] Fix interns1-vit qk norm code path (#27480)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-24 17:43:45 +00:00
Ming Yang
0f67d4d962 [Attention] Add MLA prefill backend: trtllm_ragged_attention_deepseek (#26397)
Signed-off-by: Ming Yang <minos.future@gmail.com>
2025-10-24 10:24:08 -07:00
kourosh hakhamaneshi
7e1d697b56 [Bugfix] Fix MultiConnector stats reconstruction across process boundaries (#27366)
Signed-off-by: Kourosh Hakhamaneshi <Kourosh@anyscale.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
2025-10-24 17:08:05 +00:00
Chendi.Xue
699d62e6cf [NIXL][BUGFIX] delay done_recving queue cleanup to bottom of get_finished (#27297)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
2025-10-24 17:01:41 +00:00
Richard Zou
cd390b609d [compile] Turn standalone_compile back on (#27460)
Signed-off-by: Richard Zou <zou3519@gmail.com>
2025-10-24 16:30:27 +00:00
Fadi Arafeh
2080b05099 [cpu][fix] Fix onednn_mm crash on consecutive matmuls with same M,K,N and different dtype (#27472)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2025-10-24 15:57:48 +00:00
Lifans
6454afec90 [Doc] Fix minor issues in docs/design/metrics.md (#27436)
Signed-off-by: Lifan Shen <lifans@meta.com>
2025-10-24 05:40:54 -07:00
Chauncey
41a62564a7 Fix test named tool use (#27458)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-10-24 20:27:45 +08:00
fhl2000
284cc92275 [MISC] cudagraph_capture_sizes related improvements (#26016)
Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-24 05:11:05 -07:00
ioana ghiban
435be10db9 Fix AArch64 CPU Docker pipeline (#27331)
Signed-off-by: Ioana Ghiban <ioana.ghiban@arm.com>
2025-10-24 05:11:01 -07:00
Cyrus Leung
b7030d962b [Benchmark] Enable benchmark to run with encoding_format="bytes" (#27467)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-24 11:16:50 +00:00
Chauncey
3567816932 [Refactor] move tool parsing logic from protocol.py to the tool parser (#27383)
Co-authored-by: Aaron Pham <contact@aarnphm.xyz>
2025-10-24 09:53:23 +00:00
22quinn
e0ef8a2920 [BugFix] Fix torchrun DP with LLM class (#27395)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-10-24 08:11:37 +00:00
Isotr0py
42efe609ba [MM][Bugfix] Replace PatchEmbed's conv3d to linear layer (#27418)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-10-24 07:32:47 +00:00
Yu Jiaqi
88d3141ec6 [Docs] remove v1 column for embedding models (#27446)
Signed-off-by: piood <2477084691@qq.com>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-10-23 23:55:03 -07:00
Rui Qiao
09a6a49eaf [Misc] Avoid "PyTorch non-writable tensors" warning in RayPPCommunicator (#27443)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-10-24 14:53:09 +08:00
strinczer
074475541a [Bugfix] Fix Pydantic union resolution for ResponseFunctionToolCall in Responses API (#26706)
Signed-off-by: Shai Trinczer <strinczer@icloud.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
2025-10-23 22:53:42 -07:00
Aaron Pham
d4c574c39f [Chore] remove structural tags logging lines (#27451) 2025-10-24 05:35:45 +00:00
usberkeley
c528b9006a Fix EventPublisherFactory logic for disabled KV cache events (#27419)
Signed-off-by: Bradley <bradley.b.pitt@gmail.com>
2025-10-24 05:00:01 +00:00
fhl2000
85fee74b33 [Bugfix][CI] Move resolving cudagraph_mode before initializing attn_metadata_builder (#27427)
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
2025-10-23 20:31:14 -07:00
hfan
8dbe0c527f [Misc] Add TPU usage report when using tpu_inference. (#27423)
Signed-off-by: Hongmin Fan <fanhongmin@google.com>
2025-10-23 20:29:37 -07:00
Xiangyu Li
5cc6bddb6e [Kernel] Add GPTQv2 format support for low-bit or asymmetric quantization, by adapting gptq_gemm (#26092) 2025-10-23 23:26:13 -04:00
Harry Mellor
1f9460c4c1 Fix pooling adapters for Transformers backend (#27338)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-23 20:23:55 -07:00
xiao-llm
70022ffc00 Granite 4.0 quark quantization support (#26944)
Signed-off-by: Xiao YU <Xiao.YU@xilinx.com>
Signed-off-by: Xiao Yu <xiao.yu.dc@outlook.com>
Co-authored-by: Xiao YU <Xiao.YU@xilinx.com>
2025-10-24 02:14:03 +00:00
Akash kaothalkar
f417746ad7 [Hardware][POWERPC] Disable oneDNN path in vllm/model_executor/layers/utils.py for Powerpc (#27422)
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
2025-10-23 21:21:36 +00:00
Yu Jiaqi
0552cfb195 [Model] Siglip Embedding Support (#27324)
Signed-off-by: piood <2477084691@qq.com>
2025-10-23 20:19:48 +00:00
Kebe
51dd14ac2b [Bugfix][DP] Fix creating too many DP Placement Groups (#26880)
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Rui Qiao <ruisearch42@gmail.com>
2025-10-23 20:16:51 +00:00
Matthew Bonanni
dbfbf9f324 [Attention] Fix FlashMLA metadata builder arguments for q_len > 1 (#27368)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-10-23 15:58:15 -04:00
Jonathan Chen
ca76486a16 [Chore] Separate out vllm.utils.platform_utils.py (#27374)
Signed-off-by: Jonathan <chenleejonathan@gmail.com>
2025-10-23 19:08:06 +00:00
Varun Sundar Rabindranath
a9f55dc588 [Misc] Add triton_kernels dependency (#27370)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-10-23 12:04:14 -07:00
Isotr0py
81d5bb765a [Bugfix] Fix AWQ marlin layer skipping (#27416)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-23 18:30:28 +00:00
Gregory Shtrasberg
0825197bee [Bugfix][ROCm][DeepSeek] Fix for forward_hip in rope for DeepSeek (#27373)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-10-23 17:43:53 +00:00
Alexander Matveev
9ef3d5b875 [Bugfix] Fix dp_chunking enablement logic in FusedMoE layer (#27220)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-10-24 00:03:14 +08:00
Alexei-V-Ivanov-AMD
295c7f0267 Mirroring the test definitions (2025-10-22) (#27362)
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
2025-10-24 00:02:26 +08:00
wang.yuqi
3fa2c12185 [Frontend][4/N] Improve all pooling task | Add plugin pooling task (#26973)
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Christian Pinto <christian.pinto@ibm.com>
2025-10-23 14:46:18 +00:00
Cyrus Leung
fe2016de2d [CI/Build] Remove unnecessary flags from test registry (#27353)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-23 14:42:40 +00:00
Ilya Markov
237cf6d32a [Misc] Remove use of CUDA_VISIBLE_DEVICES for device selection (fix DP slow startup time &c) (#26709)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
2025-10-23 20:58:39 +08:00
Navya Srivastava
faee3ccdc2 [Feature] Pydantic validation for speculative.py (#27156)
Signed-off-by: Navya Srivastava <navya.srivastava1707@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-23 12:19:33 +00:00
Bradley D
570c3e1cd4 [Bugfix] Honor --mm_encoder_attn_backend when used (#27124)
Co-authored-by: Bradley D <4551889+bradleyhd@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-10-23 20:09:52 +08:00
Harry Mellor
3a4255c7c4 Run mypy on the lowest supported Python version instead of system Python (#27048)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-23 05:07:44 -07:00
tomeras91
61089465a6 [Model] Add MoE support for NemotronH (#25863)
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
2025-10-23 10:27:23 +00:00
Tova Movshovitz
88afa11010 [Metrics] [KVConnector] Add connector prefix cache hit rate stats (#26245)
Signed-off-by: tovam <tovam@pliops.com>
2025-10-23 12:21:08 +02:00
Chauncey
d00ce29d89 [CI] Reorganize entrypoints tests (#27403)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-10-23 10:10:06 +00:00
Louie Tsai
3b7bdf983b add SLA information into comparison graph for vLLM Benchmark Suite (#25525)
Signed-off-by: Tsai, Louie <louie.tsai@intel.com>
Signed-off-by: louie-tsai <louie.tsai@intel.com>
Signed-off-by: Louie Tsai <louie.tsai@intel.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-23 08:04:59 +00:00
Zhewen Li
50b788a17a [CI/Build] Fix AMD CI: test_cpu_gpu.py (#27388)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-10-23 07:55:00 +00:00
Lucia Fang
fc059c7061 [Bugfix] Fix args settings for guided decoding args (#27375)
Signed-off-by: Lucia Fang <fanglu@fb.com>
2025-10-23 07:34:06 +00:00
Cyrus Leung
bfb240cc49 [CI/Build] Fix Prithvi plugin test (#27393)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-23 07:30:44 +00:00
Jonathan Chen
e255d92990 [Chore] Remove duplicate has_ functions in vllm.utils (#27372)
Signed-off-by: Jonathan <chenleejonathan@gmail.com>
2025-10-23 06:11:59 +00:00
wang.yuqi
3729ed00ba [Model] Add num_cached_tokens for PoolingRequestOutput (#27378)
Signed-off-by: wang.yuqi <noooop@126.com>
2025-10-23 14:03:42 +08:00
Giancarlo Delfin
6644796bf4 [V1][spec decode] return logprobs for spec decoding (#26060)
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-10-22 22:59:59 -07:00
Andrew Sansom
ff93cc8c84 [CORE] Support Prefix Caching with Prompt Embeds (#27219)
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
2025-10-22 22:18:07 -07:00
PiteXChen
243ed7d32e [Bugfix][Core] running queue index leakage exception (#26754)
Signed-off-by: CLFutureX <chenyongqyl@163.com>
2025-10-22 21:40:12 -07:00
fangpings
7e0941055f [Bugfix] Fix incorrect kv cache metrics in grafana.json (#27133)
Signed-off-by: Fangping Shi <fangping_shi@apple.com>
Co-authored-by: Fangping Shi <fangping_shi@apple.com>
2025-10-22 20:58:36 -07:00
Cyrus Leung
6738e4a093 [Bugfix] Fix SLA tuner initialization (#27355)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-22 20:43:04 -07:00
Isotr0py
2566dca2a9 [Bugfix] Fix deepseek-ocr multi-image inference and add merge_by_field_config=True with tensor schema support (#27361)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-22 17:15:38 -07:00
Matthew Bonanni
b4fda58a2d [MLA] Bump FlashMLA (#27354)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-10-22 15:48:37 -07:00
dongbo910220
a0003b56b0 [Chore] Separate out system utilities from vllm.utils (#27201)
Signed-off-by: dongbo910220 <1275604947@qq.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-10-22 20:25:25 +00:00
Daisy-Ma-coder
5beacce2ea [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (#27128)
Signed-off-by: qqma <qqma@amazon.com>
Co-authored-by: qqma <qqma@amazon.com>
2025-10-22 19:36:39 +00:00
rongfu.leng
8669c69afa [Feature] publisher default set zmq in kv_event config (#26915)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-10-22 19:19:33 +00:00
Sage
1651003c35 [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (#27211)
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
2025-10-22 18:13:03 +00:00
William Song
1cb8c6c5fe [Doc] Fix numbering sequence in prefix caching (#27357)
Signed-off-by: William Song <jinwook@umich.edu>
2025-10-22 17:35:47 +00:00
Luciano Martins
e05a6754a8 [Model] Revert PR #26715: Restore custom PaliGemma and Gemma3-MM impl… (#27309)
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>
2025-10-22 10:05:34 -07:00
Isotr0py
084a9dae80 [Bugfix] Disable FlexAttention direct block mask building for encoder-only models (#27344)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-22 16:39:08 +00:00
RED
c9461e05a4 Support Anthropic API /v1/messages Endpoint (#22627)
Signed-off-by: liuli <ll407707@alibaba-inc.com>
Co-authored-by: liuli <ll407707@alibaba-inc.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-10-22 09:13:18 -07:00
Nicolò Lucchesi
4dfdb821c8 [P/D] Dynamic kv_output_aggregator collect size (#26734)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-10-22 18:07:58 +02:00
Russell Bryant
58fab50d82 [Frontend] Require flag for loading text and image embeds (#27204)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-22 15:52:02 +00:00
Isotr0py
db6f28d898 [Bugfix] Fix HF format InternVL large variants video processing (#27330)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-22 08:39:23 -07:00
Cyrus Leung
14e2f1231e [Bugfix] Make get_mrope_input_positions instance methods (#27342)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-22 08:38:34 -07:00
Chendi.Xue
7c4767f1eb [NIXL] use Host buffer to support TP_ratio > 1 for XPU (#27140)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
2025-10-22 15:28:13 +00:00
Jee Jee Li
9771e0b432 [Bugfix] Add missing 'is_internal_router' attribute to FusedMoEWithLoRA (#27351)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-10-22 08:19:12 -07:00
Reinforce-II
980de31ca0 [bugfix] remove unused parameters to reduce unnecessary vram usage (#26789)
Signed-off-by: Reinforce-II <fate@eastal.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-10-22 08:16:09 -07:00
Wentao Ye
1c160841ea [Bug] Fix DeepSeek-V2.5-1210-FP8 issue (#27267)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-22 11:00:10 -04:00
Mark McLoughlin
4ca13a8667 [NIXL] Terminate handshake listener thread in shutdown (#26404)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-10-22 16:59:53 +02:00
Isotr0py
675aa2ec64 [Model] Upstream Deepseek-OCR model (#27247)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-10-22 07:59:15 -07:00
dongbo910220
3ae082c373 [Chore] Separate out optional dependency checks from vllm.utils (#27207)
Signed-off-by: dongbo910220 <1275604947@qq.com>
Signed-off-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-10-22 10:44:21 -04:00
Alexei-V-Ivanov-AMD
49c00fe304 Mirroring changes in test-pipeline.yaml into test-amd.yaml (#27242)
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
2025-10-22 09:59:45 -04:00
Mark McLoughlin
141d3b9fc5 [docs] Update v1 metrics design doc (#27332)
Signed-off-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: atalhens <sneh.lata@nutanix.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: atalhens <sneh.lata@nutanix.com>
2025-10-22 06:29:15 -07:00
Jee Jee Li
abf3db40ef [Core] Handle MoE LoRA edge cases (#27335)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-10-22 13:14:33 +00:00
gnovack
8e4ca4d14e Bugfix - pass 'max_num_tokens_padded' into 'moe_lora_align_block_size' (#27311)
Signed-off-by: gnovack <gnovack@amazon.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-10-22 12:23:57 +00:00
Wentao Ye
1a0f4defb7 [Log] Add Warning for LLM(data_parallel_size=k) single-process DP Usage (#27282)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-22 12:12:21 +00:00
Li, Jiang
843af7f7fc [Bugfix][CPU] Disable dual stream execution for experts on CPU (#27320)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-10-22 11:02:27 +00:00
wang.yuqi
1f633b8632 [Frontend][3/N] Improve all pooling task | Support binary embedding response (#27066)
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-10-22 18:38:57 +08:00
ExtReMLapin
a4c29e6e82 fixed reasoning streaming with tool_choice="required" (#24108)
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Signed-off-by: ExtReMLapin <3909752+ExtReMLapin@users.noreply.github.com>
Co-authored-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
2025-10-22 09:42:55 +00:00
Harry Mellor
8f18feb191 Remove last level references not removed in #26355 (#27260)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-22 09:18:17 +00:00
Huy Do
ed540d6d4c Update release pipeline for PyTorch 2.9.0 (#27303)
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-10-22 09:18:01 +00:00
wangxiyuan
f6027b2855 [1/N][Platform] Cleanup useless function (#26982)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-22 09:04:57 +00:00
Jiangyun Zhu
ab3e80042e [torch.compile] Enable silu_mul_fp8_quant fusion without custom ops enabled (#27146)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-10-22 00:22:39 -04:00
Cyrus Leung
ceacedc1f9 [Benchmark] Add plot utility for parameter sweep (#27168)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-21 20:30:03 -07:00
Nicolò Lucchesi
bfa59be8f1 [CI] Nixl integration tests DP-EP (#27199)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-10-22 11:17:48 +08:00
vllmellm
265ecb05fb [DOC] [ROCm] Add ROCm quickstart guide (#26505)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-10-22 03:10:48 +00:00
Lain
09a7e6f617 [Deepseek v3.2] Remove extra logics in indexer (#26465)
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Lain <siyuanf@nvidia.com>
Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-10-21 23:34:03 +00:00
Tyler Michael Smith
6c2eef5a5d [P/D] KVConnector for decode benchmarking (#25986)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-10-21 16:30:47 -07:00
Benjamin Chislett
19748806f0 [Bugfix] skip cuda graph for drafter when running with eager (#26821)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2025-10-21 15:39:09 -07:00
ExtReMLapin
4a8a567e16 Updated xgrammar backend to not deny supported string formats (#27253)
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Signed-off-by: ExtReMLapin <3909752+ExtReMLapin@users.noreply.github.com>
Co-authored-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-21 22:25:23 +00:00
Alexander Matveev
344a0017c0 [Performance] Dual stream execution of "shared_experts" and "selected_experts" inside FusedMoE (#26440)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-10-21 21:38:29 +00:00
Huy Do
becb7de40b Update PyTorch to 2.9.0+cu129 (#24994)
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-10-21 17:20:18 -04:00
Tao He
250fb1b8ea [Bugfix] fixes the decoding metadata of dense mla's fp8 kvcache. (#27144)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-10-21 18:27:03 +00:00
Nick Hill
647214f3d5 [V0 Deprecation] Remove V0 executors (#27142)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-21 11:09:37 -07:00
David Whyte-Gray
ddeec11ba9 [Bugfix][P/D] Reduce num_threads used by nixl ucx backend (#27196)
Signed-off-by: David Whyte-Gray <40244437+dagrayvid@users.noreply.github.com>
2025-10-21 13:41:52 -04:00
Wentao Ye
86ed77022d [Feature] Batch Invariant for R1 TP 8 on Blackwell (#27229)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-21 10:25:55 -07:00
Micah Williamson
aa1356ec53 [ROCm] Update Triton, Torch, and AITER branches for ROCm base Dockerfile (#27206)
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
2025-10-21 12:01:23 -04:00
Pavani Majety
ecc3c0940a Add @pavanimajety to .github/codeowners for Flashinfer, ModelOpt related code (#27213)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-10-21 22:59:53 +08:00
JartX
ba09652de2 [ROCM] Enable CompressedTensorsWNA16 (#27187)
Signed-off-by: JartX <sagformas@epdcenter.es>
2025-10-21 10:43:23 -04:00
Harry Mellor
bd66b8529b [CI] Install pre-release version of apache-tvm-ffi for flashinfer (#27262)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-21 14:23:56 +00:00
dongbo910220
6c728f7771 [Chore] Separate out NCCL utilities from vllm.utils (#27197)
Signed-off-by: dongbo910220 <1275604947@qq.com>
2025-10-21 06:18:23 -07:00
Daniel Cámpora
80e9452984 [Deepseek v3.2] Optimize top_k_per_row (#26763)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-10-21 08:30:07 +00:00
Roger Wang
c3a2c6ac5f [MM][Core] Decouple ViT backend from LM backend (#27061)
Signed-off-by: Roger Wang <hey@rogerw.io>
2025-10-21 00:30:10 -07:00
Nicolò Lucchesi
72f431e709 [Nixl] Minor refactor to handshake related metadata (#26410)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-10-21 09:07:47 +02:00
Zebing Lin
be4445072c [Fix][Spec Decode] Fix llama4 draft loading with different quantization (#27136)
Signed-off-by: linzebing <linzebing1995@gmail.com>
2025-10-20 23:19:00 -07:00
Benjamin Chislett
f381cf2302 [Bugfix] Fix broken MTP weight loading for FP8 KV Scales (#27227)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2025-10-20 22:51:44 -07:00
Varun Sundar Rabindranath
5ff5d94e77 [Bugfix] Fix gpt-oss w4a8 DP/EP on B200 (#26729)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-10-21 01:51:14 -04:00
Shu Wang
f95da13c3d [ModelOpt] Load w13/w2_input_scale for all experts, nvfp4 (#26135)
Signed-off-by: Shu Wang <shuw@nvidia.com>
Signed-off-by: Shu Wang. <shuw@nvidia.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-10-21 01:50:31 -04:00
Po-Han Huang (NVIDIA)
aef368aa08 [BugFix] GPT-OSS Attention DP + MoE TP weight loading issue (#24032)
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
2025-10-21 04:03:47 +00:00
Chen Wu
5f6cbf60d6 [Feature][Kernel]FusedMoE LoRA (#21229)
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: banjuede <lmklhc@163.com>
Signed-off-by: Chen Wu <cntryroa@gmail.com>
Signed-off-by: Danielle Robinson <dmmaddix@amazon.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: bk-201 <joy25810@foxmail.com>
Co-authored-by: wuchen <wuchen@zetyun.com>
Co-authored-by: Nathan Van Gheem <vangheem@gmail.com>
Co-authored-by: banjuede <lmklhc@163.com>
Co-authored-by: Danielle Robinson <dmmaddix@amazon.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: bk-201 <joy25810@foxmail.com>
2025-10-21 03:01:37 +00:00
Russell Bryant
3ada34f9cb [Frontend] Enforce tokenize=False when applying chat template (#27205)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-21 02:57:34 +00:00
Lunwen He
0eb8f2b880 create is_in_the_same_node on cpu (#26832)
Co-authored-by: Lunwen He <lunwenh@meta.com>
2025-10-21 02:04:14 +00:00
Fadi Arafeh
163965d183 [cpu] Dispatch un-quantized linear to oneDNN/ACL by default for AArch64 (#27183)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
Co-authored-by: Michael Yang <Michael.Yang@arm.com>
2025-10-21 02:02:58 +00:00
Nick Hill
a03cf9bc70 [V0 Deprecation] Remove V0 metrics code (#27215)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-21 02:02:10 +00:00
Isotr0py
352c0c8a28 [Quantization] Automatically infer AWQ modules_to_not_convert field (#26909)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-21 01:49:28 +00:00
Andrew Xia
bfe0b4bd2a [ez] add uv lock to gitignore (#27212)
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: Andrew Xia <axia@fb.com>
2025-10-21 00:37:44 +00:00
Concurrensee
58fbbcb2f5 [ROCm] enable some tests in entrypoints test groups on AMD (#26725)
Signed-off-by: Yida <yida.wu@amd.com>
2025-10-21 00:37:16 +00:00
Heng Guo
87778d5f00 [Feature][Quantization] auto_round support for mixed bits quantization (#23812)
Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: Heng Guo <heng.guo@intel.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-20 22:23:30 +00:00
Nicolò Lucchesi
f9e7ad5400 [Bugfix][CI] Fix Distributed Tests (4 GPUs) async_sched+ray test (#27195)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-10-20 16:34:54 +00:00
shivampr
4d0f266113 [Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on H100 (FP8/BF16) (#26268)
Signed-off-by: Shivam <shivampr.dev@gmail.com>
2025-10-20 07:48:01 -07:00
Eugene Khvedchenya
e93ff6c8b9 Nemotron Nano V2 VL + EVS Video Support (#27107)
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Natan Bagrov <nbagrov@nvidia.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-10-20 22:19:11 +08:00
ioana ghiban
1c691f4a71 AArch64 CPU Docker pipeline (#26931) 2025-10-20 07:09:40 -04:00
Jiangyun Zhu
9fce7bee74 [Kernel] Accelerate solve_tril with TMA (#26746)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2025-10-20 05:39:02 +00:00
Andy Lo
b63f2143f8 [LoRA] LoRA cuda graph specialization (#25914)
Signed-off-by: Andy Lo <andy@mistral.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-10-20 04:21:09 +00:00
Yi Zhang
f32bf7582e [Model][VLM] Support Bee-8B Model (#27012)
Signed-off-by: uyzhang <yi.zhang.4096@gmail.com>
Signed-off-by: Yi Zhang <zhangyi970819@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-10-20 02:31:26 +00:00
Yongtao Huang
8a81d776ce Fix typo in ValueError message: use kv_role instead of kv_disagg_role (#27166)
Signed-off-by: Yongtao Huang <yongtaoh2022@gmail.com>
2025-10-19 19:47:19 +00:00
Sergei Skvortsov
f6fdacd82c [Bugfix] Fix error with penalties when speculative decoding and structural output are enabled (#26586)
Signed-off-by: southfreebird <yvorott@gmail.com>
2025-10-19 19:24:46 +00:00
Cyrus Leung
d31f7844f8 [Misc] Move utils to avoid conflicts with stdlib, and move tests (#27169)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-19 05:20:55 -07:00
iAmir97
7a6c8c3fa1 [Chore] Separate out vllm.utils.network_utils (#27164)
Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com>
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com>
2025-10-19 03:06:32 -07:00
Jianyu Huang
221bf72577 output type conversion fix (#27159) 2025-10-19 08:10:07 +00:00
Cyrus Leung
b3aba04e5a [Benchmark] Convenience script for multiple parameter combinations (#27085)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-18 23:57:01 -07:00
dongbo910220
8a297115e2 [Chore] Separate out hashing utilities from vllm.utils (#27151)
Signed-off-by: dongbo910220 <1275604947@qq.com>
2025-10-19 11:09:38 +08:00
22quinn
191eed0bb9 [BugFix] Fix lazy imports involving outlines_core (#27158)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-10-19 02:35:32 +00:00
Woosuk Kwon
fb860670da [Minor] Remove unused env variable (#27161) 2025-10-18 18:48:35 -07:00
Tova Movshovitz
83e760c57d [V1][Metrics][Plugin] Add plugin support for custom StatLoggerBase implementations (#22456)
Signed-off-by: tovam <tovam@pliops.com>
2025-10-18 15:12:46 -07:00
Lucas Wilkinson
c2bba69065 [BugFix] Disable fp8 kv-cache by default for DeepSeek V3.2 (#27121)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-18 22:05:23 +00:00
Boyuan Feng
e133d6d218 [BugFix] fix graph partition signature (#27139)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
2025-10-18 17:34:36 -04:00
dongbo910220
a1946c9f61 [Chore] Separate out profiling utilities from vllm.utils (#27150)
Signed-off-by: dongbo910220 <1275604947@qq.com>
2025-10-18 19:12:01 +00:00
Lucas Wilkinson
9f020f4f31 [BugFix] Fix failing gemma-3-1b-it test: test_lm_eval_accuracy_v1_engine[google/gemma-3-1b-it] (#27111)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-10-18 12:44:39 -06:00
Nick Hill
3b45075206 [Minor] Add some clarifying comments to recent changes (#27130)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-18 09:52:45 -07:00
Yongtao Huang
168e578efc Fix incorrect string formatting in barrier timeout exceptions (#27149)
Signed-off-by: Yongtao Huang <yongtaoh2022@gmail.com>
2025-10-18 09:51:57 -07:00
Isotr0py
6ac5e06f7c [Chore] Clean up pytorch helper functions in vllm.utils (#26908)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: isotr0py <2037008807@qq.com>
2025-10-18 09:48:22 -07:00
Lukas Geiger
5c2acb270a [Models][QwenVL] Remove unnecessary .contiguous() calls (#27106)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-10-18 07:05:05 -07:00
Nicolò Lucchesi
b26b70bec4 [Misc] Refactor get_kv_cache_spec into AttentionLayerBase (#26587)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-10-18 13:51:21 +00:00
Fadi Arafeh
ab4be40fc5 [fix][cpu] fix prefill attention in CPU attention backend (#27035)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2025-10-18 13:30:21 +00:00
Wentao Ye
245e4f2c01 [Feature] Batch Invariant: Support DeepGEMM and Blackwell (#27127)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-18 09:28:05 -04:00
iAmir97
1d165d6d85 [Chore] Separate out vllm.utils.mem_utils (#27143)
Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com>
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com>
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-18 10:06:59 +00:00
dongbo910220
83004020fd [Test] Add test for /health endpoint on engine failure (#26074)
Signed-off-by: dongbo910220 <1275604947@qq.com>
2025-10-18 09:59:05 +00:00
Chendi.Xue
12e21701e7 [DOC][FEATURES][CPU]update cpu feature for v1 (#27135)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
2025-10-18 01:10:45 -07:00
Varun Sundar Rabindranath
30a33b92ee [Misc] Rev DeepEP (#27122)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-10-18 14:54:29 +08:00
Hanchenli
7c572544e4 [GPT-OSS] Structure_Tag support for gpt-oss tool-call in cot (#25515)
Signed-off-by: Hanchenli <lihanc2002@gmail.com>
Signed-off-by: Hanchenli <61769611+Hanchenli@users.noreply.github.com>
Signed-off-by: Wei Wei <wwei6@meta.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Wei Wei <wwei6@meta.com>
Co-authored-by: Wei Wei <weiweinpu@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-10-17 21:55:54 -07:00
Huamin Li
c312320764 [CI/Build] tests(v1): feed Triton attention the (num_blocks, 2, …) KV cache layout in backend-correctness tests (#26663)
Signed-off-by: Huamin Li <3ericli@gmail.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
2025-10-17 21:11:26 -07:00
ZiTian Zhao
c981f0ea78 [Perf] Add H100 fused MoE config (#25398)
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com>
2025-10-18 02:21:27 +00:00
Lehua Ding
6367bde739 [BugFix][Core] Fix error when enable async-scheduling in multi-node env (#25887)
Signed-off-by: Lehua Ding <lehuading@tencent.com>
Signed-off-by: Lehua Ding <lehuading@qq.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
2025-10-17 22:16:18 +00:00
Wentao Ye
f50cc221ea [Test] Make test_failure more stable for batch invariance (#27054) 2025-10-17 16:59:08 -04:00
Pradyun92
acedc74b1a [V1][Spec Decode] Fix greedy temperature detection after sampler refactor (#27077)
Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
Co-authored-by: Pradyun Ramadorai <pradyunr@amazon.com>
2025-10-17 13:27:47 -07:00
Zhuohan Li
d29483b58a [Minor] Remove unnecessary error message (#27115)
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com>
2025-10-17 20:02:12 +00:00
Michael Goin
950cf9e58e [Bugfix] Use PIECEWISE cudagraphs on Blackwell if max_model_len > 131072 (#27114)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-10-17 19:47:18 +00:00
Isotr0py
3125d79950 [Chore] Remove unused PolyNorm layer (#27110)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-17 19:03:43 +00:00
vllmellm
e33ee23ee3 [Bugfix] [AITER] [ROCm] Fix Quark MoE Quant Config and AITER Fused MoE quant type logic (#27029)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-10-17 12:51:10 -06:00
rasmith
b10c64c834 [ROCm][Bugfix][Model] Fix illegal memory access when running qwen3_moe models with rms_norm (Qwen3-235B-A22B, Qwen3-30B-A3B, etc.) (#26192)
Signed-off-by: Randall Smith <ransmith@amd.com>
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
Signed-off-by: rasmith <Randall.Smith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-10-17 14:17:18 -04:00
Aleksandr Malyshev
0925b28a8e [ROCM] MoE fp4 CK kernel (#26545)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
2025-10-17 14:06:33 -04:00
Nicolò Lucchesi
99722d5f0e [CI] Remove forbidden slash (#27112)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-10-17 09:38:00 -07:00
4c91a28e30 [bugfix] Qwen3-VL fix video incorrect timestamp calculations while do_sample_frames=True (#27104)
Co-authored-by: 松灵 <wpf272043@alibaba-inc.com>
2025-10-17 16:26:33 +00:00
Patrick von Platen
b038d9c40c [Data-parallel] Allow DP>1 for world_size > num_gpus on node (8) (#26367)
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Rui Qiao <ruisearch42@gmail.com>
2025-10-17 08:24:42 -07:00
Nicolò Lucchesi
2ba60ec7fe [CI] Nixl integration tests (#27010)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-10-17 07:13:31 -07:00
Luka Govedič
bd7157a071 [torch.compile] Enable attention and allreduce fusion without custom ops enabled (#24604)
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-10-17 08:10:23 -06:00
Yongtao Huang
be429d0cfd Fix incorrect docstring for stop_profile() method (#27101)
Signed-off-by: Yongtao Huang <yongtaoh2022@gmail.com>
2025-10-17 06:30:23 -07:00
Reima Karhila (AMD)
c253745eb8 [Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI350 and MI355 (#25586)
Signed-off-by: Reima Karhila <reima.karhila@amd.com>
Signed-off-by: xaguilar <Xavier.AguilarFruto@amd.com>
Co-authored-by: xaguilar <Xavier.AguilarFruto@amd.com>
2025-10-17 04:56:12 -07:00
Jee Jee Li
daec4d2624 [Model]Improve Qwen3VLMoeForConditionalGeneration packed_modules_mapping (#27096)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-10-17 04:47:00 -07:00
Harry Mellor
6c9fdbf725 [Docs] Replace rst style double-backtick with md single-backtick (#27091)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-17 02:47:34 -07:00
Harry Mellor
483ea64611 [Docs] Replace all explicit anchors with real links (#27087)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-17 02:22:06 -07:00
Mengqing Cao
e20eba753b [VLM][Refactor] Remove useless func get_input_positions in MRotaryEmbedding (#27088)
Signed-off-by: MengqingCao <cmq0113@163.com>
2025-10-17 02:00:30 -07:00
cong-meta
bbc1b29665 Update troubleshooting.md and remind VLLM_TRACE_FUNCTION usage (#27069)
Signed-off-by: cong-meta <prowindy@hotmail.com>
2025-10-17 01:53:06 -07:00
Chauncey
acb1bfa601 [CI] fix docs build failed (#27082)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-10-17 07:53:40 +00:00
zhrrr
75c7ad9918 [Kernel][Performance] Fuse float cast and renormalize to topk softmax kernel (#26717)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
2025-10-17 07:30:35 +00:00
Li, Jiang
5550ff9c25 [CI/Build] Update compressed tensor test path to fix CPU CI (#27068)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-10-16 22:34:56 -07:00
Said Taghadouini
3aeb19a39e [Model] Add support for LightOnOCR (#26916)
Signed-off-by: Said Taghadouini <taghadouinisaid@gmail.com>
Signed-off-by: Said Taghadouini <84044788+staghado@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-10-17 05:05:24 +00:00
Cyrus Leung
8c017b3490 [Model] Always use Transformers backend for PaliGemma and Gemma3-MM (#26715)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-17 05:03:35 +00:00
Zhewen Li
9c2c2287a0 [CI/Build] Update Llama4 eval yaml (#27070)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-10-17 04:59:47 +00:00
Jee Jee Li
fec2b341ad [Kernel] Lazy import FlashInfer (#26977) 2025-10-17 04:48:18 +00:00
Jee Jee Li
87bc0c492f [Bugfix] Fix ReplicatedLinearWithLoRA (#27065)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-10-17 04:43:16 +00:00
Nick Hill
fe3b9372ad [Core] Change execute_model_with_error_logging() to be a ctx manager (#27060)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-17 11:45:32 +08:00
Tao He
bde9e2272a [Bugfix][Qwen] fixes the weights dtype in qwen3_next: it is actually a bfloat16 (#27030)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
2025-10-17 03:37:52 +00:00
Boyuan Feng
08405609cc disable graph partition in custom op (#26952)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Signed-off-by: Boyuan Feng <fby.1994@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-10-17 11:08:47 +08:00
Nick Hill
ab81379ea6 [Perf] Exploit out-of-band buffers in shm_broadcast (#26961)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-16 20:08:03 -07:00
Harry Mellor
4ffd6e8942 [Docs] Reduce custom syntax used in docs (#27009)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-16 20:05:34 -07:00
Tomas Ruiz
965c5f4914 vllm bench serve shows num of failed requests (#26478)
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>
2025-10-16 19:55:09 -07:00
Lukas Geiger
4d055ef465 Remove unused imports (#26972)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-10-16 19:51:17 -07:00
Boyuan Feng
17c540a993 [torch.compile] fix simple inductor graph partition test (#27050)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
2025-10-16 21:09:36 -04:00
Cyrus Leung
4d4d6bad19 [Chore] Separate out vllm.utils.importlib (#27022)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-17 00:48:59 +00:00
Lucia Fang
11ae016bd7 [torch.compile] Passing only necessary compilation config to inductor pass config (#27041)
Signed-off-by: Lu Fang <fanglu@fb.com>
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com>
2025-10-17 00:01:52 +00:00
jiahanc
41d3071918 [NVIDIA] [Perf] Update to leverage flashinfer trtllm FP4 MOE throughput kernel (#26714)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-10-16 16:20:25 -07:00
Harry Mellor
fb5e10d3fb Refactor Transformers backend to use mixins (#26906)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-16 21:50:39 +00:00
Bram Wasti
b2f78cbad4 [small][batch invariance] Rename the env and internal flags to simplify usage (#26855)
Signed-off-by: Bram Wasti <bwasti@meta.com>
2025-10-16 21:40:25 +00:00
Wentao Ye
23583ee28c [Bug] Add Assertion for random-input-len / random-output-len (#26834)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-16 21:36:39 +00:00
Michael Goin
01c977e96d [CI] Prune Quantization Tests and skip compilation (#27038)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-10-16 17:26:35 -04:00
Wentao Ye
b3dda72c23 [Feature] Migrate DeepGEMM API from get_m_alignment_for_contiguous_layout to get_mk_alignment_for_contiguous_layout (#26935)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-16 16:46:48 -04:00
Varun Sundar Rabindranath
fb0571b077 [GPTOSS][DP/EP][Marlin] Enable GPTOSS Batched DP/EP using Marlin kernels (#25997)
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>
2025-10-16 12:53:11 -07:00
Wentao Ye
2ed8b6b3d0 [Bug] Fix batch invariant test has to is (#27032)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-16 19:45:14 +00:00
kimbochen
013abde6ef Adding Warmup to Benchmark Serving (#26943)
Signed-off-by: Kimbo Chen <chentenghung@gmail.com>
2025-10-16 12:44:32 -07:00
Kyle Sayers
a5464dcf92 [Compressed Tensors] Always clone output for compile robustness (#26849)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-10-16 19:29:59 +00:00
Mandy Li
ac3ed5a815 Support block size of 256 used by Intel HPU (#26883)
Signed-off-by: mandy-li <mandy.j.li@intel.com>
2025-10-16 15:10:57 -04:00
Andrew Xia
e6ba2000ae [gpt-oss][1/N] EZ: refactor serving_responses for modularity (#26948)
Signed-off-by: Andrew Xia <axia@meta.com>
2025-10-16 18:44:06 +00:00
Harry Mellor
aa255ff55a Support set in the CLI generation (#27031)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-16 18:07:18 +00:00
ZiTian Zhao
7bb736d00e Fix Qwen2.5 VL image grid docstring (#27033)
Signed-off-by: zitian zhao <zitian.zhao@tencentmusic.com>
2025-10-16 09:57:36 -07:00
Jee Jee Li
9f4e30904b [Model] Fix Qwen3VL mm mapping (#27027)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-10-16 09:45:59 -07:00
rongfu.leng
5afd3276df [Feature] Add process_weights_after_loading to AttentionImpl (#26870)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
2025-10-16 08:02:30 -07:00
Tahsin Tunan
43721bc67f [CI] Replace large models with tiny alternatives in tests (#24057)
Signed-off-by: Tahsin Tunan <tahsintunan@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-16 15:51:27 +01:00
Kay Yan
02d709a6f1 [docs] standardize Hugging Face env var to HF_TOKEN (deprecates HUGGING_FACE_HUB_TOKEN) (#27020)
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
2025-10-16 15:31:02 +01:00
Mark McLoughlin
4a510ab487 [NIXL] Improve request_finished() debug logs (#25665)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-10-16 15:55:17 +02:00
Matthew Bonanni
314fa8abbf [Attention] Tune CUTLASS MLA num_splits (#26846)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-10-16 06:36:09 -07:00
Cyrus Leung
334535b6fb [Benchmark] Show E2EL by default for pooling models (#27014)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-16 12:47:09 +00:00
bogdanm
dcbb3f1871 [Bugfix] Correct LayerNorm epsilon parameter in modernbert.py (#27008)
Signed-off-by: bogdanm <152898065+bogdan01m@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-16 12:27:44 +00:00
Sungjae Lee
00417f4e44 [MISC] fix import violations for re and triton modules (#26654)
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-10-16 03:38:27 -07:00
Lukas Geiger
ed344f4116 Cleanup code after Python 3.10 upgrade (#26520)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-10-16 03:38:23 -07:00
CSWYF3634076
e51928793e [Model][Bugfix] fix ernie45 vl run failed from shared experts optimization (#26885)
Signed-off-by: wangyafeng <wangyafeng@baidu.com>
2025-10-16 03:37:35 -07:00
Cyrus Leung
d2740fafbf [Chore] Separate out vllm.utils.collections (#26990)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-16 08:35:35 +00:00
Cyrus Leung
17838e50ef [Benchmark] Use truncation by default for pooling benchmarks (#26992)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-16 16:02:39 +08:00
Zhewen Li
44c8555621 [CI/Build] Fix AMD import failures in CI (#26841)
Signed-off-by: zhewenli <zhewenli@meta.com>
2025-10-16 07:28:20 +00:00
Akash kaothalkar
f7d318de2b [Hardware][CPU][PowerPC]Disable torch.compile() in toptopk sampling (#26987)
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com>
2025-10-15 22:36:59 -07:00
Cyrus Leung
76f0d05bc6 [CI/Build] Update expected beam search output for Phi3V (#26978)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-16 05:12:44 +00:00
Bram Wasti
7d8975de84 Deepseek-v3 Batch Invariant on 8xH100 (#26609)
Signed-off-by: Bram Wasti <bwasti@meta.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-10-15 22:06:02 -07:00
Vadim Gimpelson
785d8b6410 [PERF] Qwen3-next MTP speedup (change bool mask indexing to index_select / index_copy to reduce d2h) (#26437)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2025-10-16 12:18:31 +08:00
Cyrus Leung
f6cdc9a02f [Chore] Rename utils submodules (#26920)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-16 03:58:13 +00:00
Chendi.Xue
509cdc0370 [DOC][XPU]update feature parity with Intel GPU (#26954)
Signed-off-by: Chendi Xue <Chendi.Xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
2025-10-15 20:07:10 -07:00
Richard Zou
9b6504c307 [BugFix] Work around graph partition x torch.compile cache issue (#26956)
Signed-off-by: Richard Zou <zou3519@gmail.com>
2025-10-15 20:06:11 -07:00
Angela Yi
e19b16dde6 [bugfix] Fix SP + PP without specifying compile size (#26955)
Signed-off-by: angelayi <yiangela7@gmail.com>
2025-10-15 20:05:33 -07:00
ahao-anyscale
582f2c6be7 [BUG] Allow runai_streamer_sharded in config check (#26958)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
2025-10-15 20:05:14 -07:00
Michael Goin
f8a0acbdbe [CI] Enable Blackwell Llama4 MoE tests (#26731)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-10-15 21:02:57 -06:00
kliuae
1317034379 [ROCm][FEAT] Fuse DeepSeek shared experts into AITER fused_moe ops (#24097)
Signed-off-by: chenjun <junchen2@amd.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Co-authored-by: valarLip <103567126+valarLip@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
2025-10-16 10:41:34 +08:00
InChang Jeong
0ecc553ee6 [Bugfix] reasoning_parser parameter handling in run_batch.py (#26225)
Signed-off-by: inc-jeong <inc.jeong@navercorp.com>
Signed-off-by: InChang Jeong <inc.jeong@navercorp.com>
Co-authored-by: USER <user@AL02367916.local>
2025-10-16 10:24:05 +08:00
felixzhu555
f96bc3649c [Qwen3-Next] Add tuned MoE config for Qwen3-Next FP8 on H100 tp2 (#26887)
Signed-off-by: Felix Zhu <felixzhu555@gmail.com>
2025-10-15 18:55:05 -07:00
Alexei-V-Ivanov-AMD
938c43ea7f [ci] Adjusting AMD test composition 2025-10-14 (#26852)
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
2025-10-15 23:52:13 +00:00
Adrian Abeyta
0a9ef0cfce Move query quantization to attention layer for Flashinfer & Triton. (#26534)
Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Adrian Abeyta <aabeyta@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-10-15 19:01:38 -04:00
Wentao Ye
e5b438a247 [Bug] Temporally Disable VLLM_ALLREDUCE_USE_SYMM_MEM by Default (#26925)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-15 16:18:50 -04:00
XiaobingZhang
0b99f5d302 support flashinfer_fp4 moe for 5090 gpu (#26669)
Signed-off-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-10-15 15:06:47 -04:00
Benji Beck
1f491aa0c8 Vectorize RMS norm variance using vectorize_read_with_alignment (#26234)
Signed-off-by: Benji Beck <benjibeck@meta.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2025-10-15 11:54:41 -07:00
Kaixi Hou
de92d916fe [NVIDIA] Add support for cudnn fp4 gemm via flashinfer (#26107)
Signed-off-by: kaixih <kaixih@nvidia.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-10-15 13:53:00 -04:00
Woosuk Kwon
a1063628a4 [Chore] Clean up CODEOWNERS (#26923)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-10-15 10:52:54 -07:00
XiaobingZhang
d796375258 [ModelOpt] Remove NVFP4 MoE K%16==0 constraint (#26891)
Signed-off-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
2025-10-15 13:06:17 -04:00
Sam/Samuel
14f8456344 [Feature]: Use pydantic validation in observability.py config (#26637)
Signed-off-by: Samuel Wu <cernunnos1710@gmail.com>
Signed-off-by: Sam/Samuel <57896620+cern1710@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-15 16:44:03 +00:00
Pradeep Dasigi
4794c2bd92 Olmo 3 tool parser and tests (#26143)
Signed-off-by: Pradeep Dasigi <pradeepd@allenai.org>
2025-10-15 16:36:12 +00:00
Harry Mellor
d3cbaa08dc Lower sevarity of log when model info cache misses due to exception (#26917)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-15 09:01:09 -07:00
Cyrus Leung
828523ad8e [Chore] Separate out vllm.utils.async_utils (#26913)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-15 15:33:00 +00:00
Cyrus Leung
136a17fe6e [Chore] Separate out vllm.utils.func (#26904)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-15 13:03:58 +00:00
Boyuan Feng
f57438338d [BugFix] Patch inductor memory plan logic (#26878)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-15 12:51:45 +00:00
Max Wittig
5d598680e3 chore: remove unused marker (#26890)
Signed-off-by: Max Wittig <max.wittig@siemens.com>
2025-10-15 05:40:33 -07:00
wangxiyuan
8f4b313c37 [Misc] rename torch_dtype to dtype (#26695)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-15 12:11:48 +00:00
Cyrus Leung
f93e348010 [Misc] Remove isort and yapf ignores (#26888)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-15 12:09:03 +00:00
wang.yuqi
f54f85129e [Model][2/N] Improve all pooling task | Support multi-vector retrieval (#25370)
Signed-off-by: wang.yuqi <noooop@126.com>
2025-10-15 11:14:41 +00:00
li2haipeng
d4d1a6024f [Lora]Load tuned multi-lora kernel configs from json files (#26319)
Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-10-15 09:45:14 +00:00
wangxiyuan
db1764e4e0 [Platform] allow platform to init dp group (#22243)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-10-15 02:32:17 -07:00
Jialin Ouyang
7f83b4ee8e [Easy] Get rid of unnecessary paraenthesis in kv_cache_manager (#26842)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-10-15 09:17:43 +00:00
ant-yy
5c3bae1a6a [Fix] Remove divisibility requirement between num_kv_heads and tp_size in bailing_moe (#26876)
Signed-off-by: vito.yy <vito.yy@antgroup.com>
2025-10-15 16:44:04 +08:00
Xudong Ma
5210dc3940 [Misc] Update TritonLanguagePlaceholder to have attributes that are used by Flash Linear Attention ops. (#26853)
Co-authored-by: Xudong Ma <mxd@meta.com>
2025-10-15 08:37:49 +00:00
youkaichao
650b51f9f9 [doc] add Context Parallel Deployment doc (#26877)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-10-15 16:33:52 +08:00
Cyrus Leung
6256697997 [Doc] ruff format remaining Python examples (#26795)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-15 01:25:49 -07:00
Wentao Ye
71557a5f7c [CI] Fix mypy for vllm/executor (#26845)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-10-15 01:23:33 -07:00
Zhewen Li
f3c378ffa7 [CI/Build] Add Qwen2.5-VL-7B-Instruct ChartQA Accuracy Tests in CI (#21810)
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com>
Co-authored-by: Ye (Charlotte) Qi <ye.charlotte.qi@gmail.com>
2025-10-15 08:09:56 +00:00
Yongye Zhu
f5ed68ef63 [Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather (#26456)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
2025-10-15 16:05:01 +08:00
Angela Yi
efdef57b1f [bugfix] Lazy import cv2 (#26869)
Signed-off-by: angelayi <yiangela7@gmail.com>
2025-10-15 07:47:50 +00:00
Cyrus Leung
b8a4572157 [Misc] Use helper function to generate dummy messages in OpenAI MM tests (#26875)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-15 07:17:37 +00:00
Mengqing Cao
302ef403a2 [DSA][MLA] Tiny refactor on DeepSeek to make it reusable for different backends (#26656)
Signed-off-by: MengqingCao <cmq0113@163.com>
2025-10-15 00:16:44 -07:00
sangho.lee
8865da157b [Bugfix][Multi Modal] Fix incorrect Molmo token processing (#26873)
Signed-off-by: sanghol <sanghol@allenai.org>
2025-10-15 07:13:59 +00:00
Boyuan Feng
f0862eae43 [Graph Partition] pass tests for decorator (#26831)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
2025-10-15 06:39:48 +00:00
Isotr0py
8c851f6d04 [Bugfix] Fix qwen3-omni audio truncation issue (#26815)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-10-15 05:38:36 +00:00
Angela Yi
7cfa420f49 [BugFix] Patch inductor partitioning logic (#26735)
Signed-off-by: angelayi <yiangela7@gmail.com>
2025-10-15 05:04:32 +00:00
rongfu.leng
a27b288e4a [Feature] default --extra-body param to disable thinking in vllm bench serve (#26784)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
2025-10-15 04:23:44 +00:00
zhrrr
e471d7ca7e [CI/Build][Bugfix] fix qutlass cmake error when set QUTLASS_SRC_DIR (#26773)
Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-10-15 04:09:44 +00:00
Michael Yao
c43ca8259e [Docs] Move build.inc into arm.inc (#26862)
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
2025-10-14 20:35:08 -07:00
Tao Hui
85a65e7f51 [Model] Add DeepSeek-V3.1 reasoning parser (split from PR #24972) (#25589)
Signed-off-by: taohui <taohui3@gmail.com>
Signed-off-by: Tao Hui <taohui3@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
2025-10-15 11:09:52 +08:00
kourosh hakhamaneshi
a2986b3e33 [Bugfix] Fixes prefix-repetition benchmark script (#26828)
Signed-off-by: Kourosh Hakhamaneshi <Kourosh@anyscale.com>
2025-10-15 02:54:43 +00:00
Morrison Turnansky
96b9aa5aa0 [Frontend][torch.compile] CompilationConfig Overhaul (#20283): name change compilation level to compilation mode, deprecation compilation level (#26355)
Signed-off-by: morrison-turnansky <mturnans@redhat.com>
Signed-off-by: Morrison Turnansky <mturnans@redhat.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-10-15 02:51:16 +00:00
Michael Goin
e66d787bce Disable FlashInfer sampler by default (#26859)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-10-15 02:35:18 +00:00
Chendi.Xue
bfad142e25 [BUGFIX][NIXL] quick fix for 'assert self.connector_worker is not None' in get_kv_connector_stats (#26851)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
2025-10-15 02:33:25 +00:00
Zhikaiiii
9354660036 [Bugfix]fix Qwen3 xml tool parser (#26345)
Signed-off-by: Zhikaiiii <1658973216@qq.com>
2025-10-15 09:50:30 +08:00
Jialin Ouyang
07ca70af8d [Core][Easy] Use envs.__getattr__ for all Unify to environment variable access (#26810)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-10-15 01:41:18 +00:00
Luka Govedič
2dcd12d357 [torch.compile] Fix tests for torch==2.9 inductor partition (#26116)
Signed-off-by: ProExpertProg <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
2025-10-14 19:55:02 -04:00
Tyler Michael Smith
579d2e5458 [WideEP][P/D] Add usage stats for DP+EP and KV Connector (#26836)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
2025-10-14 23:51:54 +00:00
Ye Hu
0512c04aee [frontend][gptoss] Add per turn stats into Harmony Context (#25061)
Signed-off-by: lacora <hyelacora@gmail.com>
Co-authored-by: Ye Hu <yehu@fb.com>
2025-10-14 16:48:13 -07:00
Michael Goin
7e0ef4084a [CI Failure] Fix torchao dep failure for Quantization Test (#26824)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-10-14 16:41:43 -07:00
Nick Hill
4aed506b65 [Core] Streamline some structured output related code (#26737)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-14 23:27:44 +00:00
Boyuan Feng
a86b4c58e8 remove attn output view kernel (#26680)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Signed-off-by: Boyuan Feng <fby.1994@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-14 22:53:10 +00:00
Nick Hill
ff4810ba73 [Minor] Group async_scheduling related fields in model runner init (#26736)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-10-14 14:46:37 -07:00
Nan Qin
9d6964926e fix: response_format for completion (#23212)
Signed-off-by: Nan2018 <qinnanjoshua@gmail.com>
2025-10-14 21:23:22 +00:00
Dhruvil Bhatt
0e65818910 Added MoE configs for llama 4, H200 device with tp=4/8 tuning (#26837)
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
2025-10-14 14:21:03 -07:00
Jialin Ouyang
380f17527c [Perf] Cache vllm.env.__getattr__ result to avoid recomputation (#26146)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-10-14 17:03:21 -04:00
HDCharles
b92ab3deda Notice for deprecation of AutoAWQ (#26820)
Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-10-14 13:39:59 -07:00
Jialin Ouyang
acaa2c0a4a [Core] Reuse empty block lists whenever possible in KVCacheBlocks to mitigate GC costs (#24964)
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
2025-10-14 12:58:43 -07:00
Matthew Bonanni
82af928c41 [Attention][Spec Decode] FlashMLA spec decode support (#26541)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-10-14 19:38:20 +00:00
Huamin Li
87efc681db llama4_vision_rope: add HIP override to accept (q, k) and avoid (positions, q, k) mismatch (#26790)
Signed-off-by: Huamin Li <3ericli@gmail.com>
2025-10-14 11:54:12 -07:00
1567 changed files with 102850 additions and 37336 deletions

View File

@@ -1,12 +1,12 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise -b "auto" -l 1000 -f 5 -t 1
model_name: "nm-testing/Qwen2-1.5B-Instruct-W8A16-Channelwise"
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
model_name: "HandH1998/QQQ-Llama-3-8b-g128"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.595
value: 0.419
- name: "exact_match,flexible-extract"
value: 0.582
value: 0.416
limit: 1000
num_fewshot: 5

View File

@@ -0,0 +1,12 @@
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh -m meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -l 100 -t 8
model_name: "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
backend: "vllm-vlm"
tasks:
- name: "chartqa"
metrics:
- name: "relaxed_accuracy,none"
# TODO(zhewenl): model card is 0.90, but the actual score is 0.80.
value: 0.80
limit: 100
num_fewshot: 0

View File

@@ -0,0 +1,10 @@
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh -m meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -l 250 -t 8 -f 5
model_name: "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
tasks:
- name: "mmlu_pro"
metrics:
- name: "exact_match,custom-extract"
value: 0.80
limit: 250 # will run on 250 * 14 subjects = 3500 samples
num_fewshot: 5

View File

@@ -1,4 +1,5 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic -b auto -l 1319 -f 5 -t 1
# For vllm script, with -t option (tensor parallel size)
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic -l 1319 -t 1
model_name: "RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic"
tasks:
- name: "gsm8k"

View File

@@ -0,0 +1,12 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh -m Qwen/Qwen2.5-VL-7B-Instruct -l 2500 -t 1
model_name: "Qwen/Qwen2.5-VL-7B-Instruct"
backend: "vllm-vlm"
tasks:
- name: "chartqa"
metrics:
- name: "relaxed_accuracy,none"
value: 0.855
limit: 2500
num_fewshot: 0

View File

@@ -0,0 +1,14 @@
model_name: "Qwen/Qwen3-235B-A22B-Instruct-2507-FP8"
tasks:
- name: "mmlu_pro"
metrics:
- name: "exact_match,custom-extract"
value: 0.82
limit: 250 # will run on 250 * 14 subjects = 3500 samples
num_fewshot: 5
enforce_eager: false # we use false to speed up the eval process
kv_cache_dtype: fp8 # we use fp8 to speed up the eval process
max_model_len: 40960
apply_chat_template: true
fewshot_as_multiturn: true
gen_kwargs: "temperature=0,top_p=1,top_k=0,max_gen_toks=5632,until=<|ENDANSWER|>"

View File

@@ -0,0 +1 @@
Qwen3-235B-A22B-Instruct-2507-FP8.yaml

View File

@@ -0,0 +1 @@
Meta-Llama-4-Maverick-17B-128E-Instruct-FP8-MM.yaml

View File

@@ -0,0 +1 @@
Qwen2.5-VL-7B-Instruct.yaml

View File

@@ -0,0 +1,44 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on chartqa for vllm.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.9
usage() {
echo``
echo "Runs lm eval harness on ChartQA using multimodal vllm."
echo "This pathway is intended to be used to create baselines for "
echo "our correctness tests in vllm's CI."
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -l - limit number of samples to run"
echo " -t - tensor parallel size to run at"
echo
}
while getopts "m:l:t:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done
lm_eval --model vllm-vlm \
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE" \
--tasks chartqa \
--batch_size auto \
--apply_chat_template \
--limit $LIMIT

View File

View File

@@ -0,0 +1,50 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on MMLUPRO for vllm.
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]
usage() {
echo``
echo "Runs lm eval harness on MMLU Pro using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo " -t - tensor parallel size to run at"
echo
}
while getopts "m:b:l:f:t:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done
lm_eval --model vllm \
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,add_bos_token=true,trust_remote_code=true,max_model_len=4096" \
--tasks mmlu_pro --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size auto

View File

@@ -19,21 +19,35 @@ RTOL = 0.08
def launch_lm_eval(eval_config, tp_size):
trust_remote_code = eval_config.get("trust_remote_code", False)
max_model_len = eval_config.get("max_model_len", 4096)
batch_size = eval_config.get("batch_size", "auto")
backend = eval_config.get("backend", "vllm")
enforce_eager = eval_config.get("enforce_eager", "true")
kv_cache_dtype = eval_config.get("kv_cache_dtype", "auto")
model_args = (
f"pretrained={eval_config['model_name']},"
f"tensor_parallel_size={tp_size},"
f"enforce_eager=true,"
f"enforce_eager={enforce_eager},"
f"kv_cache_dtype={kv_cache_dtype},"
f"add_bos_token=true,"
f"trust_remote_code={trust_remote_code},"
f"max_model_len={max_model_len}"
f"max_model_len={max_model_len},"
)
results = lm_eval.simple_evaluate(
model="vllm",
model=backend,
model_args=model_args,
tasks=[task["name"] for task in eval_config["tasks"]],
num_fewshot=eval_config["num_fewshot"],
limit=eval_config["limit"],
batch_size="auto",
# TODO(yeq): using chat template w/ fewshot_as_multiturn is supposed help
# text models. however, this is regressing measured strict-match for
# existing text models in CI, so only apply it for mm, or explicitly set
apply_chat_template=eval_config.get(
"apply_chat_template", backend == "vllm-vlm"
),
fewshot_as_multiturn=eval_config.get("fewshot_as_multiturn", False),
# Forward decoding and early-stop controls (e.g., max_gen_toks, until=...)
gen_kwargs=eval_config.get("gen_kwargs"),
batch_size=batch_size,
)
return results

View File

@@ -1,184 +0,0 @@
steps:
- label: "Wait for container to be ready"
key: wait-for-container-image
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
containers:
- image: badouralix/curl-jq
command:
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
- label: "Cleanup H100"
agents:
queue: H100
depends_on: ~
command: docker system prune -a --volumes --force
- label: "A100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: A100
depends_on: wait-for-container-image
if: build.branch == "main"
plugins:
- kubernetes:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: devshm
mountPath: /dev/shm
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
volumes:
- name: devshm
emptyDir:
medium: Memory
- label: "H200"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H200
depends_on: wait-for-container-image
if: build.branch == "main"
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: 4,5,6,7
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN
#- block: "Run H100 Benchmark"
#key: block-h100
#depends_on: ~
- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
depends_on: wait-for-container-image
if: build.branch == "main"
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: all # see CUDA_VISIBLE_DEVICES for actual GPUs used
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN
# Premerge benchmark
- label: "A100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: A100
depends_on: wait-for-container-image
if: build.branch != "main"
plugins:
- kubernetes:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: devshm
mountPath: /dev/shm
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
volumes:
- name: devshm
emptyDir:
medium: Memory
- label: "H200"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H200
depends_on: wait-for-container-image
if: build.branch != "main"
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: 4,5,6,7
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN
#- block: "Run H100 Benchmark"
#key: block-h100
#depends_on: ~
- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
depends_on: wait-for-container-image
if: build.branch != "main"
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: all # see CUDA_VISIBLE_DEVICES for actual GPUs used
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN

View File

@@ -1,28 +0,0 @@
# Nightly benchmark annotation
## Description
This file contains the downloading link for benchmarking results.
- [benchmarking pipeline](artifact://nightly-pipeline.yaml)
- [benchmarking results](artifact://results.zip)
- [benchmarking code](artifact://nightly-benchmarks.zip)
Please download the visualization scripts in the post
## Results reproduction
- Find the docker we use in `benchmarking pipeline`
- Deploy the docker, and inside the docker:
- Download `nightly-benchmarks.zip`.
- In the same folder, run the following code:
```bash
export HF_TOKEN=<your HF token>
apt update
apt install -y git
unzip nightly-benchmarks.zip
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
```
And the results will be inside `./benchmarks/results`.

View File

@@ -1,39 +0,0 @@
# Nightly benchmark
This benchmark aims to:
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.
Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end.
Latest reproduction guide: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
## Setup
- Docker images:
- vLLM: `vllm/vllm-openai:v0.6.2`
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
- *NOTE: we use r24.07 as the current implementation only works for this version. We are going to bump this up.*
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
- Hardware
- 8x Nvidia A100 GPUs
- Workload:
- Dataset
- ShareGPT dataset
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
- Models: llama-3 8B, llama-3 70B.
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
## Known issues
- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
- TGI does not support `ignore-eos` flag.

View File

@@ -1,196 +0,0 @@
common_pod_spec: &common_pod_spec
priorityClassName: perf-benchmark
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
volumes:
- name: devshm
emptyDir:
medium: Memory
- name: hf-cache
hostPath:
path: /root/.cache/huggingface
type: Directory
common_container_settings: &common_container_settings
command:
- bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: devshm
mountPath: /dev/shm
- name: hf-cache
mountPath: /root/.cache/huggingface
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_HOME
value: /root/.cache/huggingface
- name: VLLM_SOURCE_CODE_LOC
value: /workspace/build/buildkite/vllm/performance-benchmark
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
steps:
- block: ":rocket: Ready for comparing vllm against alternatives? This will take 4 hours."
- label: "A100 vllm step 10"
priority: 100
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
<<: *common_pod_spec
containers:
- image: vllm/vllm-openai:v0.6.2
<<: *common_container_settings
- label: "A100 sglang benchmark"
priority: 100
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
<<: *common_pod_spec
containers:
- image: lmsysorg/sglang:v0.3.2-cu121
<<: *common_container_settings
- label: "A100 lmdeploy benchmark"
priority: 100
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
<<: *common_pod_spec
containers:
- image: openmmlab/lmdeploy:v0.6.1-cu12
<<: *common_container_settings
- label: "A100 trt llama-8B"
priority: 100
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
<<: *common_pod_spec
containers:
- image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
<<: *common_container_settings
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_HOME
value: /root/.cache/huggingface
- name: VLLM_SOURCE_CODE_LOC
value: /workspace/build/buildkite/vllm/performance-benchmark
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
- name: TEST_SELECTOR
value: "llama8B"
- label: "A100 trt llama-70B"
priority: 100
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
<<: *common_pod_spec
containers:
- image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
<<: *common_container_settings
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_HOME
value: /root/.cache/huggingface
- name: VLLM_SOURCE_CODE_LOC
value: /workspace/build/buildkite/vllm/performance-benchmark
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
- name: TEST_SELECTOR
value: "llama70B"
# FIXME(Kuntai): uncomment this after NVIDIA gives us their test docker image
# - label: "A100 trt benchmark"
# priority: 100
# agents:
# queue: A100
# plugins:
# - kubernetes:
# podSpec:
# <<: *common_pod_spec
# containers:
# - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
# <<: *common_container_settings
# FIXME(Kuntai): uncomment this after TGI supports `--ignore-eos`.
# - label: "A100 tgi benchmark"
# priority: 100
# agents:
# queue: A100
# plugins:
# - kubernetes:
# podSpec:
# <<: *common_pod_spec
# containers:
# - image: ghcr.io/huggingface/text-generation-inference:2.2.0
# <<: *common_container_settings
- wait
- label: "Collect the results"
priority: 100
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
<<: *common_pod_spec
containers:
- image: vllm/vllm-openai:v0.5.0.post1
command:
- bash .buildkite/nightly-benchmarks/scripts/nightly-annotate.sh
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: devshm
mountPath: /dev/shm
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: VLLM_SOURCE_CODE_LOC
value: /workspace/build/buildkite/vllm/performance-benchmark
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
- block: ":rocket: check the results!"

View File

@@ -1,26 +0,0 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import argparse
from transformers import AutoTokenizer
def main(model, cachedir):
# Load the tokenizer and save it to the specified directory
tokenizer = AutoTokenizer.from_pretrained(model)
tokenizer.save_pretrained(cachedir)
print(f"Tokenizer saved to {cachedir}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Download and save Hugging Face tokenizer"
)
parser.add_argument("--model", type=str, required=True, help="Name of the model")
parser.add_argument(
"--cachedir", type=str, required=True, help="Directory to save the tokenizer"
)
args = parser.parse_args()
main(args.model, args.cachedir)

View File

@@ -1,97 +0,0 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import argparse
import json
from pathlib import Path
import numpy as np
import pandas as pd
from tabulate import tabulate
def parse_arguments():
parser = argparse.ArgumentParser(
description="Parse command line arguments for summary-nightly-results script."
)
parser.add_argument(
"--results-folder",
type=str,
required=True,
help="The folder where the results are stored.",
)
parser.add_argument(
"--description", type=str, required=True, help="Description of the results."
)
args = parser.parse_args()
return args
def get_perf(df, method, model, metric):
means = []
for qps in [2, 4, 8, 16, "inf"]:
target = df["Test name"].str.contains(model)
target = target & df["Engine"].str.contains(method)
target = target & df["Test name"].str.contains("qps_" + str(qps))
filtered_df = df[target]
if filtered_df.empty:
means.append(0.0)
else:
means.append(filtered_df[metric].values[0])
return np.array(means)
def get_perf_w_std(df, method, model, metric):
if metric in ["TTFT", "ITL"]:
mean = get_perf(df, method, model, "Mean " + metric + " (ms)")
mean = mean.tolist()
std = get_perf(df, method, model, "Std " + metric + " (ms)")
if std.mean() == 0:
std = None
success = get_perf(df, method, model, "Successful req.")
if std is not None:
std = std / np.sqrt(success)
std = std.tolist()
else:
assert metric == "Tput"
mean = get_perf(df, method, model, "Input Tput (tok/s)") + get_perf(
df, method, model, "Output Tput (tok/s)"
)
mean = mean.tolist()
std = None
return mean, std
def main(args):
results_folder = Path(args.results_folder)
results = []
# collect results
for test_file in results_folder.glob("*_nightly_results.json"):
with open(test_file) as f:
results = results + json.loads(f.read())
# generate markdown table
df = pd.DataFrame.from_dict(results)
md_table = tabulate(df, headers="keys", tablefmt="pipe", showindex=False)
with open(args.description) as f:
description = f.read()
description = description.format(nightly_results_benchmarking_table=md_table)
with open("nightly_results.md", "w") as f:
f.write(description)
if __name__ == "__main__":
args = parse_arguments()
main(args)

View File

@@ -1,9 +0,0 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient("http://localhost:8000")
model_name = api_client.available_models[0]
print(model_name)

View File

@@ -1,78 +0,0 @@
#!/bin/bash
set -ex
set -o pipefail
main() {
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get update && apt-get -y install jq)
(which zip) || (apt-get install -y zip)
if [ ! -f /workspace/buildkite-agent ]; then
echo "buildkite-agent binary not found. Skip plotting the results."
exit 0
fi
# initial annotation
#description="$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/nightly-descriptions.md"
# download results
cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
mkdir -p results/
/workspace/buildkite-agent artifact download 'results/*nightly_results.json' results/
ls
ls results/
# upload benchmark results
zip -r results.zip results/
/workspace/buildkite-agent artifact upload "results.zip"
# upload benchmarking scripts
cd "$VLLM_SOURCE_CODE_LOC/"
zip -r nightly-benchmarks.zip .buildkite/ benchmarks/
/workspace/buildkite-agent artifact upload "nightly-benchmarks.zip"
cd "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/"
# upload benchmarking pipeline
/workspace/buildkite-agent artifact upload "nightly-pipeline.yaml"
cd "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/"
/workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly-annotation.md
# The figures should be generated by a separate process outside the CI/CD pipeline
# # generate figures
# python3 -m pip install tabulate pandas matplotlib
# python3 $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/generate-nightly-markdown.py \
# --description $description \
# --results-folder results/
# python3 $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/plot-nightly-results.py \
# --description $description \
# --results-folder results/ \
# --dataset sharegpt
# python3 $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/plot-nightly-results.py \
# --description $description \
# --results-folder results/ \
# --dataset sonnet_2048_128
# python3 $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/plot-nightly-results.py \
# --description $description \
# --results-folder results/ \
# --dataset sonnet_128_2048
# # upload results and figures
# /workspace/buildkite-agent artifact upload "nightly_results*.png"
# /workspace/buildkite-agent artifact upload $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/nightly-pipeline.yaml
# /workspace/buildkite-agent artifact upload $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/tests/nightly-tests.json
# /workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly_results.md
}
main "$@"

View File

@@ -1,464 +0,0 @@
#!/bin/bash
set -o pipefail
set -x
check_gpus() {
# check the number of GPUs and GPU type.
declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
if [[ $gpu_count -gt 0 ]]; then
echo "GPU found."
else
echo "Need at least 1 GPU to run benchmarking."
exit 1
fi
declare -g gpu_type="$(nvidia-smi --query-gpu=name --format=csv,noheader | awk '{print $2}')"
echo "GPU type is $gpu_type"
}
check_hf_token() {
# check if HF_TOKEN is available and valid
if [[ -z "$HF_TOKEN" ]]; then
echo "Error: HF_TOKEN is not set."
exit 1
elif [[ ! "$HF_TOKEN" =~ ^hf_ ]]; then
echo "Error: HF_TOKEN does not start with 'hf_'."
exit 1
else
echo "HF_TOKEN is set and valid."
fi
}
upload_to_buildkite() {
# upload the benchmarking results to buildkite
# if the agent binary is not found, skip uploading the results, exit 0
if [ ! -f /workspace/buildkite-agent ]; then
echo "buildkite-agent binary not found. Skip uploading the results."
return 0
fi
# /workspace/buildkite-agent annotate --style "success" --context "benchmark-results" --append < $RESULTS_FOLDER/${CURRENT_LLM_SERVING_ENGINE}_nightly_results.md
/workspace/buildkite-agent artifact upload "$RESULTS_FOLDER/*"
}
get_current_llm_serving_engine() {
if which lmdeploy >/dev/null; then
echo "Container: lmdeploy"
export CURRENT_LLM_SERVING_ENGINE=lmdeploy
return
fi
if [ -e /tgi-entrypoint.sh ]; then
echo "Container: tgi"
export CURRENT_LLM_SERVING_ENGINE=tgi
return
fi
if which trtllm-build >/dev/null; then
echo "Container: tensorrt-llm"
export CURRENT_LLM_SERVING_ENGINE=trt
return
fi
if [ -e /sgl-workspace ]; then
echo "Container: sglang"
export CURRENT_LLM_SERVING_ENGINE=sglang
return
fi
if [ -e /vllm-workspace ]; then
echo "Container: vllm"
# move to a completely irrelevant directory, to avoid import vllm from current folder
export CURRENT_LLM_SERVING_ENGINE=vllm
return
fi
}
json2args() {
# transforms the JSON string to command line args, and '_' is replaced to '-'
# example:
# input: { "model": "meta-llama/Llama-2-7b-chat-hf", "tensor_parallel_size": 1 }
# output: --model meta-llama/Llama-2-7b-chat-hf --tensor-parallel-size 1
local json_string=$1
local args=$(
echo "$json_string" | jq -r '
to_entries |
map("--" + (.key | gsub("_"; "-")) + " " + (.value | tostring)) |
join(" ")
'
)
echo "$args"
}
kill_gpu_processes() {
pkill -f '[p]ython'
pkill -f '[p]ython3'
pkill -f '[t]ritonserver'
pkill -f '[p]t_main_thread'
pkill -f '[t]ext-generation'
pkill -f '[l]mdeploy'
# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
pkill -f '[V]LLM'
while [ "$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1)" -ge 1000 ]; do
sleep 1
done
}
wait_for_server() {
# wait for vllm server to start
# return 1 if vllm server crashes
timeout 1200 bash -c '
until curl -s localhost:8000/v1/completions > /dev/null; do
sleep 1
done' && return 0 || return 1
}
ensure_installed() {
# Ensure that the given command is installed by apt-get
local cmd=$1
if ! which "$cmd" >/dev/null; then
apt-get update && apt-get install -y "$cmd"
fi
}
run_serving_tests() {
# run serving tests using `vllm bench serve` command
# $1: a json file specifying serving test cases
local serving_test_file
serving_test_file=$1
# Iterate over serving tests
jq -c '.[]' "$serving_test_file" | while read -r params; do
# get the test name, and append the GPU type back to it.
test_name=$(echo "$params" | jq -r '.test_name')
# if TEST_SELECTOR is set, only run the test cases that match the selector
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
echo "Skip test case $test_name."
continue
fi
# prepend the current serving engine to the test name
test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name}
# get common parameters
common_params=$(echo "$params" | jq -r '.common_parameters')
model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
reuse_server=$(echo "$common_params" | jq -r '.reuse_server')
# get client and server arguments
server_params=$(echo "$params" | jq -r ".${CURRENT_LLM_SERVING_ENGINE}_server_parameters")
client_params=$(echo "$params" | jq -r ".${CURRENT_LLM_SERVING_ENGINE}_client_parameters")
client_args=$(json2args "$client_params")
qps_list=$(echo "$params" | jq -r '.qps_list')
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
echo "Running over qps list $qps_list"
# check if there is enough GPU to run the test
if [[ $gpu_count -lt $tp ]]; then
echo "Required num-shard $tp but only $gpu_count GPU found. Skip testcase $test_name."
continue
fi
if [[ $reuse_server == "true" ]]; then
echo "Reuse previous server for test case $test_name"
else
kill_gpu_processes
bash "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/launch-server.sh" \
"$server_params" "$common_params"
fi
if wait_for_server; then
echo ""
echo "$CURRENT_LLM_SERVING_ENGINE server is up and running."
else
echo ""
echo "$CURRENT_LLM_SERVING_ENGINE failed to start within the timeout period."
break
fi
# prepare tokenizer
# this is required for lmdeploy.
cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
rm -rf /tokenizer_cache
mkdir /tokenizer_cache
python3 ../.buildkite/nightly-benchmarks/scripts/download-tokenizer.py \
--model "$model" \
--cachedir /tokenizer_cache
cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
# change model name for lmdeploy (it will not follow standard hf name)
if [[ "$CURRENT_LLM_SERVING_ENGINE" == "lmdeploy" ]]; then
model=$(python ../.buildkite/nightly-benchmarks/scripts/get-lmdeploy-modelname.py)
fi
# iterate over different QPS
for qps in $qps_list; do
# remove the surrounding single quote from qps
if [[ "$qps" == *"inf"* ]]; then
echo "qps was $qps"
qps="inf"
echo "now qps is $qps"
fi
new_test_name=$test_name"_qps_"$qps
backend=$CURRENT_LLM_SERVING_ENGINE
if [[ $backend = "trt" ]]; then
backend="tensorrt-llm"
fi
if [[ "$backend" == *"vllm"* ]]; then
backend="vllm"
fi
if [[ "$dataset_name" = "sharegpt" ]]; then
client_command="vllm bench serve \
--backend $backend \
--tokenizer /tokenizer_cache \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--num-prompts $num_prompts \
--port $port \
--save-result \
--result-dir $RESULTS_FOLDER \
--result-filename ${new_test_name}.json \
--request-rate $qps \
--ignore-eos \
$client_args"
elif [[ "$dataset_name" = "sonnet" ]]; then
sonnet_input_len=$(echo "$common_params" | jq -r '.sonnet_input_len')
sonnet_output_len=$(echo "$common_params" | jq -r '.sonnet_output_len')
sonnet_prefix_len=$(echo "$common_params" | jq -r '.sonnet_prefix_len')
client_command="vllm bench serve \
--backend $backend \
--tokenizer /tokenizer_cache \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--num-prompts $num_prompts \
--sonnet-input-len $sonnet_input_len \
--sonnet-output-len $sonnet_output_len \
--sonnet-prefix-len $sonnet_prefix_len \
--port $port \
--save-result \
--result-dir $RESULTS_FOLDER \
--result-filename ${new_test_name}.json \
--request-rate $qps \
--ignore-eos \
$client_args"
else
echo "The dataset name must be either 'sharegpt' or 'sonnet'. Got $dataset_name."
exit 1
fi
echo "Running test case $test_name with qps $qps"
echo "Client command: $client_command"
eval "$client_command"
server_command="None"
# record the benchmarking commands
jq_output=$(jq -n \
--arg server "$server_command" \
--arg client "$client_command" \
--arg gpu "$gpu_type" \
--arg engine "$CURRENT_LLM_SERVING_ENGINE" \
'{
server_command: $server,
client_command: $client,
gpu_type: $gpu,
engine: $engine
}')
echo "$jq_output" >"$RESULTS_FOLDER/${new_test_name}.commands"
done
done
kill_gpu_processes
}
run_genai_perf_tests() {
# run genai-perf tests
# $1: a json file specifying genai-perf test cases
local genai_perf_test_file
genai_perf_test_file=$1
# Iterate over genai-perf tests
jq -c '.[]' "$genai_perf_test_file" | while read -r params; do
# get the test name, and append the GPU type back to it.
test_name=$(echo "$params" | jq -r '.test_name')
# if TEST_SELECTOR is set, only run the test cases that match the selector
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
echo "Skip test case $test_name."
continue
fi
# prepend the current serving engine to the test name
test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name}
# get common parameters
common_params=$(echo "$params" | jq -r '.common_parameters')
model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
reuse_server=$(echo "$common_params" | jq -r '.reuse_server')
# get client and server arguments
server_params=$(echo "$params" | jq -r ".${CURRENT_LLM_SERVING_ENGINE}_server_parameters")
qps_list=$(echo "$params" | jq -r '.qps_list')
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
echo "Running over qps list $qps_list"
# check if there is enough GPU to run the test
if [[ $gpu_count -lt $tp ]]; then
echo "Required num-shard $tp but only $gpu_count GPU found. Skip testcase $test_name."
continue
fi
if [[ $reuse_server == "true" ]]; then
echo "Reuse previous server for test case $test_name"
else
kill_gpu_processes
bash "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/launch-server.sh" \
"$server_params" "$common_params"
fi
if wait_for_server; then
echo ""
echo "$CURRENT_LLM_SERVING_ENGINE server is up and running."
else
echo ""
echo "$CURRENT_LLM_SERVING_ENGINE failed to start within the timeout period."
break
fi
# iterate over different QPS
for qps in $qps_list; do
# remove the surrounding single quote from qps
if [[ "$qps" == *"inf"* ]]; then
echo "qps was $qps"
qps=$num_prompts
echo "now qps is $qps"
fi
new_test_name=$test_name"_qps_"$qps
backend=$CURRENT_LLM_SERVING_ENGINE
if [[ "$backend" == *"vllm"* ]]; then
backend="vllm"
fi
#TODO: add output dir.
client_command="genai-perf profile \
-m $model \
--service-kind openai \
--backend "$backend" \
--endpoint-type chat \
--streaming \
--url localhost:$port \
--request-rate $qps \
--num-prompts $num_prompts \
"
echo "Client command: $client_command"
eval "$client_command"
#TODO: process/record outputs
done
done
kill_gpu_processes
}
prepare_dataset() {
# download sharegpt dataset
cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
# duplicate sonnet by 4x, to allow benchmarking with input length 2048
cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
echo "" > sonnet_4x.txt
for _ in {1..4}
do
cat sonnet.txt >> sonnet_4x.txt
done
}
main() {
# check if the environment variable is successfully injected from yaml
check_gpus
check_hf_token
get_current_llm_serving_engine
pip install -U transformers
pip install -r requirements/dev.txt
which genai-perf
# check storage
df -h
ensure_installed wget
ensure_installed curl
ensure_installed jq
# genai-perf dependency
ensure_installed libb64-0d
prepare_dataset
cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
declare -g RESULTS_FOLDER=results/
mkdir -p $RESULTS_FOLDER
BENCHMARK_ROOT="$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/"
# run the test
run_serving_tests "$BENCHMARK_ROOT/tests/nightly-tests.json"
# run genai-perf tests
run_genai_perf_tests "$BENCHMARK_ROOT/tests/genai-perf-tests.json"
mv artifacts/ $RESULTS_FOLDER/
# upload benchmark results to buildkite
python3 -m pip install tabulate pandas
python3 "$BENCHMARK_ROOT/scripts/summary-nightly-results.py"
upload_to_buildkite
}
main "$@"

View File

@@ -1,82 +0,0 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import datetime
import json
import os
from pathlib import Path
import pandas as pd
from tabulate import tabulate
results_folder = Path("results/")
# serving results and the keys that will be printed into markdown
serving_results = []
serving_column_mapping = {
"test_name": "Test name",
"gpu_type": "GPU",
"completed": "Successful req.",
"request_throughput": "Tput (req/s)",
"mean_ttft_ms": "Mean TTFT (ms)",
"std_ttft_ms": "Std TTFT (ms)",
"median_ttft_ms": "Median TTFT (ms)",
"mean_itl_ms": "Mean ITL (ms)",
"std_itl_ms": "Std ITL (ms)",
"median_itl_ms": "Median ITL (ms)",
"mean_tpot_ms": "Mean TPOT (ms)",
"std_tpot_ms": "Std TPOT (ms)",
"median_tpot_ms": "Median TPOT (ms)",
"total_token_throughput": "Total Token Tput (tok/s)",
"output_throughput": "Output Tput (tok/s)",
"total_input_tokens": "Total input tokens",
"total_output_tokens": "Total output tokens",
"engine": "Engine",
}
if __name__ == "__main__":
# collect results
for test_file in results_folder.glob("*.json"):
with open(test_file) as f:
raw_result = json.loads(f.read())
# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
raw_result.update(command)
# update the test name of this result
raw_result.update({"test_name": test_file.stem})
# add the result to raw_result
serving_results.append(raw_result)
continue
serving_results = pd.DataFrame.from_dict(serving_results)
if not serving_results.empty:
serving_results = serving_results[list(serving_column_mapping.keys())].rename(
columns=serving_column_mapping
)
serving_md_table_with_headers = tabulate(
serving_results, headers="keys", tablefmt="pipe", showindex=False
)
# remove the first line of header
serving_md_table_lines = serving_md_table_with_headers.split("\n")
serving_md_table_without_header = "\n".join(serving_md_table_lines[2:])
prefix = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
prefix = prefix + "_" + os.environ.get("CURRENT_LLM_SERVING_ENGINE")
# document benchmarking results in markdown
with open(results_folder / f"{prefix}_nightly_results.md", "w") as f:
# document results with header.
# for those who wants to reproduce our benchmark.
f.write(serving_md_table_with_headers)
f.write("\n")
# document benchmarking results in json
with open(results_folder / f"{prefix}_nightly_results.json", "w") as f:
results = serving_results.to_dict(orient="records")
f.write(json.dumps(results))

View File

@@ -1,23 +0,0 @@
#!/bin/sh
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-postmerge-repo:pull" | jq -r .token)
if [[ "$BUILDKITE_BRANCH" == "main" ]]; then
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-postmerge-repo/manifests/$BUILDKITE_COMMIT"
else
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-test-repo/manifests/$BUILDKITE_COMMIT"
fi
TIMEOUT_SECONDS=10
retries=0
while [ $retries -lt 1000 ]; do
if [ "$(curl -s --max-time "$TIMEOUT_SECONDS" -L -H "Authorization: Bearer $TOKEN" -o /dev/null -w "%{http_code}" "$URL")" -eq 200 ]; then
exit 0
fi
echo "Waiting for image to be available..."
retries=$((retries + 1))
sleep 5
done
exit 1

View File

@@ -1,30 +0,0 @@
[
{
"test_name": "latency_llama8B_tp1",
"environment_variables": {
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1,
"load_format": "dummy",
"num_iters_warmup": 5,
"num_iters": 15
}
},
{
"test_name": "latency_llama8B_tp4",
"environment_variables": {
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 4,
"load_format": "dummy",
"num_iters_warmup": 5,
"num_iters": 15
}
}
]

View File

@@ -1,32 +0,0 @@
[
{
"test_name": "throughput_llama8B_tp1",
"environment_variables": {
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1,
"load_format": "dummy",
"dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200,
"backend": "vllm"
}
},
{
"test_name": "throughput_llama8B_tp4",
"environment_variables": {
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 4,
"load_format": "dummy",
"dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200,
"backend": "vllm"
}
}
]

View File

@@ -2,40 +2,23 @@
## Introduction
This directory contains two sets of benchmark for vllm.
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
See [vLLM performance dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
This directory contains a benchmarking suite for **developers** to run locally and gain clarity on whether their PR improves/degrades vllm's performance.
vLLM also maintains a continuous performance benchmark under [perf.vllm.ai](https://perf.vllm.ai/), hosted under PyTorch CI HUD.
## Performance benchmark quick overview
**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) and Intel® Xeon® Processors, with different models.
**Benchmarking Coverage**: latency, throughput and fix-qps serving on B200, A100, H100, Intel® Xeon® Processors and Intel® Gaudi® 3 Accelerators with different models.
**Benchmarking Duration**: about 1hr.
**For benchmarking developers**: please try your best to constraint the duration of benchmarking to about 1 hr so that it won't take forever to run.
## Nightly benchmark quick overview
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
**Benchmarking engines**: vllm, TGI, trt-llm and lmdeploy.
**Benchmarking Duration**: about 3.5hrs.
## Trigger the benchmark
Performance benchmark will be triggered when:
- A PR being merged into vllm.
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.
Manually Trigger the benchmark
The benchmark needs to be triggered manually:
```bash
bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
```
Runtime environment variables:
@@ -47,14 +30,11 @@ Runtime environment variables:
- `REMOTE_HOST`: IP for the remote vLLM service to benchmark. Default value is empty string.
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
Nightly benchmark will be triggered when:
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
## Performance benchmark details
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
> NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead.
For Intel® Gaudi® 3 Accelerators, use `tests/latency-tests-hpu.json`, `tests/throughput-tests-hpu.json`, `tests/serving-tests-hpu.json` instead.
>
### Latency test
@@ -152,26 +132,3 @@ Here is an example using the script to compare result_a and result_b with Model,
A comparison diagram will be generated below the table.
Here is an example to compare between 96c/results_gnr_96c_091_tp2pp3 and 128c/results_gnr_128c_091_tp2pp3
<img width="1886" height="828" alt="image" src="https://github.com/user-attachments/assets/c02a43ef-25d0-4fd6-90e5-2169a28682dd" />
## Nightly test details
See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.
### Workflow
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
- Inside each container, we run [scripts/run-nightly-benchmarks.sh](scripts/run-nightly-benchmarks.sh), which will probe the serving engine of the current container.
- The `scripts/run-nightly-benchmarks.sh` will parse the workload described in [nightly-tests.json](tests/nightly-tests.json) and launch the right benchmark for the specified serving engine via `scripts/launch-server.sh`.
- At last, we run [scripts/summary-nightly-results.py](scripts/summary-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
### Nightly tests
In [nightly-tests.json](tests/nightly-tests.json), we include the command line arguments for benchmarking commands, together with the benchmarking test cases. The format is highly similar to performance benchmark.
### Docker containers
The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `scripts/run-nightly-benchmarks.sh` and `scripts/launch-server.sh`.
WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).

View File

@@ -5,7 +5,7 @@
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- GPU Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- GPU/HPU Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- CPU Models: llama-3.1 8B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
@@ -16,7 +16,7 @@
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- GPU Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- GPU/HPU Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- CPU Models: llama-3.1 8B.
- Evaluation metrics: throughput.
@@ -28,7 +28,7 @@
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- GPU Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- GPU/HPU Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- We also added a speculative decoding test for llama-3 70B on GPU, under QPS 2
- CPU Models: llama-3.1 8B.
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

View File

@@ -7,6 +7,7 @@ from importlib import util
import pandas as pd
pd.options.display.float_format = "{:.2f}".format
plotly_found = util.find_spec("plotly.express") is not None
@@ -109,7 +110,10 @@ def compare_data_columns(
if len(compare_frames) >= 2:
base = compare_frames[0]
current = compare_frames[-1]
ratio = current / base
if "P99" in data_column or "Median" in data_column:
ratio = base / current # for latency
else:
ratio = current / base
ratio = ratio.mask(base == 0) # avoid inf when baseline is 0
ratio.name = f"Ratio 1 vs {len(compare_frames)}"
frames.append(ratio)
@@ -199,6 +203,71 @@ def split_json_by_tp_pp(
return saved_paths
def _add_limit_line(fig, y_value, label):
# Visible dashed line + annotation
fig.add_hline(
y=y_value,
line_dash="dash",
line_color="red" if "ttft" in label.lower() else "blue",
annotation_text=f"{label}: {y_value} ms",
annotation_position="top left",
)
# Optional: add a legend item (as a transparent helper trace)
if plot and plotly_found:
import plotly.graph_objects as go
fig.add_trace(
go.Scatter(
x=[None],
y=[None],
mode="lines",
line=dict(
dash="dash", color="red" if "ttft" in label.lower() else "blue"
),
name=f"{label}",
)
)
def _find_concurrency_col(df: pd.DataFrame) -> str:
for c in [
"# of max concurrency.",
"# of max concurrency",
"Max Concurrency",
"max_concurrency",
"Concurrency",
]:
if c in df.columns:
return c
# Fallback: guess an integer-like column (harmless if unused)
for c in df.columns:
if df[c].dtype.kind in "iu" and df[c].nunique() > 1 and df[c].min() >= 1:
return c
return "# of max concurrency."
def _highlight_threshold(
df: pd.DataFrame, threshold: float
) -> "pd.io.formats.style.Styler":
"""Highlight numeric per-configuration columns with value <= threshold."""
conc_col = _find_concurrency_col(df)
key_cols = [
c
for c in ["Model", "Dataset Name", "Input Len", "Output Len", conc_col]
if c in df.columns
]
conf_cols = [
c for c in df.columns if c not in key_cols and not str(c).startswith("Ratio")
]
conf_cols = [c for c in conf_cols if pd.api.types.is_numeric_dtype(df[c])]
return df.style.map(
lambda v: "background-color:#e6ffe6;font-weight:bold;"
if pd.notna(v) and v <= threshold
else "",
subset=conf_cols,
)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
@@ -220,6 +289,26 @@ if __name__ == "__main__":
default="# of max concurrency.",
help="column name to use as X Axis in comparison graph",
)
parser.add_argument(
"-l",
"--latency",
type=str,
default="p99",
help="take median|p99 for latency like TTFT/TPOT",
)
parser.add_argument(
"--ttft-max-ms",
type=float,
default=3000.0,
help="Reference limit for TTFT plots (ms)",
)
parser.add_argument(
"--tpot-max-ms",
type=float,
default=100.0,
help="Reference limit for TPOT plots (ms)",
)
args = parser.parse_args()
drop_column = "P99"
@@ -234,12 +323,22 @@ if __name__ == "__main__":
"# of max concurrency.",
"qps",
]
data_cols_to_compare = ["Output Tput (tok/s)", "Median TTFT (ms)", "Median"]
html_msgs_for_data_cols = [
"Compare Output Tokens /n",
"Median TTFT /n",
"Median TPOT /n",
]
if "median" in args.latency:
data_cols_to_compare = ["Output Tput (tok/s)", "Median TTFT (ms)", "Median"]
html_msgs_for_data_cols = [
"Compare Output Tokens /n",
"Median TTFT /n",
"Median TPOT /n",
]
drop_column = "P99"
elif "p99" in args.latency:
data_cols_to_compare = ["Output Tput (tok/s)", "P99 TTFT (ms)", "P99"]
html_msgs_for_data_cols = [
"Compare Output Tokens /n",
"P99 TTFT /n",
"P99 TPOT /n",
]
if len(args.file) == 1:
files = split_json_by_tp_pp(args.file[0], output_root="splits")
@@ -275,33 +374,83 @@ if __name__ == "__main__":
f"Expected subset: {filtered_info_cols}, "
f"but DataFrame has: {list(output_df.columns)}"
)
output_df_sorted = output_df.sort_values(by=existing_group_cols)
# output_df_sorted = output_df.sort_values(by=existing_group_cols)
output_df_sorted = output_df.sort_values(by=args.xaxis)
output_groups = output_df_sorted.groupby(existing_group_cols, dropna=False)
for name, group in output_groups:
html = group.to_html()
group_name = (
",".join(map(str, name)).replace(",", "_").replace("/", "-")
)
group_html_name = "perf_comparison_" + group_name + ".html"
metric_name = str(data_cols_to_compare[i]).lower()
if "tok/s" in metric_name:
html = group.to_html()
elif "ttft" in metric_name:
styler = _highlight_threshold(group, args.ttft_max_ms).format(
{c: "{:.2f}" for c in group.select_dtypes("number").columns},
na_rep="",
)
html = styler.to_html(
table_attributes='border="1" class="dataframe"'
)
elif (
"tpot" in metric_name
or "median" in metric_name
or "p99" in metric_name
):
styler = _highlight_threshold(group, args.tpot_max_ms).format(
{c: "{:.2f}" for c in group.select_dtypes("number").columns},
na_rep="",
)
html = styler.to_html(
table_attributes='border="1" class="dataframe"'
)
text_file.write(html_msgs_for_data_cols[i])
text_file.write(html)
with open(group_html_name, "a+") as sub_text_file:
sub_text_file.write(html_msgs_for_data_cols[i])
sub_text_file.write(html)
if plot and plotly_found:
import plotly.express as px
if plot and plotly_found:
import plotly.express as px
df = group[raw_data_cols]
df_sorted = df.sort_values(by=info_cols[y_axis_index])
# Melt DataFrame for plotting
df_melted = df_sorted.melt(
id_vars=info_cols[y_axis_index],
var_name="Configuration",
value_name=data_cols_to_compare[i],
)
title = data_cols_to_compare[i] + " vs " + info_cols[y_axis_index]
# Create Plotly line chart
fig = px.line(
df_melted,
x=info_cols[y_axis_index],
y=data_cols_to_compare[i],
color="Configuration",
title=title,
markers=True,
)
# Export to HTML
text_file.write(fig.to_html(full_html=True, include_plotlyjs="cdn"))
df = group[raw_data_cols]
df_sorted = df.sort_values(by=info_cols[y_axis_index])
# Melt DataFrame for plotting
df_melted = df_sorted.melt(
id_vars=info_cols[y_axis_index],
var_name="Configuration",
value_name=data_cols_to_compare[i],
)
title = (
data_cols_to_compare[i] + " vs " + info_cols[y_axis_index]
)
# Create Plotly line chart
fig = px.line(
df_melted,
x=info_cols[y_axis_index],
y=data_cols_to_compare[i],
color="Configuration",
title=title,
markers=True,
)
# ---- Add threshold lines based on metric name ----
if "ttft" in metric_name:
_add_limit_line(fig, args.ttft_max_ms, "TTFT limit")
elif (
"tpot" in metric_name
or "median" in metric_name
or "p99" in metric_name
):
_add_limit_line(fig, args.tpot_max_ms, "TPOT limit")
# Export to HTML
text_file.write(
fig.to_html(full_html=True, include_plotlyjs="cdn")
)
sub_text_file.write(
fig.to_html(full_html=True, include_plotlyjs="cdn")
)

View File

@@ -63,9 +63,11 @@ serving_column_mapping = {
"mean_ttft_ms": "Mean TTFT (ms)",
"median_ttft_ms": "Median TTFT (ms)",
"p99_ttft_ms": "P99 TTFT (ms)",
"std_ttft_ms": "STD TTFT (ms)",
"mean_tpot_ms": "Mean TPOT (ms)",
"median_tpot_ms": "Median",
"p99_tpot_ms": "P99",
"std_tpot_ms": "STD TPOT (ms)",
"mean_itl_ms": "Mean ITL (ms)",
"median_itl_ms": "Median ITL (ms)",
"p99_itl_ms": "P99 ITL (ms)",
@@ -368,7 +370,7 @@ if __name__ == "__main__":
# The GPUs sometimes come in format of "GPUTYPE\nGPUTYPE\n...",
# we want to turn it into "8xGPUTYPE"
df["GPU"] = df["GPU"].apply(
lambda x: f"{len(x.splitlines())}x{x.splitlines()[0]}"
lambda x: "{}x{}".format(len(x.split("\n")), x.split("\n")[0])
)
# get markdown tables
@@ -390,7 +392,7 @@ if __name__ == "__main__":
json_file = "benchmark_results.json"
with open(results_folder / md_file, "w") as f:
results = read_markdown(
"../.buildkite/nightly-benchmarks/"
"../.buildkite/performance-benchmarks/"
+ "performance-benchmarks-descriptions.md"
)
results = results.format(

View File

@@ -15,6 +15,8 @@ check_gpus() {
declare -g gpu_count=$(nvidia-smi --list-gpus | wc -l)
elif command -v amd-smi; then
declare -g gpu_count=$(amd-smi list | grep 'GPU' | wc -l)
elif command -v hl-smi; then
declare -g gpu_count=$(hl-smi --list | grep -i "Module ID" | wc -l)
fi
if [[ $gpu_count -gt 0 ]]; then
@@ -23,10 +25,16 @@ check_gpus() {
echo "Need at least 1 GPU to run benchmarking."
exit 1
fi
declare -g arch_suffix=''
if command -v nvidia-smi; then
declare -g gpu_type=$(nvidia-smi --query-gpu=name --format=csv,noheader | awk '{print $2}')
elif command -v amd-smi; then
declare -g gpu_type=$(amd-smi static -g 0 -a | grep 'MARKET_NAME' | awk '{print $2}')
elif command -v hl-smi; then
declare -g gpu_type=$(hl-smi -q | grep "Product Name" | head -n 1 | awk -F ':' '{print $2}' | sed 's/^ *//')
arch_suffix='-hpu'
fi
echo "GPU type is $gpu_type"
}
@@ -138,6 +146,10 @@ kill_gpu_processes() {
while [ "$(amd-smi metric -g 0 | grep 'USED_VRAM' | awk '{print $2}')" -ge 1000 ]; do
sleep 1
done
elif command -v hl-smi; then
while [ "$(hl-smi -q | grep "Used" | head -n 1 | awk '{print $3}')" -ge 1000 ]; do
sleep 1
done
fi
# remove vllm config file
@@ -451,6 +463,7 @@ main() {
ARCH='-cpu'
else
check_gpus
ARCH="$arch_suffix"
fi
check_hf_token
@@ -469,7 +482,12 @@ main() {
ensure_sharegpt_downloaded
declare -g RESULTS_FOLDER=results/
mkdir -p $RESULTS_FOLDER
QUICK_BENCHMARK_ROOT=../.buildkite/nightly-benchmarks/
QUICK_BENCHMARK_ROOT=../.buildkite/performance-benchmarks/
# dump vllm info via vllm collect-env
env_output=$(vllm collect-env)
echo "$env_output" >"$RESULTS_FOLDER/vllm_env.txt"
# benchmarking
run_serving_tests $QUICK_BENCHMARK_ROOT/tests/"${SERVING_JSON:-serving-tests$ARCH.json}"

View File

@@ -0,0 +1,26 @@
[
{
"test_name": "latency_llama8B_tp2",
"environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_SGL_KERNEL": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 2,
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
"trust_remote_code": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"num_iters_warmup": 5,
"num_iters": 15
}
}
]

View File

@@ -0,0 +1,55 @@
[
{
"test_name": "latency_llama8B_tp1",
"environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1,
"load_format": "dummy",
"num-iters-warmup": 5,
"num-iters": 15,
"max-model-len": 256,
"async-scheduling": ""
}
},
{
"test_name": "latency_llama70B_tp4",
"environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"PT_HPU_ENABLE_LAZY_COLLECTIVES": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"parameters": {
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"tensor_parallel_size": 4,
"load_format": "dummy",
"num-iters-warmup": 5,
"num-iters": 15,
"max-model-len": 256,
"async-scheduling": ""
}
},
{
"test_name": "latency_mixtral8x7B_tp2",
"environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"PT_HPU_ENABLE_LAZY_COLLECTIVES": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"parameters": {
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"tensor_parallel_size": 2,
"load_format": "dummy",
"num-iters-warmup": 5,
"num-iters": 15,
"max-model-len": 256,
"async-scheduling": ""
}
}
]

View File

@@ -95,6 +95,38 @@
"num_prompts": 200
}
},
{
"test_name": "serving_llama8B_bf16_tp4_sharegpt",
"qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_SGL_KERNEL": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"server_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 4,
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
"trust_remote_code": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"load_format": "dummy"
},
"client_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
},
{
"test_name": "serving_llama8B_bf16_tp2pp3_sharegpt",
"qps_list": ["inf"],
@@ -233,6 +265,41 @@
"num_prompts": 1000
}
},
{
"test_name": "serving_llama8B_bf16_tp4_random_128_128",
"qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
"server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_SGL_KERNEL": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"server_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 4,
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
"trust_remote_code": "",
"enable_chunked_prefill": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"load_format": "dummy"
},
"client_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm",
"dataset_name": "random",
"random-input-len": 128,
"random-output-len": 128,
"ignore-eos": "",
"num_prompts": 1000
}
},
{
"test_name": "serving_llama8B_bf16_tp2pp3_random_128_128",
"qps_list": ["inf"],
@@ -365,6 +432,38 @@
"num_prompts": 200
}
},
{
"test_name": "serving_llama8B_int8_tp4_sharegpt",
"qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_SGL_KERNEL": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"server_parameters": {
"model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
"tensor_parallel_size": 4,
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
"trust_remote_code": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"load_format": "dummy"
},
"client_parameters": {
"model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
},
{
"test_name": "serving_llama8B_int8_tp2pp3_sharegpt",
"qps_list": ["inf"],
@@ -503,6 +602,41 @@
"num_prompts": 1000
}
},
{
"test_name": "serving_llama8B_int8_tp4_random_128_128",
"qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
"server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_SGL_KERNEL": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"server_parameters": {
"model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
"tensor_parallel_size": 4,
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
"trust_remote_code": "",
"enable_chunked_prefill": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"load_format": "dummy"
},
"client_parameters": {
"model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
"backend": "vllm",
"dataset_name": "random",
"random-input-len": 128,
"random-output-len": 128,
"ignore-eos": "",
"num_prompts": 1000
}
},
{
"test_name": "serving_llama8B_int8_tp2pp3_random_128_128",
"qps_list": ["inf"],
@@ -638,6 +772,39 @@
"num_prompts": 200
}
},
{
"test_name": "serving_llama8B_int4_tp4_sharegpt",
"qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_SGL_KERNEL": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"server_parameters": {
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
"quantization": "awq",
"tensor_parallel_size": 4,
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
"trust_remote_code": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"load_format": "dummy"
},
"client_parameters": {
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
},
{
"test_name": "serving_llama8B_int4_tp2pp3_sharegpt",
"qps_list": ["inf"],
@@ -780,6 +947,42 @@
"num_prompts": 1000
}
},
{
"test_name": "serving_llama8B_int4_tp4_random_128_128",
"qps_list": ["inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
"server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_SGL_KERNEL": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"server_parameters": {
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
"quantization": "awq",
"tensor_parallel_size": 4,
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
"trust_remote_code": "",
"enable_chunked_prefill": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"load_format": "dummy"
},
"client_parameters": {
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
"backend": "vllm",
"dataset_name": "random",
"random-input-len": 128,
"random-output-len": 128,
"ignore-eos": "",
"num_prompts": 1000
}
},
{
"test_name": "serving_llama8B_int4_tp2pp3_random_128_128",
"qps_list": ["inf"],

View File

@@ -2,7 +2,7 @@
{
"test_name": "serving_llama8B_tp1_sharegpt",
"qps_list": [1, 4, 16, "inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"max_concurrency_list": [32],
"server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@@ -28,13 +28,13 @@
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
"num_prompts": 32
}
},
{
"test_name": "serving_llama8B_tp2_sharegpt",
"qps_list": [1, 4, 16, "inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"max_concurrency_list": [32],
"server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@@ -60,13 +60,13 @@
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
"num_prompts": 32
}
},
{
"test_name": "serving_llama8B_tp4_sharegpt",
"test_name": "serving_llama8B_tp1_random_128_128",
"qps_list": [1, 4, 16, "inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"max_concurrency_list": [32],
"server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@@ -76,39 +76,7 @@
},
"server_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 4,
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
"trust_remote_code": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"load_format": "dummy"
},
"client_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
},
{
"test_name": "serving_llama8B_tp4_random_1024_128",
"qps_list": [1, 4, 16, "inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_SGL_KERNEL": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"server_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 4,
"tensor_parallel_size": 1,
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
@@ -124,16 +92,16 @@
"model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm",
"dataset_name": "random",
"random-input-len": 1024,
"random-input-len": 128,
"random-output-len": 128,
"ignore-eos": "",
"num_prompts": 100
"num_prompts": 32
}
},
{
"test_name": "serving_llama8B_pp6_random_1024_128",
"test_name": "serving_llama8B_tp2_random_128_128",
"qps_list": [1, 4, 16, "inf"],
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
"max_concurrency_list": [32],
"server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
@@ -143,7 +111,7 @@
},
"server_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"pipeline_parallel_size": 6,
"tensor_parallel_size": 2,
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
@@ -159,10 +127,150 @@
"model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm",
"dataset_name": "random",
"random-input-len": 1024,
"random-input-len": 128,
"random-output-len": 128,
"ignore-eos": "",
"num_prompts": 100
"num_prompts": 32
}
},
{
"test_name": "serving_llama8B_tp1_random_128_2048",
"qps_list": [1, 4, 16, "inf"],
"max_concurrency_list": [32],
"server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_SGL_KERNEL": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"server_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1,
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
"trust_remote_code": "",
"enable_chunked_prefill": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"load_format": "dummy"
},
"client_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm",
"dataset_name": "random",
"random-input-len": 128,
"random-output-len": 2048,
"ignore-eos": "",
"num_prompts": 32
}
},
{
"test_name": "serving_llama8B_tp2_random_128_2048",
"qps_list": [1, 4, 16, "inf"],
"max_concurrency_list": [32],
"server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_SGL_KERNEL": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"server_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 2,
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
"trust_remote_code": "",
"enable_chunked_prefill": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"load_format": "dummy"
},
"client_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm",
"dataset_name": "random",
"random-input-len": 128,
"random-output-len": 2048,
"ignore-eos": "",
"num_prompts": 32
}
},
{
"test_name": "serving_llama8B_tp1_random_2048_128",
"qps_list": [1, 4, 16, "inf"],
"max_concurrency_list": [32],
"server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_SGL_KERNEL": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"server_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1,
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
"trust_remote_code": "",
"enable_chunked_prefill": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"load_format": "dummy"
},
"client_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm",
"dataset_name": "random",
"random-input-len": 2048,
"random-output-len": 128,
"ignore-eos": "",
"num_prompts": 32
}
},
{
"test_name": "serving_llama8B_tp2_random_2048_128",
"qps_list": [1, 4, 16, "inf"],
"max_concurrency_list": [32],
"server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_SGL_KERNEL": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"server_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 2,
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
"trust_remote_code": "",
"enable_chunked_prefill": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"load_format": "dummy"
},
"client_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm",
"dataset_name": "random",
"random-input-len": 2048,
"random-output-len": 128,
"ignore-eos": "",
"num_prompts": 32
}
}
]

View File

@@ -0,0 +1,82 @@
[
{
"test_name": "serving_llama8B_tp1_sharegpt",
"qps_list": [1, 4, 16, "inf"],
"server_environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"load_format": "dummy",
"max-model-len": 2048,
"max-num-seqs": 256,
"async-scheduling": ""
},
"client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
},
{
"test_name": "serving_llama70B_tp4_sharegpt",
"qps_list": [1, 4, 16, "inf"],
"server_environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"PT_HPU_ENABLE_LAZY_COLLECTIVES": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"server_parameters": {
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"tensor_parallel_size": 4,
"swap_space": 16,
"disable_log_stats": "",
"load_format": "dummy",
"max-model-len": 2048,
"max-num-seqs": 256,
"async-scheduling": ""
},
"client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
},
{
"test_name": "serving_mixtral8x7B_tp2_sharegpt",
"qps_list": [1, 4, 16, "inf"],
"server_environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"PT_HPU_ENABLE_LAZY_COLLECTIVES": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"server_parameters": {
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"tensor_parallel_size": 2,
"swap_space": 16,
"disable_log_stats": "",
"load_format": "dummy",
"max-model-len": 2048,
"max-num-seqs": 256,
"async-scheduling": ""
},
"client_parameters": {
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
}
]

View File

@@ -0,0 +1,27 @@
[
{
"test_name": "throughput_llama8B_tp2",
"environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_SGL_KERNEL": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 2,
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
"trust_remote_code": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200,
"backend": "vllm"
}
}
]

View File

@@ -0,0 +1,61 @@
[
{
"test_name": "throughput_llama8B_tp1",
"environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"parameters": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1,
"load_format": "dummy",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 1000,
"backend": "vllm",
"max-model-len": 2048,
"max-num-seqs": 512,
"async-scheduling": ""
}
},
{
"test_name": "throughput_llama70B_tp4",
"environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"PT_HPU_ENABLE_LAZY_COLLECTIVES": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"parameters": {
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"tensor_parallel_size": 4,
"load_format": "dummy",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 1000,
"backend": "vllm",
"max-model-len": 2048,
"max-num-seqs": 512,
"async-scheduling": ""
}
},
{
"test_name": "throughput_mixtral8x7B_tp2",
"environment_variables": {
"PT_HPU_LAZY_MODE": 1,
"PT_HPU_ENABLE_LAZY_COLLECTIVES": 1,
"VLLM_CONTIGUOUS_PA": 1,
"VLLM_DEFRAG": 1
},
"parameters": {
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"tensor_parallel_size": 2,
"load_format": "dummy",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 1000,
"backend": "vllm",
"max-model-len": 2048,
"max-num-seqs": 512,
"async-scheduling": ""
}
}
]

View File

@@ -1,5 +1,5 @@
steps:
# aarch64 + CUDA builds. PyTorch 2.8 aarch64 + CUDA wheel is only available on CUDA 12.9
# aarch64 + CUDA builds
- label: "Build arm64 wheel - CUDA 12.9"
depends_on: ~
id: build-wheel-arm64-cuda-12-9
@@ -15,6 +15,21 @@ steps:
env:
DOCKER_BUILDKIT: "1"
# aarch64 build
- label: "Build arm64 CPU wheel"
depends_on: ~
id: build-wheel-arm64-cpu
agents:
queue: arm64_cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --build-arg VLLM_BUILD_ACL=ON --tag vllm-ci:build-image --target vllm-build --progress plain -f docker/Dockerfile.cpu ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/scripts/upload-wheels.sh"
env:
DOCKER_BUILDKIT: "1"
# x86 + CUDA builds
- label: "Build wheel - CUDA 12.8"
depends_on: ~
id: build-wheel-cuda-12-8
@@ -28,20 +43,6 @@ steps:
env:
DOCKER_BUILDKIT: "1"
- label: "Build wheel - CUDA 12.6"
depends_on: ~
id: build-wheel-cuda-12-6
agents:
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.6.3 --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0+PTX' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/scripts/upload-wheels.sh"
env:
DOCKER_BUILDKIT: "1"
# x86 + CUDA builds
- label: "Build wheel - CUDA 12.9"
depends_on: ~
id: build-wheel-cuda-12-9
@@ -55,6 +56,20 @@ steps:
env:
DOCKER_BUILDKIT: "1"
- label: "Build wheel - CUDA 13.0"
depends_on: ~
id: build-wheel-cuda-13-0
agents:
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/scripts/upload-wheels.sh"
env:
DOCKER_BUILDKIT: "1"
# Build release images (12.9)
- label: "Build release image (x86)"
depends_on: ~
id: build-release-image-x86
@@ -62,13 +77,12 @@ steps:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.8.1 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) --target vllm-openai --progress plain -f docker/Dockerfile ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) --target vllm-openai --progress plain -f docker/Dockerfile ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)"
# re-tag to default image tag and push, just in case arm64 build fails
- "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
# PyTorch 2.8 aarch64 + CUDA wheel is only available on CUDA 12.9
- label: "Build release image (arm64)"
depends_on: ~
id: build-release-image-arm64
@@ -102,24 +116,6 @@ steps:
commands:
- "bash .buildkite/scripts/annotate-release.sh"
- label: "Build and publish TPU release image"
depends_on: ~
if: build.env("NIGHTLY") == "1"
agents:
queue: tpu_queue_postmerge
commands:
- "yes | docker system prune -a"
- "git fetch --all"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --tag vllm/vllm-tpu:nightly --tag vllm/vllm-tpu:$BUILDKITE_COMMIT --progress plain -f docker/Dockerfile.tpu ."
- "docker push vllm/vllm-tpu:nightly"
- "docker push vllm/vllm-tpu:$BUILDKITE_COMMIT"
plugins:
- docker-login#v3.0.0:
username: vllmbot
password-env: DOCKERHUB_TOKEN
env:
DOCKER_BUILDKIT: "1"
- input: "Provide Release version here"
id: input-release-version
fields:
@@ -136,12 +132,28 @@ steps:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest --progress plain --target vllm-openai -f docker/Dockerfile.cpu ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --build-arg VLLM_CPU_AVX512BF16=true --build-arg VLLM_CPU_AVX512VNNI=true --build-arg VLLM_CPU_AMXBF16=true --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest --progress plain --target vllm-openai -f docker/Dockerfile.cpu ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest"
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
env:
DOCKER_BUILDKIT: "1"
- block: "Build arm64 CPU release image"
key: block-arm64-cpu-release-image-build
depends_on: ~
- label: "Build and publish arm64 CPU release image"
depends_on: block-arm64-cpu-release-image-build
agents:
queue: arm64_cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:latest --progress plain --target vllm-openai -f docker/Dockerfile.cpu ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:latest"
- "docker push public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
env:
DOCKER_BUILDKIT: "1"
- label: "Build and publish nightly multi-arch image to DockerHub"
depends_on:
- create-multi-arch-manifest

View File

@@ -2,16 +2,23 @@
set -ex
# Get release version and strip leading 'v' if present
RELEASE_VERSION=$(buildkite-agent meta-data get release-version | sed 's/^v//')
if [ -z "$RELEASE_VERSION" ]; then
echo "Error: RELEASE_VERSION is empty. 'release-version' metadata might not be set or is invalid."
exit 1
# Get release version, default to 1.0.0.dev for nightly/per-commit builds
RELEASE_VERSION=$(buildkite-agent meta-data get release-version 2>/dev/null | sed 's/^v//')
if [ -z "${RELEASE_VERSION}" ]; then
RELEASE_VERSION="1.0.0.dev"
fi
buildkite-agent annotate --style 'info' --context 'release-workflow' << EOF
To download the wheel:
To download the wheel (by commit):
\`\`\`
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux1_x86_64.whl .
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux2014_aarch64.whl .
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cu129-cp38-abi3-manylinux1_x86_64.whl .
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cu129-cp38-abi3-manylinux1_x86_64.whl .
\`\`\`
To download the wheel (by version):
\`\`\`
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux1_x86_64.whl .
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux2014_aarch64.whl .

View File

@@ -59,7 +59,7 @@ while true; do
fi
done
echo "--- Pulling container"
echo "--- Pulling container"
image_name="rocm/vllm-ci:${BUILDKITE_COMMIT}"
container_name="rocm_${BUILDKITE_COMMIT}_$(tr -dc A-Za-z0-9 < /dev/urandom | head -c 10; echo)"
docker pull "${image_name}"
@@ -78,17 +78,13 @@ HF_MOUNT="/root/.cache/huggingface"
commands=$@
echo "Commands:$commands"
if [[ $commands == *"pytest -v -s basic_correctness/test_basic_correctness.py"* ]]; then
commands=${commands//"pytest -v -s basic_correctness/test_basic_correctness.py"/"VLLM_USE_TRITON_FLASH_ATTN=0 pytest -v -s basic_correctness/test_basic_correctness.py"}
fi
commands=${commands//"pytest -v -s basic_correctness/test_basic_correctness.py"/"pytest -v -s basic_correctness/test_basic_correctness.py"}
if [[ $commands == *"pytest -v -s models/test_registry.py"* ]]; then
commands=${commands//"pytest -v -s models/test_registry.py"/"pytest -v -s models/test_registry.py -k 'not BambaForCausalLM and not GritLM and not Mamba2ForCausalLM and not Zamba2ForCausalLM'"}
fi
if [[ $commands == *"pytest -v -s compile/test_basic_correctness.py"* ]]; then
commands=${commands//"pytest -v -s compile/test_basic_correctness.py"/"VLLM_USE_TRITON_FLASH_ATTN=0 pytest -v -s compile/test_basic_correctness.py"}
fi
commands=${commands//"pytest -v -s compile/test_basic_correctness.py"/"pytest -v -s compile/test_basic_correctness.py"}
if [[ $commands == *"pytest -v -s lora"* ]]; then
commands=${commands//"pytest -v -s lora"/"VLLM_ROCM_CUSTOM_PAGED_ATTN=0 pytest -v -s lora"}
@@ -173,19 +169,28 @@ fi
PARALLEL_JOB_COUNT=8
MYPYTHONPATH=".."
# check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs.
# Test that we're launching on the machine that has
# proper access to GPUs
render_gid=$(getent group render | cut -d: -f3)
if [[ -z "$render_gid" ]]; then
echo "Error: 'render' group not found. This is required for GPU access." >&2
exit 1
fi
# check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs.
if [[ $commands == *"--shard-id="* ]]; then
# assign job count as the number of shards used
commands=${commands//"--num-shards= "/"--num-shards=${PARALLEL_JOB_COUNT} "}
# assign job count as the number of shards used
commands=$(echo "$commands" | sed -E "s/--num-shards[[:blank:]]*=[[:blank:]]*[0-9]*/--num-shards=${PARALLEL_JOB_COUNT} /g" | sed 's/ \\ / /g')
for GPU in $(seq 0 $(($PARALLEL_JOB_COUNT-1))); do
# assign shard-id for each shard
commands_gpu=${commands//"--shard-id= "/"--shard-id=${GPU} "}
commands_gpu=$(echo "$commands" | sed -E "s/--shard-id[[:blank:]]*=[[:blank:]]*[0-9]*/--shard-id=${GPU} /g" | sed 's/ \\ / /g')
echo "Shard ${GPU} commands:$commands_gpu"
echo "Render devices: $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES"
docker run \
--device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \
--network=host \
--shm-size=16gb \
--group-add "$render_gid" \
--rm \
-e HIP_VISIBLE_DEVICES="${GPU}" \
-e HF_TOKEN \
@@ -217,8 +222,8 @@ else
--device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \
--network=host \
--shm-size=16gb \
--group-add "$render_gid" \
--rm \
-e HIP_VISIBLE_DEVICES=0 \
-e HF_TOKEN \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \

View File

@@ -49,6 +49,7 @@ function cpu_tests() {
# Run kernel tests
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -x -v -s tests/kernels/attention/test_cpu_attn.py
pytest -x -v -s tests/kernels/test_onednn.py"
# Run basic model test
@@ -70,13 +71,13 @@ function cpu_tests() {
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -x -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs[False-10-32-neuralmagic/Llama-3.2-1B-quantized.w8a8]"
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs"
# Note: disable it until supports V1
# Run AWQ test
# docker exec cpu-test-"$NUMA_NODE" bash -c "
# set -e
# VLLM_USE_V1=0 pytest -x -s -v \
# pytest -x -s -v \
# tests/quantization/test_ipex_quant.py"
# Run multi-lora tests
@@ -116,4 +117,4 @@ function cpu_tests() {
# All of CPU tests are expected to be finished less than 40 mins.
export -f cpu_tests
timeout 2h bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"
timeout 2.5h bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"

View File

@@ -20,7 +20,10 @@ trap remove_docker_container EXIT
# Run the image and test offline inference/tensor parallel
docker run \
--device /dev/dri \
--device /dev/dri:/dev/dri \
--net=host \
--ipc=host \
--privileged \
-v /dev/dri/by-path:/dev/dri/by-path \
--entrypoint="" \
-e "HF_TOKEN=${HF_TOKEN}" \
@@ -42,7 +45,7 @@ docker run \
pytest -v -s v1/sample --ignore=v1/sample/test_logprobs.py --ignore=v1/sample/test_logprobs_e2e.py
pytest -v -s v1/worker --ignore=v1/worker/test_gpu_model_runner.py
pytest -v -s v1/structured_output
pytest -v -s v1/spec_decode --ignore=v1/spec_decode/test_max_len.py --ignore=v1/spec_decode/test_tree_attention.py
pytest -v -s v1/kv_connector/unit --ignore=v1/kv_connector/unit/test_multi_connector.py --ignore=v1/kv_connector/unit/test_nixl_connector.py --ignore=v1/kv_connector/unit/test_shared_storage_connector.py
pytest -v -s v1/spec_decode --ignore=v1/spec_decode/test_max_len.py --ignore=v1/spec_decode/test_tree_attention.py --ignore=v1/spec_decode/test_speculators_eagle3.py
pytest -v -s v1/kv_connector/unit --ignore=v1/kv_connector/unit/test_multi_connector.py --ignore=v1/kv_connector/unit/test_nixl_connector.py --ignore=v1/kv_connector/unit/test_shared_storage_connector.py --ignore=v1/kv_connector/unit/test_lmcache_integration.py
pytest -v -s v1/test_serial_utils.py
'

View File

@@ -0,0 +1,62 @@
#!/usr/bin/env bash
set -euxo pipefail
# args: [THRESHOLD] [NUM_QUESTIONS] [START_PORT]
THRESHOLD=${1:-0.25}
NUM_Q=${2:-1319}
PORT=${3:-8010}
OUT_DIR=${OUT_DIR:-/tmp/vllm-scheduled}
mkdir -p "${OUT_DIR}"
wait_for_server() {
local port=$1
timeout 600 bash -c '
until curl -sf "http://127.0.0.1:'"$port"'/health" > /dev/null; do
sleep 1
done'
}
MODEL="deepseek-ai/DeepSeek-V2-lite"
BACKENDS=("deepep_high_throughput" "deepep_low_latency")
cleanup() {
if [[ -n "${SERVER_PID:-}" ]] && kill -0 "${SERVER_PID}" 2>/dev/null; then
kill "${SERVER_PID}" 2>/dev/null || true
for _ in {1..20}; do
kill -0 "${SERVER_PID}" 2>/dev/null || break
sleep 0.5
done
kill -9 "${SERVER_PID}" 2>/dev/null || true
fi
}
trap cleanup EXIT
for BACK in "${BACKENDS[@]}"; do
VLLM_DEEP_GEMM_WARMUP=skip \
VLLM_ALL2ALL_BACKEND=$BACK \
vllm serve "$MODEL" \
--enforce-eager \
--tensor-parallel-size 2 \
--data-parallel-size 2 \
--enable-expert-parallel \
--enable-eplb \
--trust-remote-code \
--max-model-len 2048 \
--port $PORT &
SERVER_PID=$!
wait_for_server $PORT
TAG=$(echo "$MODEL" | tr '/: \\n' '_____')
OUT="${OUT_DIR}/${TAG}_${BACK}.json"
python3 tests/evals/gsm8k/gsm8k_eval.py --host http://127.0.0.1 --port $PORT --num-questions ${NUM_Q} --save-results ${OUT}
python3 - <<PY
import json; acc=json.load(open('${OUT}'))['accuracy']
print(f"${MODEL} ${BACK}: accuracy {acc:.3f}")
assert acc >= ${THRESHOLD}, f"${MODEL} ${BACK} accuracy {acc}"
PY
cleanup
SERVER_PID=
sleep 1
PORT=$((PORT+1))
done

View File

@@ -0,0 +1,61 @@
#!/usr/bin/env bash
set -euxo pipefail
# args: [THRESHOLD] [NUM_QUESTIONS] [START_PORT]
THRESHOLD=${1:-0.8}
NUM_Q=${2:-1319}
PORT=${3:-8020}
OUT_DIR=${OUT_DIR:-/tmp/vllm-scheduled}
mkdir -p "${OUT_DIR}"
wait_for_server() {
local port=$1
timeout 600 bash -c '
until curl -sf "http://127.0.0.1:'"$port"'/health" > /dev/null; do
sleep 1
done'
}
MODEL="QWen/Qwen3-30B-A3B-FP8"
BACKENDS=("deepep_high_throughput" "deepep_low_latency")
cleanup() {
if [[ -n "${SERVER_PID:-}" ]] && kill -0 "${SERVER_PID}" 2>/dev/null; then
kill "${SERVER_PID}" 2>/dev/null || true
for _ in {1..20}; do
kill -0 "${SERVER_PID}" 2>/dev/null || break
sleep 0.5
done
kill -9 "${SERVER_PID}" 2>/dev/null || true
fi
}
trap cleanup EXIT
for BACK in "${BACKENDS[@]}"; do
VLLM_DEEP_GEMM_WARMUP=skip \
VLLM_ALL2ALL_BACKEND=$BACK \
vllm serve "$MODEL" \
--enforce-eager \
--tensor-parallel-size 2 \
--data-parallel-size 2 \
--enable-expert-parallel \
--trust-remote-code \
--max-model-len 2048 \
--port $PORT &
SERVER_PID=$!
wait_for_server $PORT
TAG=$(echo "$MODEL" | tr '/: \\n' '_____')
OUT="${OUT_DIR}/${TAG}_${BACK}.json"
python3 tests/evals/gsm8k/gsm8k_eval.py --host http://127.0.0.1 --port $PORT --num-questions ${NUM_Q} --save-results ${OUT}
python3 - <<PY
import json; acc=json.load(open('${OUT}'))['accuracy']
print(f"${MODEL} ${BACK}: accuracy {acc:.3f}")
assert acc >= ${THRESHOLD}, f"${MODEL} ${BACK} accuracy {acc}"
PY
cleanup
SERVER_PID=
sleep 1
PORT=$((PORT+1))
done

View File

@@ -58,33 +58,25 @@ python3 .buildkite/generate_index.py --wheel "$normal_wheel"
aws s3 cp "$wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
aws s3 cp "$normal_wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
if [[ $normal_wheel == *"cu126"* ]]; then
# if $normal_wheel matches cu126, do not upload the index.html
echo "Skipping index files for cu126 wheels"
elif [[ $normal_wheel == *"cu128"* ]]; then
# if $normal_wheel matches cu128, do not upload the index.html
echo "Skipping index files for cu128 wheels"
else
if [[ $normal_wheel == *"cu129"* ]]; then
# only upload index.html for cu129 wheels (default wheels) as it
# is available on both x86 and arm64
aws s3 cp index.html "s3://vllm-wheels/$BUILDKITE_COMMIT/vllm/index.html"
aws s3 cp "s3://vllm-wheels/nightly/index.html" "s3://vllm-wheels/$BUILDKITE_COMMIT/index.html"
else
echo "Skipping index files for non-cu129 wheels"
fi
# generate index for nightly
aws s3 cp "$wheel" "s3://vllm-wheels/nightly/"
aws s3 cp "$normal_wheel" "s3://vllm-wheels/nightly/"
if [[ $normal_wheel == *"cu126"* ]]; then
# if $normal_wheel matches cu126, do not upload the index.html
echo "Skipping index files for cu126 wheels"
elif [[ $normal_wheel == *"cu128"* ]]; then
# if $normal_wheel matches cu128, do not upload the index.html
echo "Skipping index files for cu128 wheels"
else
if [[ $normal_wheel == *"cu129"* ]]; then
# only upload index.html for cu129 wheels (default wheels) as it
# is available on both x86 and arm64
aws s3 cp index.html "s3://vllm-wheels/nightly/vllm/index.html"
else
echo "Skipping index files for non-cu129 wheels"
fi
aws s3 cp "$wheel" "s3://vllm-wheels/$version/"

View File

@@ -38,7 +38,7 @@ steps:
- label: Pytorch Nightly Dependency Override Check # 2min
# if this test fails, it means the nightly torch version is not compatible with some
# of the dependencies. Please check the error message and add the package to whitelist
# in /vllm/tools/generate_nightly_torch_test.py
# in /vllm/tools/pre_commit/generate_nightly_torch_test.py
mirror_hardwares: [amdexperimental]
agent_pool: mi325_1
# grade: Blocking
@@ -48,9 +48,9 @@ steps:
commands:
- bash standalone_tests/pytorch_nightly_dependency.sh
- label: Async Engine, Inputs, Utils, Worker Test # 36min
timeout_in_minutes: 50
mirror_hardwares: [amdexperimental]
- label: Async Engine, Inputs, Utils, Worker Test # 10min
timeout_in_minutes: 15
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
# grade: Blocking
source_file_dependencies:
@@ -63,7 +63,7 @@ steps:
- label: Async Engine, Inputs, Utils, Worker Test (CPU) # 4 mins
timeout_in_minutes: 10
mirror_hardwares: [amdexperimental]
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
# grade: Blocking
source_file_dependencies:
@@ -226,6 +226,27 @@ steps:
- VLLM_ALLOW_INSECURE_SERIALIZATION=1 RAY_DEDUP_LOGS=0 python3 rlhf_colocate.py
- popd
- label: Distributed Tests (8 GPUs) # 4min
timeout_in_minutes: 10
mirror_hardwares: [amdexperimental]
agent_pool: mi325_8
# grade: Blocking
gpu: h100
num_gpus: 8
working_dir: "/vllm-workspace/tests"
source_file_dependencies:
- examples/offline_inference/torchrun_dp_example.py
- vllm/config/parallel.py
- vllm/distributed/
- vllm/v1/engine/llm_engine.py
- vllm/v1/executor/uniproc_executor.py
- vllm/v1/worker/gpu_worker.py
commands:
# https://github.com/NVIDIA/nccl/issues/1838
#- export NCCL_CUMEM_HOST_ENABLE=0
# test with torchrun tp=2 and dp=4 with ep
- torchrun --nproc-per-node=8 ../examples/offline_inference/torchrun_dp_example.py --tp-size=2 --pp-size=1 --dp-size=4 --enable-ep
- label: EPLB Algorithm Test # 5min
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
@@ -238,11 +259,11 @@ steps:
commands:
- pytest -v -s distributed/test_eplb_algo.py
- label: EPLB Execution Test # 5min
- label: EPLB Execution Test # 10min
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_4
# grade: Blocking
timeout_in_minutes: 15
timeout_in_minutes: 20
working_dir: "/vllm-workspace/tests"
num_gpus: 4
source_file_dependencies:
@@ -250,6 +271,7 @@ steps:
- tests/distributed/test_eplb_execute.py
commands:
- pytest -v -s distributed/test_eplb_execute.py
- pytest -v -s distributed/test_eplb_spec_decode.py
- label: Metrics, Tracing Test # 12min
timeout_in_minutes: 20
@@ -273,7 +295,7 @@ steps:
- label: Regression Test # 7min
timeout_in_minutes: 20
mirror_hardwares: [amdexperimental, amdproduction]
mirror_hardwares: [amdexperimental, amdproduction, amdtentative]
agent_pool: mi325_1
grade: Blocking
source_file_dependencies:
@@ -286,9 +308,9 @@ steps:
- label: Engine Test # 25min
timeout_in_minutes: 40
mirror_hardwares: [amdexperimental]
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
#grade: Blocking
# grade: Blocking
source_file_dependencies:
- vllm/
- tests/engine
@@ -318,7 +340,7 @@ steps:
- label: V1 Test entrypoints # 35min
timeout_in_minutes: 50
mirror_hardwares: [amdexperimental]
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
# grade: Blocking
source_file_dependencies:
@@ -337,6 +359,7 @@ steps:
- tests/v1
commands:
# split the test to avoid interference
- uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
- pytest -v -s -m 'not cpu_test' v1/core
- pytest -v -s v1/executor
- pytest -v -s v1/kv_offload
@@ -348,14 +371,29 @@ steps:
- pytest -v -s -m 'not cpu_test' v1/metrics
- pytest -v -s v1/test_oracle.py
- pytest -v -s v1/test_request.py
- pytest -v -s v1/test_outputs.py
# Integration test for streaming correctness (requires special branch).
- pip install -U git+https://github.com/robertgshaw2-redhat/lm-evaluation-harness.git@streaming-api
- pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine
- label: V1 Test others (CPU) # 5 mins
# TODO: Add the "V1 Test attetion (MI300)" test group
- label: V1 Test attention (H100) # 10min
mirror_hardwares: [amdexperimental]
agent_pool: mi325_1
# grade: Blocking
timeout_in_minutes: 30
gpu: h100
source_file_dependencies:
- vllm/v1/attention
- tests/v1/attention
commands:
- pytest -v -s v1/attention
- label: V1 Test others (CPU) # 5 mins
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
# grade: Blocking
source_file_dependencies:
- vllm/
- tests/v1
@@ -395,7 +433,9 @@ steps:
- python3 offline_inference/basic/embed.py
- python3 offline_inference/basic/score.py
- python3 offline_inference/spec_decode.py --test --method eagle --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 2048
- python3 offline_inference/spec_decode.py --test --method eagle3 --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 2048
# https://github.com/vllm-project/vllm/pull/26682 uses slightly more memory in PyTorch 2.9+ causing this test to OOM in 1xL4 GPU
- python3 offline_inference/spec_decode.py --test --method eagle3 --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 1536
#- python3 offline_inference/spec_decode.py --test --method eagle3 --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 2048
- label: Platform Tests (CUDA) # 4min
timeout_in_minutes: 15
@@ -436,7 +476,11 @@ steps:
--num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT \
--ignore=lora/test_chatglm3_tp.py \
--ignore=lora/test_llama_tp.py \
--ignore=lora/test_llm_with_multi_loras.py
--ignore=lora/test_llm_with_multi_loras.py \
--ignore=lora/test_olmoe_tp.py \
--ignore=lora/test_deepseekv2_tp.py \
--ignore=lora/test_gptoss_tp.py \
--ignore=lora/test_qwen3moe_tp.py
parallelism: 4
- label: PyTorch Compilation Unit Tests # 15min
@@ -454,11 +498,12 @@ steps:
- pytest -v -s compile/test_fusion_attn.py
- pytest -v -s compile/test_functionalization.py
- pytest -v -s compile/test_silu_mul_quant_fusion.py
- pytest -v -s compile/test_sequence_parallelism.py
- pytest -v -s compile/test_async_tp.py
# - pytest -v -s compile/test_sequence_parallelism.py
# - pytest -v -s compile/test_async_tp.py
- pytest -v -s compile/test_fusion_all_reduce.py
- pytest -v -s compile/test_decorator.py
- pytest -v -s compile/test_noop_elimination.py
- pytest -v -s compile/test_aot_compile.py
- label: PyTorch Fullgraph Smoke Test # 15min
timeout_in_minutes: 30
@@ -471,10 +516,11 @@ steps:
- tests/compile
commands:
- pytest -v -s compile/test_basic_correctness.py
- pytest -v -s compile/test_multimodal_compile.py
- pytest -v -s compile/piecewise/
- label: PyTorch Fullgraph Test # 20min
timeout_in_minutes: 30
- label: PyTorch Fullgraph Test # 27min
timeout_in_minutes: 40
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
# grade: Blocking
@@ -483,18 +529,35 @@ steps:
- vllm/
- tests/compile
commands:
- pytest -v -s compile/test_full_graph.py
- pytest -v -s compile/test_full_graph.py -k 'not test_fp8_kv_scale_compile'
# Limit to no custom ops to reduce running time
# Wrap with quotes to escape yaml and avoid starting -k string with a -
- "pytest -v -s compile/test_fusions_e2e.py -k 'TRITON and -quant_fp8'"
- label: Cudagraph test
timeout_in_minutes: 20
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
source_file_dependencies:
- tests/v1/cudagraph
- vllm/v1/cudagraph_dispatcher.py
- vllm/config/compilation.py
- vllm/compilation
commands:
- pytest -v -s v1/cudagraph/test_cudagraph_dispatch.py
- pytest -v -s v1/cudagraph/test_cudagraph_mode.py
- label: Kernels Core Operation Test # 48min
timeout_in_minutes: 75
mirror_hardwares: [amdexperimental]
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
# grade: Blocking
source_file_dependencies:
- csrc/
- tests/kernels/core
- tests/kernels/test_top_k_per_row.py
commands:
- pytest -v -s kernels/core
- pytest -v -s kernels/core kernels/test_top_k_per_row.py
- label: Kernels Attention Test %N # 23min
timeout_in_minutes: 35
@@ -534,6 +597,8 @@ steps:
- tests/kernels/moe
- vllm/model_executor/layers/fused_moe/
- vllm/distributed/device_communicators/
- vllm/envs.py
- vllm/config
commands:
- pytest -v -s kernels/moe --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
parallelism: 2
@@ -552,10 +617,13 @@ steps:
- label: Model Executor Test # 23min
timeout_in_minutes: 35
mirror_hardwares: [amdexperimental]
torch_nightly: true
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
# grade: Blocking
source_file_dependencies:
- vllm/engine/arg_utils.py
- vllm/config/model.py
- vllm/model_executor
- tests/model_executor
- tests/entrypoints/openai/test_tensorizer_entrypoint.py
@@ -603,12 +671,13 @@ steps:
# since torchao nightly is only compatible with torch nightly currently
# https://github.com/pytorch/ao/issues/2919, we'll have to skip new torchao tests for now
# we can only upgrade after this is resolved
- pip install --pre torchao==0.13.0.dev20250814 --index-url https://download.pytorch.org/whl/nightly/cu128
- VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization/
# TODO(jerryzh168): resolve the above comment
- uv pip install --system torchao==0.13.0
- VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization/ --ignore quantization/test_blackwell_moe.py
- label: LM Eval Small Models # 53min
timeout_in_minutes: 75
mirror_hardwares: [amdexperimental]
- label: LM Eval Small Models # 15min
timeout_in_minutes: 20
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
# grade: Blocking
source_file_dependencies:
@@ -617,21 +686,22 @@ steps:
commands:
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt --tp-size=1
- label: OpenAI API correctness # 22min
timeout_in_minutes: 30
mirror_hardwares: [amdexperimental]
- label: OpenAI API correctness # 10min
timeout_in_minutes: 15
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
# grade: Blocking
source_file_dependencies:
- csrc/
- vllm/entrypoints/openai/
- vllm/model_executor/models/whisper.py
commands: # LMEval+Transcription WER check
- pytest -s entrypoints/openai/correctness/
commands: # LMEval
# Transcription WER check is skipped because encoder-decoder models are not supported on ROCm, see https://github.com/vllm-project/vllm/issues/27442
- pytest -s entrypoints/openai/correctness/ --ignore entrypoints/openai/correctness/test_transcription_api_correctness.py
- label: OpenAI-Compatible Tool Use # 23 min
timeout_in_minutes: 35
mirror_hardwares: [amdexperimental]
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
# grade: Blocking
fast_check: false
@@ -779,8 +849,10 @@ steps:
- vllm/
- tests/models/language/generation
commands:
# Install causal-conv1d for plamo2 models here, as it is not compatible with pip-compile.
- pip install 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8'
# Install fast path packages for testing against transformers
# Note: also needed to run plamo2 model in vLLM
- uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5'
- uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.2'
- pytest -v -s models/language/generation -m '(not core_model) and (not hybrid_model)'
- label: Language Models Test (PPL)
@@ -846,6 +918,19 @@ steps:
- pytest -v -s models/multimodal -m core_model --ignore models/multimodal/generation/test_whisper.py --ignore models/multimodal/processing
- cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model # Otherwise, mp_method="spawn" doesn't work
- label: Multi-Modal Accuracy Eval (Small Models) # 10min
timeout_in_minutes: 70
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
# grade: Blocking
working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
source_file_dependencies:
- vllm/multimodal/
- vllm/inputs/
- vllm/v1/core/
commands:
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-mm-small.txt --tp-size=1
- label: Multi-Modal Models Test (Extended) 1
mirror_hardwares: [amdexperimental]
agent_pool: mi325_1
@@ -884,7 +969,7 @@ steps:
- label: Quantized Models Test # 45 min
timeout_in_minutes: 60
mirror_hardwares: [amdexperimental]
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
# grade: Blocking
source_file_dependencies:
@@ -908,6 +993,7 @@ steps:
- label: Transformers Nightly Models Test
mirror_hardwares: [amdexperimental]
agent_pool: mi325_1
# grade: Blocking
working_dir: "/vllm-workspace/"
optional: true
commands:
@@ -921,8 +1007,8 @@ steps:
# Whisper needs spawn method to avoid deadlock
- VLLM_WORKER_MULTIPROC_METHOD=spawn python3 examples/offline_inference/audio_language.py --model-type whisper
- label: Blackwell Test # 38 min
timeout_in_minutes: 60
- label: Blackwell Test # 21 min
timeout_in_minutes: 30
working_dir: "/vllm-workspace/"
gpu: b200
# optional: true
@@ -935,13 +1021,16 @@ steps:
- vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
- vllm/v1/attention/backends/flashinfer.py
- vllm/compilation/fusion.py
- vllm/compilation/fusion_attn.py
- vllm/v1/attention/backends/mla/cutlass_mla.py
- vllm/v1/attention/backends/mla/flashinfer_mla.py
- vllm/platforms/cuda.py
- vllm/attention/selector.py
commands:
- nvidia-smi
- python3 examples/offline_inference/basic/chat.py
# Attention
# num_heads2 broken by https://github.com/flashinfer-ai/flashinfer/issues/1353
- pytest -v -s tests/kernels/attention/test_attention_selector.py
- pytest -v -s tests/kernels/attention/test_flashinfer.py -k 'not num_heads2'
- pytest -v -s tests/kernels/attention/test_flashinfer_trtllm_attention.py
- pytest -v -s tests/kernels/attention/test_cutlass_mla_decode.py
@@ -953,18 +1042,64 @@ steps:
- pytest -v -s tests/kernels/quantization/test_nvfp4_scaled_mm.py
- pytest -v -s tests/kernels/quantization/test_flashinfer_scaled_mm.py
- pytest -v -s tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py
- pytest -v -s tests/kernels/quantization/test_nvfp4_qutlass.py
- pytest -v -s tests/kernels/quantization/test_mxfp4_qutlass.py
- pytest -v -s tests/kernels/moe/test_nvfp4_moe.py
- pytest -v -s tests/kernels/moe/test_ocp_mx_moe.py
# Fusion
- pytest -v -s tests/compile/test_fusion_all_reduce.py
- pytest -v -s tests/compile/test_fusion_attn.py::test_attention_quant_pattern
- pytest -v -s tests/kernels/moe/test_flashinfer.py
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
- label: Blackwell GPT-OSS Eval
timeout_in_minutes: 60
- label: Blackwell Fusion Tests # 30 min
timeout_in_minutes: 40
working_dir: "/vllm-workspace/"
gpu: b200
source_file_dependencies:
- csrc/quantization/fp4/
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
- vllm/v1/attention/backends/flashinfer.py
- vllm/compilation/
# can affect pattern matching
- vllm/model_executor/layers/layernorm.py
- vllm/model_executor/layers/activation.py
- vllm/model_executor/layers/quantization/input_quant_fp8.py
commands:
- nvidia-smi
- pytest -v -s tests/compile/test_fusion_attn.py
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
# this runner has 2 GPUs available even though num_gpus=2 is not set
- pytest -v -s tests/compile/test_fusion_all_reduce.py
# Limit to Inductor partition, no custom ops, and allreduce & attn fusion to reduce running time
# Wrap with quotes to escape yaml
- "pytest -v -s tests/compile/test_fusions_e2e.py::test_tp2_attn_quant_allreduce_rmsnorm -k 'True and Llama-3.1 and -quant_fp8 and -rms_norm'"
- label: Blackwell Fusion E2E Tests # 30 min
timeout_in_minutes: 40
working_dir: "/vllm-workspace/"
gpu: b200
optional: true
num_gpus: 2
source_file_dependencies:
- csrc/quantization/fp4/
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
- vllm/v1/attention/backends/flashinfer.py
- vllm/compilation/
# can affect pattern matching
- vllm/model_executor/layers/layernorm.py
- vllm/model_executor/layers/activation.py
- vllm/model_executor/layers/quantization/input_quant_fp8.py
- tests/compile/test_fusions_e2e.py
- tests/compile/test_full_graph.py
commands:
- nvidia-smi
# Run all e2e fusion tests
- pytest -v -s tests/compile/test_fusions_e2e.py
# test_fp8_kv_scale_compile requires FlashAttention (not supported on default L4/L40)
- pytest -v -s tests/compile/test_full_graph.py::test_fp8_kv_scale_compile
- label: ROCm GPT-OSS Eval
timeout_in_minutes: 60
working_dir: "/vllm-workspace/"
agent_pool: mi325_1
mirror_hardwares: [amdproduction]
optional: true # run on nightlies
source_file_dependencies:
- tests/evals/gpt_oss
@@ -973,7 +1108,7 @@ steps:
- vllm/v1/attention/backends/flashinfer.py
commands:
- uv pip install --system 'gpt-oss[eval]==0.0.5'
- pytest -s -v tests/evals/gpt_oss/test_gpqa_correctness.py --model openai/gpt-oss-20b --metric 0.58
- VLLM_ROCM_USE_AITER_MHA=0 VLLM_ROCM_USE_AITER=1 VLLM_USE_AITER_UNIFIED_ATTENTION=1 pytest -s -v tests/evals/gpt_oss/test_gpqa_correctness.py --model openai/gpt-oss-20b --metric 0.58
- label: Blackwell Quantized MoE Test
timeout_in_minutes: 60
@@ -1079,6 +1214,7 @@ steps:
- pytest -v -s ./compile/test_basic_correctness.py
- pytest -v -s ./compile/test_wrapper.py
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
- VLLM_TEST_SAME_HOST=1 VLLM_TEST_WITH_DEFAULT_DEVICE_SET=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
- pytest -v -s distributed/test_sequence_parallel.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown
- pytest -v -s v1/worker/test_worker_memory_snapshot.py
@@ -1126,6 +1262,11 @@ steps:
- pytest -v -s plugins_tests/test_io_processor_plugins.py
- pip uninstall prithvi_io_processor_plugin -y
# end io_processor plugins test
# begin stat_logger plugins test
- pip install -e ./plugins/vllm_add_dummy_stat_logger
- pytest -v -s plugins_tests/test_stats_logger_plugins.py
- pip uninstall dummy_stat_logger -y
# end stat_logger plugins test
# other tests continue here:
- pytest -v -s plugins_tests/test_scheduler_plugins.py
- pip install -e ./plugins/vllm_add_dummy_model
@@ -1169,6 +1310,8 @@ steps:
- pytest -v -s -x lora/test_chatglm3_tp.py
- pytest -v -s -x lora/test_llama_tp.py
- pytest -v -s -x lora/test_llm_with_multi_loras.py
- pytest -v -s -x lora/test_olmoe_tp.py
- pytest -v -s -x lora/test_gptoss_tp.py
- label: Weight Loading Multiple GPU Test # 33min
@@ -1199,11 +1342,27 @@ steps:
commands:
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt
- label: NixlConnector PD accuracy tests (Distributed) # 30min
mirror_hardwares: [amdexperimental]
agent_pool: mi325_4
# grade: Blocking
timeout_in_minutes: 30
working_dir: "/vllm-workspace/tests"
num_gpus: 4
source_file_dependencies:
- vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
- tests/v1/kv_connector/nixl_integration/
commands:
- uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
- bash v1/kv_connector/nixl_integration/tp_config_sweep_accuracy_test.sh
##### multi gpus test #####
##### A100 test #####
- label: Distributed Tests (A100) # optional
mirror_hardwares: [amdexperimental]
agent_pool: mi325_4
# grade: Blocking
gpu: a100
optional: true
num_gpus: 4
@@ -1218,6 +1377,9 @@ steps:
- pytest -v -s -x lora/test_mixtral.py
- label: LM Eval Large Models # optional
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_4
# grade: Blocking
gpu: a100
optional: true
num_gpus: 4
@@ -1229,15 +1391,39 @@ steps:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large.txt --tp-size=4
##### H100 test #####
- label: LM Eval Large Models (H100) # optional
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_4
# grade: Blocking
gpu: h100
optional: true
num_gpus: 4
working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
source_file_dependencies:
- csrc/
- vllm/model_executor/layers/quantization
commands:
- export VLLM_USE_DEEP_GEMM=0 # We found Triton is faster than DeepGEMM for H100
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large-hopper.txt --tp-size=4
##### H200 test #####
- label: Distrubted Tests (H200) # optional
- label: Distributed Tests (H200) # optional
mirror_hardwares: [amdexperimental]
agent_pool: mi325_2
# grade: Blocking
gpu: h200
optional: true
working_dir: "/vllm-workspace/"
num_gpus: 2
commands:
- pytest -v -s tests/compile/test_async_tp.py
- pytest -v -s tests/compile/test_sequence_parallelism.py
- pytest -v -s tests/compile/test_fusion_all_reduce.py
- pytest -v -s tests/compile/test_fusions_e2e.py::test_tp2_attn_quant_allreduce_rmsnorm
- pytest -v -s tests/distributed/test_context_parallel.py
- CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048
- pytest -v -s tests/v1/distributed/test_dbo.py
##### B200 test #####
- label: Distributed Tests (B200) # optional
@@ -1248,6 +1434,7 @@ steps:
commands:
- pytest -v -s tests/distributed/test_context_parallel.py
- pytest -v -s tests/distributed/test_nccl_symm_mem_allreduce.py
- pytest -v -s tests/v1/distributed/test_dbo.py
##### RL Integration Tests #####
- label: Prime-RL Integration Test # 15min
@@ -1263,3 +1450,27 @@ steps:
- .buildkite/scripts/run-prime-rl-test.sh
commands:
- bash .buildkite/scripts/run-prime-rl-test.sh
- label: DeepSeek V2-Lite Accuracy
mirror_hardwares: [amdexperimental]
agent_pool: mi325_4
# grade: Blocking
timeout_in_minutes: 60
gpu: h100
optional: true
num_gpus: 4
working_dir: "/vllm-workspace"
commands:
- bash .buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_eplb.sh 0.25 200 8010
- label: Qwen3-30B-A3B-FP8-block Accuracy
mirror_hardwares: [amdexperimental]
agent_pool: mi325_4
# grade: Blocking
timeout_in_minutes: 60
gpu: h100
optional: true
num_gpus: 4
working_dir: "/vllm-workspace"
commands:
- bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep.sh 0.8 200 8020

View File

@@ -25,6 +25,7 @@
# and $$BUILDKITE_PARALLEL_JOB_COUNT environment variables.
# working_dir(str): specify the place where the command should execute, default to /vllm-workspace/tests
# source_file_dependencies(list): the list of prefixes to opt-in the test for, if empty, the test will always run.
# autorun_on_main (bool): default to false, if true, the test will run automatically when commit is pushed to main branch.
# When adding a test
# - If the test belongs to an existing group, add it there
@@ -38,7 +39,7 @@ steps:
- label: Pytorch Nightly Dependency Override Check # 2min
# if this test fails, it means the nightly torch version is not compatible with some
# of the dependencies. Please check the error message and add the package to whitelist
# in /vllm/tools/generate_nightly_torch_test.py
# in /vllm/tools/pre_commit/generate_nightly_torch_test.py
soft_fail: true
source_file_dependencies:
- requirements/nightly_torch_test.txt
@@ -56,7 +57,7 @@ steps:
- pytest -v -s -m 'not cpu_test' multimodal
- pytest -v -s utils_
- label: Async Engine, Inputs, Utils, Worker Test (CPU) # 4 mins
- label: Async Engine, Inputs, Utils, Worker, Config Test (CPU) # 4 mins
timeout_in_minutes: 10
source_file_dependencies:
- vllm/
@@ -65,6 +66,7 @@ steps:
- tests/multimodal
- tests/standalone_tests/lazy_imports.py
- tests/transformers_utils
- tests/config
no_gpu: true
commands:
- python3 standalone_tests/lazy_imports.py
@@ -72,6 +74,7 @@ steps:
- pytest -v -s test_outputs.py
- pytest -v -s -m 'cpu_test' multimodal
- pytest -v -s transformers_utils
- pytest -v -s config
- label: Python-only Installation Test # 10min
timeout_in_minutes: 20
@@ -172,6 +175,8 @@ steps:
- tests/v1/engine/test_engine_core_client.py
- tests/distributed/test_symm_mem_allreduce.py
commands:
# https://github.com/NVIDIA/nccl/issues/1838
- export NCCL_CUMEM_HOST_ENABLE=0
# test with torchrun tp=2 and external_dp=2
- torchrun --nproc-per-node=4 distributed/test_torchrun_example.py
# test with torchrun tp=2 and pp=2
@@ -203,6 +208,24 @@ steps:
- VLLM_ALLOW_INSECURE_SERIALIZATION=1 RAY_DEDUP_LOGS=0 python3 rlhf_colocate.py
- popd
- label: Distributed Tests (8 GPUs) # 4min
timeout_in_minutes: 10
gpu: h100
num_gpus: 8
working_dir: "/vllm-workspace/tests"
source_file_dependencies:
- examples/offline_inference/torchrun_dp_example.py
- vllm/config/parallel.py
- vllm/distributed/
- vllm/v1/engine/llm_engine.py
- vllm/v1/executor/uniproc_executor.py
- vllm/v1/worker/gpu_worker.py
commands:
# https://github.com/NVIDIA/nccl/issues/1838
- export NCCL_CUMEM_HOST_ENABLE=0
# test with torchrun tp=2 and dp=4 with ep
- torchrun --nproc-per-node=8 ../examples/offline_inference/torchrun_dp_example.py --tp-size=2 --pp-size=1 --dp-size=4 --enable-ep
- label: EPLB Algorithm Test # 5min
timeout_in_minutes: 15
working_dir: "/vllm-workspace/tests"
@@ -212,8 +235,8 @@ steps:
commands:
- pytest -v -s distributed/test_eplb_algo.py
- label: EPLB Execution Test # 5min
timeout_in_minutes: 15
- label: EPLB Execution Test # 10min
timeout_in_minutes: 20
working_dir: "/vllm-workspace/tests"
num_gpus: 4
source_file_dependencies:
@@ -221,6 +244,7 @@ steps:
- tests/distributed/test_eplb_execute.py
commands:
- pytest -v -s distributed/test_eplb_execute.py
- pytest -v -s distributed/test_eplb_spec_decode.py
- label: Metrics, Tracing Test # 12min
timeout_in_minutes: 20
@@ -295,6 +319,7 @@ steps:
- vllm/
- tests/v1
commands:
- uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
# split the test to avoid interference
- pytest -v -s -m 'not cpu_test' v1/core
- pytest -v -s v1/executor
@@ -307,10 +332,29 @@ steps:
- pytest -v -s -m 'not cpu_test' v1/metrics
- pytest -v -s v1/test_oracle.py
- pytest -v -s v1/test_request.py
- pytest -v -s v1/test_outputs.py
# Integration test for streaming correctness (requires special branch).
- pip install -U git+https://github.com/robertgshaw2-redhat/lm-evaluation-harness.git@streaming-api
- pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine
- label: V1 Test attention (H100) # 10min
timeout_in_minutes: 30
gpu: h100
source_file_dependencies:
- vllm/v1/attention
- tests/v1/attention
commands:
- pytest -v -s v1/attention
- label: V1 Test attention (B200) # 10min
timeout_in_minutes: 30
gpu: b200
source_file_dependencies:
- vllm/v1/attention
- tests/v1/attention
commands:
- VLLM_DISABLE_FLASHINFER_PREFILL=1 pytest -v -s v1/attention # TODO: FI prefill is bugged and causes incorrectness, fix this
- label: V1 Test others (CPU) # 5 mins
source_file_dependencies:
- vllm/
@@ -349,7 +393,8 @@ steps:
- python3 offline_inference/basic/embed.py
- python3 offline_inference/basic/score.py
- python3 offline_inference/spec_decode.py --test --method eagle --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 2048
- python3 offline_inference/spec_decode.py --test --method eagle3 --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 2048
# https://github.com/vllm-project/vllm/pull/26682 uses slightly more memory in PyTorch 2.9+ causing this test to OOM in 1xL4 GPU
- python3 offline_inference/spec_decode.py --test --method eagle3 --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 1536
- label: Platform Tests (CUDA) # 4min
timeout_in_minutes: 15
@@ -384,7 +429,12 @@ steps:
--num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT \
--ignore=lora/test_chatglm3_tp.py \
--ignore=lora/test_llama_tp.py \
--ignore=lora/test_llm_with_multi_loras.py
--ignore=lora/test_llm_with_multi_loras.py \
--ignore=lora/test_olmoe_tp.py \
--ignore=lora/test_deepseekv2_tp.py \
--ignore=lora/test_gptoss_tp.py \
--ignore=lora/test_qwen3moe_tp.py
parallelism: 4
- label: PyTorch Compilation Unit Tests # 15min
@@ -395,6 +445,8 @@ steps:
- vllm/
- tests/compile
commands:
- pytest -v -s compile/test_graph_partition.py
- pytest -v -s compile/test_config.py
- pytest -v -s compile/test_pass_manager.py
- pytest -v -s compile/test_fusion.py
- pytest -v -s compile/test_fusion_attn.py
@@ -404,6 +456,7 @@ steps:
- pytest -v -s compile/test_decorator.py
- pytest -v -s compile/test_noop_elimination.py
- pytest -v -s compile/test_aot_compile.py
- pytest -v -s compile/test_qk_norm_rope_fusion.py
- label: PyTorch Fullgraph Smoke Test # 15min
timeout_in_minutes: 30
@@ -414,17 +467,34 @@ steps:
- tests/compile
commands:
- pytest -v -s compile/test_basic_correctness.py
- pytest -v -s compile/test_multimodal_compile.py
- pytest -v -s compile/piecewise/
- label: PyTorch Fullgraph Test # 20min
timeout_in_minutes: 30
- label: PyTorch Fullgraph Test # 27min
timeout_in_minutes: 40
mirror_hardwares: [amdexperimental]
torch_nightly: true
source_file_dependencies:
- vllm/
- tests/compile
commands:
- pytest -v -s compile/test_full_graph.py
# fp8 kv scales not supported on sm89, tested on Blackwell instead
- pytest -v -s compile/test_full_graph.py -k 'not test_fp8_kv_scale_compile'
# Limit to no custom ops to reduce running time
# Wrap with quotes to escape yaml and avoid starting -k string with a -
- "pytest -v -s compile/test_fusions_e2e.py -k 'TRITON and not +quant_fp8 and not Llama-4'"
- label: Cudagraph test
timeout_in_minutes: 20
mirror_hardwares: [amdexperimental]
source_file_dependencies:
- tests/v1/cudagraph
- vllm/v1/cudagraph_dispatcher.py
- vllm/config/compilation.py
- vllm/compilation
commands:
- pytest -v -s v1/cudagraph/test_cudagraph_dispatch.py
- pytest -v -s v1/cudagraph/test_cudagraph_mode.py
- label: Kernels Core Operation Test # 48min
timeout_in_minutes: 75
@@ -468,6 +538,8 @@ steps:
- tests/kernels/moe
- vllm/model_executor/layers/fused_moe/
- vllm/distributed/device_communicators/
- vllm/envs.py
- vllm/config
commands:
- pytest -v -s kernels/moe --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
parallelism: 2
@@ -484,8 +556,11 @@ steps:
- label: Model Executor Test # 23min
timeout_in_minutes: 35
torch_nightly: true
mirror_hardwares: [amdexperimental]
source_file_dependencies:
- vllm/engine/arg_utils.py
- vllm/config/model.py
- vllm/model_executor
- tests/model_executor
- tests/entrypoints/openai/test_tensorizer_entrypoint.py
@@ -527,8 +602,9 @@ steps:
# since torchao nightly is only compatible with torch nightly currently
# https://github.com/pytorch/ao/issues/2919, we'll have to skip new torchao tests for now
# we can only upgrade after this is resolved
- pip install --pre torchao==0.13.0.dev20250814 --index-url https://download.pytorch.org/whl/nightly/cu128
- VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization/
# TODO(jerryzh168): resolve the above comment
- uv pip install --system torchao==0.13.0 --index-url https://download.pytorch.org/whl/cu129
- VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization/ --ignore quantization/test_blackwell_moe.py
- label: LM Eval Small Models # 53min
timeout_in_minutes: 75
@@ -536,6 +612,7 @@ steps:
source_file_dependencies:
- csrc/
- vllm/model_executor/layers/quantization
autorun_on_main: true
commands:
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-small.txt --tp-size=1
@@ -677,8 +754,10 @@ steps:
- vllm/
- tests/models/language/generation
commands:
# Install causal-conv1d for plamo2 models here, as it is not compatible with pip-compile.
- pip install 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8'
# Install fast path packages for testing against transformers
# Note: also needed to run plamo2 model in vLLM
- uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5'
- uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.2'
- pytest -v -s models/language/generation -m '(not core_model) and (not hybrid_model)'
- label: Language Models Test (PPL)
@@ -733,6 +812,16 @@ steps:
- pytest -v -s models/multimodal -m core_model --ignore models/multimodal/generation/test_whisper.py --ignore models/multimodal/processing
- cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model # Otherwise, mp_method="spawn" doesn't work
- label: Multi-Modal Accuracy Eval (Small Models) # 50min
timeout_in_minutes: 70
working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
source_file_dependencies:
- vllm/multimodal/
- vllm/inputs/
- vllm/v1/core/
commands:
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-mm-small.txt --tp-size=1
- label: Multi-Modal Models Test (Extended) 1
mirror_hardwares: [amdexperimental]
optional: true
@@ -787,17 +876,17 @@ steps:
optional: true
commands:
- pip install --upgrade git+https://github.com/huggingface/transformers
- pytest -v -s tests/models/test_initialization.py
- pytest -v -s tests/models/test_initialization.py -k 'not (Gemma3 or ModernBert or Qwen2_5_VL or Qwen2_5vl or Qwen2VL or TransformersMultiModalEmbeddingModel or TransformersMultiModalForSequenceClassification or Ultravox or Phi4Multimodal or LlavaNextVideo or MiniCPMO or Lfm2Moe or PaliGemma or RobertaForSequenceClassification or Ovis2_5 or Fuyu or DeepseekOCR or KimiVL)'
- pytest -v -s tests/models/test_transformers.py
- pytest -v -s tests/models/multimodal/processing/
- pytest -v -s tests/models/multimodal/test_mapping.py
# - pytest -v -s tests/models/multimodal/processing/
- pytest -v -s tests/models/multimodal/test_mapping.py -k 'not (Gemma3 or Qwen2VL or Qwen2_5_VL)'
- python3 examples/offline_inference/basic/chat.py
- python3 examples/offline_inference/vision_language.py --model-type qwen2_5_vl
# - python3 examples/offline_inference/vision_language.py --model-type qwen2_5_vl
# Whisper needs spawn method to avoid deadlock
- VLLM_WORKER_MULTIPROC_METHOD=spawn python3 examples/offline_inference/audio_language.py --model-type whisper
- label: Blackwell Test # 38 min
timeout_in_minutes: 60
- label: Blackwell Test # 21 min
timeout_in_minutes: 30
working_dir: "/vllm-workspace/"
gpu: b200
# optional: true
@@ -810,13 +899,16 @@ steps:
- vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
- vllm/v1/attention/backends/flashinfer.py
- vllm/compilation/fusion.py
- vllm/compilation/fusion_attn.py
- vllm/v1/attention/backends/mla/cutlass_mla.py
- vllm/v1/attention/backends/mla/flashinfer_mla.py
- vllm/platforms/cuda.py
- vllm/attention/selector.py
commands:
- nvidia-smi
- python3 examples/offline_inference/basic/chat.py
# Attention
# num_heads2 broken by https://github.com/flashinfer-ai/flashinfer/issues/1353
- pytest -v -s tests/kernels/attention/test_attention_selector.py
- pytest -v -s tests/kernels/attention/test_flashinfer.py -k 'not num_heads2'
- pytest -v -s tests/kernels/attention/test_flashinfer_trtllm_attention.py
- pytest -v -s tests/kernels/attention/test_cutlass_mla_decode.py
@@ -828,15 +920,58 @@ steps:
- pytest -v -s tests/kernels/quantization/test_nvfp4_scaled_mm.py
- pytest -v -s tests/kernels/quantization/test_flashinfer_scaled_mm.py
- pytest -v -s tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py
- pytest -v -s tests/kernels/moe/test_nvfp4_moe.py
- pytest -v -s tests/kernels/moe/test_ocp_mx_moe.py
# Fusion
- pytest -v -s tests/compile/test_fusion_all_reduce.py
- pytest -v -s tests/compile/test_fusion_attn.py::test_attention_quant_pattern
- pytest -v -s tests/kernels/moe/test_flashinfer.py
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
- pytest -v -s tests/kernels/quantization/test_nvfp4_qutlass.py
- pytest -v -s tests/kernels/quantization/test_mxfp4_qutlass.py
- pytest -v -s tests/kernels/moe/test_nvfp4_moe.py
- pytest -v -s tests/kernels/moe/test_ocp_mx_moe.py
- pytest -v -s tests/kernels/moe/test_flashinfer.py
- label: Blackwell Fusion and Compile Tests # 30 min
timeout_in_minutes: 40
working_dir: "/vllm-workspace/"
gpu: b200
source_file_dependencies:
- csrc/quantization/fp4/
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
- vllm/v1/attention/backends/flashinfer.py
- vllm/compilation/
# can affect pattern matching
- vllm/model_executor/layers/layernorm.py
- vllm/model_executor/layers/activation.py
- vllm/model_executor/layers/quantization/input_quant_fp8.py
commands:
- nvidia-smi
- pytest -v -s tests/compile/test_fusion_attn.py
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
# this runner has 2 GPUs available even though num_gpus=2 is not set
- pytest -v -s tests/compile/test_fusion_all_reduce.py
# Limit to Inductor partition, no custom ops, and allreduce & attn fusion to reduce running time
# Wrap with quotes to escape yaml
- "pytest -v -s tests/compile/test_fusions_e2e.py::test_tp2_attn_quant_allreduce_rmsnorm -k 'True and not +quant_fp8 and not +rms_norm'"
# test_fp8_kv_scale_compile requires FlashAttention (not supported on default L4/L40)
- pytest -v -s tests/compile/test_full_graph.py::test_fp8_kv_scale_compile
- label: Blackwell Fusion E2E Tests # 30 min
timeout_in_minutes: 40
working_dir: "/vllm-workspace/"
gpu: b200
optional: true
num_gpus: 2
source_file_dependencies:
- csrc/quantization/fp4/
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
- vllm/v1/attention/backends/flashinfer.py
- vllm/compilation/
# can affect pattern matching
- vllm/model_executor/layers/layernorm.py
- vllm/model_executor/layers/activation.py
- vllm/model_executor/layers/quantization/input_quant_fp8.py
- tests/compile/test_fusions_e2e.py
- tests/compile/test_full_graph.py
commands:
- nvidia-smi
# Run all e2e fusion tests
- pytest -v -s tests/compile/test_fusions_e2e.py
- label: Blackwell GPT-OSS Eval
timeout_in_minutes: 60
@@ -943,6 +1078,8 @@ steps:
- tests/v1/shutdown
- tests/v1/worker/test_worker_memory_snapshot.py
commands:
# https://github.com/NVIDIA/nccl/issues/1838
- export NCCL_CUMEM_HOST_ENABLE=0
- TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_async_llm_dp.py
- TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_external_lb_dp.py
- DP_SIZE=2 pytest -v -s v1/entrypoints/openai/test_multi_api_servers.py
@@ -950,6 +1087,7 @@ steps:
- pytest -v -s ./compile/test_basic_correctness.py
- pytest -v -s ./compile/test_wrapper.py
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
- VLLM_TEST_SAME_HOST=1 VLLM_TEST_WITH_DEFAULT_DEVICE_SET=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
- pytest -v -s distributed/test_sequence_parallel.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown
- pytest -v -s v1/worker/test_worker_memory_snapshot.py
@@ -993,6 +1131,11 @@ steps:
- pytest -v -s plugins_tests/test_io_processor_plugins.py
- pip uninstall prithvi_io_processor_plugin -y
# end io_processor plugins test
# begin stat_logger plugins test
- pip install -e ./plugins/vllm_add_dummy_stat_logger
- pytest -v -s plugins_tests/test_stats_logger_plugins.py
- pip uninstall dummy_stat_logger -y
# end stat_logger plugins test
# other tests continue here:
- pytest -v -s plugins_tests/test_scheduler_plugins.py
- pip install -e ./plugins/vllm_add_dummy_model
@@ -1032,6 +1175,8 @@ steps:
- pytest -v -s -x lora/test_chatglm3_tp.py
- pytest -v -s -x lora/test_llama_tp.py
- pytest -v -s -x lora/test_llm_with_multi_loras.py
- pytest -v -s -x lora/test_olmoe_tp.py
- pytest -v -s -x lora/test_gptoss_tp.py
- label: Weight Loading Multiple GPU Test # 33min
@@ -1058,6 +1203,17 @@ steps:
commands:
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt
- label: NixlConnector PD accuracy tests (Distributed) # 30min
timeout_in_minutes: 30
working_dir: "/vllm-workspace/tests"
num_gpus: 4
source_file_dependencies:
- vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
- tests/v1/kv_connector/nixl_integration/
commands:
- uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
- bash v1/kv_connector/nixl_integration/tp_config_sweep_accuracy_test.sh
##### multi gpus test #####
##### A100 test #####
@@ -1088,8 +1244,21 @@ steps:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large.txt --tp-size=4
##### H100 test #####
- label: LM Eval Large Models (H100) # optional
gpu: h100
optional: true
num_gpus: 4
working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
source_file_dependencies:
- csrc/
- vllm/model_executor/layers/quantization
commands:
- export VLLM_USE_DEEP_GEMM=0 # We found Triton is faster than DeepGEMM for H100
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large-hopper.txt --tp-size=4
##### H200 test #####
- label: Distrubted Tests (H200) # optional
- label: Distributed Tests (H200) # optional
gpu: h200
optional: true
working_dir: "/vllm-workspace/"
@@ -1097,8 +1266,12 @@ steps:
commands:
- pytest -v -s tests/compile/test_async_tp.py
- pytest -v -s tests/compile/test_sequence_parallelism.py
- pytest -v -s tests/compile/test_fusion_all_reduce.py
- "pytest -v -s tests/compile/test_fusions_e2e.py -k 'not Llama-4'"
- pytest -v -s tests/distributed/test_sequence_parallel.py
- pytest -v -s tests/distributed/test_context_parallel.py
- CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048
- pytest -v -s tests/v1/distributed/test_dbo.py
##### B200 test #####
- label: Distributed Tests (B200) # optional
@@ -1109,6 +1282,7 @@ steps:
commands:
- pytest -v -s tests/distributed/test_context_parallel.py
- pytest -v -s tests/distributed/test_nccl_symm_mem_allreduce.py
- pytest -v -s tests/v1/distributed/test_dbo.py
##### RL Integration Tests #####
- label: Prime-RL Integration Test # 15min
@@ -1121,3 +1295,21 @@ steps:
- .buildkite/scripts/run-prime-rl-test.sh
commands:
- bash .buildkite/scripts/run-prime-rl-test.sh
- label: DeepSeek V2-Lite Accuracy
timeout_in_minutes: 60
gpu: h100
optional: true
num_gpus: 4
working_dir: "/vllm-workspace"
commands:
- bash .buildkite/scripts/scheduled_integration_test/deepseek_v2_lite_ep_eplb.sh 0.25 200 8010
- label: Qwen3-30B-A3B-FP8-block Accuracy
timeout_in_minutes: 60
gpu: h100
optional: true
num_gpus: 4
working_dir: "/vllm-workspace"
commands:
- bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep.sh 0.8 200 8020

69
.github/CODEOWNERS vendored
View File

@@ -3,15 +3,13 @@
# This lists cover the "core" components of vLLM that require careful review
/vllm/attention @LucasWilkinson
/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill @22quinn
/vllm/worker/worker_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill @22quinn
/vllm/model_executor/layers/fused_moe @mgoin
/vllm/model_executor/layers/sampler.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill @NickLucche
/vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth @yewentao256
/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @njhill
/vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-redhat @njhill @22quinn
/vllm/model_executor/layers/fused_moe @mgoin @pavanimajety
/vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth @yewentao256 @pavanimajety
/vllm/model_executor/layers/mamba @tdoublep
/vllm/model_executor/model_loader @22quinn
/vllm/multimodal @DarkLight1337 @ywang96 @NickLucche
/vllm/multimodal @DarkLight1337 @ywang96 @NickLucche @tjtanaa
/vllm/vllm_flash_attn @LucasWilkinson
/vllm/lora @jeejeelee
/vllm/reasoning @aarnphm @chaunceyjiang
@@ -22,15 +20,15 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
# Any change to the VllmConfig changes can have a large user-facing impact,
# so spam a lot of people
/vllm/config @simon-mo @WoosukKwon @youkaichao @robertgshaw2-redhat @mgoin @tlrmchlsmth @houseroad @hmellor @yewentao256 @ProExpertProg
/vllm/config/cache.py @simon-mo @WoosukKwon @youkaichao @robertgshaw2-redhat @mgoin @tlrmchlsmth @houseroad @hmellor @yewentao256 @ProExpertProg @heheda12345
/vllm/config @WoosukKwon @youkaichao @robertgshaw2-redhat @mgoin @tlrmchlsmth @houseroad @hmellor @yewentao256 @ProExpertProg
/vllm/config/cache.py @WoosukKwon @youkaichao @robertgshaw2-redhat @mgoin @tlrmchlsmth @houseroad @hmellor @yewentao256 @ProExpertProg @heheda12345
# vLLM V1
/vllm/v1 @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @comaniac @alexm-redhat
/vllm/v1/attention @LucasWilkinson
/vllm/v1/attention/backends/flashinfer.py @mgoin
/vllm/v1/attention/backends/mla @pavanimajety
/vllm/v1/attention/backends/flashinfer.py @mgoin @pavanimajety
/vllm/v1/attention/backends/triton_attn.py @tdoublep
/vllm/v1/core @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @comaniac @alexm-redhat @heheda12345 @ApostaC
/vllm/v1/core @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @alexm-redhat @heheda12345 @ApostaC
/vllm/v1/sample @22quinn @houseroad @njhill
/vllm/v1/spec_decode @benchislett @luccafong
/vllm/v1/structured_output @mgoin @russellb @aarnphm @benchislett
@@ -38,20 +36,20 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
/vllm/v1/offloading @ApostaC
# Test ownership
/.buildkite/lm-eval-harness @mgoin @simon-mo
/.buildkite/lm-eval-harness @mgoin
/tests/distributed/test_multi_node_assignment.py @youkaichao
/tests/distributed/test_pipeline_parallel.py @youkaichao
/tests/distributed/test_same_node.py @youkaichao
/tests/entrypoints @DarkLight1337 @robertgshaw2-redhat @simon-mo @aarnphm @NickLucche
/tests/entrypoints @DarkLight1337 @robertgshaw2-redhat @aarnphm @NickLucche
/tests/evals @mgoin
/tests/kernels @mgoin @tlrmchlsmth @WoosukKwon @yewentao256
/tests/models @DarkLight1337 @ywang96
/tests/multimodal @DarkLight1337 @ywang96 @NickLucche
/tests/quantization @mgoin @robertgshaw2-redhat @yewentao256
/tests/quantization @mgoin @robertgshaw2-redhat @yewentao256 @pavanimajety
/tests/test_inputs.py @DarkLight1337 @ywang96
/tests/v1/entrypoints/llm/test_struct_output_generate.py @mgoin @russellb @aarnphm
/tests/v1/structured_output @mgoin @russellb @aarnphm
/tests/v1/core @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @comaniac @alexm-redhat @heheda12345 @ApostaC
/tests/v1/core @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @alexm-redhat @heheda12345 @ApostaC
/tests/weight_loading @mgoin @youkaichao @yewentao256
/tests/lora @jeejeelee
/tests/models/language/generation/test_hybrid.py @tdoublep
@@ -59,10 +57,20 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
/tests/v1/kv_connector @ApostaC
/tests/v1/offloading @ApostaC
# Transformers backend
/vllm/model_executor/models/transformers.py @hmellor
# Transformers modeling backend
/vllm/model_executor/models/transformers @hmellor
/tests/models/test_transformers.py @hmellor
# Observability
/vllm/config/observability.py @markmc
/vllm/v1/metrics @markmc
/tests/v1/metrics @markmc
/vllm/tracing.py @markmc
/tests/v1/tracing/test_tracing.py @markmc
/vllm/config/kv_events.py @markmc
/vllm/distributed/kv_events.py @markmc
/tests/distributed/test_events.py @markmc
# Docs
/docs/mkdocs @hmellor
/docs/**/*.yml @hmellor
@@ -107,11 +115,21 @@ mkdocs.yaml @hmellor
/vllm/attention/ops/triton_unified_attention.py @tdoublep
# ROCm related: specify owner with write access to notify AMD folks for careful code review
/docker/Dockerfile.rocm* @gshtras
/vllm/v1/attention/backends/rocm*.py @gshtras
/vllm/v1/attention/backends/mla/rocm*.py @gshtras
/vllm/attention/ops/rocm*.py @gshtras
/vllm/model_executor/layers/fused_moe/rocm*.py @gshtras
/vllm/**/*rocm* @tjtanaa
/docker/Dockerfile.rocm* @gshtras @tjtanaa
/vllm/v1/attention/backends/rocm*.py @gshtras @tjtanaa
/vllm/v1/attention/backends/mla/rocm*.py @gshtras @tjtanaa
/vllm/attention/ops/rocm*.py @gshtras @tjtanaa
/vllm/model_executor/layers/fused_moe/rocm*.py @gshtras @tjtanaa
/csrc/rocm @gshtras @tjtanaa
/requirements/*rocm* @tjtanaa
/tests/**/*rocm* @tjtanaa
/docs/**/*rocm* @tjtanaa
/vllm/**/*quark* @tjtanaa
/tests/**/*quark* @tjtanaa
/docs/**/*quark* @tjtanaa
/vllm/**/*aiter* @tjtanaa
/tests/**/*aiter* @tjtanaa
# TPU
/vllm/v1/worker/tpu* @NickLucche
@@ -129,3 +147,8 @@ mkdocs.yaml @hmellor
/vllm/config/pooler.py @noooop
/vllm/pooling_params.py @noooop
/vllm/model_executor/layers/pooler.py @noooop
# Security guide and policies
/docs/usage/security.md @russellb
/SECURITY.md @russellb
/docs/contributing/vulnerability_management.md @russellb

19
.github/mergify.yml vendored
View File

@@ -108,7 +108,7 @@ pull_request_rules:
- files~=^benchmarks/
- files~=^vllm/benchmarks/
- files~=^tests/benchmarks/
- files~=^\.buildkite/nightly-benchmarks/
- files~=^\.buildkite/performance-benchmarks/
actions:
label:
add:
@@ -151,6 +151,23 @@ pull_request_rules:
add:
- gpt-oss
- name: label-nvidia
description: Automatically apply nvidia label
conditions:
- label != stale
- or:
- files~=cuda
- files~=cutlass
- files~=flashinfer
- files~=trtllm
- title~=(?i)NVIDIA
- title~=(?i)CUDA
- title~=(?i)CUTLASS
actions:
label:
add:
- nvidia
- name: label-rocm
description: Automatically apply rocm label
conditions:

81
.github/workflows/macos-smoke-test.yml vendored Normal file
View File

@@ -0,0 +1,81 @@
name: macOS Apple Silicon Smoke Test
on:
push:
branches:
- main
workflow_dispatch: # Manual trigger
jobs:
macos-m1-smoke-test:
runs-on: macos-latest
timeout-minutes: 20
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v7
with:
enable-cache: true
cache-dependency-glob: |
requirements/**/*.txt
pyproject.toml
python-version: '3.12'
- name: Create virtual environment
run: |
uv venv
echo "$GITHUB_WORKSPACE/.venv/bin" >> "$GITHUB_PATH"
- name: Install dependencies and build vLLM
run: |
uv pip install -r requirements/cpu.txt --index-strategy unsafe-best-match
uv pip install -e .
env:
CMAKE_BUILD_PARALLEL_LEVEL: 4
- name: Verify installation
run: |
python -c "import vllm; print(f'vLLM version: {vllm.__version__}')"
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
- name: Smoke test vllm serve
timeout-minutes: 10
run: |
# Start server in background
vllm serve Qwen/Qwen3-0.6B \
--max-model-len=2048 \
--load-format=dummy \
--enforce-eager \
--port 8000 &
SERVER_PID=$!
# Wait for server to start
for i in {1..30}; do
if curl -s http://localhost:8000/health > /dev/null; then
echo "Server started successfully"
break
fi
if [ "$i" -eq 30 ]; then
echo "Server failed to start"
kill "$SERVER_PID"
exit 1
fi
sleep 2
done
# Test health endpoint
curl -f http://localhost:8000/health
# Test completion
curl -f http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"prompt": "Hello",
"max_tokens": 5
}'
# Cleanup
kill "$SERVER_PID"

6
.gitignore vendored
View File

@@ -94,6 +94,9 @@ ipython_config.py
# generated files
**/generated/**
# uv
uv.lock
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
@@ -218,3 +221,6 @@ csrc/moe/marlin_moe_wna16/kernel_*
# Ignore ep_kernels_workspace folder
ep_kernels_workspace/
# Allow tracked library source folders under submodules (e.g., benchmarks/lib)
!vllm/benchmarks/lib/

View File

@@ -3,11 +3,9 @@ MD007:
MD013: false
MD024:
siblings_only: true
MD031:
list_items: false
MD033: false
MD042: false
MD045: false
MD046: false
MD051: false
MD052: false
MD053: false
MD059: false

View File

@@ -38,18 +38,18 @@ repos:
rev: 0.9.1
hooks:
- id: pip-compile
args: [requirements/test.in, -o, requirements/test.txt, --index-strategy, unsafe-best-match, --torch-backend, cu128, --python-platform, x86_64-manylinux_2_28]
args: [requirements/test.in, -o, requirements/test.txt, --index-strategy, unsafe-best-match, --torch-backend, cu129, --python-platform, x86_64-manylinux_2_28, --python-version, "3.12"]
files: ^requirements/test\.(in|txt)$
- repo: local
hooks:
- id: format-torch-nightly-test
name: reformat nightly_torch_test.txt to be in sync with test.in
language: python
entry: python tools/generate_nightly_torch_test.py
entry: python tools/pre_commit/generate_nightly_torch_test.py
files: ^requirements/test\.(in|txt)$
- id: mypy-local
name: Run mypy for local Python installation
entry: python tools/pre_commit/mypy.py 0 "local"
name: Run mypy locally for lowest supported Python version
entry: python tools/pre_commit/mypy.py 0 "3.10"
stages: [pre-commit] # Don't run in CI
<<: &mypy_common
language: python
@@ -78,12 +78,12 @@ repos:
stages: [manual] # Only run in CI
- id: shellcheck
name: Lint shell scripts
entry: tools/shellcheck.sh
entry: tools/pre_commit/shellcheck.sh
language: script
types: [shell]
- id: png-lint
name: Lint PNG exports from excalidraw
entry: tools/png-lint.sh
entry: tools/pre_commit/png-lint.sh
language: script
types: [png]
- id: signoff-commit
@@ -100,12 +100,12 @@ repos:
stages: [commit-msg]
- id: check-spdx-header
name: Check SPDX headers
entry: python tools/check_spdx_header.py
entry: python tools/pre_commit/check_spdx_header.py
language: python
types: [python]
- id: check-root-lazy-imports
name: Check root lazy imports
entry: python tools/check_init_lazy_imports.py
entry: python tools/pre_commit/check_init_lazy_imports.py
language: python
types: [python]
- id: check-filenames
@@ -119,11 +119,11 @@ repos:
pass_filenames: false
- id: update-dockerfile-graph
name: Update Dockerfile dependency graph
entry: tools/update-dockerfile-graph.sh
entry: tools/pre_commit/update-dockerfile-graph.sh
language: script
- id: enforce-import-regex-instead-of-re
name: Enforce import regex as re
entry: python tools/enforce_regex_import.py
entry: python tools/pre_commit/enforce_regex_import.py
language: python
types: [python]
pass_filenames: false
@@ -131,7 +131,7 @@ repos:
# forbid directly import triton
- id: forbid-direct-triton-import
name: "Forbid direct 'import triton'"
entry: python tools/check_triton_import.py
entry: python tools/pre_commit/check_triton_import.py
language: python
types: [python]
pass_filenames: false
@@ -144,7 +144,7 @@ repos:
additional_dependencies: [regex]
- id: validate-config
name: Validate configuration has default values and that each field has a docstring
entry: python tools/validate_config.py
entry: python tools/pre_commit/validate_config.py
language: python
additional_dependencies: [regex]
# Keep `suggestion` last

View File

@@ -39,6 +39,13 @@ set(PYTHON_SUPPORTED_VERSIONS "3.10" "3.11" "3.12" "3.13")
# Supported AMD GPU architectures.
set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151")
# ROCm installation prefix. Default to /opt/rocm but allow override via
# -DROCM_PATH=/your/rocm/path when invoking cmake.
if(NOT DEFINED ROCM_PATH)
set(ROCM_PATH "/opt/rocm" CACHE PATH "ROCm installation prefix")
else()
set(ROCM_PATH ${ROCM_PATH} CACHE PATH "ROCm installation prefix" FORCE)
endif()
#
# Supported/expected torch versions for CUDA/ROCm.
#
@@ -49,8 +56,8 @@ set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1
# requirements.txt files and should be kept consistent. The ROCm torch
# versions are derived from docker/Dockerfile.rocm
#
set(TORCH_SUPPORTED_VERSION_CUDA "2.8.0")
set(TORCH_SUPPORTED_VERSION_ROCM "2.8.0")
set(TORCH_SUPPORTED_VERSION_CUDA "2.9.0")
set(TORCH_SUPPORTED_VERSION_ROCM "2.9.0")
#
# Try to find python package with an executable that exactly matches
@@ -237,11 +244,28 @@ set_gencode_flags_for_srcs(
SRCS "${VLLM_CUMEM_EXT_SRC}"
CUDA_ARCHS "${CUDA_ARCHS}")
if(VLLM_GPU_LANG STREQUAL "CUDA")
if(VLLM_GPU_LANG STREQUAL "CUDA" OR VLLM_GPU_LANG STREQUAL "HIP")
message(STATUS "Enabling cumem allocator extension.")
# link against cuda driver library
list(APPEND CUMEM_LIBS CUDA::cuda_driver)
define_gpu_extension_target(
if(VLLM_GPU_LANG STREQUAL "CUDA")
# link against cuda driver library
list(APPEND CUMEM_LIBS CUDA::cuda_driver)
else()
# link against rocm driver library. Prefer an absolute path to
# libamdhip64.so inside ${ROCM_PATH}/lib if available, otherwise fall
# back to linking by name "amdhip64".
find_library(AMDHIP64_LIB
NAMES amdhip64 libamdhip64.so
PATHS ${ROCM_PATH}/lib
NO_DEFAULT_PATH)
if(AMDHIP64_LIB)
message(STATUS "Found libamdhip64 at ${AMDHIP64_LIB}")
list(APPEND CUMEM_LIBS ${AMDHIP64_LIB})
else()
message(WARNING "libamdhip64 not found in ${ROCM_PATH}/lib; falling back to linking 'amdhip64' by name")
list(APPEND CUMEM_LIBS amdhip64)
endif()
endif()
define_extension_target(
cumem_allocator
DESTINATION vllm
LANGUAGE CXX
@@ -265,6 +289,7 @@ set(VLLM_EXT_SRC
"csrc/pos_encoding_kernels.cu"
"csrc/activation_kernels.cu"
"csrc/layernorm_kernels.cu"
"csrc/fused_qknorm_rope_kernel.cu"
"csrc/layernorm_quant_kernels.cu"
"csrc/sampler.cu"
"csrc/cuda_view.cu"
@@ -330,7 +355,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# Keep building Marlin for 9.0 as there are some group sizes and shapes that
# are not supported by Machete yet.
# 9.0 for latest bf16 atomicAdd PTX
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.7;9.0+PTX" "${CUDA_ARCHS}")
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0+PTX;9.0+PTX" "${CUDA_ARCHS}")
if (MARLIN_ARCHS)
#
@@ -836,7 +861,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif()
# Hadacore kernels
cuda_archs_loose_intersection(HADACORE_ARCHS "8.0;8.9;9.0" "${CUDA_ARCHS}")
cuda_archs_loose_intersection(HADACORE_ARCHS "8.0+PTX;9.0+PTX" "${CUDA_ARCHS}")
if(HADACORE_ARCHS)
set(SRCS "csrc/quantization/hadamard/hadacore/hadamard_transform_cuda.cu")
set_gencode_flags_for_srcs(
@@ -858,7 +883,7 @@ if (VLLM_GPU_LANG STREQUAL "HIP")
endif()
message(STATUS "Enabling C extension.")
define_gpu_extension_target(
define_extension_target(
_C
DESTINATION vllm
LANGUAGE ${VLLM_GPU_LANG}
@@ -883,6 +908,7 @@ target_compile_definitions(_C PRIVATE CUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL=1)
set(VLLM_MOE_EXT_SRC
"csrc/moe/torch_bindings.cpp"
"csrc/moe/moe_align_sum_kernels.cu"
"csrc/moe/moe_lora_align_sum_kernels.cu"
"csrc/moe/topk_softmax_kernels.cu")
if(VLLM_GPU_LANG STREQUAL "CUDA")
@@ -913,7 +939,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
list(APPEND VLLM_MOE_EXT_SRC "${VLLM_MOE_WNA16_SRC}")
# 9.0 for latest bf16 atomicAdd PTX
cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0;8.7;9.0+PTX" "${CUDA_ARCHS}")
cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0+PTX;9.0+PTX" "${CUDA_ARCHS}")
if (MARLIN_MOE_ARCHS)
#
@@ -972,7 +998,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif()
message(STATUS "Enabling moe extension.")
define_gpu_extension_target(
define_extension_target(
_moe_C
DESTINATION vllm
LANGUAGE ${VLLM_GPU_LANG}
@@ -993,7 +1019,7 @@ if(VLLM_GPU_LANG STREQUAL "HIP")
"csrc/rocm/skinny_gemms.cu"
"csrc/rocm/attention.cu")
define_gpu_extension_target(
define_extension_target(
_rocm_C
DESTINATION vllm
LANGUAGE ${VLLM_GPU_LANG}

View File

@@ -21,6 +21,9 @@ Join us at the [PyTorch Conference, October 22-23](https://events.linuxfoundatio
*Latest News* 🔥
- [2025/11] We hosted [the first vLLM Europe Meetup in Zurich](https://luma.com/0gls27kb) focused on quantization, distributed inference, and reinforcement learning at scale with speakers from Mistral, IBM, and Red Hat. Please find the meetup slides [here](https://docs.google.com/presentation/d/1UC9PTLCHYXQpOmJDSFg6Sljra3iVXzc09DeEI7dnxMc/edit?usp=sharing) and recording [here](https://www.youtube.com/watch?v=6m6ZE6yVEDI)
- [2025/11] We hosted [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/xSrYXjNgr1HbCP4ExYNG1w) focusing on distributed inference and diverse accelerator support with vLLM! Please find the meetup slides [here](https://drive.google.com/drive/folders/1nQJ8ZkLSjKxvu36sSHaceVXtttbLvvu-?usp=drive_link).
- [2025/10] We hosted [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/__xb4OyOsImz-9eAVrdlcg) focused on hands-on vLLM inference optimization! Please find the meetup slides [here](https://drive.google.com/drive/folders/1KqwjsFJLfEsC8wlDugnrR61zsWHt94Q6).
- [2025/09] We hosted [vLLM Toronto Meetup](https://luma.com/e80e0ymm) focused on tackling inference at scale and speculative decoding with speakers from NVIDIA and Red Hat! Please find the meetup slides [here](https://docs.google.com/presentation/d/1IYJYmJcu9fLpID5N5RbW_vO0XLo0CGOR14IXOjB61V8/edit?usp=sharing).
- [2025/08] We hosted [vLLM Shenzhen Meetup](https://mp.weixin.qq.com/s/k8ZBO1u2_2odgiKWH_GVTQ) focusing on the ecosystem around vLLM! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Ua2SVKVSu-wp5vou_6ElraDt2bnKhiEA).
- [2025/08] We hosted [vLLM Singapore Meetup](https://www.sginnovate.com/event/vllm-sg-meet). We shared V1 updates, disaggregated serving and MLLM speedups with speakers from Embedded LLM, AMD, WekaIO, and A*STAR. Please find the meetup slides [here](https://drive.google.com/drive/folders/1ncf3GyqLdqFaB6IeB834E5TZJPLAOiXZ?usp=sharing).
@@ -82,7 +85,7 @@ vLLM is flexible and easy to use with:
- Tensor, pipeline, data and expert parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
- Prefix caching support
- Multi-LoRA support

View File

@@ -0,0 +1,380 @@
#!/usr/bin/env python3
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""
Benchmark to measure the performance overhead of VLLM_BATCH_INVARIANT mode.
This benchmark runs the same workload twice:
1. With VLLM_BATCH_INVARIANT=0 (baseline)
2. With VLLM_BATCH_INVARIANT=1 (batch invariant mode)
And reports the timing and throughput metrics for comparison.
Environment variables:
VLLM_BENCH_MODEL: Model to benchmark (default: "Qwen/Qwen3-1.7B")
VLLM_BENCH_TP_SIZE: Tensor parallel size (default: 1, use 8 for deepseek)
VLLM_BENCH_BATCH_SIZE: Max batch size (default: 128)
VLLM_BENCH_NUM_TRIALS: Number of trials to run (default: 5)
VLLM_BENCH_MIN_PROMPT: Min prompt length in words (default: 1024)
VLLM_BENCH_MAX_PROMPT: Max prompt length in words (default: 2048)
VLLM_BENCH_MAX_TOKENS: Max tokens to generate (default: 128)
VLLM_BENCH_TEMPERATURE: Temperature for sampling (default: 0.0)
VLLM_BENCH_GPU_MEMORY_UTILIZATION: GPU memory utilization (default: 0.4)
VLLM_BENCH_MAX_MODEL_LEN: Max model length (default: 5120)
VLLM_BENCH_BACKEND: Attention backend (default: FLASH_ATTN)
Example usage:
# Benchmark qwen3 (default)
python benchmarks/benchmark_batch_invariance.py
# Benchmark deepseek with 8 GPUs
VLLM_BENCH_MODEL="deepseek-ai/DeepSeek-V3" VLLM_BENCH_TP_SIZE=8 \\
python benchmarks/benchmark_batch_invariance.py
# Quick test with fewer trials
VLLM_BENCH_NUM_TRIALS=2 VLLM_BENCH_BATCH_SIZE=32 \\
python benchmarks/benchmark_batch_invariance.py
"""
import contextlib
import os
import random
import time
from vllm import LLM, SamplingParams
from vllm.platforms import current_platform
def _random_prompt(min_words: int = 1024, max_words: int = 1024 * 2) -> str:
"""Generate a random prompt for benchmarking."""
prompt_templates = [
"Question: What is the capital of France?\nAnswer: The capital of France is",
"Q: How does photosynthesis work?\nA: Photosynthesis is the process by which",
"User: Can you explain quantum mechanics?\nAssistant: Quantum mechanics is",
"Once upon a time in a distant galaxy, there lived",
"The old man walked slowly down the street, remembering",
"In the year 2157, humanity finally discovered",
"To implement a binary search tree in Python, first we need to",
"The algorithm works by iterating through the array and",
"Here's how to optimize database queries using indexing:",
"The Renaissance was a period in European history that",
"Climate change is caused by several factors including",
"The human brain contains approximately 86 billion neurons which",
"I've been thinking about getting a new laptop because",
"Yesterday I went to the store and bought",
"My favorite thing about summer is definitely",
]
base_prompt = random.choice(prompt_templates)
if max_words < min_words:
max_words = min_words
target_words = random.randint(min_words, max_words)
if target_words > 50:
padding_text = (
" This is an interesting topic that deserves more explanation. "
* (target_words // 50)
)
base_prompt = base_prompt + padding_text
return base_prompt
def run_benchmark_with_batch_invariant(
model: str,
tp_size: int,
max_batch_size: int,
num_trials: int,
min_prompt: int,
max_prompt: int,
max_tokens: int,
temperature: float,
gpu_mem_util: float,
max_model_len: int,
backend: str,
batch_invariant: bool,
seed: int = 12345,
) -> dict:
"""
Run the benchmark with the specified configuration.
Returns a dict with timing and throughput metrics.
"""
random.seed(seed)
# Set environment variables
os.environ["VLLM_ATTENTION_BACKEND"] = backend
if batch_invariant:
os.environ["VLLM_BATCH_INVARIANT"] = "1"
else:
os.environ["VLLM_BATCH_INVARIANT"] = "0"
print(f"\n{'=' * 80}")
print(f"BENCHMARK: VLLM_BATCH_INVARIANT={int(batch_invariant)}")
print(f" Model: {model}")
print(f" TP Size: {tp_size}")
print(f" Backend: {backend}")
print(f" Max Batch Size: {max_batch_size}")
print(f" Trials: {num_trials}")
print(f" Max Tokens: {max_tokens}")
print(f"{'=' * 80}\n")
sampling = SamplingParams(
temperature=temperature,
top_p=0.95,
max_tokens=max_tokens,
seed=20240919,
)
needle_prompt = "There once was a "
llm = None
try:
# Create LLM engine
start_init = time.perf_counter()
llm = LLM(
model=model,
max_num_seqs=max_batch_size,
gpu_memory_utilization=gpu_mem_util,
max_model_len=max_model_len,
dtype="bfloat16",
tensor_parallel_size=tp_size,
enable_prefix_caching=False,
)
init_time = time.perf_counter() - start_init
print(f"Engine initialization time: {init_time:.2f}s\n")
# Generate baseline
print("Generating baseline (warmup)...")
baseline_out = llm.generate([needle_prompt], sampling)
assert len(baseline_out) == 1
baseline_text = baseline_out[0].outputs[0].text
print(f"Baseline output: '{baseline_text[:50]}...'\n")
# Run trials and measure timing
trial_times: list[float] = []
total_tokens = 0
total_prompts = 0
for trial in range(num_trials):
# Create a batch
prompts: list[str] = []
batch_size = random.randint(max_batch_size // 2, max_batch_size)
needle_pos = random.randint(0, batch_size - 1)
for i in range(batch_size):
if i == needle_pos:
prompts.append(needle_prompt)
else:
prompts.append(_random_prompt(min_prompt, max_prompt))
# Measure time for this trial
start_time = time.perf_counter()
outputs = llm.generate(prompts, sampling)
trial_time = time.perf_counter() - start_time
trial_times.append(trial_time)
total_prompts += len(prompts)
# Count tokens
for output in outputs:
if output.outputs:
total_tokens += len(output.outputs[0].token_ids)
print(
f"Trial {trial + 1}/{num_trials}: "
f"batch_size={batch_size}, "
f"time={trial_time:.2f}s"
)
# Verify needle output still matches
needle_output = outputs[needle_pos]
assert needle_output.prompt == needle_prompt
# Compute statistics
avg_time = sum(trial_times) / len(trial_times)
min_time = min(trial_times)
max_time = max(trial_times)
throughput = total_tokens / sum(trial_times)
prompts_per_sec = total_prompts / sum(trial_times)
print(f"\n{'=' * 80}")
print("RESULTS:")
print(f" Average time per trial: {avg_time:.2f}s")
print(f" Min time: {min_time:.2f}s")
print(f" Max time: {max_time:.2f}s")
print(f" Total tokens generated: {total_tokens}")
print(f" Total prompts processed: {total_prompts}")
print(f" Throughput: {throughput:.2f} tokens/s")
print(f" Prompts/s: {prompts_per_sec:.2f}")
print(f"{'=' * 80}\n")
return {
"init_time": init_time,
"avg_time": avg_time,
"min_time": min_time,
"max_time": max_time,
"total_tokens": total_tokens,
"total_prompts": total_prompts,
"throughput": throughput,
"prompts_per_sec": prompts_per_sec,
"trial_times": trial_times,
}
finally:
# Cleanup
if llm is not None:
with contextlib.suppress(Exception):
llm.shutdown()
def main():
# Check platform support
if not (current_platform.is_cuda() and current_platform.has_device_capability(90)):
print("ERROR: Requires CUDA and >= Hopper (SM90)")
print(f"Current platform: {current_platform.device_type}")
if current_platform.is_cuda():
print(f"Device capability: {current_platform.get_device_capability()}")
return 1
# Read configuration from environment
model = os.getenv("VLLM_BENCH_MODEL", "Qwen/Qwen3-1.7B")
tp_size = int(os.getenv("VLLM_BENCH_TP_SIZE", "1"))
max_batch_size = int(os.getenv("VLLM_BENCH_BATCH_SIZE", "128"))
num_trials = int(os.getenv("VLLM_BENCH_NUM_TRIALS", "5"))
min_prompt = int(os.getenv("VLLM_BENCH_MIN_PROMPT", "1024"))
max_prompt = int(os.getenv("VLLM_BENCH_MAX_PROMPT", "2048"))
max_tokens = int(os.getenv("VLLM_BENCH_MAX_TOKENS", "128"))
temperature = float(os.getenv("VLLM_BENCH_TEMPERATURE", "0.0"))
gpu_mem_util = float(os.getenv("VLLM_BENCH_GPU_MEMORY_UTILIZATION", "0.4"))
max_model_len = int(os.getenv("VLLM_BENCH_MAX_MODEL_LEN", "5120"))
backend = os.getenv("VLLM_BENCH_BACKEND", "FLASH_ATTN")
print("\n" + "=" * 80)
print("VLLM BATCH INVARIANCE BENCHMARK")
print("=" * 80)
print("\nConfiguration:")
print(f" Model: {model}")
print(f" Tensor Parallel Size: {tp_size}")
print(f" Attention Backend: {backend}")
print(f" Max Batch Size: {max_batch_size}")
print(f" Number of Trials: {num_trials}")
print(f" Prompt Length Range: {min_prompt}-{max_prompt} words")
print(f" Max Tokens to Generate: {max_tokens}")
print(f" Temperature: {temperature}")
print(f" GPU Memory Utilization: {gpu_mem_util}")
print(f" Max Model Length: {max_model_len}")
print("=" * 80)
# Run benchmark WITHOUT batch invariance (baseline)
print("\n" + "=" * 80)
print("PHASE 1: Running WITHOUT batch invariance (baseline)")
print("=" * 80)
baseline_results = run_benchmark_with_batch_invariant(
model=model,
tp_size=tp_size,
max_batch_size=max_batch_size,
num_trials=num_trials,
min_prompt=min_prompt,
max_prompt=max_prompt,
max_tokens=max_tokens,
temperature=temperature,
gpu_mem_util=gpu_mem_util,
max_model_len=max_model_len,
backend=backend,
batch_invariant=False,
)
# Run benchmark WITH batch invariance
print("\n" + "=" * 80)
print("PHASE 2: Running WITH batch invariance")
print("=" * 80)
batch_inv_results = run_benchmark_with_batch_invariant(
model=model,
tp_size=tp_size,
max_batch_size=max_batch_size,
num_trials=num_trials,
min_prompt=min_prompt,
max_prompt=max_prompt,
max_tokens=max_tokens,
temperature=temperature,
gpu_mem_util=gpu_mem_util,
max_model_len=max_model_len,
backend=backend,
batch_invariant=True,
)
# Compare results
print("\n" + "=" * 80)
print("COMPARISON: Batch Invariance vs Baseline")
print("=" * 80)
init_overhead_pct = (
(batch_inv_results["init_time"] - baseline_results["init_time"])
/ baseline_results["init_time"]
* 100
)
time_overhead_pct = (
(batch_inv_results["avg_time"] - baseline_results["avg_time"])
/ baseline_results["avg_time"]
* 100
)
throughput_change_pct = (
(batch_inv_results["throughput"] - baseline_results["throughput"])
/ baseline_results["throughput"]
* 100
)
print("\nInitialization Time:")
print(f" Baseline: {baseline_results['init_time']:.2f}s")
print(f" Batch Invariant: {batch_inv_results['init_time']:.2f}s")
print(f" Overhead: {init_overhead_pct:+.2f}%")
print("\nAverage Trial Time:")
print(f" Baseline: {baseline_results['avg_time']:.2f}s")
print(f" Batch Invariant: {batch_inv_results['avg_time']:.2f}s")
print(f" Overhead: {time_overhead_pct:+.2f}%")
print("\nThroughput (tokens/s):")
print(f" Baseline: {baseline_results['throughput']:.2f}")
print(f" Batch Invariant: {batch_inv_results['throughput']:.2f}")
print(f" Change: {throughput_change_pct:+.2f}%")
print("\nPrompts/s:")
print(f" Baseline: {baseline_results['prompts_per_sec']:.2f}")
print(f" Batch Invariant: {batch_inv_results['prompts_per_sec']:.2f}")
print("\n" + "=" * 80)
print("SUMMARY")
print("=" * 80)
if time_overhead_pct > 0:
print(
f"Batch invariance mode adds approximately {time_overhead_pct:.1f}% "
"overhead"
)
else:
print(
f"Batch invariance mode is approximately {-time_overhead_pct:.1f}% "
"faster (unexpected!)"
)
if abs(throughput_change_pct) < 1.0:
print("Throughput difference is negligible (< 1%)")
elif throughput_change_pct < 0:
print(
f"Throughput decreased by {-throughput_change_pct:.1f}% "
"with batch invariance"
)
else:
print(
f"Throughput increased by {throughput_change_pct:.1f}% "
"with batch invariance (unexpected!)"
)
print("=" * 80 + "\n")
return 0
if __name__ == "__main__":
exit(main())

View File

@@ -5,7 +5,7 @@ import gc
from benchmark_utils import TimeCollector
from tabulate import tabulate
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.v1.core.block_pool import BlockPool

View File

@@ -46,7 +46,7 @@ import time
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import EngineArgs
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
def test_long_document_qa(llm=None, sampling_params=None, prompts=None):

View File

@@ -19,7 +19,7 @@ from vllm.config import (
VllmConfig,
)
from vllm.platforms import current_platform
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.v1.spec_decode.ngram_proposer import NgramProposer
from vllm.v1.worker.gpu_input_batch import InputBatch
from vllm.v1.worker.gpu_model_runner import GPUModelRunner

View File

@@ -37,7 +37,7 @@ from transformers import PreTrainedTokenizerBase
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import EngineArgs
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
try:
from vllm.transformers_utils.tokenizer import get_tokenizer
@@ -69,7 +69,7 @@ def sample_tokens(tokenizer: PreTrainedTokenizerBase, length: int) -> list[int]:
# Remove the special tokens.
return random.choices(
[v for k, v in vocab.items() if k not in all_special_ids],
[v for v in vocab.values() if v not in all_special_ids],
k=length,
)

View File

@@ -11,7 +11,7 @@ import time
from transformers import AutoTokenizer, PreTrainedTokenizerBase
from vllm.engine.arg_utils import EngineArgs
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
# Select a equi-probable random priority

View File

@@ -31,6 +31,7 @@ import time
import uuid
import warnings
from collections.abc import AsyncGenerator
from contextlib import nullcontext
from dataclasses import dataclass
import datasets
@@ -50,7 +51,7 @@ except ImportError:
from backend_request_func import get_tokenizer
try:
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
except ImportError:
from argparse import ArgumentParser as FlexibleArgumentParser
@@ -501,15 +502,9 @@ async def benchmark(
pbar = None if disable_tqdm else tqdm(total=len(input_requests))
# This can be used once the minimum Python version is 3.10 or higher,
# and it will simplify the code in limited_request_func.
# semaphore = (asyncio.Semaphore(max_concurrency)
# if max_concurrency else contextlib.nullcontext())
semaphore = asyncio.Semaphore(max_concurrency) if max_concurrency else None
semaphore = asyncio.Semaphore(max_concurrency) if max_concurrency else nullcontext()
async def limited_request_func(request_func_input, pbar):
if semaphore is None:
return await request_func(request_func_input=request_func_input, pbar=pbar)
async with semaphore:
return await request_func(request_func_input=request_func_input, pbar=pbar)

View File

@@ -15,7 +15,7 @@ from utils import make_rand_sparse_tensors
from weight_shapes import WEIGHT_SHAPES
from vllm import _custom_ops as ops
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
DEFAULT_MODELS = list(WEIGHT_SHAPES.keys())
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512]

View File

@@ -18,7 +18,8 @@ from vllm import _custom_ops as ops
from vllm.model_executor.layers.quantization.utils.fp8_utils import (
w8a8_triton_block_scaled_mm,
)
from vllm.utils import FlexibleArgumentParser, cdiv
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.math_utils import cdiv
DEFAULT_MODELS = list(WEIGHT_SHAPES.keys())
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512]

View File

@@ -1,10 +1,18 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import os
# Disable DeepGEMM for this benchmark to use CUTLASS
os.environ["VLLM_USE_DEEP_GEMM"] = "0"
import torch
from vllm.model_executor.layers.quantization.utils.fp8_utils import (
apply_w8a8_block_fp8_linear,
W8A8BlockFp8LinearOp,
)
from vllm.model_executor.layers.quantization.utils.quant_utils import (
GroupShape,
)
from vllm.model_executor.layers.quantization.utils.w8a8_utils import (
CUTLASS_BLOCK_FP8_SUPPORTED,
@@ -39,13 +47,14 @@ def build_w8a8_block_fp8_runner(M, N, K, block_size, device, use_cutlass):
fp8_info = torch.finfo(torch.float8_e4m3fn)
fp8_max, fp8_min = fp8_info.max, fp8_info.min
# Create random FP8 tensors
# Create random input tensor (bfloat16, will be quantized by W8A8BlockFp8LinearOp)
A_ref = (torch.rand(M, K, dtype=torch.bfloat16, device=device) - 0.5) * 2 * fp8_max
# Create quantized weight tensor
B_ref = (torch.rand(N, K, dtype=torch.bfloat16, device=device) - 0.5) * 2 * fp8_max
B = B_ref.clamp(min=fp8_min, max=fp8_max).to(torch.float8_e4m3fn)
# Create scales
# Create weight scales
block_n, block_k = block_size[0], block_size[1]
n_tiles = (N + block_n - 1) // block_n
k_tiles = (K + block_k - 1) // block_k
@@ -55,19 +64,25 @@ def build_w8a8_block_fp8_runner(M, N, K, block_size, device, use_cutlass):
* factor_for_scale
)
# SM90 CUTLASS requires row-major format for scales
if use_cutlass and current_platform.is_device_capability(90):
Bs = Bs.T.contiguous()
# Create W8A8BlockFp8LinearOp instance
weight_group_shape = GroupShape(block_n, block_k)
act_quant_group_shape = GroupShape(1, block_k) # Per-token, per-group quantization
linear_op = W8A8BlockFp8LinearOp(
weight_group_shape=weight_group_shape,
act_quant_group_shape=act_quant_group_shape,
cutlass_block_fp8_supported=use_cutlass,
use_aiter_and_is_supported=False,
)
def run():
if use_cutlass:
return apply_w8a8_block_fp8_linear(
A_ref, B, block_size, Bs, cutlass_block_fp8_supported=True
)
else:
return apply_w8a8_block_fp8_linear(
A_ref, B, block_size, Bs, cutlass_block_fp8_supported=False
)
return linear_op.apply(
input=A_ref,
weight=B,
weight_scale=Bs,
input_scale=None,
bias=None,
)
return run

View File

@@ -10,7 +10,8 @@ import torch
from vllm.model_executor.layers.quantization.input_quant_fp8 import QuantFP8
from vllm.model_executor.layers.quantization.utils.quant_utils import GroupShape
from vllm.triton_utils import triton
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
def with_triton_mode(fn):

View File

@@ -10,7 +10,8 @@ import vllm.model_executor.layers.activation # noqa F401
from vllm.model_executor.custom_op import CustomOp
from vllm.platforms import current_platform
from vllm.triton_utils import triton
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
batch_size_range = [1, 16, 32, 64, 128]
seq_len_range = [1, 16, 64, 128, 256, 512, 1024, 2048, 4096]

View File

@@ -28,7 +28,7 @@ except ImportError as e:
from bitblas import Matmul, MatmulConfig, auto_detect_nvidia_target
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
parser = FlexibleArgumentParser(
description="Benchmark BitBLAS int4 on a specific target."

View File

@@ -20,7 +20,7 @@ from vllm.model_executor.layers.fused_moe.config import (
from vllm.model_executor.layers.fused_moe.cutlass_moe import cutlass_moe_fp4
from vllm.model_executor.layers.fused_moe.fused_moe import fused_experts, fused_topk
from vllm.scalar_type import scalar_types
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
WEIGHT_SHAPES_MOE = {
"nvidia/DeepSeek-R1-FP4": [

View File

@@ -14,7 +14,7 @@ from vllm.model_executor.layers.fused_moe.config import fp8_w8a8_moe_quant_confi
from vllm.model_executor.layers.fused_moe.cutlass_moe import cutlass_moe_fp8
from vllm.model_executor.layers.fused_moe.fused_moe import fused_experts, fused_topk
from vllm.platforms import current_platform
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
# Weight shapes for different models: [num_experts, topk, hidden_size,
# intermediate_size]

View File

@@ -39,7 +39,7 @@ from vllm.distributed.device_communicators.pynccl_allocator import (
)
from vllm.distributed.device_communicators.symm_mem import SymmMemCommunicator
from vllm.logger import init_logger
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
logger = init_logger(__name__)

File diff suppressed because it is too large Load Diff

View File

@@ -13,11 +13,11 @@ from vllm.model_executor.layers.fused_moe.fused_moe import (
fused_experts,
fused_topk,
)
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
DEFAULT_MODELS = [
"nm-testing/Mixtral-8x7B-Instruct-v0.1",
"nm-testing/deepseekv2-lite",
"mistralai/Mixtral-8x7B-Instruct-v0.1",
"deepseek-ai/DeepSeek-V2-Lite",
"ibm-granite/granite-3.0-1b-a400m",
"ibm-granite/granite-3.0-3b-a800m",
]

View File

@@ -7,7 +7,8 @@ import torch
from vllm.model_executor.layers.layernorm import RMSNorm
from vllm.platforms import current_platform
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
@torch.inference_mode()

View File

@@ -19,13 +19,24 @@ from torch.utils.benchmark import Measurement as TMeasurement
from utils import ArgPool, Bench, CudaGraphBenchParams
from weight_shapes import WEIGHT_SHAPES
from vllm.triton_utils import HAS_TRITON
from vllm.lora.ops.triton_ops.utils import get_lora_op_configs
from vllm.triton_utils import HAS_TRITON, triton
if HAS_TRITON:
from vllm.lora.ops.triton_ops import LoRAKernelMeta, lora_expand, lora_shrink
from vllm.lora.ops.triton_ops import ( ## added fused_moe_lora
LoRAKernelMeta,
fused_moe_lora_expand,
fused_moe_lora_shrink,
lora_expand,
lora_shrink,
)
from vllm.lora.ops.triton_ops.fused_moe_lora_op import (
_LORA_PTR_DICT, ## added _LORA_PTR_DICT for fused_moe_lora
)
from vllm.lora.ops.triton_ops.utils import _LORA_A_PTR_DICT, _LORA_B_PTR_DICT
from vllm.utils import FlexibleArgumentParser
from vllm import _custom_ops as ops
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.math_utils import round_up
DEFAULT_MODELS = list(WEIGHT_SHAPES.keys())
DEFAULT_TP_SIZES = [1]
@@ -59,6 +70,8 @@ DEFAULT_NUM_LORAS = [1, 2, 3, 4]
DEFAULT_SORT_BY_LORA_IDS = [False, True]
DEFAULT_SEQ_LENGTHS = [1]
DEFAULT_EXPAND_FN_ADD_INPUTS = [True, False]
DEFAULT_TOP_K_NUMS = [1] # Added for MoE LoRA top_k
DEFAULT_NUM_EXPERTS = [8] # Added for MoE LoRA num_experts
# Utilities
@@ -191,6 +204,11 @@ class OpType(Enum):
LORA_SHRINK = auto()
LORA_EXPAND = auto()
## Adding support for fused moe lora
FUSED_MOE_LORA_GATE_UP_SHRINK = auto() ## Gate/Up projection variant with shrink
FUSED_MOE_LORA_GATE_UP_EXPAND = auto() ## Gate/Up projection variant with expand
FUSED_MOE_LORA_DOWN_SHRINK = auto() ## Down projection variant with shrink
FUSED_MOE_LORA_DOWN_EXPAND = auto() ## Down projection variant with expand
@staticmethod
def from_str(s: str) -> "OpType":
@@ -198,6 +216,15 @@ class OpType(Enum):
return OpType.LORA_SHRINK
if s.lower() == "lora_expand":
return OpType.LORA_EXPAND
# Adding support for fused moe lora, both in gate_up and down
if s.lower() == "fused_moe_lora_gate_up_shrink": ## Gate/Up variant with shrink
return OpType.FUSED_MOE_LORA_GATE_UP_SHRINK
if s.lower() == "fused_moe_lora_gate_up_expand": ## Gate/Up variant with expand
return OpType.FUSED_MOE_LORA_GATE_UP_EXPAND
if s.lower() == "fused_moe_lora_down_shrink": ## Down variant with shrink
return OpType.FUSED_MOE_LORA_DOWN_SHRINK
if s.lower() == "fused_moe_lora_down_expand": ## Down variant with expand
return OpType.FUSED_MOE_LORA_DOWN_EXPAND
raise ValueError(f"Unrecognized str {s} to convert to OpType")
def is_shrink_fn(self) -> bool:
@@ -206,19 +233,56 @@ class OpType(Enum):
def is_expand_fn(self) -> bool:
return self in [OpType.LORA_EXPAND]
def is_fused_moe_lora_fn(self) -> bool: ## adding for fused MoE LoRA
return self in [
OpType.FUSED_MOE_LORA_GATE_UP_SHRINK,
OpType.FUSED_MOE_LORA_DOWN_SHRINK,
OpType.FUSED_MOE_LORA_GATE_UP_EXPAND,
OpType.FUSED_MOE_LORA_DOWN_EXPAND,
]
def is_fused_moe_lora_gate_up_fn(
self,
) -> bool: ## adding for fused MoE LoRA Gate/Up
return self in [
OpType.FUSED_MOE_LORA_GATE_UP_SHRINK,
OpType.FUSED_MOE_LORA_GATE_UP_EXPAND,
]
def is_fused_moe_lora_down_fn(self) -> bool: ## adding for fused MoE LoRA Down
return self in [
OpType.FUSED_MOE_LORA_DOWN_SHRINK,
OpType.FUSED_MOE_LORA_DOWN_EXPAND,
]
def is_fused_moe_lora_shrink_fn(self) -> bool:
return self in [
OpType.FUSED_MOE_LORA_GATE_UP_SHRINK,
OpType.FUSED_MOE_LORA_DOWN_SHRINK,
]
def is_fused_moe_lora_expand_fn(self) -> bool:
return self in [
OpType.FUSED_MOE_LORA_GATE_UP_EXPAND,
OpType.FUSED_MOE_LORA_DOWN_EXPAND,
]
def num_slices(self) -> list[int]:
if self.is_fused_moe_lora_gate_up_fn():
return [2]
elif self.is_fused_moe_lora_down_fn():
return [1]
return [1, 2, 3]
def mkn(
self, batch_size: int, seq_length: int, hidden_size: int, lora_rank: int
) -> tuple[int, int, int]:
num_tokens = batch_size * seq_length
if self.is_shrink_fn():
if self.is_shrink_fn() or self.is_fused_moe_lora_fn():
m = num_tokens
k = hidden_size
n = lora_rank
else:
assert self.is_expand_fn()
elif self.is_expand_fn():
m = num_tokens
k = lora_rank
n = hidden_size
@@ -232,9 +296,36 @@ class OpType(Enum):
"""
if self.is_shrink_fn():
return op_dtype, op_dtype, torch.float32
else:
assert self.is_expand_fn()
elif self.is_expand_fn():
return torch.float32, op_dtype, op_dtype
else:
assert self.is_fused_moe_lora_fn()
return op_dtype, op_dtype, op_dtype
def matmul_shapes_fused_moe_lora(
self,
m: int,
n: int,
k: int,
num_loras: int,
num_slices: int,
top_k_num: int,
num_experts: int,
) -> tuple[tuple[int], tuple[int], tuple[int], tuple[int]]:
if self.is_fused_moe_lora_shrink_fn():
input_shape = (
(m * top_k_num, n)
if self in [OpType.FUSED_MOE_LORA_DOWN_SHRINK]
else (m, n)
)
output_shape = (num_slices, m, top_k_num, k)
weight_shape = (num_loras, num_experts, k, n)
else:
assert self.is_fused_moe_lora_expand_fn()
input_shape = (num_slices, m, top_k_num, k)
output_shape = (m, top_k_num, n * num_slices)
weight_shape = (num_loras, num_experts, n, k)
return (input_shape, weight_shape, output_shape)
def matmul_shapes(
self,
@@ -244,6 +335,8 @@ class OpType(Enum):
lora_rank: int,
num_loras: int,
num_slices: int,
top_k_num: int | None = None,
num_experts: int | None = None,
) -> tuple[tuple[int, ...], tuple[int, ...], tuple[int, ...]]:
"""
Given num_slices, return the shapes of the A, B, and C matrices
@@ -258,6 +351,16 @@ class OpType(Enum):
if self in [OpType.LORA_EXPAND]:
# LoRA expand kernels support num_slices inherently in the kernel
return ((num_slices, m, k), b_shape, (m, n * num_slices))
if self.is_fused_moe_lora_fn():
return self.matmul_shapes_fused_moe_lora(
m,
k,
n,
num_loras,
num_slices,
top_k_num,
num_experts,
)
raise ValueError(f"Unrecognized op_type {self}")
def bench_fn(self) -> Callable:
@@ -265,6 +368,16 @@ class OpType(Enum):
return lora_shrink
if self == OpType.LORA_EXPAND:
return lora_expand
if self in [
OpType.FUSED_MOE_LORA_GATE_UP_SHRINK,
OpType.FUSED_MOE_LORA_DOWN_SHRINK,
]:
return fused_moe_lora_shrink
if self in [
OpType.FUSED_MOE_LORA_GATE_UP_EXPAND,
OpType.FUSED_MOE_LORA_DOWN_EXPAND,
]:
return fused_moe_lora_expand
raise ValueError(f"Unrecognized optype {self}")
@@ -318,6 +431,8 @@ class BenchmarkContext:
sort_by_lora_id: bool
dtype: torch.dtype
seq_length: int | None = None
num_experts: int | None = None # num_experts for MoE based ops
top_k_num: int | None = None # top_k for MoE based ops
num_slices: int | None = None # num_slices for slice based ops
def with_seq_length(self, seq_length: int) -> "BenchmarkContext":
@@ -373,6 +488,11 @@ class BenchmarkTensors:
f"{dtype_to_str(self.output.dtype)}"
)
def get_num_tokens(self, size: int, top_k_num: int, op_type: OpType):
return (
size * top_k_num if op_type in [OpType.FUSED_MOE_LORA_DOWN_SHRINK] else size
)
@staticmethod
def make(
ctx: BenchmarkContext, op_type: OpType, device: str = "cuda"
@@ -385,6 +505,8 @@ class BenchmarkTensors:
ctx.lora_rank,
ctx.num_loras,
ctx.num_slices,
ctx.top_k_num,
ctx.num_experts,
)
a_type, b_type, c_type = op_type.matmul_dtypes(ctx.dtype)
input_tensor, lora_weights, output_tensor = make_rand_tensors(
@@ -432,17 +554,27 @@ class BenchmarkTensors:
prompt_lora_indices_tensor,
)
def sanity_check(self) -> None:
def sanity_check(self, ctx: BenchmarkContext, op_type: OpType) -> None:
"""
Fails asserts when non-conformality is detected.
"""
num_tokens = self.input.shape[-2]
num_tokens = (
self.input.shape[1]
if op_type.is_fused_moe_lora_expand_fn()
else self.input.shape[-2]
)
# check metadata tensors
assert torch.sum(self.seq_lens) == num_tokens
## In down shrink case, each token is repeated top_k_num times
assert num_tokens == self.get_num_tokens(
torch.sum(self.seq_lens), ctx.top_k_num, op_type
), f"Expected {num_tokens} tokens, but got {torch.sum(self.seq_lens)}"
num_seqs = self.seq_lens.shape[0]
# assert self.seq_start_loc.shape[0] == num_seqs
## In down shrink case, each prompt corresponds to top_k_num sequences
assert self.prompt_lora_mapping.shape[0] == num_seqs
assert self.lora_kernel_meta.token_lora_mapping.shape[0] == num_tokens
assert self.get_num_tokens(
self.lora_kernel_meta.token_lora_mapping.shape[0], ctx.top_k_num, op_type
)
def to_device(self, device: str):
"""
@@ -471,21 +603,111 @@ class BenchmarkTensors:
to_device(field) if field_name != "no_lora_flag_cpu" else field,
)
def metadata(self) -> tuple[int, int, int]:
def metadata(self, ctx: BenchmarkContext, op_type: OpType) -> tuple[int, int, int]:
"""
Return num_seqs, num_tokens and max_seq_len
"""
num_seqs = self.seq_lens.shape[0]
num_tokens = self.lora_kernel_meta.token_lora_mapping.shape[0]
num_tokens = self.get_num_tokens(
self.lora_kernel_meta.token_lora_mapping.shape[0], ctx.top_k_num, op_type
)
max_seq_len = torch.max(self.seq_lens).item()
num_slices = len(self.lora_weights_lst)
return num_seqs, num_tokens, max_seq_len, num_slices
def as_lora_shrink_kwargs(self) -> dict[str, Any]:
self.sanity_check()
def fused_moe_lora_data_prepare(
self,
block_size: int,
token_lora_mapping: torch.Tensor,
ctx: BenchmarkContext,
):
def moe_lora_align_block_size(
topk_ids: torch.Tensor,
token_lora_mapping: torch.Tensor,
block_size: int,
num_experts: int,
max_loras: int,
expert_map: torch.Tensor | None = None,
pad_sorted_ids: bool = False,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""
Aligns tokens and experts into block-sized chunks for LoRA-based
mixture-of-experts (MoE) execution.
"""
max_num_tokens_padded = topk_ids.numel() + num_experts * (block_size - 1)
if pad_sorted_ids:
max_num_tokens_padded = round_up(max_num_tokens_padded, block_size)
sorted_ids = torch.empty(
(max_loras * max_num_tokens_padded,),
dtype=torch.int32,
device=topk_ids.device,
)
max_num_m_blocks = triton.cdiv(max_num_tokens_padded, block_size)
# Expert ids must be set default to -1 to prevent a blank block
expert_ids = torch.empty(
(max_loras * max_num_m_blocks,),
dtype=torch.int32,
device=topk_ids.device,
)
num_tokens_post_pad = torch.empty(
(max_loras), dtype=torch.int32, device=topk_ids.device
)
ops.moe_lora_align_block_size(
topk_ids,
token_lora_mapping,
num_experts,
block_size,
max_loras,
max_num_tokens_padded,
max_num_m_blocks,
sorted_ids,
expert_ids,
num_tokens_post_pad,
)
if expert_map is not None:
expert_ids = expert_map[expert_ids]
return sorted_ids, expert_ids, num_tokens_post_pad
num_tokens = ctx.batch_size
curr_topk_ids = torch.randint(
0,
ctx.num_experts,
(num_tokens, ctx.top_k_num),
device="cuda",
dtype=torch.int32,
)
topk_weights = torch.randint(
0,
ctx.num_experts,
(num_tokens, ctx.top_k_num),
device="cuda",
dtype=torch.int32,
)
(sorted_token_ids_lora, expert_ids_lora, num_tokens_post_padded_lora) = (
moe_lora_align_block_size(
topk_ids=curr_topk_ids,
token_lora_mapping=token_lora_mapping,
block_size=block_size,
num_experts=ctx.num_experts,
max_loras=ctx.num_loras,
)
)
sorted_token_ids = sorted_token_ids_lora.view(ctx.num_loras, -1)
expert_ids = expert_ids_lora.view(ctx.num_loras, -1)
num_tokens_post_padded = num_tokens_post_padded_lora
return (topk_weights, sorted_token_ids, expert_ids, num_tokens_post_padded)
def as_lora_shrink_kwargs(
self, ctx: BenchmarkContext, op_type: OpType
) -> dict[str, Any]:
self.sanity_check(ctx, op_type)
self.to_device(self.input.device)
_, num_tokens, _, num_slices = self.metadata()
_, num_tokens, _, num_slices = self.metadata(ctx, op_type)
# Sanity check matrix shapes.
i_shape, lw_shape, o_shape = (
@@ -520,11 +742,13 @@ class BenchmarkTensors:
"no_lora_flag_cpu": self.lora_kernel_meta.no_lora_flag_cpu,
}
def as_lora_expand_kwargs(self, add_inputs: bool) -> dict[str, Any]:
self.sanity_check()
def as_lora_expand_kwargs(
self, ctx: BenchmarkContext, op_type: OpType, add_inputs: bool
) -> dict[str, Any]:
self.sanity_check(ctx, op_type)
self.to_device(self.input.device)
_, num_tokens, _, num_slices = self.metadata()
_, num_tokens, _, num_slices = self.metadata(ctx, op_type)
# Sanity check matrix shapes.
i_shape, lw_shape, o_shape = (
@@ -561,18 +785,173 @@ class BenchmarkTensors:
"no_lora_flag_cpu": self.lora_kernel_meta.no_lora_flag_cpu,
}
def bench_fn_kwargs(
self, op_type: OpType, add_inputs: bool | None = None
def as_fused_moe_lora_shrink_kwargs(
self, ctx: BenchmarkContext, op_type: OpType
) -> dict[str, Any]:
if op_type.is_shrink_fn():
self.sanity_check(ctx, op_type)
self.to_device(self.input.device)
_, num_tokens, _, num_slices = self.metadata(ctx, op_type)
# Sanity check matrix shapes.
i_shape, lw_shape, o_shape = (
self.input.shape,
self.lora_weights_lst[0].shape,
self.output.shape,
)
# Expected input shape : [num_tokens, hidden_size] for gate_up
# Expected input shape : [top_k_num * num_tokens, hidden_size] for down
assert len(i_shape) == 2
assert i_shape[0] == num_tokens
hidden_size = i_shape[1]
# Expected lora weight shape [max_lora, num_experts, lora_rank, hidden_size]
assert len(lw_shape) == 4
assert lw_shape[-1] == hidden_size
lora_rank = lw_shape[-2]
# Expected output shape : [num_slices, num_tokens, top_k_num, lora_rank]
assert len(o_shape) == 4
assert (
o_shape
== (num_slices, num_tokens // ctx.top_k_num, ctx.top_k_num, lora_rank)
if op_type in [OpType.FUSED_MOE_LORA_DOWN_SHRINK]
else o_shape == (num_slices, num_tokens, ctx.top_k_num, lora_rank)
)
kernel_config = get_lora_op_configs(
op_type.name.lower(),
max_loras=lw_shape[0],
batch=num_tokens,
hidden_size=hidden_size,
rank=lora_rank,
num_slices=num_slices,
add_inputs=False,
)
(topk_weights, sorted_token_ids, expert_ids, num_tokens_post_padded) = (
self.fused_moe_lora_data_prepare(
block_size=kernel_config["BLOCK_SIZE_M"],
token_lora_mapping=self.lora_kernel_meta.token_lora_mapping,
ctx=ctx,
)
)
return {
"qcurr_hidden_states": self.input,
"lora_a_stacked": self.lora_weights_lst,
"a_intermediate_cache1": self.output,
"topk_weights": topk_weights,
"sorted_token_ids": sorted_token_ids,
"expert_ids": expert_ids,
"num_tokens_post_padded": num_tokens_post_padded,
"top_k_num": ctx.top_k_num,
"device": self.input.device,
"N": lora_rank,
"M": topk_weights.shape[0],
"EM": sorted_token_ids.shape[1],
"K": self.input.shape[1],
"num_tokens": num_tokens,
"num_experts": ctx.num_experts,
"num_slices": num_slices,
"shrink_block_size_m": kernel_config["BLOCK_SIZE_M"],
"shrink_block_size_n": kernel_config["BLOCK_SIZE_N"],
"shrink_block_size_k": kernel_config["BLOCK_SIZE_K"],
"shrink_group_size_m": kernel_config["GROUP_SIZE_M"],
"shrink_num_warps": kernel_config["NUM_WARPS"],
"shrink_num_stages": kernel_config["NUM_STAGES"],
"shrink_split_k": kernel_config.get("SPLIT_K", 1),
"mul_routed_weight": op_type.is_fused_moe_lora_down_fn(),
}
def as_fused_moe_lora_expand_kwargs(
self, ctx: BenchmarkContext, op_type: OpType
) -> dict[str, Any]:
self.sanity_check(ctx, op_type)
self.to_device(self.input.device)
_, num_tokens, _, num_slices = self.metadata(ctx, op_type)
# Sanity check matrix shapes.
i_shape, lw_shape, o_shape = (
self.input.shape,
self.lora_weights_lst[0].shape,
self.output.shape,
)
# Expected input shape : [num_slices, num_tokens, top_k_num, lora_rank]
assert len(i_shape) == 4
assert i_shape[0] == num_slices
assert i_shape[1] == num_tokens
lora_rank = i_shape[-1]
# Expected lora weight shape : [num_loras, num_experts, hidden_size, lora_rank]
assert len(lw_shape) == 4
assert lw_shape[-1] == lora_rank
hidden_size = lw_shape[-2]
# Expected output shape : [num_tokens, top_k_num, hidden_size * num_slices]
assert len(o_shape) == 3
assert o_shape == (num_tokens, ctx.top_k_num, hidden_size * num_slices)
kernel_config = get_lora_op_configs(
op_type.name.lower(),
max_loras=lw_shape[0],
batch=num_tokens,
hidden_size=hidden_size,
rank=lora_rank,
num_slices=num_slices,
add_inputs=False,
)
(topk_weights, sorted_token_ids, expert_ids, num_tokens_post_padded) = (
self.fused_moe_lora_data_prepare(
block_size=kernel_config["BLOCK_SIZE_M"],
token_lora_mapping=self.lora_kernel_meta.token_lora_mapping,
ctx=ctx,
)
)
return {
"a_intermediate_cache1": self.input,
"lora_b_stacked": self.lora_weights_lst,
"output": self.output,
"topk_weights": topk_weights,
"sorted_token_ids": sorted_token_ids,
"expert_ids": expert_ids,
"num_tokens_post_padded": num_tokens_post_padded,
"top_k_num": ctx.top_k_num,
"device": self.input.device,
"N": lora_rank,
"M": topk_weights.shape[0],
"EM": sorted_token_ids.shape[1],
"K": self.input.shape[1],
"num_tokens": num_tokens,
"num_experts": ctx.num_experts,
"num_slices": num_slices,
"max_lora_rank": lora_rank,
"w1_output_dim_size": lw_shape[2],
"expand_block_size_m": kernel_config["BLOCK_SIZE_M"],
"expand_block_size_n": kernel_config["BLOCK_SIZE_N"],
"expand_block_size_k": kernel_config["BLOCK_SIZE_K"],
"expand_group_size_m": kernel_config["GROUP_SIZE_M"],
"expand_num_warps": kernel_config["NUM_WARPS"],
"expand_num_stages": kernel_config["NUM_STAGES"],
"expand_split_k": kernel_config.get("SPLIT_K", 1),
"mul_routed_weight": op_type.is_fused_moe_lora_down_fn(),
}
def bench_fn_kwargs(
self, ctx: BenchmarkContext, op_type: OpType, add_inputs: bool | None = None
) -> dict[str, Any]:
if op_type.is_shrink_fn() or op_type.is_fused_moe_lora_fn():
assert add_inputs is None
else:
assert add_inputs is not None
if op_type == OpType.LORA_SHRINK:
return self.as_lora_shrink_kwargs()
return self.as_lora_shrink_kwargs(ctx, op_type)
if op_type == OpType.LORA_EXPAND:
return self.as_lora_expand_kwargs(add_inputs)
return self.as_lora_expand_kwargs(ctx, op_type, add_inputs)
if op_type.is_fused_moe_lora_shrink_fn():
return self.as_fused_moe_lora_shrink_kwargs(ctx, op_type)
if op_type.is_fused_moe_lora_expand_fn():
return self.as_fused_moe_lora_expand_kwargs(ctx, op_type)
raise ValueError(f"Unrecognized optype {self}")
def test_correctness(
@@ -617,7 +996,7 @@ def bench_optype(
test_correctness: bool = False,
) -> TMeasurement:
assert arg_pool_size >= 1
if op_type.is_shrink_fn():
if op_type.is_shrink_fn() or op_type.is_fused_moe_lora_fn():
assert expand_fn_add_inputs is None
else:
assert expand_fn_add_inputs is not None
@@ -627,23 +1006,30 @@ def bench_optype(
BenchmarkTensors.make(ctx, op_type) for _ in range(arg_pool_size)
]
for bt in bench_tensors:
bt.sanity_check()
bt.sanity_check(ctx, op_type)
# Test correctness of our implementation.
if test_correctness:
assert op_type in [OpType.LORA_SHRINK, OpType.LORA_EXPAND], (
f"Correctness testing is not supported for {op_type.name}."
)
assert all(
[bt.test_correctness(op_type, expand_fn_add_inputs) for bt in bench_tensors]
[
bt.test_correctness(ctx, op_type, expand_fn_add_inputs)
for bt in bench_tensors
]
)
# BenchmarkTensors -> dict (kwargs)
kwargs_list = [
bt.bench_fn_kwargs(op_type, add_inputs=expand_fn_add_inputs)
bt.bench_fn_kwargs(ctx, op_type, add_inputs=expand_fn_add_inputs)
for bt in bench_tensors
]
# Clear LoRA optimization hash-maps.
_LORA_A_PTR_DICT.clear()
_LORA_B_PTR_DICT.clear()
_LORA_PTR_DICT.clear()
# Run bench function so that _LORA_A_PTR_DICT and _LORA_B_PTR_DICT are set up
for kwargs in kwargs_list:
op_type.bench_fn()(**kwargs)
@@ -793,7 +1179,9 @@ def run(args: argparse.Namespace, bench_ctxs: list[BenchmarkContext]):
# Benchmark bench_op
expand_fn_add_inputs = (
[None] if bench_op.is_shrink_fn() else args.expand_fn_add_inputs
[None]
if bench_op.is_shrink_fn() or bench_op.is_fused_moe_lora_fn()
else args.expand_fn_add_inputs
)
for add_input_arg in expand_fn_add_inputs:
seq_len_timers.append(
@@ -831,12 +1219,22 @@ def as_benchmark_contexts(
hidden_sizes: list[int], lora_ranks: list[int], args: argparse.Namespace
) -> list[BenchmarkContext]:
ctxs: list[BenchmarkContext] = []
for batch_size, hidden_size, lora_rank, num_loras, sort_by_lora_id in product( # noqa
for (
batch_size,
hidden_size,
lora_rank,
num_loras,
sort_by_lora_id,
top_k_num,
num_experts,
) in product( # noqa
args.batch_sizes,
list(hidden_sizes),
lora_ranks,
args.num_loras,
args.sort_by_lora_id,
args.top_k_nums,
args.num_experts,
):
ctxs.append(
BenchmarkContext(
@@ -851,6 +1249,8 @@ def as_benchmark_contexts(
seq_length=None,
sort_by_lora_id=sort_by_lora_id,
dtype=args.dtype,
top_k_num=top_k_num,
num_experts=num_experts,
# To be filled based on the OpType to benchmark
num_slices=None,
)
@@ -1012,6 +1412,22 @@ if __name__ == "__main__":
),
)
p.add_argument(
"--top-k-nums",
nargs="+",
type=int,
default=DEFAULT_TOP_K_NUMS,
help="Top-K values for MoE LoRA operations",
)
p.add_argument(
"--num-experts",
nargs="+",
type=int,
default=DEFAULT_NUM_EXPERTS,
help="Number of experts for MoE LoRA operations",
)
parser = FlexibleArgumentParser(
description=f"""
Benchmark LoRA kernels:

View File

@@ -33,7 +33,7 @@ from vllm.model_executor.layers.quantization.utils.quant_utils import (
quantize_weights,
)
from vllm.scalar_type import ScalarType, scalar_types
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
DEFAULT_MODELS = ["meta-llama/Llama-3-8b", "meta-llama/Llama-2-70b-hf"]
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512, 1024]

View File

@@ -44,7 +44,7 @@ from vllm.model_executor.layers.quantization.utils.quant_utils import (
sort_weights,
)
from vllm.scalar_type import ScalarType, scalar_types
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
DEFAULT_MODELS = ["meta-llama/Llama-2-7b-hf/TP1"]
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]

View File

@@ -22,7 +22,7 @@ from vllm.model_executor.layers.fused_moe.fused_moe import *
from vllm.platforms import current_platform
from vllm.transformers_utils.config import get_config
from vllm.triton_utils import triton
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
FP8_DTYPE = current_platform.fp8_dtype()
@@ -211,7 +211,7 @@ def get_rocm_tuning_space(use_fp16):
num_warps_range = [1, 2, 4, 8]
group_m_range = [1, 4, 8, 16, 32]
num_stage_range = [2]
waves_per_eu_range = [0]
waves_per_eu_range = [0, 1, 2, 4]
matrix_instr_nonkdim_range = [16, 32] if use_fp16 else []
kpack_range = [1, 2] if use_fp16 else []
@@ -590,6 +590,7 @@ def main(args: argparse.Namespace):
"DeepseekV3ForCausalLM",
"DeepseekV32ForCausalLM",
"Glm4MoeForCausalLM",
"NemotronHForCausalLM",
):
E = config.n_routed_experts
topk = config.num_experts_per_tok
@@ -615,6 +616,11 @@ def main(args: argparse.Namespace):
topk = config.moe_topk[0]
intermediate_size = config.moe_intermediate_size[0]
hidden_size = config.hidden_size
elif config.architectures[0] in ["Qwen3OmniMoeForConditionalGeneration"]:
E = config.thinker_config.text_config.num_experts
topk = config.thinker_config.text_config.num_experts_per_tok
intermediate_size = config.thinker_config.text_config.moe_intermediate_size
hidden_size = config.thinker_config.text_config.hidden_size
else:
# Support for llama4
config = config.get_text_config()
@@ -631,7 +637,7 @@ def main(args: argparse.Namespace):
else:
ensure_divisibility(intermediate_size, args.tp_size, "intermediate_size")
shard_intermediate_size = 2 * intermediate_size // args.tp_size
dtype = torch.float16 if current_platform.is_rocm() else config.torch_dtype
dtype = torch.float16 if current_platform.is_rocm() else config.dtype
use_fp8_w8a8 = args.dtype == "fp8_w8a8"
use_int8_w8a16 = args.dtype == "int8_w8a16"
block_quant_shape = get_weight_block_size_safety(config)

View File

@@ -17,7 +17,7 @@ from vllm.model_executor.layers.fused_moe.moe_permute_unpermute import (
)
from vllm.model_executor.layers.fused_moe.utils import _fp8_quantize
from vllm.platforms import current_platform
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
FP8_DTYPE = current_platform.fp8_dtype()
@@ -344,7 +344,7 @@ def main(args: argparse.Namespace):
topk = config.num_experts_per_tok
hidden_size = config.hidden_size
dtype = torch.float16 if current_platform.is_rocm() else config.torch_dtype
dtype = torch.float16 if current_platform.is_rocm() else config.dtype
use_fp8_w8a8 = args.dtype == "fp8_w8a8"
use_int8_w8a16 = args.dtype == "int8_w8a16"
use_customized_permute = args.use_customized_permute

View File

@@ -39,7 +39,7 @@ import torch
from vllm.model_executor.layers.rotary_embedding import get_rope
from vllm.platforms import current_platform
from vllm.transformers_utils.config import get_config
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

View File

@@ -9,9 +9,9 @@ import torch
from vllm import _custom_ops as ops
from vllm.logger import init_logger
from vllm.platforms import current_platform
from vllm.utils import (
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import (
STR_DTYPE_TO_TORCH_DTYPE,
FlexibleArgumentParser,
create_kv_caches_with_random,
)

View File

@@ -1,155 +0,0 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import itertools
import torch
from vllm import _custom_ops as vllm_ops
from vllm.triton_utils import triton
def polynorm_naive(
x: torch.Tensor,
weight: torch.Tensor,
bias: torch.Tensor,
eps: float = 1e-6,
):
orig_shape = x.shape
x = x.view(-1, x.shape[-1])
def norm(x, eps: float):
return x / torch.sqrt(x.pow(2).mean(-1, keepdim=True) + eps)
x = x.float()
return (
(
weight[0] * norm(x**3, eps)
+ weight[1] * norm(x**2, eps)
+ weight[2] * norm(x, eps)
+ bias
)
.to(weight.dtype)
.view(orig_shape)
)
def polynorm_vllm(
x: torch.Tensor,
weight: torch.Tensor,
bias: torch.Tensor,
eps: float = 1e-6,
):
orig_shape = x.shape
x = x.view(-1, x.shape[-1])
out = torch.empty_like(x)
vllm_ops.poly_norm(out, x, weight, bias, eps)
output = out
output = output.view(orig_shape)
return output
def calculate_diff(batch_size, seq_len, hidden_dim):
dtype = torch.bfloat16
x = torch.randn(batch_size, seq_len, hidden_dim, dtype=dtype, device="cuda")
weight = torch.ones(3, dtype=dtype, device="cuda")
bias = torch.ones(1, dtype=dtype, device="cuda")
output_naive = polynorm_naive(x, weight, bias)
output_vllm = polynorm_vllm(x, weight, bias)
if torch.allclose(output_naive, output_vllm, atol=1e-2, rtol=1e-2):
print("✅ All implementations match")
else:
print("❌ Implementations differ")
batch_size_range = [2**i for i in range(0, 7, 2)]
seq_length_range = [2**i for i in range(6, 11, 1)]
dim_range = [2048, 4096]
configs = list(itertools.product(dim_range, batch_size_range, seq_length_range))
def get_benchmark():
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["dim", "batch_size", "seq_len"],
x_vals=[list(_) for _ in configs],
line_arg="provider",
line_vals=["naive", "vllm"],
line_names=["Naive", "vLLM"],
styles=[("blue", "-"), ("red", "-")],
ylabel="us",
plot_name="polynorm-perf",
args={},
)
)
def benchmark(dim, batch_size, seq_len, provider):
dtype = torch.bfloat16
hidden_dim = dim * 4
x = torch.randn(batch_size, seq_len, hidden_dim, dtype=dtype, device="cuda")
weight = torch.ones(3, dtype=dtype, device="cuda")
bias = torch.ones(1, dtype=dtype, device="cuda")
quantiles = [0.5, 0.2, 0.8]
if provider == "naive":
ms, min_ms, max_ms = triton.testing.do_bench(
lambda: polynorm_naive(x, weight, bias),
quantiles=quantiles,
)
else:
ms, min_ms, max_ms = triton.testing.do_bench(
lambda: polynorm_vllm(x, weight, bias),
quantiles=quantiles,
)
return 1000 * ms, 1000 * max_ms, 1000 * min_ms
return benchmark
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument(
"--batch-size",
type=int,
default=4,
help="Batch size",
)
parser.add_argument(
"--seq-len",
type=int,
default=128,
help="Sequence length",
)
parser.add_argument(
"--hidden-dim",
type=int,
default=8192,
help="Intermediate size of MLP",
)
parser.add_argument(
"--save-path",
type=str,
default="./configs/polnorm/",
help="Path to save polnorm benchmark results",
)
args = parser.parse_args()
# Run correctness test
calculate_diff(
batch_size=args.batch_size,
seq_len=args.seq_len,
hidden_dim=args.hidden_dim,
)
benchmark = get_benchmark()
# Run performance benchmark
benchmark.run(print_data=True, save_path=args.save_path)

View File

@@ -7,7 +7,8 @@ import torch
from vllm import _custom_ops as ops
from vllm.platforms import current_platform
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
@torch.inference_mode()

View File

@@ -9,9 +9,9 @@ from tabulate import tabulate
from vllm import _custom_ops as ops
from vllm.logger import init_logger
from vllm.platforms import current_platform
from vllm.utils import (
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import (
STR_DTYPE_TO_TORCH_DTYPE,
FlexibleArgumentParser,
create_kv_caches_with_random,
)

View File

@@ -12,9 +12,9 @@ from vllm.attention.ops.triton_reshape_and_cache_flash import (
)
from vllm.logger import init_logger
from vllm.platforms import current_platform
from vllm.utils import (
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import (
STR_DTYPE_TO_TORCH_DTYPE,
FlexibleArgumentParser,
create_kv_caches_with_random_flash,
)

View File

@@ -1,97 +1,76 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
from itertools import accumulate
import itertools
import nvtx
import torch
from vllm.model_executor.layers.rotary_embedding import RotaryEmbedding, get_rope
from vllm.platforms import current_platform
from vllm.utils import FlexibleArgumentParser
from vllm.model_executor.layers.rotary_embedding import get_rope
from vllm.triton_utils import triton
from vllm.utils.argparse_utils import FlexibleArgumentParser
batch_size_range = [2**i for i in range(0, 8, 2)]
seq_len_range = [2**i for i in range(6, 10, 1)]
num_heads_range = [32, 48]
configs = list(itertools.product(batch_size_range, seq_len_range, num_heads_range))
def benchmark_rope_kernels_multi_lora(
is_neox_style: bool,
batch_size: int,
seq_len: int,
num_heads: int,
head_size: int,
rotary_dim: int | None,
dtype: torch.dtype,
seed: int,
device: str,
max_position: int = 8192,
base: float = 10000,
) -> None:
current_platform.seed_everything(seed)
torch.set_default_device(device)
if rotary_dim is None:
rotary_dim = head_size
# silulating serving 4 LoRAs
scaling_factors = [1, 2, 4, 8]
# batched RoPE can take multiple scaling factors
batched_rope = get_rope(
head_size,
rotary_dim,
max_position,
base,
is_neox_style,
{"rope_type": "linear", "factor": tuple(scaling_factors)},
)
# non-batched RoPE takes only one scaling factor, we create multiple
# instances to simulate the same behavior
non_batched_ropes: list[RotaryEmbedding] = []
for scaling_factor in scaling_factors:
non_batched_ropes.append(
get_rope(
head_size,
rotary_dim,
max_position,
base,
is_neox_style,
{"rope_type": "linear", "factor": (scaling_factor,)},
)
)
positions = torch.randint(0, max_position, (batch_size, seq_len))
query = torch.randn(batch_size, seq_len, num_heads * head_size, dtype=dtype)
key = torch.randn_like(query)
# create query offsets for batched RoPE, we concat multiple kv cache
# together and each query needs to find the right kv cache of its type
offset_map = torch.tensor(
list(
accumulate(
[0]
+ [
max_position * scaling_factor * 2
for scaling_factor in scaling_factors[:-1]
]
)
def get_benchmark(head_size, rotary_dim, is_neox_style, device):
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["batch_size", "seq_len", "num_heads"],
x_vals=[list(_) for _ in configs],
line_arg="provider",
line_vals=["torch", "flashinfer", "vllm"],
line_names=["PyTorch", "FlashInfer", "vLLM"],
styles=[("blue", "-"), ("green", "-"), ("red", "-")],
ylabel="us",
plot_name=f"rope-perf{'-neox-style' if is_neox_style else ''}",
args={},
)
)
query_types = torch.randint(
0, len(scaling_factors), (batch_size, seq_len), device=device
)
# map query types to offsets
query_offsets = offset_map[query_types]
# the kernel takes flattened offsets
flatten_offsets = query_offsets.flatten()
def benchmark(batch_size, seq_len, num_heads, provider):
dtype = torch.bfloat16
max_position = 8192
base = 10000
rope = get_rope(head_size, rotary_dim, max_position, base, is_neox_style)
rope = rope.to(dtype=dtype, device=device)
cos_sin_cache = rope.cos_sin_cache.to(dtype=torch.float, device=device)
# batched queries of the same type together for non-batched RoPE
queries = [query[query_types == i] for i in range(len(scaling_factors))]
keys = [key[query_types == i] for i in range(len(scaling_factors))]
packed_qkr = zip(queries, keys, non_batched_ropes)
# synchronize before start timing
torch.cuda.synchronize()
with nvtx.annotate("non-batched", color="yellow"):
for q, k, r in packed_qkr:
r.forward(positions, q, k)
torch.cuda.synchronize()
with nvtx.annotate("batched", color="green"):
batched_rope.forward(positions, query, key, flatten_offsets)
torch.cuda.synchronize()
positions = torch.randint(0, max_position, (batch_size, seq_len), device=device)
query = torch.randn(
(batch_size, seq_len, num_heads * head_size), dtype=dtype, device=device
)
key = torch.randn_like(query)
quantiles = [0.5, 0.2, 0.8]
if provider == "torch":
ms, min_ms, max_ms = triton.testing.do_bench(
lambda: rope.forward_native(positions, query.clone(), key.clone()),
quantiles=quantiles,
)
elif provider == "flashinfer":
ms, min_ms, max_ms = triton.testing.do_bench(
lambda: torch.ops.vllm.flashinfer_rotary_embedding(
positions,
query.clone(),
key.clone(),
head_size,
cos_sin_cache,
is_neox_style,
),
quantiles=quantiles,
)
else:
ms, min_ms, max_ms = triton.testing.do_bench(
lambda: rope.forward_cuda(positions, query.clone(), key.clone()),
quantiles=quantiles,
)
return 1000 * ms, 1000 * max_ms, 1000 * min_ms
return benchmark
if __name__ == "__main__":
@@ -116,17 +95,12 @@ if __name__ == "__main__":
parser.add_argument(
"--device", type=str, choices=["cuda:0", "cuda:1"], default="cuda:0"
)
parser.add_argument("--save-path", type=str, default="./configs/rope/")
args = parser.parse_args()
print(args)
benchmark_rope_kernels_multi_lora(
is_neox_style=args.is_neox_style,
batch_size=args.batch_size,
seq_len=args.seq_len,
num_heads=args.num_heads,
head_size=args.head_size,
rotary_dim=args.rotary_dim,
dtype=getattr(torch, args.dtype),
seed=args.seed,
device=args.device,
# Get the benchmark function
benchmark = get_benchmark(
args.head_size, args.rotary_dim, args.is_neox_style, args.device
)
# Run performance benchmark
benchmark.run(print_data=True, save_path=args.save_path)

View File

@@ -78,11 +78,11 @@ WEIGHT_SHAPES = {
}
WEIGHT_SHAPES_MOE = {
"nm-testing/Mixtral-8x7B-Instruct-v0.1": [
"mistralai/Mixtral-8x7B-Instruct-v0.1": [
[8, 2, 4096, 28672],
[8, 2, 14336, 4096],
],
"nm-testing/deepseekv2-lite": [
"deepseek-ai/DeepSeek-V2-Lite": [
[64, 6, 2048, 1408],
],
"ibm-granite/granite-3.0-1b-a400m": [

View File

@@ -8,7 +8,7 @@ from datetime import datetime
import flashinfer
import torch
from vllm.utils import round_up
from vllm.utils.math_utils import round_up
FLOAT32_BYTES = torch.finfo(torch.float).bits // 8
FP8_DTYPE = torch.float8_e4m3fn

View File

@@ -8,7 +8,7 @@ from datetime import datetime
import flashinfer
import torch
from vllm.utils import round_up
from vllm.utils.math_utils import round_up
FLOAT32_BYTES = torch.finfo(torch.float).bits // 8
FP8_DTYPE = torch.float8_e4m3fn

View File

@@ -18,7 +18,7 @@ from vllm.model_executor.layers.quantization.utils.fp8_utils import (
)
from vllm.platforms import current_platform
from vllm.triton_utils import triton
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
mp.set_start_method("spawn", force=True)

View File

@@ -11,7 +11,7 @@ import regex as re
import seaborn as sns
from torch.utils.benchmark import Measurement as TMeasurement
from vllm.utils import FlexibleArgumentParser
from vllm.utils.argparse_utils import FlexibleArgumentParser
if __name__ == "__main__":
parser = FlexibleArgumentParser(

View File

@@ -11,6 +11,7 @@ from bench_utils import (
Color,
logger,
)
from tqdm import tqdm
from transformers import AutoTokenizer # type: ignore
# Conversation ID is a string (e.g: "UzTK34D")
@@ -417,6 +418,10 @@ def generate_conversations(
data = file.read()
tokens_in_file = tokenizer.encode(data, add_special_tokens=False)
list_of_tokens.extend(tokens_in_file)
logger.info(
f"Loaded {len(tokens_in_file)} tokens from file {filename}, "
f"total tokens so far: {len(list_of_tokens)}"
)
conversations: ConversationsMap = {}
conv_id = 0
@@ -449,18 +454,25 @@ def generate_conversations(
)
base_offset += common_prefix_tokens
for conv_id in range(args.num_conversations):
for conv_id in tqdm(
range(args.num_conversations),
total=args.num_conversations,
desc="Generating conversations",
unit="conv",
):
# Generate a single conversation
messages: MessagesList = []
nturns = turn_count[conv_id]
# User prompt token count per turn (with lower limit)
input_token_count: np.ndarray = args.input_num_tokens.sample(nturns)
input_token_count: np.ndarray = args.input_num_tokens.sample(nturns).astype(int)
input_token_count = np.maximum(input_token_count, base_prompt_token_count)
# Assistant answer token count per turn (with lower limit)
output_token_count: np.ndarray = args.output_num_tokens.sample(nturns)
output_token_count: np.ndarray = args.output_num_tokens.sample(nturns).astype(
int
)
output_token_count = np.maximum(output_token_count, 1)
user_turn = True

Some files were not shown because too many files have changed in this diff Show More