Compare commits

...

539 Commits

Author SHA1 Message Date
Andreas Karatzas
11b6af5280 [ROCm][Bugfix] Fix Mamba batched decode producing incorrect output (#32099)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-13 05:46:53 +00:00
Wentao Ye
2a719e0865 [Perf] Optimize requests abort (#32211)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-01-13 04:11:37 +00:00
Andrew Bennett
f243abc92d Fix various typos found in docs (#32212)
Signed-off-by: Andrew Bennett <potatosaladx@meta.com>
2026-01-13 03:41:47 +00:00
Sanghoon Yoon
60b77e1463 [Frontend] Add reasoning_effort to OpenAIServing._preprocess_chat() (#31956)
Signed-off-by: Sanghoon Yoon <seanyoon@kakao.com>
2026-01-13 03:21:49 +00:00
cjackal
15b33ff064 [Misc] improve warning/assert messages (#32226)
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
2026-01-13 03:11:23 +00:00
Nick Hill
c6bb5b5603 [BugFix] Fix engine crash caused by chat tools + response_format (#32127)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-01-13 10:33:14 +08:00
Nick Hill
9273a427b5 [Misc] Allow enabling NCCL for DP sync when async scheduling (#32197)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-01-13 02:03:08 +00:00
Cyrus Leung
78d13ea9de [Model] Handle trust_remote_code for transformers backend (#32194)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-13 09:30:12 +08:00
Andrew Xia
a307ac0734 [responsesAPI] add unit test for optional function tool call id (#32036)
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: Andrew Xia <axia@fb.com>
2026-01-12 16:14:54 -08:00
Divakar Verma
a28d9f4470 [ROCm][CI] Handle pytest status code 5 when a shard isn't allocated any tests (#32040)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2026-01-12 17:35:49 -05:00
xuebwang-amd
629584bfc9 [Kernel][MoE] fix computation order of MoE weight multiplication and improve flow (#31962)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
2026-01-12 17:17:30 -05:00
Woosuk Kwon
0a7dd23754 [Model Runner V2] Add support for M-RoPE (#32143)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2026-01-12 13:37:43 -08:00
Woosuk Kwon
dec28688c5 [Model Runner V2] Minor refactor for logit_bias (#32209)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2026-01-12 13:08:30 -08:00
Vadim Gimpelson
9f430c94bd [BUGFIX] Add missed remaping of the names of fp8 kv-scale (#32199)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2026-01-12 20:42:06 +00:00
Nicolò Lucchesi
f8bd8394e3 [NIXL][Bugfix] Failure logging overhaul + early metadata free on failure (#32031)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-01-12 20:38:49 +00:00
Woosuk Kwon
ca81811bfe [Model Runner V2] Support logit_bias, allowed_token_ids, min_tokens (#32163)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2026-01-12 11:31:10 -08:00
Lucas Kabela
ad8818bb5e [Misc][BE] Type coverage for vllm/compilation [3/3] (#31748)
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
2026-01-12 19:24:38 +00:00
Nicolò Lucchesi
08e8e99ce7 [Misc] Change log level for batch queue log (#32192)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-01-12 18:59:31 +00:00
Or Ozeri
2be765b68a [BugFix] scheduler: Fix ordering preserving of skipped requests (#32173)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2026-01-12 18:39:38 +00:00
Roger Wang
16abe6b85a [Misc] Set default torch num threads for input processing (#31879)
Signed-off-by: Roger Wang <hey@rogerw.io>
2026-01-12 10:28:16 -08:00
Ilya Markov
1eb61ab34b [Refactor] EPLB rebalance algo to NumPy (#30697)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
2026-01-12 18:13:23 +00:00
Kyungmin Lee
3d962d72ab [BugFix] fix FusedMoE.make_expert_params_mapping in EXAONE-MoE (#32196)
Signed-off-by: lkm2835 <lkm2835@gmail.com>
2026-01-12 10:00:45 -08:00
Matthew Bonanni
20228cb851 [3/N][Attention] Move AttentionMetadata-related code from utils.py to backend.py (#32054)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-12 09:13:56 -08:00
Cyrus Leung
7c0d3c5152 [Benchmark] Share data between SLA runs (#32184)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-13 01:12:22 +08:00
Nicolò Lucchesi
5b68107411 [Misc][PD] Fix get_attn_backend usage in transfer connectors (#31988)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-01-12 18:10:05 +01:00
Asaf Joseph Gardin
8fb2c135be [Bugfix] Fix stale SSM state for new Mamba requests scheduled as decode (#32118)
Signed-off-by: Josephasafg <ajgard7@gmail.com>
2026-01-12 17:02:38 +00:00
Cyrus Leung
8863c2b25c [Model] Standardize pooling heads (#32148)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-12 17:01:49 +00:00
danielafrimi
3f72639d36 [FIX] Add NO_MUL activation support for modular kernel path (#31528)
Signed-off-by: dafrimi <dafrimi@nvidia.com>
Signed-off-by: <>
Co-authored-by: root <root@gpu-267.slurm-workers-slurm.slurm.svc.cluster.local>
Co-authored-by: root <root@gpu-537.slurm-workers-slurm.slurm.svc.cluster.local>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: root <root@pool0-01777.cm.cluster>
2026-01-12 11:55:49 -05:00
Jaehyun An
6bc9c8473e [MODEL] New model support for kakaocorp/kanana-1.5-v-3b-instruct (#29384)
Signed-off-by: Jaehyun An <steve.ai@kakaocorp.com>
2026-01-12 16:39:02 +00:00
Kyungmin Lee
63ed2409e8 Add K-EXAONE-236B-A23B (#31621)
Signed-off-by: lkm2835 <lkm2835@gmail.com>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: lgai-exaone <exaonemodels@lgresearch.ai>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2026-01-12 16:30:50 +00:00
Andy Zhang
95e53d907c doc: Update model references in supported_models.md (#32188) 2026-01-12 08:15:28 -08:00
TJian
0346396e94 [ROCm] [Bugfix] Fix order of mori build in Dockerfile.rocm_base (#32179)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2026-01-12 15:33:21 +00:00
Andy Zhang
e68b0dad8b doc: Update model name for Qwen3-Coder in documentation (#32185)
Signed-off-by: Andy Zhang <xiazhang@microsoft.com>
2026-01-12 07:10:50 -08:00
Or Ozeri
9cddbdba6d OffloadingConnector: Add cpu_bytes_to_use configuration (#24498)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2026-01-12 15:00:43 +00:00
Hongxin Xu
49e6b86c91 [Feature] Support recording expert indices for rollout router replay (#28284)
Signed-off-by: xhx1022 <1737006628@qq.com>
Signed-off-by: Hongxin Xu <70438206+xhx1022@users.noreply.github.com>
Signed-off-by: arlenxu <arlenxu@tencent.com>
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Co-authored-by: arlenxu <arlenxu@tencent.com>
2026-01-12 06:23:04 -08:00
dtc
0565f1fdec [P/D] Refactor mooncake connector sender thread using async coroutines (#31573)
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
2026-01-12 12:35:35 +00:00
Isotr0py
9dbe1fe960 [Bugfix] Fix missing scale passing for encoder Triton Attention implementation (#32149)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-12 11:13:41 +00:00
RickyChen / 陳昭儒
a5f89ae296 [Doc] Add documentation for offline API docs feature (#32134)
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com>
2026-01-12 10:33:48 +00:00
Jee Jee Li
05e8981234 [Doc] Improve LoRA docs (#32159)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-12 02:19:17 -08:00
XlKsyt
899541bdb1 [doc] fix broken links (#32158)
Signed-off-by: minimAluminiumalism <caixuesen@outlook.com>
2026-01-12 10:18:38 +00:00
daniel-salib
d7b2e57097 [Frontend] Fix Flaky MCP Streaming Test (#32153)
Signed-off-by: Daniel Salib <danielsalib@meta.com>
2026-01-12 18:03:32 +08:00
Andika Rachman
5e034f2e3d [cpu][bench] Add Fused MoE Micro Benchmark for CPU Backend (#32092)
Signed-off-by: andikarachman <andika.rachman.y@gmail.com>
2026-01-12 10:03:28 +00:00
Nicolò Lucchesi
22970c1626 [Misc] Disable default --ready-check-timeout-sec extra call in vllm bench (#30975)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-01-12 01:58:21 -08:00
Cyrus Leung
600aaab8d6 [Model] Remove incorrect SupportsPP from MTP models (#32150)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-12 01:19:30 -08:00
wang.yuqi
60446cd684 [Model] Improve multimodal pooling examples (#32085)
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
2026-01-12 07:54:09 +00:00
Cyrus Leung
9101dc756c [Model] Avoid hardcoding pooling type (#32119)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-11 21:28:12 -08:00
Woosuk Kwon
025a32f9ed [Model Runner V2] Remove async barrier (#32083)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2026-01-11 20:24:30 -08:00
Woosuk Kwon
19504ac07f [Model Runner V2] Skip building deprecated fields in attn metadata (#32132)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2026-01-11 14:31:04 -08:00
Jiangyun Zhu
3df619ac94 [CI] fix test_concat_and_cache_mla_rope_fused (#32117)
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
2026-01-11 15:11:11 +00:00
Ning Xie
d74132ca3b fix offline inference chat response prompt (#32088)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2026-01-11 14:01:18 +00:00
maang
a34abc49b7 [FixBug] Improve exception string in tensorizer.py (#31680)
Signed-off-by: maang <maang_h@163.com>
Signed-off-by: maang-h <55082429+maang-h@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-11 05:01:53 -08:00
rongfu.leng
d70249e2e9 [Misc] fix this log format not space (#32112)
Signed-off-by: lengrongfu <lenronfu@gmail.com>
2026-01-11 05:01:16 -08:00
Cyrus Leung
a374532111 [CI/Build] Separate out flaky responses API tests (#32110)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-11 05:01:12 -08:00
Isotr0py
cee7436a26 [Misc] Make scipy as optional audio/benchmark dependency (#32096)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-11 00:18:57 -08:00
Or Ozeri
4c16ba617f [KVConnector] OffloadingConnector: Fix bug in handling of preemptions (#29870)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2026-01-11 08:05:36 +00:00
Matt
bde57ab2ed [Hardware][AMD][CI][Bugfix] Fix AMD Quantization test group (#31713)
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
2026-01-10 23:19:46 -08:00
Fadi Arafeh
9103ed1696 [CPU][BugFix] Disable AOT Compile for CPU (#32037)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2026-01-10 23:15:49 -08:00
Laith Sakka
46eb30f519 make assume_32_bit_indexing configurable (#32044)
Signed-off-by: Laith Sakka <lsakka@meta.com>
2026-01-10 23:15:46 -08:00
Andy Liu
0dd63639be [MTP][GLM][Bugfix] Fixed .weight_scale loading logic that dropped MTP prediction accuracy with fp8+mtp (#32101)
Signed-off-by: Andy Liu <andyliu@roblox.com>
2026-01-10 23:14:54 -08:00
Cyrus Leung
ef96fa3f1f [Benchmark][2/2] Use spline interpolation to tune SLA variables (#32095)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-10 20:27:27 -08:00
Or Ozeri
2a4dbe24ea [BugFix] Wait for compute before offloading KV to CPU (#31341)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2026-01-10 22:25:08 +00:00
RickyChen / 陳昭儒
8020a60402 [Bugfix] Fix Qwen3-VL-Reranker model loading for sequence classification (#32089)
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-10 12:40:09 -08:00
Vadim Gimpelson
e15a5ff07b [MISC] Add strict contiguity check for FlashInfer attention tensors (#32008)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
2026-01-10 12:40:05 -08:00
Vensen
6ea001cfb7 [Bugfix][Quantization] Ensure input contiguity in per_token_quant_int8 (#31637)
Signed-off-by: vensen <vensenmu@gmail.com>
2026-01-10 12:40:02 -08:00
shyeh25
1c46dea001 Revert "[Kernels][FI] Skip trtllm attention when num_kv_heads=1 (#308… (#31617)
Signed-off-by: shyeh25 <206795756+shyeh25@users.noreply.github.com>
2026-01-10 12:39:59 -08:00
Or Ozeri
028599739d [BugFix] scheduler: Fix resuming of preempted requests after async load (#31583)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2026-01-10 12:39:25 -08:00
gnovack
d1fd802fa3 fused_moe_kernel - cast accumulator after applying router weights (#32002)
Signed-off-by: gnovack <gnovack@amazon.com>
2026-01-11 04:36:45 +08:00
Xin Yang
543c23be78 [LoRA][Perf] Improve FusedMoE LoRA performance for small rank (#32019)
Signed-off-by: Xin Yang <xyangx@amazon.com>
2026-01-10 11:04:18 -08:00
jvlunteren
b8bf5c45bb [Kernel] Optimize Sliding Window Attention in 3D Triton Kernel (#31984)
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>
2026-01-10 18:13:44 +00:00
Michael Goin
e6c6f2c79d [Quant] Support MXFP4 W4A16 for compressed-tensors dense models (#31926)
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2026-01-10 06:44:35 -08:00
Jeremy Teboul
07286ec5a6 [Bugfix] Fix integer overflow in Gemma3n audio processing (#31657)
Signed-off-by: Jeremy Teboul <jeremyte@meta.com>
2026-01-10 17:52:53 +08:00
Ning Xie
14fc7a68c7 [Bugfix] fix offline chat output prompt (#32076)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2026-01-10 07:50:57 +00:00
Cyrus Leung
5f2385a4c8 [Benchmark][1/2] Generalize SLA criterion validation from binary flags to margins (#32075)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-10 07:11:03 +00:00
Frelam
a01a1c0d69 [Bugfix] fix encoder cache leak of waiting requests in scheduler to solve stuck in CPU scheduling (#31857)
Signed-off-by: frelam <frelam112233@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
2026-01-10 06:27:58 +00:00
Lucas Wilkinson
da6709c9fe [Misc] Delay deprecation of CommonAttentionMetadata properties (#32074)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-01-09 21:06:44 -08:00
Andreas Karatzas
d83becd503 [ROCm][CI] Fix flaky test_function_calling_with_stream and reduce schema test examples (#32063)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-10 05:02:35 +00:00
roikoren755
0c9614876e Update modelopt KV cache quantization resolution to new scheme (#31895)
Signed-off-by: Roi Koren <roik@nvidia.com>
2026-01-10 04:54:13 +00:00
Cyrus Leung
583a90e005 [Refactor] Separate sequence and token pooling types (#32026)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-10 04:53:24 +00:00
maang
52d428295d [Core] Refactor ColumnParallelLinear: remove unused parameter and optimize forward (#31939)
Signed-off-by: maang <maang_h@163.com>
2026-01-10 04:19:49 +00:00
Kevin McKay
c60578de0a [Bugfix][Hardware][AMD] Use dynamic WARP_SIZE in sampler vectorized_process (#31295)
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
2026-01-10 03:57:38 +00:00
PatrykSaffer
80fead8bf6 Fuse RoPE and MLA KV-cache write (#25774)
Signed-off-by: Patryk Saffer <patryk.saffer99@gmail.com>
Signed-off-by: PatrykSaffer <patryk.saffer@mistral.ai>
Co-authored-by: Patryk Saffer <patryk.saffer99@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2026-01-09 19:18:37 -08:00
Akshat Shrivastava
e45946bd91 feature/issac 0.2 (#31550)
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
2026-01-10 03:18:05 +00:00
Lucas Kabela
ea6d067a2a [Misc][LLaMa4] Compile LLaMa Vision Encoder (#30709)
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
2026-01-09 22:01:38 -05:00
Ning Xie
abd9224280 resolve pydantic error in startup benchmark (#31348)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2026-01-10 02:41:27 +00:00
Kevin McKay
4dc0d606b7 [Bugfix] Narrow broad exceptions in compilation backends (#31616)
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2026-01-09 21:39:22 -05:00
Micah Williamson
ac0675ff6b [CI] Allow Deprecated Quantization For LM Eval Tests (#32065)
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
2026-01-09 19:10:47 -07:00
Wentao Ye
e18464a57d [Perf] Optimize async scheduling placeholder using empty (#32056)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-01-10 00:46:11 +00:00
Russell Bryant
1963245ed1 [Core] Use weights_only=True with torch.load (#32045)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2026-01-10 00:28:57 +00:00
Matthew Bonanni
0308901975 [2/N][Attention] Fix pre-commit errors (#32052)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-10 00:27:15 +00:00
Lucas Kabela
aaf4b70aae [Misc][BE] Type coverage for vllm/compilation [2/3] (#31744) 2026-01-09 18:30:38 -05:00
Nick Hill
3adffd5b90 [Misc] Enable async scheduling by default with spec decoding (#31998)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-01-09 23:09:34 +00:00
zhrrr
97ba96fbe9 [perf][async] support non cpu sync get logprob tensors for spec (#31336)
Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
2026-01-09 21:24:51 +00:00
Chendi.Xue
94578127a4 [NIXL] refine decoder side post process for heterogeneous BlockSize and kv_layout (#30275) 2026-01-09 21:22:19 +00:00
Matthew Bonanni
2612ba9285 [1/N][Attention] Restructure attention: move files (#31916)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-09 13:10:24 -08:00
Andrew Xia
1f8b7c536b [responsesAPI] fix incomplete_messages for simple/parsable context (#31836)
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: Andrew Xia <axia@fb.com>
2026-01-09 21:00:57 +00:00
Lucas Wilkinson
0a0aa07747 [Quant] Make static quant support all group shapes (#30833)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-01-09 12:49:27 -08:00
jiahanc
f9e2a75a1e [fix] add cutedsl to global sf (#32001)
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
2026-01-09 12:03:02 -08:00
Runkai Tao
a4d5d663e2 Add unpermute-aware fused MoE path and small-batch fallback (#29354)
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2026-01-09 12:58:39 -07:00
Jeremy Teboul
657e9c0e18 [Fix] Introduce audio channels spec (#31595)
Signed-off-by: Jeremy Teboul <jeremyte@meta.com>
2026-01-09 19:34:51 +00:00
Wentao Ye
308feab33f [Perf] Optimize cutlass moe problem size calculation, 5.3% E2E Throughput improvement, 2.2% TTFT improvement (#31830)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2026-01-09 11:13:43 -08:00
Wentao Ye
28ae32a5d3 [Refactor] Remove numpy split in async scheduling (#32034)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-01-09 19:09:02 +00:00
Andrew Xia
f32c629eb4 [Frontend][gpt-oss] Allow system message to overwrite model identity (#31737)
Signed-off-by: lacora <hyelacora@gmail.com>
Signed-off-by: Andrew Xia <axia@fb.com>
Co-authored-by: lacora <hyelacora@gmail.com>
Co-authored-by: Andrew Xia <axia@fb.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2026-01-09 14:03:57 -05:00
Yifan Qiao
cd4a95e3aa [Feat][Core] Support multiple KV cache groups in Hybrid KV Coordinator (#31707)
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
2026-01-09 10:53:20 -08:00
Michael Goin
d5ec6c056f [UX] Add vLLM model inspection view (#29450)
Signed-off-by: mgoin <mgoin64@gmail.com>
2026-01-09 10:12:35 -07:00
Shanshan Shen
08d954f036 [Doc] Add developer guide for CustomOp (#30886)
Signed-off-by: shen-shanshan <467638484@qq.com>
2026-01-09 16:21:11 +00:00
Kevin Šuc
ac9f9330e6 Rename --exclude-log-deltas to --enable-log-deltas (#32020)
Signed-off-by: Catacomba <kevinsuc16@gmail.com>
2026-01-09 15:30:40 +00:00
Isotr0py
2d0c5b630e [Doc] Remove hardcoded Whisper in example openai translation client (#32027)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-09 14:44:52 +00:00
Michael Goin
34cd32fe30 [Perf][Kernel] Fused SiLU+Mul+Quant kernel for NVFP4 cutlass_moe (#31832)
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2026-01-09 07:40:33 -07:00
R3hankhan
8e27663b6a [CPU] Add head sizes 80 and 112 with vec16 fallback (#31968)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
2026-01-09 22:14:46 +08:00
maang
7cdf7e2fe0 [Model] Remove redundant None check in DeepSeekOCR image input processing (#32016)
Signed-off-by: maang <maang_h@163.com>
2026-01-09 06:12:44 -08:00
Adolfo Victoria
bbf80ede43 Fix type error (#31999)
Signed-off-by: Adolfo Victoria <adolfokarim@gmail.com>
Co-authored-by: Adolfo Victoria <adovi@meta.com>
2026-01-09 22:03:32 +08:00
inkcherry
4505849b30 [ROCm][PD] add moriio kv connector. (#29304)
Signed-off-by: inkcherry <mingzhi.liu@amd.com>
2026-01-09 14:01:57 +00:00
Roger Wang
db07433ce5 [Misc] Skip hashing kwargs if value is None (#32025)
Signed-off-by: Roger Wang <hey@rogerw.io>
2026-01-09 13:20:59 +00:00
Andreas Karatzas
e02706d2d2 [ROCm][CI][V1] Fix nixl_connector test failure and achieve CUDA parity in test_async_scheduling (#32000)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-09 20:48:32 +08:00
Sophie du Couédic
b474782ad7 [Feature][Benchmarks] Custom dataset: read output length from dataset (#31881)
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com>
2026-01-09 12:40:59 +00:00
Bofeng Xue
55212c1404 fix: remove duplicate engine_id check in nixl_connector (#31948)
Signed-off-by: Bofeng BF1 Xue <xuebf1@Lenovo.com>
Co-authored-by: Bofeng BF1 Xue <xuebf1@Lenovo.com>
2026-01-09 12:13:17 +00:00
Xin Yang
e7b68f4d6c [Bugfix] Fix Triton FusedMoE LoRA (#30585)
Signed-off-by: Xin Yang <xyangx@amazon.com>
2026-01-09 11:46:59 +00:00
vllmellm
1a19e9cd87 [Bugfix][ROCm]Fix Qwen3-Next-80B-A3B-Thinking inference and optimize non-standard block size (544) support under rocm_atten (#31380)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2026-01-09 19:28:02 +08:00
Cyrus Leung
c8ed39b9dd [Model] Reorganize pooling layers (#31973)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-09 11:02:14 +00:00
Andreas Karatzas
020732800c [Bugfix] Fix OpenAPI schema test failures (#31921)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-09 10:56:20 +00:00
Alex Brooks
dc77cb7129 [Bugfix] Fix Var Length Batched Padding in Granite Speech (#31906)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2026-01-09 10:28:43 +00:00
gnovack
bde38c11df fix lora moe sharding when rank < max_lora_rank (#31994)
Signed-off-by: gnovack <gnovack@amazon.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2026-01-09 14:43:25 +08:00
Xin Yang
707b240d7e [Bugfix] Fix FusedMoE LoRA w2_output_size (#31949)
Signed-off-by: Xin Yang <xyangx@amazon.com>
2026-01-09 00:54:05 -05:00
Nick Hill
29ce48221c [Cleanup] Remove obsolete spec decoding compatibility logic (#32003)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-01-09 05:44:18 +00:00
TJian
7a05d2dc65 [CI] [ROCm] Fix tests/entrypoints/test_grpc_server.py on ROCm (#31970)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2026-01-09 12:54:20 +08:00
Divakar Verma
a1648c4045 [ROCm][CI] Fix test_token_classification.py::test_bert_models (#31993)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2026-01-09 04:04:33 +00:00
RioS
e2d49ec2a4 [Bugfix] missing tokens occur in harmony streaming (#30437)
Signed-off-by: RioS <aa248424@gmail.com>
Signed-off-by: Ri0S <aa248424@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
2026-01-09 03:59:34 +00:00
Xin Yang
8413868dab [Bugfix] Fix typo in FusedMoE LoRA reshape comment (#31992)
Signed-off-by: Xin Yang <xyangx@amazon.com>
2026-01-08 18:46:05 -08:00
zhrrr
8ff4a99566 [Async][Feat] support apply penalty or bad_words for async + spec (#30495)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: izhuhaoran <izhuhaoran@qq.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
2026-01-09 02:31:50 +00:00
daniel-salib
a4ec0c5595 [Frontend] Add MCP tool streaming support to Responses API (#31761)
Signed-off-by: Daniel Salib <danielsalib@meta.com>
2026-01-09 09:19:34 +08:00
Robert Shaw
0fa8dd24d2 [Bugfix] Fix Typo from NVFP4 Refactor (#31977)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2026-01-08 16:18:50 -08:00
Max Hu
6ebe34d6fa [Feature] Add iteration level logging and enhance nvtx marker (#31193)
Signed-off-by: Max Hu <maxhu@nvidia.com>
Signed-off-by: Max Hu <hyoung2991@gmail.com>
Co-authored-by: Max Hu <maxhu@nvidia.com>
2026-01-09 00:13:39 +00:00
Nick Hill
11cec296dd [BugFix] Add spec-decode-incompatible request param validation (#31982)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-01-09 00:08:21 +00:00
Robert Shaw
5825bbc1f7 [Quantization] Deprecate Long Tail of Schemes (#31688)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2026-01-08 19:07:45 -05:00
Yongye Zhu
d62cfe546d [MoE Refactoring][Bugfix]Wrap WNA16 Triton kernel into mk and change compressed tensor kernel selection (#31752)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2026-01-08 19:01:30 -05:00
Lucas Wilkinson
6cdf015c3c [Misc] Fix Current vLLM config is not set. warnings, assert to avoid issues in the future (#31747)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2026-01-08 15:20:49 -08:00
Dipika Sikka
5d3b6097ad [Compressed-Tensors] Simplify NVFP4 Conditions, enable marlin support for NVFP4A16 MoEs (#30881) 2026-01-08 17:45:17 -05:00
bnellnm
e74698c27a [Misc][Refactor] Add FusedMoERouter object (#30519)
Signed-off-by: Bill Nell <bnell@redhat.com>
2026-01-08 20:52:55 +00:00
Cyrus Leung
aa125ecf0e [Frontend] Improve error message (#31987)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-08 20:07:03 +00:00
Lucas Kabela
f16bfbe5bc [Documentation][torch.compile] Add documentation for torch.compile + multimodal encoders (#31627)
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
2026-01-08 14:33:24 -05:00
Michael Goin
87e07a6b46 Revert "feat(moe): Add is_act_and_mul=False support for Triton MoE kernels" (#31978) 2026-01-08 11:31:53 -08:00
Woosuk Kwon
7508243249 [Model Runner V2] Simplify BlockTables with UVA (#31965)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2026-01-08 10:24:26 -08:00
Nicolò Lucchesi
83e1c76dbe [CI][ROCm] Fix NIXL tests on ROCm (#31728)
Signed-off-by: NickLucche <nlucches@redhat.com>
2026-01-09 01:34:43 +08:00
Nishidha Panpaliya
a563866b48 Fix ijson build for Power. (#31702)
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
2026-01-08 17:12:33 +00:00
Nick Hill
a3d909ad2b [Misc] Tidy up some spec decode logic in GPUModelRunner (#31591)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-01-08 09:10:07 -08:00
Jee Jee Li
49568d5cf9 [Doc] Improve MM models LoRA notes (#31979)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2026-01-08 08:55:22 -08:00
danisereb
b8112c1d85 [Bugfix] Fix vllm serve failure with Nemotron Nano V3 FP8 (#31960)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
2026-01-08 16:08:37 +00:00
Chauncey
eaba8ece77 [Bugfix]: Fix Step3ReasoningParser missing is_reasoning_end_streaming (#31969)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2026-01-08 15:28:13 +00:00
yxing-bj
fe86be66c5 [Model] Support IQuestCoder model (#31575)
Signed-off-by: yxing <yxing@iquestlab.com>
2026-01-08 14:42:57 +00:00
Chauncey
1da3a5441a [Docs]: update claude code url (#31971)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2026-01-08 14:04:55 +00:00
TJian
72c068b8e0 [CI] [Bugfix] Fix unbounded variable in run-multi-node-test.sh (#31967)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2026-01-08 05:42:01 -08:00
Mary
7645bc524b [OpenAI] Fix tool_choice=required streaming when output has trailing extra data (#31610)
Signed-off-by: maylikenoother <ogedengbemary19@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
2026-01-08 21:01:42 +08:00
Ce Zhao
1123a87892 [Model] Enable LoRA support for Pixtral (#31724)
Signed-off-by: <>
Signed-off-by: 赵策 <alcor@zhaocedeMacBook-Air.local>
Signed-off-by: 赵策 <alcor@mac.mynetworksettings.com>
Co-authored-by: 赵策 <alcor@mac.mynetworksettings.com>
2026-01-08 05:00:57 -08:00
tianshu-Michael-yu
03fd76c570 [Model] Add LFM2-VL model support (#31758)
Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com>
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2026-01-08 05:00:27 -08:00
Bijaya Dangol
59d260f5e4 [Model] Add Grok-2 (#31847)
Signed-off-by: dangoldbj <dangoldbj23@gmail.com>
2026-01-08 04:59:48 -08:00
Patrick von Platen
18d4e481d0 [Voxtral] Fix speech transcription api (#31388)
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: bk-201 <joy25810@foxmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: prashanth058 <prashanth.dannamaneni@uipath.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: bk-201 <joy25810@foxmail.com>
Co-authored-by: prashanth058 <prashanth.dannamaneni@uipath.com>
Co-authored-by: Anexdeus <5142168@mail.ru>
Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
2026-01-08 18:34:19 +08:00
Isotr0py
2972a05473 [MM Encoder]: Make MMEncoderAttention's scale takes effect properly (#31950)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-08 02:33:48 -08:00
Cyrus Leung
5576227bc1 [Model] Standardize common vision encoders (#31947)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-08 02:33:16 -08:00
Cyrus Leung
d1b6fe007f [Chore] Further cleanup pooler (#31951)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-08 02:16:21 -08:00
omer-dayan
04a49669d1 RayLLM Bugfix - Preserve obj store URL for multi engine_config creation (#30803)
Signed-off-by: Omer Dayan <omdayan@nvidia.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-08 10:00:25 +00:00
BingjiaWang
96fcd3c267 [Misc] Support qwen3-next lora (#31719) 2026-01-08 09:27:50 +00:00
DevByteAI
1f214290d6 fix(compile): apply partition wrapper when loading AOT cached functions (#31536)
Signed-off-by: Devbyteai <abud6673@gmail.com>
Signed-off-by: DevByteAI <161969603+devbyteai@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2026-01-08 17:27:26 +08:00
Ryan Rock
8cbdc7eb94 [CI/Build] Enable test_kv_cache_events_dp for AMD (#31834)
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
2026-01-08 09:00:24 +00:00
Lumosis
b634e619bb Decouple page_size_bytes calculation in AttentionSpec for TPU/RPA Compatibility. (#31635)
Signed-off-by: Lihao Ran <imlihao.ran@gmail.com>
Signed-off-by: Lumosis <30372757+Lumosis@users.noreply.github.com>
2026-01-08 09:00:07 +00:00
Isotr0py
eac3b96ec0 [Models] Allow converting Qwen3-VL into Reranker model (#31890)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-08 08:10:15 +00:00
Zhiwei
573a1d1119 [ROCm]Skip test_torchao.py::test_pre_quantized_model on CDNA3 arch (#31905)
Signed-off-by: ZhiweiYan-96 <zhiwei.yan@amd.com>
2026-01-08 15:47:44 +08:00
Shang Wang
33156f56e0 [docker] A follow-up patch to fix #30913: [docker] install cuda13 version of lmcache and nixl (#31775)
Signed-off-by: Shang Wang <shangw@nvidia.com>
2026-01-07 23:47:02 -08:00
Rabi Mishra
107cf8e92f fix(rocm): Add get_supported_kernel_block_sizes() to ROCM_ATTN (#31712)
Signed-off-by: rabi <ramishra@redhat.com>
2026-01-08 15:46:07 +08:00
Zyyeric
63baa28cf5 [Model] Enable LoRA support for tower and connector in GLM4-V (#31652)
Signed-off-by: Zyyeric <eric1976808123@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2026-01-08 15:45:53 +08:00
Andy Liu
e5173d3bac [Bugfix] Remove the num_hidden_layers override for glm4_moe (#31745) 2026-01-08 15:45:10 +08:00
prashanth058
d3235cb503 [Fix] Enable mm_processor_cache with vision LoRA (#31927)
Signed-off-by: prashanth058 <prashanth.dannamaneni@uipath.com>
2026-01-08 15:31:51 +08:00
Nick Hill
287b37cda4 [BugFix] Fix spec decoding edge case bugs (#31944)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-01-08 15:31:03 +08:00
Chang Su
791b2fc30a [grpc] Support gRPC server entrypoint (#30190)
Signed-off-by: Chang Su <chang.s.su@oracle.com>
Signed-off-by: njhill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: njhill <nickhill123@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2026-01-07 23:24:46 -08:00
Lucas Wilkinson
be6a81f31b [chore] Update FA commit (#30460)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-01-07 23:24:18 -08:00
Ronald
2ab441befe [platform] add dp_metadata arg to set_additional_forward_context (#31942)
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-01-08 06:56:44 +00:00
ShaanveerS
9572f74f15 [Model] Enable LoRA support for tower and connector in DotsOCR (#31825)
Signed-off-by: ShaanveerS <shaanver.singh@gmail.com>
2026-01-08 14:50:16 +08:00
Andreas Karatzas
5f2a473ff3 [ROCm][CI] v1 cpu offloading attention backend fix (#31833)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-08 14:37:50 +08:00
Michael Goin
6b2a672e47 [Doc] Add Claude code usage example (#31188)
Signed-off-by: mgoin <mgoin64@gmail.com>
2026-01-08 13:50:23 +08:00
rasmith
f1b1bea5c3 [CI][BugFix][AMD] Actually skip tests marked @pytest.mark.skip_v1 (#31873)
Signed-off-by: Randall Smith <ransmith@amd.com>
Co-authored-by: Randall Smith <ransmith@amd.com>
2026-01-08 13:06:09 +08:00
Charlie Fu
cddbc2b4b2 [ROCm][CI] Add rocm support for run-multi-node-test.sh (#31922)
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Charlie Fu <Charlie.Fu@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-08 04:36:39 +00:00
Andreas Karatzas
087a138963 [ROCm][CI] Fix attention backend test flakiness from uninitialized KV cache memory (#31928)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-08 04:35:25 +00:00
Andreas Karatzas
c4041f37a4 [ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router weight scaling (#31931)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-08 04:17:56 +00:00
Richard Zou
a79079feef [BugFix] Fix flakiness in test_eagle_dp for PyTorch 2.10 (#31915)
Signed-off-by: Richard Zou <zou3519@gmail.com>
2026-01-08 04:04:58 +00:00
Robert Shaw
9f6dcb71ae [MoE Refactor][16/N] Apply Refactor to NVFP4 (#31692)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Pavani Majety <pmajety@nvidia.com>
2026-01-08 03:46:27 +00:00
Andreas Karatzas
8dd2419fa9 [CI] Skip Qwen-VL in multimodal processing tests due to flaky external dependency (#31932)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-08 02:58:01 +00:00
Rabi Mishra
39d82005f7 fix(rocm): add early return in get_flash_attn_version for ROCm (#31286)
Signed-off-by: rabi <ramishra@redhat.com>
2026-01-08 10:28:07 +08:00
Rabi Mishra
25eef3dc2e feat(moe): Add is_act_and_mul=False support for Triton MoE kernels (#31645)
Signed-off-by: rabi <ramishra@redhat.com>
2026-01-08 10:27:09 +08:00
Matthew Bonanni
0d7667419f [0/N][Attention] Fix miscellaneous pre-commit issues (#31924)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-08 01:15:17 +00:00
Robert Shaw
5dcd7ef1f2 [MoE Refactor][15/N] Apply Refactor to Fp8 (#31415) 2026-01-07 19:42:33 -05:00
Elvir Crnčević
ffc0a2798b Add back missing DeepEP LL params (#31911)
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com>
2026-01-07 17:47:54 -05:00
Nick Hill
10ef65eded [BugFix] Fix bad words with speculative decoding (#31908)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2026-01-07 15:46:42 -05:00
Ilya Markov
6170d47d22 [EPLB] Optimize EPLB with numpy (#29499)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2026-01-07 15:21:35 -05:00
Xin Yang
0ada960a20 [Kernel] Support bias type in grouped_topk kernel (#31781)
Signed-off-by: Xin Yang <xyangx@amazon.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2026-01-07 12:16:32 -08:00
Ning Xie
c907d22158 [refactor] refactor memory constants usage (#31865)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2026-01-07 18:37:31 +00:00
Michael Goin
f347ac6c34 [Perf] Fuse stride preparation for NVFP4 cutlass_moe (#31837)
Signed-off-by: mgoin <mgoin64@gmail.com>
2026-01-07 13:31:26 -05:00
Festus Ayobami Owumi
05f47bd8d2 [Doc] Fix: Correct vLLM announcing blog post link in docs (#31868)
Signed-off-by: enfinity <festusowumi@gmail.com>
2026-01-07 10:06:42 -08:00
roikoren755
bf184a6621 Enable quantized attention in NemotronH models (#31898)
Signed-off-by: Roi Koren <roik@nvidia.com>
2026-01-07 17:37:19 +00:00
Jee Jee Li
30399cc725 UX: add vLLM env info in '/server_info' (#31899)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2026-01-07 17:13:02 +00:00
Kfir Toledo
b89443b8d9 [KVConnector]: Enable Cross-layers KV cache layout for MultiConnector (#30761)
Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com>
2026-01-07 16:59:43 +00:00
Marko Rosenmueller
1d9e9ae8a4 [Bugfix]: prevent leaking tokens in crash log (#30751)
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
2026-01-07 16:15:19 +00:00
Cyrus Leung
b7036c87a1 [Refactor] Clean up pooler modules (#31897)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-08 00:07:43 +08:00
Kate Cheng
cc6dafaef2 [Perf][Kernels] Enable FlashInfer DeepGEMM swapAB on SM90 (for W8A8 Linear Op) (#29213)
Signed-off-by: Kate Cheng <yunhsuanc@nvidia.com>
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
Co-authored-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
2026-01-07 10:53:54 -05:00
R3hankhan
1ab055efe6 [OpenAI] Extend VLLMValidationError to additional validation parameters (#31870)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
2026-01-07 14:45:49 +00:00
Cyrus Leung
b665bbc2d4 [Chore] Migrate V0 attention utils (#31891)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-07 13:44:36 +00:00
Jared Wen
974138751b [Refactor] GLM-ASR Modeling (#31779)
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-07 13:08:29 +00:00
vllmellm
41cfa50632 [ROCm][AITER] fix wrong argument passed to AITER flash_attn_varlen_func (#31880)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2026-01-07 11:25:03 +00:00
Andy Liu
d111bc53ad [Bugfix][MTP] Fix GLM4 MoE fp8 loading with MTP on (#31757)
Signed-off-by: Andy Liu <andyliu@roblox.com>
2026-01-07 09:18:52 +00:00
BlankR
0790f07695 [Misc] Improve error messages for unsupported types and parameters (#30593)
Signed-off-by: BlankR <hjyblanche@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2026-01-07 09:00:16 +00:00
maang
1f33e38e81 [Model] Cleanup: Remove redundant manual definition of make_empty_intermediate_tensors in GLM-4-MoE (#31869)
Signed-off-by: maang <maang_h@163.com>
2026-01-07 08:18:28 +00:00
sihao_li
59fe6f298e [XPU]fallback to TRITON_ATTN on xpu when use float32 dtype (#31762)
Signed-off-by: sihao.li <sihao.li@intel.com>
2026-01-07 08:10:29 +00:00
weiyu
e7596371a4 [Refactor][TPU] Remove torch_xla path and use tpu-inference (#30808)
Signed-off-by: Wei-Yu Lin <weiyulin@google.com>
Signed-off-by: weiyu <62784299+weiyu0824@users.noreply.github.com>
2026-01-07 16:07:16 +08:00
xuebwang-amd
0dd5dee9b9 [Bugfix][Kernel] fix bias adding in triton kernel implemented fused moe (#31676)
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
2026-01-07 07:36:13 +00:00
Kevin McKay
4614c5a539 [Bugfix][Hardware][AMD] Consolidate FP8 min/max values helper function (#31106)
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
Signed-off-by: Kevin McKay <kevin@example.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-07 06:55:03 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
482914849c [BugFix] LoRA: Support loading base_layer of experts (#31104)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2026-01-07 14:49:39 +08:00
tianshu-Michael-yu
efeaac92f2 [Bugfix] Fix race condition in async-scheduling for vlm model (#31841)
Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com>
2026-01-07 06:45:10 +00:00
tjp_zju
55caa6051d refactor: find_loaded_library (#31866)
Signed-off-by: tjp_zju <tanjianpingzju1990@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2026-01-07 06:42:20 +00:00
Lucas Wilkinson
c7a79d41a0 [Attention][3/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties (#31850)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-01-07 13:31:34 +08:00
vllmellm
6409004b26 [ROCm][AITER] bugfix accuracy regression in ROCM_AITER_TRITON_MLA backend (#31816)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2026-01-07 05:04:53 +00:00
Cyrus Leung
aafd4d2354 [Chore] Try remove init_cached_hf_modules (#31786)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-07 12:34:04 +08:00
Jack Yang
0a2c2dc3f1 fixed mypy warnings for files vllm/v1/attention with TEMPORARY workaround (#31465)
Signed-off-by: Zhuohao Yang <zy242@cornell.edu>
Co-authored-by: Zhuohao Yang <zy242@cornell.edu>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2026-01-07 04:08:47 +00:00
Tyler Michael Smith
f09c5feb7c Change warning in get_current_vllm_config to report caller's line number (#31855)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
2026-01-07 03:48:13 +00:00
Cyrus Leung
1b8af957f6 [Doc] Update release docs (#31799)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-07 03:27:40 +00:00
Ce Zhao
a051525e07 [Model] Enable LoRA support for PaliGemma (#31656)
Signed-off-by: 赵策 <alcor@mac.mynetworksettings.com>
Signed-off-by: Alcor <alcor_zhao@outlook.com>
Co-authored-by: 赵策 <alcor@mac.mynetworksettings.com>
2026-01-07 10:09:32 +08:00
Yihua Cheng
5b833be49e [1/2][lmcache connector] clean up lmcache multi-process adapter (#31838)
Signed-off-by: ApostaC <yihua98@uchicago.edu>
2026-01-07 02:02:42 +00:00
Lucas Kabela
873480d133 [Misc][BE] Type coverage for vllm/compilation [1/3] (#31554)
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
2026-01-06 20:37:51 -05:00
vSeamar
6f351548b2 [Frontend] Implement robust video frame recovery for corrupted videos (#29197)
Signed-off-by: cmartinez <cmartinez@roblox.com>
Signed-off-by: vSeamar <cmartinez@roblox.com>
2026-01-07 01:13:24 +00:00
Andreas Karatzas
364a8bc6dc [ROCm][CI] Fix plugin tests (2 GPUs) failures on ROCm and removing VLLM_FLOAT32_MATMUL_PRECISION from all ROCm tests (#31829)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-07 01:12:23 +00:00
Angela Yi
9a1d20a89c [CI] Add warmup run in test_fusion_attn (#31183)
Signed-off-by: angelayi <yiangela7@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2026-01-07 00:31:52 +00:00
Cyrus Leung
309a8f66ee [Bugfix] Handle mistral tokenizer in get_hf_processor (#31817)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-07 07:46:56 +08:00
Andreas Karatzas
e5d427e93a [ROCm][CI] Pinning timm lib version to fix ImportError in Multi-Modal Tests (Nemotron) (#31835)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-06 23:23:11 +00:00
Andreas Karatzas
2a42ae790d [ROCm][CI] Fix ModernBERT token classification test numerical accuracy on ROCm (#31820)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-06 23:21:15 +00:00
Matthew Bonanni
d49899732e [Spec Decode][UX] Add acceptance stats to vllm bench serve report (#31739)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
2026-01-06 21:21:42 +00:00
Elvir Crnčević
dba95378a6 Report error log after vllm bench serve (#31808)
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com>
2026-01-06 20:24:19 +00:00
Nikhil G
ada6f91d56 Fix RecursionError in MediaWithBytes unpickling (#31191)
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
2026-01-06 20:11:26 +00:00
Li, Jiang
8becf146bd [Quantization][Refactor] Move CPU GPTQ kernel into MP linear (#31801)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: Li, Jiang <bigpyj64@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-01-06 19:10:18 +00:00
Charlie Fu
c07163663d [ROCm][CI] Fix tests/compile unit tests (#28895)
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Signed-off-by: Charlie Fu <Charlie.Fu@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2026-01-06 18:50:43 +00:00
Benjamin Chislett
f7008ce1c4 [Perf] Async Scheduling + Speculative Decoding + Structured Outputs (#29821)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
2026-01-06 18:50:37 +00:00
Yakine Tahtah
4e67a8f616 [Bugfix] Fix GLM-4 MoE router logits dtype for data parallel chunking (#31055)
Signed-off-by: ReinforcedKnowledge <reinforced.knowledge@gmail.com>
2026-01-06 17:57:56 +00:00
Masataro Asai
142c4d1738 make 500: InternalServerError more informative (#20610)
Signed-off-by: Masataro Asai <guicho2.71828@gmail.com>
2026-01-06 17:36:24 +00:00
Ning Xie
6f5e653383 [Log] add log about gpu worker init snapshot and requested memory (#29493)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2026-01-06 17:32:55 +00:00
Vadim Gimpelson
22dffca982 [PERF] Speed-up of GDN attention decode part (Qwen3-Next) (#31722)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2026-01-06 17:32:46 +00:00
Lucas Wilkinson
4c73be14e0 [Attention][2/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties (#31774)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2026-01-06 17:32:14 +00:00
Jinzhen Lin
2f4bdee61e [Quantization][MoE] remove unused ep logic from moe marlin (#31571)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2026-01-06 09:07:19 -08:00
roikoren755
28c94770ad [NemotronH] Use ReplicatedLinear for fc1_latent_proj (#31807)
Signed-off-by: Roi Koren <roik@nvidia.com>
2026-01-06 16:00:40 +00:00
Robert Shaw
af8fd73051 [MoE Refactor][14/N] Clean Up FI Quant Config Smuggling (#31593)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2026-01-06 15:47:04 +00:00
Robert Shaw
d3e477c013 [MoE Refactor] Add Temporary Integration Tests - H100/B200 (#31759)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2026-01-06 10:34:17 -05:00
Isotr0py
02809af1e7 [Bugfix]: Fix cross attention backend selection for Turing GPU (#31806)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-06 23:15:56 +08:00
Jee Jee Li
cbd4690a03 [LoRA]Disable linear LoRA kernel PDL (#31777)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2026-01-06 23:12:25 +08:00
wang.yuqi
96860af655 [Model] rename use_pad_token to use_sep_token (#31784)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
2026-01-06 14:16:04 +00:00
Chauncey
0202971a48 [Frontend] Support GLM-4.5 / GLM-4.7 with enable_thinking: false (#31788)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2026-01-06 13:53:21 +00:00
Jzz1943
2c1a4f2488 [Bugfix]: avoid overriding audio/text kwargs (Qwen3-Omni) (#31790)
Signed-off-by: Zhongze Jiang <jiangzhongze.jzz@ant-intl.com>
2026-01-06 12:59:17 +00:00
Cyrus Leung
6444824873 [Misc] Implement TokenizerLike.convert_tokens_to_ids (#31796)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-06 12:08:22 +00:00
kzwrime
bf0f3a4638 [Bugfix] Fix torch.compile error for DP + MoE on CPU Backend (#31650)
Signed-off-by: kunzh <zhikun.wu@outlook.com>
2026-01-06 12:06:20 +00:00
Lucas Wilkinson
e0327c9db2 [Attention][1/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties (#31773)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-01-06 04:05:17 -08:00
Cyrus Leung
14df02b4e1 [Chore] Cleanup mem_utils.py (#31793)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-06 19:55:59 +08:00
BlankR
6ebb66ccea [Doc] Fix format of multimodal_inputs.md (#31800)
Signed-off-by: BlankR <hjyblanche@gmail.com>
2026-01-06 03:30:24 -08:00
wang.yuqi
43d384bab4 [CI] Increase the MTEB_EMBED_TOL threshold to 5e-4. (#31797)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
2026-01-06 19:30:05 +08:00
Cyrus Leung
db318326a5 [Misc] Use deprecated for seed_everything (#31780)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-06 11:29:55 +00:00
Fadi Arafeh
799b5721f6 [cpu][bench] Add CPU paged attention benchmarks (#31720)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
2026-01-06 10:57:57 +00:00
Cyrus Leung
97ca4c3b60 [Chore] Remove more V0 dead code from sequence.py (#31783)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-06 10:25:14 +00:00
Isotr0py
ee2e69d6cd [Bugfix][CI/Build] Fix failing pooling models test due to Triton kernel accuracy diff (#31776)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-06 00:44:22 -08:00
Isotr0py
7101e0851f [Models]: Use MMEncoderAttention for MoonViT (#31738)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: h100 <h100@inferact.ai>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: h100 <h100@inferact.ai>
2026-01-06 08:00:25 +00:00
vllmellm
e9717801bd [Bugfix][ROCm] Fix Unsupported attention metadata type for speculative decoding in eagle.py (#31714)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2026-01-06 07:53:22 +00:00
Cyrus Leung
da71d44410 [Doc] Show that use_audio_in_video is supported in docs (#30837)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-05 23:27:19 -08:00
Kevin McKay
1fb0209bbc [Bugfix][Hardware][AMD] Fix exception types in AITER MLA FP8 check (#31177)
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-06 14:10:59 +08:00
Robert Shaw
81323ea221 [CI] Fix CPU MM PRocessor Test (#31764)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2026-01-06 04:22:18 +00:00
Michael Goin
e1cd7a5faf [Bugfix] Add init_workspace_manager to moe kernel benchmarks (#31042)
Signed-off-by: mgoin <mgoin64@gmail.com>
2026-01-05 19:14:33 -08:00
Michael Goin
a68e703c32 [UX] Add -ep shorthand for --enable-expert-parallel (#30890)
Signed-off-by: mgoin <mgoin64@gmail.com>
2026-01-05 19:13:36 -08:00
maang
cd1245a184 [Cleanup] Remove redundant decoder_layer_type assignment in Qwen2 (#31760)
Signed-off-by: maang <maang_h@163.com>
2026-01-05 18:09:18 -08:00
Wentao Ye
ffec815422 [Perf] Optimize additional fill(0) in cutlass moe, 2.9% E2E throughput improvement, 10.8% TTFT improvement (#31754)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-01-05 18:01:13 -08:00
maang
d386ab1412 [Docs] Improve malformed exception caused by backslash line continuations (#31694)
Signed-off-by: maang <maang_h@163.com>
Signed-off-by: maang <55082429+maang-h@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2026-01-05 17:51:54 -08:00
Michael Goin
ccb309a964 Revert "[CI Failure] Disable B200 tests while runner is broken" (#31750)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2026-01-05 17:26:33 -08:00
John Calderon
2f4e6548ef [Bugfix] vLLM produces invalid UTF-8 tokens and “�” (#28874)
Signed-off-by: John Calderon <jcalderon@nvidia.com>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
2026-01-06 00:23:00 +00:00
Seiji Eicher
3c98c2d21b [CI/Build] Allow user to configure NVSHMEM version via ENV or command line (#30732)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2026-01-05 15:56:08 -08:00
Michael Goin
9513029898 [Bugfix] Properly apply v_scale for mimo_v2_flash (#31175)
Signed-off-by: mgoin <mgoin64@gmail.com>
2026-01-05 23:20:46 +00:00
Robert Shaw
f6c0009afa [Bugfix] Fix Broken ModelOpt NVFP4 MoE (#31742)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2026-01-05 23:18:38 +00:00
Yongye Zhu
776ca1e187 [MoE Refactor] Aiter Experts for BF16 MoE (#31542)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2026-01-05 14:52:59 -08:00
Wentao Ye
af9a7ec255 [Bug] Revert torch warning fix (#31585)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2026-01-05 22:31:21 +00:00
Matthew Bonanni
276e03b92c [CI][DeepSeek] Add nightly DeepSeek R1 lm_eval tests on H200 (#30356)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2026-01-05 17:17:59 -05:00
Nick Hill
32f4e4db00 [Cleanup] Remove deprecated fields from CachedRequestData class (#31734)
Signed-off-by: njhill <nickhill123@gmail.com>
2026-01-05 21:07:14 +00:00
amitz-nv
ee21291825 [Model] Nemotron Parse 1.1 Support (#30864)
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2026-01-05 13:00:14 -08:00
Qidong Su
af1b07b0c5 [docker] install cuda13 version of lmcache and nixl (#30913)
Signed-off-by: Qidong Su <soodoshll@gmail.com>
2026-01-05 12:50:39 -08:00
gnovack
c77a993cc2 pin lora_b moe weights on cpu (#31317)
Signed-off-by: gnovack <gnovack@amazon.com>
2026-01-05 12:15:40 -08:00
Roberto L. Castro
fdcc5176be [BugFix] Fix architecture flags to prevent issues on SM103 (#31150)
Signed-off-by: LopezCastroRoberto <robertol.c510@gmail.com>
2026-01-05 20:11:35 +00:00
Wang Kunpeng
5708297e4e [Misc][Model][Refactor] Pass the prefix into Linear layers (#31669)
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2026-01-05 20:03:18 +00:00
baonudesifeizhai
02dbb933cb Fix GLM-4.6v flash tool calling in transformers 5.x (#31622)
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
2026-01-05 11:32:43 -08:00
Isotr0py
51e38a8e30 [Misc] Enable Paligemma's PrefixLM attention mask computation (#31725)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-06 03:31:49 +08:00
Or Ozeri
d8e38d4939 Triton Attention: Support cross-layers blocks (#30687)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2026-01-05 19:29:16 +00:00
kzwrime
21156ff199 [Bugfix] Add missing extra_tensors arg to DeviceCommunicatorBase.disp… (#31644)
Signed-off-by: kunzh <zhikun.wu@outlook.com>
2026-01-06 01:26:09 +08:00
RickyChen / 陳昭儒
c455b771fd [Bugfix][CPU] Fix RotaryEmbedding fallback causing gibberish with --enforce-eager (#31643)
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com>
2026-01-06 01:25:38 +08:00
Michael Goin
eefa713a66 [CI Failure] Disable B200 tests while runner is broken (#31732)
Signed-off-by: mgoin <mgoin64@gmail.com>
2026-01-05 08:50:51 -08:00
Kevin Šuc
79ed460dd5 [Frontend] [Doc] Exclude log deltas feature (#30322)
Signed-off-by: Catacomba <kevinsuc16@gmail.com>
Signed-off-by: Kevin Šuc <kevinsuc16@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2026-01-05 16:34:35 +00:00
Isotr0py
6aa5b18e1d [v1] Add encoder-only/cross attention support to Triton Attention backend (#31406)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-06 00:00:23 +08:00
wang.yuqi
911d38ed99 [Model] Let more models to support the score template. (#31335)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2026-01-05 11:54:26 +00:00
zzzzwwjj
caaa482aca [platform] Support additional forward context for OOT (#31674)
Signed-off-by: zzzzwwjj <1183291235@qq.com>
Signed-off-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2026-01-05 10:25:13 +00:00
Yihua Cheng
b471aad41f [KVconnector][LMCache] remove the import of legacy LMCache code (#31704)
Signed-off-by: ApostaC <yihua98@uchicago.edu>
2026-01-05 10:11:01 +00:00
Jee Jee Li
d5503ca7f9 [LoRA] LoRA PDL improvement (#31660)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2026-01-05 08:28:46 +00:00
Qiping Pan
a2ad15c070 [Model] Enable LoRA support for BLIP2 (#31620)
Signed-off-by: Qiping Pan <panqiping@outlook.com>
2026-01-05 08:02:24 +00:00
Tres
3133c192a3 [ROCM] Reorder arguments and rename parameters for rope_cached_thd_positions_2c_fwd_inplace (#29993)
Signed-off-by: Tres Popp <tres.popp@amd.com>
2026-01-05 15:37:57 +08:00
wang.yuqi
76fd458aa7 [CI] Bump sentence-transformer from 3.2.1 to 5.2.0 (#31664)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
2026-01-04 21:45:01 -08:00
cjackal
e2701cc525 [Frontend] [Bugfix] respect server-level default chat template kwargs in reasoning parser (#31581)
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
2026-01-05 05:42:47 +00:00
Tyler Michael Smith
fe8a9fbd2e [Bugfix] Fix EPLB state logging error (#31455)
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
2026-01-05 04:06:28 +00:00
Ning Xie
98b8b3abaa [log] enable max_log_len trim only when needed (#31482)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2026-01-05 03:55:43 +00:00
CHENYUE
346e56455a Add chat prefix completion feature to DeepSeek v3.2 (#31147) 2026-01-05 11:20:25 +08:00
wang.yuqi
8be6432bda [CI Failure] Fix NomicBert max_model_len validation (#31662)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
2026-01-05 11:06:52 +08:00
Nick Hill
43e3f8e4a9 [Misc] Various code simplifications (#31666)
Signed-off-by: njhill <nickhill123@gmail.com>
2026-01-04 18:35:56 -08:00
wangxiyuan
bb4337b34c [Platform] Deprecate seed_everything (#31659)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-04 18:34:04 -08:00
Isotr0py
367856de14 [CI/Build] Revive skipped reward models e2e test (#31665)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-01-05 02:33:46 +00:00
Nick Hill
da436f868a [Minor] Small pooler output processing optimization (#31667)
Signed-off-by: njhill <nickhill123@gmail.com>
2026-01-04 18:33:12 -08:00
Jee Jee Li
f099cd557a [Bugfix] Fix AttributeError: 'Stream' object has no attribute 'dp_size' (#31663)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2026-01-05 02:31:31 +00:00
Andreas Karatzas
f2b6dfd237 [ROCm][CI] Fix language generation test accuracy by disabling HF flash_sdp and mem_efficient_sdp (#31597)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-05 02:17:05 +00:00
Andreas Karatzas
89f1f25310 [CI] Skip Phi-MoE test due to old API util (#31632)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-05 08:52:07 +08:00
Nick Hill
b53b89fdb3 [BugFix] Async scheduling: handle model forward errors more cleanly (#31611)
Signed-off-by: njhill <nickhill123@gmail.com>
2026-01-04 11:04:37 -08:00
Ning Xie
6522721d17 [misc] Sort uvicorn log level description according to verbosity (#31137)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2026-01-04 18:45:37 +00:00
Yuxuan Zhang
0d4044edd8 fix no think of GLM-4.5 / GLM-4.7 (#31449)
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
2026-01-04 11:43:00 +08:00
Reagan Lee
41ab179738 [Docs] Fix argparse include path for mm-processor benchmark (#31654)
Signed-off-by: Reagan <reaganjlee@gmail.com>
2026-01-04 03:31:29 +00:00
Robert Shaw
268b1c55ad [MoE Refactor][13/N] Convert FI to Use PFNoEP (#31533)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
2026-01-03 12:26:36 -08:00
Andreas Karatzas
4f9ce35afe [CI][Bugfix] Fix token counting in chunked prefill compl test (#31630)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-03 14:28:49 +08:00
jeremyteboul
97a01308e9 Improve HF qwen3_omni: preserve audio_sample_rate in kwargs restructuring (#29255)
Signed-off-by: Jeremy Teboul <jeremyteboul@fb.com>
Co-authored-by: Jeremy Teboul <jeremyteboul@fb.com>
2026-01-03 04:31:09 +00:00
Xingyu Liu
0eee877f67 [Core] Parse vLLM engine required fields from hf_config to model_arch_config (#28454)
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>
Signed-off-by: Xingyu Liu <38244988+charlotte12l@users.noreply.github.com>
2026-01-02 15:13:15 -08:00
Alfred
a0e9ee83c7 [Benchmark] Fix OOM during MoE kernel tuning for large models (#31604)
Signed-off-by: Alfred <massif0601@gmail.com>
2026-01-02 22:24:51 +00:00
Yongye Zhu
a3f2f40947 [MoE Refactor] Explicit construct mk for flashinfer bf16 kernel (#31504)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2026-01-02 13:54:50 -08:00
Yongye Zhu
5a468ff7c7 [MoE Refactor] Split invoke_fused_moe_kernel (#31050)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2026-01-02 13:47:15 -08:00
Andreas Karatzas
6ef770df7c [MoE] Fix output_shape calculation in Attention layer to handle 3D query inputs (#31596)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-02 15:46:23 +00:00
Nick Hill
bd877162eb [BugFix] Support online dense model DP without overhead (#30739)
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: njhill <nickhill123@gmail.com>
2026-01-02 23:36:38 +08:00
Xinyu Chen
08f425bad1 CustomOp: test forward dispatch for grouped_topk (#31530)
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
2026-01-02 10:04:01 -05:00
labAxiaoming
a01f2faedf Add multimodal input method in the documentation (#31601)
Signed-off-by: xiaoming <1259730330@qq.com>
2026-01-02 12:43:30 +00:00
Kyuyeun Kim
cc410e8644 [Bugfix] Fix weight_loader v1 block scale (#31103)
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com>
2026-01-02 13:14:10 +08:00
Kevin McKay
825c2dc133 [Bugfix][Hardware][AMD] Fix last_page_len calculation in AITER MLA decode (#31282)
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
2026-01-01 21:14:00 -08:00
Vaibhav Sourirajan
1f43c121d5 Remove unused use_marlin variable in Mxfp4MoEMethod (#31549)
Signed-off-by: vaibhav sourirajan <vs2787@columbia.edu>
2026-01-01 21:13:36 -08:00
Tmn07
ca179d0f64 [Bugfix] Fix activation quantization for compressed-tensors W4A16 (#31572)
Signed-off-by: Tmn07 <tmn0796@gmail.com>
2026-01-01 21:13:22 -08:00
Andreas Karatzas
013b54088c [ROCm][CI] Fix ModernBERT token classification test (#31612)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-02 04:19:08 +00:00
Jay Hemnani
5ac55eb30f [Model] Enable LoRA support for tower and connector in LLaVA (#31513)
Signed-off-by: Jay Hemnani <jayhemnani9910@gmail.com>
Co-authored-by: Jay Hemnani <jayhemnani9910@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-01 19:32:39 -08:00
Benjamin Chislett
ea53ca5e85 [Bugfix] Fix block size used in EAGLE slot mapping (#31540)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2026-01-01 19:32:30 -08:00
zhima771
27864a851c feat: support LoRA for DeepSeek-OCR(Language Model part) (#31569)
Signed-off-by: zhima771 <15836938703@163.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2026-01-01 19:32:11 -08:00
Andreas Karatzas
5cc4876630 [ROCm][CI] Fix failure in Language Models Tests (Extra Standard) by reducing agent pool size (#31553)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2026-01-01 19:29:42 -08:00
Kevin McKay
5fff44064b [Bugfix] Replace BaseException with specific exceptions in FLA utils (#31590)
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
2026-01-01 19:27:54 -08:00
Reagan Lee
1f5b7c41c3 Add Multimodal Processor Benchmark (#29105)
Signed-off-by: Reagan Lee <reaganjlee@gmail.com>
Signed-off-by: Reagan <reaganjlee@gmail.com>
2026-01-01 19:26:53 -08:00
Ekagra Ranjan
adcf682fc7 [Audio] Improve Audio Inference Scripts (offline/online) (#29279)
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
2025-12-31 23:34:18 +00:00
Andreas Karatzas
21de6d4b02 [CI][Bugfix] Fix token counting in chunked prefill streaming test (#31565)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2025-12-31 23:05:14 +00:00
Nick Hill
6c2cfb62ff [BugFix] Fix async scheduling for pooling models (#31584)
Signed-off-by: njhill <nickhill123@gmail.com>
2025-12-31 14:48:51 -08:00
Fanjiang Ye
d8da76f3b7 [Bugfix] Fix BAGEL online serving for text and image understanding (#31546)
Signed-off-by: Dylan1229 <yvanphys@gmail.com>
Signed-off-by: UED <zxr3611244710@gmail.com>
Signed-off-by: mr-ye-cao <yecaoyc2019@gmail.com>
Co-authored-by: UED <zxr3611244710@gmail.com>
Co-authored-by: mr-ye-cao <yecaoyc2019@gmail.com>
Co-authored-by: Mr-Ye-Cao <60802056+Mr-Ye-Cao@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-12-31 14:46:10 -08:00
baonudesifeizhai
d722e9e614 Add GLM-ASR multimodal support (#31436)
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-31 23:12:24 +08:00
Andreas Karatzas
cf16342d43 [ROCm][CI] Update MiniCPM model test: MiniCPM3-4B to MiniCPM4.1-8B and simplify attention backend testing (#31551)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2025-12-31 00:12:01 -08:00
Wentao Ye
357d435c54 [Bug] Fix log issue with \n (#31390)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2025-12-30 21:16:55 -08:00
danisereb
108a2728f7 Add get_expert_mapping to NemotronHModel (for LoRA support) (#31539)
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
2025-12-30 21:09:03 -08:00
TJian
578c8f51f6 [CI] [Critical] [CUDA] Fix duplicated test name (#31562)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-12-30 21:01:09 -08:00
maang-h
b4bb5f312f [Core] Remove unused num_tokens parameter from _init_model_kwargs (#31517)
Signed-off-by: maang <maang_h@163.com>
2025-12-30 20:47:23 -08:00
SameerAsal
70e1acefcd [BugFix] Fix NUMA node validation in CPU platform (#31520)
Signed-off-by: SameerAsal <SameerAsal@users.noreply.github.com>
Co-authored-by: SameerAsal <SameerAsal@users.noreply.github.com>
2025-12-31 04:06:49 +00:00
Qiu
84f6cd741b [Mics] add pcp basic support to MoE model (#31003) 2025-12-30 20:01:29 -08:00
B-201
ecd49ce7e6 [Fix] Align fused moe lora_b shape with peft (#31534)
Signed-off-by: bk-201 <joy25810@foxmail.com>
2025-12-31 09:44:59 +08:00
Amr Mahdi
e1ee11b2a5 Add docker buildx bake configuration (#31477)
Signed-off-by: Amr Mahdi <amrmahdi@meta.com>
2025-12-31 01:08:54 +00:00
vintipandey
04147dcfa7 [Bugfix]Fix pooling model always disabled due to incorrect PP rank check (#31505)
Signed-off-by: vintipandey <vinti.pandey@gmail.com>
2025-12-30 11:27:10 -08:00
JartX
07728bf5cd [BugFix] add select_gemm_impl on CompressedTensorsWNA16MoEMethod to support LoRA (#31453)
Signed-off-by: JartX <sagformas@epdcenter.es>
2025-12-30 11:20:15 -08:00
yt0428
3f52fa5aa2 [Model] Add support for openPangu moe model (#28775)
Signed-off-by: yuantao <2422264527@qq.com>
Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-12-30 08:11:38 -08:00
Li, Jiang
7157596103 [CPU] Disable async schedule on CPU (#31525)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-12-30 12:34:08 +00:00
Nicolò Lucchesi
ab1af6aa3e [CI][NIXL] Split DPEP tests (#31491)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-12-30 07:26:12 -05:00
Pleaplusone
1a834df2d4 [ROCm][Bugfix] Fix accuracy issue on fmoe when VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS enabled (#31523)
Signed-off-by: ganyi <ygan@amd.com>
2025-12-30 09:21:49 +00:00
Kevin
51085c2aeb [Frontend] add continue_final_message parameter to /embeddings endpoint (#31497)
Signed-off-by: Kevin P-W <140451262+kevin-pw@users.noreply.github.com>
2025-12-30 07:21:13 +00:00
Roger Feng
3d973764ce [xpu] [bugfix] upgrade to latest oneccl in dockerfile (#31522)
Signed-off-by: roger feng <roger.feng@intel.com>
2025-12-30 14:52:28 +08:00
Nick Hill
3b312fb792 [Minor] Various small code cleanups/simplifications (#31508)
Signed-off-by: njhill <nickhill123@gmail.com>
2025-12-29 22:42:06 -08:00
ZT-AIA
f84bf7d79b Add Loraconfig parameter to get_punica_wrapper function (#31408)
Signed-off-by: ZT-AIA <1028681969@qq.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-12-29 22:27:31 -08:00
Roy Wang
99dcf5dcc5 Migrate meetups & sponsors [2/N] (#31500)
Signed-off-by: esmeetu <jasonailu87@gmail.com>
2025-12-30 04:26:15 +00:00
Hojin Yang
dc837bc23e feat(frontend): add --default-chat-template-kwargs CLI argument (#31343)
Signed-off-by: effortprogrammer <yhjhoward7@gmail.com>
2025-12-30 03:38:47 +00:00
Nick Hill
e54ee3ea33 [Core] Deduplicate generate/encode logic in AsyncLLM (#31510)
Signed-off-by: njhill <nickhill123@gmail.com>
2025-12-30 10:42:45 +08:00
wangln19
358bfd315c fix: update kimi k2 tool parser logic (#31207)
Signed-off-by: wangln19 <wanglinian@dev.wanglinian.msh-dev.svc.cluster.local>
Signed-off-by: Wang Linian <wanglinian@stu.pku.edu.cn>
Co-authored-by: wangln19 <wanglinian@dev.wanglinian.msh-dev.svc.cluster.local>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
2025-12-30 10:01:58 +08:00
Sage
39512aba72 [Prefix Cache] Include lora_name in BlockStored event for deterministic KV-cache reconstruction (#27577)
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Co-authored-by: Sage <80211083+sagiahrac@users.noreply.github.com>
2025-12-30 00:17:16 +00:00
qli88
0f35429a0c [CI]Test Group 'NixlConnector PD accuracy tests' is fixed (#31460)
Signed-off-by: qli88 <qiang.li2@amd.com>
2025-12-29 23:48:56 +00:00
Alexei-V-Ivanov-AMD
d63b969675 [CI/ROCm] Fixing "V1 Test attention (H100)" test group. (#31187)
Signed-off-by: DCCS-4560 <alivanov@chi-mi325x-pod1-108.ord.vultr.cpe.ice.amd.com>
Signed-off-by: <>
Co-authored-by: DCCS-4560 <alivanov@chi-mi325x-pod1-108.ord.vultr.cpe.ice.amd.com>
Co-authored-by: root <root@chi-mi325x-pod1-108.ord.vultr.cpe.ice.amd.com>
2025-12-29 16:53:59 -05:00
Robert Shaw
56f516254c [Bugfix][ROCm] Fix Static Quant Issue (#31502)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
2025-12-29 13:27:55 -08:00
Robert Shaw
9152a30d8f [MoE Refactor][12/N] Marlin Fp8 MoE Pure Function (#31499)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2025-12-29 13:27:00 -08:00
Nick Hill
c2ff33cc8c [Core] Enable async scheduling by default (#27614)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2025-12-29 13:20:55 -07:00
chunxiaozheng
b12cb38398 implements register kv caches in lmcache connector (#31397)
Signed-off-by: idellzheng <idellzheng@tencent.com>
2025-12-29 11:13:42 -08:00
Roger Young
5bc664110f Optimize QKNorm for MiniMax-M2/M2.1 (#31493)
Signed-off-by: xuebi <xuebi@minimaxi.com>
Co-authored-by: xuebi <xuebi@minimaxi.com>
2025-12-29 16:30:18 +00:00
RickyChen / 陳昭儒
b3a2bdf1ac [Feature] Add offline FastAPI documentation support for air-gapped environments (#30184)
Signed-off-by: rickychen-infinirc <ricky.chen@infinirc.com>
Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-29 16:22:39 +00:00
Harry Mellor
e37e7349e6 Replace nn.ConvNd with vLLM's ConvNdLayer for Transformers modeling backend (#31498)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-29 16:20:01 +00:00
Roy Wang
b5d2d71d26 Migrate doc to website: Hardware Plugins (1/N) (#31496)
Signed-off-by: esmeetu <jasonailu87@gmail.com>
2025-12-29 15:55:20 +00:00
Harry Mellor
decc244767 [Docs] Use relative md links instead of absolute html links for cross referencing (#31494)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-29 13:33:44 +00:00
amittell
9c884faa95 [Bugfix] Preserve tool call id/type/name in streaming finish chunk (#31438)
Signed-off-by: amittell <mittell@me.com>
Signed-off-by: Alex Mittell <mittell@me.com>
2025-12-29 21:10:52 +08:00
Chauncey
48d5ca4e8b [CI] fix test_chat_truncation_content_not_null test (#31488)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-12-29 12:47:08 +00:00
twj
bf73a3e4d7 [Bugfix][Frontend] Fix Jina reranker multimodal input compatibility (#31445)
Signed-off-by: tianwenjing <tianwenjing@jfgenius.com>
Signed-off-by: twj <151701930+twjww@users.noreply.github.com>
Co-authored-by: tianwenjing <tianwenjing@jfgenius.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-29 01:13:18 -08:00
Andreas Karatzas
3ecfdc3776 [ROCm][GPTQ][Bugfix] Fix GPTQ GEMM kernel output zeroing race condition (#30719)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2025-12-29 01:13:14 -08:00
Andreas Karatzas
45c1ca1ca1 [ROCm][CI] Skip DeepGemm-dependent test on ROCm platform (#31462)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2025-12-29 16:31:10 +09:00
Li, Jiang
17347daaa2 [CI/Build][CPU] Update CPU CI test cases (#31466)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-12-29 14:17:52 +08:00
Mamy Ratsimbazafy
b9793e6a8c Add Fused MoE Triton kernels for GLM-4.5-Air, GLM-4.5v, GLM-4.6v on 2x RTX Pro 6000 (#31407)
Signed-off-by: Mamy Ratsimbazafy <mamy_github@numforge.co>
2025-12-28 08:38:33 -08:00
Jzz1943
0b6b701050 [Model] Add tuned triton fused_moe configs for Qwen3Moe on B200 (#31448)
Signed-off-by: Zhongze Jiang <jiangzhongze.jzz@ant-intl.com>
2025-12-28 08:38:07 -08:00
Nick Hill
094fcce250 [BugFix] Re-fix async multimodal cpu tensor race condition (#31373)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: njhill <nickhill123@gmail.com>
2025-12-28 03:05:08 -08:00
Andreas Karatzas
573dd0e6f0 [ROCm] Migrate xgrammar to upstream release (#31327)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2025-12-28 00:08:29 -08:00
Andreas Karatzas
f70368867e [ROCm][CI] Add TorchCodec source build for transcription tests (#31323)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2025-12-28 16:06:05 +08:00
Andreas Karatzas
96142f2094 [ROCm][CI] Added perceptron lib in requirements for isaac multi-modal test (#31441)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2025-12-28 04:15:14 +00:00
Boyuan Feng
62def07d67 [BugFix] register quant scale tensors as buffer (#31395)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
2025-12-28 11:20:02 +08:00
yitingdc
b326598e97 add tip for VLLM_USE_PRECOMPILED arg to reduce docker build time (#31385)
Signed-off-by: yiting.jiang <yiting.jiang@daocloud.io>
2025-12-28 03:19:47 +00:00
Robert Shaw
727c41f3fd [MoE Refactor][10/N] Cleanup Fp8 Process Weights After Loading (#31169)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2025-12-27 20:22:48 +00:00
Boyuan Feng
2f12cd32c0 [BugFix] Fix cache issue in compilation_config (#31376)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
2025-12-27 09:30:39 -05:00
Isotr0py
40a8756224 [Chore]: Remove HF format Phi4-MM examples (#31405)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-12-27 13:42:02 +00:00
Isotr0py
3d024985ab [CI/Build] Ignore max transformers version for more common tests (#31401)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-12-27 13:06:26 +00:00
baonudesifeizhai
8711b21676 Fix/get raw stream patch #30905 (#30912)
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-12-26 20:08:47 -08:00
Yifan Qiao
52bf066516 [Core][Hybrid allocator + connector] Support hybrid allocator + kv cache connector (#30166)
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Co-authored-by: KuntaiDu <kuntai@uchicago.edu>
2025-12-26 18:25:46 -08:00
Kunshang Ji
5326c89803 [XPU][CI]skip test_preprocess_error_handling due to fork/spawn issue (#31381)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-12-26 21:40:44 +00:00
Xinyu Chen
87f1b8ca2c CustomOp: Unify aiter impl into GroupedTopk (#31221)
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com>
2025-12-26 12:44:29 -05:00
rongfu.leng
887e900b77 [Docs] Add profiler user docs for http request (#31370)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
2025-12-26 23:48:15 +08:00
Patrick von Platen
48e744976c [Mistral common] Ensure all functions are imported from the top & only use public methods (#31138)
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-12-26 04:48:24 -08:00
Jee Jee Li
ce1eafd1a5 [Core] Initialize LoRA support for tower and connector in multi-modal models (#26674)
Signed-off-by: bk-201 <joy25810@foxmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: prashanth058 <prashanth.dannamaneni@uipath.com>
Co-authored-by: bk-201 <joy25810@foxmail.com>
Co-authored-by: prashanth058 <prashanth.dannamaneni@uipath.com>
Co-authored-by: Anexdeus <5142168@mail.ru>
2025-12-26 04:48:20 -08:00
Harry Mellor
0b544e6476 [Docs] Fix some snippets (#31378)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-26 12:47:41 +00:00
Jee Jee Li
c3666f56fd [Misc] Fix Qwen2-MoE shared_expert_gate (#31339)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-12-26 05:10:39 +00:00
Andreas Karatzas
c79dbfa9ad [CI] Fix flaky vision beam search test with flexible semantic validation (#31324)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2025-12-26 04:39:32 +00:00
Shinichi Hemmi
9ee05cbe7f Support LoRA and GPTQModel for PLaMo 2/3 (#31322)
Signed-off-by: Shinichi Hemmi <50256998+Alnusjaponica@users.noreply.github.com>
2025-12-26 11:41:33 +08:00
Ning Xie
3b8f31b362 [benchmark] use model card root instead of id (#31329)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-12-26 10:55:56 +08:00
Isotr0py
2cd94259c8 [CI/Build] Ignore max transformers version skipping for initialization tests (#30619)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-12-26 10:50:32 +08:00
oscardev256
b7165d53c6 Feature/isaac 0.1 (#28367)
Signed-off-by: oscardev256 <42308241+oscardev256@users.noreply.github.com>
Signed-off-by: Oscar Gonzalez <ogonzal6@alumni.jh.edu>
Signed-off-by: Yang <lymailforjob@gmail.com>
Co-authored-by: Yang <lymailforjob@gmail.com>
2025-12-25 18:49:11 -08:00
Nick Hill
81786c8774 [BugFix] Fix async scheduling + reasoning with struct output (#31332)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
2025-12-25 23:01:02 +00:00
Stan Wozniak
f1531d9f2a [Hybrid] Mamba2 prefix cache blocks freeing for running requests (#28047)
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2025-12-25 20:54:06 +00:00
SongHe
2d6001f491 [Model][Ernie4.5-VL] Support video metadata for timestamp rendering (#31274)
Signed-off-by: dengsonghe <dengsonghe@baidu.com>
Co-authored-by: dengsonghe <dengsonghe@baidu.com>
2025-12-25 14:07:15 +00:00
Amir Samani
030fc44914 use the same stream for cuda graph catpure and replay for NCCL (#29207)
Signed-off-by: Amir Samani <asamani@nvidia.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-12-25 19:10:03 +08:00
Isotr0py
2532f437ee [Doc] Add troubleshooting for Triton PTX error about undefined gpu-name (#31338)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-12-25 02:26:34 -08:00
Louie Tsai
f15185fbdb [Benchmark Suite] improve cpu Benchmark Suite tests and comparison report for 0.12.0 (#30994)
Signed-off-by: Tsai, Louie <louie.tsai@intel.com>
2025-12-25 08:51:45 +00:00
Mark Gatere
ba25a65992 [Frontend] add FunctionGemma tool parser support (#31218)
Signed-off-by: gateremark <gateremg@gmail.com>
2025-12-25 15:29:25 +08:00
Amith KK
42826bbccd [Doc] Add tool call parser documentation for GPT-OSS models (#31212)
Signed-off-by: Amith KK <amithkumaran@gmail.com>
2025-12-25 05:29:10 +00:00
Richard Zou
254f6b9867 [Bugfix] Fix eagle dp tests on A100 (#31241)
Signed-off-by: Richard Zou <zou3519@gmail.com>
2025-12-25 00:05:04 +00:00
Michael Goin
bc5ef333e0 [Perf] Add skip_clone to SamplingParams for internal request handling (#31041)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-12-24 14:35:57 -08:00
Cyrus Leung
09dc7c690c [Chore][1/2] Drop v0.14 deprecations (#31285)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-24 09:54:01 -08:00
ゆり
506eb0f454 [Bugfix] Remove dead block_quant_to_tensor_quant function (#31294)
Co-authored-by: yurekami <yurekami@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-24 17:22:48 +00:00
Ning Xie
5d93089686 [cli] complete vllm cli help message (#31226)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-12-24 15:45:47 +00:00
Kevin McKay
66c9887440 [Bugfix][Hardware][AMD] Fix FP8 dtype in silu_mul quantization (#31179)
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
2025-12-24 10:37:11 -05:00
wang.yuqi
1ff67df182 [CI] Reorganization pooling_mteb_test (#31265)
Signed-off-by: wang.yuqi <noooop@126.com>
2025-12-24 23:36:20 +08:00
skaraban3807
7cd288a4b3 [PERF] Add interleaved memory allocation to NUMA module (#30800) 2025-12-24 13:47:49 +00:00
Cyrus Leung
d201807339 [Chore] Bump lm-eval version (#31264)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-24 05:39:13 -08:00
Cyrus Leung
aa3868ecfe [Chore] Remove unused noqas (#31263)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-24 05:38:46 -08:00
Cyrus Leung
7adeb4bfa8 [Bugfix] Fix max_model_len="auto" handling (#31260)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-24 19:15:27 +08:00
wang.yuqi
bd89ce16d2 [Model] Introduce verify_and_update_model_config for VerifyAndUpdateConfig. (#31131)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
2025-12-24 09:54:57 +00:00
Pleaplusone
b41aeb3468 [Bugfix][ROCm] Fix load issue on deepseek quark quantization when shared expert enabled (#31261)
Signed-off-by: ganyi <ygan@amd.com>
2025-12-24 16:47:44 +08:00
Ryan Rock
ddfac7034e [CI/Build] Ignore data_parallel_size_local (#30281)
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
2025-12-24 07:40:54 +00:00
Micah Williamson
6559d96796 [ROCm][CI] Set TORCH_NCCL_BLOCKING_WAIT Distributed Tests On ROCm (#31259)
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
2025-12-24 07:19:07 +00:00
kliuae
1c74150bca [ROCm][CI] Fix "Distributed Tests (H200)" Test (#31227)
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
2025-12-24 06:56:30 +00:00
Andreas Karatzas
0247a91e00 [ROCm][CI] Fix entrypoints tests and Python-only installation test on ROCm (#28979)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2025-12-23 22:42:30 -08:00
Michael Goin
8ee90c83f8 Add --max-model-len auto to auto-fit context to available memory (#29431)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-12-23 21:37:14 -08:00
Nick Cao
d7e05ac743 [docker] Fix downloading sccache on aarch64 platform (#30070)
Signed-off-by: Nick Cao <nickcao@nichi.co>
2025-12-23 21:36:33 -08:00
sihao_li
471ddb99a0 [XPU] Remove distributed_executor_backend check (#30760)
Signed-off-by: sihao.li <sihao.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2025-12-23 21:34:33 -08:00
Xiong Wang
bb24592d13 [Qwen3-Omni] fixed _get_feat_extract_output_lengths function (#31007)
Signed-off-by: Xiong Wang <wangxiongts@163.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
2025-12-23 21:33:54 -08:00
Matthew Bonanni
369f47aa0f [DeepSeek v3.2] Remove unnecessary syncwarps (#31047)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-12-23 21:33:30 -08:00
zejunchen-zejun
dabff12ed3 [Bugfix][ROCm][Dynamo][DS 3.1][FP8] fix unsupported hasattr call when Dynamo tracing for ROCm device (#31149)
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
2025-12-23 21:32:19 -08:00
Ming Yang
3bb9561928 Revert "[bench] Support common prefix len config (for decode-only bench)" (#31240)
Signed-off-by: Ming Yang <minos.future@gmail.com>
2025-12-23 21:17:23 -08:00
Micah Williamson
3ce791ac77 [ROCm][CI] Set VLLM_FLOAT32_MATMUL_PRECISION="tf32" For terratorch Tests In AMD CI (#31242)
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
2025-12-24 03:21:50 +00:00
Andreas Karatzas
e42894f5b5 [ROCm][CI][Bugfix] Fix Siglip2 rotary embedding dispatch and InternVL video test tolerance (#31235)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2025-12-24 02:56:58 +00:00
Wentao Ye
76e6a95192 [Bug] Fix Number of dimensions of tensors must match. for Deepseek V3.2 (#31160)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-12-24 10:41:09 +08:00
Chao Lei
8b59753cdb [P/D] Mooncake connector support more protocols (#30133)
Signed-off-by: LCAIZJ <leichao139636@163.com>
2025-12-24 10:24:07 +08:00
Chen Zhang
538e830caa [KVEvent] User request.block_hash for parent block_hash (#30544)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Co-authored-by: Yifan Qiao <yifanqiao@berkeley.edu>
2025-12-23 18:23:43 -08:00
rongfu.leng
4ed11105d7 [Misc] Remove unused custom ops copy_blocks and copy_blocks_mla (#30967)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
2025-12-23 18:22:35 -08:00
Cyrus Leung
dd424571c8 [Bugfix] Enable dynamic_dims for different embeds shape (#31223)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-24 10:15:47 +08:00
Cyrus Leung
ca6a95ba25 [Chore] Simplify logic of _execute_mm_encoder (#31222)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-23 18:15:16 -08:00
Vadim Gimpelson
bc0a5a0c08 [CI] Add Qwen3-Next-FP8 to Blackwell model tests (#31049)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
2025-12-23 17:21:50 -08:00
Andreas Karatzas
bfa2c0bbb9 [ROCm][Bugfix] Fix RuntimeError in MMEncoderAttention by replacing .view() with .reshape() (#31203)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2025-12-23 21:48:01 +00:00
Mark McLoughlin
f790068600 [Core] Add a random suffix to frontend-provided request IDs (#27987)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-12-23 13:05:39 -08:00
Asaf Joseph Gardin
34916ae37f [Mamba] - Consolidate Mambas Attention Logic (#28133) 2025-12-23 21:57:00 +01:00
Yuan Tang
0736f901e7 docs: Add llm-d integration to the website (#31234)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-12-23 20:27:22 +00:00
Harry Mellor
c016c95b45 Use helper function instead of looping through attribute names (#29788)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-23 17:31:56 +00:00
Harry Mellor
1339878e13 Only patch original_max_position_embeddings for Transformers v4 (#31214)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-23 16:46:32 +00:00
danielafrimi
b94f80ffb8 [FIX] FP4 quantization kernel padding initialization bug (#31097)
Signed-off-by: <>
Co-authored-by: root <root@gpu-193.slurm-workers-slurm.slurm.svc.cluster.local>
Co-authored-by: root <root@gpu-951.slurm-workers-slurm.slurm.svc.cluster.local>
2025-12-23 08:45:18 -08:00
Joachim Studnia
38c361f99d Fix edge case Mistral tool parser (#30724)
Signed-off-by: Joachim Studnia <joachim@mistral.ai>
Signed-off-by: Joachim Studnia <studniajoachim@gmail.com>
Signed-off-by: juliendenize <julien.denize@mistral.ai>
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: juliendenize <julien.denize@mistral.ai>
Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2025-12-23 14:19:58 +00:00
Cyrus Leung
bb62dda2c3 [Misc] Introduce encode_*_url utility function (#31208)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-23 13:45:21 +00:00
Patrick von Platen
3faa8bee57 adapt voxtral (#31095)
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>
2025-12-23 05:31:55 -08:00
Harry Mellor
b10d47e0e0 Add util function for checking nesting of rope parameters (#31146)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-23 11:41:49 +00:00
R3hankhan
769f27e701 [OpenAI] Add parameter metadata to validation errors (#30134)
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
2025-12-23 11:30:12 +00:00
Jakub Zakrzewski
23daef548d [Frontend] Support using chat template as custom score template for reranking models (#30550)
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
2025-12-23 11:19:16 +00:00
Jee Jee Li
27c6c2f98c [Bugfix] Fix MoE LoRA bin/pt loading (#31161)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-12-23 19:09:15 +08:00
Weida Hong
73cfb7a722 Correct position of docstring of class attributes (#31209)
Signed-off-by: Weida Hong <wdhongtw@google.com>
2025-12-23 02:08:58 -08:00
vllmellm
f32cfd7d97 [ROCm][FEAT] Support AITER RMSNorm quantization fusion pass (#26575)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
2025-12-23 02:07:54 -08:00
Jee Jee Li
6b16fff01b [Bugfix] Fix Jais2ForCausalLM (#31198)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-12-23 07:44:01 +00:00
Yan Ma
f1c2c20136 [XPU] decrease IGC_ForceOCLSIMDWidth for speculative decoding triton-xpu kernel compilation (#30538)
Signed-off-by: Yan Ma <yan.ma@intel.com>
2025-12-23 05:22:15 +00:00
Cyrus Leung
8cef137689 [Chore] Update more locations to use attention_config.backend (#31153)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-12-22 19:19:50 -08:00
quanliu
a37328fc5c [Feature] Batch invariant: Lora (#30097)
Signed-off-by: quanliu <18646313696@163.com>
2025-12-23 10:32:47 +08:00
Pavani Majety
3e10262356 Revert "[SM100] Enable fp8 compute for prefill MLA (#30746)" (#31197)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-12-22 18:15:33 -08:00
Angela Yi
612d5ffdab [ci] Fix Pytorch compilation test oom in 2.10 (#31194)
Signed-off-by: angelayi <yiangela7@gmail.com>
2025-12-23 01:56:47 +00:00
Divakar Verma
78e5e62bbf [AMD][CI] fix v1/engine test_preprocess_error_handling (#31192)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2025-12-23 01:28:19 +00:00
Robert Shaw
b57b967386 [MoE Refactor][7/N] AITER MK (#31102)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2025-12-22 16:42:58 -07:00
Michael Goin
6d518ffbaa [CI Failure] Disable mosaicml/mpt-7b and databricks/dbrx-instruct tests (#31182)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-12-22 15:40:35 -08:00
Benjamin Chislett
85aff45e24 [Perf] Remove blocking copy in GDN Attention (#31167)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
2025-12-22 14:25:22 -08:00
Wentao Ye
5312a7284e [Bug] Fix 'CutlassMLAImpl' object has no attribute '_workspace_buffer' (#31173)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-12-22 14:24:27 -08:00
Lucas Wilkinson
de71747655 [SpecDecode] Simplified alternative padded-speculation acceptance rate fix (#29845)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-12-22 13:06:10 -08:00
Michael Goin
9586354053 [Doc] Add vllm-metal to hardware plugin documentation (#31174)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-12-22 20:06:29 +00:00
Pavani Majety
b10f41c894 [SM100] Enable fp8 compute for prefill MLA (#30746)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-12-22 19:15:57 +00:00
Yongye Zhu
7b926e8901 [MoE Refactor][9/N] Use modular kernel for unquantized Triton MoE (#31052)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
2025-12-22 17:34:19 +00:00
Gregory Shtrasberg
ab3a85fd68 [ROCm][CI/Build] Fix triton version to one that has triton_kernels required for gpt-oss to run (#31159)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-12-22 17:19:27 +00:00
Boyuan Feng
8dd0db687b [UX] improve profiler error message (#31125)
Signed-off-by: Boyuan Feng <boyuan@meta.com>
2025-12-22 08:45:59 -08:00
TJian
022f3cea53 [ROCm] [Critical]: Remove unused variable (#31156)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-12-22 08:28:22 -08:00
Micah Williamson
a5bc77c253 [AMD][CI] Add "V1 Test e2e + engine" to mi325_8 Agent Pool (#31040)
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
2025-12-22 10:41:56 -05:00
Nicolò Lucchesi
b1c3f96ae3 [CI][Bugfix] Fix entrypoints/openai/test_audio.py (#31151)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-12-22 07:21:40 -08:00
dengyunyang
8f8f469b1b [BugFix] skip language model in Encoder (#30242)
Signed-off-by: dengyunyang <584797741@qq.com>
2025-12-22 05:25:59 -08:00
Shengqi Chen
2cf91c2ea4 [CI] add polling for precompiled wheel in python_only_compile.sh, fix index generation for releases (#30781)
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
2025-12-22 13:24:21 +00:00
AlonKejzman
bd6d5a7475 [gpt-oss] Fix harmony parser in streaming responses (#30205)
Signed-off-by: AlonKejzman <alonkeizman@gmail.com>
2025-12-22 20:56:06 +08:00
Li Wang
256a33ecb4 [Model] Fix bagel failed to run (#31132)
Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-22 02:15:54 -08:00
Roger Young
c02a2705f9 Update MiniMax-M2 ToolCall and add MiniMax-M2.1 in Docs (#31083)
Signed-off-by: xuebi <xuebi@minimaxi.com>
Co-authored-by: xuebi <xuebi@minimaxi.com>
2025-12-22 05:28:40 +00:00
Kevin McKay
cf8eed7bef [Bugfix][ROCm] Fix typo: is_linear_fp8_enaled -> is_linear_fp8_enabled (#31109)
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-21 21:14:58 -08:00
Kevin McKay
44ae85f725 [Misc] Fix typo: 'occured' -> 'occurred' (#31120)
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
2025-12-21 21:14:27 -08:00
Kevin McKay
14c3e6ade3 [Misc] Fix spelling typos in model comments (#31117)
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
2025-12-21 21:14:14 -08:00
Kevin McKay
42b42824ae [Misc] Fix grammar errors in comments and messages (#31115)
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
2025-12-21 21:14:02 -08:00
Kevin McKay
ec58c10ce1 [Misc] Fix quantization-related typos (#31116)
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
2025-12-21 21:13:48 -08:00
Kevin McKay
8c084de59d [Misc] Fix spelling typos in comments (#31114)
Signed-off-by: c0de128 <kevin.mckay@outlook.com>
2025-12-21 21:13:14 -08:00
CedricHuang
19cc9468fd [Feature]: Support NVIDIA ModelOpt HF FP8 variants FP8_PER_CHANNEL_PER_TOKEN and FP8_PB_WO in vLLM (#30957) 2025-12-21 22:34:49 -05:00
Jee Jee Li
097978a15d [Kernel] Enable fused_qknorm_rope_kernel supports partial rope (#30821)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-12-21 18:39:22 -08:00
Lucas Wilkinson
7e065eba59 [CI] Fix "2 Node Tests (4 GPUs in total)" (#31090)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-12-22 10:32:40 +08:00
Steve Westerhouse
9d701e90d8 [Doc] Clarify FP8 KV cache computation workflow (#31071)
Signed-off-by: westers <steve.westerhouse@origami-analytics.com>
2025-12-22 08:41:37 +08:00
Michael Goin
06d490282f [NVFP4][Perf] Tune NVFP4 input quant kernel for small batch size (#30897)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-12-21 09:41:57 -08:00
Robert Shaw
b471092d3a [MoE Refactor][4/N] Marlin Fp8 Mk (#31036) 2025-12-21 12:37:42 -05:00
Ameen Patel
93cabc417c ci: add nvidia-smi warmup before Prime-RL integration test (#31093)
Signed-off-by: AmeenP <ameenp360@gmail.com>
2025-12-21 15:43:01 +00:00
Chauncey
bb80f69bc9 add aarnphm and chaunceyjiang to the new tool_parser directory (#31088)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-12-21 03:24:34 +00:00
汪志鹏
3e92b2b7ac [BugFix]fix gpt-oss v1/completions response bug (#30608)
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: bbrowning <bbrownin@redhat.com>
2025-12-21 10:39:31 +08:00
Jinzhen Lin
7c73ceb581 [Quantization] add marlin w4a8/w8a8 check (#31061)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
2025-12-20 21:58:11 +00:00
Lucas Wilkinson
ae0770fa6b [CI] Fix H200 Distributed test (#31054)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-12-20 16:48:49 -05:00
Jinzhen Lin
ee52d9901d [Quantization] support logical_widths for fp8 marlin (#30962)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-12-20 12:02:57 -08:00
baonudesifeizhai
54c8924384 [MoE Refactor][5/N] Isolate zero expert to LongCatFlash (#28891)
Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com>
Signed-off-by: Dongjie Zou <85092850+baonudesifeizhai@users.noreply.github.com>
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com>
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robertgshaw2@gmail.com>
2025-12-20 18:22:04 +00:00
Yan Ma
560ae9638c [XPU] enable fp8 online streaming quantization (#30944)
Signed-off-by: Yan Ma <yan.ma@intel.com>
2025-12-20 13:45:27 +00:00
Jeffrey Wang
1501a4070e [Bugfix] Read truncate_prompt_tokens from pooling_params in AsyncLLM.encode() (#31013)
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
2025-12-20 10:29:31 +00:00
Lucas Wilkinson
ff2168bca3 [CI] FIx fixture 'siglip_attention_config' not found (#31053)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-12-20 03:46:15 +00:00
Gregory Shtrasberg
0be149524c [ROCm][CI/Build] Update ROCm dockerfiles (#30991)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-12-20 03:19:12 +00:00
zejunchen-zejun
d52c5096d7 [Bugfix] fix the alias bug of AttentionBackendEnum when register CUSTOM attention backend to vllm (#30869)
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
2025-12-20 09:03:35 +08:00
Yuxuan Zhang
8a7a414374 GLM-4.7 Tool Parser and Doc Update (#30876)
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com>
2025-12-20 00:09:58 +00:00
Robert Shaw
95befecc18 [MoE Refactor][2/N] Use Modular Kernels for Fp8 (#30825)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2025-12-19 23:36:38 +00:00
Wentao Ye
4cf9429897 [Bug] Fix error 'Dynamo failed to run FX node with fake tensors for Deepseek V3.2 (#31046)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-12-19 23:31:31 +00:00
Robert Shaw
83a317f650 [MoE Refactor][3/N] Deprecate cutlass block quant fp8 (b200) (#30990)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2025-12-19 13:09:54 -08:00
Lucas Wilkinson
5f6477d1d0 [BugFix] Fix TypeError: unhashable type: 'dict' when serving deepseek32 (#30924)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-12-19 16:07:54 -05:00
Wentao Ye
3bd8335bd0 [Refactor] Refactor for DeepGemmQuantScaleFMT using cache (#30898)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-12-19 13:50:39 -07:00
Seiji Eicher
1ab5213531 Make engine core client handshake timeout configurable (#27444)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
2025-12-19 20:38:30 +00:00
Zhonghua Deng
969bbc7c61 [Model] Add MiMo-V2-Flash support (#30836)
Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Jumiar <liuanqim10@126.com>
Signed-off-by: Zyann7 <zyann7@outlook.com>
Co-authored-by: Jumiar <liuanqim10@126.com>
Co-authored-by: Zyann7 <zyann7@outlook.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-12-19 17:17:03 +00:00
Andrey Talman
268a972c62 Update Pytorch version update docs (#30982) 2025-12-19 16:08:53 +00:00
Jinzhen Lin
5fbfa8d9ef [Quantization] fix marlin w8a8 check (#30961)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
2025-12-19 07:33:22 -08:00
Shanshan Shen
23a1946e3b [CustomOp][Refactor] Extract common methods for ApplyRotaryEmb CustomOp (#31021)
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-12-19 22:16:09 +08:00
Thomas Parnell
b5545d9d5c [Bugfix] [Kernel] Triton attention kernels: mask out V blocks that fall outside sliding window (#30887)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-12-19 21:39:54 +08:00
Nishidha Panpaliya
bd2b52fc2d [CPU][Bugfix] Fix ppc64le CPU build (#30871)
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
2025-12-19 12:26:35 +00:00
Li, Jiang
420ba2dbb6 Enable aarch64 CPU performance benchmarks (#26494)
Signed-off-by: Ioana Ghiban <ioana.ghiban@arm.com>
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
Co-authored-by: Ioana Ghiban <ioana.ghiban@arm.com>
Co-authored-by: Fadi Arafeh <fadi.arafeh@arm.com>
2025-12-19 12:16:18 +00:00
Marko Rosenmueller
455949675d [Frontend][Bug] allow tool calls in analysis channel (#28139)
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
2025-12-19 10:47:44 +00:00
lif
086b96339f [Bugfix] Add validation for tool requests when tool_parser is unavailable (#30613)
Signed-off-by: majiayu000 <1835304752@qq.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-19 18:23:28 +08:00
Jinzhen Lin
9187de9fac [Quantization] enable compressed-tensors marlin support for turing (2) (#31008)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
2025-12-19 08:56:35 +00:00
Isotr0py
ac1c934276 [Bugfix] Fix incorrect tiles creation for mm prefix triton attention (#30974)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-12-19 16:00:33 +08:00
Wenqi Glantz
4924ac582c Add hidden dimension validation for multimodal embedding inputs (#30968)
Signed-off-by: Wenqi Glantz <wglantz@nvidia.com>
2025-12-19 07:59:36 +00:00
Li, Jiang
096b25c9ed [Doc][CPU] Fix index link for CPU regular release wheels (#31015)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-12-19 07:29:52 +00:00
Jinzhen Lin
de08b8f61b [Quantization] enable compressed-tensors marlin support for turing (#31000)
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com>
2025-12-18 20:29:48 -08:00
Nick Hill
2ac85a4544 [BugFix] Fix logprobs with spec decode and modified logits (#30846)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-12-18 19:58:28 -08:00
Andreas Karatzas
7b43db210c [ROCm][CI][Bugfix] Multi-Modal Model Support Fixes and Attention Backend Improvements (#30270)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
2025-12-19 02:17:27 +00:00
1213 changed files with 57814 additions and 25530 deletions

View File

@@ -2,7 +2,7 @@
# We can use this script to compute baseline accuracy on chartqa for vllm.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.9
# pip install "lm-eval[api]>=0.4.9.2"
usage() {
echo``

View File

@@ -2,7 +2,7 @@
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]
# pip install "lm-eval[api]>=0.4.9.2"
usage() {
echo``

View File

@@ -3,7 +3,7 @@
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]
# pip install "lm-eval[api]>=0.4.9.2"
usage() {
echo``

View File

@@ -3,7 +3,7 @@
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]
# pip install "lm-eval[api]>=0.4.9.2"
usage() {
echo``

View File

@@ -60,6 +60,7 @@ def launch_lm_eval(eval_config, tp_size):
f"add_bos_token=true,"
f"trust_remote_code={trust_remote_code},"
f"max_model_len={max_model_len},"
"allow_deprecated_quantization=True,"
)
env_vars = eval_config.get("env_vars", None)

View File

@@ -7,7 +7,7 @@ vLLM also maintains a continuous performance benchmark under [perf.vllm.ai](http
## Performance benchmark quick overview
**Benchmarking Coverage**: latency, throughput and fix-qps serving on B200, A100, H100, Intel® Xeon® Processors and Intel® Gaudi® 3 Accelerators with different models.
**Benchmarking Coverage**: latency, throughput and fix-qps serving on B200, A100, H100, Intel® Xeon® Processors, Intel® Gaudi® 3 Accelerators and Arm® Neoverse™ with different models.
**Benchmarking Duration**: about 1hr.
@@ -23,7 +23,7 @@ bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh
Runtime environment variables:
- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
- `ON_CPU`: set the value to '1' on Intel® Xeon® and Arm® Neoverse™ Processors. Default value is 0.
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
- `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
@@ -34,8 +34,9 @@ Runtime environment variables:
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
> NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead.
For Intel® Gaudi® 3 Accelerators, use `tests/latency-tests-hpu.json`, `tests/throughput-tests-hpu.json`, `tests/serving-tests-hpu.json` instead.
>
> For Intel® Gaudi® 3 Accelerators, use `tests/latency-tests-hpu.json`, `tests/throughput-tests-hpu.json`, `tests/serving-tests-hpu.json` instead.
> For Arm® Neoverse™, use `tests/latency-tests-arm64-cpu.json`, `tests/throughput-tests-arm64-cpu.json`, `tests/serving-tests-arm64-cpu.json` instead.
### Latency test
Here is an example of one test inside `latency-tests.json`:
@@ -175,19 +176,6 @@ If you do not see the table, please wait till the benchmark finish running.
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead.
#### Performance Results Comparison
Here is an example using the script to compare result_a and result_b with Model, Dataset name, input/output length, max concurrency and qps.
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
| | Model | Dataset Name | Input Len | Output Len | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
|----|---------------------------------------|--------|-----|-----|------|-----|-----------|----------|----------|
| 0 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | 1 | 142.633982 | 156.526018 | 1.097396 |
| 1 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | inf| 241.620334 | 294.018783 | 1.216863 |
A comparison diagram will be generated below the table.
Here is an example to compare between 96c/results_gnr_96c_091_tp2pp3 and 128c/results_gnr_128c_091_tp2pp3
<img width="1886" height="828" alt="image" src="https://github.com/user-attachments/assets/c02a43ef-25d0-4fd6-90e5-2169a28682dd" />
Follow the instructions in [performance results comparison](https://docs.vllm.ai/en/latest/benchmarking/dashboard/#performance-results-comparison) to analyze performance results and the sizing guide.

View File

@@ -1,8 +1,13 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
from __future__ import annotations
import argparse
import html as _html
import json
import os
from dataclasses import dataclass
from importlib import util
import pandas as pd
@@ -10,27 +15,49 @@ import pandas as pd
pd.options.display.float_format = "{:.2f}".format
plotly_found = util.find_spec("plotly.express") is not None
DEFAULT_INFO_COLS = [
"Model",
"Dataset Name",
"Input Len",
"Output Len",
# "TP Size",
# "PP Size",
"# of max concurrency.",
"qps",
]
# Safety net: if any DataFrame leaks into to_html(), keep precision at 2.
pd.set_option("display.precision", 2)
pd.set_option("display.float_format", lambda x: f"{x:.2f}")
# -----------------------------
# Core data compare
# -----------------------------
def compare_data_columns(
files, name_column, data_column, info_cols, drop_column, debug=False
files: list[str],
name_column: str,
data_column: str,
info_cols: list[str],
drop_column: str,
debug: bool = False,
):
"""
Align concatenation by keys derived from info_cols instead of row order.
- Pick one canonical key list: subset of info_cols present in ALL files.
- For each file: set index to those keys, aggregate duplicates
- (mean for metric, first for names).
(mean for metric, first for names).
- Concat along axis=1 (indexes align), then reset_index so callers can
- group by columns.
group by columns.
- If --debug, add a <file_label>_name column per file.
"""
print("\ncompare_data_column:", data_column)
frames = []
raw_data_cols = []
raw_data_cols: list[str] = []
compare_frames = []
# 1) choose a canonical key list from info_cols that exists in ALL files
cols_per_file = []
cols_per_file: list[set] = []
for f in files:
try:
df_tmp = pd.read_json(f, orient="records")
@@ -40,24 +67,20 @@ def compare_data_columns(
key_cols = [c for c in info_cols if all(c in cset for cset in cols_per_file)]
if not key_cols:
# soft fallback: use any info_cols present in the first file
key_cols = [c for c in info_cols if c in list(cols_per_file[0])]
if not key_cols:
raise ValueError(
"No common key columns found from info_cols across the input files."
)
# 2) build a single "meta" block (keys as columns) once, aligned by the key index
meta_added = False
for file in files:
df = pd.read_json(file, orient="records")
# Keep rows that actually have the compared metric (same as original behavior)
if drop_column in df.columns:
df = df.dropna(subset=[drop_column], ignore_index=True)
# Stabilize numeric key columns (harmless if missing)
for c in (
"Input Len",
"Output Len",
@@ -69,32 +92,26 @@ def compare_data_columns(
if c in df.columns:
df[c] = pd.to_numeric(df[c], errors="coerce")
# Ensure all key columns exist
for c in key_cols:
if c not in df.columns:
df[c] = pd.NA
# Set index = key_cols and aggregate duplicates → unique MultiIndex
df_idx = df.set_index(key_cols, drop=False)
# meta (key columns), unique per key
meta = df_idx[key_cols]
if not meta.index.is_unique:
meta = meta.groupby(level=key_cols, dropna=False).first()
# metric series for this file, aggregated to one row per key
file_label = "/".join(file.split("/")[:-1]) or os.path.basename(file)
s = df_idx[data_column]
if not s.index.is_unique:
s = s.groupby(level=key_cols, dropna=False).mean()
s.name = file_label # column label like original
s.name = file_label
# add meta once (from first file) so keys are the leftmost columns
if not meta_added:
frames.append(meta)
meta_added = True
# (NEW) debug: aligned test-name column per file
if debug and name_column in df_idx.columns:
name_s = df_idx[name_column]
if not name_s.index.is_unique:
@@ -106,26 +123,19 @@ def compare_data_columns(
raw_data_cols.append(file_label)
compare_frames.append(s)
# Generalize ratio: for any file N>=2, add ratio (fileN / file1)
if len(compare_frames) >= 2:
base = compare_frames[0]
current = compare_frames[-1]
if "P99" in data_column or "Median" in data_column:
ratio = base / current # for latency
ratio = base / current
else:
ratio = current / base
ratio = ratio.mask(base == 0) # avoid inf when baseline is 0
ratio = ratio.mask(base == 0)
ratio.name = f"Ratio 1 vs {len(compare_frames)}"
frames.append(ratio)
# 4) concat on columns with aligned MultiIndex;
# then reset_index to return keys as columns
concat_df = pd.concat(frames, axis=1)
concat_df = concat_df.reset_index(drop=True).reset_index()
if "index" in concat_df.columns:
concat_df = concat_df.drop(columns=["index"])
concat_df = pd.concat(frames, axis=1).reset_index(drop=True)
# Ensure key/info columns appear first (in your info_cols order)
front = [c for c in info_cols if c in concat_df.columns]
rest = [c for c in concat_df.columns if c not in front]
concat_df = concat_df[front + rest]
@@ -134,20 +144,15 @@ def compare_data_columns(
return concat_df, raw_data_cols
# -----------------------------
# Split helper
# -----------------------------
def split_json_by_tp_pp(
input_file: str = "benchmark_results.json", output_root: str = "."
) -> list[str]:
"""
Split a benchmark JSON into separate folders by (TP Size, PP Size).
Creates: <output_root>/tp{TP}_pp{PP}/benchmark_results.json
Returns: list of file paths written.
"""
# Load JSON data into DataFrame
with open(input_file, encoding="utf-8") as f:
data = json.load(f)
# If the JSON is a dict with a list under common keys, use that list
if isinstance(data, dict):
for key in ("results", "serving_results", "benchmarks", "data"):
if isinstance(data.get(key), list):
@@ -156,7 +161,6 @@ def split_json_by_tp_pp(
df = pd.DataFrame(data)
# Keep only "serving" tests
name_col = next(
(c for c in ["Test name", "test_name", "Test Name"] if c in df.columns), None
)
@@ -165,7 +169,6 @@ def split_json_by_tp_pp(
df[name_col].astype(str).str.contains(r"serving", case=False, na=False)
].copy()
# Handle alias column names
rename_map = {
"tp_size": "TP Size",
"tensor_parallel_size": "TP Size",
@@ -176,21 +179,14 @@ def split_json_by_tp_pp(
columns={k: v for k, v in rename_map.items() if k in df.columns}, inplace=True
)
# Ensure TP/PP columns exist (default to 1 if missing)
if "TP Size" not in df.columns:
df["TP Size"] = 1
if "PP Size" not in df.columns:
df["PP Size"] = 1
# make sure TP/PP are numeric ints with no NaN
df["TP Size"] = (
pd.to_numeric(df.get("TP Size", 1), errors="coerce").fillna(1).astype(int)
)
df["PP Size"] = (
pd.to_numeric(df.get("PP Size", 1), errors="coerce").fillna(1).astype(int)
)
df["TP Size"] = pd.to_numeric(df["TP Size"], errors="coerce").fillna(1).astype(int)
df["PP Size"] = pd.to_numeric(df["PP Size"], errors="coerce").fillna(1).astype(int)
# Split into separate folders
saved_paths: list[str] = []
for (tp, pp), group_df in df.groupby(["TP Size", "PP Size"], dropna=False):
folder_name = os.path.join(output_root, f"tp{int(tp)}_pp{int(pp)}")
@@ -203,32 +199,9 @@ def split_json_by_tp_pp(
return saved_paths
def _add_limit_line(fig, y_value, label):
# Visible dashed line + annotation
fig.add_hline(
y=y_value,
line_dash="dash",
line_color="red" if "ttft" in label.lower() else "blue",
annotation_text=f"{label}: {y_value} ms",
annotation_position="top left",
)
# Optional: add a legend item (as a transparent helper trace)
if plot and plotly_found:
import plotly.graph_objects as go
fig.add_trace(
go.Scatter(
x=[None],
y=[None],
mode="lines",
line=dict(
dash="dash", color="red" if "ttft" in label.lower() else "blue"
),
name=f"{label}",
)
)
# -----------------------------
# Styling helpers
# -----------------------------
def _find_concurrency_col(df: pd.DataFrame) -> str:
for c in [
"# of max concurrency.",
@@ -239,7 +212,6 @@ def _find_concurrency_col(df: pd.DataFrame) -> str:
]:
if c in df.columns:
return c
# Fallback: guess an integer-like column (harmless if unused)
for c in df.columns:
if df[c].dtype.kind in "iu" and df[c].nunique() > 1 and df[c].min() >= 1:
return c
@@ -248,8 +220,7 @@ def _find_concurrency_col(df: pd.DataFrame) -> str:
def _highlight_threshold(
df: pd.DataFrame, threshold: float
) -> "pd.io.formats.style.Styler":
"""Highlight numeric per-configuration columns with value <= threshold."""
) -> pd.io.formats.style.Styler:
conc_col = _find_concurrency_col(df)
key_cols = [
c
@@ -260,6 +231,7 @@ def _highlight_threshold(
c for c in df.columns if c not in key_cols and not str(c).startswith("Ratio")
]
conf_cols = [c for c in conf_cols if pd.api.types.is_numeric_dtype(df[c])]
return df.style.map(
lambda v: "background-color:#e6ffe6;font-weight:bold;"
if pd.notna(v) and v <= threshold
@@ -268,7 +240,264 @@ def _highlight_threshold(
)
if __name__ == "__main__":
def highlight_ratio_columns(styler: pd.io.formats.style.Styler):
ratio_cols = [c for c in styler.data.columns if "ratio" in str(c).lower()]
if not ratio_cols:
return styler
styler = styler.apply(
lambda _: ["background-color: #fff3b0"] * len(styler.data),
subset=ratio_cols,
axis=0,
)
styler = styler.set_table_styles(
[
{
"selector": f"th.col_heading.level0.col{i}",
"props": [("background-color", "#fff3b0")],
}
for i, col in enumerate(styler.data.columns)
if col in ratio_cols
],
overwrite=False,
)
return styler
def _apply_two_decimals(
styler: pd.io.formats.style.Styler,
) -> pd.io.formats.style.Styler:
df = styler.data
num_cols = df.select_dtypes("number").columns
if len(num_cols) == 0:
return styler
return styler.format({c: "{:.2f}" for c in num_cols}, na_rep="")
# -----------------------------
# Valid max concurrency summary helpers
# -----------------------------
def _config_value_columns(df: pd.DataFrame, conc_col: str) -> list[str]:
key_cols = [
c
for c in ["Model", "Dataset Name", "Input Len", "Output Len"]
if c in df.columns
]
exclude = set(key_cols + [conc_col, "qps", "QPS"])
cols: list[str] = []
for c in df.columns:
if c in exclude:
continue
lc = str(c).lower()
if lc.startswith("ratio"):
continue
if lc.endswith("_name") or lc == "test name" or lc == "test_name":
continue
if pd.api.types.is_numeric_dtype(df[c]):
cols.append(c)
return cols
def _max_concurrency_ok(
df: pd.DataFrame, conc_col: str, cfg_col: str, threshold: float
):
if df is None or conc_col not in df.columns or cfg_col not in df.columns:
return pd.NA
d = df[[conc_col, cfg_col]].copy()
d[conc_col] = pd.to_numeric(d[conc_col], errors="coerce")
d[cfg_col] = pd.to_numeric(d[cfg_col], errors="coerce")
d = d.dropna(subset=[conc_col, cfg_col])
if d.empty:
return pd.NA
ok = d[d[cfg_col] <= threshold]
if ok.empty:
return pd.NA
return ok[conc_col].max()
def _value_at_concurrency(df: pd.DataFrame, conc_col: str, cfg_col: str, conc_value):
if (
df is None
or conc_col not in df.columns
or cfg_col not in df.columns
or pd.isna(conc_value)
):
return pd.NA
d = df[[conc_col, cfg_col]].copy()
d[conc_col] = pd.to_numeric(d[conc_col], errors="coerce")
d[cfg_col] = pd.to_numeric(d[cfg_col], errors="coerce")
conc_value = pd.to_numeric(conc_value, errors="coerce")
if pd.isna(conc_value):
return pd.NA
hit = d[d[conc_col] == conc_value]
if hit.empty:
return pd.NA
return hit[cfg_col].iloc[0]
def build_valid_max_concurrency_summary_html(
tput_group_df: pd.DataFrame | None,
ttft_group_df: pd.DataFrame | None,
tpot_group_df: pd.DataFrame | None,
conc_col: str,
args,
) -> str:
if ttft_group_df is None and tpot_group_df is None:
return ""
ttft_cols = (
_config_value_columns(ttft_group_df, conc_col)
if ttft_group_df is not None
else []
)
tpot_cols = (
_config_value_columns(tpot_group_df, conc_col)
if tpot_group_df is not None
else []
)
tput_cols = (
_config_value_columns(tput_group_df, conc_col)
if tput_group_df is not None
else []
)
if ttft_group_df is not None and tpot_group_df is not None:
cfg_cols = [c for c in ttft_cols if c in tpot_cols]
if tput_group_df is not None:
cfg_cols = [c for c in cfg_cols if c in tput_cols] or cfg_cols
else:
cfg_cols = ttft_cols or tpot_cols
if not cfg_cols:
cfg_cols = sorted(set(ttft_cols) | set(tpot_cols) | set(tput_cols), key=str)
rows = []
for cfg in cfg_cols:
ttft_max = (
_max_concurrency_ok(ttft_group_df, conc_col, cfg, args.ttft_max_ms)
if ttft_group_df is not None
else pd.NA
)
tpot_max = (
_max_concurrency_ok(tpot_group_df, conc_col, cfg, args.tpot_max_ms)
if tpot_group_df is not None
else pd.NA
)
both = (
pd.NA
if (pd.isna(ttft_max) or pd.isna(tpot_max))
else min(ttft_max, tpot_max)
)
tput_at_both = (
_value_at_concurrency(tput_group_df, conc_col, cfg, both)
if tput_group_df is not None
else pd.NA
)
ttft_at_both = (
_value_at_concurrency(ttft_group_df, conc_col, cfg, both)
if ttft_group_df is not None
else pd.NA
)
tpot_at_both = (
_value_at_concurrency(tpot_group_df, conc_col, cfg, both)
if tpot_group_df is not None
else pd.NA
)
rows.append(
{
"Configuration": cfg,
f"Max {conc_col} (TTFT ≤ {args.ttft_max_ms:g} ms)": ttft_max,
f"Max {conc_col} (TPOT ≤ {args.tpot_max_ms:g} ms)": tpot_max,
f"Max {conc_col} (Both)": both,
"Output Tput @ Both (tok/s)": tput_at_both,
"TTFT @ Both (ms)": ttft_at_both,
"TPOT @ Both (ms)": tpot_at_both,
}
)
summary_df = pd.DataFrame(rows)
# --- Coerce numeric columns so Styler doesn't miss them due to object dtype ---
for c in summary_df.columns:
if c == "Configuration":
continue
summary_df[c] = pd.to_numeric(summary_df[c], errors="coerce")
both_col = f"Max {conc_col} (Both)"
# --- Strict 2-decimal formatting for ALL non-Configuration columns ---
formatters = {}
for c in summary_df.columns:
if c == "Configuration":
continue
# default argument binds per-column formatter correctly
formatters[c] = lambda v: "" if pd.isna(v) else f"{float(v):.2f}"
styler = summary_df.style.format(formatters)
def _green(v):
return "background-color:#e6ffe6;font-weight:bold;" if pd.notna(v) else ""
if both_col in summary_df.columns:
styler = styler.map(_green, subset=[both_col])
title = (
'<div style="font-size: 1.15em; font-weight: 700; margin: 12px 0 6px 0;">'
"Valid Max Concurrency Summary"
"</div>\n"
)
return title + styler.to_html(table_attributes='border="1" class="dataframe"')
# -----------------------------
# Plot helper
# -----------------------------
def _add_limit_line(fig, y_value: float, label: str):
fig.add_hline(
y=y_value,
line_dash="dash",
line_color="red" if "ttft" in label.lower() else "blue",
annotation_text=f"{label}: {y_value} ms",
annotation_position="top left",
)
if plotly_found:
import plotly.graph_objects as go
fig.add_trace(
go.Scatter(
x=[None],
y=[None],
mode="lines",
line=dict(
dash="dash",
color="red" if "ttft" in label.lower() else "blue",
),
name=label,
)
)
# -----------------------------
# Refactored main + group-first report
# -----------------------------
@dataclass(frozen=True)
class MetricPlan:
data_cols: list[str]
drop_column: str
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser()
parser.add_argument(
"-f", "--file", action="append", type=str, help="input file name"
@@ -308,149 +537,289 @@ if __name__ == "__main__":
default=100.0,
help="Reference limit for TPOT plots (ms)",
)
return parser
args = parser.parse_args()
def choose_metrics(latency: str) -> MetricPlan:
latency = (latency or "").lower()
drop_column = "P99"
name_column = "Test name"
info_cols = [
"Model",
"Dataset Name",
"Input Len",
"Output Len",
"TP Size",
"PP Size",
"# of max concurrency.",
"qps",
]
if "median" in args.latency:
data_cols_to_compare = ["Output Tput (tok/s)", "Median TTFT (ms)", "Median"]
html_msgs_for_data_cols = [
"Compare Output Tokens /n",
"Median TTFT /n",
"Median TPOT /n",
]
drop_column = "P99"
elif "p99" in args.latency:
data_cols_to_compare = ["Output Tput (tok/s)", "P99 TTFT (ms)", "P99"]
html_msgs_for_data_cols = [
"Compare Output Tokens /n",
"P99 TTFT /n",
"P99 TPOT /n",
]
if "median" in latency:
return MetricPlan(
data_cols=["Output Tput (tok/s)", "Median TTFT (ms)", "Median"],
drop_column=drop_column,
)
return MetricPlan(
data_cols=["Output Tput (tok/s)", "P99 TTFT (ms)", "P99"],
drop_column=drop_column,
)
def prepare_input_files(args, info_cols: list[str]) -> tuple[list[str], list[str]]:
if not args.file:
raise ValueError("No input files provided. Use -f/--file.")
if len(args.file) == 1:
files = split_json_by_tp_pp(args.file[0], output_root="splits")
info_cols = [c for c in info_cols if c not in ("TP Size", "PP Size")]
else:
files = args.file
return files, info_cols
def get_y_axis_col(info_cols: list[str], xaxis: str) -> str:
y_axis_index = info_cols.index(xaxis) if xaxis in info_cols else 6
return info_cols[y_axis_index]
def get_group_cols(output_df: pd.DataFrame, info_cols: list[str]) -> list[str]:
filtered_info_cols = info_cols[:4]
group_cols = [c for c in filtered_info_cols if c in output_df.columns]
if not group_cols:
raise ValueError(
f"No valid group-by columns. Expected subset: {filtered_info_cols}, "
f"but DataFrame has: {list(output_df.columns)}"
)
return group_cols
def normalize_group_key(name):
return name if isinstance(name, tuple) else (name,)
def group_filename(name, prefix: str = "perf_comparison_") -> str:
name_vals = normalize_group_key(name)
safe = ",".join(map(str, name_vals)).replace(",", "_").replace("/", "-")
return f"{prefix}{safe}.html"
def build_group_suffix(group_cols: list[str], name) -> str:
name_vals = normalize_group_key(name)
return " , ".join(f"{col} : [ {val} ] " for col, val in zip(group_cols, name_vals))
def render_metric_table_html(
display_group: pd.DataFrame,
metric_label: str,
group_suffix: str,
args,
) -> str:
title = (
f'<div style="font-size: 1.25em; font-weight: 600; margin: 12px 0;">'
f"{_html.escape(metric_label)}"
f"{_html.escape(group_suffix)}"
f"</div>\n"
)
metric_name = metric_label.lower()
if "ttft" in metric_name:
styler = _highlight_threshold(display_group, args.ttft_max_ms)
elif ("tpot" in metric_name) or ("median" in metric_name) or ("p99" in metric_name):
styler = _highlight_threshold(display_group, args.tpot_max_ms)
else:
styler = display_group.style
styler = _apply_two_decimals(styler)
styler = highlight_ratio_columns(styler)
return title + styler.to_html(table_attributes='border="1" class="dataframe"')
def maybe_write_plot(
main_fh,
sub_fh,
group_df: pd.DataFrame,
raw_data_cols: list[str],
metric_label: str,
y_axis_col: str,
args,
):
if not (args.plot and plotly_found):
return
import plotly.express as px
df = group_df[raw_data_cols].sort_values(by=y_axis_col)
df_melted = df.melt(
id_vars=y_axis_col,
var_name="Configuration",
value_name=metric_label,
)
fig = px.line(
df_melted,
x=y_axis_col,
y=metric_label,
color="Configuration",
title=f"{metric_label} vs {y_axis_col}",
markers=True,
)
# Ensure plot hover + y tick labels are also 2 decimals.
fig.update_traces(hovertemplate="%{y:.2f}<extra></extra>")
fig.update_yaxes(tickformat=".2f")
metric_name = metric_label.lower()
if "ttft" in metric_name:
_add_limit_line(fig, args.ttft_max_ms, "TTFT limit")
elif ("tpot" in metric_name) or ("median" in metric_name) or ("p99" in metric_name):
_add_limit_line(fig, args.tpot_max_ms, "TPOT limit")
html = fig.to_html(full_html=True, include_plotlyjs="cdn")
main_fh.write(html)
sub_fh.write(html)
def build_group_keys(
df: pd.DataFrame, group_cols: list[str], sort_cols: list[str] | None = None
):
if sort_cols:
df = df.sort_values(by=sort_cols)
gb = df.groupby(group_cols, dropna=False)
return [k for k, _ in gb]
def write_report_group_first(
files: list[str], info_cols: list[str], plan: MetricPlan, args
):
name_column = "Test name"
y_axis_col = get_y_axis_col(info_cols, args.xaxis)
print("comparing : " + ", ".join(files))
debug = args.debug
plot = args.plot
# For Plot feature, assign y axis from one of info_cols
y_axis_index = info_cols.index(args.xaxis) if args.xaxis in info_cols else 6
with open("perf_comparison.html", "w") as text_file:
for i in range(len(data_cols_to_compare)):
output_df, raw_data_cols = compare_data_columns(
files,
name_column,
data_cols_to_compare[i],
info_cols,
drop_column,
debug=debug,
metric_cache: dict[str, tuple[pd.DataFrame, list[str]]] = {}
group_cols_canonical: list[str] | None = None
for metric_label in plan.data_cols:
output_df, raw_data_cols = compare_data_columns(
files,
name_column,
metric_label,
info_cols,
plan.drop_column,
debug=args.debug,
)
raw_data_cols = list(raw_data_cols)
raw_data_cols.insert(0, y_axis_col)
group_cols = get_group_cols(output_df, info_cols)
if group_cols_canonical is None:
group_cols_canonical = group_cols
else:
group_cols_canonical = [c for c in group_cols_canonical if c in group_cols]
metric_cache[metric_label] = (
output_df.sort_values(by=args.xaxis),
raw_data_cols,
)
if not group_cols_canonical:
raise ValueError("No canonical group columns found across metrics.")
first_metric = plan.data_cols[0]
first_df_sorted, _ = metric_cache[first_metric]
group_keys = build_group_keys(
first_df_sorted, group_cols_canonical, sort_cols=[args.xaxis]
)
metric_groupbys = {
metric_label: df.groupby(group_cols_canonical, dropna=False)
for metric_label, (df, _) in metric_cache.items()
}
with open("perf_comparison.html", "w", encoding="utf-8") as main_fh:
main_fh.write('<meta charset="utf-8">\n')
for gkey in group_keys:
gkey_tuple = normalize_group_key(gkey)
suffix = build_group_suffix(group_cols_canonical, gkey_tuple)
sub_path = group_filename(gkey_tuple)
group_header = (
'<div style="font-size: 1.4em; font-weight: 700; '
'margin: 18px 0 10px 0;">'
f"{_html.escape(suffix)}"
"</div>\n"
)
# For Plot feature, insert y axis from one of info_cols
raw_data_cols.insert(0, info_cols[y_axis_index])
main_fh.write(group_header)
with open(sub_path, "w", encoding="utf-8") as sub_fh:
sub_fh.write('<meta charset="utf-8">\n')
sub_fh.write(group_header)
tput_group_df = None
ttft_group_df = None
tpot_group_df = None
conc_col = args.xaxis
filtered_info_cols = info_cols[:-2]
existing_group_cols = [
c for c in filtered_info_cols if c in output_df.columns
]
if not existing_group_cols:
raise ValueError(
f"No valid group-by columns "
f"Expected subset: {filtered_info_cols}, "
f"but DataFrame has: {list(output_df.columns)}"
for metric_label in plan.data_cols:
gb = metric_groupbys[metric_label]
df_sorted, raw_data_cols = metric_cache[metric_label]
try:
group_df = gb.get_group(gkey)
except KeyError:
missing = (
'<div style="font-size: 1.1em; font-weight: 600; '
'margin: 10px 0;">'
f"{_html.escape(metric_label)} — missing for this group"
"</div>\n"
)
main_fh.write(missing)
sub_fh.write(missing)
continue
if conc_col not in group_df.columns:
conc_col = _find_concurrency_col(group_df)
mn = metric_label.lower().strip()
if "tok/s" in mn:
tput_group_df = group_df
elif "ttft" in mn:
ttft_group_df = group_df
elif mn in ("p99", "median") or "tpot" in mn:
tpot_group_df = group_df
display_group = group_df.drop(
columns=group_cols_canonical, errors="ignore"
)
html = render_metric_table_html(
display_group, metric_label, suffix, args
)
main_fh.write(html)
sub_fh.write(html)
maybe_write_plot(
main_fh,
sub_fh,
group_df=group_df,
raw_data_cols=raw_data_cols,
metric_label=metric_label,
y_axis_col=y_axis_col,
args=args,
)
summary_html = build_valid_max_concurrency_summary_html(
tput_group_df=tput_group_df,
ttft_group_df=ttft_group_df,
tpot_group_df=tpot_group_df,
conc_col=conc_col,
args=args,
)
# output_df_sorted = output_df.sort_values(by=existing_group_cols)
output_df_sorted = output_df.sort_values(by=args.xaxis)
output_groups = output_df_sorted.groupby(existing_group_cols, dropna=False)
for name, group in output_groups:
group_name = (
",".join(map(str, name)).replace(",", "_").replace("/", "-")
)
group_html_name = "perf_comparison_" + group_name + ".html"
if summary_html:
main_fh.write(summary_html)
sub_fh.write(summary_html)
metric_name = str(data_cols_to_compare[i]).lower()
if "tok/s" in metric_name:
html = group.to_html()
elif "ttft" in metric_name:
styler = _highlight_threshold(group, args.ttft_max_ms).format(
{c: "{:.2f}" for c in group.select_dtypes("number").columns},
na_rep="",
)
html = styler.to_html(
table_attributes='border="1" class="dataframe"'
)
elif (
"tpot" in metric_name
or "median" in metric_name
or "p99" in metric_name
):
styler = _highlight_threshold(group, args.tpot_max_ms).format(
{c: "{:.2f}" for c in group.select_dtypes("number").columns},
na_rep="",
)
html = styler.to_html(
table_attributes='border="1" class="dataframe"'
)
text_file.write(html_msgs_for_data_cols[i])
text_file.write(html)
with open(group_html_name, "a+") as sub_text_file:
sub_text_file.write(html_msgs_for_data_cols[i])
sub_text_file.write(html)
def main():
args = build_parser().parse_args()
info_cols = list(DEFAULT_INFO_COLS)
plan = choose_metrics(args.latency)
files, info_cols = prepare_input_files(args, info_cols)
write_report_group_first(files, info_cols, plan, args)
if plot and plotly_found:
import plotly.express as px
df = group[raw_data_cols]
df_sorted = df.sort_values(by=info_cols[y_axis_index])
# Melt DataFrame for plotting
df_melted = df_sorted.melt(
id_vars=info_cols[y_axis_index],
var_name="Configuration",
value_name=data_cols_to_compare[i],
)
title = (
data_cols_to_compare[i] + " vs " + info_cols[y_axis_index]
)
# Create Plotly line chart
fig = px.line(
df_melted,
x=info_cols[y_axis_index],
y=data_cols_to_compare[i],
color="Configuration",
title=title,
markers=True,
)
# ---- Add threshold lines based on metric name ----
if "ttft" in metric_name:
_add_limit_line(fig, args.ttft_max_ms, "TTFT limit")
elif (
"tpot" in metric_name
or "median" in metric_name
or "p99" in metric_name
):
_add_limit_line(fig, args.tpot_max_ms, "TPOT limit")
# Export to HTML
text_file.write(
fig.to_html(full_html=True, include_plotlyjs="cdn")
)
sub_text_file.write(
fig.to_html(full_html=True, include_plotlyjs="cdn")
)
if __name__ == "__main__":
main()

View File

@@ -49,7 +49,11 @@ check_cpus() {
echo "Need at least 1 NUMA to run benchmarking."
exit 1
fi
declare -g gpu_type="cpu"
if [[ "$(uname -m)" == "aarch64" ]] || [[ "$(uname -m)" == "arm64" ]]; then
declare -g gpu_type="arm64-cpu"
else
declare -g gpu_type="cpu"
fi
echo "GPU type is $gpu_type"
}
@@ -207,8 +211,8 @@ run_latency_tests() {
# check if there is enough GPU to run the test
tp=$(echo "$latency_params" | jq -r '.tensor_parallel_size')
if [ "$ON_CPU" == "1" ]; then
pp=$(echo "$latency_params" | jq -r '.pipeline_parallel_size')
if [[ "$ON_CPU" == "1" ]]; then
pp=$(echo "$latency_params" | jq -r '.pipeline_parallel_size // 1')
world_size=$(($tp*$pp))
if [[ $numa_count -lt $world_size && -z "${REMOTE_HOST}" ]]; then
echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
@@ -276,8 +280,8 @@ run_throughput_tests() {
# check if there is enough GPU to run the test
tp=$(echo "$throughput_params" | jq -r '.tensor_parallel_size')
if [ "$ON_CPU" == "1" ]; then
pp=$(echo "$throughput_params" | jq -r '.pipeline_parallel_size')
if [[ "$ON_CPU" == "1" ]]; then
pp=$(echo "$throughput_params" | jq -r '.pipeline_parallel_size // 1')
world_size=$(($tp*$pp))
if [[ $numa_count -lt $world_size && -z "${REMOTE_HOST}" ]]; then
echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
@@ -393,8 +397,8 @@ run_serving_tests() {
# check if there is enough resources to run the test
tp=$(echo "$server_params" | jq -r '.tensor_parallel_size')
if [ "$ON_CPU" == "1" ]; then
pp=$(echo "$server_params" | jq -r '.pipeline_parallel_size')
if [[ "$ON_CPU" == "1" ]]; then
pp=$(echo "$server_params" | jq -r '.pipeline_parallel_size // 1')
world_size=$(($tp*$pp))
if [[ $numa_count -lt $world_size && -z "${REMOTE_HOST}" ]]; then
echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
@@ -496,9 +500,9 @@ run_serving_tests() {
main() {
local ARCH
ARCH=''
if [ "$ON_CPU" == "1" ];then
check_cpus
ARCH='-cpu'
if [[ "$ON_CPU" == "1" ]]; then
check_cpus
ARCH="-$gpu_type"
else
check_gpus
ARCH="$arch_suffix"

View File

@@ -0,0 +1,26 @@
[
{
"test_name": "latency_llama8B_tp1",
"environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1,
"load_format": "dummy",
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
"trust_remote_code": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"num_iters_warmup": 5,
"num_iters": 15
}
}
]

View File

@@ -0,0 +1,130 @@
{
"defaults": {
"qps_list": [
"inf"
],
"max_concurrency_list": [
12,
16,
24,
32,
64,
128,
200
],
"server_environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_SGL_KERNEL": 1,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"server_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1,
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
"trust_remote_code": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"load_format": "dummy"
},
"client_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"backend": "vllm",
"ignore-eos": "",
"num_prompts": 200
}
},
"tests": [
{
"test_name": "serving_llama8B_tp1_sharegpt",
"server_parameters": {
"tensor_parallel_size": 1
},
"client_parameters": {
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json"
}
},
{
"test_name": "serving_llama8B_tp2_sharegpt",
"server_parameters": {
"tensor_parallel_size": 2
},
"client_parameters": {
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json"
}
},
{
"test_name": "serving_llama8B_tp1_random_128_128",
"server_parameters": {
"tensor_parallel_size": 1
},
"client_parameters": {
"dataset_name": "random",
"random-input-len": 128,
"random-output-len": 128
}
},
{
"test_name": "serving_llama8B_tp2_random_128_128",
"server_parameters": {
"tensor_parallel_size": 2
},
"client_parameters": {
"dataset_name": "random",
"random-input-len": 128,
"random-output-len": 128
}
},
{
"test_name": "serving_llama8B_tp1_random_128_2048",
"server_parameters": {
"tensor_parallel_size": 1
},
"client_parameters": {
"dataset_name": "random",
"random-input-len": 128,
"random-output-len": 2048
}
},
{
"test_name": "serving_llama8B_tp2_random_128_2048",
"server_parameters": {
"tensor_parallel_size": 2
},
"client_parameters": {
"dataset_name": "random",
"random-input-len": 128,
"random-output-len": 2048
}
},
{
"test_name": "serving_llama8B_tp1_random_2048_128",
"server_parameters": {
"tensor_parallel_size": 1
},
"client_parameters": {
"dataset_name": "random",
"random-input-len": 2048,
"random-output-len": 128
}
},
{
"test_name": "serving_llama8B_tp2_random_2048_128",
"server_parameters": {
"tensor_parallel_size": 2
},
"client_parameters": {
"dataset_name": "random",
"random-input-len": 2048,
"random-output-len": 128
}
}
]
}

View File

@@ -19,10 +19,8 @@
"block_size": 128,
"trust_remote_code": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"load_format": "dummy"
"max_num_seqs": 256
},
"client_parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
@@ -151,6 +149,45 @@
"random-output-len": 128
}
},
{
"test_name": "serving_llama8B_int4_tp1_random_128_128",
"server_parameters": {
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
"tensor_parallel_size": 1
},
"client_parameters": {
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
"dataset_name": "random",
"random-input-len": 128,
"random-output-len": 128
}
},
{
"test_name": "serving_llama8B_int4_tp2_random_128_128",
"server_parameters": {
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
"tensor_parallel_size": 2
},
"client_parameters": {
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
"dataset_name": "random",
"random-input-len": 128,
"random-output-len": 128
}
},
{
"test_name": "serving_llama8B_int4_tp4_random_128_128",
"server_parameters": {
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
"tensor_parallel_size": 4
},
"client_parameters": {
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
"dataset_name": "random",
"random-input-len": 128,
"random-output-len": 128
}
},
{
"test_name": "serving_llama3B_tp1_random_128_128",
"server_parameters": {

View File

@@ -0,0 +1,27 @@
[
{
"test_name": "throughput_llama8B_tp1",
"environment_variables": {
"VLLM_RPC_TIMEOUT": 100000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
"VLLM_CPU_KVCACHE_SPACE": 40
},
"parameters": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"tensor_parallel_size": 1,
"load_format": "dummy",
"dtype": "bfloat16",
"distributed_executor_backend": "mp",
"block_size": 128,
"trust_remote_code": "",
"disable_log_stats": "",
"enforce_eager": "",
"max_num_batched_tokens": 2048,
"max_num_seqs": 256,
"dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200,
"backend": "vllm"
}
}
]

View File

@@ -291,6 +291,7 @@ if __name__ == "__main__":
"""
Arguments:
--version <version> : version string for the current build (e.g., commit hash)
--wheel-dir <wheel_directory> : directory containing wheel files (default to be same as `version`)
--current-objects <path_to_json> : path to JSON file containing current S3 objects listing in this version directory
--output-dir <output_directory> : directory to store generated index files
--alias-to-default <alias_variant_name> : (optional) alias variant name for the default variant
@@ -318,6 +319,12 @@ if __name__ == "__main__":
required=True,
help="Directory to store generated index files",
)
parser.add_argument(
"--wheel-dir",
type=str,
default=None,
help="Directory containing wheel files (default to be same as `version`)",
)
parser.add_argument(
"--alias-to-default",
type=str,
@@ -372,7 +379,7 @@ if __name__ == "__main__":
print(f"Found {len(wheel_files)} wheel files for version {version}: {wheel_files}")
# keep only "official" files for a non-nightly version (specifed by cli args)
# keep only "official" files for a non-nightly version (specified by cli args)
PY_VERSION_RE = re.compile(r"^\d+\.\d+\.\d+([a-zA-Z0-9.+-]*)?$")
if PY_VERSION_RE.match(version):
# upload-wheels.sh ensures no "dev" is in args.version
@@ -384,9 +391,10 @@ if __name__ == "__main__":
print("Nightly version detected, keeping all wheel files.")
# Generate index and metadata, assuming wheels and indices are stored as:
# s3://vllm-wheels/{version}/<wheel files>
# s3://vllm-wheels/{wheel_dir}/<wheel files>
# s3://vllm-wheels/<anything>/<index files>
wheel_base_dir = Path(output_dir).parent / version
wheel_dir = args.wheel_dir or version
wheel_base_dir = Path(output_dir).parent / wheel_dir.strip().rstrip("/")
index_base_dir = Path(output_dir)
generate_index_and_metadata(

View File

@@ -209,12 +209,21 @@ if [[ $commands == *"--shard-id="* ]]; then
wait "${pid}"
STATUS+=($?)
done
at_least_one_shard_with_tests=0
for st in "${STATUS[@]}"; do
if [[ ${st} -ne 0 ]]; then
if [[ ${st} -ne 0 ]] && [[ ${st} -ne 5 ]]; then
echo "One of the processes failed with $st"
exit "${st}"
elif [[ ${st} -eq 5 ]]; then
echo "Shard exited with status 5 (no tests collected) - treating as success"
else # This means st is 0
at_least_one_shard_with_tests=1
fi
done
if [[ ${#STATUS[@]} -gt 0 && ${at_least_one_shard_with_tests} -eq 0 ]]; then
echo "All shards reported no tests collected. Failing the build."
exit 1
fi
else
echo "Render devices: $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES"
docker run \

View File

@@ -84,7 +84,7 @@ function cpu_tests() {
docker exec cpu-test-"$NUMA_NODE" bash -c "
set -e
pytest -x -s -v \
tests/lora/test_qwen2vl.py"
tests/lora/test_qwenvl.py"
# online serving: tp+pp
docker exec cpu-test-"$NUMA_NODE" bash -c '

View File

@@ -61,7 +61,7 @@ echo "Results will be stored in: $RESULTS_DIR"
echo "--- Installing Python dependencies ---"
python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \
&& python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \
&& python3 -m pip install --progress-bar off "lm-eval @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d" \
&& python3 -m pip install --progress-bar off "lm-eval[api]>=0.4.9.2" \
&& python3 -m pip install --progress-bar off hf-transfer tblib==3.1.0
echo "--- Python dependencies installed ---"

View File

@@ -61,7 +61,7 @@ echo "Results will be stored in: $RESULTS_DIR"
echo "--- Installing Python dependencies ---"
python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \
&& python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \
&& python3 -m pip install --progress-bar off "lm-eval @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d" \
&& python3 -m pip install --progress-bar off "lm-eval[api]>=0.4.9.2" \
&& python3 -m pip install --progress-bar off hf-transfer tblib==3.1.0
echo "--- Python dependencies installed ---"

View File

@@ -2,6 +2,17 @@
set -euox pipefail
# To detect ROCm
# Check multiple indicators:
if [ -e /dev/kfd ] || \
[ -d /opt/rocm ] || \
command -v rocm-smi &> /dev/null || \
[ -n "${ROCM_HOME:-}" ]; then
IS_ROCM=1
else
IS_ROCM=0
fi
if [[ $# -lt 4 ]]; then
echo "Usage: .buildkite/scripts/run-multi-node-test.sh WORKING_DIR NUM_NODES NUM_GPUS DOCKER_IMAGE COMMAND1 COMMAND2 ... COMMANDN"
exit 1
@@ -26,13 +37,18 @@ for command in "${COMMANDS[@]}"; do
echo "$command"
done
start_network() {
docker network create --subnet=192.168.10.0/24 docker-net
}
start_nodes() {
for node in $(seq 0 $(($NUM_NODES-1))); do
GPU_DEVICES='"device='
if [ "$IS_ROCM" -eq 1 ]; then
GPU_DEVICES='--device /dev/kfd --device /dev/dri -e HIP_VISIBLE_DEVICES='
else
GPU_DEVICES='--gpus "device='
fi
for node_gpu in $(seq 0 $(($NUM_GPUS - 1))); do
DEVICE_NUM=$(($node * $NUM_GPUS + $node_gpu))
GPU_DEVICES+=$(($DEVICE_NUM))
@@ -40,7 +56,9 @@ start_nodes() {
GPU_DEVICES+=','
fi
done
GPU_DEVICES+='"'
if [ "$IS_ROCM" -eq 0 ]; then
GPU_DEVICES+='"'
fi
# start the container in detached mode
# things to note:
@@ -49,7 +67,7 @@ start_nodes() {
# 3. map the huggingface cache directory to the container
# 3. assign ip addresses to the containers (head node: 192.168.10.10, worker nodes:
# starting from 192.168.10.11)
docker run -d --gpus "$GPU_DEVICES" --shm-size=10.24gb -e HF_TOKEN \
docker run -d $GPU_DEVICES --shm-size=10.24gb -e HF_TOKEN \
-v ~/.cache/huggingface:/root/.cache/huggingface --name "node$node" \
--network docker-net --ip 192.168.10.$((10 + $node)) --rm "$DOCKER_IMAGE" \
/bin/bash -c "tail -f /dev/null"

View File

@@ -102,6 +102,7 @@ if [[ "$version" != *"dev"* ]]; then
echo "Re-generating indices for /$pure_version/"
rm -rf "$INDICES_OUTPUT_DIR/*"
mkdir -p "$INDICES_OUTPUT_DIR"
$PYTHON .buildkite/scripts/generate-nightly-index.py --version "$pure_version" --current-objects "$obj_json" --output-dir "$INDICES_OUTPUT_DIR" --comment "version $pure_version" $alias_arg
# wheel-dir is overridden to be the commit directory, so that the indices point to the correct wheel path
$PYTHON .buildkite/scripts/generate-nightly-index.py --version "$pure_version" --wheel-dir "$SUBPATH" --current-objects "$obj_json" --output-dir "$INDICES_OUTPUT_DIR" --comment "version $pure_version" $alias_arg
aws s3 cp --recursive "$INDICES_OUTPUT_DIR/" "s3://$BUCKET/$pure_version/"
fi

View File

@@ -162,7 +162,7 @@ steps:
- tests/entrypoints/test_chat_utils
commands:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/tool_parsers/
- pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/tool_parsers/ --ignore=entrypoints/openai/responses
- pytest -v -s entrypoints/test_chat_utils.py
- label: Entrypoints Integration Test (API Server 2)
@@ -199,6 +199,21 @@ steps:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s entrypoints/pooling
- label: Entrypoints Integration Test (Responses API)
timeout_in_minutes: 50
mirror_hardwares: [amdexperimental]
agent_pool: mi325_1
# grade: Blocking
working_dir: "/vllm-workspace/tests"
fast_check: true
torch_nightly: true
source_file_dependencies:
- vllm/
- tests/entrypoints/openai/responses
commands:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s entrypoints/openai/responses
- label: Distributed Tests (4 GPUs) # 35min
timeout_in_minutes: 50
mirror_hardwares: [amdexperimental]
@@ -219,6 +234,9 @@ steps:
- tests/v1/engine/test_engine_core_client.py
- tests/distributed/test_symm_mem_allreduce.py
commands:
# Work around HIP bug tracked here: https://github.com/ROCm/hip/issues/3876
# TODO: Remove when the bug is fixed in a future ROCm release
- export TORCH_NCCL_BLOCKING_WAIT=1
# test with torchrun tp=2 and external_dp=2
- torchrun --nproc-per-node=4 distributed/test_torchrun_example.py
# test with torchrun tp=2 and pp=2
@@ -267,9 +285,10 @@ steps:
- vllm/v1/executor/uniproc_executor.py
- vllm/v1/worker/gpu_worker.py
commands:
# https://github.com/NVIDIA/nccl/issues/1838
#- export NCCL_CUMEM_HOST_ENABLE=0
# test with torchrun tp=2 and dp=4 with ep
# Work around HIP bug tracked here: https://github.com/ROCm/hip/issues/3876
# TODO: Remove when the bug is fixed in a future ROCm release
- export TORCH_NCCL_BLOCKING_WAIT=1
- torchrun --nproc-per-node=8 ../examples/offline_inference/torchrun_dp_example.py --tp-size=2 --pp-size=1 --dp-size=4 --enable-ep
- label: EPLB Algorithm Test # 5min
@@ -349,7 +368,9 @@ steps:
- label: V1 Test e2e + engine # 65min
timeout_in_minutes: 90
mirror_hardwares: [amdexperimental]
agent_pool: mi325_4
# The test uses 4 GPUs, but we schedule it on 8-GPU machines for stability.
# See discussion here: https://github.com/vllm-project/vllm/pull/31040
agent_pool: mi325_8
# grade: Blocking
source_file_dependencies:
- vllm/
@@ -510,8 +531,7 @@ steps:
- tests/samplers
- tests/conftest.py
commands:
- pytest -v -s samplers
- VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers
- pytest -v -s -m 'not skip_v1' samplers
- label: LoRA Test %N # 20min each
timeout_in_minutes: 30
@@ -725,7 +745,7 @@ steps:
- label: Quantization Test # 70min
timeout_in_minutes: 90
mirror_hardwares: [amdexperimental]
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
# grade: Blocking
source_file_dependencies:
@@ -765,8 +785,9 @@ steps:
- csrc/
- vllm/entrypoints/openai/
- vllm/model_executor/models/whisper.py
- tools/
commands: # LMEval+Transcription WER check
# Transcription WER check is skipped because encoder-decoder models are not supported on ROCm, see https://github.com/vllm-project/vllm/issues/27442
- bash ../tools/install_torchcodec_rocm.sh || exit 1
- pytest -s entrypoints/openai/correctness/
@@ -861,6 +882,7 @@ steps:
# Shard slow subset of standard language models tests. Only run when model
# source is modified, or when specified test files are modified
- pip freeze | grep -E 'torch'
- export TORCH_NCCL_BLOCKING_WAIT=1
- pytest -v -s models/language -m 'core_model and slow_test' \
--num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT \
--shard-id=$$BUILDKITE_PARALLEL_JOB
@@ -878,7 +900,7 @@ steps:
commands:
# Install fast path packages for testing against transformers
# Note: also needed to run plamo2 model in vLLM
- uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5'
- uv pip install --system --no-build-isolation 'git+https://github.com/AndreasKaratzas/mamba@fix-rocm-7.0-warp-size-constexpr'
- uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.2'
# Shard hybrid language model tests
- pytest -v -s models/language/generation \
@@ -899,7 +921,7 @@ steps:
commands:
# Install fast path packages for testing against transformers
# Note: also needed to run plamo2 model in vLLM
- uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5'
- uv pip install --system --no-build-isolation 'git+https://github.com/AndreasKaratzas/mamba@fix-rocm-7.0-warp-size-constexpr'
- uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.2'
- pytest -v -s models/language/generation -m '(not core_model) and (not hybrid_model)'
@@ -964,7 +986,7 @@ steps:
- pytest -v -s models/multimodal/processing
- label: Multi-Modal Models Test (Standard) # 60min
timeout_in_minutes: 80
timeout_in_minutes: 100
mirror_hardwares: [amdexperimental]
agent_pool: mi325_1
# grade: Blocking
@@ -973,13 +995,16 @@ steps:
- vllm/
- tests/models/multimodal
commands:
- export MIOPEN_DEBUG_CONV_DIRECT=0
- export MIOPEN_DEBUG_CONV_GEMM=0
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pip freeze | grep -E 'torch'
- pytest -v -s models/multimodal -m core_model --ignore models/multimodal/generation/test_whisper.py --ignore models/multimodal/processing
- pytest -v -s models/multimodal -m core_model --ignore models/multimodal/generation/test_whisper.py --ignore models/multimodal/processing --ignore models/multimodal/pooling/test_prithvi_mae.py
- pytest -v -s models/multimodal/pooling/test_prithvi_mae.py -m core_model
- cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model # Otherwise, mp_method="spawn" doesn't work
- label: Multi-Modal Accuracy Eval (Small Models) # 150min - 180min
timeout_in_minutes: 180
- label: Multi-Modal Accuracy Eval (Small Models) # 5min
timeout_in_minutes: 10
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_1
# grade: Blocking
@@ -989,7 +1014,9 @@ steps:
- vllm/inputs/
- vllm/v1/core/
commands:
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-mm-small.txt --tp-size=1
- export MIOPEN_DEBUG_CONV_DIRECT=0
- export MIOPEN_DEBUG_CONV_GEMM=0
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-mm-small.txt
- label: Multi-Modal Models Test (Extended) 1 # 60min
timeout_in_minutes: 120
@@ -1001,10 +1028,13 @@ steps:
- vllm/
- tests/models/multimodal
commands:
- export MIOPEN_DEBUG_CONV_DIRECT=0
- export MIOPEN_DEBUG_CONV_GEMM=0
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/multimodal -m 'not core_model' --ignore models/multimodal/generation/test_common.py --ignore models/multimodal/processing
- label: Multi-Modal Models Test (Extended) 2
- label: Multi-Modal Models Test (Extended) 2 #60min
timeout_in_minutes: 120
mirror_hardwares: [amdexperimental]
agent_pool: mi325_1
# grade: Blocking
@@ -1013,6 +1043,8 @@ steps:
- vllm/
- tests/models/multimodal
commands:
- export MIOPEN_DEBUG_CONV_DIRECT=0
- export MIOPEN_DEBUG_CONV_GEMM=0
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/multimodal/generation/test_common.py -m 'split(group=0) and not core_model'
@@ -1026,6 +1058,8 @@ steps:
- vllm/
- tests/models/multimodal
commands:
- export MIOPEN_DEBUG_CONV_DIRECT=0
- export MIOPEN_DEBUG_CONV_GEMM=0
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/multimodal/generation/test_common.py -m 'split(group=1) and not core_model'
@@ -1085,8 +1119,8 @@ steps:
- vllm/v1/attention/backends/flashinfer.py
- vllm/v1/attention/backends/mla/cutlass_mla.py
- vllm/v1/attention/backends/mla/flashinfer_mla.py
- vllm/v1/attention/selector.py
- vllm/platforms/cuda.py
- vllm/attention/selector.py
commands:
- nvidia-smi
- python3 examples/offline_inference/basic/chat.py
@@ -1243,13 +1277,13 @@ steps:
- # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
- NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed'
- python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=0 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code
- python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=0 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
- # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
- NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed'
- python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=1 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code
- python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=1 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code
- label: Distributed Tests (2 GPUs) # 68min
timeout_in_minutes: 90
@@ -1275,6 +1309,9 @@ steps:
- tests/v1/shutdown
- tests/v1/worker/test_worker_memory_snapshot.py
commands:
# Work around HIP bug tracked here: https://github.com/ROCm/hip/issues/3876
# TODO: Remove when the bug is fixed in a future ROCm release
- export TORCH_NCCL_BLOCKING_WAIT=1
- TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_async_llm_dp.py
- TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_eagle_dp.py
- TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_external_lb_dp.py
@@ -1424,8 +1461,22 @@ steps:
- vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
- tests/v1/kv_connector/nixl_integration/
commands:
- uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
- bash v1/kv_connector/nixl_integration/tp_config_sweep_accuracy_test.sh
- uv pip install --system -r /vllm-workspace/requirements/kv_connectors_rocm.txt
- VLLM_ATTENTION_BACKEND=ROCM_ATTN bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh
- label: DP EP NixlConnector PD accuracy tests (Distributed) # 15min
mirror_hardwares: [amdexperimental]
agent_pool: mi325_4
# grade: Blocking
timeout_in_minutes: 15
working_dir: "/vllm-workspace/tests"
num_gpus: 4
source_file_dependencies:
- vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
- tests/v1/kv_connector/nixl_integration/
commands:
- uv pip install --system -r /vllm-workspace/requirements/kv_connectors_rocm.txt
- VLLM_ATTENTION_BACKEND=ROCM_ATTN DP_EP=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh
##### multi gpus test #####
##### A100 test #####
@@ -1497,7 +1548,7 @@ steps:
- "VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/compile/distributed/test_fusions_e2e.py -k 'not Llama-4'"
- VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/distributed/test_sequence_parallel.py
- pytest -v -s tests/distributed/test_context_parallel.py
- HIP_VISIBLE_DEVICES=0,1 VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048 --all2all-backend deepep_high_throughput
- HIP_VISIBLE_DEVICES=0,1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=allgather_reducescatter --disable-nccl-for-dp-synchronization
- pytest -v -s tests/v1/distributed/test_dbo.py
##### B200 test #####
@@ -1576,6 +1627,8 @@ steps:
- .buildkite/scripts/run-prime-rl-test.sh
commands:
- bash .buildkite/scripts/run-prime-rl-test.sh
##### EPLB Accuracy Tests #####
- label: DeepSeek V2-Lite Accuracy
mirror_hardwares: [amdexperimental, amdproduction]
agent_pool: mi325_4

View File

@@ -144,7 +144,7 @@ steps:
- tests/entrypoints/test_chat_utils
commands:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/tool_parsers/
- pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/tool_parsers/ --ignore=entrypoints/openai/responses
- pytest -v -s entrypoints/test_chat_utils.py
- label: Entrypoints Integration Test (API Server 2)
@@ -177,6 +177,18 @@ steps:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s entrypoints/pooling
- label: Entrypoints Integration Test (Responses API)
timeout_in_minutes: 50
mirror_hardwares: [amdexperimental]
working_dir: "/vllm-workspace/tests"
fast_check: true
torch_nightly: true
source_file_dependencies:
- vllm/
- tests/entrypoints/openai/responses
commands:
- pytest -v -s entrypoints/openai/responses
- label: Distributed Tests (4 GPUs) # 35min
timeout_in_minutes: 50
mirror_hardwares: [amdexperimental]
@@ -943,7 +955,6 @@ steps:
timeout_in_minutes: 30
working_dir: "/vllm-workspace/"
gpu: b200
# optional: true
source_file_dependencies:
- csrc/quantization/fp4/
- csrc/attention/mla/
@@ -955,8 +966,8 @@ steps:
- vllm/v1/attention/backends/flashinfer.py
- vllm/v1/attention/backends/mla/cutlass_mla.py
- vllm/v1/attention/backends/mla/flashinfer_mla.py
- vllm/v1/attention/selector.py
- vllm/platforms/cuda.py
- vllm/attention/selector.py
commands:
- nvidia-smi
- python3 examples/offline_inference/basic/chat.py
@@ -1105,17 +1116,18 @@ steps:
- vllm/model_executor/models/
- tests/distributed/
- tests/examples/offline_inference/data_parallel.py
- .buildkite/scripts/run-multi-node-test.sh
commands:
- # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
- NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed'
- python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=0 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code
- python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=0 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
- # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
- NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed'
- python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=1 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code
- python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=1 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code
- label: Distributed Tests (2 GPUs) # 68min
timeout_in_minutes: 90
@@ -1267,8 +1279,8 @@ steps:
commands:
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt
- label: NixlConnector PD accuracy tests (Distributed) # 30min
timeout_in_minutes: 30
- label: NixlConnector PD accuracy tests (Distributed) # 40min
timeout_in_minutes: 40
working_dir: "/vllm-workspace/tests"
num_gpus: 4
source_file_dependencies:
@@ -1276,7 +1288,18 @@ steps:
- tests/v1/kv_connector/nixl_integration/
commands:
- uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
- bash v1/kv_connector/nixl_integration/tp_config_sweep_accuracy_test.sh
- bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh
- label: DP EP NixlConnector PD accuracy tests (Distributed) # 15min
timeout_in_minutes: 15
working_dir: "/vllm-workspace/tests"
num_gpus: 4
source_file_dependencies:
- vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
- tests/v1/kv_connector/nixl_integration/
commands:
- uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
- DP_EP=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh
##### multi gpus test #####
@@ -1334,9 +1357,17 @@ steps:
- "VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/compile/distributed/test_fusions_e2e.py -k 'not Llama-4'"
- VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/distributed/test_sequence_parallel.py
- pytest -v -s tests/distributed/test_context_parallel.py
- CUDA_VISIBLE_DEVICES=1,2 VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048 --all2all-backend deepep_high_throughput
- CUDA_VISIBLE_DEVICES=1,2 VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput
- pytest -v -s tests/v1/distributed/test_dbo.py
- label: LM Eval Large Models (H200) # optional
timeout_in_minutes: 60
gpu: h200
optional: true
num_gpus: 8
commands:
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-h200.txt
##### B200 test #####
- label: Distributed Tests (B200) # optional
gpu: b200
@@ -1359,6 +1390,7 @@ steps:
- vllm/
- .buildkite/scripts/run-prime-rl-test.sh
commands:
- nvidia-smi
- bash .buildkite/scripts/run-prime-rl-test.sh
- label: DeepSeek V2-Lite Accuracy
@@ -1387,3 +1419,26 @@ steps:
working_dir: "/vllm-workspace"
commands:
- bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh 0.8 200 8020 2 1
##### MoE Refactor (Temporary) Tests #####
- label: MoE Refactor Integration Test (H100 - TEMPORARY) # optional
gpu: h100
optional: true
num_gpus: 2
commands:
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-h100.txt
- label: MoE Refactor Integration Test (B200 - TEMPORARY) # optional
gpu: b200
optional: true
num_gpus: 2
commands:
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-b200.txt
- label: MoE Refactor Integration Test (B200 DP - TEMPORARY) # optional
gpu: b200
optional: true
num_gpus: 2
commands:
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor-dp-ep/config-b200.txt

View File

@@ -145,7 +145,7 @@ steps:
- VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/compile/distributed/test_fusions_e2e.py -k 'not Llama-4'
- VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/distributed/test_sequence_parallel.py
- pytest -v -s tests/distributed/test_context_parallel.py
- CUDA_VISIBLE_DEVICES=1,2 VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048 --all2all-backend deepep_high_throughput
- CUDA_VISIBLE_DEVICES=1,2 VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput
- pytest -v -s tests/v1/distributed/test_dbo.py
- label: Distributed Tests (2 GPUs)(B200)
@@ -171,7 +171,7 @@ steps:
- tests/distributed/
- tests/examples/offline_inference/data_parallel.py
commands:
- ./.buildkite/scripts/run-multi-node-test.sh /vllm-workspace/tests 2 2 public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:0bec63fa317e1fbd62e19b0fc31c43c81bf89077 "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=0 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py" "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py --dp-size=2 --tp-size=1 --node-size=2 --node-rank=1 --master-addr=192.168.10.10 --master-port=12345 --enforce-eager --trust-remote-code"
- ./.buildkite/scripts/run-multi-node-test.sh /vllm-workspace/tests 2 2 public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:0bec63fa317e1fbd62e19b0fc31c43c81bf89077 "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=0 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py" "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=1 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code"
- label: Distributed NixlConnector PD accuracy (4 GPUs)
timeout_in_minutes: 30
@@ -182,7 +182,7 @@ steps:
- tests/v1/kv_connector/nixl_integration/
commands:
- uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
- bash v1/kv_connector/nixl_integration/tp_config_sweep_accuracy_test.sh
- bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh
- label: Pipeline + Context Parallelism (4 GPUs))
timeout_in_minutes: 60

View File

@@ -34,10 +34,9 @@ steps:
- tests/entrypoints/test_chat_utils
commands:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/tool_parsers/
- pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/tool_parsers/ --ignore=entrypoints/openai/responses
- pytest -v -s entrypoints/test_chat_utils.py
- label: Entrypoints Integration (API Server 2)
timeout_in_minutes: 130
working_dir: "/vllm-workspace/tests"
@@ -64,6 +63,14 @@ steps:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s entrypoints/pooling
- label: Entrypoints Integration (Responses API)
timeout_in_minutes: 50
working_dir: "/vllm-workspace/tests"
source_file_dependencies:
- vllm/
- tests/entrypoints/openai/responses
commands:
- pytest -v -s entrypoints/openai/responses
- label: Entrypoints V1
timeout_in_minutes: 50

View File

@@ -90,8 +90,8 @@ steps:
- vllm/v1/attention/backends/flashinfer.py
- vllm/v1/attention/backends/mla/cutlass_mla.py
- vllm/v1/attention/backends/mla/flashinfer_mla.py
- vllm/v1/attention/selector.py
- vllm/platforms/cuda.py
- vllm/attention/selector.py
commands:
- nvidia-smi
- python3 examples/offline_inference/basic/chat.py

11
.github/CODEOWNERS vendored
View File

@@ -3,7 +3,6 @@
# This lists cover the "core" components of vLLM that require careful review
/vllm/attention @LucasWilkinson
/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @njhill
/vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-redhat @njhill @22quinn
/vllm/model_executor/layers/fused_moe @mgoin @pavanimajety
/vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth @yewentao256 @pavanimajety
@@ -15,6 +14,7 @@
/vllm/lora @jeejeelee
/vllm/reasoning @aarnphm @chaunceyjiang
/vllm/entrypoints @aarnphm @chaunceyjiang
/vllm/tool_parsers @aarnphm @chaunceyjiang
/vllm/compilation @zou3519 @youkaichao @ProExpertProg
/vllm/distributed/kv_transfer @NickLucche @ApostaC
CMakeLists.txt @tlrmchlsmth @LucasWilkinson
@@ -26,6 +26,7 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
# vLLM V1
/vllm/v1/attention @LucasWilkinson
/vllm/v1/attention/backend.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @njhill
/vllm/v1/attention/backends/mla @pavanimajety
/vllm/v1/attention/backends/flashinfer.py @mgoin @pavanimajety
/vllm/v1/attention/backends/triton_attn.py @tdoublep
@@ -116,15 +117,15 @@ mkdocs.yaml @hmellor
/vllm/transformers_utils/tokenizers/mistral.py @patrickvonplaten
# Kernels
/vllm/attention/ops/chunked_prefill_paged_decode.py @tdoublep
/vllm/attention/ops/triton_unified_attention.py @tdoublep
/vllm/v1/attention/ops/chunked_prefill_paged_decode.py @tdoublep
/vllm/v1/attention/ops/triton_unified_attention.py @tdoublep
# ROCm related: specify owner with write access to notify AMD folks for careful code review
/vllm/**/*rocm* @tjtanaa
/docker/Dockerfile.rocm* @gshtras @tjtanaa
/vllm/v1/attention/backends/rocm*.py @gshtras @tjtanaa
/vllm/v1/attention/backends/mla/rocm*.py @gshtras @tjtanaa
/vllm/attention/ops/rocm*.py @gshtras @tjtanaa
/vllm/v1/attention/ops/rocm*.py @gshtras @tjtanaa
/vllm/model_executor/layers/fused_moe/rocm*.py @gshtras @tjtanaa
/csrc/rocm @gshtras @tjtanaa
/requirements/*rocm* @tjtanaa
@@ -152,7 +153,7 @@ mkdocs.yaml @hmellor
/vllm/entrypoints/pooling @noooop
/vllm/config/pooler.py @noooop
/vllm/pooling_params.py @noooop
/vllm/model_executor/layers/pooler.py @noooop
/vllm/model_executor/layers/pooler @noooop
# Security guide and policies
/docs/usage/security.md @russellb

4
.github/mergify.yml vendored
View File

@@ -222,10 +222,10 @@ pull_request_rules:
- files~=^csrc/rocm/
- files~=^docker/Dockerfile.rocm
- files~=^requirements/rocm.*\.txt
- files~=^vllm/attention/backends/rocm.*\.py
- files~=^vllm/attention/ops/rocm.*\.py
- files~=^vllm/model_executor/layers/fused_moe/rocm.*\.py
- files~=^vllm/v1/attention/backends/rocm.*\.py
- files~=^vllm/v1/attention/backends/mla/rocm.*\.py
- files~=^vllm/v1/attention/ops/rocm.*\.py
- files~=^tests/kernels/.*_rocm.*\.py
- files=vllm/platforms/rocm.py
- title~=(?i)AMD

5
.gitignore vendored
View File

@@ -227,3 +227,8 @@ ep_kernels_workspace/
# Allow tracked library source folders under submodules (e.g., benchmarks/lib)
!vllm/benchmarks/lib/
# Generated gRPC protobuf files (compiled at build time from vllm_engine.proto)
vllm/grpc/vllm_engine_pb2.py
vllm/grpc/vllm_engine_pb2_grpc.py
vllm/grpc/vllm_engine_pb2.pyi

View File

@@ -282,6 +282,7 @@ endif()
set(VLLM_EXT_SRC
"csrc/mamba/mamba_ssm/selective_scan_fwd.cu"
"csrc/cache_kernels.cu"
"csrc/cache_kernels_fused.cu"
"csrc/attention/paged_attention_v1.cu"
"csrc/attention/paged_attention_v2.cu"
"csrc/attention/merge_attn_states.cu"
@@ -799,24 +800,6 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
else()
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a" "${CUDA_ARCHS}")
endif()
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
set(SRCS "csrc/quantization/w8a8/cutlass/moe/blockwise_scaled_group_mm_sm100.cu")
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${SCALED_MM_ARCHS}")
list(APPEND VLLM_EXT_SRC "${SRCS}")
list(APPEND VLLM_GPU_FLAGS "-DENABLE_CUTLASS_MOE_SM100=1")
message(STATUS "Building blockwise_scaled_group_mm_sm100 for archs: ${SCALED_MM_ARCHS}")
else()
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
message(STATUS "Not building blockwise_scaled_group_mm_sm100 kernels as CUDA Compiler version is "
"not >= 12.8, we recommend upgrading to CUDA 12.8 or later "
"if you intend on running FP8 quantized MoE models on Blackwell.")
else()
message(STATUS "Not building blockwise_scaled_group_mm_sm100 as no compatible archs found "
"in CUDA target architectures")
endif()
endif()
#
# Machete kernels

View File

@@ -14,51 +14,8 @@ Easy, fast, and cheap LLM serving for everyone
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://blog.vllm.ai/"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
</p>
---
Join us at the [PyTorch Conference, October 22-23](https://events.linuxfoundation.org/pytorch-conference/) and [Ray Summit, November 3-5](https://www.anyscale.com/ray-summit/2025) in San Francisco for our latest updates on vLLM and to meet the vLLM team! Register now for the largest vLLM community events of the year!
---
*Latest News* 🔥
- [2025/11] We hosted [vLLM Bangkok Meetup](https://luma.com/v0f647nv). We explored vLLM and LMCache inference and low-resource language adaptation with speakers from Embedded LLM, AMD, and Red Hat. Please find the meetup slides [here](https://drive.google.com/drive/folders/1H0DS57F8HQ5q3kSOSoRmucPJWL3E0A_X?usp=sharing).
- [2025/11] We hosted [the first vLLM Europe Meetup in Zurich](https://luma.com/0gls27kb) focused on quantization, distributed inference, and reinforcement learning at scale with speakers from Mistral, IBM, and Red Hat. Please find the meetup slides [here](https://docs.google.com/presentation/d/1UC9PTLCHYXQpOmJDSFg6Sljra3iVXzc09DeEI7dnxMc/edit?usp=sharing) and recording [here](https://www.youtube.com/watch?v=6m6ZE6yVEDI)
- [2025/11] We hosted [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/xSrYXjNgr1HbCP4ExYNG1w) focusing on distributed inference and diverse accelerator support with vLLM! Please find the meetup slides [here](https://drive.google.com/drive/folders/1nQJ8ZkLSjKxvu36sSHaceVXtttbLvvu-?usp=drive_link).
- [2025/10] We hosted [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/__xb4OyOsImz-9eAVrdlcg) focused on hands-on vLLM inference optimization! Please find the meetup slides [here](https://drive.google.com/drive/folders/1KqwjsFJLfEsC8wlDugnrR61zsWHt94Q6).
- [2025/09] We hosted [vLLM Toronto Meetup](https://luma.com/e80e0ymm) focused on tackling inference at scale and speculative decoding with speakers from NVIDIA and Red Hat! Please find the meetup slides [here](https://docs.google.com/presentation/d/1IYJYmJcu9fLpID5N5RbW_vO0XLo0CGOR14IXOjB61V8/edit?usp=sharing).
- [2025/08] We hosted [vLLM Shenzhen Meetup](https://mp.weixin.qq.com/s/k8ZBO1u2_2odgiKWH_GVTQ) focusing on the ecosystem around vLLM! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Ua2SVKVSu-wp5vou_6ElraDt2bnKhiEA).
- [2025/08] We hosted [vLLM Singapore Meetup](https://www.sginnovate.com/event/vllm-sg-meet). We shared V1 updates, disaggregated serving and MLLM speedups with speakers from Embedded LLM, AMD, WekaIO, and A*STAR. Please find the meetup slides [here](https://drive.google.com/drive/folders/1ncf3GyqLdqFaB6IeB834E5TZJPLAOiXZ?usp=sharing).
- [2025/08] We hosted [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg) focusing on building, developing, and integrating with vLLM! Please find the meetup slides [here](https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH).
- [2025/05] vLLM is now a hosted project under PyTorch Foundation! Please find the announcement [here](https://pytorch.org/blog/pytorch-foundation-welcomes-vllm/).
- [2025/01] We are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more. Please check out our blog post [here](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html).
<details>
<summary>Previous News</summary>
- [2025/08] We hosted [vLLM Korea Meetup](https://luma.com/cgcgprmh) with Red Hat and Rebellions! We shared the latest advancements in vLLM along with project spotlights from the vLLM Korea community. Please find the meetup slides [here](https://drive.google.com/file/d/1bcrrAE1rxUgx0mjIeOWT6hNe2RefC5Hm/view).
- [2025/08] We hosted [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/dgkWg1WFpWGO2jCdTqQHxA) focusing on large-scale LLM deployment! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF) and the recording [here](https://www.chaspark.com/#/live/1166916873711665152).
- [2025/05] We hosted [NYC vLLM Meetup](https://lu.ma/c1rqyf1f)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing).
- [2025/04] We hosted [Asia Developer Day](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/19cp6Qu8u48ihB91A064XfaXruNYiBOUKrBxAmDOllOo/edit?usp=sharing).
- [2025/03] We hosted [vLLM x Ollama Inference Night](https://lu.ma/vllm-ollama)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/16T2PDD1YwRnZ4Tu8Q5r6n53c5Lr5c73UV9Vd2_eBo4U/edit?usp=sharing).
- [2025/03] We hosted [the first vLLM China Meetup](https://mp.weixin.qq.com/s/n77GibL2corAtQHtVEAzfg)! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1REHvfQMKGnvz6p3Fd23HhSO4c8j5WPGZV0bKYLwnHyQ/edit?usp=sharing).
- [2025/03] We hosted [the East Coast vLLM Meetup](https://lu.ma/7mu4k4xx)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1NHiv8EUFF1NLd3fEYODm56nDmL26lEeXCaDgyDlTsRs/edit#slide=id.g31441846c39_0_0).
- [2025/02] We hosted [the ninth vLLM meetup](https://lu.ma/h7g3kuj9) with Meta! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1jzC_PZVXrVNSFVCW-V4cFXb6pn7zZ2CyP_Flwo05aqg/edit?usp=sharing) and AMD [here](https://drive.google.com/file/d/1Zk5qEJIkTmlQ2eQcXQZlljAx3m9s7nwn/view?usp=sharing). The slides from Meta will not be posted.
- [2025/01] We hosted [the eighth vLLM meetup](https://lu.ma/zep56hui) with Google Cloud! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1epVkt4Zu8Jz_S5OhEHPc798emsYh2BwYfRuDDVEF7u4/edit?usp=sharing), and Google Cloud team [here](https://drive.google.com/file/d/1h24pHewANyRL11xy5dXUbvRC9F9Kkjix/view?usp=sharing).
- [2024/12] vLLM joins [pytorch ecosystem](https://pytorch.org/blog/vllm-joins-pytorch)! Easy, Fast, and Cheap LLM Serving for Everyone!
- [2024/11] We hosted [the seventh vLLM meetup](https://lu.ma/h0qvrajz) with Snowflake! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit?usp=sharing), and Snowflake team [here](https://docs.google.com/presentation/d/1qF3RkDAbOULwz9WK5TOltt2fE9t6uIc_hVNLFAaQX6A/edit?usp=sharing).
- [2024/10] We have just created a developer slack ([slack.vllm.ai](https://slack.vllm.ai)) focusing on coordinating contributions and discussing features. Please feel free to join us there!
- [2024/10] Ray Summit 2024 held a special track for vLLM! Please find the opening talk slides from the vLLM team [here](https://docs.google.com/presentation/d/1B_KQxpHBTRa_mDF-tR6i8rWdOU5QoTZNcEg2MKZxEHM/edit?usp=sharing). Learn more from the [talks](https://www.youtube.com/playlist?list=PLzTswPQNepXl6AQwifuwUImLPFRVpksjR) from other vLLM contributors and users!
- [2024/09] We hosted [the sixth vLLM meetup](https://lu.ma/87q3nvnh) with NVIDIA! Please find the meetup slides [here](https://docs.google.com/presentation/d/1wrLGwytQfaOTd5wCGSPNhoaW3nq0E-9wqyP7ny93xRs/edit?usp=sharing).
- [2024/07] We hosted [the fifth vLLM meetup](https://lu.ma/lp0gyjqr) with AWS! Please find the meetup slides [here](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing).
- [2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post [here](https://blog.vllm.ai/2024/07/23/llama31.html).
- [2024/06] We hosted [the fourth vLLM meetup](https://lu.ma/agivllm) with Cloudflare and BentoML! Please find the meetup slides [here](https://docs.google.com/presentation/d/1iJ8o7V2bQEi0BFEljLTwc5G1S10_Rhv3beed5oB0NJ4/edit?usp=sharing).
- [2024/04] We hosted [the third vLLM meetup](https://robloxandvllmmeetup2024.splashthat.com/) with Roblox! Please find the meetup slides [here](https://docs.google.com/presentation/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM/edit?usp=sharing).
- [2024/01] We hosted [the second vLLM meetup](https://lu.ma/ygxbpzhl) with IBM! Please find the meetup slides [here](https://docs.google.com/presentation/d/12mI2sKABnUw5RBWXDYY-HtHth4iMSNcEoQ10jDQbxgA/edit?usp=sharing).
- [2023/10] We hosted [the first vLLM meetup](https://lu.ma/first-vllm-meetup) with a16z! Please find the meetup slides [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing).
- [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM.
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
</details>
🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.
For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
---
@@ -118,50 +75,6 @@ Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
We welcome and value any contributions and collaborations.
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.
## Sponsors
vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support!
<!-- Note: Please sort them in alphabetical order. -->
<!-- Note: Please keep these consistent with docs/community/sponsors.md -->
Cash Donations:
- a16z
- Dropbox
- Sequoia Capital
- Skywork AI
- ZhenFund
Compute Resources:
- Alibaba Cloud
- AMD
- Anyscale
- Arm
- AWS
- Crusoe Cloud
- Databricks
- DeepInfra
- Google Cloud
- IBM
- Intel
- Lambda Lab
- Nebius
- Novita AI
- NVIDIA
- Red Hat
- Replicate
- Roblox
- RunPod
- Trainy
- UC Berkeley
- UC San Diego
- Volcengine
Slack Sponsor: Anyscale
We also have an official fundraising venue through [OpenCollective](https://opencollective.com/vllm). We plan to use the fund to support the development, maintenance, and adoption of vLLM.
## Citation
If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
@@ -182,7 +95,7 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
- For collaborations and partnerships, please contact us at [vllm-questions@lists.berkeley.edu](mailto:vllm-questions@lists.berkeley.edu)
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
<!-- --8<-- [end:contact-us] -->
## Media Kit

View File

@@ -1,47 +1,30 @@
# Releasing vLLM
vLLM releases offer a reliable version of the code base, packaged into a binary format that can be conveniently accessed via PyPI. These releases also serve as key milestones for the development team to communicate with the community about newly available features, improvements, and upcoming changes that could affect users, including potential breaking changes.
vLLM releases offer a reliable version of the code base, packaged into a binary format that can be conveniently accessed via [PyPI](https://pypi.org/project/vllm). These releases also serve as key milestones for the development team to communicate with the community about newly available features, improvements, and upcoming changes that could affect users, including potential breaking changes.
## Release Versioning
## Release Cadence and Versioning
vLLM uses a “right-shifted” versioning scheme where a new patch release is out every 2 weeks. And patch releases contain features and bug fixes (as opposed to semver where patch release contains only backwards-compatible bug fixes). When critical fixes need to be made, special release post1 is released.
We aim to have a regular release every 2 weeks. Since v0.12.0, regular releases increment the minor version rather than patch version. The list of past releases can be found [here](https://vllm.ai/releases).
* _major_ major architectural milestone and when incompatible API changes are made, similar to PyTorch 2.0.
* _minor_ major features
* _patch_ features and backwards-compatible bug fixes
* _post1_ or _patch-1_ backwards-compatible bug fixes, either explicit or implicit post release
Our version numbers are expressed in the form `vX.Y.Z`, where `X` is the major version, `Y` is the minor version, and `Z` is the patch version. They are incremented according to the following rules:
## Release Cadence
* _Major_ releases are reserved for architectural milestones involving sweeping API changes, similar to PyTorch 2.0.
* _Minor_ releases correspond to regular releases, which include new features, bug fixes and other backwards-compatible changes.
* _Patch_ releases correspond to special releases for new models, as well as emergency patches for critical performance, functionality and security issues.
Patch release is released on bi-weekly basis. Post release 1-3 days after patch release and uses same branch as patch release.
Following is the release cadence for year 2025. All future release dates below are tentative. Please note: Post releases are optional.
This versioning scheme is similar to [SemVer](https://semver.org/) for compatibility purposes, except that backwards compatibility is only guaranteed for a limited number of minor releases (see our [deprecation policy](https://docs.vllm.ai/en/latest/contributing/deprecation_policy) for details).
| Release Date | Patch release versions | Post Release versions |
| --- | --- | --- |
| Jan 2025 | 0.7.0 | --- |
| Feb 2025 | 0.7.1, 0.7.2, 0.7.3 | --- |
| Mar 2025 | 0.7.4, 0.7.5 | --- |
| Apr 2025 | 0.7.6, 0.7.7 | --- |
| May 2025 | 0.7.8, 0.7.9 | --- |
| Jun 2025 | 0.7.10, 0.7.11 | --- |
| Jul 2025 | 0.7.12, 0.7.13 | --- |
| Aug 2025 | 0.7.14, 0.7.15 | --- |
| Sep 2025 | 0.7.16, 0.7.17 | --- |
| Oct 2025 | 0.7.18, 0.7.19 | --- |
| Nov 2025 | 0.7.20, 0.7.21 | --- |
| Dec 2025 | 0.7.22, 0.7.23 | --- |
## Release branch
## Release Branch
Each release is built from a dedicated release branch.
* For _major_, _minor_, _patch_ releases, the release branch cut is performed 1-2 days before release is live.
* For post releases, previously cut release branch is reused
* Release builds are triggered via push to RC tag like vX.Y.Z-rc1 . This enables us to build and test multiple RCs for each release.
* Final tag : vX.Y.Z does not trigger the build but used for Release notes and assets.
* After branch cut is created we monitor the main branch for any reverts and apply these reverts to a release branch.
* For _major_ and _minor_ releases, the release branch cut is performed 1-2 days before release is live.
* For _patch_ releases, previously cut release branch is reused.
* Release builds are triggered via push to RC tag like `vX.Y.Z-rc1`. This enables us to build and test multiple RCs for each release.
* Final tag: `vX.Y.Z` does not trigger the build but used for Release notes and assets.
* After branch cut is created, we monitor the main branch for any reverts and apply these reverts to a release branch.
## Release Cherry-Pick Criteria
### Cherry-Pick Criteria
After branch cut, we approach finalizing the release branch with clear criteria on what cherry picks are allowed in. Note: a cherry pick is a process to land a PR in the release branch after branch cut. These are typically limited to ensure that the team has sufficient time to complete a thorough round of testing on a stable code base.

View File

@@ -104,7 +104,6 @@ def run_benchmark_with_batch_invariant(
random.seed(seed)
# Set environment variables
os.environ["VLLM_ATTENTION_BACKEND"] = backend
if batch_invariant:
os.environ["VLLM_BATCH_INVARIANT"] = "1"
else:
@@ -140,6 +139,7 @@ def run_benchmark_with_batch_invariant(
max_model_len=max_model_len,
dtype="bfloat16",
tensor_parallel_size=tp_size,
attention_config={"backend": backend},
enable_prefix_caching=False,
)
init_time = time.perf_counter() - start_init

View File

@@ -135,7 +135,6 @@ def benchmark_batched_propose(args):
block_sizes=[16],
)
dummy_input_batch._req_ids = list(str(id) for id in range(args.num_req))
dummy_input_batch.spec_decode_unsupported_reqs = ()
dummy_input_batch.num_tokens_no_spec = [args.num_token] * args.num_req
dummy_input_batch.token_ids_cpu = np.random.randint(
0, 20, (args.num_req, args.num_token)
@@ -151,10 +150,8 @@ def benchmark_batched_propose(args):
start = time.time()
runner.drafter.propose(
sampled_token_ids,
dummy_input_batch.req_ids,
dummy_input_batch.num_tokens_no_spec,
dummy_input_batch.token_ids_cpu,
dummy_input_batch.spec_decode_unsupported_reqs,
)
end = time.time()
print(f"Iteration time (s): {end - start}")

View File

@@ -343,7 +343,9 @@ def bench(
return bench_int8(dtype, m, k, n, label, sub_label)
if dtype == torch.float8_e4m3fn:
return bench_fp8(dtype, m, k, n, label, sub_label)
raise ValueError("unsupported type")
raise ValueError(
f"Unsupported dtype {dtype}: should be one of torch.int8, torch.float8_e4m3fn."
)
# runner

View File

@@ -0,0 +1,177 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import argparse
import copy
import itertools
import torch
from weight_shapes import WEIGHT_SHAPES
from vllm import _custom_ops as ops
from vllm.platforms import current_platform
from vllm.scalar_type import scalar_types
from vllm.triton_utils import triton
from vllm.utils.flashinfer import flashinfer_fp4_quantize
if not current_platform.has_device_capability(100):
raise RuntimeError("NVFP4 requires compute capability of 10.0 (Blackwell)")
FLOAT4_E2M1_MAX = scalar_types.float4_e2m1f.max()
FLOAT8_E4M3_MAX = torch.finfo(torch.float8_e4m3fn).max
PROVIDER_CFGS = {
"vllm": dict(backend="vllm", enabled=True),
"flashinfer": dict(backend="flashinfer", enabled=True),
}
_enabled = [k for k, v in PROVIDER_CFGS.items() if v["enabled"]]
def compute_global_scale(tensor: torch.Tensor) -> torch.Tensor:
"""Compute global scale for FP4 quantization."""
amax = torch.abs(tensor).max().to(torch.float32)
return FLOAT8_E4M3_MAX * FLOAT4_E2M1_MAX / amax
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["batch_size"],
x_vals=[1, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096],
x_log=False,
line_arg="provider",
line_vals=_enabled,
line_names=_enabled,
ylabel="us (lower is better)",
plot_name="NVFP4 Input Quantization Latency (us)",
args={},
)
)
def benchmark(batch_size, provider, N, K):
M = batch_size
device = "cuda"
dtype = torch.bfloat16
# Create input tensor
a = torch.randn((M, K), device=device, dtype=dtype)
# Compute global scale for activation
a_global_scale = compute_global_scale(a)
quantiles = [0.5, 0.2, 0.8]
cfg = PROVIDER_CFGS[provider]
if cfg["backend"] == "vllm":
# vLLM's FP4 quantization
ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
lambda: ops.scaled_fp4_quant(a, a_global_scale),
quantiles=quantiles,
)
elif cfg["backend"] == "flashinfer":
# FlashInfer's FP4 quantization
# Use is_sf_swizzled_layout=True to match vLLM's output format
ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
lambda: flashinfer_fp4_quantize(
a, a_global_scale, is_sf_swizzled_layout=True
),
quantiles=quantiles,
)
# Convert ms to us for better readability at small batch sizes
to_us = lambda t_ms: t_ms * 1000
return to_us(ms), to_us(max_ms), to_us(min_ms)
def prepare_shapes(args):
out = []
for model, tp_size in itertools.product(args.models, args.tp_sizes):
for KN, tp_dim in copy.deepcopy(WEIGHT_SHAPES[model]):
KN[tp_dim] //= tp_size
KN.append(model)
out.append(KN)
return out
def _test_accuracy_once(M: int, K: int, dtype: torch.dtype, device: str):
"""Test accuracy between vLLM and FlashInfer FP4 quantization."""
# Create input tensor
a = torch.randn((M, K), device=device, dtype=dtype)
# Compute global scale
a_global_scale = compute_global_scale(a)
# vLLM quantization
vllm_fp4, vllm_scale = ops.scaled_fp4_quant(a, a_global_scale)
# FlashInfer quantization (with swizzled layout to match vLLM's output)
flashinfer_fp4, flashinfer_scale = flashinfer_fp4_quantize(
a, a_global_scale, is_sf_swizzled_layout=True
)
flashinfer_scale = flashinfer_scale.view(torch.float8_e4m3fn)
# Compare outputs
torch.testing.assert_close(
vllm_fp4,
flashinfer_fp4,
)
print(f"M={M}, K={K}, dtype={dtype}: PASSED")
def test_accuracy():
"""Run accuracy tests across various shapes."""
print("\n" + "=" * 60)
print("Running accuracy tests: vLLM vs FlashInfer")
print("=" * 60)
device = "cuda"
dtype = torch.bfloat16
# Test various batch sizes and hidden dimensions
Ms = [1, 1024]
Ks = [4096]
for M in Ms:
for K in Ks:
_test_accuracy_once(M, K, dtype, device)
print("\nAll accuracy tests passed!")
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Benchmark NVFP4 quantization: vLLM vs FlashInfer"
)
parser.add_argument(
"--models",
nargs="+",
type=str,
default=["meta-llama/Llama-3.1-8B-Instruct"],
choices=list(WEIGHT_SHAPES.keys()),
)
parser.add_argument("--tp-sizes", nargs="+", type=int, default=[1])
parser.add_argument(
"--save-path",
type=str,
default=None,
help="Path to save benchmark results",
)
parser.add_argument(
"--accuracy",
action="store_true",
help="Run accuracy tests",
)
args = parser.parse_args()
if args.accuracy:
test_accuracy()
for K, N, model in prepare_shapes(args):
print(f"\n{model}, N={N} K={K}")
benchmark.run(
print_data=True,
save_path=args.save_path,
N=N,
K=K,
)
print("\nBenchmark finished!")

View File

@@ -8,10 +8,9 @@ import torch
import vllm.model_executor.layers.activation # noqa F401
from vllm.model_executor.custom_op import CustomOp
from vllm.platforms import current_platform
from vllm.triton_utils import triton
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE, set_random_seed
batch_size_range = [1, 16, 128]
seq_len_range = [1, 16, 64, 1024, 4096]
@@ -30,7 +29,7 @@ def benchmark_activation(
device = "cuda"
num_tokens = batch_size * seq_len
dim = intermediate_size
current_platform.seed_everything(42)
set_random_seed(42)
torch.set_default_device(device)
if func_name == "gelu_and_mul":

View File

@@ -6,15 +6,19 @@ kernel. Both kernels take in fp8 quantized weights and 16-bit activations,
but use different quantization strategies and backends.
"""
import nvtx
import torch
import vllm.model_executor.layers.fused_moe.modular_kernel as mk
from vllm import _custom_ops as ops
from vllm.model_executor.layers.fused_moe.config import fp8_w8a8_moe_quant_config
from vllm.model_executor.layers.fused_moe.cutlass_moe import cutlass_moe_fp8
from vllm.model_executor.layers.fused_moe.cutlass_moe import CutlassExpertsFp8
from vllm.model_executor.layers.fused_moe.fused_moe import fused_experts, fused_topk
from vllm.model_executor.layers.fused_moe.prepare_finalize import (
MoEPrepareAndFinalizeNoEP,
)
from vllm.platforms import current_platform
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.v1.worker.workspace import init_workspace_manager
# Weight shapes for different models: [num_experts, topk, hidden_size,
# intermediate_size]
@@ -58,6 +62,7 @@ def bench_run(
per_out_ch: bool,
mkn: tuple[int, int, int],
):
init_workspace_manager(torch.cuda.current_device())
(m, k, n) = mkn
dtype = torch.half
@@ -120,85 +125,6 @@ def bench_run(
# Force per-tensor quantization for all cases
per_act_token = False
# Create stride tensors for CUTLASS
ab_strides1 = torch.full((num_experts,), k, dtype=torch.int64, device=device)
ab_strides2 = torch.full((num_experts,), n, dtype=torch.int64, device=device)
c_strides1 = torch.full((num_experts,), 2 * n, dtype=torch.int64, device=device)
c_strides2 = torch.full((num_experts,), k, dtype=torch.int64, device=device)
def run_triton_moe(
a: torch.Tensor,
w1: torch.Tensor,
w2: torch.Tensor,
topk_weights: torch.Tensor,
topk_ids: torch.Tensor,
w1_scale: torch.Tensor,
w2_scale: torch.Tensor,
a1_scale: torch.Tensor,
a2_scale: torch.Tensor,
num_repeats: int,
):
quant_config = fp8_w8a8_moe_quant_config(
w1_scale=w1_scale,
w2_scale=w2_scale,
a1_scale=a1_scale,
a2_scale=a2_scale,
per_act_token_quant=per_act_token,
per_out_ch_quant=per_out_ch,
)
for _ in range(num_repeats):
fused_experts(
a,
w1,
w2,
topk_weights,
topk_ids,
quant_config=quant_config,
)
def run_cutlass_moe_fp8(
a: torch.Tensor,
w1: torch.Tensor,
w2: torch.Tensor,
topk_weights: torch.Tensor,
topk_ids: torch.Tensor,
ab_strides1: torch.Tensor,
ab_strides2: torch.Tensor,
c_strides1: torch.Tensor,
c_strides2: torch.Tensor,
w1_scale: torch.Tensor,
w2_scale: torch.Tensor,
a1_scale: torch.Tensor,
a2_scale: torch.Tensor,
num_repeats: int,
):
quant_config = fp8_w8a8_moe_quant_config(
w1_scale=w1_scale,
w2_scale=w2_scale,
a1_scale=a1_scale,
a2_scale=a2_scale,
per_act_token_quant=per_act_token,
per_out_ch_quant=per_out_ch,
)
for _ in range(num_repeats):
with nvtx.annotate("cutlass_moe_fp8", color="blue"):
cutlass_moe_fp8(
a=a,
w1_q=w1,
w2_q=w2,
topk_weights=topk_weights,
topk_ids=topk_ids,
ab_strides1=ab_strides1,
ab_strides2=ab_strides2,
c_strides1=c_strides1,
c_strides2=c_strides2,
quant_config=quant_config,
activation="silu",
global_num_experts=num_experts,
)
# Pre-create quantization config to avoid creating it inside CUDA graph
quant_config = fp8_w8a8_moe_quant_config(
w1_scale=w1_scale,
@@ -209,23 +135,30 @@ def bench_run(
per_out_ch_quant=per_out_ch,
)
fn = mk.FusedMoEModularKernel(
MoEPrepareAndFinalizeNoEP(),
CutlassExpertsFp8(
out_dtype=a.dtype,
e=num_experts,
n=n,
k=k,
quant_config=quant_config,
device=w1.device,
),
)
# Create CUDA graphs for CUTLASS (match benchmark_moe.py pattern exactly)
cutlass_stream = torch.cuda.Stream()
cutlass_graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(cutlass_graph, stream=cutlass_stream):
# Capture 10 invocations like benchmark_moe.py
for _ in range(10):
cutlass_moe_fp8(
a=a,
w1_q=w1_fp8q_cutlass,
w2_q=w2_fp8q_cutlass,
topk_weights=topk_weights,
topk_ids=topk_ids,
ab_strides1=ab_strides1,
ab_strides2=ab_strides2,
c_strides1=c_strides1,
c_strides2=c_strides2,
quant_config=quant_config,
fn(
a,
w1_fp8q_cutlass,
w2_fp8q_cutlass,
topk_weights,
topk_ids,
activation="silu",
global_num_experts=num_experts,
)
@@ -297,6 +230,10 @@ def bench_run(
def main(args):
# Initialize workspace manager (required for CUTLASS MoE kernels)
device = torch.device("cuda:0")
init_workspace_manager(device)
print("Benchmarking models:")
for i, model in enumerate(args.models):
print(f"[{i}] {model}")

View File

@@ -11,16 +11,23 @@ import nvtx
import torch
import torch.utils.benchmark as benchmark
import vllm.model_executor.layers.fused_moe.modular_kernel as mk
from vllm import _custom_ops as ops
from vllm.config import ParallelConfig, VllmConfig, set_current_vllm_config
from vllm.model_executor.layers.fused_moe.config import (
fp8_w8a8_moe_quant_config,
nvfp4_moe_quant_config,
)
from vllm.model_executor.layers.fused_moe.cutlass_moe import cutlass_moe_fp4
from vllm.model_executor.layers.fused_moe.cutlass_moe import (
CutlassExpertsFp4,
)
from vllm.model_executor.layers.fused_moe.fused_moe import fused_experts, fused_topk
from vllm.model_executor.layers.fused_moe.prepare_finalize import (
MoEPrepareAndFinalizeNoEP,
)
from vllm.scalar_type import scalar_types
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.v1.worker.workspace import init_workspace_manager
WEIGHT_SHAPES_MOE = {
"nvidia/DeepSeek-R1-FP4": [
@@ -187,19 +194,24 @@ def bench_run(
g1_alphas=w1_gs,
g2_alphas=w2_gs,
)
kernel = mk.FusedMoEModularKernel(
MoEPrepareAndFinalizeNoEP(defer_input_quant=True),
CutlassExpertsFp4(
out_dtype=dtype,
max_experts_per_worker=e,
quant_config=quant_config,
),
)
for _ in range(num_repeats):
with nvtx.annotate("cutlass_moe_fp4", color="green"):
cutlass_moe_fp4(
a=a,
w1_fp4=w1_fp4,
w2_fp4=w2_fp4,
kernel(
hidden_states=a,
w1=w1_fp4,
w2=w2_fp4,
topk_weights=topk_weights,
topk_ids=topk_ids,
m=m,
n=n,
k=k,
e=num_experts,
quant_config=quant_config,
)
def run_cutlass_from_graph(
@@ -229,20 +241,24 @@ def bench_run(
g2_alphas=w2_gs,
)
kernel = mk.FusedMoEModularKernel(
MoEPrepareAndFinalizeNoEP(defer_input_quant=True),
CutlassExpertsFp4(
out_dtype=dtype,
max_experts_per_worker=e,
quant_config=quant_config,
),
)
with set_current_vllm_config(
VllmConfig(parallel_config=ParallelConfig(pipeline_parallel_size=1))
):
return cutlass_moe_fp4(
a=a,
w1_fp4=w1_fp4,
w2_fp4=w2_fp4,
return kernel(
hidden_states=a,
w1=w1_fp4,
w2=w2_fp4,
topk_weights=topk_weights,
topk_ids=topk_ids,
m=m,
n=n,
k=k,
e=num_experts,
quant_config=quant_config,
)
def run_triton_from_graph(
@@ -441,6 +457,10 @@ def bench_run(
def main(args):
# Initialize workspace manager (required for CUTLASS MoE kernels)
device = torch.device("cuda:0")
init_workspace_manager(device)
print("Benchmarking models:")
for i, model in enumerate(args.models):
print(f"[{i}] {model}")

View File

@@ -293,7 +293,7 @@ class CommunicatorBenchmark:
graph = torch.cuda.CUDAGraph()
graph_pool = torch.cuda.graph_pool_handle()
set_graph_pool_id(graph_pool)
with torch.cuda.graph(graph, pool=graph_pool):
with torch.cuda.graph(graph, pool=graph_pool, stream=stream):
for _ in range(CUDA_GRAPH_CAPTURE_CYCLES):
allreduce_fn(graph_input)

View File

@@ -5,15 +5,20 @@ import torch
import torch.utils.benchmark as benchmark
from benchmark_shapes import WEIGHT_SHAPES_MOE
import vllm.model_executor.layers.fused_moe.modular_kernel as mk
from vllm import _custom_ops as ops
from vllm.config import ParallelConfig, VllmConfig, set_current_vllm_config
from vllm.model_executor.layers.fused_moe.config import fp8_w8a8_moe_quant_config
from vllm.model_executor.layers.fused_moe.cutlass_moe import cutlass_moe_fp8
from vllm.model_executor.layers.fused_moe.cutlass_moe import CutlassExpertsFp8
from vllm.model_executor.layers.fused_moe.fused_moe import (
fused_experts,
fused_topk,
)
from vllm.model_executor.layers.fused_moe.prepare_finalize import (
MoEPrepareAndFinalizeNoEP,
)
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.v1.worker.workspace import init_workspace_manager
DEFAULT_MODELS = [
"mistralai/Mixtral-8x7B-Instruct-v0.1",
@@ -44,6 +49,7 @@ def bench_run(
per_out_ch: bool,
mkn: tuple[int, int, int],
):
init_workspace_manager(torch.cuda.current_device())
label = "Quant Matmul"
sub_label = (
@@ -81,11 +87,6 @@ def bench_run(
a, score, topk, renormalize=False
)
ab_strides1 = torch.full((num_experts,), k, device="cuda", dtype=torch.int64)
ab_strides2 = torch.full((num_experts,), n, device="cuda", dtype=torch.int64)
c_strides1 = torch.full((num_experts,), 2 * n, device="cuda", dtype=torch.int64)
c_strides2 = torch.full((num_experts,), k, device="cuda", dtype=torch.int64)
def run_triton_moe(
a: torch.Tensor,
w1: torch.Tensor,
@@ -119,10 +120,6 @@ def bench_run(
w2: torch.Tensor,
w1_scale: torch.Tensor,
w2_scale: torch.Tensor,
ab_strides1: torch.Tensor,
ab_strides2: torch.Tensor,
c_strides1: torch.Tensor,
c_strides2: torch.Tensor,
topk_weights: torch.Tensor,
topk_ids: torch.Tensor,
per_act_token: bool,
@@ -134,31 +131,29 @@ def bench_run(
per_act_token_quant=per_act_token,
)
for _ in range(num_repeats):
cutlass_moe_fp8(
a,
w1,
w2,
topk_weights,
topk_ids,
ab_strides1,
ab_strides2,
c_strides1,
c_strides2,
fn = mk.FusedMoEModularKernel(
MoEPrepareAndFinalizeNoEP(),
CutlassExpertsFp8(
out_dtype=a.dtype,
# NOTE(rob): w2 is shaped as [E, hidden, intermediate]
e=w2.shape[0],
n=w2.shape[2],
k=w2.shape[1],
quant_config=quant_config,
)
device=w1.device,
),
)
for _ in range(num_repeats):
fn(a, w1, w2, topk_weights, topk_ids)
def run_cutlass_from_graph(
a: torch.Tensor,
a_scale: torch.Tensor,
w1_q: torch.Tensor,
w2_q: torch.Tensor,
w1: torch.Tensor,
w2: torch.Tensor,
w1_scale: torch.Tensor,
w2_scale: torch.Tensor,
ab_strides1: torch.Tensor,
ab_strides2: torch.Tensor,
c_strides1: torch.Tensor,
c_strides2: torch.Tensor,
topk_weights: torch.Tensor,
topk_ids: torch.Tensor,
):
@@ -168,21 +163,23 @@ def bench_run(
per_act_token_quant=per_act_token,
)
fn = mk.FusedMoEModularKernel(
MoEPrepareAndFinalizeNoEP(),
CutlassExpertsFp8(
out_dtype=a.dtype,
# NOTE(rob): w2 is shaped as [E, hidden, intermediate]
e=w2.shape[0],
n=w2.shape[2],
k=w2.shape[1],
quant_config=quant_config,
device=w1.device,
),
)
with set_current_vllm_config(
VllmConfig(parallel_config=ParallelConfig(pipeline_parallel_size=1))
):
return cutlass_moe_fp8(
a,
w1_q,
w2_q,
topk_weights,
topk_ids,
ab_strides1,
ab_strides2,
c_strides1,
c_strides2,
quant_config=quant_config,
)
return fn(a, w1, w2, topk_weights, topk_ids)
def run_triton_from_graph(
a: torch.Tensor,
@@ -226,10 +223,6 @@ def bench_run(
w2_q,
w1_scale,
w2_scale,
ab_strides1,
ab_strides2,
c_strides1,
c_strides2,
topk_weights,
topk_ids,
)
@@ -267,10 +260,6 @@ def bench_run(
"w1_scale": w1_scale,
"w2_scale": w2_scale,
"per_act_token": per_act_token,
"ab_strides1": ab_strides1,
"ab_strides2": ab_strides2,
"c_strides1": c_strides1,
"c_strides2": c_strides2,
# cuda graph params
"cutlass_graph": cutlass_graph,
"triton_graph": triton_graph,
@@ -329,10 +318,6 @@ def bench_run(
w2_q,
w1_scale,
w2_scale,
ab_strides1,
ab_strides2,
c_strides1,
c_strides2,
topk_weights,
topk_ids,
per_act_token,
@@ -341,7 +326,7 @@ def bench_run(
results.append(
benchmark.Timer(
stmt="run_cutlass_moe(a, a_scale, w1_q, w2_q, w1_scale, w2_scale, ab_strides1, ab_strides2, c_strides1, c_strides2, topk_weights, topk_ids, per_act_token, num_runs)", # noqa: E501
stmt="run_cutlass_moe(a, a_scale, w1_q, w2_q, w1_scale, w2_scale, topk_weights, topk_ids, per_act_token, num_runs)", # noqa: E501
globals=globals,
label=label,
sub_label=sub_label,
@@ -364,6 +349,10 @@ def bench_run(
def main(args):
# Initialize workspace manager (required for CUTLASS MoE kernels)
device = torch.device("cuda:0")
init_workspace_manager(device)
print("Benchmarking models:")
for i, model in enumerate(args.models):
print(f"[{i}] {model}")

View File

@@ -6,9 +6,8 @@ import time
import torch
from vllm.model_executor.layers.layernorm import RMSNorm
from vllm.platforms import current_platform
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE, set_random_seed
@torch.inference_mode()
@@ -22,7 +21,7 @@ def main(
num_warmup_iters: int = 5,
num_iters: int = 100,
) -> None:
current_platform.seed_everything(seed)
set_random_seed(seed)
torch.set_default_device("cuda")
layer = RMSNorm(hidden_size).to(dtype=dtype)

View File

@@ -2,6 +2,7 @@
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import argparse
import gc
import json
import os
import time
@@ -23,9 +24,48 @@ from vllm.platforms import current_platform
from vllm.transformers_utils.config import get_config
from vllm.triton_utils import triton
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import set_random_seed
FP8_DTYPE = current_platform.fp8_dtype()
# Default interval for clearing Triton JIT cache during tuning
# Set to 0 to disable automatic cache clearing
_CACHE_CLEAR_INTERVAL_ENV = "VLLM_MOE_TUNE_CACHE_CLEAR_INTERVAL"
TRITON_CACHE_CLEAR_INTERVAL = int(os.environ.get(_CACHE_CLEAR_INTERVAL_ENV, "50"))
def clear_triton_cache():
"""Clear Triton JIT compilation cache and Python/CUDA memory.
This helps prevent OOM during tuning with large models (many experts).
"""
# Force Python garbage collection
gc.collect()
# Clear CUDA memory cache
if torch.cuda.is_available():
torch.cuda.empty_cache()
# Try to clear Triton's runtime cache
try:
if (
hasattr(triton, "runtime")
and hasattr(triton.runtime, "cache")
and hasattr(triton.runtime.cache, "clear")
):
triton.runtime.cache.clear()
except ImportError:
# Triton not installed, skip cache clearing
pass
except AttributeError:
# Triton version doesn't have expected cache API
pass
except Exception as e:
print(f"Warning: Failed to clear Triton cache: {e}")
# Additional garbage collection after clearing caches
gc.collect()
def ensure_divisibility(numerator, denominator, text):
"""Ensure that numerator is divisible by the denominator."""
@@ -390,7 +430,7 @@ def merge_unique_dicts(list1, list2):
class BenchmarkWorker:
def __init__(self, seed: int) -> None:
torch.set_default_device("cuda")
current_platform.seed_everything(seed)
set_random_seed(seed)
self.seed = seed
# Get the device ID to allocate tensors and kernels
# on the respective GPU. This is required for Ray to work
@@ -410,7 +450,7 @@ class BenchmarkWorker:
block_quant_shape: list[int] = None,
use_deep_gemm: bool = False,
) -> tuple[dict[str, int], float]:
current_platform.seed_everything(self.seed)
set_random_seed(self.seed)
dtype_str = _get_config_dtype_str(
dtype, use_int8_w8a16=use_int8_w8a16, use_fp8_w8a8=use_fp8_w8a8
)
@@ -483,7 +523,7 @@ class BenchmarkWorker:
need_device_guard = True
with torch.cuda.device(self.device_id) if need_device_guard else nullcontext():
for config in tqdm(search_space):
for idx, config in enumerate(tqdm(search_space)):
try:
kernel_time = benchmark_config(
config,
@@ -506,6 +546,19 @@ class BenchmarkWorker:
if kernel_time < best_time:
best_time = kernel_time
best_config = config
# Periodically clear Triton JIT cache to prevent OOM
# This is especially important for large models with many experts
if (
TRITON_CACHE_CLEAR_INTERVAL > 0
and idx > 0
and idx % TRITON_CACHE_CLEAR_INTERVAL == 0
):
clear_triton_cache()
# Final cleanup after tuning completes
clear_triton_cache()
now = datetime.now()
print(f"{now.ctime()}] Completed tuning for batch_size={num_tokens}")
assert best_config is not None

View File

@@ -18,6 +18,7 @@ from vllm.model_executor.layers.fused_moe.moe_permute_unpermute import (
from vllm.model_executor.layers.fused_moe.utils import _fp8_quantize
from vllm.platforms import current_platform
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import set_random_seed
FP8_DTYPE = current_platform.fp8_dtype()
@@ -261,7 +262,7 @@ def benchmark_unpermute(
class BenchmarkWorker:
def __init__(self, seed: int) -> None:
torch.set_default_device("cuda")
current_platform.seed_everything(seed)
set_random_seed(seed)
self.seed = seed
# Get the device ID to allocate tensors and kernels
# on the respective GPU. This is required for Ray to work
@@ -279,7 +280,7 @@ class BenchmarkWorker:
use_int8_w8a16: bool,
use_customized_permute: bool = False,
) -> tuple[dict[str, int], float]:
current_platform.seed_everything(self.seed)
set_random_seed(self.seed)
permute_time = benchmark_permute(
num_tokens,

View File

@@ -37,9 +37,9 @@ import numpy as np
import torch
from vllm.model_executor.layers.rotary_embedding import get_rope
from vllm.platforms import current_platform
from vllm.transformers_utils.config import get_config
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import set_random_seed
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
@@ -94,7 +94,7 @@ def benchmark_mrope(
benchmark_iter: int = 100,
csv_writer=None,
):
current_platform.seed_everything(seed)
set_random_seed(seed)
torch.set_default_device(device)
# the parameters to compute the q k v size based on tp_size
mrope_helper_class = get_rope(

View File

@@ -13,6 +13,7 @@ from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import (
STR_DTYPE_TO_TORCH_DTYPE,
create_kv_caches_with_random,
set_random_seed,
)
logger = init_logger(__name__)
@@ -38,7 +39,7 @@ def main(
device: str = "cuda",
kv_cache_dtype: str | None = None,
) -> None:
current_platform.seed_everything(seed)
set_random_seed(seed)
scale = float(1.0 / (head_size**0.5))
query = torch.empty(

View File

@@ -6,9 +6,8 @@ import time
import torch
from vllm import _custom_ops as ops
from vllm.platforms import current_platform
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE, set_random_seed
@torch.inference_mode()
@@ -23,7 +22,7 @@ def main(
num_warmup_iters: int = 5,
num_iters: int = 100,
) -> None:
current_platform.seed_everything(seed)
set_random_seed(seed)
torch.set_default_device("cuda")
x = torch.randn(num_tokens, hidden_size, dtype=dtype)

View File

@@ -8,11 +8,11 @@ from tabulate import tabulate
from vllm import _custom_ops as ops
from vllm.logger import init_logger
from vllm.platforms import current_platform
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import (
STR_DTYPE_TO_TORCH_DTYPE,
create_kv_caches_with_random,
set_random_seed,
)
logger = init_logger(__name__)
@@ -36,7 +36,7 @@ def run_benchmark(
if kv_cache_dtype == "fp8" and head_size % 16:
raise ValueError("fp8 kv-cache requires head_size to be a multiple of 16.")
current_platform.seed_everything(42)
set_random_seed(42)
torch.set_default_device(device)
# create random key / value tensors [T, H, D].

View File

@@ -7,15 +7,15 @@ import torch
from tabulate import tabulate
from vllm import _custom_ops as ops
from vllm.attention.ops.triton_reshape_and_cache_flash import (
triton_reshape_and_cache_flash,
)
from vllm.logger import init_logger
from vllm.platforms import current_platform
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import (
STR_DTYPE_TO_TORCH_DTYPE,
create_kv_caches_with_random_flash,
set_random_seed,
)
from vllm.v1.attention.ops.triton_reshape_and_cache_flash import (
triton_reshape_and_cache_flash,
)
logger = init_logger(__name__)
@@ -49,7 +49,7 @@ def run_benchmark(
if implementation == "triton" and kv_cache_layout == "HND":
return float("nan") # Triton does not support HND layout yet.
current_platform.seed_everything(42)
set_random_seed(42)
torch.set_default_device(device)
# create random key / value tensors [T, H, D].

View File

@@ -23,9 +23,9 @@ import torch
from vllm.model_executor.layers.fused_moe.batched_deep_gemm_moe import (
persistent_masked_m_silu_mul_quant,
)
from vllm.platforms import current_platform
from vllm.triton_utils import tl, triton
from vllm.utils.deep_gemm import is_deep_gemm_e8m0_used
from vllm.utils.torch_utils import set_random_seed
@triton.jit
@@ -207,7 +207,7 @@ def benchmark(
):
def generate_data(seed_offset=0):
"""Generate input data with given seed offset"""
current_platform.seed_everything(42 + seed_offset)
set_random_seed(42 + seed_offset)
y = torch.rand((E, T, 2 * H), dtype=torch.bfloat16, device="cuda").contiguous()
if gen_strategy == "random_imbalanced":

View File

@@ -0,0 +1,272 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import functools
import time
import numpy as np
import torch
from vllm._custom_ops import (
cpu_attention_with_kv_cache,
cpu_attn_get_scheduler_metadata,
cpu_attn_reshape_and_cache,
)
from vllm.platforms import CpuArchEnum, current_platform
from vllm.utils.argparse_utils import FlexibleArgumentParser
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
from vllm.v1.attention.backends.cpu_attn import CPUAttentionBackend, _get_attn_isa
def get_attn_isa(
block_size: int | None = None,
dtype: torch.dtype | None = None,
):
if block_size and dtype:
return _get_attn_isa(dtype, block_size)
else:
if current_platform.get_cpu_architecture() == CpuArchEnum.ARM:
return "neon"
elif torch._C._cpu._is_amx_tile_supported():
return "amx"
else:
return "vec"
# rand number generation takes too much time, cache rand tensors
@functools.lru_cache(maxsize=128, typed=False)
def tensor_cache(
elem_num: int,
dtype: torch.dtype,
) -> torch.Tensor:
tensor = torch.randn(elem_num, dtype=dtype)
return tensor
@torch.inference_mode()
def main(
seq_lens: list[tuple[int, int]],
num_heads: tuple[int, int],
head_size: int,
sliding_window: int = None,
dtype: torch.dtype = torch.bfloat16,
block_size: int = 128,
num_blocks: int = 4096,
use_sink: bool = False,
enable_kv_split: bool = False,
isa: str | None = None,
seed: int = 0,
iters: int = 20,
) -> None:
current_platform.seed_everything(seed)
num_seqs = len(seq_lens)
query_lens = [x[0] for x in seq_lens]
kv_lens = [x[1] for x in seq_lens]
num_query_heads = num_heads[0]
num_kv_heads = num_heads[1]
assert num_query_heads % num_kv_heads == 0
max_kv_len = max(kv_lens)
window_size = (sliding_window - 1, 0) if sliding_window is not None else (-1, -1)
scale = head_size**-0.5
token_num = sum(query_lens)
if isa is None:
isa = get_attn_isa(block_size, dtype)
s_aux = (
15 * torch.rand((num_query_heads,), dtype=torch.bfloat16) if use_sink else None
)
query = tensor_cache(
elem_num=token_num * num_query_heads * head_size,
dtype=dtype,
)
query = query.view(
token_num,
num_query_heads,
head_size,
)
key_value = tensor_cache(
elem_num=2 * num_blocks * num_kv_heads * block_size * head_size,
dtype=dtype,
)
key_value = key_value.view(
2,
num_blocks,
block_size,
num_kv_heads,
head_size,
)
key_cache, value_cache = key_value.unbind(0)
# KV cache for CPU attention
packed_key_cache = torch.empty(
num_blocks, num_kv_heads, block_size, head_size, dtype=dtype
)
packed_value_cache = torch.empty_like(packed_key_cache)
cu_query_lens = torch.tensor([0] + query_lens, dtype=torch.int32).cumsum(
dim=0, dtype=torch.int32
)
kv_lens_tensor = torch.tensor(kv_lens, dtype=torch.int32)
max_num_blocks_per_seq = (max_kv_len + block_size - 1) // block_size
block_tables = torch.randint(
0, num_blocks, (num_seqs, max_num_blocks_per_seq), dtype=torch.int32
)
# use reshape_and_cache to pack key_cache and value_cache
slot_mapping = torch.arange(0, num_blocks * block_size, dtype=torch.int64)
cpu_attn_reshape_and_cache(
key=key_cache.view(-1, num_kv_heads, head_size),
value=value_cache.view(-1, num_kv_heads, head_size),
key_cache=packed_key_cache,
value_cache=packed_value_cache,
slot_mapping=slot_mapping,
isa=isa,
)
metadata = cpu_attn_get_scheduler_metadata(
num_reqs=num_seqs,
num_heads=num_query_heads,
num_kv_heads=num_kv_heads,
head_dim=head_size,
seq_lens=kv_lens_tensor,
dtype=dtype,
query_start_loc=cu_query_lens,
causal=True,
sliding_window_size=sliding_window if sliding_window is not None else -1,
isa=isa,
enable_kv_split=enable_kv_split,
)
out_with_split = torch.empty_like(query)
def run_benchmark(iters: int) -> list[float]:
times = []
for _ in range(iters):
start_time = time.perf_counter_ns()
cpu_attention_with_kv_cache(
query=query,
key_cache=packed_key_cache,
value_cache=packed_value_cache,
output=out_with_split,
query_start_loc=cu_query_lens,
seq_lens=kv_lens_tensor,
scale=scale,
causal=True,
alibi_slopes=None,
sliding_window=window_size,
block_table=block_tables,
softcap=0,
scheduler_metadata=metadata,
s_aux=s_aux,
)
end_time = time.perf_counter_ns()
times.append((end_time - start_time) / 1e6)
return times
# warmup
run_benchmark(5)
# benchmark
times = run_benchmark(iters)
time_min = min(times)
time_max = max(times)
time_mean = np.mean(times)
time_std = np.std(times)
print("\tmin (ms) = ", time_min)
print("\tmax (ms) = ", time_max)
print("\tmean (ms) = ", time_mean)
print("\tstd = ", time_std)
print("\tmedian (ms) = ", np.median(times))
def generate_seq_lens(
batch_size: int,
q_len_min: int,
q_len_max: int,
kv_len_min: int,
kv_len_max: int,
seed: int = 0,
) -> list[tuple[int, int]]:
assert 1 <= q_len_min <= q_len_max
assert 1 <= kv_len_min <= kv_len_max
assert kv_len_max >= q_len_min
g = torch.Generator(device="cpu").manual_seed(seed)
def rint(lo: int, hi: int) -> int:
return torch.randint(lo, hi + 1, (1,), generator=g).item()
seq_lens: list[tuple[int, int]] = []
for _ in range(batch_size):
# ensure q <= kv
kv = rint(max(kv_len_min, q_len_min), kv_len_max)
q = rint(q_len_min, min(q_len_max, kv))
seq_lens.append((q, kv))
return seq_lens
if __name__ == "__main__":
parser = FlexibleArgumentParser(description="Benchmark the paged attention kernel.")
parser.add_argument("--batch-size", type=int, default=64)
parser.add_argument("--q-len-min", type=int, default=512)
parser.add_argument("--q-len-max", type=int, default=512)
parser.add_argument("--kv-len-min", type=int, default=512)
parser.add_argument("--kv-len-max", type=int, default=512)
parser.add_argument("--num-blocks", type=int, default=4096)
parser.add_argument("--sliding-window", type=int, default=None)
parser.add_argument("--num-query-heads", type=int, default=32)
parser.add_argument("--num-kv-heads", type=int, default=8)
parser.add_argument(
"--head-size",
type=int,
choices=CPUAttentionBackend.get_supported_head_sizes(),
default=128,
)
parser.add_argument("--enable-kv-split", action="store_true")
parser.add_argument("--block-size", type=int, choices=[32, 64, 128], default=128)
parser.add_argument(
"--dtype", type=str, choices=["half", "bfloat16", "float"], default="bfloat16"
)
parser.add_argument("--use-sink", action="store_true")
parser.add_argument(
"--isa", type=str, choices=["vec", "neon", "amx", "vec16"], default=None
)
parser.add_argument("--seed", type=int, default=0)
parser.add_argument("--iters", type=int, default=20)
args = parser.parse_args()
print(args)
seq_lens = generate_seq_lens(
args.batch_size,
args.q_len_min,
args.q_len_max,
args.kv_len_min,
args.kv_len_max,
args.seed,
)
print("batch (query len, kv len) = ", seq_lens)
main(
seq_lens=seq_lens,
num_heads=(args.num_query_heads, args.num_kv_heads),
head_size=args.head_size,
sliding_window=args.sliding_window,
dtype=STR_DTYPE_TO_TORCH_DTYPE[args.dtype],
block_size=args.block_size,
num_blocks=args.num_blocks,
use_sink=args.use_sink,
enable_kv_split=args.enable_kv_split,
isa=args.isa
if args.isa is not None
else get_attn_isa(args.block_size, STR_DTYPE_TO_TORCH_DTYPE[args.dtype]),
seed=args.seed,
iters=args.iters,
)

View File

@@ -0,0 +1,175 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import sys
import time
import numpy as np
import torch
from vllm.platforms import current_platform
from vllm.utils.argparse_utils import FlexibleArgumentParser
# Check if CPU MoE operations are available
try:
from vllm._custom_ops import cpu_fused_moe, cpu_prepack_moe_weight
except (ImportError, AttributeError) as e:
print("ERROR: CPU fused MoE operations are not available on this platform.")
print("This benchmark requires x86 CPU with proper vLLM CPU extensions compiled.")
print(
"The cpu_fused_moe kernel is typically available on Linux x86_64 "
"with AVX2/AVX512."
)
print(f"Import error: {e}")
sys.exit(1)
# ISA selection following test_cpu_fused_moe.py pattern
ISA_CHOICES = ["amx", "vec"] if torch._C._cpu._is_amx_tile_supported() else ["vec"]
@torch.inference_mode()
def main(
batch_size: int,
expert_num: int,
hidden_size: int,
intermediate_size: int,
topk_num: int,
use_bias: bool = False,
dtype: torch.dtype = torch.bfloat16,
activation: str = "silu",
isa: str = "vec",
seed: int = 0,
iters: int = 20,
) -> None:
current_platform.seed_everything(seed)
# up_dim = 2 * intermediate_size for gate + up projection
up_dim = 2 * intermediate_size
input_tensor = torch.randn((batch_size, hidden_size), dtype=dtype) / (
0.5 * hidden_size**0.5
)
w13 = torch.randn((expert_num, up_dim, hidden_size), dtype=dtype) / (
0.5 * hidden_size**0.5
)
w2 = torch.randn((expert_num, hidden_size, intermediate_size), dtype=dtype) / (
0.5 * intermediate_size**0.5
)
w13_bias = None
w2_bias = None
if use_bias:
w13_bias = torch.randn((expert_num, up_dim), dtype=dtype) / (0.5 * up_dim**0.5)
w2_bias = torch.randn((expert_num, hidden_size), dtype=dtype) / (
0.5 * hidden_size**0.5
)
router_logits = torch.randn((batch_size, expert_num), dtype=dtype)
score = torch.softmax(router_logits, dim=-1, dtype=torch.float32)
topk_weights, topk_ids = torch.topk(score, topk_num)
topk_ids = topk_ids.to(torch.int32)
packed_w13 = cpu_prepack_moe_weight(w13, isa)
packed_w2 = cpu_prepack_moe_weight(w2, isa)
def run_benchmark(iters: int) -> list[float]:
times = []
for _ in range(iters):
start_time = time.perf_counter_ns()
_ = cpu_fused_moe(
input_tensor,
packed_w13,
packed_w2,
w13_bias,
w2_bias,
topk_weights,
topk_ids,
activation,
isa,
)
end_time = time.perf_counter_ns()
times.append((end_time - start_time) / 1e6)
return times
# warmup
run_benchmark(5)
# benchmark
times = run_benchmark(iters)
if not times:
print("No iterations to measure. Set --iters > 0.")
return
time_min = min(times)
time_max = max(times)
time_mean = np.mean(times)
time_std = np.std(times)
print("\tmin (ms) = ", time_min)
print("\tmax (ms) = ", time_max)
print("\tmean (ms) = ", time_mean)
print("\tstd = ", time_std)
print("\tmedian (ms) = ", np.median(times))
# Calculate throughput metrics
# FLOPs estimation: 2 * batch * topk * (hidden * up_dim + intermediate * hidden)
flops_per_token = (
2 * topk_num * (hidden_size * up_dim + intermediate_size * hidden_size)
)
total_flops = batch_size * flops_per_token
tflops = total_flops / (time_mean * 1e-3) / 1e12
print(f"\tthroughput (TFLOP/s) = {tflops:.4f}")
if __name__ == "__main__":
parser = FlexibleArgumentParser(description="Benchmark the CPU fused MoE kernel.")
parser.add_argument("--batch-size", type=int, default=64)
parser.add_argument("--expert-num", type=int, default=8)
parser.add_argument("--hidden-size", type=int, default=2880)
parser.add_argument("--intermediate-size", type=int, default=2880)
parser.add_argument(
"--topk-num",
type=int,
default=None,
help="Number of experts to route each token to (default: expert_num // 2)",
)
parser.add_argument("--use-bias", action="store_true")
parser.add_argument(
"--activation",
type=str,
choices=["silu", "swigluoai"],
default="silu",
help="Activation function",
)
parser.add_argument(
"--isa",
type=str,
choices=ISA_CHOICES,
default=ISA_CHOICES[0],
help=f"ISA to use (available: {ISA_CHOICES})",
)
parser.add_argument("--seed", type=int, default=0)
parser.add_argument("--iters", type=int, default=20)
args = parser.parse_args()
# Default topk_num to expert_num // 2, minimum 1
topk_num = (
args.topk_num if args.topk_num is not None else max(args.expert_num // 2, 1)
)
print(args)
main(
batch_size=args.batch_size,
expert_num=args.expert_num,
hidden_size=args.hidden_size,
intermediate_size=args.intermediate_size,
topk_num=topk_num,
use_bias=args.use_bias,
dtype=torch.bfloat16, # Following test_cpu_fused_moe.py
activation=args.activation,
isa=args.isa,
seed=args.seed,
iters=args.iters,
)

View File

@@ -31,10 +31,15 @@ if(NOT qutlass_SOURCE_DIR)
endif()
message(STATUS "[QUTLASS] QuTLASS is available at ${qutlass_SOURCE_DIR}")
cuda_archs_loose_intersection(QUTLASS_ARCHS "12.0a;10.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND QUTLASS_ARCHS)
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 13.0)
cuda_archs_loose_intersection(QUTLASS_ARCHS "12.0a;10.0f" "${CUDA_ARCHS}")
else()
cuda_archs_loose_intersection(QUTLASS_ARCHS "12.0a;10.0a;10.3a" "${CUDA_ARCHS}")
endif()
if(QUTLASS_ARCHS MATCHES "10\\.0a")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND QUTLASS_ARCHS)
if(QUTLASS_ARCHS MATCHES "10\\.(0a|3a|0f)")
set(QUTLASS_TARGET_CC 100)
elseif(QUTLASS_ARCHS MATCHES "12\\.0a")
set(QUTLASS_TARGET_CC 120)

View File

@@ -38,7 +38,7 @@ else()
FetchContent_Declare(
vllm-flash-attn
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
GIT_TAG 86f8f157cf82aa2342743752b97788922dd7de43
GIT_TAG 188be16520ceefdc625fdf71365585d2ee348fe2
GIT_PROGRESS TRUE
# Don't share the vllm-flash-attn build between build types
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn

View File

@@ -9,16 +9,6 @@
void swap_blocks(torch::Tensor& src, torch::Tensor& dst,
const torch::Tensor& block_mapping);
// Note: the key_caches and value_caches vectors are constant but
// not the Tensors they contain. The vectors need to be const refs
// in order to satisfy pytorch's C++ operator registration code.
void copy_blocks(std::vector<torch::Tensor> const& key_caches,
std::vector<torch::Tensor> const& value_caches,
const torch::Tensor& block_mapping);
void copy_blocks_mla(std::vector<torch::Tensor> const& kv_caches,
const torch::Tensor& block_mapping);
void reshape_and_cache(torch::Tensor& key, torch::Tensor& value,
torch::Tensor& key_cache, torch::Tensor& value_cache,
torch::Tensor& slot_mapping,
@@ -37,6 +27,13 @@ void concat_and_cache_mla(torch::Tensor& kv_c, torch::Tensor& k_pe,
const std::string& kv_cache_dtype,
torch::Tensor& scale);
// NOTE: k_pe and kv_c order is flipped compared to concat_and_cache_mla
void concat_and_cache_mla_rope_fused(
torch::Tensor& positions, torch::Tensor& q_pe, torch::Tensor& k_pe,
torch::Tensor& kv_c, torch::Tensor& rope_cos_sin_cache, bool rope_is_neox,
torch::Tensor& kv_cache_slot_mapping, torch::Tensor& kv_cache,
const std::string& kv_cache_dtype, torch::Tensor& kv_cache_quant_scale);
// Just for unittest
void convert_fp8(torch::Tensor& dst_cache, torch::Tensor& src_cache,
const double scale, const std::string& kv_cache_dtype);

View File

@@ -119,94 +119,6 @@ __global__ void copy_blocks_mla_kernel(
} // namespace vllm
// Note: the key_caches and value_caches vectors are constant but
// not the Tensors they contain. The vectors need to be const refs
// in order to satisfy pytorch's C++ operator registration code.
void copy_blocks(std::vector<torch::Tensor> const& key_caches,
std::vector<torch::Tensor> const& value_caches,
const torch::Tensor& block_mapping) {
int num_layers = key_caches.size();
TORCH_CHECK(num_layers == value_caches.size());
if (num_layers == 0) {
return;
}
torch::Device cache_device = key_caches[0].device();
TORCH_CHECK(cache_device.is_cuda());
// Create data structures for the kernel.
// Create an array of pointers to the key and value caches.
int64_t key_cache_ptrs[num_layers];
int64_t value_cache_ptrs[num_layers];
for (int layer_idx = 0; layer_idx < num_layers; ++layer_idx) {
key_cache_ptrs[layer_idx] =
reinterpret_cast<int64_t>(key_caches[layer_idx].data_ptr());
value_cache_ptrs[layer_idx] =
reinterpret_cast<int64_t>(value_caches[layer_idx].data_ptr());
}
// block_mapping is a 2D tensor with shape (num_pairs, 2).
int num_pairs = block_mapping.size(0);
// Move the data structures to the GPU.
// NOTE: This synchronizes the CPU and GPU.
torch::Tensor key_cache_ptrs_tensor =
torch::from_blob(key_cache_ptrs, {num_layers}, torch::kInt64)
.to(cache_device);
torch::Tensor value_cache_ptrs_tensor =
torch::from_blob(value_cache_ptrs, {num_layers}, torch::kInt64)
.to(cache_device);
// Launch the kernel.
const int numel_per_block = key_caches[0][0].numel();
dim3 grid(num_layers, num_pairs);
dim3 block(std::min(1024, numel_per_block));
const at::cuda::OptionalCUDAGuard device_guard(cache_device);
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
VLLM_DISPATCH_FLOATING_AND_BYTE_TYPES(
key_caches[0].scalar_type(), "copy_blocks_kernel", ([&] {
vllm::copy_blocks_kernel<scalar_t><<<grid, block, 0, stream>>>(
key_cache_ptrs_tensor.data_ptr<int64_t>(),
value_cache_ptrs_tensor.data_ptr<int64_t>(),
block_mapping.data_ptr<int64_t>(), numel_per_block);
}));
}
// copy blocks kernel for MLA (assumes a joint KV-cache)
void copy_blocks_mla(std::vector<torch::Tensor> const& kv_caches,
const torch::Tensor& block_mapping) {
int num_layers = kv_caches.size();
if (num_layers == 0) {
return;
}
torch::Device cache_device = kv_caches[0].device();
TORCH_CHECK(cache_device.is_cuda(), "kv_cache must be on CUDA");
std::vector<int64_t> cache_ptrs(num_layers);
for (int layer_idx = 0; layer_idx < num_layers; ++layer_idx) {
cache_ptrs[layer_idx] =
reinterpret_cast<int64_t>(kv_caches[layer_idx].data_ptr());
}
torch::Tensor cache_ptrs_tensor =
torch::from_blob(cache_ptrs.data(), {num_layers}, torch::kInt64)
.to(cache_device);
int num_pairs = block_mapping.size(0);
// We use the stride instead of numel in case the cache is padded for memory
// alignment reasons, we assume the blocks data (inclusive of any padding)
// is contiguous in memory
int mem_footprint_per_block = kv_caches[0].stride(0);
dim3 grid(num_layers, num_pairs);
dim3 block(std::min(1024, mem_footprint_per_block));
const at::cuda::OptionalCUDAGuard device_guard(cache_device);
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
VLLM_DISPATCH_FLOATING_AND_BYTE_TYPES(
kv_caches[0].scalar_type(), "copy_blocks_mla_kernel", ([&] {
vllm::copy_blocks_mla_kernel<scalar_t><<<grid, block, 0, stream>>>(
cache_ptrs_tensor.data_ptr<int64_t>(),
block_mapping.data_ptr<int64_t>(), mem_footprint_per_block);
}));
}
namespace vllm {
// Used to copy/convert one element
@@ -539,9 +451,6 @@ __global__ void indexer_k_quant_and_cache_kernel(
for (int i = 0; i < VEC_SIZE; i++) {
amax = fmaxf(amax, fabsf(float(k_val_ptr[i])));
}
#ifndef USE_ROCM
__syncwarp();
#endif
// Reduced amax
for (int mask = 16; mask > 0; mask /= 2) {
@@ -551,9 +460,7 @@ __global__ void indexer_k_quant_and_cache_kernel(
amax = fmaxf(amax, __shfl_xor_sync(unsigned(-1), amax, mask));
#endif
}
#ifndef USE_ROCM
__syncwarp();
#endif
#if defined(__gfx942__)
float scale = fmaxf(amax, 1e-4) / 224.0f;
#else

279
csrc/cache_kernels_fused.cu Normal file
View File

@@ -0,0 +1,279 @@
#include <torch/all.h>
#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGuard.h>
#include "cuda_compat.h"
#include "dispatch_utils.h"
#include "quantization/w8a8/fp8/common.cuh"
#ifdef USE_ROCM
#include "quantization/w8a8/fp8/amd/quant_utils.cuh"
#else
#include "quantization/w8a8/fp8/nvidia/quant_utils.cuh"
#endif
#ifdef USE_ROCM
#include <hip/hip_bf16.h>
typedef __hip_bfloat16 __nv_bfloat16;
#endif
namespace vllm {
// NOTE Be EXTRA careful with raw_kv_scalar_t, for __half and __nv_bfloat16 it's
// using u16 as the backing type.
template <typename qk_t, bool IS_NEOX, typename raw_kv_scalar_t,
typename cache_t, Fp8KVCacheDataType kv_dt>
__global__ void concat_and_cache_mla_rope_fused_kernel(
const int64_t* __restrict__ positions, // [num_tokens]
qk_t* __restrict__ q_pe, // [num_tokens, num_q_heads, rot_dim]
qk_t* __restrict__ k_pe, // [num_tokens, rot_dim]
const qk_t* __restrict__ kv_c, // [num_tokens, kv_lora_rank]
const qk_t* __restrict__ rope_cos_sin_cache, // [max_position, 2,
// rot_dim // 2]
const int rot_dim, const int64_t q_pe_stride_token,
const int64_t q_pe_stride_head, const int64_t k_pe_stride,
const int64_t kv_c_stride, const int num_q_heads,
cache_t* __restrict__ kv_cache, // [num_blocks, block_size, (kv_lora_rank +
// rot_dim)]
const int64_t* __restrict__ kv_cache_slot_mapping, // [num_tokens]
const int block_stride, const int entry_stride, const int kv_lora_rank,
const int block_size, const float* kv_cache_quant_scale) {
// Each thread block is responsible for one token.
const int64_t token_idx = blockIdx.x;
const int64_t pos = positions[token_idx];
const qk_t* cos_sin_ptr = rope_cos_sin_cache + pos * rot_dim;
const int embed_dim = rot_dim / 2;
// Q ROPE
const int nq = num_q_heads * embed_dim;
for (int i = threadIdx.x; i < nq; i += blockDim.x) {
int head_idx = i / embed_dim;
int pair_idx = i % embed_dim;
// NOTE: Would be nice to have interleaved sin/cos so we could just load
// both at the same time.
qk_t cos = VLLM_LDG(cos_sin_ptr + pair_idx);
qk_t sin = VLLM_LDG(cos_sin_ptr + pair_idx + embed_dim);
qk_t* q_pe_head_ptr =
q_pe + token_idx * q_pe_stride_token + head_idx * q_pe_stride_head;
int pair_idx_x, pair_idx_y;
if constexpr (IS_NEOX) {
// GPT-NeoX style rotary embedding.
pair_idx_x = pair_idx;
pair_idx_y = embed_dim + pair_idx;
} else {
// GPT-J style rotary embedding.
pair_idx_x = pair_idx * 2;
pair_idx_y = pair_idx * 2 + 1;
}
qk_t x_src = q_pe_head_ptr[pair_idx_x];
qk_t y_src = q_pe_head_ptr[pair_idx_y];
qk_t x_dst = x_src * cos - y_src * sin;
qk_t y_dst = y_src * cos + x_src * sin;
q_pe_head_ptr[pair_idx_x] = x_dst;
q_pe_head_ptr[pair_idx_y] = y_dst;
}
const int64_t slot_idx = kv_cache_slot_mapping[token_idx];
const int64_t block_idx = slot_idx / block_size;
const int64_t entry_idx = slot_idx % block_size;
// NOTE: slot_idx can be -1 if the token is padded
if (slot_idx < 0) {
return;
}
// K with 1 HEAD
for (int i = threadIdx.x; i < embed_dim; i += blockDim.x) {
int pair_idx = i;
qk_t cos = VLLM_LDG(cos_sin_ptr + pair_idx);
qk_t sin = VLLM_LDG(cos_sin_ptr + pair_idx + embed_dim);
qk_t* k_pe_head_ptr = k_pe + token_idx * k_pe_stride;
int pair_idx_x, pair_idx_y;
if constexpr (IS_NEOX) {
// GPT-NeoX style rotary embedding.
pair_idx_x = pair_idx;
pair_idx_y = embed_dim + pair_idx;
} else {
// GPT-J style rotary embedding.
pair_idx_x = pair_idx * 2;
pair_idx_y = pair_idx * 2 + 1;
}
qk_t x_src = k_pe_head_ptr[pair_idx_x];
qk_t y_src = k_pe_head_ptr[pair_idx_y];
qk_t x_dst = x_src * cos - y_src * sin;
qk_t y_dst = y_src * cos + x_src * sin;
k_pe_head_ptr[pair_idx_x] = x_dst;
k_pe_head_ptr[pair_idx_y] = y_dst;
// NOTE Why is this monster necessary?
// When K is of type float16, the actual template replacement for
// raw_kv_scalar_t with be u16. That's why it's used at the last moment
// otherwise CUDA ALU would break.
const raw_kv_scalar_t raw_x_value =
*reinterpret_cast<const raw_kv_scalar_t*>(&x_dst);
const raw_kv_scalar_t raw_y_value =
*reinterpret_cast<const raw_kv_scalar_t*>(&y_dst);
cache_t* kv_cache_ptr = kv_cache + block_idx * block_stride +
entry_idx * entry_stride + kv_lora_rank;
// MLA Cache Store
if constexpr (kv_dt == Fp8KVCacheDataType::kAuto) {
kv_cache_ptr[pair_idx_x] = raw_x_value;
kv_cache_ptr[pair_idx_y] = raw_y_value;
} else {
kv_cache_ptr[pair_idx_x] =
fp8::scaled_convert<cache_t, raw_kv_scalar_t, kv_dt>(
raw_x_value, *kv_cache_quant_scale);
kv_cache_ptr[pair_idx_y] =
fp8::scaled_convert<cache_t, raw_kv_scalar_t, kv_dt>(
raw_y_value, *kv_cache_quant_scale);
}
}
// NOPE
for (int i = threadIdx.x; i < kv_lora_rank; i += blockDim.x) {
const qk_t* src_ptr = kv_c + token_idx * kv_c_stride + i;
const raw_kv_scalar_t src_value =
*reinterpret_cast<const raw_kv_scalar_t*>(src_ptr);
cache_t* kv_cache_ptr =
kv_cache + block_idx * block_stride + entry_idx * entry_stride;
if constexpr (kv_dt == Fp8KVCacheDataType::kAuto) {
kv_cache_ptr[i] = src_value;
} else {
kv_cache_ptr[i] = fp8::scaled_convert<cache_t, raw_kv_scalar_t, kv_dt>(
src_value, *kv_cache_quant_scale);
}
}
}
} // namespace vllm
#define CALL_CONCAT_AND_CACHE_MLA_ROPE_FUSED(RAW_KV_T, CACHE_T, KV_DTYPE) \
do { \
VLLM_DISPATCH_FLOATING_TYPES(q_pe.scalar_type(), "qk_scalar_type", [&] { \
using qk_t = scalar_t; \
if (rope_is_neox) { \
vllm::concat_and_cache_mla_rope_fused_kernel<qk_t, true, RAW_KV_T, \
CACHE_T, KV_DTYPE> \
<<<grid, block, 0, stream>>>( \
positions.data_ptr<int64_t>(), q_pe.data_ptr<qk_t>(), \
k_pe.data_ptr<qk_t>(), kv_c.data_ptr<qk_t>(), \
rope_cos_sin_cache.data_ptr<qk_t>(), rot_dim, \
q_pe_stride_token, q_pe_stride_head, k_pe_stride, kv_c_stride, \
num_q_heads, reinterpret_cast<CACHE_T*>(kv_cache.data_ptr()), \
kv_cache_slot_mapping.data_ptr<int64_t>(), block_stride, \
entry_stride, kv_lora_rank, block_size, \
kv_cache_quant_scale.data_ptr<float>()); \
} else { \
vllm::concat_and_cache_mla_rope_fused_kernel<qk_t, false, RAW_KV_T, \
CACHE_T, KV_DTYPE> \
<<<grid, block, 0, stream>>>( \
positions.data_ptr<int64_t>(), q_pe.data_ptr<qk_t>(), \
k_pe.data_ptr<qk_t>(), kv_c.data_ptr<qk_t>(), \
rope_cos_sin_cache.data_ptr<qk_t>(), rot_dim, \
q_pe_stride_token, q_pe_stride_head, k_pe_stride, kv_c_stride, \
num_q_heads, reinterpret_cast<CACHE_T*>(kv_cache.data_ptr()), \
kv_cache_slot_mapping.data_ptr<int64_t>(), block_stride, \
entry_stride, kv_lora_rank, block_size, \
kv_cache_quant_scale.data_ptr<float>()); \
} \
}); \
} while (false)
// Executes RoPE on q_pe and k_pe, then writes k_pe and kv_c in the kv cache.
// q_pe and k_pe are modified in place.
// Replaces DeepseekScalingRotaryEmbedding.self.rotary_emb and
// concat_and_cache_mla.
void concat_and_cache_mla_rope_fused(
torch::Tensor& positions, // [num_tokens]
torch::Tensor& q_pe, // [num_tokens, num_q_heads, rot_dim]
torch::Tensor& k_pe, // [num_tokens, rot_dim]
torch::Tensor& kv_c, // [num_tokens, kv_lora_rank]
torch::Tensor& rope_cos_sin_cache, // [max_position, rot_dim]
bool rope_is_neox,
torch::Tensor&
kv_cache_slot_mapping, // [num_tokens] or [num_actual_tokens]
torch::Tensor&
kv_cache, // [num_blocks, block_size, (kv_lora_rank + rot_dim)]
const std::string& kv_cache_dtype, torch::Tensor& kv_cache_quant_scale) {
const int64_t num_tokens = q_pe.size(0);
const int num_q_heads = q_pe.size(1);
const int rot_dim = q_pe.size(2);
const int kv_lora_rank = kv_c.size(1);
TORCH_CHECK(positions.size(0) >=
num_tokens); // CUDA Graphs might pad this for us
TORCH_CHECK_EQ(positions.dim(), 1);
TORCH_CHECK_EQ(positions.scalar_type(), c10::ScalarType::Long);
TORCH_CHECK_EQ(q_pe.size(0), num_tokens);
TORCH_CHECK_EQ(q_pe.size(1), num_q_heads);
TORCH_CHECK_EQ(q_pe.size(2), rot_dim);
TORCH_CHECK_EQ(q_pe.dim(), 3);
TORCH_CHECK_EQ(k_pe.size(0), num_tokens);
TORCH_CHECK_EQ(k_pe.size(1), rot_dim);
TORCH_CHECK_EQ(k_pe.dim(), 2);
TORCH_CHECK_EQ(k_pe.scalar_type(), q_pe.scalar_type());
TORCH_CHECK_EQ(kv_c.size(0), num_tokens);
TORCH_CHECK_EQ(kv_c.size(1), kv_lora_rank);
TORCH_CHECK_EQ(kv_c.dim(), 2);
TORCH_CHECK_EQ(kv_c.scalar_type(), q_pe.scalar_type());
TORCH_CHECK_EQ(kv_c.dtype(), q_pe.dtype());
TORCH_CHECK_EQ(rope_cos_sin_cache.size(1), rot_dim);
TORCH_CHECK_EQ(rope_cos_sin_cache.scalar_type(), q_pe.scalar_type());
TORCH_CHECK_EQ(kv_cache_slot_mapping.size(0), num_tokens);
TORCH_CHECK_EQ(kv_cache_slot_mapping.scalar_type(), c10::ScalarType::Long);
TORCH_CHECK_EQ(kv_cache.size(2), kv_lora_rank + rot_dim);
TORCH_CHECK_EQ(kv_cache.dim(), 3);
TORCH_CHECK_EQ(kv_cache_quant_scale.numel(), 1);
TORCH_CHECK_EQ(kv_cache_quant_scale.scalar_type(), c10::ScalarType::Float);
int64_t q_pe_stride_token = q_pe.stride(0);
int64_t q_pe_stride_head = q_pe.stride(1);
int64_t k_pe_stride = k_pe.stride(0);
int64_t kv_c_stride = kv_c.stride(0);
int block_size = kv_cache.size(1);
int block_stride = kv_cache.stride(0);
int entry_stride = kv_cache.stride(1);
int rope_block_size = std::min(num_q_heads * rot_dim / 2, 512);
int mla_block_size = kv_lora_rank;
int thread_block_size =
std::min(std::max(rope_block_size, mla_block_size), 512);
dim3 grid(num_tokens, 1, 1);
dim3 block(thread_block_size, 1, 1);
const at::cuda::OptionalCUDAGuard device_guard(device_of(positions));
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
DISPATCH_BY_KV_CACHE_DTYPE(kv_c.dtype(), kv_cache_dtype,
CALL_CONCAT_AND_CACHE_MLA_ROPE_FUSED);
}

View File

@@ -15,6 +15,7 @@
#ifdef __aarch64__
#include "cpu_attn_neon.hpp"
// NEON requires head_dim to be a multiple of 32
#define NEON_DISPATCH(...) \
case cpu_attention::ISA::NEON: { \
using attn_impl = cpu_attention::AttentionImpl<cpu_attention::ISA::NEON, \
@@ -36,7 +37,9 @@
switch (HEAD_DIM) { \
CPU_ATTN_DISPATCH_CASE(32, __VA_ARGS__) \
CPU_ATTN_DISPATCH_CASE(64, __VA_ARGS__) \
CPU_ATTN_DISPATCH_CASE(80, __VA_ARGS__) \
CPU_ATTN_DISPATCH_CASE(96, __VA_ARGS__) \
CPU_ATTN_DISPATCH_CASE(112, __VA_ARGS__) \
CPU_ATTN_DISPATCH_CASE(128, __VA_ARGS__) \
CPU_ATTN_DISPATCH_CASE(160, __VA_ARGS__) \
CPU_ATTN_DISPATCH_CASE(192, __VA_ARGS__) \

View File

@@ -377,7 +377,7 @@ class AttentionImpl<ISA::AMX, scalar_t, head_dim> {
const int32_t q_heads_per_kv, const int64_t q_num_stride,
const int64_t q_head_stride, const float scale) {
constexpr int64_t bytes_per_head = head_dim * sizeof(scalar_t);
static_assert(bytes_per_head % AMX_TILE_ROW_BYTES == 0);
// static_assert(bytes_per_head % AMX_TILE_ROW_BYTES == 0);
constexpr int64_t head_size_block_num = bytes_per_head / AMX_TILE_ROW_BYTES;
constexpr int64_t head_elem_num_pre_block =
AMX_TILE_ROW_BYTES / sizeof(scalar_t);

View File

@@ -264,7 +264,7 @@ class AttentionImpl<ISA::NEON, scalar_t, head_dim> {
constexpr static ISA ISAType = ISA::NEON;
constexpr static bool scale_on_logits = false; // apply scale on q_buffer
static_assert(HeadDim % HeadDimAlignment == 0);
// static_assert(HeadDim % HeadDimAlignment == 0);
// the gemm micro kernel is Mx8
static_assert(HeadDimAlignment % 8 == 0);
static_assert(BlockSizeAlignment % 8 == 0);

View File

@@ -24,6 +24,8 @@ std::string init_cpu_threads_env(const std::string& cpu_ids) {
#ifndef VLLM_NUMA_DISABLED
std::string init_cpu_threads_env(const std::string& cpu_ids) {
bitmask* omp_cpu_mask = numa_parse_cpustring_all(cpu_ids.c_str());
TORCH_CHECK(omp_cpu_mask != nullptr,
"Failed to parse CPU string: " + cpu_ids);
TORCH_CHECK(omp_cpu_mask->size > 0);
std::vector<int> omp_cpu_ids;
omp_cpu_ids.reserve(omp_cpu_mask->size);
@@ -44,20 +46,12 @@ std::string init_cpu_threads_env(const std::string& cpu_ids) {
// Memory node binding
if (numa_available() != -1) {
int mem_node_id = numa_node_of_cpu(omp_cpu_ids.front());
std::set<int> node_ids;
for (const auto& cpu_id : omp_cpu_ids) {
int node_id = numa_node_of_cpu(cpu_id);
if (node_id != -1) {
node_ids.insert(node_id);
}
if (node_id != mem_node_id) {
TORCH_WARN("CPU ", cpu_id, " is on NUMA node ", node_id, ", but CPU ",
omp_cpu_ids.front(), " is on NUMA node ", mem_node_id,
". All CPUs should be on the same NUMA node for optimal "
"performance. Memory will be bound to NUMA node ",
mem_node_id, ".");
}
}
// Concatenate all node_ids into a single comma-separated string
if (!node_ids.empty()) {
@@ -70,7 +64,7 @@ std::string init_cpu_threads_env(const std::string& cpu_ids) {
}
bitmask* mask = numa_parse_nodestring(node_ids_str.c_str());
bitmask* src_mask = numa_get_membind();
bitmask* src_mask = numa_get_mems_allowed();
int pid = getpid();
@@ -83,15 +77,46 @@ std::string init_cpu_threads_env(const std::string& cpu_ids) {
std::to_string(errno));
}
// restrict memory allocation node.
numa_set_membind(mask);
// Restrict memory allocation to the selected NUMA node(s).
// Enhances memory locality for the threads bound to those NUMA CPUs.
if (node_ids.size() > 1) {
errno = 0;
numa_set_interleave_mask(mask);
if (errno != 0) {
TORCH_WARN("numa_set_interleave_mask failed. errno: " +
std::to_string(errno));
} else {
TORCH_WARN(
"NUMA binding: Using INTERLEAVE policy for memory "
"allocation across multiple NUMA nodes (nodes: " +
node_ids_str +
"). Memory allocations will be "
"interleaved across the specified NUMA nodes.");
}
} else {
errno = 0;
numa_set_membind(mask);
if (errno != 0) {
TORCH_WARN("numa_set_membind failed. errno: " +
std::to_string(errno));
} else {
TORCH_WARN(
"NUMA binding: Using MEMBIND policy for memory "
"allocation on the NUMA nodes (" +
node_ids_str +
"). Memory allocations will be "
"strictly bound to these NUMA nodes.");
}
}
numa_set_strict(1);
numa_free_nodemask(mask);
numa_free_nodemask(src_mask);
} else {
TORCH_WARN("numa_parse_nodestring or numa_get_membind failed. errno: " +
std::to_string(errno));
TORCH_WARN(
"numa_parse_nodestring or numa_get_run_node_mask failed. errno: " +
std::to_string(errno));
}
}
}

View File

@@ -37,10 +37,12 @@ struct VecTypeTrait<c10::BFloat16> {
};
#endif
#if !defined(__powerpc__)
template <>
struct VecTypeTrait<c10::Half> {
using vec_t = vec_op::FP16Vec16;
};
#endif
struct Counter {
std::atomic<int64_t> counter;

View File

@@ -107,7 +107,8 @@ __global__ void fusedQKNormRopeKernel(
void const* k_weight_void, // RMSNorm weights for key
void const* cos_sin_cache_void, // Pre-computed cos/sin cache
int64_t const* position_ids, // Position IDs for RoPE
int const num_tokens // Number of tokens
int const num_tokens, // Number of tokens
int const rotary_dim // Dimension for RoPE
) {
#if (!defined(__CUDA_ARCH__) || __CUDA_ARCH__ < 800) && !defined(USE_ROCM)
if constexpr ((std::is_same_v<scalar_t_in, c10::BFloat16>) ||
@@ -227,56 +228,59 @@ __global__ void fusedQKNormRopeKernel(
// Calculate cache pointer for this position - similar to
// pos_encoding_kernels.cu
T_cache const* cache_ptr = cos_sin_cache + pos_id * head_dim;
int const embed_dim = head_dim / 2;
T_cache const* cache_ptr = cos_sin_cache + pos_id * rotary_dim;
int const embed_dim = rotary_dim / 2;
T_cache const* cos_ptr = cache_ptr;
T_cache const* sin_ptr = cache_ptr + embed_dim;
if constexpr (interleave) {
// Perform interleaving. Use pre-computed cos/sin values.
int const rotary_lanes = rotary_dim / numElemsPerThread; // rotary range
if (laneId < rotary_lanes) {
if constexpr (interleave) {
// Perform interleaving. Use pre-computed cos/sin values.
#pragma unroll
for (int i = 0; i < numElemsPerThread / 2; ++i) {
int const idx0 = 2 * i;
int const idx1 = 2 * i + 1;
for (int i = 0; i < numElemsPerThread / 2; ++i) {
int const idx0 = 2 * i;
int const idx1 = 2 * i + 1;
// Global dimension index in the head
int const dim_idx = laneId * numElemsPerThread + idx0;
float const val0 = elements[idx0];
float const val1 = elements[idx1];
float const val0 = elements[idx0];
float const val1 = elements[idx1];
int const dim_idx = laneId * numElemsPerThread + idx0;
int const half_dim = dim_idx / 2;
float const cos_val =
CacheConverter::convert(VLLM_LDG(cos_ptr + half_dim));
float const sin_val =
CacheConverter::convert(VLLM_LDG(sin_ptr + half_dim));
int const half_dim = dim_idx / 2;
float const cos_val =
CacheConverter::convert(VLLM_LDG(cos_ptr + half_dim));
float const sin_val =
CacheConverter::convert(VLLM_LDG(sin_ptr + half_dim));
elements[idx0] = val0 * cos_val - val1 * sin_val;
elements[idx1] = val0 * sin_val + val1 * cos_val;
}
} else {
// Before data exchange with in warp, we need to sync.
__syncwarp();
// Get the data from the other half of the warp. Use pre-computed cos/sin
// values.
#pragma unroll
for (int i = 0; i < numElemsPerThread; i++) {
elements2[i] = __shfl_xor_sync(FINAL_MASK, elements[i], 16);
if (laneId < 16) {
elements2[i] = -elements2[i];
elements[idx0] = val0 * cos_val - val1 * sin_val;
elements[idx1] = val0 * sin_val + val1 * cos_val;
}
} else {
// Before data exchange with in warp, we need to sync.
__syncwarp();
int pairOffset = (rotary_dim / 2) / numElemsPerThread;
// Get the data from the other half of the warp. Use pre-computed
// cos/sin values.
#pragma unroll
for (int i = 0; i < numElemsPerThread; i++) {
elements2[i] = __shfl_xor_sync(FINAL_MASK, elements[i], pairOffset);
int dim_idx = laneId * numElemsPerThread + i;
dim_idx = (dim_idx * 2) % head_dim;
int half_dim = dim_idx / 2;
// Use pre-computed cos/sin from cache
float cos_val = CacheConverter::convert(VLLM_LDG(cos_ptr + half_dim));
float sin_val = CacheConverter::convert(VLLM_LDG(sin_ptr + half_dim));
if (laneId < pairOffset) {
elements2[i] = -elements2[i];
}
int dim_idx = laneId * numElemsPerThread + i;
elements[i] = elements[i] * cos_val + elements2[i] * sin_val;
dim_idx = (dim_idx * 2) % rotary_dim;
int half_dim = dim_idx / 2;
float cos_val = CacheConverter::convert(VLLM_LDG(cos_ptr + half_dim));
float sin_val = CacheConverter::convert(VLLM_LDG(sin_ptr + half_dim));
elements[i] = elements[i] * cos_val + elements2[i] * sin_val;
}
// __shfl_xor_sync does not provide memfence. Need to sync again.
__syncwarp();
}
// __shfl_xor_sync does not provide memfence. Need to sync again.
__syncwarp();
}
// Store.
{
vec_T vec;
@@ -312,10 +316,10 @@ template <typename scalar_t_in, typename scalar_t_cache>
void launchFusedQKNormRope(void* qkv, int const num_tokens,
int const num_heads_q, int const num_heads_k,
int const num_heads_v, int const head_dim,
float const eps, void const* q_weight,
void const* k_weight, void const* cos_sin_cache,
bool const interleave, int64_t const* position_ids,
cudaStream_t stream) {
int const rotary_dim, float const eps,
void const* q_weight, void const* k_weight,
void const* cos_sin_cache, bool const interleave,
int64_t const* position_ids, cudaStream_t stream) {
constexpr int blockSize = 256;
int const warpsPerBlock = blockSize / 32;
@@ -332,7 +336,7 @@ void launchFusedQKNormRope(void* qkv, int const num_tokens,
fusedQKNormRopeKernel<scalar_t_in, scalar_t_cache, 64, INTERLEAVE>
<<<gridDim, blockDim, 0, stream>>>(
qkv, num_heads_q, num_heads_k, num_heads_v, eps, q_weight,
k_weight, cos_sin_cache, position_ids, num_tokens);
k_weight, cos_sin_cache, position_ids, num_tokens, rotary_dim);
});
break;
case 128:
@@ -340,7 +344,7 @@ void launchFusedQKNormRope(void* qkv, int const num_tokens,
fusedQKNormRopeKernel<scalar_t_in, scalar_t_cache, 128, INTERLEAVE>
<<<gridDim, blockDim, 0, stream>>>(
qkv, num_heads_q, num_heads_k, num_heads_v, eps, q_weight,
k_weight, cos_sin_cache, position_ids, num_tokens);
k_weight, cos_sin_cache, position_ids, num_tokens, rotary_dim);
});
break;
case 256:
@@ -348,7 +352,7 @@ void launchFusedQKNormRope(void* qkv, int const num_tokens,
fusedQKNormRopeKernel<scalar_t_in, scalar_t_cache, 256, INTERLEAVE>
<<<gridDim, blockDim, 0, stream>>>(
qkv, num_heads_q, num_heads_k, num_heads_v, eps, q_weight,
k_weight, cos_sin_cache, position_ids, num_tokens);
k_weight, cos_sin_cache, position_ids, num_tokens, rotary_dim);
});
break;
default:
@@ -392,8 +396,11 @@ void fused_qk_norm_rope(
"Query weights size must match head dimension");
TORCH_CHECK(k_weight.size(0) == head_dim,
"Key weights size must match head dimension");
TORCH_CHECK(cos_sin_cache.size(1) == head_dim,
"Cos/sin cache dimension must match head_dim");
TORCH_CHECK(cos_sin_cache.size(1) % 2 == 0, "rotary_dim must be even");
TORCH_CHECK(cos_sin_cache.size(1) <= head_dim,
"rotary_dim must be less than or equal to head_dim");
TORCH_CHECK(qkv.scalar_type() == q_weight.scalar_type() &&
qkv.scalar_type() == k_weight.scalar_type(),
"qkv, q_weight and k_weight must have the same dtype");
@@ -419,7 +426,8 @@ void fused_qk_norm_rope(
qkv.data_ptr(), static_cast<int>(num_tokens),
static_cast<int>(num_heads_q), static_cast<int>(num_heads_k),
static_cast<int>(num_heads_v), static_cast<int>(head_dim),
static_cast<float>(eps), q_weight.data_ptr(), k_weight.data_ptr(),
static_cast<int>(cos_sin_cache.size(1)), static_cast<float>(eps),
q_weight.data_ptr(), k_weight.data_ptr(),
cos_sin_cache.data_ptr(), !is_neox,
reinterpret_cast<int64_t const*>(position_ids.data_ptr()),
stream);

View File

@@ -457,8 +457,8 @@ __device__ inline T apply_scoring(T val) {
}
}
template <typename T, ScoringFunc SF>
__device__ void topk_with_k2(T* output, T const* input, T const* bias,
template <typename T, typename BiasT, ScoringFunc SF>
__device__ void topk_with_k2(T* output, T const* input, BiasT const* bias,
cg::thread_block_tile<32> const& tile,
int32_t const lane_id,
int const num_experts_per_group) {
@@ -469,7 +469,7 @@ __device__ void topk_with_k2(T* output, T const* input, T const* bias,
if (num_experts_per_group > WARP_SIZE) {
for (int i = lane_id; i < num_experts_per_group; i += WARP_SIZE) {
T value = apply_scoring<SF>(input[i]);
value = value + bias[i];
value = value + static_cast<T>(bias[i]);
if (value > largest) {
second_largest = largest;
@@ -481,7 +481,7 @@ __device__ void topk_with_k2(T* output, T const* input, T const* bias,
} else {
for (int i = lane_id; i < num_experts_per_group; i += WARP_SIZE) {
T value = apply_scoring<SF>(input[i]);
value = value + bias[i];
value = value + static_cast<T>(bias[i]);
largest = value;
}
}
@@ -503,8 +503,8 @@ __device__ void topk_with_k2(T* output, T const* input, T const* bias,
}
}
template <typename T, ScoringFunc SF>
__global__ void topk_with_k2_kernel(T* output, T* input, T const* bias,
template <typename T, typename BiasT, ScoringFunc SF>
__global__ void topk_with_k2_kernel(T* output, T* input, BiasT const* bias,
int64_t const num_tokens,
int64_t const num_cases,
int64_t const n_group,
@@ -517,7 +517,7 @@ __global__ void topk_with_k2_kernel(T* output, T* input, T const* bias,
input += case_id * num_experts_per_group;
// bias is per expert group, offset to current group
int32_t group_id = case_id % n_group;
T const* group_bias = bias + group_id * num_experts_per_group;
BiasT const* group_bias = bias + group_id * num_experts_per_group;
output += case_id;
cg::thread_block block = cg::this_thread_block();
@@ -526,18 +526,19 @@ __global__ void topk_with_k2_kernel(T* output, T* input, T const* bias,
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
asm volatile("griddepcontrol.wait;");
#endif
topk_with_k2<T, SF>(output, input, group_bias, tile, lane_id,
num_experts_per_group);
topk_with_k2<T, BiasT, SF>(output, input, group_bias, tile, lane_id,
num_experts_per_group);
}
#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
asm volatile("griddepcontrol.launch_dependents;");
#endif
}
template <typename T, typename IdxT, ScoringFunc SF, int NGroup = -1>
template <typename T, typename BiasT, typename IdxT, ScoringFunc SF,
int NGroup = -1>
__global__ void group_idx_and_topk_idx_kernel(
T* scores, T const* group_scores, float* topk_values, IdxT* topk_indices,
T const* bias, int64_t const num_tokens, int64_t const n_group,
BiasT const* bias, int64_t const num_tokens, int64_t const n_group,
int64_t const topk_group, int64_t const topk, int64_t const num_experts,
int64_t const num_experts_per_group, bool renormalize,
double routed_scaling_factor) {
@@ -623,7 +624,7 @@ __global__ void group_idx_and_topk_idx_kernel(
T input = scores[offset + i];
if (is_finite(input)) {
T score = apply_scoring<SF>(input);
candidates = score + bias[offset + i];
candidates = score + static_cast<T>(bias[offset + i]);
}
}
queue.add(candidates, offset + i);
@@ -698,10 +699,10 @@ __global__ void group_idx_and_topk_idx_kernel(
#endif
}
template <typename T, typename IdxT, ScoringFunc SF>
template <typename T, typename BiasT, typename IdxT, ScoringFunc SF>
inline void launch_group_idx_and_topk_kernel(
cudaLaunchConfig_t const& config, T* scores, T* group_scores,
float* topk_values, IdxT* topk_indices, T const* bias,
float* topk_values, IdxT* topk_indices, BiasT const* bias,
int64_t const num_tokens, int64_t const n_group, int64_t const topk_group,
int64_t const topk, int64_t const num_experts,
int64_t const num_experts_per_group, bool const renormalize,
@@ -715,36 +716,36 @@ inline void launch_group_idx_and_topk_kernel(
switch (n_group) {
case 4: {
launch(&group_idx_and_topk_idx_kernel<T, IdxT, SF, 4>);
launch(&group_idx_and_topk_idx_kernel<T, BiasT, IdxT, SF, 4>);
break;
}
case 8: {
launch(&group_idx_and_topk_idx_kernel<T, IdxT, SF, 8>);
launch(&group_idx_and_topk_idx_kernel<T, BiasT, IdxT, SF, 8>);
break;
}
case 16: {
launch(&group_idx_and_topk_idx_kernel<T, IdxT, SF, 16>);
launch(&group_idx_and_topk_idx_kernel<T, BiasT, IdxT, SF, 16>);
break;
}
case 32: {
launch(&group_idx_and_topk_idx_kernel<T, IdxT, SF, 32>);
launch(&group_idx_and_topk_idx_kernel<T, BiasT, IdxT, SF, 32>);
break;
}
default: {
launch(&group_idx_and_topk_idx_kernel<T, IdxT, SF>);
launch(&group_idx_and_topk_idx_kernel<T, BiasT, IdxT, SF>);
break;
}
}
}
template <typename T, typename IdxT>
template <typename T, typename BiasT, typename IdxT>
void invokeNoAuxTc(T* scores, T* group_scores, float* topk_values,
IdxT* topk_indices, T const* bias, int64_t const num_tokens,
int64_t const num_experts, int64_t const n_group,
int64_t const topk_group, int64_t const topk,
bool const renormalize, double const routed_scaling_factor,
int const scoring_func, bool enable_pdl = false,
cudaStream_t const stream = 0) {
IdxT* topk_indices, BiasT const* bias,
int64_t const num_tokens, int64_t const num_experts,
int64_t const n_group, int64_t const topk_group,
int64_t const topk, bool const renormalize,
double const routed_scaling_factor, int const scoring_func,
bool enable_pdl = false, cudaStream_t const stream = 0) {
int64_t num_cases = num_tokens * n_group;
int64_t topk_with_k2_num_blocks = (num_cases - 1) / NUM_WARPS_PER_BLOCK + 1;
cudaLaunchConfig_t config;
@@ -765,12 +766,12 @@ void invokeNoAuxTc(T* scores, T* group_scores, float* topk_values,
};
switch (sf) {
case SCORING_NONE: {
auto* kernel_instance1 = &topk_with_k2_kernel<T, SCORING_NONE>;
auto* kernel_instance1 = &topk_with_k2_kernel<T, BiasT, SCORING_NONE>;
launch_topk_with_k2(kernel_instance1);
break;
}
case SCORING_SIGMOID: {
auto* kernel_instance1 = &topk_with_k2_kernel<T, SCORING_SIGMOID>;
auto* kernel_instance1 = &topk_with_k2_kernel<T, BiasT, SCORING_SIGMOID>;
launch_topk_with_k2(kernel_instance1);
break;
}
@@ -794,14 +795,14 @@ void invokeNoAuxTc(T* scores, T* group_scores, float* topk_values,
config.attrs = attrs;
switch (sf) {
case SCORING_NONE: {
launch_group_idx_and_topk_kernel<T, IdxT, SCORING_NONE>(
launch_group_idx_and_topk_kernel<T, BiasT, IdxT, SCORING_NONE>(
config, scores, group_scores, topk_values, topk_indices, bias,
num_tokens, n_group, topk_group, topk, num_experts,
num_experts_per_group, renormalize, routed_scaling_factor);
break;
}
case SCORING_SIGMOID: {
launch_group_idx_and_topk_kernel<T, IdxT, SCORING_SIGMOID>(
launch_group_idx_and_topk_kernel<T, BiasT, IdxT, SCORING_SIGMOID>(
config, scores, group_scores, topk_values, topk_indices, bias,
num_tokens, n_group, topk_group, topk, num_experts,
num_experts_per_group, renormalize, routed_scaling_factor);
@@ -812,17 +813,23 @@ void invokeNoAuxTc(T* scores, T* group_scores, float* topk_values,
}
}
#define INSTANTIATE_NOAUX_TC(T, IdxT) \
template void invokeNoAuxTc<T, IdxT>( \
T * scores, T * group_scores, float* topk_values, IdxT* topk_indices, \
T const* bias, int64_t const num_tokens, int64_t const num_experts, \
int64_t const n_group, int64_t const topk_group, int64_t const topk, \
bool const renormalize, double const routed_scaling_factor, \
#define INSTANTIATE_NOAUX_TC(T, BiasT, IdxT) \
template void invokeNoAuxTc<T, BiasT, IdxT>( \
T * scores, T * group_scores, float* topk_values, IdxT* topk_indices, \
BiasT const* bias, int64_t const num_tokens, int64_t const num_experts, \
int64_t const n_group, int64_t const topk_group, int64_t const topk, \
bool const renormalize, double const routed_scaling_factor, \
int const scoring_func, bool enable_pdl, cudaStream_t const stream);
INSTANTIATE_NOAUX_TC(float, int32_t);
INSTANTIATE_NOAUX_TC(half, int32_t);
INSTANTIATE_NOAUX_TC(__nv_bfloat16, int32_t);
INSTANTIATE_NOAUX_TC(float, float, int32_t);
INSTANTIATE_NOAUX_TC(float, half, int32_t);
INSTANTIATE_NOAUX_TC(float, __nv_bfloat16, int32_t);
INSTANTIATE_NOAUX_TC(half, float, int32_t);
INSTANTIATE_NOAUX_TC(half, half, int32_t);
INSTANTIATE_NOAUX_TC(half, __nv_bfloat16, int32_t);
INSTANTIATE_NOAUX_TC(__nv_bfloat16, float, int32_t);
INSTANTIATE_NOAUX_TC(__nv_bfloat16, half, int32_t);
INSTANTIATE_NOAUX_TC(__nv_bfloat16, __nv_bfloat16, int32_t);
} // end namespace moe
} // namespace vllm
@@ -831,6 +838,7 @@ std::tuple<torch::Tensor, torch::Tensor> grouped_topk(
int64_t topk, bool renormalize, double routed_scaling_factor,
torch::Tensor const& bias, int64_t scoring_func = 0) {
auto data_type = scores.scalar_type();
auto bias_type = bias.scalar_type();
auto input_size = scores.sizes();
int64_t num_tokens = input_size[0];
int64_t num_experts = input_size[1];
@@ -854,39 +862,62 @@ std::tuple<torch::Tensor, torch::Tensor> grouped_topk(
auto stream = c10::cuda::getCurrentCUDAStream(scores.get_device());
#define LAUNCH_KERNEL(T, IdxT) \
do { \
switch (bias_type) { \
case torch::kFloat16: \
vllm::moe::invokeNoAuxTc<T, half, IdxT>( \
reinterpret_cast<T*>(scores.mutable_data_ptr()), \
reinterpret_cast<T*>(group_scores.mutable_data_ptr()), \
reinterpret_cast<float*>(topk_values.mutable_data_ptr()), \
reinterpret_cast<IdxT*>(topk_indices.mutable_data_ptr()), \
reinterpret_cast<half const*>(bias.data_ptr()), num_tokens, \
num_experts, n_group, topk_group, topk, renormalize, \
routed_scaling_factor, static_cast<int>(scoring_func), false, \
stream); \
break; \
case torch::kFloat32: \
vllm::moe::invokeNoAuxTc<T, float, IdxT>( \
reinterpret_cast<T*>(scores.mutable_data_ptr()), \
reinterpret_cast<T*>(group_scores.mutable_data_ptr()), \
reinterpret_cast<float*>(topk_values.mutable_data_ptr()), \
reinterpret_cast<IdxT*>(topk_indices.mutable_data_ptr()), \
reinterpret_cast<float const*>(bias.data_ptr()), num_tokens, \
num_experts, n_group, topk_group, topk, renormalize, \
routed_scaling_factor, static_cast<int>(scoring_func), false, \
stream); \
break; \
case torch::kBFloat16: \
vllm::moe::invokeNoAuxTc<T, __nv_bfloat16, IdxT>( \
reinterpret_cast<T*>(scores.mutable_data_ptr()), \
reinterpret_cast<T*>(group_scores.mutable_data_ptr()), \
reinterpret_cast<float*>(topk_values.mutable_data_ptr()), \
reinterpret_cast<IdxT*>(topk_indices.mutable_data_ptr()), \
reinterpret_cast<__nv_bfloat16 const*>(bias.data_ptr()), \
num_tokens, num_experts, n_group, topk_group, topk, renormalize, \
routed_scaling_factor, static_cast<int>(scoring_func), false, \
stream); \
break; \
default: \
throw std::invalid_argument( \
"Invalid bias dtype, only supports float16, float32, and " \
"bfloat16"); \
break; \
} \
} while (0)
switch (data_type) {
case torch::kFloat16:
// Handle Float16
vllm::moe::invokeNoAuxTc<half, int32_t>(
reinterpret_cast<half*>(scores.mutable_data_ptr()),
reinterpret_cast<half*>(group_scores.mutable_data_ptr()),
reinterpret_cast<float*>(topk_values.mutable_data_ptr()),
reinterpret_cast<int32_t*>(topk_indices.mutable_data_ptr()),
reinterpret_cast<half const*>(bias.data_ptr()), num_tokens,
num_experts, n_group, topk_group, topk, renormalize,
routed_scaling_factor, static_cast<int>(scoring_func), false, stream);
LAUNCH_KERNEL(half, int32_t);
break;
case torch::kFloat32:
// Handle Float32
vllm::moe::invokeNoAuxTc<float, int32_t>(
reinterpret_cast<float*>(scores.mutable_data_ptr()),
reinterpret_cast<float*>(group_scores.mutable_data_ptr()),
reinterpret_cast<float*>(topk_values.mutable_data_ptr()),
reinterpret_cast<int32_t*>(topk_indices.mutable_data_ptr()),
reinterpret_cast<float const*>(bias.data_ptr()), num_tokens,
num_experts, n_group, topk_group, topk, renormalize,
routed_scaling_factor, static_cast<int>(scoring_func), false, stream);
LAUNCH_KERNEL(float, int32_t);
break;
case torch::kBFloat16:
// Handle BFloat16
vllm::moe::invokeNoAuxTc<__nv_bfloat16, int32_t>(
reinterpret_cast<__nv_bfloat16*>(scores.mutable_data_ptr()),
reinterpret_cast<__nv_bfloat16*>(group_scores.mutable_data_ptr()),
reinterpret_cast<float*>(topk_values.mutable_data_ptr()),
reinterpret_cast<int32_t*>(topk_indices.mutable_data_ptr()),
reinterpret_cast<__nv_bfloat16 const*>(bias.data_ptr()), num_tokens,
num_experts, n_group, topk_group, topk, renormalize,
routed_scaling_factor, static_cast<int>(scoring_func), false, stream);
LAUNCH_KERNEL(__nv_bfloat16, int32_t);
break;
default:
// Handle other data types
@@ -894,5 +925,6 @@ std::tuple<torch::Tensor, torch::Tensor> grouped_topk(
"Invalid dtype, only supports float16, float32, and bfloat16");
break;
}
#undef LAUNCH_KERNEL
return {topk_values, topk_indices};
}

View File

@@ -7,20 +7,20 @@
#include "quantization/gptq_marlin/marlin_dtypes.cuh"
#include "core/scalar_type.hpp"
#define MARLIN_KERNEL_PARAMS \
const int4 *__restrict__ A, const int4 *__restrict__ B, \
int4 *__restrict__ C, int4 *__restrict__ C_tmp, \
const int4 *__restrict__ b_bias_ptr, \
const float *__restrict__ a_scales_ptr, \
const int4 *__restrict__ scales_ptr, \
const uint16_t *__restrict__ global_scale_ptr, \
const int4 *__restrict__ zp_ptr, const int *__restrict__ g_idx, \
const int32_t *__restrict__ sorted_token_ids_ptr, \
const int32_t *__restrict__ expert_ids_ptr, \
const int32_t *__restrict__ num_tokens_past_padded_ptr, \
const float *__restrict__ topk_weights_ptr, int top_k, \
bool mul_topk_weights, bool is_ep, int num_groups, int prob_m, \
int prob_n, int prob_k, int *locks, bool has_bias, bool use_atomic_add, \
#define MARLIN_KERNEL_PARAMS \
const int4 *__restrict__ A, const int4 *__restrict__ B, \
int4 *__restrict__ C, int4 *__restrict__ C_tmp, \
const int4 *__restrict__ b_bias_ptr, \
const float *__restrict__ a_scales_ptr, \
const int4 *__restrict__ scales_ptr, \
const uint16_t *__restrict__ global_scale_ptr, \
const int4 *__restrict__ zp_ptr, const int *__restrict__ g_idx, \
const int32_t *__restrict__ sorted_token_ids_ptr, \
const int32_t *__restrict__ expert_ids_ptr, \
const int32_t *__restrict__ num_tokens_past_padded_ptr, \
const float *__restrict__ topk_weights_ptr, int top_k, \
bool mul_topk_weights, int num_groups, int prob_m, int prob_n, \
int prob_k, int *locks, bool has_bias, bool use_atomic_add, \
bool use_fp32_reduce
namespace MARLIN_NAMESPACE_NAME {

View File

@@ -71,7 +71,6 @@ __global__ void Marlin(
const float* __restrict__ topk_weights_ptr, // moe top weights
int top_k, // num of experts per token
bool mul_topk_weights, // mul topk weights or not
bool is_ep, // expert parallelism
int num_groups, // number of scale groups per output channel
int prob_m, // batch dimension m
int prob_n, // output dimension n
@@ -273,7 +272,6 @@ __global__ void Marlin(
const float* __restrict__ topk_weights_ptr, // moe top weights
int top_k, // num of experts per token
bool mul_topk_weights, // mul topk weights or not
bool is_ep, // expert parallelism
int num_groups, // number of scale groups per output channel
int prob_m, // batch dimension m
int prob_n, // output dimension n
@@ -376,14 +374,6 @@ __global__ void Marlin(
// parallel: num valid moe blocks
int parallel = num_tokens_past_padded / moe_block_size;
int num_valid_blocks = parallel;
if (is_ep) {
for (int i = 0; i < parallel; i++) {
if (expert_ids_ptr[i] == -1) num_valid_blocks--;
}
}
int num_invalid_blocks = parallel - num_valid_blocks;
parallel = num_valid_blocks;
int k_tiles = prob_k / 16 / thread_k_blocks;
int n_tiles = prob_n / 16 / thread_n_blocks;
@@ -538,22 +528,8 @@ __global__ void Marlin(
if (par_id >= parallel) return;
old_expert_id = expert_id;
if (num_invalid_blocks > 0) {
int skip_count = par_id;
for (int i = 0; i < num_tokens_past_padded / moe_block_size; i++) {
expert_id = expert_ids_ptr[i];
if (expert_id != -1) {
if (skip_count == 0) {
block_id = i;
break;
};
skip_count--;
};
}
} else {
block_id = par_id;
expert_id = expert_ids_ptr[block_id];
}
block_id = par_id;
expert_id = expert_ids_ptr[block_id];
if constexpr (b_type == vllm::kFE2M1f && s_type == vllm::kFE4M3fn) {
uint16_t val = global_scale_ptr[expert_id];

View File

@@ -336,14 +336,14 @@ void marlin_mm(const void* A, const void* B, void* C, void* C_tmp, void* b_bias,
void* perm, void* a_tmp, void* sorted_token_ids,
void* expert_ids, void* num_tokens_past_padded,
void* topk_weights, int moe_block_size, int num_experts,
int top_k, bool mul_topk_weights, bool is_ep, int prob_m,
int prob_n, int prob_k, void* workspace,
vllm::ScalarType const& a_type, vllm::ScalarType const& b_type,
vllm::ScalarType const& c_type, vllm::ScalarType const& s_type,
bool has_bias, bool has_act_order, bool is_k_full, bool has_zp,
int num_groups, int group_size, int dev, cudaStream_t stream,
int thread_k, int thread_n, int sms, int blocks_per_sm,
bool use_atomic_add, bool use_fp32_reduce, bool is_zp_float) {
int top_k, bool mul_topk_weights, int prob_m, int prob_n,
int prob_k, void* workspace, vllm::ScalarType const& a_type,
vllm::ScalarType const& b_type, vllm::ScalarType const& c_type,
vllm::ScalarType const& s_type, bool has_bias,
bool has_act_order, bool is_k_full, bool has_zp, int num_groups,
int group_size, int dev, cudaStream_t stream, int thread_k,
int thread_n, int sms, int blocks_per_sm, bool use_atomic_add,
bool use_fp32_reduce, bool is_zp_float) {
int thread_m_blocks = div_ceil(moe_block_size, 16);
bool m_block_size_8 = moe_block_size == 8;
bool is_a_8bit = a_type.size_bits() == 8;
@@ -523,7 +523,7 @@ void marlin_mm(const void* A, const void* B, void* C, void* C_tmp, void* b_bias,
kernel<<<blocks, num_threads, max_shared_mem, stream>>>(
A_ptr, B_ptr, C_ptr, C_tmp_ptr, bias_ptr, a_s_ptr, b_s_ptr, g_s_ptr, zp_ptr, g_idx_ptr,
sorted_token_ids_ptr, expert_ids_ptr, num_tokens_past_padded_ptr,
topk_weights_ptr, top_k, mul_topk_weights, is_ep, num_groups, prob_m,
topk_weights_ptr, top_k, mul_topk_weights, num_groups, prob_m,
prob_n, prob_k, locks, has_bias, use_atomic_add, use_fp32_reduce);
// clang-format on
}
@@ -541,7 +541,7 @@ torch::Tensor moe_wna16_marlin_gemm(
std::optional<torch::Tensor> const& perm_or_none, torch::Tensor& workspace,
torch::Tensor& sorted_token_ids, torch::Tensor& expert_ids,
torch::Tensor& num_tokens_past_padded, torch::Tensor& topk_weights,
int64_t moe_block_size, int64_t top_k, bool mul_topk_weights, bool is_ep,
int64_t moe_block_size, int64_t top_k, bool mul_topk_weights,
vllm::ScalarTypeId const& b_type_id, int64_t size_m, int64_t size_n,
int64_t size_k, bool is_k_full, bool use_atomic_add, bool use_fp32_reduce,
bool is_zp_float, int64_t thread_k, int64_t thread_n,
@@ -855,9 +855,9 @@ torch::Tensor moe_wna16_marlin_gemm(
perm.data_ptr(), a_tmp.data_ptr(), sorted_token_ids.data_ptr(),
expert_ids.data_ptr(), num_tokens_past_padded.data_ptr(),
topk_weights.data_ptr(), moe_block_size, num_experts, top_k,
mul_topk_weights, is_ep, size_m, size_n, size_k, workspace.data_ptr(),
a_type, b_type, c_type, s_type, has_bias, has_act_order, is_k_full,
has_zp, num_groups, group_size, dev, at::cuda::getCurrentCUDAStream(dev),
mul_topk_weights, size_m, size_n, size_k, workspace.data_ptr(), a_type,
b_type, c_type, s_type, has_bias, has_act_order, is_k_full, has_zp,
num_groups, group_size, dev, at::cuda::getCurrentCUDAStream(dev),
thread_k, thread_n, sms, blocks_per_sm, use_atomic_add, use_fp32_reduce,
is_zp_float);
@@ -866,4 +866,4 @@ torch::Tensor moe_wna16_marlin_gemm(
TORCH_LIBRARY_IMPL_EXPAND(TORCH_EXTENSION_NAME, CUDA, m) {
m.impl("moe_wna16_marlin_gemm", &moe_wna16_marlin_gemm);
}
}

View File

@@ -71,7 +71,7 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, m) {
"Tensor sorted_token_ids,"
"Tensor! expert_ids, Tensor! num_tokens_past_padded,"
"Tensor! topk_weights, int moe_block_size, int top_k, "
"bool mul_topk_weights, bool is_ep, int b_type_id,"
"bool mul_topk_weights, int b_type_id,"
"int size_m, int size_n, int size_k,"
"bool is_full_k, bool use_atomic_add,"
"bool use_fp32_reduce, bool is_zp_float,"

View File

@@ -2,6 +2,7 @@
#include <optional>
#include <torch/library.h>
#include <tuple>
#include "core/scalar_type.hpp"
@@ -265,6 +266,11 @@ void get_cutlass_moe_mm_problem_sizes(
const int64_t k, const std::optional<torch::Tensor>& blockscale_offsets,
std::optional<bool> force_swap_ab = std::nullopt);
void get_cutlass_moe_mm_problem_sizes_from_expert_offsets(
const torch::Tensor& expert_first_token_offset,
torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2,
const int64_t n, const int64_t k, const bool swap_ab);
void get_cutlass_pplx_moe_mm_data(torch::Tensor& expert_offsets,
torch::Tensor& problem_sizes1,
torch::Tensor& problem_sizes2,
@@ -301,6 +307,12 @@ void scaled_fp4_experts_quant(
torch::Tensor const& input_offset_by_experts,
torch::Tensor const& output_scale_offset_by_experts);
void silu_and_mul_scaled_fp4_experts_quant(
torch::Tensor& output, torch::Tensor& output_scale,
torch::Tensor const& input, torch::Tensor const& input_global_scale,
torch::Tensor const& input_offset_by_experts,
torch::Tensor const& output_scale_offset_by_experts);
void per_token_group_quant_fp8(const torch::Tensor& input,
torch::Tensor& output_q, torch::Tensor& output_s,
int64_t group_size, double eps, double fp8_min,
@@ -335,8 +347,9 @@ torch::Tensor gptq_gemm(torch::Tensor a, torch::Tensor b_q_weight,
void gptq_shuffle(torch::Tensor q_weight, torch::Tensor q_perm, int64_t bit);
void static_scaled_fp8_quant(torch::Tensor& out, torch::Tensor const& input,
torch::Tensor const& scale);
void static_scaled_fp8_quant(
torch::Tensor& out, torch::Tensor const& input, torch::Tensor const& scale,
std::optional<std::tuple<int64_t, int64_t>> group_shape = std::nullopt);
void dynamic_scaled_fp8_quant(torch::Tensor& out, torch::Tensor const& input,
torch::Tensor& scale);

View File

@@ -31,37 +31,6 @@
namespace vllm {
// silu in float32
__device__ __forceinline__ float silu(float x) {
return __fdividef(x, (1.f + __expf(-x)));
}
__device__ __forceinline__ float2 silu2(float2 x) {
return make_float2(silu(x.x), silu(x.y));
}
template <class Type>
__inline__ __device__ PackedVec<Type> compute_silu_mul(PackedVec<Type>& vec,
PackedVec<Type>& vec2) {
PackedVec<Type> result;
using packed_type = typename TypeConverter<Type>::Type;
#pragma unroll
for (int i = 0; i < CVT_FP4_ELTS_PER_THREAD / 2; ++i) {
// silu_mul in float32
if constexpr (std::is_same_v<Type, half>) {
float2 silu_vec = silu2(__half22float2(vec.elts[i]));
result.elts[i] =
__float22half2_rn(__fmul2_rn(silu_vec, __half22float2(vec2.elts[i])));
} else {
float2 silu_vec = silu2(__bfloat1622float2(vec.elts[i]));
result.elts[i] = __float22bfloat162_rn(
__fmul2_rn(silu_vec, __bfloat1622float2(vec2.elts[i])));
}
}
return result;
}
// Use UE4M3 by default.
template <class Type, bool UE8M0_SF = false>
__global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024))
@@ -74,6 +43,9 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024))
static_assert(sizeof(PackedVec) == sizeof(Type) * CVT_FP4_ELTS_PER_THREAD,
"Vec size is not matched.");
// Precompute SF layout parameter (constant for entire kernel).
int32_t const numKTiles = (numCols + 63) / 64;
// Get the global scaling factor, which will be applied to the SF.
// Note SFScale is the same as next GEMM's alpha, which is
// (448.f / (Alpha_A / 6.f)).
@@ -101,7 +73,7 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024))
auto sf_out =
cvt_quant_to_fp4_get_sf_out_offset<uint32_t,
CVT_FP4_NUM_THREADS_PER_SF>(
rowIdx, colIdx, numCols, SFout);
rowIdx, colIdx, numKTiles, SFout);
out_pos = cvt_warp_fp16_to_fp4<Type, UE8M0_SF>(out_silu_mul, SFScaleVal,
sf_out);

View File

@@ -62,7 +62,9 @@ __global__ void __get_group_gemm_starts(
ElementSF* a_scales_base_as_int, ElementSF* b_scales_base_as_int,
ElementAccumulator* alphas_base_as_int, const int32_t* expert_offsets,
const int32_t* sf_offsets, const int32_t* problem_sizes_as_shapes,
const int K, const int N) {
int64_t* a_strides, int64_t* b_strides, int64_t* c_strides,
const int64_t a_stride_val, const int64_t b_stride_val,
const int64_t c_stride_val, const int K, const int N) {
int64_t expert_id = threadIdx.x;
if (expert_id >= gridDim.x * blockDim.x) {
return;
@@ -103,6 +105,11 @@ __global__ void __get_group_gemm_starts(
// Shape of alpha = [E]
alpha_offsets[expert_id] = alphas_base_as_int + expert_id;
// Initialize strides (constant across all experts, avoids separate kernels)
a_strides[expert_id] = a_stride_val;
b_strides[expert_id] = b_stride_val;
c_strides[expert_id] = c_stride_val;
LayoutSFA* layout_sfa_ptr = layout_sfa_base_as_int + expert_id;
LayoutSFB* layout_sfb_ptr = layout_sfb_base_as_int + expert_id;
@@ -135,7 +142,11 @@ __global__ void __get_group_gemm_starts(
static_cast<float*>(alphas.data_ptr()), \
static_cast<int32_t*>(expert_offsets.data_ptr()), \
static_cast<int32_t*>(sf_offsets.data_ptr()), \
static_cast<int32_t*>(problem_sizes.data_ptr()), K, N); \
static_cast<int32_t*>(problem_sizes.data_ptr()), \
static_cast<int64_t*>(a_strides.data_ptr()), \
static_cast<int64_t*>(b_strides.data_ptr()), \
static_cast<int64_t*>(c_strides.data_ptr()), a_stride_val, \
b_stride_val, c_stride_val, K, N); \
}
template <typename LayoutSFA, typename LayoutSFB, typename ScaleConfig>
@@ -144,6 +155,9 @@ void run_get_group_gemm_starts(
const torch::Tensor& out_starts, const torch::Tensor& a_scales_starts,
const torch::Tensor& b_scales_starts, const torch::Tensor& alpha_starts,
const torch::Tensor& layout_sfa, const torch::Tensor& layout_sfb,
const torch::Tensor& a_strides, const torch::Tensor& b_strides,
const torch::Tensor& c_strides, int64_t a_stride_val, int64_t b_stride_val,
int64_t c_stride_val,
/*these are used for their base addresses*/
torch::Tensor const& a_tensors, torch::Tensor const& b_tensors,
torch::Tensor const& out_tensors, torch::Tensor const& a_scales,
@@ -269,17 +283,16 @@ void run_fp4_blockwise_scaled_group_mm_sm100(
torch::Tensor alpha_ptrs = torch::empty(num_experts, options_int);
torch::Tensor layout_sfa = torch::empty({num_experts, 5}, options_int);
torch::Tensor layout_sfb = torch::empty({num_experts, 5}, options_int);
torch::Tensor c_strides1 =
torch::full({num_experts}, output.stride(0), options_int);
torch::Tensor a_strides1 =
torch::full({num_experts}, a.stride(0) * 2, options_int);
torch::Tensor b_strides1 =
torch::full({num_experts}, b.stride(1) * 2, options_int);
torch::Tensor a_strides1 = torch::empty(num_experts, options_int);
torch::Tensor b_strides1 = torch::empty(num_experts, options_int);
torch::Tensor c_strides1 = torch::empty(num_experts, options_int);
run_get_group_gemm_starts<LayoutSFA, LayoutSFB, ScaleConfig>(
a_ptrs, b_ptrs, out_ptrs, a_scales_ptrs, b_scales_ptrs, alpha_ptrs,
layout_sfa, layout_sfb, a, b, output, a_blockscale, b_blockscales, alphas,
expert_offsets, sf_offsets, problem_sizes, M, N, K);
layout_sfa, layout_sfb, a_strides1, b_strides1, c_strides1,
a.stride(0) * 2, b.stride(1) * 2, output.stride(0), a, b, output,
a_blockscale, b_blockscales, alphas, expert_offsets, sf_offsets,
problem_sizes, M, N, K);
// Create an instance of the GEMM
Gemm gemm_op;
@@ -444,17 +457,16 @@ void run_fp4_blockwise_scaled_group_mm_sm120(
torch::Tensor alpha_ptrs = torch::empty(num_experts, options_int);
torch::Tensor layout_sfa = torch::empty({num_experts, 5}, options_int);
torch::Tensor layout_sfb = torch::empty({num_experts, 5}, options_int);
torch::Tensor c_strides1 =
torch::full({num_experts}, output.stride(0), options_int);
torch::Tensor a_strides1 =
torch::full({num_experts}, a.stride(0) * 2, options_int);
torch::Tensor b_strides1 =
torch::full({num_experts}, b.stride(1) * 2, options_int);
torch::Tensor a_strides1 = torch::empty(num_experts, options_int);
torch::Tensor b_strides1 = torch::empty(num_experts, options_int);
torch::Tensor c_strides1 = torch::empty(num_experts, options_int);
run_get_group_gemm_starts<LayoutSFA, LayoutSFB, ScaleConfig>(
a_ptrs, b_ptrs, out_ptrs, a_scales_ptrs, b_scales_ptrs, alpha_ptrs,
layout_sfa, layout_sfb, a, b, output, a_blockscale, b_blockscales, alphas,
expert_offsets, sf_offsets, problem_sizes, M, N, K);
layout_sfa, layout_sfb, a_strides1, b_strides1, c_strides1,
a.stride(0) * 2, b.stride(1) * 2, output.stride(0), a, b, output,
a_blockscale, b_blockscales, alphas, expert_offsets, sf_offsets,
problem_sizes, M, N, K);
// Create an instance of the GEMM
Gemm gemm_op;

View File

@@ -25,13 +25,18 @@
#include <cuda_fp8.h>
#include "dispatch_utils.h"
#include "cuda_utils.h"
#include "nvfp4_utils.cuh"
#include "launch_bounds_utils.h"
namespace vllm {
// NVFP4 quantization kernel for experts (low-latency path).
// When FUSE_SILU_MUL=true, expects input with gate||up layout and fuses
// SiLU(gate)*up before quantization.
// Use UE4M3 by default.
template <class Type, bool UE8M0_SF = false, bool SMALL_NUM_EXPERTS = false>
template <class Type, bool FUSE_SILU_MUL = false, bool UE8M0_SF = false,
bool SMALL_NUM_EXPERTS = false>
__global__ void __launch_bounds__(512, VLLM_BLOCKS_PER_SM(512))
cvt_fp16_to_fp4(int32_t numRows, int32_t numCols, Type const* in,
float const* SFScale, uint32_t* out, uint32_t* SFout,
@@ -44,8 +49,13 @@ __global__ void __launch_bounds__(512, VLLM_BLOCKS_PER_SM(512))
static_assert(sizeof(PackedVec) == sizeof(Type) * CVT_FP4_ELTS_PER_THREAD,
"Vec size is not matched.");
// Precompute SF layout parameter (constant for entire kernel).
int32_t const numKTiles = (numCols + 63) / 64;
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int colsPerRow = numCols / CVT_FP4_ELTS_PER_THREAD;
// When fusing SiLU+Mul, input has gate || up layout (doubled width)
int inColsPerRow = FUSE_SILU_MUL ? colsPerRow * 2 : colsPerRow;
// Each global thread processes one element
for (int globalIdx = tid; globalIdx < numRows * colsPerRow;
@@ -54,13 +64,6 @@ __global__ void __launch_bounds__(512, VLLM_BLOCKS_PER_SM(512))
int rowIdx = globalIdx / colsPerRow;
int colIdx = globalIdx % colsPerRow;
int64_t inOffset = rowIdx * colsPerRow + colIdx;
PackedVec in_vec = reinterpret_cast<PackedVec const*>(in)[inOffset];
// Get the output tensor offset.
// Same as inOffset because 8 elements are packed into one uint32_t.
int64_t outOffset = inOffset;
auto& out_pos = out[outOffset];
// Find index within the experts using different strategies based on expert
// count
int rowIdx_in_expert = 0;
@@ -107,29 +110,46 @@ __global__ void __launch_bounds__(512, VLLM_BLOCKS_PER_SM(512))
}
}
// Load input and optionally apply fused SiLU+Mul
int64_t inOffset = rowIdx * inColsPerRow + colIdx;
PackedVec in_vec = reinterpret_cast<PackedVec const*>(in)[inOffset];
PackedVec quant_input;
if constexpr (FUSE_SILU_MUL) {
PackedVec in_vec_up =
reinterpret_cast<PackedVec const*>(in)[inOffset + colsPerRow];
quant_input = compute_silu_mul(in_vec, in_vec_up);
} else {
quant_input = in_vec;
}
// Get the output tensor offset.
// Same as inOffset because 8 elements are packed into one uint32_t.
int64_t outOffset = rowIdx * colsPerRow + colIdx;
auto& out_pos = out[outOffset];
// Get the global scaling factor, which will be applied to the SF.
// Note SFScale is the same as next GEMM's alpha, which is
// (448.f / (Alpha_A / 6.f)).
float const SFScaleVal = SFScale == nullptr ? 1.0f : SFScale[expert_idx];
int factor = CVT_FP4_SF_VEC_SIZE * 4;
// The actual output_scales dim is computed from the padded numCols.
int32_t numCols_padded = (numCols + factor - 1) / factor * factor;
int numCols_SFout = numCols_padded / CVT_FP4_SF_VEC_SIZE / 4;
uint32_t* SFout_in_expert =
SFout + output_scale_offset_by_experts[expert_idx] * numCols_SFout;
SFout + output_scale_offset_by_experts[expert_idx] * numKTiles;
auto sf_out =
cvt_quant_to_fp4_get_sf_out_offset<uint32_t,
CVT_FP4_NUM_THREADS_PER_SF>(
rowIdx_in_expert, colIdx, numCols, SFout_in_expert);
rowIdx_in_expert, colIdx, numKTiles, SFout_in_expert);
out_pos = cvt_warp_fp16_to_fp4<Type, UE8M0_SF>(in_vec, SFScaleVal, sf_out);
out_pos =
cvt_warp_fp16_to_fp4<Type, UE8M0_SF>(quant_input, SFScaleVal, sf_out);
}
}
// Kernel for LARGE_M_TOPK = true (large m_topk optimized version)
template <class Type, bool UE8M0_SF = false, bool SMALL_NUM_EXPERTS = false>
// NVFP4 quantization kernel for LARGE_M_TOPK = true (large m_topk optimized
// version). When FUSE_SILU_MUL=true, expects input with gate||up layout and
// fuses SiLU(gate)*up before quantization.
template <class Type, bool FUSE_SILU_MUL = false, bool UE8M0_SF = false,
bool SMALL_NUM_EXPERTS = false>
__global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024))
cvt_fp16_to_fp4(int32_t numRows, int32_t numCols, Type const* in,
float const* SFScale, uint32_t* out, uint32_t* SFout,
@@ -140,6 +160,10 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024))
(CVT_FP4_SF_VEC_SIZE / CVT_FP4_ELTS_PER_THREAD);
static_assert(sizeof(PackedVec) == sizeof(Type) * CVT_FP4_ELTS_PER_THREAD,
"Vec size is not matched.");
// Precompute SF layout parameter (constant for entire kernel).
int32_t const numKTiles = (numCols + 63) / 64;
extern __shared__ uint32_t shared_input_offsets[];
// Load input offsets into shared memory.
@@ -163,6 +187,8 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024))
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int colsPerRow = numCols / CVT_FP4_ELTS_PER_THREAD;
// When fusing SiLU+Mul, input has gate || up layout (doubled width)
int inColsPerRow = FUSE_SILU_MUL ? colsPerRow * 2 : colsPerRow;
// Each global thread processes one element
for (int globalIdx = tid; globalIdx < numRows * colsPerRow;
@@ -171,11 +197,6 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024))
int rowIdx = globalIdx / colsPerRow;
int colIdx = globalIdx % colsPerRow;
int64_t inOffset = rowIdx * colsPerRow + colIdx;
PackedVec in_vec = reinterpret_cast<PackedVec const*>(in)[inOffset];
int64_t outOffset = inOffset;
auto& out_pos = out[outOffset];
// Find expert using binary search for better performance with large m_topk
int rowIdx_in_expert = 0;
int expert_idx = 0;
@@ -200,34 +221,43 @@ __global__ void __launch_bounds__(1024, VLLM_BLOCKS_PER_SM(1024))
}
}
// Load input and optionally apply fused SiLU+Mul
int64_t inOffset = rowIdx * inColsPerRow + colIdx;
PackedVec in_vec = reinterpret_cast<PackedVec const*>(in)[inOffset];
PackedVec quant_input;
if constexpr (FUSE_SILU_MUL) {
PackedVec in_vec_up =
reinterpret_cast<PackedVec const*>(in)[inOffset + colsPerRow];
quant_input = compute_silu_mul(in_vec, in_vec_up);
} else {
quant_input = in_vec;
}
int64_t outOffset = rowIdx * colsPerRow + colIdx;
auto& out_pos = out[outOffset];
float const SFScaleVal = SFScale == nullptr ? 1.0f : SFScale[expert_idx];
int factor = CVT_FP4_SF_VEC_SIZE * 4;
int32_t numCols_padded = (numCols + factor - 1) / factor * factor;
int numCols_SFout = numCols_padded / CVT_FP4_SF_VEC_SIZE / 4;
uint32_t* SFout_in_expert =
SFout + output_scale_offset_by_experts[expert_idx] * numCols_SFout;
SFout + output_scale_offset_by_experts[expert_idx] * numKTiles;
auto sf_out =
cvt_quant_to_fp4_get_sf_out_offset<uint32_t,
CVT_FP4_NUM_THREADS_PER_SF>(
rowIdx_in_expert, colIdx, numCols, SFout_in_expert);
rowIdx_in_expert, colIdx, numKTiles, SFout_in_expert);
out_pos = cvt_warp_fp16_to_fp4<Type, UE8M0_SF>(in_vec, SFScaleVal, sf_out);
out_pos =
cvt_warp_fp16_to_fp4<Type, UE8M0_SF>(quant_input, SFScaleVal, sf_out);
}
}
template <typename T>
template <typename T, bool FUSE_SILU_MUL = false>
void quant_impl(void* output, void* output_scale, void* input,
void* input_global_scale, void* input_offset_by_experts,
void* output_scale_offset_by_experts, int m_topk, int k,
int n_experts, cudaStream_t stream) {
// TODO: this multiProcessorCount should be cached.
int device;
cudaGetDevice(&device);
int multiProcessorCount;
cudaDeviceGetAttribute(&multiProcessorCount, cudaDevAttrMultiProcessorCount,
device);
int multiProcessorCount =
get_device_attribute(cudaDevAttrMultiProcessorCount, -1);
// Grid, Block size.
// Each thread converts 8 values.
@@ -249,7 +279,7 @@ void quant_impl(void* output, void* output_scale, void* input,
if (blockRepeat > 1) {
size_t shared_mem_size = (n_experts + 1) * sizeof(uint32_t);
if (n_experts >= 4) {
cvt_fp16_to_fp4<T, false, false>
cvt_fp16_to_fp4<T, FUSE_SILU_MUL, false, false>
<<<grid, block, shared_mem_size, stream>>>(
m_topk, k, reinterpret_cast<T*>(input),
reinterpret_cast<float*>(input_global_scale),
@@ -259,34 +289,37 @@ void quant_impl(void* output, void* output_scale, void* input,
reinterpret_cast<uint32_t*>(output_scale_offset_by_experts),
n_experts);
} else {
cvt_fp16_to_fp4<T, false, true><<<grid, block, shared_mem_size, stream>>>(
m_topk, k, reinterpret_cast<T*>(input),
reinterpret_cast<float*>(input_global_scale),
reinterpret_cast<uint32_t*>(output),
reinterpret_cast<uint32_t*>(output_scale),
reinterpret_cast<uint32_t*>(input_offset_by_experts),
reinterpret_cast<uint32_t*>(output_scale_offset_by_experts),
n_experts);
cvt_fp16_to_fp4<T, FUSE_SILU_MUL, false, true>
<<<grid, block, shared_mem_size, stream>>>(
m_topk, k, reinterpret_cast<T*>(input),
reinterpret_cast<float*>(input_global_scale),
reinterpret_cast<uint32_t*>(output),
reinterpret_cast<uint32_t*>(output_scale),
reinterpret_cast<uint32_t*>(input_offset_by_experts),
reinterpret_cast<uint32_t*>(output_scale_offset_by_experts),
n_experts);
}
} else {
if (n_experts >= 16) {
cvt_fp16_to_fp4<T, false, false><<<grid, block, 0, stream>>>(
m_topk, k, reinterpret_cast<T*>(input),
reinterpret_cast<float*>(input_global_scale),
reinterpret_cast<uint32_t*>(output),
reinterpret_cast<uint32_t*>(output_scale),
reinterpret_cast<uint32_t*>(input_offset_by_experts),
reinterpret_cast<uint32_t*>(output_scale_offset_by_experts),
n_experts, /* bool low_latency */ true);
cvt_fp16_to_fp4<T, FUSE_SILU_MUL, false, false>
<<<grid, block, 0, stream>>>(
m_topk, k, reinterpret_cast<T*>(input),
reinterpret_cast<float*>(input_global_scale),
reinterpret_cast<uint32_t*>(output),
reinterpret_cast<uint32_t*>(output_scale),
reinterpret_cast<uint32_t*>(input_offset_by_experts),
reinterpret_cast<uint32_t*>(output_scale_offset_by_experts),
n_experts, /* bool low_latency */ true);
} else {
cvt_fp16_to_fp4<T, false, true><<<grid, block, 0, stream>>>(
m_topk, k, reinterpret_cast<T*>(input),
reinterpret_cast<float*>(input_global_scale),
reinterpret_cast<uint32_t*>(output),
reinterpret_cast<uint32_t*>(output_scale),
reinterpret_cast<uint32_t*>(input_offset_by_experts),
reinterpret_cast<uint32_t*>(output_scale_offset_by_experts),
n_experts, /* bool low_latency */ true);
cvt_fp16_to_fp4<T, FUSE_SILU_MUL, false, true>
<<<grid, block, 0, stream>>>(
m_topk, k, reinterpret_cast<T*>(input),
reinterpret_cast<float*>(input_global_scale),
reinterpret_cast<uint32_t*>(output),
reinterpret_cast<uint32_t*>(output_scale),
reinterpret_cast<uint32_t*>(input_offset_by_experts),
reinterpret_cast<uint32_t*>(output_scale_offset_by_experts),
n_experts, /* bool low_latency */ true);
}
}
}
@@ -307,19 +340,19 @@ constexpr auto FLOAT = at::ScalarType::Float;
constexpr auto INT = at::ScalarType::Int;
constexpr auto UINT8 = at::ScalarType::Byte;
void scaled_fp4_experts_quant_sm1xxa(
torch::Tensor& output, torch::Tensor& output_scale,
// Common validation for fp4 experts quantization entry points.
static void validate_fp4_experts_quant_inputs(
torch::Tensor const& output, torch::Tensor const& output_scale,
torch::Tensor const& input, torch::Tensor const& input_global_scale,
torch::Tensor const& input_offset_by_experts,
torch::Tensor const& output_scale_offset_by_experts) {
CHECK_INPUT(output, "output must be a CUDA tensor");
CHECK_INPUT(output_scale, "output_scale must be a CUDA tensor");
CHECK_INPUT(input, "input must be a CUDA tensor");
CHECK_INPUT(input_global_scale, "input_global_scale must be a CUDA tensor");
CHECK_INPUT(input_offset_by_experts,
"input_offset_by_experts must be a CUDA tensor");
CHECK_INPUT(output_scale_offset_by_experts,
"output_scale_offset_by_experts must be a CUDA tensor");
torch::Tensor const& output_scale_offset_by_experts, int64_t m_topk,
int64_t k) {
CHECK_INPUT(output, "output");
CHECK_INPUT(output_scale, "output_scale");
CHECK_INPUT(input, "input");
CHECK_INPUT(input_global_scale, "input_global_scale");
CHECK_INPUT(input_offset_by_experts, "input_offset_by_experts");
CHECK_INPUT(output_scale_offset_by_experts, "output_scale_offset_by_experts");
TORCH_CHECK(output.dim() == 2);
TORCH_CHECK(output_scale.dim() == 2);
@@ -338,8 +371,6 @@ void scaled_fp4_experts_quant_sm1xxa(
TORCH_CHECK(output_scale.scalar_type() == INT);
const int BLOCK_SIZE = 16;
auto m_topk = input.size(0);
auto k = input.size(1);
TORCH_CHECK(k % BLOCK_SIZE == 0, "k must be a multiple of 16");
auto n_experts = input_global_scale.size(0);
TORCH_CHECK(input_offset_by_experts.size(0) == n_experts + 1);
@@ -351,7 +382,21 @@ void scaled_fp4_experts_quant_sm1xxa(
int padded_k = (scales_k + (4 - 1)) / 4 * 4;
// 4 means 4 fp8 values are packed into one int32
TORCH_CHECK(output_scale.size(1) * 4 == padded_k);
}
void scaled_fp4_experts_quant_sm1xxa(
torch::Tensor& output, torch::Tensor& output_scale,
torch::Tensor const& input, torch::Tensor const& input_global_scale,
torch::Tensor const& input_offset_by_experts,
torch::Tensor const& output_scale_offset_by_experts) {
auto m_topk = input.size(0);
auto k = input.size(1);
validate_fp4_experts_quant_inputs(output, output_scale, input,
input_global_scale, input_offset_by_experts,
output_scale_offset_by_experts, m_topk, k);
auto n_experts = input_global_scale.size(0);
const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
const cudaStream_t stream =
at::cuda::getCurrentCUDAStream(input.get_device());
@@ -359,7 +404,38 @@ void scaled_fp4_experts_quant_sm1xxa(
VLLM_DISPATCH_HALF_TYPES(
input.scalar_type(), "nvfp4_experts_quant_kernel", [&] {
using cuda_type = vllm::CUDATypeConverter<scalar_t>::Type;
vllm::quant_impl<cuda_type>(
vllm::quant_impl<cuda_type, /*FUSE_SILU_MUL=*/false>(
output.data_ptr(), output_scale.data_ptr(), input.data_ptr(),
input_global_scale.data_ptr(), input_offset_by_experts.data_ptr(),
output_scale_offset_by_experts.data_ptr(), m_topk, k, n_experts,
stream);
});
}
void silu_and_mul_scaled_fp4_experts_quant_sm1xxa(
torch::Tensor& output, torch::Tensor& output_scale,
torch::Tensor const& input, torch::Tensor const& input_global_scale,
torch::Tensor const& input_offset_by_experts,
torch::Tensor const& output_scale_offset_by_experts) {
auto m_topk = input.size(0);
// Input has gate || up layout, so k = input.size(1) / 2
auto k_times_2 = input.size(1);
TORCH_CHECK(k_times_2 % 2 == 0, "input width must be even (gate || up)");
auto k = k_times_2 / 2;
validate_fp4_experts_quant_inputs(output, output_scale, input,
input_global_scale, input_offset_by_experts,
output_scale_offset_by_experts, m_topk, k);
auto n_experts = input_global_scale.size(0);
const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
const cudaStream_t stream =
at::cuda::getCurrentCUDAStream(input.get_device());
VLLM_DISPATCH_HALF_TYPES(
input.scalar_type(), "silu_mul_nvfp4_experts_quant_kernel", [&] {
using cuda_type = vllm::CUDATypeConverter<scalar_t>::Type;
vllm::quant_impl<cuda_type, /*FUSE_SILU_MUL=*/true>(
output.data_ptr(), output_scale.data_ptr(), input.data_ptr(),
input_global_scale.data_ptr(), input_offset_by_experts.data_ptr(),
output_scale_offset_by_experts.data_ptr(), m_topk, k, n_experts,

View File

@@ -41,6 +41,15 @@ void silu_and_mul_nvfp4_quant_sm1xxa(torch::Tensor& output,
torch::Tensor& input_sf);
#endif
#if (defined(ENABLE_NVFP4_SM100) && ENABLE_NVFP4_SM100) || \
(defined(ENABLE_NVFP4_SM120) && ENABLE_NVFP4_SM120)
void silu_and_mul_scaled_fp4_experts_quant_sm1xxa(
torch::Tensor& output, torch::Tensor& output_scale,
torch::Tensor const& input, torch::Tensor const& input_global_scale,
torch::Tensor const& input_offset_by_experts,
torch::Tensor const& output_scale_offset_by_experts);
#endif
void scaled_fp4_quant(torch::Tensor& output, torch::Tensor const& input,
torch::Tensor& output_sf, torch::Tensor const& input_sf) {
#if (defined(ENABLE_NVFP4_SM100) && ENABLE_NVFP4_SM100) || \
@@ -74,3 +83,18 @@ void silu_and_mul_nvfp4_quant(torch::Tensor& output, torch::Tensor& output_sf,
TORCH_CHECK_NOT_IMPLEMENTED(
false, "No compiled silu_and_mul nvfp4 quantization kernel");
}
void silu_and_mul_scaled_fp4_experts_quant(
torch::Tensor& output, torch::Tensor& output_scale,
torch::Tensor const& input, torch::Tensor const& input_global_scale,
torch::Tensor const& input_offset_by_experts,
torch::Tensor const& output_scale_offset_by_experts) {
#if (defined(ENABLE_NVFP4_SM100) && ENABLE_NVFP4_SM100) || \
(defined(ENABLE_NVFP4_SM120) && ENABLE_NVFP4_SM120)
return silu_and_mul_scaled_fp4_experts_quant_sm1xxa(
output, output_scale, input, input_global_scale, input_offset_by_experts,
output_scale_offset_by_experts);
#endif
TORCH_CHECK_NOT_IMPLEMENTED(
false, "No compiled silu_and_mul nvfp4 experts quantization kernel");
}

View File

@@ -35,7 +35,13 @@ template <typename Int>
__host__ __device__ inline Int round_up(Int x, Int y) {
static_assert(std::is_integral_v<Int>,
"round_up argument must be integral type");
return (x + y - 1) / y * y;
return ((x + y - 1) / y) * y;
}
// Compute effective rows for grid configuration with swizzled SF layouts.
inline int computeEffectiveRows(int m) {
constexpr int ROW_TILE = 128;
return round_up(m, ROW_TILE);
}
// Use UE4M3 by default.
@@ -49,81 +55,57 @@ __global__ void __launch_bounds__(512, VLLM_BLOCKS_PER_SM(512))
static_assert(sizeof(PackedVec) == sizeof(Type) * CVT_FP4_ELTS_PER_THREAD,
"Vec size is not matched.");
// Precompute SF layout parameter (constant for entire kernel).
int32_t const numKTiles = (numCols + 63) / 64;
int sf_m = round_up<int>(numRows, 128);
int sf_n_unpadded = numCols / CVT_FP4_SF_VEC_SIZE;
int sf_n_int = round_up<int>(sf_n_unpadded, 4) / 4;
for (int row = numRows + blockIdx.x; row < sf_m; row += gridDim.x) {
// Each thread writes 4 uint32_t elements.
for (int col = sf_n_unpadded + threadIdx.x * 4; col < sf_n_int;
col += blockDim.x * 4) {
SFout[row * sf_n_int + col] = 0x00;
}
}
int num_padded_cols = sf_n_int * 4 * CVT_FP4_SF_VEC_SIZE;
// Get the global scaling factor, which will be applied to the SF.
// Note SFScale is the same as next GEMM's alpha, which is
// (448.f / (Alpha_A / 6.f)).
float const global_scale = SFScale == nullptr ? 1.0f : SFScale[0];
// Input tensor row/col loops.
for (int rowIdx = blockIdx.x; rowIdx < numRows; rowIdx += gridDim.x) {
for (int colIdx = threadIdx.x; colIdx < numCols / CVT_FP4_ELTS_PER_THREAD;
// Iterate over all rows and cols including padded ones -
// ensures we visit every single scale factor address to initialize it.
for (int rowIdx = blockIdx.x; rowIdx < sf_m; rowIdx += gridDim.x) {
for (int colIdx = threadIdx.x;
colIdx < num_padded_cols / CVT_FP4_ELTS_PER_THREAD;
colIdx += blockDim.x) {
int elem_idx = colIdx * CVT_FP4_ELTS_PER_THREAD;
PackedVec in_vec;
int64_t inOffset = rowIdx * (numCols / CVT_FP4_ELTS_PER_THREAD) + colIdx;
PackedVec in_vec = reinterpret_cast<PackedVec const*>(in)[inOffset];
// Get the output tensor offset.
// Same as inOffset because 8 elements are packed into one uint32_t.
int64_t outOffset = inOffset;
auto& out_pos = out[outOffset];
// If we are outside valid rows OR outside valid columns -> Use Zeros
if (rowIdx >= numRows || elem_idx >= numCols) {
memset(&in_vec, 0, sizeof(PackedVec));
} else {
// Valid Region: Load actual data
in_vec = reinterpret_cast<PackedVec const*>(in)[inOffset];
}
auto sf_out =
cvt_quant_to_fp4_get_sf_out_offset<uint32_t,
CVT_FP4_NUM_THREADS_PER_SF>(
rowIdx, colIdx, numCols, SFout);
rowIdx, colIdx, numKTiles, SFout);
out_pos =
auto out_val =
cvt_warp_fp16_to_fp4<Type, UE8M0_SF>(in_vec, global_scale, sf_out);
// We do NOT write output for padding because the 'out' tensor is not
// padded.
if (rowIdx < numRows && elem_idx < numCols) {
// Same as inOffset because 8 elements are packed into one uint32_t.
out[inOffset] = out_val;
}
}
}
}
template <typename T>
void invokeFP4Quantization(int m, int n, T const* input, float const* SFScale,
int64_t* output, int32_t* SFOuput, bool useUE8M0,
int multiProcessorCount, cudaStream_t stream) {
// Grid, Block size.
// Each thread converts 8 values.
dim3 block(std::min(int(n / ELTS_PER_THREAD), 512));
// Get number of blocks per SM
int const numBlocksPerSM =
vllm_runtime_blocks_per_sm(static_cast<int>(block.x));
dim3 grid(std::min(int(m), multiProcessorCount * numBlocksPerSM));
// Launch the cvt kernel.
if (useUE8M0) {
cvt_fp16_to_fp4<T, true><<<grid, block, 0, stream>>>(
m, n, input, SFScale, reinterpret_cast<uint32_t*>(output),
reinterpret_cast<uint32_t*>(SFOuput));
} else {
cvt_fp16_to_fp4<T, false><<<grid, block, 0, stream>>>(
m, n, input, SFScale, reinterpret_cast<uint32_t*>(output),
reinterpret_cast<uint32_t*>(SFOuput));
}
}
// Instantiate the function.
template void invokeFP4Quantization(int m, int n, half const* input,
float const* SFScale, int64_t* output,
int32_t* SFOuput, bool useUE8M0,
int multiProcessorCount,
cudaStream_t stream);
template void invokeFP4Quantization(int m, int n, __nv_bfloat16 const* input,
float const* SFScale, int64_t* output,
int32_t* SFOuput, bool useUE8M0,
int multiProcessorCount,
cudaStream_t stream);
} // namespace vllm
void scaled_fp4_quant_sm1xxa(torch::Tensor const& output,
@@ -147,13 +129,19 @@ void scaled_fp4_quant_sm1xxa(torch::Tensor const& output,
const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
auto stream = at::cuda::getCurrentCUDAStream(input.get_device());
// We don't support e8m0 scales at this moment.
bool useUE8M0 = false;
// Grid, Block size. Each thread converts 8 values.
dim3 block(std::min(int(n / ELTS_PER_THREAD), 512));
int const numBlocksPerSM =
vllm_runtime_blocks_per_sm(static_cast<int>(block.x));
int effectiveRows = vllm::computeEffectiveRows(m);
dim3 grid(std::min(effectiveRows, multiProcessorCount * numBlocksPerSM));
VLLM_DISPATCH_HALF_TYPES(input.scalar_type(), "nvfp4_quant_kernel", [&] {
using cuda_type = vllm::CUDATypeConverter<scalar_t>::Type;
auto input_ptr = static_cast<cuda_type const*>(input.data_ptr());
vllm::invokeFP4Quantization(m, n, input_ptr, input_sf_ptr, output_ptr,
sf_out, useUE8M0, multiProcessorCount, stream);
// NOTE: We don't support e8m0 scales at this moment.
vllm::cvt_fp16_to_fp4<cuda_type, false><<<grid, block, 0, stream>>>(
m, n, input_ptr, input_sf_ptr, reinterpret_cast<uint32_t*>(output_ptr),
reinterpret_cast<uint32_t*>(sf_out));
});
}
}

View File

@@ -128,51 +128,42 @@ inline __device__ float reciprocal_approximate_ftz(float a) {
return b;
}
// Compute SF output offset for swizzled tensor core layout.
// SF layout: [numMTiles, numKTiles, 32, 4, 4]
// Caller must precompute: numKTiles = (numCols + 63) / 64
template <class SFType, int CVT_FP4_NUM_THREADS_PER_SF>
__device__ uint8_t* cvt_quant_to_fp4_get_sf_out_offset(int rowIdx, int colIdx,
int numCols,
SFType* SFout) {
__device__ __forceinline__ uint8_t* cvt_quant_to_fp4_get_sf_out_offset(
int rowIdx, int colIdx, int32_t numKTiles, SFType* SFout) {
static_assert(CVT_FP4_NUM_THREADS_PER_SF == 1 ||
CVT_FP4_NUM_THREADS_PER_SF == 2);
// One pair of threads write one SF to global memory.
// TODO: stage through smem for packed STG.32
// is it better than STG.8 from 4 threads ?
if (threadIdx.x % CVT_FP4_NUM_THREADS_PER_SF == 0) {
// SF vector index (16 elements share one SF in the K dimension).
int32_t kIdx = colIdx / CVT_FP4_NUM_THREADS_PER_SF;
int32_t mIdx = rowIdx;
// SF layout [numMTiles, numKTiles, 32 (mTile), 4 (mTile), 4(kTile)]
// --> index [mTileIdx, kTileIdx, outerMIdx, innerMIdx, innerKIdx]
int32_t mTileIdx = mIdx / (32 * 4);
// SF vector size 16.
int factor = CVT_FP4_SF_VEC_SIZE * 4;
int32_t numKTiles = (numCols + factor - 1) / factor;
int64_t mTileStride = numKTiles * 32 * 4 * 4;
int32_t kTileIdx = (kIdx / 4);
int64_t kTileStride = 32 * 4 * 4;
// M tile layout [32, 4] is column-major.
int32_t outerMIdx = (mIdx % 32);
int64_t outerMStride = 4 * 4;
int32_t innerMIdx = (mIdx % (32 * 4)) / 32;
int64_t innerMStride = 4;
int32_t innerKIdx = (kIdx % 4);
int64_t innerKStride = 1;
// Compute the global offset.
int64_t SFOffset = mTileIdx * mTileStride + kTileIdx * kTileStride +
outerMIdx * outerMStride + innerMIdx * innerMStride +
innerKIdx * innerKStride;
return reinterpret_cast<uint8_t*>(SFout) + SFOffset;
if (threadIdx.x % CVT_FP4_NUM_THREADS_PER_SF != 0) {
return nullptr;
}
return nullptr;
// SF vector index (16 elements share one SF in the K dimension).
int32_t kIdx = colIdx / CVT_FP4_NUM_THREADS_PER_SF;
int32_t mIdx = rowIdx;
// Decompose indices using bitwise ops (all divisors are powers of 2).
// SF layout [numMTiles, numKTiles, 32 (mTile), 4 (mTile), 4(kTile)]
int32_t mTileIdx = mIdx >> 7; // mIdx / 128
int32_t outerMIdx = mIdx & 31; // mIdx % 32
int32_t innerMIdx = (mIdx >> 5) & 3; // (mIdx / 32) % 4
int32_t kTileIdx = kIdx >> 2; // kIdx / 4
int32_t innerKIdx = kIdx & 3; // kIdx % 4
// Compute global SF offset: mTileIdx * (numKTiles * 512) + kTileIdx * 512 +
// outerMIdx * 16 + innerMIdx * 4 + innerKIdx
// Use bitwise OR for non-overlapping lower bits.
int64_t SFOffset = (static_cast<int64_t>(mTileIdx) * numKTiles + kTileIdx)
<< 9 |
(outerMIdx << 4) | (innerMIdx << 2) | innerKIdx;
return reinterpret_cast<uint8_t*>(SFout) + SFOffset;
}
// Quantizes the provided PackedVec into the uint32_t output
@@ -248,4 +239,34 @@ __device__ uint32_t cvt_warp_fp16_to_fp4(PackedVec<Type>& vec, float SFScaleVal,
return e2m1Vec;
}
// silu in float32
__device__ __forceinline__ float silu(float x) {
return __fdividef(x, (1.f + __expf(-x)));
}
__device__ __forceinline__ float2 silu2(float2 x) {
return make_float2(silu(x.x), silu(x.y));
}
template <class Type>
__inline__ __device__ PackedVec<Type> compute_silu_mul(
const PackedVec<Type>& x_vec, const PackedVec<Type>& y_vec) {
PackedVec<Type> result;
#pragma unroll
for (int i = 0; i < CVT_FP4_ELTS_PER_THREAD / 2; ++i) {
// silu_mul in float32
if constexpr (std::is_same_v<Type, half>) {
float2 silu_vec = silu2(__half22float2(x_vec.elts[i]));
result.elts[i] = __float22half2_rn(
__fmul2_rn(silu_vec, __half22float2(y_vec.elts[i])));
} else {
float2 silu_vec = silu2(__bfloat1622float2(x_vec.elts[i]));
result.elts[i] = __float22bfloat162_rn(
__fmul2_rn(silu_vec, __bfloat1622float2(y_vec.elts[i])));
}
}
return result;
}
} // namespace vllm

View File

@@ -233,11 +233,6 @@ __global__ void gemm_half_q_half_gptq_4bit_kernel(
// Zero output
if (n >= size_n) return;
if (blockIdx.z == 0) {
for (int m = 0; m < m_count; m++)
*((uint64_t*)c_.item_ptr(offset_m + m, n)) = 0;
}
__syncthreads();
// Find initial group
@@ -372,11 +367,6 @@ __global__ void gemm_half_q_half_gptq_2bit_kernel(
// Zero output
if (n >= size_n) return;
if (blockIdx.z == 0) {
for (int m = 0; m < m_count; m++)
*((uint64_t*)c_.item_ptr(offset_m + m, n)) = 0;
}
__syncthreads();
// Find initial group
@@ -494,11 +484,6 @@ __global__ void gemm_half_q_half_gptq_3bit_kernel(
// Zero output
if (n >= size_n) return;
if (blockIdx.z == 0) {
for (int m = 0; m < m_count; m++)
*((uint64_t*)c_.item_ptr(offset_m + m, n)) = 0;
}
__syncthreads();
// Find initial group
@@ -623,11 +608,6 @@ __global__ void gemm_half_q_half_gptq_8bit_kernel(
// Zero output
if (n >= size_n) return;
if (blockIdx.z == 0) {
for (int m = 0; m < m_count; m++)
*((uint64_t*)c_.item_ptr(offset_m + m, n)) = 0;
}
__syncthreads();
// Find initial group
@@ -1224,9 +1204,6 @@ __global__ void gemm_half_q_half_alt_4bit_kernel(
__halves2half2(__int2half_rn(val & 0xF), __int2half_rn(val >> 4));
}
if (blockIdx.z == 0) {
for (int m = 0; m < b_end; m++) mul[(b + m) * width + w] = __int2half_rn(0);
}
__syncthreads();
int i = width * h + w;
@@ -1319,9 +1296,6 @@ __global__ void gemm_half_q_half_alt_8bit_kernel(
}
}
if (blockIdx.z == 0) {
for (int m = 0; m < b_end; m++) mul[(b + m) * width + w] = __int2half_rn(0);
}
__syncthreads();
int i = width * h + w;
@@ -1857,7 +1831,7 @@ torch::Tensor gptq_gemm(torch::Tensor a, torch::Tensor b_q_weight,
bool use_exllama, bool use_v2_format, int64_t bit) {
const at::cuda::OptionalCUDAGuard device_guard(device_of(a));
auto options = torch::TensorOptions().dtype(a.dtype()).device(a.device());
at::Tensor c = torch::empty({a.size(0), b_q_weight.size(1)}, options);
at::Tensor c = torch::zeros({a.size(0), b_q_weight.size(1)}, options);
at::Tensor temp_dq = torch::empty(
{b_q_weight.size(0) * 32 / bit, b_q_weight.size(1)}, options);

View File

@@ -1,373 +0,0 @@
#include "core/registration.h"
#include <torch/all.h>
#include <cutlass/arch/arch.h>
#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGuard.h>
#include <c10/cuda/CUDAStream.h>
#include "cute/tensor.hpp"
#include "cutlass/tensor_ref.h"
#include "cutlass/epilogue/collective/default_epilogue.hpp"
#include "cutlass/epilogue/thread/linear_combination.h"
#include "cutlass/gemm/dispatch_policy.hpp"
#include "cutlass/gemm/group_array_problem_shape.hpp"
#include "cutlass/gemm/collective/collective_builder.hpp"
#include "cutlass/epilogue/collective/collective_builder.hpp"
#include "cutlass/gemm/device/gemm_universal_adapter.h"
#include "cutlass/gemm/kernel/gemm_universal.hpp"
#include "cutlass/util/command_line.h"
#include "cutlass/util/distribution.h"
#include "cutlass/util/host_tensor.h"
#include "cutlass/util/packed_stride.hpp"
#include "cutlass/util/tensor_view_io.h"
#include "cutlass/util/reference/device/gemm.h"
#include "cutlass/util/reference/device/tensor_compare.h"
#include "cutlass/util/reference/host/tensor_fill.h"
#include "cutlass/util/reference/host/gett.hpp"
#include "cutlass/util/reference/host/tensor_norm.h"
#include "cutlass/util/reference/host/tensor_compare.h"
#include <cassert>
using namespace cute;
template <typename ElementAB, typename ElementC, typename ElementAccumulator,
typename LayoutSFA, typename LayoutSFB, typename ScaleConfig>
__global__ void get_ggemm_starts(
int32_t* expert_offsets, ElementAB** a_offsets, ElementAB** b_offsets,
ElementC** out_offsets, ElementAccumulator** a_scale_offsets,
ElementAccumulator** b_scale_offsets, ElementAB* a_base_as_int,
ElementAB* b_base_as_int, ElementC* out_base_as_int,
ElementAccumulator* a_scale_base_as_int,
ElementAccumulator* b_scale_base_as_int, LayoutSFA* layout_sfa_base_as_int,
LayoutSFB* layout_sfb_base_as_int, int* problem_sizes) {
int expert_id = threadIdx.x;
if (expert_id >= gridDim.x * blockDim.x) {
return;
}
int m = problem_sizes[expert_id * 3];
int n = problem_sizes[expert_id * 3 + 1];
int k = problem_sizes[expert_id * 3 + 2];
int32_t expert_offset = expert_offsets[expert_id];
int a_stride = expert_offset * k;
int b_stride = expert_id * k * n;
int a_scale_stride = expert_offset * k / 128;
int b_scale_stride = expert_id * k * n / 128 / 128;
a_offsets[expert_id] = a_base_as_int + a_stride;
b_offsets[expert_id] = b_base_as_int + b_stride;
out_offsets[expert_id] = out_base_as_int + expert_offset * n;
a_scale_offsets[expert_id] = a_scale_base_as_int + a_scale_stride;
b_scale_offsets[expert_id] = b_scale_base_as_int + b_scale_stride;
LayoutSFA* layout_sfa_ptr = layout_sfa_base_as_int + expert_id;
LayoutSFB* layout_sfb_ptr = layout_sfb_base_as_int + expert_id;
*layout_sfa_ptr =
ScaleConfig::tile_atom_to_shape_SFA(cute::make_shape(m, n, k, 1));
*layout_sfb_ptr =
ScaleConfig::tile_atom_to_shape_SFB(cute::make_shape(m, n, k, 1));
}
#define __CALL_GET_STARTS_KERNEL(TENSOR_C_TYPE, C_TYPE, LayoutSFA, LayoutSFB, \
ScaleConfig) \
else if (out_tensors.dtype() == TENSOR_C_TYPE) { \
get_ggemm_starts<cutlass::float_e4m3_t, C_TYPE, float, LayoutSFA, \
LayoutSFB, ScaleConfig><<<1, num_experts, 0, stream>>>( \
static_cast<int32_t*>(expert_offsets.data_ptr()), \
static_cast<cutlass::float_e4m3_t**>(a_ptrs.data_ptr()), \
static_cast<cutlass::float_e4m3_t**>(b_ptrs.data_ptr()), \
static_cast<C_TYPE**>(out_ptrs.data_ptr()), \
static_cast<float**>(a_scales_ptrs.data_ptr()), \
static_cast<float**>(b_scales_ptrs.data_ptr()), \
static_cast<cutlass::float_e4m3_t*>(a_tensors.data_ptr()), \
static_cast<cutlass::float_e4m3_t*>(b_tensors.data_ptr()), \
static_cast<C_TYPE*>(out_tensors.data_ptr()), \
static_cast<float*>(a_scales.data_ptr()), \
static_cast<float*>(b_scales.data_ptr()), \
reinterpret_cast<LayoutSFA*>(layout_sfa.data_ptr()), \
reinterpret_cast<LayoutSFB*>(layout_sfb.data_ptr()), \
static_cast<int*>(problem_sizes.data_ptr())); \
}
template <typename LayoutSFA, typename LayoutSFB, typename ScaleConfig>
void run_get_ggemm_starts(
torch::Tensor const& expert_offsets, torch::Tensor& a_ptrs,
torch::Tensor& b_ptrs, torch::Tensor& out_ptrs,
torch::Tensor& a_scales_ptrs, torch::Tensor& b_scales_ptrs,
torch::Tensor const& a_tensors, torch::Tensor const& b_tensors,
torch::Tensor out_tensors, torch::Tensor const& a_scales,
torch::Tensor const& b_scales, torch::Tensor const& layout_sfa,
torch::Tensor const& layout_sfb, torch::Tensor const& problem_sizes) {
TORCH_CHECK(a_tensors.dtype() == torch::kFloat8_e4m3fn);
TORCH_CHECK(b_tensors.dtype() == torch::kFloat8_e4m3fn);
TORCH_CHECK(a_scales.dtype() == torch::kFloat32);
TORCH_CHECK(b_scales.dtype() == torch::kFloat32);
TORCH_CHECK(out_tensors.size(1) % 128 == 0 or out_tensors.size(0) % 128 == 0);
TORCH_CHECK(a_tensors.size(1) % 128 == 0 or a_tensors.size(0) % 128 == 0);
int num_experts = (int)expert_offsets.size(0);
auto stream = at::cuda::getCurrentCUDAStream(a_tensors.device().index());
if (false) {
}
__CALL_GET_STARTS_KERNEL(torch::kBFloat16, cutlass::bfloat16_t, LayoutSFA,
LayoutSFB, ScaleConfig)
__CALL_GET_STARTS_KERNEL(torch::kFloat16, cutlass::half_t, LayoutSFA,
LayoutSFB, ScaleConfig)
else {
TORCH_CHECK(false, "Unsupported output tensor type");
}
}
template <typename OutType, typename ScheduleConfig, typename LayoutD>
void run_blockwise_scaled_group_mm(
torch::Tensor& out_ptrs, const torch::Tensor& a_ptrs,
const torch::Tensor& b_ptrs, const torch::Tensor& a_scales_ptrs,
const torch::Tensor& b_scales_ptrs, const torch::Tensor& stride_a,
const torch::Tensor& stride_b, const torch::Tensor& stride_c,
const torch::Tensor& layout_sfa, const torch::Tensor& layout_sfb,
const torch::Tensor& problem_sizes, const torch::Tensor& expert_offsets) {
using ProblemShape = cutlass::gemm::GroupProblemShape<Shape<int, int, int>>;
// Types
using ElementA = cutlass::float_e4m3_t;
using ElementB = cutlass::float_e4m3_t;
using ElementC = OutType;
using ElementD = ElementC;
using ElementAccumulator = float;
using LayoutA = cutlass::layout::RowMajor;
using LayoutB = cutlass::layout::ColumnMajor;
using LayoutC = LayoutD;
// Alignments
static constexpr int AlignmentA = 128 / cutlass::sizeof_bits<ElementA>::value;
static constexpr int AlignmentB = 128 / cutlass::sizeof_bits<ElementB>::value;
static constexpr int AlignmentC = 128 / cutlass::sizeof_bits<ElementC>::value;
using ArchTag = cutlass::arch::Sm100;
using OperatorClass = cutlass::arch::OpClassTensorOp;
using CollectiveEpilogue =
typename cutlass::epilogue::collective::CollectiveBuilder<
ArchTag, OperatorClass, typename ScheduleConfig::MmaTileShape,
typename ScheduleConfig::ClusterShape,
cutlass::epilogue::collective::EpilogueTileAuto, ElementAccumulator,
ElementAccumulator, void, LayoutC*, AlignmentC, ElementD, LayoutC*,
AlignmentC, typename ScheduleConfig::EpilogueSchedule>::CollectiveOp;
using CollectiveMainloop =
typename cutlass::gemm::collective::CollectiveBuilder<
ArchTag, OperatorClass, ElementA,
cute::tuple<LayoutA*, typename ScheduleConfig::LayoutSFA*>,
AlignmentA, ElementB,
cute::tuple<LayoutB*, typename ScheduleConfig::LayoutSFB*>,
AlignmentB, ElementAccumulator, typename ScheduleConfig::MmaTileShape,
typename ScheduleConfig::ClusterShape,
cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(
sizeof(typename CollectiveEpilogue::SharedStorage))>,
typename ScheduleConfig::KernelSchedule>::CollectiveOp;
using GemmKernel =
cutlass::gemm::kernel::GemmUniversal<ProblemShape, CollectiveMainloop,
CollectiveEpilogue, void>;
using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
using StrideA = typename Gemm::GemmKernel::InternalStrideA;
using StrideB = typename Gemm::GemmKernel::InternalStrideB;
using StrideC = typename Gemm::GemmKernel::InternalStrideC;
using StrideD = typename Gemm::GemmKernel::InternalStrideD;
using UnderlyingProblemShape = ProblemShape::UnderlyingProblemShape;
int num_experts = (int)expert_offsets.size(0);
Gemm gemm_op;
// Mainloop Arguments
typename GemmKernel::MainloopArguments mainloop_args{
static_cast<const ElementA**>(a_ptrs.data_ptr()),
static_cast<StrideA*>(stride_a.data_ptr()),
static_cast<const ElementB**>(b_ptrs.data_ptr()),
static_cast<StrideB*>(stride_b.data_ptr()),
static_cast<const ElementAccumulator**>(a_scales_ptrs.data_ptr()),
reinterpret_cast<typename ScheduleConfig::LayoutSFA*>(
layout_sfa.data_ptr()),
static_cast<const ElementAccumulator**>(b_scales_ptrs.data_ptr()),
reinterpret_cast<typename ScheduleConfig::LayoutSFB*>(
layout_sfb.data_ptr())};
int device_id = a_ptrs.device().index();
static const cutlass::KernelHardwareInfo hw_info{
device_id, cutlass::KernelHardwareInfo::query_device_multiprocessor_count(
device_id)};
// Epilogue Arguments
typename GemmKernel::EpilogueArguments epilogue_args{
{}, // epilogue.thread
nullptr,
static_cast<StrideC*>(stride_c.data_ptr()),
static_cast<ElementD**>(out_ptrs.data_ptr()),
static_cast<StrideC*>(stride_c.data_ptr())};
UnderlyingProblemShape* problem_sizes_as_shapes =
static_cast<UnderlyingProblemShape*>(problem_sizes.data_ptr());
// Gemm Arguments
typename GemmKernel::Arguments args{
cutlass::gemm::GemmUniversalMode::kGrouped,
{num_experts, problem_sizes_as_shapes, nullptr},
mainloop_args,
epilogue_args,
hw_info};
at::cuda::CUDAGuard device_guard{(char)a_ptrs.device().index()};
const cudaStream_t stream =
at::cuda::getCurrentCUDAStream(a_ptrs.get_device());
auto can_implement_status = gemm_op.can_implement(args);
TORCH_CHECK(can_implement_status == cutlass::Status::kSuccess,
"Failed to implement GEMM");
size_t workspace_size = gemm_op.get_workspace_size(args);
auto const workspace_options =
torch::TensorOptions().dtype(torch::kUInt8).device(a_ptrs.device());
auto workspace = torch::empty(workspace_size, workspace_options);
auto status = gemm_op.initialize(args, workspace.data_ptr(), stream);
TORCH_CHECK(status == cutlass::Status::kSuccess, "Failed to initialize GEMM");
status = gemm_op.run(stream);
TORCH_CHECK(status == cutlass::Status::kSuccess, "Failed to run GEMM");
}
template <typename OutType>
void blockwise_scaled_group_mm_dispatch_shape(
torch::Tensor& output, const torch::Tensor& a, const torch::Tensor& b,
const torch::Tensor& scales_a, const torch::Tensor& scales_b,
const torch::Tensor& problem_sizes, const torch::Tensor& expert_offsets) {
struct MmaConfig {
using ElementA = cutlass::float_e4m3_t;
using KernelSchedule =
cutlass::gemm::KernelPtrArrayTmaWarpSpecializedBlockwise1SmSm100;
using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecialized1Sm;
using ScaleConfig = cutlass::detail::Sm100BlockwiseScaleConfig<
1, 128, 128, cute::UMMA::Major::K, cute::UMMA::Major::K>;
using LayoutSFA = decltype(ScaleConfig::deduce_layoutSFA());
using LayoutSFB = decltype(ScaleConfig::deduce_layoutSFB());
using LayoutC = cutlass::layout::RowMajor;
using MmaTileShape = Shape<_128, _128, _128>;
using ClusterShape = Shape<_1, _1, _1>;
};
int num_experts = (int)expert_offsets.size(0);
auto a_ptrs = torch::empty(
{num_experts},
torch::TensorOptions().dtype(torch::kInt64).device(a.device()));
auto b_ptrs = torch::empty(
{num_experts},
torch::TensorOptions().dtype(torch::kInt64).device(a.device()));
auto out_ptrs = torch::empty(
{num_experts},
torch::TensorOptions().dtype(torch::kInt64).device(a.device()));
auto a_scales_ptrs = torch::empty(
{num_experts},
torch::TensorOptions().dtype(torch::kInt64).device(a.device()));
auto b_scales_ptrs = torch::empty(
{num_experts},
torch::TensorOptions().dtype(torch::kInt64).device(a.device()));
auto layout_sfa = torch::empty(
{num_experts, 5},
torch::TensorOptions().dtype(torch::kInt32).device(a.device()));
auto layout_sfb = torch::empty(
{num_experts, 5},
torch::TensorOptions().dtype(torch::kInt32).device(a.device()));
auto stride_a = torch::full(
{num_experts}, a.size(1),
torch::TensorOptions().dtype(torch::kInt64).device(a.device()));
auto stride_b = torch::full(
{num_experts}, a.size(1),
torch::TensorOptions().dtype(torch::kInt64).device(a.device()));
auto stride_c = torch::full(
{num_experts}, output.size(1),
torch::TensorOptions().dtype(torch::kInt64).device(a.device()));
torch::TensorOptions options_int =
torch::TensorOptions().dtype(torch::kInt64).device(a.device());
run_get_ggemm_starts<typename MmaConfig::LayoutSFA,
typename MmaConfig::LayoutSFB,
typename MmaConfig::ScaleConfig>(
expert_offsets, a_ptrs, b_ptrs, out_ptrs, a_scales_ptrs, b_scales_ptrs, a,
b, output, scales_a, scales_b, layout_sfa, layout_sfb, problem_sizes);
run_blockwise_scaled_group_mm<OutType, MmaConfig,
typename MmaConfig::LayoutC>(
out_ptrs, a_ptrs, b_ptrs, a_scales_ptrs, b_scales_ptrs, stride_a,
stride_b, stride_c, layout_sfa, layout_sfb, problem_sizes,
expert_offsets);
}
void cutlass_blockwise_scaled_grouped_mm(
torch::Tensor& output, const torch::Tensor& a, const torch::Tensor& b,
const torch::Tensor& scales_a, const torch::Tensor& scales_b,
const torch::Tensor& problem_sizes, const torch::Tensor& expert_offsets) {
TORCH_CHECK(problem_sizes.dim() == 2, "problem_sizes must be 2D tensor");
TORCH_CHECK(problem_sizes.size(1) == 3,
"problem_sizes must have shape (num_experts, 3)");
TORCH_CHECK(problem_sizes.size(0) == expert_offsets.size(0),
"Number of experts in problem_sizes must match expert_offsets");
TORCH_CHECK(problem_sizes.dtype() == torch::kInt32,
"problem_sizes must be int32");
TORCH_CHECK(a.scalar_type() == torch::kFloat8_e4m3fn,
"a must be kFloat8_e4m3fn");
TORCH_CHECK(b.scalar_type() == torch::kFloat8_e4m3fn,
"b must be kFloat8_e4m3fn");
TORCH_CHECK(output.scalar_type() == torch::kBFloat16 ||
output.scalar_type() == torch::kHalf,
"output must be bfloat16 or half");
TORCH_CHECK(scales_a.scalar_type() == torch::kFloat32,
"scales_a must be float32");
TORCH_CHECK(scales_b.scalar_type() == torch::kFloat32,
"scales_b must be float32");
TORCH_CHECK(expert_offsets.scalar_type() == torch::kInt32,
"expert_offsets must be int32");
TORCH_CHECK(output.dim() == 2, "output must be 2D tensor");
TORCH_CHECK(a.dim() == 2, "a must be 2D tensor");
TORCH_CHECK(b.dim() == 3, "b must be 3D tensor");
TORCH_CHECK(scales_a.dim() == 2, "scales_a must be 2D tensor");
TORCH_CHECK(scales_b.dim() == 3, "scales_b must be 3D tensor");
TORCH_CHECK(problem_sizes.dim() == 2, "problem_sizes must be 2D tensor");
TORCH_CHECK(problem_sizes.size(1) == 3,
"problem_sizes must have shape (num_experts, 3)");
TORCH_CHECK(problem_sizes.size(0) == expert_offsets.size(0),
"Number of experts in problem_sizes must match expert_offsets");
TORCH_CHECK(problem_sizes.dtype() == torch::kInt32,
"problem_sizes must be int32");
TORCH_CHECK(expert_offsets.dim() == 1, "expert_offsets must be 1D tensor");
#if defined(ENABLE_CUTLASS_MOE_SM100) && ENABLE_CUTLASS_MOE_SM100
if (output.scalar_type() == torch::kBFloat16) {
blockwise_scaled_group_mm_dispatch_shape<cutlass::bfloat16_t>(
output, a, b, scales_a, scales_b, problem_sizes, expert_offsets);
} else if (output.scalar_type() == torch::kFloat16) {
blockwise_scaled_group_mm_dispatch_shape<cutlass::half_t>(
output, a, b, scales_a, scales_b, problem_sizes, expert_offsets);
} else {
TORCH_CHECK(false, "Unsupported output tensor type");
}
#endif
}
TORCH_LIBRARY_IMPL_EXPAND(TORCH_EXTENSION_NAME, CUDA, m) {
m.impl("cutlass_blockwise_scaled_grouped_mm",
&cutlass_blockwise_scaled_grouped_mm);
}

View File

@@ -3,6 +3,8 @@
#include <c10/cuda/CUDAGuard.h>
#include <torch/all.h>
#include "dispatch_utils.h"
#include <iostream>
constexpr uint64_t THREADS_PER_EXPERT = 512;
@@ -114,22 +116,17 @@ inline void launch_compute_problem_sizes(const torch::Tensor& topk_ids,
const bool swap_ab) {
int num_threads = min(THREADS_PER_EXPERT, topk_ids.numel());
const int32_t* topk_ptr = static_cast<const int32_t*>(topk_ids.data_ptr());
int32_t* ps1_ptr = static_cast<int32_t*>(problem_sizes1.data_ptr());
int32_t* ps2_ptr = static_cast<int32_t*>(problem_sizes2.data_ptr());
int32_t* atomic_ptr = static_cast<int32_t*>(atomic_buffer.data_ptr());
auto const* topk_ptr = topk_ids.data_ptr<int32_t>();
auto* ps1_ptr = problem_sizes1.data_ptr<int32_t>();
auto* ps2_ptr = problem_sizes2.data_ptr<int32_t>();
auto* atomic_ptr = atomic_buffer.data_ptr<int32_t>();
if (swap_ab) {
compute_problem_sizes<true><<<num_experts, num_threads, 0, stream>>>(
VLLM_DISPATCH_BOOL(swap_ab, SwapAB, [&] {
compute_problem_sizes<SwapAB><<<num_experts, num_threads, 0, stream>>>(
topk_ptr, ps1_ptr, ps2_ptr, atomic_ptr,
static_cast<int>(topk_ids.numel()), static_cast<int>(n),
static_cast<int>(k));
} else {
compute_problem_sizes<false><<<num_experts, num_threads, 0, stream>>>(
topk_ptr, ps1_ptr, ps2_ptr, atomic_ptr,
static_cast<int>(topk_ids.numel()), static_cast<int>(n),
static_cast<int>(k));
}
});
}
} // namespace
@@ -153,6 +150,93 @@ void get_cutlass_moe_mm_problem_sizes_caller(
may_swap_ab);
}
template <bool SWAP_AB>
__global__ void compute_problem_sizes_from_expert_offsets(
const int64_t* __restrict__ expert_first_token_offset,
int32_t* __restrict__ problem_sizes1, int32_t* __restrict__ problem_sizes2,
const int num_experts, const int n, const int k) {
int const expert_id = blockIdx.x * blockDim.x + threadIdx.x;
if (expert_id >= num_experts) {
return;
}
int64_t const m64 = expert_first_token_offset[expert_id + 1] -
expert_first_token_offset[expert_id];
int32_t const m = static_cast<int32_t>(m64);
int32_t* ps1 = problem_sizes1 + expert_id * 3;
int32_t* ps2 = problem_sizes2 + expert_id * 3;
if constexpr (!SWAP_AB) {
// [M, 2*N, K]
ps1[0] = m;
ps1[1] = 2 * n;
ps1[2] = k;
// [M, K, N]
ps2[0] = m;
ps2[1] = k;
ps2[2] = n;
} else {
// swap logical M/N in the problem shape
// [2*N, M, K]
ps1[0] = 2 * n;
ps1[1] = m;
ps1[2] = k;
// [K, M, N]
ps2[0] = k;
ps2[1] = m;
ps2[2] = n;
}
}
void get_cutlass_moe_mm_problem_sizes_from_expert_offsets_caller(
const torch::Tensor& expert_first_token_offset,
torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2,
const int64_t n, const int64_t k, const bool swap_ab) {
TORCH_CHECK(expert_first_token_offset.is_cuda(),
"expert_first_token_offset must be a CUDA tensor");
TORCH_CHECK(expert_first_token_offset.dtype() == torch::kInt64,
"expert_first_token_offset must be int64");
TORCH_CHECK(problem_sizes1.is_cuda() && problem_sizes2.is_cuda(),
"problem_sizes must be CUDA tensors");
TORCH_CHECK(problem_sizes1.dtype() == torch::kInt32 &&
problem_sizes2.dtype() == torch::kInt32,
"problem_sizes must be int32");
TORCH_CHECK(problem_sizes1.is_contiguous() && problem_sizes2.is_contiguous(),
"problem_sizes must be contiguous");
TORCH_CHECK(problem_sizes1.dim() == 2 && problem_sizes2.dim() == 2,
"problem_sizes must be 2D tensors");
TORCH_CHECK(problem_sizes1.size(1) == 3 && problem_sizes2.size(1) == 3,
"problem_sizes second dim must be 3");
TORCH_CHECK(problem_sizes1.sizes() == problem_sizes2.sizes(),
"problem_sizes1 and problem_sizes2 must have same shape");
int64_t const num_experts64 = problem_sizes1.size(0);
TORCH_CHECK(expert_first_token_offset.numel() == num_experts64 + 1,
"expert_first_token_offset must have num_experts + 1 elements");
TORCH_CHECK(num_experts64 <= INT32_MAX, "num_experts must fit in int32");
TORCH_CHECK(n <= INT32_MAX && k <= INT32_MAX, "n and k must fit in int32");
int const num_experts = static_cast<int>(num_experts64);
auto stream = at::cuda::getCurrentCUDAStream(
expert_first_token_offset.device().index());
int const threads = (num_experts < 256) ? num_experts : 256;
int const blocks = (num_experts + threads - 1) / threads;
auto const* offsets_ptr = expert_first_token_offset.data_ptr<int64_t>();
auto* ps1_ptr = problem_sizes1.data_ptr<int32_t>();
auto* ps2_ptr = problem_sizes2.data_ptr<int32_t>();
VLLM_DISPATCH_BOOL(swap_ab, SwapAB, [&] {
compute_problem_sizes_from_expert_offsets<SwapAB>
<<<blocks, threads, 0, stream>>>(offsets_ptr, ps1_ptr, ps2_ptr,
num_experts, static_cast<int>(n),
static_cast<int>(k));
});
}
void get_cutlass_moe_mm_data_caller(
const torch::Tensor& topk_ids, torch::Tensor& expert_offsets,
torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2,

View File

@@ -83,6 +83,11 @@ void get_cutlass_moe_mm_problem_sizes_caller(
const int64_t k, const std::optional<torch::Tensor>& blockscale_offsets,
std::optional<bool> force_swap_ab = std::nullopt);
void get_cutlass_moe_mm_problem_sizes_from_expert_offsets_caller(
const torch::Tensor& expert_first_token_offset,
torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2,
const int64_t n, const int64_t k, const bool swap_ab);
void get_cutlass_pplx_moe_mm_data_caller(torch::Tensor& expert_offsets,
torch::Tensor& problem_sizes1,
torch::Tensor& problem_sizes2,
@@ -322,6 +327,25 @@ void get_cutlass_moe_mm_problem_sizes(
version_num, ". Required capability: 90, 100, or 120");
}
void get_cutlass_moe_mm_problem_sizes_from_expert_offsets(
const torch::Tensor& expert_first_token_offset,
torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2,
const int64_t n, const int64_t k, const bool swap_ab) {
int32_t version_num = get_sm_version_num();
#if (defined ENABLE_CUTLASS_MOE_SM90 && ENABLE_CUTLASS_MOE_SM90) || \
(defined ENABLE_CUTLASS_MOE_SM100 && ENABLE_CUTLASS_MOE_SM100) || \
(defined ENABLE_CUTLASS_MOE_SM120 && ENABLE_CUTLASS_MOE_SM120)
get_cutlass_moe_mm_problem_sizes_from_expert_offsets_caller(
expert_first_token_offset, problem_sizes1, problem_sizes2, n, k, swap_ab);
return;
#endif
TORCH_CHECK_NOT_IMPLEMENTED(
false,
"No compiled get_cutlass_moe_mm_problem_sizes_from_expert_offsets: "
"no cutlass_scaled_mm kernel for CUDA device capability: ",
version_num, ". Required capability: 90, 100, or 120");
}
void get_cutlass_pplx_moe_mm_data(torch::Tensor& expert_offsets,
torch::Tensor& problem_sizes1,
torch::Tensor& problem_sizes2,

View File

@@ -4,28 +4,77 @@
#include "quantization/vectorization_utils.cuh"
#include <c10/cuda/CUDAGuard.h>
#include <ATen/cuda/Exceptions.h>
#include <tuple>
namespace vllm {
template <typename scalar_t, typename fp8_type>
__global__ void scaled_fp8_quant_kernel_strided(
// STRIDE_I_ZERO: true if scale_stride_i == 0 (per-tensor or per-channel)
// STRIDE_J_ZERO: true if scale_stride_j == 0 (per-tensor or per-token)
template <typename scalar_t, typename fp8_type, bool STRIDE_I_ZERO,
bool STRIDE_J_ZERO>
__global__ void scaled_fp8_quant_kernel_strided_group_shape(
fp8_type* __restrict__ out, const scalar_t* __restrict__ input,
const float* __restrict__ scale, int hidden_size, int64_t in_row_stride,
int64_t out_row_stride) {
const int64_t token_idx = blockIdx.x; // one token per block
int64_t out_row_stride, int group_m, int group_n, int64_t scale_stride_i,
int64_t scale_stride_j) {
const int64_t token_idx = blockIdx.x;
const int tid = threadIdx.x;
const scalar_t* token_in = input + token_idx * in_row_stride;
fp8_type* token_out = out + token_idx * out_row_stride;
const float inv_scale = 1.0f / (*scale);
// Precompute row-level base offset for scale access (compile-time eliminated
// when STRIDE_I_ZERO)
const int64_t scale_row_base =
STRIDE_I_ZERO ? 0
: static_cast<int>(token_idx) / group_m * scale_stride_i;
vectorize_with_alignment<16>(
token_in, token_out, hidden_size, tid, blockDim.x,
[=] __device__(fp8_type & dst, const scalar_t& src) {
dst = scaled_fp8_conversion<true, fp8_type>(static_cast<float>(src),
inv_scale);
});
auto get_inv_scale = [&](int gj) {
return 1.0f / scale[scale_row_base + gj * scale_stride_j];
};
int cached_gj = -1;
float cached_inv_scale = 0.0f;
auto get_inv_scale_cached = [&](int gj) {
if (gj != cached_gj) {
cached_inv_scale = 1.0f / scale[scale_row_base + gj * scale_stride_j];
cached_gj = gj;
}
return cached_inv_scale;
};
constexpr int VEC_SIZE = 16; // FP8 so vectorize to 128 bits
auto scaled_fp8_conversion_vectorized = [&](const scalar_t* in, fp8_type* out,
int size, float inv_scale) {
vectorize_with_alignment<VEC_SIZE>(
in, out, size, tid, blockDim.x,
[=] __device__(fp8_type & dst, const scalar_t& src) {
dst = scaled_fp8_conversion<true, fp8_type>(static_cast<float>(src),
inv_scale);
});
};
if (STRIDE_J_ZERO && hidden_size % VEC_SIZE == 0) {
// Per-tensor or per-token: single scale per row, vectorize full row
scaled_fp8_conversion_vectorized(token_in, token_out, hidden_size,
get_inv_scale(0));
} else if (group_n % VEC_SIZE == 0) {
// Multiple column groups with vectorization
const int num_groups_n = hidden_size / group_n;
for (int gj = 0; gj < num_groups_n; gj++) {
scaled_fp8_conversion_vectorized(token_in + gj * group_n,
token_out + gj * group_n, group_n,
get_inv_scale(gj));
}
} else {
// Scalar path for small column groups (group_n < VEC_SIZE)
for (int n = tid; n < hidden_size; n += blockDim.x) {
const int gj = n / group_n;
token_out[n] = scaled_fp8_conversion<true, fp8_type>(
static_cast<float>(token_in[n]), get_inv_scale_cached(gj));
}
}
}
template <typename scalar_t, typename fp8_type>
@@ -133,17 +182,116 @@ __global__ void dynamic_per_token_scaled_fp8_quant_kernel_strided(
} // namespace vllm
void static_scaled_fp8_quant(torch::Tensor& out, // [..., d]
torch::Tensor const& input, // [..., d]
torch::Tensor const& scale) // [1]
void static_scaled_fp8_quant(
torch::Tensor& out, // [..., d]
torch::Tensor const& input, // [..., d]
torch::Tensor const& scale, // various shapes
std::optional<std::tuple<int64_t, int64_t>>
opt_group_shape) // optional explicit (group_m, group_n)
{
TORCH_CHECK(input.stride(-1) == 1,
"last dimension of input must be contiguous");
TORCH_CHECK(out.stride(-1) == 1,
"last dimension of output must be contiguous");
const int hidden_size = input.size(-1);
const int num_tokens = input.numel() / hidden_size;
const int hidden_size = input.size(-1); // N (columns)
const int num_tokens = input.numel() / hidden_size; // M (rows)
// Determine group_m, group_n, and scale strides from scale shape
// Scale indexing: scale[gi * scale_stride_j + gj * scale_stride_i]
// where gi = m / group_m, gj = n / group_n
int group_m, group_n;
int64_t scale_stride_i, scale_stride_j;
if (scale.dim() == 0 || scale.numel() == 1) {
// Per-tensor: one scale for the entire tensor
group_m = num_tokens;
group_n = hidden_size;
scale_stride_i = 0;
scale_stride_j = 0;
} else if (scale.dim() == 1) {
// 1D scale: require explicit group_shape to disambiguate per-channel vs
// per-token (avoids edge case where num_tokens == hidden_size)
TORCH_CHECK(opt_group_shape.has_value(),
"1D scale requires explicit group_shape to disambiguate "
"per-channel vs per-token quantization. "
"Use group_shape=(-1, 1) for per-channel or group_shape=(1, "
"-1) for per-token.");
const auto& [opt_group_m, opt_group_n] = opt_group_shape.value();
group_m = opt_group_m == -1 ? num_tokens : static_cast<int>(opt_group_m);
group_n = opt_group_n == -1 ? hidden_size : static_cast<int>(opt_group_n);
// Validate the explicit group shape matches the 1D scale
const int64_t scale_len = scale.numel();
const int64_t expected_scale_m = num_tokens / group_m;
const int64_t expected_scale_n = hidden_size / group_n;
const int64_t expected_scale_numel = expected_scale_m * expected_scale_n;
TORCH_CHECK(scale_len == expected_scale_numel, "1D scale length (",
scale_len, ") does not match expected size (",
expected_scale_numel, ") for group_shape (", opt_group_m, ", ",
opt_group_n, ") with input shape (", num_tokens, ", ",
hidden_size, ")");
// For 1D scale, determine strides based on which dim is trivial
// Scale indexing: scale[gi * scale_stride_i + gj * scale_stride_j]
// where gi = m / group_m (row group), gj = n / group_n (col group)
if (expected_scale_m == 1) {
// Per-channel style: one scale in M dim, scale varies along N
// gi = 0 always, gj varies, so stride_1 traverses the scale
scale_stride_i = 0;
scale_stride_j = scale.stride(0);
} else if (expected_scale_n == 1) {
// Per-token style: one scale in N dim, scale varies along M
// gj = 0 always, gi varies, so stride_0 traverses the scale
scale_stride_i = scale.stride(0);
scale_stride_j = 0;
} else {
TORCH_CHECK(
false,
"1D scale can only be used when one of the scale dimensions is 1. "
"For 2D group scaling, use a 2D scale tensor.");
}
} else if (scale.dim() == 2) {
// 2D scale: infer group sizes from scale dimensions (or use explicit if
// provided)
const int64_t scale_size_0 = scale.size(0);
const int64_t scale_size_1 = scale.size(1);
TORCH_CHECK(num_tokens % scale_size_0 == 0, "num_tokens (", num_tokens,
") must be divisible by scale.size(0) (", scale_size_0, ")");
TORCH_CHECK(hidden_size % scale_size_1 == 0, "hidden_size (", hidden_size,
") must be divisible by scale.size(1) (", scale_size_1, ")");
// Infer from 2D scale shape
int inferred_group_m = num_tokens / scale_size_0;
int inferred_group_n = hidden_size / scale_size_1;
// Use explicit if provided, otherwise use inferred
if (opt_group_shape.has_value()) {
const auto& [opt_group_m, opt_group_n] = opt_group_shape.value();
group_m = opt_group_m == -1 ? num_tokens : static_cast<int>(opt_group_m);
group_n = opt_group_n == -1 ? hidden_size : static_cast<int>(opt_group_n);
// Validate explicit matches inferred
TORCH_CHECK(group_m == inferred_group_m && group_n == inferred_group_n,
"Explicit group_shape (", opt_group_m, ", ", opt_group_n,
") does not match inferred group shape (", inferred_group_m,
", ", inferred_group_n, ") from 2D scale tensor shape (",
scale_size_0, ", ", scale_size_1, ")");
} else {
group_m = inferred_group_m;
group_n = inferred_group_n;
}
scale_stride_i = scale.stride(0);
scale_stride_j = scale.stride(1);
} else {
TORCH_CHECK(false, "scale must be 0D, 1D, or 2D tensor, but got ",
scale.dim(), "D");
}
const int block_size = 256;
dim3 grid(num_tokens);
dim3 block(block_size);
@@ -153,15 +301,23 @@ void static_scaled_fp8_quant(torch::Tensor& out, // [..., d]
const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
// Dispatch to template-specialized kernel based on stride pattern
VLLM_DISPATCH_FLOATING_TYPES(
input.scalar_type(), "scaled_fp8_quant_kernel_scalar_type", [&] {
VLLM_DISPATCH_FP8_TYPES(
out.scalar_type(), "scaled_fp8_quant_kernel_fp8_type", [&] {
vllm::scaled_fp8_quant_kernel_strided<scalar_t, fp8_t>
<<<grid, block, 0, stream>>>(
out.data_ptr<fp8_t>(), input.data_ptr<scalar_t>(),
scale.data_ptr<float>(), hidden_size, in_row_stride,
out_row_stride);
VLLM_DISPATCH_BOOL(scale_stride_i == 0, S0_ZERO, [&] {
VLLM_DISPATCH_BOOL(scale_stride_j == 0, S1_ZERO, [&] {
vllm::scaled_fp8_quant_kernel_strided_group_shape<
scalar_t, fp8_t, S0_ZERO, S1_ZERO>
<<<grid, block, 0, stream>>>(
out.data_ptr<fp8_t>(), input.data_ptr<scalar_t>(),
scale.data_ptr<float>(), hidden_size, in_row_stride,
out_row_stride, group_m, group_n, scale_stride_i,
scale_stride_j);
});
});
});
});
}

View File

@@ -1,3 +1,4 @@
#include "cuda_compat.h"
#include "dispatch_utils.h"
#include <torch/cuda.h>
@@ -97,7 +98,9 @@ static inline __device__ bool isPartialMatch(float x, uint32_t pattern) {
template <typename T, typename idxT, typename Func>
__device__ void vectorized_process(size_t thread_rank, size_t num_threads,
const T* in, idxT len, Func f) {
constexpr int WARP_SIZE = 32;
// Use dynamic WARP_SIZE from cuda_compat.h to support both
// Wave64 (MI300X/gfx942) and Wave32 (Strix Halo/gfx1151) architectures
constexpr int kWarpSize = WARP_SIZE;
using WideT = float4;
if constexpr (sizeof(T) >= sizeof(WideT)) {
for (idxT i = thread_rank; i < len; i += num_threads) {
@@ -132,8 +135,8 @@ __device__ void vectorized_process(size_t thread_rank, size_t num_threads,
}
}
static_assert(WARP_SIZE >= items_per_scalar);
// and because items_per_scalar > skip_cnt, WARP_SIZE > skip_cnt
static_assert(kWarpSize >= items_per_scalar);
// and because items_per_scalar > skip_cnt, kWarpSize > skip_cnt
// no need to use loop
if (thread_rank < skip_cnt) {
f(in[thread_rank], thread_rank);
@@ -142,7 +145,7 @@ __device__ void vectorized_process(size_t thread_rank, size_t num_threads,
// len_cast * items_per_scalar + items_per_scalar > len - skip_cnt;
// and so
// len - (skip_cnt + len_cast * items_per_scalar) < items_per_scalar <=
// WARP_SIZE no need to use loop
// kWarpSize no need to use loop
const idxT remain_i = skip_cnt + len_cast * items_per_scalar + thread_rank;
if (remain_i < len) {
f(in[remain_i], remain_i);

View File

@@ -416,13 +416,6 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
" Tensor alpha) -> ()");
ops.impl("cutlass_scaled_fp4_mm", torch::kCUDA, &cutlass_scaled_fp4_mm);
// cutlass blockwise scaledgroup GEMM
ops.def(
"cutlass_blockwise_scaled_grouped_mm(Tensor! output, Tensor a, Tensor b, "
"Tensor scales_a, Tensor scales_b, "
"Tensor problem_sizes, Tensor expert_offsets) -> ()");
// conditionally compiled so impl registration is in source file
// cutlass nvfp4 block scaled group GEMM
ops.def(
"cutlass_fp4_group_mm(Tensor! out, Tensor a, Tensor b,"
@@ -494,6 +487,17 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
ops.impl("get_cutlass_moe_mm_problem_sizes", torch::kCUDA,
&get_cutlass_moe_mm_problem_sizes);
// compute per-expert problem sizes from expert_first_token_offset
// produced by vLLM's moe_permute kernel
ops.def(
"get_cutlass_moe_mm_problem_sizes_from_expert_offsets("
" Tensor expert_first_token_offset, "
" Tensor! problem_sizes1, "
" Tensor! problem_sizes2, "
" int n, int k, bool swap_ab) -> ()");
ops.impl("get_cutlass_moe_mm_problem_sizes_from_expert_offsets", torch::kCUDA,
&get_cutlass_moe_mm_problem_sizes_from_expert_offsets);
// A function that computes data required to run fused MoE with w8a8 grouped
// GEMM and PPLX. It takes expert_num_tokens and non_zero_expert_idxs
// as an input, and computes expert_offsets (token start indices of each
@@ -565,6 +569,15 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
"Tensor output_scale_offset_by_experts) -> ()");
ops.impl("scaled_fp4_experts_quant", torch::kCUDA, &scaled_fp4_experts_quant);
// Fused SiLU+Mul+NVFP4 experts quantization.
ops.def(
"silu_and_mul_scaled_fp4_experts_quant(Tensor! output, Tensor! "
"output_scale,"
"Tensor input, Tensor input_global_scale, Tensor input_offset_by_experts,"
"Tensor output_scale_offset_by_experts) -> ()");
ops.impl("silu_and_mul_scaled_fp4_experts_quant", torch::kCUDA,
&silu_and_mul_scaled_fp4_experts_quant);
// Check if cutlass_scaled_mm_fp4 is supported for CUDA devices
// of the given capability
ops.def("cutlass_scaled_mm_supports_fp4(int cuda_device_capability) -> bool");
@@ -586,9 +599,12 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
ops.impl("gptq_shuffle", torch::kCUDA, &gptq_shuffle);
// Compute FP8 quantized tensor for given scaling factor.
// Supports per-tensor, per-channel, per-token, and arbitrary 2D group
// scaling. Optional group_m/group_n specify the group shape explicitly;
// required for 1D scales to disambiguate per-channel vs per-token.
ops.def(
"static_scaled_fp8_quant(Tensor! result, Tensor input, Tensor scale) -> "
"()");
"static_scaled_fp8_quant(Tensor! result, Tensor input, Tensor scale, "
"(int, int)? group_shape=None) -> ()");
ops.impl("static_scaled_fp8_quant", torch::kCUDA, &static_scaled_fp8_quant);
// Compute dynamic-per-tensor FP8 quantized tensor and scaling factor.
@@ -692,16 +708,6 @@ TORCH_LIBRARY_EXPAND(CONCAT(TORCH_EXTENSION_NAME, _cache_ops), cache_ops) {
"swap_blocks(Tensor src, Tensor! dst, Tensor block_mapping) -> ()");
cache_ops.impl("swap_blocks", torch::kCUDA, &swap_blocks);
// Copy the cache blocks from src to dst.
cache_ops.def(
"copy_blocks(Tensor(a!)[] key_caches, Tensor[](b!) value_caches, "
"Tensor block_mapping) -> ()");
cache_ops.impl("copy_blocks", torch::kCUDA, &copy_blocks);
cache_ops.def(
"copy_blocks_mla(Tensor(a!)[] kv_caches, Tensor block_mapping) -> ()");
cache_ops.impl("copy_blocks_mla", torch::kCUDA, &copy_blocks_mla);
// Reshape the key and value tensors and cache them.
cache_ops.def(
"reshape_and_cache(Tensor key, Tensor value,"
@@ -731,6 +737,22 @@ TORCH_LIBRARY_EXPAND(CONCAT(TORCH_EXTENSION_NAME, _cache_ops), cache_ops) {
" Tensor scale) -> ()");
cache_ops.impl("concat_and_cache_mla", torch::kCUDA, &concat_and_cache_mla);
// Rotate Q and K, then write to kv cache for MLA
cache_ops.def(
"concat_and_cache_mla_rope_fused("
" Tensor positions,"
" Tensor! q_pe,"
" Tensor! k_pe,"
" Tensor kv_c,"
" Tensor cos_sin_cache,"
" bool is_neox,"
" Tensor slot_mapping,"
" Tensor! kv_cache,"
" str kv_cache_dtype,"
" Tensor kv_cache_scale) -> ()");
cache_ops.impl("concat_and_cache_mla_rope_fused", torch::kCUDA,
&concat_and_cache_mla_rope_fused);
// Convert the key and value cache to fp8 data type.
cache_ops.def(
"convert_fp8(Tensor! dst_cache, Tensor src_cache, float scale, "

View File

@@ -183,7 +183,7 @@ ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads
ARG USE_SCCACHE
ARG SCCACHE_DOWNLOAD_URL=https://github.com/mozilla/sccache/releases/download/v0.8.1/sccache-v0.8.1-x86_64-unknown-linux-musl.tar.gz
ARG SCCACHE_DOWNLOAD_URL
ARG SCCACHE_ENDPOINT
ARG SCCACHE_BUCKET_NAME=vllm-build-sccache
ARG SCCACHE_REGION_NAME=us-west-2
@@ -201,10 +201,16 @@ ENV SETUPTOOLS_SCM_PRETEND_VERSION="0.0.0+csrc.build"
RUN --mount=type=cache,target=/root/.cache/uv \
if [ "$USE_SCCACHE" = "1" ]; then \
echo "Installing sccache..." \
&& case "${TARGETPLATFORM}" in \
linux/arm64) SCCACHE_ARCH="aarch64" ;; \
linux/amd64) SCCACHE_ARCH="x86_64" ;; \
*) echo "Unsupported TARGETPLATFORM for sccache: ${TARGETPLATFORM}" >&2; exit 1 ;; \
esac \
&& export SCCACHE_DOWNLOAD_URL="${SCCACHE_DOWNLOAD_URL:-https://github.com/mozilla/sccache/releases/download/v0.8.1/sccache-v0.8.1-${SCCACHE_ARCH}-unknown-linux-musl.tar.gz}" \
&& curl -L -o sccache.tar.gz ${SCCACHE_DOWNLOAD_URL} \
&& tar -xzf sccache.tar.gz \
&& sudo mv sccache-v0.8.1-x86_64-unknown-linux-musl/sccache /usr/bin/sccache \
&& rm -rf sccache.tar.gz sccache-v0.8.1-x86_64-unknown-linux-musl \
&& sudo mv sccache-v0.8.1-${SCCACHE_ARCH}-unknown-linux-musl/sccache /usr/bin/sccache \
&& rm -rf sccache.tar.gz sccache-v0.8.1-${SCCACHE_ARCH}-unknown-linux-musl \
&& if [ ! -z ${SCCACHE_ENDPOINT} ] ; then export SCCACHE_ENDPOINT=${SCCACHE_ENDPOINT} ; fi \
&& export SCCACHE_BUCKET=${SCCACHE_BUCKET_NAME} \
&& export SCCACHE_REGION=${SCCACHE_REGION_NAME} \
@@ -267,6 +273,7 @@ RUN mkdir -p /tmp/deepgemm/dist && touch /tmp/deepgemm/dist/.deepgemm_skipped
COPY tools/ep_kernels/install_python_libraries.sh /tmp/install_python_libraries.sh
ARG PPLX_COMMIT_HASH
ARG DEEPEP_COMMIT_HASH
ARG NVSHMEM_VER
RUN --mount=type=cache,target=/root/.cache/uv \
mkdir -p /tmp/ep_kernels_workspace/dist && \
export TORCH_CUDA_ARCH_LIST='9.0a 10.0a' && \
@@ -274,7 +281,8 @@ RUN --mount=type=cache,target=/root/.cache/uv \
--workspace /tmp/ep_kernels_workspace \
--mode wheel \
${PPLX_COMMIT_HASH:+--pplx-ref "$PPLX_COMMIT_HASH"} \
${DEEPEP_COMMIT_HASH:+--deepep-ref "$DEEPEP_COMMIT_HASH"} && \
${DEEPEP_COMMIT_HASH:+--deepep-ref "$DEEPEP_COMMIT_HASH"} \
${NVSHMEM_VER:+--nvshmem-ver "$NVSHMEM_VER"} && \
find /tmp/ep_kernels_workspace/nvshmem -name '*.a' -delete
#################### EXTENSIONS BUILD IMAGE ####################
@@ -609,6 +617,7 @@ RUN mv vllm src/vllm
FROM vllm-base AS vllm-openai-base
ARG TARGETPLATFORM
ARG INSTALL_KV_CONNECTORS=false
ARG CUDA_VERSION
ARG PIP_INDEX_URL UV_INDEX_URL
ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL
@@ -618,10 +627,30 @@ ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL
ENV UV_HTTP_TIMEOUT=500
# install kv_connectors if requested
ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0'
ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}
RUN --mount=type=cache,target=/root/.cache/uv \
--mount=type=bind,source=requirements/kv_connectors.txt,target=/tmp/kv_connectors.txt,ro \
CUDA_MAJOR="${CUDA_VERSION%%.*}"; \
CUDA_VERSION_DASH=$(echo $CUDA_VERSION | cut -d. -f1,2 | tr '.' '-'); \
CUDA_HOME=/usr/local/cuda; \
# lmcache requires explicit specifying CUDA_HOME
BUILD_PKGS="libcusparse-dev-${CUDA_VERSION_DASH} \
libcublas-dev-${CUDA_VERSION_DASH} \
libcusolver-dev-${CUDA_VERSION_DASH}"; \
if [ "$INSTALL_KV_CONNECTORS" = "true" ]; then \
uv pip install --system -r /tmp/kv_connectors.txt || true; \
if [ "$CUDA_MAJOR" -ge 13 ]; then \
uv pip install --system nixl-cu13; \
fi; \
uv pip install --system -r /tmp/kv_connectors.txt --no-build || ( \
# if the above fails, install from source
apt-get update -y && \
apt-get install -y --no-install-recommends ${BUILD_PKGS} && \
uv pip install --system -r /tmp/kv_connectors.txt --no-build-isolation && \
apt-get purge -y ${BUILD_PKGS} && \
# clean up -dev packages, keep runtime libraries
rm -rf /var/lib/apt/lists/* \
); \
fi
ENV VLLM_USAGE_SOURCE production-docker-image

View File

@@ -22,13 +22,13 @@ RUN microdnf install -y dnf && dnf install -y gcc-toolset-14 make wget unzip \
###############################################################
FROM registry.access.redhat.com/ubi9/ubi-minimal:${BASE_UBI_IMAGE_TAG} AS centos-deps-builder
RUN microdnf install -y dnf && \
dnf install -y https://mirror.stream.centos.org/9-stream/BaseOS/`arch`/os/Packages/centos-gpg-keys-9.0-24.el9.noarch.rpm \
https://mirror.stream.centos.org/9-stream/BaseOS/`arch`/os/Packages/centos-stream-repos-9.0-24.el9.noarch.rpm \
dnf install -y https://mirror.stream.centos.org/9-stream/BaseOS/`arch`/os/Packages/centos-gpg-keys-9.0-26.el9.noarch.rpm \
https://mirror.stream.centos.org/9-stream/BaseOS/`arch`/os/Packages/centos-stream-repos-9.0-26.el9.noarch.rpm \
https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && \
dnf config-manager --set-enabled crb
RUN dnf install -y openjpeg2-devel lcms2-devel tcl-devel tk-devel fribidi-devel && \
dnf remove -y centos-gpg-keys-9.0-24.el9.noarch centos-stream-repos-9.0-24.el9.noarch
RUN dnf install -y openjpeg2-devel lcms2-devel tcl-devel tk-devel fribidi-devel yajl-devel && \
dnf remove -y centos-gpg-keys-9.0-24.el9.noarch centos-stream-repos-9.0-26.el9.noarch
###############################################################
@@ -346,4 +346,4 @@ WORKDIR /workspace/
RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks
ENTRYPOINT ["vllm", "serve"]
ENTRYPOINT ["vllm", "serve"]

View File

@@ -97,6 +97,14 @@ RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system hf_transfer
ENV HF_HUB_ENABLE_HF_TRANSFER=1
# install audio decode package `torchcodec` from source (required due to
# ROCm and torch version mismatch) for tests with datasets package
COPY tools/install_torchcodec_rocm.sh /tmp/install_torchcodec.sh
RUN bash /tmp/install_torchcodec.sh \
&& rm /tmp/install_torchcodec.sh \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# Copy in the v1 package (for python-only install test group)
COPY --from=export_vllm /vllm_v1 /usr/local/lib/python${PYTHON_VERSION}/dist-packages/vllm/v1
@@ -130,6 +138,7 @@ RUN --mount=type=bind,from=export_vllm,src=/,target=/install \
&& uv pip install --system *.whl
ARG COMMON_WORKDIR
ARG BASE_IMAGE
# Copy over the benchmark scripts as well
COPY --from=export_vllm /benchmarks ${COMMON_WORKDIR}/vllm/benchmarks
@@ -144,4 +153,9 @@ ENV SAFETENSORS_FAST_GPU=1
# Performance environment variable.
ENV HIP_FORCE_DEV_KERNARG=1
# Workaround for ROCm profiler limits
RUN echo "ROCTRACER_MAX_EVENTS=10000000" > ${COMMON_WORKDIR}/libkineto.conf
ENV KINETO_CONFIG="${COMMON_WORKDIR}/libkineto.conf"
RUN echo "VLLM_BASE_IMAGE=${BASE_IMAGE}" >> ${COMMON_WORKDIR}/versions.txt
CMD ["/bin/bash"]

View File

@@ -1,16 +1,29 @@
ARG BASE_IMAGE=rocm/dev-ubuntu-22.04:7.1-complete
ARG BASE_IMAGE=rocm/dev-ubuntu-22.04:7.0-complete
ARG TRITON_BRANCH="57c693b6"
ARG TRITON_REPO="https://github.com/ROCm/triton.git"
ARG PYTORCH_BRANCH="1c57644d"
ARG PYTORCH_VISION_BRANCH="v0.23.0"
ARG PYTORCH_BRANCH="89075173"
ARG PYTORCH_REPO="https://github.com/ROCm/pytorch.git"
ARG PYTORCH_VISION_BRANCH="v0.24.1"
ARG PYTORCH_VISION_REPO="https://github.com/pytorch/vision.git"
ARG PYTORCH_AUDIO_BRANCH="v2.9.0"
ARG PYTORCH_AUDIO_REPO="https://github.com/pytorch/audio.git"
ARG FA_BRANCH="0e60e394"
ARG FA_REPO="https://github.com/Dao-AILab/flash-attention.git"
ARG AITER_BRANCH="59bd8ff2"
ARG AITER_BRANCH="6af8b687"
ARG AITER_REPO="https://github.com/ROCm/aiter.git"
ARG MORI_BRANCH="2d02c6a9"
ARG MORI_REPO="https://github.com/ROCm/mori.git"
#TODO: When patch has been upstreamed, switch to the main repo/branch
# ARG RIXL_BRANCH="<TODO>"
# ARG RIXL_REPO="https://github.com/ROCm/RIXL.git"
ARG RIXL_BRANCH="50d63d94"
ARG RIXL_REPO="https://github.com/vcave/RIXL.git"
# Needed by RIXL
ARG ETCD_BRANCH="7c6e714f"
ARG ETCD_REPO="https://github.com/etcd-cpp-apiv3/etcd-cpp-apiv3.git"
ARG UCX_BRANCH="da3fac2a"
ARG UCX_REPO="https://github.com/ROCm/ucx.git"
FROM ${BASE_IMAGE} AS base
@@ -20,6 +33,7 @@ ENV LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
ARG PYTORCH_ROCM_ARCH=gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151
ENV PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}
ENV AITER_ROCM_ARCH=gfx942;gfx950
ENV MORI_GPU_ARCHS=gfx942;gfx950
# Required for RCCL in ROCm7.1
ENV HSA_NO_SCRATCH_RECLAIM=1
@@ -33,7 +47,7 @@ ENV DEBIAN_FRONTEND=noninteractive
# Install Python and other dependencies
RUN apt-get update -y \
&& apt-get install -y software-properties-common git curl sudo vim less libgfortran5 \
&& apt-get install -y software-properties-common git curl sudo vim less libgfortran5 libopenmpi-dev libpci-dev \
&& for i in 1 2 3; do \
add-apt-repository -y ppa:deadsnakes/ppa && break || \
{ echo "Attempt $i failed, retrying in 5s..."; sleep 5; }; \
@@ -50,6 +64,10 @@ RUN apt-get update -y \
RUN pip install -U packaging 'cmake<4' ninja wheel 'setuptools<80' pybind11 Cython
RUN apt-get update && apt-get install -y libjpeg-dev libsox-dev libsox-fmt-all sox && rm -rf /var/lib/apt/lists/*
###
### Triton Build
###
FROM base AS build_triton
ARG TRITON_BRANCH
ARG TRITON_REPO
@@ -62,11 +80,19 @@ RUN cd triton \
RUN if [ -d triton/python/triton_kernels ]; then pip install build && cd triton/python/triton_kernels \
&& python3 -m build --wheel && cp dist/*.whl /app/install; fi
###
### AMD SMI Build
###
FROM base AS build_amdsmi
RUN cd /opt/rocm/share/amd_smi \
&& pip wheel . --wheel-dir=dist
RUN mkdir -p /app/install && cp /opt/rocm/share/amd_smi/dist/*.whl /app/install
###
### Pytorch build
###
FROM base AS build_pytorch
ARG PYTORCH_BRANCH
ARG PYTORCH_VISION_BRANCH
@@ -95,6 +121,112 @@ RUN mkdir -p /app/install && cp /app/pytorch/dist/*.whl /app/install \
&& cp /app/vision/dist/*.whl /app/install \
&& cp /app/audio/dist/*.whl /app/install
###
### MORI Build
###
FROM base AS build_mori
ARG MORI_BRANCH
ARG MORI_REPO
RUN --mount=type=bind,from=build_pytorch,src=/app/install/,target=/install \
pip install /install/*.whl
RUN git clone ${MORI_REPO}
RUN cd mori \
&& git checkout ${MORI_BRANCH} \
&& git submodule update --init --recursive \
&& python3 setup.py bdist_wheel --dist-dir=dist && ls /app/mori/dist/*.whl
RUN mkdir -p /app/install && cp /app/mori/dist/*.whl /app/install
###
### RIXL Build
###
FROM build_pytorch AS build_rixl
ARG RIXL_BRANCH
ARG RIXL_REPO
ARG ETCD_BRANCH
ARG ETCD_REPO
ARG UCX_BRANCH
ARG UCX_REPO
ENV ROCM_PATH=/opt/rocm
ENV UCX_HOME=/usr/local/ucx
ENV RIXL_HOME=/usr/local/rixl
ENV RIXL_BENCH_HOME=/usr/local/rixl_bench
# RIXL build system dependences and RDMA support
RUN apt-get -y update && apt-get -y install autoconf libtool pkg-config \
libgrpc-dev \
libgrpc++-dev \
libprotobuf-dev \
protobuf-compiler-grpc \
libcpprest-dev \
libaio-dev \
librdmacm1 \
librdmacm-dev \
libibverbs1 \
libibverbs-dev \
ibverbs-utils \
rdmacm-utils \
ibverbs-providers
RUN pip install meson auditwheel patchelf tomlkit
WORKDIR /workspace
RUN git clone ${ETCD_REPO} && \
cd etcd-cpp-apiv3 && \
git checkout ${ETCD_BRANCH} && \
mkdir build && cd build && \
cmake .. -DCMAKE_POLICY_VERSION_MINIMUM=3.5 && \
make -j$(nproc) && \
make install
RUN cd /usr/local/src && \
git clone ${UCX_REPO} && \
cd ucx && \
git checkout ${UCX_BRANCH} && \
./autogen.sh && \
mkdir build && cd build && \
../configure \
--prefix=/usr/local/ucx \
--enable-shared \
--disable-static \
--disable-doxygen-doc \
--enable-optimizations \
--enable-devel-headers \
--with-rocm=/opt/rocm \
--with-verbs \
--with-dm \
--enable-mt && \
make -j && \
make -j install
ENV PATH=/usr/local/ucx/bin:$PATH
ENV LD_LIBRARY_PATH=${UCX_HOME}/lib:${LD_LIBRARY_PATH}
RUN git clone ${RIXL_REPO} /opt/rixl && \
cd /opt/rixl && \
git checkout ${RIXL_BRANCH} && \
meson setup build --prefix=${RIXL_HOME} \
-Ducx_path=${UCX_HOME} \
-Drocm_path=${ROCM_PATH} && \
cd build && \
ninja && \
ninja install
# Generate RIXL wheel
RUN cd /opt/rixl && mkdir -p /app/install && \
./contrib/build-wheel.sh \
--output-dir /app/install \
--rocm-dir ${ROCM_PATH} \
--ucx-plugins-dir ${UCX_HOME}/lib/ucx \
--nixl-plugins-dir ${RIXL_HOME}/lib/x86_64-linux-gnu/plugins
###
### FlashAttention Build
###
FROM base AS build_fa
ARG FA_BRANCH
ARG FA_REPO
@@ -107,6 +239,10 @@ RUN cd flash-attention \
&& GPU_ARCHS=$(echo ${PYTORCH_ROCM_ARCH} | sed -e 's/;gfx1[0-9]\{3\}//g') python3 setup.py bdist_wheel --dist-dir=dist
RUN mkdir -p /app/install && cp /app/flash-attention/dist/*.whl /app/install
###
### AITER Build
###
FROM base AS build_aiter
ARG AITER_BRANCH
ARG AITER_REPO
@@ -120,6 +256,10 @@ RUN cd aiter \
RUN pip install pyyaml && cd aiter && PREBUILD_KERNELS=1 GPU_ARCHS=${AITER_ROCM_ARCH} python3 setup.py bdist_wheel --dist-dir=dist && ls /app/aiter/dist/*.whl
RUN mkdir -p /app/install && cp /app/aiter/dist/*.whl /app/install
###
### Final Build
###
FROM base AS debs
RUN mkdir /app/debs
RUN --mount=type=bind,from=build_triton,src=/app/install/,target=/install \
@@ -132,6 +272,10 @@ RUN --mount=type=bind,from=build_pytorch,src=/app/install/,target=/install \
cp /install/*.whl /app/debs
RUN --mount=type=bind,from=build_aiter,src=/app/install/,target=/install \
cp /install/*.whl /app/debs
RUN --mount=type=bind,from=build_mori,src=/app/install/,target=/install \
cp /install/*.whl /app/debs
RUN --mount=type=bind,from=build_rixl,src=/app/install/,target=/install \
cp /install/*.whl /app/debs
FROM base AS final
RUN --mount=type=bind,from=debs,src=/app/debs,target=/install \
@@ -150,6 +294,14 @@ ARG FA_BRANCH
ARG FA_REPO
ARG AITER_BRANCH
ARG AITER_REPO
ARG RIXL_BRANCH
ARG RIXL_REPO
ARG ETCD_BRANCH
ARG ETCD_REPO
ARG UCX_BRANCH
ARG UCX_REPO
ARG MORI_BRANCH
ARG MORI_REPO
RUN echo "BASE_IMAGE: ${BASE_IMAGE}" > /app/versions.txt \
&& echo "TRITON_BRANCH: ${TRITON_BRANCH}" >> /app/versions.txt \
&& echo "TRITON_REPO: ${TRITON_REPO}" >> /app/versions.txt \
@@ -162,4 +314,12 @@ RUN echo "BASE_IMAGE: ${BASE_IMAGE}" > /app/versions.txt \
&& echo "FA_BRANCH: ${FA_BRANCH}" >> /app/versions.txt \
&& echo "FA_REPO: ${FA_REPO}" >> /app/versions.txt \
&& echo "AITER_BRANCH: ${AITER_BRANCH}" >> /app/versions.txt \
&& echo "AITER_REPO: ${AITER_REPO}" >> /app/versions.txt
&& echo "AITER_REPO: ${AITER_REPO}" >> /app/versions.txt \
&& echo "RIXL_BRANCH: ${RIXL_BRANCH}" >> /app/versions.txt \
&& echo "RIXL_REPO: ${RIXL_REPO}" >> /app/versions.txt \
&& echo "ETCD_BRANCH: ${ETCD_BRANCH}" >> /app/versions.txt \
&& echo "ETCD_REPO: ${ETCD_REPO}" >> /app/versions.txt \
&& echo "UCX_BRANCH: ${UCX_BRANCH}" >> /app/versions.txt \
&& echo "UCX_REPO: ${UCX_REPO}" >> /app/versions.txt \
&& echo "MORI_BRANCH: ${MORI_BRANCH}" >> /app/versions.txt \
&& echo "MORI_REPO: ${MORI_REPO}" >> /app/versions.txt

View File

@@ -2,7 +2,7 @@ FROM intel/deep-learning-essentials:2025.2.2-0-devel-ubuntu24.04 AS vllm-base
RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null && \
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | tee /etc/apt/sources.list.d/oneAPI.list && \
add-apt-repository -y ppa:kobuk-team/intel-graphics
add-apt-repository -y ppa:kobuk-team/intel-graphics-staging
RUN apt clean && apt-get update -y && \
apt-get install -y --no-install-recommends --fix-missing \
@@ -28,10 +28,14 @@ RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.12 1
RUN apt install -y libze1 libze-dev libze-intel-gpu1 intel-opencl-icd libze-intel-gpu-raytracing intel-ocloc
# This oneccl contains the BMG support which is not the case for default version of oneapi 2025.2.
RUN wget https://github.com/uxlfoundation/oneCCL/releases/download/2021.15.6/intel-oneccl-2021.15.6.9_offline.sh
RUN bash intel-oneccl-2021.15.6.9_offline.sh -a --silent --eula accept && \
ARG ONECCL_INSTALLER="intel-oneccl-2021.15.7.6_offline.sh"
RUN wget "https://github.com/uxlfoundation/oneCCL/releases/download/2021.15.7/${ONECCL_INSTALLER}" && \
bash "${ONECCL_INSTALLER}" -a --silent --eula accept && \
rm "${ONECCL_INSTALLER}" && \
echo "source /opt/intel/oneapi/setvars.sh --force" >> /root/.bashrc && \
echo "source /opt/intel/oneapi/ccl/2021.15/env/vars.sh --force" >> /root/.bashrc
RUN rm -f /opt/intel/oneapi/ccl/latest && \
ln -s /opt/intel/oneapi/ccl/2021.15 /opt/intel/oneapi/ccl/latest
SHELL ["bash", "-c"]
CMD ["bash", "-c", "source /root/.bashrc && exec bash"]
@@ -47,6 +51,11 @@ RUN --mount=type=cache,target=/root/.cache/pip \
pip install --no-cache-dir \
-r requirements/xpu.txt
# arctic-inference is built from source which needs torch-xpu properly installed
# used for suffix method speculative decoding
RUN --mount=type=cache,target=/root/.cache/pip \
pip install --no-cache-dir arctic-inference==0.1.1
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/lib/"
COPY . .

76
docker/docker-bake.hcl Normal file
View File

@@ -0,0 +1,76 @@
# docker-bake.hcl - vLLM Docker build configuration
#
# This file lives in vLLM repo at docker/docker-bake.hcl
#
# Usage:
# cd docker && docker buildx bake # Build default target (openai)
# cd docker && docker buildx bake test # Build test target
# docker buildx bake --print # Show resolved config
#
# Reference: https://docs.docker.com/build/bake/reference/
# Build configuration
variable "MAX_JOBS" {
default = 16
}
variable "NVCC_THREADS" {
default = 8
}
variable "TORCH_CUDA_ARCH_LIST" {
default = "8.0 8.9 9.0 10.0"
}
variable "COMMIT" {
default = ""
}
# Groups
group "default" {
targets = ["openai"]
}
# Base targets
target "_common" {
dockerfile = "docker/Dockerfile"
context = "."
args = {
max_jobs = MAX_JOBS
nvcc_threads = NVCC_THREADS
torch_cuda_arch_list = TORCH_CUDA_ARCH_LIST
}
}
target "_labels" {
labels = {
"org.opencontainers.image.source" = "https://github.com/vllm-project/vllm"
"org.opencontainers.image.vendor" = "vLLM"
"org.opencontainers.image.title" = "vLLM"
"org.opencontainers.image.description" = "vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs"
"org.opencontainers.image.licenses" = "Apache-2.0"
"org.opencontainers.image.revision" = COMMIT
}
annotations = [
"index,manifest:org.opencontainers.image.revision=${COMMIT}",
]
}
# Build targets
target "test" {
inherits = ["_common", "_labels"]
target = "test"
tags = ["vllm:test"]
output = ["type=docker"]
}
target "openai" {
inherits = ["_common", "_labels"]
target = "vllm-openai"
tags = ["vllm:openai"]
output = ["type=docker"]
}

View File

@@ -62,7 +62,7 @@ vLLM is flexible and easy to use with:
For more information, check out the following:
- [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention)
- [vLLM announcing blog post](https://blog.vllm.ai/2023/06/20/vllm.html) (intro to PagedAttention)
- [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
- [vLLM Meetups](community/meetups.md)

View File

@@ -72,7 +72,6 @@ Internal data structures.
- [vllm.multimodal.inputs.MultiModalFieldConfig][]
- [vllm.multimodal.inputs.MultiModalKwargsItem][]
- [vllm.multimodal.inputs.MultiModalKwargsItems][]
- [vllm.multimodal.inputs.MultiModalKwargs][]
- [vllm.multimodal.inputs.MultiModalInputs][]
### Data Parsing

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.1 MiB

View File

@@ -8,12 +8,19 @@ The results are automatically published to the public [vLLM Performance Dashboar
## Manually Trigger the benchmark
Use [vllm-ci-test-repo images](https://gallery.ecr.aws/q9t5s3a7/vllm-ci-test-repo) with vLLM benchmark suite.
For CPU environment, please use the image with "-cpu" postfix.
For x86 CPU environment, please use the image with "-cpu" postfix. For AArch64 CPU environment, please use the image with "-arm64-cpu" postfix.
Here is an example for docker run command for CPU.
Here is an example for docker run command for CPU. For GPUs skip setting the `ON_CPU` env var.
```bash
docker run -it --entrypoint /bin/bash -v /data/huggingface:/root/.cache/huggingface -e HF_TOKEN='' --shm-size=16g --name vllm-cpu-ci public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:1da94e673c257373280026f75ceb4effac80e892-cpu
export VLLM_COMMIT=1da94e673c257373280026f75ceb4effac80e892 # use full commit hash from the main branch
export HF_TOKEN=<valid Hugging Face token>
if [[ "$(uname -m)" == aarch64 || "$(uname -m)" == arm64 ]]; then
IMG_SUFFIX="arm64-cpu"
else
IMG_SUFFIX="cpu"
fi
docker run -it --entrypoint /bin/bash -v /data/huggingface:/root/.cache/huggingface -e HF_TOKEN=$HF_TOKEN -e ON_ARM64_CPU=1 --shm-size=16g --name vllm-cpu-ci public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:${VLLM_COMMIT}-${IMG_SUFFIX}
```
Then, run below command inside the docker instance.
@@ -26,14 +33,65 @@ When run, benchmark script generates results under **benchmark/results** folder,
### Runtime environment variables
- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
- `ON_CPU`: set the value to '1' on Intel® Xeon® and Arm® Neoverse™ Processors. Default value is 0.
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
- `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
- `REMOTE_HOST`: IP for the remote vLLM service to benchmark. Default value is empty string.
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.
For more results visualization, check the [visualizing the results](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md#visualizing-the-results).
### Visualization
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table with real benchmarking results.
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
If you do not see the table, please wait till the benchmark finish running.
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
#### Performance Results Comparison
The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead.
Here is an example using the script to compare result_a and result_b with max concurrency and qps for same Model, Dataset name, input/output length.
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`
***Output Tput (tok/s) — Model : [ meta-llama/Llama-3.1-8B-Instruct ] , Dataset Name : [ random ] , Input Len : [ 2048.0 ] , Output Len : [ 2048.0 ]***
| | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
|----|------|-----|-----------|----------|----------|
| 0 | 12 | inf | 24.98 | 186.03 | 7.45 |
| 1 | 16 | inf| 25.49 | 246.92 | 9.69 |
| 2 | 24 | inf| 27.74 | 293.34 | 10.57 |
| 3 | 32 | inf| 28.61 |306.69 | 10.72 |
***compare-json-results.py Command-Line Parameters***
compare-json-results.py provides configurable parameters to compare one or more benchmark_results.json files and generate summary tables and plots.
In most cases, users only need to specify --file to parse the desired benchmark results.
| Parameter | Type | Default Value | Description |
| ---------------------- | ------------------ | ----------------------- | ----------------------------------------------------------------------------------------------------- |
| `--file` | `str` (appendable) | *None* | Input JSON result file(s). Can be specified multiple times to compare multiple benchmark outputs. |
| `--debug` | `bool` | `False` | Enables debug mode. When set, prints all available information to aid troubleshooting and validation. |
| `--plot` / `--no-plot` | `bool` | `True` | Controls whether performance plots are generated. Use `--no-plot` to disable graph generation. |
| `--xaxis` | `str` | `# of max concurrency.` | Column name used as the X-axis in comparison plots (for example, concurrency or batch size). |
| `--latency` | `str` | `p99` | Latency aggregation method used for TTFT/TPOT. Supported values: `median` or `p99`. |
| `--ttft-max-ms` | `float` | `3000.0` | Reference upper bound (milliseconds) for TTFT plots, typically used to visualize SLA thresholds. |
| `--tpot-max-ms` | `float` | `100.0` | Reference upper bound (milliseconds) for TPOT plots, typically used to visualize SLA thresholds. |
***Valid Max Concurrency Summary***
Based on the configured TTFT and TPOT SLA thresholds, compare-json-results.py computes the maximum valid concurrency for each benchmark result.
The “Max # of max concurrency. (Both)” column represents the highest concurrency level that satisfies both TTFT and TPOT constraints simultaneously.
This value is typically used in capacity planning and sizing guides.
| # | Configuration | Max # of max concurrency. (TTFT ≤ 10000 ms) | Max # of max concurrency. (TPOT ≤ 100 ms) | Max # of max concurrency. (Both) | Output Tput @ Both (tok/s) | TTFT @ Both (ms) | TPOT @ Both (ms) |
| - | -------------- | ------------------------------------------- | ----------------------------------------- | -------------------------------- | -------------------------- | ---------------- | ---------------- |
| 0 | results-a | 128.00 | 12.00 | 12.00 | 127.76 | 3000.82 | 93.24 |
| 1 | results-b | 128.00 | 32.00 | 32.00 | 371.42 | 2261.53 | 81.74 |
More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](../../.buildkite/performance-benchmarks/performance-benchmarks-descriptions.md).

View File

@@ -129,10 +129,10 @@ vllm bench sweep serve_sla \
The algorithm for adjusting the SLA variable is as follows:
1. Run the benchmark with infinite QPS, and use the corresponding metrics to determine the initial value of the variable.
- For example, the initial request rate is set to the concurrency under infinite QPS.
2. If the SLA is still satisfied, keep doubling the value until the SLA is no longer satisfied. This gives a relatively narrow window that contains the point where the SLA is barely satisfied.
3. Apply binary search over the window to find the maximum value that still satisfies the SLA.
1. Run the benchmark once with maximum possible QPS, and once with minimum possible QPS. For each run, calculate the distance of the SLA metrics from their targets, resulting in data points of QPS vs SLA distance.
2. Perform spline interpolation between the data points to estimate the QPS that results in zero SLA distance.
3. Run the benchmark with the estimated QPS and add the resulting data point to the history.
4. Repeat Steps 2 and 3 until the maximum QPS that passes SLA and the minimum QPS that fails SLA in the history are close enough to each other.
!!! important
SLA tuning is applied over each combination of `--serve-params`, `--bench-params`, and `--sla-params`.

View File

@@ -6,4 +6,4 @@
## Arguments
--8<-- "docs/argparse/bench_latency.inc.md"
--8<-- "docs/generated/argparse/bench_latency.inc.md"

View File

@@ -0,0 +1,9 @@
# vllm bench mm-processor
## JSON CLI Arguments
--8<-- "docs/cli/json_tip.inc.md"
## Arguments
--8<-- "docs/generated/argparse/bench_mm_processor.inc.md"

View File

@@ -6,4 +6,4 @@
## Arguments
--8<-- "docs/argparse/bench_serve.inc.md"
--8<-- "docs/generated/argparse/bench_serve.inc.md"

View File

@@ -6,4 +6,4 @@
## Arguments
--8<-- "docs/argparse/bench_sweep_plot.inc.md"
--8<-- "docs/generated/argparse/bench_sweep_plot.inc.md"

View File

@@ -6,4 +6,4 @@
## Arguments
--8<-- "docs/argparse/bench_sweep_plot_pareto.inc.md"
--8<-- "docs/generated/argparse/bench_sweep_plot_pareto.inc.md"

View File

@@ -6,4 +6,4 @@
## Arguments
--8<-- "docs/argparse/bench_sweep_serve.inc.md"
--8<-- "docs/generated/argparse/bench_sweep_serve.inc.md"

View File

@@ -6,4 +6,4 @@
## Arguments
--8<-- "docs/argparse/bench_sweep_serve_sla.inc.md"
--8<-- "docs/generated/argparse/bench_sweep_serve_sla.inc.md"

View File

@@ -6,4 +6,4 @@
## Arguments
--8<-- "docs/argparse/bench_throughput.inc.md"
--8<-- "docs/generated/argparse/bench_throughput.inc.md"

View File

@@ -2,4 +2,4 @@
## Arguments
--8<-- "docs/argparse/chat.inc.md"
--8<-- "docs/generated/argparse/chat.inc.md"

Some files were not shown because too many files have changed in this diff Show More