Commit Graph

126 Commits

Author SHA1 Message Date
Robert Shaw
8435b2e049 [ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup) (#34302)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2026-02-23 14:02:26 +00:00
Wei Zhao
7f51e93864 [Bug] Fix DeepSeek V3 weight loading caused by incorrect prefix (#34876)
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
2026-02-18 23:20:30 -08:00
Robert Shaw
6874638bc4 [Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup) (#34758)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2026-02-18 07:42:36 -08:00
Xinyu Dong
be7f3d5d20 [Bugfix] fix default is_neox_style is True for deepseek (#34353)
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
2026-02-11 18:20:45 +00:00
Jee Jee Li
978a37c823 [Model] GLM adaptation (#34124) 2026-02-09 17:32:52 +08:00
Isotr0py
a2522839d8 [Bugfix] Fix Kimi-K2.5 NVFP4 checkpoints weight loading (#33876)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2026-02-05 10:29:54 +00:00
Pavani Majety
d2f4a71cd5 [Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. (#33858)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2026-02-05 09:32:10 +00:00
Dimitrios Bariamis
f0bca83ee4 Add support for Mistral Large 3 inference with Flashinfer MoE (#33174)
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2026-01-30 22:48:27 -08:00
Matthew Bonanni
a608b4c6c2 [5/N][Attention] Finish eliminating vllm/attention folder (#32064)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-27 10:02:51 -05:00
Cyrus Leung
dcd80206b7 [Chore] Update type annotation of input_ids in model forward (#33063)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2026-01-26 06:02:10 -08:00
Pleaplusone
6c20e89c02 [ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp (#29287)
Signed-off-by: ganyi <ygan@amd.com>
2026-01-21 23:16:30 +08:00
Chauncey
c4e5bdf61b [Bugfix] Fix the fp8_mqa_logits dim mismatch (#32652)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2026-01-20 18:48:07 +08:00
Kebe
5de6dd0662 [Bugfix] [DeepSeek-V3.2] fix sparse_attn_indexer padding (#32175)
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
2026-01-16 03:21:55 +00:00
Matthew Bonanni
2612ba9285 [1/N][Attention] Restructure attention: move files (#31916)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2026-01-09 13:10:24 -08:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
482914849c [BugFix] LoRA: Support loading base_layer of experts (#31104)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2026-01-07 14:49:39 +08:00
Pleaplusone
b41aeb3468 [Bugfix][ROCm] Fix load issue on deepseek quark quantization when shared expert enabled (#31261)
Signed-off-by: ganyi <ygan@amd.com>
2025-12-24 16:47:44 +08:00
Wentao Ye
76e6a95192 [Bug] Fix Number of dimensions of tensors must match. for Deepseek V3.2 (#31160)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-12-24 10:41:09 +08:00
Wentao Ye
4cf9429897 [Bug] Fix error 'Dynamo failed to run FX node with fake tensors for Deepseek V3.2 (#31046)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-12-19 23:31:31 +00:00
baoqian426
84896fda22 [Bugfix] deepseek-V3.2 self.weights_proj has no bias (#30841)
Signed-off-by: baoqian <1354987947@qq.com>
Signed-off-by: baoqian426 <1354987947@qq.com>
2025-12-17 03:32:34 -08:00
Lucas Wilkinson
3e41992fec [Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2 (#27532)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-12-12 05:57:47 -08:00
Harry Mellor
cf3eacfe58 Standardise get_rope to use rope_parameters["partial_rotary_factor"], not rotary_dim (#30389)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-12-11 20:45:23 +00:00
Daniel Cámpora
184076c3fe [DeepSeek v3.2] Make top-k work for any logit values. (#27568)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-12-08 06:55:58 -08:00
Julien Denize
5e5646e206 [BUGFIX] llama_4_scaling wrongly passed to DeepseekAttention (#29908)
Signed-off-by: juliendenize <julien.denize@mistral.ai>
2025-12-02 14:51:20 -08:00
Julien Denize
d8c6210eea Add Mistral Large 3 and Ministral 3 (#29757)
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com>
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Mickael Seznec <mickael@mistral.ai>
2025-12-02 10:29:00 +00:00
Matthew Bonanni
430dd4d9eb [Attention] Remove imports from vllm/attention/__init__.py (#29342)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-11-26 10:53:15 -07:00
杰兮
8005e606bf [Bugfix][Rocm] Fix shared expert weight loading failure in DeepSeek-MTP (#27563)
Signed-off-by: zhyajie <yajizhan@amd.com>
Co-authored-by: zhyajie <yajizhan@amd.com>
2025-11-24 10:16:52 +00:00
Pleaplusone
06c20c9904 [ROCm] Add AMD GPU support on Deepseek v3.2 and SparseMLA (#26670)
Signed-off-by: ganyi <ygan@amd.com>
2025-11-20 02:54:01 -08:00
Pleaplusone
7218f83992 [ROCm][BugFix] Fix shared expert loading error when disable VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS (#28633)
Signed-off-by: ganyi <ygan@amd.com>
2025-11-20 14:50:23 +07:00
Wentao Ye
0075bfffd4 [CI] Fix precommit rope_theta issue (#29040)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-11-19 14:22:43 -08:00
Yongye Zhu
88f5b19f0b [DeepSeek] Fix DeepSeek V3.2 Rope Embedding (#28968)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
2025-11-19 16:30:04 -05:00
Harry Mellor
a8b70304d6 Update rope_scaling to rope_parameters in preparation for Transformers v5 (#28542)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-19 09:06:36 -08:00
Eldar Kurtić
e439c784fa Add support for Eagle with separate lm-head and embed_tokens layers (#28549)
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>
2025-11-15 06:12:02 -08:00
Harry Mellor
97d1c99302 Rename clashing method names for vLLM model protocol (#27583)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-11-12 19:14:33 -08:00
vllmellm
f080a83511 [RFC][ROCm][AITER] Keep all AITER kernels in _aiter_ops class like _custom_ops and _ipex_ops (#24490)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-11-10 08:20:53 -08:00
Isotr0py
934a9c3b79 [Model] Consolidate Deepseek-MoE implementation with DeepSeek-v2 (#28101)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2025-11-08 05:01:27 +00:00
Isotr0py
43ecd0a900 [Chore] Clean up deepseek v2/v3 config copy (#28055)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-11-06 03:46:30 +00:00
Ilya Markov
e50c454672 [BugFix] Support EP/DP + EPLB with MTP (#25311)
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
2025-11-05 15:22:17 +00:00
Lain
09a7e6f617 [Deepseek v3.2] Remove extra logics in indexer (#26465)
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Lain <siyuanf@nvidia.com>
Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-10-21 23:34:03 +00:00
Alexander Matveev
344a0017c0 [Performance] Dual stream execution of "shared_experts" and "selected_experts" inside FusedMoE (#26440)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-10-21 21:38:29 +00:00
Daniel Cámpora
80e9452984 [Deepseek v3.2] Optimize top_k_per_row (#26763)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
2025-10-21 08:30:07 +00:00
Chen Wu
5f6cbf60d6 [Feature][Kernel]FusedMoE LoRA (#21229)
Signed-off-by: wuchen <cntryroa@gmail.com>
Signed-off-by: banjuede <lmklhc@163.com>
Signed-off-by: Chen Wu <cntryroa@gmail.com>
Signed-off-by: Danielle Robinson <dmmaddix@amazon.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: bk-201 <joy25810@foxmail.com>
Co-authored-by: wuchen <wuchen@zetyun.com>
Co-authored-by: Nathan Van Gheem <vangheem@gmail.com>
Co-authored-by: banjuede <lmklhc@163.com>
Co-authored-by: Danielle Robinson <dmmaddix@amazon.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: bk-201 <joy25810@foxmail.com>
2025-10-21 03:01:37 +00:00
Isotr0py
6ac5e06f7c [Chore] Clean up pytorch helper functions in vllm.utils (#26908)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: isotr0py <2037008807@qq.com>
2025-10-18 09:48:22 -07:00
Nicolò Lucchesi
b26b70bec4 [Misc] Refactor get_kv_cache_spec into AttentionLayerBase (#26587)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-10-18 13:51:21 +00:00
kliuae
1317034379 [ROCm][FEAT] Fuse DeepSeek shared experts into AITER fused_moe ops (#24097)
Signed-off-by: chenjun <junchen2@amd.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Co-authored-by: valarLip <103567126+valarLip@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
2025-10-16 10:41:34 +08:00
Yongye Zhu
f5ed68ef63 [Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather (#26456)
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
2025-10-15 16:05:01 +08:00
Mengqing Cao
302ef403a2 [DSA][MLA] Tiny refactor on DeepSeek to make it reusable for different backends (#26656)
Signed-off-by: MengqingCao <cmq0113@163.com>
2025-10-15 00:16:44 -07:00
Harry Mellor
8fcaaf6a16 Update Optional[x] -> x | None and Union[x, y] to x | y (#26633)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-12 09:51:31 -07:00
bnellnm
47e66c24e2 [Model] Apply shared experts overlap optimization to all models with shared experts (#26145)
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-10-09 11:31:04 -04:00
Naveenraj Kamalakannan
e614ab7806 Separate MLAAttention class from Attention (#25103)
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-10-08 17:11:11 -07:00
Daniel Cámpora
e1098ced95 Add topk logits torch op for DS3.2. (#25945)
Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>
Signed-off-by: Daniel Cámpora <961215+dcampora@users.noreply.github.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-10-07 10:07:32 +00:00