biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Wentao Ye	3a6d5cbefd	[Perf] Optimize dcp allocate tensor (#33102 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-01-27 17:24:41 -05:00
Matthew Bonanni	a608b4c6c2	[5/N][Attention] Finish eliminating `vllm/attention` folder (#32064 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-27 10:02:51 -05:00
Nicolò Lucchesi	1f3a2c2944	[Bugfix] Disable CG for Whisper+FA2 (#33164 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2026-01-27 21:46:51 +08:00
Strahinja Stamenkovic	c568581ff3	Fix IndexError with encoder-decoder models when using Custom Paged Attention (#33112 ) Signed-off-by: sstamenk <strahinja.stamenkovic@amd.com>	2026-01-27 10:33:37 +08:00
ElizaWszola	a28b94e6ef	[Performance] Split FlashAttn attention and cache update (#25954 ) Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Luka Govedič <luka.govedic@gmail.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Luka Govedič <lgovedic@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Luka Govedič <luka.govedic@gmail.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Luka Govedič <lgovedic@redhat.com>	2026-01-23 17:28:06 -08:00
Markus / Mark	586a57ad7e	fix: Add glm4_moe_lite to MLA detection (#32614 ) Signed-off-by: marksverdhei <marksverdhei@hotmail.com> Signed-off-by: Markus / Mark <46672778+marksverdhei@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: mgoin <mgoin64@gmail.com>	2026-01-23 12:38:57 -08:00
Harry Huang	5206e5e28c	[V1][Hybrid] Mamba Prefix Caching with align mode (#30877 ) Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Chen Zhang <zhangch99@outlook.com>	2026-01-23 09:56:48 -08:00
tianshu-Michael-yu	13d8746c54	[Feature]: Remove DtoH Copy for lfm2_vl On Default Stream (#32815 ) Signed-off-by: Tianshu Yu <tianshuyu.formal@gmail.com>	2026-01-23 13:20:30 +00:00
Nicolò Lucchesi	160c6fa387	[Misc] Add `get_name` to missing AttentionBackends (#32698 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2026-01-23 10:35:44 +00:00
Andreas Karatzas	a8eb1182f1	[CI][Models] Add VLM Support for Sequence Classification Conversion (#32885 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-01-23 16:22:51 +08:00
Wentao Ye	7ef5873752	[CI] Fix mypy for `vllm/v1/structured_output` (#32722 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-01-23 11:55:51 +08:00
Eldar Kurtić	44f08af3a7	Add llmcompressor fp8 kv-cache quant (per-tensor and per-attn_head) (#30141 ) Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com> Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>	2026-01-22 13:29:57 -07:00
Matthew Bonanni	955b43a5a5	[Bugfix][Attention] Explicitly report support for kv_cache_dtype bfloat16 (#32795 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-22 19:05:18 +00:00
Lucas Wilkinson	889722f3bf	[FlashMLA] Update FlashMLA to expose new arguments (#32810 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-01-21 22:02:39 -07:00
Nick Hill	24dc30f7ff	[ModelRunner V2] Don't pin reused flashinfer tensors (#32799 ) Signed-off-by: Nick Hill <nickhill123@gmail.com>	2026-01-21 13:17:43 -08:00
elvischenv	808d6fd7b9	Bump Flashinfer to v0.6.1 (#30993 ) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>	2026-01-21 08:49:50 -08:00
Pleaplusone	6c20e89c02	[ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp (#29287 ) Signed-off-by: ganyi <ygan@amd.com>	2026-01-21 23:16:30 +08:00
Robert Shaw	42135d6898	[MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority (#32414 )	2026-01-21 08:22:33 -05:00
Lucas Wilkinson	b4f64e5b02	Update FlashMLA (#32491 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-01-21 13:03:37 +08:00
Tomas Ruiz	4a5299c93f	feat: spec decode with draft models (#24322 ) Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>	2026-01-19 16:05:46 -05:00
Vadim Gimpelson	6101a26dc9	[BUGFIX] Fix degenerate strides in TRTLLM query tensors for FlashInfer backend. Fixes issue #32353 (#32417 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>	2026-01-18 16:57:32 -08:00
Wentao Ye	16de822c71	[Refactor] Remove unused file `pallas_kv_cache_update.py` (#32433 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-01-18 12:46:39 -08:00
Li Xie	c826c72a96	[Model] Support Step1 Model (#32511 ) Signed-off-by: xieli <xieli@stepfun.com>	2026-01-18 10:20:46 +00:00
Isotr0py	8cc26acd8b	[Performance] Improve Triton prefill attention kernel's performance (#32403 ) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>	2026-01-17 20:19:59 -08:00
Guofang.Tang	2b99f210f5	[Misc] Fix typo: seperator -> separator in flashmla_sparse.py (#32411 ) Signed-off-by: Guofang Tang <tinggofun@gmail.com> Co-authored-by: Guofang Tang <tinggofun@gmail.com>	2026-01-17 12:18:30 +00:00
Matthias Gehre	047413375c	[Attention][AMD] Make flash-attn optional (#30361 ) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>	2026-01-15 17:18:24 +00:00
Pleaplusone	130d6c9514	[ROCm][Perf] Enable shuffle kv cache layout and assembly paged attention kernel for `AiterFlashAttentionBackend` (#29887 ) Signed-off-by: ganyi <ygan@amd.com>	2026-01-15 15:29:53 +00:00
vllmellm	e27078ea80	[Bugfix][ROCm][performance] Resolve the performance regression issue of the Qwen3-Next-80B-A3B-Thinking under rocm_atten (#32336 ) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>	2026-01-14 19:32:48 +00:00
Matthew Bonanni	2263d44b68	[4/N][Attention] Move MLA common to model_executor (#32060 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2026-01-13 09:08:45 -08:00
Matthew Bonanni	98f60e5acb	[6/N][Attention] Move utils to more appropriate locations (#32215 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-13 05:38:52 -08:00
Mickaël Seznec	a5bbbd2f24	[Quantization] fix: overflow with static per-tensor scaling (#29867 ) Signed-off-by: Mickael Seznec <mickael@mistral.ai> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2026-01-13 12:56:01 +00:00
cjackal	15b33ff064	[Misc] improve warning/assert messages (#32226 ) Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>	2026-01-13 03:11:23 +00:00
Matthew Bonanni	20228cb851	[3/N][Attention] Move AttentionMetadata-related code from utils.py to backend.py (#32054 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-12 09:13:56 -08:00
Asaf Joseph Gardin	8fb2c135be	[Bugfix] Fix stale SSM state for new Mamba requests scheduled as decode (#32118 ) Signed-off-by: Josephasafg <ajgard7@gmail.com>	2026-01-12 17:02:38 +00:00
Isotr0py	9dbe1fe960	[Bugfix] Fix missing scale passing for encoder Triton Attention implementation (#32149 ) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>	2026-01-12 11:13:41 +00:00
Vadim Gimpelson	e15a5ff07b	[MISC] Add strict contiguity check for FlashInfer attention tensors (#32008 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>	2026-01-10 12:40:05 -08:00
jvlunteren	b8bf5c45bb	[Kernel] Optimize Sliding Window Attention in 3D Triton Kernel (#31984 ) Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>	2026-01-10 18:13:44 +00:00
Lucas Wilkinson	da6709c9fe	[Misc] Delay deprecation of CommonAttentionMetadata properties (#32074 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-01-09 21:06:44 -08:00
Lucas Kabela	ea6d067a2a	[Misc][LLaMa4] Compile LLaMa Vision Encoder (#30709 ) Signed-off-by: Lucas Kabela <lucaskabela@meta.com>	2026-01-09 22:01:38 -05:00
Matthew Bonanni	0308901975	[2/N][Attention] Fix pre-commit errors (#32052 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-10 00:27:15 +00:00
Matthew Bonanni	2612ba9285	[1/N][Attention] Restructure attention: move files (#31916 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-09 13:10:24 -08:00
R3hankhan	8e27663b6a	[CPU] Add head sizes 80 and 112 with vec16 fallback (#31968 ) Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>	2026-01-09 22:14:46 +08:00
vllmellm	1a19e9cd87	[Bugfix][ROCm]Fix Qwen3-Next-80B-A3B-Thinking inference and optimize non-standard block size (544) support under rocm_atten (#31380 ) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>	2026-01-09 19:28:02 +08:00
Rabi Mishra	107cf8e92f	fix(rocm): Add get_supported_kernel_block_sizes() to ROCM_ATTN (#31712 ) Signed-off-by: rabi <ramishra@redhat.com>	2026-01-08 15:46:07 +08:00
Cyrus Leung	b665bbc2d4	[Chore] Migrate V0 attention utils (#31891 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-01-07 13:44:36 +00:00
vllmellm	41cfa50632	[ROCm][AITER] fix wrong argument passed to AITER `flash_attn_varlen_func` (#31880 ) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>	2026-01-07 11:25:03 +00:00
weiyu	e7596371a4	[Refactor][TPU] Remove torch_xla path and use tpu-inference (#30808 ) Signed-off-by: Wei-Yu Lin <weiyulin@google.com> Signed-off-by: weiyu <62784299+weiyu0824@users.noreply.github.com>	2026-01-07 16:07:16 +08:00
Lucas Wilkinson	c7a79d41a0	[Attention][3/n] Remove usage of deprecated `seq_lens_cpu` and `num_computed_tokens_cpu` CommonAttentionMetadata properties (#31850 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-01-07 13:31:34 +08:00
vllmellm	6409004b26	[ROCm][AITER] bugfix accuracy regression in ROCM_AITER_TRITON_MLA backend (#31816 ) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>	2026-01-07 05:04:53 +00:00
Jack Yang	0a2c2dc3f1	fixed mypy warnings for files vllm/v1/attention with TEMPORARY workaround (#31465 ) Signed-off-by: Zhuohao Yang <zy242@cornell.edu> Co-authored-by: Zhuohao Yang <zy242@cornell.edu> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>	2026-01-07 04:08:47 +00:00

1 2 3 4 5 ...

504 Commits