biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Andreas Karatzas	a8eb1182f1	[CI][Models] Add VLM Support for Sequence Classification Conversion (#32885 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com>	2026-01-23 16:22:51 +08:00
Wentao Ye	7ef5873752	[CI] Fix mypy for `vllm/v1/structured_output` (#32722 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-01-23 11:55:51 +08:00
Eldar Kurtić	44f08af3a7	Add llmcompressor fp8 kv-cache quant (per-tensor and per-attn_head) (#30141 ) Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com> Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>	2026-01-22 13:29:57 -07:00
Matthew Bonanni	955b43a5a5	[Bugfix][Attention] Explicitly report support for kv_cache_dtype bfloat16 (#32795 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-22 19:05:18 +00:00
Lucas Wilkinson	889722f3bf	[FlashMLA] Update FlashMLA to expose new arguments (#32810 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-01-21 22:02:39 -07:00
Nick Hill	24dc30f7ff	[ModelRunner V2] Don't pin reused flashinfer tensors (#32799 ) Signed-off-by: Nick Hill <nickhill123@gmail.com>	2026-01-21 13:17:43 -08:00
elvischenv	808d6fd7b9	Bump Flashinfer to v0.6.1 (#30993 ) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>	2026-01-21 08:49:50 -08:00
Pleaplusone	6c20e89c02	[ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp (#29287 ) Signed-off-by: ganyi <ygan@amd.com>	2026-01-21 23:16:30 +08:00
Robert Shaw	42135d6898	[MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority (#32414 )	2026-01-21 08:22:33 -05:00
Lucas Wilkinson	b4f64e5b02	Update FlashMLA (#32491 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-01-21 13:03:37 +08:00
Tomas Ruiz	4a5299c93f	feat: spec decode with draft models (#24322 ) Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com>	2026-01-19 16:05:46 -05:00
Vadim Gimpelson	6101a26dc9	[BUGFIX] Fix degenerate strides in TRTLLM query tensors for FlashInfer backend. Fixes issue #32353 (#32417 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>	2026-01-18 16:57:32 -08:00
Wentao Ye	16de822c71	[Refactor] Remove unused file `pallas_kv_cache_update.py` (#32433 ) Signed-off-by: yewentao256 <zhyanwentao@126.com>	2026-01-18 12:46:39 -08:00
Li Xie	c826c72a96	[Model] Support Step1 Model (#32511 ) Signed-off-by: xieli <xieli@stepfun.com>	2026-01-18 10:20:46 +00:00
Isotr0py	8cc26acd8b	[Performance] Improve Triton prefill attention kernel's performance (#32403 ) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>	2026-01-17 20:19:59 -08:00
Guofang.Tang	2b99f210f5	[Misc] Fix typo: seperator -> separator in flashmla_sparse.py (#32411 ) Signed-off-by: Guofang Tang <tinggofun@gmail.com> Co-authored-by: Guofang Tang <tinggofun@gmail.com>	2026-01-17 12:18:30 +00:00
Matthias Gehre	047413375c	[Attention][AMD] Make flash-attn optional (#30361 ) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>	2026-01-15 17:18:24 +00:00
Pleaplusone	130d6c9514	[ROCm][Perf] Enable shuffle kv cache layout and assembly paged attention kernel for `AiterFlashAttentionBackend` (#29887 ) Signed-off-by: ganyi <ygan@amd.com>	2026-01-15 15:29:53 +00:00
vllmellm	e27078ea80	[Bugfix][ROCm][performance] Resolve the performance regression issue of the Qwen3-Next-80B-A3B-Thinking under rocm_atten (#32336 ) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>	2026-01-14 19:32:48 +00:00
Matthew Bonanni	2263d44b68	[4/N][Attention] Move MLA common to model_executor (#32060 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2026-01-13 09:08:45 -08:00
Matthew Bonanni	98f60e5acb	[6/N][Attention] Move utils to more appropriate locations (#32215 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-13 05:38:52 -08:00
Mickaël Seznec	a5bbbd2f24	[Quantization] fix: overflow with static per-tensor scaling (#29867 ) Signed-off-by: Mickael Seznec <mickael@mistral.ai> Signed-off-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2026-01-13 12:56:01 +00:00
cjackal	15b33ff064	[Misc] improve warning/assert messages (#32226 ) Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>	2026-01-13 03:11:23 +00:00
Matthew Bonanni	20228cb851	[3/N][Attention] Move AttentionMetadata-related code from utils.py to backend.py (#32054 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-12 09:13:56 -08:00
Asaf Joseph Gardin	8fb2c135be	[Bugfix] Fix stale SSM state for new Mamba requests scheduled as decode (#32118 ) Signed-off-by: Josephasafg <ajgard7@gmail.com>	2026-01-12 17:02:38 +00:00
Isotr0py	9dbe1fe960	[Bugfix] Fix missing scale passing for encoder Triton Attention implementation (#32149 ) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>	2026-01-12 11:13:41 +00:00
Vadim Gimpelson	e15a5ff07b	[MISC] Add strict contiguity check for FlashInfer attention tensors (#32008 ) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>	2026-01-10 12:40:05 -08:00
jvlunteren	b8bf5c45bb	[Kernel] Optimize Sliding Window Attention in 3D Triton Kernel (#31984 ) Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>	2026-01-10 18:13:44 +00:00
Lucas Wilkinson	da6709c9fe	[Misc] Delay deprecation of CommonAttentionMetadata properties (#32074 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-01-09 21:06:44 -08:00
Lucas Kabela	ea6d067a2a	[Misc][LLaMa4] Compile LLaMa Vision Encoder (#30709 ) Signed-off-by: Lucas Kabela <lucaskabela@meta.com>	2026-01-09 22:01:38 -05:00
Matthew Bonanni	0308901975	[2/N][Attention] Fix pre-commit errors (#32052 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-10 00:27:15 +00:00
Matthew Bonanni	2612ba9285	[1/N][Attention] Restructure attention: move files (#31916 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2026-01-09 13:10:24 -08:00
R3hankhan	8e27663b6a	[CPU] Add head sizes 80 and 112 with vec16 fallback (#31968 ) Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>	2026-01-09 22:14:46 +08:00
vllmellm	1a19e9cd87	[Bugfix][ROCm]Fix Qwen3-Next-80B-A3B-Thinking inference and optimize non-standard block size (544) support under rocm_atten (#31380 ) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>	2026-01-09 19:28:02 +08:00
Rabi Mishra	107cf8e92f	fix(rocm): Add get_supported_kernel_block_sizes() to ROCM_ATTN (#31712 ) Signed-off-by: rabi <ramishra@redhat.com>	2026-01-08 15:46:07 +08:00
Cyrus Leung	b665bbc2d4	[Chore] Migrate V0 attention utils (#31891 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2026-01-07 13:44:36 +00:00
vllmellm	41cfa50632	[ROCm][AITER] fix wrong argument passed to AITER `flash_attn_varlen_func` (#31880 ) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>	2026-01-07 11:25:03 +00:00
weiyu	e7596371a4	[Refactor][TPU] Remove torch_xla path and use tpu-inference (#30808 ) Signed-off-by: Wei-Yu Lin <weiyulin@google.com> Signed-off-by: weiyu <62784299+weiyu0824@users.noreply.github.com>	2026-01-07 16:07:16 +08:00
Lucas Wilkinson	c7a79d41a0	[Attention][3/n] Remove usage of deprecated `seq_lens_cpu` and `num_computed_tokens_cpu` CommonAttentionMetadata properties (#31850 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-01-07 13:31:34 +08:00
vllmellm	6409004b26	[ROCm][AITER] bugfix accuracy regression in ROCM_AITER_TRITON_MLA backend (#31816 ) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>	2026-01-07 05:04:53 +00:00
Jack Yang	0a2c2dc3f1	fixed mypy warnings for files vllm/v1/attention with TEMPORARY workaround (#31465 ) Signed-off-by: Zhuohao Yang <zy242@cornell.edu> Co-authored-by: Zhuohao Yang <zy242@cornell.edu> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>	2026-01-07 04:08:47 +00:00
Lucas Wilkinson	4c73be14e0	[Attention][2/n] Remove usage of deprecated `seq_lens_cpu` and `num_computed_tokens_cpu` CommonAttentionMetadata properties (#31774 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2026-01-06 17:32:14 +00:00
Lucas Wilkinson	e0327c9db2	[Attention][1/n] Remove usage of deprecated `seq_lens_cpu` and `num_computed_tokens_cpu` CommonAttentionMetadata properties (#31773 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2026-01-06 04:05:17 -08:00
Or Ozeri	d8e38d4939	Triton Attention: Support cross-layers blocks (#30687 ) Signed-off-by: Or Ozeri <oro@il.ibm.com>	2026-01-05 19:29:16 +00:00
Isotr0py	6aa5b18e1d	[v1] Add encoder-only/cross attention support to Triton Attention backend (#31406 ) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>	2026-01-06 00:00:23 +08:00
Kevin McKay	825c2dc133	[Bugfix][Hardware][AMD] Fix last_page_len calculation in AITER MLA decode (#31282 ) Signed-off-by: c0de128 <kevin.mckay@outlook.com>	2026-01-01 21:14:00 -08:00
Wentao Ye	357d435c54	[Bug] Fix log issue with `\n` (#31390 ) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>	2025-12-30 21:16:55 -08:00
yt0428	3f52fa5aa2	[Model] Add support for openPangu moe model (#28775 ) Signed-off-by: yuantao <2422264527@qq.com> Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2025-12-30 08:11:38 -08:00
Asaf Joseph Gardin	34916ae37f	[Mamba] - Consolidate Mambas Attention Logic (#28133 )	2025-12-23 21:57:00 +01:00
Patrick von Platen	3faa8bee57	adapt voxtral (#31095 ) Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com>	2025-12-23 05:31:55 -08:00

1 2 3 4 5 ...

495 Commits