biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Sage Moore	0edaf752d7	[Attention][DBO] Add support for "splitting" the CommonAttentionMetadata (#21153 ) Signed-off-by: Sage Moore <sage@neuralmagic.com>	2025-08-01 19:47:53 -07:00
fhl2000	23322431c8	[V1][CUDA] Full cudagraph support for FlashInfer (#21367 )	2025-08-01 21:49:34 -04:00
Michael Goin	f81c1bb055	[Bugfix] Check NVIDIA artifactory is accessible before using flashinfer cubin kernels (#21893 )	2025-08-01 08:28:45 -04:00
Mickaël Seznec	e1a7fe4af5	[BugFix] fix: aot passes kvcache dtype information (#19750 ) Signed-off-by: Mickael Seznec <mickael@mistral.ai>	2025-08-01 05:45:02 +00:00
Michael Goin	61445453df	[UX] Rename CUTLASS_MLA_VLLM_V1 to CUTLASS_MLA (#21966 ) Signed-off-by: mgoin <mgoin64@gmail.com>	2025-07-30 20:40:34 -07:00
Yong Hoon Shin	ad510309ee	Override attention metadata for fast prefill in some KV sharing setups (#21590 ) Signed-off-by: Yong Hoon Shin <yhshin@meta.com>	2025-07-30 08:54:15 -07:00
Chen Zhang	555e7225bc	[v1][attention] Support Hybrid Allocator + FlashInfer (#21412 ) Signed-off-by: Chen Zhang <zhangch99@outlook.com>	2025-07-30 01:45:29 +00:00
elvischenv	58b11b24a6	[Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backend (#21525 ) Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>	2025-07-29 10:34:00 -04:00
Asaf Joseph Gardin	a6c050286a	[v1][mamba] Added mamba_type into MambaSpec (#21715 ) Signed-off-by: asafg <asafg@ai21.com> Co-authored-by: asafg <asafg@ai21.com>	2025-07-28 08:15:55 +00:00
Maximilien de Bayser	1cd6eaba54	Support encoder-only models without KV-Cache (#21270 ) Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Russell Bryant <rbryant@redhat.com>	2025-07-26 21:09:52 +08:00
who who who	b3caeb82e7	[ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. (#20295 ) Signed-off-by: fsx950223 <fsx950223@outlook.com> Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com> Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com>	2025-07-25 06:50:21 -07:00
weiliang	2dd72d23d9	update flashinfer to v0.2.9rc1 (#21485 ) Signed-off-by: Weiliang Liu <weiliangl@nvidia.com>	2025-07-24 14:06:11 -07:00
Lucas Wilkinson	61b8cea3b4	[Attention] Optimize FlashInfer MetadataBuilder Build call (#21137 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-07-24 03:21:46 -07:00
Lucas Wilkinson	304dce7ec0	[Attention] Clean up iRoPE in V1 (#21188 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2025-07-21 09:10:30 -07:00
Chengji Yao	3a1d8940ae	[TPU] support fp8 kv cache quantization (#19292 ) Signed-off-by: Chengji Yao <chengjiyao@google.com>	2025-07-20 03:01:00 +00:00
Woosuk Kwon	752c6ade2e	[V0 Deprecation] Deprecate BlockSparse Attention & Phi3-Small (#21217 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-07-19 13:53:17 -07:00
Lucas Wilkinson	59f935300c	[BugFix] Fix potential cuda-graph IMA (#21196 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-07-19 02:18:47 -07:00
Lucia Fang	9a9fda1423	[Core] Support Local Chunked Attention for Hybrid KV Cache (#19351 ) Signed-off-by: Lucia Fang <fanglu@fb.com> Signed-off-by: Lu Fang <fanglu@meta.com> Signed-off-by: Lu Fang <fanglu@fb.com> Co-authored-by: Lu Fang <fanglu@meta.com>	2025-07-18 20:48:38 -07:00
Lucas Wilkinson	89cab4d01f	[Attention] Make local attention backend agnostic (#21093 )	2025-07-18 00:10:42 -04:00
elvischenv	8dfb45ca33	[Bugfix] Fix the tensor non-contiguous issue for Flashinfer TRT-LLM backend attention kernel (#21133 )	2025-07-18 00:35:58 +00:00
Lucas Wilkinson	76b494444f	[Attention] Refactor attention metadata builder interface (#20466 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-07-17 04:44:25 +00:00
QiliangCui	72ad273582	Remove torch_xla.tpu.version() from pallas.py. (#21065 ) Signed-off-by: Qiliang Cui <derrhein@gmail.com>	2025-07-17 00:25:26 +00:00
Chengji Yao	85431bd9ad	[TPU] fix kv_cache_update kernel block size choosing logic (#21007 ) Signed-off-by: Chengji Yao <chengjiyao@google.com>	2025-07-16 04:39:48 +00:00
Peter Pan	1eb2b9c102	[CI] update typos config for CI pre-commit and fix some spells (#20919 ) Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>	2025-07-15 21:12:40 -07:00
Elfie Guo	30800b01c2	[Nvidia] Integrate SM100 cudnn prefill API to MLA prefill (#20411 ) Signed-off-by: Elfie Guo <elfieg@nvidia.com> Co-authored-by: Elfie Guo <eflieg@nvidia.com>	2025-07-15 17:56:45 -07:00
Yifei Teng	c586b55667	[TPU] Optimize kv cache update kernel (#20415 ) Signed-off-by: Yifei Teng <tengyifei88@gmail.com>	2025-07-15 03:56:43 -07:00
Alexander Matveev	8cdc371217	SM100 Cutlass MLA decode with unrestricted num_heads (< 128) for DeepSeek TP (#20769 ) Signed-off-by: Alexander Matveev <amatveev@redhat.com>	2025-07-15 01:06:38 +00:00
Cyrus Leung	e8cc53af5e	[Misc] Log the reason for falling back to FlexAttention (#20699 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-07-14 04:16:51 -07:00
Pavani Majety	7bd4c37ae7	[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). (#19825 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: shuw <shuw@nvidia.com> Co-authored-by: mgoin <mgoin64@gmail.com>	2025-07-11 09:23:23 +00:00
nopperl	5d09152ff1	[V1] Enable Mamba2 layers other than MambaMixer2 in the v1 engine (#20660 ) Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com>	2025-07-11 05:53:31 +00:00
Alexander Matveev	5b032352cc	[Attention] MLA - Flashinfer Ragged Prefill (#20034 )	2025-07-10 20:17:47 -07:00
Tuan, Hoang-Trong	47043eb678	[Kernel] Triton implementation of causal-conv1d for Mamba-based models (#18218 ) Signed-off-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com> Co-authored-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com> Co-authored-by: Tyler Michael Smith <tysmith@redhat.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>	2025-07-09 12:53:55 -07:00
Akash kaothalkar	6db31e7a27	[Hardware][PPC64LE] Enable V1 for ppc64le and ARM (#20554 ) Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com> Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com> Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com>	2025-07-08 20:00:41 -07:00
Chenyaaang	e34d130c16	[TPU] Temporary fix vmem oom for long model len by reducing page size (#20278 ) Signed-off-by: Chenyaaang <chenyangli@google.com>	2025-07-08 05:16:16 +00:00
Li, Jiang	7721ef1786	[CI/Build][CPU] Fix CPU CI and remove all CPU V0 files (#20560 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2025-07-07 22:13:44 -07:00
Cyrus Leung	9fb52e523a	[V1] Support any head size for FlexAttention backend (#20467 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-07-06 09:54:36 -07:00
Isotr0py	32c9be2200	[v1] Re-add fp32 support to v1 engine through FlexAttention (#19754 ) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>	2025-07-05 09:41:10 +00:00
Jee Jee Li	1caca5a589	[Misc] Add SPDX-FileCopyrightText (#20428 ) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>	2025-07-04 07:40:42 +00:00
Nicolò Lucchesi	8d775dd30a	[Misc] Fix `Unable to detect current VLLM config. Defaulting to NHD kv cache layout` warning (#20400 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2025-07-03 14:56:09 -07:00
vllmellm	a1aafc827a	[ROCm][FEAT] Enable Full Graph Mode in AITER MLA V1 Attn Backend (Decode Phase only) (#20254 ) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>	2025-07-02 16:25:46 +00:00
Chengji Yao	7da296be04	[TPU] kv cache update kernel supports dynamic grid (#20235 ) Signed-off-by: Chengji Yao <chengjiyao@google.com>	2025-07-02 06:33:37 +00:00
Liangliang Ma	a0389e0554	[UT][intel GPU] use current_platform instead of device hardcode in v1 tests (#20169 ) Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com>	2025-07-02 09:06:04 +08:00
Woosuk Kwon	8acb4badee	[CUDA graphs] Enable full cuda graphs with FA3 AoT scheduling (#20301 ) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2025-07-01 09:07:36 -07:00
TY-AMD	96453cfa83	[BugFix][V1][ROCm] Triton MLA uses V0 backend on V1 engine (#19067 ) Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com>	2025-07-01 16:12:19 +08:00
Chendi.Xue	dec197e3e5	Quick Fix by adding conditional import for flash_attn_varlen_func in flash_attn (#20143 ) Signed-off-by: Chendi.Xue <chendi.xue@intel.com>	2025-06-27 05:48:13 +00:00
Chengji Yao	04e1642e32	[TPU] add kv cache update kernel (#19928 ) Signed-off-by: Chengji Yao <chengjiyao@google.com>	2025-06-26 10:01:37 -07:00
Kunshang Ji	b69781f107	[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend. (#19560 ) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>	2025-06-26 09:27:18 -07:00
TJian	27c065df50	[Bugfix][V1][ROCm] Fix AITER Flash Attention Backend (Fix API Break and Local Attention Logic: affecting Llama4) (#19904 ) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>	2025-06-26 12:42:31 +00:00
Chenyaaang	2d7620c3eb	[TPU] Add TPU specific var VLLM_TPU_MOST_MODEL_LEN (#19919 ) Signed-off-by: Chenyaaang <chenyangli@google.com>	2025-06-25 15:51:02 -07:00
Chengji Yao	2cc2069970	[TPU][Bugfix] fix kv cache padding (#20048 ) Signed-off-by: Chengji Yao <chengjiyao@google.com>	2025-06-25 21:24:10 +00:00

1 2 3 4

171 Commits