biondizzle/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
rasmith	711241c13c	[CI/Build] Fix illegal memory access and unsupported test in kernels/attention/test_cache.py (#29118 ) Signed-off-by: Randall Smith <ransmith@amd.com> Co-authored-by: Randall Smith <ransmith@amd.com>	2025-11-21 10:58:38 -05:00
rasmith	5e5a7eb16f	[CI/Build] Make test_attention_selector.py run tests on correct platform (#29064 ) Signed-off-by: Randall Smith <ransmith@amd.com> Signed-off-by: rasmith <Randall.Smith@amd.com> Co-authored-by: Randall Smith <ransmith@amd.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-11-20 20:45:56 +00:00
rasmith	3d84ef9054	[CI/Build][AMD] Skip if flash_attn_varlen_func not available in test_aiter_flash_attn.py (#29043 ) Signed-off-by: Randall Smith <ransmith@amd.com> Co-authored-by: Randall Smith <ransmith@amd.com>	2025-11-20 20:39:49 +00:00
Vensen	fb8851f254	[Bugfix][cache_kernels]: Fix OOB in cache_kernels.cu (#28760 ) Signed-off-by: vensen <vensenmu@gmail.com> Signed-off-by: Vensenmu <vensenmu@gmail.com>	2025-11-20 02:52:02 -08:00
rasmith	322cb02872	[CI/Build][AMD] Fix import errors in tests/kernels/attention (#29032 ) Signed-off-by: Randall Smith <ransmith@amd.com> Co-authored-by: Randall Smith <ransmith@amd.com>	2025-11-20 17:48:09 +08:00
Alexander Matveev	3aaa94ac99	[Performance] Reduce DeepGEMM N dim restriction from 128 to 64 multiplier (#28687 ) Signed-off-by: Alexander Matveev <amatveev@redhat.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com>	2025-11-19 15:47:13 -08:00
Shu Wang	613abb50d5	[MoE] Nvfp4 Masked Gemm: Add flashinfer grouped_gemm_nt_masked (#25990 ) Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2025-11-19 13:29:06 -08:00
Ryan Rock	68d7231991	[CI/Build] Fix test_prefix_prefill for AMD (#28905 ) Signed-off-by: Ryan Rock <ryan.rock@amd.com>	2025-11-19 16:04:36 -05:00
Qiu	2fd893b4ce	[Feature] Prefill Context Parallel (PCP) basic support (#28718 ) Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com> Signed-off-by: FENP <yuanyongjie.yyj@antgroup.com> Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: FENP <yuanyongjie.yyj@antgroup.com> Co-authored-by: LookAround <lixushi@huawei.com> Co-authored-by: Jingchun Gao <gaojingchun1@huawei.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: Jingchun Gao <63247409+gjc0824@users.noreply.github.com>	2025-11-19 15:52:44 -05:00
Harry Mellor	a8b70304d6	Update `rope_scaling` to `rope_parameters` in preparation for Transformers v5 (#28542 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-11-19 09:06:36 -08:00
Matthew Bonanni	4c23690f43	[Attention] FlashAttention ViT support, make default backend (#28763 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2025-11-18 20:06:21 -08:00
Kunshang Ji	2a2d5d2780	Replace `torch.cuda.Event` with `torch.Event` for better hardware compatibility (#26985 ) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>	2025-11-18 11:34:36 -08:00
amirkl94	03ee48111d	Feature: Support Relu2 in FusedMoE fp8 cutlass path (#27261 )	2025-11-16 13:39:44 -05:00
Cyrus Leung	638e4196d1	[Misc] Make `SchedulerConfig.max_model_len` init-only (#28733 ) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>	2025-11-15 01:59:31 -08:00
Varun Sundar Rabindranath	6965ef436f	[Performance][DeepGEMM] Estimate expected_m (#28694 ) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>	2025-11-15 13:52:14 +08:00
Lucas Wilkinson	db56a59970	[BugFix] Fix FA3 IMA with FULL_AND_PIECEWISE and cascade attention (default) (#28702 )	2025-11-14 12:19:22 +00:00
Varun Sundar Rabindranath	fe1cd7704d	[Performance][B200] silu_mul_quant: pack scales in int32 (#28358 ) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>	2025-11-13 10:16:55 -08:00
wangxiyuan	2dacd57394	[platform] Move get_cu_count to utils (#27005 ) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-11-13 08:48:47 +08:00
TJian	edb59a9470	[ROCm] [Bugfix] Fix `fused_qknorm_rope_kernel` rocm compatibility (#28500 ) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>	2025-11-12 05:01:14 -08:00
Andreas Karatzas	9f0247cfa4	`VLLM_USE_TRITON_FLASH_ATTN` V0 variable deprecation (#27611 ) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Andreas Karatzas <Andreas.Karatzas@amd.com>	2025-11-11 18:34:36 -08:00
Li, Jiang	7f829be7d3	[CPU] Refactor CPU attention backend (#27954 ) Signed-off-by: jiang1.li <jiang1.li@intel.com>	2025-11-12 09:43:06 +08:00
Zhewen Li	e553424919	[CI/Build] Refactor Attention backend for test_prefix_prefill from xformers to SDPA (#28424 ) Signed-off-by: zhewenli <zhewenli@meta.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Roger Wang <hey@rogerw.io>	2025-11-12 01:09:47 +08:00
zhrrr	68c09efc37	[Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model (#27165 ) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>	2025-11-11 12:00:31 -05:00
bnellnm	a1448b4b69	[Kernels] Split up fused_moe/layer.py, isolate more modular kernel code (#28064 )	2025-11-11 07:29:02 -07:00
Matthew Bonanni	b30dfa03c5	[Attention] Refactor CUDA attention backend selection logic (#24794 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2025-11-11 07:40:44 -05:00
Matthew Bonanni	0bf29fadf5	[Test] Remove old non-varlen FA2 test (#28420 ) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>	2025-11-10 23:57:41 +00:00
Varun Sundar Rabindranath	b039bfda8f	[Bugfix] Fix persistent_masked_m_silu_mul_quant tests (#28366 ) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>	2025-11-10 09:21:52 -08:00
vllmellm	f080a83511	[RFC][ROCm][AITER] Keep all AITER kernels in `_aiter_ops` class like `_custom_ops` and `_ipex_ops` (#24490 ) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2025-11-10 08:20:53 -08:00
Lucas Wilkinson	e8697faf03	[V0 deprecation] Remove no longer used `get_metadata_cls` (#28370 ) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>	2025-11-10 14:32:09 +08:00
ElizaWszola	171133f929	[Bugfix] Fix test fused quant layernorm tests (#27865 ) Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>	2025-11-08 14:31:33 -08:00
Harry Mellor	811df41ee9	Update Flashinfer from `v0.4.1` to `v0.5.2` (#27952 ) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>	2025-11-07 16:24:42 -08:00
Pavani Majety	72b1c2ae2c	[Bugfix] Use latency MOE backend as default for Flashinfer and other misc fixes (#27439 ) Signed-off-by: Pavani Majety <pmajety@nvidia.com>	2025-11-07 04:18:39 -08:00
Pleaplusone	6cae1e5332	[ROCm][MLA] Support block-size > 1 for AITER MLA backend (#27224 ) Signed-off-by: ganyi <ygan@amd.com> Co-authored-by: wuhuikx <hattie.wu@amd.com>	2025-11-05 10:43:02 -05:00
amirkl94	6b7a81185d	Bugfix: Cutlass FP8 FusedMoE bad scaling factors (#27255 ) Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>	2025-11-05 06:06:06 -05:00
Asaf Joseph Gardin	00b31a36a2	[V1] [Hybrid] Mamba1 Automatic Prefix Caching (#26377 ) Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>	2025-11-02 04:16:23 -08:00
Fardin Hoque	b8c48c5d72	kernels/moe test pruning (#27053 ) Signed-off-by: Fardin Hoque <kfhfar@amazon.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>	2025-10-30 12:10:34 +08:00
bnellnm	1891cf605a	[Bugfix] Fix modular kernel tests (#27707 ) Signed-off-by: Bill Nell <bnell@redhat.com>	2025-10-29 16:14:33 +08:00
Yeshwanth N	71b1c8b667	[Chore]:Extract math and argparse utilities to separate modules (#27188 ) Signed-off-by: Yeshwanth Surya <yeshsurya@gmail.com> Signed-off-by: Yeshwanth N <yeshsurya@gmail.com> Signed-off-by: yeshsurya <yeshsurya@gmail.com>	2025-10-26 04:03:32 -07:00
Xiangyu Li	5cc6bddb6e	[Kernel] Add GPTQv2 format support for low-bit or asymmetric quantization, by adapting gptq_gemm (#26092 )	2025-10-23 23:26:13 -04:00
Jonathan Chen	ca76486a16	[Chore] Separate out `vllm.utils.platform_utils.py` (#27374 ) Signed-off-by: Jonathan <chenleejonathan@gmail.com>	2025-10-23 19:08:06 +00:00
Varun Sundar Rabindranath	a9f55dc588	[Misc] Add triton_kernels dependency (#27370 ) Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com> Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com>	2025-10-23 12:04:14 -07:00
dongbo910220	a0003b56b0	[Chore] Separate out system utilities from vllm.utils (#27201 ) Signed-off-by: dongbo910220 <1275604947@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2025-10-22 20:25:25 +00:00
dongbo910220	3ae082c373	[Chore] Separate out optional dependency checks from vllm.utils (#27207 ) Signed-off-by: dongbo910220 <1275604947@qq.com> Signed-off-by: dongbo910220 <32610838+dongbo910220@users.noreply.github.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2025-10-22 10:44:21 -04:00
Lain	09a7e6f617	[Deepseek v3.2] Remove extra logics in indexer (#26465 ) Signed-off-by: Siyuan Fu <siyuanf@nvidia.com> Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com> Signed-off-by: Lain <siyuanf@nvidia.com> Co-authored-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-10-21 23:34:03 +00:00
Daniel Cámpora	80e9452984	[Deepseek v3.2] Optimize top_k_per_row (#26763 ) Signed-off-by: Daniel Campora <961215+dcampora@users.noreply.github.com>	2025-10-21 08:30:07 +00:00
iAmir97	7a6c8c3fa1	[Chore] Separate out `vllm.utils.network_utils` (#27164 ) Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com> Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com>	2025-10-19 03:06:32 -07:00
Isotr0py	6ac5e06f7c	[Chore] Clean up pytorch helper functions in `vllm.utils` (#26908 ) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: isotr0py <2037008807@qq.com>	2025-10-18 09:48:22 -07:00
iAmir97	1d165d6d85	[Chore] Separate out `vllm.utils.mem_utils` (#27143 ) Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com> Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com> Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-10-18 10:06:59 +00:00
Isotr0py	3125d79950	[Chore] Remove unused `PolyNorm` layer (#27110 ) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>	2025-10-17 19:03:43 +00:00
Luka Govedič	bd7157a071	[torch.compile] Enable attention and allreduce fusion without custom ops enabled (#24604 ) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>	2025-10-17 08:10:23 -06:00

1 2 3 4 5 ...

578 Commits