Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

10546f925a [Bugfix] Fix mm budget setting for Qwen Omni models (#33634) Roger Wang 2026-02-02 20:56:25 -08:00
e69c990c21 [Feature][CPU Backend]: Optimize ARM vectorization backend (#30329) Radu Salavat 2026-02-03 04:17:56 +00:00
5eac9a1b34 [torch.compile] Don't do the fast moe cold start optimization if there is speculative decoding (#33624) Richard Zou 2026-02-02 19:38:49 -08:00
1b60b45d0d [CI/Build] add directions for CPU image upload to Docker Hub (#32032) Nathan Weinberg 2026-02-02 21:48:06 -05:00
4b3803d180 [BugFix] DPMetadata raises assert error for dense model (#32739) Dezhan 2026-02-02 16:56:44 -08:00
31a64c63a8 [Release] Fix format and cherry-pick (#33618) Zhewen Li 2026-02-02 16:19:05 -08:00
57eae2f891 [Release] patch step3p5 attention class in v0.15.1 release (#33602) Zhewen Li 2026-02-02 14:54:08 -08:00
5019c59dd2 [Voxtral Realtime] Introduce global log mel max (#33574) Patrick von Platen 2026-02-02 23:01:47 +01:00
089cd4f002 fix cutlass_3x_gemm_fp8_blockwise on sm103a (#32224) Lain 2026-02-02 11:47:46 -08:00
0130223bd9 fix memory for online fp8 quantization with streaming weight load (#31914) Vasiliy Kuznetsov 2026-02-02 14:17:42 -05:00
5d1aef3004 [UX] Format attention backend log line (#33570) Matthew Bonanni 2026-02-02 13:57:12 -05:00
f0d005864a [Fix] prefix cache hit rate == 0 bug with gpt-oss style models (#33524) Yifan Qiao 2026-02-01 17:59:58 -08:00
ffe1fc7a28 Reduce the kernel overhead when num of active loras is smaller than max loras. Multiple cuda graphs are captured for each num of active-loras. (#32005) yugong333 2026-02-02 09:30:06 -08:00
8b7346d5f1 Update huggingface-hub again (#33567) Harry Mellor 2026-02-02 17:20:54 +00:00
6141ebe0dd Remove incorrect tokenizer info test (#33565) Harry Mellor 2026-02-02 17:11:44 +00:00
199e3cb476 [Model] Use mm_position to compute mrope positions for GLM-4.xV (#33039) Yang Liu 2026-02-02 08:55:48 -08:00
9f8cb81b44 [CI] Add DeepSeek V3.2 nightly eval (#33566) Matthew Bonanni 2026-02-02 11:10:02 -05:00
d7e17aaacd [Refactor] Move profiling methods to MM budget (#33559) Cyrus Leung 2026-02-02 23:27:00 +08:00
528e9b1490 [Feature][Core] Support Fabric detection to adapt the MNNVL protocol for the GB series (#33540) Kebe 2026-02-02 23:55:46 +09:00
d95b4be47a move spec decode slow test to test_areas.yaml (#33365) shanjiaz 2026-02-02 09:28:36 -05:00
4061dcf4c5 [Bugfix] Enable Kimi k25 processor test (#33562) Isotr0py 2026-02-02 22:25:25 +08:00
0aca8b8c62 [MoE] Enable Shared/Routed Overlap For Latent MoE (Nemotron-H) (#32790) danielafrimi 2026-02-02 16:18:50 +02:00
9eb58f8cf1 fix[ROCm]: Remove unconditional aiter import (#32902) Rabi Mishra 2026-02-02 19:40:02 +05:30
b10d05b8a8 [Model] Use explicit types in get_generation_prompt (#33551) Cyrus Leung 2026-02-02 20:38:49 +08:00
b398e5c819 Update get_expert_mapping to include self parameter (#33525) Borushiki 2026-02-02 13:29:07 +01:00
78061ef584 Fix accessing hidden_act from model config (#32686) Grzegorz K. Karch 2026-02-02 12:11:33 +01:00
528b3076af [CI][Bugfix] Fix flaky tests/v1/kv_connector/unit/test_multi_connector.py::test_multi_example_connector_consistency (#33555) Nicolò Lucchesi 2026-02-02 12:01:29 +01:00
a502831d36 [Chore] Remove redundant input parsing methods (#33542) Cyrus Leung 2026-02-02 18:50:47 +08:00
94cbe0a328 [Nightly CI] Remove CT Model (#33530) Robert Shaw 2026-02-01 22:09:09 -05:00
8b45c58fe9 [Models] Step-3.5-Flash (#33523) csy0225 2026-02-02 10:21:18 +08:00
ba871fb788 [Misc] support arbitrary MM datasets in spec dec bench (#33486) Komal Kumar Teru 2026-02-02 14:19:48 +05:30
c7039a80b8 pin LMCache to v0.3.9 or greater with vLLM v0.15.0 (#33440) Greg Pereira 2026-01-31 19:50:38 -08:00
15ebd0cedf fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels (#33417) René Honig 2026-01-31 23:06:42 +01:00
2915268369 [fix][torch.compile] Fix cold-start compilation time increase by adding kv cache update to splitting ops (#33441) Luka Govedič 2026-01-31 09:48:34 -05:00
d984d664cc [BugFix] Fix whisper FA2 + full cudagraphs (#33360) Lucas Wilkinson 2026-01-30 21:15:06 -07:00
5f45b0b7e0 [Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831 (#33366) Gregory Shtrasberg 2026-01-30 19:05:23 -06:00
a2dba556db [release] Minor fixes to release annotation and wheel upload (#33129) Kevin H. Luu 2026-01-29 12:09:35 -08:00
6ff16b77f8 [Bugfix] Enable Triton MoE for FP8 per-tensor dynamic (#33300) Michael Goin 2026-01-29 15:15:17 -05:00
1ed963d43a [Bugfix] Fix Qwen3-VL-Reranker load. (#33298) wang.yuqi 2026-01-29 16:42:53 +08:00
39e8b49378 [Bugfix] Register fp8 cutlass_group_gemm as supported for only SM90+SM100 (#33285) Michael Goin 2026-01-28 21:40:59 -05:00
ab374786c7 [CPU][IBM Z][Dockerfile] Fix IBM Z builds (#33243) R3hankhan 2026-02-02 13:11:29 +05:30
808dd87b30 [Model] Support DeepSeek-OCR-2 (#33165) RED 2026-02-02 14:24:10 +08:00
beb8899482 Fix mistral sliding window parsing (#33521) Andy Lo 2026-02-02 05:08:04 +00:00
ce88756b96 [Doc]: update paths for Offline/Online/Others example sections (#33494) Sawyer Bowerman 2026-02-01 22:56:53 -05:00
a3154a6092 [Doc] add missing model entries in supported_models.md (#33220) Paco Xu 2026-02-02 11:37:25 +08:00
7c036432fc [Bugfix] GLM-4 tool parser: incremental string streaming (#33218) jack 2026-02-02 11:13:31 +08:00
318b120766 [Nightly CI] Remove CT Model (#33530) Robert Shaw 2026-02-01 22:09:09 -05:00
c3b40dc3e7 [Models] Step-3.5-Flash (#33523) csy0225 2026-02-02 10:21:18 +08:00
a01ef3fa51 [Fix] prefix cache hit rate == 0 bug with gpt-oss style models (#33524) Yifan Qiao 2026-02-01 17:59:58 -08:00
7320ca3942 Add unpermute-aware fused MoE LoRA path (#32655) Runkai Tao 2026-02-01 20:46:09 -05:00
cf0a99f84d [ModelRunner V2] Support spec decode with structured outputs (#33374) Nick Hill 2026-02-01 16:19:59 -08:00
e535d90deb [ModelRunner V2] Misc minor simplifications and optimizations (#33467) Nick Hill 2026-02-01 14:17:14 -08:00
0b225fb7b2 [Misc] skip target model mm emb in draft proposal step when draft is text-only (#33437) Komal Kumar Teru 2026-02-02 02:43:35 +05:30
46b4a02794 Fix DeepSeek V2 RoPE initialization error (#33501) will b. 2026-02-01 15:00:56 -06:00
8869cd8ec1 Add MoE config for Super B200 TP2 (#33510) shaharmor98 2026-02-01 20:48:37 +02:00
cd86fff38f [BUGFIX] Fix hipErrorIllegalState in Qwen3-Omni during startup profiling allow inference Omni on ROCM (#33077) JartX 2026-02-01 14:36:25 +01:00
b5f8c3092d [W8A8 Block Linear Refactor][1/N] Keep all quantization types into QuantFP8 class. (#33047) Maral 2026-02-01 17:28:01 +08:00
21997f45b1 [Redo] #33110 with threading limit (#33502) Cyrus Leung 2026-02-01 17:18:11 +08:00
672023877b Change defaults for vllm bench startup (#33489) Luka Govedič 2026-02-01 02:46:01 -05:00
754a8ca942 fix: only include Authorization header when OPENAI_API_KEY is set (#33488) Zack Yu 2026-01-31 23:35:09 -08:00
302ecf64ff [Models]: lfm2_siglip2 return intermediate encoder layers (#33370) Eduardo Salinas 2026-02-01 01:17:49 -05:00
b6bb2842cf [Critical] Revert #33110 (#33500) Cyrus Leung 2026-02-01 13:06:42 +08:00
79b6ec6aab [Bugfix] Fix inconsistent handling of cache reset (#33481) Cyrus Leung 2026-02-01 12:23:41 +08:00
d6416fdde9 pin LMCache to v0.3.9 or greater with vLLM v0.15.0 (#33440) Greg Pereira 2026-01-31 19:50:38 -08:00
0fb3157267 [ROCm][CI] Update huggingface-hub pin (#33492) Andreas Karatzas 2026-01-31 20:51:54 -06:00
a358e4dffe [Refactor] Make Renderer an abstract class (#33479) Cyrus Leung 2026-02-01 10:36:30 +08:00
079781177a fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels (#33417) René Honig 2026-01-31 23:06:42 +01:00
63c0889416 [Misc] Fix flashinfer related tests (#33462) Roy Wang 2026-02-01 05:10:24 +08:00
1e86c802d4 Fix grammar (#33121) smashyalts 2026-01-31 18:59:34 +01:00
fedf64332e [Bugfix]: Fix display errors in TORCH_CHECK messages (#32942) linhaifeng 2026-02-01 01:48:48 +08:00
2238a12c13 [Misc] support collect_env for endpoint /server_info (#33246) Xiao Yang 2026-02-01 01:42:59 +08:00
ce0afe2451 Update huggingface-hub pin for the last time before Transformers v5 (#33473) Harry Mellor 2026-01-31 17:14:24 +00:00
88c3e114d8 [Refactor] Move MM data parsing outside processor (#33408) Cyrus Leung 2026-02-01 00:46:14 +08:00
92924b2ddd [Deprecation] Remove deprecated items related to pooling (#33477) Cyrus Leung 2026-02-01 00:44:40 +08:00
27cb2f678f [Bugfix] Early-reject requests with MM data longer than encode cache capacity (#33110) YunzhuLu 2026-02-01 00:41:13 +08:00
22d9a056d5 Support clear mm and encoder cache (#33452) jma99_2333 2026-01-31 07:22:25 -08:00
13b842f271 [BugFix][Router Replay] Capture Logical Experts with EPLB (#33013) ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 2026-01-31 17:12:17 +02:00
15f40b20aa [fix][torch.compile] Fix cold-start compilation time increase by adding kv cache update to splitting ops (#33441) Luka Govedič 2026-01-31 09:48:34 -05:00
793af538a3 [Doc] Update plugin deprecation notices (#33476) Cyrus Leung 2026-01-31 22:48:28 +08:00
6f5e7cda57 support return prompt token ids in responses (#33378) cmunley1 2026-01-31 06:04:20 -08:00
68feb76a6f [Misc] Replace deprecated interface seed_everything (#33474) Roy Wang 2026-01-31 21:38:39 +08:00
4cb59dea6a [Bugfix] Fix incompatibility between #33372 and #32863 (#33475) Cyrus Leung 2026-01-31 21:21:32 +08:00
608b556507 [ez] Add structured torch.compile logs (#33213) Angela Yi 2026-01-31 05:00:54 -08:00
f0a1c8453a [Frontend] Use new Renderer for Completions and Tokenize API (#32863) Cyrus Leung 2026-01-31 20:51:15 +08:00
8980001c93 [perf] v1/spec_decode: skip softmax for all-greedy rejection sampling (#32852) caozuoba 2026-01-31 17:51:26 +08:00
527bcd14d4 [ROCM] Enable aiter attn backend for qwen3-next model (#32492) jennyyyyzhen 2026-01-31 01:03:57 -08:00
f68e3ea4e1 [BugFix] Add synchronize in CutlassW4A8LinearKernel to ensure data is ready for use. (#33078) Jinwu 2026-01-31 00:14:54 -08:00
d5c41db35b [Kernel] [Helion] [3/N] Helion kernel registry (#33203) Yanan Cao 2026-01-30 23:38:46 -08:00
1618e25492 [CPU][Feat] Enable KleidiAI accelerated int4 dynamic quant with BF16 activations on Arm CPUs (#33122) Fadi Arafeh 2026-01-31 07:16:22 +00:00
f3888aca83 Add EAGLE3 support for AFMoE (#33111) AutumnAurelium 2026-01-30 22:53:08 -08:00
f0bca83ee4 Add support for Mistral Large 3 inference with Flashinfer MoE (#33174) Dimitrios Bariamis 2026-01-31 07:48:27 +01:00
73419abfae [Bugfix] Handle Asym W4A16 (ConchLinearKernel) for CT (#33200) Matthias Gehre 2026-01-31 07:21:51 +01:00
e77f162cf5 [Bugfix] Fix Qwen3ASR language asr tag in output (#33410) Nicolò Lucchesi 2026-01-31 06:24:49 +01:00
8ecd213c0b [Kernel] [Helion] [2/N] Helion kernel wrapper (#32964) Yanan Cao 2026-01-30 20:53:01 -08:00
5b55c0bea7 [Attention] Clarify comment explaining attn_logits +1 dimension (#33427) Francesco Fusco 2026-01-31 05:50:30 +01:00
15e0bb9c42 [Streaming -> Realtime] Rename all voxtral related classes, fn, files (#33415) Patrick von Platen 2026-01-31 05:49:00 +01:00
6c64c41b4a [ROCm][CI] Force max_num_seqs=1 on ROCm In test_sharded_state_loader to reduce flakiness (#33277) Micah Williamson 2026-01-30 22:28:29 -06:00
a2ef06e1b3 [Misc] offest -> offset in comments and variable names (#33444) Russell Bryant 2026-01-30 23:19:22 -05:00
0a3c71e7e5 [BugFix] Fix whisper FA2 + full cudagraphs (#33360) Lucas Wilkinson 2026-01-30 21:15:06 -07:00
29fba76781 [UX] Use gguf repo_id:quant_type syntax for examples and docs (#33371) Michael Goin 2026-01-30 23:14:54 -05:00

... 21 22 23 24 25 ...