Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

1838cd4860 Revert "Add batch invariant kernel override for FlashInfer backend [2/n]" (#26220) Cyrus Leung 2025-10-04 17:45:08 +08:00
7d6b03381e [CI Failure] fix_test_auto_prefix_cache_support (#26053) Huamin Li 2025-10-04 02:44:49 -07:00
7c2e91c4e0 [Misc] Remove unused executor.apply_model (#26215) Cyrus Leung 2025-10-04 16:45:53 +08:00
736fbf4c89 [Misc] Require merge_by_field_config argument (#26214) Cyrus Leung 2025-10-04 16:40:14 +08:00
44ea85137a [Model] Support nested structures for TensorSchema (#26212) Cyrus Leung 2025-10-04 16:20:32 +08:00
d3d649efec Support expert parallel in Transformers backend (#26162) Harry Mellor 2025-10-04 05:35:04 +01:00
ea507c3a93 [V1] [Hybrid] Mamba2 Automatic Prefix Caching (#25752) Stan Wozniak 2025-10-04 06:34:22 +02:00
9705fba7b7 [cpu][perf] Accelerate unquantized-linear for AArch64 through oneDNN/ACL and weight prepack (#25948) Fadi Arafeh 2025-10-04 05:16:38 +01:00
2f7dbc9b42 Add batch invariant kernel override for FlashInfer backend [2/n] (#25769) Bram Wasti 2025-10-03 21:49:30 -05:00
ea25a76c05 [BugFix] Use async Mistral Tokenizer in Chat Completions (#26134) Ben Browning 2025-10-03 21:42:08 -04:00
67bc0c003e [Bugfix] Fix qwen3 vl dummy data generation with overrides (#26193) Roger Wang 2025-10-03 18:40:20 -07:00
5a05f26603 Fix issue of using only the part of video frame [Nemotron Nano] (#26186) Eugene Khvedchenya 2025-10-04 03:21:00 +03:00
7ef40bb983 [GPTOSS][DP/EP][Marlin] Enable GPTOSS DP/EP using Marlin kernels (#25488) Varun Sundar Rabindranath 2025-10-03 20:13:13 -04:00
767cbb011d [CI] Fix Pre-commit Mypy Error (#26181) Wentao Ye 2025-10-03 19:08:03 -04:00
7cfa4b24bf [BugFix] Fix de-functionalization pass for rotary_embedding (#23953) Angela Yi 2025-10-03 15:44:18 -07:00
b71fcd4905 [Misc] Add penalties sampling parameters to serve tool (#25974) Sergei Skvortsov 2025-10-03 23:43:14 +01:00
75003f34e8 [CI] Push multiarch manifests as nightly builds (#25764) Sahithi Chigurupati 2025-10-03 15:42:55 -07:00
78b8015a4d [Bugfix] Relax tokenizer regex for mixtral to include 'tokenizer.model' (#25964) Bowen Bao 2025-10-03 15:31:59 -07:00
831b124151 [responsesAPI] add better error messaging for long prompts (#25724) Andrew Xia 2025-10-03 14:33:13 -07:00
c1ffcb55da [Refactor] Optimize FP8 MOE Backend Choice and Log (#26044) Wentao Ye 2025-10-03 17:23:42 -04:00
0879736aab [Perf] Remove hardcoded num_warps=1 (#26183) Corey Lowman 2025-10-03 16:38:50 -04:00
a26917332f [Quantization/NVFP4] Speed up TRTLLM NVFP4 MOE weight loading and fix K/V scale loading for MLA Attn (#25968) Pavani Majety 2025-10-03 12:35:06 -07:00
cd9e5b8340 Fix V1 engine serialization error with Ray distributed executor (#26148) Nikhil G 2025-10-03 11:39:45 -07:00
300a59c4c3 Avoid division by zero in cache DS MLA kernel (#26174) Matthew Bonanni 2025-10-03 13:35:17 -04:00
d76541a6c5 Stop mergify from keeping stale PRs alive (#26169) Harry Mellor 2025-10-03 17:42:34 +01:00
dd96465fd7 [BugFix][QWEN-VL]fix wrong apply_rotary_emb_torch selection introduced by #24642 (#26123) Chendi.Xue 2025-10-03 10:52:26 -05:00
4f8f47e87e Fix undefined symbol: cutlass_moe_mm_sm100 (#26098) Jun Jiang 2025-10-03 23:48:32 +08:00
d78fda7cda [Renderer] Move Processor out of LLMEngine (#26165) Cyrus Leung 2025-10-03 23:08:22 +08:00
73a99cc2a5 [Model] Fixed stream generator for gpt-oss + spec-decoding (#26027) Aleksandr Samarin 2025-10-03 15:43:41 +02:00
adae0c1f43 [CI/Build] do not enforce precompilation on tpu ci tests (#25992) Xiang Si 2025-10-03 06:38:42 -07:00
cbf9221992 [Model] Supplement to PR 24862: Pass param prefix to LLMHead (#25805) whx 2025-10-03 21:34:53 +08:00
5f42fc53b6 [backends][short_conv] CUDA graph piecewise edits (#24215) Paul Pak 2025-10-03 21:59:48 +09:00
8ee846c27c [Bugfix] Re-enable prefill of max model length (#24446) Yannick Schnider 2025-10-03 14:13:34 +02:00
812b7f54a8 [Renderer] Move Processor out of AsyncLLM (#24138) Yang Liu 2025-10-03 04:29:45 -07:00
5f2cacdb1e Quick fix for IMA with the Prefix Prefill kernel during graph capture (#25983) Sage Moore 2025-10-03 04:28:22 -07:00
aa5053e3fe [Doc] Fixed shape description for fused_batched_moe.py (#25668) Egor 2025-10-03 13:00:23 +02:00
79aa244678 [Multi Modal] Configurable MM Profiling (#25631) Wenlong Wang 2025-10-03 03:59:10 -07:00
2ed3f20dba [openai] Fix missing tool usage check (system message) (#24768) kyt 2025-10-03 19:55:44 +09:00
48f309029a [NIXL][Misc] Expose metrics from NIXL for logging to CLI (#25388) Nicolò Lucchesi 2025-10-03 12:47:59 +02:00
0e93ac0b3a [CI] Fix distributed hybrid tests in CI (#26155) Thomas Parnell 2025-10-03 11:14:18 +02:00
5446ad1d24 [test utils] correct wrong typing (#26159) Yannick Schnider 2025-10-03 11:11:49 +02:00
f9a8084e48 [Model] Use merge_by_field_config for MM models (InternVL family) (#26153) Cyrus Leung 2025-10-03 16:59:06 +08:00
3e70e3d4d5 add(v1): RequestStatesStats to RequestOutput (#24947) HUIJONG JEONG 2025-10-03 17:56:25 +09:00
eb0fa43868 [Perf] Optimize reshape_and_cache CUDA Kernel (#25955) Jiangyun Zhu 2025-10-03 16:33:46 +08:00
0ad9951c41 [Input] Remove unused prompt field (#26097) Cyrus Leung 2025-10-03 15:23:21 +08:00
8c9117181d [Misc] Remove typing.List (#26150) Varun Sundar Rabindranath 2025-10-03 03:00:33 -04:00
c4b48d3c0f [BUG] Reorder model config creation (#26124) ahao-anyscale 2025-10-02 23:59:36 -07:00
10d765482d FusedMoE support for the Transformers backend (#22650) Harry Mellor 2025-10-03 07:12:15 +01:00
39b643dc1a [Model] Use merge_by_field_config for MM models (G) (#26117) Cyrus Leung 2025-10-03 13:38:29 +08:00
711f485643 [Bugfix] Fix import gemm_afp4wfp4 failure on AMD (#26068) Zhewen Li 2025-10-02 22:37:25 -07:00
9c5ee91b2a [ROCm] [VL] [Bugfix] Fix vit flash attn dispatcher logic for ROCm (#26104) TJian 2025-10-02 22:34:53 -07:00
f71952c1c4 [Build/CI] Revert back to Ubuntu 20.04, install python 3.12 with uv (#26103) v0.11.0rc6 Tyler Michael Smith 2025-10-03 01:21:01 -04:00
d1007767c5 [Bugfix] Disable cascade attention with FlashInfer (#26130) Michael Goin 2025-10-02 19:30:37 -04:00
27edd2aeb4 [Build/CI] Revert back to Ubuntu 20.04, install python 3.12 with uv (#26103) Tyler Michael Smith 2025-10-03 01:21:01 -04:00
e5017cd6d6 [gpt-oss] disable tool server initialization if no tool in request (#25790) Andrew Xia 2025-10-02 22:08:35 -07:00
6a7796e871 [Bug]: Limit num_reqs in dummy_run when max_num_seqs is small (#26144) Benjamin Chislett 2025-10-03 00:00:20 -04:00
47b9339546 [DeepSeek] Improve performance of DS MLA cache kernel (#26132) Matthew Bonanni 2025-10-02 23:35:47 -04:00
5d5146eee3 [CI/Build] Conditionally register cutlass_fp4_group_mm to fix building on Hopper (#26138) Michael Goin 2025-10-02 23:32:38 -04:00
2aaa423842 [Attention] Move Backend enum into registry (#25893) Matthew Bonanni 2025-10-02 23:32:24 -04:00
ad2d788016 [Bug][Benchmark] Fix duplicate req in oversampling (#26140) Ekagra Ranjan 2025-10-02 22:55:24 -04:00
36ce76c632 [Log] Optimize DeepGEMM Missing Log (#26106) Wentao Ye 2025-10-02 22:02:26 -04:00
f1fc2107a3 [Bugfix] Disable cascade attention with FlashInfer (#26130) Michael Goin 2025-10-02 19:30:37 -04:00
13cdc02173 Fix MTP with deepep_low_latency (#25904) Matthew Bonanni 2025-10-02 17:29:49 -04:00
502640c3f9 [Perf] Fix and reapply move apply w8a8 block fp8 linear to class (#25696) ElizaWszola 2025-10-02 21:35:13 +02:00
3d5f1c8640 [Mamba][KVCacheManager] Simplify kv cache manage logic for mamba + MTP (#25119) Chen Zhang 2025-10-02 11:48:31 -07:00
1cab2f9cad EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 32% instead of 5% on MTBench (#25916) Ekagra Ranjan 2025-10-02 14:29:35 -04:00
c75c2e70d6 [Deepseek v3.2] Support indexer prefill chunking (#25999) v0.11.0rc5 Chen Zhang 2025-10-02 10:29:12 -07:00
9d9a2b77f1 [Small] Prevent bypassing media domain restriction via HTTP redirects (#26035) Chenheli Hua 2025-10-02 10:27:10 -07:00
6040e0b6c0 [BugFix] Fix FI accuracy issue when used for MLA prefill (#26063) Lucas Wilkinson 2025-10-02 13:18:13 -04:00
05bf0c52a1 Update base image to 22.04 (jammy) (#26065) Huy Do 2025-10-02 05:48:04 -07:00
c536881a7c [BugFix] ChunkedLocalAttention is currently not CG compatible (#26034) Lucas Wilkinson 2025-10-01 19:28:00 -04:00
ebce361c07 [BugFix][DP/EP] Fix CUTLASS MLA hang under load (#26026) Lucas Wilkinson 2025-10-01 15:30:00 -04:00
1e50f1be70 [Deepseek v3.2] Support indexer prefill chunking (#25999) Chen Zhang 2025-10-02 10:29:12 -07:00
ad87ba927a [Small] Prevent bypassing media domain restriction via HTTP redirects (#26035) Chenheli Hua 2025-10-02 10:27:10 -07:00
decf7f794b [BugFix] Fix FI accuracy issue when used for MLA prefill (#26063) Lucas Wilkinson 2025-10-02 13:18:13 -04:00
d00d652998 [CI/Build] Replace vllm.entrypoints.openai.api_server entrypoint with vllm serve command (#25967) Cyrus Leung 2025-10-03 01:04:57 +08:00
3b279a84be [CI] Add Blackwell DeepSeek FP8 FlashInfer MoE tests (#26040) Michael Goin 2025-10-02 12:07:19 -04:00
5e4a8223c6 [Qwen][ROCm] Flash Attention Rotary Embeddings (#24642) vllmellm 2025-10-02 23:26:08 +08:00
e51de388a2 [Platform][CI] Added OOT platform interface e2e test that running on Ascend NPU (#25470) leo-pony 2025-10-02 23:19:22 +08:00
cc253b73d3 [Model] Use merge_by_field_config for MM models (D-F) (#26076) Cyrus Leung 2025-10-02 23:17:35 +08:00
7d6fb905d9 [Model] Use merge_by_field_config for MM models (A-C) (#26073) Cyrus Leung 2025-10-02 23:17:31 +08:00
418d111f8c [FA/Chore] Bump vllm-flash-attention (#25537) Lucas Wilkinson 2025-10-02 11:06:14 -04:00
be8921fbba Change size of single CUDA graph for CI to 4 (#26089) Thomas Parnell 2025-10-02 16:14:28 +02:00
d4e7a1152d Update base image to 22.04 (jammy) (#26065) Huy Do 2025-10-02 05:48:04 -07:00
be22bb6f3d Run:ai model streamer add GCS package support (#24909) pwschuurman 2025-10-01 20:59:13 -07:00
169313b9f8 [Misc] Make handling of SamplingParams clearer in n>1 case (#26032) Nick Hill 2025-10-01 19:31:39 -07:00
0b018d8baf [ROCm][Bugfix] Add missing parameter to ROCm backend (#26029) Gregory Shtrasberg 2025-10-01 22:23:14 -04:00
c31246800c Support RL online quantization with torchao (#23014) Jerry Zhang 2025-10-01 16:39:29 -07:00
4134312b35 [BugFix] ChunkedLocalAttention is currently not CG compatible (#26034) Lucas Wilkinson 2025-10-01 19:28:00 -04:00
da554f932e [Bug] Fix Negative Cuda Memory Usage (#25683) Wentao Ye 2025-10-01 18:16:26 -04:00
aac622e0cd [ROCm][Build] Add support for AMD Ryzen AI MAX / AI 300 Series (#25908) Hosang 2025-10-01 17:39:49 -04:00
1726e93ef1 [BugFix][DP/EP] Fix CUTLASS MLA hang under load (#26026) Lucas Wilkinson 2025-10-01 15:30:00 -04:00
ee04c0cd04 [CI] Tweaks to GPT-OSS Eval (Blackwell) for stability (#26030) Michael Goin 2025-10-01 15:02:17 -04:00
c36f0aa300 Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_indices_offsets (#25995) Huamin Li 2025-10-01 11:18:36 -07:00
5234dc7451 [NVIDIA] Blackwell Family (#24673) Johnny 2025-10-01 19:50:54 +02:00
3b7c20a6b5 [Bugfix] Apply same sampling parameters for both n=1 and n>1 (#26005) Kenichi Maehashi 2025-10-01 23:37:35 +09:00
f9e714813a [Benchmark] Finish documented v0.11.0 deprecation of --endpoint-type (#26007) Nathan Scott 2025-10-01 22:41:57 +10:00
2518230d3e [MISC] Fix misleading batch_size_capture_list when cuda_graph_sizes < 4 (#25829) billishyahao 2025-10-01 20:39:45 +08:00
a332b84578 [CI] Only capture a single CUDA graph size in CI by default (#25951) Harry Mellor 2025-10-01 10:03:44 +01:00
1405f0c7ba [Misc] Factor out common _apply_feature_select_strategy (#26003) Cyrus Leung 2025-10-01 16:31:03 +08:00

... 56 57 58 59 60 ...