Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

f78c0be80a Fix benchmark_moe.py tuning for CUDA devices (#14164) Michael Goin 2025-03-04 00:11:03 -05:00
66233af7b6 Use math.prod instead of np.prod for trivial ops (#14142) Zhanwen Chen 2025-03-04 00:09:22 -05:00
bf13d40972 [core] Pass all driver env vars to ray workers unless excluded (#14099) Rui Qiao 2025-03-03 19:44:17 -08:00
989f4f430c [Misc] Remove lru_cache in NvmlCudaPlatform (#14156) Cody Yu 2025-03-03 19:09:34 -08:00
bb5b640359 [core] moe fp8 block quant tuning support (#14068) Divakar Verma 2025-03-03 19:30:23 -06:00
c060b71408 [Model] Add support for GraniteMoeShared models (#13313) Travis Johnson 2025-03-03 17:04:52 -07:00
79e4937c65 [v1] Add comments to the new ragged paged attention Pallas kernel (#14155) iefgnoix 2025-03-03 15:00:55 -08:00
cd1d3c3df8 [Docs] Add GPTQModel (#14056) Qubitium-ModelCloud 2025-03-04 05:59:09 +08:00
19d98e0c7d [Kernel] Optimize moe intermediate_cache usage (#13625) Michael Goin 2025-03-03 16:29:53 -05:00
2b04c209ee [Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 (#14100) Michael Goin 2025-03-03 16:20:24 -05:00
ae122b1cbd [WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics (#14055) Mark McLoughlin 2025-03-03 19:04:45 +00:00
872db2be0e [V1] Simplify stats logging (#14082) Nick Hill 2025-03-03 10:34:14 -08:00
2dfdfed8a0 [V0][Metrics] Deprecate some KV/prefix cache metrics (#14136) Mark McLoughlin 2025-03-03 18:25:46 +00:00
c41d27156b [V0][Metrics] Remove unimplemented vllm:tokens_total (#14134) Mark McLoughlin 2025-03-03 17:50:22 +00:00
91373a0d15 Fix head_dim not existing in all model configs (Transformers backend) (#14141) Harry Mellor 2025-03-03 17:48:11 +00:00
848a6438ae [ROCm] Faster Custom Paged Attention kernels (#12348) TJian 2025-03-04 01:24:45 +08:00
98175b2816 Improve the docs for TransformersModel (#14147) Harry Mellor 2025-03-03 17:03:05 +00:00
4167252eaf [V1] Refactor parallel sampling support (#13774) Mark McLoughlin 2025-03-03 16:15:27 +00:00
f35f8e2242 [Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 (#13921) Cody Yu 2025-03-03 00:43:14 -08:00
b87c21fc89 [Misc][Platform] Move use allgather to platform (#14010) Mengqing Cao 2025-03-03 15:40:04 +08:00
e584b85afd [Misc] duplicate code in deepseek_v2 (#14106) wang.yuqi 2025-03-03 14:10:11 +08:00
09e56f9262 [Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure (#14051) Sheng Yao 2025-03-03 09:35:01 +08:00
cf069aa8aa Update deprecated Python 3.8 typing (#13971) Harry Mellor 2025-03-03 01:34:51 +00:00
bf33700ecd [v0][structured output] Support reasoning output (#12955) Ce Gao 2025-03-03 03:49:42 +08:00
bc6ccb9878 [Doc] Source building add clone step (#14086) qux-bbb 2025-03-02 18:59:50 +08:00
82fbeae92b [Misc] Accurately capture the time of loading weights (#14063) Jun Duan 2025-03-01 20:20:30 -05:00
cc5e8f6db8 [Model] Add LoRA support for TransformersModel (#13770) Jee Jee Li 2025-03-02 09:17:34 +08:00
d54990da47 [v1] Add __repr__ to KVCacheBlock to avoid recursive print (#14081) Chen Zhang 2025-03-02 04:46:02 +08:00
b9f1d4294e [v1][Bugfix] Only cache blocks that are not in the prefix cache (#14073) Chen Zhang 2025-03-01 16:25:54 +08:00
b28246f6ff [ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class (#14065) Sage Moore 2025-02-28 23:18:32 -08:00
3b5567a209 [V1][Minor] Do not print attn backend twice (#13985) Woosuk Kwon 2025-02-28 23:09:14 -08:00
fdcc405346 [Doc] Consolidate whisper and florence2 examples (#14050) Isotr0py 2025-03-01 14:49:15 +08:00
8994dabc22 [Documentation] Add more deployment guide for Kubernetes deployment (#13841) Kuntai Du 2025-02-28 22:44:24 -08:00
02296f420d [Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor (#14053) Li, Jiang 2025-03-01 14:31:01 +08:00
6a92ff93e1 [Misc][Kernel]: Add GPTQAllSpark Quantization (#12931) YajieWang 2025-03-01 14:30:59 +08:00
6a84164add [Bugfix] Add file lock for ModelScope download (#14060) Jee Jee Li 2025-03-01 14:10:28 +08:00
f64ffa8c25 [Docs] Add pipeline_parallel_size to optimization docs (#14059) Brayden Zhong 2025-03-01 00:43:54 -05:00
bd56c983d6 [torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass (#10902) Luka Govedič 2025-02-28 18:20:11 -05:00
084bbac8cc [core] Bump ray to 2.43 (#13994) Rui Qiao 2025-02-28 13:47:44 -08:00
28943d36ce [v1] Move block pool operations to a separate class (#13973) Chen Zhang 2025-03-01 04:53:31 +08:00
b526ca6726 Add RELEASE.md (#13926) Andrey Talman 2025-02-28 20:25:50 +00:00
e7bd944e08 [v1] Cleanup the BlockTable in InputBatch (#13977) Chen Zhang 2025-03-01 03:03:16 +08:00
c3b6559a10 [V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU (#13379) iefgnoix 2025-02-28 10:01:36 -08:00
4be4b26cb7 Fix entrypoint tests for embedding models (#14052) Harry Mellor 2025-02-28 16:56:44 +00:00
2aed2c9fa7 [Doc] Fix ROCm documentation (#14041) Brayden Zhong 2025-02-28 11:42:07 -05:00
9b61dd41e7 [Bugfix] Initialize attention bias on the same device as Query/Key/Value for QwenVL Series (#14031) Yang Liu 2025-02-28 23:36:08 +08:00
f7bee5c815 [VLM][Bugfix] Enable specifying prompt target via index (#14038) Cyrus Leung 2025-02-28 23:35:55 +08:00
e0734387fb [Bugfix] Fix MoeWNA16Method activation (#14024) Jee Jee Li 2025-02-28 23:22:42 +08:00
f58f8b5c96 Update AutoAWQ docs (#14042) Harry Mellor 2025-02-28 15:20:29 +00:00
b3f7aaccd0 [V1][Minor] Restore V1 compatibility with LLMEngine class (#13090) Thibault Schueller 2025-02-28 09:52:25 +01:00
b91660ddb8 [Hardware][Intel-Gaudi] Regional compilation support (#13213) Kacper Pietkun 2025-02-28 09:51:49 +01:00
76c89fcadd Use smaller embedding model when not testing model specifically (#13891) Harry Mellor 2025-02-28 08:50:43 +00:00
b9e41734c5 [Bugfix][Disaggregated] patch the inflight batching on the decode node in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) (#13987) Mathis Felardos 2025-02-28 08:53:45 +01:00
1088f06242 [Doc] Move multimodal Embedding API example to Online Serving page (#14017) Cyrus Leung 2025-02-28 15:12:04 +08:00
73e0225ee9 [Bugfix] Check that number of images matches number of <|image|> tokens with mllama (#13911) Travis Johnson 2025-02-27 21:00:45 -07:00
6c85da3a18 [V1]SupportsV0Only protocol for model definitions (#13959) Roger Wang 2025-02-27 17:02:15 -08:00
67fc426845 [Misc] Print FusedMoE detail info (#13974) Jee Jee Li 2025-02-28 07:53:13 +08:00
9804145cac [Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict (#13626) Benjamin Chislett 2025-02-27 18:28:08 -05:00
2e94b9cfbb [Attention] Flash MLA for V1 (#13867) Lucas Wilkinson 2025-02-27 18:03:41 -05:00
8294773e48 [core] Perf improvement for DSv3 on AMD GPUs (#13718) qli88 2025-02-27 16:14:30 -06:00
cd813c6d4d [V1][Minor] Minor cleanup for GPU Model Runner (#13983) Woosuk Kwon 2025-02-27 13:11:40 -08:00
38acae6e97 [ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups (#13970) Sage Moore 2025-02-27 12:31:47 -08:00
a2dd48c386 [VLM] Deprecate legacy input mapper for OOT multimodal models (#13979) Cyrus Leung 2025-02-28 03:14:55 +08:00
126f6beeb4 Bump azure/setup-helm from 4.2.0 to 4.3.0 (#13742) dependabot[bot] 2025-02-27 19:04:10 +00:00
58d1b2aa77 [Attention] MLA support for V1 (#13789) Yang Chen 2025-02-27 10:14:17 -08:00
f1579b229d [VLM] Generalized prompt updates for multi-modal processor (#13964) Cyrus Leung 2025-02-28 01:44:25 +08:00
7864875879 [Bugfix] Fix qwen2.5-vl overflow issue (#13968) Isotr0py 2025-02-28 01:30:39 +08:00
1dd422b64a Update LMFE version to v0.10.11 to support new versions of transforme… (#13930) Noam Gat 2025-02-27 19:16:12 +02:00
06c8f8d885 [bugfix] Fix profiling for RayDistributedExecutor (#13945) Rui Qiao 2025-02-27 09:01:21 -08:00
5677c9bb3e Deduplicate .pre-commit-config.yaml's exclude (#13967) Harry Mellor 2025-02-27 16:27:47 +00:00
512d77d582 Update quickstart.md (#13958) 王博伟 2025-02-28 00:05:11 +08:00
7f0be2aa24 [Model] Deepseek GGUF support (#13167) Szymon Ożóg 2025-02-27 11:08:35 +01:00
edf309ebbe [VLM] Support multimodal inputs for Florence-2 models (#13320) Isotr0py 2025-02-27 18:06:41 +08:00
788f284b53 Fix test_block_fp8.py test for MoE (#13915) Michael Goin 2025-02-27 05:00:00 -05:00
4b1d141f49 [PP] Correct cache size check (#13873) Yang Zheng 2025-02-27 17:47:29 +08:00
10c3b8c1cf [Misc] fixed 'required' is an invalid argument for positionals (#13948) Chauncey 2025-02-27 17:06:49 +08:00
a7f37314b7 [CI/Build] Add examples/ directory to be labelled by mergify (#13944) Brayden Zhong 2025-02-27 03:24:11 -05:00
cd711c48b2 [V1][Metrics] Handle preemptions (#13169) Mark McLoughlin 2025-02-27 04:04:59 +00:00
378b3ef6f8 [ROCm][V1] Update reshape_and_cache to properly work with CUDA graph padding (#13922) Sage Moore 2025-02-26 20:04:12 -08:00
c9944acbf9 [misc] Rename Ray ADAG to Compiled Graph (#13928) Rui Qiao 2025-02-26 20:03:28 -08:00
ca377cf1b9 Use CUDA 12.4 as default for release and nightly wheels (#12098) Michael Goin 2025-02-26 22:06:37 -05:00
a31614e386 [ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined (#13851) ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 2025-02-27 04:39:10 +02:00
f95903909f [Kernel] FlashMLA integration (#13747) Lucas Wilkinson 2025-02-26 21:35:08 -05:00
b382a7f28f [BugFix] Make FP8 Linear compatible with torch.compile (#13918) Woosuk Kwon 2025-02-26 13:48:55 -08:00
4cb6fa0a9c [Bugfix] Backend option to disable xgrammar any_whitespace (#12744) Wallas Henrique 2025-02-26 15:52:34 -03:00
d08b285adf [Misc] fixed qwen_vl_utils parameter error (#13906) Chauncey 2025-02-27 00:31:53 +08:00
b27122acc2 [TPU] use torch2.6 with whl package (#13860) Chenyaaang 2025-02-26 05:18:54 -08:00
934bb99c71 [Bugfix] Update expected token counts for Ultravox tests (#13895) Cyrus Leung 2025-02-26 20:56:50 +08:00
3f808cc044 [Bugfix] Do not crash V0 engine on input errors (#13101) Joe Runde 2025-02-26 04:07:29 -07:00
ec8a5e5386 [Misc]: Add support for goodput on guided benchmarking + TPOT calculation refactor (#13736) Brayden Zhong 2025-02-26 06:06:47 -05:00
215bf150a6 [Bugfix] Handle None parameters in Mistral function calls. (#13786) Florian Greinacher 2025-02-26 12:06:21 +01:00
0ecdd98031 Add comments on accessing kv_cache and attn_metadata (#13887) Harry Mellor 2025-02-26 10:41:02 +00:00
7b700ec8c8 [Bugfix] Add test example for Ultravox v0.5 (#13890) Cyrus Leung 2025-02-26 18:31:43 +08:00
7ca1da020f [Misc] Fix input processing for Ultravox (#13871) Roger Wang 2025-02-25 23:56:34 -08:00
5157338ed9 [Misc] Improve LoRA spelling (#13831) Jee Jee Li 2025-02-26 15:43:01 +08:00
e206b54331 [v0][Core] Use xgrammar shared context to avoid copy overhead for offline engine (#13837) Seth Kimmel 2025-02-25 22:58:24 -08:00
1d35662e6d [ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms (#13844) Sage Moore 2025-02-25 22:56:58 -08:00
e656f638de [Doc] fix the incorrect module path of tensorize_vllm_model (#13863) Albert 2025-02-26 14:56:19 +08:00
145944cb94 Improve pipeline partitioning (#13839) Harry Mellor 2025-02-26 02:53:56 +00:00
094b7d9496 [Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues (#13797) Henry Tsang 2025-02-25 18:52:03 -08:00

... 109 110 111 112 113 ...