Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

4abed65c58 [VLM] Disallow overflowing max_model_len for multimodal models (#7998) Cyrus Leung 2024-08-30 08:49:04 +08:00
0c785d344d Add more percentiles and latencies (#7759) Wei-Sheng Chin 2024-08-29 16:48:11 -07:00
4664ceaad6 support bitsandbytes 8-bit and FP4 quantized models (#7445) chenqianfzh 2024-08-29 16:09:08 -07:00
257afc37c5 [Neuron] Adding support for context-lenght, token-gen buckets. (#7885) Harsha vardhan manoj Bikki 2024-08-29 13:58:14 -07:00
86a677de42 [misc] update tpu int8 to use new vLLM Parameters (#7973) Dipika Sikka 2024-08-29 16:46:55 -04:00
d78789ac16 [Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism (#7954) Isotr0py 2024-08-30 03:54:49 +08:00
c334b1898b extend cuda graph size for H200 (#7894) kushanam 2024-08-29 12:15:04 -07:00
6b3421567d [Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985) Pavani Majety 2024-08-29 11:53:11 -07:00
3f60f2244e [Core] Combine async postprocessor and multi-step (#7921) Alexander Matveev 2024-08-29 14:18:26 -04:00
f205c09854 [Bugfix] Unify rank computation across regular decoding and speculative decoding (#7899) Jonas M. Kübler 2024-08-29 07:18:13 +02:00
ef99a78760 Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982) youkaichao 2024-08-28 21:27:06 -07:00
74d5543ec5 [VLM][Core] Fix exceptions on ragged NestedTensors (#7974) Peter Salas 2024-08-28 20:24:31 -07:00
a7f65c2be9 [torch.compile] remove reset (#7975) youkaichao 2024-08-28 17:32:26 -07:00
4289cad37f [Frontend] Minor optimizations to zmq decoupled front-end (#7957) Nick Hill 2024-08-28 17:22:43 -07:00
af59df0a10 Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test (#7961) Michael Goin 2024-08-28 19:19:17 -04:00
ce6bf3a2cf [torch.compile] avoid Dynamo guard evaluation overhead (#7898) youkaichao 2024-08-28 16:10:12 -07:00
3cdfe1f38b [Bugfix] Make torch registration of punica ops optional (#7970) bnellnm 2024-08-28 18:11:49 -04:00
fdd9daafa3 [Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651) Mor Zusman 2024-08-29 01:06:52 +03:00
8c56e57def [Doc] fix 404 link (#7966) Stas Bekman 2024-08-28 13:54:23 -07:00
eeffde1ac0 [TPU] Upgrade PyTorch XLA nightly (#7967) Woosuk Kwon 2024-08-28 13:10:21 -07:00
e5697d161c [Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386) rasmith 2024-08-28 14:37:47 -05:00
b98cc28f91 [Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798) Pavani Majety 2024-08-28 10:01:22 -07:00
ef9baee3c5 [Bugfix][VLM] Fix incompatibility between #7902 and #7230 (#7948) Cyrus Leung 2024-08-28 23:11:18 +08:00
98c12cffe5 [Doc] fix the autoAWQ example (#7937) Stas Bekman 2024-08-28 05:12:32 -07:00
f52a43a8b9 [ci][test] fix pp test failure (#7945) youkaichao 2024-08-28 01:27:07 -07:00
e3580537a4 [Performance] Enable chunked prefill and prefix caching together (#7753) Cody Yu 2024-08-28 00:36:31 -07:00
f508e03e7f [Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) (#7911) Alexander Matveev 2024-08-28 03:02:30 -04:00
51f86bf487 [mypy][CI/Build] Fix mypy errors (#7929) Cyrus Leung 2024-08-28 14:47:44 +08:00
c166e7e43e [Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. (#7886) bnellnm 2024-08-27 23:13:45 -04:00
bc6e42a9b1 [hardware][rocm] allow rocm to override default env var (#7926) youkaichao 2024-08-27 19:50:06 -07:00
fab5f53e2d [Core][VLM] Stack multimodal tensors to represent multiple images within each prompt (#7902) Peter Salas 2024-08-27 18:53:56 -07:00
9c71c97ae2 [mypy] Enable mypy type checking for vllm/core (#7229) Jonathan Berkhahn 2024-08-27 16:11:14 -07:00
5340a2dccf [Model] Add multi-image input support for LLaVA-Next offline inference (#7230) zifeitong 2024-08-27 16:09:02 -07:00
345be0e244 [benchmark] Update TGI version (#7917) Philipp Schmid 2024-08-28 00:07:53 +02:00
fc911880cc [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766) Dipika Sikka 2024-08-27 18:07:09 -04:00
ed6f002d33 [cuda][misc] error on empty CUDA_VISIBLE_DEVICES (#7924) youkaichao 2024-08-27 12:06:11 -07:00
b09c755be8 [Bugfix] Fix phi3v incorrect image_idx when using async engine (#7916) Isotr0py 2024-08-28 01:36:09 +08:00
42e932c7d4 [CI/Build][ROCm] Enabling tensorizer tests for ROCm (#7237) alexeykondrat 2024-08-27 13:09:13 -04:00
076169f603 [Hardware][Intel GPU] Add intel GPU pipeline parallel support. (#7810) Kunshang Ji 2024-08-28 01:07:02 +08:00
9db642138b [CI/Build][VLM] Cleanup multiple images inputs model test (#7897) Isotr0py 2024-08-27 23:28:30 +08:00
6fc4e6e07a [Model] Add Mistral Tokenization to improve robustness and chat encoding (#7739) Patrick von Platen 2024-08-27 14:40:02 +02:00
9606c7197d Revert #7509 (#7887) Cody Yu 2024-08-27 00:16:31 -07:00
64cc644425 [core][torch.compile] discard the compile for profiling (#7796) youkaichao 2024-08-26 21:33:58 -07:00
39178c7fbc [Tests] Disable retries and use context manager for openai client (#7565) Nick Hill 2024-08-26 21:33:17 -07:00
2eedede875 [Core] Asynchronous Output Processor (#7049) Megha Agarwal 2024-08-26 20:53:20 -07:00
015e6cc252 [Misc] Update compressed tensors lifecycle to remove prefix from create_weights (#7825) Dipika Sikka 2024-08-26 20:09:34 -04:00
760e9f71a8 [Bugfix] neuron: enable tensor parallelism (#7562) omrishiv 2024-08-26 15:13:13 -07:00
05826c887b [misc] fix custom allreduce p2p cache file generation (#7853) youkaichao 2024-08-26 15:02:25 -07:00
dd9857f5fa [Misc] Update gptq_marlin_24 to use vLLMParameters (#7762) Dipika Sikka 2024-08-26 17:44:54 -04:00
665304092d [Misc] Update qqq to use vLLMParameters (#7805) Dipika Sikka 2024-08-26 15:16:15 -04:00
2deb029d11 [Performance][BlockManagerV2] Mark prefix cache block as computed after schedule (#7822) Cody Yu 2024-08-26 11:24:53 -07:00
029c71de11 [CI/Build] Avoid downloading all HF files in RemoteOpenAIServer (#7836) Cyrus Leung 2024-08-26 13:31:10 +08:00
0b769992ec [Bugfix]: Use float32 for base64 embedding (#7855) ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 2024-08-26 06:16:38 +03:00
1856aff4d6 [Spec Decoding] Streamline batch expansion tensor manipulation (#7851) Nick Hill 2024-08-25 15:45:14 -07:00
70c094ade6 [misc][cuda] improve pynvml warning (#7852) youkaichao 2024-08-25 14:30:09 -07:00
2059b8d9ca [Misc] Remove snapshot_download usage in InternVL2 test (#7835) Isotr0py 2024-08-25 23:53:09 +08:00
8aaf3d5347 [Model][VLM] Support multi-images inputs for Phi-3-vision models (#7783) Isotr0py 2024-08-25 19:51:20 +08:00
80162c44b1 [Bugfix] Fix Phi-3v crash when input images are of certain sizes (#7840) zifeitong 2024-08-24 18:16:24 -07:00
aab0fcdb63 [ci][test] fix RemoteOpenAIServer (#7838) youkaichao 2024-08-24 10:31:28 -07:00
ea9fa160e3 [ci][test] exclude model download time in server start time (#7834) youkaichao 2024-08-24 01:03:27 -07:00
7d9ffa2ae1 [misc][core] lazy import outlines (#7831) youkaichao 2024-08-24 00:51:38 -07:00
d81abefd2e [Frontend] add json_schema support from OpenAI protocol (#7654) Tyler Rockwood 2024-08-24 01:07:24 -05:00
8da48e4d95 [Frontend] Publish Prometheus metrics in run_batch API (#7641) Pooya Davoodi 2024-08-23 23:04:22 -07:00
6885fde317 [Bugfix] Fix run_batch logger (#7640) Pooya Davoodi 2024-08-23 13:58:26 -07:00
9db93de20c [Core] Add multi-step support to LLMEngine (#7789) Alexander Matveev 2024-08-23 15:45:53 -04:00
09c7792610 Bump version to v0.5.5 (#7823) v0.5.5 Simon Mo 2024-08-23 11:35:33 -07:00
f1df5dbfd6 [Misc] Update marlin to use vLLMParameters (#7803) Dipika Sikka 2024-08-23 14:30:52 -04:00
35ee2ad6b9 [github][misc] promote asking llm first (#7809) youkaichao 2024-08-23 09:38:50 -07:00
e25fee57c2 [BugFix] Fix server crash on empty prompt (#7746) Maximilien de Bayser 2024-08-23 10:12:44 -03:00
faeddb565d [misc] Add Torch profiler support for CPU-only devices (#7806) Jie Fu (傅杰) 2024-08-23 13:46:25 +08:00
fc5ebbd1d3 [Hardware][Intel GPU] refactor xpu_model_runner for tp (#7712) Kunshang Ji 2024-08-23 11:06:54 +08:00
c01a6cb231 [Ray backend] Better error when pg topology is bad. (#7584) SangBin Cho 2024-08-22 17:44:25 -07:00
b903e1ba7f [Frontend] error suppression cleanup (#7786) Joe Runde 2024-08-22 15:50:21 -06:00
a152246428 [Misc] fix typo in triton import warning (#7794) Siyuan Liu 2024-08-22 13:51:23 -07:00
666ad0aa16 [ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args (#7705) Kevin H. Luu 2024-08-22 13:10:55 -07:00
15310b5101 [Bugfix] Use LoadFormat values for vllm serve --load-format (#7784) Michael Goin 2024-08-22 14:37:08 -04:00
57792ed469 [Doc] Fix incorrect docs from #7615 (#7788) Peter Salas 2024-08-22 10:02:06 -07:00
d3b5b98021 [Misc] Enhance prefix-caching benchmark tool (#6568) Jiaxin Shan 2024-08-23 00:32:02 +08:00
cc0eaf12b1 [Bugfix] spec decode handle None entries in topk args in create_sequence_group_output (#7232) Travis Johnson 2024-08-22 07:33:48 -06:00
955b5191c9 [Misc] update fp8 to use vLLMParameter (#7437) Dipika Sikka 2024-08-22 08:36:18 -04:00
55d63b1211 [Bugfix] Don't build machete on cuda <12.0 (#7757) Lucas Wilkinson 2024-08-22 08:28:52 -04:00
4f419c00a6 Fix ShardedStateLoader for vllm fp8 quantization (#7708) Flex Wang 2024-08-22 05:25:04 -07:00
a3fce56b88 [Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830) Abhinav Goyal 2024-08-22 15:12:24 +05:30
b3856bef7d [Misc] Use torch.compile for GemmaRMSNorm (#7642) Woosuk Kwon 2024-08-22 01:14:13 -07:00
8c6f694a79 [ci] refine dependency for distributed tests (#7776) youkaichao 2024-08-22 00:54:15 -07:00
eeee1c3b1a [TPU] Avoid initializing TPU runtime in is_tpu (#7763) Woosuk Kwon 2024-08-21 21:31:49 -07:00
aae74ef95c Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" (#7764) Michael Goin 2024-08-21 23:42:14 -04:00
cde9183b40 [Bug][Frontend] Improve ZMQ client robustness (#7443) Joe Runde 2024-08-21 20:18:11 -06:00
df1a21131d [Model] Fix Phi-3.5-vision-instruct 'num_crops' issue (#7710) zifeitong 2024-08-21 18:36:24 -07:00
7937009a7e [Kernel] Replaced blockReduce[...] functions with cub::BlockReduce (#7233) Luka Govedič 2024-08-21 20:18:00 -04:00
9984605412 [AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility (#7477) Gregory Shtrasberg 2024-08-21 19:47:36 -04:00
7eebe8ccaa [distributed][misc] error on same VLLM_HOST_IP setting (#7756) youkaichao 2024-08-21 16:25:34 -07:00
8678a69ab5 [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527) Dipika Sikka 2024-08-21 19:17:10 -04:00
5844017285 [ci] [multi-step] narrow multi-step test dependency paths (#7760) William Lin 2024-08-21 15:52:40 -07:00
1ca0d4f86b [Model] Add UltravoxModel and UltravoxConfig (#7615) Peter Salas 2024-08-21 15:49:39 -07:00
dd53c4b023 [misc] Add Torch profiler support (#7451) William Lin 2024-08-21 15:39:26 -07:00
970dfdc01d [Frontend] Improve Startup Failure UX (#7716) Robert Shaw 2024-08-21 15:53:01 -04:00
91f4522cbf [multi-step] Raise error if not using async engine (#7703) William Lin 2024-08-21 11:49:19 -07:00
1b32e02648 [Bugfix] Pass PYTHONPATH from setup.py to CMake (#7730) sasha0552 2024-08-21 18:17:48 +00:00
f7e3b0c5aa [Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend (#7394) Robert Shaw 2024-08-21 13:34:14 -04:00

... 133 134 135 136 137 ...