Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

d3c002eadc [Bugfix] chat method add_generation_prompt param (#7734) Brian Li 2024-08-22 01:33:35 +08:00
9b73a2f498 [Spec Decoding] Use target model max length as default for draft model (#7706) Nick Hill 2024-08-21 12:23:22 -04:00
6925cdbeea [Bugfix][Hardware][CPU] Fix mm_limits initialization for CPU backend (#7735) Isotr0py 2024-08-22 00:23:03 +08:00
53328d7536 [BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509) LI MOU 2024-08-21 23:54:31 +08:00
c75363fbc0 [BugFix] Avoid premature async generator exit and raise all exception variations (#7698) Nick Hill 2024-08-21 11:45:55 -04:00
dd3fa0e430 [Bugfix] Mirror jinja2 in pyproject.toml (#7723) sasha0552 2024-08-21 13:41:17 +00:00
baaedfdb2d [mypy] Enable following imports for entrypoints (#7248) Cyrus Leung 2024-08-21 14:28:21 +08:00
4506641212 [Doc] Section for Multimodal Language Models (#7719) Roger Wang 2024-08-20 23:24:01 -07:00
12e1c65bc9 [Model] Add AWQ quantization support for InternVL2 model (#7187) Isotr0py 2024-08-21 14:18:57 +08:00
b74a125800 [ci] try to log process using the port to debug the port usage (#7711) youkaichao 2024-08-20 17:41:12 -07:00
66a9e713a7 [Core] Pipe worker_class_fn argument in Executor (#7707) Antoni Baum 2024-08-20 17:37:39 -07:00
9e51b6a626 [ci][test] adjust max wait time for cpu offloading test (#7709) youkaichao 2024-08-20 17:12:44 -07:00
6e4658c7aa [Intel GPU] fix xpu not support punica kernel (which use torch.library.custom_op) (#7685) Kunshang Ji 2024-08-21 03:01:09 +08:00
3b682179dd [Core] Add AttentionState abstraction (#7663) Antoni Baum 2024-08-20 11:50:45 -07:00
c6af027a35 [Misc] Add jinja2 as an explicit build requirement (#7695) Lucas Wilkinson 2024-08-20 13:17:47 -04:00
2aa00d59ad [CI/Build] Pin OpenTelemetry versions and make errors clearer (#7266) Ronen Schaffer 2024-08-20 20:02:21 +03:00
c42590f97a [Hardware] [Intel GPU] refactor xpu worker/executor (#7686) Kunshang Ji 2024-08-21 00:54:10 +08:00
aae6927be0 [VLM][Model] Add test for InternViT vision encoder (#7409) Isotr0py 2024-08-20 23:10:20 +08:00
398521ad19 [OpenVINO] Updated documentation (#7687) Ilya Lavrenov 2024-08-20 17:33:56 +04:00
5288c06aa0 [Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174) Lucas Wilkinson 2024-08-20 09:09:33 -04:00
b6f99a6ffe [Core] Refactor executor classes for easier inheritance (#7673) Kunshang Ji 2024-08-20 15:56:50 +08:00
ad28a74beb [misc][cuda] add warning for pynvml user (#7675) youkaichao 2024-08-20 00:35:09 -07:00
e6d811dd13 [XPU] fallback to native implementation for xpu custom op (#7670) jianyizh 2024-08-20 15:26:09 +08:00
c4be16e1a7 [misc] add nvidia related library in collect env (#7674) youkaichao 2024-08-19 23:22:49 -07:00
3d8a5f063d [CI] Organizing performance benchmark files (#7616) Kuntai Du 2024-08-19 22:43:54 -07:00
f4fc7337bf [Bugfix] support tie_word_embeddings for all models (#5724) Zijian Hu 2024-08-19 20:00:04 -07:00
0df7ec0b2d [ci] Install Buildkite test suite analysis (#7667) Kevin H. Luu 2024-08-19 19:55:04 -07:00
312f761232 [Speculative Decoding] Fixing hidden states handling in batch expansion (#7508) Abhinav Goyal 2024-08-20 06:28:14 +05:30
e54ebc2f8f [doc] fix doc build error caused by msgspec (#7659) youkaichao 2024-08-19 17:50:59 -07:00
67e02fa8a4 [Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs-during-spec-decoding (#7665) Travis Johnson 2024-08-19 18:43:09 -06:00
43735bf5e1 [TPU] Remove redundant input tensor cloning (#7660) Woosuk Kwon 2024-08-19 15:55:04 -07:00
da115230fd [Bugfix] Don't disable existing loggers (#7664) Andrew Song 2024-08-19 15:11:58 -07:00
7601cb044d [Core] Support tensor parallelism for GGUF quantization (#7520) Isotr0py 2024-08-20 05:30:14 +08:00
47b65a5508 [core] Multi Step Scheduling (#7000) William Lin 2024-08-19 13:52:13 -07:00
dad961ef5c [Bugfix] fix lora_dtype value type in arg_utils.py - part 2 (#5428) Ali Panahi 2024-08-19 13:47:00 -07:00
3ac50b47d0 [MISC] Add prefix cache hit rate to metrics (#7606) Cody Yu 2024-08-19 11:52:07 -07:00
df845b2b46 [Misc] Remove Gemma RoPE (#7638) Woosuk Kwon 2024-08-19 09:29:31 -07:00
1a36287b89 [Bugfix] Fix xpu build (#7644) Kunshang Ji 2024-08-19 13:00:09 +08:00
f710fb5265 [Core] Use flashinfer sampling kernel when available (#7137) Peng Guanwen 2024-08-19 11:24:03 +08:00
ff7ec82c4d [Core] Optimize SPMD architecture with delta + serialization optimization (#7109) SangBin Cho 2024-08-18 17:57:20 -07:00
200a2ffa6b [Misc] Refactor Llama3 RoPE initialization (#7637) Woosuk Kwon 2024-08-18 17:18:12 -07:00
40e1360bb6 [CI/Build] Add text-only test for Qwen models (#7475) Alex Brooks 2024-08-18 17:43:46 -06:00
e3b318216d [ Bugfix ] Fix Prometheus Metrics With zeromq Frontend (#7279) Robert Shaw 2024-08-18 16:19:48 -04:00
ab7165f2c7 [TPU] Optimize RoPE forward_native2 (#7636) Woosuk Kwon 2024-08-18 01:15:10 -07:00
0c2fa50b84 [TPU] Use mark_dynamic only for dummy run (#7634) Woosuk Kwon 2024-08-18 00:18:53 -07:00
ce143353c6 [TPU] Skip creating empty tensor (#7630) Woosuk Kwon 2024-08-17 14:22:46 -07:00
bbf55c4805 [VLM] Refactor MultiModalConfig initialization and profiling (#7530) Roger Wang 2024-08-17 13:30:55 -07:00
1ef13cf92f [Misc]Fix BitAndBytes exception messages (#7626) Jee Jee Li 2024-08-18 03:02:14 +08:00
832163b875 [ci][test] allow longer wait time for api server (#7629) youkaichao 2024-08-17 11:26:38 -07:00
e73f76eec6 [Model] Pipeline parallel support for JAIS (#7603) Besher Alkurdi 2024-08-17 21:11:09 +03:00
d95cc0a55c [core][misc] update libcudart finding (#7620) youkaichao 2024-08-16 23:01:35 -07:00
5bf45db7df [ci][test] fix engine/logger test (#7621) youkaichao 2024-08-16 23:00:59 -07:00
eed020f673 [misc] use nvml to get consistent device name (#7582) youkaichao 2024-08-16 21:15:13 -07:00
7c0b7ea214 [Bugfix] add >= 1.0 constraint for openai dependency (#7612) Xander Johnson 2024-08-16 20:56:01 -07:00
4706eb628e [aDAG] Unflake aDAG + PP tests (#7600) SangBin Cho 2024-08-16 20:49:30 -07:00
bae888cb8e [Bugfix] Clear engine reference in AsyncEngineRPCServer (#7618) Rui Qiao 2024-08-16 20:44:05 -07:00
6bd19551b0 .[Build/CI] Enabling passing AMD tests. (#7610) Alexei-V-Ivanov-AMD 2024-08-16 22:25:32 -05:00
e680349994 [Bugfix] Fix custom_ar support check (#7617) bnellnm 2024-08-16 22:05:49 -04:00
44f26a9466 [Model] Align nemotron config with final HF state and fix lm-eval-small (#7611) Michael Goin 2024-08-16 18:56:34 -04:00
37fd47e780 [Kernel] fix types used in aqlm and ggml kernels to support dynamo (#7596) bnellnm 2024-08-16 17:00:11 -04:00
7759ae958f [Kernel][Misc] dynamo support for ScalarType (#7594) bnellnm 2024-08-16 16:59:49 -04:00
9f69856356 [Kernel] register punica functions as torch ops (#7591) bnellnm 2024-08-16 16:59:38 -04:00
d4f0f17b02 [Doc] Update quantization supported hardware table (#7595) Michael Goin 2024-08-16 16:59:27 -04:00
b3f4e17935 [Doc] Add docs for llmcompressor INT8 and FP8 checkpoints (#7444) Michael Goin 2024-08-16 16:59:16 -04:00
93478b63d2 [Core] Fix tracking of model forward time in case of PP>1 (#7440) Mahesh Keralapura 2024-08-16 13:46:01 -07:00
f366f6339b [spec decode] [4/N] Move update_flash_attn_metadata to attn backend (#7571) William Lin 2024-08-16 11:41:56 -07:00
855866caa9 [Kernel] Add tuned triton configs for ExpertsInt8 (#7601) Michael Goin 2024-08-16 14:37:01 -04:00
7fc23be81c [Kernel] W8A16 Int8 inside FusedMoE (#7415) Mor Zusman 2024-08-16 20:06:51 +03:00
e837b624f2 [Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210) Charlie Fu 2024-08-16 12:06:30 -05:00
ec724a725e support tqdm in notebooks (#7510) fzyzcjy 2024-08-17 00:17:50 +08:00
0e39a33c6d [Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method (#7513) Gordon Wong 2024-08-17 00:05:18 +08:00
6fc5b0f249 [CI] Fix crashes of performance benchmark (#7500) Kuntai Du 2024-08-16 08:08:45 -07:00
9587b050fb [Core] Use uvloop with zmq-decoupled front-end (#7570) Nick Hill 2024-08-15 22:48:07 -07:00
54bd9a03c4 register custom op for flash attn and use from torch.ops (#7536) youkaichao 2024-08-15 22:38:56 -07:00
50b8d08dbd [Misc/Testing] Use torch.testing.assert_close (#7324) jon-chuang 2024-08-15 21:24:04 -07:00
e165528778 [CI] Move quantization cpu offload tests out of fastcheck (#7574) Michael Goin 2024-08-16 00:16:20 -04:00
3b19e39dc5 Chat method for offline llm (#5049) nunjunj 2024-08-16 09:41:34 +07:00
4cd7d47fed [ci/test] rearrange tests and make adag test soft fail (#7572) youkaichao 2024-08-15 19:39:04 -07:00
f878c8feb0 [Feature]: Add OpenAI server prompt_logprobs support #6508 (#7453) Grant Pinkert 2024-08-16 12:38:08 +10:00
b67ae00cdb [Misc] Add quantization config support for speculative model. (#7343) shangmingc 2024-08-16 10:34:28 +08:00
9c8e2d1161 [Bugfix][Harmless] Fix float16 dtype for model_is_embedding (#7566) Michael Goin 2024-08-15 21:26:19 -04:00
21313e09e3 [Bugfix] Fix default weight loading for scalars (#7534) Michael Goin 2024-08-15 16:10:22 -04:00
f4da5f7b6d [Misc] Update dockerfile for CPU to cover protobuf installation (#7182) PHILO-HE 2024-08-16 01:03:01 +08:00
9c1f78d5d6 [Bugfix] update neuron for version > 0.5.0 (#7175) omrishiv 2024-08-15 09:44:14 -07:00
fc93e56143 [Bugfix][TPU] Correct env variable for XLA cache path (#7544) Woosuk Kwon 2024-08-15 00:02:29 -07:00
22b39e11f2 llama_index serving integration documentation (#6973) Kameshwara Pavan Kumar Mantha 2024-08-15 04:08:37 +05:30
f55a9aea45 [Misc] Revert compressed-tensors code reuse (#7521) Kyle Sayers 2024-08-14 18:07:37 -04:00
951fdd66d3 [TPU] Set per-rank XLA cache (#7533) Woosuk Kwon 2024-08-14 14:47:51 -07:00
2ecf7b1757 [core] [3/N] multi-step args and sequence.py (#7452) William Lin 2024-08-14 12:32:45 -07:00
3f674a49b5 [VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126) Cyrus Leung 2024-08-15 01:55:42 +08:00
70b746efcf [Misc] Deprecation Warning when setting --engine-use-ray (#7424) Wallas Henrique 2024-08-14 13:44:27 -03:00
67d115db08 [Bugfix][Frontend] Disable embedding API for chat models (#7504) jack 2024-08-15 00:15:19 +08:00
d3d9cb6e4b [ci] fix model tests (#7507) youkaichao 2024-08-14 01:01:43 -07:00
c134a46402 Fix empty output when temp is too low (#2937) Chang Su 2024-08-13 22:31:44 -07:00
199adbb7cf [doc] update test script to include cudagraph (#7501) youkaichao 2024-08-13 21:52:58 -07:00
dd164d72f3 [Bugfix][Docs] Update list of mock imports (#7493) Cyrus Leung 2024-08-14 11:37:30 +08:00
ea49e6a3c8 [misc][ci] fix cpu test with plugins (#7489) youkaichao 2024-08-13 19:27:46 -07:00
97992802f3 [CI/Build]Reduce the time consumption for LoRA tests (#7396) Jee Jee Li 2024-08-14 08:27:29 +08:00
59edd0f134 [Bugfix][CI] Import ray under guard (#7486) Woosuk Kwon 2024-08-13 17:12:58 -07:00
a08df8322e [TPU] Support multi-host inference (#7457) Woosuk Kwon 2024-08-13 16:31:20 -07:00

... 134 135 136 137 138 ...