Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

396d92d5e0 [Kernel][Core] Add AWQ support to the Marlin kernel (#6612) Alexander Matveev 2024-07-21 19:41:42 -04:00
25e778aa16 [Model] Refactor and decouple phi3v image embedding (#6621) Isotr0py 2024-07-22 07:07:58 +08:00
b6df37f943 [Misc] Remove abused noqa (#6619) Woosuk Kwon 2024-07-21 08:47:04 -07:00
14f91fe67c [Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485) sroy745 2024-07-20 23:58:58 -07:00
d7f4178dd9 [Frontend] Move chat utils (#6602) Cyrus Leung 2024-07-21 08:38:17 +08:00
082ecd80d5 [ Bugfix ] Fix AutoFP8 fp8 marlin (#6609) Robert Shaw 2024-07-20 19:25:56 -04:00
f952bbc8ff [Misc] Fix input_scale typing in w8a8_utils.py (#6579) Michael Goin 2024-07-20 19:11:13 -04:00
9364f74eee [ Kernel ] Enable fp8-marlin for fbgemm-fp8 models (#6606) Robert Shaw 2024-07-20 14:50:10 -04:00
06d6c5fe9f [Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543) Matt Wong 2024-07-20 11:39:07 -05:00
683e3cb9c4 [ Misc ] fbgemm checkpoints (#6559) Robert Shaw 2024-07-20 12:36:57 -04:00
9042d68362 [Misc] Consolidate and optimize logic for building padded tensors (#6541) Cyrus Leung 2024-07-20 12:17:24 +08:00
3f8d42c81f Pipeline Parallel: Guard for KeyErrors at request abort (#6587) Travis Johnson 2024-07-19 20:18:19 -06:00
7bd82002ae [Core] Allow specifying custom Executor (#6557) Antoni Baum 2024-07-19 18:25:06 -07:00
2e26564259 [ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593) Varun Sundar Rabindranath 2024-07-19 21:15:26 -04:00
e81522e879 [build] add ib in image for out-of-the-box infiniband support (#6599) youkaichao 2024-07-19 17:16:57 -07:00
45ceb85a0c [Docs] Update PP docs (#6598) Murali Andoorveedu 2024-07-19 19:38:21 -04:00
4cc24f01b1 [ Kernel ] Enable Dynamic Per Token fp8 (#6547) Robert Shaw 2024-07-19 19:08:15 -04:00
07eb6f19f3 [bugfix][distributed] fix multi-node bug for shared memory (#6597) youkaichao 2024-07-19 15:34:34 -07:00
f0bbfaf917 [Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection (#6578) Thomas Parnell 2024-07-19 23:01:03 +02:00
30efe41532 [Docs] Update docs for wheel location (#6580) Simon Mo 2024-07-19 12:14:11 -07:00
9ed82e7074 [Misc] Small perf improvements (#6520) Antoni Baum 2024-07-19 12:10:56 -07:00
51f8aa90ad [Bugfix][Frontend] remove duplicate init logger (#6581) Daniele 2024-07-19 19:16:27 +02:00
a5314e8698 [Model] RowParallelLinear: pass bias to quant_method.apply (#6327) Thomas Parnell 2024-07-19 15:15:22 +02:00
a921e86392 [BUGFIX] Raise an error for no draft token case when draft_tp>1 (#6369) Woo-Yeon Lee 2024-07-19 22:01:09 +09:00
6366efc67b [Bugfix][Frontend] Fix missing /metrics endpoint (#6463) Cyrus Leung 2024-07-19 11:55:13 +08:00
dbe5588554 [ Misc ] non-uniform quantization via compressed-tensors for Llama (#6515) Robert Shaw 2024-07-18 22:39:18 -04:00
d4201e06d5 [Bugfix] Make spec. decode respect per-request seed. (#6034) Thomas Parnell 2024-07-19 04:22:08 +02:00
b5672a112c [Core] Multiprocessing Pipeline Parallel support (#6130) Nick Hill 2024-07-18 19:15:52 -07:00
c5df56f88b Add support for a rope extension method (#6553) Simon Mo 2024-07-18 18:53:03 -07:00
1689219ebf [CI/Build] Build on Ubuntu 20.04 instead of 22.04 (#6517) Tyler Michael Smith 2024-07-18 20:29:25 -04:00
4ffffccb7e [Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm (#6552) Tyler Michael Smith 2024-07-18 19:52:22 -04:00
f53b8f0d05 [ci][test] add correctness test for cpu offloading (#6549) youkaichao 2024-07-18 16:41:06 -07:00
2d4733ba2d Fix PR comment bot (#6554) Kevin H. Luu 2024-07-18 14:48:29 -07:00
15c6a079b1 [Model] Support Mistral-Nemo (#6548) Michael Goin 2024-07-18 16:31:50 -04:00
ecdb462c24 [ci] Reword Github bot comment (#6534) Kevin H. Luu 2024-07-18 08:01:45 -07:00
58ca663224 [ Misc ] Improve Min Capability Checking in compressed-tensors (#6522) Robert Shaw 2024-07-18 10:39:12 -04:00
4634c8728b [TPU] Refactor TPU worker & model runner (#6506) Woosuk Kwon 2024-07-18 01:34:16 -07:00
c8a7d51c49 [Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash (#6501) Noam Gat 2024-07-18 10:47:13 +03:00
e2fbaee725 [BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs (#6227) Nick Hill 2024-07-18 00:13:30 -07:00
8a74c68bd1 [Misc] Minor patch for draft model runner (#6523) Cody Yu 2024-07-17 23:06:21 -07:00
61e592747c [Core] Introduce SPMD worker execution using Ray accelerated DAG (#6032) Rui Qiao 2024-07-17 22:27:09 -07:00
d25877dd9b [BugFix] Avoid secondary error in ShmRingBuffer destructor (#6530) Nick Hill 2024-07-17 22:24:43 -07:00
1c27d25fb5 [core][model] yet another cpu offload implementation (#6496) youkaichao 2024-07-17 20:54:35 -07:00
18fecc3559 [ Kernel ] Fp8 Channelwise Weight Support (#6487) Robert Shaw 2024-07-17 23:18:13 -04:00
b5af8c223c [Model] Pipeline parallel support for Mixtral (#6516) Cody Yu 2024-07-17 19:26:04 -07:00
b5241e41d9 [ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511) Varun Sundar Rabindranath 2024-07-17 21:38:35 -04:00
e76466dde2 [Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338) Alexander Matveev 2024-07-17 17:30:28 -04:00
5f0b9933e6 [Bugfix] Fix Ray Metrics API usage (#6354) Antoni Baum 2024-07-17 12:40:10 -07:00
a38524f338 [DOC] - Add docker image to Cerebrium Integration (#6510) milo157 2024-07-17 13:22:53 -04:00
2fa4623d9e [Core] Refactor _prepare_model_input_tensors - take 2 (#6164) Cody Yu 2024-07-17 09:37:16 -07:00
a9a2e74d21 [Misc] Use torch.Tensor for type annotation (#6505) Woosuk Kwon 2024-07-17 06:01:10 -07:00
e09ce759aa [TPU] Remove multi-modal args in TPU backend (#6504) Woosuk Kwon 2024-07-17 04:02:53 -07:00
5fa6e9876e [Bugfix] Fix for multinode crash on 4 PP (#6495) Murali Andoorveedu 2024-07-17 04:25:10 -04:00
5bf35a91e4 [Doc][CI/Build] Update docs and tests to use vllm serve (#6431) Cyrus Leung 2024-07-17 15:43:21 +08:00
a19e8d3726 [Misc][Speculative decoding] Typos and typing fixes (#6467) shangmingc 2024-07-17 15:17:07 +08:00
10383887e0 [ROCm] Cleanup Dockerfile and remove outdated patch (#6482) Hongxia Yang 2024-07-17 01:47:02 -04:00
1d094fd7c0 [Distributed][PP] only create embedding & lm head when necessary (#6455) Wushi Dong 2024-07-16 19:20:26 -07:00
ce37be7ba0 [misc][distributed] add seed to dummy weights (#6491) youkaichao 2024-07-16 19:16:34 -07:00
7f62077af5 [misc][distributed] improve tests (#6488) youkaichao 2024-07-16 17:35:52 -07:00
09c2eb85dd [ci][distributed] add pipeline parallel correctness test (#6410) youkaichao 2024-07-16 15:44:22 -07:00
978aed5300 [Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale (#6081) Michael Goin 2024-07-16 18:31:32 -04:00
160e1d8c99 [Misc] Log spec decode metrics (#6454) Cody Yu 2024-07-16 13:37:10 -07:00
94162beb9f [Doc] Fix the lora adapter path in server startup script (#6230) Jiaxin Shan 2024-07-16 10:11:04 -07:00
c467dff24f [Hardware][TPU] Support MoE with Pallas GMM kernel (#6457) Woosuk Kwon 2024-07-16 09:56:28 -07:00
9f4ccec761 [doc][misc] remind to cancel debugging environment variables (#6481) youkaichao 2024-07-16 09:45:30 -07:00
38ef94888a [CI/Build] Remove "boardwalk" image asset (#6460) Cyrus Leung 2024-07-16 23:59:36 +08:00
2bb0489cb3 [Core] Use numpy to speed up padded token processing (#6442) Peng Guanwen 2024-07-16 23:13:25 +08:00
7508a3dc34 [Misc] Fix typos in spec. decode metrics logging. (#6470) Thomas Parnell 2024-07-16 15:55:15 +02:00
7a3d2a5b95 [Frontend] Support for chat completions input in the tokenize endpoint (#5923) sasha0552 2024-07-16 12:18:09 +00:00
d97011512e [CI/Build] vLLM cache directory for images (#6444) Cyrus Leung 2024-07-16 14:12:25 +08:00
37d776606f [Docs] Announce 5th meetup (#6458) Woosuk Kwon 2024-07-15 21:04:58 -07:00
d92b3c5cde [Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests (#6419) Joe 2024-07-15 18:54:15 -07:00
9ad32dacd9 [BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug (#6425) Mor Zusman 2024-07-16 04:32:55 +03:00
d6f3b3d5c4 Pin sphinx-argparse version (#6453) Kevin H. Luu 2024-07-15 18:26:11 -07:00
4552e37b55 [CI/Build][TPU] Add TPU CI test (#6277) Woosuk Kwon 2024-07-15 14:31:16 -07:00
ec9933f4a5 [Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod (#6289) Woosuk Kwon 2024-07-15 12:02:14 -07:00
3dee97b05f [Docs] Add Google Cloud to sponsor list (#6450) Woosuk Kwon 2024-07-15 11:58:10 -07:00
4cf256ae7f [misc][distributed] fix pp missing layer condition (#6446) v0.5.2 youkaichao 2024-07-15 10:32:35 -07:00
64fdc08c72 bump version to v0.5.2 (#6433) Simon Mo 2024-07-15 10:27:40 -07:00
4ef95b0f06 [Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF (#6409) Thomas Parnell 2024-07-15 19:14:49 +02:00
eaec4b9153 [Bugfix] Add custom Triton cache manager to resolve MoE MP issue (#6140) Thomas Parnell 2024-07-15 19:12:47 +02:00
a63a4c6341 [Misc] Use 0.0.9 version for flashinfer (#6447) Pernekhan Utemuratov 2024-07-15 10:10:26 -07:00
c8fd97f26d [Kernel] Use CUTLASS kernels for the FP8 layers with Bias (#6270) Tyler Michael Smith 2024-07-15 13:05:52 -04:00
94b82e8c18 [doc][distributed] add suggestion for distributed inference (#6418) youkaichao 2024-07-15 09:45:51 -07:00
6ae1597ddf [VLM] Minor space optimization for ClipVisionModel (#6436) Roger Wang 2024-07-15 02:29:51 -07:00
22e79ee8f3 [doc][misc] doc update (#6439) youkaichao 2024-07-14 23:33:25 -07:00
de19916314 [Bugfix] Convert image to RGB by default (#6430) Cyrus Leung 2024-07-15 13:39:15 +08:00
69672f116c [core][distributed] simplify code to support pipeline parallel (#6406) youkaichao 2024-07-14 21:20:51 -07:00
44874a0bf9 [Doc] add env docs for flashinfer backend (#6437) DefTruth 2024-07-15 12:16:51 +08:00
b47008b4d2 [BugFix] BatchResponseData body should be optional (#6345) zifeitong 2024-07-14 21:06:09 -07:00
9bfece89fd Add FUNDING.yml (#6435) Simon Mo 2024-07-14 20:36:16 -07:00
32c9d7f765 Report usage for beam search (#6404) Simon Mo 2024-07-14 19:37:35 -07:00
ccb20db8bd [Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' (#6428) Fish 2024-07-15 10:27:01 +08:00
a754dc2cb9 [CI/Build] Cross python wheel (#6394) Robert Shaw 2024-07-14 21:54:46 -04:00
61e85dbad8 [Doc] xpu backend requires running setvars.sh (#6393) Robert Cohn 2024-07-14 20:10:11 -04:00
dbfe254eda [Feature] vLLM CLI (#5090) Ethan Xu 2024-07-14 15:36:43 -07:00
73030b7dae [ Misc ] Enable Quantizing All Layers of DeekSeekv2 (#6423) Robert Shaw 2024-07-14 17:38:42 -04:00
ccd3c04571 [ci][build] fix commit id (#6420) youkaichao 2024-07-14 07:16:21 -07:00
9dad5cc859 [Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace (#6384) Tyler Michael Smith 2024-07-14 09:37:19 -04:00
6ef3bf912c Remove unnecessary trailing period in spec_decode.rst (#6405) Yuan Tang 2024-07-14 02:58:09 -05:00

... 138 139 140 141 142 ...