Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

ef65dcfa6f [Doc] Add docs about OpenAI compatible server (#3288) Simon Mo 2024-03-18 22:05:34 -07:00
6a9c583e73 [Core] print error before deadlock (#3459) youkaichao 2024-03-18 21:06:23 -07:00
b37cdce2b1 [Core] Cache some utils (#3474) Antoni Baum 2024-03-18 17:14:26 -07:00
b30880a762 [Misc] Update README for the Third vLLM Meetup (#3479) Zhuohan Li 2024-03-18 15:58:38 -07:00
49eedea373 [Core] Zero-copy asdict for InputMetadata (#3475) Antoni Baum 2024-03-18 15:56:40 -07:00
9fdf3de346 Cmake based build system (#2830) bnellnm 2024-03-18 18:38:33 -04:00
c0c17d4896 [Misc] Fix PR Template (#3478) Zhuohan Li 2024-03-18 15:00:31 -07:00
097aa0ea22 [CI/Build] Fix Bad Import In Test (#3473) Robert Shaw 2024-03-18 15:28:00 -05:00
482b0adf1b [Testing] Add test_config.py to CI (#3437) Cade Daniel 2024-03-18 12:48:45 -07:00
8c654c045f CI: Add ROCm Docker Build (#2886) Simon Mo 2024-03-18 12:33:47 -07:00
9101d832e6 [Bugfix] Make moe_align_block_size AMD-compatible (#3470) Woosuk Kwon 2024-03-18 11:26:24 -07:00
93348d9458 [CI] Shard tests for LoRA and Kernels to speed up (#3445) Simon Mo 2024-03-17 14:56:30 -07:00
abfc4f3387 [Misc] Use dataclass for InputMetadata (#3452) Woosuk Kwon 2024-03-17 03:02:46 -07:00
6b78837b29 Fix setup.py neuron-ls issue (#2671) Simon Mo 2024-03-16 16:00:25 -07:00
120157fd2a Support arbitrary json_object in OpenAI and Context Free Grammar (#3211) Simon Mo 2024-03-16 13:35:27 -07:00
8e67598aa6 [Misc] fix line length for entire codebase (#3444) Simon Mo 2024-03-16 00:36:29 -07:00
ad50bf4b25 fix lint simon-mo 2024-03-15 22:23:38 -07:00
cf6ff18246 Fix Baichuan chat template (#3340) Dinghow Yang 2024-03-16 12:02:12 +08:00
14e3f9a1b2 Replace lstrip() with removeprefix() to fix Ruff linter warning (#2958) Ronen Schaffer 2024-03-16 06:01:30 +02:00
3123f15138 Fixes the incorrect argument in the prefix-prefill test cases (#3246) Tao He 2024-03-16 11:58:10 +08:00
413366e9a2 [Misc] PR templates (#3413) youkaichao 2024-03-15 18:25:51 -07:00
10585e035e Removed Extraneous Print Message From OAI Server (#3440) Robert Shaw 2024-03-15 19:35:36 -05:00
fb96c1e98c Asynchronous tokenization (#2879) Antoni Baum 2024-03-15 16:37:01 -07:00
8fa7357f2d fix document error for value and v_vec illustration (#3421) laneeee 2024-03-16 07:06:09 +08:00
a7af4538ca Fix issue templates (#3436) Harry Mellor 2024-03-15 21:26:00 +00:00
604f235937 [Misc] add error message in non linux platform (#3438) youkaichao 2024-03-15 14:21:37 -07:00
14b8ae02e7 Fixes the misuse/mixuse of time.time()/time.monotonic() (#3220) Tao He 2024-03-16 02:25:43 +08:00
03d37f2441 [Fix] Add args for mTLS support (#3430) Dan Clark 2024-03-15 09:56:13 -07:00
a7c871680e Fix tie_word_embeddings for Qwen2. (#3344) Yang Fan 2024-03-16 00:36:53 +08:00
429284dc37 Fix dist.broadcast stall without group argument (#3408) Junda Chen 2024-03-14 23:25:05 -07:00
253a98078a Add chat templates for ChatGLM (#3418) Dinghow Yang 2024-03-15 14:19:22 +08:00
21539e6856 Add chat templates for Falcon (#3420) Dinghow Yang 2024-03-15 14:19:02 +08:00
b522c4476f [Misc] add HOST_IP env var (#3419) youkaichao 2024-03-14 21:32:52 -07:00
78b6c4845a Dynamically configure shared memory size for moe_align_block_size_kernel (#3376) akhoroshev 2024-03-15 04:18:07 +03:00
b983ba35bd fix marlin config repr (#3414) Enrique Shockwave 2024-03-14 23:26:19 +00:00
54be8a0be2 Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373) 陈序 2024-03-15 04:56:57 +08:00
dfc77408bd [issue templates] add some issue templates (#3412) youkaichao 2024-03-14 13:16:00 -07:00
c17ca8ef18 Add args for mTLS support (#3410) Dan Clark 2024-03-14 13:11:45 -07:00
06ec486794 Install flash_attn in Docker image (#3396) Thomas Parnell 2024-03-14 18:55:54 +01:00
8fe8386591 [Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389) youkaichao 2024-03-14 01:11:48 -07:00
a37415c31b allow user to chose which vllm's merics to display in grafana (#3393) Allen.Dou 2024-03-14 14:35:13 +08:00
81653d9688 [Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion (#3383) Simon Mo 2024-03-13 17:02:21 -07:00
eeab52a4ff [FIX] Simpler fix for async engine running on ray (#3371) Zhuohan Li 2024-03-13 14:18:40 -07:00
c33afd89f5 Fix lint (#3388) Antoni Baum 2024-03-13 13:56:49 -07:00
7e9bd08f60 Add batched RoPE kernel (#3095) Terry 2024-03-13 13:45:26 -07:00
ae0ccb4017 Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350) Or Sharir 2024-03-13 21:18:25 +02:00
739c350c19 [Minor Fix] Use cupy-cuda11x in CUDA 11.8 build (#3256) 陈序 2024-03-14 00:43:24 +08:00
ba8dc958a3 [Minor] Fix bias in if to remove ambiguity (#3259) Hui Liu 2024-03-13 09:16:55 -07:00
e221910e77 add hf_transfer to requirements.txt (#3031) Ronan McGovern 2024-03-13 06:33:43 +00:00
b167109ba1 [Fix] Fix quantization="gptq" when using Marlin (#3319) Bo-Wen Wang 2024-03-13 13:51:42 +08:00
602358f8a8 Add kernel for GeGLU with approximate GELU (#3337) Woosuk Kwon 2024-03-12 22:06:17 -07:00
49a3c8662b Fixes #1556 double free (#3347) Breno Faria 2024-03-13 01:30:08 +01:00
b0925b3878 docs: Add BentoML deployment doc (#3336) Sherlock Xu 2024-03-13 01:34:30 +08:00
654865e21d Support Mistral Model Inference with transformers-neuronx (#3153) DAIZHENWEI 2024-03-11 13:19:51 -07:00
c9415c19d3 [ROCm] Fix warp and lane calculation in blockReduceSum (#3321) kliuae 2024-03-12 04:14:07 +08:00
4c922709b6 Add distributed model executor abstraction (#3191) Zhuohan Li 2024-03-11 11:03:45 -07:00
657061fdce [docs] Add LoRA support information for models (#3299) Philipp Moritz 2024-03-11 00:54:51 -07:00
2f8844ba08 Re-enable the 80 char line width limit (#3305) Zhuohan Li 2024-03-10 19:49:14 -07:00
4b59f00e91 [Fix] Fix best_of behavior when n=1 (#3298) Nick Hill 2024-03-10 19:17:46 -07:00
9e8744a545 [BugFix] Fix get tokenizer when using ray (#3301) Roy 2024-03-11 10:17:16 +08:00
e4a28e5316 [ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA (#3262) Douglas Lehr 2024-03-10 17:27:45 -05:00
0bba88df03 Enhance lora tests with more layer and rank variations (#3243) Terry 2024-03-09 17:14:16 -08:00
8437bae6ef [Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103) Cade Daniel 2024-03-08 23:32:46 -08:00
f48c6791b7 [FIX] Fix prefix test error on main (#3286) Zhuohan Li 2024-03-08 17:16:14 -08:00
c2c5e0909a Move model filelocks from /tmp/ to ~/.cache/vllm/locks/ dir (#3241) Michael Goin 2024-03-08 13:33:10 -08:00
1cb0cc2975 [FIX] Make flash_attn optional (#3269) Woosuk Kwon 2024-03-08 10:52:20 -08:00
99c3cfb83c [Docs] Fix Unmocked Imports (#3275) Roger Wang 2024-03-08 09:58:01 -08:00
1ece1ae829 [Minor Fix] Fix comments in benchmark_serving (#3252) TianYu GUO 2024-03-08 14:22:59 +08:00
c59e120c55 Feature add lora support for Qwen2 (#3177) whyiug 2024-03-08 13:58:24 +08:00
d2339d6840 Connect engine healthcheck to openai server (#3260) Nick Hill 2024-03-07 16:38:12 -08:00
b35cc93420 Fix auto prefix bug (#3239) ElizaWszola 2024-03-08 01:37:28 +01:00
8cbba4622c Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) (#3263) jacobthebanana 2024-03-07 18:03:22 -05:00
385da2dae2 Measure model memory usage (#3120) Michael Goin 2024-03-07 11:42:42 -08:00
2daf23ab0c Separate attention backends (#3005) Woosuk Kwon 2024-03-07 01:45:50 -08:00
cbf4c05b15 Update requirements-dev.txt to include package for benchmarking scripts. (#3181) Chen Wang 2024-03-07 03:39:28 -05:00
d3c04b6a39 Add GPTQ support for Gemma (#3200) TechxGenus 2024-03-07 08:19:14 +08:00
4cb3b924cd Add tqdm dynamic_ncols=True (#3242) Chujie Zheng 2024-03-06 14:41:42 -08:00
a33ce60c66 [Testing] Fix core tests (#3224) Cade Daniel 2024-03-06 01:04:23 -08:00
24aecf421a [Tests] Add block manager and scheduler tests (#3108) SangBin Cho 2024-03-06 11:23:34 +09:00
2efce05dc3 [Fix] Avoid pickling entire LLMEngine for Ray workers (#3207) Nick Hill 2024-03-05 16:17:20 -08:00
8999ec3c16 Store eos_token_id in Sequence for easy access (#3166) Nick Hill 2024-03-05 15:35:43 -08:00
05af6da8d9 [ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs (#3123) Hongxia Yang 2024-03-04 21:14:53 -05:00
9a4548bae7 Fix the openai benchmarking requests to work with latest OpenAI apis (#2992) Chen Wang 2024-03-04 18:51:56 -05:00
ff578cae54 Add health check, make async Engine more robust (#3015) Antoni Baum 2024-03-04 14:01:40 -08:00
22de45235c Push logprob generation to LLMEngine (#3065) Antoni Baum 2024-03-04 11:54:06 -08:00
76e8a70476 [Minor fix] The domain dns.google may cause a socket.gaierror exception (#3176) ttbachyinsda 2024-03-05 03:17:12 +08:00
9cbc7e5f3b enable --gpu-memory-utilization in benchmark_throughput.py (#3175) Allen.Dou 2024-03-05 02:37:58 +08:00
27a7b070db Add document for vllm paged attention kernel. (#2978) Jialun Lyu 2024-03-04 09:23:34 -08:00
901cf4c52b [Minor Fix] Remove unused code in benchmark_prefix_caching.py (#3171) TianYu GUO 2024-03-04 14:48:27 +08:00
d0fae88114 [DOC] add setup document to support neuron backend (#2777) Liangfu Chen 2024-03-03 17:03:51 -08:00
17c3103c56 Make it easy to profile workers with nsight (#3162) Philipp Moritz 2024-03-03 16:19:13 -08:00
996d095c54 [FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark (#3158) Zhuohan Li 2024-03-03 14:37:18 -08:00
d65fac2738 Add vLLM version info to logs and openai API server (#3161) Jason Cox 2024-03-03 00:00:29 -05:00
ce4f5a29fb Add Automatic Prefix Caching (#2762) Sage Moore 2024-03-02 03:50:01 -05:00
baee28c46c Reorder kv dtype check to avoid nvcc not found error on AMD platform (#3104) cloudhan 2024-03-02 14:34:48 +08:00
29e70e3e88 allow user chose log level by --log-level instead of fixed 'info'. (#3109) Allen.Dou 2024-03-02 07:28:41 +08:00
82091b864a Bump up to v0.3.3 (#3129) v0.3.3 Woosuk Kwon 2024-03-01 12:58:06 -08:00
c0c2335ce0 Integrate Marlin Kernels for Int4 GPTQ inference (#2497) Robert Shaw 2024-03-01 14:47:51 -06:00
90fbf12540 fix relative import path of protocol.py (#3134) Huarong 2024-03-02 03:42:06 +08:00
49d849b3ab docs: Add tutorial on deploying vLLM model with KServe (#2586) Yuan Tang 2024-03-01 14:04:14 -05:00

... 149 150 151 152 153 ...