Commit Graph

  • ef65dcfa6f [Doc] Add docs about OpenAI compatible server (#3288) Simon Mo 2024-03-18 22:05:34 -07:00
  • 6a9c583e73 [Core] print error before deadlock (#3459) youkaichao 2024-03-18 21:06:23 -07:00
  • b37cdce2b1 [Core] Cache some utils (#3474) Antoni Baum 2024-03-18 17:14:26 -07:00
  • b30880a762 [Misc] Update README for the Third vLLM Meetup (#3479) Zhuohan Li 2024-03-18 15:58:38 -07:00
  • 49eedea373 [Core] Zero-copy asdict for InputMetadata (#3475) Antoni Baum 2024-03-18 15:56:40 -07:00
  • 9fdf3de346 Cmake based build system (#2830) bnellnm 2024-03-18 18:38:33 -04:00
  • c0c17d4896 [Misc] Fix PR Template (#3478) Zhuohan Li 2024-03-18 15:00:31 -07:00
  • 097aa0ea22 [CI/Build] Fix Bad Import In Test (#3473) Robert Shaw 2024-03-18 15:28:00 -05:00
  • 482b0adf1b [Testing] Add test_config.py to CI (#3437) Cade Daniel 2024-03-18 12:48:45 -07:00
  • 8c654c045f CI: Add ROCm Docker Build (#2886) Simon Mo 2024-03-18 12:33:47 -07:00
  • 9101d832e6 [Bugfix] Make moe_align_block_size AMD-compatible (#3470) Woosuk Kwon 2024-03-18 11:26:24 -07:00
  • 93348d9458 [CI] Shard tests for LoRA and Kernels to speed up (#3445) Simon Mo 2024-03-17 14:56:30 -07:00
  • abfc4f3387 [Misc] Use dataclass for InputMetadata (#3452) Woosuk Kwon 2024-03-17 03:02:46 -07:00
  • 6b78837b29 Fix setup.py neuron-ls issue (#2671) Simon Mo 2024-03-16 16:00:25 -07:00
  • 120157fd2a Support arbitrary json_object in OpenAI and Context Free Grammar (#3211) Simon Mo 2024-03-16 13:35:27 -07:00
  • 8e67598aa6 [Misc] fix line length for entire codebase (#3444) Simon Mo 2024-03-16 00:36:29 -07:00
  • ad50bf4b25 fix lint simon-mo 2024-03-15 22:23:38 -07:00
  • cf6ff18246 Fix Baichuan chat template (#3340) Dinghow Yang 2024-03-16 12:02:12 +08:00
  • 14e3f9a1b2 Replace lstrip() with removeprefix() to fix Ruff linter warning (#2958) Ronen Schaffer 2024-03-16 06:01:30 +02:00
  • 3123f15138 Fixes the incorrect argument in the prefix-prefill test cases (#3246) Tao He 2024-03-16 11:58:10 +08:00
  • 413366e9a2 [Misc] PR templates (#3413) youkaichao 2024-03-15 18:25:51 -07:00
  • 10585e035e Removed Extraneous Print Message From OAI Server (#3440) Robert Shaw 2024-03-15 19:35:36 -05:00
  • fb96c1e98c Asynchronous tokenization (#2879) Antoni Baum 2024-03-15 16:37:01 -07:00
  • 8fa7357f2d fix document error for value and v_vec illustration (#3421) laneeee 2024-03-16 07:06:09 +08:00
  • a7af4538ca Fix issue templates (#3436) Harry Mellor 2024-03-15 21:26:00 +00:00
  • 604f235937 [Misc] add error message in non linux platform (#3438) youkaichao 2024-03-15 14:21:37 -07:00
  • 14b8ae02e7 Fixes the misuse/mixuse of time.time()/time.monotonic() (#3220) Tao He 2024-03-16 02:25:43 +08:00
  • 03d37f2441 [Fix] Add args for mTLS support (#3430) Dan Clark 2024-03-15 09:56:13 -07:00
  • a7c871680e Fix tie_word_embeddings for Qwen2. (#3344) Yang Fan 2024-03-16 00:36:53 +08:00
  • 429284dc37 Fix dist.broadcast stall without group argument (#3408) Junda Chen 2024-03-14 23:25:05 -07:00
  • 253a98078a Add chat templates for ChatGLM (#3418) Dinghow Yang 2024-03-15 14:19:22 +08:00
  • 21539e6856 Add chat templates for Falcon (#3420) Dinghow Yang 2024-03-15 14:19:02 +08:00
  • b522c4476f [Misc] add HOST_IP env var (#3419) youkaichao 2024-03-14 21:32:52 -07:00
  • 78b6c4845a Dynamically configure shared memory size for moe_align_block_size_kernel (#3376) akhoroshev 2024-03-15 04:18:07 +03:00
  • b983ba35bd fix marlin config repr (#3414) Enrique Shockwave 2024-03-14 23:26:19 +00:00
  • 54be8a0be2 Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373) 陈序 2024-03-15 04:56:57 +08:00
  • dfc77408bd [issue templates] add some issue templates (#3412) youkaichao 2024-03-14 13:16:00 -07:00
  • c17ca8ef18 Add args for mTLS support (#3410) Dan Clark 2024-03-14 13:11:45 -07:00
  • 06ec486794 Install flash_attn in Docker image (#3396) Thomas Parnell 2024-03-14 18:55:54 +01:00
  • 8fe8386591 [Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 (#3389) youkaichao 2024-03-14 01:11:48 -07:00
  • a37415c31b allow user to chose which vllm's merics to display in grafana (#3393) Allen.Dou 2024-03-14 14:35:13 +08:00
  • 81653d9688 [Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion (#3383) Simon Mo 2024-03-13 17:02:21 -07:00
  • eeab52a4ff [FIX] Simpler fix for async engine running on ray (#3371) Zhuohan Li 2024-03-13 14:18:40 -07:00
  • c33afd89f5 Fix lint (#3388) Antoni Baum 2024-03-13 13:56:49 -07:00
  • 7e9bd08f60 Add batched RoPE kernel (#3095) Terry 2024-03-13 13:45:26 -07:00
  • ae0ccb4017 Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350) Or Sharir 2024-03-13 21:18:25 +02:00
  • 739c350c19 [Minor Fix] Use cupy-cuda11x in CUDA 11.8 build (#3256) 陈序 2024-03-14 00:43:24 +08:00
  • ba8dc958a3 [Minor] Fix bias in if to remove ambiguity (#3259) Hui Liu 2024-03-13 09:16:55 -07:00
  • e221910e77 add hf_transfer to requirements.txt (#3031) Ronan McGovern 2024-03-13 06:33:43 +00:00
  • b167109ba1 [Fix] Fix quantization="gptq" when using Marlin (#3319) Bo-Wen Wang 2024-03-13 13:51:42 +08:00
  • 602358f8a8 Add kernel for GeGLU with approximate GELU (#3337) Woosuk Kwon 2024-03-12 22:06:17 -07:00
  • 49a3c8662b Fixes #1556 double free (#3347) Breno Faria 2024-03-13 01:30:08 +01:00
  • b0925b3878 docs: Add BentoML deployment doc (#3336) Sherlock Xu 2024-03-13 01:34:30 +08:00
  • 654865e21d Support Mistral Model Inference with transformers-neuronx (#3153) DAIZHENWEI 2024-03-11 13:19:51 -07:00
  • c9415c19d3 [ROCm] Fix warp and lane calculation in blockReduceSum (#3321) kliuae 2024-03-12 04:14:07 +08:00
  • 4c922709b6 Add distributed model executor abstraction (#3191) Zhuohan Li 2024-03-11 11:03:45 -07:00
  • 657061fdce [docs] Add LoRA support information for models (#3299) Philipp Moritz 2024-03-11 00:54:51 -07:00
  • 2f8844ba08 Re-enable the 80 char line width limit (#3305) Zhuohan Li 2024-03-10 19:49:14 -07:00
  • 4b59f00e91 [Fix] Fix best_of behavior when n=1 (#3298) Nick Hill 2024-03-10 19:17:46 -07:00
  • 9e8744a545 [BugFix] Fix get tokenizer when using ray (#3301) Roy 2024-03-11 10:17:16 +08:00
  • e4a28e5316 [ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA (#3262) Douglas Lehr 2024-03-10 17:27:45 -05:00
  • 0bba88df03 Enhance lora tests with more layer and rank variations (#3243) Terry 2024-03-09 17:14:16 -08:00
  • 8437bae6ef [Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103) Cade Daniel 2024-03-08 23:32:46 -08:00
  • f48c6791b7 [FIX] Fix prefix test error on main (#3286) Zhuohan Li 2024-03-08 17:16:14 -08:00
  • c2c5e0909a Move model filelocks from /tmp/ to ~/.cache/vllm/locks/ dir (#3241) Michael Goin 2024-03-08 13:33:10 -08:00
  • 1cb0cc2975 [FIX] Make flash_attn optional (#3269) Woosuk Kwon 2024-03-08 10:52:20 -08:00
  • 99c3cfb83c [Docs] Fix Unmocked Imports (#3275) Roger Wang 2024-03-08 09:58:01 -08:00
  • 1ece1ae829 [Minor Fix] Fix comments in benchmark_serving (#3252) TianYu GUO 2024-03-08 14:22:59 +08:00
  • c59e120c55 Feature add lora support for Qwen2 (#3177) whyiug 2024-03-08 13:58:24 +08:00
  • d2339d6840 Connect engine healthcheck to openai server (#3260) Nick Hill 2024-03-07 16:38:12 -08:00
  • b35cc93420 Fix auto prefix bug (#3239) ElizaWszola 2024-03-08 01:37:28 +01:00
  • 8cbba4622c Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) (#3263) jacobthebanana 2024-03-07 18:03:22 -05:00
  • 385da2dae2 Measure model memory usage (#3120) Michael Goin 2024-03-07 11:42:42 -08:00
  • 2daf23ab0c Separate attention backends (#3005) Woosuk Kwon 2024-03-07 01:45:50 -08:00
  • cbf4c05b15 Update requirements-dev.txt to include package for benchmarking scripts. (#3181) Chen Wang 2024-03-07 03:39:28 -05:00
  • d3c04b6a39 Add GPTQ support for Gemma (#3200) TechxGenus 2024-03-07 08:19:14 +08:00
  • 4cb3b924cd Add tqdm dynamic_ncols=True (#3242) Chujie Zheng 2024-03-06 14:41:42 -08:00
  • a33ce60c66 [Testing] Fix core tests (#3224) Cade Daniel 2024-03-06 01:04:23 -08:00
  • 24aecf421a [Tests] Add block manager and scheduler tests (#3108) SangBin Cho 2024-03-06 11:23:34 +09:00
  • 2efce05dc3 [Fix] Avoid pickling entire LLMEngine for Ray workers (#3207) Nick Hill 2024-03-05 16:17:20 -08:00
  • 8999ec3c16 Store eos_token_id in Sequence for easy access (#3166) Nick Hill 2024-03-05 15:35:43 -08:00
  • 05af6da8d9 [ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs (#3123) Hongxia Yang 2024-03-04 21:14:53 -05:00
  • 9a4548bae7 Fix the openai benchmarking requests to work with latest OpenAI apis (#2992) Chen Wang 2024-03-04 18:51:56 -05:00
  • ff578cae54 Add health check, make async Engine more robust (#3015) Antoni Baum 2024-03-04 14:01:40 -08:00
  • 22de45235c Push logprob generation to LLMEngine (#3065) Antoni Baum 2024-03-04 11:54:06 -08:00
  • 76e8a70476 [Minor fix] The domain dns.google may cause a socket.gaierror exception (#3176) ttbachyinsda 2024-03-05 03:17:12 +08:00
  • 9cbc7e5f3b enable --gpu-memory-utilization in benchmark_throughput.py (#3175) Allen.Dou 2024-03-05 02:37:58 +08:00
  • 27a7b070db Add document for vllm paged attention kernel. (#2978) Jialun Lyu 2024-03-04 09:23:34 -08:00
  • 901cf4c52b [Minor Fix] Remove unused code in benchmark_prefix_caching.py (#3171) TianYu GUO 2024-03-04 14:48:27 +08:00
  • d0fae88114 [DOC] add setup document to support neuron backend (#2777) Liangfu Chen 2024-03-03 17:03:51 -08:00
  • 17c3103c56 Make it easy to profile workers with nsight (#3162) Philipp Moritz 2024-03-03 16:19:13 -08:00
  • 996d095c54 [FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark (#3158) Zhuohan Li 2024-03-03 14:37:18 -08:00
  • d65fac2738 Add vLLM version info to logs and openai API server (#3161) Jason Cox 2024-03-03 00:00:29 -05:00
  • ce4f5a29fb Add Automatic Prefix Caching (#2762) Sage Moore 2024-03-02 03:50:01 -05:00
  • baee28c46c Reorder kv dtype check to avoid nvcc not found error on AMD platform (#3104) cloudhan 2024-03-02 14:34:48 +08:00
  • 29e70e3e88 allow user chose log level by --log-level instead of fixed 'info'. (#3109) Allen.Dou 2024-03-02 07:28:41 +08:00
  • 82091b864a Bump up to v0.3.3 (#3129) v0.3.3 Woosuk Kwon 2024-03-01 12:58:06 -08:00
  • c0c2335ce0 Integrate Marlin Kernels for Int4 GPTQ inference (#2497) Robert Shaw 2024-03-01 14:47:51 -06:00
  • 90fbf12540 fix relative import path of protocol.py (#3134) Huarong 2024-03-02 03:42:06 +08:00
  • 49d849b3ab docs: Add tutorial on deploying vLLM model with KServe (#2586) Yuan Tang 2024-03-01 14:04:14 -05:00