Commit Graph

  • 3aa7b6cf66 [Misc][Doc] Add Example of using OpenAI Server with VLM (#5832) Roger Wang 2024-06-25 20:34:25 -07:00
  • dda4811591 [Core] Refactor Worker and ModelRunner to consolidate control plane communication (#5408) Stephanie Wang 2024-06-25 20:30:03 -07:00
  • 82079729cc [Bugfix] Fix assertion in NeuronExecutor (#5841) aws-patlange 2024-06-25 19:52:10 -07:00
  • c2a8ac75e0 [CI/Build] Add E2E tests for MLPSpeculator (#5791) Thomas Parnell 2024-06-26 01:04:08 +01:00
  • f178e56c68 [Hardware][TPU] Raise errors for unsupported sampling params (#5850) Woosuk Kwon 2024-06-25 16:58:23 -07:00
  • dd793d1de5 [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (#5422) Matt Wong 2024-06-25 17:56:15 -05:00
  • bc34937d68 [Hardware][TPU] Refactor TPU backend (#5831) Woosuk Kwon 2024-06-25 15:25:52 -07:00
  • dd248f7675 [Misc] Update w4a16 compressed-tensors support to include w8a16 (#5794) Dipika Sikka 2024-06-25 15:23:35 -04:00
  • d9b34baedd [CI/Build] Add unit testing for FlexibleArgumentParser (#5798) Michael Goin 2024-06-25 15:18:03 -04:00
  • c18ebfdd71 [doc][distributed] add both gloo and nccl tests (#5834) youkaichao 2024-06-25 12:10:28 -07:00
  • 67882dbb44 [Core] Add fault tolerance for RayTokenizerGroupPool (#5748) Antoni Baum 2024-06-25 10:15:10 -07:00
  • 7b99314301 [Misc] Remove useless code in cpu_worker (#5824) Jie Fu (傅杰) 2024-06-26 00:41:36 +08:00
  • 2ce5d6688b [Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414) Woo-Yeon Lee 2024-06-25 18:56:06 +09:00
  • f23871e9ee [Doc] Add notice about breaking changes to VLMs (#5818) Cyrus Leung 2024-06-25 16:25:03 +08:00
  • e9de9dd551 [ci] Remove aws template (#5757) Kevin H. Luu 2024-06-24 21:09:02 -07:00
  • ba991d5c84 [Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args (#5795) Chang Su 2024-06-24 16:01:19 -07:00
  • 1744cc99ba [Doc] Add Phi-3-medium to list of supported models (#5788) Michael Goin 2024-06-24 13:48:55 -04:00
  • e72dc6cb35 [Doc] Add "Suggest edit" button to doc pages (#5789) Michael Goin 2024-06-24 13:26:17 -04:00
  • c246212952 [doc][faq] add warning to download models for every nodes (#5783) youkaichao 2024-06-24 00:37:42 -07:00
  • edd5fe5fa2 [Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement (#5772) Isotr0py 2024-06-24 12:11:53 +08:00
  • 5d4d90536f [Distributed] Add send and recv helpers (#5719) Murali Andoorveedu 2024-06-23 17:42:28 -04:00
  • 6c916ac8a8 [BugFix] [Kernel] Add Cutlass2x fallback kernels (#5744) Varun Sundar Rabindranath 2024-06-24 02:37:11 +05:30
  • 832ea88fcb [core][distributed] improve shared memory broadcast (#5754) youkaichao 2024-06-22 10:00:43 -07:00
  • 8c00f9c15d [Docs][TPU] Add installation tip for TPU (#5761) Woosuk Kwon 2024-06-21 23:09:40 -07:00
  • 0cbc1d2b4f [Bugfix] Fix pin_lora error in TPU executor (#5760) Woosuk Kwon 2024-06-21 22:25:14 -07:00
  • ff9ddbceee [Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py (#5756) zifeitong 2024-06-21 20:33:12 -07:00
  • 9c62db07ed [Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs (#5710) Jie Fu (傅杰) 2024-06-22 10:07:08 +08:00
  • cf90ae0123 [CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline (#5616) Kunshang Ji 2024-06-22 08:09:34 +08:00
  • f5dda63eb5 [LoRA] Add support for pinning lora adapters in the LRU cache (#5603) rohithkrn 2024-06-21 15:42:46 -07:00
  • 7187507301 [ci][test] fix ca test in main (#5746) youkaichao 2024-06-21 14:04:26 -07:00
  • f1e72cc19a [BugFix] exclude version 1.15.0 for modelscope (#5668) zhyncs 2024-06-22 03:15:48 +08:00
  • 5b15bde539 [Doc] Documentation on supported hardware for quantization methods (#5745) Michael Goin 2024-06-21 12:44:29 -04:00
  • bd620b01fb [Kernel][CPU] Add Quick gelu to CPU (#5717) Roger Wang 2024-06-20 23:39:40 -07:00
  • d9a252bc8e [Core][Distributed] add shm broadcast (#5399) youkaichao 2024-06-20 22:12:35 -07:00
  • 67005a07bc [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665) Jee Li 2024-06-21 12:46:28 +08:00
  • c35e4a3dd7 [BugFix] Fix test_phi3v.py (#5725) Chang Su 2024-06-20 21:45:34 -07:00
  • 1f5674218f [Kernel] Add punica dimension for Qwen2 LoRA (#5441) Jinzhen Lin 2024-06-21 08:55:41 +08:00
  • b12518d3cf [Model] MLPSpeculator speculative decoding support (#4947) Joshua Rosenkranz 2024-06-20 20:23:12 -04:00
  • 6c5b7af152 [distributed][misc] use fork by default for mp (#5669) youkaichao 2024-06-20 17:06:34 -07:00
  • 8065a7e220 [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718) Michael Goin 2024-06-20 19:00:13 -04:00
  • 3f3b6b2150 [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715) Tyler Michael Smith 2024-06-20 14:36:10 -04:00
  • a7dcc62086 [Kernel] Update Cutlass int8 kernel configs for SM80 (#5275) Varun Sundar Rabindranath 2024-06-20 19:03:21 +05:30
  • ad137cd111 [Model] Port over CLIPVisionModel for VLMs (#5591) Roger Wang 2024-06-20 04:52:09 -07:00
  • 111af1fa2c [Kernel] Update Cutlass int8 kernel configs for SM90 (#5514) Varun Sundar Rabindranath 2024-06-20 12:07:08 +05:30
  • 1b2eaac316 [Bugfix][Doc] FIx Duplicate Explicit Target Name Errors (#5703) Roger Wang 2024-06-19 23:10:47 -07:00
  • 3730a1c832 [Misc] Improve conftest (#5681) Cyrus Leung 2024-06-20 10:09:21 +08:00
  • 949e49a685 [ci] Limit num gpus if specified for A100 (#5694) Kevin H. Luu 2024-06-19 16:30:03 -07:00
  • 4a30d7e3cc [Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650) Dipika Sikka 2024-06-19 18:06:44 -04:00
  • e83db9e7e3 [Doc] Update docker references (#5614) Rafael Vasquez 2024-06-19 18:01:45 -04:00
  • 78687504f7 [Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654) zifeitong 2024-06-19 13:57:12 -07:00
  • d571ca0108 [ci][distributed] add tests for custom allreduce (#5689) youkaichao 2024-06-19 13:16:04 -07:00
  • afed90a034 [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py (#5688) Michael Goin 2024-06-19 14:41:42 -04:00
  • 3ee5c4bca5 [ci] Add A100 queue into AWS CI template (#5648) Kevin H. Luu 2024-06-19 07:42:13 -07:00
  • e9c2732b97 [CI/Build] Add tqdm to dependencies (#5680) Cyrus Leung 2024-06-19 22:37:33 +08:00
  • d8714530d1 [Misc]Add param max-model-len in benchmark_latency.py (#5629) DearPlanet 2024-06-19 18:19:08 +08:00
  • 7d46c8d378 [Bugfix] Fix sampling_params passed incorrectly in Phi3v example (#5684) Isotr0py 2024-06-19 17:58:32 +08:00
  • da971ec7a5 [Model] Add FP8 kv cache for Qwen2 (#5656) Michael Goin 2024-06-19 05:38:26 -04:00
  • 3eea74889f [misc][distributed] use 127.0.0.1 for single-node (#5619) youkaichao 2024-06-19 01:05:00 -07:00
  • f758aed0e8 [Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices (#5641) Hongxia Yang 2024-06-19 02:21:29 -04:00
  • e5150f2c28 [Bugfix] Added test for sampling repetition penalty bug. (#5659) Thomas Parnell 2024-06-19 08:03:55 +02:00
  • 59a1eb59c9 [Bugfix] Fix Phi-3 Long RoPE scaling implementation (#5628) Shukant Pal 2024-06-18 18:46:38 -07:00
  • 6820724e51 [Bugfix] Fix w8a8 benchmarks for int8 case (#5643) Tyler Michael Smith 2024-06-18 20:33:25 -04:00
  • b23ce92032 [Bugfix] Fix CUDA version check for mma warning suppression (#5642) Tyler Michael Smith 2024-06-18 19:48:49 -04:00
  • 2bd231a7b7 [Doc] Added cerebrium as Integration option (#5553) milo157 2024-06-18 18:56:59 -04:00
  • 8a173382c8 [Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties (#5639) Thomas Parnell 2024-06-18 23:18:37 +02:00
  • 07feecde1a [Model] LoRA support added for command-r (#5178) sergey-tinkoff 2024-06-18 21:01:21 +03:00
  • 19091efc44 [ci] Setup Release pipeline and build release wheels with cache (#5610) Kevin H. Luu 2024-06-18 11:00:36 -07:00
  • 95db455e7f [Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542) Dipika Sikka 2024-06-18 12:45:05 -04:00
  • 7879f24dcc [Misc] Add OpenTelemetry support (#4687) Ronen Schaffer 2024-06-18 19:17:03 +03:00
  • 13db4369d9 [ci] Deprecate original CI template (#5624) Kevin H. Luu 2024-06-18 07:26:20 -07:00
  • 4ad7b53e59 [CI/Build][Misc] Update Pytest Marker for VLMs (#5623) Roger Wang 2024-06-18 06:10:04 -07:00
  • f0cc0e68e3 [Misc] Remove import from transformers logging (#5625) Chang Su 2024-06-18 05:12:19 -07:00
  • db5ec52ad7 [bugfix][distributed] improve p2p capability test (#5612) youkaichao 2024-06-18 00:21:05 -07:00
  • 114d7270ff [CI] Avoid naming different metrics with the same name in performance benchmark (#5615) Kuntai Du 2024-06-17 21:37:18 -07:00
  • 32c86e494a [Misc] Fix typo (#5618) Cyrus Leung 2024-06-18 11:58:30 +08:00
  • 8eadcf0b90 [misc][typo] fix typo (#5620) youkaichao 2024-06-17 20:54:57 -07:00
  • 5002175e80 [Kernel] Add punica dimensions for Granite 13b (#5559) Joe Runde 2024-06-17 21:54:11 -06:00
  • daef218b55 [Model] Initialize Phi-3-vision support (#4986) Isotr0py 2024-06-18 10:34:33 +08:00
  • fa9e385229 [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (#5131) sroy745 2024-06-17 19:29:09 -07:00
  • 26e1188e51 [Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py (#5606) zifeitong 2024-06-17 16:16:10 -07:00
  • a3e8a05d4c [Bugfix] Fix KV head calculation for MPT models when using GQA (#5142) Bruce Fontaine 2024-06-17 15:26:41 -07:00
  • e441bad674 [Optimization] use a pool to reuse LogicalTokenBlock.token_ids (#5584) youkaichao 2024-06-17 15:08:05 -07:00
  • 1b44aaf4e3 [bugfix][distributed] fix 16 gpus local rank arrangement (#5604) youkaichao 2024-06-17 14:35:04 -07:00
  • 9e4e6fe207 [CI] the readability of benchmarking and prepare for dashboard (#5571) Kuntai Du 2024-06-17 11:41:08 -07:00
  • ab66536dbf [CI/BUILD] Support non-AVX512 vLLM building and testing (#5574) Jie Fu (傅杰) 2024-06-18 02:36:10 +08:00
  • 728c4c8a06 [Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814) Kunshang Ji 2024-06-18 02:01:25 +08:00
  • 1f12122b17 [Misc] use AutoTokenizer for benchmark serving when vLLM not installed (#5588) zhyncs 2024-06-18 00:40:35 +08:00
  • 890d8d960b [Kernel] compressed-tensors marlin 24 support (#5435) Dipika Sikka 2024-06-17 12:32:48 -04:00
  • 9e74d9d003 Correct alignment in the seq_len diagram. (#5592) Charles Riggins 2024-06-18 00:05:33 +08:00
  • 9333fb8eb9 [Model] Rename Phi3 rope scaling type (#5595) Amit Garg 2024-06-17 09:04:14 -07:00
  • e2b85cf86a Fix w8a8 benchmark and add Llama-3-8B (#5562) Cody Yu 2024-06-16 23:48:06 -07:00
  • 845a3f26f9 [Doc] add debugging tips for crash and multi-node debugging (#5581) youkaichao 2024-06-16 19:08:01 -07:00
  • f07d513320 [build][misc] limit numpy version (#5582) youkaichao 2024-06-16 16:07:01 -07:00
  • 4a6769053a [CI][BugFix] Flip is_quant_method_supported condition (#5577) Michael Goin 2024-06-16 10:07:34 -04:00
  • f31c1f90e3 Add basic correctness 2 GPU tests to 4 GPU pipeline (#5518) Antoni Baum 2024-06-16 00:48:02 -07:00
  • 3ce2c050dd [Fix] Correct OpenAI batch response format (#5554) zifeitong 2024-06-15 16:57:54 -07:00
  • 1c0afa13c5 [BugFix] Don't start a Ray cluster when not using Ray (#5570) Nick Hill 2024-06-15 16:30:51 -07:00
  • d919ecc771 add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 (#5145) Alexander Matveev 2024-06-15 13:38:16 -04:00
  • e691918e3b [misc] Do not allow to use lora with chunked prefill. (#5538) SangBin Cho 2024-06-15 23:59:36 +09:00
  • 81fbb3655f [CI/Build] Test both text and token IDs in batched OpenAI Completions API (#5568) Cyrus Leung 2024-06-15 19:29:42 +08:00