Commit Graph

  • 0e9164b40a [mypy] Enable type checking for test directory (#5017) Cyrus Leung 2024-06-15 12:45:31 +08:00
  • 1b8a0d71cf [Core][Bugfix]: fix prefix caching for blockv2 (#5364) leiwen83 2024-06-15 08:23:56 +08:00
  • bd7efe95d0 Add ccache to amd (#5555) Simon Mo 2024-06-14 19:18:22 -05:00
  • f5bb85b435 [Core][Distributed] improve p2p cache generation (#5528) youkaichao 2024-06-14 14:47:45 -07:00
  • 28c145eb57 [Bugfix] Fix typo in Pallas backend (#5558) Woosuk Kwon 2024-06-14 14:40:09 -07:00
  • e2afb03c92 [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (#5460) Thomas Parnell 2024-06-14 22:28:11 +02:00
  • 6e2527a7cb [Doc] Update documentation on Tensorizer (#5471) Sanger Steel 2024-06-14 14:27:57 -04:00
  • cdab68dcdb [Docs] Add ZhenFund as a Sponsor (#5548) Simon Mo 2024-06-14 13:17:21 -05:00
  • d1c3d7d139 [misc][distributed] fix benign error in is_in_the_same_node (#5512) youkaichao 2024-06-14 10:59:28 -07:00
  • 77490c6f2f [Core] Remove duplicate processing in async engine (#5525) Cyrus Leung 2024-06-15 01:04:42 +08:00
  • 48f589e18b [mis] fix flaky test of test_cuda_device_count_stateless (#5546) youkaichao 2024-06-14 10:02:23 -07:00
  • 348616ac4b [Kernel] Suppress mma.sp warning on CUDA 12.5 and later (#5401) Tyler Michael Smith 2024-06-14 13:02:00 -04:00
  • 15985680e2 [ Misc ] Rs/compressed tensors cleanup (#5432) Robert Shaw 2024-06-14 13:01:46 -04:00
  • d74674bbd9 [Misc] Fix arg names (#5524) Allen.Dou 2024-06-15 00:47:44 +08:00
  • 703475f6c2 [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516) Tyler Michael Smith 2024-06-14 12:30:15 -04:00
  • d47af2bc02 [CI/Build] Disable LLaVA-NeXT CPU test (#5529) Cyrus Leung 2024-06-15 00:27:30 +08:00
  • 319ad7f1d3 [CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with perf-benchmarks label (#5073) Kuntai Du 2024-06-13 22:36:20 -07:00
  • 0f0d8bc065 bump version to v0.5.0.post1 (#5522) Simon Mo 2024-06-13 21:42:06 -05:00
  • 55d6361b13 [Misc] Fix arg names in quantizer script (#5507) Allen.Dou 2024-06-14 10:02:53 +08:00
  • cd9c0d65d9 [Hardware][Intel] Support CPU inference with AVX2 ISA (#5452) Jie Fu (傅杰) 2024-06-14 07:22:24 +08:00
  • 50eed24d25 Add cuda_device_count_stateless (#5473) v0.5.0.post1 Antoni Baum 2024-06-13 16:06:49 -07:00
  • e38042d4af [Kernel] Disable CUTLASS kernels for fp8 (#5505) Tyler Michael Smith 2024-06-13 16:38:05 -04:00
  • 33e3b37242 [CI/Build] Disable test_fp8.py (#5508) Tyler Michael Smith 2024-06-13 16:37:48 -04:00
  • 1696efe6c9 [misc] fix format.sh (#5511) youkaichao 2024-06-13 12:09:16 -07:00
  • 6b0511a57b Revert "[Core] Remove unnecessary copies in flash attn backend" (#5478) Antoni Baum 2024-06-13 11:22:50 -07:00
  • a8fda4f661 Seperate dev requirements into lint and test (#5474) Antoni Baum 2024-06-13 11:22:41 -07:00
  • 30299a41fa [MISC] Remove FP8 warning (#5472) Cody Yu 2024-06-13 11:22:30 -07:00
  • 85657b5607 [Kernel] Factor out epilogues from cutlass kernels (#5391) Tyler Michael Smith 2024-06-13 14:22:19 -04:00
  • 0ce7b952f8 [Doc] Update LLaVA docs (#5437) Cyrus Leung 2024-06-14 02:22:07 +08:00
  • 39873476f8 [CI/Build] Simplify OpenAI server setup in tests (#5100) Cyrus Leung 2024-06-14 02:21:53 +08:00
  • 03dccc886e [Misc] Add vLLM version getter to utils (#5098) Cyrus Leung 2024-06-14 02:21:39 +08:00
  • a65634d3ae [Docs] Add 4th meetup slides (#5509) Woosuk Kwon 2024-06-13 10:18:26 -07:00
  • 80aa7e91fc [Hardware][Intel] Optimize CPU backend and add more performance tips (#4971) Li, Jiang 2024-06-14 00:33:14 +08:00
  • bd43973522 [Kernel] Tune Qwen2MoE kernel configurations with tp2,4 (#5497) wenyujin333 2024-06-14 00:01:10 +08:00
  • 23ec72fa03 [CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations (#5466) Michael Goin 2024-06-13 11:18:08 -04:00
  • c2637a613b [Kernel] w4a16 support for compressed-tensors (#5385) Dipika Sikka 2024-06-13 10:19:56 -04:00
  • 88407532e7 [Bugfix]if the content is started with ":"(response of ping), client should i… (#5303) Wang, Yi 2024-06-13 11:16:41 +08:00
  • 916d219d62 [ci] Use sccache to build images (#5419) Kevin H. Luu 2024-06-12 17:58:12 -07:00
  • ea3890a5f0 [Core][Distributed] code deduplication in tp&pp with coordinator(#5293) youkaichao 2024-06-12 17:27:08 -07:00
  • 2135cacb45 [Bugfix] Fix wrong multi_modal_input format for CPU runner (#5451) Isotr0py 2024-06-13 07:20:18 +08:00
  • 7d19de2e9c [Frontend] Add "input speed" to tqdm postfix alongside output speed (#5425) Michael Goin 2024-06-12 18:42:12 -04:00
  • 94a07bbdd8 [Bugfix] Fix typo in scheduler.py (requeset -> request) (#5470) Michael Goin 2024-06-12 17:59:44 -04:00
  • b8d4dfff9c [Doc] Update debug docs (#5438) Cyrus Leung 2024-06-13 05:49:31 +08:00
  • 622d45128c [misc] add hint for AttributeError (#5462) youkaichao 2024-06-12 14:46:35 -07:00
  • 51602eefd3 [Frontend] [Core] Support for sharded tensorized models (#4990) Travis Johnson 2024-06-12 15:13:52 -06:00
  • 5cc50a531f [Bugfix] TYPE_CHECKING for MultiModalData (#5444) Arthur Kim 2024-06-13 06:08:52 +09:00
  • 5985e3427d [Kernel] Vectorized FP8 quantize kernel (#5396) Cody Yu 2024-06-12 14:07:26 -07:00
  • 8b82a89997 [ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests (#5464) Kevin H. Luu 2024-06-12 14:00:18 -07:00
  • c3c2903e72 [Bugfix] Add device assertion to TorchSDPA (#5402) Li, Jiang 2024-06-13 03:58:53 +08:00
  • 1a8bfd92d5 [Hardware] Initial TPU integration (#5292) Woosuk Kwon 2024-06-12 11:53:03 -07:00
  • 847cdcca1c [CI] Upgrade codespell version. (#5381) SangBin Cho 2024-06-13 02:06:14 +09:00
  • e3c12bf6d2 Revert "[CI/Build] Add is_quant_method_supported to control quantization test configurations" (#5463) Simon Mo 2024-06-12 12:03:24 -05:00
  • 3dd6853bc8 [CI/Build] Add is_quant_method_supported to control quantization test configurations (#5253) Michael Goin 2024-06-12 12:58:02 -04:00
  • 8f89d72090 [Doc] add common case for long waiting time (#5430) v0.5.0 youkaichao 2024-06-11 11:12:13 -07:00
  • 99dac099ab [Core][Doc] Default to multiprocessing for single-node distributed case (#5230) Nick Hill 2024-06-11 11:10:41 -07:00
  • c4bd03c7c5 [Core][Distributed] add same-node detection (#5369) youkaichao 2024-06-11 10:53:59 -07:00
  • dcbf4286af [Frontend] Customizable RoPE theta (#5197) sasha0552 2024-06-11 17:42:26 +00:00
  • 00e6a2dc53 [Bugfix] fix lora_dtype value type in arg_utils.py (#5398) Ali Panahi 2024-06-11 10:40:23 -07:00
  • 2e02311a1b [Bugfix] Fix MultiprocessingGPUExecutor.check_health when world_size == 1 (#5254) Junichi Sato 2024-06-12 02:38:07 +09:00
  • 89ec06c33b [Docs] [Spec decode] Fix docs error in code example (#5427) Cade Daniel 2024-06-11 10:31:56 -07:00
  • 9fde251bf0 [Doc] Add an automatic prefix caching section in vllm documentation (#5324) Kuntai Du 2024-06-11 10:24:59 -07:00
  • 4c2ffb28ff [Speculative decoding] Initial spec decode docs (#5400) Cade Daniel 2024-06-11 10:15:40 -07:00
  • 246598a6b1 [CI] docfix (#5410) SangBin Cho 2024-06-11 17:28:50 +09:00
  • 8bab4959be [Misc] Remove VLLM_BUILD_WITH_NEURON env variable (#5389) Woosuk Kwon 2024-06-11 00:37:56 -07:00
  • 3c4cebf751 [Doc][Typo] Fixing Missing Comma (#5403) Roger Wang 2024-06-11 00:20:28 -07:00
  • d8f31f2f8b [Doc] add debugging tips (#5409) youkaichao 2024-06-10 23:21:43 -07:00
  • 640052b069 [Bugfix][Frontend] Cleanup "fix chat logprobs" (#5026) Cyrus Leung 2024-06-11 13:36:46 +08:00
  • 351d5e7b82 [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312) maor-ps 2024-06-11 05:30:31 +03:00
  • a008629807 [Misc] Various simplifications and typing fixes (#5368) Nick Hill 2024-06-10 19:29:02 -07:00
  • 76477a93b7 [ci] Fix Buildkite agent path (#5392) Kevin H. Luu 2024-06-10 18:58:07 -07:00
  • 77c87beb06 [Doc] Add documentation for FP8 W8A8 (#5388) Michael Goin 2024-06-10 20:55:12 -04:00
  • 114332b88e Bump version to v0.5.0 (#5384) Simon Mo 2024-06-10 17:56:06 -05:00
  • cb77ad836f [Docs] Alphabetically sort sponsors (#5386) Woosuk Kwon 2024-06-10 13:17:19 -07:00
  • 856c990041 [Docs] Add Docs on Limitations of VLM Support (#5383) Roger Wang 2024-06-10 09:53:50 -07:00
  • c5602f0baa [ci] Mount buildkite agent on Docker container to upload benchmark results (#5330) Kevin H. Luu 2024-06-10 09:22:34 -07:00
  • f7f9c5f97b [ci] Use small_cpu_queue for doc build (#5331) Kevin H. Luu 2024-06-10 09:21:11 -07:00
  • 2c0d933594 [Bugfix] Fix LLaVA-NeXT (#5380) Cyrus Leung 2024-06-10 23:38:47 +08:00
  • 774d1035e4 [Feature][Frontend]: Continued stream_options implementation also in CompletionRequest (#5319) Itay Etelis 2024-06-10 17:22:09 +03:00
  • 6b29d6fe70 [Model] Initial support for LLaVA-NeXT (#4199) Cyrus Leung 2024-06-10 20:47:15 +08:00
  • 0bfa1c4f13 [Misc] Improve error message when LoRA parsing fails (#5194) Cyrus Leung 2024-06-10 19:38:49 +08:00
  • c81da5f56d [misc][typo] fix typo (#5372) youkaichao 2024-06-10 02:51:02 -07:00
  • 68bc81703e [Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server (#5374) Roger Wang 2024-06-10 02:13:39 -07:00
  • 5884c2b454 [Misc] Update to comply with the new compressed-tensors config (#5350) Dipika Sikka 2024-06-09 23:49:46 -04:00
  • 45f92c00cf [Bugfix] Fix KeyError: 1 When Using LoRA adapters (#5164) Bla_ckB 2024-06-10 06:23:14 +07:00
  • 5467ac3196 [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047) bnellnm 2024-06-09 16:23:30 -04:00
  • 5d7e3d0176 [mis][ci/test] fix flaky test in test_sharded_state_loader.py (#5361) youkaichao 2024-06-08 20:50:14 -07:00
  • 0373e1837e [Core][CUDA Graph] add output buffer for cudagraph (#5074) youkaichao 2024-06-08 19:14:43 -07:00
  • c09dade2a2 [Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353) Michael Goin 2024-06-08 13:54:05 -04:00
  • 8ea5e44a43 [CI/Test] improve robustness of test (vllm_runner) (#5357) youkaichao 2024-06-08 01:59:20 -07:00
  • 9fb900f90c [CI/Test] improve robustness of test (hf_runner) (#5347) youkaichao 2024-06-07 22:31:32 -07:00
  • c96fc06747 [ROCm][AMD] Use pytorch sdpa math backend to do naive attention (#4965) Hongxia Yang 2024-06-07 22:13:12 -04:00
  • b3376e5c76 [Misc] Add args for selecting distributed executor to benchmarks (#5335) Benjamin Kitor 2024-06-07 18:20:16 -07:00
  • e69ded7d1c [Bug Fix] Fix the support check for FP8 CUTLASS (#5352) Cheng Li 2024-06-07 17:42:05 -07:00
  • 767c727a81 fix DbrxFusedNormAttention missing cache_config (#5340) Calvinn Ng 2024-06-08 05:10:21 +08:00
  • 6840a71610 [Misc] Remove unused cuda_utils.h in CPU backend (#5345) Jie Fu (傅杰) 2024-06-08 05:09:13 +08:00
  • 7a9cb294ae [Frontend] Add OpenAI Vision API Support (#5237) Roger Wang 2024-06-07 11:23:32 -07:00
  • ca3ea51bde [Kernel] Dynamic Per-Token Activation Quantization (#5037) Dipika Sikka 2024-06-07 12:36:26 -04:00
  • dc49fb892c Addition of lacked ignored_seq_groups in _schedule_chunked_prefill (#5296) limingshu 2024-06-07 21:35:42 +08:00
  • 18a277b52d Remove Ray health check (#4693) Antoni Baum 2024-06-07 03:01:56 -07:00
  • 8d75fe48ca [Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183) Tyler Michael Smith 2024-06-07 04:42:35 -04:00