Commit Graph

  • 4abed65c58 [VLM] Disallow overflowing max_model_len for multimodal models (#7998) Cyrus Leung 2024-08-30 08:49:04 +08:00
  • 0c785d344d Add more percentiles and latencies (#7759) Wei-Sheng Chin 2024-08-29 16:48:11 -07:00
  • 4664ceaad6 support bitsandbytes 8-bit and FP4 quantized models (#7445) chenqianfzh 2024-08-29 16:09:08 -07:00
  • 257afc37c5 [Neuron] Adding support for context-lenght, token-gen buckets. (#7885) Harsha vardhan manoj Bikki 2024-08-29 13:58:14 -07:00
  • 86a677de42 [misc] update tpu int8 to use new vLLM Parameters (#7973) Dipika Sikka 2024-08-29 16:46:55 -04:00
  • d78789ac16 [Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism (#7954) Isotr0py 2024-08-30 03:54:49 +08:00
  • c334b1898b extend cuda graph size for H200 (#7894) kushanam 2024-08-29 12:15:04 -07:00
  • 6b3421567d [Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985) Pavani Majety 2024-08-29 11:53:11 -07:00
  • 3f60f2244e [Core] Combine async postprocessor and multi-step (#7921) Alexander Matveev 2024-08-29 14:18:26 -04:00
  • f205c09854 [Bugfix] Unify rank computation across regular decoding and speculative decoding (#7899) Jonas M. Kübler 2024-08-29 07:18:13 +02:00
  • ef99a78760 Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982) youkaichao 2024-08-28 21:27:06 -07:00
  • 74d5543ec5 [VLM][Core] Fix exceptions on ragged NestedTensors (#7974) Peter Salas 2024-08-28 20:24:31 -07:00
  • a7f65c2be9 [torch.compile] remove reset (#7975) youkaichao 2024-08-28 17:32:26 -07:00
  • 4289cad37f [Frontend] Minor optimizations to zmq decoupled front-end (#7957) Nick Hill 2024-08-28 17:22:43 -07:00
  • af59df0a10 Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test (#7961) Michael Goin 2024-08-28 19:19:17 -04:00
  • ce6bf3a2cf [torch.compile] avoid Dynamo guard evaluation overhead (#7898) youkaichao 2024-08-28 16:10:12 -07:00
  • 3cdfe1f38b [Bugfix] Make torch registration of punica ops optional (#7970) bnellnm 2024-08-28 18:11:49 -04:00
  • fdd9daafa3 [Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651) Mor Zusman 2024-08-29 01:06:52 +03:00
  • 8c56e57def [Doc] fix 404 link (#7966) Stas Bekman 2024-08-28 13:54:23 -07:00
  • eeffde1ac0 [TPU] Upgrade PyTorch XLA nightly (#7967) Woosuk Kwon 2024-08-28 13:10:21 -07:00
  • e5697d161c [Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386) rasmith 2024-08-28 14:37:47 -05:00
  • b98cc28f91 [Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798) Pavani Majety 2024-08-28 10:01:22 -07:00
  • ef9baee3c5 [Bugfix][VLM] Fix incompatibility between #7902 and #7230 (#7948) Cyrus Leung 2024-08-28 23:11:18 +08:00
  • 98c12cffe5 [Doc] fix the autoAWQ example (#7937) Stas Bekman 2024-08-28 05:12:32 -07:00
  • f52a43a8b9 [ci][test] fix pp test failure (#7945) youkaichao 2024-08-28 01:27:07 -07:00
  • e3580537a4 [Performance] Enable chunked prefill and prefix caching together (#7753) Cody Yu 2024-08-28 00:36:31 -07:00
  • f508e03e7f [Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) (#7911) Alexander Matveev 2024-08-28 03:02:30 -04:00
  • 51f86bf487 [mypy][CI/Build] Fix mypy errors (#7929) Cyrus Leung 2024-08-28 14:47:44 +08:00
  • c166e7e43e [Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. (#7886) bnellnm 2024-08-27 23:13:45 -04:00
  • bc6e42a9b1 [hardware][rocm] allow rocm to override default env var (#7926) youkaichao 2024-08-27 19:50:06 -07:00
  • fab5f53e2d [Core][VLM] Stack multimodal tensors to represent multiple images within each prompt (#7902) Peter Salas 2024-08-27 18:53:56 -07:00
  • 9c71c97ae2 [mypy] Enable mypy type checking for vllm/core (#7229) Jonathan Berkhahn 2024-08-27 16:11:14 -07:00
  • 5340a2dccf [Model] Add multi-image input support for LLaVA-Next offline inference (#7230) zifeitong 2024-08-27 16:09:02 -07:00
  • 345be0e244 [benchmark] Update TGI version (#7917) Philipp Schmid 2024-08-28 00:07:53 +02:00
  • fc911880cc [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766) Dipika Sikka 2024-08-27 18:07:09 -04:00
  • ed6f002d33 [cuda][misc] error on empty CUDA_VISIBLE_DEVICES (#7924) youkaichao 2024-08-27 12:06:11 -07:00
  • b09c755be8 [Bugfix] Fix phi3v incorrect image_idx when using async engine (#7916) Isotr0py 2024-08-28 01:36:09 +08:00
  • 42e932c7d4 [CI/Build][ROCm] Enabling tensorizer tests for ROCm (#7237) alexeykondrat 2024-08-27 13:09:13 -04:00
  • 076169f603 [Hardware][Intel GPU] Add intel GPU pipeline parallel support. (#7810) Kunshang Ji 2024-08-28 01:07:02 +08:00
  • 9db642138b [CI/Build][VLM] Cleanup multiple images inputs model test (#7897) Isotr0py 2024-08-27 23:28:30 +08:00
  • 6fc4e6e07a [Model] Add Mistral Tokenization to improve robustness and chat encoding (#7739) Patrick von Platen 2024-08-27 14:40:02 +02:00
  • 9606c7197d Revert #7509 (#7887) Cody Yu 2024-08-27 00:16:31 -07:00
  • 64cc644425 [core][torch.compile] discard the compile for profiling (#7796) youkaichao 2024-08-26 21:33:58 -07:00
  • 39178c7fbc [Tests] Disable retries and use context manager for openai client (#7565) Nick Hill 2024-08-26 21:33:17 -07:00
  • 2eedede875 [Core] Asynchronous Output Processor (#7049) Megha Agarwal 2024-08-26 20:53:20 -07:00
  • 015e6cc252 [Misc] Update compressed tensors lifecycle to remove prefix from create_weights (#7825) Dipika Sikka 2024-08-26 20:09:34 -04:00
  • 760e9f71a8 [Bugfix] neuron: enable tensor parallelism (#7562) omrishiv 2024-08-26 15:13:13 -07:00
  • 05826c887b [misc] fix custom allreduce p2p cache file generation (#7853) youkaichao 2024-08-26 15:02:25 -07:00
  • dd9857f5fa [Misc] Update gptq_marlin_24 to use vLLMParameters (#7762) Dipika Sikka 2024-08-26 17:44:54 -04:00
  • 665304092d [Misc] Update qqq to use vLLMParameters (#7805) Dipika Sikka 2024-08-26 15:16:15 -04:00
  • 2deb029d11 [Performance][BlockManagerV2] Mark prefix cache block as computed after schedule (#7822) Cody Yu 2024-08-26 11:24:53 -07:00
  • 029c71de11 [CI/Build] Avoid downloading all HF files in RemoteOpenAIServer (#7836) Cyrus Leung 2024-08-26 13:31:10 +08:00
  • 0b769992ec [Bugfix]: Use float32 for base64 embedding (#7855) ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 2024-08-26 06:16:38 +03:00
  • 1856aff4d6 [Spec Decoding] Streamline batch expansion tensor manipulation (#7851) Nick Hill 2024-08-25 15:45:14 -07:00
  • 70c094ade6 [misc][cuda] improve pynvml warning (#7852) youkaichao 2024-08-25 14:30:09 -07:00
  • 2059b8d9ca [Misc] Remove snapshot_download usage in InternVL2 test (#7835) Isotr0py 2024-08-25 23:53:09 +08:00
  • 8aaf3d5347 [Model][VLM] Support multi-images inputs for Phi-3-vision models (#7783) Isotr0py 2024-08-25 19:51:20 +08:00
  • 80162c44b1 [Bugfix] Fix Phi-3v crash when input images are of certain sizes (#7840) zifeitong 2024-08-24 18:16:24 -07:00
  • aab0fcdb63 [ci][test] fix RemoteOpenAIServer (#7838) youkaichao 2024-08-24 10:31:28 -07:00
  • ea9fa160e3 [ci][test] exclude model download time in server start time (#7834) youkaichao 2024-08-24 01:03:27 -07:00
  • 7d9ffa2ae1 [misc][core] lazy import outlines (#7831) youkaichao 2024-08-24 00:51:38 -07:00
  • d81abefd2e [Frontend] add json_schema support from OpenAI protocol (#7654) Tyler Rockwood 2024-08-24 01:07:24 -05:00
  • 8da48e4d95 [Frontend] Publish Prometheus metrics in run_batch API (#7641) Pooya Davoodi 2024-08-23 23:04:22 -07:00
  • 6885fde317 [Bugfix] Fix run_batch logger (#7640) Pooya Davoodi 2024-08-23 13:58:26 -07:00
  • 9db93de20c [Core] Add multi-step support to LLMEngine (#7789) Alexander Matveev 2024-08-23 15:45:53 -04:00
  • 09c7792610 Bump version to v0.5.5 (#7823) v0.5.5 Simon Mo 2024-08-23 11:35:33 -07:00
  • f1df5dbfd6 [Misc] Update marlin to use vLLMParameters (#7803) Dipika Sikka 2024-08-23 14:30:52 -04:00
  • 35ee2ad6b9 [github][misc] promote asking llm first (#7809) youkaichao 2024-08-23 09:38:50 -07:00
  • e25fee57c2 [BugFix] Fix server crash on empty prompt (#7746) Maximilien de Bayser 2024-08-23 10:12:44 -03:00
  • faeddb565d [misc] Add Torch profiler support for CPU-only devices (#7806) Jie Fu (傅杰) 2024-08-23 13:46:25 +08:00
  • fc5ebbd1d3 [Hardware][Intel GPU] refactor xpu_model_runner for tp (#7712) Kunshang Ji 2024-08-23 11:06:54 +08:00
  • c01a6cb231 [Ray backend] Better error when pg topology is bad. (#7584) SangBin Cho 2024-08-22 17:44:25 -07:00
  • b903e1ba7f [Frontend] error suppression cleanup (#7786) Joe Runde 2024-08-22 15:50:21 -06:00
  • a152246428 [Misc] fix typo in triton import warning (#7794) Siyuan Liu 2024-08-22 13:51:23 -07:00
  • 666ad0aa16 [ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args (#7705) Kevin H. Luu 2024-08-22 13:10:55 -07:00
  • 15310b5101 [Bugfix] Use LoadFormat values for vllm serve --load-format (#7784) Michael Goin 2024-08-22 14:37:08 -04:00
  • 57792ed469 [Doc] Fix incorrect docs from #7615 (#7788) Peter Salas 2024-08-22 10:02:06 -07:00
  • d3b5b98021 [Misc] Enhance prefix-caching benchmark tool (#6568) Jiaxin Shan 2024-08-23 00:32:02 +08:00
  • cc0eaf12b1 [Bugfix] spec decode handle None entries in topk args in create_sequence_group_output (#7232) Travis Johnson 2024-08-22 07:33:48 -06:00
  • 955b5191c9 [Misc] update fp8 to use vLLMParameter (#7437) Dipika Sikka 2024-08-22 08:36:18 -04:00
  • 55d63b1211 [Bugfix] Don't build machete on cuda <12.0 (#7757) Lucas Wilkinson 2024-08-22 08:28:52 -04:00
  • 4f419c00a6 Fix ShardedStateLoader for vllm fp8 quantization (#7708) Flex Wang 2024-08-22 05:25:04 -07:00
  • a3fce56b88 [Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830) Abhinav Goyal 2024-08-22 15:12:24 +05:30
  • b3856bef7d [Misc] Use torch.compile for GemmaRMSNorm (#7642) Woosuk Kwon 2024-08-22 01:14:13 -07:00
  • 8c6f694a79 [ci] refine dependency for distributed tests (#7776) youkaichao 2024-08-22 00:54:15 -07:00
  • eeee1c3b1a [TPU] Avoid initializing TPU runtime in is_tpu (#7763) Woosuk Kwon 2024-08-21 21:31:49 -07:00
  • aae74ef95c Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" (#7764) Michael Goin 2024-08-21 23:42:14 -04:00
  • cde9183b40 [Bug][Frontend] Improve ZMQ client robustness (#7443) Joe Runde 2024-08-21 20:18:11 -06:00
  • df1a21131d [Model] Fix Phi-3.5-vision-instruct 'num_crops' issue (#7710) zifeitong 2024-08-21 18:36:24 -07:00
  • 7937009a7e [Kernel] Replaced blockReduce[...] functions with cub::BlockReduce (#7233) Luka Govedič 2024-08-21 20:18:00 -04:00
  • 9984605412 [AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility (#7477) Gregory Shtrasberg 2024-08-21 19:47:36 -04:00
  • 7eebe8ccaa [distributed][misc] error on same VLLM_HOST_IP setting (#7756) youkaichao 2024-08-21 16:25:34 -07:00
  • 8678a69ab5 [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527) Dipika Sikka 2024-08-21 19:17:10 -04:00
  • 5844017285 [ci] [multi-step] narrow multi-step test dependency paths (#7760) William Lin 2024-08-21 15:52:40 -07:00
  • 1ca0d4f86b [Model] Add UltravoxModel and UltravoxConfig (#7615) Peter Salas 2024-08-21 15:49:39 -07:00
  • dd53c4b023 [misc] Add Torch profiler support (#7451) William Lin 2024-08-21 15:39:26 -07:00
  • 970dfdc01d [Frontend] Improve Startup Failure UX (#7716) Robert Shaw 2024-08-21 15:53:01 -04:00
  • 91f4522cbf [multi-step] Raise error if not using async engine (#7703) William Lin 2024-08-21 11:49:19 -07:00
  • 1b32e02648 [Bugfix] Pass PYTHONPATH from setup.py to CMake (#7730) sasha0552 2024-08-21 18:17:48 +00:00
  • f7e3b0c5aa [Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend (#7394) Robert Shaw 2024-08-21 13:34:14 -04:00