Commit Graph

  • 3b7fea770f [Model][VLM] Add Qwen2-VL model support (#7905) Yang Fan 2024-09-12 00:31:19 +08:00
  • cea95dfb94 [Frontend] Create ErrorResponse instead of raising exceptions in run_batch (#8347) Pooya Davoodi 2024-09-10 22:30:11 -07:00
  • 6a512a00df [model] Support for Llava-Next-Video model (#7559) Yangshen⚡Deng 2024-09-11 13:21:36 +08:00
  • efcf946a15 [Hardware][NV] Add support for ModelOpt static scaling checkpoints. (#6112) Pavani Majety 2024-09-10 21:38:40 -07:00
  • 1230263e16 [Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel (#8299) Isotr0py 2024-09-11 10:11:01 +08:00
  • e497b8aeff [Misc] Skip loading extra bias for Qwen2-MOE GPTQ models (#8329) Jee Jee Li 2024-09-11 08:59:19 +08:00
  • 94144e726c [CI/Build][Kernel] Update CUTLASS to 3.5.1 tag (#8043) Tyler Michael Smith 2024-09-10 19:51:58 -04:00
  • 1d5e397aa4 [Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers (#8172) William Lin 2024-09-10 16:46:08 -07:00
  • 22f3a4bc6c [Bugfix] lookahead block table with cuda graph max capture (#8340) Alexander Matveev 2024-09-10 19:00:35 -04:00
  • b1f3e18958 [MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled (#8342) Cody Yu 2024-09-10 15:28:28 -07:00
  • 04e7c4e771 [Misc] remove peft as dependency for prompt models (#8162) Prashant Gupta 2024-09-10 14:21:56 -07:00
  • 5faedf1b62 [Spec Decode] Move ops.advance_step to flash attn advance_step (#8224) Kevin Lin 2024-09-10 15:18:14 -05:00
  • 02751a7a42 Fix ppc64le buildkite job (#8309) sumitd2 2024-09-11 01:28:34 +05:30
  • f421f3cefb [CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail (#8130) Alexey Kondratiev(AMD) 2024-09-10 14:51:15 -04:00
  • 8c054b7a62 [Frontend] Clean up type annotations for mistral tokenizer (#8314) Cyrus Leung 2024-09-11 00:49:11 +08:00
  • 6234385f4a [CI/Build] enable ccache/scccache for HIP builds (#8327) Daniele 2024-09-10 17:55:08 +02:00
  • da1a844e61 [Bugfix] Fix missing post_layernorm in CLIP (#8155) Cyrus Leung 2024-09-10 16:22:50 +08:00
  • a1d874224d Add NVIDIA Meetup slides, announce AMD meetup, and add contact info (#8319) Simon Mo 2024-09-09 23:21:00 -07:00
  • 6cd5e5b07e [Misc] Fused MoE Marlin support for GPTQ (#8217) Dipika Sikka 2024-09-09 23:02:52 -04:00
  • c7cb5c3335 [Misc] GPTQ Activation Ordering (#8135) Kyle Sayers 2024-09-09 16:27:26 -04:00
  • f9b4a2d415 [Bugfix] Correct adapter usage for cohere and jamba (#8292) Vladislav Kruglikov 2024-09-09 21:20:46 +03:00
  • 58fcc8545a [Frontend] Add progress reporting to run_batch.py (#8060) Adam Lugowski 2024-09-09 11:16:37 -07:00
  • 08287ef675 [Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility (#8272) Kyle Mistele 2024-09-09 09:45:11 -05:00
  • 4ef41b8476 [Bugfix] Fix async postprocessor in case of preemption (#8267) Alexander Matveev 2024-09-08 00:01:51 -04:00
  • cfe712bf1a [CI/Build] Use python 3.12 in cuda image (#8133) Joe Runde 2024-09-07 14:03:16 -06:00
  • b962ee1470 ppc64le: Dockerfile fixed, and a script for buildkite (#8026) sumitd2 2024-09-07 23:48:40 +05:30
  • 36bf8150cc [Model][VLM] Decouple weight loading logic for Paligemma (#8269) Isotr0py 2024-09-08 01:45:44 +08:00
  • e807125936 [Model][VLM] Support multi-images inputs for InternVL2 models (#8201) Isotr0py 2024-09-07 16:38:23 +08:00
  • 9f68e00d27 [Bugfix] Fix broken OpenAI tensorizer test (#8258) Cyrus Leung 2024-09-07 16:02:39 +08:00
  • ce2702a923 [tpu][misc] fix typo (#8260) youkaichao 2024-09-06 22:40:46 -07:00
  • 795b662cff Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) (#8241) Wei-Sheng Chin 2024-09-06 20:18:16 -07:00
  • 2f707fcb35 [Model] Multi-input support for LLaVA (#8238) Cyrus Leung 2024-09-07 10:57:24 +08:00
  • 41e95c5247 [Bugfix] Fix Hermes tool call chat template bug (#8256) Kyle Mistele 2024-09-06 21:49:01 -05:00
  • 12dd715807 [misc] [doc] [frontend] LLM torch profiler support (#7943) William Lin 2024-09-06 17:48:48 -07:00
  • 29f49cd6e3 [Model] Allow loading from original Mistral format (#8168) Patrick von Platen 2024-09-07 01:02:05 +02:00
  • 23f322297f [Misc] Remove SqueezeLLM (#8220) Dipika Sikka 2024-09-06 18:29:03 -04:00
  • 9db52eab3d [Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput (#8248) rasmith 2024-09-06 17:26:09 -05:00
  • 1447c97e75 [CI/Build] Increasing timeout for multiproc worker tests (#8203) Alexey Kondratiev(AMD) 2024-09-06 14:51:03 -04:00
  • de80783b69 [Misc] Use ray[adag] dependency instead of cuda (#7938) Rui Qiao 2024-09-06 09:18:35 -07:00
  • e5cab71531 [Frontend] Add --logprobs argument to benchmark_serving.py (#8191) afeldman-nm 2024-09-06 12:01:14 -04:00
  • baa5467547 [BugFix] Fix Granite model configuration (#8216) Nick Hill 2024-09-05 20:39:29 -07:00
  • db3bf7c991 [Core] Support load and unload LoRA in api server (#6566) Jiaxin Shan 2024-09-05 18:10:33 -07:00
  • 2febcf2777 [Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM (#7962) sroy745 2024-09-05 13:25:29 -07:00
  • 2ee45281a5 Move verify_marlin_supported to GPTQMarlinLinearMethod (#8165) Michael Goin 2024-09-05 11:09:46 -04:00
  • 9da25a88aa [MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (#8029) Alex Brooks 2024-09-05 06:48:10 -06:00
  • 8685ba1a1e Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) (#7860) manikandan.tm@zucisystems.com 2024-09-05 17:03:37 +05:30
  • 288a938872 [Doc] Indicate more information about supported modalities (#8181) Cyrus Leung 2024-09-05 18:51:53 +08:00
  • e39ebf5cf5 [Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173) Elfie Guo 2024-09-04 22:12:26 -07:00
  • ba262c4e5a [ci] Mark LoRA test as soft-fail (#8160) Kevin H. Luu 2024-09-04 20:33:12 -07:00
  • 4624d98dbd [Misc] Clean up RoPE forward_native (#8076) Woosuk Kwon 2024-09-04 20:31:48 -07:00
  • 1afc931987 [bugfix] >1.43 constraint for openai (#8169) William Lin 2024-09-04 17:35:36 -07:00
  • e01c2beb7d [Doc] [Misc] Create CODE_OF_CONDUCT.md (#8161) Maureen McElaney 2024-09-04 19:50:13 -04:00
  • 32e7db2536 Bump version to v0.6.0 (#8166) v0.6.0 Simon Mo 2024-09-04 16:34:27 -07:00
  • 008cf886c9 [Neuron] Adding support for adding/ overriding neuron configuration a… (#8062) Harsha vardhan manoj Bikki 2024-09-04 16:33:43 -07:00
  • 77d9e514a2 [MISC] Replace input token throughput with total token throughput (#8164) Cody Yu 2024-09-04 13:23:22 -07:00
  • e02ce498be [Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models (#5649) Kyle Mistele 2024-09-04 15:18:13 -05:00
  • 561d6f8077 [CI] Change test input in Gemma LoRA test (#8163) Woosuk Kwon 2024-09-04 13:05:50 -07:00
  • d1dec64243 [CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369) alexeykondrat 2024-09-04 14:57:54 -04:00
  • 2ad2e5608e [MISC] Consolidate FP8 kv-cache tests (#8131) Cody Yu 2024-09-04 11:53:25 -07:00
  • d3311562fb [Bugfix] remove post_layernorm in siglip (#8106) wnma 2024-09-04 18:55:37 +08:00
  • ccd7207191 chore: Update check-wheel-size.py to read MAX_SIZE_MB from env (#8103) TimWang 2024-09-04 14:17:05 +08:00
  • 855c262a6b [Frontend] Multimodal support in offline chat (#8098) Cyrus Leung 2024-09-04 13:22:17 +08:00
  • 2be8ec6e71 [Model] Add Ultravox support for multiple audio chunks (#7963) Peter Salas 2024-09-03 21:38:21 -07:00
  • e16fa99a6a [Misc] Update fbgemmfp8 to use vLLMParameters (#7972) Dipika Sikka 2024-09-03 22:12:41 -04:00
  • 61f4a93d14 [TPU][Bugfix] Use XLA rank for persistent cache path (#8137) Woosuk Kwon 2024-09-03 18:35:33 -07:00
  • d4db9f53c8 [Benchmark] Add --async-engine option to benchmark_throughput.py (#7964) Nick Hill 2024-09-03 17:57:41 -07:00
  • 2188a60c7e [Misc] Update GPTQ to use vLLMParameters (#7976) Dipika Sikka 2024-09-03 17:21:44 -04:00
  • dc0b6066ab [CI] Change PR remainder to avoid at-mentions (#8134) Simon Mo 2024-09-03 14:11:42 -07:00
  • 0af3abe3d3 [TPU][Bugfix] Fix next_token_ids shape (#8128) Woosuk Kwon 2024-09-03 13:29:24 -07:00
  • f1575dc99f [ci] Fix GHA workflow (#8129) Kevin H. Luu 2024-09-03 13:25:09 -07:00
  • c02638efb3 [CI/Build] make pip install vllm work in macos (for import only) (#8118) tomeras91 2024-09-03 22:37:08 +03:00
  • 652c83b697 [Misc] Raise a more informative exception in add/remove_logger (#7750) Antoni Baum 2024-09-03 12:28:25 -07:00
  • 6d646d08a2 [Core] Optimize Async + Multi-step (#8050) Alexander Matveev 2024-09-03 14:50:29 -04:00
  • 95a178f861 [CI] Only PR reviewers/committers can trigger CI on PR (#8124) Kevin H. Luu 2024-09-03 11:32:27 -07:00
  • bd852f2a8b [Performance] Enable chunked prefill and prefix caching together (#8120) Cody Yu 2024-09-03 10:49:18 -07:00
  • ec266536b7 [Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backend (#8061) Isotr0py 2024-09-03 21:37:52 +08:00
  • 0fbc6696c2 [Bugfix] Fix single output condition in output processor (#7881) Woosuk Kwon 2024-09-02 20:35:42 -07:00
  • 6e36f4fa6c improve chunked prefill performance wang.yuqi 2024-09-03 05:20:12 +08:00
  • dd2a6a82e3 [Bugfix] Fix internlm2 tensor parallel inference (#8055) Isotr0py 2024-09-02 23:48:56 +08:00
  • 4ca65a9763 [Core][Bugfix] Accept GGUF model without .gguf extension (#8056) Isotr0py 2024-09-02 20:43:26 +08:00
  • e2b2aa5a0f [TPU] Align worker index with node boundary (#7932) Woosuk Kwon 2024-09-01 23:09:46 -07:00
  • e6a26ed037 [SpecDecode][Kernel] Flashinfer Rejection Sampling (#7244) Lily Liu 2024-09-01 21:23:29 -07:00
  • f8d60145b4 [Model] Add Granite model (#7436) Shawn Tan 2024-09-01 21:37:18 -04:00
  • 5b86b19954 [Misc] Optional installation of audio related packages (#8063) Roger Wang 2024-09-01 14:46:57 -07:00
  • 5231f0898e [Frontend][VLM] Add support for multiple multi-modal items (#8049) Roger Wang 2024-08-31 16:35:53 -07:00
  • 8423aef4c8 [BugFix][Core] Multistep Fix Crash on Request Cancellation (#8059) Robert Shaw 2024-08-31 15:44:03 -04:00
  • 4f5d8446ed [Bugfix] Fix ModelScope models in v0.5.5 (#8037) Nicolò Lucchesi 2024-08-31 09:27:58 +02:00
  • d05f0a9db2 [Bugfix] Fix import error in Phi-3.5-MoE (#8052) Cyrus Leung 2024-08-31 13:26:55 +08:00
  • 622f8abff8 [Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013) Pavani Majety 2024-08-30 22:18:50 -07:00
  • 1248e8506a [Model] Adding support for MSFT Phi-3.5-MoE (#7729) Wenxiang 2024-08-31 03:42:57 +08:00
  • 2684efc467 [TPU][Bugfix] Fix tpu type api (#8035) Woosuk Kwon 2024-08-30 09:01:26 -07:00
  • 058344f89a [Frontend]-config-cli-args (#7737) Kaunil Dhruv 2024-08-30 08:21:02 -07:00
  • 98cef6a227 [Core] Increase default max_num_batched_tokens for multimodal models (#8028) Cyrus Leung 2024-08-30 23:20:34 +08:00
  • f97be32d1d [VLM][Model] TP support for ViTs (#7186) Jungho Christopher Cho 2024-08-31 00:19:27 +09:00
  • afd39a4511 [Bugfix] Fix import error in Exaone model (#8034) Cyrus Leung 2024-08-30 23:03:28 +08:00
  • 2148441fd3 [TPU] Support single and multi-host TPUs on GKE (#7613) Richard Liu 2024-08-30 00:27:40 -07:00
  • dc13e99348 [MODEL] add Exaone model support (#7819) Yohan Na 2024-08-30 15:34:20 +09:00
  • 34a0e96d46 [Kernel] changing fused moe kernel chunk size default to 32k (#7995) Avshalom Manevich 2024-08-30 11:11:39 +07:00
  • 80c7b089b1 [TPU] Async output processing for TPU (#8011) Woosuk Kwon 2024-08-29 19:35:29 -07:00
  • 428dd1445e [Core] Logprobs support in Multi-step (#7652) afeldman-nm 2024-08-29 22:19:08 -04:00