Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

3b7fea770f [Model][VLM] Add Qwen2-VL model support (#7905) Yang Fan 2024-09-12 00:31:19 +08:00
cea95dfb94 [Frontend] Create ErrorResponse instead of raising exceptions in run_batch (#8347) Pooya Davoodi 2024-09-10 22:30:11 -07:00
6a512a00df [model] Support for Llava-Next-Video model (#7559) Yangshen⚡Deng 2024-09-11 13:21:36 +08:00
efcf946a15 [Hardware][NV] Add support for ModelOpt static scaling checkpoints. (#6112) Pavani Majety 2024-09-10 21:38:40 -07:00
1230263e16 [Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel (#8299) Isotr0py 2024-09-11 10:11:01 +08:00
e497b8aeff [Misc] Skip loading extra bias for Qwen2-MOE GPTQ models (#8329) Jee Jee Li 2024-09-11 08:59:19 +08:00
94144e726c [CI/Build][Kernel] Update CUTLASS to 3.5.1 tag (#8043) Tyler Michael Smith 2024-09-10 19:51:58 -04:00
1d5e397aa4 [Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers (#8172) William Lin 2024-09-10 16:46:08 -07:00
22f3a4bc6c [Bugfix] lookahead block table with cuda graph max capture (#8340) Alexander Matveev 2024-09-10 19:00:35 -04:00
b1f3e18958 [MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled (#8342) Cody Yu 2024-09-10 15:28:28 -07:00
04e7c4e771 [Misc] remove peft as dependency for prompt models (#8162) Prashant Gupta 2024-09-10 14:21:56 -07:00
5faedf1b62 [Spec Decode] Move ops.advance_step to flash attn advance_step (#8224) Kevin Lin 2024-09-10 15:18:14 -05:00
02751a7a42 Fix ppc64le buildkite job (#8309) sumitd2 2024-09-11 01:28:34 +05:30
f421f3cefb [CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail (#8130) Alexey Kondratiev(AMD) 2024-09-10 14:51:15 -04:00
8c054b7a62 [Frontend] Clean up type annotations for mistral tokenizer (#8314) Cyrus Leung 2024-09-11 00:49:11 +08:00
6234385f4a [CI/Build] enable ccache/scccache for HIP builds (#8327) Daniele 2024-09-10 17:55:08 +02:00
da1a844e61 [Bugfix] Fix missing post_layernorm in CLIP (#8155) Cyrus Leung 2024-09-10 16:22:50 +08:00
a1d874224d Add NVIDIA Meetup slides, announce AMD meetup, and add contact info (#8319) Simon Mo 2024-09-09 23:21:00 -07:00
6cd5e5b07e [Misc] Fused MoE Marlin support for GPTQ (#8217) Dipika Sikka 2024-09-09 23:02:52 -04:00
c7cb5c3335 [Misc] GPTQ Activation Ordering (#8135) Kyle Sayers 2024-09-09 16:27:26 -04:00
f9b4a2d415 [Bugfix] Correct adapter usage for cohere and jamba (#8292) Vladislav Kruglikov 2024-09-09 21:20:46 +03:00
58fcc8545a [Frontend] Add progress reporting to run_batch.py (#8060) Adam Lugowski 2024-09-09 11:16:37 -07:00
08287ef675 [Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility (#8272) Kyle Mistele 2024-09-09 09:45:11 -05:00
4ef41b8476 [Bugfix] Fix async postprocessor in case of preemption (#8267) Alexander Matveev 2024-09-08 00:01:51 -04:00
cfe712bf1a [CI/Build] Use python 3.12 in cuda image (#8133) Joe Runde 2024-09-07 14:03:16 -06:00
b962ee1470 ppc64le: Dockerfile fixed, and a script for buildkite (#8026) sumitd2 2024-09-07 23:48:40 +05:30
36bf8150cc [Model][VLM] Decouple weight loading logic for Paligemma (#8269) Isotr0py 2024-09-08 01:45:44 +08:00
e807125936 [Model][VLM] Support multi-images inputs for InternVL2 models (#8201) Isotr0py 2024-09-07 16:38:23 +08:00
9f68e00d27 [Bugfix] Fix broken OpenAI tensorizer test (#8258) Cyrus Leung 2024-09-07 16:02:39 +08:00
ce2702a923 [tpu][misc] fix typo (#8260) youkaichao 2024-09-06 22:40:46 -07:00
795b662cff Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) (#8241) Wei-Sheng Chin 2024-09-06 20:18:16 -07:00
2f707fcb35 [Model] Multi-input support for LLaVA (#8238) Cyrus Leung 2024-09-07 10:57:24 +08:00
41e95c5247 [Bugfix] Fix Hermes tool call chat template bug (#8256) Kyle Mistele 2024-09-06 21:49:01 -05:00
12dd715807 [misc] [doc] [frontend] LLM torch profiler support (#7943) William Lin 2024-09-06 17:48:48 -07:00
29f49cd6e3 [Model] Allow loading from original Mistral format (#8168) Patrick von Platen 2024-09-07 01:02:05 +02:00
23f322297f [Misc] Remove SqueezeLLM (#8220) Dipika Sikka 2024-09-06 18:29:03 -04:00
9db52eab3d [Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput (#8248) rasmith 2024-09-06 17:26:09 -05:00
1447c97e75 [CI/Build] Increasing timeout for multiproc worker tests (#8203) Alexey Kondratiev(AMD) 2024-09-06 14:51:03 -04:00
de80783b69 [Misc] Use ray[adag] dependency instead of cuda (#7938) Rui Qiao 2024-09-06 09:18:35 -07:00
e5cab71531 [Frontend] Add --logprobs argument to benchmark_serving.py (#8191) afeldman-nm 2024-09-06 12:01:14 -04:00
baa5467547 [BugFix] Fix Granite model configuration (#8216) Nick Hill 2024-09-05 20:39:29 -07:00
db3bf7c991 [Core] Support load and unload LoRA in api server (#6566) Jiaxin Shan 2024-09-05 18:10:33 -07:00
2febcf2777 [Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM (#7962) sroy745 2024-09-05 13:25:29 -07:00
2ee45281a5 Move verify_marlin_supported to GPTQMarlinLinearMethod (#8165) Michael Goin 2024-09-05 11:09:46 -04:00
9da25a88aa [MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (#8029) Alex Brooks 2024-09-05 06:48:10 -06:00
8685ba1a1e Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) (#7860) manikandan.tm@zucisystems.com 2024-09-05 17:03:37 +05:30
288a938872 [Doc] Indicate more information about supported modalities (#8181) Cyrus Leung 2024-09-05 18:51:53 +08:00
e39ebf5cf5 [Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173) Elfie Guo 2024-09-04 22:12:26 -07:00
ba262c4e5a [ci] Mark LoRA test as soft-fail (#8160) Kevin H. Luu 2024-09-04 20:33:12 -07:00
4624d98dbd [Misc] Clean up RoPE forward_native (#8076) Woosuk Kwon 2024-09-04 20:31:48 -07:00
1afc931987 [bugfix] >1.43 constraint for openai (#8169) William Lin 2024-09-04 17:35:36 -07:00
e01c2beb7d [Doc] [Misc] Create CODE_OF_CONDUCT.md (#8161) Maureen McElaney 2024-09-04 19:50:13 -04:00
32e7db2536 Bump version to v0.6.0 (#8166) v0.6.0 Simon Mo 2024-09-04 16:34:27 -07:00
008cf886c9 [Neuron] Adding support for adding/ overriding neuron configuration a… (#8062) Harsha vardhan manoj Bikki 2024-09-04 16:33:43 -07:00
77d9e514a2 [MISC] Replace input token throughput with total token throughput (#8164) Cody Yu 2024-09-04 13:23:22 -07:00
e02ce498be [Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models (#5649) Kyle Mistele 2024-09-04 15:18:13 -05:00
561d6f8077 [CI] Change test input in Gemma LoRA test (#8163) Woosuk Kwon 2024-09-04 13:05:50 -07:00
d1dec64243 [CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369) alexeykondrat 2024-09-04 14:57:54 -04:00
2ad2e5608e [MISC] Consolidate FP8 kv-cache tests (#8131) Cody Yu 2024-09-04 11:53:25 -07:00
d3311562fb [Bugfix] remove post_layernorm in siglip (#8106) wnma 2024-09-04 18:55:37 +08:00
ccd7207191 chore: Update check-wheel-size.py to read MAX_SIZE_MB from env (#8103) TimWang 2024-09-04 14:17:05 +08:00
855c262a6b [Frontend] Multimodal support in offline chat (#8098) Cyrus Leung 2024-09-04 13:22:17 +08:00
2be8ec6e71 [Model] Add Ultravox support for multiple audio chunks (#7963) Peter Salas 2024-09-03 21:38:21 -07:00
e16fa99a6a [Misc] Update fbgemmfp8 to use vLLMParameters (#7972) Dipika Sikka 2024-09-03 22:12:41 -04:00
61f4a93d14 [TPU][Bugfix] Use XLA rank for persistent cache path (#8137) Woosuk Kwon 2024-09-03 18:35:33 -07:00
d4db9f53c8 [Benchmark] Add --async-engine option to benchmark_throughput.py (#7964) Nick Hill 2024-09-03 17:57:41 -07:00
2188a60c7e [Misc] Update GPTQ to use vLLMParameters (#7976) Dipika Sikka 2024-09-03 17:21:44 -04:00
dc0b6066ab [CI] Change PR remainder to avoid at-mentions (#8134) Simon Mo 2024-09-03 14:11:42 -07:00
0af3abe3d3 [TPU][Bugfix] Fix next_token_ids shape (#8128) Woosuk Kwon 2024-09-03 13:29:24 -07:00
f1575dc99f [ci] Fix GHA workflow (#8129) Kevin H. Luu 2024-09-03 13:25:09 -07:00
c02638efb3 [CI/Build] make pip install vllm work in macos (for import only) (#8118) tomeras91 2024-09-03 22:37:08 +03:00
652c83b697 [Misc] Raise a more informative exception in add/remove_logger (#7750) Antoni Baum 2024-09-03 12:28:25 -07:00
6d646d08a2 [Core] Optimize Async + Multi-step (#8050) Alexander Matveev 2024-09-03 14:50:29 -04:00
95a178f861 [CI] Only PR reviewers/committers can trigger CI on PR (#8124) Kevin H. Luu 2024-09-03 11:32:27 -07:00
bd852f2a8b [Performance] Enable chunked prefill and prefix caching together (#8120) Cody Yu 2024-09-03 10:49:18 -07:00
ec266536b7 [Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backend (#8061) Isotr0py 2024-09-03 21:37:52 +08:00
0fbc6696c2 [Bugfix] Fix single output condition in output processor (#7881) Woosuk Kwon 2024-09-02 20:35:42 -07:00
6e36f4fa6c improve chunked prefill performance wang.yuqi 2024-09-03 05:20:12 +08:00
dd2a6a82e3 [Bugfix] Fix internlm2 tensor parallel inference (#8055) Isotr0py 2024-09-02 23:48:56 +08:00
4ca65a9763 [Core][Bugfix] Accept GGUF model without .gguf extension (#8056) Isotr0py 2024-09-02 20:43:26 +08:00
e2b2aa5a0f [TPU] Align worker index with node boundary (#7932) Woosuk Kwon 2024-09-01 23:09:46 -07:00
e6a26ed037 [SpecDecode][Kernel] Flashinfer Rejection Sampling (#7244) Lily Liu 2024-09-01 21:23:29 -07:00
f8d60145b4 [Model] Add Granite model (#7436) Shawn Tan 2024-09-01 21:37:18 -04:00
5b86b19954 [Misc] Optional installation of audio related packages (#8063) Roger Wang 2024-09-01 14:46:57 -07:00
5231f0898e [Frontend][VLM] Add support for multiple multi-modal items (#8049) Roger Wang 2024-08-31 16:35:53 -07:00
8423aef4c8 [BugFix][Core] Multistep Fix Crash on Request Cancellation (#8059) Robert Shaw 2024-08-31 15:44:03 -04:00
4f5d8446ed [Bugfix] Fix ModelScope models in v0.5.5 (#8037) Nicolò Lucchesi 2024-08-31 09:27:58 +02:00
d05f0a9db2 [Bugfix] Fix import error in Phi-3.5-MoE (#8052) Cyrus Leung 2024-08-31 13:26:55 +08:00
622f8abff8 [Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013) Pavani Majety 2024-08-30 22:18:50 -07:00
1248e8506a [Model] Adding support for MSFT Phi-3.5-MoE (#7729) Wenxiang 2024-08-31 03:42:57 +08:00
2684efc467 [TPU][Bugfix] Fix tpu type api (#8035) Woosuk Kwon 2024-08-30 09:01:26 -07:00
058344f89a [Frontend]-config-cli-args (#7737) Kaunil Dhruv 2024-08-30 08:21:02 -07:00
98cef6a227 [Core] Increase default max_num_batched_tokens for multimodal models (#8028) Cyrus Leung 2024-08-30 23:20:34 +08:00
f97be32d1d [VLM][Model] TP support for ViTs (#7186) Jungho Christopher Cho 2024-08-31 00:19:27 +09:00
afd39a4511 [Bugfix] Fix import error in Exaone model (#8034) Cyrus Leung 2024-08-30 23:03:28 +08:00
2148441fd3 [TPU] Support single and multi-host TPUs on GKE (#7613) Richard Liu 2024-08-30 00:27:40 -07:00
dc13e99348 [MODEL] add Exaone model support (#7819) Yohan Na 2024-08-30 15:34:20 +09:00
34a0e96d46 [Kernel] changing fused moe kernel chunk size default to 32k (#7995) Avshalom Manevich 2024-08-30 11:11:39 +07:00
80c7b089b1 [TPU] Async output processing for TPU (#8011) Woosuk Kwon 2024-08-29 19:35:29 -07:00
428dd1445e [Core] Logprobs support in Multi-step (#7652) afeldman-nm 2024-08-29 22:19:08 -04:00

... 132 133 134 135 136 ...