Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

9cc373f390 [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577) Charlie Fu 2024-09-19 12:37:57 -05:00
76515f303b [Frontend] Use MQLLMEngine for embeddings models too (#8584) Nick Hill 2024-09-19 17:51:06 +01:00
855c8ae2c9 [MISC] remove engine_use_ray in benchmark_throughput.py (#8615) Kunshang Ji 2024-09-19 13:33:20 +08:00
c52ec5f034 [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (#8616) Kuntai Du 2024-09-18 22:24:24 -07:00
02c9afa2d0 Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (#8593) Roger Wang 2024-09-18 21:14:28 -07:00
3118f63385 [Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. (#8545) sroy745 2024-09-18 19:24:15 -07:00
4c34ce8916 [Kernel] Remove marlin moe templating on thread_m_blocks (#8573) Tyler Michael Smith 2024-09-18 21:42:49 -04:00
0d47bf3bf4 [Bugfix] add dead_error property to engine client (#8574) Joe Runde 2024-09-18 16:10:01 -06:00
d9cd78eb71 [BugFix] Nonzero exit code if MQLLMEngine startup fails (#8572) Nick Hill 2024-09-18 21:17:55 +01:00
db9120cded [Kernel] Change interface to Mamba selective_state_update for continuous batching (#8039) Tyler Michael Smith 2024-09-18 16:05:06 -04:00
b3195bc9e4 [AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (#8380) Gregory Shtrasberg 2024-09-18 13:41:08 -04:00
e18749ff09 [Model] Support Solar Model (#8386) Geun, Lim 2024-09-19 02:04:00 +09:00
d65798f78c [Core] zmq: bind only to 127.0.0.1 for local-only usage (#8543) Russell Bryant 2024-09-18 12:10:27 -04:00
a8c1d161a7 [Core] *Prompt* logprobs support in Multi-step (#8199) afeldman-nm 2024-09-18 11:38:43 -04:00
7c7714d856 [Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH (#8157) Alexander Matveev 2024-09-18 09:56:58 -04:00
9d104b5beb [CI/Build] Update Ruff version (#8469) Aaron Pham 2024-09-18 07:00:56 -04:00
6ffa3f314c [CI/Build] Avoid CUDA initialization (#8534) Cyrus Leung 2024-09-18 18:38:11 +08:00
e351572900 [Misc] Add argument to disable FastAPI docs (#8554) Jiaxin Shan 2024-09-18 02:51:59 -07:00
95965d31b6 [CI/Build] fix Dockerfile.cpu on podman (#8540) Daniele 2024-09-18 04:49:53 +02:00
8110e44529 [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012) Tyler Michael Smith 2024-09-17 19:44:27 -04:00
09deb4721f [CI/Build] Excluding kernels/test_gguf.py from ROCm (#8520) Alexey Kondratiev(AMD) 2024-09-17 19:40:29 -04:00
fa0c114fad [doc] improve installation doc (#8550) youkaichao 2024-09-17 16:24:06 -07:00
98f9713399 [Bugfix] Fix TP > 1 for new granite (#8544) Joe Runde 2024-09-17 17:17:08 -06:00
56c3de018c [Misc] Don't dump contents of kvcache tensors on errors (#8527) Nick Hill 2024-09-17 20:24:29 +01:00
a54ed80249 [Model] Add mistral function calling format to all models loaded with "mistral" format (#8515) Patrick von Platen 2024-09-17 19:50:37 +02:00
9855b99502 [Feature][kernel] tensor parallelism with bitsandbytes quantization (#8434) chenqianfzh 2024-09-17 08:09:12 -07:00
1009e93c5d [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631) sroy745 2024-09-17 07:35:01 -07:00
1b6de8352b [Benchmark] Support sample from HF datasets and image input for benchmark_serving (#8495) Isotr0py 2024-09-17 15:34:27 +08:00
cbdb252259 [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (#8509) Rui Qiao 2024-09-17 00:06:26 -07:00
99aa4eddaf [torch.compile] register allreduce operations as custom ops (#8526) youkaichao 2024-09-16 22:57:57 -07:00
ee2bceaaa6 [Misc][Bugfix] Disable guided decoding for mistral tokenizer (#8521) Roger Wang 2024-09-16 22:22:45 -07:00
1c1bb388e0 [Frontend] Improve Nullable kv Arg Parsing (#8525) Alex Brooks 2024-09-16 22:17:32 -06:00
546034b466 [refactor] remove triton based sampler (#8524) Simon Mo 2024-09-16 20:04:48 -07:00
cca61642e0 [Bugfix] Fix 3.12 builds on main (#8510) Joe Runde 2024-09-16 18:01:45 -06:00
5ce45eb54d [misc] small qol fixes for release process (#8517) Simon Mo 2024-09-16 15:11:27 -07:00
5478c4b41f [perf bench] set timeout to debug hanging (#8516) Simon Mo 2024-09-16 14:30:02 -07:00
47f5e03b5b [Bugfix] Bind api server port before starting engine (#8491) Kevin Lin 2024-09-16 15:56:28 -05:00
2759a43a26 [doc] update doc on testing and debugging (#8514) youkaichao 2024-09-16 12:10:23 -07:00
5d73ae49d6 [Kernel] AQ AZP 3/4: Asymmetric quantization kernels (#7270) Luka Govedič 2024-09-16 14:52:40 -04:00
781e3b9a42 [Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (#8506) sasha0552 2024-09-16 18:15:57 +00:00
acd5511b6d [BugFix] Fix clean shutdown issues (#8492) Nick Hill 2024-09-16 17:33:46 +01:00
837c1968f9 [Frontend] Expose revision arg in OpenAI server (#8501) lewtun 2024-09-16 17:55:26 +02:00
a091e2da3e [Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032) ElizaWszola 2024-09-16 17:47:19 +02:00
fc990f9795 [Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel (#8357) Isotr0py 2024-09-16 06:51:44 +08:00
3724d5f6b5 [Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations (#8490) Chris 2024-09-15 06:20:05 +02:00
50e9ec41fc [TPU] Implement multi-step scheduling (#8489) Woosuk Kwon 2024-09-14 16:58:31 -07:00
47790f3e32 [torch.compile] add a flag to disable custom op (#8488) youkaichao 2024-09-14 13:07:16 -07:00
a36e070dad [torch.compile] fix functionalization (#8480) youkaichao 2024-09-14 09:46:04 -07:00
8a0cf1ddc3 [Model] support minicpm3 (#8297) ywfang 2024-09-14 22:50:26 +08:00
1ef0d2efd0 [Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310) Charlie Fu 2024-09-13 19:01:11 -05:00
851725202a [Hardware][intel GPU] bump up ipex version to 2.3 (#8365) Kunshang Ji 2024-09-14 07:54:34 +08:00
9ba0817ff1 bump version to v0.6.1.post2 (#8473) v0.6.1.post2 Simon Mo 2024-09-13 11:35:00 -07:00
18e9e1f7b3 [HotFix] Fix final output truncation with stop string + streaming (#8468) Nick Hill 2024-09-13 19:31:12 +01:00
f57092c00b [Doc] Add oneDNN installation to CPU backend documentation (#8467) Isotr0py 2024-09-14 02:06:30 +08:00
a84e598e21 [CI/Build] Reorganize models tests (#7820) Cyrus Leung 2024-09-14 01:20:06 +08:00
0a4806f0a9 [plugin][torch.compile] allow to add custom compile backend (#8445) youkaichao 2024-09-13 09:32:42 -07:00
ecd7a1d5b6 [Installation] Gate FastAPI version for Python 3.8 (#8456) Cyrus Leung 2024-09-14 00:02:26 +08:00
a2469127db [misc][ci] fix quant test (#8449) youkaichao 2024-09-13 02:20:14 -07:00
06311e2956 [Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 (#8442) Jee Jee Li 2024-09-13 15:58:28 +08:00
cab69a15e4 [doc] recommend pip instead of conda (#8446) youkaichao 2024-09-12 23:52:41 -07:00
9b4a3b235e [CI/Build] Enable InternVL2 PP test only on single node (#8437) Isotr0py 2024-09-13 14:35:20 +08:00
acda0b35d0 bump version to v0.6.1.post1 (#8440) v0.6.1.post1 Simon Mo 2024-09-12 21:39:49 -07:00
ba77527955 [bugfix] torch profiler bug for single gpu with GPUExecutor (#8354) William Lin 2024-09-12 21:30:00 -07:00
6821020109 [Bugfix] Fix async log stats (#8417) Alexander Matveev 2024-09-12 23:48:59 -04:00
8427550488 [CI/Build] Update pixtral tests to use JSON (#8436) Cyrus Leung 2024-09-13 11:47:52 +08:00
3f79bc3d1a [Bugfix] Bump fastapi and pydantic version (#8435) Cyrus Leung 2024-09-13 11:21:42 +08:00
40c396533d [Bugfix] Mapping physical device indices for e2e test utils (#8290) shangmingc 2024-09-13 11:06:28 +08:00
5ec9c0fb3c [Core] Factor out input preprocessing to a separate class (#7329) Cyrus Leung 2024-09-13 10:56:13 +08:00
8f44a92d85 [BugFix] fix group_topk (#8430) Dipika Sikka 2024-09-12 21:23:42 -04:00
360ddbd37e [Misc] Update Pixtral example (#8431) Roger Wang 2024-09-12 17:31:18 -07:00
a480939e8e [Bugfix] Fix weight loading issue by rename variable. (#8293) Wenxiang 2024-09-13 07:25:00 +08:00
d31174a4e1 [Hotfix][Pixtral] Fix multiple images bugs (#8415) Patrick von Platen 2024-09-13 00:21:51 +02:00
b61bd98f90 [CI/Build] Disable multi-node test for InternVL2 (#8428) Roger Wang 2024-09-12 15:05:35 -07:00
c16369455f [Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models (#8425) Roger Wang 2024-09-12 14:06:51 -07:00
019877253b [Bugfix] multi-step + flashinfer: ensure cuda graph compatible (#8427) Alexander Matveev 2024-09-12 17:01:50 -04:00
551ce01078 [Core] Add engine option to return only deltas or final output (#7381) Nick Hill 2024-09-12 20:02:00 +01:00
a6c0f3658d [multi-step] add flashinfer backend (#7928) William Lin 2024-09-12 11:16:22 -07:00
f2e263b801 [Bugfix] Offline mode fix (#8376) Joe Runde 2024-09-12 12:11:57 -06:00
1f0c75afa9 [BugFix] Fix Duplicate Assignment in Hermes2ProToolParser (#8423) Luis Vega 2024-09-12 11:10:11 -07:00
8a23e93302 [BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance (#8403) WANGWEI 2024-09-13 01:47:42 +08:00
c6202daeed [Model] Support multiple images for qwen-vl (#8247) Alex Brooks 2024-09-12 11:10:54 -06:00
e56bf27741 [Bugfix] Fix InternVL2 inference with various num_patches (#8375) Isotr0py 2024-09-13 01:10:35 +08:00
520ca380ae [Hotfix][VLM] Fixing max position embeddings for Pixtral (#8399) Roger Wang 2024-09-12 09:28:37 -07:00
7de49aa86c [torch.compile] hide slicing under custom op for inductor (#8384) youkaichao 2024-09-12 00:11:55 -07:00
42ffba11ad [Misc] Use RoPE cache for MRoPE (#8396) Woosuk Kwon 2024-09-11 23:13:14 -07:00
295c4730a8 [Misc] Raise error when using encoder/decoder model with cpu backend (#8355) Kevin Lin 2024-09-12 00:45:24 -05:00
1bf2dd9df0 [Gemma2] add bitsandbytes support for Gemma2 (#8338) Blueyo0 2024-09-12 12:53:12 +08:00
5a60699c45 [Bugfix]: Fix the logic for deciding if tool parsing is used (#8366) tomeras91 2024-09-12 06:55:30 +03:00
b6c75e1cf2 Fix the AMD weight loading tests (#8390) Michael Goin 2024-09-11 23:35:33 -04:00
b71c956deb [TPU] Use Ray for default distributed backend (#8389) Woosuk Kwon 2024-09-11 20:31:51 -07:00
f842a7aff1 [misc] remove engine_use_ray (#8126) youkaichao 2024-09-11 18:23:36 -07:00
a65cb16067 [MISC] Dump model runner inputs when crashing (#8305) Cody Yu 2024-09-11 18:12:25 -07:00
3fd2b0d21c Bump version to v0.6.1 (#8379) v0.6.1 Simon Mo 2024-09-11 14:42:11 -07:00
d394787e52 Pixtral (#8377) Patrick von Platen 2024-09-11 23:41:55 +02:00
775f00f81e [Speculative Decoding] Test refactor (#8317) Lily Liu 2024-09-11 14:07:34 -07:00
8baa454937 [Misc] Move device options to a single place (#8322) Aarni Koskela 2024-09-11 23:25:58 +03:00
73202dbe77 [Kernel][Misc] register ops to prevent graph breaks (#6917) bnellnm 2024-09-11 15:52:19 -04:00
7015417fd4 [Bugfix] Add missing attributes in mistral tokenizer (#8364) Cyrus Leung 2024-09-12 02:36:54 +08:00
aea02f30de [CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation (#8373) Alexey Kondratiev(AMD) 2024-09-11 14:31:41 -04:00
0b952af458 [Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257) Li, Jiang 2024-09-12 00:46:46 +08:00

... 131 132 133 134 135 ...