Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

a1a2aaadb9 [Model]: Add transformers backend support (#11330) Arthur 2025-02-03 14:30:38 +01:00
1298a400e8 [ci/build] fix gh200 test (#12681) youkaichao 2025-02-03 15:59:49 +08:00
ad4a9dc817 [cuda] manually import the correct pynvml module (#12679) youkaichao 2025-02-03 15:58:21 +08:00
b9986454fe Fix for attention layers to remain unquantized during moe_wn16 quant (#12570) Srikanth Srinivas 2025-02-02 21:46:19 -08:00
c5932e5dac Properly check if all fused layers are in the list of targets (#12666) Eldar Kurtic 2025-02-03 06:42:18 +01:00
20579c0fae make sure mistral_common not imported for non-mistral models (#12669) youkaichao 2025-02-03 13:40:25 +08:00
95460fc513 [Kernel] port sgl moe_align_block_size kernels (#12574) Yang Chen 2025-02-02 21:09:50 -08:00
326fcc8b9f [Doc] Deprecate Discord (#12668) Zhuohan Li 2025-02-02 19:19:56 -08:00
e64330910b [doc][misc] clarify VLLM_HOST_IP for multi-node inference (#12667) youkaichao 2025-02-03 09:32:18 +08:00
e489ad7a21 [Misc] Add SPDX-License-Identifier headers to python source files (#12628) Russell Bryant 2025-02-02 14:58:18 -05:00
f256ebe4df [Hardware][Intel GPU] add XPU bf16 support (#12392) Kunshang Ji 2025-02-02 18:17:26 +08:00
f8ece6e17f [Core][v1] Unify allocating slots in prefill and decode in KV cache manager (#12608) Shawn Du 2025-02-02 16:40:58 +08:00
abfcdcdf27 [V1][Minor] Avoid frequently creating ConstantList (#12653) Woosuk Kwon 2025-02-01 23:43:20 -08:00
e497f33491 [Core] Silence unnecessary deprecation warnings (#12620) Russell Bryant 2025-02-02 02:35:50 -05:00
baaa2b24da [Bugfix] fix moe_wna16 get_quant_method (#12648) Jinzhen Lin 2025-02-02 15:29:56 +08:00
b4e5c03306 doc: fixing minor typo in readme.md (#12643) Vicente Herrera 2025-02-01 18:17:29 +01:00
3194039c0e Apply torch.compile to fused_moe/grouped_topk (#12637) Michael Goin 2025-02-01 11:16:19 -05:00
4f4d427ac2 Disable chunked prefill and/or prefix caching when MLA is enabled (#12642) v0.7.1 Simon Mo 2025-01-31 23:46:57 -08:00
1e3698393f [CI/Build] Add label automation for structured-output, speculative-decoding, v1 (#12280) Russell Bryant 2025-02-01 02:13:10 -05:00
baeded2569 [Attention] Deepseek v3 MLA support with FP8 compute (#12601) Lucas Wilkinson 2025-02-01 00:52:51 -05:00
3e1c76cf3a Fix: Respect sparsity_config.ignore in Cutlass Integration (#12517) Rahul Tuli 2025-01-31 23:41:59 -06:00
cfa134d247 [Bugfix/CI] Fixup benchmark_moe.py (#12562) Tyler Michael Smith 2025-02-01 00:41:35 -05:00
35b7a05507 [ci] Upgrade transformers to 4.48.2 in CI dependencies (#12599) Kevin H. Luu 2025-01-31 21:22:23 -08:00
1867c258bd Fix target matching for fused layers with compressed-tensors (#12617) Eldar Kurtic 2025-02-01 06:07:46 +01:00
cb3e73e4c8 [BugFix] fix wrong output when using lora and num_scheduler_steps=8 (#11161) fade_away 2025-02-01 12:52:07 +08:00
b1340f9d55 [V1] Bugfix: Validate Model Input Length (#12600) Robert Shaw 2025-01-31 21:32:04 -05:00
44bbca78d7 [Doc] int4 w4a16 example (#12585) Brian Dellabetta 2025-01-31 17:38:48 -06:00
60808bd4c7 [Doc] Improve installation signposting (#12575) Harry Mellor 2025-01-31 23:38:35 +00:00
fc542144c4 [Feature] Fix guided decoding blocking bitmask memcpy (#12563) Ryan Nguyen 2025-01-31 18:37:30 -05:00
eb5741ad42 [Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (#12587) Tyler Michael Smith 2025-01-31 18:29:11 -05:00
145c2ff648 [Bugfix] Revert MoE Triton Config Default (#12629) Robert Shaw 2025-01-31 18:28:47 -05:00
415f19474d [release] Add input step to ask for Release version (#12631) Kevin H. Luu 2025-01-31 13:39:36 -08:00
89003c4082 [v1][Bugfix] Add extra_keys to block_hash for prefix caching (#12603) Chen Zhang 2025-02-01 05:13:04 +08:00
60bcef000e [Docs][V1] Prefix caching design (#12598) Cody Yu 2025-01-31 12:30:46 -08:00
847f883232 [Git] Automatically sign-off commits (#12595) Cody Yu 2025-01-31 12:30:33 -08:00
325f679f32 [BugFix] Fix Torch.Compile For DeepSeek (#12594) Robert Shaw 2025-01-31 15:06:39 -05:00
e3f7ff65e7 Add favicon to docs (#12611) Harry Mellor 2025-01-31 17:20:34 +00:00
7a8987dac5 [Bugfix] Gracefully handle huggingface hub http error (#12571) Roger Wang 2025-01-31 00:19:35 -08:00
cabaf4eff3 [Attention] MLA decode optimizations (#12528) Lucas Wilkinson 2025-01-31 02:49:37 -05:00
a1fc18c030 [ROCm][AMD][Model] llama 3.2 support upstreaming (#12421) Aleksandr Malyshev 2025-01-30 20:24:28 -08:00
9798b2fb00 [Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling (#11868) Lucas Wilkinson 2025-01-30 21:33:00 -05:00
4078052f09 [V1][Log] Add max request concurrency log to V1 (#12569) Michael Goin 2025-01-30 18:07:19 -05:00
bd2107e30a [CPU][PPC] Updated torch, torchvision, torchaudio dependencies (#12555) Nishidha 2025-01-31 02:59:39 +05:30
9b0c4bab36 [Kernel] Triton Configs for Fp8 Block Quantization (#11589) Robert Shaw 2025-01-30 14:53:22 -05:00
41bf5612f5 [Misc] fix typo: add missing space in lora adapter error message (#12564) Beim 2025-01-31 04:39:22 +13:00
a2769032ca Set ?device={device} when changing tab in installation guides (#12560) Harry Mellor 2025-01-30 08:05:42 +00:00
f17f1d4608 [V1][Metrics] Add GPU cache usage % gauge (#12561) Mark McLoughlin 2025-01-30 02:31:01 +00:00
1c1bb0bbf2 [Misc][MoE] add Deepseek-V3 moe tuning support (#12558) Divakar Verma 2025-01-29 18:47:30 -06:00
e0cc5f259a [V1][BugFix] Free encoder cache for aborted requests (#12545) Woosuk Kwon 2025-01-29 13:47:33 -08:00
73aa6cfdf7 Revert "[Build/CI] Fix libcuda.so linkage" (#12552) Tyler Michael Smith 2025-01-29 16:12:24 -05:00
27b78c73ca [Kernel] add triton fused moe kernel for gptq/awq (#12185) Jinzhen Lin 2025-01-29 22:07:09 +08:00
b02fd288b2 [Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. (#11787) Pavani Majety 2025-01-29 01:46:12 -08:00
ff7424f491 [Frontend] Support override generation config in args (#12409) Yanyi Liu 2025-01-29 17:41:01 +08:00
d93bf4da85 [Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM (#12069) Alphi 2025-01-29 17:24:59 +08:00
036ca94c25 [Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense (#12347) Travis Johnson 2025-01-29 01:54:35 -07:00
ef001d98ef Fix the pydantic logging validator (#12420) Maximilien de Bayser 2025-01-29 04:53:13 -03:00
5f671cb4c3 [V1] Improve Error Message for Unsupported Config (#12535) Robert Shaw 2025-01-28 23:56:56 -05:00
bd02164cf9 Bugfix for whisper quantization due to fake k_proj bias (#12524) Michael Goin 2025-01-28 23:49:03 -05:00
46fb056749 [V1][Metrics] Add TTFT and TPOT histograms (#12530) Mark McLoughlin 2025-01-29 04:11:16 +00:00
dd6a3a02cb [Doc] Convert docs to use colon fences (#12471) Harry Mellor 2025-01-29 03:38:29 +00:00
a7e3eba66f [Frontend] Support reasoning content for deepseek r1 (#12473) Ce Gao 2025-01-29 11:38:08 +08:00
fbb5bd4cef [TPU] Add example for profiling TPU inference (#12531) Michael Goin 2025-01-28 22:16:47 -05:00
80fcc3ed1c [Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels (#12482) fenghuizhang 2025-01-28 14:36:44 -08:00
c386c43ca3 [V1][Metrics] Add per-request prompt/generation_tokens histograms (#12516) Mark McLoughlin 2025-01-28 22:07:22 +00:00
f26d790718 Do not run suggestion pre-commit hook multiple times (#12521) Harry Mellor 2025-01-28 20:05:27 +00:00
0f657bdc52 Replace missed warning_once for rerank API (#12472) Michael Goin 2025-01-28 14:06:32 -05:00
3fd1fb63ef [V1][Metrics] Hook up IterationStats for Prometheus metrics (#12478) Mark McLoughlin 2025-01-28 16:38:38 +00:00
925d2f1908 [Doc] Fix typo for x86 CPU installation (#12514) Jun Duan 2025-01-28 11:37:10 -05:00
8f58a51358 [VLM] Merged multi-modal processor and V1 support for Qwen-VL (#12504) Cyrus Leung 2025-01-29 00:25:05 +08:00
2079e43bee [Core] Make raw_request optional in ServingCompletion (#12503) Sebastian Schoennenbeck 2025-01-28 11:56:45 +01:00
e29d4358ef [V1] Include Engine Version in Logs (#12496) Robert Shaw 2025-01-28 03:27:41 -05:00
8cbc424975 Update README.md with V1 alpha release (#12495) Roger Wang 2025-01-28 00:22:41 -08:00
dd66fd2b01 [CI] fix pre-commit error (#12494) Mengqing Cao 2025-01-28 14:11:05 +08:00
0f465ab533 [FEATURE] Enables offline /score for embedding models (#12021) Gabriel Marinho 2025-01-28 00:30:13 -03:00
23a7cbc88b [CI/Build] Fixed the xla nightly issue report in #12451 (#12453) Hossein Sarshar 2025-01-27 22:18:07 -05:00
426a5c3625 Fix bad path in prometheus example (#12481) Michael Goin 2025-01-27 20:56:31 -05:00
ddee88d0ff [Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache (#11277) Liangfu Chen 2025-01-27 17:31:16 -08:00
823ab79633 Update pre-commit hooks (#12475) Harry Mellor 2025-01-28 00:23:08 +00:00
6116ca8cd7 [Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill (#10132) Nicolò Lucchesi 2025-01-27 22:38:35 +01:00
2bc3fbba0c [FlashInfer] Upgrade to 0.2.0 (#11194) Bowen Wang 2025-01-28 02:19:24 +08:00
3f1fc7425a [V1][CI/Test] Do basic test for top-p & top-k sampling (#12469) Woosuk Kwon 2025-01-27 09:40:04 -08:00
01ba927040 [V1][Metrics] Add initial Prometheus logger (#12416) Mark McLoughlin 2025-01-27 17:26:28 +00:00
103bd17ac5 [Build] Only build 9.0a for scaled_mm and sparse kernels (#12339) Lucas Wilkinson 2025-01-27 10:40:00 -05:00
ce69f7f754 [Bugfix] Fix gpt2 GGUF inference (#12467) Isotr0py 2025-01-27 18:31:49 +08:00
624a1e4711 [V1][Minor] Minor optimizations for update_from_output (#12454) Woosuk Kwon 2025-01-27 01:09:27 -08:00
372bf0890b [Bugfix] Fix missing seq_start_loc in xformers prefill metadata (#12464) Isotr0py 2025-01-27 15:25:30 +08:00
5204ff5c3f [Bugfix] Fix Granite 3.0 MoE model loading (#12446) v0.7.0 Cyrus Leung 2025-01-27 13:26:44 +08:00
0cc6b383d7 [Frontend] Support scores endpoint in run_batch (#12430) Pooya Davoodi 2025-01-26 20:30:17 -08:00
28e0750847 [V1] Avoid list creation in input preparation (#12457) Woosuk Kwon 2025-01-26 19:57:56 -08:00
582cf78798 [DOC] Add link to vLLM blog (#12460) Yuan Tang 2025-01-26 21:46:19 -06:00
0034b09ceb [Frontend] Rerank API (Jina- and Cohere-compatible API) (#12376) Kyle Mistele 2025-01-26 20:58:45 -06:00
72bac73067 [Build/CI] Fix libcuda.so linkage (#12424) Tyler Michael Smith 2025-01-26 16:18:19 -05:00
68f11149d8 [Bugfix][Kernel] Fix perf regression caused by PR #12405 (#12434) Lucas Wilkinson 2025-01-26 14:09:34 -05:00
72f4880425 [Bugfix/CI] Fix broken kernels/test_mha.py (#12450) Tyler Michael Smith 2025-01-26 13:39:03 -05:00
aa2cd2c43d [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417) Tyler Michael Smith 2025-01-26 06:59:58 -05:00
9ddc35220b [Frontend] generation_config.json for maximum tokens(#12242) Matthew Hendrey 2025-01-26 06:59:25 -05:00
a5255270c3 [Misc] Revert FA on ViT #12355 and #12435 (#12445) Roger Wang 2025-01-26 03:56:34 -08:00
0ee349b553 [V1][Bugfix] Fix assertion when mm hashing is turned off (#12439) Roger Wang 2025-01-26 00:47:42 -08:00
fa63e710c7 [V1][Perf] Reduce scheduling overhead in model runner after cuda sync (#12094) Keyun Tong 2025-01-26 00:42:37 -08:00
2a0309a646 [Misc][Bugfix] FA3 support to ViT MHA layer (#12435) Roger Wang 2025-01-25 21:00:31 -08:00

... 114 115 116 117 118 ...