Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

d0feea31c7 [Kernel] optimize performance of gptq marlin kernel when n is small (#14138) Jinzhen Lin 2025-03-08 00:53:38 +08:00
58abe35455 [Benchmarks] Make detokenization optional in benchmark scripts (#11697) Jeremy Arnold 2025-03-07 10:09:00 -06:00
f7ebad2307 [Doc] Update prefix_caching.md to match the example image (#14420) York-RDWang 2025-03-07 23:29:00 +08:00
80e9afb5bc [V1][Core] Support for Structured Outputs (#12388) Aaron Pham 2025-03-07 10:19:11 -05:00
1e3598edeb Use the optimized block sizes after tuning the kernel. (#14329) iefgnoix 2025-03-07 05:25:13 -08:00
f7a6bd0fa1 Fix missing kv_caches and attn_metadata in OpenVINOCausalLM (#14271) Harry Mellor 2025-03-07 13:30:42 +01:00
0ca3b8e01c [BUGFIX] Skip tokenization support for throughput benchmark (#12712) Aleksandr Malyshev 2025-03-07 02:51:47 -08:00
cc10281498 [Misc] Set default value of seed to None (#14274) மனோஜ்குமார் பழனிச்சாமி 2025-03-07 16:10:01 +05:30
05fb6718f0 [Bugfix] Clean up multi-modal processors (#14417) Cyrus Leung 2025-03-07 18:33:38 +08:00
12c29a881f [Bugfix] Further clean up LoRA test (#14422) Jee Jee Li 2025-03-07 18:30:55 +08:00
70da0c0748 correct wrong markdown syntax (#14414) Peng Li 2025-03-07 16:01:18 +08:00
c1588a2c94 [GH] Auto-apply multi-modality label to relevant PRs (#14402) Cyrus Leung 2025-03-07 15:26:32 +08:00
8ca7a71df7 OpenVINO: added CPU-like conditions (#14338) Ilya Lavrenov 2025-03-07 10:24:49 +04:00
63137cd922 [Build] Add nightly wheel fallback when latest commit wheel unavailable (#14358) Isotr0py 2025-03-07 14:10:57 +08:00
ddd1ef66ec [Bugfix] Fix JambaForCausalLM LoRA (#14370) Jee Jee Li 2025-03-07 14:05:47 +08:00
e5e03c2c1b [BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs (#14396) Lucas Wilkinson 2025-03-07 00:56:06 -05:00
e1744502c2 [FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object (#14390) Luka Govedič 2025-03-07 00:20:16 -05:00
dae6896977 [Perf] Reduce MLA CPU overheads in V1 (#14384) Lucas Wilkinson 2025-03-06 22:59:14 -05:00
c34eeec58d [Bugfix] Correctly call cudaProfilerStop in benchmarks script (#14183) Brayden Zhong 2025-03-06 19:42:49 -05:00
ad60bbb2b2 [Doc] Fix a typo (#14385) Daniel Li 2025-03-06 16:31:52 -08:00
0578e5a462 [Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue (#14310) Chengji Yao 2025-03-06 15:31:05 -08:00
04222984f8 [Docs] Add nsight guide to profiling docs (#14298) Michael Goin 2025-03-06 17:19:58 -05:00
6832707e90 [V1][Bugfix] Standardize quantized kv cache rejection for attention backends (#14221) Michael Goin 2025-03-06 17:18:29 -05:00
6b2ef5cd17 [Bug] Fix Attention when ignored in by quant_method (#14313) Michael Goin 2025-03-06 17:18:06 -05:00
958adce478 [Bugfix] Fix use_direct_call condition in FusedMoE layer for (#14382) Tyler Michael Smith 2025-03-06 17:17:21 -05:00
99b0915d3b [Kernel] Add needs_fixed_stride_order tag to most GEMMs (#14306) Tyler Michael Smith 2025-03-06 17:17:09 -05:00
8ca2b21c98 [CI] Disable spawn when running V1 Test (#14345) Thomas Parnell 2025-03-06 22:52:46 +01:00
d9292786e1 [CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa (#13569) Michael Goin 2025-03-06 16:08:36 -05:00
cc2f9b32c8 [Distributed] Add enable_expert_parallel arg (#14305) Tyler Michael Smith 2025-03-06 13:54:45 -05:00
cd579352bf [V1] Do not detokenize if sampling param detokenize is False (#14224) Himanshu Jaju 2025-03-06 19:40:24 +01:00
9f1710f1ac Fix mla prefill context performance (#13897) Ying Zhong 2025-03-07 01:35:49 +08:00
e642ec962c Add authors to license header. (#14371) Thomas Parnell 2025-03-06 17:43:09 +01:00
ada19210a3 Adding cpu inference with VXE ISA for s390x architecture (#12613) Dilip Gowda Bhagavan 2025-03-06 22:10:53 +05:30
bf0560bda9 Reinstate best_of for V0 (#14356) Harry Mellor 2025-03-06 17:34:22 +01:00
151b08e0fe [RLHF] use worker_extension_cls for compatibility with V0 and V1 (#14185) youkaichao 2025-03-07 00:32:46 +08:00
81b2f4a45f [Doc] Fix date typo in README.md (#14366) Jitse Klomp 2025-03-06 17:29:57 +01:00
82551ad616 [Core] Don't use cache during multi-modal profiling (#14336) Cyrus Leung 2025-03-07 00:03:31 +08:00
caac5c2e59 [Bugfix][Core] fix abort_seq_group and memory leak when n>1 (#14326) courage17340 2025-03-06 23:59:32 +08:00
6bd1dd9d26 [Kernel] [V1] Improved performance for V1 Triton (ROCm) backend (#14152) Thomas Parnell 2025-03-06 16:39:16 +01:00
4f27044aab [Doc] Correct beam_search using in generative_models.md (#14363) Irina Yuryeva 2025-03-06 18:37:10 +03:00
0ddc991f5c [Doc] Update reasoning with stream example to use OpenAI library (#14077) Yanyi Liu 2025-03-06 21:20:37 +08:00
fa82b93853 [Frontend][Docs] Transcription API streaming (#13301) Nicolò Lucchesi 2025-03-06 11:39:35 +01:00
69ff99fdcd [Core] Optimizing cross-attention QKVParallelLinear computation (#12325) Nicolò Lucchesi 2025-03-06 10:37:26 +01:00
5d802522a7 [V1][VLM][Pixtral-HF] Support Pixtral-HF on V1 (#14275) lkchen 2025-03-06 00:58:41 -08:00
1769928079 [Model] Update Paligemma multimodal processing with PromptUpdate (#14015) kYLe 2025-03-06 02:31:38 -06:00
ed6ea06577 [Hardware] Update the flash attn tag to support Blackwell (#14244) Pavani Majety 2025-03-05 22:01:37 -08:00
5ee10e990d [Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention (#11301) Nicolò Lucchesi 2025-03-06 05:00:53 +01:00
3dbd2d813a [V1] LoRA - Enable more V1 tests (#14315) Varun Sundar Rabindranath 2025-03-05 22:55:42 -05:00
f5f7f00cd9 [Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 (#14114) Ce Gao 2025-03-06 11:49:20 +08:00
abcc61e0af [misc] Mention ray list nodes command to troubleshoot ray issues (#14318) Rui Qiao 2025-03-05 18:00:36 -08:00
f6bb18fd9a [BugFix] MLA + V1, illegal memory access and accuracy issues (#14253) Lucas Wilkinson 2025-03-05 20:10:13 -05:00
71eaf8969b [Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation (#13850) Yuan Tang 2025-03-05 20:09:29 -05:00
ca100c90fe Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (#13917) Michael Goin 2025-03-05 20:08:51 -05:00
ffad94397d [CI/Build] Use spawn multiprocessing mode for V1 test pipeline (#14243) Russell Bryant 2025-03-05 20:08:02 -05:00
4dacaa4a83 [BugFix] Fix prefix caching V0 MLA (#14255) Lucas Wilkinson 2025-03-05 20:07:42 -05:00
a7ea35aa67 [Bugfix] Remove num_tokens_across_dp (#14302) Tyler Michael Smith 2025-03-05 18:55:55 -05:00
1e3e76b6cc [Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch (#14237) pyc96 2025-03-05 14:22:40 -08:00
53ea6ad830 [V1][Easy] Add empty allowed_token_ids in the v1 sampler test (#14308) Lu Fang 2025-03-05 13:41:18 -08:00
1b7624bf5c [misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env (#14267) Serena 2025-03-06 05:28:50 +08:00
ac60dc7fe1 [V1][BugFix] Fix for mixed top_k batch (#14301) Nick Hill 2025-03-05 12:43:04 -08:00
a4f1ee35d6 Deprecate best_of Sampling Parameter in anticipation for vLLM V1 (#13997) Vincent 2025-03-05 15:22:43 -05:00
a32c8669ca [V1][Minor] Remove obsolete FIXME comment (#14304) Nick Hill 2025-03-05 11:59:23 -08:00
ca2ca8de57 [Docs] Add Meta Slides (#14297) Simon Mo 2025-03-05 08:30:23 -08:00
f71b00a19e [Bugfix] Fix broken vision language example (#14292) Isotr0py 2025-03-05 23:57:10 +08:00
8f808cf86e prefix_caching.md: Fixed typo (#14293) DaividFrank 2025-03-05 16:43:13 +01:00
7bab4bb048 [Misc] Add Qwen2MoeForCausalLM moe tuning support (#14276) Jee Jee Li 2025-03-05 23:11:29 +08:00
e17e4488bd [LoRA] Remove linear hack outside transformers backend (#14177) Isotr0py 2025-03-05 23:06:28 +08:00
257e200a25 [V1][Frontend] Add Testing For V1 Runtime Parameters (#14159) Robert Shaw 2025-03-05 14:18:55 +00:00
47d4a7e004 Small update for external_launcher backend docs (#14288) Zhe Zhang 2025-03-05 05:30:00 -08:00
7f89a594dd [Doc] [3/N] Refer code examples for common cases in dev multimodal processor (#14278) Cyrus Leung 2025-03-05 20:29:50 +08:00
961644e6a8 [Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID (#14217) Iacopo Poli 2025-03-05 12:44:10 +01:00
8d6cd32b7b [Bugfix][V1] Fix allowed_token_ids for v1 Sampler (#14169) Lu Fang 2025-03-05 00:49:44 -08:00
ec79b67c77 [Misc][V1] Avoid using envs.VLLM_USE_V1 in mm processing (#14256) Roger Wang 2025-03-04 23:37:16 -08:00
32985bed7c [Frontend] Allow return_tokens_as_token_ids to be passed as a request param (#14066) Benjamin Chislett 2025-03-05 01:30:40 -05:00
dae9ec464c Temporarily disable test_awq_gemm_opcheck (#14251) Michael Goin 2025-03-05 01:10:35 -05:00
6eaf93020d [platforms] improve rocm debugging info (#14257) youkaichao 2025-03-05 13:32:18 +08:00
72c62eae5f [V1] EP/TP MoE + DP Attention (#13931) Tyler Michael Smith 2025-03-05 00:27:26 -05:00
0a995d5434 [Model] New model support for Phi-4-multimodal-instruct (#14119) Congcong Chen 2025-03-04 20:57:01 -08:00
ade3f7d988 [V1][Bugfix] Do not reset prefix caching metrics (#14235) Cody Yu 2025-03-04 20:39:13 -08:00
0df25101d6 [Bugfix] Fix gptq_marlin for deepseek-v3 (#13750) rainkert 2025-03-05 12:25:53 +08:00
e123aafdf0 Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 (#14157) Michael Goin 2025-03-04 23:25:24 -05:00
5b143d33be Moved numba from common requirements to cuda/rocm specific requirements (#14199) Nishidha 2025-03-05 09:55:00 +05:30
eb59b5a6cb [misc] announce china meetup (#14248) youkaichao 2025-03-05 10:33:50 +08:00
fbfc3ee37e [V1][TPU] TPU multimodal model support for ragged attention (#14158) Michael Goin 2025-03-04 19:58:48 -05:00
3e1d223626 [ROCm] Disable a few more kernel tests that are broken on ROCm (#14145) Sage Moore 2025-03-04 15:37:55 -08:00
4f5b059f14 Clean up unused padding_idx variables across many model definitions (#13240) Tyler Michael Smith 2025-03-04 16:27:00 -05:00
288ca110f6 [Security] Serialize using safetensors instead of pickle in Mooncake Pipe (#14228) Kuntai Du 2025-03-04 15:10:32 -06:00
c2bd2196fc [v1][Metrics] Add design doc (#12745) Mark McLoughlin 2025-03-04 20:36:55 +00:00
550c7ba3dc [Docs] Update Dockerfile dependency image (#14215) Michael Goin 2025-03-04 15:22:11 -05:00
e5b2f1601a [Frontend] Do prompt_logprobs clamping for chat as well as completions (#14225) Harry Mellor 2025-03-04 21:13:06 +01:00
9badee53de Fix performance when --generation-config is not None (#14223) Harry Mellor 2025-03-04 20:59:22 +01:00
beebf4742a [TPU][Profiler] Support start_profile/stop_profile in TPU worker (#13988) Siyuan Liu 2025-03-04 11:40:06 -08:00
f89978ad7c add cutlass support for blackwell fp8 gemm (#13798) kushanam 2025-03-04 07:55:07 -08:00
b3cf368d79 [V1][Molmo] Fix get_multimodal_embeddings() in molmo.py (#14161) lkchen 2025-03-04 07:43:59 -08:00
c8525f06fc [V0][Metrics] Deprecate some questionable request time metrics (#14135) Mark McLoughlin 2025-03-04 15:11:33 +00:00
5db6b2c961 [V1][BugFix] Fix remaining sync engine client shutdown errors/hangs (#13869) Nick Hill 2025-03-04 07:06:47 -08:00
6247bae6c6 [Bugfix] Restrict MacOS CPU detection (#14210) Michael Goin 2025-03-04 09:25:27 -05:00
3610fb4930 [doc] add "Failed to infer device type" to faq (#14200) youkaichao 2025-03-04 20:47:06 +08:00
71c4b40562 [sleep mode] error out with expandable_segments (#14189) youkaichao 2025-03-04 18:54:19 +08:00
ac65bc92df [platform] add debug logging during inferring the device type (#14195) youkaichao 2025-03-04 18:39:16 +08:00

... 108 109 110 111 112 ...