Commit Graph

  • 0c6f5023c3 [V1] Scheduler Refactoring [1/N] - Add Scheduler Interface (#15250) Woosuk Kwon 2025-03-20 17:50:43 -07:00
  • 06dd08256f Enforce that TP > 1 is not supported for Mamba2 if Quantization is Enabled. (#14617) Yu Chin Fabian Lim 2025-03-21 08:44:37 +08:00
  • 2b22290ce0 [V1] Add flag to disable cascade attention (#15243) Woosuk Kwon 2025-03-20 15:24:16 -07:00
  • d8e82bc06d [Bugfix] fix V1 Engine crash while handling requests with duplicate request id (#15043) Jason 2025-03-21 01:01:02 +08:00
  • 086b56824c [ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 (#15172) Chi Zhang 2025-03-21 00:30:04 +08:00
  • 5a0905ba2a Replace misc issues with link to forum (#15226) Harry Mellor 2025-03-20 15:18:20 +00:00
  • a8f12a63fd Fix env vars for running Ray distributed backend on GKE (#15166) Richard Liu 2025-03-20 07:59:33 -07:00
  • 69ae2380c6 Add user forum to README (#15220) Harry Mellor 2025-03-20 14:39:51 +00:00
  • 27261e40a6 [Bugfix] Multi-video inference on LLaVA-Onevision (#15082) Cyrus Leung 2025-03-20 22:10:45 +08:00
  • e3f813c33b [macOS] Ugrade pytorch to 2.6.0 (#15129) Quang-Linh LE 2025-03-20 09:22:40 +01:00
  • c607a2652b Fixing Imprecise Type Annotations (#15192) Wang Ran (汪然) 2025-03-20 16:19:55 +08:00
  • 3d45e3d749 [release] Tag vllm-cpu with latest upon new version released (#15193) Kevin H. Luu 2025-03-20 01:19:10 -07:00
  • 742369d35a [Frontend][Bugfix] support prefill decode disaggregation on deepseek (#14824) billishyahao 2025-03-20 15:00:33 +08:00
  • bfe2fe0af4 typo: Update config.py (#15189) Wang Ran (汪然) 2025-03-20 14:31:21 +08:00
  • a8652f4f0f Enable CUDA graph support for llama 3.2 vision (#14917) Matt Ritter 2025-03-19 23:29:16 -07:00
  • 2f726b241e [Doc] Update README.md (#15187) Cyrus Leung 2025-03-20 13:25:58 +08:00
  • a597a57595 [Attention] Flash Attention 3 - fp8 (#14570) Mickaël Seznec 2025-03-20 06:14:20 +01:00
  • ae65f3e237 [Misc]fixed disable these http request logs (#14754) Chauncey 2025-03-20 12:53:40 +08:00
  • 34868b106a [Doc] Update Mistral Small 3.1/Pixtral example (#15184) Roger Wang 2025-03-19 21:46:06 -07:00
  • 1f16b7fe74 [Core][V0] Add guidance backend for structured output (#14589) Russell Bryant 2025-03-20 00:33:51 -04:00
  • b88be22165 [Benchmark] Allow oversample request in benchmark dataset (#15170) Jennifer Zhao 2025-03-19 21:32:58 -07:00
  • d8c6d7d6b5 [V1][TPU] Support V1 Sampler for ragged attention (#14227) Nicolò Lucchesi 2025-03-20 05:00:39 +01:00
  • 40828ce5fe fix "Total generated tokens:" is 0 if using --backend tgi and --endpo… (#14673) Wang, Yi 2025-03-20 11:56:16 +08:00
  • ffa443afed [Bugfix] Fix embedding assignment for InternVL-based models (#15086) Cyrus Leung 2025-03-20 11:40:13 +08:00
  • 70e500cad9 Fix broken tests (#14713) Jovan Sardinha 2025-03-19 19:06:49 -07:00
  • 4cb1c05c9e [Doc] Clarify run vllm only on one node in distributed inference (#15148) Rui Qiao 2025-03-19 18:55:59 -07:00
  • c47aafa37c [BugFix] Lazily import XgrammarBackend to avoid early cuda init (#15171) Nick Hill 2025-03-19 18:30:43 -07:00
  • cfbca8a2f2 [V1] TPU - Tensor parallel MP support (#15059) Alexander Matveev 2025-03-19 20:55:18 -04:00
  • 0fe5609874 [Docs] Annouce Ollama and Singapore Meetups (#15161) Simon Mo 2025-03-19 16:18:04 -07:00
  • 22d33baca2 [FrontEnd][Perf] merge_async_iterators fast-path for single-prompt requests (#15150) Nick Hill 2025-03-19 14:04:41 -07:00
  • b0e96aaebb [V1][TPU] Change kv cache shape. (#15145) iefgnoix 2025-03-19 12:16:42 -07:00
  • 8310e0b59b simple bugfix: Update stats.py (#15139) Wang Ran (汪然) 2025-03-20 02:26:27 +08:00
  • 26dd972adb [FEAT]Support reset prefix cache by specified device (#15003) maobaolong 2025-03-20 01:54:41 +08:00
  • 61c7a1b856 [V1] Minor V1 async engine test refactor (#15075) v0.8.1 Murali Andoorveedu 2025-03-19 10:37:17 -07:00
  • 374ee287d8 [Frontend] Remove custom_cache_manager (#13791) Alessandro Sangiorgi 2025-03-19 11:13:50 -05:00
  • a4d83661d7 [Misc] Update the "the first vLLM China Meetup" slides link to point to the first page (#15134) Kero Liang 2025-03-19 23:07:39 +08:00
  • 8363cd093d [Bugfix] Adjust mllama to regional compilation (#15112) Jan Kaniecki 2025-03-19 15:57:25 +01:00
  • 6c5a3195db [Misc][Benchmark] Add support for different tokenizer_mode (#15040) Aaron Pham 2025-03-19 10:56:50 -04:00
  • 073d1ed354 [Doc] Update tip info on using latest transformers when creating a custom Dockerfile (#15070) Marc-Alexandre Côté 2025-03-19 09:33:40 -04:00
  • 3d446433ec [Bugfix] Fix size calculation of processing cache (#15114) Cyrus Leung 2025-03-19 20:53:19 +08:00
  • 1fe0fd12d3 [Misc] Avoid unnecessary HF do_rescale warning when passing dummy data (#15107) Cyrus Leung 2025-03-19 18:42:31 +08:00
  • dafb4e504a [V1][Bugfix] Fix oracle for device checking (#15104) Roger Wang 2025-03-19 03:35:32 -07:00
  • 68cf1601d3 [CI][Intel GPU] update XPU dockerfile and CI script (#15109) Kunshang Ji 2025-03-19 01:29:25 -07:00
  • 61f412187d [Bugfix] Re-enable Gemma3 for V1 (#14980) Cyrus Leung 2025-03-19 14:58:22 +08:00
  • 05ccd0aa35 [V1] Ensure using int64 for sampled token ids (#15065) Woosuk Kwon 2025-03-18 23:52:19 -07:00
  • f690372b68 [Core] Update dtype detection and defaults (#14858) Cyrus Leung 2025-03-19 13:49:33 +08:00
  • 8b3e94a357 [Model] Remove duplicated message check in Mistral chat completion request (#15069) Brayden Zhong 2025-03-19 01:09:32 -04:00
  • 437f9162d0 [Model] Pixtral: Remove layer instantiation duplication (#15053) Julien Denize 2025-03-19 03:34:03 +01:00
  • 4f065f12f5 [Misc][V1] Skip device checking if not available (#15061) Cody Yu 2025-03-18 19:33:43 -07:00
  • 228b768db6 [Doc] Minor v1_user_guide update (#15064) Jennifer Zhao 2025-03-18 16:10:45 -07:00
  • 027827cc1d fix long dtype in topk sampling (#15049) Chujie Zheng 2025-03-19 06:57:31 +08:00
  • 72a8639b68 [V1] TPU - CI/CD use smaller model (#15054) Alexander Matveev 2025-03-18 17:39:21 -04:00
  • 99abb8b650 [V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels (#14930) Woosuk Kwon 2025-03-18 14:31:54 -07:00
  • 3a1e648158 [V1] Refactor Structured Output for multiple backends (#14694) Russell Bryant 2025-03-18 15:49:15 -04:00
  • 966f933ee1 [Bugfix] Fix LoRA extra vocab size (#15047) v0.8.0 Jee Jee Li 2025-03-19 00:40:29 +08:00
  • 1a504aff6c [Bugfix] Fix broken CPU quantization due to triton import (#15038) Isotr0py 2025-03-18 23:57:39 +08:00
  • 01ca85bbd8 [MODEL] Add support for Zamba2 models (#13185) yury-tokpanov 2025-03-18 08:56:21 -07:00
  • d82b9487ea [Bugfix] Register serializers for V0 MQ Engine (#15009) Simon Mo 2025-03-18 06:14:47 -07:00
  • be13281d4b [Bugfix] Loosen type check to avoid errors in V1 (#15021) Cyrus Leung 2025-03-18 20:54:40 +08:00
  • 54e084f7fb [Bugfix] torchrun compatibility (#14899) hoshi-hiyouga 2025-03-18 20:49:27 +08:00
  • 9e8f089d08 [Kernels] LoRA - Retire SGMV and BGMV Kernels (#14685) Varun Sundar Rabindranath 2025-03-18 05:47:53 -04:00
  • 46c759c165 [Bugfix] Fix LoRA extra vocab size (#15047) Jee Jee Li 2025-03-19 00:40:29 +08:00
  • 179a619c21 [Bugfix] Fix broken CPU quantization due to triton import (#15038) Isotr0py 2025-03-18 23:57:39 +08:00
  • 452e8fd968 [MODEL] Add support for Zamba2 models (#13185) yury-tokpanov 2025-03-18 08:56:21 -07:00
  • 8b793f7ec6 MI325 configs, fused_moe_kernel bugfix (#14987) ekuznetsov139 2025-03-18 08:05:18 -07:00
  • af35d3a3cc [TPU][V1][Bugfix] Fix chunked prefill with padding (#15037) Nicolò Lucchesi 2025-03-18 15:34:45 +01:00
  • 3b457143d2 [Bugfix] Register serializers for V0 MQ Engine (#15009) Simon Mo 2025-03-18 06:14:47 -07:00
  • ab656f2c2f [Bugfix] Loosen type check to avoid errors in V1 (#15021) Cyrus Leung 2025-03-18 20:54:40 +08:00
  • 64fc2193dc [Misc][Docs] fix the comments of KV_T and CACHE_T in CALL_RESHAPE_AND_CACHE_XX macros (#14347) Serena 2025-03-18 20:50:19 +08:00
  • dd732028f5 [Bugfix][Frontend] Fix validation of logprobs in ChatCompletionRequest (#14352) Sebastian Schoennenbeck 2025-03-18 13:50:05 +01:00
  • 414919138b [Bugfix] torchrun compatibility (#14899) hoshi-hiyouga 2025-03-18 20:49:27 +08:00
  • db7c8ca910 [Misc] Embedding model support LoRA (#14935) Jee Jee Li 2025-03-18 20:07:00 +08:00
  • f863ffc965 [Mistral-Small 3.1] Update docs and tests (#14977) Patrick von Platen 2025-03-18 11:29:42 +01:00
  • 400d483e87 [Kernels] LoRA - Retire SGMV and BGMV Kernels (#14685) Varun Sundar Rabindranath 2025-03-18 05:47:53 -04:00
  • d1695758b2 [Doc][V1] Fix V1 APC doc (#14920) Shanshan Shen 2025-03-18 16:15:46 +08:00
  • 53a0cf8b95 [Neuron] trim attention kernel tests to fit trn1.2x instance (#14988) Liangfu Chen 2025-03-18 00:05:52 -07:00
  • 5eeabc2a44 [Bugfix] Fix bnb quantization for models with both HF-format and Mistral-format weights (#14950) Tristan Leclercq 2025-03-18 00:27:26 +01:00
  • 18551e820c [V1] TPU - Fix CI/CD runner (#14974) Alexander Matveev 2025-03-17 17:07:07 -04:00
  • 16e9064f84 [V1] Guard Against Main Thread Usage (#14972) Robert Shaw 2025-03-17 16:23:02 -04:00
  • e41e160263 [V1] Guard Against Main Thread Usage (#14972) Robert Shaw 2025-03-17 16:23:02 -04:00
  • 5ac1a8e6e4 [Bugfix] Fix interface for Olmo2 on V1 (#14976) Roger Wang 2025-03-17 11:26:38 -07:00
  • b89fb2a4a1 [CI/Build] Use AutoModelForImageTextToText to load VLMs in tests (#14945) Cyrus Leung 2025-03-18 02:35:17 +08:00
  • 5340b0e221 [Bugfix] Fix interface for Olmo2 on V1 (#14976) Roger Wang 2025-03-17 11:26:38 -07:00
  • 37e3806132 [Bugfix] Make Gemma3 MM V0 only for now (#14971) v0.8.0rc2 Roger Wang 2025-03-17 10:04:21 -07:00
  • c0efdd655b [Fix][Structured Output] using vocab_size to construct matcher (#14868) Aaron Pham 2025-03-17 11:42:45 -04:00
  • aaaec52ad9 [Bugfix][Model] Mixtral: use unused head_dim config argument (#14961) Quentin 2025-03-17 15:44:18 +01:00
  • e1eb45d397 [Bugfix] Fix precommit - line too long in pixtral.py (#14960) Tyler Michael Smith 2025-03-17 10:18:50 -04:00
  • 89fca671fb [V1] Default MLA to V1 (#14921) Simon Mo 2025-03-17 06:54:40 -07:00
  • d20b0c139c Add patch merger (#14957) Patrick von Platen 2025-03-17 14:47:50 +01:00
  • 166a168b0f [Doc] Fix misleading log during multi-modal profiling (#14955) Cyrus Leung 2025-03-17 21:14:32 +08:00
  • 2bb0e1a799 [Bugfix][ROCm] running new process using spawn method for rocm in tests. (#14810) vllmellm 2025-03-17 19:33:35 +08:00
  • 6eaf1e5c52 [Misc] Add --seed option to offline multi-modal examples (#14934) Cyrus Leung 2025-03-17 18:00:17 +08:00
  • 868a8c5b2c [Bugfix] Fix Ultravox on V1 (#14929) Cyrus Leung 2025-03-17 17:15:20 +08:00
  • b4ad56c1bd [V1][TPU] Apply the ragged paged attention kernel fix and remove the padding. (#14846) iefgnoix 2025-03-17 01:48:28 -07:00
  • 69698f257e fix minor miscalled method (#14327) kushanam 2025-03-17 01:47:58 -07:00
  • cd0cd85102 [MISC] More AMD unused var clean up (#14926) Lu Fang 2025-03-17 01:40:41 -07:00
  • 0a74bfce9c setup.py: drop assumption about local main branch (#14692) Russell Bryant 2025-03-17 04:37:42 -04:00
  • dd3b865854 [Doc] Add vLLM Beijing meetup slide (#14938) Chen Zhang 2025-03-17 16:29:36 +08:00
  • 9b87a579aa [Misc][XPU] Use None as device capacity for XPU (#14932) Yan Ma 2025-03-17 16:22:14 +08:00
  • b539222d4e [V1] Remove input cache client (#14864) Cyrus Leung 2025-03-17 14:42:06 +08:00