Commit Graph

  • a11f326528 [V1] Initial support of multimodal models for V1 re-arch (#10699) Roger Wang 2024-12-08 04:50:51 -08:00
  • fd57d2b534 [torch.compile] allow candidate compile sizes (#10984) youkaichao 2024-12-08 03:05:21 -08:00
  • 7be15d9356 [core][misc] remove use_dummy driver for _run_workers (#10920) youkaichao 2024-12-07 12:06:08 -08:00
  • 1b62745b1d [core][executor] simplify instance id (#10976) youkaichao 2024-12-07 09:33:45 -08:00
  • 78029b34ed [BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None (#10928) zhou fan 2024-12-08 01:21:18 +08:00
  • c889d5888b [Doc] Explicitly state that PP isn't compatible with speculative decoding yet (#10975) Cyrus Leung 2024-12-08 01:20:49 +08:00
  • 39e227c7ae [Model] Update multi-modal processor to support Mantis(LLaVA) model (#10711) Cyrus Leung 2024-12-08 01:10:05 +08:00
  • 1c768fe537 [Doc] Explicitly state that InternVL 2.5 is supported (#10978) Cyrus Leung 2024-12-08 00:58:02 +08:00
  • bf0e382e16 [Model] Composite weight loading for multimodal Qwen2 (#10944) Cyrus Leung 2024-12-07 22:22:52 +08:00
  • b26b4cd03c [Misc][LoRA] Refactor and clean MergedQKVParallelLinearWithLora implementation (#10958) Isotr0py 2024-12-07 18:33:49 +08:00
  • f13cf9ad50 [Build] Fix for the Wswitch-bool clang warning (#10060) Gregory Shtrasberg 2024-12-07 04:03:44 -05:00
  • 955fa9533a [3/N] Support and implement merged input processor for LLaVA model (#10676) Cyrus Leung 2024-12-07 16:50:58 +08:00
  • acf092d348 [Bugfix] Fix test-pipeline.yaml (#10973) Jee Jee Li 2024-12-07 12:08:54 +08:00
  • 69d357ba12 [Core] Cleanup startup logging a bit (#10961) Russell Bryant 2024-12-06 21:30:23 -05:00
  • dcdc3fafe5 [ci] fix broken tests (#10956) youkaichao 2024-12-06 11:25:47 -08:00
  • c05cfb67da [misc] fix typo (#10960) youkaichao 2024-12-06 11:25:20 -08:00
  • 7406274041 [Doc] add KubeAI to serving integrations (#10837) Sam Stoelinga 2024-12-06 09:03:56 -08:00
  • 8b59631855 [Core] Support Lark grammars for XGrammar (#10870) Michael Goin 2024-12-06 10:34:29 -05:00
  • a1887f2c96 [torch.compile] fix deprecated code (#10948) youkaichao 2024-12-06 03:01:23 -08:00
  • 222f5b082a [CI/Build] Fix broken multimodal test (#10950) Cyrus Leung 2024-12-06 18:41:23 +08:00
  • b031a455a9 [torch.compile] add logging for compilation time (#10941) youkaichao 2024-12-06 02:07:15 -08:00
  • db87eb6c67 [torch.compile] use size tuning for specific sizes (#10933) youkaichao 2024-12-05 20:30:41 -08:00
  • 9743d64e4e [ci][build] add tests for python only compilation (#10915) youkaichao 2024-12-05 08:54:47 -08:00
  • a43065272f [Misc][Gaudi] Avoid torch.compile and enable lazy collectives (#10897) Konrad Zawora 2024-12-05 17:47:46 +01:00
  • 998eeafe58 [CI/Build] Bump test transformers version (#10106) Isotr0py 2024-12-06 00:05:52 +08:00
  • 571da8fc43 [Misc][LoRA] Clean up the function interface of Punica (#10917) Jee Jee Li 2024-12-05 21:22:28 +08:00
  • 39c89e71a8 [Misc] Update llama 3.2 template to support system prompt with images (#10901) Travis Johnson 2024-12-04 22:54:06 -07:00
  • 1f958a7d52 [Bugfix] Fix BNB loader target_modules (#10720) Jee Jee Li 2024-12-05 13:20:26 +08:00
  • aa39a8e175 [Doc] Create a new "Usage" section (#10827) Cyrus Leung 2024-12-05 11:19:35 +08:00
  • 8d370e91cb [Bugfix] Fallback to outlines for complex json schemas (#10899) Michael Goin 2024-12-04 22:14:06 -05:00
  • 7883c2bbe7 [benchmark] Make H100 benchmark optional (#10908) Kevin H. Luu 2024-12-04 17:02:17 -08:00
  • 2a56e1264f [V1] Fix when max_model_len is not divisible by block_size (#10903) Woosuk Kwon 2024-12-04 16:54:05 -08:00
  • e4c34c23de [CI/Build] improve python-only dev setup (#9621) Daniele 2024-12-04 22:48:13 +01:00
  • 82eb5ea8f3 Benchmark serving structured output (#10880) Chendi.Xue 2024-12-04 15:28:21 -06:00
  • 10398b4706 [Model] Consolidate ViTs attention implementation without mask (#10893) Isotr0py 2024-12-05 02:11:08 +08:00
  • 01d079fd8e [LoRA] Change lora_tokenizers capacity (#10796) Xin Yang 2024-12-04 09:40:16 -08:00
  • c92acb9693 [ci/build] Update vLLM postmerge ECR repo (#10887) Kevin H. Luu 2024-12-04 01:01:20 -08:00
  • 8db957ee3a [bugfix] fixed parameter “n” when set parameter “bestof” > 1 (#10854) jianzheng 2024-12-04 16:48:22 +08:00
  • c9ca4fce3f [ci/build] Job to build and push release image (#10877) Kevin H. Luu 2024-12-03 23:02:40 -08:00
  • fa2dea61df [ci/build] Change queue name for Release jobs (#10875) Kevin H. Luu 2024-12-03 23:02:16 -08:00
  • b5b647b084 Drop ROCm load format check (#10767) wangxiyuan 2024-12-04 12:32:21 +08:00
  • d2bd88b122 [CI/Build] Replace mean with torch.all in test_pynccl.py (#10876) Tyler Michael Smith 2024-12-03 22:23:21 -05:00
  • 381ac93bb5 [Benchmark] Benchmark structured output with datasets (#10557) Chendi.Xue 2024-12-03 18:21:06 -06:00
  • a061fe601e [Build][Bugfix] Using the correct type hint (#10866) Gregory Shtrasberg 2024-12-03 15:47:55 -05:00
  • 7c32b6861e [Frontend] correctly record prefill and decode time metrics (#10853) tomeras91 2024-12-03 21:13:31 +02:00
  • 7090c27bb2 [Bugfix] Only require XGrammar on x86 (#10865) Michael Goin 2024-12-03 13:32:21 -05:00
  • 2f2cdc745a [MISC][XPU] quick fix for XPU CI (#10859) Yan Ma 2024-12-04 01:16:31 +08:00
  • 3bc94cab69 [V1] VLM - Run the mm_mapper preprocessor in the frontend process (#10640) Alexander Matveev 2024-12-03 05:33:10 -05:00
  • f6084f6324 [Speculative Decoding] Move indices to device before filtering output (#10850) Yang Zheng 2024-12-03 17:01:39 +08:00
  • 9323a3153b [Core][Performance] Add XGrammar support for guided decoding and set it as default (#10785) Aaron Pham 2024-12-03 02:17:00 -05:00
  • 3257d449fa [Misc] Remove deprecated names (#10817) Cyrus Leung 2024-12-03 14:52:57 +08:00
  • ef51831ee8 [Doc] Add github links for source code references (#10672) Russell Bryant 2024-12-03 01:46:07 -05:00
  • dc5ce861bf [torch.compile] remove compilation_context and simplify code (#10838) youkaichao 2024-12-02 22:19:02 -08:00
  • 21fe7b481a [core][distributed] add pynccl broadcast (#10843) youkaichao 2024-12-02 20:53:23 -08:00
  • a4cf256159 [Bugfix] Fix QKVParallelLinearWithShardedLora bias bug (#10844) Jee Jee Li 2024-12-03 12:10:29 +08:00
  • d746268e92 [Model] support bitsandbytes quantization with minicpm model (#10842) zixuanzhang226 2024-12-02 19:06:41 -08:00
  • 4433195ab7 [Bugfix] Prevent benchmark_throughput.py from using duplicated random prompts (#10753) Michael Goin 2024-12-02 21:26:15 -05:00
  • 4c05edb33a [Model] Add TP and BNB quantization support to LlavaMultiModalProjector (#10834) Isotr0py 2024-12-03 07:06:09 +08:00
  • 9b14d978aa Fix openvino on GPU (#10793) Jani Monoses 2024-12-02 20:52:19 +02:00
  • 519cc6ca12 [Misc][XPU] Avoid torch compile for XPU platform (#10747) Yan Ma 2024-12-03 01:53:55 +08:00
  • b45f0d7946 [Misc][LoRA] Move the implementation of lora bias to punica.py (#10829) Jee Jee Li 2024-12-03 01:53:36 +08:00
  • a4c4daf364 [misc] use out argument for flash attention (#10822) youkaichao 2024-12-02 02:50:10 -08:00
  • e95f275f57 [CI/Build] Update mistral_common version for tests and docs (#10825) Cyrus Leung 2024-12-02 18:26:10 +08:00
  • ef31eabc68 [Model]: add some tests for aria model (#10770) zhou fan 2024-12-02 13:36:36 +08:00
  • 995a148575 [doc]Update config docstring (#10732) wangxiyuan 2024-12-02 12:14:45 +08:00
  • 63a164172d [misc] remove xverse modeling file (#10814) youkaichao 2024-12-01 19:27:13 -08:00
  • e25810ae29 Fill TorchSDPAAttentionMetadata seq_lens_field for prefill (#10799) Maximilien de Bayser 2024-12-01 23:05:32 -03:00
  • 073a4bd1c0 [Kernel] Use out arg in flash_attn_varlen_func (#10811) Woosuk Kwon 2024-12-01 17:55:39 -08:00
  • b7954776fd [core] Avoid metrics log noise when idle - include speculative decodi… (#10809) cduk 2024-12-02 02:49:48 +01:00
  • b18c9bbaba [Model] Add BNB support to Llava and Pixtral-HF (#10795) Isotr0py 2024-12-02 09:31:09 +08:00
  • 0590ec3fd9 [Core] Implement disagg prefill by StatelessProcessGroup (#10502) Kuntai Du 2024-12-01 19:01:00 -06:00
  • c11f172187 [Misc] Adding MMMU-Pro vision dataset to serving benchmark (#10804) Roger Wang 2024-12-01 00:47:05 -08:00
  • 169a0ff911 [doc] add warning about comparing hf and vllm outputs (#10805) youkaichao 2024-12-01 00:41:38 -08:00
  • d2f058e76c [Misc] Rename embedding classes to pooling (#10801) Cyrus Leung 2024-12-01 14:36:51 +08:00
  • f877a7d12a [Misc] Improve type annotations for support_torch_compile (#10763) Cyrus Leung 2024-12-01 09:48:35 +08:00
  • 133707123e [Model] Replace embedding models with pooling adapter (#10769) Cyrus Leung 2024-12-01 08:02:54 +08:00
  • 7e4bbda573 [doc] format fix (#10789) wangxiyuan 2024-11-30 19:38:40 +08:00
  • e7cfc4ef4c [Interleaved ATTN] Support for Mistral-8B (#10591) Patrick von Platen 2024-11-30 08:45:50 +01:00
  • 16ee07f22a [Model] Refactor Molmo weights loading to use AutoWeightsLoader (#10771) Isotr0py 2024-11-30 12:19:14 +08:00
  • 40bc242579 [Bugfix] Fix OpenVino/Neuron driver_worker init (#10779) Nicolò Lucchesi 2024-11-30 05:07:13 +01:00
  • 661175bc82 [platform] Add verify_quantization in platform. (#10757) wangxiyuan 2024-11-29 23:22:21 +08:00
  • 3132aac043 [Bugfix] Fix Idefics3 bug (#10778) Jee Jee Li 2024-11-29 21:56:46 +08:00
  • c82b432d4a [Misc] typo find in sampling_metadata.py (#10740) wang.yuqi 2024-11-29 13:17:57 +08:00
  • fa6ecb9aa7 [Model] Clean up MiniCPMV (#10751) Cyrus Leung 2024-11-29 12:47:06 +08:00
  • c83919c7a6 [Model] Add Internlm2 LoRA support (#5064) Isotr0py 2024-11-29 01:29:04 +08:00
  • 98f47f2a40 [V1] Optimize the CPU overheads in FlashAttention custom op (#10733) Woosuk Kwon 2024-11-28 09:01:02 -08:00
  • 8c1e77fb58 [Kernel] Update vllm-flash-attn version to reduce CPU overheads (#10742) Woosuk Kwon 2024-11-28 08:31:28 -08:00
  • 5fc5ce0fe4 [Model] Added GLM-4 series hf format model support vllm==0.6.4 (#10561) sixgod 2024-11-28 22:53:31 +08:00
  • 3ed5e73146 [TPU] Update requirements-tpu (#10726) Richard Liu 2024-11-28 02:30:48 -08:00
  • 9a8bff0285 [Kernel] Update vllm-flash-attn version (#10736) Woosuk Kwon 2024-11-28 02:25:59 -08:00
  • a79b122400 [V1] Do not allocate beyond the max_model_len (#10730) Woosuk Kwon 2024-11-28 00:13:15 -08:00
  • d9b4b3f069 [Bug][CLI] Allow users to disable prefix caching explicitly (#10724) Ricky Xu 2024-11-27 23:59:28 -08:00
  • 278be671a3 [Doc] Update model in arch_overview.rst to match comment (#10701) 罗泽轩 2024-11-28 15:58:39 +08:00
  • 70dc14fbd0 [Model] support bitsandbytes quantization with minicpm3 model (#10682) zixuanzhang226 2024-11-27 23:58:02 -08:00
  • cb4e1c3f3a [misc] upgrade filelock version (#10731) youkaichao 2024-11-27 19:54:58 -08:00
  • 395b1c7454 [Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server (#10635) tomeras91 2024-11-27 23:21:10 +02:00
  • 9b4b150395 [Bugfix] Ignore lm_head when loading embedding models (#10719) Cyrus Leung 2024-11-28 03:05:29 +08:00
  • 197b4484a3 [Bugfix][Mamba] Fix Multistep on Mamba-like models (#10705) Mor Zusman 2024-11-27 21:02:27 +02:00
  • b98c62ba49 [Bugfix] Fix GGUF inference with FP16 unquantized checkpoint (#10675) Isotr0py 2024-11-28 02:43:17 +08:00
  • c411def234 [torch.compile] fix shape specialization (#10722) youkaichao 2024-11-27 10:16:10 -08:00