Commit Graph

  • 308cc5e21e [ci] fix slow tests (#10698) youkaichao 2024-11-27 09:26:14 -08:00
  • 9e0a147d50 [V1] Update interface for mistral-format Pixtral (#10703) Roger Wang 2024-11-27 04:26:27 -08:00
  • 418cb3b93f [Bugfix][Hardware][CPU] Fix intel-omp version to avoid segfault (#10700) Li, Jiang 2024-11-27 19:55:38 +08:00
  • 1209261e93 [Model] Support telechat2 (#10311) shunxing12345 2024-11-27 19:32:35 +08:00
  • e2251109c7 [Kernel] Remove if-else with identical branches in marlin 2:4 (#10687) Tyler Michael Smith 2024-11-27 01:55:32 -05:00
  • 15cc2a9f1a [Misc]Further reduce BNB static variable (#10597) Jee Jee Li 2024-11-27 14:54:12 +08:00
  • e85250b1d1 [Hardware][Gaudi]add get_name method for HPUAttentionBackend (#10667) Kunshang Ji 2024-11-27 14:49:40 +08:00
  • cfb3bf25fb [bugfix] fix the default value of llm_int8_threshold in BitsAndBytesConfig (#10657) yansh97 2024-11-27 13:55:23 +08:00
  • 1bf905ddaa [Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. (#10198) jeongin601 2024-11-27 14:07:30 +09:00
  • 0a4d968500 [V1] Update interface for idefics3 (#10680) Roger Wang 2024-11-26 18:04:01 -08:00
  • 0a71900bc9 Remove hard-dependencies of Speculative decode to CUDA workers (#10587) Chendi.Xue 2024-11-26 19:57:11 -06:00
  • 2f0a0a17a4 [V1] Refactor model executable interface for multimodal models (#10570) Roger Wang 2024-11-26 12:46:11 -08:00
  • 7576cd38df [Bugfix] Check bnb_4bit_quant_storage for bitsandbytes (#10642) Michael Goin 2024-11-26 15:29:00 -05:00
  • 9a99273b48 [Bugfix] Fix using -O[0,3] with LLM entrypoint (#10677) Michael Goin 2024-11-26 13:44:01 -05:00
  • f5792c7c4a [Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson (#9735) Conroy Cheers 2024-11-27 05:26:28 +11:00
  • db66e018ea [Bugfix] Fix for Spec model TP + Chunked Prefill (#10232) Murali Andoorveedu 2024-11-26 09:11:16 -08:00
  • 1f6584ee85 [V1] Enable profile for LLMEngine (#10665) Kunshang Ji 2024-11-26 18:36:45 +08:00
  • 334d64d1e8 [ci] add vllm_test_utils (#10659) youkaichao 2024-11-26 00:20:04 -08:00
  • 940635343a [Misc] Remove outdated init protocols (#10655) Cyrus Leung 2024-11-26 14:55:00 +08:00
  • 9a88f89799 custom allreduce + torch.compile (#10121) Sage Moore 2024-11-26 00:00:16 -06:00
  • 519e8e4182 [v1] EngineArgs for better config handling for v1 (#10382) Ricky Xu 2024-11-25 21:09:43 -08:00
  • a6760f6456 [Feature] vLLM ARM Enablement for AARCH64 CPUs (#9228) Sanket Kale 2024-11-26 08:02:39 +05:30
  • 45ac4ff270 [bugfix] fix aria model and add torch.compile (#10645) youkaichao 2024-11-25 18:32:09 -08:00
  • 6e9ff050c8 [misc] do not read HOST_IP (#10644) youkaichao 2024-11-25 17:04:50 -08:00
  • 9db713a1dc [Model] Add OLMo November 2024 model (#10503) Shane A 2024-11-25 14:26:40 -08:00
  • 1b583cfefa [Doc] Fix typos in docs (#10636) Cyrus Leung 2024-11-26 02:15:45 +08:00
  • cf73f0c95e [Model] Enable optional prefix when loading embedding models (#10639) Cyrus Leung 2024-11-26 02:14:33 +08:00
  • b1d920531f [Model]: Add support for Aria model (#10514) zhou fan 2024-11-26 02:10:55 +08:00
  • 452a4e80c3 [Docs] Add Snowflake Slides (#10641) Simon Mo 2024-11-25 09:34:46 -08:00
  • c27df94e1f [Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices (#9850) Wallas Henrique 2024-11-25 14:23:32 -03:00
  • d04b13a380 [Bug]: Authorization ignored when root_path is set (#10606) Chauncey 2024-11-26 00:21:41 +08:00
  • 2b0879bfc2 Super tiny little typo fix (#10633) fzyzcjy 2024-11-25 21:08:30 +08:00
  • ed46f14321 [Model] Support is_causal HF config field for Qwen2 model (#10621) Cyrus Leung 2024-11-25 17:51:20 +08:00
  • 05d1f8c9c6 [misc] move functions to config.py (#10624) youkaichao 2024-11-25 01:27:30 -08:00
  • 25d806e953 [misc] add torch.compile compatibility check (#10618) youkaichao 2024-11-24 23:40:08 -08:00
  • 65813781a2 [torch.compile] add warning for unsupported models (#10622) youkaichao 2024-11-24 23:27:51 -08:00
  • 7c2134beda [torch.compile] force inductor threads (#10620) Jee Jee Li 2024-11-25 15:04:21 +08:00
  • a30a605d21 [Doc] Add encoder-based models to Supported Models page (#10616) Cyrus Leung 2024-11-25 14:34:07 +08:00
  • 571841b7fc [torch.compile] support encoder based models (#10613) youkaichao 2024-11-24 21:24:33 -08:00
  • 7ea3cd7c3e [Refactor][MISC] del redundant code in ParallelConfig.postinit (#10614) Mengqing Cao 2024-11-25 13:14:56 +08:00
  • 214efc2c3c Support Cross encoder models (#10400) Maximilien de Bayser 2024-11-24 23:56:20 -03:00
  • 49628fe13e [Doc] Update README.md with Ray Summit talk links (#10610) Zhuohan Li 2024-11-24 16:45:09 -08:00
  • e4fbb14414 [doc] update the code to add models (#10603) youkaichao 2024-11-24 11:21:40 -08:00
  • c055747867 [model][utils] add extract_layer_index utility function (#10599) youkaichao 2024-11-23 22:22:54 -08:00
  • eda2b3589c Revert "Print running script to enhance CI log readability" (#10601) youkaichao 2024-11-23 21:31:47 -08:00
  • 1c445dca51 [CI/Build] Print running script to enhance CI log readability (#10594) Jee Jee Li 2024-11-24 11:57:13 +08:00
  • 1700c543a5 [Bugfix] Fix LoRA weight sharding (#10450) Jee Jee Li 2024-11-24 09:23:17 +08:00
  • 17d8fc1806 [bugfix] Fix example/tensorize_vllm_model tests (#10595) Jee Jee Li 2024-11-24 09:22:33 +08:00
  • 04668ebe7a [Bugfix] Avoid import AttentionMetadata explicitly in Mllama (#10593) Isotr0py 2024-11-24 02:12:20 +08:00
  • 651f6c31ac For ppc64le, disabled tests for now and addressed space issues (#10538) Nishidha 2024-11-23 15:03:53 +05:30
  • 86a44fb896 [Platforms] Refactor openvino code (#10573) JiHuazhong 2024-11-23 14:23:12 +08:00
  • 4cfe5d2bca [Bugfix] multi_modal_kwargs broadcast for CPU tensor parallel (#10541) Isotr0py 2024-11-23 13:25:46 +08:00
  • c8acd80548 [2/N] handling placeholders in merged multi-modal processor (#10485) Cyrus Leung 2024-11-23 13:25:09 +08:00
  • 4634a89d18 Prefix Cache Aware Scheduling [1/n] (#10128) Ricky Xu 2024-11-22 21:15:55 -08:00
  • 7c25fe45a6 [AMD] Add support for GGUF quantization on ROCm (#10254) kliuae 2024-11-23 13:14:49 +08:00
  • 02a43f82a9 Update default max_num_batch_tokens for chunked prefill to 2048 (#10544) Michael Goin 2024-11-23 00:14:19 -05:00
  • cfea9c04ef [Model] Fix Baichuan BNB online quantization (#10572) Chen Wu 2024-11-23 13:13:59 +08:00
  • 7d8ffb344f [Bugfix] Internal Server Error when tool_choice is incorrect. (#10567) Varun Vinayak Shenoy 2024-11-22 21:13:29 -08:00
  • 4aba6e3d1a [core] gemma2 full context length support (#10584) youkaichao 2024-11-22 20:13:54 -08:00
  • 978b39744b [Misc] Add pynccl wrappers for all_gather and reduce_scatter (#9432) Tyler Michael Smith 2024-11-22 22:14:03 -05:00
  • ebda51968b [Core] Fix broken log configuration (#10458) Russell Bryant 2024-11-22 21:23:51 -05:00
  • 9195dbdbca [Bugfix][Frontend] Update Llama Chat Templates to also support Non-Tool use (#10164) Travis Johnson 2024-11-22 19:17:38 -07:00
  • d559979c54 [bugfix] fix cpu tests (#10585) youkaichao 2024-11-22 17:34:03 -08:00
  • d345f409b7 [V1] EngineCore supports profiling (#10564) Zhonghua Deng 2024-11-23 09:16:15 +08:00
  • 28598f3939 [Core] remove temporary local variables in LLMEngine.__init__ (#10577) Russell Bryant 2024-11-22 19:22:53 -05:00
  • 948c859571 support bitsandbytes quantization with qwen model (#10549) zixuanzhang226 2024-11-22 16:16:14 -08:00
  • 97814fbf0f [v1] Refactor KVCacheManager for more hash input than token ids (#10507) Ricky Xu 2024-11-22 15:27:25 -08:00
  • eebad39f26 [torch.compile] support all attention backends (#10558) youkaichao 2024-11-22 14:04:42 -08:00
  • db100c5cde [bugfix] fix full graph tests (#10581) youkaichao 2024-11-22 10:02:14 -08:00
  • 11fcf0e066 Remove token-adding chat embedding params (#10551) Noam Gat 2024-11-22 09:59:47 +02:00
  • b6374e09b0 [Bugfix] Fix Phi-3 BNB quantization with tensor parallel (#9948) Isotr0py 2024-11-22 15:01:56 +08:00
  • a111d0151f [platforms] absorb worker cls difference into platforms folder (#10555) youkaichao 2024-11-21 21:00:32 -08:00
  • 446c7806b2 [Minor] Fix line-too-long (#10563) Woosuk Kwon 2024-11-21 19:40:40 -08:00
  • 33e0a2540a [9/N] torch.compile LLM usage (#10552) youkaichao 2024-11-21 19:13:31 -08:00
  • aed074860a [Benchmark] Add new H100 machine (#10547) Simon Mo 2024-11-21 18:27:20 -08:00
  • 9afa014552 Add small example to metrics.rst (#10550) Michael Goin 2024-11-21 18:43:43 -05:00
  • 46fe9b46d8 [Minor] Revert change in offline inference example (#10545) Woosuk Kwon 2024-11-21 13:28:16 -08:00
  • cf656f5a02 [misc] improve error message (#10553) youkaichao 2024-11-21 13:13:17 -08:00
  • edec3385b6 [CI][Installation] Avoid uploading CUDA 11.8 wheel (#10535) Yunmeng 2024-11-22 05:03:58 +08:00
  • f9310cbd0c [V1] Fix Compilation config & Enable CUDA graph by default (#10528) Woosuk Kwon 2024-11-21 12:53:39 -08:00
  • 7560ae5caf [8/N] enable cli flag without a space (#10529) youkaichao 2024-11-21 12:30:42 -08:00
  • e7a8341c7c [Bugfix] Allow token ID-only inputs in Qwen2-Audio (#10536) Cyrus Leung 2024-11-22 02:09:43 +08:00
  • c51e397fe8 [Misc] Suppress duplicated logging regarding multimodal input pipeline (#10530) Roger Wang 2024-11-21 09:21:31 -08:00
  • 2385b60d83 [Kernel] Register punica ops directly (#10522) Jee Jee Li 2024-11-22 01:18:11 +08:00
  • da7e702c6f [Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored (#10180) Chauncey 2024-11-22 00:24:32 +08:00
  • 4d676f0852 [Bugfix] Embedding model pooling_type equals ALL and multi input's bug (#10494) Xiaoyu Zhang 2024-11-21 22:40:02 +08:00
  • d5ec121f95 [Model] Expose dynamic_image_size as mm_processor_kwargs for InternVL2 models (#10518) Isotr0py 2024-11-21 22:20:08 +08:00
  • 8a93a598d9 fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len (#10524) Wang, Yi 2024-11-21 19:15:36 +08:00
  • 1cfde82ffd [Model] Add Support for Multimodal Granite Models (#10291) Alex Brooks 2024-11-21 03:46:20 -07:00
  • f0e0238016 [Doc] fix a small typo in docstring of llama_tool_parser (#10513) Zhong Qishuai 2024-11-21 17:05:23 +08:00
  • aaddce5d26 [platforms] improve error message for unspecified platforms (#10520) youkaichao 2024-11-20 23:07:56 -08:00
  • 3430857b64 [Misc] Increase default video fetch timeout (#10495) Cyrus Leung 2024-11-21 15:06:42 +08:00
  • 8b0fe06c89 [torch.compile] Inductor code caching fix (#10273) Luka Govedič 2024-11-21 00:44:57 -05:00
  • 9d827170a3 [Platforms] Add device_type in Platform (#10508) Mengqing Cao 2024-11-21 12:44:20 +08:00
  • 6c1208d083 [Core] Add Sliding Window Support with Flashinfer (#10462) Pavani Majety 2024-11-20 19:56:47 -08:00
  • 388ee3de66 [torch.compile] limit inductor threads and lazy import quant (#10482) youkaichao 2024-11-20 18:36:33 -08:00
  • 2f77b6cfec [TPU] Implement prefix caching for TPUs (#10307) Woosuk Kwon 2024-11-20 13:54:15 -08:00
  • c68f7ede6a [Bugfix]: allow extra fields in requests to openai compatible server (#10463) Guillaume Calmettes 2024-11-20 22:42:21 +01:00
  • 0cd3d9717e [7/N] torch.compile, reduce compilation time (#10460) youkaichao 2024-11-20 11:20:38 -08:00
  • 5f1d6af2b6 [perf bench] H200 development (#9768) Simon Mo 2024-11-20 11:06:56 -08:00