Commit Graph

  • 8c6de96ea1 [Model] Explicit interface for vLLM models and support OOT embedding models (#9108) Cyrus Leung 2024-10-07 14:10:35 +08:00
  • 18b296fdb2 [core] remove beam search from the core (#9105) youkaichao 2024-10-06 22:47:04 -07:00
  • c8f26bb636 [BugFix][Core] Fix BlockManagerV2 when Encoder Input is None (#9103) sroy745 2024-10-06 20:52:42 -07:00
  • 487678d046 [Bugfix][Hardware][CPU] Fix CPU model input for decode (#9044) Isotr0py 2024-10-07 10:14:27 +08:00
  • cb3b2b9ba4 [Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling (#9038) Varun Sundar Rabindranath 2024-10-06 15:48:11 -04:00
  • fdf59d30ea [Bugfix] fix tool_parser error handling when serve a model not support it (#8709) Yanyi Liu 2024-10-06 20:51:08 +08:00
  • b22b798471 [Model] PP support for embedding models and update docs (#9090) Cyrus Leung 2024-10-06 16:35:27 +08:00
  • f22619fe96 [Misc] Remove user-facing error for removed VLM args (#9104) Cyrus Leung 2024-10-06 16:33:52 +08:00
  • 168cab6bbf [Frontend] API support for beam search (#9087) Brendan Wong 2024-10-05 23:39:03 -07:00
  • 23fea8714a [Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model (#9101) TJian 2024-10-05 22:00:04 -07:00
  • f4dd830e09 [core] use forward context for flash infer (#9097) youkaichao 2024-10-05 19:37:31 -07:00
  • 5df1834895 [Bugfix] Fix order of arguments matters in config.yaml (#8960) Andy Dai 2024-10-05 10:35:11 -07:00
  • cfadb9c687 [Bugfix] Deprecate registration of custom configs to huggingface (#9083) Chen Zhang 2024-10-05 06:56:40 -07:00
  • 15986f598c [Model] Support Gemma2 embedding model (#9004) Xin Yang 2024-10-04 23:57:05 -07:00
  • 53b3a33027 [Bugfix] Fixes Phi3v & Ultravox Multimodal EmbeddingInputs (#8979) hhzhang16 2024-10-04 22:05:37 -07:00
  • dac914b0d6 [Bugfix] use blockmanagerv1 for encoder-decoder (#9084) Chen Zhang 2024-10-04 21:45:38 -07:00
  • a95354a36e [Doc] Update README.md with Ray summit slides (#9088) Zhuohan Li 2024-10-04 19:54:45 -07:00
  • 663874e048 [torch.compile] improve allreduce registration (#9061) youkaichao 2024-10-04 16:43:50 -07:00
  • cc90419e89 [Hardware][Neuron] Add on-device sampling support for Neuron (#8746) Chongming Ni 2024-10-04 16:42:20 -07:00
  • 27302dd584 [Misc] Fix CI lint (#9085) Cody Yu 2024-10-04 16:07:54 -07:00
  • 0cc566ca8f [Misc] Add random seed for prefix cache benchmark (#9081) Andy Dai 2024-10-04 14:58:57 -07:00
  • 05c531be47 [Misc] Improved prefix cache example (#9077) Andy Dai 2024-10-04 14:38:42 -07:00
  • fbb74420e7 [CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang (#7412) Kuntai Du 2024-10-04 14:01:44 -07:00
  • 05d686432f [Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (#8973) ElizaWszola 2024-10-04 20:34:44 +02:00
  • 0dcc8cbe5a Adds truncate_prompt_tokens param for embeddings creation (#8999) Flávia Béo 2024-10-04 15:31:40 -03:00
  • 26aa325f4f [Core][VLM] Test registration for OOT multimodal models (#8717) Roger Wang 2024-10-04 10:38:25 -07:00
  • e5dc713c23 [Hardware][PowerPC] Make oneDNN dependency optional for Power (#9039) Varad Ahirwadkar 2024-10-04 22:54:42 +05:30
  • 36eecfbddb Remove AMD Ray Summit Banner (#9075) Simon Mo 2024-10-04 10:17:16 -07:00
  • 9ade8bbc8d [Model] add a bunch of supported lora modules for mixtral (#9008) Prashant Gupta 2024-10-04 09:24:40 -07:00
  • 22482e495e [Bugfix] Flash attention arches not getting set properly (#9062) Lucas Wilkinson 2024-10-04 11:43:15 -04:00
  • 3d826d2c52 [Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL (#9071) whyiug 2024-10-04 22:34:58 +08:00
  • 0e36fd4909 [Misc] Move registry to its own file (#9064) Cyrus Leung 2024-10-04 18:01:37 +08:00
  • 0f6d7a9a34 [Models] Add remaining model PP support (#7168) Murali Andoorveedu 2024-10-03 19:56:58 -07:00
  • 303d44790a [Misc] Enable multi-step output streaming by default (#9047) Michael Goin 2024-10-03 22:55:42 -04:00
  • aeb37c2a72 [CI/Build] Per file CUDA Archs (improve wheel size and dev build times) (#8845) Lucas Wilkinson 2024-10-03 22:55:25 -04:00
  • 3dbb215b38 [Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model (#8405) 代君 2024-10-04 10:36:39 +08:00
  • 2838d6b38e [Bugfix] Weight loading fix for OPT model (#9042) Domen Vreš 2024-10-04 01:53:29 +02:00
  • 91add85ec4 Fix failing spec decode test (#9054) sroy745 2024-10-03 16:07:29 -07:00
  • 9aaf14c62e [misc] add forward context for attention (#9029) youkaichao 2024-10-03 12:09:42 -07:00
  • 63e39937f9 [Frontend] [Neuron] Parse literals out of override-neuron-config (#8959) xendo 2024-10-03 20:02:07 +02:00
  • f5d72b2fc6 [Core] Make BlockSpaceManagerV2 the default BlockManager to use. (#8678) sroy745 2024-10-03 09:44:21 -07:00
  • 83caf35e08 [BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser (#9020) Guillaume Calmettes 2024-10-03 10:44:52 +02:00
  • 01843c89b8 [Misc] log when using default MoE config (#8971) Divakar Verma 2024-10-02 23:31:07 -05:00
  • 19a4dd0990 [Bugfix] example template should not add parallel_tool_prompt if tools is none (#9007) Travis Johnson 2024-10-02 21:04:17 -06:00
  • 18c2e30c57 [Doc] Update Granite model docs (#9025) Nick Hill 2024-10-03 03:42:24 +01:00
  • 19f0d25796 [Model] Adding Granite MoE. (#8206) Shawn Tan 2024-10-02 21:33:57 -04:00
  • f58d4fccc9 [OpenVINO] Enable GPU support for OpenVINO vLLM backend (#8192) Sergey Shlyapnikov 2024-10-03 01:50:01 +04:00
  • afb050b29d [Core] CUDA Graphs for Multi-Step + Chunked-Prefill (#8645) Varun Sundar Rabindranath 2024-10-02 15:44:39 -04:00
  • 7f60520deb [Misc] Update Default Image Mapper Error Log (#8977) Alex Brooks 2024-10-02 05:44:38 -06:00
  • 563649aafe [Core] Combined support for multi-step scheduling, chunked prefill & prefix caching (#8804) afeldman-nm 2024-10-02 03:52:20 -04:00
  • 1570203864 [Spec Decode] (1/2) Remove batch expansion (#8839) Lily Liu 2024-10-01 16:04:42 -07:00
  • 22f5851b80 Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows (#8997) vlsav 2024-10-01 21:07:06 +03:00
  • 4f341bd4bf [Doc] Update list of supported models (#8987) Cyrus Leung 2024-10-02 00:35:39 +08:00
  • 35bd215168 [Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API (#8965) Sebastian Schoennenbeck 2024-10-01 11:58:06 +02:00
  • 1fe0a4264a [Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders (#8991) Alex Brooks 2024-10-01 03:52:44 -06:00
  • bc4eb65b54 [Bugfix] Fix Fuyu tensor parallel inference (#8986) Isotr0py 2024-10-01 17:51:41 +08:00
  • 82f3937e59 [Misc] add process_weights_after_loading for DummyLoader (#8969) Divakar Verma 2024-09-30 22:46:41 -05:00
  • 7da2487591 [torch.compile] fix tensor alias (#8982) youkaichao 2024-09-30 20:40:48 -07:00
  • aaccca2b4d [CI/Build] Fix machete generated kernel files ordering (#8976) Kevin H. Luu 2024-09-30 20:33:12 -07:00
  • 062c89e7c9 [Frontend][Core] Move guided decoding params into sampling params (#8252) Joe Runde 2024-09-30 19:34:25 -06:00
  • bce324487a [CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. (#8975) Lily Liu 2024-09-30 17:51:40 -07:00
  • 1425a1bcf9 [ci] Add CODEOWNERS for test directories (#8795) Kevin H. Luu 2024-09-30 17:47:08 -07:00
  • 1cabfcefb6 [Misc] Adjust max_position_embeddings for LoRA compatibility (#8957) Jee Jee Li 2024-09-30 20:57:39 +08:00
  • be76e5aabf [Core] Make scheduling policy settable via EngineArgs (#8956) Sebastian Schoennenbeck 2024-09-30 14:28:44 +02:00
  • 2ae25f79cf [Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg (#8946) Isotr0py 2024-09-30 13:01:20 +08:00
  • 8e60afa15e [Model][LoRA]LoRA support added for MiniCPMV2.6 (#8943) Jee Jee Li 2024-09-30 12:31:55 +08:00
  • b6d7392579 [Misc][CI/Build] Include cv2 via mistral_common[opencv] (#8951) Roger Wang 2024-09-29 21:28:26 -07:00
  • e01ab595d8 [Model] support input embeddings for qwen2vl (#8856) whyiug 2024-09-30 11:16:10 +08:00
  • f13a07b1f8 [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (#8533) Mor Zusman 2024-09-30 00:35:58 +03:00
  • 6c9ba48fde [Frontend] Added support for HF's new continue_final_message parameter (#8942) danieljannai21 2024-09-29 20:59:47 +03:00
  • 1fb9c1b0bf [Misc] Fix typo in BlockSpaceManagerV1 (#8944) juncheoll 2024-09-30 00:05:54 +09:00
  • 31f46a0d35 [BugFix] Fix seeded random sampling with encoder-decoder models (#8870) Nick Hill 2024-09-29 10:43:14 +01:00
  • 3d49776bbb [Model][LoRA]LoRA support added for MiniCPMV2.5 (#7199) Jee Jee Li 2024-09-29 14:59:45 +08:00
  • bc2ef1f77c [Model] Support Qwen2.5-Math-RM-72B (#8896) Zilin Zhu 2024-09-29 12:19:39 +08:00
  • 2e7fe7e79f [Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching (#8930) Tyler Michael Smith 2024-09-28 23:13:01 -04:00
  • 26a68d5d7e [CI/Build] Add test decorator for minimum GPU memory (#8925) Cyrus Leung 2024-09-29 10:50:51 +08:00
  • d081da0064 [Bugfix] Fix Marlin MoE act order when is_k_full == False (#8741) ElizaWszola 2024-09-29 03:19:40 +02:00
  • 5bf8789b2a [Bugfix] Block manager v2 with preemption and lookahead slots (#8824) sroy745 2024-09-28 18:17:45 -07:00
  • d1537039ce [Core] Improve choice of Python multiprocessing method (#8823) Russell Bryant 2024-09-28 21:17:07 -04:00
  • cc276443b5 [doc] organize installation doc and expose per-commit docker (#8931) youkaichao 2024-09-28 17:48:41 -07:00
  • e585b583a9 [Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (#8891) Chen Zhang 2024-09-28 11:51:22 -07:00
  • 090e945e36 [Frontend] Make beam search emulator temperature modifiable (#8928) Edouard B. 2024-09-28 20:30:21 +02:00
  • e1a3f5e831 [CI/Build] Update models tests & examples (#8874) Cyrus Leung 2024-09-29 00:54:35 +08:00
  • 19d02ff938 [Bugfix] Fix PP for Multi-Step (#8887) Varun Sundar Rabindranath 2024-09-28 11:52:46 -04:00
  • 39d3f8d94f [Bugfix] Fix code for downloading models from modelscope (#8443) tastelikefeet 2024-09-28 23:24:12 +08:00
  • b0298aa8cc [Misc] Remove vLLM patch of BaichuanTokenizer (#8921) Cyrus Leung 2024-09-28 16:11:25 +08:00
  • 260024a374 [Bugfix][Intel] Fix XPU Dockerfile Build (#7824) Tyler Titsworth 2024-09-27 23:45:50 -07:00
  • d86f6b2afb [misc] fix wheel name (#8919) youkaichao 2024-09-27 22:10:44 -07:00
  • bd429f2b75 [Core] Priority-based scheduling in async engine (#8850) Sebastian Schoennenbeck 2024-09-28 00:07:10 +02:00
  • 18e60d7d13 [misc][distributed] add VLLM_SKIP_P2P_CHECK flag (#8911) youkaichao 2024-09-27 14:27:56 -07:00
  • c2ec430ab5 [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (#8378) Varun Sundar Rabindranath 2024-09-27 16:32:07 -04:00
  • c5d55356f9 [Bugfix] fix for deepseek w4a16 (#8906) Lucas Wilkinson 2024-09-27 15:12:34 -04:00
  • 172d1cd276 [Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (#7271) Luka Govedič 2024-09-27 14:25:10 -04:00
  • a9b15c606f [torch.compile] use empty tensor instead of None for profiling (#8875) youkaichao 2024-09-27 08:11:32 -07:00
  • 8df2dc3c88 [TPU] Update pallas.py to support trillium (#8871) Brittany 2024-09-27 01:16:55 -07:00
  • 6d792d2f31 [Bugfix][VLM] Fix Fuyu batching inference with max_num_seqs>1 (#8892) Isotr0py 2024-09-27 16:15:58 +08:00
  • 0e088750af [MISC] Fix invalid escape sequence '\' (#8830) Peter Pan 2024-09-27 16:13:25 +08:00
  • dc4e3df5c2 [misc] fix collect env (#8894) youkaichao 2024-09-27 00:26:38 -07:00
  • 3b00b9c26c [Core] renamePromptInputs and inputs (#8876) Cyrus Leung 2024-09-27 11:35:15 +08:00
  • 344cd2b6f4 [Feature] Add support for Llama 3.1 and 3.2 tool use (#8343) Maximilien de Bayser 2024-09-26 21:01:42 -03:00