Commit Graph

  • 3c6325f0fc [core][distributed] custom allreduce when pp size > 1 (#6117) youkaichao 2024-07-03 14:41:32 -07:00
  • 47f0954af0 [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975) Michael Goin 2024-07-03 13:38:00 -04:00
  • 7cd2ebb025 [Bugfix] Fix compute_logits in Jamba (#6093) Roger Wang 2024-07-03 00:32:35 -07:00
  • f1c78138aa [Doc] Fix Mock Import (#6094) Roger Wang 2024-07-03 00:13:56 -07:00
  • 3a86b54fb0 [VLM][Frontend] Proper Image Prompt Formatting from OpenAI API (#6091) Roger Wang 2024-07-02 23:41:23 -07:00
  • f666207161 [misc][distributed] error on invalid state (#6092) youkaichao 2024-07-02 23:37:29 -07:00
  • d830656a97 [BugFix] Avoid unnecessary Ray import warnings (#6079) Nick Hill 2024-07-02 23:09:40 -07:00
  • d18bab3587 [CI] Fix base url doesn't strip "/" (#6087) SangBin Cho 2024-07-03 13:31:25 +09:00
  • 9831aec49f [Core] Dynamic image size support for VLMs (#5276) Cyrus Leung 2024-07-03 11:34:00 +08:00
  • 482045ee77 [hardware][misc] introduce platform abstraction (#6080) youkaichao 2024-07-02 20:12:22 -07:00
  • 9d6a8daa87 [Model] Jamba support (#4115) Mor Zusman 2024-07-03 02:11:29 +03:00
  • ee93f4f92a [CORE] Quantized lm-head Framework (#4442) Qubitium-ModelCloud 2024-07-03 06:25:17 +08:00
  • 7c008c51a9 [ Misc ] Refactor MoE to isolate Fp8 From Mixtral (#5970) Robert Shaw 2024-07-02 17:54:35 -04:00
  • 4d26d806e1 Update conftest.py (#6076) Robert Shaw 2024-07-02 16:14:22 -04:00
  • c5832d2ae9 [Core] Pipeline Parallel Support (#4412) Murali Andoorveedu 2024-07-02 10:58:08 -07:00
  • 15aba081f3 [Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) (#6050) Sirej Dua 2024-07-02 07:20:29 -07:00
  • 31354e563f [Doc] Reinstate doc dependencies (#6061) Cyrus Leung 2024-07-02 18:53:16 +08:00
  • 98d6682cd1 [VLM] Remove image_input_type from VLM config (#5852) xwjiang2010 2024-07-02 00:57:09 -07:00
  • 2c37540aa6 [Frontend] Add template related params to request (#5709) danieljannai21 2024-07-02 09:01:57 +03:00
  • 3476ed0809 [Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (#5602) Alexander Matveev 2024-07-01 23:10:37 -04:00
  • 54600709b6 [Model] Changes to MLPSpeculator to support tie_weights and input_scale (#5965) Thomas Parnell 2024-07-02 01:40:02 +02:00
  • e373853e12 [Frontend] Relax api url assertion for openai benchmarking (#6046) James Whedbee 2024-07-01 18:39:10 -05:00
  • c87ebc3ef9 [BugFix] Ensure worker model loop is always stopped at the right time (#5987) Nick Hill 2024-07-01 16:17:58 -07:00
  • c4059ea54f [Bugfix] Add explicit end_forward calls to flashinfer (#6044) Antoni Baum 2024-07-01 16:08:58 -07:00
  • 8e0817c262 [Bugfix][Doc] Fix Doc Formatting (#6048) Roger Wang 2024-07-01 15:09:11 -07:00
  • 83bdcb6ac3 add FAQ doc under 'serving' (#5946) ning.zhang 2024-07-01 14:11:36 -07:00
  • 12a59959ed [Bugfix] adding chunking mechanism to fused_moe to handle large inputs (#6029) Avshalom Manevich 2024-07-02 00:08:29 +03:00
  • dec6fc6f3b [Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool (#6039) Antoni Baum 2024-07-01 13:12:40 -07:00
  • 8893130b63 [doc][misc] further lower visibility of simple api server (#6041) youkaichao 2024-07-01 10:50:56 -07:00
  • bb60326836 [Misc] update benchmark backend for scalellm (#6018) zhyncs 2024-07-02 01:20:33 +08:00
  • 4050d646e5 [doc][misc] remove deprecated api server in doc (#6037) youkaichao 2024-07-01 09:52:43 -07:00
  • d76084c12f [ CI ] Re-enable Large Model LM Eval (#6031) Robert Shaw 2024-07-01 12:40:45 -04:00
  • 80ca1e6a3a [Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (#5348) sroy745 2024-07-01 00:33:05 -07:00
  • 614aa51203 [misc][cuda] use nvml to avoid accidentally cuda initialization (#6007) youkaichao 2024-06-30 20:07:34 -07:00
  • af9ad46fca [ Misc ] Refactor w8a8 to use process_weights_after_load (Simplify Weight Loading) (#5940) Robert Shaw 2024-06-30 19:06:27 -04:00
  • 7836fdcc11 [Misc] Fix get_min_capability (#5971) Dipika Sikka 2024-06-30 16:15:16 -04:00
  • deacb7ec44 [ CI ] Temporarily Disable Large LM-Eval Tests (#6005) Robert Shaw 2024-06-30 14:56:56 -04:00
  • f5e73c9f1b [Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (#5909) SangBin Cho 2024-07-01 02:11:15 +09:00
  • c6c240aa0a [Frontend]: Support base64 embedding (#5935) llmpros 2024-06-30 08:53:00 -07:00
  • 2be6955a3f [ci][distributed] fix device count call youkaichao 2024-06-30 01:06:13 -07:00
  • 9d47f64eb6 [CI/Build] [3/3] Reorganize entrypoints tests (#5966) Cyrus Leung 2024-06-30 12:58:49 +08:00
  • cff6a1fec1 [CI/Build] Reuse code for checking output consistency (#5988) Cyrus Leung 2024-06-30 11:44:25 +08:00
  • bcc6a09b63 [CI/Build] Temporarily Remove Phi3-Vision from TP Test (#5989) Roger Wang 2024-06-29 18:18:31 -07:00
  • 9def10664e [Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests (#5949) Matt Wong 2024-06-29 14:47:58 -05:00
  • 75aa1442db [ CI/Build ] LM Eval Harness Based CI Testing (#5838) Robert Shaw 2024-06-29 13:04:30 -04:00
  • 99397da534 [CI/Build] Add TP test for vision models (#5892) Cyrus Leung 2024-06-29 23:45:54 +08:00
  • 8dbfcd35bf [ CI/Build ] Added E2E Test For Compressed Tensors (#5839) Robert Shaw 2024-06-29 09:12:58 -04:00
  • f7dac83d95 [Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k (#5939) Cody Yu 2024-06-29 06:04:20 -07:00
  • 7c01f70641 [Core] Optimize SequenceStatus.is_finished by switching to IntEnum (#5974) Antoni Baum 2024-06-29 05:47:53 -07:00
  • 51e971d39e [Bugfix] Support eos_token_id from config.json (#5954) Cyrus Leung 2024-06-29 19:19:02 +08:00
  • 329df38f1a [Misc] Update Phi-3-Vision Example (#5981) Roger Wang 2024-06-28 23:34:29 -07:00
  • 580353da93 [Bugfix] Fix precisions in Gemma 1 (#5913) Woosuk Kwon 2024-06-28 20:10:21 -07:00
  • ba4994443a [Kernel] Add punica dimensions for Granite 3b and 8b (#5930) Joe Runde 2024-06-28 20:48:25 -06:00
  • 906a19cdb0 [Misc] Extend vLLM Metrics logging API (#5925) William Lin 2024-06-28 19:36:06 -07:00
  • c4bca740e8 [Bugfix] fix missing last itl in openai completions benchmark (#5926) mcalman 2024-06-28 22:34:42 -04:00
  • 7f83f40dee [Bugfix][TPU] Fix pad slot id (#5977) Woosuk Kwon 2024-06-28 18:55:17 -07:00
  • 54814fd85b [Bugfix][TPU] Fix TPU sampler output (#5978) Woosuk Kwon 2024-06-28 18:14:16 -07:00
  • 7041de4384 [Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (#4628) Lily Liu 2024-06-28 15:28:49 -07:00
  • 6a62cb82cc [Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError (#5963) Robert Shaw 2024-06-28 17:46:30 -04:00
  • 5d2a1a9cf0 Unmark more files as executable (#5962) Tyler Michael Smith 2024-06-28 17:34:56 -04:00
  • 4bf35ed9ae [Bugfix] Only add Attention.kv_scale if kv cache quantization is enabled (#5936) Michael Goin 2024-06-28 17:12:40 -04:00
  • be0b3af9e0 Support Deepseek-V2 (#4650) wangding zeng 2024-06-29 04:24:57 +08:00
  • 2cd402e169 [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (#5921) Robert Shaw 2024-06-28 14:43:49 -04:00
  • b185230744 [ Misc ] Remove fp8_shard_indexer from Col/Row Parallel Linear (Simplify Weight Loading) (#5928) Robert Shaw 2024-06-28 13:49:57 -04:00
  • 6a2d659d28 [Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931) Tyler Michael Smith 2024-06-28 13:10:34 -04:00
  • b2c620230a [Spec Decode] Introduce DraftModelRunner (#5799) Cody Yu 2024-06-28 09:17:51 -07:00
  • b90d8cd832 [Distributed] Make it clear that % should not be in tensor dict keys. (#5927) xwjiang2010 2024-06-28 08:20:22 -07:00
  • 3b752a6555 [CI/Build] [2/3] Reorganize entrypoints tests (#5904) Cyrus Leung 2024-06-28 22:59:18 +08:00
  • ec1ad0046c [Bugfix] Better error message for MLPSpeculator when num_speculative_tokens is set too high (#5894) Thomas Parnell 2024-06-28 16:42:17 +02:00
  • 57f09a419c [Hardware][Intel] OpenVINO vLLM backend (#5379) Ilya Lavrenov 2024-06-28 17:50:16 +04:00
  • 5932634409 Unmark fused_moe config json file as executable (#5960) Tyler Michael Smith 2024-06-28 09:36:12 -04:00
  • 5cbe8d155c [Core] Registry for processing model inputs (#5214) Cyrus Leung 2024-06-28 20:09:56 +08:00
  • 0d0e3a42ac [Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner (#5956) Isotr0py 2024-06-28 20:03:41 +08:00
  • 74d55c065b [VLM][BugFix] Make sure that multi_modal_kwargs can broadcast properly with ring buffer. (#5905) xwjiang2010 2024-06-28 00:29:13 -07:00
  • f136da15e1 [Hardware][TPU] Optimize KV cache swapping (#5878) Woosuk Kwon 2024-06-27 21:12:13 -07:00
  • c3dde367f1 [Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (#5932) Divakar Verma 2024-06-27 15:41:08 -05:00
  • 64e8d2a783 [core][misc] remove logical block (#5882) youkaichao 2024-06-27 13:34:55 -07:00
  • 79c92c7c8a [Model] Add Gemma 2 (#5908) Woosuk Kwon 2024-06-27 13:33:56 -07:00
  • 736ed38849 [CI/Build] Fix Args for _get_logits_warper in Sampler Test (#5922) Roger Wang 2024-06-27 11:43:04 -07:00
  • 365791ff81 [BugFix] Fix min_tokens behaviour for multiple eos tokens (#5849) Nick Hill 2024-06-27 11:31:11 -07:00
  • 691e29ecf3 [BugFix] Fix MLPSpeculator handling of num_speculative_tokens (#5876) Nick Hill 2024-06-27 10:59:33 -07:00
  • 3fd02bda51 [doc][misc] add note for Kubernetes users (#5916) youkaichao 2024-06-27 10:07:07 -07:00
  • 98cf2ed678 [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896) Cyrus Leung 2024-06-28 00:08:10 +08:00
  • e9d32d077d [CI/Build] [1/3] Reorganize entrypoints tests (#5526) Cyrus Leung 2024-06-27 20:43:17 +08:00
  • 2061f0b8a7 [Bugfix] Fix img_sizes Parsing in Phi3-Vision (#5888) Roger Wang 2024-06-27 01:29:24 -07:00
  • 96354d6a29 [Model] Add base class for LoRA-supported models (#5018) Cyrus Leung 2024-06-27 16:03:04 +08:00
  • d12af207d2 [VLM][Bugfix] Make sure that multi_modal_kwargs is broadcasted properly (#5880) xwjiang2010 2024-06-27 00:15:24 -07:00
  • 6eabc6cb0e [Doc] Add note about context length in Phi-3-Vision example (#5887) Cyrus Leung 2024-06-27 14:20:01 +08:00
  • 2110557dab [BugFix] Fix cuda graph for MLPSpeculator (#5875) Nick Hill 2024-06-26 21:12:10 -07:00
  • b9e84259e9 [Misc] Add example for LLaVA-NeXT (#5879) Roger Wang 2024-06-26 17:57:16 -07:00
  • 294104c3f9 [doc] update usage of env var to avoid conflict (#5873) youkaichao 2024-06-26 14:57:12 -07:00
  • 38a1674abb Support CPU inference with VSX PowerPC ISA (#5652) Chip Kerchner 2024-06-26 17:53:04 -04:00
  • f5c8628fdc [Bugfix][TPU] Fix CPU cache allocation (#5869) Woosuk Kwon 2024-06-26 13:42:40 -07:00
  • cbc53b6b8d [Hardware][TPU] Support parallel sampling & Swapping (#5855) Woosuk Kwon 2024-06-26 11:07:49 -07:00
  • c54269d967 [Frontend] Add tokenize/detokenize endpoints (#5054) sasha0552 2024-06-26 16:54:22 +00:00
  • 5bfd1bbc98 [Kernel] Adding bias epilogue support for cutlass_scaled_mm (#5560) Luka Govedič 2024-06-26 11:16:00 -04:00
  • 6984c02a27 [CI/Build] Refactor image test assets (#5821) Cyrus Leung 2024-06-26 16:02:34 +08:00
  • 3439c5a8e3 [Bugfix][TPU] Fix KV cache size calculation (#5860) Woosuk Kwon 2024-06-26 00:58:23 -07:00
  • 6806998bf9 [Bugfix] Fix embedding to support 2D inputs (#5829) Woosuk Kwon 2024-06-26 00:15:22 -07:00
  • 515080ad2f [bugfix][distributed] fix shm broadcast when the queue size is full (#5801) youkaichao 2024-06-25 21:56:02 -07:00