Commit Graph

  • 388596c914 [Misc][Utils] allow get_open_port to be called for multiple times (#5333) youkaichao 2024-06-06 22:15:11 -07:00
  • baa15a9ec3 [Feature][Frontend]: Add support for stream_options in ChatCompletionRequest (#5135) Itay Etelis 2024-06-07 06:29:24 +03:00
  • 15063741e3 [Misc] Missing error message for custom ops import (#5282) Jie Fu (傅杰) 2024-06-07 11:17:21 +08:00
  • ccdc490dda [Core] Change LoRA embedding sharding to support loading methods (#5038) Antoni Baum 2024-06-06 19:07:57 -07:00
  • a31cab7556 [Core] Avoid copying prompt/output tokens if no penalties are used (#5289) Antoni Baum 2024-06-06 18:12:00 -07:00
  • 828da0d44e [Frontend] enable passing multiple LoRA adapters at once to generate() (#5300) Matthew Goldey 2024-06-06 16:48:13 -04:00
  • abe855d637 [Kernel] Retune Mixtral 8x22b configs for FP8 on H100 (#5294) Philipp Moritz 2024-06-06 09:29:29 -07:00
  • 4efff036f0 Bugfix: fix broken of download models from modelscope (#5233) liuyhwangyh 2024-06-07 00:28:10 +08:00
  • 89c920785f [CI/Build] Update vision tests (#5307) Cyrus Leung 2024-06-06 18:17:18 +08:00
  • 7b0a0dfb22 [Frontend][Core] Update Outlines Integration from FSM to Guide (#4109) Breno Faria 2024-06-06 01:49:12 +02:00
  • 3a6ae1d33c [CI] Disable flash_attn backend for spec decode (#5286) Simon Mo 2024-06-05 17:49:27 -05:00
  • 8f1729b829 [Docs] Add Ray Summit CFP (#5295) Simon Mo 2024-06-05 17:25:18 -05:00
  • 6a7c7711a2 [Misc] Skip for logits_scale == 1.0 (#5291) Woosuk Kwon 2024-06-05 15:19:02 -07:00
  • 0f83ddd4d7 [Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. (#5290) Alex Wu 2024-06-05 15:18:12 -07:00
  • 065aff6c16 [Bugfix] Make EngineArgs use named arguments for config construction (#5285) Michael Goin 2024-06-05 18:16:56 -04:00
  • 3d33e372a1 [BugFix] Fix log message about default max model length (#5284) Nick Hill 2024-06-05 14:53:16 -07:00
  • faf71bcd4b [Speculative Decoding] Add ProposerWorkerBase abstract class (#5252) Nick Hill 2024-06-05 14:53:05 -07:00
  • f270a39537 [Docs] Add Sequoia as sponsors (#5287) Simon Mo 2024-06-05 13:02:56 -05:00
  • 51a08e7d8f [Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 (#5238) Philipp Moritz 2024-06-05 10:59:14 -07:00
  • eb8fcd2666 [BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM (#5207) DriverSong 2024-06-06 01:59:02 +08:00
  • 5563a4dea8 [Model] Correct Mixtral FP8 checkpoint loading (#5231) Cody Yu 2024-06-05 10:58:50 -07:00
  • ccd4f129e8 [Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size (#5157) Tyler Michael Smith 2024-06-05 13:44:15 -04:00
  • 02cc3b51a7 [misc] benchmark_serving.py -- add ITL results and tweak TPOT results (#5263) Tyler Michael Smith 2024-06-05 13:17:51 -04:00
  • d5b1eb081e [CI] Add nightly benchmarks (#5260) Simon Mo 2024-06-05 11:42:08 -05:00
  • f0a500545f [Frontend] OpenAI API server: Add add_special_tokens to ChatCompletionRequest (default False) (#5278) tomeras91 2024-06-05 19:32:58 +03:00
  • c65146e75e [Misc] Fix docstring of get_attn_backend (#5271) Woosuk Kwon 2024-06-05 09:18:59 -07:00
  • 41ca62cf03 [Misc] Add CustomOp interface for device portability (#5255) Woosuk Kwon 2024-06-05 09:18:19 -07:00
  • 974fc9b845 [Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True (#5226) zifeitong 2024-06-04 19:37:28 -07:00
  • fee4dcc33a [Misc] update collect env (#5261) youkaichao 2024-06-04 15:29:09 -07:00
  • 650a4cc55e [Misc] Add transformers version to collect_env.py (#5259) Michael Goin 2024-06-04 15:52:28 -04:00
  • 9ca62d8668 [CI] mark AMD test as softfail to prevent blockage (#5256) Simon Mo 2024-06-04 13:34:53 -05:00
  • 45c35f0d58 [CI/Build] Reducing CPU CI execution time (#5241) Li, Jiang 2024-06-05 01:26:40 +08:00
  • 9ba093b4f4 [CI/Build] Simplify model loading for HfRunner (#5251) Cyrus Leung 2024-06-05 01:09:19 +08:00
  • 27208be66e [Kernel] Add back batch size 1536 and 3072 to MoE tuning (#5242) Woosuk Kwon 2024-06-04 09:58:47 -07:00
  • 87d5abef75 [Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU backend (#5249) Jie Fu (傅杰) 2024-06-05 00:57:51 +08:00
  • ec784b2526 [CI/Build] Add inputs tests (#5215) Cyrus Leung 2024-06-04 12:01:46 +08:00
  • a58f24e590 [Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor (#5229) zifeitong 2024-06-03 20:55:50 -07:00
  • f42a006b15 [Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend (#5210) afeldman-nm 2024-06-03 23:32:57 -04:00
  • 3a434b07ed [Kernel] Enhance MoE benchmarking & tuning script (#4921) Woosuk Kwon 2024-06-03 20:06:59 -07:00
  • bd0e7802e0 [Bugfix] Add warmup for prefix caching example (#5235) Zhuohan Li 2024-06-03 19:36:41 -07:00
  • 06b2550cbb [Bugfix] Support prompt_logprobs==0 (#5217) Toshiki Kataoka 2024-06-04 09:59:30 +09:00
  • f775a07e30 [FRONTEND] OpenAI tools support named functions (#5032) Breno Faria 2024-06-04 01:25:29 +02:00
  • 4f0d17c05c New CI template on AWS stack (#5110) Kevin H. Luu 2024-06-03 16:16:43 -07:00
  • 10c38e3e46 [Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834) Kaiyang Chen 2024-06-04 04:37:11 +08:00
  • cafb8e06c5 [CI/BUILD] enable intel queue for longer CPU tests (#4113) Yuan 2024-06-04 01:39:50 +08:00
  • cbb2f59cc8 [Kernel] Pass a device pointer into the quantize kernel for the scales (#5159) Tyler Michael Smith 2024-06-03 12:52:30 -04:00
  • 0ab278ca31 [Core] Remove unnecessary copies in flash attn backend (#5138) Antoni Baum 2024-06-03 09:39:31 -07:00
  • 7a64d24aad [Core] Support image processor (#4197) Cyrus Leung 2024-06-03 13:56:41 +08:00
  • dfbe60dc62 [Misc] Simplify code and fix type annotations in conftest.py (#5118) Cyrus Leung 2024-06-03 07:05:50 +08:00
  • a66cf40b20 [Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#4927) Divakar Verma 2024-06-02 16:13:26 -05:00
  • f790ad3c50 [Frontend][OpenAI] Support for returning max_model_len on /v1/models response (#4643) Avinash Raj 2024-06-02 13:36:13 +05:30
  • ed59a7ed23 Update test_ignore_eos (#4898) Simon Mo 2024-06-01 21:21:53 -05:00
  • 044793d8df [BugFix] Prevent LLM.encode for non-generation Models (#5184) Robert Shaw 2024-06-01 19:35:41 -04:00
  • c2d6d2f960 [Bugfix]: Fix issues related to prefix caching example (#5177) (#5180) Daniil Arapov 2024-06-02 01:53:52 +03:00
  • 8279078e21 [Bugfix] Remove deprecated @abstractproperty (#5174) Zhuohan Li 2024-06-01 15:40:25 -07:00
  • b9c0605a8e [Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776) chenqianfzh 2024-06-01 13:51:10 -07:00
  • 37464a0f74 [Bugfix] Fix call to init_logger in openai server (#4765) Nadav Shmayovits 2024-06-01 20:18:50 +03:00
  • c354072828 [Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py (#5151) Ye Cao 2024-06-02 01:11:22 +08:00
  • f081c3ce4b [Kernel] Update Cutlass fp8 configs (#5144) Varun Sundar Rabindranath 2024-06-01 14:16:07 +05:30
  • 260d119e86 [Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137) Tyler Michael Smith 2024-06-01 02:45:32 -04:00
  • a360ff80bb [CI/Build] CMakeLists: build all extensions' cmake targets at the same time (#5034) Daniele 2024-06-01 06:06:45 +02:00
  • 1197e02141 [Build] Guard against older CUDA versions when building CUTLASS 3.x kernels (#5168) v0.4.3 Tyler Michael Smith 2024-05-31 20:21:38 -04:00
  • 657579113f [Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support (#5171) Nick Hill 2024-05-31 17:20:19 -07:00
  • e9899fb7a4 [Model] Enable FP8 QKV in MoE and refine kernel tuning script (#5039) Cody Yu 2024-05-31 14:29:19 -07:00
  • a377f0bd5e [Misc]: optimize eager mode host time (#4196) functionxu123 2024-05-31 13:14:50 +08:00
  • e9d3aa04f6 Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" (#5149) Simon Mo 2024-05-31 00:00:26 -05:00
  • a22dea54d3 [Model] Support MAP-NEO model (#5081) SnowDist 2024-05-31 10:24:41 +08:00
  • 533c217792 Fix cutlass sm_90a vesrion in CMakeList simon-mo 2024-05-31 02:13:01 +00:00
  • 6d21fa1cad [Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) (#5136) Alexander Matveev 2024-05-30 22:02:11 -04:00
  • b35be5403f [Bugfix] Avoid Warnings in SparseML Activation Quantization (#5120) Robert Shaw 2024-05-30 17:04:37 -07:00
  • 45a1a69b98 [Build] Disable sm_90a in cu11 (#5141) Simon Mo 2024-05-30 16:37:16 -05:00
  • 87a658c812 Bump version to v0.4.3 (#5046) Simon Mo 2024-05-30 13:13:46 -05:00
  • 429d89720e add doc about serving option on dstack (#3074) Chansung Park 2024-05-31 02:11:07 +09:00
  • a9bcc7afb2 [Doc] Use intersphinx and update entrypoints docs (#5125) Cyrus Leung 2024-05-31 00:59:23 +08:00
  • d79d9eaaff [Misc] remove duplicate definition of seq_lens_tensor in model_runner.py (#5129) Hyunsung Lee 2024-05-30 22:56:19 +09:00
  • f758505c73 [CI/Build] increase wheel size limit to 200 MB (#5130) youkaichao 2024-05-30 06:29:48 -07:00
  • d910816c73 [Bugfix] Automatically Detect SparseML models (#5119) Robert Shaw 2024-05-30 05:58:37 -07:00
  • 87d41c849d [BUGFIX] [FRONTEND] Correct chat logprobs (#5029) Breno Faria 2024-05-30 11:52:14 +02:00
  • e07aff9e52 [CI/Build] Docker cleanup functionality for amd servers (#5112) omkar kakarparthi 2024-05-29 22:27:39 -05:00
  • 5bf185a1c4 [Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter (#5108) Alexander Matveev 2024-05-29 20:30:18 -04:00
  • 4fbcb0f27e [Doc][Build] update after removing vllm-nccl (#5103) youkaichao 2024-05-29 16:51:18 -07:00
  • 7c3604fb68 [Bugfix] logprobs is not compatible with the OpenAI spec #4795 (#5031) Itay Etelis 2024-05-30 02:13:22 +03:00
  • b1c255630d [Core] Avoid the need to pass None values to Sequence.inputs (#5099) Cyrus Leung 2024-05-30 07:05:01 +08:00
  • eb6c50cdc2 [Bugfix][CI/Build] Fix codespell failing to skip files in git diff (#5097) Cyrus Leung 2024-05-30 07:02:54 +08:00
  • eecd864388 [Bugfix][CI/Build] Fix test and improve code for merge_async_iterators (#5096) Cyrus Leung 2024-05-30 07:02:25 +08:00
  • ae495c74ea [Doc]Replace deprecated flag in readme (#4526) Ronen Schaffer 2024-05-30 01:26:33 +03:00
  • 4238bc82f2 [Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) (#4837) afeldman-nm 2024-05-29 12:09:13 -04:00
  • 594392d27a [Core][Distributed] improve p2p access check (#4992) youkaichao 2024-05-29 04:29:07 -07:00
  • 18c1f16d86 [Bugfix] Fix arguments passed to Sequence in stop checker test (#5092) Cyrus Leung 2024-05-29 15:16:41 +08:00
  • 5bd3c65072 [Core][Optimization] remove vllm-nccl (#5091) youkaichao 2024-05-28 22:13:52 -07:00
  • 616e600e0b [Misc] add gpu_memory_utilization arg (#5079) Marut Pandya 2024-05-28 17:16:18 -07:00
  • dfba529b40 [Bugfix] Remove the last EOS token unless explicitly specified (#5077) Junichi Sato 2024-05-29 09:15:35 +09:00
  • 5ae5ed1e60 [Core] Consolidate prompt arguments to LLM engines (#4328) Cyrus Leung 2024-05-29 04:29:31 +08:00
  • 290f4ada2b [Docs] Add Dropbox as sponsors (#5089) Simon Mo 2024-05-28 12:29:09 -05:00
  • dd8de11f0a [Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X (#4951) Divakar Verma 2024-05-28 11:03:23 -05:00
  • 9ba415588a [BugFix] Fix Embedding Models with TP>1 (#5075) Robert Shaw 2024-05-28 08:32:42 -07:00
  • d4f3985907 [Core] Sliding window for block manager v2 (#4545) Michał Moskal 2024-05-27 19:07:07 -07:00
  • 890aa93d27 [Model] Add support for falcon-11B (#5069) Isotr0py 2024-05-28 07:41:43 +08:00
  • fbdb7b3ee2 [Core] Allow AQLM on Pascal (#5058) sasha0552 2024-05-27 22:26:14 +00:00
  • 1102bef219 [Bugfix / Core] Prefix Caching Guards (merged with main) (#4846) Zhuohan Li 2024-05-27 15:18:17 -07:00