Commit Graph

  • 789937af2e [Doc] [SpecDecode] Update MLPSpeculator documentation (#7100) Thomas Parnell 2024-08-06 01:29:43 +02:00
  • dfb1a15dcb [ci][frontend] deduplicate tests (#7101) youkaichao 2024-08-05 15:59:22 -07:00
  • 4db5176d97 bump version to v0.5.4 (#7139) v0.5.4 Simon Mo 2024-08-05 14:39:48 -07:00
  • 4cf1dc39be [Bugfix][CI/Build] Fix CUTLASS FetchContent (#7171) Tyler Michael Smith 2024-08-05 17:22:57 -04:00
  • 6e4852ce28 [CI/Build] Suppress divide-by-zero and missing return statement warnings (#7001) Tyler Michael Smith 2024-08-05 16:00:01 -04:00
  • 8571ac4672 [Kernel] Update CUTLASS to 3.5.1 (#7085) Tyler Michael Smith 2024-08-05 15:13:43 -04:00
  • 997cf78308 [Misc] Fix typo in GroupCoordinator.recv() (#7167) Rui Qiao 2024-08-05 11:10:16 -07:00
  • 57f560aa23 [BugFix] Use args.trust_remote_code (#7121) Aditya Paliwal 2024-08-05 09:26:14 -07:00
  • 003f8ee128 [BugFix] Use IP4 localhost form for zmq bind (#7163) Nick Hill 2024-08-05 08:41:03 -07:00
  • e9630458c7 [SpecDecode] Support FlashInfer in DraftModelRunner (#6926) Bongwon Jang 2024-08-06 00:05:05 +09:00
  • 82a1b1a82b [Speculative decoding] Add periodic log with time spent in proposal/scoring/verification (#6963) Cade Daniel 2024-08-05 01:46:44 -07:00
  • c0d8f1636c [Model] SiglipVisionModel ported from transformers (#6942) Jungho Christopher Cho 2024-08-05 15:22:12 +09:00
  • cc08fc7225 [Frontend] Reapply "Factor out code for running uvicorn" (#7095) Cyrus Leung 2024-08-05 11:40:51 +08:00
  • 7b86e7c9cd [Model] Add multi-image support for minicpmv (#7122) Alphi 2024-08-05 09:23:17 +08:00
  • f80ab3521c Clean up remaining Punica C information (#7027) Jee Jee Li 2024-08-05 06:37:08 +08:00
  • 16a1cc9bb2 [misc][distributed] improve libcudart.so finding (#7127) youkaichao 2024-08-04 11:31:51 -07:00
  • b1c9aa3daa [Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size to 1 when using MLPSpeculator (#7105) Thomas Parnell 2024-08-04 16:13:18 +02:00
  • 179a6a36f2 [Model]Refactor MiniCPMV (#7020) Jee Jee Li 2024-08-04 16:12:41 +08:00
  • 83c644fe7e [core][misc] simply output processing with shortcut code path (#7117) youkaichao 2024-08-04 00:22:19 -07:00
  • 9fadc7b7a0 [misc] add zmq in collect env (#7119) youkaichao 2024-08-03 22:03:46 -07:00
  • 654bc5ca49 Support for guided decoding for offline LLM (#6878) Yihuan Bu 2024-08-03 23:12:09 -04:00
  • 825b044863 [Frontend] Warn if user max_model_len is greater than derived max_model_len (#7080) Jeff Fialho 2024-08-03 20:01:38 -03:00
  • 44dcb52e39 [ci][test] finalize fork_new_process_for_each_test (#7114) youkaichao 2024-08-03 10:44:53 -07:00
  • 67d745cc68 [CI] Temporarily turn off H100 performance benchmark (#7104) Kuntai Du 2024-08-02 23:52:44 -07:00
  • 99d7cabd7b [LoRA] ReplicatedLinear support LoRA (#7081) Jee Jee Li 2024-08-03 13:40:19 +08:00
  • fb2c1c86c1 [Bugfix] Fix block table for seqs that have prefix cache hits (#7018) Zach Zheng 2024-08-02 22:38:15 -07:00
  • 0c25435daa [Model] Refactor and decouple weight loading logic for InternVL2 model (#7067) Isotr0py 2024-08-03 13:36:14 +08:00
  • a0d164567c [ci][distributed] disable ray dag tests (#7099) youkaichao 2024-08-02 22:32:04 -07:00
  • 04e5583425 [ci][distributed] merge distributed test commands (#7097) youkaichao 2024-08-02 21:33:53 -07:00
  • 8c025fa703 [Frontend] Factor out chat message parsing (#7055) Cyrus Leung 2024-08-03 12:31:27 +08:00
  • 69ea15e5cc [ci][distributed] shorten wait time if server hangs (#7098) youkaichao 2024-08-02 21:05:16 -07:00
  • ed812a73fa [ Frontend ] Multiprocessing for OpenAI Server with zeromq (#6883) Robert Shaw 2024-08-02 21:27:28 -04:00
  • 708989341e [misc] add a flag to enable compile (#7092) youkaichao 2024-08-02 16:18:45 -07:00
  • 22e718ff1a [Misc] Revive to use loopback address for driver IP (#7091) Rui Qiao 2024-08-02 15:50:00 -07:00
  • 05308891e2 [Core] Pipeline parallel with Ray ADAG (#6837) Rui Qiao 2024-08-02 13:55:40 -07:00
  • a8d604ca2a [Misc] Disambiguate quantized types via a new ScalarType (#6396) Lucas Wilkinson 2024-08-02 16:51:58 -04:00
  • b482b9a5b1 [CI/Build] Add support for Python 3.12 (#7035) Michael Goin 2024-08-02 16:51:22 -04:00
  • 806949514a [ci] set timeout for test_oot_registration.py (#7082) youkaichao 2024-08-02 10:03:24 -07:00
  • c16eaac500 [Hardware][Intel CPU] Update torch 2.4.0 for CPU backend (#6931) Jie Fu (傅杰) 2024-08-02 23:55:58 +08:00
  • db35186391 [Core] Comment out unused code in sampler (#7023) Peng Guanwen 2024-08-02 15:58:26 +08:00
  • 660dea1235 [cuda][misc] remove error_on_invalid_device_count_status (#7069) youkaichao 2024-08-02 00:14:21 -07:00
  • cf2a1a4d9d Fix tracing.py (#7065) Bongwon Jang 2024-08-02 15:28:00 +09:00
  • 252357793d [ci][distributed] try to fix pp test (#7054) youkaichao 2024-08-01 22:03:12 -07:00
  • 3bb4b1e4cd [mypy] Speed up mypy checking (#7056) Cyrus Leung 2024-08-02 10:49:43 +08:00
  • 954f7305a1 [Kernel] Fix input for flashinfer prefill wrapper. (#7008) Lily Liu 2024-08-01 18:44:16 -07:00
  • 6ce01f3066 [Performance] Optimize get_seqs (#7051) Woosuk Kwon 2024-08-01 18:29:52 -07:00
  • 6a11fdfbb8 [CI/Build][Bugfix] Fix CUTLASS header-only line (#7034) Tyler Michael Smith 2024-08-01 16:51:15 -04:00
  • 805a8a75f2 [Misc] Support attention logits soft-capping with flash-attn (#7022) Woosuk Kwon 2024-08-01 13:14:37 -07:00
  • 562e580abc Update run-amd-test.sh (#7044) omkar kakarparthi 2024-08-01 15:12:37 -05:00
  • fc912e0886 [Models] Support Qwen model with PP (#6974) Murali Andoorveedu 2024-08-01 12:40:43 -07:00
  • f4fd390f5d [Bugfix] Lower gemma's unloaded_params exception to warning (#7002) Michael Goin 2024-08-01 15:01:07 -04:00
  • fb3db61688 [CI/Build] Remove sparseml requirement from testing (#7037) Michael Goin 2024-08-01 15:00:51 -04:00
  • 2dd34371a6 [Bugfix] Fix RMSNorm forward in InternViT attention qk_layernorm (#6992) Isotr0py 2024-08-02 03:00:28 +08:00
  • 7e0861bd0b [CI/Build] Update PyTorch to 2.4.0 (#6951) Sage Moore 2024-08-01 11:11:24 -07:00
  • a72a424b3e [Build/CI] Fixing Docker Hub quota issue. (#7043) Alexei-V-Ivanov-AMD 2024-08-01 13:07:37 -05:00
  • c8a7e93273 [core][scheduler] simplify and improve scheduler (#6867) youkaichao 2024-07-31 23:51:09 -07:00
  • 3c10591ef2 [Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user (#6954) zifeitong 2024-07-31 21:13:34 -07:00
  • 0437492ea9 PP comm optimization: replace send with partial send + allgather (#6695) Aurick Qiao 2024-07-31 20:15:42 -07:00
  • 630dd9e0ae [Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings (#6758) Travis Johnson 2024-07-31 20:49:11 -06:00
  • 23993a7997 [Bugfix][TPU] Do not use torch.Generator for TPUs (#6981) Woosuk Kwon 2024-07-31 18:50:28 -07:00
  • 1d2e7fb73f [Model] Pipeline parallel support for Qwen2 (#6924) xuyi 2024-08-01 09:49:51 +08:00
  • 7ecee34321 [Kernel][RFC] Refactor the punica kernel based on Triton (#5036) Jee Jee Li 2024-08-01 08:12:24 +08:00
  • 7eb0cb4a14 Revert "[Frontend] Factor out code for running uvicorn" (#7012) Simon Mo 2024-07-31 16:34:26 -07:00
  • a0dce9383a [Misc] Add compressed-tensors to optimized quant list (#7006) Michael Goin 2024-07-31 17:40:44 -04:00
  • 35e9c12bfa [Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) (#6996) Varun Sundar Rabindranath 2024-07-31 17:40:32 -04:00
  • 93548eb37e [Kernel] Enable FP8 Cutlass for Ada Lovelace (#6950) Varun Sundar Rabindranath 2024-07-31 17:40:22 -04:00
  • 460c1884e3 [Bugfix] Support cpu offloading with fp8 quantization (#6960) Michael Goin 2024-07-31 15:47:46 -04:00
  • bd70013407 [MISC] Introduce pipeline parallelism partition strategies (#6920) Cody Yu 2024-07-31 12:02:17 -07:00
  • 2ee8d3ba55 [Model] use FusedMoE layer in Jamba (#6935) Avshalom Manevich 2024-07-31 22:00:24 +03:00
  • daed30c4a9 [Bugfix] Fix feature size calculation for LLaVA-NeXT (#6982) Cyrus Leung 2024-07-31 23:46:17 +08:00
  • 2f4e108f75 [Bugfix] Clean up MiniCPM-V (#6939) Alphi 2024-07-31 22:39:19 +08:00
  • 6512937de1 Support W4A8 quantization for vllm (#5218) HandH1998 2024-07-31 21:55:21 +08:00
  • c0644cf9ce [Bugfix] fix logit processor excceed vocab size issue (#6927) Fei 2024-07-31 01:16:01 -07:00
  • 533d1932d2 [Bugfix][TPU] Set readonly=True for non-root devices (#6980) Woosuk Kwon 2024-07-31 00:19:28 -07:00
  • 9f0e69b653 [CI/Build] Fix mypy errors (#6968) Cyrus Leung 2024-07-31 10:49:48 +08:00
  • f230cc2ca6 [Bugfix] Fix broadcasting logic for multi_modal_kwargs (#6836) Cyrus Leung 2024-07-31 10:38:45 +08:00
  • da1f7cc12a [mypy] Enable following imports for some directories (#6681) Cyrus Leung 2024-07-31 10:38:03 +08:00
  • c32ab8be1a [Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding (#6964) Cade Daniel 2024-07-30 17:53:21 -07:00
  • fb4f530bf5 [CI] [nightly benchmark] Do not re-download sharegpt dataset if exists (#6706) Cade Daniel 2024-07-30 16:28:49 -07:00
  • 79319cedfa [Nightly benchmarking suite] Remove pkill python from run benchmark suite (#6965) Cade Daniel 2024-07-30 16:28:05 -07:00
  • 40c27a7cbb [Build] Temporarily Disable Kernels and LoRA tests (#6961) Simon Mo 2024-07-30 14:59:48 -07:00
  • 6ca8031e71 [core][misc] improve free_finished_seq_groups (#6865) youkaichao 2024-07-30 14:32:12 -07:00
  • d7a299edaa [Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842) Tyler Michael Smith 2024-07-30 16:37:01 -04:00
  • 052b6f8ca4 [Bugfix] Fix tensorizer memory profiling bug during testing (#6881) Sanger Steel 2024-07-30 14:48:50 -04:00
  • 5895b24677 [OpenVINO] Updated OpenVINO requirements and build docs (#6948) Ilya Lavrenov 2024-07-30 22:33:01 +04:00
  • cbbc904470 [Kernel] Squash a few more warnings (#6914) Tyler Michael Smith 2024-07-30 13:50:42 -04:00
  • 5cf9254a9c [BugFix] Fix use of per-request seed with pipeline parallel (#6698) Nick Hill 2024-07-30 10:40:08 -07:00
  • f058403683 [Doc] Super tiny fix doc typo (#6949) fzyzcjy 2024-07-31 00:14:03 +08:00
  • c66c7f86ac [Bugfix] Fix PaliGemma MMP (#6930) Roger Wang 2024-07-30 02:20:57 -07:00
  • 6e063ea35b [TPU] Fix greedy decoding (#6933) Woosuk Kwon 2024-07-30 02:06:29 -07:00
  • af647fb8b3 [Kernel] Tuned int8 kernels for Ada Lovelace (#6848) Varun Sundar Rabindranath 2024-07-29 22:24:58 -04:00
  • 61a97c32f6 [Kernel] Fix marlin divide-by-zero warnings (#6904) Tyler Michael Smith 2024-07-29 21:26:07 -04:00
  • 4fbf4aa128 [ci] GHA workflow to remove ready label upon "/notready" comment (#6921) Kevin H. Luu 2024-07-29 17:03:45 -07:00
  • aae6d36f7e [Kernel] Remove unused variables in awq/gemm_kernels.cu (#6908) Tyler Michael Smith 2024-07-29 20:01:17 -04:00
  • 9f69d8245a [Frontend] New allowed_token_ids decoding request parameter (#6753) Nick Hill 2024-07-29 16:37:27 -07:00
  • 9a7e2d0534 [Bugfix] Allow vllm to still work if triton is not installed. (#6786) Thomas Parnell 2024-07-29 23:51:27 +02:00
  • 7f8d612d24 [TPU] Support tensor parallelism in async llm engine (#6891) Earthwalker 2024-07-30 03:42:21 +08:00
  • 60d1c6e584 [Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel (#6901) Tyler Michael Smith 2024-07-29 12:59:02 -04:00
  • db9e5708a9 [Core] Reduce unnecessary compute when logprobs=None (#6532) Peng Guanwen 2024-07-30 00:47:31 +08:00
  • 766435e660 [Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677) Varun Sundar Rabindranath 2024-07-29 11:42:35 -04:00