Commit Graph

  • d3c002eadc [Bugfix] chat method add_generation_prompt param (#7734) Brian Li 2024-08-22 01:33:35 +08:00
  • 9b73a2f498 [Spec Decoding] Use target model max length as default for draft model (#7706) Nick Hill 2024-08-21 12:23:22 -04:00
  • 6925cdbeea [Bugfix][Hardware][CPU] Fix mm_limits initialization for CPU backend (#7735) Isotr0py 2024-08-22 00:23:03 +08:00
  • 53328d7536 [BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509) LI MOU 2024-08-21 23:54:31 +08:00
  • c75363fbc0 [BugFix] Avoid premature async generator exit and raise all exception variations (#7698) Nick Hill 2024-08-21 11:45:55 -04:00
  • dd3fa0e430 [Bugfix] Mirror jinja2 in pyproject.toml (#7723) sasha0552 2024-08-21 13:41:17 +00:00
  • baaedfdb2d [mypy] Enable following imports for entrypoints (#7248) Cyrus Leung 2024-08-21 14:28:21 +08:00
  • 4506641212 [Doc] Section for Multimodal Language Models (#7719) Roger Wang 2024-08-20 23:24:01 -07:00
  • 12e1c65bc9 [Model] Add AWQ quantization support for InternVL2 model (#7187) Isotr0py 2024-08-21 14:18:57 +08:00
  • b74a125800 [ci] try to log process using the port to debug the port usage (#7711) youkaichao 2024-08-20 17:41:12 -07:00
  • 66a9e713a7 [Core] Pipe worker_class_fn argument in Executor (#7707) Antoni Baum 2024-08-20 17:37:39 -07:00
  • 9e51b6a626 [ci][test] adjust max wait time for cpu offloading test (#7709) youkaichao 2024-08-20 17:12:44 -07:00
  • 6e4658c7aa [Intel GPU] fix xpu not support punica kernel (which use torch.library.custom_op) (#7685) Kunshang Ji 2024-08-21 03:01:09 +08:00
  • 3b682179dd [Core] Add AttentionState abstraction (#7663) Antoni Baum 2024-08-20 11:50:45 -07:00
  • c6af027a35 [Misc] Add jinja2 as an explicit build requirement (#7695) Lucas Wilkinson 2024-08-20 13:17:47 -04:00
  • 2aa00d59ad [CI/Build] Pin OpenTelemetry versions and make errors clearer (#7266) Ronen Schaffer 2024-08-20 20:02:21 +03:00
  • c42590f97a [Hardware] [Intel GPU] refactor xpu worker/executor (#7686) Kunshang Ji 2024-08-21 00:54:10 +08:00
  • aae6927be0 [VLM][Model] Add test for InternViT vision encoder (#7409) Isotr0py 2024-08-20 23:10:20 +08:00
  • 398521ad19 [OpenVINO] Updated documentation (#7687) Ilya Lavrenov 2024-08-20 17:33:56 +04:00
  • 5288c06aa0 [Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174) Lucas Wilkinson 2024-08-20 09:09:33 -04:00
  • b6f99a6ffe [Core] Refactor executor classes for easier inheritance (#7673) Kunshang Ji 2024-08-20 15:56:50 +08:00
  • ad28a74beb [misc][cuda] add warning for pynvml user (#7675) youkaichao 2024-08-20 00:35:09 -07:00
  • e6d811dd13 [XPU] fallback to native implementation for xpu custom op (#7670) jianyizh 2024-08-20 15:26:09 +08:00
  • c4be16e1a7 [misc] add nvidia related library in collect env (#7674) youkaichao 2024-08-19 23:22:49 -07:00
  • 3d8a5f063d [CI] Organizing performance benchmark files (#7616) Kuntai Du 2024-08-19 22:43:54 -07:00
  • f4fc7337bf [Bugfix] support tie_word_embeddings for all models (#5724) Zijian Hu 2024-08-19 20:00:04 -07:00
  • 0df7ec0b2d [ci] Install Buildkite test suite analysis (#7667) Kevin H. Luu 2024-08-19 19:55:04 -07:00
  • 312f761232 [Speculative Decoding] Fixing hidden states handling in batch expansion (#7508) Abhinav Goyal 2024-08-20 06:28:14 +05:30
  • e54ebc2f8f [doc] fix doc build error caused by msgspec (#7659) youkaichao 2024-08-19 17:50:59 -07:00
  • 67e02fa8a4 [Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs-during-spec-decoding (#7665) Travis Johnson 2024-08-19 18:43:09 -06:00
  • 43735bf5e1 [TPU] Remove redundant input tensor cloning (#7660) Woosuk Kwon 2024-08-19 15:55:04 -07:00
  • da115230fd [Bugfix] Don't disable existing loggers (#7664) Andrew Song 2024-08-19 15:11:58 -07:00
  • 7601cb044d [Core] Support tensor parallelism for GGUF quantization (#7520) Isotr0py 2024-08-20 05:30:14 +08:00
  • 47b65a5508 [core] Multi Step Scheduling (#7000) William Lin 2024-08-19 13:52:13 -07:00
  • dad961ef5c [Bugfix] fix lora_dtype value type in arg_utils.py - part 2 (#5428) Ali Panahi 2024-08-19 13:47:00 -07:00
  • 3ac50b47d0 [MISC] Add prefix cache hit rate to metrics (#7606) Cody Yu 2024-08-19 11:52:07 -07:00
  • df845b2b46 [Misc] Remove Gemma RoPE (#7638) Woosuk Kwon 2024-08-19 09:29:31 -07:00
  • 1a36287b89 [Bugfix] Fix xpu build (#7644) Kunshang Ji 2024-08-19 13:00:09 +08:00
  • f710fb5265 [Core] Use flashinfer sampling kernel when available (#7137) Peng Guanwen 2024-08-19 11:24:03 +08:00
  • ff7ec82c4d [Core] Optimize SPMD architecture with delta + serialization optimization (#7109) SangBin Cho 2024-08-18 17:57:20 -07:00
  • 200a2ffa6b [Misc] Refactor Llama3 RoPE initialization (#7637) Woosuk Kwon 2024-08-18 17:18:12 -07:00
  • 40e1360bb6 [CI/Build] Add text-only test for Qwen models (#7475) Alex Brooks 2024-08-18 17:43:46 -06:00
  • e3b318216d [ Bugfix ] Fix Prometheus Metrics With zeromq Frontend (#7279) Robert Shaw 2024-08-18 16:19:48 -04:00
  • ab7165f2c7 [TPU] Optimize RoPE forward_native2 (#7636) Woosuk Kwon 2024-08-18 01:15:10 -07:00
  • 0c2fa50b84 [TPU] Use mark_dynamic only for dummy run (#7634) Woosuk Kwon 2024-08-18 00:18:53 -07:00
  • ce143353c6 [TPU] Skip creating empty tensor (#7630) Woosuk Kwon 2024-08-17 14:22:46 -07:00
  • bbf55c4805 [VLM] Refactor MultiModalConfig initialization and profiling (#7530) Roger Wang 2024-08-17 13:30:55 -07:00
  • 1ef13cf92f [Misc]Fix BitAndBytes exception messages (#7626) Jee Jee Li 2024-08-18 03:02:14 +08:00
  • 832163b875 [ci][test] allow longer wait time for api server (#7629) youkaichao 2024-08-17 11:26:38 -07:00
  • e73f76eec6 [Model] Pipeline parallel support for JAIS (#7603) Besher Alkurdi 2024-08-17 21:11:09 +03:00
  • d95cc0a55c [core][misc] update libcudart finding (#7620) youkaichao 2024-08-16 23:01:35 -07:00
  • 5bf45db7df [ci][test] fix engine/logger test (#7621) youkaichao 2024-08-16 23:00:59 -07:00
  • eed020f673 [misc] use nvml to get consistent device name (#7582) youkaichao 2024-08-16 21:15:13 -07:00
  • 7c0b7ea214 [Bugfix] add >= 1.0 constraint for openai dependency (#7612) Xander Johnson 2024-08-16 20:56:01 -07:00
  • 4706eb628e [aDAG] Unflake aDAG + PP tests (#7600) SangBin Cho 2024-08-16 20:49:30 -07:00
  • bae888cb8e [Bugfix] Clear engine reference in AsyncEngineRPCServer (#7618) Rui Qiao 2024-08-16 20:44:05 -07:00
  • 6bd19551b0 .[Build/CI] Enabling passing AMD tests. (#7610) Alexei-V-Ivanov-AMD 2024-08-16 22:25:32 -05:00
  • e680349994 [Bugfix] Fix custom_ar support check (#7617) bnellnm 2024-08-16 22:05:49 -04:00
  • 44f26a9466 [Model] Align nemotron config with final HF state and fix lm-eval-small (#7611) Michael Goin 2024-08-16 18:56:34 -04:00
  • 37fd47e780 [Kernel] fix types used in aqlm and ggml kernels to support dynamo (#7596) bnellnm 2024-08-16 17:00:11 -04:00
  • 7759ae958f [Kernel][Misc] dynamo support for ScalarType (#7594) bnellnm 2024-08-16 16:59:49 -04:00
  • 9f69856356 [Kernel] register punica functions as torch ops (#7591) bnellnm 2024-08-16 16:59:38 -04:00
  • d4f0f17b02 [Doc] Update quantization supported hardware table (#7595) Michael Goin 2024-08-16 16:59:27 -04:00
  • b3f4e17935 [Doc] Add docs for llmcompressor INT8 and FP8 checkpoints (#7444) Michael Goin 2024-08-16 16:59:16 -04:00
  • 93478b63d2 [Core] Fix tracking of model forward time in case of PP>1 (#7440) Mahesh Keralapura 2024-08-16 13:46:01 -07:00
  • f366f6339b [spec decode] [4/N] Move update_flash_attn_metadata to attn backend (#7571) William Lin 2024-08-16 11:41:56 -07:00
  • 855866caa9 [Kernel] Add tuned triton configs for ExpertsInt8 (#7601) Michael Goin 2024-08-16 14:37:01 -04:00
  • 7fc23be81c [Kernel] W8A16 Int8 inside FusedMoE (#7415) Mor Zusman 2024-08-16 20:06:51 +03:00
  • e837b624f2 [Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210) Charlie Fu 2024-08-16 12:06:30 -05:00
  • ec724a725e support tqdm in notebooks (#7510) fzyzcjy 2024-08-17 00:17:50 +08:00
  • 0e39a33c6d [Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method (#7513) Gordon Wong 2024-08-17 00:05:18 +08:00
  • 6fc5b0f249 [CI] Fix crashes of performance benchmark (#7500) Kuntai Du 2024-08-16 08:08:45 -07:00
  • 9587b050fb [Core] Use uvloop with zmq-decoupled front-end (#7570) Nick Hill 2024-08-15 22:48:07 -07:00
  • 54bd9a03c4 register custom op for flash attn and use from torch.ops (#7536) youkaichao 2024-08-15 22:38:56 -07:00
  • 50b8d08dbd [Misc/Testing] Use torch.testing.assert_close (#7324) jon-chuang 2024-08-15 21:24:04 -07:00
  • e165528778 [CI] Move quantization cpu offload tests out of fastcheck (#7574) Michael Goin 2024-08-16 00:16:20 -04:00
  • 3b19e39dc5 Chat method for offline llm (#5049) nunjunj 2024-08-16 09:41:34 +07:00
  • 4cd7d47fed [ci/test] rearrange tests and make adag test soft fail (#7572) youkaichao 2024-08-15 19:39:04 -07:00
  • f878c8feb0 [Feature]: Add OpenAI server prompt_logprobs support #6508 (#7453) Grant Pinkert 2024-08-16 12:38:08 +10:00
  • b67ae00cdb [Misc] Add quantization config support for speculative model. (#7343) shangmingc 2024-08-16 10:34:28 +08:00
  • 9c8e2d1161 [Bugfix][Harmless] Fix float16 dtype for model_is_embedding (#7566) Michael Goin 2024-08-15 21:26:19 -04:00
  • 21313e09e3 [Bugfix] Fix default weight loading for scalars (#7534) Michael Goin 2024-08-15 16:10:22 -04:00
  • f4da5f7b6d [Misc] Update dockerfile for CPU to cover protobuf installation (#7182) PHILO-HE 2024-08-16 01:03:01 +08:00
  • 9c1f78d5d6 [Bugfix] update neuron for version > 0.5.0 (#7175) omrishiv 2024-08-15 09:44:14 -07:00
  • fc93e56143 [Bugfix][TPU] Correct env variable for XLA cache path (#7544) Woosuk Kwon 2024-08-15 00:02:29 -07:00
  • 22b39e11f2 llama_index serving integration documentation (#6973) Kameshwara Pavan Kumar Mantha 2024-08-15 04:08:37 +05:30
  • f55a9aea45 [Misc] Revert compressed-tensors code reuse (#7521) Kyle Sayers 2024-08-14 18:07:37 -04:00
  • 951fdd66d3 [TPU] Set per-rank XLA cache (#7533) Woosuk Kwon 2024-08-14 14:47:51 -07:00
  • 2ecf7b1757 [core] [3/N] multi-step args and sequence.py (#7452) William Lin 2024-08-14 12:32:45 -07:00
  • 3f674a49b5 [VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126) Cyrus Leung 2024-08-15 01:55:42 +08:00
  • 70b746efcf [Misc] Deprecation Warning when setting --engine-use-ray (#7424) Wallas Henrique 2024-08-14 13:44:27 -03:00
  • 67d115db08 [Bugfix][Frontend] Disable embedding API for chat models (#7504) jack 2024-08-15 00:15:19 +08:00
  • d3d9cb6e4b [ci] fix model tests (#7507) youkaichao 2024-08-14 01:01:43 -07:00
  • c134a46402 Fix empty output when temp is too low (#2937) Chang Su 2024-08-13 22:31:44 -07:00
  • 199adbb7cf [doc] update test script to include cudagraph (#7501) youkaichao 2024-08-13 21:52:58 -07:00
  • dd164d72f3 [Bugfix][Docs] Update list of mock imports (#7493) Cyrus Leung 2024-08-14 11:37:30 +08:00
  • ea49e6a3c8 [misc][ci] fix cpu test with plugins (#7489) youkaichao 2024-08-13 19:27:46 -07:00
  • 97992802f3 [CI/Build]Reduce the time consumption for LoRA tests (#7396) Jee Jee Li 2024-08-14 08:27:29 +08:00
  • 59edd0f134 [Bugfix][CI] Import ray under guard (#7486) Woosuk Kwon 2024-08-13 17:12:58 -07:00
  • a08df8322e [TPU] Support multi-host inference (#7457) Woosuk Kwon 2024-08-13 16:31:20 -07:00