Commit Graph

  • 4ea1f9678d [BugFix] Resolved Issues For LinearMethod --> QuantConfig (#4418) Robert Shaw 2024-04-27 14:35:33 -04:00
  • ba4be44c32 [BugFix] Fix return type of executor execute_model methods (#4402) Nick Hill 2024-04-27 11:17:45 -07:00
  • d6e520e170 [Core] Support offline use of local cache for models (#4374) Prashant Gupta 2024-04-27 09:59:55 -07:00
  • 81661da7b2 [BugFix] Fix min_tokens when eos_token_id is None (#4389) Nick Hill 2024-04-27 09:52:46 -07:00
  • dfea173148 [Bugfix] Abort requests when the connection to /v1/completions is interrupted (#4363) Ruoyu Qin 2024-04-28 00:48:37 +08:00
  • 7134303cbb [Bugfix][Core] Fix get decoding config from ray (#4335) Roy 2024-04-27 19:30:08 +08:00
  • 3da24c2df7 [Model] Phi-3 4k sliding window temp. fix (#4380) Caio Mendes 2024-04-27 07:08:15 -03:00
  • eefeb16464 [Kernel] Full Tensor Parallelism for LoRA Layers (#3524) Austin Veselka 2024-04-27 02:03:48 -05:00
  • 18d23f642a [ROCm][Hardware][AMD] Enable group query attention for triton FA (#4406) Hongxia Yang 2024-04-27 02:37:40 -04:00
  • 87f545ba6f [Misc] Fix logger format typo (#4396) Roy 2024-04-27 13:45:02 +08:00
  • 8947bc3c15 [Frontend][Bugfix] Disallow extra fields in OpenAI API (#4355) Cyrus Leung 2024-04-27 13:08:24 +08:00
  • 12628d3c78 [Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343) Philipp Moritz 2024-04-26 21:49:59 -07:00
  • 258a2c58d0 [Core] Introduce DistributedGPUExecutor abstract class (#4348) Nick Hill 2024-04-26 21:14:26 -07:00
  • aba47be3fe [Misc] add RFC issue template (#4401) youkaichao 2024-04-26 15:47:45 -07:00
  • a62aaf1df5 [Misc][Refactor] Generalize linear_method to be quant_method (#4373) Cody Yu 2024-04-26 13:41:14 -07:00
  • 603ad84815 [Core] Refactoring sampler and support prompt logprob for chunked prefill (#4309) SangBin Cho 2024-04-26 22:02:02 +09:00
  • a88081bf76 [CI] Disable non-lazy string operation on logging (#4326) SangBin Cho 2024-04-26 16:16:58 +09:00
  • 2f30e7c72f [Frontend] Add --log-level option to api server (#4377) Norman Mu 2024-04-25 22:36:01 -07:00
  • a74dee9b62 [Bugfix] Fix parameter name in get_tokenizer (#4107) Cyrus Leung 2024-04-26 10:10:48 +08:00
  • cf29b7eda4 [ROCm][Hardware][AMD][Doc] Documentation update for ROCm (#4376) Hongxia Yang 2024-04-25 21:12:25 -04:00
  • efffb63f58 [Core] Move function tracing setup to util function (#4352) Nick Hill 2024-04-25 16:45:12 -07:00
  • 15e7c675b0 [Core] Add shutdown() method to ExecutorBase (#4349) Nick Hill 2024-04-25 16:32:48 -07:00
  • b6dcb4d442 [Misc] Fix flash attention backend log (#4368) Roy 2024-04-26 03:43:32 +08:00
  • b5b4a398a7 [Mypy] Typing lora folder (#4337) SangBin Cho 2024-04-26 04:13:50 +09:00
  • f4bc4de1b1 [Core]refactor aqlm quant ops (#4351) Kunshang Ji 2024-04-25 19:03:56 +00:00
  • bd7a8eef25 [Doc] README Phi-3 name fix. (#4372) Caio Mendes 2024-04-25 14:32:00 -03:00
  • 7ee82bef1e [CI/Build] Adding functionality to reset the node's GPUs before processing. (#4213) Alexei-V-Ivanov-AMD 2024-04-25 11:37:20 -05:00
  • fbf152d976 [Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 (#4324) Isotr0py 2024-04-26 00:35:56 +08:00
  • 479d69fad0 [Core] Move ray_utils.py from engine to executor package (#4347) Nick Hill 2024-04-24 23:52:22 -07:00
  • 96e90fdeb3 [Model] Adds Phi-3 support (#4298) Caio Mendes 2024-04-25 00:06:57 -03:00
  • a395a638c2 [Misc] Use public API in benchmark_throughput (#4300) zifeitong 2024-04-24 14:10:24 -07:00
  • 2768884ac4 [Doc] Add note for docker user (#4340) youkaichao 2024-04-24 14:09:44 -07:00
  • aae08249ac [Bugfix] Fix marlin kernel crash on H100 (#4218) alexm-nm 2024-04-24 13:35:01 -04:00
  • 7923dcad12 [Misc] Update ShareGPT Dataset Sampling in Serving Benchmark (#4279) Roger Wang 2024-04-24 09:49:13 -07:00
  • 3cd9b5bb2d [Core][Distributed] use existing torch.cuda.device (#4318) youkaichao 2024-04-24 09:00:20 -07:00
  • 468d761b32 [Misc] Reduce supported Punica dtypes (#4304) v0.4.1 Woosuk Kwon 2024-04-23 18:54:33 -07:00
  • e4bf860a54 [CI][Build] change pynvml to nvidia-ml-py (#4302) youkaichao 2024-04-23 18:33:12 -07:00
  • 91f50a6fe2 [Core][Distributed] use cpu/gloo to initialize pynccl (#4248) youkaichao 2024-04-23 18:32:19 -07:00
  • 79a268c4ab [BUG] fixed fp8 conflict with aqlm (#4307) Robert Shaw 2024-04-23 21:26:33 -04:00
  • eace8bf0b9 [Kernel] FP8 support for MoE kernel / Mixtral (#4244) Philipp Moritz 2024-04-23 18:18:23 -07:00
  • 1e8f4252aa [Bugfix][Frontend] Raise exception when file-like chat template fails to be opened (#4292) Cyrus Leung 2024-04-24 02:19:03 +08:00
  • 2b7949c1c2 AQLM CUDA support (#3287) James Fleming 2024-04-23 13:59:33 -04:00
  • 62b5166bd4 [CI] Add ccache for wheel builds job (#4281) Simon Mo 2024-04-23 09:51:41 -07:00
  • d86285a4a4 [Core][Logging] Add last frame information for better debugging (#4278) youkaichao 2024-04-23 09:45:52 -07:00
  • d87f39e9a9 [Bugfix] Add init_cached_hf_modules to RayWorkerWrapper (#4286) DefTruth 2024-04-24 00:28:35 +08:00
  • d3c8180ac4 [Bugfix] Fixing max token error message for openai compatible server (#4016) Jack Gordley 2024-04-23 12:06:29 +01:00
  • 62b8aebc6f [Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951) Cade Daniel 2024-04-23 01:02:36 -07:00
  • 050f285ff6 [Core] Scheduling optimization 2 (#4280) SangBin Cho 2024-04-23 17:02:11 +09:00
  • 8f2ea22bde [Core] Some simplification of WorkerWrapper changes (#4183) Nick Hill 2024-04-23 00:49:08 -07:00
  • 0ae11f78ab [Mypy] Part 3 fix typing for nested directories for most of directory (#4161) SangBin Cho 2024-04-23 13:32:44 +09:00
  • 34128a697e Fix autodoc directives (#4272) Harry Mellor 2024-04-23 02:53:01 +01:00
  • c1b4e4157c [Core][Distributed] use absolute path for library file (#4271) youkaichao 2024-04-22 17:21:48 -07:00
  • ceaf4ed003 [Doc] Update the SkyPilot doc with serving and Llama-3 (#4276) Zhanghao Wu 2024-04-22 15:34:31 -07:00
  • ad8d696a99 [Core] Scheduler perf fix (#4270) SangBin Cho 2024-04-23 06:11:06 +09:00
  • 3d925165f2 Add example scripts to documentation (#4225) Harry Mellor 2024-04-22 17:36:54 +01:00
  • 1543680691 [Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter (#4217) alexm-nm 2024-04-22 12:10:48 -04:00
  • 077f0a2e8a [Frontend] Enable support for CPU backend in AsyncLLMEngine. (#3993) Tao He 2024-04-22 17:19:51 +08:00
  • e73ed0f1c6 [Bugfix] Fix type annotations in CPU model runner (#4256) Woosuk Kwon 2024-04-22 00:54:16 -07:00
  • 296cdf8ac7 [Misc] Add vision language model support to CPU backend (#3968) Isotr0py 2024-04-22 15:44:16 +08:00
  • 747b1a7147 [Core][Distributed] fix _is_full_nvlink detection (#4233) youkaichao 2024-04-21 23:04:16 -07:00
  • 95e5b087cf [AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring (#4129) Hongxia Yang 2024-04-22 00:57:24 -04:00
  • a37d815b83 Make initialization of tokenizer and detokenizer optional (#3748) GeauxEric 2024-04-21 15:06:46 -07:00
  • 7f2593b164 [Doc]: Update the doc of adding new models (#4236) xiaoji 2024-04-22 00:57:08 +08:00
  • fe7d648fe5 Don't show default value for flags in EngineArgs (#4223) Harry Mellor 2024-04-21 17:15:28 +01:00
  • cc74b2b232 Updating lm-format-enforcer version and adding links to decoding libraries in docs (#4222) Noam Gat 2024-04-20 11:33:16 +03:00
  • 91528575ec [Frontend] multiple sampling params support (#3570) nunjunj 2024-04-20 00:11:57 -07:00
  • a22cdea371 [Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118) Cody Yu 2024-04-19 21:28:57 -07:00
  • 682789d402 Fix missing docs and out of sync EngineArgs (#4219) Harry Mellor 2024-04-20 04:51:33 +01:00
  • 138485a82d [Bugfix] Add fix for JSON whitespace (#4189) Ayush Rautwar 2024-04-19 23:49:22 -04:00
  • bc9df1571b Pass tokenizer_revision when getting tokenizer in openai serving (#4214) Chirag Jain 2024-04-20 05:43:56 +05:30
  • 15b86408a8 [Misc] add nccl in collect env (#4211) youkaichao 2024-04-19 12:44:51 -07:00
  • 7be4f5628f [Bugfix][Core] Restore logging of stats in the async engine (#4150) Ronen Schaffer 2024-04-19 18:08:26 +03:00
  • 8f20fc04bf [Misc] fix docstrings (#4191) Uranus 2024-04-19 16:18:33 +08:00
  • 221d93ecbf Bump version of 0.4.1 (#4177) Simon Mo 2024-04-19 01:00:22 -07:00
  • d17c8477f1 [Bugfix] Fix LoRA loading check (#4138) Jee Li 2024-04-19 15:59:54 +08:00
  • a134ef6f5e Support eos_token_id from generation_config.json (#4182) Simon Mo 2024-04-18 21:13:36 -07:00
  • 8a7a3e4436 [Core] add an option to log every function call to for debugging hang/crash in distributed inference (#4079) youkaichao 2024-04-18 16:15:12 -07:00
  • 8f9c28fd40 [Bugfix] Fix CustomAllreduce nvlink topology detection (#3974) Adam Tilghman 2024-04-18 15:32:47 -07:00
  • cd2f63fb36 [CI/CD] add neuron docker and ci test scripts (#3571) Liangfu Chen 2024-04-18 15:26:01 -07:00
  • 87fa80c91f [Misc] Bump transformers to latest version (#4176) Nick Hill 2024-04-18 14:36:39 -07:00
  • e1bb2fd52d [Bugfix] Support logprobs when using guided_json and other constrained decoding fields (#4149) James Whedbee 2024-04-18 16:12:55 -05:00
  • 705578ae14 [Docs] document that Meta Llama 3 is supported (#4175) Simon Mo 2024-04-18 10:55:48 -07:00
  • e8cc7967ff [Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill (#4128) Michał Moskal 2024-04-18 00:51:28 -07:00
  • 53b018edcb [Bugfix] Get available quantization methods from quantization registry (#4098) Michael Goin 2024-04-18 03:21:55 -04:00
  • 66ded03067 Allow model to be served under multiple names (#2894) Harry Mellor 2024-04-18 08:16:26 +01:00
  • 6dc1fc9cfe [Core] nccl integrity check and test (#4155) youkaichao 2024-04-17 22:28:52 -07:00
  • 533d2a1f39 [Typing] Mypy typing part 2 (#4043) SangBin Cho 2024-04-18 09:28:43 +09:00
  • a53222544c [Kernel] Add punica dimension for Swallow-MS-7B LoRA (#4134) Shoichi Uchinami 2024-04-18 02:02:45 +09:00
  • fe3b5bbc23 [Bugfix] fix output parsing error for trtllm backend (#4137) Elinx 2024-04-17 19:07:23 +08:00
  • 8438e0569e [Core] RayWorkerVllm --> WorkerWrapper to reduce duplication (#4024) youkaichao 2024-04-17 01:34:33 -07:00
  • 11d652bd4f [CI] Move CPU/AMD tests to after wait (#4123) Cade Daniel 2024-04-16 22:53:26 -07:00
  • d150e4f89f [Misc] [CI] Fix CI failure caught after merge (#4126) Cade Daniel 2024-04-16 17:56:01 -07:00
  • e95cd87959 [Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894) Cade Daniel 2024-04-16 13:09:21 -07:00
  • 69e1d2fb69 [Core] Refactor model loading code (#4097) Antoni Baum 2024-04-16 11:34:39 -07:00
  • 05434764cd LM Format Enforcer Guided Decoding Support (#3868) Noam Gat 2024-04-16 08:54:57 +03:00
  • 4e7ee664e2 [Core] Fix engine-use-ray broken (#4105) SangBin Cho 2024-04-16 14:24:53 +09:00
  • 37e84a403d [Typing] Fix Sequence type GenericAlias only available after Python 3.9. (#4092) SangBin Cho 2024-04-16 06:47:31 +09:00
  • 4695397dcf [Bugfix] Fix ray workers profiling with nsight (#4095) Ricky Xu 2024-04-15 14:24:45 -07:00
  • d619ae2d19 [Doc] Add better clarity for tensorizer usage (#4090) Sanger Steel 2024-04-15 16:28:25 -04:00
  • eb46fbfda2 [Core] Simplifications to executor classes (#4071) Nick Hill 2024-04-15 13:05:09 -07:00