Commit Graph

  • 7cbd9ec7a9 [Model] Initialize support for InternVL2 series models (#6514) Isotr0py 2024-07-29 18:16:30 +08:00
  • 3eeb148f46 [Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 (#6871) Elsa Granger 2024-07-28 23:13:49 +08:00
  • b1366a9534 Add Nemotron to PP_SUPPORTED_MODELS (#6863) Michael Goin 2024-07-27 18:05:17 -04:00
  • 75acdaa4b6 [Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795) Alexander Matveev 2024-07-27 17:52:33 -04:00
  • fad5576c58 [TPU] Reduce compilation time & Upgrade PyTorch XLA version (#6856) Woosuk Kwon 2024-07-27 10:28:33 -07:00
  • f954d0715c [Docs] Add RunLLM chat widget (#6857) Chenggang Wu 2024-07-27 09:24:46 -07:00
  • 1ad86acf17 [Model] Initial support for BLIP-2 (#5920) Cyrus Leung 2024-07-27 19:53:07 +08:00
  • ecb33a28cb [CI/Build][Doc] Update CI and Doc for VLM example changes (#6860) Roger Wang 2024-07-27 02:54:14 -07:00
  • a57d75821c [bugfix] make args.stream work (#6831) Wang Ran (汪然) 2024-07-27 17:07:02 +08:00
  • 925de97e05 [Bugfix] Fix VLM example typo (#6859) Roger Wang 2024-07-26 23:24:08 -07:00
  • aa46953a20 [Misc][VLM][Doc] Consolidate offline examples for vision language models (#6858) Roger Wang 2024-07-26 22:44:13 -07:00
  • 593e79e733 [Bugfix] torch.set_num_threads() in multiproc_gpu_executor (#6802) Travis Johnson 2024-07-26 23:15:20 -06:00
  • c53041ae3b [Doc] Add missing mock import to docs conf.py (#6834) Harry Mellor 2024-07-27 05:47:33 +01:00
  • 52f07e3dec [Hardware][TPU] Implement tensor parallelism with Ray (#5871) Woosuk Kwon 2024-07-26 20:54:27 -07:00
  • 14dbd5a767 [Model] H2O Danube3-4b (#6451) Joe 2024-07-26 20:47:50 -07:00
  • ed94e4f427 [Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba (#6784) tomeras91 2024-07-27 06:45:31 +03:00
  • 3c3012398e [Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron (#6844) omrishiv 2024-07-26 20:20:16 -07:00
  • ced36cd89b [ROCm] Upgrade PyTorch nightly version (#6845) Woosuk Kwon 2024-07-26 20:16:13 -07:00
  • 969d032265 [Bugfix]: Fix Tensorizer test failures (#6835) Sanger Steel 2024-07-26 23:02:25 -04:00
  • 55712941e5 [Bug Fix] Illegal memory access, FP8 Llama 3.1 405b (#6852) Lucas Wilkinson 2024-07-26 22:27:44 -04:00
  • 981b0d5673 [Frontend] Factor out code for running uvicorn (#6828) Cyrus Leung 2024-07-27 09:58:25 +08:00
  • d09b94ca58 [TPU] Support collective communications in XLA devices (#6813) Woosuk Kwon 2024-07-26 18:45:57 -07:00
  • bb5494676f enforce eager mode with bnb quantization temporarily (#6846) chenqianfzh 2024-07-26 18:32:20 -07:00
  • b5f49ee55b Update README.md (#6847) Gurpreet Singh Dhami 2024-07-26 20:26:45 -04:00
  • 150a1ffbfd [Doc] Update SkyPilot doc for wrong indents and instructions for update service (#4283) Zhanghao Wu 2024-07-26 17:39:10 -04:00
  • 281977bd6e [Doc] Add Nemotron to supported model docs (#6843) Michael Goin 2024-07-26 17:32:44 -04:00
  • 3bbb4936dc [Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation (#6125) Li, Jiang 2024-07-27 04:50:10 +08:00
  • aa4867791e [Misc][TPU] Support TPU in initialize_ray_cluster (#6812) Woosuk Kwon 2024-07-26 12:39:49 -07:00
  • 71734f1bf2 [Build/CI][ROCm] Minor simplification to Dockerfile.rocm (#6811) Woosuk Kwon 2024-07-26 12:28:32 -07:00
  • 50704f52c4 [Bugfix][Kernel] Promote another index to int64_t (#6838) Tyler Michael Smith 2024-07-26 14:41:04 -04:00
  • 07278c37dd [Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) (#6611) Michael Goin 2024-07-26 14:33:42 -04:00
  • 85ad7e2d01 [doc][debugging] add known issues for hangs (#6816) youkaichao 2024-07-25 21:48:05 -07:00
  • 89a84b0bb7 [Core] Use array to speedup padding (#6779) Peng Guanwen 2024-07-26 12:31:31 +08:00
  • 084a01fd35 [Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor. (#6770) Anthony Platanios 2024-07-26 00:25:35 -04:00
  • 062a1d0fab Fix ReplicatedLinear weight loading (#6793) QQSong 2024-07-25 19:24:58 -07:00
  • 2eb9f4ff26 [ci] Mark tensorizer as soft fail and separate from grouped test (#6810) Kevin H. Luu 2024-07-25 18:08:33 -07:00
  • 443c7cf4cf [ci][distributed] fix flaky tests (#6806) youkaichao 2024-07-25 17:44:09 -07:00
  • 1adddb14bf [Core] Fix ray forward_dag error mssg (#6792) SangBin Cho 2024-07-25 16:53:25 -07:00
  • b7215de2c5 [Docs] Publish 5th meetup slides (#6799) Woosuk Kwon 2024-07-25 16:47:55 -07:00
  • f3ff63c3f4 [doc][distributed] improve multinode serving doc (#6804) youkaichao 2024-07-25 15:38:32 -07:00
  • cd7edc4e87 [Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors (#6798) Lucas Wilkinson 2024-07-25 18:05:09 -04:00
  • 6a1e25b151 [Doc] Add documentations for nightly benchmarks (#6412) Kuntai Du 2024-07-25 11:57:16 -07:00
  • 95db75de64 [Bugfix] Add synchronize to prevent possible data race (#6788) Tyler Michael Smith 2024-07-25 13:40:01 -04:00
  • 65b1f121c8 [Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints (#6761) Michael Goin 2024-07-25 12:46:15 -04:00
  • 889da130e7 [ Misc ] fp8-marlin channelwise via compressed-tensors (#6524) Robert Shaw 2024-07-25 09:46:04 -07:00
  • b75e314fff [Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V (#6787) Alphi 2024-07-26 00:42:49 +08:00
  • 316a41ac1d [Bugfix] Fix encoding_format in examples/openai_embedding_client.py (#6755) Chang Su 2024-07-24 22:48:07 -07:00
  • 0310029a2f [Bugfix] Fix awq_marlin and gptq_marlin flags (#6745) Alexander Matveev 2024-07-25 01:34:11 -04:00
  • 309aaef825 [Bugfix] Fix decode tokens w. CUDA graph (#6757) Cody Yu 2024-07-24 22:33:56 -07:00
  • 9e169a4c61 [Model] Adding support for MiniCPM-V (#4087) Alphi 2024-07-25 11:59:30 +08:00
  • 5689e256ba [Frontend] Represent tokens with identifiable strings (#6626) Evan Z. Liu 2024-07-24 18:51:00 -07:00
  • 740374d456 [core][distributed] fix zmq hang (#6759) youkaichao 2024-07-24 17:37:12 -07:00
  • d88c458f44 [Doc][AMD][ROCm]Added tips to refer to mi300x tuning guide for mi300x users (#6754) Hongxia Yang 2024-07-24 17:32:57 -04:00
  • 421e218b37 [Bugfix] Bump transformers to 4.43.2 (#6752) Michael Goin 2024-07-24 16:22:16 -04:00
  • 5448f67635 [Core] Tweaks to model runner/input builder developer APIs (#6712) Antoni Baum 2024-07-24 12:17:12 -07:00
  • 0e63494cf3 Add fp8 support to reshape_and_cache_flash (#6667) Antoni Baum 2024-07-24 11:36:52 -07:00
  • ee812580f7 [Frontend] split run_server into build_server and run_server (#6740) Daniele 2024-07-24 19:36:04 +02:00
  • 40468b13fa [Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. (#6686) Allen.Dou 2024-07-24 23:58:42 +08:00
  • 2cf0df3381 [Bugfix] Fix speculative decode seeded test (#6743) Nick Hill 2024-07-24 08:58:31 -07:00
  • 545146349c Adding f-string to validation error which is missing (#6748) LF Marques 2024-07-24 16:55:53 +01:00
  • f4f8a9d892 [Bugfix]fix modelscope compatible issue (#6730) liuyhwangyh 2024-07-24 20:04:46 +08:00
  • b570811706 [Build/CI] Update run-amd-test.sh. Enable Docker Hub login. (#6711) Alexei-V-Ivanov-AMD 2024-07-24 07:01:14 -05:00
  • ccc4a73257 [Docs][ROCm] Detailed instructions to build from source (#6680) Woosuk Kwon 2024-07-24 01:07:23 -07:00
  • 0a740a11ba [Bugfix] Fix token padding for chameleon (#6724) Roger Wang 2024-07-24 01:05:09 -07:00
  • c882a7f5b3 [SpecDecoding] Update MLPSpeculator CI tests to use smaller model (#6714) Nick Hill 2024-07-24 00:34:22 -07:00
  • 5e8ca973eb [Bugfix] fix flashinfer cudagraph capture for PP (#6708) William Lin 2024-07-23 18:49:44 -07:00
  • 87525fab92 [bitsandbytes]: support read bnb pre-quantized model (#5753) dongmao zhang 2024-07-23 16:45:09 -07:00
  • 2f808e69ab [Bugfix] StatLoggers: cache spec decode metrics when they get collected. (#6645) Thomas Parnell 2024-07-24 01:05:05 +02:00
  • 01c16ede6b [CI] Add smoke test for non-uniform AutoFP8 quantization (#6702) Michael Goin 2024-07-23 18:45:12 -04:00
  • 72fc704803 [build] relax wheel size limit (#6704) youkaichao 2024-07-23 14:03:49 -07:00
  • 1bedf210e3 Bump transformers version for Llama 3.1 hotfix and patch Chameleon (#6690) Roger Wang 2024-07-23 13:47:48 -07:00
  • 507ef787d8 [Model] Pipeline Parallel Support for DeepSeek v2 (#6519) Travis Johnson 2024-07-23 13:22:09 -06:00
  • 58f53034ad [Frontend] Add Usage data in each chunk for chat_serving. #6540 (#6652) Yehoshua Cohen 2024-07-23 21:41:55 +03:00
  • 0eb0757bef [Misc] Add ignored layers for fp8 quantization (#6657) Michael Goin 2024-07-23 14:04:04 -04:00
  • 38c4b7e863 Bump version to 0.5.3.post1 (#6696) v0.5.3.post1 Simon Mo 2024-07-23 10:08:59 -07:00
  • a112a84aad [BugFix] Fix RoPE error in Llama 3.1 (#6693) Woosuk Kwon 2024-07-23 09:46:05 -07:00
  • 461089a21a [Bugfix] Fix a log error in chunked prefill (#6694) Woosuk Kwon 2024-07-23 09:27:58 -07:00
  • 71950af726 [doc][distributed] fix doc argument order (#6691) youkaichao 2024-07-23 08:55:33 -07:00
  • cb1362a889 [Docs] Announce llama3.1 support (#6688) Woosuk Kwon 2024-07-23 08:18:15 -07:00
  • bb2fc08072 Bump version to v0.5.3 (#6674) v0.5.3 Simon Mo 2024-07-23 00:00:08 -07:00
  • 3eda4ec780 support ignore patterns in model loader (#6673) Simon Mo 2024-07-22 23:59:42 -07:00
  • 22fa2e35cb [VLM][Model] Support image input for Chameleon (#6633) Roger Wang 2024-07-22 23:50:48 -07:00
  • c5201240a4 [misc] only tqdm for first rank (#6672) youkaichao 2024-07-22 21:57:27 -07:00
  • 97234be0ec [Misc] Manage HTTP connections in one place (#6600) Cyrus Leung 2024-07-23 12:32:02 +08:00
  • c051bfe4eb [doc][distributed] doc for setting up multi-node environment (#6529) youkaichao 2024-07-22 21:22:09 -07:00
  • 9e0b558a09 [Misc] Support FP8 kv cache scales from compressed-tensors (#6528) Michael Goin 2024-07-23 00:11:50 -04:00
  • e519ae097a add tqdm when loading checkpoint shards (#6569) zhaotyer 2024-07-23 11:48:01 +08:00
  • 7c2749a4fd [misc] add start loading models for users information (#6670) youkaichao 2024-07-22 20:08:02 -07:00
  • 729171ae58 [Misc] Enable chunked prefill by default for long context models (#6666) Woosuk Kwon 2024-07-22 20:03:13 -07:00
  • c5e8330997 [Bugfix] Fix null modules_to_not_convert in FBGEMM Fp8 quantization (#6665) Cheng Li 2024-07-22 19:25:05 -07:00
  • e0c15758b8 [Core] Modulize prepare input and attention metadata builder (#6596) Cody Yu 2024-07-22 17:45:24 -07:00
  • bdf5fd1386 [Misc] Remove deprecation warning for beam search (#6659) Woosuk Kwon 2024-07-22 17:21:58 -07:00
  • 5a96ee52a3 [ci][build] add back vim in docker (#6661) youkaichao 2024-07-22 16:26:29 -07:00
  • 42c7f66a38 [Core] Support dynamically loading Lora adapter from HuggingFace (#6234) Jiaxin Shan 2024-07-22 15:42:40 -07:00
  • 69d5ae38dc [ci] Use different sccache bucket for CUDA 11.8 wheel build (#6656) Kevin H. Luu 2024-07-22 14:20:41 -07:00
  • fea59c7712 [Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels (#6649) Tyler Michael Smith 2024-07-22 16:08:30 -04:00
  • 739b61a348 [Frontend] Refactor prompt processing (#4028) Cyrus Leung 2024-07-23 01:13:53 +08:00
  • 89c1c6a196 [Bugfix] Fix vocab_size field access in llava_next.py (#6624) Jae-Won Chung 2024-07-22 01:02:51 -04:00
  • 42de2cefcb [Misc] Add a wrapper for torch.inference_mode (#6618) Woosuk Kwon 2024-07-21 18:43:11 -07:00
  • c9eef37f32 [Model] Initial Support for Chameleon (#5770) Roger Wang 2024-07-21 17:37:51 -07:00