Commit Graph

  • 540c0368b1 [Model] Initialize Fuyu-8B support (#3924) Isotr0py 2024-07-14 13:27:14 +08:00
  • fb6af8bc08 [ Misc ] Apply MoE Refactor to Deepseekv2 To Support Fp8 (#6417) Robert Shaw 2024-07-13 23:03:58 -04:00
  • eeceadaecc [Misc] Add deprecation warning for beam search (#6402) Woosuk Kwon 2024-07-13 11:52:22 -07:00
  • babf52dade [ Misc ] More Cleanup of Marlin (#6359) Robert Shaw 2024-07-13 06:21:37 -04:00
  • 9da4aad44b Updating LM Format Enforcer version to v10.3 (#6411) Noam Gat 2024-07-13 13:09:12 +03:00
  • 41708e5034 [ci] try to add multi-node tests (#6280) youkaichao 2024-07-12 21:51:48 -07:00
  • d80aef3776 [Docs] Clean up latest news (#6401) Woosuk Kwon 2024-07-12 19:36:53 -07:00
  • e1684a766a [Bugfix] Fix hard-coded value of x in context_attention_fwd (#6373) Thomas Parnell 2024-07-13 03:30:54 +02:00
  • a27f87da34 [Doc] Fix Typo in Doc (#6392) Saliya Ekanayake 2024-07-12 17:48:23 -07:00
  • 16ff6bd58c [ci] Fix wording for GH bot (#6398) Kevin H. Luu 2024-07-12 16:34:37 -07:00
  • f8f9ff57ee [Bugfix][TPU] Fix megacore setting for v5e-litepod (#6397) Woosuk Kwon 2024-07-12 15:59:47 -07:00
  • 6bc9710f6e Fix release pipeline's dir permission (#6391) Simon Mo 2024-07-12 15:52:43 -07:00
  • 111fc6e7ec [Misc] Add generated git commit hash as vllm.__commit__ (#6386) Michael Goin 2024-07-12 18:52:15 -04:00
  • 75f64d8b94 [Bugfix] Fix illegal memory access in FP8 MoE kernel (#6382) Cody Yu 2024-07-12 14:33:33 -07:00
  • 21b2dcedab Fix release pipeline's -e flag (#6390) Simon Mo 2024-07-12 14:08:04 -07:00
  • 07b35af86d Fix interpolation in release pipeline (#6389) Simon Mo 2024-07-12 14:03:39 -07:00
  • bb1a784b05 Fix release-pipeline.yaml (#6388) Simon Mo 2024-07-12 14:00:57 -07:00
  • d719ba24c5 Build some nightly wheels by default (#6380) Simon Mo 2024-07-12 13:56:59 -07:00
  • aa48e502fb [MISC] Upgrade dependency to PyTorch 2.3.1 (#5327) Cody Yu 2024-07-12 12:04:26 -07:00
  • 4dbebd03cc [ci] Add GHA workflows to enable full CI run (#6381) Kevin H. Luu 2024-07-12 11:36:26 -07:00
  • b75bce1008 [ci] Add grouped tests & mark tests to run by default for fastcheck pipeline (#6365) Kevin H. Luu 2024-07-12 09:58:38 -07:00
  • b039cbbce3 [Misc] add fixture to guided processor tests (#6341) Yihuan Bu 2024-07-12 12:55:39 -04:00
  • f9d25c2519 [Build/CI] Checking/Waiting for the GPU's clean state (#6379) Alexei-V-Ivanov-AMD 2024-07-12 11:42:24 -05:00
  • 024ad87cdc [Bugfix] Fix dtype mismatch in PaliGemma (#6367) Cyrus Leung 2024-07-12 23:22:18 +08:00
  • aea19f0989 [ Misc ] Support Models With Bias in compressed-tensors integration (#6356) Robert Shaw 2024-07-12 11:11:29 -04:00
  • f7160d946a [Misc][Bugfix] Update transformers for tokenizer issue (#6364) Roger Wang 2024-07-12 01:40:07 -07:00
  • 6047187cd8 [ Misc ] Remove separate bias add (#6353) Robert Shaw 2024-07-12 01:06:09 -04:00
  • b6c16cf8ff [ROCm][AMD] unify CUDA_VISIBLE_DEVICES usage in cuda/rocm (#6352) Hongxia Yang 2024-07-12 00:30:46 -04:00
  • d26a8b3f1f [CI/Build] (2/2) Switching AMD CI to store images in Docker Hub (#6350) adityagoel14 2024-07-12 00:26:26 -04:00
  • d59eb98489 [Model][Phi3-Small] Remove scipy from blocksparse_attention (#6343) Michael Goin 2024-07-11 22:47:17 -04:00
  • adf32e0a0f [Bugfix] Fix usage stats logging exception warning with OpenVINO (#6349) Helena Kloosterman 2024-07-12 04:47:00 +02:00
  • 2b0fb53481 [distributed][misc] be consistent with pytorch for libcudart.so (#6346) youkaichao 2024-07-11 19:35:17 -07:00
  • d6ab528997 [Misc] Remove flashinfer warning, add flashinfer tests to CI (#6351) Lily Liu 2024-07-11 18:32:06 -07:00
  • 7ed6a4f0e1 [ BugFix ] Prompt Logprobs Detokenization (#6223) Robert Shaw 2024-07-11 18:02:29 -04:00
  • a4feba929b [CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy (#5362) Kuntai Du 2024-07-11 13:28:38 -07:00
  • 2d23b42d92 [doc] update pipeline parallel in readme (#6347) youkaichao 2024-07-11 11:38:40 -07:00
  • 1df43de9bb [bug fix] Fix llava next feature size calculation. (#6339) xwjiang2010 2024-07-11 10:21:10 -07:00
  • 52b7fcb35a Benchmark: add H100 suite (#6047) Simon Mo 2024-07-11 09:17:07 -07:00
  • b675069d74 [ Misc ] Refactor Marlin Python Utilities (#6082) Robert Shaw 2024-07-11 11:40:11 -04:00
  • 55f692b46e [BugFix] get_and_reset only when scheduler outputs are not empty (#6266) Mor Zusman 2024-07-11 17:40:20 +03:00
  • 8a1415cf77 [Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. (#6326) Thomas Parnell 2024-07-11 16:05:59 +02:00
  • 546b101fa0 [BugFix]: fix engine timeout due to request abort (#6255) pushan 2024-07-11 21:46:31 +08:00
  • 3963a5335b [Misc] refactor(config): clean up unused code (#6320) aniaan 2024-07-11 17:39:07 +08:00
  • c4774eb841 [Bugfix] Fix snapshot download in serving benchmark (#6318) Roger Wang 2024-07-11 00:04:05 -07:00
  • fc17110bbe [BugFix]: set outlines pkg version (#6262) Lim Xiang Yang 2024-07-11 12:37:11 +08:00
  • 439c84581a [Doc] Update description of vLLM support for CPUs (#6003) Jie Fu (傅杰) 2024-07-11 12:15:29 +08:00
  • 99ded1e1c4 [Doc] Remove comments incorrectly copied from another project (#6286) daquexian 2024-07-11 01:05:26 +01:00
  • 997df46a32 [Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor (#6313) Woosuk Kwon 2024-07-10 16:39:02 -07:00
  • ae151d73be [Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models (#5765) sroy745 2024-07-10 16:02:47 -07:00
  • 44cc76610d [Bugfix] Fix OpenVINOExecutor abstractmethod error (#6296) sangjune.park 2024-07-11 02:03:32 +09:00
  • b422d4961a [CI/Build] Enable mypy typing for remaining folders (#6268) Benjamin Muskalla 2024-07-10 16:15:55 +02:00
  • c38eba3046 [Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. (#6303) Thomas Parnell 2024-07-10 15:04:07 +02:00
  • e72ae80b06 [Bugfix] Support 2D input shape in MoE layer (#6287) Woosuk Kwon 2024-07-10 06:03:16 -07:00
  • 8a924d2248 [Doc] Guide for adding multi-modal plugins (#6205) Cyrus Leung 2024-07-10 14:55:34 +08:00
  • 5ed3505d82 [Bugfix][TPU] Add prompt adapter methods to TPUExecutor (#6279) Woosuk Kwon 2024-07-09 19:30:56 -07:00
  • da78caecfa [core][distributed] zmq fallback for broadcasting large objects (#6183) youkaichao 2024-07-09 18:49:11 -07:00
  • 2416b26e11 [Speculative Decoding] Medusa Implementation with Top-1 proposer (#4978) Abhinav Goyal 2024-07-10 07:04:02 +05:30
  • d3a245138a [Bugfix]fix and needs_scalar_to_array logic check (#6238) Baoyuan Qi 2024-07-10 07:43:24 +08:00
  • 673dd4cae9 [Docs] Docs update for Pipeline Parallel (#6222) Murali Andoorveedu 2024-07-09 16:24:58 -07:00
  • 4d6ada947c [CORE] Adding support for insertion of soft-tuned prompts (#4645) Swapnil Parekh 2024-07-09 16:26:36 -04:00
  • a0550cbc80 Add support for multi-node on CI (#5955) Kevin H. Luu 2024-07-09 12:56:56 -07:00
  • 08c5bdecae [Bugfix][TPU] Fix outlines installation in TPU Dockerfile (#6256) Woosuk Kwon 2024-07-09 02:56:06 -07:00
  • 5d5b4c5fe5 [Bugfix][TPU] Add missing None to model input (#6245) Woosuk Kwon 2024-07-09 00:21:37 -07:00
  • 70c232f85a [core][distributed] fix ray worker rank assignment (#6235) youkaichao 2024-07-08 21:31:44 -07:00
  • a3c9435d93 [hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability (#6216) youkaichao 2024-07-08 20:02:15 -07:00
  • 4f0e0ea131 Add FlashInfer to default Dockerfile (#6172) Simon Mo 2024-07-08 13:38:03 -07:00
  • ddc369fba1 [Bugfix] Mamba cache Cuda Graph padding (#6214) tomeras91 2024-07-08 21:25:51 +03:00
  • 185ad31f37 [Bugfix] use diskcache in outlines _get_guide #5436 (#6203) Eric 2024-07-09 02:23:24 +08:00
  • 543aa48573 [Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888) afeldman-nm 2024-07-08 13:12:15 -04:00
  • f7a8fa39d8 [Kernel] reloading fused_moe config on the last chunk (#6210) Avshalom Manevich 2024-07-08 18:00:38 +03:00
  • 717f4bcea0 Feature/add benchmark testing (#5947) Haichuan 2024-07-08 15:52:06 +08:00
  • 16620f439d do not exclude object field in CompletionStreamResponse (#6196) kczimm 2024-07-07 21:32:57 -05:00
  • 3b08fe2b13 [misc][frontend] log all available endpoints (#6195) youkaichao 2024-07-07 15:11:12 -07:00
  • abfe705a02 [ Misc ] Support Fp8 via llm-compressor (#6110) Robert Shaw 2024-07-07 16:42:11 -04:00
  • 333306a252 add benchmark for fix length input and output (#5857) Haichuan 2024-07-07 15:42:13 +08:00
  • 6206dcb29e [Model] Add PaliGemma (#5189) Roger Wang 2024-07-06 18:25:50 -07:00
  • 9389380015 [Doc] Move guide for multimodal model and other improvements (#6168) Cyrus Leung 2024-07-06 17:18:59 +08:00
  • 175c43eca4 [Doc] Reorganize Supported Models by Type (#6167) Roger Wang 2024-07-05 22:59:36 -07:00
  • bc96d5c330 Move release wheel env var to Dockerfile instead (#6163) Simon Mo 2024-07-05 17:19:53 -07:00
  • f0250620dd Fix release wheel build env var (#6162) Simon Mo 2024-07-05 16:24:31 -07:00
  • 2de490d60f Update wheel builds to strip debug (#6161) Simon Mo 2024-07-05 14:51:25 -07:00
  • 79d406e918 [Docs] Fix readthedocs for tag build (#6158) v0.5.1 Simon Mo 2024-07-05 12:44:40 -07:00
  • abad5746a7 bump version to v0.5.1 (#6157) Simon Mo 2024-07-05 12:04:51 -07:00
  • e58294ddf2 [Bugfix] Add verbose error if scipy is missing for blocksparse attention (#5695) JGSweets 2024-07-05 12:41:01 -05:00
  • f1e15da6fe [Frontend] Continuous usage stats in OpenAI completion API (#5742) jvlunteren 2024-07-05 19:37:09 +02:00
  • 0097bb1829 [Bugfix] Use templated datasource in grafana.json to allow automatic imports (#6136) Christian Rohmann 2024-07-05 18:49:47 +02:00
  • ea4b570483 [VLM] Cleanup validation and update docs (#6149) Cyrus Leung 2024-07-05 13:49:38 +08:00
  • a41357e941 [VLM] Improve consistency between feature size calculation and dummy data for profiling (#6146) Roger Wang 2024-07-04 18:29:47 -07:00
  • ae96ef8fbd [VLM] Calculate maximum number of multi-modal tokens by model (#6121) Cyrus Leung 2024-07-05 07:37:23 +08:00
  • 69ec3ca14c [Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051) Lily Liu 2024-07-04 16:35:51 -07:00
  • 81d7a50f24 [Hardware][Intel CPU] Adding intel openmp tunings in Docker file (#6008) Yuan 2024-07-05 06:22:12 +08:00
  • 27902d42be [misc][doc] try to add warning for latest html (#5979) youkaichao 2024-07-04 09:57:09 -07:00
  • 56b325e977 [ROCm][AMD][Model]Adding alibi slopes support in ROCm triton flash attention and naive flash attention (#6043) Gregory Shtrasberg 2024-07-04 01:19:38 -04:00
  • 3dd507083f [CI/Build] Cleanup VLM tests (#6107) Cyrus Leung 2024-07-04 09:58:18 +08:00
  • 0ed646b7aa [Distributed][Core] Support Py39 and Py38 for PP (#6120) Murali Andoorveedu 2024-07-03 17:52:29 -07:00
  • 1dab9bc8a9 [Bugfix] set OMP_NUM_THREADS to 1 by default for multiprocessing (#6109) Travis Johnson 2024-07-03 17:56:59 -06:00
  • 3de6e6a30e [core][distributed] support n layers % pp size != 0 (#6115) youkaichao 2024-07-03 16:40:31 -07:00
  • 966fe72141 [doc][misc] bump up py version in installation doc (#6119) youkaichao 2024-07-03 15:52:04 -07:00
  • 62963d129e [ Misc ] Clean Up CompressedTensorsW8A8 (#6113) Robert Shaw 2024-07-03 18:50:08 -04:00
  • d9e98f42e4 [vlm] Remove vision language config. (#6089) xwjiang2010 2024-07-03 15:14:16 -07:00