Commit Graph

  • 16422ea76f [misc][plugin] add plugin system implementation (#7426) youkaichao 2024-08-13 16:24:17 -07:00
  • 373538f973 [Misc] compressed-tensors code reuse (#7277) Kyle Sayers 2024-08-13 19:05:15 -04:00
  • 33e5d7e6b6 [frontend] spawn engine process from api server process (#7484) youkaichao 2024-08-13 15:40:17 -07:00
  • c5c7768264 Announce NVIDIA Meetup (#7483) Simon Mo 2024-08-13 14:28:36 -07:00
  • b1e5afc3e7 [Misc] Update awq and awq_marlin to use vLLMParameters (#7422) Dipika Sikka 2024-08-13 17:08:20 -04:00
  • d3bdfd3ab9 [Misc] Update Fused MoE weight loading (#7334) Dipika Sikka 2024-08-13 14:57:45 -04:00
  • fb377d7e74 [Misc] Update gptq_marlin to use new vLLMParameters (#7281) Dipika Sikka 2024-08-13 14:30:11 -04:00
  • 181abbc27d [Misc] Update LM Eval Tolerance (#7473) Dipika Sikka 2024-08-13 14:28:14 -04:00
  • 00c3d68e45 [Frontend][Core] Add plumbing to support audio language models (#7446) Peter Salas 2024-08-13 10:39:33 -07:00
  • e20233d361 Revert "[Doc] Update supported_hardware.rst (#7276)" (#7467) Woosuk Kwon 2024-08-13 01:37:08 -07:00
  • d6e634f3d7 [TPU] Suppress import custom_ops warning (#7458) Woosuk Kwon 2024-08-13 00:30:30 -07:00
  • 4d2dc5072b [hardware] unify usage of is_tpu to current_platform.is_tpu() (#7102) youkaichao 2024-08-13 00:16:42 -07:00
  • 7025b11d94 [Bugfix] Fix weight loading for Chameleon when TP>1 (#7410) Cyrus Leung 2024-08-13 13:33:41 +08:00
  • 5469146bcc [ci] Remove fast check cancel workflow (#7455) Kevin H. Luu 2024-08-12 21:19:51 -07:00
  • 97a6be95ba [Misc] improve logits processors logging message (#7435) Andrew Wang 2024-08-12 19:29:34 -07:00
  • 9ba85bc152 [mypy] Misc. typing improvements (#7417) Cyrus Leung 2024-08-13 09:20:20 +08:00
  • 198d6a2898 [Core] Shut down aDAG workers with clean async llm engine exit (#7224) Rui Qiao 2024-08-12 17:57:16 -07:00
  • 774cd1d3bf [CI/Build] bump minimum cmake version (#6999) Daniele 2024-08-13 01:29:20 +02:00
  • 91294d56e1 [Bugfix] Handle PackageNotFoundError when checking for xpu version (#7398) sasha0552 2024-08-12 23:07:20 +00:00
  • a046f86397 [Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208) jon-chuang 2024-08-12 15:47:41 -07:00
  • 4ddc4743d7 [Core] Consolidate GB constant and enable float GB arguments (#7416) Cyrus Leung 2024-08-13 05:14:14 +08:00
  • 6aa33cb2dd [Misc] Use scalar type to dispatch to different gptq_marlin kernels (#7323) Lucas Wilkinson 2024-08-12 14:40:13 -04:00
  • 1137f343aa [ci] Cancel fastcheck when PR is ready (#7433) Kevin H. Luu 2024-08-12 10:59:14 -07:00
  • 9b3e2edd30 [ci] Cancel fastcheck run when PR is marked ready (#7427) Kevin H. Luu 2024-08-12 10:56:52 -07:00
  • 65950e8f58 [ci] Entrypoints run upon changes in vllm/ (#7423) Kevin H. Luu 2024-08-12 10:18:03 -07:00
  • cfba4def5d [Bugfix] Fix logit soft cap in flash-attn backend (#7425) Woosuk Kwon 2024-08-12 09:58:28 -07:00
  • d2bc4510a4 [CI/Build] bump Dockerfile.neuron image base, use public ECR (#6832) Daniele 2024-08-12 18:53:35 +02:00
  • 24154f8618 [Frontend] Disallow passing model as both argument and option (#7347) Cyrus Leung 2024-08-12 20:58:34 +08:00
  • e6e42e4b17 [Core][VLM] Support image embeddings as input (#6613) Roger Wang 2024-08-12 01:16:06 -07:00
  • ec2affa8ae [Kernel] Flashinfer correctness fix for v0.1.3 (#7319) Lily Liu 2024-08-12 00:59:17 -07:00
  • 86ab567bae [CI/Build] Minor refactoring for vLLM assets (#7407) Roger Wang 2024-08-11 19:41:52 -07:00
  • f020a6297e [Docs] Update readme (#7316) Simon Mo 2024-08-11 17:13:37 -07:00
  • 6c8e595710 [misc] add commit id in collect env (#7405) youkaichao 2024-08-11 15:40:48 -07:00
  • 02b1988b9f [Doc] building vLLM with VLLM_TARGET_DEVICE=empty (#7403) tomeras91 2024-08-12 00:38:17 +03:00
  • 386087970a [CI/Build] build on empty device for better dev experience (#4773) tomeras91 2024-08-11 23:09:44 +03:00
  • c08e2b3086 [core] [2/N] refactor worker_base input preparation for multi-step (#7387) William Lin 2024-08-11 08:50:08 -07:00
  • 4fb7b52a2c Updating LM Format Enforcer version to v0.10.6 (#7189) Noam Gat 2024-08-11 15:11:50 +03:00
  • 90bab18f24 [TPU] Use mark_dynamic to reduce compilation time (#7340) Woosuk Kwon 2024-08-10 18:12:22 -07:00
  • 4c5d8e8ea9 [Bugfix] Fix phi3v batch inference when images have different aspect ratio (#7392) Isotr0py 2024-08-11 00:19:33 +08:00
  • baa240252e [Core] Fix edge case in chunked prefill + block manager v2 (#7380) Cade Daniel 2024-08-09 16:48:49 -07:00
  • 999ef0b917 [Misc] Add numpy implementation of compute_slot_mapping (#7377) Antoni Baum 2024-08-09 15:52:29 -07:00
  • 5c6c54d67a [Bugfix] Fix PerTensorScaleParameter weight loading for fused models (#7376) Dipika Sikka 2024-08-09 17:23:46 -04:00
  • 933790c209 [Core] Add span metrics for model_forward, scheduler and sampler time (#7089) Mahesh Keralapura 2024-08-09 13:55:13 -07:00
  • 70d268a399 [Bugfix] Fix ITL recording in serving benchmark (#7372) Roger Wang 2024-08-09 10:00:00 -07:00
  • 249b88228d [Frontend] Support embeddings in the run_batch API (#7132) Pooya Davoodi 2024-08-09 09:48:21 -07:00
  • 74af2bbd90 [Bugfix] Fix reinit procedure in ModelInputForGPUBuilder (#7360) Alexander Matveev 2024-08-09 12:35:49 -04:00
  • fc7b8d1eef [Performance] e2e overheads reduction: Small followup diff (#7364) Alexander Matveev 2024-08-09 11:49:36 -04:00
  • 67abdbb42f [VLM][Doc] Add stop_token_ids to InternVL example (#7354) Isotr0py 2024-08-09 22:51:04 +08:00
  • 07ab160741 [Model][Jamba] Mamba cache single buffer (#6739) Mor Zusman 2024-08-09 17:07:06 +03:00
  • b4e9528f95 [Core] Streamline stream termination in AsyncLLMEngine (#7336) Nick Hill 2024-08-09 00:06:36 -07:00
  • 57b7be0e1c [Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace (#6971) William Lin 2024-08-08 22:42:45 -07:00
  • 99b4cf5f23 [Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary (#7218) Travis Johnson 2024-08-08 23:08:46 -06:00
  • e02ac55617 [Performance] Optimize e2e overheads: Reduce python allocations (#7162) Alexander Matveev 2024-08-09 00:34:28 -04:00
  • 73388c07a4 [TPU] Fix dockerfile.tpu (#7331) Woosuk Kwon 2024-08-08 20:24:58 -07:00
  • 7eb4a51c5f [Core] Support serving encoder/decoder models (#7258) Cyrus Leung 2024-08-09 10:39:41 +08:00
  • 0fa14907da [TPU] Add Load-time W8A16 quantization for TPU Backend (#7005) Siyuan Liu 2024-08-08 18:35:49 -07:00
  • 5923532e15 Add Skywork AI as Sponsor (#7314) Simon Mo 2024-08-08 13:59:57 -07:00
  • a049b107e2 [Misc] Temporarily resolve the error of BitAndBytes (#7308) Jee Jee Li 2024-08-09 04:42:58 +08:00
  • 8334c39f37 [Bugfix] Fix new Llama3.1 GGUF model loading (#7269) Isotr0py 2024-08-09 04:42:44 +08:00
  • e904576743 [CI/Build] Dockerfile.cpu improvements (#7298) Daniele 2024-08-08 21:24:52 +02:00
  • e14fb22e59 [Doc] Put collect_env issue output in a <detail> block (#7310) Michael Goin 2024-08-08 14:22:49 -04:00
  • 782e53ab59 [Bugfix][fast] Fix the get_num_blocks_touched logic (#6849) Zach Zheng 2024-08-08 10:43:30 -07:00
  • 21b9c49aa3 [Frontend] Kill the server on engine death (#6594) Joe Runde 2024-08-08 10:47:48 -06:00
  • 5fb4a3f678 [Bugfix][Kernel] Increased atol to fix failing tests (#7305) Luka Govedič 2024-08-08 12:16:13 -04:00
  • 757ac70a64 [Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 (#7273) Jee Jee Li 2024-08-08 22:02:41 +08:00
  • 6dffa4b0a6 [Bugfix] Fix LoRA with PP (#7292) Murali Andoorveedu 2024-08-08 00:02:27 -07:00
  • 48abee9e54 [Frontend] remove max_num_batched_tokens limit for lora (#7288) Cherilyn Buren 2024-08-08 14:17:29 +08:00
  • 746709642c [Misc] Fix typos in scheduler.py (#7285) Rui Qiao 2024-08-07 17:06:01 -07:00
  • e53dfd3eaf [Kernel] Fix Flashinfer Correctness (#7284) Lily Liu 2024-08-07 16:26:52 -07:00
  • 6d94420246 [Doc] Update supported_hardware.rst (#7276) Michael Goin 2024-08-07 17:21:50 -04:00
  • fc1493a01e [FrontEnd] Make merge_async_iterators is_cancelled arg optional (#7282) Nick Hill 2024-08-07 13:35:14 -07:00
  • 311f743831 [Bugfix] Fix gptq failure on T4s (#7264) Lucas Wilkinson 2024-08-07 16:05:37 -04:00
  • 469b3bc538 [ci] Make building wheels per commit optional (#7278) Kevin H. Luu 2024-08-07 11:34:25 -07:00
  • 5223199e03 [Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219) Michael Goin 2024-08-07 14:23:12 -04:00
  • fde47d3bc2 [BugFix] Fix frontend multiprocessing hang (#7217) Maximilien de Bayser 2024-08-07 15:09:36 -03:00
  • 0e12cd67a8 [Doc] add online speculative decoding example (#7243) Stas Bekman 2024-08-07 09:58:02 -07:00
  • 80cbe10c59 [OpenVINO] migrate to latest dependencies versions (#7251) Ilya Lavrenov 2024-08-07 20:49:10 +04:00
  • b764547616 [Bugfix] Fix input processor for InternVL2 model (#7164) Isotr0py 2024-08-08 00:32:07 +08:00
  • ab0f5e2823 Fixes typo in function name (#7275) Rafael Vasquez 2024-08-07 12:29:27 -04:00
  • 564985729a [ BugFix ] Move zmq frontend to IPC instead of TCP (#7222) Robert Shaw 2024-08-07 12:24:56 -04:00
  • 0f7052bc7e [Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2 (#5874) Dipika Sikka 2024-08-07 12:17:58 -04:00
  • 639159b2a6 [distributed][misc] add specialized method for cuda platform (#7249) youkaichao 2024-08-07 08:54:52 -07:00
  • 66d617e343 [Frontend] Gracefully handle missing chat template and fix CI failure (#7238) Cyrus Leung 2024-08-07 17:12:05 +08:00
  • 7b261092de [BUGFIX]: top_k is expected to be an integer. (#7227) Atilla Akkuş 2024-08-07 10:32:16 +03:00
  • 2385c8f374 [Doc] Mock new dependencies for documentation (#7245) Roger Wang 2024-08-06 23:43:03 -07:00
  • 9a3f49ae07 [BugFix] Overhaul async request cancellation (#7111) Nick Hill 2024-08-06 22:21:41 -07:00
  • f9a5600649 [Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading (#7225) Michael Goin 2024-08-06 21:34:26 -04:00
  • fd95e026e0 [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942) afeldman-nm 2024-08-06 16:51:47 -04:00
  • 660470e5a3 [Core] Optimize evictor-v2 performance (#7193) xiaobochen123 2024-08-07 03:34:25 +08:00
  • 8d59dbb000 [Kernel] Add per-tensor and per-token AZP epilogues (#5941) Luka Govedič 2024-08-06 14:17:08 -04:00
  • 5c60c8c423 [SpecDecode] [Minor] Fix spec decode sampler tests (#7183) Lily Liu 2024-08-06 10:40:32 -07:00
  • 00afc78590 [Bugfix] add gguf dependency (#7198) Katarzyna Papis 2024-08-06 19:08:35 +02:00
  • 541c1852d3 [ BugFix ] Fix ZMQ when VLLM_PORT is set (#7205) Robert Shaw 2024-08-06 12:26:26 -04:00
  • a3bbbfa1d8 [BugFix] Fix DeepSeek remote code (#7178) Dipika Sikka 2024-08-06 11:16:53 -04:00
  • 1f26efbb3a [Model] Support SigLIP encoder and alternative decoders for LLaVA models (#7153) Cyrus Leung 2024-08-06 16:55:31 +08:00
  • 9118217f58 [LoRA] Relax LoRA condition (#7146) Jee Jee Li 2024-08-06 09:57:25 +08:00
  • e3c664bfcb [Build] Add initial conditional testing spec (#6841) Simon Mo 2024-08-05 17:39:22 -07:00
  • 360bd67cf0 [Core] Support loading GGUF model (#5191) Isotr0py 2024-08-06 07:54:23 +08:00
  • ef527be06c [MISC] Use non-blocking transfer in prepare_input (#7172) Cody Yu 2024-08-05 16:41:27 -07:00
  • 89b8db6bb2 [Bugfix] Specify device when loading LoRA and embedding tensors (#7129) Jacob Schein 2024-08-05 16:35:47 -07:00