Commit Graph

  • ebce310b74 [Model] Snowflake arctic model implementation (#4652) Hao Zhang 2024-05-09 15:37:14 -07:00
  • be0c5180ac [Bugfix] Add logs for all model dtype casting (#4717) Michael Goin 2024-05-09 14:36:25 -04:00
  • cea64430f6 [Bugfix] Update grafana.json (#4711) Robert Shaw 2024-05-09 11:10:13 -06:00
  • a3c124570a [Bugfix] Fix CLI arguments in OpenAI server docs (#4709) Cyrus Leung 2024-05-10 00:53:14 +08:00
  • ff5abcd746 [ROCm] Add support for Punica kernels on AMD GPUs (#3140) kliuae 2024-05-10 00:19:50 +08:00
  • 0ee535b294 [Misc] Set block size at initialization & Fix test_model_runner (#4705) Woosuk Kwon 2024-05-09 09:04:59 -07:00
  • 190bc838e1 [Misc] Remove unnecessary ModelRunner imports (#4703) Woosuk Kwon 2024-05-09 00:17:17 -07:00
  • f12b20decc [Frontend] Move async logic outside of constructor (#4674) Cyrus Leung 2024-05-09 13:48:33 +08:00
  • 16bc0a098f [Frontend] add tok/s speed metric to llm class when using tqdm (#4400) Mahmoud Ashraf 2024-05-09 08:02:31 +03:00
  • e288df0632 [Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (#4626) alexm-nm 2024-05-08 20:14:31 -04:00
  • 8b9241be3a [Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs (#4672) Cade Daniel 2024-05-08 16:24:46 -07:00
  • f942efb5a3 [Dynamic Spec Decoding] Auto-disable by the running queue size (#4592) Cody Yu 2024-05-08 14:44:00 -07:00
  • 89579a201f [Misc] Use vllm-flash-attn instead of flash-attn (#4686) Woosuk Kwon 2024-05-08 13:15:34 -07:00
  • 230c4b38c1 [CI/Test] fix swap test for multi gpu (#4689) youkaichao 2024-05-08 13:14:02 -07:00
  • 20cfcdec99 [Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659) youkaichao 2024-05-08 12:07:05 -07:00
  • ad932a221d [Core] Faster startup for LoRA enabled models (#4634) Antoni Baum 2024-05-08 10:33:18 -07:00
  • 5510cf0e8a [Misc] Add get_name method to attention backends (#4685) Woosuk Kwon 2024-05-08 09:59:31 -07:00
  • 0f9a6e3d22 [Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (#4573) DefTruth 2024-05-09 00:19:58 +08:00
  • f6a593093a [CI] Make mistral tests pass (#4596) SangBin Cho 2024-05-09 00:44:35 +09:00
  • d7740ea4dc [Core] Optimize sampler get_logprobs (#4594) SangBin Cho 2024-05-09 00:42:28 +09:00
  • cc466a3290 [Core][Distributed] support cpu&device in broadcast tensor dict (#4660) youkaichao 2024-05-07 19:34:47 -07:00
  • 8344f7742b [Bug fix][Core] fixup ngram not setup correctly (#4551) leiwen83 2024-05-08 02:40:18 +08:00
  • 469f85c782 [Core][Optimization] change copy-on-write from dict[int, list] to list (#4648) youkaichao 2024-05-07 11:06:32 -07:00
  • 10760da800 [Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora (#4609) Austin Veselka 2024-05-07 12:59:07 -05:00
  • 478aed5827 [Build/CI] Fixing 'docker run' to re-enable AMD CI tests. (#4642) Alexei-V-Ivanov-AMD 2024-05-07 11:23:17 -05:00
  • 63575bc2e1 [Core][Optimization] change python dict to pytorch tensor (#4607) youkaichao 2024-05-06 21:30:27 -07:00
  • a98187cf72 [Kernel] Make static FP8 scaling more robust (#4570) Philipp Moritz 2024-05-06 17:39:28 -07:00
  • bd99d22629 Update lm-format-enforcer to 0.10.1 (#4631) Noam Gat 2024-05-07 02:51:59 +03:00
  • 19cb4716ee [CI] Add retry for agent lost (#4633) Cade Daniel 2024-05-06 16:18:57 -07:00
  • e186d37cb1 [CI] use ccache actions properly in release workflow (#4629) Simon Mo 2024-05-06 15:23:36 -07:00
  • 323f27b904 [Bugfix] Fix asyncio.Task not being subscriptable (#4623) Cyrus Leung 2024-05-07 00:31:05 +08:00
  • 0650e5935b Disable cuda version check in vllm-openai image (#4530) zhaoyang-star 2024-05-06 07:58:55 +08:00
  • c7f2cf2b7f [CI] Reduce wheel size by not shipping debug symbols (#4602) v0.4.2 Simon Mo 2024-05-04 21:28:58 -07:00
  • 8d8357c8ed bump version to v0.4.2 (#4600) Simon Mo 2024-05-04 17:09:49 -07:00
  • 4302987069 [Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937) DearPlanet 2024-05-05 06:39:34 +08:00
  • 021b1a2ab7 [CI] check size of the wheels (#4319) Simon Mo 2024-05-04 13:44:36 -07:00
  • 2a052011ca [Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527) Michael Goin 2024-05-04 14:45:16 -04:00
  • 36fb68f947 [Doc] Chunked Prefill Documentation (#4580) SangBin Cho 2024-05-04 16:18:00 +09:00
  • bc8ad68455 [Misc][Refactor] Introduce ExecuteModelData (#4540) Cody Yu 2024-05-03 17:47:07 -07:00
  • 344bf7cd2d [Misc] add installation time env vars (#4574) youkaichao 2024-05-03 15:55:56 -07:00
  • ab50275111 [Speculative decoding] Support target-model logprobs (#4378) Cade Daniel 2024-05-03 15:52:01 -07:00
  • 43c413ec57 [Kernel] Use flashinfer for decoding (#4353) Lily Liu 2024-05-03 15:51:27 -07:00
  • f8e7adda21 Fix/async chat serving (#2727) Sebastian Schoennenbeck 2024-05-03 20:04:14 +02:00
  • 7e65477e5e [Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None (#4586) Michael Goin 2024-05-03 13:32:21 -04:00
  • 3521ba4f25 [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) SangBin Cho 2024-05-04 02:20:12 +09:00
  • 2d7bce9cd5 [Doc] add env vars to the doc (#4572) youkaichao 2024-05-02 22:13:49 -07:00
  • ce3f1eedf8 [Misc] remove chunk detected debug logs (#4571) DefTruth 2024-05-03 12:48:08 +08:00
  • 808632d3b4 [BugFix] Prevent the task of _force_log from being garbage collected (#4567) Yang, Bo 2024-05-02 18:35:18 -07:00
  • 344a5d0c33 [Core][Distributed] enable allreduce for multiple tp groups (#4566) youkaichao 2024-05-02 17:32:33 -07:00
  • 0f8a91401c [Core] Ignore infeasible swap requests. (#4557) SangBin Cho 2024-05-03 06:31:20 +09:00
  • 9b5c9f9484 [CI/Build] AMD CI pipeline with extended set of tests. (#4267) Alexei-V-Ivanov-AMD 2024-05-02 14:29:07 -05:00
  • 32881f3f31 [kernel] fix sliding window in prefix prefill Triton kernel (#4405) Michał Moskal 2024-05-02 11:23:37 -07:00
  • 5b8a7c1cb0 [Misc] centralize all usage of environment variables (#4548) youkaichao 2024-05-02 11:13:25 -07:00
  • 1ff0c73a79 [BugFix] Include target-device specific requirements.txt in sdist (#4559) Mark McLoughlin 2024-05-02 18:52:51 +01:00
  • 5ad60b0cbd [Misc] Exclude the tests directory from being packaged (#4552) Hu Dong 2024-05-03 01:50:25 +08:00
  • fb087af52e [mypy][7/N] Cover all directories (#4555) SangBin Cho 2024-05-03 02:47:41 +09:00
  • 7038e8b803 [Kernel] Support running GPTQ 8-bit models in Marlin (#4533) alexm-nm 2024-05-02 12:56:22 -04:00
  • 2a85f93007 [Core][Distributed] enable multiple tp group (#4512) youkaichao 2024-05-01 21:28:21 -07:00
  • cf8cac8c70 [mypy][6/N] Fix all the core subdirectory typing (#4450) SangBin Cho 2024-05-02 12:01:00 +09:00
  • 5e401bce17 [CI]Add regression tests to ensure the async engine generates metrics (#4524) Ronen Schaffer 2024-05-02 05:57:12 +03:00
  • 0d62fe58db [Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption (#4451) SangBin Cho 2024-05-02 11:24:13 +09:00
  • b8afa8b95a [MISC] Rework logger to enable pythonic custom logging configuration to be provided (#4273) Danny Guinther 2024-05-01 20:34:40 -04:00
  • 826b82a260 [Misc] Fix expert_ids shape in MoE (#4517) Woosuk Kwon 2024-05-01 16:47:59 -07:00
  • c9d852d601 [Misc] Remove Mixtral device="cuda" declarations (#4543) Philipp Moritz 2024-05-01 16:30:52 -07:00
  • 6ef09b08f8 [Core][Distributed] fix pynccl del error (#4508) youkaichao 2024-05-01 15:23:06 -07:00
  • 3a922c1e7e [Bugfix][Core] Fix and refactor logging stats (#4336) Roy 2024-05-02 04:08:14 +08:00
  • c47ba4aaa9 [Bugfix] Add validation for seed (#4529) sasha0552 2024-05-01 19:31:22 +00:00
  • 24bb4fe432 [Kernel] Update fused_moe tuning script for FP8 (#4457) Philipp Moritz 2024-05-01 11:47:38 -07:00
  • a657bfc48a [Core] Add multiproc_worker_utils for multiprocessing-based workers (#4357) Nick Hill 2024-05-01 11:41:59 -07:00
  • 24750f4cad [Core] Enable prefix caching with block manager v2 enabled (#4142) leiwen83 2024-05-02 02:20:32 +08:00
  • b38e42fbca [Speculative decoding] Add ngram prompt lookup decoding (#4237) leiwen83 2024-05-02 02:13:03 +08:00
  • 8b798eec75 [CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation (#4534) Travis Johnson 2024-05-01 12:01:50 -06:00
  • 69909126a7 [Bugfix] Use random seed if seed is -1 (#4531) sasha0552 2024-05-01 17:41:17 +00:00
  • e491c7e053 [Doc] update(example model): for OpenAI compatible serving (#4503) Frαnçois 2024-05-01 19:14:16 +02:00
  • 4dc8026d86 [Bugfix] Fix 307 Redirect for /metrics (#4523) Robert Shaw 2024-05-01 12:14:13 -04:00
  • a88bb9b032 [Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. (#4173) AnyISalIn 2024-05-02 00:11:03 +08:00
  • 6f1df80436 [Test] Add ignore_eos test (#4519) SangBin Cho 2024-05-01 21:45:42 +09:00
  • d6f4bd7cdd [Misc]Add customized information for models (#4132) Jee Li 2024-05-01 12:18:14 +08:00
  • c3845d82dc Allow user to define whitespace pattern for outlines (#4305) Robert Caulk 2024-05-01 05:48:39 +02:00
  • a822eb3413 [Misc] fix typo in block manager (#4453) Pastel! 2024-05-01 11:41:32 +08:00
  • f458112e8a [Misc][Typo] type annotation fix (#4495) harrywu 2024-05-01 11:21:39 +08:00
  • 2e240c69a9 [Core] Centralize GPU Worker construction (#4419) Nick Hill 2024-04-30 18:06:34 -07:00
  • ee37328da0 Unable to find Punica extension issue during source code installation (#4494) fuchen.ljl 2024-05-01 08:42:09 +08:00
  • 6ad58f42c5 fix_tokenizer_snapshot_download_bug (#4493) fuchen.ljl 2024-05-01 07:38:50 +08:00
  • dd1a50a8bc [Bugfix][Minor] Make ignore_eos effective (#4468) Li, Jiang 2024-05-01 07:33:33 +08:00
  • 715c2d854d [Frontend] [Core] Tensorizer: support dynamic num_readers, update version (#4467) Alpay Ariyak 2024-04-30 19:32:13 -04:00
  • a494140433 [Frontend] Support complex message content for chat completions endpoint (#3467) Florian Greinacher 2024-05-01 01:28:46 +02:00
  • 111815d482 [Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332) Robert Shaw 2024-04-30 17:46:12 -04:00
  • b31a1fb63c [Doc] add visualization for multi-stage dockerfile (#4456) Prashant Gupta 2024-04-30 10:41:59 -07:00
  • 4bb53e2dde [BugFix] fix num_lookahead_slots missing in async executor (#4165) leiwen83 2024-05-01 01:12:59 +08:00
  • 26f2fb5113 [Core]Refactor gptq_marlin ops (#4466) Kunshang Ji 2024-04-30 12:14:47 +00:00
  • fa32207842 [Bugfix][Kernel] Fix compute_type for MoE kernel (#4463) Woosuk Kwon 2024-04-29 22:05:40 -07:00
  • d627a3d837 [Misc] Upgrade to torch==2.3.0 (#4454) Michael Goin 2024-04-29 20:05:47 -04:00
  • f4f921b7f1 [Core][Distributed] use cpu group to broadcast metadata in cpu (#4444) youkaichao 2024-04-29 13:52:22 -07:00
  • ac5ccf0156 [CI] hotfix: soft fail neuron test (#4458) Simon Mo 2024-04-29 12:50:01 -07:00
  • 73c8d677e5 [Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922) Robert Shaw 2024-04-29 12:35:34 -04:00
  • df29793dc7 [mypy][5/N] Support all typing on model executor (#4427) SangBin Cho 2024-04-29 11:01:26 +09:00
  • 03dd7d52bf [CI] clean docker cache for neuron (#4441) Simon Mo 2024-04-28 16:32:07 -07:00
  • bf480c5302 Add more Prometheus metrics (#2764) Ronen Schaffer 2024-04-29 01:59:33 +03:00
  • 9c7306ac11 [Misc] fix typo in llm_engine init logging (#4428) DefTruth 2024-04-28 18:58:30 +08:00