Compare commits

..

1433 Commits

Author SHA1 Message Date
youkaichao
0408efc6d0 [Misc] Improve error message for incorrect pynvml (#12809)
Some checks failed
Create Release / Create Release (push) Has been cancelled
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-06 15:23:50 +08:00
Michael Goin
449d1bce02 [Misc] Remove duplicated DeepSeek V2/V3 model definition (#12793) 2025-02-05 23:16:20 -08:00
Harry Mellor
1a6fcad4c9 Improve TransformersModel UX (#12785) 2025-02-05 22:24:57 -08:00
Lu Fang
56534cd577 [Bugfix] Fix the test_ultravox.py's license (#12806)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-06 13:25:54 +08:00
Sumit Vij
d88506dda4 [Model] LoRA Support for Ultravox model (#11253) 2025-02-05 19:54:13 -08:00
Lu Fang
9cdea30b4f [Misc][Easy] Remove the space from the file name 2025-02-05 19:23:35 -08:00
Lucas Wilkinson
76abd0c881 [Bugfix] Better FP8 supported defaults 2025-02-05 19:22:19 -08:00
Gregory Shtrasberg
5b19b93082 [ROCm][Kernel] Using the correct warp_size value 2025-02-05 19:15:08 -08:00
Cyrus Leung
75404d041b [VLM] Update compatibility with transformers 4.49 2025-02-05 19:09:45 -08:00
Roger Wang
bf3b79efb8 [VLM] Qwen2.5-VL 2025-02-05 13:31:38 -08:00
Russell Bryant
9a5b1554b4 [Docs] Drop duplicate [source] links 2025-02-05 13:30:50 -08:00
Cyrus Leung
a4ce74c14a [VLM] Use shared field to pass token ids to model 2025-02-05 13:30:46 -08:00
Rahul Tuli
3b2005e1db Add: Support for Sparse24Bitmask Compressed Models 2025-02-05 13:30:43 -08:00
Sanju C Sudhakaran
af8486de49 [Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU) 2025-02-05 13:29:45 -08:00
Chen Zhang
4c3aac51e1 Merging PR #12536
Merged via CLI script
2025-02-05 13:24:26 -08:00
youkaichao
bc1bdecebf [core][distributed] exact ray placement control (#12732)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-06 02:03:19 +08:00
Akash kaothalkar
022bcc701a [Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 (#12546) 2025-02-04 23:11:02 -08:00
Michael Goin
c53dc466b1 [Doc] Remove performance warning for auto_awq.md (#12743) 2025-02-04 22:43:11 -08:00
Nick Hill
3d09e592a8 [V1][Misc] Shorten FinishReason enum and use constant strings (#12760) 2025-02-04 22:43:02 -08:00
Harry Mellor
fcf2e3d7fc [Bugfix] Fix OpenVINO model runner (#12750) 2025-02-04 22:42:46 -08:00
Michael Goin
58b218d7ae [Doc] Update PR Reminder with link to Developer Slack (#12748) 2025-02-04 22:42:09 -08:00
Kyle Sayers
7ff7a638b6 [Model][Quant] Fix GLM, Fix fused module mappings for quantization (#12634)
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2025-02-05 05:32:06 +00:00
Dipika Sikka
686006a220 [Misc] Bump the compressed-tensors version (#12736) 2025-02-04 20:44:48 -08:00
Isotr0py
98fd089fc9 [VLM] Add MLA with pure RoPE support for deepseek-vl2 models (#12729) 2025-02-04 20:44:26 -08:00
Harry Mellor
249824c3bf Refactor Linear handling in TransformersModel (#12727)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-05 04:31:12 +00:00
Aleksandr Malyshev
64862d106e [ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling (#12713)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
2025-02-05 03:58:22 +00:00
Aviv Keshet
b3a0d01e45 [Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS (#12368)
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>
2025-02-04 18:46:26 -08:00
Lucas Wilkinson
75e94309e8 [Perf] Mem align KV caches for CUDA devices (MLA perf improvement) (#12676)
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-02-04 18:22:24 -08:00
Mark McLoughlin
233df6f5c4 [V1][Metrics] Add request_success_total counter, labelled with finish reason (#12579)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-02-04 19:46:54 -05:00
Cyrus Leung
18016a5e62 [Bugfix] Fix CI failures for InternVL and Mantis models (#12728)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-04 23:54:23 +08:00
Sophie du Couédic
649550f27e [Build] update requirements of no-device for plugin usage (#12630)
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com>
2025-02-04 21:19:12 +08:00
Kero Liang
62467a834a Avoid unnecessary multi-modal input data copy when len(batch) == 1 (#12722)
Signed-off-by: imkero <kerorek@outlook.com>
2025-02-04 21:03:19 +08:00
Michael Greenbaum
6469038b14 [Bugfix] Fix loading of fine-tuned models based on Phi-3-Small (#12689)
Signed-off-by: Michael Greenbaum <mgreenbaum@microsoft.com>
Co-authored-by: Michael Greenbaum <mgreenbaum@microsoft.com>
2025-02-04 20:58:48 +08:00
Isotr0py
815079de8e [VLM] merged multimodal processor and V1 support for idefics3 (#12660)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-02-04 20:00:51 +08:00
Woosuk Kwon
18a88fcccc [V1] Remove scheduling constraint on partial requests (#12674)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-04 02:43:58 -08:00
Cyrus Leung
d1ca7df84d [VLM] Merged multi-modal processor for InternVL-based models (#12553)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-02-04 16:44:52 +08:00
Jee Jee Li
96b23621c1 [Misc] Add BNB quantization for Whisper (#12381)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-02-04 16:27:36 +08:00
Hongxia Yang
c36ac98d01 [AMD][ROCm] Enable DeepSeek model on ROCm (#12662)
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
2025-02-04 08:24:11 +00:00
Kyle Sayers
4896d0c2dd [Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs (#12711) 2025-02-03 23:27:11 -08:00
Thomas Parnell
bb392af434 [Doc] Replace ibm-fms with ibm-ai-platform (#12709)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-02-04 07:05:04 +00:00
Michael Goin
5d98d56089 Support Pixtral-Large HF by using llava multimodal_projector_bias config (#12710)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-02-04 11:55:46 +08:00
Russell Bryant
73b35cca7f [Core] Improve hash collision avoidance in prefix caching (#12621)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-03 16:28:20 -08:00
Cody Yu
5095e96606 [V1] Revert uncache_blocks and support recaching full blocks (#12415)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-03 15:04:53 -08:00
Cody Yu
cf58b9c4ca [MISC] Remove model input dumping when exception (#12582)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-03 13:34:16 -08:00
kushanam
4797dad3ec [Model] Add Deepseek V3 fp8_w8a8 configs for B200 (#12707) 2025-02-03 13:30:39 -08:00
Kyle Sayers
6dd5e52823 Squelch MLA warning for Compressed-Tensors Models (#12704)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2025-02-03 13:29:56 -08:00
Tyler Michael Smith
c11de33dad [Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm (#12696)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-03 13:04:59 -08:00
Russell Bryant
33e0602e59 [Misc] Fix improper placement of SPDX header in scripts (#12694)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-03 11:16:59 -08:00
Arthur
a1a2aaadb9 [Model]: Add transformers backend support (#11330)
# Adds support for `transformers` as a backend

Following https://github.com/huggingface/transformers/pull/35235, a
bunch of models should already be supported, we are ramping up support
for more models.

Thanks @Isotr0py for the TP support, and @hmellor for his help as well!
This includes: 
- `trust_remote_code=True` support: any model on the hub, if it
implements attention the correct way can be natively supported!!
- tensor parallel support

---------

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-02-03 21:30:38 +08:00
youkaichao
1298a400e8 [ci/build] fix gh200 test (#12681)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-03 15:59:49 +08:00
youkaichao
ad4a9dc817 [cuda] manually import the correct pynvml module (#12679)
fixes problems like https://github.com/vllm-project/vllm/pull/12635 and
https://github.com/vllm-project/vllm/pull/12636 and
https://github.com/vllm-project/vllm/pull/12565

---------

Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-03 15:58:21 +08:00
Srikanth Srinivas
b9986454fe Fix for attention layers to remain unquantized during moe_wn16 quant (#12570)
Fix to AWQ quant loading of the new R1 model

The new optimized MoE kernels for a large number of experts `moe_wn16`
uses AWQ quant which requires the attention layers to be in 16bit

The current merge has broken this, and the `get_quant_method` must
return None for it to work correctly again

---------

Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Beim <beim2015@outlook.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Ryan N <ryan.nguyen@centml.ai>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Beim <805908499@qq.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Ryan Nguyen <96593302+xpbowler@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: fade_away <1028552010@qq.com>
Co-authored-by: weilong.yu <weilong.yu@shopee.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Eldar Kurtic <eldarkurtic314@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Shawn Du <shawnd200@outlook.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-02-03 13:46:19 +08:00
Eldar Kurtic
c5932e5dac Properly check if all fused layers are in the list of targets (#12666)
Thanks @kylesayrs for catching this!
2025-02-03 13:42:18 +08:00
youkaichao
20579c0fae make sure mistral_common not imported for non-mistral models (#12669)
When people use deepseek models, they find that they need to solve cv2
version conflict, see https://zhuanlan.zhihu.com/p/21064432691 .

I added the check, and make all imports of `cv2` lazy.

---------

Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-03 13:40:25 +08:00
Yang Chen
95460fc513 [Kernel] port sgl moe_align_block_size kernels (#12574)
sgl_moe_align_block_size is based on:


ded9fcd09a

moe_align_block_size is based on:


ba5112ff69

Signed-off-by: Yang Chen <yangche@fb.com>
2025-02-03 13:09:50 +08:00
Zhuohan Li
326fcc8b9f [Doc] Deprecate Discord (#12668) 2025-02-02 19:19:56 -08:00
youkaichao
e64330910b [doc][misc] clarify VLLM_HOST_IP for multi-node inference (#12667)
As more and more people are trying deepseek models with multi-node
inference, https://github.com/vllm-project/vllm/issues/7815 becomes more
frequent. Let's give clear message to users.

Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-03 09:32:18 +08:00
Russell Bryant
e489ad7a21 [Misc] Add SPDX-License-Identifier headers to python source files (#12628)
- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**

commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:18:24 2025 -0500

    Add SPDX license headers to python source files
    
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
    also be easily used by tools to help manage license compliance.
    
The Linux Foundation runs license scans against the codebase to help
ensure
    we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
    
    More information can be found on the SPDX site:
    
    - https://spdx.dev/learn/handling-license-info/
    
    Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:36:32 2025 -0500

    Check for SPDX headers using pre-commit
    
    Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------

Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-02 11:58:18 -08:00
Kunshang Ji
f256ebe4df [Hardware][Intel GPU] add XPU bf16 support (#12392)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-02-02 10:17:26 +00:00
Shawn Du
f8ece6e17f [Core][v1] Unify allocating slots in prefill and decode in KV cache manager (#12608)
As mentioned in RFC https://github.com/vllm-project/vllm/issues/12254,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------

Signed-off-by: Shawn Du <shawnd200@outlook.com>
2025-02-02 16:40:58 +08:00
Woosuk Kwon
abfcdcdf27 [V1][Minor] Avoid frequently creating ConstantList (#12653)
A small optimization to avoid creating a new `ConstantList` every time `request.kv_block_hashes` is used.

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-01 23:43:20 -08:00
Russell Bryant
e497f33491 [Core] Silence unnecessary deprecation warnings (#12620)
I noticed during testing that I was getting a lot of these deprecation
warnings about `local_lora_path`:

```
DeprecationWarning: The 'lora_local_path' attribute is deprecated
     and will be removed in a future version.
     Please use 'lora_path' instead.
```

The check used for emitting this warning was always True, even when the
parameter was not actually specified. It will always be in
`__struct_fields__`. We should be checking for a non-None value,
instead.

Signed-off-by: Russell Bryant <rbryant@redhat.com>

Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-02 15:35:50 +08:00
Jinzhen Lin
baaa2b24da [Bugfix] fix moe_wna16 get_quant_method (#12648)
Fix https://github.com/vllm-project/vllm/issues/12647
The `get_quant_method` of `moe_wna16` always return moe method,
GPTQ-based linear method or AWQ-based linear method, even when the
target module is attention layer.


baeded2569/vllm/attention/layer.py (L86-L92)

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
2025-02-02 15:29:56 +08:00
Vicente Herrera
b4e5c03306 doc: fixing minor typo in readme.md (#12643)
Word "evolved" was mistyped

Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>

---------

Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
2025-02-01 17:17:29 +00:00
Michael Goin
3194039c0e Apply torch.compile to fused_moe/grouped_topk (#12637) 2025-02-01 16:16:19 +00:00
Simon Mo
4f4d427ac2 Disable chunked prefill and/or prefix caching when MLA is enabled (#12642)
Some checks failed
Create Release / Create Release (push) Has been cancelled
From @mgoin in https://github.com/vllm-project/vllm/pull/12638

I cannot push to that branch, therefore a new PR to unblock release.

---------

Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2025-01-31 23:46:57 -08:00
Russell Bryant
1e3698393f [CI/Build] Add label automation for structured-output, speculative-decoding, v1 (#12280)
We have `v1`, `structured-output`, and `speculative-decoding` labels on
github. This adds automation for applying these labels based on the
files touched by a PR.

Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------

Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-01-31 23:13:10 -08:00
Lucas Wilkinson
baeded2569 [Attention] Deepseek v3 MLA support with FP8 compute (#12601)
This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights 

---------

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
2025-01-31 21:52:51 -08:00
Rahul Tuli
3e1c76cf3a Fix: Respect sparsity_config.ignore in Cutlass Integration (#12517)
This PR addresses a bug in the Cutlass integration where the
`sparsity_config.ignore` list was not being respected. When only a
subset of modules were configured as Sparse24, the system incorrectly
selected Cutlass for non-sparse modules as well. This update ensures the
correct scheme is selected for non-sparse modules, fixing this behavior.

---

### Changes

- Updated logic to correctly respect `sparsity_config.ignore`.
- Ensured non-sparse modules use the appropriate scheme instead of
defaulting to Cutlass.

---

<details>
<summary>Testing Setup</summary>

The fix has been tested on top of [this
diff](https://github.com/vllm-project/vllm/pull/12097).

#### Steps to Test:
```bash
git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support
git revert --no-edit aa2cd2c # revert Tyler's commit to turn off Cutlass for W16A16
git cherry-pick ca624cddb # this branch
```

#### Additional Patch Required:
```diff
diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
index a54177c1c..f916dd0c9 100644
--- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
+++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
@@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs,
                                              QuantizationStrategy,
                                              QuantizationType)
 from pydantic import BaseModel
-
+from vllm.logger import init_logger
 from vllm.model_executor.layers.fused_moe import FusedMoE
 from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
                                                UnquantizedLinearMethod)
@@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import (
     should_ignore_layer)
 from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod
 from vllm.platforms import current_platform
-
+logger = init_logger(__name__)
 __all__ = ["CompressedTensorsLinearMethod"]
 
 SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config"
```

Apply using:
```bash
git apply logging-patch.patch
```

</details>

---

<details>
<summary>Models Tested</summary>

- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24` 
- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed`

</details>

---


<details>
<summary>Example Output</summary>

#### Layers 0-5 (Sparse24)
```
Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj
Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj
...
```

#### Layers 6+ (Non-Sparse, FP8)
```
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj
...
```

</details>

**Note:** Assumed all modules in fused layers such as `QKV_proj` and
`Gate_up_proj` follow the same quantization/pruning scheme.

---

For related tasks using the Asana app for GitHub, refer to [[this
link](https://app.asana.com/0/0/1209227810815160)](https://app.asana.com/0/0/1209227810815160).

Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
2025-02-01 13:41:59 +08:00
Tyler Michael Smith
cfa134d247 [Bugfix/CI] Fixup benchmark_moe.py (#12562)
Fixes `is_marlin` not being passed into `get_default_config`

Also allow `--tensor-parallel-size` in addition to `-tp` and `--tp-size`

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-01 13:41:35 +08:00
Kevin H. Luu
35b7a05507 [ci] Upgrade transformers to 4.48.2 in CI dependencies (#12599) 2025-01-31 21:22:23 -08:00
Eldar Kurtic
1867c258bd Fix target matching for fused layers with compressed-tensors (#12617)
Without this PR
---------------
Quantizing models with llm-compressor and a recipe that explicitly lists
names of layers produces a model that is not loadable by vLLM (i.e.
`vllm serve <model>` fails with `raise ValueError(f"Unable to find
matching target for {module} in the ...`).

Example recipe:
```
recipe = """
quantization_stage:
  run_type: oneshot
  quantization_modifiers:
    GPTQModifier:
      ignore: ["lm_head"]
      config_groups:
        group_0:
          weights:
            num_bits: 4
            type: "int"
            symmetric: true
            strategy: "group"
            group_size: 128
          targets: [
            "model.layers.0.mlp.down_proj",
            "model.layers.2.mlp.down_proj",
            "model.layers.3.mlp.down_proj",
            "model.layers.4.mlp.down_proj",
            "model.layers.5.mlp.down_proj",
            "model.layers.6.mlp.down_proj",
            "model.layers.7.mlp.down_proj",
            "model.layers.8.mlp.down_proj",
            "model.layers.9.mlp.down_proj",
            "model.layers.10.mlp.down_proj",
            "model.layers.11.mlp.down_proj",
            "model.layers.12.mlp.down_proj",
            "model.layers.13.mlp.down_proj",
            "model.layers.14.mlp.down_proj",
            "model.layers.15.mlp.down_proj",
            "model.layers.16.mlp.down_proj",
            "model.layers.17.mlp.down_proj",
            "model.layers.19.mlp.down_proj",
            "model.layers.21.mlp.down_proj",
            "model.layers.22.mlp.down_proj",
            .
            .
            .
          ]
"""
```

To reproduce the vLLM error: 
```bash
vllm serve nm-testing/eldar-test
```

With this PR
------------
Models are loaded correctly without any errors.
2025-02-01 05:07:46 +00:00
fade_away
cb3e73e4c8 [BugFix] fix wrong output when using lora and num_scheduler_steps=8 (#11161)
FIX issue https://github.com/vllm-project/vllm/issues/9688
https://github.com/vllm-project/vllm/issues/11086 #12487

---------

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: weilong.yu <weilong.yu@shopee.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-02-01 12:52:07 +08:00
Robert Shaw
b1340f9d55 [V1] Bugfix: Validate Model Input Length (#12600)
SUMMARY:
* avoid crashing the engine when we get an input longer than
max_model_len

FIX #12567(*link existing issues this PR will resolve*)
2025-01-31 18:32:04 -08:00
Brian Dellabetta
44bbca78d7 [Doc] int4 w4a16 example (#12585)
Based on a request by @mgoin , with @kylesayrs we have added an example
doc for int4 w4a16 quantization, following the pre-existing int8 w8a8
quantization example and the example available in
[`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py)

FIX #n/a (no issue created)

@kylesayrs and I have discussed a couple additional improvements for the
quantization docs. We will revisit at a later date, possibly including:
- A section for "choosing the correct quantization scheme/ compression
technique"
- Additional vision or audio calibration datasets

---------

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2025-01-31 15:38:48 -08:00
Harry Mellor
60808bd4c7 [Doc] Improve installation signposting (#12575)
- Make device tab names more explicit
- Add comprehensive list of devices to
https://docs.vllm.ai/en/latest/getting_started/installation/index.html
- Add `attention` blocks to the intro of all devices that don't have
pre-built wheels/images

---------

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-31 15:38:35 -08:00
Ryan Nguyen
fc542144c4 [Feature] Fix guided decoding blocking bitmask memcpy (#12563)
**[Guided decoding performance optimization]** Sending the guided
decoding bitmask in xgrammar to the GPU
(`self.token_bitmask.to(scores.device)`) is a blocking operation that
prevents the CPU from pre-launching the sampler kernels. The CPU waits
until decode is complete, then copies the bitmask over. This PR changes
the operation to async via setting `non-blocking=True`.

(Current) The CPU is blocked on a `cudaStreamSynchronize` and only
pre-empts the sampling kernels after bitmask application. Below is the
Nsys profile for one decode phase from Llama 3.1 8B.

![image](https://github.com/user-attachments/assets/8997eae1-b822-4f52-beb8-ef19a7c6b824)

With the optimization, this is no longer the case:

![image](https://github.com/user-attachments/assets/6d5ea83f-f169-4f98-a8c1-41c719b3e1e7)

---------

Signed-off-by: Ryan N <ryan.nguyen@centml.ai>
2025-01-31 15:37:30 -08:00
Tyler Michael Smith
eb5741ad42 [Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (#12587)
Integrates the block-quantized kernels introduced in
https://github.com/vllm-project/vllm/pull/11868 for use in linear
layers.

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-31 15:29:11 -08:00
Robert Shaw
145c2ff648 [Bugfix] Revert MoE Triton Config Default (#12629)
SUMMARY:
* previous PR for pulling in block configs also changed defaults
(https://github.com/vllm-project/vllm/pull/11589/files) for FP8
* this broke L4 MoE since there was not enough SHM for the default
configuration
* this reverts the non-block example to the default

Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-01-31 15:28:47 -08:00
Kevin H. Luu
415f19474d [release] Add input step to ask for Release version (#12631)
Instead of having to create a new build with release version put in as
env var.
2025-01-31 13:39:36 -08:00
Chen Zhang
89003c4082 [v1][Bugfix] Add extra_keys to block_hash for prefix caching (#12603)
This pr adds extra key to block hash, to generate different hash value
for two blocks with the same token string but different extra_keys in
their parent blocks. For example, it can generate different hash value
for the second block of the following two requests:
```python
request1 = make_request(
        request_id=0,
        prompt_token_ids=[_ for _ in range(6)],
        mm_positions=[{
            "offset": 0,
            "length": 3
        }, {
            "offset": 3,
            "length": 3
        }],
        mm_hashes=["hash1", "hash2"],
    )
    request2 = make_request(
        request_id=1,
        prompt_token_ids=[_ for _ in range(6)],
        mm_positions=[{
            "offset": 0,
            "length": 3
        }, {
            "offset": 3,
            "length": 3
        }],
        mm_hashes=["hash3", "hash2"],
    )
```

---------

Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-31 13:13:04 -08:00
Cody Yu
60bcef000e [Docs][V1] Prefix caching design (#12598)
- Create v1 design document section in docs.
- Add prefix caching design doc.

@WoosukKwon @ywang96

---------

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-01-31 12:30:46 -08:00
Cody Yu
847f883232 [Git] Automatically sign-off commits (#12595)
It's very annoying when I forgot to add `-s` in `git commit` to
sign-off, because I then need to `git rebase HEAD~1 --signoff` and `git
push -f` to fix the DCO. This PR adds a hook to sign off commits
automatically when `-s` is missing to solve this problem. The only
change from the user side is now users have to install 2 hooks, so
instead of just

```
pre-commit install
```

Now we need to

```
pre-commit install --hook-type pre-commit --hook-type commit-msg
```

Note that even if users still only install the pre-commit hook, they
won't get any error in `git commit`. Just the sign-off hook won't run.

cc @hmellor @youkaichao

---------

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-01-31 12:30:33 -08:00
Robert Shaw
325f679f32 [BugFix] Fix Torch.Compile For DeepSeek (#12594)
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-01-31 12:06:39 -08:00
Harry Mellor
e3f7ff65e7 Add favicon to docs (#12611)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-31 09:20:34 -08:00
Roger Wang
7a8987dac5 [Bugfix] Gracefully handle huggingface hub http error (#12571) 2025-01-31 08:19:35 +00:00
Lucas Wilkinson
cabaf4eff3 [Attention] MLA decode optimizations (#12528)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-01-30 23:49:37 -08:00
Aleksandr Malyshev
a1fc18c030 [ROCm][AMD][Model] llama 3.2 support upstreaming (#12421)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
2025-01-31 12:24:28 +08:00
Lucas Wilkinson
9798b2fb00 [Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling (#11868) 2025-01-30 18:33:00 -08:00
Michael Goin
4078052f09 [V1][Log] Add max request concurrency log to V1 (#12569)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-01-30 23:07:19 +00:00
Nishidha
bd2107e30a [CPU][PPC] Updated torch, torchvision, torchaudio dependencies (#12555)
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
2025-01-30 16:29:39 -05:00
Robert Shaw
9b0c4bab36 [Kernel] Triton Configs for Fp8 Block Quantization (#11589)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-01-30 11:53:22 -08:00
Beim
41bf5612f5 [Misc] fix typo: add missing space in lora adapter error message (#12564)
Signed-off-by: Beim <beim2015@outlook.com>
2025-01-30 15:39:22 +00:00
Harry Mellor
a2769032ca Set ?device={device} when changing tab in installation guides (#12560)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-30 00:05:42 -08:00
Mark McLoughlin
f17f1d4608 [V1][Metrics] Add GPU cache usage % gauge (#12561)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-01-29 18:31:01 -08:00
Divakar Verma
1c1bb0bbf2 [Misc][MoE] add Deepseek-V3 moe tuning support (#12558)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2025-01-30 00:47:30 +00:00
Woosuk Kwon
e0cc5f259a [V1][BugFix] Free encoder cache for aborted requests (#12545)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-29 13:47:33 -08:00
Tyler Michael Smith
73aa6cfdf7 Revert "[Build/CI] Fix libcuda.so linkage" (#12552) 2025-01-29 21:12:24 +00:00
Jinzhen Lin
27b78c73ca [Kernel] add triton fused moe kernel for gptq/awq (#12185) 2025-01-29 09:07:09 -05:00
Pavani Majety
b02fd288b2 [Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. (#11787)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2025-01-29 01:46:12 -08:00
Yanyi Liu
ff7424f491 [Frontend] Support override generation config in args (#12409)
Signed-off-by: liuyanyi <wolfsonliu@163.com>
2025-01-29 01:41:01 -08:00
Alphi
d93bf4da85 [Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM (#12069)
Signed-off-by: hzh <hezhihui_thu@163.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Oleg Mosalov <oleg@krai.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Signed-off-by: Chenguang Li <757486878@qq.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Shanshan Shen <467638484@qq.com>
Signed-off-by: elijah <f1renze.142857@gmail.com>
Signed-off-by: Yikun <yikunkero@gmail.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com>
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: sixgod <evethwillbeok@outlook.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com>
Co-authored-by: Oleg Mosalov <oleg@krai.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com>
Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com>
Co-authored-by: Concurrensee <yida.wu@amd.com>
Co-authored-by: Chenguang Li <757486878@qq.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Alex Brooks <alex.brooks@ibm.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: TJian <tunjian1996@gmail.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com>
Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com>
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-01-29 09:24:59 +00:00
Travis Johnson
036ca94c25 [Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense (#12347)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: Wallas Santos <wallashss@ibm.com>
2025-01-29 08:54:35 +00:00
Maximilien de Bayser
ef001d98ef Fix the pydantic logging validator (#12420)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2025-01-29 07:53:13 +00:00
Robert Shaw
5f671cb4c3 [V1] Improve Error Message for Unsupported Config (#12535)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2025-01-29 04:56:56 +00:00
Michael Goin
bd02164cf9 Bugfix for whisper quantization due to fake k_proj bias (#12524)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-01-29 04:49:03 +00:00
Mark McLoughlin
46fb056749 [V1][Metrics] Add TTFT and TPOT histograms (#12530)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-01-29 04:11:16 +00:00
Harry Mellor
dd6a3a02cb [Doc] Convert docs to use colon fences (#12471)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-29 11:38:29 +08:00
Ce Gao
a7e3eba66f [Frontend] Support reasoning content for deepseek r1 (#12473)
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
2025-01-29 11:38:08 +08:00
Michael Goin
fbb5bd4cef [TPU] Add example for profiling TPU inference (#12531)
Signed-off-by: mgoin <mgoin@redhat.com>
2025-01-29 03:16:47 +00:00
fenghuizhang
80fcc3ed1c [Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels (#12482)
Signed-off-by: Fenghui Zhang <fhzhang@google.com>
2025-01-28 22:36:44 +00:00
Mark McLoughlin
c386c43ca3 [V1][Metrics] Add per-request prompt/generation_tokens histograms (#12516)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-01-28 22:07:22 +00:00
Harry Mellor
f26d790718 Do not run suggestion pre-commit hook multiple times (#12521)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-28 20:05:27 +00:00
Michael Goin
0f657bdc52 Replace missed warning_once for rerank API (#12472)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-01-28 19:06:32 +00:00
Mark McLoughlin
3fd1fb63ef [V1][Metrics] Hook up IterationStats for Prometheus metrics (#12478)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-01-28 16:38:38 +00:00
Jun Duan
925d2f1908 [Doc] Fix typo for x86 CPU installation (#12514)
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
2025-01-28 16:37:10 +00:00
Cyrus Leung
8f58a51358 [VLM] Merged multi-modal processor and V1 support for Qwen-VL (#12504)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-28 16:25:05 +00:00
Sebastian Schoennenbeck
2079e43bee [Core] Make raw_request optional in ServingCompletion (#12503)
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com>
2025-01-28 10:56:45 +00:00
Robert Shaw
e29d4358ef [V1] Include Engine Version in Logs (#12496)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-01-28 08:27:41 +00:00
Roger Wang
8cbc424975 Update README.md with V1 alpha release (#12495) 2025-01-28 08:22:41 +00:00
Mengqing Cao
dd66fd2b01 [CI] fix pre-commit error (#12494)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-01-28 06:11:05 +00:00
Gabriel Marinho
0f465ab533 [FEATURE] Enables offline /score for embedding models (#12021)
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com>
2025-01-28 11:30:13 +08:00
Hossein Sarshar
23a7cbc88b [CI/Build] Fixed the xla nightly issue report in #12451 (#12453) 2025-01-28 11:18:07 +08:00
Michael Goin
426a5c3625 Fix bad path in prometheus example (#12481)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-01-27 18:56:31 -07:00
Liangfu Chen
ddee88d0ff [Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache (#11277)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
Co-authored-by: Jiangfei Duan <jfduan@outlook.com>
2025-01-27 17:31:16 -08:00
Harry Mellor
823ab79633 Update pre-commit hooks (#12475)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-27 17:23:08 -07:00
Nicolò Lucchesi
6116ca8cd7 [Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill (#10132)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: wallashss <wallashss@ibm.com>
Co-authored-by: wallashss <wallashss@ibm.com>
2025-01-27 13:38:35 -08:00
Bowen Wang
2bc3fbba0c [FlashInfer] Upgrade to 0.2.0 (#11194)
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-01-27 18:19:24 +00:00
Woosuk Kwon
3f1fc7425a [V1][CI/Test] Do basic test for top-p & top-k sampling (#12469)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-27 09:40:04 -08:00
Mark McLoughlin
01ba927040 [V1][Metrics] Add initial Prometheus logger (#12416)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-01-27 12:26:28 -05:00
Lucas Wilkinson
103bd17ac5 [Build] Only build 9.0a for scaled_mm and sparse kernels (#12339)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-27 10:40:00 -05:00
Isotr0py
ce69f7f754 [Bugfix] Fix gpt2 GGUF inference (#12467)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-27 18:31:49 +08:00
Woosuk Kwon
624a1e4711 [V1][Minor] Minor optimizations for update_from_output (#12454)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-27 01:09:27 -08:00
Isotr0py
372bf0890b [Bugfix] Fix missing seq_start_loc in xformers prefill metadata (#12464)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-27 07:25:30 +00:00
Cyrus Leung
5204ff5c3f [Bugfix] Fix Granite 3.0 MoE model loading (#12446)
Some checks failed
Create Release / Create Release (push) Has been cancelled
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-26 21:26:44 -08:00
Pooya Davoodi
0cc6b383d7 [Frontend] Support scores endpoint in run_batch (#12430)
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
2025-01-27 04:30:17 +00:00
Woosuk Kwon
28e0750847 [V1] Avoid list creation in input preparation (#12457)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-26 19:57:56 -08:00
Yuan Tang
582cf78798 [DOC] Add link to vLLM blog (#12460)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-01-27 03:46:19 +00:00
Kyle Mistele
0034b09ceb [Frontend] Rerank API (Jina- and Cohere-compatible API) (#12376)
Signed-off-by: Kyle Mistele <kyle@mistele.com>
2025-01-26 19:58:45 -07:00
Tyler Michael Smith
72bac73067 [Build/CI] Fix libcuda.so linkage (#12424)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-26 21:18:19 +00:00
Lucas Wilkinson
68f11149d8 [Bugfix][Kernel] Fix perf regression caused by PR #12405 (#12434)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-26 11:09:34 -08:00
Tyler Michael Smith
72f4880425 [Bugfix/CI] Fix broken kernels/test_mha.py (#12450) 2025-01-26 10:39:03 -08:00
Tyler Michael Smith
aa2cd2c43d [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2025-01-26 19:59:58 +08:00
Matthew Hendrey
9ddc35220b [Frontend] generation_config.json for maximum tokens(#12242)
Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-01-26 19:59:25 +08:00
Roger Wang
a5255270c3 [Misc] Revert FA on ViT #12355 and #12435 (#12445) 2025-01-26 03:56:34 -08:00
Roger Wang
0ee349b553 [V1][Bugfix] Fix assertion when mm hashing is turned off (#12439)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-26 00:47:42 -08:00
Keyun Tong
fa63e710c7 [V1][Perf] Reduce scheduling overhead in model runner after cuda sync (#12094)
Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
2025-01-26 00:42:37 -08:00
Roger Wang
2a0309a646 [Misc][Bugfix] FA3 support to ViT MHA layer (#12435)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-01-26 05:00:31 +00:00
Siyuan Liu
324960a95c [TPU][CI] Update torchxla version in requirement-tpu.txt (#12422)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
2025-01-25 07:23:03 +00:00
Isotr0py
f1fc0510df [Misc] Add FA2 support to ViT MHA layer (#12355)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-25 15:07:35 +08:00
Divakar Verma
bf21481dde [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (#12408)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2025-01-25 12:17:19 +08:00
Cyrus Leung
fb30ee92ee [Bugfix] Fix BLIP-2 processing (#12412)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-25 11:42:42 +08:00
ElizaWszola
221d388cc5 [Bugfix][Kernel] Fix moe align block issue for mixtral (#12413) 2025-01-25 01:49:28 +00:00
Lucas Wilkinson
3132a933b6 [Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). (#12405)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-24 20:20:59 +00:00
Cyrus Leung
df5dafaa5b [Misc] Remove deprecated code (#12383)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-24 14:45:20 -05:00
Lucas Wilkinson
ab5bbf5ae3 [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-24 15:27:59 +00:00
Junichi Sato
3bb8e2c9a2 [Misc] Enable proxy support in benchmark script (#12356)
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
2025-01-24 14:58:26 +00:00
youkaichao
e784c6b998 [ci/build] sync default value for wheel size (#12398)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-24 17:54:29 +08:00
Mohit Deopujari
9a0f3bdbe5 [Hardware][Gaudi][Doc] Add missing step in setup instructions (#12382) 2025-01-24 09:43:49 +00:00
youkaichao
c7c9851036 [ci/build] fix wheel size check (#12396)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-24 17:31:25 +08:00
Roger Wang
3c818bdb42 [Misc] Use VisionArena Dataset for VLM Benchmarking (#12389)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-24 00:22:04 -08:00
youkaichao
6dd94dbe94 [perf] fix perf regression from #12253 (#12380)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-24 11:34:27 +08:00
Woosuk Kwon
0e74d797ce [V1] Increase default batch size for H100/H200 (#12369)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-24 03:19:55 +00:00
Dipika Sikka
55ef66edf4 Update compressed-tensors version (#12367) 2025-01-24 11:19:42 +08:00
omer-dayan
5e5630a478 [Bugfix] Path join when building local path for S3 clone (#12353)
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai>
2025-01-24 11:06:07 +08:00
Russell Bryant
d3d6bb13fb Set weights_only=True when using torch.load() (#12366)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-01-24 02:17:30 +00:00
Nick Hill
24b0205f58 [V1][Frontend] Coalesce bunched RequestOutputs (#12298)
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2025-01-23 17:17:41 -08:00
Russell Bryant
c5cffcd0cd [Docs] Update spec decode + structured output in compat matrix (#12373)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-01-24 01:15:52 +00:00
Woosuk Kwon
682b55bc07 [Docs] Add meetup slides (#12345)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-23 14:10:03 -08:00
Junichi Sato
9726ad676d [Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (#12357)
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
2025-01-23 17:02:13 -05:00
Dipika Sikka
eb5cb5e528 [BugFix] Fix parameter names and process_after_weight_loading for W4A16 MoE Group Act Order (#11528)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2025-01-23 21:40:33 +00:00
Isotr0py
2cbeedad09 [Docs] Document Phi-4 support (#12362)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-23 19:18:51 +00:00
Siyuan Liu
2c85529bfc [TPU] Update TPU CI to use torchxla nightly on 20250122 (#12334)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
2025-01-23 18:50:16 +00:00
Gregory Shtrasberg
e97f802b2d [FP8][Kernel] Dynamic kv cache scaling factors computation (#11906)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
2025-01-23 18:04:03 +00:00
youkaichao
6e650f56a1 [torch.compile] decouple compile sizes and cudagraph sizes (#12243)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-24 02:01:30 +08:00
youkaichao
3f50c148fd [core] add wake_up doc and some sanity check (#12361)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-24 02:00:50 +08:00
Isotr0py
8c01b8022c [Bugfix] Fix broken internvl2 inference with v1 (#12360)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-23 17:20:33 +00:00
Roger Wang
99d01a5e3d [V1] Simplify M-RoPE (#12352)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: imkero <kerorek@outlook.com>
2025-01-23 23:13:23 +08:00
Cyrus Leung
d07efb31c5 [Doc] Troubleshooting errors during model inspection (#12351)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-23 22:46:58 +08:00
Lucas Wilkinson
978b45f399 [Kernel] Flash Attention 3 Support (#12093)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-23 06:45:48 -08:00
Isotr0py
c5b4b11d7f [Bugfix] Fix k_proj's bias for whisper self attention (#12342)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-23 10:15:33 +00:00
liuzhenwei
8ae5ff2009 [Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update (#12338)
Signed-off-by: zhenwei <zhenweiliu@habana.ai>
2025-01-23 08:35:46 +00:00
youkaichao
511627445e [doc] explain common errors around torch.compile (#12340)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-23 14:56:02 +08:00
Cody Yu
f0ef37233e [V1] Add uncache_blocks (#12333) 2025-01-23 04:19:21 +00:00
Russell Bryant
7551a34032 [Docs] Document vulnerability disclosure process (#12326)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-01-23 03:44:09 +00:00
Michael Goin
01a55941f5 [Docs] Update FP8 KV Cache documentation (#12238)
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-01-23 11:18:09 +08:00
Alexei-V-Ivanov-AMD
8d7aa9de71 [Bugfix] Fixing AMD LoRA CI test. (#12329)
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
2025-01-23 10:53:02 +08:00
rasmith
68c4421b6d [AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (#12282)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2025-01-23 00:10:37 +00:00
Nick Hill
aea94362c9 [Frontend][V1] Online serving performance improvements (#12287) 2025-01-22 22:22:12 +00:00
Cody Yu
7206ce4ce1 [Core] Support reset_prefix_cache (#12284) 2025-01-22 18:52:27 +00:00
Konrad Zawora
96f6a7596f [Bugfix] Fix HPU multiprocessing executor (#12167)
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
2025-01-23 02:07:07 +08:00
Jee Jee Li
84bee4bd5c [Misc] Improve the readability of BNB error messages (#12320)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-22 16:56:54 +00:00
Robin
fc66dee76d [Misc] Fix the error in the tip for the --lora-modules parameter (#12319)
Signed-off-by: wangerxiao <863579016@qq.com>
2025-01-22 16:48:41 +00:00
Cyrus Leung
6609cdf019 [Doc] Add docs for prompt replacement (#12318)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-22 14:56:29 +00:00
Roger Wang
16366ee8bb [Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 (#12313)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-22 21:06:36 +08:00
zhou fan
528dbcac7d [Model][Bugfix]: correct Aria model output (#12309)
Signed-off-by: xffxff <1247714429@qq.com>
2025-01-22 11:39:19 +00:00
Cyrus Leung
cd7b6f0857 [VLM] Avoid unnecessary tokenization (#12310)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-22 11:08:31 +00:00
youkaichao
68ad4e3a8d [Core] Support fully transparent sleep mode (#11743)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-22 14:39:32 +08:00
Mengqing Cao
4004f144f3 [Build] update requirements of no-device (#12299)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-01-22 14:29:31 +08:00
youkaichao
66818e5b63 [core] separate builder init and builder prepare for each batch (#12253)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-22 14:13:52 +08:00
Nick Hill
222a9dc350 [Benchmark] More accurate TPOT calc in benchmark_serving.py (#12288)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-01-22 13:46:14 +08:00
Cyrus Leung
cbdc4ad5a5 [Ci/Build] Fix mypy errors on main (#12296)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-22 12:06:54 +08:00
Liangfu Chen
016e3676e7 [CI] add docker volume prune to neuron CI (#12291)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
2025-01-22 10:47:49 +08:00
Kevin H. Luu
64ea24d0b3 [ci/lint] Add back default arg for pre-commit (#12279)
Signed-off-by: kevin <kevin@anyscale.com>
2025-01-22 01:15:27 +00:00
Cyrus Leung
df76e5af26 [VLM] Simplify post-processing of replacement info (#12269)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-21 16:48:13 -08:00
Hongxia Yang
09ccc9c8f7 [Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose (#12281)
Signed-off-by: Hongxia Yang <hongxyan@amd.com>
2025-01-22 07:49:22 +08:00
Aleksandr Malyshev
69196a9bc7 [BUGFIX] When skip_tokenize_init and multistep are set, execution crashes (#12277)
Signed-off-by: maleksan85 <maleksan@amd.com>
Co-authored-by: maleksan85 <maleksan@amd.com>
2025-01-21 23:30:46 +00:00
Divakar Verma
2acba47d9b [bugfix] moe tuning. rm is_navi() (#12273)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2025-01-21 22:47:32 +00:00
Jani Monoses
9c485d9e25 [Core] Free CPU pinned memory on environment cleanup (#10477) 2025-01-21 11:56:41 -08:00
wangxiyuan
fa9ee08121 [Misc] Set default backend to SDPA for get_vit_attn_backend (#12235)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-01-21 11:52:11 -08:00
Adrian Cole
347eeebe3b [Misc] Remove experimental dep from tracing.py (#12007)
Signed-off-by: Adrian Cole <adrian.cole@elastic.co>
2025-01-21 11:51:55 -08:00
Andy Lo
18fd4a8331 [Bugfix] Multi-sequence broken (#11898)
Signed-off-by: Andy Lo <andy@mistral.ai>
2025-01-21 11:51:35 -08:00
Ricky Xu
132a132100 [v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (#10907)
Signed-off-by: rickyx <rickyx@anyscale.com>
2025-01-21 11:51:13 -08:00
Jinzhen Lin
1e60f87bb3 [Kernel] fix moe_align_block_size error condition (#12239)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
2025-01-21 10:30:28 -08:00
Jannis Schönleber
9705b90bcf [Bugfix] fix race condition that leads to wrong order of token returned (#10802)
Signed-off-by: Jannis Schönleber <joennlae@gmail.com>
2025-01-21 09:47:04 -08:00
youkaichao
3aec49e56f [ci/build] update nightly torch for gh200 test (#12270)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-21 23:03:17 +08:00
Mengqing Cao
c64612802b [Platform] improve platforms getattr (#12264)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-01-21 14:42:41 +00:00
Thomas Parnell
9a7c3a0042 Remove pytorch comments for outlines + compressed-tensors (#12260)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-01-21 21:49:08 +08:00
Roger Wang
b197a5ccfd [V1][Bugfix] Fix data item ordering in mixed-modality inference (#12259)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-21 13:18:43 +00:00
youkaichao
c81081fece [torch.compile] transparent compilation with more logging (#12246)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-21 19:32:55 +08:00
Cyrus Leung
a94eee4456 [Bugfix] Fix mm_limits access for merged multi-modal processor (#12252)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-21 10:09:39 +00:00
Cyrus Leung
f2e9f2a3be [Misc] Remove redundant TypeVar from base model (#12248)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-21 08:40:39 +00:00
Jee Jee Li
1f1542afa9 [Misc]Add BNB quantization for PaliGemmaForConditionalGeneration (#12237)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-21 07:49:08 +00:00
Cyrus Leung
96912550c8 [Misc] Rename MultiModalInputsV2 -> MultiModalInputs (#12244)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-21 07:31:19 +00:00
youkaichao
2fc6944c5e [ci/build] disable failed and flaky tests (#12240)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-21 13:25:03 +08:00
Nicolò Lucchesi
5fe6bf29d6 [BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (#12230)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-01-21 12:23:14 +08:00
Gregory Shtrasberg
d4b62d4641 [AMD][Build] Porting dockerfiles from the ROCm/vllm fork (#11777)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-01-21 12:22:23 +08:00
Michael Goin
ecf67814f1 Add quantization and guided decoding CODEOWNERS (#12228)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-01-20 18:23:40 -07:00
Jinzhen Lin
750f4cabfa [Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) (#12222)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-20 16:42:16 -08:00
Cheng Kuan Yong Jason
06a760d6e8 [bugfix] catch xgrammar unsupported array constraints (#12210)
Signed-off-by: Jason Cheng <jasoncky96@gmail.com>
2025-01-20 16:42:02 -08:00
youkaichao
da7512215f [misc] add cuda runtime version to usage data (#12190)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-01-21 00:31:01 +00:00
Işık
af69a6aded fix: update platform detection for M-series arm based MacBook processors (#12227)
Signed-off-by: isikhi <huseyin.isik000@gmail.com>
2025-01-20 22:23:28 +00:00
Roger Wang
7bd3630067 [Misc] Update CODEOWNERS (#12229) 2025-01-20 22:19:09 +00:00
Chen Zhang
96663699b2 [CI] Pass local python version explicitly to pre-commit mypy.sh (#12224)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-20 23:49:18 +08:00
Cyrus Leung
18572e3384 [Bugfix] Fix HfExampleModels.find_hf_info (#12223)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-20 15:35:36 +00:00
wangxiyuan
86bfb6dba7 [Misc] Pass attention to impl backend (#12218)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-01-20 23:25:28 +08:00
Chen Zhang
5f0ec3935a [V1] Remove _get_cache_block_size (#12214)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-20 21:54:16 +08:00
youkaichao
c222f47992 [core][bugfix] configure env var during import vllm (#12209)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-20 19:35:59 +08:00
youkaichao
170eb35079 [misc] print a message to suggest how to bypass commit hooks (#12217)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-20 18:06:24 +08:00
Cyrus Leung
b37d82791e [Model] Upgrade Aria to transformers 4.48 (#12203)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-20 17:58:48 +08:00
Cyrus Leung
3127e975fb [CI/Build] Make pre-commit faster (#12212)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-20 17:36:24 +08:00
Cyrus Leung
4001ea1266 [CI/Build] Remove dummy CI steps (#12208)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-20 16:41:57 +08:00
youkaichao
5c89a29c22 [misc] add placeholder format.sh (#12206)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-20 16:04:49 +08:00
Cyrus Leung
59a0192fb9 [Core] Interface for accessing model from VllmRunner (#10353)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-20 15:00:59 +08:00
Isotr0py
83609791d2 [Model] Add Qwen2 PRM model support (#12202)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-20 14:59:46 +08:00
Yuan Tang
0974c9bc5c [Bugfix] Fix incorrect types in LayerwiseProfileResults (#12196)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-01-20 14:59:20 +08:00
Yuan Tang
d2643128f7 [DOC] Add missing docstring in LLMEngine.add_request() (#12195)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-01-20 14:59:00 +08:00
Yuan Tang
c5c06209ec [DOC] Fix typo in docstring and assert message (#12194)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-01-20 14:58:29 +08:00
Harry Mellor
3ea7b94523 Move linting to pre-commit (#11975)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-20 14:58:01 +08:00
youkaichao
51ef828f10 [torch.compile] fix sym_tensor_indices (#12191)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-20 11:37:50 +08:00
shangmingc
df450aa567 [Bugfix] Fix num_heads value for simple connector when tp enabled (#12074)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-01-20 02:56:43 +00:00
Martin Gleize
bbe5f9de7d [Model] Support for fairseq2 Llama (#11442)
Signed-off-by: Martin Gleize <mgleize@meta.com>
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas>
2025-01-19 10:40:40 -08:00
Roger Wang
81763c58a0 [V1] Add V1 support of Qwen2-VL (#12128)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: imkero <kerorek@outlook.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-19 19:52:13 +08:00
Isotr0py
edaae198e7 [Misc] Add BNB support to GLM4-V model (#12184)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-19 19:49:22 +08:00
gujing
936db119ed benchmark_serving support --served-model-name param (#12109)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2025-01-19 09:59:56 +00:00
youkaichao
e66faf4809 [torch.compile] store inductor compiled Python file (#12182)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-19 16:27:26 +08:00
Cyrus Leung
630eb5b5ce [Bugfix] Fix multi-modal processors for transformers 4.48 (#12187) 2025-01-18 19:16:34 -08:00
Michal Adamczyk
4e94951bb1 [BUGFIX] Move scores to float32 in case of running xgrammar on cpu (#12152)
Signed-off-by: Michal Adamczyk <madamczyk@habana.ai>
2025-01-19 11:12:05 +08:00
Simon Mo
7a8a48d51e [V1] Collect env var for usage stats (#12115) 2025-01-19 03:07:15 +00:00
yancong
32eb0da808 [Misc] Support register quantization method out-of-tree (#11969) 2025-01-18 16:13:16 -08:00
youkaichao
6d0e3d3724 [core] clean up executor class hierarchy between v1 and v0 (#12171)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-18 14:35:15 +08:00
Isotr0py
02798ecabe [Model] Port deepseek-vl2 processor, remove dependency (#12169)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-18 13:59:39 +08:00
Russell Bryant
813f249f02 [Docs] Fix broken link in SECURITY.md (#12175)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-01-18 04:35:21 +00:00
youkaichao
da02cb4b27 [core] further polish memory profiling (#12126)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-18 12:25:08 +08:00
Hongxia Yang
c09503ddd6 [AMD][CI/Build][Bugfix] use pytorch stale wheel (#12172)
Signed-off-by: hongxyan <hongxyan@amd.com>
2025-01-18 11:15:53 +08:00
youkaichao
2b83503227 [misc] fix cross-node TP (#12166)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-18 10:53:27 +08:00
youkaichao
7b98a65ae6 [torch.compile] disable logging when cache is disabled (#12043)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-17 20:29:31 +00:00
Gregory Shtrasberg
b5b57e301e [AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (#12134)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-01-17 17:12:26 +00:00
Kunshang Ji
54cacf008f [Bugfix] Mistral tokenizer encode accept list of str (#12149)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-01-17 16:47:53 +00:00
Wallas Henrique
58fd57ff1d [Bugfix] Fix score api for missing max_model_len validation (#12119)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2025-01-17 16:24:22 +00:00
youkaichao
87a0c076af [core] allow callable in collective_rpc (#12151)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-17 20:47:01 +08:00
Li, Jiang
d4e6194570 [CI/Build][CPU][Bugfix] Fix CPU CI (#12150)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-01-17 19:39:52 +08:00
Jee Jee Li
07934cc237 [Misc][LoRA] Improve the readability of LoRA error messages (#12102)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-17 19:32:28 +08:00
Chen Zhang
69d765f5a5 [V1] Move more control of kv cache initialization from model_executor to EngineCore (#11960)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-01-17 07:39:35 +00:00
Divakar Verma
8027a72461 [ROCm][MoE] moe tuning support for rocm (#12049)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2025-01-17 14:49:16 +08:00
Isotr0py
d75ab55f10 [Misc] Add deepseek_vl2 chat template (#12143)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-17 06:34:48 +00:00
Chen Zhang
d1adb9b403 [BugFix] add more is not None check in VllmConfig.__post_init__ (#12138)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-17 05:33:22 +00:00
Yuan Tang
b8bfa46a18 [Bugfix] Fix issues in CPU build Dockerfile (#12135)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-01-17 12:54:01 +08:00
Yuan Tang
1475847a14 [Doc] Add instructions on using Podman when SELinux is active (#12136)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-01-17 04:45:36 +00:00
Kunshang Ji
fead53ba78 [CI]add genai-perf benchmark in nightly benchmark (#10704)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-01-17 04:15:09 +00:00
Kuntai Du
ebc73f2828 [Bugfix] Fix a path bug in disaggregated prefill example script. (#12121)
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
2025-01-17 11:12:41 +08:00
Chen Zhang
d06e824006 [Bugfix] Set enforce_eager automatically for mllama (#12127)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-16 15:30:08 -05:00
Isotr0py
62b06ba23d [Model] Add support for deepseek-vl2-tiny model (#12068)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-16 17:14:48 +00:00
Varun Sundar Rabindranath
5fd24ec02e [misc] Add LoRA kernel micro benchmarks (#11579) 2025-01-16 15:51:40 +00:00
Roger Wang
874f7c292a [Bugfix] Fix max image feature size for Llava-one-vision (#12104)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-16 14:54:06 +00:00
youkaichao
92e793d91a [core] LLM.collective_rpc interface and RLHF example (#12084)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-16 20:19:52 +08:00
youkaichao
bf53e0c70b Support torchrun and SPMD-style offline inference (#12071)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-16 19:58:53 +08:00
Isotr0py
dd7c9ad870 [Bugfix] Remove hardcoded head_size=256 for Deepseek v2 and v3 (#12067)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-16 10:11:54 +00:00
Michael Goin
9aa1519f08 Various cosmetic/comment fixes (#12089)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-01-16 09:59:06 +00:00
Cyrus Leung
f8ef146f03 [Doc] Add documentation for specifying model architecture (#12105) 2025-01-16 15:53:43 +08:00
Elfie Guo
fa0050db08 [Core] Default to using per_token quantization for fp8 when cutlass is supported. (#8651)
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2025-01-16 04:31:27 +00:00
tvirolai-amd
cd9d06fb8d Allow hip sources to be directly included when compiling for rocm. (#12087) 2025-01-15 16:46:03 -05:00
Varun Sundar Rabindranath
ebd8c669ef [Bugfix] Fix _get_lora_device for HQQ marlin (#12090)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-01-15 19:59:42 +00:00
Roger Wang
70755e819e [V1][Core] Autotune encoder cache budget (#11895)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-15 11:29:00 -08:00
Joe Runde
edce722eaa [Bugfix] use right truncation for non-generative tasks (#12050)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2025-01-16 00:31:01 +08:00
maang-h
57e729e874 [Doc]: Update OpenAI-Compatible Server documents (#12082) 2025-01-15 16:07:45 +00:00
kewang-xlnx
de0526f668 [Misc][Quark] Upstream Quark format to VLLM (#10765)
Signed-off-by: kewang-xlnx <kewang@xilinx.com>
Signed-off-by: kewang2 <kewang2@amd.com>
Co-authored-by: kewang2 <kewang2@amd.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2025-01-15 11:05:15 -05:00
Yuan
5ecf3e0aaf Misc: allow to use proxy in HTTPConnection (#12042)
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
2025-01-15 13:16:40 +00:00
RunningLeon
97eb97b5a4 [Model]: Support internlm3 (#12037) 2025-01-15 11:35:17 +00:00
wangxiyuan
3adf0ffda8 [Platform] Do not raise error if _Backend is not found (#12023)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-01-15 10:14:15 +00:00
Keyun Tong
ad388d25a8 Type-fix: make execute_model output type optional (#12020) 2025-01-15 09:44:56 +00:00
Rahul Tuli
cbe94391eb Fix: cases with empty sparsity config (#12057)
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
2025-01-15 17:41:24 +08:00
Chen Zhang
994fc655b7 [V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager (#12003) 2025-01-15 07:55:30 +00:00
Kyle Sayers
3f9b7ab9f5 [Doc] Update examples to remove SparseAutoModelForCausalLM (#12062)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2025-01-15 06:36:01 +00:00
youkaichao
ad34c0df0f [core] platform agnostic executor via collective_rpc (#11256)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-15 13:45:21 +08:00
Rui Qiao
f218f9c24d [core] Turn off GPU communication overlap for Ray executor (#12051)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-01-15 05:19:55 +00:00
Elfie Guo
0794e7446e [Misc] Add multipstep chunked-prefill support for FlashInfer (#10467) 2025-01-15 12:47:49 +08:00
Woosuk Kwon
b7ee940a82 [V1][BugFix] Fix edge case in VLM scheduling (#12065)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-14 20:21:28 -08:00
Shanshan Shen
9ddac56311 [Platform] move current_memory_usage() into platform (#11369)
Signed-off-by: Shanshan Shen <467638484@qq.com>
2025-01-15 03:38:25 +00:00
Konrad Zawora
1a51b9f872 [HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py (#12046)
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
2025-01-15 02:59:18 +00:00
Jee Jee Li
42f5e7c52a [Kernel] Support MulAndSilu (#11624)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-15 02:29:53 +00:00
Jee Jee Li
a3a3ee4e6f [Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping (#11924)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-15 07:49:49 +08:00
maang-h
87054a57ab [Doc]: Update the Json Example of the Engine Arguments document (#12045) 2025-01-14 17:03:04 +00:00
Harry Mellor
c9d6ff530b Explain where the engine args go when using Docker (#12041)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-14 16:05:50 +00:00
Chen Zhang
a2d2acb4c8 [Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (#12040)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-14 15:45:05 +00:00
wangxiyuan
2e0e017610 [Platform] Add output for Attention Backend (#11981)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-01-14 13:27:04 +00:00
Chen Zhang
1f18adb245 [Kernel] Revert the API change of Attention.forward (#12038)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-14 20:59:32 +08:00
Cyrus Leung
bb354e6b2d [Bugfix] Fix various bugs in multi-modal processor (#12031)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-14 12:16:11 +00:00
youkaichao
ff39141a49 [HPU][misc] add comments for explanation (#12034)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-14 19:24:06 +08:00
TJian
8a1f938e6f [Doc] Update Quantization Hardware Support Documentation (#12025)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-01-14 04:37:52 +00:00
Konrad Zawora
078da31903 [HPU][Bugfix] set_forward_context and CI test execution (#12014)
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
2025-01-14 11:04:18 +08:00
Woosuk Kwon
1a401252b5 [Docs] Add Sky Computing Lab to project intro (#12019)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-13 17:24:36 -08:00
Steve Luo
f35ec461fc [Bugfix] Fix deepseekv3 gate bias error (#12002)
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2025-01-13 13:43:51 -07:00
Yikun Jiang
289b5191d5 [Doc] Fix build from source and installation link in README.md (#12013)
Signed-off-by: Yikun <yikunkero@gmail.com>
2025-01-13 17:23:59 +00:00
elijah
c6db21313c bugfix: Fix signature mismatch in benchmark's get_tokenizer function (#11982)
Signed-off-by: elijah <f1renze.142857@gmail.com>
2025-01-13 15:22:07 +00:00
Shanshan Shen
a7d59688fb [Platform] Move get_punica_wrapper() function to Platform (#11516)
Signed-off-by: Shanshan Shen <467638484@qq.com>
2025-01-13 13:12:10 +00:00
youkaichao
458e63a2c6 [platform] add device_control env var (#12009)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-13 20:59:09 +08:00
Harry Mellor
e8c23ff989 [Doc] Organise installation documentation into categories and tabs (#11935)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-13 12:27:36 +00:00
Roger Wang
cd8249903f [Doc][V1] Update model implementation guide for V1 support (#11998)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-01-13 11:58:54 +00:00
Chen Zhang
0f8cafe2d1 [Kernel] unified_attention for Attention.forward (#11967)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-13 19:28:53 +08:00
Alex Brooks
5340a30d01 Fix Max Token ID for Qwen-VL-Chat (#11980)
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
2025-01-13 08:37:48 +00:00
youkaichao
89ce62a316 [platform] add ray_device_key (#11948)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-13 16:20:52 +08:00
Chenguang Li
c3f05b09a0 [Misc]Minor Changes about Worker (#11555)
Signed-off-by: Chenguang Li <757486878@qq.com>
2025-01-13 15:47:05 +08:00
Concurrensee
cf6bbcb493 [Misc] Fix Deepseek V2 fp8 kv-scale remapping (#11947)
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
2025-01-12 23:05:06 -08:00
Sungjae Lee
80ea3af1a0 [CI][Spec Decode] fix: broken test for EAGLE model (#11972)
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
2025-01-13 06:50:35 +00:00
Siyuan Li
9dd02d85ca [Bug] Fix usage of .transpose() and .view() consecutively. (#11979) 2025-01-13 06:24:10 +00:00
Yangcheng Li
f7b3ba82c3 [MISC] fix typo in kv transfer send recv test (#11983) 2025-01-13 05:07:48 +00:00
Robert Shaw
619ae268c3 [V1] [2/n] Logging and Metrics - OutputProcessor Abstraction (#11973)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-01-13 04:54:10 +00:00
Isotr0py
d14e98d924 [Model] Support GGUF models newly added in transformers 4.46.0 (#9685)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-01-13 00:13:44 +00:00
Robert Shaw
9597a095f2 [V1][Core][1/n] Logging and Metrics (#11962)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-01-12 21:02:02 +00:00
Avshalom Manevich
263a870ee1 [Hardware][TPU] workaround fix for MoE on TPU (#11764) 2025-01-12 10:53:51 -05:00
Akshat Tripathi
8bddb73512 [Hardware][CPU] Multi-LoRA implementation for the CPU backend (#11100)
Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Oleg Mosalov <oleg@krai.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Oleg Mosalov <oleg@krai.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-01-12 13:01:52 +00:00
Isotr0py
f967e51f38 [Model] Initialize support for Deepseek-VL2 models (#11578)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-01-12 00:17:24 -08:00
Rafael Vasquez
43f3d9e699 [CI/Build] Add markdown linter (#11857)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
2025-01-12 00:17:13 -08:00
Roger Wang
b25cfab9a0 [V1] Avoid sending text prompt to core engine (#11963)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-12 06:36:38 +00:00
sixgod
4b657d3292 [Model] Add cogagent model support vLLM (#11742)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-01-11 19:05:56 +00:00
Nicolò Lucchesi
d697dc01b4 [Bugfix] Fix RobertaModel loading (#11940)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-01-11 14:05:09 +00:00
Cyrus Leung
a991f7d508 [Doc] Basic guide for writing unit tests for new models (#11951)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-11 21:27:24 +08:00
Cyrus Leung
7a3a83e3b8 [CI/Build] Move model-specific multi-modal processing tests (#11934)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-11 13:50:05 +08:00
shaochangxu
c32a7c7c0c [Bugfix] fused_experts_impl wrong compute type for float32 (#11921)
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
2025-01-11 13:49:39 +08:00
Sungjae Lee
2118d0565c [Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design (#11672)
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
2025-01-10 20:49:38 -08:00
youkaichao
899136b857 [ci] fix broken distributed-tests-4-gpus (#11937)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-11 09:07:24 +08:00
Fred Reiss
c9f09a4fe8 [mypy] Fix mypy warnings in api_server.py (#11941)
Signed-off-by: Fred Reiss <frreiss@us.ibm.com>
2025-01-11 01:04:58 +00:00
Travis Johnson
d45cbe70f5 [Bugfix] Check that number of images matches number of <|image|> tokens with mllama (#11939)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2025-01-10 23:26:00 +00:00
minmin
8a579408f3 [Misc] Update benchmark_prefix_caching.py fixed example usage (#11920)
Signed-off-by: Ren MinMin <renmm6@chinaunicom.cn>
Co-authored-by: Ren MinMin <renmm6@chinaunicom.cn>
2025-01-10 20:39:22 +00:00
Isotr0py
46fa98ccad [Misc] Clean up debug code in Deepseek-V3 (#11930)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-10 19:19:15 +00:00
Li, Jiang
aa1e77a19c [Hardware][CPU] Support MOE models on x86 CPU (#11831)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-01-10 11:07:58 -05:00
Kuntai Du
5959564f94 Doc fix in benchmark_long_document_qa_throughput.py (#11933)
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
2025-01-10 23:51:43 +08:00
Kuntai Du
f33e033e27 [Docs] Fix docstring in get_ip function (#11932)
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
2025-01-10 23:51:02 +08:00
Harry Mellor
482cdc494e [Doc] Rename offline inference examples (#11927)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-10 23:50:29 +08:00
wangxiyuan
20410b2fda [platform] support custom torch.compile backend key (#11318)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-01-10 23:46:51 +08:00
Cyrus Leung
12664ddda5 [Doc] [1/N] Initial guide for merged multi-modal processor (#11925)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-10 14:30:25 +00:00
youkaichao
241ad7b301 [ci] Fix sampler tests (#11922)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-10 20:45:33 +08:00
Harry Mellor
d85c47d6ad Replace "online inference" with "online serving" (#11923)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-10 12:05:56 +00:00
wangxiyuan
ef725feafc [platform] support pytorch custom op pluggable (#11328)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-01-10 10:02:38 +00:00
cennn
d907be7dc7 [misc] remove python function call for custom activation op (#11885)
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-01-10 17:18:25 +08:00
youkaichao
d53575a5f0 [ci] fix gh200 tests (#11919)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-10 16:25:17 +08:00
Kunshang Ji
61af633256 [BUGFIX] Fix UnspecifiedPlatform package name (#11916)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-01-10 16:20:46 +08:00
Joe Runde
ac2f3f7fee [Bugfix] Validate lora adapters to avoid crashing server (#11727)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-10 15:56:36 +08:00
Chen Zhang
cf5f000d21 [torch.compile] Hide KV cache behind torch.compile boundary (#11677)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-10 13:14:42 +08:00
Cyrus Leung
3de2b1eafb [Doc] Show default pooling method in a table (#11904)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-10 11:25:20 +08:00
Cyrus Leung
b844b99ad3 [VLM] Enable tokenized inputs for merged multi-modal processor (#11900)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-10 03:24:00 +00:00
Cyrus Leung
c3cf54dda4 [Doc][5/N] Move Community and API Reference to the bottom (#11896)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2025-01-10 03:10:12 +00:00
Charles Frye
36f5303578 [Docs] Add Modal to deployment frameworks (#11907) 2025-01-09 23:26:37 +00:00
Cyrus Leung
9a228348d2 [Misc] Provide correct Pixtral-HF chat template (#11891)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-09 10:19:37 -07:00
youkaichao
bd82872211 [ci]try to fix flaky multi-step tests (#11894)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-09 14:47:29 +00:00
wangxiyuan
405eb8e396 [platform] Allow platform specify attention backend (#11609)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-01-09 21:46:50 +08:00
Cyrus Leung
65097ca0af [Doc] Add model development API Reference (#11884)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-09 09:43:40 +00:00
Ye (Charlotte) Qi
1d967acb45 [Bugfix] fix beam search input errors and latency benchmark script (#11875)
Signed-off-by: Ye Qi <yeq@meta.com>
Co-authored-by: yeq <yeq@devgpu004.lla3.facebook.com>
2025-01-09 17:36:39 +08:00
Cyrus Leung
0bd1ff4346 [Bugfix] Override dunder methods of placeholder modules (#11882)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-09 09:02:53 +00:00
youkaichao
310aca88c9 [perf]fix current stream (#11870)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-09 07:18:21 +00:00
Guspan Tanadi
a732900efc [Doc] Intended links Python multiprocessing library (#11878) 2025-01-09 05:39:39 +00:00
Cyrus Leung
d848800e88 [Misc] Move print_*_once from utils to logger (#11298)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
2025-01-09 12:48:12 +08:00
Michael Goin
730e9592e9 [Doc] Recommend uv and python 3.12 for quickstart guide (#11849)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-01-09 11:37:48 +08:00
Maximilien de Bayser
1fe554bac3 treat do_lower_case in the same way as the sentence-transformers library (#11815)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2025-01-09 11:05:43 +08:00
Tyler Michael Smith
615e4a5401 [CI] Turn on basic correctness tests for V1 (#10864) 2025-01-08 21:20:44 -05:00
Simon Mo
3db0cafdf1 [Docs] Add Google Cloud Meetup (#11864) 2025-01-08 12:38:28 -08:00
rasmith
526de822d5 [Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models (#11698)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2025-01-08 20:23:15 +00:00
Robert Shaw
56fe4c297c [TPU][Quantization] TPU W8A8 (#11785)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-08 19:33:29 +00:00
WangErXiao
47de8821d3 [Misc]add some explanations for BlockHashType (#11847) 2025-01-08 18:21:30 +00:00
Cyrus Leung
5984499e47 [Doc] Expand Multimodal API Reference (#11852)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-08 17:14:14 +00:00
Cyrus Leung
ca47e176af [Misc] Move some model utils into vision file (#11848)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-08 17:04:46 +00:00
Yan Ma
78f4590b60 [Bugfix][XPU] fix silu_and_mul (#11823)
Signed-off-by: yan ma <yan.ma@intel.com>
2025-01-09 00:11:50 +08:00
Li, Jiang
2f7024987e [CI/Build][Bugfix] Fix CPU CI image clean up (#11836)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-01-08 15:18:28 +00:00
Cyrus Leung
6cd40a5bfe [Doc][4/N] Reorganize API Reference (#11843)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-08 21:34:44 +08:00
Harry Mellor
aba8d6ee00 [Doc] Move examples into categories (#11840)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-08 13:09:53 +00:00
Cyrus Leung
2a0596bc48 [VLM] Reorganize profiling/processing-related code (#11812)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-08 18:59:58 +08:00
youkaichao
f12141170a [torch.compile] consider relevant code in compilation cache (#11614)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-08 10:46:43 +00:00
Wallas Henrique
cfd3219f58 [Hardware][Apple] Native support for macOS Apple Silicon (#11696)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2025-01-08 16:35:49 +08:00
Simon Mo
a1b2b8606e [Docs] Update sponsor name: 'Novita' to 'Novita AI' (#11833) 2025-01-07 23:05:46 -08:00
youkaichao
ad9f1aa679 [doc] update wheels url (#11830)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-08 14:36:49 +08:00
youkaichao
889e662eae [misc] improve memory profiling (#11809)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-01-08 06:36:03 +00:00
Cyrus Leung
ef68eb28d8 [Bug] Fix pickling of ModelConfig when RunAI Model Streamer is used (#11825)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-08 13:40:09 +08:00
Simon Mo
259abd8953 [Docs] reorganize sponsorship page (#11639)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-01-07 21:16:08 -08:00
Jee Jee Li
f645eb6954 [Bugfix] Add checks for LoRA and CPU offload (#11810)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-08 13:08:48 +08:00
Ilya Lavrenov
f4923cb8bc [OpenVINO] Fixed Docker.openvino build (#11732)
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
2025-01-08 13:08:30 +08:00
Nishidha
b640b19cc0 Fixed docker build for ppc64le (#11518)
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
2025-01-08 13:05:37 +08:00
WangErXiao
dc71af0a71 Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… (#11824) 2025-01-08 04:09:25 +00:00
Divakar Verma
4d29e91be8 [Misc] sort torch profiler table by kernel timing (#11813) 2025-01-08 10:57:04 +08:00
Cyrus Leung
91445c7bc8 [Bugfix] Fix image input for Pixtral-HF (#11741)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-08 10:17:16 +08:00
Harry Mellor
5950f555a1 [Doc] Group examples into categories (#11782)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-08 09:20:12 +08:00
Jie Fu (傅杰)
a4e2b26856 [Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 (#11794) 2025-01-07 16:15:50 -08:00
sroy745
973f5dc581 [Doc]Add documentation for using EAGLE in vLLM (#11417)
Signed-off-by: Sourashis Roy <sroy@roblox.com>
2025-01-07 19:19:12 +00:00
jiangjiadi
c994223d56 [Bugfix] update the prefix for qwen2 (#11795)
Co-authored-by: jiadi.jjd <jiadi.jjd@antgroup.com>
2025-01-07 18:36:34 +00:00
youkaichao
869579a702 [optimization] remove python function call for custom op (#11750)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-07 17:04:28 +00:00
Cyrus Leung
c0efe92d8b [Doc] Add note to gte-Qwen2 models (#11808)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-07 21:50:58 +08:00
youkaichao
d9fa1c05ad [doc] update how pip can install nightly wheels (#11806)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-07 21:42:58 +08:00
Roger Wang
2de197bdd4 [V1] Support audio language models on V1 (#11733)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-07 19:47:36 +08:00
youkaichao
869e829b85 [doc] add doc to explain how to use uv (#11773)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-01-07 18:41:17 +08:00
Cyrus Leung
8f37be38eb [Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation (#11800)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-07 18:25:02 +08:00
Roger Wang
8082ad7950 [V1][Doc] Update V1 support for LLaVa-NeXT-Video (#11798)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-07 09:55:39 +00:00
Yuan
1e4ce295ae [CI][CPU] adding build number to docker image name (#11788)
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
2025-01-07 07:28:01 +00:00
Russell Bryant
ce1917fcf2 [Doc] Create a vulnerability management team (#9925)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-01-06 22:57:32 -08:00
XiaobingZhang
e512f76a89 fix init error for MessageQueue when n_local_reader is zero (#11768) 2025-01-07 06:12:48 +00:00
Liangfu Chen
898cdf033e [CI] Fix neuron CI and run offline tests (#11779)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
2025-01-06 21:36:10 -08:00
Roger Wang
0f3f3c86ec [Bugfix] Update attention interface in Whisper (#11784)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-07 04:36:24 +00:00
Jee Jee Li
b278557935 [Kernel][LoRA]Punica prefill kernels fusion (#11234)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Co-authored-by: Zhonghua Deng <abatom@163.com>
2025-01-07 04:01:39 +00:00
Cyrus Leung
8ceffbf315 [Doc][3/N] Reorganize Serving section (#11766)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-07 11:20:01 +08:00
YiSheng5
d93d2d74fd [XPU] Make pp group initilized for pipeline-parallelism (#11648)
Signed-off-by: yisheng <yi.sheng@intel.com>
2025-01-07 11:09:58 +08:00
Cyrus Leung
d0169e1b0f [Model] Future-proof Qwen2-Audio multi-modal processor (#11776)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-07 11:05:17 +08:00
Cyrus Leung
08fb75c72e [Bugfix] Fix LLaVA-NeXT feature size precision error (for real) (#11772)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-07 01:10:54 +00:00
Roger Wang
91b361ae89 [V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision (#11685)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-06 19:58:16 +00:00
Chen Zhang
e20c92bb61 [Kernel] Move attn_type to Attention.__init__() (#11690)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-07 00:11:28 +08:00
Jee Jee Li
32c9eff2ff [Bugfix][V1] Fix molmo text-only inputs (#11676)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-06 15:22:25 +00:00
youkaichao
4ca5d40adc [doc] explain how to add interleaving sliding window support (#11771)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-06 21:57:44 +08:00
Roger Wang
9279b9f83d [Bugfix] Fix max image size for LLaVA-Onevision (#11769)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-06 13:48:53 +00:00
Cyrus Leung
ee77fdb5de [Doc][2/N] Reorganize Models and Usage sections (#11755)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-06 21:40:31 +08:00
Cyrus Leung
996357e480 [VLM] Separate out profiling-related logic (#11746)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-06 16:02:21 +08:00
Suraj Deshmukh
2a622d704a k8s-config: Update the secret to use stringData (#11679)
Signed-off-by: Suraj Deshmukh <surajd.service@gmail.com>
2025-01-06 08:01:22 +00:00
Lucas Tucker
9c749713f6 [mypy] Forward pass function type hints in lora (#11740)
Signed-off-by: lucast2021 <lucast2021@headroyce.org>
Co-authored-by: lucast2021 <lucast2021@headroyce.org>
2025-01-06 07:59:36 +00:00
Rui Qiao
022c5c6944 [V1] Refactor get_executor_cls (#11754) 2025-01-06 07:59:16 +00:00
Rui Qiao
f8fcca100b [Misc] Fix typo for valid_tool_parses (#11753)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-01-06 07:12:38 +00:00
Woosuk Kwon
06bfb51963 [V1] Add BlockTable class (#11693)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-06 14:24:42 +09:00
Cody Yu
408e560015 [Bugfix] Remove block size constraint (#11723) 2025-01-06 12:49:55 +08:00
Cyrus Leung
402d378360 [Doc] [1/N] Reorganize Getting Started section (#11645)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-06 02:18:33 +00:00
cennn
9e764e7b10 [distributed] remove pynccl's redundant change_state (#11749) 2025-01-06 09:05:48 +08:00
Robert Shaw
33fc1e2e86 [Frontend] Improve StreamingResponse Exception Handling (#11752) 2025-01-05 16:35:01 -05:00
Lancer
eba17173d3 fix: [doc] fix typo (#11751)
Co-authored-by: Lancer <maruixiang6688@gmail.com>
2025-01-06 00:48:16 +08:00
cennn
635b897246 [distributed] remove pynccl's redundant stream (#11744) 2025-01-05 23:09:11 +08:00
Lu Fang
4068f4b5b5 [MISC] Replace c10::optional with std::optional (#11730)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-01-05 10:20:34 +09:00
Jee Jee Li
47831430cc [Bugfix][V1] Fix test_kv_cache_utils.py (#11738)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-04 16:07:59 +00:00
Cyrus Leung
65c08928c2 [Model] Remove unnecessary weight initialization logic (#11736)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-01-04 23:46:21 +08:00
Cyrus Leung
ba214dffbe [Bugfix] Fix precision error in LLaVA-NeXT (#11735)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-04 23:45:57 +08:00
Cyrus Leung
eed11ebee9 [VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-OneVision (#11717)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-04 11:40:53 +00:00
Yan Burman
300acb8347 [Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture (#11233)
Signed-off-by: Yan Burman <yanburman@users.noreply.github.com>
Signed-off-by: Ido Asraff <idoa@atero.ai>
2025-01-04 14:50:16 +08:00
xcnick
d91457d529 [V1] Add kv cache utils tests. (#11513)
Signed-off-by: xcnick <xcnick0412@gmail.com>
2025-01-04 14:49:46 +08:00
Kunshang Ji
fbf2564554 [V1] Add RayExecutor support for AsyncLLM (api server) (#11712) 2025-01-04 06:41:31 +00:00
Alberto Ferrer
d1d49397e7 Update bnb.md with example for OpenAI (#11718) 2025-01-04 06:29:02 +00:00
Hust_YangXian
9c93636d84 Update tool_calling.md (#11701) 2025-01-04 06:16:30 +00:00
WangErXiao
e5d7ed0c53 [V1] log GPU blocks num for MultiprocExecutor (#11656) 2025-01-04 00:13:12 +00:00
Robert Shaw
ad0d567e1c [V1] Chore: cruft removal (#11724) 2025-01-03 23:25:02 +00:00
Michael Goin
bf0d97d786 Update requirements-tpu.txt to support python 3.9 and 3.11 (#11695)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-01-03 22:36:46 +00:00
Jee Jee Li
a655eb3025 [Misc]Add BNB quantization for Qwen2VL (#11719)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-01-03 15:19:02 -07:00
Robert Shaw
1543914c04 [V1] Improve TP>1 Error Handling + Stack Trace (#11721)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-03 21:29:11 +00:00
ZincCat
61fed92c7e [Bugfix] Fix ColumnParallelLinearWithLoRA slice (#11708)
Signed-off-by: ZincCat <zincchloride@outlook.com>
2025-01-03 21:02:34 +00:00
Robert Shaw
80c751e7f6 [V1] Simplify Shutdown (#11659) 2025-01-03 17:25:38 +00:00
Aurick Qiao
e1a5c2f0a1 [Model] Whisper model implementation (#11280)
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
2025-01-03 16:39:19 +08:00
Kevin H. Luu
fd3a62a122 [perf-benchmark] Fix dependency for steps in benchmark pipeline (#11710) 2025-01-02 22:38:37 -08:00
Lu Fang
07064cb1d4 [Bugfix] Check chain_speculative_sampling before calling it (#11673)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-01-02 16:58:56 -08:00
Sachin Varghese
2f1e8e8f54 Update default max_num_batch_tokens for chunked prefill (#11694) 2025-01-03 00:25:53 +00:00
Nathan Azrak
68d37809b9 [Misc] Minimum requirements for SageMaker compatibility (#11576) 2025-01-02 15:59:25 -08:00
wchen61
5dba257506 Resolve race conditions in Marlin kernel (#11493)
Signed-off-by: wchen61 <wchen61@foxmail.com>
2025-01-02 22:58:56 +00:00
bjmsong
187e32997c [Bugfix] Change kv scaling factor by param json on nvidia gpu (#11688)
Signed-off-by: bjmsong <bjmsong@126.com>
Co-authored-by: bjmsong <bjmsong@126.com>
2025-01-02 21:11:39 +00:00
Woosuk Kwon
b55ed6ef8a [V1][Minor] Optimize token_ids_cpu copy (#11692)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-02 12:04:58 -07:00
Kathy Yu
2f385183f3 [Bugfix] Free cross attention block table for preempted-for-recompute sequence group. (#10013)
Signed-off-by: Kathy Yu <feiyangyu@google.com>
2025-01-02 10:28:09 -08:00
Chunyang Wen
84c35c374a According to vllm.EngineArgs, the name should be distributed_executor_backend (#11689) 2025-01-02 18:14:16 +00:00
Cyrus Leung
8c38ee7007 [VLM] Merged multi-modal processor for LLaVA-NeXT (#11682)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-02 16:39:27 +00:00
Tobias Pitters
b6087a6bee [mypy] Pass type checking in vllm/inputs (#11680)
Signed-off-by: Tobias Pitters <tobias.pitters@gmail.com>
2025-01-02 16:18:15 +00:00
Cyrus Leung
23c1b10a4c [VLM][Bugfix] Multi-modal processor compatible with V1 multi-input (#11674)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-02 17:00:00 +08:00
Cyrus Leung
a115ac46b5 [VLM] Move supported limits and max tokens to merged multi-modal processor (#11669)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-01-01 15:44:42 +00:00
Woosuk Kwon
73001445fb [V1] Implement Cascade Attention (#11635)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-01 21:56:46 +09:00
Kazuhiro Serizawa
6d70198b17 [Doc] Fix typo (#11666)
Signed-off-by: Kazuhiro Serizawa <nserihiro@gmail.com>
2025-01-01 08:10:10 +00:00
Lu Fang
f962f426bc [Misc] Replace space with - in the file names (#11667)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-01-01 07:39:30 +00:00
Jee Jee Li
11d8a091c6 [Misc] Optimize Qwen2-VL LoRA test (#11663)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-01 14:42:23 +08:00
Cyrus Leung
365801fedd [VLM] Add max-count checking in data parser for single image models (#11661)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-12-31 22:15:21 -08:00
Joe Runde
4db72e57f6 [Bugfix][Refactor] Unify model management in frontend (#11660)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2025-01-01 02:21:51 +00:00
Yihua Cheng
0c6f998554 [Benchmark] Add benchmark script for CPU offloading (#11533)
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Co-authored-by: KuntaiDu <kuntai@uchicago.edu>
2025-01-01 00:10:55 +00:00
Roger Wang
e7c7c5e822 [V1][VLM] V1 support for selected single-image models. (#11632)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-12-31 21:17:22 +00:00
Chen Zhang
8c3230d8c1 [V1] Simpify vision block hash for prefix caching by removing offset from hash (#11646) 2024-12-31 08:56:01 +00:00
sakunkun
2c5718809b [Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. (#11565) 2024-12-31 06:29:04 +00:00
John Giorgi
82c49d3260 [Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) (#6909)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-30 22:15:58 -08:00
Michael Goin
74fa1d123c [Bugfix] Fix OpenAI parallel sampling when using xgrammar (#11637)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-31 03:43:54 +00:00
Matthias Vogler
a2a40bcd0d [Model][LoRA]LoRA support added for MolmoForCausalLM (#11439)
Signed-off-by: Matthias Vogler <matthias.vogler@joesecurity.org>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Matthias Vogler <matthias.vogler@joesecurity.org>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-30 17:33:06 -08:00
Kevin H. Luu
ccb1aabcca [benchmark] Remove dependency for H100 benchmark step (#11572) 2024-12-30 12:27:07 -08:00
whyiug
36e7670045 [Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseModel (#11631) 2024-12-30 18:51:04 +00:00
Robert Shaw
5886aa496e [V1] [6/N] API Server: Better Shutdown (#11586) 2024-12-30 15:51:02 +00:00
Cyrus Leung
8d9b6721e7 [VLM] Abstract out multi-modal data parsing in merged processor (#11620)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-30 15:01:35 +00:00
youkaichao
b12e87f942 [platforms] enable platform plugins (#11602)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-30 20:24:45 +08:00
Li, Jiang
5dbf854553 [CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels (#11618)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2024-12-30 10:17:04 +00:00
Tyler Michael Smith
970d6d0776 [Build][Kernel] Update CUTLASS to v3.6.0 (#11607)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-30 17:22:13 +08:00
Liangfu Chen
628ec6c17b [Docker] bump up neuron sdk v2.21 (#11593)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
2024-12-30 13:46:14 +08:00
youkaichao
3682e33f9f [v1] fix compilation cache (#11598)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-30 04:24:12 +00:00
Michael Goin
0aa38d16f5 Remove print statement in DeepseekScalingRotaryEmbedding (#11604) 2024-12-29 20:16:46 +00:00
Kuntai Du
faef77c0d6 [Misc] KV cache transfer connector registry (#11481)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
2024-12-29 16:08:09 +00:00
youkaichao
dba4d9dec6 [v1][bugfix] fix cudagraph with inplace buffer assignment (#11596)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-29 09:03:49 +00:00
Cyrus Leung
32b4c63f02 [Doc] Convert list tables to MyST (#11594)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-29 15:56:22 +08:00
Robert Shaw
4fb8e329fd [V1] [5/N] API Server: unify Detokenizer and EngineCore input (#11545)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2024-12-28 20:51:57 +00:00
youkaichao
328841d002 [bugfix] interleaving sliding window for cohere2 model (#11583)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-28 16:55:42 +00:00
Cyrus Leung
d427e5cfda [Doc] Minor documentation fixes (#11580)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-28 21:53:59 +08:00
Woosuk Kwon
42bb201fd6 [V1][Minor] Set pin_memory=False for token_ids_cpu tensor (#11581)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-28 13:33:12 +00:00
hj-wei
59d6bb4c86 [Hardware][AMD]: Replace HIPCC version with more precise ROCm version (#11515)
Signed-off-by: hjwei <hjwei_xd@163.com>
2024-12-28 11:17:35 +00:00
Roger Wang
b7dcc003dc [Model] Remove hardcoded image tokens ids from Pixtral (#11582)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-12-28 10:54:23 +00:00
Isotr0py
d34be24bb1 [Model] Support InternLM2 Reward models (#11571)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-28 06:14:10 +00:00
Rajveer Bachkaniwala
b5cbe8eeb3 [Bugfix] Last token measurement fix (#11376)
Signed-off-by: rajveerb <46040700+rajveerb@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-12-28 11:34:46 +08:00
Robert Shaw
df04dffade [V1] [4/N] API Server: ZMQ/MP Utilities (#11541) 2024-12-28 01:45:08 +00:00
Chen Zhang
a60731247f [Doc] Update mllama example based on official doc (#11567)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2024-12-28 00:31:10 +00:00
Selali
ac79799403 [Bugfix] Fix for ROCM compressed tensor support (#11561) 2024-12-27 20:12:11 +00:00
Isotr0py
dde1fa18c9 [Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix (#11566)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-27 19:45:13 +00:00
Jee Jee Li
0240402c46 [Misc]Add BNB quantization for MolmoForCausalLM (#11551)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-27 18:48:24 +00:00
ErezSC42
55509c2114 [MODEL] LoRA support for Jamba model (#11209)
Signed-off-by: Erez Schwartz <erezs@ai21.com>
2024-12-27 17:58:21 +00:00
Cyrus Leung
101418096f [VLM] Support caching in merged multi-modal processor (#11396)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-27 17:22:48 +00:00
Chen1022
5ce4627a7e [Doc] Add xgrammar in doc (#11549)
Signed-off-by: ccjincong <chenjincong11@gmail.com>
2024-12-27 13:05:10 +00:00
Cyrus Leung
7af553ea30 [Misc] Abstract the logic for reading and writing media content (#11527)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-27 19:21:23 +08:00
Jee Jee Li
2c9b8ea2b0 [Bugfix] Fix TeleChat2ForCausalLM weights mapper (#11546)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-27 10:39:15 +00:00
AlexHe99
d003f3ea39 Update deploying_with_k8s.md with AMD ROCm GPU example (#11465)
Signed-off-by: Alex He <alehe@amd.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-27 10:00:04 +00:00
Mengqing Cao
6c6f7fe8a8 [Platform] Move model arch check to platform (#11503)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2024-12-27 08:45:25 +00:00
Robert Shaw
2339d59f92 [BugFix] Fix quantization for all other methods (#11547)
Some checks failed
Create Release / Create Release (push) Has been cancelled
2024-12-26 22:23:29 -08:00
Robert Shaw
1b875a0ef3 [V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly (#11534) 2024-12-26 21:19:21 -08:00
youkaichao
eb881ed006 [misc] fix typing (#11540)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-27 11:05:08 +08:00
Robert Shaw
46d4359450 [CI] Fix broken CI (#11543) 2024-12-26 18:49:16 -08:00
Woosuk Kwon
81b979f2a8 [V1] Fix yapf (#11538)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-27 09:47:10 +09:00
Woosuk Kwon
371d04d39b [V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling (#11394)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-27 09:32:38 +09:00
Robert Shaw
0c0c2015c5 Update openai_compatible_server.md (#11536)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-12-26 16:26:18 -08:00
Simon Mo
82d24f7aac [Docs] Document Deepseek V3 support (#11535)
Signed-off-by: simon-mo <simon.mo@hey.com>
2024-12-26 16:21:56 -08:00
Simon Mo
f49777ba62 Deepseek v3 (#11502)
Some checks failed
Create Release / Create Release (push) Has been cancelled
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: robertgshaw2-neuralmagic <rshaw@neuralmagic.com>
2024-12-26 16:09:44 -08:00
Robert Shaw
55fb97f7bd [2/N] API Server: Avoid ulimit footgun (#11530) 2024-12-26 23:43:05 +00:00
Michael Goin
2072924d14 [Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization (#11523)
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-26 15:33:30 -08:00
Robert Shaw
720b10fdc6 [1/N] API Server (Remove Proxy) (#11529) 2024-12-26 23:03:43 +00:00
Isotr0py
b85a977822 [Doc] Add video example to openai client for multimodal (#11521)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-26 17:31:29 +00:00
Cyrus Leung
eec906d811 [Misc] Add placeholder module (#11501)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-26 13:12:51 +00:00
Jee Jee Li
f57ee5650d [Model] Modify MolmoForCausalLM MLP (#11510)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-26 13:12:05 +00:00
sroy745
dcb1a944d4 [V1] Adding min tokens/repetition/presence/frequence penalties to V1 sampler (#10681)
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-26 19:02:58 +09:00
Roger Wang
7492a36207 [Doc] Add QVQ and QwQ to the list of supported models (#11509)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-12-26 09:44:32 +00:00
Jee Jee Li
aa25985bd1 [Misc][LoRA] Fix LoRA weight mapper (#11495)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-26 15:52:48 +08:00
Lucas Tucker
dbeac95dbb Mypy checking for vllm/compilation (#11496)
Signed-off-by: lucast2021 <lucast2021@headroyce.org>
Co-authored-by: lucast2021 <lucast2021@headroyce.org>
2024-12-26 05:04:07 +00:00
Cyrus Leung
51a624bf02 [Misc] Move some multimodal utils to modality-specific modules (#11494)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-26 04:23:20 +00:00
Cyrus Leung
6ad909fdda [Doc] Improve GitHub links (#11491)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-25 14:49:26 -08:00
Cyrus Leung
b689ada91e [Frontend] Enable decord to load video from base64 (#11492)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-25 16:33:55 +00:00
Jiaxin Shan
fc601665eb [Misc] Update disaggregation benchmark scripts and test logs (#11456)
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>
2024-12-25 06:58:48 +00:00
Rui Qiao
9832e5572a [V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor (#11472) 2024-12-24 19:49:46 -08:00
Cyrus Leung
3f3e92e1f2 [Model] Automatic conversion of classification and reward models (#11469)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-24 18:22:22 +00:00
Yuan Tang
409475a827 [Bugfix] Fix issues in CPU build Dockerfile. Fixes #9182 (#11435)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2024-12-24 16:53:28 +00:00
Jee Jee Li
196c34b0ac [Misc] Move weights mapper (#11443)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-24 13:05:25 +00:00
Mengqing Cao
5c7963249d [attn][tiny fix] fix attn backend in MultiHeadAttention (#11463)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2024-12-24 12:39:36 +00:00
Ilya Lavrenov
461cde2080 [OpenVINO] Fixed installation conflicts (#11458)
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
2024-12-24 11:38:21 +00:00
Isotr0py
7a5286cc04 [Bugfix][Hardware][CPU] Fix CPU input_positions creation for text-only inputs with mrope (#11434)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-24 17:59:51 +08:00
Jee Jee Li
b1b1038fbd [Bugfix] Fix Qwen2-VL LoRA weight loading (#11430)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-24 09:56:10 +00:00
Cyrus Leung
9edca6bf8f [Frontend] Online Pooling API (#11457)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-24 17:54:30 +08:00
dpxa
4f074fbf53 [Misc]Suppress irrelevant exception stack trace information when CUDA… (#11438)
Co-authored-by: shiquan <shiquan>
2024-12-24 08:43:39 +00:00
Rui Qiao
a491d6f535 [V1] TP Ray executor (#11107)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-12-23 23:00:12 +00:00
Rafael Vasquez
32aa2059ad [Docs] Convert rST to MyST (Markdown) (#11145)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
2024-12-23 22:35:38 +00:00
yansh97
94d545a1a1 [Doc] Fix typo in the help message of '--guided-decoding-backend' (#11440) 2024-12-23 20:20:44 +00:00
Michael Goin
60fb4f3bcf [Bugfix] Add kv cache scales to gemma2.py (#11269) 2024-12-23 19:30:45 +00:00
Michael Goin
63afbe9215 [CI] Expand OpenAI test_chat.py guided decoding tests (#11048)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-23 18:35:38 +00:00
Dipika Sikka
8cef6e02dc [Misc] add w8a8 asym models (#11075) 2024-12-23 13:33:20 -05:00
Dipika Sikka
b866cdbd05 [Misc] Add assertion and helpful message for marlin24 compressed models (#11388) 2024-12-24 02:23:38 +08:00
Yuan Tang
2e726680b3 [Bugfix] torch nightly version in ROCm installation guide (#11423)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2024-12-23 17:20:22 +00:00
Michael Goin
5bfb30a529 [Bugfix] Fix CFGGuide and use outlines for grammars that can't convert to GBNF (#11389)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-23 23:06:20 +08:00
Lucas Tucker
e51719ae72 mypy type checking for vllm/worker (#11418)
Signed-off-by: lucast2021 <lucast2021@headroyce.org>
Co-authored-by: lucast2021 <lucast2021@headroyce.org>
2024-12-23 13:55:49 +00:00
youkaichao
f30581c518 [misc][perf] remove old code (#11425)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-23 08:01:08 +00:00
Simon Mo
048fc57a0f [CI] Unboock H100 Benchmark (#11419)
Signed-off-by: simon-mo <simon.mo@hey.com>
2024-12-22 14:17:43 -08:00
Jason T. Greene
f1d1bf6288 [Bugfix] Fix fully sharded LoRAs with Mixtral (#11390)
Signed-off-by: Jason Greene <jason.greene@redhat.com>
2024-12-22 23:25:10 +08:00
youkaichao
72d9c316d3 [cd][release] fix race conditions (#11407)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-22 00:39:11 -08:00
youkaichao
4a9139780a [cd][release] add pypi index for every commit and nightly build (#11404)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-12-21 23:53:44 -08:00
Roger Wang
29c748930e [CI] Fix flaky entrypoint tests (#11403)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-12-21 21:08:44 -08:00
Roger Wang
c2d1b075ba [Bugfix] Fix issues for Pixtral-Large-Instruct-2411 (#11393)
Signed-off-by: ywang96 <ywang@example.com>
Co-authored-by: ywang96 <ywang@example.com>
2024-12-21 10:15:03 +00:00
Ricky Xu
584f0ae40d [V1] Make AsyncLLMEngine v1-v0 opaque (#11383)
Signed-off-by: Ricky Xu <xuchen727@hotmail.com>
2024-12-21 15:14:08 +08:00
George
51ff216d85 [Bugfix] update should_ignore_layer (#11354)
Signed-off-by: George Ohashi <george@neuralmagic.com>
2024-12-21 06:36:23 +00:00
Woosuk Kwon
dd2b5633dd [V1][Bugfix] Skip hashing empty or None mm_data (#11386)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-21 14:22:21 +09:00
Jiaxin Shan
47a0b615b4 Add ray[default] to wget to run distributed inference out of box (#11265)
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>
2024-12-20 13:54:55 -08:00
youkaichao
5d2248d81a [doc] explain nccl requirements for rlhf (#11381)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-20 13:00:56 -08:00
Michael Goin
d573aeadcc [Bugfix] Don't log OpenAI field aliases as ignored (#11378)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-20 19:03:50 +00:00
omer-dayan
995f56236b [Core] Loading model from S3 using RunAI Model Streamer as optional loader (#10192)
Signed-off-by: OmerD <omer@run.ai>
2024-12-20 16:46:24 +00:00
Daniele
7c7aa37c69 [CI/Build] fix pre-compiled wheel install for exact tag (#11373)
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com>
2024-12-21 00:14:40 +08:00
Roger Wang
04139ade59 [V1] Fix profiling for models with merged input processor (#11370)
Signed-off-by: ywang96 <ywang@roblox.com>
2024-12-20 12:04:21 +00:00
youkaichao
1ecc645b8f [doc] backward compatibility for 0.6.4 (#11359)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-19 21:33:53 -08:00
youkaichao
c954f21ac0 [misc] add early error message for custom ops (#11355)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-19 21:18:25 -08:00
Wallas Henrique
86c2d8fd1c [Bugfix] Fix spec decoding when seed is none in a batch (#10863)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2024-12-20 05:15:31 +00:00
Michael Goin
b880ffb87e [Misc] Add tqdm progress bar during graph capture (#11349)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-20 04:35:18 +00:00
youkaichao
7801f56ed7 [ci][gh200] dockerfile clean up (#11351)
Signed-off-by: drikster80 <ed.sealing@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: drikster80 <ed.sealing@gmail.com>
Co-authored-by: cenzhiyao <2523403608@qq.com>
2024-12-19 18:13:06 -08:00
Akash kaothalkar
48edab8041 [Bugfix][Hardware][POWERPC] Fix auto dtype failure in case of POWER10 (#11331)
Signed-off-by: Akash Kaothalkar <0052v2@linux.vnet.ibm.com>
2024-12-20 01:32:07 +00:00
Yuan
a985f7af9f [CI] Adding CPU docker pipeline (#11261)
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
2024-12-19 11:46:55 -08:00
yangzhibin
e461c262f0 [Misc] Remove unused vllm/block.py (#11336) 2024-12-19 17:54:24 +00:00
Isotr0py
276738ce0f [Bugfix] Fix broken CPU compressed-tensors test (#11338)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-19 17:37:31 +00:00
Cyrus Leung
cdf22afdda [Misc] Clean up and consolidate LRUCache (#11339)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-20 00:59:32 +08:00
Isotr0py
e24113a8fe [Model] Refactor Qwen2-VL to use merged multimodal processor (#11258)
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-19 16:28:00 +00:00
Roger Wang
7379b3d4b2 [V1] Fix multimodal profiling for Molmo (#11325)
Signed-off-by: ywang96 <ywang@example.com>
Co-authored-by: ywang96 <ywang@example.com>
2024-12-19 16:27:22 +00:00
Yehoshua Cohen
6c7f881541 [Model] Add JambaForSequenceClassification model (#10860)
Signed-off-by: Yehoshua Cohen <yehoshuaco@ai21.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Yehoshua Cohen <yehoshuaco@ai21.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-19 22:48:06 +08:00
Cyrus Leung
a0f7d53beb [Bugfix] Cleanup Pixtral HF code (#11333)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-19 13:22:00 +00:00
Yanyi Liu
5aef49806d [Feature] Add load generation config from model (#11164)
Signed-off-by: liuyanyi <wolfsonliu@163.com>
Signed-off-by: Yanyi Liu <wolfsonliu@163.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-12-19 10:50:38 +00:00
Varun Sundar Rabindranath
98356735ac [misc] benchmark_throughput : Add LoRA (#11267)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-12-19 15:43:16 +08:00
Rui Qiao
f26c4aeecb [Misc] Optimize ray worker initialization time (#11275)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-12-18 23:38:02 -08:00
Varun Sundar Rabindranath
8936316d58 [Kernel] Refactor Cutlass c3x (#10049)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-12-19 07:00:18 +00:00
Cyrus Leung
6142ef0ada [VLM] Merged multimodal processor for Qwen2-Audio (#11303)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-19 06:14:17 +00:00
Chen Zhang
c6b0a7d3ba [V1] Simplify prefix caching logic by removing num_evictable_computed_blocks (#11310) 2024-12-19 04:17:12 +00:00
Michael Goin
a30482f054 [CI] Expand test_guided_generate to test all backends (#11313)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-19 04:00:38 +00:00
Travis Johnson
17ca964273 [Model] IBM Granite 3.1 (#11307)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-12-19 11:27:24 +08:00
Tyler Michael Smith
5a9da2e6e9 [Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) (#11311)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-19 02:43:30 +00:00
Alexander Matveev
fdea8ec167 [V1] VLM - enable processor cache by default (#11305)
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com>
2024-12-18 18:54:46 -05:00
Joe Runde
ca5f54a9b9 [Bugfix] fix minicpmv test (#11304)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-12-18 10:34:26 -08:00
Kunshang Ji
f954fe0e65 [FIX] update openai version (#11287)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2024-12-18 10:17:05 -08:00
Simon Mo
362cff1eb3 [CI][Misc] Remove Github Action Release Workflow (#11274) 2024-12-18 10:16:53 -08:00
Isotr0py
996aa70f00 [Bugfix] Fix broken phi3-v mm_processor_kwargs tests (#11263)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-18 10:16:40 -08:00
Dipika Sikka
60508ffda9 [Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com>
Co-authored-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2024-12-18 09:57:16 -05:00
Yan Ma
f04e407e6b [MISC][XPU]update ipex link for CI fix (#11278) 2024-12-17 22:34:23 -08:00
Wallas Henrique
8b79f9e107 [Bugfix] Fix guided decoding with tokenizer mode mistral (#11046) 2024-12-17 22:34:08 -08:00
Konrad Zawora
866fa4550d [Bugfix] Restore support for larger block sizes (#11259)
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
2024-12-17 16:39:07 -08:00
Cody Yu
bf8717ebae [V1] Prefix caching for vision language models (#11187)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2024-12-17 16:37:59 -08:00
Michael Goin
c77eb8a33c [Bugfix] Set temperature=0.7 in test_guided_choice_chat (#11264) 2024-12-17 16:34:06 -08:00
Joe Runde
2d1b9baa8f [Bugfix] Fix request cancellation without polling (#11190)
Some checks failed
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-12-17 12:26:32 -08:00
Isotr0py
f9ecbb18bf [Misc] Allow passing logits_soft_cap for xformers backend (#11252)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-17 00:37:04 -08:00
Roger Wang
02222a0256 [Misc] Kernel Benchmark for RMSNorm (#11241)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Xiaoyu Zhang <BBuf@users.noreply.github.com>
2024-12-17 06:57:02 +00:00
Tyler Michael Smith
2bfdbf2a36 [V1][Core] Use weakref.finalize instead of atexit (#11242)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-16 22:11:33 -08:00
wangxiyuan
e88db68cf5 [Platform] platform agnostic for EngineArgs initialization (#11225)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2024-12-16 22:11:06 -08:00
Roger Wang
59c9b6ebeb [V1][VLM] Proper memory profiling for image language models (#11210)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: ywang96 <ywang@example.com>
2024-12-16 22:10:57 -08:00
kYLe
66d4b16724 [Frontend] Add OpenAI API support for input_audio (#11027)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-16 22:09:58 -08:00
Michael Goin
0064f697d3 [CI] Add test case with JSON schema using references + use xgrammar by default with OpenAI parse (#10935)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-17 11:39:58 +08:00
youkaichao
35bae114a8 fix gh200 tests on main (#11246)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-16 17:22:38 -08:00
youkaichao
88a412ed3d [torch.compile] fast inductor (#11108)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-16 16:15:22 -08:00
youkaichao
c301616ed2 [ci][tests] add gh200 tests (#11244)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-16 15:53:18 -08:00
bk-TurbaAI
35ffa682b1 [Docs] hint to enable use of GPU performance counters in profiling tools for multi-node distributed serving (#11235)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-12-16 22:20:39 +00:00
youkaichao
551603feff [core] overhaul memory profiling and fix backward compatibility (#10511)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-16 13:32:25 -08:00
Varun Sundar Rabindranath
efbce85f4d [misc] Layerwise profile updates (#10242)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-12-16 18:14:57 +00:00
Isotr0py
2ca830dbaa [Doc] Reorder vision language examples in alphabet order (#11228)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-16 11:23:33 +00:00
Isotr0py
d927dbcd88 [Model] Refactor Ultravox to use merged input processor (#11198)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-16 10:09:53 +00:00
Jani Monoses
bddbbcb132 [Model] Support Cohere2ForCausalLM (Cohere R7B) (#11203) 2024-12-16 09:56:19 +00:00
cennn
b3b1526f03 WIP: [CI/Build] simplify Dockerfile build for ARM64 / GH200 (#11212)
Signed-off-by: drikster80 <ed.sealing@gmail.com>
Co-authored-by: drikster80 <ed.sealing@gmail.com>
2024-12-16 09:20:49 +00:00
yansh97
17138af7c4 [Bugfix] Fix the default value for temperature in ChatCompletionRequest (#11219) 2024-12-16 00:15:40 -08:00
chenqianfzh
69ba344de8 [Bugfix] Fix block size validation (#10938) 2024-12-15 16:38:40 -08:00
AlexHe99
da6f409246 Update deploying_with_k8s.rst (#10922) 2024-12-15 16:33:58 -08:00
Woosuk Kwon
25ebed2f8c [V1][Minor] Cache np arange to reduce input preparation overhead (#11214)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-15 13:33:00 -08:00
shangmingc
d263bd9df7 [Core] Support disaggregated prefill with Mooncake Transfer Engine (#10884)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2024-12-15 21:28:18 +00:00
Kuntai Du
38e599d6a8 [Doc] add documentation for disaggregated prefilling (#11197)
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
2024-12-15 13:31:16 -06:00
Cyrus Leung
96d673e0f8 [Bugfix] Fix error handling of unsupported sliding window (#11213)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-15 10:59:42 -07:00
Cyrus Leung
b10609e6a1 [Misc] Clean up multi-modal processor (#11207)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-15 06:30:28 +00:00
youkaichao
a1c02058ba [torch.compile] allow tracking forward time (#11081)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-14 19:45:00 -08:00
Jee Jee Li
15859f2357 [[Misc]Upgrade bitsandbytes to the latest version 0.45.0 (#11201) 2024-12-15 03:03:06 +00:00
Sungjae Lee
886936837c [Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion (#7209) 2024-12-14 11:38:10 -08:00
Mark McLoughlin
6d917d0eeb Enable mypy checking on V1 code (#11105)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2024-12-14 09:54:04 -08:00
Cyrus Leung
93abf23a64 [VLM] Fully dynamic prompt replacement in merged input processor (#11199)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-14 17:52:18 +00:00
Brad Hilton
9c3dadd1c9 [Frontend] Add logits_processors as an extra completion argument (#11150)
Signed-off-by: Brad Hilton <brad.hilton.nw@gmail.com>
2024-12-14 16:46:42 +00:00
Jee Jee Li
3cb5769883 [Misc] Minor improvements to the readability of PunicaWrapperBase (#11200)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-14 16:38:27 +00:00
Tyler Michael Smith
ea7bd68d10 [V1][Bugfix] Fix V1 TP trust-remote-code (#11182)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-14 08:21:23 +00:00
Russell Bryant
48259264a4 [Core] Update outlines and increase its threadpool size (#11140)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-12-14 07:46:18 +00:00
dhuangnm
24a3d12b82 update compressed-tensors to latest version (#11183)
Co-authored-by: dhuangnm <dhuang@MacBook-Pro-2.local>
2024-12-14 03:22:44 +00:00
Cody Yu
9855aea21b [Bugfix][V1] Re-compute an entire block when fully cache hit (#11186)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2024-12-13 17:08:23 -08:00
Tyler Michael Smith
4b5b8a6a3b [V1][Bugfix] Fix EngineCoreProc profile (#11185)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-13 17:02:35 -08:00
Russell Bryant
4863e5fba5 [Core] V1: Use multiprocessing by default (#11074)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-12-13 16:27:32 -08:00
Jiaxin Shan
0d8451c3a4 [Distributed] Allow the placement group more time to wait for resources to be ready (#11138)
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>
2024-12-13 20:17:37 +00:00
Jani Monoses
0a56bcc03d [Bugfix][Hardware][CPU] Enable Gemma2 with SDPA on CPU backend (#11169) 2024-12-13 18:00:40 +00:00
Cyrus Leung
0920ab9131 [Doc] Reorganize online pooling APIs (#11172)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-14 00:22:22 +08:00
Alexander Matveev
238c0d93b4 [Misc] Add tokenizer_mode param to benchmark_serving.py (#11174)
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com>
2024-12-13 16:19:10 +00:00
zhangjf
5b0ed8391d [Bugfix] using len(tokenizer) instead of tokenizer.vocab_size in AllowedTokenIdsLogitsProcessor (#11156) 2024-12-13 15:56:19 +00:00
Sungjae Lee
c31d4a57a6 [Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching (#8240) 2024-12-13 07:51:25 -08:00
Chenguang Li
d1fa714cb1 [Refactor]A simple device-related refactor (#11163)
Signed-off-by: noemotiovon <noemotiovon@gmail.com>
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-12-13 13:39:00 +00:00
Roger Wang
969da7d70b [V1][VLM] Fix edge case bug for InternVL2 (#11165)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-12-13 11:09:30 +00:00
Cyrus Leung
eeec9e3390 [Frontend] Separate pooling APIs in offline inference (#11129)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-13 10:40:07 +00:00
Li, Jiang
f93bf2b189 [Bugfix][CI][CPU] add missing datasets package to requirements-cpu.txt (#11159)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2024-12-13 08:50:35 +00:00
Jani Monoses
7cd7409142 PaliGemma 2 support (#11142) 2024-12-13 07:40:07 +00:00
youkaichao
be39e3cd18 [core] clean up cudagraph batchsize padding logic (#10996)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-13 06:57:50 +00:00
Cody Yu
34f1a806d5 [Bugfix][V1] Fix 'NoneType' object has no attribute 'hash_value' (#11157)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2024-12-13 06:30:06 +00:00
Gregory Shtrasberg
00c1bde5d8 [ROCm][AMD] Disable auto enabling chunked prefill on ROCm (#11146)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2024-12-13 05:31:26 +00:00
Dipika Sikka
3989a79824 [Bugfix] Update starcoder2 to remap k/v scale names for kv_cache quantization (#11148) 2024-12-13 05:07:20 +00:00
Pooya Davoodi
1efce68605 [Bugfix] Use runner_type instead of task in GritLM (#11144)
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
2024-12-13 04:09:53 +00:00
Luka Govedič
30870b4f66 [torch.compile] Dynamic fp8 + rms_norm fusion (#10906)
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-12-13 03:19:23 +00:00
Cody Yu
78ed8f57d8 [Misc][V1] Fix type in v1 prefix caching (#11151) 2024-12-13 00:57:40 +00:00
shangmingc
db6c264a1e [Bugfix] Fix value unpack error of simple connector for KVCache transfer. (#11058)
Signed-off-by: ShangmingCai <csmthu@gmail.com>
2024-12-12 21:19:17 +00:00
Jeremy Arnold
9f3974a319 Fix logging of the vLLM Config (#11143) 2024-12-12 12:05:57 -08:00
Cody Yu
2c97eca1ff [Misc] Validate grammar and fail early (#11119) 2024-12-12 18:34:26 +00:00
Jeff Cook
5d712571af [Bugfix] Quick fix to make Pixtral-HF load correctly again after 39e227c7ae. (#11024) 2024-12-12 18:09:20 +00:00
Ramon Ziai
d4d5291cc2 fix(docs): typo in helm install instructions (#11141)
Signed-off-by: Ramon Ziai <ramon.ziai@bettermarks.com>
2024-12-12 17:36:32 +00:00
Roger Wang
4816d20aa4 [V1] Fix torch profiling for offline inference (#11125)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-12-12 15:51:53 +00:00
Jiaxin Shan
85362f028c [Misc][LoRA] Ensure Lora Adapter requests return adapter name (#11094)
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-12 09:25:16 +00:00
youkaichao
62de37a38e [core][distributed] initialization from StatelessProcessGroup (#10986)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-12 09:04:19 +00:00
Sanju C Sudhakaran
8195824206 [Hardware][Intel-Gaudi] Enable LoRA support for Intel Gaudi (HPU) (#10565)
Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai>
2024-12-12 08:09:28 +00:00
Woosuk Kwon
f092153fbe [V1] Use more persistent buffers to optimize input preparation overheads (#11111)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-11 23:14:20 -08:00
Pooya Davoodi
1da8f0e1dd [Model] Add support for embedding model GritLM (#10816)
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
2024-12-12 06:39:16 +00:00
Russell Bryant
ccede2b264 [Core] cleanup zmq ipc sockets on exit (#11115)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-12-11 19:12:24 -08:00
Yuan Tang
24a36d6d5f Update link to LlamaStack remote vLLM guide in serving_with_llamastack.rst (#11112)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2024-12-12 02:39:21 +00:00
Simon Mo
8fb26dac61 [Docs] Add media kit (#11121) 2024-12-11 17:33:11 -08:00
Clayton
7439a8b5fc [Bugfix] Multiple fixes to tool streaming with hermes and mistral (#10979)
Signed-off-by: cedonley <clayton@donley.io>
2024-12-12 01:10:12 +00:00
Alexander Matveev
4e11683368 [V1] VLM preprocessor hashing (#11020)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-12-12 00:55:30 +00:00
Tyler Michael Smith
452a723bf2 [V1][Core] Remove should_shutdown to simplify core process termination (#11113)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-11 23:34:54 +00:00
Cyrus Leung
d1e21a979b [CI/Build] Split up VLM tests (#11083)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-12 06:18:16 +08:00
Rui Qiao
72ff3a9686 [core] Bump ray to use _overlap_gpu_communication in compiled graph tests (#10410)
Signed-off-by: Rui Qiao <ubuntu@ip-172-31-15-128.us-west-2.compute.internal>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Rui Qiao <ubuntu@ip-172-31-15-128.us-west-2.compute.internal>
2024-12-11 11:36:35 -08:00
youkaichao
66aaa7722d [torch.compile] remove graph logging in ci (#11110)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-11 10:59:50 -08:00
Woosuk Kwon
d643c2aba1 [V1] Use input_ids as input for text-only models (#11032)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-11 10:49:23 -08:00
youkaichao
91642db952 [torch.compile] use depyf to dump torch.compile internals (#10972)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-11 10:43:05 -08:00
bingps
fd22220687 [Doc] Installed version of llmcompressor for int8/fp8 quantization (#11103)
Signed-off-by: Guangda Liu <bingps@users.noreply.github.com>
Co-authored-by: Guangda Liu <bingps@users.noreply.github.com>
2024-12-11 15:43:24 +00:00
hissu-hyvarinen
b2f775456e [CI/Build] Enable prefix caching test for AMD (#11098)
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com>
2024-12-11 15:23:37 +00:00
Cyrus Leung
cad5c0a6ed [Doc] Update docs to refer to pooling models (#11093)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-11 13:36:27 +00:00
Cyrus Leung
8f10d5e393 [Misc] Split up pooling tasks (#10820)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-11 01:28:00 -08:00
Rafael Vasquez
40766ca1b8 [Bugfix]: Clamp -inf logprob values in prompt_logprobs (#11073)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
2024-12-11 01:27:39 -08:00
B-201
2e32f5d28d [Bugfix] Fix Idefics3 fails during multi-image inference (#11080)
Signed-off-by: B-201 <Joy25810@foxmail.com>
2024-12-11 01:27:07 -08:00
Russell Bryant
61b1d2f6ae [Core] v1: Use atexit to handle engine core client shutdown (#11076)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-12-11 01:26:36 -08:00
Kevin H. Luu
9974fca047 [ci/build] Fix entrypoints test and pin outlines version (#11088) 2024-12-11 01:01:53 -08:00
Kevin H. Luu
3fb4b4f163 [ci/build] Fix AMD CI dependencies (#11087) 2024-12-11 00:39:53 -08:00
Cyrus Leung
2e33fe4191 [CI/Build] Check transformers v4.47 (#10991)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-11 05:02:02 +00:00
Maximilien de Bayser
e39400a4b6 Fix streaming for granite tool call when <|tool_call|> is present (#11069)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-12-11 04:51:40 +00:00
Mor Zusman
ffa48c9146 [Model] PP support for Mamba-like models (#10992)
Signed-off-by: mzusman <mor.zusmann@gmail.com>
2024-12-10 21:53:37 -05:00
Aurick Qiao
d5c5154fcf [Misc] LoRA + Chunked Prefill (#9057) 2024-12-11 10:09:20 +08:00
Tyler Michael Smith
9a93973708 [Bugfix] Fix Mamba multistep (#11071)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-11 00:16:22 +00:00
Woosuk Kwon
134810b3d9 [V1][Bugfix] Always set enable_chunked_prefill = True for V1 (#11061)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-10 14:41:23 -08:00
youkaichao
75f89dc44c [torch.compile] add a flag to track batchsize statistics (#11059)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-10 12:40:52 -08:00
Russell Bryant
e739194926 [Core] Update to outlines >= 0.1.8 (#10576)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-12-10 12:08:16 -08:00
Flávia Béo
250ee65d72 [BUG] Remove token param #10921 (#11022)
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
2024-12-10 17:38:15 +00:00
Joe Runde
9b9cef3145 [Bugfix] Backport request id validation to v0 (#11036)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-12-10 16:38:23 +00:00
Jee Jee Li
d05f88679b [Misc][LoRA] Add PEFTHelper for LoRA (#11003)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-10 11:12:01 +00:00
Travis Johnson
beb16b2c81 [Bugfix] Handle <|tool_call|> token in granite tool parser (#11039)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-12-10 10:27:11 +00:00
Maxime Fournioux
fe2e10c71b Add example of helm chart for vllm deployment on k8s (#9199)
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
2024-12-10 09:19:27 +00:00
Gene Der Su
82c73fd510 [Bugfix] cuda error running llama 3.2 (#11047) 2024-12-10 07:41:11 +00:00
Diego Marinho
bfd610430c Update README.md (#11034) 2024-12-09 23:08:10 -08:00
Jeff Cook
e35879c276 [Bugfix] Fix xgrammar failing to read a vocab_size from LlavaConfig on PixtralHF. (#11043) 2024-12-10 14:54:22 +08:00
youkaichao
ebf778061d monitor metrics of tokens per step using cudagraph batchsizes (#11031)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-09 22:35:36 -08:00
Tyler Michael Smith
28b3a1c7e5 [V1] Multiprocessing Tensor Parallel Support for v1 (#9856)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-10 06:28:14 +00:00
Patrick von Platen
bc192a2b09 [Pixtral] Improve loading (#11040) 2024-12-10 06:09:32 +00:00
Joe Runde
980ad394a8 [Frontend] Use request id from header (#10968)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-12-10 13:46:29 +08:00
Cyrus Leung
391d7b2763 [Bugfix] Fix usage of deprecated decorator (#11025)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-10 13:45:47 +08:00
Isotr0py
d1f6d1c8af [Model] Add has_weight to RMSNorm and re-enable weights loading tracker for Mamba (#10739)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-10 10:23:07 +08:00
Michael Goin
6d525288c1 [Docs] Add dedicated tool calling page to docs (#10554)
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-09 20:15:34 -05:00
Woosuk Kwon
6faec54505 [V1] Do not store None in self.generators (#11038)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-09 15:08:19 -08:00
Richard Liu
5ed5d5f128 Build tpu image in release pipeline (#10936)
Signed-off-by: Richard Liu <ricliu@google.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
2024-12-09 23:07:48 +00:00
Gregory Shtrasberg
b63ba84832 [ROCm][bugfix] scpecilative decoding worker class (#11035)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2024-12-09 14:00:29 -08:00
xendo
9c6459e4cb [Neuron] Upgrade neuron to 2.20.2 (#11016)
Signed-off-by: Jerzy Zagorski <jzagorsk@amazon.com>
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com>
2024-12-09 13:53:24 -08:00
youkaichao
1a2f8fb828 [v1] fix use compile sizes (#11000)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-09 13:47:24 -08:00
Konrad Zawora
cbcbdb1ceb [Bugfix][Hardware][Gaudi] Bump vllm_hpu_extension version (#11028)
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
2024-12-09 13:21:06 -08:00
Isotr0py
a811dd6608 [Model] merged input processor for Phi-3-Vision models (#10977)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-09 12:55:10 -08:00
Jee Jee Li
ca871491ed [Misc][LoRA] Abstract PunicaWrapper (#10955)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-09 12:54:44 -08:00
Woosuk Kwon
3b61cb450d [V1] Further reduce CPU overheads in flash-attn (#10989)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-09 12:38:46 -08:00
Kevin H. Luu
edc4fa3188 [ci/build] Recompile CI dependencies list with Python 3.12 (#11013)
Signed-off-by: kevin <kevin@anyscale.com>
2024-12-09 11:46:58 -08:00
Varun Sundar Rabindranath
25b79d9fd3 [V1] Input Batch Relocation (#10962)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-12-09 09:33:41 -08:00
wangxiyuan
aea2fc38c3 [Platform] Move async output check to platform (#10768)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2024-12-09 17:24:46 +00:00
Russell Bryant
e691b26f6f [Core] Require xgrammar >= 0.1.6 (#11021)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-12-09 16:44:27 +00:00
Roger Wang
c690357928 [V1] Fix Detokenizer loading in AsyncLLM (#10997)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-12-09 16:27:10 +00:00
youkaichao
d1c2e15eb3 [torch.compile] add dynamo time tracking (#11005)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-08 23:09:04 -08:00
Roger Wang
af7c4a92e6 [Doc][V1] Add V1 support column for multimodal models (#10998)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-12-08 22:29:16 -08:00
youkaichao
46004e83a2 [misc] clean up and unify logging (#10999)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-08 17:28:27 -08:00
youkaichao
43b05fa314 [torch.compile][misc] fix comments (#10993)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-08 11:18:18 -08:00
Roger Wang
a11f326528 [V1] Initial support of multimodal models for V1 re-arch (#10699)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-12-08 12:50:51 +00:00
youkaichao
fd57d2b534 [torch.compile] allow candidate compile sizes (#10984)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-08 11:05:21 +00:00
youkaichao
7be15d9356 [core][misc] remove use_dummy driver for _run_workers (#10920)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-07 12:06:08 -08:00
youkaichao
1b62745b1d [core][executor] simplify instance id (#10976)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-07 09:33:45 -08:00
zhou fan
78029b34ed [BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None (#10928)
Signed-off-by: xffxff <1247714429@qq.com>
2024-12-08 01:21:18 +08:00
Cyrus Leung
c889d5888b [Doc] Explicitly state that PP isn't compatible with speculative decoding yet (#10975)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-07 17:20:49 +00:00
Cyrus Leung
39e227c7ae [Model] Update multi-modal processor to support Mantis(LLaVA) model (#10711)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-07 17:10:05 +00:00
Cyrus Leung
1c768fe537 [Doc] Explicitly state that InternVL 2.5 is supported (#10978)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-07 16:58:02 +00:00
Cyrus Leung
bf0e382e16 [Model] Composite weight loading for multimodal Qwen2 (#10944)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-07 07:22:52 -07:00
Isotr0py
b26b4cd03c [Misc][LoRA] Refactor and clean MergedQKVParallelLinearWithLora implementation (#10958)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-07 18:33:49 +08:00
Gregory Shtrasberg
f13cf9ad50 [Build] Fix for the Wswitch-bool clang warning (#10060)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2024-12-07 09:03:44 +00:00
Cyrus Leung
955fa9533a [3/N] Support and implement merged input processor for LLaVA model (#10676)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-12-07 00:50:58 -08:00
Jee Jee Li
acf092d348 [Bugfix] Fix test-pipeline.yaml (#10973)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-07 12:08:54 +08:00
Russell Bryant
69d357ba12 [Core] Cleanup startup logging a bit (#10961)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-12-07 02:30:23 +00:00
youkaichao
dcdc3fafe5 [ci] fix broken tests (#10956)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-06 11:25:47 -08:00
youkaichao
c05cfb67da [misc] fix typo (#10960)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-06 11:25:20 -08:00
Sam Stoelinga
7406274041 [Doc] add KubeAI to serving integrations (#10837)
Signed-off-by: Sam Stoelinga <sammiestoel@gmail.com>
2024-12-06 17:03:56 +00:00
Michael Goin
8b59631855 [Core] Support Lark grammars for XGrammar (#10870)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-06 08:34:29 -07:00
youkaichao
a1887f2c96 [torch.compile] fix deprecated code (#10948)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-06 11:01:23 +00:00
Cyrus Leung
222f5b082a [CI/Build] Fix broken multimodal test (#10950) 2024-12-06 10:41:23 +00:00
youkaichao
b031a455a9 [torch.compile] add logging for compilation time (#10941)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-06 10:07:15 +00:00
youkaichao
db87eb6c67 [torch.compile] use size tuning for specific sizes (#10933)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-05 20:30:41 -08:00
youkaichao
9743d64e4e [ci][build] add tests for python only compilation (#10915)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-05 08:54:47 -08:00
Konrad Zawora
a43065272f [Misc][Gaudi] Avoid torch.compile and enable lazy collectives (#10897)
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
2024-12-05 08:47:46 -08:00
Isotr0py
998eeafe58 [CI/Build] Bump test transformers version (#10106)
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-05 16:05:52 +00:00
Jee Jee Li
571da8fc43 [Misc][LoRA] Clean up the function interface of Punica (#10917)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-05 13:22:28 +00:00
Travis Johnson
39c89e71a8 [Misc] Update llama 3.2 template to support system prompt with images (#10901)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-12-05 05:54:06 +00:00
Jee Jee Li
1f958a7d52 [Bugfix] Fix BNB loader target_modules (#10720)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-05 13:20:26 +08:00
Cyrus Leung
aa39a8e175 [Doc] Create a new "Usage" section (#10827)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-05 11:19:35 +08:00
Michael Goin
8d370e91cb [Bugfix] Fallback to outlines for complex json schemas (#10899)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-05 11:14:06 +08:00
Kevin H. Luu
7883c2bbe7 [benchmark] Make H100 benchmark optional (#10908) 2024-12-04 17:02:17 -08:00
Woosuk Kwon
2a56e1264f [V1] Fix when max_model_len is not divisible by block_size (#10903)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-04 16:54:05 -08:00
Daniele
e4c34c23de [CI/Build] improve python-only dev setup (#9621)
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-12-04 21:48:13 +00:00
Chendi.Xue
82eb5ea8f3 Benchmark serving structured output (#10880)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-12-04 16:28:21 -05:00
Isotr0py
10398b4706 [Model] Consolidate ViTs attention implementation without mask (#10893)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-04 18:11:08 +00:00
Xin Yang
01d079fd8e [LoRA] Change lora_tokenizers capacity (#10796)
Signed-off-by: Xin Yang <xyang19@gmail.com>
2024-12-04 17:40:16 +00:00
Kevin H. Luu
c92acb9693 [ci/build] Update vLLM postmerge ECR repo (#10887) 2024-12-04 09:01:20 +00:00
jianzheng
8db957ee3a [bugfix] fixed parameter “n” when set parameter “bestof” > 1 (#10854)
Signed-off-by: jianzheng <57654625+o2363286@users.noreply.github.com>
2024-12-04 08:48:22 +00:00
Kevin H. Luu
c9ca4fce3f [ci/build] Job to build and push release image (#10877) 2024-12-04 15:02:40 +08:00
Kevin H. Luu
fa2dea61df [ci/build] Change queue name for Release jobs (#10875) 2024-12-04 15:02:16 +08:00
wangxiyuan
b5b647b084 Drop ROCm load format check (#10767)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2024-12-04 04:32:21 +00:00
Tyler Michael Smith
d2bd88b122 [CI/Build] Replace mean with torch.all in test_pynccl.py (#10876)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-04 03:23:21 +00:00
Chendi.Xue
381ac93bb5 [Benchmark] Benchmark structured output with datasets (#10557)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: Aaron Pham <contact@aarnphm.xyz>
2024-12-03 17:21:06 -07:00
Gregory Shtrasberg
a061fe601e [Build][Bugfix] Using the correct type hint (#10866)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2024-12-03 15:47:55 -05:00
tomeras91
7c32b6861e [Frontend] correctly record prefill and decode time metrics (#10853)
Signed-off-by: Tomer Asida <tomera@ai21.com>
2024-12-03 19:13:31 +00:00
Michael Goin
7090c27bb2 [Bugfix] Only require XGrammar on x86 (#10865)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-03 10:32:21 -08:00
Yan Ma
2f2cdc745a [MISC][XPU] quick fix for XPU CI (#10859)
Signed-off-by: yan ma <yan.ma@intel.com>
2024-12-03 17:16:31 +00:00
Alexander Matveev
3bc94cab69 [V1] VLM - Run the mm_mapper preprocessor in the frontend process (#10640)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-12-03 10:33:10 +00:00
Yang Zheng
f6084f6324 [Speculative Decoding] Move indices to device before filtering output (#10850)
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>
2024-12-03 17:01:39 +08:00
Aaron Pham
9323a3153b [Core][Performance] Add XGrammar support for guided decoding and set it as default (#10785)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-12-03 15:17:00 +08:00
Cyrus Leung
3257d449fa [Misc] Remove deprecated names (#10817)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-03 06:52:57 +00:00
Russell Bryant
ef51831ee8 [Doc] Add github links for source code references (#10672)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-03 06:46:07 +00:00
youkaichao
dc5ce861bf [torch.compile] remove compilation_context and simplify code (#10838)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-03 06:19:02 +00:00
youkaichao
21fe7b481a [core][distributed] add pynccl broadcast (#10843)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-03 04:53:23 +00:00
Jee Jee Li
a4cf256159 [Bugfix] Fix QKVParallelLinearWithShardedLora bias bug (#10844)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-03 12:10:29 +08:00
zixuanzhang226
d746268e92 [Model] support bitsandbytes quantization with minicpm model (#10842)
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com>
2024-12-03 03:06:41 +00:00
Michael Goin
4433195ab7 [Bugfix] Prevent benchmark_throughput.py from using duplicated random prompts (#10753) 2024-12-03 02:26:15 +00:00
Isotr0py
4c05edb33a [Model] Add TP and BNB quantization support to LlavaMultiModalProjector (#10834)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-02 23:06:09 +00:00
Jani Monoses
9b14d978aa Fix openvino on GPU (#10793) 2024-12-02 18:52:19 +00:00
Yan Ma
519cc6ca12 [Misc][XPU] Avoid torch compile for XPU platform (#10747)
Signed-off-by: yan ma <yan.ma@intel.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-12-02 17:53:55 +00:00
Jee Jee Li
b45f0d7946 [Misc][LoRA] Move the implementation of lora bias to punica.py (#10829)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-02 17:53:36 +00:00
youkaichao
a4c4daf364 [misc] use out argument for flash attention (#10822)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-02 10:50:10 +00:00
Cyrus Leung
e95f275f57 [CI/Build] Update mistral_common version for tests and docs (#10825)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-02 10:26:10 +00:00
zhou fan
ef31eabc68 [Model]: add some tests for aria model (#10770)
Signed-off-by: xffxff <1247714429@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-12-02 05:36:36 +00:00
wangxiyuan
995a148575 [doc]Update config docstring (#10732)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2024-12-02 04:14:45 +00:00
youkaichao
63a164172d [misc] remove xverse modeling file (#10814)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-02 03:27:13 +00:00
Maximilien de Bayser
e25810ae29 Fill TorchSDPAAttentionMetadata seq_lens_field for prefill (#10799)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-12-02 10:05:32 +08:00
Woosuk Kwon
073a4bd1c0 [Kernel] Use out arg in flash_attn_varlen_func (#10811)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-01 17:55:39 -08:00
cduk
b7954776fd [core] Avoid metrics log noise when idle - include speculative decodi… (#10809) 2024-12-02 01:49:48 +00:00
Isotr0py
b18c9bbaba [Model] Add BNB support to Llava and Pixtral-HF (#10795)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-02 01:31:09 +00:00
Kuntai Du
0590ec3fd9 [Core] Implement disagg prefill by StatelessProcessGroup (#10502)
This PR provides initial support for single-node disaggregated prefill in 1P1D scenario.
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Co-authored-by: ApostaC <yihua98@uchicago.edu>
Co-authored-by: YaoJiayi <120040070@link.cuhk.edu.cn>
2024-12-01 19:01:00 -06:00
Roger Wang
c11f172187 [Misc] Adding MMMU-Pro vision dataset to serving benchmark (#10804)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-12-01 08:47:05 +00:00
youkaichao
169a0ff911 [doc] add warning about comparing hf and vllm outputs (#10805)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-01 00:41:38 -08:00
Cyrus Leung
d2f058e76c [Misc] Rename embedding classes to pooling (#10801)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-01 14:36:51 +08:00
Cyrus Leung
f877a7d12a [Misc] Improve type annotations for support_torch_compile (#10763)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-30 17:48:35 -08:00
Cyrus Leung
133707123e [Model] Replace embedding models with pooling adapter (#10769)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-01 08:02:54 +08:00
wangxiyuan
7e4bbda573 [doc] format fix (#10789)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2024-11-30 11:38:40 +00:00
Patrick von Platen
e7cfc4ef4c [Interleaved ATTN] Support for Mistral-8B (#10591)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-30 07:45:50 +00:00
Isotr0py
16ee07f22a [Model] Refactor Molmo weights loading to use AutoWeightsLoader (#10771)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-30 04:19:14 +00:00
Nicolò Lucchesi
40bc242579 [Bugfix] Fix OpenVino/Neuron driver_worker init (#10779)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-11-30 12:07:13 +08:00
wangxiyuan
661175bc82 [platform] Add verify_quantization in platform. (#10757)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2024-11-29 15:22:21 +00:00
Jee Jee Li
3132aac043 [Bugfix] Fix Idefics3 bug (#10778)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-29 13:56:46 +00:00
wang.yuqi
c82b432d4a [Misc] typo find in sampling_metadata.py (#10740) 2024-11-29 05:17:57 +00:00
Cyrus Leung
fa6ecb9aa7 [Model] Clean up MiniCPMV (#10751)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-29 04:47:06 +00:00
Isotr0py
c83919c7a6 [Model] Add Internlm2 LoRA support (#5064)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-28 17:29:04 +00:00
Woosuk Kwon
98f47f2a40 [V1] Optimize the CPU overheads in FlashAttention custom op (#10733)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-28 09:01:02 -08:00
Woosuk Kwon
8c1e77fb58 [Kernel] Update vllm-flash-attn version to reduce CPU overheads (#10742)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-28 08:31:28 -08:00
sixgod
5fc5ce0fe4 [Model] Added GLM-4 series hf format model support vllm==0.6.4 (#10561)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-11-28 14:53:31 +00:00
Richard Liu
3ed5e73146 [TPU] Update requirements-tpu (#10726)
Signed-off-by: Richard Liu <ricliu@google.com>
2024-11-28 02:30:48 -08:00
Woosuk Kwon
9a8bff0285 [Kernel] Update vllm-flash-attn version (#10736)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-28 02:25:59 -08:00
Woosuk Kwon
a79b122400 [V1] Do not allocate beyond the max_model_len (#10730)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-28 00:13:15 -08:00
Ricky Xu
d9b4b3f069 [Bug][CLI] Allow users to disable prefix caching explicitly (#10724)
Signed-off-by: rickyx <rickyx@anyscale.com>
2024-11-27 23:59:28 -08:00
罗泽轩
278be671a3 [Doc] Update model in arch_overview.rst to match comment (#10701)
Signed-off-by: spacewander <spacewanderlzx@gmail.com>
2024-11-27 23:58:39 -08:00
zixuanzhang226
70dc14fbd0 [Model] support bitsandbytes quantization with minicpm3 model (#10682)
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com>
2024-11-27 23:58:02 -08:00
youkaichao
cb4e1c3f3a [misc] upgrade filelock version (#10731)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-27 19:54:58 -08:00
tomeras91
395b1c7454 [Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server (#10635)
Signed-off-by: Tomer Asida <tomera@ai21.com>
2024-11-27 13:21:10 -08:00
Cyrus Leung
9b4b150395 [Bugfix] Ignore lm_head when loading embedding models (#10719)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-27 19:05:29 +00:00
Mor Zusman
197b4484a3 [Bugfix][Mamba] Fix Multistep on Mamba-like models (#10705)
Signed-off-by: mzusman <mor.zusmann@gmail.com>
2024-11-27 19:02:27 +00:00
Isotr0py
b98c62ba49 [Bugfix] Fix GGUF inference with FP16 unquantized checkpoint (#10675)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-27 10:43:17 -08:00
youkaichao
c411def234 [torch.compile] fix shape specialization (#10722)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-27 10:16:10 -08:00
youkaichao
308cc5e21e [ci] fix slow tests (#10698)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-27 09:26:14 -08:00
Roger Wang
9e0a147d50 [V1] Update interface for mistral-format Pixtral (#10703)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-11-27 12:26:27 +00:00
Li, Jiang
418cb3b93f [Bugfix][Hardware][CPU] Fix intel-omp version to avoid segfault (#10700)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2024-11-27 11:55:38 +00:00
shunxing12345
1209261e93 [Model] Support telechat2 (#10311)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: xiangw2 <xiangw2@chinatelecom.cn>
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-11-27 11:32:35 +00:00
Tyler Michael Smith
e2251109c7 [Kernel] Remove if-else with identical branches in marlin 2:4 (#10687)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-11-26 22:55:32 -08:00
Jee Jee Li
15cc2a9f1a [Misc]Further reduce BNB static variable (#10597)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-26 22:54:12 -08:00
Kunshang Ji
e85250b1d1 [Hardware][Gaudi]add get_name method for HPUAttentionBackend (#10667)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2024-11-26 22:49:40 -08:00
yansh97
cfb3bf25fb [bugfix] fix the default value of llm_int8_threshold in BitsAndBytesConfig (#10657) 2024-11-27 13:55:23 +08:00
jeongin601
1bf905ddaa [Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. (#10198)
Signed-off-by: jeongin601 <0200angela@gmail.com>
Signed-off-by: jeong_in.bae <jeong_in.bae@navercorp.com>
2024-11-27 05:07:30 +00:00
Roger Wang
0a4d968500 [V1] Update interface for idefics3 (#10680)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-11-27 10:04:01 +08:00
Chendi.Xue
0a71900bc9 Remove hard-dependencies of Speculative decode to CUDA workers (#10587)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
2024-11-26 17:57:11 -08:00
Roger Wang
2f0a0a17a4 [V1] Refactor model executable interface for multimodal models (#10570)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-11-26 20:46:11 +00:00
Michael Goin
7576cd38df [Bugfix] Check bnb_4bit_quant_storage for bitsandbytes (#10642) 2024-11-26 12:29:00 -08:00
Michael Goin
9a99273b48 [Bugfix] Fix using -O[0,3] with LLM entrypoint (#10677)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-26 10:44:01 -08:00
Conroy Cheers
f5792c7c4a [Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson (#9735)
Signed-off-by: Conroy Cheers <conroy@corncheese.org>
2024-11-26 10:26:28 -08:00
Murali Andoorveedu
db66e018ea [Bugfix] Fix for Spec model TP + Chunked Prefill (#10232)
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Co-authored-by: Sourashis Roy <sroy@roblox.com>
2024-11-26 09:11:16 -08:00
Kunshang Ji
1f6584ee85 [V1] Enable profile for LLMEngine (#10665) 2024-11-26 10:36:45 +00:00
youkaichao
334d64d1e8 [ci] add vllm_test_utils (#10659)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-26 00:20:04 -08:00
Cyrus Leung
940635343a [Misc] Remove outdated init protocols (#10655)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-26 14:55:00 +08:00
Sage Moore
9a88f89799 custom allreduce + torch.compile (#10121)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-25 22:00:16 -08:00
Ricky Xu
519e8e4182 [v1] EngineArgs for better config handling for v1 (#10382)
Signed-off-by: rickyx <rickyx@anyscale.com>
2024-11-25 21:09:43 -08:00
Sanket Kale
a6760f6456 [Feature] vLLM ARM Enablement for AARCH64 CPUs (#9228)
Signed-off-by: Sanket Kale <sanketk.kale@fujitsu.com>
Co-authored-by: Sanket Kale <sanketk.kale@fujitsu.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-11-25 18:32:39 -08:00
youkaichao
45ac4ff270 [bugfix] fix aria model and add torch.compile (#10645)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-25 18:32:09 -08:00
youkaichao
6e9ff050c8 [misc] do not read HOST_IP (#10644)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-25 17:04:50 -08:00
Shane A
9db713a1dc [Model] Add OLMo November 2024 model (#10503) 2024-11-25 17:26:40 -05:00
Cyrus Leung
1b583cfefa [Doc] Fix typos in docs (#10636)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-25 10:15:45 -08:00
Cyrus Leung
cf73f0c95e [Model] Enable optional prefix when loading embedding models (#10639)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-25 18:14:33 +00:00
zhou fan
b1d920531f [Model]: Add support for Aria model (#10514)
Signed-off-by: xffxff <1247714429@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-11-25 18:10:55 +00:00
Simon Mo
452a4e80c3 [Docs] Add Snowflake Slides (#10641)
Signed-off-by: simon-mo <simon.mo@hey.com>
2024-11-25 09:34:46 -08:00
Wallas Henrique
c27df94e1f [Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices (#9850)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-11-25 12:23:32 -05:00
Chauncey
d04b13a380 [Bug]: Authorization ignored when root_path is set (#10606)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2024-11-25 16:21:41 +00:00
fzyzcjy
2b0879bfc2 Super tiny little typo fix (#10633) 2024-11-25 13:08:30 +00:00
Cyrus Leung
ed46f14321 [Model] Support is_causal HF config field for Qwen2 model (#10621)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-25 09:51:20 +00:00
youkaichao
05d1f8c9c6 [misc] move functions to config.py (#10624)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-25 09:27:30 +00:00
youkaichao
25d806e953 [misc] add torch.compile compatibility check (#10618)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-24 23:40:08 -08:00
youkaichao
65813781a2 [torch.compile] add warning for unsupported models (#10622)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-24 23:27:51 -08:00
Jee Jee Li
7c2134beda [torch.compile] force inductor threads (#10620)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-24 23:04:21 -08:00
Cyrus Leung
a30a605d21 [Doc] Add encoder-based models to Supported Models page (#10616)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-25 06:34:07 +00:00
youkaichao
571841b7fc [torch.compile] support encoder based models (#10613)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-25 05:24:33 +00:00
Mengqing Cao
7ea3cd7c3e [Refactor][MISC] del redundant code in ParallelConfig.postinit (#10614)
Signed-off-by: MengqingCao <cmq0113@163.com>
2024-11-25 05:14:56 +00:00
Maximilien de Bayser
214efc2c3c Support Cross encoder models (#10400)
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Co-authored-by: Flavia Beo <flavia.beo@ibm.com>
2024-11-24 18:56:20 -08:00
Zhuohan Li
49628fe13e [Doc] Update README.md with Ray Summit talk links (#10610) 2024-11-24 16:45:09 -08:00
youkaichao
e4fbb14414 [doc] update the code to add models (#10603)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-11-24 11:21:40 -08:00
youkaichao
c055747867 [model][utils] add extract_layer_index utility function (#10599)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-23 22:22:54 -08:00
youkaichao
eda2b3589c Revert "Print running script to enhance CI log readability" (#10601) 2024-11-23 21:31:47 -08:00
Jee Jee Li
1c445dca51 [CI/Build] Print running script to enhance CI log readability (#10594)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-24 03:57:13 +00:00
Jee Jee Li
1700c543a5 [Bugfix] Fix LoRA weight sharding (#10450)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-11-23 17:23:17 -08:00
Jee Jee Li
17d8fc1806 [bugfix] Fix example/tensorize_vllm_model tests (#10595)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-23 17:22:33 -08:00
Isotr0py
04668ebe7a [Bugfix] Avoid import AttentionMetadata explicitly in Mllama (#10593)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-23 18:12:20 +00:00
Nishidha
651f6c31ac For ppc64le, disabled tests for now and addressed space issues (#10538) 2024-11-23 09:33:53 +00:00
JiHuazhong
86a44fb896 [Platforms] Refactor openvino code (#10573)
Signed-off-by: statelesshz <hzji210@gmail.com>
2024-11-22 22:23:12 -08:00
Isotr0py
4cfe5d2bca [Bugfix] multi_modal_kwargs broadcast for CPU tensor parallel (#10541)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-22 21:25:46 -08:00
Cyrus Leung
c8acd80548 [2/N] handling placeholders in merged multi-modal processor (#10485)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-22 21:25:09 -08:00
Ricky Xu
4634a89d18 Prefix Cache Aware Scheduling [1/n] (#10128)
Signed-off-by: rickyx <rickyx@anyscale.com>
2024-11-22 21:15:55 -08:00
kliuae
7c25fe45a6 [AMD] Add support for GGUF quantization on ROCm (#10254) 2024-11-22 21:14:49 -08:00
Michael Goin
02a43f82a9 Update default max_num_batch_tokens for chunked prefill to 2048 (#10544) 2024-11-22 21:14:19 -08:00
Chen Wu
cfea9c04ef [Model] Fix Baichuan BNB online quantization (#10572)
Signed-off-by: Chen Wu <cntryroa@gmail.com>
2024-11-22 21:13:59 -08:00
Varun Vinayak Shenoy
7d8ffb344f [Bugfix] Internal Server Error when tool_choice is incorrect. (#10567)
Signed-off-by: Varun Shenoy <varun.vinayak.shenoy@oracle.com>
2024-11-22 21:13:29 -08:00
youkaichao
4aba6e3d1a [core] gemma2 full context length support (#10584)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-22 20:13:54 -08:00
Tyler Michael Smith
978b39744b [Misc] Add pynccl wrappers for all_gather and reduce_scatter (#9432) 2024-11-22 22:14:03 -05:00
Russell Bryant
ebda51968b [Core] Fix broken log configuration (#10458)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-23 10:23:51 +08:00
Travis Johnson
9195dbdbca [Bugfix][Frontend] Update Llama Chat Templates to also support Non-Tool use (#10164)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-11-23 10:17:38 +08:00
youkaichao
d559979c54 [bugfix] fix cpu tests (#10585)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-22 17:34:03 -08:00
Zhonghua Deng
d345f409b7 [V1] EngineCore supports profiling (#10564)
Signed-off-by: Abatom <abzhonghua@gmail.com>
2024-11-22 17:16:15 -08:00
Russell Bryant
28598f3939 [Core] remove temporary local variables in LLMEngine.__init__ (#10577)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-22 16:22:53 -08:00
zixuanzhang226
948c859571 support bitsandbytes quantization with qwen model (#10549)
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com>
2024-11-22 16:16:14 -08:00
Ricky Xu
97814fbf0f [v1] Refactor KVCacheManager for more hash input than token ids (#10507)
Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-11-22 23:27:25 +00:00
youkaichao
eebad39f26 [torch.compile] support all attention backends (#10558)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-22 14:04:42 -08:00
youkaichao
db100c5cde [bugfix] fix full graph tests (#10581)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-22 10:02:14 -08:00
Noam Gat
11fcf0e066 Remove token-adding chat embedding params (#10551)
Signed-off-by: Noam Gat <noamgat@gmail.com>
2024-11-21 23:59:47 -08:00
Isotr0py
b6374e09b0 [Bugfix] Fix Phi-3 BNB quantization with tensor parallel (#9948)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-22 15:01:56 +08:00
youkaichao
a111d0151f [platforms] absorb worker cls difference into platforms folder (#10555)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2024-11-21 21:00:32 -08:00
Woosuk Kwon
446c7806b2 [Minor] Fix line-too-long (#10563)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-21 19:40:40 -08:00
youkaichao
33e0a2540a [9/N] torch.compile LLM usage (#10552)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-21 19:13:31 -08:00
Simon Mo
aed074860a [Benchmark] Add new H100 machine (#10547) 2024-11-21 18:27:20 -08:00
Michael Goin
9afa014552 Add small example to metrics.rst (#10550) 2024-11-21 23:43:43 +00:00
Woosuk Kwon
46fe9b46d8 [Minor] Revert change in offline inference example (#10545)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-21 21:28:16 +00:00
youkaichao
cf656f5a02 [misc] improve error message (#10553)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-21 13:13:17 -08:00
Yunmeng
edec3385b6 [CI][Installation] Avoid uploading CUDA 11.8 wheel (#10535)
Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-11-21 13:03:58 -08:00
Woosuk Kwon
f9310cbd0c [V1] Fix Compilation config & Enable CUDA graph by default (#10528)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-21 12:53:39 -08:00
youkaichao
7560ae5caf [8/N] enable cli flag without a space (#10529)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-21 12:30:42 -08:00
Cyrus Leung
e7a8341c7c [Bugfix] Allow token ID-only inputs in Qwen2-Audio (#10536)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-21 18:09:43 +00:00
Roger Wang
c51e397fe8 [Misc] Suppress duplicated logging regarding multimodal input pipeline (#10530)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-11-21 09:21:31 -08:00
Jee Jee Li
2385b60d83 [Kernel] Register punica ops directly (#10522)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-21 09:18:11 -08:00
Chauncey
da7e702c6f [Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored (#10180)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2024-11-21 16:24:32 +00:00
Xiaoyu Zhang
4d676f0852 [Bugfix] Embedding model pooling_type equals ALL and multi input's bug (#10494) 2024-11-21 14:40:02 +00:00
Isotr0py
d5ec121f95 [Model] Expose dynamic_image_size as mm_processor_kwargs for InternVL2 models (#10518)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-21 14:20:08 +00:00
Wang, Yi
8a93a598d9 fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len (#10524)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-11-21 11:15:36 +00:00
Alex Brooks
1cfde82ffd [Model] Add Support for Multimodal Granite Models (#10291)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-11-21 10:46:20 +00:00
Zhong Qishuai
f0e0238016 [Doc] fix a small typo in docstring of llama_tool_parser (#10513) 2024-11-21 09:05:23 +00:00
youkaichao
aaddce5d26 [platforms] improve error message for unspecified platforms (#10520)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-20 23:07:56 -08:00
Cyrus Leung
3430857b64 [Misc] Increase default video fetch timeout (#10495)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-20 23:06:42 -08:00
Luka Govedič
8b0fe06c89 [torch.compile] Inductor code caching fix (#10273)
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: Luka Govedic <luka.govedic@gmail.com>
2024-11-20 21:44:57 -08:00
Mengqing Cao
9d827170a3 [Platforms] Add device_type in Platform (#10508)
Signed-off-by: MengqingCao <cmq0113@163.com>
2024-11-21 04:44:20 +00:00
Pavani Majety
6c1208d083 [Core] Add Sliding Window Support with Flashinfer (#10462)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2024-11-20 19:56:47 -08:00
youkaichao
388ee3de66 [torch.compile] limit inductor threads and lazy import quant (#10482)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-20 18:36:33 -08:00
Woosuk Kwon
2f77b6cfec [TPU] Implement prefix caching for TPUs (#10307)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-20 13:54:15 -08:00
Guillaume Calmettes
c68f7ede6a [Bugfix]: allow extra fields in requests to openai compatible server (#10463)
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
2024-11-20 16:42:21 -05:00
youkaichao
0cd3d9717e [7/N] torch.compile, reduce compilation time (#10460)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-20 11:20:38 -08:00
Simon Mo
5f1d6af2b6 [perf bench] H200 development (#9768)
Signed-off-by: simon-mo <simon.mo@hey.com>
2024-11-20 11:06:56 -08:00
youkaichao
772a66732d [platforms] restore xpu check for parallel config (#10479)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-20 17:13:28 +00:00
Li, Jiang
63f1fde277 [Hardware][CPU] Support chunked-prefill and prefix-caching on CPU (#10355)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2024-11-20 10:57:39 +00:00
Mengqing Cao
d5b28447e0 [Platforms] Refactor xpu code (#10468)
Signed-off-by: MengqingCao <cmq0113@163.com>
2024-11-19 22:52:13 -08:00
Cyrus Leung
09dbf9ff16 [Bugfix] Handle conflicts between modern and legacy fields (#10471)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-20 14:45:08 +08:00
Sky Lee
343041c4c4 [model] Reduce medusa weight (#10454)
Signed-off-by: skylee-01 <497627264@qq.com>
2024-11-20 06:05:55 +00:00
Kevin H. Luu
ed701ca963 [ci/build] Combine nightly and optional (#10465) 2024-11-19 21:36:03 -08:00
wchen61
7629a9c6e5 [CI/Build] Support compilation with local cutlass path (#10423) (#10424) 2024-11-19 21:35:50 -08:00
Rafael Vasquez
709c9f1f25 [CI/Build] Add sphinx/rst linter for docs (#10366) 2024-11-19 21:35:31 -08:00
Cyrus Leung
b4be5a8adb [Bugfix] Enforce no chunked prefill for embedding models (#10470)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-20 05:12:51 +00:00
Isotr0py
ad44437ba3 [Bugfix] Fix Mamba model initialization and MLP Speculator weights loading (#10456)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-20 05:04:05 +00:00
Yanyi Liu
9e05252b46 [Misc] Add __setitem__ for LazyDict (#10469)
Signed-off-by: Yanyi Liu <wolfsonliu@163.com>
2024-11-20 04:44:57 +00:00
Lucas Wilkinson
d200972e7f [Bugfix] Marlin 2:4 temp fix for large M dim (>256) (#10464)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-11-19 19:40:33 -08:00
Alexei-V-Ivanov-AMD
d5b68aba2f [CI/Build] Update Dockerfile.rocm (#10434)
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
2024-11-19 17:19:59 -08:00
Maximilien de Bayser
a324d3a1a7 Change granite chat template to keep json list formatting for tool calls (#10452)
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
2024-11-19 18:16:54 -07:00
ElizaWszola
b00b33d77e [Model][Quantization] HQQ support through Marlin kernel expansion (#9766)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
2024-11-19 13:31:12 -08:00
Russell Bryant
efa9084628 [Core] Avoid metrics log noise when idle (#8868)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-19 21:05:25 +00:00
youkaichao
803f37eaaa [6/N] torch.compile rollout to users (#10437)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-19 10:09:03 -08:00
Russell Bryant
fd9f124971 [Doc] fix link for page that was renamed (#10455)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-19 09:48:30 -08:00
Manjul Mohan
1ea291a417 Fix: Build error seen on Power Architecture (#10421)
Signed-off-by: Manjul Mohan <manjul.mohan@ibm.com>
Signed-off-by: B-201 <Joy25810@foxmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: ismael-dm <ismaeldm99@gmail.com>
Signed-off-by: Andrew Nesbitt <andrewnez@gmail.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: yan ma <yan.ma@intel.com>
Signed-off-by: Angus Wang <wangjadehao@gmail.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Manjul Mohan manjul.mohan@ibm.com <manjulmohan@ltcd97-lp2.aus.stglabs.ibm.com>
Co-authored-by: B-201 <Joy25810@foxmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: ismael-dm <ismaeldm99@gmail.com>
Co-authored-by: Andrew Nesbitt <andrewnez@gmail.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Yan Ma <yan.ma@intel.com>
Co-authored-by: Angus Wang <wangjadehao@gmail.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Ricky Xu <rickyx@anyscale.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
2024-11-19 09:34:57 -08:00
Patrick von Platen
11fd7ea639 [Pixtral-Large] Pixtral actually has no bias in vision-lang adapter (#10449) 2024-11-19 17:33:06 +00:00
COSMOPlat
f028dff33d [BugFix] Fix hermes tool parser output error stream arguments in some cases (#10395) (#10398)
Signed-off-by: xiyuan lee <lixiyuan@haier.com>
2024-11-19 13:42:50 +00:00
Yuan
b4614656b8 [CI][CPU] adding numa node number as container name suffix (#10441)
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
2024-11-19 13:16:43 +00:00
youkaichao
25f9c78961 [misc][plugin] improve plugin loading (#10443)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-19 10:43:21 +00:00
Russell Bryant
5390d6664f [Doc] Add the start of an arch overview page (#10368) 2024-11-19 09:52:11 +00:00
Jee Jee Li
382b6a4852 [Misc] Avoid misleading warning messages (#10438)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-19 08:54:58 +00:00
Travis Johnson
272e31c0bd [Bugfix] Guard for negative counter metrics to prevent crash (#10430)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-11-19 04:57:10 +00:00
Michael Goin
74f8c2cf5f Add openai.beta.chat.completions.parse example to structured_outputs.rst (#10433) 2024-11-19 04:37:46 +00:00
Mengqing Cao
8c1fb50705 [Platform][Refactor] Extract func get_default_attn_backend to Platform (#10358)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2024-11-19 11:22:26 +08:00
Jee Jee Li
7eb719df13 [Bugfix]Fix Phi-3 BNB online quantization (#10417)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-19 03:21:42 +00:00
Kevin H. Luu
284203f171 [ci/build] Have dependabot ignore all patch update (#10436)
We have too many dependencies and all patch updates can be a little noisy. This is to have dependabot ignore all patch version updates.
2024-11-19 01:04:25 +00:00
Ricky Xu
90a6c759ca [misc] partial prefix & random input generation benchmark (#9929)
Signed-off-by: rickyx <rickyx@anyscale.com>
2024-11-18 15:39:14 -08:00
youkaichao
2298e69b5f [ci][bugfix] fix kernel tests (#10431)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-18 15:29:37 -08:00
youkaichao
a03ea40792 [3/N][torch.compile] consolidate custom op logging (#10399)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-18 15:14:59 -08:00
Lucas Wilkinson
96d999fbe8 [Kernel] Initial Machete W4A8 support + Refactors (#9855)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-11-18 12:59:29 -07:00
Angus Wang
c2170a5b39 [Kernel] Explicitly specify other value in tl.load calls (#9014)
Signed-off-by: Angus Wang <wangjadehao@gmail.com>
2024-11-18 11:39:40 -08:00
Yan Ma
6b2d25efc7 [Hardware][XPU] AWQ/GPTQ support for xpu backend (#10107)
Signed-off-by: yan ma <yan.ma@intel.com>
2024-11-18 11:18:05 -07:00
Michael Goin
281cc4b3cd [Model][Bugfix] Support TP for PixtralHF ViT (#10405)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-18 10:04:14 -08:00
Andrew Nesbitt
4f686d139f Fix open_collective value in FUNDING.yml (#10426)
Signed-off-by: Andrew Nesbitt <andrewnez@gmail.com>
2024-11-18 09:52:42 -08:00
ismael-dm
31894a2155 [Doc] Add documentation for Structured Outputs (#9943)
Signed-off-by: ismael-dm <ismaeldm99@gmail.com>
2024-11-18 09:52:12 -08:00
youkaichao
7851b45196 [5/N][torch.compile] torch.jit.script --> torch.compile (#10406)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-18 23:20:06 +08:00
B-201
4186be8111 [Doc] Update doc for LoRA support in GLM-4V (#10425)
Signed-off-by: B-201 <Joy25810@foxmail.com>
2024-11-18 15:08:30 +00:00
Isotr0py
e7ebb662d7 [Model] Remove transformers attention porting in VITs (#10414)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-18 21:45:21 +08:00
B-201
5be4e52b65 [Model][LoRA]LoRA support added for glm-4v (#10418)
Signed-off-by: B-201 <Joy25810@foxmail.com>
2024-11-18 12:57:10 +00:00
Maybewuss
01aae1cc68 [Model] Remove redundant softmax when using PoolingType.STEP (#10415) 2024-11-18 10:05:36 +00:00
lkchen
c7dec926f6 [VLM] Report multi_modal_placeholders in output (#10407)
Signed-off-by: Linkun Chen <lkchen+anyscale@github.com>
2024-11-18 16:06:16 +08:00
youkaichao
51bb12d17b [4/N][torch.compile] clean up set_torch_compile_backend (#10401)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-17 23:57:20 -08:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
47826cacf0 [Bugfix] Ignore ray reinit error when current platform is ROCm or XPU (#10375)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2024-11-18 11:29:26 +08:00
Isotr0py
c4e464333e [Misc] Add uninitialized params tracking for AutoWeightsLoader (#10327)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-18 09:07:46 +08:00
wchen61
d1557e66d3 [Misc] Enhance offline_inference to support user-configurable paramet… (#10392)
Signed-off-by: wchen61 <wchen61@foxmail.com>
2024-11-17 11:32:40 +00:00
电脑星人
80d85c5d7b [Bugfix] Fix mrope_position_delta in non-last prefill chunk (#10403)
Signed-off-by: imkero <kerorek@outlook.com>
2024-11-17 08:50:24 +00:00
Kunshang Ji
76aab90ab6 [Hardware] [HPU]add mark_step for hpu (#10239)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2024-11-17 00:44:44 -08:00
youkaichao
8d74b5aee9 [platforms] refactor cpu code (#10402)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-16 23:14:23 -08:00
Isotr0py
cf349c4a97 [Bugfix][CPU] Fix CPU embedding runner with tensor parallel (#10394)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-16 23:12:04 -08:00
Chendi.Xue
905d0f0af4 [CI/Build] Fix IDC hpu [Device not found] issue (#10384)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
2024-11-17 14:58:22 +08:00
Roger Wang
643ecf7b11 [V1] Refactor model executable interface for all text-only language models (#10374)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-11-17 05:18:46 +00:00
youkaichao
4fd9375028 [2/N][torch.compile] make compilation cfg part of vllm cfg (#10383)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-16 18:02:14 -08:00
Woosuk Kwon
661a34fd4f [V1] Add code owners for V1 (#10397)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-16 10:45:26 -08:00
电脑星人
361c29e174 [Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled (#10388)
Signed-off-by: imkero <kerorek@outlook.com>
2024-11-17 02:10:00 +08:00
Sky Lee
b98d89efd4 [Misc] Medusa supports custom bias (#10361) 2024-11-16 16:33:01 +00:00
Jaehyun An
8b6725b0cf [Misc] Update benchmark to support image_url file or http (#10287)
Signed-off-by: rbbang <anjaehyun87@gmail.com>
2024-11-16 18:15:40 +08:00
rasmith
1d75472626 [BugFix] [Kernel] Fix GPU SEGV occuring in fused_moe kernel (#10385)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2024-11-16 09:55:05 +00:00
youkaichao
2f427c2d16 [misc][plugin] improve log messages (#10386)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-16 01:23:20 -08:00
youkaichao
755b85359b [doc] add doc for the plugin system (#10372)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-15 21:46:27 -08:00
Cyrus Leung
32e46e000f [Frontend] Automatic detection of chat content format from AST (#9919)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-16 13:35:40 +08:00
Michael Green
4f168f69a3 [Docs] Misc updates to TPU installation instructions (#10165) 2024-11-15 13:26:17 -08:00
Russell Bryant
3e8d14d8a1 [Doc] Move PR template content to docs (#10159)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-15 13:20:20 -08:00
Russell Bryant
a067f85e08 [Frontend] Add --version flag to CLI (#10369)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-15 13:13:53 -08:00
Simon Mo
c76ac49d26 [Docs] Add Nebius as sponsors (#10371)
Signed-off-by: simon-mo <simon.mo@hey.com>
2024-11-15 12:47:40 -08:00
Simon Mo
a6221a144a [Misc] bump mistral common version (#10367)
Some checks failed
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Signed-off-by: simon-mo <simon.mo@hey.com>
2024-11-15 09:48:07 -08:00
ElizaWszola
79ee45b428 [Misc] Bump up test_fused_moe tolerance (#10364)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
2024-11-15 16:31:18 +00:00
Guillaume Calmettes
691a3ec047 [Bugfix] Ensure special tokens are properly filtered out for guided structured output with MistralTokenizer (#10363)
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
2024-11-15 14:50:40 +00:00
youkaichao
3a763ba0c3 [core][misc] keep compatibility for old-style classes (#10356)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-15 13:55:51 +00:00
shangmingc
f2056f726d [Misc] Fix some help info of arg_utils to improve readability (#10362) 2024-11-15 12:40:30 +00:00
Jee Jee Li
1d65ec7eeb [Bugfix] Fix fully sharded LoRA bug (#10352)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-15 10:34:58 +00:00
Xin Yang
26908554b2 [Doc] Remove float32 choice from --lora-dtype (#10348)
Signed-off-by: Xin Yang <xyang19@gmail.com>
2024-11-15 10:22:57 +00:00
Cyrus Leung
b311efd0bd [Misc] Fix import error in tensorizer tests and cleanup some code (#10349)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-15 09:34:17 +00:00
wchen61
3d158cdc8d Add default value to avoid Falcon crash (#5363) (#10347)
Signed-off-by: wchen61 <wchen61@foxmail.com>
2024-11-15 08:52:20 +00:00
Simon Mo
02dbf30e9a [Build] skip renaming files for release wheels pipeline (#9671)
Some checks failed
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Signed-off-by: simon-mo <simon.mo@hey.com>
2024-11-14 23:31:52 -08:00
Cyrus Leung
2ac6d0e75b [Misc] Consolidate pooler config overrides (#10351)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-15 06:59:00 +00:00
Sky Lee
2ec8827288 [Bugfix] Qwen-vl output is inconsistent in speculative decoding (#10350) 2024-11-15 05:40:10 +00:00
Cyrus Leung
b40cf6402e [Model] Support Qwen2 embeddings and use tags to select model tests (#10184) 2024-11-14 20:23:09 -08:00
Tyler Michael Smith
2885ba0e24 [Misc] Change RedundantReshapesPass and FusionPass logging from info to debug (#10308)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-11-15 02:44:26 +00:00
Luka Govedič
bf2ddc6610 [bugfix] Fix static asymmetric quantization case (#10334)
Signed-off-by: Daniël de Kok <me@danieldk.eu>
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: Daniël de Kok <me@danieldk.eu>
2024-11-15 09:35:11 +08:00
Cyrus Leung
972112d82f [Bugfix] Fix unable to load some models (#10312)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-14 16:55:54 -08:00
Patrick von Platen
11cd1ae6ad [Tool parsing] Improve / correct mistral tool parsing (#10333) 2024-11-15 00:42:49 +00:00
Zijin Xiao
554af9228d [Bugfix] use AF_INET6 for OpenAI Compatible Server with ipv6 (#9583)
Signed-off-by: xiaozijin <xiaozijin@bytedance.com>
2024-11-14 16:38:53 -08:00
Murali Andoorveedu
b2e0ad3b59 [Perf] Reduce peak memory usage of llama (#10339)
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>
2024-11-15 00:38:20 +00:00
Maximilien de Bayser
4a18fd14ba Support Roberta embedding models (#9387)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Co-authored-by: Flavia Beo <flavia.beo@ibm.com>
2024-11-14 21:23:29 +00:00
Woosuk Kwon
1dbae0329c [Docs] Publish meetup slides (#10331)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-14 16:19:38 +00:00
Cyrus Leung
675d603400 [CI/Build] Make shellcheck happy (#10285)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-14 09:47:53 +00:00
Isotr0py
03025c023f [CI/Build] Fix CPU CI online inference timeout (#10314)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-14 16:45:32 +08:00
youkaichao
29f3ef26a3 [ci][distributed] disable hanging tests (#10317)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-14 00:23:39 -08:00
B-201
294bf467ba [Model] Add BNB quantization support for Idefics3 (#10310)
Signed-off-by: B-201 <Joy25810@foxmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-14 06:31:44 +00:00
Guillaume Calmettes
52b48c1ead [BugFix]: properly deserialize tool_calls iterator before processing by mistral-common when MistralTokenizer is used (#9951)
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
2024-11-14 04:48:16 +00:00
Mike Depinet
f67ce05d0b [Frontend] Pythonic tool parser (#9859)
Signed-off-by: Mike Depinet <mike@fixie.ai>
2024-11-14 04:14:34 +00:00
Russell Bryant
e0853b6508 [Misc] format.sh: Simplify tool_version_check (#10305)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-14 11:12:35 +08:00
youkaichao
504ac53d18 [misc] error early for old-style class (#10304)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-13 18:55:39 -08:00
Isotr0py
15bb8330aa [Bugfix] Fix tensor parallel for qwen2 classification model (#10297)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-14 10:54:59 +08:00
HoangCongDuc
ac49b59d8b [Bugfix] bitsandbytes models fail to run pipeline parallel (#10200)
Signed-off-by: Hoang Cong Duc <hoangcongducltt@gmail.com>
2024-11-13 09:56:39 -07:00
Cyrus Leung
0b8bb86bf1 [1/N] Initial prototype for multi-modal processor (#10044)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-13 12:39:03 +00:00
Roger Wang
bb7991aa29 [V1] Add missing tokenizer options for Detokenizer (#10288)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-11-13 11:02:56 +00:00
B-201
d909acf9fe [Model][LoRA]LoRA support added for idefics3 (#10281)
Signed-off-by: B-201 <Joy25810@foxmail.com>
2024-11-13 17:25:59 +08:00
Pavani Majety
b6dde33019 [Core] Flashinfer - Remove advance step size restriction (#10282) 2024-11-13 16:29:32 +08:00
Austin Veselka
1b886aa104 [Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 (#9944)
Signed-off-by: FurtherAI <austin.veselka@lighton.ai>
Co-authored-by: FurtherAI <austin.veselka@lighton.ai>
2024-11-13 08:28:13 +00:00
电脑星人
3945c82346 [Model] Add support for Qwen2-VL video embeddings input & multiple image embeddings input with varied resolutions (#10221)
Signed-off-by: imkero <kerorek@outlook.com>
2024-11-13 07:07:22 +00:00
Xin Yang
032fcf16ae [Doc] Fix typo in arg_utils.py (#10264)
Signed-off-by: Xin Yang <xyang19@gmail.com>
2024-11-12 21:54:52 -08:00
Dipika Sikka
56a955e774 Bump to compressed-tensors v0.8.0 (#10279)
Signed-off-by: Dipika <dipikasikka1@gmail.com>
2024-11-12 21:54:10 -08:00
Woosuk Kwon
bbd3e86926 [V1] Support VLMs with fine-grained scheduling (#9871)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-11-13 04:53:13 +00:00
youkaichao
0d4ea3fb5c [core][distributed] use tcp store directly (#10275)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-12 17:36:08 -08:00
Woosuk Kwon
112fa0bbe5 [V1] Fix CI tests on V1 engine (#10272)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-12 16:17:20 -08:00
youkaichao
377b74fe87 Revert "[ci][build] limit cmake version" (#10271) 2024-11-12 15:06:48 -08:00
youkaichao
18081451f9 [doc] improve debugging doc (#10270)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-12 14:43:52 -08:00
youkaichao
96ae0eaeb2 [doc] fix location of runllm widget (#10266)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-12 14:34:39 -08:00
Woosuk Kwon
1f55e05713 [V1] Enable Inductor when using piecewise CUDA graphs (#10268)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-12 13:39:56 -08:00
Umesh
8a06428c70 [LoRA] Adds support for bias in LoRA (#5733)
Signed-off-by: Umesh Deshpande <udeshpa@us.ibm.com>
Co-authored-by: Umesh Deshpande <udeshpa@us.ibm.com>
2024-11-12 11:08:40 -08:00
sroy745
b41fb9d3b1 [Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers (#9982)
Signed-off-by: Sourashis Roy <sroy@roblox.com>
2024-11-12 10:53:57 -08:00
Woosuk Kwon
7c65527918 [V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest (#10245)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-12 08:57:14 -08:00
zifeitong
47db6ec831 [Frontend] Add per-request number of cached token stats (#10174) 2024-11-12 16:42:28 +00:00
Jie Fu (傅杰)
176fcb1c71 [Bugfix] Fix QwenModel argument (#10262)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2024-11-12 16:36:51 +00:00
Jee Jee Li
a838ba7254 [Misc]Fix Idefics3Model argument (#10255)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-12 13:07:11 +00:00
Guillaume Calmettes
36c513a076 [BugFix] Do not raise a ValueError when tool_choice is set to the supported none option and tools are not defined. (#10000)
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
2024-11-12 11:13:46 +00:00
Yuan
d201d41973 [CI][CPU]refactor CPU tests to allow to bind with different cores (#10222)
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
2024-11-12 10:07:32 +00:00
youkaichao
3a28f18b0b [doc] explain the class hierarchy in vLLM (#10240)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-11 22:56:44 -08:00
Aleksandr Malyshev
812c981fa0 Splitting attention kernel file (#10091)
Signed-off-by: maleksan85 <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
2024-11-11 22:55:07 -08:00
Jee Jee Li
7f5edb5900 [Misc][LoRA] Replace hardcoded cuda device with configurable argument (#10223)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-12 11:10:15 +08:00
youkaichao
eea55cca5b [1/N] torch.compile user interface design (#10237)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-11 18:01:06 -08:00
Russell Bryant
9cdba9669c [Doc] Update help text for --distributed-executor-backend (#10231)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-12 09:55:09 +08:00
youkaichao
d1c6799b88 [doc] update debugging guide (#10236)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-11 15:21:12 -08:00
Robert Shaw
6ace6fba2c [V1] AsyncLLM Implementation (#9826)
Signed-off-by: Nick Hill <nickhill@us.ibm.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-11-11 23:05:38 +00:00
Nikolai Shcheglov
08f93e7439 Make shutil rename in python_only_dev (#10233)
Signed-off-by: shcheglovnd <shcheglovnd@avride.ai>
2024-11-11 14:29:19 -08:00
Woosuk Kwon
9d5b4e4dea [V1] Enable custom ops with piecewise CUDA graphs (#10228)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-11 11:58:07 -08:00
youkaichao
8a7fe47d32 [misc][distributed] auto port selection and disable tests (#10226)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-11 11:54:59 -08:00
Yuan Tang
4800339c62 Add docs on serving with Llama Stack (#10183)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
2024-11-11 11:28:55 -08:00
Woosuk Kwon
fe15729a2b [V1] Use custom ops for piecewise CUDA graphs (#10227)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-11 11:26:48 -08:00
youkaichao
330e82d34a [v1][torch.compile] support managing cudagraph buffer (#10203)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-11 11:10:27 -08:00
Woosuk Kwon
d7a4f2207b [V1] Do not use inductor for piecewise CUDA graphs (#10225)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-11 11:05:57 -08:00
Woosuk Kwon
f9dadfbee3 [V1] Fix detokenizer ports (#10224)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-11 10:42:07 -08:00
dependabot[bot]
25144ceed0 Bump actions/setup-python from 5.2.0 to 5.3.0 (#10209)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-11 17:24:10 +00:00
youkaichao
e6de9784d2 [core][distributed] add stateless process group (#10216)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-11 09:02:14 -08:00
Yangcheng Li
36fc439de0 [Doc] fix doc string typo in block_manager swap_out function (#10212) 2024-11-11 08:53:07 -08:00
harrywu
874f551b36 [Metrics] add more metrics (#4464)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-12 00:17:38 +08:00
Isotr0py
2cebda42bb [Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner (#10218)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-11 12:37:58 +00:00
Roger Wang
5fb1f935b0 [V1] Allow tokenizer_mode and trust_remote_code for Detokenizer (#10211)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-11-11 18:01:18 +08:00
Jee Jee Li
36e4acd02a [LoRA][Kernel] Remove the unused libentry module (#10214)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-11 09:43:23 +00:00
Isotr0py
58170d6503 [Hardware][CPU] Add embedding models support for CPU backend (#10193)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-11 08:54:28 +00:00
dependabot[bot]
9804ac7c7c Bump the patch-update group with 5 updates (#10210)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-11 07:22:40 +00:00
youkaichao
f89d18ff74 [6/N] pass whole config to inner model (#10205)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-11 06:41:46 +00:00
youkaichao
f0f2e5638e [doc] improve debugging code (#10206)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-10 17:49:40 -08:00
yansh97
ad9a78bf64 [Doc] Fix typo error in vllm/entrypoints/openai/cli_args.py (#10196) 2024-11-11 00:14:22 +00:00
youkaichao
73b9083e99 [misc] improve cloudpickle registration and tests (#10202)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-11 00:10:53 +00:00
Shawn Du
20cf2f553c [Misc] small fixes to function tracing file path (#9543)
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-10 15:21:06 -08:00
Yongzao
bfb7d61a7c [doc] Polish the integration with huggingface doc (#10195)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-10 10:22:04 -08:00
FuryMartin
19682023b6 [Doc] Fix typo error in CONTRIBUTING.md (#10190)
Signed-off-by: FuryMartin <furymartin9910@outlook.com>
2024-11-10 07:47:24 +00:00
youkaichao
9fa4bdde9d [ci][build] limit cmake version (#10188)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-09 16:27:26 -08:00
Cyrus Leung
51c2e1fcef [CI/Build] Split up models tests (#10069)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-09 11:39:14 -08:00
Krishna Mandal
b09895a618 [Frontend][Core] Override HF config.json via CLI (#5836)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-09 16:19:27 +00:00
cjackal
d88bff1b96 [Frontend] add add_request_id middleware (#9594)
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
2024-11-09 10:18:29 +00:00
Zhao Yingzhuo
9e37266420 bugfix: fix the bug that stream generate not work (#2756) 2024-11-09 10:09:48 +00:00
youkaichao
8a4358ecb5 [doc] explaining the integration with huggingface (#10173)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-09 01:02:54 -08:00
youkaichao
bd46357ad9 [bugfix] fix broken tests of mlp speculator (#10177)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-09 00:04:50 -08:00
bnellnm
f192aeba74 [Bugfix] Enable some fp8 and quantized fullgraph tests (#10171)
Signed-off-by: Bill Nell <bill@neuralmagic.com>
2024-11-09 08:01:27 +00:00
Chendi.Xue
8e1529dc57 [CI/Build] Add run-hpu-test.sh script (#10167)
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
2024-11-09 06:26:52 +00:00
youkaichao
1a95f10ee7 [5/N] pass the whole config to model (#9983)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-09 14:17:28 +08:00
Cyrus Leung
49d2a41a86 [Doc] Adjust RunLLM location (#10176)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-08 20:07:10 -08:00
Isotr0py
47672f38b5 [CI/Build] Fix VLM broadcast tests tensor_parallel_size passing (#10161)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-09 04:02:59 +00:00
Michael Goin
f83feccd7f [Bugfix] Ignore GPTQ quantization of Qwen2-VL visual module (#10169)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-09 03:36:46 +00:00
Cyrus Leung
e0191a95d8 [0/N] Rename MultiModalInputs to MultiModalKwargs (#10040)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-09 11:31:02 +08:00
Li, Jiang
d7edca1dee [CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking (#6892)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-09 03:27:11 +00:00
rasmith
127c07480e [Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case (#9857)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2024-11-08 19:59:22 -05:00
bnellnm
10b67d865d [Bugfix] SymIntArrayRef expected to contain concrete integers (#10170)
Signed-off-by: Bill Nell <bill@neuralmagic.com>
2024-11-08 14:44:18 -08:00
Luka Govedič
4f93dfe952 [torch.compile] Fuse RMSNorm with quant (#9138)
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: youkaichao <youkaichao@126.com>
2024-11-08 21:20:08 +00:00
Florian Zimmermeister
e1b5a82179 Rename vllm.logging to vllm.logging_utils (#10134) 2024-11-08 20:53:24 +00:00
Luka Govedič
87713c6053 [CI/Build] Ignore .gitignored files for shellcheck (#10162)
Signed-off-by: luka <luka@neuralmagic.com>
2024-11-08 19:53:36 +00:00
Woosuk Kwon
b5815c8413 [V1] Fix non-cudagraph op name (#10166)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-08 10:23:04 -08:00
Rafael Vasquez
6b30471586 [Misc] Improve Web UI (#10090)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
2024-11-08 09:51:04 -08:00
sroy745
f6778620a9 Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 (#10136)
Signed-off-by: Sourashis Roy <sroy@roblox.com>
2024-11-08 15:56:18 +00:00
Patrick von Platen
0535e5fe6c Fix edge case Mistral tokenizer (#10152) 2024-11-08 15:42:27 +00:00
Cyrus Leung
b489fc3c91 [CI/Build] Update CPU tests to include all "standard" tests (#5481)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-08 23:30:04 +08:00
Roger Wang
208ce622c7 [V1]Enable APC by default only for text models (#10148)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-11-08 14:39:41 +00:00
Isotr0py
1ff4aed5bd [Model] Expose size to Idefics3 as mm_processor_kwargs (#10146)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-08 09:56:58 +00:00
Yan Ma
f10797c0ce [Bugfix][XPU] Fix xpu tp by introducing XpuCommunicator (#10144)
Signed-off-by: yan ma <yan.ma@intel.com>
2024-11-08 09:41:03 +00:00
Cyrus Leung
f4c2187e29 [Misc] Fix typo in #5895 (#10145)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-08 09:07:01 +00:00
Michael Goin
aea6ad629f Add hf_transfer to testing image (#10096) 2024-11-08 08:35:25 +00:00
Tao He
da07a9ead7 Fixes a typo about 'max_decode_seq_len' which causes crashes with cuda graph. (#9285)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
2024-11-08 05:31:28 +00:00
Russell Bryant
3a7f15a398 [Doc] Move CONTRIBUTING to docs site (#9924)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-08 05:15:12 +00:00
Mengqing Cao
7371749d54 [Misc] Fix ImportError causing by triton (#9493) 2024-11-08 05:08:51 +00:00
DearPlanet
ad39bd640c [Bugfix] Add error handling when server cannot respond any valid tokens (#5895) 2024-11-08 04:58:37 +00:00
whyiug
40d0e7411d [Doc] Update FAQ links in spec_decode.rst (#9662)
Signed-off-by: whyiug <whyiug@hotmail.com>
2024-11-08 04:44:58 +00:00
Russell Bryant
6bb52b0f97 [CI/Build] Give PR cleanup job PR write access (#10139)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-08 12:10:20 +08:00
Cody Yu
201fc07730 [V1] Prefix caching (take 2) (#9972)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2024-11-07 17:34:44 -08:00
Woosuk Kwon
42b4f46b71 [V1] Add all_token_ids attribute to Request (#10135)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-07 17:08:24 -08:00
Jiangtao Hu
073a472728 [Misc] report relevant env vars in collect_env.py tool (#9293) 2024-11-07 16:14:01 -08:00
dependabot[bot]
93bff421bc Bump actions/checkout from 4.2.1 to 4.2.2 (#9746)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 21:44:58 +00:00
litianjian
28b2877d30 Online video support for VLMs (#10020)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: litianjian <litianjian@bytedance.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-07 20:25:59 +00:00
dependabot[bot]
97b8475beb Bump actions/setup-python from 5.2.0 to 5.3.0 (#9745)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 18:55:35 +00:00
Russell Bryant
a2f1f3b089 [CI/Build] Automate PR body text cleanup (#10082)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-07 18:26:28 +00:00
Russell Bryant
3be5b26a76 [CI/Build] Add shell script linting using shellcheck (#7925)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-07 18:17:29 +00:00
Russell Bryant
de0e61a323 [CI/Build] Always run mypy (#10122)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-07 16:43:16 +00:00
Nicolò Lucchesi
9d43afcc53 [Feature] [Spec decode]: Combine chunked prefill with speculative decoding (#9291)
Signed-off-by: NickLucche <nlucches@redhat.com>
2024-11-07 08:15:14 -08:00
Maximilien de Bayser
ae62fd17c0 [Frontend] Tool calling parser for Granite 3.0 models (#9027)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-11-07 07:09:02 -08:00
Atlas
a62bc0109c [Misc] Add Gamma-Distribution Request Generation Support for Serving Benchmark. (#10105)
Signed-off-by: Mozhou <spli161006@gmail.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-11-07 11:20:30 +00:00
Jiahao Li
999df95b4e [Bugfix] Make image processor respect mm_processor_kwargs for Qwen2-VL (#10112)
Signed-off-by: Jiahao Li <liplus17@163.com>
2024-11-07 10:50:44 +00:00
Li, Jiang
a6f332d0d9 [Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target (#10108)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2024-11-07 18:42:50 +08:00
Lei Yang
0dfba97b42 [Frontend] Fix multiple values for keyword argument error (#10075) (#10076)
Signed-off-by: Lei <ylxx@live.com>
2024-11-07 09:07:19 +00:00
Flávia Béo
aa9078fa03 Adds method to read the pooling types from model's files (#9506)
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
2024-11-07 08:42:40 +00:00
Russell Bryant
e036e527a0 [CI/Build] Improve mypy + python version matrix (#10041)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-07 07:54:16 +00:00
Hanzhi Zhou
6192e9b8fe [Core][Distributed] Refactor ipc buffer init in CustomAllreduce (#10030)
Signed-off-by: Hanzhi Zhou <hanzhi713@gmail.com>
2024-11-06 23:50:47 -08:00
Rafael Vasquez
d7263a1bb8 Doc: Improve benchmark documentation (#9927)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
2024-11-06 23:50:35 -08:00
Russell Bryant
104d729656 [CI/Build] re-add codespell to CI (#10083)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-06 22:54:46 -08:00
Cyrus Leung
db7db4aab9 [Misc] Consolidate ModelConfig code related to HF config (#10104)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-07 06:00:21 +00:00
Nick Hill
1fa020c539 [V1][BugFix] Fix Generator construction in greedy + seed case (#10097)
Signed-off-by: Nick Hill <nhill@redhat.com>
2024-11-07 05:06:57 +00:00
youkaichao
e7b84c394d [doc] add back Python 3.8 ABI (#10100)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-06 21:06:41 -08:00
Li, Jiang
a4b3e0c1e9 [Hardware][CPU] Update torch 2.5 (#9911)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2024-11-07 04:43:08 +00:00
Nick Hill
29862b884b [Frontend] Adjust try/except blocks in API impl (#10056)
Signed-off-by: Nick Hill <nhill@redhat.com>
2024-11-06 20:07:51 -08:00
Yan Ma
d3859f1891 [Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend (#9823)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: yan ma <yan.ma@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2024-11-06 17:29:03 -08:00
Michael Goin
4ab3256644 [Bugfix] Fix FP8 torch._scaled_mm fallback for torch>2.5 with CUDA<12.4 (#10095)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-07 00:54:13 +00:00
youkaichao
719c1ca468 [core][distributed] add stateless_init_process_group (#10072)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-06 16:42:09 -08:00
Russell Bryant
74f2f8a0f1 [CI/Build] Always run the ruff workflow (#10092)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-06 22:25:23 +00:00
Joe Runde
d58268c56a [V1] Make v1 more testable (#9888)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-11-06 11:57:35 -08:00
Russell Bryant
87bd7e0515 [CI/Build] change conflict PR comment from mergify (#10080)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-06 10:15:42 -08:00
Russell Bryant
098f94de42 [CI/Build] Drop Python 3.8 support (#10038)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-06 14:31:01 +00:00
Michael Goin
399c798608 Remove ScaledActivation for AWQ (#10057)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-06 14:27:06 +00:00
Eric
406d4cc480 [Model][LoRA]LoRA support added for Qwen2VLForConditionalGeneration (#10022)
Signed-off-by: ericperfect <ericperfectttt@gmail.com>
2024-11-06 14:13:15 +00:00
Jee Jee Li
a5bba7d234 [Model] Add Idefics3 support (#9767)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: B-201 <Joy25810@foxmail.com>
Co-authored-by: B-201 <Joy25810@foxmail.com>
2024-11-06 11:41:17 +00:00
Jee Jee Li
2003cc3513 [Model][LoRA]LoRA support added for LlamaEmbeddingModel (#10071)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-06 09:49:19 +00:00
Woosuk Kwon
6a585a23d2 [Hotfix] Fix ruff errors (#10073)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-06 01:24:28 -08:00
Konrad Zawora
a02a50e6e5 [Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend (#6143)
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: Chendi.Xue <chendi.xue@intel.com>
Signed-off-by: Bob Zhu <bob.zhu@intel.com>
Signed-off-by: zehao-intel <zehao.huang@intel.com>
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai>
Co-authored-by: Michal Adamczyk <madamczyk@habana.ai>
Co-authored-by: Marceli Fylcek <mfylcek@habana.ai>
Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com>
Co-authored-by: Vivek Goel <vgoel@habana.ai>
Co-authored-by: yuwenzho <yuwen.zhou@intel.com>
Co-authored-by: Dominika Olszewska <dolszewska@habana.ai>
Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com>
Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com>
Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai>
Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com>
Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai>
Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com>
Co-authored-by: Ilia Taraban <tarabanil@gmail.com>
Co-authored-by: Chendi.Xue <chendi.xue@intel.com>
Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>
Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai>
Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com>
Co-authored-by: Sun Choi <schoi@habana.ai>
Co-authored-by: Iryna Boiko <iboiko@habana.ai>
Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com>
Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>
Co-authored-by: Zehao Huang <zehao.huang@intel.com>
Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com>
Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com>
Co-authored-by: Nir David <ndavid@habana.ai>
Co-authored-by: Yu-Zhou <yu.zhou@intel.com>
Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai>
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai>
Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Jacek Czaja <jacek.czaja@intel.com>
Co-authored-by: Jacek Czaja <jczaja@habana.ai>
Co-authored-by: Yuan <yuan.zhou@outlook.com>
2024-11-06 01:09:10 -08:00
Isotr0py
a5fda50a10 [CI/Build] Fix large_gpu_mark reason (#10070)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-06 08:50:37 +00:00
Aaron Pham
21063c11c7 [CI/Build] drop support for Python 3.8 EOL (#8464)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2024-11-06 07:11:55 +00:00
youkaichao
4be3a45158 [distributed] add function to create ipc buffers directly (#10064)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-05 22:35:03 -08:00
Woosuk Kwon
4089985552 [V1] Integrate Piecewise CUDA graphs (#10058)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-05 22:16:04 -08:00
zifeitong
9d59b75593 [Bugfix] Remove CustomChatCompletionContentPartParam multimodal input type (#10054)
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
2024-11-06 05:13:09 +00:00
arakowsk-amd
ea928f608c [Bugfix] Gpt-j-6B patch kv_scale to k_scale path (#10063)
Signed-off-by: Alex Rakowski <alex.rakowski@amd.com>
Signed-off-by: Alex Rakowski <182798202+arakowsk-amd@users.noreply.github.com>
2024-11-06 05:10:40 +00:00
Travis Johnson
2bcbae704c [Bugfix] Fix edge-case crash when using chat with the Mistral Tekken Tokenizer (#10051)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-11-06 04:28:29 +00:00
Peter Salas
ffc0f2b47a [Model][OpenVINO] Fix regressions from #8346 (#10045)
Signed-off-by: Peter Salas <peter@fixie.ai>
2024-11-06 04:19:15 +00:00
Cyrus Leung
82bfc38d07 [Misc] Sort the list of embedding models (#10037)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-06 04:05:05 +00:00
youkaichao
c4cacbaa7f [v1] reduce graph capture time for piecewise cudagraph (#10059)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-05 18:19:50 -08:00
Sungjae Lee
0c63c34f72 [Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode (#9730)
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2024-11-06 01:45:45 +00:00
Wallas Henrique
966e31697b [Bugfix] Fix pickle of input when async output processing is on (#9931)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2024-11-06 00:39:26 +00:00
zifeitong
43300bd98a [Bugfix] Properly propagate trust_remote_code settings (#10047)
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
2024-11-05 16:34:40 -08:00
youkaichao
ca9844b340 [bugfix] fix weak ref in piecewise cudagraph and tractable test (#10048)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-05 14:49:20 -08:00
Michael Goin
235366fe2e [CI] Prune back the number of tests in tests/kernels/* (#9932)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-05 16:02:32 -05:00
Michael Goin
02462465ea [CI] Prune tests/models/decoder_only/language/* tests (#9940)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-05 16:02:23 -05:00
Jee Jee Li
b9c64c0ca7 [Misc] Modify BNB parameter name (#9997)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-05 14:40:08 -05:00
lkchen
d2e80332a7 [Feature] Update benchmark_throughput.py to support image input (#9851)
Signed-off-by: Linkun Chen <github+anyscale@lkchen.net>
Co-authored-by: Linkun Chen <github+anyscale@lkchen.net>
2024-11-05 19:30:02 +00:00
Michael Goin
a53046b16f [Model] Support quantization of PixtralHFTransformer for PixtralHF (#9921)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-05 10:42:20 -08:00
Russell Bryant
731aec5be7 [CI/Build] Limit github CI jobs based on files changed (#9928)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-05 10:30:42 -08:00
Chenghao (Alan) Yang
09d3550372 [Misc] Add logging for CUDA memory (#10027)
Signed-off-by: Chenghao Yang <yangalan1996@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Chenghao Yang <yangalan1996@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-05 09:50:50 -08:00
Richard Liu
cd34029e91 Refactor TPU requirements file and pin build dependencies (#10010)
Signed-off-by: Richard Liu <ricliu@google.com>
2024-11-05 16:48:44 +00:00
Russell Bryant
5952d81139 [Frontend] Fix tcp port reservation for api server (#10012)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-05 07:50:57 -08:00
Chauncey
93dee88f6b [Misc] vllm CLI flags should be ordered for better user readability (#10017)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2024-11-05 18:59:56 +08:00
Gene Der Su
7a83b1aec0 [BugFix] Lazy import ray (#10021) 2024-11-05 10:04:10 +00:00
Tyler Michael Smith
ad23318928 [Bugfix] Fixup Mamba (#10004)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-11-05 03:46:38 +00:00
Cyrus Leung
bbc3619dc8 [Core] Make encoder-decoder inputs a nested structure to be more composable (#9604)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-05 10:07:31 +08:00
Tyler Michael Smith
04bbf38e05 [Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep (#9994)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-11-05 01:08:21 +00:00
Michael Goin
8f0a9ca890 [Bugfix] Respect modules_to_not_convert within awq_marlin (#9895)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-04 16:57:44 -07:00
youkaichao
2094062b4e [4.5/N] bugfix for quant config in speculative decode (#10007)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-04 15:11:59 -08:00
bnellnm
d93478b399 [Bugfix] Upgrade to pytorch 2.5.1 (#10001)
Signed-off-by: Bill Nell <bill@neuralmagic.com>
2024-11-04 15:11:28 -08:00
tomeras91
ac04a97a9f [Frontend] Add max_tokens prometheus metric (#9881)
Signed-off-by: Tomer Asida <tomera@ai21.com>
2024-11-04 22:53:24 +00:00
lkchen
9a5664d4a4 [Misc] Refactor benchmark_throughput.py (#9779)
Signed-off-by: Linkun Chen <github+anyscale@lkchen.net>
Co-authored-by: Linkun Chen <lkchen@github.com>
Co-authored-by: Linkun Chen <github+anyscale@lkchen.net>
2024-11-04 14:32:16 -08:00
Robert Shaw
04cef2c6ab [Bugfix] Fix MQLLMEngine hanging (#9973)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2024-11-04 16:01:43 -05:00
Roger Wang
6e056bcf04 [Doc] Update VLM doc about loading from local files (#9999)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-11-04 19:47:11 +00:00
hissu-hyvarinen
5208dc7a20 [Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests running parallel jobs (#9279)
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com>
2024-11-04 11:37:46 -08:00
Robert Shaw
1c45f4c385 [CI] Basic Integration Test For TPU (#9968)
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com>
2024-11-04 11:34:26 -08:00
Mor Zusman
603a661ae8 [Model] factoring out MambaMixer out of Jamba (#8993)
Signed-off-by: mzusman <mor.zusmann@gmail.com>
2024-11-04 18:00:00 +00:00
Jee Jee Li
fb2716d641 [Misc]Reduce BNB static variable (#9987)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-04 17:04:40 +00:00
youkaichao
8d72bb20fa [4/N] make quant config first-class citizen (#9978)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-04 08:51:31 -08:00
Chauncey
ac6b8f19b9 [Frontend] Multi-Modality Support for Loading Local Image Files (#9915)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2024-11-04 15:34:57 +00:00
Mengqing Cao
ccb5376a9a [Bugfix][OpenVINO] Fix circular reference #9939 (#9974)
Signed-off-by: MengqingCao <cmq0113@163.com>
2024-11-04 18:14:13 +08:00
Tran Quang Dai
ea4adeddc1 [Bugfix] Fix E2EL mean and median stats (#9984)
Signed-off-by: daitran2k1 <tranquangdai7a@gmail.com>
2024-11-04 09:37:58 +00:00
Yang Zheng
4dbcbbeb09 [Misc] Compute query_start_loc/seq_start_loc on CPU (#9447)
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>
2024-11-04 08:54:37 +00:00
Gregory Shtrasberg
b67feb1274 [Bugfix]Using the correct type hints (#9885)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2024-11-04 06:19:51 +00:00
Jee Jee Li
c49f0407ba [Bugfix] Fix MiniCPMV and Mllama BNB bug (#9917)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-04 03:36:41 +00:00
Robert Shaw
91c9ebbb1b [V1] Fix Configs (#9971) 2024-11-04 00:24:40 +00:00
shanshan wang
54597724f4 [Model] Add support for H2OVL-Mississippi models (#9747)
Signed-off-by: Shanshan Wang <shanshan.wang@h2o.ai>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-11-04 00:15:36 +00:00
Nick Hill
1f1b6d6eda [V1] Support per-request seed (#9945)
Signed-off-by: Nick Hill <nickhill@us.ibm.com>
2024-11-03 09:14:17 -08:00
youkaichao
3bb4befea7 [bugfix] fix tsts (#9959)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-02 15:54:05 -07:00
Yongzao
ae5279a163 [torch.compile] Adding torch compile to vision-language models (#9946) 2024-11-02 12:56:05 -07:00
Nikita Furin
1b73ab2a1f [CI/Build] Quoting around > (#9956) 2024-11-02 12:50:28 -07:00
youkaichao
cea808f325 [3/N] model runner pass the whole config to model (#9958)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-02 12:08:49 -07:00
youkaichao
74b529ceee [bugfix] fix chatglm dummy_data_for_glmv (#9955)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-02 08:03:33 -07:00
Robert Shaw
d6459b4516 [V1] Fix EngineArgs refactor on V1 (#9954) 2024-11-02 07:44:38 -07:00
youkaichao
e893795443 [2/N] executor pass the complete config to worker/modelrunner (#9938)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2024-11-02 07:35:05 -07:00
Michael Green
1d4cfe2be1 [Doc] Updated tpu-installation.rst with more details (#9926)
Signed-off-by: Michael Green <mikegre@google.com>
2024-11-02 10:06:45 -04:00
Nick Hill
eed92f12fc [Docs] Update Granite 3.0 models in supported models table (#9930)
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-11-02 09:02:18 +00:00
youkaichao
af7380d83b [torch.compile] fix cpu broken code (#9947)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-01 23:35:47 -07:00
sroy745
a78dd3303e [Encoder Decoder] Add flash_attn kernel support for encoder-decoder models (#9559) 2024-11-01 23:22:49 -07:00
Kevin H. Luu
d522034c85 [ci/build] Have dependabot ignore pinned dependencies (#9935)
Signed-off-by: kevin <kevin@anyscale.com>
2024-11-01 23:56:13 +00:00
Peter Salas
6c0b7f548d [Core][VLM] Add precise multi-modal placeholder tracking (#8346)
Signed-off-by: Peter Salas <peter@fixie.ai>
2024-11-01 16:21:10 -07:00
dependabot[bot]
d151fde834 [ci/build] Bump the patch-update group with 10 updates (#9897)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
2024-11-01 23:04:42 +00:00
Gene Der Su
27cd36e6e2 [Bugfix] PicklingError on RayTaskError (#9934)
Signed-off-by: Gene Su <e870252314@gmail.com>
2024-11-01 22:08:23 +00:00
youkaichao
18bd7587b7 [1/N] pass the complete config from engine to executor (#9933)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-01 13:51:57 -07:00
Pavani Majety
598b6d7b07 [Bugfix/Core] Flashinfer k_scale and v_scale (#9861) 2024-11-01 12:15:05 -07:00
youkaichao
aff1fd8188 [torch.compile] use interpreter with stable api from pytorch (#9889)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-01 11:50:37 -07:00
André Jonasson
4581d2cc02 [Core] Refactor: Clean up unused argument in Scheduler._preempt (#9696)
Signed-off-by: André Jonasson <andre.jonasson@gmail.com>
2024-11-01 11:41:38 -07:00
Travis Johnson
1dd4cb2935 [Bugfix] Fix edge cases for MistralTokenizer (#9625)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2024-11-01 10:33:15 -07:00
Cyrus Leung
ba0d892074 [Frontend] Use a proper chat template for VLM2Vec (#9912) 2024-11-01 14:09:07 +00:00
Michael Goin
30a2e80742 [CI/Build] Add Model Tests for PixtralHF (#9813) 2024-11-01 07:55:29 -06:00
Cyrus Leung
06386a64dd [Frontend] Chat-based Embeddings API (#9759) 2024-11-01 08:13:35 +00:00
Cyrus Leung
d3aa2a8b2f [Doc] Update multi-input support (#9906) 2024-11-01 07:34:49 +00:00
Yongzao
2b5bf20988 [torch.compile] Adding torch compile annotations to some models (#9876)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-01 00:25:47 -07:00
Michael Goin
93a76dd21d [Model] Support bitsandbytes for MiniCPMV (#9891)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-01 13:31:56 +08:00
youkaichao
566cd27797 [torch.compile] rework test plans (#9866)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-31 22:20:17 -07:00
Michael Goin
37a4947dcd [Bugfix] Fix layer skip logic with bitsandbytes (#9887)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-01 13:12:44 +08:00
youkaichao
96e0c9cbbd [torch.compile] directly register custom op (#9896)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-31 21:56:09 -07:00
Joe Runde
031a7995f3 [Bugfix][Frontend] Reject guided decoding in multistep mode (#9892)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-11-01 01:09:46 +00:00
Kevin H. Luu
b63c64d95b [ci/build] Configure dependabot to update pip dependencies (#9811)
Signed-off-by: kevin <kevin@anyscale.com>
2024-10-31 15:55:38 -07:00
Mor Zusman
9fb12f7848 [BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 (#9838)
Signed-off-by: mzusman <mor.zusmann@gmail.com>
2024-10-31 20:06:25 +00:00
sasha0552
55650c83a0 [Bugfix] Fix illegal memory access error with chunked prefill, prefix caching, block manager v2 and xformers enabled together (#9532)
Signed-off-by: sasha0552 <admin@sasha0552.org>
2024-10-31 11:46:36 -07:00
Alexei-V-Ivanov-AMD
77f7ef2908 [CI/Build] Adding a forced docker system prune to clean up space (#9849) 2024-11-01 01:02:58 +08:00
Alex Brooks
16b8f7a86f [CI/Build] Add Model Tests for Qwen2-VL (#9846)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-10-31 09:10:52 -07:00
Jee Jee Li
5608e611c2 [Doc] Update Qwen documentation (#9869) 2024-10-31 08:54:18 +00:00
Roger Wang
3ea2dc2ec4 [Misc] Remove deprecated arg for cuda graph capture (#9864)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-10-31 07:22:07 +00:00
Michael Goin
d087bf863e [Model] Support quantization of Qwen2VisionTransformer (#9817)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-10-30 22:41:20 -07:00
Kevin H. Luu
890ca36072 Revert "[Bugfix] Use host argument to bind to interface (#9798)" (#9852) 2024-10-31 01:44:51 +00:00
Guillaume Calmettes
abbfb6134d [Misc][OpenAI] deprecate max_tokens in favor of new max_completion_tokens field for chat completion endpoint (#9837) 2024-10-30 18:15:56 -07:00
youkaichao
64384bbcdf [torch.compile] upgrade tests (#9858)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-30 16:34:22 -07:00
Yongzao
00d91c8a2c [CI/Build] Simplify exception trace in api server tests (#9787)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-10-30 14:52:05 -07:00
youkaichao
c2cd1a2142 [doc] update pp support (#9853)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-30 13:36:51 -07:00
Harsha vardhan manoj Bikki
c787f2d81d [Neuron] Update Dockerfile.neuron to fix build failure (#9822) 2024-10-30 12:22:02 -07:00
Joe Runde
33d257735f [Doc] link bug for multistep guided decoding (#9843)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-30 17:28:29 +00:00
Joe Runde
3b3f1e7436 [Bugfix][core] replace heartbeat with pid check (#9818)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-30 09:34:07 -07:00
Elfie Guo
9ff4511e43 [Misc] Add chunked-prefill support on FlashInfer. (#9781) 2024-10-30 09:33:53 -07:00
Went-Liang
81f09cfd80 [Model] Support math-shepherd-mistral-7b-prm model (#9697)
Signed-off-by: Went-Liang <wenteng_liang@163.com>
2024-10-30 09:33:42 -07:00
Alex Brooks
cc98f1e079 [CI/Build] VLM Test Consolidation (#9372)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-10-30 09:32:17 -07:00
Woosuk Kwon
211fe91aa8 [TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA (#9438) 2024-10-30 09:41:38 +00:00
Jee Jee Li
6aa6020f9b [Misc] Specify minimum pynvml version (#9827)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-10-29 23:05:43 -07:00
youkaichao
ff5ed6e1bc [torch.compile] rework compile control with piecewise cudagraph (#9715)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-29 23:03:49 -07:00
Russell Bryant
7b0365efef [Doc] Add the DCO to CONTRIBUTING.md (#9803)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-10-30 05:22:23 +00:00
Yan Ma
04a3ae0aca [Bugfix] Fix multi nodes TP+PP for XPU (#8884)
Signed-off-by: YiSheng5 <syhm@mail.ustc.edu.cn>
Signed-off-by: yan ma <yan.ma@intel.com>
Co-authored-by: YiSheng5 <syhm@mail.ustc.edu.cn>
2024-10-29 21:34:45 -07:00
Kevin H. Luu
62fac4b9aa [ci/build] Pin CI dependencies version with pip-compile (#9810)
Signed-off-by: kevin <kevin@anyscale.com>
2024-10-30 03:34:55 +00:00
Michael Goin
226688bd61 [Bugfix][VLM] Make apply_fp8_linear work with >2D input (#9812) 2024-10-29 19:49:44 -07:00
Lily Liu
64cb1cdc3f Update README.md (#9819) 2024-10-29 17:28:43 -07:00
youkaichao
1ab6f6b4ad [core][distributed] fix custom allreduce in pytorch 2.5 (#9815)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-29 17:06:24 -07:00
Michael Goin
bc73e9821c [Bugfix] Fix prefix strings for quantized VLMs (#9772) 2024-10-29 16:02:59 -07:00
Simon Mo
8d7724104a [Docs] Add notes about Snowflake Meetup (#9814)
Signed-off-by: simon-mo <simon.mo@hey.com>
2024-10-29 15:19:02 -07:00
Will Eaton
882a1ad0de [Model] tool calling support for ibm-granite/granite-20b-functioncalling (#8339)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com>
2024-10-29 15:07:37 -07:00
Joe Runde
67bdf8e523 [Bugfix][Frontend] Guard against bad token ids (#9634)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-29 14:13:20 -07:00
Kunjan
0ad216f575 [MISC] Set label value to timestamp over 0, to keep track of recent history (#9777)
Signed-off-by: Kunjan Patel <kunjanp@google.com>
2024-10-29 19:52:19 +00:00
Russell Bryant
7585ec996f [CI/Build] mergify: fix rules for ci/build label (#9804)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-10-29 19:24:42 +00:00
Michael Goin
ab6f981671 [CI][Bugfix] Skip chameleon for transformers 4.46.1 (#9808) 2024-10-29 11:12:43 -07:00
Junichi Sato
ac3d748dba [Model] Add LlamaEmbeddingModel as an embedding Implementation of LlamaModel (#9806) 2024-10-29 10:40:35 -07:00
yannicks1
0ce7798f44 [Misc]: Typo fix: Renaming classes (casualLM -> causalLM) (#9801)
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>
2024-10-29 10:39:20 -07:00
Sven Seeberg
0f43387157 [Bugfix] Use host argument to bind to interface (#9798) 2024-10-29 10:37:59 -07:00
tastelikefeet
08600ddc68 Fix the log to correct guide user to install modelscope (#9793)
Signed-off-by: yuze.zyz <yuze.zyz@alibaba-inc.com>
2024-10-29 10:36:59 -07:00
科英
74fc2d77ae [Misc] Add metrics for request queue time, forward time, and execute time (#9659) 2024-10-29 10:32:56 -07:00
wangshuai09
622b7ab955 [Hardware] using current_platform.seed_everything (#9785)
Signed-off-by: wangshuai09 <391746016@qq.com>
2024-10-29 14:47:44 +00:00
Isotr0py
09500f7dde [Model] Add BNB quantization support for Mllama (#9720) 2024-10-29 08:20:02 -04:00
Zhong Qishuai
ef7865b4f9 [Frontend] re-enable multi-modality input in the new beam search implementation (#9427)
Signed-off-by: Qishuai Ferdinandzhong@gmail.com
2024-10-29 11:49:47 +00:00
Cyrus Leung
eae3d48181 [Bugfix] Use temporary directory in registry (#9721) 2024-10-28 22:08:20 -07:00
Cyrus Leung
e74f2d448c [Doc] Specify async engine args in docs (#9726) 2024-10-28 22:07:57 -07:00
Jee Jee Li
7a4df5f200 [Model][LoRA]LoRA support added for Qwen (#9622)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-10-29 04:14:07 +00:00
Russell Bryant
c5d7fb9ddc [Doc] fix third-party model example (#9771)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-10-28 19:39:21 -07:00
youkaichao
76ed5340f0 [torch.compile] add deepseek v2 compile (#9775)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-28 14:35:17 -07:00
youkaichao
97b61bfae6 [misc] avoid circular import (#9765)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-28 20:51:23 +00:00
Yongzao
aa0addb397 Adding "torch compile" annotations to moe models (#9758) 2024-10-28 13:49:56 -07:00
litianjian
5f8d8075f9 [Model][VLM] Add multi-video support for LLaVA-Onevision (#8905)
Co-authored-by: litianjian <litianjian@bytedance.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-10-28 18:04:10 +00:00
Russell Bryant
8b0e4f2ad7 [CI/Build] Adopt Mergify for auto-labeling PRs (#9259)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-10-28 09:38:09 -07:00
Yan Ma
2adb4409e0 [Bugfix] Fix ray instance detect issue (#9439) 2024-10-28 07:13:03 +00:00
Robert Shaw
feb92fbe4a Fix beam search eos (#9627) 2024-10-28 06:59:37 +00:00
youkaichao
32176fee73 [torch.compile] support moe models (#9632)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-27 21:58:04 -07:00
wangshuai09
4e2d95e372 [Hardware][ROCM] using current_platform.is_rocm (#9642)
Signed-off-by: wangshuai09 <391746016@qq.com>
2024-10-28 04:07:00 +00:00
madt2709
34a9941620 [Bugfix] Fix load config when using bools (#9533) 2024-10-27 13:46:41 -04:00
Harry Mellor
e130c40e4e Fix cache management in "Close inactive issues and PRs" actions workflow (#9734) 2024-10-27 10:30:03 -07:00
bnellnm
3cb07a36a2 [Misc] Upgrade to pytorch 2.5 (#9588)
Signed-off-by: Bill Nell <bill@neuralmagic.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-10-27 09:44:24 +00:00
youkaichao
8549c82660 [core] cudagraph output with tensor weak reference (#9724)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-27 00:19:28 -07:00
科英
67a6882da4 [Misc] SpecDecodeWorker supports profiling (#9719)
Signed-off-by: Abatom <abatom@163.com>
2024-10-27 04:18:03 +00:00
kakao-kevin-us
6650e6a930 [Model] Add classification Task with Qwen2ForSequenceClassification (#9704)
Signed-off-by: Kevin-Yang <ykcha9@gmail.com>
Co-authored-by: Kevin-Yang <ykcha9@gmail.com>
2024-10-26 17:53:35 +00:00
Vasiliy Alekseev
07e981fdf4 [Frontend] Bad words sampling parameter (#9717)
Signed-off-by: Vasily Alexeev <alvasian@yandex.ru>
2024-10-26 16:29:38 +00:00
ErkinSagiroglu
55137e8ee3 Fix: MI100 Support By Bypassing Custom Paged Attention (#9560) 2024-10-26 12:12:57 +00:00
Mengqing Cao
5cbdccd151 [Hardware][openvino] is_openvino --> current_platform.is_openvino (#9716) 2024-10-26 10:59:06 +00:00
Sam Stoelinga
067e77f9a8 [Bugfix] Steaming continuous_usage_stats default to False (#9709)
Signed-off-by: Sam Stoelinga <sammiestoel@gmail.com>
2024-10-26 05:05:47 +00:00
Travis Johnson
6567e13724 [Bugfix] Fix crash with llama 3.2 vision models and guided decoding (#9631)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: pavlo-ruban <pavlo.ruban@servicenow.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-10-25 15:42:56 -07:00
Rafael Vasquez
228cfbd03f [Doc] Improve quickstart documentation (#9256)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
2024-10-25 14:32:10 -07:00
Michael Goin
ca0d92227e [Bugfix] Fix compressed_tensors_moe bad config.strategy (#9677) 2024-10-25 12:40:33 -07:00
Woosuk Kwon
9645b9f646 [V1] Support sliding window attention (#9679)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-10-24 22:20:37 -07:00
Will Johnson
a6f3721861 [Model] add a lora module for granite 3.0 MoE models (#9673) 2024-10-24 22:00:17 -07:00
Kevin H. Luu
9f7b4ba865 [ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test #9675 (#9676) 2024-10-24 20:59:00 -07:00
Michael Goin
c91ed47c43 [Bugfix] Remove xformers requirement for Pixtral (#9597)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-10-24 15:38:05 -07:00
Charlie Fu
59449095ab [Performance][Kernel] Fused_moe Performance Improvement (#9384)
Signed-off-by: charlifu <charlifu@amd.com>
2024-10-24 15:37:52 -07:00
Michael Goin
e26d37a185 [Log][Bugfix] Fix default value check for image_url.detail (#9663) 2024-10-24 10:44:38 -07:00
Alex Brooks
722d46edb9 [Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints (#9650)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-10-24 10:42:24 -07:00
Cyrus Leung
c866e0079d [CI/Build] Fix VLM test failures when using transformers v4.46 (#9666) 2024-10-25 01:40:40 +08:00
Yongzao
d27cfbf791 [torch.compile] Adding torch compile annotations to some models (#9641)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-10-24 09:31:42 -07:00
Harry Mellor
de662d32b5 Increase operation per run limit for "Close inactive issues and PRs" workflow (#9661)
Signed-off-by: Harry Mellor <hej.mellor@gmail.com>
2024-10-24 12:17:45 -04:00
litianjian
f58454968f [Bugfix]Disable the post_norm layer of the vision encoder for LLaVA models (#9653) 2024-10-24 07:52:07 -07:00
Cyrus Leung
b979143d5b [Doc] Move additional tips/notes to the top (#9647) 2024-10-24 09:43:59 +00:00
Yongzao
ad6f78053e [torch.compile] expanding support and fix allgather compilation (#9637)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-10-24 01:32:15 -07:00
Jee Jee Li
295a061fb3 [Kernel] add kernel for FATReLU (#9610)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-10-24 16:18:27 +08:00
Yongzao
8a02cd045a [torch.compile] Adding torch compile annotations to some models (#9639)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-10-24 00:54:57 -07:00
youkaichao
4fdc581f9e [core] simplify seq group code (#9569)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-10-24 00:16:44 -07:00
Woosuk Kwon
3770071eb4 [V1][Bugfix] Clean up requests when aborted (#9629)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-10-23 23:33:22 -07:00
Cyrus Leung
836e8ef6ee [Bugfix] Fix PP for ChatGLM and Molmo (#9422) 2024-10-24 06:12:05 +00:00
Yan Ma
056a68c7db [XPU] avoid triton import for xpu (#9440)
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-10-24 05:14:00 +00:00
Vinay R Damodaran
33bab41060 [Bugfix]: Make chat content text allow type content (#9358)
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
2024-10-24 05:05:49 +00:00
Michael Goin
b7df53cd42 [Bugfix] Use "vision_model" prefix for MllamaVisionModel (#9628)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-10-24 10:07:44 +08:00
Michael Goin
bb01f2915e [Bugfix][Model] Fix Mllama SDPA illegal memory access for batched multi-image (#9626)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-10-24 10:03:44 +08:00
Russell Bryant
b548d7a5f4 [CI/Build] Add bot to close stale issues and PRs (#9436) 2024-10-23 15:45:26 -07:00
Yunfei Chu
fc6c274626 [Model] Add Qwen2-Audio model support (#9248)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-10-23 17:54:22 +00:00
Alex Brooks
150b779081 [Frontend] Enable Online Multi-image Support for MLlama (#9393)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-10-23 17:28:57 +00:00
Yongzao
9013e24f7b [torch.compile] Adding torch compile annotations to some models (#9614) 2024-10-23 10:07:48 -07:00
Michael Goin
fd0e2cfdb2 [Misc] Separate total and output tokens in benchmark_throughput.py (#8914) 2024-10-23 16:47:20 +00:00
Tyler Michael Smith
e5ac6a4199 [Bugfix] Fix divide by zero when serving Mamba models (#9617)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-10-23 16:40:43 +00:00
youkaichao
dbdd3b5e5a [misc] comment to avoid future confusion about baichuan (#9620)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-10-23 09:14:44 -07:00
Cyrus Leung
e7116c017c [Bugfix] Fix _init_vision_model in NVLM_D model (#9611)
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-10-23 14:09:04 +00:00
Alex Brooks
31a08f5bd2 [Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs (#9612)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-10-23 14:05:18 +00:00
Cyrus Leung
c18e1a3418 [VLM] Enable overriding whether post layernorm is used in vision encoder + fix quant args (#9217)
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-10-23 11:27:37 +00:00
Isotr0py
3ff57ebfca [Model] Initialize Florence-2 language backbone support (#9555) 2024-10-23 10:42:47 +00:00
Mengqing Cao
2394962d70 [Hardware][XPU] using current_platform.is_xpu (#9605) 2024-10-23 08:28:21 +00:00
Luka Govedič
51c24c9736 [Build] Fix FetchContent multiple build issue (#9596)
Signed-off-by: luka <luka@neuralmagic.com>
2024-10-23 12:43:07 +08:00
Cyrus Leung
831540cf04 [Model] Support E5-V (#9576) 2024-10-23 11:35:29 +08:00
Flex Wang
29061ed9df [Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend to all logging messages (#9590) 2024-10-23 11:17:28 +08:00
Chen Zhang
65050a40e6 [Bugfix] Generate exactly input_len tokens in benchmark_throughput (#9592) 2024-10-22 17:45:35 -07:00
Seth Kimmel
208cb34c81 [Doc]: Update tensorizer docs to include vllm[tensorizer] (#7889)
Co-authored-by: Kaunil Dhruv <dhruv.kaunil@gmail.com>
2024-10-22 15:43:25 -07:00
yulei
b17046e298 [BugFix] Fix metrics error for --num-scheduler-steps > 1 (#8234) 2024-10-22 15:43:03 -07:00
Lucas Wilkinson
d1e8240875 [Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing (#9487) 2024-10-22 15:41:13 -07:00
Jeremy Arnold
cb6fdaa0a0 [Misc] Make benchmarks use EngineArgs (#9529) 2024-10-22 15:40:38 -07:00
Aurick Qiao
23b899a8e6 [Bugfix] fix detokenizer shallow copy (#5919) 2024-10-22 15:38:12 -07:00
youkaichao
17c79f3c36 [torch.compile] auto infer dynamic_arg_dims from type annotation (#9589) 2024-10-22 13:43:37 -07:00
Ronen Schaffer
cd5601ac37 [BugFix] Prevent exporting duplicate OpenTelemetry spans (#9017) 2024-10-22 11:11:53 -07:00
Yuhong Guo
434984e665 [Frontend] Support custom request_id from request (#9550)
Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>
2024-10-22 18:07:30 +00:00
Yuan
32a1ee74a0 [Hardware][Intel CPU][DOC] Update docs for CPU backend (#6212)
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Gubrud, Aaron D <aaron.d.gubrud@intel.com>
Co-authored-by: adgubrud <96072084+adgubrud@users.noreply.github.com>
2024-10-22 10:38:04 -07:00
gopalsarda
08075c3448 [Bugfix] Eagle: change config name for fc bias (#9580) 2024-10-22 16:14:22 +00:00
Isotr0py
bb392ea2d2 [Model][VLM] Initialize support for Mono-InternVL model (#9528) 2024-10-22 16:01:46 +00:00
xendo
9dbcce84a7 [Neuron] [Bugfix] Fix neuron startup (#9374)
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com>
2024-10-22 12:51:41 +00:00
Jee Jee Li
a48e3ec052 [CI/Build][LoRA] Temporarily fix long context failure issue (#9579) 2024-10-22 11:32:51 +00:00
Woosuk Kwon
6c5af09b39 [V1] Implement vLLM V1 [1/N] (#9289) 2024-10-22 01:24:07 -07:00
wangshuai09
3ddbe25502 [Hardware][CPU] using current_platform.is_cpu (#9536) 2024-10-22 00:50:43 -07:00
chenqianfzh
0d02747f2e support TP in qwen2 bnb (#9574) 2024-10-22 07:13:23 +00:00
Rafael Vasquez
f7db5f0fa9 [Doc] Use shell code-blocks and fix section headers (#9508)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
2024-10-22 06:43:24 +00:00
Kuntai Du
ca30c3c84b [Core] Remove evictor_v1 (#9572) 2024-10-22 04:55:49 +00:00
Wallas Henrique
c0292211ce [CI/Build] Replaced some models on tests for smaller ones (#9570)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2024-10-22 04:52:14 +00:00
Falko1
74692421f7 [Bugfix]: phi.py get rope_theta from config file (#9503)
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-10-22 02:53:36 +00:00
ngrozae
29acd2c34c [Bugfix][OpenVINO] fix_dockerfile_openvino (#9552) 2024-10-21 19:47:52 -07:00
Cyrus Leung
f085995a7b [CI/Build] Remove unnecessary fork_new_process (#9484) 2024-10-21 19:47:29 -07:00
Travis Johnson
b729901139 [Bugfix]: serialize config by value for --trust-remote-code (#6751)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-10-21 19:46:24 -07:00
youkaichao
76a5e13270 [core] move parallel sampling out from vllm core (#9302) 2024-10-22 00:31:44 +00:00
Joe Runde
ef7faad1b8 🐛 Fixup more test failures from memory profiling (#9563)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-21 17:10:56 -07:00
Kuntai Du
575dcebe9a [CI] Make format checker error message more user-friendly by using emoji (#9564)
This PR makes format checker error message more user-friendly by adding emojis.
2024-10-21 23:45:15 +00:00
Wallas Henrique
711f3a7806 [Frontend] Don't log duplicate error stacktrace for every request in the batch (#9023)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2024-10-21 14:49:41 -07:00
Nick Hill
15713e3b75 [BugFix] Update draft model TP size check to allow matching target TP size (#9394)
Co-authored-by: Baoyuan Qi <qibaoyuan@126.com>
2024-10-21 14:14:29 -07:00
youkaichao
d621c43df7 [doc] fix format (#9562) 2024-10-21 13:54:57 -07:00
Nick Hill
9d9186be97 [Frontend] Reduce frequency of client cancellation checking (#7959) 2024-10-21 13:28:10 -07:00
Michael Goin
5241aa1494 [Model][Bugfix] Fix batching with multi-image in PixtralHF (#9518) 2024-10-21 14:20:07 -04:00
Varad Ahirwadkar
ec6bd6c4c6 [BugFix] Use correct python3 binary in Docker.ppc64le entrypoint (#9492)
Signed-off-by: Varad Ahirwadkar <varad.ahirwadkar1@ibm.com>
2024-10-21 17:43:02 +00:00
yudian0504
8ca8954841 [Bugfix][Misc]: fix graph capture for decoder (#9549) 2024-10-21 17:33:30 +00:00
Dhia Eddine Rhaiem
f6b97293aa [Model] FalconMamba Support (#9325) 2024-10-21 12:50:16 -04:00
Thomas Parnell
496e991da8 [Doc] Consistent naming of attention backends (#9498)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-10-21 22:29:57 +08:00
Cyrus Leung
696b01af8f [CI/Build] Split up decoder-only LM tests (#9488)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-10-20 21:27:50 -07:00
Andy Dai
855e0e6f97 [Frontend][Misc] Goodput metric support (#9338) 2024-10-20 18:39:32 +00:00
Chen Zhang
4fa3e33349 [Kernel] Support sliding window in flash attention backend (#9403) 2024-10-20 10:57:52 -07:00
Michael Goin
962d2c6349 [Model][Pixtral] Use memory_efficient_attention for PixtralHFVision (#9520) 2024-10-20 05:29:14 +00:00
Chen Zhang
5b59fe0f08 [Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger (#9530) 2024-10-20 00:05:02 +00:00
Michael Goin
8e3e7f2713 [Model][Pixtral] Optimizations for input_processor_for_pixtral_hf (#9514) 2024-10-19 10:44:29 -04:00
Cyrus Leung
263d8ee150 [Bugfix] Fix missing task for speculative decoding (#9524) 2024-10-19 06:49:40 +00:00
Yue Zhang
c5eea3c8ba [Frontend] Support simpler image input format (#9478) 2024-10-18 23:17:07 -07:00
Russell Bryant
85dc92fc98 [CI/Build] Configure matcher for actionlint workflow (#9511)
Signed-off-by: Russell Bryant <russell.bryant@gmail.com>
2024-10-19 06:04:18 +00:00
Russell Bryant
dfd951ed9b [CI/Build] Add error matching for ruff output (#9513)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-10-19 05:42:20 +00:00
Joe Runde
82c25151ec [Doc] update gpu-memory-utilization flag docs (#9507)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-19 11:26:36 +08:00
Nick Hill
1325872ec8 [Frontend] Avoid creating guided decoding LogitsProcessor unnecessarily (#9521) 2024-10-18 20:21:01 -07:00
Joe Runde
380e18639f 🐛 fix torch memory profiling (#9516)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-18 21:25:19 -04:00
sasha0552
337ed76671 [Bugfix] Fix offline mode when using mistral_common (#9457) 2024-10-18 18:12:32 -07:00
Thomas Parnell
0c9a5258f9 [Kernel] Add env variable to force flashinfer backend to enable tensor cores (#9497)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-10-18 17:55:48 -07:00
Cody Yu
d11bf435a0 [MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py (#9510) 2024-10-18 14:30:55 -07:00
Kunjan
9bb10a7d27 [MISC] Add lora requests to metrics (#9477)
Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a.c.kunjanp-gke-dev-2.internal>
2024-10-18 20:50:18 +00:00
Michael Goin
3921a2f29e [Model] Support Pixtral models in the HF Transformers format (#9036) 2024-10-18 13:29:56 -06:00
Russell Bryant
67a7e5ef38 [CI/Build] Add error matching config for mypy (#9512) 2024-10-18 12:17:53 -07:00
Cyrus Leung
051eaf6db3 [Model] Add user-configurable task for models that support both generation and embedding (#9424) 2024-10-18 11:31:58 -07:00
Russell Bryant
7dbe738d65 [Misc] benchmark: Add option to set max concurrency (#9390)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-10-18 11:15:28 -07:00
Tyler Michael Smith
ae8b633ba3 [Bugfix] Fix offline_inference_with_prefix.py (#9505) 2024-10-18 16:59:19 +00:00
Cyrus Leung
1bbbcc0b1d [CI/Build] Fix lint errors in mistral tokenizer (#9504) 2024-10-19 00:09:35 +08:00
Nick Hill
25aeb7d4c9 [BugFix] Fix and simplify completion API usage streaming (#9475) 2024-10-18 14:10:26 +00:00
tomeras91
d2b1bf55ec [Frontend][Feature] Add jamba tool parser (#9154) 2024-10-18 10:27:48 +00:00
Nick Hill
1ffc8a7362 [BugFix] Typing fixes to RequestOutput.prompt and beam search (#9473) 2024-10-18 07:19:53 +00:00
Russell Bryant
944dd8edaf [CI/Build] Use commit hash references for github actions (#9430) 2024-10-17 21:54:58 -07:00
Haoyu Wang
154a8ae880 [Qwen2.5] Support bnb quant for Qwen2.5 (#9467) 2024-10-18 04:40:14 +00:00
Joe Runde
de4008e2ab [Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage (#9352)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-10-17 22:47:27 -04:00
Dipika Sikka
48138a8415 [BugFix] Stop silent failures on compressed-tensors parsing (#9381) 2024-10-17 18:54:00 -07:00
Robert Shaw
343f8e0905 Support BERTModel (first encoder-only embedding model) (#9056)
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com>
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: laishzh <laishengzhang@gmail.com>
Co-authored-by: Max de Bayser <maxdebayser@gmail.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-10-17 23:21:01 +00:00
Shashwat Srijan
bb76538bbd [Hardwware][Neuron] Simplify model load for transformers-neuronx library (#9380) 2024-10-17 15:39:39 -07:00
sasha0552
d615b5c9f8 [Bugfix] Print warnings related to mistral_common tokenizer only once (#9468) 2024-10-17 21:44:20 +00:00
Kai Wu
d65049daab [Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script (#9013)
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-10-17 21:11:11 +00:00
bnellnm
eca2c5f7c0 [Bugfix] Fix support for dimension like integers and ScalarType (#9299) 2024-10-17 19:08:34 +00:00
Luka Govedič
0f41fbe5a3 [torch.compile] Fine-grained CustomOp enabling mechanism (#9300) 2024-10-17 18:36:37 +00:00
Cyrus Leung
7871659abb [Misc] Remove commit id file (#9470) 2024-10-17 10:34:37 -07:00
1606 changed files with 152833 additions and 45210 deletions

View File

@@ -1,9 +1,14 @@
# SPDX-License-Identifier: Apache-2.0
import os import os
import sys import sys
import zipfile import zipfile
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 250 MB # Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 400 MiB
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 250)) # Note that we have 400 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/3792 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 400))
def print_top_10_largest_files(zip_file): def print_top_10_largest_files(zip_file):

View File

@@ -0,0 +1,26 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import os
template = """<!DOCTYPE html>
<html>
<body>
<h1>Links for vLLM</h1/>
<a href="../{wheel_html_escaped}">{wheel}</a><br/>
</body>
</html>
"""
parser = argparse.ArgumentParser()
parser.add_argument("--wheel", help="The wheel path.", required=True)
args = parser.parse_args()
filename = os.path.basename(args.wheel)
with open("index.html", "w") as f:
print(f"Generated index.html for {args.wheel}")
# cloudfront requires escaping the '+' character
f.write(
template.format(wheel=filename,
wheel_html_escaped=filename.replace("+", "%2B")))

View File

@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.356
- name: "exact_match,flexible-extract"
value: 0.358
limit: 1000
num_fewshot: 5

View File

@@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM -b "auto" -t 2
model_name: "nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.6353
- name: "exact_match,flexible-extract"
value: 0.637
limit: null
num_fewshot: null

View File

@@ -1,6 +1,6 @@
Meta-Llama-3-8B-Instruct.yaml Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml

View File

@@ -41,6 +41,6 @@ while getopts "m:b:l:f:" OPT; do
done done
lm_eval --model hf \ lm_eval --model hf \
--model_args pretrained=$MODEL,parallelize=True \ --model_args "pretrained=$MODEL,parallelize=True" \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \ --tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size $BATCH_SIZE --batch_size "$BATCH_SIZE"

View File

@@ -46,6 +46,6 @@ while getopts "m:b:l:f:t:" OPT; do
done done
lm_eval --model vllm \ lm_eval --model vllm \
--model_args pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend="ray",trust_remote_code=true,max_model_len=4096 \ --model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \ --tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size $BATCH_SIZE --batch_size "$BATCH_SIZE"

View File

@@ -30,7 +30,7 @@ while getopts "c:t:" OPT; do
done done
# Parse list of configs. # Parse list of configs.
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < $CONFIG IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < "$CONFIG"
for MODEL_CONFIG in "${MODEL_CONFIGS[@]}" for MODEL_CONFIG in "${MODEL_CONFIGS[@]}"
do do

View File

@@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
""" """
LM eval harness on model to compare vs HF baseline computed offline. LM eval harness on model to compare vs HF baseline computed offline.
Configs are found in configs/$MODEL.yaml Configs are found in configs/$MODEL.yaml

View File

@@ -1,5 +1,6 @@
steps: steps:
- label: "Wait for container to be ready" - label: "Wait for container to be ready"
key: wait-for-container-image
agents: agents:
queue: A100 queue: A100
plugins: plugins:
@@ -9,16 +10,18 @@ steps:
- image: badouralix/curl-jq - image: badouralix/curl-jq
command: command:
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh - sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
- wait
- label: "A100" - label: "A100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents: agents:
queue: A100 queue: A100
depends_on: wait-for-container-image
plugins: plugins:
- kubernetes: - kubernetes:
podSpec: podSpec:
priorityClassName: perf-benchmark priorityClassName: perf-benchmark
containers: containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT - image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command: command:
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh - bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
resources: resources:
@@ -41,20 +44,49 @@ steps:
- name: devshm - name: devshm
emptyDir: emptyDir:
medium: Memory medium: Memory
# - label: "H100"
# agents:
# queue: H100
# plugins:
# - docker#v5.11.0:
# image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
# command:
# - bash
# - .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
# mount-buildkite-agent: true
# propagate-environment: true
# ipc: host
# gpus: all
# environment:
# - VLLM_USAGE_SOURCE
# - HF_TOKEN
- label: "H200"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H200
depends_on: wait-for-container-image
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: 4,5,6,7
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN
#- block: "Run H100 Benchmark"
#key: block-h100
#depends_on: ~
- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
depends_on: wait-for-container-image
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: all # see CUDA_VISIBLE_DEVICES for actual GPUs used
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN

View File

@@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import json import json
import os import os
from pathlib import Path from pathlib import Path
@@ -56,7 +58,7 @@ serving_column_mapping = {
def read_markdown(file): def read_markdown(file):
if os.path.exists(file): if os.path.exists(file):
with open(file, "r") as f: with open(file) as f:
return f.read() + "\n" return f.read() + "\n"
else: else:
return f"{file} not found.\n" return f"{file} not found.\n"
@@ -75,14 +77,14 @@ if __name__ == "__main__":
# collect results # collect results
for test_file in results_folder.glob("*.json"): for test_file in results_folder.glob("*.json"):
with open(test_file, "r") as f: with open(test_file) as f:
raw_result = json.loads(f.read()) raw_result = json.loads(f.read())
if "serving" in str(test_file): if "serving" in str(test_file):
# this result is generated via `benchmark_serving.py` # this result is generated via `benchmark_serving.py`
# attach the benchmarking command to raw_result # attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands"), "r") as f: with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read()) command = json.loads(f.read())
raw_result.update(command) raw_result.update(command)
@@ -97,7 +99,7 @@ if __name__ == "__main__":
# this result is generated via `benchmark_latency.py` # this result is generated via `benchmark_latency.py`
# attach the benchmarking command to raw_result # attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands"), "r") as f: with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read()) command = json.loads(f.read())
raw_result.update(command) raw_result.update(command)
@@ -119,7 +121,7 @@ if __name__ == "__main__":
# this result is generated via `benchmark_throughput.py` # this result is generated via `benchmark_throughput.py`
# attach the benchmarking command to raw_result # attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands"), "r") as f: with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read()) command = json.loads(f.read())
raw_result.update(command) raw_result.update(command)
@@ -157,6 +159,18 @@ if __name__ == "__main__":
throughput_results, throughput_results,
serving_results) serving_results)
for df in [latency_results, serving_results, throughput_results]:
if df.empty:
continue
# Sort all dataframes by their respective "Test name" columns
df.sort_values(by="Test name", inplace=True)
# The GPUs sometimes come in format of "GPUTYPE\nGPUTYPE\n...",
# we want to turn it into "8xGPUTYPE"
df["GPU"] = df["GPU"].apply(
lambda x: f"{len(x.split('\n'))}x{x.split('\n')[0]}")
# get markdown tables # get markdown tables
latency_md_table = tabulate(latency_results, latency_md_table = tabulate(latency_results,
headers='keys', headers='keys',

View File

@@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import argparse import argparse
from transformers import AutoTokenizer from transformers import AutoTokenizer

View File

@@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import argparse import argparse
import json import json
from pathlib import Path from pathlib import Path
@@ -72,7 +74,7 @@ def main(args):
# collect results # collect results
for test_file in results_folder.glob("*_nightly_results.json"): for test_file in results_folder.glob("*_nightly_results.json"):
with open(test_file, "r") as f: with open(test_file) as f:
results = results + json.loads(f.read()) results = results + json.loads(f.read())
# generate markdown table # generate markdown table
@@ -80,7 +82,7 @@ def main(args):
md_table = tabulate(df, headers='keys', tablefmt='pipe', showindex=False) md_table = tabulate(df, headers='keys', tablefmt='pipe', showindex=False)
with open(args.description, "r") as f: with open(args.description) as f:
description = f.read() description = f.read()
description = description.format( description = description.format(

View File

@@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from lmdeploy.serve.openai.api_client import APIClient from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient("http://localhost:8000") api_client = APIClient("http://localhost:8000")

View File

@@ -50,31 +50,30 @@ launch_trt_server() {
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
git lfs install git lfs install
cd tensorrtllm_backend cd tensorrtllm_backend
git checkout $trt_llm_version git checkout "$trt_llm_version"
tensorrtllm_backend_dir=$(pwd)
git submodule update --init --recursive git submodule update --init --recursive
# build trtllm engine # build trtllm engine
cd /tensorrtllm_backend cd /tensorrtllm_backend
cd ./tensorrt_llm/examples/${model_type} cd "./tensorrt_llm/examples/${model_type}"
python3 convert_checkpoint.py \ python3 convert_checkpoint.py \
--model_dir ${model_path} \ --model_dir "${model_path}" \
--dtype ${model_dtype} \ --dtype "${model_dtype}" \
--tp_size ${model_tp_size} \ --tp_size "${model_tp_size}" \
--output_dir ${trt_model_path} --output_dir "${trt_model_path}"
trtllm-build \ trtllm-build \
--checkpoint_dir ${trt_model_path} \ --checkpoint_dir "${trt_model_path}" \
--use_fused_mlp \ --use_fused_mlp \
--reduce_fusion disable \ --reduce_fusion disable \
--workers 8 \ --workers 8 \
--gpt_attention_plugin ${model_dtype} \ --gpt_attention_plugin "${model_dtype}" \
--gemm_plugin ${model_dtype} \ --gemm_plugin "${model_dtype}" \
--tp_size ${model_tp_size} \ --tp_size "${model_tp_size}" \
--max_batch_size ${max_batch_size} \ --max_batch_size "${max_batch_size}" \
--max_input_len ${max_input_len} \ --max_input_len "${max_input_len}" \
--max_seq_len ${max_seq_len} \ --max_seq_len "${max_seq_len}" \
--max_num_tokens ${max_num_tokens} \ --max_num_tokens "${max_num_tokens}" \
--output_dir ${trt_engine_path} --output_dir "${trt_engine_path}"
# handle triton protobuf files and launch triton server # handle triton protobuf files and launch triton server
cd /tensorrtllm_backend cd /tensorrtllm_backend
@@ -82,15 +81,15 @@ launch_trt_server() {
cp -r all_models/inflight_batcher_llm/* triton_model_repo/ cp -r all_models/inflight_batcher_llm/* triton_model_repo/
cd triton_model_repo cd triton_model_repo
rm -rf ./tensorrt_llm/1/* rm -rf ./tensorrt_llm/1/*
cp -r ${trt_engine_path}/* ./tensorrt_llm/1 cp -r "${trt_engine_path}"/* ./tensorrt_llm/1
python3 ../tools/fill_template.py -i tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,engine_dir:/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1,decoupled_mode:true,batching_strategy:inflight_fused_batching,batch_scheduler_policy:guaranteed_no_evict,exclude_input_in_output:true,triton_max_batch_size:2048,max_queue_delay_microseconds:0,max_beam_width:1,max_queue_size:2048,enable_kv_cache_reuse:false python3 ../tools/fill_template.py -i tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,engine_dir:/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1,decoupled_mode:true,batching_strategy:inflight_fused_batching,batch_scheduler_policy:guaranteed_no_evict,exclude_input_in_output:true,triton_max_batch_size:2048,max_queue_delay_microseconds:0,max_beam_width:1,max_queue_size:2048,enable_kv_cache_reuse:false
python3 ../tools/fill_template.py -i preprocessing/config.pbtxt triton_max_batch_size:2048,tokenizer_dir:$model_path,preprocessing_instance_count:5 python3 ../tools/fill_template.py -i preprocessing/config.pbtxt "triton_max_batch_size:2048,tokenizer_dir:$model_path,preprocessing_instance_count:5"
python3 ../tools/fill_template.py -i postprocessing/config.pbtxt triton_max_batch_size:2048,tokenizer_dir:$model_path,postprocessing_instance_count:5,skip_special_tokens:false python3 ../tools/fill_template.py -i postprocessing/config.pbtxt "triton_max_batch_size:2048,tokenizer_dir:$model_path,postprocessing_instance_count:5,skip_special_tokens:false"
python3 ../tools/fill_template.py -i ensemble/config.pbtxt triton_max_batch_size:$max_batch_size python3 ../tools/fill_template.py -i ensemble/config.pbtxt triton_max_batch_size:"$max_batch_size"
python3 ../tools/fill_template.py -i tensorrt_llm_bls/config.pbtxt triton_max_batch_size:$max_batch_size,decoupled_mode:true,accumulate_tokens:"False",bls_instance_count:1 python3 ../tools/fill_template.py -i tensorrt_llm_bls/config.pbtxt "triton_max_batch_size:$max_batch_size,decoupled_mode:true,accumulate_tokens:False,bls_instance_count:1"
cd /tensorrtllm_backend cd /tensorrtllm_backend
python3 scripts/launch_triton_server.py \ python3 scripts/launch_triton_server.py \
--world_size=${model_tp_size} \ --world_size="${model_tp_size}" \
--model_repo=/tensorrtllm_backend/triton_model_repo & --model_repo=/tensorrtllm_backend/triton_model_repo &
} }
@@ -98,10 +97,7 @@ launch_trt_server() {
launch_tgi_server() { launch_tgi_server() {
model=$(echo "$common_params" | jq -r '.model') model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp') tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port') port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params") server_args=$(json2args "$server_params")
if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
@@ -129,10 +125,7 @@ launch_tgi_server() {
launch_lmdeploy_server() { launch_lmdeploy_server() {
model=$(echo "$common_params" | jq -r '.model') model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp') tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port') port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params") server_args=$(json2args "$server_params")
server_command="lmdeploy serve api_server $model \ server_command="lmdeploy serve api_server $model \
@@ -149,10 +142,7 @@ launch_sglang_server() {
model=$(echo "$common_params" | jq -r '.model') model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp') tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port') port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params") server_args=$(json2args "$server_params")
if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
@@ -185,10 +175,7 @@ launch_vllm_server() {
model=$(echo "$common_params" | jq -r '.model') model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp') tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port') port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
server_args=$(json2args "$server_params") server_args=$(json2args "$server_params")
if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then if echo "$common_params" | jq -e 'has("fp8")' >/dev/null; then
@@ -217,19 +204,19 @@ launch_vllm_server() {
main() { main() {
if [[ $CURRENT_LLM_SERVING_ENGINE == "trt" ]]; then if [[ "$CURRENT_LLM_SERVING_ENGINE" == "trt" ]]; then
launch_trt_server launch_trt_server
fi fi
if [[ $CURRENT_LLM_SERVING_ENGINE == "tgi" ]]; then if [[ "$CURRENT_LLM_SERVING_ENGINE" == "tgi" ]]; then
launch_tgi_server launch_tgi_server
fi fi
if [[ $CURRENT_LLM_SERVING_ENGINE == "lmdeploy" ]]; then if [[ "$CURRENT_LLM_SERVING_ENGINE" == "lmdeploy" ]]; then
launch_lmdeploy_server launch_lmdeploy_server
fi fi
if [[ $CURRENT_LLM_SERVING_ENGINE == "sglang" ]]; then if [[ "$CURRENT_LLM_SERVING_ENGINE" == "sglang" ]]; then
launch_sglang_server launch_sglang_server
fi fi

View File

@@ -16,10 +16,10 @@ main() {
fi fi
# initial annotation # initial annotation
description="$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/nightly-descriptions.md" #description="$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/nightly-descriptions.md"
# download results # download results
cd $VLLM_SOURCE_CODE_LOC/benchmarks cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
mkdir -p results/ mkdir -p results/
/workspace/buildkite-agent artifact download 'results/*nightly_results.json' results/ /workspace/buildkite-agent artifact download 'results/*nightly_results.json' results/
ls ls
@@ -30,20 +30,20 @@ main() {
/workspace/buildkite-agent artifact upload "results.zip" /workspace/buildkite-agent artifact upload "results.zip"
# upload benchmarking scripts # upload benchmarking scripts
cd $VLLM_SOURCE_CODE_LOC/ cd "$VLLM_SOURCE_CODE_LOC/"
zip -r nightly-benchmarks.zip .buildkite/ benchmarks/ zip -r nightly-benchmarks.zip .buildkite/ benchmarks/
/workspace/buildkite-agent artifact upload "nightly-benchmarks.zip" /workspace/buildkite-agent artifact upload "nightly-benchmarks.zip"
cd $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/ cd "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/"
# upload benchmarking pipeline # upload benchmarking pipeline
/workspace/buildkite-agent artifact upload "nightly-pipeline.yaml" /workspace/buildkite-agent artifact upload "nightly-pipeline.yaml"
cd $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/ cd "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/"
/workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly-annotation.md /workspace/buildkite-agent annotate --style "success" --context "nightly-benchmarks-results" --append < nightly-annotation.md
# The figures should be genereated by a separate process outside the CI/CD pipeline # The figures should be generated by a separate process outside the CI/CD pipeline
# # generate figures # # generate figures
# python3 -m pip install tabulate pandas matplotlib # python3 -m pip install tabulate pandas matplotlib

View File

@@ -12,7 +12,7 @@ check_gpus() {
echo "Need at least 1 GPU to run benchmarking." echo "Need at least 1 GPU to run benchmarking."
exit 1 exit 1
fi fi
declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}') declare -g gpu_type="$(nvidia-smi --query-gpu=name --format=csv,noheader | awk '{print $2}')"
echo "GPU type is $gpu_type" echo "GPU type is $gpu_type"
} }
@@ -102,7 +102,7 @@ kill_gpu_processes() {
pkill -f text-generation pkill -f text-generation
pkill -f lmdeploy pkill -f lmdeploy
while [ $(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1) -ge 1000 ]; do while [ "$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1)" -ge 1000 ]; do
sleep 1 sleep 1
done done
} }
@@ -119,8 +119,8 @@ wait_for_server() {
ensure_installed() { ensure_installed() {
# Ensure that the given command is installed by apt-get # Ensure that the given command is installed by apt-get
local cmd=$1 local cmd=$1
if ! which $cmd >/dev/null; then if ! which "$cmd" >/dev/null; then
apt-get update && apt-get install -y $cmd apt-get update && apt-get install -y "$cmd"
fi fi
} }
@@ -173,13 +173,11 @@ run_serving_tests() {
echo "Reuse previous server for test case $test_name" echo "Reuse previous server for test case $test_name"
else else
kill_gpu_processes kill_gpu_processes
bash $VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/launch-server.sh \ bash "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/launch-server.sh" \
"$server_params" "$common_params" "$server_params" "$common_params"
fi fi
wait_for_server if wait_for_server; then
if [ $? -eq 0 ]; then
echo "" echo ""
echo "$CURRENT_LLM_SERVING_ENGINE server is up and running." echo "$CURRENT_LLM_SERVING_ENGINE server is up and running."
else else
@@ -190,13 +188,13 @@ run_serving_tests() {
# prepare tokenizer # prepare tokenizer
# this is required for lmdeploy. # this is required for lmdeploy.
cd $VLLM_SOURCE_CODE_LOC/benchmarks cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
rm -rf /tokenizer_cache rm -rf /tokenizer_cache
mkdir /tokenizer_cache mkdir /tokenizer_cache
python3 ../.buildkite/nightly-benchmarks/scripts/download-tokenizer.py \ python3 ../.buildkite/nightly-benchmarks/scripts/download-tokenizer.py \
--model "$model" \ --model "$model" \
--cachedir /tokenizer_cache --cachedir /tokenizer_cache
cd $VLLM_SOURCE_CODE_LOC/benchmarks cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
# change model name for lmdeploy (it will not follow standard hf name) # change model name for lmdeploy (it will not follow standard hf name)
@@ -303,15 +301,113 @@ run_serving_tests() {
kill_gpu_processes kill_gpu_processes
} }
run_genai_perf_tests() {
# run genai-perf tests
# $1: a json file specifying genai-perf test cases
local genai_perf_test_file
genai_perf_test_file=$1
# Iterate over genai-perf tests
jq -c '.[]' "$genai_perf_test_file" | while read -r params; do
# get the test name, and append the GPU type back to it.
test_name=$(echo "$params" | jq -r '.test_name')
# if TEST_SELECTOR is set, only run the test cases that match the selector
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
echo "Skip test case $test_name."
continue
fi
# prepend the current serving engine to the test name
test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name}
# get common parameters
common_params=$(echo "$params" | jq -r '.common_parameters')
model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
reuse_server=$(echo "$common_params" | jq -r '.reuse_server')
# get client and server arguments
server_params=$(echo "$params" | jq -r ".${CURRENT_LLM_SERVING_ENGINE}_server_parameters")
qps_list=$(echo "$params" | jq -r '.qps_list')
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
echo "Running over qps list $qps_list"
# check if there is enough GPU to run the test
if [[ $gpu_count -lt $tp ]]; then
echo "Required num-shard $tp but only $gpu_count GPU found. Skip testcase $test_name."
continue
fi
if [[ $reuse_server == "true" ]]; then
echo "Reuse previous server for test case $test_name"
else
kill_gpu_processes
bash "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/launch-server.sh" \
"$server_params" "$common_params"
fi
if wait_for_server; then
echo ""
echo "$CURRENT_LLM_SERVING_ENGINE server is up and running."
else
echo ""
echo "$CURRENT_LLM_SERVING_ENGINE failed to start within the timeout period."
break
fi
# iterate over different QPS
for qps in $qps_list; do
# remove the surrounding single quote from qps
if [[ "$qps" == *"inf"* ]]; then
echo "qps was $qps"
qps=$num_prompts
echo "now qps is $qps"
fi
new_test_name=$test_name"_qps_"$qps
backend=$CURRENT_LLM_SERVING_ENGINE
if [[ "$backend" == *"vllm"* ]]; then
backend="vllm"
fi
#TODO: add output dir.
client_command="genai-perf profile \
-m $model \
--service-kind openai \
--backend vllm \
--endpoint-type chat \
--streaming \
--url localhost:$port \
--request-rate $qps \
--num-prompts $num_prompts \
"
echo "Client command: $client_command"
eval "$client_command"
#TODO: process/record outputs
done
done
kill_gpu_processes
}
prepare_dataset() { prepare_dataset() {
# download sharegpt dataset # download sharegpt dataset
cd $VLLM_SOURCE_CODE_LOC/benchmarks cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
# duplicate sonnet by 4x, to allow benchmarking with input length 2048 # duplicate sonnet by 4x, to allow benchmarking with input length 2048
cd $VLLM_SOURCE_CODE_LOC/benchmarks cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
echo "" > sonnet_4x.txt echo "" > sonnet_4x.txt
for _ in {1..4} for _ in {1..4}
do do
@@ -330,26 +426,35 @@ main() {
pip install -U transformers pip install -U transformers
pip install -r requirements-dev.txt
which genai-perf
# check storage # check storage
df -h df -h
ensure_installed wget ensure_installed wget
ensure_installed curl ensure_installed curl
ensure_installed jq ensure_installed jq
# genai-perf dependency
ensure_installed libb64-0d
prepare_dataset prepare_dataset
cd $VLLM_SOURCE_CODE_LOC/benchmarks cd "$VLLM_SOURCE_CODE_LOC/benchmarks"
declare -g RESULTS_FOLDER=results/ declare -g RESULTS_FOLDER=results/
mkdir -p $RESULTS_FOLDER mkdir -p $RESULTS_FOLDER
BENCHMARK_ROOT=$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/ BENCHMARK_ROOT="$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/"
# run the test # run the test
run_serving_tests $BENCHMARK_ROOT/tests/nightly-tests.json run_serving_tests "$BENCHMARK_ROOT/tests/nightly-tests.json"
# run genai-perf tests
run_genai_perf_tests "$BENCHMARK_ROOT/tests/genai-perf-tests.json"
mv artifacts/ $RESULTS_FOLDER/
# upload benchmark results to buildkite # upload benchmark results to buildkite
python3 -m pip install tabulate pandas python3 -m pip install tabulate pandas
python3 $BENCHMARK_ROOT/scripts/summary-nightly-results.py python3 "$BENCHMARK_ROOT/scripts/summary-nightly-results.py"
upload_to_buildkite upload_to_buildkite
} }

View File

@@ -6,6 +6,7 @@
# Do not set -e, as the mixtral 8x22B model tends to crash occasionally # Do not set -e, as the mixtral 8x22B model tends to crash occasionally
# and we still want to see other benchmarking results even when mixtral crashes. # and we still want to see other benchmarking results even when mixtral crashes.
set -x
set -o pipefail set -o pipefail
check_gpus() { check_gpus() {
@@ -17,7 +18,7 @@ check_gpus() {
echo "Need at least 1 GPU to run benchmarking." echo "Need at least 1 GPU to run benchmarking."
exit 1 exit 1
fi fi
declare -g gpu_type=$(echo $(nvidia-smi --query-gpu=name --format=csv,noheader) | awk '{print $2}') declare -g gpu_type=$(nvidia-smi --query-gpu=name --format=csv,noheader | awk '{print $2}')
echo "GPU type is $gpu_type" echo "GPU type is $gpu_type"
} }
@@ -85,15 +86,11 @@ kill_gpu_processes() {
ps -aux ps -aux
lsof -t -i:8000 | xargs -r kill -9 lsof -t -i:8000 | xargs -r kill -9
pkill -f pt_main_thread pgrep python3 | xargs -r kill -9
# this line doesn't work now
# ps aux | grep python | grep openai | awk '{print $2}' | xargs -r kill -9
pkill -f python3
pkill -f /usr/bin/python3
# wait until GPU memory usage smaller than 1GB # wait until GPU memory usage smaller than 1GB
while [ $(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1) -ge 1000 ]; do while [ "$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1)" -ge 1000 ]; do
sleep 1 sleep 1
done done
@@ -117,7 +114,7 @@ upload_to_buildkite() {
fi fi
# Use the determined command to annotate and upload artifacts # Use the determined command to annotate and upload artifacts
$BUILDKITE_AGENT_COMMAND annotate --style "info" --context "$BUILDKITE_LABEL-benchmark-results" <$RESULTS_FOLDER/benchmark_results.md $BUILDKITE_AGENT_COMMAND annotate --style "info" --context "$BUILDKITE_LABEL-benchmark-results" < "$RESULTS_FOLDER/benchmark_results.md"
$BUILDKITE_AGENT_COMMAND artifact upload "$RESULTS_FOLDER/*" $BUILDKITE_AGENT_COMMAND artifact upload "$RESULTS_FOLDER/*"
} }
@@ -150,7 +147,7 @@ run_latency_tests() {
# check if there is enough GPU to run the test # check if there is enough GPU to run the test
tp=$(echo "$latency_params" | jq -r '.tensor_parallel_size') tp=$(echo "$latency_params" | jq -r '.tensor_parallel_size')
if [[ $gpu_count -lt $tp ]]; then if [[ $gpu_count -lt $tp ]]; then
echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $testname." echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $test_name."
continue continue
fi fi
@@ -206,9 +203,9 @@ run_throughput_tests() {
throughput_args=$(json2args "$throughput_params") throughput_args=$(json2args "$throughput_params")
# check if there is enough GPU to run the test # check if there is enough GPU to run the test
tp=$(echo $throughput_params | jq -r '.tensor_parallel_size') tp=$(echo "$throughput_params" | jq -r '.tensor_parallel_size')
if [[ $gpu_count -lt $tp ]]; then if [[ $gpu_count -lt $tp ]]; then
echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $testname." echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $test_name."
continue continue
fi fi
@@ -270,7 +267,7 @@ run_serving_tests() {
# check if there is enough GPU to run the test # check if there is enough GPU to run the test
tp=$(echo "$server_params" | jq -r '.tensor_parallel_size') tp=$(echo "$server_params" | jq -r '.tensor_parallel_size')
if [[ $gpu_count -lt $tp ]]; then if [[ $gpu_count -lt $tp ]]; then
echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $testname." echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $test_name."
continue continue
fi fi
@@ -278,7 +275,7 @@ run_serving_tests() {
server_model=$(echo "$server_params" | jq -r '.model') server_model=$(echo "$server_params" | jq -r '.model')
client_model=$(echo "$client_params" | jq -r '.model') client_model=$(echo "$client_params" | jq -r '.model')
if [[ $server_model != "$client_model" ]]; then if [[ $server_model != "$client_model" ]]; then
echo "Server model and client model must be the same. Skip testcase $testname." echo "Server model and client model must be the same. Skip testcase $test_name."
continue continue
fi fi
@@ -289,12 +286,11 @@ run_serving_tests() {
# run the server # run the server
echo "Running test case $test_name" echo "Running test case $test_name"
echo "Server command: $server_command" echo "Server command: $server_command"
eval "$server_command" & bash -c "$server_command" &
server_pid=$! server_pid=$!
# wait until the server is alive # wait until the server is alive
wait_for_server if wait_for_server; then
if [ $? -eq 0 ]; then
echo "" echo ""
echo "vllm server is up and running." echo "vllm server is up and running."
else else
@@ -323,7 +319,7 @@ run_serving_tests() {
echo "Running test case $test_name with qps $qps" echo "Running test case $test_name with qps $qps"
echo "Client command: $client_command" echo "Client command: $client_command"
eval "$client_command" bash -c "$client_command"
# record the benchmarking commands # record the benchmarking commands
jq_output=$(jq -n \ jq_output=$(jq -n \

View File

@@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import datetime import datetime
import json import json
import os import os
@@ -36,11 +38,11 @@ if __name__ == "__main__":
# collect results # collect results
for test_file in results_folder.glob("*.json"): for test_file in results_folder.glob("*.json"):
with open(test_file, "r") as f: with open(test_file) as f:
raw_result = json.loads(f.read()) raw_result = json.loads(f.read())
# attach the benchmarking command to raw_result # attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands"), "r") as f: with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read()) command = json.loads(f.read())
raw_result.update(command) raw_result.update(command)

View File

@@ -1,12 +1,12 @@
#!/bin/sh #!/bin/sh
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-test-repo:pull" | jq -r .token) TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-postmerge-repo:pull" | jq -r .token)
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-test-repo/manifests/$BUILDKITE_COMMIT" URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-postmerge-repo/manifests/$BUILDKITE_COMMIT"
TIMEOUT_SECONDS=10 TIMEOUT_SECONDS=10
retries=0 retries=0
while [ $retries -lt 1000 ]; do while [ $retries -lt 1000 ]; do
if [ $(curl -s --max-time $TIMEOUT_SECONDS -L -H "Authorization: Bearer $TOKEN" -o /dev/null -w "%{http_code}" $URL) -eq 200 ]; then if [ "$(curl -s --max-time "$TIMEOUT_SECONDS" -L -H "Authorization: Bearer $TOKEN" -o /dev/null -w "%{http_code}" "$URL")" -eq 200 ]; then
exit 0 exit 0
fi fi

View File

@@ -0,0 +1,23 @@
[
{
"test_name": "llama8B_tp1_genai_perf",
"qps_list": [4,8,16,32],
"common_parameters": {
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"tp": 1,
"port": 8000,
"num_prompts": 500,
"reuse_server": false
},
"vllm_server_parameters": {
"disable_log_stats": "",
"disable_log_requests": "",
"gpu_memory_utilization": 0.9,
"num_scheduler_steps": 10,
"max_num_seqs": 512,
"dtype": "bfloat16"
},
"genai_perf_input_parameters": {
}
}
]

View File

@@ -1,33 +1,77 @@
steps: steps:
- label: "Build wheel - CUDA 12.1" - label: "Build wheel - CUDA 12.1"
agents: agents:
queue: cpu_queue queue: cpu_queue_postmerge
commands: commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain ." - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts" - "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'" - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
# rename the files to change linux -> manylinux1 - "bash .buildkite/upload-wheels.sh"
- "for f in artifacts/dist/*.whl; do mv -- \"$$f\" \"$${f/linux/manylinux1}\"; done"
- "mv artifacts/dist/$(ls artifacts/dist) artifacts/dist/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl"
- "aws s3 cp artifacts/dist/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl s3://vllm-wheels/$BUILDKITE_COMMIT/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl"
- "aws s3 cp artifacts/dist/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl s3://vllm-wheels/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl"
env: env:
DOCKER_BUILDKIT: "1" DOCKER_BUILDKIT: "1"
- block: "Build CUDA 11.8 wheel" # Note(simon): We can always build CUDA 11.8 wheel to ensure the build is working.
key: block-build-cu118-wheel # However, this block can be uncommented to save some compute hours.
# - block: "Build CUDA 11.8 wheel"
# key: block-build-cu118-wheel
- label: "Build wheel - CUDA 11.8" - label: "Build wheel - CUDA 11.8"
depends_on: block-build-cu118-wheel # depends_on: block-build-cu118-wheel
agents: agents:
queue: cpu_queue queue: cpu_queue_postmerge
commands: commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain ." - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts" - "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'" - "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
# rename the files to change linux -> manylinux1 - "bash .buildkite/upload-wheels.sh"
- "for f in artifacts/dist/*.whl; do mv -- \"$$f\" \"$${f/linux/manylinux1}\"; done" env:
- "aws s3 cp --recursive artifacts/dist s3://vllm-wheels/$BUILDKITE_COMMIT/" DOCKER_BUILDKIT: "1"
- "aws s3 cp --recursive artifacts/dist s3://vllm-wheels/nightly/"
- block: "Build release image"
depends_on: ~
key: block-release-image-build
- label: "Build release image"
depends_on: block-release-image-build
agents:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
- label: "Build and publish TPU release image"
depends_on: ~
if: build.env("NIGHTLY") == "1"
agents:
queue: tpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --tag vllm/vllm-tpu:nightly --tag vllm/vllm-tpu:$BUILDKITE_COMMIT --progress plain -f Dockerfile.tpu ."
- "docker push vllm/vllm-tpu:nightly"
- "docker push vllm/vllm-tpu:$BUILDKITE_COMMIT"
plugins:
- docker-login#v3.0.0:
username: vllm
password-env: DOCKERHUB_TOKEN
env:
DOCKER_BUILDKIT: "1"
- input: "Provide Release version here"
fields:
- text: "What is the release version?"
key: "release-version"
- block: "Build CPU release image"
key: block-cpu-release-image-build
depends_on: ~
- label: "Build and publish CPU release image"
depends_on: block-cpu-release-image-build
agents:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --progress plain -f Dockerfile.cpu ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
env: env:
DOCKER_BUILDKIT: "1" DOCKER_BUILDKIT: "1"

View File

@@ -1,3 +1,5 @@
#!/bin/bash
# This script runs test inside the corresponding ROCm docker container. # This script runs test inside the corresponding ROCm docker container.
set -o pipefail set -o pipefail
@@ -31,8 +33,8 @@ cleanup_docker() {
echo "Disk usage is above $threshold%. Cleaning up Docker images and volumes..." echo "Disk usage is above $threshold%. Cleaning up Docker images and volumes..."
# Remove dangling images (those that are not tagged and not used by any container) # Remove dangling images (those that are not tagged and not used by any container)
docker image prune -f docker image prune -f
# Remove unused volumes # Remove unused volumes / force the system prune for old images as well.
docker volume prune -f docker volume prune -f && docker system prune --force --filter "until=72h" --all
echo "Docker images and volumes cleanup completed." echo "Docker images and volumes cleanup completed."
else else
echo "Disk usage is below $threshold%. No cleanup needed." echo "Disk usage is below $threshold%. No cleanup needed."
@@ -57,17 +59,17 @@ done
echo "--- Pulling container" echo "--- Pulling container"
image_name="rocm/vllm-ci:${BUILDKITE_COMMIT}" image_name="rocm/vllm-ci:${BUILDKITE_COMMIT}"
container_name="rocm_${BUILDKITE_COMMIT}_$(tr -dc A-Za-z0-9 < /dev/urandom | head -c 10; echo)" container_name="rocm_${BUILDKITE_COMMIT}_$(tr -dc A-Za-z0-9 < /dev/urandom | head -c 10; echo)"
docker pull ${image_name} docker pull "${image_name}"
remove_docker_container() { remove_docker_container() {
docker rm -f ${container_name} || docker image rm -f ${image_name} || true docker rm -f "${container_name}" || docker image rm -f "${image_name}" || true
} }
trap remove_docker_container EXIT trap remove_docker_container EXIT
echo "--- Running container" echo "--- Running container"
HF_CACHE="$(realpath ~)/huggingface" HF_CACHE="$(realpath ~)/huggingface"
mkdir -p ${HF_CACHE} mkdir -p "${HF_CACHE}"
HF_MOUNT="/root/.cache/huggingface" HF_MOUNT="/root/.cache/huggingface"
commands=$@ commands=$@
@@ -83,7 +85,6 @@ if [[ $commands == *" kernels "* ]]; then
--ignore=kernels/test_encoder_decoder_attn.py \ --ignore=kernels/test_encoder_decoder_attn.py \
--ignore=kernels/test_flash_attn.py \ --ignore=kernels/test_flash_attn.py \
--ignore=kernels/test_flashinfer.py \ --ignore=kernels/test_flashinfer.py \
--ignore=kernels/test_gguf.py \
--ignore=kernels/test_int8_quant.py \ --ignore=kernels/test_int8_quant.py \
--ignore=kernels/test_machete_gemm.py \ --ignore=kernels/test_machete_gemm.py \
--ignore=kernels/test_mamba_ssm.py \ --ignore=kernels/test_mamba_ssm.py \
@@ -107,35 +108,36 @@ fi
PARALLEL_JOB_COUNT=8 PARALLEL_JOB_COUNT=8
# check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs. # check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs.
if [[ $commands == *"--shard-id="* ]]; then if [[ $commands == *"--shard-id="* ]]; then
# assign job count as the number of shards used
commands=${commands//"--num-shards= "/"--num-shards=${PARALLEL_JOB_COUNT} "}
for GPU in $(seq 0 $(($PARALLEL_JOB_COUNT-1))); do for GPU in $(seq 0 $(($PARALLEL_JOB_COUNT-1))); do
#replace shard arguments # assign shard-id for each shard
commands=${commands//"--shard-id= "/"--shard-id=${GPU} "} commands_gpu=${commands//"--shard-id= "/"--shard-id=${GPU} "}
commands=${commands//"--num-shards= "/"--num-shards=${PARALLEL_JOB_COUNT} "} echo "Shard ${GPU} commands:$commands_gpu"
echo "Shard ${GPU} commands:$commands"
docker run \ docker run \
--device /dev/kfd --device /dev/dri \ --device /dev/kfd --device /dev/dri \
--network host \ --network host \
--shm-size=16gb \ --shm-size=16gb \
--rm \ --rm \
-e HIP_VISIBLE_DEVICES=${GPU} \ -e HIP_VISIBLE_DEVICES="${GPU}" \
-e HF_TOKEN \ -e HF_TOKEN \
-v ${HF_CACHE}:${HF_MOUNT} \ -v "${HF_CACHE}:${HF_MOUNT}" \
-e HF_HOME=${HF_MOUNT} \ -e "HF_HOME=${HF_MOUNT}" \
--name ${container_name}_${GPU} \ --name "${container_name}_${GPU}" \
${image_name} \ "${image_name}" \
/bin/bash -c "${commands}" \ /bin/bash -c "${commands_gpu}" \
|& while read -r line; do echo ">>Shard $GPU: $line"; done & |& while read -r line; do echo ">>Shard $GPU: $line"; done &
PIDS+=($!) PIDS+=($!)
done done
#wait for all processes to finish and collect exit codes #wait for all processes to finish and collect exit codes
for pid in ${PIDS[@]}; do for pid in "${PIDS[@]}"; do
wait ${pid} wait "${pid}"
STATUS+=($?) STATUS+=($?)
done done
for st in ${STATUS[@]}; do for st in "${STATUS[@]}"; do
if [[ ${st} -ne 0 ]]; then if [[ ${st} -ne 0 ]]; then
echo "One of the processes failed with $st" echo "One of the processes failed with $st"
exit ${st} exit "${st}"
fi fi
done done
else else
@@ -146,9 +148,9 @@ else
--rm \ --rm \
-e HIP_VISIBLE_DEVICES=0 \ -e HIP_VISIBLE_DEVICES=0 \
-e HF_TOKEN \ -e HF_TOKEN \
-v ${HF_CACHE}:${HF_MOUNT} \ -v "${HF_CACHE}:${HF_MOUNT}" \
-e HF_HOME=${HF_MOUNT} \ -e "HF_HOME=${HF_MOUNT}" \
--name ${container_name} \ --name "${container_name}" \
${image_name} \ "${image_name}" \
/bin/bash -c "${commands}" /bin/bash -c "${commands}"
fi fi

View File

@@ -1,3 +1,5 @@
#!/bin/bash
# This script is run by buildkite to run the benchmarks and upload the results to buildkite # This script is run by buildkite to run the benchmarks and upload the results to buildkite
set -ex set -ex

View File

@@ -1,39 +1,14 @@
#!/bin/bash
# This script build the CPU docker image and run the offline inference inside the container. # This script build the CPU docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage. # It serves a sanity check for compilation and basic model usage.
set -ex set -ex
# Try building the docker image
docker build -t cpu-test -f Dockerfile.ppc64le .
# Setup cleanup # Setup cleanup
remove_docker_container() { docker rm -f cpu-test || true; } remove_docker_container() { docker rm -f cpu-test || true; docker system prune -f; }
trap remove_docker_container EXIT trap remove_docker_container EXIT
remove_docker_container remove_docker_container
# Run the image, setting --shm-size=4g for tensor parallel. # Try building the docker image
source /etc/environment docker build -t cpu-test -f Dockerfile.ppc64le .
#docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN=$HF_TOKEN --name cpu-test cpu-test
# Run basic model test
docker exec cpu-test bash -c "
pip install pytest matplotlib einops transformers_stream_generator
pytest -v -s tests/models -m \"not vlm\" \
--ignore=tests/models/test_embedding.py \
--ignore=tests/models/test_oot_registration.py \
--ignore=tests/models/test_registry.py \
--ignore=tests/models/test_jamba.py \
--ignore=tests/models/test_mamba.py \
--ignore=tests/models/test_danube3_4b.py" # Mamba kernels and Danube3-4B on CPU is not supported
# online inference
docker exec cpu-test bash -c "
python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m &
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
python3 benchmarks/benchmark_serving.py \
--backend vllm \
--dataset-name random \
--model facebook/opt-125m \
--num-prompts 20 \
--endpoint /v1/completions \
--tokenizer facebook/opt-125m"

View File

@@ -1,57 +1,88 @@
#!/bin/bash
# This script build the CPU docker image and run the offline inference inside the container. # This script build the CPU docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage. # It serves a sanity check for compilation and basic model usage.
set -ex set -ex
# allow to bind to different cores
CORE_RANGE=${CORE_RANGE:-48-95}
NUMA_NODE=${NUMA_NODE:-1}
# Try building the docker image # Try building the docker image
numactl -C 48-95 -N 1 docker build -t cpu-test -f Dockerfile.cpu . numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build -t cpu-test-"$BUILDKITE_BUILD_NUMBER" -f Dockerfile.cpu .
numactl -C 48-95 -N 1 docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-avx2 -f Dockerfile.cpu . numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2 -f Dockerfile.cpu .
# Setup cleanup # Setup cleanup
remove_docker_container() { docker rm -f cpu-test cpu-test-avx2 || true; } remove_docker_container() { set -e; docker rm -f cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" || true; }
trap remove_docker_container EXIT trap remove_docker_container EXIT
remove_docker_container remove_docker_container
# Run the image, setting --shm-size=4g for tensor parallel. # Run the image, setting --shm-size=4g for tensor parallel.
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus=48-95 \ docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems=1 --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test cpu-test --cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus=48-95 \ docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems=1 --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-avx2 cpu-test-avx2 --cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2
# offline inference function cpu_tests() {
docker exec cpu-test-avx2 bash -c "python3 examples/offline_inference.py" set -e
export NUMA_NODE=$2
# Run basic model test # offline inference
docker exec cpu-test bash -c " docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" bash -c "
pip install pytest matplotlib einops transformers_stream_generator datamodel_code_generator set -e
pytest -v -s tests/models/encoder_decoder/language python3 examples/offline_inference/basic.py"
pytest -v -s tests/models/decoder_only/language \
--ignore=tests/models/test_fp8.py \
--ignore=tests/models/decoder_only/language/test_jamba.py \
--ignore=tests/models/decoder_only/language/test_mamba.py \
--ignore=tests/models/decoder_only/language/test_granitemoe.py \
--ignore=tests/models/decoder_only/language/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported
# Run compressed-tensor test # Run basic model test
docker exec cpu-test bash -c " docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
pytest -s -v \ set -e
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \ pip install -r vllm/requirements-test.txt
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynamic_per_token" pytest -v -s tests/models/decoder_only/language -m cpu_model
pytest -v -s tests/models/embedding/language -m cpu_model
pytest -v -s tests/models/encoder_decoder/language -m cpu_model
pytest -v -s tests/models/decoder_only/audio_language -m cpu_model
pytest -v -s tests/models/decoder_only/vision_language -m cpu_model"
# Run AWQ test # Run compressed-tensor test
docker exec cpu-test bash -c " docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
pytest -s -v \ set -e
tests/quantization/test_ipex_quant.py" pytest -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynamic_per_token"
# online inference # Run AWQ test
docker exec cpu-test bash -c " docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
export VLLM_CPU_KVCACHE_SPACE=10 set -e
export VLLM_CPU_OMP_THREADS_BIND=48-92 pytest -s -v \
python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m & tests/quantization/test_ipex_quant.py"
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
python3 benchmarks/benchmark_serving.py \ # Run chunked-prefill and prefix-cache test
--backend vllm \ docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
--dataset-name random \ set -e
--model facebook/opt-125m \ pytest -s -v -k cpu_model \
--num-prompts 20 \ tests/basic_correctness/test_chunked_prefill.py"
--endpoint /v1/completions \
--tokenizer facebook/opt-125m" # online serving
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
export VLLM_CPU_KVCACHE_SPACE=10
export VLLM_CPU_OMP_THREADS_BIND=$1
python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --dtype half &
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
python3 benchmarks/benchmark_serving.py \
--backend vllm \
--dataset-name random \
--model facebook/opt-125m \
--num-prompts 20 \
--endpoint /v1/completions \
--tokenizer facebook/opt-125m"
# Run multi-lora tests
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/lora/test_qwen2vl.py"
}
# All of CPU tests are expected to be finished less than 40 mins.
export -f cpu_tests
timeout 40m bash -c "cpu_tests $CORE_RANGE $NUMA_NODE"

View File

@@ -0,0 +1,28 @@
#!/bin/bash
# This script build the GH200 docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex
# Skip the new torch installation during build since we are using the specified version for arm64 in the Dockerfile
python3 use_existing_torch.py
# Try building the docker image
DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--platform "linux/arm64" \
-t gh200-test \
--build-arg max_jobs=66 \
--build-arg nvcc_threads=2 \
--build-arg torch_cuda_arch_list="9.0+PTX" \
--build-arg vllm_fa_cmake_gpu_arches="90-real"
# Setup cleanup
remove_docker_container() { docker rm -f gh200-test || true; }
trap remove_docker_container EXIT
remove_docker_container
# Run the image and test offline inference
docker run -e HF_TOKEN -v /root/.cache/huggingface:/root/.cache/huggingface --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
python3 examples/offline_inference/cli.py --model meta-llama/Llama-3.2-1B
'

View File

@@ -0,0 +1,24 @@
#!/bin/bash
# This script build the CPU docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex
# Try building the docker image
docker build -t hpu-test-env -f Dockerfile.hpu .
# Setup cleanup
# certain versions of HPU software stack have a bug that can
# override the exit code of the script, so we need to use
# separate remove_docker_container and remove_docker_container_and_exit
# functions, while other platforms only need one remove_docker_container
# function.
EXITCODE=1
remove_docker_container() { docker rm -f hpu-test || true; }
remove_docker_container_and_exit() { remove_docker_container; exit $EXITCODE; }
trap remove_docker_container_and_exit EXIT
remove_docker_container
# Run the image and launch offline inference
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic.py
EXITCODE=$?

View File

@@ -14,7 +14,7 @@ DOCKER_IMAGE=$4
shift 4 shift 4
COMMANDS=("$@") COMMANDS=("$@")
if [ ${#COMMANDS[@]} -ne $NUM_NODES ]; then if [ ${#COMMANDS[@]} -ne "$NUM_NODES" ]; then
echo "The number of commands must be equal to the number of nodes." echo "The number of commands must be equal to the number of nodes."
echo "Number of nodes: $NUM_NODES" echo "Number of nodes: $NUM_NODES"
echo "Number of commands: ${#COMMANDS[@]}" echo "Number of commands: ${#COMMANDS[@]}"
@@ -23,7 +23,7 @@ fi
echo "List of commands" echo "List of commands"
for command in "${COMMANDS[@]}"; do for command in "${COMMANDS[@]}"; do
echo $command echo "$command"
done done
start_network() { start_network() {
@@ -36,7 +36,7 @@ start_nodes() {
for node_gpu in $(seq 0 $(($NUM_GPUS - 1))); do for node_gpu in $(seq 0 $(($NUM_GPUS - 1))); do
DEVICE_NUM=$(($node * $NUM_GPUS + $node_gpu)) DEVICE_NUM=$(($node * $NUM_GPUS + $node_gpu))
GPU_DEVICES+=$(($DEVICE_NUM)) GPU_DEVICES+=$(($DEVICE_NUM))
if [ $node_gpu -lt $(($NUM_GPUS - 1)) ]; then if [ "$node_gpu" -lt $(($NUM_GPUS - 1)) ]; then
GPU_DEVICES+=',' GPU_DEVICES+=','
fi fi
done done
@@ -49,17 +49,20 @@ start_nodes() {
# 3. map the huggingface cache directory to the container # 3. map the huggingface cache directory to the container
# 3. assign ip addresses to the containers (head node: 192.168.10.10, worker nodes: # 3. assign ip addresses to the containers (head node: 192.168.10.10, worker nodes:
# starting from 192.168.10.11) # starting from 192.168.10.11)
docker run -d --gpus "$GPU_DEVICES" --shm-size=10.24gb -e HF_TOKEN -v ~/.cache/huggingface:/root/.cache/huggingface --name node$node --network docker-net --ip 192.168.10.$((10 + $node)) --rm $DOCKER_IMAGE /bin/bash -c "tail -f /dev/null" docker run -d --gpus "$GPU_DEVICES" --shm-size=10.24gb -e HF_TOKEN \
-v ~/.cache/huggingface:/root/.cache/huggingface --name "node$node" \
--network docker-net --ip 192.168.10.$((10 + $node)) --rm "$DOCKER_IMAGE" \
/bin/bash -c "tail -f /dev/null"
# organize containers into a ray cluster # organize containers into a ray cluster
if [ $node -eq 0 ]; then if [ "$node" -eq 0 ]; then
# start the ray head node # start the ray head node
docker exec -d node$node /bin/bash -c "ray start --head --port=6379 --block" docker exec -d "node$node" /bin/bash -c "ray start --head --port=6379 --block"
# wait for the head node to be ready # wait for the head node to be ready
sleep 10 sleep 10
else else
# start the ray worker nodes, and connect them to the head node # start the ray worker nodes, and connect them to the head node
docker exec -d node$node /bin/bash -c "ray start --address=192.168.10.10:6379 --block" docker exec -d "node$node" /bin/bash -c "ray start --address=192.168.10.10:6379 --block"
fi fi
done done
@@ -79,22 +82,22 @@ run_nodes() {
for node_gpu in $(seq 0 $(($NUM_GPUS - 1))); do for node_gpu in $(seq 0 $(($NUM_GPUS - 1))); do
DEVICE_NUM=$(($node * $NUM_GPUS + $node_gpu)) DEVICE_NUM=$(($node * $NUM_GPUS + $node_gpu))
GPU_DEVICES+=$(($DEVICE_NUM)) GPU_DEVICES+=$(($DEVICE_NUM))
if [ $node_gpu -lt $(($NUM_GPUS - 1)) ]; then if [ "$node_gpu" -lt $(($NUM_GPUS - 1)) ]; then
GPU_DEVICES+=',' GPU_DEVICES+=','
fi fi
done done
GPU_DEVICES+='"' GPU_DEVICES+='"'
echo "Running node$node with GPU devices: $GPU_DEVICES" echo "Running node$node with GPU devices: $GPU_DEVICES"
if [ $node -ne 0 ]; then if [ "$node" -ne 0 ]; then
docker exec -d node$node /bin/bash -c "cd $WORKING_DIR ; ${COMMANDS[$node]}" docker exec -d "node$node" /bin/bash -c "cd $WORKING_DIR ; ${COMMANDS[$node]}"
else else
docker exec node$node /bin/bash -c "cd $WORKING_DIR ; ${COMMANDS[$node]}" docker exec "node$node" /bin/bash -c "cd $WORKING_DIR ; ${COMMANDS[$node]}"
fi fi
done done
} }
cleanup() { cleanup() {
for node in $(seq 0 $(($NUM_NODES-1))); do for node in $(seq 0 $(($NUM_NODES-1))); do
docker stop node$node docker stop "node$node"
done done
docker network rm docker-net docker network rm docker-net
} }

View File

@@ -1,6 +1,20 @@
#!/bin/bash
# This script build the Neuron docker image and run the API server inside the container. # This script build the Neuron docker image and run the API server inside the container.
# It serves a sanity check for compilation and basic model usage. # It serves a sanity check for compilation and basic model usage.
set -e set -e
set -v
image_name="neuron/vllm-ci"
container_name="neuron_$(tr -dc A-Za-z0-9 < /dev/urandom | head -c 10; echo)"
HF_CACHE="$(realpath ~)/huggingface"
mkdir -p "${HF_CACHE}"
HF_MOUNT="/root/.cache/huggingface"
NEURON_COMPILE_CACHE_URL="$(realpath ~)/neuron_compile_cache"
mkdir -p "${NEURON_COMPILE_CACHE_URL}"
NEURON_COMPILE_CACHE_MOUNT="/root/.cache/neuron_compile_cache"
# Try building the docker image # Try building the docker image
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com
@@ -11,41 +25,33 @@ if [ -f /tmp/neuron-docker-build-timestamp ]; then
last_build=$(cat /tmp/neuron-docker-build-timestamp) last_build=$(cat /tmp/neuron-docker-build-timestamp)
current_time=$(date +%s) current_time=$(date +%s)
if [ $((current_time - last_build)) -gt 86400 ]; then if [ $((current_time - last_build)) -gt 86400 ]; then
docker system prune -f # Remove dangling images (those that are not tagged and not used by any container)
echo $current_time > /tmp/neuron-docker-build-timestamp docker image prune -f
# Remove unused volumes / force the system prune for old images as well.
docker volume prune -f && docker system prune -f
# Remove huggingface model artifacts and compiler cache
rm -rf "${HF_MOUNT:?}/*"
rm -rf "${NEURON_COMPILE_CACHE_MOUNT:?}/*"
echo "$current_time" > /tmp/neuron-docker-build-timestamp
fi fi
else else
echo $(date +%s) > /tmp/neuron-docker-build-timestamp date "+%s" > /tmp/neuron-docker-build-timestamp
fi fi
docker build -t neuron -f Dockerfile.neuron . docker build -t "${image_name}" -f Dockerfile.neuron .
# Setup cleanup # Setup cleanup
remove_docker_container() { docker rm -f neuron || true; } remove_docker_container() {
docker image rm -f "${image_name}" || true;
}
trap remove_docker_container EXIT trap remove_docker_container EXIT
remove_docker_container
# Run the image # Run the image
docker run --device=/dev/neuron0 --device=/dev/neuron1 --network host --name neuron neuron python3 -m vllm.entrypoints.api_server \ docker run --rm -it --device=/dev/neuron0 --device=/dev/neuron1 --network host \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --max-num-seqs 8 --max-model-len 128 --block-size 128 --device neuron --tensor-parallel-size 2 & -v "${HF_CACHE}:${HF_MOUNT}" \
-e "HF_HOME=${HF_MOUNT}" \
# Wait for the server to start -v "${NEURON_COMPILE_CACHE_URL}:${NEURON_COMPILE_CACHE_MOUNT}" \
wait_for_server_to_start() { -e "NEURON_COMPILE_CACHE_URL=${NEURON_COMPILE_CACHE_MOUNT}" \
timeout=300 --name "${container_name}" \
counter=0 ${image_name} \
/bin/bash -c "python3 /workspace/vllm/examples/offline_inference/neuron.py && python3 -m pytest /workspace/vllm/tests/neuron/ -v --capture=tee-sys"
while [ "$(curl -s -o /dev/null -w ''%{http_code}'' localhost:8000/health)" != "200" ]; do
sleep 1
counter=$((counter + 1))
if [ $counter -ge $timeout ]; then
echo "Timeout after $timeout seconds"
break
fi
done
}
wait_for_server_to_start
# Test a simple prompt
curl -X POST -H "Content-Type: application/json" \
localhost:8000/generate \
-d '{"prompt": "San Francisco is a"}'

View File

@@ -1,3 +1,5 @@
#!/bin/bash
# This script build the OpenVINO docker image and run the offline inference inside the container. # This script build the OpenVINO docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage. # It serves a sanity check for compilation and basic model usage.
set -ex set -ex
@@ -11,4 +13,4 @@ trap remove_docker_container EXIT
remove_docker_container remove_docker_container
# Run the image and launch offline inference # Run the image and launch offline inference
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/vllm/examples/offline_inference.py docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/examples/offline_inference/basic.py

13
.buildkite/run-tpu-test.sh Normal file → Executable file
View File

@@ -1,3 +1,5 @@
#!/bin/bash
set -e set -e
# Build the docker image. # Build the docker image.
@@ -12,4 +14,13 @@ remove_docker_container
# For HF_TOKEN. # For HF_TOKEN.
source /etc/environment source /etc/environment
# Run a simple end-to-end example. # Run a simple end-to-end example.
docker run --privileged --net host --shm-size=16G -it -e HF_TOKEN=$HF_TOKEN --name tpu-test vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git && python3 -m pip install pytest && pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py && python3 /workspace/vllm/tests/tpu/test_compilation.py && python3 /workspace/vllm/examples/offline_inference_tpu.py" docker run --privileged --net host --shm-size=16G -it \
-e "HF_TOKEN=$HF_TOKEN" --name tpu-test \
vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git \
&& python3 -m pip install pytest \
&& python3 -m pip install lm_eval[api]==0.4.4 \
&& pytest -v -s /workspace/vllm/tests/entrypoints/openai/test_accuracy.py \
&& pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py \
&& python3 /workspace/vllm/tests/tpu/test_compilation.py \
&& python3 /workspace/vllm/tests/tpu/test_quantization_accuracy.py \
&& python3 /workspace/vllm/examples/offline_inference/tpu.py"

View File

@@ -1,3 +1,5 @@
#!/bin/bash
# This script build the CPU docker image and run the offline inference inside the container. # This script build the CPU docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage. # It serves a sanity check for compilation and basic model usage.
set -ex set -ex
@@ -10,5 +12,8 @@ remove_docker_container() { docker rm -f xpu-test || true; }
trap remove_docker_container EXIT trap remove_docker_container EXIT
remove_docker_container remove_docker_container
# Run the image and launch offline inference # Run the image and test offline inference/tensor parallel
docker run --network host --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test python3 examples/offline_inference.py docker run --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test sh -c '
python3 examples/offline_inference/basic.py
python3 examples/offline_inference/cli.py -tp 2
'

View File

@@ -9,7 +9,7 @@
# label(str): the name of the test. emoji allowed. # label(str): the name of the test. emoji allowed.
# fast_check(bool): whether to run this on each commit on fastcheck pipeline. # fast_check(bool): whether to run this on each commit on fastcheck pipeline.
# fast_check_only(bool): run this test on fastcheck pipeline only # fast_check_only(bool): run this test on fastcheck pipeline only
# optional(bool): never run this test by default (i.e. need to unblock manually) # optional(bool): never run this test by default (i.e. need to unblock manually) unless it's scheduled nightly run.
# command(str): the single command to run for tests. incompatible with commands. # command(str): the single command to run for tests. incompatible with commands.
# commands(list): the list of commands to run for test. incompatbile with command. # commands(list): the list of commands to run for test. incompatbile with command.
# mirror_hardwares(list): the list of hardwares to run the test on as well. currently only supports [amd] # mirror_hardwares(list): the list of hardwares to run the test on as well. currently only supports [amd]
@@ -38,7 +38,7 @@ steps:
- pip install -r requirements-docs.txt - pip install -r requirements-docs.txt
- SPHINXOPTS=\"-W\" make html - SPHINXOPTS=\"-W\" make html
# Check API reference (if it fails, you may have missing mock imports) # Check API reference (if it fails, you may have missing mock imports)
- grep \"sig sig-object py\" build/html/dev/sampling_params.html - grep \"sig sig-object py\" build/html/api/inference_params.html
- label: Async Engine, Inputs, Utils, Worker Test # 24min - label: Async Engine, Inputs, Utils, Worker Test # 24min
fast_check: true fast_check: true
@@ -50,7 +50,9 @@ steps:
- tests/multimodal - tests/multimodal
- tests/test_utils - tests/test_utils
- tests/worker - tests/worker
- tests/standalone_tests/lazy_imports.py
commands: commands:
- python3 standalone_tests/lazy_imports.py
- pytest -v -s mq_llm_engine # MQLLMEngine - pytest -v -s mq_llm_engine # MQLLMEngine
- pytest -v -s async_engine # AsyncLLMEngine - pytest -v -s async_engine # AsyncLLMEngine
- NUM_SCHEDULER_STEPS=4 pytest -v -s async_engine/test_async_llm_engine.py - NUM_SCHEDULER_STEPS=4 pytest -v -s async_engine/test_async_llm_engine.py
@@ -59,6 +61,13 @@ steps:
- pytest -v -s test_utils.py # Utils - pytest -v -s test_utils.py # Utils
- pytest -v -s worker # Worker - pytest -v -s worker # Worker
- label: Python-only Installation Test
source_file_dependencies:
- tests/standalone_tests/python_only_compile.sh
- setup.py
commands:
- bash standalone_tests/python_only_compile.sh
- label: Basic Correctness Test # 30min - label: Basic Correctness Test # 30min
#mirror_hardwares: [amd] #mirror_hardwares: [amd]
fast_check: true fast_check: true
@@ -67,7 +76,9 @@ steps:
- tests/basic_correctness/test_basic_correctness - tests/basic_correctness/test_basic_correctness
- tests/basic_correctness/test_cpu_offload - tests/basic_correctness/test_cpu_offload
- tests/basic_correctness/test_preemption - tests/basic_correctness/test_preemption
- tests/basic_correctness/test_cumem.py
commands: commands:
- pytest -v -s basic_correctness/test_cumem.py
- pytest -v -s basic_correctness/test_basic_correctness.py - pytest -v -s basic_correctness/test_basic_correctness.py
- pytest -v -s basic_correctness/test_cpu_offload.py - pytest -v -s basic_correctness/test_cpu_offload.py
- VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py - VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py
@@ -97,14 +108,12 @@ steps:
source_file_dependencies: source_file_dependencies:
- vllm/ - vllm/
commands: commands:
- pip install -e ./plugins/vllm_add_dummy_model - pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py --ignore=entrypoints/llm/test_collective_rpc.py
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py
- pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process - pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process - pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate_multiple_loras.py # it needs a clean process - pytest -v -s entrypoints/llm/test_generate_multiple_loras.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_guided_generate.py # it needs a clean process - pytest -v -s entrypoints/llm/test_guided_generate.py # it needs a clean process
- pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_oot_registration.py - pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_oot_registration.py
- pytest -v -s entrypoints/openai/test_oot_registration.py # it needs a clean process
- pytest -v -s entrypoints/test_chat_utils.py - pytest -v -s entrypoints/test_chat_utils.py
- pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests - pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests
@@ -118,10 +127,17 @@ steps:
- tests/distributed - tests/distributed
- tests/spec_decode/e2e/test_integration_dist_tp4 - tests/spec_decode/e2e/test_integration_dist_tp4
- tests/compile - tests/compile
- examples/offline_inference/rlhf.py
- examples/offline_inference/ray_placement.py
commands: commands:
- pytest -v -s distributed/test_utils.py
- pytest -v -s compile/test_basic_correctness.py - pytest -v -s compile/test_basic_correctness.py
- pytest -v -s distributed/test_pynccl.py - pytest -v -s distributed/test_pynccl.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py - pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py
# TODO: create a dedicated test section for multi-GPU example tests
# when we have multiple distributed example tests
- python3 ../examples/offline_inference/rlhf.py
- RAY_DEDUP_LOGS=0 python3 ../examples/offline_inference/ray_placement.py
- label: Metrics, Tracing Test # 10min - label: Metrics, Tracing Test # 10min
num_gpus: 2 num_gpus: 2
@@ -163,27 +179,47 @@ steps:
# OOM in the CI unless we run this separately # OOM in the CI unless we run this separately
- pytest -v -s tokenization - pytest -v -s tokenization
- label: Examples Test # 15min - label: V1 Test
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/v1
commands:
# split the test to avoid interference
- VLLM_USE_V1=1 pytest -v -s v1/core
- VLLM_USE_V1=1 pytest -v -s v1/engine
- VLLM_USE_V1=1 pytest -v -s v1/sample
- VLLM_USE_V1=1 pytest -v -s v1/worker
- VLLM_USE_V1=1 pytest -v -s v1/test_stats.py
- VLLM_USE_V1=1 pytest -v -s v1/test_utils.py
# TODO: accuracy does not match, whether setting
# VLLM_USE_FLASHINFER_SAMPLER or not on H100.
- VLLM_USE_V1=1 pytest -v -s v1/e2e
- label: Examples Test # 25min
working_dir: "/vllm-workspace/examples" working_dir: "/vllm-workspace/examples"
#mirror_hardwares: [amd] #mirror_hardwares: [amd]
source_file_dependencies: source_file_dependencies:
- vllm/entrypoints - vllm/entrypoints
- examples/ - examples/
commands: commands:
- pip install awscli tensorizer # for llava example and tensorizer test - pip install tensorizer # for tensorizer test
- python3 offline_inference.py - python3 offline_inference/basic.py
- python3 cpu_offload.py - python3 offline_inference/cpu_offload.py
- python3 offline_inference_chat.py - python3 offline_inference/chat.py
- python3 offline_inference_with_prefix.py - python3 offline_inference/prefix_caching.py
- python3 llm_engine_example.py - python3 offline_inference/llm_engine_example.py
- python3 offline_inference_vision_language.py - python3 offline_inference/vision_language.py
- python3 offline_inference_vision_language_multi_image.py - python3 offline_inference/vision_language_multi_image.py
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors - python3 other/tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 other/tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
- python3 offline_inference_encoder_decoder.py - python3 offline_inference/encoder_decoder.py
- python3 offline_profile.py --model facebook/opt-125m - python3 offline_inference/classification.py
- python3 offline_inference/embedding.py
- python3 offline_inference/scoring.py
- python3 offline_inference/profiling.py --model facebook/opt-125m run_num_steps --num-steps 2
- label: Prefix Caching Test # 9min - label: Prefix Caching Test # 9min
#mirror_hardwares: [amd] mirror_hardwares: [amd]
source_file_dependencies: source_file_dependencies:
- vllm/ - vllm/
- tests/prefix_caching - tests/prefix_caching
@@ -195,6 +231,7 @@ steps:
- vllm/model_executor/layers - vllm/model_executor/layers
- vllm/sampling_metadata.py - vllm/sampling_metadata.py
- tests/samplers - tests/samplers
- tests/conftest.py
commands: commands:
- pytest -v -s samplers - pytest -v -s samplers
- VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers - VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers
@@ -203,23 +240,29 @@ steps:
mirror_hardwares: [amd] mirror_hardwares: [amd]
source_file_dependencies: source_file_dependencies:
- vllm/model_executor/layers - vllm/model_executor/layers
- vllm/model_executor/guided_decoding
- tests/test_logits_processor - tests/test_logits_processor
command: pytest -v -s test_logits_processor.py - tests/model_executor/test_guided_processors
commands:
- pytest -v -s test_logits_processor.py
- pytest -v -s model_executor/test_guided_processors.py
- label: Speculative decoding tests # 30min - label: Speculative decoding tests # 40min
source_file_dependencies: source_file_dependencies:
- vllm/spec_decode - vllm/spec_decode
- tests/spec_decode - tests/spec_decode
- vllm/model_executor/models/eagle.py
commands: commands:
- pytest -v -s spec_decode/e2e/test_multistep_correctness.py - pytest -v -s spec_decode/e2e/test_multistep_correctness.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s spec_decode --ignore=spec_decode/e2e/test_multistep_correctness.py - VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s spec_decode --ignore=spec_decode/e2e/test_multistep_correctness.py
- pytest -v -s spec_decode/e2e/test_eagle_correctness.py
- label: LoRA Test %N # 15min each - label: LoRA Test %N # 15min each
mirror_hardwares: [amd] mirror_hardwares: [amd]
source_file_dependencies: source_file_dependencies:
- vllm/lora - vllm/lora
- tests/lora - tests/lora
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_long_context.py command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_long_context.py --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py --ignore=lora/test_minicpmv_tp.py
parallelism: 4 parallelism: 4
- label: "PyTorch Fullgraph Smoke Test" # 9min - label: "PyTorch Fullgraph Smoke Test" # 9min
@@ -229,15 +272,16 @@ steps:
- tests/compile - tests/compile
commands: commands:
- pytest -v -s compile/test_basic_correctness.py - pytest -v -s compile/test_basic_correctness.py
# these tests need to be separated, cannot combine
- pytest -v -s compile/piecewise/test_simple.py
- pytest -v -s compile/piecewise/test_toy_llama.py
# TODO: re-write in comparison tests, and fix symbolic shape - label: "PyTorch Fullgraph Test" # 18min
# for quantization ops. source_file_dependencies:
# - label: "PyTorch Fullgraph Test" # 18min - vllm/
# source_file_dependencies: - tests/compile
# - vllm/ commands:
# - tests/compile - pytest -v -s compile/test_full_graph.py
# commands:
# - pytest -v -s compile/test_full_graph.py
- label: Kernels Test %N # 1h each - label: Kernels Test %N # 1h each
mirror_hardwares: [amd] mirror_hardwares: [amd]
@@ -266,7 +310,6 @@ steps:
source_file_dependencies: source_file_dependencies:
- benchmarks/ - benchmarks/
commands: commands:
- pip install aiohttp
- bash run-benchmarks.sh - bash run-benchmarks.sh
- label: Quantization Test # 33min - label: Quantization Test # 33min
@@ -303,46 +346,84 @@ steps:
##### models test ##### ##### models test #####
- label: Basic Models Test # 3min - label: Basic Models Test # 24min
source_file_dependencies: source_file_dependencies:
- vllm/ - vllm/
- tests/models - tests/models
commands: commands:
- pip install -e ./plugins/vllm_add_dummy_model - pytest -v -s models/test_transformers.py
- pytest -v -s models/test_oot_registration.py # it needs a clean process - pytest -v -s models/test_registry.py
- pytest -v -s models/*.py --ignore=models/test_oot_registration.py - pytest -v -s models/test_initialization.py
- label: Decoder-only Language Models Test # 1h36min - label: Language Models Test (Standard) # 32min
#mirror_hardwares: [amd] #mirror_hardwares: [amd]
source_file_dependencies: source_file_dependencies:
- vllm/ - vllm/
- tests/models/decoder_only/language - tests/models/decoder_only/language
- tests/models/embedding/language
- tests/models/encoder_decoder/language
commands: commands:
- pytest -v -s models/decoder_only/language - pytest -v -s models/decoder_only/language -m 'core_model or quant_model'
- pytest -v -s models/embedding/language -m core_model
- label: Decoder-only Multi-Modal Models Test # 1h31min - label: Language Models Test (Extended) # 1h10min
optional: true
source_file_dependencies:
- vllm/
- tests/models/decoder_only/language
- tests/models/embedding/language
- tests/models/encoder_decoder/language
commands:
- pytest -v -s models/decoder_only/language -m 'not core_model and not quant_model'
- pytest -v -s models/embedding/language -m 'not core_model'
- label: Multi-Modal Models Test (Standard) # 40min
#mirror_hardwares: [amd] #mirror_hardwares: [amd]
source_file_dependencies: source_file_dependencies:
- vllm/ - vllm/
- tests/models/decoder_only/audio_language - tests/models/decoder_only/audio_language
- tests/models/decoder_only/vision_language - tests/models/decoder_only/vision_language
commands:
- pytest -v -s models/decoder_only/audio_language
- pytest -v -s models/decoder_only/vision_language
- label: Other Models Test # 6min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/models/embedding/language
- tests/models/embedding/vision_language - tests/models/embedding/vision_language
- tests/models/encoder_decoder/language - tests/models/encoder_decoder/audio_language
- tests/models/encoder_decoder/vision_language - tests/models/encoder_decoder/vision_language
commands: commands:
- pytest -v -s models/embedding/language - pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/embedding/vision_language - pytest -v -s models/multimodal
- pytest -v -s models/encoder_decoder/language - pytest -v -s models/decoder_only/audio_language -m 'core_model or quant_model'
- pytest -v -s models/encoder_decoder/vision_language - pytest -v -s --ignore models/decoder_only/vision_language/test_phi3v.py models/decoder_only/vision_language -m 'core_model or quant_model'
- pytest -v -s models/embedding/vision_language -m core_model
- pytest -v -s models/encoder_decoder/audio_language -m core_model
- pytest -v -s models/encoder_decoder/language -m core_model
- pytest -v -s models/encoder_decoder/vision_language -m core_model
- label: Multi-Modal Models Test (Extended) 1 # 48m
optional: true
source_file_dependencies:
- vllm/
- tests/models/decoder_only/audio_language
- tests/models/decoder_only/vision_language
- tests/models/embedding/vision_language
- tests/models/encoder_decoder/vision_language
commands:
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/decoder_only/audio_language -m 'not core_model and not quant_model'
- pytest -v -s models/decoder_only/vision_language/test_models.py -m 'split(group=0) and not core_model and not quant_model'
# HACK - run phi3v tests separately to sidestep this transformers bug
# https://github.com/huggingface/transformers/issues/34307
- pytest -v -s models/decoder_only/vision_language/test_phi3v.py
- pytest -v -s --ignore models/decoder_only/vision_language/test_models.py --ignore models/decoder_only/vision_language/test_phi3v.py models/decoder_only/vision_language -m 'not core_model and not quant_model'
- pytest -v -s models/embedding/vision_language -m 'not core_model'
- pytest -v -s models/encoder_decoder/language -m 'not core_model'
- pytest -v -s models/encoder_decoder/vision_language -m 'not core_model'
- label: Multi-Modal Models Test (Extended) 2 # 38m
optional: true
source_file_dependencies:
- vllm/
- tests/models/decoder_only/vision_language
commands:
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/decoder_only/vision_language/test_models.py -m 'split(group=1) and not core_model and not quant_model'
# This test is used only in PR development phase to test individual models and should never run on main # This test is used only in PR development phase to test individual models and should never run on main
- label: Custom Models Test - label: Custom Models Test
@@ -378,11 +459,11 @@ steps:
- tests/distributed/ - tests/distributed/
commands: commands:
- # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up) - # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep -q 'Same node test passed' - VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py - VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py - VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
- # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up) - # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep -q 'Same node test passed' - VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
- label: Distributed Tests (2 GPUs) # 40min - label: Distributed Tests (2 GPUs) # 40min
#mirror_hardwares: [amd] #mirror_hardwares: [amd]
@@ -395,20 +476,46 @@ steps:
- vllm/model_executor/models/ - vllm/model_executor/models/
- tests/distributed/ - tests/distributed/
- vllm/compilation - vllm/compilation
- vllm/worker/worker_base.py
- vllm/worker/worker.py
- vllm/worker/model_runner.py
- entrypoints/llm/test_collective_rpc.py
commands: commands:
- pytest -v -s entrypoints/llm/test_collective_rpc.py
- torchrun --nproc-per-node=2 distributed/test_torchrun_example.py
- pytest -v -s ./compile/test_basic_correctness.py - pytest -v -s ./compile/test_basic_correctness.py
- pytest -v -s ./compile/test_wrapper.py - pytest -v -s ./compile/test_wrapper.py
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep -q 'Same node test passed' - VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
- TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m distributed_2_gpus - TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)'
# Avoid importing model tests that cause CUDA reinitialization error # Avoid importing model tests that cause CUDA reinitialization error
- pytest models/encoder_decoder/language/test_bart.py -v -s -m distributed_2_gpus - pytest models/test_transformers.py -v -s -m 'distributed(num_gpus=2)'
- pytest models/encoder_decoder/vision_language/test_broadcast.py -v -s -m distributed_2_gpus - pytest models/encoder_decoder/language/test_bart.py -v -s -m 'distributed(num_gpus=2)'
- pytest models/decoder_only/vision_language/test_broadcast.py -v -s -m distributed_2_gpus - pytest models/encoder_decoder/vision_language/test_broadcast.py -v -s -m 'distributed(num_gpus=2)'
- pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py - pytest models/decoder_only/vision_language/test_models.py -v -s -m 'distributed(num_gpus=2)'
# this test fails consistently.
# TODO: investigate and fix
# - pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s kv_transfer/disagg_test.py
- label: Plugin Tests (2 GPUs) # 40min
working_dir: "/vllm-workspace/tests"
num_gpus: 2
fast_check: true
source_file_dependencies:
- vllm/plugins/
- tests/plugins/
commands:
# begin platform plugin tests, all the code in-between runs on dummy platform
- pip install -e ./plugins/vllm_add_dummy_platform
- pytest -v -s plugins_tests/test_platform_plugins.py
- pip uninstall vllm_add_dummy_platform -y
# end platform plugin tests
# other tests continue here:
- pip install -e ./plugins/vllm_add_dummy_model - pip install -e ./plugins/vllm_add_dummy_model
- pytest -v -s distributed/test_distributed_oot.py - pytest -v -s distributed/test_distributed_oot.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py - pytest -v -s entrypoints/openai/test_oot_registration.py # it needs a clean process
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s distributed/test_utils.py - pytest -v -s models/test_oot_registration.py # it needs a clean process
- label: Multi-step Tests (4 GPUs) # 36min - label: Multi-step Tests (4 GPUs) # 36min
working_dir: "/vllm-workspace/tests" working_dir: "/vllm-workspace/tests"
@@ -425,7 +532,9 @@ steps:
- vllm/engine - vllm/engine
- tests/multi_step - tests/multi_step
commands: commands:
- pytest -v -s multi_step/test_correctness_async_llm.py # this test is quite flaky
# TODO: investigate and fix.
# - pytest -v -s multi_step/test_correctness_async_llm.py
- pytest -v -s multi_step/test_correctness_llm.py - pytest -v -s multi_step/test_correctness_llm.py
- label: Pipeline Parallelism Test # 45min - label: Pipeline Parallelism Test # 45min
@@ -441,18 +550,23 @@ steps:
- pytest -v -s distributed/test_pp_cudagraph.py - pytest -v -s distributed/test_pp_cudagraph.py
- pytest -v -s distributed/test_pipeline_parallel.py - pytest -v -s distributed/test_pipeline_parallel.py
- label: LoRA Long Context (Distributed) # 11min - label: LoRA TP Test (Distributed)
# This test runs llama 13B, so it is required to run on 4 GPUs.
num_gpus: 4 num_gpus: 4
soft_fail: true
source_file_dependencies: source_file_dependencies:
- vllm/lora - vllm/lora
- tests/lora/test_long_context - tests/lora
commands: commands:
# FIXIT: find out which code initialize cuda before running the test # FIXIT: find out which code initialize cuda before running the test
# before the fix, we need to use spawn to test it # before the fix, we need to use spawn to test it
- export VLLM_WORKER_MULTIPROC_METHOD=spawn - export VLLM_WORKER_MULTIPROC_METHOD=spawn
# This test runs llama 13B, so it is required to run on 4 GPUs.
- pytest -v -s -x lora/test_long_context.py - pytest -v -s -x lora/test_long_context.py
# There is some Tensor Parallelism related processing logic in LoRA that
# requires multi-GPU testing for validation.
- pytest -v -s -x lora/test_chatglm3_tp.py
- pytest -v -s -x lora/test_llama_tp.py
- pytest -v -s -x lora/test_minicpmv_tp.py
- label: Weight Loading Multiple GPU Test # 33min - label: Weight Loading Multiple GPU Test # 33min
working_dir: "/vllm-workspace/tests" working_dir: "/vllm-workspace/tests"
@@ -480,6 +594,7 @@ steps:
- label: Distributed Tests (A100) # optional - label: Distributed Tests (A100) # optional
gpu: a100 gpu: a100
optional: true
num_gpus: 4 num_gpus: 4
source_file_dependencies: source_file_dependencies:
- vllm/ - vllm/
@@ -487,11 +602,13 @@ steps:
# NOTE: don't test llama model here, it seems hf implementation is buggy # NOTE: don't test llama model here, it seems hf implementation is buggy
# see https://github.com/vllm-project/vllm/pull/5689 for details # see https://github.com/vllm-project/vllm/pull/5689 for details
- pytest -v -s distributed/test_custom_all_reduce.py - pytest -v -s distributed/test_custom_all_reduce.py
- TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m distributed_2_gpus - torchrun --nproc_per_node=2 distributed/test_ca_buffer_sharing.py
- TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)'
- pytest -v -s -x lora/test_mixtral.py - pytest -v -s -x lora/test_mixtral.py
- label: LM Eval Large Models # optional - label: LM Eval Large Models # optional
gpu: a100 gpu: a100
optional: true
num_gpus: 4 num_gpus: 4
working_dir: "/vllm-workspace/.buildkite/lm-eval-harness" working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
source_file_dependencies: source_file_dependencies:

View File

@@ -0,0 +1,71 @@
#!/usr/bin/env bash
set -ex
# Assume wheels are in artifacts/dist/*.whl
wheel_files=(artifacts/dist/*.whl)
# Check that exactly one wheel is found
if [[ ${#wheel_files[@]} -ne 1 ]]; then
echo "Error: Expected exactly one wheel file in artifacts/dist/, but found ${#wheel_files[@]}"
exit 1
fi
# Get the single wheel file
wheel="${wheel_files[0]}"
# Rename 'linux' to 'manylinux1' in the wheel filename
new_wheel="${wheel/linux/manylinux1}"
mv -- "$wheel" "$new_wheel"
wheel="$new_wheel"
# Extract the version from the wheel
version=$(unzip -p "$wheel" '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)
echo "Version: $version"
normal_wheel="$wheel" # Save the original wheel filename
# If the version contains "dev", rename it to v1.0.0.dev for consistency
if [[ $version == *dev* ]]; then
suffix="${version##*.}"
if [[ $suffix == cu* ]]; then
new_version="1.0.0.dev+${suffix}"
else
new_version="1.0.0.dev"
fi
new_wheel="${wheel/$version/$new_version}"
# use cp to keep both files in the artifacts directory
cp -- "$wheel" "$new_wheel"
wheel="$new_wheel"
version="$new_version"
fi
# Upload the wheel to S3
python3 .buildkite/generate_index.py --wheel "$normal_wheel"
# generate index for this commit
aws s3 cp "$wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
aws s3 cp "$normal_wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
if [[ $normal_wheel == *"cu118"* ]]; then
# if $normal_wheel matches cu118, do not upload the index.html
echo "Skipping index files for cu118 wheels"
else
# only upload index.html for cu12 wheels (default wheels)
aws s3 cp index.html "s3://vllm-wheels/$BUILDKITE_COMMIT/vllm/index.html"
aws s3 cp "s3://vllm-wheels/nightly/index.html" "s3://vllm-wheels/$BUILDKITE_COMMIT/index.html"
fi
# generate index for nightly
aws s3 cp "$wheel" "s3://vllm-wheels/nightly/"
aws s3 cp "$normal_wheel" "s3://vllm-wheels/nightly/"
if [[ $normal_wheel == *"cu118"* ]]; then
# if $normal_wheel matches cu118, do not upload the index.html
echo "Skipping index files for cu118 wheels"
else
# only upload index.html for cu12 wheels (default wheels)
aws s3 cp index.html "s3://vllm-wheels/nightly/vllm/index.html"
fi
aws s3 cp "$wheel" "s3://vllm-wheels/$version/"

30
.github/CODEOWNERS vendored
View File

@@ -2,29 +2,35 @@
# for more info about CODEOWNERS file # for more info about CODEOWNERS file
# This lists cover the "core" components of vLLM that require careful review # This lists cover the "core" components of vLLM that require careful review
/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill /vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/core @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill /vllm/core @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/engine/llm_engine.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill /vllm/engine/llm_engine.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/executor/executor_base.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill /vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/worker/worker_base.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill /vllm/worker/worker_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/worker/worker.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill /vllm/worker/worker.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/model_executor/layers/sampler.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill /vllm/model_executor/layers/sampler.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
CMakeLists.txt @tlrmchlsmth @WoosukKwon /vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth
/vllm/model_executor/guided_decoding @mgoin
/vllm/multimodal @DarkLight1337 @ywang96
CMakeLists.txt @tlrmchlsmth
# vLLM V1
/vllm/v1 @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @comaniac @alexm-redhat
# Test ownership # Test ownership
/tests/async_engine @njhill @robertgshaw2-neuralmagic @simon-mo /tests/async_engine @njhill @robertgshaw2-redhat @simon-mo
/tests/test_inputs.py @DarkLight1337 @ywang96 /tests/test_inputs.py @DarkLight1337 @ywang96
/tests/entrypoints @DarkLight1337 @robertgshaw2-neuralmagic @simon-mo /tests/entrypoints @DarkLight1337 @robertgshaw2-redhat @simon-mo
/tests/models @DarkLight1337 @ywang96 /tests/models @DarkLight1337 @ywang96
/tests/multimodal @DarkLight1337 @ywang96 /tests/multimodal @DarkLight1337 @ywang96
/tests/prefix_caching @comaniac @KuntaiDu /tests/prefix_caching @comaniac @KuntaiDu
/tests/spec_decode @njhill @LiuXiaoxuanPKU /tests/spec_decode @njhill @LiuXiaoxuanPKU
/tests/kernels @tlrmchlsmth @WoosukKwon /tests/kernels @tlrmchlsmth @WoosukKwon
/tests/quantization @mgoin @robertgshaw2-neuralmagic /tests/quantization @mgoin @robertgshaw2-redhat
/.buildkite/lm-eval-harness @mgoin @simon-mo /.buildkite/lm-eval-harness @mgoin @simon-mo
/tests/distributed/test_multi_node_assignment.py @youkaichao /tests/distributed/test_multi_node_assignment.py @youkaichao
/tests/distributed/test_pipeline_parallel.py @youkaichao /tests/distributed/test_pipeline_parallel.py @youkaichao
/tests/distributed/test_same_node.py @youkaichao /tests/distributed/test_same_node.py @youkaichao
/tests/multi_step @alexm-neuralmagic @comaniac /tests/multi_step @alexm-redhat @comaniac
/tests/weight_loading @mgoin @youkaichao /tests/weight_loading @mgoin @youkaichao
/tests/basic_correctness/test_chunked_prefill @rkooo567 @comaniac /tests/basic_correctness/test_chunked_prefill @rkooo567 @comaniac

2
.github/FUNDING.yml vendored
View File

@@ -1,2 +1,2 @@
github: [vllm-project] github: [vllm-project]
open_collective: [vllm] open_collective: vllm

View File

@@ -30,15 +30,6 @@ body:
</details> </details>
validations: validations:
required: true required: true
- type: textarea
attributes:
label: Model Input Dumps
description: |
If you are facing crashing due to illegal memory access or other issues with model execution, vLLM may dump the problematic input of the model. In this case, you will see the message `Error in model execution (input dumped to /tmp/err_xxx.pkl)`. If you see this message, please zip the file (because GitHub doesn't support .pkl file format) and upload it here. This will help us to reproduce the issue and facilitate the debugging process.
placeholder: |
Upload the dumped input file.
validations:
required: false
- type: textarea - type: textarea
attributes: attributes:
label: 🐛 Describe the bug label: 🐛 Describe the bug

View File

@@ -9,7 +9,7 @@ body:
value: > value: >
#### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue+sort%3Acreated-desc+). #### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue+sort%3Acreated-desc+).
#### We also highly recommend you read https://docs.vllm.ai/en/latest/models/adding_model.html first to understand how to add a new model. #### We also highly recommend you read https://docs.vllm.ai/en/latest/contributing/model/adding_model.html first to understand how to add a new model.
- type: textarea - type: textarea
attributes: attributes:
label: The model to consider. label: The model to consider.

View File

@@ -2,73 +2,4 @@ FILL IN THE PR DESCRIPTION HERE
FIX #xxxx (*link existing issues this PR will resolve*) FIX #xxxx (*link existing issues this PR will resolve*)
**BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** **BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html **
---
<details>
<!-- inside this <details> section, markdown rendering does not work, so we use raw html here. -->
<summary><b> PR Checklist (Click to Expand) </b></summary>
<p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p>
<h3>PR Title and Classification</h3>
<p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p>
<ul>
<li><code>[Bugfix]</code> for bug fixes.</li>
<li><code>[CI/Build]</code> for build or continuous integration improvements.</li>
<li><code>[Doc]</code> for documentation fixes and improvements.</li>
<li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li>
<li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li>
<li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li>
<li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li>
<li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li>
<li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li>
</ul>
<p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p>
<h3>Code Quality</h3>
<p>The PR need to meet the following code quality standards:</p>
<ul>
<li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li>
<li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li>
<li>The code need to be well-documented to ensure future contributors can easily understand the code.</li>
<li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li>
<li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li>
</ul>
<h3>Adding or changing kernels</h3>
<p>Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.</p>
<ul>
<li>Make sure custom ops are registered following PyTorch guidelines: <a href="https://pytorch.org/tutorials/advanced/cpp_custom_ops.html#cpp-custom-ops-tutorial">Custom C++ and CUDA Operators</a> and <a href="https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU">The Custom Operators Manual</a></li>
<li>Custom operations that return <code>Tensors</code> require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.</li>
<li>Use <a href="https://pytorch.org/docs/stable/library.html#torch.library.opcheck"><code>torch.libary.opcheck()</code></a> to test the function registration and meta-function for any registered ops. See <code>tests/kernels</code> for examples.</li>
<li>When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.</li>
<li>If a new custom type is needed, see the following document: <a href="https://docs.google.com/document/d/18fBMPuOJ0fY5ZQ6YyrHUppw9FA332CpNtgB6SOIgyuA">Custom Class Support in PT2</a>.
</ul>
<h3>Notes for Large Changes</h3>
<p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p>
<h3>What to Expect for the Reviews</h3>
<p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p>
<ul>
<li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li>
<li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li>
<li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li>
<li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.
</li>
</ul>
<h3>Thank You</h3>
<p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p>
</details>

View File

@@ -5,3 +5,27 @@ updates:
directory: "/" directory: "/"
schedule: schedule:
interval: "weekly" interval: "weekly"
- package-ecosystem: "pip"
directory: "/"
schedule:
interval: "weekly"
labels: ["dependencies"]
open-pull-requests-limit: 5
reviewers: ["khluu", "simon-mo"]
allow:
- dependency-type: "all"
ignore:
- dependency-name: "*"
update-types: ["version-update:semver-patch"]
- dependency-name: "torch"
- dependency-name: "torchvision"
- dependency-name: "xformers"
- dependency-name: "lm-format-enforcer"
- dependency-name: "gguf"
- dependency-name: "compressed-tensors"
- dependency-name: "ray[adag]"
- dependency-name: "lm-eval"
groups:
minor-update:
applies-to: version-updates
update-types: ["minor"]

97
.github/mergify.yml vendored Normal file
View File

@@ -0,0 +1,97 @@
pull_request_rules:
- name: label-documentation
description: Automatically apply documentation label
conditions:
- or:
- files~=^[^/]+\.md$
- files~=^docs/
actions:
label:
add:
- documentation
- name: label-ci-build
description: Automatically apply ci/build label
conditions:
- or:
- files~=^\.github/
- files~=\.buildkite/
- files~=^cmake/
- files=CMakeLists.txt
- files~=^Dockerfile
- files~=^requirements.*\.txt
- files=setup.py
actions:
label:
add:
- ci/build
- name: label-frontend
description: Automatically apply frontend label
conditions:
- files~=^vllm/entrypoints/
actions:
label:
add:
- frontend
- name: label-structured-output
description: Automatically apply structured-output label
conditions:
- or:
- files~=^vllm/model_executor/guided_decoding/
- files=tests/model_executor/test_guided_processors.py
- files=tests/entrypoints/llm/test_guided_generate.py
- files=benchmarks/benchmark_serving_guided.py
- files=benchmarks/benchmark_guided.py
actions:
label:
add:
- structured-output
- name: label-speculative-decoding
description: Automatically apply speculative-decoding label
conditions:
- or:
- files~=^vllm/spec_decode/
- files=vllm/model_executor/layers/spec_decode_base_sampler.py
- files~=^tests/spec_decode/
actions:
label:
add:
- speculative-decoding
- name: label-v1
description: Automatically apply v1 label
conditions:
- or:
- files~=^vllm/v1/
- files~=^tests/v1/
actions:
label:
add:
- v1
- name: ping author on conflicts and add 'needs-rebase' label
conditions:
- conflict
- -closed
actions:
label:
add:
- needs-rebase
comment:
message: |
This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @{{author}}.
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork
- name: remove 'needs-rebase' label when conflict is resolved
conditions:
- -conflict
- -closed
actions:
label:
remove:
- needs-rebase

50
.github/scripts/cleanup_pr_body.sh vendored Executable file
View File

@@ -0,0 +1,50 @@
#!/bin/bash
set -eu
# ensure 1 argument is passed
if [ "$#" -ne 1 ]; then
echo "Usage: $0 <pr_number>"
exit 1
fi
PR_NUMBER=$1
OLD=/tmp/orig_pr_body.txt
NEW=/tmp/new_pr_body.txt
gh pr view --json body --template "{{.body}}" "${PR_NUMBER}" > "${OLD}"
cp "${OLD}" "${NEW}"
# Remove "FIX #xxxx (*link existing issues this PR will resolve*)"
sed -i '/FIX #xxxx.*$/d' "${NEW}"
# Remove "FILL IN THE PR DESCRIPTION HERE"
sed -i '/FILL IN THE PR DESCRIPTION HERE/d' "${NEW}"
# Remove all lines after and including "**BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE**"
sed -i '/\*\*BEFORE SUBMITTING, PLEASE READ.*\*\*/,$d' "${NEW}"
# Remove HTML <details> section that includes <summary> text of "PR Checklist (Click to Expand)"
python3 - <<EOF
import re
with open("${NEW}", "r") as file:
content = file.read()
pattern = re.compile(r'(---\n\n)?<details>.*?<summary>.*?PR Checklist \(Click to Expand\).*?</summary>.*?</details>', re.DOTALL)
content = re.sub(pattern, '', content)
with open("${NEW}", "w") as file:
file.write(content)
EOF
# Run this only if ${NEW} is different than ${OLD}
if ! cmp -s "${OLD}" "${NEW}"; then
gh pr edit --body-file "${NEW}" "${PR_NUMBER}"
echo
echo "Updated PR body:"
echo
cat "${NEW}"
else
echo "No changes needed"
fi

View File

@@ -1,37 +0,0 @@
name: Lint GitHub Actions workflows
on:
push:
branches:
- "main"
paths:
- '.github/workflows/*.ya?ml'
- '.github/workflows/actionlint.*'
pull_request:
branches:
- "main"
paths:
- '.github/workflows/*.ya?ml'
- '.github/workflows/actionlint.*'
env:
LC_ALL: en_US.UTF-8
defaults:
run:
shell: bash
permissions:
contents: read
jobs:
actionlint:
runs-on: ubuntu-latest
steps:
- name: "Checkout"
uses: actions/checkout@eef61447b9ff4aafe5dcd4e0bbf5d482be7e7871 # v4.2.1
with:
fetch-depth: 0
- name: "Run actionlint"
run: |
tools/actionlint.sh -color

View File

@@ -8,7 +8,7 @@ jobs:
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- name: Add label - name: Add label
uses: actions/github-script@v7 uses: actions/github-script@60a0d83039c74a4aee543508d2ffcb1c3799cdea # v7.0.1
with: with:
script: | script: |
github.rest.issues.addLabels({ github.rest.issues.addLabels({

View File

@@ -1,41 +0,0 @@
name: clang-format
on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- main
pull_request:
branches:
- main
jobs:
clang-format:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.11"]
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install clang-format==18.1.5
- name: Running clang-format
run: |
EXCLUDES=(
'csrc/moe/topk_softmax_kernels.cu'
'csrc/quantization/gguf/ggml-common.h'
'csrc/quantization/gguf/dequantize.cuh'
'csrc/quantization/gguf/vecdotq.cuh'
'csrc/quantization/gguf/mmq.cuh'
'csrc/quantization/gguf/mmvq.cuh'
)
find csrc/ \( -name '*.h' -o -name '*.cpp' -o -name '*.cu' -o -name '*.cuh' \) -print \
| grep -vFf <(printf "%s\n" "${EXCLUDES[@]}") \
| xargs clang-format --dry-run --Werror

26
.github/workflows/cleanup_pr_body.yml vendored Normal file
View File

@@ -0,0 +1,26 @@
name: Cleanup PR Body
on:
pull_request_target:
types: [opened, reopened, edited]
permissions:
pull-requests: write
jobs:
update-description:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set up Python
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: '3.12'
- name: Update PR description
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: .github/scripts/cleanup_pr_body.sh "${{ github.event.number }}"

82
.github/workflows/lint-and-deploy.yaml vendored Normal file
View File

@@ -0,0 +1,82 @@
name: Lint and Deploy Charts
on: pull_request
jobs:
lint-and-deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: Set up Helm
uses: azure/setup-helm@fe7b79cd5ee1e45176fcad797de68ecaf3ca4814 # v4.2.0
with:
version: v3.14.4
#Python is required because ct lint runs Yamale and yamllint which require Python.
- uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: '3.13'
- name: Set up chart-testing
uses: helm/chart-testing-action@e6669bcd63d7cb57cb4380c33043eebe5d111992 # v2.6.1
with:
version: v3.10.1
- name: Run chart-testing (lint)
run: ct lint --target-branch ${{ github.event.repository.default_branch }} --chart-dirs examples/online_serving/chart-helm --charts examples/online_serving/chart-helm
- name: Setup minio
run: |
docker network create vllm-net
docker run -d -p 9000:9000 --name minio --net vllm-net \
-e "MINIO_ACCESS_KEY=minioadmin" \
-e "MINIO_SECRET_KEY=minioadmin" \
-v /tmp/data:/data \
-v /tmp/config:/root/.minio \
minio/minio server /data
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
export AWS_EC2_METADATA_DISABLED=true
mkdir opt-125m
cd opt-125m && curl -O -Ls "https://huggingface.co/facebook/opt-125m/resolve/main/{pytorch_model.bin,config.json,generation_config.json,merges.txt,special_tokens_map.json,tokenizer_config.json,vocab.json}" && cd ..
aws --endpoint-url http://127.0.0.1:9000/ s3 mb s3://testbucket
aws --endpoint-url http://127.0.0.1:9000/ s3 cp opt-125m/ s3://testbucket/opt-125m --recursive
- name: Create kind cluster
uses: helm/kind-action@0025e74a8c7512023d06dc019c617aa3cf561fde # v1.10.0
- name: Build the Docker image vllm cpu
run: docker buildx build -f Dockerfile.cpu -t vllm-cpu-env .
- name: Configuration of docker images, network and namespace for the kind cluster
run: |
docker pull amazon/aws-cli:2.6.4
kind load docker-image amazon/aws-cli:2.6.4 --name chart-testing
kind load docker-image vllm-cpu-env:latest --name chart-testing
docker network connect vllm-net "$(docker ps -aqf "name=chart-testing-control-plane")"
kubectl create ns ns-vllm
- name: Run chart-testing (install)
run: |
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
sleep 30 && kubectl -n ns-vllm logs -f "$(kubectl -n ns-vllm get pods | awk '/deployment/ {print $1;exit}')" &
helm install --wait --wait-for-jobs --timeout 5m0s --debug --create-namespace --namespace=ns-vllm test-vllm examples/online_serving/chart-helm -f examples/online_serving/chart-helm/values.yaml --set secrets.s3endpoint=http://minio:9000 --set secrets.s3bucketname=testbucket --set secrets.s3accesskeyid=$AWS_ACCESS_KEY_ID --set secrets.s3accesskey=$AWS_SECRET_ACCESS_KEY --set resources.requests.cpu=1 --set resources.requests.memory=4Gi --set resources.limits.cpu=2 --set resources.limits.memory=5Gi --set image.env[0].name=VLLM_CPU_KVCACHE_SPACE --set image.env[1].name=VLLM_LOGGING_LEVEL --set-string image.env[0].value="1" --set-string image.env[1].value="DEBUG" --set-string extraInit.s3modelpath="opt-125m/" --set-string 'resources.limits.nvidia\.com/gpu=0' --set-string 'resources.requests.nvidia\.com/gpu=0' --set-string image.repository="vllm-cpu-env"
- name: curl test
run: |
kubectl -n ns-vllm port-forward service/test-vllm-service 8001:80 &
sleep 10
CODE="$(curl -v -f --location http://localhost:8001/v1/completions \
--header "Content-Type: application/json" \
--data '{
"model": "opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'):$CODE"
echo "$CODE"

16
.github/workflows/matchers/mypy.json vendored Normal file
View File

@@ -0,0 +1,16 @@
{
"problemMatcher": [
{
"owner": "mypy",
"pattern": [
{
"regexp": "^(.+):(\\d+):\\s(error|warning):\\s(.+)$",
"file": 1,
"line": 2,
"severity": 3,
"message": 4
}
]
}
]
}

View File

@@ -1,35 +0,0 @@
name: mypy
on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- main
pull_request:
branches:
- main
jobs:
mypy:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install mypy==1.11.1
pip install types-setuptools
pip install types-PyYAML
pip install types-requests
pip install types-setuptools
- name: Mypy
run: |
tools/mypy.sh

19
.github/workflows/pre-commit.yml vendored Normal file
View File

@@ -0,0 +1,19 @@
name: pre-commit
on:
pull_request:
push:
branches: [main]
jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: "3.12"
- run: echo "::add-matcher::.github/workflows/matchers/actionlint.json"
- uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1
with:
extra_args: --all-files --hook-stage manual

View File

@@ -21,7 +21,7 @@ jobs:
upload_url: ${{ steps.create_release.outputs.upload_url }} upload_url: ${{ steps.create_release.outputs.upload_url }}
steps: steps:
- name: Checkout - name: Checkout
uses: actions/checkout@v4 uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Extract branch info - name: Extract branch info
shell: bash shell: bash
@@ -30,7 +30,7 @@ jobs:
- name: Create Release - name: Create Release
id: create_release id: create_release
uses: "actions/github-script@v7" uses: actions/github-script@60a0d83039c74a4aee543508d2ffcb1c3799cdea # v7.0.1
env: env:
RELEASE_TAG: ${{ env.release_tag }} RELEASE_TAG: ${{ env.release_tag }}
with: with:
@@ -39,67 +39,68 @@ jobs:
const script = require('.github/workflows/scripts/create_release.js') const script = require('.github/workflows/scripts/create_release.js')
await script(github, context, core) await script(github, context, core)
wheel: # NOTE(simon): No longer build wheel using Github Actions. See buildkite's release workflow.
name: Build Wheel # wheel:
runs-on: ${{ matrix.os }} # name: Build Wheel
needs: release # runs-on: ${{ matrix.os }}
# needs: release
strategy: # strategy:
fail-fast: false # fail-fast: false
matrix: # matrix:
os: ['ubuntu-20.04'] # os: ['ubuntu-20.04']
python-version: ['3.8', '3.9', '3.10', '3.11', '3.12'] # python-version: ['3.9', '3.10', '3.11', '3.12']
pytorch-version: ['2.4.0'] # Must be the most recent version that meets requirements-cuda.txt. # pytorch-version: ['2.4.0'] # Must be the most recent version that meets requirements-cuda.txt.
cuda-version: ['11.8', '12.1'] # cuda-version: ['11.8', '12.1']
steps: # steps:
- name: Checkout # - name: Checkout
uses: actions/checkout@v4 # uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Setup ccache # - name: Setup ccache
uses: hendrikmuhs/ccache-action@v1.2 # uses: hendrikmuhs/ccache-action@ed74d11c0b343532753ecead8a951bb09bb34bc9 # v1.2.14
with: # with:
create-symlink: true # create-symlink: true
key: ${{ github.job }}-${{ matrix.python-version }}-${{ matrix.cuda-version }} # key: ${{ github.job }}-${{ matrix.python-version }}-${{ matrix.cuda-version }}
- name: Set up Linux Env # - name: Set up Linux Env
if: ${{ runner.os == 'Linux' }} # if: ${{ runner.os == 'Linux' }}
run: | # run: |
bash -x .github/workflows/scripts/env.sh # bash -x .github/workflows/scripts/env.sh
- name: Set up Python # - name: Set up Python
uses: actions/setup-python@v5 # uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with: # with:
python-version: ${{ matrix.python-version }} # python-version: ${{ matrix.python-version }}
- name: Install CUDA ${{ matrix.cuda-version }} # - name: Install CUDA ${{ matrix.cuda-version }}
run: | # run: |
bash -x .github/workflows/scripts/cuda-install.sh ${{ matrix.cuda-version }} ${{ matrix.os }} # bash -x .github/workflows/scripts/cuda-install.sh ${{ matrix.cuda-version }} ${{ matrix.os }}
- name: Install PyTorch ${{ matrix.pytorch-version }} with CUDA ${{ matrix.cuda-version }} # - name: Install PyTorch ${{ matrix.pytorch-version }} with CUDA ${{ matrix.cuda-version }}
run: | # run: |
bash -x .github/workflows/scripts/pytorch-install.sh ${{ matrix.python-version }} ${{ matrix.pytorch-version }} ${{ matrix.cuda-version }} # bash -x .github/workflows/scripts/pytorch-install.sh ${{ matrix.python-version }} ${{ matrix.pytorch-version }} ${{ matrix.cuda-version }}
- name: Build wheel # - name: Build wheel
shell: bash # shell: bash
env: # env:
CMAKE_BUILD_TYPE: Release # do not compile with debug symbol to reduce wheel size # CMAKE_BUILD_TYPE: Release # do not compile with debug symbol to reduce wheel size
run: | # run: |
bash -x .github/workflows/scripts/build.sh ${{ matrix.python-version }} ${{ matrix.cuda-version }} # bash -x .github/workflows/scripts/build.sh ${{ matrix.python-version }} ${{ matrix.cuda-version }}
wheel_name=$(find dist -name "*whl" -print0 | xargs -0 -n 1 basename) # wheel_name=$(find dist -name "*whl" -print0 | xargs -0 -n 1 basename)
asset_name=${wheel_name//"linux"/"manylinux1"} # asset_name=${wheel_name//"linux"/"manylinux1"}
echo "wheel_name=${wheel_name}" >> "$GITHUB_ENV" # echo "wheel_name=${wheel_name}" >> "$GITHUB_ENV"
echo "asset_name=${asset_name}" >> "$GITHUB_ENV" # echo "asset_name=${asset_name}" >> "$GITHUB_ENV"
- name: Upload Release Asset # - name: Upload Release Asset
uses: actions/upload-release-asset@v1 # uses: actions/upload-release-asset@e8f9f06c4b078e705bd2ea027f0926603fc9b4d5 # v1.0.2
env: # env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with: # with:
upload_url: ${{ needs.release.outputs.upload_url }} # upload_url: ${{ needs.release.outputs.upload_url }}
asset_path: ./dist/${{ env.wheel_name }} # asset_path: ./dist/${{ env.wheel_name }}
asset_name: ${{ env.asset_name }} # asset_name: ${{ env.asset_name }}
asset_content_type: application/* # asset_content_type: application/*
# (Danielkinz): This last step will publish the .whl to pypi. Warning: untested # (Danielkinz): This last step will publish the .whl to pypi. Warning: untested
# - name: Publish package # - name: Publish package

View File

@@ -2,20 +2,24 @@ name: PR Reminder Comment Bot
on: on:
pull_request_target: pull_request_target:
types: [opened] types: [opened]
jobs: jobs:
pr_reminder: pr_reminder:
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- name: Remind to run full CI on PR - name: Remind to run full CI on PR
uses: actions/github-script@v7 uses: actions/github-script@60a0d83039c74a4aee543508d2ffcb1c3799cdea # v7.0.1
with: with:
script: | script: |
github.rest.issues.createComment({ github.rest.issues.createComment({
owner: context.repo.owner, owner: context.repo.owner,
repo: context.repo.repo, repo: context.repo.repo,
issue_number: context.issue.number, issue_number: context.issue.number,
body: '👋 Hi! Thank you for contributing to the vLLM project.\n Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run `fastcheck` CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your `fastcheck` build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping `simon-mo` or `khluu` to add you in our Buildkite org. \n\nOnce the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.\n\n To run CI, PR reviewers can do one of these:\n- Add `ready` label to the PR\n- Enable auto-merge.\n\n🚀' body: '👋 Hi! Thank you for contributing to the vLLM project.\n\n' +
'💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.\n\n' +
'Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run `fastcheck` CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your `fastcheck` build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping `simon-mo` or `khluu` to add you in our Buildkite org.\n\n' +
'Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.\n\n' +
'To run CI, PR reviewers can either: Add `ready` label to the PR or enable auto-merge.\n\n' +
'🚀'
}) })
env: env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

View File

@@ -1,37 +0,0 @@
name: ruff
on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- main
pull_request:
branches:
- main
jobs:
ruff:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements-lint.txt
- name: Analysing the code with ruff
run: |
ruff check .
- name: Spelling check with codespell
run: |
codespell --toml pyproject.toml
- name: Run isort
run: |
isort . --check-only

View File

@@ -1,16 +1,16 @@
#!/bin/bash #!/bin/bash
# Replace '.' with '-' ex: 11.8 -> 11-8 # Replace '.' with '-' ex: 11.8 -> 11-8
cuda_version=$(echo $1 | tr "." "-") cuda_version=$(echo "$1" | tr "." "-")
# Removes '-' and '.' ex: ubuntu-20.04 -> ubuntu2004 # Removes '-' and '.' ex: ubuntu-20.04 -> ubuntu2004
OS=$(echo $2 | tr -d ".\-") OS=$(echo "$2" | tr -d ".\-")
# Installs CUDA # Installs CUDA
wget -nv https://developer.download.nvidia.com/compute/cuda/repos/${OS}/x86_64/cuda-keyring_1.1-1_all.deb wget -nv "https://developer.download.nvidia.com/compute/cuda/repos/${OS}/x86_64/cuda-keyring_1.1-1_all.deb"
sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb
rm cuda-keyring_1.1-1_all.deb rm cuda-keyring_1.1-1_all.deb
sudo apt -qq update sudo apt -qq update
sudo apt -y install cuda-${cuda_version} cuda-nvcc-${cuda_version} cuda-libraries-dev-${cuda_version} sudo apt -y install "cuda-${cuda_version}" "cuda-nvcc-${cuda_version}" "cuda-libraries-dev-${cuda_version}"
sudo apt clean sudo apt clean
# Test nvcc # Test nvcc

View File

@@ -6,7 +6,7 @@ cuda_version=$3
# Install torch # Install torch
$python_executable -m pip install numpy pyyaml scipy ipython mkl mkl-include ninja cython typing pandas typing-extensions dataclasses setuptools && conda clean -ya $python_executable -m pip install numpy pyyaml scipy ipython mkl mkl-include ninja cython typing pandas typing-extensions dataclasses setuptools && conda clean -ya
$python_executable -m pip install torch==${pytorch_version}+cu${cuda_version//./} --extra-index-url https://download.pytorch.org/whl/cu${cuda_version//./} $python_executable -m pip install torch=="${pytorch_version}+cu${cuda_version//./}" --extra-index-url "https://download.pytorch.org/whl/cu${cuda_version//./}"
# Print version information # Print version information
$python_executable --version $python_executable --version

52
.github/workflows/stale.yml vendored Normal file
View File

@@ -0,0 +1,52 @@
name: 'Close inactive issues and PRs'
on:
schedule:
# Daily at 1:30 AM UTC
- cron: '30 1 * * *'
jobs:
close-issues-and-pull-requests:
permissions:
issues: write
pull-requests: write
actions: write
runs-on: ubuntu-latest
steps:
- uses: actions/stale@28ca1036281a5e5922ead5184a1bbf96e5fc984e # v9.0.0
with:
# Increasing this value ensures that changes to this workflow
# propagate to all issues and PRs in days rather than months
operations-per-run: 1000
exempt-draft-pr: true
exempt-issue-labels: 'keep-open'
exempt-pr-labels: 'keep-open'
labels-to-add-when-unstale: 'unstale'
labels-to-remove-when-stale: 'unstale'
days-before-issue-stale: 90
days-before-issue-close: 30
stale-issue-label: 'stale'
stale-issue-message: >
This issue has been automatically marked as stale because it has not
had any activity within 90 days. It will be automatically closed if no
further activity occurs within 30 days. Leave a comment if
you feel this issue should remain open. Thank you!
close-issue-message: >
This issue has been automatically closed due to inactivity. Please
feel free to reopen if you feel it is still relevant. Thank you!
days-before-pr-stale: 90
days-before-pr-close: 30
stale-pr-label: 'stale'
stale-pr-message: >
This pull request has been automatically marked as stale because it
has not had any activity within 90 days. It will be automatically
closed if no further activity occurs within 30 days. Leave a comment
if you feel this pull request should remain open. Thank you!
close-pr-message: >
This pull request has been automatically closed due to inactivity.
Please feel free to reopen if you intend to continue working on it.
Thank you!

View File

@@ -1,31 +0,0 @@
name: yapf
on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- main
pull_request:
branches:
- main
jobs:
yapf:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install yapf==0.32.0
pip install toml==0.10.2
- name: Running yapf
run: |
yapf --diff --recursive .

4
.gitignore vendored
View File

@@ -79,8 +79,7 @@ instance/
# Sphinx documentation # Sphinx documentation
docs/_build/ docs/_build/
docs/source/getting_started/examples/*.rst docs/source/getting_started/examples/
!**/*.template.rst
# PyBuilder # PyBuilder
.pybuilder/ .pybuilder/
@@ -202,3 +201,4 @@ benchmarks/*.json
# Linting # Linting
actionlint actionlint
shellcheck*/

110
.pre-commit-config.yaml Normal file
View File

@@ -0,0 +1,110 @@
default_stages:
- pre-commit # Run locally
- manual # Run in CI
repos:
- repo: https://github.com/google/yapf
rev: v0.43.0
hooks:
- id: yapf
args: [--in-place, --verbose]
additional_dependencies: [toml] # TODO: Remove when yapf is upgraded
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.9.3
hooks:
- id: ruff
args: [--output-format, github]
- repo: https://github.com/codespell-project/codespell
rev: v2.4.0
hooks:
- id: codespell
exclude: 'benchmarks/sonnet.txt|(build|tests/(lora/data|models/fixtures|prompts))/.*'
- repo: https://github.com/PyCQA/isort
rev: 5.13.2
hooks:
- id: isort
- repo: https://github.com/pre-commit/mirrors-clang-format
rev: v19.1.7
hooks:
- id: clang-format
exclude: 'csrc/(moe/topk_softmax_kernels.cu|quantization/gguf/(ggml-common.h|dequantize.cuh|vecdotq.cuh|mmq.cuh|mmvq.cuh))'
types_or: [c++, cuda]
args: [--style=file, --verbose]
- repo: https://github.com/jackdewinter/pymarkdown
rev: v0.9.27
hooks:
- id: pymarkdown
files: docs/.*
- repo: https://github.com/rhysd/actionlint
rev: v1.7.7
hooks:
- id: actionlint
- repo: local
hooks:
- id: mypy-local
name: Run mypy for local Python installation
entry: tools/mypy.sh 0 "local"
language: python
types: [python]
additional_dependencies: &mypy_deps [mypy==1.11.1, types-setuptools, types-PyYAML, types-requests]
stages: [pre-commit] # Don't run in CI
- id: mypy-3.9 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name: Run mypy for Python 3.9
entry: tools/mypy.sh 1 "3.9"
language: python
types: [python]
additional_dependencies: *mypy_deps
stages: [manual] # Only run in CI
- id: mypy-3.10 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name: Run mypy for Python 3.10
entry: tools/mypy.sh 1 "3.10"
language: python
types: [python]
additional_dependencies: *mypy_deps
stages: [manual] # Only run in CI
- id: mypy-3.11 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name: Run mypy for Python 3.11
entry: tools/mypy.sh 1 "3.11"
language: python
types: [python]
additional_dependencies: *mypy_deps
stages: [manual] # Only run in CI
- id: mypy-3.12 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name: Run mypy for Python 3.12
entry: tools/mypy.sh 1 "3.12"
language: python
types: [python]
additional_dependencies: *mypy_deps
stages: [manual] # Only run in CI
- id: shellcheck
name: Lint shell scripts
entry: tools/shellcheck.sh
language: script
types: [shell]
- id: png-lint
name: Lint PNG exports from excalidraw
entry: tools/png-lint.sh
language: script
types: [png]
- id: signoff-commit
name: Sign-off Commit
entry: bash
args:
- -c
- |
if ! grep -q "^Signed-off-by: $(git config user.name) <$(git config user.email)>" .git/COMMIT_EDITMSG; then
printf "\nSigned-off-by: $(git config user.name) <$(git config user.email)>\n" >> .git/COMMIT_EDITMSG
fi
language: system
verbose: true
stages: [commit-msg]
- id: check-spdx-header
name: Check SPDX headers
entry: python tools/check_spdx_header.py
language: python
types: [python]
- id: suggestion
name: Suggestion
entry: bash -c 'echo "To bypass pre-commit hooks, add --no-verify to git commit."'
language: system
verbose: true
pass_filenames: false

View File

@@ -6,17 +6,16 @@ version: 2
build: build:
os: ubuntu-22.04 os: ubuntu-22.04
tools: tools:
python: "3.8" python: "3.12"
sphinx: sphinx:
configuration: docs/source/conf.py configuration: docs/source/conf.py
fail_on_warning: true fail_on_warning: true
# If using Sphinx, optionally build your docs in additional formats such as PDF # If using Sphinx, optionally build your docs in additional formats such as PDF
formats: [] formats: []
# Optionally declare the Python requirements required to build your docs # Optionally declare the Python requirements required to build your docs
python: python:
install: install:
- requirements: docs/requirements-docs.txt - requirements: docs/requirements-docs.txt

9
.shellcheckrc Normal file
View File

@@ -0,0 +1,9 @@
# rules currently disabled:
#
# SC1091 (info): Not following: <sourced file> was not specified as input (see shellcheck -x)
# SC2004 (style): $/${} is unnecessary on arithmetic variables.
# SC2129 (style): Consider using { cmd1; cmd2; } >> file instead of individual redirects.
# SC2155 (warning): Declare and assign separately to avoid masking return values.
# SC2164 (warning): Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
#
disable=SC1091,SC2004,SC2129,SC2155,SC2164

194
CMakeLists.txt Normal file → Executable file
View File

@@ -24,20 +24,17 @@ include(${CMAKE_CURRENT_LIST_DIR}/cmake/utils.cmake)
# Suppress potential warnings about unused manually-specified variables # Suppress potential warnings about unused manually-specified variables
set(ignoreMe "${VLLM_PYTHON_PATH}") set(ignoreMe "${VLLM_PYTHON_PATH}")
# Prevent installation of dependencies (cutlass) by default.
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY TRUE)" ALL_COMPONENTS)
# #
# Supported python versions. These versions will be searched in order, the # Supported python versions. These versions will be searched in order, the
# first match will be selected. These should be kept in sync with setup.py. # first match will be selected. These should be kept in sync with setup.py.
# #
set(PYTHON_SUPPORTED_VERSIONS "3.8" "3.9" "3.10" "3.11" "3.12") set(PYTHON_SUPPORTED_VERSIONS "3.9" "3.10" "3.11" "3.12")
# Supported NVIDIA architectures. # Supported NVIDIA architectures.
set(CUDA_SUPPORTED_ARCHS "7.0;7.5;8.0;8.6;8.9;9.0") set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0")
# Supported AMD GPU architectures. # Supported AMD GPU architectures.
set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx1100") set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx1100;gfx1101")
# #
# Supported/expected torch versions for CUDA/ROCm. # Supported/expected torch versions for CUDA/ROCm.
@@ -49,8 +46,8 @@ set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx11
# requirements.txt files and should be kept consistent. The ROCm torch # requirements.txt files and should be kept consistent. The ROCm torch
# versions are derived from Dockerfile.rocm # versions are derived from Dockerfile.rocm
# #
set(TORCH_SUPPORTED_VERSION_CUDA "2.4.0") set(TORCH_SUPPORTED_VERSION_CUDA "2.5.1")
set(TORCH_SUPPORTED_VERSION_ROCM "2.5.0") set(TORCH_SUPPORTED_VERSION_ROCM "2.5.1")
# #
# Try to find python package with an executable that exactly matches # Try to find python package with an executable that exactly matches
@@ -83,24 +80,6 @@ endif()
# #
find_package(Torch REQUIRED) find_package(Torch REQUIRED)
#
message(STATUS "Enabling core extension.")
# Define _core_C extension
# built for (almost) every target platform, (excludes TPU and Neuron)
set(VLLM_EXT_SRC
"csrc/core/torch_bindings.cpp")
define_gpu_extension_target(
_core_C
DESTINATION vllm
LANGUAGE CXX
SOURCES ${VLLM_EXT_SRC}
COMPILE_FLAGS ${CXX_COMPILE_FLAGS}
USE_SABI 3
WITH_SOABI)
# #
# Forward the non-CUDA device extensions to external CMake scripts. # Forward the non-CUDA device extensions to external CMake scripts.
# #
@@ -187,33 +166,61 @@ endif()
# #
# Use FetchContent for C++ dependencies that are compiled as part of vLLM's build process. # Use FetchContent for C++ dependencies that are compiled as part of vLLM's build process.
# Configure it to place files in vllm/.deps, in order to play nicely with sccache. # setup.py will override FETCHCONTENT_BASE_DIR to play nicely with sccache.
# Each dependency that produces build artifacts should override its BINARY_DIR to avoid
# conflicts between build types. It should instead be set to ${CMAKE_BINARY_DIR}/<dependency>.
# #
include(FetchContent) include(FetchContent)
get_filename_component(PROJECT_ROOT_DIR "${CMAKE_CURRENT_SOURCE_DIR}" ABSOLUTE) file(MAKE_DIRECTORY ${FETCHCONTENT_BASE_DIR}) # Ensure the directory exists
file(MAKE_DIRECTORY "${FETCHCONTENT_BASE_DIR}")
set(FETCHCONTENT_BASE_DIR "${PROJECT_ROOT_DIR}/.deps")
message(STATUS "FetchContent base directory: ${FETCHCONTENT_BASE_DIR}") message(STATUS "FetchContent base directory: ${FETCHCONTENT_BASE_DIR}")
# #
# Define other extension targets # Define other extension targets
# #
#
# cumem_allocator extension
#
set(VLLM_CUMEM_EXT_SRC
"csrc/cumem_allocator.cpp")
set_gencode_flags_for_srcs(
SRCS "${VLLM_CUMEM_EXT_SRC}"
CUDA_ARCHS "${CUDA_ARCHS}")
if(VLLM_GPU_LANG STREQUAL "CUDA")
message(STATUS "Enabling cumem allocator extension.")
# link against cuda driver library
list(APPEND CUMEM_LIBS cuda)
define_gpu_extension_target(
cumem_allocator
DESTINATION vllm
LANGUAGE CXX
SOURCES ${VLLM_CUMEM_EXT_SRC}
LIBRARIES ${CUMEM_LIBS}
USE_SABI 3.8
WITH_SOABI)
endif()
# #
# _C extension # _C extension
# #
set(VLLM_EXT_SRC set(VLLM_EXT_SRC
"csrc/cache_kernels.cu" "csrc/cache_kernels.cu"
"csrc/attention/attention_kernels.cu" "csrc/attention/paged_attention_v1.cu"
"csrc/attention/paged_attention_v2.cu"
"csrc/pos_encoding_kernels.cu" "csrc/pos_encoding_kernels.cu"
"csrc/activation_kernels.cu" "csrc/activation_kernels.cu"
"csrc/layernorm_kernels.cu" "csrc/layernorm_kernels.cu"
"csrc/layernorm_quant_kernels.cu"
"csrc/quantization/gptq/q_gemm.cu" "csrc/quantization/gptq/q_gemm.cu"
"csrc/quantization/compressed_tensors/int8_quant_kernels.cu" "csrc/quantization/compressed_tensors/int8_quant_kernels.cu"
"csrc/quantization/fp8/common.cu" "csrc/quantization/fp8/common.cu"
"csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu"
"csrc/quantization/gguf/gguf_kernel.cu"
"csrc/cuda_utils_kernels.cu" "csrc/cuda_utils_kernels.cu"
"csrc/moe_align_block_size_kernels.cu"
"csrc/prepare_inputs/advance_step.cu" "csrc/prepare_inputs/advance_step.cu"
"csrc/torch_bindings.cpp") "csrc/torch_bindings.cpp")
@@ -221,19 +228,32 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
SET(CUTLASS_ENABLE_HEADERS_ONLY ON CACHE BOOL "Enable only the header library") SET(CUTLASS_ENABLE_HEADERS_ONLY ON CACHE BOOL "Enable only the header library")
# Set CUTLASS_REVISION manually -- its revision detection doesn't work in this case. # Set CUTLASS_REVISION manually -- its revision detection doesn't work in this case.
set(CUTLASS_REVISION "v3.5.1" CACHE STRING "CUTLASS revision to use") set(CUTLASS_REVISION "v3.6.0" CACHE STRING "CUTLASS revision to use")
FetchContent_Declare( # Use the specified CUTLASS source directory for compilation if VLLM_CUTLASS_SRC_DIR is provided
if (DEFINED ENV{VLLM_CUTLASS_SRC_DIR})
set(VLLM_CUTLASS_SRC_DIR $ENV{VLLM_CUTLASS_SRC_DIR})
endif()
if(VLLM_CUTLASS_SRC_DIR)
if(NOT IS_ABSOLUTE VLLM_CUTLASS_SRC_DIR)
get_filename_component(VLLM_CUTLASS_SRC_DIR "${VLLM_CUTLASS_SRC_DIR}" ABSOLUTE)
endif()
message(STATUS "The VLLM_CUTLASS_SRC_DIR is set, using ${VLLM_CUTLASS_SRC_DIR} for compilation")
FetchContent_Declare(cutlass SOURCE_DIR ${VLLM_CUTLASS_SRC_DIR})
else()
FetchContent_Declare(
cutlass cutlass
GIT_REPOSITORY https://github.com/nvidia/cutlass.git GIT_REPOSITORY https://github.com/nvidia/cutlass.git
GIT_TAG v3.5.1 GIT_TAG v3.7.0
GIT_PROGRESS TRUE GIT_PROGRESS TRUE
# Speed up CUTLASS download by retrieving only the specified GIT_TAG instead of the history. # Speed up CUTLASS download by retrieving only the specified GIT_TAG instead of the history.
# Important: If GIT_SHALLOW is enabled then GIT_TAG works only with branch names and tags. # Important: If GIT_SHALLOW is enabled then GIT_TAG works only with branch names and tags.
# So if the GIT_TAG above is updated to a commit hash, GIT_SHALLOW must be set to FALSE # So if the GIT_TAG above is updated to a commit hash, GIT_SHALLOW must be set to FALSE
GIT_SHALLOW TRUE GIT_SHALLOW TRUE
) )
endif()
FetchContent_MakeAvailable(cutlass) FetchContent_MakeAvailable(cutlass)
list(APPEND VLLM_EXT_SRC list(APPEND VLLM_EXT_SRC
@@ -241,10 +261,12 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
"csrc/mamba/causal_conv1d/causal_conv1d.cu" "csrc/mamba/causal_conv1d/causal_conv1d.cu"
"csrc/quantization/aqlm/gemm_kernels.cu" "csrc/quantization/aqlm/gemm_kernels.cu"
"csrc/quantization/awq/gemm_kernels.cu" "csrc/quantization/awq/gemm_kernels.cu"
"csrc/quantization/gguf/gguf_kernel.cu"
"csrc/custom_all_reduce.cu" "csrc/custom_all_reduce.cu"
"csrc/permute_cols.cu" "csrc/permute_cols.cu"
"csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu") "csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu"
"csrc/sparse/cutlass/sparse_scaled_mm_entry.cu"
"csrc/sparse/cutlass/sparse_compressor_entry.cu"
"csrc/cutlass_extensions/common.cpp")
set_gencode_flags_for_srcs( set_gencode_flags_for_srcs(
SRCS "${VLLM_EXT_SRC}" SRCS "${VLLM_EXT_SRC}"
@@ -253,7 +275,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# Only build Marlin kernels if we are building for at least some compatible archs. # Only build Marlin kernels if we are building for at least some compatible archs.
# Keep building Marlin for 9.0 as there are some group sizes and shapes that # Keep building Marlin for 9.0 as there are some group sizes and shapes that
# are not supported by Machete yet. # are not supported by Machete yet.
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.6;8.9;9.0" ${CUDA_ARCHS}) cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")
if (MARLIN_ARCHS) if (MARLIN_ARCHS)
set(MARLIN_SRCS set(MARLIN_SRCS
"csrc/quantization/fp8/fp8_marlin.cu" "csrc/quantization/fp8/fp8_marlin.cu"
@@ -270,15 +292,19 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
message(STATUS "Building Marlin kernels for archs: ${MARLIN_ARCHS}") message(STATUS "Building Marlin kernels for archs: ${MARLIN_ARCHS}")
else() else()
message(STATUS "Not building Marlin kernels as no compatible archs found" message(STATUS "Not building Marlin kernels as no compatible archs found"
"in CUDA target architectures") " in CUDA target architectures")
endif() endif()
#
# The cutlass_scaled_mm kernels for Hopper (c3x, i.e. CUTLASS 3.x) require # The cutlass_scaled_mm kernels for Hopper (c3x, i.e. CUTLASS 3.x) require
# CUDA 12.0 or later (and only work on Hopper, 9.0/9.0a for now). # CUDA 12.0 or later (and only work on Hopper, 9.0a for now).
cuda_archs_loose_intersection(SCALED_MM_3X_ARCHS "9.0;9.0a" "${CUDA_ARCHS}") cuda_archs_loose_intersection(SCALED_MM_3X_ARCHS "9.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_3X_ARCHS) if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_3X_ARCHS)
set(SRCS "csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu") set(SRCS
"csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu"
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm90_fp8.cu"
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm90_int8.cu"
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_azp_sm90_int8.cu"
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm90_fp8.cu")
set_gencode_flags_for_srcs( set_gencode_flags_for_srcs(
SRCS "${SRCS}" SRCS "${SRCS}"
CUDA_ARCHS "${SCALED_MM_3X_ARCHS}") CUDA_ARCHS "${SCALED_MM_3X_ARCHS}")
@@ -305,7 +331,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# For the cutlass_scaled_mm kernels we want to build the c2x (CUTLASS 2.x) # For the cutlass_scaled_mm kernels we want to build the c2x (CUTLASS 2.x)
# kernels for the remaining archs that are not already built for 3x. # kernels for the remaining archs that are not already built for 3x.
cuda_archs_loose_intersection(SCALED_MM_2X_ARCHS cuda_archs_loose_intersection(SCALED_MM_2X_ARCHS
"7.5;8.0;8.6;8.9;9.0" "${CUDA_ARCHS}") "7.5;8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")
# subtract out the archs that are already built for 3x # subtract out the archs that are already built for 3x
list(REMOVE_ITEM SCALED_MM_2X_ARCHS ${SCALED_MM_3X_ARCHS}) list(REMOVE_ITEM SCALED_MM_2X_ARCHS ${SCALED_MM_3X_ARCHS})
if (SCALED_MM_2X_ARCHS) if (SCALED_MM_2X_ARCHS)
@@ -326,6 +352,31 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif() endif()
endif() endif()
#
# 2:4 Sparse Kernels
# The 2:4 sparse kernels cutlass_scaled_sparse_mm and cutlass_compressor
# require CUDA 12.2 or later (and only work on Hopper, 9.0a for now).
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_3X_ARCHS)
set(SRCS "csrc/sparse/cutlass/sparse_compressor_c3x.cu"
"csrc/sparse/cutlass/sparse_scaled_mm_c3x.cu")
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${SCALED_MM_3X_ARCHS}")
list(APPEND VLLM_EXT_SRC "${SRCS}")
list(APPEND VLLM_GPU_FLAGS "-DENABLE_SPARSE_SCALED_MM_C3X=1")
message(STATUS "Building sparse_scaled_mm_c3x for archs: ${SCALED_MM_3X_ARCHS}")
else()
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_3X_ARCHS)
message(STATUS "Not building sparse_scaled_mm_c3x kernels as CUDA Compiler version is "
"not >= 12.2, we recommend upgrading to CUDA 12.2 or later "
"if you intend on running FP8 sparse quantized models on Hopper.")
else()
message(STATUS "Not building sparse_scaled_mm_c3x as no compatible archs found "
"in CUDA target architectures")
endif()
endif()
# #
# Machete kernels # Machete kernels
@@ -407,7 +458,7 @@ define_gpu_extension_target(
SOURCES ${VLLM_EXT_SRC} SOURCES ${VLLM_EXT_SRC}
COMPILE_FLAGS ${VLLM_GPU_FLAGS} COMPILE_FLAGS ${VLLM_GPU_FLAGS}
ARCHITECTURES ${VLLM_GPU_ARCHES} ARCHITECTURES ${VLLM_GPU_ARCHES}
INCLUDE_DIRECTORIES ${CUTLASS_INCLUDE_DIR} INCLUDE_DIRECTORIES ${CUTLASS_INCLUDE_DIR};${CUTLASS_TOOLS_UTIL_INCLUDE_DIR}
USE_SABI 3 USE_SABI 3
WITH_SOABI) WITH_SOABI)
@@ -423,6 +474,7 @@ target_compile_definitions(_C PRIVATE CUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL=1)
set(VLLM_MOE_EXT_SRC set(VLLM_MOE_EXT_SRC
"csrc/moe/torch_bindings.cpp" "csrc/moe/torch_bindings.cpp"
"csrc/moe/moe_align_sum_kernels.cu"
"csrc/moe/topk_softmax_kernels.cu") "csrc/moe/topk_softmax_kernels.cu")
set_gencode_flags_for_srcs( set_gencode_flags_for_srcs(
@@ -430,7 +482,7 @@ set_gencode_flags_for_srcs(
CUDA_ARCHS "${CUDA_ARCHS}") CUDA_ARCHS "${CUDA_ARCHS}")
if(VLLM_GPU_LANG STREQUAL "CUDA") if(VLLM_GPU_LANG STREQUAL "CUDA")
cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0;8.6;8.9;9.0" "${CUDA_ARCHS}") cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0;8.6;8.7;8.9;9.0" "${CUDA_ARCHS}")
if (MARLIN_MOE_ARCHS) if (MARLIN_MOE_ARCHS)
set(MARLIN_MOE_SRC set(MARLIN_MOE_SRC
"csrc/moe/marlin_kernels/marlin_moe_kernel.h" "csrc/moe/marlin_kernels/marlin_moe_kernel.h"
@@ -450,7 +502,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
message(STATUS "Building Marlin MOE kernels for archs: ${MARLIN_MOE_ARCHS}") message(STATUS "Building Marlin MOE kernels for archs: ${MARLIN_MOE_ARCHS}")
else() else()
message(STATUS "Not building Marlin MOE kernels as no compatible archs found" message(STATUS "Not building Marlin MOE kernels as no compatible archs found"
"in CUDA target architectures") " in CUDA target architectures")
endif() endif()
endif() endif()
@@ -485,7 +537,7 @@ if(VLLM_GPU_LANG STREQUAL "HIP")
endif() endif()
# vllm-flash-attn currently only supported on CUDA # vllm-flash-attn currently only supported on CUDA
if (NOT VLLM_TARGET_DEVICE STREQUAL "cuda") if (NOT VLLM_GPU_LANG STREQUAL "CUDA")
return() return()
endif () endif ()
@@ -508,7 +560,7 @@ endif()
# They should be identical but if they aren't, this is a massive footgun. # They should be identical but if they aren't, this is a massive footgun.
# #
# The vllm-flash-attn install rules are nested under vllm to make sure the library gets installed in the correct place. # The vllm-flash-attn install rules are nested under vllm to make sure the library gets installed in the correct place.
# To only install vllm-flash-attn, use --component vllm_flash_attn_c. # To only install vllm-flash-attn, use --component _vllm_fa2_C (for FA2) or --component _vllm_fa3_C (for FA3).
# If no component is specified, vllm-flash-attn is still installed. # If no component is specified, vllm-flash-attn is still installed.
# If VLLM_FLASH_ATTN_SRC_DIR is set, vllm-flash-attn is installed from that directory instead of downloading. # If VLLM_FLASH_ATTN_SRC_DIR is set, vllm-flash-attn is installed from that directory instead of downloading.
@@ -520,41 +572,41 @@ if (DEFINED ENV{VLLM_FLASH_ATTN_SRC_DIR})
endif() endif()
if(VLLM_FLASH_ATTN_SRC_DIR) if(VLLM_FLASH_ATTN_SRC_DIR)
FetchContent_Declare(vllm-flash-attn SOURCE_DIR ${VLLM_FLASH_ATTN_SRC_DIR}) FetchContent_Declare(
vllm-flash-attn SOURCE_DIR
${VLLM_FLASH_ATTN_SRC_DIR}
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
)
else() else()
FetchContent_Declare( FetchContent_Declare(
vllm-flash-attn vllm-flash-attn
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
GIT_TAG 013f0c4fc47e6574060879d9734c1df8c5c273bd GIT_TAG d4e09037abf588af1ec47d0e966b237ee376876c
GIT_PROGRESS TRUE GIT_PROGRESS TRUE
# Don't share the vllm-flash-attn build between build types
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
) )
endif() endif()
# Set the parent build flag so that the vllm-flash-attn library does not redo compile flag and arch initialization.
set(VLLM_PARENT_BUILD ON)
# Ensure the vllm/vllm_flash_attn directory exists before installation
install(CODE "file(MAKE_DIRECTORY \"\${CMAKE_INSTALL_PREFIX}/vllm/vllm_flash_attn\")" COMPONENT vllm_flash_attn_c)
# Make sure vllm-flash-attn install rules are nested under vllm/
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY FALSE)" COMPONENT vllm_flash_attn_c)
install(CODE "set(OLD_CMAKE_INSTALL_PREFIX \"\${CMAKE_INSTALL_PREFIX}\")" COMPONENT vllm_flash_attn_c)
install(CODE "set(CMAKE_INSTALL_PREFIX \"\${CMAKE_INSTALL_PREFIX}/vllm/\")" COMPONENT vllm_flash_attn_c)
# Fetch the vllm-flash-attn library # Fetch the vllm-flash-attn library
FetchContent_MakeAvailable(vllm-flash-attn) FetchContent_MakeAvailable(vllm-flash-attn)
message(STATUS "vllm-flash-attn is available at ${vllm-flash-attn_SOURCE_DIR}") message(STATUS "vllm-flash-attn is available at ${vllm-flash-attn_SOURCE_DIR}")
# Restore the install prefix # Copy over the vllm-flash-attn python files (duplicated for fa2 and fa3, in
install(CODE "set(CMAKE_INSTALL_PREFIX \"\${OLD_CMAKE_INSTALL_PREFIX}\")" COMPONENT vllm_flash_attn_c) # case only one is built, in the case both are built redundant work is done)
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY TRUE)" COMPONENT vllm_flash_attn_c)
# Copy over the vllm-flash-attn python files
install( install(
DIRECTORY ${vllm-flash-attn_SOURCE_DIR}/vllm_flash_attn/ DIRECTORY ${vllm-flash-attn_SOURCE_DIR}/vllm_flash_attn/
DESTINATION vllm/vllm_flash_attn DESTINATION vllm_flash_attn
COMPONENT vllm_flash_attn_c COMPONENT _vllm_fa2_C
FILES_MATCHING PATTERN "*.py" FILES_MATCHING PATTERN "*.py"
)
install(
DIRECTORY ${vllm-flash-attn_SOURCE_DIR}/vllm_flash_attn/
DESTINATION vllm_flash_attn
COMPONENT _vllm_fa3_C
FILES_MATCHING PATTERN "*.py"
) )
# Nothing after vllm-flash-attn, see comment about macros above # Nothing after vllm-flash-attn, see comment about macros above

View File

@@ -61,7 +61,7 @@ representative at an online or offline/IRL event.
Instances of abusive, harassing, or otherwise unacceptable behavior may be Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement in the #code-of-conduct reported to the community leaders responsible for enforcement in the #code-of-conduct
channel in the [vLLM Discord](https://discord.com/invite/jz7wjKhh6g). channel in the [vLLM Slack](https://slack.vllm.ai).
All complaints will be reviewed and investigated promptly and fairly. All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the All community leaders are obligated to respect the privacy and security of the

View File

@@ -1,50 +1,3 @@
# Contributing to vLLM # Contributing to vLLM
Thank you for your interest in contributing to vLLM! Our community is open to everyone and welcomes all kinds of contributions, no matter how small or large. There are several ways you can contribute to the project: You may find information about contributing to vLLM on [docs.vllm.ai](https://docs.vllm.ai/en/latest/contributing/overview.html).
- Identify and report any issues or bugs.
- Request or add support for a new model.
- Suggest or implement new features.
- Improve documentation or contribute a how-to guide.
We also believe in the power of community support; thus, answering queries, offering PR reviews, and assisting others are also highly regarded and beneficial contributions.
Finally, one of the most impactful ways to support us is by raising awareness about vLLM. Talk about it in your blog posts and highlight how it's driving your incredible projects. Express your support on social media if you're using vLLM, or simply offer your appreciation by starring our repository!
## Developing
Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation. Check out the [building from source](https://docs.vllm.ai/en/latest/getting_started/installation.html#build-from-source) documentation for details.
## Testing
```bash
pip install -r requirements-dev.txt
# linting and formatting
bash format.sh
# Static type checking
mypy
# Unit tests
pytest tests/
```
**Note:** Currently, the repository does not pass the ``mypy`` tests.
## Contribution Guidelines
### Issues
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
> [!IMPORTANT]
> If you discover a security vulnerability, please follow the instructions [here](/SECURITY.md#reporting-a-vulnerability).
### Pull Requests & Code Reviews
Please check the PR checklist in the [PR template](.github/PULL_REQUEST_TEMPLATE.md) for detailed guide for contribution.
### Thank You
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM.
All of your contributions help make vLLM a great tool and community for everyone!

34
DCO Normal file
View File

@@ -0,0 +1,34 @@
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or
(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or
(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.
(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.

View File

@@ -2,8 +2,8 @@
# to run the OpenAI compatible server. # to run the OpenAI compatible server.
# Please update any changes made here to # Please update any changes made here to
# docs/source/dev/dockerfile/dockerfile.rst and # docs/source/contributing/dockerfile/dockerfile.md and
# docs/source/assets/dev/dockerfile-stages-dependency.png # docs/source/assets/contributing/dockerfile-stages-dependency.png
ARG CUDA_VERSION=12.4.1 ARG CUDA_VERSION=12.4.1
#################### BASE BUILD IMAGE #################### #################### BASE BUILD IMAGE ####################
@@ -11,6 +11,7 @@ ARG CUDA_VERSION=12.4.1
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS base FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS base
ARG CUDA_VERSION=12.4.1 ARG CUDA_VERSION=12.4.1
ARG PYTHON_VERSION=3.12 ARG PYTHON_VERSION=3.12
ARG TARGETPLATFORM
ENV DEBIAN_FRONTEND=noninteractive ENV DEBIAN_FRONTEND=noninteractive
# Install Python and other dependencies # Install Python and other dependencies
@@ -44,12 +45,21 @@ RUN ldconfig /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/
WORKDIR /workspace WORKDIR /workspace
# install build and runtime dependencies # install build and runtime dependencies
# arm64 (GH200) build follows the practice of "use existing pytorch" build,
# we need to install torch and torchvision from the nightly builds first,
# pytorch will not appear as a vLLM dependency in all of the following steps
# after this step
RUN --mount=type=cache,target=/root/.cache/pip \
if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
python3 -m pip install --index-url https://download.pytorch.org/whl/nightly/cu126 "torch==2.7.0.dev20250121+cu126" "torchvision==0.22.0.dev20250121"; \
fi
COPY requirements-common.txt requirements-common.txt COPY requirements-common.txt requirements-common.txt
COPY requirements-cuda.txt requirements-cuda.txt COPY requirements-cuda.txt requirements-cuda.txt
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install -r requirements-cuda.txt python3 -m pip install -r requirements-cuda.txt
# cuda arch list used by torch # cuda arch list used by torch
# can be useful for both `dev` and `test` # can be useful for both `dev` and `test`
# explicitly set the list to avoid issues with torch 2.2 # explicitly set the list to avoid issues with torch 2.2
@@ -63,6 +73,7 @@ ENV VLLM_FA_CMAKE_GPU_ARCHES=${vllm_fa_cmake_gpu_arches}
#################### WHEEL BUILD IMAGE #################### #################### WHEEL BUILD IMAGE ####################
FROM base AS build FROM base AS build
ARG TARGETPLATFORM
# install build dependencies # install build dependencies
COPY requirements-build.txt requirements-build.txt COPY requirements-build.txt requirements-build.txt
@@ -115,8 +126,8 @@ RUN --mount=type=cache,target=/root/.cache/ccache \
# Check the size of the wheel if RUN_WHEEL_CHECK is true # Check the size of the wheel if RUN_WHEEL_CHECK is true
COPY .buildkite/check-wheel-size.py check-wheel-size.py COPY .buildkite/check-wheel-size.py check-wheel-size.py
# Default max size of the wheel is 250MB # sync the default value with .buildkite/check-wheel-size.py
ARG VLLM_MAX_SIZE_MB=250 ARG VLLM_MAX_SIZE_MB=400
ENV VLLM_MAX_SIZE_MB=$VLLM_MAX_SIZE_MB ENV VLLM_MAX_SIZE_MB=$VLLM_MAX_SIZE_MB
ARG RUN_WHEEL_CHECK=true ARG RUN_WHEEL_CHECK=true
RUN if [ "$RUN_WHEEL_CHECK" = "true" ]; then \ RUN if [ "$RUN_WHEEL_CHECK" = "true" ]; then \
@@ -134,15 +145,17 @@ COPY requirements-test.txt requirements-test.txt
COPY requirements-dev.txt requirements-dev.txt COPY requirements-dev.txt requirements-dev.txt
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install -r requirements-dev.txt python3 -m pip install -r requirements-dev.txt
#################### DEV IMAGE #################### #################### DEV IMAGE ####################
#################### vLLM installation IMAGE #################### #################### vLLM installation IMAGE ####################
# image with vLLM installed # image with vLLM installed
FROM nvidia/cuda:${CUDA_VERSION}-base-ubuntu22.04 AS vllm-base # TODO: Restore to base image after FlashInfer AOT wheel fixed
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04 AS vllm-base
ARG CUDA_VERSION=12.4.1 ARG CUDA_VERSION=12.4.1
ARG PYTHON_VERSION=3.12 ARG PYTHON_VERSION=3.12
WORKDIR /vllm-workspace WORKDIR /vllm-workspace
ENV DEBIAN_FRONTEND=noninteractive ENV DEBIAN_FRONTEND=noninteractive
ARG TARGETPLATFORM
RUN PYTHON_VERSION_STR=$(echo ${PYTHON_VERSION} | sed 's/\.//g') && \ RUN PYTHON_VERSION_STR=$(echo ${PYTHON_VERSION} | sed 's/\.//g') && \
echo "export PYTHON_VERSION_STR=${PYTHON_VERSION_STR}" >> /etc/environment echo "export PYTHON_VERSION_STR=${PYTHON_VERSION_STR}" >> /etc/environment
@@ -151,7 +164,7 @@ RUN PYTHON_VERSION_STR=$(echo ${PYTHON_VERSION} | sed 's/\.//g') && \
RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \ RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
&& echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \ && echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \
&& apt-get update -y \ && apt-get update -y \
&& apt-get install -y ccache software-properties-common git curl sudo vim python3-pip \ && apt-get install -y ccache software-properties-common git curl wget sudo vim python3-pip \
&& apt-get install -y ffmpeg libsm6 libxext6 libgl1 \ && apt-get install -y ffmpeg libsm6 libxext6 libgl1 \
&& add-apt-repository ppa:deadsnakes/ppa \ && add-apt-repository ppa:deadsnakes/ppa \
&& apt-get update -y \ && apt-get update -y \
@@ -168,17 +181,45 @@ RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
# or future versions of triton. # or future versions of triton.
RUN ldconfig /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/ RUN ldconfig /usr/local/cuda-$(echo $CUDA_VERSION | cut -d. -f1,2)/compat/
# install vllm wheel first, so that torch etc will be installed # arm64 (GH200) build follows the practice of "use existing pytorch" build,
# we need to install torch and torchvision from the nightly builds first,
# pytorch will not appear as a vLLM dependency in all of the following steps
# after this step
RUN --mount=type=cache,target=/root/.cache/pip \
if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
python3 -m pip install --index-url https://download.pytorch.org/whl/nightly/cu124 "torch==2.6.0.dev20241210+cu124" "torchvision==0.22.0.dev20241215"; \
fi
# Install vllm wheel first, so that torch etc will be installed.
RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist \ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist \
--mount=type=cache,target=/root/.cache/pip \ --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install dist/*.whl --verbose python3 -m pip install dist/*.whl --verbose
RUN --mount=type=cache,target=/root/.cache/pip \ # How to build this FlashInfer wheel:
. /etc/environment && \ # $ export FLASHINFER_ENABLE_AOT=1
python3 -m pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.6/flashinfer-0.1.6+cu121torch2.4-cp${PYTHON_VERSION_STR}-cp${PYTHON_VERSION_STR}-linux_x86_64.whl # $ # Note we remove 7.0 from the arch list compared to the list below, since FlashInfer only supports sm75+
COPY examples examples # $ export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.6 8.9 9.0+PTX'
#################### vLLM installation IMAGE #################### # $ git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
# $ cd flashinfer
# $ git checkout 524304395bd1d8cd7d07db083859523fcaa246a4
# $ python3 setup.py bdist_wheel --dist-dir=dist --verbose
RUN --mount=type=cache,target=/root/.cache/pip \
. /etc/environment && \
if [ "$TARGETPLATFORM" != "linux/arm64" ]; then \
python3 -m pip install https://wheels.vllm.ai/flashinfer/524304395bd1d8cd7d07db083859523fcaa246a4/flashinfer_python-0.2.0.post1-cp${PYTHON_VERSION_STR}-cp${PYTHON_VERSION_STR}-linux_x86_64.whl; \
fi
COPY examples examples
# Although we build Flashinfer with AOT mode, there's still
# some issues w.r.t. JIT compilation. Therefore we need to
# install build dependencies for JIT compilation.
# TODO: Remove this once FlashInfer AOT wheel is fixed
COPY requirements-build.txt requirements-build.txt
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install -r requirements-build.txt
#################### vLLM installation IMAGE ####################
#################### TEST IMAGE #################### #################### TEST IMAGE ####################
# image to run unit testing suite # image to run unit testing suite
@@ -191,24 +232,48 @@ ADD . /vllm-workspace/
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install -r requirements-dev.txt python3 -m pip install -r requirements-dev.txt
# install development dependencies (for testing)
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install -e tests/vllm_test_utils
# enable fast downloads from hf (for testing)
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install hf_transfer
ENV HF_HUB_ENABLE_HF_TRANSFER 1
# Copy in the v1 package for testing (it isn't distributed yet)
COPY vllm/v1 /usr/local/lib/python3.12/dist-packages/vllm/v1
# doc requires source code # doc requires source code
# we hide them inside `test_docs/` , so that this source code # we hide them inside `test_docs/` , so that this source code
# will not be imported by other tests # will not be imported by other tests
RUN mkdir test_docs RUN mkdir test_docs
RUN mv docs test_docs/ RUN mv docs test_docs/
RUN mv vllm test_docs/ RUN mv vllm test_docs/
#################### TEST IMAGE #################### #################### TEST IMAGE ####################
#################### OPENAI API SERVER #################### #################### OPENAI API SERVER ####################
# openai api server alternative # base openai image with additional requirements, for any subsequent openai-style images
FROM vllm-base AS vllm-openai FROM vllm-base AS vllm-openai-base
# install additional dependencies for openai api server # install additional dependencies for openai api server
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
pip install accelerate hf_transfer 'modelscope!=1.15.0' bitsandbytes>=0.44.0 timm==0.9.10 if [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
pip install accelerate hf_transfer 'modelscope!=1.15.0' 'bitsandbytes>=0.42.0' 'timm==0.9.10' boto3 runai-model-streamer runai-model-streamer[s3]; \
else \
pip install accelerate hf_transfer 'modelscope!=1.15.0' 'bitsandbytes>=0.45.0' 'timm==0.9.10' boto3 runai-model-streamer runai-model-streamer[s3]; \
fi
ENV VLLM_USAGE_SOURCE production-docker-image ENV VLLM_USAGE_SOURCE production-docker-image
# define sagemaker first, so it is not default from `docker build`
FROM vllm-openai-base AS vllm-sagemaker
COPY examples/online_serving/sagemaker-entrypoint.sh .
RUN chmod +x sagemaker-entrypoint.sh
ENTRYPOINT ["./sagemaker-entrypoint.sh"]
FROM vllm-openai-base AS vllm-openai
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"] ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
#################### OPENAI API SERVER #################### #################### OPENAI API SERVER ####################

62
Dockerfile.arm Normal file
View File

@@ -0,0 +1,62 @@
# This vLLM Dockerfile is used to construct an image that can build and run vLLM on ARM CPU platform.
FROM ubuntu:22.04 AS cpu-test-arm
ENV CCACHE_DIR=/root/.cache/ccache
ENV CMAKE_CXX_COMPILER_LAUNCHER=ccache
RUN --mount=type=cache,target=/var/cache/apt \
apt-get update -y \
&& apt-get install -y curl ccache git wget vim numactl gcc-12 g++-12 python3 python3-pip libtcmalloc-minimal4 libnuma-dev \
&& apt-get install -y ffmpeg libsm6 libxext6 libgl1 \
&& update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
# tcmalloc provides better memory allocation efficiency, e.g., holding memory in caches to speed up access of commonly-used objects.
RUN --mount=type=cache,target=/root/.cache/pip \
pip install py-cpuinfo # Use this to gather CPU info and optimize based on ARM Neoverse cores
# Set LD_PRELOAD for tcmalloc on ARM
ENV LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4"
RUN echo 'ulimit -c 0' >> ~/.bashrc
WORKDIR /workspace
ARG PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
ENV PIP_EXTRA_INDEX_URL=${PIP_EXTRA_INDEX_URL}
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,src=requirements-build.txt,target=requirements-build.txt \
pip install --upgrade pip && \
pip install -r requirements-build.txt
FROM cpu-test-arm AS build
WORKDIR /workspace/vllm
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,src=requirements-common.txt,target=requirements-common.txt \
--mount=type=bind,src=requirements-cpu.txt,target=requirements-cpu.txt \
pip install -v -r requirements-cpu.txt
COPY . .
ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
# Disabling AVX512 specific optimizations for ARM
ARG VLLM_CPU_DISABLE_AVX512="true"
ENV VLLM_CPU_DISABLE_AVX512=${VLLM_CPU_DISABLE_AVX512}
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=cache,target=/root/.cache/ccache \
--mount=type=bind,source=.git,target=.git \
VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel && \
pip install dist/*.whl && \
rm -rf dist
WORKDIR /workspace/
RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

View File

@@ -16,13 +16,13 @@ RUN --mount=type=cache,target=/var/cache/apt \
# intel-openmp provides additional performance improvement vs. openmp # intel-openmp provides additional performance improvement vs. openmp
# tcmalloc provides better memory allocation efficiency, e.g, holding memory in caches to speed up access of commonly-used objects. # tcmalloc provides better memory allocation efficiency, e.g, holding memory in caches to speed up access of commonly-used objects.
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
pip install intel-openmp pip install intel-openmp==2025.0.1
ENV LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:/usr/local/lib/libiomp5.so" ENV LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:/usr/local/lib/libiomp5.so"
RUN echo 'ulimit -c 0' >> ~/.bashrc RUN echo 'ulimit -c 0' >> ~/.bashrc
RUN pip install intel_extension_for_pytorch==2.4.0 RUN pip install intel_extension_for_pytorch==2.5.0
WORKDIR /workspace WORKDIR /workspace
@@ -62,4 +62,8 @@ WORKDIR /workspace/
RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks
# install development dependencies (for testing)
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -e tests/vllm_test_utils
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"] ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

21
Dockerfile.hpu Normal file
View File

@@ -0,0 +1,21 @@
FROM vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
COPY ./ /workspace/vllm
WORKDIR /workspace/vllm
RUN pip install -v -r requirements-hpu.txt
ENV no_proxy=localhost,127.0.0.1
ENV PT_HPU_ENABLE_LAZY_COLLECTIVES=true
RUN VLLM_TARGET_DEVICE=hpu python3 setup.py install
# install development dependencies (for testing)
RUN python3 -m pip install -e tests/vllm_test_utils
WORKDIR /workspace/
RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

View File

@@ -1,5 +1,6 @@
# default base image # default base image
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04" # https://gallery.ecr.aws/neuron/pytorch-inference-neuronx
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.5.1-neuronx-py310-sdk2.21.0-ubuntu22.04"
FROM $BASE_IMAGE FROM $BASE_IMAGE
@@ -14,16 +15,17 @@ RUN apt-get update && \
ffmpeg libsm6 libxext6 libgl1 ffmpeg libsm6 libxext6 libgl1
### Mount Point ### ### Mount Point ###
# When launching the container, mount the code directory to /app # When launching the container, mount the code directory to /workspace
ARG APP_MOUNT=/app ARG APP_MOUNT=/workspace
VOLUME [ ${APP_MOUNT} ] VOLUME [ ${APP_MOUNT} ]
WORKDIR ${APP_MOUNT}/vllm WORKDIR ${APP_MOUNT}/vllm
RUN python3 -m pip install --upgrade pip RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas
RUN python3 -m pip install sentencepiece transformers==4.36.2 -U RUN python3 -m pip install sentencepiece transformers==4.45.2 -U
RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install --pre neuronx-cc==2.15.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U RUN python3 -m pip install neuronx-cc==2.16.345.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install pytest
COPY . . COPY . .
ARG GIT_REPO_CHECK=0 ARG GIT_REPO_CHECK=0
@@ -31,11 +33,17 @@ RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
RUN python3 -m pip install -U \ RUN python3 -m pip install -U \
cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \ 'cmake>=3.26' ninja packaging 'setuptools-scm>=8' wheel jinja2 \
-r requirements-neuron.txt -r requirements-neuron.txt
ENV VLLM_TARGET_DEVICE neuron ENV VLLM_TARGET_DEVICE neuron
RUN --mount=type=bind,source=.git,target=.git \ RUN --mount=type=bind,source=.git,target=.git \
pip install --no-build-isolation -v -e . \ pip install --no-build-isolation -v -e .
# install development dependencies (for testing)
RUN python3 -m pip install -e tests/vllm_test_utils
# overwrite entrypoint to run bash script
RUN echo "import subprocess; import sys; subprocess.check_call(sys.argv[1:])" > /usr/local/bin/dockerd-entrypoint.py
CMD ["/bin/bash"] CMD ["/bin/bash"]

View File

@@ -14,12 +14,16 @@ ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \ RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
RUN python3 -m pip install -U pip
# install build requirements # install build requirements
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/vllm/requirements-build.txt RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/requirements-build.txt
# build vLLM with OpenVINO backend # build vLLM with OpenVINO backend
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE="openvino" python3 -m pip install /workspace/vllm/ RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE="openvino" python3 -m pip install /workspace
COPY examples/ /workspace/vllm/examples COPY examples/ /workspace/examples
COPY benchmarks/ /workspace/vllm/benchmarks COPY benchmarks/ /workspace/benchmarks
# install development dependencies (for testing)
RUN python3 -m pip install -e tests/vllm_test_utils
CMD ["/bin/bash"] CMD ["/bin/bash"]

View File

@@ -4,12 +4,12 @@ USER root
ENV PATH="/usr/local/cargo/bin:$PATH:/opt/conda/bin/" ENV PATH="/usr/local/cargo/bin:$PATH:/opt/conda/bin/"
RUN apt-get update -y && apt-get install -y git wget curl vim libnuma-dev libsndfile-dev libprotobuf-dev build-essential ffmpeg libsm6 libxext6 libgl1 RUN apt-get update -y && apt-get install -y git wget kmod curl vim libnuma-dev libsndfile-dev libprotobuf-dev build-essential ffmpeg libsm6 libxext6 libgl1 libssl-dev
# Some packages in requirements-cpu are installed here # Some packages in requirements-cpu are installed here
# IBM provides optimized packages for ppc64le processors in the open-ce project for mamba # IBM provides optimized packages for ppc64le processors in the open-ce project for mamba
# Currently these may not be available for venv or pip directly # Currently these may not be available for venv or pip directly
RUN micromamba install -y -n base -c https://ftp.osuosl.org/pub/open-ce/1.11.0-p10/ -c defaults python=3.10 torchvision-cpu=0.16.2 rust && micromamba clean --all --yes RUN micromamba install -y -n base -c https://ftp.osuosl.org/pub/open-ce/1.11.0-p10/ -c defaults python=3.10 rust && micromamba clean --all --yes
COPY ./ /workspace/vllm COPY ./ /workspace/vllm
@@ -18,19 +18,20 @@ ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \ RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh; fi if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh; fi
# These packages will be in rocketce eventually
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
pip install -v --prefer-binary --extra-index-url https://repo.fury.io/mgiessing \ RUSTFLAGS='-L /opt/conda/lib' pip install -v --prefer-binary --extra-index-url https://repo.fury.io/mgiessing \
cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \ 'cmake>=3.26' ninja packaging 'setuptools-scm>=8' wheel jinja2 \
torch==2.3.1 \
-r requirements-cpu.txt \ -r requirements-cpu.txt \
xformers uvloop==0.20.0 xformers uvloop==0.20.0
RUN --mount=type=bind,source=.git,target=.git \ RUN --mount=type=bind,source=.git,target=.git \
VLLM_TARGET_DEVICE=cpu python3 setup.py install VLLM_TARGET_DEVICE=cpu python3 setup.py install
# install development dependencies (for testing)
RUN python3 -m pip install -e tests/vllm_test_utils
WORKDIR /workspace/ WORKDIR /workspace/
RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"] ENTRYPOINT ["/opt/conda/bin/python3", "-m", "vllm.entrypoints.openai.api_server"]

View File

@@ -1,169 +1,119 @@
# Default ROCm 6.2 base image # default base image
ARG BASE_IMAGE="rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.3.0" ARG REMOTE_VLLM="0"
ARG USE_CYTHON="0"
ARG BUILD_RPD="1"
ARG COMMON_WORKDIR=/app
ARG BASE_IMAGE=rocm/vllm-dev:base
# Default ROCm ARCHes to build vLLM for. FROM ${BASE_IMAGE} AS base
ARG PYTORCH_ROCM_ARCH="gfx908;gfx90a;gfx942;gfx1100"
# Whether to install CK-based flash-attention ARG ARG_PYTORCH_ROCM_ARCH
# If 0, will not install flash-attention ENV PYTORCH_ROCM_ARCH=${ARG_PYTORCH_ROCM_ARCH:-${PYTORCH_ROCM_ARCH}}
ARG BUILD_FA="1"
ARG FA_GFX_ARCHS="gfx90a;gfx942"
ARG FA_BRANCH="3cea2fb"
# Whether to build triton on rocm
ARG BUILD_TRITON="1"
ARG TRITON_BRANCH="e192dba"
### Base image build stage
FROM $BASE_IMAGE AS base
# Import arg(s) defined before this build stage
ARG PYTORCH_ROCM_ARCH
# Install some basic utilities # Install some basic utilities
RUN apt-get update && apt-get install python3 python3-pip -y RUN apt-get update -q -y && apt-get install -q -y \
RUN apt-get update && apt-get install -y \ sqlite3 libsqlite3-dev libfmt-dev libmsgpack-dev libsuitesparse-dev
curl \ # Remove sccache
ca-certificates \ RUN python3 -m pip install --upgrade pip && pip install setuptools_scm
sudo \
git \
bzip2 \
libx11-6 \
build-essential \
wget \
unzip \
tmux \
ccache \
&& rm -rf /var/lib/apt/lists/*
# When launching the container, mount the code directory to /vllm-workspace
ARG APP_MOUNT=/vllm-workspace
WORKDIR ${APP_MOUNT}
RUN python3 -m pip install --upgrade pip
# Remove sccache so it doesn't interfere with ccache
# TODO: implement sccache support across components
RUN apt-get purge -y sccache; python3 -m pip uninstall -y sccache; rm -f "$(which sccache)" RUN apt-get purge -y sccache; python3 -m pip uninstall -y sccache; rm -f "$(which sccache)"
ARG COMMON_WORKDIR
WORKDIR ${COMMON_WORKDIR}
# Install torch == 2.6.0 on ROCm
RUN --mount=type=cache,target=/root/.cache/pip \ # -----------------------
case "$(ls /opt | grep -Po 'rocm-[0-9]\.[0-9]')" in \ # vLLM fetch stages
*"rocm-6.2"*) \ FROM base AS fetch_vllm_0
python3 -m pip uninstall -y torch torchvision \ ONBUILD COPY ./ vllm/
&& python3 -m pip install --pre \ FROM base AS fetch_vllm_1
torch==2.6.0.dev20240918 \ ARG VLLM_REPO="https://github.com/vllm-project/vllm.git"
setuptools-scm>=8 \ ARG VLLM_BRANCH="main"
torchvision==0.20.0.dev20240918 \ ONBUILD RUN git clone ${VLLM_REPO} \
--extra-index-url https://download.pytorch.org/whl/nightly/rocm6.2;; \ && cd vllm \
&& git checkout ${VLLM_BRANCH}
FROM fetch_vllm_${REMOTE_VLLM} AS fetch_vllm
# -----------------------
# vLLM build stages
FROM fetch_vllm AS build_vllm
ARG USE_CYTHON
# Build vLLM
RUN cd vllm \
&& python3 -m pip install -r requirements-rocm.txt \
&& python3 setup.py clean --all \
&& if [ ${USE_CYTHON} -eq "1" ]; then python3 setup_cython.py build_ext --inplace; fi \
&& python3 setup.py bdist_wheel --dist-dir=dist
FROM scratch AS export_vllm
ARG COMMON_WORKDIR
COPY --from=build_vllm ${COMMON_WORKDIR}/vllm/dist/*.whl /
COPY --from=build_vllm ${COMMON_WORKDIR}/vllm/requirements*.txt /
COPY --from=build_vllm ${COMMON_WORKDIR}/vllm/benchmarks /benchmarks
COPY --from=build_vllm ${COMMON_WORKDIR}/vllm/tests /tests
COPY --from=build_vllm ${COMMON_WORKDIR}/vllm/examples /examples
COPY --from=build_vllm ${COMMON_WORKDIR}/vllm/.buildkite /.buildkite
# -----------------------
# Test vLLM image
FROM base AS test
RUN python3 -m pip install --upgrade pip && rm -rf /var/lib/apt/lists/*
# Install vLLM
RUN --mount=type=bind,from=export_vllm,src=/,target=/install \
cd /install \
&& pip install -U -r requirements-rocm.txt \
&& pip uninstall -y vllm \
&& pip install *.whl
WORKDIR /vllm-workspace
ARG COMMON_WORKDIR
COPY --from=build_vllm ${COMMON_WORKDIR}/vllm /vllm-workspace
# install development dependencies (for testing)
RUN cd /vllm-workspace \
&& rm -rf vllm \
&& python3 -m pip install -e tests/vllm_test_utils \
&& python3 -m pip install lm-eval[api]==0.4.4 \
&& python3 -m pip install pytest-shard
# -----------------------
# Final vLLM image
FROM base AS final
RUN python3 -m pip install --upgrade pip && rm -rf /var/lib/apt/lists/*
# Error related to odd state for numpy 1.20.3 where there is no METADATA etc, but an extra LICENSES_bundled.txt.
# Manually remove it so that later steps of numpy upgrade can continue
RUN case "$(which python3)" in \
*"/opt/conda/envs/py_3.9"*) \
rm -rf /opt/conda/envs/py_3.9/lib/python3.9/site-packages/numpy-1.20.3.dist-info/;; \
*) ;; esac *) ;; esac
ENV LLVM_SYMBOLIZER_PATH=/opt/rocm/llvm/bin/llvm-symbolizer RUN python3 -m pip install --upgrade huggingface-hub[cli]
ENV PATH=$PATH:/opt/rocm/bin:/libtorch/bin: ARG BUILD_RPD
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/lib/:/libtorch/lib: RUN if [ ${BUILD_RPD} -eq "1" ]; then \
ENV CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/libtorch/include:/libtorch/include/torch/csrc/api/include/:/opt/rocm/include/: git clone -b nvtx_enabled https://github.com/ROCm/rocmProfileData.git \
&& cd rocmProfileData/rpd_tracer \
&& pip install -r requirements.txt && cd ../ \
&& make && make install \
&& cd hipMarker && python3 setup.py install ; fi
ENV PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} # Install vLLM
ENV CCACHE_DIR=/root/.cache/ccache RUN --mount=type=bind,from=export_vllm,src=/,target=/install \
cd /install \
&& pip install -U -r requirements-rocm.txt \
&& pip uninstall -y vllm \
&& pip install *.whl
ARG COMMON_WORKDIR
### AMD-SMI build stage # Copy over the benchmark scripts as well
FROM base AS build_amdsmi COPY --from=export_vllm /benchmarks ${COMMON_WORKDIR}/vllm/benchmarks
# Build amdsmi wheel always COPY --from=export_vllm /examples ${COMMON_WORKDIR}/vllm/examples
RUN cd /opt/rocm/share/amd_smi \
&& python3 -m pip wheel . --wheel-dir=/install
### Flash-Attention wheel build stage
FROM base AS build_fa
ARG BUILD_FA
ARG FA_GFX_ARCHS
ARG FA_BRANCH
# Build ROCm flash-attention wheel if `BUILD_FA = 1`
RUN --mount=type=cache,target=${CCACHE_DIR} \
if [ "$BUILD_FA" = "1" ]; then \
mkdir -p libs \
&& cd libs \
&& git clone https://github.com/ROCm/flash-attention.git \
&& cd flash-attention \
&& git checkout "${FA_BRANCH}" \
&& git submodule update --init \
&& GPU_ARCHS="${FA_GFX_ARCHS}" python3 setup.py bdist_wheel --dist-dir=/install; \
# Create an empty directory otherwise as later build stages expect one
else mkdir -p /install; \
fi
### Triton wheel build stage
FROM base AS build_triton
ARG BUILD_TRITON
ARG TRITON_BRANCH
# Build triton wheel if `BUILD_TRITON = 1`
RUN --mount=type=cache,target=${CCACHE_DIR} \
if [ "$BUILD_TRITON" = "1" ]; then \
mkdir -p libs \
&& cd libs \
&& python3 -m pip install ninja cmake wheel pybind11 \
&& git clone https://github.com/OpenAI/triton.git \
&& cd triton \
&& git checkout "${TRITON_BRANCH}" \
&& cd python \
&& python3 setup.py bdist_wheel --dist-dir=/install; \
# Create an empty directory otherwise as later build stages expect one
else mkdir -p /install; \
fi
### Final vLLM build stage
FROM base AS final
# Import the vLLM development directory from the build context
COPY . .
ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
# Package upgrades for useful functionality or to avoid dependency issues
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install --upgrade numba scipy huggingface-hub[cli] pytest-shard
# Workaround for ray >= 2.10.0
ENV RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1 ENV RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1
# Silences the HF Tokenizers warning
ENV TOKENIZERS_PARALLELISM=false ENV TOKENIZERS_PARALLELISM=false
RUN --mount=type=cache,target=${CCACHE_DIR} \ # Performance environment variable.
--mount=type=bind,source=.git,target=.git \ ENV HIP_FORCE_DEV_KERNARG=1
--mount=type=cache,target=/root/.cache/pip \
python3 -m pip install -Ur requirements-rocm.txt \
&& python3 setup.py clean --all \
&& python3 setup.py develop
# Copy amdsmi wheel into final image
RUN --mount=type=bind,from=build_amdsmi,src=/install,target=/install \
mkdir -p libs \
&& cp /install/*.whl libs \
# Preemptively uninstall to avoid same-version no-installs
&& python3 -m pip uninstall -y amdsmi;
# Copy triton wheel(s) into final image if they were built
RUN --mount=type=bind,from=build_triton,src=/install,target=/install \
mkdir -p libs \
&& if ls /install/*.whl; then \
cp /install/*.whl libs \
# Preemptively uninstall to avoid same-version no-installs
&& python3 -m pip uninstall -y triton; fi
# Copy flash-attn wheel(s) into final image if they were built
RUN --mount=type=bind,from=build_fa,src=/install,target=/install \
mkdir -p libs \
&& if ls /install/*.whl; then \
cp /install/*.whl libs \
# Preemptively uninstall to avoid same-version no-installs
&& python3 -m pip uninstall -y flash-attn; fi
# Install wheels that were built to the final image
RUN --mount=type=cache,target=/root/.cache/pip \
if ls libs/*.whl; then \
python3 -m pip install libs/*.whl; fi
CMD ["/bin/bash"] CMD ["/bin/bash"]

158
Dockerfile.rocm_base Normal file
View File

@@ -0,0 +1,158 @@
ARG BASE_IMAGE=rocm/dev-ubuntu-22.04:6.3.1-complete
ARG HIPBLASLT_BRANCH="4d40e36"
ARG HIPBLAS_COMMON_BRANCH="7c1566b"
ARG LEGACY_HIPBLASLT_OPTION=
ARG RCCL_BRANCH="648a58d"
ARG RCCL_REPO="https://github.com/ROCm/rccl"
ARG TRITON_BRANCH="e5be006"
ARG TRITON_REPO="https://github.com/triton-lang/triton.git"
ARG PYTORCH_BRANCH="8d4926e"
ARG PYTORCH_VISION_BRANCH="v0.19.1"
ARG PYTORCH_REPO="https://github.com/pytorch/pytorch.git"
ARG PYTORCH_VISION_REPO="https://github.com/pytorch/vision.git"
ARG FA_BRANCH="b7d29fb"
ARG FA_REPO="https://github.com/ROCm/flash-attention.git"
FROM ${BASE_IMAGE} AS base
ENV PATH=/opt/rocm/llvm/bin:$PATH
ENV ROCM_PATH=/opt/rocm
ENV LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
ARG PYTORCH_ROCM_ARCH=gfx90a;gfx942
ENV PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}
ARG PYTHON_VERSION=3.12
RUN mkdir -p /app
WORKDIR /app
ENV DEBIAN_FRONTEND=noninteractive
# Install Python and other dependencies
RUN apt-get update -y \
&& apt-get install -y software-properties-common git curl sudo vim less \
&& add-apt-repository ppa:deadsnakes/ppa \
&& apt-get update -y \
&& apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python${PYTHON_VERSION}-venv \
python${PYTHON_VERSION}-lib2to3 python-is-python3 \
&& update-alternatives --install /usr/bin/python3 python3 /usr/bin/python${PYTHON_VERSION} 1 \
&& update-alternatives --set python3 /usr/bin/python${PYTHON_VERSION} \
&& ln -sf /usr/bin/python${PYTHON_VERSION}-config /usr/bin/python3-config \
&& curl -sS https://bootstrap.pypa.io/get-pip.py | python${PYTHON_VERSION} \
&& python3 --version && python3 -m pip --version
RUN pip install -U packaging cmake ninja wheel setuptools pybind11 Cython
FROM base AS build_hipblaslt
ARG HIPBLASLT_BRANCH
ARG HIPBLAS_COMMON_BRANCH
# Set to "--legacy_hipblas_direct" for ROCm<=6.2
ARG LEGACY_HIPBLASLT_OPTION
RUN git clone https://github.com/ROCm/hipBLAS-common.git
RUN cd hipBLAS-common \
&& git checkout ${HIPBLAS_COMMON_BRANCH} \
&& mkdir build \
&& cd build \
&& cmake .. \
&& make package \
&& dpkg -i ./*.deb
RUN git clone https://github.com/ROCm/hipBLASLt
RUN cd hipBLASLt \
&& git checkout ${HIPBLASLT_BRANCH} \
&& ./install.sh -d --architecture ${PYTORCH_ROCM_ARCH} ${LEGACY_HIPBLASLT_OPTION} \
&& cd build/release \
&& make package
RUN mkdir -p /app/install && cp /app/hipBLASLt/build/release/*.deb /app/hipBLAS-common/build/*.deb /app/install
FROM base AS build_rccl
ARG RCCL_BRANCH
ARG RCCL_REPO
RUN git clone ${RCCL_REPO}
RUN cd rccl \
&& git checkout ${RCCL_BRANCH} \
&& ./install.sh -p --amdgpu_targets ${PYTORCH_ROCM_ARCH}
RUN mkdir -p /app/install && cp /app/rccl/build/release/*.deb /app/install
FROM base AS build_triton
ARG TRITON_BRANCH
ARG TRITON_REPO
RUN git clone ${TRITON_REPO}
RUN cd triton \
&& git checkout ${TRITON_BRANCH} \
&& cd python \
&& python3 setup.py bdist_wheel --dist-dir=dist
RUN mkdir -p /app/install && cp /app/triton/python/dist/*.whl /app/install
FROM base AS build_amdsmi
RUN cd /opt/rocm/share/amd_smi \
&& pip wheel . --wheel-dir=dist
RUN mkdir -p /app/install && cp /opt/rocm/share/amd_smi/dist/*.whl /app/install
FROM base AS build_pytorch
ARG PYTORCH_BRANCH
ARG PYTORCH_VISION_BRANCH
ARG PYTORCH_REPO
ARG PYTORCH_VISION_REPO
ARG FA_BRANCH
ARG FA_REPO
RUN git clone ${PYTORCH_REPO} pytorch
RUN cd pytorch && git checkout ${PYTORCH_BRANCH} && \
pip install -r requirements.txt && git submodule update --init --recursive \
&& python3 tools/amd_build/build_amd.py \
&& CMAKE_PREFIX_PATH=$(python3 -c 'import sys; print(sys.prefix)') python3 setup.py bdist_wheel --dist-dir=dist \
&& pip install dist/*.whl
RUN git clone ${PYTORCH_VISION_REPO} vision
RUN cd vision && git checkout ${PYTORCH_VISION_BRANCH} \
&& python3 setup.py bdist_wheel --dist-dir=dist \
&& pip install dist/*.whl
RUN git clone ${FA_REPO}
RUN cd flash-attention \
&& git checkout ${FA_BRANCH} \
&& git submodule update --init \
&& MAX_JOBS=64 GPU_ARCHS=${PYTORCH_ROCM_ARCH} python3 setup.py bdist_wheel --dist-dir=dist
RUN mkdir -p /app/install && cp /app/pytorch/dist/*.whl /app/install \
&& cp /app/vision/dist/*.whl /app/install \
&& cp /app/flash-attention/dist/*.whl /app/install
FROM base AS final
RUN --mount=type=bind,from=build_hipblaslt,src=/app/install/,target=/install \
dpkg -i /install/*deb \
&& sed -i 's/, hipblaslt-dev \(.*\), hipcub-dev/, hipcub-dev/g' /var/lib/dpkg/status \
&& sed -i 's/, hipblaslt \(.*\), hipfft/, hipfft/g' /var/lib/dpkg/status
RUN --mount=type=bind,from=build_rccl,src=/app/install/,target=/install \
dpkg -i /install/*deb \
&& sed -i 's/, rccl-dev \(.*\), rocalution/, rocalution/g' /var/lib/dpkg/status \
&& sed -i 's/, rccl \(.*\), rocalution/, rocalution/g' /var/lib/dpkg/status
RUN --mount=type=bind,from=build_triton,src=/app/install/,target=/install \
pip install /install/*.whl
RUN --mount=type=bind,from=build_amdsmi,src=/app/install/,target=/install \
pip install /install/*.whl
RUN --mount=type=bind,from=build_pytorch,src=/app/install/,target=/install \
pip install /install/*.whl
ARG BASE_IMAGE
ARG HIPBLASLT_BRANCH
ARG LEGACY_HIPBLASLT_OPTION
ARG RCCL_BRANCH
ARG RCCL_REPO
ARG TRITON_BRANCH
ARG TRITON_REPO
ARG PYTORCH_BRANCH
ARG PYTORCH_VISION_BRANCH
ARG PYTORCH_REPO
ARG PYTORCH_VISION_REPO
ARG FA_BRANCH
ARG FA_REPO
RUN echo "BASE_IMAGE: ${BASE_IMAGE}" > /app/versions.txt \
&& echo "HIPBLAS_COMMON_BRANCH: ${HIPBLAS_COMMON_BRANCH}" >> /app/versions.txt \
&& echo "HIPBLASLT_BRANCH: ${HIPBLASLT_BRANCH}" >> /app/versions.txt \
&& echo "LEGACY_HIPBLASLT_OPTION: ${LEGACY_HIPBLASLT_OPTION}" >> /app/versions.txt \
&& echo "RCCL_BRANCH: ${RCCL_BRANCH}" >> /app/versions.txt \
&& echo "RCCL_REPO: ${RCCL_REPO}" >> /app/versions.txt \
&& echo "TRITON_BRANCH: ${TRITON_BRANCH}" >> /app/versions.txt \
&& echo "TRITON_REPO: ${TRITON_REPO}" >> /app/versions.txt \
&& echo "PYTORCH_BRANCH: ${PYTORCH_BRANCH}" >> /app/versions.txt \
&& echo "PYTORCH_VISION_BRANCH: ${PYTORCH_VISION_BRANCH}" >> /app/versions.txt \
&& echo "PYTORCH_REPO: ${PYTORCH_REPO}" >> /app/versions.txt \
&& echo "PYTORCH_VISION_REPO: ${PYTORCH_VISION_REPO}" >> /app/versions.txt \
&& echo "FA_BRANCH: ${FA_BRANCH}" >> /app/versions.txt \
&& echo "FA_REPO: ${FA_REPO}" >> /app/versions.txt

View File

@@ -1,4 +1,4 @@
ARG NIGHTLY_DATE="20240828" ARG NIGHTLY_DATE="20250124"
ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_$NIGHTLY_DATE" ARG BASE_IMAGE="us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_$NIGHTLY_DATE"
FROM $BASE_IMAGE FROM $BASE_IMAGE
@@ -9,12 +9,6 @@ RUN apt-get update && apt-get install -y \
git \ git \
ffmpeg libsm6 libxext6 libgl1 ffmpeg libsm6 libxext6 libgl1
# Install the TPU and Pallas dependencies.
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
# Build vLLM. # Build vLLM.
COPY . . COPY . .
ARG GIT_REPO_CHECK=0 ARG GIT_REPO_CHECK=0
@@ -25,8 +19,10 @@ ENV VLLM_TARGET_DEVICE="tpu"
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,source=.git,target=.git \ --mount=type=bind,source=.git,target=.git \
python3 -m pip install \ python3 -m pip install \
cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
-r requirements-tpu.txt -r requirements-tpu.txt
RUN python3 setup.py develop RUN python3 setup.py develop
# install development dependencies (for testing)
RUN python3 -m pip install -e tests/vllm_test_utils
CMD ["/bin/bash"] CMD ["/bin/bash"]

View File

@@ -30,9 +30,19 @@ COPY requirements-common.txt /workspace/vllm/requirements-common.txt
RUN --mount=type=cache,target=/root/.cache/pip \ RUN --mount=type=cache,target=/root/.cache/pip \
pip install --no-cache-dir \ pip install --no-cache-dir \
--extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ \
-r requirements-xpu.txt -r requirements-xpu.txt
RUN git clone https://github.com/intel/pti-gpu && \
cd pti-gpu/sdk && \
git checkout 6c491f07a777ed872c2654ca9942f1d0dde0a082 && \
mkdir build && \
cd build && \
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=../cmake/toolchains/icpx_toolchain.cmake -DBUILD_TESTING=OFF .. && \
make -j && \
cmake --install . --config Release --prefix "/usr/local"
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/lib/"
COPY . . COPY . .
ARG GIT_REPO_CHECK ARG GIT_REPO_CHECK
RUN --mount=type=bind,source=.git,target=.git \ RUN --mount=type=bind,source=.git,target=.git \
@@ -54,5 +64,6 @@ RUN --mount=type=cache,target=/root/.cache/pip \
ENV VLLM_USAGE_SOURCE production-docker-image \ ENV VLLM_USAGE_SOURCE production-docker-image \
TRITON_XPU_PROFILE 1 TRITON_XPU_PROFILE 1
# install development dependencies (for testing)
RUN python3 -m pip install -e tests/vllm_test_utils
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"] ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

View File

@@ -10,13 +10,18 @@ Easy, fast, and cheap LLM serving for everyone
</h3> </h3>
<p align="center"> <p align="center">
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> | | <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
</p> </p>
---
*Latest News* 🔥 *Latest News* 🔥
- [2025/01] We are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more. Please check out our blog post [here](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html).
- [2025/01] We hosted [the eighth vLLM meetup](https://lu.ma/zep56hui) with Google Cloud! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1epVkt4Zu8Jz_S5OhEHPc798emsYh2BwYfRuDDVEF7u4/edit?usp=sharing).
- [2024/12] vLLM joins [pytorch ecosystem](https://pytorch.org/blog/vllm-joins-pytorch)! Easy, Fast, and Cheap LLM Serving for Everyone!
- [2024/11] We hosted [the seventh vLLM meetup](https://lu.ma/h0qvrajz) with Snowflake! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit?usp=sharing), and Snowflake team [here](https://docs.google.com/presentation/d/1qF3RkDAbOULwz9WK5TOltt2fE9t6uIc_hVNLFAaQX6A/edit?usp=sharing).
- [2024/10] We have just created a developer slack ([slack.vllm.ai](https://slack.vllm.ai)) focusing on coordinating contributions and discussing features. Please feel free to join us there! - [2024/10] We have just created a developer slack ([slack.vllm.ai](https://slack.vllm.ai)) focusing on coordinating contributions and discussing features. Please feel free to join us there!
- [2024/10] Ray Summit 2024 held a special track for vLLM! Please find the opening talk slides from the vLLM team [here](https://docs.google.com/presentation/d/1B_KQxpHBTRa_mDF-tR6i8rWdOU5QoTZNcEg2MKZxEHM/edit?usp=sharing). Learn more from the [talks](https://raysummit.anyscale.com/flow/anyscale/raysummit2024/landing/page/sessioncatalog?tab.day=20241001&search.sessiontracks=1719251906298001uzJ2) from other vLLM contributors and users! - [2024/10] Ray Summit 2024 held a special track for vLLM! Please find the opening talk slides from the vLLM team [here](https://docs.google.com/presentation/d/1B_KQxpHBTRa_mDF-tR6i8rWdOU5QoTZNcEg2MKZxEHM/edit?usp=sharing). Learn more from the [talks](https://www.youtube.com/playlist?list=PLzTswPQNepXl6AQwifuwUImLPFRVpksjR) from other vLLM contributors and users!
- [2024/09] We hosted [the sixth vLLM meetup](https://lu.ma/87q3nvnh) with NVIDIA! Please find the meetup slides [here](https://docs.google.com/presentation/d/1wrLGwytQfaOTd5wCGSPNhoaW3nq0E-9wqyP7ny93xRs/edit?usp=sharing). - [2024/09] We hosted [the sixth vLLM meetup](https://lu.ma/87q3nvnh) with NVIDIA! Please find the meetup slides [here](https://docs.google.com/presentation/d/1wrLGwytQfaOTd5wCGSPNhoaW3nq0E-9wqyP7ny93xRs/edit?usp=sharing).
- [2024/07] We hosted [the fifth vLLM meetup](https://lu.ma/lp0gyjqr) with AWS! Please find the meetup slides [here](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing). - [2024/07] We hosted [the fifth vLLM meetup](https://lu.ma/lp0gyjqr) with AWS! Please find the meetup slides [here](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing).
- [2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post [here](https://blog.vllm.ai/2024/07/23/llama31.html). - [2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post [here](https://blog.vllm.ai/2024/07/23/llama31.html).
@@ -31,10 +36,12 @@ Easy, fast, and cheap LLM serving for everyone
## About ## About
vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
vLLM is fast with: vLLM is fast with:
- State-of-the-art serving throughput - State-of-the-art serving throughput
- Efficient management of attention key and value memory with **PagedAttention** - Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests - Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph - Fast model execution with CUDA/HIP graph
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8. - Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8.
@@ -57,7 +64,7 @@ vLLM is flexible and easy to use with:
vLLM seamlessly supports most popular open-source models on HuggingFace, including: vLLM seamlessly supports most popular open-source models on HuggingFace, including:
- Transformer-like LLMs (e.g., Llama) - Transformer-like LLMs (e.g., Llama)
- Mixture-of-Expert LLMs (e.g., Mixtral) - Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
- Embedding Models (e.g. E5-Mistral) - Embedding Models (e.g. E5-Mistral)
- Multi-modal LLMs (e.g., LLaVA) - Multi-modal LLMs (e.g., LLaVA)
@@ -65,16 +72,16 @@ Find the full list of supported models [here](https://docs.vllm.ai/en/latest/mod
## Getting Started ## Getting Started
Install vLLM with `pip` or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source): Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):
```bash ```bash
pip install vllm pip install vllm
``` ```
Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to learn more. Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
- [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html) - [Installation](https://docs.vllm.ai/en/latest/getting_started/installation/index.html)
- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html) - [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html) - [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
## Contributing ## Contributing
@@ -87,27 +94,33 @@ vLLM is a community project. Our compute resources for development and testing a
<!-- Note: Please sort them in alphabetical order. --> <!-- Note: Please sort them in alphabetical order. -->
<!-- Note: Please keep these consistent with docs/source/community/sponsors.md --> <!-- Note: Please keep these consistent with docs/source/community/sponsors.md -->
Cash Donations:
- a16z - a16z
- Dropbox
- Sequoia Capital
- Skywork AI
- ZhenFund
Compute Resources:
- AMD - AMD
- Anyscale - Anyscale
- AWS - AWS
- Crusoe Cloud - Crusoe Cloud
- Databricks - Databricks
- DeepInfra - DeepInfra
- Dropbox
- Google Cloud - Google Cloud
- Lambda Lab - Lambda Lab
- Nebius
- Novita AI
- NVIDIA - NVIDIA
- Replicate - Replicate
- Roblox - Roblox
- RunPod - RunPod
- Sequoia Capital
- Skywork AI
- Trainy - Trainy
- UC Berkeley - UC Berkeley
- UC San Diego - UC San Diego
- ZhenFund
Slack Sponsor: Anyscale
We also have an official fundraising venue through [OpenCollective](https://opencollective.com/vllm). We plan to use the fund to support the development, maintenance, and adoption of vLLM. We also have an official fundraising venue through [OpenCollective](https://opencollective.com/vllm). We plan to use the fund to support the development, maintenance, and adoption of vLLM.
@@ -126,7 +139,10 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs
## Contact Us ## Contact Us
* For technical questions and feature requests, please use Github issues or discussions. * For technical questions and feature requests, please use Github issues or discussions.
* For discussing with fellow users, please use Discord. * For discussing with fellow users and coordinating contributions and development, please use Slack.
* For coordinating contributions and development, please use Slack.
* For security disclosures, please use Github's security advisory feature. * For security disclosures, please use Github's security advisory feature.
* For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu. * For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu.
## Media Kit
* If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit).

View File

@@ -4,7 +4,7 @@
If you believe you have found a security vulnerability in vLLM, we encourage you to let us know right away. We will investigate all legitimate reports and do our best to quickly fix the problem. If you believe you have found a security vulnerability in vLLM, we encourage you to let us know right away. We will investigate all legitimate reports and do our best to quickly fix the problem.
Please report security issues privately using [the vulnerability submission form](https://github.com/vllm-project/vllm/security/advisories/new). Please report security issues privately using [the vulnerability submission form](https://github.com/vllm-project/vllm/security/advisories/new). Reports will then be triaged by the [vulnerability management team](https://docs.vllm.ai/en/latest/contributing/vulnerability_management.html).
--- ---

View File

@@ -6,3 +6,14 @@ You can download the dataset by running:
```bash ```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
``` ```
## Downloading the ShareGPT4V dataset
The json file refers to several image datasets (coco, llava, etc.). The benchmark scripts
will ignore a datapoint if the referred image is missing.
```bash
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json
mkdir coco -p
wget http://images.cocodataset.org/zips/train2017.zip -O coco/train2017.zip
unzip coco/train2017.zip -d coco/
```

View File

@@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import json import json
import os import os
import sys import sys
@@ -22,8 +24,10 @@ class RequestFuncInput:
prompt_len: int prompt_len: int
output_len: int output_len: int
model: str model: str
model_name: Optional[str] = None
best_of: int = 1 best_of: int = 1
logprobs: Optional[int] = None logprobs: Optional[int] = None
extra_body: Optional[dict] = None
multi_modal_content: Optional[dict] = None multi_modal_content: Optional[dict] = None
ignore_eos: bool = False ignore_eos: bool = False
@@ -33,9 +37,11 @@ class RequestFuncOutput:
generated_text: str = "" generated_text: str = ""
success: bool = False success: bool = False
latency: float = 0.0 latency: float = 0.0
output_tokens: int = 0
ttft: float = 0.0 # Time to first token ttft: float = 0.0 # Time to first token
itl: List[float] = field( itl: List[float] = field(
default_factory=list) # List of inter-token latencies default_factory=list) # List of inter-token latencies
tpot: float = 0.0 # avg next-token latencies
prompt_len: int = 0 prompt_len: int = 0
error: str = "" error: str = ""
@@ -47,13 +53,15 @@ async def async_request_tgi(
api_url = request_func_input.api_url api_url = request_func_input.api_url
assert api_url.endswith("generate_stream") assert api_url.endswith("generate_stream")
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session: async with aiohttp.ClientSession(trust_env=True,
timeout=AIOHTTP_TIMEOUT) as session:
params = { params = {
"best_of": request_func_input.best_of, "best_of": request_func_input.best_of,
"max_new_tokens": request_func_input.output_len, "max_new_tokens": request_func_input.output_len,
"do_sample": True, "do_sample": True,
"temperature": 0.01, # TGI does not accept 0.0 temperature. "temperature": 0.01, # TGI does not accept 0.0 temperature.
"top_p": 0.99, # TGI does not accept 1.0 top_p. "top_p": 0.99, # TGI does not accept 1.0 top_p.
"truncate": request_func_input.prompt_len,
# TGI does not accept ignore_eos flag. # TGI does not accept ignore_eos flag.
} }
payload = { payload = {
@@ -75,11 +83,11 @@ async def async_request_tgi(
continue continue
chunk_bytes = chunk_bytes.decode("utf-8") chunk_bytes = chunk_bytes.decode("utf-8")
#NOTE: Sometimes TGI returns a ping response without # NOTE: Sometimes TGI returns a ping response without
# any data, we should skip it. # any data, we should skip it.
if chunk_bytes.startswith(":"): if chunk_bytes.startswith(":"):
continue continue
chunk = remove_prefix(chunk_bytes, "data:") chunk = chunk_bytes.removeprefix("data:")
data = json.loads(chunk) data = json.loads(chunk)
timestamp = time.perf_counter() timestamp = time.perf_counter()
@@ -118,7 +126,8 @@ async def async_request_trt_llm(
api_url = request_func_input.api_url api_url = request_func_input.api_url
assert api_url.endswith("generate_stream") assert api_url.endswith("generate_stream")
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session: async with aiohttp.ClientSession(trust_env=True,
timeout=AIOHTTP_TIMEOUT) as session:
assert request_func_input.best_of == 1 assert request_func_input.best_of == 1
payload = { payload = {
"accumulate_tokens": True, "accumulate_tokens": True,
@@ -144,15 +153,15 @@ async def async_request_trt_llm(
if not chunk_bytes: if not chunk_bytes:
continue continue
chunk = remove_prefix(chunk_bytes.decode("utf-8"), chunk = chunk_bytes.decode("utf-8").removeprefix(
"data:") "data:")
data = json.loads(chunk) data = json.loads(chunk)
output.generated_text += data["text_output"] output.generated_text += data["text_output"]
timestamp = time.perf_counter() timestamp = time.perf_counter()
# First token # First token
if ttft == 0.0: if ttft == 0.0:
ttft = time.perf_counter() - st ttft = timestamp - st
output.ttft = ttft output.ttft = ttft
# Decoding phase # Decoding phase
@@ -182,7 +191,8 @@ async def async_request_deepspeed_mii(
request_func_input: RequestFuncInput, request_func_input: RequestFuncInput,
pbar: Optional[tqdm] = None, pbar: Optional[tqdm] = None,
) -> RequestFuncOutput: ) -> RequestFuncOutput:
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session: async with aiohttp.ClientSession(trust_env=True,
timeout=AIOHTTP_TIMEOUT) as session:
assert request_func_input.best_of == 1 assert request_func_input.best_of == 1
payload = { payload = {
@@ -230,17 +240,25 @@ async def async_request_openai_completions(
("completions", "profile") ("completions", "profile")
), "OpenAI Completions API URL must end with 'completions' or 'profile'." ), "OpenAI Completions API URL must end with 'completions' or 'profile'."
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session: async with aiohttp.ClientSession(trust_env=True,
timeout=AIOHTTP_TIMEOUT) as session:
payload = { payload = {
"model": request_func_input.model, "model": request_func_input.model_name \
if request_func_input.model_name else request_func_input.model,
"prompt": request_func_input.prompt, "prompt": request_func_input.prompt,
"temperature": 0.0, "temperature": 0.0,
"best_of": request_func_input.best_of, "best_of": request_func_input.best_of,
"max_tokens": request_func_input.output_len, "max_tokens": request_func_input.output_len,
"logprobs": request_func_input.logprobs, "logprobs": request_func_input.logprobs,
"stream": True, "stream": True,
"ignore_eos": request_func_input.ignore_eos, "stream_options": {
"include_usage": True,
},
} }
if request_func_input.ignore_eos:
payload["ignore_eos"] = request_func_input.ignore_eos
if request_func_input.extra_body:
payload.update(request_func_input.extra_body)
headers = { headers = {
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}" "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
} }
@@ -249,32 +267,34 @@ async def async_request_openai_completions(
output.prompt_len = request_func_input.prompt_len output.prompt_len = request_func_input.prompt_len
generated_text = "" generated_text = ""
ttft = 0.0
st = time.perf_counter() st = time.perf_counter()
most_recent_timestamp = st most_recent_timestamp = st
try: try:
async with session.post(url=api_url, json=payload, async with session.post(url=api_url, json=payload,
headers=headers) as response: headers=headers) as response:
if response.status == 200: if response.status == 200:
first_chunk_received = False
async for chunk_bytes in response.content: async for chunk_bytes in response.content:
chunk_bytes = chunk_bytes.strip() chunk_bytes = chunk_bytes.strip()
if not chunk_bytes: if not chunk_bytes:
continue continue
chunk = remove_prefix(chunk_bytes.decode("utf-8"), chunk = chunk_bytes.decode("utf-8").removeprefix(
"data: ") "data: ")
if chunk == "[DONE]": if chunk != "[DONE]":
latency = time.perf_counter() - st
else:
data = json.loads(chunk) data = json.loads(chunk)
# NOTE: Some completion API might have a last # NOTE: Some completion API might have a last
# usage summary response without a token so we # usage summary response without a token so we
# want to check a token was generated # want to check a token was generated
if data["choices"][0]["text"]: if choices := data.get("choices"):
# Note that text could be empty here
# e.g. for special tokens
text = choices[0].get("text")
timestamp = time.perf_counter() timestamp = time.perf_counter()
# First token # First token
if ttft == 0.0: if not first_chunk_received:
first_chunk_received = True
ttft = time.perf_counter() - st ttft = time.perf_counter() - st
output.ttft = ttft output.ttft = ttft
@@ -284,11 +304,19 @@ async def async_request_openai_completions(
most_recent_timestamp) most_recent_timestamp)
most_recent_timestamp = timestamp most_recent_timestamp = timestamp
generated_text += data["choices"][0]["text"] generated_text += text or ""
elif usage := data.get("usage"):
output.output_tokens = usage.get(
"completion_tokens")
if first_chunk_received:
output.success = True
else:
output.success = False
output.error = (
"Never received a valid chunk to calculate TTFT."
"This response will be marked as failed!")
output.generated_text = generated_text output.generated_text = generated_text
output.success = True output.latency = most_recent_timestamp - st
output.latency = latency
else: else:
output.error = response.reason or "" output.error = response.reason or ""
output.success = False output.success = False
@@ -311,12 +339,14 @@ async def async_request_openai_chat_completions(
"chat/completions" "chat/completions"
), "OpenAI Chat Completions API URL must end with 'chat/completions'." ), "OpenAI Chat Completions API URL must end with 'chat/completions'."
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session: async with aiohttp.ClientSession(trust_env=True,
timeout=AIOHTTP_TIMEOUT) as session:
content = [{"type": "text", "text": request_func_input.prompt}] content = [{"type": "text", "text": request_func_input.prompt}]
if request_func_input.multi_modal_content: if request_func_input.multi_modal_content:
content.append(request_func_input.multi_modal_content) content.append(request_func_input.multi_modal_content)
payload = { payload = {
"model": request_func_input.model, "model": request_func_input.model_name \
if request_func_input.model_name else request_func_input.model,
"messages": [ "messages": [
{ {
"role": "user", "role": "user",
@@ -324,10 +354,16 @@ async def async_request_openai_chat_completions(
}, },
], ],
"temperature": 0.0, "temperature": 0.0,
"max_tokens": request_func_input.output_len, "max_completion_tokens": request_func_input.output_len,
"stream": True, "stream": True,
"ignore_eos": request_func_input.ignore_eos, "stream_options": {
"include_usage": True,
},
} }
if request_func_input.ignore_eos:
payload["ignore_eos"] = request_func_input.ignore_eos
if request_func_input.extra_body:
payload.update(request_func_input.extra_body)
headers = { headers = {
"Content-Type": "application/json", "Content-Type": "application/json",
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}", "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}",
@@ -349,19 +385,17 @@ async def async_request_openai_chat_completions(
if not chunk_bytes: if not chunk_bytes:
continue continue
chunk = remove_prefix(chunk_bytes.decode("utf-8"), chunk = chunk_bytes.decode("utf-8").removeprefix(
"data: ") "data: ")
if chunk == "[DONE]": if chunk != "[DONE]":
latency = time.perf_counter() - st
else:
timestamp = time.perf_counter() timestamp = time.perf_counter()
data = json.loads(chunk) data = json.loads(chunk)
delta = data["choices"][0]["delta"] if choices := data.get("choices"):
if delta.get("content", None): content = choices[0]["delta"].get("content")
# First token # First token
if ttft == 0.0: if ttft == 0.0:
ttft = time.perf_counter() - st ttft = timestamp - st
output.ttft = ttft output.ttft = ttft
# Decoding phase # Decoding phase
@@ -369,13 +403,16 @@ async def async_request_openai_chat_completions(
output.itl.append(timestamp - output.itl.append(timestamp -
most_recent_timestamp) most_recent_timestamp)
generated_text += delta["content"] generated_text += content or ""
elif usage := data.get("usage"):
output.output_tokens = usage.get(
"completion_tokens")
most_recent_timestamp = timestamp most_recent_timestamp = timestamp
output.generated_text = generated_text output.generated_text = generated_text
output.success = True output.success = True
output.latency = latency output.latency = most_recent_timestamp - st
else: else:
output.error = response.reason or "" output.error = response.reason or ""
output.success = False output.success = False
@@ -389,14 +426,6 @@ async def async_request_openai_chat_completions(
return output return output
# Since vllm must support Python 3.8, we can't use str.removeprefix(prefix)
# introduced in Python 3.9
def remove_prefix(text: str, prefix: str) -> str:
if text.startswith(prefix):
return text[len(prefix):]
return text
def get_model(pretrained_model_name_or_path: str) -> str: def get_model(pretrained_model_name_or_path: str) -> str:
if os.getenv('VLLM_USE_MODELSCOPE', 'False').lower() == 'true': if os.getenv('VLLM_USE_MODELSCOPE', 'False').lower() == 'true':
from modelscope import snapshot_download from modelscope import snapshot_download
@@ -411,14 +440,35 @@ def get_model(pretrained_model_name_or_path: str) -> str:
def get_tokenizer( def get_tokenizer(
pretrained_model_name_or_path: str, trust_remote_code: bool pretrained_model_name_or_path: str,
tokenizer_mode: str = "auto",
trust_remote_code: bool = False,
**kwargs,
) -> Union[PreTrainedTokenizer, PreTrainedTokenizerFast]: ) -> Union[PreTrainedTokenizer, PreTrainedTokenizerFast]:
if pretrained_model_name_or_path is not None and not os.path.exists( if pretrained_model_name_or_path is not None and not os.path.exists(
pretrained_model_name_or_path): pretrained_model_name_or_path):
pretrained_model_name_or_path = get_model( pretrained_model_name_or_path = get_model(
pretrained_model_name_or_path) pretrained_model_name_or_path)
return AutoTokenizer.from_pretrained(pretrained_model_name_or_path, if tokenizer_mode == "slow":
trust_remote_code=trust_remote_code) if kwargs.get("use_fast", False):
raise ValueError(
"Cannot use the fast tokenizer in slow tokenizer mode.")
kwargs["use_fast"] = False
if tokenizer_mode == "mistral":
try:
from vllm.transformers_utils.tokenizer import MistralTokenizer
except ImportError as e:
raise ImportError("MistralTokenizer requires vllm package.\n"
"Please install it with `pip install vllm` "
"to use mistral tokenizer mode.") from e
return MistralTokenizer.from_pretrained(
str(pretrained_model_name_or_path))
else:
return AutoTokenizer.from_pretrained(
pretrained_model_name_or_path,
trust_remote_code=trust_remote_code,
**kwargs,
)
ASYNC_REQUEST_FUNCS = { ASYNC_REQUEST_FUNCS = {

View File

@@ -0,0 +1,495 @@
# SPDX-License-Identifier: Apache-2.0
"""Benchmark guided decoding throughput."""
import argparse
import dataclasses
import json
import os
import random
import time
from typing import List
import datasets
import pandas as pd
import uvloop
from transformers import AutoTokenizer, PreTrainedTokenizerBase
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
from vllm.entrypoints.openai.api_server import (
build_async_engine_client_from_engine_args)
from vllm.sampling_params import GuidedDecodingParams
from vllm.utils import FlexibleArgumentParser, merge_async_iterators
@dataclasses.dataclass
class SampleRequest:
"""A class representing a single inference request for benchmarking.
Attributes:
prompt: The input text prompt for the model.
multi_modal_data: Optional dictionary containing multi-modal data (e.g.
images).
prompt_len: The length of the prompt in tokens.
expected_output_len: The expected length of the output in tokens.
"""
prompt: str
prompt_len: int
expected_output_len: int
schema: dict
structure_type: str = 'json'
completion: str = None
def run_vllm(requests: List[SampleRequest],
engine_args: EngineArgs,
n: int,
guided_decoding_rate: float = 1.0,
warmup: bool = False) -> float:
from vllm import LLM, SamplingParams
llm = LLM(**vars(engine_args))
# Add the requests to the engine.
prompts: List[str] = []
sampling_params: List[SamplingParams] = []
# create a list containing random selected true or false
guided_decoding_req_idx = random.sample(
range(len(requests)), int(len(requests) * guided_decoding_rate))
if warmup:
print(">>>>> Running warmup prompt, for the first 5")
# We setup the first 5 requests to warmup FSM
# if using xgrammar dataset, we will skip warmup
warmup_requests = requests[:5]
for i, request in enumerate(warmup_requests):
prompts.append(request.prompt)
sampling_params.append(
SamplingParams(
n=n,
temperature=1.0,
top_p=1.0,
ignore_eos=True,
max_tokens=request.expected_output_len,
guided_decoding=GuidedDecodingParams(json=request.schema)
if guided_decoding_rate > 0 else None,
))
llm.generate(prompts, sampling_params, use_tqdm=False)
print(">>>>> Benchmark started...")
prompts = []
sampling_params = []
for i, request in enumerate(requests):
prompts.append(request.prompt)
sampling_params.append(
SamplingParams(
n=n,
temperature=1.0,
top_p=1.0,
ignore_eos=True,
max_tokens=request.expected_output_len,
guided_decoding=GuidedDecodingParams(
**{request.structure_type: request.schema})
if i in guided_decoding_req_idx else None,
))
start = time.perf_counter()
outputs = llm.generate(prompts, sampling_params, use_tqdm=False)
ret = []
for output, request in zip(outputs, requests):
generated_text = output.outputs[0].text
ret.append({
"generated": generated_text,
"expected": request.completion
})
end = time.perf_counter()
return end - start, ret
async def run_vllm_async(
requests: List[SampleRequest],
engine_args: AsyncEngineArgs,
n: int,
guided_decoding_rate: float = 1.0,
warmup: bool = False,
disable_frontend_multiprocessing: bool = False) -> float:
from vllm import SamplingParams
async with build_async_engine_client_from_engine_args(
engine_args, disable_frontend_multiprocessing) as llm:
# Add the requests to the engine.
prompts: List[str] = []
sampling_params: List[SamplingParams] = []
guided_decoding_req_idx = random.sample(
range(len(requests)), int(len(requests) * guided_decoding_rate))
if warmup:
print(">>>>>> Running warmup prompt, for the first 5")
# We setup the first 5 requests to warmup FSM
# if using xgrammar dataset, we will skip warmup
warmup_requests = requests[:5]
for i, request in enumerate(warmup_requests):
prompts.append(request.prompt)
sampling_params.append(
SamplingParams(
n=n,
temperature=1.0,
top_p=1.0,
ignore_eos=True,
max_tokens=request.expected_output_len,
guided_decoding=GuidedDecodingParams(
json=request.schema)
if guided_decoding_rate > 0 else None,
))
generators = []
for i, (prompt, sp) in enumerate(zip(prompts, sampling_params)):
generator = llm.generate(prompt, sp, request_id=f"test{i}")
generators.append(generator)
all_gens = merge_async_iterators(*generators)
async for i, res in all_gens:
pass
print(">>>>> Benchmark started...")
prompts = []
sampling_params = []
for i, request in enumerate(requests):
prompts.append(request.prompt)
sampling_params.append(
SamplingParams(
n=n,
temperature=1.0,
top_p=1.0,
ignore_eos=True,
max_tokens=request.expected_output_len,
guided_decoding=GuidedDecodingParams(json=request.schema)
if i in guided_decoding_req_idx else None,
))
generators = []
start_time = []
latencies = []
start = time.perf_counter()
for i, (prompt, sp) in enumerate(zip(prompts, sampling_params)):
generator = llm.generate(prompt, sp, request_id=f"test{i}")
generators.append(generator)
start_time.append(time.perf_counter())
latencies.append([])
all_gens = merge_async_iterators(*generators)
generated_texts = [''] * len(requests)
async for i, res in all_gens:
generated_texts[i] = res.outputs[0].text
lat = time.perf_counter() - start_time[i]
latencies[i].append(lat)
ret = [{
'generated': gt,
'expected': req.completion
} for gt, req in zip(generated_texts, requests)]
end = time.perf_counter()
first_latency = pd.Series([lat[0] * 1000 for lat in latencies])
next_latency = pd.Series([(lat[-1] - lat[0]) / len(lat[1:]) * 1000
for lat in latencies])
return end - start, ret, (first_latency, next_latency)
def sample_requests(tokenizer: PreTrainedTokenizerBase,
args: argparse.Namespace) -> List[SampleRequest]:
if args.dataset == 'json':
if args.json_schema_path is None:
dir_path = os.path.dirname(os.path.realpath(__file__))
args.json_schema_path = os.path.join(dir_path,
"structured_schemas",
"structured_schema_1.json")
with open(args.json_schema_path) as f:
schema = json.load(f)
prompt = f"Generate an example of a user profile given the following schema: {json.dumps(schema)}" # noqa: E501
input_len = len(tokenizer(prompt).input_ids)
print(f"Input length of the prompt: {input_len} tokens")
requests = [
SampleRequest(prompt=prompt,
prompt_len=input_len,
expected_output_len=args.output_len,
schema=schema,
structure_type=args.structure_type)
for _ in range(args.num_prompts)
]
elif args.dataset == "grammar":
schema = """
?start: select_statement
?select_statement: "SELECT " column_list " FROM " table_name
?column_list: column_name ("," column_name)*
?table_name: identifier
?column_name: identifier
?identifier: /[a-zA-Z_][a-zA-Z0-9_]*/
"""
prompt = "Generate an SQL query to show the 'username' \
and 'email' from the 'users' table."
input_len = len(tokenizer(prompt).input_ids)
print(f"Input length of the prompt: {input_len} tokens")
requests = [
SampleRequest(prompt=prompt,
prompt_len=input_len,
expected_output_len=args.output_len,
schema=schema,
structure_type=args.structure_type)
for _ in range(args.num_prompts)
]
elif args.dataset == "regex":
regex = r"\w+@\w+\.com\n"
args.regex = regex
prompt = "Generate an email address for Alan Turing, \
who works in Enigma. End in .com and new line. \
Example result: alan.turing@enigma.com\n"
input_len = len(tokenizer(prompt).input_ids)
print(f"Input length of the prompt: {input_len} tokens")
requests = [
SampleRequest(prompt=prompt,
prompt_len=input_len,
expected_output_len=args.output_len,
schema=regex,
structure_type=args.structure_type)
for _ in range(args.num_prompts)
]
elif args.dataset == "choice":
choice = ["Positive", "Negative"]
args.choice = choice
prompt = "Classify this sentiment: vLLM is wonderful!"
input_len = len(tokenizer(prompt).input_ids)
print(f"Input length of the prompt: {input_len} tokens")
requests = [
SampleRequest(prompt=prompt,
prompt_len=input_len,
expected_output_len=args.output_len,
schema=choice,
structure_type=args.structure_type)
for _ in range(args.num_prompts)
]
elif args.dataset == "xgrammar_bench":
args.warmup = False
requests: List[SampleRequest] = []
dataset = datasets.load_dataset("NousResearch/json-mode-eval",
split="train")
print(f"dataset has {len(dataset)} entries")
len_dataset = len(dataset)
for data_point_idx in range(args.num_prompts):
idx = data_point_idx
while idx >= len_dataset:
idx -= len_dataset
schema = dataset["schema"][idx]
prompt = tokenizer.apply_chat_template(dataset["prompt"][idx],
tokenize=False)
input_len = len(tokenizer(prompt).input_ids)
completion = dataset["completion"][idx]
requests.append(
SampleRequest(prompt=prompt,
prompt_len=input_len,
expected_output_len=args.output_len,
schema=schema,
completion=completion))
return requests
def evaluate(ret, args):
def _eval_correctness_json(expected, actual):
# extract json string from string using regex
import re
actual = actual.replace('\n', '').replace(' ', '').strip()
try:
actual = re.search(r'\{.*\}', actual).group()
actual = json.loads(actual)
except Exception:
return False
return True
def _eval_correctness_choice(expected, actual):
return actual in args.choice
def _eval_correctness_regex(expected, actual):
import re
return re.match(args.regex, actual) is not None
def _eval_correctness(expected, actual):
if args.structure_type == 'json':
return _eval_correctness_json(expected, actual)
elif args.structure_type == 'regex':
return _eval_correctness_regex(expected, actual)
elif args.structure_type == 'choice':
return _eval_correctness_choice(expected, actual)
else:
return None
scores = []
for res in ret:
score = _eval_correctness(res['expected'], res['generated'])
res['correctness'] = score
scores.append(score)
not_none_scores = [score for score in scores if score is not None]
return (sum(not_none_scores) / len(not_none_scores) *
100) if len(not_none_scores) > 0 else None
def main(args: argparse.Namespace):
print(args)
random.seed(args.seed)
# async engine is working for 'regex', 'choice' and 'grammar'
if args.dataset == 'grammar':
args.structure_type = 'grammar'
args.async_engine = False
elif args.dataset == 'regex':
args.structure_type = 'regex'
args.async_engine = False
elif args.dataset == 'choice':
args.structure_type = 'choice'
args.async_engine = False
else:
args.structure_type = 'json'
if args.no_guided_decoding:
args.guided_decoding_ratio = 0
if args.save_results:
result_file_name = f'{args.guided_decoding_ratio}guided'
result_file_name += f"_{args.model.split('/')[-1]}"
result_file_name += f"_{args.dataset}"
result_file_name += f"_{args.num_prompts}"
result_file_name += f"_out{args.output_len}"
result_file_name += f"_async{args.async_engine}"
result_file_name += f"_warmup{args.warmup}"
result_file_name += f"_chunkedprefill{args.enable_chunked_prefill}"
result_file_name += ".txt"
else:
result_file_name = None
# Synthesize a prompt with the given input length.
tokenizer = AutoTokenizer.from_pretrained(
args.tokenizer, trust_remote_code=args.trust_remote_code)
requests = sample_requests(tokenizer, args)
if args.async_engine:
engine_args = AsyncEngineArgs.from_cli_args(args)
elapsed_time, ret, (first_latency, next_latency) = uvloop.run(
run_vllm_async(requests, engine_args, args.n,
args.guided_decoding_ratio, args.warmup,
args.disable_frontend_multiprocessing))
else:
engine_args = EngineArgs.from_cli_args(args)
elapsed_time, ret = run_vllm(requests, engine_args, args.n,
args.guided_decoding_ratio, args.warmup)
first_latency, next_latency = None, None
score = evaluate(ret, args)
total_num_tokens = sum(request.prompt_len + request.expected_output_len
for request in requests)
total_output_tokens = sum(request.expected_output_len
for request in requests)
if first_latency is not None:
latency_breakdown = "\nFirst token latency(msecs):\n"
latency_breakdown += f"{first_latency.describe()}"
latency_breakdown += "\nNext token latency(msecs):\n"
latency_breakdown += f"{next_latency.describe()}"
print(
f"Throughput: {len(requests) / elapsed_time:.2f} requests/s, "
f"{total_num_tokens / elapsed_time:.2f} total tokens/s, "
f"{total_output_tokens / elapsed_time:.2f} output tokens/s",
f"Correct rate is {score} %",
f"{latency_breakdown if first_latency is not None else ''}")
# Output JSON results if specified
if args.output_json or result_file_name:
results = {
"elapsed_time": elapsed_time,
"num_requests": len(requests),
"total_num_tokens": total_num_tokens,
"total_output_tokens": total_output_tokens,
"requests_per_second": len(requests) / elapsed_time,
"tokens_per_second": f"{total_num_tokens / elapsed_time:.2f}",
"output_tokens_per_second":
f"{total_output_tokens / elapsed_time:.2f}",
"correct_rate(%)": score
}
results = {"outputs": ret, **results}
if first_latency is not None:
results["first_token_latency(msecs)"] = first_latency.describe(
).to_dict()
results["next_token_latency(msecs)"] = next_latency.describe(
).to_dict()
if args.output_json:
with open(args.output_json, "w") as f:
json.dump(results, f, indent=4)
elif result_file_name:
with open(result_file_name, "w") as f:
json.dump(results, f, indent=4)
if __name__ == "__main__":
parser = FlexibleArgumentParser(description="Benchmark guided decoding.")
parser = AsyncEngineArgs.add_cli_args(parser)
parser.add_argument("--output-len",
type=int,
default=512,
help="Output length for each request. Overrides the "
"output length from the dataset.")
parser.add_argument(
"--dataset",
default='json',
choices=['json', 'grammar', 'regex', 'choice', 'xgrammar_bench'])
parser.add_argument("--json_schema_path",
type=str,
default=None,
help="Path to json schema.")
parser.add_argument("--n",
type=int,
default=1,
help="Number of generated sequences per prompt.")
parser.add_argument("--num-prompts",
type=int,
default=10,
help="Number of prompts to process.")
parser.add_argument(
'--output-json',
type=str,
default=None,
help='Path to save the throughput results in JSON format.')
parser.add_argument("--async-engine",
action='store_true',
default=False,
help="Use vLLM async engine rather than LLM class.")
parser.add_argument("--no-guided-decoding",
action='store_true',
default=False,
help="Whether to disable JSON decoding or not.")
parser.add_argument("--guided-decoding-ratio",
type=float,
default=1.0,
help="Ratio of Guided Decoding requests")
parser.add_argument("--disable-frontend-multiprocessing",
action='store_true',
default=False,
help="Disable decoupled async engine frontend.")
parser.add_argument("--warmup",
action="store_true",
default=False,
help="Run warmup prompts before benchmark.")
parser.add_argument("--save-results",
action="store_true",
default=False,
help="save output results.")
args = parser.parse_args()
if args.tokenizer is None:
args.tokenizer = args.model
main(args)

View File

@@ -1,5 +1,7 @@
# SPDX-License-Identifier: Apache-2.0
"""Benchmark the latency of processing a single batch of requests.""" """Benchmark the latency of processing a single batch of requests."""
import argparse import argparse
import dataclasses
import json import json
import time import time
from pathlib import Path from pathlib import Path
@@ -10,43 +12,20 @@ import torch
from tqdm import tqdm from tqdm import tqdm
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import DEVICE_OPTIONS, EngineArgs from vllm.engine.arg_utils import EngineArgs
from vllm.inputs import PromptType from vllm.inputs import PromptType
from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS from vllm.sampling_params import BeamSearchParams
from vllm.utils import FlexibleArgumentParser from vllm.utils import FlexibleArgumentParser
def main(args: argparse.Namespace): def main(args: argparse.Namespace):
print(args) print(args)
engine_args = EngineArgs.from_cli_args(args)
# NOTE(woosuk): If the request cannot be processed in a single batch, # NOTE(woosuk): If the request cannot be processed in a single batch,
# the engine will automatically process the request in multiple batches. # the engine will automatically process the request in multiple batches.
llm = LLM( llm = LLM(**dataclasses.asdict(engine_args))
model=args.model,
speculative_model=args.speculative_model,
num_speculative_tokens=args.num_speculative_tokens,
speculative_draft_tensor_parallel_size=\
args.speculative_draft_tensor_parallel_size,
tokenizer=args.tokenizer,
quantization=args.quantization,
tensor_parallel_size=args.tensor_parallel_size,
trust_remote_code=args.trust_remote_code,
dtype=args.dtype,
max_model_len=args.max_model_len,
enforce_eager=args.enforce_eager,
kv_cache_dtype=args.kv_cache_dtype,
quantization_param_path=args.quantization_param_path,
device=args.device,
ray_workers_use_nsight=args.ray_workers_use_nsight,
enable_chunked_prefill=args.enable_chunked_prefill,
download_dir=args.download_dir,
block_size=args.block_size,
gpu_memory_utilization=args.gpu_memory_utilization,
load_format=args.load_format,
distributed_executor_backend=args.distributed_executor_backend,
otlp_traces_endpoint=args.otlp_traces_endpoint,
enable_prefix_caching=args.enable_prefix_caching,
)
sampling_params = SamplingParams( sampling_params = SamplingParams(
n=args.n, n=args.n,
@@ -63,6 +42,20 @@ def main(args: argparse.Namespace):
"prompt_token_ids": batch "prompt_token_ids": batch
} for batch in dummy_prompt_token_ids.tolist()] } for batch in dummy_prompt_token_ids.tolist()]
def llm_generate():
if not args.use_beam_search:
llm.generate(dummy_prompts,
sampling_params=sampling_params,
use_tqdm=False)
else:
llm.beam_search(
dummy_prompts,
BeamSearchParams(
beam_width=args.n,
max_tokens=args.output_len,
ignore_eos=True,
))
def run_to_completion(profile_dir: Optional[str] = None): def run_to_completion(profile_dir: Optional[str] = None):
if profile_dir: if profile_dir:
with torch.profiler.profile( with torch.profiler.profile(
@@ -72,15 +65,11 @@ def main(args: argparse.Namespace):
], ],
on_trace_ready=torch.profiler.tensorboard_trace_handler( on_trace_ready=torch.profiler.tensorboard_trace_handler(
str(profile_dir))) as p: str(profile_dir))) as p:
llm.generate(dummy_prompts, llm_generate()
sampling_params=sampling_params, print(p.key_averages().table(sort_by="self_cuda_time_total"))
use_tqdm=False)
print(p.key_averages())
else: else:
start_time = time.perf_counter() start_time = time.perf_counter()
llm.generate(dummy_prompts, llm_generate()
sampling_params=sampling_params,
use_tqdm=False)
end_time = time.perf_counter() end_time = time.perf_counter()
latency = end_time - start_time latency = end_time - start_time
return latency return latency
@@ -125,19 +114,6 @@ if __name__ == '__main__':
parser = FlexibleArgumentParser( parser = FlexibleArgumentParser(
description='Benchmark the latency of processing a single batch of ' description='Benchmark the latency of processing a single batch of '
'requests till completion.') 'requests till completion.')
parser.add_argument('--model', type=str, default='facebook/opt-125m')
parser.add_argument('--speculative-model', type=str, default=None)
parser.add_argument('--num-speculative-tokens', type=int, default=None)
parser.add_argument('--speculative-draft-tensor-parallel-size',
'-spec-draft-tp',
type=int,
default=None)
parser.add_argument('--tokenizer', type=str, default=None)
parser.add_argument('--quantization',
'-q',
choices=[*QUANTIZATION_METHODS, None],
default=None)
parser.add_argument('--tensor-parallel-size', '-tp', type=int, default=1)
parser.add_argument('--input-len', type=int, default=32) parser.add_argument('--input-len', type=int, default=32)
parser.add_argument('--output-len', type=int, default=128) parser.add_argument('--output-len', type=int, default=128)
parser.add_argument('--batch-size', type=int, default=8) parser.add_argument('--batch-size', type=int, default=8)
@@ -154,45 +130,6 @@ if __name__ == '__main__':
type=int, type=int,
default=30, default=30,
help='Number of iterations to run.') help='Number of iterations to run.')
parser.add_argument('--trust-remote-code',
action='store_true',
help='trust remote code from huggingface')
parser.add_argument(
'--max-model-len',
type=int,
default=None,
help='Maximum length of a sequence (including prompt and output). '
'If None, will be derived from the model.')
parser.add_argument(
'--dtype',
type=str,
default='auto',
choices=['auto', 'half', 'float16', 'bfloat16', 'float', 'float32'],
help='data type for model weights and activations. '
'The "auto" option will use FP16 precision '
'for FP32 and FP16 models, and BF16 precision '
'for BF16 models.')
parser.add_argument('--enforce-eager',
action='store_true',
help='enforce eager mode and disable CUDA graph')
parser.add_argument(
'--kv-cache-dtype',
type=str,
choices=['auto', 'fp8', 'fp8_e5m2', 'fp8_e4m3'],
default="auto",
help='Data type for kv cache storage. If "auto", will use model '
'data type. CUDA 11.8+ supports fp8 (=fp8_e4m3) and fp8_e5m2. '
'ROCm (AMD GPU) supports fp8 (=fp8_e4m3)')
parser.add_argument(
'--quantization-param-path',
type=str,
default=None,
help='Path to the JSON file containing the KV cache scaling factors. '
'This should generally be supplied, when KV cache dtype is FP8. '
'Otherwise, KV cache scaling factors default to 1.0, which may cause '
'accuracy issues. FP8_E5M2 (without scaling) is only supported on '
'cuda version greater than 11.8. On ROCm (AMD GPU), FP8_E4M3 is '
'instead supported for common inference criteria.')
parser.add_argument( parser.add_argument(
'--profile', '--profile',
action='store_true', action='store_true',
@@ -203,78 +140,12 @@ if __name__ == '__main__':
default=None, default=None,
help=('path to save the pytorch profiler output. Can be visualized ' help=('path to save the pytorch profiler output. Can be visualized '
'with ui.perfetto.dev or Tensorboard.')) 'with ui.perfetto.dev or Tensorboard.'))
parser.add_argument("--device",
type=str,
default="auto",
choices=DEVICE_OPTIONS,
help='device type for vLLM execution')
parser.add_argument('--block-size',
type=int,
default=16,
help='block size of key/value cache')
parser.add_argument(
'--enable-chunked-prefill',
action='store_true',
help='If True, the prefill requests can be chunked based on the '
'max_num_batched_tokens')
parser.add_argument("--enable-prefix-caching",
action='store_true',
help="Enable automatic prefix caching")
parser.add_argument(
"--ray-workers-use-nsight",
action='store_true',
help="If specified, use nsight to profile ray workers",
)
parser.add_argument('--download-dir',
type=str,
default=None,
help='directory to download and load the weights, '
'default to the default cache dir of huggingface')
parser.add_argument( parser.add_argument(
'--output-json', '--output-json',
type=str, type=str,
default=None, default=None,
help='Path to save the latency results in JSON format.') help='Path to save the latency results in JSON format.')
parser.add_argument('--gpu-memory-utilization',
type=float, parser = EngineArgs.add_cli_args(parser)
default=0.9,
help='the fraction of GPU memory to be used for '
'the model executor, which can range from 0 to 1.'
'If unspecified, will use the default value of 0.9.')
parser.add_argument(
'--load-format',
type=str,
default=EngineArgs.load_format,
choices=[
'auto', 'pt', 'safetensors', 'npcache', 'dummy', 'tensorizer',
'bitsandbytes'
],
help='The format of the model weights to load.\n\n'
'* "auto" will try to load the weights in the safetensors format '
'and fall back to the pytorch bin format if safetensors format '
'is not available.\n'
'* "pt" will load the weights in the pytorch bin format.\n'
'* "safetensors" will load the weights in the safetensors format.\n'
'* "npcache" will load the weights in pytorch format and store '
'a numpy cache to speed up the loading.\n'
'* "dummy" will initialize the weights with random values, '
'which is mainly for profiling.\n'
'* "tensorizer" will load the weights using tensorizer from '
'CoreWeave. See the Tensorize vLLM Model script in the Examples'
'section for more information.\n'
'* "bitsandbytes" will load the weights using bitsandbytes '
'quantization.\n')
parser.add_argument(
'--distributed-executor-backend',
choices=['ray', 'mp'],
default=None,
help='Backend to use for distributed serving. When more than 1 GPU '
'is used, will be automatically set to "ray" if installed '
'or "mp" (multiprocessing) otherwise.')
parser.add_argument(
'--otlp-traces-endpoint',
type=str,
default=None,
help='Target URL to which OpenTelemetry traces will be sent.')
args = parser.parse_args() args = parser.parse_args()
main(args) main(args)

View File

@@ -0,0 +1,184 @@
# SPDX-License-Identifier: Apache-2.0
"""
Offline benchmark to test the long document QA throughput.
Example usage:
# This workload samples 8 different prompts with a default input
# length of 20000 tokens, then replicates each prompt 2 times
# in random order.
python benchmark_long_document_qa_throughput.py \
--model meta-llama/Llama-2-7b-chat-hf \
--enable-prefix-caching \
--num-documents 8 \
--repeat-count 2
Commandline arguments:
--num-documents: The number of documents to sample prompts from.
--document-length: The length of each document in tokens.
(Optional, default: 20000)
--output-len: The number of tokens to generate for each prompt.
(Optional, default: 10)
--repeat-count: The number of times to repeat each prompt.
(Optional, default: 2)
--repeat-mode: The mode to repeat prompts. The supported modes are:
- 'random': shuffle the prompts randomly. (Default)
- 'tile': the entire prompt list is repeated in sequence. (Potentially
lowest cache hit)
- 'interleave': each prompt is repeated consecutively before
moving to the next element. (Highest cache hit)
--shuffle-seed: Random seed when the repeat mode is "random".
(Optional, default: 0)
In the meantime, it also supports all the vLLM engine args to initialize the
LLM engine. You can refer to the `vllm.engine.arg_utils.EngineArgs` for more
details.
"""
import dataclasses
import random
import time
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import EngineArgs
from vllm.utils import FlexibleArgumentParser
def test_long_document_qa(llm=None, sampling_params=None, prompts=None):
"""
Test long document QA with the given prompts and sampling parameters.
Print the time spent in processing all the prompts.
Args:
llm: The language model used for generating responses.
sampling_params: Sampling parameter used to generate the response.
prompts: A list of prompt strings to be processed by the LLM.
"""
start_time = time.time()
llm.generate(prompts, sampling_params=sampling_params)
end_time = time.time()
print(f"Time to execute all requests: {end_time - start_time:.4f} secs")
def repeat_prompts(prompts, repeat_count, mode: str):
"""
Repeat each prompt in the list for a specified number of times.
The order of prompts in the output list depends on the mode.
Args:
prompts: A list of prompts to be repeated.
repeat_count: The number of times each prompt is repeated.
mode: The mode of repetition. Supported modes are:
- 'random': Shuffle the prompts randomly after repetition.
- 'tile': Repeat the entire prompt list in sequence.
Example: [1, 2, 3] -> [1, 2, 3, 1, 2, 3].
- 'interleave': Repeat each prompt consecutively before moving to
the next. Example: [1, 2, 3] -> [1, 1, 2, 2, 3, 3].
Returns:
A list of repeated prompts in the specified order.
Raises:
ValueError: If an invalid mode is provided.
"""
print("Repeat mode: ", mode)
if mode == 'random':
repeated_prompts = prompts * repeat_count
random.shuffle(repeated_prompts)
return repeated_prompts
elif mode == 'tile':
return prompts * repeat_count
elif mode == 'interleave':
repeated_prompts = []
for prompt in prompts:
repeated_prompts.extend([prompt] * repeat_count)
return repeated_prompts
else:
raise ValueError(f"Invalid mode: {mode}, only support "
"'random', 'tile', 'interleave'")
def main(args):
random.seed(args.shuffle_seed)
# Prepare the prompts:
# we append the document id at the beginning to avoid any of the document
# being the prefix of other documents
prompts = [
str(i) + ' '.join(['hi'] * args.document_length)
for i in range(args.num_documents)
]
prompts = repeat_prompts(prompts, args.repeat_count, mode=args.repeat_mode)
warmup_prompts = [
"This is warm up request " + str(i) + \
' '.join(['hi'] * args.document_length)
for i in range(args.num_documents)]
# Create the LLM engine
engine_args = EngineArgs.from_cli_args(args)
llm = LLM(**dataclasses.asdict(engine_args))
sampling_params = SamplingParams(temperature=0, max_tokens=args.output_len)
print("------warm up------")
test_long_document_qa(
llm=llm,
prompts=warmup_prompts,
sampling_params=sampling_params,
)
print("------start generating------")
test_long_document_qa(
llm=llm,
prompts=prompts,
sampling_params=sampling_params,
)
if __name__ == "__main__":
parser = FlexibleArgumentParser(
description=
'Benchmark the performance with or without automatic prefix caching.')
parser.add_argument(
'--document-length',
type=int,
# Roughly the number of tokens for a system paper,
# excluding images
default=20000,
help='Range of input lengths for sampling prompts,'
'specified as "min:max" (e.g., "128:256").')
parser.add_argument('--num-documents',
type=int,
default=8,
help='Range of input lengths for sampling prompts,'
'specified as "min:max" (e.g., "128:256").')
parser.add_argument('--output-len', type=int, default=10)
parser.add_argument('--repeat-count',
type=int,
default=2,
help='Number of times to repeat each prompt')
parser.add_argument("--repeat-mode",
type=str,
default='random',
help='The mode to repeat prompts. The supported '
'modes are "random", "tile", and "interleave". '
'See repeat_prompts() in the source code for details.')
parser.add_argument("--shuffle-seed",
type=int,
default=0,
help='Random seed when the repeat mode is "random"')
parser = EngineArgs.add_cli_args(parser)
args = parser.parse_args()
main(args)

View File

@@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
""" """
Benchmark the efficiency of prefix caching. Benchmark the efficiency of prefix caching.
@@ -10,7 +11,8 @@ Fixed example usage:
--model meta-llama/Llama-2-7b-chat-hf \ --model meta-llama/Llama-2-7b-chat-hf \
--enable-prefix-caching \ --enable-prefix-caching \
--num-prompts 1 \ --num-prompts 1 \
--repeat-count 100 --repeat-count 100 \
--input-length-range 128:256
ShareGPT example usage: ShareGPT example usage:
# This command samples 20 prompts with input lengths # This command samples 20 prompts with input lengths
@@ -25,6 +27,7 @@ ShareGPT example usage:
--input-length-range 128:256 --input-length-range 128:256
""" """
import dataclasses
import json import json
import random import random
import time import time
@@ -33,6 +36,7 @@ from typing import List, Optional, Tuple
from transformers import PreTrainedTokenizerBase from transformers import PreTrainedTokenizerBase
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import EngineArgs
from vllm.utils import FlexibleArgumentParser from vllm.utils import FlexibleArgumentParser
try: try:
@@ -52,13 +56,30 @@ def test_prefix(llm=None, sampling_params=None, prompts=None):
print(f"cost time {end_time - start_time}") print(f"cost time {end_time - start_time}")
def sample_requests( @dataclasses.dataclass
class Request:
prompt: str
prompt_len: int
output_len: int
def sample_tokens(tokenizer: PreTrainedTokenizerBase, length: int) -> str:
vocab = tokenizer.get_vocab()
# Remove the special tokens.
vocab = {
k: v
for k, v in vocab.items() if k not in tokenizer.all_special_ids
}
return random.choices(list(vocab.values()), k=length)
def sample_requests_from_dataset(
dataset_path: str, dataset_path: str,
num_requests: int, num_requests: int,
tokenizer: PreTrainedTokenizerBase, tokenizer: PreTrainedTokenizerBase,
input_length_range: Tuple[int, int], input_length_range: Tuple[int, int],
fixed_output_len: Optional[int], fixed_output_len: Optional[int],
) -> List[Tuple[str, int, int]]: ) -> List[Request]:
if fixed_output_len is not None and fixed_output_len < 4: if fixed_output_len is not None and fixed_output_len < 4:
raise ValueError("output_len too small") raise ValueError("output_len too small")
@@ -75,31 +96,55 @@ def sample_requests(
random.shuffle(dataset) random.shuffle(dataset)
min_len, max_len = input_length_range min_len, max_len = input_length_range
assert min_len >= 0 and max_len >= min_len, "input_length_range too small"
# Filter out sequences that are too long or too short # Filter out sequences that are too long or too short
filtered_dataset: List[Tuple[str, int, int]] = [] filtered_requests: List[Request] = []
for i in range(len(dataset)): for i in range(len(dataset)):
if len(filtered_dataset) == num_requests: if len(filtered_requests) == num_requests:
break break
# Tokenize the prompts and completions. # Tokenize the prompts and completions.
prompt = dataset[i][0] prompt_token_ids = tokenizer(dataset[i][0]).input_ids
prompt_token_ids = tokenizer(prompt).input_ids prompt = tokenizer.decode(prompt_token_ids)
completion = dataset[i][1] completion = dataset[i][1]
completion_token_ids = tokenizer(completion).input_ids completion_token_ids = tokenizer(completion).input_ids
prompt_len = len(prompt_token_ids) prompt_len = len(prompt_token_ids)
output_len = len(completion_token_ids output_len = (len(completion_token_ids)
) if fixed_output_len is None else fixed_output_len if fixed_output_len is None else fixed_output_len)
if prompt_len < 4 or output_len < 4:
# Prune too short sequences.
continue
if min_len <= prompt_len <= max_len: if min_len <= prompt_len <= max_len:
filtered_dataset.append((prompt, prompt_len, output_len)) filtered_requests.append(Request(prompt, prompt_len, output_len))
return filtered_dataset return filtered_requests
def repeat_and_sort_requests(requests: List[Tuple[str, int, int]], def sample_requests_from_random(
num_requests: int,
tokenizer: PreTrainedTokenizerBase,
input_length_range: Tuple[int, int],
fixed_output_len: Optional[int],
prefix_len: int,
) -> List[Request]:
requests = []
prefix_token_ids = sample_tokens(tokenizer, prefix_len)
min_len, max_len = input_length_range
for i in range(num_requests):
unique_part_token_ids = sample_tokens(
tokenizer,
random.randint(min_len - prefix_len, max_len - prefix_len))
prompt_token_ids = prefix_token_ids + unique_part_token_ids
prompt = tokenizer.decode(prompt_token_ids)
prompt_len = len(prompt_token_ids)
assert (min_len <= prompt_len <= max_len
), f"prompt_len {prompt_len} out of range {min_len}:{max_len}"
requests.append(Request(prompt, prompt_len, fixed_output_len))
return requests
def repeat_and_sort_requests(requests: List[Request],
repeat_count: int, repeat_count: int,
sort: bool = False) -> List[str]: sort: bool = False) -> List[str]:
repeated_requests = requests * repeat_count repeated_requests = requests * repeat_count
@@ -107,7 +152,7 @@ def repeat_and_sort_requests(requests: List[Tuple[str, int, int]],
repeated_requests.sort(key=lambda x: x[1]) repeated_requests.sort(key=lambda x: x[1])
else: else:
random.shuffle(repeated_requests) random.shuffle(repeated_requests)
return [req[0] for req in repeated_requests] return [req.prompt for req in repeated_requests]
def main(args): def main(args):
@@ -115,9 +160,12 @@ def main(args):
input_length_range = tuple(map(int, args.input_length_range.split(':'))) input_length_range = tuple(map(int, args.input_length_range.split(':')))
random.seed(args.seed) random.seed(args.seed)
if args.dataset_path is not None: if args.dataset_path is not None:
print(f"Start to sample {args.num_prompts} prompts" if args.prefix_len > 0:
"from {args.dataset_path}") raise ValueError("prefix-len is not supported when "
filtered_datasets = sample_requests( "dataset-path is provided.")
print(f"Start to sample {args.num_prompts} prompts "
f"from {args.dataset_path}")
filtered_requests = sample_requests_from_dataset(
dataset_path=args.dataset_path, dataset_path=args.dataset_path,
num_requests=args.num_prompts, num_requests=args.num_prompts,
tokenizer=tokenizer, tokenizer=tokenizer,
@@ -125,31 +173,34 @@ def main(args):
fixed_output_len=args.output_len, fixed_output_len=args.output_len,
) )
else: else:
prompt_len = len(tokenizer(PROMPT).input_ids) print(f"Start to sample {args.num_prompts} prompts from random")
filtered_datasets = [(PROMPT, prompt_len, args.output_len) filtered_requests = sample_requests_from_random(
] * args.num_prompts num_requests=args.num_prompts,
tokenizer=tokenizer,
input_length_range=input_length_range,
fixed_output_len=args.output_len,
prefix_len=args.prefix_len,
)
llm = LLM(model=args.model, # Print some helpful stats of the requests.
tokenizer_mode='auto', print(f"Sampled {len(filtered_requests)} requests.")
trust_remote_code=True, prompt_lens = [req.prompt_len for req in filtered_requests]
enforce_eager=True, print(f"Average input length: {sum(prompt_lens) / len(prompt_lens)}")
tensor_parallel_size=args.tensor_parallel_size, print(f"P50 input length: {sorted(prompt_lens)[len(prompt_lens) // 2]}")
enable_prefix_caching=args.enable_prefix_caching) print(f"Min Prompt Length: {min(prompt_lens)}")
print(f"Max Prompt Length: {max(prompt_lens)}")
engine_args = EngineArgs.from_cli_args(args)
llm = LLM(**dataclasses.asdict(engine_args))
sampling_params = SamplingParams(temperature=0, max_tokens=args.output_len) sampling_params = SamplingParams(temperature=0, max_tokens=args.output_len)
print("Testing filtered datasets") print("Testing filtered requests")
prompts = repeat_and_sort_requests(filtered_datasets, prompts = repeat_and_sort_requests(filtered_requests,
repeat_count=args.repeat_count, repeat_count=args.repeat_count,
sort=args.sort) sort=args.sort)
print("------warm up------")
test_prefix(
llm=llm,
prompts=prompts,
sampling_params=sampling_params,
)
print("------start generating------") print("------start generating------")
test_prefix( test_prefix(
llm=llm, llm=llm,
@@ -162,37 +213,37 @@ if __name__ == "__main__":
parser = FlexibleArgumentParser( parser = FlexibleArgumentParser(
description= description=
'Benchmark the performance with or without automatic prefix caching.') 'Benchmark the performance with or without automatic prefix caching.')
parser.add_argument('--model',
type=str,
default='baichuan-inc/Baichuan2-13B-Chat')
parser.add_argument("--dataset-path", parser.add_argument("--dataset-path",
type=str, type=str,
default=None, default=None,
help="Path to the dataset.") help="Path to the dataset.")
parser.add_argument('--tensor-parallel-size', '-tp', type=int, default=1)
parser.add_argument('--output-len', type=int, default=10) parser.add_argument('--output-len', type=int, default=10)
parser.add_argument('--enable-prefix-caching',
action='store_true',
help='enable prefix caching')
parser.add_argument('--num-prompts', parser.add_argument('--num-prompts',
type=int, type=int,
default=1, required=True,
help="Number of the prompts sampled from dataset") help="Number of the prompts sampled from dataset")
parser.add_argument('--repeat-count', parser.add_argument('--repeat-count',
type=int, type=int,
default=100, default=1,
help='Number of times to repeat each prompt') help='Number of times to repeat each prompt')
parser.add_argument('--sort', parser.add_argument('--sort',
action='store_true', action='store_true',
help='Sort prompts by input length') help='Sort prompts by input length')
parser.add_argument('--input-length-range', parser.add_argument('--input-length-range',
type=str, type=str,
default='128:256', required=True,
help='Range of input lengths for sampling prompts,' help='Range of input lengths for sampling prompts,'
'specified as "min:max" (e.g., "128:256").') 'specified as "min:max" (e.g., "128:256").')
parser.add_argument("--seed", parser.add_argument(
type=int, "--prefix-len",
default=0, type=int,
help='Random seed for reproducibility') default=0,
help="Specifies the length of a common prefix to be "
"added to the input prompt. The input-length-range will "
"subtract this length when filtering prompts. Only used "
"when dataset-path is not provided.",
)
parser = EngineArgs.add_cli_args(parser)
args = parser.parse_args() args = parser.parse_args()
main(args) main(args)

View File

@@ -1,5 +1,7 @@
# SPDX-License-Identifier: Apache-2.0
"""Benchmark offline prioritization.""" """Benchmark offline prioritization."""
import argparse import argparse
import dataclasses
import json import json
import random import random
import time import time
@@ -7,7 +9,8 @@ from typing import List, Optional, Tuple
from transformers import AutoTokenizer, PreTrainedTokenizerBase from transformers import AutoTokenizer, PreTrainedTokenizerBase
from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS from vllm.engine.arg_utils import EngineArgs
from vllm.utils import FlexibleArgumentParser
def sample_requests( def sample_requests(
@@ -62,46 +65,11 @@ def sample_requests(
def run_vllm( def run_vllm(
requests: List[Tuple[str, int, int]], requests: List[Tuple[str, int, int]],
model: str,
tokenizer: str,
quantization: Optional[str],
tensor_parallel_size: int,
seed: int,
n: int, n: int,
trust_remote_code: bool, engine_args: EngineArgs,
dtype: str,
max_model_len: Optional[int],
enforce_eager: bool,
kv_cache_dtype: str,
quantization_param_path: Optional[str],
device: str,
enable_prefix_caching: bool,
enable_chunked_prefill: bool,
max_num_batched_tokens: int,
gpu_memory_utilization: float = 0.9,
download_dir: Optional[str] = None,
) -> float: ) -> float:
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
llm = LLM( llm = LLM(**dataclasses.asdict(engine_args))
model=model,
tokenizer=tokenizer,
quantization=quantization,
tensor_parallel_size=tensor_parallel_size,
seed=seed,
trust_remote_code=trust_remote_code,
dtype=dtype,
max_model_len=max_model_len,
gpu_memory_utilization=gpu_memory_utilization,
enforce_eager=enforce_eager,
kv_cache_dtype=kv_cache_dtype,
quantization_param_path=quantization_param_path,
device=device,
enable_prefix_caching=enable_prefix_caching,
download_dir=download_dir,
enable_chunked_prefill=enable_chunked_prefill,
max_num_batched_tokens=max_num_batched_tokens,
disable_log_stats=False,
)
# Add the requests to the engine. # Add the requests to the engine.
prompts = [] prompts = []
@@ -142,16 +110,8 @@ def main(args: argparse.Namespace):
args.output_len) args.output_len)
if args.backend == "vllm": if args.backend == "vllm":
elapsed_time = run_vllm(requests, args.model, args.tokenizer, elapsed_time = run_vllm(requests, args.n,
args.quantization, args.tensor_parallel_size, EngineArgs.from_cli_args(args))
args.seed, args.n, args.trust_remote_code,
args.dtype, args.max_model_len,
args.enforce_eager, args.kv_cache_dtype,
args.quantization_param_path, args.device,
args.enable_prefix_caching,
args.enable_chunked_prefill,
args.max_num_batched_tokens,
args.gpu_memory_utilization, args.download_dir)
else: else:
raise ValueError(f"Unknown backend: {args.backend}") raise ValueError(f"Unknown backend: {args.backend}")
total_num_tokens = sum(prompt_len + output_len total_num_tokens = sum(prompt_len + output_len
@@ -173,7 +133,7 @@ def main(args: argparse.Namespace):
if __name__ == "__main__": if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Benchmark the throughput.") parser = FlexibleArgumentParser(description="Benchmark the throughput.")
parser.add_argument("--backend", parser.add_argument("--backend",
type=str, type=str,
choices=["vllm", "hf", "mii"], choices=["vllm", "hf", "mii"],
@@ -191,13 +151,6 @@ if __name__ == "__main__":
default=None, default=None,
help="Output length for each request. Overrides the " help="Output length for each request. Overrides the "
"output length from the dataset.") "output length from the dataset.")
parser.add_argument("--model", type=str, default="facebook/opt-125m")
parser.add_argument("--tokenizer", type=str, default=None)
parser.add_argument('--quantization',
'-q',
choices=[*QUANTIZATION_METHODS, None],
default=None)
parser.add_argument("--tensor-parallel-size", "-tp", type=int, default=1)
parser.add_argument("--n", parser.add_argument("--n",
type=int, type=int,
default=1, default=1,
@@ -206,81 +159,13 @@ if __name__ == "__main__":
type=int, type=int,
default=200, default=200,
help="Number of prompts to process.") help="Number of prompts to process.")
parser.add_argument("--seed", type=int, default=0)
parser.add_argument('--trust-remote-code',
action='store_true',
help='trust remote code from huggingface')
parser.add_argument(
'--max-model-len',
type=int,
default=None,
help='Maximum length of a sequence (including prompt and output). '
'If None, will be derived from the model.')
parser.add_argument(
'--dtype',
type=str,
default='auto',
choices=['auto', 'half', 'float16', 'bfloat16', 'float', 'float32'],
help='data type for model weights and activations. '
'The "auto" option will use FP16 precision '
'for FP32 and FP16 models, and BF16 precision '
'for BF16 models.')
parser.add_argument('--gpu-memory-utilization',
type=float,
default=0.9,
help='the fraction of GPU memory to be used for '
'the model executor, which can range from 0 to 1.'
'If unspecified, will use the default value of 0.9.')
parser.add_argument("--enforce-eager",
action="store_true",
help="enforce eager execution")
parser.add_argument(
'--kv-cache-dtype',
type=str,
choices=['auto', 'fp8', 'fp8_e5m2', 'fp8_e4m3'],
default="auto",
help='Data type for kv cache storage. If "auto", will use model '
'data type. CUDA 11.8+ supports fp8 (=fp8_e4m3) and fp8_e5m2. '
'ROCm (AMD GPU) supports fp8 (=fp8_e4m3)')
parser.add_argument(
'--quantization-param-path',
type=str,
default=None,
help='Path to the JSON file containing the KV cache scaling factors. '
'This should generally be supplied, when KV cache dtype is FP8. '
'Otherwise, KV cache scaling factors default to 1.0, which may cause '
'accuracy issues. FP8_E5M2 (without scaling) is only supported on '
'cuda version greater than 11.8. On ROCm (AMD GPU), FP8_E4M3 is '
'instead supported for common inference criteria.')
parser.add_argument(
"--device",
type=str,
default="cuda",
choices=["cuda", "cpu"],
help='device type for vLLM execution, supporting CUDA and CPU.')
parser.add_argument(
"--enable-prefix-caching",
action='store_true',
help="enable automatic prefix caching for vLLM backend.")
parser.add_argument("--enable-chunked-prefill",
action='store_true',
help="enable chunked prefill for vLLM backend.")
parser.add_argument('--max-num-batched-tokens',
type=int,
default=None,
help='maximum number of batched tokens per '
'iteration')
parser.add_argument('--download-dir',
type=str,
default=None,
help='directory to download and load the weights, '
'default to the default cache dir of huggingface')
parser.add_argument( parser.add_argument(
'--output-json', '--output-json',
type=str, type=str,
default=None, default=None,
help='Path to save the throughput results in JSON format.') help='Path to save the throughput results in JSON format.')
parser = EngineArgs.add_cli_args(parser)
args = parser.parse_args() args = parser.parse_args()
if args.tokenizer is None: if args.tokenizer is None:
args.tokenizer = args.model args.tokenizer = args.model

View File

@@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
r"""Benchmark online serving throughput. r"""Benchmark online serving throughput.
On the server side, run one of the following commands: On the server side, run one of the following commands:
@@ -25,6 +26,7 @@ On the client side, run:
import argparse import argparse
import asyncio import asyncio
import base64 import base64
import gc
import io import io
import json import json
import os import os
@@ -53,6 +55,8 @@ try:
except ImportError: except ImportError:
from argparse import ArgumentParser as FlexibleArgumentParser from argparse import ArgumentParser as FlexibleArgumentParser
MILLISECONDS_TO_SECONDS_CONVERSION = 1000
@dataclass @dataclass
class BenchmarkMetrics: class BenchmarkMetrics:
@@ -60,6 +64,7 @@ class BenchmarkMetrics:
total_input: int total_input: int
total_output: int total_output: int
request_throughput: float request_throughput: float
request_goodput: float
output_throughput: float output_throughput: float
total_token_throughput: float total_token_throughput: float
mean_ttft_ms: float mean_ttft_ms: float
@@ -196,22 +201,80 @@ def sample_sonnet_requests(
return sampled_requests return sampled_requests
def sample_hf_requests( def sample_vision_arena_requests(
dataset_path: str, dataset,
dataset_subset: str,
dataset_split: str,
num_requests: int, num_requests: int,
tokenizer: PreTrainedTokenizerBase, tokenizer: PreTrainedTokenizerBase,
fixed_output_len: Optional[int] = None, fixed_output_len: Optional[int] = None,
) -> List[Tuple[str, str, int, Optional[Dict[str, Collection[str]]]]]: ) -> List[Tuple[str, str, int, Optional[Dict[str, Collection[str]]]]]:
sampled_requests: List[Tuple[str, int, int, Dict[str,
Collection[str]]]] = []
for data in dataset:
if len(sampled_requests) == num_requests:
break
prompt = data["turns"][0][0]['content']
prompt_token_ids = tokenizer(prompt).input_ids
if fixed_output_len is None:
# Default max output len is set to 128
print("--hf-output-len is not provided. Using default value 128.")
fixed_output_len = 128
prompt_len = len(prompt_token_ids)
output_len = fixed_output_len
assert isinstance(
data["images"][0],
Image), ("Input image format must be `PIL.Image.Image`, "
f"given {type(data['image'])}.")
image: Image = data["images"][0]
image = image.convert("RGB")
image_data = io.BytesIO()
image.save(image_data, format='JPEG')
image_base64 = base64.b64encode(image_data.getvalue()).decode("utf-8")
mm_content = {
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
},
}
sampled_requests.append((prompt, prompt_len, output_len, mm_content))
return sampled_requests
def sample_hf_requests(
dataset_path: str,
dataset_subset: Optional[str],
dataset_split: str,
num_requests: int,
tokenizer: PreTrainedTokenizerBase,
random_seed: int,
fixed_output_len: Optional[int] = None,
) -> List[Tuple[str, str, int, Optional[Dict[str, Collection[str]]]]]:
# Special case for vision_arena dataset
if dataset_path == 'lmarena-ai/vision-arena-bench-v0.1' \
and dataset_subset is None:
assert dataset_split == "train"
dataset = load_dataset(dataset_path,
name=dataset_subset,
split=dataset_split,
streaming=True)
dataset = dataset.shuffle(seed=random_seed)
return sample_vision_arena_requests(dataset, num_requests, tokenizer,
fixed_output_len)
dataset = load_dataset(dataset_path, dataset = load_dataset(dataset_path,
name=dataset_subset, name=dataset_subset,
split=dataset_split, split=dataset_split,
streaming=True) streaming=True)
assert "conversations" in dataset.features, ( assert "conversations" in dataset.features, (
"HF Dataset must have 'conversations' column.") "HF Dataset must have 'conversations' column.")
filtered_dataset = dataset.shuffle().filter( filter_func = lambda x: len(x["conversations"]) >= 2
lambda x: len(x["conversations"]) >= 2) filtered_dataset = dataset.shuffle(seed=random_seed).filter(filter_func)
sampled_requests: List[Tuple[str, int, int, Dict[str, sampled_requests: List[Tuple[str, int, int, Dict[str,
Collection[str]]]] = [] Collection[str]]]] = []
for data in filtered_dataset: for data in filtered_dataset:
@@ -247,6 +310,19 @@ def sample_hf_requests(
"url": f"data:image/jpeg;base64,{image_base64}" "url": f"data:image/jpeg;base64,{image_base64}"
}, },
} }
elif "image" in data and isinstance(data["image"], str):
if (data["image"].startswith("http://") or \
data["image"].startswith("file://")):
image_url = data["image"]
else:
image_url = f"file://{data['image']}"
mm_content = {
"type": "image_url",
"image_url": {
"url": image_url
},
}
else: else:
mm_content = None mm_content = None
@@ -293,8 +369,33 @@ def sample_random_requests(
async def get_request( async def get_request(
input_requests: List[Tuple[str, int, int]], input_requests: List[Tuple[str, int, int]],
request_rate: float, request_rate: float,
burstiness: float = 1.0,
) -> AsyncGenerator[Tuple[str, int, int], None]: ) -> AsyncGenerator[Tuple[str, int, int], None]:
"""
Asynchronously generates requests at a specified rate
with OPTIONAL burstiness.
Args:
input_requests:
A list of input requests, each represented as a tuple.
request_rate:
The rate at which requests are generated (requests/s).
burstiness (optional):
The burstiness factor of the request generation.
Only takes effect when request_rate is not inf.
Default value is 1, which follows a Poisson process.
Otherwise, the request intervals follow a gamma distribution.
A lower burstiness value (0 < burstiness < 1) results
in more bursty requests, while a higher burstiness value
(burstiness > 1) results in a more uniform arrival of requests.
"""
input_requests = iter(input_requests) input_requests = iter(input_requests)
# Calculate scale parameter theta to maintain the desired request_rate.
assert burstiness > 0, (
f"A positive burstiness factor is expected, but given {burstiness}.")
theta = 1.0 / (request_rate * burstiness)
for request in input_requests: for request in input_requests:
yield request yield request
@@ -302,8 +403,9 @@ async def get_request(
# If the request rate is infinity, then we don't need to wait. # If the request rate is infinity, then we don't need to wait.
continue continue
# Sample the request interval from the exponential distribution. # Sample the request interval from the gamma distribution.
interval = np.random.exponential(1.0 / request_rate) # If burstiness is 1, it follows exponential distribution.
interval = np.random.gamma(shape=burstiness, scale=theta)
# The next request will be sent after the interval. # The next request will be sent after the interval.
await asyncio.sleep(interval) await asyncio.sleep(interval)
@@ -315,28 +417,39 @@ def calculate_metrics(
tokenizer: PreTrainedTokenizerBase, tokenizer: PreTrainedTokenizerBase,
selected_percentile_metrics: List[str], selected_percentile_metrics: List[str],
selected_percentiles: List[float], selected_percentiles: List[float],
goodput_config_dict: Dict[str, float],
) -> Tuple[BenchmarkMetrics, List[int]]: ) -> Tuple[BenchmarkMetrics, List[int]]:
actual_output_lens: List[int] = [] actual_output_lens: List[int] = []
total_input = 0 total_input = 0
completed = 0 completed = 0
good_completed = 0
itls: List[float] = [] itls: List[float] = []
tpots: List[float] = [] tpots: List[float] = []
all_tpots: List[float] = []
ttfts: List[float] = [] ttfts: List[float] = []
e2els: List[float] = [] e2els: List[float] = []
for i in range(len(outputs)): for i in range(len(outputs)):
if outputs[i].success: if outputs[i].success:
# We use the tokenizer to count the number of output tokens for all output_len = outputs[i].output_tokens
# serving backends instead of looking at len(outputs[i].itl) since
# multiple output tokens may be bundled together if output_len is None:
# Note : this may inflate the output token count slightly # We use the tokenizer to count the number of output tokens
output_len = len( # for some serving backends instead of looking at
tokenizer(outputs[i].generated_text, # len(outputs[i].itl) since multiple output tokens may be
add_special_tokens=False).input_ids) # bundled together
# Note : this may inflate the output token count slightly
output_len = len(
tokenizer(outputs[i].generated_text,
add_special_tokens=False).input_ids)
actual_output_lens.append(output_len) actual_output_lens.append(output_len)
total_input += input_requests[i][1] total_input += input_requests[i][1]
tpot = 0
if output_len > 1: if output_len > 1:
tpots.append( latency_minus_ttft = outputs[i].latency - outputs[i].ttft
(outputs[i].latency - outputs[i].ttft) / (output_len - 1)) tpot = latency_minus_ttft / (output_len - 1)
tpots.append(tpot)
# Note: if output_len <= 1, we regard tpot as 0 for goodput
all_tpots.append(tpot)
itls += outputs[i].itl itls += outputs[i].itl
ttfts.append(outputs[i].ttft) ttfts.append(outputs[i].ttft)
e2els.append(outputs[i].latency) e2els.append(outputs[i].latency)
@@ -344,6 +457,28 @@ def calculate_metrics(
else: else:
actual_output_lens.append(0) actual_output_lens.append(0)
if goodput_config_dict:
valid_metrics = []
slo_values = []
if "ttft" in goodput_config_dict:
valid_metrics.append(ttfts)
slo_values.append(goodput_config_dict["ttft"] /
MILLISECONDS_TO_SECONDS_CONVERSION)
if "tpot" in goodput_config_dict:
valid_metrics.append(all_tpots)
slo_values.append(goodput_config_dict["tpot"] /
MILLISECONDS_TO_SECONDS_CONVERSION)
if "e2el" in goodput_config_dict:
valid_metrics.append(e2els)
slo_values.append(goodput_config_dict["e2el"] /
MILLISECONDS_TO_SECONDS_CONVERSION)
for req_metric in zip(*valid_metrics):
is_good_req = all([s >= r for s, r in zip(slo_values, req_metric)])
if is_good_req:
good_completed += 1
if completed == 0: if completed == 0:
warnings.warn( warnings.warn(
"All requests failed. This is likely due to a misconfiguration " "All requests failed. This is likely due to a misconfiguration "
@@ -354,6 +489,7 @@ def calculate_metrics(
total_input=total_input, total_input=total_input,
total_output=sum(actual_output_lens), total_output=sum(actual_output_lens),
request_throughput=completed / dur_s, request_throughput=completed / dur_s,
request_goodput=good_completed / dur_s,
output_throughput=sum(actual_output_lens) / dur_s, output_throughput=sum(actual_output_lens) / dur_s,
total_token_throughput=(total_input + sum(actual_output_lens)) / dur_s, total_token_throughput=(total_input + sum(actual_output_lens)) / dur_s,
mean_ttft_ms=np.mean(ttfts or 0) * mean_ttft_ms=np.mean(ttfts or 0) *
@@ -372,9 +508,9 @@ def calculate_metrics(
median_itl_ms=np.median(itls or 0) * 1000, median_itl_ms=np.median(itls or 0) * 1000,
percentiles_itl_ms=[(p, np.percentile(itls or 0, p) * 1000) percentiles_itl_ms=[(p, np.percentile(itls or 0, p) * 1000)
for p in selected_percentiles], for p in selected_percentiles],
mean_e2el_ms=np.median(e2els or 0) * 1000, mean_e2el_ms=np.mean(e2els or 0) * 1000,
std_e2el_ms=np.std(e2els or 0) * 1000, std_e2el_ms=np.std(e2els or 0) * 1000,
median_e2el_ms=np.mean(e2els or 0) * 1000, median_e2el_ms=np.median(e2els or 0) * 1000,
percentiles_e2el_ms=[(p, np.percentile(e2els or 0, p) * 1000) percentiles_e2el_ms=[(p, np.percentile(e2els or 0, p) * 1000)
for p in selected_percentiles], for p in selected_percentiles],
) )
@@ -387,16 +523,20 @@ async def benchmark(
api_url: str, api_url: str,
base_url: str, base_url: str,
model_id: str, model_id: str,
model_name: str,
tokenizer: PreTrainedTokenizerBase, tokenizer: PreTrainedTokenizerBase,
input_requests: List[Tuple[str, int, int]], input_requests: List[Tuple[str, int, int]],
logprobs: Optional[int], logprobs: Optional[int],
best_of: int, best_of: int,
request_rate: float, request_rate: float,
burstiness: float,
disable_tqdm: bool, disable_tqdm: bool,
profile: bool, profile: bool,
selected_percentile_metrics: List[str], selected_percentile_metrics: List[str],
selected_percentiles: List[str], selected_percentiles: List[str],
ignore_eos: bool, ignore_eos: bool,
goodput_config_dict: Dict[str, float],
max_concurrency: Optional[int],
): ):
if backend in ASYNC_REQUEST_FUNCS: if backend in ASYNC_REQUEST_FUNCS:
request_func = ASYNC_REQUEST_FUNCS[backend] request_func = ASYNC_REQUEST_FUNCS[backend]
@@ -412,6 +552,7 @@ async def benchmark(
"Multi-modal content is only supported on 'openai-chat' backend.") "Multi-modal content is only supported on 'openai-chat' backend.")
test_input = RequestFuncInput( test_input = RequestFuncInput(
model=model_id, model=model_id,
model_name=model_name,
prompt=test_prompt, prompt=test_prompt,
api_url=api_url, api_url=api_url,
prompt_len=test_prompt_len, prompt_len=test_prompt_len,
@@ -432,6 +573,7 @@ async def benchmark(
if profile: if profile:
print("Starting profiler...") print("Starting profiler...")
profile_input = RequestFuncInput(model=model_id, profile_input = RequestFuncInput(model=model_id,
model_name=model_name,
prompt=test_prompt, prompt=test_prompt,
api_url=base_url + "/start_profile", api_url=base_url + "/start_profile",
prompt_len=test_prompt_len, prompt_len=test_prompt_len,
@@ -444,15 +586,38 @@ async def benchmark(
if profile_output.success: if profile_output.success:
print("Profiler started") print("Profiler started")
if burstiness == 1.0:
distribution = "Poisson process"
else:
distribution = "Gamma distribution"
print(f"Traffic request rate: {request_rate}") print(f"Traffic request rate: {request_rate}")
print(f"Burstiness factor: {burstiness} ({distribution})")
print(f"Maximum request concurrency: {max_concurrency}")
pbar = None if disable_tqdm else tqdm(total=len(input_requests)) pbar = None if disable_tqdm else tqdm(total=len(input_requests))
# This can be used once the minimum Python version is 3.10 or higher,
# and it will simplify the code in limited_request_func.
# semaphore = (asyncio.Semaphore(max_concurrency)
# if max_concurrency else contextlib.nullcontext())
semaphore = (asyncio.Semaphore(max_concurrency)
if max_concurrency else None)
async def limited_request_func(request_func_input, pbar):
if semaphore is None:
return await request_func(request_func_input=request_func_input,
pbar=pbar)
async with semaphore:
return await request_func(request_func_input=request_func_input,
pbar=pbar)
benchmark_start_time = time.perf_counter() benchmark_start_time = time.perf_counter()
tasks: List[asyncio.Task] = [] tasks: List[asyncio.Task] = []
async for request in get_request(input_requests, request_rate): async for request in get_request(input_requests, request_rate, burstiness):
prompt, prompt_len, output_len, mm_content = request prompt, prompt_len, output_len, mm_content = request
request_func_input = RequestFuncInput(model=model_id, request_func_input = RequestFuncInput(model=model_id,
model_name=model_name,
prompt=prompt, prompt=prompt,
api_url=api_url, api_url=api_url,
prompt_len=prompt_len, prompt_len=prompt_len,
@@ -463,8 +628,8 @@ async def benchmark(
ignore_eos=ignore_eos) ignore_eos=ignore_eos)
tasks.append( tasks.append(
asyncio.create_task( asyncio.create_task(
request_func(request_func_input=request_func_input, limited_request_func(request_func_input=request_func_input,
pbar=pbar))) pbar=pbar)))
outputs: List[RequestFuncOutput] = await asyncio.gather(*tasks) outputs: List[RequestFuncOutput] = await asyncio.gather(*tasks)
if profile: if profile:
@@ -494,6 +659,7 @@ async def benchmark(
tokenizer=tokenizer, tokenizer=tokenizer,
selected_percentile_metrics=selected_percentile_metrics, selected_percentile_metrics=selected_percentile_metrics,
selected_percentiles=selected_percentiles, selected_percentiles=selected_percentiles,
goodput_config_dict=goodput_config_dict,
) )
print("{s:{c}^{n}}".format(s=' Serving Benchmark Result ', n=50, c='=')) print("{s:{c}^{n}}".format(s=' Serving Benchmark Result ', n=50, c='='))
@@ -505,6 +671,9 @@ async def benchmark(
metrics.total_output)) metrics.total_output))
print("{:<40} {:<10.2f}".format("Request throughput (req/s):", print("{:<40} {:<10.2f}".format("Request throughput (req/s):",
metrics.request_throughput)) metrics.request_throughput))
if goodput_config_dict:
print("{:<40} {:<10.2f}".format("Request goodput (req/s):",
metrics.request_goodput))
print("{:<40} {:<10.2f}".format("Output token throughput (tok/s):", print("{:<40} {:<10.2f}".format("Output token throughput (tok/s):",
metrics.output_throughput)) metrics.output_throughput))
print("{:<40} {:<10.2f}".format("Total Token throughput (tok/s):", print("{:<40} {:<10.2f}".format("Total Token throughput (tok/s):",
@@ -516,6 +685,8 @@ async def benchmark(
"total_input_tokens": metrics.total_input, "total_input_tokens": metrics.total_input,
"total_output_tokens": metrics.total_output, "total_output_tokens": metrics.total_output,
"request_throughput": metrics.request_throughput, "request_throughput": metrics.request_throughput,
"request_goodput:":
metrics.request_goodput if goodput_config_dict else None,
"output_throughput": metrics.output_throughput, "output_throughput": metrics.output_throughput,
"total_token_throughput": metrics.total_token_throughput, "total_token_throughput": metrics.total_token_throughput,
"input_lens": [output.prompt_len for output in outputs], "input_lens": [output.prompt_len for output in outputs],
@@ -569,6 +740,41 @@ async def benchmark(
return result return result
def check_goodput_args(args):
# Check and parse goodput arguments
goodput_config_dict = {}
VALID_NAMES = ["ttft", "tpot", "e2el"]
if args.goodput:
goodput_config_dict = parse_goodput(args.goodput)
for slo_name, slo_val in goodput_config_dict.items():
if slo_name not in VALID_NAMES:
raise ValueError(
f"Invalid metric name found, {slo_name}: {slo_val}. "
"The service level objective name should be one of "
f"{str(VALID_NAMES)}. ")
if slo_val < 0:
raise ValueError(
f"Invalid value found, {slo_name}: {slo_val}. "
"The service level objective value should be "
"non-negative.")
return goodput_config_dict
def parse_goodput(slo_pairs):
goodput_config_dict = {}
try:
for slo_pair in slo_pairs:
slo_name, slo_val = slo_pair.split(":")
goodput_config_dict[slo_name] = float(slo_val)
except ValueError as err:
raise argparse.ArgumentTypeError(
"Invalid format found for service level objectives. "
"Specify service level objectives for goodput as \"KEY:VALUE\" "
"pairs, where the key is a metric name, and the value is a "
"number in milliseconds.") from err
return goodput_config_dict
def main(args: argparse.Namespace): def main(args: argparse.Namespace):
print(args) print(args)
random.seed(args.seed) random.seed(args.seed)
@@ -576,7 +782,9 @@ def main(args: argparse.Namespace):
backend = args.backend backend = args.backend
model_id = args.model model_id = args.model
model_name = args.served_model_name
tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model
tokenizer_mode = args.tokenizer_mode
if args.base_url is not None: if args.base_url is not None:
api_url = f"{args.base_url}{args.endpoint}" api_url = f"{args.base_url}{args.endpoint}"
@@ -586,6 +794,7 @@ def main(args: argparse.Namespace):
base_url = f"http://{args.host}:{args.port}" base_url = f"http://{args.host}:{args.port}"
tokenizer = get_tokenizer(tokenizer_id, tokenizer = get_tokenizer(tokenizer_id,
tokenizer_mode=tokenizer_mode,
trust_remote_code=args.trust_remote_code) trust_remote_code=args.trust_remote_code)
if args.dataset is not None: if args.dataset is not None:
@@ -646,6 +855,7 @@ def main(args: argparse.Namespace):
dataset_split=args.hf_split, dataset_split=args.hf_split,
num_requests=args.num_prompts, num_requests=args.num_prompts,
tokenizer=tokenizer, tokenizer=tokenizer,
random_seed=args.seed,
fixed_output_len=args.hf_output_len, fixed_output_len=args.hf_output_len,
) )
@@ -662,17 +872,25 @@ def main(args: argparse.Namespace):
else: else:
raise ValueError(f"Unknown dataset: {args.dataset_name}") raise ValueError(f"Unknown dataset: {args.dataset_name}")
goodput_config_dict = check_goodput_args(args)
# Avoid GC processing "static" data - reduce pause times.
gc.collect()
gc.freeze()
benchmark_result = asyncio.run( benchmark_result = asyncio.run(
benchmark( benchmark(
backend=backend, backend=backend,
api_url=api_url, api_url=api_url,
base_url=base_url, base_url=base_url,
model_id=model_id, model_id=model_id,
model_name=model_name,
tokenizer=tokenizer, tokenizer=tokenizer,
input_requests=input_requests, input_requests=input_requests,
logprobs=args.logprobs, logprobs=args.logprobs,
best_of=args.best_of, best_of=args.best_of,
request_rate=args.request_rate, request_rate=args.request_rate,
burstiness=args.burstiness,
disable_tqdm=args.disable_tqdm, disable_tqdm=args.disable_tqdm,
profile=args.profile, profile=args.profile,
selected_percentile_metrics=args.percentile_metrics.split(","), selected_percentile_metrics=args.percentile_metrics.split(","),
@@ -680,6 +898,8 @@ def main(args: argparse.Namespace):
float(p) for p in args.metric_percentiles.split(",") float(p) for p in args.metric_percentiles.split(",")
], ],
ignore_eos=args.ignore_eos, ignore_eos=args.ignore_eos,
goodput_config_dict=goodput_config_dict,
max_concurrency=args.max_concurrency,
)) ))
# Save config and results to json # Save config and results to json
@@ -707,15 +927,19 @@ def main(args: argparse.Namespace):
) )
# Traffic # Traffic
result_json["request_rate"] = ( result_json["request_rate"] = (args.request_rate if args.request_rate
args.request_rate if args.request_rate < float("inf") else "inf") < float("inf") else "inf")
result_json["burstiness"] = args.burstiness
result_json["max_concurrency"] = args.max_concurrency
# Merge with benchmark result # Merge with benchmark result
result_json = {**result_json, **benchmark_result} result_json = {**result_json, **benchmark_result}
# Save to file # Save to file
base_model_id = model_id.split("/")[-1] base_model_id = model_id.split("/")[-1]
file_name = f"{backend}-{args.request_rate}qps-{base_model_id}-{current_dt}.json" #noqa max_concurrency_str = (f"-concurrency{args.max_concurrency}"
if args.max_concurrency is not None else "")
file_name = f"{backend}-{args.request_rate}qps{max_concurrency_str}-{base_model_id}-{current_dt}.json" #noqa
if args.result_filename: if args.result_filename:
file_name = args.result_filename file_name = args.result_filename
if args.result_dir: if args.result_dir:
@@ -766,6 +990,19 @@ if __name__ == "__main__":
default=None, default=None,
help="Path to the sharegpt/sonnet dataset. " help="Path to the sharegpt/sonnet dataset. "
"Or the huggingface dataset ID if using HF dataset.") "Or the huggingface dataset ID if using HF dataset.")
parser.add_argument(
"--max-concurrency",
type=int,
default=None,
help="Maximum number of concurrent requests. This can be used "
"to help simulate an environment where a higher level component "
"is enforcing a maximum number of concurrent requests. While the "
"--request-rate argument controls the rate at which requests are "
"initiated, this argument will control how many are actually allowed "
"to execute at a time. This means that when used in combination, the "
"actual request rate may be lower than specified with --request-rate, "
"if the server is not processing requests fast enough to keep up.")
parser.add_argument( parser.add_argument(
"--model", "--model",
type=str, type=str,
@@ -808,8 +1045,20 @@ if __name__ == "__main__":
default=float("inf"), default=float("inf"),
help="Number of requests per second. If this is inf, " help="Number of requests per second. If this is inf, "
"then all the requests are sent at time 0. " "then all the requests are sent at time 0. "
"Otherwise, we use Poisson process to synthesize " "Otherwise, we use Poisson process or gamma distribution "
"the request arrival times.", "to synthesize the request arrival times.",
)
parser.add_argument(
"--burstiness",
type=float,
default=1.0,
help="Burstiness factor of the request generation. "
"Only take effect when request_rate is not inf. "
"Default value is 1, which follows Poisson process. "
"Otherwise, the request intervals follow a gamma distribution. "
"A lower burstiness value (0 < burstiness < 1) results in more "
"bursty requests. A higher burstiness value (burstiness > 1) "
"results in a more uniform arrival of requests.",
) )
parser.add_argument("--seed", type=int, default=0) parser.add_argument("--seed", type=int, default=0)
parser.add_argument( parser.add_argument(
@@ -879,6 +1128,17 @@ if __name__ == "__main__":
"Default value is \"99\". " "Default value is \"99\". "
"Use \"--percentile-metrics\" to select metrics.", "Use \"--percentile-metrics\" to select metrics.",
) )
parser.add_argument(
"--goodput",
nargs="+",
required=False,
help="Specify service level objectives for goodput as \"KEY:VALUE\" "
"pairs, where the key is a metric name, and the value is in "
"milliseconds. Multiple \"KEY:VALUE\" pairs can be provided, "
"separated by spaces. Allowed request level metric names are "
"\"ttft\", \"tpot\", \"e2el\". For more context on the definition of "
"goodput, refer to DistServe paper: https://arxiv.org/pdf/2401.09670 "
"and the blog: https://hao-ai-lab.github.io/blogs/distserve")
# group for dataset specific arguments # group for dataset specific arguments
sonnet_group = parser.add_argument_group("sonnet dataset options") sonnet_group = parser.add_argument_group("sonnet dataset options")
@@ -960,5 +1220,22 @@ if __name__ == "__main__":
"from the sampled HF dataset.", "from the sampled HF dataset.",
) )
parser.add_argument(
'--tokenizer-mode',
type=str,
default="auto",
choices=['auto', 'slow', 'mistral'],
help='The tokenizer mode.\n\n* "auto" will use the '
'fast tokenizer if available.\n* "slow" will '
'always use the slow tokenizer. \n* '
'"mistral" will always use the `mistral_common` tokenizer.')
parser.add_argument("--served-model-name",
type=str,
default=None,
help="The model name used in the API. "
"If not specified, the model name will be the "
"same as the ``--model`` argument. ")
args = parser.parse_args() args = parser.parse_args()
main(args) main(args)

View File

@@ -0,0 +1,882 @@
# SPDX-License-Identifier: Apache-2.0
r"""Benchmark online serving throughput with guided decoding.
On the server side, run one of the following commands:
(vLLM OpenAI API server)
vllm serve <your_model> --disable-log-requests
(TGI backend)
./launch_tgi_server.sh <your_model> <max_batch_total_tokens>
On the client side, run:
python benchmarks/benchmark_serving.py \
--backend <backend> \
--model <your_model> \
--dataset json \
--guided-decoding-ratio 1.0 \
--guided-decoding-backend xgrammar \
--request-rate 10 \
--num-prompts 1000
when using tgi backend, add
--endpoint /generate_stream
to the end of the command above.
"""
import argparse
import asyncio
import dataclasses
import json
import os
import random
import time
import warnings
from dataclasses import dataclass
from typing import AsyncGenerator, List, Optional, Tuple
import datasets
import numpy as np
import pandas as pd
from backend_request_func import (ASYNC_REQUEST_FUNCS, RequestFuncInput,
RequestFuncOutput)
from tqdm.asyncio import tqdm
from transformers import PreTrainedTokenizerBase
try:
from vllm.transformers_utils.tokenizer import get_tokenizer
except ImportError:
from backend_request_func import get_tokenizer
try:
from vllm.utils import FlexibleArgumentParser
except ImportError:
from argparse import ArgumentParser as FlexibleArgumentParser
MILLISECONDS_TO_SECONDS_CONVERSION = 1000
@dataclass
class BenchmarkMetrics:
completed: int
total_input: int
total_output: int
request_throughput: float
request_goodput: float
output_throughput: float
total_token_throughput: float
mean_ttft_ms: float
median_ttft_ms: float
std_ttft_ms: float
percentiles_ttft_ms: List[Tuple[float, float]]
mean_tpot_ms: float
median_tpot_ms: float
std_tpot_ms: float
percentiles_tpot_ms: List[Tuple[float, float]]
mean_itl_ms: float
median_itl_ms: float
std_itl_ms: float
percentiles_itl_ms: List[Tuple[float, float]]
# E2EL stands for end-to-end latency per request.
# It is the time taken on the client side from sending
# a request to receiving a complete response.
mean_e2el_ms: float
median_e2el_ms: float
std_e2el_ms: float
percentiles_e2el_ms: List[Tuple[float, float]]
@dataclasses.dataclass
class SampleRequest:
"""A class representing a single inference request for benchmarking.
Attributes:
prompt: The input text prompt for the model.
multi_modal_data: Optional dictionary containing multi-modal data (e.g.
images).
prompt_len: The length of the prompt in tokens.
expected_output_len: The expected length of the output in tokens.
"""
prompt: str
prompt_len: int
expected_output_len: int
schema: dict
structure_type: str
completion: str = None
def sample_requests(tokenizer: PreTrainedTokenizerBase,
args: argparse.Namespace) -> List[SampleRequest]:
if args.dataset == 'json':
if args.json_schema_path is None:
dir_path = os.path.dirname(os.path.realpath(__file__))
args.json_schema_path = os.path.join(dir_path,
"structured_schemas",
"structured_schema_1.json")
with open(args.json_schema_path) as f:
schema = json.load(f)
prompt = f"Generate an example of a user profile given the following schema: {json.dumps(schema)}" # noqa: E501
input_len = len(tokenizer(prompt).input_ids)
print(f"Input length of the prompt: {input_len} tokens")
requests = [
SampleRequest(prompt=prompt,
prompt_len=input_len,
expected_output_len=args.output_len,
schema=schema,
structure_type=args.structure_type)
for _ in range(args.num_prompts)
]
elif args.dataset == "grammar":
schema = """
?start: select_statement
?select_statement: "SELECT " column_list " FROM " table_name
?column_list: column_name ("," column_name)*
?table_name: identifier
?column_name: identifier
?identifier: /[a-zA-Z_][a-zA-Z0-9_]*/
"""
prompt = "Generate an SQL query to show the 'username' \
and 'email' from the 'users' table."
input_len = len(tokenizer(prompt).input_ids)
print(f"Input length of the prompt: {input_len} tokens")
requests = [
SampleRequest(prompt=prompt,
prompt_len=input_len,
expected_output_len=args.output_len,
schema=schema,
structure_type=args.structure_type)
for _ in range(args.num_prompts)
]
elif args.dataset == "regex":
regex = r"\w+@\w+\.com\n"
args.regex = regex
prompt = "Generate an email address for Alan Turing, \
who works in Enigma. End in .com and new line. \
Example result: alan.turing@enigma.com\n"
input_len = len(tokenizer(prompt).input_ids)
print(f"Input length of the prompt: {input_len} tokens")
requests = [
SampleRequest(prompt=prompt,
prompt_len=input_len,
expected_output_len=args.output_len,
schema=regex,
structure_type=args.structure_type)
for _ in range(args.num_prompts)
]
elif args.dataset == "choice":
choice = ["Positive", "Negative"]
args.choice = choice
prompt = "Classify this sentiment: vLLM is wonderful!"
input_len = len(tokenizer(prompt).input_ids)
print(f"Input length of the prompt: {input_len} tokens")
requests = [
SampleRequest(prompt=prompt,
prompt_len=input_len,
expected_output_len=args.output_len,
schema=choice,
structure_type=args.structure_type)
for _ in range(args.num_prompts)
]
elif args.dataset == "xgrammar_bench":
requests: List[SampleRequest] = []
dataset = datasets.load_dataset("NousResearch/json-mode-eval",
split="train")
print(f"dataset has {len(dataset)} entries")
len_dataset = len(dataset)
for data_point_idx in range(args.num_prompts):
idx = data_point_idx
while idx >= len_dataset:
idx -= len_dataset
schema = dataset["schema"][idx]
prompt = tokenizer.apply_chat_template(dataset["prompt"][idx],
tokenize=False)
input_len = len(tokenizer(prompt).input_ids)
completion = dataset["completion"][idx]
requests.append(
SampleRequest(prompt=prompt,
prompt_len=input_len,
expected_output_len=args.output_len,
schema=schema,
structure_type=args.structure_type,
completion=completion))
return requests
async def get_request(
input_requests: List[SampleRequest],
request_rate: float,
burstiness: float = 1.0,
) -> AsyncGenerator[Tuple[int, SampleRequest], None]:
"""
Asynchronously generates requests at a specified rate
with OPTIONAL burstiness.
Args:
input_requests:
A list of input requests, each represented as a tuple.
request_rate:
The rate at which requests are generated (requests/s).
burstiness (optional):
The burstiness factor of the request generation.
Only takes effect when request_rate is not inf.
Default value is 1, which follows a Poisson process.
Otherwise, the request intervals follow a gamma distribution.
A lower burstiness value (0 < burstiness < 1) results
in more bursty requests, while a higher burstiness value
(burstiness > 1) results in a more uniform arrival of requests.
"""
input_requests = iter(input_requests)
# Calculate scale parameter theta to maintain the desired request_rate.
assert burstiness > 0, (
f"A positive burstiness factor is expected, but given {burstiness}.")
theta = 1.0 / (request_rate * burstiness)
for i, request in enumerate(input_requests):
yield i, request
if request_rate == float("inf"):
# If the request rate is infinity, then we don't need to wait.
continue
# Sample the request interval from the gamma distribution.
# If burstiness is 1, it follows exponential distribution.
interval = np.random.gamma(shape=burstiness, scale=theta)
# The next request will be sent after the interval.
await asyncio.sleep(interval)
def calculate_metrics(
input_requests: List[Tuple[str, int, int]],
outputs: List[RequestFuncOutput],
dur_s: float,
tokenizer: PreTrainedTokenizerBase,
selected_percentile_metrics: List[str],
selected_percentiles: List[float],
) -> Tuple[BenchmarkMetrics, List[int]]:
actual_output_lens: List[int] = []
total_input = 0
completed = 0
good_completed = 0
itls: List[float] = []
tpots: List[float] = []
all_tpots: List[float] = []
ttfts: List[float] = []
e2els: List[float] = []
for i in range(len(outputs)):
if outputs[i].success:
# We use the tokenizer to count the number of output tokens for all
# serving backends instead of looking at len(outputs[i].itl) since
# multiple output tokens may be bundled together
# Note : this may inflate the output token count slightly
output_len = len(
tokenizer(outputs[i].generated_text,
add_special_tokens=False).input_ids)
actual_output_lens.append(output_len)
total_input += input_requests[i].prompt_len
tpot = 0
if output_len > 1:
tpot = (outputs[i].latency - outputs[i].ttft) / (output_len -
1)
tpots.append(tpot)
outputs[i].tpot = sum(tpots) / len(tpots) if len(tpots) else 0
# Note: if output_len <= 1, we regard tpot as 0 for goodput
all_tpots.append(tpot)
itls += outputs[i].itl
ttfts.append(outputs[i].ttft)
e2els.append(outputs[i].latency)
completed += 1
else:
actual_output_lens.append(0)
if completed == 0:
warnings.warn(
"All requests failed. This is likely due to a misconfiguration "
"on the benchmark arguments.",
stacklevel=2)
metrics = BenchmarkMetrics(
completed=completed,
total_input=total_input,
total_output=sum(actual_output_lens),
request_throughput=completed / dur_s,
request_goodput=good_completed / dur_s,
output_throughput=sum(actual_output_lens) / dur_s,
total_token_throughput=(total_input + sum(actual_output_lens)) / dur_s,
mean_ttft_ms=np.mean(ttfts or 0) *
1000, # ttfts is empty if streaming is not supported by backend
std_ttft_ms=np.std(ttfts or 0) * 1000,
median_ttft_ms=np.median(ttfts or 0) * 1000,
percentiles_ttft_ms=[(p, np.percentile(ttfts or 0, p) * 1000)
for p in selected_percentiles],
mean_tpot_ms=np.mean(tpots or 0) * 1000,
std_tpot_ms=np.std(tpots or 0) * 1000,
median_tpot_ms=np.median(tpots or 0) * 1000,
percentiles_tpot_ms=[(p, np.percentile(tpots or 0, p) * 1000)
for p in selected_percentiles],
mean_itl_ms=np.mean(itls or 0) * 1000,
std_itl_ms=np.std(itls or 0) * 1000,
median_itl_ms=np.median(itls or 0) * 1000,
percentiles_itl_ms=[(p, np.percentile(itls or 0, p) * 1000)
for p in selected_percentiles],
mean_e2el_ms=np.mean(e2els or 0) * 1000,
std_e2el_ms=np.std(e2els or 0) * 1000,
median_e2el_ms=np.median(e2els or 0) * 1000,
percentiles_e2el_ms=[(p, np.percentile(e2els or 0, p) * 1000)
for p in selected_percentiles],
)
return metrics, actual_output_lens
async def benchmark(
backend: str,
api_url: str,
base_url: str,
model_id: str,
tokenizer: PreTrainedTokenizerBase,
input_requests: List[SampleRequest],
request_rate: float,
burstiness: float,
disable_tqdm: bool,
profile: bool,
selected_percentile_metrics: List[str],
selected_percentiles: List[str],
ignore_eos: bool,
max_concurrency: Optional[int],
guided_decoding_ratio: float,
guided_decoding_backend: str,
):
if backend in ASYNC_REQUEST_FUNCS:
request_func = ASYNC_REQUEST_FUNCS[backend]
else:
raise ValueError(f"Unknown backend: {backend}")
def prepare_extra_body(request) -> dict:
extra_body = {}
# Add the schema to the extra_body
extra_body[request.structure_type] = request.schema
# Add the specific guided_decoding_backend
extra_body["guided_decoding_backend"] = guided_decoding_backend
return extra_body
print("Starting initial single prompt test run...")
guided_decoding_req_idx = random.sample(
range(len(input_requests)),
int(len(input_requests) * guided_decoding_ratio))
test_request = input_requests[0]
test_input = RequestFuncInput(
model=model_id,
prompt=test_request.prompt,
api_url=api_url,
prompt_len=test_request.prompt_len,
output_len=test_request.expected_output_len,
ignore_eos=ignore_eos,
extra_body=prepare_extra_body(test_request),
)
test_output = await request_func(request_func_input=test_input)
if not test_output.success:
raise ValueError(
"Initial test run failed - Please make sure benchmark arguments "
f"are correctly specified. Error: {test_output.error}")
else:
print("Initial test run completed. Starting main benchmark run...")
if profile:
print("Starting profiler...")
profile_input = RequestFuncInput(
model=model_id,
prompt=test_request.prompt,
api_url=base_url + "/start_profile",
prompt_len=test_request.prompt_len,
output_len=test_request.expected_output_len,
ignore_eos=ignore_eos,
extra_body=prepare_extra_body(test_request),
)
profile_output = await request_func(request_func_input=profile_input)
if profile_output.success:
print("Profiler started")
if burstiness == 1.0:
distribution = "Poisson process"
else:
distribution = "Gamma distribution"
print(f"Traffic request rate: {request_rate}")
print(f"Burstiness factor: {burstiness} ({distribution})")
print(f"Maximum request concurrency: {max_concurrency}")
pbar = None if disable_tqdm else tqdm(total=len(input_requests))
# This can be used once the minimum Python version is 3.10 or higher,
# and it will simplify the code in limited_request_func.
# semaphore = (asyncio.Semaphore(max_concurrency)
# if max_concurrency else contextlib.nullcontext())
semaphore = (asyncio.Semaphore(max_concurrency)
if max_concurrency else None)
async def limited_request_func(request_func_input, pbar):
if semaphore is None:
return await request_func(request_func_input=request_func_input,
pbar=pbar)
async with semaphore:
return await request_func(request_func_input=request_func_input,
pbar=pbar)
benchmark_start_time = time.perf_counter()
tasks: List[asyncio.Task] = []
expected: List[str] = []
async for i, request in get_request(input_requests, request_rate,
burstiness):
extra_body = prepare_extra_body(
request) if i in guided_decoding_req_idx else None
request_func_input = RequestFuncInput(
model=model_id,
prompt=request.prompt,
api_url=api_url,
prompt_len=request.prompt_len,
output_len=request.expected_output_len,
ignore_eos=ignore_eos,
extra_body=extra_body,
)
expected.append(request.completion)
tasks.append(
asyncio.create_task(
limited_request_func(request_func_input=request_func_input,
pbar=pbar)))
outputs: List[RequestFuncOutput] = await asyncio.gather(*tasks)
if profile:
print("Stopping profiler...")
profile_input = RequestFuncInput(
model=model_id,
prompt=test_request.prompt,
api_url=base_url + "/stop_profile",
prompt_len=test_request.prompt_len,
output_len=test_request.expected_output_len,
extra_body={test_request.structure_type: test_request.schema},
)
profile_output = await request_func(request_func_input=profile_input)
if profile_output.success:
print("Profiler stopped")
if pbar is not None:
pbar.close()
benchmark_duration = time.perf_counter() - benchmark_start_time
metrics, actual_output_lens = calculate_metrics(
input_requests=input_requests,
outputs=outputs,
dur_s=benchmark_duration,
tokenizer=tokenizer,
selected_percentile_metrics=selected_percentile_metrics,
selected_percentiles=selected_percentiles,
)
print("{s:{c}^{n}}".format(s=' Serving Benchmark Result ', n=50, c='='))
print("{:<40} {:<10}".format("Successful requests:", metrics.completed))
print("{:<40} {:<10.2f}".format("Benchmark duration (s):",
benchmark_duration))
print("{:<40} {:<10}".format("Total input tokens:", metrics.total_input))
print("{:<40} {:<10}".format("Total generated tokens:",
metrics.total_output))
print("{:<40} {:<10.2f}".format("Request throughput (req/s):",
metrics.request_throughput))
print("{:<40} {:<10.2f}".format("Output token throughput (tok/s):",
metrics.output_throughput))
print("{:<40} {:<10.2f}".format("Total Token throughput (tok/s):",
metrics.total_token_throughput))
result = {
"duration":
benchmark_duration,
"completed":
metrics.completed,
"total_input_tokens":
metrics.total_input,
"total_output_tokens":
metrics.total_output,
"request_throughput":
metrics.request_throughput,
"output_throughput":
metrics.output_throughput,
"total_token_throughput":
metrics.total_token_throughput,
"ttft_description":
pd.Series([output.ttft for output in outputs]).describe().to_dict(),
"tpot_description":
pd.Series([output.tpot for output in outputs]).describe().to_dict(),
"input_lens": [output.prompt_len for output in outputs],
"output_lens":
actual_output_lens,
"ttfts": [output.ttft for output in outputs],
"itls": [output.itl for output in outputs],
"errors": [output.error for output in outputs],
}
ret = [{
'generated': output.generated_text,
'expected': gt
} for output, gt in zip(outputs, expected)]
def process_one_metric(
# E.g., "ttft"
metric_attribute_name: str,
# E.g., "TTFT"
metric_name: str,
# E.g., "Time to First Token"
metric_header: str,
):
# This function prints and adds statistics of the specified
# metric.
if metric_attribute_name not in selected_percentile_metrics:
return
print("{s:{c}^{n}}".format(s=metric_header, n=50, c='-'))
print("{:<40} {:<10.2f}".format(
f"Mean {metric_name} (ms):",
getattr(metrics, f"mean_{metric_attribute_name}_ms")))
print("{:<40} {:<10.2f}".format(
f"Median {metric_name} (ms):",
getattr(metrics, f"median_{metric_attribute_name}_ms")))
result[f"mean_{metric_attribute_name}_ms"] = getattr(
metrics, f"mean_{metric_attribute_name}_ms")
result[f"median_{metric_attribute_name}_ms"] = getattr(
metrics, f"median_{metric_attribute_name}_ms")
result[f"std_{metric_attribute_name}_ms"] = getattr(
metrics, f"std_{metric_attribute_name}_ms")
for p, value in getattr(metrics,
f"percentiles_{metric_attribute_name}_ms"):
p_word = str(int(p)) if int(p) == p else str(p)
print("{:<40} {:<10.2f}".format(f"P{p_word} {metric_name} (ms):",
value))
result[f"p{p_word}_{metric_attribute_name}_ms"] = value
process_one_metric("ttft", "TTFT", "Time to First Token")
process_one_metric("tpot", "TPOT",
"Time per Output Token (excl. 1st token)")
process_one_metric("itl", "ITL", "Inter-token Latency")
process_one_metric("e2el", "E2EL", "End-to-end Latency")
print("=" * 50)
return result, ret
def evaluate(ret, args):
def _eval_correctness_json(expected, actual):
# extract json string from string using regex
import re
actual = actual.replace('\n', '').replace(' ', '').strip()
try:
actual = re.search(r'\{.*\}', actual).group()
actual = json.loads(actual)
except Exception:
return False
return True
def _eval_correctness_choice(expected, actual):
return actual in args.choice
def _eval_correctness_regex(expected, actual):
import re
return re.match(args.regex, actual) is not None
def _eval_correctness(expected, actual):
if args.structure_type == 'guided_json':
return _eval_correctness_json(expected, actual)
elif args.structure_type == 'guided_regex':
return _eval_correctness_regex(expected, actual)
elif args.structure_type == 'guided_choice':
return _eval_correctness_choice(expected, actual)
else:
return None
scores = []
for res in ret:
score = _eval_correctness(res['expected'], res['generated'])
res['correctness'] = score
scores.append(score)
not_none_scores = [score for score in scores if score is not None]
return (sum(not_none_scores) / len(not_none_scores) *
100) if len(not_none_scores) > 0 else None
def main(args: argparse.Namespace):
print(args)
random.seed(args.seed)
np.random.seed(args.seed)
backend = args.backend
model_id = args.model
tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model
if args.base_url is not None:
api_url = f"{args.base_url}{args.endpoint}"
base_url = f"{args.base_url}"
else:
api_url = f"http://{args.host}:{args.port}{args.endpoint}"
base_url = f"http://{args.host}:{args.port}"
tokenizer = get_tokenizer(tokenizer_id,
trust_remote_code=args.trust_remote_code)
if args.dataset == 'grammar':
args.structure_type = 'guided_grammar'
elif args.dataset == 'regex':
args.structure_type = 'guided_regex'
elif args.dataset == 'choice':
args.structure_type = 'guided_choice'
else:
args.structure_type = 'guided_json'
if args.no_guided_decoding:
args.guided_decoding_ratio = 0
if args.save_results:
result_file_name = f'{args.guided_decoding_ratio}guided'
result_file_name += f"_{backend}"
result_file_name += f"_{args.request_rate}qps"
result_file_name += f"_{args.model.split('/')[-1]}"
result_file_name += f"_{args.dataset}"
result_file_name += f"_{args.num_prompts}"
result_file_name += f"_out{args.output_len}"
result_file_name += ".txt"
else:
result_file_name = None
input_requests = sample_requests(tokenizer, args)
benchmark_result, ret = asyncio.run(
benchmark(
backend=backend,
api_url=api_url,
base_url=base_url,
model_id=model_id,
tokenizer=tokenizer,
input_requests=input_requests,
request_rate=args.request_rate,
burstiness=args.burstiness,
disable_tqdm=args.disable_tqdm,
profile=args.profile,
selected_percentile_metrics=args.percentile_metrics.split(","),
selected_percentiles=[
float(p) for p in args.metric_percentiles.split(",")
],
ignore_eos=args.ignore_eos,
max_concurrency=args.max_concurrency,
guided_decoding_ratio=args.guided_decoding_ratio,
guided_decoding_backend=args.guided_decoding_backend,
))
# Save config and results to json
score = evaluate(ret, args)
print("correct_rate(%)", score, '\n')
if args.save_results:
results = {
"backend":
backend,
"model_id":
model_id,
"tokenizer_id":
tokenizer_id,
"num_prompts":
args.num_prompts,
"request_rate":
args.request_rate if args.request_rate < float("inf") else "inf",
"burstiness":
args.burstiness,
"max_concurrency":
args.max_concurrency,
"correct_rate(%)":
score
}
results = {"outputs": ret, **results, **benchmark_result}
# Save to file
if args.result_filename:
result_file_name = args.result_filename
if args.result_dir:
result_file_name = os.path.join(args.result_dir, result_file_name)
with open(result_file_name, "w", encoding='utf-8') as outfile:
json.dump(results, outfile, indent=4)
if __name__ == "__main__":
parser = FlexibleArgumentParser(
description="Benchmark the online serving throughput.")
parser.add_argument(
"--backend",
type=str,
default="vllm",
choices=list(ASYNC_REQUEST_FUNCS.keys()),
)
parser.add_argument(
"--base-url",
type=str,
default=None,
help="Server or API base url if not using http host and port.",
)
parser.add_argument("--host", type=str, default="localhost")
parser.add_argument("--port", type=int, default=8000)
parser.add_argument(
"--endpoint",
type=str,
default="/v1/completions",
help="API endpoint.",
)
parser.add_argument(
"--dataset",
default='json',
choices=['json', 'grammar', 'regex', 'choice', 'xgrammar_bench'])
parser.add_argument("--json_schema_path",
type=str,
default=None,
help="Path to json schema.")
parser.add_argument(
"--max-concurrency",
type=int,
default=None,
help="Maximum number of concurrent requests. This can be used "
"to help simulate an environment where a higher level component "
"is enforcing a maximum number of concurrent requests. While the "
"--request-rate argument controls the rate at which requests are "
"initiated, this argument will control how many are actually allowed "
"to execute at a time. This means that when used in combination, the "
"actual request rate may be lower than specified with --request-rate, "
"if the server is not processing requests fast enough to keep up.")
parser.add_argument(
"--model",
type=str,
required=True,
help="Name of the model.",
)
parser.add_argument(
"--tokenizer",
type=str,
help=
"Name or path of the tokenizer, if not using the default tokenizer.", # noqa: E501
)
parser.add_argument(
"--num-prompts",
type=int,
default=1000,
help="Number of prompts to process.",
)
parser.add_argument(
"--output-len",
type=int,
default=128,
help="Number of output tokens.",
)
parser.add_argument(
"--request-rate",
type=float,
default=float("inf"),
help="Number of requests per second. If this is inf, "
"then all the requests are sent at time 0. "
"Otherwise, we use Poisson process or gamma distribution "
"to synthesize the request arrival times.",
)
parser.add_argument(
"--burstiness",
type=float,
default=1.0,
help="Burstiness factor of the request generation. "
"Only take effect when request_rate is not inf. "
"Default value is 1, which follows Poisson process. "
"Otherwise, the request intervals follow a gamma distribution. "
"A lower burstiness value (0 < burstiness < 1) results in more "
"bursty requests. A higher burstiness value (burstiness > 1) "
"results in a more uniform arrival of requests.",
)
parser.add_argument("--seed", type=int, default=0)
parser.add_argument(
"--trust-remote-code",
action="store_true",
help="Trust remote code from huggingface",
)
parser.add_argument(
"--disable-tqdm",
action="store_true",
help="Specify to disable tqdm progress bar.",
)
parser.add_argument(
"--save-results",
action="store_true",
help="Specify to save benchmark results to a json file",
)
parser.add_argument(
"--profile",
action="store_true",
help="Use Torch Profiler. The endpoint must be launched with "
"VLLM_TORCH_PROFILER_DIR to enable profiler.",
)
parser.add_argument(
"--result-dir",
type=str,
default=None,
help="Specify directory to save benchmark json results."
"If not specified, results are saved in the current directory.",
)
parser.add_argument(
"--result-filename",
type=str,
default=None,
help="Specify the filename to save benchmark json results."
"If not specified, results will be saved in "
"{backend}-{args.request_rate}qps-{base_model_id}-{current_dt}.json"
" format.",
)
parser.add_argument(
"--ignore-eos",
action="store_true",
help="Set ignore_eos flag when sending the benchmark request."
"Warning: ignore_eos is not supported in deepspeed_mii and tgi.")
parser.add_argument(
"--percentile-metrics",
type=str,
default="ttft,tpot,itl",
help="Comma-seperated list of selected metrics to report percentils. "
"This argument specifies the metrics to report percentiles. "
"Allowed metric names are \"ttft\", \"tpot\", \"itl\", \"e2el\". "
"Default value is \"ttft,tpot,itl\".")
parser.add_argument(
"--metric-percentiles",
type=str,
default="99",
help="Comma-seperated list of percentiles for selected metrics. "
"To report 25-th, 50-th, and 75-th percentiles, use \"25,50,75\". "
"Default value is \"99\". "
"Use \"--percentile-metrics\" to select metrics.",
)
parser.add_argument("--no-guided-decoding",
action='store_true',
default=False,
help="Whether to disable JSON decoding or not.")
parser.add_argument("--guided-decoding-ratio",
type=float,
default=1.0,
help="Ratio of Guided Decoding requests")
parser.add_argument("--guided-decoding-backend",
type=str,
choices=["outlines", "lm-format-enforcer", "xgrammar"],
default="xgrammar",
help="Backend to use for guided decoding")
args = parser.parse_args()
main(args)

View File

@@ -1,30 +1,100 @@
# SPDX-License-Identifier: Apache-2.0
"""Benchmark offline inference throughput.""" """Benchmark offline inference throughput."""
import argparse import argparse
import dataclasses
import json import json
import random import random
import time import time
from typing import List, Optional, Tuple from functools import cache
from typing import Dict, List, Optional, Tuple
import torch import torch
import uvloop import uvloop
from PIL import Image
from tqdm import tqdm from tqdm import tqdm
from transformers import (AutoModelForCausalLM, AutoTokenizer, from transformers import (AutoModelForCausalLM, AutoTokenizer,
PreTrainedTokenizerBase) PreTrainedTokenizerBase)
from vllm.engine.arg_utils import DEVICE_OPTIONS, AsyncEngineArgs, EngineArgs from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
from vllm.entrypoints.openai.api_server import ( from vllm.entrypoints.openai.api_server import (
build_async_engine_client_from_engine_args) build_async_engine_client_from_engine_args)
from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS from vllm.inputs import TextPrompt
from vllm.lora.request import LoRARequest
from vllm.lora.utils import get_adapter_absolute_path
from vllm.multimodal import MultiModalDataDict
from vllm.sampling_params import BeamSearchParams from vllm.sampling_params import BeamSearchParams
from vllm.transformers_utils.tokenizer import AnyTokenizer, get_lora_tokenizer
from vllm.utils import FlexibleArgumentParser, merge_async_iterators from vllm.utils import FlexibleArgumentParser, merge_async_iterators
def sample_requests( @dataclasses.dataclass
dataset_path: str, class SampleRequest:
num_requests: int, """A class representing a single inference request for benchmarking.
tokenizer: PreTrainedTokenizerBase,
fixed_output_len: Optional[int], Attributes:
) -> List[Tuple[str, int, int]]: prompt: The input text prompt for the model.
prompt_len: The length of the prompt in tokens.
expected_output_len: The expected length of the output in tokens.
multi_modal_data: Optional dictionary containing multi-modal data (e.g.
images).
lora_request: Optional LoRARequest specifying the LoRA to use.
"""
prompt: str
prompt_len: int
expected_output_len: int
multi_modal_data: Optional[MultiModalDataDict] = None
lora_request: Optional[LoRARequest] = None
def _get_prompt_for_image_model(question: str, *, model: str) -> str:
"""Prepend and append special tokens around the question to form a prompt.
Args:
question: The input question text to wrap with special tokens
model: The name of the model being used, to determine which special
tokens to add
Returns:
The formatted prompt string with appropriate special tokens for the
model
Raises:
ValueError: If an unsupported model name is provided
"""
model = model.lower()
if "pixtral" in model:
return f"<s>[INST]{question}\n[IMG][/INST]"
raise ValueError(f"Unsupported model {model}")
@cache
def lora_path_on_disk(lora_path: str) -> str:
return get_adapter_absolute_path(lora_path)
lora_tokenizer_cache: Dict[int, AnyTokenizer] = {}
def get_random_lora_request(
args: argparse.Namespace
) -> Tuple[LoRARequest, Optional[AnyTokenizer]]:
global lora_tokenizer_cache
lora_id = random.randint(1, args.max_loras)
lora_request = LoRARequest(lora_name=str(lora_id),
lora_int_id=lora_id,
lora_path=lora_path_on_disk(args.lora_path))
if lora_id not in lora_tokenizer_cache:
lora_tokenizer_cache[lora_id] = get_lora_tokenizer(lora_request)
return lora_request, lora_tokenizer_cache[lora_id]
def sample_requests(tokenizer: PreTrainedTokenizerBase,
args: argparse.Namespace) -> List[SampleRequest]:
dataset_path: str = args.dataset
num_requests: int = args.num_prompts
fixed_output_len: Optional[int] = args.output_len
model: str = args.model
if fixed_output_len is not None and fixed_output_len < 4: if fixed_output_len is not None and fixed_output_len < 4:
raise ValueError("output_len too small") raise ValueError("output_len too small")
@@ -33,24 +103,46 @@ def sample_requests(
dataset = json.load(f) dataset = json.load(f)
# Filter out the conversations with less than 2 turns. # Filter out the conversations with less than 2 turns.
dataset = [data for data in dataset if len(data["conversations"]) >= 2] dataset = [data for data in dataset if len(data["conversations"]) >= 2]
# Only keep the first two turns of each conversation.
dataset = [(data["conversations"][0]["value"],
data["conversations"][1]["value"]) for data in dataset]
# Shuffle the dataset. # Shuffle the dataset.
random.shuffle(dataset) random.shuffle(dataset)
# Filter out sequences that are too long or too short # Filter out sequences that are too long or too short
filtered_dataset: List[Tuple[str, int, int]] = [] filtered_dataset: List[SampleRequest] = []
for i in range(len(dataset)): for data in tqdm(dataset,
total=len(filtered_dataset),
desc="sampling requests"):
if len(filtered_dataset) == num_requests: if len(filtered_dataset) == num_requests:
break break
# Only keep the first two turns of each conversation.
prompt = data["conversations"][0]["value"]
completion = data["conversations"][1]["value"]
multi_modal_data: Optional[MultiModalDataDict] = None
if "image" in data:
multi_modal_data = multi_modal_data or {}
image_path = data["image"]
# TODO(vllm-project/vllm/issues/9778): Support multiple images.
assert isinstance(image_path,
str), "Only support single image input"
try:
multi_modal_data["image"] = Image.open(image_path).convert(
"RGB")
except FileNotFoundError:
# Ignore datapoint where asset is missing
continue
prompt = _get_prompt_for_image_model(question=prompt, model=model)
request_tokenizer = tokenizer
lora_request: Optional[LoRARequest] = None
if args.enable_lora:
lora_request, lora_tokenizer = get_random_lora_request(args)
if lora_tokenizer:
request_tokenizer = lora_tokenizer
# Tokenize the prompts and completions. # Tokenize the prompts and completions.
prompt = dataset[i][0] prompt_token_ids = request_tokenizer(prompt).input_ids
prompt_token_ids = tokenizer(prompt).input_ids completion_token_ids = request_tokenizer(completion).input_ids
completion = dataset[i][1]
completion_token_ids = tokenizer(completion).input_ids
prompt_len = len(prompt_token_ids) prompt_len = len(prompt_token_ids)
output_len = len(completion_token_ids output_len = len(completion_token_ids
) if fixed_output_len is None else fixed_output_len ) if fixed_output_len is None else fixed_output_len
@@ -60,87 +152,59 @@ def sample_requests(
if prompt_len > 1024 or prompt_len + output_len > 2048: if prompt_len > 1024 or prompt_len + output_len > 2048:
# Prune too long sequences. # Prune too long sequences.
continue continue
filtered_dataset.append((prompt, prompt_len, output_len)) filtered_dataset.append(
SampleRequest(prompt=prompt,
prompt_len=prompt_len,
expected_output_len=output_len,
multi_modal_data=multi_modal_data,
lora_request=lora_request))
return filtered_dataset return filtered_dataset
def run_vllm( def run_vllm(
requests: List[Tuple[str, int, int]], requests: List[SampleRequest],
model: str,
tokenizer: str,
quantization: Optional[str],
tensor_parallel_size: int,
seed: int,
n: int, n: int,
trust_remote_code: bool, engine_args: EngineArgs,
dtype: str,
max_model_len: Optional[int],
enforce_eager: bool,
kv_cache_dtype: str,
quantization_param_path: Optional[str],
device: str,
enable_prefix_caching: bool,
enable_chunked_prefill: bool,
max_num_batched_tokens: int,
distributed_executor_backend: Optional[str],
gpu_memory_utilization: float = 0.9,
num_scheduler_steps: int = 1,
download_dir: Optional[str] = None,
load_format: str = EngineArgs.load_format,
disable_async_output_proc: bool = False,
) -> float: ) -> float:
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
llm = LLM( llm = LLM(**dataclasses.asdict(engine_args))
model=model,
tokenizer=tokenizer,
quantization=quantization,
tensor_parallel_size=tensor_parallel_size,
seed=seed,
trust_remote_code=trust_remote_code,
dtype=dtype,
max_model_len=max_model_len,
gpu_memory_utilization=gpu_memory_utilization,
enforce_eager=enforce_eager,
kv_cache_dtype=kv_cache_dtype,
quantization_param_path=quantization_param_path,
device=device,
enable_prefix_caching=enable_prefix_caching,
download_dir=download_dir,
enable_chunked_prefill=enable_chunked_prefill,
max_num_batched_tokens=max_num_batched_tokens,
distributed_executor_backend=distributed_executor_backend,
load_format=load_format,
num_scheduler_steps=num_scheduler_steps,
disable_async_output_proc=disable_async_output_proc,
)
# Add the requests to the engine. # Add the requests to the engine.
prompts: List[str] = [] prompts: List[TextPrompt] = []
sampling_params: List[SamplingParams] = [] sampling_params: List[SamplingParams] = []
for prompt, _, output_len in requests: for request in requests:
prompts.append(prompt) prompts.append(
TextPrompt(prompt=request.prompt,
multi_modal_data=request.multi_modal_data))
sampling_params.append( sampling_params.append(
SamplingParams( SamplingParams(
n=n, n=n,
temperature=1.0, temperature=1.0,
top_p=1.0, top_p=1.0,
ignore_eos=True, ignore_eos=True,
max_tokens=output_len, max_tokens=request.expected_output_len,
)) ))
lora_requests: Optional[List[LoRARequest]] = None
if engine_args.enable_lora:
lora_requests = [request.lora_request for request in requests]
use_beam_search = False use_beam_search = False
if not use_beam_search: if not use_beam_search:
start = time.perf_counter() start = time.perf_counter()
llm.generate(prompts, sampling_params, use_tqdm=True) llm.generate(prompts,
sampling_params,
lora_request=lora_requests,
use_tqdm=True)
end = time.perf_counter() end = time.perf_counter()
else: else:
prompts = [prompt for prompt, _, _ in requests] assert lora_requests is None, "BeamSearch API does not support LoRA"
prompts = [request.prompt for request in requests]
# output_len should be the same for all requests. # output_len should be the same for all requests.
output_len = requests[0][2] output_len = requests[0][2]
for prompt, input_len, _output_len in requests: for request in requests:
assert _output_len == output_len assert request.expected_output_len == output_len
start = time.perf_counter() start = time.perf_counter()
llm.beam_search( llm.beam_search(
prompts, prompts,
@@ -154,79 +218,42 @@ def run_vllm(
async def run_vllm_async( async def run_vllm_async(
requests: List[Tuple[str, int, int]], requests: List[SampleRequest],
model: str,
tokenizer: str,
quantization: Optional[str],
tensor_parallel_size: int,
seed: int,
n: int, n: int,
trust_remote_code: bool, engine_args: AsyncEngineArgs,
dtype: str,
max_model_len: Optional[int],
enforce_eager: bool,
kv_cache_dtype: str,
quantization_param_path: Optional[str],
device: str,
enable_prefix_caching: bool,
enable_chunked_prefill: bool,
max_num_batched_tokens: int,
distributed_executor_backend: Optional[str],
gpu_memory_utilization: float = 0.9,
num_scheduler_steps: int = 1,
download_dir: Optional[str] = None,
load_format: str = EngineArgs.load_format,
disable_async_output_proc: bool = False,
disable_frontend_multiprocessing: bool = False, disable_frontend_multiprocessing: bool = False,
) -> float: ) -> float:
from vllm import SamplingParams from vllm import SamplingParams
engine_args = AsyncEngineArgs(
model=model,
tokenizer=tokenizer,
quantization=quantization,
tensor_parallel_size=tensor_parallel_size,
seed=seed,
trust_remote_code=trust_remote_code,
dtype=dtype,
max_model_len=max_model_len,
gpu_memory_utilization=gpu_memory_utilization,
enforce_eager=enforce_eager,
kv_cache_dtype=kv_cache_dtype,
quantization_param_path=quantization_param_path,
device=device,
enable_prefix_caching=enable_prefix_caching,
download_dir=download_dir,
enable_chunked_prefill=enable_chunked_prefill,
max_num_batched_tokens=max_num_batched_tokens,
distributed_executor_backend=distributed_executor_backend,
load_format=load_format,
num_scheduler_steps=num_scheduler_steps,
disable_async_output_proc=disable_async_output_proc,
worker_use_ray=False,
disable_log_requests=True,
)
async with build_async_engine_client_from_engine_args( async with build_async_engine_client_from_engine_args(
engine_args, disable_frontend_multiprocessing) as llm: engine_args, disable_frontend_multiprocessing) as llm:
# Add the requests to the engine. # Add the requests to the engine.
prompts: List[str] = [] prompts: List[TextPrompt] = []
sampling_params: List[SamplingParams] = [] sampling_params: List[SamplingParams] = []
for prompt, _, output_len in requests: lora_requests: List[Optional[LoRARequest]] = []
prompts.append(prompt) for request in requests:
prompts.append(
TextPrompt(prompt=request.prompt,
multi_modal_data=request.multi_modal_data))
sampling_params.append( sampling_params.append(
SamplingParams( SamplingParams(
n=n, n=n,
temperature=1.0, temperature=1.0,
top_p=1.0, top_p=1.0,
ignore_eos=True, ignore_eos=True,
max_tokens=output_len, max_tokens=request.expected_output_len,
)) ))
lora_requests.append(request.lora_request)
generators = [] generators = []
start = time.perf_counter() start = time.perf_counter()
for i, (prompt, sp) in enumerate(zip(prompts, sampling_params)): for i, (prompt, sp,
generator = llm.generate(prompt, sp, request_id=f"test{i}") lr) in enumerate(zip(prompts, sampling_params, lora_requests)):
generator = llm.generate(prompt,
sp,
lora_request=lr,
request_id=f"test{i}")
generators.append(generator) generators.append(generator)
all_gens = merge_async_iterators(*generators) all_gens = merge_async_iterators(*generators)
async for i, res in all_gens: async for i, res in all_gens:
@@ -236,7 +263,7 @@ async def run_vllm_async(
def run_hf( def run_hf(
requests: List[Tuple[str, int, int]], requests: List[SampleRequest],
model: str, model: str,
tokenizer: PreTrainedTokenizerBase, tokenizer: PreTrainedTokenizerBase,
n: int, n: int,
@@ -294,14 +321,14 @@ def run_hf(
def run_mii( def run_mii(
requests: List[Tuple[str, int, int]], requests: List[SampleRequest],
model: str, model: str,
tensor_parallel_size: int, tensor_parallel_size: int,
output_len: int, output_len: int,
) -> float: ) -> float:
from mii import client, serve from mii import client, serve
llm = serve(model, tensor_parallel=tensor_parallel_size) llm = serve(model, tensor_parallel=tensor_parallel_size)
prompts = [prompt for prompt, _, _ in requests] prompts = [request.prompt for request in requests]
start = time.perf_counter() start = time.perf_counter()
llm.generate(prompts, max_new_tokens=output_len) llm.generate(prompts, max_new_tokens=output_len)
@@ -319,32 +346,62 @@ def main(args: argparse.Namespace):
tokenizer = AutoTokenizer.from_pretrained( tokenizer = AutoTokenizer.from_pretrained(
args.tokenizer, trust_remote_code=args.trust_remote_code) args.tokenizer, trust_remote_code=args.trust_remote_code)
if args.dataset is None: if args.dataset is None:
# Synthesize a prompt with the given input length. vocab_size = tokenizer.vocab_size
prompt = "hi" * (args.input_len - 1) requests = []
requests = [(prompt, args.input_len, args.output_len) for _ in range(args.num_prompts):
for _ in range(args.num_prompts)]
request_tokenizer = tokenizer
lora_request: Optional[LoRARequest] = None
if args.enable_lora:
lora_request, lora_tokenizer = get_random_lora_request(args)
if lora_tokenizer:
request_tokenizer = lora_tokenizer
# Synthesize a prompt with the given input length.
candidate_ids = [
random.randint(0, vocab_size - 1)
for _ in range(args.input_len)
]
# As tokenizer may add additional tokens like BOS, we need to try
# different lengths to get the desired input length.
for _ in range(5): # Max attempts to correct
candidate_prompt = request_tokenizer.decode(candidate_ids)
tokenized_len = len(request_tokenizer.encode(candidate_prompt))
if tokenized_len == args.input_len:
break
# Adjust length based on difference
diff = args.input_len - tokenized_len
if diff > 0:
candidate_ids.extend([
random.randint(100, vocab_size - 100)
for _ in range(diff)
])
else:
candidate_ids = candidate_ids[:diff]
requests.append(
SampleRequest(prompt=candidate_prompt,
prompt_len=args.input_len,
expected_output_len=args.output_len,
lora_request=lora_request))
else: else:
requests = sample_requests(args.dataset, args.num_prompts, tokenizer, requests = sample_requests(tokenizer, args)
args.output_len)
is_multi_modal = any(request.multi_modal_data is not None
for request in requests)
if args.backend == "vllm": if args.backend == "vllm":
run_args = [
requests, args.model, args.tokenizer, args.quantization,
args.tensor_parallel_size, args.seed, args.n,
args.trust_remote_code, args.dtype, args.max_model_len,
args.enforce_eager, args.kv_cache_dtype,
args.quantization_param_path, args.device,
args.enable_prefix_caching, args.enable_chunked_prefill,
args.max_num_batched_tokens, args.distributed_executor_backend,
args.gpu_memory_utilization, args.num_scheduler_steps,
args.download_dir, args.load_format, args.disable_async_output_proc
]
if args.async_engine: if args.async_engine:
run_args.append(args.disable_frontend_multiprocessing) elapsed_time = uvloop.run(
elapsed_time = uvloop.run(run_vllm_async(*run_args)) run_vllm_async(
requests,
args.n,
AsyncEngineArgs.from_cli_args(args),
args.disable_frontend_multiprocessing,
))
else: else:
elapsed_time = run_vllm(*run_args) elapsed_time = run_vllm(requests, args.n,
EngineArgs.from_cli_args(args))
elif args.backend == "hf": elif args.backend == "hf":
assert args.tensor_parallel_size == 1 assert args.tensor_parallel_size == 1
elapsed_time = run_hf(requests, args.model, tokenizer, args.n, elapsed_time = run_hf(requests, args.model, tokenizer, args.n,
@@ -354,10 +411,18 @@ def main(args: argparse.Namespace):
args.output_len) args.output_len)
else: else:
raise ValueError(f"Unknown backend: {args.backend}") raise ValueError(f"Unknown backend: {args.backend}")
total_num_tokens = sum(prompt_len + output_len total_num_tokens = sum(request.prompt_len + request.expected_output_len
for _, prompt_len, output_len in requests) for request in requests)
total_output_tokens = sum(request.expected_output_len
for request in requests)
if is_multi_modal:
print("\033[91mWARNING\033[0m: Multi-modal request detected. The "
"following metrics are not accurate because image tokens are not"
" counted. See vllm-project/vllm/issues/9778 for details.")
# TODO(vllm-project/vllm/issues/9778): Count molti-modal token length.
print(f"Throughput: {len(requests) / elapsed_time:.2f} requests/s, " print(f"Throughput: {len(requests) / elapsed_time:.2f} requests/s, "
f"{total_num_tokens / elapsed_time:.2f} tokens/s") f"{total_num_tokens / elapsed_time:.2f} total tokens/s, "
f"{total_output_tokens / elapsed_time:.2f} output tokens/s")
# Output JSON results if specified # Output JSON results if specified
if args.output_json: if args.output_json:
@@ -381,7 +446,9 @@ if __name__ == "__main__":
parser.add_argument("--dataset", parser.add_argument("--dataset",
type=str, type=str,
default=None, default=None,
help="Path to the dataset.") help="Path to the dataset. The dataset is expected to "
"be a json in form of List[Dict[..., conversations: "
"List[Dict[..., value: <prompt_or_response>]]]]")
parser.add_argument("--input-len", parser.add_argument("--input-len",
type=int, type=int,
default=None, default=None,
@@ -391,13 +458,6 @@ if __name__ == "__main__":
default=None, default=None,
help="Output length for each request. Overrides the " help="Output length for each request. Overrides the "
"output length from the dataset.") "output length from the dataset.")
parser.add_argument("--model", type=str, default="facebook/opt-125m")
parser.add_argument("--tokenizer", type=str, default=None)
parser.add_argument('--quantization',
'-q',
choices=[*QUANTIZATION_METHODS, None],
default=None)
parser.add_argument("--tensor-parallel-size", "-tp", type=int, default=1)
parser.add_argument("--n", parser.add_argument("--n",
type=int, type=int,
default=1, default=1,
@@ -406,123 +466,15 @@ if __name__ == "__main__":
type=int, type=int,
default=1000, default=1000,
help="Number of prompts to process.") help="Number of prompts to process.")
parser.add_argument("--seed", type=int, default=0)
parser.add_argument("--hf-max-batch-size", parser.add_argument("--hf-max-batch-size",
type=int, type=int,
default=None, default=None,
help="Maximum batch size for HF backend.") help="Maximum batch size for HF backend.")
parser.add_argument('--trust-remote-code',
action='store_true',
help='trust remote code from huggingface')
parser.add_argument(
'--max-model-len',
type=int,
default=None,
help='Maximum length of a sequence (including prompt and output). '
'If None, will be derived from the model.')
parser.add_argument(
'--dtype',
type=str,
default='auto',
choices=['auto', 'half', 'float16', 'bfloat16', 'float', 'float32'],
help='data type for model weights and activations. '
'The "auto" option will use FP16 precision '
'for FP32 and FP16 models, and BF16 precision '
'for BF16 models.')
parser.add_argument('--gpu-memory-utilization',
type=float,
default=0.9,
help='the fraction of GPU memory to be used for '
'the model executor, which can range from 0 to 1.'
'If unspecified, will use the default value of 0.9.')
parser.add_argument("--enforce-eager",
action="store_true",
help="enforce eager execution")
parser.add_argument(
'--kv-cache-dtype',
type=str,
choices=['auto', 'fp8', 'fp8_e5m2', 'fp8_e4m3'],
default="auto",
help='Data type for kv cache storage. If "auto", will use model '
'data type. CUDA 11.8+ supports fp8 (=fp8_e4m3) and fp8_e5m2. '
'ROCm (AMD GPU) supports fp8 (=fp8_e4m3)')
parser.add_argument(
'--quantization-param-path',
type=str,
default=None,
help='Path to the JSON file containing the KV cache scaling factors. '
'This should generally be supplied, when KV cache dtype is FP8. '
'Otherwise, KV cache scaling factors default to 1.0, which may cause '
'accuracy issues. FP8_E5M2 (without scaling) is only supported on '
'cuda version greater than 11.8. On ROCm (AMD GPU), FP8_E4M3 is '
'instead supported for common inference criteria.')
parser.add_argument("--device",
type=str,
default="auto",
choices=DEVICE_OPTIONS,
help='device type for vLLM execution')
parser.add_argument(
"--num-scheduler-steps",
type=int,
default=1,
help="Maximum number of forward steps per scheduler call.")
parser.add_argument(
"--enable-prefix-caching",
action='store_true',
help="Enable automatic prefix caching for vLLM backend.")
parser.add_argument("--enable-chunked-prefill",
action='store_true',
help="enable chunked prefill for vLLM backend.")
parser.add_argument('--max-num-batched-tokens',
type=int,
default=None,
help='maximum number of batched tokens per '
'iteration')
parser.add_argument('--download-dir',
type=str,
default=None,
help='directory to download and load the weights, '
'default to the default cache dir of huggingface')
parser.add_argument( parser.add_argument(
'--output-json', '--output-json',
type=str, type=str,
default=None, default=None,
help='Path to save the throughput results in JSON format.') help='Path to save the throughput results in JSON format.')
parser.add_argument(
'--distributed-executor-backend',
choices=['ray', 'mp'],
default=None,
help='Backend to use for distributed serving. When more than 1 GPU '
'is used, will be automatically set to "ray" if installed '
'or "mp" (multiprocessing) otherwise.')
parser.add_argument(
'--load-format',
type=str,
default=EngineArgs.load_format,
choices=[
'auto', 'pt', 'safetensors', 'npcache', 'dummy', 'tensorizer',
'bitsandbytes'
],
help='The format of the model weights to load.\n\n'
'* "auto" will try to load the weights in the safetensors format '
'and fall back to the pytorch bin format if safetensors format '
'is not available.\n'
'* "pt" will load the weights in the pytorch bin format.\n'
'* "safetensors" will load the weights in the safetensors format.\n'
'* "npcache" will load the weights in pytorch format and store '
'a numpy cache to speed up the loading.\n'
'* "dummy" will initialize the weights with random values, '
'which is mainly for profiling.\n'
'* "tensorizer" will load the weights using tensorizer from '
'CoreWeave. See the Tensorize vLLM Model script in the Examples'
'section for more information.\n'
'* "bitsandbytes" will load the weights using bitsandbytes '
'quantization.\n')
parser.add_argument(
"--disable-async-output-proc",
action='store_true',
default=False,
help="Disable async output processor for vLLM backend.")
parser.add_argument("--async-engine", parser.add_argument("--async-engine",
action='store_true', action='store_true',
default=False, default=False,
@@ -531,6 +483,15 @@ if __name__ == "__main__":
action='store_true', action='store_true',
default=False, default=False,
help="Disable decoupled async engine frontend.") help="Disable decoupled async engine frontend.")
# LoRA
parser.add_argument(
"--lora-path",
type=str,
default=None,
help="Path to the lora adapters to use. This can be an absolute path, "
"a relative path, or a Hugging Face model identifier.")
parser = AsyncEngineArgs.add_cli_args(parser)
args = parser.parse_args() args = parser.parse_args()
if args.tokenizer is None: if args.tokenizer is None:
args.tokenizer = args.model args.tokenizer = args.model
@@ -539,6 +500,8 @@ if __name__ == "__main__":
assert args.output_len is not None assert args.output_len is not None
else: else:
assert args.input_len is None assert args.input_len is None
if args.enable_lora:
assert args.lora_path is not None
if args.backend == "vllm": if args.backend == "vllm":
if args.hf_max_batch_size is not None: if args.hf_max_batch_size is not None:
@@ -548,6 +511,9 @@ if __name__ == "__main__":
raise ValueError("HF max batch size is required for HF backend.") raise ValueError("HF max batch size is required for HF backend.")
if args.quantization is not None: if args.quantization is not None:
raise ValueError("Quantization is only for vLLM backend.") raise ValueError("Quantization is only for vLLM backend.")
if args.enable_lora is not None:
raise ValueError("LoRA benchmarking is only supported for vLLM"
" backend")
elif args.backend == "mii": elif args.backend == "mii":
if args.dtype != "auto": if args.dtype != "auto":
raise ValueError("dtype must be auto for MII backend.") raise ValueError("dtype must be auto for MII backend.")
@@ -560,4 +526,7 @@ if __name__ == "__main__":
if args.tokenizer != args.model: if args.tokenizer != args.model:
raise ValueError("Tokenizer must be the same as the model for MII " raise ValueError("Tokenizer must be the same as the model for MII "
"backend.") "backend.")
if args.enable_lora is not None:
raise ValueError("LoRA benchmarking is only supported for vLLM"
" backend")
main(args) main(args)

View File

@@ -0,0 +1,386 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import copy
import itertools
import pickle as pkl
import time
from typing import Callable, Iterable, List, Tuple
import torch
import torch.utils.benchmark as TBenchmark
from torch.utils.benchmark import Measurement as TMeasurement
from utils import make_rand_sparse_tensors
from weight_shapes import WEIGHT_SHAPES
from vllm import _custom_ops as ops
from vllm.utils import FlexibleArgumentParser
DEFAULT_MODELS = list(WEIGHT_SHAPES.keys())
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512]
DEFAULT_TP_SIZES = [1]
# bench
def bench_fn(label: str, sub_label: str, description: str, fn: Callable, *args,
**kwargs) -> TMeasurement:
min_run_time = 1
globals = {
"args": args,
"kwargs": kwargs,
"fn": fn,
}
return TBenchmark.Timer(
stmt="fn(*args, **kwargs)",
globals=globals,
label=label,
sub_label=sub_label,
description=description,
).blocked_autorange(min_run_time=min_run_time)
def bench_int8(dtype: torch.dtype, m: int, k: int, n: int, label: str,
sub_label: str) -> Iterable[TMeasurement]:
assert dtype == torch.int8
b_compressed, e, a, b = make_rand_sparse_tensors(torch.int8, m, n, k)
scale_a = torch.tensor(1.0, device="cuda", dtype=torch.float32)
scale_b = torch.tensor(1.0, device="cuda", dtype=torch.float32)
bias = torch.zeros((n, ), device="cuda", dtype=torch.bfloat16)
out = ops.cutlass_scaled_sparse_mm(a, b_compressed, e, scale_a, scale_b,
torch.bfloat16)
out_ref = ops.cutlass_scaled_mm(a, b, scale_a, scale_b, torch.bfloat16)
if not torch.allclose(out, out_ref):
print("Incorrect results")
print(out)
print(out_ref)
else:
print("Correct results")
timers = []
# pytorch impl - bfloat16
timers.append(
bench_fn(label, sub_label, "pytorch_bf16_bf16_bf16_matmul-no-scales",
torch.mm, a.to(dtype=torch.bfloat16),
b.to(dtype=torch.bfloat16)))
# pytorch impl - float16
timers.append(
bench_fn(label, sub_label,
"pytorch_fp16_fp16_fp16_matmul-no-scales", torch.mm,
a.to(dtype=torch.float16), b.to(dtype=torch.float16)))
# cutlass impl
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_mm",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b,
torch.bfloat16))
# cutlass with bias
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_mm_bias",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b, torch.bfloat16,
bias))
# cutlass sparse impl
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_sparse_mm",
ops.cutlass_scaled_sparse_mm, a, b_compressed, e, scale_a,
scale_b, torch.bfloat16))
# cutlass sparse with bias
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_sparse_mm_bias",
ops.cutlass_scaled_sparse_mm, a, b_compressed, e, scale_a,
scale_b, torch.bfloat16, bias))
return timers
def bench_fp8(dtype: torch.dtype, m: int, k: int, n: int, label: str,
sub_label: str) -> Iterable[TMeasurement]:
assert dtype == torch.float8_e4m3fn
b_compressed, e, a, b = make_rand_sparse_tensors(torch.float8_e4m3fn, m, n,
k)
scale_a = torch.tensor(1.0, device="cuda", dtype=torch.float32)
scale_b = torch.tensor(1.0, device="cuda", dtype=torch.float32)
bias = torch.zeros((n, ), device="cuda", dtype=torch.bfloat16)
out = ops.cutlass_scaled_sparse_mm(a, b_compressed, e, scale_a, scale_b,
torch.bfloat16)
out_ref = ops.cutlass_scaled_mm(a, b, scale_a, scale_b, torch.bfloat16)
if not torch.allclose(out, out_ref):
print("Incorrect results")
print(out)
print(out_ref)
else:
print("Correct results")
timers = []
# pytorch impl w. bf16
timers.append(
bench_fn(label, sub_label, "pytorch_bf16_bf16_bf16_matmul-no-scales",
torch.mm, a.to(dtype=torch.bfloat16, device="cuda"),
b.to(dtype=torch.bfloat16, device="cuda")))
# pytorch impl: bf16 output, without fp8 fast accum
timers.append(
bench_fn(label,
sub_label,
"pytorch_fp8_fp8_bf16_scaled_mm",
torch._scaled_mm,
a,
b,
scale_a=scale_a,
scale_b=scale_b,
out_dtype=torch.bfloat16))
# pytorch impl: bf16 output, with fp8 fast accum
timers.append(
bench_fn(label,
sub_label,
"pytorch_fp8_fp8_bf16_scaled_mm_fast_accum",
torch._scaled_mm,
a,
b,
scale_a=scale_a,
scale_b=scale_b,
out_dtype=torch.bfloat16,
use_fast_accum=True))
# pytorch impl: fp16 output, without fp8 fast accum
timers.append(
bench_fn(label,
sub_label,
"pytorch_fp8_fp8_fp16_scaled_mm",
torch._scaled_mm,
a,
b,
scale_a=scale_a,
scale_b=scale_b,
out_dtype=torch.float16))
# pytorch impl: fp16 output, with fp8 fast accum
timers.append(
bench_fn(label,
sub_label,
"pytorch_fp8_fp8_fp16_scaled_mm_fast_accum",
torch._scaled_mm,
a,
b,
scale_a=scale_a,
scale_b=scale_b,
out_dtype=torch.float16,
use_fast_accum=True))
# cutlass impl: bf16 output
timers.append(
bench_fn(label, sub_label, "cutlass_fp8_fp8_bf16_scaled_mm",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b,
torch.bfloat16))
# cutlass impl: bf16 output
timers.append(
bench_fn(label, sub_label, "cutlass_fp8_fp8_bf16_scaled_sparse_mm",
ops.cutlass_scaled_sparse_mm, a, b_compressed, e, scale_a,
scale_b, torch.bfloat16))
# cutlass impl: fp16 output
timers.append(
bench_fn(label, sub_label, "cutlass_fp8_fp8_fp16_scaled_sparse_mm",
ops.cutlass_scaled_sparse_mm, a, b_compressed, e, scale_a,
scale_b, torch.float16))
# cutlass impl: bf16 output, with bias
timers.append(
bench_fn(label, sub_label,
"cutlass_fp8_fp8_bf16_scaled_sparse_mm_bias",
ops.cutlass_scaled_sparse_mm, a, b_compressed, e, scale_a,
scale_b, torch.bfloat16, bias))
# cutlass impl: fp16 output, with bias
timers.append(
bench_fn(label, sub_label,
"cutlass_fp8_fp8_fp16_scaled_sparse_mm_bias",
ops.cutlass_scaled_sparse_mm, a, b_compressed, e, scale_a,
scale_b, torch.float16, bias.to(dtype=torch.float16)))
return timers
def bench(dtype: torch.dtype, m: int, k: int, n: int, label: str,
sub_label: str) -> Iterable[TMeasurement]:
if dtype == torch.int8:
return bench_int8(dtype, m, k, n, label, sub_label)
if dtype == torch.float8_e4m3fn:
return bench_fp8(dtype, m, k, n, label, sub_label)
raise ValueError("unsupported type")
# runner
def print_timers(timers: Iterable[TMeasurement]):
compare = TBenchmark.Compare(timers)
compare.print()
def run(dtype: torch.dtype,
MKNs: Iterable[Tuple[int, int, int]]) -> Iterable[TMeasurement]:
results = []
for m, k, n in MKNs:
timers = bench(dtype, m, k, n, f"scaled-{dtype}-gemm",
f"MKN=({m}x{k}x{n})")
print_timers(timers)
results.extend(timers)
return results
# output makers
def make_output(data: Iterable[TMeasurement],
MKNs: Iterable[Tuple[int, int, int]],
base_description: str,
timestamp=None):
print(f"== All Results {base_description} ====")
print_timers(data)
# pickle all the results
timestamp = int(time.time()) if timestamp is None else timestamp
with open(f"{base_description}-{timestamp}.pkl", "wb") as f:
pkl.dump(data, f)
# argparse runners
def run_square_bench(args):
dim_sizes = list(
range(args.dim_start, args.dim_end + 1, args.dim_increment))
MKNs = list(zip(dim_sizes, dim_sizes, dim_sizes))
data = run(args.dtype, MKNs)
make_output(data, MKNs, f"square_bench-{args.dtype}")
def run_range_bench(args):
dim_sizes = list(range(args.dim_start, args.dim_end, args.dim_increment))
n = len(dim_sizes)
Ms = [args.m_constant] * n if args.m_constant is not None else dim_sizes
Ks = [args.k_constant] * n if args.k_constant is not None else dim_sizes
Ns = [args.n_constant] * n if args.n_constant is not None else dim_sizes
MKNs = list(zip(Ms, Ks, Ns))
data = run(args.dtype, MKNs)
make_output(data, MKNs, f"range_bench-{args.dtype}")
def run_model_bench(args):
print("Benchmarking models:")
for i, model in enumerate(args.models):
print(f"[{i}] {model}")
def model_shapes(model_name: str, tp_size: int) -> List[Tuple[int, int]]:
KNs = []
for KN, tp_split_dim in copy.deepcopy(WEIGHT_SHAPES[model_name]):
KN[tp_split_dim] = KN[tp_split_dim] // tp_size
KNs.append(KN)
return KNs
model_bench_data = []
models_tps = list(itertools.product(args.models, args.tp_sizes))
for model, tp_size in models_tps:
Ms = args.batch_sizes
KNs = model_shapes(model, tp_size)
MKNs = []
for m in Ms:
for k, n in KNs:
MKNs.append((m, k, n))
data = run(args.dtype, MKNs)
model_bench_data.append(data)
# Print all results
for data, model_tp in zip(model_bench_data, models_tps):
model, tp_size = model_tp
print(f"== Results {args.dtype} {model}-TP{tp_size} ====")
print_timers(data)
timestamp = int(time.time())
all_data = []
for d in model_bench_data:
all_data.extend(d)
# pickle all data
with open(f"model_bench-{args.dtype}-{timestamp}.pkl", "wb") as f:
pkl.dump(all_data, f)
if __name__ == '__main__':
def to_torch_dtype(dt):
if dt == "int8":
return torch.int8
if dt == "fp8":
return torch.float8_e4m3fn
raise ValueError("unsupported dtype")
parser = FlexibleArgumentParser(
description="""
Benchmark Cutlass GEMM.
To run square GEMMs:
python3 ./benchmarks/cutlass_benchmarks/sparse_benchmarks.py --dtype fp8 square_bench --dim-start 128 --dim-end 512 --dim-increment 64
To run constant N and K and sweep M:
python3 ./benchmarks/cutlass_benchmarks/sparse_benchmarks.py --dtype fp8 range_bench --dim-start 128 --dim-end 512 --dim-increment 64 --n-constant 16384 --k-constant 16384
To run dimensions from a model:
python3 ./benchmarks/cutlass_benchmarks/sparse_benchmarks.py --dtype fp8 model_bench --models meta-llama/Llama-2-7b-hf --batch-sizes 16 --tp-sizes 1
Output:
- a .pkl file, that is a list of raw torch.benchmark.utils.Measurements for the pytorch and cutlass implementations for the various GEMMs.
""", # noqa: E501
formatter_class=argparse.RawTextHelpFormatter)
parser.add_argument("--dtype",
type=to_torch_dtype,
required=True,
help="Available options are ['int8', 'fp8']")
subparsers = parser.add_subparsers(dest="cmd")
square_parser = subparsers.add_parser("square_bench")
square_parser.add_argument("--dim-start", type=int, required=True)
square_parser.add_argument("--dim-end", type=int, required=True)
square_parser.add_argument("--dim-increment", type=int, required=True)
square_parser.set_defaults(func=run_square_bench)
range_parser = subparsers.add_parser("range_bench")
range_parser.add_argument("--dim-start", type=int, required=True)
range_parser.add_argument("--dim-end", type=int, required=True)
range_parser.add_argument("--dim-increment", type=int, required=True)
range_parser.add_argument("--m-constant", type=int, default=None)
range_parser.add_argument("--n-constant", type=int, default=None)
range_parser.add_argument("--k-constant", type=int, default=None)
range_parser.set_defaults(func=run_range_bench)
model_parser = subparsers.add_parser("model_bench")
model_parser.add_argument("--models",
nargs="+",
type=str,
default=DEFAULT_MODELS,
choices=WEIGHT_SHAPES.keys())
model_parser.add_argument("--tp-sizes",
nargs="+",
type=int,
default=DEFAULT_TP_SIZES)
model_parser.add_argument("--batch-sizes",
nargs="+",
type=int,
default=DEFAULT_BATCH_SIZES)
model_parser.set_defaults(func=run_model_bench)
args = parser.parse_args()
args.func(args)

View File

@@ -0,0 +1,98 @@
# SPDX-License-Identifier: Apache-2.0
# Cutlass bench utils
from typing import Iterable, Tuple
import torch
import vllm._custom_ops as ops
def to_fp8(tensor: torch.Tensor) -> torch.Tensor:
finfo = torch.finfo(torch.float8_e4m3fn)
return torch.round(tensor.clamp(
min=finfo.min, max=finfo.max)).to(dtype=torch.float8_e4m3fn)
def to_int8(tensor: torch.Tensor) -> torch.Tensor:
return torch.round(tensor.clamp(min=-128, max=127)).to(dtype=torch.int8)
def to_bf16(tensor: torch.Tensor) -> torch.Tensor:
return tensor.to(dtype=torch.bfloat16)
def to_fp16(tensor: torch.Tensor) -> torch.Tensor:
return tensor.to(dtype=torch.float16)
def make_rand_tensors(dtype: torch.dtype, m: int, n: int,
k: int) -> Tuple[torch.Tensor, torch.Tensor]:
a = torch.randn((m, k), device='cuda') * 5
b = torch.randn((n, k), device='cuda').t() * 5
if dtype == torch.int8:
return to_int8(a), to_int8(b)
if dtype == torch.float8_e4m3fn:
return to_fp8(a), to_fp8(b)
raise ValueError("unsupported dtype")
def prune_to_2_4(tensor):
# Reshape tensor to [N, 4] where N is number of groups of 4
original_shape = tensor.shape
reshaped = tensor.reshape(-1, 4)
# Get indices of top 2 absolute values in each group of 4
_, indices = torch.topk(torch.abs(reshaped), k=2, dim=1)
# Create binary mask
mask = torch.zeros_like(reshaped)
mask.scatter_(dim=1,
index=indices,
src=torch.ones_like(indices, dtype=mask.dtype))
# Apply mask and reshape back
pruned = reshaped * mask
# Turn all -0.0 to 0.0
pruned[pruned == -0.0] = 0.0
return pruned.reshape(original_shape)
def make_rand_sparse_tensors(dtype: torch.dtype, m: int, n: int,
k: int) -> Tuple[torch.Tensor, torch.Tensor]:
a = torch.randn((m, k), device='cuda') * 5
b = torch.randn((n, k), device='cuda').t() * 5
b = prune_to_2_4(b.t()).t()
if dtype == torch.int8:
a, b = to_int8(a), to_int8(b)
elif dtype == torch.float8_e4m3fn:
a, b = to_fp8(a), to_fp8(b)
elif dtype == torch.float16:
a, b = to_fp16(a), to_fp16(b)
elif dtype == torch.bfloat16:
a, b = to_bf16(a), to_bf16(b)
else:
raise ValueError("unsupported dtype")
b_compressed, e = ops.cutlass_sparse_compress(b.t())
# Compressed B, Metadata, Original A, B
return b_compressed, e, a, b
def make_n_rand_sparse_tensors(num_tensors: int, dtype: torch.dtype,
m: int, n: int, k: int) -> \
Tuple[Iterable[torch.Tensor], Iterable[torch.Tensor]]:
ABs = []
for _ in range(num_tensors):
b_comp, e, a, b = make_rand_sparse_tensors(dtype, m, n, k)
if b_comp is not None:
ABs.append(make_rand_sparse_tensors(dtype, m, n, k))
BComps, Es, As, Bs = zip(*ABs)
return list(BComps), list(Es), list(As), list(Bs)

View File

@@ -1,47 +1,27 @@
# SPDX-License-Identifier: Apache-2.0
import argparse import argparse
import copy import copy
import itertools import itertools
import pickle as pkl import pickle as pkl
import time import time
from typing import Callable, Iterable, List, Tuple from typing import Callable, Iterable, List, Optional, Tuple
import torch import torch
import torch.utils.benchmark as TBenchmark import torch.utils.benchmark as TBenchmark
from torch.utils.benchmark import Measurement as TMeasurement from torch.utils.benchmark import Measurement as TMeasurement
from utils import make_rand_tensors
from weight_shapes import WEIGHT_SHAPES from weight_shapes import WEIGHT_SHAPES
from vllm import _custom_ops as ops from vllm import _custom_ops as ops
from vllm.model_executor.layers.quantization.utils.fp8_utils import (
w8a8_block_fp8_matmul)
from vllm.utils import FlexibleArgumentParser from vllm.utils import FlexibleArgumentParser
DEFAULT_MODELS = list(WEIGHT_SHAPES.keys()) DEFAULT_MODELS = list(WEIGHT_SHAPES.keys())
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512] DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512]
DEFAULT_TP_SIZES = [1] DEFAULT_TP_SIZES = [1]
# helpers
def to_fp8(tensor: torch.Tensor) -> torch.Tensor:
finfo = torch.finfo(torch.float8_e4m3fn)
return torch.round(tensor.clamp(
min=finfo.min, max=finfo.max)).to(dtype=torch.float8_e4m3fn)
def to_int8(tensor: torch.Tensor) -> torch.Tensor:
return torch.round(tensor.clamp(min=-128, max=127)).to(dtype=torch.int8)
def make_rand_tensors(dtype: torch.dtype, m: int, n: int,
k: int) -> Tuple[torch.Tensor, torch.Tensor]:
a = torch.randn((m, k), device='cuda') * 5
b = torch.randn((n, k), device='cuda').t() * 5
if dtype == torch.int8:
return to_int8(a), to_int8(b)
if dtype == torch.float8_e4m3fn:
return to_fp8(a), to_fp8(b)
raise ValueError("unsupported dtype")
# bench # bench
def bench_fn(label: str, sub_label: str, description: str, fn: Callable, *args, def bench_fn(label: str, sub_label: str, description: str, fn: Callable, *args,
@@ -62,8 +42,15 @@ def bench_fn(label: str, sub_label: str, description: str, fn: Callable, *args,
).blocked_autorange(min_run_time=min_run_time) ).blocked_autorange(min_run_time=min_run_time)
def bench_int8(dtype: torch.dtype, m: int, k: int, n: int, label: str, def bench_int8(
sub_label: str) -> Iterable[TMeasurement]: dtype: torch.dtype,
m: int,
k: int,
n: int,
label: str,
sub_label: str,
bench_kernels: Optional[List[str]] = None) -> Iterable[TMeasurement]:
"""Benchmark INT8-based kernels."""
assert dtype == torch.int8 assert dtype == torch.int8
a, b = make_rand_tensors(torch.int8, m, n, k) a, b = make_rand_tensors(torch.int8, m, n, k)
scale_a = torch.tensor(1.0, device="cuda", dtype=torch.float32) scale_a = torch.tensor(1.0, device="cuda", dtype=torch.float32)
@@ -72,155 +59,132 @@ def bench_int8(dtype: torch.dtype, m: int, k: int, n: int, label: str,
azp = torch.zeros((m, ), device="cuda", dtype=torch.int32) azp = torch.zeros((m, ), device="cuda", dtype=torch.int32)
azp_adj = torch.zeros((n, ), device="cuda", dtype=torch.int32) azp_adj = torch.zeros((n, ), device="cuda", dtype=torch.int32)
bench_fns = {
"pytorch_bf16_bf16_bf16_matmul-no-scales":
lambda: torch.mm(a.to(dtype=torch.bfloat16), b.to(dtype=torch.bfloat16)
),
"pytorch_fp16_fp16_fp16_matmul-no-scales":
lambda: torch.mm(a.to(dtype=torch.float16), b.to(dtype=torch.float16)),
"cutlass_i8_i8_bf16_scaled_mm":
lambda: ops.cutlass_scaled_mm(a, b, scale_a, scale_b, torch.bfloat16),
"cutlass_i8_i8_bf16_scaled_mm_bias":
lambda: ops.cutlass_scaled_mm(a, b, scale_a, scale_b, torch.bfloat16,
bias),
"cutlass_i8_i8_bf16_scaled_mm_azp":
lambda: ops.cutlass_scaled_mm_azp(a, b, scale_a, scale_b, torch.
bfloat16, azp_adj),
"cutlass_i8_i8_bf16_scaled_mm_azp_bias":
lambda: ops.cutlass_scaled_mm_azp(a, b, scale_a, scale_b, torch.
bfloat16, azp_adj, None, bias),
"cutlass_i8_i8_bf16_scaled_mm_azp_pt":
lambda: ops.cutlass_scaled_mm_azp(a, b, scale_a, scale_b, torch.
bfloat16, azp_adj, azp),
"cutlass_i8_i8_bf16_scaled_mm_azp_pt_bias":
lambda: ops.cutlass_scaled_mm_azp(a, b, scale_a, scale_b, torch.
bfloat16, azp_adj, azp, bias),
}
timers = [] timers = []
# pytorch impl - bfloat16 for name, fn in bench_fns.items():
timers.append( # If bench_kernels is None, run all. Otherwise, run only exact matches.
bench_fn(label, sub_label, "pytorch_bf16_bf16_bf16_matmul-no-scales", if bench_kernels is None or name in bench_kernels:
torch.mm, a.to(dtype=torch.bfloat16), print(f"Running {name}")
b.to(dtype=torch.bfloat16))) timers.append(bench_fn(label, sub_label, name, fn))
# pytorch impl - float16
timers.append(
bench_fn(label, sub_label,
"pytorch_fp16_fp16_fp16_matmul-no-scales", torch.mm,
a.to(dtype=torch.float16), b.to(dtype=torch.float16)))
# cutlass impl
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_mm",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b,
torch.bfloat16))
# cutlass with bias
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_mm_bias",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b, torch.bfloat16,
bias))
# cutlass with azp per-tensor
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_mm_azp",
ops.cutlass_scaled_mm_azp, a, b, scale_a, scale_b,
torch.bfloat16, azp_adj))
# cutlass with azp per-tensor + bias
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_mm_azp_bias",
ops.cutlass_scaled_mm_azp, a, b, scale_a, scale_b,
torch.bfloat16, azp_adj, None, bias))
# cutlass with azp per-token
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_mm_azp_pt",
ops.cutlass_scaled_mm_azp, a, b, scale_a, scale_b,
torch.bfloat16, azp_adj, azp))
# cutlass with azp per-token + bias
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_mm_azp_pt_bias",
ops.cutlass_scaled_mm_azp, a, b, scale_a, scale_b,
torch.bfloat16, azp_adj, azp, bias))
return timers return timers
def bench_fp8(dtype: torch.dtype, m: int, k: int, n: int, label: str, def bench_fp8(
sub_label: str) -> Iterable[TMeasurement]: dtype: torch.dtype,
m: int,
k: int,
n: int,
label: str,
sub_label: str,
bench_kernels: Optional[List[str]] = None) -> Iterable[TMeasurement]:
"""Benchmark FP8-based kernels."""
assert dtype == torch.float8_e4m3fn assert dtype == torch.float8_e4m3fn
a, b = make_rand_tensors(torch.float8_e4m3fn, m, n, k) a, b = make_rand_tensors(torch.float8_e4m3fn, m, n, k)
a_cont = a.contiguous()
scale_a = torch.tensor(1.0, device="cuda", dtype=torch.float32) scale_a = torch.tensor(1.0, device="cuda", dtype=torch.float32)
scale_b = torch.tensor(1.0, device="cuda", dtype=torch.float32) scale_b = torch.tensor(1.0, device="cuda", dtype=torch.float32)
block_scale_a = torch.rand((m, k // 128),
device="cuda",
dtype=torch.float32)
block_scale_b = torch.rand((k // 128, n // 128),
device="cuda",
dtype=torch.float32)
block_scale_a_M_major = block_scale_a.t().contiguous().t()
block_scale_b_K_major = block_scale_b.t().contiguous().t()
bias = torch.zeros((n, ), device="cuda", dtype=torch.bfloat16) bias = torch.zeros((n, ), device="cuda", dtype=torch.bfloat16)
print(m, k, n)
bench_fns = {
"pytorch_bf16_bf16_bf16_matmul-no-scales":
lambda: torch.mm(a.to(dtype=torch.bfloat16), b.to(dtype=torch.bfloat16)
),
"pytorch_fp16_fp16_fp16_matmul-no-scales":
lambda: torch.mm(a.to(dtype=torch.float16), b.to(dtype=torch.float16)),
"pytorch_fp8_fp8_fp16_scaled_mm":
lambda: torch._scaled_mm(
a, b, scale_a, scale_b, out_dtype=torch.float16),
"pytorch_fp8_fp8_fp16_scaled_mm_fast_accum":
lambda: torch._scaled_mm(a,
b,
scale_a,
scale_b,
out_dtype=torch.float16,
use_fast_accum=True),
"pytorch_fp8_fp8_bf16_scaled_mm":
lambda: torch._scaled_mm(
a, b, scale_a, scale_b, out_dtype=torch.bfloat16),
"pytorch_fp8_fp8_bf16_scaled_mm_fast_accum":
lambda: torch._scaled_mm(a,
b,
scale_a,
scale_b,
out_dtype=torch.bfloat16,
use_fast_accum=True),
"cutlass_fp8_fp8_bf16_scaled_mm":
lambda: ops.cutlass_scaled_mm(a, b, scale_a, scale_b, torch.bfloat16),
"cutlass_fp8_fp8_fp16_scaled_mm":
lambda: ops.cutlass_scaled_mm(a, b, scale_a, scale_b, torch.float16),
"cutlass_fp8_fp8_bf16_scaled_mm_bias":
lambda: ops.cutlass_scaled_mm(a, b, scale_a, scale_b, torch.bfloat16,
bias),
"cutlass_fp8_fp8_fp16_scaled_mm_bias":
lambda: ops.cutlass_scaled_mm(a, b, scale_a, scale_b, torch.float16,
bias.to(dtype=torch.float16)),
"triton_fp8_fp8_fp16_scaled_mm_blockwise":
lambda: w8a8_block_fp8_matmul(a_cont, b.t(), block_scale_a,
block_scale_b.t(), (128, 128)),
"cutlass_fp8_fp8_fp16_scaled_mm_blockwise":
lambda: ops.cutlass_scaled_mm(a, b, block_scale_a_M_major,
block_scale_b_K_major, torch.float16),
}
timers = [] timers = []
for name, fn in bench_fns.items():
# pytorch impl w. bf16 # If bench_kernels is None, run all. Otherwise, run only exact matches.
timers.append( if bench_kernels is None or name in bench_kernels:
bench_fn(label, sub_label, "pytorch_bf16_bf16_bf16_matmul-no-scales", print(f"Running {name}")
torch.mm, a.to(dtype=torch.bfloat16, device="cuda"), timers.append(bench_fn(label, sub_label, name, fn))
b.to(dtype=torch.bfloat16, device="cuda")))
# pytorch impl: bf16 output, without fp8 fast accum
timers.append(
bench_fn(label,
sub_label,
"pytorch_fp8_fp8_bf16_scaled_mm",
torch._scaled_mm,
a,
b,
scale_a=scale_a,
scale_b=scale_b,
out_dtype=torch.bfloat16))
# pytorch impl: bf16 output, with fp8 fast accum
timers.append(
bench_fn(label,
sub_label,
"pytorch_fp8_fp8_bf16_scaled_mm_fast_accum",
torch._scaled_mm,
a,
b,
scale_a=scale_a,
scale_b=scale_b,
out_dtype=torch.bfloat16,
use_fast_accum=True))
# pytorch impl: fp16 output, without fp8 fast accum
timers.append(
bench_fn(label,
sub_label,
"pytorch_fp8_fp8_fp16_scaled_mm",
torch._scaled_mm,
a,
b,
scale_a=scale_a,
scale_b=scale_b,
out_dtype=torch.float16))
# pytorch impl: fp16 output, with fp8 fast accum
timers.append(
bench_fn(label,
sub_label,
"pytorch_fp8_fp8_fp16_scaled_mm_fast_accum",
torch._scaled_mm,
a,
b,
scale_a=scale_a,
scale_b=scale_b,
out_dtype=torch.float16,
use_fast_accum=True))
# cutlass impl: bf16 output
timers.append(
bench_fn(label, sub_label, "cutlass_fp8_fp8_bf16_scaled_mm",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b,
torch.bfloat16))
# cutlass impl: fp16 output
timers.append(
bench_fn(label, sub_label, "cutlass_fp8_fp8_fp16_scaled_mm",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b, torch.float16))
# cutlass impl: bf16 output, with bias
timers.append(
bench_fn(label, sub_label, "cutlass_fp8_fp8_bf16_scaled_mm_bias",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b, torch.bfloat16,
bias))
# cutlass impl: fp16 output, with bias
timers.append(
bench_fn(label, sub_label, "cutlass_fp8_fp8_fp16_scaled_mm_bias",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b, torch.float16,
bias.to(dtype=torch.float16)))
return timers return timers
def bench(dtype: torch.dtype, m: int, k: int, n: int, label: str, def bench(dtype: torch.dtype,
sub_label: str) -> Iterable[TMeasurement]: m: int,
k: int,
n: int,
label: str,
sub_label: str,
bench_kernels: Optional[List[str]] = None) -> Iterable[TMeasurement]:
if dtype == torch.int8: if dtype == torch.int8:
return bench_int8(dtype, m, k, n, label, sub_label) return bench_int8(dtype, m, k, n, label, sub_label, bench_kernels)
if dtype == torch.float8_e4m3fn: if dtype == torch.float8_e4m3fn:
return bench_fp8(dtype, m, k, n, label, sub_label) return bench_fp8(dtype, m, k, n, label, sub_label, bench_kernels)
raise ValueError("unsupported type") raise ValueError("unsupported type")
@@ -231,18 +195,22 @@ def print_timers(timers: Iterable[TMeasurement]):
def run(dtype: torch.dtype, def run(dtype: torch.dtype,
MKNs: Iterable[Tuple[int, int, int]]) -> Iterable[TMeasurement]: MKNs: Iterable[Tuple[int, int, int]],
bench_kernels: Optional[List[str]] = None) -> Iterable[TMeasurement]:
results = [] results = []
for m, k, n in MKNs: for m, k, n in MKNs:
timers = bench(dtype, m, k, n, f"scaled-{dtype}-gemm", timers = bench(dtype,
f"MKN=({m}x{k}x{n})") m,
k,
n,
f"scaled-{dtype}-gemm",
f"MKN=({m}x{k}x{n})",
bench_kernels=bench_kernels)
print_timers(timers) print_timers(timers)
results.extend(timers) results.extend(timers)
return results return results
# output makers
def make_output(data: Iterable[TMeasurement], def make_output(data: Iterable[TMeasurement],
MKNs: Iterable[Tuple[int, int, int]], MKNs: Iterable[Tuple[int, int, int]],
base_description: str, base_description: str,
@@ -256,15 +224,11 @@ def make_output(data: Iterable[TMeasurement],
pkl.dump(data, f) pkl.dump(data, f)
# argparse runners
def run_square_bench(args): def run_square_bench(args):
dim_sizes = list( dim_sizes = list(
range(args.dim_start, args.dim_end + 1, args.dim_increment)) range(args.dim_start, args.dim_end + 1, args.dim_increment))
MKNs = list(zip(dim_sizes, dim_sizes, dim_sizes)) MKNs = list(zip(dim_sizes, dim_sizes, dim_sizes))
data = run(args.dtype, MKNs) data = run(args.dtype, MKNs, bench_kernels=args.kernels)
make_output(data, MKNs, f"square_bench-{args.dtype}") make_output(data, MKNs, f"square_bench-{args.dtype}")
@@ -275,8 +239,7 @@ def run_range_bench(args):
Ks = [args.k_constant] * n if args.k_constant is not None else dim_sizes Ks = [args.k_constant] * n if args.k_constant is not None else dim_sizes
Ns = [args.n_constant] * n if args.n_constant is not None else dim_sizes Ns = [args.n_constant] * n if args.n_constant is not None else dim_sizes
MKNs = list(zip(Ms, Ks, Ns)) MKNs = list(zip(Ms, Ks, Ns))
data = run(args.dtype, MKNs) data = run(args.dtype, MKNs, bench_kernels=args.kernels)
make_output(data, MKNs, f"range_bench-{args.dtype}") make_output(data, MKNs, f"range_bench-{args.dtype}")
@@ -302,7 +265,7 @@ def run_model_bench(args):
for k, n in KNs: for k, n in KNs:
MKNs.append((m, k, n)) MKNs.append((m, k, n))
data = run(args.dtype, MKNs) data = run(args.dtype, MKNs, bench_kernels=args.kernels)
model_bench_data.append(data) model_bench_data.append(data)
# Print all results # Print all results
@@ -352,6 +315,15 @@ Benchmark Cutlass GEMM.
type=to_torch_dtype, type=to_torch_dtype,
required=True, required=True,
help="Available options are ['int8', 'fp8']") help="Available options are ['int8', 'fp8']")
parser.add_argument(
"--kernels",
nargs="+",
type=str,
default=None,
help=
"Exact names of the kernels to benchmark. If not set, runs all kernels."
)
subparsers = parser.add_subparsers(dest="cmd") subparsers = parser.add_subparsers(dest="cmd")
square_parser = subparsers.add_parser("square_bench") square_parser = subparsers.add_parser("square_bench")

View File

@@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
# Weight Shapes are in the format # Weight Shapes are in the format
# ([K, N], TP_SPLIT_DIM) # ([K, N], TP_SPLIT_DIM)
# Example: # Example:

View File

@@ -0,0 +1,145 @@
#!/bin/bash
# benchmark the overhead of disaggregated prefill.
# methodology:
# - send all request to prefill vLLM instance. It will buffer KV cache.
# - then send all request to decode instance.
# - The TTFT of decode instance is the overhead.
set -ex
kill_gpu_processes() {
# kill all processes on GPU.
pgrep pt_main_thread | xargs -r kill -9
pgrep python3 | xargs -r kill -9
sleep 10
# remove vllm config file
rm -rf ~/.config/vllm
# Print the GPU memory usage
# so that we know if all GPU processes are killed.
gpu_memory_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
# The memory usage should be 0 MB.
echo "GPU 0 Memory Usage: $gpu_memory_usage MB"
}
wait_for_server() {
# wait for vllm server to start
# return 1 if vllm server crashes
local port=$1
timeout 1200 bash -c "
until curl -s localhost:${port}/v1/completions > /dev/null; do
sleep 1
done" && return 0 || return 1
}
benchmark() {
export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
# compare chunked prefill with disaggregated prefill
results_folder="./results"
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
dataset_name="sonnet"
dataset_path="../sonnet_4x.txt"
num_prompts=10
qps=$1
prefix_len=50
input_len=2048
output_len=$2
CUDA_VISIBLE_DEVICES=0 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8100 \
--max-model-len 10000 \
--gpu-memory-utilization 0.6 \
--kv-transfer-config \
'{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9}' &
CUDA_VISIBLE_DEVICES=1 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8200 \
--max-model-len 10000 \
--gpu-memory-utilization 0.6 \
--kv-transfer-config \
'{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":5e9}' &
wait_for_server 8100
wait_for_server 8200
# let the prefill instance finish prefill
python3 ../benchmark_serving.py \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len "$output_len" \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8100 \
--save-result \
--result-dir $results_folder \
--result-filename disagg_prefill_tp1.json \
--request-rate "inf"
# send the request to decode.
# The TTFT of this command will be the overhead of disagg prefill impl.
python3 ../benchmark_serving.py \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len "$output_len" \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8200 \
--save-result \
--result-dir $results_folder \
--result-filename disagg_prefill_tp1_overhead.json \
--request-rate "$qps"
kill_gpu_processes
}
main() {
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get -y install jq)
(which socat) || (apt-get -y install socat)
pip install quart httpx datasets
cd "$(dirname "$0")"
cd ..
# create sonnet-4x.txt
echo "" > sonnet_4x.txt
for _ in {1..4}
do
cat sonnet.txt >> sonnet_4x.txt
done
cd disagg_benchmarks
rm -rf results
mkdir results
default_qps=1
default_output_len=1
benchmark $default_qps $default_output_len
}
main "$@"

View File

@@ -0,0 +1,163 @@
#!/bin/bash
# Requirement: 2x GPUs.
# Model: meta-llama/Meta-Llama-3.1-8B-Instruct
# Query: 1024 input tokens, 6 output tokens, QPS 2/4/6/8, 100 requests
# Resource: 2x GPU
# Approaches:
# 2. Chunked prefill: 2 vllm instance with tp=4, equivalent to 1 tp=4 instance with QPS 4
# 3. Disaggregated prefill: 1 prefilling instance and 1 decoding instance
# Prefilling instance: max_output_token=1
# Decoding instance: force the input tokens be the same across requests to bypass prefilling
set -ex
kill_gpu_processes() {
# kill all processes on GPU.
pgrep pt_main_thread | xargs -r kill -9
pgrep python3 | xargs -r kill -9
for port in 8000 8100 8200; do lsof -t -i:$port | xargs -r kill -9; done
sleep 1
}
wait_for_server() {
# wait for vllm server to start
# return 1 if vllm server crashes
local port=$1
timeout 1200 bash -c "
until curl -s localhost:${port}/v1/completions > /dev/null; do
sleep 1
done" && return 0 || return 1
}
launch_chunked_prefill() {
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
# disagg prefill
CUDA_VISIBLE_DEVICES=0 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8100 \
--max-model-len 10000 \
--enable-chunked-prefill \
--gpu-memory-utilization 0.6 &
CUDA_VISIBLE_DEVICES=1 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8200 \
--max-model-len 10000 \
--enable-chunked-prefill \
--gpu-memory-utilization 0.6 &
wait_for_server 8100
wait_for_server 8200
python3 round_robin_proxy.py &
sleep 1
}
launch_disagg_prefill() {
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
# disagg prefill
CUDA_VISIBLE_DEVICES=0 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8100 \
--max-model-len 10000 \
--gpu-memory-utilization 0.6 \
--kv-transfer-config \
'{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9}' &
CUDA_VISIBLE_DEVICES=1 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8200 \
--max-model-len 10000 \
--gpu-memory-utilization 0.6 \
--kv-transfer-config \
'{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":5e9}' &
wait_for_server 8100
wait_for_server 8200
python3 disagg_prefill_proxy_server.py &
sleep 1
}
benchmark() {
results_folder="./results"
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
dataset_name="sonnet"
dataset_path="../sonnet_4x.txt"
num_prompts=100
qps=$1
prefix_len=50
input_len=1024
output_len=$2
tag=$3
python3 ../benchmark_serving.py \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len "$output_len" \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8000 \
--save-result \
--result-dir $results_folder \
--result-filename "$tag"-qps-"$qps".json \
--request-rate "$qps"
sleep 2
}
main() {
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get -y install jq)
(which socat) || (apt-get -y install socat)
(which lsof) || (apt-get -y install lsof)
pip install quart httpx matplotlib aiohttp datasets
cd "$(dirname "$0")"
cd ..
# create sonnet-4x.txt so that we can sample 2048 tokens for input
echo "" > sonnet_4x.txt
for _ in {1..4}
do
cat sonnet.txt >> sonnet_4x.txt
done
cd disagg_benchmarks
rm -rf results
mkdir results
default_output_len=6
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
launch_chunked_prefill
for qps in 2 4 6 8; do
benchmark $qps $default_output_len chunked_prefill
done
kill_gpu_processes
launch_disagg_prefill
for qps in 2 4 6 8; do
benchmark $qps $default_output_len disagg_prefill
done
kill_gpu_processes
python3 visualize_benchmark_results.py
}
main "$@"

View File

@@ -0,0 +1,63 @@
# SPDX-License-Identifier: Apache-2.0
import os
import aiohttp
from quart import Quart, make_response, request
AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60)
app = Quart(__name__)
async def forward_request(url, data):
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
headers = {
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
}
async with session.post(url=url, json=data,
headers=headers) as response:
if response.status == 200:
# if response.headers.get('Transfer-Encoding') == 'chunked':
if True:
async for chunk_bytes in response.content.iter_chunked(
1024):
yield chunk_bytes
else:
content = await response.read()
yield content
@app.route('/v1/completions', methods=['POST'])
async def handle_request():
try:
original_request_data = await request.get_json()
prefill_request = original_request_data.copy()
# change max_tokens = 1 to let it only do prefill
prefill_request['max_tokens'] = 1
# finish prefill
async for _ in forward_request('http://localhost:8100/v1/completions',
prefill_request):
continue
# return decode
generator = forward_request('http://localhost:8200/v1/completions',
original_request_data)
response = await make_response(generator)
response.timeout = None
return response
except Exception as e:
import sys
import traceback
exc_info = sys.exc_info()
print("Error occurred in disagg prefill proxy server")
print(e)
print("".join(traceback.format_exception(*exc_info)))
if __name__ == '__main__':
app.run(port=8000)

View File

@@ -0,0 +1,62 @@
# SPDX-License-Identifier: Apache-2.0
import asyncio
import itertools
import aiohttp
from aiohttp import web
class RoundRobinProxy:
def __init__(self, target_ports):
self.target_ports = target_ports
self.port_cycle = itertools.cycle(self.target_ports)
async def handle_request(self, request):
target_port = next(self.port_cycle)
target_url = f"http://localhost:{target_port}{request.path_qs}"
async with aiohttp.ClientSession() as session:
try:
# Forward the request
async with session.request(
method=request.method,
url=target_url,
headers=request.headers,
data=request.content,
) as response:
# Start sending the response
resp = web.StreamResponse(status=response.status,
headers=response.headers)
await resp.prepare(request)
# Stream the response content
async for chunk in response.content.iter_any():
await resp.write(chunk)
await resp.write_eof()
return resp
except Exception as e:
return web.Response(text=f"Error: {str(e)}", status=500)
async def main():
proxy = RoundRobinProxy([8100, 8200])
app = web.Application()
app.router.add_route('*', '/{path:.*}', proxy.handle_request)
runner = web.AppRunner(app)
await runner.setup()
site = web.TCPSite(runner, 'localhost', 8000)
await site.start()
print("Proxy server started on http://localhost:8000")
# Keep the server running
await asyncio.Event().wait()
if __name__ == '__main__':
asyncio.run(main())

Some files were not shown because too many files have changed in this diff Show More