Simon Mo
fd47e57f4b
[Docs] Remove PDF build from Readtehdocs ( #9347 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-10-14 11:57:47 -07:00
Daniele
203ab8f80f
[CI/Build] setuptools-scm fixes ( #8900 )
2024-10-14 11:34:47 -07:00
Kunshang Ji
4141608c6a
[Hardware][intel GPU] add async output process for xpu ( #8897 )
2024-10-14 12:23:33 -06:00
Reza Salehi
dfe43a2071
[Model] Molmo vLLM Integration ( #9016 )
...
Co-authored-by: sanghol <sanghol@allenai.org >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-14 07:56:24 -07:00
Tyler Michael Smith
16b24e7dcd
[Bugfix] Bandaid fix for speculative decoding tests ( #9327 )
2024-10-13 23:02:11 +00:00
Lily Liu
f519902c52
[CI] Fix merge conflict ( #9317 )
2024-10-13 06:41:23 +00:00
Jee Jee Li
250e26a63e
[Bugfix]Fix MiniCPM's LoRA bug ( #9286 )
2024-10-12 09:36:47 -07:00
Yunmeng
2b184ddd4f
[Misc][Installation] Improve source installation script and doc ( #9309 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-10-12 09:36:40 -07:00
Xiang Xu
00298e092c
[Bugfix] Fix bug of xformer prefill for encoder-decoder ( #9026 )
2024-10-12 15:00:43 +08:00
Lily Liu
89feb4c84d
[SpecDec] Remove Batch Expansion (2/3) ( #9298 )
2024-10-12 05:13:37 +00:00
Maximilien de Bayser
ec10cb8511
[BugFix] Fix tool call finish reason in streaming case ( #9209 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-10-11 18:24:26 -07:00
Prashant Gupta
d11b46f3a5
[bugfix] fix f-string for error ( #9295 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2024-10-11 17:03:48 -07:00
Allen Wang
c6cf9295e1
[Bugfix] Sets is_first_step_output for TPUModelRunner ( #9202 )
2024-10-11 13:28:10 -07:00
Lucas Wilkinson
de9fb4bef8
[Bugfix][CI/Build] Fix docker build where CUDA archs < 7.0 are being detected ( #9254 )
2024-10-11 15:57:39 -04:00
Wallas Henrique
8baf85e4e9
[Doc] Compatibility matrix for mutual exclusive features ( #8512 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-10-11 11:18:50 -07:00
homeffjy
1a1823871d
[Doc] Remove outdated comment to avoid misunderstanding ( #9287 )
2024-10-11 18:02:03 +00:00
sixgod
6cf1167c1a
[Model] Add GLM-4v support and meet vllm==0.6.2 ( #9242 )
2024-10-11 17:36:13 +00:00
Burkhard Ringlein
f710090d8e
[Kernel] adding fused moe kernel config for L40S TP4 ( #9245 )
2024-10-11 08:54:22 -07:00
Tyler Michael Smith
7342a7d7f8
[Model] Support Mamba ( #6484 )
2024-10-11 15:40:06 +00:00
Sebastian Schoennenbeck
df3dcdf49d
[Bugfix] Fix priority in multiprocessing engine ( #9277 )
2024-10-11 15:35:35 +00:00
Jee Jee Li
36ea79079b
[Misc][LoRA] Support loading LoRA weights for target_modules in reg format ( #9275 )
2024-10-11 12:31:21 +00:00
Cyrus Leung
e808156f30
[Misc] Collect model support info in a single process per model ( #9233 )
2024-10-11 11:08:11 +00:00
youkaichao
cbc2ef5529
[misc] hide best_of from engine ( #9261 )
...
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com >
2024-10-10 21:30:44 -07:00
Andy Dai
94bf9ae4e9
[Misc] Fix sampling from sonnet for long context case ( #9235 )
2024-10-11 00:33:16 +00:00
omrishiv
f990bab2a4
[Doc][Neuron] add note to neuron documentation about resolving triton issue ( #9257 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-10-10 23:36:32 +00:00
youkaichao
e00c094f15
[torch.compile] generic decorators ( #9258 )
2024-10-10 15:54:23 -07:00
Kevin H. Luu
a78c6ba7c8
[ci/build] Add placeholder command for custom models test ( #9262 )
2024-10-10 15:45:09 -07:00
dependabot[bot]
fb870fd491
Bump actions/setup-python from 3 to 5 ( #9195 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:46 -07:00
dependabot[bot]
270953bafb
Bump actions/checkout from 3 to 4 ( #9196 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:35 -07:00
dependabot[bot]
9cc811c4ff
Bump actions/github-script from 6 to 7 ( #9197 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:24 -07:00
youkaichao
e4d652ea3e
[torch.compile] integration with compilation control ( #9058 )
2024-10-10 12:39:36 -07:00
Simon Mo
78c0b4166c
Suggest codeowners for the core componenets ( #9210 )
2024-10-10 12:29:24 -07:00
jordanyono
21efb603f5
[CI/Build] Make the Dockerfile.cpu file's PIP_EXTRA_INDEX_URL Configurable as a Build Argument ( #9252 )
2024-10-10 18:18:18 +00:00
Rafael Vasquez
055f3270d4
[Doc] Improve debugging documentation ( #9204 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-10 10:48:51 -07:00
Lucas Wilkinson
18511aeda6
[Bugfix] Fix Machete unittests failing with NotImplementedError ( #9218 )
2024-10-10 17:39:56 +00:00
Ilya Lavrenov
83ea5c72b9
[OpenVINO] Use torch 2.4.0 and newer optimim version ( #9121 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-10 11:18:58 -06:00
whyiug
04de9057ab
[Model] support input image embedding for minicpmv ( #9237 )
2024-10-10 15:00:47 +00:00
Isotr0py
07c11cf4d4
[Bugfix] Fix lm_head weights tying with lora for llama ( #9227 )
2024-10-10 21:11:56 +08:00
sroy745
f3a507f1d3
[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 ( #9149 )
2024-10-10 14:17:17 +08:00
Lucas Wilkinson
a64e7b9407
[Bugfix] Machete garbage results for some models (large K dim) ( #9212 )
2024-10-10 14:16:17 +08:00
Michael Goin
ce00231a8b
[Bugfix] Fix Weight Loading Multiple GPU Test - Large Models ( #9213 )
2024-10-10 14:15:40 +08:00
youkaichao
de895f1697
[misc] improve model support check in another process ( #9208 )
2024-10-09 21:58:27 -07:00
Russell Bryant
cf25b93bdd
[Core] Fix invalid args to _process_request ( #9201 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-10 12:10:09 +08:00
Michael Goin
d5fbb8706d
[CI/Build] Update Dockerfile install+deploy image to ubuntu 22.04 ( #9130 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-09 12:51:47 -06:00
Russell Bryant
cdca8994bd
[CI/Build] mypy: check vllm/entrypoints ( #9194 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-09 17:15:28 +00:00
Li, Jiang
ca77dd7a44
[Hardware][CPU] Support AWQ for CPU backend ( #7515 )
2024-10-09 10:28:08 -06:00
Ewout ter Hoeven
7dea289066
Add Dependabot configuration for GitHub Actions updates ( #1217 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-09 08:16:26 -07:00
Cyrus Leung
cfaa6008e6
[Bugfix] Access get_vocab instead of vocab in tool parsers ( #9188 )
2024-10-09 08:59:57 -06:00
Ahmad Fahadh Ilyas
21906a6f50
[Bugfix] Fix lora loading for Compressed Tensors in #9120 ( #9179 )
2024-10-09 12:10:44 +00:00
Jiangtao Hu
dc4aea677a
[Doc] Fix VLM prompt placeholder sample bug ( #9170 )
2024-10-09 08:59:42 +00:00
youkaichao
c8627cd41b
[ci][test] use load dummy for testing ( #9165 )
2024-10-09 00:38:40 -07:00
Cyrus Leung
8bfaa4e31e
[Bugfix] fix composite weight loading and EAGLE weight loading ( #9160 )
2024-10-09 00:36:55 -07:00
AlpinDale
0b5b5d767e
[Frontend] Log the maximum supported concurrency ( #8831 )
2024-10-09 00:03:14 -07:00
Hui Liu
cdc72e3c80
[Model] Remap FP8 kv_scale in CommandR and DBRX ( #9174 )
2024-10-09 06:43:06 +00:00
Joe Rowell
7627172bf4
[Bugfix][Doc] Report neuron error in output ( #9159 )
2024-10-08 22:43:34 -07:00
Travis Johnson
480b7f40cf
[Misc] Improve validation errors around best_of and n ( #9167 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-10-09 04:54:48 +00:00
Yuan Tang
acce7630c1
Update link to KServe deployment guide ( #9173 )
2024-10-09 03:58:49 +00:00
Yuan Tang
ffc4b27ea8
Add classifiers in setup.py ( #9171 )
2024-10-08 19:30:48 -07:00
chenqianfzh
2f4117c38e
support bitsandbytes quantization with more models ( #9148 )
2024-10-08 19:52:19 -06:00
Michael Goin
9ba0bd6aa6
Add lm-eval directly to requirements-test.txt ( #9161 )
2024-10-08 18:22:31 -07:00
Russell Bryant
2a131965a8
mypy: check additional directories ( #9162 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-08 22:08:22 +00:00
bnellnm
bd37b9fbe2
[Bugfix] Try to handle older versions of pytorch ( #9086 )
2024-10-08 14:28:12 -07:00
Rafael Vasquez
de24046fcd
[Doc] Improve contributing and installation documentation ( #9132 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-08 20:22:08 +00:00
Sayak Paul
1874c6a1b0
[Doc] Update vlm.rst to include an example on videos ( #9155 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-08 18:12:29 +00:00
Daniele
9a94ca4a5d
[Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing ( #8537 )
2024-10-08 09:38:40 -07:00
Peter Pan
cfba685bd4
[CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models ( #8758 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2024-10-08 09:37:34 -07:00
Alex Brooks
069d3bd8d0
[Frontend] Add Early Validation For Chat Template / Tool Call Parser ( #9151 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-08 14:31:26 +00:00
Alex Brooks
a3691b6b5e
[Core][Frontend] Add Support for Inference Time mm_processor_kwargs ( #9131 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-08 14:12:56 +00:00
Brendan Wong
8c746226c9
[Frontend] API support for beam search for MQLLMEngine ( #9117 )
2024-10-08 05:51:43 +00:00
youkaichao
e1faa2a598
[misc] improve ux on readme ( #9147 )
2024-10-07 22:26:25 -07:00
Kunshang Ji
80b57f00d5
[Intel GPU] Fix xpu decode input ( #9145 )
2024-10-08 03:51:14 +00:00
youkaichao
04c12f8157
[misc] update utils to support comparing multiple settings ( #9140 )
2024-10-08 02:51:49 +00:00
Simon Mo
8eeb857084
Add Slack to README ( #9137 )
2024-10-07 17:06:21 -07:00
youkaichao
fa45513a51
[misc] fix comment and variable name ( #9139 )
2024-10-07 16:07:05 -07:00
Kuntai Du
c0d9a98d0c
[Doc] Include performance benchmark in README ( #9135 )
2024-10-07 15:04:06 -07:00
Russell Bryant
e0dbdb013d
[CI/Build] Add linting for github actions workflows ( #7876 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-07 21:18:10 +00:00
TimWang
93cf74a8a7
[Doc]: Add deploying_with_k8s guide ( #8451 )
2024-10-07 13:31:45 -07:00
Cyrus Leung
151ef4efd2
[Model] Support NVLM-D and fix QK Norm in InternViT ( #9045 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2024-10-07 11:55:12 +00:00
Isotr0py
f19da64871
[Core] Refactor GGUF parameters packing and forwarding ( #8859 )
2024-10-07 10:01:46 +00:00
Isotr0py
4f95ffee6f
[Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend ( #9089 )
2024-10-07 06:50:35 +00:00
Cyrus Leung
8c6de96ea1
[Model] Explicit interface for vLLM models and support OOT embedding models ( #9108 )
2024-10-07 06:10:35 +00:00
youkaichao
18b296fdb2
[core] remove beam search from the core ( #9105 )
2024-10-07 05:47:04 +00:00
sroy745
c8f26bb636
[BugFix][Core] Fix BlockManagerV2 when Encoder Input is None ( #9103 )
2024-10-07 03:52:42 +00:00
Isotr0py
487678d046
[Bugfix][Hardware][CPU] Fix CPU model input for decode ( #9044 )
2024-10-06 19:14:27 -07:00
Varun Sundar Rabindranath
cb3b2b9ba4
[Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling ( #9038 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-10-06 12:48:11 -07:00
Yanyi Liu
fdf59d30ea
[Bugfix] fix tool_parser error handling when serve a model not support it ( #8709 )
2024-10-06 12:51:08 +00:00
Cyrus Leung
b22b798471
[Model] PP support for embedding models and update docs ( #9090 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-10-06 16:35:27 +08:00
Cyrus Leung
f22619fe96
[Misc] Remove user-facing error for removed VLM args ( #9104 )
2024-10-06 01:33:52 -07:00
Brendan Wong
168cab6bbf
[Frontend] API support for beam search ( #9087 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-10-05 23:39:03 -07:00
TJian
23fea8714a
[Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model ( #9101 )
2024-10-06 13:00:04 +08:00
youkaichao
f4dd830e09
[core] use forward context for flash infer ( #9097 )
2024-10-05 19:37:31 -07:00
Andy Dai
5df1834895
[Bugfix] Fix order of arguments matters in config.yaml ( #8960 )
2024-10-05 17:35:11 +00:00
Chen Zhang
cfadb9c687
[Bugfix] Deprecate registration of custom configs to huggingface ( #9083 )
2024-10-05 21:56:40 +08:00
Xin Yang
15986f598c
[Model] Support Gemma2 embedding model ( #9004 )
2024-10-05 06:57:05 +00:00
hhzhang16
53b3a33027
[Bugfix] Fixes Phi3v & Ultravox Multimodal EmbeddingInputs ( #8979 )
2024-10-04 22:05:37 -07:00
Chen Zhang
dac914b0d6
[Bugfix] use blockmanagerv1 for encoder-decoder ( #9084 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-05 04:45:38 +00:00
Zhuohan Li
a95354a36e
[Doc] Update README.md with Ray summit slides ( #9088 )
2024-10-05 02:54:45 +00:00
youkaichao
663874e048
[torch.compile] improve allreduce registration ( #9061 )
2024-10-04 16:43:50 -07:00
Chongming Ni
cc90419e89
[Hardware][Neuron] Add on-device sampling support for Neuron ( #8746 )
...
Co-authored-by: Ashraf Mahgoub <ashymahg@amazon.com >
2024-10-04 16:42:20 -07:00
Cody Yu
27302dd584
[Misc] Fix CI lint ( #9085 )
2024-10-04 16:07:54 -07:00
Andy Dai
0cc566ca8f
[Misc] Add random seed for prefix cache benchmark ( #9081 )
2024-10-04 21:58:57 +00:00
Andy Dai
05c531be47
[Misc] Improved prefix cache example ( #9077 )
2024-10-04 21:38:42 +00:00
Kuntai Du
fbb74420e7
[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang ( #7412 )
2024-10-04 14:01:44 -07:00
ElizaWszola
05d686432f
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE ( #8973 )
...
Co-authored-by: Dipika <dipikasikka1@gmail.com >
Co-authored-by: Dipika Sikka <ds3822@columbia.edu >
2024-10-04 12:34:44 -06:00
Flávia Béo
0dcc8cbe5a
Adds truncate_prompt_tokens param for embeddings creation ( #8999 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
2024-10-04 18:31:40 +00:00
Roger Wang
26aa325f4f
[Core][VLM] Test registration for OOT multimodal models ( #8717 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-04 10:38:25 -07:00
Varad Ahirwadkar
e5dc713c23
[Hardware][PowerPC] Make oneDNN dependency optional for Power ( #9039 )
...
Signed-off-by: Varad Ahirwadkar <varad.ahirwadkar1@ibm.com >
2024-10-04 17:24:42 +00:00
Simon Mo
36eecfbddb
Remove AMD Ray Summit Banner ( #9075 )
2024-10-04 10:17:16 -07:00
Prashant Gupta
9ade8bbc8d
[Model] add a bunch of supported lora modules for mixtral ( #9008 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2024-10-04 16:24:40 +00:00
Lucas Wilkinson
22482e495e
[Bugfix] Flash attention arches not getting set properly ( #9062 )
2024-10-04 09:43:15 -06:00
whyiug
3d826d2c52
[Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL ( #9071 )
2024-10-04 14:34:58 +00:00
Cyrus Leung
0e36fd4909
[Misc] Move registry to its own file ( #9064 )
2024-10-04 10:01:37 +00:00
Murali Andoorveedu
0f6d7a9a34
[Models] Add remaining model PP support ( #7168 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Signed-off-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-04 10:56:58 +08:00
Michael Goin
303d44790a
[Misc] Enable multi-step output streaming by default ( #9047 )
2024-10-03 22:55:42 -04:00
Lucas Wilkinson
aeb37c2a72
[CI/Build] Per file CUDA Archs (improve wheel size and dev build times) ( #8845 )
2024-10-03 22:55:25 -04:00
代君
3dbb215b38
[Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model ( #8405 )
2024-10-04 10:36:39 +08:00
Domen Vreš
2838d6b38e
[Bugfix] Weight loading fix for OPT model ( #9042 )
...
Co-authored-by: dvres <dvres@fri.uni-lj.si >
2024-10-03 19:53:29 -04:00
sroy745
91add85ec4
Fix failing spec decode test ( #9054 )
2024-10-03 23:07:29 +00:00
youkaichao
9aaf14c62e
[misc] add forward context for attention ( #9029 )
2024-10-03 12:09:42 -07:00
xendo
63e39937f9
[Frontend] [Neuron] Parse literals out of override-neuron-config ( #8959 )
...
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com >
2024-10-03 18:02:07 +00:00
sroy745
f5d72b2fc6
[Core] Make BlockSpaceManagerV2 the default BlockManager to use. ( #8678 )
2024-10-03 09:44:21 -07:00
Guillaume Calmettes
83caf35e08
[BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser ( #9020 )
2024-10-03 16:44:52 +08:00
Divakar Verma
01843c89b8
[Misc] log when using default MoE config ( #8971 )
2024-10-03 04:31:07 +00:00
Travis Johnson
19a4dd0990
[Bugfix] example template should not add parallel_tool_prompt if tools is none ( #9007 )
2024-10-03 03:04:17 +00:00
Nick Hill
18c2e30c57
[Doc] Update Granite model docs ( #9025 )
2024-10-03 02:42:24 +00:00
Shawn Tan
19f0d25796
[Model] Adding Granite MoE. ( #8206 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-03 09:33:57 +08:00
Sergey Shlyapnikov
f58d4fccc9
[OpenVINO] Enable GPU support for OpenVINO vLLM backend ( #8192 )
2024-10-02 17:50:01 -04:00
Varun Sundar Rabindranath
afb050b29d
[Core] CUDA Graphs for Multi-Step + Chunked-Prefill ( #8645 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-10-02 19:44:39 +00:00
Alex Brooks
7f60520deb
[Misc] Update Default Image Mapper Error Log ( #8977 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-10-02 11:44:38 +00:00
afeldman-nm
563649aafe
[Core] Combined support for multi-step scheduling, chunked prefill & prefix caching ( #8804 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Andrew Feldman <afeld2012@gmail.com >
2024-10-02 07:52:20 +00:00
Lily Liu
1570203864
[Spec Decode] (1/2) Remove batch expansion ( #8839 )
2024-10-01 16:04:42 -07:00
vlsav
22f5851b80
Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows ( #8997 )
2024-10-01 11:07:06 -07:00
Cyrus Leung
4f341bd4bf
[Doc] Update list of supported models ( #8987 )
2024-10-02 00:35:39 +08:00
Sebastian Schoennenbeck
35bd215168
[Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API ( #8965 )
2024-10-01 09:58:06 +00:00
Alex Brooks
1fe0a4264a
[Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders ( #8991 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-01 09:52:44 +00:00
Isotr0py
bc4eb65b54
[Bugfix] Fix Fuyu tensor parallel inference ( #8986 )
2024-10-01 17:51:41 +08:00
Divakar Verma
82f3937e59
[Misc] add process_weights_after_loading for DummyLoader ( #8969 )
2024-10-01 03:46:41 +00:00
youkaichao
7da2487591
[torch.compile] fix tensor alias ( #8982 )
2024-10-01 03:40:48 +00:00
Kevin H. Luu
aaccca2b4d
[CI/Build] Fix machete generated kernel files ordering ( #8976 )
...
Signed-off-by: kevin <kevin@anyscale.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-10-01 03:33:12 +00:00
Joe Runde
062c89e7c9
[Frontend][Core] Move guided decoding params into sampling params ( #8252 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-01 09:34:25 +08:00
Lily Liu
bce324487a
[CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. ( #8975 )
2024-10-01 00:51:40 +00:00
Kevin H. Luu
1425a1bcf9
[ci] Add CODEOWNERS for test directories ( #8795 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-10-01 00:47:08 +00:00
Jee Jee Li
1cabfcefb6
[Misc] Adjust max_position_embeddings for LoRA compatibility ( #8957 )
2024-09-30 12:57:39 +00:00
Sebastian Schoennenbeck
be76e5aabf
[Core] Make scheduling policy settable via EngineArgs ( #8956 )
2024-09-30 12:28:44 +00:00
Isotr0py
2ae25f79cf
[Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg ( #8946 )
2024-09-30 13:01:20 +08:00
Jee Jee Li
8e60afa15e
[Model][LoRA]LoRA support added for MiniCPMV2.6 ( #8943 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-30 04:31:55 +00:00
Roger Wang
b6d7392579
[Misc][CI/Build] Include cv2 via mistral_common[opencv] ( #8951 )
2024-09-30 04:28:26 +00:00
whyiug
e01ab595d8
[Model] support input embeddings for qwen2vl ( #8856 )
2024-09-30 03:16:10 +00:00
Mor Zusman
f13a07b1f8
[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model ( #8533 )
2024-09-29 17:35:58 -04:00
danieljannai21
6c9ba48fde
[Frontend] Added support for HF's new continue_final_message parameter ( #8942 )
2024-09-29 17:59:47 +00:00
juncheoll
1fb9c1b0bf
[Misc] Fix typo in BlockSpaceManagerV1 ( #8944 )
2024-09-29 15:05:54 +00:00
Nick Hill
31f46a0d35
[BugFix] Fix seeded random sampling with encoder-decoder models ( #8870 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-29 09:43:14 +00:00
Jee Jee Li
3d49776bbb
[Model][LoRA]LoRA support added for MiniCPMV2.5 ( #7199 )
2024-09-29 06:59:45 +00:00
Zilin Zhu
bc2ef1f77c
[Model] Support Qwen2.5-Math-RM-72B ( #8896 )
2024-09-28 21:19:39 -07:00
Tyler Michael Smith
2e7fe7e79f
[Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching ( #8930 )
2024-09-29 03:13:01 +00:00
Cyrus Leung
26a68d5d7e
[CI/Build] Add test decorator for minimum GPU memory ( #8925 )
2024-09-29 02:50:51 +00:00
ElizaWszola
d081da0064
[Bugfix] Fix Marlin MoE act order when is_k_full == False ( #8741 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-09-28 18:19:40 -07:00
sroy745
5bf8789b2a
[Bugfix] Block manager v2 with preemption and lookahead slots ( #8824 )
2024-09-29 09:17:45 +08:00
Russell Bryant
d1537039ce
[Core] Improve choice of Python multiprocessing method ( #8823 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-29 09:17:07 +08:00
youkaichao
cc276443b5
[doc] organize installation doc and expose per-commit docker ( #8931 )
2024-09-28 17:48:41 -07:00
Chen Zhang
e585b583a9
[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 ( #8891 )
2024-09-28 18:51:22 +00:00
Edouard B.
090e945e36
[Frontend] Make beam search emulator temperature modifiable ( #8928 )
...
Co-authored-by: Eduard Balzin <nfunctor@yahoo.fr >
2024-09-28 11:30:21 -07:00
Cyrus Leung
e1a3f5e831
[CI/Build] Update models tests & examples ( #8874 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-28 09:54:35 -07:00
Varun Sundar Rabindranath
19d02ff938
[Bugfix] Fix PP for Multi-Step ( #8887 )
2024-09-28 08:52:46 -07:00
tastelikefeet
39d3f8d94f
[Bugfix] Fix code for downloading models from modelscope ( #8443 )
2024-09-28 08:24:12 -07:00
Cyrus Leung
b0298aa8cc
[Misc] Remove vLLM patch of BaichuanTokenizer ( #8921 )
2024-09-28 08:11:25 +00:00
Tyler Titsworth
260024a374
[Bugfix][Intel] Fix XPU Dockerfile Build ( #7824 )
...
Signed-off-by: tylertitsworth <tyler.titsworth@intel.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-27 23:45:50 -07:00
youkaichao
d86f6b2afb
[misc] fix wheel name ( #8919 )
2024-09-27 22:10:44 -07:00
Sebastian Schoennenbeck
bd429f2b75
[Core] Priority-based scheduling in async engine ( #8850 )
2024-09-27 15:07:10 -07:00
youkaichao
18e60d7d13
[misc][distributed] add VLLM_SKIP_P2P_CHECK flag ( #8911 )
2024-09-27 14:27:56 -07:00
Varun Sundar Rabindranath
c2ec430ab5
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path ( #8378 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-09-27 13:32:07 -07:00
Lucas Wilkinson
c5d55356f9
[Bugfix] fix for deepseek w4a16 ( #8906 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-09-27 13:12:34 -06:00
Luka Govedič
172d1cd276
[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method ( #7271 )
2024-09-27 14:25:10 -04:00
youkaichao
a9b15c606f
[torch.compile] use empty tensor instead of None for profiling ( #8875 )
2024-09-27 08:11:32 -07:00
Brittany
8df2dc3c88
[TPU] Update pallas.py to support trillium ( #8871 )
2024-09-27 01:16:55 -07:00
Isotr0py
6d792d2f31
[Bugfix][VLM] Fix Fuyu batching inference with max_num_seqs>1 ( #8892 )
2024-09-27 01:15:58 -07:00
Peter Pan
0e088750af
[MISC] Fix invalid escape sequence '\' ( #8830 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2024-09-27 01:13:25 -07:00
youkaichao
dc4e3df5c2
[misc] fix collect env ( #8894 )
2024-09-27 00:26:38 -07:00
Cyrus Leung
3b00b9c26c
[Core] renamePromptInputs and inputs ( #8876 )
2024-09-26 20:35:15 -07:00
Maximilien de Bayser
344cd2b6f4
[Feature] Add support for Llama 3.1 and 3.2 tool use ( #8343 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-09-26 17:01:42 -07:00
Cyrus Leung
1b49148e47
[Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility ( #8764 )
2024-09-26 16:54:09 -07:00
Nick Hill
4b377d6feb
[BugFix] Fix test breakages from transformers 4.45 upgrade ( #8829 )
2024-09-26 16:46:43 -07:00
Tyler Michael Smith
71d21c73ab
[Bugfix] Fixup advance_step.cu warning ( #8815 )
2024-09-26 16:23:45 -07:00
Chirag Jain
ee2da3e9ef
fix validation: Only set tool_choice auto if at least one tool is provided ( #8568 )
2024-09-26 16:23:17 -07:00
Tyler Michael Smith
e2f6f26e86
[Bugfix] Fix print_warning_once's line info ( #8867 )
2024-09-26 16:18:26 -07:00
Michael Goin
b28d2104de
[Misc] Change dummy profiling and BOS fallback warns to log once ( #8820 )
2024-09-26 16:18:14 -07:00
Pernekhan Utemuratov
93d364da34
[Bugfix] Include encoder prompts len to non-stream api usage response ( #8861 )
2024-09-26 15:47:00 -07:00
Kevin H. Luu
d9cfbc891e
[ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM ( #8872 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-26 15:02:16 -07:00
youkaichao
70de39f6b4
[misc][installation] build from source without compilation ( #8818 )
2024-09-26 13:19:04 -07:00
fyuan1316
68988d4e0d
[CI/Build] Fix missing ci dependencies ( #8834 )
2024-09-26 11:04:39 -07:00
Michael Goin
520db4dbc1
[Docs] Add README to the build docker image ( #8825 )
2024-09-26 11:02:52 -07:00
Tyler Michael Smith
f70bccac75
[Build/CI] Upgrade to gcc 10 in the base build Docker image ( #8814 )
2024-09-26 10:07:18 -07:00
Roger Wang
4bb98f2190
[Misc] Update config loading for Qwen2-VL and remove Granite ( #8837 )
2024-09-26 07:45:30 -07:00
Michael Goin
7193774b1f
[Misc] Support quantization of MllamaForCausalLM ( #8822 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-09-25 14:46:22 -07:00
Roger Wang
e2c6e0a829
[Doc] Update doc for Transformers 4.45 ( #8817 )
2024-09-25 13:29:48 -07:00
Chen Zhang
770ec6024f
[Model] Add support for the multi-modal Llama 3.2 model ( #8811 )
...
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chang Su <chang.s.su@oracle.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-25 13:29:32 -07:00
Simon Mo
4f1ba0844b
Revert "rename PromptInputs and inputs with backward compatibility ( #8760 ) ( #8810 )
2024-09-25 10:36:26 -07:00
Michael Goin
873edda6cf
[Misc] Support FP8 MoE for compressed-tensors ( #8588 )
2024-09-25 09:43:36 -07:00
科英
64840dfae4
[Frontend] MQLLMEngine supports profiling. ( #8761 )
2024-09-25 09:37:41 -07:00
Cyrus Leung
28e1299e60
rename PromptInputs and inputs with backward compatibility ( #8760 )
2024-09-25 09:36:47 -07:00
DefTruth
0c4d2ad5e6
[VLM][Bugfix] internvl with num_scheduler_steps > 1 ( #8614 )
2024-09-25 09:35:53 -07:00
Jee Jee Li
c6f2485c82
[[Misc]] Add extra deps for openai server image ( #8792 )
2024-09-25 09:35:23 -07:00
bnellnm
300da09177
[Kernel] Fullgraph and opcheck tests ( #8479 )
2024-09-25 08:35:52 -06:00
Hongxia Yang
1c046447a6
[CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade ( #8777 )
2024-09-25 22:26:37 +08:00
Woo-Yeon Lee
8fae5ed7f6
[Misc] Fix minor typo in scheduler ( #8765 )
2024-09-25 00:53:03 -07:00
David Newman
3368c3ab36
[Bugfix] Ray 2.9.x doesn't expose available_resources_per_node ( #8767 )
...
Signed-off-by: darthhexx <darthhexx@gmail.com >
2024-09-25 00:52:26 -07:00
Adam Tilghman
1ac3de09cd
[Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer ( #8672 )
2024-09-25 07:49:26 +00:00
sohamparikh
3e073e66f1
[Bugfix] load fc bias from config for eagle ( #8790 )
2024-09-24 23:16:30 -07:00
Isotr0py
c23953675f
[Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend ( #8770 )
2024-09-24 23:16:11 -07:00
zifeitong
e3dd0692fa
[BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv ( #8250 )
2024-09-25 05:53:43 +00:00
sroy745
fc3afc20df
Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 ( #8752 )
2024-09-24 21:26:36 -07:00
sasha0552
b4522474a3
[Bugfix][Kernel] Implement acquire/release polyfill for Pascal ( #8776 )
2024-09-24 21:26:33 -07:00
sroy745
ee777d9c30
Fix test_schedule_swapped_simple in test_scheduler.py ( #8780 )
2024-09-24 21:26:18 -07:00
Joe Runde
6e0c9d6bd0
[Bugfix] Use heartbeats instead of health checks ( #8583 )
2024-09-24 20:37:38 -07:00
Archit Patke
6da1ab6b41
[Core] Adding Priority Scheduling ( #5958 )
2024-09-24 19:50:50 -07:00
Travis Johnson
01b6f9e1f0
[Core][Bugfix] Support prompt_logprobs returned with speculative decoding ( #8047 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-09-24 17:29:56 -07:00
Jee Jee Li
13f9f7a3d0
[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 ( #8768 )
2024-09-24 17:08:55 -07:00
youkaichao
1e7d5c01f5
[misc] soft drop beam search ( #8763 )
2024-09-24 15:48:39 -07:00
Daniele
2467b642dd
[CI/Build] fix setuptools-scm usage ( #8771 )
2024-09-24 12:38:12 -07:00
Lucas Wilkinson
72fc97a0f1
[Bugfix] Fix torch dynamo fixes caused by replace_parameters ( #8748 )
2024-09-24 14:33:21 -04:00
Andy
2529d09b5a
[Frontend] Batch inference for llm.chat() API ( #8648 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-09-24 09:44:11 -07:00
ElizaWszola
a928ded995
[Kernel] Split Marlin MoE kernels into multiple files ( #8661 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-09-24 09:31:42 -07:00
Hanzhi Zhou
cc4325b66a
[Bugfix] Fix potentially unsafe custom allreduce synchronization ( #8558 )
2024-09-24 01:08:14 -07:00
Alex Brooks
8ff7ced996
[Model] Expose Phi3v num_crops as a mm_processor_kwarg ( #8658 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-24 07:36:46 +00:00
Peter Salas
3f06bae907
[Core][Model] Support loading weights by ID within models ( #7931 )
2024-09-24 07:14:15 +00:00
Cody Yu
b8747e8a7c
[MISC] Skip dumping inputs when unpicklable ( #8744 )
2024-09-24 06:10:03 +00:00
Simon Mo
3185fb0cca
Revert "[Core] Rename PromptInputs to PromptType, and inputs to prompt" ( #8750 )
2024-09-24 05:45:20 +00:00
youkaichao
0250dd68c5
re-implement beam search on top of vllm core ( #8726 )
...
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com >
2024-09-23 22:08:12 -07:00
sroy745
88577ac928
Fix tests in test_scheduler.py that fail with BlockManager V2 ( #8728 )
2024-09-24 04:43:13 +00:00
Hongxia Yang
530821d00c
[Hardware][AMD] ROCm6.2 upgrade ( #8674 )
2024-09-23 18:52:39 -07:00
Alexander Matveev
1a2aef3e59
Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse ( #8335 )
2024-09-23 15:38:04 -07:00
jiqing-feng
5f7bb58427
Fix typical acceptance sampler with correct recovered token ids ( #8562 )
2024-09-23 12:32:27 -07:00
Russell Bryant
b05f5c9238
[Core] Allow IPv6 in VLLM_HOST_IP with zmq ( #8575 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-09-23 12:15:41 -07:00
Jee Jee Li
9b0e3ec970
[Kernel][LoRA] Add assertion for punica sgmv kernels ( #7585 )
2024-09-23 18:57:42 +00:00
Lucas Wilkinson
86e9c8df29
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin ( #7701 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-09-23 13:46:26 -04:00
Daniele
ee5f34b1c2
[CI/Build] use setuptools-scm to set __version__ ( #4738 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-23 09:44:26 -07:00
Jani Monoses
f2bd246c17
[VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size ( #8707 )
2024-09-23 14:43:09 +00:00
Yanyi Liu
a79e522984
[Model] Support pp for qwen2-vl ( #8696 )
2024-09-23 13:46:59 +00:00
Li, Jiang
3e83c12b5c
[Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner ( #8733 )
2024-09-23 13:15:16 +00:00
Isotr0py
e551ca1555
[Hardware][CPU] Refactor CPU model runner ( #8729 )
2024-09-23 20:12:20 +08:00
Alex Brooks
9b8c8ba119
[Core][Frontend] Support Passing Multimodal Processor Kwargs ( #8657 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-09-23 07:44:48 +00:00
Yan Ma
d23679eb99
[Bugfix] fix docker build for xpu ( #8652 )
2024-09-22 22:54:18 -07:00
Luka Govedič
57a0702e63
[Bugfix] Fix CPU CMake build ( #8723 )
...
Co-authored-by: Yuan <yuan.zhou@intel.com >
2024-09-22 20:40:46 -07:00
Tyler Michael Smith
3dda7c2250
[Bugfix] Avoid some bogus messages RE CUTLASS's revision when building ( #8702 )
2024-09-22 22:24:59 -04:00
youkaichao
92ba7e7477
[misc] upgrade mistral-common ( #8715 )
2024-09-22 15:41:59 -07:00
youkaichao
d4a2ac8302
[build] enable existing pytorch (for GH200, aarch64, nightly) ( #8713 )
2024-09-22 12:47:54 -07:00
Lily Liu
c6bd70d772
[SpecDec][Misc] Cleanup, remove bonus token logic. ( #8701 )
2024-09-22 12:34:14 -07:00
litianjian
5b59532760
[Model][VLM] Add LLaVA-Onevision model support ( #8486 )
...
Co-authored-by: litianjian <litianjian@bytedance.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-22 10:51:44 -07:00
Huazhong Ji
ca2b628b3c
[MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler ( #8703 )
2024-09-22 10:44:09 -07:00
Alex Brooks
8ca5051b9a
[Misc] Use NamedTuple in Multi-image example ( #8705 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-09-22 20:56:20 +08:00
Cyrus Leung
06ed2815e2
[Model] Refactor BLIP/BLIP-2 to support composite model loading ( #8407 )
2024-09-22 12:24:21 +00:00
youkaichao
0e40ac9b7b
[ci][build] fix vllm-flash-attn ( #8699 )
2024-09-21 23:24:58 -07:00
Isotr0py
13d88d4137
[Bugfix] Refactor composite weight loading logic ( #8656 )
2024-09-22 04:33:27 +00:00
Tyler Michael Smith
d66ac62854
[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu ( #8643 )
2024-09-21 23:45:02 +00:00
Divakar Verma
9dc7c6c7f3
[dbrx] refactor dbrx experts to extend FusedMoe class ( #8518 )
2024-09-21 15:09:39 -06:00
rasmith
ec4aaad812
[Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 ( #8646 )
2024-09-21 09:20:54 +00:00
Andy Dai
4dfdf43196
[Doc] Fix typo in AMD installation guide ( #8689 )
2024-09-21 00:24:12 -07:00
Cyrus Leung
5e85f4f82a
[VLM] Use SequenceData.from_token_counts to create dummy data ( #8687 )
2024-09-20 23:28:56 -07:00
Luka Govedič
71c60491f2
[Kernel] Build flash-attn from source ( #8245 )
2024-09-20 23:27:10 -07:00
youkaichao
0faab90eb0
[beam search] add output for manually checking the correctness ( #8684 )
2024-09-20 19:55:33 -07:00
Cyrus Leung
0455c46ed4
[Core] Factor out common code in SequenceData and Sequence ( #8675 )
2024-09-21 02:30:39 +00:00
Kunshang Ji
d4bf085ad0
[MISC] add support custom_op check ( #8557 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-20 19:03:55 -07:00
Cyrus Leung
0057894ef7
[Core] Rename PromptInputs and inputs( #8673 )
2024-09-20 19:00:54 -07:00
zyddnys
0f961b3ce9
[Bugfix] Fix incorrect llava next feature size calculation ( #8496 )
2024-09-20 22:48:32 +00:00
omrishiv
7f9c8902e3
[Hardware][AWS] update neuron to 2.20 ( #8676 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-09-20 15:19:44 -07:00
omrishiv
7c8566aa4f
[Doc] neuron documentation update ( #8671 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-09-20 15:04:37 -07:00
Patrick von Platen
b4e4eda92e
[Bugfix][Core] Fix tekken edge case for mistral tokenizer ( #8640 )
2024-09-20 14:33:03 -07:00
Pastel!
2874bac618
[Bugfix] Config got an unexpected keyword argument 'engine' ( #8556 )
2024-09-20 14:00:45 -07:00
Cyrus Leung
035fa895ec
[Misc] Show AMD GPU topology in collect_env.py ( #8649 )
2024-09-20 13:52:19 -07:00
saumya-saran
b28298f2f4
[Bugfix] Validate SamplingParam n is an int ( #8548 )
2024-09-20 12:46:02 -07:00
Alexey Kondratiev(AMD)
2940afa04e
[CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build ( #8670 )
2024-09-20 10:27:44 -07:00
Niklas Muennighoff
3b63de9353
[Model] Add OLMoE ( #7922 )
2024-09-20 09:31:41 -07:00
Jiaxin Shan
260d40b5ea
[Core] Support Lora lineage and base model metadata management ( #6315 )
2024-09-20 06:20:56 +00:00
William Lin
9e5ec35b1f
[bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata ( #8474 )
2024-09-19 20:49:54 -07:00
Amit Garg
18ae428a0d
[Bugfix] Fix Phi3.5 mini and MoE LoRA inference ( #8571 )
2024-09-20 08:54:02 +08:00
bnellnm
de6f90a13d
[Misc] guard against change in cuda library name ( #8609 )
2024-09-20 06:36:30 +08:00
Alexey Kondratiev(AMD)
6cb748e190
[CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail ( #8551 )
2024-09-19 13:06:32 -07:00
Simon Mo
9e99407e3c
Create SECURITY.md ( #8642 )
2024-09-19 12:16:28 -07:00
Isotr0py
ea4647b7d7
[Doc] Add documentation for GGUF quantization ( #8618 )
2024-09-19 13:15:55 -06:00
盏一
e42c634acb
[Core] simplify logits resort in _apply_top_k_top_p ( #8619 )
2024-09-19 18:28:25 +00:00
Charlie Fu
9cc373f390
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention ( #8577 )
2024-09-19 17:37:57 +00:00
Nick Hill
76515f303b
[Frontend] Use MQLLMEngine for embeddings models too ( #8584 )
2024-09-19 12:51:06 -04:00
Kunshang Ji
855c8ae2c9
[MISC] remove engine_use_ray in benchmark_throughput.py ( #8615 )
2024-09-18 22:33:20 -07:00
Kuntai Du
c52ec5f034
[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py ( #8616 )
2024-09-19 05:24:24 +00:00
Roger Wang
02c9afa2d0
Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" ( #8593 )
2024-09-19 04:14:28 +00:00
sroy745
3118f63385
[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. ( #8545 )
2024-09-19 02:24:15 +00:00
Tyler Michael Smith
4c34ce8916
[Kernel] Remove marlin moe templating on thread_m_blocks ( #8573 )
...
Co-authored-by: lwilkinson@neuralmagic.com
2024-09-19 01:42:49 +00:00
Joe Runde
0d47bf3bf4
[Bugfix] add dead_error property to engine client ( #8574 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-18 22:10:01 +00:00
Nick Hill
d9cd78eb71
[BugFix] Nonzero exit code if MQLLMEngine startup fails ( #8572 )
2024-09-18 20:17:55 +00:00
Tyler Michael Smith
db9120cded
[Kernel] Change interface to Mamba selective_state_update for continuous batching ( #8039 )
2024-09-18 20:05:06 +00:00
Gregory Shtrasberg
b3195bc9e4
[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call ( #8380 )
...
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-18 10:41:08 -07:00
Geun, Lim
e18749ff09
[Model] Support Solar Model ( #8386 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-18 11:04:00 -06:00
Russell Bryant
d65798f78c
[Core] zmq: bind only to 127.0.0.1 for local-only usage ( #8543 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-09-18 16:10:27 +00:00
afeldman-nm
a8c1d161a7
[Core] *Prompt* logprobs support in Multi-step ( #8199 )
2024-09-18 08:38:43 -07:00
Alexander Matveev
7c7714d856
[Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH ( #8157 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-09-18 13:56:58 +00:00
Aaron Pham
9d104b5beb
[CI/Build] Update Ruff version ( #8469 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-09-18 11:00:56 +00:00
Cyrus Leung
6ffa3f314c
[CI/Build] Avoid CUDA initialization ( #8534 )
2024-09-18 10:38:11 +00:00
Jiaxin Shan
e351572900
[Misc] Add argument to disable FastAPI docs ( #8554 )
2024-09-18 09:51:59 +00:00
Daniele
95965d31b6
[CI/Build] fix Dockerfile.cpu on podman ( #8540 )
2024-09-18 10:49:53 +08:00
Tyler Michael Smith
8110e44529
[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching ( #8012 )
2024-09-17 23:44:27 +00:00
Alexey Kondratiev(AMD)
09deb4721f
[CI/Build] Excluding kernels/test_gguf.py from ROCm ( #8520 )
2024-09-17 16:40:29 -07:00
youkaichao
fa0c114fad
[doc] improve installation doc ( #8550 )
...
Co-authored-by: Andy Dai <76841985+Imss27@users.noreply.github.com >
2024-09-17 16:24:06 -07:00
Joe Runde
98f9713399
[Bugfix] Fix TP > 1 for new granite ( #8544 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-17 23:17:08 +00:00
Nick Hill
56c3de018c
[Misc] Don't dump contents of kvcache tensors on errors ( #8527 )
2024-09-17 12:24:29 -07:00
Patrick von Platen
a54ed80249
[Model] Add mistral function calling format to all models loaded with "mistral" format ( #8515 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-09-17 17:50:37 +00:00
chenqianfzh
9855b99502
[Feature][kernel] tensor parallelism with bitsandbytes quantization ( #8434 )
2024-09-17 08:09:12 -07:00
sroy745
1009e93c5d
[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models ( #7631 )
2024-09-17 07:35:01 -07:00
Isotr0py
1b6de8352b
[Benchmark] Support sample from HF datasets and image input for benchmark_serving ( #8495 )
2024-09-17 07:34:27 +00:00
Rui Qiao
cbdb252259
[Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change ( #8509 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-09-17 00:06:26 -07:00
youkaichao
99aa4eddaf
[torch.compile] register allreduce operations as custom ops ( #8526 )
2024-09-16 22:57:57 -07:00
Roger Wang
ee2bceaaa6
[Misc][Bugfix] Disable guided decoding for mistral tokenizer ( #8521 )
2024-09-16 22:22:45 -07:00
Alex Brooks
1c1bb388e0
[Frontend] Improve Nullable kv Arg Parsing ( #8525 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-09-17 04:17:32 +00:00
Simon Mo
546034b466
[refactor] remove triton based sampler ( #8524 )
2024-09-16 20:04:48 -07:00
Joe Runde
cca61642e0
[Bugfix] Fix 3.12 builds on main ( #8510 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-17 00:01:45 +00:00
Simon Mo
5ce45eb54d
[misc] small qol fixes for release process ( #8517 )
2024-09-16 15:11:27 -07:00
Simon Mo
5478c4b41f
[perf bench] set timeout to debug hanging ( #8516 )
2024-09-16 14:30:02 -07:00
Kevin Lin
47f5e03b5b
[Bugfix] Bind api server port before starting engine ( #8491 )
2024-09-16 13:56:28 -07:00
youkaichao
2759a43a26
[doc] update doc on testing and debugging ( #8514 )
2024-09-16 12:10:23 -07:00
Luka Govedič
5d73ae49d6
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels ( #7270 )
2024-09-16 11:52:40 -07:00
sasha0552
781e3b9a42
[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel ( #8506 )
2024-09-16 12:15:57 -06:00
Nick Hill
acd5511b6d
[BugFix] Fix clean shutdown issues ( #8492 )
2024-09-16 09:33:46 -07:00
lewtun
837c1968f9
[Frontend] Expose revision arg in OpenAI server ( #8501 )
2024-09-16 15:55:26 +00:00
ElizaWszola
a091e2da3e
[Kernel] Enable 8-bit weights in Fused Marlin MoE ( #8032 )
...
Co-authored-by: Dipika <dipikasikka1@gmail.com >
2024-09-16 09:47:19 -06:00
Isotr0py
fc990f9795
[Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel ( #8357 )
2024-09-15 16:51:44 -06:00
Chris
3724d5f6b5
[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations ( #8490 )
2024-09-15 04:20:05 +00:00
Woosuk Kwon
50e9ec41fc
[TPU] Implement multi-step scheduling ( #8489 )
2024-09-14 16:58:31 -07:00
youkaichao
47790f3e32
[torch.compile] add a flag to disable custom op ( #8488 )
2024-09-14 13:07:16 -07:00
youkaichao
a36e070dad
[torch.compile] fix functionalization ( #8480 )
2024-09-14 09:46:04 -07:00
ywfang
8a0cf1ddc3
[Model] support minicpm3 ( #8297 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-14 14:50:26 +00:00
Charlie Fu
1ef0d2efd0
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm ( #8310 )
2024-09-13 17:01:11 -07:00
Kunshang Ji
851725202a
[Hardware][intel GPU] bump up ipex version to 2.3 ( #8365 )
...
Co-authored-by: Yan Ma <yan.ma@intel.com >
2024-09-13 16:54:34 -07:00
Simon Mo
9ba0817ff1
bump version to v0.6.1.post2 ( #8473 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-09-13 11:35:00 -07:00
Nick Hill
18e9e1f7b3
[HotFix] Fix final output truncation with stop string + streaming ( #8468 )
2024-09-13 11:31:12 -07:00
Isotr0py
f57092c00b
[Doc] Add oneDNN installation to CPU backend documentation ( #8467 )
2024-09-13 18:06:30 +00:00
Cyrus Leung
a84e598e21
[CI/Build] Reorganize models tests ( #7820 )
2024-09-13 10:20:06 -07:00
youkaichao
0a4806f0a9
[plugin][torch.compile] allow to add custom compile backend ( #8445 )
2024-09-13 09:32:42 -07:00
Cyrus Leung
ecd7a1d5b6
[Installation] Gate FastAPI version for Python 3.8 ( #8456 )
2024-09-13 09:02:26 -07:00
youkaichao
a2469127db
[misc][ci] fix quant test ( #8449 )
2024-09-13 17:20:14 +08:00
Jee Jee Li
06311e2956
[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 ( #8442 )
2024-09-13 07:58:28 +00:00
youkaichao
cab69a15e4
[doc] recommend pip instead of conda ( #8446 )
2024-09-12 23:52:41 -07:00
Isotr0py
9b4a3b235e
[CI/Build] Enable InternVL2 PP test only on single node ( #8437 )
2024-09-13 06:35:20 +00:00
Simon Mo
acda0b35d0
bump version to v0.6.1.post1 ( #8440 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-09-12 21:39:49 -07:00
William Lin
ba77527955
[bugfix] torch profiler bug for single gpu with GPUExecutor ( #8354 )
2024-09-12 21:30:00 -07:00
Alexander Matveev
6821020109
[Bugfix] Fix async log stats ( #8417 )
2024-09-12 20:48:59 -07:00
Cyrus Leung
8427550488
[CI/Build] Update pixtral tests to use JSON ( #8436 )
2024-09-13 03:47:52 +00:00
Cyrus Leung
3f79bc3d1a
[Bugfix] Bump fastapi and pydantic version ( #8435 )
2024-09-13 03:21:42 +00:00
shangmingc
40c396533d
[Bugfix] Mapping physical device indices for e2e test utils ( #8290 )
2024-09-13 11:06:28 +08:00
Cyrus Leung
5ec9c0fb3c
[Core] Factor out input preprocessing to a separate class ( #7329 )
2024-09-13 02:56:13 +00:00
Dipika Sikka
8f44a92d85
[BugFix] fix group_topk ( #8430 )
2024-09-13 09:23:42 +08:00
Roger Wang
360ddbd37e
[Misc] Update Pixtral example ( #8431 )
2024-09-12 17:31:18 -07:00
Wenxiang
a480939e8e
[Bugfix] Fix weight loading issue by rename variable. ( #8293 )
2024-09-12 19:25:00 -04:00
Patrick von Platen
d31174a4e1
[Hotfix][Pixtral] Fix multiple images bugs ( #8415 )
2024-09-12 15:21:51 -07:00
Roger Wang
b61bd98f90
[CI/Build] Disable multi-node test for InternVL2 ( #8428 )
2024-09-12 15:05:35 -07:00
Roger Wang
c16369455f
[Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models ( #8425 )
2024-09-12 14:06:51 -07:00
Alexander Matveev
019877253b
[Bugfix] multi-step + flashinfer: ensure cuda graph compatible ( #8427 )
2024-09-12 21:01:50 +00:00
Nick Hill
551ce01078
[Core] Add engine option to return only deltas or final output ( #7381 )
2024-09-12 12:02:00 -07:00
William Lin
a6c0f3658d
[multi-step] add flashinfer backend ( #7928 )
2024-09-12 11:16:22 -07:00
Joe Runde
f2e263b801
[Bugfix] Offline mode fix ( #8376 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-12 11:11:57 -07:00
Luis Vega
1f0c75afa9
[BugFix] Fix Duplicate Assignment in Hermes2ProToolParser ( #8423 )
2024-09-12 11:10:11 -07:00
WANGWEI
8a23e93302
[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance ( #8403 )
2024-09-12 10:47:42 -07:00
Alex Brooks
c6202daeed
[Model] Support multiple images for qwen-vl ( #8247 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-12 10:10:54 -07:00
Isotr0py
e56bf27741
[Bugfix] Fix InternVL2 inference with various num_patches ( #8375 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-12 10:10:35 -07:00
Roger Wang
520ca380ae
[Hotfix][VLM] Fixing max position embeddings for Pixtral ( #8399 )
2024-09-12 09:28:37 -07:00
youkaichao
7de49aa86c
[torch.compile] hide slicing under custom op for inductor ( #8384 )
2024-09-12 00:11:55 -07:00
Woosuk Kwon
42ffba11ad
[Misc] Use RoPE cache for MRoPE ( #8396 )
2024-09-11 23:13:14 -07:00
Kevin Lin
295c4730a8
[Misc] Raise error when using encoder/decoder model with cpu backend ( #8355 )
2024-09-12 05:45:24 +00:00
Blueyo0
1bf2dd9df0
[Gemma2] add bitsandbytes support for Gemma2 ( #8338 )
2024-09-11 21:53:12 -07:00
tomeras91
5a60699c45
[Bugfix]: Fix the logic for deciding if tool parsing is used ( #8366 )
2024-09-12 03:55:30 +00:00
Michael Goin
b6c75e1cf2
Fix the AMD weight loading tests ( #8390 )
2024-09-11 20:35:33 -07:00
Woosuk Kwon
b71c956deb
[TPU] Use Ray for default distributed backend ( #8389 )
2024-09-11 20:31:51 -07:00
youkaichao
f842a7aff1
[misc] remove engine_use_ray ( #8126 )
2024-09-11 18:23:36 -07:00
Cody Yu
a65cb16067
[MISC] Dump model runner inputs when crashing ( #8305 )
2024-09-12 01:12:25 +00:00
Simon Mo
3fd2b0d21c
Bump version to v0.6.1 ( #8379 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-09-11 14:42:11 -07:00
Patrick von Platen
d394787e52
Pixtral ( #8377 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-11 14:41:55 -07:00
Lily Liu
775f00f81e
[Speculative Decoding] Test refactor ( #8317 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-11 14:07:34 -07:00
Aarni Koskela
8baa454937
[Misc] Move device options to a single place ( #8322 )
2024-09-11 13:25:58 -07:00
bnellnm
73202dbe77
[Kernel][Misc] register ops to prevent graph breaks ( #6917 )
...
Co-authored-by: Sage Moore <sage@neuralmagic.com >
2024-09-11 12:52:19 -07:00
Cyrus Leung
7015417fd4
[Bugfix] Add missing attributes in mistral tokenizer ( #8364 )
2024-09-11 11:36:54 -07:00
Alexey Kondratiev(AMD)
aea02f30de
[CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation ( #8373 )
2024-09-11 18:31:41 +00:00
Li, Jiang
0b952af458
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend ( #7257 )
2024-09-11 09:46:46 -07:00
Yang Fan
3b7fea770f
[Model][VLM] Add Qwen2-VL model support ( #7905 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-11 09:31:19 -07:00
Pooya Davoodi
cea95dfb94
[Frontend] Create ErrorResponse instead of raising exceptions in run_batch ( #8347 )
2024-09-11 05:30:11 +00:00
Yangshen⚡Deng
6a512a00df
[model] Support for Llava-Next-Video model ( #7559 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-09-10 22:21:36 -07:00
Pavani Majety
efcf946a15
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. ( #6112 )
2024-09-11 00:38:40 -04:00
Isotr0py
1230263e16
[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel ( #8299 )
2024-09-11 10:11:01 +08:00
Jee Jee Li
e497b8aeff
[Misc] Skip loading extra bias for Qwen2-MOE GPTQ models ( #8329 )
2024-09-10 20:59:19 -04:00
Tyler Michael Smith
94144e726c
[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag ( #8043 )
2024-09-10 23:51:58 +00:00
William Lin
1d5e397aa4
[Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers ( #8172 )
2024-09-10 23:46:08 +00:00
Alexander Matveev
22f3a4bc6c
[Bugfix] lookahead block table with cuda graph max capture ( #8340 )
...
[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (#8340 )
2024-09-10 16:00:35 -07:00
Cody Yu
b1f3e18958
[MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled ( #8342 )
2024-09-10 22:28:28 +00:00
Prashant Gupta
04e7c4e771
[Misc] remove peft as dependency for prompt models ( #8162 )
2024-09-10 17:21:56 -04:00
Kevin Lin
5faedf1b62
[Spec Decode] Move ops.advance_step to flash attn advance_step ( #8224 )
2024-09-10 13:18:14 -07:00
sumitd2
02751a7a42
Fix ppc64le buildkite job ( #8309 )
2024-09-10 12:58:34 -07:00
Alexey Kondratiev(AMD)
f421f3cefb
[CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail ( #8130 )
2024-09-10 11:51:15 -07:00
Cyrus Leung
8c054b7a62
[Frontend] Clean up type annotations for mistral tokenizer ( #8314 )
2024-09-10 16:49:11 +00:00
Daniele
6234385f4a
[CI/Build] enable ccache/scccache for HIP builds ( #8327 )
2024-09-10 08:55:08 -07:00
Cyrus Leung
da1a844e61
[Bugfix] Fix missing post_layernorm in CLIP ( #8155 )
2024-09-10 08:22:50 +00:00
Simon Mo
a1d874224d
Add NVIDIA Meetup slides, announce AMD meetup, and add contact info ( #8319 )
2024-09-09 23:21:00 -07:00
Dipika Sikka
6cd5e5b07e
[Misc] Fused MoE Marlin support for GPTQ ( #8217 )
2024-09-09 23:02:52 -04:00
Kyle Sayers
c7cb5c3335
[Misc] GPTQ Activation Ordering ( #8135 )
2024-09-09 16:27:26 -04:00
Vladislav Kruglikov
f9b4a2d415
[Bugfix] Correct adapter usage for cohere and jamba ( #8292 )
2024-09-09 11:20:46 -07:00
Adam Lugowski
58fcc8545a
[Frontend] Add progress reporting to run_batch.py ( #8060 )
...
Co-authored-by: Adam Lugowski <adam.lugowski@parasail.io >
2024-09-09 11:16:37 -07:00
Kyle Mistele
08287ef675
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility ( #8272 )
2024-09-09 10:45:11 -04:00
Alexander Matveev
4ef41b8476
[Bugfix] Fix async postprocessor in case of preemption ( #8267 )
2024-09-07 21:01:51 -07:00
Joe Runde
cfe712bf1a
[CI/Build] Use python 3.12 in cuda image ( #8133 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-07 13:03:16 -07:00
sumitd2
b962ee1470
ppc64le: Dockerfile fixed, and a script for buildkite ( #8026 )
2024-09-07 11:18:40 -07:00
Isotr0py
36bf8150cc
[Model][VLM] Decouple weight loading logic for Paligemma ( #8269 )
2024-09-07 17:45:44 +00:00
Isotr0py
e807125936
[Model][VLM] Support multi-images inputs for InternVL2 models ( #8201 )
2024-09-07 16:38:23 +08:00
Cyrus Leung
9f68e00d27
[Bugfix] Fix broken OpenAI tensorizer test ( #8258 )
2024-09-07 08:02:39 +00:00
youkaichao
ce2702a923
[tpu][misc] fix typo ( #8260 )
2024-09-06 22:40:46 -07:00
Wei-Sheng Chin
795b662cff
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) ( #8241 )
2024-09-06 20:18:16 -07:00
Cyrus Leung
2f707fcb35
[Model] Multi-input support for LLaVA ( #8238 )
2024-09-07 02:57:24 +00:00
Kyle Mistele
41e95c5247
[Bugfix] Fix Hermes tool call chat template bug ( #8256 )
...
Co-authored-by: Kyle Mistele <kyle@constellate.ai >
2024-09-07 10:49:01 +08:00
William Lin
12dd715807
[misc] [doc] [frontend] LLM torch profiler support ( #7943 )
2024-09-06 17:48:48 -07:00
Patrick von Platen
29f49cd6e3
[Model] Allow loading from original Mistral format ( #8168 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-06 17:02:05 -06:00
Dipika Sikka
23f322297f
[Misc] Remove SqueezeLLM ( #8220 )
2024-09-06 16:29:03 -06:00
rasmith
9db52eab3d
[Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput ( #8248 )
2024-09-06 16:26:09 -06:00
Alexey Kondratiev(AMD)
1447c97e75
[CI/Build] Increasing timeout for multiproc worker tests ( #8203 )
2024-09-06 11:51:03 -07:00
Rui Qiao
de80783b69
[Misc] Use ray[adag] dependency instead of cuda ( #7938 )
2024-09-06 09:18:35 -07:00
afeldman-nm
e5cab71531
[Frontend] Add --logprobs argument to benchmark_serving.py ( #8191 )
2024-09-06 09:01:14 -07:00
Nick Hill
baa5467547
[BugFix] Fix Granite model configuration ( #8216 )
2024-09-06 11:39:29 +08:00
Jiaxin Shan
db3bf7c991
[Core] Support load and unload LoRA in api server ( #6566 )
...
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-09-05 18:10:33 -07:00
sroy745
2febcf2777
[Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM ( #7962 )
2024-09-05 16:25:29 -04:00
Michael Goin
2ee45281a5
Move verify_marlin_supported to GPTQMarlinLinearMethod ( #8165 )
2024-09-05 11:09:46 -04:00
Alex Brooks
9da25a88aa
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) ( #8029 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-05 12:48:10 +00:00
manikandan.tm@zucisystems.com
8685ba1a1e
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) ( #7860 )
2024-09-05 11:33:37 +00:00
Cyrus Leung
288a938872
[Doc] Indicate more information about supported modalities ( #8181 )
2024-09-05 10:51:53 +00:00
Elfie Guo
e39ebf5cf5
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. ( #8173 )
2024-09-05 05:12:26 +00:00
Kevin H. Luu
ba262c4e5a
[ci] Mark LoRA test as soft-fail ( #8160 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-04 20:33:12 -07:00
Woosuk Kwon
4624d98dbd
[Misc] Clean up RoPE forward_native ( #8076 )
2024-09-04 20:31:48 -07:00
William Lin
1afc931987
[bugfix] >1.43 constraint for openai ( #8169 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-04 17:35:36 -07:00
Maureen McElaney
e01c2beb7d
[Doc] [Misc] Create CODE_OF_CONDUCT.md ( #8161 )
2024-09-04 16:50:13 -07:00
Simon Mo
32e7db2536
Bump version to v0.6.0 ( #8166 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-09-04 16:34:27 -07:00
Harsha vardhan manoj Bikki
008cf886c9
[Neuron] Adding support for adding/ overriding neuron configuration a… ( #8062 )
...
Co-authored-by: Harsha Bikki <harbikh@amazon.com >
2024-09-04 16:33:43 -07:00
Cody Yu
77d9e514a2
[MISC] Replace input token throughput with total token throughput ( #8164 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-04 20:23:22 +00:00
Kyle Mistele
e02ce498be
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models ( #5649 )
...
Co-authored-by: constellate <constellate@1-ai-appserver-staging.codereach.com >
Co-authored-by: Kyle Mistele <kyle@constellate.ai >
2024-09-04 13:18:13 -07:00
Woosuk Kwon
561d6f8077
[CI] Change test input in Gemma LoRA test ( #8163 )
2024-09-04 13:05:50 -07:00
alexeykondrat
d1dec64243
[CI/Build][ROCm] Enabling LoRA tests on ROCm ( #7369 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-09-04 11:57:54 -07:00
Cody Yu
2ad2e5608e
[MISC] Consolidate FP8 kv-cache tests ( #8131 )
2024-09-04 18:53:25 +00:00
wnma
d3311562fb
[Bugfix] remove post_layernorm in siglip ( #8106 )
2024-09-04 18:55:37 +08:00
TimWang
ccd7207191
chore: Update check-wheel-size.py to read MAX_SIZE_MB from env ( #8103 )
2024-09-03 23:17:05 -07:00
Cyrus Leung
855c262a6b
[Frontend] Multimodal support in offline chat ( #8098 )
2024-09-04 05:22:17 +00:00
Peter Salas
2be8ec6e71
[Model] Add Ultravox support for multiple audio chunks ( #7963 )
2024-09-04 04:38:21 +00:00
Dipika Sikka
e16fa99a6a
[Misc] Update fbgemmfp8 to use vLLMParameters ( #7972 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-03 20:12:41 -06:00
Woosuk Kwon
61f4a93d14
[TPU][Bugfix] Use XLA rank for persistent cache path ( #8137 )
2024-09-03 18:35:33 -07:00
Nick Hill
d4db9f53c8
[Benchmark] Add --async-engine option to benchmark_throughput.py ( #7964 )
2024-09-03 20:57:41 -04:00
Dipika Sikka
2188a60c7e
[Misc] Update GPTQ to use vLLMParameters ( #7976 )
2024-09-03 17:21:44 -04:00
Simon Mo
dc0b6066ab
[CI] Change PR remainder to avoid at-mentions ( #8134 )
2024-09-03 14:11:42 -07:00
Woosuk Kwon
0af3abe3d3
[TPU][Bugfix] Fix next_token_ids shape ( #8128 )
2024-09-03 13:29:24 -07:00
Kevin H. Luu
f1575dc99f
[ci] Fix GHA workflow ( #8129 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-03 13:25:09 -07:00
tomeras91
c02638efb3
[CI/Build] make pip install vllm work in macos (for import only) ( #8118 )
2024-09-03 12:37:08 -07:00
Antoni Baum
652c83b697
[Misc] Raise a more informative exception in add/remove_logger ( #7750 )
2024-09-03 12:28:25 -07:00
Alexander Matveev
6d646d08a2
[Core] Optimize Async + Multi-step ( #8050 )
2024-09-03 18:50:29 +00:00
Kevin H. Luu
95a178f861
[CI] Only PR reviewers/committers can trigger CI on PR ( #8124 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-03 11:32:27 -07:00
Cody Yu
bd852f2a8b
[Performance] Enable chunked prefill and prefix caching together ( #8120 )
...
Co-authored-by: Tao He <sighingnow@gmail.com >
Co-authored-by: Juelianqvq <Juelianqvq@noreply.github.com >
2024-09-03 10:49:18 -07:00
Isotr0py
ec266536b7
[Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backend ( #8061 )
2024-09-03 21:37:52 +08:00
Woosuk Kwon
0fbc6696c2
[Bugfix] Fix single output condition in output processor ( #7881 )
2024-09-02 20:35:42 -07:00
wang.yuqi
6e36f4fa6c
improve chunked prefill performance
...
[Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874 )
2024-09-02 14:20:12 -07:00
Isotr0py
dd2a6a82e3
[Bugfix] Fix internlm2 tensor parallel inference ( #8055 )
2024-09-02 23:48:56 +08:00
Isotr0py
4ca65a9763
[Core][Bugfix] Accept GGUF model without .gguf extension ( #8056 )
2024-09-02 08:43:26 -04:00
Woosuk Kwon
e2b2aa5a0f
[TPU] Align worker index with node boundary ( #7932 )
2024-09-01 23:09:46 -07:00
Lily Liu
e6a26ed037
[SpecDecode][Kernel] Flashinfer Rejection Sampling ( #7244 )
2024-09-01 21:23:29 -07:00
Shawn Tan
f8d60145b4
[Model] Add Granite model ( #7436 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-09-01 18:37:18 -07:00
Roger Wang
5b86b19954
[Misc] Optional installation of audio related packages ( #8063 )
2024-09-01 14:46:57 -07:00
Roger Wang
5231f0898e
[Frontend][VLM] Add support for multiple multi-modal items ( #8049 )
2024-08-31 16:35:53 -07:00
Robert Shaw
8423aef4c8
[BugFix][Core] Multistep Fix Crash on Request Cancellation ( #8059 )
2024-08-31 19:44:03 +00:00
Nicolò Lucchesi
4f5d8446ed
[Bugfix] Fix ModelScope models in v0.5.5 ( #8037 )
2024-08-31 00:27:58 -07:00
Cyrus Leung
d05f0a9db2
[Bugfix] Fix import error in Phi-3.5-MoE ( #8052 )
2024-08-30 22:26:55 -07:00
Pavani Majety
622f8abff8
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. ( #8013 )
2024-08-30 22:18:50 -07:00
Wenxiang
1248e8506a
[Model] Adding support for MSFT Phi-3.5-MoE ( #7729 )
...
Co-authored-by: Your Name <you@example.com >
Co-authored-by: Zeqi Lin <zelin@microsoft.com >
Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com >
2024-08-30 13:42:57 -06:00
Woosuk Kwon
2684efc467
[TPU][Bugfix] Fix tpu type api ( #8035 )
2024-08-30 09:01:26 -07:00
Kaunil Dhruv
058344f89a
[Frontend]-config-cli-args ( #7737 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com >
2024-08-30 08:21:02 -07:00
Cyrus Leung
98cef6a227
[Core] Increase default max_num_batched_tokens for multimodal models ( #8028 )
2024-08-30 08:20:34 -07:00
Jungho Christopher Cho
f97be32d1d
[VLM][Model] TP support for ViTs ( #7186 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-08-30 08:19:27 -07:00
Cyrus Leung
afd39a4511
[Bugfix] Fix import error in Exaone model ( #8034 )
2024-08-30 08:03:28 -07:00
Richard Liu
2148441fd3
[TPU] Support single and multi-host TPUs on GKE ( #7613 )
2024-08-30 00:27:40 -07:00
Yohan Na
dc13e99348
[MODEL] add Exaone model support ( #7819 )
2024-08-29 23:34:20 -07:00
Avshalom Manevich
34a0e96d46
[Kernel] changing fused moe kernel chunk size default to 32k ( #7995 )
2024-08-30 04:11:39 +00:00
Woosuk Kwon
80c7b089b1
[TPU] Async output processing for TPU ( #8011 )
2024-08-29 19:35:29 -07:00
afeldman-nm
428dd1445e
[Core] Logprobs support in Multi-step ( #7652 )
2024-08-29 19:19:08 -07:00
Cyrus Leung
4abed65c58
[VLM] Disallow overflowing max_model_len for multimodal models ( #7998 )
2024-08-29 17:49:04 -07:00
Wei-Sheng Chin
0c785d344d
Add more percentiles and latencies ( #7759 )
2024-08-29 16:48:11 -07:00
chenqianfzh
4664ceaad6
support bitsandbytes 8-bit and FP4 quantized models ( #7445 )
2024-08-29 19:09:08 -04:00
Harsha vardhan manoj Bikki
257afc37c5
[Neuron] Adding support for context-lenght, token-gen buckets. ( #7885 )
...
Co-authored-by: Harsha Bikki <harbikh@amazon.com >
2024-08-29 13:58:14 -07:00
Dipika Sikka
86a677de42
[misc] update tpu int8 to use new vLLM Parameters ( #7973 )
2024-08-29 16:46:55 -04:00
Isotr0py
d78789ac16
[Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism ( #7954 )
2024-08-29 15:54:49 -04:00
kushanam
c334b1898b
extend cuda graph size for H200 ( #7894 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-08-29 12:15:04 -07:00
Pavani Majety
6b3421567d
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto ( #7985 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-29 14:53:11 -04:00
Alexander Matveev
3f60f2244e
[Core] Combine async postprocessor and multi-step ( #7921 )
2024-08-29 11:18:26 -07:00
Jonas M. Kübler
f205c09854
[Bugfix] Unify rank computation across regular decoding and speculative decoding ( #7899 )
2024-08-28 22:18:13 -07:00
youkaichao
ef99a78760
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." ( #7982 )
2024-08-28 21:27:06 -07:00
Peter Salas
74d5543ec5
[VLM][Core] Fix exceptions on ragged NestedTensors ( #7974 )
2024-08-29 03:24:31 +00:00
youkaichao
a7f65c2be9
[torch.compile] remove reset ( #7975 )
2024-08-28 17:32:26 -07:00
Nick Hill
4289cad37f
[Frontend] Minor optimizations to zmq decoupled front-end ( #7957 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-08-28 17:22:43 -07:00
Michael Goin
af59df0a10
Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test ( #7961 )
2024-08-28 19:19:17 -04:00
youkaichao
ce6bf3a2cf
[torch.compile] avoid Dynamo guard evaluation overhead ( #7898 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-08-28 16:10:12 -07:00
bnellnm
3cdfe1f38b
[Bugfix] Make torch registration of punica ops optional ( #7970 )
2024-08-28 16:11:49 -06:00
Mor Zusman
fdd9daafa3
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM ( #7651 )
2024-08-28 15:06:52 -07:00
Stas Bekman
8c56e57def
[Doc] fix 404 link ( #7966 )
2024-08-28 13:54:23 -07:00
Woosuk Kwon
eeffde1ac0
[TPU] Upgrade PyTorch XLA nightly ( #7967 )
2024-08-28 13:10:21 -07:00
rasmith
e5697d161c
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ ( #7386 )
2024-08-28 15:37:47 -04:00
Pavani Majety
b98cc28f91
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. ( #7798 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-08-28 10:01:22 -07:00
Cyrus Leung
ef9baee3c5
[Bugfix][VLM] Fix incompatibility between #7902 and #7230 ( #7948 )
2024-08-28 08:11:18 -07:00
Stas Bekman
98c12cffe5
[Doc] fix the autoAWQ example ( #7937 )
2024-08-28 12:12:32 +00:00
youkaichao
f52a43a8b9
[ci][test] fix pp test failure ( #7945 )
2024-08-28 01:27:07 -07:00
Cody Yu
e3580537a4
[Performance] Enable chunked prefill and prefix caching together ( #7753 )
2024-08-28 00:36:31 -07:00
Alexander Matveev
f508e03e7f
[Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) ( #7911 )
2024-08-28 00:02:30 -07:00
Cyrus Leung
51f86bf487
[mypy][CI/Build] Fix mypy errors ( #7929 )
2024-08-27 23:47:44 -07:00
bnellnm
c166e7e43e
[Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. ( #7886 )
2024-08-27 23:13:45 -04:00
youkaichao
bc6e42a9b1
[hardware][rocm] allow rocm to override default env var ( #7926 )
2024-08-27 19:50:06 -07:00
Peter Salas
fab5f53e2d
[Core][VLM] Stack multimodal tensors to represent multiple images within each prompt ( #7902 )
2024-08-28 01:53:56 +00:00
Jonathan Berkhahn
9c71c97ae2
[mypy] Enable mypy type checking for vllm/core ( #7229 )
2024-08-28 07:11:14 +08:00
zifeitong
5340a2dccf
[Model] Add multi-image input support for LLaVA-Next offline inference ( #7230 )
2024-08-28 07:09:02 +08:00
Philipp Schmid
345be0e244
[benchmark] Update TGI version ( #7917 )
2024-08-27 15:07:53 -07:00
Dipika Sikka
fc911880cc
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7766 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
2024-08-27 15:07:09 -07:00
youkaichao
ed6f002d33
[cuda][misc] error on empty CUDA_VISIBLE_DEVICES ( #7924 )
2024-08-27 12:06:11 -07:00
Isotr0py
b09c755be8
[Bugfix] Fix phi3v incorrect image_idx when using async engine ( #7916 )
2024-08-27 17:36:09 +00:00
alexeykondrat
42e932c7d4
[CI/Build][ROCm] Enabling tensorizer tests for ROCm ( #7237 )
2024-08-27 10:09:13 -07:00
Kunshang Ji
076169f603
[Hardware][Intel GPU] Add intel GPU pipeline parallel support. ( #7810 )
2024-08-27 10:07:02 -07:00
Isotr0py
9db642138b
[CI/Build][VLM] Cleanup multiple images inputs model test ( #7897 )
2024-08-27 15:28:30 +00:00
Patrick von Platen
6fc4e6e07a
[Model] Add Mistral Tokenization to improve robustness and chat encoding ( #7739 )
2024-08-27 12:40:02 +00:00
Cody Yu
9606c7197d
Revert #7509 ( #7887 )
2024-08-27 00:16:31 -07:00
youkaichao
64cc644425
[core][torch.compile] discard the compile for profiling ( #7796 )
2024-08-26 21:33:58 -07:00
Nick Hill
39178c7fbc
[Tests] Disable retries and use context manager for openai client ( #7565 )
2024-08-26 21:33:17 -07:00
Megha Agarwal
2eedede875
[Core] Asynchronous Output Processor ( #7049 )
...
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com >
2024-08-26 20:53:20 -07:00
Dipika Sikka
015e6cc252
[Misc] Update compressed tensors lifecycle to remove prefix from create_weights ( #7825 )
2024-08-26 18:09:34 -06:00
omrishiv
760e9f71a8
[Bugfix] neuron: enable tensor parallelism ( #7562 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-08-26 15:13:13 -07:00
youkaichao
05826c887b
[misc] fix custom allreduce p2p cache file generation ( #7853 )
2024-08-26 15:02:25 -07:00
Dipika Sikka
dd9857f5fa
[Misc] Update gptq_marlin_24 to use vLLMParameters ( #7762 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-26 17:44:54 -04:00
Dipika Sikka
665304092d
[Misc] Update qqq to use vLLMParameters ( #7805 )
2024-08-26 13:16:15 -06:00
Cody Yu
2deb029d11
[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule ( #7822 )
2024-08-26 11:24:53 -07:00
Cyrus Leung
029c71de11
[CI/Build] Avoid downloading all HF files in RemoteOpenAIServer ( #7836 )
2024-08-26 05:31:10 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
0b769992ec
[Bugfix]: Use float32 for base64 embedding ( #7855 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2024-08-26 03:16:38 +00:00
Nick Hill
1856aff4d6
[Spec Decoding] Streamline batch expansion tensor manipulation ( #7851 )
2024-08-25 15:45:14 -07:00
youkaichao
70c094ade6
[misc][cuda] improve pynvml warning ( #7852 )
2024-08-25 14:30:09 -07:00
Isotr0py
2059b8d9ca
[Misc] Remove snapshot_download usage in InternVL2 test ( #7835 )
2024-08-25 15:53:09 +00:00
Isotr0py
8aaf3d5347
[Model][VLM] Support multi-images inputs for Phi-3-vision models ( #7783 )
2024-08-25 11:51:20 +00:00
zifeitong
80162c44b1
[Bugfix] Fix Phi-3v crash when input images are of certain sizes ( #7840 )
2024-08-24 18:16:24 -07:00
youkaichao
aab0fcdb63
[ci][test] fix RemoteOpenAIServer ( #7838 )
2024-08-24 17:31:28 +00:00
youkaichao
ea9fa160e3
[ci][test] exclude model download time in server start time ( #7834 )
2024-08-24 01:03:27 -07:00
youkaichao
7d9ffa2ae1
[misc][core] lazy import outlines ( #7831 )
2024-08-24 00:51:38 -07:00
Tyler Rockwood
d81abefd2e
[Frontend] add json_schema support from OpenAI protocol ( #7654 )
2024-08-23 23:07:24 -07:00
Pooya Davoodi
8da48e4d95
[Frontend] Publish Prometheus metrics in run_batch API ( #7641 )
2024-08-23 23:04:22 -07:00
Pooya Davoodi
6885fde317
[Bugfix] Fix run_batch logger ( #7640 )
2024-08-23 13:58:26 -07:00
Alexander Matveev
9db93de20c
[Core] Add multi-step support to LLMEngine ( #7789 )
2024-08-23 12:45:53 -07:00
Simon Mo
09c7792610
Bump version to v0.5.5 ( #7823 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-08-23 11:35:33 -07:00
Dipika Sikka
f1df5dbfd6
[Misc] Update marlin to use vLLMParameters ( #7803 )
2024-08-23 14:30:52 -04:00
youkaichao
35ee2ad6b9
[github][misc] promote asking llm first ( #7809 )
2024-08-23 09:38:50 -07:00
Maximilien de Bayser
e25fee57c2
[BugFix] Fix server crash on empty prompt ( #7746 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-08-23 13:12:44 +00:00
Jie Fu (傅杰)
faeddb565d
[misc] Add Torch profiler support for CPU-only devices ( #7806 )
2024-08-23 05:46:25 +00:00
Kunshang Ji
fc5ebbd1d3
[Hardware][Intel GPU] refactor xpu_model_runner for tp ( #7712 )
2024-08-22 20:06:54 -07:00
SangBin Cho
c01a6cb231
[Ray backend] Better error when pg topology is bad. ( #7584 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-08-22 17:44:25 -07:00
Joe Runde
b903e1ba7f
[Frontend] error suppression cleanup ( #7786 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-08-22 21:50:21 +00:00
Siyuan Liu
a152246428
[Misc] fix typo in triton import warning ( #7794 )
2024-08-22 13:51:23 -07:00
Kevin H. Luu
666ad0aa16
[ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args ( #7705 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-22 20:10:55 +00:00
Michael Goin
15310b5101
[Bugfix] Use LoadFormat values for vllm serve --load-format ( #7784 )
2024-08-22 11:37:08 -07:00
Peter Salas
57792ed469
[Doc] Fix incorrect docs from #7615 ( #7788 )
2024-08-22 10:02:06 -07:00
Jiaxin Shan
d3b5b98021
[Misc] Enhance prefix-caching benchmark tool ( #6568 )
2024-08-22 09:32:02 -07:00
Travis Johnson
cc0eaf12b1
[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output ( #7232 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-08-22 09:33:48 -04:00
Dipika Sikka
955b5191c9
[Misc] update fp8 to use vLLMParameter ( #7437 )
2024-08-22 08:36:18 -04:00
Lucas Wilkinson
55d63b1211
[Bugfix] Don't build machete on cuda <12.0 ( #7757 )
2024-08-22 08:28:52 -04:00
Flex Wang
4f419c00a6
Fix ShardedStateLoader for vllm fp8 quantization ( #7708 )
2024-08-22 08:25:04 -04:00
Abhinav Goyal
a3fce56b88
[Speculative Decoding] EAGLE Implementation with Top-1 proposer ( #6830 )
2024-08-22 02:42:24 -07:00
Woosuk Kwon
b3856bef7d
[Misc] Use torch.compile for GemmaRMSNorm ( #7642 )
2024-08-22 01:14:13 -07:00
youkaichao
8c6f694a79
[ci] refine dependency for distributed tests ( #7776 )
2024-08-22 00:54:15 -07:00
Woosuk Kwon
eeee1c3b1a
[TPU] Avoid initializing TPU runtime in is_tpu ( #7763 )
2024-08-21 21:31:49 -07:00
Michael Goin
aae74ef95c
Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7527 )" ( #7764 )
2024-08-22 03:42:14 +00:00
Joe Runde
cde9183b40
[Bug][Frontend] Improve ZMQ client robustness ( #7443 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-08-22 02:18:11 +00:00
zifeitong
df1a21131d
[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue ( #7710 )
2024-08-22 09:36:24 +08:00
Luka Govedič
7937009a7e
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce ( #7233 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-21 20:18:00 -04:00
Gregory Shtrasberg
9984605412
[AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility ( #7477 )
...
Co-authored-by: Charlie Fu <Charlie.Fu@amd.com >
2024-08-21 16:47:36 -07:00
youkaichao
7eebe8ccaa
[distributed][misc] error on same VLLM_HOST_IP setting ( #7756 )
2024-08-21 16:25:34 -07:00
Dipika Sikka
8678a69ab5
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7527 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
2024-08-21 16:17:10 -07:00
William Lin
5844017285
[ci] [multi-step] narrow multi-step test dependency paths ( #7760 )
2024-08-21 15:52:40 -07:00
Peter Salas
1ca0d4f86b
[Model] Add UltravoxModel and UltravoxConfig ( #7615 )
2024-08-21 22:49:39 +00:00
William Lin
dd53c4b023
[misc] Add Torch profiler support ( #7451 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-21 15:39:26 -07:00
Robert Shaw
970dfdc01d
[Frontend] Improve Startup Failure UX ( #7716 )
2024-08-21 19:53:01 +00:00
William Lin
91f4522cbf
[multi-step] Raise error if not using async engine ( #7703 )
2024-08-21 11:49:19 -07:00
sasha0552
1b32e02648
[Bugfix] Pass PYTHONPATH from setup.py to CMake ( #7730 )
2024-08-21 11:17:48 -07:00
Robert Shaw
f7e3b0c5aa
[Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend ( #7394 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-08-21 13:34:14 -04:00
Brian Li
d3c002eadc
[Bugfix] chat method add_generation_prompt param ( #7734 )
2024-08-21 17:33:35 +00:00
Nick Hill
9b73a2f498
[Spec Decoding] Use target model max length as default for draft model ( #7706 )
2024-08-22 00:23:22 +08:00
Isotr0py
6925cdbeea
[Bugfix][Hardware][CPU] Fix mm_limits initialization for CPU backend ( #7735 )
2024-08-21 16:23:03 +00:00
LI MOU
53328d7536
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] ( #7509 )
2024-08-21 08:54:31 -07:00
Nick Hill
c75363fbc0
[BugFix] Avoid premature async generator exit and raise all exception variations ( #7698 )
2024-08-21 11:45:55 -04:00
sasha0552
dd3fa0e430
[Bugfix] Mirror jinja2 in pyproject.toml ( #7723 )
2024-08-21 13:41:17 +00:00
Cyrus Leung
baaedfdb2d
[mypy] Enable following imports for entrypoints ( #7248 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Fei <dfdfcai4@gmail.com >
2024-08-20 23:28:21 -07:00
Roger Wang
4506641212
[Doc] Section for Multimodal Language Models ( #7719 )
2024-08-20 23:24:01 -07:00
Isotr0py
12e1c65bc9
[Model] Add AWQ quantization support for InternVL2 model ( #7187 )
2024-08-20 23:18:57 -07:00
youkaichao
b74a125800
[ci] try to log process using the port to debug the port usage ( #7711 )
2024-08-20 17:41:12 -07:00
Antoni Baum
66a9e713a7
[Core] Pipe worker_class_fn argument in Executor ( #7707 )
2024-08-21 00:37:39 +00:00
youkaichao
9e51b6a626
[ci][test] adjust max wait time for cpu offloading test ( #7709 )
2024-08-20 17:12:44 -07:00
Kunshang Ji
6e4658c7aa
[Intel GPU] fix xpu not support punica kernel (which use torch.library.custom_op) ( #7685 )
2024-08-20 12:01:09 -07:00
Antoni Baum
3b682179dd
[Core] Add AttentionState abstraction ( #7663 )
2024-08-20 18:50:45 +00:00
Lucas Wilkinson
c6af027a35
[Misc] Add jinja2 as an explicit build requirement ( #7695 )
2024-08-20 17:17:47 +00:00
Ronen Schaffer
2aa00d59ad
[CI/Build] Pin OpenTelemetry versions and make errors clearer ( #7266 )
...
[CI/Build] Pin OpenTelemetry versions and make a availability errors clearer (#7266 )
2024-08-20 10:02:21 -07:00
Kunshang Ji
c42590f97a
[Hardware] [Intel GPU] refactor xpu worker/executor ( #7686 )
2024-08-20 09:54:10 -07:00
Isotr0py
aae6927be0
[VLM][Model] Add test for InternViT vision encoder ( #7409 )
2024-08-20 23:10:20 +08:00
Ilya Lavrenov
398521ad19
[OpenVINO] Updated documentation ( #7687 )
2024-08-20 07:33:56 -06:00
Lucas Wilkinson
5288c06aa0
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel ( #7174 )
2024-08-20 07:09:33 -06:00
Kunshang Ji
b6f99a6ffe
[Core] Refactor executor classes for easier inheritance ( #7673 )
...
[Core] Refactor executor classes to make it easier to inherit GPUExecutor (#7673 )
2024-08-20 00:56:50 -07:00
youkaichao
ad28a74beb
[misc][cuda] add warning for pynvml user ( #7675 )
2024-08-20 00:35:09 -07:00
jianyizh
e6d811dd13
[XPU] fallback to native implementation for xpu custom op ( #7670 )
2024-08-20 00:26:09 -07:00
youkaichao
c4be16e1a7
[misc] add nvidia related library in collect env ( #7674 )
2024-08-19 23:22:49 -07:00
Kuntai Du
3d8a5f063d
[CI] Organizing performance benchmark files ( #7616 )
2024-08-19 22:43:54 -07:00
Zijian Hu
f4fc7337bf
[Bugfix] support tie_word_embeddings for all models ( #5724 )
2024-08-19 20:00:04 -07:00
Kevin H. Luu
0df7ec0b2d
[ci] Install Buildkite test suite analysis ( #7667 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-19 19:55:04 -07:00
Abhinav Goyal
312f761232
[Speculative Decoding] Fixing hidden states handling in batch expansion ( #7508 )
2024-08-19 17:58:14 -07:00
youkaichao
e54ebc2f8f
[doc] fix doc build error caused by msgspec ( #7659 )
2024-08-19 17:50:59 -07:00
Travis Johnson
67e02fa8a4
[Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs-during-spec-decoding ( #7665 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-08-20 00:43:09 +00:00
Woosuk Kwon
43735bf5e1
[TPU] Remove redundant input tensor cloning ( #7660 )
2024-08-19 15:55:04 -07:00
Andrew Song
da115230fd
[Bugfix] Don't disable existing loggers ( #7664 )
2024-08-19 15:11:58 -07:00
Isotr0py
7601cb044d
[Core] Support tensor parallelism for GGUF quantization ( #7520 )
2024-08-19 17:30:14 -04:00
William Lin
47b65a5508
[core] Multi Step Scheduling ( #7000 )
...
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com >
2024-08-19 13:52:13 -07:00
Ali Panahi
dad961ef5c
[Bugfix] fix lora_dtype value type in arg_utils.py - part 2 ( #5428 )
2024-08-19 20:47:00 +00:00
Cody Yu
3ac50b47d0
[MISC] Add prefix cache hit rate to metrics ( #7606 )
2024-08-19 11:52:07 -07:00
Woosuk Kwon
df845b2b46
[Misc] Remove Gemma RoPE ( #7638 )
2024-08-19 09:29:31 -07:00
Kunshang Ji
1a36287b89
[Bugfix] Fix xpu build ( #7644 )
2024-08-18 22:00:09 -07:00
Peng Guanwen
f710fb5265
[Core] Use flashinfer sampling kernel when available ( #7137 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-19 03:24:03 +00:00
SangBin Cho
ff7ec82c4d
[Core] Optimize SPMD architecture with delta + serialization optimization ( #7109 )
2024-08-18 17:57:20 -07:00
Woosuk Kwon
200a2ffa6b
[Misc] Refactor Llama3 RoPE initialization ( #7637 )
2024-08-18 17:18:12 -07:00
Alex Brooks
40e1360bb6
[CI/Build] Add text-only test for Qwen models ( #7475 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-08-19 07:43:46 +08:00
Robert Shaw
e3b318216d
[ Bugfix ] Fix Prometheus Metrics With zeromq Frontend ( #7279 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-08-18 20:19:48 +00:00
Woosuk Kwon
ab7165f2c7
[TPU] Optimize RoPE forward_native2 ( #7636 )
2024-08-18 01:15:10 -07:00
Woosuk Kwon
0c2fa50b84
[TPU] Use mark_dynamic only for dummy run ( #7634 )
2024-08-18 00:18:53 -07:00
Woosuk Kwon
ce143353c6
[TPU] Skip creating empty tensor ( #7630 )
2024-08-17 14:22:46 -07:00
Roger Wang
bbf55c4805
[VLM] Refactor MultiModalConfig initialization and profiling ( #7530 )
2024-08-17 13:30:55 -07:00
Jee Jee Li
1ef13cf92f
[Misc]Fix BitAndBytes exception messages ( #7626 )
2024-08-17 12:02:14 -07:00
youkaichao
832163b875
[ci][test] allow longer wait time for api server ( #7629 )
2024-08-17 11:26:38 -07:00
Besher Alkurdi
e73f76eec6
[Model] Pipeline parallel support for JAIS ( #7603 )
2024-08-17 11:11:09 -07:00
youkaichao
d95cc0a55c
[core][misc] update libcudart finding ( #7620 )
...
Co-authored-by: cjackal <44624812+cjackal@users.noreply.github.com >
2024-08-16 23:01:35 -07:00
youkaichao
5bf45db7df
[ci][test] fix engine/logger test ( #7621 )
2024-08-16 23:00:59 -07:00
youkaichao
eed020f673
[misc] use nvml to get consistent device name ( #7582 )
2024-08-16 21:15:13 -07:00
Xander Johnson
7c0b7ea214
[Bugfix] add >= 1.0 constraint for openai dependency ( #7612 )
2024-08-16 20:56:01 -07:00
SangBin Cho
4706eb628e
[aDAG] Unflake aDAG + PP tests ( #7600 )
2024-08-16 20:49:30 -07:00
Rui Qiao
bae888cb8e
[Bugfix] Clear engine reference in AsyncEngineRPCServer ( #7618 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-16 20:44:05 -07:00
Alexei-V-Ivanov-AMD
6bd19551b0
.[Build/CI] Enabling passing AMD tests. ( #7610 )
2024-08-16 20:25:32 -07:00
bnellnm
e680349994
[Bugfix] Fix custom_ar support check ( #7617 )
2024-08-16 19:05:49 -07:00
Michael Goin
44f26a9466
[Model] Align nemotron config with final HF state and fix lm-eval-small ( #7611 )
2024-08-16 15:56:34 -07:00
bnellnm
37fd47e780
[Kernel] fix types used in aqlm and ggml kernels to support dynamo ( #7596 )
2024-08-16 14:00:11 -07:00
bnellnm
7759ae958f
[Kernel][Misc] dynamo support for ScalarType ( #7594 )
2024-08-16 13:59:49 -07:00
bnellnm
9f69856356
[Kernel] register punica functions as torch ops ( #7591 )
2024-08-16 13:59:38 -07:00
Michael Goin
d4f0f17b02
[Doc] Update quantization supported hardware table ( #7595 )
2024-08-16 13:59:27 -07:00
Michael Goin
b3f4e17935
[Doc] Add docs for llmcompressor INT8 and FP8 checkpoints ( #7444 )
2024-08-16 13:59:16 -07:00
Mahesh Keralapura
93478b63d2
[Core] Fix tracking of model forward time in case of PP>1 ( #7440 )
...
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440 )
2024-08-16 13:46:01 -07:00
William Lin
f366f6339b
[spec decode] [4/N] Move update_flash_attn_metadata to attn backend ( #7571 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-16 11:41:56 -07:00
Michael Goin
855866caa9
[Kernel] Add tuned triton configs for ExpertsInt8 ( #7601 )
2024-08-16 11:37:01 -07:00
Mor Zusman
7fc23be81c
[Kernel] W8A16 Int8 inside FusedMoE ( #7415 )
2024-08-16 10:06:51 -07:00
Charlie Fu
e837b624f2
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm ( #7210 )
2024-08-16 10:06:30 -07:00
fzyzcjy
ec724a725e
support tqdm in notebooks ( #7510 )
2024-08-16 09:17:50 -07:00
Gordon Wong
0e39a33c6d
[Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method ( #7513 )
2024-08-16 10:05:18 -06:00
Kuntai Du
6fc5b0f249
[CI] Fix crashes of performance benchmark ( #7500 )
2024-08-16 08:08:45 -07:00
Nick Hill
9587b050fb
[Core] Use uvloop with zmq-decoupled front-end ( #7570 )
2024-08-15 22:48:07 -07:00
youkaichao
54bd9a03c4
register custom op for flash attn and use from torch.ops ( #7536 )
2024-08-15 22:38:56 -07:00
jon-chuang
50b8d08dbd
[Misc/Testing] Use torch.testing.assert_close ( #7324 )
2024-08-16 04:24:04 +00:00
Michael Goin
e165528778
[CI] Move quantization cpu offload tests out of fastcheck ( #7574 )
2024-08-15 21:16:20 -07:00
nunjunj
3b19e39dc5
Chat method for offline llm ( #5049 )
...
Co-authored-by: nunjunj <ray@g-3ff9f30f2ed650001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-1df6075697c3f0001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-c5a2c23abc49e0001.c.vllm-405802.internal>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-08-15 19:41:34 -07:00
youkaichao
4cd7d47fed
[ci/test] rearrange tests and make adag test soft fail ( #7572 )
2024-08-15 19:39:04 -07:00
Grant Pinkert
f878c8feb0
[Feature]: Add OpenAI server prompt_logprobs support #6508 ( #7453 )
2024-08-16 02:38:08 +00:00
shangmingc
b67ae00cdb
[Misc] Add quantization config support for speculative model. ( #7343 )
2024-08-15 19:34:28 -07:00
Michael Goin
9c8e2d1161
[Bugfix][Harmless] Fix float16 dtype for model_is_embedding ( #7566 )
2024-08-15 18:26:19 -07:00
Michael Goin
21313e09e3
[Bugfix] Fix default weight loading for scalars ( #7534 )
2024-08-15 13:10:22 -07:00
PHILO-HE
f4da5f7b6d
[Misc] Update dockerfile for CPU to cover protobuf installation ( #7182 )
2024-08-15 10:03:01 -07:00
omrishiv
9c1f78d5d6
[Bugfix] update neuron for version > 0.5.0 ( #7175 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-15 09:44:14 -07:00
Woosuk Kwon
fc93e56143
[Bugfix][TPU] Correct env variable for XLA cache path ( #7544 )
2024-08-15 00:02:29 -07:00
Kameshwara Pavan Kumar Mantha
22b39e11f2
llama_index serving integration documentation ( #6973 )
...
Co-authored-by: pavanmantha <pavan.mantha@thevaslabs.io >
2024-08-14 15:38:37 -07:00
Kyle Sayers
f55a9aea45
[Misc] Revert compressed-tensors code reuse ( #7521 )
2024-08-14 15:07:37 -07:00
Woosuk Kwon
951fdd66d3
[TPU] Set per-rank XLA cache ( #7533 )
2024-08-14 14:47:51 -07:00
William Lin
2ecf7b1757
[core] [3/N] multi-step args and sequence.py ( #7452 )
2024-08-14 12:32:45 -07:00
Cyrus Leung
3f674a49b5
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt ( #7126 )
2024-08-14 17:55:42 +00:00
Wallas Henrique
70b746efcf
[Misc] Deprecation Warning when setting --engine-use-ray ( #7424 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-08-14 09:44:27 -07:00
jack
67d115db08
[Bugfix][Frontend] Disable embedding API for chat models ( #7504 )
...
Co-authored-by: jack <jack@alex>
2024-08-14 09:15:19 -07:00
youkaichao
d3d9cb6e4b
[ci] fix model tests ( #7507 )
2024-08-14 01:01:43 -07:00
Chang Su
c134a46402
Fix empty output when temp is too low ( #2937 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-08-14 05:31:44 +00:00
youkaichao
199adbb7cf
[doc] update test script to include cudagraph ( #7501 )
2024-08-13 21:52:58 -07:00
Cyrus Leung
dd164d72f3
[Bugfix][Docs] Update list of mock imports ( #7493 )
2024-08-13 20:37:30 -07:00
youkaichao
ea49e6a3c8
[misc][ci] fix cpu test with plugins ( #7489 )
2024-08-13 19:27:46 -07:00
Jee Jee Li
97992802f3
[CI/Build]Reduce the time consumption for LoRA tests ( #7396 )
2024-08-13 17:27:29 -07:00
Woosuk Kwon
59edd0f134
[Bugfix][CI] Import ray under guard ( #7486 )
2024-08-13 17:12:58 -07:00
Woosuk Kwon
a08df8322e
[TPU] Support multi-host inference ( #7457 )
2024-08-13 16:31:20 -07:00
youkaichao
16422ea76f
[misc][plugin] add plugin system implementation ( #7426 )
2024-08-13 16:24:17 -07:00
Kyle Sayers
373538f973
[Misc] compressed-tensors code reuse ( #7277 )
2024-08-13 19:05:15 -04:00
youkaichao
33e5d7e6b6
[frontend] spawn engine process from api server process ( #7484 )
2024-08-13 15:40:17 -07:00
Simon Mo
c5c7768264
Announce NVIDIA Meetup ( #7483 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-08-13 14:28:36 -07:00
Dipika Sikka
b1e5afc3e7
[Misc] Update awq and awq_marlin to use vLLMParameters ( #7422 )
2024-08-13 17:08:20 -04:00
Dipika Sikka
d3bdfd3ab9
[Misc] Update Fused MoE weight loading ( #7334 )
2024-08-13 14:57:45 -04:00
Dipika Sikka
fb377d7e74
[Misc] Update gptq_marlin to use new vLLMParameters ( #7281 )
2024-08-13 14:30:11 -04:00
Dipika Sikka
181abbc27d
[Misc] Update LM Eval Tolerance ( #7473 )
2024-08-13 14:28:14 -04:00
Peter Salas
00c3d68e45
[Frontend][Core] Add plumbing to support audio language models ( #7446 )
2024-08-13 17:39:33 +00:00
Woosuk Kwon
e20233d361
Revert "[Doc] Update supported_hardware.rst ( #7276 )" ( #7467 )
2024-08-13 01:37:08 -07:00
Woosuk Kwon
d6e634f3d7
[TPU] Suppress import custom_ops warning ( #7458 )
2024-08-13 00:30:30 -07:00
youkaichao
4d2dc5072b
[hardware] unify usage of is_tpu to current_platform.is_tpu() ( #7102 )
2024-08-13 00:16:42 -07:00
Cyrus Leung
7025b11d94
[Bugfix] Fix weight loading for Chameleon when TP>1 ( #7410 )
2024-08-13 05:33:41 +00:00
Kevin H. Luu
5469146bcc
[ci] Remove fast check cancel workflow ( #7455 )
2024-08-12 21:19:51 -07:00
Andrew Wang
97a6be95ba
[Misc] improve logits processors logging message ( #7435 )
2024-08-13 02:29:34 +00:00
Cyrus Leung
9ba85bc152
[mypy] Misc. typing improvements ( #7417 )
2024-08-13 09:20:20 +08:00
Rui Qiao
198d6a2898
[Core] Shut down aDAG workers with clean async llm engine exit ( #7224 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-12 17:57:16 -07:00
Daniele
774cd1d3bf
[CI/Build] bump minimum cmake version ( #6999 )
2024-08-12 16:29:20 -07:00
sasha0552
91294d56e1
[Bugfix] Handle PackageNotFoundError when checking for xpu version ( #7398 )
2024-08-12 16:07:20 -07:00
jon-chuang
a046f86397
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel ( #7208 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-12 22:47:41 +00:00
Cyrus Leung
4ddc4743d7
[Core] Consolidate GB constant and enable float GB arguments ( #7416 )
2024-08-12 14:14:14 -07:00
Lucas Wilkinson
6aa33cb2dd
[Misc] Use scalar type to dispatch to different gptq_marlin kernels ( #7323 )
2024-08-12 14:40:13 -04:00
Kevin H. Luu
1137f343aa
[ci] Cancel fastcheck when PR is ready ( #7433 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-12 10:59:14 -07:00
Kevin H. Luu
9b3e2edd30
[ci] Cancel fastcheck run when PR is marked ready ( #7427 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-12 10:56:52 -07:00
Kevin H. Luu
65950e8f58
[ci] Entrypoints run upon changes in vllm/ ( #7423 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-12 10:18:03 -07:00
Woosuk Kwon
cfba4def5d
[Bugfix] Fix logit soft cap in flash-attn backend ( #7425 )
2024-08-12 09:58:28 -07:00
Daniele
d2bc4510a4
[CI/Build] bump Dockerfile.neuron image base, use public ECR ( #6832 )
2024-08-12 09:53:35 -07:00
Cyrus Leung
24154f8618
[Frontend] Disallow passing model as both argument and option ( #7347 )
2024-08-12 12:58:34 +00:00
Roger Wang
e6e42e4b17
[Core][VLM] Support image embeddings as input ( #6613 )
2024-08-12 16:16:06 +08:00
Lily Liu
ec2affa8ae
[Kernel] Flashinfer correctness fix for v0.1.3 ( #7319 )
2024-08-12 07:59:17 +00:00
Roger Wang
86ab567bae
[CI/Build] Minor refactoring for vLLM assets ( #7407 )
2024-08-12 02:41:52 +00:00
Simon Mo
f020a6297e
[Docs] Update readme ( #7316 )
2024-08-11 17:13:37 -07:00
youkaichao
6c8e595710
[misc] add commit id in collect env ( #7405 )
2024-08-11 15:40:48 -07:00
tomeras91
02b1988b9f
[Doc] building vLLM with VLLM_TARGET_DEVICE=empty ( #7403 )
2024-08-11 14:38:17 -07:00
tomeras91
386087970a
[CI/Build] build on empty device for better dev experience ( #4773 )
2024-08-11 13:09:44 -07:00
William Lin
c08e2b3086
[core] [2/N] refactor worker_base input preparation for multi-step ( #7387 )
2024-08-11 08:50:08 -07:00
Noam Gat
4fb7b52a2c
Updating LM Format Enforcer version to v0.10.6 ( #7189 )
2024-08-11 08:11:50 -04:00
Woosuk Kwon
90bab18f24
[TPU] Use mark_dynamic to reduce compilation time ( #7340 )
2024-08-10 18:12:22 -07:00
Isotr0py
4c5d8e8ea9
[Bugfix] Fix phi3v batch inference when images have different aspect ratio ( #7392 )
2024-08-10 16:19:33 +00:00
Cade Daniel
baa240252e
[Core] Fix edge case in chunked prefill + block manager v2 ( #7380 )
2024-08-09 23:48:49 +00:00
Antoni Baum
999ef0b917
[Misc] Add numpy implementation of compute_slot_mapping ( #7377 )
2024-08-09 22:52:29 +00:00
Dipika Sikka
5c6c54d67a
[Bugfix] Fix PerTensorScaleParameter weight loading for fused models ( #7376 )
2024-08-09 21:23:46 +00:00
Mahesh Keralapura
933790c209
[Core] Add span metrics for model_forward, scheduler and sampler time ( #7089 )
2024-08-09 13:55:13 -07:00
Roger Wang
70d268a399
[Bugfix] Fix ITL recording in serving benchmark ( #7372 )
2024-08-09 10:00:00 -07:00
Pooya Davoodi
249b88228d
[Frontend] Support embeddings in the run_batch API ( #7132 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-08-09 09:48:21 -07:00
Alexander Matveev
74af2bbd90
[Bugfix] Fix reinit procedure in ModelInputForGPUBuilder ( #7360 )
2024-08-09 16:35:49 +00:00
Alexander Matveev
fc7b8d1eef
[Performance] e2e overheads reduction: Small followup diff ( #7364 )
2024-08-09 15:49:36 +00:00
Isotr0py
67abdbb42f
[VLM][Doc] Add stop_token_ids to InternVL example ( #7354 )
2024-08-09 14:51:04 +00:00
Mor Zusman
07ab160741
[Model][Jamba] Mamba cache single buffer ( #6739 )
...
Co-authored-by: Mor Zusman <morz@ai21.com >
2024-08-09 10:07:06 -04:00
Nick Hill
b4e9528f95
[Core] Streamline stream termination in AsyncLLMEngine ( #7336 )
2024-08-09 07:06:36 +00:00
William Lin
57b7be0e1c
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace ( #6971 )
2024-08-09 05:42:45 +00:00
Travis Johnson
99b4cf5f23
[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary ( #7218 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-08-08 22:08:46 -07:00
Alexander Matveev
e02ac55617
[Performance] Optimize e2e overheads: Reduce python allocations ( #7162 )
2024-08-08 21:34:28 -07:00
Woosuk Kwon
73388c07a4
[TPU] Fix dockerfile.tpu ( #7331 )
2024-08-08 20:24:58 -07:00
Cyrus Leung
7eb4a51c5f
[Core] Support serving encoder/decoder models ( #7258 )
2024-08-09 10:39:41 +08:00
Siyuan Liu
0fa14907da
[TPU] Add Load-time W8A16 quantization for TPU Backend ( #7005 )
2024-08-08 18:35:49 -07:00
Simon Mo
5923532e15
Add Skywork AI as Sponsor ( #7314 )
2024-08-08 13:59:57 -07:00
Jee Jee Li
a049b107e2
[Misc] Temporarily resolve the error of BitAndBytes ( #7308 )
2024-08-08 13:42:58 -07:00
Isotr0py
8334c39f37
[Bugfix] Fix new Llama3.1 GGUF model loading ( #7269 )
2024-08-08 13:42:44 -07:00
Daniele
e904576743
[CI/Build] Dockerfile.cpu improvements ( #7298 )
2024-08-08 15:24:52 -04:00
Michael Goin
e14fb22e59
[Doc] Put collect_env issue output in a <detail> block ( #7310 )
2024-08-08 11:22:49 -07:00
Zach Zheng
782e53ab59
[Bugfix][fast] Fix the get_num_blocks_touched logic ( #6849 )
2024-08-08 10:43:30 -07:00
Joe Runde
21b9c49aa3
[Frontend] Kill the server on engine death ( #6594 )
...
Signed-off-by: Joe Runde <joe@joerun.de >
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-08-08 09:47:48 -07:00
Luka Govedič
5fb4a3f678
[Bugfix][Kernel] Increased atol to fix failing tests ( #7305 )
2024-08-08 12:16:13 -04:00
Jee Jee Li
757ac70a64
[Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 ( #7273 )
2024-08-08 14:02:41 +00:00
Murali Andoorveedu
6dffa4b0a6
[Bugfix] Fix LoRA with PP ( #7292 )
2024-08-08 00:02:27 -07:00
Cherilyn Buren
48abee9e54
[Frontend] remove max_num_batched_tokens limit for lora ( #7288 )
2024-08-08 06:17:29 +00:00
Rui Qiao
746709642c
[Misc] Fix typos in scheduler.py ( #7285 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-07 17:06:01 -07:00
Lily Liu
e53dfd3eaf
[Kernel] Fix Flashinfer Correctness ( #7284 )
2024-08-07 16:26:52 -07:00
Michael Goin
6d94420246
[Doc] Update supported_hardware.rst ( #7276 )
2024-08-07 14:21:50 -07:00
Nick Hill
fc1493a01e
[FrontEnd] Make merge_async_iterators is_cancelled arg optional ( #7282 )
2024-08-07 13:35:14 -07:00
Lucas Wilkinson
311f743831
[Bugfix] Fix gptq failure on T4s ( #7264 )
2024-08-07 20:05:37 +00:00
Kevin H. Luu
469b3bc538
[ci] Make building wheels per commit optional ( #7278 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-07 11:34:25 -07:00
Michael Goin
5223199e03
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization ( #7219 )
2024-08-07 11:23:12 -07:00
Maximilien de Bayser
fde47d3bc2
[BugFix] Fix frontend multiprocessing hang ( #7217 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-08-07 18:09:36 +00:00
Stas Bekman
0e12cd67a8
[Doc] add online speculative decoding example ( #7243 )
2024-08-07 09:58:02 -07:00
Ilya Lavrenov
80cbe10c59
[OpenVINO] migrate to latest dependencies versions ( #7251 )
2024-08-07 09:49:10 -07:00
Isotr0py
b764547616
[Bugfix] Fix input processor for InternVL2 model ( #7164 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-07 09:32:07 -07:00
Rafael Vasquez
ab0f5e2823
Fixes typo in function name ( #7275 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-08-07 09:29:27 -07:00
Robert Shaw
564985729a
[ BugFix ] Move zmq frontend to IPC instead of TCP ( #7222 )
2024-08-07 16:24:56 +00:00
Dipika Sikka
0f7052bc7e
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2 ( #5874 )
2024-08-07 09:17:58 -07:00
youkaichao
639159b2a6
[distributed][misc] add specialized method for cuda platform ( #7249 )
2024-08-07 08:54:52 -07:00
Cyrus Leung
66d617e343
[Frontend] Gracefully handle missing chat template and fix CI failure ( #7238 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-08-07 09:12:05 +00:00
Atilla Akkuş
7b261092de
[BUGFIX]: top_k is expected to be an integer. ( #7227 )
2024-08-07 00:32:16 -07:00
Roger Wang
2385c8f374
[Doc] Mock new dependencies for documentation ( #7245 )
2024-08-07 06:43:03 +00:00
Nick Hill
9a3f49ae07
[BugFix] Overhaul async request cancellation ( #7111 )
2024-08-07 13:21:41 +08:00
Michael Goin
f9a5600649
[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading ( #7225 )
2024-08-06 18:34:26 -07:00
afeldman-nm
fd95e026e0
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) ( #4942 )
...
Co-authored-by: Andrew Feldman <afeld2012@gmail.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-08-06 16:51:47 -04:00
xiaobochen123
660470e5a3
[Core] Optimize evictor-v2 performance ( #7193 )
2024-08-06 12:34:25 -07:00
Luka Govedič
8d59dbb000
[Kernel] Add per-tensor and per-token AZP epilogues ( #5941 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-08-06 18:17:08 +00:00
Lily Liu
5c60c8c423
[SpecDecode] [Minor] Fix spec decode sampler tests ( #7183 )
2024-08-06 10:40:32 -07:00
Katarzyna Papis
00afc78590
[Bugfix] add gguf dependency ( #7198 )
...
Co-authored-by: katarzyna.papis <kpapis@kpapis-u20.sclab.intel.com >
2024-08-06 10:08:35 -07:00
Robert Shaw
541c1852d3
[ BugFix ] Fix ZMQ when VLLM_PORT is set ( #7205 )
2024-08-06 09:26:26 -07:00
Dipika Sikka
a3bbbfa1d8
[BugFix] Fix DeepSeek remote code ( #7178 )
2024-08-06 08:16:53 -07:00
Cyrus Leung
1f26efbb3a
[Model] Support SigLIP encoder and alternative decoders for LLaVA models ( #7153 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-08-06 16:55:31 +08:00
Jee Jee Li
9118217f58
[LoRA] Relax LoRA condition ( #7146 )
2024-08-06 01:57:25 +00:00
Simon Mo
e3c664bfcb
[Build] Add initial conditional testing spec ( #6841 )
2024-08-05 17:39:22 -07:00
Isotr0py
360bd67cf0
[Core] Support loading GGUF model ( #5191 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-05 17:54:23 -06:00
Cody Yu
ef527be06c
[MISC] Use non-blocking transfer in prepare_input ( #7172 )
2024-08-05 23:41:27 +00:00
Jacob Schein
89b8db6bb2
[Bugfix] Specify device when loading LoRA and embedding tensors ( #7129 )
...
Co-authored-by: Jacob Schein <jacobschein@Jacobs-MacBook-Pro-2.local >
2024-08-05 16:35:47 -07:00
Thomas Parnell
789937af2e
[Doc] [SpecDecode] Update MLPSpeculator documentation ( #7100 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-08-05 23:29:43 +00:00
youkaichao
dfb1a15dcb
[ci][frontend] deduplicate tests ( #7101 )
2024-08-05 15:59:22 -07:00
Simon Mo
4db5176d97
bump version to v0.5.4 ( #7139 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.12, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.4.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.4.0) (push) Has been cancelled
2024-08-05 14:39:48 -07:00
Tyler Michael Smith
4cf1dc39be
[Bugfix][CI/Build] Fix CUTLASS FetchContent ( #7171 )
2024-08-05 14:22:57 -07:00
Tyler Michael Smith
6e4852ce28
[CI/Build] Suppress divide-by-zero and missing return statement warnings ( #7001 )
2024-08-05 16:00:01 -04:00
Tyler Michael Smith
8571ac4672
[Kernel] Update CUTLASS to 3.5.1 ( #7085 )
2024-08-05 15:13:43 -04:00
Rui Qiao
997cf78308
[Misc] Fix typo in GroupCoordinator.recv() ( #7167 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-05 11:10:16 -07:00
Aditya Paliwal
57f560aa23
[BugFix] Use args.trust_remote_code ( #7121 )
2024-08-05 09:26:14 -07:00
Nick Hill
003f8ee128
[BugFix] Use IP4 localhost form for zmq bind ( #7163 )
2024-08-05 08:41:03 -07:00
Bongwon Jang
e9630458c7
[SpecDecode] Support FlashInfer in DraftModelRunner ( #6926 )
2024-08-05 08:05:05 -07:00
Cade Daniel
82a1b1a82b
[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification ( #6963 )
2024-08-05 08:46:44 +00:00
Jungho Christopher Cho
c0d8f1636c
[Model] SiglipVisionModel ported from transformers ( #6942 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-08-05 06:22:12 +00:00
Cyrus Leung
cc08fc7225
[Frontend] Reapply "Factor out code for running uvicorn" ( #7095 )
2024-08-04 20:40:51 -07:00
Alphi
7b86e7c9cd
[Model] Add multi-image support for minicpmv ( #7122 )
...
Co-authored-by: hezhihui <hzh7269@modelbest.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-05 09:23:17 +08:00
Jee Jee Li
f80ab3521c
Clean up remaining Punica C information ( #7027 )
2024-08-04 15:37:08 -07:00
youkaichao
16a1cc9bb2
[misc][distributed] improve libcudart.so finding ( #7127 )
2024-08-04 11:31:51 -07:00
Thomas Parnell
b1c9aa3daa
[Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size to 1 when using MLPSpeculator ( #7105 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-08-04 07:13:18 -07:00
Jee Jee Li
179a6a36f2
[Model]Refactor MiniCPMV ( #7020 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-04 08:12:41 +00:00
youkaichao
83c644fe7e
[core][misc] simply output processing with shortcut code path ( #7117 )
2024-08-04 00:22:19 -07:00
youkaichao
9fadc7b7a0
[misc] add zmq in collect env ( #7119 )
2024-08-03 22:03:46 -07:00
Yihuan Bu
654bc5ca49
Support for guided decoding for offline LLM ( #6878 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-04 03:12:09 +00:00
Jeff Fialho
825b044863
[Frontend] Warn if user max_model_len is greater than derived max_model_len ( #7080 )
...
Signed-off-by: Jefferson Fialho <jfialho@ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-08-03 16:01:38 -07:00
youkaichao
44dcb52e39
[ci][test] finalize fork_new_process_for_each_test ( #7114 )
2024-08-03 10:44:53 -07:00
Kuntai Du
67d745cc68
[CI] Temporarily turn off H100 performance benchmark ( #7104 )
2024-08-02 23:52:44 -07:00
Jee Jee Li
99d7cabd7b
[LoRA] ReplicatedLinear support LoRA ( #7081 )
2024-08-02 22:40:19 -07:00
Zach Zheng
fb2c1c86c1
[Bugfix] Fix block table for seqs that have prefix cache hits ( #7018 )
2024-08-02 22:38:15 -07:00
Isotr0py
0c25435daa
[Model] Refactor and decouple weight loading logic for InternVL2 model ( #7067 )
2024-08-02 22:36:14 -07:00
youkaichao
a0d164567c
[ci][distributed] disable ray dag tests ( #7099 )
2024-08-02 22:32:04 -07:00
youkaichao
04e5583425
[ci][distributed] merge distributed test commands ( #7097 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-02 21:33:53 -07:00
Cyrus Leung
8c025fa703
[Frontend] Factor out chat message parsing ( #7055 )
2024-08-02 21:31:27 -07:00
youkaichao
69ea15e5cc
[ci][distributed] shorten wait time if server hangs ( #7098 )
2024-08-02 21:05:16 -07:00
Robert Shaw
ed812a73fa
[ Frontend ] Multiprocessing for OpenAI Server with zeromq ( #6883 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Joe Runde <joe@joerun.de >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-08-02 18:27:28 -07:00
youkaichao
708989341e
[misc] add a flag to enable compile ( #7092 )
2024-08-02 16:18:45 -07:00
Rui Qiao
22e718ff1a
[Misc] Revive to use loopback address for driver IP ( #7091 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-02 15:50:00 -07:00
Rui Qiao
05308891e2
[Core] Pipeline parallel with Ray ADAG ( #6837 )
...
Support pipeline-parallelism with Ray accelerated DAG.
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-02 13:55:40 -07:00
Lucas Wilkinson
a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType ( #6396 )
2024-08-02 13:51:58 -07:00
Michael Goin
b482b9a5b1
[CI/Build] Add support for Python 3.12 ( #7035 )
2024-08-02 13:51:22 -07:00
youkaichao
806949514a
[ci] set timeout for test_oot_registration.py ( #7082 )
2024-08-02 10:03:24 -07:00
Jie Fu (傅杰)
c16eaac500
[Hardware][Intel CPU] Update torch 2.4.0 for CPU backend ( #6931 )
2024-08-02 08:55:58 -07:00
Peng Guanwen
db35186391
[Core] Comment out unused code in sampler ( #7023 )
2024-08-02 00:58:26 -07:00
youkaichao
660dea1235
[cuda][misc] remove error_on_invalid_device_count_status ( #7069 )
2024-08-02 00:14:21 -07:00
Bongwon Jang
cf2a1a4d9d
Fix tracing.py ( #7065 )
2024-08-01 23:28:00 -07:00
youkaichao
252357793d
[ci][distributed] try to fix pp test ( #7054 )
2024-08-01 22:03:12 -07:00
Cyrus Leung
3bb4b1e4cd
[mypy] Speed up mypy checking ( #7056 )
2024-08-01 19:49:43 -07:00
Lily Liu
954f7305a1
[Kernel] Fix input for flashinfer prefill wrapper. ( #7008 )
2024-08-01 18:44:16 -07:00
Woosuk Kwon
6ce01f3066
[Performance] Optimize get_seqs ( #7051 )
2024-08-01 18:29:52 -07:00
Tyler Michael Smith
6a11fdfbb8
[CI/Build][Bugfix] Fix CUTLASS header-only line ( #7034 )
2024-08-01 13:51:15 -07:00
Woosuk Kwon
805a8a75f2
[Misc] Support attention logits soft-capping with flash-attn ( #7022 )
2024-08-01 13:14:37 -07:00
omkar kakarparthi
562e580abc
Update run-amd-test.sh ( #7044 )
2024-08-01 13:12:37 -07:00
Murali Andoorveedu
fc912e0886
[Models] Support Qwen model with PP ( #6974 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-08-01 12:40:43 -07:00
Michael Goin
f4fd390f5d
[Bugfix] Lower gemma's unloaded_params exception to warning ( #7002 )
2024-08-01 12:01:07 -07:00
Michael Goin
fb3db61688
[CI/Build] Remove sparseml requirement from testing ( #7037 )
2024-08-01 12:00:51 -07:00
Isotr0py
2dd34371a6
[Bugfix] Fix RMSNorm forward in InternViT attention qk_layernorm ( #6992 )
2024-08-01 12:00:28 -07:00
Sage Moore
7e0861bd0b
[CI/Build] Update PyTorch to 2.4.0 ( #6951 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-01 11:11:24 -07:00
Alexei-V-Ivanov-AMD
a72a424b3e
[Build/CI] Fixing Docker Hub quota issue. ( #7043 )
2024-08-01 11:07:37 -07:00
youkaichao
c8a7e93273
[core][scheduler] simplify and improve scheduler ( #6867 )
2024-07-31 23:51:09 -07:00
zifeitong
3c10591ef2
[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user ( #6954 )
2024-07-31 21:13:34 -07:00
Aurick Qiao
0437492ea9
PP comm optimization: replace send with partial send + allgather ( #6695 )
...
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com >
2024-07-31 20:15:42 -07:00
Travis Johnson
630dd9e0ae
[Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings ( #6758 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-31 19:49:11 -07:00
Woosuk Kwon
23993a7997
[Bugfix][TPU] Do not use torch.Generator for TPUs ( #6981 )
2024-07-31 18:50:28 -07:00
xuyi
1d2e7fb73f
[Model] Pipeline parallel support for Qwen2 ( #6924 )
2024-07-31 18:49:51 -07:00
Jee Jee Li
7ecee34321
[Kernel][RFC] Refactor the punica kernel based on Triton ( #5036 )
2024-07-31 17:12:24 -07:00
Simon Mo
7eb0cb4a14
Revert "[Frontend] Factor out code for running uvicorn" ( #7012 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-07-31 16:34:26 -07:00
Michael Goin
a0dce9383a
[Misc] Add compressed-tensors to optimized quant list ( #7006 )
2024-07-31 14:40:44 -07:00
Varun Sundar Rabindranath
35e9c12bfa
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) ( #6996 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-31 14:40:32 -07:00
Varun Sundar Rabindranath
93548eb37e
[Kernel] Enable FP8 Cutlass for Ada Lovelace ( #6950 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-31 14:40:22 -07:00
Michael Goin
460c1884e3
[Bugfix] Support cpu offloading with fp8 quantization ( #6960 )
2024-07-31 12:47:46 -07:00
Cody Yu
bd70013407
[MISC] Introduce pipeline parallelism partition strategies ( #6920 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-07-31 12:02:17 -07:00
Avshalom Manevich
2ee8d3ba55
[Model] use FusedMoE layer in Jamba ( #6935 )
2024-07-31 12:00:24 -07:00
Cyrus Leung
daed30c4a9
[Bugfix] Fix feature size calculation for LLaVA-NeXT ( #6982 )
2024-07-31 23:46:17 +08:00
Alphi
2f4e108f75
[Bugfix] Clean up MiniCPM-V ( #6939 )
...
Co-authored-by: hezhihui <hzh7269@modelbest.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-31 14:39:19 +00:00
HandH1998
6512937de1
Support W4A8 quantization for vllm ( #5218 )
2024-07-31 07:55:21 -06:00
Fei
c0644cf9ce
[Bugfix] fix logit processor excceed vocab size issue ( #6927 )
2024-07-31 16:16:01 +08:00
Woosuk Kwon
533d1932d2
[Bugfix][TPU] Set readonly=True for non-root devices ( #6980 )
2024-07-31 00:19:28 -07:00
Cyrus Leung
9f0e69b653
[CI/Build] Fix mypy errors ( #6968 )
2024-07-30 19:49:48 -07:00
Cyrus Leung
f230cc2ca6
[Bugfix] Fix broadcasting logic for multi_modal_kwargs ( #6836 )
2024-07-31 10:38:45 +08:00
Cyrus Leung
da1f7cc12a
[mypy] Enable following imports for some directories ( #6681 )
2024-07-31 10:38:03 +08:00
Cade Daniel
c32ab8be1a
[Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding ( #6964 )
2024-07-31 00:53:21 +00:00
Cade Daniel
fb4f530bf5
[CI] [nightly benchmark] Do not re-download sharegpt dataset if exists ( #6706 )
2024-07-30 16:28:49 -07:00
Cade Daniel
79319cedfa
[Nightly benchmarking suite] Remove pkill python from run benchmark suite ( #6965 )
2024-07-30 16:28:05 -07:00
Simon Mo
40c27a7cbb
[Build] Temporarily Disable Kernels and LoRA tests ( #6961 )
2024-07-30 14:59:48 -07:00
youkaichao
6ca8031e71
[core][misc] improve free_finished_seq_groups ( #6865 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-07-30 14:32:12 -07:00
Tyler Michael Smith
d7a299edaa
[Kernel] Remove scaled_fp8_quant kernel padding footgun ( #6842 )
2024-07-30 16:37:01 -04:00
Sanger Steel
052b6f8ca4
[Bugfix] Fix tensorizer memory profiling bug during testing ( #6881 )
2024-07-30 11:48:50 -07:00
Ilya Lavrenov
5895b24677
[OpenVINO] Updated OpenVINO requirements and build docs ( #6948 )
2024-07-30 11:33:01 -07:00
Tyler Michael Smith
cbbc904470
[Kernel] Squash a few more warnings ( #6914 )
2024-07-30 13:50:42 -04:00
Nick Hill
5cf9254a9c
[BugFix] Fix use of per-request seed with pipeline parallel ( #6698 )
2024-07-30 10:40:08 -07:00
fzyzcjy
f058403683
[Doc] Super tiny fix doc typo ( #6949 )
2024-07-30 09:14:03 -07:00
Roger Wang
c66c7f86ac
[Bugfix] Fix PaliGemma MMP ( #6930 )
2024-07-30 02:20:57 -07:00
Woosuk Kwon
6e063ea35b
[TPU] Fix greedy decoding ( #6933 )
2024-07-30 02:06:29 -07:00
Varun Sundar Rabindranath
af647fb8b3
[Kernel] Tuned int8 kernels for Ada Lovelace ( #6848 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-29 20:24:58 -06:00
Tyler Michael Smith
61a97c32f6
[Kernel] Fix marlin divide-by-zero warnings ( #6904 )
2024-07-30 01:26:07 +00:00
Kevin H. Luu
4fbf4aa128
[ci] GHA workflow to remove ready label upon "/notready" comment ( #6921 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-29 17:03:45 -07:00
Tyler Michael Smith
aae6d36f7e
[Kernel] Remove unused variables in awq/gemm_kernels.cu ( #6908 )
2024-07-29 18:01:17 -06:00
Nick Hill
9f69d8245a
[Frontend] New allowed_token_ids decoding request parameter ( #6753 )
2024-07-29 23:37:27 +00:00
Thomas Parnell
9a7e2d0534
[Bugfix] Allow vllm to still work if triton is not installed. ( #6786 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-29 14:51:27 -07:00
Earthwalker
7f8d612d24
[TPU] Support tensor parallelism in async llm engine ( #6891 )
2024-07-29 12:42:21 -07:00
Tyler Michael Smith
60d1c6e584
[Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel ( #6901 )
2024-07-29 09:59:02 -07:00
Peng Guanwen
db9e5708a9
[Core] Reduce unnecessary compute when logprobs=None ( #6532 )
2024-07-29 16:47:31 +00:00
Varun Sundar Rabindranath
766435e660
[Kernel] Tuned FP8 Kernels for Ada Lovelace ( #6677 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-29 09:42:35 -06:00
Isotr0py
7cbd9ec7a9
[Model] Initialize support for InternVL2 series models ( #6514 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-29 10:16:30 +00:00
Elsa Granger
3eeb148f46
[Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 ( #6871 )
2024-07-28 11:13:49 -04:00
Michael Goin
b1366a9534
Add Nemotron to PP_SUPPORTED_MODELS ( #6863 )
2024-07-27 15:05:17 -07:00
Alexander Matveev
75acdaa4b6
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel ( #6795 )
2024-07-27 17:52:33 -04:00
Woosuk Kwon
fad5576c58
[TPU] Reduce compilation time & Upgrade PyTorch XLA version ( #6856 )
2024-07-27 10:28:33 -07:00
Chenggang Wu
f954d0715c
[Docs] Add RunLLM chat widget ( #6857 )
2024-07-27 09:24:46 -07:00
Cyrus Leung
1ad86acf17
[Model] Initial support for BLIP-2 ( #5920 )
...
Co-authored-by: ywang96 <ywang@roblox.com >
2024-07-27 11:53:07 +00:00
Roger Wang
ecb33a28cb
[CI/Build][Doc] Update CI and Doc for VLM example changes ( #6860 )
2024-07-27 09:54:14 +00:00
Wang Ran (汪然)
a57d75821c
[bugfix] make args.stream work ( #6831 )
2024-07-27 09:07:02 +00:00
Roger Wang
925de97e05
[Bugfix] Fix VLM example typo ( #6859 )
2024-07-27 14:24:08 +08:00
Roger Wang
aa46953a20
[Misc][VLM][Doc] Consolidate offline examples for vision language models ( #6858 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-07-26 22:44:13 -07:00
Travis Johnson
593e79e733
[Bugfix] torch.set_num_threads() in multiproc_gpu_executor ( #6802 )
...
[Bugfix] Use torch.set_num_threads() to configure parallelism in multiproc_gpu_executor (#6802 )
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-26 22:15:20 -07:00
Harry Mellor
c53041ae3b
[Doc] Add missing mock import to docs conf.py ( #6834 )
2024-07-27 04:47:33 +00:00
Woosuk Kwon
52f07e3dec
[Hardware][TPU] Implement tensor parallelism with Ray ( #5871 )
2024-07-26 20:54:27 -07:00
Joe
14dbd5a767
[Model] H2O Danube3-4b ( #6451 )
2024-07-26 20:47:50 -07:00
tomeras91
ed94e4f427
[Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba ( #6784 )
2024-07-26 20:45:31 -07:00
omrishiv
3c3012398e
[Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron ( #6844 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-07-26 20:20:16 -07:00
Woosuk Kwon
ced36cd89b
[ROCm] Upgrade PyTorch nightly version ( #6845 )
2024-07-26 20:16:13 -07:00
Sanger Steel
969d032265
[Bugfix]: Fix Tensorizer test failures ( #6835 )
2024-07-26 20:02:25 -07:00
Lucas Wilkinson
55712941e5
[Bug Fix] Illegal memory access, FP8 Llama 3.1 405b ( #6852 )
2024-07-27 02:27:44 +00:00
Cyrus Leung
981b0d5673
[Frontend] Factor out code for running uvicorn ( #6828 )
2024-07-27 09:58:25 +08:00
Woosuk Kwon
d09b94ca58
[TPU] Support collective communications in XLA devices ( #6813 )
2024-07-27 01:45:57 +00:00
chenqianfzh
bb5494676f
enforce eager mode with bnb quantization temporarily ( #6846 )
2024-07-27 01:32:20 +00:00
Gurpreet Singh Dhami
b5f49ee55b
Update README.md ( #6847 )
2024-07-27 00:26:45 +00:00
Zhanghao Wu
150a1ffbfd
[Doc] Update SkyPilot doc for wrong indents and instructions for update service ( #4283 )
2024-07-26 14:39:10 -07:00
Michael Goin
281977bd6e
[Doc] Add Nemotron to supported model docs ( #6843 )
2024-07-26 17:32:44 -04:00
Li, Jiang
3bbb4936dc
[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation ( #6125 )
2024-07-26 13:50:10 -07:00
Woosuk Kwon
aa4867791e
[Misc][TPU] Support TPU in initialize_ray_cluster ( #6812 )
2024-07-26 19:39:49 +00:00
Woosuk Kwon
71734f1bf2
[Build/CI][ROCm] Minor simplification to Dockerfile.rocm ( #6811 )
2024-07-26 12:28:32 -07:00
Tyler Michael Smith
50704f52c4
[Bugfix][Kernel] Promote another index to int64_t ( #6838 )
2024-07-26 18:41:04 +00:00
Michael Goin
07278c37dd
[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) ( #6611 )
2024-07-26 14:33:42 -04:00
youkaichao
85ad7e2d01
[doc][debugging] add known issues for hangs ( #6816 )
2024-07-25 21:48:05 -07:00
Peng Guanwen
89a84b0bb7
[Core] Use array to speedup padding ( #6779 )
2024-07-25 21:31:31 -07:00
Anthony Platanios
084a01fd35
[Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor. ( #6770 )
2024-07-25 21:25:35 -07:00
QQSong
062a1d0fab
Fix ReplicatedLinear weight loading ( #6793 )
2024-07-25 19:24:58 -07:00
Kevin H. Luu
2eb9f4ff26
[ci] Mark tensorizer as soft fail and separate from grouped test ( #6810 )
...
[ci] Mark tensorizer test as soft fail and separate it from grouped test in fast check (#6810 )
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-25 18:08:33 -07:00
youkaichao
443c7cf4cf
[ci][distributed] fix flaky tests ( #6806 )
2024-07-25 17:44:09 -07:00
SangBin Cho
1adddb14bf
[Core] Fix ray forward_dag error mssg ( #6792 )
2024-07-25 16:53:25 -07:00
Woosuk Kwon
b7215de2c5
[Docs] Publish 5th meetup slides ( #6799 )
2024-07-25 16:47:55 -07:00
youkaichao
f3ff63c3f4
[doc][distributed] improve multinode serving doc ( #6804 )
2024-07-25 15:38:32 -07:00
Lucas Wilkinson
cd7edc4e87
[Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors ( #6798 )
2024-07-25 15:05:09 -07:00
Kuntai Du
6a1e25b151
[Doc] Add documentations for nightly benchmarks ( #6412 )
2024-07-25 11:57:16 -07:00
Tyler Michael Smith
95db75de64
[Bugfix] Add synchronize to prevent possible data race ( #6788 )
...
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-07-25 10:40:01 -07:00
Michael Goin
65b1f121c8
[Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints ( #6761 )
2024-07-25 09:46:15 -07:00
Robert Shaw
889da130e7
[ Misc ] fp8-marlin channelwise via compressed-tensors ( #6524 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-07-25 09:46:04 -07:00
Alphi
b75e314fff
[Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V ( #6787 )
...
Co-authored-by: hezhihui <hzh7269@modelbest.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-25 09:42:49 -07:00
Chang Su
316a41ac1d
[Bugfix] Fix encoding_format in examples/openai_embedding_client.py ( #6755 )
2024-07-24 22:48:07 -07:00
Alexander Matveev
0310029a2f
[Bugfix] Fix awq_marlin and gptq_marlin flags ( #6745 )
2024-07-24 22:34:11 -07:00
Cody Yu
309aaef825
[Bugfix] Fix decode tokens w. CUDA graph ( #6757 )
2024-07-24 22:33:56 -07:00
Alphi
9e169a4c61
[Model] Adding support for MiniCPM-V ( #4087 )
2024-07-24 20:59:30 -07:00
Evan Z. Liu
5689e256ba
[Frontend] Represent tokens with identifiable strings ( #6626 )
2024-07-25 09:51:00 +08:00
youkaichao
740374d456
[core][distributed] fix zmq hang ( #6759 )
2024-07-24 17:37:12 -07:00
Hongxia Yang
d88c458f44
[Doc][AMD][ROCm]Added tips to refer to mi300x tuning guide for mi300x users ( #6754 )
2024-07-24 14:32:57 -07:00
Michael Goin
421e218b37
[Bugfix] Bump transformers to 4.43.2 ( #6752 )
2024-07-24 13:22:16 -07:00
Antoni Baum
5448f67635
[Core] Tweaks to model runner/input builder developer APIs ( #6712 )
2024-07-24 12:17:12 -07:00
Antoni Baum
0e63494cf3
Add fp8 support to reshape_and_cache_flash ( #6667 )
2024-07-24 18:36:52 +00:00
Daniele
ee812580f7
[Frontend] split run_server into build_server and run_server ( #6740 )
2024-07-24 10:36:04 -07:00
Allen.Dou
40468b13fa
[Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. ( #6686 )
2024-07-24 08:58:42 -07:00
Nick Hill
2cf0df3381
[Bugfix] Fix speculative decode seeded test ( #6743 )
2024-07-24 08:58:31 -07:00
LF Marques
545146349c
Adding f-string to validation error which is missing ( #6748 )
2024-07-24 08:55:53 -07:00
liuyhwangyh
f4f8a9d892
[Bugfix]fix modelscope compatible issue ( #6730 )
2024-07-24 05:04:46 -07:00
Alexei-V-Ivanov-AMD
b570811706
[Build/CI] Update run-amd-test.sh. Enable Docker Hub login. ( #6711 )
2024-07-24 05:01:14 -07:00
Woosuk Kwon
ccc4a73257
[Docs][ROCm] Detailed instructions to build from source ( #6680 )
2024-07-24 01:07:23 -07:00
Roger Wang
0a740a11ba
[Bugfix] Fix token padding for chameleon ( #6724 )
2024-07-24 01:05:09 -07:00
Nick Hill
c882a7f5b3
[SpecDecoding] Update MLPSpeculator CI tests to use smaller model ( #6714 )
2024-07-24 07:34:22 +00:00
William Lin
5e8ca973eb
[Bugfix] fix flashinfer cudagraph capture for PP ( #6708 )
2024-07-24 01:49:44 +00:00
dongmao zhang
87525fab92
[bitsandbytes]: support read bnb pre-quantized model ( #5753 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-07-23 23:45:09 +00:00
Thomas Parnell
2f808e69ab
[Bugfix] StatLoggers: cache spec decode metrics when they get collected. ( #6645 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-23 23:05:05 +00:00
Michael Goin
01c16ede6b
[CI] Add smoke test for non-uniform AutoFP8 quantization ( #6702 )
2024-07-23 22:45:12 +00:00
youkaichao
72fc704803
[build] relax wheel size limit ( #6704 )
2024-07-23 14:03:49 -07:00
Roger Wang
1bedf210e3
Bump transformers version for Llama 3.1 hotfix and patch Chameleon ( #6690 )
2024-07-23 13:47:48 -07:00
Travis Johnson
507ef787d8
[Model] Pipeline Parallel Support for DeepSeek v2 ( #6519 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-23 12:22:09 -07:00
Yehoshua Cohen
58f53034ad
[Frontend] Add Usage data in each chunk for chat_serving. #6540 ( #6652 )
2024-07-23 11:41:55 -07:00
Michael Goin
0eb0757bef
[Misc] Add ignored layers for fp8 quantization ( #6657 )
2024-07-23 14:04:04 -04:00
Simon Mo
38c4b7e863
Bump version to 0.5.3.post1 ( #6696 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.3.1) (push) Has been cancelled
2024-07-23 10:08:59 -07:00
Woosuk Kwon
a112a84aad
[BugFix] Fix RoPE error in Llama 3.1 ( #6693 )
2024-07-23 09:46:05 -07:00
Woosuk Kwon
461089a21a
[Bugfix] Fix a log error in chunked prefill ( #6694 )
2024-07-23 09:27:58 -07:00
youkaichao
71950af726
[doc][distributed] fix doc argument order ( #6691 )
2024-07-23 08:55:33 -07:00
Woosuk Kwon
cb1362a889
[Docs] Announce llama3.1 support ( #6688 )
2024-07-23 08:18:15 -07:00
Simon Mo
bb2fc08072
Bump version to v0.5.3 ( #6674 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.3.1) (push) Has been cancelled
2024-07-23 00:00:08 -07:00
Simon Mo
3eda4ec780
support ignore patterns in model loader ( #6673 )
2024-07-22 23:59:42 -07:00
Roger Wang
22fa2e35cb
[VLM][Model] Support image input for Chameleon ( #6633 )
2024-07-22 23:50:48 -07:00
youkaichao
c5201240a4
[misc] only tqdm for first rank ( #6672 )
2024-07-22 21:57:27 -07:00
Cyrus Leung
97234be0ec
[Misc] Manage HTTP connections in one place ( #6600 )
2024-07-22 21:32:02 -07:00
youkaichao
c051bfe4eb
[doc][distributed] doc for setting up multi-node environment ( #6529 )
...
[doc][distributed] add more doc for setting up multi-node environment (#6529 )
2024-07-22 21:22:09 -07:00
Michael Goin
9e0b558a09
[Misc] Support FP8 kv cache scales from compressed-tensors ( #6528 )
2024-07-23 04:11:50 +00:00
zhaotyer
e519ae097a
add tqdm when loading checkpoint shards ( #6569 )
...
Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-07-22 20:48:01 -07:00
youkaichao
7c2749a4fd
[misc] add start loading models for users information ( #6670 )
2024-07-22 20:08:02 -07:00
Woosuk Kwon
729171ae58
[Misc] Enable chunked prefill by default for long context models ( #6666 )
2024-07-22 20:03:13 -07:00
Cheng Li
c5e8330997
[Bugfix] Fix null modules_to_not_convert in FBGEMM Fp8 quantization ( #6665 )
2024-07-22 19:25:05 -07:00
Cody Yu
e0c15758b8
[Core] Modulize prepare input and attention metadata builder ( #6596 )
2024-07-23 00:45:24 +00:00
Woosuk Kwon
bdf5fd1386
[Misc] Remove deprecation warning for beam search ( #6659 )
2024-07-23 00:21:58 +00:00
youkaichao
5a96ee52a3
[ci][build] add back vim in docker ( #6661 )
2024-07-22 16:26:29 -07:00
Jiaxin Shan
42c7f66a38
[Core] Support dynamically loading Lora adapter from HuggingFace ( #6234 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-07-22 15:42:40 -07:00
Kevin H. Luu
69d5ae38dc
[ci] Use different sccache bucket for CUDA 11.8 wheel build ( #6656 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-22 14:20:41 -07:00
Tyler Michael Smith
fea59c7712
[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels ( #6649 )
2024-07-22 14:08:30 -06:00
Cyrus Leung
739b61a348
[Frontend] Refactor prompt processing ( #4028 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-22 10:13:53 -07:00
Jae-Won Chung
89c1c6a196
[Bugfix] Fix vocab_size field access in llava_next.py ( #6624 )
2024-07-22 05:02:51 +00:00
Woosuk Kwon
42de2cefcb
[Misc] Add a wrapper for torch.inference_mode ( #6618 )
2024-07-21 18:43:11 -07:00
Roger Wang
c9eef37f32
[Model] Initial Support for Chameleon ( #5770 )
2024-07-21 17:37:51 -07:00
Alexander Matveev
396d92d5e0
[Kernel][Core] Add AWQ support to the Marlin kernel ( #6612 )
2024-07-21 19:41:42 -04:00
Isotr0py
25e778aa16
[Model] Refactor and decouple phi3v image embedding ( #6621 )
2024-07-21 16:07:58 -07:00
Woosuk Kwon
b6df37f943
[Misc] Remove abused noqa ( #6619 )
2024-07-21 23:47:04 +08:00
sroy745
14f91fe67c
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. ( #6485 )
2024-07-20 23:58:58 -07:00
Cyrus Leung
d7f4178dd9
[Frontend] Move chat utils ( #6602 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-21 08:38:17 +08:00
Robert Shaw
082ecd80d5
[ Bugfix ] Fix AutoFP8 fp8 marlin ( #6609 )
2024-07-20 17:25:56 -06:00
Michael Goin
f952bbc8ff
[Misc] Fix input_scale typing in w8a8_utils.py ( #6579 )
2024-07-20 23:11:13 +00:00
Robert Shaw
9364f74eee
[ Kernel ] Enable fp8-marlin for fbgemm-fp8 models ( #6606 )
2024-07-20 18:50:10 +00:00
Matt Wong
06d6c5fe9f
[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes ( #6543 )
2024-07-20 09:39:07 -07:00
Robert Shaw
683e3cb9c4
[ Misc ] fbgemm checkpoints ( #6559 )
2024-07-20 09:36:57 -07:00
Cyrus Leung
9042d68362
[Misc] Consolidate and optimize logic for building padded tensors ( #6541 )
2024-07-20 04:17:24 +00:00
Travis Johnson
3f8d42c81f
Pipeline Parallel: Guard for KeyErrors at request abort ( #6587 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-19 19:18:19 -07:00
Antoni Baum
7bd82002ae
[Core] Allow specifying custom Executor ( #6557 )
2024-07-20 01:25:06 +00:00
Varun Sundar Rabindranath
2e26564259
[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub ( #6593 )
...
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com >
2024-07-19 18:15:26 -07:00
youkaichao
e81522e879
[build] add ib in image for out-of-the-box infiniband support ( #6599 )
...
[build] add ib so that multi-node support with infiniband can be supported out-of-the-box (#6599 )
2024-07-19 17:16:57 -07:00
Murali Andoorveedu
45ceb85a0c
[Docs] Update PP docs ( #6598 )
2024-07-19 16:38:21 -07:00
Robert Shaw
4cc24f01b1
[ Kernel ] Enable Dynamic Per Token fp8 ( #6547 )
2024-07-19 23:08:15 +00:00
youkaichao
07eb6f19f3
[bugfix][distributed] fix multi-node bug for shared memory ( #6597 )
2024-07-19 15:34:34 -07:00
Thomas Parnell
f0bbfaf917
[Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection ( #6578 )
2024-07-19 14:01:03 -07:00
Simon Mo
30efe41532
[Docs] Update docs for wheel location ( #6580 )
2024-07-19 12:14:11 -07:00
Antoni Baum
9ed82e7074
[Misc] Small perf improvements ( #6520 )
2024-07-19 12:10:56 -07:00
Daniele
51f8aa90ad
[Bugfix][Frontend] remove duplicate init logger ( #6581 )
2024-07-19 10:16:27 -07:00
Thomas Parnell
a5314e8698
[Model] RowParallelLinear: pass bias to quant_method.apply ( #6327 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-19 07:15:22 -06:00
Woo-Yeon Lee
a921e86392
[BUGFIX] Raise an error for no draft token case when draft_tp>1 ( #6369 )
2024-07-19 06:01:09 -07:00
Cyrus Leung
6366efc67b
[Bugfix][Frontend] Fix missing /metrics endpoint ( #6463 )
2024-07-19 03:55:13 +00:00
Robert Shaw
dbe5588554
[ Misc ] non-uniform quantization via compressed-tensors for Llama ( #6515 )
2024-07-18 22:39:18 -04:00
Thomas Parnell
d4201e06d5
[Bugfix] Make spec. decode respect per-request seed. ( #6034 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-07-18 19:22:08 -07:00
Nick Hill
b5672a112c
[Core] Multiprocessing Pipeline Parallel support ( #6130 )
...
Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-18 19:15:52 -07:00
Simon Mo
c5df56f88b
Add support for a rope extension method ( #6553 )
2024-07-19 01:53:03 +00:00
Tyler Michael Smith
1689219ebf
[CI/Build] Build on Ubuntu 20.04 instead of 22.04 ( #6517 )
2024-07-18 17:29:25 -07:00
Tyler Michael Smith
4ffffccb7e
[Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm ( #6552 )
2024-07-18 23:52:22 +00:00
youkaichao
f53b8f0d05
[ci][test] add correctness test for cpu offloading ( #6549 )
2024-07-18 23:41:06 +00:00
Kevin H. Luu
2d4733ba2d
Fix PR comment bot ( #6554 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-18 14:48:29 -07:00
Michael Goin
15c6a079b1
[Model] Support Mistral-Nemo ( #6548 )
2024-07-18 20:31:50 +00:00
Kevin H. Luu
ecdb462c24
[ci] Reword Github bot comment ( #6534 )
2024-07-18 08:01:45 -07:00
Robert Shaw
58ca663224
[ Misc ] Improve Min Capability Checking in compressed-tensors ( #6522 )
2024-07-18 14:39:12 +00:00
Woosuk Kwon
4634c8728b
[TPU] Refactor TPU worker & model runner ( #6506 )
2024-07-18 01:34:16 -07:00
Noam Gat
c8a7d51c49
[Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash ( #6501 )
2024-07-18 07:47:13 +00:00
Nick Hill
e2fbaee725
[BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs ( #6227 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-18 15:13:30 +08:00
Cody Yu
8a74c68bd1
[Misc] Minor patch for draft model runner ( #6523 )
2024-07-18 06:06:21 +00:00
Rui Qiao
61e592747c
[Core] Introduce SPMD worker execution using Ray accelerated DAG ( #6032 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu >
2024-07-17 22:27:09 -07:00
Nick Hill
d25877dd9b
[BugFix] Avoid secondary error in ShmRingBuffer destructor ( #6530 )
2024-07-17 22:24:43 -07:00
youkaichao
1c27d25fb5
[core][model] yet another cpu offload implementation ( #6496 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-07-17 20:54:35 -07:00
Robert Shaw
18fecc3559
[ Kernel ] Fp8 Channelwise Weight Support ( #6487 )
2024-07-18 03:18:13 +00:00
Cody Yu
b5af8c223c
[Model] Pipeline parallel support for Mixtral ( #6516 )
2024-07-17 19:26:04 -07:00
Varun Sundar Rabindranath
b5241e41d9
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel ( #6511 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-18 01:38:35 +00:00
Alexander Matveev
e76466dde2
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step ( #6338 )
2024-07-17 14:30:28 -07:00
Antoni Baum
5f0b9933e6
[Bugfix] Fix Ray Metrics API usage ( #6354 )
2024-07-17 19:40:10 +00:00
milo157
a38524f338
[DOC] - Add docker image to Cerebrium Integration ( #6510 )
2024-07-17 10:22:53 -07:00
Cody Yu
2fa4623d9e
[Core] Refactor _prepare_model_input_tensors - take 2 ( #6164 )
2024-07-17 09:37:16 -07:00
Woosuk Kwon
a9a2e74d21
[Misc] Use torch.Tensor for type annotation ( #6505 )
2024-07-17 13:01:10 +00:00
Woosuk Kwon
e09ce759aa
[TPU] Remove multi-modal args in TPU backend ( #6504 )
2024-07-17 04:02:53 -07:00
Murali Andoorveedu
5fa6e9876e
[Bugfix] Fix for multinode crash on 4 PP ( #6495 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-17 08:25:10 +00:00
Cyrus Leung
5bf35a91e4
[Doc][CI/Build] Update docs and tests to use vllm serve ( #6431 )
2024-07-17 07:43:21 +00:00
shangmingc
a19e8d3726
[Misc][Speculative decoding] Typos and typing fixes ( #6467 )
...
Co-authored-by: caishangming.csm <caishangming.csm@alibaba-inc.com >
2024-07-17 07:17:07 +00:00
Hongxia Yang
10383887e0
[ROCm] Cleanup Dockerfile and remove outdated patch ( #6482 )
2024-07-16 22:47:02 -07:00
Wushi Dong
1d094fd7c0
[Distributed][PP] only create embedding & lm head when necessary ( #6455 )
...
original title: [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization
2024-07-16 19:20:26 -07:00
youkaichao
ce37be7ba0
[misc][distributed] add seed to dummy weights ( #6491 )
2024-07-16 19:16:34 -07:00
youkaichao
7f62077af5
[misc][distributed] improve tests ( #6488 )
2024-07-16 17:35:52 -07:00
youkaichao
09c2eb85dd
[ci][distributed] add pipeline parallel correctness test ( #6410 )
2024-07-16 15:44:22 -07:00
Michael Goin
978aed5300
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale ( #6081 )
2024-07-16 15:31:32 -07:00
Cody Yu
160e1d8c99
[Misc] Log spec decode metrics ( #6454 )
2024-07-16 20:37:10 +00:00
Jiaxin Shan
94162beb9f
[Doc] Fix the lora adapter path in server startup script ( #6230 )
2024-07-16 10:11:04 -07:00
Woosuk Kwon
c467dff24f
[Hardware][TPU] Support MoE with Pallas GMM kernel ( #6457 )
2024-07-16 09:56:28 -07:00
youkaichao
9f4ccec761
[doc][misc] remind to cancel debugging environment variables ( #6481 )
...
[doc][misc] remind users to cancel debugging environment variables after debugging (#6481 )
2024-07-16 09:45:30 -07:00
Cyrus Leung
38ef94888a
[CI/Build] Remove "boardwalk" image asset ( #6460 )
2024-07-16 08:59:36 -07:00
Peng Guanwen
2bb0489cb3
[Core] Use numpy to speed up padded token processing ( #6442 )
2024-07-16 08:13:25 -07:00
Thomas Parnell
7508a3dc34
[Misc] Fix typos in spec. decode metrics logging. ( #6470 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-16 13:55:15 +00:00
sasha0552
7a3d2a5b95
[Frontend] Support for chat completions input in the tokenize endpoint ( #5923 )
2024-07-16 20:18:09 +08:00
Cyrus Leung
d97011512e
[CI/Build] vLLM cache directory for images ( #6444 )
2024-07-15 23:12:25 -07:00
Woosuk Kwon
37d776606f
[Docs] Announce 5th meetup ( #6458 )
2024-07-15 21:04:58 -07:00
Joe
d92b3c5cde
[Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests ( #6419 )
2024-07-15 18:54:15 -07:00
Mor Zusman
9ad32dacd9
[BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug ( #6425 )
...
Co-authored-by: Mor Zusman <morz@ai21.com >
2024-07-16 01:32:55 +00:00
Kevin H. Luu
d6f3b3d5c4
Pin sphinx-argparse version ( #6453 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-16 01:26:11 +00:00
Woosuk Kwon
4552e37b55
[CI/Build][TPU] Add TPU CI test ( #6277 )
...
Co-authored-by: kevin <kevin@anyscale.com >
2024-07-15 14:31:16 -07:00
Woosuk Kwon
ec9933f4a5
[Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod ( #6289 )
2024-07-15 19:02:14 +00:00
Woosuk Kwon
3dee97b05f
[Docs] Add Google Cloud to sponsor list ( #6450 )
2024-07-15 11:58:10 -07:00
youkaichao
4cf256ae7f
[misc][distributed] fix pp missing layer condition ( #6446 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.3.1) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.3.1) (push) Has been cancelled
2024-07-15 10:32:35 -07:00
Simon Mo
64fdc08c72
bump version to v0.5.2 ( #6433 )
2024-07-15 17:27:40 +00:00
Thomas Parnell
4ef95b0f06
[Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF ( #6409 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-15 13:14:49 -04:00
Thomas Parnell
eaec4b9153
[Bugfix] Add custom Triton cache manager to resolve MoE MP issue ( #6140 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Chih-Chieh-Yang <chih.chieh.yang@ibm.com >
2024-07-15 10:12:47 -07:00
Pernekhan Utemuratov
a63a4c6341
[Misc] Use 0.0.9 version for flashinfer ( #6447 )
...
Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com >
2024-07-15 10:10:26 -07:00
Tyler Michael Smith
c8fd97f26d
[Kernel] Use CUTLASS kernels for the FP8 layers with Bias ( #6270 )
2024-07-15 13:05:52 -04:00
youkaichao
94b82e8c18
[doc][distributed] add suggestion for distributed inference ( #6418 )
2024-07-15 09:45:51 -07:00
Roger Wang
6ae1597ddf
[VLM] Minor space optimization for ClipVisionModel ( #6436 )
2024-07-15 17:29:51 +08:00
youkaichao
22e79ee8f3
[doc][misc] doc update ( #6439 )
2024-07-14 23:33:25 -07:00
Cyrus Leung
de19916314
[Bugfix] Convert image to RGB by default ( #6430 )
2024-07-15 05:39:15 +00:00
youkaichao
69672f116c
[core][distributed] simplify code to support pipeline parallel ( #6406 )
2024-07-14 21:20:51 -07:00
DefTruth
44874a0bf9
[Doc] add env docs for flashinfer backend ( #6437 )
2024-07-14 21:16:51 -07:00
zifeitong
b47008b4d2
[BugFix] BatchResponseData body should be optional ( #6345 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-15 04:06:09 +00:00
Simon Mo
9bfece89fd
Add FUNDING.yml ( #6435 )
2024-07-14 20:36:16 -07:00
Simon Mo
32c9d7f765
Report usage for beam search ( #6404 )
2024-07-14 19:37:35 -07:00
Fish
ccb20db8bd
[Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' ( #6428 )
2024-07-14 19:27:01 -07:00
Robert Shaw
a754dc2cb9
[CI/Build] Cross python wheel ( #6394 )
2024-07-14 18:54:46 -07:00
Robert Cohn
61e85dbad8
[Doc] xpu backend requires running setvars.sh ( #6393 )
2024-07-14 17:10:11 -07:00
Ethan Xu
dbfe254eda
[Feature] vLLM CLI ( #5090 )
...
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-07-14 15:36:43 -07:00
Robert Shaw
73030b7dae
[ Misc ] Enable Quantizing All Layers of DeekSeekv2 ( #6423 )
2024-07-14 21:38:42 +00:00
youkaichao
ccd3c04571
[ci][build] fix commit id ( #6420 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-07-14 22:16:21 +08:00
Tyler Michael Smith
9dad5cc859
[Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace ( #6384 )
2024-07-14 13:37:19 +00:00
Yuan Tang
6ef3bf912c
Remove unnecessary trailing period in spec_decode.rst ( #6405 )
2024-07-14 07:58:09 +00:00
Isotr0py
540c0368b1
[Model] Initialize Fuyu-8B support ( #3924 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-14 05:27:14 +00:00
Robert Shaw
fb6af8bc08
[ Misc ] Apply MoE Refactor to Deepseekv2 To Support Fp8 ( #6417 )
2024-07-13 20:03:58 -07:00
Woosuk Kwon
eeceadaecc
[Misc] Add deprecation warning for beam search ( #6402 )
2024-07-13 11:52:22 -07:00
Robert Shaw
babf52dade
[ Misc ] More Cleanup of Marlin ( #6359 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-07-13 10:21:37 +00:00
Noam Gat
9da4aad44b
Updating LM Format Enforcer version to v10.3 ( #6411 )
2024-07-13 10:09:12 +00:00
youkaichao
41708e5034
[ci] try to add multi-node tests ( #6280 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-12 21:51:48 -07:00
Woosuk Kwon
d80aef3776
[Docs] Clean up latest news ( #6401 )
2024-07-12 19:36:53 -07:00
Thomas Parnell
e1684a766a
[Bugfix] Fix hard-coded value of x in context_attention_fwd ( #6373 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-12 18:30:54 -07:00
Saliya Ekanayake
a27f87da34
[Doc] Fix Typo in Doc ( #6392 )
...
Co-authored-by: Saliya Ekanayake <esaliya@d-matrix.ai >
2024-07-13 00:48:23 +00:00
Kevin H. Luu
16ff6bd58c
[ci] Fix wording for GH bot ( #6398 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-12 16:34:37 -07:00
Woosuk Kwon
f8f9ff57ee
[Bugfix][TPU] Fix megacore setting for v5e-litepod ( #6397 )
2024-07-12 15:59:47 -07:00
Simon Mo
6bc9710f6e
Fix release pipeline's dir permission ( #6391 )
2024-07-12 15:52:43 -07:00
Michael Goin
111fc6e7ec
[Misc] Add generated git commit hash as vllm.__commit__ ( #6386 )
2024-07-12 22:52:15 +00:00
Cody Yu
75f64d8b94
[Bugfix] Fix illegal memory access in FP8 MoE kernel ( #6382 )
2024-07-12 21:33:33 +00:00
Simon Mo
21b2dcedab
Fix release pipeline's -e flag ( #6390 )
2024-07-12 14:08:04 -07:00
Simon Mo
07b35af86d
Fix interpolation in release pipeline ( #6389 )
2024-07-12 14:03:39 -07:00
Simon Mo
bb1a784b05
Fix release-pipeline.yaml ( #6388 )
2024-07-12 14:00:57 -07:00
Simon Mo
d719ba24c5
Build some nightly wheels by default ( #6380 )
2024-07-12 13:56:59 -07:00
Cody Yu
aa48e502fb
[MISC] Upgrade dependency to PyTorch 2.3.1 ( #5327 )
2024-07-12 12:04:26 -07:00
Kevin H. Luu
4dbebd03cc
[ci] Add GHA workflows to enable full CI run ( #6381 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-12 11:36:26 -07:00
Kevin H. Luu
b75bce1008
[ci] Add grouped tests & mark tests to run by default for fastcheck pipeline ( #6365 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-12 09:58:38 -07:00
Yihuan Bu
b039cbbce3
[Misc] add fixture to guided processor tests ( #6341 )
2024-07-12 09:55:39 -07:00
Alexei-V-Ivanov-AMD
f9d25c2519
[Build/CI] Checking/Waiting for the GPU's clean state ( #6379 )
2024-07-12 09:42:24 -07:00
Cyrus Leung
024ad87cdc
[Bugfix] Fix dtype mismatch in PaliGemma ( #6367 )
2024-07-12 08:22:18 -07:00
Robert Shaw
aea19f0989
[ Misc ] Support Models With Bias in compressed-tensors integration ( #6356 )
2024-07-12 11:11:29 -04:00
Roger Wang
f7160d946a
[Misc][Bugfix] Update transformers for tokenizer issue ( #6364 )
2024-07-12 08:40:07 +00:00
Robert Shaw
6047187cd8
[ Misc ] Remove separate bias add ( #6353 )
2024-07-12 05:06:09 +00:00
Hongxia Yang
b6c16cf8ff
[ROCm][AMD] unify CUDA_VISIBLE_DEVICES usage in cuda/rocm ( #6352 )
2024-07-11 21:30:46 -07:00
adityagoel14
d26a8b3f1f
[CI/Build] (2/2) Switching AMD CI to store images in Docker Hub ( #6350 )
2024-07-11 21:26:26 -07:00
Michael Goin
d59eb98489
[Model][Phi3-Small] Remove scipy from blocksparse_attention ( #6343 )
2024-07-12 10:47:17 +08:00
Helena Kloosterman
adf32e0a0f
[Bugfix] Fix usage stats logging exception warning with OpenVINO ( #6349 )
2024-07-12 10:47:00 +08:00
youkaichao
2b0fb53481
[distributed][misc] be consistent with pytorch for libcudart.so ( #6346 )
...
[distributed][misc] keep consistent with how pytorch finds libcudart.so (#6346 )
2024-07-11 19:35:17 -07:00
Lily Liu
d6ab528997
[Misc] Remove flashinfer warning, add flashinfer tests to CI ( #6351 )
2024-07-12 01:32:06 +00:00
Robert Shaw
7ed6a4f0e1
[ BugFix ] Prompt Logprobs Detokenization ( #6223 )
...
Co-authored-by: Zifei Tong <zifeitong@gmail.com >
2024-07-11 22:02:29 +00:00
Kuntai Du
a4feba929b
[CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy ( #5362 )
2024-07-11 13:28:38 -07:00
youkaichao
2d23b42d92
[doc] update pipeline parallel in readme ( #6347 )
2024-07-11 11:38:40 -07:00
xwjiang2010
1df43de9bb
[bug fix] Fix llava next feature size calculation. ( #6339 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
2024-07-11 17:21:10 +00:00
Simon Mo
52b7fcb35a
Benchmark: add H100 suite ( #6047 )
2024-07-11 09:17:07 -07:00
Robert Shaw
b675069d74
[ Misc ] Refactor Marlin Python Utilities ( #6082 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-07-11 15:40:11 +00:00
Mor Zusman
55f692b46e
[BugFix] get_and_reset only when scheduler outputs are not empty ( #6266 )
2024-07-11 07:40:20 -07:00
Thomas Parnell
8a1415cf77
[Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. ( #6326 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-11 07:05:59 -07:00
pushan
546b101fa0
[BugFix]: fix engine timeout due to request abort ( #6255 )
...
Signed-off-by: yatta zhang <ytzhang01@foxmail.com >
Signed-off-by: zhangyuntao.dev <zhangyuntao.dev@bytedance.com >
Co-authored-by: zhangyuntao.dev <zhangyuntao.dev@bytedance.com >
2024-07-11 06:46:31 -07:00
aniaan
3963a5335b
[Misc] refactor(config): clean up unused code ( #6320 )
2024-07-11 09:39:07 +00:00
Roger Wang
c4774eb841
[Bugfix] Fix snapshot download in serving benchmark ( #6318 )
2024-07-11 07:04:05 +00:00
Lim Xiang Yang
fc17110bbe
[BugFix]: set outlines pkg version ( #6262 )
2024-07-11 04:37:11 +00:00
Jie Fu (傅杰)
439c84581a
[Doc] Update description of vLLM support for CPUs ( #6003 )
2024-07-10 21:15:29 -07:00
daquexian
99ded1e1c4
[Doc] Remove comments incorrectly copied from another project ( #6286 )
2024-07-10 17:05:26 -07:00
Woosuk Kwon
997df46a32
[Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor ( #6313 )
2024-07-10 16:39:02 -07:00
sroy745
ae151d73be
[Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models ( #5765 )
2024-07-10 16:02:47 -07:00
sangjune.park
44cc76610d
[Bugfix] Fix OpenVINOExecutor abstractmethod error ( #6296 )
...
Signed-off-by: sangjune.park <sangjune.park@navercorp.com >
2024-07-10 10:03:32 -07:00
Benjamin Muskalla
b422d4961a
[CI/Build] Enable mypy typing for remaining folders ( #6268 )
2024-07-10 22:15:55 +08:00
Thomas Parnell
c38eba3046
[Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. ( #6303 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-10 09:04:07 -04:00
Woosuk Kwon
e72ae80b06
[Bugfix] Support 2D input shape in MoE layer ( #6287 )
2024-07-10 09:03:16 -04:00
Cyrus Leung
8a924d2248
[Doc] Guide for adding multi-modal plugins ( #6205 )
2024-07-10 14:55:34 +08:00
Woosuk Kwon
5ed3505d82
[Bugfix][TPU] Add prompt adapter methods to TPUExecutor ( #6279 )
2024-07-09 19:30:56 -07:00
youkaichao
da78caecfa
[core][distributed] zmq fallback for broadcasting large objects ( #6183 )
...
[core][distributed] add zmq fallback for broadcasting large objects (#6183 )
2024-07-09 18:49:11 -07:00
Abhinav Goyal
2416b26e11
[Speculative Decoding] Medusa Implementation with Top-1 proposer ( #4978 )
2024-07-09 18:34:02 -07:00
Baoyuan Qi
d3a245138a
[Bugfix]fix and needs_scalar_to_array logic check ( #6238 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-07-09 23:43:24 +00:00
Murali Andoorveedu
673dd4cae9
[Docs] Docs update for Pipeline Parallel ( #6222 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-07-09 16:24:58 -07:00
Swapnil Parekh
4d6ada947c
[CORE] Adding support for insertion of soft-tuned prompts ( #4645 )
...
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com >
Co-authored-by: Joe G <joseph.granados@h2o.ai >
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-07-09 13:26:36 -07:00
Kevin H. Luu
a0550cbc80
Add support for multi-node on CI ( #5955 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-09 12:56:56 -07:00
Woosuk Kwon
08c5bdecae
[Bugfix][TPU] Fix outlines installation in TPU Dockerfile ( #6256 )
2024-07-09 02:56:06 -07:00
Woosuk Kwon
5d5b4c5fe5
[Bugfix][TPU] Add missing None to model input ( #6245 )
2024-07-09 00:21:37 -07:00
youkaichao
70c232f85a
[core][distributed] fix ray worker rank assignment ( #6235 )
2024-07-08 21:31:44 -07:00
youkaichao
a3c9435d93
[hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability ( #6216 )
2024-07-08 20:02:15 -07:00
Simon Mo
4f0e0ea131
Add FlashInfer to default Dockerfile ( #6172 )
2024-07-08 13:38:03 -07:00
tomeras91
ddc369fba1
[Bugfix] Mamba cache Cuda Graph padding ( #6214 )
2024-07-08 11:25:51 -07:00
Eric
185ad31f37
[Bugfix] use diskcache in outlines _get_guide #5436 ( #6203 )
2024-07-08 11:23:24 -07:00
afeldman-nm
543aa48573
[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) ( #4888 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-07-08 17:12:15 +00:00
Avshalom Manevich
f7a8fa39d8
[Kernel] reloading fused_moe config on the last chunk ( #6210 )
2024-07-08 08:00:38 -07:00
Haichuan
717f4bcea0
Feature/add benchmark testing ( #5947 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-08 07:52:06 +00:00
kczimm
16620f439d
do not exclude object field in CompletionStreamResponse ( #6196 )
2024-07-08 10:32:57 +08:00
youkaichao
3b08fe2b13
[misc][frontend] log all available endpoints ( #6195 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-07-07 15:11:12 -07:00
Robert Shaw
abfe705a02
[ Misc ] Support Fp8 via llm-compressor ( #6110 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-07-07 20:42:11 +00:00
Haichuan
333306a252
add benchmark for fix length input and output ( #5857 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-07 07:42:13 +00:00
Roger Wang
6206dcb29e
[Model] Add PaliGemma ( #5189 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-07-07 09:25:50 +08:00
Cyrus Leung
9389380015
[Doc] Move guide for multimodal model and other improvements ( #6168 )
2024-07-06 17:18:59 +08:00
Roger Wang
175c43eca4
[Doc] Reorganize Supported Models by Type ( #6167 )
2024-07-06 05:59:36 +00:00
Simon Mo
bc96d5c330
Move release wheel env var to Dockerfile instead ( #6163 )
2024-07-05 17:19:53 -07:00
Simon Mo
f0250620dd
Fix release wheel build env var ( #6162 )
2024-07-05 16:24:31 -07:00
Simon Mo
2de490d60f
Update wheel builds to strip debug ( #6161 )
2024-07-05 14:51:25 -07:00
Simon Mo
79d406e918
[Docs] Fix readthedocs for tag build ( #6158 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.3.0) (push) Has been cancelled
2024-07-05 12:44:40 -07:00
Simon Mo
abad5746a7
bump version to v0.5.1 ( #6157 )
2024-07-05 12:04:51 -07:00
JGSweets
e58294ddf2
[Bugfix] Add verbose error if scipy is missing for blocksparse attention ( #5695 )
2024-07-05 10:41:01 -07:00
jvlunteren
f1e15da6fe
[Frontend] Continuous usage stats in OpenAI completion API ( #5742 )
2024-07-05 10:37:09 -07:00
Christian Rohmann
0097bb1829
[Bugfix] Use templated datasource in grafana.json to allow automatic imports ( #6136 )
...
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de >
2024-07-05 09:49:47 -07:00
Cyrus Leung
ea4b570483
[VLM] Cleanup validation and update docs ( #6149 )
2024-07-05 05:49:38 +00:00
Roger Wang
a41357e941
[VLM] Improve consistency between feature size calculation and dummy data for profiling ( #6146 )
2024-07-05 09:29:47 +08:00
Cyrus Leung
ae96ef8fbd
[VLM] Calculate maximum number of multi-modal tokens by model ( #6121 )
2024-07-04 16:37:23 -07:00
Lily Liu
69ec3ca14c
[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer ( #6051 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-07-04 16:35:51 -07:00
Yuan
81d7a50f24
[Hardware][Intel CPU] Adding intel openmp tunings in Docker file ( #6008 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2024-07-04 15:22:12 -07:00
youkaichao
27902d42be
[misc][doc] try to add warning for latest html ( #5979 )
2024-07-04 09:57:09 -07:00
Gregory Shtrasberg
56b325e977
[ROCm][AMD][Model]Adding alibi slopes support in ROCm triton flash attention and naive flash attention ( #6043 )
...
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
2024-07-03 22:19:38 -07:00
Cyrus Leung
3dd507083f
[CI/Build] Cleanup VLM tests ( #6107 )
2024-07-03 18:58:18 -07:00
Murali Andoorveedu
0ed646b7aa
[Distributed][Core] Support Py39 and Py38 for PP ( #6120 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-03 17:52:29 -07:00
Travis Johnson
1dab9bc8a9
[Bugfix] set OMP_NUM_THREADS to 1 by default for multiprocessing ( #6109 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-07-03 16:56:59 -07:00
youkaichao
3de6e6a30e
[core][distributed] support n layers % pp size != 0 ( #6115 )
2024-07-03 16:40:31 -07:00
youkaichao
966fe72141
[doc][misc] bump up py version in installation doc ( #6119 )
2024-07-03 15:52:04 -07:00
Robert Shaw
62963d129e
[ Misc ] Clean Up CompressedTensorsW8A8 ( #6113 )
2024-07-03 22:50:08 +00:00
xwjiang2010
d9e98f42e4
[vlm] Remove vision language config. ( #6089 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-03 22:14:16 +00:00
youkaichao
3c6325f0fc
[core][distributed] custom allreduce when pp size > 1 ( #6117 )
2024-07-03 14:41:32 -07:00
Michael Goin
47f0954af0
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin ( #5975 )
2024-07-03 17:38:00 +00:00
Roger Wang
7cd2ebb025
[Bugfix] Fix compute_logits in Jamba ( #6093 )
2024-07-03 00:32:35 -07:00
Roger Wang
f1c78138aa
[Doc] Fix Mock Import ( #6094 )
2024-07-03 00:13:56 -07:00
Roger Wang
3a86b54fb0
[VLM][Frontend] Proper Image Prompt Formatting from OpenAI API ( #6091 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-02 23:41:23 -07:00
youkaichao
f666207161
[misc][distributed] error on invalid state ( #6092 )
2024-07-02 23:37:29 -07:00
Nick Hill
d830656a97
[BugFix] Avoid unnecessary Ray import warnings ( #6079 )
2024-07-03 14:09:40 +08:00
SangBin Cho
d18bab3587
[CI] Fix base url doesn't strip "/" ( #6087 )
2024-07-02 21:31:25 -07:00
Cyrus Leung
9831aec49f
[Core] Dynamic image size support for VLMs ( #5276 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: ywang96 <ywang@roblox.com >
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-07-02 20:34:00 -07:00
youkaichao
482045ee77
[hardware][misc] introduce platform abstraction ( #6080 )
2024-07-02 20:12:22 -07:00
Mor Zusman
9d6a8daa87
[Model] Jamba support ( #4115 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: Erez Schwartz <erezs@ai21.com >
Co-authored-by: Mor Zusman <morz@ai21.com >
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com >
Co-authored-by: Tomer Asida <tomera@ai21.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-02 23:11:29 +00:00
Qubitium-ModelCloud
ee93f4f92a
[CORE] Quantized lm-head Framework ( #4442 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
Co-authored-by: ZX <zx@lbx.dev >
2024-07-02 22:25:17 +00:00
Robert Shaw
7c008c51a9
[ Misc ] Refactor MoE to isolate Fp8 From Mixtral ( #5970 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-07-02 21:54:35 +00:00
Robert Shaw
4d26d806e1
Update conftest.py ( #6076 )
2024-07-02 20:14:22 +00:00
Murali Andoorveedu
c5832d2ae9
[Core] Pipeline Parallel Support ( #4412 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-02 10:58:08 -07:00
Sirej Dua
15aba081f3
[Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) ( #6050 )
...
Co-authored-by: Sirej Dua <sirej.dua@databricks.com >
Co-authored-by: Sirej Dua <Sirej Dua>
2024-07-02 07:20:29 -07:00
Cyrus Leung
31354e563f
[Doc] Reinstate doc dependencies ( #6061 )
2024-07-02 10:53:16 +00:00
xwjiang2010
98d6682cd1
[VLM] Remove image_input_type from VLM config ( #5852 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-02 07:57:09 +00:00
danieljannai21
2c37540aa6
[Frontend] Add template related params to request ( #5709 )
2024-07-01 23:01:57 -07:00
Alexander Matveev
3476ed0809
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) ( #5602 )
2024-07-01 20:10:37 -07:00
Thomas Parnell
54600709b6
[Model] Changes to MLPSpeculator to support tie_weights and input_scale ( #5965 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Joshua Rosenkranz <jmrosenk@us.ibm.com >
2024-07-01 16:40:02 -07:00
James Whedbee
e373853e12
[Frontend] Relax api url assertion for openai benchmarking ( #6046 )
2024-07-01 23:39:10 +00:00
Nick Hill
c87ebc3ef9
[BugFix] Ensure worker model loop is always stopped at the right time ( #5987 )
2024-07-01 16:17:58 -07:00
Antoni Baum
c4059ea54f
[Bugfix] Add explicit end_forward calls to flashinfer ( #6044 )
2024-07-01 23:08:58 +00:00
Roger Wang
8e0817c262
[Bugfix][Doc] Fix Doc Formatting ( #6048 )
2024-07-01 15:09:11 -07:00
ning.zhang
83bdcb6ac3
add FAQ doc under 'serving' ( #5946 )
2024-07-01 14:11:36 -07:00
Avshalom Manevich
12a59959ed
[Bugfix] adding chunking mechanism to fused_moe to handle large inputs ( #6029 )
2024-07-01 21:08:29 +00:00
Antoni Baum
dec6fc6f3b
[Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool ( #6039 )
2024-07-01 20:12:40 +00:00
youkaichao
8893130b63
[doc][misc] further lower visibility of simple api server ( #6041 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-07-01 10:50:56 -07:00
zhyncs
bb60326836
[Misc] update benchmark backend for scalellm ( #6018 )
2024-07-01 10:20:33 -07:00
youkaichao
4050d646e5
[doc][misc] remove deprecated api server in doc ( #6037 )
2024-07-01 12:52:43 -04:00
Robert Shaw
d76084c12f
[ CI ] Re-enable Large Model LM Eval ( #6031 )
2024-07-01 12:40:45 -04:00
sroy745
80ca1e6a3a
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker ( #5348 )
2024-07-01 00:33:05 -07:00
youkaichao
614aa51203
[misc][cuda] use nvml to avoid accidentally cuda initialization ( #6007 )
2024-06-30 20:07:34 -07:00
Robert Shaw
af9ad46fca
[ Misc ] Refactor w8a8 to use process_weights_after_load (Simplify Weight Loading) ( #5940 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-30 23:06:27 +00:00
Dipika Sikka
7836fdcc11
[Misc] Fix get_min_capability ( #5971 )
2024-06-30 20:15:16 +00:00
Robert Shaw
deacb7ec44
[ CI ] Temporarily Disable Large LM-Eval Tests ( #6005 )
...
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic>
2024-06-30 11:56:56 -07:00
SangBin Cho
f5e73c9f1b
[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. ( #5909 )
...
Co-authored-by: sang <sangcho@anyscale.com >
2024-06-30 17:11:15 +00:00
llmpros
c6c240aa0a
[Frontend]: Support base64 embedding ( #5935 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-06-30 23:53:00 +08:00
youkaichao
2be6955a3f
[ci][distributed] fix device count call
...
[ci][distributed] fix some cuda init that makes it necessary to use spawn (#5991 )
2024-06-30 08:06:13 +00:00
Cyrus Leung
9d47f64eb6
[CI/Build] [3/3] Reorganize entrypoints tests ( #5966 )
2024-06-30 12:58:49 +08:00
Cyrus Leung
cff6a1fec1
[CI/Build] Reuse code for checking output consistency ( #5988 )
2024-06-30 11:44:25 +08:00
Roger Wang
bcc6a09b63
[CI/Build] Temporarily Remove Phi3-Vision from TP Test ( #5989 )
2024-06-30 09:18:31 +08:00
Matt Wong
9def10664e
[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests ( #5949 )
2024-06-29 12:47:58 -07:00
Robert Shaw
75aa1442db
[ CI/Build ] LM Eval Harness Based CI Testing ( #5838 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-29 13:04:30 -04:00
Cyrus Leung
99397da534
[CI/Build] Add TP test for vision models ( #5892 )
2024-06-29 15:45:54 +00:00
Robert Shaw
8dbfcd35bf
[ CI/Build ] Added E2E Test For Compressed Tensors ( #5839 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-29 21:12:58 +08:00
Cody Yu
f7dac83d95
[Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k ( #5939 )
2024-06-29 21:04:20 +08:00
Antoni Baum
7c01f70641
[Core] Optimize SequenceStatus.is_finished by switching to IntEnum ( #5974 )
2024-06-29 12:47:53 +00:00
Cyrus Leung
51e971d39e
[Bugfix] Support eos_token_id from config.json ( #5954 )
2024-06-29 11:19:02 +00:00
Roger Wang
329df38f1a
[Misc] Update Phi-3-Vision Example ( #5981 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-06-29 14:34:29 +08:00
Woosuk Kwon
580353da93
[Bugfix] Fix precisions in Gemma 1 ( #5913 )
2024-06-29 03:10:21 +00:00
Joe Runde
ba4994443a
[Kernel] Add punica dimensions for Granite 3b and 8b ( #5930 )
...
Signed-off-by: Joe Runde <joe@joerun.de >
2024-06-29 10:48:25 +08:00
William Lin
906a19cdb0
[Misc] Extend vLLM Metrics logging API ( #5925 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-06-29 10:36:06 +08:00
mcalman
c4bca740e8
[Bugfix] fix missing last itl in openai completions benchmark ( #5926 )
2024-06-29 10:34:42 +08:00
Woosuk Kwon
7f83f40dee
[Bugfix][TPU] Fix pad slot id ( #5977 )
2024-06-28 18:55:17 -07:00
Woosuk Kwon
54814fd85b
[Bugfix][TPU] Fix TPU sampler output ( #5978 )
2024-06-28 18:14:16 -07:00
Lily Liu
7041de4384
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode ( #4628 )
...
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com >, bong-furiosa <bongwon.jang@furiosa.ai >
2024-06-28 15:28:49 -07:00
Robert Shaw
6a62cb82cc
[Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError ( #5963 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-28 17:46:30 -04:00
Tyler Michael Smith
5d2a1a9cf0
Unmark more files as executable ( #5962 )
2024-06-28 17:34:56 -04:00
Michael Goin
4bf35ed9ae
[Bugfix] Only add Attention.kv_scale if kv cache quantization is enabled ( #5936 )
2024-06-28 21:12:40 +00:00
wangding zeng
be0b3af9e0
Support Deepseek-V2 ( #4650 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com >
2024-06-28 13:24:57 -07:00
Robert Shaw
2cd402e169
[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 ( #5921 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-28 18:43:49 +00:00
Robert Shaw
b185230744
[ Misc ] Remove fp8_shard_indexer from Col/Row Parallel Linear (Simplify Weight Loading) ( #5928 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-28 13:49:57 -04:00
Tyler Michael Smith
6a2d659d28
[Bugfix] Fix compute datatype for cutlass 3.x epilogues ( #5931 )
2024-06-28 17:10:34 +00:00
Cody Yu
b2c620230a
[Spec Decode] Introduce DraftModelRunner ( #5799 )
2024-06-28 09:17:51 -07:00
xwjiang2010
b90d8cd832
[Distributed] Make it clear that % should not be in tensor dict keys. ( #5927 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
2024-06-28 15:20:22 +00:00
Cyrus Leung
3b752a6555
[CI/Build] [2/3] Reorganize entrypoints tests ( #5904 )
2024-06-28 07:59:18 -07:00
Thomas Parnell
ec1ad0046c
[Bugfix] Better error message for MLPSpeculator when num_speculative_tokens is set too high ( #5894 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-06-28 07:42:17 -07:00
Ilya Lavrenov
57f09a419c
[Hardware][Intel] OpenVINO vLLM backend ( #5379 )
2024-06-28 13:50:16 +00:00
Tyler Michael Smith
5932634409
Unmark fused_moe config json file as executable ( #5960 )
2024-06-28 06:36:12 -07:00
Cyrus Leung
5cbe8d155c
[Core] Registry for processing model inputs ( #5214 )
...
Co-authored-by: ywang96 <ywang@roblox.com >
2024-06-28 12:09:56 +00:00
Isotr0py
0d0e3a42ac
[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner ( #5956 )
2024-06-28 12:03:41 +00:00
xwjiang2010
74d55c065b
[VLM][BugFix] Make sure that multi_modal_kwargs can broadcast properly with ring buffer. ( #5905 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-28 07:29:13 +00:00
Woosuk Kwon
f136da15e1
[Hardware][TPU] Optimize KV cache swapping ( #5878 )
2024-06-27 21:12:13 -07:00
Divakar Verma
c3dde367f1
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X ( #5932 )
2024-06-27 13:41:08 -07:00
youkaichao
64e8d2a783
[core][misc] remove logical block ( #5882 )
2024-06-27 13:34:55 -07:00
Woosuk Kwon
79c92c7c8a
[Model] Add Gemma 2 ( #5908 )
2024-06-27 13:33:56 -07:00
Roger Wang
736ed38849
[CI/Build] Fix Args for _get_logits_warper in Sampler Test ( #5922 )
2024-06-27 11:43:04 -07:00
Nick Hill
365791ff81
[BugFix] Fix min_tokens behaviour for multiple eos tokens ( #5849 )
2024-06-27 11:31:11 -07:00
Nick Hill
691e29ecf3
[BugFix] Fix MLPSpeculator handling of num_speculative_tokens ( #5876 )
2024-06-27 10:59:33 -07:00
youkaichao
3fd02bda51
[doc][misc] add note for Kubernetes users ( #5916 )
2024-06-27 10:07:07 -07:00
Cyrus Leung
98cf2ed678
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision ( #5896 )
2024-06-27 09:08:10 -07:00
Cyrus Leung
e9d32d077d
[CI/Build] [1/3] Reorganize entrypoints tests ( #5526 )
2024-06-27 12:43:17 +00:00
Roger Wang
2061f0b8a7
[Bugfix] Fix img_sizes Parsing in Phi3-Vision ( #5888 )
2024-06-27 08:29:24 +00:00
Cyrus Leung
96354d6a29
[Model] Add base class for LoRA-supported models ( #5018 )
2024-06-27 16:03:04 +08:00
xwjiang2010
d12af207d2
[VLM][Bugfix] Make sure that multi_modal_kwargs is broadcasted properly ( #5880 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
2024-06-27 15:15:24 +08:00
Cyrus Leung
6eabc6cb0e
[Doc] Add note about context length in Phi-3-Vision example ( #5887 )
2024-06-26 23:20:01 -07:00
Nick Hill
2110557dab
[BugFix] Fix cuda graph for MLPSpeculator ( #5875 )
...
Co-authored-by: Abhinav Goyal <abhinav.goyal@flipkart.com >
2024-06-27 04:12:10 +00:00
Roger Wang
b9e84259e9
[Misc] Add example for LLaVA-NeXT ( #5879 )
2024-06-26 17:57:16 -07:00
youkaichao
294104c3f9
[doc] update usage of env var to avoid conflict ( #5873 )
2024-06-26 17:57:12 -04:00
Chip Kerchner
38a1674abb
Support CPU inference with VSX PowerPC ISA ( #5652 )
2024-06-26 21:53:04 +00:00
Woosuk Kwon
f5c8628fdc
[Bugfix][TPU] Fix CPU cache allocation ( #5869 )
2024-06-26 13:42:40 -07:00
Woosuk Kwon
cbc53b6b8d
[Hardware][TPU] Support parallel sampling & Swapping ( #5855 )
2024-06-26 11:07:49 -07:00
sasha0552
c54269d967
[Frontend] Add tokenize/detokenize endpoints ( #5054 )
2024-06-26 16:54:22 +00:00
Luka Govedič
5bfd1bbc98
[Kernel] Adding bias epilogue support for cutlass_scaled_mm ( #5560 )
...
Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-06-26 15:16:00 +00:00
Cyrus Leung
6984c02a27
[CI/Build] Refactor image test assets ( #5821 )
2024-06-26 01:02:34 -07:00
Woosuk Kwon
3439c5a8e3
[Bugfix][TPU] Fix KV cache size calculation ( #5860 )
2024-06-26 00:58:23 -07:00
Woosuk Kwon
6806998bf9
[Bugfix] Fix embedding to support 2D inputs ( #5829 )
2024-06-26 00:15:22 -07:00
youkaichao
515080ad2f
[bugfix][distributed] fix shm broadcast when the queue size is full ( #5801 )
2024-06-25 21:56:02 -07:00
Roger Wang
3aa7b6cf66
[Misc][Doc] Add Example of using OpenAI Server with VLM ( #5832 )
2024-06-25 20:34:25 -07:00
Stephanie Wang
dda4811591
[Core] Refactor Worker and ModelRunner to consolidate control plane communication ( #5408 )
...
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu >
Signed-off-by: Stephanie <swang@anyscale.com >
Co-authored-by: Stephanie <swang@anyscale.com >
2024-06-25 20:30:03 -07:00
aws-patlange
82079729cc
[Bugfix] Fix assertion in NeuronExecutor ( #5841 )
2024-06-25 19:52:10 -07:00
Thomas Parnell
c2a8ac75e0
[CI/Build] Add E2E tests for MLPSpeculator ( #5791 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-06-26 00:04:08 +00:00
Woosuk Kwon
f178e56c68
[Hardware][TPU] Raise errors for unsupported sampling params ( #5850 )
2024-06-25 16:58:23 -07:00
Matt Wong
dd793d1de5
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes ( #5422 )
2024-06-25 15:56:15 -07:00
Woosuk Kwon
bc34937d68
[Hardware][TPU] Refactor TPU backend ( #5831 )
2024-06-25 15:25:52 -07:00
Dipika Sikka
dd248f7675
[Misc] Update w4a16 compressed-tensors support to include w8a16 ( #5794 )
2024-06-25 19:23:35 +00:00
Michael Goin
d9b34baedd
[CI/Build] Add unit testing for FlexibleArgumentParser ( #5798 )
2024-06-25 12:18:03 -07:00
youkaichao
c18ebfdd71
[doc][distributed] add both gloo and nccl tests ( #5834 )
2024-06-25 15:10:28 -04:00
Antoni Baum
67882dbb44
[Core] Add fault tolerance for RayTokenizerGroupPool ( #5748 )
2024-06-25 10:15:10 -07:00
Jie Fu (傅杰)
7b99314301
[Misc] Remove useless code in cpu_worker ( #5824 )
2024-06-25 09:41:36 -07:00
Woo-Yeon Lee
2ce5d6688b
[Speculative Decoding] Support draft model on different tensor-parallel size than target model ( #5414 )
2024-06-25 09:56:06 +00:00
Cyrus Leung
f23871e9ee
[Doc] Add notice about breaking changes to VLMs ( #5818 )
2024-06-25 01:25:03 -07:00
Kevin H. Luu
e9de9dd551
[ci] Remove aws template ( #5757 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-24 21:09:02 -07:00
Chang Su
ba991d5c84
[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args ( #5795 )
2024-06-24 17:01:19 -06:00
Michael Goin
1744cc99ba
[Doc] Add Phi-3-medium to list of supported models ( #5788 )
2024-06-24 10:48:55 -07:00
Michael Goin
e72dc6cb35
[Doc] Add "Suggest edit" button to doc pages ( #5789 )
2024-06-24 10:26:17 -07:00
youkaichao
c246212952
[doc][faq] add warning to download models for every nodes ( #5783 )
2024-06-24 15:37:42 +08:00
Isotr0py
edd5fe5fa2
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement ( #5772 )
2024-06-24 12:11:53 +08:00
Murali Andoorveedu
5d4d90536f
[Distributed] Add send and recv helpers ( #5719 )
2024-06-23 14:42:28 -07:00
Varun Sundar Rabindranath
6c916ac8a8
[BugFix] [Kernel] Add Cutlass2x fallback kernels ( #5744 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-06-23 21:07:11 +00:00
youkaichao
832ea88fcb
[core][distributed] improve shared memory broadcast ( #5754 )
2024-06-22 10:00:43 -07:00
Woosuk Kwon
8c00f9c15d
[Docs][TPU] Add installation tip for TPU ( #5761 )
2024-06-21 23:09:40 -07:00
Woosuk Kwon
0cbc1d2b4f
[Bugfix] Fix pin_lora error in TPU executor ( #5760 )
2024-06-21 22:25:14 -07:00
zifeitong
ff9ddbceee
[Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py ( #5756 )
2024-06-22 03:33:12 +00:00
Jie Fu (傅杰)
9c62db07ed
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs ( #5710 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-22 02:07:08 +00:00
Kunshang Ji
cf90ae0123
[CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline ( #5616 )
2024-06-21 17:09:34 -07:00
rohithkrn
f5dda63eb5
[LoRA] Add support for pinning lora adapters in the LRU cache ( #5603 )
2024-06-21 15:42:46 -07:00
youkaichao
7187507301
[ci][test] fix ca test in main ( #5746 )
2024-06-21 14:04:26 -07:00
zhyncs
f1e72cc19a
[BugFix] exclude version 1.15.0 for modelscope ( #5668 )
2024-06-21 13:15:48 -06:00
Michael Goin
5b15bde539
[Doc] Documentation on supported hardware for quantization methods ( #5745 )
2024-06-21 12:44:29 -04:00
Roger Wang
bd620b01fb
[Kernel][CPU] Add Quick gelu to CPU ( #5717 )
2024-06-21 06:39:40 +00:00
youkaichao
d9a252bc8e
[Core][Distributed] add shm broadcast ( #5399 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-06-21 05:12:35 +00:00
Jee Li
67005a07bc
[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora ( #5665 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-06-21 04:46:28 +00:00
Chang Su
c35e4a3dd7
[BugFix] Fix test_phi3v.py ( #5725 )
2024-06-21 04:45:34 +00:00
Jinzhen Lin
1f5674218f
[Kernel] Add punica dimension for Qwen2 LoRA ( #5441 )
2024-06-20 17:55:41 -07:00
Joshua Rosenkranz
b12518d3cf
[Model] MLPSpeculator speculative decoding support ( #4947 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com >
2024-06-20 20:23:12 -04:00
youkaichao
6c5b7af152
[distributed][misc] use fork by default for mp ( #5669 )
2024-06-20 17:06:34 -07:00
Michael Goin
8065a7e220
[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names ( #5718 )
2024-06-20 17:00:13 -06:00
Tyler Michael Smith
3f3b6b2150
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels ( #5715 )
2024-06-20 18:36:10 +00:00
Varun Sundar Rabindranath
a7dcc62086
[Kernel] Update Cutlass int8 kernel configs for SM80 ( #5275 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-06-20 13:33:21 +00:00
Roger Wang
ad137cd111
[Model] Port over CLIPVisionModel for VLMs ( #5591 )
2024-06-20 11:52:09 +00:00
Varun Sundar Rabindranath
111af1fa2c
[Kernel] Update Cutlass int8 kernel configs for SM90 ( #5514 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-06-20 06:37:08 +00:00
Roger Wang
1b2eaac316
[Bugfix][Doc] FIx Duplicate Explicit Target Name Errors ( #5703 )
2024-06-19 23:10:47 -07:00
Cyrus Leung
3730a1c832
[Misc] Improve conftest ( #5681 )
2024-06-19 19:09:21 -07:00
Kevin H. Luu
949e49a685
[ci] Limit num gpus if specified for A100 ( #5694 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-19 16:30:03 -07:00
Dipika Sikka
4a30d7e3cc
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes ( #5650 )
2024-06-19 18:06:44 -04:00
Rafael Vasquez
e83db9e7e3
[Doc] Update docker references ( #5614 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-06-19 15:01:45 -07:00
zifeitong
78687504f7
[Bugfix] AsyncLLMEngine hangs with asyncio.run ( #5654 )
2024-06-19 13:57:12 -07:00
youkaichao
d571ca0108
[ci][distributed] add tests for custom allreduce ( #5689 )
2024-06-19 20:16:04 +00:00
Michael Goin
afed90a034
[Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py ( #5688 )
2024-06-19 14:41:42 -04:00
Kevin H. Luu
3ee5c4bca5
[ci] Add A100 queue into AWS CI template ( #5648 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-19 08:42:13 -06:00
Cyrus Leung
e9c2732b97
[CI/Build] Add tqdm to dependencies ( #5680 )
2024-06-19 08:37:33 -06:00
DearPlanet
d8714530d1
[Misc]Add param max-model-len in benchmark_latency.py ( #5629 )
2024-06-19 18:19:08 +08:00
Isotr0py
7d46c8d378
[Bugfix] Fix sampling_params passed incorrectly in Phi3v example ( #5684 )
2024-06-19 17:58:32 +08:00
Michael Goin
da971ec7a5
[Model] Add FP8 kv cache for Qwen2 ( #5656 )
2024-06-19 09:38:26 +00:00
youkaichao
3eea74889f
[misc][distributed] use 127.0.0.1 for single-node ( #5619 )
2024-06-19 08:05:00 +00:00
Hongxia Yang
f758aed0e8
[Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices ( #5641 )
2024-06-18 23:21:29 -07:00
Thomas Parnell
e5150f2c28
[Bugfix] Added test for sampling repetition penalty bug. ( #5659 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-06-19 06:03:55 +00:00
Shukant Pal
59a1eb59c9
[Bugfix] Fix Phi-3 Long RoPE scaling implementation ( #5628 )
2024-06-19 01:46:38 +00:00
Tyler Michael Smith
6820724e51
[Bugfix] Fix w8a8 benchmarks for int8 case ( #5643 )
2024-06-19 00:33:25 +00:00
Tyler Michael Smith
b23ce92032
[Bugfix] Fix CUDA version check for mma warning suppression ( #5642 )
2024-06-18 23:48:49 +00:00
milo157
2bd231a7b7
[Doc] Added cerebrium as Integration option ( #5553 )
2024-06-18 15:56:59 -07:00
Thomas Parnell
8a173382c8
[Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties ( #5639 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-06-18 14:18:37 -07:00
sergey-tinkoff
07feecde1a
[Model] LoRA support added for command-r ( #5178 )
2024-06-18 11:01:21 -07:00
Kevin H. Luu
19091efc44
[ci] Setup Release pipeline and build release wheels with cache ( #5610 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-18 11:00:36 -07:00
Dipika Sikka
95db455e7f
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization ( #5542 )
2024-06-18 12:45:05 -04:00
Ronen Schaffer
7879f24dcc
[Misc] Add OpenTelemetry support ( #4687 )
...
This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.
I've also added a markdown with print-screens to guide users how to use this feature. You can find it here
2024-06-19 01:17:03 +09:00
Kevin H. Luu
13db4369d9
[ci] Deprecate original CI template ( #5624 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-18 14:26:20 +00:00
Roger Wang
4ad7b53e59
[CI/Build][Misc] Update Pytest Marker for VLMs ( #5623 )
2024-06-18 13:10:04 +00:00
Chang Su
f0cc0e68e3
[Misc] Remove import from transformers logging ( #5625 )
2024-06-18 12:12:19 +00:00
youkaichao
db5ec52ad7
[bugfix][distributed] improve p2p capability test ( #5612 )
...
[bugfix][distributed] do not error if two processes do not agree on p2p capability (#5612 )
2024-06-18 07:21:05 +00:00
Kuntai Du
114d7270ff
[CI] Avoid naming different metrics with the same name in performance benchmark ( #5615 )
2024-06-17 21:37:18 -07:00
Cyrus Leung
32c86e494a
[Misc] Fix typo ( #5618 )
2024-06-17 20:58:30 -07:00
youkaichao
8eadcf0b90
[misc][typo] fix typo ( #5620 )
2024-06-17 20:54:57 -07:00
Joe Runde
5002175e80
[Kernel] Add punica dimensions for Granite 13b ( #5559 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-06-18 03:54:11 +00:00
Isotr0py
daef218b55
[Model] Initialize Phi-3-vision support ( #4986 )
2024-06-17 19:34:33 -07:00
sroy745
fa9e385229
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier ( #5131 )
2024-06-17 21:29:09 -05:00
zifeitong
26e1188e51
[Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py ( #5606 )
2024-06-17 23:16:10 +00:00
Bruce Fontaine
a3e8a05d4c
[Bugfix] Fix KV head calculation for MPT models when using GQA ( #5142 )
2024-06-17 15:26:41 -07:00
youkaichao
e441bad674
[Optimization] use a pool to reuse LogicalTokenBlock.token_ids ( #5584 )
2024-06-17 22:08:05 +00:00
youkaichao
1b44aaf4e3
[bugfix][distributed] fix 16 gpus local rank arrangement ( #5604 )
2024-06-17 21:35:04 +00:00
Kuntai Du
9e4e6fe207
[CI] the readability of benchmarking and prepare for dashboard ( #5571 )
...
[CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (#5571 )
2024-06-17 11:41:08 -07:00
Jie Fu (傅杰)
ab66536dbf
[CI/BUILD] Support non-AVX512 vLLM building and testing ( #5574 )
2024-06-17 14:36:10 -04:00
Kunshang Ji
728c4c8a06
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend ( #3814 )
...
Co-authored-by: Jiang Li <jiang1.li@intel.com >
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com >
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com >
2024-06-17 11:01:25 -07:00
zhyncs
1f12122b17
[Misc] use AutoTokenizer for benchmark serving when vLLM not installed ( #5588 )
2024-06-17 09:40:35 -07:00
Dipika Sikka
890d8d960b
[Kernel] compressed-tensors marlin 24 support ( #5435 )
2024-06-17 12:32:48 -04:00
Charles Riggins
9e74d9d003
Correct alignment in the seq_len diagram. ( #5592 )
...
Co-authored-by: Liqian Chen <liqian.chen@deeplang.ai >
2024-06-17 12:05:33 -04:00
Amit Garg
9333fb8eb9
[Model] Rename Phi3 rope scaling type ( #5595 )
2024-06-17 12:04:14 -04:00
Cody Yu
e2b85cf86a
Fix w8a8 benchmark and add Llama-3-8B ( #5562 )
2024-06-17 06:48:06 +00:00
youkaichao
845a3f26f9
[Doc] add debugging tips for crash and multi-node debugging ( #5581 )
2024-06-17 10:08:01 +08:00
youkaichao
f07d513320
[build][misc] limit numpy version ( #5582 )
2024-06-16 16:07:01 -07:00
Michael Goin
4a6769053a
[CI][BugFix] Flip is_quant_method_supported condition ( #5577 )
2024-06-16 14:07:34 +00:00
Antoni Baum
f31c1f90e3
Add basic correctness 2 GPU tests to 4 GPU pipeline ( #5518 )
2024-06-16 07:48:02 +00:00
zifeitong
3ce2c050dd
[Fix] Correct OpenAI batch response format ( #5554 )
2024-06-15 16:57:54 -07:00
Nick Hill
1c0afa13c5
[BugFix] Don't start a Ray cluster when not using Ray ( #5570 )
2024-06-15 16:30:51 -07:00
Alexander Matveev
d919ecc771
add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 ( #5145 )
2024-06-15 13:38:16 -04:00
SangBin Cho
e691918e3b
[misc] Do not allow to use lora with chunked prefill. ( #5538 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-06-15 14:59:36 +00:00
Cyrus Leung
81fbb3655f
[CI/Build] Test both text and token IDs in batched OpenAI Completions API ( #5568 )
2024-06-15 07:29:42 -04:00
Cyrus Leung
0e9164b40a
[mypy] Enable type checking for test directory ( #5017 )
2024-06-15 04:45:31 +00:00
leiwen83
1b8a0d71cf
[Core][Bugfix]: fix prefix caching for blockv2 ( #5364 )
...
Signed-off-by: Lei Wen <wenlei03@qiyi.com >
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
2024-06-14 17:23:56 -07:00
Simon Mo
bd7efe95d0
Add ccache to amd ( #5555 )
2024-06-14 17:18:22 -07:00
youkaichao
f5bb85b435
[Core][Distributed] improve p2p cache generation ( #5528 )
2024-06-14 14:47:45 -07:00
Woosuk Kwon
28c145eb57
[Bugfix] Fix typo in Pallas backend ( #5558 )
2024-06-14 14:40:09 -07:00
Thomas Parnell
e2afb03c92
[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models ( #5460 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-06-14 20:28:11 +00:00
Sanger Steel
6e2527a7cb
[Doc] Update documentation on Tensorizer ( #5471 )
2024-06-14 11:27:57 -07:00
Simon Mo
cdab68dcdb
[Docs] Add ZhenFund as a Sponsor ( #5548 )
2024-06-14 11:17:21 -07:00
youkaichao
d1c3d7d139
[misc][distributed] fix benign error in is_in_the_same_node ( #5512 )
2024-06-14 10:59:28 -07:00
Cyrus Leung
77490c6f2f
[Core] Remove duplicate processing in async engine ( #5525 )
2024-06-14 10:04:42 -07:00
youkaichao
48f589e18b
[mis] fix flaky test of test_cuda_device_count_stateless ( #5546 )
2024-06-14 10:02:23 -07:00
Tyler Michael Smith
348616ac4b
[Kernel] Suppress mma.sp warning on CUDA 12.5 and later ( #5401 )
2024-06-14 10:02:00 -07:00
Robert Shaw
15985680e2
[ Misc ] Rs/compressed tensors cleanup ( #5432 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com >
2024-06-14 10:01:46 -07:00
Allen.Dou
d74674bbd9
[Misc] Fix arg names ( #5524 )
2024-06-14 09:47:44 -07:00
Tyler Michael Smith
703475f6c2
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue ( #5516 )
2024-06-14 09:30:15 -07:00
Cyrus Leung
d47af2bc02
[CI/Build] Disable LLaVA-NeXT CPU test ( #5529 )
2024-06-14 09:27:30 -07:00
Kuntai Du
319ad7f1d3
[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with perf-benchmarks label ( #5073 )
...
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-06-13 22:36:20 -07:00
Simon Mo
0f0d8bc065
bump version to v0.5.0.post1 ( #5522 )
2024-06-13 19:42:06 -07:00
Allen.Dou
55d6361b13
[Misc] Fix arg names in quantizer script ( #5507 )
2024-06-13 19:02:53 -07:00
Jie Fu (傅杰)
cd9c0d65d9
[Hardware][Intel] Support CPU inference with AVX2 ISA ( #5452 )
2024-06-13 17:22:24 -06:00
Antoni Baum
50eed24d25
Add cuda_device_count_stateless ( #5473 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.3.0) (push) Has been cancelled
2024-06-13 16:06:49 -07:00
Tyler Michael Smith
e38042d4af
[Kernel] Disable CUTLASS kernels for fp8 ( #5505 )
2024-06-13 13:38:05 -07:00
Tyler Michael Smith
33e3b37242
[CI/Build] Disable test_fp8.py ( #5508 )
2024-06-13 13:37:48 -07:00
youkaichao
1696efe6c9
[misc] fix format.sh ( #5511 )
2024-06-13 12:09:16 -07:00
Antoni Baum
6b0511a57b
Revert "[Core] Remove unnecessary copies in flash attn backend" ( #5478 )
2024-06-13 11:22:50 -07:00
Antoni Baum
a8fda4f661
Seperate dev requirements into lint and test ( #5474 )
2024-06-13 11:22:41 -07:00
Cody Yu
30299a41fa
[MISC] Remove FP8 warning ( #5472 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com >
2024-06-13 11:22:30 -07:00
Tyler Michael Smith
85657b5607
[Kernel] Factor out epilogues from cutlass kernels ( #5391 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: zifeitong <zifei.tong@parasail.io >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-06-13 11:22:19 -07:00
Cyrus Leung
0ce7b952f8
[Doc] Update LLaVA docs ( #5437 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-13 11:22:07 -07:00
Cyrus Leung
39873476f8
[CI/Build] Simplify OpenAI server setup in tests ( #5100 )
2024-06-13 11:21:53 -07:00
Cyrus Leung
03dccc886e
[Misc] Add vLLM version getter to utils ( #5098 )
2024-06-13 11:21:39 -07:00
Woosuk Kwon
a65634d3ae
[Docs] Add 4th meetup slides ( #5509 )
2024-06-13 10:18:26 -07:00
Li, Jiang
80aa7e91fc
[Hardware][Intel] Optimize CPU backend and add more performance tips ( #4971 )
...
Co-authored-by: Jianan Gu <jianan.gu@intel.com >
2024-06-13 09:33:14 -07:00
wenyujin333
bd43973522
[Kernel] Tune Qwen2MoE kernel configurations with tp2,4 ( #5497 )
...
Tune Qwen2-57B-A14B configs based on #4921
Throughput Performance
command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2
A100 GPU
benchmark no config w/ PR
tp=2 10.53 requests/s, 11058.17 tokens/s 12.47 requests/s, 13088.57 tokens/s
tp=4 17.77 requests/s, 18662.95 tokens/s 20.20 requests/s, 21212.32 tokens/s
2024-06-13 09:01:10 -07:00
Michael Goin
23ec72fa03
[CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations ( #5466 )
2024-06-13 15:18:08 +00:00
Dipika Sikka
c2637a613b
[Kernel] w4a16 support for compressed-tensors ( #5385 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-06-13 10:19:56 -04:00
Wang, Yi
88407532e7
[Bugfix]if the content is started with ":"(response of ping), client should i… ( #5303 )
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-12 20:16:41 -07:00
Kevin H. Luu
916d219d62
[ci] Use sccache to build images ( #5419 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-12 17:58:12 -07:00
youkaichao
ea3890a5f0
[Core][Distributed] code deduplication in tp&pp with coordinator( #5293 )
...
[Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293 )
2024-06-12 17:27:08 -07:00
Isotr0py
2135cacb45
[Bugfix] Fix wrong multi_modal_input format for CPU runner ( #5451 )
2024-06-12 16:20:18 -07:00
Michael Goin
7d19de2e9c
[Frontend] Add "input speed" to tqdm postfix alongside output speed ( #5425 )
2024-06-12 18:42:12 -04:00
Michael Goin
94a07bbdd8
[Bugfix] Fix typo in scheduler.py (requeset -> request) ( #5470 )
2024-06-12 21:59:44 +00:00
Cyrus Leung
b8d4dfff9c
[Doc] Update debug docs ( #5438 )
2024-06-12 14:49:31 -07:00
youkaichao
622d45128c
[misc] add hint for AttributeError ( #5462 )
2024-06-12 21:46:35 +00:00
Travis Johnson
51602eefd3
[Frontend] [Core] Support for sharded tensorized models ( #4990 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Sanger Steel <sangersteel@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-12 14:13:52 -07:00
Arthur Kim
5cc50a531f
[Bugfix] TYPE_CHECKING for MultiModalData ( #5444 )
2024-06-12 14:08:52 -07:00
Cody Yu
5985e3427d
[Kernel] Vectorized FP8 quantize kernel ( #5396 )
...
Inspired by #5146 , this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).
In details, we applied 3 optimizations:
- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.
2024-06-12 14:07:26 -07:00
Kevin H. Luu
8b82a89997
[ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests ( #5464 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-12 14:00:18 -07:00
Li, Jiang
c3c2903e72
[Bugfix] Add device assertion to TorchSDPA ( #5402 )
2024-06-12 12:58:53 -07:00
Woosuk Kwon
1a8bfd92d5
[Hardware] Initial TPU integration ( #5292 )
2024-06-12 11:53:03 -07:00
SangBin Cho
847cdcca1c
[CI] Upgrade codespell version. ( #5381 )
2024-06-12 10:06:14 -07:00
Simon Mo
e3c12bf6d2
Revert "[CI/Build] Add is_quant_method_supported to control quantization test configurations" ( #5463 )
2024-06-12 10:03:24 -07:00
Michael Goin
3dd6853bc8
[CI/Build] Add is_quant_method_supported to control quantization test configurations ( #5253 )
2024-06-12 09:58:02 -07:00
youkaichao
8f89d72090
[Doc] add common case for long waiting time ( #5430 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.3.0) (push) Has been cancelled
2024-06-11 11:12:13 -07:00
Nick Hill
99dac099ab
[Core][Doc] Default to multiprocessing for single-node distributed case ( #5230 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-06-11 11:10:41 -07:00
youkaichao
c4bd03c7c5
[Core][Distributed] add same-node detection ( #5369 )
2024-06-11 10:53:59 -07:00
sasha0552
dcbf4286af
[Frontend] Customizable RoPE theta ( #5197 )
2024-06-11 10:42:26 -07:00
Ali Panahi
00e6a2dc53
[Bugfix] fix lora_dtype value type in arg_utils.py ( #5398 )
2024-06-11 10:40:23 -07:00
Junichi Sato
2e02311a1b
[Bugfix] Fix MultiprocessingGPUExecutor.check_health when world_size == 1 ( #5254 )
2024-06-11 10:38:07 -07:00
Cade Daniel
89ec06c33b
[Docs] [Spec decode] Fix docs error in code example ( #5427 )
2024-06-11 10:31:56 -07:00
Kuntai Du
9fde251bf0
[Doc] Add an automatic prefix caching section in vllm documentation ( #5324 )
...
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-06-11 10:24:59 -07:00
Cade Daniel
4c2ffb28ff
[Speculative decoding] Initial spec decode docs ( #5400 )
2024-06-11 10:15:40 -07:00
SangBin Cho
246598a6b1
[CI] docfix ( #5410 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: ywang96 <ywang@roblox.com >
2024-06-11 01:28:50 -07:00
Woosuk Kwon
8bab4959be
[Misc] Remove VLLM_BUILD_WITH_NEURON env variable ( #5389 )
2024-06-11 00:37:56 -07:00
Roger Wang
3c4cebf751
[Doc][Typo] Fixing Missing Comma ( #5403 )
2024-06-11 00:20:28 -07:00
youkaichao
d8f31f2f8b
[Doc] add debugging tips ( #5409 )
2024-06-10 23:21:43 -07:00
Cyrus Leung
640052b069
[Bugfix][Frontend] Cleanup "fix chat logprobs" ( #5026 )
2024-06-10 22:36:46 -07:00
maor-ps
351d5e7b82
[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs ( #5312 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-06-11 10:30:31 +08:00
Nick Hill
a008629807
[Misc] Various simplifications and typing fixes ( #5368 )
2024-06-11 10:29:02 +08:00
Kevin H. Luu
76477a93b7
[ci] Fix Buildkite agent path ( #5392 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-10 18:58:07 -07:00
Michael Goin
77c87beb06
[Doc] Add documentation for FP8 W8A8 ( #5388 )
2024-06-10 18:55:12 -06:00
Simon Mo
114332b88e
Bump version to v0.5.0 ( #5384 )
2024-06-10 15:56:06 -07:00
Woosuk Kwon
cb77ad836f
[Docs] Alphabetically sort sponsors ( #5386 )
2024-06-10 15:17:19 -05:00
Roger Wang
856c990041
[Docs] Add Docs on Limitations of VLM Support ( #5383 )
2024-06-10 09:53:50 -07:00
Kevin H. Luu
c5602f0baa
[ci] Mount buildkite agent on Docker container to upload benchmark results ( #5330 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-10 09:22:34 -07:00
Kevin H. Luu
f7f9c5f97b
[ci] Use small_cpu_queue for doc build ( #5331 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-10 09:21:11 -07:00
Cyrus Leung
2c0d933594
[Bugfix] Fix LLaVA-NeXT ( #5380 )
2024-06-10 15:38:47 +00:00
Itay Etelis
774d1035e4
[Feature][Frontend]: Continued stream_options implementation also in CompletionRequest ( #5319 )
2024-06-10 14:22:09 +00:00
Cyrus Leung
6b29d6fe70
[Model] Initial support for LLaVA-NeXT ( #4199 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-10 12:47:15 +00:00
Cyrus Leung
0bfa1c4f13
[Misc] Improve error message when LoRA parsing fails ( #5194 )
2024-06-10 19:38:49 +08:00
youkaichao
c81da5f56d
[misc][typo] fix typo ( #5372 )
2024-06-10 09:51:02 +00:00
Roger Wang
68bc81703e
[Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server ( #5374 )
2024-06-10 09:13:39 +00:00
Dipika Sikka
5884c2b454
[Misc] Update to comply with the new compressed-tensors config ( #5350 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-06-10 03:49:46 +00:00
Bla_ckB
45f92c00cf
[Bugfix] Fix KeyError: 1 When Using LoRA adapters ( #5164 )
2024-06-09 16:23:14 -07:00
bnellnm
5467ac3196
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops ( #5047 )
2024-06-09 16:23:30 -04:00
youkaichao
5d7e3d0176
[mis][ci/test] fix flaky test in test_sharded_state_loader.py ( #5361 )
...
[mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py (#5361 )
2024-06-09 03:50:14 +00:00
youkaichao
0373e1837e
[Core][CUDA Graph] add output buffer for cudagraph ( #5074 )
...
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (#5074 )
2024-06-08 19:14:43 -07:00
Michael Goin
c09dade2a2
[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale ( #5353 )
2024-06-08 13:54:05 -04:00
youkaichao
8ea5e44a43
[CI/Test] improve robustness of test (vllm_runner) ( #5357 )
...
[CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357 )
2024-06-08 08:59:20 +00:00
youkaichao
9fb900f90c
[CI/Test] improve robustness of test (hf_runner) ( #5347 )
...
[CI/Test] improve robustness of test by replacing del with context manager (hf_runner) (#5347 )
2024-06-07 22:31:32 -07:00
Hongxia Yang
c96fc06747
[ROCm][AMD] Use pytorch sdpa math backend to do naive attention ( #4965 )
2024-06-07 19:13:12 -07:00
Benjamin Kitor
b3376e5c76
[Misc] Add args for selecting distributed executor to benchmarks ( #5335 )
2024-06-08 09:20:16 +08:00
Cheng Li
e69ded7d1c
[Bug Fix] Fix the support check for FP8 CUTLASS ( #5352 )
...
Bug description:
With torch 2.4.0.dev20240603+cu121,
cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112)
This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183 .
2024-06-08 00:42:05 +00:00
Calvinn Ng
767c727a81
fix DbrxFusedNormAttention missing cache_config ( #5340 )
...
Co-authored-by: team <calvinn.ng@ahrefs.com >
2024-06-07 14:10:21 -07:00
Jie Fu (傅杰)
6840a71610
[Misc] Remove unused cuda_utils.h in CPU backend ( #5345 )
2024-06-07 14:09:13 -07:00
Roger Wang
7a9cb294ae
[Frontend] Add OpenAI Vision API Support ( #5237 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-06-07 11:23:32 -07:00
Dipika Sikka
ca3ea51bde
[Kernel] Dynamic Per-Token Activation Quantization ( #5037 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-06-07 09:36:26 -07:00
limingshu
dc49fb892c
Addition of lacked ignored_seq_groups in _schedule_chunked_prefill ( #5296 )
2024-06-07 13:35:42 +00:00
Antoni Baum
18a277b52d
Remove Ray health check ( #4693 )
2024-06-07 10:01:56 +00:00
Tyler Michael Smith
8d75fe48ca
[Kernel] Switch fp8 layers to use the CUTLASS kernels ( #5183 )
...
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8
see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
2024-06-07 08:42:35 +00:00
youkaichao
388596c914
[Misc][Utils] allow get_open_port to be called for multiple times ( #5333 )
2024-06-06 22:15:11 -07:00
Itay Etelis
baa15a9ec3
[Feature][Frontend]: Add support for stream_options in ChatCompletionRequest ( #5135 )
2024-06-07 03:29:24 +00:00
Jie Fu (傅杰)
15063741e3
[Misc] Missing error message for custom ops import ( #5282 )
2024-06-06 20:17:21 -07:00
Antoni Baum
ccdc490dda
[Core] Change LoRA embedding sharding to support loading methods ( #5038 )
2024-06-06 19:07:57 -07:00
Antoni Baum
a31cab7556
[Core] Avoid copying prompt/output tokens if no penalties are used ( #5289 )
2024-06-06 18:12:00 -07:00
Matthew Goldey
828da0d44e
[Frontend] enable passing multiple LoRA adapters at once to generate() ( #5300 )
2024-06-06 15:48:13 -05:00
Philipp Moritz
abe855d637
[Kernel] Retune Mixtral 8x22b configs for FP8 on H100 ( #5294 )
2024-06-06 09:29:29 -07:00
liuyhwangyh
4efff036f0
Bugfix: fix broken of download models from modelscope ( #5233 )
...
Co-authored-by: mulin.lyh <mulin.lyh@taobao.com >
2024-06-06 09:28:10 -07:00
Cyrus Leung
89c920785f
[CI/Build] Update vision tests ( #5307 )
2024-06-06 05:17:18 -05:00
Breno Faria
7b0a0dfb22
[Frontend][Core] Update Outlines Integration from FSM to Guide ( #4109 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Breno Faria <breno.faria@intrafind.com >
2024-06-05 16:49:12 -07:00
Simon Mo
3a6ae1d33c
[CI] Disable flash_attn backend for spec decode ( #5286 )
2024-06-05 15:49:27 -07:00
Simon Mo
8f1729b829
[Docs] Add Ray Summit CFP ( #5295 )
2024-06-05 15:25:18 -07:00
Woosuk Kwon
6a7c7711a2
[Misc] Skip for logits_scale == 1.0 ( #5291 )
2024-06-05 15:19:02 -07:00
Alex Wu
0f83ddd4d7
[Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. ( #5290 )
2024-06-05 15:18:12 -07:00
Michael Goin
065aff6c16
[Bugfix] Make EngineArgs use named arguments for config construction ( #5285 )
2024-06-05 15:16:56 -07:00
Nick Hill
3d33e372a1
[BugFix] Fix log message about default max model length ( #5284 )
2024-06-05 14:53:16 -07:00
Nick Hill
faf71bcd4b
[Speculative Decoding] Add ProposerWorkerBase abstract class ( #5252 )
2024-06-05 14:53:05 -07:00
Simon Mo
f270a39537
[Docs] Add Sequoia as sponsors ( #5287 )
2024-06-05 18:02:56 +00:00
Philipp Moritz
51a08e7d8f
[Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 ( #5238 )
2024-06-05 10:59:14 -07:00
DriverSong
eb8fcd2666
[BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM ( #5207 )
...
Co-authored-by: qiujiawei9 <qiujiawei9@jd.com >
2024-06-05 10:59:02 -07:00
Cody Yu
5563a4dea8
[Model] Correct Mixtral FP8 checkpoint loading ( #5231 )
2024-06-05 10:58:50 -07:00
Tyler Michael Smith
ccd4f129e8
[Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size ( #5157 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-06-05 10:44:15 -07:00
Tyler Michael Smith
02cc3b51a7
[misc] benchmark_serving.py -- add ITL results and tweak TPOT results ( #5263 )
2024-06-05 10:17:51 -07:00
Simon Mo
d5b1eb081e
[CI] Add nightly benchmarks ( #5260 )
2024-06-05 09:42:08 -07:00
tomeras91
f0a500545f
[Frontend] OpenAI API server: Add add_special_tokens to ChatCompletionRequest (default False) ( #5278 )
2024-06-05 09:32:58 -07:00
Woosuk Kwon
c65146e75e
[Misc] Fix docstring of get_attn_backend ( #5271 )
2024-06-05 09:18:59 -07:00
Woosuk Kwon
41ca62cf03
[Misc] Add CustomOp interface for device portability ( #5255 )
2024-06-05 09:18:19 -07:00
zifeitong
974fc9b845
[Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True ( #5226 )
2024-06-04 19:37:28 -07:00
youkaichao
fee4dcc33a
[Misc] update collect env ( #5261 )
2024-06-04 17:29:09 -05:00
Michael Goin
650a4cc55e
[Misc] Add transformers version to collect_env.py ( #5259 )
2024-06-04 12:52:28 -07:00
Simon Mo
9ca62d8668
[CI] mark AMD test as softfail to prevent blockage ( #5256 )
2024-06-04 11:34:53 -07:00
Li, Jiang
45c35f0d58
[CI/Build] Reducing CPU CI execution time ( #5241 )
2024-06-04 10:26:40 -07:00
Cyrus Leung
9ba093b4f4
[CI/Build] Simplify model loading for HfRunner ( #5251 )
2024-06-04 10:09:19 -07:00
Woosuk Kwon
27208be66e
[Kernel] Add back batch size 1536 and 3072 to MoE tuning ( #5242 )
2024-06-04 09:58:47 -07:00
Jie Fu (傅杰)
87d5abef75
[Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU backend ( #5249 )
2024-06-04 09:57:51 -07:00
Cyrus Leung
ec784b2526
[CI/Build] Add inputs tests ( #5215 )
2024-06-03 21:01:46 -07:00
zifeitong
a58f24e590
[Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor ( #5229 )
2024-06-03 20:55:50 -07:00
afeldman-nm
f42a006b15
[Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend ( #5210 )
2024-06-03 20:32:57 -07:00
Woosuk Kwon
3a434b07ed
[Kernel] Enhance MoE benchmarking & tuning script ( #4921 )
2024-06-03 20:06:59 -07:00
Zhuohan Li
bd0e7802e0
[Bugfix] Add warmup for prefix caching example ( #5235 )
2024-06-03 19:36:41 -07:00
Toshiki Kataoka
06b2550cbb
[Bugfix] Support prompt_logprobs==0 ( #5217 )
2024-06-03 17:59:30 -07:00
Breno Faria
f775a07e30
[FRONTEND] OpenAI tools support named functions ( #5032 )
2024-06-03 18:25:29 -05:00
Kevin H. Luu
4f0d17c05c
New CI template on AWS stack ( #5110 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-03 16:16:43 -07:00
Kaiyang Chen
10c38e3e46
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 ( #3834 )
2024-06-03 13:37:11 -07:00
Yuan
cafb8e06c5
[CI/BUILD] enable intel queue for longer CPU tests ( #4113 )
2024-06-03 10:39:50 -07:00
Tyler Michael Smith
cbb2f59cc8
[Kernel] Pass a device pointer into the quantize kernel for the scales ( #5159 )
2024-06-03 09:52:30 -07:00
Antoni Baum
0ab278ca31
[Core] Remove unnecessary copies in flash attn backend ( #5138 )
2024-06-03 09:39:31 -07:00
Cyrus Leung
7a64d24aad
[Core] Support image processor ( #4197 )
2024-06-02 22:56:41 -07:00
Cyrus Leung
dfbe60dc62
[Misc] Simplify code and fix type annotations in conftest.py ( #5118 )
2024-06-02 16:05:50 -07:00
Divakar Verma
a66cf40b20
[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer ( #4927 )
...
This PR enables the fused topk_softmax kernel used in moe layer for HIP
2024-06-02 14:13:26 -07:00
Avinash Raj
f790ad3c50
[Frontend][OpenAI] Support for returning max_model_len on /v1/models response ( #4643 )
2024-06-02 08:06:13 +00:00
Simon Mo
ed59a7ed23
Update test_ignore_eos ( #4898 )
2024-06-02 02:21:53 +00:00
Robert Shaw
044793d8df
[BugFix] Prevent LLM.encode for non-generation Models ( #5184 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-06-01 23:35:41 +00:00
Daniil Arapov
c2d6d2f960
[Bugfix]: Fix issues related to prefix caching example ( #5177 ) ( #5180 )
2024-06-01 15:53:52 -07:00
Zhuohan Li
8279078e21
[Bugfix] Remove deprecated @abstractproperty ( #5174 )
2024-06-01 22:40:25 +00:00
chenqianfzh
b9c0605a8e
[Feature][Kernel] Support bitsandbytes quantization and QLoRA ( #4776 )
2024-06-01 14:51:10 -06:00
Nadav Shmayovits
37464a0f74
[Bugfix] Fix call to init_logger in openai server ( #4765 )
2024-06-01 17:18:50 +00:00
Ye Cao
c354072828
[Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py ( #5151 )
...
Signed-off-by: Ye Cao <caoye.cao@alibaba-inc.com >
2024-06-01 17:11:22 +00:00
Varun Sundar Rabindranath
f081c3ce4b
[Kernel] Update Cutlass fp8 configs ( #5144 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-06-01 08:46:07 +00:00
Tyler Michael Smith
260d119e86
[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU ( #5137 )
2024-06-01 06:45:32 +00:00
Daniele
a360ff80bb
[CI/Build] CMakeLists: build all extensions' cmake targets at the same time ( #5034 )
2024-05-31 22:06:45 -06:00
Tyler Michael Smith
1197e02141
[Build] Guard against older CUDA versions when building CUTLASS 3.x kernels ( #5168 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.3.0) (push) Has been cancelled
2024-05-31 17:21:38 -07:00
Nick Hill
657579113f
[Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support ( #5171 )
2024-05-31 17:20:19 -07:00
Cody Yu
e9899fb7a4
[Model] Enable FP8 QKV in MoE and refine kernel tuning script ( #5039 )
2024-05-31 14:29:19 -07:00
functionxu123
a377f0bd5e
[Misc]: optimize eager mode host time ( #4196 )
...
Co-authored-by: xuhao <xuhao@cambricon.com >
2024-05-31 13:14:50 +08:00
Simon Mo
e9d3aa04f6
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" ( #5149 )
2024-05-30 22:00:26 -07:00
SnowDist
a22dea54d3
[Model] Support MAP-NEO model ( #5081 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-05-30 19:24:41 -07:00
simon-mo
533c217792
Fix cutlass sm_90a vesrion in CMakeList
2024-05-31 02:13:01 +00:00
Alexander Matveev
6d21fa1cad
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) ( #5136 )
2024-05-30 21:02:11 -05:00
Robert Shaw
b35be5403f
[Bugfix] Avoid Warnings in SparseML Activation Quantization ( #5120 )
2024-05-30 17:04:37 -07:00
Simon Mo
45a1a69b98
[Build] Disable sm_90a in cu11 ( #5141 )
2024-05-30 14:37:16 -07:00
Simon Mo
87a658c812
Bump version to v0.4.3 ( #5046 )
2024-05-30 11:13:46 -07:00
Chansung Park
429d89720e
add doc about serving option on dstack ( #3074 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-05-30 10:11:07 -07:00
Cyrus Leung
a9bcc7afb2
[Doc] Use intersphinx and update entrypoints docs ( #5125 )
2024-05-30 09:59:23 -07:00
Hyunsung Lee
d79d9eaaff
[Misc] remove duplicate definition of seq_lens_tensor in model_runner.py ( #5129 )
2024-05-30 06:56:19 -07:00
youkaichao
f758505c73
[CI/Build] increase wheel size limit to 200 MB ( #5130 )
2024-05-30 06:29:48 -07:00
Robert Shaw
d910816c73
[Bugfix] Automatically Detect SparseML models ( #5119 )
2024-05-30 12:58:37 +00:00
Breno Faria
87d41c849d
[BUGFIX] [FRONTEND] Correct chat logprobs ( #5029 )
...
Co-authored-by: Breno Faria <breno.faria@intrafind.com >
2024-05-30 02:52:14 -07:00
omkar kakarparthi
e07aff9e52
[CI/Build] Docker cleanup functionality for amd servers ( #5112 )
...
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com >
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com >
Co-authored-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
Co-authored-by: omkarkakarparthi <okakarpa>
2024-05-30 03:27:39 +00:00
Alexander Matveev
5bf185a1c4
[Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter ( #5108 )
2024-05-30 00:30:18 +00:00
youkaichao
4fbcb0f27e
[Doc][Build] update after removing vllm-nccl ( #5103 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-05-29 23:51:18 +00:00
Itay Etelis
7c3604fb68
[Bugfix] logprobs is not compatible with the OpenAI spec #4795 ( #5031 )
2024-05-29 16:13:22 -07:00
Cyrus Leung
b1c255630d
[Core] Avoid the need to pass None values to Sequence.inputs ( #5099 )
2024-05-29 16:05:01 -07:00
Cyrus Leung
eb6c50cdc2
[Bugfix][CI/Build] Fix codespell failing to skip files in git diff ( #5097 )
2024-05-29 16:02:54 -07:00
Cyrus Leung
eecd864388
[Bugfix][CI/Build] Fix test and improve code for merge_async_iterators ( #5096 )
2024-05-29 16:02:25 -07:00
Ronen Schaffer
ae495c74ea
[Doc]Replace deprecated flag in readme ( #4526 )
2024-05-29 22:26:33 +00:00
afeldman-nm
4238bc82f2
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) ( #4837 )
2024-05-29 16:09:13 +00:00
youkaichao
594392d27a
[Core][Distributed] improve p2p access check ( #4992 )
2024-05-29 11:29:07 +00:00
Cyrus Leung
18c1f16d86
[Bugfix] Fix arguments passed to Sequence in stop checker test ( #5092 )
2024-05-29 07:16:41 +00:00
youkaichao
5bd3c65072
[Core][Optimization] remove vllm-nccl ( #5091 )
2024-05-29 05:13:52 +00:00
Marut Pandya
616e600e0b
[Misc] add gpu_memory_utilization arg ( #5079 )
...
Signed-off-by: pandyamarut <pandyamarut@gmail.com >
2024-05-28 17:16:18 -07:00
Junichi Sato
dfba529b40
[Bugfix] Remove the last EOS token unless explicitly specified ( #5077 )
2024-05-28 17:15:35 -07:00
Cyrus Leung
5ae5ed1e60
[Core] Consolidate prompt arguments to LLM engines ( #4328 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-05-28 13:29:31 -07:00
Simon Mo
290f4ada2b
[Docs] Add Dropbox as sponsors ( #5089 )
2024-05-28 10:29:09 -07:00
Divakar Verma
dd8de11f0a
[Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X ( #4951 )
...
This PR adds Triton kernel configs for the MoE kernel for MI300X
2024-05-28 16:03:23 +00:00
Robert Shaw
9ba415588a
[BugFix] Fix Embedding Models with TP>1 ( #5075 )
2024-05-28 08:32:42 -07:00
Michał Moskal
d4f3985907
[Core] Sliding window for block manager v2 ( #4545 )
...
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local >
2024-05-28 11:07:07 +09:00
Isotr0py
890aa93d27
[Model] Add support for falcon-11B ( #5069 )
2024-05-27 16:41:43 -07:00
sasha0552
fbdb7b3ee2
[Core] Allow AQLM on Pascal ( #5058 )
2024-05-27 15:26:14 -07:00
Zhuohan Li
1102bef219
[Bugfix / Core] Prefix Caching Guards (merged with main) ( #4846 )
...
Co-authored-by: rsnm2 <rshaw@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-05-27 15:18:17 -07:00
Roger Wang
f17a1a8f96
[Misc] Make Serving Benchmark More User-friendly ( #5044 )
2024-05-25 17:28:16 +00:00
Lily Liu
d5a1697772
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding ( #5000 )
2024-05-25 10:00:14 -07:00
youkaichao
325c119961
[Misc] add logging level env var ( #5045 )
2024-05-24 23:49:49 -07:00
Eric Xihui Lin
8e192ff967
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model ( #4799 )
...
Co-authored-by: beagleski <yunanzhang@microsoft.com >
Co-authored-by: bapatra <bapatra@microsoft.com >
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-05-24 22:00:52 -07:00
leiwen83
e64fde4b01
[Core][Bugfix]: fix prefix caching for blockv2 ( #4764 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
2024-05-24 10:07:09 -07:00
Robert Shaw
919770957f
[Bugfix] Fix Mistral v0.3 Weight Loading ( #5005 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-05-24 12:28:27 +00:00
youkaichao
6a50f4cafa
[Doc] add ccache guide in doc ( #5012 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-05-23 23:21:54 +00:00
Elisei Smirnov
e3470f8753
[Core]: Option To Use Prompt Token Ids Inside Logits Processor ( #4985 )
...
Co-authored-by: Elisei Smirnov <el.smirnov@innopolis.university >
2024-05-23 22:04:24 +00:00
Dipika Sikka
a1242324c9
[Kernel] Initial Activation Quantization Support ( #4525 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-05-23 21:29:18 +00:00
Murali Andoorveedu
5eda2ea02a
[Core][1/N] Support send/recv in PyNCCL Groups ( #4988 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-05-23 09:54:48 -07:00
Letian Li
2ba80bed27
[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is not defined ( #5009 )
2024-05-23 09:08:58 -07:00
Alexander Matveev
6066253296
Marlin 24 prefill performance improvement (about 25% better on average) ( #4983 )
2024-05-23 02:39:27 -04:00
Cody Yu
ee3eea0a1b
[Misc] Take user preference in attention selector ( #4960 )
2024-05-23 07:55:56 +09:00
Philipp Moritz
a36de682d4
[Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig ( #4991 )
2024-05-22 22:26:56 +00:00
Nick Hill
eb6d3c264d
[Core] Eliminate parallel worker per-step task scheduling overhead ( #4894 )
2024-05-23 06:17:27 +09:00
raywanb
97b030005c
[Model] LoRA gptbigcode implementation ( #3949 )
2024-05-22 13:58:59 -07:00
Cody Yu
a3a73ab069
[Misc] Load FP8 kv-cache scaling factors from checkpoints ( #4893 )
...
The 2nd PR for #4532 .
This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
2024-05-22 13:28:20 -07:00
Tyler Michael Smith
8674f9880e
[Kernel] Fixup for CUTLASS kernels in CUDA graphs ( #4954 )
...
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
2024-05-22 14:10:43 +00:00
SangBin Cho
c74c913bfb
[misc] remove comments that were supposed to be removed ( #4977 )
2024-05-22 09:02:58 -04:00
Michael Goin
5f6d10c14c
[CI/Build] Enforce style for C++ and CUDA code with clang-format ( #4722 )
2024-05-22 07:18:41 +00:00
sasha0552
9b9a10d6cb
[Frontend] Dynamic RoPE scaling ( #4638 )
2024-05-22 01:32:35 -04:00
Isotr0py
99eff67ba9
[Bugfix][Kernel] Add head size check for attention backend selection ( #4944 )
2024-05-21 15:33:25 -04:00
Kante Yin
14772eeb8e
[Bugfix] Fix flag name for max_seq_len_to_capture ( #4935 )
...
Signed-off-by: kerthcet <kerthcet@gmail.com >
2024-05-21 09:30:52 -07:00
Michael Goin
757b62c495
[CI/Build] Codespell ignore build/ directory ( #4945 )
2024-05-21 09:06:10 -07:00
Simon Mo
e941f88584
[Docs] Add acknowledgment for sponsors ( #4925 )
2024-05-21 00:17:25 -07:00
Isotr0py
f12c3b5b3d
[Model] Add Phi-2 LoRA support ( #4886 )
2024-05-21 14:24:17 +09:00
HUANG Fei
d130b573a0
[Model] add rope_scaling support for qwen2 ( #4930 )
2024-05-21 05:22:22 +00:00
Antoni Baum
65ae8c2c8f
[Core] Fix scheduler considering "no LoRA" as "LoRA" ( #4897 )
2024-05-20 17:48:32 -07:00
Kuntai Du
c3af44722c
[Doc]Add documentation to benchmarking script when running TGI ( #4920 )
2024-05-20 20:16:57 +00:00
Aurick Qiao
1937e29848
[Core] Sharded State Loader download from HF ( #4889 )
2024-05-20 11:46:12 -07:00
Mor Zusman
f0eecee610
[Bugfix] Fix dummy weight for fp8 ( #4916 )
...
Allow dummy load format for fp8,
torch.uniform_ doesn't support FP8 at the moment
Co-authored-by: Mor Zusman <morz@ai21.com >
2024-05-20 18:44:25 +00:00
Alexei-V-Ivanov-AMD
943e72ca56
[Build/CI] Enabling AMD Entrypoints Test ( #4834 )
...
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com >
2024-05-20 11:29:28 -07:00
Wenwei Zhang
546a97ef69
[Misc]: allow user to specify port in distributed setting ( #4914 )
2024-05-20 17:45:06 +00:00
Alexander Matveev
da5a0b539d
Remove marlin warning ( #4918 )
2024-05-20 14:55:34 +00:00
Cyrus Leung
6287537a0c
[Model] LLaVA model refactor ( #4910 )
2024-05-20 08:11:25 +00:00
Woosuk Kwon
b57e6c5949
[Kernel] Add flash-attn back ( #4907 )
2024-05-19 18:11:30 -07:00
Alexander Matveev
27ce85476e
[Kernel] Add marlin_24 unit tests ( #4901 )
2024-05-19 11:37:34 -04:00
Cyrus Leung
f68470e803
[Bugfix][Model] Add base class for vision-language models ( #4809 )
2024-05-19 00:13:33 -07:00
SangBin Cho
2e9a2227ec
[Lora] Support long context lora ( #4787 )
...
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.
It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.
Follow up of https://github.com/vllm-project/vllm/pull/3095/files
2024-05-18 16:05:23 +09:00
alexeykondrat
c0724fc915
[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if Triton is not used ( #4658 )
2024-05-18 05:09:11 +00:00
Michael Goin
86b45ae065
[Bugfix] Relax tiktoken to >= 0.6.0 ( #4890 )
2024-05-17 12:58:52 -06:00
Antoni Baum
c5711ef985
[Doc] Update Ray Data distributed offline inference example ( #4871 )
2024-05-17 10:52:11 -07:00
eigenLiu
48d5985a08
Sync huggingface modifications of qwen Moe model ( #4774 )
2024-05-17 09:43:19 -07:00
Jinzhen Lin
33e0823de5
[Bugfix] fix rope error when load models with different dtypes ( #4835 )
2024-05-17 18:43:34 +09:00
Alexei-V-Ivanov-AMD
26148120b3
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests ( #4797 )
2024-05-16 20:58:25 -07:00
bofeng huang
0150a10630
[Frontend] OpenAI API server: Do not add bos token by default when encoding ( #4688 )
2024-05-16 18:47:22 -07:00
Kante Yin
8e7fb5d43a
Support to serve vLLM on Kubernetes with LWS ( #4829 )
...
Signed-off-by: kerthcet <kerthcet@gmail.com >
2024-05-16 16:37:29 -07:00
Woosuk Kwon
9a31a817a8
[Bugfix] Fix FP8 KV cache support ( #4869 )
2024-05-16 22:42:29 +00:00
Tyler Michael Smith
2060e93659
[Kernel] Add w8a8 CUTLASS kernels ( #4749 )
2024-05-16 18:32:50 -04:00
Silencio
8435b207af
[Kernel] Add punica dimension for Qwen1.5-32B LoRA ( #4850 )
...
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net >
2024-05-16 11:16:09 -07:00
youkaichao
10fa9eea21
[Misc] remove old comments ( #4866 )
2024-05-16 11:07:41 -07:00
youkaichao
e08188081b
[Core][Distributed] remove graph mode function ( #4818 )
2024-05-16 10:59:52 -07:00
Hongxia Yang
b5853f9963
[ROCm][AMD][Bugfix] adding a missing triton autotune config ( #4845 )
2024-05-16 10:46:52 -07:00
Simon Mo
f09edd8a25
Add JSON output support for benchmark_latency and benchmark_throughput ( #4848 )
2024-05-16 10:02:56 -07:00
Alexander Matveev
6979ade384
Add GPTQ Marlin 2:4 sparse structured support ( #4790 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-05-16 12:56:15 -04:00
Pierre Dulac
9216b9cc38
[Bugfix] Bypass authorization API token for preflight requests ( #4862 )
2024-05-16 09:42:21 -07:00
Alex Wu
5e0391c040
[Frontend] Separate OpenAI Batch Runner usage from API Server ( #4851 )
2024-05-17 00:42:41 +09:00
Alex Wu
dbc0754ddf
[docs] Fix typo in examples filename openi -> openai ( #4864 )
2024-05-17 00:42:17 +09:00
Jinzhen Lin
99caa49106
[Kernel] add bfloat16 support for gptq marlin kernel ( #4788 )
2024-05-16 09:55:29 -04:00
alexm-nm
5c342570d7
Add marlin unit tests and marlin benchmark script ( #4815 )
2024-05-16 09:36:49 -04:00
Cody Yu
973617ae02
[Speculative decoding][Re-take] Enable TP>1 speculative decoding ( #4840 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
Co-authored-by: Cade Daniel <cade@anyscale.com >
2024-05-16 00:53:51 -07:00
Aurick Qiao
30e754390c
[Core] Implement sharded state loader ( #4690 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-05-15 22:11:54 -07:00
Alex Wu
52f8107cf2
[Frontend] Support OpenAI batch file format ( #4794 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-05-15 19:13:36 -04:00
Cyrus Leung
fc0d9dfc3a
[Frontend] Re-enable custom roles in Chat Completions API ( #4758 )
2024-05-15 14:58:46 -07:00
Zhuohan Li
361c461a12
[Doc] Highlight the fourth meetup in the README ( #4842 )
2024-05-15 11:38:49 -07:00
zifeitong
a5675d348b
[Bugfix] Properly set distributed_executor_backend in ParallelConfig ( #4816 )
2024-05-15 07:22:09 -07:00
Cyrus Leung
e9cdd2b1e2
[CI/Build] Further decouple HuggingFace implementation from ours during tests ( #4166 )
2024-05-14 23:38:40 -07:00
SangBin Cho
65bf2ac165
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API ( #4681 )
...
This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.
It also refactors subquery_start_loc which was not refactored in the previous PR
2024-05-15 14:00:10 +09:00
SangBin Cho
8a7cc254a0
Revert "[Kernel] Use flash-attn for decoding ( #3648 )" ( #4820 )
...
Lora 3 & 4 test seems to have illegal memory access failure after this commit;
[2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
<br class="Apple-interchange-newline">
Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241
This reverts commit 1356df5 .
FILL IN THE PR DESCRIPTION HERE
FIX #xxxx (link existing issues this PR will resolve)
2024-05-15 11:52:45 +09:00
Simon Mo
29bc01bf3b
Add 4th meetup announcement to readme ( #4817 )
2024-05-14 18:33:06 -04:00
Nick Hill
676a99982f
[Core] Add MultiprocessingGPUExecutor ( #4539 )
...
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com >
2024-05-14 10:38:59 -07:00
Cyrus Leung
dc72402b57
[Bugfix][Doc] Fix CI failure in docs ( #4804 )
...
This PR fixes the CI failure introduced by #4798 .
The failure originates from having duplicate target names in reST, and is fixed by changing the ref targets to anonymous ones. For more information, see this discussion.
I have also changed the format of the links to be more distinct from each other.
2024-05-15 01:57:08 +09:00
Kuntai Du
ccb63a8245
[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies ( #4696 )
2024-05-14 21:34:33 +09:00
Zhuohan Li
c579b750a0
[Doc] Add meetups to the doc ( #4798 )
2024-05-13 18:48:00 -07:00
Cyrus Leung
4bfa7e7f75
[Doc] Add API reference for offline inference ( #4710 )
2024-05-13 17:47:42 -07:00
Zhuohan Li
ac1fbf7fd2
[Doc] Shorten README by removing supported model list ( #4796 )
2024-05-13 16:23:54 -07:00
Philipp Moritz
33d3914b1e
[Bugfix] Fix dynamic FP8 quantization for Mixtral ( #4793 )
2024-05-13 19:00:27 -04:00
Stephen Krider
1356df53bd
[Kernel] Use flash-attn for decoding ( #3648 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2024-05-13 15:50:33 -07:00
Cody Yu
ce532ff45c
[Speculative decoding] Improve n-gram efficiency ( #4724 )
2024-05-13 15:00:13 -07:00
Sanger Steel
8bc68e198c
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update tensorizer to version 2.9.0 ( #4208 )
2024-05-13 14:57:07 -07:00
Woosuk Kwon
0fca3cdcf2
[Misc] Enhance attention selector ( #4751 )
2024-05-13 10:47:25 -07:00
SangBin Cho
e7c46b9527
[Scheduler] Warning upon preemption and Swapping ( #4647 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-05-13 23:50:44 +09:00
Cyrus Leung
350f9e107f
[CI/Build] Move test_utils.py to tests/utils.py ( #4425 )
...
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)
Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
2024-05-13 23:50:09 +09:00
youkaichao
702bee461f
[Core][Distributed] refactor custom allreduce to support multiple tp groups ( #4754 )
2024-05-12 17:47:59 -07:00
Swapnil Parekh
a7be4d0072
[CORE] Improvement in ranks code ( #4718 )
2024-05-12 17:47:47 -07:00
Robert Shaw
a709e87a4f
[CI/Build] Tweak Marlin Nondeterminism Issues ( #4713 )
2024-05-12 17:46:31 -07:00
Yikang Shen
6eaccb7353
[Model] Add support for IBM Granite Code models ( #4636 )
2024-05-11 21:27:24 -07:00
Chang Su
e254497b66
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API ( #3734 )
2024-05-11 11:30:37 -07:00
youkaichao
4e12131089
[Core][Test] fix function name typo in custom allreduce ( #4750 )
2024-05-10 15:14:40 -07:00
Robert Shaw
fcc2994be6
[CI] Nits for bad initialization of SeqGroup in testing ( #4748 )
2024-05-10 18:01:01 -04:00
heeju-kim2
2e7796f2cf
[Speculative decoding] CUDA graph support ( #4295 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
2024-05-10 17:36:25 +00:00
Allen.Dou
706588a77d
[Bugfix] Fix CLI arguments in OpenAI server docs ( #4729 )
2024-05-11 00:00:56 +09:00
SangBin Cho
6a0f617210
[Core] Fix circular reference which leaked llm instance in local dev env ( #4737 )
...
Storing exception frame is extremely prone to circular refernece because it contains the reference to objects.
When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem.
I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.
2024-05-10 23:54:32 +09:00
Steve Grubb
dac6a3f6ed
[Misc] Apply a couple g++ cleanups ( #4719 )
2024-05-10 13:37:05 +00:00
Kunshang Ji
64b77dfd7e
[Core]fix type annotation for swap_blocks ( #4726 )
2024-05-10 21:52:48 +09:00
Simon Mo
51d4094fda
chunked-prefill-doc-syntax ( #4603 )
...
Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html
Co-authored-by: sang <rkooo567@gmail.com >
2024-05-10 14:13:23 +09:00
Allen.Dou
e965d46184
[Misc] Keep only one implementation of the create_dummy_prompt function. ( #4716 )
2024-05-09 21:42:38 -07:00
youkaichao
208b71bcc1
[Core][Distributed] refactor pynccl ( #4591 )
...
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591 )
2024-05-09 19:48:43 -07:00
Cody Yu
c833101740
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support ( #4535 )
2024-05-09 18:04:17 -06:00
Philipp Moritz
379da6dcb5
[Kernel] [FP8] Improve FP8 linear layer performance ( #4691 )
...
This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)).
We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance.
Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization:
qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16)
qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16)
qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16)
qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16)
qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)
2024-05-09 16:38:07 -07:00
Hao Zhang
ebce310b74
[Model] Snowflake arctic model implementation ( #4652 )
...
Co-authored-by: Dash Desai <1723932+iamontheinet@users.noreply.github.com >
Co-authored-by: Aurick Qiao <qiao@aurick.net >
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com >
Co-authored-by: Aurick Qiao <aurickq@users.noreply.github.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-05-09 22:37:14 +00:00
Michael Goin
be0c5180ac
[Bugfix] Add logs for all model dtype casting ( #4717 )
2024-05-09 18:36:25 +00:00
Robert Shaw
cea64430f6
[Bugfix] Update grafana.json ( #4711 )
2024-05-09 10:10:13 -07:00
Cyrus Leung
a3c124570a
[Bugfix] Fix CLI arguments in OpenAI server docs ( #4709 )
2024-05-09 09:53:14 -07:00
kliuae
ff5abcd746
[ROCm] Add support for Punica kernels on AMD GPUs ( #3140 )
...
Co-authored-by: miloice <jeffaw99@hotmail.com >
2024-05-09 09:19:50 -07:00
Woosuk Kwon
0ee535b294
[Misc] Set block size at initialization & Fix test_model_runner ( #4705 )
2024-05-09 09:04:59 -07:00
Woosuk Kwon
190bc838e1
[Misc] Remove unnecessary ModelRunner imports ( #4703 )
2024-05-09 00:17:17 -07:00
Cyrus Leung
f12b20decc
[Frontend] Move async logic outside of constructor ( #4674 )
2024-05-08 22:48:33 -07:00
Mahmoud Ashraf
16bc0a098f
[Frontend] add tok/s speed metric to llm class when using tqdm ( #4400 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-05-08 22:02:31 -07:00
alexm-nm
e288df0632
[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin ( #4626 )
2024-05-08 17:14:31 -07:00
Cade Daniel
8b9241be3a
[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs ( #4672 )
2024-05-08 23:24:46 +00:00
Cody Yu
f942efb5a3
[Dynamic Spec Decoding] Auto-disable by the running queue size ( #4592 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
2024-05-08 21:44:00 +00:00
Woosuk Kwon
89579a201f
[Misc] Use vllm-flash-attn instead of flash-attn ( #4686 )
2024-05-08 13:15:34 -07:00
youkaichao
230c4b38c1
[CI/Test] fix swap test for multi gpu ( #4689 )
2024-05-08 13:14:02 -07:00
youkaichao
20cfcdec99
[Core][Optimization] change python dict to pytorch tensor for blocks to swap ( #4659 )
2024-05-08 12:07:05 -07:00
Antoni Baum
ad932a221d
[Core] Faster startup for LoRA enabled models ( #4634 )
2024-05-08 10:33:18 -07:00
Woosuk Kwon
5510cf0e8a
[Misc] Add get_name method to attention backends ( #4685 )
2024-05-08 09:59:31 -07:00
DefTruth
0f9a6e3d22
[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi ( #4573 )
2024-05-08 09:19:58 -07:00
SangBin Cho
f6a593093a
[CI] Make mistral tests pass ( #4596 )
2024-05-08 08:44:35 -07:00
SangBin Cho
d7740ea4dc
[Core] Optimize sampler get_logprobs ( #4594 )
2024-05-08 08:42:28 -07:00
youkaichao
cc466a3290
[Core][Distributed] support cpu&device in broadcast tensor dict ( #4660 )
...
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660 )
2024-05-07 19:34:47 -07:00
leiwen83
8344f7742b
[Bug fix][Core] fixup ngram not setup correctly ( #4551 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
Co-authored-by: Cade Daniel <edacih@gmail.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-05-07 11:40:18 -07:00
youkaichao
469f85c782
[Core][Optimization] change copy-on-write from dict[int, list] to list ( #4648 )
2024-05-07 11:06:32 -07:00
Austin Veselka
10760da800
[Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora ( #4609 )
2024-05-07 10:59:07 -07:00
Alexei-V-Ivanov-AMD
478aed5827
[Build/CI] Fixing 'docker run' to re-enable AMD CI tests. ( #4642 )
2024-05-07 09:23:17 -07:00
youkaichao
63575bc2e1
[Core][Optimization] change python dict to pytorch tensor ( #4607 )
2024-05-06 21:30:27 -07:00
Philipp Moritz
a98187cf72
[Kernel] Make static FP8 scaling more robust ( #4570 )
...
Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint
https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale
(which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k ), I'm getting the following mostly random performance on MMLU:
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.2295|± |0.0035|
| - humanities |N/A |none | 5|acc |0.2421|± |0.0062|
| - other |N/A |none | 5|acc |0.2398|± |0.0076|
| - social_sciences|N/A |none | 5|acc |0.2171|± |0.0074|
| - stem |N/A |none | 5|acc |0.2125|± |0.0073|
With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.7008|± |0.0036|
| - humanities |N/A |none | 5|acc |0.6453|± |0.0065|
| - other |N/A |none | 5|acc |0.7692|± |0.0072|
| - social_sciences|N/A |none | 5|acc |0.8083|± |0.0070|
| - stem |N/A |none | 5|acc |0.6115|± |0.0083|
This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.
2024-05-06 17:39:28 -07:00
Noam Gat
bd99d22629
Update lm-format-enforcer to 0.10.1 ( #4631 )
2024-05-06 23:51:59 +00:00
Cade Daniel
19cb4716ee
[CI] Add retry for agent lost ( #4633 )
2024-05-06 23:18:57 +00:00
Simon Mo
e186d37cb1
[CI] use ccache actions properly in release workflow ( #4629 )
2024-05-06 22:23:36 +00:00
Cyrus Leung
323f27b904
[Bugfix] Fix asyncio.Task not being subscriptable ( #4623 )
2024-05-06 09:31:05 -07:00
zhaoyang-star
0650e5935b
Disable cuda version check in vllm-openai image ( #4530 )
2024-05-05 16:58:55 -07:00
Simon Mo
c7f2cf2b7f
[CI] Reduce wheel size by not shipping debug symbols ( #4602 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.3.0) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.3.0) (push) Has been cancelled
2024-05-04 21:28:58 -07:00
Simon Mo
8d8357c8ed
bump version to v0.4.2 ( #4600 )
2024-05-04 17:09:49 -07:00
DearPlanet
4302987069
[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics ( #3937 )
2024-05-04 15:39:34 -07:00
Simon Mo
021b1a2ab7
[CI] check size of the wheels ( #4319 )
2024-05-04 20:44:36 +00:00
Michael Goin
2a052011ca
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) ( #4527 )
...
Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436 .
This PR enables the following checkpoint loading features for Mixtral:
Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:
The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.
2024-05-04 11:45:16 -07:00
SangBin Cho
36fb68f947
[Doc] Chunked Prefill Documentation ( #4580 )
2024-05-04 00:18:00 -07:00
Cody Yu
bc8ad68455
[Misc][Refactor] Introduce ExecuteModelData ( #4540 )
2024-05-03 17:47:07 -07:00
youkaichao
344bf7cd2d
[Misc] add installation time env vars ( #4574 )
2024-05-03 15:55:56 -07:00
Cade Daniel
ab50275111
[Speculative decoding] Support target-model logprobs ( #4378 )
2024-05-03 15:52:01 -07:00
Lily Liu
43c413ec57
[Kernel] Use flashinfer for decoding ( #4353 )
...
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com >
2024-05-03 15:51:27 -07:00
Sebastian Schoennenbeck
f8e7adda21
Fix/async chat serving ( #2727 )
2024-05-03 11:04:14 -07:00
Michael Goin
7e65477e5e
[Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None ( #4586 )
2024-05-03 10:32:21 -07:00
SangBin Cho
3521ba4f25
[Core][Model runner refactoring 1/N] Refactor attn metadata term ( #4518 )
2024-05-03 10:20:12 -07:00
youkaichao
2d7bce9cd5
[Doc] add env vars to the doc ( #4572 )
2024-05-03 05:13:49 +00:00
DefTruth
ce3f1eedf8
[Misc] remove chunk detected debug logs ( #4571 )
2024-05-03 04:48:08 +00:00
Yang, Bo
808632d3b4
[BugFix] Prevent the task of _force_log from being garbage collected ( #4567 )
2024-05-03 01:35:18 +00:00
youkaichao
344a5d0c33
[Core][Distributed] enable allreduce for multiple tp groups ( #4566 )
2024-05-02 17:32:33 -07:00
SangBin Cho
0f8a91401c
[Core] Ignore infeasible swap requests. ( #4557 )
2024-05-02 14:31:20 -07:00
Alexei-V-Ivanov-AMD
9b5c9f9484
[CI/Build] AMD CI pipeline with extended set of tests. ( #4267 )
...
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-05-02 12:29:07 -07:00
Michał Moskal
32881f3f31
[kernel] fix sliding window in prefix prefill Triton kernel ( #4405 )
...
Co-authored-by: SangBin Cho <rkooo567@gmail.com >
2024-05-02 11:23:37 -07:00
youkaichao
5b8a7c1cb0
[Misc] centralize all usage of environment variables ( #4548 )
2024-05-02 11:13:25 -07:00
Mark McLoughlin
1ff0c73a79
[BugFix] Include target-device specific requirements.txt in sdist ( #4559 )
2024-05-02 10:52:51 -07:00
Hu Dong
5ad60b0cbd
[Misc] Exclude the tests directory from being packaged ( #4552 )
2024-05-02 10:50:25 -07:00
SangBin Cho
fb087af52e
[mypy][7/N] Cover all directories ( #4555 )
2024-05-02 10:47:41 -07:00
alexm-nm
7038e8b803
[Kernel] Support running GPTQ 8-bit models in Marlin ( #4533 )
2024-05-02 12:56:22 -04:00
youkaichao
2a85f93007
[Core][Distributed] enable multiple tp group ( #4512 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-05-02 04:28:21 +00:00
SangBin Cho
cf8cac8c70
[mypy][6/N] Fix all the core subdirectory typing ( #4450 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
2024-05-02 03:01:00 +00:00
Ronen Schaffer
5e401bce17
[CI]Add regression tests to ensure the async engine generates metrics ( #4524 )
2024-05-01 19:57:12 -07:00
SangBin Cho
0d62fe58db
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption ( #4451 )
2024-05-01 19:24:13 -07:00
Danny Guinther
b8afa8b95a
[MISC] Rework logger to enable pythonic custom logging configuration to be provided ( #4273 )
2024-05-01 17:34:40 -07:00
Woosuk Kwon
826b82a260
[Misc] Fix expert_ids shape in MoE ( #4517 )
2024-05-01 23:47:59 +00:00
Philipp Moritz
c9d852d601
[Misc] Remove Mixtral device="cuda" declarations ( #4543 )
...
Remove the device="cuda" declarations in mixtral as promised in #4343
2024-05-01 16:30:52 -07:00
youkaichao
6ef09b08f8
[Core][Distributed] fix pynccl del error ( #4508 )
2024-05-01 15:23:06 -07:00
Roy
3a922c1e7e
[Bugfix][Core] Fix and refactor logging stats ( #4336 )
2024-05-01 20:08:14 +00:00
sasha0552
c47ba4aaa9
[Bugfix] Add validation for seed ( #4529 )
2024-05-01 19:31:22 +00:00
Philipp Moritz
24bb4fe432
[Kernel] Update fused_moe tuning script for FP8 ( #4457 )
...
This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo.
All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens.
Before this PR (with static activation scaling):
qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.1 ms ITL, 0.52s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 14.0 ms ITL, 0.70s e2e latency
qps = 10: 15.7 ms ITL, 0.79s e2e latency
After this PR (with static activation scaling):
qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.2 ms ITL, 0.53s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 11.9 ms ITL, 0.59s e2e latency
qps = 10: 12.1 ms ITL, 0.61s e2e latency
2024-05-01 11:47:38 -07:00
Nick Hill
a657bfc48a
[Core] Add multiproc_worker_utils for multiprocessing-based workers ( #4357 )
2024-05-01 18:41:59 +00:00
leiwen83
24750f4cad
[Core] Enable prefix caching with block manager v2 enabled ( #4142 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
Co-authored-by: Sage Moore <sagemoore@utexas.edu >
2024-05-01 11:20:32 -07:00
leiwen83
b38e42fbca
[Speculative decoding] Add ngram prompt lookup decoding ( #4237 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
2024-05-01 11:13:03 -07:00
Travis Johnson
8b798eec75
[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation ( #4534 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-05-01 18:01:50 +00:00
sasha0552
69909126a7
[Bugfix] Use random seed if seed is -1 ( #4531 )
2024-05-01 10:41:17 -07:00
Frαnçois
e491c7e053
[Doc] update(example model): for OpenAI compatible serving ( #4503 )
2024-05-01 10:14:16 -07:00
Robert Shaw
4dc8026d86
[Bugfix] Fix 307 Redirect for /metrics ( #4523 )
2024-05-01 09:14:13 -07:00
AnyISalIn
a88bb9b032
[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. ( #4173 )
...
Signed-off-by: AnyISalIn <anyisalin@gmail.com >
2024-05-01 09:11:03 -07:00
SangBin Cho
6f1df80436
[Test] Add ignore_eos test ( #4519 )
2024-05-01 08:45:42 -04:00
Jee Li
d6f4bd7cdd
[Misc]Add customized information for models ( #4132 )
2024-04-30 21:18:14 -07:00
Robert Caulk
c3845d82dc
Allow user to define whitespace pattern for outlines ( #4305 )
2024-04-30 20:48:39 -07:00
Pastel!
a822eb3413
[Misc] fix typo in block manager ( #4453 )
2024-04-30 20:41:32 -07:00
harrywu
f458112e8a
[Misc][Typo] type annotation fix ( #4495 )
2024-04-30 20:21:39 -07:00
Nick Hill
2e240c69a9
[Core] Centralize GPU Worker construction ( #4419 )
2024-05-01 01:06:34 +00:00
fuchen.ljl
ee37328da0
Unable to find Punica extension issue during source code installation ( #4494 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-05-01 00:42:09 +00:00
fuchen.ljl
6ad58f42c5
fix_tokenizer_snapshot_download_bug ( #4493 )
2024-04-30 16:38:50 -07:00
Li, Jiang
dd1a50a8bc
[Bugfix][Minor] Make ignore_eos effective ( #4468 )
2024-04-30 16:33:33 -07:00
Alpay Ariyak
715c2d854d
[Frontend] [Core] Tensorizer: support dynamic num_readers, update version ( #4467 )
2024-04-30 16:32:13 -07:00
Florian Greinacher
a494140433
[Frontend] Support complex message content for chat completions endpoint ( #3467 )
...
Co-authored-by: Lily Liu <lilyliupku@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-04-30 16:28:46 -07:00
Robert Shaw
111815d482
[Kernel] Support Fp8 Checkpoints (Dynamic + Static) ( #4332 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-04-30 21:46:12 +00:00
Prashant Gupta
b31a1fb63c
[Doc] add visualization for multi-stage dockerfile ( #4456 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-04-30 17:41:59 +00:00
leiwen83
4bb53e2dde
[BugFix] fix num_lookahead_slots missing in async executor ( #4165 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
2024-04-30 10:12:59 -07:00
Kunshang Ji
26f2fb5113
[Core]Refactor gptq_marlin ops ( #4466 )
2024-04-30 08:14:47 -04:00
Woosuk Kwon
fa32207842
[Bugfix][Kernel] Fix compute_type for MoE kernel ( #4463 )
2024-04-29 22:05:40 -07:00
Michael Goin
d627a3d837
[Misc] Upgrade to torch==2.3.0 ( #4454 )
2024-04-29 20:05:47 -04:00
youkaichao
f4f921b7f1
[Core][Distributed] use cpu group to broadcast metadata in cpu ( #4444 )
2024-04-29 13:52:22 -07:00
Simon Mo
ac5ccf0156
[CI] hotfix: soft fail neuron test ( #4458 )
2024-04-29 19:50:01 +00:00
Robert Shaw
73c8d677e5
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin ( #3922 )
...
Co-authored-by: alexm <alexm@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-04-29 09:35:34 -07:00
SangBin Cho
df29793dc7
[mypy][5/N] Support all typing on model executor ( #4427 )
2024-04-28 19:01:26 -07:00
Simon Mo
03dd7d52bf
[CI] clean docker cache for neuron ( #4441 )
2024-04-28 23:32:07 +00:00
Ronen Schaffer
bf480c5302
Add more Prometheus metrics ( #2764 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-04-28 15:59:33 -07:00
DefTruth
9c7306ac11
[Misc] fix typo in llm_engine init logging ( #4428 )
2024-04-28 18:58:30 +08:00
Robert Shaw
4ea1f9678d
[BugFix] Resolved Issues For LinearMethod --> QuantConfig ( #4418 )
2024-04-27 18:35:33 +00:00
Nick Hill
ba4be44c32
[BugFix] Fix return type of executor execute_model methods ( #4402 )
2024-04-27 11:17:45 -07:00
Prashant Gupta
d6e520e170
[Core] Support offline use of local cache for models ( #4374 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Travis Johnson <tjohnson31415@gmail.com >
2024-04-27 09:59:55 -07:00
Nick Hill
81661da7b2
[BugFix] Fix min_tokens when eos_token_id is None ( #4389 )
...
Co-authored-by: DefTruth <31974251+deftruth@users.noreply.github.com >
2024-04-27 09:52:46 -07:00
Ruoyu Qin
dfea173148
[Bugfix] Abort requests when the connection to /v1/completions is interrupted ( #4363 )
2024-04-27 09:48:37 -07:00
Roy
7134303cbb
[Bugfix][Core] Fix get decoding config from ray ( #4335 )
2024-04-27 11:30:08 +00:00
Caio Mendes
3da24c2df7
[Model] Phi-3 4k sliding window temp. fix ( #4380 )
2024-04-27 18:08:15 +08:00
Austin Veselka
eefeb16464
[Kernel] Full Tensor Parallelism for LoRA Layers ( #3524 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-04-27 00:03:48 -07:00
Hongxia Yang
18d23f642a
[ROCm][Hardware][AMD] Enable group query attention for triton FA ( #4406 )
2024-04-26 23:37:40 -07:00
Roy
87f545ba6f
[Misc] Fix logger format typo ( #4396 )
2024-04-27 13:45:02 +08:00
Cyrus Leung
8947bc3c15
[Frontend][Bugfix] Disallow extra fields in OpenAI API ( #4355 )
2024-04-27 05:08:24 +00:00
Philipp Moritz
12628d3c78
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales ( #4343 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-04-27 04:49:59 +00:00
Nick Hill
258a2c58d0
[Core] Introduce DistributedGPUExecutor abstract class ( #4348 )
2024-04-27 04:14:26 +00:00
youkaichao
aba47be3fe
[Misc] add RFC issue template ( #4401 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-04-26 15:47:45 -07:00
Cody Yu
a62aaf1df5
[Misc][Refactor] Generalize linear_method to be quant_method ( #4373 )
2024-04-26 16:41:14 -04:00
SangBin Cho
603ad84815
[Core] Refactoring sampler and support prompt logprob for chunked prefill ( #4309 )
2024-04-26 13:02:02 +00:00
SangBin Cho
a88081bf76
[CI] Disable non-lazy string operation on logging ( #4326 )
...
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com >
2024-04-26 00:16:58 -07:00
Norman Mu
2f30e7c72f
[Frontend] Add --log-level option to api server ( #4377 )
2024-04-26 05:36:01 +00:00
Cyrus Leung
a74dee9b62
[Bugfix] Fix parameter name in get_tokenizer ( #4107 )
2024-04-25 19:10:48 -07:00
Hongxia Yang
cf29b7eda4
[ROCm][Hardware][AMD][Doc] Documentation update for ROCm ( #4376 )
...
Co-authored-by: WoosukKwon <woosuk.kwon@berkeley.edu >
2024-04-25 18:12:25 -07:00
Nick Hill
efffb63f58
[Core] Move function tracing setup to util function ( #4352 )
2024-04-25 16:45:12 -07:00
Nick Hill
15e7c675b0
[Core] Add shutdown() method to ExecutorBase ( #4349 )
2024-04-25 16:32:48 -07:00
Roy
b6dcb4d442
[Misc] Fix flash attention backend log ( #4368 )
2024-04-25 12:43:32 -07:00
SangBin Cho
b5b4a398a7
[Mypy] Typing lora folder ( #4337 )
2024-04-25 19:13:50 +00:00
Kunshang Ji
f4bc4de1b1
[Core]refactor aqlm quant ops ( #4351 )
2024-04-25 15:03:56 -04:00
Caio Mendes
bd7a8eef25
[Doc] README Phi-3 name fix. ( #4372 )
...
Co-authored-by: Caio Mendes <caiocesart@microsoft.com >
2024-04-25 10:32:00 -07:00
Alexei-V-Ivanov-AMD
7ee82bef1e
[CI/Build] Adding functionality to reset the node's GPUs before processing. ( #4213 )
2024-04-25 09:37:20 -07:00
Isotr0py
fbf152d976
[Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 ( #4324 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-04-25 09:35:56 -07:00
Nick Hill
479d69fad0
[Core] Move ray_utils.py from engine to executor package ( #4347 )
2024-04-25 06:52:22 +00:00
Caio Mendes
96e90fdeb3
[Model] Adds Phi-3 support ( #4298 )
2024-04-25 03:06:57 +00:00
zifeitong
a395a638c2
[Misc] Use public API in benchmark_throughput ( #4300 )
2024-04-24 21:10:24 +00:00
youkaichao
2768884ac4
[Doc] Add note for docker user ( #4340 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-04-24 21:09:44 +00:00
alexm-nm
aae08249ac
[Bugfix] Fix marlin kernel crash on H100 ( #4218 )
...
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
2024-04-24 10:35:01 -07:00
Roger Wang
7923dcad12
[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark ( #4279 )
2024-04-24 09:49:13 -07:00
youkaichao
3cd9b5bb2d
[Core][Distributed] use existing torch.cuda.device ( #4318 )
...
[Core][Distributed] use existing torch.cuda.device context manager (#4318 )
2024-04-24 09:00:20 -07:00
Woosuk Kwon
468d761b32
[Misc] Reduce supported Punica dtypes ( #4304 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.2.1) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.2.1) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.2.1) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.2.1) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.2.1) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.2.1) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.2.1) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.2.1) (push) Has been cancelled
2024-04-23 18:54:33 -07:00
youkaichao
e4bf860a54
[CI][Build] change pynvml to nvidia-ml-py ( #4302 )
2024-04-23 18:33:12 -07:00
youkaichao
91f50a6fe2
[Core][Distributed] use cpu/gloo to initialize pynccl ( #4248 )
2024-04-23 18:32:19 -07:00
Robert Shaw
79a268c4ab
[BUG] fixed fp8 conflict with aqlm ( #4307 )
...
Fixes fp8 iterface which broke in AQLM merge.
2024-04-23 18:26:33 -07:00
Philipp Moritz
eace8bf0b9
[Kernel] FP8 support for MoE kernel / Mixtral ( #4244 )
...
This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208
It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118 ), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:
```python
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954 ). With this PR, the results are as follows:
<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03 ">
**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:
```
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.7018|± |0.0036|
| - humanities |N/A |none | 5|acc |0.6472|± |0.0065|
| - other |N/A |none | 5|acc |0.7673|± |0.0072|
| - social_sciences|N/A |none | 5|acc |0.8099|± |0.0070|
| - stem |N/A |none | 5|acc |0.6131|± |0.0083|
```
this compares favorably with the fp16 results which are
```
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.7020|± |0.1313|
| - humanities |N/A |none | 5|acc |0.6425|± |0.1349|
| - other |N/A |none | 5|acc |0.7744|± |0.1038|
| - social_sciences|N/A |none | 5|acc |0.8131|± |0.0695|
| - stem |N/A |none | 5|acc |0.6108|± |0.1383|
```
Happy hacking!
2024-04-24 01:18:23 +00:00
Cyrus Leung
1e8f4252aa
[Bugfix][Frontend] Raise exception when file-like chat template fails to be opened ( #4292 )
2024-04-23 18:19:03 +00:00
James Fleming
2b7949c1c2
AQLM CUDA support ( #3287 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-04-23 13:59:33 -04:00
Simon Mo
62b5166bd4
[CI] Add ccache for wheel builds job ( #4281 )
2024-04-23 09:51:41 -07:00
youkaichao
d86285a4a4
[Core][Logging] Add last frame information for better debugging ( #4278 )
2024-04-23 09:45:52 -07:00
DefTruth
d87f39e9a9
[Bugfix] Add init_cached_hf_modules to RayWorkerWrapper ( #4286 )
2024-04-23 09:28:35 -07:00
Jack Gordley
d3c8180ac4
[Bugfix] Fixing max token error message for openai compatible server ( #4016 )
2024-04-23 19:06:29 +08:00
Cade Daniel
62b8aebc6f
[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. ( #3951 )
2024-04-23 08:02:36 +00:00
SangBin Cho
050f285ff6
[Core] Scheduling optimization 2 ( #4280 )
2024-04-23 08:02:11 +00:00
Nick Hill
8f2ea22bde
[Core] Some simplification of WorkerWrapper changes ( #4183 )
2024-04-23 07:49:08 +00:00
SangBin Cho
0ae11f78ab
[Mypy] Part 3 fix typing for nested directories for most of directory ( #4161 )
2024-04-22 21:32:44 -07:00
Harry Mellor
34128a697e
Fix autodoc directives ( #4272 )
...
Co-authored-by: Harry Mellor <hmellor@oxts.com >
2024-04-23 01:53:01 +00:00
youkaichao
c1b4e4157c
[Core][Distributed] use absolute path for library file ( #4271 )
2024-04-22 17:21:48 -07:00
Zhanghao Wu
ceaf4ed003
[Doc] Update the SkyPilot doc with serving and Llama-3 ( #4276 )
2024-04-22 15:34:31 -07:00
SangBin Cho
ad8d696a99
[Core] Scheduler perf fix ( #4270 )
2024-04-22 21:11:06 +00:00
Harry Mellor
3d925165f2
Add example scripts to documentation ( #4225 )
...
Co-authored-by: Harry Mellor <hmellor@oxts.com >
2024-04-22 16:36:54 +00:00
alexm-nm
1543680691
[Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter ( #4217 )
2024-04-22 09:10:48 -07:00
Tao He
077f0a2e8a
[Frontend] Enable support for CPU backend in AsyncLLMEngine. ( #3993 )
...
Signed-off-by: Tao He <sighingnow@gmail.com >
2024-04-22 09:19:51 +00:00
Woosuk Kwon
e73ed0f1c6
[Bugfix] Fix type annotations in CPU model runner ( #4256 )
2024-04-22 00:54:16 -07:00
Isotr0py
296cdf8ac7
[Misc] Add vision language model support to CPU backend ( #3968 )
2024-04-22 00:44:16 -07:00
youkaichao
747b1a7147
[Core][Distributed] fix _is_full_nvlink detection ( #4233 )
2024-04-21 23:04:16 -07:00
Hongxia Yang
95e5b087cf
[AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring ( #4129 )
2024-04-21 21:57:24 -07:00
GeauxEric
a37d815b83
Make initialization of tokenizer and detokenizer optional ( #3748 )
...
Co-authored-by: Yun Ding <yunding@nvidia.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-04-21 22:06:46 +00:00
xiaoji
7f2593b164
[Doc]: Update the doc of adding new models ( #4236 )
2024-04-21 09:57:08 -07:00
Harry Mellor
fe7d648fe5
Don't show default value for flags in EngineArgs ( #4223 )
...
Co-authored-by: Harry Mellor <hmellor@oxts.com >
2024-04-21 09:15:28 -07:00
Noam Gat
cc74b2b232
Updating lm-format-enforcer version and adding links to decoding libraries in docs ( #4222 )
2024-04-20 08:33:16 +00:00
nunjunj
91528575ec
[Frontend] multiple sampling params support ( #3570 )
2024-04-20 00:11:57 -07:00
Cody Yu
a22cdea371
[Kernel][FP8] Initial support with dynamic per-tensor scaling ( #4118 )
...
Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726
This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.
Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.
Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:
BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.
2024-04-20 04:28:57 +00:00
Harry Mellor
682789d402
Fix missing docs and out of sync EngineArgs ( #4219 )
...
Co-authored-by: Harry Mellor <hmellor@oxts.com >
2024-04-19 20:51:33 -07:00
Ayush Rautwar
138485a82d
[Bugfix] Add fix for JSON whitespace ( #4189 )
...
Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal >
2024-04-19 20:49:22 -07:00
Chirag Jain
bc9df1571b
Pass tokenizer_revision when getting tokenizer in openai serving ( #4214 )
2024-04-19 17:13:56 -07:00
youkaichao
15b86408a8
[Misc] add nccl in collect env ( #4211 )
2024-04-19 19:44:51 +00:00
Ronen Schaffer
7be4f5628f
[Bugfix][Core] Restore logging of stats in the async engine ( #4150 )
2024-04-19 08:08:26 -07:00
Uranus
8f20fc04bf
[Misc] fix docstrings ( #4191 )
...
Co-authored-by: Zhong Wang <wangzhong@infini-ai.com >
2024-04-19 08:18:33 +00:00
Simon Mo
221d93ecbf
Bump version of 0.4.1 ( #4177 )
2024-04-19 01:00:22 -07:00
Jee Li
d17c8477f1
[Bugfix] Fix LoRA loading check ( #4138 )
...
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-04-19 00:59:54 -07:00
Simon Mo
a134ef6f5e
Support eos_token_id from generation_config.json ( #4182 )
2024-04-19 04:13:36 +00:00
youkaichao
8a7a3e4436
[Core] add an option to log every function call to for debugging hang/crash in distributed inference ( #4079 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-04-18 16:15:12 -07:00
Adam Tilghman
8f9c28fd40
[Bugfix] Fix CustomAllreduce nvlink topology detection ( #3974 )
...
[Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974 ) (#4159 )
2024-04-18 15:32:47 -07:00
Liangfu Chen
cd2f63fb36
[CI/CD] add neuron docker and ci test scripts ( #3571 )
2024-04-18 15:26:01 -07:00
Nick Hill
87fa80c91f
[Misc] Bump transformers to latest version ( #4176 )
2024-04-18 14:36:39 -07:00
James Whedbee
e1bb2fd52d
[Bugfix] Support logprobs when using guided_json and other constrained decoding fields ( #4149 )
2024-04-18 21:12:55 +00:00
Simon Mo
705578ae14
[Docs] document that Meta Llama 3 is supported ( #4175 )
2024-04-18 10:55:48 -07:00
Michał Moskal
e8cc7967ff
[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill ( #4128 )
2024-04-18 00:51:28 -07:00
Michael Goin
53b018edcb
[Bugfix] Get available quantization methods from quantization registry ( #4098 )
2024-04-18 00:21:55 -07:00
Harry Mellor
66ded03067
Allow model to be served under multiple names ( #2894 )
...
Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai >
2024-04-18 00:16:26 -07:00
youkaichao
6dc1fc9cfe
[Core] nccl integrity check and test ( #4155 )
...
[Core] Add integrity check during initialization; add test for it (#4155 )
2024-04-17 22:28:52 -07:00
SangBin Cho
533d2a1f39
[Typing] Mypy typing part 2 ( #4043 )
...
Co-authored-by: SangBin Cho <sangcho@sangcho-LT93GQWG9C.local >
2024-04-17 17:28:43 -07:00
Shoichi Uchinami
a53222544c
[Kernel] Add punica dimension for Swallow-MS-7B LoRA ( #4134 )
2024-04-17 10:02:45 -07:00
Elinx
fe3b5bbc23
[Bugfix] fix output parsing error for trtllm backend ( #4137 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-04-17 11:07:23 +00:00
youkaichao
8438e0569e
[Core] RayWorkerVllm --> WorkerWrapper to reduce duplication ( #4024 )
...
[Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication (#4024 )
2024-04-17 08:34:33 +00:00
Cade Daniel
11d652bd4f
[CI] Move CPU/AMD tests to after wait ( #4123 )
2024-04-16 22:53:26 -07:00
Cade Daniel
d150e4f89f
[Misc] [CI] Fix CI failure caught after merge ( #4126 )
2024-04-16 17:56:01 -07:00
Cade Daniel
e95cd87959
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine ( #3894 )
2024-04-16 13:09:21 -07:00
Antoni Baum
69e1d2fb69
[Core] Refactor model loading code ( #4097 )
2024-04-16 11:34:39 -07:00
Noam Gat
05434764cd
LM Format Enforcer Guided Decoding Support ( #3868 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-04-16 05:54:57 +00:00
SangBin Cho
4e7ee664e2
[Core] Fix engine-use-ray broken ( #4105 )
2024-04-16 05:24:53 +00:00
SangBin Cho
37e84a403d
[Typing] Fix Sequence type GenericAlias only available after Python 3.9. ( #4092 )
2024-04-15 14:47:31 -07:00
Ricky Xu
4695397dcf
[Bugfix] Fix ray workers profiling with nsight ( #4095 )
2024-04-15 14:24:45 -07:00
Sanger Steel
d619ae2d19
[Doc] Add better clarity for tensorizer usage ( #4090 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-04-15 13:28:25 -07:00
Nick Hill
eb46fbfda2
[Core] Simplifications to executor classes ( #4071 )
2024-04-15 13:05:09 -07:00
Li, Jiang
0003e9154b
[Misc][Minor] Fix CPU block num log in CPUExecutor. ( #4088 )
2024-04-15 08:35:55 -07:00
Zhuohan Li
e11e200736
[Bugfix] Fix filelock version requirement ( #4075 )
2024-04-14 21:50:08 -07:00
Roy
8db1bf32f8
[Misc] Upgrade triton to 2.2.0 ( #4061 )
2024-04-14 17:43:54 -07:00
Simon Mo
aceb17cf2d
[Docs] document that mixtral 8x22b is supported ( #4073 )
2024-04-14 14:35:55 -07:00
Nick Hill
563c54f760
[BugFix] Fix tensorizer extra in setup.py ( #4072 )
2024-04-14 14:12:42 -07:00
youkaichao
2cd6b4f362
[Core] avoid too many cuda context by caching p2p test ( #4021 )
2024-04-13 23:40:21 -07:00
Sanger Steel
711a000255
[Frontend] [Core] feat: Add model loading using tensorizer ( #3476 )
2024-04-13 17:13:01 -07:00
Jee Li
989ae2538d
[Kernel] Add punica dimension for Baichuan-13B ( #4053 )
2024-04-13 07:55:05 -07:00
zspo
0a430b4ae2
[Bugfix] fix_small_bug_in_neuron_executor ( #4051 )
2024-04-13 07:54:03 -07:00
zspo
ec8e3c695f
[Bugfix] fix_log_time_in_metrics ( #4050 )
2024-04-13 07:52:36 -07:00
youkaichao
98afde19fc
[Core][Distributed] improve logging for init dist ( #4042 )
2024-04-13 07:12:53 -07:00
Dylan Hawk
5c2e66e487
[Bugfix] More type hint fixes for py 3.8 ( #4039 )
2024-04-12 21:07:04 -07:00
youkaichao
546e721168
[CI/Test] expand ruff and yapf for all supported python version ( #4037 )
2024-04-13 01:43:37 +00:00
Jee Li
b8aacac31a
[Bugfix] Fix LoRA bug ( #4032 )
2024-04-12 16:56:37 -07:00
Bellk17
d04973ad54
Fix triton compilation issue ( #3984 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-04-12 16:41:26 -07:00
youkaichao
fbb9d9eef4
[Core] fix custom allreduce default value ( #4040 )
2024-04-12 16:40:39 -07:00
SangBin Cho
09473ee41c
[mypy] Add mypy type annotation part 1 ( #4006 )
2024-04-12 14:35:50 -07:00
Zhuohan Li
d4ec9ffb95
[Misc] Fix typo in scheduler.py ( #4022 )
2024-04-12 13:56:04 -07:00
youkaichao
96b6a6d790
[Bugfix] fix type hint for py 3.8 ( #4036 )
2024-04-12 19:35:44 +00:00
SangBin Cho
36729bac13
[Test] Test multiple attn backend for chunked prefill. ( #4023 )
2024-04-12 09:56:57 -07:00
Cyrus Leung
7fd3949a0b
[Frontend][Core] Move merge_async_iterators to utils ( #4026 )
2024-04-12 05:30:54 +00:00
Jee Li
1096717ae9
[Core] Support LoRA on quantized models ( #4012 )
2024-04-11 21:02:44 -07:00
Michael Feil
c2b4a1bce9
[Doc] Add typing hints / mypy types cleanup ( #3816 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-04-11 17:17:21 -07:00
Nick Hill
e46a60aa4c
[BugFix] Fix handling of stop strings and stop token ids ( #3672 )
2024-04-11 15:34:12 -07:00
Antoni Baum
1e96c3341a
Add extra punica sizes to support bigger vocabs ( #4015 )
2024-04-11 22:18:57 +00:00
Dylan Hawk
95e7d4a97c
Fix echo/logprob OpenAI completion bug ( #3441 )
...
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com >
2024-04-11 22:15:50 +00:00
youkaichao
559eb852f8
[Core] init_distributed_environment align with init_process_group( #4014 )
...
[Core][Distributed] make init_distributed_environment compatible with init_process_group (#4014 )
2024-04-11 14:00:48 -07:00
Antoni Baum
a10d3056da
[Core] Set linear_weights directly on the layer ( #3977 )
2024-04-11 16:35:51 -04:00
bigPYJ1151
8afca50889
[Hardware][Intel] Isolate CPUModelRunner and ModelRunner for better maintenance ( #3824 )
2024-04-11 11:56:49 -07:00
fuchen.ljl
08ccee1e83
punica fix-bgmv-kernel-640 ( #4007 )
2024-04-11 08:59:26 -07:00
Roger Wang
c1dc547129
[Kernel] Fused MoE Config for Mixtral 8x22 ( #4002 )
2024-04-11 07:50:00 -07:00
youkaichao
f3d0bf7589
[Doc][Installation] delete python setup.py develop ( #3989 )
2024-04-11 03:33:02 +00:00
Kunshang Ji
e9da5a40c6
[Misc] Add indirection layer for custom ops ( #3913 )
2024-04-10 20:26:07 -07:00
SangBin Cho
e42df7227d
[Test] Add xformer and flash attn tests ( #3961 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-04-11 03:09:50 +00:00
youkaichao
caada5e50a
[Core][Model] torch.compile for layernorm in commandr ( #3985 )
...
[Core][Model] Use torch.compile to accelerate layernorm in commandr (#3985 )
2024-04-11 01:48:26 +00:00
SangBin Cho
67b4221a61
[Core][5/N] Fully working chunked prefill e2e ( #3884 )
2024-04-10 17:56:48 -07:00
youkaichao
63e7176f26
[Core][Refactor] move parallel_utils into vllm/distributed ( #3950 )
...
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950 )
2024-04-10 15:33:30 -07:00
Travis Johnson
934d3662f7
[Bugfix] handle hf_config with architectures == None ( #3982 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-04-10 22:28:25 +00:00
Frαnçois
92cd2e2f21
[Doc] Fix getting stared to use publicly available model ( #3963 )
2024-04-10 18:05:52 +00:00
Daniel E Marasco
e4c4072c94
[Bugfix] Remove key sorting for guided_json parameter in OpenAi compatible Server ( #3945 )
2024-04-10 10:15:51 -07:00
youkaichao
e35397468f
[Doc] Add doc to state our model support policy ( #3948 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-04-10 17:03:02 +00:00
James Whedbee
8b317c6dd0
[Model][AMD] ROCm support for 256 head dims for Gemma ( #3972 )
2024-04-10 08:12:00 -07:00
Woosuk Kwon
bd3c144e0b
[Bugfix][ROCm] Add numba to Dockerfile.rocm ( #3962 )
2024-04-10 07:37:17 -07:00
Travis Johnson
0258b7a94b
[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty ( #3876 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-04-10 01:39:56 -07:00
胡译文
b3104b2a10
[Bugfix] Fix logits processor when prompt_logprobs is not None ( #3899 )
2024-04-10 00:09:36 -07:00
zhaotyer
c2e00af523
[Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable ( #3955 )
...
Co-authored-by: tianyi_zhao <tianyi.zhao@transwarp.io >
2024-04-10 04:49:11 +00:00
Zedong Peng
c013d32c75
[Benchmark] Add cpu options to bench scripts ( #3915 )
2024-04-09 21:30:03 -07:00
Jee Li
11dd6ebb89
[Misc] Avoid loading incorrect LoRA config ( #3777 )
2024-04-09 19:47:15 -07:00
Juan Villamizar
6c0b04515f
[ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm ( #3643 )
...
Co-authored-by: jpvillam <jpvillam@amd.com >
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-04-09 15:10:47 -07:00
Junichi Sato
e23a43aef8
[Bugfix] Fix KeyError on loading GPT-NeoX ( #3925 )
2024-04-09 12:11:31 -07:00
Cade Daniel
e7c7067b45
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" ( #3837 )
2024-04-09 11:44:15 -07:00
youkaichao
6d592eb430
[Core] separate distributed_init from worker ( #3904 )
2024-04-09 08:49:02 +00:00
Roy
d036198e23
[BugFix][Model] Fix commandr RoPE max_position_embeddings ( #3919 )
2024-04-09 06:17:21 +08:00
Matt Wong
59a6abf3c9
[Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations ( #3782 )
2024-04-08 14:31:02 -07:00
Kiran R
bc0c0192d1
[Bugfix] Enable Proper attention_bias Usage in Llama Model Configuration ( #3767 )
...
Co-authored-by: roy <jasonailu87@gmail.com >
2024-04-08 19:42:35 +00:00
egortolmachev
f46864d68d
[Bugfix] Added Command-R GPTQ support ( #3849 )
...
Co-authored-by: Egor Tolmachev <t333ga@gmail.com >
2024-04-08 14:59:38 +00:00
ywfang
b4543c8f6b
[Model] add minicpm ( #3893 )
2024-04-08 18:28:36 +08:00
Isotr0py
0ce0539d47
[Bugfix] Fix Llava inference with Tensor Parallelism. ( #3883 )
2024-04-07 22:54:13 +08:00
youkaichao
2f19283549
[Core] latency optimization ( #3890 )
2024-04-06 19:14:06 -07:00
youkaichao
95baec828f
[Core] enable out-of-tree model register ( #3871 )
2024-04-06 17:11:41 -07:00
youkaichao
e4be7d70bb
[CI/Benchmark] add more iteration and use median for robust latency benchmark ( #3889 )
2024-04-06 21:32:30 +00:00
Isotr0py
54951ac4bf
[Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism ( #3869 )
2024-04-05 12:02:09 -07:00
SangBin Cho
18de883489
[Chunked Prefill][4/n] Chunked prefill scheduler. ( #3853 )
2024-04-05 10:17:58 -07:00
Thomas Parnell
1d7c940d74
Add option to completion API to truncate prompt tokens ( #3144 )
2024-04-05 10:15:42 -07:00
Woosuk Kwon
cfaf49a167
[Misc] Define common requirements ( #3841 )
2024-04-05 00:39:17 -07:00
Noam Gat
9edec652e2
[Bugfix] Fixing requirements.txt ( #3865 )
2024-04-04 23:46:01 -07:00
Cade Daniel
e0dd4d3589
[Misc] Fix linter issues in examples/fp8/quantizer/quantize.py ( #3864 )
2024-04-04 21:57:33 -07:00
Cade Daniel
e5043a3e75
[Misc] Add pytest marker to opt-out of global test cleanup ( #3863 )
2024-04-04 21:54:16 -07:00
youkaichao
d03d64fd2e
[CI/Build] refactor dockerfile & fix pip cache
...
[CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels (#3859 )
2024-04-04 21:53:16 -07:00
Sean Gallen
78107fa091
[Doc]Add asynchronous engine arguments to documentation. ( #3810 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-04-04 21:52:01 -07:00
youkaichao
c391e4b68e
[Core] improve robustness of pynccl ( #3860 )
2024-04-04 16:52:12 -07:00
Saurabh Dash
9117f892f0
[Model] Cohere CommandR+ ( #3829 )
2024-04-04 13:31:49 -07:00
Michael Goin
db2a6a41e2
[Hardware][CPU] Update cpu torch to match default of 2.2.1 ( #3854 )
2024-04-04 19:49:49 +00:00
youkaichao
ca81ff5196
[Core] manage nccl via a pypi package & upgrade to pt 2.2.1 ( #3805 )
2024-04-04 10:26:19 -07:00
TianYu GUO
b7782002e1
[Benchmark] Refactor sample_requests in benchmark_throughput ( #3613 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-04-04 09:56:22 +00:00
Chang Su
819a309c0f
[Bugfix] Fix args in benchmark_serving ( #3836 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-04-04 07:41:05 +00:00
Matthias Gerstgrasser
aabe8f40f2
[Core] [Frontend] Make detokenization optional ( #3749 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-04-03 21:52:18 -07:00
Woosuk Kwon
498eb5cfa3
[Bugfix] Add kv_scale input parameter to CPU backend ( #3840 )
2024-04-04 04:33:08 +00:00
Michael Feil
537ee25f43
[Core] Enable hf_transfer by default if available ( #3817 )
2024-04-04 04:02:43 +00:00
Tao He
294f8f6665
[BugFix] Pass tokenizer_config to local_tokenizer_group ( #3754 )
...
Signed-off-by: Tao He <sighingnow@gmail.com >
2024-04-03 20:31:46 -07:00
Woosuk Kwon
b95047f2da
[Misc] Publish 3rd meetup slides ( #3835 )
2024-04-03 15:46:10 -07:00
Adrian Abeyta
2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) ( #3290 )
...
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: HaiShaw <hixiao@gmail.com >
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu >
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com >
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com >
Co-authored-by: guofangze <guofangze@kuaishou.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-04-03 14:15:55 -07:00
SangBin Cho
3dcb3e8b98
[3/N] Refactor scheduler for chunked prefill scheduling ( #3550 )
2024-04-03 14:13:49 -07:00
Michael Feil
c64cf38673
[Doc] Update contribution guidelines for better onboarding ( #3819 )
2024-04-03 07:31:43 +00:00
Robert Shaw
76b889bf1d
[Doc] Update README.md ( #3806 )
2024-04-02 23:11:10 -07:00
Nick Hill
c9b506dad4
[BugFix] Use different mechanism to get vllm version in is_cpu() ( #3804 )
2024-04-02 23:06:25 -07:00
Cade Daniel
5757d90e26
[Speculative decoding] Adding configuration object for speculative decoding ( #3706 )
...
Co-authored-by: Lily Liu <lilyliupku@gmail.com >
2024-04-03 00:40:57 +00:00
youkaichao
a3c226e7eb
[CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary ( #3803 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
2024-04-02 12:57:04 -07:00
Michael Goin
b321d4881b
[Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ ( #3798 )
2024-04-02 12:35:31 -07:00
leiwen83
ad6eca408b
Fix early CUDA init via get_architecture_class_name import ( #3770 )
...
Signed-off-by: Lei Wen <wenlei03@qiyi.com >
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
2024-04-02 11:56:26 -07:00
youkaichao
205b94942e
[CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build ( #3801 )
2024-04-02 11:54:33 -07:00
Roger Wang
3bec41f41a
[Doc] Fix vLLMEngine Doc Page ( #3791 )
2024-04-02 09:49:37 -07:00
A-Mahla
0739b1947f
[Frontend][Bugfix] allow using the default middleware with a root path ( #3788 )
...
Co-authored-by: A-Mahla <>
2024-04-02 01:20:28 -07:00
bigPYJ1151
77a6572aa5
[HotFix] [CI/Build] Minor fix for CPU backend CI ( #3787 )
2024-04-01 22:50:53 -07:00
bigPYJ1151
0e3f06fe9c
[Hardware][Intel] Add CPU inference backend ( #3634 )
...
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com >
2024-04-01 22:07:30 -07:00
Cade Daniel
eb69d68804
[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup ( #3783 )
2024-04-02 00:49:51 +00:00
Qubitium
7d4e1b85e7
[Misc] Add support for new autogptq checkpoint_format ( #3689 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-04-01 19:32:01 -04:00
Cade Daniel
93deb0b38f
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding ( #3250 )
2024-04-01 22:55:24 +00:00
Roger Wang
ccb58b23e6
[Misc] Fix Benchmark TTFT Calculation for Chat Completions ( #3768 )
2024-04-01 15:24:30 -07:00
Nick Hill
49782fcb76
[Misc] Some minor simplifications to detokenization logic ( #3670 )
...
Some simplifications made for clarity.
Also moves detokenization-related functions from tokenizer.py to detokenizer.py.
2024-04-01 13:22:06 -07:00
Woosuk Kwon
f03cc667a0
[Misc] Minor fixes in requirements.txt ( #3769 )
2024-04-01 10:15:48 +00:00
Robert Shaw
563c1d7ec5
[CI/Build] Make Marlin Tests Green ( #3753 )
2024-03-30 19:18:34 -07:00
youkaichao
9c82a1bec3
[Doc] Update installation doc ( #3746 )
...
[Doc] Update installation doc for build from source and explain the dependency on torch/cuda version (#3746 )
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-03-30 16:34:38 -07:00
mawong-amd
b6d103542c
[Kernel] Layernorm performance optimization ( #3662 )
2024-03-30 14:26:38 -07:00
Simon Mo
51c31bc10c
CMake build elf without PTX ( #3739 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
2024-03-30 01:53:08 +00:00
bnellnm
3ad438c66f
Fix build when nvtools is missing ( #3698 )
2024-03-29 18:52:39 -07:00
youkaichao
203d4f82ac
[Core][Bugfix] cache len of tokenizer ( #3741 )
2024-03-29 18:46:39 -07:00
Nick Hill
991143cfcd
[BugFix] Use consistent logger everywhere ( #3738 )
2024-03-29 23:26:44 +00:00
Simon Mo
8b2d3cbc1b
usage lib get version another way ( #3735 )
2024-03-29 15:57:08 -07:00
Hongxia Yang
9765b5c406
[ROCm][Bugfix] Fixed several bugs related to rccl path and attention selector logic ( #3699 )
2024-03-29 14:52:36 -07:00
Simon Mo
430530fc18
bump version to v0.4.0 ( #3712 )
2024-03-29 12:28:33 -07:00
Roger Wang
97356f3c7e
[Bugfix] Command-R Max Model Length ( #3727 )
2024-03-29 12:27:51 -07:00
Roy
f510395bbf
[BugFix][Frontend] Fix completion logprobs=0 error ( #3731 )
2024-03-29 09:38:21 -07:00
Roy
6110c39dc8
[BugFix] Fix tokenizer out of vocab size ( #3685 )
2024-03-29 08:18:59 -07:00
yhu422
d8658c8cc1
Usage Stats Collection ( #2852 )
2024-03-28 22:16:12 -07:00
Simon Mo
7bc94a0fdd
add ccache to docker build image ( #3704 )
2024-03-28 22:14:24 -07:00
youkaichao
756b30a5f3
[Core][Test] move local_rank to the last arg with default value( #3711 )
...
[Core][Test] move local_rank to the last arg with default value to keep api compatible (#3711 )
2024-03-28 21:19:45 -07:00
Woosuk Kwon
395aa823ea
[Misc] Minor type annotation fix ( #3716 )
2024-03-28 21:12:24 -07:00
SangBin Cho
26422e477b
[Test] Make model tests run again and remove --forked from pytest ( #3631 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-03-28 21:06:40 -07:00
youkaichao
f342153b48
Revert "bump version to v0.4.0" ( #3708 )
2024-03-28 18:49:42 -07:00
Simon Mo
27a57cad52
bump version to v0.4.0 ( #3705 )
2024-03-28 18:26:51 -07:00
Yile (Michael) Gu
98a42e7078
[Benchmark] Change mii to use persistent deployment and support tensor parallel ( #3628 )
2024-03-28 17:33:52 -07:00
youkaichao
0267fef52a
[Core] fix del of communicator ( #3702 )
2024-03-29 00:24:58 +00:00
Simon Mo
4716a32dd4
fix logging msg for block manager ( #3701 )
2024-03-28 23:29:55 +00:00
Woosuk Kwon
c0935c96d3
[Bugfix] Set enable_prefix_caching=True in prefix caching example ( #3703 )
2024-03-28 16:26:30 -07:00
Woosuk Kwon
cb40b3ab6b
[Kernel] Add MoE Triton kernel configs for A100 40GB ( #3700 )
2024-03-28 15:26:24 -07:00
Roy
515386ef3c
[Core] Support multi-node inference(eager and cuda graph) ( #3686 )
2024-03-28 15:01:55 -07:00
Simon Mo
a4075cba4d
[CI] Add test case to run examples scripts ( #3638 )
2024-03-28 14:36:10 -07:00
Simon Mo
96aa014d1e
fix benchmark format reporting in buildkite ( #3693 )
2024-03-28 14:35:16 -07:00
Adam Boeglin
1715056fef
[Bugfix] Update neuron_executor.py to add optional vision_language_config ( #3695 )
2024-03-28 10:43:34 -07:00
SangBin Cho
b51c1cc9d2
[2/N] Chunked prefill data update ( #3538 )
2024-03-28 10:06:01 -07:00
Roger Wang
ce567a2926
[Kernel] DBRX Triton MoE kernel H100 ( #3692 )
2024-03-28 10:05:34 -07:00
wenyujin333
d6ea427f04
[Model] Add support for Qwen2MoeModel ( #3346 )
2024-03-28 15:19:59 +00:00
Cade Daniel
14ccd94c89
[Core][Bugfix]Refactor block manager for better testability ( #3492 )
2024-03-27 23:59:28 -07:00
Woosuk Kwon
8267b06c30
[Kernel] Add Triton MoE kernel configs for DBRX on A100 ( #3679 )
2024-03-27 22:22:25 -07:00
youkaichao
3492859b68
[CI/Build] update default number of jobs and nvcc threads to avoid overloading the system ( #3675 )
2024-03-28 00:18:54 -04:00
hxer7963
098e1776ba
[Model] Add support for xverse ( #3610 )
...
Co-authored-by: willhe <hexin@xverse.cn >
Co-authored-by: root <root@localhost.localdomain >
2024-03-27 18:12:54 -07:00
Roy
10e6322283
[Model] Fix and clean commandr ( #3671 )
2024-03-28 00:20:00 +00:00
Woosuk Kwon
6d9aa00fc4
[Docs] Add Command-R to supported models ( #3669 )
2024-03-27 15:20:00 -07:00
zeppombal
1182607e18
Add support for Cohere's Command-R model ( #3433 )
...
Co-authored-by: José Maria Pombal <jose.pombal@unbabel.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-03-27 14:19:32 -07:00
Roger Wang
45b6ef6513
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark ( #3277 )
2024-03-27 13:39:26 -07:00
AmadeusChan
1956931436
[Misc] add the "download-dir" option to the latency/throughput benchmarks ( #3621 )
2024-03-27 13:39:05 -07:00
Megha Agarwal
e24336b5a7
[Model] Add support for DBRX ( #3660 )
2024-03-27 13:01:46 -07:00
youkaichao
d18f4e73f3
[Bugfix] [Hotfix] fix nccl library name ( #3661 )
2024-03-27 17:23:54 +00:00
Woosuk Kwon
82c540bebf
[Bugfix] More faithful implementation of Gemma ( #3653 )
2024-03-27 09:37:18 -07:00
youkaichao
8f44facddd
[Core] remove cupy dependency ( #3625 )
2024-03-27 00:33:26 -07:00
Woosuk Kwon
e66b629c04
[Misc] Minor fix in KVCache type ( #3652 )
2024-03-26 23:14:06 -07:00
Jee Li
76879342a3
[Doc]add lora support ( #3649 )
2024-03-27 02:06:46 +00:00
Jee Li
566b57c5c4
[Kernel] support non-zero cuda devices in punica kernels ( #3636 )
2024-03-27 00:37:42 +00:00
Nick Hill
0dc72273b8
[BugFix] Fix ipv4 address parsing regression ( #3645 )
2024-03-26 14:39:44 -07:00
liiliiliil
a979d9771e
[Bugfix] Fix ipv6 address parsing bug ( #3641 )
2024-03-26 11:58:20 -07:00
Jee Li
8af890a865
Enable more models to inference based on LoRA ( #3382 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-03-25 18:09:31 -07:00
Nick Hill
dfeb2ecc3a
[Misc] Include matched stop string/token in responses ( #2976 )
...
Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com >
2024-03-25 17:31:32 -07:00
Antoni Baum
3a243095e5
Optimize _get_ranks in Sampler ( #3623 )
2024-03-25 16:03:02 -07:00
xwjiang2010
64172a976c
[Feature] Add vision language model support. ( #3042 )
2024-03-25 14:16:30 -07:00
Simon Mo
f408d05c52
hotfix isort on logprobs ranks pr ( #3622 )
2024-03-25 11:55:46 -07:00
Dylan Hawk
0b4997e05c
[Bugfix] API stream returning two stops ( #3450 )
...
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com >
2024-03-25 10:14:34 -07:00
Travis Johnson
c13ad1b7bd
feat: implement the min_tokens sampling parameter ( #3124 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-03-25 10:14:26 -07:00
Swapnil Parekh
819924e749
[Core] Adding token ranks along with logprobs ( #3516 )
...
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com >
2024-03-25 10:13:10 -07:00
SangBin Cho
01bfb22b41
[CI] Try introducing isort. ( #3495 )
2024-03-25 07:59:47 -07:00
TianYu GUO
e67c295b0c
[Bugfix] fix automatic prefix args and add log info ( #3608 )
2024-03-25 05:35:22 -07:00
Woosuk Kwon
925f3332ca
[Core] Refactor Attention Take 2 ( #3462 )
2024-03-25 04:39:33 +00:00
少年
b0dfa91dd7
[Model] Add starcoder2 awq support ( #3569 )
2024-03-24 21:07:36 -07:00
Woosuk Kwon
56a8652f33
[Bugfix] store lock file in tmp directory ( #3578 )" ( #3599 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-03-24 20:06:50 -07:00
Kunshang Ji
6d93d35308
[BugFix] tensor.get_device() -> tensor.device ( #3604 )
2024-03-24 19:01:13 -07:00
youkaichao
837e185142
[CI/Build] fix flaky test ( #3602 )
2024-03-24 17:43:05 -07:00
youkaichao
42bc386129
[CI/Build] respect the common environment variable MAX_JOBS ( #3600 )
2024-03-24 17:04:00 -07:00
youkaichao
8b268a46a7
[CI] typo fix: is_hip --> is_hip() ( #3595 )
2024-03-24 16:03:06 -07:00
Nick Hill
41deac4a3d
[BugFix] 1D query fix for MoE models ( #3597 )
2024-03-24 16:00:16 -07:00
Woosuk Kwon
af9e53496f
[BugFix] Fix Falcon tied embeddings ( #3590 )
...
Co-authored-by: 44670 <44670@users.noreply.github.com >
2024-03-24 06:34:01 -07:00
Roger Wang
f8a12ecc7f
[Misc] Bump transformers version ( #3592 )
2024-03-24 06:32:45 -07:00
Woosuk Kwon
3c5ab9b811
[Misc] Fix BLOOM copyright notice ( #3591 )
2024-03-23 23:30:56 -07:00
kota-iizuka
743a0b7402
[Bugfix] use SoftLockFile instead of LockFile ( #3578 )
2024-03-23 11:43:11 -07:00
Antoni Baum
bfdb1ba5c3
[Core] Improve detokenization performance for prefill ( #3469 )
...
Co-authored-by: MeloYang <meloyang05@gmail.com >
2024-03-22 13:44:12 -07:00
Thomas Parnell
cf2f084d56
Dynamic scheduler delay to improve ITL performance ( #3279 )
...
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com >
2024-03-22 12:28:14 -07:00
Hanzhi Zhou
f721096d48
[BugFix] Some fixes for custom allreduce kernels ( #2760 )
2024-03-21 23:02:58 -07:00
Zhuohan Li
e90fc21f2e
[Hardware][Neuron] Refactor neuron support ( #3471 )
2024-03-22 01:22:17 +00:00
Roy
ea5f14e6ff
[Bugfix][Model] Fix Qwen2 ( #3554 )
2024-03-22 00:18:58 +00:00
Taemin Lee
b7050ca7df
[BugFix] gemma loading after quantization or LoRA. ( #3553 )
2024-03-21 13:16:57 -07:00
Woosuk Kwon
c188ecb080
[Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config ( #3551 )
...
Co-authored-by: Roy <jasonailu87@gmail.com >
Co-authored-by: Roger Meier <r.meier@siemens.com >
2024-03-21 07:58:12 -07:00
Roy
865732342b
[Misc][Log] Add log for tokenizer length not equal to vocabulary size ( #3500 )
2024-03-21 18:07:48 +08:00
Lalit Pradhan
4c07dd28c0
[ 🚀 Ready to be merged] Added support for Jais models ( #3183 )
2024-03-21 09:45:24 +00:00
SangBin Cho
3bbff9e5ab
Fix 1D query issue from _prune_hidden_states ( #3539 )
2024-03-21 08:49:06 +00:00
ElizaWszola
6ebd02bdef
[PREFIX CACHING FOLLOW UP] OrderedDict-based evictor ( #3431 )
...
Co-authored-by: rsnm2 <rshaw@neuralmagic.com >
Co-authored-by: Luka <luka@paperspace>
2024-03-20 23:20:04 -07:00
Zhuohan Li
523e30ea0c
[BugFix] Hot fix in setup.py for neuron build ( #3537 )
2024-03-20 17:59:52 -07:00
Roy
f1c0fc3919
Migrate logits computation and gather to model_runner ( #3233 )
2024-03-20 23:25:01 +00:00
SangBin Cho
6e435de766
[1/n][Chunked Prefill] Refactor input query shapes ( #3236 )
2024-03-20 14:46:05 -07:00
Antoni Baum
426ec4ec67
[1/n] Triton sampling kernel ( #3186 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-03-20 14:45:08 -07:00
James Whedbee
80e254834d
[Bugfix] Fix ROCm support in CMakeLists.txt ( #3534 )
2024-03-20 21:05:03 +00:00
bnellnm
ba8ae1d84f
Check for _is_cuda() in compute_num_jobs ( #3481 )
2024-03-20 10:06:56 -07:00
Allen.Dou
84eaa68425
Abort when nvcc command is not found in the PATH ( #3527 )
2024-03-20 09:28:29 -07:00
Woosuk Kwon
5ee14494e4
[Misc] Remove cache stream and cache events ( #3461 )
2024-03-20 00:38:53 -07:00
Nick Hill
4ad521d8b5
[Core] Add generic typing to LRUCache ( #3511 )
2024-03-20 00:36:09 -07:00
ElizaWszola
9474e89ba4
[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled ( #3357 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-03-20 00:11:11 -07:00
Simon Mo
20478c4d3a
Use lru_cache for some environment detection utils ( #3508 )
2024-03-19 21:34:15 +00:00
Jim Burtoft
63e8b28a99
[Doc] minor fix of spelling in amd-installation.rst ( #3506 )
2024-03-19 20:32:30 +00:00
Simon Mo
cc63d03fbb
Revert "[Core] Cache some utils" ( #3507 )
2024-03-19 13:22:58 -07:00
Jim Burtoft
2a60c9bd17
[Doc] minor fix to neuron-installation.rst ( #3505 )
2024-03-19 13:21:35 -07:00
ifsheldon
c614cfee58
Update dockerfile with ModelScope support ( #3429 )
2024-03-19 10:54:59 -07:00
Nick Hill
7341c77d69
[BugFix] Avoid initializing CUDA too early ( #3487 )
2024-03-18 23:05:20 -07:00
Simon Mo
ef65dcfa6f
[Doc] Add docs about OpenAI compatible server ( #3288 )
2024-03-18 22:05:34 -07:00
youkaichao
6a9c583e73
[Core] print error before deadlock ( #3459 )
2024-03-19 04:06:23 +00:00
Antoni Baum
b37cdce2b1
[Core] Cache some utils ( #3474 )
2024-03-18 17:14:26 -07:00
Zhuohan Li
b30880a762
[Misc] Update README for the Third vLLM Meetup ( #3479 )
2024-03-18 15:58:38 -07:00
Antoni Baum
49eedea373
[Core] Zero-copy asdict for InputMetadata ( #3475 )
2024-03-18 22:56:40 +00:00
bnellnm
9fdf3de346
Cmake based build system ( #2830 )
2024-03-18 15:38:33 -07:00
Zhuohan Li
c0c17d4896
[Misc] Fix PR Template ( #3478 )
2024-03-18 15:00:31 -07:00
Robert Shaw
097aa0ea22
[CI/Build] Fix Bad Import In Test ( #3473 )
2024-03-18 20:28:00 +00:00
Cade Daniel
482b0adf1b
[Testing] Add test_config.py to CI ( #3437 )
2024-03-18 12:48:45 -07:00
Simon Mo
8c654c045f
CI: Add ROCm Docker Build ( #2886 )
2024-03-18 19:33:47 +00:00
Woosuk Kwon
9101d832e6
[Bugfix] Make moe_align_block_size AMD-compatible ( #3470 )
2024-03-18 11:26:24 -07:00
Simon Mo
93348d9458
[CI] Shard tests for LoRA and Kernels to speed up ( #3445 )
2024-03-17 14:56:30 -07:00
Woosuk Kwon
abfc4f3387
[Misc] Use dataclass for InputMetadata ( #3452 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-03-17 10:02:46 +00:00
Simon Mo
6b78837b29
Fix setup.py neuron-ls issue ( #2671 )
2024-03-16 16:00:25 -07:00
Simon Mo
120157fd2a
Support arbitrary json_object in OpenAI and Context Free Grammar ( #3211 )
2024-03-16 13:35:27 -07:00
Simon Mo
8e67598aa6
[Misc] fix line length for entire codebase ( #3444 )
2024-03-16 00:36:29 -07:00
simon-mo
ad50bf4b25
fix lint
2024-03-15 22:23:38 -07:00
Dinghow Yang
cf6ff18246
Fix Baichuan chat template ( #3340 )
2024-03-15 21:02:12 -07:00
Ronen Schaffer
14e3f9a1b2
Replace lstrip() with removeprefix() to fix Ruff linter warning ( #2958 )
2024-03-15 21:01:30 -07:00
Tao He
3123f15138
Fixes the incorrect argument in the prefix-prefill test cases ( #3246 )
2024-03-15 20:58:10 -07:00
youkaichao
413366e9a2
[Misc] PR templates ( #3413 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-03-15 18:25:51 -07:00
Robert Shaw
10585e035e
Removed Extraneous Print Message From OAI Server ( #3440 )
2024-03-16 00:35:36 +00:00
Antoni Baum
fb96c1e98c
Asynchronous tokenization ( #2879 )
2024-03-15 23:37:01 +00:00
laneeee
8fa7357f2d
fix document error for value and v_vec illustration ( #3421 )
2024-03-15 16:06:09 -07:00
Harry Mellor
a7af4538ca
Fix issue templates ( #3436 )
2024-03-15 21:26:00 +00:00
youkaichao
604f235937
[Misc] add error message in non linux platform ( #3438 )
2024-03-15 21:21:37 +00:00
Tao He
14b8ae02e7
Fixes the misuse/mixuse of time.time()/time.monotonic() ( #3220 )
...
Signed-off-by: Tao He <sighingnow@gmail.com >
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-03-15 18:25:43 +00:00
Dan Clark
03d37f2441
[Fix] Add args for mTLS support ( #3430 )
...
Co-authored-by: declark1 <daniel.clark@ibm.com >
2024-03-15 09:56:13 -07:00
Yang Fan
a7c871680e
Fix tie_word_embeddings for Qwen2. ( #3344 )
2024-03-15 09:36:53 -07:00
Junda Chen
429284dc37
Fix dist.broadcast stall without group argument ( #3408 )
2024-03-14 23:25:05 -07:00
Dinghow Yang
253a98078a
Add chat templates for ChatGLM ( #3418 )
2024-03-14 23:19:22 -07:00
Dinghow Yang
21539e6856
Add chat templates for Falcon ( #3420 )
2024-03-14 23:19:02 -07:00
youkaichao
b522c4476f
[Misc] add HOST_IP env var ( #3419 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-03-14 21:32:52 -07:00
akhoroshev
78b6c4845a
Dynamically configure shared memory size for moe_align_block_size_kernel ( #3376 )
2024-03-14 18:18:07 -07:00
Enrique Shockwave
b983ba35bd
fix marlin config repr ( #3414 )
2024-03-14 16:26:19 -07:00
陈序
54be8a0be2
Fix assertion failure in Qwen 1.5 with prefix caching enabled ( #3373 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
2024-03-14 13:56:57 -07:00
youkaichao
dfc77408bd
[issue templates] add some issue templates ( #3412 )
2024-03-14 13:16:00 -07:00
Dan Clark
c17ca8ef18
Add args for mTLS support ( #3410 )
...
Co-authored-by: Daniel Clark <daniel.clark@ibm.com >
2024-03-14 13:11:45 -07:00
Thomas Parnell
06ec486794
Install flash_attn in Docker image ( #3396 )
2024-03-14 10:55:54 -07:00
youkaichao
8fe8386591
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 ( #3389 )
2024-03-14 08:11:48 +00:00
Allen.Dou
a37415c31b
allow user to chose which vllm's merics to display in grafana ( #3393 )
2024-03-14 06:35:13 +00:00
Simon Mo
81653d9688
[Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion ( #3383 )
2024-03-13 17:02:21 -07:00
Zhuohan Li
eeab52a4ff
[FIX] Simpler fix for async engine running on ray ( #3371 )
2024-03-13 14:18:40 -07:00
Antoni Baum
c33afd89f5
Fix lint ( #3388 )
2024-03-13 13:56:49 -07:00
Terry
7e9bd08f60
Add batched RoPE kernel ( #3095 )
2024-03-13 13:45:26 -07:00
Or Sharir
ae0ccb4017
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. ( #3350 )
2024-03-13 12:18:25 -07:00
陈序
739c350c19
[Minor Fix] Use cupy-cuda11x in CUDA 11.8 build ( #3256 )
2024-03-13 09:43:24 -07:00
Hui Liu
ba8dc958a3
[Minor] Fix bias in if to remove ambiguity ( #3259 )
2024-03-13 09:16:55 -07:00
Ronan McGovern
e221910e77
add hf_transfer to requirements.txt ( #3031 )
2024-03-12 23:33:43 -07:00
Bo-Wen Wang
b167109ba1
[Fix] Fix quantization="gptq" when using Marlin ( #3319 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-03-12 22:51:42 -07:00
Woosuk Kwon
602358f8a8
Add kernel for GeGLU with approximate GELU ( #3337 )
2024-03-12 22:06:17 -07:00
Breno Faria
49a3c8662b
Fixes #1556 double free ( #3347 )
2024-03-13 00:30:08 +00:00
Sherlock Xu
b0925b3878
docs: Add BentoML deployment doc ( #3336 )
...
Signed-off-by: Sherlock113 <sherlockxu07@gmail.com >
2024-03-12 10:34:30 -07:00
DAIZHENWEI
654865e21d
Support Mistral Model Inference with transformers-neuronx ( #3153 )
2024-03-11 13:19:51 -07:00
kliuae
c9415c19d3
[ROCm] Fix warp and lane calculation in blockReduceSum ( #3321 )
2024-03-11 13:14:07 -07:00
Zhuohan Li
4c922709b6
Add distributed model executor abstraction ( #3191 )
2024-03-11 11:03:45 -07:00
Philipp Moritz
657061fdce
[docs] Add LoRA support information for models ( #3299 )
2024-03-11 00:54:51 -07:00
Zhuohan Li
2f8844ba08
Re-enable the 80 char line width limit ( #3305 )
2024-03-10 19:49:14 -07:00
Nick Hill
4b59f00e91
[Fix] Fix best_of behavior when n=1 ( #3298 )
2024-03-10 19:17:46 -07:00
Roy
9e8744a545
[BugFix] Fix get tokenizer when using ray ( #3301 )
2024-03-10 19:17:16 -07:00
Douglas Lehr
e4a28e5316
[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA ( #3262 )
2024-03-10 15:27:45 -07:00
Terry
0bba88df03
Enhance lora tests with more layer and rank variations ( #3243 )
2024-03-09 17:14:16 -08:00
Cade Daniel
8437bae6ef
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling ( #3103 )
2024-03-08 23:32:46 -08:00
Zhuohan Li
f48c6791b7
[FIX] Fix prefix test error on main ( #3286 )
2024-03-08 17:16:14 -08:00
Michael Goin
c2c5e0909a
Move model filelocks from /tmp/ to ~/.cache/vllm/locks/ dir ( #3241 )
2024-03-08 13:33:10 -08:00
Woosuk Kwon
1cb0cc2975
[FIX] Make flash_attn optional ( #3269 )
2024-03-08 10:52:20 -08:00
Roger Wang
99c3cfb83c
[Docs] Fix Unmocked Imports ( #3275 )
2024-03-08 09:58:01 -08:00
TianYu GUO
1ece1ae829
[Minor Fix] Fix comments in benchmark_serving ( #3252 )
2024-03-07 22:22:59 -08:00
whyiug
c59e120c55
Feature add lora support for Qwen2 ( #3177 )
2024-03-07 21:58:24 -08:00
Nick Hill
d2339d6840
Connect engine healthcheck to openai server ( #3260 )
2024-03-07 16:38:12 -08:00
ElizaWszola
b35cc93420
Fix auto prefix bug ( #3239 )
2024-03-07 16:37:28 -08:00
jacobthebanana
8cbba4622c
Possible fix for conflict between Automated Prefix Caching ( #2762 ) and multi-LoRA support ( #1804 ) ( #3263 )
2024-03-07 23:03:22 +00:00
Michael Goin
385da2dae2
Measure model memory usage ( #3120 )
2024-03-07 11:42:42 -08:00
Woosuk Kwon
2daf23ab0c
Separate attention backends ( #3005 )
2024-03-07 01:45:50 -08:00
Chen Wang
cbf4c05b15
Update requirements-dev.txt to include package for benchmarking scripts. ( #3181 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-03-07 08:39:28 +00:00
TechxGenus
d3c04b6a39
Add GPTQ support for Gemma ( #3200 )
2024-03-07 08:19:14 +08:00
Chujie Zheng
4cb3b924cd
Add tqdm dynamic_ncols=True ( #3242 )
2024-03-06 22:41:42 +00:00
Cade Daniel
a33ce60c66
[Testing] Fix core tests ( #3224 )
2024-03-06 01:04:23 -08:00
SangBin Cho
24aecf421a
[Tests] Add block manager and scheduler tests ( #3108 )
2024-03-05 18:23:34 -08:00
Nick Hill
2efce05dc3
[Fix] Avoid pickling entire LLMEngine for Ray workers ( #3207 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-03-06 00:17:20 +00:00
Nick Hill
8999ec3c16
Store eos_token_id in Sequence for easy access ( #3166 )
2024-03-05 15:35:43 -08:00
Hongxia Yang
05af6da8d9
[ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs ( #3123 )
...
Co-authored-by: lcskrishna <lollachaitanya@gmail.com >
2024-03-04 18:14:53 -08:00
Chen Wang
9a4548bae7
Fix the openai benchmarking requests to work with latest OpenAI apis ( #2992 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-03-04 15:51:56 -08:00
Antoni Baum
ff578cae54
Add health check, make async Engine more robust ( #3015 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-03-04 22:01:40 +00:00
Antoni Baum
22de45235c
Push logprob generation to LLMEngine ( #3065 )
...
Co-authored-by: Avnish Narayan <avnish@anyscale.com >
2024-03-04 19:54:06 +00:00
ttbachyinsda
76e8a70476
[Minor fix] The domain dns.google may cause a socket.gaierror exception ( #3176 )
...
Co-authored-by: guofangze <guofangze@kuaishou.com >
2024-03-04 19:17:12 +00:00
Allen.Dou
9cbc7e5f3b
enable --gpu-memory-utilization in benchmark_throughput.py ( #3175 )
...
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
2024-03-04 10:37:58 -08:00
Jialun Lyu
27a7b070db
Add document for vllm paged attention kernel. ( #2978 )
2024-03-04 09:23:34 -08:00
TianYu GUO
901cf4c52b
[Minor Fix] Remove unused code in benchmark_prefix_caching.py ( #3171 )
2024-03-03 22:48:27 -08:00
Liangfu Chen
d0fae88114
[DOC] add setup document to support neuron backend ( #2777 )
2024-03-04 01:03:51 +00:00
Philipp Moritz
17c3103c56
Make it easy to profile workers with nsight ( #3162 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-03-03 16:19:13 -08:00
Zhuohan Li
996d095c54
[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark ( #3158 )
2024-03-03 14:37:18 -08:00
Jason Cox
d65fac2738
Add vLLM version info to logs and openai API server ( #3161 )
2024-03-02 21:00:29 -08:00
Sage Moore
ce4f5a29fb
Add Automatic Prefix Caching ( #2762 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-03-02 00:50:01 -08:00
cloudhan
baee28c46c
Reorder kv dtype check to avoid nvcc not found error on AMD platform ( #3104 )
2024-03-02 14:34:48 +08:00
Allen.Dou
29e70e3e88
allow user chose log level by --log-level instead of fixed 'info'. ( #3109 )
...
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-03-01 23:28:41 +00:00
Woosuk Kwon
82091b864a
Bump up to v0.3.3 ( #3129 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
2024-03-01 12:58:06 -08:00
Robert Shaw
c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference ( #2497 )
...
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com >
Co-authored-by: alexm <alexm@neuralmagic.com >
2024-03-01 12:47:51 -08:00
Huarong
90fbf12540
fix relative import path of protocol.py ( #3134 )
...
Co-authored-by: huohuarong <huohuarong@zuoshouyisheng.com >
2024-03-01 19:42:06 +00:00
Yuan Tang
49d849b3ab
docs: Add tutorial on deploying vLLM model with KServe ( #2586 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2024-03-01 11:04:14 -08:00
Seonghyeon
27ca23dc00
Remove exclude_unset in streaming response ( #3143 )
2024-03-01 09:59:06 -08:00
Sherry
54d3544784
Fix: Output text is always truncated in some models ( #3016 )
2024-03-01 07:52:22 +00:00
felixzhu555
703e42ee4b
Add guided decoding for OpenAI API server ( #2819 )
...
Co-authored-by: br3no <breno@veltefaria.de >
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-02-29 22:13:08 +00:00
Nick Hill
29a8d6a554
[Fix] Don't deep-copy LogitsProcessors when copying SamplingParams ( #3099 )
2024-02-29 19:20:42 +00:00
Billy Cao
2c08ff23c0
Fix building from source on WSL ( #3112 )
2024-02-29 11:13:58 -08:00
Seonghyeon
bfdcfa6a05
Support starcoder2 architecture ( #3089 )
2024-02-29 00:51:48 -08:00
Allen.Dou
9289e577ec
add cache_config's info to prometheus metrics. ( #3100 )
2024-02-29 06:15:18 +00:00
Jae-Won Chung
a6d471c759
Fix: AttributeError in OpenAI-compatible server ( #3018 )
2024-02-28 22:04:07 -08:00
CHU Tianxiang
01a5d18a53
Add Support for 2/3/8-bit GPTQ Quantization Models ( #2330 )
2024-02-28 21:52:23 -08:00
Woosuk Kwon
929b4f2973
Add LoRA support for Gemma ( #3050 )
2024-02-28 13:03:28 -08:00
Liangfu Chen
3b7178cfa4
[Neuron] Support inference with transformers-neuronx ( #2569 )
2024-02-28 09:34:34 -08:00
Allen.Dou
e46fa5d52e
Restrict prometheus_client >= 0.18.0 to prevent errors when importing pkgs ( #3070 )
2024-02-28 05:38:26 +00:00
Ganesh Jagadeesan
a8683102cc
multi-lora documentation fix ( #3064 )
2024-02-27 21:26:15 -08:00
Tao He
71bcaf99e2
Enable GQA support in the prefix prefill kernels ( #3007 )
...
Signed-off-by: Tao He <sighingnow@gmail.com >
2024-02-27 01:14:31 -08:00
Woosuk Kwon
8b430d7dea
[Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM ( #3046 )
2024-02-26 20:23:50 -08:00
Dylan Hawk
e0ade06d63
Support logit bias for OpenAI API ( #3027 )
2024-02-27 11:51:53 +08:00
Woosuk Kwon
4bd18ec0c7
[Minor] Fix type annotation in fused moe ( #3045 )
2024-02-26 19:44:29 -08:00
Jingru
2410e320b3
fix get_ip error in pure ipv6 environment ( #2931 )
2024-02-26 19:22:16 -08:00
张大成
48a8f4a7fd
Support Orion model ( #2539 )
...
Co-authored-by: zhangdacheng <zhangdacheng@ainirobot.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-02-26 19:17:06 -08:00
Roy
4dd6416faf
Fix stablelm ( #3038 )
2024-02-26 18:31:10 -08:00
Roy
c1c0d00b88
Don't use cupy when enforce_eager=True ( #3037 )
2024-02-26 17:33:38 -08:00
Roy
d9f726c4d0
[Minor] Remove unused config files ( #3039 )
2024-02-26 17:25:22 -08:00
Woosuk Kwon
d6e4a130b0
[Minor] Remove gather_cached_kv kernel ( #3043 )
2024-02-26 15:00:54 -08:00
Philipp Moritz
cfc15a1031
Optimize Triton MoE Kernel ( #2979 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
2024-02-26 13:48:56 -08:00
Jared Moore
70f3e8e3a1
Add LogProbs for Chat Completions in OpenAI ( #2918 )
2024-02-26 10:39:34 +08:00
Harry Mellor
ef978fe411
Port metrics from aioprometheus to prometheus_client ( #2730 )
2024-02-25 11:54:00 -08:00
Woosuk Kwon
f7c1234990
[Fix] Fissertion on YaRN model len ( #2984 )
2024-02-23 12:57:48 -08:00
zhaoyang-star
57f044945f
Fix nvcc not found in vlm-openai image ( #2781 )
2024-02-22 14:25:07 -08:00
Ronen Schaffer
4caf7044e0
Include tokens from prompt phase in counter_generation_tokens ( #2802 )
2024-02-22 14:00:12 -08:00
Woosuk Kwon
6f32cddf1c
Remove Flash Attention in test env ( #2982 )
2024-02-22 09:58:29 -08:00
44670
c530e2cfe3
[FIX] Fix a bug in initializing Yarn RoPE ( #2983 )
2024-02-22 01:40:05 -08:00
Woosuk Kwon
fd5dcc5c81
Optimize GeGLU layer in Gemma ( #2975 )
2024-02-21 20:17:52 -08:00
Massimiliano Pronesti
93dc5a2870
chore(vllm): codespell for spell checking ( #2820 )
2024-02-21 18:56:01 -08:00
Woosuk Kwon
95529e3253
Use Llama RMSNorm custom op for Gemma ( #2974 )
2024-02-21 18:28:23 -08:00
Roy
344020c926
Migrate MistralForCausalLM to LlamaForCausalLM ( #2868 )
2024-02-21 18:25:05 -08:00
Mustafa Eyceoz
5574081c49
Added early stopping to completion APIs ( #2939 )
2024-02-21 18:24:01 -08:00
Ronen Schaffer
d7f396486e
Update comment ( #2934 )
2024-02-21 18:18:37 -08:00
Zhuohan Li
8fbd84bf78
Bump up version to v0.3.2 ( #2968 )
...
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
This version is for more model support. Add support for Gemma models (#2964 ) and OLMo models (#2832 ).
2024-02-21 11:47:25 -08:00
Nick Hill
7d2dcce175
Support per-request seed ( #2514 )
2024-02-21 11:47:00 -08:00
Woosuk Kwon
dc903e70ac
[ROCm] Upgrade transformers to v4.38.0 ( #2967 )
2024-02-21 09:46:57 -08:00
Zhuohan Li
a9c8212895
[FIX] Add Gemma model to the doc ( #2966 )
2024-02-21 09:46:15 -08:00
Woosuk Kwon
c20ecb6a51
Upgrade transformers to v4.38.0 ( #2965 )
2024-02-21 09:38:03 -08:00
Xiang Xu
5253edaacb
Add Gemma model ( #2964 )
2024-02-21 09:34:30 -08:00
Antoni Baum
017d9f1515
Add metrics to RequestOutput ( #2876 )
2024-02-20 21:55:57 -08:00
Antoni Baum
181b27d881
Make vLLM logging formatting optional ( #2877 )
2024-02-20 14:38:55 -08:00
Zhuohan Li
63e2a6419d
[FIX] Fix beam search test ( #2930 )
2024-02-20 14:37:39 -08:00
James Whedbee
264017a2bf
[ROCm] include gfx908 as supported ( #2792 )
2024-02-19 17:58:59 -08:00
Ronen Schaffer
e433c115bc
Fix vllm:prompt_tokens_total metric calculation ( #2869 )
2024-02-18 23:55:41 -08:00
Simon Mo
86fd8bb0ac
Add warning to prevent changes to benchmark api server ( #2858 )
2024-02-18 21:36:19 -08:00
Isotr0py
ab3a5a8259
Support OLMo models. ( #2832 )
2024-02-18 21:05:15 -08:00
Zhuohan Li
a61f0521b8
[Test] Add basic correctness test ( #2908 )
2024-02-18 16:44:50 -08:00
Zhuohan Li
537c9755a7
[Minor] Small fix to make distributed init logic in worker looks cleaner ( #2905 )
2024-02-18 14:39:00 -08:00
Mark Mozolewski
786b7f18a5
Add code-revision config argument for Hugging Face Hub ( #2892 )
2024-02-17 22:36:53 -08:00
jvmncs
8f36444c4f
multi-LoRA as extra models in OpenAI server ( #2775 )
...
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py )):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs
no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
Nick Hill
185b2c29e2
Defensively copy sampling_params ( #2881 )
...
If the SamplingParams object passed to LLMEngine.add_request() is mutated after it returns, it could affect the async sampling process for that request.
Suggested by @Yard1 https://github.com/vllm-project/vllm/pull/2514#discussion_r1490106059
2024-02-17 11:18:04 -08:00
Woosuk Kwon
5f08050d8d
Bump up to v0.3.1 ( #2887 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
2024-02-16 15:05:18 -08:00
shiyi.c_98
64da65b322
Prefix Caching- fix t4 triton error ( #2517 )
2024-02-16 14:17:55 -08:00
Hongxia Yang
5255d99dc5
[ROCm] Dockerfile fix for flash-attention build ( #2885 )
2024-02-15 10:22:39 -08:00
Philipp Moritz
4f2ad11135
Fix DeciLM ( #2883 )
2024-02-14 22:29:57 -08:00
Woosuk Kwon
d7afab6d3a
[BugFix] Fix GC bug for LLM class ( #2882 )
2024-02-14 22:17:44 -08:00
Philipp Moritz
31348dff03
Align LoRA code between Mistral and Mixtral ( fixes #2875 ) ( #2880 )
...
* Fix AttributeError: MixtralModel object has no attribute org_vocab_size.
* Make LoRA logic for Mistral and Mixtral the same
---------
Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com >
2024-02-15 01:00:43 +01:00
Woosuk Kwon
25e86b6a61
Don't use cupy NCCL for AMD backends ( #2855 )
2024-02-14 12:30:44 -08:00
Roy
4efbac6d35
Migrate AquilaForCausalLM to LlamaForCausalLM ( #2867 )
2024-02-14 12:30:24 -08:00
Nikola Borisov
87069ccf68
Fix docker python version ( #2845 )
2024-02-14 10:17:57 -08:00
Woosuk Kwon
7e45107f51
[Fix] Fix memory profiling when GPU is used by multiple processes ( #2863 )
2024-02-13 19:52:34 -08:00
Philipp Moritz
0c48b37c31
Fix internlm after https://github.com/vllm-project/vllm/pull/2860 ( #2861 )
2024-02-13 18:01:15 -08:00
Philipp Moritz
7eacffd951
Migrate InternLMForCausalLM to LlamaForCausalLM ( #2860 )
...
Co-authored-by: Roy <jasonailu87@gmail.com >
2024-02-13 17:12:05 -08:00
Terry
2a543d6efe
Add LoRA support for Mixtral ( #2831 )
...
* add mixtral lora support
* formatting
* fix incorrectly ported logic
* polish tests
* minor fixes and refactoring
* minor fixes
* formatting
* rename and remove redundant logic
* refactoring
* refactoring
* minor fix
* minor refactoring
* fix code smell
2024-02-14 00:55:45 +01:00
Philipp Moritz
317b29de0f
Remove Yi model definition, please use LlamaForCausalLM instead ( #2854 )
...
Co-authored-by: Roy <jasonailu87@gmail.com >
2024-02-13 14:22:22 -08:00
Woosuk Kwon
a463c333dd
Use CuPy for CUDA graphs ( #2811 )
2024-02-13 11:32:06 -08:00
Philipp Moritz
ea356004d4
Revert "Refactor llama family models ( #2637 )" ( #2851 )
...
This reverts commit 5c976a7e1a .
2024-02-13 09:24:59 -08:00
Roy
5c976a7e1a
Refactor llama family models ( #2637 )
2024-02-13 00:09:23 -08:00
Simon Mo
f964493274
[CI] Ensure documentation build is checked in CI ( #2842 )
2024-02-12 22:53:07 -08:00
Roger Wang
a4211a4dc3
Serving Benchmark Refactoring ( #2433 )
2024-02-12 22:53:00 -08:00
Rex
563836496a
Refactor 2 awq gemm kernels into m16nXk32 ( #2723 )
...
Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net >
2024-02-12 11:02:17 -08:00
Philipp Moritz
4ca2c358b1
Add documentation section about LoRA ( #2834 )
2024-02-12 17:24:45 +01:00
Hongxia Yang
0580aab02f
[ROCm] support Radeon™ 7900 series (gfx1100) without using flash-attention ( #2768 )
2024-02-10 23:14:37 -08:00
Woosuk Kwon
3711811b1d
Disable custom all reduce by default ( #2808 )
2024-02-08 09:58:03 -08:00
SangBin Cho
65b89d16ee
[Ray] Integration compiled DAG off by default ( #2471 )
2024-02-08 09:57:25 -08:00
Philipp Moritz
931746bc6d
Add documentation on how to do incremental builds ( #2796 )
2024-02-07 14:42:02 -08:00
Hongxia Yang
c81dddb45c
[ROCm] Fix build problem resulted from previous commit related to FP8 kv-cache support ( #2790 )
2024-02-06 22:36:59 -08:00
Lily Liu
fe6d09ae61
[Minor] More fix of test_cache.py CI test failure ( #2750 )
2024-02-06 11:38:38 -08:00
liuyhwangyh
ed70c70ea3
modelscope: fix issue when model parameter is not a model id but path of the model. ( #2489 )
2024-02-06 09:57:15 -08:00
Woosuk Kwon
f0d4e14557
Add fused top-K softmax kernel for MoE ( #2769 )
2024-02-05 17:38:02 -08:00
Douglas Lehr
2ccee3def6
[ROCm] Fixup arch checks for ROCM ( #2627 )
2024-02-05 14:59:09 -08:00
Lukas
b92adec8e8
Set local logging level via env variable ( #2774 )
2024-02-05 14:26:50 -08:00
Hongxia Yang
56f738ae9b
[ROCm] Fix some kernels failed unit tests ( #2498 )
2024-02-05 14:25:36 -08:00
Woosuk Kwon
72d3a30c63
[Minor] Fix benchmark_latency script ( #2765 )
2024-02-05 12:45:37 -08:00
whyiug
c9b45adeeb
Require triton >= 2.1.0 ( #2746 )
...
Co-authored-by: yangrui1 <yangrui@lanjingren.com >
2024-02-04 23:07:36 -08:00
Rex
5a6c81b051
Remove eos tokens from output by default ( #2611 )
2024-02-04 14:32:42 -08:00
dancingpipi
51cd22ce56
set&get llm internal tokenizer instead of the TokenizerGroup ( #2741 )
...
Co-authored-by: shujunhua1 <shujunhua1@jd.com >
2024-02-04 14:25:36 -08:00
Massimiliano Pronesti
5ed704ec8c
docs: fix langchain ( #2736 )
2024-02-03 18:17:55 -08:00
Cheng Su
4abf6336ec
Add one example to run batch inference distributed on Ray ( #2696 )
2024-02-02 15:41:42 -08:00
zspo
0e163fce18
Fix default length_penalty to 1.0 ( #2667 )
2024-02-01 15:59:39 -08:00
Kunshang Ji
96b6f475dd
Remove hardcoded device="cuda" to support more devices ( #2503 )
...
Co-authored-by: Jiang Li <jiang1.li@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2024-02-01 15:46:39 -08:00
Pernekhan Utemuratov
c410f5d020
Use revision when downloading the quantization config file ( #2697 )
...
Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com >
2024-02-01 15:41:58 -08:00
Simon Mo
bb8c697ee0
Update README for meetup slides ( #2718 )
2024-02-01 14:56:53 -08:00
Simon Mo
b9e96b17de
fix python 3.8 syntax ( #2716 )
2024-02-01 14:00:58 -08:00
zhaoyang-star
923797fea4
Fix compile error when using rocm ( #2648 )
2024-02-01 09:35:09 -08:00
Fengzhe Zhou
cd9e60c76c
Add Internlm2 ( #2666 )
2024-02-01 09:27:40 -08:00
Robert Shaw
93b38bea5d
Refactor Prometheus and Add Request Level Metrics ( #2316 )
2024-01-31 14:58:07 -08:00
Philipp Moritz
d0d93b92b1
Add unit test for Mixtral MoE layer ( #2677 )
2024-01-31 14:34:17 -08:00
Philipp Moritz
89efcf1ce5
[Minor] Fix test_cache.py CI test failure ( #2684 )
2024-01-31 10:12:11 -08:00
zspo
c664b0e683
fix some bugs ( #2689 )
2024-01-31 10:09:23 -08:00
Tao He
d69ff0cbbb
Fixes assertion failure in prefix caching: the lora index mapping should respect prefix_len ( #2688 )
...
Signed-off-by: Tao He <sighingnow@gmail.com >
2024-01-31 18:00:13 +01:00
Zhuohan Li
1af090b57d
Bump up version to v0.3.0 ( #2656 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
2024-01-31 00:07:07 -08:00
Woosuk Kwon
3dad944485
Add quantized mixtral support ( #2673 )
2024-01-30 16:34:10 -08:00
Woosuk Kwon
105a40f53a
[Minor] Fix false warning when TP=1 ( #2674 )
2024-01-30 14:39:40 -08:00
Philipp Moritz
bbe9bd9684
[Minor] Fix a small typo ( #2672 )
2024-01-30 13:40:37 -08:00
Vladimir
4f65af0e25
Add swap_blocks unit tests ( #2616 )
2024-01-30 09:30:50 -08:00
Wen Sun
d79ced3292
Fix 'Actor methods cannot be called directly' when using --engine-use-ray ( #2664 )
...
* fix: engine-useray complain
* fix: typo
2024-01-30 17:17:05 +01:00
Philipp Moritz
ab40644669
Fused MOE for Mixtral ( #2542 )
...
Co-authored-by: chen shen <scv119@gmail.com >
2024-01-29 22:43:37 -08:00
wangding zeng
5d60def02c
DeepseekMoE support with Fused MoE kernel ( #2453 )
...
Co-authored-by: roy <jasonailu87@gmail.com >
2024-01-29 21:19:48 -08:00
Rasmus Larsen
ea8489fce2
ROCm: Allow setting compilation target ( #2581 )
2024-01-29 10:52:31 -08:00
Hanzhi Zhou
1b20639a43
No repeated IPC open ( #2642 )
2024-01-29 10:46:29 -08:00
zhaoyang-star
b72af8f1ed
Fix error when tp > 1 ( #2644 )
...
Co-authored-by: zhaoyang-star <zhao.yang16@zte.com.cn >
2024-01-28 22:47:39 -08:00
zhaoyang-star
9090bf02e7
Support FP8-E5M2 KV Cache ( #2279 )
...
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-01-28 16:43:54 -08:00
Simon Mo
7d648418b8
Update Ray version requirements ( #2636 )
2024-01-28 14:27:22 -08:00
Murali Andoorveedu
89be30fa7d
Small async_llm_engine refactor ( #2618 )
2024-01-27 23:28:37 -08:00
Woosuk Kwon
f8ecb84c02
Speed up Punica compilation ( #2632 )
2024-01-27 17:46:56 -08:00
Woosuk Kwon
5f036d2bcc
[Minor] Fix warning on Ray dependencies ( #2630 )
2024-01-27 15:43:40 -08:00
Hanzhi Zhou
380170038e
Implement custom all reduce kernels ( #2192 )
2024-01-27 12:46:35 -08:00
Xiang Xu
220a47627b
Use head_dim in config if exists ( #2622 )
2024-01-27 10:30:49 -08:00
Casper
beb89f68b4
AWQ: Up to 2.66x higher throughput ( #2566 )
2024-01-26 23:53:17 -08:00
Philipp Moritz
390b495ff3
Don't build punica kernels by default ( #2605 )
2024-01-26 15:19:19 -08:00
dakotamahan-stability
3a0e1fc070
Support for Stable LM 2 ( #2598 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-01-26 12:45:19 -08:00
Hongxia Yang
6b7de1a030
[ROCm] add support to ROCm 6.0 and MI300 ( #2274 )
2024-01-26 12:41:10 -08:00
Vladimir
5265631d15
use a correct device when creating OptionalCUDAGuard ( #2583 )
2024-01-25 23:48:17 -08:00
Junyang Lin
2832e7b9f9
fix names and license for Qwen2 ( #2589 )
2024-01-24 22:37:51 -08:00
Simon Mo
3a7dd7e367
Support Batch Completion in Server ( #2529 )
2024-01-24 17:11:07 -08:00
LastWhisper
223c19224b
Fix the syntax error in the doc of supported_models ( #2584 )
2024-01-24 11:22:51 -08:00
Federico Galatolo
f1f6cc10c7
Added include_stop_str_in_output and length_penalty parameters to OpenAI API ( #2562 )
2024-01-24 10:21:56 -08:00
Nikola Borisov
3209b49033
[Bugfix] fix crash if max_tokens=None ( #2570 )
2024-01-23 22:38:55 -08:00
Simon Mo
1e4277d2d1
lint: format all python file instead of just source code ( #2567 )
2024-01-23 15:53:06 -08:00
Antoni Baum
9b945daaf1
[Experimental] Add multi-LoRA support ( #1804 )
...
Co-authored-by: Chen Shen <scv119@gmail.com >
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com >
Co-authored-by: Avnish Narayan <avnish@anyscale.com >
2024-01-23 15:26:37 -08:00
Erfan Al-Hossami
9c1352eb57
[Feature] Simple API token authentication and pluggable middlewares ( #1106 )
2024-01-23 15:13:00 -08:00
Jason Zhu
7a0b011dd5
Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py ( #2553 )
2024-01-22 14:47:25 -08:00
Harry Mellor
63e835cbcc
Fix progress bar and allow HTTPS in benchmark_serving.py ( #2552 )
2024-01-22 14:40:31 -08:00
Junyang Lin
94b5edeb53
Add qwen2 ( #2495 )
2024-01-22 14:34:21 -08:00
Philipp Moritz
ab7e6006d6
Fix https://github.com/vllm-project/vllm/issues/2540 ( #2545 )
2024-01-22 19:02:38 +01:00
Cade Daniel
18bfcdd05c
[Speculative decoding 2/9] Multi-step worker for draft model ( #2424 )
2024-01-21 16:31:47 -08:00
Jannis Schönleber
71d63ed72e
migrate pydantic from v1 to v2 ( #2531 )
2024-01-21 16:05:56 -08:00
Nick Hill
d75c40734a
[Fix] Keep scheduler.running as deque ( #2523 )
2024-01-20 22:36:09 -08:00
Junda Chen
5b23c3f26f
Add group as an argument in broadcast ops ( #2522 )
2024-01-20 16:00:26 -08:00
Simon Mo
00efdc84ba
Add benchmark serving to CI ( #2505 )
2024-01-19 20:20:19 -08:00
Roy
91a61da9b1
[Bugfix] fix load local safetensors model ( #2512 )
2024-01-19 16:26:16 -08:00
Zhuohan Li
ef9b636e2d
Simplify broadcast logic for control messages ( #2501 )
2024-01-19 11:23:30 -08:00
Harry Mellor
2709c0009a
Support OpenAI API server in benchmark_serving.py ( #2172 )
2024-01-18 20:34:08 -08:00
Simon Mo
dd7e8f5f64
refactor complemention api for readability ( #2499 )
2024-01-18 16:45:14 -08:00
ljss
d2a68364c4
[BugFix] Fix abort_seq_group ( #2463 )
2024-01-18 15:10:42 -08:00
Nikola Borisov
7e1081139d
Don't download both safetensor and bin files. ( #2480 )
2024-01-18 11:05:53 -08:00
Liangfu Chen
18473cf498
[Neuron] Add an option to build with neuron ( #2065 )
2024-01-18 10:58:50 -08:00
zspo
4df417d059
fix: fix some args desc ( #2487 )
2024-01-18 09:41:44 -08:00
Jason Zhu
5d80a9178b
Minor fix in prefill cache example ( #2494 )
2024-01-18 09:40:34 -08:00
YingchaoX
8a25d3a71a
fix stablelm.py tensor-parallel-size bug ( #2482 )
2024-01-18 09:39:46 -08:00
shiyi.c_98
d10f8e1d43
[Experimental] Prefix Caching Support ( #1669 )
...
Co-authored-by: DouHappy <2278958187@qq.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-01-17 16:32:10 -08:00
FlorianJoncour
14cc317ba4
OpenAI Server refactoring ( #2360 )
2024-01-16 21:33:14 -08:00
Hyunsung Lee
e1957c6ebd
Add StableLM3B model ( #2372 )
2024-01-16 20:32:40 -08:00
Simon Mo
8cd5a992bf
ci: retry on build failure as well ( #2457 )
2024-01-16 12:51:04 -08:00
Simon Mo
947f0b23cc
CI: make sure benchmark script exit on error ( #2449 )
2024-01-16 09:50:13 -08:00
Chenhui Zhang
f780504d12
fix weigit loading for GQA with TP ( #2379 )
2024-01-15 15:43:59 -08:00
Simon Mo
bfc072addf
Allow buildkite to retry build on agent lost ( #2446 )
2024-01-15 15:43:15 -08:00
Woosuk Kwon
2a18da257c
Announce the second vLLM meetup ( #2444 )
2024-01-15 14:11:59 -08:00
Simon Mo
6e01e8c1c8
[CI] Add Buildkite ( #2355 )
2024-01-14 12:37:58 -08:00
Roy
9f659bf07f
[Minor] Optimize cuda graph memory usage ( #2437 )
2024-01-14 18:40:51 +01:00
Woosuk Kwon
35c4bc20d9
[Minor] Fix err msg ( #2431 )
2024-01-12 14:02:52 -08:00
陈序
218dc2ccda
Aligning top_p and top_k Sampling ( #1885 )
...
* Align top_p and top_k with huggingface
* remove _get_prompt_and_output_tokens
* rename _apply_top_p_top_k
* compare top_p top_k with hf
* fix test errors
2024-01-12 22:51:03 +01:00
Simon
827cbcd37c
Update quickstart.rst ( #2369 )
2024-01-12 12:56:18 -08:00
Ben
cb7a1c1cbf
Suggest using dtype=half when OOM.
2024-01-12 12:33:29 -08:00
Gary Hui
7878958c0d
Address Phi modeling update 2 ( #2428 )
2024-01-12 12:16:49 -08:00
Chirag Jain
ce036244c9
Allow setting fastapi root_path argument ( #2341 )
2024-01-12 10:59:59 -08:00
陈序
48cf1e413c
fix: deque mutated during iteration in abort_seq_group ( #2371 )
2024-01-12 17:44:18 +01:00
arkohut
97460585d9
Add gradio chatbot for openai webserver ( #2307 )
2024-01-11 19:45:56 -08:00
Zhuohan Li
f745847ef7
[Minor] Fix the format in quick start guide related to Model Scope ( #2425 )
2024-01-11 19:44:01 -08:00
Jiaxiang
6549aef245
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine ( #1011 )
2024-01-11 19:26:49 -08:00
Woosuk Kwon
50376faa7b
Rename phi_1_5 -> phi ( #2385 )
2024-01-11 16:23:43 -08:00
Yunfeng Bai
4b61c6b669
get_ip(): Fix ipv4 ipv6 dualstack (#2408 )
2024-01-10 11:39:58 -08:00
Cade Daniel
79d64c4954
[Speculative decoding 1/9] Optimized rejection sampler ( #2336 )
2024-01-09 15:38:41 -08:00
KKY
74cd5abdd1
Add baichuan chat template jinjia file ( #2390 )
2024-01-09 09:13:02 -08:00
Woosuk Kwon
28c3f12104
[Minor] Remove unused code in attention ( #2384 )
2024-01-08 13:13:08 -08:00
Woosuk Kwon
c884819135
Fix eager mode performance ( #2377 )
2024-01-08 10:11:06 -08:00
Nadav Shmayovits
05921a9a7a
Changed scheduler to use deques instead of lists ( #2290 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-01-07 09:48:07 -08:00
Iskren Ivov Chernev
d0215a58e7
Ensure metrics are logged regardless of requests ( #2347 )
2024-01-05 05:24:42 -08:00
Alexandre Payot
937e7b7d7c
Build docker image with shared objects from "build" step ( #2237 )
2024-01-04 09:35:18 -08:00
ljss
aee8ef661a
Miner fix of type hint ( #2340 )
2024-01-03 21:27:56 -08:00
Woosuk Kwon
2e0b6e7757
Bump up to v0.2.7 ( #2337 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
2024-01-03 17:35:56 -08:00
Woosuk Kwon
941767127c
Revert the changes in test_cache ( #2335 )
2024-01-03 17:32:05 -08:00
Ronen Schaffer
74d8d77626
Remove unused const TIMEOUT_TO_PREVENT_DEADLOCK ( #2321 )
2024-01-03 15:49:07 -08:00
Zhuohan Li
fd4ea8ef5c
Use NCCL instead of ray for control-plane communication to remove serialization overhead ( #2221 )
2024-01-03 11:30:22 -08:00
Ronen Schaffer
1066cbd152
Remove deprecated parameter: concurrency_count ( #2315 )
2024-01-03 09:56:21 -08:00
Woosuk Kwon
6ef00b03a2
Enable CUDA graph for GPTQ & SqueezeLLM ( #2318 )
2024-01-03 09:52:29 -08:00
Roy
9140561059
[Minor] Fix typo and remove unused code ( #2305 )
2024-01-02 19:23:15 -08:00
Jee Li
77af974b40
[FIX] Support non-zero CUDA devices in custom kernels ( #1959 )
2024-01-02 19:09:59 -08:00
Jong-hun Shin
4934d49274
Support GPT-NeoX Models without attention biases ( #2301 )
2023-12-30 11:42:04 -05:00
Zhuohan Li
358c328d69
[BUGFIX] Fix communication test ( #2285 )
2023-12-27 17:18:11 -05:00
Zhuohan Li
4aaafdd289
[BUGFIX] Fix the path of test prompts ( #2273 )
2023-12-26 10:37:21 -08:00
Zhuohan Li
66b108d142
[BUGFIX] Fix API server test ( #2270 )
2023-12-26 10:37:06 -08:00
Zhuohan Li
e0ff920001
[BUGFIX] Do not return ignored sentences twice in async llm engine ( #2258 )
2023-12-26 13:41:09 +08:00
blueceiling
face83c7ec
[Docs] Add "About" Heading to README.md ( #2260 )
2023-12-25 16:37:07 -08:00
Shivam Thakkar
1db83e31a2
[Docs] Update installation instructions to include CUDA 11.8 xFormers ( #2246 )
2023-12-22 23:20:02 -08:00
Woosuk Kwon
a1b9cb2a34
[BugFix] Fix recovery logic for sequence group ( #2186 )
2023-12-20 21:52:37 -08:00
Woosuk Kwon
3a4fd5ca59
Disable Ray usage stats collection ( #2206 )
2023-12-20 21:52:08 -08:00
Ronen Schaffer
c17daa9f89
[Docs] Fix broken links ( #2222 )
2023-12-20 12:43:42 -08:00
Antoni Baum
bd29cf3d3a
Remove Sampler copy stream ( #2209 )
2023-12-20 00:04:33 -08:00
Hanzhi Zhou
31bff69151
Make _prepare_sample non-blocking and use pinned memory for input buffers ( #2207 )
2023-12-19 16:52:46 -08:00
Woosuk Kwon
ba4f826738
[BugFix] Fix weight loading for Mixtral with TP ( #2208 )
2023-12-19 16:16:11 -08:00
avideci
de60a3fb93
Added DeciLM-7b and DeciLM-7b-instruct ( #2062 )
2023-12-19 02:29:33 -08:00
Woosuk Kwon
21d5daa4ac
Add warning on CUDA graph memory usage ( #2182 )
2023-12-18 18:16:17 -08:00
Suhong Moon
290e015c6c
Update Help Text for --gpu-memory-utilization Argument ( #2183 )
2023-12-18 11:33:24 -08:00
kliuae
1b7c791d60
[ROCm] Fixes for GPTQ on ROCm ( #2180 )
2023-12-18 10:41:04 -08:00
JohnSaxon
bbe4466fd9
[Minor] Fix typo ( #2166 )
...
Co-authored-by: John-Saxon <zhang.xiangxuan@oushu.com >
2023-12-17 23:28:49 -08:00
Harry Mellor
08133c4d1a
Add SSL arguments to API servers ( #2109 )
2023-12-18 10:56:23 +08:00
Woosuk Kwon
76a7983b23
[BugFix] Fix RoPE kernel on long sequences( #2164 )
2023-12-17 17:09:10 -08:00
Woosuk Kwon
8041b7305e
[BugFix] Raise error when max_model_len is larger than KV cache ( #2163 )
2023-12-17 17:08:23 -08:00
Suhong Moon
3ec8c25cd0
[Docs] Update documentation for gpu-memory-utilization option ( #2162 )
2023-12-17 10:51:57 -08:00
Woosuk Kwon
671af2b1c0
Bump up to v0.2.6 ( #2157 )
Create Release / Create Release (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (11.8, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.10, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.11, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.8, 2.1.2) (push) Has been cancelled
Create Release / Build Wheel (12.1, ubuntu-20.04, 3.9, 2.1.2) (push) Has been cancelled
2023-12-17 10:34:56 -08:00
Woosuk Kwon
6f41f0e377
Disable CUDA graph for SqueezeLLM ( #2161 )
2023-12-17 10:24:25 -08:00
Woosuk Kwon
2c9b638065
[Minor] Fix a typo in .pt weight support ( #2160 )
2023-12-17 10:12:44 -08:00
Antoni Baum
a7347d9a6d
Make sampler less blocking ( #1889 )
2023-12-17 23:03:49 +08:00
Woosuk Kwon
f8c688d746
[Minor] Add Phi 2 to supported models ( #2159 )
2023-12-17 02:54:57 -08:00
Woosuk Kwon
c9fadda543
[Minor] Fix xformers version ( #2158 )
2023-12-17 02:28:02 -08:00
Woosuk Kwon
30fb0956df
[Minor] Add more detailed explanation on quantization argument ( #2145 )
2023-12-17 01:56:16 -08:00
Woosuk Kwon
3a765bd5e1
Temporarily enforce eager mode for GPTQ models ( #2154 )
2023-12-17 01:51:12 -08:00
Woosuk Kwon
26c52a5ea6
[Docs] Add CUDA graph support to docs ( #2148 )
2023-12-17 01:49:20 -08:00
Woosuk Kwon
c3372e87be
Remove dependency on CuPy ( #2152 )
2023-12-17 01:49:07 -08:00
Woosuk Kwon
b0a1d667b0
Pin PyTorch & xformers versions ( #2155 )
2023-12-17 01:46:54 -08:00
Woosuk Kwon
e1d5402238
Fix all-reduce memory usage ( #2151 )
2023-12-17 01:44:45 -08:00
Woosuk Kwon
3d1cfbfc74
[Minor] Delete Llama tokenizer warnings ( #2146 )
2023-12-16 22:05:18 -08:00
Woosuk Kwon
37ca558103
Optimize model execution with CUDA graph ( #1926 )
...
Co-authored-by: Chen Shen <scv119@gmail.com >
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2023-12-16 21:12:08 -08:00
Roy
eed74a558f
Simplify weight loading logic ( #2133 )
2023-12-16 12:41:23 -08:00
Woosuk Kwon
2acd76f346
[ROCm] Temporarily remove GPTQ ROCm support ( #2138 )
2023-12-15 17:13:58 -08:00
Woosuk Kwon
b81a6a6bb3
[Docs] Add supported quantization methods to docs ( #2135 )
2023-12-15 13:29:22 -08:00
CHU Tianxiang
0fbfc4b81b
Add GPTQ support ( #916 )
2023-12-15 03:04:22 -08:00
Yunfeng Bai
c06170cc8e
Add a flag to include stop string in output text ( #1976 )
2023-12-15 00:45:58 -08:00
Mingcan Xiang
614856da25
Avoid multiple redefinition ( #1817 )
2023-12-14 09:35:58 -08:00
TJian
05bdf4eaf3
Fix Dockerfile.rocm ( #2101 )
...
Co-authored-by: miloice <jeffaw99@hotmail.com >
2023-12-14 00:45:58 -08:00
mezuzza
6774bd50b0
Fix typing in AsyncLLMEngine & add toml to requirements-dev ( #2100 )
2023-12-14 00:19:41 -08:00