youkaichao
bc1bdecebf
[core][distributed] exact ray placement control ( #12732 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-06 02:03:19 +08:00
Aviv Keshet
b3a0d01e45
[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS ( #12368 )
...
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com >
2025-02-04 18:46:26 -08:00
Lucas Wilkinson
75e94309e8
[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) ( #12676 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-02-04 18:22:24 -08:00
Yang Chen
95460fc513
[Kernel] port sgl moe_align_block_size kernels ( #12574 )
...
sgl_moe_align_block_size is based on:
ded9fcd09a
moe_align_block_size is based on:
ba5112ff69
Signed-off-by: Yang Chen <yangche@fb.com >
2025-02-03 13:09:50 +08:00
Russell Bryant
e489ad7a21
[Misc] Add SPDX-License-Identifier headers to python source files ( #12628 )
...
- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**
commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:18:24 2025 -0500
Add SPDX license headers to python source files
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
also be easily used by tools to help manage license compliance.
The Linux Foundation runs license scans against the codebase to help
ensure
we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
More information can be found on the SPDX site:
- https://spdx.dev/learn/handling-license-info/
Signed-off-by: Russell Bryant <rbryant@redhat.com >
commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:36:32 2025 -0500
Check for SPDX headers using pre-commit
Signed-off-by: Russell Bryant <rbryant@redhat.com >
---------
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-02 11:58:18 -08:00
Lucas Wilkinson
baeded2569
[Attention] Deepseek v3 MLA support with FP8 compute ( #12601 )
...
This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights
---------
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com >
2025-01-31 21:52:51 -08:00
Lucas Wilkinson
cabaf4eff3
[Attention] MLA decode optimizations ( #12528 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-01-30 23:49:37 -08:00
Gregory Shtrasberg
e97f802b2d
[FP8][Kernel] Dynamic kv cache scaling factors computation ( #11906 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Micah Williamson <micah.williamson@amd.com >
2025-01-23 18:04:03 +00:00
Lucas Wilkinson
978b45f399
[Kernel] Flash Attention 3 Support ( #12093 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-23 06:45:48 -08:00
Nick Hill
aea94362c9
[Frontend][V1] Online serving performance improvements ( #12287 )
2025-01-22 22:22:12 +00:00
Cody Yu
7206ce4ce1
[Core] Support reset_prefix_cache ( #12284 )
2025-01-22 18:52:27 +00:00
Rui Qiao
f218f9c24d
[core] Turn off GPU communication overlap for Ray executor ( #12051 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-01-15 05:19:55 +00:00
Woosuk Kwon
371d04d39b
[V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling ( #11394 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-27 09:32:38 +09:00
youkaichao
88a412ed3d
[torch.compile] fast inductor ( #11108 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-16 16:15:22 -08:00
Russell Bryant
4863e5fba5
[Core] V1: Use multiprocessing by default ( #11074 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-13 16:27:32 -08:00
Rui Qiao
72ff3a9686
[core] Bump ray to use _overlap_gpu_communication in compiled graph tests ( #10410 )
...
Signed-off-by: Rui Qiao <ubuntu@ip-172-31-15-128.us-west-2.compute.internal >
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Rui Qiao <ubuntu@ip-172-31-15-128.us-west-2.compute.internal >
2024-12-11 11:36:35 -08:00
youkaichao
75f89dc44c
[torch.compile] add a flag to track batchsize statistics ( #11059 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-10 12:40:52 -08:00
youkaichao
1b62745b1d
[core][executor] simplify instance id ( #10976 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-07 09:33:45 -08:00
Daniele
e4c34c23de
[CI/Build] improve python-only dev setup ( #9621 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-12-04 21:48:13 +00:00
youkaichao
6e9ff050c8
[misc] do not read HOST_IP ( #10644 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 17:04:50 -08:00
Cyrus Leung
3430857b64
[Misc] Increase default video fetch timeout ( #10495 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-20 23:06:42 -08:00
youkaichao
803f37eaaa
[6/N] torch.compile rollout to users ( #10437 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-19 10:09:03 -08:00
youkaichao
4fd9375028
[2/N][torch.compile] make compilation cfg part of vllm cfg ( #10383 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-16 18:02:14 -08:00
Robert Shaw
6ace6fba2c
[V1] AsyncLLM Implementation ( #9826 )
...
Signed-off-by: Nick Hill <nickhill@us.ibm.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-11 23:05:38 +00:00
Luka Govedič
4f93dfe952
[torch.compile] Fuse RMSNorm with quant ( #9138 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-11-08 21:20:08 +00:00
litianjian
28b2877d30
Online video support for VLMs ( #10020 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: litianjian <litianjian@bytedance.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-07 20:25:59 +00:00
youkaichao
ff5ed6e1bc
[torch.compile] rework compile control with piecewise cudagraph ( #9715 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-29 23:03:49 -07:00
Flex Wang
29061ed9df
[Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend to all logging messages ( #9590 )
2024-10-23 11:17:28 +08:00
Woosuk Kwon
6c5af09b39
[V1] Implement vLLM V1 [1/N] ( #9289 )
2024-10-22 01:24:07 -07:00
Thomas Parnell
0c9a5258f9
[Kernel] Add env variable to force flashinfer backend to enable tensor cores ( #9497 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-10-18 17:55:48 -07:00
Luka Govedič
0f41fbe5a3
[torch.compile] Fine-grained CustomOp enabling mechanism ( #9300 )
2024-10-17 18:36:37 +00:00
Kuntai Du
81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default ( #8704 )
...
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
Lucas Wilkinson
e312e52b44
[Kernel] Add Exllama as a backend for compressed-tensors ( #9395 )
2024-10-17 09:48:26 -04:00
youkaichao
e4d652ea3e
[torch.compile] integration with compilation control ( #9058 )
2024-10-10 12:39:36 -07:00
sroy745
f3a507f1d3
[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 ( #9149 )
2024-10-10 14:17:17 +08:00
youkaichao
c8627cd41b
[ci][test] use load dummy for testing ( #9165 )
2024-10-09 00:38:40 -07:00
youkaichao
18b296fdb2
[core] remove beam search from the core ( #9105 )
2024-10-07 05:47:04 +00:00
Sergey Shlyapnikov
f58d4fccc9
[OpenVINO] Enable GPU support for OpenVINO vLLM backend ( #8192 )
2024-10-02 17:50:01 -04:00
youkaichao
18e60d7d13
[misc][distributed] add VLLM_SKIP_P2P_CHECK flag ( #8911 )
2024-09-27 14:27:56 -07:00
youkaichao
1e7d5c01f5
[misc] soft drop beam search ( #8763 )
2024-09-24 15:48:39 -07:00
Alexander Matveev
7c7714d856
[Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH ( #8157 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-09-18 13:56:58 +00:00
Cyrus Leung
6ffa3f314c
[CI/Build] Avoid CUDA initialization ( #8534 )
2024-09-18 10:38:11 +00:00
youkaichao
47790f3e32
[torch.compile] add a flag to disable custom op ( #8488 )
2024-09-14 13:07:16 -07:00
youkaichao
f842a7aff1
[misc] remove engine_use_ray ( #8126 )
2024-09-11 18:23:36 -07:00
bnellnm
73202dbe77
[Kernel][Misc] register ops to prevent graph breaks ( #6917 )
...
Co-authored-by: Sage Moore <sage@neuralmagic.com >
2024-09-11 12:52:19 -07:00
Jiaxin Shan
db3bf7c991
[Core] Support load and unload LoRA in api server ( #6566 )
...
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-09-05 18:10:33 -07:00
Lily Liu
e6a26ed037
[SpecDecode][Kernel] Flashinfer Rejection Sampling ( #7244 )
2024-09-01 21:23:29 -07:00
Avshalom Manevich
34a0e96d46
[Kernel] changing fused moe kernel chunk size default to 32k ( #7995 )
2024-08-30 04:11:39 +00:00
youkaichao
ce6bf3a2cf
[torch.compile] avoid Dynamo guard evaluation overhead ( #7898 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-08-28 16:10:12 -07:00
rasmith
e5697d161c
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ ( #7386 )
2024-08-28 15:37:47 -04:00