Compare commits
377 Commits
v0.11.1rc1
...
v0.11.1rc4
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
f257544709 | ||
|
|
0b51c9bd8b | ||
|
|
d3ab240f39 | ||
|
|
94666612a9 | ||
|
|
4fe5895361 | ||
|
|
111faf1118 | ||
|
|
6afc28a9ba | ||
|
|
141e6a0505 | ||
|
|
130aa8cbcf | ||
|
|
e3d8186666 | ||
|
|
f5710ef02a | ||
|
|
a8c02fb5bf | ||
|
|
02af36df36 | ||
|
|
e88bdd60d9 | ||
|
|
05e034f085 | ||
|
|
936643a868 | ||
|
|
b186149e8e | ||
|
|
2abbd351ef | ||
|
|
446912d1cb | ||
|
|
a00d6254e9 | ||
|
|
05181cc57f | ||
|
|
259504e147 | ||
|
|
0484b64248 | ||
|
|
f58d9b6404 | ||
|
|
44b5ce956d | ||
|
|
7a865f2325 | ||
|
|
2fa90bda27 | ||
|
|
0291fbf65c | ||
|
|
b46e4a06f1 | ||
|
|
d34f5fe939 | ||
|
|
bdb01a38fe | ||
|
|
5b3c35a68e | ||
|
|
61fbfe5274 | ||
|
|
255e34ca50 | ||
|
|
a8d2e326ec | ||
|
|
53a56e658b | ||
|
|
69f064062b | ||
|
|
921e78f4bb | ||
|
|
6ebffafbb6 | ||
|
|
3b96f85c36 | ||
|
|
23ad820553 | ||
|
|
5d3be3ba4c | ||
|
|
4f882be4a0 | ||
|
|
9273754222 | ||
|
|
f4e8154076 | ||
|
|
a663f6ae64 | ||
|
|
a4fc21895e | ||
|
|
a3e8611da5 | ||
|
|
7c2bdb83dc | ||
|
|
9932ed6a83 | ||
|
|
2d631d28c6 | ||
|
|
b368382964 | ||
|
|
a806c14cc7 | ||
|
|
181bf5bbde | ||
|
|
cbd5e07a51 | ||
|
|
63b22e0dbb | ||
|
|
5980604c44 | ||
|
|
361a7463d3 | ||
|
|
720af6ab79 | ||
|
|
55cba4a05c | ||
|
|
c7abff2990 | ||
|
|
71b1c8b667 | ||
|
|
8fb7b2fab9 | ||
|
|
be7b55a83d | ||
|
|
315b860abe | ||
|
|
87c41c26ad | ||
|
|
65d2cf9511 | ||
|
|
d63cd9ff10 | ||
|
|
66a168a197 | ||
|
|
a99564ac5b | ||
|
|
4c5f632165 | ||
|
|
b853540388 | ||
|
|
56ed7609a9 | ||
|
|
29c9cb8007 | ||
|
|
83f478bb19 | ||
|
|
269c4db0a4 | ||
|
|
52efc34ebf | ||
|
|
d95d0f4b98 | ||
|
|
0402428200 | ||
|
|
17af6aa0da | ||
|
|
fc168c33f3 | ||
|
|
acc78aeb88 | ||
|
|
0f67d4d962 | ||
|
|
7e1d697b56 | ||
|
|
699d62e6cf | ||
|
|
cd390b609d | ||
|
|
2080b05099 | ||
|
|
6454afec90 | ||
|
|
41a62564a7 | ||
|
|
284cc92275 | ||
|
|
435be10db9 | ||
|
|
b7030d962b | ||
|
|
3567816932 | ||
|
|
e0ef8a2920 | ||
|
|
42efe609ba | ||
|
|
88d3141ec6 | ||
|
|
09a6a49eaf | ||
|
|
074475541a | ||
|
|
d4c574c39f | ||
|
|
c528b9006a | ||
|
|
85fee74b33 | ||
|
|
8dbe0c527f | ||
|
|
5cc6bddb6e | ||
|
|
1f9460c4c1 | ||
|
|
70022ffc00 | ||
|
|
f417746ad7 | ||
|
|
0552cfb195 | ||
|
|
51dd14ac2b | ||
|
|
dbfbf9f324 | ||
|
|
ca76486a16 | ||
|
|
a9f55dc588 | ||
|
|
81d5bb765a | ||
|
|
0825197bee | ||
|
|
9ef3d5b875 | ||
|
|
295c7f0267 | ||
|
|
3fa2c12185 | ||
|
|
fe2016de2d | ||
|
|
237cf6d32a | ||
|
|
faee3ccdc2 | ||
|
|
570c3e1cd4 | ||
|
|
3a4255c7c4 | ||
|
|
61089465a6 | ||
|
|
88afa11010 | ||
|
|
d00ce29d89 | ||
|
|
3b7bdf983b | ||
|
|
50b788a17a | ||
|
|
fc059c7061 | ||
|
|
bfb240cc49 | ||
|
|
e255d92990 | ||
|
|
3729ed00ba | ||
|
|
6644796bf4 | ||
|
|
ff93cc8c84 | ||
|
|
243ed7d32e | ||
|
|
7e0941055f | ||
|
|
6738e4a093 | ||
|
|
2566dca2a9 | ||
|
|
b4fda58a2d | ||
|
|
a0003b56b0 | ||
|
|
5beacce2ea | ||
|
|
8669c69afa | ||
|
|
1651003c35 | ||
|
|
1cb8c6c5fe | ||
|
|
e05a6754a8 | ||
|
|
084a9dae80 | ||
|
|
c9461e05a4 | ||
|
|
4dfdb821c8 | ||
|
|
58fab50d82 | ||
|
|
db6f28d898 | ||
|
|
14e2f1231e | ||
|
|
7c4767f1eb | ||
|
|
9771e0b432 | ||
|
|
980de31ca0 | ||
|
|
1c160841ea | ||
|
|
4ca13a8667 | ||
|
|
675aa2ec64 | ||
|
|
3ae082c373 | ||
|
|
49c00fe304 | ||
|
|
141d3b9fc5 | ||
|
|
abf3db40ef | ||
|
|
8e4ca4d14e | ||
|
|
1a0f4defb7 | ||
|
|
843af7f7fc | ||
|
|
1f633b8632 | ||
|
|
a4c29e6e82 | ||
|
|
8f18feb191 | ||
|
|
ed540d6d4c | ||
|
|
f6027b2855 | ||
|
|
ab3e80042e | ||
|
|
ceacedc1f9 | ||
|
|
bfa59be8f1 | ||
|
|
265ecb05fb | ||
|
|
09a7e6f617 | ||
|
|
6c2eef5a5d | ||
|
|
19748806f0 | ||
|
|
4a8a567e16 | ||
|
|
344a0017c0 | ||
|
|
becb7de40b | ||
|
|
250fb1b8ea | ||
|
|
647214f3d5 | ||
|
|
ddeec11ba9 | ||
|
|
86ed77022d | ||
|
|
aa1356ec53 | ||
|
|
ecc3c0940a | ||
|
|
ba09652de2 | ||
|
|
bd66b8529b | ||
|
|
6c728f7771 | ||
|
|
80e9452984 | ||
|
|
c3a2c6ac5f | ||
|
|
72f431e709 | ||
|
|
be4445072c | ||
|
|
f381cf2302 | ||
|
|
5ff5d94e77 | ||
|
|
f95da13c3d | ||
|
|
aef368aa08 | ||
|
|
5f6cbf60d6 | ||
|
|
3ada34f9cb | ||
|
|
0eb8f2b880 | ||
|
|
163965d183 | ||
|
|
a03cf9bc70 | ||
|
|
352c0c8a28 | ||
|
|
bfe0b4bd2a | ||
|
|
58fbbcb2f5 | ||
|
|
87778d5f00 | ||
|
|
f9e7ad5400 | ||
|
|
4d0f266113 | ||
|
|
e93ff6c8b9 | ||
|
|
1c691f4a71 | ||
|
|
9fce7bee74 | ||
|
|
b63f2143f8 | ||
|
|
f32bf7582e | ||
|
|
8a81d776ce | ||
|
|
f6fdacd82c | ||
|
|
d31f7844f8 | ||
|
|
7a6c8c3fa1 | ||
|
|
221bf72577 | ||
|
|
b3aba04e5a | ||
|
|
8a297115e2 | ||
|
|
191eed0bb9 | ||
|
|
fb860670da | ||
|
|
83e760c57d | ||
|
|
c2bba69065 | ||
|
|
e133d6d218 | ||
|
|
a1946c9f61 | ||
|
|
9f020f4f31 | ||
|
|
3b45075206 | ||
|
|
168e578efc | ||
|
|
6ac5e06f7c | ||
|
|
5c2acb270a | ||
|
|
b26b70bec4 | ||
|
|
ab4be40fc5 | ||
|
|
245e4f2c01 | ||
|
|
1d165d6d85 | ||
|
|
83004020fd | ||
|
|
12e21701e7 | ||
|
|
30a33b92ee | ||
|
|
7c572544e4 | ||
|
|
c312320764 | ||
|
|
c981f0ea78 | ||
|
|
6367bde739 | ||
|
|
f50cc221ea | ||
|
|
acedc74b1a | ||
|
|
d29483b58a | ||
|
|
950cf9e58e | ||
|
|
3125d79950 | ||
|
|
e33ee23ee3 | ||
|
|
b10c64c834 | ||
|
|
0925b28a8e | ||
|
|
99722d5f0e | ||
|
|
4c91a28e30 | ||
|
|
b038d9c40c | ||
|
|
2ba60ec7fe | ||
|
|
bd7157a071 | ||
|
|
be429d0cfd | ||
|
|
c253745eb8 | ||
|
|
daec4d2624 | ||
|
|
6c9fdbf725 | ||
|
|
483ea64611 | ||
|
|
e20eba753b | ||
|
|
bbc1b29665 | ||
|
|
acb1bfa601 | ||
|
|
75c7ad9918 | ||
|
|
5550ff9c25 | ||
|
|
3aeb19a39e | ||
|
|
8c017b3490 | ||
|
|
9c2c2287a0 | ||
|
|
fec2b341ad | ||
|
|
87bc0c492f | ||
|
|
fe3b9372ad | ||
|
|
bde9e2272a | ||
|
|
08405609cc | ||
|
|
ab81379ea6 | ||
|
|
4ffd6e8942 | ||
|
|
965c5f4914 | ||
|
|
4d055ef465 | ||
|
|
17c540a993 | ||
|
|
4d4d6bad19 | ||
|
|
11ae016bd7 | ||
|
|
41d3071918 | ||
|
|
fb5e10d3fb | ||
|
|
b2f78cbad4 | ||
|
|
23583ee28c | ||
|
|
01c977e96d | ||
|
|
b3dda72c23 | ||
|
|
fb0571b077 | ||
|
|
2ed8b6b3d0 | ||
|
|
013abde6ef | ||
|
|
a5464dcf92 | ||
|
|
ac3ed5a815 | ||
|
|
e6ba2000ae | ||
|
|
aa255ff55a | ||
|
|
7bb736d00e | ||
|
|
9f4e30904b | ||
|
|
5afd3276df | ||
|
|
43721bc67f | ||
|
|
02d709a6f1 | ||
|
|
4a510ab487 | ||
|
|
314fa8abbf | ||
|
|
334535b6fb | ||
|
|
dcbb3f1871 | ||
|
|
00417f4e44 | ||
|
|
ed344f4116 | ||
|
|
e51928793e | ||
|
|
d2740fafbf | ||
|
|
17838e50ef | ||
|
|
44c8555621 | ||
|
|
f7d318de2b | ||
|
|
76f0d05bc6 | ||
|
|
7d8975de84 | ||
|
|
785d8b6410 | ||
|
|
f6cdc9a02f | ||
|
|
509cdc0370 | ||
|
|
9b6504c307 | ||
|
|
e19b16dde6 | ||
|
|
582f2c6be7 | ||
|
|
f8a0acbdbe | ||
|
|
1317034379 | ||
|
|
0ecc553ee6 | ||
|
|
f96bc3649c | ||
|
|
938c43ea7f | ||
|
|
0a9ef0cfce | ||
|
|
e5b438a247 | ||
|
|
0b99f5d302 | ||
|
|
1f491aa0c8 | ||
|
|
de92d916fe | ||
|
|
a1063628a4 | ||
|
|
d796375258 | ||
|
|
14f8456344 | ||
|
|
4794c2bd92 | ||
|
|
d3cbaa08dc | ||
|
|
828523ad8e | ||
|
|
136a17fe6e | ||
|
|
f57438338d | ||
|
|
5d598680e3 | ||
|
|
8f4b313c37 | ||
|
|
f93e348010 | ||
|
|
f54f85129e | ||
|
|
d4d1a6024f | ||
|
|
db1764e4e0 | ||
|
|
7f83b4ee8e | ||
|
|
5c3bae1a6a | ||
|
|
5210dc3940 | ||
|
|
650b51f9f9 | ||
|
|
6256697997 | ||
|
|
71557a5f7c | ||
|
|
f3c378ffa7 | ||
|
|
f5ed68ef63 | ||
|
|
efdef57b1f | ||
|
|
b8a4572157 | ||
|
|
302ef403a2 | ||
|
|
8865da157b | ||
|
|
f0862eae43 | ||
|
|
8c851f6d04 | ||
|
|
7cfa420f49 | ||
|
|
a27b288e4a | ||
|
|
e471d7ca7e | ||
|
|
c43ca8259e | ||
|
|
85a65e7f51 | ||
|
|
a2986b3e33 | ||
|
|
96b9aa5aa0 | ||
|
|
e66d787bce | ||
|
|
bfad142e25 | ||
|
|
9354660036 | ||
|
|
07ca70af8d | ||
|
|
2dcd12d357 | ||
|
|
579d2e5458 | ||
|
|
0512c04aee | ||
|
|
7e0ef4084a | ||
|
|
4aed506b65 | ||
|
|
a86b4c58e8 | ||
|
|
ff4810ba73 | ||
|
|
9d6964926e | ||
|
|
0e65818910 | ||
|
|
380f17527c | ||
|
|
b92ab3deda | ||
|
|
acaa2c0a4a | ||
|
|
82af928c41 | ||
|
|
87efc681db |
12
.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml
Normal file
12
.buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml
Normal file
@@ -0,0 +1,12 @@
|
|||||||
|
# For vllm script, with -t option (tensor parallel size).
|
||||||
|
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
|
||||||
|
model_name: "HandH1998/QQQ-Llama-3-8b-g128"
|
||||||
|
tasks:
|
||||||
|
- name: "gsm8k"
|
||||||
|
metrics:
|
||||||
|
- name: "exact_match,strict-match"
|
||||||
|
value: 0.419
|
||||||
|
- name: "exact_match,flexible-extract"
|
||||||
|
value: 0.416
|
||||||
|
limit: 1000
|
||||||
|
num_fewshot: 5
|
||||||
@@ -0,0 +1,12 @@
|
|||||||
|
# For hf script, without -t option (tensor parallel size).
|
||||||
|
# bash .buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh -m meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -l 100 -t 8
|
||||||
|
model_name: "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
|
||||||
|
backend: "vllm-vlm"
|
||||||
|
tasks:
|
||||||
|
- name: "chartqa"
|
||||||
|
metrics:
|
||||||
|
- name: "relaxed_accuracy,none"
|
||||||
|
# TODO(zhewenl): model card is 0.90, but the actual score is 0.80.
|
||||||
|
value: 0.80
|
||||||
|
limit: 100
|
||||||
|
num_fewshot: 0
|
||||||
@@ -0,0 +1,10 @@
|
|||||||
|
# For hf script, without -t option (tensor parallel size).
|
||||||
|
# bash .buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh -m meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -l 250 -t 8 -f 5
|
||||||
|
model_name: "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
|
||||||
|
tasks:
|
||||||
|
- name: "mmlu_pro"
|
||||||
|
metrics:
|
||||||
|
- name: "exact_match,custom-extract"
|
||||||
|
value: 0.80
|
||||||
|
limit: 250 # will run on 250 * 14 subjects = 3500 samples
|
||||||
|
num_fewshot: 5
|
||||||
@@ -1,4 +1,5 @@
|
|||||||
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic -b auto -l 1319 -f 5 -t 1
|
# For vllm script, with -t option (tensor parallel size)
|
||||||
|
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic -l 1319 -t 1
|
||||||
model_name: "RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic"
|
model_name: "RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic"
|
||||||
tasks:
|
tasks:
|
||||||
- name: "gsm8k"
|
- name: "gsm8k"
|
||||||
|
|||||||
@@ -0,0 +1,12 @@
|
|||||||
|
# For vllm script, with -t option (tensor parallel size).
|
||||||
|
# bash .buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh -m Qwen/Qwen2.5-VL-7B-Instruct -l 2500 -t 1
|
||||||
|
|
||||||
|
model_name: "Qwen/Qwen2.5-VL-7B-Instruct"
|
||||||
|
backend: "vllm-vlm"
|
||||||
|
tasks:
|
||||||
|
- name: "chartqa"
|
||||||
|
metrics:
|
||||||
|
- name: "relaxed_accuracy,none"
|
||||||
|
value: 0.855
|
||||||
|
limit: 2500
|
||||||
|
num_fewshot: 0
|
||||||
1
.buildkite/lm-eval-harness/configs/models-large-h100.txt
Normal file
1
.buildkite/lm-eval-harness/configs/models-large-h100.txt
Normal file
@@ -0,0 +1 @@
|
|||||||
|
Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml
|
||||||
@@ -0,0 +1 @@
|
|||||||
|
Meta-Llama-4-Maverick-17B-128E-Instruct-FP8-MM.yaml
|
||||||
1
.buildkite/lm-eval-harness/configs/models-mm-small.txt
Normal file
1
.buildkite/lm-eval-harness/configs/models-mm-small.txt
Normal file
@@ -0,0 +1 @@
|
|||||||
|
Qwen2.5-VL-7B-Instruct.yaml
|
||||||
44
.buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh
Executable file
44
.buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh
Executable file
@@ -0,0 +1,44 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
# We can use this script to compute baseline accuracy on chartqa for vllm.
|
||||||
|
#
|
||||||
|
# Make sure you have lm-eval-harness installed:
|
||||||
|
# pip install lm-eval==0.4.9
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
echo``
|
||||||
|
echo "Runs lm eval harness on ChartQA using multimodal vllm."
|
||||||
|
echo "This pathway is intended to be used to create baselines for "
|
||||||
|
echo "our correctness tests in vllm's CI."
|
||||||
|
echo
|
||||||
|
echo "usage: ${0} <options>"
|
||||||
|
echo
|
||||||
|
echo " -m - huggingface stub or local directory of the model"
|
||||||
|
echo " -l - limit number of samples to run"
|
||||||
|
echo " -t - tensor parallel size to run at"
|
||||||
|
echo
|
||||||
|
}
|
||||||
|
|
||||||
|
while getopts "m:l:t:" OPT; do
|
||||||
|
case ${OPT} in
|
||||||
|
m )
|
||||||
|
MODEL="$OPTARG"
|
||||||
|
;;
|
||||||
|
l )
|
||||||
|
LIMIT="$OPTARG"
|
||||||
|
;;
|
||||||
|
t )
|
||||||
|
TP_SIZE="$OPTARG"
|
||||||
|
;;
|
||||||
|
\? )
|
||||||
|
usage
|
||||||
|
exit 1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
lm_eval --model vllm-vlm \
|
||||||
|
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE" \
|
||||||
|
--tasks chartqa \
|
||||||
|
--batch_size auto \
|
||||||
|
--apply_chat_template \
|
||||||
|
--limit $LIMIT
|
||||||
0
.buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Normal file → Executable file
0
.buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Normal file → Executable file
@@ -0,0 +1,50 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
# We can use this script to compute baseline accuracy on MMLUPRO for vllm.
|
||||||
|
# We use this for fp8, which HF does not support.
|
||||||
|
#
|
||||||
|
# Make sure you have lm-eval-harness installed:
|
||||||
|
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
echo``
|
||||||
|
echo "Runs lm eval harness on MMLU Pro using huggingface transformers."
|
||||||
|
echo "This pathway is intended to be used to create baselines for "
|
||||||
|
echo "our automated nm-test-accuracy workflow"
|
||||||
|
echo
|
||||||
|
echo "usage: ${0} <options>"
|
||||||
|
echo
|
||||||
|
echo " -m - huggingface stub or local directory of the model"
|
||||||
|
echo " -l - limit number of samples to run"
|
||||||
|
echo " -f - number of fewshot samples to use"
|
||||||
|
echo " -t - tensor parallel size to run at"
|
||||||
|
echo
|
||||||
|
}
|
||||||
|
|
||||||
|
while getopts "m:b:l:f:t:" OPT; do
|
||||||
|
case ${OPT} in
|
||||||
|
m )
|
||||||
|
MODEL="$OPTARG"
|
||||||
|
;;
|
||||||
|
b )
|
||||||
|
BATCH_SIZE="$OPTARG"
|
||||||
|
;;
|
||||||
|
l )
|
||||||
|
LIMIT="$OPTARG"
|
||||||
|
;;
|
||||||
|
f )
|
||||||
|
FEWSHOT="$OPTARG"
|
||||||
|
;;
|
||||||
|
t )
|
||||||
|
TP_SIZE="$OPTARG"
|
||||||
|
;;
|
||||||
|
\? )
|
||||||
|
usage
|
||||||
|
exit 1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
lm_eval --model vllm \
|
||||||
|
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,add_bos_token=true,trust_remote_code=true,max_model_len=4096" \
|
||||||
|
--tasks mmlu_pro --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
|
||||||
|
--batch_size auto
|
||||||
@@ -19,21 +19,27 @@ RTOL = 0.08
|
|||||||
def launch_lm_eval(eval_config, tp_size):
|
def launch_lm_eval(eval_config, tp_size):
|
||||||
trust_remote_code = eval_config.get("trust_remote_code", False)
|
trust_remote_code = eval_config.get("trust_remote_code", False)
|
||||||
max_model_len = eval_config.get("max_model_len", 4096)
|
max_model_len = eval_config.get("max_model_len", 4096)
|
||||||
|
batch_size = eval_config.get("batch_size", "auto")
|
||||||
|
backend = eval_config.get("backend", "vllm")
|
||||||
model_args = (
|
model_args = (
|
||||||
f"pretrained={eval_config['model_name']},"
|
f"pretrained={eval_config['model_name']},"
|
||||||
f"tensor_parallel_size={tp_size},"
|
f"tensor_parallel_size={tp_size},"
|
||||||
f"enforce_eager=true,"
|
f"enforce_eager=true,"
|
||||||
f"add_bos_token=true,"
|
f"add_bos_token=true,"
|
||||||
f"trust_remote_code={trust_remote_code},"
|
f"trust_remote_code={trust_remote_code},"
|
||||||
f"max_model_len={max_model_len}"
|
f"max_model_len={max_model_len},"
|
||||||
)
|
)
|
||||||
results = lm_eval.simple_evaluate(
|
results = lm_eval.simple_evaluate(
|
||||||
model="vllm",
|
model=backend,
|
||||||
model_args=model_args,
|
model_args=model_args,
|
||||||
tasks=[task["name"] for task in eval_config["tasks"]],
|
tasks=[task["name"] for task in eval_config["tasks"]],
|
||||||
num_fewshot=eval_config["num_fewshot"],
|
num_fewshot=eval_config["num_fewshot"],
|
||||||
limit=eval_config["limit"],
|
limit=eval_config["limit"],
|
||||||
batch_size="auto",
|
# TODO(yeq): using chat template w/ fewshot_as_multiturn is supposed help
|
||||||
|
# text models. however, this is regressing measured strict-match for
|
||||||
|
# existing text models in CI, so only apply it for mm.
|
||||||
|
apply_chat_template=backend == "vllm-vlm",
|
||||||
|
batch_size=batch_size,
|
||||||
)
|
)
|
||||||
return results
|
return results
|
||||||
|
|
||||||
|
|||||||
@@ -7,6 +7,7 @@ from importlib import util
|
|||||||
|
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
|
|
||||||
|
pd.options.display.float_format = "{:.2f}".format
|
||||||
plotly_found = util.find_spec("plotly.express") is not None
|
plotly_found = util.find_spec("plotly.express") is not None
|
||||||
|
|
||||||
|
|
||||||
@@ -109,7 +110,10 @@ def compare_data_columns(
|
|||||||
if len(compare_frames) >= 2:
|
if len(compare_frames) >= 2:
|
||||||
base = compare_frames[0]
|
base = compare_frames[0]
|
||||||
current = compare_frames[-1]
|
current = compare_frames[-1]
|
||||||
ratio = current / base
|
if "P99" in data_column or "Median" in data_column:
|
||||||
|
ratio = base / current # for latency
|
||||||
|
else:
|
||||||
|
ratio = current / base
|
||||||
ratio = ratio.mask(base == 0) # avoid inf when baseline is 0
|
ratio = ratio.mask(base == 0) # avoid inf when baseline is 0
|
||||||
ratio.name = f"Ratio 1 vs {len(compare_frames)}"
|
ratio.name = f"Ratio 1 vs {len(compare_frames)}"
|
||||||
frames.append(ratio)
|
frames.append(ratio)
|
||||||
@@ -199,6 +203,71 @@ def split_json_by_tp_pp(
|
|||||||
return saved_paths
|
return saved_paths
|
||||||
|
|
||||||
|
|
||||||
|
def _add_limit_line(fig, y_value, label):
|
||||||
|
# Visible dashed line + annotation
|
||||||
|
fig.add_hline(
|
||||||
|
y=y_value,
|
||||||
|
line_dash="dash",
|
||||||
|
line_color="red" if "ttft" in label.lower() else "blue",
|
||||||
|
annotation_text=f"{label}: {y_value} ms",
|
||||||
|
annotation_position="top left",
|
||||||
|
)
|
||||||
|
# Optional: add a legend item (as a transparent helper trace)
|
||||||
|
if plot and plotly_found:
|
||||||
|
import plotly.graph_objects as go
|
||||||
|
|
||||||
|
fig.add_trace(
|
||||||
|
go.Scatter(
|
||||||
|
x=[None],
|
||||||
|
y=[None],
|
||||||
|
mode="lines",
|
||||||
|
line=dict(
|
||||||
|
dash="dash", color="red" if "ttft" in label.lower() else "blue"
|
||||||
|
),
|
||||||
|
name=f"{label}",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _find_concurrency_col(df: pd.DataFrame) -> str:
|
||||||
|
for c in [
|
||||||
|
"# of max concurrency.",
|
||||||
|
"# of max concurrency",
|
||||||
|
"Max Concurrency",
|
||||||
|
"max_concurrency",
|
||||||
|
"Concurrency",
|
||||||
|
]:
|
||||||
|
if c in df.columns:
|
||||||
|
return c
|
||||||
|
# Fallback: guess an integer-like column (harmless if unused)
|
||||||
|
for c in df.columns:
|
||||||
|
if df[c].dtype.kind in "iu" and df[c].nunique() > 1 and df[c].min() >= 1:
|
||||||
|
return c
|
||||||
|
return "# of max concurrency."
|
||||||
|
|
||||||
|
|
||||||
|
def _highlight_threshold(
|
||||||
|
df: pd.DataFrame, threshold: float
|
||||||
|
) -> "pd.io.formats.style.Styler":
|
||||||
|
"""Highlight numeric per-configuration columns with value <= threshold."""
|
||||||
|
conc_col = _find_concurrency_col(df)
|
||||||
|
key_cols = [
|
||||||
|
c
|
||||||
|
for c in ["Model", "Dataset Name", "Input Len", "Output Len", conc_col]
|
||||||
|
if c in df.columns
|
||||||
|
]
|
||||||
|
conf_cols = [
|
||||||
|
c for c in df.columns if c not in key_cols and not str(c).startswith("Ratio")
|
||||||
|
]
|
||||||
|
conf_cols = [c for c in conf_cols if pd.api.types.is_numeric_dtype(df[c])]
|
||||||
|
return df.style.map(
|
||||||
|
lambda v: "background-color:#e6ffe6;font-weight:bold;"
|
||||||
|
if pd.notna(v) and v <= threshold
|
||||||
|
else "",
|
||||||
|
subset=conf_cols,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
parser = argparse.ArgumentParser()
|
parser = argparse.ArgumentParser()
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
@@ -220,6 +289,26 @@ if __name__ == "__main__":
|
|||||||
default="# of max concurrency.",
|
default="# of max concurrency.",
|
||||||
help="column name to use as X Axis in comparison graph",
|
help="column name to use as X Axis in comparison graph",
|
||||||
)
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"-l",
|
||||||
|
"--latency",
|
||||||
|
type=str,
|
||||||
|
default="p99",
|
||||||
|
help="take median|p99 for latency like TTFT/TPOT",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--ttft-max-ms",
|
||||||
|
type=float,
|
||||||
|
default=3000.0,
|
||||||
|
help="Reference limit for TTFT plots (ms)",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--tpot-max-ms",
|
||||||
|
type=float,
|
||||||
|
default=100.0,
|
||||||
|
help="Reference limit for TPOT plots (ms)",
|
||||||
|
)
|
||||||
|
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
drop_column = "P99"
|
drop_column = "P99"
|
||||||
@@ -234,12 +323,22 @@ if __name__ == "__main__":
|
|||||||
"# of max concurrency.",
|
"# of max concurrency.",
|
||||||
"qps",
|
"qps",
|
||||||
]
|
]
|
||||||
data_cols_to_compare = ["Output Tput (tok/s)", "Median TTFT (ms)", "Median"]
|
|
||||||
html_msgs_for_data_cols = [
|
if "median" in args.latency:
|
||||||
"Compare Output Tokens /n",
|
data_cols_to_compare = ["Output Tput (tok/s)", "Median TTFT (ms)", "Median"]
|
||||||
"Median TTFT /n",
|
html_msgs_for_data_cols = [
|
||||||
"Median TPOT /n",
|
"Compare Output Tokens /n",
|
||||||
]
|
"Median TTFT /n",
|
||||||
|
"Median TPOT /n",
|
||||||
|
]
|
||||||
|
drop_column = "P99"
|
||||||
|
elif "p99" in args.latency:
|
||||||
|
data_cols_to_compare = ["Output Tput (tok/s)", "P99 TTFT (ms)", "P99"]
|
||||||
|
html_msgs_for_data_cols = [
|
||||||
|
"Compare Output Tokens /n",
|
||||||
|
"P99 TTFT /n",
|
||||||
|
"P99 TPOT /n",
|
||||||
|
]
|
||||||
|
|
||||||
if len(args.file) == 1:
|
if len(args.file) == 1:
|
||||||
files = split_json_by_tp_pp(args.file[0], output_root="splits")
|
files = split_json_by_tp_pp(args.file[0], output_root="splits")
|
||||||
@@ -275,33 +374,83 @@ if __name__ == "__main__":
|
|||||||
f"Expected subset: {filtered_info_cols}, "
|
f"Expected subset: {filtered_info_cols}, "
|
||||||
f"but DataFrame has: {list(output_df.columns)}"
|
f"but DataFrame has: {list(output_df.columns)}"
|
||||||
)
|
)
|
||||||
output_df_sorted = output_df.sort_values(by=existing_group_cols)
|
# output_df_sorted = output_df.sort_values(by=existing_group_cols)
|
||||||
|
output_df_sorted = output_df.sort_values(by=args.xaxis)
|
||||||
output_groups = output_df_sorted.groupby(existing_group_cols, dropna=False)
|
output_groups = output_df_sorted.groupby(existing_group_cols, dropna=False)
|
||||||
for name, group in output_groups:
|
for name, group in output_groups:
|
||||||
html = group.to_html()
|
group_name = (
|
||||||
|
",".join(map(str, name)).replace(",", "_").replace("/", "-")
|
||||||
|
)
|
||||||
|
group_html_name = "perf_comparison_" + group_name + ".html"
|
||||||
|
|
||||||
|
metric_name = str(data_cols_to_compare[i]).lower()
|
||||||
|
if "tok/s" in metric_name:
|
||||||
|
html = group.to_html()
|
||||||
|
elif "ttft" in metric_name:
|
||||||
|
styler = _highlight_threshold(group, args.ttft_max_ms).format(
|
||||||
|
{c: "{:.2f}" for c in group.select_dtypes("number").columns},
|
||||||
|
na_rep="—",
|
||||||
|
)
|
||||||
|
html = styler.to_html(
|
||||||
|
table_attributes='border="1" class="dataframe"'
|
||||||
|
)
|
||||||
|
elif (
|
||||||
|
"tpot" in metric_name
|
||||||
|
or "median" in metric_name
|
||||||
|
or "p99" in metric_name
|
||||||
|
):
|
||||||
|
styler = _highlight_threshold(group, args.tpot_max_ms).format(
|
||||||
|
{c: "{:.2f}" for c in group.select_dtypes("number").columns},
|
||||||
|
na_rep="—",
|
||||||
|
)
|
||||||
|
html = styler.to_html(
|
||||||
|
table_attributes='border="1" class="dataframe"'
|
||||||
|
)
|
||||||
|
|
||||||
text_file.write(html_msgs_for_data_cols[i])
|
text_file.write(html_msgs_for_data_cols[i])
|
||||||
text_file.write(html)
|
text_file.write(html)
|
||||||
|
with open(group_html_name, "a+") as sub_text_file:
|
||||||
|
sub_text_file.write(html_msgs_for_data_cols[i])
|
||||||
|
sub_text_file.write(html)
|
||||||
|
|
||||||
if plot and plotly_found:
|
if plot and plotly_found:
|
||||||
import plotly.express as px
|
import plotly.express as px
|
||||||
|
|
||||||
df = group[raw_data_cols]
|
df = group[raw_data_cols]
|
||||||
df_sorted = df.sort_values(by=info_cols[y_axis_index])
|
df_sorted = df.sort_values(by=info_cols[y_axis_index])
|
||||||
# Melt DataFrame for plotting
|
# Melt DataFrame for plotting
|
||||||
df_melted = df_sorted.melt(
|
df_melted = df_sorted.melt(
|
||||||
id_vars=info_cols[y_axis_index],
|
id_vars=info_cols[y_axis_index],
|
||||||
var_name="Configuration",
|
var_name="Configuration",
|
||||||
value_name=data_cols_to_compare[i],
|
value_name=data_cols_to_compare[i],
|
||||||
)
|
)
|
||||||
title = data_cols_to_compare[i] + " vs " + info_cols[y_axis_index]
|
title = (
|
||||||
# Create Plotly line chart
|
data_cols_to_compare[i] + " vs " + info_cols[y_axis_index]
|
||||||
fig = px.line(
|
)
|
||||||
df_melted,
|
# Create Plotly line chart
|
||||||
x=info_cols[y_axis_index],
|
fig = px.line(
|
||||||
y=data_cols_to_compare[i],
|
df_melted,
|
||||||
color="Configuration",
|
x=info_cols[y_axis_index],
|
||||||
title=title,
|
y=data_cols_to_compare[i],
|
||||||
markers=True,
|
color="Configuration",
|
||||||
)
|
title=title,
|
||||||
# Export to HTML
|
markers=True,
|
||||||
text_file.write(fig.to_html(full_html=True, include_plotlyjs="cdn"))
|
)
|
||||||
|
|
||||||
|
# ---- Add threshold lines based on metric name ----
|
||||||
|
if "ttft" in metric_name:
|
||||||
|
_add_limit_line(fig, args.ttft_max_ms, "TTFT limit")
|
||||||
|
elif (
|
||||||
|
"tpot" in metric_name
|
||||||
|
or "median" in metric_name
|
||||||
|
or "p99" in metric_name
|
||||||
|
):
|
||||||
|
_add_limit_line(fig, args.tpot_max_ms, "TPOT limit")
|
||||||
|
|
||||||
|
# Export to HTML
|
||||||
|
text_file.write(
|
||||||
|
fig.to_html(full_html=True, include_plotlyjs="cdn")
|
||||||
|
)
|
||||||
|
sub_text_file.write(
|
||||||
|
fig.to_html(full_html=True, include_plotlyjs="cdn")
|
||||||
|
)
|
||||||
|
|||||||
@@ -63,9 +63,11 @@ serving_column_mapping = {
|
|||||||
"mean_ttft_ms": "Mean TTFT (ms)",
|
"mean_ttft_ms": "Mean TTFT (ms)",
|
||||||
"median_ttft_ms": "Median TTFT (ms)",
|
"median_ttft_ms": "Median TTFT (ms)",
|
||||||
"p99_ttft_ms": "P99 TTFT (ms)",
|
"p99_ttft_ms": "P99 TTFT (ms)",
|
||||||
|
"std_ttft_ms": "STD TTFT (ms)",
|
||||||
"mean_tpot_ms": "Mean TPOT (ms)",
|
"mean_tpot_ms": "Mean TPOT (ms)",
|
||||||
"median_tpot_ms": "Median",
|
"median_tpot_ms": "Median",
|
||||||
"p99_tpot_ms": "P99",
|
"p99_tpot_ms": "P99",
|
||||||
|
"std_tpot_ms": "STD TPOT (ms)",
|
||||||
"mean_itl_ms": "Mean ITL (ms)",
|
"mean_itl_ms": "Mean ITL (ms)",
|
||||||
"median_itl_ms": "Median ITL (ms)",
|
"median_itl_ms": "Median ITL (ms)",
|
||||||
"p99_itl_ms": "P99 ITL (ms)",
|
"p99_itl_ms": "P99 ITL (ms)",
|
||||||
@@ -368,7 +370,7 @@ if __name__ == "__main__":
|
|||||||
# The GPUs sometimes come in format of "GPUTYPE\nGPUTYPE\n...",
|
# The GPUs sometimes come in format of "GPUTYPE\nGPUTYPE\n...",
|
||||||
# we want to turn it into "8xGPUTYPE"
|
# we want to turn it into "8xGPUTYPE"
|
||||||
df["GPU"] = df["GPU"].apply(
|
df["GPU"] = df["GPU"].apply(
|
||||||
lambda x: f"{len(x.splitlines())}x{x.splitlines()[0]}"
|
lambda x: "{}x{}".format(len(x.split("\n")), x.split("\n")[0])
|
||||||
)
|
)
|
||||||
|
|
||||||
# get markdown tables
|
# get markdown tables
|
||||||
|
|||||||
@@ -471,6 +471,11 @@ main() {
|
|||||||
mkdir -p $RESULTS_FOLDER
|
mkdir -p $RESULTS_FOLDER
|
||||||
QUICK_BENCHMARK_ROOT=../.buildkite/nightly-benchmarks/
|
QUICK_BENCHMARK_ROOT=../.buildkite/nightly-benchmarks/
|
||||||
|
|
||||||
|
# dump vllm info via vllm collect-env
|
||||||
|
env_output=$(vllm collect-env)
|
||||||
|
|
||||||
|
echo "$env_output" >"$RESULTS_FOLDER/vllm_env.txt"
|
||||||
|
|
||||||
# benchmarking
|
# benchmarking
|
||||||
run_serving_tests $QUICK_BENCHMARK_ROOT/tests/"${SERVING_JSON:-serving-tests$ARCH.json}"
|
run_serving_tests $QUICK_BENCHMARK_ROOT/tests/"${SERVING_JSON:-serving-tests$ARCH.json}"
|
||||||
run_latency_tests $QUICK_BENCHMARK_ROOT/tests/"${LATENCY_JSON:-latency-tests$ARCH.json}"
|
run_latency_tests $QUICK_BENCHMARK_ROOT/tests/"${LATENCY_JSON:-latency-tests$ARCH.json}"
|
||||||
|
|||||||
@@ -1,28 +1,24 @@
|
|||||||
[
|
[
|
||||||
{
|
{
|
||||||
"test_name": "latency_llama8B_tp1",
|
"test_name": "latency_llama8B_tp2",
|
||||||
"environment_variables": {
|
"environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"parameters": {
|
"parameters": {
|
||||||
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 1,
|
"tensor_parallel_size": 2,
|
||||||
"load_format": "dummy",
|
"dtype": "bfloat16",
|
||||||
"num_iters_warmup": 5,
|
"distributed_executor_backend": "mp",
|
||||||
"num_iters": 15
|
"block_size": 128,
|
||||||
}
|
"trust_remote_code": "",
|
||||||
},
|
"disable_log_stats": "",
|
||||||
{
|
"enforce_eager": "",
|
||||||
"test_name": "latency_llama8B_tp4",
|
"max_num_batched_tokens": 2048,
|
||||||
"environment_variables": {
|
"max_num_seqs": 256,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
|
||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
|
||||||
},
|
|
||||||
"parameters": {
|
|
||||||
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
|
||||||
"tensor_parallel_size": 4,
|
|
||||||
"load_format": "dummy",
|
|
||||||
"num_iters_warmup": 5,
|
"num_iters_warmup": 5,
|
||||||
"num_iters": 15
|
"num_iters": 15
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -95,6 +95,38 @@
|
|||||||
"num_prompts": 200
|
"num_prompts": 200
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_bf16_tp4_sharegpt",
|
||||||
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
|
"tensor_parallel_size": 4,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "sharegpt",
|
||||||
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
|
"num_prompts": 200
|
||||||
|
}
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_bf16_tp2pp3_sharegpt",
|
"test_name": "serving_llama8B_bf16_tp2pp3_sharegpt",
|
||||||
"qps_list": ["inf"],
|
"qps_list": ["inf"],
|
||||||
@@ -233,6 +265,41 @@
|
|||||||
"num_prompts": 1000
|
"num_prompts": 1000
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_bf16_tp4_random_128_128",
|
||||||
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
|
"tensor_parallel_size": 4,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"enable_chunked_prefill": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "random",
|
||||||
|
"random-input-len": 128,
|
||||||
|
"random-output-len": 128,
|
||||||
|
"ignore-eos": "",
|
||||||
|
"num_prompts": 1000
|
||||||
|
}
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_bf16_tp2pp3_random_128_128",
|
"test_name": "serving_llama8B_bf16_tp2pp3_random_128_128",
|
||||||
"qps_list": ["inf"],
|
"qps_list": ["inf"],
|
||||||
@@ -365,6 +432,38 @@
|
|||||||
"num_prompts": 200
|
"num_prompts": 200
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_int8_tp4_sharegpt",
|
||||||
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
|
||||||
|
"tensor_parallel_size": 4,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "sharegpt",
|
||||||
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
|
"num_prompts": 200
|
||||||
|
}
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_int8_tp2pp3_sharegpt",
|
"test_name": "serving_llama8B_int8_tp2pp3_sharegpt",
|
||||||
"qps_list": ["inf"],
|
"qps_list": ["inf"],
|
||||||
@@ -503,6 +602,41 @@
|
|||||||
"num_prompts": 1000
|
"num_prompts": 1000
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_int8_tp4_random_128_128",
|
||||||
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
|
||||||
|
"tensor_parallel_size": 4,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"enable_chunked_prefill": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "random",
|
||||||
|
"random-input-len": 128,
|
||||||
|
"random-output-len": 128,
|
||||||
|
"ignore-eos": "",
|
||||||
|
"num_prompts": 1000
|
||||||
|
}
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_int8_tp2pp3_random_128_128",
|
"test_name": "serving_llama8B_int8_tp2pp3_random_128_128",
|
||||||
"qps_list": ["inf"],
|
"qps_list": ["inf"],
|
||||||
@@ -638,6 +772,39 @@
|
|||||||
"num_prompts": 200
|
"num_prompts": 200
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_int4_tp4_sharegpt",
|
||||||
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
|
||||||
|
"quantization": "awq",
|
||||||
|
"tensor_parallel_size": 4,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "sharegpt",
|
||||||
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
|
"num_prompts": 200
|
||||||
|
}
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_int4_tp2pp3_sharegpt",
|
"test_name": "serving_llama8B_int4_tp2pp3_sharegpt",
|
||||||
"qps_list": ["inf"],
|
"qps_list": ["inf"],
|
||||||
@@ -780,6 +947,42 @@
|
|||||||
"num_prompts": 1000
|
"num_prompts": 1000
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_int4_tp4_random_128_128",
|
||||||
|
"qps_list": ["inf"],
|
||||||
|
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200, 1000],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
|
||||||
|
"quantization": "awq",
|
||||||
|
"tensor_parallel_size": 4,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"enable_chunked_prefill": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "random",
|
||||||
|
"random-input-len": 128,
|
||||||
|
"random-output-len": 128,
|
||||||
|
"ignore-eos": "",
|
||||||
|
"num_prompts": 1000
|
||||||
|
}
|
||||||
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_int4_tp2pp3_random_128_128",
|
"test_name": "serving_llama8B_int4_tp2pp3_random_128_128",
|
||||||
"qps_list": ["inf"],
|
"qps_list": ["inf"],
|
||||||
|
|||||||
@@ -2,7 +2,7 @@
|
|||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp1_sharegpt",
|
"test_name": "serving_llama8B_tp1_sharegpt",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
"max_concurrency_list": [32],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@@ -28,13 +28,13 @@
|
|||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "sharegpt",
|
"dataset_name": "sharegpt",
|
||||||
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
"num_prompts": 200
|
"num_prompts": 32
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp2_sharegpt",
|
"test_name": "serving_llama8B_tp2_sharegpt",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
"max_concurrency_list": [32],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@@ -60,13 +60,13 @@
|
|||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "sharegpt",
|
"dataset_name": "sharegpt",
|
||||||
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
"num_prompts": 200
|
"num_prompts": 32
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_tp4_sharegpt",
|
"test_name": "serving_llama8B_tp1_random_128_128",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
"max_concurrency_list": [32],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@@ -76,39 +76,7 @@
|
|||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 4,
|
"tensor_parallel_size": 1,
|
||||||
"dtype": "bfloat16",
|
|
||||||
"distributed_executor_backend": "mp",
|
|
||||||
"block_size": 128,
|
|
||||||
"trust_remote_code": "",
|
|
||||||
"disable_log_stats": "",
|
|
||||||
"enforce_eager": "",
|
|
||||||
"max_num_batched_tokens": 2048,
|
|
||||||
"max_num_seqs": 256,
|
|
||||||
"load_format": "dummy"
|
|
||||||
},
|
|
||||||
"client_parameters": {
|
|
||||||
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
|
||||||
"backend": "vllm",
|
|
||||||
"dataset_name": "sharegpt",
|
|
||||||
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
|
||||||
"num_prompts": 200
|
|
||||||
}
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"test_name": "serving_llama8B_tp4_random_1024_128",
|
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
|
||||||
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
|
||||||
"server_environment_variables": {
|
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
|
||||||
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
|
||||||
"VLLM_CPU_SGL_KERNEL": 1,
|
|
||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
|
||||||
},
|
|
||||||
"server_parameters": {
|
|
||||||
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
|
||||||
"tensor_parallel_size": 4,
|
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
"block_size": 128,
|
"block_size": 128,
|
||||||
@@ -124,16 +92,16 @@
|
|||||||
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "random",
|
"dataset_name": "random",
|
||||||
"random-input-len": 1024,
|
"random-input-len": 128,
|
||||||
"random-output-len": 128,
|
"random-output-len": 128,
|
||||||
"ignore-eos": "",
|
"ignore-eos": "",
|
||||||
"num_prompts": 100
|
"num_prompts": 32
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"test_name": "serving_llama8B_pp6_random_1024_128",
|
"test_name": "serving_llama8B_tp2_random_128_128",
|
||||||
"qps_list": [1, 4, 16, "inf"],
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
"max_concurrency_list": [12, 16, 24, 32, 64, 128, 200],
|
"max_concurrency_list": [32],
|
||||||
"server_environment_variables": {
|
"server_environment_variables": {
|
||||||
"VLLM_RPC_TIMEOUT": 100000,
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
@@ -143,7 +111,7 @@
|
|||||||
},
|
},
|
||||||
"server_parameters": {
|
"server_parameters": {
|
||||||
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"pipeline_parallel_size": 6,
|
"tensor_parallel_size": 2,
|
||||||
"dtype": "bfloat16",
|
"dtype": "bfloat16",
|
||||||
"distributed_executor_backend": "mp",
|
"distributed_executor_backend": "mp",
|
||||||
"block_size": 128,
|
"block_size": 128,
|
||||||
@@ -159,10 +127,150 @@
|
|||||||
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"backend": "vllm",
|
"backend": "vllm",
|
||||||
"dataset_name": "random",
|
"dataset_name": "random",
|
||||||
"random-input-len": 1024,
|
"random-input-len": 128,
|
||||||
"random-output-len": 128,
|
"random-output-len": 128,
|
||||||
"ignore-eos": "",
|
"ignore-eos": "",
|
||||||
"num_prompts": 100
|
"num_prompts": 32
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_tp1_random_128_2048",
|
||||||
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"max_concurrency_list": [32],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
|
"tensor_parallel_size": 1,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"enable_chunked_prefill": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "random",
|
||||||
|
"random-input-len": 128,
|
||||||
|
"random-output-len": 2048,
|
||||||
|
"ignore-eos": "",
|
||||||
|
"num_prompts": 32
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_tp2_random_128_2048",
|
||||||
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"max_concurrency_list": [32],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
|
"tensor_parallel_size": 2,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"enable_chunked_prefill": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "random",
|
||||||
|
"random-input-len": 128,
|
||||||
|
"random-output-len": 2048,
|
||||||
|
"ignore-eos": "",
|
||||||
|
"num_prompts": 32
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_tp1_random_2048_128",
|
||||||
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"max_concurrency_list": [32],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
|
"tensor_parallel_size": 1,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"enable_chunked_prefill": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "random",
|
||||||
|
"random-input-len": 2048,
|
||||||
|
"random-output-len": 128,
|
||||||
|
"ignore-eos": "",
|
||||||
|
"num_prompts": 32
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"test_name": "serving_llama8B_tp2_random_2048_128",
|
||||||
|
"qps_list": [1, 4, 16, "inf"],
|
||||||
|
"max_concurrency_list": [32],
|
||||||
|
"server_environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
|
},
|
||||||
|
"server_parameters": {
|
||||||
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
|
"tensor_parallel_size": 2,
|
||||||
|
"dtype": "bfloat16",
|
||||||
|
"distributed_executor_backend": "mp",
|
||||||
|
"block_size": 128,
|
||||||
|
"trust_remote_code": "",
|
||||||
|
"enable_chunked_prefill": "",
|
||||||
|
"disable_log_stats": "",
|
||||||
|
"enforce_eager": "",
|
||||||
|
"max_num_batched_tokens": 2048,
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"load_format": "dummy"
|
||||||
|
},
|
||||||
|
"client_parameters": {
|
||||||
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
|
"backend": "vllm",
|
||||||
|
"dataset_name": "random",
|
||||||
|
"random-input-len": 2048,
|
||||||
|
"random-output-len": 128,
|
||||||
|
"ignore-eos": "",
|
||||||
|
"num_prompts": 32
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
|
|||||||
@@ -1,29 +1,24 @@
|
|||||||
[
|
[
|
||||||
{
|
{
|
||||||
"test_name": "throughput_llama8B_tp1",
|
"test_name": "throughput_llama8B_tp2",
|
||||||
"environment_variables": {
|
"environment_variables": {
|
||||||
|
"VLLM_RPC_TIMEOUT": 100000,
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
||||||
|
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 120,
|
||||||
|
"VLLM_CPU_SGL_KERNEL": 1,
|
||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
"VLLM_CPU_KVCACHE_SPACE": 40
|
||||||
},
|
},
|
||||||
"parameters": {
|
"parameters": {
|
||||||
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
"tensor_parallel_size": 1,
|
"tensor_parallel_size": 2,
|
||||||
"load_format": "dummy",
|
"dtype": "bfloat16",
|
||||||
"dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"distributed_executor_backend": "mp",
|
||||||
"num_prompts": 200,
|
"block_size": 128,
|
||||||
"backend": "vllm"
|
"trust_remote_code": "",
|
||||||
}
|
"disable_log_stats": "",
|
||||||
},
|
"enforce_eager": "",
|
||||||
{
|
"max_num_batched_tokens": 2048,
|
||||||
"test_name": "throughput_llama8B_tp4",
|
"max_num_seqs": 256,
|
||||||
"environment_variables": {
|
|
||||||
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
|
|
||||||
"VLLM_CPU_KVCACHE_SPACE": 40
|
|
||||||
},
|
|
||||||
"parameters": {
|
|
||||||
"model": "meta-llama/Llama-3.1-8B-Instruct",
|
|
||||||
"tensor_parallel_size": 4,
|
|
||||||
"load_format": "dummy",
|
|
||||||
"dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
"dataset": "./ShareGPT_V3_unfiltered_cleaned_split.json",
|
||||||
"num_prompts": 200,
|
"num_prompts": 200,
|
||||||
"backend": "vllm"
|
"backend": "vllm"
|
||||||
|
|||||||
@@ -1,5 +1,5 @@
|
|||||||
steps:
|
steps:
|
||||||
# aarch64 + CUDA builds. PyTorch 2.8 aarch64 + CUDA wheel is only available on CUDA 12.9
|
# aarch64 + CUDA builds
|
||||||
- label: "Build arm64 wheel - CUDA 12.9"
|
- label: "Build arm64 wheel - CUDA 12.9"
|
||||||
depends_on: ~
|
depends_on: ~
|
||||||
id: build-wheel-arm64-cuda-12-9
|
id: build-wheel-arm64-cuda-12-9
|
||||||
@@ -15,6 +15,21 @@ steps:
|
|||||||
env:
|
env:
|
||||||
DOCKER_BUILDKIT: "1"
|
DOCKER_BUILDKIT: "1"
|
||||||
|
|
||||||
|
# aarch64 build
|
||||||
|
- label: "Build arm64 CPU wheel"
|
||||||
|
depends_on: ~
|
||||||
|
id: build-wheel-arm64-cpu
|
||||||
|
agents:
|
||||||
|
queue: arm64_cpu_queue_postmerge
|
||||||
|
commands:
|
||||||
|
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --build-arg VLLM_BUILD_ACL=ON --tag vllm-ci:build-image --target vllm-build --progress plain -f docker/Dockerfile.cpu ."
|
||||||
|
- "mkdir artifacts"
|
||||||
|
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
|
||||||
|
- "bash .buildkite/scripts/upload-wheels.sh"
|
||||||
|
env:
|
||||||
|
DOCKER_BUILDKIT: "1"
|
||||||
|
|
||||||
|
# x86 + CUDA builds
|
||||||
- label: "Build wheel - CUDA 12.8"
|
- label: "Build wheel - CUDA 12.8"
|
||||||
depends_on: ~
|
depends_on: ~
|
||||||
id: build-wheel-cuda-12-8
|
id: build-wheel-cuda-12-8
|
||||||
@@ -28,20 +43,6 @@ steps:
|
|||||||
env:
|
env:
|
||||||
DOCKER_BUILDKIT: "1"
|
DOCKER_BUILDKIT: "1"
|
||||||
|
|
||||||
- label: "Build wheel - CUDA 12.6"
|
|
||||||
depends_on: ~
|
|
||||||
id: build-wheel-cuda-12-6
|
|
||||||
agents:
|
|
||||||
queue: cpu_queue_postmerge
|
|
||||||
commands:
|
|
||||||
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.6.3 --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0+PTX' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
|
|
||||||
- "mkdir artifacts"
|
|
||||||
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
|
|
||||||
- "bash .buildkite/scripts/upload-wheels.sh"
|
|
||||||
env:
|
|
||||||
DOCKER_BUILDKIT: "1"
|
|
||||||
|
|
||||||
# x86 + CUDA builds
|
|
||||||
- label: "Build wheel - CUDA 12.9"
|
- label: "Build wheel - CUDA 12.9"
|
||||||
depends_on: ~
|
depends_on: ~
|
||||||
id: build-wheel-cuda-12-9
|
id: build-wheel-cuda-12-9
|
||||||
@@ -55,6 +56,20 @@ steps:
|
|||||||
env:
|
env:
|
||||||
DOCKER_BUILDKIT: "1"
|
DOCKER_BUILDKIT: "1"
|
||||||
|
|
||||||
|
- label: "Build wheel - CUDA 13.0"
|
||||||
|
depends_on: ~
|
||||||
|
id: build-wheel-cuda-13-0
|
||||||
|
agents:
|
||||||
|
queue: cpu_queue_postmerge
|
||||||
|
commands:
|
||||||
|
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.1 --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.1-devel-ubuntu22.04 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
|
||||||
|
- "mkdir artifacts"
|
||||||
|
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
|
||||||
|
- "bash .buildkite/scripts/upload-wheels.sh"
|
||||||
|
env:
|
||||||
|
DOCKER_BUILDKIT: "1"
|
||||||
|
|
||||||
|
# Build release images (12.9)
|
||||||
- label: "Build release image (x86)"
|
- label: "Build release image (x86)"
|
||||||
depends_on: ~
|
depends_on: ~
|
||||||
id: build-release-image-x86
|
id: build-release-image-x86
|
||||||
@@ -62,13 +77,12 @@ steps:
|
|||||||
queue: cpu_queue_postmerge
|
queue: cpu_queue_postmerge
|
||||||
commands:
|
commands:
|
||||||
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
|
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
|
||||||
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.8.1 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) --target vllm-openai --progress plain -f docker/Dockerfile ."
|
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) --target vllm-openai --progress plain -f docker/Dockerfile ."
|
||||||
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)"
|
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)"
|
||||||
# re-tag to default image tag and push, just in case arm64 build fails
|
# re-tag to default image tag and push, just in case arm64 build fails
|
||||||
- "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
|
- "docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
|
||||||
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
|
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
|
||||||
|
|
||||||
# PyTorch 2.8 aarch64 + CUDA wheel is only available on CUDA 12.9
|
|
||||||
- label: "Build release image (arm64)"
|
- label: "Build release image (arm64)"
|
||||||
depends_on: ~
|
depends_on: ~
|
||||||
id: build-release-image-arm64
|
id: build-release-image-arm64
|
||||||
@@ -142,6 +156,22 @@ steps:
|
|||||||
env:
|
env:
|
||||||
DOCKER_BUILDKIT: "1"
|
DOCKER_BUILDKIT: "1"
|
||||||
|
|
||||||
|
- block: "Build arm64 CPU release image"
|
||||||
|
key: block-arm64-cpu-release-image-build
|
||||||
|
depends_on: ~
|
||||||
|
|
||||||
|
- label: "Build and publish arm64 CPU release image"
|
||||||
|
depends_on: block-arm64-cpu-release-image-build
|
||||||
|
agents:
|
||||||
|
queue: arm64_cpu_queue_postmerge
|
||||||
|
commands:
|
||||||
|
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
|
||||||
|
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:latest --progress plain --target vllm-openai -f docker/Dockerfile.cpu ."
|
||||||
|
- "docker push public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:latest"
|
||||||
|
- "docker push public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
|
||||||
|
env:
|
||||||
|
DOCKER_BUILDKIT: "1"
|
||||||
|
|
||||||
- label: "Build and publish nightly multi-arch image to DockerHub"
|
- label: "Build and publish nightly multi-arch image to DockerHub"
|
||||||
depends_on:
|
depends_on:
|
||||||
- create-multi-arch-manifest
|
- create-multi-arch-manifest
|
||||||
|
|||||||
@@ -70,7 +70,7 @@ function cpu_tests() {
|
|||||||
docker exec cpu-test-"$NUMA_NODE" bash -c "
|
docker exec cpu-test-"$NUMA_NODE" bash -c "
|
||||||
set -e
|
set -e
|
||||||
pytest -x -s -v \
|
pytest -x -s -v \
|
||||||
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs[False-10-32-neuralmagic/Llama-3.2-1B-quantized.w8a8]"
|
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_logprobs"
|
||||||
|
|
||||||
# Note: disable it until supports V1
|
# Note: disable it until supports V1
|
||||||
# Run AWQ test
|
# Run AWQ test
|
||||||
|
|||||||
@@ -58,33 +58,25 @@ python3 .buildkite/generate_index.py --wheel "$normal_wheel"
|
|||||||
aws s3 cp "$wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
|
aws s3 cp "$wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
|
||||||
aws s3 cp "$normal_wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
|
aws s3 cp "$normal_wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
|
||||||
|
|
||||||
if [[ $normal_wheel == *"cu126"* ]]; then
|
if [[ $normal_wheel == *"cu129"* ]]; then
|
||||||
# if $normal_wheel matches cu126, do not upload the index.html
|
|
||||||
echo "Skipping index files for cu126 wheels"
|
|
||||||
elif [[ $normal_wheel == *"cu128"* ]]; then
|
|
||||||
# if $normal_wheel matches cu128, do not upload the index.html
|
|
||||||
echo "Skipping index files for cu128 wheels"
|
|
||||||
else
|
|
||||||
# only upload index.html for cu129 wheels (default wheels) as it
|
# only upload index.html for cu129 wheels (default wheels) as it
|
||||||
# is available on both x86 and arm64
|
# is available on both x86 and arm64
|
||||||
aws s3 cp index.html "s3://vllm-wheels/$BUILDKITE_COMMIT/vllm/index.html"
|
aws s3 cp index.html "s3://vllm-wheels/$BUILDKITE_COMMIT/vllm/index.html"
|
||||||
aws s3 cp "s3://vllm-wheels/nightly/index.html" "s3://vllm-wheels/$BUILDKITE_COMMIT/index.html"
|
aws s3 cp "s3://vllm-wheels/nightly/index.html" "s3://vllm-wheels/$BUILDKITE_COMMIT/index.html"
|
||||||
|
else
|
||||||
|
echo "Skipping index files for non-cu129 wheels"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# generate index for nightly
|
# generate index for nightly
|
||||||
aws s3 cp "$wheel" "s3://vllm-wheels/nightly/"
|
aws s3 cp "$wheel" "s3://vllm-wheels/nightly/"
|
||||||
aws s3 cp "$normal_wheel" "s3://vllm-wheels/nightly/"
|
aws s3 cp "$normal_wheel" "s3://vllm-wheels/nightly/"
|
||||||
|
|
||||||
if [[ $normal_wheel == *"cu126"* ]]; then
|
if [[ $normal_wheel == *"cu129"* ]]; then
|
||||||
# if $normal_wheel matches cu126, do not upload the index.html
|
|
||||||
echo "Skipping index files for cu126 wheels"
|
|
||||||
elif [[ $normal_wheel == *"cu128"* ]]; then
|
|
||||||
# if $normal_wheel matches cu128, do not upload the index.html
|
|
||||||
echo "Skipping index files for cu128 wheels"
|
|
||||||
else
|
|
||||||
# only upload index.html for cu129 wheels (default wheels) as it
|
# only upload index.html for cu129 wheels (default wheels) as it
|
||||||
# is available on both x86 and arm64
|
# is available on both x86 and arm64
|
||||||
aws s3 cp index.html "s3://vllm-wheels/nightly/vllm/index.html"
|
aws s3 cp index.html "s3://vllm-wheels/nightly/vllm/index.html"
|
||||||
|
else
|
||||||
|
echo "Skipping index files for non-cu129 wheels"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
aws s3 cp "$wheel" "s3://vllm-wheels/$version/"
|
aws s3 cp "$wheel" "s3://vllm-wheels/$version/"
|
||||||
|
|||||||
@@ -50,7 +50,7 @@ steps:
|
|||||||
|
|
||||||
- label: Async Engine, Inputs, Utils, Worker Test # 36min
|
- label: Async Engine, Inputs, Utils, Worker Test # 36min
|
||||||
timeout_in_minutes: 50
|
timeout_in_minutes: 50
|
||||||
mirror_hardwares: [amdexperimental]
|
mirror_hardwares: [amdexperimental, amdproduction]
|
||||||
agent_pool: mi325_1
|
agent_pool: mi325_1
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
@@ -63,7 +63,7 @@ steps:
|
|||||||
|
|
||||||
- label: Async Engine, Inputs, Utils, Worker Test (CPU) # 4 mins
|
- label: Async Engine, Inputs, Utils, Worker Test (CPU) # 4 mins
|
||||||
timeout_in_minutes: 10
|
timeout_in_minutes: 10
|
||||||
mirror_hardwares: [amdexperimental]
|
mirror_hardwares: [amdexperimental, amdproduction]
|
||||||
agent_pool: mi325_1
|
agent_pool: mi325_1
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
@@ -353,7 +353,7 @@ steps:
|
|||||||
- pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine
|
- pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine
|
||||||
|
|
||||||
- label: V1 Test others (CPU) # 5 mins
|
- label: V1 Test others (CPU) # 5 mins
|
||||||
mirror_hardwares: [amdexperimental]
|
mirror_hardwares: [amdexperimental, amdproduction]
|
||||||
agent_pool: mi325_1
|
agent_pool: mi325_1
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
@@ -395,7 +395,9 @@ steps:
|
|||||||
- python3 offline_inference/basic/embed.py
|
- python3 offline_inference/basic/embed.py
|
||||||
- python3 offline_inference/basic/score.py
|
- python3 offline_inference/basic/score.py
|
||||||
- python3 offline_inference/spec_decode.py --test --method eagle --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 2048
|
- python3 offline_inference/spec_decode.py --test --method eagle --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 2048
|
||||||
- python3 offline_inference/spec_decode.py --test --method eagle3 --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 2048
|
# https://github.com/vllm-project/vllm/pull/26682 uses slightly more memory in PyTorch 2.9+ causing this test to OOM in 1xL4 GPU
|
||||||
|
- python3 offline_inference/spec_decode.py --test --method eagle3 --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 1536
|
||||||
|
#- python3 offline_inference/spec_decode.py --test --method eagle3 --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 2048
|
||||||
|
|
||||||
- label: Platform Tests (CUDA) # 4min
|
- label: Platform Tests (CUDA) # 4min
|
||||||
timeout_in_minutes: 15
|
timeout_in_minutes: 15
|
||||||
@@ -436,7 +438,11 @@ steps:
|
|||||||
--num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT \
|
--num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT \
|
||||||
--ignore=lora/test_chatglm3_tp.py \
|
--ignore=lora/test_chatglm3_tp.py \
|
||||||
--ignore=lora/test_llama_tp.py \
|
--ignore=lora/test_llama_tp.py \
|
||||||
--ignore=lora/test_llm_with_multi_loras.py
|
--ignore=lora/test_llm_with_multi_loras.py \
|
||||||
|
--ignore=lora/test_olmoe_tp.py \
|
||||||
|
--ignore=lora/test_deepseekv2_tp.py \
|
||||||
|
--ignore=lora/test_gptoss.py \
|
||||||
|
--ignore=lora/test_qwen3moe_tp.py
|
||||||
parallelism: 4
|
parallelism: 4
|
||||||
|
|
||||||
- label: PyTorch Compilation Unit Tests # 15min
|
- label: PyTorch Compilation Unit Tests # 15min
|
||||||
@@ -454,11 +460,12 @@ steps:
|
|||||||
- pytest -v -s compile/test_fusion_attn.py
|
- pytest -v -s compile/test_fusion_attn.py
|
||||||
- pytest -v -s compile/test_functionalization.py
|
- pytest -v -s compile/test_functionalization.py
|
||||||
- pytest -v -s compile/test_silu_mul_quant_fusion.py
|
- pytest -v -s compile/test_silu_mul_quant_fusion.py
|
||||||
- pytest -v -s compile/test_sequence_parallelism.py
|
# - pytest -v -s compile/test_sequence_parallelism.py
|
||||||
- pytest -v -s compile/test_async_tp.py
|
# - pytest -v -s compile/test_async_tp.py
|
||||||
- pytest -v -s compile/test_fusion_all_reduce.py
|
- pytest -v -s compile/test_fusion_all_reduce.py
|
||||||
- pytest -v -s compile/test_decorator.py
|
- pytest -v -s compile/test_decorator.py
|
||||||
- pytest -v -s compile/test_noop_elimination.py
|
- pytest -v -s compile/test_noop_elimination.py
|
||||||
|
- pytest -v -s compile/test_aot_compile.py
|
||||||
|
|
||||||
- label: PyTorch Fullgraph Smoke Test # 15min
|
- label: PyTorch Fullgraph Smoke Test # 15min
|
||||||
timeout_in_minutes: 30
|
timeout_in_minutes: 30
|
||||||
@@ -473,8 +480,8 @@ steps:
|
|||||||
- pytest -v -s compile/test_basic_correctness.py
|
- pytest -v -s compile/test_basic_correctness.py
|
||||||
- pytest -v -s compile/piecewise/
|
- pytest -v -s compile/piecewise/
|
||||||
|
|
||||||
- label: PyTorch Fullgraph Test # 20min
|
- label: PyTorch Fullgraph Test # 22min
|
||||||
timeout_in_minutes: 30
|
timeout_in_minutes: 35
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental, amdproduction]
|
||||||
agent_pool: mi325_1
|
agent_pool: mi325_1
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
@@ -484,17 +491,19 @@ steps:
|
|||||||
- tests/compile
|
- tests/compile
|
||||||
commands:
|
commands:
|
||||||
- pytest -v -s compile/test_full_graph.py
|
- pytest -v -s compile/test_full_graph.py
|
||||||
|
- pytest -v -s compile/test_fusions_e2e.py
|
||||||
|
|
||||||
- label: Kernels Core Operation Test # 48min
|
- label: Kernels Core Operation Test # 48min
|
||||||
timeout_in_minutes: 75
|
timeout_in_minutes: 75
|
||||||
mirror_hardwares: [amdexperimental]
|
mirror_hardwares: [amdexperimental, amdproduction]
|
||||||
agent_pool: mi325_1
|
agent_pool: mi325_1
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- csrc/
|
- csrc/
|
||||||
- tests/kernels/core
|
- tests/kernels/core
|
||||||
|
- tests/kernels/test_top_k_per_row.py
|
||||||
commands:
|
commands:
|
||||||
- pytest -v -s kernels/core
|
- pytest -v -s kernels/core kernels/test_top_k_per_row.py
|
||||||
|
|
||||||
- label: Kernels Attention Test %N # 23min
|
- label: Kernels Attention Test %N # 23min
|
||||||
timeout_in_minutes: 35
|
timeout_in_minutes: 35
|
||||||
@@ -552,7 +561,7 @@ steps:
|
|||||||
|
|
||||||
- label: Model Executor Test # 23min
|
- label: Model Executor Test # 23min
|
||||||
timeout_in_minutes: 35
|
timeout_in_minutes: 35
|
||||||
mirror_hardwares: [amdexperimental]
|
mirror_hardwares: [amdexperimental, amdproduction]
|
||||||
agent_pool: mi325_1
|
agent_pool: mi325_1
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
@@ -603,8 +612,9 @@ steps:
|
|||||||
# since torchao nightly is only compatible with torch nightly currently
|
# since torchao nightly is only compatible with torch nightly currently
|
||||||
# https://github.com/pytorch/ao/issues/2919, we'll have to skip new torchao tests for now
|
# https://github.com/pytorch/ao/issues/2919, we'll have to skip new torchao tests for now
|
||||||
# we can only upgrade after this is resolved
|
# we can only upgrade after this is resolved
|
||||||
- pip install --pre torchao==0.13.0.dev20250814 --index-url https://download.pytorch.org/whl/nightly/cu128
|
# TODO(jerryzh168): resolve the above comment
|
||||||
- VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization/
|
- uv pip install --system torchao==0.13.0
|
||||||
|
- VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization/ --ignore quantization/test_blackwell_moe.py
|
||||||
|
|
||||||
- label: LM Eval Small Models # 53min
|
- label: LM Eval Small Models # 53min
|
||||||
timeout_in_minutes: 75
|
timeout_in_minutes: 75
|
||||||
@@ -631,7 +641,7 @@ steps:
|
|||||||
|
|
||||||
- label: OpenAI-Compatible Tool Use # 23 min
|
- label: OpenAI-Compatible Tool Use # 23 min
|
||||||
timeout_in_minutes: 35
|
timeout_in_minutes: 35
|
||||||
mirror_hardwares: [amdexperimental]
|
mirror_hardwares: [amdexperimental, amdproduction]
|
||||||
agent_pool: mi325_1
|
agent_pool: mi325_1
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
fast_check: false
|
fast_check: false
|
||||||
@@ -779,8 +789,10 @@ steps:
|
|||||||
- vllm/
|
- vllm/
|
||||||
- tests/models/language/generation
|
- tests/models/language/generation
|
||||||
commands:
|
commands:
|
||||||
# Install causal-conv1d for plamo2 models here, as it is not compatible with pip-compile.
|
# Install fast path packages for testing against transformers
|
||||||
- pip install 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8'
|
# Note: also needed to run plamo2 model in vLLM
|
||||||
|
- uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5'
|
||||||
|
- uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.2'
|
||||||
- pytest -v -s models/language/generation -m '(not core_model) and (not hybrid_model)'
|
- pytest -v -s models/language/generation -m '(not core_model) and (not hybrid_model)'
|
||||||
|
|
||||||
- label: Language Models Test (PPL)
|
- label: Language Models Test (PPL)
|
||||||
@@ -846,6 +858,18 @@ steps:
|
|||||||
- pytest -v -s models/multimodal -m core_model --ignore models/multimodal/generation/test_whisper.py --ignore models/multimodal/processing
|
- pytest -v -s models/multimodal -m core_model --ignore models/multimodal/generation/test_whisper.py --ignore models/multimodal/processing
|
||||||
- cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model # Otherwise, mp_method="spawn" doesn't work
|
- cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model # Otherwise, mp_method="spawn" doesn't work
|
||||||
|
|
||||||
|
- label: Multi-Modal Accuracy Eval (Small Models) # 50min
|
||||||
|
mirror_hardwares: [amdexperimental]
|
||||||
|
agent_pool: mi325_1
|
||||||
|
timeout_in_minutes: 70
|
||||||
|
working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
|
||||||
|
source_file_dependencies:
|
||||||
|
- vllm/multimodal/
|
||||||
|
- vllm/inputs/
|
||||||
|
- vllm/v1/core/
|
||||||
|
commands:
|
||||||
|
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-mm-small.txt --tp-size=1
|
||||||
|
|
||||||
- label: Multi-Modal Models Test (Extended) 1
|
- label: Multi-Modal Models Test (Extended) 1
|
||||||
mirror_hardwares: [amdexperimental]
|
mirror_hardwares: [amdexperimental]
|
||||||
agent_pool: mi325_1
|
agent_pool: mi325_1
|
||||||
@@ -921,8 +945,8 @@ steps:
|
|||||||
# Whisper needs spawn method to avoid deadlock
|
# Whisper needs spawn method to avoid deadlock
|
||||||
- VLLM_WORKER_MULTIPROC_METHOD=spawn python3 examples/offline_inference/audio_language.py --model-type whisper
|
- VLLM_WORKER_MULTIPROC_METHOD=spawn python3 examples/offline_inference/audio_language.py --model-type whisper
|
||||||
|
|
||||||
- label: Blackwell Test # 38 min
|
- label: Blackwell Test # 21 min
|
||||||
timeout_in_minutes: 60
|
timeout_in_minutes: 30
|
||||||
working_dir: "/vllm-workspace/"
|
working_dir: "/vllm-workspace/"
|
||||||
gpu: b200
|
gpu: b200
|
||||||
# optional: true
|
# optional: true
|
||||||
@@ -935,8 +959,6 @@ steps:
|
|||||||
- vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
|
- vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
|
||||||
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
|
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
|
||||||
- vllm/v1/attention/backends/flashinfer.py
|
- vllm/v1/attention/backends/flashinfer.py
|
||||||
- vllm/compilation/fusion.py
|
|
||||||
- vllm/compilation/fusion_attn.py
|
|
||||||
commands:
|
commands:
|
||||||
- nvidia-smi
|
- nvidia-smi
|
||||||
- python3 examples/offline_inference/basic/chat.py
|
- python3 examples/offline_inference/basic/chat.py
|
||||||
@@ -953,13 +975,32 @@ steps:
|
|||||||
- pytest -v -s tests/kernels/quantization/test_nvfp4_scaled_mm.py
|
- pytest -v -s tests/kernels/quantization/test_nvfp4_scaled_mm.py
|
||||||
- pytest -v -s tests/kernels/quantization/test_flashinfer_scaled_mm.py
|
- pytest -v -s tests/kernels/quantization/test_flashinfer_scaled_mm.py
|
||||||
- pytest -v -s tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py
|
- pytest -v -s tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py
|
||||||
|
- pytest -v -s tests/kernels/quantization/test_nvfp4_qutlass.py
|
||||||
|
- pytest -v -s tests/kernels/quantization/test_mxfp4_qutlass.py
|
||||||
- pytest -v -s tests/kernels/moe/test_nvfp4_moe.py
|
- pytest -v -s tests/kernels/moe/test_nvfp4_moe.py
|
||||||
- pytest -v -s tests/kernels/moe/test_ocp_mx_moe.py
|
- pytest -v -s tests/kernels/moe/test_ocp_mx_moe.py
|
||||||
# Fusion
|
|
||||||
- pytest -v -s tests/compile/test_fusion_all_reduce.py
|
|
||||||
- pytest -v -s tests/compile/test_fusion_attn.py::test_attention_quant_pattern
|
|
||||||
- pytest -v -s tests/kernels/moe/test_flashinfer.py
|
- pytest -v -s tests/kernels/moe/test_flashinfer.py
|
||||||
|
|
||||||
|
- label: Blackwell Fusion Tests # 30 min
|
||||||
|
timeout_in_minutes: 40
|
||||||
|
working_dir: "/vllm-workspace/"
|
||||||
|
gpu: b200
|
||||||
|
source_file_dependencies:
|
||||||
|
- csrc/quantization/fp4/
|
||||||
|
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
|
||||||
|
- vllm/v1/attention/backends/flashinfer.py
|
||||||
|
- vllm/compilation/
|
||||||
|
# can affect pattern matching
|
||||||
|
- vllm/model_executor/layers/layernorm.py
|
||||||
|
- vllm/model_executor/layers/activation.py
|
||||||
|
- vllm/model_executor/layers/quantization/input_quant_fp8.py
|
||||||
|
commands:
|
||||||
|
- nvidia-smi
|
||||||
|
- pytest -v -s tests/compile/test_fusion_attn.py
|
||||||
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
|
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
|
||||||
|
# this runner has 2 GPUs available even though num_gpus=2 is not set
|
||||||
|
- pytest -v -s tests/compile/test_fusion_all_reduce.py
|
||||||
|
- pytest -v -s tests/compile/test_fusions_e2e.py
|
||||||
|
|
||||||
- label: Blackwell GPT-OSS Eval
|
- label: Blackwell GPT-OSS Eval
|
||||||
timeout_in_minutes: 60
|
timeout_in_minutes: 60
|
||||||
@@ -1079,6 +1120,7 @@ steps:
|
|||||||
- pytest -v -s ./compile/test_basic_correctness.py
|
- pytest -v -s ./compile/test_basic_correctness.py
|
||||||
- pytest -v -s ./compile/test_wrapper.py
|
- pytest -v -s ./compile/test_wrapper.py
|
||||||
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
|
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
|
||||||
|
- VLLM_TEST_SAME_HOST=1 VLLM_TEST_WITH_DEFAULT_DEVICE_SET=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
|
||||||
- pytest -v -s distributed/test_sequence_parallel.py
|
- pytest -v -s distributed/test_sequence_parallel.py
|
||||||
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown
|
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown
|
||||||
- pytest -v -s v1/worker/test_worker_memory_snapshot.py
|
- pytest -v -s v1/worker/test_worker_memory_snapshot.py
|
||||||
@@ -1126,6 +1168,11 @@ steps:
|
|||||||
- pytest -v -s plugins_tests/test_io_processor_plugins.py
|
- pytest -v -s plugins_tests/test_io_processor_plugins.py
|
||||||
- pip uninstall prithvi_io_processor_plugin -y
|
- pip uninstall prithvi_io_processor_plugin -y
|
||||||
# end io_processor plugins test
|
# end io_processor plugins test
|
||||||
|
# begin stat_logger plugins test
|
||||||
|
- pip install -e ./plugins/vllm_add_dummy_stat_logger
|
||||||
|
- pytest -v -s plugins_tests/test_stats_logger_plugins.py
|
||||||
|
- pip uninstall dummy_stat_logger -y
|
||||||
|
# end stat_logger plugins test
|
||||||
# other tests continue here:
|
# other tests continue here:
|
||||||
- pytest -v -s plugins_tests/test_scheduler_plugins.py
|
- pytest -v -s plugins_tests/test_scheduler_plugins.py
|
||||||
- pip install -e ./plugins/vllm_add_dummy_model
|
- pip install -e ./plugins/vllm_add_dummy_model
|
||||||
@@ -1169,7 +1216,7 @@ steps:
|
|||||||
- pytest -v -s -x lora/test_chatglm3_tp.py
|
- pytest -v -s -x lora/test_chatglm3_tp.py
|
||||||
- pytest -v -s -x lora/test_llama_tp.py
|
- pytest -v -s -x lora/test_llama_tp.py
|
||||||
- pytest -v -s -x lora/test_llm_with_multi_loras.py
|
- pytest -v -s -x lora/test_llm_with_multi_loras.py
|
||||||
|
- pytest -v -s -x lora/test_olmoe_tp.py
|
||||||
|
|
||||||
- label: Weight Loading Multiple GPU Test # 33min
|
- label: Weight Loading Multiple GPU Test # 33min
|
||||||
timeout_in_minutes: 45
|
timeout_in_minutes: 45
|
||||||
@@ -1199,6 +1246,18 @@ steps:
|
|||||||
commands:
|
commands:
|
||||||
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt
|
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt
|
||||||
|
|
||||||
|
- label: NixlConnector PD accuracy tests (Distributed) # 30min
|
||||||
|
mirror_hardwares: [amdexperimental]
|
||||||
|
agent_pool: mi325_4
|
||||||
|
timeout_in_minutes: 30
|
||||||
|
working_dir: "/vllm-workspace/tests"
|
||||||
|
num_gpus: 4
|
||||||
|
source_file_dependencies:
|
||||||
|
- vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
|
||||||
|
- tests/v1/kv_connector/nixl_integration/
|
||||||
|
commands:
|
||||||
|
- uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
|
||||||
|
- bash v1/kv_connector/nixl_integration/tp_config_sweep_accuracy_test.sh
|
||||||
|
|
||||||
##### multi gpus test #####
|
##### multi gpus test #####
|
||||||
##### A100 test #####
|
##### A100 test #####
|
||||||
@@ -1230,12 +1289,16 @@ steps:
|
|||||||
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large.txt --tp-size=4
|
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large.txt --tp-size=4
|
||||||
|
|
||||||
##### H200 test #####
|
##### H200 test #####
|
||||||
- label: Distrubted Tests (H200) # optional
|
- label: Distributed Tests (H200) # optional
|
||||||
gpu: h200
|
gpu: h200
|
||||||
optional: true
|
optional: true
|
||||||
working_dir: "/vllm-workspace/"
|
working_dir: "/vllm-workspace/"
|
||||||
num_gpus: 2
|
num_gpus: 2
|
||||||
commands:
|
commands:
|
||||||
|
- pytest -v -s tests/compile/test_async_tp.py
|
||||||
|
- pytest -v -s tests/compile/test_sequence_parallelism.py
|
||||||
|
- pytest -v -s tests/compile/test_fusion_all_reduce.py
|
||||||
|
- pytest -v -s tests/compile/test_fusions_e2e.py::test_tp2_attn_quant_allreduce_rmsnorm
|
||||||
- pytest -v -s tests/distributed/test_context_parallel.py
|
- pytest -v -s tests/distributed/test_context_parallel.py
|
||||||
- CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048
|
- CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048
|
||||||
|
|
||||||
|
|||||||
@@ -172,6 +172,8 @@ steps:
|
|||||||
- tests/v1/engine/test_engine_core_client.py
|
- tests/v1/engine/test_engine_core_client.py
|
||||||
- tests/distributed/test_symm_mem_allreduce.py
|
- tests/distributed/test_symm_mem_allreduce.py
|
||||||
commands:
|
commands:
|
||||||
|
# https://github.com/NVIDIA/nccl/issues/1838
|
||||||
|
- export NCCL_CUMEM_HOST_ENABLE=0
|
||||||
# test with torchrun tp=2 and external_dp=2
|
# test with torchrun tp=2 and external_dp=2
|
||||||
- torchrun --nproc-per-node=4 distributed/test_torchrun_example.py
|
- torchrun --nproc-per-node=4 distributed/test_torchrun_example.py
|
||||||
# test with torchrun tp=2 and pp=2
|
# test with torchrun tp=2 and pp=2
|
||||||
@@ -311,6 +313,15 @@ steps:
|
|||||||
- pip install -U git+https://github.com/robertgshaw2-redhat/lm-evaluation-harness.git@streaming-api
|
- pip install -U git+https://github.com/robertgshaw2-redhat/lm-evaluation-harness.git@streaming-api
|
||||||
- pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine
|
- pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine
|
||||||
|
|
||||||
|
- label: V1 Test attention (H100) # 10min
|
||||||
|
timeout_in_minutes: 30
|
||||||
|
gpu: h100
|
||||||
|
source_file_dependencies:
|
||||||
|
- vllm/v1/attention
|
||||||
|
- tests/v1/attention
|
||||||
|
commands:
|
||||||
|
- pytest -v -s v1/attention
|
||||||
|
|
||||||
- label: V1 Test others (CPU) # 5 mins
|
- label: V1 Test others (CPU) # 5 mins
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
@@ -349,7 +360,8 @@ steps:
|
|||||||
- python3 offline_inference/basic/embed.py
|
- python3 offline_inference/basic/embed.py
|
||||||
- python3 offline_inference/basic/score.py
|
- python3 offline_inference/basic/score.py
|
||||||
- python3 offline_inference/spec_decode.py --test --method eagle --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 2048
|
- python3 offline_inference/spec_decode.py --test --method eagle --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 2048
|
||||||
- python3 offline_inference/spec_decode.py --test --method eagle3 --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 2048
|
# https://github.com/vllm-project/vllm/pull/26682 uses slightly more memory in PyTorch 2.9+ causing this test to OOM in 1xL4 GPU
|
||||||
|
- python3 offline_inference/spec_decode.py --test --method eagle3 --num_spec_tokens 3 --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --temp 0 --top-p 1.0 --top-k -1 --tp 1 --enable-chunked-prefill --max-model-len 1536
|
||||||
|
|
||||||
- label: Platform Tests (CUDA) # 4min
|
- label: Platform Tests (CUDA) # 4min
|
||||||
timeout_in_minutes: 15
|
timeout_in_minutes: 15
|
||||||
@@ -384,7 +396,12 @@ steps:
|
|||||||
--num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT \
|
--num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT \
|
||||||
--ignore=lora/test_chatglm3_tp.py \
|
--ignore=lora/test_chatglm3_tp.py \
|
||||||
--ignore=lora/test_llama_tp.py \
|
--ignore=lora/test_llama_tp.py \
|
||||||
--ignore=lora/test_llm_with_multi_loras.py
|
--ignore=lora/test_llm_with_multi_loras.py \
|
||||||
|
--ignore=lora/test_olmoe_tp.py \
|
||||||
|
--ignore=lora/test_deepseekv2_tp.py \
|
||||||
|
--ignore=lora/test_gptoss.py \
|
||||||
|
--ignore=lora/test_qwen3moe_tp.py
|
||||||
|
|
||||||
parallelism: 4
|
parallelism: 4
|
||||||
|
|
||||||
- label: PyTorch Compilation Unit Tests # 15min
|
- label: PyTorch Compilation Unit Tests # 15min
|
||||||
@@ -416,8 +433,8 @@ steps:
|
|||||||
- pytest -v -s compile/test_basic_correctness.py
|
- pytest -v -s compile/test_basic_correctness.py
|
||||||
- pytest -v -s compile/piecewise/
|
- pytest -v -s compile/piecewise/
|
||||||
|
|
||||||
- label: PyTorch Fullgraph Test # 20min
|
- label: PyTorch Fullgraph Test # 22min
|
||||||
timeout_in_minutes: 30
|
timeout_in_minutes: 35
|
||||||
mirror_hardwares: [amdexperimental]
|
mirror_hardwares: [amdexperimental]
|
||||||
torch_nightly: true
|
torch_nightly: true
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
@@ -425,6 +442,19 @@ steps:
|
|||||||
- tests/compile
|
- tests/compile
|
||||||
commands:
|
commands:
|
||||||
- pytest -v -s compile/test_full_graph.py
|
- pytest -v -s compile/test_full_graph.py
|
||||||
|
- pytest -v -s compile/test_fusions_e2e.py
|
||||||
|
|
||||||
|
- label: Cudagraph test
|
||||||
|
timeout_in_minutes: 20
|
||||||
|
mirror_hardwares: [amdexperimental]
|
||||||
|
source_file_dependencies:
|
||||||
|
- tests/v1/cudagraph
|
||||||
|
- vllm/v1/cudagraph_dispatcher.py
|
||||||
|
- vllm/config/compilation.py
|
||||||
|
- vllm/compilation
|
||||||
|
commands:
|
||||||
|
- pytest -v -s v1/cudagraph/test_cudagraph_dispatch.py
|
||||||
|
- pytest -v -s v1/cudagraph/test_cudagraph_mode.py
|
||||||
|
|
||||||
- label: Kernels Core Operation Test # 48min
|
- label: Kernels Core Operation Test # 48min
|
||||||
timeout_in_minutes: 75
|
timeout_in_minutes: 75
|
||||||
@@ -527,8 +557,9 @@ steps:
|
|||||||
# since torchao nightly is only compatible with torch nightly currently
|
# since torchao nightly is only compatible with torch nightly currently
|
||||||
# https://github.com/pytorch/ao/issues/2919, we'll have to skip new torchao tests for now
|
# https://github.com/pytorch/ao/issues/2919, we'll have to skip new torchao tests for now
|
||||||
# we can only upgrade after this is resolved
|
# we can only upgrade after this is resolved
|
||||||
- pip install --pre torchao==0.13.0.dev20250814 --index-url https://download.pytorch.org/whl/nightly/cu128
|
# TODO(jerryzh168): resolve the above comment
|
||||||
- VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization/
|
- uv pip install --system torchao==0.13.0 --index-url https://download.pytorch.org/whl/cu129
|
||||||
|
- VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization/ --ignore quantization/test_blackwell_moe.py
|
||||||
|
|
||||||
- label: LM Eval Small Models # 53min
|
- label: LM Eval Small Models # 53min
|
||||||
timeout_in_minutes: 75
|
timeout_in_minutes: 75
|
||||||
@@ -677,8 +708,10 @@ steps:
|
|||||||
- vllm/
|
- vllm/
|
||||||
- tests/models/language/generation
|
- tests/models/language/generation
|
||||||
commands:
|
commands:
|
||||||
# Install causal-conv1d for plamo2 models here, as it is not compatible with pip-compile.
|
# Install fast path packages for testing against transformers
|
||||||
- pip install 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8'
|
# Note: also needed to run plamo2 model in vLLM
|
||||||
|
- uv pip install --system --no-build-isolation 'git+https://github.com/state-spaces/mamba@v2.2.5'
|
||||||
|
- uv pip install --system --no-build-isolation 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.2'
|
||||||
- pytest -v -s models/language/generation -m '(not core_model) and (not hybrid_model)'
|
- pytest -v -s models/language/generation -m '(not core_model) and (not hybrid_model)'
|
||||||
|
|
||||||
- label: Language Models Test (PPL)
|
- label: Language Models Test (PPL)
|
||||||
@@ -733,6 +766,16 @@ steps:
|
|||||||
- pytest -v -s models/multimodal -m core_model --ignore models/multimodal/generation/test_whisper.py --ignore models/multimodal/processing
|
- pytest -v -s models/multimodal -m core_model --ignore models/multimodal/generation/test_whisper.py --ignore models/multimodal/processing
|
||||||
- cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model # Otherwise, mp_method="spawn" doesn't work
|
- cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model # Otherwise, mp_method="spawn" doesn't work
|
||||||
|
|
||||||
|
- label: Multi-Modal Accuracy Eval (Small Models) # 50min
|
||||||
|
timeout_in_minutes: 70
|
||||||
|
working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
|
||||||
|
source_file_dependencies:
|
||||||
|
- vllm/multimodal/
|
||||||
|
- vllm/inputs/
|
||||||
|
- vllm/v1/core/
|
||||||
|
commands:
|
||||||
|
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-mm-small.txt --tp-size=1
|
||||||
|
|
||||||
- label: Multi-Modal Models Test (Extended) 1
|
- label: Multi-Modal Models Test (Extended) 1
|
||||||
mirror_hardwares: [amdexperimental]
|
mirror_hardwares: [amdexperimental]
|
||||||
optional: true
|
optional: true
|
||||||
@@ -796,8 +839,8 @@ steps:
|
|||||||
# Whisper needs spawn method to avoid deadlock
|
# Whisper needs spawn method to avoid deadlock
|
||||||
- VLLM_WORKER_MULTIPROC_METHOD=spawn python3 examples/offline_inference/audio_language.py --model-type whisper
|
- VLLM_WORKER_MULTIPROC_METHOD=spawn python3 examples/offline_inference/audio_language.py --model-type whisper
|
||||||
|
|
||||||
- label: Blackwell Test # 38 min
|
- label: Blackwell Test # 21 min
|
||||||
timeout_in_minutes: 60
|
timeout_in_minutes: 30
|
||||||
working_dir: "/vllm-workspace/"
|
working_dir: "/vllm-workspace/"
|
||||||
gpu: b200
|
gpu: b200
|
||||||
# optional: true
|
# optional: true
|
||||||
@@ -810,8 +853,6 @@ steps:
|
|||||||
- vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
|
- vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
|
||||||
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
|
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
|
||||||
- vllm/v1/attention/backends/flashinfer.py
|
- vllm/v1/attention/backends/flashinfer.py
|
||||||
- vllm/compilation/fusion.py
|
|
||||||
- vllm/compilation/fusion_attn.py
|
|
||||||
commands:
|
commands:
|
||||||
- nvidia-smi
|
- nvidia-smi
|
||||||
- python3 examples/offline_inference/basic/chat.py
|
- python3 examples/offline_inference/basic/chat.py
|
||||||
@@ -828,15 +869,32 @@ steps:
|
|||||||
- pytest -v -s tests/kernels/quantization/test_nvfp4_scaled_mm.py
|
- pytest -v -s tests/kernels/quantization/test_nvfp4_scaled_mm.py
|
||||||
- pytest -v -s tests/kernels/quantization/test_flashinfer_scaled_mm.py
|
- pytest -v -s tests/kernels/quantization/test_flashinfer_scaled_mm.py
|
||||||
- pytest -v -s tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py
|
- pytest -v -s tests/kernels/quantization/test_flashinfer_nvfp4_scaled_mm.py
|
||||||
- pytest -v -s tests/kernels/moe/test_nvfp4_moe.py
|
|
||||||
- pytest -v -s tests/kernels/moe/test_ocp_mx_moe.py
|
|
||||||
# Fusion
|
|
||||||
- pytest -v -s tests/compile/test_fusion_all_reduce.py
|
|
||||||
- pytest -v -s tests/compile/test_fusion_attn.py::test_attention_quant_pattern
|
|
||||||
- pytest -v -s tests/kernels/moe/test_flashinfer.py
|
|
||||||
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
|
|
||||||
- pytest -v -s tests/kernels/quantization/test_nvfp4_qutlass.py
|
- pytest -v -s tests/kernels/quantization/test_nvfp4_qutlass.py
|
||||||
- pytest -v -s tests/kernels/quantization/test_mxfp4_qutlass.py
|
- pytest -v -s tests/kernels/quantization/test_mxfp4_qutlass.py
|
||||||
|
- pytest -v -s tests/kernels/moe/test_nvfp4_moe.py
|
||||||
|
- pytest -v -s tests/kernels/moe/test_ocp_mx_moe.py
|
||||||
|
- pytest -v -s tests/kernels/moe/test_flashinfer.py
|
||||||
|
|
||||||
|
- label: Blackwell Fusion Tests # 30 min
|
||||||
|
timeout_in_minutes: 40
|
||||||
|
working_dir: "/vllm-workspace/"
|
||||||
|
gpu: b200
|
||||||
|
source_file_dependencies:
|
||||||
|
- csrc/quantization/fp4/
|
||||||
|
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
|
||||||
|
- vllm/v1/attention/backends/flashinfer.py
|
||||||
|
- vllm/compilation/
|
||||||
|
# can affect pattern matching
|
||||||
|
- vllm/model_executor/layers/layernorm.py
|
||||||
|
- vllm/model_executor/layers/activation.py
|
||||||
|
- vllm/model_executor/layers/quantization/input_quant_fp8.py
|
||||||
|
commands:
|
||||||
|
- nvidia-smi
|
||||||
|
- pytest -v -s tests/compile/test_fusion_attn.py
|
||||||
|
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
|
||||||
|
# this runner has 2 GPUs available even though num_gpus=2 is not set
|
||||||
|
- pytest -v -s tests/compile/test_fusion_all_reduce.py
|
||||||
|
- pytest -v -s tests/compile/test_fusions_e2e.py
|
||||||
|
|
||||||
- label: Blackwell GPT-OSS Eval
|
- label: Blackwell GPT-OSS Eval
|
||||||
timeout_in_minutes: 60
|
timeout_in_minutes: 60
|
||||||
@@ -943,6 +1001,8 @@ steps:
|
|||||||
- tests/v1/shutdown
|
- tests/v1/shutdown
|
||||||
- tests/v1/worker/test_worker_memory_snapshot.py
|
- tests/v1/worker/test_worker_memory_snapshot.py
|
||||||
commands:
|
commands:
|
||||||
|
# https://github.com/NVIDIA/nccl/issues/1838
|
||||||
|
- export NCCL_CUMEM_HOST_ENABLE=0
|
||||||
- TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_async_llm_dp.py
|
- TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_async_llm_dp.py
|
||||||
- TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_external_lb_dp.py
|
- TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/distributed/test_external_lb_dp.py
|
||||||
- DP_SIZE=2 pytest -v -s v1/entrypoints/openai/test_multi_api_servers.py
|
- DP_SIZE=2 pytest -v -s v1/entrypoints/openai/test_multi_api_servers.py
|
||||||
@@ -950,6 +1010,7 @@ steps:
|
|||||||
- pytest -v -s ./compile/test_basic_correctness.py
|
- pytest -v -s ./compile/test_basic_correctness.py
|
||||||
- pytest -v -s ./compile/test_wrapper.py
|
- pytest -v -s ./compile/test_wrapper.py
|
||||||
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
|
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
|
||||||
|
- VLLM_TEST_SAME_HOST=1 VLLM_TEST_WITH_DEFAULT_DEVICE_SET=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
|
||||||
- pytest -v -s distributed/test_sequence_parallel.py
|
- pytest -v -s distributed/test_sequence_parallel.py
|
||||||
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown
|
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown
|
||||||
- pytest -v -s v1/worker/test_worker_memory_snapshot.py
|
- pytest -v -s v1/worker/test_worker_memory_snapshot.py
|
||||||
@@ -993,6 +1054,11 @@ steps:
|
|||||||
- pytest -v -s plugins_tests/test_io_processor_plugins.py
|
- pytest -v -s plugins_tests/test_io_processor_plugins.py
|
||||||
- pip uninstall prithvi_io_processor_plugin -y
|
- pip uninstall prithvi_io_processor_plugin -y
|
||||||
# end io_processor plugins test
|
# end io_processor plugins test
|
||||||
|
# begin stat_logger plugins test
|
||||||
|
- pip install -e ./plugins/vllm_add_dummy_stat_logger
|
||||||
|
- pytest -v -s plugins_tests/test_stats_logger_plugins.py
|
||||||
|
- pip uninstall dummy_stat_logger -y
|
||||||
|
# end stat_logger plugins test
|
||||||
# other tests continue here:
|
# other tests continue here:
|
||||||
- pytest -v -s plugins_tests/test_scheduler_plugins.py
|
- pytest -v -s plugins_tests/test_scheduler_plugins.py
|
||||||
- pip install -e ./plugins/vllm_add_dummy_model
|
- pip install -e ./plugins/vllm_add_dummy_model
|
||||||
@@ -1032,6 +1098,7 @@ steps:
|
|||||||
- pytest -v -s -x lora/test_chatglm3_tp.py
|
- pytest -v -s -x lora/test_chatglm3_tp.py
|
||||||
- pytest -v -s -x lora/test_llama_tp.py
|
- pytest -v -s -x lora/test_llama_tp.py
|
||||||
- pytest -v -s -x lora/test_llm_with_multi_loras.py
|
- pytest -v -s -x lora/test_llm_with_multi_loras.py
|
||||||
|
- pytest -v -s -x lora/test_olmoe_tp.py
|
||||||
|
|
||||||
|
|
||||||
- label: Weight Loading Multiple GPU Test # 33min
|
- label: Weight Loading Multiple GPU Test # 33min
|
||||||
@@ -1058,6 +1125,17 @@ steps:
|
|||||||
commands:
|
commands:
|
||||||
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt
|
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt
|
||||||
|
|
||||||
|
- label: NixlConnector PD accuracy tests (Distributed) # 30min
|
||||||
|
timeout_in_minutes: 30
|
||||||
|
working_dir: "/vllm-workspace/tests"
|
||||||
|
num_gpus: 4
|
||||||
|
source_file_dependencies:
|
||||||
|
- vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
|
||||||
|
- tests/v1/kv_connector/nixl_integration/
|
||||||
|
commands:
|
||||||
|
- uv pip install --system -r /vllm-workspace/requirements/kv_connectors.txt
|
||||||
|
- bash v1/kv_connector/nixl_integration/tp_config_sweep_accuracy_test.sh
|
||||||
|
|
||||||
|
|
||||||
##### multi gpus test #####
|
##### multi gpus test #####
|
||||||
##### A100 test #####
|
##### A100 test #####
|
||||||
@@ -1089,7 +1167,7 @@ steps:
|
|||||||
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large.txt --tp-size=4
|
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large.txt --tp-size=4
|
||||||
|
|
||||||
##### H200 test #####
|
##### H200 test #####
|
||||||
- label: Distrubted Tests (H200) # optional
|
- label: Distributed Tests (H200) # optional
|
||||||
gpu: h200
|
gpu: h200
|
||||||
optional: true
|
optional: true
|
||||||
working_dir: "/vllm-workspace/"
|
working_dir: "/vllm-workspace/"
|
||||||
@@ -1097,6 +1175,8 @@ steps:
|
|||||||
commands:
|
commands:
|
||||||
- pytest -v -s tests/compile/test_async_tp.py
|
- pytest -v -s tests/compile/test_async_tp.py
|
||||||
- pytest -v -s tests/compile/test_sequence_parallelism.py
|
- pytest -v -s tests/compile/test_sequence_parallelism.py
|
||||||
|
- pytest -v -s tests/compile/test_fusion_all_reduce.py
|
||||||
|
- pytest -v -s tests/compile/test_fusions_e2e.py::test_tp2_attn_quant_allreduce_rmsnorm
|
||||||
- pytest -v -s tests/distributed/test_context_parallel.py
|
- pytest -v -s tests/distributed/test_context_parallel.py
|
||||||
- CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048
|
- CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048
|
||||||
|
|
||||||
|
|||||||
14
.github/CODEOWNERS
vendored
14
.github/CODEOWNERS
vendored
@@ -5,10 +5,8 @@
|
|||||||
/vllm/attention @LucasWilkinson
|
/vllm/attention @LucasWilkinson
|
||||||
/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
|
/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
|
||||||
/vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill @22quinn
|
/vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill @22quinn
|
||||||
/vllm/worker/worker_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill @22quinn
|
/vllm/model_executor/layers/fused_moe @mgoin @pavanimajety
|
||||||
/vllm/model_executor/layers/fused_moe @mgoin
|
/vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth @yewentao256 @pavanimajety
|
||||||
/vllm/model_executor/layers/sampler.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill @NickLucche
|
|
||||||
/vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth @yewentao256
|
|
||||||
/vllm/model_executor/layers/mamba @tdoublep
|
/vllm/model_executor/layers/mamba @tdoublep
|
||||||
/vllm/model_executor/model_loader @22quinn
|
/vllm/model_executor/model_loader @22quinn
|
||||||
/vllm/multimodal @DarkLight1337 @ywang96 @NickLucche
|
/vllm/multimodal @DarkLight1337 @ywang96 @NickLucche
|
||||||
@@ -26,9 +24,9 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
|
|||||||
/vllm/config/cache.py @simon-mo @WoosukKwon @youkaichao @robertgshaw2-redhat @mgoin @tlrmchlsmth @houseroad @hmellor @yewentao256 @ProExpertProg @heheda12345
|
/vllm/config/cache.py @simon-mo @WoosukKwon @youkaichao @robertgshaw2-redhat @mgoin @tlrmchlsmth @houseroad @hmellor @yewentao256 @ProExpertProg @heheda12345
|
||||||
|
|
||||||
# vLLM V1
|
# vLLM V1
|
||||||
/vllm/v1 @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @comaniac @alexm-redhat
|
|
||||||
/vllm/v1/attention @LucasWilkinson
|
/vllm/v1/attention @LucasWilkinson
|
||||||
/vllm/v1/attention/backends/flashinfer.py @mgoin
|
/vllm/v1/attention/backends/mla @pavanimajety
|
||||||
|
/vllm/v1/attention/backends/flashinfer.py @mgoin @pavanimajety
|
||||||
/vllm/v1/attention/backends/triton_attn.py @tdoublep
|
/vllm/v1/attention/backends/triton_attn.py @tdoublep
|
||||||
/vllm/v1/core @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @comaniac @alexm-redhat @heheda12345 @ApostaC
|
/vllm/v1/core @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @comaniac @alexm-redhat @heheda12345 @ApostaC
|
||||||
/vllm/v1/sample @22quinn @houseroad @njhill
|
/vllm/v1/sample @22quinn @houseroad @njhill
|
||||||
@@ -47,7 +45,7 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
|
|||||||
/tests/kernels @mgoin @tlrmchlsmth @WoosukKwon @yewentao256
|
/tests/kernels @mgoin @tlrmchlsmth @WoosukKwon @yewentao256
|
||||||
/tests/models @DarkLight1337 @ywang96
|
/tests/models @DarkLight1337 @ywang96
|
||||||
/tests/multimodal @DarkLight1337 @ywang96 @NickLucche
|
/tests/multimodal @DarkLight1337 @ywang96 @NickLucche
|
||||||
/tests/quantization @mgoin @robertgshaw2-redhat @yewentao256
|
/tests/quantization @mgoin @robertgshaw2-redhat @yewentao256 @pavanimajety
|
||||||
/tests/test_inputs.py @DarkLight1337 @ywang96
|
/tests/test_inputs.py @DarkLight1337 @ywang96
|
||||||
/tests/v1/entrypoints/llm/test_struct_output_generate.py @mgoin @russellb @aarnphm
|
/tests/v1/entrypoints/llm/test_struct_output_generate.py @mgoin @russellb @aarnphm
|
||||||
/tests/v1/structured_output @mgoin @russellb @aarnphm
|
/tests/v1/structured_output @mgoin @russellb @aarnphm
|
||||||
@@ -60,7 +58,7 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
|
|||||||
/tests/v1/offloading @ApostaC
|
/tests/v1/offloading @ApostaC
|
||||||
|
|
||||||
# Transformers backend
|
# Transformers backend
|
||||||
/vllm/model_executor/models/transformers.py @hmellor
|
/vllm/model_executor/models/transformers @hmellor
|
||||||
/tests/models/test_transformers.py @hmellor
|
/tests/models/test_transformers.py @hmellor
|
||||||
|
|
||||||
# Docs
|
# Docs
|
||||||
|
|||||||
3
.gitignore
vendored
3
.gitignore
vendored
@@ -94,6 +94,9 @@ ipython_config.py
|
|||||||
# generated files
|
# generated files
|
||||||
**/generated/**
|
**/generated/**
|
||||||
|
|
||||||
|
# uv
|
||||||
|
uv.lock
|
||||||
|
|
||||||
# pyenv
|
# pyenv
|
||||||
# For a library or package, you might want to ignore these files since the code is
|
# For a library or package, you might want to ignore these files since the code is
|
||||||
# intended to run in multiple environments; otherwise, check them in:
|
# intended to run in multiple environments; otherwise, check them in:
|
||||||
|
|||||||
@@ -4,7 +4,6 @@ MD013: false
|
|||||||
MD024:
|
MD024:
|
||||||
siblings_only: true
|
siblings_only: true
|
||||||
MD033: false
|
MD033: false
|
||||||
MD042: false
|
|
||||||
MD045: false
|
MD045: false
|
||||||
MD046: false
|
MD046: false
|
||||||
MD051: false
|
MD051: false
|
||||||
|
|||||||
@@ -38,7 +38,7 @@ repos:
|
|||||||
rev: 0.9.1
|
rev: 0.9.1
|
||||||
hooks:
|
hooks:
|
||||||
- id: pip-compile
|
- id: pip-compile
|
||||||
args: [requirements/test.in, -o, requirements/test.txt, --index-strategy, unsafe-best-match, --torch-backend, cu128, --python-platform, x86_64-manylinux_2_28]
|
args: [requirements/test.in, -o, requirements/test.txt, --index-strategy, unsafe-best-match, --torch-backend, cu129, --python-platform, x86_64-manylinux_2_28]
|
||||||
files: ^requirements/test\.(in|txt)$
|
files: ^requirements/test\.(in|txt)$
|
||||||
- repo: local
|
- repo: local
|
||||||
hooks:
|
hooks:
|
||||||
@@ -48,8 +48,8 @@ repos:
|
|||||||
entry: python tools/generate_nightly_torch_test.py
|
entry: python tools/generate_nightly_torch_test.py
|
||||||
files: ^requirements/test\.(in|txt)$
|
files: ^requirements/test\.(in|txt)$
|
||||||
- id: mypy-local
|
- id: mypy-local
|
||||||
name: Run mypy for local Python installation
|
name: Run mypy locally for lowest supported Python version
|
||||||
entry: python tools/pre_commit/mypy.py 0 "local"
|
entry: python tools/pre_commit/mypy.py 0 "3.10"
|
||||||
stages: [pre-commit] # Don't run in CI
|
stages: [pre-commit] # Don't run in CI
|
||||||
<<: &mypy_common
|
<<: &mypy_common
|
||||||
language: python
|
language: python
|
||||||
|
|||||||
@@ -49,8 +49,8 @@ set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1
|
|||||||
# requirements.txt files and should be kept consistent. The ROCm torch
|
# requirements.txt files and should be kept consistent. The ROCm torch
|
||||||
# versions are derived from docker/Dockerfile.rocm
|
# versions are derived from docker/Dockerfile.rocm
|
||||||
#
|
#
|
||||||
set(TORCH_SUPPORTED_VERSION_CUDA "2.8.0")
|
set(TORCH_SUPPORTED_VERSION_CUDA "2.9.0")
|
||||||
set(TORCH_SUPPORTED_VERSION_ROCM "2.8.0")
|
set(TORCH_SUPPORTED_VERSION_ROCM "2.9.0")
|
||||||
|
|
||||||
#
|
#
|
||||||
# Try to find python package with an executable that exactly matches
|
# Try to find python package with an executable that exactly matches
|
||||||
@@ -883,6 +883,7 @@ target_compile_definitions(_C PRIVATE CUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL=1)
|
|||||||
set(VLLM_MOE_EXT_SRC
|
set(VLLM_MOE_EXT_SRC
|
||||||
"csrc/moe/torch_bindings.cpp"
|
"csrc/moe/torch_bindings.cpp"
|
||||||
"csrc/moe/moe_align_sum_kernels.cu"
|
"csrc/moe/moe_align_sum_kernels.cu"
|
||||||
|
"csrc/moe/moe_lora_align_sum_kernels.cu"
|
||||||
"csrc/moe/topk_softmax_kernels.cu")
|
"csrc/moe/topk_softmax_kernels.cu")
|
||||||
|
|
||||||
if(VLLM_GPU_LANG STREQUAL "CUDA")
|
if(VLLM_GPU_LANG STREQUAL "CUDA")
|
||||||
|
|||||||
@@ -5,7 +5,7 @@ import gc
|
|||||||
from benchmark_utils import TimeCollector
|
from benchmark_utils import TimeCollector
|
||||||
from tabulate import tabulate
|
from tabulate import tabulate
|
||||||
|
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
from vllm.v1.core.block_pool import BlockPool
|
from vllm.v1.core.block_pool import BlockPool
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -46,7 +46,7 @@ import time
|
|||||||
|
|
||||||
from vllm import LLM, SamplingParams
|
from vllm import LLM, SamplingParams
|
||||||
from vllm.engine.arg_utils import EngineArgs
|
from vllm.engine.arg_utils import EngineArgs
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
|
|
||||||
def test_long_document_qa(llm=None, sampling_params=None, prompts=None):
|
def test_long_document_qa(llm=None, sampling_params=None, prompts=None):
|
||||||
|
|||||||
@@ -19,7 +19,7 @@ from vllm.config import (
|
|||||||
VllmConfig,
|
VllmConfig,
|
||||||
)
|
)
|
||||||
from vllm.platforms import current_platform
|
from vllm.platforms import current_platform
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
from vllm.v1.spec_decode.ngram_proposer import NgramProposer
|
from vllm.v1.spec_decode.ngram_proposer import NgramProposer
|
||||||
from vllm.v1.worker.gpu_input_batch import InputBatch
|
from vllm.v1.worker.gpu_input_batch import InputBatch
|
||||||
from vllm.v1.worker.gpu_model_runner import GPUModelRunner
|
from vllm.v1.worker.gpu_model_runner import GPUModelRunner
|
||||||
|
|||||||
@@ -37,7 +37,7 @@ from transformers import PreTrainedTokenizerBase
|
|||||||
|
|
||||||
from vllm import LLM, SamplingParams
|
from vllm import LLM, SamplingParams
|
||||||
from vllm.engine.arg_utils import EngineArgs
|
from vllm.engine.arg_utils import EngineArgs
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
try:
|
try:
|
||||||
from vllm.transformers_utils.tokenizer import get_tokenizer
|
from vllm.transformers_utils.tokenizer import get_tokenizer
|
||||||
|
|||||||
@@ -11,7 +11,7 @@ import time
|
|||||||
from transformers import AutoTokenizer, PreTrainedTokenizerBase
|
from transformers import AutoTokenizer, PreTrainedTokenizerBase
|
||||||
|
|
||||||
from vllm.engine.arg_utils import EngineArgs
|
from vllm.engine.arg_utils import EngineArgs
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
|
|
||||||
# Select a equi-probable random priority
|
# Select a equi-probable random priority
|
||||||
|
|||||||
@@ -31,6 +31,7 @@ import time
|
|||||||
import uuid
|
import uuid
|
||||||
import warnings
|
import warnings
|
||||||
from collections.abc import AsyncGenerator
|
from collections.abc import AsyncGenerator
|
||||||
|
from contextlib import nullcontext
|
||||||
from dataclasses import dataclass
|
from dataclasses import dataclass
|
||||||
|
|
||||||
import datasets
|
import datasets
|
||||||
@@ -50,7 +51,7 @@ except ImportError:
|
|||||||
from backend_request_func import get_tokenizer
|
from backend_request_func import get_tokenizer
|
||||||
|
|
||||||
try:
|
try:
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
except ImportError:
|
except ImportError:
|
||||||
from argparse import ArgumentParser as FlexibleArgumentParser
|
from argparse import ArgumentParser as FlexibleArgumentParser
|
||||||
|
|
||||||
@@ -501,15 +502,9 @@ async def benchmark(
|
|||||||
|
|
||||||
pbar = None if disable_tqdm else tqdm(total=len(input_requests))
|
pbar = None if disable_tqdm else tqdm(total=len(input_requests))
|
||||||
|
|
||||||
# This can be used once the minimum Python version is 3.10 or higher,
|
semaphore = asyncio.Semaphore(max_concurrency) if max_concurrency else nullcontext()
|
||||||
# and it will simplify the code in limited_request_func.
|
|
||||||
# semaphore = (asyncio.Semaphore(max_concurrency)
|
|
||||||
# if max_concurrency else contextlib.nullcontext())
|
|
||||||
semaphore = asyncio.Semaphore(max_concurrency) if max_concurrency else None
|
|
||||||
|
|
||||||
async def limited_request_func(request_func_input, pbar):
|
async def limited_request_func(request_func_input, pbar):
|
||||||
if semaphore is None:
|
|
||||||
return await request_func(request_func_input=request_func_input, pbar=pbar)
|
|
||||||
async with semaphore:
|
async with semaphore:
|
||||||
return await request_func(request_func_input=request_func_input, pbar=pbar)
|
return await request_func(request_func_input=request_func_input, pbar=pbar)
|
||||||
|
|
||||||
|
|||||||
@@ -15,7 +15,7 @@ from utils import make_rand_sparse_tensors
|
|||||||
from weight_shapes import WEIGHT_SHAPES
|
from weight_shapes import WEIGHT_SHAPES
|
||||||
|
|
||||||
from vllm import _custom_ops as ops
|
from vllm import _custom_ops as ops
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
DEFAULT_MODELS = list(WEIGHT_SHAPES.keys())
|
DEFAULT_MODELS = list(WEIGHT_SHAPES.keys())
|
||||||
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512]
|
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512]
|
||||||
|
|||||||
@@ -18,7 +18,8 @@ from vllm import _custom_ops as ops
|
|||||||
from vllm.model_executor.layers.quantization.utils.fp8_utils import (
|
from vllm.model_executor.layers.quantization.utils.fp8_utils import (
|
||||||
w8a8_triton_block_scaled_mm,
|
w8a8_triton_block_scaled_mm,
|
||||||
)
|
)
|
||||||
from vllm.utils import FlexibleArgumentParser, cdiv
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
from vllm.utils.math_utils import cdiv
|
||||||
|
|
||||||
DEFAULT_MODELS = list(WEIGHT_SHAPES.keys())
|
DEFAULT_MODELS = list(WEIGHT_SHAPES.keys())
|
||||||
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512]
|
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512]
|
||||||
|
|||||||
@@ -10,7 +10,8 @@ import torch
|
|||||||
from vllm.model_executor.layers.quantization.input_quant_fp8 import QuantFP8
|
from vllm.model_executor.layers.quantization.input_quant_fp8 import QuantFP8
|
||||||
from vllm.model_executor.layers.quantization.utils.quant_utils import GroupShape
|
from vllm.model_executor.layers.quantization.utils.quant_utils import GroupShape
|
||||||
from vllm.triton_utils import triton
|
from vllm.triton_utils import triton
|
||||||
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
|
||||||
|
|
||||||
|
|
||||||
def with_triton_mode(fn):
|
def with_triton_mode(fn):
|
||||||
|
|||||||
@@ -10,7 +10,8 @@ import vllm.model_executor.layers.activation # noqa F401
|
|||||||
from vllm.model_executor.custom_op import CustomOp
|
from vllm.model_executor.custom_op import CustomOp
|
||||||
from vllm.platforms import current_platform
|
from vllm.platforms import current_platform
|
||||||
from vllm.triton_utils import triton
|
from vllm.triton_utils import triton
|
||||||
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
|
||||||
|
|
||||||
batch_size_range = [1, 16, 32, 64, 128]
|
batch_size_range = [1, 16, 32, 64, 128]
|
||||||
seq_len_range = [1, 16, 64, 128, 256, 512, 1024, 2048, 4096]
|
seq_len_range = [1, 16, 64, 128, 256, 512, 1024, 2048, 4096]
|
||||||
|
|||||||
@@ -28,7 +28,7 @@ except ImportError as e:
|
|||||||
|
|
||||||
from bitblas import Matmul, MatmulConfig, auto_detect_nvidia_target
|
from bitblas import Matmul, MatmulConfig, auto_detect_nvidia_target
|
||||||
|
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
parser = FlexibleArgumentParser(
|
parser = FlexibleArgumentParser(
|
||||||
description="Benchmark BitBLAS int4 on a specific target."
|
description="Benchmark BitBLAS int4 on a specific target."
|
||||||
|
|||||||
@@ -20,7 +20,7 @@ from vllm.model_executor.layers.fused_moe.config import (
|
|||||||
from vllm.model_executor.layers.fused_moe.cutlass_moe import cutlass_moe_fp4
|
from vllm.model_executor.layers.fused_moe.cutlass_moe import cutlass_moe_fp4
|
||||||
from vllm.model_executor.layers.fused_moe.fused_moe import fused_experts, fused_topk
|
from vllm.model_executor.layers.fused_moe.fused_moe import fused_experts, fused_topk
|
||||||
from vllm.scalar_type import scalar_types
|
from vllm.scalar_type import scalar_types
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
WEIGHT_SHAPES_MOE = {
|
WEIGHT_SHAPES_MOE = {
|
||||||
"nvidia/DeepSeek-R1-FP4": [
|
"nvidia/DeepSeek-R1-FP4": [
|
||||||
|
|||||||
@@ -14,7 +14,7 @@ from vllm.model_executor.layers.fused_moe.config import fp8_w8a8_moe_quant_confi
|
|||||||
from vllm.model_executor.layers.fused_moe.cutlass_moe import cutlass_moe_fp8
|
from vllm.model_executor.layers.fused_moe.cutlass_moe import cutlass_moe_fp8
|
||||||
from vllm.model_executor.layers.fused_moe.fused_moe import fused_experts, fused_topk
|
from vllm.model_executor.layers.fused_moe.fused_moe import fused_experts, fused_topk
|
||||||
from vllm.platforms import current_platform
|
from vllm.platforms import current_platform
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
# Weight shapes for different models: [num_experts, topk, hidden_size,
|
# Weight shapes for different models: [num_experts, topk, hidden_size,
|
||||||
# intermediate_size]
|
# intermediate_size]
|
||||||
|
|||||||
@@ -39,7 +39,7 @@ from vllm.distributed.device_communicators.pynccl_allocator import (
|
|||||||
)
|
)
|
||||||
from vllm.distributed.device_communicators.symm_mem import SymmMemCommunicator
|
from vllm.distributed.device_communicators.symm_mem import SymmMemCommunicator
|
||||||
from vllm.logger import init_logger
|
from vllm.logger import init_logger
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
logger = init_logger(__name__)
|
logger = init_logger(__name__)
|
||||||
|
|
||||||
|
|||||||
@@ -13,7 +13,7 @@ from vllm.model_executor.layers.fused_moe.fused_moe import (
|
|||||||
fused_experts,
|
fused_experts,
|
||||||
fused_topk,
|
fused_topk,
|
||||||
)
|
)
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
DEFAULT_MODELS = [
|
DEFAULT_MODELS = [
|
||||||
"nm-testing/Mixtral-8x7B-Instruct-v0.1",
|
"nm-testing/Mixtral-8x7B-Instruct-v0.1",
|
||||||
|
|||||||
@@ -7,7 +7,8 @@ import torch
|
|||||||
|
|
||||||
from vllm.model_executor.layers.layernorm import RMSNorm
|
from vllm.model_executor.layers.layernorm import RMSNorm
|
||||||
from vllm.platforms import current_platform
|
from vllm.platforms import current_platform
|
||||||
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
|
||||||
|
|
||||||
|
|
||||||
@torch.inference_mode()
|
@torch.inference_mode()
|
||||||
|
|||||||
@@ -25,7 +25,7 @@ if HAS_TRITON:
|
|||||||
from vllm.lora.ops.triton_ops import LoRAKernelMeta, lora_expand, lora_shrink
|
from vllm.lora.ops.triton_ops import LoRAKernelMeta, lora_expand, lora_shrink
|
||||||
from vllm.lora.ops.triton_ops.utils import _LORA_A_PTR_DICT, _LORA_B_PTR_DICT
|
from vllm.lora.ops.triton_ops.utils import _LORA_A_PTR_DICT, _LORA_B_PTR_DICT
|
||||||
|
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
DEFAULT_MODELS = list(WEIGHT_SHAPES.keys())
|
DEFAULT_MODELS = list(WEIGHT_SHAPES.keys())
|
||||||
DEFAULT_TP_SIZES = [1]
|
DEFAULT_TP_SIZES = [1]
|
||||||
|
|||||||
@@ -33,7 +33,7 @@ from vllm.model_executor.layers.quantization.utils.quant_utils import (
|
|||||||
quantize_weights,
|
quantize_weights,
|
||||||
)
|
)
|
||||||
from vllm.scalar_type import ScalarType, scalar_types
|
from vllm.scalar_type import ScalarType, scalar_types
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
DEFAULT_MODELS = ["meta-llama/Llama-3-8b", "meta-llama/Llama-2-70b-hf"]
|
DEFAULT_MODELS = ["meta-llama/Llama-3-8b", "meta-llama/Llama-2-70b-hf"]
|
||||||
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512, 1024]
|
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512, 1024]
|
||||||
|
|||||||
@@ -44,7 +44,7 @@ from vllm.model_executor.layers.quantization.utils.quant_utils import (
|
|||||||
sort_weights,
|
sort_weights,
|
||||||
)
|
)
|
||||||
from vllm.scalar_type import ScalarType, scalar_types
|
from vllm.scalar_type import ScalarType, scalar_types
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
DEFAULT_MODELS = ["meta-llama/Llama-2-7b-hf/TP1"]
|
DEFAULT_MODELS = ["meta-llama/Llama-2-7b-hf/TP1"]
|
||||||
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
|
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
|
||||||
|
|||||||
@@ -22,7 +22,7 @@ from vllm.model_executor.layers.fused_moe.fused_moe import *
|
|||||||
from vllm.platforms import current_platform
|
from vllm.platforms import current_platform
|
||||||
from vllm.transformers_utils.config import get_config
|
from vllm.transformers_utils.config import get_config
|
||||||
from vllm.triton_utils import triton
|
from vllm.triton_utils import triton
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
FP8_DTYPE = current_platform.fp8_dtype()
|
FP8_DTYPE = current_platform.fp8_dtype()
|
||||||
|
|
||||||
@@ -631,7 +631,7 @@ def main(args: argparse.Namespace):
|
|||||||
else:
|
else:
|
||||||
ensure_divisibility(intermediate_size, args.tp_size, "intermediate_size")
|
ensure_divisibility(intermediate_size, args.tp_size, "intermediate_size")
|
||||||
shard_intermediate_size = 2 * intermediate_size // args.tp_size
|
shard_intermediate_size = 2 * intermediate_size // args.tp_size
|
||||||
dtype = torch.float16 if current_platform.is_rocm() else config.torch_dtype
|
dtype = torch.float16 if current_platform.is_rocm() else config.dtype
|
||||||
use_fp8_w8a8 = args.dtype == "fp8_w8a8"
|
use_fp8_w8a8 = args.dtype == "fp8_w8a8"
|
||||||
use_int8_w8a16 = args.dtype == "int8_w8a16"
|
use_int8_w8a16 = args.dtype == "int8_w8a16"
|
||||||
block_quant_shape = get_weight_block_size_safety(config)
|
block_quant_shape = get_weight_block_size_safety(config)
|
||||||
|
|||||||
@@ -17,7 +17,7 @@ from vllm.model_executor.layers.fused_moe.moe_permute_unpermute import (
|
|||||||
)
|
)
|
||||||
from vllm.model_executor.layers.fused_moe.utils import _fp8_quantize
|
from vllm.model_executor.layers.fused_moe.utils import _fp8_quantize
|
||||||
from vllm.platforms import current_platform
|
from vllm.platforms import current_platform
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
FP8_DTYPE = current_platform.fp8_dtype()
|
FP8_DTYPE = current_platform.fp8_dtype()
|
||||||
|
|
||||||
@@ -344,7 +344,7 @@ def main(args: argparse.Namespace):
|
|||||||
topk = config.num_experts_per_tok
|
topk = config.num_experts_per_tok
|
||||||
|
|
||||||
hidden_size = config.hidden_size
|
hidden_size = config.hidden_size
|
||||||
dtype = torch.float16 if current_platform.is_rocm() else config.torch_dtype
|
dtype = torch.float16 if current_platform.is_rocm() else config.dtype
|
||||||
use_fp8_w8a8 = args.dtype == "fp8_w8a8"
|
use_fp8_w8a8 = args.dtype == "fp8_w8a8"
|
||||||
use_int8_w8a16 = args.dtype == "int8_w8a16"
|
use_int8_w8a16 = args.dtype == "int8_w8a16"
|
||||||
use_customized_permute = args.use_customized_permute
|
use_customized_permute = args.use_customized_permute
|
||||||
|
|||||||
@@ -39,7 +39,7 @@ import torch
|
|||||||
from vllm.model_executor.layers.rotary_embedding import get_rope
|
from vllm.model_executor.layers.rotary_embedding import get_rope
|
||||||
from vllm.platforms import current_platform
|
from vllm.platforms import current_platform
|
||||||
from vllm.transformers_utils.config import get_config
|
from vllm.transformers_utils.config import get_config
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
||||||
|
|
||||||
|
|||||||
@@ -9,9 +9,9 @@ import torch
|
|||||||
from vllm import _custom_ops as ops
|
from vllm import _custom_ops as ops
|
||||||
from vllm.logger import init_logger
|
from vllm.logger import init_logger
|
||||||
from vllm.platforms import current_platform
|
from vllm.platforms import current_platform
|
||||||
from vllm.utils import (
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
from vllm.utils.torch_utils import (
|
||||||
STR_DTYPE_TO_TORCH_DTYPE,
|
STR_DTYPE_TO_TORCH_DTYPE,
|
||||||
FlexibleArgumentParser,
|
|
||||||
create_kv_caches_with_random,
|
create_kv_caches_with_random,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@@ -1,155 +0,0 @@
|
|||||||
# SPDX-License-Identifier: Apache-2.0
|
|
||||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
|
||||||
|
|
||||||
import itertools
|
|
||||||
|
|
||||||
import torch
|
|
||||||
|
|
||||||
from vllm import _custom_ops as vllm_ops
|
|
||||||
from vllm.triton_utils import triton
|
|
||||||
|
|
||||||
|
|
||||||
def polynorm_naive(
|
|
||||||
x: torch.Tensor,
|
|
||||||
weight: torch.Tensor,
|
|
||||||
bias: torch.Tensor,
|
|
||||||
eps: float = 1e-6,
|
|
||||||
):
|
|
||||||
orig_shape = x.shape
|
|
||||||
x = x.view(-1, x.shape[-1])
|
|
||||||
|
|
||||||
def norm(x, eps: float):
|
|
||||||
return x / torch.sqrt(x.pow(2).mean(-1, keepdim=True) + eps)
|
|
||||||
|
|
||||||
x = x.float()
|
|
||||||
return (
|
|
||||||
(
|
|
||||||
weight[0] * norm(x**3, eps)
|
|
||||||
+ weight[1] * norm(x**2, eps)
|
|
||||||
+ weight[2] * norm(x, eps)
|
|
||||||
+ bias
|
|
||||||
)
|
|
||||||
.to(weight.dtype)
|
|
||||||
.view(orig_shape)
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def polynorm_vllm(
|
|
||||||
x: torch.Tensor,
|
|
||||||
weight: torch.Tensor,
|
|
||||||
bias: torch.Tensor,
|
|
||||||
eps: float = 1e-6,
|
|
||||||
):
|
|
||||||
orig_shape = x.shape
|
|
||||||
x = x.view(-1, x.shape[-1])
|
|
||||||
|
|
||||||
out = torch.empty_like(x)
|
|
||||||
vllm_ops.poly_norm(out, x, weight, bias, eps)
|
|
||||||
output = out
|
|
||||||
|
|
||||||
output = output.view(orig_shape)
|
|
||||||
return output
|
|
||||||
|
|
||||||
|
|
||||||
def calculate_diff(batch_size, seq_len, hidden_dim):
|
|
||||||
dtype = torch.bfloat16
|
|
||||||
x = torch.randn(batch_size, seq_len, hidden_dim, dtype=dtype, device="cuda")
|
|
||||||
weight = torch.ones(3, dtype=dtype, device="cuda")
|
|
||||||
bias = torch.ones(1, dtype=dtype, device="cuda")
|
|
||||||
|
|
||||||
output_naive = polynorm_naive(x, weight, bias)
|
|
||||||
output_vllm = polynorm_vllm(x, weight, bias)
|
|
||||||
|
|
||||||
if torch.allclose(output_naive, output_vllm, atol=1e-2, rtol=1e-2):
|
|
||||||
print("✅ All implementations match")
|
|
||||||
else:
|
|
||||||
print("❌ Implementations differ")
|
|
||||||
|
|
||||||
|
|
||||||
batch_size_range = [2**i for i in range(0, 7, 2)]
|
|
||||||
seq_length_range = [2**i for i in range(6, 11, 1)]
|
|
||||||
dim_range = [2048, 4096]
|
|
||||||
configs = list(itertools.product(dim_range, batch_size_range, seq_length_range))
|
|
||||||
|
|
||||||
|
|
||||||
def get_benchmark():
|
|
||||||
@triton.testing.perf_report(
|
|
||||||
triton.testing.Benchmark(
|
|
||||||
x_names=["dim", "batch_size", "seq_len"],
|
|
||||||
x_vals=[list(_) for _ in configs],
|
|
||||||
line_arg="provider",
|
|
||||||
line_vals=["naive", "vllm"],
|
|
||||||
line_names=["Naive", "vLLM"],
|
|
||||||
styles=[("blue", "-"), ("red", "-")],
|
|
||||||
ylabel="us",
|
|
||||||
plot_name="polynorm-perf",
|
|
||||||
args={},
|
|
||||||
)
|
|
||||||
)
|
|
||||||
def benchmark(dim, batch_size, seq_len, provider):
|
|
||||||
dtype = torch.bfloat16
|
|
||||||
hidden_dim = dim * 4
|
|
||||||
|
|
||||||
x = torch.randn(batch_size, seq_len, hidden_dim, dtype=dtype, device="cuda")
|
|
||||||
weight = torch.ones(3, dtype=dtype, device="cuda")
|
|
||||||
bias = torch.ones(1, dtype=dtype, device="cuda")
|
|
||||||
|
|
||||||
quantiles = [0.5, 0.2, 0.8]
|
|
||||||
|
|
||||||
if provider == "naive":
|
|
||||||
ms, min_ms, max_ms = triton.testing.do_bench(
|
|
||||||
lambda: polynorm_naive(x, weight, bias),
|
|
||||||
quantiles=quantiles,
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
ms, min_ms, max_ms = triton.testing.do_bench(
|
|
||||||
lambda: polynorm_vllm(x, weight, bias),
|
|
||||||
quantiles=quantiles,
|
|
||||||
)
|
|
||||||
|
|
||||||
return 1000 * ms, 1000 * max_ms, 1000 * min_ms
|
|
||||||
|
|
||||||
return benchmark
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
import argparse
|
|
||||||
|
|
||||||
parser = argparse.ArgumentParser()
|
|
||||||
parser.add_argument(
|
|
||||||
"--batch-size",
|
|
||||||
type=int,
|
|
||||||
default=4,
|
|
||||||
help="Batch size",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--seq-len",
|
|
||||||
type=int,
|
|
||||||
default=128,
|
|
||||||
help="Sequence length",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--hidden-dim",
|
|
||||||
type=int,
|
|
||||||
default=8192,
|
|
||||||
help="Intermediate size of MLP",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--save-path",
|
|
||||||
type=str,
|
|
||||||
default="./configs/polnorm/",
|
|
||||||
help="Path to save polnorm benchmark results",
|
|
||||||
)
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
# Run correctness test
|
|
||||||
calculate_diff(
|
|
||||||
batch_size=args.batch_size,
|
|
||||||
seq_len=args.seq_len,
|
|
||||||
hidden_dim=args.hidden_dim,
|
|
||||||
)
|
|
||||||
|
|
||||||
benchmark = get_benchmark()
|
|
||||||
# Run performance benchmark
|
|
||||||
benchmark.run(print_data=True, save_path=args.save_path)
|
|
||||||
@@ -7,7 +7,8 @@ import torch
|
|||||||
|
|
||||||
from vllm import _custom_ops as ops
|
from vllm import _custom_ops as ops
|
||||||
from vllm.platforms import current_platform
|
from vllm.platforms import current_platform
|
||||||
from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
|
||||||
|
|
||||||
|
|
||||||
@torch.inference_mode()
|
@torch.inference_mode()
|
||||||
|
|||||||
@@ -9,9 +9,9 @@ from tabulate import tabulate
|
|||||||
from vllm import _custom_ops as ops
|
from vllm import _custom_ops as ops
|
||||||
from vllm.logger import init_logger
|
from vllm.logger import init_logger
|
||||||
from vllm.platforms import current_platform
|
from vllm.platforms import current_platform
|
||||||
from vllm.utils import (
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
from vllm.utils.torch_utils import (
|
||||||
STR_DTYPE_TO_TORCH_DTYPE,
|
STR_DTYPE_TO_TORCH_DTYPE,
|
||||||
FlexibleArgumentParser,
|
|
||||||
create_kv_caches_with_random,
|
create_kv_caches_with_random,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@@ -12,9 +12,9 @@ from vllm.attention.ops.triton_reshape_and_cache_flash import (
|
|||||||
)
|
)
|
||||||
from vllm.logger import init_logger
|
from vllm.logger import init_logger
|
||||||
from vllm.platforms import current_platform
|
from vllm.platforms import current_platform
|
||||||
from vllm.utils import (
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
from vllm.utils.torch_utils import (
|
||||||
STR_DTYPE_TO_TORCH_DTYPE,
|
STR_DTYPE_TO_TORCH_DTYPE,
|
||||||
FlexibleArgumentParser,
|
|
||||||
create_kv_caches_with_random_flash,
|
create_kv_caches_with_random_flash,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ import torch
|
|||||||
|
|
||||||
from vllm.model_executor.layers.rotary_embedding import RotaryEmbedding, get_rope
|
from vllm.model_executor.layers.rotary_embedding import RotaryEmbedding, get_rope
|
||||||
from vllm.platforms import current_platform
|
from vllm.platforms import current_platform
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
|
|
||||||
def benchmark_rope_kernels_multi_lora(
|
def benchmark_rope_kernels_multi_lora(
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ from datetime import datetime
|
|||||||
import flashinfer
|
import flashinfer
|
||||||
import torch
|
import torch
|
||||||
|
|
||||||
from vllm.utils import round_up
|
from vllm.utils.math_utils import round_up
|
||||||
|
|
||||||
FLOAT32_BYTES = torch.finfo(torch.float).bits // 8
|
FLOAT32_BYTES = torch.finfo(torch.float).bits // 8
|
||||||
FP8_DTYPE = torch.float8_e4m3fn
|
FP8_DTYPE = torch.float8_e4m3fn
|
||||||
|
|||||||
@@ -8,7 +8,7 @@ from datetime import datetime
|
|||||||
import flashinfer
|
import flashinfer
|
||||||
import torch
|
import torch
|
||||||
|
|
||||||
from vllm.utils import round_up
|
from vllm.utils.math_utils import round_up
|
||||||
|
|
||||||
FLOAT32_BYTES = torch.finfo(torch.float).bits // 8
|
FLOAT32_BYTES = torch.finfo(torch.float).bits // 8
|
||||||
FP8_DTYPE = torch.float8_e4m3fn
|
FP8_DTYPE = torch.float8_e4m3fn
|
||||||
|
|||||||
@@ -18,7 +18,7 @@ from vllm.model_executor.layers.quantization.utils.fp8_utils import (
|
|||||||
)
|
)
|
||||||
from vllm.platforms import current_platform
|
from vllm.platforms import current_platform
|
||||||
from vllm.triton_utils import triton
|
from vllm.triton_utils import triton
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
mp.set_start_method("spawn", force=True)
|
mp.set_start_method("spawn", force=True)
|
||||||
|
|
||||||
|
|||||||
@@ -11,7 +11,7 @@ import regex as re
|
|||||||
import seaborn as sns
|
import seaborn as sns
|
||||||
from torch.utils.benchmark import Measurement as TMeasurement
|
from torch.utils.benchmark import Measurement as TMeasurement
|
||||||
|
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
parser = FlexibleArgumentParser(
|
parser = FlexibleArgumentParser(
|
||||||
|
|||||||
@@ -1251,7 +1251,7 @@ async def main() -> None:
|
|||||||
default=None,
|
default=None,
|
||||||
help="The model name used in the API. "
|
help="The model name used in the API. "
|
||||||
"If not specified, the model name will be the "
|
"If not specified, the model name will be the "
|
||||||
"same as the ``--model`` argument. ",
|
"same as the `--model` argument. ",
|
||||||
)
|
)
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
|
|||||||
@@ -5,7 +5,7 @@ import cProfile
|
|||||||
import pstats
|
import pstats
|
||||||
|
|
||||||
from vllm import LLM, SamplingParams
|
from vllm import LLM, SamplingParams
|
||||||
from vllm.utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
|
||||||
# A very long prompt, total number of tokens is about 15k.
|
# A very long prompt, total number of tokens is about 15k.
|
||||||
LONG_PROMPT = ["You are an expert in large language models, aren't you?"] * 1000
|
LONG_PROMPT = ["You are an expert in large language models, aren't you?"] * 1000
|
||||||
|
|||||||
@@ -188,16 +188,60 @@ else()
|
|||||||
message(FATAL_ERROR "vLLM CPU backend requires AVX512, AVX2, Power9+ ISA, S390X ISA, ARMv8 or RISC-V support.")
|
message(FATAL_ERROR "vLLM CPU backend requires AVX512, AVX2, Power9+ ISA, S390X ISA, ARMv8 or RISC-V support.")
|
||||||
endif()
|
endif()
|
||||||
|
|
||||||
#
|
|
||||||
# Build oneDNN for W8A8 GEMM kernels (only for x86-AVX512 /ARM platforms)
|
|
||||||
# Flag to enable ACL kernels for AARCH64 platforms
|
|
||||||
if (VLLM_BUILD_ACL STREQUAL "ON")
|
|
||||||
set(USE_ACL ON)
|
|
||||||
else()
|
|
||||||
set(USE_ACL OFF)
|
|
||||||
endif()
|
|
||||||
|
|
||||||
|
# Build oneDNN for GEMM kernels (only for x86-AVX512 /ARM platforms)
|
||||||
if ((AVX512_FOUND AND NOT AVX512_DISABLED) OR (ASIMD_FOUND AND NOT APPLE_SILICON_FOUND) OR POWER9_FOUND OR POWER10_FOUND OR POWER11_FOUND)
|
if ((AVX512_FOUND AND NOT AVX512_DISABLED) OR (ASIMD_FOUND AND NOT APPLE_SILICON_FOUND) OR POWER9_FOUND OR POWER10_FOUND OR POWER11_FOUND)
|
||||||
|
# Fetch and build Arm Compute Library (ACL) as oneDNN's backend for AArch64
|
||||||
|
# TODO [fadara01]: remove this once ACL can be fetched and built automatically as a dependency of oneDNN
|
||||||
|
if(ASIMD_FOUND)
|
||||||
|
if(DEFINED ENV{ACL_ROOT_DIR} AND IS_DIRECTORY "$ENV{ACL_ROOT_DIR}")
|
||||||
|
message(STATUS "Using ACL from specified source directory: $ENV{ACL_ROOT_DIR}")
|
||||||
|
else()
|
||||||
|
message(STATUS "Downloading Arm Compute Library (ACL) from GitHub")
|
||||||
|
FetchContent_Populate(arm_compute
|
||||||
|
SUBBUILD_DIR "${FETCHCONTENT_BASE_DIR}/arm_compute-subbuild"
|
||||||
|
SOURCE_DIR "${FETCHCONTENT_BASE_DIR}/arm_compute-src"
|
||||||
|
GIT_REPOSITORY https://github.com/ARM-software/ComputeLibrary.git
|
||||||
|
GIT_TAG v52.2.0
|
||||||
|
GIT_SHALLOW TRUE
|
||||||
|
GIT_PROGRESS TRUE
|
||||||
|
)
|
||||||
|
set(ENV{ACL_ROOT_DIR} "${arm_compute_SOURCE_DIR}")
|
||||||
|
endif()
|
||||||
|
|
||||||
|
# Build ACL with scons
|
||||||
|
include(ProcessorCount)
|
||||||
|
ProcessorCount(_NPROC)
|
||||||
|
set(_scons_cmd
|
||||||
|
scons -j${_NPROC}
|
||||||
|
Werror=0 debug=0 neon=1 examples=0 embed_kernels=0 os=linux
|
||||||
|
arch=armv8.2-a build=native benchmark_examples=0 fixed_format_kernels=1
|
||||||
|
multi_isa=1 openmp=1 cppthreads=0
|
||||||
|
)
|
||||||
|
|
||||||
|
# locate PyTorch's libgomp (e.g. site-packages/torch.libs/libgomp-947d5fa1.so.1.0.0)
|
||||||
|
# and create a local shim dir with it
|
||||||
|
include("${CMAKE_CURRENT_LIST_DIR}/utils.cmake")
|
||||||
|
vllm_prepare_torch_gomp_shim(VLLM_TORCH_GOMP_SHIM_DIR)
|
||||||
|
|
||||||
|
if(NOT VLLM_TORCH_GOMP_SHIM_DIR STREQUAL "")
|
||||||
|
list(APPEND _scons_cmd extra_link_flags=-L${VLLM_TORCH_GOMP_SHIM_DIR})
|
||||||
|
endif()
|
||||||
|
|
||||||
|
execute_process(
|
||||||
|
COMMAND ${_scons_cmd}
|
||||||
|
WORKING_DIRECTORY "$ENV{ACL_ROOT_DIR}"
|
||||||
|
RESULT_VARIABLE _acl_rc
|
||||||
|
)
|
||||||
|
if(NOT _acl_rc EQUAL 0)
|
||||||
|
message(FATAL_ERROR "ACL SCons build failed (exit ${_acl_rc}).")
|
||||||
|
endif()
|
||||||
|
|
||||||
|
set(ONEDNN_AARCH64_USE_ACL "ON")
|
||||||
|
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wl,-rpath,$ENV{ACL_ROOT_DIR}/build/")
|
||||||
|
add_compile_definitions(VLLM_USE_ACL)
|
||||||
|
endif()
|
||||||
|
|
||||||
set(FETCHCONTENT_SOURCE_DIR_ONEDNN "$ENV{FETCHCONTENT_SOURCE_DIR_ONEDNN}" CACHE PATH "Path to a local oneDNN source directory.")
|
set(FETCHCONTENT_SOURCE_DIR_ONEDNN "$ENV{FETCHCONTENT_SOURCE_DIR_ONEDNN}" CACHE PATH "Path to a local oneDNN source directory.")
|
||||||
|
|
||||||
if(FETCHCONTENT_SOURCE_DIR_ONEDNN)
|
if(FETCHCONTENT_SOURCE_DIR_ONEDNN)
|
||||||
@@ -217,16 +261,6 @@ if ((AVX512_FOUND AND NOT AVX512_DISABLED) OR (ASIMD_FOUND AND NOT APPLE_SILICON
|
|||||||
)
|
)
|
||||||
endif()
|
endif()
|
||||||
|
|
||||||
if(USE_ACL)
|
|
||||||
find_library(ARM_COMPUTE_LIBRARY NAMES arm_compute PATHS $ENV{ACL_ROOT_DIR}/build/)
|
|
||||||
if(NOT ARM_COMPUTE_LIBRARY)
|
|
||||||
message(FATAL_ERROR "Could not find ARM Compute Library: please set ACL_ROOT_DIR")
|
|
||||||
endif()
|
|
||||||
set(ONEDNN_AARCH64_USE_ACL "ON")
|
|
||||||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wl,-rpath,$ENV{ACL_ROOT_DIR}/build/")
|
|
||||||
add_compile_definitions(VLLM_USE_ACL)
|
|
||||||
endif()
|
|
||||||
|
|
||||||
set(ONEDNN_LIBRARY_TYPE "STATIC")
|
set(ONEDNN_LIBRARY_TYPE "STATIC")
|
||||||
set(ONEDNN_BUILD_DOC "OFF")
|
set(ONEDNN_BUILD_DOC "OFF")
|
||||||
set(ONEDNN_BUILD_EXAMPLES "OFF")
|
set(ONEDNN_BUILD_EXAMPLES "OFF")
|
||||||
|
|||||||
@@ -19,7 +19,7 @@ else()
|
|||||||
FetchContent_Declare(
|
FetchContent_Declare(
|
||||||
flashmla
|
flashmla
|
||||||
GIT_REPOSITORY https://github.com/vllm-project/FlashMLA
|
GIT_REPOSITORY https://github.com/vllm-project/FlashMLA
|
||||||
GIT_TAG 5f65b85703c7ed75fda01e06495077caad207c3f
|
GIT_TAG 46d64a8ebef03fa50b4ae74937276a5c940e3f95
|
||||||
GIT_PROGRESS TRUE
|
GIT_PROGRESS TRUE
|
||||||
CONFIGURE_COMMAND ""
|
CONFIGURE_COMMAND ""
|
||||||
BUILD_COMMAND ""
|
BUILD_COMMAND ""
|
||||||
@@ -66,6 +66,7 @@ if(FLASH_MLA_ARCHS)
|
|||||||
${flashmla_SOURCE_DIR}/csrc/extension/torch_api.cpp
|
${flashmla_SOURCE_DIR}/csrc/extension/torch_api.cpp
|
||||||
${flashmla_SOURCE_DIR}/csrc/extension/sm90/dense_fp8/pybind.cpp
|
${flashmla_SOURCE_DIR}/csrc/extension/sm90/dense_fp8/pybind.cpp
|
||||||
${flashmla_SOURCE_DIR}/csrc/extension/sm90/dense_fp8/flash_fwd_mla_fp8_sm90.cu
|
${flashmla_SOURCE_DIR}/csrc/extension/sm90/dense_fp8/flash_fwd_mla_fp8_sm90.cu
|
||||||
|
${flashmla_SOURCE_DIR}/csrc/extension/sm90/dense_fp8/flash_fwd_mla_metadata.cu
|
||||||
)
|
)
|
||||||
|
|
||||||
set(FlashMLA_INCLUDES
|
set(FlashMLA_INCLUDES
|
||||||
|
|||||||
@@ -22,10 +22,10 @@ else()
|
|||||||
CONFIGURE_COMMAND ""
|
CONFIGURE_COMMAND ""
|
||||||
BUILD_COMMAND ""
|
BUILD_COMMAND ""
|
||||||
)
|
)
|
||||||
FetchContent_Populate(qutlass)
|
|
||||||
set(qutlass_SOURCE_DIR "${qutlass_SOURCE_DIR}")
|
|
||||||
endif()
|
endif()
|
||||||
|
|
||||||
|
FetchContent_Populate(qutlass)
|
||||||
|
|
||||||
if(NOT qutlass_SOURCE_DIR)
|
if(NOT qutlass_SOURCE_DIR)
|
||||||
message(FATAL_ERROR "[QUTLASS] source directory could not be resolved.")
|
message(FATAL_ERROR "[QUTLASS] source directory could not be resolved.")
|
||||||
endif()
|
endif()
|
||||||
|
|||||||
@@ -38,7 +38,7 @@ else()
|
|||||||
FetchContent_Declare(
|
FetchContent_Declare(
|
||||||
vllm-flash-attn
|
vllm-flash-attn
|
||||||
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
|
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
|
||||||
GIT_TAG 8f468e7da54a8e2f98abfa7c38636aac91c0cba1
|
GIT_TAG a893712401d70362fbb299cd9c4b3476e8e9ed54
|
||||||
GIT_PROGRESS TRUE
|
GIT_PROGRESS TRUE
|
||||||
# Don't share the vllm-flash-attn build between build types
|
# Don't share the vllm-flash-attn build between build types
|
||||||
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
|
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
|
||||||
|
|||||||
@@ -129,6 +129,44 @@ function (get_torch_gpu_compiler_flags OUT_GPU_FLAGS GPU_LANG)
|
|||||||
set(${OUT_GPU_FLAGS} ${GPU_FLAGS} PARENT_SCOPE)
|
set(${OUT_GPU_FLAGS} ${GPU_FLAGS} PARENT_SCOPE)
|
||||||
endfunction()
|
endfunction()
|
||||||
|
|
||||||
|
# Find libgomp that gets shipped with PyTorch wheel and create a shim dir with:
|
||||||
|
# libgomp.so -> libgomp-<hash>.so...
|
||||||
|
# libgomp.so.1 -> libgomp-<hash>.so...
|
||||||
|
# OUTPUT: TORCH_GOMP_SHIM_DIR ("" if not found)
|
||||||
|
function(vllm_prepare_torch_gomp_shim TORCH_GOMP_SHIM_DIR)
|
||||||
|
set(${TORCH_GOMP_SHIM_DIR} "" PARENT_SCOPE)
|
||||||
|
|
||||||
|
# Use run_python to locate vendored libgomp; never throw on failure.
|
||||||
|
run_python(_VLLM_TORCH_GOMP_PATH
|
||||||
|
"
|
||||||
|
import os, glob
|
||||||
|
try:
|
||||||
|
import torch
|
||||||
|
torch_pkg = os.path.dirname(torch.__file__)
|
||||||
|
site_root = os.path.dirname(torch_pkg)
|
||||||
|
torch_libs = os.path.join(site_root, 'torch.libs')
|
||||||
|
print(glob.glob(os.path.join(torch_libs, 'libgomp-*.so*'))[0])
|
||||||
|
except:
|
||||||
|
print('')
|
||||||
|
"
|
||||||
|
"failed to probe torch.libs for libgomp")
|
||||||
|
|
||||||
|
if(_VLLM_TORCH_GOMP_PATH STREQUAL "" OR NOT EXISTS "${_VLLM_TORCH_GOMP_PATH}")
|
||||||
|
return()
|
||||||
|
endif()
|
||||||
|
|
||||||
|
# Create shim under the build tree
|
||||||
|
set(_shim "${CMAKE_BINARY_DIR}/gomp_shim")
|
||||||
|
file(MAKE_DIRECTORY "${_shim}")
|
||||||
|
|
||||||
|
execute_process(COMMAND ${CMAKE_COMMAND} -E rm -f "${_shim}/libgomp.so")
|
||||||
|
execute_process(COMMAND ${CMAKE_COMMAND} -E rm -f "${_shim}/libgomp.so.1")
|
||||||
|
execute_process(COMMAND ${CMAKE_COMMAND} -E create_symlink "${_VLLM_TORCH_GOMP_PATH}" "${_shim}/libgomp.so")
|
||||||
|
execute_process(COMMAND ${CMAKE_COMMAND} -E create_symlink "${_VLLM_TORCH_GOMP_PATH}" "${_shim}/libgomp.so.1")
|
||||||
|
|
||||||
|
set(${TORCH_GOMP_SHIM_DIR} "${_shim}" PARENT_SCOPE)
|
||||||
|
endfunction()
|
||||||
|
|
||||||
# Macro for converting a `gencode` version number to a cmake version number.
|
# Macro for converting a `gencode` version number to a cmake version number.
|
||||||
macro(string_to_ver OUT_VER IN_STR)
|
macro(string_to_ver OUT_VER IN_STR)
|
||||||
string(REGEX REPLACE "\([0-9]+\)\([0-9]\)" "\\1.\\2" ${OUT_VER} ${IN_STR})
|
string(REGEX REPLACE "\([0-9]+\)\([0-9]\)" "\\1.\\2" ${OUT_VER} ${IN_STR})
|
||||||
|
|||||||
@@ -125,32 +125,37 @@ public:
|
|||||||
}
|
}
|
||||||
|
|
||||||
static void set_split_kv (KernelArguments& args) {
|
static void set_split_kv (KernelArguments& args) {
|
||||||
// printf("set_split_kv start");
|
|
||||||
if (args.split_kv >= 1) return;
|
if (args.split_kv >= 1) return;
|
||||||
auto [H, K, D, B] = args.problem_shape;
|
auto [H, K, D, B] = args.problem_shape;
|
||||||
// std::cout << H << " " << K << " " << D << " " << B << "\n";
|
|
||||||
int sm_count = args.hw_info.sm_count;
|
int sm_count = args.hw_info.sm_count;
|
||||||
// printf(" sm_count = %d\n", sm_count);
|
float seq_length_k = static_cast<float>(K) / 1024.0f;
|
||||||
int max_splits = ceil_div(K, 128);
|
int max_splits = 1;
|
||||||
max_splits = min(16, max_splits);
|
|
||||||
|
|
||||||
// TODO: This avoids a hang when the batch size larger than 1 and
|
if (B <= 4 && seq_length_k >= 16) {
|
||||||
// there is more than 1 kv_splits.
|
max_splits = 16;
|
||||||
// Discuss with NVIDIA how this can be fixed.
|
}
|
||||||
if (B > 1) {
|
else if (B <= 8 && seq_length_k >= 4) {
|
||||||
max_splits = min(1, max_splits);
|
max_splits = 8;
|
||||||
|
}
|
||||||
|
else if ((B <= 16 && seq_length_k >= 8) ||
|
||||||
|
(B == 48 && seq_length_k >= 32)) {
|
||||||
|
max_splits = 4;
|
||||||
|
}
|
||||||
|
else if ((B <= 32 && seq_length_k >= 16) ||
|
||||||
|
(B == 96 && seq_length_k >= 16)) {
|
||||||
|
max_splits = 2;
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
max_splits = 1;
|
||||||
}
|
}
|
||||||
|
|
||||||
// printf(" max_splits = %d\n", max_splits);
|
// Wave-aware scheduling: ensure integer number of waves in K dimension
|
||||||
int sms_per_batch = max(1, sm_count / B);
|
int sms_per_batch = max(1, sm_count / B);
|
||||||
// printf(" sms_per_batch = %d\n", sms_per_batch);
|
|
||||||
int split_heur = min(max_splits, sms_per_batch);
|
int split_heur = min(max_splits, sms_per_batch);
|
||||||
int waves = ceil_div(B * split_heur, sm_count);
|
int waves = ceil_div(B * split_heur, sm_count);
|
||||||
int k_waves = ceil_div(max_splits, split_heur);
|
int k_waves = ceil_div(max_splits, split_heur);
|
||||||
int split_wave_aware = ceil_div(max_splits, k_waves);
|
int split_wave_aware = ceil_div(max_splits, k_waves);
|
||||||
args.split_kv = split_wave_aware;
|
args.split_kv = split_wave_aware;
|
||||||
// printf(" args.split_kv = %d\n", args.split_kv);
|
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Determines whether the GEMM can execute the given problem.
|
/// Determines whether the GEMM can execute the given problem.
|
||||||
|
|||||||
@@ -5,11 +5,11 @@
|
|||||||
|
|
||||||
namespace vllm {
|
namespace vllm {
|
||||||
|
|
||||||
// vllm_kernel_override_batch_invariant(); returns true
|
// vllm_is_batch_invariant(); returns true
|
||||||
// if env VLLM_KERNEL_OVERRIDE_BATCH_INVARIANT=1
|
// if env VLLM_BATCH_INVARIANT=1
|
||||||
inline bool vllm_kernel_override_batch_invariant() {
|
inline bool vllm_is_batch_invariant() {
|
||||||
static bool cached = []() {
|
static bool cached = []() {
|
||||||
std::string env_key = "VLLM_KERNEL_OVERRIDE_BATCH_INVARIANT";
|
std::string env_key = "VLLM_BATCH_INVARIANT";
|
||||||
const char* val = std::getenv(env_key.c_str());
|
const char* val = std::getenv(env_key.c_str());
|
||||||
return (val && std::atoi(val) != 0) ? 1 : 0;
|
return (val && std::atoi(val) != 0) ? 1 : 0;
|
||||||
}();
|
}();
|
||||||
|
|||||||
@@ -187,7 +187,8 @@ template <>
|
|||||||
struct hash<MatMulPrimitiveHandler::ClassMatmulCacheKey> {
|
struct hash<MatMulPrimitiveHandler::ClassMatmulCacheKey> {
|
||||||
size_t operator()(
|
size_t operator()(
|
||||||
const MatMulPrimitiveHandler::ClassMatmulCacheKey& val) const {
|
const MatMulPrimitiveHandler::ClassMatmulCacheKey& val) const {
|
||||||
return hash<dnnl_dim_t>()(val.b_n_size) ^ hash<dnnl_dim_t>()(val.b_k_size);
|
return hash<dnnl_dim_t>()(val.b_n_size) ^ hash<dnnl_dim_t>()(val.b_k_size) ^
|
||||||
|
hash<int>()(static_cast<int>(val.b_type));
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
@@ -216,7 +217,8 @@ bool operator==(const W8A8MatMulPrimitiveHandler::MSizeCacheKey& l,
|
|||||||
|
|
||||||
bool operator==(const MatMulPrimitiveHandler::ClassMatmulCacheKey& l,
|
bool operator==(const MatMulPrimitiveHandler::ClassMatmulCacheKey& l,
|
||||||
const MatMulPrimitiveHandler::ClassMatmulCacheKey& r) {
|
const MatMulPrimitiveHandler::ClassMatmulCacheKey& r) {
|
||||||
return l.b_n_size == r.b_n_size && l.b_k_size == r.b_k_size;
|
return l.b_n_size == r.b_n_size && l.b_k_size == r.b_k_size &&
|
||||||
|
l.b_type == r.b_type;
|
||||||
}
|
}
|
||||||
|
|
||||||
bool operator==(const MatMulPrimitiveHandler::MSizeCacheKey& l,
|
bool operator==(const MatMulPrimitiveHandler::MSizeCacheKey& l,
|
||||||
@@ -493,8 +495,10 @@ void MatMulPrimitiveHandler::execute(ExecArgs& args) {
|
|||||||
dnnl::matmul MatMulPrimitiveHandler::get_matmul_cache(
|
dnnl::matmul MatMulPrimitiveHandler::get_matmul_cache(
|
||||||
const MSizeCacheKey& key) {
|
const MSizeCacheKey& key) {
|
||||||
if (m_size_cache_.get() == nullptr) {
|
if (m_size_cache_.get() == nullptr) {
|
||||||
ClassMatmulCacheKey key = {.b_n_size = b_n_size_, .b_k_size = b_k_size_};
|
ClassMatmulCacheKey class_key = {
|
||||||
m_size_cache_ = get_matul_class_primitive_cache(key, primitive_cache_size_);
|
.b_n_size = b_n_size_, .b_k_size = b_k_size_, .b_type = b_type_};
|
||||||
|
m_size_cache_ =
|
||||||
|
get_matul_class_primitive_cache(class_key, primitive_cache_size_);
|
||||||
}
|
}
|
||||||
return m_size_cache_->get_or_create(key, [&]() {
|
return m_size_cache_->get_or_create(key, [&]() {
|
||||||
dnnl::matmul::primitive_desc desc = this->create_primitive_desc(key, false);
|
dnnl::matmul::primitive_desc desc = this->create_primitive_desc(key, false);
|
||||||
|
|||||||
@@ -199,6 +199,7 @@ class MatMulPrimitiveHandler : public DNNLMatMulPrimitiveHandler {
|
|||||||
struct ClassMatmulCacheKey {
|
struct ClassMatmulCacheKey {
|
||||||
dnnl_dim_t b_n_size;
|
dnnl_dim_t b_n_size;
|
||||||
dnnl_dim_t b_k_size;
|
dnnl_dim_t b_k_size;
|
||||||
|
dnnl::memory::data_type b_type;
|
||||||
|
|
||||||
friend bool operator==(const ClassMatmulCacheKey& l,
|
friend bool operator==(const ClassMatmulCacheKey& l,
|
||||||
const ClassMatmulCacheKey& r);
|
const ClassMatmulCacheKey& r);
|
||||||
|
|||||||
@@ -2,6 +2,7 @@
|
|||||||
#include "dispatch_utils.h"
|
#include "dispatch_utils.h"
|
||||||
#include "cub_helpers.h"
|
#include "cub_helpers.h"
|
||||||
#include "core/batch_invariant.hpp"
|
#include "core/batch_invariant.hpp"
|
||||||
|
#include "quantization/vectorization_utils.cuh"
|
||||||
|
|
||||||
#include <torch/cuda.h>
|
#include <torch/cuda.h>
|
||||||
#include <c10/cuda/CUDAGuard.h>
|
#include <c10/cuda/CUDAGuard.h>
|
||||||
@@ -18,11 +19,22 @@ __global__ void rms_norm_kernel(
|
|||||||
const float epsilon, const int num_tokens, const int hidden_size) {
|
const float epsilon, const int num_tokens, const int hidden_size) {
|
||||||
__shared__ float s_variance;
|
__shared__ float s_variance;
|
||||||
float variance = 0.0f;
|
float variance = 0.0f;
|
||||||
|
const scalar_t* input_row = input + blockIdx.x * input_stride;
|
||||||
|
|
||||||
for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
|
constexpr int VEC_SIZE = 8;
|
||||||
const float x = (float)input[blockIdx.x * input_stride + idx];
|
auto vec_op = [&variance](const vec_n_t<scalar_t, VEC_SIZE>& vec) {
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < VEC_SIZE; ++i) {
|
||||||
|
float x = static_cast<float>(vec.val[i]);
|
||||||
|
variance += x * x;
|
||||||
|
}
|
||||||
|
};
|
||||||
|
auto scalar_op = [&variance](const scalar_t& val) {
|
||||||
|
float x = static_cast<float>(val);
|
||||||
variance += x * x;
|
variance += x * x;
|
||||||
}
|
};
|
||||||
|
vllm::vectorize_read_with_alignment<VEC_SIZE>(
|
||||||
|
input_row, hidden_size, threadIdx.x, blockDim.x, vec_op, scalar_op);
|
||||||
|
|
||||||
using BlockReduce = cub::BlockReduce<float, 1024>;
|
using BlockReduce = cub::BlockReduce<float, 1024>;
|
||||||
__shared__ typename BlockReduce::TempStorage reduceStore;
|
__shared__ typename BlockReduce::TempStorage reduceStore;
|
||||||
@@ -136,211 +148,6 @@ fused_add_rms_norm_kernel(
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Function specialization in the case of FP16/BF16 tensors.
|
|
||||||
Additional optimizations we can make in this case are
|
|
||||||
packed and vectorized operations, which help with the
|
|
||||||
memory latency bottleneck.
|
|
||||||
|
|
||||||
_f16VecPN struct extends _f16Vec to add operations specifically required for
|
|
||||||
polynomial normalization (poly norm).
|
|
||||||
The original _f16Vec does not include the sum-of-powers computation or
|
|
||||||
in-place polynomial normalization logic. */
|
|
||||||
template <typename scalar_t, int width>
|
|
||||||
struct alignas(16) _f16VecPN : _f16Vec<scalar_t, width> {
|
|
||||||
using Base = _f16Vec<scalar_t, width>;
|
|
||||||
using Converter = typename Base::Converter;
|
|
||||||
using T1 = typename Base::T1;
|
|
||||||
using T2 = typename Base::T2;
|
|
||||||
using Base::data;
|
|
||||||
|
|
||||||
__device__ auto sum_pows() const {
|
|
||||||
float s2 = 0.0f, s4 = 0.0f, s6 = 0.0f;
|
|
||||||
|
|
||||||
#pragma unroll
|
|
||||||
for (int i = 0; i < width; i += 2) {
|
|
||||||
float2 z = Converter::convert(T2{data[i], data[i + 1]});
|
|
||||||
float x2 = z.x * z.x;
|
|
||||||
float x4 = x2 * x2;
|
|
||||||
float x6 = x4 * x2;
|
|
||||||
|
|
||||||
float y2 = z.y * z.y;
|
|
||||||
float y4 = y2 * y2;
|
|
||||||
float y6 = y4 * y2;
|
|
||||||
|
|
||||||
s2 += x2 + y2;
|
|
||||||
s4 += x4 + y4;
|
|
||||||
s6 += x6 + y6;
|
|
||||||
}
|
|
||||||
return std::make_tuple(s2, s4, s6);
|
|
||||||
}
|
|
||||||
|
|
||||||
__device__ void poly_norm_inplace(const float w2_inv_std,
|
|
||||||
const float w1_inv_std2,
|
|
||||||
const float w0_inv_std3, const float bias) {
|
|
||||||
#pragma unroll
|
|
||||||
for (int i = 0; i < width; i += 2) {
|
|
||||||
float2 z = Converter::convert(T2{data[i], data[i + 1]});
|
|
||||||
|
|
||||||
float x2 = z.x * z.x;
|
|
||||||
float x3 = x2 * z.x;
|
|
||||||
z.x = w2_inv_std * z.x + w1_inv_std2 * x2 + w0_inv_std3 * x3 + bias;
|
|
||||||
|
|
||||||
float y2 = z.y * z.y;
|
|
||||||
float y3 = y2 * z.y;
|
|
||||||
z.y = w2_inv_std * z.y + w1_inv_std2 * y2 + w0_inv_std3 * y3 + bias;
|
|
||||||
|
|
||||||
auto out = Converter::convert(z);
|
|
||||||
data[i] = out.x;
|
|
||||||
data[i + 1] = out.y;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
template <typename scalar_t, int width>
|
|
||||||
__global__ std::enable_if_t<(width > 0) && _typeConvert<scalar_t>::exists>
|
|
||||||
poly_norm_kernel(scalar_t* __restrict__ out, // [..., hidden_size]
|
|
||||||
const scalar_t* __restrict__ input, // [..., hidden_size]
|
|
||||||
const scalar_t* __restrict__ weight, // [3]
|
|
||||||
const scalar_t* __restrict__ bias, // [1]
|
|
||||||
const float epsilon, const int hidden_size) {
|
|
||||||
// Sanity checks on our vector struct and type-punned pointer arithmetic
|
|
||||||
static_assert(std::is_pod_v<_f16VecPN<scalar_t, width>>);
|
|
||||||
static_assert(sizeof(_f16VecPN<scalar_t, width>) == sizeof(scalar_t) * width);
|
|
||||||
|
|
||||||
/* These and the argument pointers are all declared `restrict` as they are
|
|
||||||
not aliased in practice. Argument pointers should not be dereferenced
|
|
||||||
in this kernel as that would be undefined behavior */
|
|
||||||
auto* __restrict__ input_v =
|
|
||||||
reinterpret_cast<const _f16VecPN<scalar_t, width>*>(input);
|
|
||||||
const int vec_hidden_size = hidden_size / width;
|
|
||||||
float variance = 0.0f;
|
|
||||||
float variance2 = 0.0f;
|
|
||||||
float variance3 = 0.0f;
|
|
||||||
|
|
||||||
for (int idx = threadIdx.x; idx < vec_hidden_size; idx += blockDim.x) {
|
|
||||||
int id = blockIdx.x * vec_hidden_size + idx;
|
|
||||||
_f16VecPN<scalar_t, width> temp = input_v[id];
|
|
||||||
auto [x2, x4, x6] = temp.sum_pows();
|
|
||||||
|
|
||||||
variance += x2;
|
|
||||||
variance2 += x4;
|
|
||||||
variance3 += x6;
|
|
||||||
}
|
|
||||||
|
|
||||||
float3 thread_variances = make_float3(variance, variance2, variance3);
|
|
||||||
|
|
||||||
struct SumOp {
|
|
||||||
__device__ float3 operator()(const float3& a, const float3& b) const {
|
|
||||||
return make_float3(a.x + b.x, a.y + b.y, a.z + b.z);
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
using BlockReduce = cub::BlockReduce<float3, 1024>;
|
|
||||||
__shared__ typename BlockReduce::TempStorage reduceStore;
|
|
||||||
float3 block_variances =
|
|
||||||
BlockReduce(reduceStore).Reduce(thread_variances, SumOp{}, blockDim.x);
|
|
||||||
|
|
||||||
variance = block_variances.x;
|
|
||||||
variance2 = block_variances.y;
|
|
||||||
variance3 = block_variances.z;
|
|
||||||
|
|
||||||
__shared__ float s_w2_inv_std;
|
|
||||||
__shared__ float s_w1_inv_std2;
|
|
||||||
__shared__ float s_w0_inv_std3;
|
|
||||||
__shared__ float s_bias;
|
|
||||||
|
|
||||||
if (threadIdx.x == 0) {
|
|
||||||
float w0 = (float)weight[0];
|
|
||||||
float w1 = (float)weight[1];
|
|
||||||
float w2 = (float)weight[2];
|
|
||||||
s_bias = (float)bias[0];
|
|
||||||
|
|
||||||
s_w2_inv_std = w2 * rsqrtf(variance / hidden_size + epsilon);
|
|
||||||
s_w1_inv_std2 = w1 * rsqrtf(variance2 / hidden_size + epsilon);
|
|
||||||
s_w0_inv_std3 = w0 * rsqrtf(variance3 / hidden_size + epsilon);
|
|
||||||
}
|
|
||||||
__syncthreads();
|
|
||||||
|
|
||||||
auto* __restrict__ out_v = reinterpret_cast<_f16VecPN<scalar_t, width>*>(out);
|
|
||||||
|
|
||||||
for (int idx = threadIdx.x; idx < vec_hidden_size; idx += blockDim.x) {
|
|
||||||
int id = blockIdx.x * vec_hidden_size + idx;
|
|
||||||
_f16VecPN<scalar_t, width> temp = input_v[id];
|
|
||||||
temp.poly_norm_inplace(s_w2_inv_std, s_w1_inv_std2, s_w0_inv_std3, s_bias);
|
|
||||||
out_v[id] = temp;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Generic poly_norm_kernel
|
|
||||||
The width field is not used here but necessary for other specializations.
|
|
||||||
*/
|
|
||||||
template <typename scalar_t, int width>
|
|
||||||
__global__ std::enable_if_t<(width == 0) || !_typeConvert<scalar_t>::exists>
|
|
||||||
poly_norm_kernel(scalar_t* __restrict__ out, // [..., hidden_size]
|
|
||||||
const scalar_t* __restrict__ input, // [..., hidden_size]
|
|
||||||
const scalar_t* __restrict__ weight, // [3]
|
|
||||||
const scalar_t* __restrict__ bias, // [1]
|
|
||||||
const float epsilon, const int hidden_size) {
|
|
||||||
float variance = 0.0f;
|
|
||||||
float variance2 = 0.0f;
|
|
||||||
float variance3 = 0.0f;
|
|
||||||
|
|
||||||
for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
|
|
||||||
float x = (float)input[blockIdx.x * hidden_size + idx];
|
|
||||||
float x2 = x * x;
|
|
||||||
float x4 = x2 * x2;
|
|
||||||
float x6 = x4 * x2;
|
|
||||||
|
|
||||||
variance += x2;
|
|
||||||
variance2 += x4;
|
|
||||||
variance3 += x6;
|
|
||||||
}
|
|
||||||
|
|
||||||
float3 thread_variances = make_float3(variance, variance2, variance3);
|
|
||||||
|
|
||||||
struct SumOp {
|
|
||||||
__device__ float3 operator()(const float3& a, const float3& b) const {
|
|
||||||
return make_float3(a.x + b.x, a.y + b.y, a.z + b.z);
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
using BlockReduce = cub::BlockReduce<float3, 1024>;
|
|
||||||
__shared__ typename BlockReduce::TempStorage reduceStore;
|
|
||||||
float3 block_variances =
|
|
||||||
BlockReduce(reduceStore).Reduce(thread_variances, SumOp{}, blockDim.x);
|
|
||||||
|
|
||||||
variance = block_variances.x;
|
|
||||||
variance2 = block_variances.y;
|
|
||||||
variance3 = block_variances.z;
|
|
||||||
|
|
||||||
__shared__ float s_w2_inv_std;
|
|
||||||
__shared__ float s_w1_inv_std2;
|
|
||||||
__shared__ float s_w0_inv_std3;
|
|
||||||
__shared__ float s_bias;
|
|
||||||
|
|
||||||
if (threadIdx.x == 0) {
|
|
||||||
float w0 = (float)weight[0];
|
|
||||||
float w1 = (float)weight[1];
|
|
||||||
float w2 = (float)weight[2];
|
|
||||||
s_bias = (float)bias[0];
|
|
||||||
|
|
||||||
s_w2_inv_std = w2 * rsqrtf(variance / hidden_size + epsilon);
|
|
||||||
s_w1_inv_std2 = w1 * rsqrtf(variance2 / hidden_size + epsilon);
|
|
||||||
s_w0_inv_std3 = w0 * rsqrtf(variance3 / hidden_size + epsilon);
|
|
||||||
}
|
|
||||||
__syncthreads();
|
|
||||||
|
|
||||||
for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
|
|
||||||
float x = (float)input[blockIdx.x * hidden_size + idx];
|
|
||||||
float x2 = x * x;
|
|
||||||
float x3 = x2 * x;
|
|
||||||
|
|
||||||
out[blockIdx.x * hidden_size + idx] =
|
|
||||||
(scalar_t)(x * s_w2_inv_std + x2 * s_w1_inv_std2 + x3 * s_w0_inv_std3 +
|
|
||||||
s_bias);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
} // namespace vllm
|
} // namespace vllm
|
||||||
|
|
||||||
void rms_norm(torch::Tensor& out, // [..., hidden_size]
|
void rms_norm(torch::Tensor& out, // [..., hidden_size]
|
||||||
@@ -352,18 +159,26 @@ void rms_norm(torch::Tensor& out, // [..., hidden_size]
|
|||||||
TORCH_CHECK(weight.is_contiguous());
|
TORCH_CHECK(weight.is_contiguous());
|
||||||
|
|
||||||
int hidden_size = input.size(-1);
|
int hidden_size = input.size(-1);
|
||||||
int num_tokens = input.numel() / hidden_size;
|
|
||||||
int64_t input_stride = input.stride(-2);
|
// We cannot just use `input.stride(-2)` if the tensor is not row-major.
|
||||||
|
// Instead, we use a 2d view to get the second-innermost stride.
|
||||||
|
// That way the dimensions (except the last one) can be arbitrarily permuted.
|
||||||
|
torch::Tensor input_view = input.view({-1, hidden_size});
|
||||||
|
|
||||||
|
int num_tokens = input_view.numel() / hidden_size;
|
||||||
|
int64_t input_stride = input_view.stride(-2);
|
||||||
|
|
||||||
dim3 grid(num_tokens);
|
dim3 grid(num_tokens);
|
||||||
dim3 block(std::min(hidden_size, 1024));
|
dim3 block(std::min(hidden_size, 1024));
|
||||||
const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
|
const at::cuda::OptionalCUDAGuard device_guard(device_of(input_view));
|
||||||
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
||||||
VLLM_DISPATCH_FLOATING_TYPES(input.scalar_type(), "rms_norm_kernel", [&] {
|
VLLM_DISPATCH_FLOATING_TYPES(
|
||||||
vllm::rms_norm_kernel<scalar_t><<<grid, block, 0, stream>>>(
|
input_view.scalar_type(), "rms_norm_kernel", [&] {
|
||||||
out.data_ptr<scalar_t>(), input.data_ptr<scalar_t>(), input_stride,
|
vllm::rms_norm_kernel<scalar_t><<<grid, block, 0, stream>>>(
|
||||||
weight.data_ptr<scalar_t>(), epsilon, num_tokens, hidden_size);
|
out.data_ptr<scalar_t>(), input_view.data_ptr<scalar_t>(),
|
||||||
});
|
input_stride, weight.data_ptr<scalar_t>(), epsilon, num_tokens,
|
||||||
|
hidden_size);
|
||||||
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
#define LAUNCH_FUSED_ADD_RMS_NORM(width) \
|
#define LAUNCH_FUSED_ADD_RMS_NORM(width) \
|
||||||
@@ -380,6 +195,8 @@ void fused_add_rms_norm(torch::Tensor& input, // [..., hidden_size]
|
|||||||
torch::Tensor& residual, // [..., hidden_size]
|
torch::Tensor& residual, // [..., hidden_size]
|
||||||
torch::Tensor& weight, // [hidden_size]
|
torch::Tensor& weight, // [hidden_size]
|
||||||
double epsilon) {
|
double epsilon) {
|
||||||
|
TORCH_CHECK(weight.scalar_type() == input.scalar_type());
|
||||||
|
TORCH_CHECK(input.scalar_type() == residual.scalar_type());
|
||||||
TORCH_CHECK(residual.is_contiguous());
|
TORCH_CHECK(residual.is_contiguous());
|
||||||
TORCH_CHECK(weight.is_contiguous());
|
TORCH_CHECK(weight.is_contiguous());
|
||||||
int hidden_size = input.size(-1);
|
int hidden_size = input.size(-1);
|
||||||
@@ -414,7 +231,7 @@ void fused_add_rms_norm(torch::Tensor& input, // [..., hidden_size]
|
|||||||
wt_ptr % req_alignment_bytes == 0;
|
wt_ptr % req_alignment_bytes == 0;
|
||||||
bool offsets_are_multiple_of_vector_width =
|
bool offsets_are_multiple_of_vector_width =
|
||||||
hidden_size % vector_width == 0 && input_stride % vector_width == 0;
|
hidden_size % vector_width == 0 && input_stride % vector_width == 0;
|
||||||
bool batch_invariant_launch = vllm::vllm_kernel_override_batch_invariant();
|
bool batch_invariant_launch = vllm::vllm_is_batch_invariant();
|
||||||
if (ptrs_are_aligned && offsets_are_multiple_of_vector_width &&
|
if (ptrs_are_aligned && offsets_are_multiple_of_vector_width &&
|
||||||
!batch_invariant_launch) {
|
!batch_invariant_launch) {
|
||||||
LAUNCH_FUSED_ADD_RMS_NORM(8);
|
LAUNCH_FUSED_ADD_RMS_NORM(8);
|
||||||
@@ -422,50 +239,3 @@ void fused_add_rms_norm(torch::Tensor& input, // [..., hidden_size]
|
|||||||
LAUNCH_FUSED_ADD_RMS_NORM(0);
|
LAUNCH_FUSED_ADD_RMS_NORM(0);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
#define LAUNCH_FUSED_POLY_NORM(width) \
|
|
||||||
VLLM_DISPATCH_FLOATING_TYPES(input.scalar_type(), "poly_norm_kernel", [&] { \
|
|
||||||
vllm::poly_norm_kernel<scalar_t, width><<<grid, block, 0, stream>>>( \
|
|
||||||
out.data_ptr<scalar_t>(), input.data_ptr<scalar_t>(), \
|
|
||||||
weight.data_ptr<scalar_t>(), bias.data_ptr<scalar_t>(), epsilon, \
|
|
||||||
hidden_size); \
|
|
||||||
});
|
|
||||||
|
|
||||||
void poly_norm(torch::Tensor& out, // [..., hidden_size]
|
|
||||||
torch::Tensor& input, // [..., hidden_size]
|
|
||||||
torch::Tensor& weight, // [3]
|
|
||||||
torch::Tensor& bias, // [1]
|
|
||||||
double epsilon) {
|
|
||||||
TORCH_CHECK(out.is_contiguous());
|
|
||||||
TORCH_CHECK(input.is_contiguous());
|
|
||||||
TORCH_CHECK(out.data_ptr() != input.data_ptr());
|
|
||||||
|
|
||||||
int hidden_size = input.size(-1);
|
|
||||||
int num_tokens = input.numel() / hidden_size;
|
|
||||||
|
|
||||||
dim3 grid(num_tokens);
|
|
||||||
/* This kernel is memory-latency bound in many scenarios.
|
|
||||||
When num_tokens is large, a smaller block size allows
|
|
||||||
for increased block occupancy on CUs and better latency
|
|
||||||
hiding on global mem ops. */
|
|
||||||
const int max_block_size = (num_tokens < 256) ? 1024 : 256;
|
|
||||||
dim3 block(std::min(hidden_size, max_block_size));
|
|
||||||
const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
|
|
||||||
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
|
||||||
/*If the tensor types are FP16/BF16, try to use the optimized kernel
|
|
||||||
with packed + vectorized ops.
|
|
||||||
Max optimization is achieved with a width-8 vector of FP16/BF16s
|
|
||||||
since we can load at most 128 bits at once in a global memory op.
|
|
||||||
However, this requires each tensor's data to be aligned to 16
|
|
||||||
bytes.
|
|
||||||
*/
|
|
||||||
auto inp_ptr = reinterpret_cast<std::uintptr_t>(input.data_ptr());
|
|
||||||
auto out_ptr = reinterpret_cast<std::uintptr_t>(out.data_ptr());
|
|
||||||
bool ptrs_are_aligned = inp_ptr % 16 == 0 && out_ptr % 16 == 0;
|
|
||||||
bool batch_invariant_launch = vllm::vllm_kernel_override_batch_invariant();
|
|
||||||
if (ptrs_are_aligned && hidden_size % 8 == 0 && !batch_invariant_launch) {
|
|
||||||
LAUNCH_FUSED_POLY_NORM(8);
|
|
||||||
} else {
|
|
||||||
LAUNCH_FUSED_POLY_NORM(0);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|||||||
@@ -10,6 +10,7 @@
|
|||||||
#include "dispatch_utils.h"
|
#include "dispatch_utils.h"
|
||||||
#include "cub_helpers.h"
|
#include "cub_helpers.h"
|
||||||
#include "core/batch_invariant.hpp"
|
#include "core/batch_invariant.hpp"
|
||||||
|
#include "quantization/vectorization_utils.cuh"
|
||||||
|
|
||||||
#include <torch/cuda.h>
|
#include <torch/cuda.h>
|
||||||
#include <c10/cuda/CUDAGuard.h>
|
#include <c10/cuda/CUDAGuard.h>
|
||||||
@@ -28,10 +29,22 @@ __global__ void rms_norm_static_fp8_quant_kernel(
|
|||||||
__shared__ float s_variance;
|
__shared__ float s_variance;
|
||||||
float variance = 0.0f;
|
float variance = 0.0f;
|
||||||
|
|
||||||
for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
|
const scalar_t* input_row = input + blockIdx.x * input_stride;
|
||||||
const float x = (float)input[blockIdx.x * input_stride + idx];
|
|
||||||
|
constexpr int VEC_SIZE = 8;
|
||||||
|
auto vec_op = [&variance](const vec_n_t<scalar_t, VEC_SIZE>& vec) {
|
||||||
|
#pragma unroll
|
||||||
|
for (int i = 0; i < VEC_SIZE; ++i) {
|
||||||
|
float x = static_cast<float>(vec.val[i]);
|
||||||
|
variance += x * x;
|
||||||
|
}
|
||||||
|
};
|
||||||
|
auto scalar_op = [&variance](const scalar_t& val) {
|
||||||
|
float x = static_cast<float>(val);
|
||||||
variance += x * x;
|
variance += x * x;
|
||||||
}
|
};
|
||||||
|
vllm::vectorize_read_with_alignment<VEC_SIZE>(
|
||||||
|
input_row, hidden_size, threadIdx.x, blockDim.x, vec_op, scalar_op);
|
||||||
|
|
||||||
using BlockReduce = cub::BlockReduce<float, 1024>;
|
using BlockReduce = cub::BlockReduce<float, 1024>;
|
||||||
__shared__ typename BlockReduce::TempStorage reduceStore;
|
__shared__ typename BlockReduce::TempStorage reduceStore;
|
||||||
@@ -216,6 +229,8 @@ void fused_add_rms_norm_static_fp8_quant(
|
|||||||
double epsilon) {
|
double epsilon) {
|
||||||
TORCH_CHECK(out.is_contiguous());
|
TORCH_CHECK(out.is_contiguous());
|
||||||
TORCH_CHECK(residual.is_contiguous());
|
TORCH_CHECK(residual.is_contiguous());
|
||||||
|
TORCH_CHECK(residual.scalar_type() == input.scalar_type());
|
||||||
|
TORCH_CHECK(weight.scalar_type() == input.scalar_type());
|
||||||
int hidden_size = input.size(-1);
|
int hidden_size = input.size(-1);
|
||||||
int input_stride = input.stride(-2);
|
int input_stride = input.stride(-2);
|
||||||
int num_tokens = input.numel() / hidden_size;
|
int num_tokens = input.numel() / hidden_size;
|
||||||
@@ -241,7 +256,7 @@ void fused_add_rms_norm_static_fp8_quant(
|
|||||||
auto wt_ptr = reinterpret_cast<std::uintptr_t>(weight.data_ptr());
|
auto wt_ptr = reinterpret_cast<std::uintptr_t>(weight.data_ptr());
|
||||||
bool ptrs_are_aligned =
|
bool ptrs_are_aligned =
|
||||||
inp_ptr % 16 == 0 && res_ptr % 16 == 0 && wt_ptr % 16 == 0;
|
inp_ptr % 16 == 0 && res_ptr % 16 == 0 && wt_ptr % 16 == 0;
|
||||||
bool batch_invariant_launch = vllm::vllm_kernel_override_batch_invariant();
|
bool batch_invariant_launch = vllm::vllm_is_batch_invariant();
|
||||||
if (ptrs_are_aligned && hidden_size % 8 == 0 && input_stride % 8 == 0 &&
|
if (ptrs_are_aligned && hidden_size % 8 == 0 && input_stride % 8 == 0 &&
|
||||||
!batch_invariant_launch) {
|
!batch_invariant_launch) {
|
||||||
LAUNCH_FUSED_ADD_RMS_NORM(8);
|
LAUNCH_FUSED_ADD_RMS_NORM(8);
|
||||||
|
|||||||
@@ -8,12 +8,77 @@
|
|||||||
|
|
||||||
#include "../cuda_compat.h"
|
#include "../cuda_compat.h"
|
||||||
#include "../dispatch_utils.h"
|
#include "../dispatch_utils.h"
|
||||||
|
#include "core/math.hpp"
|
||||||
|
|
||||||
#define CEILDIV(x, y) (((x) + (y) - 1) / (y))
|
#define CEILDIV(x, y) (((x) + (y) - 1) / (y))
|
||||||
|
|
||||||
namespace vllm {
|
namespace vllm {
|
||||||
namespace moe {
|
namespace moe {
|
||||||
|
|
||||||
|
namespace batched_moe_align_block_size {
|
||||||
|
|
||||||
|
// Note num_threads needs to be 1024 for BlockScan Reduction in the kernel.
|
||||||
|
static constexpr int32_t num_threads = 1024;
|
||||||
|
static constexpr int32_t num_blocks = 1;
|
||||||
|
__global__ void batched_moe_align_block_size_kernel(
|
||||||
|
int32_t const num_batches, int32_t const max_tokens_per_batch,
|
||||||
|
int32_t const block_size, int32_t const* __restrict__ batch_num_tokens,
|
||||||
|
int32_t* __restrict__ sorted_ids, int32_t* __restrict__ block_ids,
|
||||||
|
int32_t* __restrict__ num_tokens_post_pad) {
|
||||||
|
// TODO(varun): This is a naive implementation. Could be optimized.
|
||||||
|
|
||||||
|
size_t const batch_id = threadIdx.x;
|
||||||
|
size_t const stride = blockDim.x * gridDim.x;
|
||||||
|
int32_t const num_blocks_per_batch =
|
||||||
|
CEILDIV(max_tokens_per_batch, block_size);
|
||||||
|
int32_t const sorted_ids_size =
|
||||||
|
num_blocks_per_batch * num_batches * block_size;
|
||||||
|
int32_t const block_ids_size = sorted_ids_size / block_size;
|
||||||
|
int32_t const SENTINEL =
|
||||||
|
num_batches * max_tokens_per_batch; // To denote invalid entries.
|
||||||
|
// Intialize sorted_ids
|
||||||
|
for (size_t i = threadIdx.x; i < sorted_ids_size; i += stride) {
|
||||||
|
sorted_ids[i] = SENTINEL;
|
||||||
|
}
|
||||||
|
// Intialize expert_ids with -1
|
||||||
|
for (size_t i = threadIdx.x; i < block_ids_size; i += stride) {
|
||||||
|
block_ids[i] = -1;
|
||||||
|
}
|
||||||
|
|
||||||
|
int32_t b_num_tokens = 0;
|
||||||
|
if (batch_id < num_batches) {
|
||||||
|
b_num_tokens = batch_num_tokens[batch_id];
|
||||||
|
}
|
||||||
|
int32_t const ceil_b_num_tokens =
|
||||||
|
CEILDIV(b_num_tokens, block_size) * block_size;
|
||||||
|
|
||||||
|
// Compute prefix sum over token counts per expert
|
||||||
|
using BlockScan = cub::BlockScan<int32_t, 1024>;
|
||||||
|
__shared__ typename BlockScan::TempStorage temp_storage;
|
||||||
|
int cumsum_val;
|
||||||
|
BlockScan(temp_storage).ExclusiveSum(ceil_b_num_tokens, cumsum_val);
|
||||||
|
__syncthreads();
|
||||||
|
|
||||||
|
bool const is_last_batch = batch_id == (num_batches - 1);
|
||||||
|
if (is_last_batch) {
|
||||||
|
*num_tokens_post_pad = cumsum_val + ceil_b_num_tokens;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (batch_id < num_batches) {
|
||||||
|
int32_t const batch_offset = batch_id * max_tokens_per_batch;
|
||||||
|
for (size_t i = 0; i < b_num_tokens; ++i) {
|
||||||
|
sorted_ids[cumsum_val + i] = batch_offset + i;
|
||||||
|
}
|
||||||
|
|
||||||
|
int32_t const block_start = cumsum_val / block_size;
|
||||||
|
int32_t const num_blocks = ceil_b_num_tokens / block_size;
|
||||||
|
for (size_t i = 0; i < num_blocks; ++i) {
|
||||||
|
block_ids[block_start + i] = batch_id;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} // namespace batched_moe_align_block_size
|
||||||
|
|
||||||
template <typename scalar_t>
|
template <typename scalar_t>
|
||||||
__global__ void moe_align_block_size_kernel(
|
__global__ void moe_align_block_size_kernel(
|
||||||
const scalar_t* __restrict__ topk_ids,
|
const scalar_t* __restrict__ topk_ids,
|
||||||
@@ -280,6 +345,33 @@ void moe_align_block_size(torch::Tensor topk_ids, int64_t num_experts,
|
|||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
|
void batched_moe_align_block_size(int64_t max_tokens_per_batch,
|
||||||
|
int64_t block_size,
|
||||||
|
torch::Tensor const& batch_num_tokens,
|
||||||
|
torch::Tensor sorted_ids,
|
||||||
|
torch::Tensor batch_ids,
|
||||||
|
torch::Tensor num_tokens_post_pad) {
|
||||||
|
namespace batched_kernel = vllm::moe::batched_moe_align_block_size;
|
||||||
|
|
||||||
|
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
||||||
|
int32_t const B = batch_num_tokens.size(0);
|
||||||
|
int32_t const num_blocks_per_batch =
|
||||||
|
round_to_next_multiple_of(max_tokens_per_batch, block_size) / block_size;
|
||||||
|
int32_t const num_blocks = num_blocks_per_batch * B;
|
||||||
|
int64_t const sorted_ids_size = num_blocks * block_size;
|
||||||
|
|
||||||
|
TORCH_CHECK(sorted_ids.size(0) == sorted_ids_size);
|
||||||
|
TORCH_CHECK(batch_ids.size(0) == sorted_ids_size / block_size);
|
||||||
|
TORCH_CHECK(num_tokens_post_pad.size(0) == 1);
|
||||||
|
TORCH_CHECK(B <= batched_kernel::num_threads);
|
||||||
|
|
||||||
|
batched_kernel::batched_moe_align_block_size_kernel<<<
|
||||||
|
batched_kernel::num_blocks, batched_kernel::num_threads, 0, stream>>>(
|
||||||
|
B, max_tokens_per_batch, block_size, batch_num_tokens.data_ptr<int32_t>(),
|
||||||
|
sorted_ids.data_ptr<int32_t>(), batch_ids.data_ptr<int32_t>(),
|
||||||
|
num_tokens_post_pad.data_ptr<int32_t>());
|
||||||
|
}
|
||||||
|
|
||||||
void moe_sum(torch::Tensor& input, // [num_tokens, topk, hidden_size]
|
void moe_sum(torch::Tensor& input, // [num_tokens, topk, hidden_size]
|
||||||
torch::Tensor& output) // [num_tokens, hidden_size]
|
torch::Tensor& output) // [num_tokens, hidden_size]
|
||||||
{
|
{
|
||||||
|
|||||||
169
csrc/moe/moe_lora_align_sum_kernels.cu
Normal file
169
csrc/moe/moe_lora_align_sum_kernels.cu
Normal file
@@ -0,0 +1,169 @@
|
|||||||
|
#include <stdio.h>
|
||||||
|
#include <stdlib.h>
|
||||||
|
#include <time.h>
|
||||||
|
#include <torch/all.h>
|
||||||
|
#include <ATen/cuda/CUDAContext.h>
|
||||||
|
#include <c10/cuda/CUDAGuard.h>
|
||||||
|
|
||||||
|
#include <ATen/ATen.h>
|
||||||
|
#include <ATen/cuda/Atomic.cuh>
|
||||||
|
|
||||||
|
#include "../cuda_compat.h"
|
||||||
|
#include "../dispatch_utils.h"
|
||||||
|
#include "core/math.hpp"
|
||||||
|
|
||||||
|
namespace {
|
||||||
|
|
||||||
|
__device__ __forceinline__ int32_t index(int32_t total_col, int32_t row,
|
||||||
|
int32_t col) {
|
||||||
|
return row * total_col + col;
|
||||||
|
}
|
||||||
|
|
||||||
|
} // namespace
|
||||||
|
|
||||||
|
// TODO: Refactor common parts with moe_align_sum_kernels
|
||||||
|
template <typename scalar_t, typename token_cnts_t>
|
||||||
|
__global__ void moe_lora_align_sum_kernel(
|
||||||
|
scalar_t* __restrict__ topk_ids, int32_t* token_lora_mapping,
|
||||||
|
int64_t block_size, int num_experts, int max_loras, size_t numel,
|
||||||
|
int max_num_tokens_padded, int max_num_m_blocks,
|
||||||
|
int32_t* __restrict__ sorted_token_ids, int32_t* __restrict__ expert_ids,
|
||||||
|
int topk_num, int32_t* total_tokens_post_pad) {
|
||||||
|
const size_t tokens_per_thread = div_ceil(numel, blockDim.x);
|
||||||
|
const size_t start_idx = threadIdx.x * tokens_per_thread;
|
||||||
|
|
||||||
|
int lora_id = blockIdx.x;
|
||||||
|
extern __shared__ int32_t shared_mem[];
|
||||||
|
int32_t* cumsum = shared_mem;
|
||||||
|
token_cnts_t* tokens_cnts = (token_cnts_t*)(shared_mem + num_experts + 1);
|
||||||
|
|
||||||
|
// Initialize sorted_token_ids with numel
|
||||||
|
for (size_t it = threadIdx.x; it < max_num_tokens_padded; it += blockDim.x) {
|
||||||
|
sorted_token_ids[lora_id * max_num_tokens_padded + it] = numel;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Initialize expert_ids with -1
|
||||||
|
for (size_t it = threadIdx.x; it < max_num_m_blocks; it += blockDim.x) {
|
||||||
|
expert_ids[lora_id * max_num_m_blocks + it] = -1;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Initialize total_tokens_post_pad with 0
|
||||||
|
if (threadIdx.x == 0) {
|
||||||
|
total_tokens_post_pad[lora_id] = 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
for (int i = 0; i < num_experts; ++i) {
|
||||||
|
tokens_cnts[index(num_experts, threadIdx.x + 1, i)] = 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
for (int i = start_idx; i < numel && i < start_idx + tokens_per_thread; ++i) {
|
||||||
|
int mask = token_lora_mapping[i / topk_num] == lora_id;
|
||||||
|
int idx = index(num_experts, threadIdx.x + 1, topk_ids[i]);
|
||||||
|
tokens_cnts[idx] += mask;
|
||||||
|
}
|
||||||
|
|
||||||
|
__syncthreads();
|
||||||
|
|
||||||
|
// For each expert we accumulate the token counts from the different threads.
|
||||||
|
if (threadIdx.x < num_experts) {
|
||||||
|
tokens_cnts[index(num_experts, 0, threadIdx.x)] = 0;
|
||||||
|
for (int i = 1; i <= blockDim.x; ++i) {
|
||||||
|
tokens_cnts[index(num_experts, i, threadIdx.x)] +=
|
||||||
|
tokens_cnts[index(num_experts, i - 1, threadIdx.x)];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
__syncthreads();
|
||||||
|
|
||||||
|
// We accumulate the token counts of all experts in thread 0.
|
||||||
|
if (threadIdx.x == 0) {
|
||||||
|
cumsum[0] = 0;
|
||||||
|
for (int i = 1; i <= num_experts; ++i) {
|
||||||
|
cumsum[i] = cumsum[i - 1] +
|
||||||
|
div_ceil(tokens_cnts[index(num_experts, blockDim.x, i - 1)],
|
||||||
|
block_size) *
|
||||||
|
block_size;
|
||||||
|
}
|
||||||
|
total_tokens_post_pad[lora_id] = static_cast<int32_t>(cumsum[num_experts]);
|
||||||
|
}
|
||||||
|
|
||||||
|
__syncthreads();
|
||||||
|
|
||||||
|
/**
|
||||||
|
* For each expert, each thread processes the tokens of the corresponding
|
||||||
|
* blocks and stores the corresponding expert_id for each block.
|
||||||
|
*/
|
||||||
|
if (threadIdx.x < num_experts) {
|
||||||
|
for (int i = cumsum[threadIdx.x]; i < cumsum[threadIdx.x + 1];
|
||||||
|
i += block_size) {
|
||||||
|
expert_ids[index(max_num_m_blocks, lora_id, i / block_size)] =
|
||||||
|
threadIdx.x;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
for (int i = start_idx; i < numel && i < start_idx + tokens_per_thread; ++i) {
|
||||||
|
int32_t expert_id = topk_ids[i];
|
||||||
|
/** The cumsum[expert_id] stores the starting index of the tokens that the
|
||||||
|
* expert with expert_id needs to process, and
|
||||||
|
* tokens_cnts[threadIdx.x][expert_id] stores the indices of the tokens
|
||||||
|
* processed by the expert with expert_id within the current thread's token
|
||||||
|
* shard.
|
||||||
|
*/
|
||||||
|
int32_t rank_post_pad =
|
||||||
|
tokens_cnts[index(num_experts, threadIdx.x, expert_id)] +
|
||||||
|
cumsum[expert_id];
|
||||||
|
|
||||||
|
int mask = (int)token_lora_mapping[i / topk_num] == lora_id;
|
||||||
|
atomicAdd(
|
||||||
|
&sorted_token_ids[index(max_num_tokens_padded, lora_id, rank_post_pad)],
|
||||||
|
(i - numel) * mask);
|
||||||
|
tokens_cnts[index(num_experts, threadIdx.x, expert_id)] += mask;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
void moe_lora_align_block_size(torch::Tensor topk_ids,
|
||||||
|
torch::Tensor token_lora_mapping,
|
||||||
|
int64_t num_experts, int64_t block_size,
|
||||||
|
int64_t max_loras, int64_t max_num_tokens_padded,
|
||||||
|
int64_t max_num_m_blocks,
|
||||||
|
torch::Tensor sorted_token_ids,
|
||||||
|
torch::Tensor expert_ids,
|
||||||
|
torch::Tensor num_tokens_post_pad) {
|
||||||
|
const int topk_num = topk_ids.size(1);
|
||||||
|
|
||||||
|
TORCH_CHECK(block_size > 0, "block_size should be greater than 0. ");
|
||||||
|
|
||||||
|
int device_max_shared_mem;
|
||||||
|
auto dev = topk_ids.get_device();
|
||||||
|
cudaDeviceGetAttribute(&device_max_shared_mem,
|
||||||
|
cudaDevAttrMaxSharedMemoryPerBlockOptin, dev);
|
||||||
|
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
||||||
|
|
||||||
|
const int32_t num_thread = max((int32_t)num_experts, 128); // WARP_SIZE,
|
||||||
|
TORCH_CHECK(num_thread <= 1024,
|
||||||
|
"num_thread must be less than 1024, "
|
||||||
|
"and fallback is not implemented yet.");
|
||||||
|
const int32_t shared_mem = (num_thread + 1) * num_experts * sizeof(int32_t) +
|
||||||
|
(num_experts + 1) * sizeof(int32_t);
|
||||||
|
|
||||||
|
if (shared_mem > device_max_shared_mem) {
|
||||||
|
TORCH_CHECK(false,
|
||||||
|
"Shared memory usage exceeds device limit, and global memory "
|
||||||
|
"fallback is not implemented yet.");
|
||||||
|
}
|
||||||
|
|
||||||
|
VLLM_DISPATCH_INTEGRAL_TYPES(
|
||||||
|
topk_ids.scalar_type(), "moe_lora_align_sum_kernel", [&] {
|
||||||
|
dim3 blockDim(num_thread);
|
||||||
|
auto kernel = moe_lora_align_sum_kernel<scalar_t, int32_t>;
|
||||||
|
AT_CUDA_CHECK(VLLM_DevFuncAttribute_SET_MaxDynamicSharedMemorySize(
|
||||||
|
(void*)kernel, shared_mem));
|
||||||
|
kernel<<<max_loras, blockDim, shared_mem, stream>>>(
|
||||||
|
topk_ids.data_ptr<scalar_t>(),
|
||||||
|
token_lora_mapping.data_ptr<int32_t>(), block_size, num_experts,
|
||||||
|
max_loras, topk_ids.numel(), max_num_tokens_padded,
|
||||||
|
max_num_m_blocks, sorted_token_ids.data_ptr<int32_t>(),
|
||||||
|
expert_ids.data_ptr<int32_t>(), topk_num,
|
||||||
|
num_tokens_post_pad.data_ptr<int32_t>());
|
||||||
|
});
|
||||||
|
}
|
||||||
@@ -4,7 +4,7 @@
|
|||||||
|
|
||||||
void topk_softmax(torch::Tensor& topk_weights, torch::Tensor& topk_indices,
|
void topk_softmax(torch::Tensor& topk_weights, torch::Tensor& topk_indices,
|
||||||
torch::Tensor& token_expert_indices,
|
torch::Tensor& token_expert_indices,
|
||||||
torch::Tensor& gating_output);
|
torch::Tensor& gating_output, bool renormalize);
|
||||||
|
|
||||||
void moe_sum(torch::Tensor& input, torch::Tensor& output);
|
void moe_sum(torch::Tensor& input, torch::Tensor& output);
|
||||||
|
|
||||||
@@ -12,6 +12,22 @@ void moe_align_block_size(torch::Tensor topk_ids, int64_t num_experts,
|
|||||||
int64_t block_size, torch::Tensor sorted_token_ids,
|
int64_t block_size, torch::Tensor sorted_token_ids,
|
||||||
torch::Tensor experts_ids,
|
torch::Tensor experts_ids,
|
||||||
torch::Tensor num_tokens_post_pad);
|
torch::Tensor num_tokens_post_pad);
|
||||||
|
|
||||||
|
void batched_moe_align_block_size(int64_t max_tokens_per_batch,
|
||||||
|
int64_t block_size,
|
||||||
|
torch::Tensor const& expert_num_tokens,
|
||||||
|
torch::Tensor sorted_ids,
|
||||||
|
torch::Tensor expert_ids,
|
||||||
|
torch::Tensor num_tokens_post_pad);
|
||||||
|
|
||||||
|
void moe_lora_align_block_size(torch::Tensor topk_ids,
|
||||||
|
torch::Tensor token_lora_mapping,
|
||||||
|
int64_t num_experts, int64_t block_size,
|
||||||
|
int64_t max_loras, int64_t max_num_tokens_padded,
|
||||||
|
int64_t max_num_m_blocks,
|
||||||
|
torch::Tensor sorted_token_ids,
|
||||||
|
torch::Tensor expert_ids,
|
||||||
|
torch::Tensor num_tokens_post_pad);
|
||||||
#ifndef USE_ROCM
|
#ifndef USE_ROCM
|
||||||
torch::Tensor moe_wna16_gemm(torch::Tensor input, torch::Tensor output,
|
torch::Tensor moe_wna16_gemm(torch::Tensor input, torch::Tensor output,
|
||||||
torch::Tensor b_qweight, torch::Tensor b_scales,
|
torch::Tensor b_qweight, torch::Tensor b_scales,
|
||||||
|
|||||||
@@ -16,12 +16,23 @@
|
|||||||
* See the License for the specific language governing permissions and
|
* See the License for the specific language governing permissions and
|
||||||
* limitations under the License.
|
* limitations under the License.
|
||||||
*/
|
*/
|
||||||
|
#include <type_traits>
|
||||||
#include <torch/all.h>
|
#include <torch/all.h>
|
||||||
#include <ATen/cuda/CUDAContext.h>
|
#include <ATen/cuda/CUDAContext.h>
|
||||||
#include <c10/cuda/CUDAGuard.h>
|
#include <c10/cuda/CUDAGuard.h>
|
||||||
#include "../cuda_compat.h"
|
#include "../cuda_compat.h"
|
||||||
#include "../cub_helpers.h"
|
#include "../cub_helpers.h"
|
||||||
|
|
||||||
|
#ifndef USE_ROCM
|
||||||
|
#include <cuda_bf16.h>
|
||||||
|
#include <cuda_fp16.h>
|
||||||
|
#else
|
||||||
|
#include <hip/hip_bf16.h>
|
||||||
|
#include <hip/hip_fp16.h>
|
||||||
|
typedef __hip_bfloat16 __nv_bfloat16;
|
||||||
|
typedef __hip_bfloat162 __nv_bfloat162;
|
||||||
|
#endif
|
||||||
|
|
||||||
#define MAX(a, b) ((a) > (b) ? (a) : (b))
|
#define MAX(a, b) ((a) > (b) ? (a) : (b))
|
||||||
#define MIN(a, b) ((a) < (b) ? (a) : (b))
|
#define MIN(a, b) ((a) < (b) ? (a) : (b))
|
||||||
|
|
||||||
@@ -36,16 +47,27 @@ template <
|
|||||||
/// Alignment requirement in bytes
|
/// Alignment requirement in bytes
|
||||||
int Alignment = sizeof(T) * N
|
int Alignment = sizeof(T) * N
|
||||||
>
|
>
|
||||||
class alignas(Alignment) AlignedArray {
|
struct alignas(Alignment) AlignedArray {
|
||||||
float data[N];
|
T data[N];
|
||||||
};
|
};
|
||||||
|
|
||||||
|
template <typename T>
|
||||||
|
__device__ __forceinline__ float toFloat(T value) {
|
||||||
|
if constexpr (std::is_same_v<T, float>) {
|
||||||
|
return value;
|
||||||
|
} else if constexpr (std::is_same_v<T, __nv_bfloat16>) {
|
||||||
|
return __bfloat162float(value);
|
||||||
|
} else if constexpr (std::is_same_v<T, __half>) {
|
||||||
|
return __half2float(value);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// ====================== Softmax things ===============================
|
// ====================== Softmax things ===============================
|
||||||
// We have our own implementation of softmax here so we can support transposing the output
|
// We have our own implementation of softmax here so we can support transposing the output
|
||||||
// in the softmax kernel when we extend this module to support expert-choice routing.
|
// in the softmax kernel when we extend this module to support expert-choice routing.
|
||||||
template <int TPB>
|
template <int TPB, typename InputType>
|
||||||
__launch_bounds__(TPB) __global__
|
__launch_bounds__(TPB) __global__
|
||||||
void moeSoftmax(const float* input, const bool* finished, float* output, const int num_cols)
|
void moeSoftmax(const InputType* input, const bool* finished, float* output, const int num_cols)
|
||||||
{
|
{
|
||||||
using BlockReduce = cub::BlockReduce<float, TPB>;
|
using BlockReduce = cub::BlockReduce<float, TPB>;
|
||||||
__shared__ typename BlockReduce::TempStorage tmpStorage;
|
__shared__ typename BlockReduce::TempStorage tmpStorage;
|
||||||
@@ -66,7 +88,8 @@ __launch_bounds__(TPB) __global__
|
|||||||
for (int ii = threadIdx.x; ii < num_cols; ii += TPB)
|
for (int ii = threadIdx.x; ii < num_cols; ii += TPB)
|
||||||
{
|
{
|
||||||
const int idx = thread_row_offset + ii;
|
const int idx = thread_row_offset + ii;
|
||||||
threadData = max(static_cast<float>(input[idx]), threadData);
|
const float val = toFloat(input[idx]);
|
||||||
|
threadData = max(val, threadData);
|
||||||
}
|
}
|
||||||
|
|
||||||
const float maxElem = BlockReduce(tmpStorage).Reduce(threadData, CubMaxOp());
|
const float maxElem = BlockReduce(tmpStorage).Reduce(threadData, CubMaxOp());
|
||||||
@@ -81,7 +104,8 @@ __launch_bounds__(TPB) __global__
|
|||||||
for (int ii = threadIdx.x; ii < num_cols; ii += TPB)
|
for (int ii = threadIdx.x; ii < num_cols; ii += TPB)
|
||||||
{
|
{
|
||||||
const int idx = thread_row_offset + ii;
|
const int idx = thread_row_offset + ii;
|
||||||
threadData += exp((static_cast<float>(input[idx]) - float_max));
|
const float val = toFloat(input[idx]);
|
||||||
|
threadData += expf(val - float_max);
|
||||||
}
|
}
|
||||||
|
|
||||||
const auto Z = BlockReduce(tmpStorage).Reduce(threadData, CubAddOp());
|
const auto Z = BlockReduce(tmpStorage).Reduce(threadData, CubAddOp());
|
||||||
@@ -95,8 +119,9 @@ __launch_bounds__(TPB) __global__
|
|||||||
for (int ii = threadIdx.x; ii < num_cols; ii += TPB)
|
for (int ii = threadIdx.x; ii < num_cols; ii += TPB)
|
||||||
{
|
{
|
||||||
const int idx = thread_row_offset + ii;
|
const int idx = thread_row_offset + ii;
|
||||||
const float val = exp((static_cast<float>(input[idx]) - float_max)) * normalizing_factor;
|
const float val = toFloat(input[idx]);
|
||||||
output[idx] = val;
|
const float softmax_val = expf(val - float_max) * normalizing_factor;
|
||||||
|
output[idx] = softmax_val;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -110,7 +135,8 @@ __launch_bounds__(TPB) __global__ void moeTopK(
|
|||||||
const int num_experts,
|
const int num_experts,
|
||||||
const int k,
|
const int k,
|
||||||
const int start_expert,
|
const int start_expert,
|
||||||
const int end_expert)
|
const int end_expert,
|
||||||
|
const bool renormalize)
|
||||||
{
|
{
|
||||||
|
|
||||||
using cub_kvp = cub::KeyValuePair<int, float>;
|
using cub_kvp = cub::KeyValuePair<int, float>;
|
||||||
@@ -125,6 +151,7 @@ __launch_bounds__(TPB) __global__ void moeTopK(
|
|||||||
|
|
||||||
const bool row_is_active = finished ? !finished[block_row] : true;
|
const bool row_is_active = finished ? !finished[block_row] : true;
|
||||||
const int thread_read_offset = blockIdx.x * num_experts;
|
const int thread_read_offset = blockIdx.x * num_experts;
|
||||||
|
float selected_sum = 0.f;
|
||||||
for (int k_idx = 0; k_idx < k; ++k_idx)
|
for (int k_idx = 0; k_idx < k; ++k_idx)
|
||||||
{
|
{
|
||||||
thread_kvp.key = 0;
|
thread_kvp.key = 0;
|
||||||
@@ -163,9 +190,23 @@ __launch_bounds__(TPB) __global__ void moeTopK(
|
|||||||
indices[idx] = should_process_row ? (expert - start_expert) : num_experts;
|
indices[idx] = should_process_row ? (expert - start_expert) : num_experts;
|
||||||
assert(indices[idx] >= 0);
|
assert(indices[idx] >= 0);
|
||||||
source_rows[idx] = k_idx * num_rows + block_row;
|
source_rows[idx] = k_idx * num_rows + block_row;
|
||||||
|
if (renormalize) {
|
||||||
|
selected_sum += result_kvp.value;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
__syncthreads();
|
__syncthreads();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Renormalize the k weights for this row to sum to 1, if requested.
|
||||||
|
if (renormalize) {
|
||||||
|
if (threadIdx.x == 0) {
|
||||||
|
const float denom = selected_sum > 0.f ? selected_sum : 1.f;
|
||||||
|
for (int k_idx = 0; k_idx < k; ++k_idx) {
|
||||||
|
const int idx = k * block_row + k_idx;
|
||||||
|
output[idx] = output[idx] / denom;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// ====================== TopK softmax things ===============================
|
// ====================== TopK softmax things ===============================
|
||||||
@@ -184,21 +225,30 @@ __launch_bounds__(TPB) __global__ void moeTopK(
|
|||||||
2) This implementation assumes k is small, but will work for any k.
|
2) This implementation assumes k is small, but will work for any k.
|
||||||
*/
|
*/
|
||||||
|
|
||||||
template <int VPT, int NUM_EXPERTS, int WARPS_PER_CTA, int BYTES_PER_LDG, int WARP_SIZE_PARAM, typename IndType>
|
template <int VPT, int NUM_EXPERTS, int WARPS_PER_CTA, int BYTES_PER_LDG, int WARP_SIZE_PARAM, typename IndType, typename InputType = float>
|
||||||
__launch_bounds__(WARPS_PER_CTA* WARP_SIZE_PARAM) __global__
|
__launch_bounds__(WARPS_PER_CTA* WARP_SIZE_PARAM) __global__
|
||||||
void topkGatingSoftmax(const float* input, const bool* finished, float* output, const int num_rows, IndType* indices,
|
void topkGatingSoftmax(const InputType* input, const bool* finished, float* output, const int num_rows, IndType* indices,
|
||||||
int* source_rows, const int k, const int start_expert, const int end_expert)
|
int* source_rows, const int k, const int start_expert, const int end_expert, const bool renormalize)
|
||||||
{
|
{
|
||||||
|
static_assert(std::is_same_v<InputType, float> || std::is_same_v<InputType, __nv_bfloat16> ||
|
||||||
|
std::is_same_v<InputType, __half>,
|
||||||
|
"InputType must be float, __nv_bfloat16, or __half");
|
||||||
|
|
||||||
// We begin by enforcing compile time assertions and setting up compile time constants.
|
// We begin by enforcing compile time assertions and setting up compile time constants.
|
||||||
static_assert(BYTES_PER_LDG == (BYTES_PER_LDG & -BYTES_PER_LDG), "BYTES_PER_LDG must be power of 2");
|
static_assert(BYTES_PER_LDG == (BYTES_PER_LDG & -BYTES_PER_LDG), "BYTES_PER_LDG must be power of 2");
|
||||||
static_assert(BYTES_PER_LDG <= 16, "BYTES_PER_LDG must be leq 16");
|
static_assert(BYTES_PER_LDG <= 16, "BYTES_PER_LDG must be leq 16");
|
||||||
|
|
||||||
// Number of bytes each thread pulls in per load
|
// Number of bytes each thread pulls in per load
|
||||||
static constexpr int ELTS_PER_LDG = BYTES_PER_LDG / sizeof(float);
|
static constexpr int ELTS_PER_LDG = BYTES_PER_LDG / sizeof(InputType);
|
||||||
static constexpr int ELTS_PER_ROW = NUM_EXPERTS;
|
static constexpr int ELTS_PER_ROW = NUM_EXPERTS;
|
||||||
static constexpr int THREADS_PER_ROW = ELTS_PER_ROW / VPT;
|
static constexpr int THREADS_PER_ROW = ELTS_PER_ROW / VPT;
|
||||||
static constexpr int LDG_PER_THREAD = VPT / ELTS_PER_LDG;
|
static constexpr int LDG_PER_THREAD = VPT / ELTS_PER_LDG;
|
||||||
|
|
||||||
|
if constexpr (std::is_same_v<InputType, __nv_bfloat16> || std::is_same_v<InputType, __half>) {
|
||||||
|
static_assert(ELTS_PER_LDG == 1 || ELTS_PER_LDG % 2 == 0,
|
||||||
|
"ELTS_PER_LDG must be 1 or even for 16-bit conversion");
|
||||||
|
}
|
||||||
|
|
||||||
// Restrictions based on previous section.
|
// Restrictions based on previous section.
|
||||||
static_assert(VPT % ELTS_PER_LDG == 0, "The elements per thread must be a multiple of the elements per ldg");
|
static_assert(VPT % ELTS_PER_LDG == 0, "The elements per thread must be a multiple of the elements per ldg");
|
||||||
static_assert(WARP_SIZE_PARAM % THREADS_PER_ROW == 0, "The threads per row must cleanly divide the threads per warp");
|
static_assert(WARP_SIZE_PARAM % THREADS_PER_ROW == 0, "The threads per row must cleanly divide the threads per warp");
|
||||||
@@ -236,27 +286,71 @@ __launch_bounds__(WARPS_PER_CTA* WARP_SIZE_PARAM) __global__
|
|||||||
|
|
||||||
// We finally start setting up the read pointers for each thread. First, each thread jumps to the start of the
|
// We finally start setting up the read pointers for each thread. First, each thread jumps to the start of the
|
||||||
// row it will read.
|
// row it will read.
|
||||||
const float* thread_row_ptr = input + thread_row * ELTS_PER_ROW;
|
const InputType* thread_row_ptr = input + thread_row * ELTS_PER_ROW;
|
||||||
|
|
||||||
// Now, we compute the group each thread belong to in order to determine the first column to start loads.
|
// Now, we compute the group each thread belong to in order to determine the first column to start loads.
|
||||||
const int thread_group_idx = threadIdx.x % THREADS_PER_ROW;
|
const int thread_group_idx = threadIdx.x % THREADS_PER_ROW;
|
||||||
const int first_elt_read_by_thread = thread_group_idx * ELTS_PER_LDG;
|
const int first_elt_read_by_thread = thread_group_idx * ELTS_PER_LDG;
|
||||||
const float* thread_read_ptr = thread_row_ptr + first_elt_read_by_thread;
|
const InputType* thread_read_ptr = thread_row_ptr + first_elt_read_by_thread;
|
||||||
|
|
||||||
// Determine the pointer type to use to read in the data depending on the BYTES_PER_LDG template param. In theory,
|
|
||||||
// this can support all powers of 2 up to 16.
|
|
||||||
// NOTE(woosuk): The original implementation uses CUTLASS aligned array here.
|
|
||||||
// We defined our own aligned array and use it here to avoid the dependency on CUTLASS.
|
|
||||||
using AccessType = AlignedArray<float, ELTS_PER_LDG>;
|
|
||||||
|
|
||||||
// Finally, we pull in the data from global mem
|
// Finally, we pull in the data from global mem
|
||||||
float row_chunk[VPT];
|
float row_chunk[VPT];
|
||||||
AccessType* row_chunk_vec_ptr = reinterpret_cast<AccessType*>(&row_chunk);
|
|
||||||
const AccessType* vec_thread_read_ptr = reinterpret_cast<const AccessType*>(thread_read_ptr);
|
// NOTE(zhuhaoran): dispatch different input types loading, BF16/FP16 convert to float
|
||||||
|
if constexpr (std::is_same_v<InputType, float>) {
|
||||||
|
using VecType = AlignedArray<float, ELTS_PER_LDG>;
|
||||||
|
VecType* row_chunk_vec_ptr = reinterpret_cast<VecType*>(&row_chunk);
|
||||||
|
const VecType* vec_thread_read_ptr = reinterpret_cast<const VecType*>(thread_read_ptr);
|
||||||
#pragma unroll
|
#pragma unroll
|
||||||
for (int ii = 0; ii < LDG_PER_THREAD; ++ii)
|
for (int ii = 0; ii < LDG_PER_THREAD; ++ii) {
|
||||||
{
|
row_chunk_vec_ptr[ii] = vec_thread_read_ptr[ii * THREADS_PER_ROW];
|
||||||
row_chunk_vec_ptr[ii] = vec_thread_read_ptr[ii * THREADS_PER_ROW];
|
}
|
||||||
|
} else if constexpr (std::is_same_v<InputType, __nv_bfloat16>) {
|
||||||
|
if constexpr (ELTS_PER_LDG >= 2) {
|
||||||
|
using VecType = AlignedArray<__nv_bfloat16, ELTS_PER_LDG>;
|
||||||
|
float2* row_chunk_f2 = reinterpret_cast<float2*>(row_chunk);
|
||||||
|
const VecType* vec_thread_read_ptr = reinterpret_cast<const VecType*>(thread_read_ptr);
|
||||||
|
#pragma unroll
|
||||||
|
for (int ii = 0; ii < LDG_PER_THREAD; ++ii) {
|
||||||
|
VecType vec = vec_thread_read_ptr[ii * THREADS_PER_ROW];
|
||||||
|
int base_idx_f2 = ii * ELTS_PER_LDG / 2;
|
||||||
|
#pragma unroll
|
||||||
|
for (int jj = 0; jj < ELTS_PER_LDG / 2; ++jj) {
|
||||||
|
row_chunk_f2[base_idx_f2 + jj] = __bfloat1622float2(
|
||||||
|
*reinterpret_cast<const __nv_bfloat162*>(vec.data + jj * 2)
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} else { // ELTS_PER_LDG == 1
|
||||||
|
#pragma unroll
|
||||||
|
for (int ii = 0; ii < LDG_PER_THREAD; ++ii) {
|
||||||
|
const __nv_bfloat16* scalar_ptr = thread_read_ptr + ii * THREADS_PER_ROW;
|
||||||
|
row_chunk[ii] = __bfloat162float(*scalar_ptr);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} else if constexpr (std::is_same_v<InputType, __half>) {
|
||||||
|
if constexpr (ELTS_PER_LDG >= 2) {
|
||||||
|
using VecType = AlignedArray<__half, ELTS_PER_LDG>;
|
||||||
|
float2* row_chunk_f2 = reinterpret_cast<float2*>(row_chunk);
|
||||||
|
const VecType* vec_thread_read_ptr = reinterpret_cast<const VecType*>(thread_read_ptr);
|
||||||
|
#pragma unroll
|
||||||
|
for (int ii = 0; ii < LDG_PER_THREAD; ++ii) {
|
||||||
|
VecType vec = vec_thread_read_ptr[ii * THREADS_PER_ROW];
|
||||||
|
int base_idx_f2 = ii * ELTS_PER_LDG / 2;
|
||||||
|
#pragma unroll
|
||||||
|
for (int jj = 0; jj < ELTS_PER_LDG / 2; ++jj) {
|
||||||
|
row_chunk_f2[base_idx_f2 + jj] = __half22float2(
|
||||||
|
*reinterpret_cast<const __half2*>(vec.data + jj * 2)
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} else { // ELTS_PER_LDG == 1
|
||||||
|
#pragma unroll
|
||||||
|
for (int ii = 0; ii < LDG_PER_THREAD; ++ii) {
|
||||||
|
const __half* scalar_ptr = thread_read_ptr + ii * THREADS_PER_ROW;
|
||||||
|
row_chunk[ii] = __half2float(*scalar_ptr);
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// First, we perform a max reduce within the thread. We can do the max in fp16 safely (I think) and just
|
// First, we perform a max reduce within the thread. We can do the max in fp16 safely (I think) and just
|
||||||
@@ -310,6 +404,7 @@ __launch_bounds__(WARPS_PER_CTA* WARP_SIZE_PARAM) __global__
|
|||||||
int start_col = first_elt_read_by_thread;
|
int start_col = first_elt_read_by_thread;
|
||||||
static constexpr int COLS_PER_GROUP_LDG = ELTS_PER_LDG * THREADS_PER_ROW;
|
static constexpr int COLS_PER_GROUP_LDG = ELTS_PER_LDG * THREADS_PER_ROW;
|
||||||
|
|
||||||
|
float selected_sum = 0.f;
|
||||||
for (int k_idx = 0; k_idx < k; ++k_idx)
|
for (int k_idx = 0; k_idx < k; ++k_idx)
|
||||||
{
|
{
|
||||||
// First, each thread does the local argmax
|
// First, each thread does the local argmax
|
||||||
@@ -363,6 +458,9 @@ __launch_bounds__(WARPS_PER_CTA* WARP_SIZE_PARAM) __global__
|
|||||||
output[idx] = max_val;
|
output[idx] = max_val;
|
||||||
indices[idx] = should_process_row ? (expert - start_expert) : NUM_EXPERTS;
|
indices[idx] = should_process_row ? (expert - start_expert) : NUM_EXPERTS;
|
||||||
source_rows[idx] = k_idx * num_rows + thread_row;
|
source_rows[idx] = k_idx * num_rows + thread_row;
|
||||||
|
if (renormalize) {
|
||||||
|
selected_sum += max_val;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// Finally, we clear the value in the thread with the current max if there is another iteration to run.
|
// Finally, we clear the value in the thread with the current max if there is another iteration to run.
|
||||||
@@ -380,15 +478,28 @@ __launch_bounds__(WARPS_PER_CTA* WARP_SIZE_PARAM) __global__
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Renormalize the k weights for this row to sum to 1, if requested.
|
||||||
|
if (renormalize) {
|
||||||
|
if (thread_group_idx == 0)
|
||||||
|
{
|
||||||
|
const float denom = selected_sum > 0.f ? selected_sum : 1.f;
|
||||||
|
for (int k_idx = 0; k_idx < k; ++k_idx)
|
||||||
|
{
|
||||||
|
const int idx = k * thread_row + k_idx;
|
||||||
|
output[idx] = output[idx] / denom;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
namespace detail
|
namespace detail
|
||||||
{
|
{
|
||||||
// Constructs some constants needed to partition the work across threads at compile time.
|
// Constructs some constants needed to partition the work across threads at compile time.
|
||||||
template <int EXPERTS, int BYTES_PER_LDG, int WARP_SIZE_PARAM>
|
template <int EXPERTS, int BYTES_PER_LDG, int WARP_SIZE_PARAM, typename InputType>
|
||||||
struct TopkConstants
|
struct TopkConstants
|
||||||
{
|
{
|
||||||
static constexpr int ELTS_PER_LDG = BYTES_PER_LDG / sizeof(float);
|
static constexpr int ELTS_PER_LDG = BYTES_PER_LDG / sizeof(InputType);
|
||||||
static_assert(EXPERTS / (ELTS_PER_LDG * WARP_SIZE_PARAM) == 0 || EXPERTS % (ELTS_PER_LDG * WARP_SIZE_PARAM) == 0, "");
|
static_assert(EXPERTS / (ELTS_PER_LDG * WARP_SIZE_PARAM) == 0 || EXPERTS % (ELTS_PER_LDG * WARP_SIZE_PARAM) == 0, "");
|
||||||
static constexpr int VECs_PER_THREAD = MAX(1, EXPERTS / (ELTS_PER_LDG * WARP_SIZE_PARAM));
|
static constexpr int VECs_PER_THREAD = MAX(1, EXPERTS / (ELTS_PER_LDG * WARP_SIZE_PARAM));
|
||||||
static constexpr int VPT = VECs_PER_THREAD * ELTS_PER_LDG;
|
static constexpr int VPT = VECs_PER_THREAD * ELTS_PER_LDG;
|
||||||
@@ -397,20 +508,21 @@ struct TopkConstants
|
|||||||
};
|
};
|
||||||
} // namespace detail
|
} // namespace detail
|
||||||
|
|
||||||
template <int EXPERTS, int WARPS_PER_TB, int WARP_SIZE_PARAM, int MAX_BYTES_PER_LDG, typename IndType>
|
template <int EXPERTS, int WARPS_PER_TB, int WARP_SIZE_PARAM, int MAX_BYTES_PER_LDG, typename IndType, typename InputType>
|
||||||
void topkGatingSoftmaxLauncherHelper(const float* input, const bool* finished, float* output, IndType* indices,
|
void topkGatingSoftmaxLauncherHelper(const InputType* input, const bool* finished, float* output, IndType* indices,
|
||||||
int* source_row, const int num_rows, const int k, const int start_expert, const int end_expert, cudaStream_t stream)
|
int* source_row, const int num_rows, const int k, const int start_expert, const int end_expert, const bool renormalize,
|
||||||
|
cudaStream_t stream)
|
||||||
{
|
{
|
||||||
static constexpr int BYTES_PER_LDG = MIN(MAX_BYTES_PER_LDG, sizeof(float) * EXPERTS);
|
static constexpr int BYTES_PER_LDG = MIN(MAX_BYTES_PER_LDG, sizeof(InputType) * EXPERTS);
|
||||||
using Constants = detail::TopkConstants<EXPERTS, BYTES_PER_LDG, WARP_SIZE_PARAM>;
|
using Constants = detail::TopkConstants<EXPERTS, BYTES_PER_LDG, WARP_SIZE_PARAM, InputType>;
|
||||||
static constexpr int VPT = Constants::VPT;
|
static constexpr int VPT = Constants::VPT;
|
||||||
static constexpr int ROWS_PER_WARP = Constants::ROWS_PER_WARP;
|
static constexpr int ROWS_PER_WARP = Constants::ROWS_PER_WARP;
|
||||||
const int num_warps = (num_rows + ROWS_PER_WARP - 1) / ROWS_PER_WARP;
|
const int num_warps = (num_rows + ROWS_PER_WARP - 1) / ROWS_PER_WARP;
|
||||||
const int num_blocks = (num_warps + WARPS_PER_TB - 1) / WARPS_PER_TB;
|
const int num_blocks = (num_warps + WARPS_PER_TB - 1) / WARPS_PER_TB;
|
||||||
|
|
||||||
dim3 block_dim(WARP_SIZE_PARAM, WARPS_PER_TB);
|
dim3 block_dim(WARP_SIZE_PARAM, WARPS_PER_TB);
|
||||||
topkGatingSoftmax<VPT, EXPERTS, WARPS_PER_TB, BYTES_PER_LDG, WARP_SIZE_PARAM><<<num_blocks, block_dim, 0, stream>>>(
|
topkGatingSoftmax<VPT, EXPERTS, WARPS_PER_TB, BYTES_PER_LDG, WARP_SIZE_PARAM, IndType, InputType><<<num_blocks, block_dim, 0, stream>>>(
|
||||||
input, finished, output, num_rows, indices, source_row, k, start_expert, end_expert);
|
input, finished, output, num_rows, indices, source_row, k, start_expert, end_expert, renormalize);
|
||||||
}
|
}
|
||||||
|
|
||||||
#ifndef USE_ROCM
|
#ifndef USE_ROCM
|
||||||
@@ -418,26 +530,26 @@ void topkGatingSoftmaxLauncherHelper(const float* input, const bool* finished, f
|
|||||||
static_assert(WARP_SIZE == 32, \
|
static_assert(WARP_SIZE == 32, \
|
||||||
"Unsupported warp size. Only 32 is supported for CUDA"); \
|
"Unsupported warp size. Only 32 is supported for CUDA"); \
|
||||||
topkGatingSoftmaxLauncherHelper<NUM_EXPERTS, WARPS_PER_TB, WARP_SIZE, MAX_BYTES>( \
|
topkGatingSoftmaxLauncherHelper<NUM_EXPERTS, WARPS_PER_TB, WARP_SIZE, MAX_BYTES>( \
|
||||||
gating_output, nullptr, topk_weights, topk_indices, \
|
gating_output, nullptr, topk_weights, topk_indices, token_expert_indices, \
|
||||||
token_expert_indices, num_tokens, topk, 0, num_experts, stream);
|
num_tokens, topk, 0, num_experts, renormalize, stream);
|
||||||
#else
|
#else
|
||||||
#define LAUNCH_SOFTMAX(NUM_EXPERTS, WARPS_PER_TB, MAX_BYTES) \
|
#define LAUNCH_SOFTMAX(NUM_EXPERTS, WARPS_PER_TB, MAX_BYTES) \
|
||||||
if (WARP_SIZE == 64) { \
|
if (WARP_SIZE == 64) { \
|
||||||
topkGatingSoftmaxLauncherHelper<NUM_EXPERTS, WARPS_PER_TB, 64, MAX_BYTES>( \
|
topkGatingSoftmaxLauncherHelper<NUM_EXPERTS, WARPS_PER_TB, 64, MAX_BYTES>( \
|
||||||
gating_output, nullptr, topk_weights, topk_indices, \
|
gating_output, nullptr, topk_weights, topk_indices, token_expert_indices, \
|
||||||
token_expert_indices, num_tokens, topk, 0, num_experts, stream); \
|
num_tokens, topk, 0, num_experts, renormalize, stream); \
|
||||||
} else if (WARP_SIZE == 32) { \
|
} else if (WARP_SIZE == 32) { \
|
||||||
topkGatingSoftmaxLauncherHelper<NUM_EXPERTS, WARPS_PER_TB, 32, MAX_BYTES>( \
|
topkGatingSoftmaxLauncherHelper<NUM_EXPERTS, WARPS_PER_TB, 32, MAX_BYTES>( \
|
||||||
gating_output, nullptr, topk_weights, topk_indices, \
|
gating_output, nullptr, topk_weights, topk_indices, token_expert_indices, \
|
||||||
token_expert_indices, num_tokens, topk, 0, num_experts, stream); \
|
num_tokens, topk, 0, num_experts, renormalize, stream); \
|
||||||
} else { \
|
} else { \
|
||||||
assert(false && "Unsupported warp size. Only 32 and 64 are supported for ROCm"); \
|
assert(false && "Unsupported warp size. Only 32 and 64 are supported for ROCm"); \
|
||||||
}
|
}
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
template <typename IndType>
|
template <typename IndType, typename InputType>
|
||||||
void topkGatingSoftmaxKernelLauncher(
|
void topkGatingSoftmaxKernelLauncher(
|
||||||
const float* gating_output,
|
const InputType* gating_output,
|
||||||
float* topk_weights,
|
float* topk_weights,
|
||||||
IndType* topk_indices,
|
IndType* topk_indices,
|
||||||
int* token_expert_indices,
|
int* token_expert_indices,
|
||||||
@@ -445,11 +557,15 @@ void topkGatingSoftmaxKernelLauncher(
|
|||||||
const int num_tokens,
|
const int num_tokens,
|
||||||
const int num_experts,
|
const int num_experts,
|
||||||
const int topk,
|
const int topk,
|
||||||
|
const bool renormalize,
|
||||||
cudaStream_t stream) {
|
cudaStream_t stream) {
|
||||||
static constexpr int WARPS_PER_TB = 4;
|
static constexpr int WARPS_PER_TB = 4;
|
||||||
static constexpr int BYTES_PER_LDG_POWER_OF_2 = 16;
|
static constexpr int BYTES_PER_LDG_POWER_OF_2 = 16;
|
||||||
#ifndef USE_ROCM
|
#ifndef USE_ROCM
|
||||||
static constexpr int BYTES_PER_LDG_MULTIPLE_64 = 8;
|
// for bfloat16 dtype, we need 4 bytes loading to make sure num_experts
|
||||||
|
// elements can be loaded by a warp
|
||||||
|
static constexpr int BYTES_PER_LDG_MULTIPLE_64 =
|
||||||
|
(std::is_same_v<InputType, __nv_bfloat16> || std::is_same_v<InputType, __half>) ? 4 : 8;
|
||||||
#endif
|
#endif
|
||||||
switch (num_experts) {
|
switch (num_experts) {
|
||||||
case 1:
|
case 1:
|
||||||
@@ -506,11 +622,11 @@ void topkGatingSoftmaxKernelLauncher(
|
|||||||
TORCH_CHECK(softmax_workspace != nullptr,
|
TORCH_CHECK(softmax_workspace != nullptr,
|
||||||
"softmax_workspace must be provided for num_experts that are not a power of 2 or multiple of 64.");
|
"softmax_workspace must be provided for num_experts that are not a power of 2 or multiple of 64.");
|
||||||
static constexpr int TPB = 256;
|
static constexpr int TPB = 256;
|
||||||
moeSoftmax<TPB><<<num_tokens, TPB, 0, stream>>>(
|
moeSoftmax<TPB, InputType><<<num_tokens, TPB, 0, stream>>>(
|
||||||
gating_output, nullptr, softmax_workspace, num_experts);
|
gating_output, nullptr, softmax_workspace, num_experts);
|
||||||
moeTopK<TPB><<<num_tokens, TPB, 0, stream>>>(
|
moeTopK<TPB><<<num_tokens, TPB, 0, stream>>>(
|
||||||
softmax_workspace, nullptr, topk_weights, topk_indices, token_expert_indices,
|
softmax_workspace, nullptr, topk_weights, topk_indices, token_expert_indices,
|
||||||
num_experts, topk, 0, num_experts);
|
num_experts, topk, 0, num_experts, renormalize);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -518,11 +634,50 @@ void topkGatingSoftmaxKernelLauncher(
|
|||||||
} // namespace moe
|
} // namespace moe
|
||||||
} // namespace vllm
|
} // namespace vllm
|
||||||
|
|
||||||
|
|
||||||
|
template<typename ComputeType>
|
||||||
|
void dispatch_topk_softmax_launch(
|
||||||
|
torch::Tensor& gating_output,
|
||||||
|
torch::Tensor& topk_weights,
|
||||||
|
torch::Tensor& topk_indices,
|
||||||
|
torch::Tensor& token_expert_indices,
|
||||||
|
torch::Tensor& softmax_workspace,
|
||||||
|
int num_tokens, int num_experts, int topk, bool renormalize, cudaStream_t stream)
|
||||||
|
{
|
||||||
|
if (topk_indices.scalar_type() == at::ScalarType::Int) {
|
||||||
|
vllm::moe::topkGatingSoftmaxKernelLauncher<int, ComputeType>(
|
||||||
|
reinterpret_cast<const ComputeType*>(gating_output.data_ptr()),
|
||||||
|
topk_weights.data_ptr<float>(),
|
||||||
|
topk_indices.data_ptr<int>(),
|
||||||
|
token_expert_indices.data_ptr<int>(),
|
||||||
|
softmax_workspace.data_ptr<float>(),
|
||||||
|
num_tokens, num_experts, topk, renormalize, stream);
|
||||||
|
} else if (topk_indices.scalar_type() == at::ScalarType::UInt32) {
|
||||||
|
vllm::moe::topkGatingSoftmaxKernelLauncher<uint32_t, ComputeType>(
|
||||||
|
reinterpret_cast<const ComputeType*>(gating_output.data_ptr()),
|
||||||
|
topk_weights.data_ptr<float>(),
|
||||||
|
topk_indices.data_ptr<uint32_t>(),
|
||||||
|
token_expert_indices.data_ptr<int>(),
|
||||||
|
softmax_workspace.data_ptr<float>(),
|
||||||
|
num_tokens, num_experts, topk, renormalize, stream);
|
||||||
|
} else {
|
||||||
|
TORCH_CHECK(topk_indices.scalar_type() == at::ScalarType::Long);
|
||||||
|
vllm::moe::topkGatingSoftmaxKernelLauncher<int64_t, ComputeType>(
|
||||||
|
reinterpret_cast<const ComputeType*>(gating_output.data_ptr()),
|
||||||
|
topk_weights.data_ptr<float>(),
|
||||||
|
topk_indices.data_ptr<int64_t>(),
|
||||||
|
token_expert_indices.data_ptr<int>(),
|
||||||
|
softmax_workspace.data_ptr<float>(),
|
||||||
|
num_tokens, num_experts, topk, renormalize, stream);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
void topk_softmax(
|
void topk_softmax(
|
||||||
torch::Tensor& topk_weights, // [num_tokens, topk]
|
torch::Tensor& topk_weights, // [num_tokens, topk]
|
||||||
torch::Tensor& topk_indices, // [num_tokens, topk]
|
torch::Tensor& topk_indices, // [num_tokens, topk]
|
||||||
torch::Tensor& token_expert_indices, // [num_tokens, topk]
|
torch::Tensor& token_expert_indices, // [num_tokens, topk]
|
||||||
torch::Tensor& gating_output) // [num_tokens, num_experts]
|
torch::Tensor& gating_output, // [num_tokens, num_experts]
|
||||||
|
bool renormalize)
|
||||||
{
|
{
|
||||||
const int num_experts = gating_output.size(-1);
|
const int num_experts = gating_output.size(-1);
|
||||||
const auto num_tokens = gating_output.numel() / num_experts;
|
const auto num_tokens = gating_output.numel() / num_experts;
|
||||||
@@ -534,45 +689,19 @@ void topk_softmax(
|
|||||||
|
|
||||||
const at::cuda::OptionalCUDAGuard device_guard(device_of(gating_output));
|
const at::cuda::OptionalCUDAGuard device_guard(device_of(gating_output));
|
||||||
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
||||||
torch::Tensor softmax_workspace = torch::empty({workspace_size}, gating_output.options());
|
const auto workspace_options = gating_output.options().dtype(at::ScalarType::Float);
|
||||||
|
torch::Tensor softmax_workspace = torch::empty({workspace_size}, workspace_options);
|
||||||
|
|
||||||
if(topk_indices.scalar_type() == at::ScalarType::Int)
|
if (gating_output.scalar_type() == at::ScalarType::Float) {
|
||||||
{
|
dispatch_topk_softmax_launch<float>(gating_output, topk_weights, topk_indices,
|
||||||
vllm::moe::topkGatingSoftmaxKernelLauncher(
|
token_expert_indices, softmax_workspace, num_tokens, num_experts, topk, renormalize, stream);
|
||||||
gating_output.data_ptr<float>(),
|
} else if (gating_output.scalar_type() == at::ScalarType::Half) {
|
||||||
topk_weights.data_ptr<float>(),
|
dispatch_topk_softmax_launch<__half>(gating_output, topk_weights, topk_indices,
|
||||||
topk_indices.data_ptr<int>(),
|
token_expert_indices, softmax_workspace, num_tokens, num_experts, topk, renormalize, stream);
|
||||||
token_expert_indices.data_ptr<int>(),
|
} else if (gating_output.scalar_type() == at::ScalarType::BFloat16) {
|
||||||
softmax_workspace.data_ptr<float>(),
|
dispatch_topk_softmax_launch<__nv_bfloat16>(gating_output, topk_weights, topk_indices,
|
||||||
num_tokens,
|
token_expert_indices, softmax_workspace, num_tokens, num_experts, topk, renormalize, stream);
|
||||||
num_experts,
|
} else {
|
||||||
topk,
|
TORCH_CHECK(false, "Unsupported gating_output data type: ", gating_output.scalar_type());
|
||||||
stream);
|
|
||||||
}
|
|
||||||
else if (topk_indices.scalar_type() == at::ScalarType::UInt32)
|
|
||||||
{
|
|
||||||
vllm::moe::topkGatingSoftmaxKernelLauncher(
|
|
||||||
gating_output.data_ptr<float>(),
|
|
||||||
topk_weights.data_ptr<float>(),
|
|
||||||
topk_indices.data_ptr<uint32_t>(),
|
|
||||||
token_expert_indices.data_ptr<int>(),
|
|
||||||
softmax_workspace.data_ptr<float>(),
|
|
||||||
num_tokens,
|
|
||||||
num_experts,
|
|
||||||
topk,
|
|
||||||
stream);
|
|
||||||
}
|
|
||||||
else {
|
|
||||||
TORCH_CHECK(topk_indices.scalar_type() == at::ScalarType::Long);
|
|
||||||
vllm::moe::topkGatingSoftmaxKernelLauncher(
|
|
||||||
gating_output.data_ptr<float>(),
|
|
||||||
topk_weights.data_ptr<float>(),
|
|
||||||
topk_indices.data_ptr<int64_t>(),
|
|
||||||
token_expert_indices.data_ptr<int>(),
|
|
||||||
softmax_workspace.data_ptr<float>(),
|
|
||||||
num_tokens,
|
|
||||||
num_experts,
|
|
||||||
topk,
|
|
||||||
stream);
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -5,7 +5,7 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, m) {
|
|||||||
// Apply topk softmax to the gating outputs.
|
// Apply topk softmax to the gating outputs.
|
||||||
m.def(
|
m.def(
|
||||||
"topk_softmax(Tensor! topk_weights, Tensor! topk_indices, Tensor! "
|
"topk_softmax(Tensor! topk_weights, Tensor! topk_indices, Tensor! "
|
||||||
"token_expert_indices, Tensor gating_output) -> ()");
|
"token_expert_indices, Tensor gating_output, bool renormalize) -> ()");
|
||||||
m.impl("topk_softmax", torch::kCUDA, &topk_softmax);
|
m.impl("topk_softmax", torch::kCUDA, &topk_softmax);
|
||||||
|
|
||||||
// Calculate the result of moe by summing up the partial results
|
// Calculate the result of moe by summing up the partial results
|
||||||
@@ -22,6 +22,31 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, m) {
|
|||||||
" Tensor! num_tokens_post_pad) -> ()");
|
" Tensor! num_tokens_post_pad) -> ()");
|
||||||
m.impl("moe_align_block_size", torch::kCUDA, &moe_align_block_size);
|
m.impl("moe_align_block_size", torch::kCUDA, &moe_align_block_size);
|
||||||
|
|
||||||
|
// Aligning the number of tokens to be processed by each expert such
|
||||||
|
// that it is divisible by the block size, but for the batched case.
|
||||||
|
m.def(
|
||||||
|
"batched_moe_align_block_size(int max_tokens_per_batch,"
|
||||||
|
" int block_size, Tensor expert_num_tokens,"
|
||||||
|
" Tensor! sorted_token_ids,"
|
||||||
|
" Tensor! experts_ids,"
|
||||||
|
" Tensor! num_tokens_post_pad) -> ()");
|
||||||
|
m.impl("batched_moe_align_block_size", torch::kCUDA,
|
||||||
|
&batched_moe_align_block_size);
|
||||||
|
|
||||||
|
// Aligning the number of tokens to be processed by each expert such
|
||||||
|
// that it is divisible by the block size.
|
||||||
|
m.def(
|
||||||
|
"moe_lora_align_block_size(Tensor topk_ids,"
|
||||||
|
" Tensor token_lora_mapping,"
|
||||||
|
" int num_experts,"
|
||||||
|
" int block_size, int max_loras, "
|
||||||
|
" int max_num_tokens_padded, "
|
||||||
|
" int max_num_m_blocks, "
|
||||||
|
" Tensor !sorted_token_ids,"
|
||||||
|
" Tensor !experts_ids,"
|
||||||
|
" Tensor !num_tokens_post_pad) -> () ");
|
||||||
|
m.impl("moe_lora_align_block_size", torch::kCUDA, &moe_lora_align_block_size);
|
||||||
|
|
||||||
#ifndef USE_ROCM
|
#ifndef USE_ROCM
|
||||||
m.def(
|
m.def(
|
||||||
"moe_wna16_gemm(Tensor input, Tensor! output, Tensor b_qweight, "
|
"moe_wna16_gemm(Tensor input, Tensor! output, Tensor b_qweight, "
|
||||||
|
|||||||
12
csrc/ops.h
12
csrc/ops.h
@@ -92,9 +92,6 @@ void rms_norm(torch::Tensor& out, torch::Tensor& input, torch::Tensor& weight,
|
|||||||
void fused_add_rms_norm(torch::Tensor& input, torch::Tensor& residual,
|
void fused_add_rms_norm(torch::Tensor& input, torch::Tensor& residual,
|
||||||
torch::Tensor& weight, double epsilon);
|
torch::Tensor& weight, double epsilon);
|
||||||
|
|
||||||
void poly_norm(torch::Tensor& out, torch::Tensor& input, torch::Tensor& weight,
|
|
||||||
torch::Tensor& bias, double epsilon);
|
|
||||||
|
|
||||||
void apply_repetition_penalties_(torch::Tensor& logits,
|
void apply_repetition_penalties_(torch::Tensor& logits,
|
||||||
const torch::Tensor& prompt_mask,
|
const torch::Tensor& prompt_mask,
|
||||||
const torch::Tensor& output_mask,
|
const torch::Tensor& output_mask,
|
||||||
@@ -102,8 +99,11 @@ void apply_repetition_penalties_(torch::Tensor& logits,
|
|||||||
|
|
||||||
void top_k_per_row(const torch::Tensor& logits, const torch::Tensor& rowStarts,
|
void top_k_per_row(const torch::Tensor& logits, const torch::Tensor& rowStarts,
|
||||||
const torch::Tensor& rowEnds, torch::Tensor& indices,
|
const torch::Tensor& rowEnds, torch::Tensor& indices,
|
||||||
torch::Tensor& values, int64_t numRows, int64_t stride0,
|
int64_t numRows, int64_t stride0, int64_t stride1);
|
||||||
int64_t stride1);
|
|
||||||
|
void top_k_per_row_decode(const torch::Tensor& logits, int64_t next_n,
|
||||||
|
const torch::Tensor& seq_lens, torch::Tensor& indices,
|
||||||
|
int64_t numRows, int64_t stride0, int64_t stride1);
|
||||||
|
|
||||||
void rms_norm_static_fp8_quant(torch::Tensor& out, torch::Tensor& input,
|
void rms_norm_static_fp8_quant(torch::Tensor& out, torch::Tensor& input,
|
||||||
torch::Tensor& weight, torch::Tensor& scale,
|
torch::Tensor& weight, torch::Tensor& scale,
|
||||||
@@ -307,7 +307,7 @@ void dynamic_scaled_int8_quant(torch::Tensor& out, torch::Tensor const& input,
|
|||||||
torch::Tensor gptq_gemm(torch::Tensor a, torch::Tensor b_q_weight,
|
torch::Tensor gptq_gemm(torch::Tensor a, torch::Tensor b_q_weight,
|
||||||
torch::Tensor b_gptq_qzeros,
|
torch::Tensor b_gptq_qzeros,
|
||||||
torch::Tensor b_gptq_scales, torch::Tensor b_g_idx,
|
torch::Tensor b_gptq_scales, torch::Tensor b_g_idx,
|
||||||
bool use_exllama, int64_t bit);
|
bool use_exllama, bool use_v2_format, int64_t bit);
|
||||||
|
|
||||||
void gptq_shuffle(torch::Tensor q_weight, torch::Tensor q_perm, int64_t bit);
|
void gptq_shuffle(torch::Tensor q_weight, torch::Tensor q_perm, int64_t bit);
|
||||||
|
|
||||||
|
|||||||
@@ -145,7 +145,11 @@ void rms_norm_dynamic_per_token_quant(
|
|||||||
if (scale_ub.has_value()) {
|
if (scale_ub.has_value()) {
|
||||||
TORCH_CHECK(out.dtype() == kFp8Type);
|
TORCH_CHECK(out.dtype() == kFp8Type);
|
||||||
}
|
}
|
||||||
|
TORCH_CHECK(weight.dtype() == input.dtype());
|
||||||
TORCH_CHECK(scales.dtype() == torch::kFloat32);
|
TORCH_CHECK(scales.dtype() == torch::kFloat32);
|
||||||
|
if (residual) {
|
||||||
|
TORCH_CHECK(residual->scalar_type() == input.scalar_type());
|
||||||
|
}
|
||||||
|
|
||||||
VLLM_DISPATCH_FLOATING_TYPES(
|
VLLM_DISPATCH_FLOATING_TYPES(
|
||||||
input.scalar_type(), "rms_norm_dynamic_per_token_quant_dispatch", [&] {
|
input.scalar_type(), "rms_norm_dynamic_per_token_quant_dispatch", [&] {
|
||||||
|
|||||||
@@ -185,7 +185,7 @@ typedef void (*fp_gemm_half_q_half_gptq_kernel)(const half*, const uint32_t*,
|
|||||||
const uint32_t*, const half*,
|
const uint32_t*, const half*,
|
||||||
half*, const int, const int,
|
half*, const int, const int,
|
||||||
const int, const int,
|
const int, const int,
|
||||||
const int*);
|
const bool, const int*);
|
||||||
|
|
||||||
template <bool first_block, int m_count>
|
template <bool first_block, int m_count>
|
||||||
__global__ void gemm_half_q_half_gptq_4bit_kernel(
|
__global__ void gemm_half_q_half_gptq_4bit_kernel(
|
||||||
@@ -193,12 +193,15 @@ __global__ void gemm_half_q_half_gptq_4bit_kernel(
|
|||||||
const uint32_t* __restrict__ b_gptq_qzeros,
|
const uint32_t* __restrict__ b_gptq_qzeros,
|
||||||
const half* __restrict__ b_gptq_scales, half* __restrict__ c,
|
const half* __restrict__ b_gptq_scales, half* __restrict__ c,
|
||||||
const int size_m, const int size_n, const int size_k, const int groups,
|
const int size_m, const int size_n, const int size_k, const int groups,
|
||||||
const int* __restrict__ b_q_perm) {
|
const bool use_v2_format, const int* __restrict__ b_q_perm) {
|
||||||
MatrixView_half a_(a, size_m, size_k);
|
MatrixView_half a_(a, size_m, size_k);
|
||||||
MatrixView_half_rw c_(c, size_m, size_n);
|
MatrixView_half_rw c_(c, size_m, size_n);
|
||||||
MatrixView_q4_row b_gptq_qzeros_(b_gptq_qzeros, groups, size_n);
|
MatrixView_q4_row b_gptq_qzeros_(b_gptq_qzeros, groups, size_n);
|
||||||
MatrixView_half b_gptq_scales_(b_gptq_scales, groups, size_n);
|
MatrixView_half b_gptq_scales_(b_gptq_scales, groups, size_n);
|
||||||
|
|
||||||
|
// GPTQv2 and GPTQv1 handles zero points differently
|
||||||
|
int zero_offset = use_v2_format ? 0 : 1;
|
||||||
|
|
||||||
auto t = threadIdx.x;
|
auto t = threadIdx.x;
|
||||||
|
|
||||||
// Block
|
// Block
|
||||||
@@ -256,10 +259,10 @@ __global__ void gemm_half_q_half_gptq_4bit_kernel(
|
|||||||
half2 y1y16[4][2];
|
half2 y1y16[4][2];
|
||||||
b_gptq_qzeros_.item4(zeros, group, n);
|
b_gptq_qzeros_.item4(zeros, group, n);
|
||||||
b_gptq_scales_.item4_f(scales, group, n);
|
b_gptq_scales_.item4_f(scales, group, n);
|
||||||
dequant_4bit_8_prep_zero(zeros[0] + 1, z1z16[0], y1y16[0]);
|
dequant_4bit_8_prep_zero(zeros[0] + zero_offset, z1z16[0], y1y16[0]);
|
||||||
dequant_4bit_8_prep_zero(zeros[1] + 1, z1z16[1], y1y16[1]);
|
dequant_4bit_8_prep_zero(zeros[1] + zero_offset, z1z16[1], y1y16[1]);
|
||||||
dequant_4bit_8_prep_zero(zeros[2] + 1, z1z16[2], y1y16[2]);
|
dequant_4bit_8_prep_zero(zeros[2] + zero_offset, z1z16[2], y1y16[2]);
|
||||||
dequant_4bit_8_prep_zero(zeros[3] + 1, z1z16[3], y1y16[3]);
|
dequant_4bit_8_prep_zero(zeros[3] + zero_offset, z1z16[3], y1y16[3]);
|
||||||
|
|
||||||
// Column result
|
// Column result
|
||||||
float block_c[m_count][4] = {};
|
float block_c[m_count][4] = {};
|
||||||
@@ -272,10 +275,10 @@ __global__ void gemm_half_q_half_gptq_4bit_kernel(
|
|||||||
nextgroup += groupsize;
|
nextgroup += groupsize;
|
||||||
b_gptq_qzeros_.item4(zeros, group, n);
|
b_gptq_qzeros_.item4(zeros, group, n);
|
||||||
b_gptq_scales_.item4_f(scales, group, n);
|
b_gptq_scales_.item4_f(scales, group, n);
|
||||||
dequant_4bit_8_prep_zero(zeros[0] + 1, z1z16[0], y1y16[0]);
|
dequant_4bit_8_prep_zero(zeros[0] + zero_offset, z1z16[0], y1y16[0]);
|
||||||
dequant_4bit_8_prep_zero(zeros[1] + 1, z1z16[1], y1y16[1]);
|
dequant_4bit_8_prep_zero(zeros[1] + zero_offset, z1z16[1], y1y16[1]);
|
||||||
dequant_4bit_8_prep_zero(zeros[2] + 1, z1z16[2], y1y16[2]);
|
dequant_4bit_8_prep_zero(zeros[2] + zero_offset, z1z16[2], y1y16[2]);
|
||||||
dequant_4bit_8_prep_zero(zeros[3] + 1, z1z16[3], y1y16[3]);
|
dequant_4bit_8_prep_zero(zeros[3] + zero_offset, z1z16[3], y1y16[3]);
|
||||||
}
|
}
|
||||||
|
|
||||||
#pragma unroll
|
#pragma unroll
|
||||||
@@ -329,12 +332,15 @@ __global__ void gemm_half_q_half_gptq_2bit_kernel(
|
|||||||
const uint32_t* __restrict__ b_gptq_qzeros,
|
const uint32_t* __restrict__ b_gptq_qzeros,
|
||||||
const half* __restrict__ b_gptq_scales, half* __restrict__ c,
|
const half* __restrict__ b_gptq_scales, half* __restrict__ c,
|
||||||
const int size_m, const int size_n, const int size_k, const int groups,
|
const int size_m, const int size_n, const int size_k, const int groups,
|
||||||
const int* __restrict__ b_q_perm) {
|
const bool use_v2_format, const int* __restrict__ b_q_perm) {
|
||||||
MatrixView_half a_(a, size_m, size_k);
|
MatrixView_half a_(a, size_m, size_k);
|
||||||
MatrixView_half_rw c_(c, size_m, size_n);
|
MatrixView_half_rw c_(c, size_m, size_n);
|
||||||
MatrixView_q2_row b_gptq_qzeros_(b_gptq_qzeros, groups, size_n);
|
MatrixView_q2_row b_gptq_qzeros_(b_gptq_qzeros, groups, size_n);
|
||||||
MatrixView_half b_gptq_scales_(b_gptq_scales, groups, size_n);
|
MatrixView_half b_gptq_scales_(b_gptq_scales, groups, size_n);
|
||||||
|
|
||||||
|
// GPTQv2 and GPTQv1 handles zero points differently
|
||||||
|
int zero_offset = use_v2_format ? 0 : 1;
|
||||||
|
|
||||||
auto t = threadIdx.x;
|
auto t = threadIdx.x;
|
||||||
|
|
||||||
// Block
|
// Block
|
||||||
@@ -409,10 +415,10 @@ __global__ void gemm_half_q_half_gptq_2bit_kernel(
|
|||||||
int4 load_int4 = *b_ptr4;
|
int4 load_int4 = *b_ptr4;
|
||||||
|
|
||||||
half2 dq[4][8];
|
half2 dq[4][8];
|
||||||
dequant_2bit_16(load_int4.x, dq[0], size_n, zeros[0] + 1);
|
dequant_2bit_16(load_int4.x, dq[0], size_n, zeros[0] + zero_offset);
|
||||||
dequant_2bit_16(load_int4.y, dq[1], size_n, zeros[1] + 1);
|
dequant_2bit_16(load_int4.y, dq[1], size_n, zeros[1] + zero_offset);
|
||||||
dequant_2bit_16(load_int4.z, dq[2], size_n, zeros[2] + 1);
|
dequant_2bit_16(load_int4.z, dq[2], size_n, zeros[2] + zero_offset);
|
||||||
dequant_2bit_16(load_int4.w, dq[3], size_n, zeros[3] + 1);
|
dequant_2bit_16(load_int4.w, dq[3], size_n, zeros[3] + zero_offset);
|
||||||
|
|
||||||
#pragma unroll
|
#pragma unroll
|
||||||
for (int m = 0; m < m_count; m++) {
|
for (int m = 0; m < m_count; m++) {
|
||||||
@@ -448,12 +454,15 @@ __global__ void gemm_half_q_half_gptq_3bit_kernel(
|
|||||||
const uint32_t* __restrict__ b_gptq_qzeros,
|
const uint32_t* __restrict__ b_gptq_qzeros,
|
||||||
const half* __restrict__ b_gptq_scales, half* __restrict__ c,
|
const half* __restrict__ b_gptq_scales, half* __restrict__ c,
|
||||||
const int size_m, const int size_n, const int size_k, const int groups,
|
const int size_m, const int size_n, const int size_k, const int groups,
|
||||||
const int* __restrict__ b_q_perm) {
|
const bool use_v2_format, const int* __restrict__ b_q_perm) {
|
||||||
MatrixView_half a_(a, size_m, size_k);
|
MatrixView_half a_(a, size_m, size_k);
|
||||||
MatrixView_half_rw c_(c, size_m, size_n);
|
MatrixView_half_rw c_(c, size_m, size_n);
|
||||||
MatrixView_q3_row b_gptq_qzeros_(b_gptq_qzeros, groups, size_n);
|
MatrixView_q3_row b_gptq_qzeros_(b_gptq_qzeros, groups, size_n);
|
||||||
MatrixView_half b_gptq_scales_(b_gptq_scales, groups, size_n);
|
MatrixView_half b_gptq_scales_(b_gptq_scales, groups, size_n);
|
||||||
|
|
||||||
|
// GPTQv2 and GPTQv1 handles zero points differently
|
||||||
|
int zero_offset = use_v2_format ? 0 : 1;
|
||||||
|
|
||||||
auto t = threadIdx.x;
|
auto t = threadIdx.x;
|
||||||
|
|
||||||
// Block
|
// Block
|
||||||
@@ -534,13 +543,13 @@ __global__ void gemm_half_q_half_gptq_3bit_kernel(
|
|||||||
|
|
||||||
half2 dq[4][16];
|
half2 dq[4][16];
|
||||||
dequant_3bit_32(load_int4[0].x, load_int4[1].x, load_int4[2].x, dq[0],
|
dequant_3bit_32(load_int4[0].x, load_int4[1].x, load_int4[2].x, dq[0],
|
||||||
size_n, zeros[0] + 1);
|
size_n, zeros[0] + zero_offset);
|
||||||
dequant_3bit_32(load_int4[0].y, load_int4[1].y, load_int4[2].y, dq[1],
|
dequant_3bit_32(load_int4[0].y, load_int4[1].y, load_int4[2].y, dq[1],
|
||||||
size_n, zeros[1] + 1);
|
size_n, zeros[1] + zero_offset);
|
||||||
dequant_3bit_32(load_int4[0].z, load_int4[1].z, load_int4[2].z, dq[2],
|
dequant_3bit_32(load_int4[0].z, load_int4[1].z, load_int4[2].z, dq[2],
|
||||||
size_n, zeros[2] + 1);
|
size_n, zeros[2] + zero_offset);
|
||||||
dequant_3bit_32(load_int4[0].w, load_int4[1].w, load_int4[2].w, dq[3],
|
dequant_3bit_32(load_int4[0].w, load_int4[1].w, load_int4[2].w, dq[3],
|
||||||
size_n, zeros[3] + 1);
|
size_n, zeros[3] + zero_offset);
|
||||||
|
|
||||||
#pragma unroll
|
#pragma unroll
|
||||||
for (int m = 0; m < m_count; m++) {
|
for (int m = 0; m < m_count; m++) {
|
||||||
@@ -574,12 +583,15 @@ __global__ void gemm_half_q_half_gptq_8bit_kernel(
|
|||||||
const uint32_t* __restrict__ b_gptq_qzeros,
|
const uint32_t* __restrict__ b_gptq_qzeros,
|
||||||
const half* __restrict__ b_gptq_scales, half* __restrict__ c,
|
const half* __restrict__ b_gptq_scales, half* __restrict__ c,
|
||||||
const int size_m, const int size_n, const int size_k, const int groups,
|
const int size_m, const int size_n, const int size_k, const int groups,
|
||||||
const int* __restrict__ b_q_perm) {
|
const bool use_v2_format, const int* __restrict__ b_q_perm) {
|
||||||
MatrixView_half a_(a, size_m, size_k);
|
MatrixView_half a_(a, size_m, size_k);
|
||||||
MatrixView_half_rw c_(c, size_m, size_n);
|
MatrixView_half_rw c_(c, size_m, size_n);
|
||||||
MatrixView_q8_row b_gptq_qzeros_(b_gptq_qzeros, groups, size_n);
|
MatrixView_q8_row b_gptq_qzeros_(b_gptq_qzeros, groups, size_n);
|
||||||
MatrixView_half b_gptq_scales_(b_gptq_scales, groups, size_n);
|
MatrixView_half b_gptq_scales_(b_gptq_scales, groups, size_n);
|
||||||
|
|
||||||
|
// GPTQv2 and GPTQv1 handles zero points differently
|
||||||
|
int zero_offset = use_v2_format ? 0 : 1;
|
||||||
|
|
||||||
auto t = threadIdx.x;
|
auto t = threadIdx.x;
|
||||||
|
|
||||||
// Block
|
// Block
|
||||||
@@ -658,13 +670,13 @@ __global__ void gemm_half_q_half_gptq_8bit_kernel(
|
|||||||
|
|
||||||
half2 dq[4][4];
|
half2 dq[4][4];
|
||||||
dequant_8bit_8(load_int4[0].x, load_int4[1].x, dq[0], size_n,
|
dequant_8bit_8(load_int4[0].x, load_int4[1].x, dq[0], size_n,
|
||||||
zeros[0] + 1);
|
zeros[0] + zero_offset);
|
||||||
dequant_8bit_8(load_int4[0].y, load_int4[1].y, dq[1], size_n,
|
dequant_8bit_8(load_int4[0].y, load_int4[1].y, dq[1], size_n,
|
||||||
zeros[1] + 1);
|
zeros[1] + zero_offset);
|
||||||
dequant_8bit_8(load_int4[0].z, load_int4[1].z, dq[2], size_n,
|
dequant_8bit_8(load_int4[0].z, load_int4[1].z, dq[2], size_n,
|
||||||
zeros[2] + 1);
|
zeros[2] + zero_offset);
|
||||||
dequant_8bit_8(load_int4[0].w, load_int4[1].w, dq[3], size_n,
|
dequant_8bit_8(load_int4[0].w, load_int4[1].w, dq[3], size_n,
|
||||||
zeros[3] + 1);
|
zeros[3] + zero_offset);
|
||||||
|
|
||||||
for (int m = 0; m < m_count; m++) {
|
for (int m = 0; m < m_count; m++) {
|
||||||
block_c[m][0] =
|
block_c[m][0] =
|
||||||
@@ -730,7 +742,8 @@ void gemm_half_q_half_cuda_part(const half* a, const uint32_t* b_q_weight,
|
|||||||
const uint32_t* b_gptq_qzeros,
|
const uint32_t* b_gptq_qzeros,
|
||||||
const half* b_gptq_scales, const int* b_q_perm,
|
const half* b_gptq_scales, const int* b_q_perm,
|
||||||
half* c, int size_m, int size_n, int size_k,
|
half* c, int size_m, int size_n, int size_k,
|
||||||
int m_count, int groups, int bit) {
|
int m_count, int groups, bool use_v2_format,
|
||||||
|
int bit) {
|
||||||
dim3 blockDim, gridDim;
|
dim3 blockDim, gridDim;
|
||||||
blockDim.x = BLOCK_KN_SIZE;
|
blockDim.x = BLOCK_KN_SIZE;
|
||||||
blockDim.y = 1;
|
blockDim.y = 1;
|
||||||
@@ -743,20 +756,23 @@ void gemm_half_q_half_cuda_part(const half* a, const uint32_t* b_q_weight,
|
|||||||
pick_gemm_half_q_half_gptq_kernel(true, m_count, bit);
|
pick_gemm_half_q_half_gptq_kernel(true, m_count, bit);
|
||||||
|
|
||||||
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
||||||
kernel<<<gridDim, blockDim, 0, stream>>>(a, b_q_weight, b_gptq_qzeros,
|
kernel<<<gridDim, blockDim, 0, stream>>>(
|
||||||
b_gptq_scales, c, size_m, size_n,
|
a, b_q_weight, b_gptq_qzeros, b_gptq_scales, c, size_m, size_n, size_k,
|
||||||
size_k, groups, b_q_perm);
|
groups, use_v2_format, b_q_perm);
|
||||||
}
|
}
|
||||||
|
|
||||||
__global__ void reconstruct_exllama_8bit_kernel(
|
__global__ void reconstruct_exllama_8bit_kernel(
|
||||||
const uint32_t* __restrict__ b_q_weight, const int* __restrict__ b_q_perm,
|
const uint32_t* __restrict__ b_q_weight, const int* __restrict__ b_q_perm,
|
||||||
const uint32_t* __restrict__ b_gptq_qzeros,
|
const uint32_t* __restrict__ b_gptq_qzeros,
|
||||||
const half* __restrict__ b_gptq_scales, const int size_k, const int size_n,
|
const half* __restrict__ b_gptq_scales, const int size_k, const int size_n,
|
||||||
const int groups, half* __restrict__ b) {
|
const int groups, const bool use_v2_format, half* __restrict__ b) {
|
||||||
MatrixView_half_rw b_(b, size_k, size_n);
|
MatrixView_half_rw b_(b, size_k, size_n);
|
||||||
MatrixView_q8_row b_gptq_qzeros_(b_gptq_qzeros, groups, size_n);
|
MatrixView_q8_row b_gptq_qzeros_(b_gptq_qzeros, groups, size_n);
|
||||||
MatrixView_half b_gptq_scales_(b_gptq_scales, groups, size_n);
|
MatrixView_half b_gptq_scales_(b_gptq_scales, groups, size_n);
|
||||||
|
|
||||||
|
// GPTQv2 and GPTQv1 handles zero points differently
|
||||||
|
int zero_offset = use_v2_format ? 0 : 1;
|
||||||
|
|
||||||
auto offset_k = BLOCK_KN_SIZE * blockIdx.y;
|
auto offset_k = BLOCK_KN_SIZE * blockIdx.y;
|
||||||
auto offset_n = BLOCK_KN_SIZE * blockIdx.x * 4;
|
auto offset_n = BLOCK_KN_SIZE * blockIdx.x * 4;
|
||||||
|
|
||||||
@@ -812,13 +828,13 @@ __global__ void reconstruct_exllama_8bit_kernel(
|
|||||||
|
|
||||||
half2 dq[4][4];
|
half2 dq[4][4];
|
||||||
dequant_8bit_8(load_int4[0].x, load_int4[1].x, dq[0], size_n,
|
dequant_8bit_8(load_int4[0].x, load_int4[1].x, dq[0], size_n,
|
||||||
zeros[0] + 1);
|
zeros[0] + zero_offset);
|
||||||
dequant_8bit_8(load_int4[0].y, load_int4[1].y, dq[1], size_n,
|
dequant_8bit_8(load_int4[0].y, load_int4[1].y, dq[1], size_n,
|
||||||
zeros[1] + 1);
|
zeros[1] + zero_offset);
|
||||||
dequant_8bit_8(load_int4[0].z, load_int4[1].z, dq[2], size_n,
|
dequant_8bit_8(load_int4[0].z, load_int4[1].z, dq[2], size_n,
|
||||||
zeros[2] + 1);
|
zeros[2] + zero_offset);
|
||||||
dequant_8bit_8(load_int4[0].w, load_int4[1].w, dq[3], size_n,
|
dequant_8bit_8(load_int4[0].w, load_int4[1].w, dq[3], size_n,
|
||||||
zeros[3] + 1);
|
zeros[3] + zero_offset);
|
||||||
|
|
||||||
// half* dqh = (half*)dq;
|
// half* dqh = (half*)dq;
|
||||||
if (b_q_perm) {
|
if (b_q_perm) {
|
||||||
@@ -849,11 +865,14 @@ __global__ void reconstruct_exllama_4bit_kernel(
|
|||||||
const uint32_t* __restrict__ b_q_weight, const int* __restrict__ b_q_perm,
|
const uint32_t* __restrict__ b_q_weight, const int* __restrict__ b_q_perm,
|
||||||
const uint32_t* __restrict__ b_gptq_qzeros,
|
const uint32_t* __restrict__ b_gptq_qzeros,
|
||||||
const half* __restrict__ b_gptq_scales, const int size_k, const int size_n,
|
const half* __restrict__ b_gptq_scales, const int size_k, const int size_n,
|
||||||
const int groups, half* __restrict__ b) {
|
const int groups, const bool use_v2_format, half* __restrict__ b) {
|
||||||
MatrixView_half_rw b_(b, size_k, size_n);
|
MatrixView_half_rw b_(b, size_k, size_n);
|
||||||
MatrixView_q4_row b_gptq_qzeros_(b_gptq_qzeros, groups, size_n);
|
MatrixView_q4_row b_gptq_qzeros_(b_gptq_qzeros, groups, size_n);
|
||||||
MatrixView_half b_gptq_scales_(b_gptq_scales, groups, size_n);
|
MatrixView_half b_gptq_scales_(b_gptq_scales, groups, size_n);
|
||||||
|
|
||||||
|
// GPTQv2 and GPTQv1 handles zero points differently
|
||||||
|
int zero_offset = use_v2_format ? 0 : 1;
|
||||||
|
|
||||||
auto offset_k = BLOCK_KN_SIZE * blockIdx.y;
|
auto offset_k = BLOCK_KN_SIZE * blockIdx.y;
|
||||||
auto offset_n = BLOCK_KN_SIZE * blockIdx.x * 4;
|
auto offset_n = BLOCK_KN_SIZE * blockIdx.x * 4;
|
||||||
|
|
||||||
@@ -888,10 +907,10 @@ __global__ void reconstruct_exllama_4bit_kernel(
|
|||||||
half2 y1y16[4][2];
|
half2 y1y16[4][2];
|
||||||
b_gptq_qzeros_.item4(zeros, group, n);
|
b_gptq_qzeros_.item4(zeros, group, n);
|
||||||
b_gptq_scales_.item4_h2(scales, group, n);
|
b_gptq_scales_.item4_h2(scales, group, n);
|
||||||
dequant_4bit_8_prep_zero(zeros[0] + 1, z1z16[0], y1y16[0]);
|
dequant_4bit_8_prep_zero(zeros[0] + zero_offset, z1z16[0], y1y16[0]);
|
||||||
dequant_4bit_8_prep_zero(zeros[1] + 1, z1z16[1], y1y16[1]);
|
dequant_4bit_8_prep_zero(zeros[1] + zero_offset, z1z16[1], y1y16[1]);
|
||||||
dequant_4bit_8_prep_zero(zeros[2] + 1, z1z16[2], y1y16[2]);
|
dequant_4bit_8_prep_zero(zeros[2] + zero_offset, z1z16[2], y1y16[2]);
|
||||||
dequant_4bit_8_prep_zero(zeros[3] + 1, z1z16[3], y1y16[3]);
|
dequant_4bit_8_prep_zero(zeros[3] + zero_offset, z1z16[3], y1y16[3]);
|
||||||
|
|
||||||
__syncthreads();
|
__syncthreads();
|
||||||
|
|
||||||
@@ -904,10 +923,10 @@ __global__ void reconstruct_exllama_4bit_kernel(
|
|||||||
nextgroup += groupsize;
|
nextgroup += groupsize;
|
||||||
b_gptq_qzeros_.item4(zeros, group, n);
|
b_gptq_qzeros_.item4(zeros, group, n);
|
||||||
b_gptq_scales_.item4_h2(scales, group, n);
|
b_gptq_scales_.item4_h2(scales, group, n);
|
||||||
dequant_4bit_8_prep_zero(zeros[0] + 1, z1z16[0], y1y16[0]);
|
dequant_4bit_8_prep_zero(zeros[0] + zero_offset, z1z16[0], y1y16[0]);
|
||||||
dequant_4bit_8_prep_zero(zeros[1] + 1, z1z16[1], y1y16[1]);
|
dequant_4bit_8_prep_zero(zeros[1] + zero_offset, z1z16[1], y1y16[1]);
|
||||||
dequant_4bit_8_prep_zero(zeros[2] + 1, z1z16[2], y1y16[2]);
|
dequant_4bit_8_prep_zero(zeros[2] + zero_offset, z1z16[2], y1y16[2]);
|
||||||
dequant_4bit_8_prep_zero(zeros[3] + 1, z1z16[3], y1y16[3]);
|
dequant_4bit_8_prep_zero(zeros[3] + zero_offset, z1z16[3], y1y16[3]);
|
||||||
}
|
}
|
||||||
|
|
||||||
for (int p = 0; p < 4; p++) {
|
for (int p = 0; p < 4; p++) {
|
||||||
@@ -954,11 +973,14 @@ __global__ void reconstruct_exllama_3bit_kernel(
|
|||||||
const uint32_t* __restrict__ b_q_weight, const int* __restrict__ b_q_perm,
|
const uint32_t* __restrict__ b_q_weight, const int* __restrict__ b_q_perm,
|
||||||
const uint32_t* __restrict__ b_gptq_qzeros,
|
const uint32_t* __restrict__ b_gptq_qzeros,
|
||||||
const half* __restrict__ b_gptq_scales, const int size_k, const int size_n,
|
const half* __restrict__ b_gptq_scales, const int size_k, const int size_n,
|
||||||
const int groups, half* __restrict__ b) {
|
const int groups, const bool use_v2_format, half* __restrict__ b) {
|
||||||
MatrixView_half_rw b_(b, size_k, size_n);
|
MatrixView_half_rw b_(b, size_k, size_n);
|
||||||
MatrixView_q3_row b_gptq_qzeros_(b_gptq_qzeros, groups, size_n);
|
MatrixView_q3_row b_gptq_qzeros_(b_gptq_qzeros, groups, size_n);
|
||||||
MatrixView_half b_gptq_scales_(b_gptq_scales, groups, size_n);
|
MatrixView_half b_gptq_scales_(b_gptq_scales, groups, size_n);
|
||||||
|
|
||||||
|
// GPTQv2 and GPTQv1 handles zero points differently
|
||||||
|
int zero_offset = use_v2_format ? 0 : 1;
|
||||||
|
|
||||||
auto offset_k = BLOCK_KN_SIZE * blockIdx.y;
|
auto offset_k = BLOCK_KN_SIZE * blockIdx.y;
|
||||||
auto offset_n = BLOCK_KN_SIZE * blockIdx.x * 4;
|
auto offset_n = BLOCK_KN_SIZE * blockIdx.x * 4;
|
||||||
|
|
||||||
@@ -1016,13 +1038,13 @@ __global__ void reconstruct_exllama_3bit_kernel(
|
|||||||
|
|
||||||
half2 dq[4][16];
|
half2 dq[4][16];
|
||||||
dequant_3bit_32(load_int4[0].x, load_int4[1].x, load_int4[2].x, dq[0],
|
dequant_3bit_32(load_int4[0].x, load_int4[1].x, load_int4[2].x, dq[0],
|
||||||
size_n, zeros[0] + 1);
|
size_n, zeros[0] + zero_offset);
|
||||||
dequant_3bit_32(load_int4[0].y, load_int4[1].y, load_int4[2].y, dq[1],
|
dequant_3bit_32(load_int4[0].y, load_int4[1].y, load_int4[2].y, dq[1],
|
||||||
size_n, zeros[1] + 1);
|
size_n, zeros[1] + zero_offset);
|
||||||
dequant_3bit_32(load_int4[0].z, load_int4[1].z, load_int4[2].z, dq[2],
|
dequant_3bit_32(load_int4[0].z, load_int4[1].z, load_int4[2].z, dq[2],
|
||||||
size_n, zeros[2] + 1);
|
size_n, zeros[2] + zero_offset);
|
||||||
dequant_3bit_32(load_int4[0].w, load_int4[1].w, load_int4[2].w, dq[3],
|
dequant_3bit_32(load_int4[0].w, load_int4[1].w, load_int4[2].w, dq[3],
|
||||||
size_n, zeros[3] + 1);
|
size_n, zeros[3] + zero_offset);
|
||||||
|
|
||||||
if (b_q_perm) {
|
if (b_q_perm) {
|
||||||
for (int j = 0; j < 16; j++) {
|
for (int j = 0; j < 16; j++) {
|
||||||
@@ -1052,11 +1074,14 @@ __global__ void reconstruct_exllama_2bit_kernel(
|
|||||||
const uint32_t* __restrict__ b_q_weight, const int* __restrict__ b_q_perm,
|
const uint32_t* __restrict__ b_q_weight, const int* __restrict__ b_q_perm,
|
||||||
const uint32_t* __restrict__ b_gptq_qzeros,
|
const uint32_t* __restrict__ b_gptq_qzeros,
|
||||||
const half* __restrict__ b_gptq_scales, const int size_k, const int size_n,
|
const half* __restrict__ b_gptq_scales, const int size_k, const int size_n,
|
||||||
const int groups, half* __restrict__ b) {
|
const int groups, const bool use_v2_format, half* __restrict__ b) {
|
||||||
MatrixView_half_rw b_(b, size_k, size_n);
|
MatrixView_half_rw b_(b, size_k, size_n);
|
||||||
MatrixView_q2_row b_gptq_qzeros_(b_gptq_qzeros, groups, size_n);
|
MatrixView_q2_row b_gptq_qzeros_(b_gptq_qzeros, groups, size_n);
|
||||||
MatrixView_half b_gptq_scales_(b_gptq_scales, groups, size_n);
|
MatrixView_half b_gptq_scales_(b_gptq_scales, groups, size_n);
|
||||||
|
|
||||||
|
// GPTQv2 and GPTQv1 handles zero points differently
|
||||||
|
int zero_offset = use_v2_format ? 0 : 1;
|
||||||
|
|
||||||
auto offset_k = BLOCK_KN_SIZE * blockIdx.y;
|
auto offset_k = BLOCK_KN_SIZE * blockIdx.y;
|
||||||
auto offset_n = BLOCK_KN_SIZE * blockIdx.x * 4;
|
auto offset_n = BLOCK_KN_SIZE * blockIdx.x * 4;
|
||||||
|
|
||||||
@@ -1108,10 +1133,10 @@ __global__ void reconstruct_exllama_2bit_kernel(
|
|||||||
int4 load_int4 = *b_ptr4;
|
int4 load_int4 = *b_ptr4;
|
||||||
|
|
||||||
half2 dq[4][8];
|
half2 dq[4][8];
|
||||||
dequant_2bit_16(load_int4.x, dq[0], size_n, zeros[0] + 1);
|
dequant_2bit_16(load_int4.x, dq[0], size_n, zeros[0] + zero_offset);
|
||||||
dequant_2bit_16(load_int4.y, dq[1], size_n, zeros[1] + 1);
|
dequant_2bit_16(load_int4.y, dq[1], size_n, zeros[1] + zero_offset);
|
||||||
dequant_2bit_16(load_int4.z, dq[2], size_n, zeros[2] + 1);
|
dequant_2bit_16(load_int4.z, dq[2], size_n, zeros[2] + zero_offset);
|
||||||
dequant_2bit_16(load_int4.w, dq[3], size_n, zeros[3] + 1);
|
dequant_2bit_16(load_int4.w, dq[3], size_n, zeros[3] + zero_offset);
|
||||||
|
|
||||||
b_ptr += size_n;
|
b_ptr += size_n;
|
||||||
// half* dqh = (half*)dq;
|
// half* dqh = (half*)dq;
|
||||||
@@ -1143,7 +1168,7 @@ void reconstruct_exllama(const uint32_t* b_q_weight,
|
|||||||
const uint32_t* b_gptq_qzeros,
|
const uint32_t* b_gptq_qzeros,
|
||||||
const half* b_gptq_scales, const int* b_q_perm,
|
const half* b_gptq_scales, const int* b_q_perm,
|
||||||
half* out, int height, int width, int groups,
|
half* out, int height, int width, int groups,
|
||||||
int bit) {
|
bool use_v2_format, int bit) {
|
||||||
dim3 blockDim, gridDim;
|
dim3 blockDim, gridDim;
|
||||||
blockDim.x = BLOCK_KN_SIZE;
|
blockDim.x = BLOCK_KN_SIZE;
|
||||||
blockDim.y = 1;
|
blockDim.y = 1;
|
||||||
@@ -1162,14 +1187,14 @@ void reconstruct_exllama(const uint32_t* b_q_weight,
|
|||||||
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
||||||
reconstruct_exllama_kernel<<<gridDim, blockDim, 0, stream>>>(
|
reconstruct_exllama_kernel<<<gridDim, blockDim, 0, stream>>>(
|
||||||
b_q_weight, b_q_perm, b_gptq_qzeros, b_gptq_scales, height, width, groups,
|
b_q_weight, b_q_perm, b_gptq_qzeros, b_gptq_scales, height, width, groups,
|
||||||
out);
|
use_v2_format, out);
|
||||||
}
|
}
|
||||||
|
|
||||||
__global__ void gemm_half_q_half_alt_4bit_kernel(
|
__global__ void gemm_half_q_half_alt_4bit_kernel(
|
||||||
const half2* __restrict__ vec, const uint32_t* __restrict__ mat,
|
const half2* __restrict__ vec, const uint32_t* __restrict__ mat,
|
||||||
half* __restrict__ mul, const half* __restrict__ scales,
|
half* __restrict__ mul, const half* __restrict__ scales,
|
||||||
const uint32_t* __restrict__ zeros, const int* __restrict__ g_idx,
|
const uint32_t* __restrict__ zeros, const int* __restrict__ g_idx,
|
||||||
int batch, int height, int width) {
|
int batch, int height, int width, bool use_v2_format) {
|
||||||
int zero_width = width / 8;
|
int zero_width = width / 8;
|
||||||
int vec_height = height * 4;
|
int vec_height = height * 4;
|
||||||
const int blockwidth2 = BLOCK_KN_SIZE / 2;
|
const int blockwidth2 = BLOCK_KN_SIZE / 2;
|
||||||
@@ -1179,6 +1204,9 @@ __global__ void gemm_half_q_half_alt_4bit_kernel(
|
|||||||
int h_end = min(BLOCK_KN_SIZE / 8, height - h) * 4;
|
int h_end = min(BLOCK_KN_SIZE / 8, height - h) * 4;
|
||||||
auto w = BLOCK_KN_SIZE * blockIdx.x + threadIdx.x;
|
auto w = BLOCK_KN_SIZE * blockIdx.x + threadIdx.x;
|
||||||
|
|
||||||
|
// GPTQv2 and GPTQv1 handles zero points differently
|
||||||
|
int zero_offset = use_v2_format ? 0 : 1;
|
||||||
|
|
||||||
__shared__ half2 blockvec[BLOCK_M_SIZE_MAX][blockwidth2];
|
__shared__ half2 blockvec[BLOCK_M_SIZE_MAX][blockwidth2];
|
||||||
if (threadIdx.x < h_end) {
|
if (threadIdx.x < h_end) {
|
||||||
for (int m = 0; m < b_end; ++m) {
|
for (int m = 0; m < b_end; ++m) {
|
||||||
@@ -1223,10 +1251,11 @@ __global__ void gemm_half_q_half_alt_4bit_kernel(
|
|||||||
half2 zero = __halves2half2(
|
half2 zero = __halves2half2(
|
||||||
__hmul(scale_f,
|
__hmul(scale_f,
|
||||||
__int2half_rn(-((zeros[g * zero_width + z_w] >> z_mod) & 0xF) -
|
__int2half_rn(-((zeros[g * zero_width + z_w] >> z_mod) & 0xF) -
|
||||||
1)),
|
zero_offset)),
|
||||||
__hmul(scale_f2,
|
__hmul(
|
||||||
__int2half_rn(
|
scale_f2,
|
||||||
-((zeros[g2 * zero_width + z_w] >> z_mod) & 0xF) - 1)));
|
__int2half_rn(-((zeros[g2 * zero_width + z_w] >> z_mod) & 0xF) -
|
||||||
|
zero_offset)));
|
||||||
scales_tmp[tmp_k] = scale;
|
scales_tmp[tmp_k] = scale;
|
||||||
zeros_tmp[tmp_k] = zero;
|
zeros_tmp[tmp_k] = zero;
|
||||||
}
|
}
|
||||||
@@ -1268,7 +1297,7 @@ __global__ void gemm_half_q_half_alt_8bit_kernel(
|
|||||||
const half2* __restrict__ vec, const uint32_t* __restrict__ mat,
|
const half2* __restrict__ vec, const uint32_t* __restrict__ mat,
|
||||||
half* __restrict__ mul, const half* __restrict__ scales,
|
half* __restrict__ mul, const half* __restrict__ scales,
|
||||||
const uint32_t* __restrict__ zeros, const int* __restrict__ g_idx,
|
const uint32_t* __restrict__ zeros, const int* __restrict__ g_idx,
|
||||||
int batch, int height, int width) {
|
int batch, int height, int width, bool use_v2_format) {
|
||||||
int zero_width = width / 4;
|
int zero_width = width / 4;
|
||||||
int vec_height = height * 2;
|
int vec_height = height * 2;
|
||||||
const int blockwidth2 = BLOCK_KN_SIZE / 2;
|
const int blockwidth2 = BLOCK_KN_SIZE / 2;
|
||||||
@@ -1278,6 +1307,9 @@ __global__ void gemm_half_q_half_alt_8bit_kernel(
|
|||||||
int h_end = min(BLOCK_KN_SIZE / 4, height - h) * 2;
|
int h_end = min(BLOCK_KN_SIZE / 4, height - h) * 2;
|
||||||
auto w = BLOCK_KN_SIZE * blockIdx.x + threadIdx.x;
|
auto w = BLOCK_KN_SIZE * blockIdx.x + threadIdx.x;
|
||||||
|
|
||||||
|
// GPTQv2 and GPTQv1 handles zero points differently
|
||||||
|
int zero_offset = use_v2_format ? 0 : 1;
|
||||||
|
|
||||||
__shared__ half2 blockvec[BLOCK_M_SIZE_MAX][blockwidth2];
|
__shared__ half2 blockvec[BLOCK_M_SIZE_MAX][blockwidth2];
|
||||||
if (threadIdx.x < h_end) {
|
if (threadIdx.x < h_end) {
|
||||||
for (int m = 0; m < b_end; ++m) {
|
for (int m = 0; m < b_end; ++m) {
|
||||||
@@ -1312,12 +1344,13 @@ __global__ void gemm_half_q_half_alt_8bit_kernel(
|
|||||||
half scale_f2 = scales[g2 * width + w];
|
half scale_f2 = scales[g2 * width + w];
|
||||||
half2 scale = __halves2half2(scale_f, scale_f2);
|
half2 scale = __halves2half2(scale_f, scale_f2);
|
||||||
half2 zero = __halves2half2(
|
half2 zero = __halves2half2(
|
||||||
__hmul(scale_f,
|
__hmul(scale_f, __int2half_rn(
|
||||||
__int2half_rn(
|
-((zeros[g * zero_width + z_w] >> z_mod) & 0xff) -
|
||||||
-((zeros[g * zero_width + z_w] >> z_mod) & 0xff) - 1)),
|
zero_offset)),
|
||||||
__hmul(scale_f2,
|
__hmul(
|
||||||
__int2half_rn(
|
scale_f2,
|
||||||
-((zeros[g2 * zero_width + z_w] >> z_mod) & 0xff) - 1)));
|
__int2half_rn(-((zeros[g2 * zero_width + z_w] >> z_mod) & 0xff) -
|
||||||
|
zero_offset)));
|
||||||
scales_tmp[tmp_k] = scale;
|
scales_tmp[tmp_k] = scale;
|
||||||
zeros_tmp[tmp_k] = zero;
|
zeros_tmp[tmp_k] = zero;
|
||||||
}
|
}
|
||||||
@@ -1355,7 +1388,7 @@ void gemm_half_q_half_alt(const half* a, const uint32_t* b_q_weight,
|
|||||||
const uint32_t* b_gptq_qzeros,
|
const uint32_t* b_gptq_qzeros,
|
||||||
const half* b_gptq_scales, const int* b_g_idx,
|
const half* b_gptq_scales, const int* b_g_idx,
|
||||||
half* c, int size_m, int size_n, int size_k,
|
half* c, int size_m, int size_n, int size_k,
|
||||||
int bit) {
|
bool use_v2_format, int bit) {
|
||||||
dim3 blockDim, gridDim;
|
dim3 blockDim, gridDim;
|
||||||
blockDim.x = BLOCK_KN_SIZE;
|
blockDim.x = BLOCK_KN_SIZE;
|
||||||
blockDim.y = 1;
|
blockDim.y = 1;
|
||||||
@@ -1372,17 +1405,15 @@ void gemm_half_q_half_alt(const half* a, const uint32_t* b_q_weight,
|
|||||||
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
||||||
kernel<<<gridDim, blockDim, 0, stream>>>(
|
kernel<<<gridDim, blockDim, 0, stream>>>(
|
||||||
(const half2*)a, b_q_weight, c, b_gptq_scales, b_gptq_qzeros, b_g_idx,
|
(const half2*)a, b_q_weight, c, b_gptq_scales, b_gptq_qzeros, b_g_idx,
|
||||||
size_m, size_k / 32 * bit, size_n);
|
size_m, size_k / 32 * bit, size_n, use_v2_format);
|
||||||
}
|
}
|
||||||
|
|
||||||
template <class T, int bit>
|
template <class T, int bit>
|
||||||
__global__ void reconstruct_gptq_kernel(const uint32_t* __restrict__ w,
|
__global__ void reconstruct_gptq_kernel(
|
||||||
const half* __restrict__ w_scales,
|
const uint32_t* __restrict__ w, const half* __restrict__ w_scales,
|
||||||
const uint32_t* __restrict__ w_zeros,
|
const uint32_t* __restrict__ w_zeros, const int* __restrict__ g_idx,
|
||||||
const int* __restrict__ g_idx,
|
const int height, const int width, const int group,
|
||||||
const int height, const int width,
|
const bool use_v2_format, half* __restrict__ out) {
|
||||||
const int group,
|
|
||||||
half* __restrict__ out) {
|
|
||||||
// Start of block
|
// Start of block
|
||||||
|
|
||||||
auto column = BLOCK_KN_SIZE * blockIdx.x + threadIdx.x;
|
auto column = BLOCK_KN_SIZE * blockIdx.x + threadIdx.x;
|
||||||
@@ -1395,6 +1426,9 @@ __global__ void reconstruct_gptq_kernel(const uint32_t* __restrict__ w,
|
|||||||
MatrixView_half w_scales_(w_scales, group, width);
|
MatrixView_half w_scales_(w_scales, group, width);
|
||||||
T w_zeros_(w_zeros, group, width);
|
T w_zeros_(w_zeros, group, width);
|
||||||
|
|
||||||
|
// GPTQv2 and GPTQv1 handles zero points differently
|
||||||
|
int zero_offset = use_v2_format ? 0 : 1;
|
||||||
|
|
||||||
uint32_t w_read = w[blockIdx.y * width + column];
|
uint32_t w_read = w[blockIdx.y * width + column];
|
||||||
half* out_ptr = out_.item_ptr(row, column);
|
half* out_ptr = out_.item_ptr(row, column);
|
||||||
|
|
||||||
@@ -1402,7 +1436,7 @@ __global__ void reconstruct_gptq_kernel(const uint32_t* __restrict__ w,
|
|||||||
for (int s = 0; s < 32; s += bit) {
|
for (int s = 0; s < 32; s += bit) {
|
||||||
int group = g_idx[row + s / bit];
|
int group = g_idx[row + s / bit];
|
||||||
half w_scale = w_scales_.item(group, column);
|
half w_scale = w_scales_.item(group, column);
|
||||||
uint32_t w_zero = w_zeros_.item(group, column) + 1;
|
uint32_t w_zero = w_zeros_.item(group, column) + zero_offset;
|
||||||
half w_item =
|
half w_item =
|
||||||
__hmul(__int2half_rn((int)((w_read >> s) & ((1 << bit) - 1)) - w_zero),
|
__hmul(__int2half_rn((int)((w_read >> s) & ((1 << bit) - 1)) - w_zero),
|
||||||
w_scale);
|
w_scale);
|
||||||
@@ -1415,7 +1449,7 @@ __global__ void reconstruct_gptq_3bit_kernel(
|
|||||||
const uint32_t* __restrict__ w, const half* __restrict__ w_scales,
|
const uint32_t* __restrict__ w, const half* __restrict__ w_scales,
|
||||||
const uint32_t* __restrict__ w_zeros, const int* __restrict__ g_idx,
|
const uint32_t* __restrict__ w_zeros, const int* __restrict__ g_idx,
|
||||||
const int height, const int width, const int group,
|
const int height, const int width, const int group,
|
||||||
half* __restrict__ out) {
|
const bool use_v2_format, half* __restrict__ out) {
|
||||||
// Start of block
|
// Start of block
|
||||||
auto column = BLOCK_KN_SIZE * blockIdx.x + threadIdx.x;
|
auto column = BLOCK_KN_SIZE * blockIdx.x + threadIdx.x;
|
||||||
auto row = blockIdx.y * 32;
|
auto row = blockIdx.y * 32;
|
||||||
@@ -1427,6 +1461,9 @@ __global__ void reconstruct_gptq_3bit_kernel(
|
|||||||
MatrixView_half w_scales_(w_scales, group, width);
|
MatrixView_half w_scales_(w_scales, group, width);
|
||||||
MatrixView_q3_row w_zeros_(w_zeros, group, width);
|
MatrixView_q3_row w_zeros_(w_zeros, group, width);
|
||||||
|
|
||||||
|
// GPTQv2 and GPTQv1 handles zero points differently
|
||||||
|
int zero_offset = use_v2_format ? 0 : 1;
|
||||||
|
|
||||||
uint32_t w1 = w[(blockIdx.y * 3) * width + column];
|
uint32_t w1 = w[(blockIdx.y * 3) * width + column];
|
||||||
uint32_t w2 = w[(blockIdx.y * 3 + 1) * width + column];
|
uint32_t w2 = w[(blockIdx.y * 3 + 1) * width + column];
|
||||||
uint32_t w3 = w[(blockIdx.y * 3 + 2) * width + column];
|
uint32_t w3 = w[(blockIdx.y * 3 + 2) * width + column];
|
||||||
@@ -1436,7 +1473,7 @@ __global__ void reconstruct_gptq_3bit_kernel(
|
|||||||
for (int i = 0; i < 32; i += 1) {
|
for (int i = 0; i < 32; i += 1) {
|
||||||
int group = g_idx[row + i];
|
int group = g_idx[row + i];
|
||||||
half w_scale = w_scales_.item(group, column);
|
half w_scale = w_scales_.item(group, column);
|
||||||
uint32_t w_zero = w_zeros_.item(group, column) + 1;
|
uint32_t w_zero = w_zeros_.item(group, column) + zero_offset;
|
||||||
int w_item;
|
int w_item;
|
||||||
if (i == 10) {
|
if (i == 10) {
|
||||||
w_item = (w1 >> 30) | ((w2 << 2) & 0x4);
|
w_item = (w1 >> 30) | ((w2 << 2) & 0x4);
|
||||||
@@ -1456,7 +1493,8 @@ __global__ void reconstruct_gptq_3bit_kernel(
|
|||||||
|
|
||||||
void reconstruct_gptq(const uint32_t* b_q_weight, const uint32_t* b_gptq_qzeros,
|
void reconstruct_gptq(const uint32_t* b_q_weight, const uint32_t* b_gptq_qzeros,
|
||||||
const half* b_gptq_scales, const int* b_g_idx, half* out,
|
const half* b_gptq_scales, const int* b_g_idx, half* out,
|
||||||
int height, int width, int groups, int bit) {
|
int height, int width, int groups, bool use_v2_format,
|
||||||
|
int bit) {
|
||||||
dim3 blockDim, gridDim;
|
dim3 blockDim, gridDim;
|
||||||
blockDim.x = BLOCK_KN_SIZE;
|
blockDim.x = BLOCK_KN_SIZE;
|
||||||
blockDim.y = 1;
|
blockDim.y = 1;
|
||||||
@@ -1476,7 +1514,7 @@ void reconstruct_gptq(const uint32_t* b_q_weight, const uint32_t* b_gptq_qzeros,
|
|||||||
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
||||||
kernel<<<gridDim, blockDim, 0, stream>>>(b_q_weight, b_gptq_scales,
|
kernel<<<gridDim, blockDim, 0, stream>>>(b_q_weight, b_gptq_scales,
|
||||||
b_gptq_qzeros, b_g_idx, height,
|
b_gptq_qzeros, b_g_idx, height,
|
||||||
width, groups, out);
|
width, groups, use_v2_format, out);
|
||||||
}
|
}
|
||||||
|
|
||||||
void gemm_half_q_half_cuda(cublasHandle_t cublas_handle, const half* a,
|
void gemm_half_q_half_cuda(cublasHandle_t cublas_handle, const half* a,
|
||||||
@@ -1484,7 +1522,8 @@ void gemm_half_q_half_cuda(cublasHandle_t cublas_handle, const half* a,
|
|||||||
const uint32_t* b_gptq_qzeros,
|
const uint32_t* b_gptq_qzeros,
|
||||||
const half* b_gptq_scales, const int* b_g_idx,
|
const half* b_gptq_scales, const int* b_g_idx,
|
||||||
half* c, half* temp_dq, int size_m, int size_n,
|
half* c, half* temp_dq, int size_m, int size_n,
|
||||||
int size_k, int groups, bool use_exllama, int bit) {
|
int size_k, int groups, bool use_exllama,
|
||||||
|
bool use_v2_format, int bit) {
|
||||||
bool use_reconstruct;
|
bool use_reconstruct;
|
||||||
if (use_exllama) {
|
if (use_exllama) {
|
||||||
use_reconstruct = ((bit == 8 && size_m > MAX_Q_GEMM_ROWS_8BIT) ||
|
use_reconstruct = ((bit == 8 && size_m > MAX_Q_GEMM_ROWS_8BIT) ||
|
||||||
@@ -1498,10 +1537,10 @@ void gemm_half_q_half_cuda(cublasHandle_t cublas_handle, const half* a,
|
|||||||
// Reconstruct FP16 matrix, then cuBLAS
|
// Reconstruct FP16 matrix, then cuBLAS
|
||||||
if (use_exllama) {
|
if (use_exllama) {
|
||||||
reconstruct_exllama(b_q_weight, b_gptq_qzeros, b_gptq_scales, b_g_idx,
|
reconstruct_exllama(b_q_weight, b_gptq_qzeros, b_gptq_scales, b_g_idx,
|
||||||
temp_dq, size_k, size_n, groups, bit);
|
temp_dq, size_k, size_n, groups, use_v2_format, bit);
|
||||||
} else {
|
} else {
|
||||||
reconstruct_gptq(b_q_weight, b_gptq_qzeros, b_gptq_scales, b_g_idx,
|
reconstruct_gptq(b_q_weight, b_gptq_qzeros, b_gptq_scales, b_g_idx,
|
||||||
temp_dq, size_k, size_n, groups, bit);
|
temp_dq, size_k, size_n, groups, use_v2_format, bit);
|
||||||
}
|
}
|
||||||
|
|
||||||
const half alpha = __float2half(1.0f);
|
const half alpha = __float2half(1.0f);
|
||||||
@@ -1517,18 +1556,18 @@ void gemm_half_q_half_cuda(cublasHandle_t cublas_handle, const half* a,
|
|||||||
if (max_chunks) {
|
if (max_chunks) {
|
||||||
gemm_half_q_half_cuda_part(a, b_q_weight, b_gptq_qzeros, b_gptq_scales,
|
gemm_half_q_half_cuda_part(a, b_q_weight, b_gptq_qzeros, b_gptq_scales,
|
||||||
b_g_idx, c, last_chunk, size_n, size_k,
|
b_g_idx, c, last_chunk, size_n, size_k,
|
||||||
BLOCK_M_SIZE_MAX, groups, bit);
|
BLOCK_M_SIZE_MAX, groups, use_v2_format, bit);
|
||||||
}
|
}
|
||||||
|
|
||||||
if (last_chunk_size) {
|
if (last_chunk_size) {
|
||||||
gemm_half_q_half_cuda_part(a + last_chunk * size_k, b_q_weight,
|
gemm_half_q_half_cuda_part(
|
||||||
b_gptq_qzeros, b_gptq_scales, b_g_idx,
|
a + last_chunk * size_k, b_q_weight, b_gptq_qzeros, b_gptq_scales,
|
||||||
c + last_chunk * size_n, last_chunk_size,
|
b_g_idx, c + last_chunk * size_n, last_chunk_size, size_n, size_k,
|
||||||
size_n, size_k, last_chunk_size, groups, bit);
|
last_chunk_size, groups, use_v2_format, bit);
|
||||||
}
|
}
|
||||||
} else {
|
} else {
|
||||||
gemm_half_q_half_alt(a, b_q_weight, b_gptq_qzeros, b_gptq_scales, b_g_idx,
|
gemm_half_q_half_alt(a, b_q_weight, b_gptq_qzeros, b_gptq_scales, b_g_idx,
|
||||||
c, size_m, size_n, size_k, bit);
|
c, size_m, size_n, size_k, use_v2_format, bit);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -1815,7 +1854,7 @@ void shuffle_exllama_weight(uint32_t* q_weight, int* q_perm, int height,
|
|||||||
torch::Tensor gptq_gemm(torch::Tensor a, torch::Tensor b_q_weight,
|
torch::Tensor gptq_gemm(torch::Tensor a, torch::Tensor b_q_weight,
|
||||||
torch::Tensor b_gptq_qzeros,
|
torch::Tensor b_gptq_qzeros,
|
||||||
torch::Tensor b_gptq_scales, torch::Tensor b_g_idx,
|
torch::Tensor b_gptq_scales, torch::Tensor b_g_idx,
|
||||||
bool use_exllama, int64_t bit) {
|
bool use_exllama, bool use_v2_format, int64_t bit) {
|
||||||
const at::cuda::OptionalCUDAGuard device_guard(device_of(a));
|
const at::cuda::OptionalCUDAGuard device_guard(device_of(a));
|
||||||
auto options = torch::TensorOptions().dtype(a.dtype()).device(a.device());
|
auto options = torch::TensorOptions().dtype(a.dtype()).device(a.device());
|
||||||
at::Tensor c = torch::empty({a.size(0), b_q_weight.size(1)}, options);
|
at::Tensor c = torch::empty({a.size(0), b_q_weight.size(1)}, options);
|
||||||
@@ -1833,7 +1872,7 @@ torch::Tensor gptq_gemm(torch::Tensor a, torch::Tensor b_q_weight,
|
|||||||
c.size(1), // n
|
c.size(1), // n
|
||||||
a.size(1), // k
|
a.size(1), // k
|
||||||
b_gptq_qzeros.size(0), // group number
|
b_gptq_qzeros.size(0), // group number
|
||||||
use_exllama, bit);
|
use_exllama, use_v2_format, bit);
|
||||||
return c;
|
return c;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
107
csrc/sampler.cu
107
csrc/sampler.cu
@@ -54,15 +54,10 @@ static inline __device__ uint16_t extractBinIdx(float x) {
|
|||||||
return 511 - (tmp.u16 >> 7);
|
return 511 - (tmp.u16 >> 7);
|
||||||
}
|
}
|
||||||
|
|
||||||
template <int kNumThreadsPerBlock = 512>
|
template <int kNumThreadsPerBlock = 512, int kNumBins = 512, int kTopK = 2048>
|
||||||
static __global__ void topKPerRow(const float* logits, const int* rowStarts,
|
__device__ void topKPerRowJob(const float* logits, const int rowStart,
|
||||||
const int* rowEnds, int* outIndices,
|
const int rowEnd, const int rowIdx,
|
||||||
float* outLogits, int stride0, int stride1) {
|
int* outIndices, int stride0, int stride1) {
|
||||||
// The number of bins in the histogram.
|
|
||||||
static constexpr int kNumBins = 512;
|
|
||||||
|
|
||||||
// The top-k width.
|
|
||||||
static constexpr int kTopK = 2048;
|
|
||||||
// The number of elements per thread for the final top-k sort.
|
// The number of elements per thread for the final top-k sort.
|
||||||
static constexpr int kNumTopKItemsPerThread = kTopK / kNumThreadsPerBlock;
|
static constexpr int kNumTopKItemsPerThread = kTopK / kNumThreadsPerBlock;
|
||||||
// The class to sort the elements during the final top-k sort.
|
// The class to sort the elements during the final top-k sort.
|
||||||
@@ -103,17 +98,11 @@ static __global__ void topKPerRow(const float* logits, const int* rowStarts,
|
|||||||
__shared__ int smemHistogram[kNumBins];
|
__shared__ int smemHistogram[kNumBins];
|
||||||
// Shared memory to store the selected indices.
|
// Shared memory to store the selected indices.
|
||||||
__shared__ int smemIndices[kTopK];
|
__shared__ int smemIndices[kTopK];
|
||||||
// Shared memory to store the selected logits.
|
|
||||||
__shared__ float smemLogits[kTopK];
|
|
||||||
// Shared memory to store the threshold bin.
|
// Shared memory to store the threshold bin.
|
||||||
__shared__ int smemThresholdBinIdx[1];
|
__shared__ int smemThresholdBinIdx[1];
|
||||||
// Shared memory counter to register the candidates for the final phase.
|
// Shared memory counter to register the candidates for the final phase.
|
||||||
__shared__ int smemFinalDstIdx[1];
|
__shared__ int smemFinalDstIdx[1];
|
||||||
|
|
||||||
// The row computed by this block.
|
|
||||||
int rowIdx = blockIdx.x;
|
|
||||||
// The range of logits within the row.
|
|
||||||
int rowStart = rowStarts[rowIdx], rowEnd = rowEnds[rowIdx];
|
|
||||||
// The length of the row.
|
// The length of the row.
|
||||||
int rowLen = rowEnd - rowStart;
|
int rowLen = rowEnd - rowStart;
|
||||||
|
|
||||||
@@ -124,13 +113,10 @@ static __global__ void topKPerRow(const float* logits, const int* rowStarts,
|
|||||||
rowIt += kNumThreadsPerBlock) {
|
rowIt += kNumThreadsPerBlock) {
|
||||||
int idx = rowStart + rowIt;
|
int idx = rowStart + rowIt;
|
||||||
outIndices[rowIdx * kTopK + rowIt] = idx - rowStart;
|
outIndices[rowIdx * kTopK + rowIt] = idx - rowStart;
|
||||||
outLogits[rowIdx * kTopK + rowIt] =
|
|
||||||
logits[rowIdx * stride0 + idx * stride1];
|
|
||||||
}
|
}
|
||||||
for (int rowIt = rowLen + threadIdx.x; rowIt < kTopK;
|
for (int rowIt = rowLen + threadIdx.x; rowIt < kTopK;
|
||||||
rowIt += kNumThreadsPerBlock) {
|
rowIt += kNumThreadsPerBlock) {
|
||||||
outIndices[rowIdx * kTopK + rowIt] = -1;
|
outIndices[rowIdx * kTopK + rowIt] = -1;
|
||||||
outLogits[rowIdx * kTopK + rowIt] = -FLT_MAX;
|
|
||||||
}
|
}
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
@@ -201,7 +187,6 @@ static __global__ void topKPerRow(const float* logits, const int* rowStarts,
|
|||||||
uint16_t idx = extractBinIdx(logit);
|
uint16_t idx = extractBinIdx(logit);
|
||||||
if (idx < thresholdBinIdx) {
|
if (idx < thresholdBinIdx) {
|
||||||
int dstIdx = atomicAdd(&smemHistogram[idx], 1);
|
int dstIdx = atomicAdd(&smemHistogram[idx], 1);
|
||||||
smemLogits[dstIdx] = logit;
|
|
||||||
smemIndices[dstIdx] = rowIt;
|
smemIndices[dstIdx] = rowIt;
|
||||||
} else if (idx == thresholdBinIdx) {
|
} else if (idx == thresholdBinIdx) {
|
||||||
int dstIdx = atomicAdd(&smemFinalDstIdx[0], 1);
|
int dstIdx = atomicAdd(&smemFinalDstIdx[0], 1);
|
||||||
@@ -250,7 +235,6 @@ static __global__ void topKPerRow(const float* logits, const int* rowStarts,
|
|||||||
int srcIdx = ii * kNumThreadsPerBlock + threadIdx.x;
|
int srcIdx = ii * kNumThreadsPerBlock + threadIdx.x;
|
||||||
int dstIdx = baseIdx + srcIdx;
|
int dstIdx = baseIdx + srcIdx;
|
||||||
if (dstIdx < kTopK) {
|
if (dstIdx < kTopK) {
|
||||||
smemLogits[dstIdx] = finalLogits[ii];
|
|
||||||
smemIndices[dstIdx] = finalIndices[ii];
|
smemIndices[dstIdx] = finalIndices[ii];
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -258,31 +242,58 @@ static __global__ void topKPerRow(const float* logits, const int* rowStarts,
|
|||||||
// Make sure the data is in shared memory.
|
// Make sure the data is in shared memory.
|
||||||
__syncthreads();
|
__syncthreads();
|
||||||
|
|
||||||
// The topK logits.
|
|
||||||
float topKLogits[kNumTopKItemsPerThread];
|
|
||||||
// The topK indices.
|
|
||||||
int topKIndices[kNumTopKItemsPerThread];
|
|
||||||
|
|
||||||
// Load from shared memory.
|
|
||||||
#pragma unroll
|
|
||||||
for (int ii = 0; ii < kNumTopKItemsPerThread; ++ii) {
|
|
||||||
topKLogits[ii] = smemLogits[ii * kNumThreadsPerBlock + threadIdx.x];
|
|
||||||
topKIndices[ii] = smemIndices[ii * kNumThreadsPerBlock + threadIdx.x];
|
|
||||||
}
|
|
||||||
|
|
||||||
// Sort the elements.
|
|
||||||
TopKSort(smemFinal.topKSort)
|
|
||||||
.SortDescendingBlockedToStriped(topKLogits, topKIndices);
|
|
||||||
|
|
||||||
// Store to global memory.
|
// Store to global memory.
|
||||||
#pragma unroll
|
#pragma unroll
|
||||||
for (int ii = 0; ii < kNumTopKItemsPerThread; ++ii) {
|
for (int ii = 0; ii < kNumTopKItemsPerThread; ++ii) {
|
||||||
int offset = rowIdx * kTopK + ii * kNumThreadsPerBlock + threadIdx.x;
|
int offset = rowIdx * kTopK + ii * kNumThreadsPerBlock + threadIdx.x;
|
||||||
outIndices[offset] = topKIndices[ii] - rowStart;
|
outIndices[offset] =
|
||||||
outLogits[offset] = topKLogits[ii];
|
smemIndices[ii * kNumThreadsPerBlock + threadIdx.x] - rowStart;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
template <int kNumThreadsPerBlock = 512>
|
||||||
|
static __global__ void topKPerRow(const float* logits, const int* rowStarts,
|
||||||
|
const int* rowEnds, int* outIndices,
|
||||||
|
int stride0, int stride1) {
|
||||||
|
// The number of bins in the histogram.
|
||||||
|
static constexpr int kNumBins = 512;
|
||||||
|
|
||||||
|
// The top-k width.
|
||||||
|
static constexpr int kTopK = 2048;
|
||||||
|
|
||||||
|
// The row computed by this block.
|
||||||
|
int rowIdx = blockIdx.x;
|
||||||
|
|
||||||
|
// The range of logits within the row.
|
||||||
|
int rowStart = rowStarts[rowIdx];
|
||||||
|
int rowEnd = rowEnds[rowIdx];
|
||||||
|
|
||||||
|
topKPerRowJob<kNumThreadsPerBlock, kNumBins, kTopK>(
|
||||||
|
logits, rowStart, rowEnd, rowIdx, outIndices, stride0, stride1);
|
||||||
|
}
|
||||||
|
|
||||||
|
template <int kNumThreadsPerBlock = 512>
|
||||||
|
static __global__ void topKPerRowDecode(const float* logits, const int* seqLens,
|
||||||
|
int* outIndices, int stride0,
|
||||||
|
int stride1, int next_n) {
|
||||||
|
// The number of bins in the histogram.
|
||||||
|
static constexpr int kNumBins = 512;
|
||||||
|
|
||||||
|
// The top-k width.
|
||||||
|
static constexpr int kTopK = 2048;
|
||||||
|
|
||||||
|
// The row computed by this block.
|
||||||
|
int rowIdx = blockIdx.x;
|
||||||
|
|
||||||
|
// The range of logits within the row.
|
||||||
|
int rowStart = 0;
|
||||||
|
int seq_len = seqLens[rowIdx / next_n];
|
||||||
|
int rowEnd = seq_len - next_n + (rowIdx % next_n) + 1;
|
||||||
|
|
||||||
|
topKPerRowJob<kNumThreadsPerBlock, kNumBins, kTopK>(
|
||||||
|
logits, rowStart, rowEnd, rowIdx, outIndices, stride0, stride1);
|
||||||
|
}
|
||||||
|
|
||||||
} // namespace vllm
|
} // namespace vllm
|
||||||
|
|
||||||
void apply_repetition_penalties_(
|
void apply_repetition_penalties_(
|
||||||
@@ -326,10 +337,23 @@ void apply_repetition_penalties_(
|
|||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
|
void top_k_per_row_decode(const torch::Tensor& logits, int64_t next_n,
|
||||||
|
const torch::Tensor& seqLens, torch::Tensor& indices,
|
||||||
|
int64_t numRows, int64_t stride0, int64_t stride1) {
|
||||||
|
// Compute the results on the device.
|
||||||
|
constexpr int kNumThreadsPerBlock = 512;
|
||||||
|
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
||||||
|
|
||||||
|
vllm::topKPerRowDecode<kNumThreadsPerBlock>
|
||||||
|
<<<numRows, kNumThreadsPerBlock, 0, stream>>>(
|
||||||
|
logits.data_ptr<float>(), seqLens.data_ptr<int>(),
|
||||||
|
indices.data_ptr<int>(), static_cast<int>(stride0),
|
||||||
|
static_cast<int>(stride1), static_cast<int>(next_n));
|
||||||
|
}
|
||||||
|
|
||||||
void top_k_per_row(const torch::Tensor& logits, const torch::Tensor& rowStarts,
|
void top_k_per_row(const torch::Tensor& logits, const torch::Tensor& rowStarts,
|
||||||
const torch::Tensor& rowEnds, torch::Tensor& indices,
|
const torch::Tensor& rowEnds, torch::Tensor& indices,
|
||||||
torch::Tensor& values, int64_t numRows, int64_t stride0,
|
int64_t numRows, int64_t stride0, int64_t stride1) {
|
||||||
int64_t stride1) {
|
|
||||||
// Compute the results on the device.
|
// Compute the results on the device.
|
||||||
constexpr int kNumThreadsPerBlock = 512;
|
constexpr int kNumThreadsPerBlock = 512;
|
||||||
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
|
||||||
@@ -338,6 +362,5 @@ void top_k_per_row(const torch::Tensor& logits, const torch::Tensor& rowStarts,
|
|||||||
<<<numRows, kNumThreadsPerBlock, 0, stream>>>(
|
<<<numRows, kNumThreadsPerBlock, 0, stream>>>(
|
||||||
logits.data_ptr<float>(), rowStarts.data_ptr<int>(),
|
logits.data_ptr<float>(), rowStarts.data_ptr<int>(),
|
||||||
rowEnds.data_ptr<int>(), indices.data_ptr<int>(),
|
rowEnds.data_ptr<int>(), indices.data_ptr<int>(),
|
||||||
values.data_ptr<float>(), static_cast<int>(stride0),
|
static_cast<int>(stride0), static_cast<int>(stride1));
|
||||||
static_cast<int>(stride1));
|
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -175,12 +175,6 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
|
|||||||
"float epsilon) -> ()");
|
"float epsilon) -> ()");
|
||||||
ops.impl("fused_add_rms_norm", torch::kCUDA, &fused_add_rms_norm);
|
ops.impl("fused_add_rms_norm", torch::kCUDA, &fused_add_rms_norm);
|
||||||
|
|
||||||
// Polynomial Normalization.
|
|
||||||
ops.def(
|
|
||||||
"poly_norm(Tensor! out, Tensor input, Tensor weight, Tensor bias, float "
|
|
||||||
"epsilon) -> ()");
|
|
||||||
ops.impl("poly_norm", torch::kCUDA, &poly_norm);
|
|
||||||
|
|
||||||
// Apply repetition penalties to logits in-place
|
// Apply repetition penalties to logits in-place
|
||||||
ops.def(
|
ops.def(
|
||||||
"apply_repetition_penalties_(Tensor! logits, Tensor prompt_mask, "
|
"apply_repetition_penalties_(Tensor! logits, Tensor prompt_mask, "
|
||||||
@@ -191,10 +185,16 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
|
|||||||
// Optimized top-k per row operation
|
// Optimized top-k per row operation
|
||||||
ops.def(
|
ops.def(
|
||||||
"top_k_per_row(Tensor logits, Tensor rowStarts, Tensor rowEnds, "
|
"top_k_per_row(Tensor logits, Tensor rowStarts, Tensor rowEnds, "
|
||||||
"Tensor! indices, Tensor! values, int numRows, int stride0, "
|
"Tensor! indices, int numRows, int stride0, "
|
||||||
"int stride1) -> ()");
|
"int stride1) -> ()");
|
||||||
ops.impl("top_k_per_row", torch::kCUDA, &top_k_per_row);
|
ops.impl("top_k_per_row", torch::kCUDA, &top_k_per_row);
|
||||||
|
|
||||||
|
ops.def(
|
||||||
|
"top_k_per_row_decode(Tensor logits, int next_n, "
|
||||||
|
"Tensor seq_lens, Tensor! indices, int numRows, "
|
||||||
|
"int stride0, int stride1) -> ()");
|
||||||
|
ops.impl("top_k_per_row_decode", torch::kCUDA, &top_k_per_row_decode);
|
||||||
|
|
||||||
// Layernorm-quant
|
// Layernorm-quant
|
||||||
// Apply Root Mean Square (RMS) Normalization to the input tensor.
|
// Apply Root Mean Square (RMS) Normalization to the input tensor.
|
||||||
ops.def(
|
ops.def(
|
||||||
@@ -557,7 +557,8 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
|
|||||||
// to prevent the meta function registry.
|
// to prevent the meta function registry.
|
||||||
ops.def(
|
ops.def(
|
||||||
"gptq_gemm(Tensor a, Tensor b_q_weight, Tensor b_gptq_qzeros, "
|
"gptq_gemm(Tensor a, Tensor b_q_weight, Tensor b_gptq_qzeros, "
|
||||||
"Tensor b_gptq_scales, Tensor b_g_idx, bool use_exllama, int bit) "
|
"Tensor b_gptq_scales, Tensor b_g_idx, bool use_exllama, bool "
|
||||||
|
"use_v2_format, int bit) "
|
||||||
"-> Tensor",
|
"-> Tensor",
|
||||||
{stride_tag});
|
{stride_tag});
|
||||||
ops.impl("gptq_gemm", torch::kCUDA, &gptq_gemm);
|
ops.impl("gptq_gemm", torch::kCUDA, &gptq_gemm);
|
||||||
|
|||||||
@@ -5,7 +5,7 @@
|
|||||||
# docs/contributing/dockerfile/dockerfile.md and
|
# docs/contributing/dockerfile/dockerfile.md and
|
||||||
# docs/assets/contributing/dockerfile-stages-dependency.png
|
# docs/assets/contributing/dockerfile-stages-dependency.png
|
||||||
|
|
||||||
ARG CUDA_VERSION=12.8.1
|
ARG CUDA_VERSION=12.9.1
|
||||||
ARG PYTHON_VERSION=3.12
|
ARG PYTHON_VERSION=3.12
|
||||||
|
|
||||||
# By parameterizing the base images, we allow third-party to use their own
|
# By parameterizing the base images, we allow third-party to use their own
|
||||||
@@ -132,7 +132,9 @@ WORKDIR /workspace
|
|||||||
COPY requirements/common.txt requirements/common.txt
|
COPY requirements/common.txt requirements/common.txt
|
||||||
COPY requirements/cuda.txt requirements/cuda.txt
|
COPY requirements/cuda.txt requirements/cuda.txt
|
||||||
RUN --mount=type=cache,target=/root/.cache/uv \
|
RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
uv pip install --python /opt/venv/bin/python3 -r requirements/cuda.txt \
|
# TODO: remove apache-tvm-ffi once FlashInfer is fixed https://github.com/flashinfer-ai/flashinfer/issues/1962
|
||||||
|
uv pip install --python /opt/venv/bin/python3 --pre apache-tvm-ffi==0.1.0b15 \
|
||||||
|
&& uv pip install --python /opt/venv/bin/python3 -r requirements/cuda.txt \
|
||||||
--extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')
|
--extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')
|
||||||
|
|
||||||
# cuda arch list used by torch
|
# cuda arch list used by torch
|
||||||
@@ -273,6 +275,7 @@ WORKDIR /vllm-workspace
|
|||||||
ENV DEBIAN_FRONTEND=noninteractive
|
ENV DEBIAN_FRONTEND=noninteractive
|
||||||
ARG TARGETPLATFORM
|
ARG TARGETPLATFORM
|
||||||
|
|
||||||
|
# TODO (huydhn): There is no prebuilt gdrcopy package on 12.9 at the moment
|
||||||
ARG GDRCOPY_CUDA_VERSION=12.8
|
ARG GDRCOPY_CUDA_VERSION=12.8
|
||||||
# Keep in line with FINAL_BASE_IMAGE
|
# Keep in line with FINAL_BASE_IMAGE
|
||||||
ARG GDRCOPY_OS_VERSION=Ubuntu22_04
|
ARG GDRCOPY_OS_VERSION=Ubuntu22_04
|
||||||
@@ -353,14 +356,16 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
|||||||
# Install vllm wheel first, so that torch etc will be installed.
|
# Install vllm wheel first, so that torch etc will be installed.
|
||||||
RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist \
|
RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist \
|
||||||
--mount=type=cache,target=/root/.cache/uv \
|
--mount=type=cache,target=/root/.cache/uv \
|
||||||
uv pip install --system dist/*.whl --verbose \
|
# TODO: remove apache-tvm-ffi once FlashInfer is fixed https://github.com/flashinfer-ai/flashinfer/issues/1962
|
||||||
|
uv pip install --system --pre apache-tvm-ffi==0.1.0b15 \
|
||||||
|
&& uv pip install --system dist/*.whl --verbose \
|
||||||
--extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')
|
--extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')
|
||||||
|
|
||||||
# Install FlashInfer pre-compiled kernel cache and binaries
|
# Install FlashInfer pre-compiled kernel cache and binaries
|
||||||
# https://docs.flashinfer.ai/installation.html
|
# https://docs.flashinfer.ai/installation.html
|
||||||
RUN --mount=type=cache,target=/root/.cache/uv \
|
RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
uv pip install --system flashinfer-cubin==0.4.0 \
|
uv pip install --system flashinfer-cubin==0.4.1 \
|
||||||
&& uv pip install --system flashinfer-jit-cache==0.4.0 \
|
&& uv pip install --system flashinfer-jit-cache==0.4.1 \
|
||||||
--extra-index-url https://flashinfer.ai/whl/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.') \
|
--extra-index-url https://flashinfer.ai/whl/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.') \
|
||||||
&& flashinfer show-config
|
&& flashinfer show-config
|
||||||
|
|
||||||
@@ -422,6 +427,7 @@ ARG PYTHON_VERSION
|
|||||||
|
|
||||||
ARG PIP_INDEX_URL UV_INDEX_URL
|
ARG PIP_INDEX_URL UV_INDEX_URL
|
||||||
ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL
|
ARG PIP_EXTRA_INDEX_URL UV_EXTRA_INDEX_URL
|
||||||
|
ARG PYTORCH_CUDA_INDEX_BASE_URL
|
||||||
|
|
||||||
# This timeout (in seconds) is necessary when installing some dependencies via uv since it's likely to time out
|
# This timeout (in seconds) is necessary when installing some dependencies via uv since it's likely to time out
|
||||||
# Reference: https://github.com/astral-sh/uv/pull/1694
|
# Reference: https://github.com/astral-sh/uv/pull/1694
|
||||||
@@ -434,7 +440,8 @@ ENV UV_LINK_MODE=copy
|
|||||||
RUN --mount=type=cache,target=/root/.cache/uv \
|
RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
CUDA_MAJOR="${CUDA_VERSION%%.*}"; \
|
CUDA_MAJOR="${CUDA_VERSION%%.*}"; \
|
||||||
if [ "$CUDA_MAJOR" -ge 12 ]; then \
|
if [ "$CUDA_MAJOR" -ge 12 ]; then \
|
||||||
uv pip install --system -r requirements/dev.txt; \
|
uv pip install --system -r requirements/dev.txt \
|
||||||
|
--extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.'); \
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# install development dependencies (for testing)
|
# install development dependencies (for testing)
|
||||||
|
|||||||
@@ -31,7 +31,7 @@ ARG PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
|
|||||||
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
|
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
|
||||||
--mount=type=cache,target=/var/lib/apt,sharing=locked \
|
--mount=type=cache,target=/var/lib/apt,sharing=locked \
|
||||||
apt-get update -y \
|
apt-get update -y \
|
||||||
&& apt-get install -y --no-install-recommends ccache git curl wget ca-certificates \
|
&& apt-get install -y --no-install-recommends sudo ccache git curl wget ca-certificates \
|
||||||
gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 jq lsof \
|
gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 jq lsof \
|
||||||
&& update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 \
|
&& update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 \
|
||||||
&& curl -LsSf https://astral.sh/uv/install.sh | sh
|
&& curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||||
@@ -79,6 +79,9 @@ RUN echo 'ulimit -c 0' >> ~/.bashrc
|
|||||||
######################### BUILD IMAGE #########################
|
######################### BUILD IMAGE #########################
|
||||||
FROM base AS vllm-build
|
FROM base AS vllm-build
|
||||||
|
|
||||||
|
ARG max_jobs=32
|
||||||
|
ENV MAX_JOBS=${max_jobs}
|
||||||
|
|
||||||
ARG GIT_REPO_CHECK=0
|
ARG GIT_REPO_CHECK=0
|
||||||
# Support for building with non-AVX512 vLLM: docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" ...
|
# Support for building with non-AVX512 vLLM: docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" ...
|
||||||
ARG VLLM_CPU_DISABLE_AVX512=0
|
ARG VLLM_CPU_DISABLE_AVX512=0
|
||||||
@@ -104,16 +107,20 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
|||||||
--mount=type=cache,target=/root/.cache/ccache \
|
--mount=type=cache,target=/root/.cache/ccache \
|
||||||
--mount=type=cache,target=/workspace/vllm/.deps,sharing=locked \
|
--mount=type=cache,target=/workspace/vllm/.deps,sharing=locked \
|
||||||
--mount=type=bind,source=.git,target=.git \
|
--mount=type=bind,source=.git,target=.git \
|
||||||
VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel
|
VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38
|
||||||
|
|
||||||
######################### TEST DEPS #########################
|
######################### TEST DEPS #########################
|
||||||
FROM base AS vllm-test-deps
|
FROM base AS vllm-test-deps
|
||||||
|
|
||||||
WORKDIR /workspace/vllm
|
WORKDIR /workspace/vllm
|
||||||
|
|
||||||
|
# TODO: Update to 2.9.0 when there is a new build for intel_extension_for_pytorch for that version
|
||||||
RUN --mount=type=bind,src=requirements/test.in,target=requirements/test.in \
|
RUN --mount=type=bind,src=requirements/test.in,target=requirements/test.in \
|
||||||
cp requirements/test.in requirements/cpu-test.in && \
|
cp requirements/test.in requirements/cpu-test.in && \
|
||||||
sed -i '/mamba_ssm/d' requirements/cpu-test.in && \
|
sed -i '/mamba_ssm/d' requirements/cpu-test.in && \
|
||||||
|
sed -i 's/^torch==.*/torch==2.8.0/g' requirements/cpu-test.in && \
|
||||||
|
sed -i 's/torchaudio.*/torchaudio/g' requirements/cpu-test.in && \
|
||||||
|
sed -i 's/torchvision.*/torchvision/g' requirements/cpu-test.in && \
|
||||||
uv pip compile requirements/cpu-test.in -o requirements/cpu-test.txt --index-strategy unsafe-best-match --torch-backend cpu
|
uv pip compile requirements/cpu-test.in -o requirements/cpu-test.txt --index-strategy unsafe-best-match --torch-backend cpu
|
||||||
|
|
||||||
RUN --mount=type=cache,target=/root/.cache/uv \
|
RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
|
|||||||
@@ -246,7 +246,7 @@ RUN pip install setuptools==75.6.0 packaging==23.2 ninja==1.11.1.3 build==1.2.2.
|
|||||||
|
|
||||||
|
|
||||||
# build flashinfer for torch nightly from source around 10 mins
|
# build flashinfer for torch nightly from source around 10 mins
|
||||||
# release version: v0.4.0
|
# release version: v0.4.1
|
||||||
# todo(elainewy): cache flashinfer build result for faster build
|
# todo(elainewy): cache flashinfer build result for faster build
|
||||||
ENV CCACHE_DIR=/root/.cache/ccache
|
ENV CCACHE_DIR=/root/.cache/ccache
|
||||||
RUN --mount=type=cache,target=/root/.cache/ccache \
|
RUN --mount=type=cache,target=/root/.cache/ccache \
|
||||||
@@ -254,7 +254,7 @@ RUN --mount=type=cache,target=/root/.cache/ccache \
|
|||||||
echo "git clone flashinfer..." \
|
echo "git clone flashinfer..." \
|
||||||
&& git clone --recursive https://github.com/flashinfer-ai/flashinfer.git \
|
&& git clone --recursive https://github.com/flashinfer-ai/flashinfer.git \
|
||||||
&& cd flashinfer \
|
&& cd flashinfer \
|
||||||
&& git checkout v0.4.0 \
|
&& git checkout v0.4.1\
|
||||||
&& git submodule update --init --recursive \
|
&& git submodule update --init --recursive \
|
||||||
&& echo "finish git clone flashinfer..." \
|
&& echo "finish git clone flashinfer..." \
|
||||||
&& rm -rf build \
|
&& rm -rf build \
|
||||||
|
|||||||
@@ -1,13 +1,13 @@
|
|||||||
ARG BASE_IMAGE=rocm/dev-ubuntu-22.04:7.0-complete
|
ARG BASE_IMAGE=rocm/dev-ubuntu-22.04:7.0-complete
|
||||||
ARG TRITON_BRANCH="f9e5bf54"
|
ARG TRITON_BRANCH="57c693b6"
|
||||||
ARG TRITON_REPO="https://github.com/ROCm/triton.git"
|
ARG TRITON_REPO="https://github.com/ROCm/triton.git"
|
||||||
ARG PYTORCH_BRANCH="b2fb6885"
|
ARG PYTORCH_BRANCH="1c57644d"
|
||||||
ARG PYTORCH_VISION_BRANCH="v0.23.0"
|
ARG PYTORCH_VISION_BRANCH="v0.23.0"
|
||||||
ARG PYTORCH_REPO="https://github.com/ROCm/pytorch.git"
|
ARG PYTORCH_REPO="https://github.com/ROCm/pytorch.git"
|
||||||
ARG PYTORCH_VISION_REPO="https://github.com/pytorch/vision.git"
|
ARG PYTORCH_VISION_REPO="https://github.com/pytorch/vision.git"
|
||||||
ARG FA_BRANCH="0e60e394"
|
ARG FA_BRANCH="0e60e394"
|
||||||
ARG FA_REPO="https://github.com/Dao-AILab/flash-attention.git"
|
ARG FA_REPO="https://github.com/Dao-AILab/flash-attention.git"
|
||||||
ARG AITER_BRANCH="2ab9f4cd"
|
ARG AITER_BRANCH="9716b1b8"
|
||||||
ARG AITER_REPO="https://github.com/ROCm/aiter.git"
|
ARG AITER_REPO="https://github.com/ROCm/aiter.git"
|
||||||
|
|
||||||
FROM ${BASE_IMAGE} AS base
|
FROM ${BASE_IMAGE} AS base
|
||||||
|
|||||||
@@ -20,8 +20,6 @@ API documentation for vLLM's configuration classes.
|
|||||||
- [vllm.config.CompilationConfig][]
|
- [vllm.config.CompilationConfig][]
|
||||||
- [vllm.config.VllmConfig][]
|
- [vllm.config.VllmConfig][]
|
||||||
|
|
||||||
[](){ #offline-inference-api }
|
|
||||||
|
|
||||||
## Offline Inference
|
## Offline Inference
|
||||||
|
|
||||||
LLM Class.
|
LLM Class.
|
||||||
@@ -45,18 +43,14 @@ Engine classes for offline and online inference.
|
|||||||
|
|
||||||
Inference parameters for vLLM APIs.
|
Inference parameters for vLLM APIs.
|
||||||
|
|
||||||
[](){ #sampling-params }
|
|
||||||
|
|
||||||
- [vllm.SamplingParams][]
|
- [vllm.SamplingParams][]
|
||||||
- [vllm.PoolingParams][]
|
- [vllm.PoolingParams][]
|
||||||
|
|
||||||
[](){ #multi-modality }
|
|
||||||
|
|
||||||
## Multi-Modality
|
## Multi-Modality
|
||||||
|
|
||||||
vLLM provides experimental support for multi-modal models through the [vllm.multimodal][] package.
|
vLLM provides experimental support for multi-modal models through the [vllm.multimodal][] package.
|
||||||
|
|
||||||
Multi-modal inputs can be passed alongside text and token prompts to [supported models][supported-mm-models]
|
Multi-modal inputs can be passed alongside text and token prompts to [supported models](../models/supported_models.md#list-of-multimodal-language-models)
|
||||||
via the `multi_modal_data` field in [vllm.inputs.PromptType][].
|
via the `multi_modal_data` field in [vllm.inputs.PromptType][].
|
||||||
|
|
||||||
Looking to add your own multi-modal model? Please follow the instructions listed [here](../contributing/model/multimodal.md).
|
Looking to add your own multi-modal model? Please follow the instructions listed [here](../contributing/model/multimodal.md).
|
||||||
|
|||||||
Binary file not shown.
|
Before Width: | Height: | Size: 119 KiB After Width: | Height: | Size: 119 KiB |
BIN
docs/assets/contributing/load-pattern-examples.png
Normal file
BIN
docs/assets/contributing/load-pattern-examples.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 577 KiB |
@@ -4,6 +4,6 @@ This section lists the most common options for running vLLM.
|
|||||||
|
|
||||||
There are three main levels of configuration, from highest priority to lowest priority:
|
There are three main levels of configuration, from highest priority to lowest priority:
|
||||||
|
|
||||||
- [Request parameters][completions-api] and [input arguments][sampling-params]
|
- [Request parameters](../serving/openai_compatible_server.md#completions-api) and [input arguments](../api/README.md#inference-parameters)
|
||||||
- [Engine arguments](./engine_args.md)
|
- [Engine arguments](./engine_args.md)
|
||||||
- [Environment variables](./env_vars.md)
|
- [Environment variables](./env_vars.md)
|
||||||
|
|||||||
@@ -23,7 +23,7 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct", tensor_parallel_size=2)
|
|||||||
!!! note
|
!!! note
|
||||||
With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
|
With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
|
||||||
|
|
||||||
You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
|
You can convert the model checkpoint to a sharded checkpoint using [examples/offline_inference/save_sharded_state.py](../../examples/offline_inference/save_sharded_state.py). The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
|
||||||
|
|
||||||
## Quantization
|
## Quantization
|
||||||
|
|
||||||
@@ -58,12 +58,12 @@ You can adjust `compilation_config` to achieve a better balance between inferenc
|
|||||||
|
|
||||||
```python
|
```python
|
||||||
from vllm import LLM
|
from vllm import LLM
|
||||||
from vllm.config import CompilationConfig, CompilationLevel
|
from vllm.config import CompilationConfig, CompilationMode
|
||||||
|
|
||||||
llm = LLM(
|
llm = LLM(
|
||||||
model="meta-llama/Llama-3.1-8B-Instruct",
|
model="meta-llama/Llama-3.1-8B-Instruct",
|
||||||
compilation_config=CompilationConfig(
|
compilation_config=CompilationConfig(
|
||||||
level=CompilationLevel.PIECEWISE,
|
mode=CompilationMode.VLLM_COMPILE,
|
||||||
# By default, it goes up to max_num_seqs
|
# By default, it goes up to max_num_seqs
|
||||||
cudagraph_capture_sizes=[1, 2, 4, 8, 16],
|
cudagraph_capture_sizes=[1, 2, 4, 8, 16],
|
||||||
),
|
),
|
||||||
|
|||||||
@@ -27,8 +27,6 @@ You can monitor the number of preemption requests through Prometheus metrics exp
|
|||||||
|
|
||||||
In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture.
|
In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture.
|
||||||
|
|
||||||
[](){ #chunked-prefill }
|
|
||||||
|
|
||||||
## Chunked Prefill
|
## Chunked Prefill
|
||||||
|
|
||||||
Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations.
|
Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations.
|
||||||
@@ -174,14 +172,14 @@ Regardless, you need to set `mm_encoder_tp_mode="data"` in engine arguments to u
|
|||||||
|
|
||||||
Known supported models (with corresponding benchmarks):
|
Known supported models (with corresponding benchmarks):
|
||||||
|
|
||||||
- dots_ocr (<gh-pr:25466>)
|
- dots_ocr (<https://github.com/vllm-project/vllm/pull/25466>)
|
||||||
- GLM-4.1V or above (<gh-pr:23168>)
|
- GLM-4.1V or above (<https://github.com/vllm-project/vllm/pull/23168>)
|
||||||
- InternVL (<gh-pr:23909>)
|
- InternVL (<https://github.com/vllm-project/vllm/pull/23909>)
|
||||||
- Kimi-VL (<gh-pr:23817>)
|
- Kimi-VL (<https://github.com/vllm-project/vllm/pull/23817>)
|
||||||
- Llama4 (<gh-pr:18368>)
|
- Llama4 (<https://github.com/vllm-project/vllm/pull/18368>)
|
||||||
- MiniCPM-V-2.5 or above (<gh-pr:23327>, <gh-pr:23948>)
|
- MiniCPM-V-2.5 or above (<https://github.com/vllm-project/vllm/pull/23327>, <https://github.com/vllm-project/vllm/pull/23948>)
|
||||||
- Qwen2-VL or above (<gh-pr:22742>, <gh-pr:24955>, <gh-pr:25445>)
|
- Qwen2-VL or above (<https://github.com/vllm-project/vllm/pull/22742>, <https://github.com/vllm-project/vllm/pull/24955>, <https://github.com/vllm-project/vllm/pull/25445>)
|
||||||
- Step3 (<gh-pr:22697>)
|
- Step3 (<https://github.com/vllm-project/vllm/pull/22697>)
|
||||||
|
|
||||||
## Input Processing
|
## Input Processing
|
||||||
|
|
||||||
|
|||||||
@@ -96,7 +96,7 @@ Although it’s common to do this with GPUs, don't try to fragment 2 or 8 differ
|
|||||||
|
|
||||||
### Tune your workloads
|
### Tune your workloads
|
||||||
|
|
||||||
Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](gh-file:benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
|
Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case.
|
||||||
|
|
||||||
### Future Topics We'll Cover
|
### Future Topics We'll Cover
|
||||||
|
|
||||||
|
|||||||
@@ -16,13 +16,13 @@ Finally, one of the most impactful ways to support us is by raising awareness ab
|
|||||||
Unsure on where to start? Check out the following links for tasks to work on:
|
Unsure on where to start? Check out the following links for tasks to work on:
|
||||||
|
|
||||||
- [Good first issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22good%20first%20issue%22)
|
- [Good first issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22good%20first%20issue%22)
|
||||||
- [Selected onboarding tasks](gh-project:6)
|
- [Selected onboarding tasks](https://github.com/orgs/vllm-project/projects/6)
|
||||||
- [New model requests](https://github.com/vllm-project/vllm/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22new-model%22)
|
- [New model requests](https://github.com/vllm-project/vllm/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22new-model%22)
|
||||||
- [Models with multi-modal capabilities](gh-project:10)
|
- [Models with multi-modal capabilities](https://github.com/orgs/vllm-project/projects/10)
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
See <gh-file:LICENSE>.
|
See [LICENSE](../../LICENSE).
|
||||||
|
|
||||||
## Developing
|
## Developing
|
||||||
|
|
||||||
@@ -54,7 +54,7 @@ For more details about installing from source and installing for other hardware,
|
|||||||
For an optimized workflow when iterating on C++/CUDA kernels, see the [Incremental Compilation Workflow](./incremental_build.md) for recommendations.
|
For an optimized workflow when iterating on C++/CUDA kernels, see the [Incremental Compilation Workflow](./incremental_build.md) for recommendations.
|
||||||
|
|
||||||
!!! tip
|
!!! tip
|
||||||
vLLM is compatible with Python versions 3.10 to 3.13. However, vLLM's default [Dockerfile](gh-file:docker/Dockerfile) ships with Python 3.12 and tests in CI (except `mypy`) are run with Python 3.12.
|
vLLM is compatible with Python versions 3.10 to 3.13. However, vLLM's default [Dockerfile](../../docker/Dockerfile) ships with Python 3.12 and tests in CI (except `mypy`) are run with Python 3.12.
|
||||||
|
|
||||||
Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
|
Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
|
||||||
|
|
||||||
@@ -88,7 +88,7 @@ vLLM's `pre-commit` hooks will now run automatically every time you commit.
|
|||||||
|
|
||||||
### Documentation
|
### Documentation
|
||||||
|
|
||||||
MkDocs is a fast, simple and downright gorgeous static site generator that's geared towards building project documentation. Documentation source files are written in Markdown, and configured with a single YAML configuration file, <gh-file:mkdocs.yaml>.
|
MkDocs is a fast, simple and downright gorgeous static site generator that's geared towards building project documentation. Documentation source files are written in Markdown, and configured with a single YAML configuration file, [mkdocs.yaml](../../mkdocs.yaml).
|
||||||
|
|
||||||
Get started with:
|
Get started with:
|
||||||
|
|
||||||
@@ -152,7 +152,7 @@ pytest -s -v tests/test_logger.py
|
|||||||
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
|
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
|
||||||
|
|
||||||
!!! important
|
!!! important
|
||||||
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
|
If you discover a security vulnerability, please follow the instructions [here](../../SECURITY.md).
|
||||||
|
|
||||||
## Pull Requests & Code Reviews
|
## Pull Requests & Code Reviews
|
||||||
|
|
||||||
@@ -162,7 +162,7 @@ code quality and improve the efficiency of the review process.
|
|||||||
|
|
||||||
### DCO and Signed-off-by
|
### DCO and Signed-off-by
|
||||||
|
|
||||||
When contributing changes to this project, you must agree to the <gh-file:DCO>.
|
When contributing changes to this project, you must agree to the [DCO](../../DCO).
|
||||||
Commits must include a `Signed-off-by:` header which certifies agreement with
|
Commits must include a `Signed-off-by:` header which certifies agreement with
|
||||||
the terms of the DCO.
|
the terms of the DCO.
|
||||||
|
|
||||||
|
|||||||
@@ -6,9 +6,10 @@ toc_depth: 4
|
|||||||
|
|
||||||
vLLM provides comprehensive benchmarking tools for performance testing and evaluation:
|
vLLM provides comprehensive benchmarking tools for performance testing and evaluation:
|
||||||
|
|
||||||
- **[Benchmark CLI]**: `vllm bench` CLI tools and specialized benchmark scripts for interactive performance testing
|
- **[Benchmark CLI](#benchmark-cli)**: `vllm bench` CLI tools and specialized benchmark scripts for interactive performance testing
|
||||||
- **[Performance benchmarks][performance-benchmarks]**: Automated CI benchmarks for development
|
- **[Parameter sweeps](#parameter-sweeps)**: Automate `vllm bench` runs for multiple configurations
|
||||||
- **[Nightly benchmarks][nightly-benchmarks]**: Comparative benchmarks against alternatives
|
- **[Performance benchmarks](#performance-benchmarks)**: Automated CI benchmarks for development
|
||||||
|
- **[Nightly benchmarks](#nightly-benchmarks)**: Comparative benchmarks against alternatives
|
||||||
|
|
||||||
[Benchmark CLI]: #benchmark-cli
|
[Benchmark CLI]: #benchmark-cli
|
||||||
|
|
||||||
@@ -29,7 +30,7 @@ th {
|
|||||||
| Dataset | Online | Offline | Data Path |
|
| Dataset | Online | Offline | Data Path |
|
||||||
|---------|--------|---------|-----------|
|
|---------|--------|---------|-----------|
|
||||||
| ShareGPT | ✅ | ✅ | `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json` |
|
| ShareGPT | ✅ | ✅ | `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json` |
|
||||||
| ShareGPT4V (Image) | ✅ | ✅ | `wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:<br>`wget http://images.cocodataset.org/zips/train2017.zip` |
|
| ShareGPT4V (Image) | ✅ | ✅ | `wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:<br>`wget http://images.cocodataset.org/zips/train2017.zip` |
|
||||||
| ShareGPT4Video (Video) | ✅ | ✅ | `git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video` |
|
| ShareGPT4Video (Video) | ✅ | ✅ | `git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video` |
|
||||||
| BurstGPT | ✅ | ✅ | `wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv` |
|
| BurstGPT | ✅ | ✅ | `wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv` |
|
||||||
| Sonnet (deprecated) | ✅ | ✅ | Local file: `benchmarks/sonnet.txt` |
|
| Sonnet (deprecated) | ✅ | ✅ | Local file: `benchmarks/sonnet.txt` |
|
||||||
@@ -320,6 +321,73 @@ The following arguments can be used to control the ramp-up:
|
|||||||
- `--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
|
- `--ramp-up-start-rps`: The request rate at the beginning of the benchmark.
|
||||||
- `--ramp-up-end-rps`: The request rate at the end of the benchmark.
|
- `--ramp-up-end-rps`: The request rate at the end of the benchmark.
|
||||||
|
|
||||||
|
##### Load Pattern Configuration
|
||||||
|
|
||||||
|
vLLM's benchmark serving script provides sophisticated load pattern simulation capabilities through three key parameters that control request generation and concurrency behavior:
|
||||||
|
|
||||||
|
###### Load Pattern Control Parameters
|
||||||
|
|
||||||
|
- `--request-rate`: Controls the target request generation rate (requests per second). Set to `inf` for maximum throughput testing or finite values for controlled load simulation.
|
||||||
|
- `--burstiness`: Controls traffic variability using a Gamma distribution (range: > 0). Lower values create bursty traffic, higher values create uniform traffic.
|
||||||
|
- `--max-concurrency`: Limits concurrent outstanding requests. If this argument is not provided, concurrency is unlimited. Set a value to simulate backpressure.
|
||||||
|
|
||||||
|
These parameters work together to create realistic load patterns with carefully chosen defaults. The `--request-rate` parameter defaults to `inf` (infinite), which sends all requests immediately for maximum throughput testing. When set to finite values, it uses either a Poisson process (default `--burstiness=1.0`) or Gamma distribution for realistic request timing. The `--burstiness` parameter only takes effect when `--request-rate` is not infinite - a value of 1.0 creates natural Poisson traffic, while lower values (0.1-0.5) create bursty patterns and higher values (2.0-5.0) create uniform spacing. The `--max-concurrency` parameter defaults to `None` (unlimited) but can be set to simulate real-world constraints where a load balancer or API gateway limits concurrent connections. When combined, these parameters allow you to simulate everything from unrestricted stress testing (`--request-rate=inf`) to production-like scenarios with realistic arrival patterns and resource constraints.
|
||||||
|
|
||||||
|
The `--burstiness` parameter mathematically controls request arrival patterns using a Gamma distribution where:
|
||||||
|
|
||||||
|
- Shape parameter: `burstiness` value
|
||||||
|
- Coefficient of Variation (CV): $\frac{1}{\sqrt{burstiness}}$
|
||||||
|
- Traffic characteristics:
|
||||||
|
- `burstiness = 0.1`: Highly bursty traffic (CV ≈ 3.16) - stress testing
|
||||||
|
- `burstiness = 1.0`: Natural Poisson traffic (CV = 1.0) - realistic simulation
|
||||||
|
- `burstiness = 5.0`: Uniform traffic (CV ≈ 0.45) - controlled load testing
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
*Figure: Load pattern examples for each use case. Top row: Request arrival timelines showing cumulative requests over time. Bottom row: Inter-arrival time distributions showing traffic variability patterns. Each column represents a different use case with its specific parameter settings and resulting traffic characteristics.*
|
||||||
|
|
||||||
|
Load Pattern Recommendations by Use Case:
|
||||||
|
|
||||||
|
| Use Case | Burstiness | Request Rate | Max Concurrency | Description |
|
||||||
|
| --- | --- | --- | --- | --- |
|
||||||
|
| Maximum Throughput | N/A | Infinite | Limited | **Most common**: Simulates load balancer/gateway limits with unlimited user demand |
|
||||||
|
| Realistic Testing | 1.0 | Moderate (5-20) | Infinite | Natural Poisson traffic patterns for baseline performance |
|
||||||
|
| Stress Testing | 0.1-0.5 | High (20-100) | Infinite | Challenging burst patterns to test resilience |
|
||||||
|
| Latency Profiling | 2.0-5.0 | Low (1-10) | Infinite | Uniform load for consistent timing analysis |
|
||||||
|
| Capacity Planning | 1.0 | Variable | Limited | Test resource limits with realistic constraints |
|
||||||
|
| SLA Validation | 1.0 | Target rate | SLA limit | Production-like constraints for compliance testing |
|
||||||
|
|
||||||
|
These load patterns help evaluate different aspects of your vLLM deployment, from basic performance characteristics to resilience under challenging traffic conditions.
|
||||||
|
|
||||||
|
The **Maximum Throughput** pattern (`--request-rate=inf --max-concurrency=<limit>`) is the most commonly used configuration for production benchmarking. This simulates real-world deployment architectures where:
|
||||||
|
|
||||||
|
- Users send requests as fast as they can (infinite rate)
|
||||||
|
- A load balancer or API gateway controls the maximum concurrent connections
|
||||||
|
- The system operates at its concurrency limit, revealing true throughput capacity
|
||||||
|
- `--burstiness` has no effect since request timing is not controlled when rate is infinite
|
||||||
|
|
||||||
|
This pattern helps determine optimal concurrency settings for your production load balancer configuration.
|
||||||
|
|
||||||
|
To effectively configure load patterns, especially for **Capacity Planning** and **SLA Validation** use cases, you need to understand your system's resource limits. During startup, vLLM reports KV cache configuration that directly impacts your load testing parameters:
|
||||||
|
|
||||||
|
```text
|
||||||
|
GPU KV cache size: 15,728,640 tokens
|
||||||
|
Maximum concurrency for 8,192 tokens per request: 1920
|
||||||
|
```
|
||||||
|
|
||||||
|
Where:
|
||||||
|
|
||||||
|
- GPU KV cache size: Total tokens that can be cached across all concurrent requests
|
||||||
|
- Maximum concurrency: Theoretical maximum concurrent requests for the given `max_model_len`
|
||||||
|
- Calculation: `max_concurrency = kv_cache_size / max_model_len`
|
||||||
|
|
||||||
|
Using KV cache metrics for load pattern configuration:
|
||||||
|
|
||||||
|
- For Capacity Planning: Set `--max-concurrency` to 80-90% of the reported maximum to test realistic resource constraints
|
||||||
|
- For SLA Validation: Use the reported maximum as your SLA limit to ensure compliance testing matches production capacity
|
||||||
|
- For Realistic Testing: Monitor memory usage when approaching theoretical limits to understand sustainable request rates
|
||||||
|
- Request rate guidance: Use the KV cache size to estimate sustainable request rates for your specific workload and sequence lengths
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
#### 📈 Offline Throughput Benchmark
|
#### 📈 Offline Throughput Benchmark
|
||||||
@@ -714,7 +782,7 @@ Generate synthetic image inputs alongside random text prompts to stress-test vis
|
|||||||
|
|
||||||
Notes:
|
Notes:
|
||||||
|
|
||||||
- Works only with online benchmark via the OpenAI backend (`--backend openai-chat`) and endpoint `/v1/chat/completions`.
|
- Works only with online benchmark via the OpenAI backend (`--backend openai-chat`) and endpoint `/v1/chat/completions`.
|
||||||
- Video sampling is not yet implemented.
|
- Video sampling is not yet implemented.
|
||||||
|
|
||||||
Start the server (example):
|
Start the server (example):
|
||||||
@@ -822,7 +890,7 @@ you should set `--endpoint /v1/embeddings` to use the Embeddings API. The backen
|
|||||||
- CLIP: `--backend openai-embeddings-clip`
|
- CLIP: `--backend openai-embeddings-clip`
|
||||||
- VLM2Vec: `--backend openai-embeddings-vlm2vec`
|
- VLM2Vec: `--backend openai-embeddings-vlm2vec`
|
||||||
|
|
||||||
For other models, please add your own implementation inside <gh-file:vllm/benchmarks/lib/endpoint_request_func.py> to match the expected instruction format.
|
For other models, please add your own implementation inside [vllm/benchmarks/lib/endpoint_request_func.py](../../vllm/benchmarks/lib/endpoint_request_func.py) to match the expected instruction format.
|
||||||
|
|
||||||
You can use any text or multi-modal dataset to benchmark the model, as long as the model supports it.
|
You can use any text or multi-modal dataset to benchmark the model, as long as the model supports it.
|
||||||
For example, you can use ShareGPT and VisionArena to benchmark vision-language embeddings.
|
For example, you can use ShareGPT and VisionArena to benchmark vision-language embeddings.
|
||||||
@@ -924,7 +992,162 @@ throughput numbers correctly is also adjusted.
|
|||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
[](){ #performance-benchmarks }
|
## Parameter Sweeps
|
||||||
|
|
||||||
|
### Online Benchmark
|
||||||
|
|
||||||
|
[`vllm/benchmarks/sweep/serve.py`](../../vllm/benchmarks/sweep/serve.py) automatically starts `vllm serve` and runs `vllm bench serve` to evaluate vLLM over multiple configurations.
|
||||||
|
|
||||||
|
Follow these steps to run the script:
|
||||||
|
|
||||||
|
1. Construct the base command to `vllm serve`, and pass it to the `--serve-cmd` option.
|
||||||
|
2. Construct the base command to `vllm bench serve`, and pass it to the `--bench-cmd` option.
|
||||||
|
3. (Optional) If you would like to vary the settings of `vllm serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--serve-params`.
|
||||||
|
|
||||||
|
- Example: Tuning `--max-num-seqs` and `--max-num-batched-tokens`:
|
||||||
|
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"max_num_seqs": 32,
|
||||||
|
"max_num_batched_tokens": 1024
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"max_num_seqs": 64,
|
||||||
|
"max_num_batched_tokens": 1024
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"max_num_seqs": 64,
|
||||||
|
"max_num_batched_tokens": 2048
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"max_num_seqs": 128,
|
||||||
|
"max_num_batched_tokens": 2048
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"max_num_seqs": 128,
|
||||||
|
"max_num_batched_tokens": 4096
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"max_num_seqs": 256,
|
||||||
|
"max_num_batched_tokens": 4096
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
4. (Optional) If you would like to vary the settings of `vllm bench serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--bench-params`.
|
||||||
|
|
||||||
|
- Example: Using different input/output lengths for random dataset:
|
||||||
|
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"random_input_len": 128,
|
||||||
|
"random_output_len": 32
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"random_input_len": 256,
|
||||||
|
"random_output_len": 64
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"random_input_len": 512,
|
||||||
|
"random_output_len": 128
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
5. Determine where you want to save the results, and pass that to `--output-dir`.
|
||||||
|
|
||||||
|
Example command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m vllm.benchmarks.sweep.serve \
|
||||||
|
--serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
|
||||||
|
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
|
||||||
|
--serve-params benchmarks/serve_hparams.json \
|
||||||
|
--bench-params benchmarks/bench_hparams.json \
|
||||||
|
-o benchmarks/results
|
||||||
|
```
|
||||||
|
|
||||||
|
!!! important
|
||||||
|
If both `--serve-params` and `--bench-params` are passed, the script will iterate over the Cartesian product between them.
|
||||||
|
You can use `--dry-run` to preview the commands to be run.
|
||||||
|
|
||||||
|
We only start the server once for each `--serve-params`, and keep it running for multiple `--bench-params`.
|
||||||
|
Between each benchmark run, we call the `/reset_prefix_cache` and `/reset_mm_cache` endpoints to get a clean slate for the next run.
|
||||||
|
In case you are using a custom `--serve-cmd`, you can override the commands used for resetting the state by setting `--after-bench-cmd`.
|
||||||
|
|
||||||
|
!!! note
|
||||||
|
By default, each parameter combination is run 3 times to make the results more reliable. You can adjust the number of runs by setting `--num-runs`.
|
||||||
|
|
||||||
|
!!! tip
|
||||||
|
You can use the `--resume` option to continue the parameter sweep if one of the runs failed.
|
||||||
|
|
||||||
|
### SLA Auto-Tuner
|
||||||
|
|
||||||
|
[`vllm/benchmarks/sweep/serve_sla.py`](../../vllm/benchmarks/sweep/serve_sla.py) is a wrapper over [`vllm/benchmarks/sweep/serve.py`](../../vllm/benchmarks/sweep/serve.py) that tunes either the request rate or concurrency (choose using `--sla-variable`) in order to satisfy the SLA constraints given by `--sla-params`.
|
||||||
|
|
||||||
|
For example, to ensure E2E latency within different target values for 99% of requests:
|
||||||
|
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"p99_e2el_ms": "<=200"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"p99_e2el_ms": "<=500"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"p99_e2el_ms": "<=1000"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"p99_e2el_ms": "<=2000"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
Example command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m vllm.benchmarks.sweep.serve_sla \
|
||||||
|
--serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
|
||||||
|
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
|
||||||
|
--serve-params benchmarks/serve_hparams.json \
|
||||||
|
--bench-params benchmarks/bench_hparams.json \
|
||||||
|
--sla-params benchmarks/sla_hparams.json \
|
||||||
|
--sla-variable max_concurrency \
|
||||||
|
-o benchmarks/results
|
||||||
|
```
|
||||||
|
|
||||||
|
The algorithm for adjusting the SLA variable is as follows:
|
||||||
|
|
||||||
|
1. Run the benchmark with infinite QPS, and use the corresponding metrics to determine the initial value of the variable.
|
||||||
|
- For example, the initial request rate is set to the concurrency under infinite QPS.
|
||||||
|
2. If the SLA is still satisfied, keep doubling the value until the SLA is no longer satisfied. This gives a relatively narrow window that contains the point where the SLA is barely satisfied.
|
||||||
|
3. Apply binary search over the window to find the maximum value that still satisfies the SLA.
|
||||||
|
|
||||||
|
!!! important
|
||||||
|
SLA tuning is applied over each combination of `--serve-params`, `--bench-params`, and `--sla-params`.
|
||||||
|
|
||||||
|
For a given combination of `--serve-params` and `--bench-params`, we share the benchmark results across `--sla-params` to avoid rerunning benchmarks with the same SLA variable value.
|
||||||
|
|
||||||
|
### Visualizer
|
||||||
|
|
||||||
|
[`vllm/benchmarks/sweep/plot.py`](../../vllm/benchmarks/sweep/plot.py) can be used to plot performance curves from parameter sweep results.
|
||||||
|
|
||||||
|
Example command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m vllm.benchmarks.sweep.plot benchmarks/results/<timestamp> \
|
||||||
|
--var-x max_concurrency \
|
||||||
|
--row-by random_input_len \
|
||||||
|
--col-by random_output_len \
|
||||||
|
--curve-by api_server_count,max_num_batched_tokens \
|
||||||
|
--filter-by 'max_concurrency<=1024'
|
||||||
|
```
|
||||||
|
|
||||||
|
!!! tip
|
||||||
|
You can use `--dry-run` to preview the figures to be plotted.
|
||||||
|
|
||||||
## Performance Benchmarks
|
## Performance Benchmarks
|
||||||
|
|
||||||
@@ -962,7 +1185,7 @@ For more results visualization, check the [visualizing the results](https://gith
|
|||||||
|
|
||||||
The latest performance results are hosted on the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).
|
The latest performance results are hosted on the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).
|
||||||
|
|
||||||
More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
|
More information on the performance benchmarks and their parameters can be found in [Benchmark README](https://github.com/intel-ai-tce/vllm/blob/more_cpu_models/.buildkite/nightly-benchmarks/README.md) and [performance benchmark description](../../.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
|
||||||
|
|
||||||
### Continuous Benchmarking
|
### Continuous Benchmarking
|
||||||
|
|
||||||
@@ -988,12 +1211,10 @@ The benchmarking currently runs on a predefined set of models configured in the
|
|||||||
|
|
||||||
All continuous benchmarking results are automatically published to the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).
|
All continuous benchmarking results are automatically published to the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).
|
||||||
|
|
||||||
[](){ #nightly-benchmarks }
|
|
||||||
|
|
||||||
## Nightly Benchmarks
|
## Nightly Benchmarks
|
||||||
|
|
||||||
These compare vLLM's performance against alternatives (`tgi`, `trt-llm`, and `lmdeploy`) when there are major updates of vLLM (e.g., bumping up to a new version). They are primarily intended for consumers to evaluate when to choose vLLM over other options and are triggered on every commit with both the `perf-benchmarks` and `nightly-benchmarks` labels.
|
These compare vLLM's performance against alternatives (`tgi`, `trt-llm`, and `lmdeploy`) when there are major updates of vLLM (e.g., bumping up to a new version). They are primarily intended for consumers to evaluate when to choose vLLM over other options and are triggered on every commit with both the `perf-benchmarks` and `nightly-benchmarks` labels.
|
||||||
|
|
||||||
The latest nightly benchmark results are shared in major release blog posts such as [vLLM v0.6.0](https://blog.vllm.ai/2024/09/05/perf-update.html).
|
The latest nightly benchmark results are shared in major release blog posts such as [vLLM v0.6.0](https://blog.vllm.ai/2024/09/05/perf-update.html).
|
||||||
|
|
||||||
More information on the nightly benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/nightly-descriptions.md).
|
More information on the nightly benchmarks and their parameters can be found [here](../../.buildkite/nightly-benchmarks/nightly-descriptions.md).
|
||||||
|
|||||||
@@ -64,7 +64,7 @@ Download the full log file from Buildkite locally.
|
|||||||
|
|
||||||
Strip timestamps and colorization:
|
Strip timestamps and colorization:
|
||||||
|
|
||||||
<gh-file:.buildkite/scripts/ci-clean-log.sh>
|
[.buildkite/scripts/ci-clean-log.sh](../../../.buildkite/scripts/ci-clean-log.sh)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
./ci-clean-log.sh ci.log
|
./ci-clean-log.sh ci.log
|
||||||
@@ -87,7 +87,7 @@ tail -525 ci_build.log | wl-copy
|
|||||||
|
|
||||||
CI test failures may be flaky. Use a bash loop to run repeatedly:
|
CI test failures may be flaky. Use a bash loop to run repeatedly:
|
||||||
|
|
||||||
<gh-file:.buildkite/scripts/rerun-test.sh>
|
[.buildkite/scripts/rerun-test.sh](../../../.buildkite/scripts/rerun-test.sh)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
./rerun-test.sh tests/v1/engine/test_engine_core_client.py::test_kv_cache_events[True-tcp]
|
./rerun-test.sh tests/v1/engine/test_engine_core_client.py::test_kv_cache_events[True-tcp]
|
||||||
|
|||||||
@@ -5,7 +5,7 @@ release in CI/CD. It is standard practice to submit a PR to update the
|
|||||||
PyTorch version as early as possible when a new [PyTorch stable
|
PyTorch version as early as possible when a new [PyTorch stable
|
||||||
release](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-cadence) becomes available.
|
release](https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-cadence) becomes available.
|
||||||
This process is non-trivial due to the gap between PyTorch
|
This process is non-trivial due to the gap between PyTorch
|
||||||
releases. Using <gh-pr:16859> as an example, this document outlines common steps to achieve this
|
releases. Using <https://github.com/vllm-project/vllm/pull/16859> as an example, this document outlines common steps to achieve this
|
||||||
update along with a list of potential issues and how to address them.
|
update along with a list of potential issues and how to address them.
|
||||||
|
|
||||||
## Test PyTorch release candidates (RCs)
|
## Test PyTorch release candidates (RCs)
|
||||||
@@ -85,9 +85,9 @@ and timeout. Additionally, since vLLM's fastcheck pipeline runs in read-only mod
|
|||||||
it doesn't populate the cache, so re-running it to warm up the cache
|
it doesn't populate the cache, so re-running it to warm up the cache
|
||||||
is ineffective.
|
is ineffective.
|
||||||
|
|
||||||
While ongoing efforts like [#17419](gh-issue:17419)
|
While ongoing efforts like <https://github.com/vllm-project/vllm/issues/17419>
|
||||||
address the long build time at its source, the current workaround is to set `VLLM_CI_BRANCH`
|
address the long build time at its source, the current workaround is to set `VLLM_CI_BRANCH`
|
||||||
to a custom branch provided by @khluu (`VLLM_CI_BRANCH=khluu/use_postmerge_q`)
|
to a custom branch provided by @khluu (`VLLM_CI_BRANCH=khluu/long_build`)
|
||||||
when manually triggering a build on Buildkite. This branch accomplishes two things:
|
when manually triggering a build on Buildkite. This branch accomplishes two things:
|
||||||
|
|
||||||
1. Increase the timeout limit to 10 hours so that the build doesn't time out.
|
1. Increase the timeout limit to 10 hours so that the build doesn't time out.
|
||||||
@@ -100,35 +100,17 @@ to warm it up so that future builds are faster.
|
|||||||
|
|
||||||
## Update dependencies
|
## Update dependencies
|
||||||
|
|
||||||
Several vLLM dependencies, such as FlashInfer, also depend on PyTorch and need
|
Several vLLM dependencies like xFormers depend on PyTorch and need
|
||||||
to be updated accordingly. Rather than waiting for all of them to publish new
|
to be updated accordingly. Rather than waiting for all of them to publish new
|
||||||
releases (which would take too much time), they can be built from
|
releases (which would take too much time), they can be built from
|
||||||
source to unblock the update process.
|
source to unblock the update process.
|
||||||
|
|
||||||
### FlashInfer
|
|
||||||
|
|
||||||
Here is how to build and install it from source with `torch2.7.0+cu128` in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271):
|
|
||||||
|
|
||||||
```bash
|
|
||||||
export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0 10.0+PTX'
|
|
||||||
export FLASHINFER_ENABLE_SM90=1
|
|
||||||
uv pip install --system \
|
|
||||||
--no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@v0.2.6.post1"
|
|
||||||
```
|
|
||||||
|
|
||||||
One caveat is that building FlashInfer from source adds approximately 30
|
|
||||||
minutes to the vLLM build time. Therefore, it's preferable to cache the wheel in a
|
|
||||||
public location for immediate installation, such as [this FlashInfer wheel link](https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl). For future releases, contact the PyTorch release
|
|
||||||
team if you want to get the package published there.
|
|
||||||
|
|
||||||
### xFormers
|
### xFormers
|
||||||
|
|
||||||
Similar to FlashInfer, here is how to build and install xFormers from source:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
export TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0 8.9 9.0 10.0+PTX'
|
export TORCH_CUDA_ARCH_LIST='7.5 8.0+PTX 9.0a'
|
||||||
MAX_JOBS=16 uv pip install --system \
|
MAX_JOBS=16 uv pip install --system \
|
||||||
--no-build-isolation "git+https://github.com/facebookresearch/xformers@v0.0.30"
|
--no-build-isolation "git+https://github.com/facebookresearch/xformers@v0.0.32.post2"
|
||||||
```
|
```
|
||||||
|
|
||||||
## Update all the different vLLM platforms
|
## Update all the different vLLM platforms
|
||||||
@@ -138,5 +120,5 @@ to handle some platforms separately. The separation of requirements and Dockerfi
|
|||||||
for different platforms in vLLM CI/CD allows us to selectively choose
|
for different platforms in vLLM CI/CD allows us to selectively choose
|
||||||
which platforms to update. For instance, updating XPU requires the corresponding
|
which platforms to update. For instance, updating XPU requires the corresponding
|
||||||
release from [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) by Intel.
|
release from [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) by Intel.
|
||||||
While <gh-pr:16859> updated vLLM to PyTorch 2.7.0 on CPU, CUDA, and ROCm,
|
While <https://github.com/vllm-project/vllm/pull/16859> updated vLLM to PyTorch 2.7.0 on CPU, CUDA, and ROCm,
|
||||||
<gh-pr:17444> completed the update for XPU.
|
<https://github.com/vllm-project/vllm/pull/17444> completed the update for XPU.
|
||||||
|
|||||||
@@ -1,6 +1,6 @@
|
|||||||
# Dockerfile
|
# Dockerfile
|
||||||
|
|
||||||
We provide a <gh-file:docker/Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
|
We provide a [docker/Dockerfile](../../../docker/Dockerfile) to construct the image for running an OpenAI compatible server with vLLM.
|
||||||
More information about deploying with Docker can be found [here](../../deployment/docker.md).
|
More information about deploying with Docker can be found [here](../../deployment/docker.md).
|
||||||
|
|
||||||
Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
|
Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
|
||||||
|
|||||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user