Compare commits
297 Commits
v0.15.0rc1
...
v0.15.2rc0
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
bbe0574d8e | ||
|
|
4d9513537d | ||
|
|
439afa4eea | ||
|
|
fa4e0fb028 | ||
|
|
ce498a6d61 | ||
|
|
9f14c9224d | ||
|
|
535de06cb1 | ||
|
|
4292c90a2a | ||
|
|
6e98f6d8b6 | ||
|
|
2f6d17cb2f | ||
|
|
192ad4648b | ||
|
|
0e92298622 | ||
|
|
87d9a26166 | ||
|
|
80f921ba4b | ||
|
|
711edaf0d0 | ||
|
|
1d367a738e | ||
|
|
32a02c7ca2 | ||
|
|
f67ee8b859 | ||
|
|
e57ef99b40 | ||
|
|
f8516a1ab9 | ||
|
|
824058076c | ||
|
|
8e32690869 | ||
|
|
a208439537 | ||
|
|
bcd2f74c0d | ||
|
|
f79f777803 | ||
|
|
4c8d1bf361 | ||
|
|
061da6bcf7 | ||
|
|
4403e3ed4c | ||
|
|
08e094997e | ||
|
|
d88a1df699 | ||
|
|
90d74ebaa4 | ||
|
|
45f8fd6f97 | ||
|
|
5e1e0a0fbd | ||
|
|
eb5ed20743 | ||
|
|
2647163674 | ||
|
|
9fb27dd3b3 | ||
|
|
4dffc5e044 | ||
|
|
e1bf04b6c2 | ||
|
|
02080179a3 | ||
|
|
1b8fe6f7c4 | ||
|
|
52ee21021a | ||
|
|
655efb3e69 | ||
|
|
bd8da29a66 | ||
|
|
2a99c5a6c8 | ||
|
|
3f7662d650 | ||
|
|
a372f3f40a | ||
|
|
61e632aea1 | ||
|
|
b1bb18de8d | ||
|
|
2267cb1cfd | ||
|
|
0d6ccf68fa | ||
|
|
18e7cbbb15 | ||
|
|
f0d5251715 | ||
|
|
5c4f2dd6ef | ||
|
|
f3d8a34671 | ||
|
|
4bc913aeec | ||
|
|
fbb3cf6981 | ||
|
|
2df2b3499d | ||
|
|
2a8d84e66d | ||
|
|
a3acfa1071 | ||
|
|
be8168ff88 | ||
|
|
f6af34626d | ||
|
|
ceab70c89d | ||
|
|
52683ccbe1 | ||
|
|
e346e2d056 | ||
|
|
83449a5ff0 | ||
|
|
dad2d6a590 | ||
|
|
32e84fa1ff | ||
|
|
fd9c83d0e0 | ||
|
|
b95cc5014d | ||
|
|
61397891ce | ||
|
|
ef248ff740 | ||
|
|
e10604480b | ||
|
|
bf001da4bf | ||
|
|
a0a984ac2e | ||
|
|
f1cb9b5544 | ||
|
|
4c4b6f7a97 | ||
|
|
10546f925a | ||
|
|
e69c990c21 | ||
|
|
5eac9a1b34 | ||
|
|
1b60b45d0d | ||
|
|
4b3803d180 | ||
|
|
5019c59dd2 | ||
|
|
089cd4f002 | ||
|
|
0130223bd9 | ||
|
|
5d1aef3004 | ||
|
|
ffe1fc7a28 | ||
|
|
8b7346d5f1 | ||
|
|
6141ebe0dd | ||
|
|
199e3cb476 | ||
|
|
9f8cb81b44 | ||
|
|
d7e17aaacd | ||
|
|
528e9b1490 | ||
|
|
d95b4be47a | ||
|
|
4061dcf4c5 | ||
|
|
0aca8b8c62 | ||
|
|
9eb58f8cf1 | ||
|
|
b10d05b8a8 | ||
|
|
b398e5c819 | ||
|
|
78061ef584 | ||
|
|
528b3076af | ||
|
|
a502831d36 | ||
|
|
ba871fb788 | ||
|
|
ab374786c7 | ||
|
|
808dd87b30 | ||
|
|
beb8899482 | ||
|
|
ce88756b96 | ||
|
|
a3154a6092 | ||
|
|
7c036432fc | ||
|
|
318b120766 | ||
|
|
c3b40dc3e7 | ||
|
|
a01ef3fa51 | ||
|
|
7320ca3942 | ||
|
|
cf0a99f84d | ||
|
|
e535d90deb | ||
|
|
0b225fb7b2 | ||
|
|
46b4a02794 | ||
|
|
8869cd8ec1 | ||
|
|
cd86fff38f | ||
|
|
b5f8c3092d | ||
|
|
21997f45b1 | ||
|
|
672023877b | ||
|
|
754a8ca942 | ||
|
|
302ecf64ff | ||
|
|
b6bb2842cf | ||
|
|
79b6ec6aab | ||
|
|
d6416fdde9 | ||
|
|
0fb3157267 | ||
|
|
a358e4dffe | ||
|
|
079781177a | ||
|
|
63c0889416 | ||
|
|
1e86c802d4 | ||
|
|
fedf64332e | ||
|
|
2238a12c13 | ||
|
|
ce0afe2451 | ||
|
|
88c3e114d8 | ||
|
|
92924b2ddd | ||
|
|
27cb2f678f | ||
|
|
22d9a056d5 | ||
|
|
13b842f271 | ||
|
|
15f40b20aa | ||
|
|
793af538a3 | ||
|
|
6f5e7cda57 | ||
|
|
68feb76a6f | ||
|
|
4cb59dea6a | ||
|
|
608b556507 | ||
|
|
f0a1c8453a | ||
|
|
8980001c93 | ||
|
|
527bcd14d4 | ||
|
|
f68e3ea4e1 | ||
|
|
d5c41db35b | ||
|
|
1618e25492 | ||
|
|
f3888aca83 | ||
|
|
f0bca83ee4 | ||
|
|
73419abfae | ||
|
|
e77f162cf5 | ||
|
|
8ecd213c0b | ||
|
|
5b55c0bea7 | ||
|
|
15e0bb9c42 | ||
|
|
6c64c41b4a | ||
|
|
a2ef06e1b3 | ||
|
|
0a3c71e7e5 | ||
|
|
29fba76781 | ||
|
|
9df152bbf6 | ||
|
|
876a16f4fb | ||
|
|
aaa901ad55 | ||
|
|
010ec0c30e | ||
|
|
64a40a7ab4 | ||
|
|
31aedfe7d6 | ||
|
|
67ebaff528 | ||
|
|
2b465570e6 | ||
|
|
9ca66ecc10 | ||
|
|
c3a9752b0c | ||
|
|
f451b4558b | ||
|
|
3f96fcf646 | ||
|
|
6c1f9e4c18 | ||
|
|
67239c4c42 | ||
|
|
8ece60768f | ||
|
|
fd0e377244 | ||
|
|
f857a03f6b | ||
|
|
74898a7015 | ||
|
|
8f5d51203b | ||
|
|
ae5b7aff2b | ||
|
|
a11bc12d53 | ||
|
|
58cb55e4de | ||
|
|
cf896ae0e3 | ||
|
|
c5113f60f2 | ||
|
|
174f16700b | ||
|
|
8e2ad97ad0 | ||
|
|
10152d2194 | ||
|
|
1a7894dbdf | ||
|
|
c87eac18f7 | ||
|
|
f45870b53f | ||
|
|
ba45bedfd1 | ||
|
|
9432ed8c7e | ||
|
|
726d89720c | ||
|
|
d334dd26c4 | ||
|
|
070c811d6f | ||
|
|
8bfc8d5600 | ||
|
|
ec51831a22 | ||
|
|
80b918f2bd | ||
|
|
c46b0cd0af | ||
|
|
133765760b | ||
|
|
bfb9bdaf3f | ||
|
|
2284461d02 | ||
|
|
8e2a469b3b | ||
|
|
23591e631e | ||
|
|
0493d897c4 | ||
|
|
8c8ebeb941 | ||
|
|
831453fcef | ||
|
|
5a66c9cc76 | ||
|
|
5e73e4900c | ||
|
|
c6e7404cc5 | ||
|
|
17b17c0684 | ||
|
|
8bb6271c77 | ||
|
|
8b3f0a99dd | ||
|
|
8311f083bd | ||
|
|
40c35038d2 | ||
|
|
a5aa4d5c0f | ||
|
|
615e8033e5 | ||
|
|
d09135fbd0 | ||
|
|
8688c3d460 | ||
|
|
5400014d55 | ||
|
|
3a92c6f3b5 | ||
|
|
e01ff5c070 | ||
|
|
fb946a7f89 | ||
|
|
a650ad1588 | ||
|
|
d697581a7c | ||
|
|
5eeba80c74 | ||
|
|
08b1195e62 | ||
|
|
3bba2edb0f | ||
|
|
53fc166402 | ||
|
|
31b25f6516 | ||
|
|
abb34ac43a | ||
|
|
2515bbd027 | ||
|
|
c487a8eef4 | ||
|
|
9e138cb01d | ||
|
|
f9d03599ef | ||
|
|
39037d258e | ||
|
|
51550179fc | ||
|
|
07ea184f00 | ||
|
|
a663b218ae | ||
|
|
1bd47d6e5a | ||
|
|
141cd43967 | ||
|
|
6bf3b46d78 | ||
|
|
77c4f45c6c | ||
|
|
ca1969186d | ||
|
|
ab597c869a | ||
|
|
4197168ea5 | ||
|
|
59bcc5b6f2 | ||
|
|
3e440786af | ||
|
|
8bdd3979d8 | ||
|
|
c4e744dbd4 | ||
|
|
8ebf372e9d | ||
|
|
f210f0b7b1 | ||
|
|
392c5af4fe | ||
|
|
af9b69f977 | ||
|
|
8e5e40daf4 | ||
|
|
2e8de86777 | ||
|
|
247d1a32ea | ||
|
|
ecb4f82209 | ||
|
|
5914090765 | ||
|
|
f1acbd68c5 | ||
|
|
9581185d51 | ||
|
|
2dd359f953 | ||
|
|
22ad649501 | ||
|
|
36d450e3b8 | ||
|
|
a2b877df6c | ||
|
|
35fb0b8613 | ||
|
|
2eb673a088 | ||
|
|
a97b5e206d | ||
|
|
911b51b69f | ||
|
|
604e3b87e8 | ||
|
|
706f123b23 | ||
|
|
fb7abfc1d0 | ||
|
|
5d3d6e44e8 | ||
|
|
46ec6d71c7 | ||
|
|
e82fa448c4 | ||
|
|
d9aa39a3bb | ||
|
|
3a6d5cbefd | ||
|
|
f5d7049cc1 | ||
|
|
3c3c547ce0 | ||
|
|
1cbccb6dba | ||
|
|
bd92089d33 | ||
|
|
a6760f1525 | ||
|
|
66e601ef79 | ||
|
|
0cd259b2d8 | ||
|
|
83fb2d09e8 | ||
|
|
f3a5ee705f | ||
|
|
7cbbca9aaa | ||
|
|
5ec44056f7 | ||
|
|
492a7983dd | ||
|
|
a608b4c6c2 | ||
|
|
1f3a2c2944 | ||
|
|
7227d06156 | ||
|
|
14385c80fc | ||
|
|
76139d0801 | ||
|
|
da8d0c441a |
@@ -1,6 +1,7 @@
|
|||||||
group: Hardware
|
group: Hardware
|
||||||
steps:
|
steps:
|
||||||
- label: "AMD: :docker: build image"
|
- label: "AMD: :docker: build image"
|
||||||
|
depends_on: []
|
||||||
device: amd_cpu
|
device: amd_cpu
|
||||||
no_plugin: true
|
no_plugin: true
|
||||||
commands:
|
commands:
|
||||||
|
|||||||
@@ -16,6 +16,7 @@ steps:
|
|||||||
- bash .buildkite/scripts/hardware_ci/run-hpu-test.sh
|
- bash .buildkite/scripts/hardware_ci/run-hpu-test.sh
|
||||||
|
|
||||||
- label: "Intel GPU Test"
|
- label: "Intel GPU Test"
|
||||||
|
depends_on: []
|
||||||
soft_fail: true
|
soft_fail: true
|
||||||
device: intel_gpu
|
device: intel_gpu
|
||||||
no_plugin: true
|
no_plugin: true
|
||||||
|
|||||||
@@ -143,7 +143,7 @@ resolve_parent_commit() {
|
|||||||
print_bake_config() {
|
print_bake_config() {
|
||||||
echo "--- :page_facing_up: Resolved bake configuration"
|
echo "--- :page_facing_up: Resolved bake configuration"
|
||||||
BAKE_CONFIG_FILE="bake-config-build-${BUILDKITE_BUILD_NUMBER:-local}.json"
|
BAKE_CONFIG_FILE="bake-config-build-${BUILDKITE_BUILD_NUMBER:-local}.json"
|
||||||
docker buildx bake -f "${VLLM_BAKE_FILE}" -f "${CI_HCL_PATH}" --print "${TARGET}" | tee "${BAKE_CONFIG_FILE}" || true
|
docker buildx bake -f "${VLLM_BAKE_FILE_PATH}" -f "${CI_HCL_PATH}" --print "${TARGET}" | tee "${BAKE_CONFIG_FILE}" || true
|
||||||
echo "Saved bake config to ${BAKE_CONFIG_FILE}"
|
echo "Saved bake config to ${BAKE_CONFIG_FILE}"
|
||||||
echo "--- :arrow_down: Uploading bake config to Buildkite"
|
echo "--- :arrow_down: Uploading bake config to Buildkite"
|
||||||
buildkite-agent artifact upload "${BAKE_CONFIG_FILE}"
|
buildkite-agent artifact upload "${BAKE_CONFIG_FILE}"
|
||||||
@@ -170,9 +170,9 @@ IMAGE_TAG_LATEST=${8:-} # only used for main branch, optional
|
|||||||
|
|
||||||
# build config
|
# build config
|
||||||
TARGET="test-ci"
|
TARGET="test-ci"
|
||||||
CI_HCL_URL="${CI_HCL_URL:-https://raw.githubusercontent.com/vllm-project/ci-infra/main/docker/ci.hcl}"
|
VLLM_BAKE_FILE_PATH="${VLLM_BAKE_FILE_PATH:-docker/docker-bake.hcl}"
|
||||||
VLLM_BAKE_FILE="${VLLM_BAKE_FILE:-docker/docker-bake.hcl}"
|
|
||||||
BUILDER_NAME="${BUILDER_NAME:-vllm-builder}"
|
BUILDER_NAME="${BUILDER_NAME:-vllm-builder}"
|
||||||
|
CI_HCL_URL="${CI_HCL_URL:-https://raw.githubusercontent.com/vllm-project/ci-infra/main/docker/ci.hcl}"
|
||||||
CI_HCL_PATH="/tmp/ci.hcl"
|
CI_HCL_PATH="/tmp/ci.hcl"
|
||||||
BUILDKIT_SOCKET="/run/buildkit/buildkitd.sock"
|
BUILDKIT_SOCKET="/run/buildkit/buildkitd.sock"
|
||||||
|
|
||||||
@@ -180,9 +180,8 @@ prepare_cache_tags
|
|||||||
ecr_login
|
ecr_login
|
||||||
|
|
||||||
# Environment info (for docs and human readers)
|
# Environment info (for docs and human readers)
|
||||||
# CI_HCL_URL - URL to ci.hcl (default: from ci-infra main branch)
|
|
||||||
# VLLM_CI_BRANCH - ci-infra branch to use (default: main)
|
# VLLM_CI_BRANCH - ci-infra branch to use (default: main)
|
||||||
# VLLM_BAKE_FILE - Path to vLLM's bake file (default: docker/docker-bake.hcl)
|
# VLLM_BAKE_FILE_PATH - Path to vLLM's bake file (default: docker/docker-bake.hcl)
|
||||||
# BUILDER_NAME - Name for buildx builder (default: vllm-builder)
|
# BUILDER_NAME - Name for buildx builder (default: vllm-builder)
|
||||||
#
|
#
|
||||||
# Build configuration (exported as environment variables for bake):
|
# Build configuration (exported as environment variables for bake):
|
||||||
@@ -211,10 +210,9 @@ echo "IMAGE_TAG_LATEST: ${IMAGE_TAG_LATEST}"
|
|||||||
# print build configuration
|
# print build configuration
|
||||||
echo "--- :mag: Build configuration"
|
echo "--- :mag: Build configuration"
|
||||||
echo "TARGET: ${TARGET}"
|
echo "TARGET: ${TARGET}"
|
||||||
echo "CI HCL URL: ${CI_HCL_URL}"
|
echo "vLLM bake file: ${VLLM_BAKE_FILE_PATH}"
|
||||||
echo "vLLM bake file: ${VLLM_BAKE_FILE}"
|
|
||||||
echo "BUILDER_NAME: ${BUILDER_NAME}"
|
echo "BUILDER_NAME: ${BUILDER_NAME}"
|
||||||
echo "CI_HCL_PATH: ${CI_HCL_PATH}"
|
echo "CI_HCL_URL: ${CI_HCL_URL}"
|
||||||
echo "BUILDKIT_SOCKET: ${BUILDKIT_SOCKET}"
|
echo "BUILDKIT_SOCKET: ${BUILDKIT_SOCKET}"
|
||||||
|
|
||||||
echo "--- :mag: Cache tags"
|
echo "--- :mag: Cache tags"
|
||||||
@@ -227,11 +225,11 @@ check_and_skip_if_image_exists
|
|||||||
|
|
||||||
echo "--- :docker: Setting up Docker buildx bake"
|
echo "--- :docker: Setting up Docker buildx bake"
|
||||||
echo "Target: ${TARGET}"
|
echo "Target: ${TARGET}"
|
||||||
echo "CI HCL URL: ${CI_HCL_URL}"
|
echo "vLLM bake file: ${VLLM_BAKE_FILE_PATH}"
|
||||||
echo "vLLM bake file: ${VLLM_BAKE_FILE}"
|
echo "CI HCL path: ${CI_HCL_PATH}"
|
||||||
|
|
||||||
if [[ ! -f "${VLLM_BAKE_FILE}" ]]; then
|
if [[ ! -f "${VLLM_BAKE_FILE_PATH}" ]]; then
|
||||||
echo "Error: vLLM bake file not found at ${VLLM_BAKE_FILE}"
|
echo "Error: vLLM bake file not found at ${VLLM_BAKE_FILE_PATH}"
|
||||||
echo "Make sure you're running from the vLLM repository root"
|
echo "Make sure you're running from the vLLM repository root"
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
@@ -240,15 +238,19 @@ echo "--- :arrow_down: Downloading ci.hcl"
|
|||||||
curl -sSfL -o "${CI_HCL_PATH}" "${CI_HCL_URL}"
|
curl -sSfL -o "${CI_HCL_PATH}" "${CI_HCL_URL}"
|
||||||
echo "Downloaded to ${CI_HCL_PATH}"
|
echo "Downloaded to ${CI_HCL_PATH}"
|
||||||
|
|
||||||
|
if [[ ! -f "${CI_HCL_PATH}" ]]; then
|
||||||
|
echo "Error: ci.hcl not found at ${CI_HCL_PATH}"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
setup_buildx_builder
|
setup_buildx_builder
|
||||||
|
|
||||||
# Compute parent commit for cache fallback (if not already set)
|
|
||||||
resolve_parent_commit
|
resolve_parent_commit
|
||||||
export PARENT_COMMIT
|
export PARENT_COMMIT
|
||||||
|
|
||||||
print_bake_config
|
print_bake_config
|
||||||
|
|
||||||
echo "--- :docker: Building ${TARGET}"
|
echo "--- :docker: Building ${TARGET}"
|
||||||
docker --debug buildx bake -f "${VLLM_BAKE_FILE}" -f "${CI_HCL_PATH}" --progress plain "${TARGET}"
|
docker --debug buildx bake -f "${VLLM_BAKE_FILE_PATH}" -f "${CI_HCL_PATH}" --progress plain "${TARGET}"
|
||||||
|
|
||||||
echo "--- :white_check_mark: Build complete"
|
echo "--- :white_check_mark: Build complete"
|
||||||
|
|||||||
@@ -5,7 +5,7 @@ steps:
|
|||||||
depends_on: []
|
depends_on: []
|
||||||
commands:
|
commands:
|
||||||
- if [[ "$BUILDKITE_BRANCH" != "main" ]]; then .buildkite/image_build/image_build.sh $REGISTRY $REPO $BUILDKITE_COMMIT $BRANCH $VLLM_USE_PRECOMPILED $VLLM_MERGE_BASE_COMMIT $IMAGE_TAG; fi
|
- if [[ "$BUILDKITE_BRANCH" != "main" ]]; then .buildkite/image_build/image_build.sh $REGISTRY $REPO $BUILDKITE_COMMIT $BRANCH $VLLM_USE_PRECOMPILED $VLLM_MERGE_BASE_COMMIT $IMAGE_TAG; fi
|
||||||
- if [[ "$BUILDKITE_BRANCH" == "main" ]]; then .buildkite/image_build/image_build.sh $REGISTRY $REPO $BUILDKITE_COMMIT $BRANCH $VLLM_USE_PRECOMPILED $VLLM_MERGE_BASE_COMMIT $IMAGE_TAG_LATEST; fi
|
- if [[ "$BUILDKITE_BRANCH" == "main" ]]; then .buildkite/image_build/image_build.sh $REGISTRY $REPO $BUILDKITE_COMMIT $BRANCH $VLLM_USE_PRECOMPILED $VLLM_MERGE_BASE_COMMIT $IMAGE_TAG $IMAGE_TAG_LATEST; fi
|
||||||
retry:
|
retry:
|
||||||
automatic:
|
automatic:
|
||||||
- exit_status: -1 # Agent was lost
|
- exit_status: -1 # Agent was lost
|
||||||
|
|||||||
@@ -0,0 +1,15 @@
|
|||||||
|
model_name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
|
||||||
|
tasks:
|
||||||
|
- name: "gsm8k"
|
||||||
|
metrics:
|
||||||
|
- name: "exact_match,strict-match"
|
||||||
|
value: 0.695
|
||||||
|
- name: "exact_match,flexible-extract"
|
||||||
|
value: 0.447
|
||||||
|
limit: 1319
|
||||||
|
num_fewshot: 5
|
||||||
|
max_model_len: 262144
|
||||||
|
enforce_eager: false
|
||||||
|
apply_chat_template: true
|
||||||
|
fewshot_as_multiturn: true
|
||||||
|
trust_remote_code: true
|
||||||
@@ -0,0 +1,19 @@
|
|||||||
|
model_name: "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8"
|
||||||
|
tasks:
|
||||||
|
- name: "gsm8k"
|
||||||
|
metrics:
|
||||||
|
- name: "exact_match,strict-match"
|
||||||
|
value: 0.7142
|
||||||
|
- name: "exact_match,flexible-extract"
|
||||||
|
value: 0.4579
|
||||||
|
env_vars:
|
||||||
|
VLLM_USE_FLASHINFER_MOE_FP8: "1"
|
||||||
|
VLLM_FLASHINFER_MOE_BACKEND: "throughput"
|
||||||
|
limit: 1319
|
||||||
|
num_fewshot: 5
|
||||||
|
max_model_len: 262144
|
||||||
|
kv_cache_dtype: fp8
|
||||||
|
enforce_eager: false
|
||||||
|
apply_chat_template: true
|
||||||
|
fewshot_as_multiturn: true
|
||||||
|
trust_remote_code: true
|
||||||
@@ -1 +1,2 @@
|
|||||||
Qwen3-235B-A22B-Instruct-2507-FP8.yaml
|
Qwen3-235B-A22B-Instruct-2507-FP8.yaml
|
||||||
|
NVIDIA-Nemotron-3-Nano-30B-A3B-FP8.yaml
|
||||||
|
|||||||
@@ -3,3 +3,4 @@ Meta-Llama-3-70B-Instruct.yaml
|
|||||||
Mixtral-8x7B-Instruct-v0.1.yaml
|
Mixtral-8x7B-Instruct-v0.1.yaml
|
||||||
Qwen2-57B-A14-Instruct.yaml
|
Qwen2-57B-A14-Instruct.yaml
|
||||||
DeepSeek-V2-Lite-Chat.yaml
|
DeepSeek-V2-Lite-Chat.yaml
|
||||||
|
NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.yaml
|
||||||
|
|||||||
@@ -393,7 +393,7 @@ if __name__ == "__main__":
|
|||||||
with open(results_folder / md_file, "w") as f:
|
with open(results_folder / md_file, "w") as f:
|
||||||
results = read_markdown(
|
results = read_markdown(
|
||||||
"../.buildkite/performance-benchmarks/"
|
"../.buildkite/performance-benchmarks/"
|
||||||
+ "performance-benchmarks-descriptions.md"
|
"performance-benchmarks-descriptions.md"
|
||||||
)
|
)
|
||||||
results = results.format(
|
results = results.format(
|
||||||
latency_tests_markdown_table=latency_md_table,
|
latency_tests_markdown_table=latency_md_table,
|
||||||
|
|||||||
@@ -25,9 +25,9 @@ check_gpus() {
|
|||||||
echo "Need at least 1 GPU to run benchmarking."
|
echo "Need at least 1 GPU to run benchmarking."
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
|
|
||||||
declare -g arch_suffix=''
|
declare -g arch_suffix=''
|
||||||
|
|
||||||
if command -v nvidia-smi; then
|
if command -v nvidia-smi; then
|
||||||
declare -g gpu_type=$(nvidia-smi --query-gpu=name --format=csv,noheader | awk '{print $2}')
|
declare -g gpu_type=$(nvidia-smi --query-gpu=name --format=csv,noheader | awk '{print $2}')
|
||||||
elif command -v amd-smi; then
|
elif command -v amd-smi; then
|
||||||
@@ -181,19 +181,20 @@ upload_to_buildkite() {
|
|||||||
$BUILDKITE_AGENT_COMMAND artifact upload "$RESULTS_FOLDER/*"
|
$BUILDKITE_AGENT_COMMAND artifact upload "$RESULTS_FOLDER/*"
|
||||||
}
|
}
|
||||||
|
|
||||||
run_latency_tests() {
|
run_benchmark_tests() {
|
||||||
# run latency tests using `vllm bench latency` command
|
# run benchmark tests using `vllm bench <test_type>` command
|
||||||
# $1: a json file specifying latency test cases
|
# $1: test type (latency or throughput)
|
||||||
|
# $2: a json file specifying test cases
|
||||||
|
|
||||||
local latency_test_file
|
local test_type=$1
|
||||||
latency_test_file=$1
|
local test_file=$2
|
||||||
|
|
||||||
# Iterate over latency tests
|
# Iterate over tests
|
||||||
jq -c '.[]' "$latency_test_file" | while read -r params; do
|
jq -c '.[]' "$test_file" | while read -r params; do
|
||||||
# get the test name, and append the GPU type back to it.
|
# get the test name, and append the GPU type back to it.
|
||||||
test_name=$(echo "$params" | jq -r '.test_name')
|
test_name=$(echo "$params" | jq -r '.test_name')
|
||||||
if [[ ! "$test_name" =~ ^latency_ ]]; then
|
if [[ ! "$test_name" =~ ^${test_type}_ ]]; then
|
||||||
echo "In latency-test.json, test_name must start with \"latency_\"."
|
echo "In ${test_type}-test.json, test_name must start with \"${test_type}_\"."
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
|
|
||||||
@@ -204,15 +205,15 @@ run_latency_tests() {
|
|||||||
fi
|
fi
|
||||||
|
|
||||||
# get arguments
|
# get arguments
|
||||||
latency_params=$(echo "$params" | jq -r '.parameters')
|
bench_params=$(echo "$params" | jq -r '.parameters')
|
||||||
latency_args=$(json2args "$latency_params")
|
bench_args=$(json2args "$bench_params")
|
||||||
latency_environment_variables=$(echo "$params" | jq -r '.environment_variables')
|
bench_environment_variables=$(echo "$params" | jq -r '.environment_variables')
|
||||||
latency_envs=$(json2envs "$latency_environment_variables")
|
bench_envs=$(json2envs "$bench_environment_variables")
|
||||||
|
|
||||||
# check if there is enough GPU to run the test
|
# check if there is enough GPU to run the test
|
||||||
tp=$(echo "$latency_params" | jq -r '.tensor_parallel_size')
|
tp=$(echo "$bench_params" | jq -r '.tensor_parallel_size')
|
||||||
if [[ "$ON_CPU" == "1" ]]; then
|
if [[ "$ON_CPU" == "1" ]]; then
|
||||||
pp=$(echo "$latency_params" | jq -r '.pipeline_parallel_size // 1')
|
pp=$(echo "$bench_params" | jq -r '.pipeline_parallel_size // 1')
|
||||||
world_size=$(($tp*$pp))
|
world_size=$(($tp*$pp))
|
||||||
if [[ $numa_count -lt $world_size && -z "${REMOTE_HOST}" ]]; then
|
if [[ $numa_count -lt $world_size && -z "${REMOTE_HOST}" ]]; then
|
||||||
echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
|
echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
|
||||||
@@ -225,97 +226,42 @@ run_latency_tests() {
|
|||||||
fi
|
fi
|
||||||
fi
|
fi
|
||||||
|
|
||||||
latency_command=" $latency_envs vllm bench latency \
|
bench_command=" $bench_envs vllm bench $test_type \
|
||||||
--output-json $RESULTS_FOLDER/${test_name}.json \
|
--output-json $RESULTS_FOLDER/${test_name}.json \
|
||||||
$latency_args"
|
$bench_args"
|
||||||
|
|
||||||
echo "Running test case $test_name"
|
echo "Running test case $test_name"
|
||||||
echo "Latency command: $latency_command"
|
echo "${test_type^} command: $bench_command"
|
||||||
|
|
||||||
# recoding benchmarking command ang GPU command
|
# recording benchmarking command and GPU command
|
||||||
jq_output=$(jq -n \
|
jq_output=$(jq -n \
|
||||||
--arg latency "$latency_command" \
|
--arg command "$bench_command" \
|
||||||
--arg gpu "$gpu_type" \
|
--arg gpu "$gpu_type" \
|
||||||
|
--arg test_type "$test_type" \
|
||||||
'{
|
'{
|
||||||
latency_command: $latency,
|
($test_type + "_command"): $command,
|
||||||
gpu_type: $gpu
|
gpu_type: $gpu
|
||||||
}')
|
}')
|
||||||
echo "$jq_output" >"$RESULTS_FOLDER/$test_name.commands"
|
echo "$jq_output" >"$RESULTS_FOLDER/$test_name.commands"
|
||||||
|
|
||||||
# run the benchmark
|
# run the benchmark
|
||||||
eval "$latency_command"
|
eval "$bench_command"
|
||||||
|
|
||||||
kill_gpu_processes
|
kill_gpu_processes
|
||||||
|
|
||||||
done
|
done
|
||||||
}
|
}
|
||||||
|
|
||||||
|
run_latency_tests() {
|
||||||
|
run_benchmark_tests "latency" "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_startup_tests() {
|
||||||
|
run_benchmark_tests "startup" "$1"
|
||||||
|
}
|
||||||
|
|
||||||
run_throughput_tests() {
|
run_throughput_tests() {
|
||||||
# run throughput tests using `vllm bench throughput`
|
run_benchmark_tests "throughput" "$1"
|
||||||
# $1: a json file specifying throughput test cases
|
|
||||||
|
|
||||||
local throughput_test_file
|
|
||||||
throughput_test_file=$1
|
|
||||||
|
|
||||||
# Iterate over throughput tests
|
|
||||||
jq -c '.[]' "$throughput_test_file" | while read -r params; do
|
|
||||||
# get the test name, and append the GPU type back to it.
|
|
||||||
test_name=$(echo "$params" | jq -r '.test_name')
|
|
||||||
if [[ ! "$test_name" =~ ^throughput_ ]]; then
|
|
||||||
echo "In throughput-test.json, test_name must start with \"throughput_\"."
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
# if TEST_SELECTOR is set, only run the test cases that match the selector
|
|
||||||
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
|
|
||||||
echo "Skip test case $test_name."
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
|
|
||||||
# get arguments
|
|
||||||
throughput_params=$(echo "$params" | jq -r '.parameters')
|
|
||||||
throughput_args=$(json2args "$throughput_params")
|
|
||||||
throughput_environment_variables=$(echo "$params" | jq -r '.environment_variables')
|
|
||||||
throughput_envs=$(json2envs "$throughput_environment_variables")
|
|
||||||
|
|
||||||
# check if there is enough GPU to run the test
|
|
||||||
tp=$(echo "$throughput_params" | jq -r '.tensor_parallel_size')
|
|
||||||
if [[ "$ON_CPU" == "1" ]]; then
|
|
||||||
pp=$(echo "$throughput_params" | jq -r '.pipeline_parallel_size // 1')
|
|
||||||
world_size=$(($tp*$pp))
|
|
||||||
if [[ $numa_count -lt $world_size && -z "${REMOTE_HOST}" ]]; then
|
|
||||||
echo "Required world-size $world_size but only $numa_count NUMA nodes found. Skip testcase $test_name."
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
else
|
|
||||||
if [[ $gpu_count -lt $tp ]]; then
|
|
||||||
echo "Required tensor-parallel-size $tp but only $gpu_count GPU found. Skip testcase $test_name."
|
|
||||||
continue
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
throughput_command=" $throughput_envs vllm bench throughput \
|
|
||||||
--output-json $RESULTS_FOLDER/${test_name}.json \
|
|
||||||
$throughput_args"
|
|
||||||
|
|
||||||
echo "Running test case $test_name"
|
|
||||||
echo "Throughput command: $throughput_command"
|
|
||||||
# recoding benchmarking command ang GPU command
|
|
||||||
jq_output=$(jq -n \
|
|
||||||
--arg command "$throughput_command" \
|
|
||||||
--arg gpu "$gpu_type" \
|
|
||||||
'{
|
|
||||||
throughput_command: $command,
|
|
||||||
gpu_type: $gpu
|
|
||||||
}')
|
|
||||||
echo "$jq_output" >"$RESULTS_FOLDER/$test_name.commands"
|
|
||||||
|
|
||||||
# run the benchmark
|
|
||||||
eval "$throughput_command"
|
|
||||||
|
|
||||||
kill_gpu_processes
|
|
||||||
|
|
||||||
done
|
|
||||||
}
|
}
|
||||||
|
|
||||||
run_serving_tests() {
|
run_serving_tests() {
|
||||||
@@ -447,6 +393,11 @@ run_serving_tests() {
|
|||||||
fi
|
fi
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
# save the compilation mode and optimization level on the serving results
|
||||||
|
# whenever they are set
|
||||||
|
compilation_config_mode=$(echo "$server_params" | jq -r '."compilation_config.mode" // empty')
|
||||||
|
optimization_level=$(echo "$server_params" | jq -r '.optimization_level // empty')
|
||||||
|
|
||||||
# iterate over different QPS
|
# iterate over different QPS
|
||||||
for qps in $qps_list; do
|
for qps in $qps_list; do
|
||||||
# remove the surrounding single quote from qps
|
# remove the surrounding single quote from qps
|
||||||
@@ -460,15 +411,15 @@ run_serving_tests() {
|
|||||||
for max_concurrency in $max_concurrency_list; do
|
for max_concurrency in $max_concurrency_list; do
|
||||||
new_test_name=$test_name"_qps_"$qps"_concurrency_"$max_concurrency
|
new_test_name=$test_name"_qps_"$qps"_concurrency_"$max_concurrency
|
||||||
echo " new test name $new_test_name"
|
echo " new test name $new_test_name"
|
||||||
# pass the tensor parallel size to the client so that it can be displayed
|
# pass the tensor parallel size, the compilation mode, and the optimization
|
||||||
# on the benchmark dashboard
|
# level to the client so that they can be used on the benchmark dashboard
|
||||||
client_command="vllm bench serve \
|
client_command="vllm bench serve \
|
||||||
--save-result \
|
--save-result \
|
||||||
--result-dir $RESULTS_FOLDER \
|
--result-dir $RESULTS_FOLDER \
|
||||||
--result-filename ${new_test_name}.json \
|
--result-filename ${new_test_name}.json \
|
||||||
--request-rate $qps \
|
--request-rate $qps \
|
||||||
--max-concurrency $max_concurrency \
|
--max-concurrency $max_concurrency \
|
||||||
--metadata "tensor_parallel_size=$tp" \
|
--metadata tensor_parallel_size=$tp compilation_config.mode=$compilation_config_mode optimization_level=$optimization_level \
|
||||||
$client_args $client_remote_args "
|
$client_args $client_remote_args "
|
||||||
|
|
||||||
echo "Running test case $test_name with qps $qps"
|
echo "Running test case $test_name with qps $qps"
|
||||||
@@ -534,6 +485,7 @@ main() {
|
|||||||
# benchmarking
|
# benchmarking
|
||||||
run_serving_tests $QUICK_BENCHMARK_ROOT/tests/"${SERVING_JSON:-serving-tests$ARCH.json}"
|
run_serving_tests $QUICK_BENCHMARK_ROOT/tests/"${SERVING_JSON:-serving-tests$ARCH.json}"
|
||||||
run_latency_tests $QUICK_BENCHMARK_ROOT/tests/"${LATENCY_JSON:-latency-tests$ARCH.json}"
|
run_latency_tests $QUICK_BENCHMARK_ROOT/tests/"${LATENCY_JSON:-latency-tests$ARCH.json}"
|
||||||
|
run_startup_tests $QUICK_BENCHMARK_ROOT/tests/"${STARTUP_JSON:-startup-tests$ARCH.json}"
|
||||||
run_throughput_tests $QUICK_BENCHMARK_ROOT/tests/"${THROUGHPUT_JSON:-throughput-tests$ARCH.json}"
|
run_throughput_tests $QUICK_BENCHMARK_ROOT/tests/"${THROUGHPUT_JSON:-throughput-tests$ARCH.json}"
|
||||||
|
|
||||||
# postprocess benchmarking results
|
# postprocess benchmarking results
|
||||||
|
|||||||
@@ -176,23 +176,6 @@ steps:
|
|||||||
env:
|
env:
|
||||||
DOCKER_BUILDKIT: "1"
|
DOCKER_BUILDKIT: "1"
|
||||||
|
|
||||||
- block: "Build release image for x86_64 ROCm"
|
|
||||||
key: block-rocm-release-image-build
|
|
||||||
depends_on: ~
|
|
||||||
|
|
||||||
- label: "Build release image - x86_64 - ROCm"
|
|
||||||
depends_on: block-rocm-release-image-build
|
|
||||||
id: build-release-image-rocm
|
|
||||||
agents:
|
|
||||||
queue: cpu_queue_postmerge
|
|
||||||
commands:
|
|
||||||
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
|
|
||||||
# Build base image first
|
|
||||||
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --tag rocm/vllm-dev:base-$BUILDKITE_COMMIT --target final --progress plain -f docker/Dockerfile.rocm_base ."
|
|
||||||
# Build vLLM ROCm image using the base
|
|
||||||
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg BASE_IMAGE=rocm/vllm-dev:base-$BUILDKITE_COMMIT --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-rocm --target vllm-openai --progress plain -f docker/Dockerfile.rocm ."
|
|
||||||
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-rocm"
|
|
||||||
|
|
||||||
- group: "Publish release images"
|
- group: "Publish release images"
|
||||||
key: "publish-release-images"
|
key: "publish-release-images"
|
||||||
steps:
|
steps:
|
||||||
@@ -274,14 +257,14 @@ steps:
|
|||||||
- input-release-version
|
- input-release-version
|
||||||
- build-wheels
|
- build-wheels
|
||||||
|
|
||||||
- label: "Upload release wheels to PyPI and GitHub"
|
- label: "Upload release wheels to PyPI"
|
||||||
depends_on:
|
depends_on:
|
||||||
- block-upload-release-wheels
|
- block-upload-release-wheels
|
||||||
id: upload-release-wheels
|
id: upload-release-wheels
|
||||||
agents:
|
agents:
|
||||||
queue: small_cpu_queue_postmerge
|
queue: small_cpu_queue_postmerge
|
||||||
commands:
|
commands:
|
||||||
- "bash .buildkite/scripts/upload-release-wheels.sh"
|
- "bash .buildkite/scripts/upload-release-wheels-pypi.sh"
|
||||||
|
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
# ROCm Release Pipeline (x86_64 only)
|
# ROCm Release Pipeline (x86_64 only)
|
||||||
@@ -476,7 +459,7 @@ steps:
|
|||||||
S3_BUCKET: "vllm-wheels"
|
S3_BUCKET: "vllm-wheels"
|
||||||
|
|
||||||
# ROCm Job 2: Build vLLM ROCm Wheel
|
# ROCm Job 2: Build vLLM ROCm Wheel
|
||||||
- label: ":python: Build vLLM ROCm Wheel"
|
- label: ":python: Build vLLM ROCm Wheel - x86_64"
|
||||||
id: build-rocm-vllm-wheel
|
id: build-rocm-vllm-wheel
|
||||||
depends_on:
|
depends_on:
|
||||||
- step: build-rocm-base-wheels
|
- step: build-rocm-base-wheels
|
||||||
@@ -638,9 +621,93 @@ steps:
|
|||||||
depends_on:
|
depends_on:
|
||||||
- step: upload-rocm-wheels
|
- step: upload-rocm-wheels
|
||||||
allow_failure: true
|
allow_failure: true
|
||||||
|
- step: input-release-version
|
||||||
|
allow_failure: true
|
||||||
agents:
|
agents:
|
||||||
queue: cpu_queue_postmerge
|
queue: cpu_queue_postmerge
|
||||||
commands:
|
commands:
|
||||||
- "bash .buildkite/scripts/annotate-rocm-release.sh"
|
- "bash .buildkite/scripts/annotate-rocm-release.sh"
|
||||||
env:
|
env:
|
||||||
S3_BUCKET: "vllm-wheels"
|
S3_BUCKET: "vllm-wheels"
|
||||||
|
|
||||||
|
# ROCm Job 5: Generate Root Index for ROCm Wheels (for release only)
|
||||||
|
# This is the job to create https://wheels.vllm.ai/rocm/ index allowing
|
||||||
|
# users to install with `uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/`
|
||||||
|
- block: "Generate Root Index for ROCm Wheels for Release"
|
||||||
|
key: block-generate-root-index-rocm-wheels
|
||||||
|
depends_on: upload-rocm-wheels
|
||||||
|
|
||||||
|
- label: ":package: Generate Root Index for ROCm Wheels for Release"
|
||||||
|
depends_on: block-generate-root-index-rocm-wheels
|
||||||
|
id: generate-root-index-rocm-wheels
|
||||||
|
agents:
|
||||||
|
queue: cpu_queue_postmerge
|
||||||
|
commands:
|
||||||
|
- "bash tools/vllm-rocm/generate-rocm-wheels-root-index.sh"
|
||||||
|
env:
|
||||||
|
S3_BUCKET: "vllm-wheels"
|
||||||
|
VARIANT: "rocm700"
|
||||||
|
|
||||||
|
# ROCm Job 5: Build ROCm Release Docker Image
|
||||||
|
- label: ":docker: Build release image - x86_64 - ROCm"
|
||||||
|
id: build-rocm-release-image
|
||||||
|
depends_on:
|
||||||
|
- step: build-rocm-base-wheels
|
||||||
|
allow_failure: false
|
||||||
|
agents:
|
||||||
|
queue: cpu_queue_postmerge
|
||||||
|
timeout_in_minutes: 60
|
||||||
|
commands:
|
||||||
|
- |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Login to ECR
|
||||||
|
aws ecr-public get-login-password --region us-east-1 | \
|
||||||
|
docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7
|
||||||
|
|
||||||
|
# Download Docker image from S3 (set by build-rocm-base-wheels)
|
||||||
|
DOCKER_IMAGE_S3_PATH="$$(buildkite-agent meta-data get rocm-docker-image-s3-path 2>/dev/null || echo '')"
|
||||||
|
if [ -z "$${DOCKER_IMAGE_S3_PATH}" ]; then
|
||||||
|
echo "ERROR: rocm-docker-image-s3-path metadata not found"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Downloading base image from $${DOCKER_IMAGE_S3_PATH}"
|
||||||
|
mkdir -p artifacts/rocm-docker-image
|
||||||
|
aws s3 cp "$${DOCKER_IMAGE_S3_PATH}" artifacts/rocm-docker-image/rocm-base-image.tar.gz
|
||||||
|
|
||||||
|
# Load base Docker image
|
||||||
|
echo "Loading base Docker image..."
|
||||||
|
LOAD_OUTPUT=$$(gunzip -c artifacts/rocm-docker-image/rocm-base-image.tar.gz | docker load)
|
||||||
|
BASE_IMAGE_TAG=$$(echo "$${LOAD_OUTPUT}" | grep "Loaded image:" | sed 's/Loaded image: //')
|
||||||
|
echo "Loaded base image: $${BASE_IMAGE_TAG}"
|
||||||
|
|
||||||
|
# Tag and push the base image to ECR
|
||||||
|
docker tag "$${BASE_IMAGE_TAG}" public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm-base
|
||||||
|
docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm-base
|
||||||
|
echo "Pushed base image: public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm-base"
|
||||||
|
|
||||||
|
# Get GPU architectures from meta-data
|
||||||
|
PYTORCH_ROCM_ARCH="$$(buildkite-agent meta-data get rocm-pytorch-rocm-arch 2>/dev/null || echo '')"
|
||||||
|
PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH:-gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151}"
|
||||||
|
|
||||||
|
# Build vLLM ROCm release image using cached base
|
||||||
|
DOCKER_BUILDKIT=1 docker build \
|
||||||
|
--build-arg max_jobs=16 \
|
||||||
|
--build-arg BASE_IMAGE="$${BASE_IMAGE_TAG}" \
|
||||||
|
--build-arg ARG_PYTORCH_ROCM_ARCH="$${PYTORCH_ROCM_ARCH}" \
|
||||||
|
--build-arg USE_SCCACHE=1 \
|
||||||
|
--build-arg SCCACHE_BUCKET_NAME=vllm-build-sccache \
|
||||||
|
--build-arg SCCACHE_REGION_NAME=us-west-2 \
|
||||||
|
--build-arg SCCACHE_S3_NO_CREDENTIALS=0 \
|
||||||
|
--tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm \
|
||||||
|
--target vllm-openai \
|
||||||
|
--progress plain \
|
||||||
|
-f docker/Dockerfile.rocm .
|
||||||
|
|
||||||
|
# Push to ECR
|
||||||
|
docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm
|
||||||
|
echo "Pushed: public.ecr.aws/q9t5s3a7/vllm-release-repo:$${BUILDKITE_COMMIT}-rocm"
|
||||||
|
env:
|
||||||
|
DOCKER_BUILDKIT: "1"
|
||||||
|
S3_BUCKET: "vllm-wheels"
|
||||||
|
|||||||
@@ -11,51 +11,102 @@ fi
|
|||||||
buildkite-agent annotate --style 'info' --context 'release-workflow' << EOF
|
buildkite-agent annotate --style 'info' --context 'release-workflow' << EOF
|
||||||
To download the wheel (by commit):
|
To download the wheel (by commit):
|
||||||
\`\`\`
|
\`\`\`
|
||||||
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux1_x86_64.whl .
|
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux_2_31_x86_64.whl .
|
||||||
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux2014_aarch64.whl .
|
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux_2_31_aarch64.whl .
|
||||||
|
|
||||||
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cu129-cp38-abi3-manylinux1_x86_64.whl .
|
(Optional) For CUDA 13.0:
|
||||||
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cu129-cp38-abi3-manylinux1_x86_64.whl .
|
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cu130-cp38-abi3-manylinux_2_35_x86_64.whl .
|
||||||
|
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cu130-cp38-abi3-manylinux_2_35_aarch64.whl .
|
||||||
|
|
||||||
|
(Optional) For CPU:
|
||||||
|
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cpu-cp38-abi3-manylinux_2_35_x86_64.whl .
|
||||||
|
aws s3 cp s3://vllm-wheels/${BUILDKITE_COMMIT}/vllm-${RELEASE_VERSION}+cpu-cp38-abi3-manylinux_2_35_aarch64.whl .
|
||||||
\`\`\`
|
\`\`\`
|
||||||
|
|
||||||
To download the wheel (by version):
|
|
||||||
\`\`\`
|
|
||||||
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux1_x86_64.whl .
|
|
||||||
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}/vllm-${RELEASE_VERSION}-cp38-abi3-manylinux2014_aarch64.whl .
|
|
||||||
|
|
||||||
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}+cu129/vllm-${RELEASE_VERSION}+cu129-cp38-abi3-manylinux1_x86_64.whl .
|
|
||||||
aws s3 cp s3://vllm-wheels/${RELEASE_VERSION}+cu130/vllm-${RELEASE_VERSION}+cu130-cp38-abi3-manylinux1_x86_64.whl .
|
|
||||||
\`\`\`
|
|
||||||
|
|
||||||
To download and upload the image:
|
To download and upload the image:
|
||||||
|
|
||||||
\`\`\`
|
\`\`\`
|
||||||
|
Download images:
|
||||||
|
|
||||||
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64
|
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64
|
||||||
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64
|
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64
|
||||||
|
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64-cu130
|
||||||
|
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64-cu130
|
||||||
|
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm-base
|
||||||
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm
|
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm
|
||||||
|
|
||||||
|
Tag and push images:
|
||||||
|
|
||||||
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64 vllm/vllm-openai:x86_64
|
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64 vllm/vllm-openai:x86_64
|
||||||
docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:latest-x86_64
|
docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:latest-x86_64
|
||||||
docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64
|
docker tag vllm/vllm-openai:x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64
|
||||||
docker push vllm/vllm-openai:latest-x86_64
|
docker push vllm/vllm-openai:latest-x86_64
|
||||||
docker push vllm/vllm-openai:v${RELEASE_VERSION}-x86_64
|
docker push vllm/vllm-openai:v${RELEASE_VERSION}-x86_64
|
||||||
|
|
||||||
|
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-x86_64-cu130 vllm/vllm-openai:x86_64-cu130
|
||||||
|
docker tag vllm/vllm-openai:x86_64-cu130 vllm/vllm-openai:latest-x86_64-cu130
|
||||||
|
docker tag vllm/vllm-openai:x86_64-cu130 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64-cu130
|
||||||
|
docker push vllm/vllm-openai:latest-x86_64-cu130
|
||||||
|
docker push vllm/vllm-openai:v${RELEASE_VERSION}-x86_64-cu130
|
||||||
|
|
||||||
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64 vllm/vllm-openai:aarch64
|
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64 vllm/vllm-openai:aarch64
|
||||||
docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:latest-aarch64
|
docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:latest-aarch64
|
||||||
docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
|
docker tag vllm/vllm-openai:aarch64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
|
||||||
docker push vllm/vllm-openai:latest-aarch64
|
docker push vllm/vllm-openai:latest-aarch64
|
||||||
docker push vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
|
docker push vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
|
||||||
|
|
||||||
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm vllm/vllm-openai:rocm
|
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-aarch64-cu130 vllm/vllm-openai:aarch64-cu130
|
||||||
docker tag vllm/vllm-openai:rocm vllm/vllm-openai:latest-rocm
|
docker tag vllm/vllm-openai:aarch64-cu130 vllm/vllm-openai:latest-aarch64-cu130
|
||||||
docker tag vllm/vllm-openai:rocm vllm/vllm-openai:v${RELEASE_VERSION}-rocm
|
docker tag vllm/vllm-openai:aarch64-cu130 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64-cu130
|
||||||
docker push vllm/vllm-openai:latest-rocm
|
docker push vllm/vllm-openai:latest-aarch64-cu130
|
||||||
docker push vllm/vllm-openai:v${RELEASE_VERSION}-rocm
|
docker push vllm/vllm-openai:v${RELEASE_VERSION}-aarch64-cu130
|
||||||
|
|
||||||
|
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-rocm
|
||||||
|
docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-rocm vllm/vllm-openai-rocm:latest
|
||||||
|
docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-rocm vllm/vllm-openai-rocm:v${RELEASE_VERSION}-rocm
|
||||||
|
docker push vllm/vllm-openai-rocm:latest
|
||||||
|
docker push vllm/vllm-openai-rocm:v${RELEASE_VERSION}-rocm
|
||||||
|
|
||||||
|
Create multi-arch manifest:
|
||||||
|
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm-base vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base
|
||||||
|
docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base vllm/vllm-openai-rocm:latest-base
|
||||||
|
docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base vllm/vllm-openai-rocm:v${RELEASE_VERSION}-base
|
||||||
|
docker push vllm/vllm-openai-rocm:latest-base
|
||||||
|
docker push vllm/vllm-openai-rocm:v${RELEASE_VERSION}-base
|
||||||
|
|
||||||
docker manifest rm vllm/vllm-openai:latest
|
docker manifest rm vllm/vllm-openai:latest
|
||||||
docker manifest create vllm/vllm-openai:latest vllm/vllm-openai:latest-x86_64 vllm/vllm-openai:latest-aarch64
|
docker manifest create vllm/vllm-openai:latest vllm/vllm-openai:latest-x86_64 vllm/vllm-openai:latest-aarch64
|
||||||
docker manifest create vllm/vllm-openai:v${RELEASE_VERSION} vllm/vllm-openai:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
|
docker manifest create vllm/vllm-openai:v${RELEASE_VERSION} vllm/vllm-openai:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64
|
||||||
docker manifest push vllm/vllm-openai:latest
|
docker manifest push vllm/vllm-openai:latest
|
||||||
docker manifest push vllm/vllm-openai:v${RELEASE_VERSION}
|
docker manifest push vllm/vllm-openai:v${RELEASE_VERSION}
|
||||||
|
|
||||||
|
docker manifest rm vllm/vllm-openai:latest-cu130
|
||||||
|
docker manifest create vllm/vllm-openai:latest-cu130 vllm/vllm-openai:latest-x86_64-cu130 vllm/vllm-openai:latest-aarch64-cu130
|
||||||
|
docker manifest create vllm/vllm-openai:v${RELEASE_VERSION}-cu130 vllm/vllm-openai:v${RELEASE_VERSION}-x86_64-cu130 vllm/vllm-openai:v${RELEASE_VERSION}-aarch64-cu130
|
||||||
|
docker manifest push vllm/vllm-openai:latest-cu130
|
||||||
|
docker manifest push vllm/vllm-openai:v${RELEASE_VERSION}-cu130
|
||||||
|
|
||||||
|
# CPU images (vllm/vllm-openai-cpu)
|
||||||
|
docker pull public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v${RELEASE_VERSION}
|
||||||
|
docker pull public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:v${RELEASE_VERSION}
|
||||||
|
|
||||||
|
docker tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v${RELEASE_VERSION} vllm/vllm-openai-cpu:x86_64
|
||||||
|
docker tag vllm/vllm-openai-cpu:x86_64 vllm/vllm-openai-cpu:latest-x86_64
|
||||||
|
docker tag vllm/vllm-openai-cpu:x86_64 vllm/vllm-openai-cpu:v${RELEASE_VERSION}-x86_64
|
||||||
|
docker push vllm/vllm-openai-cpu:latest-x86_64
|
||||||
|
docker push vllm/vllm-openai-cpu:v${RELEASE_VERSION}-x86_64
|
||||||
|
|
||||||
|
docker tag public.ecr.aws/q9t5s3a7/vllm-arm64-cpu-release-repo:v${RELEASE_VERSION} vllm/vllm-openai-cpu:arm64
|
||||||
|
docker tag vllm/vllm-openai-cpu:arm64 vllm/vllm-openai-cpu:latest-arm64
|
||||||
|
docker tag vllm/vllm-openai-cpu:arm64 vllm/vllm-openai-cpu:v${RELEASE_VERSION}-arm64
|
||||||
|
docker push vllm/vllm-openai-cpu:latest-arm64
|
||||||
|
docker push vllm/vllm-openai-cpu:v${RELEASE_VERSION}-arm64
|
||||||
|
|
||||||
|
docker manifest rm vllm/vllm-openai-cpu:latest || true
|
||||||
|
docker manifest create vllm/vllm-openai-cpu:latest vllm/vllm-openai-cpu:latest-x86_64 vllm/vllm-openai-cpu:latest-arm64
|
||||||
|
docker manifest create vllm/vllm-openai-cpu:v${RELEASE_VERSION} vllm/vllm-openai-cpu:v${RELEASE_VERSION}-x86_64 vllm/vllm-openai-cpu:v${RELEASE_VERSION}-arm64
|
||||||
|
docker manifest push vllm/vllm-openai-cpu:latest
|
||||||
|
docker manifest push vllm/vllm-openai-cpu:v${RELEASE_VERSION}
|
||||||
\`\`\`
|
\`\`\`
|
||||||
EOF
|
EOF
|
||||||
|
|||||||
@@ -3,25 +3,32 @@
|
|||||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||||
#
|
#
|
||||||
# Generate Buildkite annotation for ROCm wheel release
|
# Generate Buildkite annotation for ROCm wheel release
|
||||||
|
|
||||||
set -ex
|
set -ex
|
||||||
|
|
||||||
# Get build configuration from meta-data
|
# Get build configuration from meta-data
|
||||||
# Extract ROCm version dynamically from Dockerfile.rocm_base
|
# Extract ROCm version dynamically from Dockerfile.rocm_base
|
||||||
# BASE_IMAGE format: rocm/dev-ubuntu-22.04:7.1-complete -> extracts "7.1"
|
# BASE_IMAGE format: rocm/dev-ubuntu-22.04:7.0-complete -> extracts "7.0"
|
||||||
ROCM_VERSION=$(grep -E '^ARG BASE_IMAGE=' docker/Dockerfile.rocm_base | sed -E 's/.*:([0-9]+\.[0-9]+).*/\1/' || echo "unknown")
|
ROCM_VERSION=$(grep -E '^ARG BASE_IMAGE=' docker/Dockerfile.rocm_base | sed -E 's/.*:([0-9]+\.[0-9]+).*/\1/' || echo "unknown")
|
||||||
PYTHON_VERSION=$(buildkite-agent meta-data get rocm-python-version 2>/dev/null || echo "3.12")
|
PYTHON_VERSION=$(buildkite-agent meta-data get rocm-python-version 2>/dev/null || echo "3.12")
|
||||||
PYTORCH_ROCM_ARCH=$(buildkite-agent meta-data get rocm-pytorch-rocm-arch 2>/dev/null || echo "gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151")
|
PYTORCH_ROCM_ARCH=$(buildkite-agent meta-data get rocm-pytorch-rocm-arch 2>/dev/null || echo "gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151")
|
||||||
|
|
||||||
|
# TODO: Enable the nightly build for ROCm
|
||||||
|
# Get release version, default to 1.0.0.dev for nightly/per-commit builds
|
||||||
|
RELEASE_VERSION=$(buildkite-agent meta-data get release-version 2>/dev/null || echo "")
|
||||||
|
if [ -z "${RELEASE_VERSION}" ]; then
|
||||||
|
RELEASE_VERSION="1.0.0.dev"
|
||||||
|
fi
|
||||||
|
|
||||||
# S3 URLs
|
# S3 URLs
|
||||||
S3_BUCKET="${S3_BUCKET:-vllm-wheels}"
|
S3_BUCKET="${S3_BUCKET:-vllm-wheels}"
|
||||||
S3_REGION="${AWS_DEFAULT_REGION:-us-west-2}"
|
S3_REGION="${AWS_DEFAULT_REGION:-us-west-2}"
|
||||||
S3_URL="https://${S3_BUCKET}.s3.${S3_REGION}.amazonaws.com"
|
S3_URL="http://${S3_BUCKET}.s3-website-${S3_REGION}.amazonaws.com"
|
||||||
ROCM_PATH="rocm/${BUILDKITE_COMMIT}"
|
|
||||||
|
|
||||||
|
# Format ROCm version for path (e.g., "7.1" -> "rocm710")
|
||||||
|
ROCM_VERSION_PATH="rocm$(echo ${ROCM_VERSION} | tr -d '.')"
|
||||||
|
ROCM_PATH="rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}"
|
||||||
buildkite-agent annotate --style 'success' --context 'rocm-release-workflow' << EOF
|
buildkite-agent annotate --style 'success' --context 'rocm-release-workflow' << EOF
|
||||||
## :rocm: ROCm Wheel Release
|
## ROCm Wheel and Docker Image Releases
|
||||||
|
|
||||||
### Build Configuration
|
### Build Configuration
|
||||||
| Setting | Value |
|
| Setting | Value |
|
||||||
|---------|-------|
|
|---------|-------|
|
||||||
@@ -34,41 +41,72 @@ buildkite-agent annotate --style 'success' --context 'rocm-release-workflow' <<
|
|||||||
### :package: Installation
|
### :package: Installation
|
||||||
|
|
||||||
**Install from this build (by commit):**
|
**Install from this build (by commit):**
|
||||||
\`\`\`bash
|
|
||||||
uv pip install vllm --extra-index-url ${S3_URL}/${ROCM_PATH}/{rocm_variant}/
|
|
||||||
|
|
||||||
# Example:
|
\`\`\`bash
|
||||||
uv pip install vllm --extra-index-url ${S3_URL}/${ROCM_PATH}/rocm700/
|
pip install vllm --extra-index-url ${S3_URL}/${ROCM_PATH}/ --trusted-host ${S3_BUCKET}.s3-website-${S3_REGION}.amazonaws.com
|
||||||
|
|
||||||
|
# Example for ROCm ${ROCM_VERSION}:
|
||||||
|
pip install vllm --extra-index-url ${S3_URL}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/ --trusted-host ${S3_BUCKET}.s3-website-${S3_REGION}.amazonaws.com
|
||||||
\`\`\`
|
\`\`\`
|
||||||
|
|
||||||
**Install from nightly (if published):**
|
**Install from nightly (if published):**
|
||||||
|
|
||||||
\`\`\`bash
|
\`\`\`bash
|
||||||
uv pip install vllm --extra-index-url ${S3_URL}/rocm/nightly/
|
pip install vllm --extra-index-url ${S3_URL}/rocm/nightly/ --trusted-host ${S3_BUCKET}.s3-website-${S3_REGION}.amazonaws.com
|
||||||
\`\`\`
|
\`\`\`
|
||||||
|
|
||||||
### :floppy_disk: Download Wheels Directly
|
### :floppy_disk: Download Wheels Directly
|
||||||
|
|
||||||
\`\`\`bash
|
\`\`\`bash
|
||||||
# List all ROCm wheels
|
# List all ROCm wheels
|
||||||
aws s3 ls s3://${S3_BUCKET}/${ROCM_PATH}/
|
aws s3 ls s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/
|
||||||
|
|
||||||
# Download specific wheels
|
# Download specific wheels
|
||||||
aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/vllm-*.whl .
|
aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/vllm-*.whl .
|
||||||
aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/torch-*.whl .
|
aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/torch-*.whl .
|
||||||
aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/triton_rocm-*.whl .
|
aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/triton-*.whl .
|
||||||
aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/torchvision-*.whl .
|
aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/triton-kernels-*.whl .
|
||||||
aws s3 cp s3://${S3_BUCKET}/${ROCM_PATH}/amdsmi-*.whl .
|
aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/torchvision-*.whl .
|
||||||
|
aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/torchaudio-*.whl .
|
||||||
|
aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/amdsmi-*.whl .
|
||||||
|
aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/aiter-*.whl .
|
||||||
|
aws s3 cp s3://${S3_BUCKET}/rocm/${BUILDKITE_COMMIT}/${ROCM_VERSION_PATH}/flash-attn-*.whl .
|
||||||
\`\`\`
|
\`\`\`
|
||||||
|
|
||||||
### :gear: Included Packages
|
### :gear: Included Packages
|
||||||
- **vllm**: vLLM with ROCm support
|
- **vllm**: vLLM with ROCm support
|
||||||
- **torch**: PyTorch built for ROCm ${ROCM_VERSION}
|
- **torch**: PyTorch built for ROCm ${ROCM_VERSION}
|
||||||
- **triton_rocm**: Triton built for ROCm
|
- **triton**: Triton
|
||||||
|
- **triton-kernels**: Triton kernels
|
||||||
- **torchvision**: TorchVision for ROCm PyTorch
|
- **torchvision**: TorchVision for ROCm PyTorch
|
||||||
|
- **torchaudio**: Torchaudio for ROCm PyTorch
|
||||||
- **amdsmi**: AMD SMI Python bindings
|
- **amdsmi**: AMD SMI Python bindings
|
||||||
|
- **aiter**: Aiter for ROCm
|
||||||
|
- **flash-attn**: Flash Attention for ROCm
|
||||||
|
|
||||||
### :warning: Notes
|
### :warning: Notes
|
||||||
- These wheels are built for **ROCm ${ROCM_VERSION}** and will NOT work with CUDA GPUs
|
- These wheels are built for **ROCm ${ROCM_VERSION}** and will NOT work with CUDA GPUs
|
||||||
- Supported GPU architectures: ${PYTORCH_ROCM_ARCH}
|
- Supported GPU architectures: ${PYTORCH_ROCM_ARCH}
|
||||||
- Platform: Linux x86_64 only
|
- Platform: Linux x86_64 only
|
||||||
|
|
||||||
|
### :package: Docker Image Release
|
||||||
|
|
||||||
|
To download and upload the image:
|
||||||
|
|
||||||
|
\`\`\`
|
||||||
|
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm-base
|
||||||
|
docker pull public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm
|
||||||
|
|
||||||
|
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm-base vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base
|
||||||
|
docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base vllm/vllm-openai-rocm:latest-base
|
||||||
|
docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}-base vllm/vllm-openai-rocm:v${RELEASE_VERSION}-base
|
||||||
|
docker push vllm/vllm-openai-rocm:latest-base
|
||||||
|
docker push vllm/vllm-openai-rocm:v${RELEASE_VERSION}-base
|
||||||
|
|
||||||
|
docker tag public.ecr.aws/q9t5s3a7/vllm-release-repo:${BUILDKITE_COMMIT}-rocm vllm/vllm-openai-rocm:${BUILDKITE_COMMIT}
|
||||||
|
docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT} vllm/vllm-openai-rocm:latest
|
||||||
|
docker tag vllm/vllm-openai-rocm:${BUILDKITE_COMMIT} vllm/vllm-openai-rocm:v${RELEASE_VERSION}
|
||||||
|
docker push vllm/vllm-openai-rocm:latest
|
||||||
|
docker push vllm/vllm-openai-rocm:v${RELEASE_VERSION}
|
||||||
|
\`\`\`
|
||||||
|
|
||||||
EOF
|
EOF
|
||||||
|
|||||||
@@ -112,7 +112,7 @@ def parse_from_filename(file: str) -> WheelFileInfo:
|
|||||||
|
|
||||||
def generate_project_list(subdir_names: list[str], comment: str = "") -> str:
|
def generate_project_list(subdir_names: list[str], comment: str = "") -> str:
|
||||||
"""
|
"""
|
||||||
Generate project list HTML content linking to each project & variant sub-directory.
|
Generate project list HTML content linking to each project & variant subdirectory.
|
||||||
"""
|
"""
|
||||||
href_tags = []
|
href_tags = []
|
||||||
for name in sorted(subdir_names):
|
for name in sorted(subdir_names):
|
||||||
@@ -168,23 +168,23 @@ def generate_index_and_metadata(
|
|||||||
comment (str | None): Optional comment to include in the generated HTML files.
|
comment (str | None): Optional comment to include in the generated HTML files.
|
||||||
|
|
||||||
First, parse all wheel files to extract metadata.
|
First, parse all wheel files to extract metadata.
|
||||||
We need to collect all wheel files for each variant, and generate an index for it (in a sub-directory).
|
We need to collect all wheel files for each variant, and generate an index for it (in a subdirectory).
|
||||||
The index for the default variant (if any) is generated in the root index directory.
|
The index for the default variant (if any) is generated in the root index directory.
|
||||||
|
|
||||||
If `default_variant` is provided, all wheels must have variant suffixes, and the default variant index
|
If `default_variant` is provided, all wheels must have variant suffixes, and the default variant index
|
||||||
is purely a copy of the corresponding variant index, with only the links adjusted.
|
is purely a copy of the corresponding variant index, with only the links adjusted.
|
||||||
Otherwise, all wheels without variant suffixes are treated as the default variant.
|
Otherwise, all wheels without variant suffixes are treated as the default variant.
|
||||||
|
|
||||||
If `alias_to_default` is provided, an additional alias sub-directory is created, it has the same content
|
If `alias_to_default` is provided, an additional alias subdirectory is created, it has the same content
|
||||||
as the default variant index, but the links are adjusted accordingly.
|
as the default variant index, but the links are adjusted accordingly.
|
||||||
|
|
||||||
Index directory structure:
|
Index directory structure:
|
||||||
index_base_dir/ (hosted at wheels.vllm.ai/{nightly,$commit,$version}/)
|
index_base_dir/ (hosted at wheels.vllm.ai/{nightly,$commit,$version}/)
|
||||||
index.html # project list, linking to "vllm/" and other packages, and all variant sub-directories
|
index.html # project list, linking to "vllm/" and other packages, and all variant subdirectories
|
||||||
vllm/
|
vllm/
|
||||||
index.html # package index, pointing to actual files in wheel_base_dir (relative path)
|
index.html # package index, pointing to actual files in wheel_base_dir (relative path)
|
||||||
metadata.json # machine-readable metadata for all wheels in this package
|
metadata.json # machine-readable metadata for all wheels in this package
|
||||||
cpu/ # cpu variant sub-directory
|
cpu/ # cpu variant subdirectory
|
||||||
index.html
|
index.html
|
||||||
vllm/
|
vllm/
|
||||||
index.html
|
index.html
|
||||||
@@ -194,7 +194,7 @@ def generate_index_and_metadata(
|
|||||||
vllm/
|
vllm/
|
||||||
index.html
|
index.html
|
||||||
metadata.json
|
metadata.json
|
||||||
cu130/ # cu130 variant sub-directory
|
cu130/ # cu130 variant subdirectory
|
||||||
index.html
|
index.html
|
||||||
vllm/
|
vllm/
|
||||||
index.html
|
index.html
|
||||||
|
|||||||
@@ -44,6 +44,17 @@ cleanup_docker() {
|
|||||||
fi
|
fi
|
||||||
}
|
}
|
||||||
|
|
||||||
|
cleanup_network() {
|
||||||
|
for node in $(seq 0 $((NUM_NODES-1))); do
|
||||||
|
if docker pr -a -q -f name="node${node}" | grep -q .; then
|
||||||
|
docker stop "node${node}"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if docker network ls | grep docker-net; then
|
||||||
|
docker network rm docker-net
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
# Call the cleanup docker function
|
# Call the cleanup docker function
|
||||||
cleanup_docker
|
cleanup_docker
|
||||||
|
|
||||||
@@ -76,7 +87,7 @@ mkdir -p "${HF_CACHE}"
|
|||||||
HF_MOUNT="/root/.cache/huggingface"
|
HF_MOUNT="/root/.cache/huggingface"
|
||||||
|
|
||||||
commands=$@
|
commands=$@
|
||||||
echo "Commands:$commands"
|
echo "Raw commands: $commands"
|
||||||
|
|
||||||
commands=${commands//"pytest -v -s basic_correctness/test_basic_correctness.py"/"pytest -v -s basic_correctness/test_basic_correctness.py"}
|
commands=${commands//"pytest -v -s basic_correctness/test_basic_correctness.py"/"pytest -v -s basic_correctness/test_basic_correctness.py"}
|
||||||
|
|
||||||
@@ -158,6 +169,9 @@ if [[ $commands == *" entrypoints/llm "* ]]; then
|
|||||||
--ignore=entrypoints/llm/test_prompt_validation.py "}
|
--ignore=entrypoints/llm/test_prompt_validation.py "}
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
commands=$(echo "$commands" | sed 's/ \\ / /g')
|
||||||
|
echo "Final commands: $commands"
|
||||||
|
|
||||||
# --ignore=entrypoints/openai/test_encoder_decoder.py \
|
# --ignore=entrypoints/openai/test_encoder_decoder.py \
|
||||||
# --ignore=entrypoints/openai/test_embedding.py \
|
# --ignore=entrypoints/openai/test_embedding.py \
|
||||||
# --ignore=entrypoints/openai/test_oot_registration.py
|
# --ignore=entrypoints/openai/test_oot_registration.py
|
||||||
@@ -165,7 +179,6 @@ fi
|
|||||||
# --ignore=entrypoints/openai/test_models.py <= Fails on MI250 but passes on MI300 as of 2025-03-13
|
# --ignore=entrypoints/openai/test_models.py <= Fails on MI250 but passes on MI300 as of 2025-03-13
|
||||||
|
|
||||||
|
|
||||||
PARALLEL_JOB_COUNT=8
|
|
||||||
MYPYTHONPATH=".."
|
MYPYTHONPATH=".."
|
||||||
|
|
||||||
# Test that we're launching on the machine that has
|
# Test that we're launching on the machine that has
|
||||||
@@ -176,53 +189,33 @@ if [[ -z "$render_gid" ]]; then
|
|||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs.
|
if [[ $commands == *"VLLM_TEST_GROUP_NAME=mi325_4-2-node-tests-4-gpus-in-total"* ]]; then
|
||||||
if [[ $commands == *"--shard-id="* ]]; then
|
|
||||||
# assign job count as the number of shards used
|
export DCKR_VER=$(docker --version | sed 's/Docker version \(.*\), build .*/\1/')
|
||||||
commands=$(echo "$commands" | sed -E "s/--num-shards[[:blank:]]*=[[:blank:]]*[0-9]*/--num-shards=${PARALLEL_JOB_COUNT} /g" | sed 's/ \\ / /g')
|
|
||||||
for GPU in $(seq 0 $(($PARALLEL_JOB_COUNT-1))); do
|
if [[ "$commands" =~ ^(.*)"["(.*)"] && ["(.*)"]"$ ]]; then
|
||||||
# assign shard-id for each shard
|
prefix=$( echo "${BASH_REMATCH[1]}" | sed 's/;//g')
|
||||||
commands_gpu=$(echo "$commands" | sed -E "s/--shard-id[[:blank:]]*=[[:blank:]]*[0-9]*/--shard-id=${GPU} /g" | sed 's/ \\ / /g')
|
echo "PREFIX: ${prefix}"
|
||||||
echo "Shard ${GPU} commands:$commands_gpu"
|
export composite_command="(command rocm-smi || true)"
|
||||||
echo "Render devices: $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES"
|
myIFS=$IFS
|
||||||
docker run \
|
IFS=','
|
||||||
--device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \
|
read -ra node0 <<< ${BASH_REMATCH[2]}
|
||||||
--network=host \
|
read -ra node1 <<< ${BASH_REMATCH[3]}
|
||||||
--shm-size=16gb \
|
IFS=$myIFS
|
||||||
--group-add "$render_gid" \
|
for i in "${!node0[@]}";do
|
||||||
--rm \
|
command_node_0=$(echo ${node0[i]} | sed 's/\"//g')
|
||||||
-e HIP_VISIBLE_DEVICES="${GPU}" \
|
command_node_1=$(echo ${node1[i]} | sed 's/\"//g')
|
||||||
-e HF_TOKEN \
|
|
||||||
-e AWS_ACCESS_KEY_ID \
|
export commands="./.buildkite/scripts/run-multi-node-test.sh /vllm-workspace/tests 2 2 ${image_name} '${command_node_0}' '${command_node_1}'"
|
||||||
-e AWS_SECRET_ACCESS_KEY \
|
echo "COMMANDS: ${commands}"
|
||||||
-v "${HF_CACHE}:${HF_MOUNT}" \
|
composite_command=$(echo "${composite_command} && ${commands}")
|
||||||
-e "HF_HOME=${HF_MOUNT}" \
|
done
|
||||||
-e "PYTHONPATH=${MYPYTHONPATH}" \
|
/bin/bash -c "${composite_command}"
|
||||||
--name "${container_name}_${GPU}" \
|
cleanup_network
|
||||||
"${image_name}" \
|
else
|
||||||
/bin/bash -c "${commands_gpu}" \
|
echo "Failed to parse node commands! Exiting."
|
||||||
|& while read -r line; do echo ">>Shard $GPU: $line"; done &
|
cleanup_network
|
||||||
PIDS+=($!)
|
exit 111
|
||||||
done
|
|
||||||
#wait for all processes to finish and collect exit codes
|
|
||||||
for pid in "${PIDS[@]}"; do
|
|
||||||
wait "${pid}"
|
|
||||||
STATUS+=($?)
|
|
||||||
done
|
|
||||||
at_least_one_shard_with_tests=0
|
|
||||||
for st in "${STATUS[@]}"; do
|
|
||||||
if [[ ${st} -ne 0 ]] && [[ ${st} -ne 5 ]]; then
|
|
||||||
echo "One of the processes failed with $st"
|
|
||||||
exit "${st}"
|
|
||||||
elif [[ ${st} -eq 5 ]]; then
|
|
||||||
echo "Shard exited with status 5 (no tests collected) - treating as success"
|
|
||||||
else # This means st is 0
|
|
||||||
at_least_one_shard_with_tests=1
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
if [[ ${#STATUS[@]} -gt 0 && ${at_least_one_shard_with_tests} -eq 0 ]]; then
|
|
||||||
echo "All shards reported no tests collected. Failing the build."
|
|
||||||
exit 1
|
|
||||||
fi
|
fi
|
||||||
else
|
else
|
||||||
echo "Render devices: $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES"
|
echo "Render devices: $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES"
|
||||||
|
|||||||
@@ -5,7 +5,9 @@
|
|||||||
set -exuo pipefail
|
set -exuo pipefail
|
||||||
|
|
||||||
# Try building the docker image
|
# Try building the docker image
|
||||||
cat <<EOF | docker build -t hpu-plugin-v1-test-env -f - .
|
image_name="hpu/upstream-vllm-ci:${BUILDKITE_COMMIT}"
|
||||||
|
container_name="hpu-upstream-vllm-ci-${BUILDKITE_COMMIT}-container"
|
||||||
|
cat <<EOF | docker build -t ${image_name} -f - .
|
||||||
FROM gaudi-base-image:latest
|
FROM gaudi-base-image:latest
|
||||||
|
|
||||||
COPY ./ /workspace/vllm
|
COPY ./ /workspace/vllm
|
||||||
@@ -15,7 +17,8 @@ WORKDIR /workspace/vllm
|
|||||||
ENV no_proxy=localhost,127.0.0.1
|
ENV no_proxy=localhost,127.0.0.1
|
||||||
ENV PT_HPU_ENABLE_LAZY_COLLECTIVES=true
|
ENV PT_HPU_ENABLE_LAZY_COLLECTIVES=true
|
||||||
|
|
||||||
RUN VLLM_TARGET_DEVICE=empty pip install .
|
RUN bash -c 'pip install -r <(sed "/^torch/d" requirements/build.txt)'
|
||||||
|
RUN VLLM_TARGET_DEVICE=empty pip install --no-build-isolation -e .
|
||||||
RUN pip install git+https://github.com/vllm-project/vllm-gaudi.git
|
RUN pip install git+https://github.com/vllm-project/vllm-gaudi.git
|
||||||
|
|
||||||
# install development dependencies (for testing)
|
# install development dependencies (for testing)
|
||||||
@@ -36,15 +39,20 @@ EOF
|
|||||||
# functions, while other platforms only need one remove_docker_container
|
# functions, while other platforms only need one remove_docker_container
|
||||||
# function.
|
# function.
|
||||||
EXITCODE=1
|
EXITCODE=1
|
||||||
remove_docker_containers() { docker rm -f hpu-plugin-v1-test || true; }
|
remove_docker_containers() { docker rm -f ${container_name} || true; }
|
||||||
trap 'remove_docker_containers; exit $EXITCODE;' EXIT
|
trap 'remove_docker_containers; exit $EXITCODE;' EXIT
|
||||||
remove_docker_containers
|
remove_docker_containers
|
||||||
|
|
||||||
echo "Running HPU plugin v1 test"
|
echo "Running HPU plugin v1 test"
|
||||||
docker run --rm --runtime=habana --name=hpu-plugin-v1-test --network=host \
|
docker run --rm --runtime=habana --name=${container_name} --network=host \
|
||||||
-e HABANA_VISIBLE_DEVICES=all \
|
-e HABANA_VISIBLE_DEVICES=all \
|
||||||
hpu-plugin-v1-test-env \
|
-e VLLM_SKIP_WARMUP=true \
|
||||||
/bin/bash "/workspace/vllm-gaudi/tests/upstream_tests/ci_tests.sh"
|
-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
|
||||||
|
-e PT_HPU_LAZY_MODE=1 \
|
||||||
|
"${image_name}" \
|
||||||
|
/bin/bash -c '
|
||||||
|
cd vllm; timeout 120s python -u examples/offline_inference/basic/generate.py --model facebook/opt-125m
|
||||||
|
'
|
||||||
|
|
||||||
EXITCODE=$?
|
EXITCODE=$?
|
||||||
if [ $EXITCODE -eq 0 ]; then
|
if [ $EXITCODE -eq 0 ]; then
|
||||||
|
|||||||
@@ -38,15 +38,16 @@ docker run \
|
|||||||
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 -O3 -cc.cudagraph_mode=NONE
|
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 -O3 -cc.cudagraph_mode=NONE
|
||||||
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager -tp 2 --distributed-executor-backend ray
|
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager -tp 2 --distributed-executor-backend ray
|
||||||
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager -tp 2 --distributed-executor-backend mp
|
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager -tp 2 --distributed-executor-backend mp
|
||||||
python3 examples/offline_inference/basic/generate.py --model Intel/Qwen2.5-0.5B-W4A16-G128-AutoRound-LLMC-TEST-ONLY --enforce-eager
|
|
||||||
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager --attention-backend=TRITON_ATTN
|
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager --attention-backend=TRITON_ATTN
|
||||||
|
python3 examples/offline_inference/basic/generate.py --model ibm-research/PowerMoE-3b --block-size 64 --enforce-eager -tp 2
|
||||||
|
python3 examples/offline_inference/basic/generate.py --model ibm-research/PowerMoE-3b --block-size 64 --enforce-eager -tp 2 --enable-expert-parallel
|
||||||
cd tests
|
cd tests
|
||||||
pytest -v -s v1/core
|
pytest -v -s v1/core --ignore=v1/core/test_reset_prefix_cache_e2e.py
|
||||||
pytest -v -s v1/engine
|
pytest -v -s v1/engine
|
||||||
pytest -v -s v1/sample --ignore=v1/sample/test_logprobs.py --ignore=v1/sample/test_logprobs_e2e.py
|
pytest -v -s v1/sample --ignore=v1/sample/test_logprobs.py --ignore=v1/sample/test_logprobs_e2e.py
|
||||||
pytest -v -s v1/worker --ignore=v1/worker/test_gpu_model_runner.py
|
pytest -v -s v1/worker --ignore=v1/worker/test_gpu_model_runner.py
|
||||||
pytest -v -s v1/structured_output
|
pytest -v -s v1/structured_output
|
||||||
pytest -v -s v1/spec_decode --ignore=v1/spec_decode/test_max_len.py --ignore=v1/spec_decode/test_tree_attention.py --ignore=v1/spec_decode/test_speculators_eagle3.py
|
pytest -v -s v1/spec_decode --ignore=v1/spec_decode/test_max_len.py --ignore=v1/spec_decode/test_tree_attention.py --ignore=v1/spec_decode/test_speculators_eagle3.py --ignore=v1/spec_decode/test_acceptance_length.py
|
||||||
pytest -v -s v1/kv_connector/unit --ignore=v1/kv_connector/unit/test_multi_connector.py --ignore=v1/kv_connector/unit/test_nixl_connector.py --ignore=v1/kv_connector/unit/test_example_connector.py --ignore=v1/kv_connector/unit/test_lmcache_integration.py
|
pytest -v -s v1/kv_connector/unit --ignore=v1/kv_connector/unit/test_multi_connector.py --ignore=v1/kv_connector/unit/test_nixl_connector.py --ignore=v1/kv_connector/unit/test_example_connector.py --ignore=v1/kv_connector/unit/test_lmcache_integration.py
|
||||||
pytest -v -s v1/test_serial_utils.py
|
pytest -v -s v1/test_serial_utils.py
|
||||||
'
|
'
|
||||||
|
|||||||
@@ -43,7 +43,6 @@ trap cleanup EXIT
|
|||||||
|
|
||||||
for BACK in "${BACKENDS[@]}"; do
|
for BACK in "${BACKENDS[@]}"; do
|
||||||
VLLM_DEEP_GEMM_WARMUP=skip \
|
VLLM_DEEP_GEMM_WARMUP=skip \
|
||||||
VLLM_ALL2ALL_BACKEND=$BACK \
|
|
||||||
vllm serve "$MODEL" \
|
vllm serve "$MODEL" \
|
||||||
--enforce-eager \
|
--enforce-eager \
|
||||||
--tensor-parallel-size 2 \
|
--tensor-parallel-size 2 \
|
||||||
@@ -52,6 +51,7 @@ for BACK in "${BACKENDS[@]}"; do
|
|||||||
--enable-eplb \
|
--enable-eplb \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--max-model-len 2048 \
|
--max-model-len 2048 \
|
||||||
|
--all2all-backend $BACK \
|
||||||
--port $PORT &
|
--port $PORT &
|
||||||
SERVER_PID=$!
|
SERVER_PID=$!
|
||||||
wait_for_server $PORT
|
wait_for_server $PORT
|
||||||
|
|||||||
@@ -7,17 +7,19 @@ SUBPATH=$BUILDKITE_COMMIT
|
|||||||
S3_COMMIT_PREFIX="s3://$BUCKET/$SUBPATH/"
|
S3_COMMIT_PREFIX="s3://$BUCKET/$SUBPATH/"
|
||||||
|
|
||||||
RELEASE_VERSION=$(buildkite-agent meta-data get release-version)
|
RELEASE_VERSION=$(buildkite-agent meta-data get release-version)
|
||||||
echo "Release version from Buildkite: $RELEASE_VERSION"
|
|
||||||
GIT_VERSION=$(git describe --exact-match --tags $BUILDKITE_COMMIT 2>/dev/null)
|
GIT_VERSION=$(git describe --exact-match --tags $BUILDKITE_COMMIT 2>/dev/null)
|
||||||
if [ -z "$GIT_VERSION" ]; then
|
|
||||||
|
echo "Release version from Buildkite: $RELEASE_VERSION"
|
||||||
|
|
||||||
|
if [[ -z "$GIT_VERSION" ]]; then
|
||||||
echo "[FATAL] Not on a git tag, cannot create release."
|
echo "[FATAL] Not on a git tag, cannot create release."
|
||||||
exit 1
|
exit 1
|
||||||
else
|
else
|
||||||
echo "Git version for commit $BUILDKITE_COMMIT: $GIT_VERSION"
|
echo "Git version for commit $BUILDKITE_COMMIT: $GIT_VERSION"
|
||||||
fi
|
fi
|
||||||
# sanity check for version mismatch
|
# sanity check for version mismatch
|
||||||
if [ "$RELEASE_VERSION" != "$GIT_VERSION" ]; then
|
if [[ "$RELEASE_VERSION" != "$GIT_VERSION" ]]; then
|
||||||
if [ "$FORCE_RELEASE_IGNORE_VERSION_MISMATCH" == "true" ]; then
|
if [[ "$FORCE_RELEASE_IGNORE_VERSION_MISMATCH" == "true" ]]; then
|
||||||
echo "[WARNING] Force release and ignore version mismatch"
|
echo "[WARNING] Force release and ignore version mismatch"
|
||||||
else
|
else
|
||||||
echo "[FATAL] Release version from Buildkite does not match Git version."
|
echo "[FATAL] Release version from Buildkite does not match Git version."
|
||||||
@@ -27,7 +29,7 @@ fi
|
|||||||
PURE_VERSION=${RELEASE_VERSION#v} # remove leading 'v'
|
PURE_VERSION=${RELEASE_VERSION#v} # remove leading 'v'
|
||||||
|
|
||||||
# check pypi token
|
# check pypi token
|
||||||
if [ -z "$PYPI_TOKEN" ]; then
|
if [[ -z "$PYPI_TOKEN" ]]; then
|
||||||
echo "[FATAL] PYPI_TOKEN is not set."
|
echo "[FATAL] PYPI_TOKEN is not set."
|
||||||
exit 1
|
exit 1
|
||||||
else
|
else
|
||||||
@@ -35,41 +37,8 @@ else
|
|||||||
export TWINE_PASSWORD="$PYPI_TOKEN"
|
export TWINE_PASSWORD="$PYPI_TOKEN"
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# check github token
|
|
||||||
if [ -z "$GITHUB_TOKEN" ]; then
|
|
||||||
echo "[FATAL] GITHUB_TOKEN is not set."
|
|
||||||
exit 1
|
|
||||||
else
|
|
||||||
export GH_TOKEN="$GITHUB_TOKEN"
|
|
||||||
fi
|
|
||||||
|
|
||||||
set -x # avoid printing secrets above
|
set -x # avoid printing secrets above
|
||||||
|
|
||||||
# download gh CLI from github
|
|
||||||
# Get latest gh CLI version from GitHub API
|
|
||||||
GH_VERSION=$(curl -s https://api.github.com/repos/cli/cli/releases/latest | grep '"tag_name":' | sed -E 's/.*"([^"]+)".*/\1/' | sed 's/^v//')
|
|
||||||
if [ -z "$GH_VERSION" ]; then
|
|
||||||
echo "[FATAL] Failed to get latest gh CLI version from GitHub"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
echo "Downloading gh CLI version: $GH_VERSION"
|
|
||||||
GH_TARBALL="gh_${GH_VERSION}_linux_amd64.tar.gz"
|
|
||||||
GH_URL="https://github.com/cli/cli/releases/download/v${GH_VERSION}/${GH_TARBALL}"
|
|
||||||
GH_INSTALL_DIR="/tmp/gh-install"
|
|
||||||
mkdir -p "$GH_INSTALL_DIR"
|
|
||||||
pushd "$GH_INSTALL_DIR"
|
|
||||||
curl -L -o "$GH_TARBALL" "$GH_URL"
|
|
||||||
tar -xzf "$GH_TARBALL"
|
|
||||||
GH_BIN=$(realpath $(find . -name "gh" -type f -executable | head -n 1))
|
|
||||||
if [ -z "$GH_BIN" ]; then
|
|
||||||
echo "[FATAL] Failed to find gh CLI executable"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
echo "gh CLI downloaded successfully, version: $($GH_BIN --version)"
|
|
||||||
echo "Last 5 releases on GitHub:" # as a sanity check of gh and GH_TOKEN
|
|
||||||
command "$GH_BIN" release list --limit 5
|
|
||||||
popd
|
|
||||||
|
|
||||||
# install twine from pypi
|
# install twine from pypi
|
||||||
python3 -m venv /tmp/vllm-release-env
|
python3 -m venv /tmp/vllm-release-env
|
||||||
source /tmp/vllm-release-env/bin/activate
|
source /tmp/vllm-release-env/bin/activate
|
||||||
@@ -89,16 +58,13 @@ echo "Wheels copied to local directory"
|
|||||||
git archive --format=tar.gz --output="$DIST_DIR/vllm-${PURE_VERSION}.tar.gz" $BUILDKITE_COMMIT
|
git archive --format=tar.gz --output="$DIST_DIR/vllm-${PURE_VERSION}.tar.gz" $BUILDKITE_COMMIT
|
||||||
ls -la $DIST_DIR
|
ls -la $DIST_DIR
|
||||||
|
|
||||||
|
|
||||||
# upload wheels to PyPI (only default variant, i.e. files without '+' in the name)
|
# upload wheels to PyPI (only default variant, i.e. files without '+' in the name)
|
||||||
PYPI_WHEEL_FILES=$(find $DIST_DIR -name "vllm-${PURE_VERSION}*.whl" -not -name "*+*")
|
PYPI_WHEEL_FILES=$(find $DIST_DIR -name "vllm-${PURE_VERSION}*.whl" -not -name "*+*")
|
||||||
if [ -z "$PYPI_WHEEL_FILES" ]; then
|
if [[ -z "$PYPI_WHEEL_FILES" ]]; then
|
||||||
echo "No default variant wheels found, quitting..."
|
echo "No default variant wheels found, quitting..."
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
python3 -m twine check $PYPI_WHEEL_FILES
|
|
||||||
python3 -m twine --non-interactive --verbose upload $PYPI_WHEEL_FILES
|
|
||||||
echo "Wheels uploaded to PyPI"
|
|
||||||
|
|
||||||
# create release on GitHub with the release version and all wheels
|
python3 -m twine check $PYPI_WHEEL_FILES
|
||||||
command "$GH_BIN" release create $GIT_VERSION -d --latest --notes-from-tag --verify-tag $DIST_DIR/*.whl
|
python3 -m twine upload --non-interactive --verbose $PYPI_WHEEL_FILES
|
||||||
|
echo "Wheels uploaded to PyPI"
|
||||||
@@ -542,7 +542,7 @@ steps:
|
|||||||
- label: LoRA Test %N # 20min each
|
- label: LoRA Test %N # 20min each
|
||||||
timeout_in_minutes: 30
|
timeout_in_minutes: 30
|
||||||
mirror_hardwares: [amdexperimental]
|
mirror_hardwares: [amdexperimental]
|
||||||
agent_pool: mi325_8
|
agent_pool: mi325_1
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/lora
|
- vllm/lora
|
||||||
@@ -604,9 +604,11 @@ steps:
|
|||||||
- tests/compile
|
- tests/compile
|
||||||
commands:
|
commands:
|
||||||
- pytest -v -s compile/fullgraph/test_full_graph.py -k 'not test_fp8_kv_scale_compile'
|
- pytest -v -s compile/fullgraph/test_full_graph.py -k 'not test_fp8_kv_scale_compile'
|
||||||
# Limit to no custom ops to reduce running time
|
# # Limit to no custom ops to reduce running time
|
||||||
# Wrap with quotes to escape yaml and avoid starting -k string with a -
|
# # Wrap with quotes to escape yaml and avoid starting -k string with a -
|
||||||
- "pytest -v -s compile/distributed/test_fusions_e2e.py -k 'TRITON and not +quant_fp8 and not Llama-4'"
|
# - "pytest -v -s compile/distributed/test_fusions_e2e.py -k 'TRITON and not +quant_fp8 and not Llama-4'"
|
||||||
|
# Old E2E tests were removed in https://github.com/vllm-project/vllm/pull/33293
|
||||||
|
# in favor of new tests in fusions_e2e. We avoid replicating the new jobs in this file as it's deprecated.
|
||||||
|
|
||||||
- label: Cudagraph test
|
- label: Cudagraph test
|
||||||
timeout_in_minutes: 20
|
timeout_in_minutes: 20
|
||||||
@@ -636,12 +638,13 @@ steps:
|
|||||||
- label: Kernels Attention Test %N # 23min
|
- label: Kernels Attention Test %N # 23min
|
||||||
timeout_in_minutes: 35
|
timeout_in_minutes: 35
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental, amdproduction]
|
||||||
agent_pool: mi325_8
|
agent_pool: mi325_1
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- csrc/attention/
|
- csrc/attention/
|
||||||
- vllm/attention
|
|
||||||
- vllm/v1/attention
|
- vllm/v1/attention
|
||||||
|
# TODO: remove this dependency (https://github.com/vllm-project/vllm/issues/32267)
|
||||||
|
- vllm/model_executor/layers/attention
|
||||||
- tests/kernels/attention
|
- tests/kernels/attention
|
||||||
commands:
|
commands:
|
||||||
- pytest -v -s kernels/attention --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
|
- pytest -v -s kernels/attention --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
|
||||||
@@ -650,7 +653,7 @@ steps:
|
|||||||
- label: Kernels Quantization Test %N # 64min
|
- label: Kernels Quantization Test %N # 64min
|
||||||
timeout_in_minutes: 90
|
timeout_in_minutes: 90
|
||||||
mirror_hardwares: [amdexperimental]
|
mirror_hardwares: [amdexperimental]
|
||||||
agent_pool: mi325_8
|
agent_pool: mi325_1
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- csrc/quantization/
|
- csrc/quantization/
|
||||||
@@ -663,7 +666,7 @@ steps:
|
|||||||
- label: Kernels MoE Test %N # 40min
|
- label: Kernels MoE Test %N # 40min
|
||||||
timeout_in_minutes: 60
|
timeout_in_minutes: 60
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental, amdproduction]
|
||||||
agent_pool: mi325_8
|
agent_pool: mi325_1
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- csrc/quantization/cutlass_w8a8/moe/
|
- csrc/quantization/cutlass_w8a8/moe/
|
||||||
@@ -741,7 +744,7 @@ steps:
|
|||||||
- label: Benchmarks # 11min
|
- label: Benchmarks # 11min
|
||||||
timeout_in_minutes: 20
|
timeout_in_minutes: 20
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental, amdproduction]
|
||||||
agent_pool: mi325_8
|
agent_pool: mi325_1
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
working_dir: "/vllm-workspace/.buildkite"
|
working_dir: "/vllm-workspace/.buildkite"
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
@@ -752,7 +755,7 @@ steps:
|
|||||||
- label: Benchmarks CLI Test # 7min
|
- label: Benchmarks CLI Test # 7min
|
||||||
timeout_in_minutes: 20
|
timeout_in_minutes: 20
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental, amdproduction]
|
||||||
agent_pool: mi325_8
|
agent_pool: mi325_1
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
@@ -826,7 +829,7 @@ steps:
|
|||||||
- label: Basic Models Tests (Extra Initialization) %N
|
- label: Basic Models Tests (Extra Initialization) %N
|
||||||
timeout_in_minutes: 45
|
timeout_in_minutes: 45
|
||||||
mirror_hardwares: [amdexperimental, amdproduction]
|
mirror_hardwares: [amdexperimental, amdproduction]
|
||||||
agent_pool: mi325_8
|
agent_pool: mi325_1
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
torch_nightly: true
|
torch_nightly: true
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
@@ -887,7 +890,7 @@ steps:
|
|||||||
- label: Language Models Tests (Extra Standard) %N
|
- label: Language Models Tests (Extra Standard) %N
|
||||||
timeout_in_minutes: 45
|
timeout_in_minutes: 45
|
||||||
mirror_hardwares: [amdexperimental]
|
mirror_hardwares: [amdexperimental]
|
||||||
agent_pool: mi325_8
|
agent_pool: mi325_1
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
torch_nightly: true
|
torch_nightly: true
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
@@ -908,7 +911,7 @@ steps:
|
|||||||
- label: Language Models Tests (Hybrid) %N
|
- label: Language Models Tests (Hybrid) %N
|
||||||
timeout_in_minutes: 75
|
timeout_in_minutes: 75
|
||||||
mirror_hardwares: [amdexperimental]
|
mirror_hardwares: [amdexperimental]
|
||||||
agent_pool: mi325_8
|
agent_pool: mi325_1
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
torch_nightly: true
|
torch_nightly: true
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
@@ -1180,7 +1183,6 @@ steps:
|
|||||||
- tests/compile/test_fusion_attn.py
|
- tests/compile/test_fusion_attn.py
|
||||||
- tests/compile/test_silu_mul_quant_fusion.py
|
- tests/compile/test_silu_mul_quant_fusion.py
|
||||||
- tests/compile/distributed/test_fusion_all_reduce.py
|
- tests/compile/distributed/test_fusion_all_reduce.py
|
||||||
- tests/compile/distributed/test_fusions_e2e.py
|
|
||||||
- tests/compile/fullgraph/test_full_graph.py
|
- tests/compile/fullgraph/test_full_graph.py
|
||||||
commands:
|
commands:
|
||||||
- nvidia-smi
|
- nvidia-smi
|
||||||
@@ -1188,33 +1190,16 @@ steps:
|
|||||||
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
|
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
|
||||||
# this runner has 2 GPUs available even though num_gpus=2 is not set
|
# this runner has 2 GPUs available even though num_gpus=2 is not set
|
||||||
- pytest -v -s tests/compile/distributed/test_fusion_all_reduce.py
|
- pytest -v -s tests/compile/distributed/test_fusion_all_reduce.py
|
||||||
# Limit to Inductor partition, no custom ops, and allreduce & attn fusion to reduce running time
|
|
||||||
# Wrap with quotes to escape yaml
|
# # Limit to Inductor partition, no custom ops, and allreduce & attn fusion to reduce running time
|
||||||
- "pytest -v -s tests/compile/distributed/test_fusions_e2e.py::test_tp2_attn_quant_allreduce_rmsnorm -k 'True and not +quant_fp8 and not +rms_norm'"
|
# # Wrap with quotes to escape yaml
|
||||||
|
# - "pytest -v -s tests/compile/distributed/test_fusions_e2e.py::test_tp2_attn_quant_allreduce_rmsnorm -k 'True and not +quant_fp8 and not +rms_norm'"
|
||||||
|
# Old E2E tests were removed in https://github.com/vllm-project/vllm/pull/33293
|
||||||
|
# in favor of new tests in fusions_e2e. We avoid replicating the new jobs in this file as it's deprecated.
|
||||||
|
|
||||||
# test_fp8_kv_scale_compile requires FlashAttention (not supported on default L4/L40)
|
# test_fp8_kv_scale_compile requires FlashAttention (not supported on default L4/L40)
|
||||||
- pytest -v -s tests/compile/fullgraph/test_full_graph.py::test_fp8_kv_scale_compile
|
- pytest -v -s tests/compile/fullgraph/test_full_graph.py::test_fp8_kv_scale_compile
|
||||||
|
|
||||||
- label: Blackwell Fusion E2E Tests # 30 min
|
|
||||||
timeout_in_minutes: 40
|
|
||||||
working_dir: "/vllm-workspace/"
|
|
||||||
gpu: b200
|
|
||||||
optional: true
|
|
||||||
num_gpus: 2
|
|
||||||
source_file_dependencies:
|
|
||||||
- csrc/quantization/fp4/
|
|
||||||
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
|
|
||||||
- vllm/v1/attention/backends/flashinfer.py
|
|
||||||
- vllm/compilation/
|
|
||||||
# can affect pattern matching
|
|
||||||
- vllm/model_executor/layers/layernorm.py
|
|
||||||
- vllm/model_executor/layers/activation.py
|
|
||||||
- vllm/model_executor/layers/quantization/input_quant_fp8.py
|
|
||||||
- tests/compile/distributed/test_fusions_e2e.py
|
|
||||||
commands:
|
|
||||||
- nvidia-smi
|
|
||||||
# Run all e2e fusion tests
|
|
||||||
- pytest -v -s tests/compile/distributed/test_fusions_e2e.py
|
|
||||||
|
|
||||||
- label: Blackwell GPT-OSS Eval
|
- label: Blackwell GPT-OSS Eval
|
||||||
timeout_in_minutes: 60
|
timeout_in_minutes: 60
|
||||||
working_dir: "/vllm-workspace/"
|
working_dir: "/vllm-workspace/"
|
||||||
@@ -1277,7 +1262,7 @@ steps:
|
|||||||
|
|
||||||
- label: 2 Node Tests (4 GPUs in total) # 16min
|
- label: 2 Node Tests (4 GPUs in total) # 16min
|
||||||
timeout_in_minutes: 30
|
timeout_in_minutes: 30
|
||||||
mirror_hardwares: [amdexperimental]
|
mirror_hardwares: [amdexperimental, amdmultinode]
|
||||||
agent_pool: mi325_4
|
agent_pool: mi325_4
|
||||||
# grade: Blocking
|
# grade: Blocking
|
||||||
working_dir: "/vllm-workspace/tests"
|
working_dir: "/vllm-workspace/tests"
|
||||||
@@ -1291,15 +1276,15 @@ steps:
|
|||||||
- tests/distributed/
|
- tests/distributed/
|
||||||
- tests/examples/offline_inference/data_parallel.py
|
- tests/examples/offline_inference/data_parallel.py
|
||||||
commands:
|
commands:
|
||||||
- # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up)
|
- # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up) | grep 'Same node test passed' | grep 'Node count test passed'
|
||||||
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
|
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py
|
||||||
- NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed'
|
- NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py
|
||||||
- python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=0 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code
|
- python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=0 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code
|
||||||
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py
|
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py
|
||||||
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
|
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
|
||||||
- # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
|
- # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
|
||||||
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
|
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py
|
||||||
- NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed'
|
- NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py
|
||||||
- python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=1 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code
|
- python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=1 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code
|
||||||
|
|
||||||
- label: Distributed Tests (2 GPUs) # 68min
|
- label: Distributed Tests (2 GPUs) # 68min
|
||||||
@@ -1508,6 +1493,9 @@ steps:
|
|||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
commands:
|
commands:
|
||||||
|
# Work around HIP bug tracked here: https://github.com/ROCm/hip/issues/3876
|
||||||
|
# TODO: Remove when the bug is fixed in a future ROCm release
|
||||||
|
- export TORCH_NCCL_BLOCKING_WAIT=1
|
||||||
# NOTE: don't test llama model here, it seems hf implementation is buggy
|
# NOTE: don't test llama model here, it seems hf implementation is buggy
|
||||||
# see https://github.com/vllm-project/vllm/pull/5689 for details
|
# see https://github.com/vllm-project/vllm/pull/5689 for details
|
||||||
- pytest -v -s distributed/test_custom_all_reduce.py
|
- pytest -v -s distributed/test_custom_all_reduce.py
|
||||||
@@ -1562,7 +1550,10 @@ steps:
|
|||||||
- pytest -v -s tests/compile/distributed/test_sequence_parallelism.py
|
- pytest -v -s tests/compile/distributed/test_sequence_parallelism.py
|
||||||
- pytest -v -s tests/compile/distributed/test_fusion_all_reduce.py
|
- pytest -v -s tests/compile/distributed/test_fusion_all_reduce.py
|
||||||
#- pytest -v -s tests/compile/distributed/test_fusions_e2e.py::test_tp2_attn_quant_allreduce_rmsnorm
|
#- pytest -v -s tests/compile/distributed/test_fusions_e2e.py::test_tp2_attn_quant_allreduce_rmsnorm
|
||||||
- "VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/compile/distributed/test_fusions_e2e.py -k 'not Llama-4'"
|
# - "VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/compile/distributed/test_fusions_e2e.py -k 'not Llama-4'"
|
||||||
|
# Old E2E tests were removed in https://github.com/vllm-project/vllm/pull/33293
|
||||||
|
# in favor of new tests in fusions_e2e. We avoid replicating the new jobs in this file as it's deprecated.
|
||||||
|
|
||||||
- VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/distributed/test_sequence_parallel.py
|
- VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/distributed/test_sequence_parallel.py
|
||||||
- pytest -v -s tests/distributed/test_context_parallel.py
|
- pytest -v -s tests/distributed/test_context_parallel.py
|
||||||
- HIP_VISIBLE_DEVICES=0,1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=allgather_reducescatter --disable-nccl-for-dp-synchronization
|
- HIP_VISIBLE_DEVICES=0,1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=allgather_reducescatter --disable-nccl-for-dp-synchronization
|
||||||
|
|||||||
@@ -362,7 +362,7 @@ steps:
|
|||||||
- pytest -v -s v1/sample
|
- pytest -v -s v1/sample
|
||||||
- pytest -v -s v1/logits_processors
|
- pytest -v -s v1/logits_processors
|
||||||
- pytest -v -s v1/worker
|
- pytest -v -s v1/worker
|
||||||
- pytest -v -s v1/spec_decode
|
- pytest -v -s -m 'not slow_test' v1/spec_decode
|
||||||
- pytest -v -s -m 'not cpu_test' v1/kv_connector/unit
|
- pytest -v -s -m 'not cpu_test' v1/kv_connector/unit
|
||||||
- pytest -v -s -m 'not cpu_test' v1/metrics
|
- pytest -v -s -m 'not cpu_test' v1/metrics
|
||||||
- pytest -v -s v1/test_oracle.py
|
- pytest -v -s v1/test_oracle.py
|
||||||
@@ -537,9 +537,11 @@ steps:
|
|||||||
commands:
|
commands:
|
||||||
# fp8 kv scales not supported on sm89, tested on Blackwell instead
|
# fp8 kv scales not supported on sm89, tested on Blackwell instead
|
||||||
- pytest -v -s compile/fullgraph/test_full_graph.py -k 'not test_fp8_kv_scale_compile'
|
- pytest -v -s compile/fullgraph/test_full_graph.py -k 'not test_fp8_kv_scale_compile'
|
||||||
# Limit to no custom ops to reduce running time
|
# # Limit to no custom ops to reduce running time
|
||||||
# Wrap with quotes to escape yaml and avoid starting -k string with a -
|
# # Wrap with quotes to escape yaml and avoid starting -k string with a -
|
||||||
- "pytest -v -s compile/distributed/test_fusions_e2e.py -k 'TRITON and not +quant_fp8 and not Llama-4'"
|
# - "pytest -v -s compile/distributed/test_fusions_e2e.py -k 'TRITON and not +quant_fp8 and not Llama-4'"
|
||||||
|
# Old E2E tests were removed in https://github.com/vllm-project/vllm/pull/33293
|
||||||
|
# in favor of new tests in fusions_e2e. We avoid replicating the new jobs in this file as it's deprecated.
|
||||||
|
|
||||||
- label: Cudagraph test
|
- label: Cudagraph test
|
||||||
timeout_in_minutes: 20
|
timeout_in_minutes: 20
|
||||||
@@ -568,8 +570,9 @@ steps:
|
|||||||
mirror_hardwares: [amdexperimental]
|
mirror_hardwares: [amdexperimental]
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- csrc/attention/
|
- csrc/attention/
|
||||||
- vllm/attention
|
|
||||||
- vllm/v1/attention
|
- vllm/v1/attention
|
||||||
|
# TODO: remove this dependency (https://github.com/vllm-project/vllm/issues/32267)
|
||||||
|
- vllm/model_executor/layers/attention
|
||||||
- tests/kernels/attention
|
- tests/kernels/attention
|
||||||
commands:
|
commands:
|
||||||
- pytest -v -s kernels/attention --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
|
- pytest -v -s kernels/attention --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
|
||||||
@@ -1068,7 +1071,6 @@ steps:
|
|||||||
- tests/compile/test_fusion_attn.py
|
- tests/compile/test_fusion_attn.py
|
||||||
- tests/compile/test_silu_mul_quant_fusion.py
|
- tests/compile/test_silu_mul_quant_fusion.py
|
||||||
- tests/compile/distributed/test_fusion_all_reduce.py
|
- tests/compile/distributed/test_fusion_all_reduce.py
|
||||||
- tests/compile/distributed/test_fusions_e2e.py
|
|
||||||
- tests/compile/fullgraph/test_full_graph.py
|
- tests/compile/fullgraph/test_full_graph.py
|
||||||
commands:
|
commands:
|
||||||
- nvidia-smi
|
- nvidia-smi
|
||||||
@@ -1076,75 +1078,15 @@ steps:
|
|||||||
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
|
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
|
||||||
# this runner has 2 GPUs available even though num_gpus=2 is not set
|
# this runner has 2 GPUs available even though num_gpus=2 is not set
|
||||||
- pytest -v -s tests/compile/distributed/test_fusion_all_reduce.py
|
- pytest -v -s tests/compile/distributed/test_fusion_all_reduce.py
|
||||||
# Limit to Inductor partition, no custom ops, and allreduce & attn fusion to reduce running time
|
# # Limit to Inductor partition, no custom ops, and allreduce & attn fusion to reduce running time
|
||||||
# Wrap with quotes to escape yaml
|
# # Wrap with quotes to escape yaml
|
||||||
- "pytest -v -s tests/compile/distributed/test_fusions_e2e.py::test_tp2_attn_quant_allreduce_rmsnorm -k 'True and not +quant_fp8 and not +rms_norm'"
|
# - "pytest -v -s tests/compile/distributed/test_fusions_e2e.py::test_tp2_attn_quant_allreduce_rmsnorm -k 'True and not +quant_fp8 and not +rms_norm'"
|
||||||
|
# Old E2E tests were removed in https://github.com/vllm-project/vllm/pull/33293
|
||||||
|
# in favor of new tests in fusions_e2e. We avoid replicating the new jobs in this file as it's deprecated.
|
||||||
|
|
||||||
# test_fp8_kv_scale_compile requires FlashAttention (not supported on default L4/L40)
|
# test_fp8_kv_scale_compile requires FlashAttention (not supported on default L4/L40)
|
||||||
- pytest -v -s tests/compile/fullgraph/test_full_graph.py::test_fp8_kv_scale_compile
|
- pytest -v -s tests/compile/fullgraph/test_full_graph.py::test_fp8_kv_scale_compile
|
||||||
|
|
||||||
- label: Blackwell Fusion E2E Tests # 30 min
|
|
||||||
timeout_in_minutes: 40
|
|
||||||
working_dir: "/vllm-workspace/"
|
|
||||||
gpu: b200
|
|
||||||
optional: true
|
|
||||||
num_gpus: 2
|
|
||||||
source_file_dependencies:
|
|
||||||
- csrc/quantization/fp4/
|
|
||||||
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
|
|
||||||
- vllm/v1/attention/backends/flashinfer.py
|
|
||||||
- vllm/compilation/
|
|
||||||
# can affect pattern matching
|
|
||||||
- vllm/model_executor/layers/layernorm.py
|
|
||||||
- vllm/model_executor/layers/activation.py
|
|
||||||
- vllm/model_executor/layers/quantization/input_quant_fp8.py
|
|
||||||
- tests/compile/distributed/test_fusions_e2e.py
|
|
||||||
commands:
|
|
||||||
- nvidia-smi
|
|
||||||
# Run all e2e fusion tests
|
|
||||||
- pytest -v -s tests/compile/distributed/test_fusions_e2e.py
|
|
||||||
|
|
||||||
- label: Hopper Fusion E2E Tests (H100) # 10min
|
|
||||||
timeout_in_minutes: 70
|
|
||||||
working_dir: "/vllm-workspace/"
|
|
||||||
gpu: h100
|
|
||||||
optional: true
|
|
||||||
source_file_dependencies:
|
|
||||||
- csrc/quantization/fp4/
|
|
||||||
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
|
|
||||||
- vllm/v1/attention/backends/flashinfer.py
|
|
||||||
- vllm/compilation/
|
|
||||||
# can affect pattern matching
|
|
||||||
- vllm/model_executor/layers/layernorm.py
|
|
||||||
- vllm/model_executor/layers/activation.py
|
|
||||||
- vllm/model_executor/layers/quantization/input_quant_fp8.py
|
|
||||||
- tests/compile/test_fusion_attn.py
|
|
||||||
commands:
|
|
||||||
- export VLLM_TEST_CLEAN_GPU_MEMORY=1
|
|
||||||
# skip Llama-4 since it does not fit on this device
|
|
||||||
- pytest -v -s tests/compile/test_fusion_attn.py -k 'not Llama-4'
|
|
||||||
|
|
||||||
- label: Hopper Fusion Distributed E2E Tests (2xH100) # 70min
|
|
||||||
timeout_in_minutes: 70
|
|
||||||
working_dir: "/vllm-workspace/"
|
|
||||||
gpu: h100
|
|
||||||
optional: true
|
|
||||||
num_gpus: 2
|
|
||||||
source_file_dependencies:
|
|
||||||
- csrc/quantization/fp4/
|
|
||||||
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
|
|
||||||
- vllm/v1/attention/backends/flashinfer.py
|
|
||||||
- vllm/compilation/
|
|
||||||
# can affect pattern matching
|
|
||||||
- vllm/model_executor/layers/layernorm.py
|
|
||||||
- vllm/model_executor/layers/activation.py
|
|
||||||
- vllm/model_executor/layers/quantization/input_quant_fp8.py
|
|
||||||
- tests/compile/distributed/test_fusions_e2e.py
|
|
||||||
commands:
|
|
||||||
- export VLLM_TEST_CLEAN_GPU_MEMORY=1
|
|
||||||
# Run all e2e fusion tests
|
|
||||||
- pytest -v -s tests/compile/distributed/test_fusions_e2e.py -k 'not Llama-4'
|
|
||||||
- pytest -v -s tests/compile/distributed/test_fusion_all_reduce.py
|
|
||||||
|
|
||||||
- label: Blackwell GPT-OSS Eval
|
- label: Blackwell GPT-OSS Eval
|
||||||
timeout_in_minutes: 60
|
timeout_in_minutes: 60
|
||||||
working_dir: "/vllm-workspace/"
|
working_dir: "/vllm-workspace/"
|
||||||
@@ -1419,6 +1361,20 @@ steps:
|
|||||||
- TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)'
|
- TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)'
|
||||||
- pytest -v -s -x lora/test_mixtral.py
|
- pytest -v -s -x lora/test_mixtral.py
|
||||||
|
|
||||||
|
- label: Acceptance Length Test (Large Models) # optional
|
||||||
|
timeout_in_minutes: 120
|
||||||
|
gpu: h100
|
||||||
|
optional: true
|
||||||
|
num_gpus: 1
|
||||||
|
working_dir: "/vllm-workspace/tests"
|
||||||
|
source_file_dependencies:
|
||||||
|
- vllm/v1/spec_decode/
|
||||||
|
- vllm/model_executor/models/mlp_speculator.py
|
||||||
|
- tests/v1/spec_decode/test_acceptance_length.py
|
||||||
|
commands:
|
||||||
|
- export VLLM_ALLOW_INSECURE_SERIALIZATION=1
|
||||||
|
- pytest -v -s v1/spec_decode/test_acceptance_length.py -m slow_test
|
||||||
|
|
||||||
- label: LM Eval Large Models # optional
|
- label: LM Eval Large Models # optional
|
||||||
gpu: a100
|
gpu: a100
|
||||||
optional: true
|
optional: true
|
||||||
|
|||||||
@@ -2,56 +2,196 @@ group: Compile
|
|||||||
depends_on:
|
depends_on:
|
||||||
- image-build
|
- image-build
|
||||||
steps:
|
steps:
|
||||||
- label: Fusion and Compile Tests (B200)
|
- label: Sequence Parallel Tests (2 GPUs)
|
||||||
|
timeout_in_minutes: 50
|
||||||
|
working_dir: "/vllm-workspace/"
|
||||||
|
num_devices: 2
|
||||||
|
source_file_dependencies:
|
||||||
|
- vllm/model_executor/layers/
|
||||||
|
- vllm/compilation/
|
||||||
|
- vllm/v1/worker/
|
||||||
|
- vllm/v1/cudagraph_dispatcher.py
|
||||||
|
- tests/distributed/test_sequence_parallel.py
|
||||||
|
commands:
|
||||||
|
- export VLLM_TEST_CLEAN_GPU_MEMORY=1
|
||||||
|
- pytest -v -s tests/distributed/test_sequence_parallel.py
|
||||||
|
|
||||||
|
- label: Sequence Parallel Tests (2xH100)
|
||||||
|
timeout_in_minutes: 50
|
||||||
|
working_dir: "/vllm-workspace/"
|
||||||
|
device: h100
|
||||||
|
optional: true
|
||||||
|
num_devices: 2
|
||||||
|
commands:
|
||||||
|
- export VLLM_TEST_CLEAN_GPU_MEMORY=1
|
||||||
|
- pytest -v -s tests/distributed/test_sequence_parallel.py
|
||||||
|
|
||||||
|
- label: Distributed Compile Unit Tests (2xH100)
|
||||||
timeout_in_minutes: 40
|
timeout_in_minutes: 40
|
||||||
working_dir: "/vllm-workspace/"
|
working_dir: "/vllm-workspace/"
|
||||||
|
device: h100
|
||||||
|
num_devices: 2
|
||||||
|
source_file_dependencies:
|
||||||
|
- vllm/compilation/
|
||||||
|
- vllm/model_executor/layers
|
||||||
|
- tests/compile/distributed/test_fusion_all_reduce.py
|
||||||
|
- tests/compile/distributed/test_sequence_parallelism.py
|
||||||
|
- tests/compile/distributed/test_async_tp.py
|
||||||
|
commands:
|
||||||
|
- export VLLM_TEST_CLEAN_GPU_MEMORY=1
|
||||||
|
- pytest -v -s tests/compile/distributed/test_fusion_all_reduce.py
|
||||||
|
- pytest -v -s tests/compile/distributed/test_sequence_parallelism.py
|
||||||
|
- pytest -v -s tests/compile/distributed/test_async_tp.py
|
||||||
|
|
||||||
|
- label: Fusion and Compile Unit Tests (B200)
|
||||||
|
timeout_in_minutes: 20
|
||||||
|
working_dir: "/vllm-workspace/"
|
||||||
device: b200
|
device: b200
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- csrc/quantization/fp4/
|
- csrc/quantization/fp4/
|
||||||
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
|
- vllm/model_executor/layers/quantization/
|
||||||
- vllm/v1/attention/backends/flashinfer.py
|
|
||||||
- vllm/v1/worker/
|
|
||||||
- vllm/v1/cudagraph_dispatcher.py
|
|
||||||
- vllm/compilation/
|
|
||||||
# can affect pattern matching
|
|
||||||
- vllm/model_executor/layers/layernorm.py
|
- vllm/model_executor/layers/layernorm.py
|
||||||
- vllm/model_executor/layers/activation.py
|
- vllm/model_executor/layers/activation.py
|
||||||
- vllm/model_executor/layers/quantization/input_quant_fp8.py
|
- vllm/model_executor/layers/attention/attention.py
|
||||||
|
- vllm/v1/attention/backends/flashinfer.py
|
||||||
|
- vllm/compilation/ # TODO(luka) limit to vllm/compilation/passes
|
||||||
- tests/compile/test_fusion_attn.py
|
- tests/compile/test_fusion_attn.py
|
||||||
- tests/compile/test_silu_mul_quant_fusion.py
|
- tests/compile/test_silu_mul_quant_fusion.py
|
||||||
- tests/compile/distributed/test_fusion_all_reduce.py
|
- tests/compile/distributed/test_fusion_all_reduce.py
|
||||||
- tests/compile/distributed/test_fusions_e2e.py
|
|
||||||
- tests/compile/fullgraph/test_full_graph.py
|
- tests/compile/fullgraph/test_full_graph.py
|
||||||
commands:
|
commands:
|
||||||
|
# b200 runners are limited, so we limit the tests to the minimum set only supported on Blackwell
|
||||||
- nvidia-smi
|
- nvidia-smi
|
||||||
- pytest -v -s tests/compile/test_fusion_attn.py
|
- pytest -v -s tests/compile/test_fusion_attn.py -k FLASHINFER
|
||||||
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
|
- pytest -v -s tests/compile/test_silu_mul_quant_fusion.py
|
||||||
# this runner has 2 GPUs available even though num_devices=2 is not set
|
# this runner has 2 GPUs available even though num_devices=2 is not set
|
||||||
- pytest -v -s tests/compile/distributed/test_fusion_all_reduce.py
|
- pytest -v -s tests/compile/distributed/test_fusion_all_reduce.py
|
||||||
# Limit to Inductor partition, no custom ops, and allreduce & attn fusion to reduce running time
|
|
||||||
# Wrap with quotes to escape yaml
|
|
||||||
- "pytest -v -s tests/compile/distributed/test_fusions_e2e.py::test_tp2_attn_quant_allreduce_rmsnorm -k 'True and not +quant_fp8 and not +rms_norm'"
|
|
||||||
# test_fp8_kv_scale_compile requires FlashAttention (not supported on default L4/L40)
|
# test_fp8_kv_scale_compile requires FlashAttention (not supported on default L4/L40)
|
||||||
|
# TODO(luka) move to H100 once pass tests run on H100
|
||||||
- pytest -v -s tests/compile/fullgraph/test_full_graph.py::test_fp8_kv_scale_compile
|
- pytest -v -s tests/compile/fullgraph/test_full_graph.py::test_fp8_kv_scale_compile
|
||||||
|
|
||||||
- label: Fusion E2E (2 GPUs)(B200)
|
- label: Fusion E2E Quick (H100)
|
||||||
timeout_in_minutes: 40
|
timeout_in_minutes: 15
|
||||||
working_dir: "/vllm-workspace/"
|
working_dir: "/vllm-workspace/"
|
||||||
device: b200
|
device: h100
|
||||||
optional: true
|
num_devices: 1
|
||||||
num_devices: 2
|
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- csrc/quantization/fp4/
|
- csrc/quantization/
|
||||||
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
|
- vllm/model_executor/
|
||||||
- vllm/v1/attention/backends/flashinfer.py
|
- vllm/v1/attention/
|
||||||
- vllm/compilation/
|
- vllm/compilation/
|
||||||
# can affect pattern matching
|
- tests/compile/fusions_e2e/
|
||||||
- vllm/model_executor/layers/layernorm.py
|
|
||||||
- vllm/model_executor/layers/activation.py
|
|
||||||
- vllm/model_executor/layers/quantization/input_quant_fp8.py
|
|
||||||
- tests/compile/distributed/test_fusions_e2e.py
|
|
||||||
commands:
|
commands:
|
||||||
- nvidia-smi
|
- nvidia-smi
|
||||||
# Run all e2e fusion tests
|
# Run all models and attn backends but only Inductor partition and native custom ops
|
||||||
- pytest -v -s tests/compile/distributed/test_fusions_e2e.py
|
- pytest -v -s tests/compile/fusions_e2e/test_tp1_quant.py -k "inductor_partition and not +rms_norm and not +quant_fp8"
|
||||||
|
# Qwen requires +quant_fp8 as -quant_fp8 rms+quant fusion is not supported
|
||||||
|
- pytest -v -s tests/compile/fusions_e2e/test_tp1_quant.py -k "inductor_partition and not +rms_norm and +quant_fp8 and qwen3"
|
||||||
|
|
||||||
|
- label: Fusion E2E Config Sweep (H100)
|
||||||
|
timeout_in_minutes: 30
|
||||||
|
working_dir: "/vllm-workspace/"
|
||||||
|
device: h100
|
||||||
|
num_devices: 1
|
||||||
|
source_file_dependencies:
|
||||||
|
- csrc/quantization/
|
||||||
|
- vllm/compilation/
|
||||||
|
# can affect pattern matching
|
||||||
|
- vllm/model_executor/layers/layernorm.py
|
||||||
|
- vllm/model_executor/layers/activation.py
|
||||||
|
- vllm/model_executor/layers/attention/attention.py
|
||||||
|
- vllm/model_executor/layers/quantization/input_quant_fp8.py
|
||||||
|
- tests/compile/fusions_e2e/
|
||||||
|
commands:
|
||||||
|
- nvidia-smi
|
||||||
|
# Run just llama3 (fp8) for all config combinations
|
||||||
|
- pytest -v -s tests/compile/fusions_e2e/test_tp1_quant.py -k "llama-3"
|
||||||
|
|
||||||
|
- label: Fusion E2E Config Sweep (B200)
|
||||||
|
timeout_in_minutes: 30
|
||||||
|
working_dir: "/vllm-workspace/"
|
||||||
|
device: b200
|
||||||
|
num_devices: 1
|
||||||
|
optional: true
|
||||||
|
commands:
|
||||||
|
- nvidia-smi
|
||||||
|
# Run all models and attn backends but only Inductor partition and native custom ops
|
||||||
|
# -k "inductor_partition and not +rms_norm and not +quant_fp8"
|
||||||
|
# Qwen requires +quant_fp8 as -quant_fp8 rms+quant fusion is not supported
|
||||||
|
# -k "inductor_partition and not +rms_norm and +quant_fp8 and qwen3"
|
||||||
|
# Run just llama3 (fp8 & fp4) for all config combinations
|
||||||
|
# -k "llama-3"
|
||||||
|
- pytest -v -s tests/compile/fusions_e2e/test_tp1_quant.py -k "inductor_partition and not +rms_norm and not +quant_fp8" -k "inductor_partition and not +rms_norm and +quant_fp8 and qwen3" -k "llama-3"
|
||||||
|
|
||||||
|
- label: Fusion E2E TP2 Quick (H100)
|
||||||
|
timeout_in_minutes: 20
|
||||||
|
working_dir: "/vllm-workspace/"
|
||||||
|
device: h100
|
||||||
|
num_devices: 2
|
||||||
|
source_file_dependencies:
|
||||||
|
- csrc/quantization/
|
||||||
|
- vllm/model_executor/
|
||||||
|
- vllm/v1/attention/
|
||||||
|
- vllm/compilation/
|
||||||
|
- tests/compile/fusions_e2e/
|
||||||
|
commands:
|
||||||
|
- nvidia-smi
|
||||||
|
# Run all models and attn backends but only Inductor partition and native custom ops
|
||||||
|
- pytest -v -s tests/compile/fusions_e2e/test_tp2_ar_rms.py -k "inductor_partition and not +rms_norm and not +quant_fp8"
|
||||||
|
- pytest -v -s tests/compile/fusions_e2e/test_tp2_async_tp.py -k "inductor_partition and not +rms_norm and not +quant_fp8"
|
||||||
|
|
||||||
|
- label: Fusion E2E TP2 AR-RMS Config Sweep (H100)
|
||||||
|
timeout_in_minutes: 40
|
||||||
|
working_dir: "/vllm-workspace/"
|
||||||
|
device: h100
|
||||||
|
num_devices: 2
|
||||||
|
source_file_dependencies:
|
||||||
|
- csrc/quantization/
|
||||||
|
- vllm/compilation/
|
||||||
|
# can affect pattern matching
|
||||||
|
- vllm/model_executor/layers/layernorm.py
|
||||||
|
- vllm/model_executor/layers/activation.py
|
||||||
|
- vllm/model_executor/layers/attention/attention.py
|
||||||
|
- vllm/model_executor/layers/quantization/input_quant_fp8.py
|
||||||
|
- tests/compile/fusions_e2e/
|
||||||
|
commands:
|
||||||
|
- nvidia-smi
|
||||||
|
# Run just llama3 (fp4 & fp8 & bf16) for all config combinations
|
||||||
|
- pytest -v -s tests/compile/fusions_e2e/test_tp2_ar_rms.py -k "llama-3"
|
||||||
|
|
||||||
|
- label: Fusion E2E TP2 AsyncTP Config Sweep (H100)
|
||||||
|
timeout_in_minutes: 40
|
||||||
|
working_dir: "/vllm-workspace/"
|
||||||
|
device: h100
|
||||||
|
num_devices: 2
|
||||||
|
source_file_dependencies:
|
||||||
|
- csrc/quantization/
|
||||||
|
- vllm/compilation/
|
||||||
|
# can affect pattern matching
|
||||||
|
- vllm/model_executor/layers/layernorm.py
|
||||||
|
- vllm/model_executor/layers/activation.py
|
||||||
|
- vllm/model_executor/layers/attention/attention.py
|
||||||
|
- vllm/model_executor/layers/quantization/input_quant_fp8.py
|
||||||
|
- tests/compile/fusions_e2e/
|
||||||
|
commands:
|
||||||
|
- nvidia-smi
|
||||||
|
# Run just llama3 (fp8 & bf16) for all config combinations
|
||||||
|
- pytest -v -s tests/compile/fusions_e2e/test_tp2_async_tp.py -k "llama-3"
|
||||||
|
|
||||||
|
- label: Fusion E2E TP2 (B200)
|
||||||
|
timeout_in_minutes: 20
|
||||||
|
working_dir: "/vllm-workspace/"
|
||||||
|
device: b200
|
||||||
|
num_devices: 2
|
||||||
|
source_file_dependencies:
|
||||||
|
- csrc/quantization/
|
||||||
|
- vllm/model_executor/
|
||||||
|
- vllm/v1/attention/
|
||||||
|
- vllm/compilation/
|
||||||
|
- tests/compile/fusions_e2e/
|
||||||
|
commands:
|
||||||
|
- nvidia-smi
|
||||||
|
# Run all models and attn backends but only Inductor partition and native custom ops
|
||||||
|
# for ar-rms-quant-fp4, also sweep llama3
|
||||||
|
- pytest -v -s tests/compile/fusions_e2e/test_tp2_ar_rms.py -k "inductor_partition and not +rms_norm and not +quant_fp8" -k "Llama-3.1-8B-Instruct-FP4"
|
||||||
|
- pytest -v -s tests/compile/fusions_e2e/test_tp2_async_tp.py -k "inductor_partition and not +rms_norm and not +quant_fp8"
|
||||||
|
|||||||
@@ -16,7 +16,7 @@ steps:
|
|||||||
- pytest -v -s distributed/test_shm_storage.py
|
- pytest -v -s distributed/test_shm_storage.py
|
||||||
|
|
||||||
- label: Distributed (2 GPUs)
|
- label: Distributed (2 GPUs)
|
||||||
timeout_in_minutes: 90
|
timeout_in_minutes: 60
|
||||||
working_dir: "/vllm-workspace/tests"
|
working_dir: "/vllm-workspace/tests"
|
||||||
num_devices: 2
|
num_devices: 2
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
@@ -47,7 +47,6 @@ steps:
|
|||||||
- pytest -v -s ./compile/test_wrapper.py
|
- pytest -v -s ./compile/test_wrapper.py
|
||||||
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
|
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
|
||||||
- VLLM_TEST_SAME_HOST=1 VLLM_TEST_WITH_DEFAULT_DEVICE_SET=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
|
- VLLM_TEST_SAME_HOST=1 VLLM_TEST_WITH_DEFAULT_DEVICE_SET=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
|
||||||
- pytest -v -s distributed/test_sequence_parallel.py
|
|
||||||
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown
|
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown
|
||||||
- pytest -v -s v1/worker/test_worker_memory_snapshot.py
|
- pytest -v -s v1/worker/test_worker_memory_snapshot.py
|
||||||
|
|
||||||
@@ -133,25 +132,13 @@ steps:
|
|||||||
- TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)'
|
- TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)'
|
||||||
- pytest -v -s -x lora/test_mixtral.py
|
- pytest -v -s -x lora/test_mixtral.py
|
||||||
|
|
||||||
- label: Sequence Parallel Tests (H100)
|
|
||||||
timeout_in_minutes: 60
|
|
||||||
working_dir: "/vllm-workspace/"
|
|
||||||
device: h100
|
|
||||||
optional: true
|
|
||||||
num_devices: 2
|
|
||||||
commands:
|
|
||||||
- export VLLM_TEST_CLEAN_GPU_MEMORY=1
|
|
||||||
# Run sequence parallel tests
|
|
||||||
- pytest -v -s tests/distributed/test_sequence_parallel.py
|
|
||||||
- pytest -v -s tests/compile/distributed/test_sequence_parallelism.py
|
|
||||||
|
|
||||||
- label: Distributed Tests (2 GPUs)(H100)
|
- label: Distributed Tests (2 GPUs)(H100)
|
||||||
|
timeout_in_minutes: 15
|
||||||
device: h100
|
device: h100
|
||||||
optional: true
|
optional: true
|
||||||
working_dir: "/vllm-workspace/"
|
working_dir: "/vllm-workspace/"
|
||||||
num_devices: 2
|
num_devices: 2
|
||||||
commands:
|
commands:
|
||||||
- VLLM_TEST_CLEAN_GPU_MEMORY=1 pytest -v -s tests/compile/distributed/test_async_tp.py
|
|
||||||
- pytest -v -s tests/distributed/test_context_parallel.py
|
- pytest -v -s tests/distributed/test_context_parallel.py
|
||||||
- VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput
|
- VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 examples/offline_inference/data_parallel.py --model=Qwen/Qwen1.5-MoE-A2.7B -tp=1 -dp=2 --max-model-len=2048 --all2all-backend=deepep_high_throughput
|
||||||
- pytest -v -s tests/v1/distributed/test_dbo.py
|
- pytest -v -s tests/v1/distributed/test_dbo.py
|
||||||
@@ -180,7 +167,7 @@ steps:
|
|||||||
- tests/distributed/
|
- tests/distributed/
|
||||||
- tests/examples/offline_inference/data_parallel.py
|
- tests/examples/offline_inference/data_parallel.py
|
||||||
commands:
|
commands:
|
||||||
- ./.buildkite/scripts/run-multi-node-test.sh /vllm-workspace/tests 2 2 public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:0bec63fa317e1fbd62e19b0fc31c43c81bf89077 "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=0 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py" "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=1 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code"
|
- ./.buildkite/scripts/run-multi-node-test.sh /vllm-workspace/tests 2 2 $IMAGE_TAG "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=0 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py && VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py" "VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed' && NUM_NODES=2 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_node_count.py | grep 'Node count test passed' && python3 ../examples/offline_inference/data_parallel.py -dp=2 -tp=1 --dp-num-nodes=2 --dp-node-rank=1 --dp-master-addr=192.168.10.10 --dp-master-port=12345 --enforce-eager --trust-remote-code"
|
||||||
|
|
||||||
- label: Distributed NixlConnector PD accuracy (4 GPUs)
|
- label: Distributed NixlConnector PD accuracy (4 GPUs)
|
||||||
timeout_in_minutes: 30
|
timeout_in_minutes: 30
|
||||||
@@ -217,45 +204,3 @@ steps:
|
|||||||
commands:
|
commands:
|
||||||
- pytest -v -s distributed/test_pp_cudagraph.py
|
- pytest -v -s distributed/test_pp_cudagraph.py
|
||||||
- pytest -v -s distributed/test_pipeline_parallel.py
|
- pytest -v -s distributed/test_pipeline_parallel.py
|
||||||
|
|
||||||
- label: Hopper Fusion E2E Tests (H100)
|
|
||||||
timeout_in_minutes: 70
|
|
||||||
working_dir: "/vllm-workspace/"
|
|
||||||
device: h100
|
|
||||||
optional: true
|
|
||||||
source_file_dependencies:
|
|
||||||
- csrc/quantization/fp4/
|
|
||||||
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
|
|
||||||
- vllm/v1/attention/backends/flashinfer.py
|
|
||||||
- vllm/compilation/
|
|
||||||
# can affect pattern matching
|
|
||||||
- vllm/model_executor/layers/layernorm.py
|
|
||||||
- vllm/model_executor/layers/activation.py
|
|
||||||
- vllm/model_executor/layers/quantization/input_quant_fp8.py
|
|
||||||
- tests/compile/test_fusion_attn.py
|
|
||||||
commands:
|
|
||||||
- export VLLM_TEST_CLEAN_GPU_MEMORY=1
|
|
||||||
# skip Llama-4 since it does not fit on this device
|
|
||||||
- pytest -v -s tests/compile/test_fusion_attn.py -k 'not Llama-4'
|
|
||||||
|
|
||||||
- label: Hopper Fusion Distributed E2E Tests (2xH100)
|
|
||||||
timeout_in_minutes: 70
|
|
||||||
working_dir: "/vllm-workspace/"
|
|
||||||
device: h100
|
|
||||||
optional: true
|
|
||||||
num_devices: 2
|
|
||||||
source_file_dependencies:
|
|
||||||
- csrc/quantization/fp4/
|
|
||||||
- vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
|
|
||||||
- vllm/v1/attention/backends/flashinfer.py
|
|
||||||
- vllm/compilation/
|
|
||||||
# can affect pattern matching
|
|
||||||
- vllm/model_executor/layers/layernorm.py
|
|
||||||
- vllm/model_executor/layers/activation.py
|
|
||||||
- vllm/model_executor/layers/quantization/input_quant_fp8.py
|
|
||||||
- tests/compile/distributed/test_fusions_e2e.py
|
|
||||||
commands:
|
|
||||||
- export VLLM_TEST_CLEAN_GPU_MEMORY=1
|
|
||||||
# Run all e2e fusion tests
|
|
||||||
- pytest -v -s tests/compile/distributed/test_fusions_e2e.py -k 'not Llama-4'
|
|
||||||
- pytest -v -s tests/compile/distributed/test_fusion_all_reduce.py
|
|
||||||
|
|||||||
@@ -15,8 +15,9 @@ steps:
|
|||||||
timeout_in_minutes: 35
|
timeout_in_minutes: 35
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- csrc/attention/
|
- csrc/attention/
|
||||||
- vllm/attention
|
|
||||||
- vllm/v1/attention
|
- vllm/v1/attention
|
||||||
|
# TODO: remove this dependency (https://github.com/vllm-project/vllm/issues/32267)
|
||||||
|
- vllm/model_executor/layers/attention
|
||||||
- tests/kernels/attention
|
- tests/kernels/attention
|
||||||
commands:
|
commands:
|
||||||
- pytest -v -s kernels/attention --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
|
- pytest -v -s kernels/attention --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
|
||||||
|
|||||||
@@ -61,7 +61,7 @@ steps:
|
|||||||
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-h100.txt
|
- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=evals/gsm8k/configs/moe-refactor/config-h100.txt
|
||||||
|
|
||||||
- label: MoE Refactor Integration Test (B200 - TEMPORARY)
|
- label: MoE Refactor Integration Test (B200 - TEMPORARY)
|
||||||
gpu: b200
|
device: b200
|
||||||
optional: true
|
optional: true
|
||||||
num_devices: 2
|
num_devices: 2
|
||||||
commands:
|
commands:
|
||||||
|
|||||||
@@ -16,7 +16,7 @@ steps:
|
|||||||
- pytest -v -s v1/sample
|
- pytest -v -s v1/sample
|
||||||
- pytest -v -s v1/logits_processors
|
- pytest -v -s v1/logits_processors
|
||||||
- pytest -v -s v1/worker
|
- pytest -v -s v1/worker
|
||||||
- pytest -v -s v1/spec_decode
|
- pytest -v -s -m 'not slow_test' v1/spec_decode
|
||||||
- pytest -v -s -m 'not cpu_test' v1/kv_connector/unit
|
- pytest -v -s -m 'not cpu_test' v1/kv_connector/unit
|
||||||
- pytest -v -s -m 'not cpu_test' v1/metrics
|
- pytest -v -s -m 'not cpu_test' v1/metrics
|
||||||
- pytest -v -s v1/test_oracle.py
|
- pytest -v -s v1/test_oracle.py
|
||||||
@@ -27,7 +27,8 @@ steps:
|
|||||||
- pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine
|
- pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine
|
||||||
|
|
||||||
- label: V1 Others (CPU)
|
- label: V1 Others (CPU)
|
||||||
depends_on: ~
|
depends_on:
|
||||||
|
- image-build-cpu
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
- tests/v1
|
- tests/v1
|
||||||
@@ -114,7 +115,8 @@ steps:
|
|||||||
- pytest -v -s utils_
|
- pytest -v -s utils_
|
||||||
|
|
||||||
- label: Async Engine, Inputs, Utils, Worker, Config (CPU)
|
- label: Async Engine, Inputs, Utils, Worker, Config (CPU)
|
||||||
depends_on: ~
|
depends_on:
|
||||||
|
- image-build-cpu
|
||||||
timeout_in_minutes: 30
|
timeout_in_minutes: 30
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
@@ -164,4 +166,18 @@ steps:
|
|||||||
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
|
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
|
||||||
- pip install pytest-timeout pytest-forked
|
- pip install pytest-timeout pytest-forked
|
||||||
- pytest -v -s v1/determinism/test_batch_invariance.py
|
- pytest -v -s v1/determinism/test_batch_invariance.py
|
||||||
- pytest -v -s v1/determinism/test_rms_norm_batch_invariant.py
|
- pytest -v -s v1/determinism/test_rms_norm_batch_invariant.py
|
||||||
|
|
||||||
|
- label: Acceptance Length Test (Large Models) # optional
|
||||||
|
timeout_in_minutes: 25
|
||||||
|
gpu: h100
|
||||||
|
optional: true
|
||||||
|
num_gpus: 1
|
||||||
|
working_dir: "/vllm-workspace/tests"
|
||||||
|
source_file_dependencies:
|
||||||
|
- vllm/v1/spec_decode/
|
||||||
|
- vllm/model_executor/models/mlp_speculator.py
|
||||||
|
- tests/v1/spec_decode/test_acceptance_length.py
|
||||||
|
commands:
|
||||||
|
- export VLLM_ALLOW_INSECURE_SERIALIZATION=1
|
||||||
|
- pytest -v -s v1/spec_decode/test_acceptance_length.py -m slow_test
|
||||||
|
|||||||
@@ -39,6 +39,8 @@ steps:
|
|||||||
- pytest -v -s models/test_transformers.py models/test_registry.py
|
- pytest -v -s models/test_transformers.py models/test_registry.py
|
||||||
|
|
||||||
- label: Basic Models Test (Other CPU) # 5min
|
- label: Basic Models Test (Other CPU) # 5min
|
||||||
|
depends_on:
|
||||||
|
- image-build-cpu
|
||||||
timeout_in_minutes: 10
|
timeout_in_minutes: 10
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
|
|||||||
@@ -14,6 +14,8 @@ steps:
|
|||||||
- cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model # Otherwise, mp_method="spawn" doesn't work
|
- cd .. && VLLM_WORKER_MULTIPROC_METHOD=spawn pytest -v -s tests/models/multimodal/generation/test_whisper.py -m core_model # Otherwise, mp_method="spawn" doesn't work
|
||||||
|
|
||||||
- label: Multi-Modal Processor Test (CPU)
|
- label: Multi-Modal Processor Test (CPU)
|
||||||
|
depends_on:
|
||||||
|
- image-build-cpu
|
||||||
timeout_in_minutes: 60
|
timeout_in_minutes: 60
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
|
|||||||
@@ -18,7 +18,7 @@ steps:
|
|||||||
- "find compile/ -maxdepth 1 -name 'test_*.py' -print0 | xargs -0 -n1 -I{} pytest -s -v '{}'"
|
- "find compile/ -maxdepth 1 -name 'test_*.py' -print0 | xargs -0 -n1 -I{} pytest -s -v '{}'"
|
||||||
|
|
||||||
- label: PyTorch Fullgraph Smoke Test
|
- label: PyTorch Fullgraph Smoke Test
|
||||||
timeout_in_minutes: 30
|
timeout_in_minutes: 35
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
- tests/compile
|
- tests/compile
|
||||||
@@ -30,16 +30,13 @@ steps:
|
|||||||
- "find compile/fullgraph/ -name 'test_*.py' -not -name 'test_full_graph.py' -exec pytest -s -v {} \\;"
|
- "find compile/fullgraph/ -name 'test_*.py' -not -name 'test_full_graph.py' -exec pytest -s -v {} \\;"
|
||||||
|
|
||||||
- label: PyTorch Fullgraph
|
- label: PyTorch Fullgraph
|
||||||
timeout_in_minutes: 40
|
timeout_in_minutes: 30
|
||||||
source_file_dependencies:
|
source_file_dependencies:
|
||||||
- vllm/
|
- vllm/
|
||||||
- tests/compile
|
- tests/compile
|
||||||
commands:
|
commands:
|
||||||
# fp8 kv scales not supported on sm89, tested on Blackwell instead
|
# fp8 kv scales not supported on sm89, tested on Blackwell instead
|
||||||
- pytest -v -s compile/fullgraph/test_full_graph.py -k 'not test_fp8_kv_scale_compile'
|
- pytest -v -s compile/fullgraph/test_full_graph.py -k 'not test_fp8_kv_scale_compile'
|
||||||
# Limit to no custom ops to reduce running time
|
|
||||||
# Wrap with quotes to escape yaml and avoid starting -k string with a -
|
|
||||||
- "pytest -v -s compile/distributed/test_fusions_e2e.py -k 'TRITON and not +quant_fp8 and not Llama-4'"
|
|
||||||
|
|
||||||
- label: Pytorch Nightly Dependency Override Check # 2min
|
- label: Pytorch Nightly Dependency Override Check # 2min
|
||||||
# if this test fails, it means the nightly torch version is not compatible with some
|
# if this test fails, it means the nightly torch version is not compatible with some
|
||||||
|
|||||||
16
.github/CODEOWNERS
vendored
16
.github/CODEOWNERS
vendored
@@ -2,8 +2,8 @@
|
|||||||
# for more info about CODEOWNERS file
|
# for more info about CODEOWNERS file
|
||||||
|
|
||||||
# This lists cover the "core" components of vLLM that require careful review
|
# This lists cover the "core" components of vLLM that require careful review
|
||||||
/vllm/attention @LucasWilkinson
|
|
||||||
/vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-redhat @njhill @22quinn
|
/vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-redhat @njhill @22quinn
|
||||||
|
/vllm/model_executor/layers/attention @LucasWilkinson
|
||||||
/vllm/model_executor/layers/fused_moe @mgoin @pavanimajety
|
/vllm/model_executor/layers/fused_moe @mgoin @pavanimajety
|
||||||
/vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth @yewentao256 @pavanimajety
|
/vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth @yewentao256 @pavanimajety
|
||||||
/vllm/model_executor/layers/mamba @tdoublep
|
/vllm/model_executor/layers/mamba @tdoublep
|
||||||
@@ -16,7 +16,7 @@
|
|||||||
/vllm/entrypoints @aarnphm @chaunceyjiang
|
/vllm/entrypoints @aarnphm @chaunceyjiang
|
||||||
/vllm/tool_parsers @aarnphm @chaunceyjiang
|
/vllm/tool_parsers @aarnphm @chaunceyjiang
|
||||||
/vllm/compilation @zou3519 @youkaichao @ProExpertProg
|
/vllm/compilation @zou3519 @youkaichao @ProExpertProg
|
||||||
/vllm/distributed/kv_transfer @NickLucche @ApostaC
|
/vllm/distributed/kv_transfer @NickLucche @ApostaC @orozery
|
||||||
CMakeLists.txt @tlrmchlsmth @LucasWilkinson
|
CMakeLists.txt @tlrmchlsmth @LucasWilkinson
|
||||||
|
|
||||||
# Any change to the VllmConfig changes can have a large user-facing impact,
|
# Any change to the VllmConfig changes can have a large user-facing impact,
|
||||||
@@ -30,12 +30,14 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
|
|||||||
/vllm/v1/attention/backends/mla @pavanimajety
|
/vllm/v1/attention/backends/mla @pavanimajety
|
||||||
/vllm/v1/attention/backends/flashinfer.py @mgoin @pavanimajety
|
/vllm/v1/attention/backends/flashinfer.py @mgoin @pavanimajety
|
||||||
/vllm/v1/attention/backends/triton_attn.py @tdoublep
|
/vllm/v1/attention/backends/triton_attn.py @tdoublep
|
||||||
/vllm/v1/core @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @alexm-redhat @heheda12345 @ApostaC
|
/vllm/v1/core @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @alexm-redhat @heheda12345 @ApostaC @orozery
|
||||||
/vllm/v1/sample @22quinn @houseroad @njhill
|
/vllm/v1/sample @22quinn @houseroad @njhill
|
||||||
/vllm/v1/spec_decode @benchislett @luccafong
|
/vllm/v1/spec_decode @benchislett @luccafong
|
||||||
/vllm/v1/structured_output @mgoin @russellb @aarnphm @benchislett
|
/vllm/v1/structured_output @mgoin @russellb @aarnphm @benchislett
|
||||||
/vllm/v1/kv_cache_interface.py @heheda12345
|
/vllm/v1/kv_cache_interface.py @heheda12345
|
||||||
/vllm/v1/offloading @ApostaC
|
/vllm/v1/kv_offload @ApostaC @orozery
|
||||||
|
/vllm/v1/worker/gpu/kv_connector.py @orozery
|
||||||
|
/vllm/v1/worker/kv_connector_model_runner_mixin.py @orozery
|
||||||
|
|
||||||
# Model runner V2
|
# Model runner V2
|
||||||
/vllm/v1/worker/gpu @WoosukKwon
|
/vllm/v1/worker/gpu @WoosukKwon
|
||||||
@@ -54,13 +56,13 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
|
|||||||
/tests/test_inputs.py @DarkLight1337 @ywang96
|
/tests/test_inputs.py @DarkLight1337 @ywang96
|
||||||
/tests/v1/entrypoints/llm/test_struct_output_generate.py @mgoin @russellb @aarnphm
|
/tests/v1/entrypoints/llm/test_struct_output_generate.py @mgoin @russellb @aarnphm
|
||||||
/tests/v1/structured_output @mgoin @russellb @aarnphm
|
/tests/v1/structured_output @mgoin @russellb @aarnphm
|
||||||
/tests/v1/core @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @alexm-redhat @heheda12345 @ApostaC
|
/tests/v1/core @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @alexm-redhat @heheda12345 @ApostaC @orozery
|
||||||
/tests/weight_loading @mgoin @youkaichao @yewentao256
|
/tests/weight_loading @mgoin @youkaichao @yewentao256
|
||||||
/tests/lora @jeejeelee
|
/tests/lora @jeejeelee
|
||||||
/tests/models/language/generation/test_hybrid.py @tdoublep
|
/tests/models/language/generation/test_hybrid.py @tdoublep
|
||||||
/tests/v1/kv_connector/nixl_integration @NickLucche
|
/tests/v1/kv_connector/nixl_integration @NickLucche
|
||||||
/tests/v1/kv_connector @ApostaC
|
/tests/v1/kv_connector @ApostaC @orozery
|
||||||
/tests/v1/offloading @ApostaC
|
/tests/v1/kv_offload @ApostaC @orozery
|
||||||
/tests/v1/determinism @yewentao256
|
/tests/v1/determinism @yewentao256
|
||||||
|
|
||||||
# Transformers modeling backend
|
# Transformers modeling backend
|
||||||
|
|||||||
@@ -154,6 +154,10 @@ repos:
|
|||||||
files: ^docker/(Dockerfile|versions\.json)$
|
files: ^docker/(Dockerfile|versions\.json)$
|
||||||
pass_filenames: false
|
pass_filenames: false
|
||||||
additional_dependencies: [dockerfile-parse]
|
additional_dependencies: [dockerfile-parse]
|
||||||
|
- id: attention-backend-docs
|
||||||
|
name: Check attention backend documentation is up to date
|
||||||
|
entry: python tools/pre_commit/generate_attention_backend_docs.py --check
|
||||||
|
language: python
|
||||||
# Keep `suggestion` last
|
# Keep `suggestion` last
|
||||||
- id: suggestion
|
- id: suggestion
|
||||||
name: Suggestion
|
name: Suggestion
|
||||||
|
|||||||
@@ -458,7 +458,6 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
|
|||||||
endif()
|
endif()
|
||||||
|
|
||||||
set(MARLIN_SRCS
|
set(MARLIN_SRCS
|
||||||
"csrc/quantization/marlin/sparse/marlin_24_cuda_kernel.cu"
|
|
||||||
"csrc/quantization/marlin/marlin.cu"
|
"csrc/quantization/marlin/marlin.cu"
|
||||||
"csrc/quantization/marlin/marlin_int4_fp8_preprocess.cu"
|
"csrc/quantization/marlin/marlin_int4_fp8_preprocess.cu"
|
||||||
"csrc/quantization/marlin/gptq_marlin_repack.cu"
|
"csrc/quantization/marlin/gptq_marlin_repack.cu"
|
||||||
|
|||||||
266
benchmarks/attention_benchmarks/README.md
Normal file
266
benchmarks/attention_benchmarks/README.md
Normal file
@@ -0,0 +1,266 @@
|
|||||||
|
# vLLM Attention Benchmarking Suite
|
||||||
|
|
||||||
|
Fast, flexible benchmarking for vLLM attention and MLA backends with an extended batch specification grammar.
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd benchmarks/attention_benchmarks
|
||||||
|
|
||||||
|
# Run a pre-configured benchmark
|
||||||
|
python benchmark.py --config configs/mla_decode.yaml
|
||||||
|
python benchmark.py --config configs/mla_mixed_batch.yaml
|
||||||
|
python benchmark.py --config configs/speculative_decode.yaml
|
||||||
|
python benchmark.py --config configs/standard_attention.yaml
|
||||||
|
python benchmark.py --config configs/reorder_threshold.yaml
|
||||||
|
|
||||||
|
# Or run custom benchmarks
|
||||||
|
python benchmark.py \
|
||||||
|
--backends flash flashinfer \
|
||||||
|
--batch-specs "q2k" "8q1s1k" "2q2k_32q1s1k" \
|
||||||
|
--output-csv results.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
## Simplified Batch Specification Grammar
|
||||||
|
|
||||||
|
Express workloads concisely using query length and sequence length:
|
||||||
|
|
||||||
|
```python
|
||||||
|
"q2k" # 2048-token prefill (q_len=2048, seq_len=2048)
|
||||||
|
"q1s1k" # Decode: 1 token with 1K sequence
|
||||||
|
"8q1s1k" # 8 decode requests
|
||||||
|
"q4s1k" # 4-token extend (e.g., spec decode)
|
||||||
|
"2q2k_32q1s1k" # Mixed: 2 prefills + 32 decodes
|
||||||
|
"16q4s1k" # 16 spec decode (4 tokens each)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Grammar Rule
|
||||||
|
|
||||||
|
```text
|
||||||
|
Format: (<count>?) q<q_len>(k?) (s<seq_len>(k?))?
|
||||||
|
|
||||||
|
- count: Number of identical requests (optional, default=1)
|
||||||
|
- q_len: Query length (number of new tokens)
|
||||||
|
- seq_len: Total sequence length (optional, defaults to q_len for prefill)
|
||||||
|
- 'k': Multiplies value by 1024
|
||||||
|
|
||||||
|
Mixed batches: Use _ to combine (e.g., "2q2k_32q1s1k")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note**: Decode, prefill, and spec decode are just different query lengths - no special syntax needed!
|
||||||
|
|
||||||
|
## Pre-configured Benchmarks
|
||||||
|
|
||||||
|
The suite includes several pre-configured YAML benchmark configurations:
|
||||||
|
|
||||||
|
### MLA Decode Benchmark
|
||||||
|
|
||||||
|
Tests pure decode performance across MLA backends with varying batch sizes and sequence lengths.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python benchmark.py --config configs/mla_decode.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
### MLA Mixed Batch Benchmark
|
||||||
|
|
||||||
|
Tests chunked prefill performance with mixed prefill + decode batches.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python benchmark.py --config configs/mla_mixed_batch.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Speculative Decoding Benchmark
|
||||||
|
|
||||||
|
Tests speculative decode scenarios (K-token verification) and reorder_batch_threshold optimization.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python benchmark.py --config configs/speculative_decode.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Standard Attention Benchmark
|
||||||
|
|
||||||
|
Tests standard attention backends (Flash/Triton/FlashInfer) with pure prefill, decode, and mixed batches.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python benchmark.py --config configs/standard_attention.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Reorder Threshold Study
|
||||||
|
|
||||||
|
**Question:** At what query length does the prefill pipeline become faster than the decode pipeline?
|
||||||
|
|
||||||
|
Tests query lengths from 1-1024 across 9 batch sizes to find the crossover point. Uses `decode_vs_prefill` mode to compare both pipelines for each query length.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python benchmark.py --config configs/reorder_threshold.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Universal Benchmark
|
||||||
|
|
||||||
|
The `benchmark.py` script handles **all** backends - both standard attention and MLA.
|
||||||
|
|
||||||
|
### Standard Attention (Flash/Triton/FlashInfer)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python benchmark.py \
|
||||||
|
--backends flash triton flashinfer \
|
||||||
|
--batch-specs "q2k" "8q1s1k" "2q2k_32q1s1k" \
|
||||||
|
--num-layers 10 \
|
||||||
|
--repeats 5 \
|
||||||
|
--output-csv results.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
### MLA Backends
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Compare all MLA backends
|
||||||
|
python benchmark.py \
|
||||||
|
--backends cutlass_mla flashinfer_mla flashattn_mla flashmla \
|
||||||
|
--batch-specs "64q1s1k" "64q1s4k" \
|
||||||
|
--output-csv mla_results.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
### Parameter Sweeps
|
||||||
|
|
||||||
|
Use `--sweep-param` and `--sweep-values` to run parameter sweeps from the CLI:
|
||||||
|
|
||||||
|
#### CUTLASS MLA num-splits Optimization
|
||||||
|
|
||||||
|
**Question:** What is the optimal `num_kv_splits` for CUTLASS MLA?
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python benchmark.py \
|
||||||
|
--backend cutlass_mla \
|
||||||
|
--batch-specs "64q1s1k" "64q1s4k" "64q1s16k" \
|
||||||
|
--sweep-param num_kv_splits \
|
||||||
|
--sweep-values 1 2 4 8 16 \
|
||||||
|
--output-json optimal_splits.json
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Reorder Batch Threshold Optimization
|
||||||
|
|
||||||
|
**Question:** What's the optimal `reorder_batch_threshold` for speculative decoding?
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python benchmark.py \
|
||||||
|
--backend flashmla \
|
||||||
|
--batch-specs "q4s1k" "q8s2k" \
|
||||||
|
--sweep-param reorder_batch_threshold \
|
||||||
|
--sweep-values 1 4 16 64 256 512 \
|
||||||
|
--output-csv threshold_sweep.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
### All Command-Line Options
|
||||||
|
|
||||||
|
```text
|
||||||
|
--config CONFIG # Path to YAML config file (overrides other args)
|
||||||
|
--backends BACKEND [BACKEND ...] # flash, triton, flashinfer, cutlass_mla,
|
||||||
|
# flashinfer_mla, flashattn_mla, flashmla
|
||||||
|
--backend BACKEND # Single backend (alternative to --backends)
|
||||||
|
--batch-specs SPEC [SPEC ...] # Batch specifications using extended grammar
|
||||||
|
|
||||||
|
# Model configuration
|
||||||
|
--num-layers N # Number of layers
|
||||||
|
--head-dim N # Head dimension
|
||||||
|
--num-q-heads N # Query heads
|
||||||
|
--num-kv-heads N # KV heads
|
||||||
|
--block-size N # Block size
|
||||||
|
|
||||||
|
# Benchmark settings
|
||||||
|
--device DEVICE # Device (default: cuda:0)
|
||||||
|
--repeats N # Repetitions
|
||||||
|
--warmup-iters N # Warmup iterations
|
||||||
|
--profile-memory # Profile memory usage
|
||||||
|
|
||||||
|
# Parameter sweeps
|
||||||
|
--sweep-param PARAM # Parameter name to sweep (e.g., num_kv_splits,
|
||||||
|
# reorder_batch_threshold)
|
||||||
|
--sweep-values N [N ...] # Values to sweep for the parameter
|
||||||
|
|
||||||
|
# Output
|
||||||
|
--output-csv FILE # Save to CSV
|
||||||
|
--output-json FILE # Save to JSON
|
||||||
|
```
|
||||||
|
|
||||||
|
## Hardware Requirements
|
||||||
|
|
||||||
|
| Backend | Hardware |
|
||||||
|
|---------|----------|
|
||||||
|
| Flash/Triton/FlashInfer | Any CUDA GPU |
|
||||||
|
| CUTLASS MLA | Blackwell (SM100+) |
|
||||||
|
| FlashAttn MLA | Hopper (SM90+) |
|
||||||
|
| FlashMLA | Hopper (SM90+) |
|
||||||
|
| FlashInfer-MLA | Any CUDA GPU |
|
||||||
|
|
||||||
|
## Using MLA Runner Directly
|
||||||
|
|
||||||
|
All MLA backends are available through `mla_runner.run_mla_benchmark()`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from mla_runner import run_mla_benchmark
|
||||||
|
from common import BenchmarkConfig
|
||||||
|
|
||||||
|
config = BenchmarkConfig(
|
||||||
|
backend="cutlass_mla",
|
||||||
|
batch_spec="64q1s4k",
|
||||||
|
num_layers=10,
|
||||||
|
head_dim=576,
|
||||||
|
num_q_heads=128,
|
||||||
|
num_kv_heads=1,
|
||||||
|
block_size=128,
|
||||||
|
device="cuda:0",
|
||||||
|
repeats=5,
|
||||||
|
warmup_iters=3,
|
||||||
|
)
|
||||||
|
|
||||||
|
# CUTLASS MLA with specific num_kv_splits
|
||||||
|
result = run_mla_benchmark("cutlass_mla", config, num_kv_splits=4)
|
||||||
|
print(f"Time: {result.mean_time:.6f}s")
|
||||||
|
|
||||||
|
# FlashInfer-MLA
|
||||||
|
result = run_mla_benchmark("flashinfer_mla", config)
|
||||||
|
|
||||||
|
# FlashAttn MLA (Hopper SM90+)
|
||||||
|
result = run_mla_benchmark("flashattn_mla", config, reorder_batch_threshold=64)
|
||||||
|
|
||||||
|
# FlashMLA (Hopper SM90+)
|
||||||
|
result = run_mla_benchmark("flashmla", config, reorder_batch_threshold=64)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Python API
|
||||||
|
|
||||||
|
```python
|
||||||
|
from batch_spec import parse_batch_spec, format_batch_spec, get_batch_stats
|
||||||
|
from common import BenchmarkConfig, BenchmarkResult, ResultsFormatter
|
||||||
|
|
||||||
|
# Parse batch specs
|
||||||
|
requests = parse_batch_spec("2q2k_q4s1k_32q1s1k")
|
||||||
|
print(format_batch_spec(requests))
|
||||||
|
# "2 prefill (2x2k), 1 extend (1xq4kv1k), 32 decode (32x1k)"
|
||||||
|
|
||||||
|
# Get batch statistics
|
||||||
|
stats = get_batch_stats(requests)
|
||||||
|
print(f"Total tokens: {stats['total_tokens']}")
|
||||||
|
print(f"Num decode: {stats['num_decode']}, Num prefill: {stats['num_prefill']}")
|
||||||
|
|
||||||
|
# Format results
|
||||||
|
formatter = ResultsFormatter()
|
||||||
|
formatter.save_csv(results, "output.csv")
|
||||||
|
formatter.save_json(results, "output.json")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Tips
|
||||||
|
|
||||||
|
**1. Warmup matters** - Use `--warmup-iters 10` for stable results
|
||||||
|
|
||||||
|
**2. Multiple repeats** - Use `--repeats 20` for low variance
|
||||||
|
|
||||||
|
**3. Save results** - Always use `--output-csv` or `--output-json`
|
||||||
|
|
||||||
|
**4. Test incrementally** - Start with `--num-layers 1 --repeats 1`
|
||||||
|
|
||||||
|
**5. Extended grammar** - Leverage spec decode, chunked prefill patterns
|
||||||
|
|
||||||
|
**6. Parameter sweeps** - Use `--sweep-param` and `--sweep-values` to find optimal values
|
||||||
44
benchmarks/attention_benchmarks/__init__.py
Normal file
44
benchmarks/attention_benchmarks/__init__.py
Normal file
@@ -0,0 +1,44 @@
|
|||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||||
|
|
||||||
|
"""vLLM Attention Benchmarking Suite."""
|
||||||
|
|
||||||
|
from .batch_spec import (
|
||||||
|
BatchRequest,
|
||||||
|
format_batch_spec,
|
||||||
|
get_batch_stats,
|
||||||
|
parse_batch_spec,
|
||||||
|
reorder_for_flashinfer,
|
||||||
|
split_by_type,
|
||||||
|
)
|
||||||
|
from .common import (
|
||||||
|
BenchmarkConfig,
|
||||||
|
BenchmarkResult,
|
||||||
|
MockLayer,
|
||||||
|
MockModelConfig,
|
||||||
|
ResultsFormatter,
|
||||||
|
get_attention_scale,
|
||||||
|
is_mla_backend,
|
||||||
|
setup_mla_dims,
|
||||||
|
)
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
# Batch specification
|
||||||
|
"BatchRequest",
|
||||||
|
"parse_batch_spec",
|
||||||
|
"format_batch_spec",
|
||||||
|
"reorder_for_flashinfer",
|
||||||
|
"split_by_type",
|
||||||
|
"get_batch_stats",
|
||||||
|
# Benchmarking infrastructure
|
||||||
|
"BenchmarkConfig",
|
||||||
|
"BenchmarkResult",
|
||||||
|
"ResultsFormatter",
|
||||||
|
# Mock objects
|
||||||
|
"MockLayer",
|
||||||
|
"MockModelConfig",
|
||||||
|
# Utilities
|
||||||
|
"setup_mla_dims",
|
||||||
|
"get_attention_scale",
|
||||||
|
"is_mla_backend",
|
||||||
|
]
|
||||||
231
benchmarks/attention_benchmarks/batch_spec.py
Normal file
231
benchmarks/attention_benchmarks/batch_spec.py
Normal file
@@ -0,0 +1,231 @@
|
|||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||||
|
|
||||||
|
"""
|
||||||
|
Simplified batch specification grammar for attention benchmarks.
|
||||||
|
|
||||||
|
Grammar (underscore-separated segments):
|
||||||
|
Format: (<count>?) q<q_len>(k?) (s<seq_len>(k?))?
|
||||||
|
|
||||||
|
- count: Number of identical requests (optional, default=1)
|
||||||
|
- q_len: Query length (number of new tokens)
|
||||||
|
- seq_len: Total sequence length (optional, defaults to q_len for prefill)
|
||||||
|
- 'k' suffix: Multiplies value by 1024
|
||||||
|
|
||||||
|
Common patterns:
|
||||||
|
- Prefill: q_len == seq_len (e.g., "q2k" → 2048 new tokens, 2048 seq)
|
||||||
|
- Decode: q_len == 1 (e.g., "q1s1k" → 1 token, 1024 seq length)
|
||||||
|
- Extend: q_len < seq_len (e.g., "q4s1k" → 4 tokens, 1024 seq length)
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
q2k -> [(2048, 2048)] # Prefill: 2048 tokens
|
||||||
|
q1s1k -> [(1, 1024)] # Decode: 1 token, 1K sequence
|
||||||
|
8q1s1k -> [(1, 1024)] * 8 # 8 decode requests
|
||||||
|
q4s1k -> [(4, 1024)] # 4-token extend (spec decode)
|
||||||
|
2q1k_32q1s1k -> [(1024, 1024)] * 2 + [(1, 1024)] * 32 # Mixed batch
|
||||||
|
16q4s1k -> [(4, 1024)] * 16 # 16 spec decode requests
|
||||||
|
"""
|
||||||
|
|
||||||
|
from collections import Counter
|
||||||
|
from dataclasses import dataclass
|
||||||
|
|
||||||
|
import regex as re
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class BatchRequest:
|
||||||
|
"""Represents a single request in a batch."""
|
||||||
|
|
||||||
|
q_len: int # Query length (number of new tokens)
|
||||||
|
kv_len: int # Total KV cache length
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_decode(self) -> bool:
|
||||||
|
"""True if this is a decode request (q_len == 1)."""
|
||||||
|
return self.q_len == 1
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_prefill(self) -> bool:
|
||||||
|
"""True if this is a pure prefill (q_len == kv_len)."""
|
||||||
|
return self.q_len == self.kv_len
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_extend(self) -> bool:
|
||||||
|
"""True if this is context extension (q_len > 1, kv_len > q_len)."""
|
||||||
|
return self.q_len > 1 and self.kv_len > self.q_len
|
||||||
|
|
||||||
|
@property
|
||||||
|
def context_len(self) -> int:
|
||||||
|
"""Context length (KV cache - query)."""
|
||||||
|
return self.kv_len - self.q_len
|
||||||
|
|
||||||
|
def as_tuple(self) -> tuple[int, int]:
|
||||||
|
"""Return as (q_len, kv_len) tuple for compatibility."""
|
||||||
|
return (self.q_len, self.kv_len)
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_size(size_str: str, k_suffix: str) -> int:
|
||||||
|
"""Parse size string with optional 'k' suffix."""
|
||||||
|
size = int(size_str)
|
||||||
|
return size * 1024 if k_suffix == "k" else size
|
||||||
|
|
||||||
|
|
||||||
|
def parse_batch_spec(spec: str) -> list[BatchRequest]:
|
||||||
|
"""
|
||||||
|
Parse batch specification string into list of BatchRequest objects.
|
||||||
|
|
||||||
|
Grammar: (<count>?) q<q_len>(k?) (s<seq_len>(k?))?
|
||||||
|
|
||||||
|
Args:
|
||||||
|
spec: Batch specification string (see module docstring for grammar)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of BatchRequest objects
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
ValueError: If spec format is invalid
|
||||||
|
"""
|
||||||
|
requests = []
|
||||||
|
|
||||||
|
for seg in spec.split("_"):
|
||||||
|
# Unified pattern: (<count>?) q<q_len>(k?) (s<seq_len>(k?))?
|
||||||
|
m = re.match(r"^(?:(\d+))?q(\d+)(k?)(?:s(\d+)(k?))?$", seg)
|
||||||
|
if m:
|
||||||
|
cnt = int(m.group(1)) if m.group(1) else 1
|
||||||
|
q_len = _parse_size(m.group(2), m.group(3))
|
||||||
|
kv_len = _parse_size(m.group(4), m.group(5)) if m.group(4) else q_len
|
||||||
|
requests.extend([BatchRequest(q_len=q_len, kv_len=kv_len)] * cnt)
|
||||||
|
continue
|
||||||
|
|
||||||
|
raise ValueError(f"Invalid batch spec segment: '{seg}'")
|
||||||
|
|
||||||
|
return requests
|
||||||
|
|
||||||
|
|
||||||
|
def format_batch_spec(requests: list[BatchRequest]) -> str:
|
||||||
|
"""
|
||||||
|
Format list of BatchRequest into human-readable string.
|
||||||
|
|
||||||
|
Groups requests by type and provides counts and sizes.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
requests: List of BatchRequest objects
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Formatted string describing the batch
|
||||||
|
"""
|
||||||
|
kinds = {
|
||||||
|
"prefill": [],
|
||||||
|
"extend": [],
|
||||||
|
"decode": [],
|
||||||
|
}
|
||||||
|
|
||||||
|
for req in requests:
|
||||||
|
tup = (req.q_len, req.kv_len)
|
||||||
|
if req.is_prefill:
|
||||||
|
kinds["prefill"].append(tup)
|
||||||
|
elif req.is_extend:
|
||||||
|
kinds["extend"].append(tup)
|
||||||
|
elif req.is_decode:
|
||||||
|
kinds["decode"].append(tup)
|
||||||
|
|
||||||
|
parts = []
|
||||||
|
for kind in ["prefill", "extend", "decode"]:
|
||||||
|
lst = kinds[kind]
|
||||||
|
if not lst:
|
||||||
|
continue
|
||||||
|
|
||||||
|
cnt_total = len(lst)
|
||||||
|
ctr = Counter(lst)
|
||||||
|
inner = []
|
||||||
|
|
||||||
|
for (q, kv), cnt in ctr.items():
|
||||||
|
if kind == "prefill":
|
||||||
|
size = f"{q // 1024}k" if q % 1024 == 0 else str(q)
|
||||||
|
inner.append(f"{cnt}x{size}")
|
||||||
|
elif kind == "decode":
|
||||||
|
size = f"{kv // 1024}k" if kv % 1024 == 0 else str(kv)
|
||||||
|
inner.append(f"{cnt}x{size}")
|
||||||
|
else: # extend
|
||||||
|
qstr = f"{q // 1024}k" if q % 1024 == 0 else str(q)
|
||||||
|
kstr = f"{kv // 1024}k" if kv % 1024 == 0 else str(kv)
|
||||||
|
inner.append(f"{cnt}xq{qstr}kv{kstr}")
|
||||||
|
|
||||||
|
parts.append(f"{cnt_total} {kind} ({', '.join(inner)})")
|
||||||
|
|
||||||
|
return ", ".join(parts)
|
||||||
|
|
||||||
|
|
||||||
|
def reorder_for_flashinfer(requests: list[BatchRequest]) -> list[BatchRequest]:
|
||||||
|
"""
|
||||||
|
Reorder requests for FlashInfer: decode first, then prefill.
|
||||||
|
|
||||||
|
FlashInfer expects decode requests before prefill requests for
|
||||||
|
optimal performance.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
requests: Original list of BatchRequest
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Reordered list with decode requests first
|
||||||
|
"""
|
||||||
|
decodes = [r for r in requests if r.is_decode]
|
||||||
|
non_decodes = [r for r in requests if not r.is_decode]
|
||||||
|
return decodes + non_decodes
|
||||||
|
|
||||||
|
|
||||||
|
def split_by_type(
|
||||||
|
requests: list[BatchRequest],
|
||||||
|
) -> dict[str, list[BatchRequest]]:
|
||||||
|
"""
|
||||||
|
Split requests by type for analysis.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
requests: List of BatchRequest
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict with keys: 'decode', 'prefill', 'extend'
|
||||||
|
"""
|
||||||
|
result = {
|
||||||
|
"decode": [],
|
||||||
|
"prefill": [],
|
||||||
|
"extend": [],
|
||||||
|
}
|
||||||
|
|
||||||
|
for req in requests:
|
||||||
|
if req.is_decode:
|
||||||
|
result["decode"].append(req)
|
||||||
|
elif req.is_prefill:
|
||||||
|
result["prefill"].append(req)
|
||||||
|
elif req.is_extend:
|
||||||
|
result["extend"].append(req)
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
def get_batch_stats(requests: list[BatchRequest]) -> dict:
|
||||||
|
"""
|
||||||
|
Compute statistics about a batch.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
requests: List of BatchRequest
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict with batch statistics
|
||||||
|
"""
|
||||||
|
by_type = split_by_type(requests)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"total_requests": len(requests),
|
||||||
|
"num_decode": len(by_type["decode"]),
|
||||||
|
"num_prefill": len(by_type["prefill"]),
|
||||||
|
"num_extend": len(by_type["extend"]),
|
||||||
|
"total_tokens": sum(r.q_len for r in requests),
|
||||||
|
"total_kv_cache": sum(r.kv_len for r in requests),
|
||||||
|
"max_q_len": max((r.q_len for r in requests), default=0),
|
||||||
|
"max_kv_len": max((r.kv_len for r in requests), default=0),
|
||||||
|
"avg_q_len": sum(r.q_len for r in requests) / len(requests) if requests else 0,
|
||||||
|
"avg_kv_len": (
|
||||||
|
sum(r.kv_len for r in requests) / len(requests) if requests else 0
|
||||||
|
),
|
||||||
|
}
|
||||||
886
benchmarks/attention_benchmarks/benchmark.py
Normal file
886
benchmarks/attention_benchmarks/benchmark.py
Normal file
@@ -0,0 +1,886 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||||
|
|
||||||
|
"""
|
||||||
|
Universal vLLM Attention Benchmark
|
||||||
|
|
||||||
|
Benchmark any attention backend with the extended grammar.
|
||||||
|
Supports standard attention (Flash/Triton/FlashInfer) and MLA backends.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
# Standard attention
|
||||||
|
python benchmark.py --backends flash flashinfer --batch-specs "q2k" "8q1s1k"
|
||||||
|
|
||||||
|
# MLA backends
|
||||||
|
python benchmark.py --backends cutlass_mla flashinfer_mla --batch-specs "64q1s1k"
|
||||||
|
|
||||||
|
# Parameter sweep (CLI)
|
||||||
|
python benchmark.py --backend cutlass_mla \
|
||||||
|
--batch-specs "64q1s1k" \
|
||||||
|
--sweep-param num_kv_splits \
|
||||||
|
--sweep-values 1 4 8 16
|
||||||
|
|
||||||
|
# Parameter sweep (YAML config - recommended)
|
||||||
|
python benchmark.py --config configs/cutlass_numsplits.yaml
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import sys
|
||||||
|
from dataclasses import replace
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import yaml
|
||||||
|
from rich.console import Console
|
||||||
|
from tqdm import tqdm
|
||||||
|
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent.parent.parent))
|
||||||
|
|
||||||
|
from batch_spec import parse_batch_spec
|
||||||
|
from common import (
|
||||||
|
BenchmarkConfig,
|
||||||
|
BenchmarkResult,
|
||||||
|
ModelParameterSweep,
|
||||||
|
ParameterSweep,
|
||||||
|
ResultsFormatter,
|
||||||
|
is_mla_backend,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def run_standard_attention_benchmark(config: BenchmarkConfig) -> BenchmarkResult:
|
||||||
|
"""Run standard attention benchmark (Flash/Triton/FlashInfer)."""
|
||||||
|
from runner import run_attention_benchmark
|
||||||
|
|
||||||
|
return run_attention_benchmark(config)
|
||||||
|
|
||||||
|
|
||||||
|
def run_mla_benchmark(config: BenchmarkConfig, **kwargs) -> BenchmarkResult:
|
||||||
|
"""Run MLA benchmark with appropriate backend."""
|
||||||
|
from mla_runner import run_mla_benchmark as run_mla
|
||||||
|
|
||||||
|
return run_mla(config.backend, config, **kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
def run_benchmark(config: BenchmarkConfig, **kwargs) -> BenchmarkResult:
|
||||||
|
"""
|
||||||
|
Run a single benchmark with proper backend selection.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
config: BenchmarkConfig with backend, batch_spec, and model params
|
||||||
|
**kwargs: Additional arguments passed to MLA benchmarks
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
BenchmarkResult (may have error field set on failure)
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
if is_mla_backend(config.backend):
|
||||||
|
return run_mla_benchmark(config, **kwargs)
|
||||||
|
else:
|
||||||
|
return run_standard_attention_benchmark(config)
|
||||||
|
except Exception as e:
|
||||||
|
return BenchmarkResult(
|
||||||
|
config=config,
|
||||||
|
mean_time=float("inf"),
|
||||||
|
std_time=0,
|
||||||
|
min_time=float("inf"),
|
||||||
|
max_time=float("inf"),
|
||||||
|
error=str(e),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def run_model_parameter_sweep(
|
||||||
|
backends: list[str],
|
||||||
|
batch_specs: list[str],
|
||||||
|
base_config_args: dict,
|
||||||
|
sweep: ModelParameterSweep,
|
||||||
|
console: Console,
|
||||||
|
) -> list[BenchmarkResult]:
|
||||||
|
"""
|
||||||
|
Run model parameter sweep for given backends and batch specs.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
backends: List of backend names
|
||||||
|
batch_specs: List of batch specifications
|
||||||
|
base_config_args: Base configuration arguments (num_layers, head_dim, etc.)
|
||||||
|
sweep: ModelParameterSweep configuration
|
||||||
|
console: Rich console for output
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of BenchmarkResult objects
|
||||||
|
"""
|
||||||
|
all_results = []
|
||||||
|
|
||||||
|
console.print(
|
||||||
|
f"[yellow]Model sweep mode: testing {sweep.param_name} = {sweep.values}[/]"
|
||||||
|
)
|
||||||
|
|
||||||
|
total = len(backends) * len(batch_specs) * len(sweep.values)
|
||||||
|
|
||||||
|
with tqdm(total=total, desc="Benchmarking") as pbar:
|
||||||
|
for backend in backends:
|
||||||
|
for spec in batch_specs:
|
||||||
|
for value in sweep.values:
|
||||||
|
# Create config with modified model parameter
|
||||||
|
config_args = base_config_args.copy()
|
||||||
|
config_args[sweep.param_name] = value
|
||||||
|
|
||||||
|
# Create config with original backend for running
|
||||||
|
clean_config = BenchmarkConfig(
|
||||||
|
backend=backend, batch_spec=spec, **config_args
|
||||||
|
)
|
||||||
|
|
||||||
|
# Run benchmark
|
||||||
|
result = run_benchmark(clean_config)
|
||||||
|
|
||||||
|
# Replace backend with labeled version for display
|
||||||
|
backend_label = sweep.get_label(backend, value)
|
||||||
|
labeled_config = replace(result.config, backend=backend_label)
|
||||||
|
result = replace(result, config=labeled_config)
|
||||||
|
all_results.append(result)
|
||||||
|
|
||||||
|
if not result.success:
|
||||||
|
console.print(
|
||||||
|
f"[red]Error {backend} {spec} {sweep.param_name}="
|
||||||
|
f"{value}: {result.error}[/]"
|
||||||
|
)
|
||||||
|
|
||||||
|
pbar.update(1)
|
||||||
|
|
||||||
|
# Display sweep results - create separate table for each parameter value
|
||||||
|
console.print("\n[bold green]Model Parameter Sweep Results:[/]")
|
||||||
|
formatter = ResultsFormatter(console)
|
||||||
|
|
||||||
|
# Group results by parameter value and extract backend mapping
|
||||||
|
by_param_value = {}
|
||||||
|
backend_mapping = {} # Maps labeled backend -> original backend
|
||||||
|
|
||||||
|
for r in all_results:
|
||||||
|
# Extract original backend and param value from labeled backend
|
||||||
|
# The label format is: {backend}_{param_name}_{value}
|
||||||
|
# We need to reverse engineer this
|
||||||
|
labeled_backend = r.config.backend
|
||||||
|
|
||||||
|
# Try each backend to find which one this result belongs to
|
||||||
|
for backend in backends:
|
||||||
|
for value in sweep.values:
|
||||||
|
expected_label = sweep.get_label(backend, value)
|
||||||
|
if labeled_backend == expected_label:
|
||||||
|
backend_mapping[labeled_backend] = backend
|
||||||
|
param_value = str(value)
|
||||||
|
|
||||||
|
if param_value not in by_param_value:
|
||||||
|
by_param_value[param_value] = []
|
||||||
|
by_param_value[param_value].append(r)
|
||||||
|
break
|
||||||
|
|
||||||
|
# Create a table for each parameter value
|
||||||
|
sorted_param_values = sorted(
|
||||||
|
by_param_value.keys(), key=lambda x: int(x) if x.isdigit() else x
|
||||||
|
)
|
||||||
|
|
||||||
|
for param_value in sorted_param_values:
|
||||||
|
console.print(f"\n[bold cyan]{sweep.param_name} = {param_value}[/]")
|
||||||
|
param_results = by_param_value[param_value]
|
||||||
|
|
||||||
|
# Create modified results with original backend names
|
||||||
|
modified_results = []
|
||||||
|
for r in param_results:
|
||||||
|
# Get the original backend name from our mapping
|
||||||
|
original_backend = backend_mapping[r.config.backend]
|
||||||
|
modified_config = replace(r.config, backend=original_backend)
|
||||||
|
modified_result = replace(r, config=modified_config)
|
||||||
|
modified_results.append(modified_result)
|
||||||
|
|
||||||
|
# Print table with original backend names
|
||||||
|
formatter.print_table(modified_results, backends, compare_to_fastest=True)
|
||||||
|
|
||||||
|
# Show optimal backend for each (param_value, batch_spec) combination
|
||||||
|
console.print(
|
||||||
|
f"\n[bold cyan]Optimal backend for each ({sweep.param_name}, batch_spec):[/]"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Group by (param_value, batch_spec)
|
||||||
|
by_param_and_spec = {}
|
||||||
|
for r in all_results:
|
||||||
|
if r.success:
|
||||||
|
# Find which (backend, value) this result corresponds to
|
||||||
|
labeled_backend = r.config.backend
|
||||||
|
for backend in backends:
|
||||||
|
for value in sweep.values:
|
||||||
|
expected_label = sweep.get_label(backend, value)
|
||||||
|
if labeled_backend == expected_label:
|
||||||
|
param_value = str(value)
|
||||||
|
spec = r.config.batch_spec
|
||||||
|
key = (param_value, spec)
|
||||||
|
|
||||||
|
if key not in by_param_and_spec:
|
||||||
|
by_param_and_spec[key] = []
|
||||||
|
by_param_and_spec[key].append(r)
|
||||||
|
break
|
||||||
|
|
||||||
|
# Sort by param value then spec
|
||||||
|
sorted_keys = sorted(
|
||||||
|
by_param_and_spec.keys(),
|
||||||
|
key=lambda x: (int(x[0]) if x[0].isdigit() else x[0], x[1]),
|
||||||
|
)
|
||||||
|
|
||||||
|
current_param_value = None
|
||||||
|
for param_value, spec in sorted_keys:
|
||||||
|
# Print header when param value changes
|
||||||
|
if param_value != current_param_value:
|
||||||
|
console.print(f"\n [bold]{sweep.param_name}={param_value}:[/]")
|
||||||
|
current_param_value = param_value
|
||||||
|
|
||||||
|
results = by_param_and_spec[(param_value, spec)]
|
||||||
|
best = min(results, key=lambda r: r.mean_time)
|
||||||
|
|
||||||
|
# Extract original backend name using the mapping
|
||||||
|
backend_name = backend_mapping[best.config.backend]
|
||||||
|
|
||||||
|
# Show all backends' times for comparison
|
||||||
|
times_str = " | ".join(
|
||||||
|
[
|
||||||
|
f"{backend_mapping[r.config.backend]}: {r.mean_time:.6f}s"
|
||||||
|
for r in sorted(results, key=lambda r: r.mean_time)
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
console.print(
|
||||||
|
f" {spec:12s} -> [bold green]{backend_name:15s}[/] ({times_str})"
|
||||||
|
)
|
||||||
|
|
||||||
|
return all_results
|
||||||
|
|
||||||
|
|
||||||
|
def run_parameter_sweep(
|
||||||
|
backends: list[str],
|
||||||
|
batch_specs: list[str],
|
||||||
|
base_config_args: dict,
|
||||||
|
sweep: ParameterSweep,
|
||||||
|
console: Console,
|
||||||
|
) -> list[BenchmarkResult]:
|
||||||
|
"""
|
||||||
|
Run parameter sweep for given backends and batch specs.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
backends: List of backend names
|
||||||
|
batch_specs: List of batch specifications
|
||||||
|
base_config_args: Base configuration arguments (num_layers, head_dim, etc.)
|
||||||
|
sweep: ParameterSweep configuration
|
||||||
|
console: Rich console for output
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of BenchmarkResult objects
|
||||||
|
"""
|
||||||
|
all_results = []
|
||||||
|
|
||||||
|
# Build list of values to sweep (including auto if requested)
|
||||||
|
sweep_values = list(sweep.values)
|
||||||
|
if sweep.include_auto:
|
||||||
|
sweep_values.append("auto")
|
||||||
|
|
||||||
|
console.print(f"[yellow]Sweep mode: testing {sweep.param_name} = {sweep_values}[/]")
|
||||||
|
|
||||||
|
total = len(backends) * len(batch_specs) * len(sweep_values)
|
||||||
|
|
||||||
|
with tqdm(total=total, desc="Benchmarking") as pbar:
|
||||||
|
for backend in backends:
|
||||||
|
for spec in batch_specs:
|
||||||
|
for value in sweep_values:
|
||||||
|
# Create config with original backend for running
|
||||||
|
config = BenchmarkConfig(
|
||||||
|
backend=backend, batch_spec=spec, **base_config_args
|
||||||
|
)
|
||||||
|
|
||||||
|
# Prepare kwargs for benchmark runner
|
||||||
|
kwargs = {}
|
||||||
|
if value != "auto":
|
||||||
|
kwargs[sweep.param_name] = value
|
||||||
|
|
||||||
|
# Run benchmark
|
||||||
|
result = run_benchmark(config, **kwargs)
|
||||||
|
|
||||||
|
# Replace backend with labeled version for display
|
||||||
|
backend_label = sweep.get_label(backend, value)
|
||||||
|
labeled_config = replace(result.config, backend=backend_label)
|
||||||
|
result = replace(result, config=labeled_config)
|
||||||
|
all_results.append(result)
|
||||||
|
|
||||||
|
if not result.success:
|
||||||
|
console.print(
|
||||||
|
f"[red]Error {backend} {spec} {sweep.param_name}="
|
||||||
|
f"{value}: {result.error}[/]"
|
||||||
|
)
|
||||||
|
|
||||||
|
pbar.update(1)
|
||||||
|
|
||||||
|
# Display sweep results
|
||||||
|
console.print("\n[bold green]Sweep Results:[/]")
|
||||||
|
backend_labels = [sweep.get_label(b, v) for b in backends for v in sweep_values]
|
||||||
|
formatter = ResultsFormatter(console)
|
||||||
|
formatter.print_table(all_results, backend_labels)
|
||||||
|
|
||||||
|
# Show optimal values
|
||||||
|
console.print(f"\n[bold cyan]Optimal {sweep.param_name} per batch spec:[/]")
|
||||||
|
by_spec = {}
|
||||||
|
for r in all_results:
|
||||||
|
if r.success:
|
||||||
|
spec = r.config.batch_spec
|
||||||
|
if spec not in by_spec:
|
||||||
|
by_spec[spec] = []
|
||||||
|
by_spec[spec].append(r)
|
||||||
|
|
||||||
|
for spec in sorted(by_spec.keys()):
|
||||||
|
results = by_spec[spec]
|
||||||
|
best = min(results, key=lambda r: r.mean_time)
|
||||||
|
console.print(
|
||||||
|
f" {spec}: [bold green]{best.config.backend}[/] ({best.mean_time:.6f}s)"
|
||||||
|
)
|
||||||
|
|
||||||
|
return all_results
|
||||||
|
|
||||||
|
|
||||||
|
def load_config_from_yaml(config_path: str) -> dict:
|
||||||
|
"""Load configuration from YAML file."""
|
||||||
|
with open(config_path) as f:
|
||||||
|
return yaml.safe_load(f)
|
||||||
|
|
||||||
|
|
||||||
|
def generate_batch_specs_from_ranges(ranges: list[dict]) -> list[str]:
|
||||||
|
"""
|
||||||
|
Generate batch specs from range specifications.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
ranges: List of range specifications, each containing:
|
||||||
|
- template: Batch spec template (e.g., "q{q_len}kv1k")
|
||||||
|
- q_len: Dict with start, stop, step, end_inclusive (optional)
|
||||||
|
- Other parameters can also be ranges
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of generated batch spec strings
|
||||||
|
|
||||||
|
Example:
|
||||||
|
ranges = [
|
||||||
|
{
|
||||||
|
"template": "q{q_len}kv1k",
|
||||||
|
"q_len": {
|
||||||
|
"start": 1,
|
||||||
|
"stop": 16,
|
||||||
|
"step": 1,
|
||||||
|
"end_inclusive": true # Optional, defaults to true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
|
Returns: ["q1kv1k", "q2kv1k", ..., "q16kv1k"]
|
||||||
|
"""
|
||||||
|
all_specs = []
|
||||||
|
|
||||||
|
for range_spec in ranges:
|
||||||
|
template = range_spec.get("template")
|
||||||
|
if not template:
|
||||||
|
raise ValueError("Range specification must include 'template'")
|
||||||
|
|
||||||
|
# Extract all range parameters from the spec
|
||||||
|
range_params = {}
|
||||||
|
for key, value in range_spec.items():
|
||||||
|
if key == "template":
|
||||||
|
continue
|
||||||
|
if isinstance(value, dict) and "start" in value:
|
||||||
|
# This is a range specification
|
||||||
|
start = value["start"]
|
||||||
|
stop = value["stop"]
|
||||||
|
step = value.get("step", 1)
|
||||||
|
# Check if end should be inclusive (default: True)
|
||||||
|
end_inclusive = value.get("end_inclusive", True)
|
||||||
|
|
||||||
|
# Adjust stop based on end_inclusive
|
||||||
|
if end_inclusive:
|
||||||
|
range_params[key] = list(range(start, stop + 1, step))
|
||||||
|
else:
|
||||||
|
range_params[key] = list(range(start, stop, step))
|
||||||
|
else:
|
||||||
|
# This is a fixed value
|
||||||
|
range_params[key] = [value]
|
||||||
|
|
||||||
|
# Generate all combinations (Cartesian product)
|
||||||
|
if range_params:
|
||||||
|
import itertools
|
||||||
|
|
||||||
|
param_names = list(range_params.keys())
|
||||||
|
param_values = [range_params[name] for name in param_names]
|
||||||
|
|
||||||
|
for values in itertools.product(*param_values):
|
||||||
|
params = dict(zip(param_names, values))
|
||||||
|
spec = template.format(**params)
|
||||||
|
all_specs.append(spec)
|
||||||
|
else:
|
||||||
|
# No parameters, just use template as-is
|
||||||
|
all_specs.append(template)
|
||||||
|
|
||||||
|
return all_specs
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Universal vLLM attention benchmark",
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
epilog=__doc__,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Config file
|
||||||
|
parser.add_argument(
|
||||||
|
"--config",
|
||||||
|
help="Path to YAML config file (overrides other args)",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Backend selection
|
||||||
|
parser.add_argument(
|
||||||
|
"--backends",
|
||||||
|
nargs="+",
|
||||||
|
help="Backends to benchmark (flash, triton, flashinfer, cutlass_mla, "
|
||||||
|
"flashinfer_mla, flashattn_mla, flashmla)",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--backend",
|
||||||
|
help="Single backend (alternative to --backends)",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Batch specifications
|
||||||
|
parser.add_argument(
|
||||||
|
"--batch-specs",
|
||||||
|
nargs="+",
|
||||||
|
default=["q2k", "8q1s1k"],
|
||||||
|
help="Batch specifications using extended grammar",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Model config
|
||||||
|
parser.add_argument("--num-layers", type=int, default=10, help="Number of layers")
|
||||||
|
parser.add_argument("--head-dim", type=int, default=128, help="Head dimension")
|
||||||
|
parser.add_argument("--num-q-heads", type=int, default=32, help="Query heads")
|
||||||
|
parser.add_argument("--num-kv-heads", type=int, default=8, help="KV heads")
|
||||||
|
parser.add_argument("--block-size", type=int, default=16, help="Block size")
|
||||||
|
|
||||||
|
# Benchmark settings
|
||||||
|
parser.add_argument("--device", default="cuda:0", help="Device")
|
||||||
|
parser.add_argument("--repeats", type=int, default=1, help="Repetitions")
|
||||||
|
parser.add_argument("--warmup-iters", type=int, default=3, help="Warmup iterations")
|
||||||
|
parser.add_argument("--profile-memory", action="store_true", help="Profile memory")
|
||||||
|
|
||||||
|
# Parameter sweep (use YAML config for advanced sweeps)
|
||||||
|
parser.add_argument(
|
||||||
|
"--sweep-param",
|
||||||
|
help="Parameter name to sweep (e.g., num_kv_splits, reorder_batch_threshold)",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--sweep-values",
|
||||||
|
type=int,
|
||||||
|
nargs="+",
|
||||||
|
help="Values to sweep for the parameter",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Output
|
||||||
|
parser.add_argument("--output-csv", help="Save to CSV")
|
||||||
|
parser.add_argument("--output-json", help="Save to JSON")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
console = Console()
|
||||||
|
console.print("[bold cyan]vLLM Attention Benchmark[/]")
|
||||||
|
|
||||||
|
# Load config from YAML if provided
|
||||||
|
if args.config:
|
||||||
|
console.print(f"[yellow]Loading config from: {args.config}[/]")
|
||||||
|
yaml_config = load_config_from_yaml(args.config)
|
||||||
|
|
||||||
|
# Show description if available
|
||||||
|
if "description" in yaml_config:
|
||||||
|
console.print(f"[dim]{yaml_config['description']}[/]")
|
||||||
|
|
||||||
|
# Override args with YAML values
|
||||||
|
# (YAML takes precedence unless CLI arg was explicitly set)
|
||||||
|
# Backend(s)
|
||||||
|
if "backend" in yaml_config:
|
||||||
|
args.backend = yaml_config["backend"]
|
||||||
|
args.backends = None
|
||||||
|
elif "backends" in yaml_config:
|
||||||
|
args.backends = yaml_config["backends"]
|
||||||
|
args.backend = None
|
||||||
|
|
||||||
|
# Check for special modes
|
||||||
|
if "mode" in yaml_config:
|
||||||
|
args.mode = yaml_config["mode"]
|
||||||
|
else:
|
||||||
|
args.mode = None
|
||||||
|
|
||||||
|
# Batch specs and sizes
|
||||||
|
# Support both explicit batch_specs and generated batch_spec_ranges
|
||||||
|
if "batch_spec_ranges" in yaml_config:
|
||||||
|
# Generate batch specs from ranges
|
||||||
|
generated_specs = generate_batch_specs_from_ranges(
|
||||||
|
yaml_config["batch_spec_ranges"]
|
||||||
|
)
|
||||||
|
# Combine with any explicit batch_specs
|
||||||
|
if "batch_specs" in yaml_config:
|
||||||
|
args.batch_specs = yaml_config["batch_specs"] + generated_specs
|
||||||
|
else:
|
||||||
|
args.batch_specs = generated_specs
|
||||||
|
console.print(
|
||||||
|
f"[dim]Generated {len(generated_specs)} batch specs from ranges[/]"
|
||||||
|
)
|
||||||
|
elif "batch_specs" in yaml_config:
|
||||||
|
args.batch_specs = yaml_config["batch_specs"]
|
||||||
|
|
||||||
|
if "batch_sizes" in yaml_config:
|
||||||
|
args.batch_sizes = yaml_config["batch_sizes"]
|
||||||
|
else:
|
||||||
|
args.batch_sizes = None
|
||||||
|
|
||||||
|
# Model config
|
||||||
|
if "model" in yaml_config:
|
||||||
|
model = yaml_config["model"]
|
||||||
|
args.num_layers = model.get("num_layers", args.num_layers)
|
||||||
|
args.head_dim = model.get("head_dim", args.head_dim)
|
||||||
|
args.num_q_heads = model.get("num_q_heads", args.num_q_heads)
|
||||||
|
args.num_kv_heads = model.get("num_kv_heads", args.num_kv_heads)
|
||||||
|
args.block_size = model.get("block_size", args.block_size)
|
||||||
|
|
||||||
|
# Benchmark settings
|
||||||
|
if "benchmark" in yaml_config:
|
||||||
|
bench = yaml_config["benchmark"]
|
||||||
|
args.device = bench.get("device", args.device)
|
||||||
|
args.repeats = bench.get("repeats", args.repeats)
|
||||||
|
args.warmup_iters = bench.get("warmup_iters", args.warmup_iters)
|
||||||
|
args.profile_memory = bench.get("profile_memory", args.profile_memory)
|
||||||
|
|
||||||
|
# Parameter sweep configuration
|
||||||
|
if "parameter_sweep" in yaml_config:
|
||||||
|
sweep_config = yaml_config["parameter_sweep"]
|
||||||
|
args.parameter_sweep = ParameterSweep(
|
||||||
|
param_name=sweep_config["param_name"],
|
||||||
|
values=sweep_config["values"],
|
||||||
|
include_auto=sweep_config.get("include_auto", False),
|
||||||
|
label_format=sweep_config.get(
|
||||||
|
"label_format", "{backend}_{param_name}_{value}"
|
||||||
|
),
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
args.parameter_sweep = None
|
||||||
|
|
||||||
|
# Model parameter sweep configuration
|
||||||
|
if "model_parameter_sweep" in yaml_config:
|
||||||
|
sweep_config = yaml_config["model_parameter_sweep"]
|
||||||
|
args.model_parameter_sweep = ModelParameterSweep(
|
||||||
|
param_name=sweep_config["param_name"],
|
||||||
|
values=sweep_config["values"],
|
||||||
|
label_format=sweep_config.get(
|
||||||
|
"label_format", "{backend}_{param_name}_{value}"
|
||||||
|
),
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
args.model_parameter_sweep = None
|
||||||
|
|
||||||
|
# Output
|
||||||
|
if "output" in yaml_config:
|
||||||
|
output = yaml_config["output"]
|
||||||
|
if "csv" in output and not args.output_csv:
|
||||||
|
args.output_csv = output["csv"]
|
||||||
|
if "json" in output and not args.output_json:
|
||||||
|
args.output_json = output["json"]
|
||||||
|
|
||||||
|
console.print()
|
||||||
|
|
||||||
|
# Handle CLI-based parameter sweep (if not from YAML)
|
||||||
|
if (
|
||||||
|
(not hasattr(args, "parameter_sweep") or args.parameter_sweep is None)
|
||||||
|
and args.sweep_param
|
||||||
|
and args.sweep_values
|
||||||
|
):
|
||||||
|
args.parameter_sweep = ParameterSweep(
|
||||||
|
param_name=args.sweep_param,
|
||||||
|
values=args.sweep_values,
|
||||||
|
include_auto=False,
|
||||||
|
label_format="{backend}_{param_name}_{value}",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Determine backends
|
||||||
|
backends = args.backends or ([args.backend] if args.backend else ["flash"])
|
||||||
|
console.print(f"Backends: {', '.join(backends)}")
|
||||||
|
console.print(f"Batch specs: {', '.join(args.batch_specs)}")
|
||||||
|
console.print()
|
||||||
|
|
||||||
|
# Run benchmarks
|
||||||
|
all_results = []
|
||||||
|
|
||||||
|
# Handle special mode: decode_vs_prefill comparison
|
||||||
|
if hasattr(args, "mode") and args.mode == "decode_vs_prefill":
|
||||||
|
console.print("[yellow]Mode: Decode vs Prefill pipeline comparison[/]")
|
||||||
|
console.print(
|
||||||
|
"[dim]For each query length, testing both decode and prefill pipelines[/]"
|
||||||
|
)
|
||||||
|
console.print("[dim]Using batched execution for optimal performance[/]")
|
||||||
|
|
||||||
|
# Extract batch sizes from config
|
||||||
|
batch_sizes = getattr(args, "batch_sizes", [1])
|
||||||
|
backend = backends[0] # Use first backend (should only be one)
|
||||||
|
|
||||||
|
# Calculate total benchmarks
|
||||||
|
total = len(batch_sizes)
|
||||||
|
|
||||||
|
with tqdm(total=total, desc="Benchmarking") as pbar:
|
||||||
|
for batch_size in batch_sizes:
|
||||||
|
# Prepare all configs for this batch size
|
||||||
|
configs_with_thresholds = []
|
||||||
|
|
||||||
|
for spec in args.batch_specs:
|
||||||
|
# Parse the batch spec to get query length
|
||||||
|
requests = parse_batch_spec(spec)
|
||||||
|
if not requests:
|
||||||
|
console.print(
|
||||||
|
f"[red]Error: Could not parse batch spec '{spec}'[/]"
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Get query length from first request
|
||||||
|
query_length = requests[0].q_len
|
||||||
|
|
||||||
|
# Create batch spec for this batch size
|
||||||
|
# For batch_size > 1, we need to prepend the count
|
||||||
|
batch_spec = f"{batch_size}{spec}" if batch_size > 1 else spec
|
||||||
|
|
||||||
|
# Create base config (without backend name)
|
||||||
|
base_config = BenchmarkConfig(
|
||||||
|
backend=backend, # Will be overridden later
|
||||||
|
batch_spec=batch_spec,
|
||||||
|
num_layers=args.num_layers,
|
||||||
|
head_dim=args.head_dim,
|
||||||
|
num_q_heads=args.num_q_heads,
|
||||||
|
num_kv_heads=args.num_kv_heads,
|
||||||
|
block_size=args.block_size,
|
||||||
|
device=args.device,
|
||||||
|
repeats=args.repeats,
|
||||||
|
warmup_iters=args.warmup_iters,
|
||||||
|
profile_memory=args.profile_memory,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Add decode pipeline config
|
||||||
|
decode_threshold = query_length
|
||||||
|
config_decode = replace(
|
||||||
|
base_config,
|
||||||
|
backend=f"{backend}_decode_qlen{query_length}_bs{batch_size}",
|
||||||
|
)
|
||||||
|
configs_with_thresholds.append((config_decode, decode_threshold))
|
||||||
|
|
||||||
|
# Add prefill pipeline config if query_length > 1
|
||||||
|
if query_length > 1:
|
||||||
|
prefill_threshold = query_length - 1
|
||||||
|
config_prefill = replace(
|
||||||
|
base_config,
|
||||||
|
backend=f"{backend}_prefill_qlen{query_length}"
|
||||||
|
f"_bs{batch_size}",
|
||||||
|
)
|
||||||
|
configs_with_thresholds.append(
|
||||||
|
(config_prefill, prefill_threshold)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Run all benchmarks for this batch size in one go (batched mode)
|
||||||
|
try:
|
||||||
|
from mla_runner import run_mla_benchmark as run_mla
|
||||||
|
|
||||||
|
# Use batched API: pass list of (config, threshold) tuples
|
||||||
|
timing_results = run_mla(backend, configs_with_thresholds)
|
||||||
|
|
||||||
|
# Create BenchmarkResult objects from timing results
|
||||||
|
for (config, _), timing in zip(
|
||||||
|
configs_with_thresholds, timing_results
|
||||||
|
):
|
||||||
|
result = BenchmarkResult(
|
||||||
|
config=config,
|
||||||
|
mean_time=timing["mean"],
|
||||||
|
std_time=timing["std"],
|
||||||
|
min_time=timing["min"],
|
||||||
|
max_time=timing["max"],
|
||||||
|
throughput_tokens_per_sec=timing.get("throughput", None),
|
||||||
|
)
|
||||||
|
all_results.append(result)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
import traceback
|
||||||
|
|
||||||
|
console.print(
|
||||||
|
f"[red]Error running batched benchmarks for "
|
||||||
|
f"batch_size={batch_size}: {e}[/]"
|
||||||
|
)
|
||||||
|
console.print("[red]Traceback:[/]")
|
||||||
|
traceback.print_exc()
|
||||||
|
# Add error results for all configs
|
||||||
|
for config, _ in configs_with_thresholds:
|
||||||
|
result = BenchmarkResult(
|
||||||
|
config=config,
|
||||||
|
mean_time=float("inf"),
|
||||||
|
std_time=0,
|
||||||
|
min_time=float("inf"),
|
||||||
|
max_time=float("inf"),
|
||||||
|
error=str(e),
|
||||||
|
)
|
||||||
|
all_results.append(result)
|
||||||
|
|
||||||
|
pbar.update(1)
|
||||||
|
|
||||||
|
# Display decode vs prefill results
|
||||||
|
console.print("\n[bold green]Decode vs Prefill Results:[/]")
|
||||||
|
|
||||||
|
# Group by batch size
|
||||||
|
by_batch_size = {}
|
||||||
|
for r in all_results:
|
||||||
|
if r.success:
|
||||||
|
# Extract batch size from backend name
|
||||||
|
parts = r.config.backend.split("_")
|
||||||
|
bs_part = [p for p in parts if p.startswith("bs")]
|
||||||
|
if bs_part:
|
||||||
|
bs = int(bs_part[0][2:])
|
||||||
|
if bs not in by_batch_size:
|
||||||
|
by_batch_size[bs] = []
|
||||||
|
by_batch_size[bs].append(r)
|
||||||
|
|
||||||
|
# For each batch size, analyze crossover point
|
||||||
|
for bs in sorted(by_batch_size.keys()):
|
||||||
|
console.print(f"\n[bold cyan]Batch size: {bs}[/]")
|
||||||
|
results = by_batch_size[bs]
|
||||||
|
|
||||||
|
# Group by query length
|
||||||
|
by_qlen = {}
|
||||||
|
for r in results:
|
||||||
|
parts = r.config.backend.split("_")
|
||||||
|
qlen_part = [p for p in parts if p.startswith("qlen")]
|
||||||
|
if qlen_part:
|
||||||
|
qlen = int(qlen_part[0][4:])
|
||||||
|
if qlen not in by_qlen:
|
||||||
|
by_qlen[qlen] = {}
|
||||||
|
|
||||||
|
pipeline = "decode" if "decode" in r.config.backend else "prefill"
|
||||||
|
by_qlen[qlen][pipeline] = r
|
||||||
|
|
||||||
|
# Find crossover point
|
||||||
|
last_decode_faster = None
|
||||||
|
for qlen in sorted(by_qlen.keys()):
|
||||||
|
pipelines = by_qlen[qlen]
|
||||||
|
if "decode" in pipelines and "prefill" in pipelines:
|
||||||
|
decode_time = pipelines["decode"].mean_time
|
||||||
|
prefill_time = pipelines["prefill"].mean_time
|
||||||
|
faster = "decode" if decode_time < prefill_time else "prefill"
|
||||||
|
|
||||||
|
speedup = (
|
||||||
|
prefill_time / decode_time
|
||||||
|
if decode_time < prefill_time
|
||||||
|
else decode_time / prefill_time
|
||||||
|
)
|
||||||
|
|
||||||
|
console.print(
|
||||||
|
f" qlen={qlen:3d}: decode={decode_time:.6f}s, "
|
||||||
|
f"prefill={prefill_time:.6f}s -> "
|
||||||
|
f"[bold]{faster}[/] ({speedup:.2f}x)"
|
||||||
|
)
|
||||||
|
|
||||||
|
if faster == "decode":
|
||||||
|
last_decode_faster = qlen
|
||||||
|
|
||||||
|
if last_decode_faster is not None:
|
||||||
|
optimal_threshold = last_decode_faster
|
||||||
|
console.print(
|
||||||
|
f"\n [bold green]Optimal threshold for batch_size={bs}: "
|
||||||
|
f"{optimal_threshold}[/]"
|
||||||
|
)
|
||||||
|
console.print(
|
||||||
|
f" [dim](Use decode pipeline for query_length <= "
|
||||||
|
f"{optimal_threshold})[/]"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
console.print(
|
||||||
|
f"\n [yellow]Prefill always faster for batch_size={bs}[/]"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Handle model parameter sweep mode
|
||||||
|
elif hasattr(args, "model_parameter_sweep") and args.model_parameter_sweep:
|
||||||
|
# Model parameter sweep
|
||||||
|
base_config_args = {
|
||||||
|
"num_layers": args.num_layers,
|
||||||
|
"head_dim": args.head_dim,
|
||||||
|
"num_q_heads": args.num_q_heads,
|
||||||
|
"num_kv_heads": args.num_kv_heads,
|
||||||
|
"block_size": args.block_size,
|
||||||
|
"device": args.device,
|
||||||
|
"repeats": args.repeats,
|
||||||
|
"warmup_iters": args.warmup_iters,
|
||||||
|
"profile_memory": args.profile_memory,
|
||||||
|
}
|
||||||
|
all_results = run_model_parameter_sweep(
|
||||||
|
backends,
|
||||||
|
args.batch_specs,
|
||||||
|
base_config_args,
|
||||||
|
args.model_parameter_sweep,
|
||||||
|
console,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Handle parameter sweep mode (unified)
|
||||||
|
elif hasattr(args, "parameter_sweep") and args.parameter_sweep:
|
||||||
|
# Unified parameter sweep
|
||||||
|
base_config_args = {
|
||||||
|
"num_layers": args.num_layers,
|
||||||
|
"head_dim": args.head_dim,
|
||||||
|
"num_q_heads": args.num_q_heads,
|
||||||
|
"num_kv_heads": args.num_kv_heads,
|
||||||
|
"block_size": args.block_size,
|
||||||
|
"device": args.device,
|
||||||
|
"repeats": args.repeats,
|
||||||
|
"warmup_iters": args.warmup_iters,
|
||||||
|
"profile_memory": args.profile_memory,
|
||||||
|
}
|
||||||
|
all_results = run_parameter_sweep(
|
||||||
|
backends, args.batch_specs, base_config_args, args.parameter_sweep, console
|
||||||
|
)
|
||||||
|
|
||||||
|
else:
|
||||||
|
# Normal mode: compare backends
|
||||||
|
total = len(backends) * len(args.batch_specs)
|
||||||
|
|
||||||
|
with tqdm(total=total, desc="Benchmarking") as pbar:
|
||||||
|
for spec in args.batch_specs:
|
||||||
|
for backend in backends:
|
||||||
|
config = BenchmarkConfig(
|
||||||
|
backend=backend,
|
||||||
|
batch_spec=spec,
|
||||||
|
num_layers=args.num_layers,
|
||||||
|
head_dim=args.head_dim,
|
||||||
|
num_q_heads=args.num_q_heads,
|
||||||
|
num_kv_heads=args.num_kv_heads,
|
||||||
|
block_size=args.block_size,
|
||||||
|
device=args.device,
|
||||||
|
repeats=args.repeats,
|
||||||
|
warmup_iters=args.warmup_iters,
|
||||||
|
profile_memory=args.profile_memory,
|
||||||
|
)
|
||||||
|
|
||||||
|
result = run_benchmark(config)
|
||||||
|
all_results.append(result)
|
||||||
|
|
||||||
|
if not result.success:
|
||||||
|
console.print(f"[red]Error {backend} {spec}: {result.error}[/]")
|
||||||
|
|
||||||
|
pbar.update(1)
|
||||||
|
|
||||||
|
# Display results
|
||||||
|
console.print("\n[bold green]Results:[/]")
|
||||||
|
formatter = ResultsFormatter(console)
|
||||||
|
formatter.print_table(all_results, backends)
|
||||||
|
|
||||||
|
# Save results
|
||||||
|
if all_results:
|
||||||
|
formatter = ResultsFormatter(console)
|
||||||
|
if args.output_csv:
|
||||||
|
formatter.save_csv(all_results, args.output_csv)
|
||||||
|
if args.output_json:
|
||||||
|
formatter.save_json(all_results, args.output_json)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
503
benchmarks/attention_benchmarks/common.py
Normal file
503
benchmarks/attention_benchmarks/common.py
Normal file
@@ -0,0 +1,503 @@
|
|||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||||
|
|
||||||
|
"""Common utilities for attention benchmarking."""
|
||||||
|
|
||||||
|
import csv
|
||||||
|
import json
|
||||||
|
import math
|
||||||
|
from dataclasses import asdict, dataclass
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
from rich.console import Console
|
||||||
|
from rich.table import Table
|
||||||
|
|
||||||
|
# Mock classes for vLLM attention infrastructure
|
||||||
|
|
||||||
|
|
||||||
|
class MockHfConfig:
|
||||||
|
"""Mock HuggingFace config that satisfies vLLM's requirements."""
|
||||||
|
|
||||||
|
def __init__(self, mla_dims: dict):
|
||||||
|
self.num_attention_heads = mla_dims["num_q_heads"]
|
||||||
|
self.num_key_value_heads = mla_dims["num_kv_heads"]
|
||||||
|
self.hidden_size = mla_dims["head_dim"] * mla_dims["num_q_heads"]
|
||||||
|
self.model_type = "deepseek_v2"
|
||||||
|
self.is_encoder_decoder = False
|
||||||
|
self.kv_lora_rank = mla_dims["kv_lora_rank"]
|
||||||
|
self.qk_nope_head_dim = mla_dims["qk_nope_head_dim"]
|
||||||
|
self.qk_rope_head_dim = mla_dims["qk_rope_head_dim"]
|
||||||
|
self.v_head_dim = mla_dims["v_head_dim"]
|
||||||
|
self.qk_head_dim = mla_dims["qk_nope_head_dim"] + mla_dims["qk_rope_head_dim"]
|
||||||
|
|
||||||
|
def get_text_config(self):
|
||||||
|
return self
|
||||||
|
|
||||||
|
|
||||||
|
# Import AttentionLayerBase at module level to avoid circular dependencies
|
||||||
|
try:
|
||||||
|
from vllm.model_executor.layers.attention_layer_base import AttentionLayerBase
|
||||||
|
|
||||||
|
_HAS_ATTENTION_LAYER_BASE = True
|
||||||
|
except ImportError:
|
||||||
|
_HAS_ATTENTION_LAYER_BASE = False
|
||||||
|
AttentionLayerBase = object # Fallback
|
||||||
|
|
||||||
|
|
||||||
|
class MockKVBProj:
|
||||||
|
"""Mock KV projection layer for MLA prefill mode.
|
||||||
|
|
||||||
|
Mimics ColumnParallelLinear behavior for kv_b_proj in MLA backends.
|
||||||
|
Projects kv_c_normed to [qk_nope_head_dim + v_head_dim] per head.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, num_heads: int, qk_nope_head_dim: int, v_head_dim: int):
|
||||||
|
self.num_heads = num_heads
|
||||||
|
self.qk_nope_head_dim = qk_nope_head_dim
|
||||||
|
self.v_head_dim = v_head_dim
|
||||||
|
self.out_dim = qk_nope_head_dim + v_head_dim
|
||||||
|
|
||||||
|
def __call__(self, x: torch.Tensor) -> tuple[torch.Tensor]:
|
||||||
|
"""
|
||||||
|
Project kv_c_normed to output space.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
x: Input tensor [num_tokens, kv_lora_rank]
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple containing output tensor
|
||||||
|
[num_tokens, num_heads, qk_nope_head_dim + v_head_dim]
|
||||||
|
"""
|
||||||
|
num_tokens = x.shape[0]
|
||||||
|
result = torch.randn(
|
||||||
|
num_tokens,
|
||||||
|
self.num_heads,
|
||||||
|
self.out_dim,
|
||||||
|
device=x.device,
|
||||||
|
dtype=x.dtype,
|
||||||
|
)
|
||||||
|
return (result,) # Return as tuple to match ColumnParallelLinear API
|
||||||
|
|
||||||
|
|
||||||
|
class MockLayer(AttentionLayerBase):
|
||||||
|
"""Mock attention layer with scale parameters and impl.
|
||||||
|
|
||||||
|
Inherits from AttentionLayerBase so it passes isinstance checks
|
||||||
|
in get_layers_from_vllm_config when FlashInfer prefill is enabled.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, device: torch.device, impl=None, kv_cache_spec=None):
|
||||||
|
# Don't call super().__init__() as AttentionLayerBase doesn't have __init__
|
||||||
|
self._k_scale = torch.tensor(1.0, device=device)
|
||||||
|
self._v_scale = torch.tensor(1.0, device=device)
|
||||||
|
self._q_scale = torch.tensor(1.0, device=device)
|
||||||
|
# Scalar floats for kernels that need them
|
||||||
|
self._k_scale_float = float(self._k_scale.item())
|
||||||
|
self._v_scale_float = float(self._v_scale.item())
|
||||||
|
self._q_scale_float = float(self._q_scale.item())
|
||||||
|
# AttentionImpl for metadata builders to query
|
||||||
|
self.impl = impl
|
||||||
|
# KV cache spec for get_kv_cache_spec
|
||||||
|
self._kv_cache_spec = kv_cache_spec
|
||||||
|
|
||||||
|
def get_attn_backend(self):
|
||||||
|
"""Get the attention backend class (required by AttentionLayerBase)."""
|
||||||
|
# Return None as this is just a mock layer for benchmarking
|
||||||
|
return None
|
||||||
|
|
||||||
|
def get_kv_cache_spec(self):
|
||||||
|
"""Get the KV cache spec (required by AttentionLayerBase)."""
|
||||||
|
return self._kv_cache_spec
|
||||||
|
|
||||||
|
|
||||||
|
class MockModelConfig:
|
||||||
|
"""Mock model configuration."""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
num_q_heads: int,
|
||||||
|
num_kv_heads: int,
|
||||||
|
head_dim: int,
|
||||||
|
dtype: torch.dtype = torch.float16,
|
||||||
|
max_model_len: int = 32768,
|
||||||
|
):
|
||||||
|
self._n_q = num_q_heads
|
||||||
|
self._n_kv = num_kv_heads
|
||||||
|
self._d = head_dim
|
||||||
|
self.dtype = dtype
|
||||||
|
self.max_model_len = max_model_len
|
||||||
|
|
||||||
|
def get_num_attention_heads(self, _=None) -> int:
|
||||||
|
return self._n_q
|
||||||
|
|
||||||
|
def get_num_kv_heads(self, _=None) -> int:
|
||||||
|
return self._n_kv
|
||||||
|
|
||||||
|
def get_head_size(self) -> int:
|
||||||
|
return self._d
|
||||||
|
|
||||||
|
def get_num_layers(self) -> int:
|
||||||
|
"""Mock method for layer count queries."""
|
||||||
|
return 1
|
||||||
|
|
||||||
|
def get_sliding_window_for_layer(self, _layer_idx: int):
|
||||||
|
"""Mock method for sliding window queries."""
|
||||||
|
return None
|
||||||
|
|
||||||
|
def get_logits_soft_cap_for_layer(self, _layer_idx: int):
|
||||||
|
"""Mock method for logits soft cap queries."""
|
||||||
|
return None
|
||||||
|
|
||||||
|
def get_sm_scale_for_layer(self, _layer_idx: int) -> float:
|
||||||
|
"""Mock method for SM scale queries."""
|
||||||
|
return 1.0 / (self.get_head_size() ** 0.5)
|
||||||
|
|
||||||
|
|
||||||
|
class MockParallelConfig:
|
||||||
|
"""Mock parallel configuration."""
|
||||||
|
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class MockCompilationConfig:
|
||||||
|
"""Mock compilation configuration."""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.full_cuda_graph = False
|
||||||
|
self.static_forward_context = {}
|
||||||
|
|
||||||
|
|
||||||
|
class MockVLLMConfig:
|
||||||
|
"""Mock VLLM configuration."""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.compilation_config = MockCompilationConfig()
|
||||||
|
|
||||||
|
|
||||||
|
class MockRunner:
|
||||||
|
"""Mock GPU runner for metadata builders."""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
seq_lens: np.ndarray,
|
||||||
|
query_start_locs: np.ndarray,
|
||||||
|
device: torch.device,
|
||||||
|
num_q_heads: int,
|
||||||
|
num_kv_heads: int,
|
||||||
|
head_dim: int,
|
||||||
|
dtype: torch.dtype,
|
||||||
|
):
|
||||||
|
self.model_config = MockModelConfig(num_q_heads, num_kv_heads, head_dim, dtype)
|
||||||
|
self.parallel_config = MockParallelConfig()
|
||||||
|
self.vllm_config = MockVLLMConfig()
|
||||||
|
self.seq_lens_np = seq_lens
|
||||||
|
self.query_start_loc_np = query_start_locs
|
||||||
|
self.device = device
|
||||||
|
self.attention_chunk_size = None
|
||||||
|
self.num_query_heads = num_q_heads
|
||||||
|
self.num_kv_heads = num_kv_heads
|
||||||
|
self.dtype = dtype
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ParameterSweep:
|
||||||
|
"""Configuration for sweeping a backend parameter."""
|
||||||
|
|
||||||
|
param_name: str # Name of the backend parameter to sweep
|
||||||
|
values: list[Any] # List of values to test
|
||||||
|
include_auto: bool = False # Also test with param unset (auto mode)
|
||||||
|
label_format: str = "{backend}_{param_name}_{value}" # Result label template
|
||||||
|
|
||||||
|
def get_label(self, backend: str, value: Any) -> str:
|
||||||
|
"""Generate a label for a specific parameter value."""
|
||||||
|
return self.label_format.format(
|
||||||
|
backend=backend, param_name=self.param_name, value=value
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ModelParameterSweep:
|
||||||
|
"""Configuration for sweeping a model configuration parameter."""
|
||||||
|
|
||||||
|
param_name: str # Name of the model config parameter to sweep (e.g., "num_q_heads")
|
||||||
|
values: list[Any] # List of values to test
|
||||||
|
label_format: str = "{backend}_{param_name}_{value}" # Result label template
|
||||||
|
|
||||||
|
def get_label(self, backend: str, value: Any) -> str:
|
||||||
|
"""Generate a label for a specific parameter value."""
|
||||||
|
return self.label_format.format(
|
||||||
|
backend=backend, param_name=self.param_name, value=value
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class BenchmarkConfig:
|
||||||
|
"""Configuration for a single benchmark run."""
|
||||||
|
|
||||||
|
backend: str
|
||||||
|
batch_spec: str
|
||||||
|
num_layers: int
|
||||||
|
head_dim: int
|
||||||
|
num_q_heads: int
|
||||||
|
num_kv_heads: int
|
||||||
|
block_size: int
|
||||||
|
device: str
|
||||||
|
dtype: torch.dtype = torch.float16
|
||||||
|
repeats: int = 1
|
||||||
|
warmup_iters: int = 3
|
||||||
|
profile_memory: bool = False
|
||||||
|
use_cuda_graphs: bool = False
|
||||||
|
|
||||||
|
# MLA-specific
|
||||||
|
kv_lora_rank: int | None = None
|
||||||
|
qk_nope_head_dim: int | None = None
|
||||||
|
qk_rope_head_dim: int | None = None
|
||||||
|
v_head_dim: int | None = None
|
||||||
|
|
||||||
|
# Backend-specific tuning
|
||||||
|
num_kv_splits: int | None = None # CUTLASS MLA
|
||||||
|
reorder_batch_threshold: int | None = None # FlashAttn MLA, FlashMLA
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class BenchmarkResult:
|
||||||
|
"""Results from a single benchmark run."""
|
||||||
|
|
||||||
|
config: BenchmarkConfig
|
||||||
|
mean_time: float # seconds
|
||||||
|
std_time: float # seconds
|
||||||
|
min_time: float # seconds
|
||||||
|
max_time: float # seconds
|
||||||
|
throughput_tokens_per_sec: float | None = None
|
||||||
|
memory_allocated_mb: float | None = None
|
||||||
|
memory_reserved_mb: float | None = None
|
||||||
|
error: str | None = None
|
||||||
|
|
||||||
|
@property
|
||||||
|
def success(self) -> bool:
|
||||||
|
"""Whether benchmark completed successfully."""
|
||||||
|
return self.error is None
|
||||||
|
|
||||||
|
def to_dict(self) -> dict[str, Any]:
|
||||||
|
"""Convert to dictionary for serialization."""
|
||||||
|
return {
|
||||||
|
"config": asdict(self.config),
|
||||||
|
"mean_time": self.mean_time,
|
||||||
|
"std_time": self.std_time,
|
||||||
|
"min_time": self.min_time,
|
||||||
|
"max_time": self.max_time,
|
||||||
|
"throughput_tokens_per_sec": self.throughput_tokens_per_sec,
|
||||||
|
"memory_allocated_mb": self.memory_allocated_mb,
|
||||||
|
"memory_reserved_mb": self.memory_reserved_mb,
|
||||||
|
"error": self.error,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class ResultsFormatter:
|
||||||
|
"""Format and display benchmark results."""
|
||||||
|
|
||||||
|
def __init__(self, console: Console | None = None):
|
||||||
|
self.console = console or Console()
|
||||||
|
|
||||||
|
def print_table(
|
||||||
|
self,
|
||||||
|
results: list[BenchmarkResult],
|
||||||
|
backends: list[str],
|
||||||
|
compare_to_fastest: bool = True,
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Print results as a rich table.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
results: List of BenchmarkResult
|
||||||
|
backends: List of backend names being compared
|
||||||
|
compare_to_fastest: Show percentage comparison to fastest
|
||||||
|
"""
|
||||||
|
# Group by batch spec
|
||||||
|
by_spec = {}
|
||||||
|
for r in results:
|
||||||
|
spec = r.config.batch_spec
|
||||||
|
if spec not in by_spec:
|
||||||
|
by_spec[spec] = {}
|
||||||
|
by_spec[spec][r.config.backend] = r
|
||||||
|
|
||||||
|
# Create shortened backend names for display
|
||||||
|
def shorten_backend_name(name: str) -> str:
|
||||||
|
"""Shorten long backend names for table display."""
|
||||||
|
# Remove common prefixes
|
||||||
|
name = name.replace("flashattn_mla", "famla")
|
||||||
|
name = name.replace("flashinfer_mla", "fimla")
|
||||||
|
name = name.replace("flashmla", "fmla")
|
||||||
|
name = name.replace("cutlass_mla", "cmla")
|
||||||
|
name = name.replace("numsplits", "ns")
|
||||||
|
return name
|
||||||
|
|
||||||
|
table = Table(title="Attention Benchmark Results")
|
||||||
|
table.add_column("Batch\nSpec", no_wrap=True)
|
||||||
|
|
||||||
|
multi = len(backends) > 1
|
||||||
|
for backend in backends:
|
||||||
|
short_name = shorten_backend_name(backend)
|
||||||
|
# Time column
|
||||||
|
col_time = f"{short_name}\nTime (s)"
|
||||||
|
table.add_column(col_time, justify="right", no_wrap=False)
|
||||||
|
if multi and compare_to_fastest:
|
||||||
|
# Relative performance column
|
||||||
|
col_rel = f"{short_name}\nvs Best"
|
||||||
|
table.add_column(col_rel, justify="right", no_wrap=False)
|
||||||
|
|
||||||
|
# Add rows
|
||||||
|
for spec in sorted(by_spec.keys()):
|
||||||
|
spec_results = by_spec[spec]
|
||||||
|
times = {b: r.mean_time for b, r in spec_results.items() if r.success}
|
||||||
|
best_time = min(times.values()) if times else 0.0
|
||||||
|
|
||||||
|
row = [spec]
|
||||||
|
for backend in backends:
|
||||||
|
if backend in spec_results:
|
||||||
|
r = spec_results[backend]
|
||||||
|
if r.success:
|
||||||
|
row.append(f"{r.mean_time:.6f}")
|
||||||
|
if multi and compare_to_fastest:
|
||||||
|
pct = (
|
||||||
|
(r.mean_time / best_time * 100) if best_time > 0 else 0
|
||||||
|
)
|
||||||
|
pct_str = f"{pct:.1f}%"
|
||||||
|
if r.mean_time == best_time:
|
||||||
|
pct_str = f"[bold green]{pct_str}[/]"
|
||||||
|
row.append(pct_str)
|
||||||
|
else:
|
||||||
|
row.append("[red]ERROR[/]")
|
||||||
|
if multi and compare_to_fastest:
|
||||||
|
row.append("-")
|
||||||
|
else:
|
||||||
|
row.append("-")
|
||||||
|
if multi and compare_to_fastest:
|
||||||
|
row.append("-")
|
||||||
|
|
||||||
|
table.add_row(*row)
|
||||||
|
|
||||||
|
self.console.print(table)
|
||||||
|
|
||||||
|
def save_csv(self, results: list[BenchmarkResult], path: str):
|
||||||
|
"""Save results to CSV file."""
|
||||||
|
if not results:
|
||||||
|
return
|
||||||
|
|
||||||
|
path_obj = Path(path)
|
||||||
|
path_obj.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
with open(path, "w", newline="") as f:
|
||||||
|
writer = csv.DictWriter(
|
||||||
|
f,
|
||||||
|
fieldnames=[
|
||||||
|
"backend",
|
||||||
|
"batch_spec",
|
||||||
|
"num_layers",
|
||||||
|
"mean_time",
|
||||||
|
"std_time",
|
||||||
|
"throughput",
|
||||||
|
"memory_mb",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
writer.writeheader()
|
||||||
|
for r in results:
|
||||||
|
writer.writerow(
|
||||||
|
{
|
||||||
|
"backend": r.config.backend,
|
||||||
|
"batch_spec": r.config.batch_spec,
|
||||||
|
"num_layers": r.config.num_layers,
|
||||||
|
"mean_time": r.mean_time,
|
||||||
|
"std_time": r.std_time,
|
||||||
|
"throughput": r.throughput_tokens_per_sec or 0,
|
||||||
|
"memory_mb": r.memory_allocated_mb or 0,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
self.console.print(f"[green]Saved CSV results to {path}[/]")
|
||||||
|
|
||||||
|
def save_json(self, results: list[BenchmarkResult], path: str):
|
||||||
|
"""Save results to JSON file."""
|
||||||
|
path_obj = Path(path)
|
||||||
|
path_obj.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
data = [r.to_dict() for r in results]
|
||||||
|
with open(path, "w") as f:
|
||||||
|
json.dump(data, f, indent=2, default=str)
|
||||||
|
|
||||||
|
self.console.print(f"[green]Saved JSON results to {path}[/]")
|
||||||
|
|
||||||
|
|
||||||
|
def setup_mla_dims(model_name: str = "deepseek-v3") -> dict:
|
||||||
|
"""
|
||||||
|
Get MLA dimensions for known models.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
model_name: Model identifier
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict with MLA dimension configuration
|
||||||
|
"""
|
||||||
|
configs = {
|
||||||
|
"deepseek-v2": {
|
||||||
|
"kv_lora_rank": 512,
|
||||||
|
"qk_nope_head_dim": 128,
|
||||||
|
"qk_rope_head_dim": 64,
|
||||||
|
"v_head_dim": 128,
|
||||||
|
"num_q_heads": 128,
|
||||||
|
"num_kv_heads": 1,
|
||||||
|
"head_dim": 576,
|
||||||
|
},
|
||||||
|
"deepseek-v3": {
|
||||||
|
"kv_lora_rank": 512,
|
||||||
|
"qk_nope_head_dim": 128,
|
||||||
|
"qk_rope_head_dim": 64,
|
||||||
|
"v_head_dim": 128,
|
||||||
|
"num_q_heads": 128,
|
||||||
|
"num_kv_heads": 1,
|
||||||
|
"head_dim": 576,
|
||||||
|
},
|
||||||
|
"deepseek-v2-lite": {
|
||||||
|
"kv_lora_rank": 512,
|
||||||
|
"qk_nope_head_dim": 128,
|
||||||
|
"qk_rope_head_dim": 64,
|
||||||
|
"v_head_dim": 128,
|
||||||
|
"num_q_heads": 16,
|
||||||
|
"num_kv_heads": 1,
|
||||||
|
"head_dim": 576,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
if model_name not in configs:
|
||||||
|
raise ValueError(
|
||||||
|
f"Unknown model '{model_name}'. Known models: {list(configs.keys())}"
|
||||||
|
)
|
||||||
|
|
||||||
|
return configs[model_name]
|
||||||
|
|
||||||
|
|
||||||
|
def get_attention_scale(head_dim: int) -> float:
|
||||||
|
"""Compute attention scale factor (1/sqrt(d))."""
|
||||||
|
return 1.0 / math.sqrt(head_dim)
|
||||||
|
|
||||||
|
|
||||||
|
def is_mla_backend(backend: str) -> bool:
|
||||||
|
"""
|
||||||
|
Check if backend is an MLA backend using the backend's is_mla() property.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
backend: Backend name (e.g., "CUTLASS_MLA", "FLASHINFER_MLA")
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if the backend is an MLA backend, False otherwise
|
||||||
|
"""
|
||||||
|
from vllm.v1.attention.backends.registry import AttentionBackendEnum
|
||||||
|
|
||||||
|
try:
|
||||||
|
backend_class = AttentionBackendEnum[backend.upper()].get_class()
|
||||||
|
return backend_class.is_mla()
|
||||||
|
except (KeyError, ValueError, ImportError):
|
||||||
|
return False
|
||||||
61
benchmarks/attention_benchmarks/configs/mla_decode.yaml
Normal file
61
benchmarks/attention_benchmarks/configs/mla_decode.yaml
Normal file
@@ -0,0 +1,61 @@
|
|||||||
|
# MLA decode-only benchmark configuration
|
||||||
|
|
||||||
|
model:
|
||||||
|
name: "deepseek-v3"
|
||||||
|
num_layers: 60
|
||||||
|
num_q_heads: 128
|
||||||
|
num_kv_heads: 1 # MLA uses single latent KV
|
||||||
|
head_dim: 576
|
||||||
|
kv_lora_rank: 512
|
||||||
|
qk_nope_head_dim: 128
|
||||||
|
qk_rope_head_dim: 64
|
||||||
|
v_head_dim: 128
|
||||||
|
block_size: 128 # CUTLASS MLA and FlashAttn MLA use 128
|
||||||
|
|
||||||
|
batch_specs:
|
||||||
|
# Small batches, varying sequence lengths
|
||||||
|
- "16q1s512" # 16 requests, 512 KV cache
|
||||||
|
- "16q1s1k" # 16 requests, 1k KV cache
|
||||||
|
- "16q1s2k" # 16 requests, 2k KV cache
|
||||||
|
- "16q1s4k" # 16 requests, 4k KV cache
|
||||||
|
|
||||||
|
# Medium batches
|
||||||
|
- "32q1s1k" # 32 requests, 1k KV cache
|
||||||
|
- "32q1s2k" # 32 requests, 2k KV cache
|
||||||
|
- "32q1s4k" # 32 requests, 4k KV cache
|
||||||
|
- "32q1s8k" # 32 requests, 8k KV cache
|
||||||
|
|
||||||
|
# Large batches
|
||||||
|
- "64q1s1k" # 64 requests, 1k KV cache
|
||||||
|
- "64q1s2k" # 64 requests, 2k KV cache
|
||||||
|
- "64q1s4k" # 64 requests, 4k KV cache
|
||||||
|
- "64q1s8k" # 64 requests, 8k KV cache
|
||||||
|
|
||||||
|
# Very large batches
|
||||||
|
- "128q1s1k" # 128 requests, 1k KV cache
|
||||||
|
- "128q1s2k" # 128 requests, 2k KV cache
|
||||||
|
|
||||||
|
# Long context
|
||||||
|
- "32q1s16k" # 32 requests, 16k KV cache
|
||||||
|
- "32q1s32k" # 32 requests, 32k KV cache
|
||||||
|
|
||||||
|
backends:
|
||||||
|
- cutlass_mla
|
||||||
|
- flashinfer_mla
|
||||||
|
- flashattn_mla # Hopper only
|
||||||
|
- flashmla # Hopper only
|
||||||
|
|
||||||
|
device: "cuda:0"
|
||||||
|
repeats: 5
|
||||||
|
warmup_iters: 3
|
||||||
|
profile_memory: true
|
||||||
|
|
||||||
|
# Backend-specific tuning
|
||||||
|
cutlass_mla:
|
||||||
|
num_kv_splits: auto # or specific value like 4, 8, 16
|
||||||
|
|
||||||
|
flashattn_mla:
|
||||||
|
reorder_batch_threshold: 512
|
||||||
|
|
||||||
|
flashmla:
|
||||||
|
reorder_batch_threshold: 1
|
||||||
60
benchmarks/attention_benchmarks/configs/mla_mixed_batch.yaml
Normal file
60
benchmarks/attention_benchmarks/configs/mla_mixed_batch.yaml
Normal file
@@ -0,0 +1,60 @@
|
|||||||
|
# MLA mixed batch benchmark (prefill + decode)
|
||||||
|
# Tests chunked prefill performance
|
||||||
|
|
||||||
|
model:
|
||||||
|
name: "deepseek-v3"
|
||||||
|
num_layers: 60
|
||||||
|
num_q_heads: 128
|
||||||
|
num_kv_heads: 1
|
||||||
|
head_dim: 576
|
||||||
|
kv_lora_rank: 512
|
||||||
|
qk_nope_head_dim: 128
|
||||||
|
qk_rope_head_dim: 64
|
||||||
|
v_head_dim: 128
|
||||||
|
block_size: 128
|
||||||
|
|
||||||
|
batch_specs:
|
||||||
|
# Small prefill + decode
|
||||||
|
- "1q1k_8q1s1k" # 1 prefill + 8 decode
|
||||||
|
- "2q2k_16q1s1k" # 2 prefill + 16 decode
|
||||||
|
- "4q1k_32q1s2k" # 4 prefill + 32 decode
|
||||||
|
|
||||||
|
# Medium prefill + decode
|
||||||
|
- "2q4k_32q1s2k" # 2 medium prefill + 32 decode
|
||||||
|
- "4q4k_64q1s2k" # 4 medium prefill + 64 decode
|
||||||
|
- "8q2k_64q1s4k" # 8 prefill + 64 decode
|
||||||
|
|
||||||
|
# Large prefill + decode (chunked prefill stress test)
|
||||||
|
- "2q8k_32q1s1k" # 2 large prefill + 32 decode
|
||||||
|
- "1q16k_16q1s2k" # 1 very large prefill + 16 decode
|
||||||
|
- "2q16k_32q1s4k" # 2 very large prefill + 32 decode
|
||||||
|
|
||||||
|
# Context extension + decode
|
||||||
|
- "2q1kkv2k_16q1s1k" # 2 extend + 16 decode
|
||||||
|
- "4q2kkv4k_32q1s2k" # 4 extend + 32 decode
|
||||||
|
- "2q1kkv8k_32q1s2k" # 2 large extend + 32 decode
|
||||||
|
|
||||||
|
# Explicitly chunked prefill
|
||||||
|
- "q8k" # 8k prefill with chunking hint
|
||||||
|
- "q16k" # 16k prefill with chunking hint
|
||||||
|
- "2q8k_32q1s2k" # 2 chunked prefill + 32 decode
|
||||||
|
|
||||||
|
# High decode ratio (realistic serving)
|
||||||
|
- "1q2k_63q1s1k" # 1 prefill + 63 decode
|
||||||
|
- "2q2k_62q1s2k" # 2 prefill + 62 decode
|
||||||
|
- "4q4k_60q1s4k" # 4 prefill + 60 decode
|
||||||
|
|
||||||
|
backends:
|
||||||
|
- cutlass_mla
|
||||||
|
- flashinfer_mla
|
||||||
|
- flashattn_mla # Hopper only
|
||||||
|
- flashmla # Hopper only
|
||||||
|
|
||||||
|
device: "cuda:0"
|
||||||
|
repeats: 5
|
||||||
|
warmup_iters: 3
|
||||||
|
profile_memory: true
|
||||||
|
|
||||||
|
# Analyze chunked prefill workspace size impact
|
||||||
|
chunked_prefill:
|
||||||
|
test_workspace_sizes: [4096, 8192, 16384, 32768, 65536]
|
||||||
@@ -0,0 +1,88 @@
|
|||||||
|
# Study 4: What is optimal reorder_batch_threshold for MLA backends supporting query length > 1?
|
||||||
|
# Question: At what query length does prefill pipeline become faster than decode pipeline?
|
||||||
|
# Methodology: For each query length, compare decode vs prefill performance to find crossover point
|
||||||
|
# Applies to: FlashAttn MLA, FlashMLA
|
||||||
|
|
||||||
|
description: "Decode vs Prefill pipeline crossover analysis"
|
||||||
|
|
||||||
|
# Test FlashAttn MLA
|
||||||
|
backend: flashattn_mla
|
||||||
|
|
||||||
|
# Mode: decode_vs_prefill comparison (special sweep mode)
|
||||||
|
# For each batch spec, we'll test both decode and prefill pipelines
|
||||||
|
mode: "decode_vs_prefill"
|
||||||
|
|
||||||
|
# Query lengths to test (from old benchmark_mla_threshold.py methodology)
|
||||||
|
# Each query length will be tested with BOTH decode and prefill pipelines:
|
||||||
|
# - decode: threshold >= query_length (forces decode pipeline)
|
||||||
|
# - prefill: threshold < query_length (forces prefill pipeline)
|
||||||
|
#
|
||||||
|
# We use q<N>s1k format which creates q_len=N, seq_len=1024 requests
|
||||||
|
# This tests different query lengths with fixed sequence length context
|
||||||
|
#
|
||||||
|
# Using batch_spec_ranges for automatic generation:
|
||||||
|
batch_spec_ranges:
|
||||||
|
- template: "q{q_len}s1k"
|
||||||
|
q_len:
|
||||||
|
start: 1
|
||||||
|
stop: 16
|
||||||
|
step: 1
|
||||||
|
end_inclusive: false
|
||||||
|
- template: "q{q_len}s1k"
|
||||||
|
q_len:
|
||||||
|
start: 16
|
||||||
|
stop: 64
|
||||||
|
step: 2
|
||||||
|
end_inclusive: false
|
||||||
|
- template: "q{q_len}s1k"
|
||||||
|
q_len:
|
||||||
|
start: 64
|
||||||
|
stop: 1024
|
||||||
|
step: 4
|
||||||
|
end_inclusive: true
|
||||||
|
|
||||||
|
# Batch sizes to test (from old script)
|
||||||
|
batch_sizes:
|
||||||
|
- 1
|
||||||
|
- 2
|
||||||
|
- 4
|
||||||
|
- 8
|
||||||
|
- 16
|
||||||
|
- 32
|
||||||
|
- 64
|
||||||
|
- 128
|
||||||
|
- 256
|
||||||
|
|
||||||
|
# Model configuration (DeepSeek V2/V3 defaults)
|
||||||
|
model:
|
||||||
|
num_layers: 10
|
||||||
|
head_dim: 576
|
||||||
|
num_q_heads: 128
|
||||||
|
num_kv_heads: 1
|
||||||
|
block_size: 128
|
||||||
|
|
||||||
|
# Benchmark settings
|
||||||
|
benchmark:
|
||||||
|
device: "cuda:0"
|
||||||
|
repeats: 15 # More repeats for spec decode variance
|
||||||
|
warmup_iters: 5
|
||||||
|
profile_memory: false
|
||||||
|
|
||||||
|
# Output
|
||||||
|
output:
|
||||||
|
csv: "reorder_threshold_results.csv"
|
||||||
|
json: "reorder_threshold_results.json"
|
||||||
|
|
||||||
|
# Expected outcome (reproduces old benchmark_mla_threshold.py study):
|
||||||
|
# - For each batch size, find the crossover point where prefill becomes faster than decode
|
||||||
|
# - Show decode vs prefill performance across all query lengths
|
||||||
|
# - Determine optimal reorder_batch_threshold based on last query length where decode is faster
|
||||||
|
# - Understand how crossover point varies with batch size
|
||||||
|
# - Provide data-driven guidance for default threshold value
|
||||||
|
#
|
||||||
|
# Methodology (from old script):
|
||||||
|
# - Each query length tested with BOTH pipelines:
|
||||||
|
# * decode: threshold >= query_length (forces decode pipeline)
|
||||||
|
# * prefill: threshold < query_length (forces prefill pipeline)
|
||||||
|
# - Compare which is faster to find crossover point
|
||||||
|
#
|
||||||
@@ -0,0 +1,62 @@
|
|||||||
|
# Speculative decoding benchmark configuration
|
||||||
|
# Tests reorder_batch_threshold optimization
|
||||||
|
|
||||||
|
model:
|
||||||
|
name: "deepseek-v3"
|
||||||
|
num_layers: 60
|
||||||
|
num_q_heads: 128
|
||||||
|
num_kv_heads: 1
|
||||||
|
head_dim: 576
|
||||||
|
kv_lora_rank: 512
|
||||||
|
qk_nope_head_dim: 128
|
||||||
|
qk_rope_head_dim: 64
|
||||||
|
v_head_dim: 128
|
||||||
|
|
||||||
|
batch_specs:
|
||||||
|
# Pure speculative decode (K-token verification)
|
||||||
|
- "q2s1k" # 2-token spec, 1k KV
|
||||||
|
- "q4s1k" # 4-token spec, 1k KV
|
||||||
|
- "q8s1k" # 8-token spec, 1k KV
|
||||||
|
- "q16s1k" # 16-token spec, 1k KV
|
||||||
|
|
||||||
|
# Speculative with different context lengths
|
||||||
|
- "q4s2k" # 4-token spec, 2k KV
|
||||||
|
- "q4s4k" # 4-token spec, 4k KV
|
||||||
|
- "q8s2k" # 8-token spec, 2k KV
|
||||||
|
- "q8s4k" # 8-token spec, 4k KV
|
||||||
|
|
||||||
|
# Mixed: speculative + regular decode
|
||||||
|
- "32q4s1k" # 32 spec requests
|
||||||
|
- "16q4s1k_16q1s1k" # 16 spec + 16 regular
|
||||||
|
- "8q8s2k_24q1s2k" # 8 spec (8-tok) + 24 regular
|
||||||
|
|
||||||
|
# Mixed: speculative + prefill + decode
|
||||||
|
- "2q1k_16q4s1k_16q1s1k" # 2 prefill + 16 spec + 16 decode
|
||||||
|
- "4q2k_32q4s2k_32q1s2k" # 4 prefill + 32 spec + 32 decode
|
||||||
|
|
||||||
|
# Large batches with speculation
|
||||||
|
- "64q4s1k" # 64 spec requests
|
||||||
|
- "32q8s2k" # 32 spec (8-token)
|
||||||
|
- "16q16s4k" # 16 spec (16-token)
|
||||||
|
|
||||||
|
# Backends that support query length > 1
|
||||||
|
backends:
|
||||||
|
- flashattn_mla # reorder_batch_threshold = 512
|
||||||
|
- flashmla # reorder_batch_threshold = 1 (tunable)
|
||||||
|
|
||||||
|
# FlashInfer-MLA also supports uniform spec-as-decode but with different mechanism
|
||||||
|
# - flashinfer_mla
|
||||||
|
|
||||||
|
# Benchmark settings
|
||||||
|
benchmark:
|
||||||
|
device: "cuda:0"
|
||||||
|
repeats: 10 # More repeats for statistical significance
|
||||||
|
warmup_iters: 5
|
||||||
|
profile_memory: false
|
||||||
|
|
||||||
|
# Test these threshold values for optimization
|
||||||
|
parameter_sweep:
|
||||||
|
param_name: "reorder_batch_threshold"
|
||||||
|
values: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
|
||||||
|
include_auto: false
|
||||||
|
label_format: "{backend}_threshold_{value}"
|
||||||
@@ -0,0 +1,40 @@
|
|||||||
|
# Standard attention backend benchmark configuration
|
||||||
|
|
||||||
|
model:
|
||||||
|
num_layers: 32
|
||||||
|
num_q_heads: 32
|
||||||
|
num_kv_heads: 8 # GQA with 4:1 ratio
|
||||||
|
head_dim: 128
|
||||||
|
block_size: 16
|
||||||
|
|
||||||
|
batch_specs:
|
||||||
|
# Pure prefill
|
||||||
|
- "q512" # Small prefill (512 tokens)
|
||||||
|
- "q2k" # Medium prefill (2048 tokens)
|
||||||
|
- "q4k" # Large prefill (4096 tokens)
|
||||||
|
- "q8k" # Very large prefill (8192 tokens)
|
||||||
|
|
||||||
|
# Pure decode
|
||||||
|
- "8q1s1k" # 8 requests, 1k KV cache each
|
||||||
|
- "16q1s2k" # 16 requests, 2k KV cache each
|
||||||
|
- "32q1s1k" # 32 requests, 1k KV cache each
|
||||||
|
- "64q1s4k" # 64 requests, 4k KV cache each
|
||||||
|
|
||||||
|
# Mixed prefill/decode
|
||||||
|
- "2q2k_8q1s1k" # 2 prefill + 8 decode
|
||||||
|
- "4q1k_16q1s2k" # 4 prefill + 16 decode
|
||||||
|
- "2q4k_32q1s1k" # 2 large prefill + 32 decode
|
||||||
|
|
||||||
|
# Context extension
|
||||||
|
- "q1ks2k" # 1k query, 2k sequence (chunked prefill)
|
||||||
|
- "2q1ks4k" # 2 requests: 1k query, 4k sequence
|
||||||
|
|
||||||
|
backends:
|
||||||
|
- flash
|
||||||
|
- triton
|
||||||
|
- flashinfer
|
||||||
|
|
||||||
|
device: "cuda:0"
|
||||||
|
repeats: 5
|
||||||
|
warmup_iters: 3
|
||||||
|
profile_memory: false
|
||||||
836
benchmarks/attention_benchmarks/mla_runner.py
Normal file
836
benchmarks/attention_benchmarks/mla_runner.py
Normal file
@@ -0,0 +1,836 @@
|
|||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||||
|
|
||||||
|
"""
|
||||||
|
MLA benchmark runner - shared utilities for MLA benchmarks.
|
||||||
|
|
||||||
|
This module provides helpers for running MLA backends without
|
||||||
|
needing full VllmConfig integration.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import importlib
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
from batch_spec import parse_batch_spec
|
||||||
|
from common import (
|
||||||
|
BenchmarkResult,
|
||||||
|
MockHfConfig,
|
||||||
|
MockKVBProj,
|
||||||
|
MockLayer,
|
||||||
|
setup_mla_dims,
|
||||||
|
)
|
||||||
|
|
||||||
|
from vllm.config import (
|
||||||
|
CacheConfig,
|
||||||
|
CompilationConfig,
|
||||||
|
ModelConfig,
|
||||||
|
ParallelConfig,
|
||||||
|
SchedulerConfig,
|
||||||
|
VllmConfig,
|
||||||
|
set_current_vllm_config,
|
||||||
|
)
|
||||||
|
|
||||||
|
# ============================================================================
|
||||||
|
# VllmConfig Creation
|
||||||
|
# ============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
def _add_mock_methods_to_model_config(model_config: ModelConfig) -> None:
|
||||||
|
"""
|
||||||
|
Add mock methods for layer-specific queries to ModelConfig.
|
||||||
|
|
||||||
|
These methods are needed by metadata builders but aren't normally
|
||||||
|
present on ModelConfig when used in benchmark contexts.
|
||||||
|
"""
|
||||||
|
import types
|
||||||
|
|
||||||
|
model_config.get_num_layers = types.MethodType(lambda self: 1, model_config)
|
||||||
|
model_config.get_sliding_window_for_layer = types.MethodType(
|
||||||
|
lambda self, _i: None, model_config
|
||||||
|
)
|
||||||
|
model_config.get_logits_soft_cap_for_layer = types.MethodType(
|
||||||
|
lambda self, _i: None, model_config
|
||||||
|
)
|
||||||
|
model_config.get_sm_scale_for_layer = types.MethodType(
|
||||||
|
lambda self, _i: 1.0 / model_config.get_head_size() ** 0.5, model_config
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def create_minimal_vllm_config(
|
||||||
|
model_name: str = "deepseek-v3",
|
||||||
|
block_size: int = 128,
|
||||||
|
max_num_seqs: int = 256,
|
||||||
|
mla_dims: dict | None = None,
|
||||||
|
) -> VllmConfig:
|
||||||
|
"""
|
||||||
|
Create minimal VllmConfig for MLA benchmarks.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
model_name: Model name (deepseek-v2, deepseek-v3, etc.) - used if mla_dims not
|
||||||
|
provided
|
||||||
|
block_size: KV cache block size
|
||||||
|
max_num_seqs: Maximum number of sequences
|
||||||
|
mla_dims: Optional custom MLA dimensions dict. If not provided, uses
|
||||||
|
setup_mla_dims(model_name)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
VllmConfig for benchmarking
|
||||||
|
"""
|
||||||
|
# Get MLA dimensions - use provided or load from model name
|
||||||
|
if mla_dims is None:
|
||||||
|
mla_dims = setup_mla_dims(model_name)
|
||||||
|
|
||||||
|
# Create mock HF config first (avoids downloading from HuggingFace)
|
||||||
|
mock_hf_config = MockHfConfig(mla_dims)
|
||||||
|
|
||||||
|
# Create a temporary minimal config.json to avoid HF downloads
|
||||||
|
# This ensures consistent ModelConfig construction without network access
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import shutil
|
||||||
|
import tempfile
|
||||||
|
|
||||||
|
minimal_config = {
|
||||||
|
"architectures": ["DeepseekV2ForCausalLM"],
|
||||||
|
"model_type": "deepseek_v2",
|
||||||
|
"num_attention_heads": mla_dims["num_q_heads"],
|
||||||
|
"num_key_value_heads": mla_dims["num_kv_heads"],
|
||||||
|
"hidden_size": mla_dims["head_dim"] * mla_dims["num_q_heads"],
|
||||||
|
"torch_dtype": "bfloat16",
|
||||||
|
"max_position_embeddings": 163840, # DeepSeek V3 default
|
||||||
|
"rope_theta": 10000.0,
|
||||||
|
"vocab_size": 128256,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Create temporary directory with config.json
|
||||||
|
temp_dir = tempfile.mkdtemp(prefix="vllm_bench_")
|
||||||
|
config_path = os.path.join(temp_dir, "config.json")
|
||||||
|
with open(config_path, "w") as f:
|
||||||
|
json.dump(minimal_config, f)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Create model config using local path - no HF downloads
|
||||||
|
model_config = ModelConfig(
|
||||||
|
model=temp_dir, # Use local temp directory
|
||||||
|
tokenizer=None,
|
||||||
|
tokenizer_mode="auto",
|
||||||
|
trust_remote_code=True,
|
||||||
|
dtype="bfloat16",
|
||||||
|
seed=0,
|
||||||
|
max_model_len=32768,
|
||||||
|
quantization=None,
|
||||||
|
quantization_param_path=None,
|
||||||
|
enforce_eager=False,
|
||||||
|
max_context_len_to_capture=None,
|
||||||
|
max_seq_len_to_capture=8192,
|
||||||
|
max_logprobs=20,
|
||||||
|
disable_sliding_window=False,
|
||||||
|
skip_tokenizer_init=True,
|
||||||
|
served_model_name=None,
|
||||||
|
limit_mm_per_prompt=None,
|
||||||
|
use_async_output_proc=True,
|
||||||
|
config_format="auto",
|
||||||
|
)
|
||||||
|
finally:
|
||||||
|
# Clean up temporary directory
|
||||||
|
shutil.rmtree(temp_dir, ignore_errors=True)
|
||||||
|
|
||||||
|
# Override with our mock config
|
||||||
|
model_config.hf_config = mock_hf_config
|
||||||
|
model_config.hf_text_config = mock_hf_config
|
||||||
|
|
||||||
|
# Add mock methods for layer-specific queries
|
||||||
|
_add_mock_methods_to_model_config(model_config)
|
||||||
|
|
||||||
|
# Create sub-configs
|
||||||
|
cache_config = CacheConfig(
|
||||||
|
block_size=block_size,
|
||||||
|
gpu_memory_utilization=0.9,
|
||||||
|
swap_space=0,
|
||||||
|
cache_dtype="auto",
|
||||||
|
enable_prefix_caching=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
scheduler_config = SchedulerConfig(
|
||||||
|
max_num_seqs=max_num_seqs,
|
||||||
|
max_num_batched_tokens=8192,
|
||||||
|
max_model_len=32768,
|
||||||
|
is_encoder_decoder=False,
|
||||||
|
enable_chunked_prefill=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
parallel_config = ParallelConfig(
|
||||||
|
tensor_parallel_size=1,
|
||||||
|
)
|
||||||
|
|
||||||
|
compilation_config = CompilationConfig()
|
||||||
|
|
||||||
|
return VllmConfig(
|
||||||
|
model_config=model_config,
|
||||||
|
cache_config=cache_config,
|
||||||
|
parallel_config=parallel_config,
|
||||||
|
scheduler_config=scheduler_config,
|
||||||
|
compilation_config=compilation_config,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ============================================================================
|
||||||
|
# Backend Configuration
|
||||||
|
# ============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
# Backend name to class name prefix mapping
|
||||||
|
_BACKEND_NAME_MAP = {
|
||||||
|
"flashattn_mla": "FlashAttnMLA",
|
||||||
|
"flashmla": "FlashMLA",
|
||||||
|
"flashinfer_mla": "FlashInferMLA",
|
||||||
|
"cutlass_mla": "CutlassMLA",
|
||||||
|
}
|
||||||
|
|
||||||
|
# Special properties that differ from defaults
|
||||||
|
_BACKEND_PROPERTIES = {
|
||||||
|
"flashmla": {
|
||||||
|
"query_format": "concat", # Single concatenated tensor (vs tuple)
|
||||||
|
"block_size": 64, # FlashMLA uses fixed block size
|
||||||
|
},
|
||||||
|
"flashinfer_mla": {
|
||||||
|
"block_size": 64, # FlashInfer MLA only supports 32 or 64
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _get_backend_config(backend: str) -> dict:
|
||||||
|
"""
|
||||||
|
Get backend configuration using naming conventions.
|
||||||
|
|
||||||
|
All MLA backends follow the pattern:
|
||||||
|
- Module: vllm.v1.attention.backends.mla.{backend}
|
||||||
|
- Impl: {Name}Impl
|
||||||
|
- Metadata: {Name}Metadata (or MLACommonMetadata)
|
||||||
|
- DecodeMetadata: {Name}DecodeMetadata (or MLACommonDecodeMetadata)
|
||||||
|
- MetadataBuilder: {Name}MetadataBuilder
|
||||||
|
"""
|
||||||
|
if backend not in _BACKEND_NAME_MAP:
|
||||||
|
raise ValueError(f"Unknown backend: {backend}")
|
||||||
|
|
||||||
|
name = _BACKEND_NAME_MAP[backend]
|
||||||
|
props = _BACKEND_PROPERTIES.get(backend, {})
|
||||||
|
|
||||||
|
# Check if backend uses common metadata (FlashInfer, CUTLASS)
|
||||||
|
uses_common = backend in ("flashinfer_mla", "cutlass_mla")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"module": f"vllm.v1.attention.backends.mla.{backend}",
|
||||||
|
"impl_class": f"{name}Impl",
|
||||||
|
"metadata_class": "MLACommonMetadata" if uses_common else f"{name}Metadata",
|
||||||
|
"decode_metadata_class": "MLACommonDecodeMetadata"
|
||||||
|
if uses_common
|
||||||
|
else f"{name}DecodeMetadata",
|
||||||
|
"builder_class": f"{name}MetadataBuilder",
|
||||||
|
"query_format": props.get("query_format", "tuple"),
|
||||||
|
"block_size": props.get("block_size", None),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# ============================================================================
|
||||||
|
# Metadata Building Helpers
|
||||||
|
# ============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
def _build_attention_metadata(
|
||||||
|
requests: list,
|
||||||
|
block_size: int,
|
||||||
|
device: torch.device,
|
||||||
|
builder_instance,
|
||||||
|
) -> tuple:
|
||||||
|
"""
|
||||||
|
Build attention metadata from batch requests.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
requests: List of BatchRequest objects
|
||||||
|
block_size: KV cache block size
|
||||||
|
device: Target device
|
||||||
|
builder_instance: Metadata builder instance
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (metadata, kv_cache_num_blocks)
|
||||||
|
"""
|
||||||
|
q_lens = [r.q_len for r in requests]
|
||||||
|
kv_lens = [r.kv_len for r in requests]
|
||||||
|
total_q = sum(q_lens)
|
||||||
|
max_kv = max(kv_lens)
|
||||||
|
|
||||||
|
# Build query start locations
|
||||||
|
q_start_cpu = torch.tensor(
|
||||||
|
[0] + [sum(q_lens[: i + 1]) for i in range(len(q_lens))],
|
||||||
|
dtype=torch.int32,
|
||||||
|
)
|
||||||
|
q_start_gpu = q_start_cpu.to(device)
|
||||||
|
|
||||||
|
# Build sequence lengths
|
||||||
|
seq_lens_cpu = torch.tensor(kv_lens, dtype=torch.int32)
|
||||||
|
seq_lens_gpu = seq_lens_cpu.to(device)
|
||||||
|
|
||||||
|
# Build num_computed_tokens (context length for each request)
|
||||||
|
context_lens = [kv_len - q_len for q_len, kv_len in zip(q_lens, kv_lens)]
|
||||||
|
num_computed_tokens_cpu = torch.tensor(context_lens, dtype=torch.int32)
|
||||||
|
|
||||||
|
# Build block table
|
||||||
|
num_blocks_per_req = [(kv + block_size - 1) // block_size for kv in kv_lens]
|
||||||
|
max_num_blocks = max(num_blocks_per_req)
|
||||||
|
|
||||||
|
block_table_cpu = np.zeros((len(requests), max_num_blocks), dtype=np.int32)
|
||||||
|
current_block = 0
|
||||||
|
for i, num_blocks in enumerate(num_blocks_per_req):
|
||||||
|
for j in range(num_blocks):
|
||||||
|
block_table_cpu[i, j] = current_block
|
||||||
|
current_block += 1
|
||||||
|
|
||||||
|
block_table_gpu = torch.from_numpy(block_table_cpu).to(device)
|
||||||
|
|
||||||
|
# Build slot mapping
|
||||||
|
slot_mapping_list = []
|
||||||
|
for i, (q_len, kv_len, num_blocks) in enumerate(
|
||||||
|
zip(q_lens, kv_lens, num_blocks_per_req)
|
||||||
|
):
|
||||||
|
context_len = kv_len - q_len
|
||||||
|
for j in range(q_len):
|
||||||
|
token_kv_idx = context_len + j
|
||||||
|
block_idx = token_kv_idx // block_size
|
||||||
|
offset_in_block = token_kv_idx % block_size
|
||||||
|
global_block_id = block_table_cpu[i, block_idx]
|
||||||
|
slot_id = global_block_id * block_size + offset_in_block
|
||||||
|
slot_mapping_list.append(slot_id)
|
||||||
|
|
||||||
|
slot_mapping = torch.tensor(slot_mapping_list, dtype=torch.int64, device=device)
|
||||||
|
|
||||||
|
# Create CommonAttentionMetadata
|
||||||
|
from vllm.v1.attention.backends.utils import CommonAttentionMetadata
|
||||||
|
|
||||||
|
common_attn_metadata = CommonAttentionMetadata(
|
||||||
|
num_reqs=len(requests),
|
||||||
|
max_query_len=max(q_lens),
|
||||||
|
max_seq_len=max_kv,
|
||||||
|
num_actual_tokens=total_q,
|
||||||
|
query_start_loc=q_start_gpu,
|
||||||
|
query_start_loc_cpu=q_start_cpu,
|
||||||
|
seq_lens=seq_lens_gpu,
|
||||||
|
_seq_lens_cpu=seq_lens_cpu,
|
||||||
|
_num_computed_tokens_cpu=num_computed_tokens_cpu,
|
||||||
|
slot_mapping=slot_mapping,
|
||||||
|
block_table_tensor=block_table_gpu,
|
||||||
|
dcp_local_seq_lens=None,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Use the production build() method
|
||||||
|
metadata = builder_instance.build(
|
||||||
|
common_prefix_len=0,
|
||||||
|
common_attn_metadata=common_attn_metadata,
|
||||||
|
fast_build=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
return metadata, current_block
|
||||||
|
|
||||||
|
|
||||||
|
def _create_input_tensors(
|
||||||
|
total_q: int,
|
||||||
|
mla_dims: dict,
|
||||||
|
query_format: str,
|
||||||
|
device: torch.device,
|
||||||
|
dtype: torch.dtype,
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Create input tensors for both decode and prefill modes.
|
||||||
|
|
||||||
|
MLA requires different tensor formats for decode vs prefill:
|
||||||
|
- Decode: Uses kv_lora_rank (512) dimension
|
||||||
|
- Prefill: Uses qk_nope_head_dim (128) to stay under FlashAttention's 256 limit
|
||||||
|
|
||||||
|
Args:
|
||||||
|
total_q: Total number of query tokens
|
||||||
|
mla_dims: MLA dimension configuration
|
||||||
|
query_format: Either "tuple" or "concat"
|
||||||
|
device: Target device
|
||||||
|
dtype: Tensor dtype
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (decode_inputs, prefill_inputs)
|
||||||
|
- decode_inputs: Query tensor(s) for decode mode
|
||||||
|
- prefill_inputs: Dict with 'q', 'k_c_normed', 'k_pe', 'k_scale' for prefill
|
||||||
|
"""
|
||||||
|
if query_format == "tuple":
|
||||||
|
# Decode mode format: (q_nope, q_pe) where q_nope has kv_lora_rank dim
|
||||||
|
q_nope_decode = torch.randn(
|
||||||
|
total_q,
|
||||||
|
mla_dims["num_q_heads"],
|
||||||
|
mla_dims["kv_lora_rank"],
|
||||||
|
device=device,
|
||||||
|
dtype=dtype,
|
||||||
|
)
|
||||||
|
q_pe = torch.randn(
|
||||||
|
total_q,
|
||||||
|
mla_dims["num_q_heads"],
|
||||||
|
mla_dims["qk_rope_head_dim"],
|
||||||
|
device=device,
|
||||||
|
dtype=dtype,
|
||||||
|
)
|
||||||
|
decode_inputs = (q_nope_decode, q_pe)
|
||||||
|
|
||||||
|
# For prefill, we need q with qk_nope_head_dim instead of kv_lora_rank
|
||||||
|
q_nope_prefill = torch.randn(
|
||||||
|
total_q,
|
||||||
|
mla_dims["num_q_heads"],
|
||||||
|
mla_dims["qk_nope_head_dim"],
|
||||||
|
device=device,
|
||||||
|
dtype=dtype,
|
||||||
|
)
|
||||||
|
prefill_q = torch.cat([q_nope_prefill, q_pe], dim=-1)
|
||||||
|
else: # concat
|
||||||
|
decode_inputs = torch.randn(
|
||||||
|
total_q,
|
||||||
|
mla_dims["num_q_heads"],
|
||||||
|
mla_dims["kv_lora_rank"] + mla_dims["qk_rope_head_dim"],
|
||||||
|
device=device,
|
||||||
|
dtype=dtype,
|
||||||
|
)
|
||||||
|
# For prefill with concat format
|
||||||
|
prefill_q = torch.randn(
|
||||||
|
total_q,
|
||||||
|
mla_dims["num_q_heads"],
|
||||||
|
mla_dims["qk_nope_head_dim"] + mla_dims["qk_rope_head_dim"],
|
||||||
|
device=device,
|
||||||
|
dtype=dtype,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Create additional inputs needed for prefill forward
|
||||||
|
k_c_normed = torch.randn(
|
||||||
|
total_q,
|
||||||
|
mla_dims["kv_lora_rank"],
|
||||||
|
device=device,
|
||||||
|
dtype=dtype,
|
||||||
|
)
|
||||||
|
k_pe = torch.randn(
|
||||||
|
total_q,
|
||||||
|
1, # Single head for MLA
|
||||||
|
mla_dims["qk_rope_head_dim"],
|
||||||
|
device=device,
|
||||||
|
dtype=dtype,
|
||||||
|
)
|
||||||
|
k_scale = torch.ones(1, device=device, dtype=torch.float32)
|
||||||
|
|
||||||
|
output = torch.zeros(
|
||||||
|
total_q,
|
||||||
|
mla_dims["num_q_heads"] * mla_dims["v_head_dim"],
|
||||||
|
device=device,
|
||||||
|
dtype=dtype,
|
||||||
|
)
|
||||||
|
|
||||||
|
prefill_inputs = {
|
||||||
|
"q": prefill_q,
|
||||||
|
"k_c_normed": k_c_normed,
|
||||||
|
"k_pe": k_pe,
|
||||||
|
"k_scale": k_scale,
|
||||||
|
"output": output,
|
||||||
|
}
|
||||||
|
|
||||||
|
return decode_inputs, prefill_inputs
|
||||||
|
|
||||||
|
|
||||||
|
# ============================================================================
|
||||||
|
# Backend Initialization
|
||||||
|
# ============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
def _create_backend_impl(
|
||||||
|
backend_cfg: dict,
|
||||||
|
mla_dims: dict,
|
||||||
|
vllm_config: VllmConfig,
|
||||||
|
device: torch.device,
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Create backend implementation instance.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
backend_cfg: Backend configuration dict
|
||||||
|
mla_dims: MLA dimension configuration
|
||||||
|
vllm_config: VllmConfig instance
|
||||||
|
device: Target device
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (impl, layer, builder_instance)
|
||||||
|
"""
|
||||||
|
# Import backend classes
|
||||||
|
backend_module = importlib.import_module(backend_cfg["module"])
|
||||||
|
impl_class = getattr(backend_module, backend_cfg["impl_class"])
|
||||||
|
|
||||||
|
# Calculate scale
|
||||||
|
scale = 1.0 / np.sqrt(mla_dims["qk_nope_head_dim"] + mla_dims["qk_rope_head_dim"])
|
||||||
|
|
||||||
|
# Create mock kv_b_proj layer for prefill mode
|
||||||
|
mock_kv_b_proj = MockKVBProj(
|
||||||
|
num_heads=mla_dims["num_q_heads"],
|
||||||
|
qk_nope_head_dim=mla_dims["qk_nope_head_dim"],
|
||||||
|
v_head_dim=mla_dims["v_head_dim"],
|
||||||
|
)
|
||||||
|
|
||||||
|
# Create impl
|
||||||
|
impl = impl_class(
|
||||||
|
num_heads=mla_dims["num_q_heads"],
|
||||||
|
head_size=mla_dims["head_dim"],
|
||||||
|
scale=scale,
|
||||||
|
num_kv_heads=mla_dims["num_kv_heads"],
|
||||||
|
alibi_slopes=None,
|
||||||
|
sliding_window=None,
|
||||||
|
kv_cache_dtype="auto",
|
||||||
|
logits_soft_cap=None,
|
||||||
|
attn_type="decoder",
|
||||||
|
kv_sharing_target_layer_name=None,
|
||||||
|
q_lora_rank=None,
|
||||||
|
kv_lora_rank=mla_dims["kv_lora_rank"],
|
||||||
|
qk_nope_head_dim=mla_dims["qk_nope_head_dim"],
|
||||||
|
qk_rope_head_dim=mla_dims["qk_rope_head_dim"],
|
||||||
|
qk_head_dim=mla_dims["qk_nope_head_dim"] + mla_dims["qk_rope_head_dim"],
|
||||||
|
v_head_dim=mla_dims["v_head_dim"],
|
||||||
|
kv_b_proj=mock_kv_b_proj,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Initialize DCP attributes
|
||||||
|
if not hasattr(impl, "dcp_world_size") or impl.dcp_world_size in (None, -1):
|
||||||
|
impl.dcp_world_size = 1
|
||||||
|
impl.dcp_rank = 0
|
||||||
|
|
||||||
|
# Create KV cache spec for MockLayer
|
||||||
|
from vllm.v1.kv_cache_interface import FullAttentionSpec
|
||||||
|
|
||||||
|
kv_cache_spec = FullAttentionSpec(
|
||||||
|
block_size=backend_cfg["block_size"] or vllm_config.cache_config.block_size,
|
||||||
|
num_kv_heads=1, # MLA uses 1 KV head
|
||||||
|
head_size=576, # MLA head dim
|
||||||
|
dtype=torch.bfloat16,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Create mock layer
|
||||||
|
layer = MockLayer(device, impl=impl, kv_cache_spec=kv_cache_spec)
|
||||||
|
|
||||||
|
# Create builder instance if needed
|
||||||
|
builder_instance = None
|
||||||
|
if backend_cfg["builder_class"]:
|
||||||
|
builder_class = getattr(backend_module, backend_cfg["builder_class"])
|
||||||
|
|
||||||
|
# Populate static_forward_context so builder can find the layer
|
||||||
|
# MockLayer inherits from AttentionLayerBase, so isinstance checks pass
|
||||||
|
vllm_config.compilation_config.static_forward_context = {"placeholder": layer}
|
||||||
|
|
||||||
|
builder_instance = builder_class(
|
||||||
|
kv_cache_spec=kv_cache_spec,
|
||||||
|
layer_names=["placeholder"],
|
||||||
|
vllm_config=vllm_config,
|
||||||
|
device=device,
|
||||||
|
)
|
||||||
|
|
||||||
|
return impl, layer, builder_instance
|
||||||
|
|
||||||
|
|
||||||
|
# ============================================================================
|
||||||
|
# Config Helpers
|
||||||
|
# ============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_mla_dims_from_config(config) -> dict | None:
|
||||||
|
"""
|
||||||
|
Extract MLA dimensions from BenchmarkConfig if all required fields are present.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
config: BenchmarkConfig instance
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict with MLA dimensions if all fields are provided, None otherwise
|
||||||
|
"""
|
||||||
|
# Check if all MLA-specific fields are provided
|
||||||
|
if all(
|
||||||
|
[
|
||||||
|
config.kv_lora_rank is not None,
|
||||||
|
config.qk_nope_head_dim is not None,
|
||||||
|
config.qk_rope_head_dim is not None,
|
||||||
|
config.v_head_dim is not None,
|
||||||
|
]
|
||||||
|
):
|
||||||
|
return {
|
||||||
|
"kv_lora_rank": config.kv_lora_rank,
|
||||||
|
"qk_nope_head_dim": config.qk_nope_head_dim,
|
||||||
|
"qk_rope_head_dim": config.qk_rope_head_dim,
|
||||||
|
"v_head_dim": config.v_head_dim,
|
||||||
|
"num_q_heads": config.num_q_heads,
|
||||||
|
"num_kv_heads": config.num_kv_heads,
|
||||||
|
"head_dim": config.head_dim,
|
||||||
|
}
|
||||||
|
# Fallback: if MLA fields not fully specified, try to construct from basic fields
|
||||||
|
elif config.head_dim == 576:
|
||||||
|
# This looks like a DeepSeek MLA config, use standard dimensions with custom
|
||||||
|
# head count
|
||||||
|
return {
|
||||||
|
"kv_lora_rank": 512,
|
||||||
|
"qk_nope_head_dim": 128,
|
||||||
|
"qk_rope_head_dim": 64,
|
||||||
|
"v_head_dim": 128,
|
||||||
|
"num_q_heads": config.num_q_heads,
|
||||||
|
"num_kv_heads": config.num_kv_heads,
|
||||||
|
"head_dim": config.head_dim,
|
||||||
|
}
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
# ============================================================================
|
||||||
|
# Benchmark Execution
|
||||||
|
# ============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
def _run_single_benchmark(
|
||||||
|
config,
|
||||||
|
impl,
|
||||||
|
layer,
|
||||||
|
builder_instance,
|
||||||
|
backend_cfg: dict,
|
||||||
|
mla_dims: dict,
|
||||||
|
device: torch.device,
|
||||||
|
) -> BenchmarkResult:
|
||||||
|
"""
|
||||||
|
Run a single benchmark iteration.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
config: BenchmarkConfig instance
|
||||||
|
impl: Backend implementation instance
|
||||||
|
layer: MockLayer instance
|
||||||
|
builder_instance: Metadata builder instance
|
||||||
|
backend_cfg: Backend configuration dict
|
||||||
|
mla_dims: MLA dimension configuration
|
||||||
|
device: Target device
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
BenchmarkResult with timing statistics
|
||||||
|
"""
|
||||||
|
# Parse batch spec
|
||||||
|
requests = parse_batch_spec(config.batch_spec)
|
||||||
|
q_lens = [r.q_len for r in requests]
|
||||||
|
total_q = sum(q_lens)
|
||||||
|
|
||||||
|
# Determine block size
|
||||||
|
block_size = backend_cfg["block_size"] or config.block_size
|
||||||
|
|
||||||
|
# Build metadata
|
||||||
|
metadata, num_blocks = _build_attention_metadata(
|
||||||
|
requests, block_size, device, builder_instance
|
||||||
|
)
|
||||||
|
|
||||||
|
# Create KV cache
|
||||||
|
kv_cache = torch.zeros(
|
||||||
|
num_blocks,
|
||||||
|
block_size,
|
||||||
|
mla_dims["kv_lora_rank"] + mla_dims["qk_rope_head_dim"],
|
||||||
|
device=device,
|
||||||
|
dtype=torch.bfloat16,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Create input tensors for both decode and prefill modes
|
||||||
|
decode_inputs, prefill_inputs = _create_input_tensors(
|
||||||
|
total_q,
|
||||||
|
mla_dims,
|
||||||
|
backend_cfg["query_format"],
|
||||||
|
device,
|
||||||
|
torch.bfloat16,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Determine which forward method to use based on metadata
|
||||||
|
if metadata.decode is not None:
|
||||||
|
forward_fn = lambda: impl._forward_decode(
|
||||||
|
decode_inputs, kv_cache, metadata, layer
|
||||||
|
)
|
||||||
|
elif metadata.prefill is not None:
|
||||||
|
forward_fn = lambda: impl._forward_prefill(
|
||||||
|
prefill_inputs["q"],
|
||||||
|
prefill_inputs["k_c_normed"],
|
||||||
|
prefill_inputs["k_pe"],
|
||||||
|
kv_cache,
|
||||||
|
metadata,
|
||||||
|
prefill_inputs["k_scale"],
|
||||||
|
prefill_inputs["output"],
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
raise RuntimeError("Metadata has neither decode nor prefill metadata")
|
||||||
|
|
||||||
|
# Warmup
|
||||||
|
for _ in range(config.warmup_iters):
|
||||||
|
forward_fn()
|
||||||
|
torch.cuda.synchronize()
|
||||||
|
|
||||||
|
# Benchmark
|
||||||
|
times = []
|
||||||
|
for _ in range(config.repeats):
|
||||||
|
start = torch.cuda.Event(enable_timing=True)
|
||||||
|
end = torch.cuda.Event(enable_timing=True)
|
||||||
|
|
||||||
|
start.record()
|
||||||
|
for _ in range(config.num_layers):
|
||||||
|
forward_fn()
|
||||||
|
end.record()
|
||||||
|
|
||||||
|
torch.cuda.synchronize()
|
||||||
|
elapsed_ms = start.elapsed_time(end)
|
||||||
|
times.append(elapsed_ms / 1000.0 / config.num_layers)
|
||||||
|
|
||||||
|
mean_time = float(np.mean(times))
|
||||||
|
return BenchmarkResult(
|
||||||
|
config=config,
|
||||||
|
mean_time=mean_time,
|
||||||
|
std_time=float(np.std(times)),
|
||||||
|
min_time=float(np.min(times)),
|
||||||
|
max_time=float(np.max(times)),
|
||||||
|
throughput_tokens_per_sec=total_q / mean_time if mean_time > 0 else 0,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _run_mla_benchmark_batched(
|
||||||
|
backend: str,
|
||||||
|
configs_with_params: list[tuple], # [(config, threshold, num_splits), ...]
|
||||||
|
) -> list[BenchmarkResult]:
|
||||||
|
"""
|
||||||
|
Unified batched MLA benchmark runner for all backends.
|
||||||
|
|
||||||
|
Works for: flashattn_mla, flashmla, flashinfer_mla, cutlass_mla
|
||||||
|
|
||||||
|
This function reuses backend initialization across multiple benchmarks
|
||||||
|
to avoid setup/teardown overhead.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
backend: Backend name
|
||||||
|
configs_with_params: List of (config, threshold, num_splits) tuples
|
||||||
|
- threshold: reorder_batch_threshold (FlashAttn/FlashMLA only)
|
||||||
|
- num_splits: num_kv_splits (CUTLASS only)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of BenchmarkResult objects
|
||||||
|
"""
|
||||||
|
if not configs_with_params:
|
||||||
|
return []
|
||||||
|
|
||||||
|
backend_cfg = _get_backend_config(backend)
|
||||||
|
device = torch.device(configs_with_params[0][0].device)
|
||||||
|
torch.cuda.set_device(device)
|
||||||
|
|
||||||
|
# Determine block size
|
||||||
|
config_block_size = configs_with_params[0][0].block_size
|
||||||
|
block_size = backend_cfg["block_size"] or config_block_size
|
||||||
|
|
||||||
|
# Extract MLA dimensions from the first config
|
||||||
|
first_config = configs_with_params[0][0]
|
||||||
|
mla_dims = _extract_mla_dims_from_config(first_config)
|
||||||
|
|
||||||
|
# If config didn't provide MLA dims, fall back to default model
|
||||||
|
if mla_dims is None:
|
||||||
|
mla_dims = setup_mla_dims("deepseek-v3")
|
||||||
|
|
||||||
|
# Create and set vLLM config for MLA (reused across all benchmarks)
|
||||||
|
vllm_config = create_minimal_vllm_config(
|
||||||
|
model_name="deepseek-v3", # Used only for model path
|
||||||
|
block_size=block_size,
|
||||||
|
mla_dims=mla_dims, # Use custom dims from config or default
|
||||||
|
)
|
||||||
|
|
||||||
|
results = []
|
||||||
|
|
||||||
|
with set_current_vllm_config(vllm_config):
|
||||||
|
# Create backend impl, layer, and builder (reused across benchmarks)
|
||||||
|
impl, layer, builder_instance = _create_backend_impl(
|
||||||
|
backend_cfg, mla_dims, vllm_config, device
|
||||||
|
)
|
||||||
|
|
||||||
|
# Run each benchmark with the shared impl
|
||||||
|
for config, threshold, num_splits in configs_with_params:
|
||||||
|
# Set threshold for this benchmark (FlashAttn/FlashMLA only)
|
||||||
|
original_threshold = None
|
||||||
|
if threshold is not None and builder_instance:
|
||||||
|
original_threshold = builder_instance.reorder_batch_threshold
|
||||||
|
builder_instance.reorder_batch_threshold = threshold
|
||||||
|
|
||||||
|
# Set num_splits for CUTLASS
|
||||||
|
original_num_splits = None
|
||||||
|
if num_splits is not None and hasattr(impl, "_num_kv_splits"):
|
||||||
|
original_num_splits = impl._num_kv_splits
|
||||||
|
impl._num_kv_splits = num_splits
|
||||||
|
|
||||||
|
try:
|
||||||
|
result = _run_single_benchmark(
|
||||||
|
config,
|
||||||
|
impl,
|
||||||
|
layer,
|
||||||
|
builder_instance,
|
||||||
|
backend_cfg,
|
||||||
|
mla_dims,
|
||||||
|
device,
|
||||||
|
)
|
||||||
|
results.append(result)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
# Restore original threshold
|
||||||
|
if original_threshold is not None:
|
||||||
|
builder_instance.reorder_batch_threshold = original_threshold
|
||||||
|
|
||||||
|
# Restore original num_splits
|
||||||
|
if original_num_splits is not None:
|
||||||
|
impl._num_kv_splits = original_num_splits
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
# ============================================================================
|
||||||
|
# Public API
|
||||||
|
# ============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
def run_mla_benchmark(
|
||||||
|
backend: str,
|
||||||
|
config,
|
||||||
|
reorder_batch_threshold: int | None = None,
|
||||||
|
num_kv_splits: int | None = None,
|
||||||
|
) -> BenchmarkResult | list[BenchmarkResult]:
|
||||||
|
"""
|
||||||
|
Unified MLA benchmark runner for all backends.
|
||||||
|
|
||||||
|
Works for: flashattn_mla, flashmla, flashinfer_mla, cutlass_mla
|
||||||
|
|
||||||
|
Always uses batched execution internally for optimal performance.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
backend: Backend name (flashattn_mla, flashmla, flashinfer_mla, cutlass_mla)
|
||||||
|
config: BenchmarkConfig or list of (BenchmarkConfig, param) tuples
|
||||||
|
reorder_batch_threshold: Threshold override for FlashAttn/FlashMLA
|
||||||
|
(single config mode only)
|
||||||
|
num_kv_splits: Number of KV splits for CUTLASS (single config mode only)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
BenchmarkResult (single mode) or list of BenchmarkResult (batched mode)
|
||||||
|
"""
|
||||||
|
# Normalize to batched mode: (config, threshold, num_splits)
|
||||||
|
if isinstance(config, list):
|
||||||
|
# Already in batched format
|
||||||
|
if len(config) > 0 and isinstance(config[0], tuple):
|
||||||
|
# Format: [(cfg, param), ...] where param is threshold or num_splits
|
||||||
|
if backend in ("flashattn_mla", "flashmla"):
|
||||||
|
configs_with_params = [(cfg, param, None) for cfg, param in config]
|
||||||
|
else: # cutlass_mla or flashinfer_mla
|
||||||
|
configs_with_params = [(cfg, None, param) for cfg, param in config]
|
||||||
|
else:
|
||||||
|
# Format: [cfg, ...] - just configs
|
||||||
|
configs_with_params = [(cfg, None, None) for cfg in config]
|
||||||
|
return_single = False
|
||||||
|
else:
|
||||||
|
# Single config: convert to batched format
|
||||||
|
configs_with_params = [(config, reorder_batch_threshold, num_kv_splits)]
|
||||||
|
return_single = True
|
||||||
|
|
||||||
|
# Use unified batched execution
|
||||||
|
results = _run_mla_benchmark_batched(backend, configs_with_params)
|
||||||
|
|
||||||
|
# Return single result or list based on input
|
||||||
|
return results[0] if return_single else results
|
||||||
481
benchmarks/attention_benchmarks/runner.py
Normal file
481
benchmarks/attention_benchmarks/runner.py
Normal file
@@ -0,0 +1,481 @@
|
|||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||||
|
|
||||||
|
"""
|
||||||
|
Standard attention benchmark runner - shared utilities for non-MLA benchmarks.
|
||||||
|
|
||||||
|
This module provides helpers for running standard attention backends
|
||||||
|
(FlashAttention, Triton, FlashInfer) with real vLLM integration.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import types
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
from batch_spec import parse_batch_spec, reorder_for_flashinfer
|
||||||
|
from common import BenchmarkConfig, BenchmarkResult, MockLayer, get_attention_scale
|
||||||
|
|
||||||
|
from vllm.config import (
|
||||||
|
CacheConfig,
|
||||||
|
CompilationConfig,
|
||||||
|
DeviceConfig,
|
||||||
|
LoadConfig,
|
||||||
|
ModelConfig,
|
||||||
|
ParallelConfig,
|
||||||
|
SchedulerConfig,
|
||||||
|
VllmConfig,
|
||||||
|
)
|
||||||
|
from vllm.v1.attention.backends.utils import CommonAttentionMetadata
|
||||||
|
from vllm.v1.kv_cache_interface import FullAttentionSpec
|
||||||
|
|
||||||
|
# ============================================================================
|
||||||
|
# Backend Configuration
|
||||||
|
# ============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
_BACKEND_CONFIG = {
|
||||||
|
"flash": {
|
||||||
|
"module": "vllm.v1.attention.backends.flash_attn",
|
||||||
|
"backend_class": "FlashAttentionBackend",
|
||||||
|
"dtype": torch.float16,
|
||||||
|
"cache_layout": "standard",
|
||||||
|
# ^ [2, num_blocks, block_size, num_kv_heads, head_dim]
|
||||||
|
},
|
||||||
|
"triton": {
|
||||||
|
"module": "vllm.v1.attention.backends.triton_attn",
|
||||||
|
"backend_class": "TritonAttentionBackend",
|
||||||
|
"dtype": torch.float32,
|
||||||
|
"cache_layout": "standard",
|
||||||
|
},
|
||||||
|
"flashinfer": {
|
||||||
|
"module": "vllm.v1.attention.backends.flashinfer",
|
||||||
|
"backend_class": "FlashInferBackend",
|
||||||
|
"dtype": torch.float16,
|
||||||
|
"cache_layout": "flashinfer",
|
||||||
|
# ^ [num_blocks, 2, block_size, num_kv_heads, head_dim]
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _get_backend_config(backend: str) -> dict:
|
||||||
|
if backend not in _BACKEND_CONFIG:
|
||||||
|
raise ValueError(
|
||||||
|
f"Unknown backend: {backend}. "
|
||||||
|
f"Available: {', '.join(_BACKEND_CONFIG.keys())}"
|
||||||
|
)
|
||||||
|
return _BACKEND_CONFIG[backend]
|
||||||
|
|
||||||
|
|
||||||
|
# ============================================================================
|
||||||
|
# Metadata Building Helpers
|
||||||
|
# ============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
def _build_common_attn_metadata(
|
||||||
|
q_lens: list[int],
|
||||||
|
kv_lens: list[int],
|
||||||
|
block_size: int,
|
||||||
|
device: torch.device,
|
||||||
|
) -> CommonAttentionMetadata:
|
||||||
|
"""Build CommonAttentionMetadata from query/kv lengths."""
|
||||||
|
batch_size = len(q_lens)
|
||||||
|
total_tokens = sum(q_lens)
|
||||||
|
|
||||||
|
query_start_loc = torch.zeros(batch_size + 1, dtype=torch.int32, device=device)
|
||||||
|
query_start_loc[1:] = torch.tensor(q_lens, dtype=torch.int32, device=device).cumsum(
|
||||||
|
0
|
||||||
|
)
|
||||||
|
query_start_loc_cpu = query_start_loc.cpu()
|
||||||
|
|
||||||
|
seq_lens = torch.tensor(kv_lens, dtype=torch.int32, device=device)
|
||||||
|
seq_lens_cpu = seq_lens.cpu()
|
||||||
|
max_seq_len = int(seq_lens_cpu.max())
|
||||||
|
|
||||||
|
context_lens = [kv - q for kv, q in zip(kv_lens, q_lens)]
|
||||||
|
num_computed_tokens_cpu = torch.tensor(context_lens, dtype=torch.int32)
|
||||||
|
|
||||||
|
max_blocks = (max(kv_lens) + block_size - 1) // block_size
|
||||||
|
num_blocks = batch_size * max_blocks
|
||||||
|
block_table_tensor = torch.arange(
|
||||||
|
num_blocks, dtype=torch.int32, device=device
|
||||||
|
).view(batch_size, max_blocks)
|
||||||
|
slot_mapping = torch.arange(total_tokens, dtype=torch.int64, device=device)
|
||||||
|
|
||||||
|
max_query_len = max(q_lens)
|
||||||
|
|
||||||
|
return CommonAttentionMetadata(
|
||||||
|
query_start_loc=query_start_loc,
|
||||||
|
query_start_loc_cpu=query_start_loc_cpu,
|
||||||
|
seq_lens=seq_lens,
|
||||||
|
seq_lens_cpu=seq_lens_cpu,
|
||||||
|
num_computed_tokens_cpu=num_computed_tokens_cpu,
|
||||||
|
num_reqs=batch_size,
|
||||||
|
num_actual_tokens=total_tokens,
|
||||||
|
max_query_len=max_query_len,
|
||||||
|
max_seq_len=max_seq_len,
|
||||||
|
block_table_tensor=block_table_tensor,
|
||||||
|
slot_mapping=slot_mapping,
|
||||||
|
causal=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _create_vllm_config(
|
||||||
|
config: BenchmarkConfig,
|
||||||
|
dtype: torch.dtype,
|
||||||
|
max_num_blocks: int,
|
||||||
|
) -> VllmConfig:
|
||||||
|
"""Create a VllmConfig for benchmarking with mock model methods."""
|
||||||
|
model_config = ModelConfig(
|
||||||
|
model="meta-llama/Meta-Llama-3-8B",
|
||||||
|
tokenizer="meta-llama/Meta-Llama-3-8B",
|
||||||
|
trust_remote_code=False,
|
||||||
|
dtype=dtype,
|
||||||
|
seed=0,
|
||||||
|
max_model_len=1024,
|
||||||
|
)
|
||||||
|
|
||||||
|
cache_config = CacheConfig(
|
||||||
|
block_size=config.block_size,
|
||||||
|
cache_dtype="auto",
|
||||||
|
swap_space=0,
|
||||||
|
)
|
||||||
|
cache_config.num_gpu_blocks = max_num_blocks
|
||||||
|
cache_config.num_cpu_blocks = 0
|
||||||
|
|
||||||
|
parallel_config = ParallelConfig(tensor_parallel_size=1)
|
||||||
|
scheduler_config = SchedulerConfig(
|
||||||
|
max_num_seqs=256,
|
||||||
|
max_num_batched_tokens=8192,
|
||||||
|
max_model_len=8192,
|
||||||
|
is_encoder_decoder=False,
|
||||||
|
enable_chunked_prefill=True,
|
||||||
|
)
|
||||||
|
device_config = DeviceConfig()
|
||||||
|
load_config = LoadConfig()
|
||||||
|
compilation_config = CompilationConfig()
|
||||||
|
|
||||||
|
# Add mock methods for benchmark config values
|
||||||
|
model_config.get_num_layers = types.MethodType(
|
||||||
|
lambda self: config.num_layers, model_config
|
||||||
|
)
|
||||||
|
model_config.get_sliding_window_for_layer = types.MethodType(
|
||||||
|
lambda self, i: None, model_config
|
||||||
|
)
|
||||||
|
model_config.get_logits_soft_cap_for_layer = types.MethodType(
|
||||||
|
lambda self, i: 0.0, model_config
|
||||||
|
)
|
||||||
|
model_config.get_sm_scale_for_layer = types.MethodType(
|
||||||
|
lambda self, i: 1.0 / config.head_dim**0.5, model_config
|
||||||
|
)
|
||||||
|
model_config.get_num_attention_heads = types.MethodType(
|
||||||
|
lambda self, parallel_config=None: config.num_q_heads, model_config
|
||||||
|
)
|
||||||
|
model_config.get_num_kv_heads = types.MethodType(
|
||||||
|
lambda self, parallel_config=None: config.num_kv_heads, model_config
|
||||||
|
)
|
||||||
|
model_config.get_head_size = types.MethodType(
|
||||||
|
lambda self: config.head_dim, model_config
|
||||||
|
)
|
||||||
|
model_config.get_sliding_window = types.MethodType(lambda self: None, model_config)
|
||||||
|
|
||||||
|
return VllmConfig(
|
||||||
|
model_config=model_config,
|
||||||
|
cache_config=cache_config,
|
||||||
|
parallel_config=parallel_config,
|
||||||
|
scheduler_config=scheduler_config,
|
||||||
|
device_config=device_config,
|
||||||
|
load_config=load_config,
|
||||||
|
compilation_config=compilation_config,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ============================================================================
|
||||||
|
# Backend Initialization
|
||||||
|
# ============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
def _create_backend_impl(
|
||||||
|
backend_cfg: dict,
|
||||||
|
config: BenchmarkConfig,
|
||||||
|
device: torch.device,
|
||||||
|
):
|
||||||
|
"""Create backend implementation instance."""
|
||||||
|
import importlib
|
||||||
|
|
||||||
|
backend_module = importlib.import_module(backend_cfg["module"])
|
||||||
|
backend_class = getattr(backend_module, backend_cfg["backend_class"])
|
||||||
|
|
||||||
|
scale = get_attention_scale(config.head_dim)
|
||||||
|
dtype = backend_cfg["dtype"]
|
||||||
|
|
||||||
|
impl = backend_class.get_impl_cls()(
|
||||||
|
num_heads=config.num_q_heads,
|
||||||
|
head_size=config.head_dim,
|
||||||
|
scale=scale,
|
||||||
|
num_kv_heads=config.num_kv_heads,
|
||||||
|
alibi_slopes=None,
|
||||||
|
sliding_window=None,
|
||||||
|
kv_cache_dtype="auto",
|
||||||
|
)
|
||||||
|
|
||||||
|
kv_cache_spec = FullAttentionSpec(
|
||||||
|
block_size=config.block_size,
|
||||||
|
num_kv_heads=config.num_kv_heads,
|
||||||
|
head_size=config.head_dim,
|
||||||
|
dtype=dtype,
|
||||||
|
)
|
||||||
|
|
||||||
|
layer = MockLayer(device, kv_cache_spec=kv_cache_spec)
|
||||||
|
|
||||||
|
return backend_class, impl, layer, dtype
|
||||||
|
|
||||||
|
|
||||||
|
def _create_metadata_builder(
|
||||||
|
backend_class,
|
||||||
|
kv_cache_spec: FullAttentionSpec,
|
||||||
|
vllm_config: VllmConfig,
|
||||||
|
device: torch.device,
|
||||||
|
):
|
||||||
|
"""Create metadata builder instance."""
|
||||||
|
return backend_class.get_builder_cls()(
|
||||||
|
kv_cache_spec=kv_cache_spec,
|
||||||
|
layer_names=["layer_0"],
|
||||||
|
vllm_config=vllm_config,
|
||||||
|
device=device,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ============================================================================
|
||||||
|
# Tensor Creation Helpers
|
||||||
|
# ============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
def _create_input_tensors(
|
||||||
|
config: BenchmarkConfig,
|
||||||
|
total_q: int,
|
||||||
|
device: torch.device,
|
||||||
|
dtype: torch.dtype,
|
||||||
|
) -> tuple:
|
||||||
|
"""Create Q, K, V input tensors for all layers."""
|
||||||
|
q_list = [
|
||||||
|
torch.randn(
|
||||||
|
total_q, config.num_q_heads, config.head_dim, device=device, dtype=dtype
|
||||||
|
)
|
||||||
|
for _ in range(config.num_layers)
|
||||||
|
]
|
||||||
|
k_list = [
|
||||||
|
torch.randn(
|
||||||
|
total_q, config.num_kv_heads, config.head_dim, device=device, dtype=dtype
|
||||||
|
)
|
||||||
|
for _ in range(config.num_layers)
|
||||||
|
]
|
||||||
|
v_list = [
|
||||||
|
torch.randn(
|
||||||
|
total_q, config.num_kv_heads, config.head_dim, device=device, dtype=dtype
|
||||||
|
)
|
||||||
|
for _ in range(config.num_layers)
|
||||||
|
]
|
||||||
|
return q_list, k_list, v_list
|
||||||
|
|
||||||
|
|
||||||
|
def _create_kv_cache(
|
||||||
|
config: BenchmarkConfig,
|
||||||
|
max_num_blocks: int,
|
||||||
|
cache_layout: str,
|
||||||
|
device: torch.device,
|
||||||
|
dtype: torch.dtype,
|
||||||
|
) -> list:
|
||||||
|
"""Create KV cache tensors for all layers."""
|
||||||
|
if cache_layout == "flashinfer":
|
||||||
|
# FlashInfer layout: [num_blocks, 2, block_size, num_kv_heads, head_dim]
|
||||||
|
cache_list = [
|
||||||
|
torch.zeros(
|
||||||
|
max_num_blocks,
|
||||||
|
2,
|
||||||
|
config.block_size,
|
||||||
|
config.num_kv_heads,
|
||||||
|
config.head_dim,
|
||||||
|
device=device,
|
||||||
|
dtype=dtype,
|
||||||
|
)
|
||||||
|
for _ in range(config.num_layers)
|
||||||
|
]
|
||||||
|
else:
|
||||||
|
# Standard layout: [2, num_blocks, block_size, num_kv_heads, head_dim]
|
||||||
|
cache_list = [
|
||||||
|
torch.zeros(
|
||||||
|
2,
|
||||||
|
max_num_blocks,
|
||||||
|
config.block_size,
|
||||||
|
config.num_kv_heads,
|
||||||
|
config.head_dim,
|
||||||
|
device=device,
|
||||||
|
dtype=dtype,
|
||||||
|
)
|
||||||
|
for _ in range(config.num_layers)
|
||||||
|
]
|
||||||
|
return cache_list
|
||||||
|
|
||||||
|
|
||||||
|
# ============================================================================
|
||||||
|
# Benchmark Execution
|
||||||
|
# ============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
def _run_single_benchmark(
|
||||||
|
config: BenchmarkConfig,
|
||||||
|
impl,
|
||||||
|
layer,
|
||||||
|
q_list: list,
|
||||||
|
k_list: list,
|
||||||
|
v_list: list,
|
||||||
|
cache_list: list,
|
||||||
|
attn_metadata,
|
||||||
|
device: torch.device,
|
||||||
|
dtype: torch.dtype,
|
||||||
|
) -> tuple:
|
||||||
|
"""Run single benchmark iteration with warmup and timing loop."""
|
||||||
|
total_q = q_list[0].shape[0]
|
||||||
|
out = torch.empty(
|
||||||
|
total_q, config.num_q_heads, config.head_dim, device=device, dtype=dtype
|
||||||
|
)
|
||||||
|
|
||||||
|
# Warmup
|
||||||
|
for _ in range(config.warmup_iters):
|
||||||
|
for i in range(config.num_layers):
|
||||||
|
impl.forward(
|
||||||
|
layer,
|
||||||
|
q_list[i],
|
||||||
|
k_list[i],
|
||||||
|
v_list[i],
|
||||||
|
cache_list[i],
|
||||||
|
attn_metadata,
|
||||||
|
output=out,
|
||||||
|
)
|
||||||
|
torch.cuda.synchronize()
|
||||||
|
|
||||||
|
# Benchmark
|
||||||
|
times = []
|
||||||
|
for _ in range(config.repeats):
|
||||||
|
start = torch.cuda.Event(enable_timing=True)
|
||||||
|
end = torch.cuda.Event(enable_timing=True)
|
||||||
|
|
||||||
|
start.record()
|
||||||
|
for i in range(config.num_layers):
|
||||||
|
impl.forward(
|
||||||
|
layer,
|
||||||
|
q_list[i],
|
||||||
|
k_list[i],
|
||||||
|
v_list[i],
|
||||||
|
cache_list[i],
|
||||||
|
attn_metadata,
|
||||||
|
output=out,
|
||||||
|
)
|
||||||
|
end.record()
|
||||||
|
|
||||||
|
torch.cuda.synchronize()
|
||||||
|
elapsed_ms = start.elapsed_time(end)
|
||||||
|
times.append(elapsed_ms / 1000.0 / config.num_layers) # seconds per layer
|
||||||
|
|
||||||
|
mem_stats = {}
|
||||||
|
if config.profile_memory:
|
||||||
|
mem_stats = {
|
||||||
|
"allocated_mb": torch.cuda.memory_allocated(device) / 1024**2,
|
||||||
|
"reserved_mb": torch.cuda.memory_reserved(device) / 1024**2,
|
||||||
|
}
|
||||||
|
|
||||||
|
return times, mem_stats
|
||||||
|
|
||||||
|
|
||||||
|
# ============================================================================
|
||||||
|
# Public API
|
||||||
|
# ============================================================================
|
||||||
|
|
||||||
|
|
||||||
|
def run_attention_benchmark(config: BenchmarkConfig) -> BenchmarkResult:
|
||||||
|
"""
|
||||||
|
Run standard attention benchmark with real kernels.
|
||||||
|
|
||||||
|
Supports: flash, triton, flashinfer
|
||||||
|
|
||||||
|
Args:
|
||||||
|
config: Benchmark configuration
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
BenchmarkResult with timing and memory statistics
|
||||||
|
"""
|
||||||
|
device = torch.device(config.device)
|
||||||
|
torch.cuda.set_device(device)
|
||||||
|
|
||||||
|
backend_cfg = _get_backend_config(config.backend)
|
||||||
|
|
||||||
|
requests = parse_batch_spec(config.batch_spec)
|
||||||
|
|
||||||
|
if config.backend == "flashinfer":
|
||||||
|
requests = reorder_for_flashinfer(requests)
|
||||||
|
|
||||||
|
q_lens = [r.q_len for r in requests]
|
||||||
|
kv_lens = [r.kv_len for r in requests]
|
||||||
|
total_q = sum(q_lens)
|
||||||
|
max_kv = max(kv_lens)
|
||||||
|
|
||||||
|
max_num_blocks = (max_kv + config.block_size - 1) // config.block_size
|
||||||
|
|
||||||
|
backend_class, impl, layer, dtype = _create_backend_impl(
|
||||||
|
backend_cfg, config, device
|
||||||
|
)
|
||||||
|
|
||||||
|
common_metadata = _build_common_attn_metadata(
|
||||||
|
q_lens, kv_lens, config.block_size, device
|
||||||
|
)
|
||||||
|
|
||||||
|
kv_cache_spec = FullAttentionSpec(
|
||||||
|
block_size=config.block_size,
|
||||||
|
num_kv_heads=config.num_kv_heads,
|
||||||
|
head_size=config.head_dim,
|
||||||
|
dtype=dtype,
|
||||||
|
)
|
||||||
|
|
||||||
|
vllm_config = _create_vllm_config(config, dtype, max_num_blocks)
|
||||||
|
|
||||||
|
builder = _create_metadata_builder(
|
||||||
|
backend_class, kv_cache_spec, vllm_config, device
|
||||||
|
)
|
||||||
|
|
||||||
|
attn_metadata = builder.build(
|
||||||
|
common_prefix_len=0,
|
||||||
|
common_attn_metadata=common_metadata,
|
||||||
|
)
|
||||||
|
|
||||||
|
q_list, k_list, v_list = _create_input_tensors(config, total_q, device, dtype)
|
||||||
|
|
||||||
|
cache_list = _create_kv_cache(
|
||||||
|
config, max_num_blocks, backend_cfg["cache_layout"], device, dtype
|
||||||
|
)
|
||||||
|
|
||||||
|
times, mem_stats = _run_single_benchmark(
|
||||||
|
config,
|
||||||
|
impl,
|
||||||
|
layer,
|
||||||
|
q_list,
|
||||||
|
k_list,
|
||||||
|
v_list,
|
||||||
|
cache_list,
|
||||||
|
attn_metadata,
|
||||||
|
device,
|
||||||
|
dtype,
|
||||||
|
)
|
||||||
|
|
||||||
|
mean_time = np.mean(times)
|
||||||
|
throughput = total_q / mean_time if mean_time > 0 else 0
|
||||||
|
|
||||||
|
return BenchmarkResult(
|
||||||
|
config=config,
|
||||||
|
mean_time=mean_time,
|
||||||
|
std_time=np.std(times),
|
||||||
|
min_time=np.min(times),
|
||||||
|
max_time=np.max(times),
|
||||||
|
throughput_tokens_per_sec=throughput,
|
||||||
|
memory_allocated_mb=mem_stats.get("allocated_mb"),
|
||||||
|
memory_reserved_mb=mem_stats.get("reserved_mb"),
|
||||||
|
)
|
||||||
@@ -1,244 +0,0 @@
|
|||||||
# SPDX-License-Identifier: Apache-2.0
|
|
||||||
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
|
||||||
# Copyright (c) Microsoft Corporation.
|
|
||||||
# Licensed under the MIT License.
|
|
||||||
|
|
||||||
from packaging import version
|
|
||||||
|
|
||||||
from vllm.model_executor.layers.quantization.utils.bitblas_utils import (
|
|
||||||
MINIMUM_BITBLAS_VERSION,
|
|
||||||
)
|
|
||||||
|
|
||||||
try:
|
|
||||||
import bitblas
|
|
||||||
|
|
||||||
if version.parse(bitblas.__version__) < version.parse(MINIMUM_BITBLAS_VERSION):
|
|
||||||
raise ImportError(
|
|
||||||
"bitblas version is wrong. Please "
|
|
||||||
f"install bitblas>={MINIMUM_BITBLAS_VERSION}"
|
|
||||||
)
|
|
||||||
except ImportError as e:
|
|
||||||
bitblas_import_exception = e
|
|
||||||
raise ValueError(
|
|
||||||
"Trying to use the bitblas backend, but could not import"
|
|
||||||
f"with the following error: {bitblas_import_exception}. "
|
|
||||||
"Please install bitblas through the following command: "
|
|
||||||
f"`pip install bitblas>={MINIMUM_BITBLAS_VERSION}`"
|
|
||||||
) from bitblas_import_exception
|
|
||||||
|
|
||||||
from bitblas import Matmul, MatmulConfig, auto_detect_nvidia_target
|
|
||||||
|
|
||||||
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
|
||||||
|
|
||||||
parser = FlexibleArgumentParser(
|
|
||||||
description="Benchmark BitBLAS int4 on a specific target."
|
|
||||||
)
|
|
||||||
|
|
||||||
# Add arguments to the parser
|
|
||||||
parser.add_argument(
|
|
||||||
"--target",
|
|
||||||
type=str,
|
|
||||||
default=auto_detect_nvidia_target(),
|
|
||||||
help="Specify the target device for benchmarking.",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--group_size", type=int, default=None, help="Group size for grouped quantization."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--A_dtype",
|
|
||||||
type=str,
|
|
||||||
default="float16",
|
|
||||||
choices=["float16", "float32", "float64", "int32", "int8"],
|
|
||||||
help="Data type of activation A.",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--W_dtype",
|
|
||||||
type=str,
|
|
||||||
default="int4",
|
|
||||||
choices=[
|
|
||||||
"float16",
|
|
||||||
"float32",
|
|
||||||
"float64",
|
|
||||||
"int32",
|
|
||||||
"int8",
|
|
||||||
"int4",
|
|
||||||
"int2",
|
|
||||||
"int1",
|
|
||||||
"nf4",
|
|
||||||
"fp4_e2m1",
|
|
||||||
],
|
|
||||||
help="Data type of weight W.",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--accum_dtype",
|
|
||||||
type=str,
|
|
||||||
default="float16",
|
|
||||||
choices=["float16", "int32"],
|
|
||||||
help="Data type for accumulation.",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--out_dtype",
|
|
||||||
type=str,
|
|
||||||
default="float16",
|
|
||||||
choices=["float16", "float32", "int32", "int8"],
|
|
||||||
help="Data type for output.",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--layout",
|
|
||||||
type=str,
|
|
||||||
default="nt",
|
|
||||||
choices=["nt", "nn"],
|
|
||||||
help="Matrix layout, 'nt' for non-transpose A and transpose W.",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--with_bias", action="store_true", help="Include bias in the benchmark."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--with_scaling",
|
|
||||||
action="store_true",
|
|
||||||
help="Include scaling factor in the quantization.",
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--with_zeros", action="store_true", help="Include zeros in the quantization."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--zeros_mode",
|
|
||||||
type=str,
|
|
||||||
default=None,
|
|
||||||
choices=["original", "rescale", "quantized"],
|
|
||||||
help="Specify the mode for calculating zeros.",
|
|
||||||
)
|
|
||||||
|
|
||||||
# Parse the arguments
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
# Assign arguments to variables
|
|
||||||
target = args.target
|
|
||||||
A_dtype = args.A_dtype
|
|
||||||
W_dtype = args.W_dtype
|
|
||||||
accum_dtype = args.accum_dtype
|
|
||||||
out_dtype = args.out_dtype
|
|
||||||
layout = args.layout
|
|
||||||
with_bias = args.with_bias
|
|
||||||
group_size = args.group_size
|
|
||||||
with_scaling = args.with_scaling
|
|
||||||
with_zeros = args.with_zeros
|
|
||||||
zeros_mode = args.zeros_mode
|
|
||||||
|
|
||||||
# Define a list of shared arguments that repeat in every config
|
|
||||||
shared_args = [
|
|
||||||
A_dtype,
|
|
||||||
W_dtype,
|
|
||||||
out_dtype,
|
|
||||||
accum_dtype,
|
|
||||||
layout,
|
|
||||||
with_bias,
|
|
||||||
group_size,
|
|
||||||
with_scaling,
|
|
||||||
with_zeros,
|
|
||||||
zeros_mode,
|
|
||||||
]
|
|
||||||
|
|
||||||
# Define just the (M, K, N) shapes in a more compact list
|
|
||||||
shapes = [
|
|
||||||
# square test
|
|
||||||
(1, 16384, 16384),
|
|
||||||
# BLOOM-176B
|
|
||||||
(1, 43008, 14336),
|
|
||||||
(1, 14336, 14336),
|
|
||||||
(1, 57344, 14336),
|
|
||||||
(1, 14336, 57344),
|
|
||||||
# OPT-65B
|
|
||||||
(1, 9216, 9216),
|
|
||||||
(1, 36864, 9216),
|
|
||||||
(1, 9216, 36864),
|
|
||||||
(1, 22016, 8192),
|
|
||||||
# LLAMA-70B/65B
|
|
||||||
(1, 8192, 22016),
|
|
||||||
(1, 8192, 8192),
|
|
||||||
(1, 28672, 8192),
|
|
||||||
(1, 8192, 28672),
|
|
||||||
# square test
|
|
||||||
(16384, 16384, 16384),
|
|
||||||
# BLOOM-176B
|
|
||||||
(8192, 43008, 14336),
|
|
||||||
(8192, 14336, 14336),
|
|
||||||
(8192, 57344, 14336),
|
|
||||||
(8192, 14336, 57344),
|
|
||||||
# OPT-65B
|
|
||||||
(8192, 9216, 9216),
|
|
||||||
(8192, 36864, 9216),
|
|
||||||
(8192, 9216, 36864),
|
|
||||||
(8192, 22016, 8192),
|
|
||||||
# LLAMA-70B/65B
|
|
||||||
(8192, 8192, 22016),
|
|
||||||
(8192, 8192, 8192),
|
|
||||||
(8192, 28672, 8192),
|
|
||||||
(8192, 8192, 28672),
|
|
||||||
]
|
|
||||||
|
|
||||||
# Build test shapes with all the shared arguments
|
|
||||||
test_shapes = [(MatmulConfig, Matmul, (*shape, *shared_args)) for shape in shapes]
|
|
||||||
|
|
||||||
benchmark_sets = []
|
|
||||||
benchmark_sets.extend(test_shapes)
|
|
||||||
|
|
||||||
benchmark_results = {}
|
|
||||||
for config_class, operator, input_args in benchmark_sets:
|
|
||||||
config = config_class(*input_args)
|
|
||||||
matmul = operator(config, target=target, enable_tuning=True)
|
|
||||||
kernel_latency = matmul.profile_latency()
|
|
||||||
|
|
||||||
print("Time cost is: {:.3f} ms".format(kernel_latency))
|
|
||||||
|
|
||||||
profile_config = {
|
|
||||||
f"{operator.__name__}-{'-'.join([str(i) for i in input_args])}": {
|
|
||||||
"BitBLAS_top20_latency": kernel_latency,
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
benchmark_results.update(profile_config)
|
|
||||||
|
|
||||||
# Define headers for the table
|
|
||||||
headers = [
|
|
||||||
"PrimFunc",
|
|
||||||
"Input Arguments",
|
|
||||||
"BitBLAS Top20 Latency",
|
|
||||||
]
|
|
||||||
|
|
||||||
# Calculate column widths for pretty printing
|
|
||||||
col_widths = [0, 0, 0]
|
|
||||||
for config_key, values in benchmark_results.items():
|
|
||||||
args_split = config_key.split("-")
|
|
||||||
func_name = args_split[0]
|
|
||||||
input_args_str = "-".join(args_split[1:])
|
|
||||||
col_widths[0] = max(col_widths[0], len(func_name) + 2, len(headers[0]) + 2)
|
|
||||||
col_widths[1] = max(col_widths[1], len(input_args_str) + 2, len(headers[1]) + 2)
|
|
||||||
col_widths[2] = max(
|
|
||||||
col_widths[2],
|
|
||||||
len(f"{values['BitBLAS_top20_latency']:.3f} ms") + 2,
|
|
||||||
len(headers[2]) + 2,
|
|
||||||
)
|
|
||||||
# break only if you want to measure widths from a single example;
|
|
||||||
# otherwise, let it loop over all items.
|
|
||||||
|
|
||||||
# Print header
|
|
||||||
for i, header in enumerate(headers):
|
|
||||||
headers[i] = header.ljust(col_widths[i])
|
|
||||||
print("".join(headers))
|
|
||||||
print("-" * sum(col_widths))
|
|
||||||
|
|
||||||
# Print rows
|
|
||||||
for config_key, values in benchmark_results.items():
|
|
||||||
args_split = config_key.split("-")
|
|
||||||
func_name = args_split[0]
|
|
||||||
input_args_str = "-".join(args_split[1:])
|
|
||||||
row = [
|
|
||||||
func_name,
|
|
||||||
input_args_str,
|
|
||||||
f"{values['BitBLAS_top20_latency']:.3f} ms",
|
|
||||||
]
|
|
||||||
row_str = "".join(
|
|
||||||
[str(cell).ljust(col_widths[idx]) for idx, cell in enumerate(row)]
|
|
||||||
)
|
|
||||||
print(row_str)
|
|
||||||
@@ -842,6 +842,7 @@ class BenchmarkTensors:
|
|||||||
"sorted_token_ids": sorted_token_ids,
|
"sorted_token_ids": sorted_token_ids,
|
||||||
"expert_ids": expert_ids,
|
"expert_ids": expert_ids,
|
||||||
"num_tokens_post_padded": num_tokens_post_padded,
|
"num_tokens_post_padded": num_tokens_post_padded,
|
||||||
|
"token_lora_mapping": self.lora_kernel_meta.token_lora_mapping,
|
||||||
"top_k_num": ctx.top_k_num,
|
"top_k_num": ctx.top_k_num,
|
||||||
"device": self.input.device,
|
"device": self.input.device,
|
||||||
"N": lora_rank,
|
"N": lora_rank,
|
||||||
@@ -915,6 +916,7 @@ class BenchmarkTensors:
|
|||||||
"sorted_token_ids": sorted_token_ids,
|
"sorted_token_ids": sorted_token_ids,
|
||||||
"expert_ids": expert_ids,
|
"expert_ids": expert_ids,
|
||||||
"num_tokens_post_padded": num_tokens_post_padded,
|
"num_tokens_post_padded": num_tokens_post_padded,
|
||||||
|
"token_lora_mapping": self.lora_kernel_meta.token_lora_mapping,
|
||||||
"top_k_num": ctx.top_k_num,
|
"top_k_num": ctx.top_k_num,
|
||||||
"device": self.input.device,
|
"device": self.input.device,
|
||||||
"N": lora_rank,
|
"N": lora_rank,
|
||||||
|
|||||||
@@ -6,12 +6,6 @@ import torch.utils.benchmark as benchmark
|
|||||||
from benchmark_shapes import WEIGHT_SHAPES
|
from benchmark_shapes import WEIGHT_SHAPES
|
||||||
|
|
||||||
from vllm import _custom_ops as ops
|
from vllm import _custom_ops as ops
|
||||||
from vllm.model_executor.layers.quantization.gptq_marlin_24 import (
|
|
||||||
GPTQ_MARLIN_24_MAX_PARALLEL,
|
|
||||||
GPTQ_MARLIN_24_MIN_THREAD_N,
|
|
||||||
GPTQ_MARLIN_24_SUPPORTED_GROUP_SIZES,
|
|
||||||
GPTQ_MARLIN_24_SUPPORTED_QUANT_TYPES,
|
|
||||||
)
|
|
||||||
from vllm.model_executor.layers.quantization.utils.allspark_utils import (
|
from vllm.model_executor.layers.quantization.utils.allspark_utils import (
|
||||||
ALLSPARK_AMPERE_M_CUBLAS_THRESHOLD,
|
ALLSPARK_AMPERE_M_CUBLAS_THRESHOLD,
|
||||||
ALLSPARK_SUPPORTED_QUANT_TYPES,
|
ALLSPARK_SUPPORTED_QUANT_TYPES,
|
||||||
@@ -34,9 +28,6 @@ from vllm.model_executor.layers.quantization.utils.marlin_utils_test import (
|
|||||||
awq_marlin_quantize,
|
awq_marlin_quantize,
|
||||||
marlin_quantize,
|
marlin_quantize,
|
||||||
)
|
)
|
||||||
from vllm.model_executor.layers.quantization.utils.marlin_utils_test_24 import (
|
|
||||||
marlin_24_quantize,
|
|
||||||
)
|
|
||||||
from vllm.model_executor.layers.quantization.utils.quant_utils import (
|
from vllm.model_executor.layers.quantization.utils.quant_utils import (
|
||||||
gptq_pack,
|
gptq_pack,
|
||||||
gptq_quantize_weights,
|
gptq_quantize_weights,
|
||||||
@@ -78,14 +69,7 @@ def bench_run(
|
|||||||
if size_k % group_size != 0:
|
if size_k % group_size != 0:
|
||||||
return
|
return
|
||||||
|
|
||||||
marlin_24_supported = (
|
repack_supported = group_size in MARLIN_SUPPORTED_GROUP_SIZES
|
||||||
quant_type in GPTQ_MARLIN_24_SUPPORTED_QUANT_TYPES
|
|
||||||
and group_size in GPTQ_MARLIN_24_SUPPORTED_GROUP_SIZES
|
|
||||||
)
|
|
||||||
repack_supported = (
|
|
||||||
quant_type in GPTQ_MARLIN_24_SUPPORTED_QUANT_TYPES
|
|
||||||
and group_size in MARLIN_SUPPORTED_GROUP_SIZES
|
|
||||||
)
|
|
||||||
allspark_supported = (
|
allspark_supported = (
|
||||||
quant_type in ALLSPARK_SUPPORTED_QUANT_TYPES
|
quant_type in ALLSPARK_SUPPORTED_QUANT_TYPES
|
||||||
and group_size == -1
|
and group_size == -1
|
||||||
@@ -126,14 +110,6 @@ def bench_run(
|
|||||||
marlin_sort_indices,
|
marlin_sort_indices,
|
||||||
)
|
)
|
||||||
|
|
||||||
def gen_marlin_24_params():
|
|
||||||
marlin_24_w_ref = marlin_24_q_w_comp = marlin_24_meta = marlin_24_s = None
|
|
||||||
if marlin_24_supported:
|
|
||||||
(marlin_24_w_ref, marlin_24_q_w_comp, marlin_24_meta, marlin_24_s) = (
|
|
||||||
marlin_24_quantize(b, quant_type, group_size)
|
|
||||||
)
|
|
||||||
return (marlin_24_w_ref, marlin_24_q_w_comp, marlin_24_meta, marlin_24_s)
|
|
||||||
|
|
||||||
def gen_repack_params():
|
def gen_repack_params():
|
||||||
q_w_gptq = None
|
q_w_gptq = None
|
||||||
repack_sort_indices = None
|
repack_sort_indices = None
|
||||||
@@ -188,9 +164,6 @@ def bench_run(
|
|||||||
marlin_g_idx,
|
marlin_g_idx,
|
||||||
marlin_sort_indices,
|
marlin_sort_indices,
|
||||||
) = gen_marlin_params()
|
) = gen_marlin_params()
|
||||||
marlin_24_w_ref, marlin_24_q_w_comp, marlin_24_meta, marlin_24_s = (
|
|
||||||
gen_marlin_24_params()
|
|
||||||
)
|
|
||||||
q_w_gptq, repack_sort_indices = gen_repack_params()
|
q_w_gptq, repack_sort_indices = gen_repack_params()
|
||||||
qw_reorder, s_reorder, zp_reorder, sm_count, sm_version, CUBLAS_M_THRESHOLD = (
|
qw_reorder, s_reorder, zp_reorder, sm_count, sm_version, CUBLAS_M_THRESHOLD = (
|
||||||
gen_allspark_params()
|
gen_allspark_params()
|
||||||
@@ -200,9 +173,6 @@ def bench_run(
|
|||||||
marlin_workspace = MarlinWorkspace(
|
marlin_workspace = MarlinWorkspace(
|
||||||
size_n, GPTQ_MARLIN_MIN_THREAD_N, GPTQ_MARLIN_MAX_PARALLEL
|
size_n, GPTQ_MARLIN_MIN_THREAD_N, GPTQ_MARLIN_MAX_PARALLEL
|
||||||
)
|
)
|
||||||
marlin_24_workspace = MarlinWorkspace(
|
|
||||||
size_n, GPTQ_MARLIN_24_MIN_THREAD_N, GPTQ_MARLIN_24_MAX_PARALLEL
|
|
||||||
)
|
|
||||||
|
|
||||||
globals = {
|
globals = {
|
||||||
# Gen params
|
# Gen params
|
||||||
@@ -222,12 +192,6 @@ def bench_run(
|
|||||||
"marlin_sort_indices": marlin_sort_indices,
|
"marlin_sort_indices": marlin_sort_indices,
|
||||||
"marlin_workspace": marlin_workspace,
|
"marlin_workspace": marlin_workspace,
|
||||||
"is_k_full": is_k_full,
|
"is_k_full": is_k_full,
|
||||||
# Marlin_24 params
|
|
||||||
"marlin_24_w_ref": marlin_24_w_ref,
|
|
||||||
"marlin_24_q_w_comp": marlin_24_q_w_comp,
|
|
||||||
"marlin_24_meta": marlin_24_meta,
|
|
||||||
"marlin_24_s": marlin_24_s,
|
|
||||||
"marlin_24_workspace": marlin_24_workspace,
|
|
||||||
# GPTQ params
|
# GPTQ params
|
||||||
"q_w_gptq": q_w_gptq,
|
"q_w_gptq": q_w_gptq,
|
||||||
"repack_sort_indices": repack_sort_indices,
|
"repack_sort_indices": repack_sort_indices,
|
||||||
@@ -240,7 +204,6 @@ def bench_run(
|
|||||||
"CUBLAS_M_THRESHOLD": CUBLAS_M_THRESHOLD,
|
"CUBLAS_M_THRESHOLD": CUBLAS_M_THRESHOLD,
|
||||||
# Kernels
|
# Kernels
|
||||||
"marlin_gemm": ops.marlin_gemm,
|
"marlin_gemm": ops.marlin_gemm,
|
||||||
"gptq_marlin_24_gemm": ops.gptq_marlin_24_gemm,
|
|
||||||
"gptq_marlin_repack": ops.gptq_marlin_repack,
|
"gptq_marlin_repack": ops.gptq_marlin_repack,
|
||||||
"allspark_w8a16_gemm": ops.allspark_w8a16_gemm,
|
"allspark_w8a16_gemm": ops.allspark_w8a16_gemm,
|
||||||
}
|
}
|
||||||
@@ -281,17 +244,6 @@ def bench_run(
|
|||||||
).blocked_autorange(min_run_time=min_run_time)
|
).blocked_autorange(min_run_time=min_run_time)
|
||||||
)
|
)
|
||||||
|
|
||||||
if marlin_24_supported:
|
|
||||||
results.append(
|
|
||||||
benchmark.Timer(
|
|
||||||
stmt="output = gptq_marlin_24_gemm(a, marlin_24_q_w_comp, marlin_24_meta, marlin_24_s, marlin_24_workspace.scratch, quant_type, size_m, size_n, size_k)", # noqa: E501
|
|
||||||
globals=globals,
|
|
||||||
label=label,
|
|
||||||
sub_label=sub_label,
|
|
||||||
description="gptq_marlin_24_gemm",
|
|
||||||
).blocked_autorange(min_run_time=min_run_time)
|
|
||||||
)
|
|
||||||
|
|
||||||
if repack_supported:
|
if repack_supported:
|
||||||
results.append(
|
results.append(
|
||||||
benchmark.Timer(
|
benchmark.Timer(
|
||||||
|
|||||||
@@ -27,7 +27,6 @@ from vllm.model_executor.layers.fused_moe.fused_moe import *
|
|||||||
from vllm.model_executor.layers.fused_moe.triton_deep_gemm_moe import (
|
from vllm.model_executor.layers.fused_moe.triton_deep_gemm_moe import (
|
||||||
TritonOrDeepGemmExperts,
|
TritonOrDeepGemmExperts,
|
||||||
)
|
)
|
||||||
from vllm.platforms import current_platform
|
|
||||||
from vllm.transformers_utils.config import get_config
|
from vllm.transformers_utils.config import get_config
|
||||||
from vllm.triton_utils import triton
|
from vllm.triton_utils import triton
|
||||||
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
@@ -482,6 +481,8 @@ class BenchmarkWorker:
|
|||||||
block_quant_shape: list[int] = None,
|
block_quant_shape: list[int] = None,
|
||||||
use_deep_gemm: bool = False,
|
use_deep_gemm: bool = False,
|
||||||
) -> tuple[dict[str, int], float]:
|
) -> tuple[dict[str, int], float]:
|
||||||
|
# local import to allow serialization by ray
|
||||||
|
|
||||||
set_random_seed(self.seed)
|
set_random_seed(self.seed)
|
||||||
dtype_str = _get_config_dtype_str(
|
dtype_str = _get_config_dtype_str(
|
||||||
dtype, use_int8_w8a16=use_int8_w8a16, use_fp8_w8a8=use_fp8_w8a8
|
dtype, use_int8_w8a16=use_int8_w8a16, use_fp8_w8a8=use_fp8_w8a8
|
||||||
@@ -535,6 +536,9 @@ class BenchmarkWorker:
|
|||||||
block_quant_shape: list[int],
|
block_quant_shape: list[int],
|
||||||
use_deep_gemm: bool,
|
use_deep_gemm: bool,
|
||||||
) -> dict[str, int]:
|
) -> dict[str, int]:
|
||||||
|
# local import to allow serialization by ray
|
||||||
|
from vllm.platforms import current_platform
|
||||||
|
|
||||||
best_config = None
|
best_config = None
|
||||||
best_time = float("inf")
|
best_time = float("inf")
|
||||||
if current_platform.is_rocm():
|
if current_platform.is_rocm():
|
||||||
@@ -646,20 +650,28 @@ def save_configs(
|
|||||||
f.write("\n")
|
f.write("\n")
|
||||||
|
|
||||||
|
|
||||||
|
def get_compressed_tensors_block_structure(config, default_value=None):
|
||||||
|
config_groups = config.get("config_groups", {})
|
||||||
|
if len(config_groups) != 1:
|
||||||
|
return default_value
|
||||||
|
group = next(iter(config_groups.values()))
|
||||||
|
weights = group.get("weights", {})
|
||||||
|
block_structure = weights.get("block_structure", default_value)
|
||||||
|
return block_structure
|
||||||
|
|
||||||
|
|
||||||
def get_weight_block_size_safety(config, default_value=None):
|
def get_weight_block_size_safety(config, default_value=None):
|
||||||
quantization_config = getattr(config, "quantization_config", {})
|
quantization_config = getattr(config, "quantization_config", {})
|
||||||
if isinstance(quantization_config, dict):
|
if isinstance(quantization_config, dict):
|
||||||
return quantization_config.get("weight_block_size", default_value)
|
if "weight_block_size" in quantization_config:
|
||||||
|
return quantization_config["weight_block_size"]
|
||||||
|
return get_compressed_tensors_block_structure(
|
||||||
|
quantization_config, default_value
|
||||||
|
)
|
||||||
return default_value
|
return default_value
|
||||||
|
|
||||||
|
|
||||||
def main(args: argparse.Namespace):
|
def get_model_params(config):
|
||||||
print(args)
|
|
||||||
|
|
||||||
config = get_config(model=args.model, trust_remote_code=args.trust_remote_code)
|
|
||||||
if args.model_prefix:
|
|
||||||
config = getattr(config, args.model_prefix)
|
|
||||||
|
|
||||||
if config.architectures[0] == "DbrxForCausalLM":
|
if config.architectures[0] == "DbrxForCausalLM":
|
||||||
E = config.ffn_config.moe_num_experts
|
E = config.ffn_config.moe_num_experts
|
||||||
topk = config.ffn_config.moe_top_k
|
topk = config.ffn_config.moe_top_k
|
||||||
@@ -677,6 +689,7 @@ def main(args: argparse.Namespace):
|
|||||||
"Glm4MoeForCausalLM",
|
"Glm4MoeForCausalLM",
|
||||||
"Glm4MoeLiteForCausalLM",
|
"Glm4MoeLiteForCausalLM",
|
||||||
"NemotronHForCausalLM",
|
"NemotronHForCausalLM",
|
||||||
|
"MistralLarge3ForCausalLM",
|
||||||
):
|
):
|
||||||
E = config.n_routed_experts
|
E = config.n_routed_experts
|
||||||
topk = config.num_experts_per_tok
|
topk = config.num_experts_per_tok
|
||||||
@@ -697,16 +710,20 @@ def main(args: argparse.Namespace):
|
|||||||
topk = text_config.num_experts_per_tok
|
topk = text_config.num_experts_per_tok
|
||||||
intermediate_size = text_config.moe_intermediate_size
|
intermediate_size = text_config.moe_intermediate_size
|
||||||
hidden_size = text_config.hidden_size
|
hidden_size = text_config.hidden_size
|
||||||
elif config.architectures[0] in ("HunYuanMoEV1ForCausalLM"):
|
elif config.architectures[0] == "HunYuanMoEV1ForCausalLM":
|
||||||
E = config.num_experts
|
E = config.num_experts
|
||||||
topk = config.moe_topk[0]
|
topk = config.moe_topk[0]
|
||||||
intermediate_size = config.moe_intermediate_size[0]
|
intermediate_size = config.moe_intermediate_size[0]
|
||||||
hidden_size = config.hidden_size
|
hidden_size = config.hidden_size
|
||||||
elif config.architectures[0] in ["Qwen3OmniMoeForConditionalGeneration"]:
|
elif config.architectures[0] == "Qwen3OmniMoeForConditionalGeneration":
|
||||||
E = config.thinker_config.text_config.num_experts
|
E = config.thinker_config.text_config.num_experts
|
||||||
topk = config.thinker_config.text_config.num_experts_per_tok
|
topk = config.thinker_config.text_config.num_experts_per_tok
|
||||||
intermediate_size = config.thinker_config.text_config.moe_intermediate_size
|
intermediate_size = config.thinker_config.text_config.moe_intermediate_size
|
||||||
hidden_size = config.thinker_config.text_config.hidden_size
|
hidden_size = config.thinker_config.text_config.hidden_size
|
||||||
|
elif config.architectures[0] == "PixtralForConditionalGeneration":
|
||||||
|
# Pixtral can contain different LLM architectures,
|
||||||
|
# recurse to get their parameters
|
||||||
|
return get_model_params(config.get_text_config())
|
||||||
else:
|
else:
|
||||||
# Support for llama4
|
# Support for llama4
|
||||||
config = config.get_text_config()
|
config = config.get_text_config()
|
||||||
@@ -715,6 +732,16 @@ def main(args: argparse.Namespace):
|
|||||||
topk = config.num_experts_per_tok
|
topk = config.num_experts_per_tok
|
||||||
intermediate_size = config.intermediate_size
|
intermediate_size = config.intermediate_size
|
||||||
hidden_size = config.hidden_size
|
hidden_size = config.hidden_size
|
||||||
|
return E, topk, intermediate_size, hidden_size
|
||||||
|
|
||||||
|
|
||||||
|
def main(args: argparse.Namespace):
|
||||||
|
print(args)
|
||||||
|
|
||||||
|
config = get_config(model=args.model, trust_remote_code=args.trust_remote_code)
|
||||||
|
if args.model_prefix:
|
||||||
|
config = getattr(config, args.model_prefix)
|
||||||
|
E, topk, intermediate_size, hidden_size = get_model_params(config)
|
||||||
enable_ep = bool(args.enable_expert_parallel)
|
enable_ep = bool(args.enable_expert_parallel)
|
||||||
if enable_ep:
|
if enable_ep:
|
||||||
ensure_divisibility(E, args.tp_size, "Number of experts")
|
ensure_divisibility(E, args.tp_size, "Number of experts")
|
||||||
|
|||||||
@@ -22,8 +22,8 @@ from vllm.utils.argparse_utils import FlexibleArgumentParser
|
|||||||
|
|
||||||
mp.set_start_method("spawn", force=True)
|
mp.set_start_method("spawn", force=True)
|
||||||
|
|
||||||
assert current_platform.is_cuda(), (
|
assert current_platform.is_cuda() or current_platform.is_rocm(), (
|
||||||
"Only support tune w8a8 block fp8 kernel on CUDA device."
|
"Only support tune w8a8 block fp8 kernel on CUDA/ROCm device."
|
||||||
)
|
)
|
||||||
|
|
||||||
DTYPE_MAP = {
|
DTYPE_MAP = {
|
||||||
|
|||||||
@@ -14,7 +14,7 @@ from vllm._custom_ops import (
|
|||||||
)
|
)
|
||||||
from vllm.platforms import CpuArchEnum, current_platform
|
from vllm.platforms import CpuArchEnum, current_platform
|
||||||
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE
|
from vllm.utils.torch_utils import STR_DTYPE_TO_TORCH_DTYPE, set_random_seed
|
||||||
from vllm.v1.attention.backends.cpu_attn import CPUAttentionBackend, _get_attn_isa
|
from vllm.v1.attention.backends.cpu_attn import CPUAttentionBackend, _get_attn_isa
|
||||||
|
|
||||||
|
|
||||||
@@ -58,7 +58,7 @@ def main(
|
|||||||
seed: int = 0,
|
seed: int = 0,
|
||||||
iters: int = 20,
|
iters: int = 20,
|
||||||
) -> None:
|
) -> None:
|
||||||
current_platform.seed_everything(seed)
|
set_random_seed(seed)
|
||||||
num_seqs = len(seq_lens)
|
num_seqs = len(seq_lens)
|
||||||
query_lens = [x[0] for x in seq_lens]
|
query_lens = [x[0] for x in seq_lens]
|
||||||
kv_lens = [x[1] for x in seq_lens]
|
kv_lens = [x[1] for x in seq_lens]
|
||||||
|
|||||||
@@ -7,8 +7,8 @@ import time
|
|||||||
import numpy as np
|
import numpy as np
|
||||||
import torch
|
import torch
|
||||||
|
|
||||||
from vllm.platforms import current_platform
|
|
||||||
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
from vllm.utils.argparse_utils import FlexibleArgumentParser
|
||||||
|
from vllm.utils.torch_utils import set_random_seed
|
||||||
|
|
||||||
# Check if CPU MoE operations are available
|
# Check if CPU MoE operations are available
|
||||||
try:
|
try:
|
||||||
@@ -41,7 +41,7 @@ def main(
|
|||||||
seed: int = 0,
|
seed: int = 0,
|
||||||
iters: int = 20,
|
iters: int = 20,
|
||||||
) -> None:
|
) -> None:
|
||||||
current_platform.seed_everything(seed)
|
set_random_seed(seed)
|
||||||
# up_dim = 2 * intermediate_size for gate + up projection
|
# up_dim = 2 * intermediate_size for gate + up projection
|
||||||
up_dim = 2 * intermediate_size
|
up_dim = 2 * intermediate_size
|
||||||
|
|
||||||
|
|||||||
@@ -359,6 +359,19 @@ else()
|
|||||||
add_compile_definitions(-DVLLM_NUMA_DISABLED)
|
add_compile_definitions(-DVLLM_NUMA_DISABLED)
|
||||||
endif()
|
endif()
|
||||||
|
|
||||||
|
#
|
||||||
|
# Generate CPU attention dispatch header
|
||||||
|
#
|
||||||
|
message(STATUS "Generating CPU attention dispatch header")
|
||||||
|
execute_process(
|
||||||
|
COMMAND ${Python_EXECUTABLE} ${CMAKE_SOURCE_DIR}/csrc/cpu/generate_cpu_attn_dispatch.py
|
||||||
|
WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}/csrc/cpu
|
||||||
|
RESULT_VARIABLE GEN_RESULT
|
||||||
|
)
|
||||||
|
if(NOT GEN_RESULT EQUAL 0)
|
||||||
|
message(FATAL_ERROR "Failed to generate CPU attention dispatch header")
|
||||||
|
endif()
|
||||||
|
|
||||||
#
|
#
|
||||||
# _C extension
|
# _C extension
|
||||||
#
|
#
|
||||||
|
|||||||
@@ -38,7 +38,7 @@ else()
|
|||||||
FetchContent_Declare(
|
FetchContent_Declare(
|
||||||
vllm-flash-attn
|
vllm-flash-attn
|
||||||
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
|
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
|
||||||
GIT_TAG 188be16520ceefdc625fdf71365585d2ee348fe2
|
GIT_TAG 2adfc8c2177c5b0e8ddeedfd5a8990d80eb496ff
|
||||||
GIT_PROGRESS TRUE
|
GIT_PROGRESS TRUE
|
||||||
# Don't share the vllm-flash-attn build between build types
|
# Don't share the vllm-flash-attn build between build types
|
||||||
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
|
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
|
||||||
|
|||||||
@@ -1,79 +1,4 @@
|
|||||||
#include "cpu_attn_vec.hpp"
|
#include "cpu_attn_dispatch_generated.h"
|
||||||
#include "cpu_attn_vec16.hpp"
|
|
||||||
|
|
||||||
#ifdef CPU_CAPABILITY_AMXBF16
|
|
||||||
#include "cpu_attn_amx.hpp"
|
|
||||||
#define AMX_DISPATCH(...) \
|
|
||||||
case cpu_attention::ISA::AMX: { \
|
|
||||||
using attn_impl = cpu_attention::AttentionImpl<cpu_attention::ISA::AMX, \
|
|
||||||
scalar_t, head_dim>; \
|
|
||||||
return __VA_ARGS__(); \
|
|
||||||
}
|
|
||||||
#else
|
|
||||||
#define AMX_DISPATCH(...) case cpu_attention::ISA::AMX:
|
|
||||||
#endif
|
|
||||||
|
|
||||||
#ifdef __aarch64__
|
|
||||||
#include "cpu_attn_neon.hpp"
|
|
||||||
// NEON requires head_dim to be a multiple of 32
|
|
||||||
#define NEON_DISPATCH(...) \
|
|
||||||
case cpu_attention::ISA::NEON: { \
|
|
||||||
using attn_impl = cpu_attention::AttentionImpl<cpu_attention::ISA::NEON, \
|
|
||||||
scalar_t, head_dim>; \
|
|
||||||
return __VA_ARGS__(); \
|
|
||||||
}
|
|
||||||
#else
|
|
||||||
#define NEON_DISPATCH(...) case cpu_attention::ISA::NEON:
|
|
||||||
#endif // #ifdef __aarch64__
|
|
||||||
|
|
||||||
#define CPU_ATTN_DISPATCH_CASE(HEAD_DIM, ...) \
|
|
||||||
case HEAD_DIM: { \
|
|
||||||
constexpr size_t head_dim = HEAD_DIM; \
|
|
||||||
return __VA_ARGS__(); \
|
|
||||||
}
|
|
||||||
|
|
||||||
#define CPU_ATTN_DISPATCH_CASE_HEADDIM(HEAD_DIM, ...) \
|
|
||||||
[&] { \
|
|
||||||
switch (HEAD_DIM) { \
|
|
||||||
CPU_ATTN_DISPATCH_CASE(32, __VA_ARGS__) \
|
|
||||||
CPU_ATTN_DISPATCH_CASE(64, __VA_ARGS__) \
|
|
||||||
CPU_ATTN_DISPATCH_CASE(80, __VA_ARGS__) \
|
|
||||||
CPU_ATTN_DISPATCH_CASE(96, __VA_ARGS__) \
|
|
||||||
CPU_ATTN_DISPATCH_CASE(112, __VA_ARGS__) \
|
|
||||||
CPU_ATTN_DISPATCH_CASE(128, __VA_ARGS__) \
|
|
||||||
CPU_ATTN_DISPATCH_CASE(160, __VA_ARGS__) \
|
|
||||||
CPU_ATTN_DISPATCH_CASE(192, __VA_ARGS__) \
|
|
||||||
CPU_ATTN_DISPATCH_CASE(224, __VA_ARGS__) \
|
|
||||||
CPU_ATTN_DISPATCH_CASE(256, __VA_ARGS__) \
|
|
||||||
default: { \
|
|
||||||
TORCH_CHECK(false, "Invalid CPU attention head_dim: " + \
|
|
||||||
std::to_string(HEAD_DIM)); \
|
|
||||||
} \
|
|
||||||
} \
|
|
||||||
}()
|
|
||||||
|
|
||||||
#define CPU_ATTN_DISPATCH_IMPL(ISA_TYPE, ...) \
|
|
||||||
[&] { \
|
|
||||||
switch (ISA_TYPE) { \
|
|
||||||
AMX_DISPATCH(__VA_ARGS__) \
|
|
||||||
NEON_DISPATCH(__VA_ARGS__) \
|
|
||||||
case cpu_attention::ISA::VEC: { \
|
|
||||||
using attn_impl = \
|
|
||||||
cpu_attention::AttentionImpl<cpu_attention::ISA::VEC, scalar_t, \
|
|
||||||
head_dim>; \
|
|
||||||
return __VA_ARGS__(); \
|
|
||||||
} \
|
|
||||||
case cpu_attention::ISA::VEC16: { \
|
|
||||||
using attn_impl = \
|
|
||||||
cpu_attention::AttentionImpl<cpu_attention::ISA::VEC16, scalar_t, \
|
|
||||||
head_dim>; \
|
|
||||||
return __VA_ARGS__(); \
|
|
||||||
} \
|
|
||||||
default: { \
|
|
||||||
TORCH_CHECK(false, "Invalid CPU attention ISA type."); \
|
|
||||||
} \
|
|
||||||
} \
|
|
||||||
}()
|
|
||||||
|
|
||||||
torch::Tensor get_scheduler_metadata(
|
torch::Tensor get_scheduler_metadata(
|
||||||
const int64_t num_req, const int64_t num_heads_q,
|
const int64_t num_req, const int64_t num_heads_q,
|
||||||
@@ -122,16 +47,14 @@ torch::Tensor get_scheduler_metadata(
|
|||||||
input.enable_kv_split = enable_kv_split;
|
input.enable_kv_split = enable_kv_split;
|
||||||
|
|
||||||
VLLM_DISPATCH_FLOATING_TYPES(dtype, "get_scheduler_metadata", [&]() {
|
VLLM_DISPATCH_FLOATING_TYPES(dtype, "get_scheduler_metadata", [&]() {
|
||||||
CPU_ATTN_DISPATCH_CASE_HEADDIM(head_dim, [&] {
|
CPU_ATTN_DISPATCH(head_dim, isa, [&]() {
|
||||||
CPU_ATTN_DISPATCH_IMPL(isa, [&]() {
|
input.elem_size = sizeof(scalar_t);
|
||||||
input.elem_size = sizeof(scalar_t);
|
input.q_buffer_elem_size = sizeof(attn_impl::q_buffer_t);
|
||||||
input.q_buffer_elem_size = sizeof(attn_impl::q_buffer_t);
|
input.logits_buffer_elem_size = sizeof(attn_impl::logits_buffer_t);
|
||||||
input.logits_buffer_elem_size = sizeof(attn_impl::logits_buffer_t);
|
input.output_buffer_elem_size =
|
||||||
input.output_buffer_elem_size =
|
sizeof(attn_impl::partial_output_buffer_t);
|
||||||
sizeof(attn_impl::partial_output_buffer_t);
|
input.max_num_q_per_iter = attn_impl::MaxQHeadNumPerIteration;
|
||||||
input.max_num_q_per_iter = attn_impl::MaxQHeadNumPerIteration;
|
input.kv_block_alignment = attn_impl::BlockSizeAlignment;
|
||||||
input.kv_block_alignment = attn_impl::BlockSizeAlignment;
|
|
||||||
});
|
|
||||||
});
|
});
|
||||||
});
|
});
|
||||||
|
|
||||||
@@ -184,18 +107,14 @@ void cpu_attn_reshape_and_cache(
|
|||||||
|
|
||||||
VLLM_DISPATCH_FLOATING_TYPES(
|
VLLM_DISPATCH_FLOATING_TYPES(
|
||||||
key.scalar_type(), "cpu_attn_reshape_and_cache", [&]() {
|
key.scalar_type(), "cpu_attn_reshape_and_cache", [&]() {
|
||||||
CPU_ATTN_DISPATCH_CASE_HEADDIM(head_dim, [&] {
|
CPU_ATTN_DISPATCH(head_dim, isa_tag, [&]() {
|
||||||
CPU_ATTN_DISPATCH_IMPL(isa_tag, [&]() {
|
attn_impl::reshape_and_cache(
|
||||||
attn_impl::reshape_and_cache(
|
key.data_ptr<scalar_t>(), value.data_ptr<scalar_t>(),
|
||||||
key.data_ptr<scalar_t>(), value.data_ptr<scalar_t>(),
|
key_cache.data_ptr<scalar_t>(), value_cache.data_ptr<scalar_t>(),
|
||||||
key_cache.data_ptr<scalar_t>(),
|
slot_mapping.data_ptr<int64_t>(), token_num, key_token_num_stride,
|
||||||
value_cache.data_ptr<scalar_t>(),
|
value_token_num_stride, head_num, key_head_num_stride,
|
||||||
slot_mapping.data_ptr<int64_t>(), token_num,
|
value_head_num_stride, num_blocks, num_blocks_stride,
|
||||||
key_token_num_stride, value_token_num_stride, head_num,
|
cache_head_num_stride, block_size, block_size_stride);
|
||||||
key_head_num_stride, value_head_num_stride, num_blocks,
|
|
||||||
num_blocks_stride, cache_head_num_stride, block_size,
|
|
||||||
block_size_stride);
|
|
||||||
});
|
|
||||||
});
|
});
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
@@ -257,12 +176,10 @@ void cpu_attention_with_kv_cache(
|
|||||||
|
|
||||||
VLLM_DISPATCH_FLOATING_TYPES(
|
VLLM_DISPATCH_FLOATING_TYPES(
|
||||||
query.scalar_type(), "cpu_attention_with_kv_cache", [&]() {
|
query.scalar_type(), "cpu_attention_with_kv_cache", [&]() {
|
||||||
CPU_ATTN_DISPATCH_CASE_HEADDIM(query.size(2), [&] {
|
CPU_ATTN_DISPATCH(query.size(2), input.metadata->isa, [&]() {
|
||||||
CPU_ATTN_DISPATCH_IMPL(input.metadata->isa, [&]() {
|
TORCH_CHECK_EQ(input.block_size % attn_impl::BlockSizeAlignment, 0);
|
||||||
TORCH_CHECK_EQ(input.block_size % attn_impl::BlockSizeAlignment, 0);
|
cpu_attention::AttentionMainLoop<attn_impl> mainloop;
|
||||||
cpu_attention::AttentionMainLoop<attn_impl> mainloop;
|
mainloop(&input);
|
||||||
mainloop(&input);
|
|
||||||
});
|
|
||||||
});
|
});
|
||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -377,7 +377,7 @@ class AttentionImpl<ISA::AMX, scalar_t, head_dim> {
|
|||||||
const int32_t q_heads_per_kv, const int64_t q_num_stride,
|
const int32_t q_heads_per_kv, const int64_t q_num_stride,
|
||||||
const int64_t q_head_stride, const float scale) {
|
const int64_t q_head_stride, const float scale) {
|
||||||
constexpr int64_t bytes_per_head = head_dim * sizeof(scalar_t);
|
constexpr int64_t bytes_per_head = head_dim * sizeof(scalar_t);
|
||||||
// static_assert(bytes_per_head % AMX_TILE_ROW_BYTES == 0);
|
static_assert(bytes_per_head % AMX_TILE_ROW_BYTES == 0);
|
||||||
constexpr int64_t head_size_block_num = bytes_per_head / AMX_TILE_ROW_BYTES;
|
constexpr int64_t head_size_block_num = bytes_per_head / AMX_TILE_ROW_BYTES;
|
||||||
constexpr int64_t head_elem_num_pre_block =
|
constexpr int64_t head_elem_num_pre_block =
|
||||||
AMX_TILE_ROW_BYTES / sizeof(scalar_t);
|
AMX_TILE_ROW_BYTES / sizeof(scalar_t);
|
||||||
|
|||||||
@@ -816,14 +816,10 @@ struct VecTypeTrait<float> {
|
|||||||
using vec_t = vec_op::FP32Vec16;
|
using vec_t = vec_op::FP32Vec16;
|
||||||
};
|
};
|
||||||
|
|
||||||
// ARM only supports BF16 with ARMv8.6-A extension
|
|
||||||
#if (defined(__aarch64__) && !defined(ARM_BF16_SUPPORT))
|
|
||||||
#else
|
|
||||||
template <>
|
template <>
|
||||||
struct VecTypeTrait<c10::BFloat16> {
|
struct VecTypeTrait<c10::BFloat16> {
|
||||||
using vec_t = vec_op::BF16Vec16;
|
using vec_t = vec_op::BF16Vec16;
|
||||||
};
|
};
|
||||||
#endif
|
|
||||||
|
|
||||||
#if !defined(__powerpc__) && !defined(__s390x__)
|
#if !defined(__powerpc__) && !defined(__s390x__)
|
||||||
template <>
|
template <>
|
||||||
@@ -1585,17 +1581,10 @@ class AttentionMainLoop {
|
|||||||
|
|
||||||
if (use_sink) {
|
if (use_sink) {
|
||||||
alignas(64) float s_aux_fp32[16];
|
alignas(64) float s_aux_fp32[16];
|
||||||
#if defined(__aarch64__) && !defined(ARM_BF16_SUPPORT)
|
|
||||||
// ARM without native BF16 support: manual conversion
|
|
||||||
for (int i = 0; i < 16; ++i) {
|
|
||||||
s_aux_fp32[i] = static_cast<float>(curr_s_aux[i]);
|
|
||||||
}
|
|
||||||
#else
|
|
||||||
// All other platforms have BF16Vec16 available
|
// All other platforms have BF16Vec16 available
|
||||||
vec_op::BF16Vec16 vec_bf16(curr_s_aux);
|
vec_op::BF16Vec16 vec_bf16(curr_s_aux);
|
||||||
vec_op::FP32Vec16 vec_fp32(vec_bf16);
|
vec_op::FP32Vec16 vec_fp32(vec_bf16);
|
||||||
vec_fp32.save(s_aux_fp32);
|
vec_fp32.save(s_aux_fp32);
|
||||||
#endif
|
|
||||||
|
|
||||||
float* __restrict__ curr_sum_buffer = sum_buffer;
|
float* __restrict__ curr_sum_buffer = sum_buffer;
|
||||||
float* __restrict__ curr_max_buffer = max_buffer;
|
float* __restrict__ curr_max_buffer = max_buffer;
|
||||||
|
|||||||
@@ -264,7 +264,7 @@ class AttentionImpl<ISA::NEON, scalar_t, head_dim> {
|
|||||||
constexpr static ISA ISAType = ISA::NEON;
|
constexpr static ISA ISAType = ISA::NEON;
|
||||||
constexpr static bool scale_on_logits = false; // apply scale on q_buffer
|
constexpr static bool scale_on_logits = false; // apply scale on q_buffer
|
||||||
|
|
||||||
// static_assert(HeadDim % HeadDimAlignment == 0);
|
static_assert(HeadDim % HeadDimAlignment == 0);
|
||||||
// the gemm micro kernel is Mx8
|
// the gemm micro kernel is Mx8
|
||||||
static_assert(HeadDimAlignment % 8 == 0);
|
static_assert(HeadDimAlignment % 8 == 0);
|
||||||
static_assert(BlockSizeAlignment % 8 == 0);
|
static_assert(BlockSizeAlignment % 8 == 0);
|
||||||
|
|||||||
File diff suppressed because it is too large
Load Diff
@@ -116,7 +116,7 @@ class Dequantizer4b {
|
|||||||
scalar_vec_t output_vec_0(wb_0);
|
scalar_vec_t output_vec_0(wb_0);
|
||||||
scalar_vec_t output_vec_1(wb_1);
|
scalar_vec_t output_vec_1(wb_1);
|
||||||
|
|
||||||
// AMX needs to interlave K elements to pack as 32 bits
|
// AMX needs to interleave K elements to pack as 32 bits
|
||||||
if constexpr (isa == ISA::AMX) {
|
if constexpr (isa == ISA::AMX) {
|
||||||
vec_op::interleave_save(output_vec_0, output_vec_1, curr_weight);
|
vec_op::interleave_save(output_vec_0, output_vec_1, curr_weight);
|
||||||
} else {
|
} else {
|
||||||
|
|||||||
@@ -14,13 +14,11 @@ struct KernelVecType<float> {
|
|||||||
using cvt_vec_type = vec_op::FP32Vec16;
|
using cvt_vec_type = vec_op::FP32Vec16;
|
||||||
};
|
};
|
||||||
|
|
||||||
#if !defined(__aarch64__) || defined(ARM_BF16_SUPPORT)
|
|
||||||
template <>
|
template <>
|
||||||
struct KernelVecType<c10::BFloat16> {
|
struct KernelVecType<c10::BFloat16> {
|
||||||
using load_vec_type = vec_op::BF16Vec16;
|
using load_vec_type = vec_op::BF16Vec16;
|
||||||
using cvt_vec_type = vec_op::FP32Vec16;
|
using cvt_vec_type = vec_op::FP32Vec16;
|
||||||
};
|
};
|
||||||
#endif
|
|
||||||
|
|
||||||
template <>
|
template <>
|
||||||
struct KernelVecType<c10::Half> {
|
struct KernelVecType<c10::Half> {
|
||||||
|
|||||||
203
csrc/cpu/generate_cpu_attn_dispatch.py
Normal file
203
csrc/cpu/generate_cpu_attn_dispatch.py
Normal file
@@ -0,0 +1,203 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# SPDX-License-Identifier: Apache-2.0
|
||||||
|
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
|
||||||
|
"""
|
||||||
|
Generate CPU attention dispatch switch cases and kernel instantiations.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
|
||||||
|
# Head dimensions divisible by 32 (support all ISAs)
|
||||||
|
HEAD_DIMS_32 = [32, 64, 96, 128, 160, 192, 224, 256]
|
||||||
|
|
||||||
|
# Head dimensions divisible by 16 but not 32 (VEC16 only)
|
||||||
|
HEAD_DIMS_16 = [80, 112]
|
||||||
|
|
||||||
|
# ISA types
|
||||||
|
ISA_TYPES = {
|
||||||
|
"AMX": 0,
|
||||||
|
"VEC": 1,
|
||||||
|
"VEC16": 2,
|
||||||
|
"NEON": 3,
|
||||||
|
}
|
||||||
|
|
||||||
|
# ISAs supported for head_dims divisible by 32
|
||||||
|
ISA_FOR_32 = ["AMX", "NEON", "VEC", "VEC16"]
|
||||||
|
|
||||||
|
# ISAs supported for head_dims divisible by 16 only
|
||||||
|
ISA_FOR_16 = ["VEC16"]
|
||||||
|
|
||||||
|
|
||||||
|
def encode_params(head_dim: int, isa_type: str) -> int:
|
||||||
|
"""Encode head_dim and ISA type into a single int64_t."""
|
||||||
|
isa_val = ISA_TYPES[isa_type]
|
||||||
|
# Encoding: (head_dim << 8) | isa_type
|
||||||
|
# This allows head_dim up to 2^56 - 1 and 256 ISA types
|
||||||
|
return (head_dim << 8) | isa_val
|
||||||
|
|
||||||
|
|
||||||
|
def generate_cases_for_isa_group(isa_list: list[str]) -> str:
|
||||||
|
"""Generate switch cases for a specific ISA group."""
|
||||||
|
cases = []
|
||||||
|
|
||||||
|
# Generate cases for head_dims divisible by 32
|
||||||
|
for head_dim in HEAD_DIMS_32:
|
||||||
|
for isa in isa_list:
|
||||||
|
if isa not in ISA_FOR_32:
|
||||||
|
continue
|
||||||
|
encoded = encode_params(head_dim, isa)
|
||||||
|
case_str = (
|
||||||
|
f""" case {encoded}LL: {{ """
|
||||||
|
f"""/* head_dim={head_dim}, isa={isa} */ \\"""
|
||||||
|
f"""
|
||||||
|
constexpr size_t head_dim = {head_dim}; \\"""
|
||||||
|
f"""
|
||||||
|
using attn_impl = cpu_attention::AttentionImpl<"""
|
||||||
|
f"""cpu_attention::ISA::{isa}, \\"""
|
||||||
|
f"""
|
||||||
|
"""
|
||||||
|
f"""scalar_t, head_dim>; \\"""
|
||||||
|
f"""
|
||||||
|
return __VA_ARGS__(); \\"""
|
||||||
|
f"""
|
||||||
|
}} \\"""
|
||||||
|
)
|
||||||
|
cases.append(case_str)
|
||||||
|
|
||||||
|
# Generate cases for head_dims divisible by 16 only
|
||||||
|
for head_dim in HEAD_DIMS_16:
|
||||||
|
for isa in isa_list:
|
||||||
|
encoded = encode_params(head_dim, isa)
|
||||||
|
case_str = (
|
||||||
|
f""" case {encoded}LL: {{ """
|
||||||
|
f"""/* head_dim={head_dim}, isa={isa} """
|
||||||
|
f"""(using VEC16) */ \\"""
|
||||||
|
f"""
|
||||||
|
constexpr size_t head_dim = {head_dim}; \\"""
|
||||||
|
f"""
|
||||||
|
using attn_impl = cpu_attention::AttentionImpl<"""
|
||||||
|
f"""cpu_attention::ISA::VEC16, \\"""
|
||||||
|
f"""
|
||||||
|
"""
|
||||||
|
f"""scalar_t, head_dim>; \\"""
|
||||||
|
f"""
|
||||||
|
return __VA_ARGS__(); \\"""
|
||||||
|
f"""
|
||||||
|
}} \\"""
|
||||||
|
)
|
||||||
|
cases.append(case_str)
|
||||||
|
|
||||||
|
return "\n".join(cases)
|
||||||
|
|
||||||
|
|
||||||
|
def generate_helper_function() -> str:
|
||||||
|
"""Generate helper function to encode parameters."""
|
||||||
|
return """
|
||||||
|
inline int64_t encode_cpu_attn_params(int64_t head_dim, cpu_attention::ISA isa) {
|
||||||
|
return (head_dim << 8) | static_cast<int64_t>(isa);
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
def generate_header_file() -> str:
|
||||||
|
"""Generate the complete header file content."""
|
||||||
|
header = """// auto generated by generate_cpu_attn_dispatch.py
|
||||||
|
// clang-format off
|
||||||
|
|
||||||
|
#ifndef CPU_ATTN_DISPATCH_GENERATED_H
|
||||||
|
#define CPU_ATTN_DISPATCH_GENERATED_H
|
||||||
|
|
||||||
|
#include "cpu_attn_vec.hpp"
|
||||||
|
#include "cpu_attn_vec16.hpp"
|
||||||
|
|
||||||
|
#ifdef CPU_CAPABILITY_AMXBF16
|
||||||
|
#include "cpu_attn_amx.hpp"
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#ifdef __aarch64__
|
||||||
|
#include "cpu_attn_neon.hpp"
|
||||||
|
#endif
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
header += generate_helper_function()
|
||||||
|
|
||||||
|
# Generate dispatch macro with conditional compilation for different ISA sets
|
||||||
|
header += """
|
||||||
|
// Dispatch macro using encoded parameters
|
||||||
|
"""
|
||||||
|
|
||||||
|
# x86_64 with AMX
|
||||||
|
header += """#if defined(CPU_CAPABILITY_AMXBF16)
|
||||||
|
#define CPU_ATTN_DISPATCH(HEAD_DIM, ISA_TYPE, ...) \\
|
||||||
|
[&] { \\
|
||||||
|
int64_t encoded_params = encode_cpu_attn_params(HEAD_DIM, ISA_TYPE); \\
|
||||||
|
switch (encoded_params) { \\
|
||||||
|
"""
|
||||||
|
header += generate_cases_for_isa_group(["AMX", "VEC", "VEC16"])
|
||||||
|
header += """
|
||||||
|
default: { \\
|
||||||
|
TORCH_CHECK(false, "Unsupported CPU attention configuration: head_dim=" + \\
|
||||||
|
std::to_string(HEAD_DIM) + " isa=" + \\
|
||||||
|
std::to_string(static_cast<int>(ISA_TYPE))); \\
|
||||||
|
} \\
|
||||||
|
} \\
|
||||||
|
}()
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
# ARM64 with NEON
|
||||||
|
header += """#elif defined(__aarch64__)
|
||||||
|
#define CPU_ATTN_DISPATCH(HEAD_DIM, ISA_TYPE, ...) \\
|
||||||
|
[&] { \\
|
||||||
|
int64_t encoded_params = encode_cpu_attn_params(HEAD_DIM, ISA_TYPE); \\
|
||||||
|
switch (encoded_params) { \\
|
||||||
|
"""
|
||||||
|
header += generate_cases_for_isa_group(["NEON", "VEC", "VEC16"])
|
||||||
|
header += """
|
||||||
|
default: { \\
|
||||||
|
TORCH_CHECK(false, "Unsupported CPU attention configuration: head_dim=" + \\
|
||||||
|
std::to_string(HEAD_DIM) + " isa=" + \\
|
||||||
|
std::to_string(static_cast<int>(ISA_TYPE))); \\
|
||||||
|
} \\
|
||||||
|
} \\
|
||||||
|
}()
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Fallback: VEC and VEC16 only
|
||||||
|
header += """#else
|
||||||
|
#define CPU_ATTN_DISPATCH(HEAD_DIM, ISA_TYPE, ...) \\
|
||||||
|
[&] { \\
|
||||||
|
int64_t encoded_params = encode_cpu_attn_params(HEAD_DIM, ISA_TYPE); \\
|
||||||
|
switch (encoded_params) { \\
|
||||||
|
"""
|
||||||
|
header += generate_cases_for_isa_group(["VEC", "VEC16"])
|
||||||
|
header += """
|
||||||
|
default: { \\
|
||||||
|
TORCH_CHECK(false, "Unsupported CPU attention configuration: head_dim=" + \\
|
||||||
|
std::to_string(HEAD_DIM) + " isa=" + \\
|
||||||
|
std::to_string(static_cast<int>(ISA_TYPE))); \\
|
||||||
|
} \\
|
||||||
|
} \\
|
||||||
|
}()
|
||||||
|
|
||||||
|
#endif /* CPU_CAPABILITY_AMXBF16 / __aarch64__ */
|
||||||
|
|
||||||
|
#endif // CPU_ATTN_DISPATCH_GENERATED_H
|
||||||
|
"""
|
||||||
|
|
||||||
|
return header
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
output_path = os.path.join(
|
||||||
|
os.path.dirname(__file__), "cpu_attn_dispatch_generated.h"
|
||||||
|
)
|
||||||
|
|
||||||
|
with open(output_path, "w") as f:
|
||||||
|
f.write(generate_header_file())
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -38,9 +38,7 @@ struct KernelVecType<c10::BFloat16> {
|
|||||||
using qk_vec_type = vec_op::BF16Vec32;
|
using qk_vec_type = vec_op::BF16Vec32;
|
||||||
using v_load_vec_type = vec_op::BF16Vec16;
|
using v_load_vec_type = vec_op::BF16Vec16;
|
||||||
};
|
};
|
||||||
#elif defined(__aarch64__) && !defined(ARM_BF16_SUPPORT)
|
#elif defined(__aarch64__)
|
||||||
// pass
|
|
||||||
#else
|
|
||||||
template <>
|
template <>
|
||||||
struct KernelVecType<c10::BFloat16> {
|
struct KernelVecType<c10::BFloat16> {
|
||||||
using qk_load_vec_type = vec_op::BF16Vec16;
|
using qk_load_vec_type = vec_op::BF16Vec16;
|
||||||
|
|||||||
@@ -265,7 +265,7 @@ void tinygemm_kernel(
|
|||||||
// mb_size = 4
|
// mb_size = 4
|
||||||
case 0x42: LAUNCH_TINYGEMM_KERNEL_NN(4, 32); break;
|
case 0x42: LAUNCH_TINYGEMM_KERNEL_NN(4, 32); break;
|
||||||
case 0x44: LAUNCH_TINYGEMM_KERNEL_NN(4, 64); break;
|
case 0x44: LAUNCH_TINYGEMM_KERNEL_NN(4, 64); break;
|
||||||
default: TORCH_CHECK(false, "Unexpected block size, ", mb_size, "x", "nb_size");
|
default: TORCH_CHECK(false, "Unexpected block size, ", mb_size, "x", nb_size);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -324,7 +324,7 @@ void tinygemm_kernel(
|
|||||||
case 0x22: LAUNCH_TINYGEMM_KERNEL_NN(2, 32); break;
|
case 0x22: LAUNCH_TINYGEMM_KERNEL_NN(2, 32); break;
|
||||||
case 0x32: LAUNCH_TINYGEMM_KERNEL_NN(3, 32); break;
|
case 0x32: LAUNCH_TINYGEMM_KERNEL_NN(3, 32); break;
|
||||||
case 0x42: LAUNCH_TINYGEMM_KERNEL_NN(4, 32); break;
|
case 0x42: LAUNCH_TINYGEMM_KERNEL_NN(4, 32); break;
|
||||||
default: TORCH_CHECK(false, "Unexpected block size, ", mb_size, "x", "nb_size");
|
default: TORCH_CHECK(false, "Unexpected block size, ", mb_size, "x", nb_size);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -180,7 +180,7 @@ void tinygemm_kernel(
|
|||||||
// mb_size = 4
|
// mb_size = 4
|
||||||
case 0x42: LAUNCH_TINYGEMM_KERNEL_NN(4, 32); break;
|
case 0x42: LAUNCH_TINYGEMM_KERNEL_NN(4, 32); break;
|
||||||
case 0x44: LAUNCH_TINYGEMM_KERNEL_NN(4, 64); break;
|
case 0x44: LAUNCH_TINYGEMM_KERNEL_NN(4, 64); break;
|
||||||
default: TORCH_CHECK(false, "Unexpected block size, ", mb_size, "x", "nb_size");
|
default: TORCH_CHECK(false, "Unexpected block size, ", mb_size, "x", nb_size);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -398,7 +398,7 @@ void tinygemm_kernel(
|
|||||||
case 0x32: LAUNCH_TINYGEMM_KERNEL_NN(3, 32); break;
|
case 0x32: LAUNCH_TINYGEMM_KERNEL_NN(3, 32); break;
|
||||||
// mb_size = 4
|
// mb_size = 4
|
||||||
case 0x42: LAUNCH_TINYGEMM_KERNEL_NN(4, 32); break;
|
case 0x42: LAUNCH_TINYGEMM_KERNEL_NN(4, 32); break;
|
||||||
default: TORCH_CHECK(false, "Unexpected block size, ", mb_size, "x", "nb_size");
|
default: TORCH_CHECK(false, "Unexpected block size, ", mb_size, "x", nb_size);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -511,7 +511,7 @@ void tinygemm_kernel(
|
|||||||
case 0x32: LAUNCH_TINYGEMM_KERNEL_NN2(3, 32); break;
|
case 0x32: LAUNCH_TINYGEMM_KERNEL_NN2(3, 32); break;
|
||||||
// mb_size = 4
|
// mb_size = 4
|
||||||
case 0x42: LAUNCH_TINYGEMM_KERNEL_NN2(4, 32); break;
|
case 0x42: LAUNCH_TINYGEMM_KERNEL_NN2(4, 32); break;
|
||||||
default: TORCH_CHECK(false, "Unexpected block size, ", mb_size, "x", "nb_size");
|
default: TORCH_CHECK(false, "Unexpected block size, ", mb_size, "x", nb_size);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -271,7 +271,7 @@ void tinygemm_kernel(
|
|||||||
case 0x22: LAUNCH_TINYGEMM_KERNEL_VNNI(2, 32); break;
|
case 0x22: LAUNCH_TINYGEMM_KERNEL_VNNI(2, 32); break;
|
||||||
case 0x32: LAUNCH_TINYGEMM_KERNEL_VNNI(3, 32); break;
|
case 0x32: LAUNCH_TINYGEMM_KERNEL_VNNI(3, 32); break;
|
||||||
case 0x42: LAUNCH_TINYGEMM_KERNEL_VNNI(4, 32); break;
|
case 0x42: LAUNCH_TINYGEMM_KERNEL_VNNI(4, 32); break;
|
||||||
default: TORCH_CHECK(false, "Unexpected block size, ", mb_size, "x", "nb_size");
|
default: TORCH_CHECK(false, "Unexpected block size, ", mb_size, "x", nb_size);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -401,7 +401,7 @@ void tinygemm_kernel(
|
|||||||
case 0x22: LAUNCH_TINYGEMM_KERNEL_VNNI2(2, 32); break;
|
case 0x22: LAUNCH_TINYGEMM_KERNEL_VNNI2(2, 32); break;
|
||||||
case 0x32: LAUNCH_TINYGEMM_KERNEL_VNNI2(3, 32); break;
|
case 0x32: LAUNCH_TINYGEMM_KERNEL_VNNI2(3, 32); break;
|
||||||
case 0x42: LAUNCH_TINYGEMM_KERNEL_VNNI2(4, 32); break;
|
case 0x42: LAUNCH_TINYGEMM_KERNEL_VNNI2(4, 32); break;
|
||||||
default: TORCH_CHECK(false, "Unexpected block size, ", mb_size, "x", "nb_size");
|
default: TORCH_CHECK(false, "Unexpected block size, ", mb_size, "x", nb_size);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -237,10 +237,10 @@ struct ThreadSHMContext {
|
|||||||
class SHMManager {
|
class SHMManager {
|
||||||
public:
|
public:
|
||||||
explicit SHMManager(const std::string& name, const int rank,
|
explicit SHMManager(const std::string& name, const int rank,
|
||||||
const int group_size)
|
const int group_size, const int thread_num)
|
||||||
: _rank(rank),
|
: _rank(rank),
|
||||||
_group_size(group_size),
|
_group_size(group_size),
|
||||||
_thread_num(omp_get_max_threads()),
|
_thread_num(thread_num),
|
||||||
_shm_names({""}),
|
_shm_names({""}),
|
||||||
_shared_mem_ptrs({nullptr}),
|
_shared_mem_ptrs({nullptr}),
|
||||||
_shm_ctx(nullptr) {
|
_shm_ctx(nullptr) {
|
||||||
@@ -282,11 +282,11 @@ class SHMManager {
|
|||||||
}
|
}
|
||||||
|
|
||||||
static int64_t create_singleton_instance(const std::string& name,
|
static int64_t create_singleton_instance(const std::string& name,
|
||||||
const int group_size,
|
const int group_size, const int rank,
|
||||||
const int rank) {
|
const int thread_num) {
|
||||||
std::lock_guard<std::mutex> guard(SingletonInstancesLock);
|
std::lock_guard<std::mutex> guard(SingletonInstancesLock);
|
||||||
SingletonInstances.emplace_back(
|
SingletonInstances.emplace_back(
|
||||||
std::make_unique<SHMManager>(name, rank, group_size));
|
std::make_unique<SHMManager>(name, rank, group_size, thread_num));
|
||||||
return static_cast<int64_t>(SingletonInstances.size() - 1);
|
return static_cast<int64_t>(SingletonInstances.size() - 1);
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -854,8 +854,9 @@ std::vector<torch::Tensor> shm_recv_tensor_list(int64_t handle, int64_t src) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
int64_t init_shm_manager(const std::string& name, const int64_t group_size,
|
int64_t init_shm_manager(const std::string& name, const int64_t group_size,
|
||||||
const int64_t rank) {
|
const int64_t rank, const int64_t thread_num) {
|
||||||
return SHMManager::create_singleton_instance(name, group_size, rank);
|
return SHMManager::create_singleton_instance(name, group_size, rank,
|
||||||
|
thread_num);
|
||||||
}
|
}
|
||||||
|
|
||||||
std::string join_shm_manager(int64_t handle, const std::string& name) {
|
std::string join_shm_manager(int64_t handle, const std::string& name) {
|
||||||
|
|||||||
@@ -35,7 +35,7 @@ void mla_decode_kvcache(torch::Tensor& out, torch::Tensor& query,
|
|||||||
torch::Tensor& block_tables, torch::Tensor& seq_lens);
|
torch::Tensor& block_tables, torch::Tensor& seq_lens);
|
||||||
|
|
||||||
int64_t init_shm_manager(const std::string& name, const int64_t group_size,
|
int64_t init_shm_manager(const std::string& name, const int64_t group_size,
|
||||||
const int64_t rank);
|
const int64_t rank, const int64_t thread_num);
|
||||||
|
|
||||||
std::string join_shm_manager(int64_t handle, const std::string& name);
|
std::string join_shm_manager(int64_t handle, const std::string& name);
|
||||||
|
|
||||||
@@ -232,8 +232,10 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
|
|||||||
|
|
||||||
// SHM CCL
|
// SHM CCL
|
||||||
#if defined(__AVX512F__) || (defined(__aarch64__) && !defined(__APPLE__))
|
#if defined(__AVX512F__) || (defined(__aarch64__) && !defined(__APPLE__))
|
||||||
ops.def("init_shm_manager(str name, int group_size, int rank) -> int",
|
ops.def(
|
||||||
&init_shm_manager);
|
"init_shm_manager(str name, int group_size, int rank, int thread_num) -> "
|
||||||
|
"int",
|
||||||
|
&init_shm_manager);
|
||||||
ops.def("join_shm_manager(int handle, str name) -> str", &join_shm_manager);
|
ops.def("join_shm_manager(int handle, str name) -> str", &join_shm_manager);
|
||||||
ops.def("shm_allreduce(int handle, Tensor! data) -> ()");
|
ops.def("shm_allreduce(int handle, Tensor! data) -> ()");
|
||||||
ops.impl("shm_allreduce", torch::kCPU, &shm_allreduce);
|
ops.impl("shm_allreduce", torch::kCPU, &shm_allreduce);
|
||||||
@@ -292,7 +294,7 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
|
|||||||
"value_cache, Tensor(a3!) output, Tensor query_start_loc, Tensor "
|
"value_cache, Tensor(a3!) output, Tensor query_start_loc, Tensor "
|
||||||
"seq_lens, float scale, bool causal, Tensor? alibi_slopes, SymInt "
|
"seq_lens, float scale, bool causal, Tensor? alibi_slopes, SymInt "
|
||||||
"sliding_window_left, SymInt sliding_window_right, Tensor block_table, "
|
"sliding_window_left, SymInt sliding_window_right, Tensor block_table, "
|
||||||
"float softcap, Tensor sheduler_metadata, Tensor? s_aux) -> ()",
|
"float softcap, Tensor scheduler_metadata, Tensor? s_aux) -> ()",
|
||||||
&cpu_attention_with_kv_cache);
|
&cpu_attention_with_kv_cache);
|
||||||
|
|
||||||
// placeholders
|
// placeholders
|
||||||
|
|||||||
@@ -30,12 +30,10 @@ struct VecTypeTrait<float> {
|
|||||||
using vec_t = vec_op::FP32Vec16;
|
using vec_t = vec_op::FP32Vec16;
|
||||||
};
|
};
|
||||||
|
|
||||||
#if !defined(__aarch64__) || defined(ARM_BF16_SUPPORT)
|
|
||||||
template <>
|
template <>
|
||||||
struct VecTypeTrait<c10::BFloat16> {
|
struct VecTypeTrait<c10::BFloat16> {
|
||||||
using vec_t = vec_op::BF16Vec16;
|
using vec_t = vec_op::BF16Vec16;
|
||||||
};
|
};
|
||||||
#endif
|
|
||||||
|
|
||||||
#if !defined(__powerpc__)
|
#if !defined(__powerpc__)
|
||||||
template <>
|
template <>
|
||||||
|
|||||||
@@ -115,11 +115,28 @@ void create_and_map(unsigned long long device, ssize_t size, CUdeviceptr d_mem,
|
|||||||
if (flag) { // support GPUDirect RDMA if possible
|
if (flag) { // support GPUDirect RDMA if possible
|
||||||
prop.allocFlags.gpuDirectRDMACapable = 1;
|
prop.allocFlags.gpuDirectRDMACapable = 1;
|
||||||
}
|
}
|
||||||
|
int fab_flag = 0;
|
||||||
|
CUDA_CHECK(cuDeviceGetAttribute(
|
||||||
|
&fab_flag, CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED, device));
|
||||||
|
if (fab_flag) { // support fabric handle if possible
|
||||||
|
prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_FABRIC;
|
||||||
|
}
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
#ifndef USE_ROCM
|
#ifndef USE_ROCM
|
||||||
// Allocate memory using cuMemCreate
|
// Allocate memory using cuMemCreate
|
||||||
CUDA_CHECK(cuMemCreate(p_memHandle, size, &prop, 0));
|
CUresult ret = (CUresult)cuMemCreate(p_memHandle, size, &prop, 0);
|
||||||
|
if (ret) {
|
||||||
|
if (fab_flag &&
|
||||||
|
(ret == CUDA_ERROR_NOT_PERMITTED || ret == CUDA_ERROR_NOT_SUPPORTED)) {
|
||||||
|
// Fabric allocation may fail without multi-node nvlink,
|
||||||
|
// fallback to POSIX file descriptor
|
||||||
|
prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;
|
||||||
|
CUDA_CHECK(cuMemCreate(p_memHandle, size, &prop, 0));
|
||||||
|
} else {
|
||||||
|
CUDA_CHECK(ret);
|
||||||
|
}
|
||||||
|
}
|
||||||
if (error_code != 0) {
|
if (error_code != 0) {
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -3,7 +3,8 @@
|
|||||||
#include "cutlass/cutlass.h"
|
#include "cutlass/cutlass.h"
|
||||||
#include <climits>
|
#include <climits>
|
||||||
#include "cuda_runtime.h"
|
#include "cuda_runtime.h"
|
||||||
#include <iostream>
|
#include <cstdio>
|
||||||
|
#include <cstdlib>
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Helper function for checking CUTLASS errors
|
* Helper function for checking CUTLASS errors
|
||||||
@@ -31,12 +32,63 @@ int32_t get_sm_version_num();
|
|||||||
* __CUDA_ARCH__ is not defined in host code, so this lets us smuggle the ifdef
|
* __CUDA_ARCH__ is not defined in host code, so this lets us smuggle the ifdef
|
||||||
* into code that will be executed on the device where it is defined.
|
* into code that will be executed on the device where it is defined.
|
||||||
*/
|
*/
|
||||||
|
|
||||||
|
template <typename Kernel>
|
||||||
|
struct enable_sm75_to_sm80 : Kernel {
|
||||||
|
template <typename... Args>
|
||||||
|
CUTLASS_DEVICE static void invoke(Args&&... args) {
|
||||||
|
#if defined __CUDA_ARCH__
|
||||||
|
#if __CUDA_ARCH__ >= 750 && __CUDA_ARCH__ < 800
|
||||||
|
Kernel::invoke(std::forward<Args>(args)...);
|
||||||
|
#else
|
||||||
|
printf("This kernel only supports sm[75, 80).\n");
|
||||||
|
asm("trap;");
|
||||||
|
#endif
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
template <typename Kernel>
|
||||||
|
struct enable_sm80_to_sm89 : Kernel {
|
||||||
|
template <typename... Args>
|
||||||
|
CUTLASS_DEVICE static void invoke(Args&&... args) {
|
||||||
|
#if defined __CUDA_ARCH__
|
||||||
|
#if __CUDA_ARCH__ >= 800 && __CUDA_ARCH__ < 890
|
||||||
|
Kernel::invoke(std::forward<Args>(args)...);
|
||||||
|
#else
|
||||||
|
printf("This kernel only supports sm[80, 89).\n");
|
||||||
|
asm("trap;");
|
||||||
|
#endif
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
template <typename Kernel>
|
||||||
|
struct enable_sm89_to_sm90 : Kernel {
|
||||||
|
template <typename... Args>
|
||||||
|
CUTLASS_DEVICE static void invoke(Args&&... args) {
|
||||||
|
#if defined __CUDA_ARCH__
|
||||||
|
#if __CUDA_ARCH__ >= 890 && __CUDA_ARCH__ < 900
|
||||||
|
Kernel::invoke(std::forward<Args>(args)...);
|
||||||
|
#else
|
||||||
|
printf("This kernel only supports sm[89, 90).\n");
|
||||||
|
asm("trap;");
|
||||||
|
#endif
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
template <typename Kernel>
|
template <typename Kernel>
|
||||||
struct enable_sm90_or_later : Kernel {
|
struct enable_sm90_or_later : Kernel {
|
||||||
template <typename... Args>
|
template <typename... Args>
|
||||||
CUTLASS_DEVICE void operator()(Args&&... args) {
|
CUTLASS_DEVICE void operator()(Args&&... args) {
|
||||||
#if defined __CUDA_ARCH__ && __CUDA_ARCH__ >= 900
|
#if defined __CUDA_ARCH__
|
||||||
|
#if __CUDA_ARCH__ >= 900
|
||||||
Kernel::operator()(std::forward<Args>(args)...);
|
Kernel::operator()(std::forward<Args>(args)...);
|
||||||
|
#else
|
||||||
|
printf("This kernel only supports sm >= 90.\n");
|
||||||
|
asm("trap;");
|
||||||
|
#endif
|
||||||
#endif
|
#endif
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
@@ -45,18 +97,43 @@ template <typename Kernel>
|
|||||||
struct enable_sm90_only : Kernel {
|
struct enable_sm90_only : Kernel {
|
||||||
template <typename... Args>
|
template <typename... Args>
|
||||||
CUTLASS_DEVICE void operator()(Args&&... args) {
|
CUTLASS_DEVICE void operator()(Args&&... args) {
|
||||||
#if defined __CUDA_ARCH__ && __CUDA_ARCH__ == 900
|
#if defined __CUDA_ARCH__
|
||||||
|
#if __CUDA_ARCH__ == 900
|
||||||
Kernel::operator()(std::forward<Args>(args)...);
|
Kernel::operator()(std::forward<Args>(args)...);
|
||||||
|
#else
|
||||||
|
printf("This kernel only supports sm90.\n");
|
||||||
|
asm("trap;");
|
||||||
|
#endif
|
||||||
#endif
|
#endif
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
template <typename Kernel>
|
template <typename Kernel>
|
||||||
struct enable_sm100_only : Kernel {
|
struct enable_sm100f_only : Kernel {
|
||||||
template <typename... Args>
|
template <typename... Args>
|
||||||
CUTLASS_DEVICE void operator()(Args&&... args) {
|
CUTLASS_DEVICE void operator()(Args&&... args) {
|
||||||
#if defined __CUDA_ARCH__ && __CUDA_ARCH__ == 1000
|
#if defined __CUDA_ARCH__
|
||||||
|
#if __CUDA_ARCH__ == 1000 || __CUDA_ARCH__ == 1030
|
||||||
Kernel::operator()(std::forward<Args>(args)...);
|
Kernel::operator()(std::forward<Args>(args)...);
|
||||||
|
#else
|
||||||
|
printf("This kernel only supports sm100f.\n");
|
||||||
|
asm("trap;");
|
||||||
|
#endif
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
template <typename Kernel>
|
||||||
|
struct enable_sm100a_only : Kernel {
|
||||||
|
template <typename... Args>
|
||||||
|
CUTLASS_DEVICE void operator()(Args&&... args) {
|
||||||
|
#if defined __CUDA_ARCH__
|
||||||
|
#if __CUDA_ARCH__ == 1000
|
||||||
|
Kernel::operator()(std::forward<Args>(args)...);
|
||||||
|
#else
|
||||||
|
printf("This kernel only supports sm100a.\n");
|
||||||
|
asm("trap;");
|
||||||
|
#endif
|
||||||
#endif
|
#endif
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
@@ -65,8 +142,13 @@ template <typename Kernel>
|
|||||||
struct enable_sm120_only : Kernel {
|
struct enable_sm120_only : Kernel {
|
||||||
template <typename... Args>
|
template <typename... Args>
|
||||||
CUTLASS_DEVICE void operator()(Args&&... args) {
|
CUTLASS_DEVICE void operator()(Args&&... args) {
|
||||||
#if defined __CUDA_ARCH__ && __CUDA_ARCH__ == 1200
|
#if defined __CUDA_ARCH__
|
||||||
|
#if __CUDA_ARCH__ == 1200
|
||||||
Kernel::operator()(std::forward<Args>(args)...);
|
Kernel::operator()(std::forward<Args>(args)...);
|
||||||
|
#else
|
||||||
|
printf("This kernel only supports sm120.\n");
|
||||||
|
asm("trap;");
|
||||||
|
#endif
|
||||||
#endif
|
#endif
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|||||||
@@ -770,7 +770,7 @@ torch::Tensor moe_wna16_marlin_gemm(
|
|||||||
b_bias = b_bias_or_none.value();
|
b_bias = b_bias_or_none.value();
|
||||||
TORCH_CHECK(b_bias.device().is_cuda(), "b_bias is not on GPU");
|
TORCH_CHECK(b_bias.device().is_cuda(), "b_bias is not on GPU");
|
||||||
TORCH_CHECK(b_bias.is_contiguous(), "b_bias is not contiguous");
|
TORCH_CHECK(b_bias.is_contiguous(), "b_bias is not contiguous");
|
||||||
TORCH_CHECK(b_bias.size(1) == size_n, "b_bias.size(0) != size_n");
|
TORCH_CHECK(b_bias.size(1) == size_n, "b_bias.size(1) != size_n");
|
||||||
TORCH_CHECK(b_bias.stride(1) == 1, "b_bias.stride(1) != 1");
|
TORCH_CHECK(b_bias.stride(1) == 1, "b_bias.stride(1) != 1");
|
||||||
} else {
|
} else {
|
||||||
b_bias = torch::empty({0}, options);
|
b_bias = torch::empty({0}, options);
|
||||||
|
|||||||
@@ -73,25 +73,40 @@ void moe_permute(
|
|||||||
get_ptr<int64_t>(expert_first_token_offset), n_token, n_expert,
|
get_ptr<int64_t>(expert_first_token_offset), n_token, n_expert,
|
||||||
n_local_expert, topk, sorter, get_ptr<int>(sort_workspace), stream);
|
n_local_expert, topk, sorter, get_ptr<int>(sort_workspace), stream);
|
||||||
|
|
||||||
|
// DeepGEMM: use getMIndices kernel to compute
|
||||||
|
// 1) align_expert_first_token_offset (aligned prefix offsets)
|
||||||
|
// 2) m_indices (expert id for each aligned row)
|
||||||
|
// eg. expert0: 3, expert1: 5, expert2: 2 tokens respectively
|
||||||
|
// expert_first_token_offset = [0, 3, 8, 10], align_block_size = 4
|
||||||
|
// expert0: 3->4, expert1: 5->8, expert2: 2->4
|
||||||
|
// align_expert_first_token_offset = [0, 4, 12, 16]
|
||||||
|
// so m_indices = [0,0,0,0, 1,1,1,1,1,1,1,1, 2,2,2,2]
|
||||||
|
torch::Tensor align_expert_first_token_offset;
|
||||||
|
const int64_t* aligned_expert_first_token_offset_ptr = nullptr;
|
||||||
|
if (align_block_size.has_value()) {
|
||||||
|
align_expert_first_token_offset =
|
||||||
|
torch::zeros_like(expert_first_token_offset);
|
||||||
|
getMIndices(get_ptr<int64_t>(expert_first_token_offset),
|
||||||
|
get_ptr<int64_t>(align_expert_first_token_offset),
|
||||||
|
get_ptr<int>(m_indices), n_local_expert, align_block_size_value,
|
||||||
|
stream);
|
||||||
|
aligned_expert_first_token_offset_ptr =
|
||||||
|
get_ptr<int64_t>(align_expert_first_token_offset);
|
||||||
|
}
|
||||||
|
|
||||||
// dispatch expandInputRowsKernelLauncher
|
// dispatch expandInputRowsKernelLauncher
|
||||||
MOE_DISPATCH(input.scalar_type(), [&] {
|
MOE_DISPATCH(input.scalar_type(), [&] {
|
||||||
expandInputRowsKernelLauncher<scalar_t>(
|
expandInputRowsKernelLauncher<scalar_t>(
|
||||||
get_ptr<scalar_t>(input), get_ptr<scalar_t>(permuted_input),
|
get_ptr<scalar_t>(input), get_ptr<scalar_t>(permuted_input),
|
||||||
get_ptr<int>(permuted_experts_id), get_ptr<int>(sorted_row_idx),
|
get_ptr<int>(permuted_experts_id), get_ptr<int>(sorted_row_idx),
|
||||||
get_ptr<int>(inv_permuted_idx), get_ptr<int>(permuted_idx),
|
get_ptr<int>(inv_permuted_idx), get_ptr<int>(permuted_idx),
|
||||||
get_ptr<int64_t>(expert_first_token_offset), n_token, valid_num_ptr,
|
get_ptr<int64_t>(expert_first_token_offset),
|
||||||
n_hidden, topk, n_local_expert, align_block_size_value, stream);
|
aligned_expert_first_token_offset_ptr, n_token, valid_num_ptr, n_hidden,
|
||||||
|
topk, n_local_expert, align_block_size_value, stream);
|
||||||
});
|
});
|
||||||
|
|
||||||
// get m_indices and update expert_first_token_offset with align block
|
|
||||||
// this is only required for DeepGemm and not required for CUTLASS group gemm
|
// this is only required for DeepGemm and not required for CUTLASS group gemm
|
||||||
if (align_block_size.has_value()) {
|
if (align_block_size.has_value()) {
|
||||||
auto align_expert_first_token_offset =
|
|
||||||
torch::zeros_like(expert_first_token_offset);
|
|
||||||
getMIndices(get_ptr<int64_t>(expert_first_token_offset),
|
|
||||||
get_ptr<int64_t>(align_expert_first_token_offset),
|
|
||||||
get_ptr<int>(m_indices), n_local_expert, align_block_size_value,
|
|
||||||
stream);
|
|
||||||
expert_first_token_offset.copy_(align_expert_first_token_offset);
|
expert_first_token_offset.copy_(align_expert_first_token_offset);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -60,7 +60,8 @@ void expandInputRowsKernelLauncher(
|
|||||||
T const* unpermuted_input, T* permuted_output, int* sorted_experts,
|
T const* unpermuted_input, T* permuted_output, int* sorted_experts,
|
||||||
int const* expanded_dest_row_to_expanded_source_row,
|
int const* expanded_dest_row_to_expanded_source_row,
|
||||||
int* expanded_source_row_to_expanded_dest_row, int* permuted_idx,
|
int* expanded_source_row_to_expanded_dest_row, int* permuted_idx,
|
||||||
int64_t* expert_first_token_offset, int64_t const num_rows,
|
int64_t const* expert_first_token_offset,
|
||||||
|
int64_t const* aligned_expert_first_token_offset, int64_t const num_rows,
|
||||||
int64_t const* num_valid_tokens_ptr, int64_t const cols, int const k,
|
int64_t const* num_valid_tokens_ptr, int64_t const cols, int const k,
|
||||||
int num_local_experts, const int& align_block_size, cudaStream_t stream);
|
int num_local_experts, const int& align_block_size, cudaStream_t stream);
|
||||||
|
|
||||||
|
|||||||
@@ -5,7 +5,8 @@ __global__ void expandInputRowsKernel(
|
|||||||
T const* unpermuted_input, T* permuted_output, int* sorted_experts,
|
T const* unpermuted_input, T* permuted_output, int* sorted_experts,
|
||||||
int const* expanded_dest_row_to_expanded_source_row,
|
int const* expanded_dest_row_to_expanded_source_row,
|
||||||
int* expanded_source_row_to_expanded_dest_row, int* permuted_idx,
|
int* expanded_source_row_to_expanded_dest_row, int* permuted_idx,
|
||||||
int64_t* expert_first_token_offset, int64_t const num_rows,
|
int64_t const* expert_first_token_offset,
|
||||||
|
int64_t const* aligned_expert_first_token_offset, int64_t const num_rows,
|
||||||
int64_t const* num_dest_rows, int64_t const cols, int64_t k,
|
int64_t const* num_dest_rows, int64_t const cols, int64_t k,
|
||||||
int num_local_experts, int align_block_size) {
|
int num_local_experts, int align_block_size) {
|
||||||
// Reverse permutation map.
|
// Reverse permutation map.
|
||||||
@@ -18,35 +19,22 @@ __global__ void expandInputRowsKernel(
|
|||||||
expanded_dest_row_to_expanded_source_row[expanded_dest_row];
|
expanded_dest_row_to_expanded_source_row[expanded_dest_row];
|
||||||
int expert_id = sorted_experts[expanded_dest_row];
|
int expert_id = sorted_experts[expanded_dest_row];
|
||||||
|
|
||||||
extern __shared__ int64_t smem_expert_first_token_offset[];
|
|
||||||
if constexpr (ALIGN_BLOCK_SIZE) {
|
if constexpr (ALIGN_BLOCK_SIZE) {
|
||||||
// load g2s
|
// convert (unaligned) expanded_dest_row -> aligned expanded_dest_row.
|
||||||
for (int idx = threadIdx.x; idx < num_local_experts + 1;
|
// aligned_expert_first_token_offset[e] provides the aligned prefix start
|
||||||
idx += blockDim.x) {
|
// for expert e. For non-local experts we map to the end (total aligned M).
|
||||||
smem_expert_first_token_offset[idx] =
|
int64_t aligned_base = 0;
|
||||||
__ldg(expert_first_token_offset + idx);
|
int64_t token_offset_in_expert = 0;
|
||||||
|
if (expert_id >= num_local_experts) {
|
||||||
|
aligned_base =
|
||||||
|
__ldg(aligned_expert_first_token_offset + num_local_experts);
|
||||||
|
token_offset_in_expert = 0;
|
||||||
|
} else {
|
||||||
|
aligned_base = __ldg(aligned_expert_first_token_offset + expert_id);
|
||||||
|
token_offset_in_expert =
|
||||||
|
expanded_dest_row - __ldg(expert_first_token_offset + expert_id);
|
||||||
}
|
}
|
||||||
__syncthreads();
|
expanded_dest_row = aligned_base + token_offset_in_expert;
|
||||||
int lane_idx = threadIdx.x & 31;
|
|
||||||
|
|
||||||
if (lane_idx == 0) {
|
|
||||||
// set token_offset_in_expert = 0 if this expert is not local expert
|
|
||||||
int token_offset_in_expert =
|
|
||||||
expert_id >= num_local_experts
|
|
||||||
? 0
|
|
||||||
: expanded_dest_row - smem_expert_first_token_offset[expert_id];
|
|
||||||
int64_t accumulate_align_offset = 0;
|
|
||||||
#pragma unroll 1
|
|
||||||
for (int eidx = 1; eidx <= min(expert_id, num_local_experts); eidx++) {
|
|
||||||
auto n_token_in_expert = smem_expert_first_token_offset[eidx] -
|
|
||||||
smem_expert_first_token_offset[eidx - 1];
|
|
||||||
accumulate_align_offset += (n_token_in_expert + align_block_size - 1) /
|
|
||||||
align_block_size * align_block_size;
|
|
||||||
}
|
|
||||||
expanded_dest_row = accumulate_align_offset + token_offset_in_expert;
|
|
||||||
}
|
|
||||||
// lane0 shuffle broadcast align_expanded_dest_row
|
|
||||||
expanded_dest_row = __shfl_sync(0xffffffff, expanded_dest_row, 0);
|
|
||||||
}
|
}
|
||||||
|
|
||||||
if (threadIdx.x == 0) {
|
if (threadIdx.x == 0) {
|
||||||
@@ -88,7 +76,8 @@ void expandInputRowsKernelLauncher(
|
|||||||
T const* unpermuted_input, T* permuted_output, int* sorted_experts,
|
T const* unpermuted_input, T* permuted_output, int* sorted_experts,
|
||||||
int const* expanded_dest_row_to_expanded_source_row,
|
int const* expanded_dest_row_to_expanded_source_row,
|
||||||
int* expanded_source_row_to_expanded_dest_row, int* permuted_idx,
|
int* expanded_source_row_to_expanded_dest_row, int* permuted_idx,
|
||||||
int64_t* expert_first_token_offset, int64_t const num_rows,
|
int64_t const* expert_first_token_offset,
|
||||||
|
int64_t const* aligned_expert_first_token_offset, int64_t const num_rows,
|
||||||
int64_t const* num_valid_tokens_ptr, int64_t const cols, int const k,
|
int64_t const* num_valid_tokens_ptr, int64_t const cols, int const k,
|
||||||
int num_local_experts, const int& align_block_size, cudaStream_t stream) {
|
int num_local_experts, const int& align_block_size, cudaStream_t stream) {
|
||||||
int64_t const blocks = num_rows * k;
|
int64_t const blocks = num_rows * k;
|
||||||
@@ -104,14 +93,12 @@ void expandInputRowsKernelLauncher(
|
|||||||
bool is_align_block_size = align_block_size != -1;
|
bool is_align_block_size = align_block_size != -1;
|
||||||
auto func = func_map[is_check_skip][is_align_block_size];
|
auto func = func_map[is_check_skip][is_align_block_size];
|
||||||
|
|
||||||
int64_t smem_size = sizeof(int64_t) * (num_local_experts + 1);
|
func<<<blocks, threads, 0, stream>>>(
|
||||||
|
|
||||||
func<<<blocks, threads, smem_size, stream>>>(
|
|
||||||
unpermuted_input, permuted_output, sorted_experts,
|
unpermuted_input, permuted_output, sorted_experts,
|
||||||
expanded_dest_row_to_expanded_source_row,
|
expanded_dest_row_to_expanded_source_row,
|
||||||
expanded_source_row_to_expanded_dest_row, permuted_idx,
|
expanded_source_row_to_expanded_dest_row, permuted_idx,
|
||||||
expert_first_token_offset, num_rows, num_valid_tokens_ptr, cols, k,
|
expert_first_token_offset, aligned_expert_first_token_offset, num_rows,
|
||||||
num_local_experts, align_block_size);
|
num_valid_tokens_ptr, cols, k, num_local_experts, align_block_size);
|
||||||
}
|
}
|
||||||
|
|
||||||
template <class T, class U>
|
template <class T, class U>
|
||||||
|
|||||||
@@ -288,8 +288,8 @@ def generate_sch_sig(schedule_config: ScheduleConfig) -> str:
|
|||||||
)
|
)
|
||||||
cluster_shape = (
|
cluster_shape = (
|
||||||
f"{schedule_config.cluster_shape_mnk[0]}"
|
f"{schedule_config.cluster_shape_mnk[0]}"
|
||||||
+ f"x{schedule_config.cluster_shape_mnk[1]}"
|
f"x{schedule_config.cluster_shape_mnk[1]}"
|
||||||
+ f"x{schedule_config.cluster_shape_mnk[2]}"
|
f"x{schedule_config.cluster_shape_mnk[2]}"
|
||||||
)
|
)
|
||||||
kernel_schedule = VLLMKernelScheduleTag[schedule_config.kernel_schedule].split(
|
kernel_schedule = VLLMKernelScheduleTag[schedule_config.kernel_schedule].split(
|
||||||
"::"
|
"::"
|
||||||
@@ -301,7 +301,7 @@ def generate_sch_sig(schedule_config: ScheduleConfig) -> str:
|
|||||||
|
|
||||||
return (
|
return (
|
||||||
f"{tile_shape}_{cluster_shape}_{kernel_schedule}"
|
f"{tile_shape}_{cluster_shape}_{kernel_schedule}"
|
||||||
+ f"_{epilogue_schedule}_{tile_scheduler}"
|
f"_{epilogue_schedule}_{tile_scheduler}"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -1,203 +0,0 @@
|
|||||||
Contains code from https://github.com/IST-DASLab/Sparse-Marlin/
|
|
||||||
|
|
||||||
Apache License
|
|
||||||
Version 2.0, January 2004
|
|
||||||
http://www.apache.org/licenses/
|
|
||||||
|
|
||||||
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
|
||||||
|
|
||||||
1. Definitions.
|
|
||||||
|
|
||||||
"License" shall mean the terms and conditions for use, reproduction,
|
|
||||||
and distribution as defined by Sections 1 through 9 of this document.
|
|
||||||
|
|
||||||
"Licensor" shall mean the copyright owner or entity authorized by
|
|
||||||
the copyright owner that is granting the License.
|
|
||||||
|
|
||||||
"Legal Entity" shall mean the union of the acting entity and all
|
|
||||||
other entities that control, are controlled by, or are under common
|
|
||||||
control with that entity. For the purposes of this definition,
|
|
||||||
"control" means (i) the power, direct or indirect, to cause the
|
|
||||||
direction or management of such entity, whether by contract or
|
|
||||||
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
|
||||||
outstanding shares, or (iii) beneficial ownership of such entity.
|
|
||||||
|
|
||||||
"You" (or "Your") shall mean an individual or Legal Entity
|
|
||||||
exercising permissions granted by this License.
|
|
||||||
|
|
||||||
"Source" form shall mean the preferred form for making modifications,
|
|
||||||
including but not limited to software source code, documentation
|
|
||||||
source, and configuration files.
|
|
||||||
|
|
||||||
"Object" form shall mean any form resulting from mechanical
|
|
||||||
transformation or translation of a Source form, including but
|
|
||||||
not limited to compiled object code, generated documentation,
|
|
||||||
and conversions to other media types.
|
|
||||||
|
|
||||||
"Work" shall mean the work of authorship, whether in Source or
|
|
||||||
Object form, made available under the License, as indicated by a
|
|
||||||
copyright notice that is included in or attached to the work
|
|
||||||
(an example is provided in the Appendix below).
|
|
||||||
|
|
||||||
"Derivative Works" shall mean any work, whether in Source or Object
|
|
||||||
form, that is based on (or derived from) the Work and for which the
|
|
||||||
editorial revisions, annotations, elaborations, or other modifications
|
|
||||||
represent, as a whole, an original work of authorship. For the purposes
|
|
||||||
of this License, Derivative Works shall not include works that remain
|
|
||||||
separable from, or merely link (or bind by name) to the interfaces of,
|
|
||||||
the Work and Derivative Works thereof.
|
|
||||||
|
|
||||||
"Contribution" shall mean any work of authorship, including
|
|
||||||
the original version of the Work and any modifications or additions
|
|
||||||
to that Work or Derivative Works thereof, that is intentionally
|
|
||||||
submitted to Licensor for inclusion in the Work by the copyright owner
|
|
||||||
or by an individual or Legal Entity authorized to submit on behalf of
|
|
||||||
the copyright owner. For the purposes of this definition, "submitted"
|
|
||||||
means any form of electronic, verbal, or written communication sent
|
|
||||||
to the Licensor or its representatives, including but not limited to
|
|
||||||
communication on electronic mailing lists, source code control systems,
|
|
||||||
and issue tracking systems that are managed by, or on behalf of, the
|
|
||||||
Licensor for the purpose of discussing and improving the Work, but
|
|
||||||
excluding communication that is conspicuously marked or otherwise
|
|
||||||
designated in writing by the copyright owner as "Not a Contribution."
|
|
||||||
|
|
||||||
"Contributor" shall mean Licensor and any individual or Legal Entity
|
|
||||||
on behalf of whom a Contribution has been received by Licensor and
|
|
||||||
subsequently incorporated within the Work.
|
|
||||||
|
|
||||||
2. Grant of Copyright License. Subject to the terms and conditions of
|
|
||||||
this License, each Contributor hereby grants to You a perpetual,
|
|
||||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
|
||||||
copyright license to reproduce, prepare Derivative Works of,
|
|
||||||
publicly display, publicly perform, sublicense, and distribute the
|
|
||||||
Work and such Derivative Works in Source or Object form.
|
|
||||||
|
|
||||||
3. Grant of Patent License. Subject to the terms and conditions of
|
|
||||||
this License, each Contributor hereby grants to You a perpetual,
|
|
||||||
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
|
||||||
(except as stated in this section) patent license to make, have made,
|
|
||||||
use, offer to sell, sell, import, and otherwise transfer the Work,
|
|
||||||
where such license applies only to those patent claims licensable
|
|
||||||
by such Contributor that are necessarily infringed by their
|
|
||||||
Contribution(s) alone or by combination of their Contribution(s)
|
|
||||||
with the Work to which such Contribution(s) was submitted. If You
|
|
||||||
institute patent litigation against any entity (including a
|
|
||||||
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
|
||||||
or a Contribution incorporated within the Work constitutes direct
|
|
||||||
or contributory patent infringement, then any patent licenses
|
|
||||||
granted to You under this License for that Work shall terminate
|
|
||||||
as of the date such litigation is filed.
|
|
||||||
|
|
||||||
4. Redistribution. You may reproduce and distribute copies of the
|
|
||||||
Work or Derivative Works thereof in any medium, with or without
|
|
||||||
modifications, and in Source or Object form, provided that You
|
|
||||||
meet the following conditions:
|
|
||||||
|
|
||||||
(a) You must give any other recipients of the Work or
|
|
||||||
Derivative Works a copy of this License; and
|
|
||||||
|
|
||||||
(b) You must cause any modified files to carry prominent notices
|
|
||||||
stating that You changed the files; and
|
|
||||||
|
|
||||||
(c) You must retain, in the Source form of any Derivative Works
|
|
||||||
that You distribute, all copyright, patent, trademark, and
|
|
||||||
attribution notices from the Source form of the Work,
|
|
||||||
excluding those notices that do not pertain to any part of
|
|
||||||
the Derivative Works; and
|
|
||||||
|
|
||||||
(d) If the Work includes a "NOTICE" text file as part of its
|
|
||||||
distribution, then any Derivative Works that You distribute must
|
|
||||||
include a readable copy of the attribution notices contained
|
|
||||||
within such NOTICE file, excluding those notices that do not
|
|
||||||
pertain to any part of the Derivative Works, in at least one
|
|
||||||
of the following places: within a NOTICE text file distributed
|
|
||||||
as part of the Derivative Works; within the Source form or
|
|
||||||
documentation, if provided along with the Derivative Works; or,
|
|
||||||
within a display generated by the Derivative Works, if and
|
|
||||||
wherever such third-party notices normally appear. The contents
|
|
||||||
of the NOTICE file are for informational purposes only and
|
|
||||||
do not modify the License. You may add Your own attribution
|
|
||||||
notices within Derivative Works that You distribute, alongside
|
|
||||||
or as an addendum to the NOTICE text from the Work, provided
|
|
||||||
that such additional attribution notices cannot be construed
|
|
||||||
as modifying the License.
|
|
||||||
|
|
||||||
You may add Your own copyright statement to Your modifications and
|
|
||||||
may provide additional or different license terms and conditions
|
|
||||||
for use, reproduction, or distribution of Your modifications, or
|
|
||||||
for any such Derivative Works as a whole, provided Your use,
|
|
||||||
reproduction, and distribution of the Work otherwise complies with
|
|
||||||
the conditions stated in this License.
|
|
||||||
|
|
||||||
5. Submission of Contributions. Unless You explicitly state otherwise,
|
|
||||||
any Contribution intentionally submitted for inclusion in the Work
|
|
||||||
by You to the Licensor shall be under the terms and conditions of
|
|
||||||
this License, without any additional terms or conditions.
|
|
||||||
Notwithstanding the above, nothing herein shall supersede or modify
|
|
||||||
the terms of any separate license agreement you may have executed
|
|
||||||
with Licensor regarding such Contributions.
|
|
||||||
|
|
||||||
6. Trademarks. This License does not grant permission to use the trade
|
|
||||||
names, trademarks, service marks, or product names of the Licensor,
|
|
||||||
except as required for reasonable and customary use in describing the
|
|
||||||
origin of the Work and reproducing the content of the NOTICE file.
|
|
||||||
|
|
||||||
7. Disclaimer of Warranty. Unless required by applicable law or
|
|
||||||
agreed to in writing, Licensor provides the Work (and each
|
|
||||||
Contributor provides its Contributions) on an "AS IS" BASIS,
|
|
||||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
|
||||||
implied, including, without limitation, any warranties or conditions
|
|
||||||
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
|
||||||
PARTICULAR PURPOSE. You are solely responsible for determining the
|
|
||||||
appropriateness of using or redistributing the Work and assume any
|
|
||||||
risks associated with Your exercise of permissions under this License.
|
|
||||||
|
|
||||||
8. Limitation of Liability. In no event and under no legal theory,
|
|
||||||
whether in tort (including negligence), contract, or otherwise,
|
|
||||||
unless required by applicable law (such as deliberate and grossly
|
|
||||||
negligent acts) or agreed to in writing, shall any Contributor be
|
|
||||||
liable to You for damages, including any direct, indirect, special,
|
|
||||||
incidental, or consequential damages of any character arising as a
|
|
||||||
result of this License or out of the use or inability to use the
|
|
||||||
Work (including but not limited to damages for loss of goodwill,
|
|
||||||
work stoppage, computer failure or malfunction, or any and all
|
|
||||||
other commercial damages or losses), even if such Contributor
|
|
||||||
has been advised of the possibility of such damages.
|
|
||||||
|
|
||||||
9. Accepting Warranty or Additional Liability. While redistributing
|
|
||||||
the Work or Derivative Works thereof, You may choose to offer,
|
|
||||||
and charge a fee for, acceptance of support, warranty, indemnity,
|
|
||||||
or other liability obligations and/or rights consistent with this
|
|
||||||
License. However, in accepting such obligations, You may act only
|
|
||||||
on Your own behalf and on Your sole responsibility, not on behalf
|
|
||||||
of any other Contributor, and only if You agree to indemnify,
|
|
||||||
defend, and hold each Contributor harmless for any liability
|
|
||||||
incurred by, or claims asserted against, such Contributor by reason
|
|
||||||
of your accepting any such warranty or additional liability.
|
|
||||||
|
|
||||||
END OF TERMS AND CONDITIONS
|
|
||||||
|
|
||||||
APPENDIX: How to apply the Apache License to your work.
|
|
||||||
|
|
||||||
To apply the Apache License to your work, attach the following
|
|
||||||
boilerplate notice, with the fields enclosed by brackets "[]"
|
|
||||||
replaced with your own identifying information. (Don't include
|
|
||||||
the brackets!) The text should be enclosed in the appropriate
|
|
||||||
comment syntax for the file format. We also recommend that a
|
|
||||||
file or class name and description of purpose be included on the
|
|
||||||
same "printed page" as the copyright notice for easier
|
|
||||||
identification within third-party archives.
|
|
||||||
|
|
||||||
Copyright [yyyy] [name of copyright owner]
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
you may not use this file except in compliance with the License.
|
|
||||||
You may obtain a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software
|
|
||||||
distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
See the License for the specific language governing permissions and
|
|
||||||
limitations under the License.
|
|
||||||
@@ -1,51 +0,0 @@
|
|||||||
/*
|
|
||||||
* Copyright (C) 2024 Roberto Lopez Castro (roberto.lopez.castro@udc.es). All
|
|
||||||
* Rights Reserved.
|
|
||||||
*
|
|
||||||
* Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
* you may not use this file except in compliance with the License.
|
|
||||||
* You may obtain a copy of the License at
|
|
||||||
*
|
|
||||||
* http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
*
|
|
||||||
* Unless required by applicable law or agreed to in writing, software
|
|
||||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
* See the License for the specific language governing permissions and
|
|
||||||
* limitations under the License.
|
|
||||||
*/
|
|
||||||
|
|
||||||
#pragma once
|
|
||||||
|
|
||||||
namespace marlin_24 {
|
|
||||||
|
|
||||||
constexpr int ceildiv(int a, int b) { return (a + b - 1) / b; }
|
|
||||||
|
|
||||||
// Instances of `Vec` are used to organize groups of >>registers<<, as needed
|
|
||||||
// for instance as inputs to tensor core operations. Consequently, all
|
|
||||||
// corresponding index accesses must be compile-time constants, which is why we
|
|
||||||
// extensively use `#pragma unroll` throughout the kernel code to guarantee
|
|
||||||
// this.
|
|
||||||
template <typename T, int n>
|
|
||||||
struct Vec {
|
|
||||||
T elems[n];
|
|
||||||
__device__ T& operator[](int i) { return elems[i]; }
|
|
||||||
};
|
|
||||||
|
|
||||||
template <int M_, int N_, int K_>
|
|
||||||
struct ShapeBase {
|
|
||||||
static constexpr int M = M_, N = N_, K = K_;
|
|
||||||
};
|
|
||||||
|
|
||||||
using I4 = Vec<int, 4>;
|
|
||||||
|
|
||||||
// Matrix fragments for tensor core instructions; their precise layout is
|
|
||||||
// documented here:
|
|
||||||
// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#matrix-fragments-for-mma-m16n8k16-with-floating-point-type
|
|
||||||
using FragA = Vec<half2, 4>;
|
|
||||||
using FragB = Vec<half2, 2>;
|
|
||||||
using FragM = Vec<uint, 1>;
|
|
||||||
using FragC = Vec<float, 4>;
|
|
||||||
using FragS = Vec<half2, 1>; // quantization scales
|
|
||||||
|
|
||||||
} // namespace marlin_24
|
|
||||||
@@ -1,136 +0,0 @@
|
|||||||
/*
|
|
||||||
* Copyright (C) 2024 Roberto Lopez Castro (roberto.lopez.castro@udc.es). All
|
|
||||||
* Rights Reserved.
|
|
||||||
*
|
|
||||||
* Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
* you may not use this file except in compliance with the License.
|
|
||||||
* You may obtain a copy of the License at
|
|
||||||
*
|
|
||||||
* http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
*
|
|
||||||
* Unless required by applicable law or agreed to in writing, software
|
|
||||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
* See the License for the specific language governing permissions and
|
|
||||||
* limitations under the License.
|
|
||||||
*/
|
|
||||||
|
|
||||||
#pragma once
|
|
||||||
#include "base.h"
|
|
||||||
|
|
||||||
namespace marlin_24 {
|
|
||||||
// Predicated asynchronous global->shared copy; used for inputs A where we apply
|
|
||||||
// predication to handle batchsizes that are not multiples of 16.
|
|
||||||
__device__ inline void cp_async4_pred_zfill(void* smem_ptr,
|
|
||||||
const void* glob_ptr,
|
|
||||||
bool pred = true,
|
|
||||||
const bool zfill = false) {
|
|
||||||
const int BYTES = 16;
|
|
||||||
int src_in_bytes = (zfill ? 0 : BYTES);
|
|
||||||
uint32_t smem = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
|
|
||||||
asm volatile(
|
|
||||||
"{\n"
|
|
||||||
" .reg .pred p;\n"
|
|
||||||
" setp.ne.b32 p, %0, 0;\n"
|
|
||||||
" @p cp.async.cg.shared.global [%1], [%2], %3;\n"
|
|
||||||
"}\n" ::"r"((int)pred),
|
|
||||||
"r"(smem), "l"(glob_ptr), "n"(BYTES), "r"(src_in_bytes));
|
|
||||||
}
|
|
||||||
|
|
||||||
__device__ inline void cp_async4_pred(void* smem_ptr, const void* glob_ptr,
|
|
||||||
bool pred = true) {
|
|
||||||
const int BYTES = 16;
|
|
||||||
uint32_t smem = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
|
|
||||||
asm volatile(
|
|
||||||
"{\n"
|
|
||||||
" .reg .pred p;\n"
|
|
||||||
" setp.ne.b32 p, %0, 0;\n"
|
|
||||||
" @p cp.async.cg.shared.global [%1], [%2], %3;\n"
|
|
||||||
"}\n" ::"r"((int)pred),
|
|
||||||
"r"(smem), "l"(glob_ptr), "n"(BYTES));
|
|
||||||
}
|
|
||||||
|
|
||||||
// Asynchronous global->shared copy
|
|
||||||
__device__ inline void cp_async4(void* smem_ptr, const void* glob_ptr) {
|
|
||||||
const int BYTES = 16;
|
|
||||||
uint32_t smem = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
|
|
||||||
asm volatile(
|
|
||||||
"{\n"
|
|
||||||
" cp.async.cg.shared.global [%0], [%1], %2;\n"
|
|
||||||
"}\n" ::"r"(smem),
|
|
||||||
"l"(glob_ptr), "n"(BYTES));
|
|
||||||
}
|
|
||||||
|
|
||||||
// Async copy fence.
|
|
||||||
__device__ inline void cp_async_fence() {
|
|
||||||
asm volatile("cp.async.commit_group;\n" ::);
|
|
||||||
}
|
|
||||||
|
|
||||||
// Wait until at most `n` async copy stages are still pending.
|
|
||||||
template <int n>
|
|
||||||
__device__ inline void cp_async_wait() {
|
|
||||||
asm volatile("cp.async.wait_group %0;\n" ::"n"(n));
|
|
||||||
}
|
|
||||||
|
|
||||||
// Instruction for loading a full 16x16 matrix fragment of operand A from shared
|
|
||||||
// memory, directly in tensor core layout.
|
|
||||||
__device__ inline void ldsm4(FragA& frag_a, const void* smem_ptr) {
|
|
||||||
uint32_t* a = reinterpret_cast<uint32_t*>(&frag_a);
|
|
||||||
uint32_t smem = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
|
|
||||||
asm volatile("ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%0,%1,%2,%3}, [%4];\n"
|
|
||||||
: "=r"(a[0]), "=r"(a[1]), "=r"(a[2]), "=r"(a[3])
|
|
||||||
: "r"(smem));
|
|
||||||
}
|
|
||||||
|
|
||||||
__device__ inline void ldsm4_m(FragM& frag_m, const void* smem_ptr) {
|
|
||||||
uint32_t* a = reinterpret_cast<uint32_t*>(&frag_m);
|
|
||||||
uint32_t smem = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
|
|
||||||
asm volatile("ldmatrix.sync.aligned.m8n8.x2.shared.b16 {%0,%1}, [%2];\n"
|
|
||||||
: "=r"(a[0]), "=r"(a[1])
|
|
||||||
: "r"(smem));
|
|
||||||
}
|
|
||||||
|
|
||||||
// Instruction for loading a full 16x16 matrix fragment of operand A from shared
|
|
||||||
// memory, directly in tensor core layout.
|
|
||||||
__device__ inline void ldsm4_t(FragA& frag_a, const void* smem_ptr) {
|
|
||||||
uint32_t* a = reinterpret_cast<uint32_t*>(&frag_a);
|
|
||||||
uint32_t smem = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
|
|
||||||
asm volatile(
|
|
||||||
"ldmatrix.sync.aligned.m8n8.x4.trans.shared.b16 {%0,%1,%2,%3}, [%4];\n"
|
|
||||||
: "=r"(a[0]), "=r"(a[1]), "=r"(a[2]), "=r"(a[3])
|
|
||||||
: "r"(smem));
|
|
||||||
}
|
|
||||||
|
|
||||||
// Wait until barrier reaches `count`, then lock for current threadblock.
|
|
||||||
__device__ inline void barrier_acquire(int* lock, int count) {
|
|
||||||
if (threadIdx.x == 0) {
|
|
||||||
int state = -1;
|
|
||||||
do
|
|
||||||
// Guarantee that subsequent writes by this threadblock will be visible
|
|
||||||
// globally.
|
|
||||||
asm volatile("ld.global.acquire.gpu.b32 %0, [%1];\n"
|
|
||||||
: "=r"(state)
|
|
||||||
: "l"(lock));
|
|
||||||
while (state != count);
|
|
||||||
}
|
|
||||||
__syncthreads();
|
|
||||||
}
|
|
||||||
|
|
||||||
// Release barrier and increment visitation count.
|
|
||||||
__device__ inline void barrier_release(int* lock, bool reset = false) {
|
|
||||||
__syncthreads();
|
|
||||||
if (threadIdx.x == 0) {
|
|
||||||
if (reset) {
|
|
||||||
lock[0] = 0;
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
int val = 1;
|
|
||||||
// Make sure that all writes since acquiring this barrier are visible
|
|
||||||
// globally, while releasing the barrier.
|
|
||||||
asm volatile("fence.acq_rel.gpu;\n");
|
|
||||||
asm volatile("red.relaxed.gpu.global.add.s32 [%0], %1;\n"
|
|
||||||
:
|
|
||||||
: "l"(lock), "r"(val));
|
|
||||||
}
|
|
||||||
}
|
|
||||||
} // namespace marlin_24
|
|
||||||
@@ -1,191 +0,0 @@
|
|||||||
/*
|
|
||||||
* Copyright (C) 2024 Roberto Lopez Castro (roberto.lopez.castro@udc.es). All
|
|
||||||
* Rights Reserved.
|
|
||||||
*
|
|
||||||
* Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
* you may not use this file except in compliance with the License.
|
|
||||||
* You may obtain a copy of the License at
|
|
||||||
*
|
|
||||||
* http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
*
|
|
||||||
* Unless required by applicable law or agreed to in writing, software
|
|
||||||
* distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
* See the License for the specific language governing permissions and
|
|
||||||
* limitations under the License.
|
|
||||||
*/
|
|
||||||
|
|
||||||
#pragma once
|
|
||||||
#include "base.h"
|
|
||||||
#include <cudaTypedefs.h>
|
|
||||||
|
|
||||||
namespace marlin_24 {
|
|
||||||
|
|
||||||
// On CUDA earlier than 12.5, the ordered_metadata version of this instruction
|
|
||||||
// is not supported. On later versions of CUDA the version without ordered
|
|
||||||
// metadata results in the following warning:
|
|
||||||
// | Advisory: Modifier ‘.sp::ordered_metadata’ should be used on instruction
|
|
||||||
// | ‘mma’ instead of modifier ‘.sp’ as it is expected to have substantially
|
|
||||||
// | reduced performance on some future architectures
|
|
||||||
#if defined CUDA_VERSION && CUDA_VERSION >= 12050
|
|
||||||
#define MMA_SP_INST \
|
|
||||||
"mma.sp::ordered_metadata.sync.aligned.m16n8k32.row.col.f32.f16.f16.f32 "
|
|
||||||
#else
|
|
||||||
#define MMA_SP_INST "mma.sp.sync.aligned.m16n8k32.row.col.f32.f16.f16.f32 "
|
|
||||||
#endif
|
|
||||||
|
|
||||||
// m16n8k32 sparse tensor core mma instruction with fp16 inputs and fp32
|
|
||||||
// output/accumulation.
|
|
||||||
__device__ inline void mma_sp(const FragB& a_frag0, const FragB& a_frag1,
|
|
||||||
const FragA& frag_b, FragC& frag_c, FragM& frag_m,
|
|
||||||
const int psel) {
|
|
||||||
const uint32_t* a0 = reinterpret_cast<const uint32_t*>(&a_frag0);
|
|
||||||
const uint32_t* a1 = reinterpret_cast<const uint32_t*>(&a_frag1);
|
|
||||||
const uint32_t* b = reinterpret_cast<const uint32_t*>(&frag_b);
|
|
||||||
const uint32_t* e = reinterpret_cast<const uint32_t*>(&frag_m);
|
|
||||||
|
|
||||||
float* c = reinterpret_cast<float*>(&frag_c);
|
|
||||||
if (psel == 0) {
|
|
||||||
asm volatile(MMA_SP_INST
|
|
||||||
"{%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9, %10,%11}, "
|
|
||||||
"{%12,%13,%14,%15}, %16, 0x0;\n"
|
|
||||||
: "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3])
|
|
||||||
: "r"(a0[0]), "r"(a1[0]), "r"(a0[1]), "r"(a1[1]), "r"(b[0]),
|
|
||||||
"r"(b[2]), "r"(b[4]), "r"(b[6]), "f"(c[0]), "f"(c[1]),
|
|
||||||
"f"(c[2]), "f"(c[3]), "r"(e[0]));
|
|
||||||
asm volatile(MMA_SP_INST
|
|
||||||
"{%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9, %10,%11}, "
|
|
||||||
"{%12,%13,%14,%15}, %16, 0x0;\n"
|
|
||||||
: "=f"(c[4]), "=f"(c[5]), "=f"(c[6]), "=f"(c[7])
|
|
||||||
: "r"(a0[0]), "r"(a1[0]), "r"(a0[1]), "r"(a1[1]), "r"(b[1]),
|
|
||||||
"r"(b[3]), "r"(b[5]), "r"(b[7]), "f"(c[4]), "f"(c[5]),
|
|
||||||
"f"(c[6]), "f"(c[7]), "r"(e[0]));
|
|
||||||
} else {
|
|
||||||
asm volatile(MMA_SP_INST
|
|
||||||
"{%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9, %10,%11}, "
|
|
||||||
"{%12,%13,%14,%15}, %16, 0x1;\n"
|
|
||||||
: "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3])
|
|
||||||
: "r"(a0[0]), "r"(a1[0]), "r"(a0[1]), "r"(a1[1]), "r"(b[0]),
|
|
||||||
"r"(b[2]), "r"(b[4]), "r"(b[6]), "f"(c[0]), "f"(c[1]),
|
|
||||||
"f"(c[2]), "f"(c[3]), "r"(e[0]));
|
|
||||||
asm volatile(MMA_SP_INST
|
|
||||||
"{%0, %1, %2, %3}, {%4, %5, %6, %7}, {%8, %9, %10,%11}, "
|
|
||||||
"{%12,%13,%14,%15}, %16, 0x1;\n"
|
|
||||||
: "=f"(c[4]), "=f"(c[5]), "=f"(c[6]), "=f"(c[7])
|
|
||||||
: "r"(a0[0]), "r"(a1[0]), "r"(a0[1]), "r"(a1[1]), "r"(b[1]),
|
|
||||||
"r"(b[3]), "r"(b[5]), "r"(b[7]), "f"(c[4]), "f"(c[5]),
|
|
||||||
"f"(c[6]), "f"(c[7]), "r"(e[0]));
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
// Lookup-table based 3-input logical operation; explicitly used for
|
|
||||||
// dequantization as the compiler does not seem to automatically recognize it in
|
|
||||||
// all cases.
|
|
||||||
template <int lut>
|
|
||||||
__device__ inline int lop3(int a, int b, int c) {
|
|
||||||
int res;
|
|
||||||
asm volatile("lop3.b32 %0, %1, %2, %3, %4;\n"
|
|
||||||
: "=r"(res)
|
|
||||||
: "r"(a), "r"(b), "r"(c), "n"(lut));
|
|
||||||
return res;
|
|
||||||
}
|
|
||||||
|
|
||||||
__device__ __forceinline__ uint2 to_half4(float c0, float c1, float c2,
|
|
||||||
float c3) {
|
|
||||||
uint2 r;
|
|
||||||
asm("{\n\t"
|
|
||||||
".reg .f16 a, b, c, d; \n\t"
|
|
||||||
"cvt.rn.f16.f32 a, %2; \n\t"
|
|
||||||
"cvt.rn.f16.f32 b, %3; \n\t"
|
|
||||||
"cvt.rn.f16.f32 c, %4; \n\t"
|
|
||||||
"cvt.rn.f16.f32 d, %5; \n\t"
|
|
||||||
"mov.b32 %0, {a, b}; \n\t"
|
|
||||||
"mov.b32 %1, {c, d}; \n\t"
|
|
||||||
"}"
|
|
||||||
: "=r"(r.x), "=r"(r.y)
|
|
||||||
: "f"(c0), "f"(c1), "f"(c2), "f"(c3));
|
|
||||||
return r;
|
|
||||||
}
|
|
||||||
|
|
||||||
// Constructs destination register by taking bytes from 2 sources (based on
|
|
||||||
// mask)
|
|
||||||
template <int start_byte, int mask>
|
|
||||||
__device__ inline uint32_t prmt(uint32_t a) {
|
|
||||||
uint32_t res;
|
|
||||||
asm volatile("prmt.b32 %0, %1, %2, %3;\n"
|
|
||||||
: "=r"(res)
|
|
||||||
: "r"(a), "n"(start_byte), "n"(mask));
|
|
||||||
return res;
|
|
||||||
}
|
|
||||||
|
|
||||||
// Efficiently dequantize an int32 value into a full B-fragment of 4 fp16
|
|
||||||
// values. We mostly follow the strategy in the link below, with some small
|
|
||||||
// changes:
|
|
||||||
// https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/cutlass_extensions/include/cutlass_extensions/interleaved_numeric_conversion.h
|
|
||||||
__device__ inline FragB dequant_4bit(int q) {
|
|
||||||
const int LO = 0x000f000f;
|
|
||||||
const int HI = 0x00f000f0;
|
|
||||||
const int EX = 0x64006400;
|
|
||||||
// Guarantee that the `(a & b) | c` operations are LOP3s.
|
|
||||||
int lo = lop3<(0xf0 & 0xcc) | 0xaa>(q, LO, EX);
|
|
||||||
int hi = lop3<(0xf0 & 0xcc) | 0xaa>(q, HI, EX);
|
|
||||||
// We want signed int4 outputs, hence we fuse the `-8` symmetric zero point
|
|
||||||
// directly into `SUB` and `ADD`.
|
|
||||||
const int SUB = 0x64086408;
|
|
||||||
const int MUL = 0x2c002c00;
|
|
||||||
const int ADD = 0xd480d480;
|
|
||||||
|
|
||||||
FragB frag_b;
|
|
||||||
frag_b[0] = __hsub2(*reinterpret_cast<half2*>(&lo),
|
|
||||||
*reinterpret_cast<const half2*>(&SUB));
|
|
||||||
frag_b[1] = __hfma2(*reinterpret_cast<half2*>(&hi),
|
|
||||||
*reinterpret_cast<const half2*>(&MUL),
|
|
||||||
*reinterpret_cast<const half2*>(&ADD));
|
|
||||||
return frag_b;
|
|
||||||
}
|
|
||||||
|
|
||||||
// Efficiently dequantize an int32 value into a full B-fragment of 4 fp16
|
|
||||||
// values. We mostly follow the strategy in the link below, with some small
|
|
||||||
// changes:
|
|
||||||
// https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/cutlass_extensions/include/cutlass_extensions/interleaved_numeric_conversion.h
|
|
||||||
__device__ inline FragB dequant_8bit(int q) {
|
|
||||||
static constexpr uint32_t mask_for_elt_01 = 0x5250;
|
|
||||||
static constexpr uint32_t mask_for_elt_23 = 0x5351;
|
|
||||||
static constexpr uint32_t start_byte_for_fp16 = 0x64646464;
|
|
||||||
|
|
||||||
uint32_t lo = prmt<start_byte_for_fp16, mask_for_elt_01>(q);
|
|
||||||
uint32_t hi = prmt<start_byte_for_fp16, mask_for_elt_23>(q);
|
|
||||||
|
|
||||||
static constexpr uint32_t I8s_TO_F16s_MAGIC_NUM = 0x64806480;
|
|
||||||
|
|
||||||
FragB frag_b;
|
|
||||||
frag_b[0] = __hsub2(*reinterpret_cast<half2*>(&lo),
|
|
||||||
*reinterpret_cast<const half2*>(&I8s_TO_F16s_MAGIC_NUM));
|
|
||||||
frag_b[1] = __hsub2(*reinterpret_cast<half2*>(&hi),
|
|
||||||
*reinterpret_cast<const half2*>(&I8s_TO_F16s_MAGIC_NUM));
|
|
||||||
return frag_b;
|
|
||||||
}
|
|
||||||
|
|
||||||
// Multiply dequantized values by the corresponding quantization scale; used
|
|
||||||
// only for grouped quantization.
|
|
||||||
__device__ inline void scale(FragB& frag_b, FragS& frag_s, int i) {
|
|
||||||
half2 s = __half2half2(reinterpret_cast<__half*>(&frag_s)[i]);
|
|
||||||
frag_b[0] = __hmul2(frag_b[0], s);
|
|
||||||
frag_b[1] = __hmul2(frag_b[1], s);
|
|
||||||
}
|
|
||||||
|
|
||||||
__device__ inline void scale_floats(float* c0, float* c1, float* c2, float* c3,
|
|
||||||
FragS& s0, float* c4, float* c5, float* c6,
|
|
||||||
float* c7, FragS& s1) {
|
|
||||||
*c0 = __fmul_rn(*c0, __half2float(s0[0].x));
|
|
||||||
*c1 = __fmul_rn(*c1, __half2float(s0[0].y));
|
|
||||||
*c2 = __fmul_rn(*c2, __half2float(s0[1].x));
|
|
||||||
*c3 = __fmul_rn(*c3, __half2float(s0[1].y));
|
|
||||||
|
|
||||||
*c4 = __fmul_rn(*c4, __half2float(s1[0].x));
|
|
||||||
*c5 = __fmul_rn(*c5, __half2float(s1[0].y));
|
|
||||||
*c6 = __fmul_rn(*c6, __half2float(s1[1].x));
|
|
||||||
*c7 = __fmul_rn(*c7, __half2float(s1[1].y));
|
|
||||||
}
|
|
||||||
|
|
||||||
} // namespace marlin_24
|
|
||||||
File diff suppressed because it is too large
Load Diff
@@ -141,8 +141,8 @@ struct cutlass_3x_gemm_sm100 {
|
|||||||
sizeof(typename CollectiveEpilogue::SharedStorage))>,
|
sizeof(typename CollectiveEpilogue::SharedStorage))>,
|
||||||
KernelSchedule>::CollectiveOp;
|
KernelSchedule>::CollectiveOp;
|
||||||
|
|
||||||
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
|
using GemmKernel = enable_sm100f_only<cutlass::gemm::kernel::GemmUniversal<
|
||||||
Shape<int, int, int, int>, CollectiveMainloop, CollectiveEpilogue, void>;
|
Shape<int, int, int, int>, CollectiveMainloop, CollectiveEpilogue, void>>;
|
||||||
};
|
};
|
||||||
|
|
||||||
template <typename ElementAB_, typename ElementD_,
|
template <typename ElementAB_, typename ElementD_,
|
||||||
@@ -202,8 +202,8 @@ struct cutlass_3x_gemm_sm120 {
|
|||||||
sizeof(typename CollectiveEpilogue::SharedStorage))>,
|
sizeof(typename CollectiveEpilogue::SharedStorage))>,
|
||||||
KernelSchedule>::CollectiveOp;
|
KernelSchedule>::CollectiveOp;
|
||||||
|
|
||||||
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
|
using GemmKernel = enable_sm120_only<cutlass::gemm::kernel::GemmUniversal<
|
||||||
Shape<int, int, int, int>, CollectiveMainloop, CollectiveEpilogue, void>;
|
Shape<int, int, int, int>, CollectiveMainloop, CollectiveEpilogue, void>>;
|
||||||
};
|
};
|
||||||
|
|
||||||
} // namespace vllm
|
} // namespace vllm
|
||||||
|
|||||||
@@ -123,7 +123,7 @@ struct cutlass_3x_gemm_fp8_blockwise {
|
|||||||
MainloopScheduler
|
MainloopScheduler
|
||||||
>::CollectiveOp>;
|
>::CollectiveOp>;
|
||||||
|
|
||||||
using KernelType = enable_sm100_only<cutlass::gemm::kernel::GemmUniversal<
|
using KernelType = enable_sm100f_only<cutlass::gemm::kernel::GemmUniversal<
|
||||||
Shape<int, int, int, int>, CollectiveMainloop, CollectiveEpilogue>>;
|
Shape<int, int, int, int>, CollectiveMainloop, CollectiveEpilogue>>;
|
||||||
|
|
||||||
struct GemmKernel : public KernelType {};
|
struct GemmKernel : public KernelType {};
|
||||||
|
|||||||
@@ -90,8 +90,8 @@ struct cutlass_3x_gemm_sm100_fp8 {
|
|||||||
// -----------------------------------------------------------
|
// -----------------------------------------------------------
|
||||||
// Kernel definition
|
// Kernel definition
|
||||||
// -----------------------------------------------------------
|
// -----------------------------------------------------------
|
||||||
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
|
using GemmKernel = enable_sm100f_only<cutlass::gemm::kernel::GemmUniversal<
|
||||||
Shape<int, int, int, int>, CollectiveMainloop, CollectiveEpilogue, void>;
|
Shape<int, int, int, int>, CollectiveMainloop, CollectiveEpilogue, void>>;
|
||||||
};
|
};
|
||||||
|
|
||||||
template <typename InType, typename OutType, bool EnableBias>
|
template <typename InType, typename OutType, bool EnableBias>
|
||||||
|
|||||||
@@ -36,41 +36,6 @@ using namespace cute;
|
|||||||
*/
|
*/
|
||||||
|
|
||||||
namespace vllm {
|
namespace vllm {
|
||||||
|
|
||||||
// Wrappers for the GEMM kernel that is used to guard against compilation on
|
|
||||||
// architectures that will never use the kernel. The purpose of this is to
|
|
||||||
// reduce the size of the compiled binary.
|
|
||||||
// __CUDA_ARCH__ is not defined in host code, so this lets us smuggle the ifdef
|
|
||||||
// into code that will be executed on the device where it is defined.
|
|
||||||
template <typename Kernel>
|
|
||||||
struct enable_sm75_to_sm80 : Kernel {
|
|
||||||
template <typename... Args>
|
|
||||||
CUTLASS_DEVICE static void invoke(Args&&... args) {
|
|
||||||
#if defined __CUDA_ARCH__ && __CUDA_ARCH__ >= 750 && __CUDA_ARCH__ < 800
|
|
||||||
Kernel::invoke(std::forward<Args>(args)...);
|
|
||||||
#endif
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
template <typename Kernel>
|
|
||||||
struct enable_sm80_to_sm89 : Kernel {
|
|
||||||
template <typename... Args>
|
|
||||||
CUTLASS_DEVICE static void invoke(Args&&... args) {
|
|
||||||
#if defined __CUDA_ARCH__ && __CUDA_ARCH__ >= 800 && __CUDA_ARCH__ < 890
|
|
||||||
Kernel::invoke(std::forward<Args>(args)...);
|
|
||||||
#endif
|
|
||||||
}
|
|
||||||
};
|
|
||||||
|
|
||||||
template <typename Kernel>
|
|
||||||
struct enable_sm89_to_sm90 : Kernel {
|
|
||||||
template <typename... Args>
|
|
||||||
CUTLASS_DEVICE static void invoke(Args&&... args) {
|
|
||||||
#if defined __CUDA_ARCH__ && __CUDA_ARCH__ >= 890 && __CUDA_ARCH__ < 900
|
|
||||||
Kernel::invoke(std::forward<Args>(args)...);
|
|
||||||
#endif
|
|
||||||
}
|
|
||||||
};
|
|
||||||
template <typename Arch, template <typename> typename ArchGuard,
|
template <typename Arch, template <typename> typename ArchGuard,
|
||||||
typename ElementAB_, typename ElementD_,
|
typename ElementAB_, typename ElementD_,
|
||||||
template <typename, typename> typename Epilogue_, typename TileShape,
|
template <typename, typename> typename Epilogue_, typename TileShape,
|
||||||
|
|||||||
@@ -50,7 +50,7 @@ struct sm89_fp8_config_default {
|
|||||||
using TileShape = typename cutlass::gemm::GemmShape<128, 128, 64>;
|
using TileShape = typename cutlass::gemm::GemmShape<128, 128, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5, FP8MathOperator>,
|
InstructionShape, 5, FP8MathOperator>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -58,7 +58,7 @@ struct sm89_fp8_config_default {
|
|||||||
using TileShape = typename cutlass::gemm::GemmShape<256, 128, 64>;
|
using TileShape = typename cutlass::gemm::GemmShape<256, 128, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 3, FP8MathOperator>,
|
InstructionShape, 3, FP8MathOperator>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -67,7 +67,7 @@ struct sm89_fp8_config_default {
|
|||||||
using TileShape = typename cutlass::gemm::GemmShape<128, 128, 64>;
|
using TileShape = typename cutlass::gemm::GemmShape<128, 128, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5, FP8MathOperator>,
|
InstructionShape, 5, FP8MathOperator>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -100,7 +100,7 @@ struct sm89_fp8_config_M256 {
|
|||||||
using TileShape = typename cutlass::gemm::GemmShape<64, 128, 128>;
|
using TileShape = typename cutlass::gemm::GemmShape<64, 128, 128>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 3, FP8MathOperator>,
|
InstructionShape, 3, FP8MathOperator>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -108,7 +108,7 @@ struct sm89_fp8_config_M256 {
|
|||||||
using TileShape = typename cutlass::gemm::GemmShape<128, 128, 64>;
|
using TileShape = typename cutlass::gemm::GemmShape<128, 128, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5, FP8MathOperator>,
|
InstructionShape, 5, FP8MathOperator>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -141,7 +141,7 @@ struct sm89_fp8_config_M128 {
|
|||||||
using TileShape = typename cutlass::gemm::GemmShape<64, 128, 128>;
|
using TileShape = typename cutlass::gemm::GemmShape<64, 128, 128>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 3, FP8MathOperator>,
|
InstructionShape, 3, FP8MathOperator>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -150,7 +150,7 @@ struct sm89_fp8_config_M128 {
|
|||||||
using TileShape = typename cutlass::gemm::GemmShape<128, 128, 64>;
|
using TileShape = typename cutlass::gemm::GemmShape<128, 128, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5, FP8MathOperator>,
|
InstructionShape, 5, FP8MathOperator>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -158,7 +158,7 @@ struct sm89_fp8_config_M128 {
|
|||||||
using TileShape = typename cutlass::gemm::GemmShape<128, 64, 128>;
|
using TileShape = typename cutlass::gemm::GemmShape<128, 64, 128>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 3, FP8MathOperator>,
|
InstructionShape, 3, FP8MathOperator>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -191,7 +191,7 @@ struct sm89_fp8_config_M64 {
|
|||||||
using FP8MathOperator = typename cutlass::arch::OpMultiplyAdd;
|
using FP8MathOperator = typename cutlass::arch::OpMultiplyAdd;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5, FP8MathOperator>,
|
InstructionShape, 5, FP8MathOperator>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -201,7 +201,7 @@ struct sm89_fp8_config_M64 {
|
|||||||
using FP8MathOperator = typename cutlass::arch::OpMultiplyAddFastAccum;
|
using FP8MathOperator = typename cutlass::arch::OpMultiplyAddFastAccum;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 3, FP8MathOperator>,
|
InstructionShape, 3, FP8MathOperator>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -211,7 +211,7 @@ struct sm89_fp8_config_M64 {
|
|||||||
using FP8MathOperator = typename cutlass::arch::OpMultiplyAdd;
|
using FP8MathOperator = typename cutlass::arch::OpMultiplyAdd;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5, FP8MathOperator>,
|
InstructionShape, 5, FP8MathOperator>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -244,7 +244,7 @@ struct sm89_fp8_config_M32 {
|
|||||||
using WarpShape = typename cutlass::gemm::GemmShape<16, 64, 64>;
|
using WarpShape = typename cutlass::gemm::GemmShape<16, 64, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5, FP8MathOperator>,
|
InstructionShape, 5, FP8MathOperator>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -253,7 +253,7 @@ struct sm89_fp8_config_M32 {
|
|||||||
using WarpShape = typename cutlass::gemm::GemmShape<32, 64, 64>;
|
using WarpShape = typename cutlass::gemm::GemmShape<32, 64, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 4, FP8MathOperator>,
|
InstructionShape, 4, FP8MathOperator>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -262,7 +262,7 @@ struct sm89_fp8_config_M32 {
|
|||||||
using WarpShape = typename cutlass::gemm::GemmShape<16, 64, 64>;
|
using WarpShape = typename cutlass::gemm::GemmShape<16, 64, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5, FP8MathOperator>,
|
InstructionShape, 5, FP8MathOperator>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -296,7 +296,7 @@ struct sm89_fp8_config_M16 {
|
|||||||
using TileShape = typename cutlass::gemm::GemmShape<16, 64, 128>;
|
using TileShape = typename cutlass::gemm::GemmShape<16, 64, 128>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, MainLoopStages,
|
InstructionShape, MainLoopStages,
|
||||||
FP8MathOperator>,
|
FP8MathOperator>,
|
||||||
@@ -305,7 +305,7 @@ struct sm89_fp8_config_M16 {
|
|||||||
using TileShape = typename cutlass::gemm::GemmShape<16, 128, 64>;
|
using TileShape = typename cutlass::gemm::GemmShape<16, 128, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, MainLoopStages,
|
InstructionShape, MainLoopStages,
|
||||||
FP8MathOperator>,
|
FP8MathOperator>,
|
||||||
@@ -314,7 +314,7 @@ struct sm89_fp8_config_M16 {
|
|||||||
using TileShape = typename cutlass::gemm::GemmShape<32, 64, 128>;
|
using TileShape = typename cutlass::gemm::GemmShape<32, 64, 128>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, MainLoopStages,
|
InstructionShape, MainLoopStages,
|
||||||
FP8MathOperator>,
|
FP8MathOperator>,
|
||||||
|
|||||||
@@ -48,7 +48,7 @@ struct sm89_int8_config_default {
|
|||||||
using TileShape = cutlass::gemm::GemmShape<128, 128, 64>;
|
using TileShape = cutlass::gemm::GemmShape<128, 128, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5>,
|
InstructionShape, 5>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -56,7 +56,7 @@ struct sm89_int8_config_default {
|
|||||||
using TileShape = cutlass::gemm::GemmShape<256, 128, 64>;
|
using TileShape = cutlass::gemm::GemmShape<256, 128, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 3>,
|
InstructionShape, 3>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -64,7 +64,7 @@ struct sm89_int8_config_default {
|
|||||||
using TileShape = cutlass::gemm::GemmShape<128, 128, 64>;
|
using TileShape = cutlass::gemm::GemmShape<128, 128, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5>,
|
InstructionShape, 5>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -72,7 +72,7 @@ struct sm89_int8_config_default {
|
|||||||
using TileShape = cutlass::gemm::GemmShape<256, 128, 64>;
|
using TileShape = cutlass::gemm::GemmShape<256, 128, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 3>,
|
InstructionShape, 3>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -104,7 +104,7 @@ struct sm89_int8_config_M256 {
|
|||||||
using TileShape = cutlass::gemm::GemmShape<64, 128, 128>;
|
using TileShape = cutlass::gemm::GemmShape<64, 128, 128>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 3>,
|
InstructionShape, 3>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -112,7 +112,7 @@ struct sm89_int8_config_M256 {
|
|||||||
using TileShape = cutlass::gemm::GemmShape<128, 128, 64>;
|
using TileShape = cutlass::gemm::GemmShape<128, 128, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5>,
|
InstructionShape, 5>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -120,7 +120,7 @@ struct sm89_int8_config_M256 {
|
|||||||
using TileShape = cutlass::gemm::GemmShape<256, 128, 64>;
|
using TileShape = cutlass::gemm::GemmShape<256, 128, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 3>,
|
InstructionShape, 3>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -128,7 +128,7 @@ struct sm89_int8_config_M256 {
|
|||||||
using TileShape = cutlass::gemm::GemmShape<128, 128, 64>;
|
using TileShape = cutlass::gemm::GemmShape<128, 128, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5>,
|
InstructionShape, 5>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -160,7 +160,7 @@ struct sm89_int8_config_M128 {
|
|||||||
using WarpShape = cutlass::gemm::GemmShape<64, 64, 64>;
|
using WarpShape = cutlass::gemm::GemmShape<64, 64, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 3>,
|
InstructionShape, 3>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -169,7 +169,7 @@ struct sm89_int8_config_M128 {
|
|||||||
using WarpShape = cutlass::gemm::GemmShape<64, 64, 64>;
|
using WarpShape = cutlass::gemm::GemmShape<64, 64, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5>,
|
InstructionShape, 5>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -178,7 +178,7 @@ struct sm89_int8_config_M128 {
|
|||||||
using WarpShape = cutlass::gemm::GemmShape<32, 64, 64>;
|
using WarpShape = cutlass::gemm::GemmShape<32, 64, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5>,
|
InstructionShape, 5>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -210,7 +210,7 @@ struct sm89_int8_config_M64 {
|
|||||||
using WarpShape = cutlass::gemm::GemmShape<32, 64, 64>;
|
using WarpShape = cutlass::gemm::GemmShape<32, 64, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5>,
|
InstructionShape, 5>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -219,7 +219,7 @@ struct sm89_int8_config_M64 {
|
|||||||
using WarpShape = cutlass::gemm::GemmShape<64, 64, 64>;
|
using WarpShape = cutlass::gemm::GemmShape<64, 64, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 3>,
|
InstructionShape, 3>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -251,7 +251,7 @@ struct sm89_int8_config_M32 {
|
|||||||
using WarpShape = cutlass::gemm::GemmShape<16, 64, 64>;
|
using WarpShape = cutlass::gemm::GemmShape<16, 64, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5>,
|
InstructionShape, 5>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -260,7 +260,7 @@ struct sm89_int8_config_M32 {
|
|||||||
using WarpShape = cutlass::gemm::GemmShape<32, 64, 64>;
|
using WarpShape = cutlass::gemm::GemmShape<32, 64, 64>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 4>,
|
InstructionShape, 4>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -292,7 +292,7 @@ struct sm89_int8_config_M16 {
|
|||||||
using TileShape = cutlass::gemm::GemmShape<16, 64, 128>;
|
using TileShape = cutlass::gemm::GemmShape<16, 64, 128>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 5>,
|
InstructionShape, 5>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
@@ -300,7 +300,7 @@ struct sm89_int8_config_M16 {
|
|||||||
using TileShape = cutlass::gemm::GemmShape<16, 128, 128>;
|
using TileShape = cutlass::gemm::GemmShape<16, 128, 128>;
|
||||||
|
|
||||||
return vllm::fallback_cutlass_gemm_caller<
|
return vllm::fallback_cutlass_gemm_caller<
|
||||||
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, vllm::enable_sm89_to_sm90,
|
vllm::cutlass_2x_gemm<cutlass::arch::Sm89, enable_sm89_to_sm90,
|
||||||
InType, OutType, Epilogue, TileShape, WarpShape,
|
InType, OutType, Epilogue, TileShape, WarpShape,
|
||||||
InstructionShape, 4>,
|
InstructionShape, 4>,
|
||||||
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
FallbackGemm>(out, a, b, std::forward<EpilogueArgs>(args)...);
|
||||||
|
|||||||
@@ -259,14 +259,6 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
|
|||||||
// custom types:
|
// custom types:
|
||||||
// https://docs.google.com/document/d/18fBMPuOJ0fY5ZQ6YyrHUppw9FA332CpNtgB6SOIgyuA
|
// https://docs.google.com/document/d/18fBMPuOJ0fY5ZQ6YyrHUppw9FA332CpNtgB6SOIgyuA
|
||||||
|
|
||||||
// Marlin_24 (Sparse) Optimized Quantized GEMM for GPTQ.
|
|
||||||
ops.def(
|
|
||||||
"gptq_marlin_24_gemm(Tensor a, Tensor b_q_weight, Tensor b_meta, "
|
|
||||||
"Tensor b_scales, Tensor workspace, "
|
|
||||||
"int b_q_type, "
|
|
||||||
"SymInt size_m, SymInt size_n, SymInt size_k) -> Tensor");
|
|
||||||
// conditionally compiled so impl in source file
|
|
||||||
|
|
||||||
// Machete (Dense) Optimized Mixed Precision GEMM for Hopper.
|
// Machete (Dense) Optimized Mixed Precision GEMM for Hopper.
|
||||||
ops.def(
|
ops.def(
|
||||||
"machete_supported_schedules("
|
"machete_supported_schedules("
|
||||||
|
|||||||
@@ -97,9 +97,7 @@ ARG PYTHON_VERSION
|
|||||||
ENV DEBIAN_FRONTEND=noninteractive
|
ENV DEBIAN_FRONTEND=noninteractive
|
||||||
|
|
||||||
# Install system dependencies including build tools
|
# Install system dependencies including build tools
|
||||||
RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
|
RUN apt-get update -y \
|
||||||
&& echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \
|
|
||||||
&& apt-get update -y \
|
|
||||||
&& apt-get install -y --no-install-recommends \
|
&& apt-get install -y --no-install-recommends \
|
||||||
ccache \
|
ccache \
|
||||||
software-properties-common \
|
software-properties-common \
|
||||||
@@ -135,7 +133,7 @@ ENV UV_LINK_MODE=copy
|
|||||||
RUN gcc --version
|
RUN gcc --version
|
||||||
|
|
||||||
# Ensure CUDA compatibility library is loaded
|
# Ensure CUDA compatibility library is loaded
|
||||||
RUN echo "/usr/local/cuda-$(echo "$CUDA_VERSION" | cut -d. -f1,2)/compat/" > /etc/ld.so.conf.d/00-cuda-compat.conf && ldconfig
|
RUN echo "/usr/local/cuda-$(echo "$CUDA_VERSION" | cut -d. -f1,2)/compat/" > /etc/ld.so.conf.d/cuda-compat.conf && ldconfig
|
||||||
|
|
||||||
# ============================================================
|
# ============================================================
|
||||||
# SLOW-CHANGING DEPENDENCIES BELOW
|
# SLOW-CHANGING DEPENDENCIES BELOW
|
||||||
@@ -502,9 +500,7 @@ RUN PYTHON_VERSION_STR=$(echo ${PYTHON_VERSION} | sed 's/\.//g') && \
|
|||||||
echo "export PYTHON_VERSION_STR=${PYTHON_VERSION_STR}" >> /etc/environment
|
echo "export PYTHON_VERSION_STR=${PYTHON_VERSION_STR}" >> /etc/environment
|
||||||
|
|
||||||
# Install Python and system dependencies
|
# Install Python and system dependencies
|
||||||
RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
|
RUN apt-get update -y \
|
||||||
&& echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \
|
|
||||||
&& apt-get update -y \
|
|
||||||
&& apt-get install -y --no-install-recommends \
|
&& apt-get install -y --no-install-recommends \
|
||||||
software-properties-common \
|
software-properties-common \
|
||||||
curl \
|
curl \
|
||||||
@@ -565,7 +561,7 @@ ENV UV_INDEX_STRATEGY="unsafe-best-match"
|
|||||||
ENV UV_LINK_MODE=copy
|
ENV UV_LINK_MODE=copy
|
||||||
|
|
||||||
# Ensure CUDA compatibility library is loaded
|
# Ensure CUDA compatibility library is loaded
|
||||||
RUN echo "/usr/local/cuda-$(echo "$CUDA_VERSION" | cut -d. -f1,2)/compat/" > /etc/ld.so.conf.d/00-cuda-compat.conf && ldconfig
|
RUN echo "/usr/local/cuda-$(echo "$CUDA_VERSION" | cut -d. -f1,2)/compat/" > /etc/ld.so.conf.d/cuda-compat.conf && ldconfig
|
||||||
|
|
||||||
# ============================================================
|
# ============================================================
|
||||||
# SLOW-CHANGING DEPENDENCIES BELOW
|
# SLOW-CHANGING DEPENDENCIES BELOW
|
||||||
@@ -586,7 +582,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
|||||||
# This is ~1.1GB and only changes when FlashInfer version bumps
|
# This is ~1.1GB and only changes when FlashInfer version bumps
|
||||||
# https://docs.flashinfer.ai/installation.html
|
# https://docs.flashinfer.ai/installation.html
|
||||||
# From versions.json: .flashinfer.version
|
# From versions.json: .flashinfer.version
|
||||||
ARG FLASHINFER_VERSION=0.6.1
|
ARG FLASHINFER_VERSION=0.6.2
|
||||||
RUN --mount=type=cache,target=/root/.cache/uv \
|
RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
uv pip install --system flashinfer-cubin==${FLASHINFER_VERSION} \
|
uv pip install --system flashinfer-cubin==${FLASHINFER_VERSION} \
|
||||||
&& uv pip install --system flashinfer-jit-cache==${FLASHINFER_VERSION} \
|
&& uv pip install --system flashinfer-jit-cache==${FLASHINFER_VERSION} \
|
||||||
@@ -713,9 +709,7 @@ ENV UV_INDEX_STRATEGY="unsafe-best-match"
|
|||||||
# Use copy mode to avoid hardlink failures with Docker cache mounts
|
# Use copy mode to avoid hardlink failures with Docker cache mounts
|
||||||
ENV UV_LINK_MODE=copy
|
ENV UV_LINK_MODE=copy
|
||||||
|
|
||||||
RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
|
RUN apt-get update -y \
|
||||||
&& echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \
|
|
||||||
&& apt-get update -y \
|
|
||||||
&& apt-get install -y git
|
&& apt-get install -y git
|
||||||
|
|
||||||
# We can specify the standard or nightly build of PyTorch
|
# We can specify the standard or nightly build of PyTorch
|
||||||
|
|||||||
@@ -25,7 +25,7 @@
|
|||||||
######################### COMMON BASE IMAGE #########################
|
######################### COMMON BASE IMAGE #########################
|
||||||
FROM ubuntu:22.04 AS base-common
|
FROM ubuntu:22.04 AS base-common
|
||||||
|
|
||||||
WORKDIR /workspace/
|
WORKDIR /workspace
|
||||||
|
|
||||||
ARG PYTHON_VERSION=3.12
|
ARG PYTHON_VERSION=3.12
|
||||||
ARG PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
|
ARG PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
|
||||||
@@ -35,7 +35,7 @@ RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
|
|||||||
--mount=type=cache,target=/var/lib/apt,sharing=locked \
|
--mount=type=cache,target=/var/lib/apt,sharing=locked \
|
||||||
apt-get update -y \
|
apt-get update -y \
|
||||||
&& apt-get install -y --no-install-recommends sudo ccache git curl wget ca-certificates \
|
&& apt-get install -y --no-install-recommends sudo ccache git curl wget ca-certificates \
|
||||||
gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 jq lsof \
|
gcc-12 g++-12 libtcmalloc-minimal4 libnuma-dev ffmpeg libsm6 libxext6 libgl1 jq lsof \
|
||||||
&& update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 \
|
&& update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12 \
|
||||||
&& curl -LsSf https://astral.sh/uv/install.sh | sh
|
&& curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||||
|
|
||||||
@@ -109,7 +109,7 @@ ENV VLLM_CPU_AVX512VNNI=${VLLM_CPU_AVX512VNNI}
|
|||||||
ARG VLLM_CPU_AMXBF16=1
|
ARG VLLM_CPU_AMXBF16=1
|
||||||
ENV VLLM_CPU_AMXBF16=${VLLM_CPU_AMXBF16}
|
ENV VLLM_CPU_AMXBF16=${VLLM_CPU_AMXBF16}
|
||||||
|
|
||||||
WORKDIR /workspace/vllm
|
WORKDIR /vllm-workspace
|
||||||
|
|
||||||
# Copy build requirements
|
# Copy build requirements
|
||||||
COPY requirements/cpu-build.txt requirements/build.txt
|
COPY requirements/cpu-build.txt requirements/build.txt
|
||||||
@@ -123,13 +123,13 @@ RUN if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
|
|||||||
|
|
||||||
RUN --mount=type=cache,target=/root/.cache/uv \
|
RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
--mount=type=cache,target=/root/.cache/ccache \
|
--mount=type=cache,target=/root/.cache/ccache \
|
||||||
--mount=type=cache,target=/workspace/vllm/.deps,sharing=locked \
|
--mount=type=cache,target=/vllm-workspace/.deps,sharing=locked \
|
||||||
VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38
|
VLLM_TARGET_DEVICE=cpu python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38
|
||||||
|
|
||||||
######################### TEST DEPS #########################
|
######################### TEST DEPS #########################
|
||||||
FROM base AS vllm-test-deps
|
FROM base AS vllm-test-deps
|
||||||
|
|
||||||
WORKDIR /workspace/vllm
|
WORKDIR /vllm-workspace
|
||||||
|
|
||||||
# Copy test requirements
|
# Copy test requirements
|
||||||
COPY requirements/test.in requirements/cpu-test.in
|
COPY requirements/test.in requirements/cpu-test.in
|
||||||
@@ -138,12 +138,12 @@ COPY requirements/test.in requirements/cpu-test.in
|
|||||||
RUN \
|
RUN \
|
||||||
sed -i '/mamba_ssm/d' requirements/cpu-test.in && \
|
sed -i '/mamba_ssm/d' requirements/cpu-test.in && \
|
||||||
remove_packages_not_supported_on_aarch64() { \
|
remove_packages_not_supported_on_aarch64() { \
|
||||||
case "$(uname -m)" in \
|
case "$(uname -m)" in \
|
||||||
aarch64|arm64) \
|
aarch64|arm64) \
|
||||||
sed -i '/decord/d' requirements/cpu-test.in; \
|
sed -i '/decord/d' requirements/cpu-test.in; \
|
||||||
sed -i '/terratorch/d' requirements/cpu-test.in; \
|
sed -i '/terratorch/d' requirements/cpu-test.in; \
|
||||||
;; \
|
;; \
|
||||||
esac; \
|
esac; \
|
||||||
}; \
|
}; \
|
||||||
remove_packages_not_supported_on_aarch64 && \
|
remove_packages_not_supported_on_aarch64 && \
|
||||||
sed -i 's/^torch==.*/torch==2.10.0/g' requirements/cpu-test.in && \
|
sed -i 's/^torch==.*/torch==2.10.0/g' requirements/cpu-test.in && \
|
||||||
@@ -157,7 +157,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
|||||||
######################### DEV IMAGE #########################
|
######################### DEV IMAGE #########################
|
||||||
FROM vllm-build AS vllm-dev
|
FROM vllm-build AS vllm-dev
|
||||||
|
|
||||||
WORKDIR /workspace/vllm
|
WORKDIR /vllm-workspace
|
||||||
|
|
||||||
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
|
RUN --mount=type=cache,target=/var/cache/apt,sharing=locked \
|
||||||
--mount=type=cache,target=/var/lib/apt,sharing=locked \
|
--mount=type=cache,target=/var/lib/apt,sharing=locked \
|
||||||
@@ -174,7 +174,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
|||||||
--mount=type=bind,source=.git,target=.git \
|
--mount=type=bind,source=.git,target=.git \
|
||||||
VLLM_TARGET_DEVICE=cpu python3 setup.py develop
|
VLLM_TARGET_DEVICE=cpu python3 setup.py develop
|
||||||
|
|
||||||
COPY --from=vllm-test-deps /workspace/vllm/requirements/cpu-test.txt requirements/test.txt
|
COPY --from=vllm-test-deps /vllm-workspace/requirements/cpu-test.txt requirements/test.txt
|
||||||
|
|
||||||
RUN --mount=type=cache,target=/root/.cache/uv \
|
RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
uv pip install -r requirements/dev.txt && \
|
uv pip install -r requirements/dev.txt && \
|
||||||
@@ -185,10 +185,10 @@ ENTRYPOINT ["bash"]
|
|||||||
######################### TEST IMAGE #########################
|
######################### TEST IMAGE #########################
|
||||||
FROM vllm-test-deps AS vllm-test
|
FROM vllm-test-deps AS vllm-test
|
||||||
|
|
||||||
WORKDIR /workspace/
|
WORKDIR /vllm-workspace
|
||||||
|
|
||||||
RUN --mount=type=cache,target=/root/.cache/uv \
|
RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
--mount=type=bind,from=vllm-build,src=/workspace/vllm/dist,target=dist \
|
--mount=type=bind,from=vllm-build,src=/vllm-workspace/dist,target=dist \
|
||||||
uv pip install dist/*.whl
|
uv pip install dist/*.whl
|
||||||
|
|
||||||
ADD ./tests/ ./tests/
|
ADD ./tests/ ./tests/
|
||||||
@@ -197,9 +197,6 @@ ADD ./benchmarks/ ./benchmarks/
|
|||||||
ADD ./vllm/collect_env.py .
|
ADD ./vllm/collect_env.py .
|
||||||
ADD ./.buildkite/ ./.buildkite/
|
ADD ./.buildkite/ ./.buildkite/
|
||||||
|
|
||||||
# Create symlink for vllm-workspace to maintain CI compatibility
|
|
||||||
RUN ln -sf /workspace /vllm-workspace
|
|
||||||
|
|
||||||
# install development dependencies (for testing)
|
# install development dependencies (for testing)
|
||||||
RUN --mount=type=cache,target=/root/.cache/uv \
|
RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
uv pip install -e tests/vllm_test_utils
|
uv pip install -e tests/vllm_test_utils
|
||||||
@@ -207,11 +204,11 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
|||||||
######################### RELEASE IMAGE #########################
|
######################### RELEASE IMAGE #########################
|
||||||
FROM base AS vllm-openai
|
FROM base AS vllm-openai
|
||||||
|
|
||||||
WORKDIR /workspace/
|
WORKDIR /vllm-workspace
|
||||||
|
|
||||||
RUN --mount=type=cache,target=/root/.cache/uv \
|
RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
--mount=type=cache,target=/root/.cache/ccache \
|
--mount=type=cache,target=/root/.cache/ccache \
|
||||||
--mount=type=bind,from=vllm-build,src=/workspace/vllm/dist,target=dist \
|
--mount=type=bind,from=vllm-build,src=/vllm-workspace/dist,target=dist \
|
||||||
uv pip install dist/*.whl
|
uv pip install dist/*.whl
|
||||||
|
|
||||||
# Add labels to document build configuration
|
# Add labels to document build configuration
|
||||||
|
|||||||
@@ -20,9 +20,7 @@ ARG PYTHON_VERSION=3.12
|
|||||||
ARG TARGETPLATFORM
|
ARG TARGETPLATFORM
|
||||||
ENV DEBIAN_FRONTEND=noninteractive
|
ENV DEBIAN_FRONTEND=noninteractive
|
||||||
# Install Python and other dependencies
|
# Install Python and other dependencies
|
||||||
RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
|
RUN apt-get update -y \
|
||||||
&& echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \
|
|
||||||
&& apt-get update -y \
|
|
||||||
&& apt-get install -y ccache software-properties-common git curl sudo \
|
&& apt-get install -y ccache software-properties-common git curl sudo \
|
||||||
&& for i in 1 2 3; do \
|
&& for i in 1 2 3; do \
|
||||||
add-apt-repository -y ppa:deadsnakes/ppa && break || \
|
add-apt-repository -y ppa:deadsnakes/ppa && break || \
|
||||||
@@ -172,9 +170,7 @@ RUN PYTHON_VERSION_STR=$(echo ${PYTHON_VERSION} | sed 's/\.//g') && \
|
|||||||
echo "export PYTHON_VERSION_STR=${PYTHON_VERSION_STR}" >> /etc/environment
|
echo "export PYTHON_VERSION_STR=${PYTHON_VERSION_STR}" >> /etc/environment
|
||||||
|
|
||||||
# Install Python and other dependencies
|
# Install Python and other dependencies
|
||||||
RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
|
RUN apt-get update -y \
|
||||||
&& echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \
|
|
||||||
&& apt-get update -y \
|
|
||||||
&& apt-get install -y ccache software-properties-common git curl wget sudo vim python3-pip \
|
&& apt-get install -y ccache software-properties-common git curl wget sudo vim python3-pip \
|
||||||
&& apt-get install -y ffmpeg libsm6 libxext6 libgl1 \
|
&& apt-get install -y ffmpeg libsm6 libxext6 libgl1 \
|
||||||
&& for i in 1 2 3; do \
|
&& for i in 1 2 3; do \
|
||||||
@@ -221,13 +217,13 @@ RUN pip install setuptools==75.6.0 packaging==23.2 ninja==1.11.1.3 build==1.2.2.
|
|||||||
|
|
||||||
|
|
||||||
# build flashinfer for torch nightly from source around 10 mins
|
# build flashinfer for torch nightly from source around 10 mins
|
||||||
# release version: v0.6.1
|
# release version: v0.6.2
|
||||||
# todo(elainewy): cache flashinfer build result for faster build
|
# todo(elainewy): cache flashinfer build result for faster build
|
||||||
ENV CCACHE_DIR=/root/.cache/ccache
|
ENV CCACHE_DIR=/root/.cache/ccache
|
||||||
RUN --mount=type=cache,target=/root/.cache/ccache \
|
RUN --mount=type=cache,target=/root/.cache/ccache \
|
||||||
--mount=type=cache,target=/root/.cache/uv \
|
--mount=type=cache,target=/root/.cache/uv \
|
||||||
echo "git clone flashinfer..." \
|
echo "git clone flashinfer..." \
|
||||||
&& git clone --depth 1 --branch v0.6.1 --recursive https://github.com/flashinfer-ai/flashinfer.git \
|
&& git clone --depth 1 --branch v0.6.2 --recursive https://github.com/flashinfer-ai/flashinfer.git \
|
||||||
&& cd flashinfer \
|
&& cd flashinfer \
|
||||||
&& git submodule update --init --recursive \
|
&& git submodule update --init --recursive \
|
||||||
&& echo "finish git clone flashinfer..." \
|
&& echo "finish git clone flashinfer..." \
|
||||||
|
|||||||
@@ -15,8 +15,6 @@ FROM ${BASE_IMAGE} AS base
|
|||||||
|
|
||||||
ARG ARG_PYTORCH_ROCM_ARCH
|
ARG ARG_PYTORCH_ROCM_ARCH
|
||||||
ENV PYTORCH_ROCM_ARCH=${ARG_PYTORCH_ROCM_ARCH:-${PYTORCH_ROCM_ARCH}}
|
ENV PYTORCH_ROCM_ARCH=${ARG_PYTORCH_ROCM_ARCH:-${PYTORCH_ROCM_ARCH}}
|
||||||
ENV RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1
|
|
||||||
ENV RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES=1
|
|
||||||
|
|
||||||
# Install some basic utilities
|
# Install some basic utilities
|
||||||
RUN apt-get update -q -y && apt-get install -q -y \
|
RUN apt-get update -q -y && apt-get install -q -y \
|
||||||
@@ -227,7 +225,7 @@ RUN if [ "$GIT_REPO_CHECK" != "0" ]; then \
|
|||||||
# This ensures setuptools_scm sees clean repo state for version detection
|
# This ensures setuptools_scm sees clean repo state for version detection
|
||||||
RUN --mount=type=bind,source=.git,target=vllm/.git \
|
RUN --mount=type=bind,source=.git,target=vllm/.git \
|
||||||
cd vllm \
|
cd vllm \
|
||||||
&& pip install setuptools_scm \
|
&& pip install setuptools_scm regex \
|
||||||
&& VLLM_VERSION=$(python3 -c "import setuptools_scm; print(setuptools_scm.get_version())") \
|
&& VLLM_VERSION=$(python3 -c "import setuptools_scm; print(setuptools_scm.get_version())") \
|
||||||
&& echo "Detected vLLM version: ${VLLM_VERSION}" \
|
&& echo "Detected vLLM version: ${VLLM_VERSION}" \
|
||||||
&& echo "${VLLM_VERSION}" > /tmp/vllm_version.txt
|
&& echo "${VLLM_VERSION}" > /tmp/vllm_version.txt
|
||||||
@@ -342,6 +340,19 @@ RUN mkdir src && mv vllm src/vllm
|
|||||||
FROM base AS final
|
FROM base AS final
|
||||||
|
|
||||||
RUN python3 -m pip install --upgrade pip && rm -rf /var/lib/apt/lists/*
|
RUN python3 -m pip install --upgrade pip && rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
# Clean up sccache from release image (not needed at runtime)
|
||||||
|
# This removes the binary and wrappers that may have been installed during build
|
||||||
|
RUN rm -f /usr/bin/sccache || true \
|
||||||
|
&& rm -rf /opt/sccache-wrappers || true
|
||||||
|
|
||||||
|
# Unset sccache environment variables for the release image
|
||||||
|
# This prevents S3 bucket config from leaking into production images
|
||||||
|
ENV SCCACHE_BUCKET=
|
||||||
|
ENV SCCACHE_REGION=
|
||||||
|
ENV SCCACHE_S3_NO_CREDENTIALS=
|
||||||
|
ENV SCCACHE_IDLE_TIMEOUT=
|
||||||
|
|
||||||
# Error related to odd state for numpy 1.20.3 where there is no METADATA etc, but an extra LICENSES_bundled.txt.
|
# Error related to odd state for numpy 1.20.3 where there is no METADATA etc, but an extra LICENSES_bundled.txt.
|
||||||
# Manually remove it so that later steps of numpy upgrade can continue
|
# Manually remove it so that later steps of numpy upgrade can continue
|
||||||
RUN case "$(which python3)" in \
|
RUN case "$(which python3)" in \
|
||||||
|
|||||||
@@ -1,5 +1,5 @@
|
|||||||
# Base UBI image for s390x architecture
|
# Base UBI image for s390x architecture
|
||||||
ARG BASE_UBI_IMAGE_TAG=9.5-1736404155
|
ARG BASE_UBI_IMAGE_TAG=9.6
|
||||||
ARG PYTHON_VERSION=3.12
|
ARG PYTHON_VERSION=3.12
|
||||||
FROM registry.access.redhat.com/ubi9/ubi-minimal:${BASE_UBI_IMAGE_TAG} AS base
|
FROM registry.access.redhat.com/ubi9/ubi-minimal:${BASE_UBI_IMAGE_TAG} AS base
|
||||||
|
|
||||||
@@ -14,12 +14,18 @@ ENV LANG=C.UTF-8 \
|
|||||||
|
|
||||||
# Install development utilities
|
# Install development utilities
|
||||||
RUN microdnf install -y \
|
RUN microdnf install -y \
|
||||||
which procps findutils tar vim git gcc-toolset-14 gcc-toolset-14-libatomic-devel patch zlib-devel \
|
which procps findutils tar vim git gcc-toolset-14 gcc-toolset-14-binutils gcc-toolset-14-libatomic-devel patch zlib-devel \
|
||||||
libjpeg-turbo-devel libtiff-devel libpng-devel libwebp-devel freetype-devel harfbuzz-devel \
|
libjpeg-turbo-devel libtiff-devel libpng-devel libwebp-devel freetype-devel harfbuzz-devel \
|
||||||
openssl-devel openblas openblas-devel autoconf automake libtool cmake numpy libsndfile \
|
openssl-devel openblas openblas-devel autoconf automake libtool cmake numpy libsndfile \
|
||||||
clang llvm-devel llvm-static clang-devel && \
|
clang llvm-devel llvm-static clang-devel && \
|
||||||
microdnf clean all
|
microdnf clean all
|
||||||
|
|
||||||
|
ENV GCC_TOOLSET_ROOT=/opt/rh/gcc-toolset-14/root \
|
||||||
|
PATH=/opt/rh/gcc-toolset-14/root/usr/bin:/usr/local/bin:/usr/bin:/bin \
|
||||||
|
LD_LIBRARY_PATH=/opt/rh/gcc-toolset-14/root/usr/lib64:/usr/local/lib:/usr/lib64 \
|
||||||
|
LIBRARY_PATH=/opt/rh/gcc-toolset-14/root/usr/lib64 \
|
||||||
|
PKG_CONFIG_PATH=/opt/rh/gcc-toolset-14/root/usr/lib64/pkgconfig
|
||||||
|
|
||||||
# Python Installation
|
# Python Installation
|
||||||
FROM base AS python-install
|
FROM base AS python-install
|
||||||
ARG PYTHON_VERSION
|
ARG PYTHON_VERSION
|
||||||
@@ -87,13 +93,13 @@ RUN curl https://sh.rustup.rs -sSf | sh -s -- -y && \
|
|||||||
|
|
||||||
FROM python-install AS torch-vision
|
FROM python-install AS torch-vision
|
||||||
# Install torchvision
|
# Install torchvision
|
||||||
ARG TORCH_VISION_VERSION=v0.23.0
|
ARG TORCH_VISION_VERSION=v0.25.0
|
||||||
WORKDIR /tmp
|
WORKDIR /tmp
|
||||||
RUN --mount=type=cache,target=/root/.cache/uv \
|
RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
git clone https://github.com/pytorch/vision.git && \
|
git clone https://github.com/pytorch/vision.git && \
|
||||||
cd vision && \
|
cd vision && \
|
||||||
git checkout $TORCH_VISION_VERSION && \
|
git checkout $TORCH_VISION_VERSION && \
|
||||||
uv pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cpu && \
|
uv pip install torch==2.10.0 --index-url https://download.pytorch.org/whl/cpu && \
|
||||||
python setup.py bdist_wheel
|
python setup.py bdist_wheel
|
||||||
|
|
||||||
FROM python-install AS hf-xet-builder
|
FROM python-install AS hf-xet-builder
|
||||||
@@ -174,7 +180,19 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
|||||||
if ! grep '#include "dynamic_annotations.h"' numba/_dispatcher.cpp; then \
|
if ! grep '#include "dynamic_annotations.h"' numba/_dispatcher.cpp; then \
|
||||||
sed -i '/#include "internal\/pycore_atomic.h"/i\#include "dynamic_annotations.h"' numba/_dispatcher.cpp; \
|
sed -i '/#include "internal\/pycore_atomic.h"/i\#include "dynamic_annotations.h"' numba/_dispatcher.cpp; \
|
||||||
fi && python setup.py bdist_wheel
|
fi && python setup.py bdist_wheel
|
||||||
|
|
||||||
|
# Build OpenCV from source for s390x
|
||||||
|
FROM python-install AS opencv-builder
|
||||||
|
WORKDIR /tmp
|
||||||
|
ARG MAX_JOBS
|
||||||
|
ARG OPENCV_VERSION=90
|
||||||
|
ARG ENABLE_HEADLESS=1
|
||||||
|
RUN --mount=type=cache,target=/root/.cache/uv \
|
||||||
|
uv pip install numpy setuptools wheel scikit_build build && \
|
||||||
|
git clone --recursive https://github.com/opencv/opencv-python.git -b ${OPENCV_VERSION} && \
|
||||||
|
cd opencv-python && \
|
||||||
|
python -m build --wheel --installer=uv --outdir /tmp/opencv-python/dist
|
||||||
|
|
||||||
# Build Outlines Core
|
# Build Outlines Core
|
||||||
FROM python-install AS outlines-core-builder
|
FROM python-install AS outlines-core-builder
|
||||||
WORKDIR /tmp
|
WORKDIR /tmp
|
||||||
@@ -198,7 +216,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
|||||||
# Final build stage
|
# Final build stage
|
||||||
FROM python-install AS vllm-cpu
|
FROM python-install AS vllm-cpu
|
||||||
ARG PYTHON_VERSION
|
ARG PYTHON_VERSION
|
||||||
|
ARG PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
|
||||||
# Set correct library path for torch and numactl
|
# Set correct library path for torch and numactl
|
||||||
ENV LD_LIBRARY_PATH="/opt/vllm/lib64/python${PYTHON_VERSION}/site-packages/torch/lib:/usr/local/lib:/opt/rh/gcc-toolset-14/root/usr/lib64:$LD_LIBRARY_PATH"
|
ENV LD_LIBRARY_PATH="/opt/vllm/lib64/python${PYTHON_VERSION}/site-packages/torch/lib:/usr/local/lib:/opt/rh/gcc-toolset-14/root/usr/lib64:$LD_LIBRARY_PATH"
|
||||||
ENV C_INCLUDE_PATH="/usr/local/include:$C_INCLUDE_PATH"
|
ENV C_INCLUDE_PATH="/usr/local/include:$C_INCLUDE_PATH"
|
||||||
@@ -209,7 +227,8 @@ ENV GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=1
|
|||||||
ENV PCP_DIR=/opt/rh/gcc-toolset-14/root
|
ENV PCP_DIR=/opt/rh/gcc-toolset-14/root
|
||||||
ENV PKG_CONFIG_PATH="/opt/rh/gcc-toolset-14/root/usr/lib64/pkgconfig:/usr/local/lib/pkgconfig/"
|
ENV PKG_CONFIG_PATH="/opt/rh/gcc-toolset-14/root/usr/lib64/pkgconfig:/usr/local/lib/pkgconfig/"
|
||||||
ENV PATH="${VIRTUAL_ENV:+${VIRTUAL_ENV}/bin}:/opt/rh/gcc-toolset-14/root/usr/bin:/usr/local/bin:$CARGO_HOME/bin:$RUSTUP_HOME/bin:$PATH"
|
ENV PATH="${VIRTUAL_ENV:+${VIRTUAL_ENV}/bin}:/opt/rh/gcc-toolset-14/root/usr/bin:/usr/local/bin:$CARGO_HOME/bin:$RUSTUP_HOME/bin:$PATH"
|
||||||
|
ENV PIP_EXTRA_INDEX_URL=${PIP_EXTRA_INDEX_URL}
|
||||||
|
ENV UV_EXTRA_INDEX_URL=${PIP_EXTRA_INDEX_URL}
|
||||||
COPY . /workspace/vllm
|
COPY . /workspace/vllm
|
||||||
WORKDIR /workspace/vllm
|
WORKDIR /workspace/vllm
|
||||||
|
|
||||||
@@ -225,23 +244,25 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
|||||||
--mount=type=bind,from=hf-xet-builder,source=/tmp/hf-xet/dist,target=/tmp/hf-xet-wheels/ \
|
--mount=type=bind,from=hf-xet-builder,source=/tmp/hf-xet/dist,target=/tmp/hf-xet-wheels/ \
|
||||||
--mount=type=bind,from=numba-builder,source=/tmp/llvmlite/dist,target=/tmp/llvmlite-wheels/ \
|
--mount=type=bind,from=numba-builder,source=/tmp/llvmlite/dist,target=/tmp/llvmlite-wheels/ \
|
||||||
--mount=type=bind,from=numba-builder,source=/tmp/numba/dist,target=/tmp/numba-wheels/ \
|
--mount=type=bind,from=numba-builder,source=/tmp/numba/dist,target=/tmp/numba-wheels/ \
|
||||||
|
--mount=type=bind,from=opencv-builder,source=/tmp/opencv-python/dist,target=/tmp/opencv-wheels/ \
|
||||||
--mount=type=bind,from=outlines-core-builder,source=/tmp/outlines-core/dist,target=/tmp/outlines-core/dist/ \
|
--mount=type=bind,from=outlines-core-builder,source=/tmp/outlines-core/dist,target=/tmp/outlines-core/dist/ \
|
||||||
sed -i '/^torch/d' requirements/build.txt && \
|
|
||||||
ARROW_WHL_FILE=$(ls /tmp/arrow-wheels/pyarrow-*.whl) && \
|
ARROW_WHL_FILE=$(ls /tmp/arrow-wheels/pyarrow-*.whl) && \
|
||||||
VISION_WHL_FILE=$(ls /tmp/vision-wheels/*.whl) && \
|
VISION_WHL_FILE=$(ls /tmp/vision-wheels/*.whl) && \
|
||||||
HF_XET_WHL_FILE=$(ls /tmp/hf-xet-wheels/*.whl) && \
|
HF_XET_WHL_FILE=$(ls /tmp/hf-xet-wheels/*.whl) && \
|
||||||
LLVM_WHL_FILE=$(ls /tmp/llvmlite-wheels/*.whl) && \
|
LLVM_WHL_FILE=$(ls /tmp/llvmlite-wheels/*.whl) && \
|
||||||
NUMBA_WHL_FILE=$(ls /tmp/numba-wheels/*.whl) && \
|
NUMBA_WHL_FILE=$(ls /tmp/numba-wheels/*.whl) && \
|
||||||
|
OPENCV_WHL_FILE=$(ls /tmp/opencv-wheels/*.whl) && \
|
||||||
OUTLINES_CORE_WHL_FILE=$(ls /tmp/outlines-core/dist/*.whl) && \
|
OUTLINES_CORE_WHL_FILE=$(ls /tmp/outlines-core/dist/*.whl) && \
|
||||||
uv pip install -v \
|
uv pip install -v \
|
||||||
$ARROW_WHL_FILE \
|
$ARROW_WHL_FILE \
|
||||||
$VISION_WHL_FILE \
|
$VISION_WHL_FILE \
|
||||||
$HF_XET_WHL_FILE \
|
$HF_XET_WHL_FILE \
|
||||||
$LLVM_WHL_FILE \
|
$LLVM_WHL_FILE \
|
||||||
$NUMBA_WHL_FILE \
|
$NUMBA_WHL_FILE \
|
||||||
|
$OPENCV_WHL_FILE \
|
||||||
$OUTLINES_CORE_WHL_FILE \
|
$OUTLINES_CORE_WHL_FILE \
|
||||||
--index-strategy unsafe-best-match \
|
--index-strategy unsafe-best-match \
|
||||||
-r requirements/build.txt \
|
-r requirements/cpu-build.txt \
|
||||||
-r requirements/cpu.txt
|
-r requirements/cpu.txt
|
||||||
|
|
||||||
|
|
||||||
@@ -252,7 +273,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
|
|||||||
|
|
||||||
# setup non-root user for vllm
|
# setup non-root user for vllm
|
||||||
RUN umask 002 && \
|
RUN umask 002 && \
|
||||||
useradd --uid 2000 --gid 0 vllm && \
|
/usr/sbin/useradd --uid 2000 --gid 0 vllm && \
|
||||||
mkdir -p /home/vllm && \
|
mkdir -p /home/vllm && \
|
||||||
chmod g+rwx /home/vllm
|
chmod g+rwx /home/vllm
|
||||||
|
|
||||||
|
|||||||
@@ -1,8 +1,8 @@
|
|||||||
FROM intel/deep-learning-essentials:2025.2.2-0-devel-ubuntu24.04 AS vllm-base
|
FROM intel/deep-learning-essentials:2025.3.2-0-devel-ubuntu24.04 AS vllm-base
|
||||||
|
|
||||||
RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null && \
|
RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null && \
|
||||||
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | tee /etc/apt/sources.list.d/oneAPI.list && \
|
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | tee /etc/apt/sources.list.d/oneAPI.list && \
|
||||||
add-apt-repository -y ppa:kobuk-team/intel-graphics-staging
|
add-apt-repository -y ppa:kobuk-team/intel-graphics
|
||||||
|
|
||||||
RUN apt clean && apt-get update -y && \
|
RUN apt clean && apt-get update -y && \
|
||||||
apt-get install -y --no-install-recommends --fix-missing \
|
apt-get install -y --no-install-recommends --fix-missing \
|
||||||
@@ -25,10 +25,13 @@ RUN apt clean && apt-get update -y && \
|
|||||||
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.12 1
|
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.12 1
|
||||||
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.12 1
|
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.12 1
|
||||||
|
|
||||||
RUN apt install -y libze1 libze-dev libze-intel-gpu1 intel-opencl-icd libze-intel-gpu-raytracing intel-ocloc
|
RUN apt update && apt upgrade -y && \
|
||||||
|
apt install -y libze1 libze-dev libze-intel-gpu1 intel-opencl-icd libze-intel-gpu-raytracing intel-ocloc && \
|
||||||
|
apt install -y intel-oneapi-compiler-dpcpp-cpp-2025.3
|
||||||
|
|
||||||
|
|
||||||
# This oneccl contains the BMG support which is not the case for default version of oneapi 2025.2.
|
# This oneccl contains the BMG support which is not the case for default version of oneapi 2025.2.
|
||||||
ARG ONECCL_INSTALLER="intel-oneccl-2021.15.7.6_offline.sh"
|
ARG ONECCL_INSTALLER="intel-oneccl-2021.15.7.8_offline.sh"
|
||||||
RUN wget "https://github.com/uxlfoundation/oneCCL/releases/download/2021.15.7/${ONECCL_INSTALLER}" && \
|
RUN wget "https://github.com/uxlfoundation/oneCCL/releases/download/2021.15.7/${ONECCL_INSTALLER}" && \
|
||||||
bash "${ONECCL_INSTALLER}" -a --silent --eula accept && \
|
bash "${ONECCL_INSTALLER}" -a --silent --eula accept && \
|
||||||
rm "${ONECCL_INSTALLER}" && \
|
rm "${ONECCL_INSTALLER}" && \
|
||||||
@@ -85,6 +88,9 @@ RUN python3 -m pip install -e tests/vllm_test_utils
|
|||||||
ENV NIXL_VERSION=0.7.0
|
ENV NIXL_VERSION=0.7.0
|
||||||
RUN python3 /workspace/vllm/tools/install_nixl_from_source_ubuntu.py
|
RUN python3 /workspace/vllm/tools/install_nixl_from_source_ubuntu.py
|
||||||
|
|
||||||
|
# FIX triton
|
||||||
|
RUN --mount=type=cache,target=/root/.cache/pip pip uninstall triton triton-xpu -y && pip install triton-xpu==3.6.0 --extra-index-url=https://download.pytorch.org/whl/xpu
|
||||||
|
|
||||||
# PyJWT-2.7.0 will influence some wheel behaviors, remove its dist-info to avoid conflicts
|
# PyJWT-2.7.0 will influence some wheel behaviors, remove its dist-info to avoid conflicts
|
||||||
RUN rm /usr/lib/python3/dist-packages/PyJWT-2.7.0.dist-info/ -rf
|
RUN rm /usr/lib/python3/dist-packages/PyJWT-2.7.0.dist-info/ -rf
|
||||||
|
|
||||||
|
|||||||
@@ -68,7 +68,7 @@
|
|||||||
"default": "true"
|
"default": "true"
|
||||||
},
|
},
|
||||||
"FLASHINFER_VERSION": {
|
"FLASHINFER_VERSION": {
|
||||||
"default": "0.6.1"
|
"default": "0.6.2"
|
||||||
},
|
},
|
||||||
"GDRCOPY_CUDA_VERSION": {
|
"GDRCOPY_CUDA_VERSION": {
|
||||||
"default": "12.8"
|
"default": "12.8"
|
||||||
|
|||||||
@@ -32,6 +32,7 @@ th {
|
|||||||
| HuggingFace-Blazedit | ✅ | ✅ | `vdaita/edit_5k_char`, `vdaita/edit_10k_char` |
|
| HuggingFace-Blazedit | ✅ | ✅ | `vdaita/edit_5k_char`, `vdaita/edit_10k_char` |
|
||||||
| Spec Bench | ✅ | ✅ | `wget https://raw.githubusercontent.com/hemingkx/Spec-Bench/refs/heads/main/data/spec_bench/question.jsonl` |
|
| Spec Bench | ✅ | ✅ | `wget https://raw.githubusercontent.com/hemingkx/Spec-Bench/refs/heads/main/data/spec_bench/question.jsonl` |
|
||||||
| Custom | ✅ | ✅ | Local file: `data.jsonl` |
|
| Custom | ✅ | ✅ | Local file: `data.jsonl` |
|
||||||
|
| Custom MM | ✅ | ✅ | Local file: `mm_data.jsonl` |
|
||||||
|
|
||||||
Legend:
|
Legend:
|
||||||
|
|
||||||
@@ -133,6 +134,33 @@ vllm bench serve --port 9001 --save-result --save-detailed \
|
|||||||
|
|
||||||
You can skip applying chat template if your data already has it by using `--custom-skip-chat-template`.
|
You can skip applying chat template if your data already has it by using `--custom-skip-chat-template`.
|
||||||
|
|
||||||
|
#### Custom multimodal dataset
|
||||||
|
|
||||||
|
If the multimodal dataset you want to benchmark is not supported yet in vLLM, then you can benchmark on it using `CustomMMDataset`. Your data needs to be in `.jsonl` format and needs to have "prompt" and "image_files" field per entry, e.g., `mm_data.jsonl`:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{"prompt": "How many animals are present in the given image?", "image_files": ["/path/to/image/folder/horsepony.jpg"]}
|
||||||
|
{"prompt": "What colour is the bird shown in the image?", "image_files": ["/path/to/image/folder/flycatcher.jpeg"]}
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# need a model with vision capability here
|
||||||
|
vllm serve Qwen/Qwen2-VL-7B-Instruct
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# run benchmarking script
|
||||||
|
vllm bench serve--save-result --save-detailed \
|
||||||
|
--backend openai-chat \
|
||||||
|
--model Qwen/Qwen2-VL-7B-Instruct \
|
||||||
|
--endpoint /v1/chat/completions \
|
||||||
|
--dataset-name custom_mm \
|
||||||
|
--dataset-path <path-to-your-mm-data-jsonl> \
|
||||||
|
--allowed-local-media-path /path/to/image/folder
|
||||||
|
```
|
||||||
|
|
||||||
|
Note that we need to use the `openai-chat` backend and `/v1/chat/completions` endpoint for multimodal inputs.
|
||||||
|
|
||||||
#### VisionArena Benchmark for Vision Language Models
|
#### VisionArena Benchmark for Vision Language Models
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|||||||
@@ -82,7 +82,7 @@ vllm bench sweep serve \
|
|||||||
You can use `--dry-run` to preview the commands to be run.
|
You can use `--dry-run` to preview the commands to be run.
|
||||||
|
|
||||||
We only start the server once for each `--serve-params`, and keep it running for multiple `--bench-params`.
|
We only start the server once for each `--serve-params`, and keep it running for multiple `--bench-params`.
|
||||||
Between each benchmark run, we call the `/reset_prefix_cache` and `/reset_mm_cache` endpoints to get a clean slate for the next run.
|
Between each benchmark run, we call all `/reset_*_cache` endpoints to get a clean slate for the next run.
|
||||||
In case you are using a custom `--serve-cmd`, you can override the commands used for resetting the state by setting `--after-bench-cmd`.
|
In case you are using a custom `--serve-cmd`, you can override the commands used for resetting the state by setting `--after-bench-cmd`.
|
||||||
|
|
||||||
!!! note
|
!!! note
|
||||||
|
|||||||
@@ -288,5 +288,11 @@ Based on the configuration, the content of the multi-modal caches on `P0` and `P
|
|||||||
| shm | Shared Memory Caching | K | N/A | V | `mm_processor_cache_gb * api_server_count` |
|
| shm | Shared Memory Caching | K | N/A | V | `mm_processor_cache_gb * api_server_count` |
|
||||||
| N/A | Disabled | N/A | N/A | N/A | `0` |
|
| N/A | Disabled | N/A | N/A | N/A | `0` |
|
||||||
|
|
||||||
K: Stores the hashes of multi-modal items
|
K: Stores the hashes of multi-modal items
|
||||||
V: Stores the processed tensor data of multi-modal items
|
V: Stores the processed tensor data of multi-modal items
|
||||||
|
|
||||||
|
## Attention Backend Selection
|
||||||
|
|
||||||
|
vLLM supports multiple attention backends optimized for different hardware and use cases. The backend is automatically selected based on your GPU architecture, model type, and configuration, but you can also manually specify one for optimal performance.
|
||||||
|
|
||||||
|
For detailed information on available backends, their feature support, and how to configure them, see the [Attention Backend Feature Support](../design/attention_backends.md) documentation.
|
||||||
|
|||||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user