[Lora] Support long context lora (#4787)
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files
This commit is contained in:
@@ -119,9 +119,23 @@ steps:
|
||||
|
||||
- label: LoRA Test %N
|
||||
#mirror_hardwares: [amd]
|
||||
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
|
||||
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_long_context.py
|
||||
parallelism: 4
|
||||
|
||||
- label: LoRA Long Context (Distributed)
|
||||
#mirror_hardwares: [amd]
|
||||
num_gpus: 4
|
||||
# This test runs llama 13B, so it is required to run on 4 GPUs.
|
||||
commands:
|
||||
# Temporarily run this way because we cannot clean up GPU mem usage
|
||||
# for multi GPU tests.
|
||||
# TODO(sang): Fix it.
|
||||
- pytest -v -s lora/test_long_context.py::test_rotary_emb_replaced
|
||||
- pytest -v -s lora/test_long_context.py::test_batched_rope_kernel
|
||||
- pytest -v -s lora/test_long_context.py::test_self_consistency
|
||||
- pytest -v -s lora/test_long_context.py::test_quality
|
||||
- pytest -v -s lora/test_long_context.py::test_max_len
|
||||
|
||||
- label: Tensorizer Test
|
||||
#mirror_hardwares: [amd]
|
||||
command: apt-get install curl libsodium23 && pytest -v -s tensorizer_loader
|
||||
|
||||
Reference in New Issue
Block a user