[Lora] Support long context lora (#4787)

Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.

It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.

Follow up of https://github.com/vllm-project/vllm/pull/3095/files

This commit is contained in:

SangBin Cho

2024-05-18 16:05:23 +09:00

committed by

GitHub

parent c0724fc915

commit 2e9a2227ec

25 changed files with 998 additions and 71 deletions

97

tests/lora/data/long_context_test_data.py Normal file

View File

File diff suppressed because one or more lines are too long

[Lora] Support long context lora (#4787)

97 tests/lora/data/long_context_test_data.py Normal file View File

97

tests/lora/data/long_context_test_data.py Normal file

View File