Yikang Shen
6eaccb7353
[Model] Add support for IBM Granite Code models ( #4636 )
2024-05-11 21:27:24 -07:00
Cody Yu
a62aaf1df5
[Misc][Refactor] Generalize linear_method to be quant_method ( #4373 )
2024-04-26 16:41:14 -04:00
Caio Mendes
96e90fdeb3
[Model] Adds Phi-3 support ( #4298 )
2024-04-25 03:06:57 +00:00
Antoni Baum
69e1d2fb69
[Core] Refactor model loading code ( #4097 )
2024-04-16 11:34:39 -07:00
youkaichao
63e7176f26
[Core][Refactor] move parallel_utils into vllm/distributed ( #3950 )
...
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950 )
2024-04-10 15:33:30 -07:00
Kiran R
bc0c0192d1
[Bugfix] Enable Proper attention_bias Usage in Llama Model Configuration ( #3767 )
...
Co-authored-by: roy <jasonailu87@gmail.com >
2024-04-08 19:42:35 +00:00
Adrian Abeyta
2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) ( #3290 )
...
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: HaiShaw <hixiao@gmail.com >
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu >
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com >
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com >
Co-authored-by: guofangze <guofangze@kuaishou.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-04-03 14:15:55 -07:00
xwjiang2010
64172a976c
[Feature] Add vision language model support. ( #3042 )
2024-03-25 14:16:30 -07:00
SangBin Cho
01bfb22b41
[CI] Try introducing isort. ( #3495 )
2024-03-25 07:59:47 -07:00
Woosuk Kwon
925f3332ca
[Core] Refactor Attention Take 2 ( #3462 )
2024-03-25 04:39:33 +00:00
Roy
f1c0fc3919
Migrate logits computation and gather to model_runner ( #3233 )
2024-03-20 23:25:01 +00:00
Woosuk Kwon
2daf23ab0c
Separate attention backends ( #3005 )
2024-03-07 01:45:50 -08:00
Woosuk Kwon
929b4f2973
Add LoRA support for Gemma ( #3050 )
2024-02-28 13:03:28 -08:00
Roy
344020c926
Migrate MistralForCausalLM to LlamaForCausalLM ( #2868 )
2024-02-21 18:25:05 -08:00
Philipp Moritz
0c48b37c31
Fix internlm after https://github.com/vllm-project/vllm/pull/2860 ( #2861 )
2024-02-13 18:01:15 -08:00
Philipp Moritz
7eacffd951
Migrate InternLMForCausalLM to LlamaForCausalLM ( #2860 )
...
Co-authored-by: Roy <jasonailu87@gmail.com >
2024-02-13 17:12:05 -08:00
Terry
2a543d6efe
Add LoRA support for Mixtral ( #2831 )
...
* add mixtral lora support
* formatting
* fix incorrectly ported logic
* polish tests
* minor fixes and refactoring
* minor fixes
* formatting
* rename and remove redundant logic
* refactoring
* refactoring
* minor fix
* minor refactoring
* fix code smell
2024-02-14 00:55:45 +01:00
Philipp Moritz
ea356004d4
Revert "Refactor llama family models ( #2637 )" ( #2851 )
...
This reverts commit 5c976a7e1a .
2024-02-13 09:24:59 -08:00
Roy
5c976a7e1a
Refactor llama family models ( #2637 )
2024-02-13 00:09:23 -08:00
Antoni Baum
9b945daaf1
[Experimental] Add multi-LoRA support ( #1804 )
...
Co-authored-by: Chen Shen <scv119@gmail.com >
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com >
Co-authored-by: Avnish Narayan <avnish@anyscale.com >
2024-01-23 15:26:37 -08:00
Zhuohan Li
fd4ea8ef5c
Use NCCL instead of ray for control-plane communication to remove serialization overhead ( #2221 )
2024-01-03 11:30:22 -08:00
Woosuk Kwon
37ca558103
Optimize model execution with CUDA graph ( #1926 )
...
Co-authored-by: Chen Shen <scv119@gmail.com >
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2023-12-16 21:12:08 -08:00
CHU Tianxiang
0fbfc4b81b
Add GPTQ support ( #916 )
2023-12-15 03:04:22 -08:00
Woosuk Kwon
24cde76a15
[Minor] Add comment on skipping rope caches ( #2004 )
2023-12-10 10:04:12 -08:00
Jun Gao
3a8c2381f7
Fix for KeyError on Loading LLaMA ( #1978 )
2023-12-09 15:59:57 -08:00
Woosuk Kwon
27feead2f8
Refactor Worker & InputMetadata ( #1843 )
2023-11-29 22:16:37 -08:00
Woosuk Kwon
a9e4574261
Refactor Attention ( #1840 )
2023-11-29 15:37:31 -08:00
Woosuk Kwon
7c600440f7
Fix model docstrings ( #1764 )
2023-11-23 23:04:44 -08:00
Simon Mo
5ffc0d13a2
Migrate linter from pylint to ruff ( #1665 )
2023-11-20 11:58:01 -08:00
ljss
e1054247ba
[Optimization] Implement fused add rmsnorm ( #1667 )
2023-11-18 18:18:02 -08:00
Zhuohan Li
7076fa1c9f
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models ( #1622 )
...
Refactor the tensor parallelism, quantization, and weight-loading codes.
Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580 ).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
chooper1
1f24755bf8
Support SqueezeLLM ( #1326 )
...
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2023-10-21 23:14:59 -07:00
Zhuohan Li
ba0bfd40e2
TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic ( #1181 )
2023-10-02 15:36:09 -07:00
Zhuohan Li
a60b353005
support sharding llama2-70b on more than 8 GPUs ( #1209 )
...
Co-authored-by: JiCheng <247153481@qq.com >
2023-10-02 15:26:33 -07:00
Lily Liu
21877b0d75
Support Longchat and RoPE scaling ( #555 )
...
Co-authored-by: Wing Lian <wing.lian@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2023-09-27 03:36:02 -07:00
Antoni Baum
3302f0aef3
rope_theta and max_position_embeddings from config ( #1096 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: wnma3mz <wnma3mz@gmail.com >
2023-09-20 13:35:11 -07:00
Woosuk Kwon
cc796b1358
Convert before transpose ( #1073 )
2023-09-18 11:51:48 -07:00
Woosuk Kwon
e3e79e9e8a
Implement AWQ quantization support for LLaMA ( #1032 )
...
Co-authored-by: Robert Irvine <robert@seamlessml.com >
Co-authored-by: root <rirv938@gmail.com >
Co-authored-by: Casper <casperbh.96@gmail.com >
Co-authored-by: julian-q <julianhquevedo@gmail.com >
2023-09-16 00:03:37 -07:00
Jasmond L
ab019eea75
Add Model Revision Support ( #1014 )
...
Co-authored-by: Jasmond Loh <Jasmond.Loh@hotmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2023-09-13 15:20:02 -07:00
Zhuohan Li
c957c741d9
Enable safetensors loading for all models ( #974 )
2023-09-07 15:49:52 -07:00
Zhuohan Li
002800f081
Align vLLM's beam search implementation with HF generate ( #857 )
2023-09-04 17:29:42 -07:00
JFDuan
0d93f15694
Accelerate LLaMA model loading ( #234 )
2023-08-30 01:00:13 -07:00
Antoni Baum
4b6f069b6f
Add support for CodeLlama ( #854 )
2023-08-25 12:44:07 -07:00
Zhuohan Li
6fc2a38b11
Add support for LLaMA-2 ( #505 )
2023-07-20 11:38:27 -07:00
panda
7b6ae94059
add vocab padding for LLama(Support WizardLM) ( #411 )
2023-07-13 23:56:22 -04:00
Zhuohan Li
d6fa1be3a8
[Quality] Add code formatter and linter ( #326 )
2023-07-03 11:31:55 -07:00
Woosuk Kwon
0b98ba15c7
Change the name to vLLM ( #150 )
2023-06-17 03:07:40 -07:00