Woosuk Kwon
925f3332ca
[Core] Refactor Attention Take 2 ( #3462 )
2024-03-25 04:39:33 +00:00
Roy
f1c0fc3919
Migrate logits computation and gather to model_runner ( #3233 )
2024-03-20 23:25:01 +00:00
Woosuk Kwon
2daf23ab0c
Separate attention backends ( #3005 )
2024-03-07 01:45:50 -08:00
Zhuohan Li
fd4ea8ef5c
Use NCCL instead of ray for control-plane communication to remove serialization overhead ( #2221 )
2024-01-03 11:30:22 -08:00
Woosuk Kwon
37ca558103
Optimize model execution with CUDA graph ( #1926 )
...
Co-authored-by: Chen Shen <scv119@gmail.com >
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2023-12-16 21:12:08 -08:00
Woosuk Kwon
27feead2f8
Refactor Worker & InputMetadata ( #1843 )
2023-11-29 22:16:37 -08:00
Woosuk Kwon
7c600440f7
Fix model docstrings ( #1764 )
2023-11-23 23:04:44 -08:00
Simon Mo
5ffc0d13a2
Migrate linter from pylint to ruff ( #1665 )
2023-11-20 11:58:01 -08:00
Woosuk Kwon
8d17774f92
Add AWQ support for all models ( #1714 )
2023-11-18 17:56:47 -08:00
Zhuohan Li
7076fa1c9f
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models ( #1622 )
...
Refactor the tensor parallelism, quantization, and weight-loading codes.
Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580 ).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
Zhuohan Li
ba0bfd40e2
TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic ( #1181 )
2023-10-02 15:36:09 -07:00
Jasmond L
ab019eea75
Add Model Revision Support ( #1014 )
...
Co-authored-by: Jasmond Loh <Jasmond.Loh@hotmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2023-09-13 15:20:02 -07:00
Zhuohan Li
c957c741d9
Enable safetensors loading for all models ( #974 )
2023-09-07 15:49:52 -07:00
Zhuohan Li
002800f081
Align vLLM's beam search implementation with HF generate ( #857 )
2023-09-04 17:29:42 -07:00
JFDuan
0d93f15694
Accelerate LLaMA model loading ( #234 )
2023-08-30 01:00:13 -07:00
Wen Sun
eedac9dba0
fix: revert code to avoid no attribute problem ( #827 )
2023-08-22 11:55:16 -07:00
zhaoyang-star
4f8584756d
Fix mqa is false case in gpt_bigcode ( #806 )
2023-08-21 22:22:06 -07:00
Zhuohan Li
7d5a155e4a
[Fix] Fix GPTBigcoder for distributed execution ( #503 )
2023-07-24 18:36:33 -07:00
Zhuohan Li
96853af5a8
Optimize MQA Kernel ( #452 )
2023-07-14 20:06:40 -04:00
Zhuohan Li
d6fa1be3a8
[Quality] Add code formatter and linter ( #326 )
2023-07-03 11:31:55 -07:00
Zhuohan Li
598dc4b79a
[Fix] Weight loading for GPTBigCode ( #313 )
2023-06-29 22:14:17 -07:00
Michael Feil
298695b766
GPTBigCode (StarCoder, SantaCoder Support) ( #209 )
2023-06-23 01:49:27 +08:00