Jee Jee Li
15cc2a9f1a
[Misc]Further reduce BNB static variable ( #10597 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-26 22:54:12 -08:00
Chen Wu
cfea9c04ef
[Model] Fix Baichuan BNB online quantization ( #10572 )
...
Signed-off-by: Chen Wu <cntryroa@gmail.com >
2024-11-22 21:13:59 -08:00
youkaichao
eebad39f26
[torch.compile] support all attention backends ( #10558 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 14:04:42 -08:00
Isotr0py
c4e464333e
[Misc] Add uninitialized params tracking for AutoWeightsLoader ( #10327 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-18 09:07:46 +08:00
Roger Wang
643ecf7b11
[V1] Refactor model executable interface for all text-only language models ( #10374 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-17 05:18:46 +00:00
youkaichao
f89d18ff74
[6/N] pass whole config to inner model ( #10205 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 06:41:46 +00:00
youkaichao
1a95f10ee7
[5/N] pass the whole config to model ( #9983 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 14:17:28 +08:00
Joe Runde
d58268c56a
[V1] Make v1 more testable ( #9888 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-11-06 11:57:35 -08:00
Aaron Pham
21063c11c7
[CI/Build] drop support for Python 3.8 EOL ( #8464 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2024-11-06 07:11:55 +00:00
Yongzao
9013e24f7b
[torch.compile] Adding torch compile annotations to some models ( #9614 )
2024-10-23 10:07:48 -07:00
youkaichao
dbdd3b5e5a
[misc] comment to avoid future confusion about baichuan ( #9620 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-23 09:14:44 -07:00
Murali Andoorveedu
0f6d7a9a34
[Models] Add remaining model PP support ( #7168 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Signed-off-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-04 10:56:58 +08:00
afeldman-nm
428dd1445e
[Core] Logprobs support in Multi-step ( #7652 )
2024-08-29 19:19:08 -07:00
Zijian Hu
f4fc7337bf
[Bugfix] support tie_word_embeddings for all models ( #5724 )
2024-08-19 20:00:04 -07:00
Cyrus Leung
7025b11d94
[Bugfix] Fix weight loading for Chameleon when TP>1 ( #7410 )
2024-08-13 05:33:41 +00:00
Qubitium-ModelCloud
ee93f4f92a
[CORE] Quantized lm-head Framework ( #4442 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
Co-authored-by: ZX <zx@lbx.dev >
2024-07-02 22:25:17 +00:00
Murali Andoorveedu
c5832d2ae9
[Core] Pipeline Parallel Support ( #4412 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-02 10:58:08 -07:00
Cyrus Leung
98cf2ed678
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision ( #5896 )
2024-06-27 09:08:10 -07:00
Cyrus Leung
96354d6a29
[Model] Add base class for LoRA-supported models ( #5018 )
2024-06-27 16:03:04 +08:00
Cody Yu
a3a73ab069
[Misc] Load FP8 kv-cache scaling factors from checkpoints ( #4893 )
...
The 2nd PR for #4532 .
This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
2024-05-22 13:28:20 -07:00
Woosuk Kwon
0fca3cdcf2
[Misc] Enhance attention selector ( #4751 )
2024-05-13 10:47:25 -07:00
Cody Yu
a62aaf1df5
[Misc][Refactor] Generalize linear_method to be quant_method ( #4373 )
2024-04-26 16:41:14 -04:00
Antoni Baum
69e1d2fb69
[Core] Refactor model loading code ( #4097 )
2024-04-16 11:34:39 -07:00
youkaichao
63e7176f26
[Core][Refactor] move parallel_utils into vllm/distributed ( #3950 )
...
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950 )
2024-04-10 15:33:30 -07:00
Jee Li
8af890a865
Enable more models to inference based on LoRA ( #3382 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-03-25 18:09:31 -07:00
SangBin Cho
01bfb22b41
[CI] Try introducing isort. ( #3495 )
2024-03-25 07:59:47 -07:00
Woosuk Kwon
925f3332ca
[Core] Refactor Attention Take 2 ( #3462 )
2024-03-25 04:39:33 +00:00
Roy
f1c0fc3919
Migrate logits computation and gather to model_runner ( #3233 )
2024-03-20 23:25:01 +00:00
Zhuohan Li
2f8844ba08
Re-enable the 80 char line width limit ( #3305 )
2024-03-10 19:49:14 -07:00
Woosuk Kwon
2daf23ab0c
Separate attention backends ( #3005 )
2024-03-07 01:45:50 -08:00
Roy
d9f726c4d0
[Minor] Remove unused config files ( #3039 )
2024-02-26 17:25:22 -08:00
Philipp Moritz
ea356004d4
Revert "Refactor llama family models ( #2637 )" ( #2851 )
...
This reverts commit 5c976a7e1a .
2024-02-13 09:24:59 -08:00
Roy
5c976a7e1a
Refactor llama family models ( #2637 )
2024-02-13 00:09:23 -08:00
Zhuohan Li
fd4ea8ef5c
Use NCCL instead of ray for control-plane communication to remove serialization overhead ( #2221 )
2024-01-03 11:30:22 -08:00
Woosuk Kwon
37ca558103
Optimize model execution with CUDA graph ( #1926 )
...
Co-authored-by: Chen Shen <scv119@gmail.com >
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2023-12-16 21:12:08 -08:00
CHU Tianxiang
0fbfc4b81b
Add GPTQ support ( #916 )
2023-12-15 03:04:22 -08:00
Woosuk Kwon
fe470ae5ad
[Minor] Fix code style for baichuan ( #2003 )
2023-12-09 19:24:29 -08:00
firebook
2b981012a6
Fix Baichuan2-7B-Chat ( #1987 )
2023-12-08 09:38:36 -08:00
Woosuk Kwon
e5452ddfd6
Normalize head weights for Baichuan 2 ( #1876 )
2023-11-30 20:03:58 -08:00
Woosuk Kwon
27feead2f8
Refactor Worker & InputMetadata ( #1843 )
2023-11-29 22:16:37 -08:00
Woosuk Kwon
a9e4574261
Refactor Attention ( #1840 )
2023-11-29 15:37:31 -08:00
Woosuk Kwon
7c600440f7
Fix model docstrings ( #1764 )
2023-11-23 23:04:44 -08:00
Simon Mo
5ffc0d13a2
Migrate linter from pylint to ruff ( #1665 )
2023-11-20 11:58:01 -08:00
ljss
e1054247ba
[Optimization] Implement fused add rmsnorm ( #1667 )
2023-11-18 18:18:02 -08:00
Zhuohan Li
7076fa1c9f
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models ( #1622 )
...
Refactor the tensor parallelism, quantization, and weight-loading codes.
Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580 ).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
Zhuohan Li
ba0bfd40e2
TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic ( #1181 )
2023-10-02 15:36:09 -07:00
Antoni Baum
3302f0aef3
rope_theta and max_position_embeddings from config ( #1096 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: wnma3mz <wnma3mz@gmail.com >
2023-09-20 13:35:11 -07:00
Jasmond L
ab019eea75
Add Model Revision Support ( #1014 )
...
Co-authored-by: Jasmond Loh <Jasmond.Loh@hotmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2023-09-13 15:20:02 -07:00
Zhuohan Li
c957c741d9
Enable safetensors loading for all models ( #974 )
2023-09-07 15:49:52 -07:00
Zhuohan Li
002800f081
Align vLLM's beam search implementation with HF generate ( #857 )
2023-09-04 17:29:42 -07:00