Commit Graph

9 Commits

Author SHA1 Message Date
Woosuk Kwon
37ca558103 Optimize model execution with CUDA graph (#1926)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-12-16 21:12:08 -08:00
CHU Tianxiang
0fbfc4b81b Add GPTQ support (#916) 2023-12-15 03:04:22 -08:00
Woosuk Kwon
27feead2f8 Refactor Worker & InputMetadata (#1843) 2023-11-29 22:16:37 -08:00
Woosuk Kwon
a9e4574261 Refactor Attention (#1840) 2023-11-29 15:37:31 -08:00
Woosuk Kwon
a7b3e33078 [Fix] Fix RoPE in ChatGLM-32K (#1841) 2023-11-29 13:01:19 -08:00
Woosuk Kwon
7c600440f7 Fix model docstrings (#1764) 2023-11-23 23:04:44 -08:00
Simon Mo
5ffc0d13a2 Migrate linter from pylint to ruff (#1665) 2023-11-20 11:58:01 -08:00
Zhuohan Li
7076fa1c9f TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622)
Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
GoHomeToMacDonal
1a2bbc9301 ChatGLM Support (#1261) 2023-11-06 16:09:33 -08:00