Commit Graph

206 Commits

Author SHA1 Message Date
youkaichao
4d2dc5072b [hardware] unify usage of is_tpu to current_platform.is_tpu() (#7102) 2024-08-13 00:16:42 -07:00
jon-chuang
a046f86397 [Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-12 22:47:41 +00:00
Cyrus Leung
4ddc4743d7 [Core] Consolidate GB constant and enable float GB arguments (#7416) 2024-08-12 14:14:14 -07:00
Mahesh Keralapura
933790c209 [Core] Add span metrics for model_forward, scheduler and sampler time (#7089) 2024-08-09 13:55:13 -07:00
Cyrus Leung
7eb4a51c5f [Core] Support serving encoder/decoder models (#7258) 2024-08-09 10:39:41 +08:00
Siyuan Liu
0fa14907da [TPU] Add Load-time W8A16 quantization for TPU Backend (#7005) 2024-08-08 18:35:49 -07:00
Jee Jee Li
a049b107e2 [Misc] Temporarily resolve the error of BitAndBytes (#7308) 2024-08-08 13:42:58 -07:00
Cherilyn Buren
48abee9e54 [Frontend] remove max_num_batched_tokens limit for lora (#7288) 2024-08-08 06:17:29 +00:00
afeldman-nm
fd95e026e0 [Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-06 16:51:47 -04:00
Jee Jee Li
9118217f58 [LoRA] Relax LoRA condition (#7146) 2024-08-06 01:57:25 +00:00
Isotr0py
360bd67cf0 [Core] Support loading GGUF model (#5191)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-05 17:54:23 -06:00
Cade Daniel
82a1b1a82b [Speculative decoding] Add periodic log with time spent in proposal/scoring/verification (#6963) 2024-08-05 08:46:44 +00:00
Jee Jee Li
f80ab3521c Clean up remaining Punica C information (#7027) 2024-08-04 15:37:08 -07:00
Thomas Parnell
b1c9aa3daa [Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size to 1 when using MLPSpeculator (#7105)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-08-04 07:13:18 -07:00
Jeff Fialho
825b044863 [Frontend] Warn if user max_model_len is greater than derived max_model_len (#7080)
Signed-off-by: Jefferson Fialho <jfialho@ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-03 16:01:38 -07:00
Murali Andoorveedu
fc912e0886 [Models] Support Qwen model with PP (#6974)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-08-01 12:40:43 -07:00
xuyi
1d2e7fb73f [Model] Pipeline parallel support for Qwen2 (#6924) 2024-07-31 18:49:51 -07:00
Michael Goin
a0dce9383a [Misc] Add compressed-tensors to optimized quant list (#7006) 2024-07-31 14:40:44 -07:00
Cyrus Leung
da1f7cc12a [mypy] Enable following imports for some directories (#6681) 2024-07-31 10:38:03 +08:00
Michael Goin
b1366a9534 Add Nemotron to PP_SUPPORTED_MODELS (#6863) 2024-07-27 15:05:17 -07:00
chenqianfzh
bb5494676f enforce eager mode with bnb quantization temporarily (#6846) 2024-07-27 01:32:20 +00:00
dongmao zhang
87525fab92 [bitsandbytes]: support read bnb pre-quantized model (#5753)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-07-23 23:45:09 +00:00
Travis Johnson
507ef787d8 [Model] Pipeline Parallel Support for DeepSeek v2 (#6519)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-07-23 12:22:09 -07:00
Woosuk Kwon
a112a84aad [BugFix] Fix RoPE error in Llama 3.1 (#6693) 2024-07-23 09:46:05 -07:00
Woosuk Kwon
461089a21a [Bugfix] Fix a log error in chunked prefill (#6694) 2024-07-23 09:27:58 -07:00
Simon Mo
3eda4ec780 support ignore patterns in model loader (#6673) 2024-07-22 23:59:42 -07:00
Alexander Matveev
396d92d5e0 [Kernel][Core] Add AWQ support to the Marlin kernel (#6612) 2024-07-21 19:41:42 -04:00
sroy745
14f91fe67c [Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485) 2024-07-20 23:58:58 -07:00
Robert Shaw
683e3cb9c4 [ Misc ] fbgemm checkpoints (#6559) 2024-07-20 09:36:57 -07:00
Antoni Baum
7bd82002ae [Core] Allow specifying custom Executor (#6557) 2024-07-20 01:25:06 +00:00
Nick Hill
b5672a112c [Core] Multiprocessing Pipeline Parallel support (#6130)
Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-18 19:15:52 -07:00
Simon Mo
c5df56f88b Add support for a rope extension method (#6553) 2024-07-19 01:53:03 +00:00
youkaichao
1c27d25fb5 [core][model] yet another cpu offload implementation (#6496)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-07-17 20:54:35 -07:00
Robert Shaw
18fecc3559 [ Kernel ] Fp8 Channelwise Weight Support (#6487) 2024-07-18 03:18:13 +00:00
Cody Yu
b5af8c223c [Model] Pipeline parallel support for Mixtral (#6516) 2024-07-17 19:26:04 -07:00
Hongxia Yang
b6c16cf8ff [ROCm][AMD] unify CUDA_VISIBLE_DEVICES usage in cuda/rocm (#6352) 2024-07-11 21:30:46 -07:00
aniaan
3963a5335b [Misc] refactor(config): clean up unused code (#6320) 2024-07-11 09:39:07 +00:00
Swapnil Parekh
4d6ada947c [CORE] Adding support for insertion of soft-tuned prompts (#4645)
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-09 13:26:36 -07:00
youkaichao
3de6e6a30e [core][distributed] support n layers % pp size != 0 (#6115) 2024-07-03 16:40:31 -07:00
xwjiang2010
d9e98f42e4 [vlm] Remove vision language config. (#6089)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-03 22:14:16 +00:00
youkaichao
3c6325f0fc [core][distributed] custom allreduce when pp size > 1 (#6117) 2024-07-03 14:41:32 -07:00
Nick Hill
d830656a97 [BugFix] Avoid unnecessary Ray import warnings (#6079) 2024-07-03 14:09:40 +08:00
Cyrus Leung
9831aec49f [Core] Dynamic image size support for VLMs (#5276)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: ywang96 <ywang@roblox.com>
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-07-02 20:34:00 -07:00
Mor Zusman
9d6a8daa87 [Model] Jamba support (#4115)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-02 23:11:29 +00:00
Murali Andoorveedu
c5832d2ae9 [Core] Pipeline Parallel Support (#4412)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-02 10:58:08 -07:00
Sirej Dua
15aba081f3 [Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) (#6050)
Co-authored-by: Sirej Dua <sirej.dua@databricks.com>
Co-authored-by: Sirej Dua <Sirej Dua>
2024-07-02 07:20:29 -07:00
xwjiang2010
98d6682cd1 [VLM] Remove image_input_type from VLM config (#5852)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-02 07:57:09 +00:00
sroy745
80ca1e6a3a [Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (#5348) 2024-07-01 00:33:05 -07:00
wangding zeng
be0b3af9e0 Support Deepseek-V2 (#4650)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2024-06-28 13:24:57 -07:00
Thomas Parnell
ec1ad0046c [Bugfix] Better error message for MLPSpeculator when num_speculative_tokens is set too high (#5894)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-06-28 07:42:17 -07:00