Nick Hill
|
2e240c69a9
|
[Core] Centralize GPU Worker construction (#4419)
|
2024-05-01 01:06:34 +00:00 |
|
fuchen.ljl
|
ee37328da0
|
Unable to find Punica extension issue during source code installation (#4494)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-05-01 00:42:09 +00:00 |
|
fuchen.ljl
|
6ad58f42c5
|
fix_tokenizer_snapshot_download_bug (#4493)
|
2024-04-30 16:38:50 -07:00 |
|
Li, Jiang
|
dd1a50a8bc
|
[Bugfix][Minor] Make ignore_eos effective (#4468)
|
2024-04-30 16:33:33 -07:00 |
|
Alpay Ariyak
|
715c2d854d
|
[Frontend] [Core] Tensorizer: support dynamic num_readers, update version (#4467)
|
2024-04-30 16:32:13 -07:00 |
|
Florian Greinacher
|
a494140433
|
[Frontend] Support complex message content for chat completions endpoint (#3467)
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2024-04-30 16:28:46 -07:00 |
|
Robert Shaw
|
111815d482
|
[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-04-30 21:46:12 +00:00 |
|
Prashant Gupta
|
b31a1fb63c
|
[Doc] add visualization for multi-stage dockerfile (#4456)
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-04-30 17:41:59 +00:00 |
|
leiwen83
|
4bb53e2dde
|
[BugFix] fix num_lookahead_slots missing in async executor (#4165)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
|
2024-04-30 10:12:59 -07:00 |
|
Kunshang Ji
|
26f2fb5113
|
[Core]Refactor gptq_marlin ops (#4466)
|
2024-04-30 08:14:47 -04:00 |
|
Woosuk Kwon
|
fa32207842
|
[Bugfix][Kernel] Fix compute_type for MoE kernel (#4463)
|
2024-04-29 22:05:40 -07:00 |
|
Michael Goin
|
d627a3d837
|
[Misc] Upgrade to torch==2.3.0 (#4454)
|
2024-04-29 20:05:47 -04:00 |
|
youkaichao
|
f4f921b7f1
|
[Core][Distributed] use cpu group to broadcast metadata in cpu (#4444)
|
2024-04-29 13:52:22 -07:00 |
|
Simon Mo
|
ac5ccf0156
|
[CI] hotfix: soft fail neuron test (#4458)
|
2024-04-29 19:50:01 +00:00 |
|
Robert Shaw
|
73c8d677e5
|
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922)
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
|
2024-04-29 09:35:34 -07:00 |
|
SangBin Cho
|
df29793dc7
|
[mypy][5/N] Support all typing on model executor (#4427)
|
2024-04-28 19:01:26 -07:00 |
|
Simon Mo
|
03dd7d52bf
|
[CI] clean docker cache for neuron (#4441)
|
2024-04-28 23:32:07 +00:00 |
|
Ronen Schaffer
|
bf480c5302
|
Add more Prometheus metrics (#2764)
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
|
2024-04-28 15:59:33 -07:00 |
|
DefTruth
|
9c7306ac11
|
[Misc] fix typo in llm_engine init logging (#4428)
|
2024-04-28 18:58:30 +08:00 |
|
Robert Shaw
|
4ea1f9678d
|
[BugFix] Resolved Issues For LinearMethod --> QuantConfig (#4418)
|
2024-04-27 18:35:33 +00:00 |
|
Nick Hill
|
ba4be44c32
|
[BugFix] Fix return type of executor execute_model methods (#4402)
|
2024-04-27 11:17:45 -07:00 |
|
Prashant Gupta
|
d6e520e170
|
[Core] Support offline use of local cache for models (#4374)
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Travis Johnson <tjohnson31415@gmail.com>
|
2024-04-27 09:59:55 -07:00 |
|
Nick Hill
|
81661da7b2
|
[BugFix] Fix min_tokens when eos_token_id is None (#4389)
Co-authored-by: DefTruth <31974251+deftruth@users.noreply.github.com>
|
2024-04-27 09:52:46 -07:00 |
|
Ruoyu Qin
|
dfea173148
|
[Bugfix] Abort requests when the connection to /v1/completions is interrupted (#4363)
|
2024-04-27 09:48:37 -07:00 |
|
Roy
|
7134303cbb
|
[Bugfix][Core] Fix get decoding config from ray (#4335)
|
2024-04-27 11:30:08 +00:00 |
|
Caio Mendes
|
3da24c2df7
|
[Model] Phi-3 4k sliding window temp. fix (#4380)
|
2024-04-27 18:08:15 +08:00 |
|
Austin Veselka
|
eefeb16464
|
[Kernel] Full Tensor Parallelism for LoRA Layers (#3524)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2024-04-27 00:03:48 -07:00 |
|
Hongxia Yang
|
18d23f642a
|
[ROCm][Hardware][AMD] Enable group query attention for triton FA (#4406)
|
2024-04-26 23:37:40 -07:00 |
|
Roy
|
87f545ba6f
|
[Misc] Fix logger format typo (#4396)
|
2024-04-27 13:45:02 +08:00 |
|
Cyrus Leung
|
8947bc3c15
|
[Frontend][Bugfix] Disallow extra fields in OpenAI API (#4355)
|
2024-04-27 05:08:24 +00:00 |
|
Philipp Moritz
|
12628d3c78
|
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-27 04:49:59 +00:00 |
|
Nick Hill
|
258a2c58d0
|
[Core] Introduce DistributedGPUExecutor abstract class (#4348)
|
2024-04-27 04:14:26 +00:00 |
|
youkaichao
|
aba47be3fe
|
[Misc] add RFC issue template (#4401)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-04-26 15:47:45 -07:00 |
|
Cody Yu
|
a62aaf1df5
|
[Misc][Refactor] Generalize linear_method to be quant_method (#4373)
|
2024-04-26 16:41:14 -04:00 |
|
SangBin Cho
|
603ad84815
|
[Core] Refactoring sampler and support prompt logprob for chunked prefill (#4309)
|
2024-04-26 13:02:02 +00:00 |
|
SangBin Cho
|
a88081bf76
|
[CI] Disable non-lazy string operation on logging (#4326)
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com>
|
2024-04-26 00:16:58 -07:00 |
|
Norman Mu
|
2f30e7c72f
|
[Frontend] Add --log-level option to api server (#4377)
|
2024-04-26 05:36:01 +00:00 |
|
Cyrus Leung
|
a74dee9b62
|
[Bugfix] Fix parameter name in get_tokenizer (#4107)
|
2024-04-25 19:10:48 -07:00 |
|
Hongxia Yang
|
cf29b7eda4
|
[ROCm][Hardware][AMD][Doc] Documentation update for ROCm (#4376)
Co-authored-by: WoosukKwon <woosuk.kwon@berkeley.edu>
|
2024-04-25 18:12:25 -07:00 |
|
Nick Hill
|
efffb63f58
|
[Core] Move function tracing setup to util function (#4352)
|
2024-04-25 16:45:12 -07:00 |
|
Nick Hill
|
15e7c675b0
|
[Core] Add shutdown() method to ExecutorBase (#4349)
|
2024-04-25 16:32:48 -07:00 |
|
Roy
|
b6dcb4d442
|
[Misc] Fix flash attention backend log (#4368)
|
2024-04-25 12:43:32 -07:00 |
|
SangBin Cho
|
b5b4a398a7
|
[Mypy] Typing lora folder (#4337)
|
2024-04-25 19:13:50 +00:00 |
|
Kunshang Ji
|
f4bc4de1b1
|
[Core]refactor aqlm quant ops (#4351)
|
2024-04-25 15:03:56 -04:00 |
|
Caio Mendes
|
bd7a8eef25
|
[Doc] README Phi-3 name fix. (#4372)
Co-authored-by: Caio Mendes <caiocesart@microsoft.com>
|
2024-04-25 10:32:00 -07:00 |
|
Alexei-V-Ivanov-AMD
|
7ee82bef1e
|
[CI/Build] Adding functionality to reset the node's GPUs before processing. (#4213)
|
2024-04-25 09:37:20 -07:00 |
|
Isotr0py
|
fbf152d976
|
[Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 (#4324)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-25 09:35:56 -07:00 |
|
Nick Hill
|
479d69fad0
|
[Core] Move ray_utils.py from engine to executor package (#4347)
|
2024-04-25 06:52:22 +00:00 |
|
Caio Mendes
|
96e90fdeb3
|
[Model] Adds Phi-3 support (#4298)
|
2024-04-25 03:06:57 +00:00 |
|
zifeitong
|
a395a638c2
|
[Misc] Use public API in benchmark_throughput (#4300)
|
2024-04-24 21:10:24 +00:00 |
|