Andrew Wang
|
97a6be95ba
|
[Misc] improve logits processors logging message (#7435)
|
2024-08-13 02:29:34 +00:00 |
|
Cyrus Leung
|
9ba85bc152
|
[mypy] Misc. typing improvements (#7417)
|
2024-08-13 09:20:20 +08:00 |
|
Rui Qiao
|
198d6a2898
|
[Core] Shut down aDAG workers with clean async llm engine exit (#7224)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
|
2024-08-12 17:57:16 -07:00 |
|
jon-chuang
|
a046f86397
|
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-08-12 22:47:41 +00:00 |
|
Roger Wang
|
e6e42e4b17
|
[Core][VLM] Support image embeddings as input (#6613)
|
2024-08-12 16:16:06 +08:00 |
|
Isotr0py
|
4c5d8e8ea9
|
[Bugfix] Fix phi3v batch inference when images have different aspect ratio (#7392)
|
2024-08-10 16:19:33 +00:00 |
|
Cade Daniel
|
baa240252e
|
[Core] Fix edge case in chunked prefill + block manager v2 (#7380)
|
2024-08-09 23:48:49 +00:00 |
|
Mahesh Keralapura
|
933790c209
|
[Core] Add span metrics for model_forward, scheduler and sampler time (#7089)
|
2024-08-09 13:55:13 -07:00 |
|
Pooya Davoodi
|
249b88228d
|
[Frontend] Support embeddings in the run_batch API (#7132)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-08-09 09:48:21 -07:00 |
|
Nick Hill
|
b4e9528f95
|
[Core] Streamline stream termination in AsyncLLMEngine (#7336)
|
2024-08-09 07:06:36 +00:00 |
|
William Lin
|
57b7be0e1c
|
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace (#6971)
|
2024-08-09 05:42:45 +00:00 |
|
Travis Johnson
|
99b4cf5f23
|
[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary (#7218)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
|
2024-08-08 22:08:46 -07:00 |
|
Cyrus Leung
|
7eb4a51c5f
|
[Core] Support serving encoder/decoder models (#7258)
|
2024-08-09 10:39:41 +08:00 |
|
Zach Zheng
|
782e53ab59
|
[Bugfix][fast] Fix the get_num_blocks_touched logic (#6849)
|
2024-08-08 10:43:30 -07:00 |
|
Joe Runde
|
21b9c49aa3
|
[Frontend] Kill the server on engine death (#6594)
Signed-off-by: Joe Runde <joe@joerun.de>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2024-08-08 09:47:48 -07:00 |
|
Luka Govedič
|
5fb4a3f678
|
[Bugfix][Kernel] Increased atol to fix failing tests (#7305)
|
2024-08-08 12:16:13 -04:00 |
|
Michael Goin
|
5223199e03
|
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219)
|
2024-08-07 11:23:12 -07:00 |
|
Maximilien de Bayser
|
fde47d3bc2
|
[BugFix] Fix frontend multiprocessing hang (#7217)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
|
2024-08-07 18:09:36 +00:00 |
|
Isotr0py
|
b764547616
|
[Bugfix] Fix input processor for InternVL2 model (#7164)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-08-07 09:32:07 -07:00 |
|
Dipika Sikka
|
0f7052bc7e
|
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2 (#5874)
|
2024-08-07 09:17:58 -07:00 |
|
Cyrus Leung
|
66d617e343
|
[Frontend] Gracefully handle missing chat template and fix CI failure (#7238)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-08-07 09:12:05 +00:00 |
|
Nick Hill
|
9a3f49ae07
|
[BugFix] Overhaul async request cancellation (#7111)
|
2024-08-07 13:21:41 +08:00 |
|
Michael Goin
|
f9a5600649
|
[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading (#7225)
|
2024-08-06 18:34:26 -07:00 |
|
afeldman-nm
|
fd95e026e0
|
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-08-06 16:51:47 -04:00 |
|
Luka Govedič
|
8d59dbb000
|
[Kernel] Add per-tensor and per-token AZP epilogues (#5941)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2024-08-06 18:17:08 +00:00 |
|
Lily Liu
|
5c60c8c423
|
[SpecDecode] [Minor] Fix spec decode sampler tests (#7183)
|
2024-08-06 10:40:32 -07:00 |
|
Cyrus Leung
|
1f26efbb3a
|
[Model] Support SigLIP encoder and alternative decoders for LLaVA models (#7153)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-08-06 16:55:31 +08:00 |
|
Jee Jee Li
|
9118217f58
|
[LoRA] Relax LoRA condition (#7146)
|
2024-08-06 01:57:25 +00:00 |
|
Isotr0py
|
360bd67cf0
|
[Core] Support loading GGUF model (#5191)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-08-05 17:54:23 -06:00 |
|
youkaichao
|
dfb1a15dcb
|
[ci][frontend] deduplicate tests (#7101)
|
2024-08-05 15:59:22 -07:00 |
|
Cade Daniel
|
82a1b1a82b
|
[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification (#6963)
|
2024-08-05 08:46:44 +00:00 |
|
Alphi
|
7b86e7c9cd
|
[Model] Add multi-image support for minicpmv (#7122)
Co-authored-by: hezhihui <hzh7269@modelbest.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-08-05 09:23:17 +08:00 |
|
Yihuan Bu
|
654bc5ca49
|
Support for guided decoding for offline LLM (#6878)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-08-04 03:12:09 +00:00 |
|
youkaichao
|
44dcb52e39
|
[ci][test] finalize fork_new_process_for_each_test (#7114)
|
2024-08-03 10:44:53 -07:00 |
|
Jee Jee Li
|
99d7cabd7b
|
[LoRA] ReplicatedLinear support LoRA (#7081)
|
2024-08-02 22:40:19 -07:00 |
|
Zach Zheng
|
fb2c1c86c1
|
[Bugfix] Fix block table for seqs that have prefix cache hits (#7018)
|
2024-08-02 22:38:15 -07:00 |
|
youkaichao
|
a0d164567c
|
[ci][distributed] disable ray dag tests (#7099)
|
2024-08-02 22:32:04 -07:00 |
|
youkaichao
|
04e5583425
|
[ci][distributed] merge distributed test commands (#7097)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-08-02 21:33:53 -07:00 |
|
youkaichao
|
69ea15e5cc
|
[ci][distributed] shorten wait time if server hangs (#7098)
|
2024-08-02 21:05:16 -07:00 |
|
Robert Shaw
|
ed812a73fa
|
[ Frontend ] Multiprocessing for OpenAI Server with zeromq (#6883)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-08-02 18:27:28 -07:00 |
|
Rui Qiao
|
05308891e2
|
[Core] Pipeline parallel with Ray ADAG (#6837)
Support pipeline-parallelism with Ray accelerated DAG.
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
|
2024-08-02 13:55:40 -07:00 |
|
Lucas Wilkinson
|
a8d604ca2a
|
[Misc] Disambiguate quantized types via a new ScalarType (#6396)
|
2024-08-02 13:51:58 -07:00 |
|
youkaichao
|
806949514a
|
[ci] set timeout for test_oot_registration.py (#7082)
|
2024-08-02 10:03:24 -07:00 |
|
youkaichao
|
252357793d
|
[ci][distributed] try to fix pp test (#7054)
|
2024-08-01 22:03:12 -07:00 |
|
Woosuk Kwon
|
805a8a75f2
|
[Misc] Support attention logits soft-capping with flash-attn (#7022)
|
2024-08-01 13:14:37 -07:00 |
|
Michael Goin
|
fb3db61688
|
[CI/Build] Remove sparseml requirement from testing (#7037)
|
2024-08-01 12:00:51 -07:00 |
|
youkaichao
|
c8a7e93273
|
[core][scheduler] simplify and improve scheduler (#6867)
|
2024-07-31 23:51:09 -07:00 |
|
zifeitong
|
3c10591ef2
|
[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user (#6954)
|
2024-07-31 21:13:34 -07:00 |
|
Jee Jee Li
|
7ecee34321
|
[Kernel][RFC] Refactor the punica kernel based on Triton (#5036)
|
2024-07-31 17:12:24 -07:00 |
|
Michael Goin
|
460c1884e3
|
[Bugfix] Support cpu offloading with fp8 quantization (#6960)
|
2024-07-31 12:47:46 -07:00 |
|