Rafael Vasquez
|
32aa2059ad
|
[Docs] Convert rST to MyST (Markdown) (#11145)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
|
2024-12-23 22:35:38 +00:00 |
|
omer-dayan
|
995f56236b
|
[Core] Loading model from S3 using RunAI Model Streamer as optional loader (#10192)
Signed-off-by: OmerD <omer@run.ai>
|
2024-12-20 16:46:24 +00:00 |
|
Akash kaothalkar
|
48edab8041
|
[Bugfix][Hardware][POWERPC] Fix auto dtype failure in case of POWER10 (#11331)
Signed-off-by: Akash Kaothalkar <0052v2@linux.vnet.ibm.com>
|
2024-12-20 01:32:07 +00:00 |
|
Yanyi Liu
|
5aef49806d
|
[Feature] Add load generation config from model (#11164)
Signed-off-by: liuyanyi <wolfsonliu@163.com>
Signed-off-by: Yanyi Liu <wolfsonliu@163.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2024-12-19 10:50:38 +00:00 |
|
Alexander Matveev
|
fdea8ec167
|
[V1] VLM - enable processor cache by default (#11305)
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com>
|
2024-12-18 18:54:46 -05:00 |
|
Konrad Zawora
|
866fa4550d
|
[Bugfix] Restore support for larger block sizes (#11259)
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
|
2024-12-17 16:39:07 -08:00 |
|
Roger Wang
|
59c9b6ebeb
|
[V1][VLM] Proper memory profiling for image language models (#11210)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: ywang96 <ywang@example.com>
|
2024-12-16 22:10:57 -08:00 |
|
youkaichao
|
88a412ed3d
|
[torch.compile] fast inductor (#11108)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
|
2024-12-16 16:15:22 -08:00 |
|
shangmingc
|
d263bd9df7
|
[Core] Support disaggregated prefill with Mooncake Transfer Engine (#10884)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2024-12-15 21:28:18 +00:00 |
|
Brad Hilton
|
9c3dadd1c9
|
[Frontend] Add logits_processors as an extra completion argument (#11150)
Signed-off-by: Brad Hilton <brad.hilton.nw@gmail.com>
|
2024-12-14 16:46:42 +00:00 |
|
youkaichao
|
be39e3cd18
|
[core] clean up cudagraph batchsize padding logic (#10996)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-12-13 06:57:50 +00:00 |
|
Alexander Matveev
|
4e11683368
|
[V1] VLM preprocessor hashing (#11020)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-12-12 00:55:30 +00:00 |
|
youkaichao
|
91642db952
|
[torch.compile] use depyf to dump torch.compile internals (#10972)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-12-11 10:43:05 -08:00 |
|
Cyrus Leung
|
cad5c0a6ed
|
[Doc] Update docs to refer to pooling models (#11093)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-12-11 13:36:27 +00:00 |
|
Cyrus Leung
|
8f10d5e393
|
[Misc] Split up pooling tasks (#10820)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-12-11 01:28:00 -08:00 |
|
Mor Zusman
|
ffa48c9146
|
[Model] PP support for Mamba-like models (#10992)
Signed-off-by: mzusman <mor.zusmann@gmail.com>
|
2024-12-10 21:53:37 -05:00 |
|
Aurick Qiao
|
d5c5154fcf
|
[Misc] LoRA + Chunked Prefill (#9057)
|
2024-12-11 10:09:20 +08:00 |
|
youkaichao
|
1a2f8fb828
|
[v1] fix use compile sizes (#11000)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-12-09 13:47:24 -08:00 |
|
wangxiyuan
|
aea2fc38c3
|
[Platform] Move async output check to platform (#10768)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
|
2024-12-09 17:24:46 +00:00 |
|
youkaichao
|
46004e83a2
|
[misc] clean up and unify logging (#10999)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-12-08 17:28:27 -08:00 |
|
youkaichao
|
43b05fa314
|
[torch.compile][misc] fix comments (#10993)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-12-08 11:18:18 -08:00 |
|
youkaichao
|
fd57d2b534
|
[torch.compile] allow candidate compile sizes (#10984)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-12-08 11:05:21 +00:00 |
|
youkaichao
|
1b62745b1d
|
[core][executor] simplify instance id (#10976)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-12-07 09:33:45 -08:00 |
|
Cyrus Leung
|
bf0e382e16
|
[Model] Composite weight loading for multimodal Qwen2 (#10944)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-12-07 07:22:52 -07:00 |
|
youkaichao
|
c05cfb67da
|
[misc] fix typo (#10960)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-12-06 11:25:20 -08:00 |
|
youkaichao
|
b031a455a9
|
[torch.compile] add logging for compilation time (#10941)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-12-06 10:07:15 +00:00 |
|
Cyrus Leung
|
aa39a8e175
|
[Doc] Create a new "Usage" section (#10827)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-12-05 11:19:35 +08:00 |
|
wangxiyuan
|
b5b647b084
|
Drop ROCm load format check (#10767)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
|
2024-12-04 04:32:21 +00:00 |
|
Aaron Pham
|
9323a3153b
|
[Core][Performance] Add XGrammar support for guided decoding and set it as default (#10785)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
|
2024-12-03 15:17:00 +08:00 |
|
youkaichao
|
dc5ce861bf
|
[torch.compile] remove compilation_context and simplify code (#10838)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-12-03 06:19:02 +00:00 |
|
youkaichao
|
a4c4daf364
|
[misc] use out argument for flash attention (#10822)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-12-02 10:50:10 +00:00 |
|
wangxiyuan
|
995a148575
|
[doc]Update config docstring (#10732)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
|
2024-12-02 04:14:45 +00:00 |
|
Kuntai Du
|
0590ec3fd9
|
[Core] Implement disagg prefill by StatelessProcessGroup (#10502)
This PR provides initial support for single-node disaggregated prefill in 1P1D scenario.
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Co-authored-by: ApostaC <yihua98@uchicago.edu>
Co-authored-by: YaoJiayi <120040070@link.cuhk.edu.cn>
|
2024-12-01 19:01:00 -06:00 |
|
Cyrus Leung
|
d2f058e76c
|
[Misc] Rename embedding classes to pooling (#10801)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-12-01 14:36:51 +08:00 |
|
Cyrus Leung
|
133707123e
|
[Model] Replace embedding models with pooling adapter (#10769)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-12-01 08:02:54 +08:00 |
|
wangxiyuan
|
661175bc82
|
[platform] Add verify_quantization in platform. (#10757)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
|
2024-11-29 15:22:21 +00:00 |
|
youkaichao
|
c411def234
|
[torch.compile] fix shape specialization (#10722)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-27 10:16:10 -08:00 |
|
Chendi.Xue
|
0a71900bc9
|
Remove hard-dependencies of Speculative decode to CUDA workers (#10587)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
|
2024-11-26 17:57:11 -08:00 |
|
Murali Andoorveedu
|
db66e018ea
|
[Bugfix] Fix for Spec model TP + Chunked Prefill (#10232)
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Co-authored-by: Sourashis Roy <sroy@roblox.com>
|
2024-11-26 09:11:16 -08:00 |
|
Wallas Henrique
|
c27df94e1f
|
[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices (#9850)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-11-25 12:23:32 -05:00 |
|
Cyrus Leung
|
ed46f14321
|
[Model] Support is_causal HF config field for Qwen2 model (#10621)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-11-25 09:51:20 +00:00 |
|
youkaichao
|
05d1f8c9c6
|
[misc] move functions to config.py (#10624)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-25 09:27:30 +00:00 |
|
youkaichao
|
25d806e953
|
[misc] add torch.compile compatibility check (#10618)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-24 23:40:08 -08:00 |
|
Mengqing Cao
|
7ea3cd7c3e
|
[Refactor][MISC] del redundant code in ParallelConfig.postinit (#10614)
Signed-off-by: MengqingCao <cmq0113@163.com>
|
2024-11-25 05:14:56 +00:00 |
|
Maximilien de Bayser
|
214efc2c3c
|
Support Cross encoder models (#10400)
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Co-authored-by: Flavia Beo <flavia.beo@ibm.com>
|
2024-11-24 18:56:20 -08:00 |
|
kliuae
|
7c25fe45a6
|
[AMD] Add support for GGUF quantization on ROCm (#10254)
|
2024-11-22 21:14:49 -08:00 |
|
Michael Goin
|
02a43f82a9
|
Update default max_num_batch_tokens for chunked prefill to 2048 (#10544)
|
2024-11-22 21:14:19 -08:00 |
|
youkaichao
|
4aba6e3d1a
|
[core] gemma2 full context length support (#10584)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-22 20:13:54 -08:00 |
|
youkaichao
|
eebad39f26
|
[torch.compile] support all attention backends (#10558)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-22 14:04:42 -08:00 |
|
youkaichao
|
a111d0151f
|
[platforms] absorb worker cls difference into platforms folder (#10555)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
|
2024-11-21 21:00:32 -08:00 |
|