Kuntai Du
|
0590ec3fd9
|
[Core] Implement disagg prefill by StatelessProcessGroup (#10502)
This PR provides initial support for single-node disaggregated prefill in 1P1D scenario.
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Co-authored-by: ApostaC <yihua98@uchicago.edu>
Co-authored-by: YaoJiayi <120040070@link.cuhk.edu.cn>
|
2024-12-01 19:01:00 -06:00 |
|
Cyrus Leung
|
d2f058e76c
|
[Misc] Rename embedding classes to pooling (#10801)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-12-01 14:36:51 +08:00 |
|
Cyrus Leung
|
133707123e
|
[Model] Replace embedding models with pooling adapter (#10769)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-12-01 08:02:54 +08:00 |
|
Cyrus Leung
|
fa6ecb9aa7
|
[Model] Clean up MiniCPMV (#10751)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-11-29 04:47:06 +00:00 |
|
sixgod
|
5fc5ce0fe4
|
[Model] Added GLM-4 series hf format model support vllm==0.6.4 (#10561)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2024-11-28 14:53:31 +00:00 |
|
Woosuk Kwon
|
a79b122400
|
[V1] Do not allocate beyond the max_model_len (#10730)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-11-28 00:13:15 -08:00 |
|
Ricky Xu
|
d9b4b3f069
|
[Bug][CLI] Allow users to disable prefix caching explicitly (#10724)
Signed-off-by: rickyx <rickyx@anyscale.com>
|
2024-11-27 23:59:28 -08:00 |
|
tomeras91
|
395b1c7454
|
[Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server (#10635)
Signed-off-by: Tomer Asida <tomera@ai21.com>
|
2024-11-27 13:21:10 -08:00 |
|
Mor Zusman
|
197b4484a3
|
[Bugfix][Mamba] Fix Multistep on Mamba-like models (#10705)
Signed-off-by: mzusman <mor.zusmann@gmail.com>
|
2024-11-27 19:02:27 +00:00 |
|
youkaichao
|
308cc5e21e
|
[ci] fix slow tests (#10698)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-27 09:26:14 -08:00 |
|
shunxing12345
|
1209261e93
|
[Model] Support telechat2 (#10311)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: xiangw2 <xiangw2@chinatelecom.cn>
Co-authored-by: Isotr0py <2037008807@qq.com>
|
2024-11-27 11:32:35 +00:00 |
|
jeongin601
|
1bf905ddaa
|
[Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. (#10198)
Signed-off-by: jeongin601 <0200angela@gmail.com>
Signed-off-by: jeong_in.bae <jeong_in.bae@navercorp.com>
|
2024-11-27 05:07:30 +00:00 |
|
Chendi.Xue
|
0a71900bc9
|
Remove hard-dependencies of Speculative decode to CUDA workers (#10587)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
|
2024-11-26 17:57:11 -08:00 |
|
Murali Andoorveedu
|
db66e018ea
|
[Bugfix] Fix for Spec model TP + Chunked Prefill (#10232)
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Co-authored-by: Sourashis Roy <sroy@roblox.com>
|
2024-11-26 09:11:16 -08:00 |
|
youkaichao
|
334d64d1e8
|
[ci] add vllm_test_utils (#10659)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-26 00:20:04 -08:00 |
|
Sage Moore
|
9a88f89799
|
custom allreduce + torch.compile (#10121)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2024-11-25 22:00:16 -08:00 |
|
Ricky Xu
|
519e8e4182
|
[v1] EngineArgs for better config handling for v1 (#10382)
Signed-off-by: rickyx <rickyx@anyscale.com>
|
2024-11-25 21:09:43 -08:00 |
|
Shane A
|
9db713a1dc
|
[Model] Add OLMo November 2024 model (#10503)
|
2024-11-25 17:26:40 -05:00 |
|
zhou fan
|
b1d920531f
|
[Model]: Add support for Aria model (#10514)
Signed-off-by: xffxff <1247714429@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
|
2024-11-25 18:10:55 +00:00 |
|
Wallas Henrique
|
c27df94e1f
|
[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices (#9850)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-11-25 12:23:32 -05:00 |
|
Chauncey
|
d04b13a380
|
[Bug]: Authorization ignored when root_path is set (#10606)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
|
2024-11-25 16:21:41 +00:00 |
|
Cyrus Leung
|
ed46f14321
|
[Model] Support is_causal HF config field for Qwen2 model (#10621)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-11-25 09:51:20 +00:00 |
|
youkaichao
|
05d1f8c9c6
|
[misc] move functions to config.py (#10624)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-25 09:27:30 +00:00 |
|
youkaichao
|
25d806e953
|
[misc] add torch.compile compatibility check (#10618)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-24 23:40:08 -08:00 |
|
youkaichao
|
571841b7fc
|
[torch.compile] support encoder based models (#10613)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-25 05:24:33 +00:00 |
|
Maximilien de Bayser
|
214efc2c3c
|
Support Cross encoder models (#10400)
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Co-authored-by: Flavia Beo <flavia.beo@ibm.com>
|
2024-11-24 18:56:20 -08:00 |
|
Jee Jee Li
|
1700c543a5
|
[Bugfix] Fix LoRA weight sharding (#10450)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-11-23 17:23:17 -08:00 |
|
Cyrus Leung
|
c8acd80548
|
[2/N] handling placeholders in merged multi-modal processor (#10485)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-11-22 21:25:09 -08:00 |
|
Ricky Xu
|
4634a89d18
|
Prefix Cache Aware Scheduling [1/n] (#10128)
Signed-off-by: rickyx <rickyx@anyscale.com>
|
2024-11-22 21:15:55 -08:00 |
|
Varun Vinayak Shenoy
|
7d8ffb344f
|
[Bugfix] Internal Server Error when tool_choice is incorrect. (#10567)
Signed-off-by: Varun Shenoy <varun.vinayak.shenoy@oracle.com>
|
2024-11-22 21:13:29 -08:00 |
|
youkaichao
|
4aba6e3d1a
|
[core] gemma2 full context length support (#10584)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-22 20:13:54 -08:00 |
|
Tyler Michael Smith
|
978b39744b
|
[Misc] Add pynccl wrappers for all_gather and reduce_scatter (#9432)
|
2024-11-22 22:14:03 -05:00 |
|
Travis Johnson
|
9195dbdbca
|
[Bugfix][Frontend] Update Llama Chat Templates to also support Non-Tool use (#10164)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
|
2024-11-23 10:17:38 +08:00 |
|
Ricky Xu
|
97814fbf0f
|
[v1] Refactor KVCacheManager for more hash input than token ids (#10507)
Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-11-22 23:27:25 +00:00 |
|
youkaichao
|
eebad39f26
|
[torch.compile] support all attention backends (#10558)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-22 14:04:42 -08:00 |
|
youkaichao
|
db100c5cde
|
[bugfix] fix full graph tests (#10581)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-22 10:02:14 -08:00 |
|
youkaichao
|
33e0a2540a
|
[9/N] torch.compile LLM usage (#10552)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-21 19:13:31 -08:00 |
|
youkaichao
|
7560ae5caf
|
[8/N] enable cli flag without a space (#10529)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-21 12:30:42 -08:00 |
|
Jee Jee Li
|
2385b60d83
|
[Kernel] Register punica ops directly (#10522)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2024-11-21 09:18:11 -08:00 |
|
Chauncey
|
da7e702c6f
|
[Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored (#10180)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
|
2024-11-21 16:24:32 +00:00 |
|
Isotr0py
|
d5ec121f95
|
[Model] Expose dynamic_image_size as mm_processor_kwargs for InternVL2 models (#10518)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2024-11-21 14:20:08 +00:00 |
|
Luka Govedič
|
8b0fe06c89
|
[torch.compile] Inductor code caching fix (#10273)
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: Luka Govedic <luka.govedic@gmail.com>
|
2024-11-20 21:44:57 -08:00 |
|
Pavani Majety
|
6c1208d083
|
[Core] Add Sliding Window Support with Flashinfer (#10462)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
2024-11-20 19:56:47 -08:00 |
|
youkaichao
|
388ee3de66
|
[torch.compile] limit inductor threads and lazy import quant (#10482)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-20 18:36:33 -08:00 |
|
Guillaume Calmettes
|
c68f7ede6a
|
[Bugfix]: allow extra fields in requests to openai compatible server (#10463)
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
|
2024-11-20 16:42:21 -05:00 |
|
youkaichao
|
0cd3d9717e
|
[7/N] torch.compile, reduce compilation time (#10460)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-20 11:20:38 -08:00 |
|
Li, Jiang
|
63f1fde277
|
[Hardware][CPU] Support chunked-prefill and prefix-caching on CPU (#10355)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2024-11-20 10:57:39 +00:00 |
|
Lucas Wilkinson
|
d200972e7f
|
[Bugfix] Marlin 2:4 temp fix for large M dim (>256) (#10464)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
|
2024-11-19 19:40:33 -08:00 |
|
ElizaWszola
|
b00b33d77e
|
[Model][Quantization] HQQ support through Marlin kernel expansion (#9766)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
|
2024-11-19 13:31:12 -08:00 |
|
youkaichao
|
803f37eaaa
|
[6/N] torch.compile rollout to users (#10437)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2024-11-19 10:09:03 -08:00 |
|