Lucas Wilkinson
|
cabaf4eff3
|
[Attention] MLA decode optimizations (#12528)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
|
2025-01-30 23:49:37 -08:00 |
|
Lucas Wilkinson
|
9798b2fb00
|
[Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling (#11868)
|
2025-01-30 18:33:00 -08:00 |
|
Mark McLoughlin
|
f17f1d4608
|
[V1][Metrics] Add GPU cache usage % gauge (#12561)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-01-29 18:31:01 -08:00 |
|
Jinzhen Lin
|
27b78c73ca
|
[Kernel] add triton fused moe kernel for gptq/awq (#12185)
|
2025-01-29 09:07:09 -05:00 |
|
Yanyi Liu
|
ff7424f491
|
[Frontend] Support override generation config in args (#12409)
Signed-off-by: liuyanyi <wolfsonliu@163.com>
|
2025-01-29 01:41:01 -08:00 |
|
Alphi
|
d93bf4da85
|
[Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM (#12069)
Signed-off-by: hzh <hezhihui_thu@163.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Oleg Mosalov <oleg@krai.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Signed-off-by: Chenguang Li <757486878@qq.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Shanshan Shen <467638484@qq.com>
Signed-off-by: elijah <f1renze.142857@gmail.com>
Signed-off-by: Yikun <yikunkero@gmail.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com>
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: sixgod <evethwillbeok@outlook.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com>
Co-authored-by: Oleg Mosalov <oleg@krai.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com>
Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com>
Co-authored-by: Concurrensee <yida.wu@amd.com>
Co-authored-by: Chenguang Li <757486878@qq.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Alex Brooks <alex.brooks@ibm.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: TJian <tunjian1996@gmail.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com>
Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com>
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2025-01-29 09:24:59 +00:00 |
|
Travis Johnson
|
036ca94c25
|
[Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense (#12347)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: Wallas Santos <wallashss@ibm.com>
|
2025-01-29 08:54:35 +00:00 |
|
Mark McLoughlin
|
46fb056749
|
[V1][Metrics] Add TTFT and TPOT histograms (#12530)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-01-29 04:11:16 +00:00 |
|
Ce Gao
|
a7e3eba66f
|
[Frontend] Support reasoning content for deepseek r1 (#12473)
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
|
2025-01-29 11:38:08 +08:00 |
|
Mark McLoughlin
|
c386c43ca3
|
[V1][Metrics] Add per-request prompt/generation_tokens histograms (#12516)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-01-28 22:07:22 +00:00 |
|
Mark McLoughlin
|
3fd1fb63ef
|
[V1][Metrics] Hook up IterationStats for Prometheus metrics (#12478)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-01-28 16:38:38 +00:00 |
|
Cyrus Leung
|
8f58a51358
|
[VLM] Merged multi-modal processor and V1 support for Qwen-VL (#12504)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-01-28 16:25:05 +00:00 |
|
Gabriel Marinho
|
0f465ab533
|
[FEATURE] Enables offline /score for embedding models (#12021)
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com>
|
2025-01-28 11:30:13 +08:00 |
|
Liangfu Chen
|
ddee88d0ff
|
[Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache (#11277)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
Co-authored-by: Jiangfei Duan <jfduan@outlook.com>
|
2025-01-27 17:31:16 -08:00 |
|
Harry Mellor
|
823ab79633
|
Update pre-commit hooks (#12475)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-01-27 17:23:08 -07:00 |
|
Nicolò Lucchesi
|
6116ca8cd7
|
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill (#10132)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: wallashss <wallashss@ibm.com>
Co-authored-by: wallashss <wallashss@ibm.com>
|
2025-01-27 13:38:35 -08:00 |
|
Bowen Wang
|
2bc3fbba0c
|
[FlashInfer] Upgrade to 0.2.0 (#11194)
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
|
2025-01-27 18:19:24 +00:00 |
|
Woosuk Kwon
|
3f1fc7425a
|
[V1][CI/Test] Do basic test for top-p & top-k sampling (#12469)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-01-27 09:40:04 -08:00 |
|
Mark McLoughlin
|
01ba927040
|
[V1][Metrics] Add initial Prometheus logger (#12416)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2025-01-27 12:26:28 -05:00 |
|
Pooya Davoodi
|
0cc6b383d7
|
[Frontend] Support scores endpoint in run_batch (#12430)
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
|
2025-01-27 04:30:17 +00:00 |
|
Kyle Mistele
|
0034b09ceb
|
[Frontend] Rerank API (Jina- and Cohere-compatible API) (#12376)
Signed-off-by: Kyle Mistele <kyle@mistele.com>
|
2025-01-26 19:58:45 -07:00 |
|
Tyler Michael Smith
|
72f4880425
|
[Bugfix/CI] Fix broken kernels/test_mha.py (#12450)
|
2025-01-26 10:39:03 -08:00 |
|
Tyler Michael Smith
|
aa2cd2c43d
|
[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
|
2025-01-26 19:59:58 +08:00 |
|
Matthew Hendrey
|
9ddc35220b
|
[Frontend] generation_config.json for maximum tokens(#12242)
Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
|
2025-01-26 19:59:25 +08:00 |
|
Isotr0py
|
f1fc0510df
|
[Misc] Add FA2 support to ViT MHA layer (#12355)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2025-01-25 15:07:35 +08:00 |
|
Cyrus Leung
|
df5dafaa5b
|
[Misc] Remove deprecated code (#12383)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-01-24 14:45:20 -05:00 |
|
Lucas Wilkinson
|
ab5bbf5ae3
|
[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
|
2025-01-24 15:27:59 +00:00 |
|
Nick Hill
|
24b0205f58
|
[V1][Frontend] Coalesce bunched RequestOutputs (#12298)
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
|
2025-01-23 17:17:41 -08:00 |
|
Gregory Shtrasberg
|
e97f802b2d
|
[FP8][Kernel] Dynamic kv cache scaling factors computation (#11906)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
|
2025-01-23 18:04:03 +00:00 |
|
Lucas Wilkinson
|
978b45f399
|
[Kernel] Flash Attention 3 Support (#12093)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
|
2025-01-23 06:45:48 -08:00 |
|
Cody Yu
|
f0ef37233e
|
[V1] Add uncache_blocks (#12333)
|
2025-01-23 04:19:21 +00:00 |
|
rasmith
|
68c4421b6d
|
[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (#12282)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
|
2025-01-23 00:10:37 +00:00 |
|
Cody Yu
|
7206ce4ce1
|
[Core] Support reset_prefix_cache (#12284)
|
2025-01-22 18:52:27 +00:00 |
|
youkaichao
|
68ad4e3a8d
|
[Core] Support fully transparent sleep mode (#11743)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-01-22 14:39:32 +08:00 |
|
Kevin H. Luu
|
64ea24d0b3
|
[ci/lint] Add back default arg for pre-commit (#12279)
Signed-off-by: kevin <kevin@anyscale.com>
|
2025-01-22 01:15:27 +00:00 |
|
Cyrus Leung
|
df76e5af26
|
[VLM] Simplify post-processing of replacement info (#12269)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-01-21 16:48:13 -08:00 |
|
Adrian Cole
|
347eeebe3b
|
[Misc] Remove experimental dep from tracing.py (#12007)
Signed-off-by: Adrian Cole <adrian.cole@elastic.co>
|
2025-01-21 11:51:55 -08:00 |
|
Andy Lo
|
18fd4a8331
|
[Bugfix] Multi-sequence broken (#11898)
Signed-off-by: Andy Lo <andy@mistral.ai>
|
2025-01-21 11:51:35 -08:00 |
|
Ricky Xu
|
132a132100
|
[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (#10907)
Signed-off-by: rickyx <rickyx@anyscale.com>
|
2025-01-21 11:51:13 -08:00 |
|
Nicolò Lucchesi
|
5fe6bf29d6
|
[BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (#12230)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-01-21 12:23:14 +08:00 |
|
Cyrus Leung
|
18572e3384
|
[Bugfix] Fix HfExampleModels.find_hf_info (#12223)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-01-20 15:35:36 +00:00 |
|
Cyrus Leung
|
b37d82791e
|
[Model] Upgrade Aria to transformers 4.48 (#12203)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-01-20 17:58:48 +08:00 |
|
Cyrus Leung
|
59a0192fb9
|
[Core] Interface for accessing model from VllmRunner (#10353)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-01-20 15:00:59 +08:00 |
|
Isotr0py
|
83609791d2
|
[Model] Add Qwen2 PRM model support (#12202)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2025-01-20 14:59:46 +08:00 |
|
Martin Gleize
|
bbe5f9de7d
|
[Model] Support for fairseq2 Llama (#11442)
Signed-off-by: Martin Gleize <mgleize@meta.com>
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas>
|
2025-01-19 10:40:40 -08:00 |
|
Roger Wang
|
81763c58a0
|
[V1] Add V1 support of Qwen2-VL (#12128)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: imkero <kerorek@outlook.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-01-19 19:52:13 +08:00 |
|
yancong
|
32eb0da808
|
[Misc] Support register quantization method out-of-tree (#11969)
|
2025-01-18 16:13:16 -08:00 |
|
Isotr0py
|
02798ecabe
|
[Model] Port deepseek-vl2 processor, remove dependency (#12169)
Signed-off-by: Isotr0py <2037008807@qq.com>
|
2025-01-18 13:59:39 +08:00 |
|
youkaichao
|
da02cb4b27
|
[core] further polish memory profiling (#12126)
Signed-off-by: youkaichao <youkaichao@gmail.com>
|
2025-01-18 12:25:08 +08:00 |
|
Wallas Henrique
|
58fd57ff1d
|
[Bugfix] Fix score api for missing max_model_len validation (#12119)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
|
2025-01-17 16:24:22 +00:00 |
|