Commit Graph

1030 Commits

Author SHA1 Message Date
Pavani Majety
1d353b6352 [Core] Always use tensor cores for Flashinfer Decode Wrapper (#23214)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-08-21 16:02:11 -04:00
Nick Hill
603fbbbce0 [Misc] Misc code cleanup/simplification (#23304)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-08-21 17:22:55 +00:00
Ming Yang
10f535c086 [Bugfix] Fix port conflict by obtaining a list of open ports upfront (#21894)
Signed-off-by: Ming Yang <minos.future@gmail.com>
2025-08-21 10:22:18 -07:00
Roger Wang
79f05e4436 [Multimodal] Always enable hashing mm data (#23308)
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-21 07:23:28 -07:00
wang.yuqi
d70a16625d [Performance] V1 Pooling Models E2E Performance Optimization (#23162)
Signed-off-by: wang.yuqi <noooop@126.com>
2025-08-21 13:26:09 +00:00
Cyrus Leung
0c6e40bbaa [Refactor] Simplify code for MM budget (#23310)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-21 08:00:16 +00:00
Paul Pak
2e2000f352 [Model] Add LFM2 architecture (#22845)
Signed-off-by: Paul Pak <paulpak58@gmail.com>
2025-08-21 09:35:07 +02:00
22quinn
f571ff8eb6 [Sampler] Support returning final logprobs (#22387)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-20 21:28:32 -07:00
Asaf Joseph Gardin
3663870c72 [V1][Mamba1] - Full CUDA and Piecewise CUDA Graphs Support (#23035)
Signed-off-by: asafg <asafg@ai21.com>
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Co-authored-by: asafg <asafg@ai21.com>
2025-08-20 20:08:51 -07:00
Woosuk Kwon
b029de9902 [Optimization] Make new_block_ids None if empty (#23262)
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai>
2025-08-20 18:25:56 -07:00
Matthew Bonanni
10cc12ba66 Feature/mla tests (#23195)
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
2025-08-20 21:46:47 +00:00
Matthew Bonanni
a4fbb32fab Remove chunked_prefill_enabled flag in V1 MLA (#23183)
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
2025-08-20 21:43:17 +00:00
rongfu.leng
4fbda0b20c [Feature] use --eplb_config to set eplb param (#20562)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-20 14:07:28 -07:00
JartX
3b11b26b50 [FIXBUG ] Allow disabling rocm_aiter_fa backend for ROCm GPUs not compatible with AITER (#22795)
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-08-20 09:08:29 -07:00
Woosuk Kwon
d6d13bd49e [Misc] Add max_seq_len to CommonAttentionMetadata (#23216)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-20 09:05:29 -07:00
xyxinyang
7cd17e22d7 [Model][V1] Support Ernie MTP (#22169)
Signed-off-by: zhouchong <zhouchong03@baidu.com>
Co-authored-by: zhouchong <zhouchong03@baidu.com>
2025-08-20 20:41:55 +08:00
who who who
d983769c41 fix cuda graph (#22721)
Signed-off-by: fsx950223 <fsx950223@outlook.com>
2025-08-20 06:24:37 +00:00
Nick Hill
8fd920924c [BugFix] Fix stuck stats/metrics after requests are aborted (#22995)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-08-20 13:50:29 +08:00
Zebing Lin
a634733f67 [Attention] Optimize make_local_attention_virtual_batches for Flash Attention (#23185)
Signed-off-by: linzebing <linzebing1995@gmail.com>
2025-08-20 02:57:47 +00:00
Chenheli Hua
e58c5a9768 [Core] Add torch profiler CPU traces for AsyncLLM. (#21794)
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
2025-08-20 02:32:47 +00:00
633WHU
0167efe20d [Core] Optimize scheduler request removal for single completions (#21917)
Signed-off-by: chiliu <chiliu@paypal.com>
Signed-off-by: chiliu <cliu_whu@yeah.net>
Co-authored-by: chiliu <chiliu@paypal.com>
2025-08-19 18:25:59 -07:00
Lucas Wilkinson
14e2b0730b [BugFix] fix CUTLASS MLA full cudagraph (#23200)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-08-19 22:17:08 +00:00
Woosuk Kwon
e61bac87ee [Misc] Minor refactoring for FlashInfer backend (#23147)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 13:11:51 -07:00
Woosuk Kwon
5b5f350d67 [Misc] Enable yapf for FlashInfer backend (#23193)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 10:33:47 -07:00
elvischenv
03752dba8f [NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel (#21716)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-08-19 08:22:15 -04:00
Woosuk Kwon
40f26734b9 [Misc] Fix seq_lens for graph capture (#23175)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 03:58:16 -07:00
Woosuk Kwon
21bcc8263f [Misc] Avoid accessing req_ids inside a loop (#23159)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-19 09:39:38 +00:00
Wentao Ye
90bbe0a5ad [Log] Warning Once for Cutlass MLA (#23137)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
2025-08-18 23:24:16 -07:00
Nikhil Suryawanshi
78dba404ad [Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes (#22725)
Signed-off-by: Nikhil Suryawanshi <suryawanshin74@gmail.com>
2025-08-19 04:40:37 +00:00
Chengji Yao
e9d6a3db69 [TPU] make ptxla not imported when using tpu_commons (#23081)
Signed-off-by: Chengji Yao <chengjiyao@gmail.com>
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: Chengji Yao <chengjiyao@gmail.com>
2025-08-19 11:46:42 +08:00
Woosuk Kwon
c9b38be8aa [Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT (#23041)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-18 17:20:38 -07:00
Woosuk Kwon
0dd3f4f5ab [Misc] Minor refactoring for prepare_inputs (#23116)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-18 16:58:05 -07:00
Cyrus Leung
27e8d1ea3e [Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs (#23053)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-18 09:52:00 +00:00
Ning Xie
08d5f7113a [Misc] refactor function name (#23029)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-08-17 22:16:21 -07:00
Ning Xie
7be3a59d8e [Misc] enhance static type hint (#23059)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-08-17 22:09:08 -07:00
Woosuk Kwon
8ea0c2753a [Misc] Minor code cleanup for _get_prompt_logprobs_dict (#23064)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-17 18:16:03 -07:00
Calvin Chen
21e39436c8 [XPU] fix xpu to set cudagraph batch sizes (#23044)
Signed-off-by: calvin chen <wen.chen@dynamia.ai>
2025-08-17 21:45:42 +00:00
Woosuk Kwon
6d243efeda [Misc] Convert use_structured_output property into constant (#23060)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-08-17 12:41:38 -07:00
Ning Xie
87f48623a5 [Misc] method name typo fix (#23042)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-08-16 21:49:14 -07:00
Cyrus Leung
5c32143b9d [Refactor] Defer tensor data construction in MultiModalKwargs (#23030)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-16 21:05:50 -07:00
afeldman-nm
bf7f470b22 [V1] Logits processors extensibility (#19912)
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
Signed-off-by: Andrew Feldman <afeld2012@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-08-16 12:59:17 -07:00
Michael Goin
000cceca8c [Bugfix gpt-oss] Fix float32 convert for flashinfer sink support (#23016)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-08-16 11:16:00 -07:00
Cyrus Leung
4dff91c93d [Refactor] Allow optional MultiModalKwargsItem in IPC (#23022)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-08-16 11:30:49 +00:00
Calvin Chen
e4e37ded56 [V1] support min_tokens for detokener (#22014)
Signed-off-by: calvin chen <wen.chen@dynamia.ai>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-08-16 02:28:10 +00:00
Nicolò Lucchesi
070da660c1 [Kernel] Simplify get_kv_cache_layout and cache use_trtllm_attention env-dependent bit (#22735)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-08-16 00:14:08 +00:00
Nick Hill
ad0297d113 [Misc] Support passing multiple request ids at once to AsyncLLM.abort() (#22944)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-08-15 17:00:36 -07:00
Yong Hoon Shin
3e2f7985a2 Support multiple attention groups for KV sharing (#22672)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
2025-08-15 16:54:10 -07:00
Or Ozeri
c280066f9d [v1] Move block_hashes from KVCacheManager to Request.block_hashes (#19728)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
2025-08-15 16:52:52 -07:00
Nick Hill
b9dc9d2607 [BugFix] Handle case where async utility call is cancelled (#22996)
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Yinghai Lu <yinghai@thinkingmachines.ai>
2025-08-15 17:38:42 -06:00
rishitdholakia13
1fc375dc05 [Structured Outputs] [Bug] Fix misalignment in apply_grammar_bitmask causing unintended masking and NaN logits (#22963)
Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
2025-08-15 23:25:05 +00:00