Pavani Majety
|
1d353b6352
|
[Core] Always use tensor cores for Flashinfer Decode Wrapper (#23214)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
2025-08-21 16:02:11 -04:00 |
|
Nick Hill
|
603fbbbce0
|
[Misc] Misc code cleanup/simplification (#23304)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-08-21 17:22:55 +00:00 |
|
Ming Yang
|
10f535c086
|
[Bugfix] Fix port conflict by obtaining a list of open ports upfront (#21894)
Signed-off-by: Ming Yang <minos.future@gmail.com>
|
2025-08-21 10:22:18 -07:00 |
|
Roger Wang
|
79f05e4436
|
[Multimodal] Always enable hashing mm data (#23308)
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-08-21 07:23:28 -07:00 |
|
wang.yuqi
|
d70a16625d
|
[Performance] V1 Pooling Models E2E Performance Optimization (#23162)
Signed-off-by: wang.yuqi <noooop@126.com>
|
2025-08-21 13:26:09 +00:00 |
|
Cyrus Leung
|
0c6e40bbaa
|
[Refactor] Simplify code for MM budget (#23310)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-08-21 08:00:16 +00:00 |
|
Paul Pak
|
2e2000f352
|
[Model] Add LFM2 architecture (#22845)
Signed-off-by: Paul Pak <paulpak58@gmail.com>
|
2025-08-21 09:35:07 +02:00 |
|
22quinn
|
f571ff8eb6
|
[Sampler] Support returning final logprobs (#22387)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-20 21:28:32 -07:00 |
|
Asaf Joseph Gardin
|
3663870c72
|
[V1][Mamba1] - Full CUDA and Piecewise CUDA Graphs Support (#23035)
Signed-off-by: asafg <asafg@ai21.com>
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com>
Co-authored-by: asafg <asafg@ai21.com>
|
2025-08-20 20:08:51 -07:00 |
|
Woosuk Kwon
|
b029de9902
|
[Optimization] Make new_block_ids None if empty (#23262)
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai>
|
2025-08-20 18:25:56 -07:00 |
|
Matthew Bonanni
|
10cc12ba66
|
Feature/mla tests (#23195)
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2025-08-20 21:46:47 +00:00 |
|
Matthew Bonanni
|
a4fbb32fab
|
Remove chunked_prefill_enabled flag in V1 MLA (#23183)
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
|
2025-08-20 21:43:17 +00:00 |
|
rongfu.leng
|
4fbda0b20c
|
[Feature] use --eplb_config to set eplb param (#20562)
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-08-20 14:07:28 -07:00 |
|
JartX
|
3b11b26b50
|
[FIXBUG ] Allow disabling rocm_aiter_fa backend for ROCm GPUs not compatible with AITER (#22795)
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
|
2025-08-20 09:08:29 -07:00 |
|
Woosuk Kwon
|
d6d13bd49e
|
[Misc] Add max_seq_len to CommonAttentionMetadata (#23216)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-20 09:05:29 -07:00 |
|
xyxinyang
|
7cd17e22d7
|
[Model][V1] Support Ernie MTP (#22169)
Signed-off-by: zhouchong <zhouchong03@baidu.com>
Co-authored-by: zhouchong <zhouchong03@baidu.com>
|
2025-08-20 20:41:55 +08:00 |
|
who who who
|
d983769c41
|
fix cuda graph (#22721)
Signed-off-by: fsx950223 <fsx950223@outlook.com>
|
2025-08-20 06:24:37 +00:00 |
|
Nick Hill
|
8fd920924c
|
[BugFix] Fix stuck stats/metrics after requests are aborted (#22995)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-08-20 13:50:29 +08:00 |
|
Zebing Lin
|
a634733f67
|
[Attention] Optimize make_local_attention_virtual_batches for Flash Attention (#23185)
Signed-off-by: linzebing <linzebing1995@gmail.com>
|
2025-08-20 02:57:47 +00:00 |
|
Chenheli Hua
|
e58c5a9768
|
[Core] Add torch profiler CPU traces for AsyncLLM. (#21794)
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
|
2025-08-20 02:32:47 +00:00 |
|
633WHU
|
0167efe20d
|
[Core] Optimize scheduler request removal for single completions (#21917)
Signed-off-by: chiliu <chiliu@paypal.com>
Signed-off-by: chiliu <cliu_whu@yeah.net>
Co-authored-by: chiliu <chiliu@paypal.com>
|
2025-08-19 18:25:59 -07:00 |
|
Lucas Wilkinson
|
14e2b0730b
|
[BugFix] fix CUTLASS MLA full cudagraph (#23200)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2025-08-19 22:17:08 +00:00 |
|
Woosuk Kwon
|
e61bac87ee
|
[Misc] Minor refactoring for FlashInfer backend (#23147)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-19 13:11:51 -07:00 |
|
Woosuk Kwon
|
5b5f350d67
|
[Misc] Enable yapf for FlashInfer backend (#23193)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-19 10:33:47 -07:00 |
|
elvischenv
|
03752dba8f
|
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel (#21716)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2025-08-19 08:22:15 -04:00 |
|
Woosuk Kwon
|
40f26734b9
|
[Misc] Fix seq_lens for graph capture (#23175)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-19 03:58:16 -07:00 |
|
Woosuk Kwon
|
21bcc8263f
|
[Misc] Avoid accessing req_ids inside a loop (#23159)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-19 09:39:38 +00:00 |
|
Wentao Ye
|
90bbe0a5ad
|
[Log] Warning Once for Cutlass MLA (#23137)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-08-18 23:24:16 -07:00 |
|
Nikhil Suryawanshi
|
78dba404ad
|
[Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes (#22725)
Signed-off-by: Nikhil Suryawanshi <suryawanshin74@gmail.com>
|
2025-08-19 04:40:37 +00:00 |
|
Chengji Yao
|
e9d6a3db69
|
[TPU] make ptxla not imported when using tpu_commons (#23081)
Signed-off-by: Chengji Yao <chengjiyao@gmail.com>
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Co-authored-by: Chengji Yao <chengjiyao@gmail.com>
|
2025-08-19 11:46:42 +08:00 |
|
Woosuk Kwon
|
c9b38be8aa
|
[Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT (#23041)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-18 17:20:38 -07:00 |
|
Woosuk Kwon
|
0dd3f4f5ab
|
[Misc] Minor refactoring for prepare_inputs (#23116)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-18 16:58:05 -07:00 |
|
Cyrus Leung
|
27e8d1ea3e
|
[Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs (#23053)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-08-18 09:52:00 +00:00 |
|
Ning Xie
|
08d5f7113a
|
[Misc] refactor function name (#23029)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
|
2025-08-17 22:16:21 -07:00 |
|
Ning Xie
|
7be3a59d8e
|
[Misc] enhance static type hint (#23059)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
|
2025-08-17 22:09:08 -07:00 |
|
Woosuk Kwon
|
8ea0c2753a
|
[Misc] Minor code cleanup for _get_prompt_logprobs_dict (#23064)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-17 18:16:03 -07:00 |
|
Calvin Chen
|
21e39436c8
|
[XPU] fix xpu to set cudagraph batch sizes (#23044)
Signed-off-by: calvin chen <wen.chen@dynamia.ai>
|
2025-08-17 21:45:42 +00:00 |
|
Woosuk Kwon
|
6d243efeda
|
[Misc] Convert use_structured_output property into constant (#23060)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-17 12:41:38 -07:00 |
|
Ning Xie
|
87f48623a5
|
[Misc] method name typo fix (#23042)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
|
2025-08-16 21:49:14 -07:00 |
|
Cyrus Leung
|
5c32143b9d
|
[Refactor] Defer tensor data construction in MultiModalKwargs (#23030)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-08-16 21:05:50 -07:00 |
|
afeldman-nm
|
bf7f470b22
|
[V1] Logits processors extensibility (#19912)
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
Signed-off-by: Andrew Feldman <afeld2012@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-08-16 12:59:17 -07:00 |
|
Michael Goin
|
000cceca8c
|
[Bugfix gpt-oss] Fix float32 convert for flashinfer sink support (#23016)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-08-16 11:16:00 -07:00 |
|
Cyrus Leung
|
4dff91c93d
|
[Refactor] Allow optional MultiModalKwargsItem in IPC (#23022)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-08-16 11:30:49 +00:00 |
|
Calvin Chen
|
e4e37ded56
|
[V1] support min_tokens for detokener (#22014)
Signed-off-by: calvin chen <wen.chen@dynamia.ai>
Co-authored-by: Nick Hill <nhill@redhat.com>
|
2025-08-16 02:28:10 +00:00 |
|
Nicolò Lucchesi
|
070da660c1
|
[Kernel] Simplify get_kv_cache_layout and cache use_trtllm_attention env-dependent bit (#22735)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2025-08-16 00:14:08 +00:00 |
|
Nick Hill
|
ad0297d113
|
[Misc] Support passing multiple request ids at once to AsyncLLM.abort() (#22944)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-08-15 17:00:36 -07:00 |
|
Yong Hoon Shin
|
3e2f7985a2
|
Support multiple attention groups for KV sharing (#22672)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
|
2025-08-15 16:54:10 -07:00 |
|
Or Ozeri
|
c280066f9d
|
[v1] Move block_hashes from KVCacheManager to Request.block_hashes (#19728)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
|
2025-08-15 16:52:52 -07:00 |
|
Nick Hill
|
b9dc9d2607
|
[BugFix] Handle case where async utility call is cancelled (#22996)
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Yinghai Lu <yinghai@thinkingmachines.ai>
|
2025-08-15 17:38:42 -06:00 |
|
rishitdholakia13
|
1fc375dc05
|
[Structured Outputs] [Bug] Fix misalignment in apply_grammar_bitmask causing unintended masking and NaN logits (#22963)
Signed-off-by: rishitdholakia13 <rishit+github@cohere.com>
|
2025-08-15 23:25:05 +00:00 |
|