Nick Hill
|
8fd920924c
|
[BugFix] Fix stuck stats/metrics after requests are aborted (#22995)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-08-20 13:50:29 +08:00 |
|
Cyrus Leung
|
de7b67a023
|
[CI/Build] Sync multimodal tests (#23181)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-08-20 05:06:42 +00:00 |
|
Cyrus Leung
|
64ab3c7253
|
[Doc] Update V1 status of various pooling models (#23189)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-08-20 10:33:41 +08:00 |
|
Michael Goin
|
d46d417b58
|
[CI Perf] Only test bfloat16 for tests/compile/test_fusion_all_reduce.py (#23132)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-08-19 20:18:52 -06:00 |
|
Michael Goin
|
0f4f0191d8
|
[CI/Build] Replace lm-eval gsm8k tests with faster implementation (#23002)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-08-19 15:07:30 -07:00 |
|
amirkl94
|
a38b8af4c3
|
[NVIDIA] Add SM100 Flashinfer Cutlass MoE fp8 backend (#22357)
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
|
2025-08-19 18:01:53 -04:00 |
|
Marko Rosenmueller
|
80141bbf2f
|
fix: use cache_salt for gpt-oss (#23186)
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
|
2025-08-19 18:12:25 +00:00 |
|
22quinn
|
f7cf5b512e
|
[Frontend] Add /collective_rpc API endpoint (#23075)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
|
2025-08-19 17:29:32 +00:00 |
|
Isotr0py
|
d6a1a20973
|
[CI/Build] Update transformers to v4.55.2 (#23093)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2025-08-19 10:06:17 -07:00 |
|
Yuge Zhang
|
24f4d1a224
|
Add return_token_ids parameter to OpenAI API endpoints (#22587)
Signed-off-by: Yuge Zhang <scottyugochang@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2025-08-19 09:48:31 -07:00 |
|
TJian
|
1298c67795
|
[FEAT] [Performance] Enable DP for ViT in Qwen2.5VL (#22742)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2025-08-19 15:25:57 +00:00 |
|
myselvess
|
b87cb97a53
|
[Model] support new model ovis2.5 (#23084)
Signed-off-by: myselvess <244285088@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2025-08-19 13:12:59 +00:00 |
|
wang.yuqi
|
f856c33ce9
|
[Model] Add multi_label_classification support (#23173)
Signed-off-by: wang.yuqi <noooop@126.com>
|
2025-08-19 12:54:30 +00:00 |
|
elvischenv
|
03752dba8f
|
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel (#21716)
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
|
2025-08-19 08:22:15 -04:00 |
|
Isotr0py
|
31fd3265c8
|
[Bugfix] Fix broken Minimax-01-VL model (#22116)
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2025-08-19 08:49:29 +00:00 |
|
Woosuk Kwon
|
14006840ea
|
[V0 Deprecation] Remove V0 FlashInfer attention backend (#22776)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-18 19:54:16 -07:00 |
|
Robert Shaw
|
6603288736
|
[CI][V0 Deprecation] Removed V0 Only Chunked Prefill and Prefix Caching Tests (#22871)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-18 17:39:01 -07:00 |
|
Woosuk Kwon
|
c9b38be8aa
|
[Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT (#23041)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2025-08-18 17:20:38 -07:00 |
|
杨朱 · Kiki
|
569aefd134
|
chore: remove unnecessary patch_padding_side for the chatglm model (#23090)
Signed-off-by: carlory <baofa.fan@daocloud.io>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
|
2025-08-18 12:32:13 +00:00 |
|
Ning Xie
|
5a30bd10d8
|
[Bugfix] fix IntermediateTensors equal method (#23027)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
|
2025-08-18 02:58:11 -07:00 |
|
Cyrus Leung
|
27e8d1ea3e
|
[Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs (#23053)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-08-18 09:52:00 +00:00 |
|
Jee Jee Li
|
4d4061b6e7
|
[Kernel] Add cuda kernel for gpt_oss activation (#22951)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
|
2025-08-17 05:03:24 +00:00 |
|
Cyrus Leung
|
5c32143b9d
|
[Refactor] Defer tensor data construction in MultiModalKwargs (#23030)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-08-16 21:05:50 -07:00 |
|
afeldman-nm
|
bf7f470b22
|
[V1] Logits processors extensibility (#19912)
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
Signed-off-by: Andrew Feldman <afeld2012@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-08-16 12:59:17 -07:00 |
|
Michael Goin
|
4fc722eca4
|
[Kernel/Quant] Remove AQLM (#22943)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
|
2025-08-16 19:38:21 +00:00 |
|
Michael Goin
|
3253ae765e
|
[Flaky CI] Increase timeout tolerance for test_mp_crash_detection+test_default_mm_lora_chat_completions (#23028)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-08-16 18:33:08 +00:00 |
|
Woonggi Min
|
68373d3126
|
[Frontend] Added support for HermesToolParser for models without special tokens (#16890)
Signed-off-by: minpeter <kali2005611@gmail.com>
|
2025-08-16 17:38:42 +00:00 |
|
汪志鹏
|
829bbd7882
|
[New Model]mBART model (#22883)
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
|
2025-08-16 12:16:58 +00:00 |
|
Cyrus Leung
|
4dff91c93d
|
[Refactor] Allow optional MultiModalKwargsItem in IPC (#23022)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2025-08-16 11:30:49 +00:00 |
|
Isotr0py
|
2dbccce8a6
|
[CI][Bugfix] Skip Ovis2 generation test because of broken remote code (#22954)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2025-08-16 09:44:19 +00:00 |
|
Isotr0py
|
cc826a202b
|
[Multimodal] Update Tensor schema test to cover arbitrary shape mm inputs (#22867)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
2025-08-16 00:44:50 -07:00 |
|
Andrew Sansom
|
78863f8c5c
|
[BugFix] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors (#22962)
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
|
2025-08-16 06:25:10 +00:00 |
|
Calvin Chen
|
e4e37ded56
|
[V1] support min_tokens for detokener (#22014)
Signed-off-by: calvin chen <wen.chen@dynamia.ai>
Co-authored-by: Nick Hill <nhill@redhat.com>
|
2025-08-16 02:28:10 +00:00 |
|
Nick Hill
|
ad0297d113
|
[Misc] Support passing multiple request ids at once to AsyncLLM.abort() (#22944)
Signed-off-by: Nick Hill <nhill@redhat.com>
|
2025-08-15 17:00:36 -07:00 |
|
Yong Hoon Shin
|
3e2f7985a2
|
Support multiple attention groups for KV sharing (#22672)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
|
2025-08-15 16:54:10 -07:00 |
|
Or Ozeri
|
c280066f9d
|
[v1] Move block_hashes from KVCacheManager to Request.block_hashes (#19728)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
|
2025-08-15 16:52:52 -07:00 |
|
Nick Hill
|
b9dc9d2607
|
[BugFix] Handle case where async utility call is cancelled (#22996)
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Yinghai Lu <yinghai@thinkingmachines.ai>
|
2025-08-15 17:38:42 -06:00 |
|
eigen
|
1723ef1aae
|
minor: zero workspace buffer init for flashinfer trtllm-gen attn (#22603)
|
2025-08-15 21:38:10 +00:00 |
|
Michael Goin
|
8a87cd27d9
|
[CI] Speed up Whisper tests by reusing server (#22859)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-08-15 16:56:31 -04:00 |
|
bnellnm
|
8ad7285ea2
|
[Kernels] Clean up FusedMoeMethodBase and modular kernel setup. Remove extra arguments from modular kernel methods. (#22035)
Signed-off-by: Bill Nell <bnell@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
|
2025-08-15 14:46:00 -04:00 |
|
fhl2000
|
74f441f4b5
|
[Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer (#20059)
Signed-off-by: fhl <2410591650@qq.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
|
2025-08-15 10:01:39 -04:00 |
|
Harry Mellor
|
e8b40c7fa2
|
[CI] Remove duplicated docs build from buildkite (#22924)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2025-08-15 05:58:06 -07:00 |
|
Thomas Parnell
|
75531a6c13
|
[V1] [Hybrid] Support using float32 for state in Hybrid Models (Mamba2, Mamba1, Minimax) (#22928)
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Daniel Afrimi <danielafrimi8@gmail.com>
Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
|
2025-08-15 12:57:06 +00:00 |
|
Staszek Paśko
|
22341b996e
|
Improve multimodal hasher performance for re-used Image prompts (#22825)
Signed-off-by: Staszek Pasko <staszek@gmail.com>
|
2025-08-15 12:32:56 +00:00 |
|
Sayandip Dutta
|
aa300c438d
|
[Bugfix] Unquote file uri before reading image (#22912)
Signed-off-by: Sayandip Dutta <sayandip199309@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2025-08-15 09:28:00 +00:00 |
|
amirai21
|
fe91ce9591
|
[V1] - Split Prefill and Decode for Mamba1 models (#22653)
Signed-off-by: amirk <amirk@ai21.com>
Signed-off-by: asafg <asafg@ai21.com>
Co-authored-by: asafg <asafg@ai21.com>
Co-authored-by: Asaf Joseph Gardin <39553475+Josephasafg@users.noreply.github.com>
|
2025-08-15 08:59:52 +00:00 |
|
wang.yuqi
|
5406ebf5c9
|
[CI] Pooling models mteb test uses enforce_eager (#22878)
Signed-off-by: wang.yuqi <noooop@126.com>
|
2025-08-15 01:16:15 -07:00 |
|
Asaf Joseph Gardin
|
3d232dbd19
|
[Mamba] - refactor: Renamed mamba_attn to mamba2_attn (#22818)
Signed-off-by: asafg <asafg@ai21.com>
Co-authored-by: asafg <asafg@ai21.com>
|
2025-08-15 06:38:05 +00:00 |
|
Wentao Ye
|
5c3fbfe46b
|
[Feature] Full Cuda Graph Support for Cutlass MLA and 6% E2E Throughput Improvement (#22763)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2025-08-15 06:27:30 +00:00 |
|
Michael Goin
|
0fe85087a9
|
[CI Perf] Prune tests in tests/kernels/attention/ (#22936)
Signed-off-by: mgoin <mgoin64@gmail.com>
|
2025-08-14 21:34:53 -06:00 |
|