Giancarlo Delfin
|
c32e97602d
|
[Model Runner V2] Enable forcing a specific acceptance rate during rejection sampling (#38045)
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
|
2026-03-26 13:38:12 -07:00 |
|
Woosuk Kwon
|
144030c84e
|
Relocate Encoder CUDA graph manager (#38116)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
|
2026-03-25 20:52:12 -07:00 |
|
Wentao Ye
|
d7e93e13fb
|
[Feature] EPLB Support for GPU Model Runner v2 (#37488)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-25 08:16:39 -07:00 |
|
Junhao
|
b73b5b0629
|
Make microbatch optimization (DBO) work with general models (#37926)
Signed-off-by: Junhao Li <junhao@ubicloud.com>
|
2026-03-24 14:40:08 -07:00 |
|
Woosuk Kwon
|
4b53740d7f
|
[MRV2] Fix for DS v3.2 (#38030)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-24 14:03:24 -07:00 |
|
Nick Hill
|
4e824d1c83
|
[Model Runner V2][Minor] Simplify PP logic (#38031)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
2026-03-24 13:57:17 -07:00 |
|
Ming Yang
|
c07e2ca6e0
|
Fix Mamba state corruption from referencing stale block table entries (#37728) (#37728) (#37728)
|
2026-03-24 10:29:59 -07:00 |
|
Sungjae Lee
|
4731884796
|
[Feature] limit thinking tokens (hard limit) (#20859)
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com>
Signed-off-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-03-24 09:53:07 -07:00 |
|
Li, Jiang
|
352b90c4a4
|
[Bugfix] Add replacement of _compute_slot_mapping_kernel on CPU (#37987)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2026-03-24 07:00:20 -07:00 |
|
Wentao Ye
|
c59a132f96
|
[V0 Deprecation] Refactor kv cache from list to element (#37487)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-03-23 20:10:11 -07:00 |
|
Giancarlo Delfin
|
8f4824b664
|
[Model Runner V2] Gather multimodal embeddings before draft model postprocess (#37932)
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
|
2026-03-23 18:14:13 -07:00 |
|
Matthew Bonanni
|
fafe76b4af
|
[Async][Spec Decoding] Zero-bubble async scheduling + spec decoding (#32951)
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com>
Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
|
2026-03-23 15:37:22 -04:00 |
|
Woosuk Kwon
|
ffb5b32b5f
|
[MRV2] Consider spec decoding in warmup (#37812)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
|
2026-03-23 17:45:43 +00:00 |
|
yanghui1-arch
|
7151ae6528
|
[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region (#37873)
Signed-off-by: dass90 <3053034939@qq.com>
|
2026-03-23 14:59:21 +00:00 |
|
DorBernsohn
|
7938d12119
|
[Bugfix] Fix CPU backend crash in KV cache block zeroing (#37550)
Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com>
|
2026-03-23 11:35:45 +00:00 |
|
Baorun (Lauren) Mu
|
f85e479e66
|
[Feature] ViT Full CUDA Graph (#35963)
Signed-off-by: Baorun Mu <bmu@nvidia.com>
|
2026-03-23 13:01:10 +08:00 |
|
zhanqiuhu
|
63f49b8bd4
|
[Model Runner V2] Enable piecewise CUDA graphs for pipeline parallelism (#35162)
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-22 20:48:25 +00:00 |
|
Woosuk Kwon
|
a5e9d511de
|
[MRV2] Use FP64 for Gumbel noise (#37798)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-22 12:28:10 -07:00 |
|
Woosuk Kwon
|
ce9b1d76cf
|
[MRV2] Skip hidden states allocation for PW CUDA graphs (#37818)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-22 11:47:21 -07:00 |
|
Giancarlo Delfin
|
b3e846017d
|
[Model Runner V2] Support multi-modal embeddings for spec decode model (#36097)
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-22 02:48:43 -07:00 |
|
Andreas Karatzas
|
66f927f205
|
[Bugfix] Fix pooling non-determinism from pinned prompt_lens aliasing (#37775)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
2026-03-22 03:22:24 +00:00 |
|
Francesco Fusco
|
298e510848
|
[Hybrid] calling get_mamba_groups() once at MambaCopyBuffers.create() (#37318)
Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com>
|
2026-03-21 09:29:43 +00:00 |
|
Itay Alroy
|
c57d38d603
|
elastic_ep: Fix issues with repeated scale up/down cycles (#37131)
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com>
|
2026-03-20 23:13:02 +00:00 |
|
Santino Ramos
|
85f671b8e1
|
[Model Runner V2] Support Streaming Inputs (#37028)
Signed-off-by: Santino Ramos <elsantinoramos@gmail.com>
|
2026-03-20 20:42:25 +00:00 |
|
Lucas Wilkinson
|
e1d85e5c24
|
[Attention] Support distinguishing between short extends and decodes (#37303)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2026-03-20 10:49:36 -07:00 |
|
Peter Pan
|
79eb9369c5
|
fix CUDAGraph memory being counted twice (#37426)
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Peter Pan <peter.pan@daocloud.io>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
|
2026-03-20 17:36:32 +00:00 |
|
Woosuk Kwon
|
e80cfe575d
|
[MRV2] Avoid recompilation of _gather_block_tables_kernel (#37645)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-20 10:31:45 -07:00 |
|
wang.yuqi
|
ed359c497a
|
[Model] Deprecate the score task (this will not affect users). (#37537)
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
|
2026-03-20 08:07:56 +00:00 |
|
Giancarlo Delfin
|
dcee9be95a
|
[Model Runner V2] Fix draft logits not populated during cudagraph replay (#37639)
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
|
2026-03-20 07:43:47 +00:00 |
|
Giancarlo Delfin
|
39474513f6
|
[Model Runner V2] fix draft attention metadata generation (#37364)
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
|
2026-03-19 21:05:15 -07:00 |
|
Fadi Arafeh
|
2890aecce5
|
[CPU][UX] Do not crash when tcmalloc/libiomp are not ldpreloaded (#37561)
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
|
2026-03-19 16:35:45 +00:00 |
|
Woosuk Kwon
|
40b8363b45
|
[MRV2] Use fp32 for draft logits (#37526)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-19 08:41:21 -07:00 |
|
Collin McCarthy
|
0b6d52629f
|
Support temporal compression for Nemotron-3-VL videos (#36808)
Signed-off-by: Collin McCarthy <cmccarthy@nvidia.com>
|
2026-03-19 08:02:19 +00:00 |
|
Ziming Huang
|
d3cc379567
|
[Perf] Fix slow hasattr in CUDAGraphWrapper.__getattr__ (#37425)
Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com>
|
2026-03-19 15:43:48 +08:00 |
|
Wentao Ye
|
e37ff5b5c8
|
[Perf] Optimize token_embed for pooling models, 1.0% token throughput improvement (#37347)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-03-19 10:27:51 +08:00 |
|
Giancarlo Delfin
|
053f3b6309
|
[Model Runner V2] Spec decode rejection sampler logprobs support (#37237)
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
|
2026-03-19 01:36:27 +00:00 |
|
Giancarlo Delfin
|
04244fd0e1
|
[Model Runner V2] Spec decode rejection sampler greedy support (#37238)
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
|
2026-03-18 15:59:03 -07:00 |
|
Philip Ottesen
|
0091017188
|
fix(worker): optimize swap_states to copy only active token prefixes (#34733)
Signed-off-by: Philip Ottesen <phiott256@gmail.com>
|
2026-03-18 14:59:27 -07:00 |
|
Wentao Ye
|
0d81a1fe61
|
[V0 Deprecation] Deprecate virtual engine (#37195)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-03-18 14:30:14 -07:00 |
|
Wentao Ye
|
c373b5c00d
|
[Log] Reduce duplicate log (#37313)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-03-18 10:57:44 -04:00 |
|
Or Ozeri
|
fcf0687b27
|
[kv_offload+HMA][0/N]: Support block-level preemption handling (#34805)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
|
2026-03-18 08:49:53 +02:00 |
|
JartX
|
e8f9dbc369
|
[Bugfix][ROCm] Fix worker startup OOM on ROCm by skipping unreliable cudagraph memory profiling (#36720)
Signed-off-by: JartX <sagformas@epdcenter.es>
|
2026-03-17 17:55:34 -04:00 |
|
Benjamin Chislett
|
f63ed7b5ac
|
[Bugfix] Fix DP MTP Dummy Run (#35243)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
2026-03-17 11:16:48 -04:00 |
|
Benjamin Chislett
|
8a680463fa
|
[Bugfix] Fix NemotronH MTP + Chunked Prefill (#35447)
|
2026-03-17 07:07:33 +01:00 |
|
Kunshang Ji
|
d157216093
|
[BUGFIX][Mamba] Use uint64 for address in KVBlockZeroer (#37197)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
|
2026-03-16 21:39:56 +01:00 |
|
haosdent
|
ca1954d58c
|
[Bugfix] Disable cross-layer KV cache for MLA attention backends (#37090)
Signed-off-by: haosdent <haosdent@gmail.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>
|
2026-03-16 19:03:10 +02:00 |
|
Fynn Schmitt-Ulms
|
04bf5a35fa
|
[Spec Decode] Update extract_hidden_states to use deferred kv_connector clear (#37013)
|
2026-03-16 14:53:45 +01:00 |
|
Kunshang Ji
|
747b068136
|
[Hardware] Replace memory related torch.cuda APIs (#37031)
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com>
|
2026-03-16 10:24:48 +00:00 |
|
Woosuk Kwon
|
96efb91480
|
[Model Runner V2] Fix processed logits in sample() (#37144)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-16 00:35:49 -07:00 |
|
Li, Jiang
|
7362b4450a
|
[Bugfix] Avoid LD_PRELOAD check on MacOS (#37145)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
|
2026-03-15 23:31:44 -07:00 |
|