Nick Hill
|
cd32d6f586
|
[Model Runner V2] Some code simplification (#36929)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
2026-03-13 00:59:23 +00:00 |
|
Matthew Bonanni
|
f444c05c32
|
[Attention] Use FA4 for MLA prefill (#34732)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2026-03-12 12:10:17 -04:00 |
|
grimulkan
|
a1257fd1ea
|
[Kernel] Add FP8 KV cache support to Triton MLA decode attention (#34597)
Signed-off-by: grimulkan <grimulkan@gmail.com>
|
2026-03-12 08:32:34 -07:00 |
|
Kunshang Ji
|
53ec16a705
|
[Hardware] Replace torch.cuda.device_count/current_device/set_device API (#36145)
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
|
2026-03-12 07:57:47 -07:00 |
|
Nick Hill
|
36735fd772
|
[BugFix] Fix multiple/duplicate stdout prefixes (#36822)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
2026-03-12 12:23:21 +08:00 |
|
Woosuk Kwon
|
2f8b4ce0c0
|
[Model Runner V2] Do not initialize sampler for non-last PP ranks (#36824)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-12 03:55:28 +00:00 |
|
Wentao Ye
|
c34ba6b961
|
[Perf] Optimize compute maxsim using batched version, 3.2% E2E throughput improvement (#36710)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-03-12 08:37:01 +08:00 |
|
Aaron Hao
|
d6b61e5166
|
[BUG] Fix async rlhf tests (#35811)
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
|
2026-03-11 18:06:10 -04:00 |
|
Woosuk Kwon
|
55eed6b7a5
|
[Model Runner V2] Add WhisperModelState [6/N] (#35790)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-11 14:20:38 -07:00 |
|
Giancarlo Delfin
|
c77181e534
|
[Model Runner V2] Add probabilistic rejection sampling for spec decoding (#35461)
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
|
2026-03-11 14:04:32 -07:00 |
|
Wentao Ye
|
35bdca5431
|
[Refactor] Remove dead code in KV connector (#36424)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-03-11 19:40:17 +00:00 |
|
Or Ozeri
|
a1a3523a56
|
[KVConnector] Support worker -> scheduler metadata (#31964)
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
|
2026-03-11 17:36:37 +00:00 |
|
Hongxin Xu
|
bea02cdf93
|
Fix routed experts capture for hybrid models (Mamba + Attention) (#35744)
Signed-off-by: arlenxu <arlenxu@tencent.com>
Signed-off-by: xhx1022 <1737006628@qq.com>
Co-authored-by: arlenxu <arlenxu@tencent.com>
|
2026-03-11 08:53:10 -07:00 |
|
Jhao-Ting Chen
|
5573894737
|
Kimi k2.5 MLA based eagle3 (#36361)
Signed-off-by: Izzy Putterman <iputterman@nvidia.com>
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com>
Co-authored-by: Izzy Putterman <iputterman@nvidia.com>
|
2026-03-11 11:36:11 -04:00 |
|
Woosuk Kwon
|
8ccbcda5c0
|
[Model Runner V2] Remove unused warmup_for_prefill method (#36762)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-11 08:02:44 -07:00 |
|
tvirolai-amd
|
a9e532afe2
|
[ROCm][Perf] Allow MTP lens > 1 in Sparse MLA (#36681)
Signed-off-by: Teemu Virolainen <teemu.virolainen@amd.com>
|
2026-03-11 14:43:03 +00:00 |
|
Wuxun Zhang
|
e584dce52b
|
Add XPU MLA Sparse backend for DeepSeek v3.2 (#33230)
Signed-off-by: Zhang, Wuxun <wuxun.zhang@intel.com>
|
2026-03-11 19:19:15 +08:00 |
|
YiSheng5
|
c910eeb125
|
[XPU]Bug fix for some unexpected error when use AgRs backend on XPU device. (#36593)
Signed-off-by: yisheng <yi.sheng@intel.com>
|
2026-03-11 09:17:46 +00:00 |
|
Nicolò Lucchesi
|
098d844731
|
[NIXL][1/N] Refactor kernel_block_size detection (#35752)
Signed-off-by: NickLucche <nlucches@redhat.com>
|
2026-03-11 01:11:23 -07:00 |
|
JartX
|
a40ee486f2
|
[Bugfix] Add Multiple of 16 block_size to triton fallback on rocm Attention to support qwen3_5 (#35923)
Signed-off-by: JartX <sagformas@epdcenter.es>
Co-authored-by: akaratza <akaratza@amd.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
|
2026-03-11 07:45:57 +00:00 |
|
pschlan-amd
|
eac2dc2b41
|
AITER MLA backend: Avoid CPU sync in _build_decode (#35765)
Signed-off-by: Patrick Schlangen <pschlan@amd.com>
|
2026-03-11 07:25:00 +00:00 |
|
Sladyn
|
4aaaf8c8ce
|
feat(spec_decode): fuse EAGLE step slot mapping and metadata updates (#33503)
Signed-off-by: sladynnunes <snunes@usc.edu>
|
2026-03-11 04:35:33 +00:00 |
|
Hongbin Guo
|
4bf533623b
|
[Doc] Fix duplicate words in comments (#36713)
Signed-off-by: Hongbin10 <jdmjdm1998@163.com>
|
2026-03-10 21:28:31 -07:00 |
|
Matthew Bonanni
|
5f77ef15ae
|
[Misc][Attention] Clean up unused method in CPU_ATTN (#36673)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2026-03-10 21:27:22 -07:00 |
|
Wentao Ye
|
a8ff2cca92
|
[Perf] Optimize scheduler overhead for PD disaggregation, around 5% E2E perf improvement (#35781)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>
|
2026-03-10 21:25:30 -07:00 |
|
Benjamin Chislett
|
9040cd40af
|
[DSV3.2][MTP] Optimize Indexer MTP handling (#36723)
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
2026-03-11 12:16:56 +08:00 |
|
fangyuchu
|
fa0d353acf
|
[Bugfix] Surface exceptions from non-blocking execute_model in UniProcExecutor to avoid DP deadlocks (#35194)
Signed-off-by: fangyuchu <fangyuchu@qq.com>
|
2026-03-11 03:22:21 +00:00 |
|
Matthew Bonanni
|
8ab3d7427c
|
[Bugfix] Fix DeepSeek V3.2 OOM during CG memory profiling (#36691)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2026-03-11 03:01:07 +00:00 |
|
Woosuk Kwon
|
195d1ca3e8
|
[Minor] Enhance error message for TRTLLM decode uniformity check (#36609)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-10 15:38:45 -07:00 |
|
Nick Hill
|
65b2f405dc
|
[Core] Simplify core kv-cache blocks initialization logic (#36521)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
2026-03-10 20:20:02 +00:00 |
|
Woosuk Kwon
|
f088a831dd
|
[Model Runner V2] Use unpadded num_tokens for PW CUDA graph attn metadata (#36626)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-10 09:30:56 -07:00 |
|
Harry Mellor
|
f83b933b84
|
[CI] Bump mypy version to 1.19.1 (#36104)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
2026-03-10 09:18:28 -07:00 |
|
Pleaplusone
|
82f3f30e26
|
[ROCm][Perf] Enable sparse_mla's cudagraph on ROCm platform (#35719)
Signed-off-by: ganyi <ygan@amd.com>
|
2026-03-10 09:14:35 -07:00 |
|
Matthew Bonanni
|
9095cbbfb6
|
[Bugfix][Sparse MLA] report indexer CG support properly (#36519)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2026-03-10 09:14:31 -07:00 |
|
Srinivasoo7
|
106ff69c4e
|
feat(kv-offload): Strategy A — StoreReusedOffloadingManager gates CPU stores on reuse frequency (#35342)
Signed-off-by: srinivas_oo7 <Sriusa4414@gmail.com>
Signed-off-by: Sriusa4414@gmail.com
Signed-off-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com>
Co-authored-by: srinivas_oo7 <sklinkedin0120@gmail.com>
Co-authored-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>
|
2026-03-10 14:43:40 +00:00 |
|
SoluMilken
|
409c4e632d
|
[Misc] fix typo: homogenous-> homogeneous (2 lines change) (#36508)
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw>
|
2026-03-10 06:25:37 -07:00 |
|
Mark McLoughlin
|
234860399b
|
[Frontend][Core] Revert "Add shutdown timeout" (#34730 and #36270) (#36628)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
2026-03-10 06:20:41 -07:00 |
|
Vadim Gimpelson
|
4ff8c3c8f9
|
[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU (#35219)
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
|
2026-03-10 03:32:20 -07:00 |
|
Nick Hill
|
ddbb0d230a
|
[Model Runner V2] Fix mm input embeddings lookup (#36588)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
2026-03-10 00:24:58 -07:00 |
|
Nick Hill
|
9efc3bdcd6
|
[Model Runner V2] Fix _compute_slot_mappings_kernel for chunked prefill (#36580)
Signed-off-by: Nick Hill <nickhill123@gmail.com>
|
2026-03-10 00:23:42 -07:00 |
|
Wentao Ye
|
7279374f91
|
[Perf] Compute maxsim in worker side, reducing redundant copies, 2.7% E2E throughput improvement (#36159)
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
2026-03-09 20:55:58 -07:00 |
|
Woosuk Kwon
|
006aea17d7
|
[BugFix] Remove incorrect assert in split_decodes_and_prefills (#36553)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-09 20:02:02 -07:00 |
|
Woosuk Kwon
|
2a194ddd72
|
[Model Runner V2] Add model_state inputs to CUDA graph capture (#36544)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-09 15:14:51 -07:00 |
|
Lucas Wilkinson
|
483463f735
|
[MRV2] Extensible CG dispatch rework (#35959)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
2026-03-09 13:58:45 -07:00 |
|
Matthew Bonanni
|
4e571ce643
|
[MTP][Misc] Clean up dead code (#36507)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2026-03-09 14:43:06 -04:00 |
|
Woosuk Kwon
|
10a5f4d53d
|
[Model Runner V2] Use NamedTuple for execute_model_state (#35930)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-09 11:17:34 -07:00 |
|
Woosuk Kwon
|
6e956d9eca
|
[Model Runner V2] Add dummy profile_cudagraph_memory API (#36520)
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
|
2026-03-09 10:20:13 -07:00 |
|
Andreas Karatzas
|
c174d54f86
|
[ROCm][CI] Fix ROCm attention backend validation for head sizes, block sizes, and compute capability checks (#36292)
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
2026-03-09 12:02:41 -05:00 |
|
Roberto L. Castro
|
580864d81e
|
[Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 (#34917)
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com>
|
2026-03-09 09:50:36 -07:00 |
|
Matthew Bonanni
|
00c4cb5606
|
[Bugfix] Clear stale CG keys after memory profiling (#36416)
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
2026-03-09 11:56:00 -04:00 |
|