013b73e9b2
Fix managed KV cache: use __cuda_array_interface__ instead of UntypedStorage.from_blob
...
UntypedStorage.from_blob was removed in PyTorch 2.11+. Use the
standard __cuda_array_interface__ protocol to wrap cudaMallocManaged
pointers into PyTorch tensors — this works across all PyTorch versions.
Also removed cudaMemAdvise calls — ctypes struct passing for
cudaMemLocation is broken on ARM64 (returns EINVAL). The advise hints
are optional; pages will page-fault to GPU on-demand regardless.
CPU memset (ctypes.memset) is still used instead of cudaMemset to
avoid forcing all pages into HBM during zeroing.
2026-04-12 06:56:52 +00:00
c77342da87
KV cache: prefer CPU placement, zero via CPU not GPU
...
Two critical fixes for managed memory KV cache allocation:
1. Preferred location set to CPU (not GPU). The KV cache is too large
for HBM (50-100+ GiB). Setting preferred location to GPU causes the
driver to try migrating the entire allocation to HBM → OOM. With
CPU as preferred location, pages stay in LPDDR/EGM and page-fault
to GPU on-demand during attention ops.
2. Zero memory via CPU memset (not cudaMemset). cudaMemset runs on the
device, forcing ALL pages to migrate to GPU before zeroing — exactly
what we're trying to avoid. CPU memset keeps pages in LPDDR.
Also added SetAccessedBy(GPU) so the GPU can access pages remotely
over C2C NVLink without triggering page migration back to GPU.
2026-04-12 03:44:16 +00:00
7f35bc4158
Targeted KV cache managed memory allocation
...
Instead of swapping the global CUDA allocator (which broke cuBLAS),
allocate KV cache via cudaMallocManaged directly in
_allocate_kv_cache_tensors(). Controlled by
VLLM_KV_CACHE_USE_MANAGED_MEMORY env var.
Model weights and compute intermediates stay in HBM via default
cudaMalloc. Only KV cache spills into EGM/LPDDR.
2026-04-11 02:14:34 +00:00
487dd34e04
Selective prefetch: only prefetch allocations <2 GiB to GPU
...
Model weights (small tensors) must be in HBM for cuBLAS GEMM ops
which can't page-fault into managed memory. KV cache blocks are
large and numerous — prefetching them all fills HBM and causes
OOM. The 2 GiB threshold separates compute data from cache data.
2026-04-10 14:58:57 +00:00
a15f86ecfa
Remove cudaMemPrefetchAsync from managed allocator
...
Eager prefetching was filling HBM+EGM, causing subsequent
cudaMallocManaged calls to fail after model loading. On GH200
with EGM, pages should migrate on-demand via hardware page faults
over C2C NVLink. The cudaMemAdviseSetPreferredLocation(GPU) hint
is sufficient to prefer GPU placement with LPDDR fallback.
2026-04-10 05:58:11 +00:00
Michael
2a69949bda
[Bugfix]: Fix Gemma4ToolParser.__init__() missing tools parameter ( #38847 )
...
Signed-off-by: Michael Hospedales <hospedales@me.com >
(cherry picked from commit bb39382b2b )
2026-04-02 16:45:38 -07:00
Luciano Martins
8adcf8c40a
feat(models): implement Google Gemma 4 architecture support (MoE, Multimodal, Reasoning, Tool-Use) ( #38826 )
...
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com >
Signed-off-by: Luciano Martins <lucianomartins@google.com >
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
(cherry picked from commit 08ed2b9688 )
2026-04-02 11:49:53 -07:00
khluu
cfad6a509c
Revert "[Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang ( #38730 )"
...
This reverts commit c284a6671c .
2026-04-01 15:14:58 -07:00
Stefano Castagnetta
c284a6671c
[Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang ( #38730 )
...
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com >
(cherry picked from commit 6183cae1bd )
2026-04-01 12:11:03 -07:00
Chauncey
3a30a1a6a8
[Misc] Rename think_start_str/think_end_str to reasoning_start_str/reasoning_end_str ( #38242 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
(cherry picked from commit cbe7d18096 )
2026-04-01 12:10:53 -07:00
Juan Pérez de Algaba
29982d48b3
(security) Enforce frame limit in VideoMediaIO ( #38636 )
...
Signed-off-by: jperezde <jperezde@redhat.com >
(cherry picked from commit 58ee614221 )
2026-04-01 12:10:40 -07:00
Yifan Qiao
1dbbafd3f3
[Feat][v1] Simple yet General CPU KV Cache Offloading ( #37160 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai >
(cherry picked from commit 91e4521f9f )
2026-04-01 01:03:14 -07:00
Lucas Wilkinson
0ee3b7fc3d
[Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking ( #36178 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
(cherry picked from commit eb47454987 )
2026-04-01 01:02:58 -07:00
Matthew Bonanni
268bed9cf3
[Bugfix][Async] Fix async spec decoding with hybrid models ( #38556 )
...
Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com >
(cherry picked from commit 757068dc65 )
2026-04-01 01:02:35 -07:00
Jiangyun Zhu
bcc0fdd0f3
[CI] fix LM Eval Qwen3.5 Models (B200) ( #38632 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
(cherry picked from commit ea7bfde6e4 )
2026-04-01 01:02:20 -07:00
wang.yuqi
69b8bd4b33
[CI Failure] pin colmodernvbert revision ( #38612 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
(cherry picked from commit 719735d6c5 )
2026-04-01 01:02:04 -07:00
Li, Jiang
12449f9492
[Bugfix][CPU] Skip set_num_threads after thread binding ( #38535 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
(cherry picked from commit 6557f4937f )
2026-03-30 23:01:42 -07:00
haosdent
b92312dfd7
[CI] Fix SPLADE pooler test broken by #38139 ( #38495 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
(cherry picked from commit a08b7733fd )
2026-03-30 21:52:13 -07:00
Jaewon
d816834c1a
[MoE] Add RoutingMethodType.Simulated to TRT-LLM FP8/NVFP4 kernel allowlists ( #38329 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
2026-03-29 22:53:43 -07:00
Roger Wang
92f0db57a8
[Misc] Always use forward_mulmat for Conv3d on newer versions of torch. ( #38487 )
2026-03-30 05:39:41 +00:00
Andreas Karatzas
bea23536f6
[CI] Add temperature=0.0, reduce max_tokens, and add debug prints to audio_in_video tests ( #38492 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-30 05:36:45 +00:00
Jiangyun Zhu
c133f33746
Add @ZJY0516 to CODEOWNERS ( #38497 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-29 21:10:00 -07:00
Stanislav Kirillov
a6db99ba02
[Bugfix] Support multi-type params parsing for DeepSeek v3.2 ( #33703 )
...
Signed-off-by: Stanislav Kirillov <stas@nebius.com >
Co-authored-by: Stanislav Kirillov <stas@nebius.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-30 04:07:28 +00:00
Andreas Karatzas
4f2ed5fddb
[ROCm][CI] Enable hybrid chunked prefill test ( #38317 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-30 10:30:26 +08:00
Kyle Sayers
d28d86e8a3
[QeRL] Fix online quantized reloading ( #38442 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-03-29 14:56:41 -06:00
Wentao Ye
995dea1354
[Perf] Remove redundant device copies for CPU-only pooling token IDs, 48.9% E2E throughput improvement ( #38139 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-29 18:12:50 +00:00
allgather
8c0b6267d7
[Transformers v5] fix missing pixtral/voxtral multimodal dispatch ( #38410 )
...
Signed-off-by: allgather <all2allops@gmail.com >
2026-03-29 09:59:06 +00:00
Andreas Karatzas
43cc5138e5
[ROCm][CI] Fix cross-attention dispatch for encoder-decoder models ( #38450 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-28 22:08:03 -07:00
Shubhra Pandit
5b8c30d62b
[Spec Decode, BugFix] Propagate norm_before_fc from Eagle3 speculator ( #38111 )
...
Signed-off-by: Shubhra Pandit <shubhra.pandit@gmail.com >
2026-03-29 00:42:06 +00:00
haosdent
d39b8daf5f
[Feature] Add Qwen3-ForcedAligner support via token classification pooling ( #35367 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-29 00:27:52 +00:00
Walter Beller-Morales
fafca38adc
[BugFix][Frontend] apply task instruction as system prompt in cohere v2/embed ( #38362 )
...
Signed-off-by: walterbm <walter.beller.morales@gmail.com >
2026-03-28 18:30:54 +00:00
Kunshang Ji
aa4eb0db78
[CI]revert initialize_model context manager ( #38426 )
...
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-28 16:56:50 +00:00
Andreas Karatzas
af89140efc
[ROCm][CI] Fix UV install in Dockerfile.rocm to detect curl failures and retry ( #38415 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-29 00:47:42 +08:00
haosdent
b2bc736b12
[CI] Fix Ernie4.5-VL initialization test ( #38429 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-28 22:43:24 +08:00
whyiug
58c959a767
[Misc]: clean up non-core lint issues ( #37049 )
...
Signed-off-by: whyiug <whyiug@hotmail.com >
2026-03-28 10:28:16 -04:00
Bvicii
bda3eda82d
[Bugfix] Disallow renderer_num_workers > 1 with mm processor cache ( #38418 )
...
Signed-off-by: Bvicii <yizhanhuang2002@gmail.com >
2026-03-28 06:32:52 -07:00
Michael Goin
2bf5b70ae8
[CI Bugfix] Pre-download missing FlashInfer headers in Docker build ( #38391 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-03-28 06:09:00 -07:00
yzong-rh
6dad4c5722
[Test] Fix flaky race condition in test_abort_final_step ( #38414 )
...
Signed-off-by: Yifan <yzong@redhat.com >
2026-03-28 09:06:56 +00:00
Liwen
171775f306
Fix Device Index for ROCm Ray Workers in MoE Benchmark ( #38108 )
...
Signed-off-by: Liwen <53441624+li-liwen@users.noreply.github.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-28 08:27:11 +00:00
TJian
58a249bc61
[ROCm] [Release] Update ROCm variant from rocm700 to rocm721 ( #38413 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-03-28 06:07:03 +00:00
IriKa
148a5c1226
[Bugfix]fix output Nan/Inf in marlin if dtype=float16 ( #33972 )
...
Signed-off-by: IriKa Qiu <qiujie.jq@gmail.com >
2026-03-27 16:36:08 -07:00
Wei Zhao
b69bf2f0b1
[Perf] Use torch compile to fuse pack topk in trtllm moe ( #37695 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com >
2026-03-27 17:30:46 -06:00
rongfu.leng
88149b635e
Add nvidia h800 moe config ( #31201 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2026-03-27 16:28:48 -07:00
Hongxia Yang
83a4df049d
[ROCm][Documentation] update quickstart and installation to include rocm nightly docker tips ( #38367 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-03-27 23:20:19 +00:00
Gregory Shtrasberg
731285c939
[ROCm][CI/Build] ROCm 7.2.1 release version; torch 2.10; triton 3.6 ( #38252 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-03-27 18:03:12 -05:00
Johnny
97d19197bc
[NVIDIA] Fix DGX Spark logic ( #38126 )
...
Signed-off-by: johnnynunez <johnnynuca14@gmail.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com >
Signed-off-by: guillaume_guy <guillaume.guy@airbnb.com >
Signed-off-by: Guillaume Guy <guillaume.c.guy@gmail.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Andreas Karatzas <akaratza@amd.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: Sathish Sanjeevi <SKPsanjeevi@users.noreply.github.com >
Co-authored-by: Guillaume Guy <guillaume.c.guy@gmail.com >
Co-authored-by: guillaume_guy <guillaume.guy@airbnb.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-27 15:26:07 -07:00
Giancarlo Delfin
384e4d5f48
[Model Runner V2] Rebuild attention metadata before eagle decode full… ( #38311 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-27 13:46:42 -07:00
Nicolò Lucchesi
44a6528028
[CI] Skip failing test ( #38369 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-27 13:25:19 -07:00
Kyle Sayers
648edcf729
[QeRL] Compose online quantization with quantized reloading ( #38032 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-03-27 13:22:33 -07:00
Michael Goin
7ba425e916
Add short flag -sc for --speculative-config argument ( #38380 )
...
Co-authored-by: Claude <noreply@anthropic.com >
2026-03-27 12:04:22 -07:00
Gregory Shtrasberg
b8665383df
[ROCm] Fix GPT-OSS import for triton 3.6 ( #37453 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-03-27 18:00:57 +00:00
Rohan Potdar
0e9358c11d
{ROCm]: gpt-oss fusion/padding fixes ( #38043 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
Co-authored-by: Andreas Karatzas <akaratza@amd.com >
2026-03-27 12:19:15 -04:00
Harry Mellor
21d2b53f88
Remove need for explicit \n in docstring lists for --help formatting ( #38350 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-27 08:38:00 -07:00
Jonas M. Kübler
98e7f223b9
enable skipping of SW attention layers when using FP8 KV cache ( #33695 )
...
Signed-off-by: Jonas Kuebler <kuebj@amazon.com >
2026-03-27 07:25:02 -06:00
Juan Pérez de Algaba
b111f8a61f
fix(security): Add VLLM_MAX_N_SEQUENCES environment variable and enforce limit ( #37952 )
...
Signed-off-by: jperezde <jperezde@redhat.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2026-03-27 09:02:10 -04:00
Sage Moore
497e234d38
[EPLB] Cleanup the transfer logic for the various eplb maps ( #34520 )
...
Signed-off-by: Sage Moore <sagmoore@redhat.com >
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-03-27 10:18:46 +01:00
dtc
6287e7fa20
[P/D] Mooncake: Add unit tests and minor fixes for mooncake connector ( #36946 )
...
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com >
2026-03-27 09:26:40 +01:00
Shengqi Chen
84e439a9cb
[CI/Build] Move nightly wheel index generation to a single post-build step ( #38322 )
...
Signed-off-by: Shengqi Chen <harry-chen@outlook.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2026-03-27 07:44:18 +00:00
Yuichiro Utsumi
a1746ff9ec
[Doc] Clarify Helm chart location in deployment guide ( #38328 )
...
Signed-off-by: Yuichiro Utsumi <utsumi.yuichiro@fujitsu.com >
Signed-off-by: Yuichiro Utsumi <81412151+utsumi-fj@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-27 15:43:02 +08:00
Flora Feng
aee4c14689
[Bugfix] Fix Hermes tool parser when stream interval > 1 ( #38168 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-27 14:42:26 +08:00
Bowen Bao
0ae89f18fd
[Refactor] Move FusedMoE hidden_size roundup to quant_method ( #34285 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com >
2026-03-26 23:38:26 -07:00
wenjun liu
c2b17d71af
[CI] Add xpu auto-label rule for Intel GPU/XPU PRs ( #38320 )
...
Signed-off-by: wendyliu235 <wenjun.liu@intel.com >
2026-03-27 14:22:38 +08:00
Li, Jiang
becaed6ec8
[CPU] Support CT W4A16 on CPU MP kernel ( #38219 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-27 14:15:28 +08:00
Xiaoshuang Wang
a8eab8f30d
[Model] Extract GatedDeltaNetAttention into shared layer for Qwen3Next and Qwen3.5 ( #37975 )
...
Signed-off-by: wxsIcey <1790571317@qq.com >
Signed-off-by: Icey <1790571317@qq.com >
2026-03-27 14:13:21 +08:00
cjackal
2babac0bed
[frontend] dump openai responses type by alias ( #38262 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2026-03-27 05:58:20 +00:00
Or Ozeri
7cc302dd87
[kv_offload+HMA][7/N]: Support register_kv_caches for hybrid models ( #37853 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-27 08:38:33 +03:00
Bvicii
999dfc1622
[Bugfix] Offload blocking tokenizer ops to shared thread pool to unblock event loop ( #34789 )
...
Signed-off-by: Bvicii <yizhanhuang2002@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-26 22:17:00 -07:00
wenjun liu
d86060122a
[CI/Build] enable Intel XPU test flow with prebuilt image ( #37447 )
...
Signed-off-by: wendyliu235 <wenjun.liu@intel.com >
2026-03-26 18:16:04 -07:00
Harry Mellor
f73bcb1c51
Various Transformers v5 config fixes ( #38247 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-26 23:06:59 +00:00
yzong-rh
28048bd6b0
[Bugfix] Add missing f-string prefix in xgrammar choices error message ( #38162 )
...
Signed-off-by: Yifan Zong <yzong@redhat.com >
2026-03-26 21:43:03 +00:00
Giancarlo Delfin
c32e97602d
[Model Runner V2] Enable forcing a specific acceptance rate during rejection sampling ( #38045 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-26 13:38:12 -07:00
Wei Zhao
0904b6550d
Fix multi-node allreduce fusion ( #38136 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: root <root@theia0053.lyris.clusters.nvidia.com >
2026-03-26 20:24:36 +00:00
Stig-Arne Grönroos
f26fcdfb9e
[Bugfix][ROCm] Fix lru_cache on paged_mqa_logits_module ( #37547 )
...
Signed-off-by: Stig-Arne Grönroos <stig-arne.gronroos@amd.com >
2026-03-26 19:01:05 +00:00
TJian
bc9c6fbbe6
[ROCm] [Bugfix] [Release] Fix nightly rocm release pipeline ( #38263 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-03-26 18:47:10 +00:00
Andreas Karatzas
bff9a1c266
[ROCm][CI] Override PYTORCH_ROCM_ARCH with detected GPU arch in test containers ( #38165 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 18:33:45 +00:00
Andreas Karatzas
db01535e2b
[ROCm][CI] Add uv pip compile workflow for rocm-test.txt lockfile ( #37930 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 12:44:01 -05:00
jennyyyyzhen
a4cf9b22ba
[ROCM][Bugfix] Use correct stride in cp_mha_gather_cache_kernel for hybrid model ( #37228 ) ( #37228 )
...
Signed-off-by: jennyyyyzhen <yzhen@hmc.edu >
Co-authored-by: yZhen <yZhen@fb.com >
2026-03-26 10:33:39 -07:00
Andreas Karatzas
9c3ae04bfe
[ROCm][CI] Add LM Eval Qwen3.5 Models test for MI355 ( #38155 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 16:51:18 +00:00
Andreas Karatzas
a8e48a7b85
[CI] Fix conch kernel crash on 3D input by reshaping to 2D before GEMM ( #38178 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 11:46:03 -05:00
Divakar Verma
b9dbc5c4ab
[Mamba][APC] Add test case to compare apc outputs ( #34977 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-03-26 16:40:35 +00:00
TJian
60af7b967b
[Releases] [ROCm] Enable Nightly Docker Image and Wheel Releases for ROCm ( #37283 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-03-26 16:32:25 +00:00
Andreas Karatzas
bdc1719eb9
[ROCm][CI] Fix AITER state leak in shared_fused_moe_routed_transform test ( #38137 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 09:26:46 -07:00
haosdent
0aac2048bf
[Bugfix] Restore CUDA graph persistent buffers for FP8 FlashMLA decode ( #35175 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-26 16:13:39 +00:00
Chuan (Richard) Li
cb2263218e
[Bugfix][Minor] Fix potential NameError in mamba backend selector and misc typos ( #35886 )
...
Signed-off-by: Li <chuali@amd.com >
2026-03-26 11:59:24 -04:00
Wentao Ye
e054f152fa
[CI] Add batch invariant test for b200 ( #38014 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-26 11:54:54 -04:00
zhang-prog
0f5b526040
[Fix] Remove unused packing_position_embedding from PaddleOCRVL for better checkpoint compatibility ( #38232 )
...
Signed-off-by: zhangyue66 <zhangyue66@baidu.com >
2026-03-26 15:34:49 +00:00
Zhewen Li
be1a85b7a2
Revert "[MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration" ( #38050 ) ( #38169 )
...
Co-authored-by: Zhewen Li <zhewenli@inferact.ai >
2026-03-26 07:59:09 -07:00
Cyrus Leung
2e225f7bd2
[Renderer] Consolidate factory methods ( #38218 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-26 12:19:22 +00:00
Jared Wen
757eafcf37
[bug-fix] GLM OCR Patch Merger context_dim ( #37962 )
...
Signed-off-by: JaredforReal <w13431838023@gmail.com >
2026-03-26 05:11:21 -07:00
wang.yuqi
dcdc145893
[CI] Reorganize scoring tests ( #38207 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-26 12:07:01 +00:00
Andreas Karatzas
f2d16207c7
[ROCm][CI] Fix flaky GPTQ compile correctness test ( #38161 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 19:57:00 +08:00
Andreas Karatzas
37a83007fe
[ROCm][CI] Fix wvSplitKrc mock argument order in test_rocm_unquantized_gemm ( #38167 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 19:54:59 +08:00
Wentao Ye
bf5eec638d
[Refactor] Remove unused utils ( #38153 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-26 17:08:19 +08:00
Mateusz Sokół
b1cb1d3d2c
DOC: Documentation pages fixes ( #38125 )
...
Signed-off-by: Mateusz Sokół <mat646@gmail.com >
2026-03-26 16:55:42 +08:00
Kunshang Ji
6ae8bbd0c2
[XPU] Disable xpu graph by default ( #38193 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-26 01:53:45 -07:00
Cyrus Leung
a9213c0ffe
[Doc] Fix outdated reference to CUDAGraphManager ( #38209 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-26 01:52:38 -07:00
Cyrus Leung
502c41a8f6
[Model] Use helper function to run MM processors with token inputs (where applicable) ( #38018 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-26 16:44:04 +08:00
Vadim Gimpelson
52069012fe
[Bugfix] Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell ( #38083 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-26 01:21:47 -07:00
Fadi Arafeh
71161e8b63
[cpu][ci] remove soft-fail for Arm CI and add quant model tests ( #37691 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-03-26 07:03:31 +00:00
Terry Gao
38de822310
[Model] Add torch.compile support for InternVL vision encoder ( #38049 )
...
Signed-off-by: tianrengao <terrygao87@gmail.com >
2026-03-25 23:52:29 -07:00
Jee Jee Li
2bfbdca23c
[Bugfix] Fix benchmark_fused_collective.py ( #38082 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-25 23:51:00 -07:00
Matej Rojec
2908094567
Add /v1/chat/completions/batch endpoint for batched chat completions ( #38011 )
...
Signed-off-by: Matej Rojec <64556640+MatejRojec@users.noreply.github.com >
2026-03-26 12:13:33 +08:00
BadrBasowid
e6bf9f15ec
[Bugfix][CI] Fix Marlin FP8 Linear Kernel for Compressed Tensors Format ( #38092 )
...
Signed-off-by: BadrBasowid <Badr.Basowid@gmail.com >
Signed-off-by: BadrBasowid <61441185+BadrBasowid@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-25 21:11:43 -07:00
Woosuk Kwon
144030c84e
Relocate Encoder CUDA graph manager ( #38116 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-25 20:52:12 -07:00
Flora Feng
e2db2b4234
[Tool Parser][1/3] Pass tools to ToolParser constructor ( #38029 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-26 10:29:06 +08:00
Chauncey
87f05d6880
[Revert] Remove DeepGEMM availability check in DeepseekV32IndexerMetadataBuilder ( #38076 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-26 01:43:51 +00:00
Andreas Karatzas
36f6aede23
[Misc] Optimized check to encapsulate both CUDA and ROCm platforms ( #34549 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-26 09:43:07 +08:00
Xin Yang
9704a5c310
Disable dual stream execution of input projection for Qwen3 ( #38152 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-26 01:20:39 +00:00
Wei Zhao
74056039b7
Fix minimax m2.5 nvfp4 kv scales weight loading ( #37214 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-26 00:48:06 +00:00
Jacob Platin
d7d51a7ee5
[Bugfix] Fix Qwen3.5-FP8 Weight Loading Error on TPU ( #37348 )
...
Signed-off-by: Jacob Platin <jacobplatin@google.com >
2026-03-26 00:46:01 +00:00
Harry Mellor
3c3c084240
Various Transformers v5 fixes ( #38127 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-26 00:10:08 +00:00
Ekagra Ranjan
7b54f60db0
[Cohere] Enable Cohere-Transcribe ( #38120 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2026-03-25 16:13:51 -07:00
Rohan Potdar
a0e8c74005
[ROCm]: Update rope+kvcache fusion conditions and disable custom op by default ( #36716 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-03-25 20:58:44 +00:00
Guillaume Guy
70a2152830
[MultiModal] add support for numpy array embeddings ( #38119 )
...
Signed-off-by: guillaume_guy <guillaume.guy@airbnb.com >
Signed-off-by: Guillaume Guy <guillaume.c.guy@gmail.com >
Co-authored-by: guillaume_guy <guillaume.guy@airbnb.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-25 20:13:04 +00:00
Sathish Sanjeevi
978fc18bf0
[ROCm] Utilize persistent MLA kernel from AITER ( #36574 )
...
Signed-off-by: Sathish Sanjeevi <sathish.krishnan.p.s@gmail.com >
2026-03-26 03:00:42 +08:00
Andreas Karatzas
7d6917bef5
[ROCm] Fix MoE kernel test failures on gfx950 ( #37833 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
2026-03-25 13:46:40 -05:00
Mark McLoughlin
e38817fadb
[Core][KV Connector] Remove use of num_cached_tokens in error handling ( #38096 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-03-25 18:20:48 +00:00
Nick Hill
72cad44d3c
[Frontend] Move APIServerProcessManager target server fn ( #38115 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-25 18:14:41 +00:00
Cyrus Leung
ba2f0acc2d
[Misc] Reorganize inputs ( #35182 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-25 10:22:54 -07:00
Yongye Zhu
678b3c99e8
[MoE Kernel] Flashinfer nvfp4 cutedsl moe kernel integration ( #38050 )
2026-03-25 10:16:40 -07:00
mikaylagawarecki
bf4cc9ed2d
[2/n] Migrate per_token_group_quant to torch stable ABI ( #36058 )
...
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com >
2026-03-25 10:15:13 -07:00
Ben Browning
1ac2ef2e53
[CI/Docs] Improve aarch64/DGX Spark support for dev setup ( #38057 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-25 09:24:42 -07:00
Richard Zou
6e37c46b35
[compile] Add some more startup tests for top models ( #38046 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-25 12:02:22 -04:00
Wentao Ye
1bf2ddd0ee
[Refactor] Rename WAITING_FOR_FSM to WAITING_FOR_STRUCTURED_OUTPUT_GRAMMAR ( #38048 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-25 11:41:44 -04:00
Necofish
e7221180e1
[Kernel] Optimize SM120 CUTLASS blockwise FP8 GEMM ( #37970 )
...
Signed-off-by: Necofish <liuxiangyang@mail.ustc.edu.cn >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-25 08:20:04 -07:00
RobTand
4a76ad12e0
[Bugfix] Preserve CUDA arch suffix (a/f) for SM12x — fixes NVFP4 NaN on desktop Blackwell ( #37725 )
...
Signed-off-by: Rob Tand <robert.tand@icloud.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
2026-03-25 08:18:25 -07:00
Wentao Ye
d7e93e13fb
[Feature] EPLB Support for GPU Model Runner v2 ( #37488 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-25 08:16:39 -07:00
Andrii Skliar
cd7643015e
[Feature] Support per-draft-model MoE backend via --speculative-config ( #37880 )
...
Signed-off-by: Andrii Skliar <askliar@nvidia.com >
Signed-off-by: [Andrii Skliar] <askliar@nvidia.com >
Co-authored-by: Andrii Skliar <askliar@nvidia.com >
2026-03-25 14:31:52 +00:00
Ben Browning
a1a2566447
[Docs] Add guide for editing agent instruction files ( #37819 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2026-03-25 13:54:09 +00:00
yjz
b745e8b5d3
[KVTransfer][Mooncake] Add heterogeneous TP support for disaggregated P/D in MooncakeConnector ( #36869 )
...
Signed-off-by: JianDan0212 <zhangyj0212@gmail.com >
2026-03-25 14:24:07 +01:00
Harry Mellor
d215d1efca
[Mypy] Better fixes for the mypy issues in vllm/config ( #37902 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-25 06:14:43 -07:00
Fadi Arafeh
34d317dcec
[CPU][UX][Perf] Enable tcmalloc by default ( #37607 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-03-25 20:39:57 +08:00
grYe99
7ac48fd357
[Model] Add AutoWeightsLoader support for jais ( #38074 )
...
Signed-off-by: grYe99 <guorongye99@gmail.com >
Co-authored-by: grYe99 <guorongye99@gmail.com >
2026-03-25 12:38:40 +00:00
Harry Mellor
d6bb2a9d9a
Fix Plamo 2/3 & LFM2 for Transformers v5 ( #38090 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-25 12:29:49 +00:00
Harry Mellor
1e673a43ce
Better weight tying check for multimodal models ( #38035 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-25 12:07:23 +00:00
Andreas Karatzas
04417ecd5f
[ROCm][CI] Rename filepath test to point to correct file ( #38102 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 20:05:46 +08:00
R0CKSTAR
242c93f744
[Docs] Adds vllm-musa to custom_op.md ( #37840 )
...
Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com >
2026-03-25 11:54:36 +00:00
Matthias Gehre
a889b7f584
[Bugfix] Pass drafter quant_config to ParallelLMHead in Eagle3 ( #37280 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-03-25 11:42:58 +00:00
Harry Mellor
ba2910f73a
Fix offline mode test for Transformers v5 ( #38095 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-25 11:39:48 +00:00
Andreas Karatzas
f262a62aa1
[ROCm][CI] Fix flaky Cohere/OpenAI embedding parity test ( #37616 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 10:55:51 +00:00
Andreas Karatzas
9ac2fcafbb
[CI] Fix realtime WebSocket timeout deadlock and unhandled model validation errors ( #37483 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 11:24:33 +01:00
Kunshang Ji
e9ae3f8077
[Hardware][XPU] Align memory usage with cuda on xpu ( #37029 )
...
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-25 18:14:29 +08:00
Andreas Karatzas
04cec4f927
[ROCm][CI] Increase OpenAPI schema test timeouts ( #38088 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 18:06:58 +08:00
Kunshang Ji
14771f7150
[XPU] support MLA model on Intel GPU ( #37143 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-25 17:43:42 +08:00
Gregory Shtrasberg
189ddefbfd
[ROCm] Attention selector reordering ( #36702 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
Co-authored-by: Micah Williamson <micah.williamson@amd.com >
2026-03-25 17:42:56 +08:00
Chauncey
09c3dc9186
[Revert] Remove CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function ( #37968 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-25 06:19:37 +00:00
vllmellm
42e9547976
[ROCm][Test] Fix ROCM_AITER_UNIFIED_ATTN attn+quant fusion test ( #37640 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-03-25 05:06:15 +00:00
Chauncey
a32783bb35
[Bugfix] Fix IndexError when accessing prev_tool_call_arr in OpenAIToolParser ( #37958 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-25 12:06:21 +08:00
Baorun (Lauren) Mu
9d0351c91d
[Docs] Add Encoder (ViT) CUDA Graphs section to CUDA Graphs design doc ( #37914 )
...
Signed-off-by: Baorun Mu <bmu@nvidia.com >
2026-03-24 19:53:24 -07:00
Artem Perevedentsev
a93a53f8a1
[Performance] Auto-enable prefetch on NFS with RAM guard ( #37673 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
2026-03-24 17:31:14 -07:00
Andreas Karatzas
679c6a3ecc
[Bugfix][ROCm][MoE] Fix mxfp4 oracle regressions from #37128 ( #37787 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 08:17:33 +08:00
Andreas Karatzas
8bbb7c7f20
[ROCm][CI][PD] Add Hybrid SSM integration tests to CI ( #37924 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-25 07:58:39 +08:00
Kevin H. Luu
af945615b5
[release] Move the rest of release jobs to release queue ( #38044 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-03-24 16:40:58 -07:00
Terry Gao
82580b10ac
[Perf] Disable inductor runtime asserts by default for serving perfor… ( #37485 )
...
Signed-off-by: tianrengao <terrygao87@gmail.com >
Co-authored-by: Tianren Gao <tianren@fb.com >
2026-03-24 19:37:51 -04:00
Netanel Haber
a0d487b2e1
nano_nemotron_vl: suppress readonly torch.from_numpy() warning in image and video resize paths ( #37903 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-03-24 23:25:56 +00:00
Junhao
b73b5b0629
Make microbatch optimization (DBO) work with general models ( #37926 )
...
Signed-off-by: Junhao Li <junhao@ubicloud.com >
2026-03-24 14:40:08 -07:00
Michael Goin
0f0e03890e
[UX] Add flashinfer-cubin as CUDA default dep ( #37233 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-03-24 14:13:08 -07:00
Woosuk Kwon
4b53740d7f
[MRV2] Fix for DS v3.2 ( #38030 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-24 14:03:24 -07:00
Nick Hill
4e824d1c83
[Model Runner V2][Minor] Simplify PP logic ( #38031 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-24 13:57:17 -07:00
amey asgaonkar
0c1809c806
Add Ubuntu 24.04 support for Docker builds ( #35386 )
...
Signed-off-by: aasgaonkar <aasgaonkar@nvidia.com >
2026-03-24 13:34:44 -07:00
liangel-02
8c47fdfdb1
[FlexAttention] allow custom mask mod ( #37692 )
...
Signed-off-by: Angel Li <liangel@meta.com >
2026-03-24 16:03:24 -04:00
Javier De Jesus
54b0578ada
[Bugfix] Pass hf_token through config loading paths for gated model support ( #37920 )
...
Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com >
2026-03-24 15:22:05 -04:00
Richard Zou
89f572dbc0
[BugFix] fix VLLM_USE_STANDALONE_COMPILE=0 ( #38015 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-24 19:08:26 +00:00
Richard Zou
71a4a2fbd0
[BugFix] Fix order of compile logging ( #38012 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-24 18:58:18 +00:00
Nick Cao
935c46dd9b
[Model] Add Granite 4.0 1B speech to supported models ( #38019 )
...
Signed-off-by: Nick Cao <ncao@redhat.com >
2026-03-24 18:23:41 +00:00
Willy Hardy
057fc94cbd
[Bugfix] Fix structured output crash on CPU due to pin_memory=True ( #37706 )
...
Signed-off-by: Willy Hardy <whardy@redhat.com >
Signed-off-by: Will Hardy <whardy@redhat.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-24 17:44:17 +00:00
Vineeta Tiwari
b58c5f28aa
docs: fix broken offline inference paths in documentation ( #37998 )
...
Signed-off-by: Vineeta Tiwari <vineeta.tiwari2@ibm.com >
Signed-off-by: Vineeta Tiwari <vineetatiwari2000@gmail.com >
Co-authored-by: Vineeta Tiwari <vineeta.tiwari2@ibm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-24 17:35:14 +00:00
Ming Yang
c07e2ca6e0
Fix Mamba state corruption from referencing stale block table entries ( #37728 ) ( #37728 ) ( #37728 )
2026-03-24 10:29:59 -07:00
Dhruv Singal
4df5fa7439
[Bugfix] Force continuous usage stats when CLI override is enabled ( #37923 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
Co-authored-by: OpenCode <noreply@openai.com >
2026-03-24 10:29:50 -07:00
sihao_li
a5416bc52e
[XPU] Support Intel XPU hardware information collection in usage stats ( #37964 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
2026-03-24 10:29:17 -07:00
Harry Mellor
b3601da6e7
[Mypy] Fix mypy for vllm/model_executor (except vllm/model_executor/layers) ( #37904 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-24 17:14:01 +00:00
Dan Blanaru
dc78c2c933
[Core] add option to schedule requests based on full ISL ( #37307 )
...
Signed-off-by: Dan Blanaru <48605845+DanBlanaru@users.noreply.github.com >
Co-authored-by: Claude <noreply@anthropic.com >
2026-03-24 13:01:12 -04:00
Sungjae Lee
4731884796
[Feature] limit thinking tokens (hard limit) ( #20859 )
...
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com >
Signed-off-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-24 09:53:07 -07:00
Harry Mellor
8de5261e69
Update new contributor message ( #37999 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-24 16:01:41 +00:00
wang.yuqi
1b6cb920e6
[Deprecate] Deprecate pooling multi task support. ( #37956 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-24 14:07:47 +00:00
Li, Jiang
352b90c4a4
[Bugfix] Add replacement of _compute_slot_mapping_kernel on CPU ( #37987 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-24 07:00:20 -07:00
Sage
1c0aabdeb0
[Bugfix] Suppress spurious CPU KV cache warning in launch render ( #37911 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-24 12:36:18 +00:00
Ilya Markov
14acf429ac
[EPLB] Remove main waits in case of slow EPLB ( #36271 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-03-24 11:50:44 +00:00
Harry Mellor
ce57fd5557
[Docs] Fix build ( #37991 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-24 03:20:49 -07:00
Flora Feng
2e67fa756d
Fix tool_parser_cls type annotation from Callable to type[ToolParser] ( #37957 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-23 22:58:27 -07:00
Ronen Schaffer
e3c6c10cad
[KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into cpu/ package ( #37874 )
...
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com >
2026-03-24 07:02:51 +02:00
jetxa
16a664df24
[Frontend][Bugfix] Pass default_chat_template_kwargs to AnthropicServingMessages ( #37899 )
...
Signed-off-by: jetxa <jetxzhang@outlook.com >
2026-03-24 05:00:12 +00:00
Kevin H. Luu
7281199a8c
[release] Move agent queue to Release cluster queues ( #37783 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-03-23 20:36:47 -07:00
Kevin H. Luu
b2dd75eb48
Downsize CPU jobs to use small queue ( #37913 )
...
Signed-off-by: khluu <khluu000@gmail.com >
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-23 20:36:37 -07:00
Wentao Ye
c59a132f96
[V0 Deprecation] Refactor kv cache from list to element ( #37487 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-23 20:10:11 -07:00
Andreas Karatzas
de99d91ece
[ROCm][CI] Split Entrypoints Integration (API Server 1) into 3 jobs ( #37906 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-24 09:48:37 +08:00
Wentao Ye
83c9d525b6
[CI] Add batch invariant test: Block FP8 + small MOE ( #37895 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-23 21:16:14 -04:00
Giancarlo Delfin
8f4824b664
[Model Runner V2] Gather multimodal embeddings before draft model postprocess ( #37932 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-23 18:14:13 -07:00
roikoren755
56777b5c89
[Test] E2E Nemotron-3-Super tests ( #36803 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-03-23 17:49:56 -07:00
Kevin H. Luu
2488a82f89
[CI] Split V1 Others into 3 separate jobs ( #37016 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-24 06:44:38 +08:00
Ranran
dc6908ac6a
[Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning ( #35007 )
...
Signed-off-by: Ranran <1012869439@qq.com >
Signed-off-by: Ranran <hzz5361@psu.edu >
Signed-off-by: ran <hzz5361@psu.edu >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-03-23 18:31:14 -04:00
yzong-rh
e85f8f0932
[Bug][MoE] Strengthen _supports_current_device() checks in the TRTLLM FP8, NVFP4, and FlashInfer CuteDSL MoE experts ( #36728 )
...
Signed-off-by: Yifan Zong <yzong@redhat.com >
2026-03-23 17:02:57 -04:00
Robert Shaw
5bf3c42d4c
[Bug][MoE] Fix TRTLLM NVFP4 Routing Kernel Precision ( #36725 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-23 20:19:06 +00:00
Kyle Sayers
38364a7e32
[Sparse24] [Deprecation] Remove Sparse24 CT integration and kernels ( #36799 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-03-23 16:03:29 -04:00
Matthew Bonanni
fafe76b4af
[Async][Spec Decoding] Zero-bubble async scheduling + spec decoding ( #32951 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com >
2026-03-23 15:37:22 -04:00
Woosuk Kwon
ffb5b32b5f
[MRV2] Consider spec decoding in warmup ( #37812 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-23 17:45:43 +00:00
Kunshang Ji
91fd695b75
[CI] split Entrypoints Integration (API Server 1) into 3 jobs ( #37882 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-23 10:37:56 -07:00
Nicolò Lucchesi
1cbbcfe8a3
[CI][PD] Add Hybrid SSM integration tests to CI ( #37657 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-23 23:58:19 +08:00
Angela Yi
aceadb5ee1
Use lazy graph module during split_module to defer recompile() ( #37609 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-03-23 11:21:29 -04:00
Yufeng He
ec2280611a
[Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding ( #37884 )
2026-03-23 15:15:12 +00:00
yanghui1-arch
7151ae6528
[Bugfix] RoBERTa position_id accumulation in CUDA graph padding region ( #37873 )
...
Signed-off-by: dass90 <3053034939@qq.com >
2026-03-23 14:59:21 +00:00
Wentao Ye
45bd5c8e75
[Mypy] Fix mypy for vllm/config ( #37808 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-23 14:33:59 +00:00
Zhaodong Bing
10a1018c12
[ROCm] fix sleep mode not releasing GPU memory problem on ROCm ( #37533 )
...
Signed-off-by: bingzhaodong <aaab8b@gmail.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-03-23 06:07:19 -07:00
Jee Jee Li
aec2dc6c0d
[Bugfix][LoRA] Fix incorrect LoRA Log ( #37877 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-23 11:42:52 +00:00
DorBernsohn
7938d12119
[Bugfix] Fix CPU backend crash in KV cache block zeroing ( #37550 )
...
Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com >
2026-03-23 11:35:45 +00:00
Kunshang Ji
debd6e768c
[XPU][MoE Refactor] Refactor xpu mxfp4 support into oracle ( #37784 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-23 11:10:41 +00:00
Andrew Xia
9ace378a63
[Frontend][Responses API] Fix arrival_time recording for TTFT on initial request ( #37498 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2026-03-23 09:58:08 +00:00
Kunshang Ji
27d5ee3e6f
[FP8]add FP8 WoQ kernel abstraction. ( #32929 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-03-23 09:47:47 +00:00
wangxiyuan
35141a7eed
[Misc]Update gitignore ( #37863 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2026-03-23 01:14:10 -07:00
Chuan (Richard) Li
e99fb98867
[ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs ( #36100 )
...
Signed-off-by: Li <chuali@amd.com >
2026-03-23 15:48:31 +08:00
Artem Perevedentsev
a16133a0f1
[Perf] [Bugfix] Fix Triton autotuning in inference for Qwen3.5 ( #37338 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
2026-03-23 00:37:58 -07:00
Hojin Yang
54ab804e87
[Bugfix] Store Qwen3Next A_log in fp32 ( #37810 )
...
Signed-off-by: effortprogrammer <yhjhoward7@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-23 15:36:57 +08:00
r266-tech
02e6efe56d
[Bugfix] JAIS: Only apply ALiBi when position_embedding_type='alibi' ( #37820 )
...
Co-authored-by: r266-tech <r266-tech@users.noreply.github.com >
2026-03-23 07:36:34 +00:00
Matthias Gehre
410d300893
[ROCm][Refactor] Enable AWQMarlinConfig on ROCm to use choose_mp_linear_kernel ( #36505 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-23 15:36:08 +08:00
Yan Ma
d3fe857135
update doc for online fp8 quantization ( #37851 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-03-23 05:19:03 +00:00
Baorun (Lauren) Mu
f85e479e66
[Feature] ViT Full CUDA Graph ( #35963 )
...
Signed-off-by: Baorun Mu <bmu@nvidia.com >
2026-03-23 13:01:10 +08:00
Jee Jee Li
1f0d210641
[CI/Build][LoRA] Update Qwen35 LoRA testing ( #37816 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-23 12:55:49 +08:00
Ben Browning
3bbe2e1e6e
[Test] Consolidate tool parser unit tests to tests/tool_parsers ( #37834 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2026-03-23 04:24:25 +00:00
Augusto Yao
6e04e79326
always use embed&token_classify for bge-m3 ( #37632 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-23 03:10:57 +00:00
Lasha Koroshinadze
e7767eccae
Fix AudioFlamingo3/MusicFlamingo HF parity and RoTE handling ( #37643 )
...
Signed-off-by: Lasha <26011196+lashahub@users.noreply.github.com >
2026-03-23 10:29:07 +08:00
Woosuk Kwon
43877a620b
[MRV2] Enable PP CUDA graph test ( #37830 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-22 16:30:25 -07:00
zhanqiuhu
63f49b8bd4
[Model Runner V2] Enable piecewise CUDA graphs for pipeline parallelism ( #35162 )
...
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-22 20:48:25 +00:00
Woosuk Kwon
a5e9d511de
[MRV2] Use FP64 for Gumbel noise ( #37798 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-22 12:28:10 -07:00
Yongye Zhu
c058ff44d4
[Bigfix]fix lora test by pass padded size back to the layer ( #37811 )
2026-03-22 13:20:13 -06:00
Woosuk Kwon
ce9b1d76cf
[MRV2] Skip hidden states allocation for PW CUDA graphs ( #37818 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-22 11:47:21 -07:00
Netanel Haber
e74c17e153
Enable NemotronHPuzzle + NemotronHMTP ( #37803 )
2026-03-22 15:13:58 +00:00
Wentao Ye
eaf4978621
[Test] Only Run MLA model when user explicitly set for batch invariance ( #37719 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-22 09:09:12 -04:00
Wentao Ye
77d24c4bfe
[Bug] Fix fp8 deepgemm batch invariant ( #37718 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-22 08:57:20 -04:00
Giancarlo Delfin
b3e846017d
[Model Runner V2] Support multi-modal embeddings for spec decode model ( #36097 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-22 02:48:43 -07:00
Andreas Karatzas
cd1242d82a
[ROCm][CI] Stabilize ROCm speech-to-text translation test with lower min acc threshold ( #37723 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 17:32:08 +08:00
Robert Shaw
4383f1532e
[MoE] Move PF Methods to Folder ( #35927 )
2026-03-22 02:42:59 -06:00
Andreas Karatzas
6eedec6e36
[ROCm][CI] Make some duplicated tests optional so that they are only evaluated in our nightly ( #37780 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 16:03:18 +08:00
Andreas Karatzas
ffc8531524
[ROCm][CI] Added missing resampy dependency for MM audio tests ( #37778 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 16:02:41 +08:00
Andreas Karatzas
6ecba840d7
[ROCm][CI] get_cu_count was renamed to num_compute_units in #35042 ( #37764 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 16:02:21 +08:00
Andreas Karatzas
3b06c55c78
[ROCm][CI] Fix MEGA_AOT_ARTIFACT fallback when PyTorch < 2.10.0 lacks AOT support ( #37763 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 16:02:03 +08:00
Yang Liu
b050700462
[Perf] Optimize glm4.xv VIT ( #37779 )
...
Signed-off-by: Yang <lymailforjob@gmail.com >
2026-03-22 06:12:34 +00:00
Andreas Karatzas
5dac719b2b
[Bugfix] Handle libsndfile sf_error(NULL) race condition in audio fallback ( #37782 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 13:37:29 +08:00
Andreas Karatzas
c862481c02
[CI] Skip ISAAC multimodal tests due to broken upstream HF model weights ( #37781 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 13:23:32 +08:00
Andreas Karatzas
c86b17cfe6
[ROCm][CI] Add large_gpu_mark to test_max_tokens_none for ROCm ( #37717 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 12:25:16 +08:00
Andreas Karatzas
66f927f205
[Bugfix] Fix pooling non-determinism from pinned prompt_lens aliasing ( #37775 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 03:22:24 +00:00
Andreas Karatzas
e78bc74268
[ROCm][CI] close missing quote in kernels/moe block in run-amd-test.sh ( #37774 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-22 09:42:34 +08:00
Robert Shaw
6b2fa3a762
[MoE] Move FlashInfer CuteDSL experts into fused_moe/experts/ ( #37759 )
...
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
2026-03-21 19:15:16 -04:00
Robert Shaw
eeee5b262d
[Quantization][Deprecation] Remove PTPC FP8 ( #32700 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-21 22:10:16 +00:00
Robert Shaw
5ad0446572
Revert "Consolidate AWQ quantization into single awq_marlin.py file" ( #37768 )
2026-03-21 17:20:41 -04:00
Robert Shaw
8cc700dd6a
Consolidate AWQ quantization into single awq_marlin.py file
...
Merge awq.py and awq_marlin.py into a single file, eliminating the
circular import between them. awq.py becomes a backward-compat shim.
Follows the same structure as gptq_marlin.py.
Co-authored-by: Claude
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
2026-03-21 17:09:17 -04:00
Brandon Pelfrey
80b70884eb
Add tensor IPC transfer mechanism for multimodal data ( #32104 )
...
Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com >
Signed-off-by: Brandon Pelfrey <brandonpelfrey@gmail.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-21 20:10:20 +00:00
Mohammad Miadh Angkad
61e381dcf0
[Perf] Add SM 10.3 (B300/GB300) all-reduce communicator tuning ( #37756 )
...
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com >
2026-03-21 19:43:47 +00:00
Mohammad Miadh Angkad
88f1b374f5
[Core] Enable allreduce fusion by default for SM 10.3 (B300/GB300) ( #37755 )
...
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com >
2026-03-21 19:40:37 +00:00
Francesco Fusco
298e510848
[Hybrid] calling get_mamba_groups() once at MambaCopyBuffers.create() ( #37318 )
...
Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com >
2026-03-21 09:29:43 +00:00
Chaitanya Sri Krishna Lolla
3982bc2cd0
[ROCm] Enable DeepEP ROCm as all2allbackend for AMD GPUs. ( #34692 )
...
Signed-off-by: Tej Kiran <vpolamre@amd.com >
Co-authored-by: Tej Kiran <vpolamre@amd.com >
2026-03-21 00:32:31 -07:00
Andreas Karatzas
02eec7ecbe
[ROCm][CI] Update GSM8K eval config to use fp8-and-mixed models list (MI355) ( #37721 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-21 15:27:12 +08:00
Bongwoo Bak
17ee641c45
[Responses API] Add kv_transfer_params for PD disaggregation ( #37424 )
...
Signed-off-by: bongwoobak <bongwoobak@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-21 13:48:54 +08:00
Andreas Karatzas
0d50fa1db6
[ROCm][CI] Mark gemma3 as large GPU test to avoid OOM on MI250 ( #37610 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-21 12:57:25 +08:00
Simon Mo
1fa1e53a73
Revert "[compile] Initialize passes at VllmBackend init" ( #37733 )
2026-03-20 21:35:49 -07:00
Andreas Karatzas
3ffa52009f
[ROCm][CI] Guard CudaPlatform/RocmPlatform imports to fix test collection on cross-platform builds ( #37617 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-21 11:58:58 +08:00
Yongye Zhu
87bd91892f
[MoE Refactor] Mxfp4 oracle rebased ( #37128 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-21 03:37:04 +00:00
Isotr0py
c7f98b4d0a
[Frontend] Remove librosa from audio dependency ( #37058 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-21 11:36:15 +08:00
tmm77
1c472f8fe1
Add get_device_uuid for rocm ( #37694 )
...
Signed-off-by: Tiffany Mintz <Tiffany.Mintz@amd.com >
2026-03-21 11:33:16 +08:00
Itay Alroy
c57d38d603
elastic_ep: Fix issues with repeated scale up/down cycles ( #37131 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com >
2026-03-20 23:13:02 +00:00
Kaihang Jiang
e5ed6c6c13
[BugFix] Allow qk_nope_head_dim=192 in FlashInfer MLA backend checks ( #37475 )
...
Signed-off-by: Kaihang Jiang <kaihangj@nvidia.com >
2026-03-20 16:14:55 -06:00
Wentao Ye
b3d0b37908
[Refactor] Remove unused dead code ( #36171 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-20 16:12:51 -06:00
Santino Ramos
85f671b8e1
[Model Runner V2] Support Streaming Inputs ( #37028 )
...
Signed-off-by: Santino Ramos <elsantinoramos@gmail.com >
2026-03-20 20:42:25 +00:00
Andreas Karatzas
8bc6b5cdb0
[ROCm][CI] Setting some mi325_4 tests back to optional (in parity with upstream) ( #37711 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 12:25:08 -07:00
Vadim Gimpelson
4f16ebbbd3
[Bugfix] Disable monolithic TRTLLM MoE for Renormalize routing ( #37591 ) ( #37605 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-20 12:19:26 -07:00
Angela Yi
12fd17eb51
[compile] Initialize passes at VllmBackend init ( #35216 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-03-20 11:40:33 -07:00
Cyrus Leung
37aadf6237
[Model] Update Kimi-K25 and Isaac processors to fit HF-style ( #37693 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-20 18:30:22 +00:00
Le Yang
d7d2b5e405
[Bugfix] Disable --calculate-kv-scales for hybrid GDN/Mamba+Attention… ( #37565 )
...
Signed-off-by: Young-Leo <562593859@qq.com >
2026-03-20 18:28:34 +00:00
SherryC41
6ec5e9fd37
refactor: abstract deepgemm support into platform ( #37519 )
...
Co-authored-by: sherryC41 <sherry.c.c41@gmail.com >
2026-03-20 17:54:08 +00:00
Lucas Wilkinson
e1d85e5c24
[Attention] Support distinguishing between short extends and decodes ( #37303 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-03-20 10:49:36 -07:00
Peter Pan
79eb9369c5
fix CUDAGraph memory being counted twice ( #37426 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
Signed-off-by: Peter Pan <peter.pan@daocloud.io >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-20 17:36:32 +00:00
Woosuk Kwon
e80cfe575d
[MRV2] Avoid recompilation of _gather_block_tables_kernel ( #37645 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-20 10:31:45 -07:00
Xin Yang
d0532bf38d
[Perf] Eliminate redundant SparseMatrix creation in gpt_oss_triton_kernels ( #37683 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-20 11:28:41 -06:00
Andreas Karatzas
fb4e8bf442
[ROCm][CI] Fix accuracy for llama-nemotron-vl pooling tests ( #37613 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 10:16:59 -07:00
Harry Mellor
6ade4bc5a5
Fix various config related issues for Transformers v5 ( #37681 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-20 16:30:12 +00:00
Zhengxu Chen
2e089b96a8
[compile] Add compiled artifact counter for VLLM_USE_MEGA_AOT_ARTIFACT=1. ( #37589 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-20 16:22:46 +00:00
Martin Hickey
880be2b1b8
[Metrics] Some small refactoring for better maintainability ( #33898 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-03-20 16:11:34 +00:00
Zhengxu Chen
c0f5fae601
[compile] Fix aot test failures with torch 2.12. ( #37604 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-20 16:06:29 +00:00
Rémi Delacourt
aa84e43ccb
[Pixtral] Enable Pixtral language model support Eagle3 ( #37182 )
...
Signed-off-by: remi <remi@mistral.ai >
2026-03-20 15:50:15 +00:00
Matthias Gehre
5e806bcf54
[Bugfix] Fix ConchLinearKernel channelwise quantization (group_size=-1) ( #37329 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-03-20 10:32:21 -05:00
Matthias Gehre
56a62c310c
[Bugfix] Reject channelwise quantization (group_size <= 0) in ExllamaLinearKernel ( #37331 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-03-20 10:31:57 -05:00
L.B.R.
1779c09898
[ROCm] Enable wvSplitK skinny GEMM kernel for RDNA4/gfx1x decode ( #34709 )
...
Signed-off-by: L.B.R. <lbr@mmonad.com >
Co-authored-by: L.B.R. <lbr@mmonad.com >
2026-03-20 10:11:23 -05:00
xuebwang-amd
44eea10f68
[ROCm][Quantization] make quark ocp mx dtype parser robust for weight-only quantization ( #36232 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
2026-03-20 10:10:03 -05:00
Ilya Boytsov
8b6c6b9505
[Model] Add LFM2-ColBERT-350M support ( #37528 )
...
Signed-off-by: Ilya Boytsov <ilyaboytsov1805@gmail.com >
2026-03-20 14:57:57 +00:00
Harry Mellor
9f6d9dd371
Fix attribute error in isaac_patch_hf_runner ( #37685 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-20 14:49:40 +00:00
Jee Jee Li
dd20ee4e3e
[UX] Enable torch_profiler_with_stack ( #37571 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-20 11:17:26 +00:00
Chauncey
0523449c9c
[Misc] Use logger.info_once for auto tool choice log message ( #37661 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-20 10:40:36 +00:00
Flora Feng
b4c1aef21c
[Refactor] Relocate tests from tests/v1/entrypoints/ to tests/entrypoints/ ( #37500 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 02:50:34 -07:00
Flora Feng
6050b93bed
[Refactor] Move serve entrypoint tests under tests/entrypoints/serve/ ( #37595 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 02:10:47 -07:00
Andreas Karatzas
5a4a179591
[ROCm][CI] Fix granite_speech test for gfx90a by selecting compatible attention backend ( #37611 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 17:07:26 +08:00
Andreas Karatzas
37cd9fc107
[ROCm][CI] Remove deepep DBO tests on gfx90a ( #37614 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 17:07:07 +08:00
Andreas Karatzas
9cfd4ebb5e
[ROCm][CI] Update GSM8K eval config to use fp8-and-mixed models list ( #37619 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 17:06:53 +08:00
wang.yuqi
ed359c497a
[Model] Deprecate the score task (this will not affect users). ( #37537 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-20 08:07:56 +00:00
Giancarlo Delfin
dcee9be95a
[Model Runner V2] Fix draft logits not populated during cudagraph replay ( #37639 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-20 07:43:47 +00:00
Andreas Karatzas
bd8c4c0752
[CI] Removing deprecated rlhf examples reference ( #37585 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-20 15:20:33 +08:00
Wei Zhao
0140eafb15
[Bug] Fix FlashInfer allreduce fusion workspace uninitialized error ( #37461 )
...
Signed-off-by: root <root@prenyx0169.a51.clusters.nvidia.com >
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Signed-off-by: <>
Co-authored-by: root <root@prenyx0169.a51.clusters.nvidia.com >
Co-authored-by: root <root@prenyx0042.a51.clusters.nvidia.com >
2026-03-20 03:09:21 -04:00
Kunshang Ji
bdf6a0a57b
[XPU] bump vllm-xpu-kernels to v0.1.4 ( #37641 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-20 15:04:38 +08:00
Wangbei25
0674d1fee7
[PluggableLayer][MM] Add PluggableLayer for CustomQwen2Decoder ( #37293 )
...
Signed-off-by: Wangbei25 <wangbei41@huawie.com >
Signed-off-by: Wangbei25 <wangbei41@huawei.com >
Co-authored-by: Wangbei25 <wangbei41@huawie.com >
2026-03-20 06:24:07 +00:00
Cyrus Leung
30108fc8b0
[Model] Refactor Step3-VL processor to HF style ( #37579 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-20 06:05:08 +00:00
Flora Feng
e2d1c8b5e8
[Refactor] Relocate entrypoint tests to match serving code structure ( #37593 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 05:31:23 +00:00
Huanxing
6951fcd44f
[XPU] Automatically detect target platform as XPU in build. ( #37634 )
...
Signed-off-by: huanxing <huanxing.shen@intel.com >
2026-03-20 13:30:15 +08:00
Giancarlo Delfin
39474513f6
[Model Runner V2] fix draft attention metadata generation ( #37364 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-19 21:05:15 -07:00
Yuxiang Liang
638a872d77
fix(xpu): Re-compute compile ranges after platform-specific config updates ( #37523 )
...
Signed-off-by: Yuxiang Liang <yuxiang.liang@intel.com >
Signed-off-by: Yuxiang Liang <yuliang@habana.ai >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-20 03:52:35 +00:00
Flora Feng
9040151fe1
[V0 Deprecation] Deprecate --disable-frontend-multiprocessing ( #37612 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 11:31:43 +08:00
Jee Jee Li
8fbe3f303f
[Bugfix][LoRA] Fix Qwen35 LoRA ( #36976 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-20 11:09:32 +08:00
Xiao
ea2c148fa7
[compile][graph_partition]Add tensor size handling ( #36038 )
...
Signed-off-by: Xiao Fu <xiaofu@meta.com >
2026-03-19 19:55:25 -07:00
Tianmu Li
47b7af0d87
[Feat] Enable CompressedTensorW4A8Int for XPU ( #37207 )
...
Signed-off-by: Li, Tianmu <tianmu.li@intel.com >
2026-03-20 02:34:28 +00:00
tianshu-Michael-yu
269bf46d99
fix: disambiguate multimodal prefix cache keys ( #36708 )
...
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com >
2026-03-20 10:33:20 +08:00
Flora Feng
e5a77a5015
[CI] Update mergify tool-calling label paths ( #37478 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-20 02:22:23 +00:00
Itay Alroy
ca1ac1a4b4
Fix DP coordinator ZMQ port TOCTOU ( #37452 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
2026-03-20 00:58:31 +00:00
Divakar Verma
4ca3fa6bb4
[ROCm][Bugfix] fix cache block size mismatch for aiter unified attention ( #37606 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-03-20 00:00:08 +00:00
Flora Feng
be12afd284
[Bugfix] Fix Deepseekv32 tool parser when stream interval > 1 ( #36056 )
2026-03-19 19:51:25 -04:00
Wentao Ye
df3c0291a3
[Bug] Fix EmbedIOprocessor "classify" <-> "embed" ( #37573 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-20 07:40:10 +08:00
Wentao Ye
2be1a0f74b
[Refactor] Remove dead code in pooling model ( #37572 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-20 07:39:43 +08:00
Jim Smith
4120a05ff1
Fix AttributeError in Qwen3.5 GDN layers with quantized models ( #37448 )
...
Signed-off-by: Jim Smith <jim@joshua8.ai >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com >
2026-03-19 19:21:14 -04:00
rasmith
98ff042917
[CI][BugFix][AMD] Don't set VLLM_ROCM_USE_AITER anymore in test_rocm_aiter_topk since its not necessary ( #36996 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-03-20 07:12:45 +08:00
Artem Perevedentsev
b55156eae9
[Performance] Enable Triton autotuning disk cache by default ( #37188 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
2026-03-19 17:36:28 -04:00
Laith Sakka
112944fab9
test Qwen/Qwen3-4B-Instruct-2507 for unbacked ( #36064 )
...
Signed-off-by: Laith Sakka <lsakka@meta.com >
2026-03-19 17:28:45 -04:00
bnellnm
91be5f9be3
[MoE Refactor] Rename "naive" all2all backend ( #36294 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-03-19 15:50:34 -04:00
Aaron Hao
4ee847e400
Comment fix for async rl example ( #35244 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-03-19 19:46:07 +00:00
Andreas Karatzas
040a505ff5
[ROCm][CI] Cleaning and restructuring amd-ci legacy pipeline ( #34839 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-19 14:30:58 -05:00
bnellnm
9279c59a0e
[MoE Refactor] DefaultMoERunner simplifcation ( #33049 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-03-19 15:07:44 -04:00
Wentao Ye
7454096199
[Log] Log once in local node by default ( #37568 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-19 12:04:59 -07:00
Andreas Karatzas
fb8b5e05fc
[CI] Add retry with 4x backoff to HTTP fetches for transient failures ( #37218 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-19 19:00:20 +00:00
Harry Mellor
e5d96dc8fc
Fix SpeculatorsConfig now that PreTrainedConfig is a dataclass in Transformers ( #37574 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 18:04:40 +00:00
EdalatiAli
daa05bf340
[Bugfix] Fix AttributeError when serving MXFP8 models with DeepGEMM installed ( #37358 )
...
Signed-off-by: EdalatiAli <aliedalati@cohere.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-19 17:58:33 +00:00
Lucas Kabela
7769b58307
[torch.compile][BE][Multimodal] Remove requirement to set_model_tag to avoid cache conflict ( #37345 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-03-19 17:26:12 +00:00
Chauncey
2f9f946b22
[P/D] AnthropicMessages add kv_transfer_params for PD disaggregation ( #37535 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-19 16:41:20 +00:00
Fadi Arafeh
2890aecce5
[CPU][UX] Do not crash when tcmalloc/libiomp are not ldpreloaded ( #37561 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2026-03-19 16:35:45 +00:00
Harry Mellor
34f093b417
[CI] Gate pre-commit on ready label or number of contributions ( #37544 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 16:21:57 +00:00
Harry Mellor
4dce8321a9
Run MacOS smoke test on daily cron job instead of every commit ( #37567 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 16:19:50 +00:00
Cyrus Leung
657855ab41
[Misc] Cleanup more configs and processors ( #37560 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 15:45:23 +00:00
Wei Zhao
e27b8ba3d1
[Bug] Fix fp8 trtllm MoE modular kernel supported routing methods ( #37346 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-19 11:43:06 -04:00
Woosuk Kwon
40b8363b45
[MRV2] Use fp32 for draft logits ( #37526 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-19 08:41:21 -07:00
mikaylagawarecki
8b10e4fb31
[1/n] Migrate permute_cols to libtorch stable ABI ( #31509 )
...
Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com >
2026-03-19 11:27:26 -04:00
Ifta khairul Alam Adil
104605cbf2
Remove deprecated reasoning_content message field(part-2) ( #37480 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com >
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Philip Ottesen <phiott256@gmail.com >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
Signed-off-by: Andy Lo <andy@mistral.ai >
Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com >
Signed-off-by: sihao.li <sihao.li@intel.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: JartX <sagformas@epdcenter.es >
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Philip Ottesen <phiott256@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Giancarlo Delfin <32987265+TheEpicDolphin@users.noreply.github.com >
Co-authored-by: Andy Lo <andy@mistral.ai >
Co-authored-by: Thillai Chithambaram <79466435+thillai-c@users.noreply.github.com >
Co-authored-by: sihao_li <165983188+1643661061leo@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 15:20:08 +00:00
Jee Jee Li
96266f119b
[LoRA] Minor improvements to LoRA log ( #37557 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-19 15:18:06 +00:00
Sage Moore
7c0cf3bcd0
Cap the number of API servers to 1 when using Elastic EP. ( #37466 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-03-19 10:42:57 -04:00
Harry Mellor
572b432913
Stop bench CLI from recursively casting all configs to dict ( #37559 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 14:04:03 +00:00
Cyrus Leung
9515c20868
[Misc] Clean up processing logic ( #37541 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 13:30:20 +00:00
DorBernsohn
c63ca2b2e6
[Bugfix] Add Kimi-K2.5 reasoning/tool parser aliases and tool_call_id support ( #37438 )
...
Signed-off-by: DorBernsohn <dor.bernsohn@gmail.com >
2026-03-19 21:08:00 +08:00
Harry Mellor
a32eaf5bb2
[CI] Merge cleanup_pr_body.yml and reminder_comment.yml ( #37552 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 12:55:07 +00:00
XueLiang Yang
e390742c59
Fix KV Offloading + MLA AssertionError by using num_kv_heads=1 in cpu… ( #37536 )
...
Signed-off-by: xueliangyang-oeuler <yxl546827391@gmail.com >
Co-authored-by: xueliangyang-oeuler <yxl546827391@gmail.com >
2026-03-19 12:05:07 +00:00
Cyrus Leung
7a6ebcbfcf
[Model] Remove unnecessary get_language_model ( #37545 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 20:00:36 +08:00
Cyrus Leung
c7bc12c20f
[CI/Build] Split out MM pooling tests ( #37542 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 11:36:11 +00:00
wang.yuqi
f9e2a38386
[Docs] Reorganize pooling docs. ( #35592 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 11:25:47 +00:00
Harry Mellor
4426447bba
Don't log exc_info when vLLM tries to doenload a file that doesn't exist ( #37458 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-19 10:38:29 +00:00
Li, Jiang
3322e26420
[Bugfix] Avoid more OpenMP thread reallocation in CPU torch compile ( #37538 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-19 10:24:39 +00:00
Cyrus Leung
765e461065
[Bugfix] Fix Nemotron Parse loading ( #37407 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-19 09:55:29 +00:00
Duyi-Wang
6a9cceb219
[Bugfix][ROCm] Fix MoRI + AITER FP8 dispatch compatibility for defer_input_quant ( #37418 )
...
Signed-off-by: Duyi-Wang <duyi.wang@amd.com >
2026-03-19 09:49:27 +00:00
yassha
199f914183
fix(cpu): add null check for aligned_alloc in ScratchPadManager ( #37369 )
...
Signed-off-by: yassha <50112520+yassha@users.noreply.github.com >
2026-03-19 17:45:06 +08:00
Kunshang Ji
ca21483bf9
[MISC] fix pin_memory=torch.cuda.is_available(), use is_pin_memory_available ( #37415 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-19 09:23:24 +00:00
TJian
da70c87e81
[CI] Fix wrong path test file, missing rlhf_async_new_apis.py ( #37532 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-03-19 02:21:55 -07:00
Collin McCarthy
0b6d52629f
Support temporal compression for Nemotron-3-VL videos ( #36808 )
...
Signed-off-by: Collin McCarthy <cmccarthy@nvidia.com >
2026-03-19 08:02:19 +00:00
Ziming Huang
d3cc379567
[Perf] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ( #37425 )
...
Signed-off-by: 智鸣 <hzm414167@alibaba-inc.com >
2026-03-19 15:43:48 +08:00
cdpath
354cd580d5
fix(anthropic): remove non-standard 'data: [DONE]' from Anthropic streaming ( #37510 )
...
Signed-off-by: cdpath <cdpath@outlook.com >
2026-03-19 07:23:35 +00:00
zhanqiuhu
d49f273144
[SSM/Mamba] Follow-up: N-1 prefill for P/D disaggregation ( #37310 )
2026-03-19 08:22:00 +01:00
Flora Feng
b21d384304
[Refactor] Relocate endpoint tests to mirror serving code directory structure ( #37504 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-19 07:19:36 +00:00
Hongxia Yang
e3126cd107
[ROCm] issue management - request information for bug issues on ROCm ( #37009 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-03-19 03:51:29 +00:00
Wentao Ye
e37ff5b5c8
[Perf] Optimize token_embed for pooling models, 1.0% token throughput improvement ( #37347 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-19 10:27:51 +08:00
Aaron Hao
6accb21f2a
[bug] Fix deadlock with pause resume and collective_rpc ( #37024 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-03-19 01:49:02 +00:00
Giancarlo Delfin
053f3b6309
[Model Runner V2] Spec decode rejection sampler logprobs support ( #37237 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-19 01:36:27 +00:00
Aaron Hao
5f82706a21
[BUG] Exclude SKIP_TENSORS from get_layer_size() + new weight sync example for dpep ( #37334 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-03-19 00:45:10 +00:00
Sage Moore
c32a58cc2a
[EPLB] Simplify EPLB rearrange by only returning one map ( #36267 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-03-18 20:34:00 -04:00
Elvir Crnčević
ef2c4f778d
[Bugfix] Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding ( #37442 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-19 00:28:37 +00:00
sihao_li
9dade5da3a
[XPU]Unify xpu test dependencies in dockerfile.xpu ( #36477 )
...
Signed-off-by: sihao.li <sihao.li@intel.com >
2026-03-19 08:12:07 +08:00
Thillai Chithambaram
828f862acb
[Bugfix] Expand quantization method support in perf metrics ( #37231 )
...
Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com >
2026-03-18 23:54:19 +00:00
Andy Lo
577df69b26
[Bugfix] Fix KV scales inconsistency in fp8 MLA & FlashInfer kv_cache_dtype "auto" leading to gibberish ( #37054 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-03-18 23:07:29 +00:00
Giancarlo Delfin
04244fd0e1
[Model Runner V2] Spec decode rejection sampler greedy support ( #37238 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-18 15:59:03 -07:00
Michael Goin
9482b0b085
[Bugfix] Remove assertion for NVFP4 scale dynamic range ( #37465 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-03-18 15:37:49 -07:00
Woosuk Kwon
5bc1da147f
[LoRA][BugFix] Fix skipped LoRA adapters for Mistral3 ( #36928 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-18 22:34:19 +00:00
Philip Ottesen
0091017188
fix(worker): optimize swap_states to copy only active token prefixes ( #34733 )
...
Signed-off-by: Philip Ottesen <phiott256@gmail.com >
2026-03-18 14:59:27 -07:00
Wentao Ye
0d81a1fe61
[V0 Deprecation] Deprecate virtual engine ( #37195 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-18 14:30:14 -07:00
Netanel Haber
6ae4c8d6fc
chunk parakeet into 30s clips to prevent OOMs on long audios ( #36671 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-03-18 14:22:24 -07:00
JartX
a913b612d8
[Bugfix] Fix ROCm crash in qwen3_next multi-stream events ( #36795 ) ( #37427 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
2026-03-18 16:06:31 -04:00
Harry Mellor
5ce2d10e4a
Fix models which use layer_type_validation for Transformers v5 ( #37398 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-18 18:41:51 +00:00
Chengyu Fang
738d0a281f
[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation ( #37439 )
...
Signed-off-by: chengyufang <cnyvfang@outlook.com >
2026-03-18 11:36:34 -07:00
youkaichao
70b81c4f3d
[bugfix][async scheduling] fix extra cuda context in device 0 with EP/DP ( #37449 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2026-03-18 18:32:30 +00:00
Cyrus Leung
7476d148db
[Model] Remove unnecessary processor definition for Nemotron Parse ( #37456 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-18 18:25:13 +00:00
Cyrus Leung
f3732bd931
[Misc] Clean up model registry ( #37457 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-18 18:24:44 +00:00
Wentao Ye
0ef7f79054
[Perf] Add tuned triton moe config for Qwen3.5 H200, 9.9% E2E throughput improvement ( #37340 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-18 14:18:34 -04:00
Or Ozeri
5dd8df0701
[kv_offload+HMA][2/N]: Support multiple KV groups in GPULoadStoreSpec ( #36642 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-18 19:26:40 +02:00
Harry Mellor
39bfb57b7c
Add API docs link if the CLI arg is a config class ( #37432 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-18 17:19:35 +00:00
RonaldBXu
c9d838fc33
Adding deterministic lora benchmarking to vLLM Bench ( #36057 )
...
Signed-off-by: Ubuntu <ubuntu@ip-172-31-43-201.ap-northeast-1.compute.internal >
Signed-off-by: Ronald Xu <ronaldxu@amazon.com >
2026-03-18 16:02:03 +00:00
Xin Yang
b1169d7be8
[Kernel] Add gpt-oss Router GEMM kernel ( #37205 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-18 08:15:56 -07:00
XLiu-2000
17808394bc
standardize load_weights using AutoWeightsLoader for kimi_linear and minimax_text_01 ( #37371 )
...
Signed-off-by: XuLiu <xuliu40@gmail.com >
Co-authored-by: XuLiu <xuliu40@gmail.com >
2026-03-18 15:05:37 +00:00
elvischenv
296839a1b0
[Perf] Eliminate padding and slicing op for GPT-OSS with Flashinfer MXFP4 MXFP8 MoE ( #30647 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-03-18 15:01:26 +00:00
Wentao Ye
c373b5c00d
[Log] Reduce duplicate log ( #37313 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-18 10:57:44 -04:00
Itay Alroy
de1a86b7de
elastic_ep: Fix stateless group port races ( #36330 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
2026-03-18 14:36:18 +00:00
Cyrus Leung
99267c23ca
[2/3] Refactor InternVL-based processors ( #37324 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-18 22:22:19 +08:00
Or Ozeri
525f2eeb0b
[kv_offload+HMA][6/N]: Split offloading_connector.py ( #37405 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-18 14:42:46 +01:00
Yufeng He
918b7890a1
[Bugfix] Fix base64 JPEG video frames returning empty metadata ( #37301 )
...
Signed-off-by: Yufeng He <40085740+universeplayer@users.noreply.github.com >
Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Yufeng He <40085740+universeplayer@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-18 13:40:03 +00:00
Andy Lo
98b09ddc27
[NIXL][Bugfix] metrics & testing minor bug ( #36051 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-03-18 14:39:14 +01:00
Shwetha Poojary
cef1f302d2
[Model] Enable LoRA support for tower and connector in H2OVL ( #31696 )
...
Signed-off-by: shwetha-s-poojary <shwetha.s-poojary@ibm.com >
2026-03-18 13:26:47 +00:00
Elvir Crnčević
17c47fb869
[Bugfix] Fix EP weight filter breaking EPLB and NVFP4 accuracy ( #37322 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-18 18:30:29 +08:00
Chauncey
b322b197f1
[Build] Bump python openai version ( #32316 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-18 18:20:10 +08:00
Andreas Karatzas
eaf7c9b976
[CI] Fix PaddleOCR-VL HF test failure due to create_causal_mask API rename ( #37328 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-18 09:44:12 +00:00
Aaron Hao
47a1f11bff
[docs] Add docs for new RL flows ( #36188 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-18 09:04:26 +00:00
Karan Bansal
fad09e8a1f
fix(glm47): improve tool call parsing and content normalization ( #37386 )
...
Signed-off-by: karanb192 <karan@example.com >
Co-authored-by: karanb192 <karan@example.com >
2026-03-18 08:12:21 +00:00
Jee Jee Li
8c31f47c63
[LoRA] Make LoRA respect language_model_only ( #37375 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-18 07:53:34 +00:00
Li, Jiang
261801242f
[Bugfix] Avoid OpenMP thread reallocation in CPU torch compile ( #37391 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-18 07:51:39 +00:00
Or Ozeri
fcf0687b27
[kv_offload+HMA][0/N]: Support block-level preemption handling ( #34805 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-18 08:49:53 +02:00
liuzhenwei
86b7e3c95a
[XPU] skip unsupported ut and update test_nixl_connector ( #37179 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-18 13:32:59 +08:00
Andrew Xia
0e95916155
[responsesAPI] parser.extract_response_outputs can take in token IDs ( #37130 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2026-03-18 05:31:31 +00:00
Andreas Karatzas
ce2ef42fd3
[CI] Stabilize test_cpu_offloading by waiting for async offload before cache reset ( #37335 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-18 05:26:20 +00:00
Andreas Karatzas
8b6325758c
[ROCm][CI] Add ROCM_EXTRA_ARGS to audio_in_video test server fixture ( #37349 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-18 04:55:40 +00:00
gxd3
a0dd1995c7
[Hardware][TPU] Add supports_async_scheduling() method to Executor interface so that it can be extended for Executor implementations. ( #36924 )
...
Signed-off-by: Guangxiang Du <gxd@google.com >
2026-03-18 12:53:28 +08:00
Xin Yang
f1740006e4
[Perf] Enable dual stream execution of input projection for Qwen3 ( #36795 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-18 11:13:27 +08:00
Andreas Karatzas
58cde5c026
[ROCm][CI] Skip trtllm kvfp8 dequant tests on ROCm ( #37330 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-18 11:12:26 +08:00
Roy Wang
761e0aa7a0
[Performance] Add --enable-ep-weight-filter CLI option ( #37351 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-03-18 09:36:55 +08:00
Yanan Cao
ff9fbc9aff
[Kernel][Helion] [16/N] Refactor register_kernel API to be more Dynamo-friendly ( #36705 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-18 01:23:35 +00:00
Divakar Verma
e6c4797704
[ROCm][Quantization] add fp8xfp8 attn support for rocm_aiter_unified_attn ( #36927 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-03-18 08:49:32 +08:00
Michael Goin
09e4576f65
[Kernel] Add non-gated support for NVFP4 CUTLASS MoE ( #37320 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-03-17 18:12:04 -04:00
Andreas Karatzas
3ed7b1e6e0
[ROCm] Validate block_size for explicitly selected attention backends ( #36846 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-17 17:04:40 -05:00
JartX
e8f9dbc369
[Bugfix][ROCm] Fix worker startup OOM on ROCm by skipping unreliable cudagraph memory profiling ( #36720 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
2026-03-17 17:55:34 -04:00
Yong Hoon Shin
de35c06c66
Make KV connector metadata build overridable via plugin ( #37336 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2026-03-17 21:29:06 +00:00
Athrael Soju
c0745a851a
[Model] Add ColQwen3.5 4.5B support ( #36887 )
...
Signed-off-by: Athrael Soju <athrael.soju@gmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-17 21:17:02 +00:00
Ekagra Ranjan
b5ca9c3557
[Models] Cohere ASR ( #35809 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2026-03-17 21:04:17 +00:00
Chao-Ju Chen
245758992e
[Bugfix] Rescale NVFP4 weight scales to fix BF16 dequant underflow ( #34577 )
...
Signed-off-by: ricky-chaoju <ricky.chen@infinirc.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-17 20:48:42 +00:00
Dimitrios Bariamis
1204cf0a9d
[Bugfix] Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 ( #37158 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
2026-03-17 20:13:06 +00:00
Wei Zhao
b36adfa349
[Perf] Set Flashinfer sparse MLA as default backend for FP8 kv cache ( #37252 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-17 20:09:20 +00:00
Michael Goin
e78821b438
[Deprecation] Deprecate --calculate-kv-scales option ( #37201 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-03-17 19:57:24 +00:00
Cyrus Leung
51f0acda79
[Model] Remove unused handle_oov_mm_token ( #37321 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-17 19:44:52 +00:00
Brian Dellabetta
fa75204b16
bump compressed-tensors version to 0.14.0.1 ( #36988 )
...
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com >
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com >
2026-03-17 15:36:19 -04:00
Wentao Ye
bdb903bb5f
[Bug] Fix FlashInfer MNNVL socket collisions under concurrent vLLM jobs ( #36674 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-17 15:19:52 -04:00
Andrey Talman
68f783a727
[Torch 2.11] Guard torch._C._cpu attribute checks for forward compatibility ( #35673 )
...
Signed-off-by: atalman <atalman@fb.com >
2026-03-17 18:47:59 +00:00
Avinash Singh
c5030c439d
[CI] Split Distributed Tests (4 GPUs) and Kernel MoE tests ( #37100 )
...
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com >
Signed-off-by: Avinash Singh <107198269+avinashsingh77@users.noreply.github.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-17 11:44:55 -07:00
Michael Goin
51b2333be1
[Perf] Optimize top-k search in apply_top_k_top_p_triton sampler ( #37225 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-03-17 11:35:17 -07:00
Andreas Karatzas
4ed51308c8
[CI] Fix GPU memory leak when RemoteOpenAIServer fails to start in __init__ ( #37230 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-17 09:08:08 -07:00
Cyrus Leung
c781fbbab3
[Bugfix] Standardize custom HF Processor init ( #37289 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-17 15:38:55 +00:00
Richard Zou
979ff44cea
[BugFix] PyTorch Compilation Tests should error if any test fails ( #37300 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-17 15:26:38 +00:00
Benjamin Chislett
f63ed7b5ac
[Bugfix] Fix DP MTP Dummy Run ( #35243 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-17 11:16:48 -04:00
Ning Xie
c9e5096256
[openapi] remove redundant exception stack trace[4/N] ( #37157 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-03-17 15:06:25 +00:00
Anton Vlasjuk
2ff0ad9694
[UltraVox] Fix output type ( #37224 )
...
Signed-off-by: vasqu <antonprogamer@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-17 14:51:17 +00:00
Isotr0py
a836524d20
[Chore] Replace all base64 usages with faster pybase64 package ( #37290 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-17 14:44:19 +00:00
Bhoomit
3717a4dd47
[Misc][LoRA] Add --lora-target-modules to restrict LoRA to specific modules ( #34984 )
...
Signed-off-by: Bhoomit Vasani <bhoomit.2010@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-17 14:36:41 +00:00
Harry Mellor
ecfcdd2ce4
Fix Phi3 test that fails with Transformers v5 ( #37298 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-17 14:29:24 +00:00
Siew's Capital Jarvis
c25dbc2d27
[Bugfix] Fix unclean shutdown crash with AllReduce Fusion workspace ( #36955 )
...
Signed-off-by: Jarvis <brayden.stanley.0127@gmail.com >
2026-03-17 14:22:09 +00:00
Jonas M. Kübler
77d2a5f17b
pick up tuned prefill configs for FP8 FA3 ( #36265 )
...
Signed-off-by: Jonas M. Kübler <44084297+jmkuebler@users.noreply.github.com >
Signed-off-by: Jonas Kuebler <kuebj@amazon.com >
2026-03-17 07:00:26 -07:00
Sage
59192dfd39
[Frontend] Complete OpenAI render delegation ( #37287 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-17 13:53:55 +00:00
Umut Polat
56cb1baa66
[Misc] Use VLLMValidationError in batch, pooling, and tokenize protocol validators ( #36256 )
...
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com >
2026-03-17 13:52:30 +00:00
Cyrus Leung
f340324335
[1/2] Move InternVL-based processors ( #37260 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-17 21:50:56 +08:00
sfbemerk
2660b9289c
Bugfix for offloading+prefetch for GLM-4.7-FP8 ( #37178 )
...
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
Co-authored-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
2026-03-17 21:22:09 +08:00
Viacheslav
293f036e6d
Add gigachat 3.1 tool parser + fix gigachat3 tool parser ( #36664 )
...
Signed-off-by: Viacheslav Barinov <viacheslav.teh@gmail.com >
2026-03-17 12:03:20 +00:00
youkaichao
0fb142a454
[perf][connector] optimize build_connector_meta when host buffer transfer is not used ( #37165 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2026-03-17 11:59:35 +00:00
Sage
00f8e0d211
[Frontend] Delegate tokenization serving preprocessing to OpenAIServingRender ( #37266 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-17 11:22:54 +00:00
zhao, zhenhui
4af9ed21cb
[Bugfix](xpu): prevent “selected index k out of range” in TP decode path ( #37259 )
...
Signed-off-by: zhenzhao <zhenzhao@habana.ai >
2026-03-17 11:14:07 +00:00
Augusto Yao
9c7cab5ebb
[Feature]: Support for multiple embedding types in a single inference call ( #35829 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
2026-03-17 17:05:42 +08:00
Chauncey
132bfd45b6
[Bugfix][ResponsesAPI] Fix crash when tool_choice=required exceeds max_output_tokens ( #37258 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-17 08:54:52 +00:00
xiao-llm
24b4272a8c
Fix infinite recursive search issue in quark.py ( #32779 )
...
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
Signed-off-by: Xiao Yu <xiao.yu.dc@outlook.com >
Signed-off-by: kimheesu <wlskaka4@gmail.com >
Co-authored-by: Yanwen Lin <lyw1124278064@gmail.com >
Co-authored-by: Kim Hee Su <wlskaka4@gmail.com >
2026-03-17 07:19:15 +00:00
Benjamin Chislett
8a680463fa
[Bugfix] Fix NemotronH MTP + Chunked Prefill ( #35447 )
2026-03-17 07:07:33 +01:00
Nick Cao
20b14095a4
[Bugfix] Fix loading Music Flamingo ( #35535 )
...
Signed-off-by: Nick Cao <ncao@redhat.com >
2026-03-17 05:24:40 +00:00
PatchyTIS
17c1bdf371
[Bugfix] dtype mismatch in ngram gpu propose ( #37246 )
...
Signed-off-by: PatchouliTaisa <patchychen@tencent.com >
Co-authored-by: PatchouliTaisa <patchychen@tencent.com >
2026-03-17 05:19:55 +00:00
Flora Feng
3e3d320c1b
[Refactor] Relocate responses API tests ( #37241 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-17 05:14:52 +00:00
Andreas Karatzas
54a62a79f7
[ROCm] Fix AttributeError for torch.compiler.skip_all_guards_unsafe on older PyTorch ( #37219 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-17 11:34:49 +08:00
Flora Feng
384dc7f77b
[Refactor] Relocate completion and chat completion tests ( #37125 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-17 11:31:23 +08:00
Flora Feng
f04d5226f8
[CI] Fix flaky tool_use chat completion tests with deterministic seed ( #37027 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-17 03:24:34 +00:00
Kyuyeun Kim
0a0a1a198b
Add ability to replace oot ops when using lora ( #37181 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com >
2026-03-16 18:04:15 -07:00
Vadim Gimpelson
6c1cfbad32
Support non-contiguous KV cache in TRTLLM fp8 dequant kernel ( #36867 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com >
Co-authored-by: Pavani Majety <pavanimajety@gmail.com >
2026-03-16 17:48:42 -07:00
Harry Huang
45f526d652
[BugFix] Correct max memory usage for multiple KV-cache groups ( #36030 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-03-17 00:38:52 +00:00
Julien Denize
5db91f0aaf
Fix some Mistral parser issues ( #37209 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-03-17 00:08:56 +00:00
Walter Beller-Morales
061980c36a
[Feature][Frontend] add support for Cohere Embed v2 API ( #37074 )
...
Signed-off-by: walterbm <walter.beller.morales@gmail.com >
2026-03-16 19:55:53 -04:00
Ben Browning
7a49742b88
[CI/Build] Add common tool call parser test suite ( #27599 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2026-03-16 19:46:20 -04:00
Terry Gao
3e6a1e1686
[Custom Ops] Add functional + out variant for scaled_fp4_quant ( #34389 )
...
Signed-off-by: tianrengao <terrygao87@gmail.com >
2026-03-16 18:51:46 -04:00
Julien Denize
7961486a9b
Fix EagleMistralLarge3Model initialization ( #37232 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-03-16 15:41:00 -07:00
Andreas Karatzas
4f9b14c21c
[CI] Stabilize multinode DP internal LB completion tests ( #36356 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 15:40:23 -07:00
Yuchen Fama
31a458c091
[Doc] Clarify schema enforcement behavior for tool_choice modes ( #37064 )
...
Signed-off-by: yfama <yuchengu@gmail.com >
2026-03-16 22:27:42 +00:00
Wei Zhao
a3a51d20e7
[Benchmark] Improvements to attention benchmark script ( #37115 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-03-16 22:22:40 +00:00
EdalatiAli
e5b807607c
[Quant][Feature] Support online MXFP8 quantization for MoE and dense models ( #35448 )
...
Signed-off-by: EdalatiAli <aliedalati@cohere.com >
2026-03-16 18:07:39 -04:00
Elvir Crnčević
fd4d96302a
Fix eplb nvfp4 experts hook ( #37217 )
...
Signed-off-by: Elvir Crncevic <elvircrn@gmail.com >
Signed-off-by: Elvir Crncevic <elvir@anthropic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-16 22:03:54 +00:00
Krish Gupta
c0f011918d
[Bugfix] opcheck false mutation error in rms_norm_per_block_quant ( #36688 ) ( #36779 )
...
Signed-off-by: Krish Gupta <krishom70@gmail.com >
2026-03-16 21:11:33 +00:00
Zhengxu Chen
e6ae4b1be1
[compile] Enable mega aot artifact for torch 2.12+. ( #37198 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-16 21:05:51 +00:00
zhanqiuhu
2dccb38f73
[Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors ( #36549 )
2026-03-16 20:51:04 +00:00
Kunshang Ji
d157216093
[BUGFIX][Mamba] Use uint64 for address in KVBlockZeroer ( #37197 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-16 21:39:56 +01:00
Matthew Bonanni
93f3c8e531
[Misc] Add float16 to CacheDType ( #37199 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-16 13:24:48 -07:00
rasmith
2cc26c3a99
[CI][BugFix][MORI][AMD] Add transfer_id to kv transfer params for test ( #37213 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-03-16 13:22:57 -07:00
Flora Feng
dfa8852db2
[Refactor] Consolidate GPT-OSS reasoning parser tests ( #36915 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Signed-off-by: Flora Feng <4florafeng@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-16 15:53:07 -04:00
Lucas Kabela
714c6e0eab
[torch.compile][BE] Modify cudagraph callable to check for is_forward_context_set ( #36288 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-03-16 19:42:34 +00:00
Sage
0fefd00e6c
[Bugfix] Fix render server crash for quantized models on CPU-only hosts ( #37215 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-16 18:59:01 +00:00
Nicolò Lucchesi
f5c081d432
[PD][Nixl] Add support for hybrid SSM-FA models ( #36687 )
2026-03-16 19:58:06 +01:00
Matthew Bonanni
c88ea8338b
[MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible ( #36982 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-16 13:51:21 -04:00
Max de Bayser
9f9ecff4cd
Add simple granite4 tool parser ( #36827 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2026-03-16 10:49:09 -07:00
haosdent
ca1954d58c
[Bugfix] Disable cross-layer KV cache for MLA attention backends ( #37090 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-03-16 19:03:10 +02:00
Raushan Turganbay
55e6d3d5c0
[Bugfix] Make siglip/clip compatible with transformers v5 ( #37200 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2026-03-16 16:48:18 +00:00
Chauncey
6682c231fa
[Bugfix] Add error handling for FINISHED_ERROR in OpenAIServing ( #37148 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-16 16:27:47 +00:00
Itay Etelis
5ae685c1c8
[Bugfix] Relax TRTLLM KV cache contiguity assertion for cross-layer layout ( #34158 )
...
Signed-off-by: Itay Etelis <itay.etelis@ibm.com >
Co-authored-by: Itay Etelis <itay.etelis@ibm.com >
2026-03-16 11:20:51 -04:00
Wentao Ye
ce8cf9161d
[Compile] Fix compile warning st256_cs in cuda_vec_utils.cuh ( #36693 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-16 11:12:15 -04:00
xjx
18be11fd59
[BUGFIX]fix CUDA OOM ERROR : invalid argument at cumem_allocator.cpp:119 ( #35594 )
...
Signed-off-by: xjx <493337577@qq.com >
2026-03-16 15:10:42 +00:00
Yuanheng Zhao
8d8855fdae
[Bugfix] Add safety check and fallback for null scaling factor ( #36106 )
...
Signed-off-by: Yuanheng Zhao <jonathan.zhaoyh@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 14:27:29 +00:00
Wentao Ye
e855d380fa
[Compile] Fix compile warning in moe_permute ( #36529 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-16 10:16:14 -04:00
Benjamin Bartels
0e5a9382af
[Bugfix] accept redacted thinking blocks in Anthropic messages ( #36992 )
...
Signed-off-by: Benjamin Bartels <benjaminba@tiglab-ubuntu.ilab.local >
Signed-off-by: bbartels <benjamin@bartels.dev >
Co-authored-by: Benjamin Bartels <benjaminba@tiglab-ubuntu.ilab.local >
2026-03-16 22:01:57 +08:00
Fynn Schmitt-Ulms
04bf5a35fa
[Spec Decode] Update extract_hidden_states to use deferred kv_connector clear ( #37013 )
2026-03-16 14:53:45 +01:00
Tianyu Guo
43a73f853b
Remove unused EVS functions in qwen3_vl.py ( #37183 )
...
Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn >
2026-03-16 13:09:09 +00:00
Julien Denize
ffbc2e5bdb
Patch Mistral config ( #37104 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-03-16 12:22:18 +00:00
Lukas Geiger
f9e6db3034
[Models][Qwen3 ViT] Keep max_seqlen on CPU to prevent D2H sync ( #37139 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-16 12:11:59 +00:00
elvischenv
d61d2b08e9
[Build] Fix API rate limit exceeded when using VLLM_USE_PRECOMPILED=1 ( #36229 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 12:09:27 +00:00
Artem Perevedentsev
f5e59ee7a6
[Performance] Add prefetch for checkpoints to OS page cache ( #36012 )
...
Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com >
2026-03-16 11:32:02 +00:00
Harry Mellor
9b005edc48
[Docs] Make the link to hardware plugins clearer ( #37174 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 04:12:58 -07:00
Robin Nabel
bf9a185395
GLM4 tool parser: fix streaming mode ( #35208 )
...
Signed-off-by: Robin Nabel <opensource@nabel.co >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-16 18:48:52 +08:00
Harry Mellor
ad041c79db
Fix text only inputs for MRoPE models with the Transformers modelling backend ( #37055 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 10:31:16 +00:00
Kunshang Ji
747b068136
[Hardware] Replace memory related torch.cuda APIs ( #37031 )
...
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
2026-03-16 10:24:48 +00:00
Harry Mellor
122f75d939
Fix pipeline parallel with multimodal models with the Transformers modelling backend ( #37057 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 10:20:37 +00:00
SoluMilken
d8f8a7aad2
[Misc] Sync pre-commit to 4.5.1 in workflows and docs ( #36675 )
...
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-16 10:03:21 +00:00
Roy Wang
0115e957d4
[Frontend][Misc] Remove unused log in /is_sleeping ( #37093 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2026-03-16 17:46:28 +08:00
haosdent
116ed130f4
[Bugfix] Fix GDN attention crash with mixed decode/spec-decode batches ( #34871 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-16 10:30:23 +01:00
Vadim Gimpelson
8374387bd8
[FlashInfer] Revert block_size 16 + head_size 256 workaround on Blackwell ( #36987 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-16 09:04:29 +00:00
Isotr0py
912fbe9555
[Bugfix] Fix Qwen2.5-Omni/Qwen3-Omni use_audio_in_video with multi-video inputs ( #37147 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-16 08:56:06 +00:00
Laith Sakka
52131f88d9
use skip_all_guards_unsafe to drop global_state and torch_function_mode_stack guards instead of previous hacks ( #36204 )
...
Signed-off-by: Laith Sakka <lsakka@meta.com >
2026-03-16 08:52:31 +00:00
Roy Wang
821eb80c0d
[Performance][Model Loader] Skip non-local expert weights during EP model loading ( #37136 )
...
Signed-off-by: esmeetu <jasonailu87@gmail.com >
2026-03-16 01:33:36 -07:00
Andreas Karatzas
a2956a0f8e
[ROCm][CI] Retrying in case of batch variance effects and reducing flakiness ( #36442 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 16:08:51 +08:00
Andreas Karatzas
911355e216
[ROCm] Fix KV copy methods and auto-select attention backend for ROCm ( #36845 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 16:07:27 +08:00
Chauncey
8d3f8f485e
[Bugfix] fix Qwen3.5 tool calling bug ( #36774 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-16 15:38:42 +08:00
Woosuk Kwon
96efb91480
[Model Runner V2] Fix processed logits in sample() ( #37144 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-16 00:35:49 -07:00
leo-cf-tian
2754231ba3
[Kernel] Add FlashInfer MoE A2A Kernel ( #36022 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Signed-off-by: Leo Tian <lctian@nvidia.com >
Co-authored-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Stefano Castagnetta <scastagnetta@nvidia.com >
Co-authored-by: root <root@lyris0267.lyris.clusters.nvidia.com >
2026-03-15 23:45:32 -07:00
bigshanedogg
2390d44209
[Model] Add HyperCLOVAX-SEED-Think-14B language model support ( #37107 )
...
Signed-off-by: bigshanedogg <bigshane319@gmail.com >
2026-03-16 06:40:05 +00:00
Li, Jiang
7362b4450a
[Bugfix] Avoid LD_PRELOAD check on MacOS ( #37145 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-15 23:31:44 -07:00
Andreas Karatzas
57a314d155
[CI][Bugfix] Fix 500 errors from priority overflow and TemplateError subclasses in schema fuzz tests ( #37127 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 05:27:21 +00:00
Andreas Karatzas
d4c57863f7
[ROCm][CI] Fix engine teardown and text normalization to stabilize voxtral test ( #37138 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-16 04:49:31 +00:00
Wang, Yiting
68e1b711f1
[XPU] Add deepseek_scaling_rope fused kernel ( #36612 )
...
Signed-off-by: yitingw1 <yiting.wang@intel.com >
2026-03-16 12:35:08 +08:00
rasmith
0024f39a32
[ROCm][P/D][MORI][BugFix] Add transfer_id for moriio_connector so moriio_connector to restore P/D functionality ( #34907 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-03-16 10:36:51 +08:00
Andrew Xia
e9163b536e
[responsesAPI][ez] add a unit test for SimpleContext logprobs ( #37126 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2026-03-15 17:12:26 -07:00
Lalithnarayan C
7acaea634c
In-Tree AMD Zen CPU Backend via zentorch [1/N] ( #35970 )
...
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Chinmay-Kulkarni-AMD <Chinmay.Kulkarni@amd.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-03-15 23:35:35 +00:00
Jiangyun Zhu
697e4ff352
[GDN] add a config for gdn kernel selection ( #36647 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-16 00:40:17 +08:00
Hari
a3e2e250f0
[Feature] Add Azure Blob Storage support for RunAI Model Streamer ( #34614 )
...
Signed-off-by: hasethuraman <hsethuraman@microsoft.com >
2026-03-15 19:38:21 +08:00
Isotr0py
143e4dccdf
[Misc] Add online audio_in_video test ( #36775 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-15 00:14:11 -07:00
Isotr0py
6590a3ecda
[Frontend] Remove torchcodec from audio dependency ( #37061 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-15 05:15:59 +00:00
Russell Bryant
b3debb7e77
[Build] Upgrade xgrammar to get a security fix ( #36168 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-15 03:13:48 +00:00
Nick Hill
458c1a4b2d
[Frontend] Reduce chat template warmup logging levels ( #37062 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-14 13:48:59 -07:00
Karan Bansal
821fde2df4
[Bugfix] Fix xgrammar dtype mismatch on macOS CPU inference ( #32384 )
...
Signed-off-by: Karan Bansal <karanb192@gmail.com >
Co-authored-by: Inokinoki <inoki@inoki.cc >
2026-03-14 17:29:06 +00:00
arlo
8c29042bb9
[Feature] Add InstantTensor weight loader ( #36139 )
2026-03-14 18:05:23 +01:00
Cyrus Leung
5467d137b3
[Frontend] Avoid startup error log for models without chat template ( #37040 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-14 09:36:11 -07:00
Santino Ramos
3ed46f374b
[Model Runner V2] Add Support for XD-RoPE ( #36817 )
...
Signed-off-by: Santino Ramos <elsantinoramos@gmail.com >
2026-03-14 09:27:55 -07:00
seanmamasde
84868e4793
[Bugfix][Frontend] Fix audio transcription for MP4, M4A, and WebM formats ( #35109 )
...
Signed-off-by: seanmamasde <seanmamasde@gmail.com >
2026-03-14 08:44:03 -07:00
Isotr0py
a8e8d62dd8
[Misc] Clean up Kimi-audio whisper encoder loading ( #36903 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-14 23:37:52 +08:00
Julien Denize
e42b49bd69
Mistral common v10 ( #36971 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: root <root@h200-bar-196-227.slurm-bar-compute.tenant-slurm.svc.cluster.local >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-14 07:26:43 -07:00
Sergey Zinchenko
4a718e770d
[Bug] Fix Failure in /v1/chat/completions/render for Multimodal Requests ( https://github.com/vllm-project/vllm/issues/35665 ) ( #35684 )
2026-03-14 14:10:11 +00:00
Kevin H. Luu
600a039f57
[CI] Shard Multi-Modal Models (Standard) into 4 parallel jobs ( #37014 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-14 08:26:54 +00:00
Harry Mellor
ffa5d74f15
Enable loading of fused expert weights in the Transformers modelling backend ( #36997 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-14 07:01:06 +00:00
Kevin H. Luu
74fe80ee95
[CI] Split Distributed Tests (4 GPUs) into 3 parallel jobs ( #37015 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-14 12:21:13 +08:00
Flora Feng
bcfdadb1bc
[Refactor] Relocate chat completion and anthropic tests ( #36919 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-14 12:16:16 +08:00
Yanan Cao
236de72e49
[CI] Pin helion version ( #37012 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-13 23:25:29 -04:00
sbeurnier
a116f96930
[V1] Remove pin_memory() in async_copy_to_gpu to fix sporadic stalls ( #37006 )
...
Signed-off-by: Sebastien Beurnier <sbeurnier@together.ai >
2026-03-14 01:37:32 +00:00
Li, Jiang
092ace9e3a
[UX] Improve UX of CPU backend ( #36968 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-14 09:27:29 +08:00
Andrew Xia
f680dc1b39
[responsesAPI] prioritize content over summary in reasoning item input ( #36516 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
Signed-off-by: Andrew Xia <mitandrewxia@gmail.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Andrew Xia <axia@fb.com >
2026-03-14 09:20:30 +08:00
Giulio Leone
b41aa264f9
fix: resolve chat template names before kwargs detection ( #36937 )
...
Co-authored-by: giulio-leone <giulio.leone@users.noreply.github.com >
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com >
2026-03-14 00:20:16 +00:00
Dimitrios Bariamis
367cf5cd3e
[Feat][Bugfix] Enable additional dimension for Flashinfer MLA and fix routing dtype ( #36931 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
2026-03-13 16:41:16 -07:00
haosdent
6d53efd2a5
[Bugfix] Fix MLA attention crash with AWQ/GPTQ quantized models ( #34695 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-13 23:25:41 +00:00
Benjamin Chislett
8b346309a5
[Refactor] Consolidate SupportsEagle ( #36063 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-13 23:22:40 +00:00
Nick Hill
54a6db827f
[BugFix] Fix "DP Coordinator receives unexpected..." messages ( #37008 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-13 23:18:05 +00:00
Matthew Bonanni
9efc4db965
[Bugfix] Fix DeepSeek-V3.2 tokenizer stripping spaces ( #37004 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-13 22:55:36 +00:00
Kevin H. Luu
f1816fb192
[CI] Split V1 e2e + engine (1 GPU) into separate jobs ( #36945 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-13 14:16:02 -07:00
Harry Mellor
0005d2a3c9
Use Transformers v5 WeightRenaming for Transformers modeling backend ( #31545 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-13 20:49:08 +00:00
Ekagra Ranjan
d0b402974f
[Bugfix][Spec Decode] Avoid double call of Ngram CPU ( #36952 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2026-03-13 20:33:19 +00:00
Divakar Verma
6341d43043
[ROCm][Quantization] add quark w4a8 mxfp4_fp8 for LinearLayer ( #35316 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2026-03-13 19:44:24 +00:00
Mark McLoughlin
7afe0faab1
[Frontend][Core] Re-add shutdown timeout - allowing in-flight requests to finish ( #36666 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-13 12:10:06 -07:00
Harry Mellor
5a3f1eb62f
[Misc] Set default kv_buffer_device in a better way ( #36862 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-13 19:07:33 +00:00
yugong333
b3ce711b93
Fp8 lora dense kernel ( #35242 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
2026-03-13 19:05:08 +00:00
Isotr0py
abf61aaa8e
[Bugfix] Fix Qwen2.5-omni/Qwen3-omni mm_processor cache for audio_in_video request ( #36800 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-13 18:16:05 +00:00
bigmoyan
4508532fbd
[Bugfix] fix paddleocr crash on some image shape ( #36959 )
...
Signed-off-by: wangzhengtao <wangzhengtao@msh.team >
Signed-off-by: bigmoyan <moyan_work@foxmail.com >
Co-authored-by: wangzhengtao <wangzhengtao@msh.team >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-13 13:46:55 +00:00
Itay Alroy
d5af196c18
[2/N] Elastic EP Milestone 2: Integrating NIXL-EP ( #35627 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
Co-authored-by: Yongji Wu <wuyongji317@gmail.com >
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com >
2026-03-13 09:25:33 -04:00
Chaojun Zhang
82f836d976
[XPU] Support LoRA via torch.compile on XPU platform ( #36962 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2026-03-13 10:34:59 +00:00
Andreas Karatzas
4fccd30f19
[ROCm][CI] Upgrading orchestrator to handle python pipeline markers and options ( #36181 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-13 02:04:22 -07:00
Or Ozeri
cfaf4668f7
[kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec ( #36610 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-13 08:04:21 +00:00
Andreas Karatzas
99a57bdf74
[ROCm][CI] Corrected the GPT-OSS test root path ( #36711 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-13 15:53:43 +08:00
Sage
a2268617cf
[Frontend] Delegate preprocessing to OpenAIServingRender ( #36483 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-13 00:39:43 -07:00
Rohan Potdar
a4ad9db541
Enable RoPE+KV cache fusion for ROCm AITER FA (non-shuffle layout) ( #35786 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-03-13 07:33:22 +00:00
Nick Hill
b373b5102a
[Tests] Shutdown test RemoteVLLMServer cleanly ( #36950 )
...
Recent PR #33949 changed the teardown logic of the RemoteVLLMServer test utility class to
send SIGTERM to all vllm (sub)processes at once, which breaks the clean/coordinated
shutdown logic that assumes only the top-level process will receive a signal (for example
when running in a container that's shut down).
This caused a bunch of errors and stacktraces in some test logs, even though those tests
still pass. We should still attempt a normal shutdown and only kill other procs if they are
still running after a few seconds.
Example: tests/v1/distributed/test_external_lb_dp.py::test_external_lb_completion_streaming
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-13 07:32:55 +00:00
Thomas Parnell
f296a1966d
[Bugfix] Fix FlashInfer GDN warmup ValueError on SM90 GPUs ( #36876 )
2026-03-13 07:09:39 +01:00
Csrayz
bc2c0c86ef
[Frontend] Fix usage incorrectly returned with empty stream_options` ( #36379 )
...
Signed-off-by: Csrayz <33659823+Csrayz@users.noreply.github.com >
2026-03-13 03:33:04 +00:00
jaime campos salas
891c60dcd5
fix(kv-cache): increase hybrid attention grouping threshold from 1.25 to 1.5 ( #36684 )
...
Signed-off-by: Jaime Campos Salas <jaime.campos.salas@gmail.com >
2026-03-12 23:28:27 -04:00
whyiug
1ce13cf992
[Model] Add support for BERT-like Chinese ERNIE pooling models ( #36385 )
...
Signed-off-by: whyiug <whyiug@hotmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-13 03:23:53 +00:00
Nikita
10f08dedfa
[Model] Add ColPali late interaction model for multi-modal retrieval ( #36818 )
...
Signed-off-by: Nikita Sukharev <kaonael@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-13 02:18:57 +00:00
Aaron Hao
5e1a373d2e
[BUG] Fix rank calculation in NCCLWeightTransferEngine ( #36940 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-03-13 01:56:51 +00:00
Simo Lin
572c776bfb
build: update smg-grpc-servicer to use vllm extra ( #36938 )
...
Signed-off-by: Simo Lin <linsimo.mark@gmail.com >
2026-03-13 01:31:36 +00:00
Yifan Qiao
55d8073d06
[Bugfix] ep_scatter kernel store-load race condition ( #34991 )
...
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu >
2026-03-13 01:07:59 +00:00
Nick Hill
cd32d6f586
[Model Runner V2] Some code simplification ( #36929 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-13 00:59:23 +00:00
Jaewon
aaa3092f51
[MoE] Add routing simulation override for MXFP4 quantized MoE ( #33595 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
2026-03-13 00:30:44 +00:00
Shubhra Pandit
87985077a4
[Speculative Decoding] Add norm_before_fc for gpt-oss draft models ( #36545 )
...
Signed-off-by: Shubhra Pandit <shubhra.pandit@gmail.com >
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com >
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-12 23:03:32 +00:00
Ryan Rock
a79c1c2c80
[AMD][Build] Add DeepEP to ROCm Dockerfile ( #36086 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-03-12 21:33:32 +00:00
Andreas Karatzas
cc8f1f4764
[ROCm][CI] Preparing gfx90a mirroring ( #36210 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-12 13:42:25 -07:00
Michael Goin
05b9e8ab5b
Revise environment setup in AGENTS.md ( #36909 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-12 19:21:11 +00:00
Xinan Miao
2cdf92228c
[Feature]: Remove Chunking From FusedMoE ( #34086 )
...
Signed-off-by: SouthWest7 <am1ao@qq.com >
Signed-off-by: Southwest <1403572259@qq.com >
Signed-off-by: southwest <am1ao@qq.com >
Signed-off-by: Xinan Miao <1403572259@qq.com >
Co-authored-by: SouthWest7 <am1ao@qq.com >
2026-03-12 14:24:38 -04:00
Marc Sun
c973ecdead
[bnb] Skip moe + bnb test ( #36896 )
...
Signed-off-by: Marc Sun <marc@huggingface.co >
2026-03-12 18:03:25 +00:00
Harry Mellor
e39257a552
Add AGENTS.md ( #36877 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-12 10:20:50 -07:00
Dimitrios Bariamis
cc16b24b17
Update Flashinfer to 0.6.6 ( #36768 )
...
Signed-off-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
Co-authored-by: Dimitrios Bariamis <12195802+dbari@users.noreply.github.com >
2026-03-12 13:19:19 -04:00
Eunkwang Jeon
bdc2343454
[Bugfix] Fix KeyError in parse_response_input for reasoning items with optional content ( #34499 )
...
Signed-off-by: jeonsworld <jeonsworld@gmail.com >
2026-03-13 00:13:36 +08:00
Matthew Bonanni
f444c05c32
[Attention] Use FA4 for MLA prefill ( #34732 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-12 12:10:17 -04:00
SoluMilken
85199f9681
[Bugfix] fix main branch pre-commit error (1 line change) ( #36897 )
...
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw >
2026-03-12 09:08:37 -07:00
grimulkan
a1257fd1ea
[Kernel] Add FP8 KV cache support to Triton MLA decode attention ( #34597 )
...
Signed-off-by: grimulkan <grimulkan@gmail.com >
2026-03-12 08:32:34 -07:00
Thomas Parnell
abcffbba8c
[CI] Fix mypy pre-commit errors on main ( #36882 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-12 08:22:29 -07:00
Kunshang Ji
53ec16a705
[Hardware] Replace torch.cuda.device_count/current_device/set_device API ( #36145 )
...
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-12 07:57:47 -07:00
Wei Zhao
2e693f48e7
[Perf] Add TRTLLM FP8 MoE Modular Kernel ( #36307 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-12 07:32:31 -07:00
Martin Hickey
7f1f36bf91
[CI] Fix mypy for vllm/reasoning ( #35742 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-12 12:21:33 +00:00
Mark McLoughlin
5282c7d4d0
[docs] Add lightweight AI assisted contribution policy ( #30947 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-03-12 11:46:13 +00:00
caozuoba
9e19f8338b
[Perf] add packed recurrent fast path for decode ( #36596 )
...
Signed-off-by: hdj <1293066020@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-12 04:01:57 -07:00
Sage
06e0bc21d2
[Frontend] Split OpenAIServingModels into OpenAIModelRegistry + OpenAIServingModels ( #36536 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-12 03:29:37 -07:00
Chauncey
5a71cdd76e
[Bugfix] Fix crash when tool_choice=required exceeds max_tokens ( #36841 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-12 03:28:45 -07:00
Shanshan Shen
f0d3658c0f
[MM][OOT] Support CPU seq_lens for OOT MMEncoderAttention kernels ( #36605 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-12 03:28:23 -07:00
Michael Goin
57431d8231
[UX] Only show FP4 Marlin fallback warning for w4a4 models ( #36806 )
...
Co-authored-by: Claude <noreply@anthropic.com >
2026-03-12 05:19:35 -04:00
Xu Jinyang
3e64fe4a18
[Bugfix] Warm up Triton autotuner for GDN layers during V1 profiling ( #36599 )
...
Signed-off-by: AuYang <459461160@qq.com >
2026-03-12 00:51:09 -07:00
sfeiqiang
8cb24d3aed
[KV Connector] Support using FlexKV as KV Cache Offloading option. ( #34328 )
...
Signed-off-by: phaedonsun <phaedonsun@tencent.com >
Co-authored-by: phaedonsun <phaedonsun@tencent.com >
2026-03-12 00:46:20 -07:00
István Ketykó
00726c74c9
[Bugfix][Model] Fix DeepSeek-OCR TensorSchema crash on empty images_crop ( #36670 )
...
Signed-off-by: István Ketykó <istvan.ketyko@gmail.com >
2026-03-12 15:35:54 +08:00
Chauncey
9fe404ed04
[Frontend] OpenAI Responses API supports Tool/Function calling with streaming ( #29947 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-12 15:03:50 +08:00
Sage
802f306cd1
[Tests] Skip model weight download for render-only test server ( #36813 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-12 06:24:42 +00:00
Yan Ma
894843eb25
replace with torch.cuda.device with with torch.accelerator.device_index ( #36144 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-03-11 23:12:57 -07:00
Yanan Cao
584a3f56de
[Kernel][Helion][13/N] Force static_shapes=False in helion register ( #36677 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-12 05:35:29 +00:00
Nick Hill
36735fd772
[BugFix] Fix multiple/duplicate stdout prefixes ( #36822 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-12 12:23:21 +08:00
wang.yuqi
6ecabe4936
[CI Failure] Fix Language Models Test (Extended Pooling) daily CI Failure ( #36761 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-12 12:22:05 +08:00
Woosuk Kwon
2f8b4ce0c0
[Model Runner V2] Do not initialize sampler for non-last PP ranks ( #36824 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-12 03:55:28 +00:00
Yuwei An
2ef69456f5
[LMCache] Fault Tolerance Mechanism ( #36586 )
...
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com >
2026-03-12 03:54:39 +00:00
Louie Tsai
17852aa503
more models for vLLM Benchmark Suite ( #35086 )
...
Signed-off-by: louie-tsai <louie.tsai@intel.com >
2026-03-12 11:36:51 +08:00
Flora Feng
8647c6cf51
[Bugfix] Fix minimax_m2 tool parser when stream interval > 1 ( #35895 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-03-12 10:25:14 +08:00
Kunshang Ji
513949f95f
[XPU][Doc] Remove manual OneAPI install step, now handled by torch-xpu ( #36831 )
...
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
2026-03-12 01:46:02 +00:00
Nick Hill
262b76a09f
[Frontend] Exclude anthropic billing header to avoid prefix cache miss ( #36829 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-12 01:20:34 +00:00
Wentao Ye
c34ba6b961
[Perf] Optimize compute maxsim using batched version, 3.2% E2E throughput improvement ( #36710 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-12 08:37:01 +08:00
Matthias Gehre
24062b704f
[ROCm][CI/Build] Add gfx1152/gfx1153 (Krackan) to HIP supported architectures ( #36499 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-03-11 23:14:40 +00:00
Aaron Hao
d6b61e5166
[BUG] Fix async rlhf tests ( #35811 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-03-11 18:06:10 -04:00
Yanan Cao
cf632499ee
[Kernel] [Helion] [15/N] Split config files into per-platform files ( #36698 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 17:25:29 -04:00
Yanan Cao
a3774a8198
[Kernel] [Helion] [12/N] Use FakeTensorMode to avoid GPU allocation during config key computation ( #36563 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 17:25:16 -04:00
Yanan Cao
0ce21c46a0
[Kernel] [Helion] [14/N] Set autotune_ignore_errors=True during autotuning ( #36683 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 17:25:04 -04:00
Woosuk Kwon
55eed6b7a5
[Model Runner V2] Add WhisperModelState [6/N] ( #35790 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-11 14:20:38 -07:00
Giancarlo Delfin
c77181e534
[Model Runner V2] Add probabilistic rejection sampling for spec decoding ( #35461 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-11 14:04:32 -07:00
maobaolong
12001f2ebc
[LMCache] Pass TP size in lookup for MLA multi-reader locking ( #36129 )
...
Signed-off-by: baoloongmao <baoloongmao@tencent.com >
Co-authored-by: Yihua Cheng <yihua98@uchicago.edu >
2026-03-11 20:45:20 +00:00
Or Ozeri
7ee5d5093b
[BugFix][kv_offload] Fix offloading decodes with async scheduling ( #33881 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-11 20:43:40 +00:00
jennyyyyzhen
428bc718bd
[Bugfix][ROCm] Strip block_size before attention backend validation ( #36274 )
...
Signed-off-by: jennyyyyzhen <yzhen@hmc.edu >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-03-11 13:37:31 -07:00
汪志鹏
ff1e3d9c63
[BugFix]: add bagel to MM_PREFIX_LM_MODELS ( #36316 )
...
Signed-off-by: princepride <wangzhipeng628@gmail.com >
2026-03-11 19:55:59 +00:00
Wentao Ye
35bdca5431
[Refactor] Remove dead code in KV connector ( #36424 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-11 19:40:17 +00:00
Amanzhol Salykov
8a24842765
[ROCm] add tuned moe_wna16_triton kernel configs for CDNA4 ( #35093 )
...
Signed-off-by: salykova <amsalykov@gmail.com >
Signed-off-by: amd-asalykov <asalykov@amd.com >
2026-03-11 19:00:08 +00:00
Harry Mellor
65986db6ba
Make Gemma and Gemma 2 accept inputs_embeds like Gemma 3 ( #36787 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 18:12:43 +00:00
Luka Govedič
9556af87d5
[torch.compile] Add support for non-contiguous fused RMSNorm + group quant ( #36551 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
2026-03-11 10:56:55 -07:00
Or Ozeri
a1a3523a56
[KVConnector] Support worker -> scheduler metadata ( #31964 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-11 17:36:37 +00:00
tianshu-Michael-yu
741f4e046b
fix: align lfm2 thumbnail token counting with HF ( #36707 )
2026-03-11 10:28:38 -07:00
Julien Denize
a5d06dc557
Add 320 dimension size support to MLA ( #36161 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
2026-03-11 10:21:22 -07:00
Harry Mellor
5efa206a8c
Fix ExaoneMoeMTP test that never ran in Transformers v4 ( #36792 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 17:10:23 +00:00
Cyrus Leung
196802dfa6
[Misc] Clean up renderers ( #36770 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-11 16:39:29 +00:00
Isotr0py
c84b519cf3
[Bugfix] Fix negative max_tokens when input prompt is too long ( #36789 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-11 16:30:51 +00:00
Flora Feng
741ecf0630
[CI] Add bfcl tool call correctness eval ( #36560 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-03-11 12:27:36 -04:00
Robert Shaw
b7e5a588d8
[Bugfix] Fix DP/EP Shared Expert With Monolithic Kernels ( #36061 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-11 16:07:14 +00:00
Richard Zou
822e250ab7
[torch.compile] Use FakeTensors instead of real GPU tensors for single-size compilation ( #36093 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-11 16:07:09 +00:00
Hongxin Xu
bea02cdf93
Fix routed experts capture for hybrid models (Mamba + Attention) ( #35744 )
...
Signed-off-by: arlenxu <arlenxu@tencent.com >
Signed-off-by: xhx1022 <1737006628@qq.com >
Co-authored-by: arlenxu <arlenxu@tencent.com >
2026-03-11 08:53:10 -07:00
Julien Denize
a3ea760ea5
Add 'none' reasoning effort to ChatCompletionRequest ( #36238 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
2026-03-11 15:45:34 +00:00
Harry Mellor
35db669f1d
Correct link to supported hardware on vllm.ai ( #36798 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 08:43:28 -07:00
Julien Denize
afebeffbfb
Add support to Mistral large 3 eagle with dense layers ( #36163 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-11 15:42:56 +00:00
Jhao-Ting Chen
5573894737
Kimi k2.5 MLA based eagle3 ( #36361 )
...
Signed-off-by: Izzy Putterman <iputterman@nvidia.com >
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com >
Co-authored-by: Izzy Putterman <iputterman@nvidia.com >
2026-03-11 11:36:11 -04:00
Harry Mellor
d5816c8c2f
Fix tied weights in weight mapping test for Transformers v5 ( #36788 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 15:10:26 +00:00
Woosuk Kwon
8ccbcda5c0
[Model Runner V2] Remove unused warmup_for_prefill method ( #36762 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-11 08:02:44 -07:00
tvirolai-amd
a9e532afe2
[ROCm][Perf] Allow MTP lens > 1 in Sparse MLA ( #36681 )
...
Signed-off-by: Teemu Virolainen <teemu.virolainen@amd.com >
2026-03-11 14:43:03 +00:00
Harry Mellor
f3163bba67
Disable docs build skipping until a better solution is found ( #36790 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 13:53:23 +00:00
Martin Hickey
700a1ddc65
[Misc] Use envs module to get VLLM_DISABLED_KERNELS ( #35776 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-03-11 13:37:46 +00:00
Silvia Colabrese
f33251ffc8
[Bugfix] Fix Mistral-small --format ( #36782 )
...
Signed-off-by: 12010486 <silvia.colabrese@intel.com >
2026-03-11 04:47:52 -07:00
Wuxun Zhang
e584dce52b
Add XPU MLA Sparse backend for DeepSeek v3.2 ( #33230 )
...
Signed-off-by: Zhang, Wuxun <wuxun.zhang@intel.com >
2026-03-11 19:19:15 +08:00
Ning Xie
40c0461f24
[openapi] refactor render related openapi [3/N] ( #36749 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-03-11 03:14:34 -07:00
Weiguang Li
724759684c
[Bugfix] Fix Qwen3-VL timestamp mismatch when using num_frames without fps ( #36136 )
...
Signed-off-by: OiPunk <codingpunk@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 03:13:06 -07:00
Michael Goin
9c34e9d24f
Disable cascade attention by default ( #36318 )
2026-03-11 03:12:23 -07:00
Richard Zou
09b6f99852
[compile] aot_compile should respect VLLM_DISABLE_COMPILE_CACHE ( #36358 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-11 03:12:03 -07:00
Ethan T.
c87fb515ed
fix(lora): use replaced_module_name in pooling model name check ( #36402 )
...
Signed-off-by: gambletan <ethanchang32@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-11 03:11:27 -07:00
Itay Alroy
5353c9b016
platforms: Fix Ray DP startup crash ( #36665 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
2026-03-11 03:08:55 -07:00
Angela Yi
13e79fc811
[ci] Update rtol for test_classification ( #36556 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com >
2026-03-11 03:08:16 -07:00
Rahul Tuli
9d07a3d6e4
Add: Eagle3 support for Qwen3.5 ( #36658 )
...
Signed-off-by: Rahul-Tuli <rtuli@redhat.com >
2026-03-11 03:07:42 -07:00
Cyrus Leung
646b85544b
[Refactor] Remove Molmo2 processor wrapper ( #36667 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-11 03:07:20 -07:00
tc-mb
4286cc5ec2
fix(minicpmv): fix audio inference by handling meta device in init_re… ( #36751 )
...
Signed-off-by: caitianchi <caitianchi@modelbest.cn >
2026-03-11 03:06:28 -07:00
LoganJane
545d18d81b
[Bugfix] Support other quantization methods in glm41v ( #36321 )
...
Signed-off-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com >
Co-authored-by: g00887675/loganJane <g00887675/loganJane73@hotmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-11 09:48:05 +00:00
roikoren755
e661b9ee83
[NemotronH] Small fix reasoning parser ( #36635 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-03-11 02:44:41 -07:00
YiSheng5
c910eeb125
[XPU]Bug fix for some unexpected error when use AgRs backend on XPU device. ( #36593 )
...
Signed-off-by: yisheng <yi.sheng@intel.com >
2026-03-11 09:17:46 +00:00
Harry Mellor
f4ae58b38b
Remove unused config field from Gemma2 ( #36672 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-11 01:51:19 -07:00
Isotr0py
e568cf88bc
[UX] Infer dtype for local checkpoint ( #36218 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-11 08:50:04 +00:00
Nicolò Lucchesi
098d844731
[NIXL][1/N] Refactor kernel_block_size detection ( #35752 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-11 01:11:23 -07:00
JartX
a40ee486f2
[Bugfix] Add Multiple of 16 block_size to triton fallback on rocm Attention to support qwen3_5 ( #35923 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Co-authored-by: akaratza <akaratza@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-03-11 07:45:57 +00:00
pschlan-amd
eac2dc2b41
AITER MLA backend: Avoid CPU sync in _build_decode ( #35765 )
...
Signed-off-by: Patrick Schlangen <pschlan@amd.com >
2026-03-11 07:25:00 +00:00
Flora Feng
d5080aeaa4
[Refactor] Remove deadcode in Responses API serving ( #36726 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-11 07:11:41 +00:00
liuzhenwei
f22d6e0267
[Hardware][NIXL] set default kv buffer type for different platform ( #36438 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-11 05:19:28 +00:00
Kunshang Ji
76c6e6da08
[XPU] Support block fp8 moe by fallback to TritonExpert on XPU ( #36458 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-10 21:54:09 -07:00
typer-J
4184653775
feat: add RISC-V support for CPU backend (v2) ( #36578 )
...
Signed-off-by: typer-J <2236066784@qq.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-03-10 21:51:39 -07:00
Sladyn
4aaaf8c8ce
feat(spec_decode): fuse EAGLE step slot mapping and metadata updates ( #33503 )
...
Signed-off-by: sladynnunes <snunes@usc.edu >
2026-03-11 04:35:33 +00:00
Hongbin Guo
4bf533623b
[Doc] Fix duplicate words in comments ( #36713 )
...
Signed-off-by: Hongbin10 <jdmjdm1998@163.com >
2026-03-10 21:28:31 -07:00
Matthew Bonanni
5f77ef15ae
[Misc][Attention] Clean up unused method in CPU_ATTN ( #36673 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-10 21:27:22 -07:00
elvischenv
7d6abdd022
[Fix] Use torch.empty for output in attention+quant fusion ( #31785 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-03-10 21:26:14 -07:00
Wentao Ye
a8ff2cca92
[Perf] Optimize scheduler overhead for PD disaggregation, around 5% E2E perf improvement ( #35781 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-03-10 21:25:30 -07:00
tunglinwood
42fadebecb
[Model] Add support for moonshotai/Kimi-Audio-7B-Instruct ( #36127 )
...
Signed-off-by: tunglinwood <tunglinwood@gmail.com >
Signed-off-by: tunglinwood <tomwu.tunglin@gmail.com >
Signed-off-by: tunglinwood <113751333+tunglinwood@users.noreply.github.com >
2026-03-10 21:24:48 -07:00
tianshu-Michael-yu
a197eda9c3
Add tuned H100 MoE configs for LFM2 8B and 24B ( #36699 )
2026-03-10 21:22:02 -07:00
Kevin H. Luu
82b110d50e
[ci] Bound nvidia-cudnn-frontend version ( #36719 )
...
Signed-off-by: khluu <khluu000@gmail.com >
2026-03-11 12:17:35 +08:00
Benjamin Chislett
9040cd40af
[DSV3.2][MTP] Optimize Indexer MTP handling ( #36723 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-11 12:16:56 +08:00
fangyuchu
fa0d353acf
[Bugfix] Surface exceptions from non-blocking execute_model in UniProcExecutor to avoid DP deadlocks ( #35194 )
...
Signed-off-by: fangyuchu <fangyuchu@qq.com >
2026-03-11 03:22:21 +00:00
Augusto Yao
b386bb3d7c
fix bugs when token_classify & classify run concurrently ( #36614 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
2026-03-10 20:16:34 -07:00
Ning Xie
fe714dd507
[openapi server] log exception in exception handler(2/N) ( #36201 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-03-10 20:16:30 -07:00
Matthew Bonanni
8ab3d7427c
[Bugfix] Fix DeepSeek V3.2 OOM during CG memory profiling ( #36691 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-11 03:01:07 +00:00
Wei Zhao
84e436ed1c
[Bug] Fix TRTLLM Block FP8 MoE Monolithic ( #36296 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-03-10 22:04:47 -04:00
Andreas Karatzas
81939e7733
[ROCm][CI] Making some tests optional to reduce workload ( #36090 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-10 16:45:27 -07:00
Woosuk Kwon
195d1ca3e8
[Minor] Enhance error message for TRTLLM decode uniformity check ( #36609 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-10 15:38:45 -07:00
Nick Hill
8d983d7cd6
[Model Runner V2] Add initial CI tests ( #36041 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 14:55:21 -07:00
Nick Hill
65b2f405dc
[Core] Simplify core kv-cache blocks initialization logic ( #36521 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 20:20:02 +00:00
Nick Hill
2a68464c5b
[Test] test_async_scheduling.py improvements ( #36340 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 11:17:26 -07:00
Zhengxu Chen
bdd8981dab
[compile] Apply stored functorch config while finalizing loaded artifacts. ( #36582 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-10 09:34:35 -07:00
Woosuk Kwon
f088a831dd
[Model Runner V2] Use unpadded num_tokens for PW CUDA graph attn metadata ( #36626 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-10 09:30:56 -07:00
Harry Mellor
f83b933b84
[CI] Bump mypy version to 1.19.1 ( #36104 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-10 09:18:28 -07:00
Pleaplusone
82f3f30e26
[ROCm][Perf] Enable sparse_mla's cudagraph on ROCm platform ( #35719 )
...
Signed-off-by: ganyi <ygan@amd.com >
2026-03-10 09:14:35 -07:00
Matthew Bonanni
9095cbbfb6
[Bugfix][Sparse MLA] report indexer CG support properly ( #36519 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-10 09:14:31 -07:00
Hashem Hashemi
721ae79f50
Improvements to wvSplitKrc skinny GEMM solution ( #34304 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-03-10 09:14:27 -07:00
AllenDou
aefc59f088
FunASR model bugfix ( #36633 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
2026-03-10 08:14:21 -07:00
Harry Mellor
d88f28da05
Fix hf_override_fn when it modifies model_type ( #35200 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-10 15:03:18 +00:00
Srinivasoo7
106ff69c4e
feat(kv-offload): Strategy A — StoreReusedOffloadingManager gates CPU stores on reuse frequency ( #35342 )
...
Signed-off-by: srinivas_oo7 <Sriusa4414@gmail.com >
Signed-off-by: Sriusa4414@gmail.com
Signed-off-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com >
Co-authored-by: srinivas_oo7 <sklinkedin0120@gmail.com >
Co-authored-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com >
Co-authored-by: Or Ozeri <oro@il.ibm.com >
2026-03-10 14:43:40 +00:00
Jiangyun Zhu
ca5fb4bbd8
[Bugfix] Avoid merging empty-only partitions into splitting-op subgraphs ( #36595 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-10 07:39:01 -07:00
Alvin Tang
cf88b23749
fix: check HTTP status in batch read_file to prevent silent failures ( #36397 )
...
Signed-off-by: gambletan <ethanchang32@gmail.com >
Co-authored-by: gambletan <ethanchang32@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-10 07:22:40 -07:00
wang.yuqi
a3189a08b0
[Model] Consolidate score logic by introduce score_type ( #36479 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-10 13:32:25 +00:00
SoluMilken
409c4e632d
[Misc] fix typo: homogenous-> homogeneous (2 lines change) ( #36508 )
...
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw >
2026-03-10 06:25:37 -07:00
Raushan Turganbay
8850738b70
[Bugfix] Fix processor signature ( #36630 )
...
Signed-off-by: raushan <raushan@huggingface.co >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-10 06:20:47 -07:00
Mark McLoughlin
234860399b
[Frontend][Core] Revert "Add shutdown timeout" ( #34730 and #36270 ) ( #36628 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-03-10 06:20:41 -07:00
Harry Mellor
c88510083b
Fix Qwen2.5-VL test for Transformers v5 ( #36532 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-10 12:05:34 +00:00
Vadim Gimpelson
4ff8c3c8f9
[BUGFIX][Mamba][Qwen3.5] Zero freed SSM cache blocks on GPU ( #35219 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-03-10 03:32:20 -07:00
Chang Su
507ddbe992
feat(grpc): extract gRPC servicer into smg-grpc-servicer package, add --grpc flag to vllm serve ( #36169 )
...
Signed-off-by: Chang Su <chang.s.su@oracle.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-03-10 03:29:59 -07:00
Nick Hill
ddbb0d230a
[Model Runner V2] Fix mm input embeddings lookup ( #36588 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 00:24:58 -07:00
Nick Hill
9efc3bdcd6
[Model Runner V2] Fix _compute_slot_mappings_kernel for chunked prefill ( #36580 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-10 00:23:42 -07:00
amirkl94
156e33553c
Fix: Re-Enable EP for trtllm MoE FP8 backend ( #36494 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com >
2026-03-09 23:11:27 -07:00
hallerite
d0cd736caa
[Bugfix] Fix RuntimeError: Already borrowed that degrades VLM serving throughput under concurrent load. ( #36557 )
...
Signed-off-by: hallerite <hallerite@users.noreply.github.com >
Signed-off-by: hallerite <git@hallerite.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-09 22:30:51 -07:00
Harry Mellor
195c997203
Fix LFM2 MoE test for Transformers v5 ( #36534 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-09 22:29:17 -07:00
Zhuohan Li
04b67d8f62
Remove unused disable_fallback field ( #36546 )
2026-03-09 20:56:54 -07:00
Wentao Ye
7279374f91
[Perf] Compute maxsim in worker side, reducing redundant copies, 2.7% E2E throughput improvement ( #36159 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-09 20:55:58 -07:00
Woosuk Kwon
006aea17d7
[BugFix] Remove incorrect assert in split_decodes_and_prefills ( #36553 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-09 20:02:02 -07:00
Hojin Yang
0836be3b03
[Model] Add HyperCLOVAX-SEED-Think-32B vision-language model support ( #31471 )
...
Signed-off-by: effortprogrammer <yhjhoward7@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-10 10:59:19 +08:00
Ajay Anubolu
4e95ec111c
[Bugfix] Fix Qwen3-Next in_proj_ba weight sharding with TP > 1 ( #36242 )
...
Signed-off-by: AjAnubolu <anuboluajay@gmail.com >
2026-03-09 19:16:26 -07:00
Andreas Karatzas
179547d62c
[ROCm][CI] Fix ROCm GPT-OSS Eval test group ( #36179 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-09 17:55:20 -07:00
youkaichao
f85b4eda3a
[bugfix] fix nvlink for nixl/ucx ( #36475 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2026-03-10 07:49:47 +08:00
Woosuk Kwon
2a194ddd72
[Model Runner V2] Add model_state inputs to CUDA graph capture ( #36544 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-09 15:14:51 -07:00
Shaun Kotek
203a7f27da
add nemotron v3 reasoning parser ( #36393 )
...
Signed-off-by: Shaun Kotek - Nvidia <skotek@nvidia.com >
Co-authored-by: root <root@gpu-259.slurm-workers-slurm.slurm.svc.cluster.local >
2026-03-09 15:11:41 -07:00
Lucas Wilkinson
483463f735
[MRV2] Extensible CG dispatch rework ( #35959 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-03-09 13:58:45 -07:00
Matthew Bonanni
4e571ce643
[MTP][Misc] Clean up dead code ( #36507 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-09 14:43:06 -04:00
Micah Williamson
4ff9b045fe
[ROCm][CI] Prep Tests For Change To ROCM_ATTN As New Default Backend On ROCm ( #36025 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-03-09 13:27:55 -05:00
Lucas Kabela
3fd03f1ec2
[BE] Rename should_torch_compile_mm_vit to should_torch_compile_mm_encoder ( #36281 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-03-09 18:22:05 +00:00
Woosuk Kwon
10a5f4d53d
[Model Runner V2] Use NamedTuple for execute_model_state ( #35930 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-09 11:17:34 -07:00
Simon Mo
fe0c085c28
[Docs] Remove the reo beacon ( #36528 )
...
Co-authored-by: Cursor Agent <cursoragent@cursor.com >
2026-03-09 11:16:50 -07:00
Taneem Ibrahim
8d6b3d5dda
[Misc] Refactored 5 duplicate helper functions that were copied-pasted across multiple parsers ( #36436 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-03-09 14:14:11 -04:00
Copilot
4b87ffbefb
[torch.compile] Rename compile_ranges_split_points to compile_ranges_endpoints ( #36027 )
...
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-03-09 18:04:40 +00:00
Shaun Kotek
fa028207aa
Fix/resupport nongated fused moe triton ( #36412 )
...
Signed-off-by: Shaun Kotek - Nvidia <skotek@nvidia.com >
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com >
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Signed-off-by: liweiguang <codingpunk@gmail.com >
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Alex Brooks <albrooks@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: cong-or <conchubhar.gannon@gmail.com >
Signed-off-by: Tushar Shetty <tushar.shetty@abbyy.com >
Signed-off-by: Tushar Shetty <54362365+tusharshetty61@users.noreply.github.com >
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Signed-off-by: Xin Yang <xyangx@amazon.com >
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: nvnbagrov <nbagrov@nvidia.com >
Co-authored-by: Sage <80211083+sagearc@users.noreply.github.com >
Co-authored-by: danisereb <daserebrenik@nvidia.com >
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Weiguang Li <codingpunk@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
Co-authored-by: Alex Brooks <albrooks@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: cong-or <conchubhar.gannon@gmail.com >
Co-authored-by: Tushar Shetty <54362365+tusharshetty61@users.noreply.github.com >
Co-authored-by: liuzhenwei <zhenwei.liu@intel.com >
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-09 11:01:18 -07:00
Russell Bryant
d460a18fc6
[Docs] Expand --allowed-media-domains security guidance with threat details ( #36506 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-09 17:43:42 +00:00
Woosuk Kwon
6e956d9eca
[Model Runner V2] Add dummy profile_cudagraph_memory API ( #36520 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-09 10:20:13 -07:00
Andreas Karatzas
1e0f917b34
[ROCm][CI] Fix logprob divergence for TitanML/tiny-mixtral under AITER rms_norm ( #36101 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-09 12:07:44 -05:00
Andreas Karatzas
c174d54f86
[ROCm][CI] Fix ROCm attention backend validation for head sizes, block sizes, and compute capability checks ( #36292 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-09 12:02:41 -05:00
SoluMilken
55d27cca55
[Misc] fix typo: dependant -> dependent (2 lines change) ( #36511 )
...
Signed-off-by: SoluMilken <ypiheyn.imm02g@g2.nctu.edu.tw >
2026-03-09 10:00:12 -07:00
Roberto L. Castro
580864d81e
[Attention][Perf][Kernel] Replace torch.cat with vectorized CUDA kernel MLA query concat - DeepSeek-V3.2 ( #34917 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
2026-03-09 09:50:36 -07:00
Roberto L. Castro
2b28b9b269
[Attention][Perf] Optimize cp_gather_and_upconvert_fp8_kv_cache - DeepSeek-v3.2 ( #35290 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Co-authored-by: Claude <noreply@anthropic.com >
2026-03-09 09:46:57 -07:00
Taoyu Zhu
70485a11bd
[ROCM] Optimize the fused_topk_bias to use aiter instead of fallback torch ops. ( #36253 )
...
Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com >
2026-03-09 11:30:35 -05:00
Harry Mellor
74a9f54cdb
[CI] Fix edge case that could lead to broken docs builds on main ( #36515 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-09 09:06:19 -07:00
Matthew Bonanni
00c4cb5606
[Bugfix] Clear stale CG keys after memory profiling ( #36416 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-09 11:56:00 -04:00
Wentao Ye
941e52c298
[Refactor] Simplify chat_completion_full_generator for tool parsers ( #35634 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-09 23:33:46 +08:00
Wentao Ye
be292b7c14
[Bug] Fix pooling model benchmark script ( #36300 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-03-09 11:17:45 -04:00
Matthew Bonanni
77a73458e3
Reapply [Attention] Refactor check_and_update_config ( #35122 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-09 07:17:14 -07:00
Tianyu Guo
5578f2a4d3
Support online use_audio_in_video ( #36319 )
...
Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-09 07:16:44 -07:00
Cyrus Leung
3ec2115015
[Frontend] Move warmup into Renderer ( #36482 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-09 06:03:21 -07:00
Isotr0py
b0906d8b02
[MM Encoder] Default to use TORCH_SDPA backend for ViT on Volta/Turing GPU ( #36472 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-09 03:43:44 -07:00
Kevin H. Luu
aaf5fa9abf
[ci] Bound openai dependency to 2.24.0 ( #36471 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-03-09 03:43:26 -07:00
Cyrus Leung
f96c3ab08c
[Deprecation][1/2] Remove items deprecated in v0.18 ( #36470 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-09 03:43:23 -07:00
Xin Yang
dc6b578466
[Kernel] Add fused_sigmoid_gating_delta_rule_update kernel for Qwen3 Next ( #35777 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-08 23:41:01 -07:00
liuzhenwei
1bc9c77f6d
[XPU] Add test script of PD disaggregation ( #36434 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
2026-03-09 05:50:27 +00:00
Alex Brooks
65a4da1504
[Frontend] Add Support for MM Encoder/Decoder Beam Search (Online Transcriptions) ( #36160 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
2026-03-09 05:46:23 +00:00
Li, Jiang
217f27598d
[Bugfix] Avoid to replace non-tensor members in cpu model runner ( #36430 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-09 13:06:28 +08:00
wang.yuqi
fff3711a24
[Frontend][2/n] Improve pooling entrypoints | embed. ( #36110 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
2026-03-09 11:42:19 +08:00
Tushar Shetty
c4d859c274
[Bugfix] Skip out-of-stage layers in get_layers_from_vllm_config for pipeline parallel ( #36243 )
...
Signed-off-by: Tushar Shetty <tushar.shetty@abbyy.com >
Signed-off-by: Tushar Shetty <54362365+tusharshetty61@users.noreply.github.com >
2026-03-08 20:40:16 -07:00
cong-or
747431044d
feat(attention): extract KV-cache update from FlexAttention backend ( #36263 )
...
Signed-off-by: cong-or <conchubhar.gannon@gmail.com >
2026-03-08 20:40:12 -07:00
Cyrus Leung
d62856b928
[Misc] Move processors to transformers_utils ( #35953 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-09 11:31:39 +08:00
Alex Brooks
bd2659a566
Increase Flexibility for OOV Multimodal Token Handling ( #34858 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
2026-03-08 20:30:49 -07:00
Shaun Kotek
90512b2e8b
fix: Use iterator as not to store all the file loads in memory at once ( #36149 )
...
Signed-off-by: Shaun Kotek - Nvidia <skotek@nvidia.com >
2026-03-08 20:25:21 -07:00
wang.yuqi
dcf8862fd4
[Examples][1/n] Resettle basic examples. ( #35579 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-08 20:22:53 -07:00
Weiguang Li
43aa389231
[Bugfix] Fix CPU OMP autobind assertion to use local_world_size ( #35815 )
...
Signed-off-by: liweiguang <codingpunk@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-03-08 20:07:29 -07:00
Wentao Ye
384425f84e
[Dependency] Remove default ray dependency ( #36170 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-08 20:06:22 -07:00
Harry Mellor
a0f44bb616
Allow markdownlint to run locally ( #36398 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-08 20:05:24 -07:00
Kunshang Ji
fde4771bbd
[XPU][Doc] update xpu document about triton dependency/conflict issue. ( #36301 )
...
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
2026-03-09 02:09:22 +00:00
Jiangyun Zhu
e5ff140216
[cudagraph] fix cudagraph warning in deepseekv32 ( #28044 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-08 20:27:41 -04:00
danisereb
0a6a3a1290
Add support for ModelOpt MXFP8 MoE models ( #35986 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-03-08 13:00:05 -07:00
Sage
4497431df6
[Frontend] Add GPU-less render serving path (vllm launch render) ( #36166 )
2026-03-08 16:35:09 +01:00
nvnbagrov
b7332b058c
[Model] Nano Nemotron VL - fast media preprocessing ( #35657 )
...
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com >
2026-03-08 03:04:05 -07:00
Andreas Karatzas
40077ea3de
[CI] fix flaky empty responses and add diagnostic assertions in vision chat tests ( #36341 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-08 14:42:24 +08:00
Samuel Shen
5d6aae4577
[LMCache MP Patch]: Race Condition + Duplicated Block Ids ( #35831 )
2026-03-07 13:52:48 -08:00
Roy Huang
63298ee173
[Bugfix][LMCache][KVConnector] fix potential memory leak in LMCache multiprocess mode ( #35931 )
2026-03-07 13:52:35 -08:00
Richard Zou
2dde535df1
[compile] Split compile/warmup monitoring ( #36098 )
2026-03-07 13:52:11 -08:00
Wei Zhao
379689d533
[Perf] Support FP8 KV cache for Flashinfer MLA Sparse ( #35891 )
2026-03-07 13:51:54 -08:00
PatchyTIS
a6be75dbd2
[Core] NGram GPU Implementation compatible with Async Scheduler ( #29184 )
2026-03-07 13:51:37 -08:00
Micah Williamson
ee54f9cdb9
[ROCm][CI] Accept Different But Valid Output for test_olmoe_tp ( #35224 )
2026-03-07 13:50:52 -08:00
Micah Williamson
fc4657756f
[ROCm][CI] Enable AITER for failing test_gpt_oss test case on MI355 ( #36174 )
2026-03-07 13:50:17 -08:00
qli88
eebd14651f
[CI] Enable Crosslayer KV layout tests for ROCm platforms ( #35416 )
2026-03-07 13:49:56 -08:00
Matthew Bonanni
ebb9cc5f2b
[UX][Startup] Account for CUDA graphs during memory profiling ( #30515 )
2026-03-07 13:49:23 -08:00
rahul-sarvam
85f50eb41f
Adding support to Sarvam's MoE models ( #33942 )
...
Signed-off-by: rahul-sarvam <140298821+rahul-sarvam@users.noreply.github.com >
2026-03-08 01:16:24 +08:00
Taneem Ibrahim
5261223c2d
[Misc] Remove duplicate parser registration ( #36303 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-03-07 09:37:01 -05:00
lif
00b814ba5a
[V0 Deprecation] Remove unused swap_space parameter ( #36216 )
...
Signed-off-by: majiayu000 <1835304752@qq.com >
Co-authored-by: mcelrath
2026-03-07 22:09:55 +08:00
vllmellm
ee8a29511f
[Bugfix] Fix compressed-tensors quantization failure for DeepSeek-R1 on MI300x ( #36247 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-03-07 09:26:59 +00:00
milesial
755356b3d1
feat: expose media_io_kwargs at runtime ( #34778 )
...
Signed-off-by: Alexandre Milesi <milesial@users.noreply.github.com >
2026-03-07 04:27:04 +00:00
Andreas Karatzas
58928475e4
[ROCm][CI] Making entrypoints more deterministic on ROCm ( #36293 )
2026-03-06 19:04:40 -08:00
Mengtao (Martin) Yuan
1a9718085c
Fix CUDA graph decode capture crash in AITER FlashAttention ( #36042 )
...
Signed-off-by: Martin Yuan <myuan@meta.com >
Co-authored-by: Martin Yuan <myuan@meta.com >
2026-03-06 18:12:07 -08:00
Kunshang Ji
7eb524e64c
refine vllm bench throughput --backend hf ( #35971 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-07 02:10:33 +00:00
Nick Hill
c7f32e08c2
[BugFix] Avoid ignored trust_remote_code warnings ( #36290 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-07 01:24:18 +00:00
Nick Hill
b354686524
[Model Runner V2] Fix warmup for pipeline parallel ( #36280 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-06 16:58:51 -08:00
Nick Hill
6a18d8789b
[Core] Fix benign error log during normal shutdown ( #36270 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
2026-03-07 00:39:21 +00:00
Itay Alroy
24a03915f5
mla: don't update kv cache on dummy forwards ( #36282 )
...
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
2026-03-07 00:36:00 +00:00
Andreas Karatzas
b5e34e1fca
[ROCm][CI] Fixing yaml file for external amd-ci signal ( #36284 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 18:30:39 -06:00
Copilot
ce8546a12b
[docs][torch.compile] Add fusions.md — kernel/operator fusion reference page ( #35538 )
...
Signed-off-by: ProExpertProg <luka.govedic@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: ProExpertProg <luka.govedic@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-03-06 23:55:06 +00:00
Chuan (Richard) Li
c188749bcd
[ROCm] Support MLA with nhead<16 and FP8 KV cache for TP=8 (Kimi K2.5/Linear) ( #35850 )
...
Signed-off-by: Li <chuali@amd.com >
2026-03-06 20:24:03 +00:00
Alexei-V-Ivanov-AMD
225d1090a0
Enabling some B200-specific tests on MI355 ( #35253 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
Signed-off-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com >
2026-03-06 19:27:20 +00:00
eellison
f3c6c9c9d7
[CustomOp] CustomOp FusedRMSNormGated ( #35877 )
...
Signed-off-by: Elias Ellison <elias.ellison@gmail.com >
Signed-off-by: eellison <elias.ellison@gmail.com >
2026-03-06 10:53:37 -08:00
Nick Hill
26bd43b52d
Revert "[BugFix] Fix engine hanging after KV cache initialization fai… ( #36262 )
2026-03-06 08:28:09 -08:00
Travis Johnson
6b625a8807
[Bugfix] Quickfix followups to busy loop removal in #28053 ( #36068 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-06 08:13:05 -08:00
Richard Zou
54756b6109
[compile] Stop unconditionally patching constrain_to_fx_strides ( #36152 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-06 10:17:27 -05:00
Raphaël Rialland
39f9ea0da4
[Bugfix] Fix cudagraph_mode:FULL dispatch (This does not impact FULL_AND_PIECEWISE (default)) ( #36165 )
2026-03-06 09:15:31 -05:00
Isotr0py
e4ae148a78
[Refactor] Modular video loader backend refactoring ( #35202 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-06 06:06:59 -08:00
Isotr0py
1d0c0d209c
[Misc] Lazy import registered processors ( #36024 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-06 06:06:45 -08:00
Chenguang Zheng
fcb73f306c
[bugfix] add api process rank in default multimodal request ( #36150 )
...
Signed-off-by: fake0fan <645327136@qq.com >
Signed-off-by: Chenguang ZHENG <645327136@qq.com >
2026-03-06 12:00:09 +00:00
Harry Mellor
e2090bf3af
[CI] Fix startup error test ( #36230 )
...
A change in engine startup error messages in #35478 caused this test failure.
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-06 11:50:28 +00:00
Andreas Karatzas
2a00d3241f
[CI][MM] Gate vision encoder attention mask to MiniCPM only, fixing Aria regression ( #36206 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 01:17:08 -08:00
Alex Brooks
10f4db4dbe
[Frontend] Add Support for MM Encoder/Decoder Beam Search (Offline) ( #36153 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-06 01:16:56 -08:00
Nicolò Lucchesi
5b3ba94ab4
[Core][KVConnector] Support HMA+NixlConnector ( #35758 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-06 08:51:21 +01:00
zhanqiuhu
90f3c01fa4
[Spec Decode][KV Connector] Fix KV transfer in PD + speculative decoding ( #35158 )
...
Signed-off-by: Claude <noreply@anthropic.com >
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-06 08:50:44 +01:00
Andreas Karatzas
807d680337
[ROCm][CI] Fix tool use test stability - disable skinny GEMM, prefix caching, eliminate batch variance ( #35553 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 15:15:12 +08:00
Tyler Michael Smith
5afb387bd4
Change "following fields were present in the request but ignored" log from warn to debug ( #36173 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-03-05 22:15:46 -08:00
Walter Beller-Morales
43e77e59ab
[BugFix] avoid infinite loop with VLLM_PORT and get_open_ports_list ( #36191 )
...
Signed-off-by: walterbm <walter.beller.morales@gmail.com >
2026-03-05 22:15:29 -08:00
Russell Bryant
00bd08edee
[Security] Respect user trust_remote_code setting in NemotronVL and KimiK25 ( #36192 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-05 22:15:19 -08:00
Ajay Anubolu
43f10573c9
[Bugfix] Fix misleading context length error messages ( #36197 )
...
Signed-off-by: AjAnubolu <anuboluajay@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-05 22:15:12 -08:00
Yongye Zhu
86e1060b17
[Bugfix] Fix inner_dp_world initialization order for multi-node TP ( #35892 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-03-05 22:04:44 -08:00
Mark McLoughlin
27066d1b2b
[Frontend][Core] Add shutdown timeout - allowing in-flight requests to finish ( #34730 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-03-05 22:04:31 -08:00
cong-or
57c84ff129
perf: add __slots__ to KVCacheBlock ( #36164 )
...
Signed-off-by: cong-or <conchubhar.gannon@gmail.com >
2026-03-05 22:04:09 -08:00
Xiang Shi
e68de8adc0
docs: fix wrong cc in int8.md ( #36209 )
...
Signed-off-by: Xiang Shi <realkevin@tutanota.com >
2026-03-06 06:01:02 +00:00
Andreas Karatzas
a1ffa56a1e
[CI] Fix bge-m3 similarity reference values after *Defination* typo fix ( #36208 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 05:07:29 +00:00
Shiyan Deng
0a208d1f54
[BugFix] Fix engine hanging after KV cache initialization failure ( #35478 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-03-05 20:58:09 -08:00
Shiyan Deng
03a49bb8f0
[Feature] Add --distributed-timeout-seconds CLI option ( #36047 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-03-05 20:57:51 -08:00
Shiyan Deng
8e87cc57f1
[Bug] Fix a corner case in _process_simple_streaming_events ( #34754 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-03-05 20:57:32 -08:00
Cyrus Leung
6dd302653f
[Misc] Rename group_mm_kwargs_by_modality -> group_and_batch_mm_kwargs ( #36158 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-06 12:32:48 +08:00
Cyrus Leung
de00ebeac4
[Bugfix] Fix simple Mistral-Small example ( #36156 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-05 20:25:11 -08:00
Andreas Karatzas
639680d220
[ROCm][CI] Adding missing dependencies for Multi-modal models tests ( #36177 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-06 12:23:10 +08:00
Rohan Potdar
c5362c739f
Reenable features for ROCm attention backends ( #36185 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-03-05 20:21:06 -08:00
Nikhil Gupta
0a49676fb0
cpu: aarch64: Upgrade OneDNN for aarch64 to add support for int8 matmul ( #36147 )
...
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com >
2026-03-06 03:48:59 +00:00
Jeffrey Wang
c012a8c477
Don't fire ray compatibility webhook when PR or branch is not provided ( #36088 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2026-03-06 00:42:21 +00:00
Dor Huri
ebed80a7c8
[Performance] Extract KV-cache update from TreeAttention backend ( #35384 )
...
Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il >
2026-03-06 00:22:43 +00:00
Nick Hill
a73af584fe
[Model Runner V2] Fix warmup for very small kvcache and/or blocksizes ( #36176 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-05 14:48:10 -08:00
Zhengxu Chen
a97954b6a8
[compile] Consistent compiler config for saved/loaded vllm backends. ( #35810 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-05 15:08:12 -05:00
Yanhong Li
a911f4dd20
[Model] Add support for OLMo Hybrid ( #32550 )
2026-03-05 14:51:06 -05:00
Russell Bryant
5395471d29
[CI] Add explicit permissions to macOS smoke test workflow ( #35775 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-05 19:08:48 +00:00
Frank Wang
a57c877f18
[BugFix] Fallback from FA4->FA2 for Batch Invariance ( #36059 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
2026-03-05 14:05:56 -05:00
Xin Yang
f917020983
[Perf] Optimize FusedMoEModularKernel output tensor using torch.empty ( #35794 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-03-05 13:47:53 -05:00
tomeras91
86483ca774
[Bugfix] Disable FlashInfer TRTLLM BF16 path for non-gated MoE ( #36146 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
2026-03-05 09:49:05 -08:00
Netanel Haber
b93a9e6f6d
ParakeetProjection.norm = RMSNorm instead of nn.LayerNorm ( #36133 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
2026-03-05 17:29:30 +00:00
Xinyu Chen
d8839ef7d9
[XPU] Enable ModelRunnerV2 on XPU ( #36078 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2026-03-05 17:19:18 +00:00
Avery Miao
e998fa76b9
[BUGFIX]Fix Qwen-Omni models audio max_token_per_item estimation error leading to encoder_cache_size is 0 ( #35994 )
...
Signed-off-by: Miao, Avery <avery.miao@intel.com >
2026-03-05 09:16:29 -08:00
Jiayi Yan
6a895197fa
[Bugfix][CI] fix typos ( #34934 )
...
Signed-off-by: 1195343015 <1195343015@qq.com >
Signed-off-by: Jiayi Yan <66017932+1195343015@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 17:05:46 +00:00
Sage Moore
8c760b6ab6
[ROCm] Refactor ROCm attention backend selection logic ( #35246 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-03-05 10:51:26 -06:00
AllenDou
3ee68590c7
refactor funasr model. ( #36108 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-05 08:07:37 -08:00
Cyrus Leung
7196348157
[Bugfix] Fix Qwen-VL tokenizer implementation ( #36140 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-05 08:07:19 -08:00
Ning Xie
176c799f4c
[openai api] log exception in exception handler (1/N) ( #31164 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2026-03-05 16:00:12 +00:00
Or Ozeri
612e7729c2
[KVConnector] Scheduler: Fix num_computed_tokens after async KV load ( #34616 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-03-05 14:25:15 +00:00
Harry Mellor
ecde7af9c4
Fix import that was moved in Transformers 5.2.0 ( #36120 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 13:59:44 +00:00
Harry Mellor
8df523351f
[Docs] Only build docs if documentation or ready labels are present ( #36135 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 13:58:16 +00:00
Andreas Karatzas
b03ff6a96b
[CI] Stabilize test_no_args_tool_call and add ROCm-specific server args ( #36107 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-05 21:52:49 +08:00
Ajay Anubolu
ed81d5edd1
[Bugfix] Fix RunAI streamer crash with S3-hosted model paths ( #35976 )
...
Signed-off-by: AjAnubolu <anuboluajay@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-05 12:14:20 +00:00
Shiyan Deng
3c23ac840e
[Bugfix] Fix mypy errors in hermes_tool_parser.py ( #36114 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2026-03-05 11:37:47 +00:00
cjackal
a708ef5944
[Misc] Fix SyntaxWarning - invalid escape sequence '\e' ( #36020 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2026-03-05 10:55:31 +00:00
Kunshang Ji
66a2209645
[Hardware] Replace torch.cuda.synchronize() api with torch.accelerator.synchronize ( #36085 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-05 10:36:39 +00:00
Doug Smith
0bfa229bf1
[Release] Include source distribution (sdist) in PyPI uploads ( #35136 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
Co-authored-by: Daniele Trifirò <dtrifiro@redhat.com >
2026-03-05 01:43:50 -08:00
Paco Xu
7493c51c55
[Docs] add Dynamo/aibrix integration and kubeai/aks link ( #32767 )
...
Signed-off-by: Paco Xu <paco.xu@daocloud.io >
2026-03-05 17:39:50 +08:00
Reagan Lee
ac773bbe80
[Docs] Update docs to include mm processor + encoder benchmarks ( #34083 )
...
Signed-off-by: Reagan <reaganjlee@gmail.com >
2026-03-05 01:38:25 -08:00
Christian Munley
48e376a007
qwen3coder tool parser fix anyOf double encoded parameters ( #36032 )
...
Signed-off-by: Christian Munley <cmunley@nvidia.com >
2026-03-05 09:06:57 +00:00
Isotr0py
21eb2c3372
[Chore] Correct MTP models test registry ordering ( #36115 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-05 08:55:04 +00:00
Seiji Eicher
e2b31243c0
[Docs] Update CacheConfig block_size docstring to remove inaccurate limit when using CUDA ( #35632 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2026-03-05 06:24:08 +00:00
Martin Hickey
c3598d02fa
[Misc] Remove deprecated items that are due for removal ( #36006 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-03-05 06:14:50 +00:00
Benjamin Chislett
57c629e9c1
[Bugfix] Fix block_size for hybrid model MTP ( #36036 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-03-05 06:10:54 +00:00
zihaoanllm
d106bf39f5
[Doc] Add Parallel Draft Models ( #35973 )
...
Signed-off-by: <zihaoan2@amd.com >
Signed-off-by: zihaoanllm <zihaoan2@amd.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 05:44:07 +00:00
Yanan Cao
b0651021e5
[Kernel] [Helion] [11/N] Retune configs for silu_mul_fp8 ( #36062 )
2026-03-04 21:25:59 -08:00
Hanjun Cho
f600d5192e
[Bugfix] Fix score layer quantization for sequence classification models - Qwen3 (VL) Reranker ( #35849 )
...
Signed-off-by: Hanjun Cho <gkswns0531@gmail.com >
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-03-04 20:57:20 -08:00
Tianmu Li
8e7820131e
[Perf] Use dummy M for weight prepacking on x86 ( #35890 )
...
Signed-off-by: Li, Tianmu <tianmu.li@intel.com >
2026-03-05 04:56:49 +00:00
Andrii Skliar
0a12cea25f
Order config.py in Lexicographical order ( #35866 )
...
Signed-off-by: Andrii Skliar <askliar@nvidia.com >
Co-authored-by: Andrii Skliar <askliar@nvidia.com >
2026-03-04 20:56:47 -08:00
Zhengxu Chen
dd6dbd93f8
[compile] Fix extra cache save on warm start. ( #35921 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-05 12:56:30 +08:00
Harry Mellor
26366009c5
[CI] Don't leave docs preview comment on closed PRs ( #36087 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-05 04:51:46 +00:00
Nick Hill
16c472abe7
[Core] Move ray-specific WorkerWrapperBase methods to RayWorkerWrapper ( #35328 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-05 12:11:59 +08:00
daje0601
3b23d57c96
[Model] Add LoRA support for Whisper models ( #29856 )
...
Signed-off-by: daje0601 <englishmt4118@gmail.com >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
2026-03-05 10:38:25 +08:00
Wentao Ye
2f4226fe52
[CI] Fix pre-commit mypy issue in main ( #36049 )
2026-03-04 18:13:12 -08:00
nkm-meta
792cbd64ca
Add platform method to enable custom collective ops registration ( #34760 )
...
Signed-off-by: Naina Kuruballi Mahesh <nainakm@meta.com >
2026-03-05 00:50:32 +00:00
Zhengxu Chen
2ed4722e26
[compile] Reduce log spam from compile. ( #36044 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-03-05 00:48:36 +00:00
Nick Hill
a3299c3d1d
[Model Runner V2] Misc code simplification ( #35941 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-04 15:26:35 -08:00
Andreas Karatzas
6c21a0c2d7
[ROCm][CI] Added MI325 mirrors (stage C) ( #35239 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-04 14:48:46 -08:00
Shanshan Shen
562339abc3
[Misc] Support OOT linear method registering ( #35981 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-03-04 22:25:56 +00:00
amitz-nv
d7adcadb9b
[Bugfix] Fix passing of activation_type to trtllm fused MoE NVFP4 and FP8 ( #36017 )
...
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com >
2026-03-04 22:23:51 +00:00
Simon Mo
f678c3f61a
[RL] [Weight Sync] Guard IPC update-info pickle deserialization behind insecure serialization flag ( #35928 )
...
Co-authored-by: Cursor Agent <cursoragent@cursor.com >
2026-03-04 17:05:32 -05:00
Thomas Parnell
be0a3f7570
[Bugfix] Fix race in non-blocking num_accepted_tokens GPU->CPU copy ( #36013 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-04 13:52:44 -08:00
Harry Mellor
17dc9c7fc9
[CI] Bump mypy version ( #34950 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 20:55:11 +00:00
fenypatel99
7eca859110
Add PyTorch profiler schedule support with warmup/active iterations ( #35240 )
2026-03-04 12:53:38 -08:00
Russell Bryant
636ee223ac
[Docs] Document security risks of GPT-OSS Python tool ( #35139 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-04 20:27:31 +00:00
Robert Shaw
b7d59ffce2
[UX] Remove NoOpOffloader log ( #35678 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-04 12:13:40 -08:00
Richard Zou
5569f5218d
[torch.compile] Stop lazily compiling ( #35472 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-04 12:13:17 -08:00
Davina Zaman
138d891d7f
[Docs] Clarify structured outputs configuration for Qwen3 reasoning mode ( #32441 )
...
Signed-off-by: Davina Zaman <davzaman@users.noreply.github.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 11:44:39 -08:00
Stefano Castagnetta
d7166e74c1
[CI] Add Blackwell AsyncTP correctness test ( #35871 )
...
Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com >
2026-03-04 19:41:21 +00:00
Nick Hill
417fd28fb1
[Model Runner V2] Fix pooling ( #36019 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-04 10:53:17 -08:00
tomeras91
7faba503c4
[Kernel][Mamba] Optimize Mamba2 SSD prefill Triton kernels ( #35397 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
2026-03-04 19:47:17 +01:00
Hyunkyun Moon
bc6be89d16
[Frontend] Add vllm launch command for GPU-less preprocessing serving ( #34551 )
...
Signed-off-by: HyunKyun Moon <mhg5303@gmail.com >
2026-03-04 18:41:52 +00:00
Maxime Grenu
32224f568a
docs: update CPU Docker images to reference Docker Hub instead of AWS ECR ( #34882 )
...
Signed-off-by: Maxime Grenu <69890511+cluster2600@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 10:31:35 -08:00
Abhishek Mathukiya
f3dc292e9f
docs: add version requirement note for --profiler-config flag ( #32454 )
...
Signed-off-by: abhishkh <mathukiya.a@northeastern.edu >
2026-03-04 18:13:54 +00:00
Chen
138c5fa186
[Docs] Add RunPod GPU deployment guide for vLLM ( #34531 )
...
Signed-off-by: lisperz <zhuchen200245@163.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 10:11:34 -08:00
Russell Bryant
2f2c1d73a7
[Docs] Upgrade dynamic LoRA warning to admonition block ( #35218 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2026-03-04 10:01:42 -08:00
Bhuminjay Soni
fb3e78ab09
[Feature][CI]: compare func & no_func outputs in test_functionalization.py ( #35481 )
...
Signed-off-by: Bhuminjay <bhuminjaysoni@gmail.com >
Signed-off-by: Bhuminjay Soni <Soni5Happy@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-03-04 18:01:16 +00:00
Michael Yao
fd3bfe74c9
[Docs] Update design/multiprocessing.md ( #30677 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2026-03-04 17:58:59 +00:00
tc-mb
bfdb512f11
fix minicpmo4.5: fix attn_mask in vit attn && fix resampler pos_emb i… ( #34127 )
...
Signed-off-by: tc-mb <caitianchi@modelbest.cn >
Co-authored-by: hezhihui <hezhihui@modelbest.cn >
2026-03-04 17:46:17 +00:00
Sage
d25c1ec3c9
docs(cpu): Clarify pre-built wheels requirement for CPU Python-only build ( #35090 )
...
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com >
2026-03-04 17:45:35 +00:00
Xing Liu
7cc6058ac6
[Doc] Add MTP docs and update speculative decoding guidance ( #35197 )
...
Signed-off-by: liuxing <945764858@qq.com >
2026-03-04 17:23:34 +00:00
Manrique Vargas
28028dff2f
fix(docs): use static rdzv backend in multi-node troubleshooting script ( #34784 )
...
Signed-off-by: machov <mv1742@nyu.edu >
2026-03-04 17:15:35 +00:00
Dr Alex Mitre
3417ba5648
docs: add README for logits_processor examples ( #35933 )
2026-03-04 17:09:19 +00:00
Yan Ma
58cfe0dc44
Fix phi4-mm and remove cuda binding ( #35964 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-03-05 01:08:05 +08:00
simone-dotolo
e86221deb6
[Doc] Fix GPU Worker count in Process Count Summary ( #36000 )
...
Signed-off-by: simone-dotolo <simonedotolo@libero.it >
Signed-off-by: simone-dotolo <84937474+simone-dotolo@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-04 17:03:14 +00:00
Netanel Haber
289fc48ab7
Use MMEncoderAttention (=use FlashAttention) instead of torch.sdpa in radio.py ( #35653 )
2026-03-04 08:43:13 -08:00
Christian Pinto
2f2212e6cc
Split generic IO Processor plugins tests from Terratorch specific ones ( #35756 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2026-03-05 00:01:03 +08:00
Nicolò Lucchesi
18e01a0a10
[Misc] Add --attention-backend auto option ( #35738 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-04 15:12:27 +00:00
sungsoo ha
6cb901093f
[Core] Add All-to-All communication backend for DCP ( #34883 )
...
Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com >
Signed-off-by: sungsoo ha <hasungsoo@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 10:01:57 -05:00
Cyrus Leung
ead7bde1ab
[Bugfix] Make kaldi_native_fbank optional ( #35996 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-04 06:47:32 -08:00
Qi Wang
6aa6ad8992
[BugFix] Fix implicit and incorrect assumption on ECConnector is_producer ( #34783 )
...
Signed-off-by: Qi Wang <qiwa@nvidia.com >
2026-03-04 15:01:30 +01:00
Raghavan
c8c3935b70
[Bugfix][Model] Fix FP8 k_scale/v_scale not loaded for Qwen3-MoE ( #35656 )
...
Signed-off-by: raghavan <oneraghavan@gmail.com >
2026-03-04 13:15:38 +00:00
Ronen Schaffer
bb6888b8b1
[Bugfix][CPUOffloadingManager] Prevent eviction of already-stored blocks in LRU/ARC prepare_store() ( #35846 )
...
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com >
2026-03-04 14:25:33 +02:00
Taneem Ibrahim
1aaec59d79
[MISC] fixed tool_parser mypy errors ( #35640 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 12:23:12 +00:00
pougetat
1659b2e058
[Feature] Add basic metrics for /realtime endpoint ( #35500 )
...
Signed-off-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Signed-off-by: pougetat <thomas.pougetabadie@gmail.com >
Co-authored-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-04 19:56:32 +08:00
haosdent
d6e04f4c43
[Bugfix] Cap FULL decode cudagraph sizes for Mamba/hybrid models ( #34094 ) ( #34571 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
Co-authored-by: zjy0516 <riverclouds.zhu@qq.com >
2026-03-04 11:56:22 +01:00
Kunshang Ji
a8f66cbde8
[XPU] bump vllm-xpu-kernels to v0.1.3 ( #35984 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-04 18:23:31 +08:00
Kunshang Ji
16d2ad1d38
[Hardware] Replace torch.cuda.empty_cache with torch.accelerator.empty_cache ( #30681 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-04 09:49:47 +00:00
Chuan (Richard) Li
5dc3538736
[ROCm][Bugfix] Fall back from CK MXFP4 MoE when GEMM dimensions are unsupported ( #35893 )
...
Signed-off-by: Li <chuali@amd.com >
2026-03-04 08:30:54 +00:00
Nathan Price
36bf213181
[Bugfix] Add missing dynamic_arg_dims for Qwen3-ASR torch.compile ( #35869 )
...
Signed-off-by: Nathan Price <nathan@abridge.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-04 08:29:01 +00:00
Joe Runde
6f0dd93801
[Core] Remove busy loop from idle buffer readers ( #28053 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-03-04 07:44:20 +00:00
Andrii Skliar
5d199ac8f2
Support Audio Extraction from MP4 Video for Nemotron Nano VL ( #35539 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Signed-off-by: Andrii Skliar <askliar@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: Andrii <askliar@nvidia.com >
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Co-authored-by: Andrii Skliar <askliar@oci-nrt-cs-001-vscode-01.cm.cluster >
Co-authored-by: Andrii <askliar@nvidia.com >
Co-authored-by: root <root@pool0-03748.cm.cluster >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: root <root@pool0-02416.cm.cluster >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com >
Co-authored-by: root <root@pool0-04880.cm.cluster >
2026-03-03 23:20:33 -08:00
Komal Kumar Teru
9e0f44bec4
[cohere][fix][spec-decode]: fix crash when allowed_token_ids is set without penalties ( #35654 )
...
Signed-off-by: kkt-cohere <komal@cohere.com >
2026-03-03 23:20:15 -08:00
lailoo
097eb544e9
[Bugfix] Improve engine ready timeout error message ( #35616 )
...
Signed-off-by: damaozi <1811866786@qq.com >
2026-03-04 05:54:32 +00:00
ShiJie Zhong
7cdba98edf
[BugFix] Support tool_choice=none in the Anthropic API ( #35835 )
...
Signed-off-by: ZhongsJie <zhongsjie@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-03-04 05:24:46 +00:00
Charlie Fu
3c85cd9d74
[Rocm][CI] Fix ROCm LM Eval Large Models (8 Card) ( #35913 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2026-03-04 04:50:13 +00:00
Andreas Karatzas
edba15045a
[Bugfix] Guard mm_token_type_ids kwarg in get_mrope_input_positions ( #35711 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-04 04:12:51 +00:00
Cyrus Leung
e379396167
[Refactor] Clean up processor kwargs extraction ( #35872 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-03 19:53:53 -08:00
Isotr0py
6e9f21e8a2
[Chore] Remove debug code in model implementation ( #35883 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-03 19:50:58 -08:00
AllenDou
c1d963403c
[model] support FireRedASR2 ( #35727 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-03 19:41:30 -08:00
Shanshan Shen
77e6dcbbfa
[PluggableLayer][MM] Add PluggableLayer for RelPosAttention ( #33753 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2026-03-03 19:41:27 -08:00
William Zhang
70c73df69e
[Bugfix] Fix EVS implementation for Qwen3 VL ( #33607 )
...
Signed-off-by: 2ez4bz <133824995+2ez4bz@users.noreply.github.com >
2026-03-04 02:18:11 +00:00
xjx
9a9d442464
Enable bnb for multiple indices weight ( #35838 )
...
Signed-off-by: xjx <493337577@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-04 01:46:47 +00:00
Andreas Karatzas
f7da9cdffc
[ROCm][CI] Support async weight transfer example with platform-aware determinism ( #35710 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-04 09:44:14 +08:00
Jaewon
f22ff2958c
[Bugfix] Fix coord_socket assertion in DPEngineCoreProc for offline DP mode ( #35916 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
2026-03-04 00:10:11 +00:00
Nick Hill
d15c3b90fc
[Core] Move save_tensorized_model logic to Worker ( #35825 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-03-03 15:31:59 -08:00
zhrrr
97286a20ed
[Model Runner V2] support dp & ep for spec decoding ( #35294 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai >
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Co-authored-by: Giancarlo Delfin <gdelfin@inferact.ai >
2026-03-03 15:19:45 -08:00
Amr Mahdi
12b38c0f45
[CI/Build] Allow mounting AWS credentials for sccache S3 auth ( #35912 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-03-03 14:30:47 -08:00
Woosuk Kwon
467886a0c4
[Model Runner V2] Fix inputs_embeds=None bug for MM models ( #35917 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-03 13:47:45 -08:00
bnellnm
a9b8b13e5c
[Bugfix] Fix misnamed parameter in compressed_tensors_moe.py ( #35813 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-03-03 16:29:57 -05:00
Micah Williamson
e7213003cb
[ROCm][CI] Fix TP size issue for test_gpt_oss ( #35887 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-03-03 20:57:34 +00:00
Rohan Potdar
3a8eef5869
[ROCm][Bugfix]: Disable AITER Triton ROPE by default ( #35601 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-03-03 13:43:56 -06:00
Robert Shaw
97995f6376
[MoE Refactor] Create MK for TRTLLM Kernels ( #32564 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2026-03-03 10:39:50 -08:00
Robert Shaw
881a6b011b
[CI] Temporarily Disable Llama4 MoE Refactor Test ( #35870 )
...
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-03-03 10:36:15 -08:00
Matthew Bonanni
8e1fd5baf0
[CI] Bump num_speculative_tokens to 3 in nightly DeepSeek tests ( #35882 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-03 09:26:44 -08:00
JasonCohere
ae88468bcc
fix: Ensure invalid audio files return 400 error ( #34715 )
...
Signed-off-by: Jason Ozuzu <jasonozuzu@cohere.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-03-03 08:47:39 -08:00
ojhaanshika
e05cb3b93e
TRTLLM gen-full attn Test Coverage ( #34986 )
...
Signed-off-by: Anshika Ojha <anshikao@nvidia.com >
Co-authored-by: Anshika Ojha <anshikao@gb-nvl-059-compute09.nvidia.com >
2026-03-03 11:35:34 -05:00
Lucas Wilkinson
28ef9ba399
[BugFix] Add support for MTP num_speculative_tokens > 1 with sparse MLA ( #34552 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-03-03 07:21:57 -08:00
TJian
fb7fdc49c4
[ROCm] [CI] Add new fusion test cases that are relevant to vLLM IR Ops ( #34307 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2026-03-03 06:24:21 -08:00
wang.yuqi
ea463978bb
[Frontend][1/n] Improve pooling entrypoints | classify. ( #35604 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-03-03 06:05:36 -08:00
Li, Jiang
440f0e7dc6
[Bugfix] Avoid src/dst as None in irecv/isend_tensor_dict ( #35754 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-03-03 05:56:08 -08:00
wang.yuqi
fd4a90f337
[CI] And PPL test for Qwen3.5. ( #35853 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-03 13:15:51 +00:00
Thomas Parnell
ad9d09e2b8
[Perf] [Hybrid] Copy num_accepted_tokens in non-blocking way when not using prefix caching ( #35442 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2026-03-03 04:15:43 -08:00
Szymon Reginis
4beebfd146
[CI/Build][Intel] Add new performance benchmarks for Intel Gaudi 3 ( #31025 )
...
Signed-off-by: Szymon Reginis <sreginis@habana.ai >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-03 19:48:24 +08:00
hallerite
b8401cde0e
add regression test ( #35834 )
...
Signed-off-by: hallerite <git@hallerite.com >
2026-03-03 07:32:15 +00:00
TJian
5dfc5abe94
[ROCm] [Release] Change the package from aiter to amd-aiter ( #35198 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-03-02 23:13:39 -08:00
lin-shh
8fa68a8ce4
Fix TYPE_CHECKING stub defaults in envs.py to match actual runtime defaults ( #35645 )
2026-03-02 21:59:43 -08:00
lin-shh
35a6f0bfe2
[Misc] Fix typos in comments: explict→explicit, paramaters→parameters ( #35648 )
2026-03-02 21:59:14 -08:00
Taneem Ibrahim
3a6cbf16e2
[MISC] Removed unused function find_all_indices() from tool_parsers/utils.py ( #35683 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-03-03 13:58:42 +08:00
Lucas Wilkinson
f44d1ddc8c
[BugFix] Fix cmake based incremental install (wrong vllm install dir) ( #35773 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-03-02 21:58:16 -08:00
Cyrus Leung
48a54c1e0d
[CI/Build] Trigger processor tests on registry update ( #35824 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-03 13:55:57 +08:00
Micah Williamson
8b9e8b7454
[ROCm][CI] Fix Assertion Logic For test_gpt_oss ( #35806 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-03-03 05:08:04 +00:00
Wentao Ye
c21d0039ec
[Refactor] Fix maxsim cuda platform and add cli to control it ( #35427 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-03-03 12:48:31 +08:00
Isotr0py
7d8bbe6f42
[CI/Build] Automatically patch video metadata for multimodal processor test ( #35822 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-03 04:27:45 +00:00
aykoppol
25e02647c2
[Core] Add optional flags to check for repetitive token patterns in engine output ( #35451 )
...
Signed-off-by: aykoppol <aykoppol+git@gmail.com >
2026-03-03 12:23:25 +08:00
Woosuk Kwon
a0a5178ab4
[Model Runner V2] Use ModelState.prepare_attn() for cuda graph capture [5/N] ( #35774 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-02 20:06:27 -08:00
Isotr0py
8ea8ba275e
[V0 deprecation] Remove Swin model ( #35821 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-02 20:03:41 -08:00
Woosuk Kwon
4f85bae9d6
[Docs][Model Runner V2] Add Design Docs ( #35819 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-02 19:58:14 -08:00
Andy Lo
0a7165fd71
[ModelRunnerV2] Rename sampler functions and variables for clarity ( #35459 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-03-02 19:48:56 -08:00
Robert Shaw
6521ccf286
[CI] Temporarily Disable Nightly Failures ( #35770 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-03 01:49:13 +00:00
Martin Vit
8ebd872f50
[Tool Parser] Fix Qwen3Coder streaming parameter loss with speculative decode ( #35615 )
...
Signed-off-by: Martin Vit <martin@voipmonitor.org >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-03-03 09:40:37 +08:00
zhrrr
168ee03e1c
[Model Runner V2][Perf] align dummy_run tokens to uniform decode for dp cudagraph ( #35376 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2026-03-02 17:10:47 -08:00
liuzhenwei
9dd656f0ea
[XPU][NIXL] Add GPUDirect RDMA support for XPU ( #35270 )
...
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-03 08:42:49 +08:00
Jakub Zakrzewski
c8b678e53e
[Model] Add support for nvidia/llama-nemotron-rerank-vl-1b-v2 ( #35735 )
...
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com >
2026-03-03 08:32:14 +08:00
Andreas Karatzas
18c29c746b
[ROCm][CI] Fix backslash-continuation in pytest marker re-quoting and treat exit code 5 as success ( #35798 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-02 16:07:51 -08:00
Hanjie Qiu
96fc09503a
[All Reduce] Change default backend of Flashinfer All Reduce to trtllm ( #35793 )
...
Signed-off-by: hjjq <hanjieq@nvidia.com >
2026-03-02 18:57:38 -05:00
Roger Wang
1b82b433fc
[Bugfix] Fix MM processor test for Qwen3.5 ( #35797 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-03-02 23:05:08 +00:00
Robert Shaw
9319044ee9
[MoE][Perf] Wrap DSV3 QKVAProj GEMM in custom op for torch.compile ( #35751 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-03-02 23:03:49 +00:00
Boyuan Feng
c42dc402c1
clean unused cudagraph_batch_sizes ( #35552 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2026-03-02 22:00:16 +00:00
Ye (Charlotte) Qi
fa6a6be519
[Bugfix] Fix missing sequence_lengths in qwen3_omni_moe_thinker ( #35741 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2026-03-02 21:11:56 +00:00
Aaron Hao
cad21918e3
[BUG] Fix rlhf_async example ( #35788 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-03-02 20:36:40 +00:00
Jeffrey Wang
53700bf49b
[ci] Add Ray compatibility check informational CI job ( #34672 )
...
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com >
2026-03-02 12:06:16 -08:00
Yashwant Bezawada
a13d8c03c9
[KVConnector] Auto-downgrade to PIECEWISE cudagraph mode for layerwise async ops ( #31057 )
...
Signed-off-by: Yashwant Bezawada <yashwant_b@me.com >
2026-03-02 15:04:47 -05:00
Fynn Schmitt-Ulms
9433acb8df
[Spec Decode] Add hidden states extraction system ( #33736 )
...
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com >
2026-03-02 14:29:09 -05:00
Richard Zou
d1a6e96d9e
[torch.compile] Improve cold and warm start compile tests ( #35709 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-02 19:27:06 +00:00
CSWYF3634076
2a9e3347e9
[BugFix][Model]Fix the garbled code in Ernie4.5-VL caused by fast_moe_cold_start ( #35587 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
2026-03-02 18:56:33 +00:00
Isotr0py
cc0d565f40
[CI/Build] Enable Qwen3.5 tests on CI ( #35763 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-03-02 17:43:53 +00:00
Patryk Wolsza
358e4d5ba7
[CI][HPU] Pin vllm commit compatible with vllm-gaudi - HPU tests ( #35307 )
...
Signed-off-by: PatrykWo <patryk.wolsza@intel.com >
2026-03-02 17:02:26 +00:00
Cyrus Leung
792a74b973
[Doc] Improve UX of --enable-log-requests ( #35723 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-03-02 08:24:09 -08:00
Turner Jabbour
4034c3d32e
[Core] Move test utility to test file ( #35672 )
...
Signed-off-by: Turner Jabbour <doubleujabbour@gmail.com >
2026-03-02 10:56:03 -05:00
Martin Hickey
7560d674c9
[CI] Fix mypy for vllm/device allocator ( #35518 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-02 15:53:18 +00:00
ElizaWszola
d9c7730877
[Performance] Extract kv update ops from MLA attention backends ( #34627 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Di Wu <dw2761@nyu.edu >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-03-02 10:43:19 -05:00
Runkai Tao
ada4f4fadd
[Fix Bug]num_active_loras always equals to zero ( #34119 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-02 23:17:46 +08:00
Harry Mellor
7e9149d9a9
[Docs] Add breadcrumbs for better UX ( #35749 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-02 14:31:54 +00:00
Martin Hickey
87c98b0236
[MyPy][BugFix] Check profiler is assigned before calling start() on it ( #35505 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-03-02 13:23:42 +00:00
Tyler Michael Smith
de7dd634b9
Fix unresolved-import errors when using Astral's ty by removing src.root ( #35681 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-03-02 10:26:47 +00:00
Chauncey
9a87b0578f
[Feat] Supports Anthropic Messages count_tokens API ( #35588 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-03-02 09:48:54 +00:00
wangxiyuan
510bc9e1df
[Misc] Cleanup useless current_platform import ( #35715 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2026-03-02 09:36:54 +00:00
Charles Ashby
cbd361fd46
[CPU][Distributed] Fix Enable _CPUSHMDistributed only when TP/PP ranks share the same SHM group name ( #34169 )
...
Signed-off-by: Charles Ashby <charlesa.l@hotmail.com >
2026-03-02 09:34:35 +00:00
Nicolò Lucchesi
c212202d93
[Misc] Bound NIXL upper bound version ( #35495 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-03-02 16:57:07 +08:00
Andreas Karatzas
ec27b36b4b
[CI] Defining extended V1 e2e + engine tests ( #35580 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-02 08:10:54 +00:00
Charlie Fu
3fd1d4ec2c
[Rocm][CI] Fix LM Eval Large Models (H100) test group ( #34750 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2026-03-02 07:43:38 +00:00
EdalatiAli
cb21972a97
[Kernel] Integrate SM100 MXFP8 blockscaled grouped MM and quant kernels ( #34448 )
...
Signed-off-by: EdalatiAli <aliedalati@cohere.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-03-01 23:31:19 -08:00
Andreas Karatzas
c34963f138
[ROCm][CI] Disable skinny GEMMs in language model standard tests to fix non-determinism ( #35152 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-03-02 15:04:18 +08:00
Hongxia Yang
f26650d649
[ROCm] add amd-quark package in requirements for rocm to use quantized models ( #35658 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-03-02 06:02:43 +00:00
Kunshang Ji
92f5d0f070
[XPU] fix mxfp4 activation type ( #35691 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-03-02 11:48:39 +08:00
Jesse Cai
a60985b07e
Fix deprecated v1 config tests ( #35327 )
...
Signed-off-by: Jesse Cai <jessecai@fb.com >
2026-03-01 20:32:03 -05:00
Lucas Wilkinson
8b5014d3dd
[Attention] FA4 integration ( #32974 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-03-01 23:44:57 +00:00
zhanqiuhu
57a96e26c9
Revert "[Bugfix] Disable TRTLLM attention with KV transfer enabled ( #33192 )" ( #34832 )
...
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-03-01 22:32:37 +00:00
Richard Zou
e82fbeec7b
[torch.compile] Undo the fast_moe_cold_start hack in torch>=2.11 ( #35475 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-03-01 21:44:22 +00:00
haosdent
6290470843
[Bugfix] Fix dtype mismatch in RMSNormGated.forward_native() during torch.compile ( #35256 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-03-01 15:14:46 -05:00
Woosuk Kwon
72f4d16262
[Model Runner V2] Use block table apis for capture inputs ( #35671 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-01 10:31:13 -08:00
Seungho Yoon
5a435507d8
fix(mxfp4): return is_monolithic=False when LoRA is enabled for Triton backend ( #35382 )
...
Signed-off-by: Seungho Yoon <yoonsnowdev@gmail.com >
2026-03-01 09:59:30 -05:00
Taneem Ibrahim
59d7af9c6c
[MISC] Fixing a null reference by removing parallel_utils from mypy EXCLUDE ( #35630 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-03-01 09:26:44 -05:00
Asaf Gardin
bbf81f9a92
[Mamba1] - Kernel Level Chunk Alignment for Prefix Caching ( #34798 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-03-01 20:40:23 +08:00
Woosuk Kwon
da543d1abe
[Model Runner V2] Minor refactoring for EncoderRunner ( #35628 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-03-01 00:15:39 -08:00
Ryan Rock
87d319c52f
[AMD][CI] Support Triton attention with ExampleConnector ( #34931 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-03-01 09:58:07 +02:00
lin-shh
a9ec392c86
Fix typo: implictly -> implicitly in isaac.py docstring ( #35646 )
2026-02-28 23:34:37 -08:00
lailoo
afd089f231
[Bugfix][Model] Fix Qwen3.5/Qwen3Next ignoring --dtype flag on older GPUs ( #35617 )
2026-03-01 03:27:37 +00:00
gnovack
3ecd0bf9fc
Add TMA support to fused_moe_lora kernel ( #32195 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-03-01 10:55:25 +08:00
Woosuk Kwon
e3eb146f7a
[Model Runner V2] Add ModelStateInterface [4/N] ( #35621 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-28 13:19:45 -08:00
Martin Vit
95a395dbec
[Bugfix] Fix Anthropic API base64 image handling in Messages endpoint ( #35557 )
...
Signed-off-by: Martin Vit <martin@voipmonitor.org >
2026-02-28 20:57:08 +00:00
Isotr0py
e94b263bd6
[Chore] Cleanup BNB utilization dead code ( #35620 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-28 19:22:41 +00:00
Wentao Ye
e113a30113
[Deprecation] Deprecate code in 0.17 as scheduled ( #35441 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-28 17:32:37 +00:00
Cyrus Leung
1dafb29f91
[Benchmark] Avoid unnecessary video download in MMVU ( #35618 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-28 09:07:02 -08:00
emricksini-h
49b9ae32e9
[Fix] Avoid sending image input to other PP ranks ( #35405 )
...
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-03-01 00:14:29 +08:00
cwazai
63d7972f13
Fix Qwen3_5MTP packed_modules_mapping for gate_up_proj ( #35581 )
2026-02-28 14:50:55 +00:00
flutist
c68e69f144
custom dataset img support base64 ( #35280 )
...
Signed-off-by: xjx <493337577@qq.com >
2026-02-28 11:49:52 +00:00
Chauncey
7e08c22b8c
[Feat] Add CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function ( #35271 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-28 10:12:00 +00:00
Augusto Yao
8e75d88554
add io_process_plugin for sparse embedding ( #34214 )
...
Signed-off-by: augusto.yjh <augusto.yjh@antgroup.com >
Signed-off-by: Augusto Yao <augusto.yjh@antgroup.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-28 09:16:37 +00:00
Mario Hong
0892d1ab1f
[Feature]Supports Anthropic Thinking Block ( #33671 )
...
Signed-off-by: mariohong <mariohong128@gmail.com >
Co-authored-by: zetaohong <i-hongzetao@stepfun.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-02-28 09:02:33 +00:00
Hashem Hashemi
7600642eae
Add padding support to wvSplitK solution for skinny GEMMs ( #33762 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-28 09:02:05 +00:00
Andreas Karatzas
1e69c04887
[ROCm][CI] Parametrize vision score tests across attention backends with per-backend tolerances ( #35571 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-28 08:59:26 +00:00
Cyrus Leung
4292e3b807
[Benchmark] Improve UX of sweep scripts ( #35600 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-28 00:36:02 -08:00
Cyrus Leung
24d6ea8afd
[Benchmark] Rename SLA Finder to Workload Explorer ( #35586 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-27 23:31:55 -08:00
Chauncey
57c86c0741
[Misc] Change logging level from info to debug for tool parser import ( #35575 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-28 14:51:35 +08:00
Chauncey
06254d4cbb
[CI] add trainer_send_weights for MockWeightTransferEngine ( #35589 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2026-02-28 06:47:43 +00:00
Andreas Karatzas
f5d1281c9d
[ROCm][CI] Expose tests to AMD production CI and fix amdsmi heap corruption ( #35071 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-28 13:57:31 +08:00
Andreas Karatzas
94029ffaf0
[ROCm] Derive device capability from GCN arch string without CUDA init ( #35069 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-28 13:55:28 +08:00
Andreas Karatzas
88e8525f2e
[ROCm][CI] Adding infiniband mappings for moriio tests ( #35170 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-28 13:53:28 +08:00
Ilya Markov
b2d8b422b2
[EPLB] Enforce sync eplb for NCCL-based all2all backend ( #35212 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-02-28 05:47:12 +00:00
Umut Polat
1d5ab5d603
[Bugfix] Move chat completion response_format validation to Pydantic model_validator ( #35510 )
...
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com >
2026-02-27 21:26:19 -08:00
Huy Do
7b346ba8ed
[Bugfix] Propagate compilation_time from workers to main process for TP>1 ( #35503 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2026-02-28 05:03:22 +00:00
Itay Alroy
dea268336f
[1/N] Elastic EP Milestone 2 ( #34861 )
...
Signed-off-by: Yongji Wu <wuyongji317@gmail.com >
Signed-off-by: Itay Alroy <ialroy@nvidia.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: Ron Tourgeman <rtourgeman@nvidia.com >
Co-authored-by: Yongji Wu <wuyongji317@gmail.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Ron Tourgeman <rtourgeman@nvidia.com >
2026-02-28 04:46:42 +00:00
Ma Jian
90805ff464
[CI/Build] CPU release supports both of AVX2 and AVX512 ( #35466 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Co-authored-by: jiang1.li <jiang1.li@intel.com >
2026-02-28 04:35:21 +00:00
Matthew Bonanni
2562e0271e
[MTP] Validate that MTP weights are actually loaded ( #35548 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-28 12:27:40 +08:00
Cyrus Leung
fd68cd132b
[Bugfix] Fixes for SLA finder ( #35537 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-27 20:20:55 -08:00
Micah Williamson
0edf101d2b
[ROCm] Add stablelm Head Size 80 To Supported Head Sizes For ROCM_ATTN ( #35527 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-28 12:16:34 +08:00
Douglas Lehr
d5b6f3ba36
[ROCm][Quantization] Add Composable Kernel (CK) backend support for M… ( #34301 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Signed-off-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com >
Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
Co-authored-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
2026-02-28 03:37:01 +00:00
Woosuk Kwon
1a014a0a93
[Model Runner V2] Move MM encoder to Model States [3/N] ( #35564 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-27 18:32:38 -08:00
Woosuk Kwon
86ac7bcf84
[Model Runner V2] Support pooling models ( #35120 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-27 18:03:01 -08:00
Umut Polat
405f28d38d
[Misc] Clean up ResponsesRequest model validators ( #35531 )
...
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com >
2026-02-28 01:19:21 +00:00
youkaichao
5323672bc2
[misc] cleanup one level of error stack when nixl fails to initialize ( #35517 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2026-02-28 08:42:37 +08:00
Roberto L. Castro
a201ad72d8
[Refactor][Kernel] Add global helper to deduplicate vectorized memory ops ( #35105 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
2026-02-27 16:28:17 -08:00
Rohan Potdar
e3691988d0
[ROCm]: fix aiter rope functionalization ( #35533 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-27 22:42:30 +00:00
Gregory Shtrasberg
9fa6c68fa6
[ROCm] Enabling encoder and encoder-decoder on ROCm and AITER unified backends ( #35334 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-27 21:32:55 +00:00
Aaron Hao
2ce6f3cf67
[Feat][RL][2/2] Native Weight Syncing API: IPC ( #34171 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-02-27 13:45:21 -07:00
Jakub Zakrzewski
1f3dbd95fd
[Bugfix][Model] Fix gpt-oss batch invariance ( #35404 )
...
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com >
2026-02-27 20:41:24 +00:00
Lucas Wilkinson
1d532f9d8f
[DP] Only use DP padding when cudagraphs are actually used ( #34102 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-27 15:14:31 -05:00
Lucas Kabela
234a65b781
[Bugfix] Add monkeypatch to prevent race condition from writing ( #35420 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2026-02-27 14:51:36 -05:00
SteadfastAsArt
2decec9856
[Transformers backend] Ignore MTP weights when num_nextn_predict_layers=0 ( #34888 )
...
Signed-off-by: SteadfastAsArt <695488173@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-27 19:39:23 +00:00
Zhengxu Chen
29b35477b0
[compile] Fix caching error over pytree slice node. ( #35308 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-27 19:34:16 +00:00
Nick Hill
b1d9f5372d
[Model Runner V2] Warmup kernels ( #35172 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-27 10:43:30 -08:00
Raushan Turganbay
fd6de37fca
[BugFix] Fix 3D rope in transformers backend ( #35097 )
...
Signed-off-by: raushan <raushan@huggingface.co >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-27 18:34:49 +00:00
Netanel Haber
c8aca0c9e1
Support parakeet as audio encoder for nemotron-nano-vl ( #35100 )
...
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-27 11:07:38 -07:00
Martin Hickey
b602e4f299
[Doc] Fix link to Llama chat template for usability ( #35525 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-27 17:51:09 +00:00
Huamin Li
157722da75
[perf] Use pinned memory for async H2D transfer in do_mamba_copy_block ( #35480 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
2026-02-28 01:50:37 +08:00
Nick Hill
1d897ff04f
[Misc] Fill in some v1 CODEOWNERS gaps ( #35524 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-27 09:34:37 -08:00
fort726
905d76b51d
[Model] Add huggingface skt/A.X-K1 model ( #32407 )
...
Signed-off-by: Sungwan(Alex) Kim <sw0726.kim@sktelecom.com >
Signed-off-by: fort726 <38447663+fort726@users.noreply.github.com >
Co-authored-by: Sungwan(Alex) Kim <sw0726.kim@sktelecom.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-27 09:26:02 -08:00
Yanan Cao
9098ce690c
[Kernel] [Helion] [7/N] Use HOP to represent Helion Kernel call to enable fx tracing and pattern matching ( #34390 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-27 09:21:35 -08:00
Nick Hill
876312f0b5
[Core] Fix gpu_worker.py pre-commit errors ( #35312 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-27 07:54:24 -08:00
Boyuan Feng
5de98abc12
Add @BoyuanFeng to CODEOWNERS ( #35317 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2026-02-27 15:53:47 +00:00
Koushik Dutta
9251ed5c4f
[Bugfix] Handle case when kimi ends reasoning with a tool call ( #33646 )
...
Signed-off-by: Koushik Dutta <koushd@gmail.com >
Co-authored-by: mondaylord <20212010046@fudan.edu.cn >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-27 14:58:28 +00:00
Yueqian Lin
e8249378e4
[Bugfix] Fix check_interleaved_audio_video false positive for batched non-interleaved requests ( #35487 )
...
Signed-off-by: linyueqian <linyueqian@outlook.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-27 06:48:25 -08:00
haosdent
6d4f9d3ad5
[Bugfix] Fix DCP + FA3 crash due to missing num_splits in _forward_with_dcp ( #35082 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-27 22:27:06 +08:00
Harry Mellor
fbe3f0120a
Revert "Add GlmOcrConfig for GLM-OCR model type recognition" ( #35512 )
2026-02-27 06:13:27 -08:00
Jason Li
66c1751d13
[compile] Cleanup: Remove unnecessary +rms_norm forcing for sequence parallelism ( #35410 )
...
Signed-off-by: jasonlizhengjian <jasonlizhengjian@gmail.com >
2026-02-27 08:36:37 -05:00
Tib
6467b635b6
[Bugfix] Add missing activation attr to RMSNormGated ( #35423 )
...
Signed-off-by: tibG <naps@qubes.milou >
Co-authored-by: tibG <naps@qubes.milou >
2026-02-27 12:53:35 +00:00
Max Hu
9c3fe9936b
Flashinfer cuDNN backend for Qwen3 VL ViT attention ( #34580 )
...
Signed-off-by: Max Hu <maxhu@nvidia.com >
Signed-off-by: Max Hu <hyoung2991@gmail.com >
Co-authored-by: Max Hu <maxhu@nvidia.com >
Co-authored-by: Shang Wang <shangw@nvidia.com >
2026-02-27 20:20:23 +08:00
Umut Polat
b66a74649e
[Bugfix] Replace assert with ValueError for response_format validation in completions endpoint ( #35456 )
...
Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com >
2026-02-27 08:01:06 +00:00
Wang Xingran
07bdabef03
[Bugfix] Use 'sum' reduction instead of 'avg' in Async TP reduce-scatter ( #33088 )
...
Signed-off-by: Xingran Wang <wangxingran123456@outlook.com >
Signed-off-by: Hongjian Zhang <hirokenovo@gmail.com >
Co-authored-by: Hongjian Zhang <hirokenovo@gmail.com >
2026-02-27 07:06:08 +00:00
Chengyi Nie
a572baff5e
[Model Performance] Add Qwen3MoE tuned MoE configs for H200 ( #35457 )
...
Signed-off-by: Chengyi Nie <cnie@roblox.com >
Co-authored-by: Chengyi Nie <cnie@roblox.com >
2026-02-27 13:51:14 +08:00
zofia
516cf26698
[Bug] correct out dtype of rms_norm_gated native path ( #35369 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-27 05:19:51 +00:00
Jiangyun Zhu
487e5c51f7
[Bugfix] disable allreduce_rms_fusion by default when pp size > 1 ( #35424 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-02-27 04:18:52 +00:00
Daniel Huang
1a8c71674e
[BugFix] Repo utils debug print patch ( #35434 )
...
Signed-off-by: Daniel Huang <daniel1.huang@intel.com >
2026-02-27 03:50:56 +00:00
Wentao Ye
062b789632
[Bug] Fix outdated links in source code ( #35314 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-27 03:50:46 +00:00
gnovack
a532c83849
use 'max_active_experts' for moe lora input size ( #33197 )
...
Signed-off-by: gnovack <gnovack@amazon.com >
2026-02-27 03:50:43 +00:00
Jee Jee Li
1e5ad9b74f
[Bugfix] Fix Qwen3NextForCausalLM packed_modules_mapping ( #35413 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-26 19:46:30 -08:00
Nicolò Lucchesi
cabdaa7619
[Misc] Move GPUModelRunner.prepare_kernel_block_sizes to utils ( #35400 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-27 11:42:51 +08:00
Chenyaaang
06be53563b
[Core]Extract is_last_rank in Ray for tpu to override ( #33012 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2026-02-27 03:18:52 +00:00
Angela Yi
c29ee9c326
[compile] Invalidate cache for cpu flags ( #35119 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2026-02-27 02:54:11 +00:00
daniel-salib
d43048ce05
[Bugfix] Emit reasoning_part events in simple streaming path for Resp… ( #35184 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2026-02-27 09:49:06 +08:00
Michael Goin
4fec53cfcb
[CI] Actually run tests/kernels/quantization/test_block_fp8.py in CI ( #34274 )
2026-02-26 17:58:03 -07:00
roikoren755
38c498b8e3
[Performance] Cublas Bf16 Gate with Fp32 Output ( #35121 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-02-26 16:51:28 -08:00
Andrii Skliar
56a6371706
[Update] Use FlashInfer fast_decode_plan directly instead of replication ( #34687 )
...
Signed-off-by: Andrii <askliar@nvidia.com >
Co-authored-by: Andrii <askliar@nvidia.com >
2026-02-26 16:31:43 -08:00
Pavani Majety
6283021142
[Bugfix] Fix KV Scale loading for MLA Models ( #35430 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2026-02-26 23:38:19 +00:00
Aleksandr Malyshev
01923eec70
[ROCm][Quantization] GPT OSS Upstream MoE wmxfp4_afp8 with static scales ( #30357 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2026-02-26 16:50:16 -06:00
pkousha
31fb6f43da
[Kernel][perf] optimize NCCL symm_mem vs custom_AR selection thresholds ( #33839 )
...
Signed-off-by: <>
Signed-off-by: pkousha <43781676+pkousha@users.noreply.github.com >
Co-authored-by: Pouya Kousha <pkousha@login-eos01.eos.clusters.nvidia.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-26 14:35:58 -08:00
Tyler Michael Smith
eb19955c37
[WideEP] Remove pplx all2all backend ( #33724 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-26 14:30:10 -08:00
Lucia Fang
0f2f24c8b2
[Bugfix] Fix MessageQueue connect_ip for cross-node data parallelism ( #35429 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-26 22:08:16 +00:00
sychen52
d0105b84f0
add mixed precision support for modelopt ( #35047 )
...
Signed-off-by: Shiyang Chen <shiychen@nvidia.com >
2026-02-26 21:56:24 +00:00
danielafrimi
832a780f3a
Nemotron: use per-layer config in NemotronHMLPDecoderLayer for heterogeneous models ( #35396 )
...
Signed-off-by: dafrimi <dafrimi@nvidia.com >
2026-02-26 16:55:19 -05:00
ElizaWszola
98217b09f9
[Performance] Extract KV cache update op from flashinfer forward ( #35422 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2026-02-26 21:29:01 +00:00
不做了睡大觉
967572dd5f
fix(reasoning): Qwen3ReasoningParser returns truncated output as reasoning ( #35230 )
...
Signed-off-by: stakeswky <stakeswky@users.noreply.github.com >
Co-authored-by: stakeswky <stakeswky@users.noreply.github.com >
2026-02-26 20:30:45 +00:00
Woosuk Kwon
3d66502e1b
[Model Runner V2] Prepare attn metadata in ModelState [2/N] ( #35383 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-26 11:47:02 -08:00
Woosuk Kwon
c66aa48e99
[Model Runner V2] Add model states [1/N] ( #35350 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-26 11:20:35 -08:00
Nick Hill
b6d5a17298
[Model Runner V2] Fix error-handling ( #35063 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-26 11:00:19 -08:00
Lucas Wilkinson
5e58bdc711
[Bugfix] Remove erroneous lower bound on LoRA vocab size constraint ( #35354 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-26 18:44:50 +00:00
Runkai Tao
a1f53addb1
[BugFix] Align fused MoE-LoRA kernel config with actual weight shapes ( #34396 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
2026-02-26 18:03:10 +00:00
Wentao Ye
05970c772c
[Refactor] Remove dead code for attention benchmark script ( #35418 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-26 09:53:46 -08:00
Yiliu Dong
d940607629
[Core] Support min_tokens with speculative decoding ( #32642 )
...
Signed-off-by: qianlihuang <yiliu.dong@qq.com >
Co-authored-by: qianlihuang <yiliu.dong@qq.com >
2026-02-26 12:31:28 -05:00
Wentao Ye
99c7892c5b
[Perf] Optimize maxsim scores computation for pooling models, 13.9% E2E throughput improvement ( #35330 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-26 17:14:54 +00:00
hujia177
ec8f943db1
Add GlmOcrConfig for GLM-OCR model type recognition ( #34982 )
2026-02-26 17:04:42 +00:00
Or Ozeri
f2ad952f40
[BugFix][kv_offload]: Fix kernel block size detection ( #35125 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2026-02-26 16:29:34 +00:00
Sage Moore
9e2cabdf9c
[ROCm] Update the torch version in rocm_build.txt to use the official 2.10 release ( #34387 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2026-02-26 16:28:45 +00:00
Douglas Lehr
ec8ab9d254
[ROCm] Add dynamic mxfp4 quantization for DeepSeek V2 projection layers ( #34157 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Signed-off-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
2026-02-26 10:00:49 -06:00
Wentao Ye
05972ea7e5
[Refactor] Remove dead or duplicate func utils or variables ( #35318 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-26 10:57:56 -05:00
Jakub Zakrzewski
111d869069
[Model] Add nvidia/llama-nemotron-embed-vl-1b-v2 multimodal embedding model ( #35297 )
...
Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com >
2026-02-26 14:17:17 +00:00
stingoChen
7fea7250a4
[Bug] Fix missing <think> tag after tool call in MiniMax 2.1 ( #35352 )
...
Signed-off-by: 冬马 <chenxinke@cai-inc.com >
Co-authored-by: 冬马 <chenxinke@cai-inc.com >
2026-02-26 22:11:07 +08:00
Cyrus Leung
845ee348ef
[Misc] Standardize handling of mm_processor_kwargs.size ( #35284 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-26 13:05:46 +00:00
Asaf Gardin
ec13e549d3
[Bugfix] Fix uint32 overflow in Mamba selective scan state pointer arithmetic ( #35275 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-02-26 12:22:06 +00:00
Li-Yongwen
c6ca51598a
[Bugfix] fix device_name for routing replay ( #34336 )
...
Signed-off-by: liyongwen <1310439159@qq.com >
2026-02-26 12:18:38 +00:00
Yueqian Lin
c0615a296d
[Bugfix] Fix Qwen2.5-Omni and Qwen3-Omni mixed-modality embed regression ( #35368 )
...
Signed-off-by: linyueqian <linyueqian@outlook.com >
2026-02-26 11:58:23 +00:00
Harry Mellor
01914445b0
Remove bc-lint ( #35274 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-26 03:01:01 -08:00
Kunshang Ji
5281713e11
[XPU] use fixed UMD version in dockerfile.xpu ( #35392 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-26 18:54:55 +08:00
HZY
32693db8ce
[Bugfix] [Qwen3.5]Fix Qwen3.5 FP8 quantization: tuple shard_id weight loading ( #35289 )
...
Signed-off-by: daowu.hzy <daowu.hzy@alibaba-inc.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-26 18:26:15 +08:00
Akash kaothalkar
e03ddcfbd4
[Hardware][Powerpc]Enable prefix caching and chunked prefill for ppc64le ( #35081 )
...
Signed-off-by: Akash kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: Akash kaothalkar <akash.kaothalkar@ibm.com >
2026-02-26 10:21:24 +00:00
Sophie du Couédic
02acd16861
[Benchmarks] Plot benchmark timeline and requests statistics ( #35220 )
...
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-26 02:17:43 -08:00
Jiangyun Zhu
ab87f85231
[Model] Ring 2.5 ( #35102 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-02-26 02:17:11 -08:00
Krish Gupta
3827c8c55a
[Test] Add tests for n parameter in chat completions API ( #35283 )
...
Signed-off-by: KrxGu <krishom70@gmail.com >
2026-02-26 09:14:07 +00:00
Kevin McKay
ade81f17fe
[Bugfix][Hardware][AMD] Gate FP4 ops on gfx950 to prevent MI300X crash ( #35250 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
2026-02-26 16:11:07 +08:00
Gregory Shtrasberg
6042e66cd5
[ROCm] Add extra step in config initialization to populate custom ops before compilation config init ( #34848 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-26 16:05:40 +08:00
Chaojun Zhang
9f9a675b23
[XPU][8/N] Fix kernel bugs in XPU LoRA and MOE LORA ( #34115 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-26 15:46:44 +08:00
Ofir Zafrir
a07c4c5939
[BugFix][XPU] Fix speculative decoding on Intel XPU due to bug with IGC_ForceOCLSIMDWidth=16 ( #35298 )
...
Signed-off-by: Ofir Zafrir <ofir.zafrir@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-26 07:15:16 +00:00
Cyrus Leung
d3a51da92a
[Benchmark] Simplify SLA scan ( #35306 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-25 22:35:41 -08:00
Flora Feng
186ea22efe
[Misc][Harmony] Move Responses API only harmony utils to responses/harmony.py ( #35339 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-02-26 14:35:16 +08:00
Daniele
4a9c07a0a2
[BugFix] anthropic/serving_messages: fix tool call arguments streaming ( #34887 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-26 05:39:48 +00:00
Jason Li
9d37941017
[torch.compile] Sequence Parallelism threshold compile ranges ( #28672 )
...
Signed-off-by: jasonlizhengjian <jasonlizhengjian@gmail.com >
Signed-off-by: Jason Li <jasonlizhengjian@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-26 05:00:12 +00:00
Fadi Arafeh
4171ff6dd9
[CPU][Feat] Enable KleidiAI INT8_W4A8 for all input dtypes ( #34890 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-02-26 05:00:10 +00:00
Woosuk Kwon
13025e71e8
[Model Runner V2] Add coding style guide ( #35325 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-25 20:42:40 -08:00
Hanjie Qiu
71dfce6aa6
[Kernel] Refactor FlashInfer allreduce for mnnvl backend ( #34109 )
...
Signed-off-by: hjjq <50634613+hjjq@users.noreply.github.com >
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com >
2026-02-26 03:17:20 +00:00
hujiaxin0
2aa4140402
openpangu-vl support video input ( #34134 )
...
Signed-off-by: hujiaxin <524446785@qq.com >
Signed-off-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com >
Co-authored-by: Emilie1001 <79921183+Emilie1001@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-26 03:08:09 +00:00
Roberto L. Castro
86c3b5a808
[BugFix] Fix fp4 quant kernel on CUDA 12.8 ( #35210 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
2026-02-25 18:32:50 -08:00
Seungmin Kim
160424a937
[Bugfix] Fix CUDA compatibility path setting for both datacenter and consumer NVIDIA GPUs ( #33992 )
...
Signed-off-by: Seungmin Kim <8457324+ehfd@users.noreply.github.com >
Signed-off-by: Andrew Mello <19512127+88plug@users.noreply.github.com >
Co-authored-by: 88plug <19512127+88plug@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 18:15:51 -08:00
Lucas Wilkinson
9511a3f8ee
[Bugfix] Fix AttributeError in SMControlContextManager ( #35338 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-25 18:01:10 -08:00
Michael Goin
de527e1cec
[UX] Add --moe-backend arg for explicit kernel selection ( #33807 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-25 17:44:44 -08:00
Yongye Zhu
1976356ee6
[MoE Refactor] MXFP4 Cutlass Experts to MK ( #34542 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
2026-02-25 17:32:39 -08:00
Michael Goin
cbf8f7028c
[UX] Add --performance-mode {balanced,interactivity,throughput} ( #34936 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-25 17:28:31 -08:00
Ming Yang
6831650c40
[offloader] v2: Hide weight onloading latency via prefetching ( #29941 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 17:20:59 -08:00
Andreas Karatzas
ed42507f6d
[ROCm][CI] Amending deletion of AMD mirror ( #35322 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 14:17:56 -08:00
Andreas Karatzas
9571e99945
[ROCm][CI] Extending attention backend coverage for Eagle spec decode tests ( #35265 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 14:16:18 -08:00
Elizabeth Thomas
c97234c08b
fix(mxfp4): Disable monolithic path for TRITON backend with EP ( #34270 )
...
Signed-off-by: Elizabeth Thomas <email2eliza@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 13:33:42 -08:00
rasmith
b188bab441
[CI][AMD][BugFix] Add torch.cuda.set_device to test_punica_ops so punica kernels execute on same device as tensor ( #34985 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-25 19:18:00 +00:00
Lucas Wilkinson
15d76f74e2
Revert "[Misc] Enable weights loading tracking for quantized models" ( #35309 )
2026-02-25 09:20:15 -08:00
Andreas Karatzas
8fd6975479
[ROCm][CI] Disable skinny GEMMs in multimodal tests to fix non-deterministic results ( #35049 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 16:48:37 +00:00
pushkar
5d18bf8b32
[Bugfix] Fix Harmony preamble visibility in Responses API ( #32114 )
...
Signed-off-by: Pushkar Patel <git@thepushkarp.com >
Signed-off-by: pupa <pupa@users.noreply.github.com >
2026-02-25 08:08:16 -08:00
haosdent
0788ff0a15
[Bugfix] Gracefully disable AllReduceFusionPass on GPUs without multicast support ( #35085 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-25 07:31:45 -08:00
Chendi.Xue
d72b0be33c
[XPU]Fix for Qwen-OMNI crash ( #35249 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2026-02-25 07:31:07 -08:00
Bhoomit
42489e43c2
[Misc][LoRA] Increase max vocab size limit to 258048 in logits processor ( #34773 )
...
Signed-off-by: Bhoomit Vasani <vbhoomit@amazon.com >
2026-02-25 23:30:55 +08:00
Mario Hong
af5e6afa0a
[Bugfix] Fix step3p5 reasoning with interleaved thinking ( #34211 )
...
Signed-off-by: mariohong <mariohong128@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-02-25 15:13:01 +00:00
Benjamin Chislett
ee59a7c615
[Tests] Add GSM8k check to SpecDec E2E tests ( #34772 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-25 07:51:14 -05:00
Joao Gante
709eadbb0b
Doc link typo ( #35281 )
...
Signed-off-by: Joao Gante <joaofranciscocardosogante@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-25 03:00:31 -08:00
Harry Mellor
90fc7f9109
Fix custom processors that use deleted behaviour for Transformers v5 ( #35107 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-25 02:36:21 -08:00
Yanwen Lin
675ec59aa9
[Bugfix][CPU] Fix basic unit tests failing in CPU platforms ( #34677 )
...
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-25 08:36:15 +00:00
Yanwen Lin
80e60a6133
[Doc] Suggest "--managed-python" flag when installing python using uv ( #33069 )
...
Signed-off-by: Yanwen Lin <lyw1124278064@gmail.com >
2026-02-25 08:19:43 +00:00
jonoillar
26e722f906
[DOC][BugFix] Specfiy build dependency installation ( #34513 )
...
Signed-off-by: Jon OILLARBURU <jon.oillarburu@multiversecomputing.com >
Co-authored-by: Jon OILLARBURU <jon.oillarburu@multiversecomputing.com >
2026-02-25 08:04:06 +00:00
lichuang
2c619e5e3f
[Docs]Fix documentation formatting in architecture overview ( #34679 )
...
Signed-off-by: codedump <lichuang1982@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-25 08:00:15 +00:00
Simon Mo
8a685be8d9
docs: document committer proposal process in governance ( #35225 )
...
Signed-off-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-25 07:58:48 +00:00
Laura Wang
2465071510
[Perf] Add opt-in SM100 Oink RMSNorm custom-op path ( #31828 )
...
Signed-off-by: Laura Wang <3700467+Laurawly@users.noreply.github.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-24 23:01:53 -08:00
wenshuai
cd43673668
[Perf] Optimize FP8 gemm of sm120. ( #34424 )
...
Signed-off-by: wenshuai <wenshuai@xiaomi.com >
2026-02-24 22:25:24 -08:00
Xinyu Chen
35d44b4557
[XPU]Support CUDAGraph on XPU Platform ( #34482 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
Co-authored-by: chzhang <chaojun.zhang@intel.com >
Co-authored-by: zhenwei-intel <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-24 22:22:52 -08:00
Kunshang Ji
8ad54a991b
[Platform] Add current_platform.num_compute_units interface ( #35042 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: Kunshang Ji <jikunshang95@gmail.com >
2026-02-24 22:22:49 -08:00
Kunshang Ji
92510edc32
remove cuda check in top_k_top_p_triton kernel ( #35011 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-24 22:22:31 -08:00
Isotr0py
a6c137521c
[Misc] Add shard_id validation for MergedColumnLinear ( #35055 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-24 22:12:28 -08:00
Isotr0py
4572a06afe
[Misc] Enable weights loading tracking for quantized models ( #35074 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-24 22:11:03 -08:00
Zhengxu Chen
5cc29cfb8b
[compile] Improve error message during artifacts load failure. ( #35115 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-24 22:01:09 -08:00
Chen Zhang
8fae54faff
[Linear Attention] fix bug for linear attention + prefix caching + reset_prefix_cache ( #35157 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2026-02-24 22:00:19 -08:00
Harry Mellor
f7967577f5
Remove requirement to use --hf-overrides for DeepseekVLV2ForCausalLM ( #35203 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-24 22:00:06 -08:00
pks
af770b8e7b
[Bugfix] Fix AttributeError when passing StructuredOutputsParams to CompletionRequest ( #35237 )
...
Signed-off-by: Patrick Simianer <patrick@lilt.com >
2026-02-24 22:00:03 -08:00
Andreas Karatzas
2ff3e436ad
[Responses][CI] Filter negative token IDs in schema fuzz test to avoid 500 errors ( #35231 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-25 05:52:44 +00:00
Jhao-Ting Chen
c2c4c4611a
[FIX] fused moe with lora shared expert dual stream (1.07x otps) ( #34933 )
...
Signed-off-by: Jhao-Ting Chen <jhaotingc@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-25 04:40:45 +00:00
Rohan Potdar
f38f8c9742
[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE ( #35180 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-25 04:36:40 +00:00
Flora Feng
ec1d30c0f6
[Responses] Decouple SSE event helpers from Harmony context ( #35148 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-02-24 20:05:25 -08:00
Pooya Davoodi
e3b2324ec4
[Frontend] Use init_app_state and FrontendArgs in run_batch ( #32967 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-24 19:40:39 -08:00
Nick Hill
dbf0da817a
[Core] Cleanup engine pause/sleep logic ( #34528 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-24 19:33:34 -08:00
Xin Yang
3bbb2046ff
[Bugfix] Fix expert_ids padding values in moe_align_block_size kernel ( #35161 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-24 17:14:24 -08:00
yugong333
576fe50333
Adding Nemotron fp8 Triton MoE Config ( #34674 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-24 15:56:38 -08:00
Hashem Hashemi
a0e50a4260
Convert wvSplitKQ to 16x16 MFMA in prep for mi4xx. ( #34100 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-24 23:35:21 +00:00
Benjamin Chislett
9fa5b25a23
[Bug][DSV3.2] Always prepare metadata for DeepGEMM Sparse Attention ( #35075 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-24 14:55:22 -08:00
Robert Shaw
ea97750414
[CI] Fix Distributed Tests ( #35236 )
...
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
2026-02-24 22:31:56 +00:00
Andreas Karatzas
067c5d9ad1
[ROCm][CI] Added MI325 mirrors ( #34923 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-24 13:37:15 -08:00
Benjamin Chislett
f5972a872f
[Model][Spec Decode] Nemotron-H MTP and Mamba Speculative Decoding Support ( #33726 )
...
Signed-off-by: Shahar Mor <smor@nvidia.com >
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Shahar Mor <smor@nvidia.com >
Co-authored-by: Roi Koren <roik@nvidia.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-24 09:49:56 -08:00
Matthew Bonanni
a9e15e040d
Add @MatthewBonanni to CODEOWNERS ( #35207 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-24 10:45:10 -07:00
Lucas Wilkinson
542ca66357
Revert "[CI/Build] Remove redundant OpenTelemetry pip install from CI configs" ( #35211 )
2026-02-24 09:26:42 -08:00
Cyrus Leung
fc8456c336
[CI/Build] Fix kernels test location ( #35205 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-24 09:20:34 -08:00
Wentao Ye
9ce8fad2a9
[Perf] Optimize Python Slice for Structured Output using islice instead of [:] ( #33593 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-24 09:02:36 -08:00
Harry Mellor
c38b8d5a31
Remove padding_index from models that don't use it for better Transformers v5 compatibility ( #35189 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-24 08:04:46 -08:00
Robert Shaw
60da0e1544
[CI] Remove Duplicated Tests ( #35199 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-24 23:53:30 +08:00
danisereb
9609b1f18d
Integrate flashinfer mm_mxfp8 in ModelOpt MXFP8 ( #35053 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-24 08:45:13 -07:00
danisereb
a0c7081695
Fix fallback to default tactic (flashinfer autotuner) with trtllm_fp4_block_scale_moe ( #35088 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-24 07:25:44 -08:00
R3hankhan
34ce0ffd1f
[CPU][Perf] Accelerate Attention head for s390x using vector intrinsics ( #34434 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-02-24 07:25:39 -08:00
Robin Nabel
0de5333989
Fix GLM4 parser tests ( #34905 )
...
Signed-off-by: Robin Nabel <opensource@nabel.co >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2026-02-24 22:27:42 +08:00
Eldar Kurtić
a87cc50859
[Attn,KV-cache] Use per-head scales in the attention selector ( #34281 )
...
Signed-off-by: Your Name <you@example.com >
Signed-off-by: Eldar Kurtic <research@neuralmagic.com >
Co-authored-by: Eldar Kurtic <research@neuralmagic.com >
Co-authored-by: Your Name <you@example.com >
2026-02-24 09:02:43 -05:00
Cyrus Leung
761e63e541
[Frontend] Always pass supported_tasks to validation ( #35186 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-24 04:16:33 -08:00
Isotr0py
d12d201409
[Bugfix] Fix failing FunASR processor test ( #35111 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-24 04:13:45 -08:00
eustlb
b3ad37c5db
[glm-asr] change defaults dummy audio size ( #35108 )
...
Signed-off-by: Eustache Le Bihan <eulebihan@gmail.com >
2026-02-24 04:13:33 -08:00
Wentao Ye
14561fabfd
[Perf] Optimize pooling model redundant copy, 1.8% throughput improvement ( #35127 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-24 04:13:11 -08:00
Zhengxu Chen
c77f3e1207
[compile] Save aot compile artifacts atomically. ( #35117 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-24 04:11:01 -08:00
Dor Huri
012dee9233
[Feature] Add LoRA tower/connector support for Llama 4 Vision (mllama4) ( #35147 )
...
Signed-off-by: dorhuri123 <dor.huri1@live.biu.ac.il >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-24 04:10:32 -08:00
Tugsbayasgalan Manlaibaatar
f1c664545b
Make voxtral compile friendly ( #33959 )
...
Signed-off-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-24 09:33:35 +01:00
Xin Yang
c870eb9e0f
[LoRA] Update LoRA expand kernel block_n calculation ( #32621 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-23 23:17:53 -08:00
BadrBasowid
6af03f2394
[Refactor] [1/N] Reorganize kernel abstraction directory ( #34055 )
...
Signed-off-by: BadrBasowid <badr.basowid@gmail.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-24 06:47:22 +00:00
Vlad Tiberiu Mihailescu
1a6cf39dec
[CI/Build] Remove redundant OpenTelemetry pip install from CI configs ( #35032 )
...
Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com >
2026-02-23 22:24:11 -08:00
Nicolò Lucchesi
f91808ae0d
[MM] Allow audio chunking for offline LLM ( #34628 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-23 21:04:28 -08:00
Vadim Gimpelson
33a0d43c71
[BUGFIX][Qwen3.5] Hardcode mlp.gate as not quantizable ( #35156 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-23 19:42:24 -08:00
pschlan-amd
80d93fd6da
gpu_model_runner: Cache is_encoder_decoder from model config ( #35099 )
...
Signed-off-by: Patrick Schlangen <pschlan@amd.com >
2026-02-23 19:08:34 -08:00
Jia Guo
ec85340531
[Quantization] Support FP8 MoE bias for models like GPT-OSS ( #34906 )
...
Signed-off-by: jasperjiaguo <jasperg662@gmail.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-23 19:07:47 -08:00
Rohan Potdar
2ff4e51152
[ROCm] AITER fused RoPE+KVCache ( #33443 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
Signed-off-by: charlifu <charlifu@amd.com >
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com >
Co-authored-by: charlifu <charlifu@amd.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com >
2026-02-23 19:06:00 -08:00
Asaf Gardin
95642441d0
[Mamba1] - Change supports_update_block_table to True ( #35054 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-02-23 19:05:57 -08:00
Xin Yang
a7c9f7b7ec
[Bugfix] Fix lora_ids in FusedMoE LoRA test ( #35135 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-23 21:49:25 -05:00
Michael Goin
a4bd661fb3
[Perf] Enable FlashInfer DeepGEMM swapAB on SM90 by default ( #34924 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-23 17:34:41 -08:00
Michael Goin
3ef9fd0f98
[Bugfix] Fix DSV3 kernels breaking _C and _moe_C on unsupported arches ( #35123 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-23 17:11:27 -08:00
Michael Goin
22a97e6613
[Perf] Improve default triton fused moe configs ( #34846 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-23 16:01:28 -08:00
Aaron Hao
596ed1f02e
[RL] Validation for pause_mode='keep' ( #34992 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-02-23 16:30:56 -05:00
Nicolò Lucchesi
b8d8b7e934
[Misc] Monitor interface changes ( #35113 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-23 17:14:51 +00:00
Harry Mellor
28c5e69ba0
Enforce that model is the first positional arg when --served-model-name is used ( #34973 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 08:38:05 -08:00
Harry Mellor
864167d376
Fix custom processors that use deleted import for Transformers v5 ( #35101 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 08:38:00 -08:00
haosdent
a2ba6a5244
[Bugfix] Fix prefix caching for Mamba 'all' mode (Nemotron models) ( #34874 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-23 17:31:51 +01:00
Harry Mellor
c4f38696f7
Use Xet high performance mode for Transformers v5 ( #35098 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 08:19:30 -08:00
haosdent
a7f341c323
[Bugfix] Fix MRotaryEmbedding missing truncate attr with YaRN scaling ( #35080 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-23 16:05:52 +00:00
Robert Shaw
d13ece38d7
[CI] Skip Responses API ( #34990 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-23 07:46:45 -08:00
Mark McLoughlin
5cc7c4452e
[Metrics] Add Prometheus counters for Model FLOPs Utilization (MFU) ( #30950 )
...
Export the existing Model FLOPs Utilization (MFU) metrics via Prometheus.
`--enable-mfu-metrics` is required for these to be exposed.
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-02-23 15:01:07 +00:00
Eldar Kurtić
b95bb6927f
[kv-cache, ct] Use compressed-tensors as a source of ground-truth for quant strategies ( #34254 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
2026-02-23 07:37:55 -07:00
Cyrus Leung
392645454b
[Refactor] Decouple TimingContext from InputProcessingContext ( #35083 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-23 14:15:50 +00:00
Eldar Kurtić
1e8438a89a
[Llama4,CI] Bring back Llama-4 bug fixes, and also fix Maverick tests ( #35033 )
...
Signed-off-by: Eldar Kurtic <you@example.com >
Co-authored-by: Eldar Kurtic <you@example.com >
2026-02-23 09:04:34 -05:00
Robert Shaw
8435b2e049
[ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup) ( #34302 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-23 14:02:26 +00:00
Yan Ma
b1b5e045df
[XPU] allow TORCH_SDPA/TRITON_ATTN as XPU vit Backend ( #35010 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2026-02-23 05:06:44 -08:00
Andreas Karatzas
5f68464f92
[ROCm][CI] Fix spec decode profile assertion and logprob test determinism ( #35043 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-23 05:05:54 -08:00
Vincent Gimenes
aa08a30fc9
[CLEANING] Remove unused disable_by_batch_size from SpeculativeConfig ( #35060 )
...
Signed-off-by: Vincent Gimenes <vincent.gimenes@gmail.com >
2026-02-23 05:05:36 -08:00
Wentao Ye
7f40e9e516
[Refactor] Remove dead private func _fp8_perm and _extract_mask_for_item ( #35068 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-23 05:05:20 -08:00
Harry Mellor
103e614b14
Fix pipeline parallel with embed scaling in the Transformers modelling backend ( #35094 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 05:04:47 -08:00
Neil Schemenauer
54e2f83d0a
[Feature] Lazy import for the "mistral" tokenizer module. ( #34651 )
...
Signed-off-by: Neil Schemenauer <nas@arctrix.com >
2026-02-23 00:43:01 -08:00
Gabe Goodhart
e631f8e78e
fix: Apply embedding_multiplier to inputs_embeds ( #34813 )
...
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-23 00:42:46 -08:00
Martin Hickey
e97c46a92d
[BugFix]: Fix local mypy issues ( #34739 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-23 00:40:29 -08:00
Jee Jee Li
7291d1b288
[Bugfix] Fix kernel benchmark ( #33752 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-22 21:18:08 -08:00
Cyrus Leung
987506bca6
[Refactor] Simplify dummy data generation ( #35025 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-22 20:55:27 -08:00
Woosuk Kwon
c645e9a214
[Model Runner V2] Remove propose_draft method ( #35070 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-22 18:27:12 -08:00
Nick Hill
944ffb5968
[Model Runner V2][Minor] Remove redundant do_spec_decode field ( #35039 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-22 16:18:04 -08:00
qizixi
2bcf71b9c0
[Spec Decode] Reduce TP communication for speculative decoding draft token generation ( #34049 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-22 14:59:16 -08:00
tacos8me
b7892a3bef
[Model] Add NVFP4 quantization support for Step3.5-Flash ( #34478 )
...
Signed-off-by: tacos8me <ian@cloudhabit.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-22 12:30:46 -07:00
Benjamin Chislett
682566b18e
[Bug] Refactor max_num_batched_tokens to account for drafting ( #34898 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-22 11:18:46 -05:00
qizixi
b9c2a565cc
[Spec Decode] Defer clearing KV connector metadata for EAGLE3 speculative decode + prefill / decode disagg setup ( #34529 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-22 08:08:32 -08:00
Andreas Karatzas
dd8c3a7fb2
[ROCm][CI] Fix realtime test timeouts caused by aiter JIT compilation delays ( #35052 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-22 10:07:18 +00:00
Andreas Karatzas
a8a47c17b6
[ROCm][CI] Fix flaky embedding chat test by using tolerance-based comparison ( #35050 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-22 09:03:44 +00:00
Roger Wang
40f88d8318
[Bugfix] Fix Qwen3/Qwen3.5 Reasoning Parser ( #34779 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-21 23:15:35 -08:00
Woosuk Kwon
2cbf9656ce
[Model Runner V2] Enable CUDA graph for Eagle3 ( #35040 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-21 21:42:50 -08:00
Xiao Li
30132cd144
Fix apply_top_k_top_p_triton called by non-cuda logits Tensor ( #35030 )
...
Signed-off-by: Xiao Li <ilx@meta.com >
2026-02-21 21:11:54 -08:00
Cyrus Leung
cbd95a2dd1
[Benchmark] Use sns.relplot for plotting ( #35027 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-21 20:26:48 -08:00
Athrael Soju
970861ac0c
[New Model] Add ColModernVBERT ( #34558 )
...
Signed-off-by: Athrael Soju <athrael.soju@gmail.com >
Signed-off-by: athrael-soju <athrael-soju@users.noreply.github.com >
2026-02-22 12:23:41 +08:00
Wentao Ye
d24bdd7c4b
[CI] Bump mteb version to mteb[bm25s]>=2, <3 for pooling model unit tests ( #34961 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-21 20:23:24 -08:00
Andreas Karatzas
d403c1da1c
[CI] Stabilizing ROCm amd-ci signal and minor name fix in upstream ( #35008 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-22 04:01:10 +00:00
Woosuk Kwon
b71fbd06e2
[Model Runner V2] Support attention group ( #35036 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-21 16:42:53 -08:00
Vadim Gimpelson
74d90b1ce4
[Model Bash][DSR1] Add selective dynamic shape marking for CustomOp ( #34900 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-21 19:28:01 -05:00
Woosuk Kwon
a4047d4ea9
[Model Runner V2] Support Eagle3 (no CUDA graph) ( #35029 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-21 12:55:24 -08:00
Cyrus Leung
965fe45935
[CI/Build] Fix gRPC version mismatch ( #35013 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-21 12:14:41 -07:00
Roman
98b0205c3c
[Frontend] Add automatic language detection for Whisper transcription ( #34342 )
...
Signed-off-by: space_check <roman.vuskov@rwth-aachen.de >
Signed-off-by: Roman <45857014+spacecheck@users.noreply.github.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-21 04:49:41 -08:00
Huy Do
272b535ab3
[Bugfix] Gate 256-bit instructions to CUDA 12.9+ ( #34791 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-21 04:48:14 -08:00
Cyrus Leung
f74f1572ca
[Benchmark] Improve benchmarks ( #35012 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-21 10:31:58 +00:00
petrpechman
bebfe55b1c
[Doc] Fix example of eagle3 ( #34960 )
...
Signed-off-by: Petr Pechman <petr.pechman@firma.seznam.cz >
Co-authored-by: Petr Pechman <petr.pechman@firma.seznam.cz >
2026-02-21 09:57:53 +00:00
Nick Hill
820d7815eb
[Core] Minor structured-output related scheduler optimization ( #34765 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-21 01:38:28 -08:00
Nicolò Lucchesi
ab6f3487a6
[PD] Change kv_load_failure_policy Default from "recompute" to "fail" ( #34896 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-21 01:34:57 -08:00
BADAOUI Abdennacer
8dc8a99b56
[ROCm] Enable bitsandbytes quantization support on ROCm ( #34688 )
...
Signed-off-by: badaoui <abdennacerbadaoui0@gmail.com >
2026-02-21 00:34:55 -08:00
jennyyyyzhen
2aab2bb543
[ROCM] Optimize ROCM_AITER_FA spec decode eagle performance ( #34541 )
...
Signed-off-by: jennyyyyzhen <yzhen@hmc.edu >
2026-02-20 20:32:05 -08:00
Andreas Karatzas
54254f7a61
[ROCm][CI] Fix spec decode logprobs flakiness and parametrize tree attention backends ( #34599 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:25:23 -08:00
Andreas Karatzas
cf93c1a128
[ROCm][AITER] Fix aiter paged_attention_v1 decode for sliding window and head_size < 64 ( #34570 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:25:07 -08:00
Andreas Karatzas
89358f0d35
[CI] Fix ColBERT HF comparison tests on AMD CI + refactor ( #34567 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:12:05 -08:00
zhongdaor-nv
a0fe7ea2f0
[feat] Add per-block extra_keys to KV events ( #33304 )
...
Signed-off-by: zhongdaor-nv <zhongdaor@nvidia.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 20:11:40 -08:00
Andreas Karatzas
991d6bff38
[CI][MCP][Harmony] Heavy refactoring Harmony & MCP response tests and stabilizing with deterministic test infrastructure ( #33949 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-20 20:03:32 -08:00
Kata Coder
5719a4e4e6
[Frontend] Support multimodal inputs for late-interaction scoring (ColQwen3) + NewModel: nvidia/nemotron-colembed ( #34574 )
...
Signed-off-by: craftsangjae <craftsangjae@gmail.com >
2026-02-20 20:01:40 -08:00
pougetat
11be2c74dc
[Realtime] Add Qwen3-ASR realtime streaming support ( #34613 )
...
Signed-off-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Co-authored-by: Thomas Pouget-Abadie <thomaspou@microsoft.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-20 19:59:42 -08:00
Xin Yang
7a5adad480
[Kernel] Optimize sample_recovered_tokens_kernel ( #34974 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-20 19:59:06 -08:00
Li
59c6233297
Support prompt_embeds for pooling requests in output processor ( #34904 )
...
Signed-off-by: Li Zhang <lzhanga@amazon.com >
Co-authored-by: Li Zhang <lzhanga@amazon.com >
2026-02-20 19:57:38 -08:00
Taneem Ibrahim
d38cd3dde5
[Misc] Fix mypy errors in vllm/profiler and remove from exclude list ( #34959 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2026-02-20 19:56:33 -08:00
Rohan Potdar
ded333fb9b
[ROCm][Bugfix]: Only save unpadded sizes for shared_experts in MoERunner to fix rmsnorm pad fusion ( #34636 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-20 19:56:16 -08:00
Yanan Cao
9d7577b2bd
[Kernel] [Helion] [9/N] Canonicalize GPU variant names to base model names ( #34928 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-20 19:55:51 -08:00
Vlad Tiberiu Mihailescu
e739c29ea4
[CI/Build] Add opentelemetry libs in default vllm build (requirements/common.txt) ( #34466 )
...
Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com >
2026-02-20 19:54:55 -08:00
yugong333
a55caf6ae9
[LoRA] Support Quantized Adapters ( #30286 )
...
Signed-off-by: Yu Gong <yu3.gong@gmail.com >
Signed-off-by: wz1qqx <ziqi.wang@novita.ai >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: wz1qqx <55830058+wz1qqx@users.noreply.github.com >
Co-authored-by: wz1qqx <ziqi.wang@novita.ai >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 19:54:35 -08:00
Lucas Wilkinson
0e22cd618b
Revert "[Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers " ( #34997 )
2026-02-20 17:19:19 -08:00
Wei Zhao
ea5f903f80
Bump Flashinfer Version and Re-enable DeepSeek NVFP4 AR+Norm Fusion ( #34899 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 13:37:31 -08:00
Ryan Rock
0632ed8778
[AMD][CI] Fix test_custom_allreduce for A100 testgroup ( #34735 )
...
Signed-off-by: Ryan Rock <ryan.rock@amd.com >
2026-02-20 21:33:04 +00:00
Lucas Wilkinson
aaefc58ee0
[CI] Revert PRs 34818 and 33600 ( #34979 )
2026-02-20 13:25:50 -08:00
Wei Zhao
f24b2de3d3
[Test] Add FP8 KV Cache Testing for MLA Backends ( #34473 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-20 18:51:58 +00:00
Michael Goin
fac1507f03
[CI] Remove failing prime-rl integration test ( #34843 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-02-20 10:17:42 -08:00
Zhengxu Chen
f863994084
[compile] Fix torch.compile time discrepancy in logging. ( #34912 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 08:47:14 -08:00
Zhengxu Chen
e4a5d8c653
[compile] Move torch_aot_compile directory under torch_compile_cache ( #34831 )
...
Signed-off-by: zhxchen17 <zhxchen17@fb.com >
2026-02-20 08:46:45 -08:00
Yanan Cao
a6d0299c75
[Kernel] [Helion] [6/N] Add num_tokens dimension to silu_mul autotuning and dispatching ( #34185 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-20 08:36:51 -08:00
Harry Mellor
6ce80f7071
Ensure that MkDocs v2 does not get installed ( #34958 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-20 15:38:11 +00:00
Huamin Li
1fe462168c
[perf] Avoid dtype promotion sync in mamba_get_block_table_tensor ( #34870 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 06:21:56 -08:00
Flora Feng
ed31a020ee
[Refactor] Extract Harmony streaming SSE event builders into streaming_events.py ( #34909 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-20 06:20:46 -08:00
Cyrus Leung
f9ac19204f
[V0 Deprecation] Remove unused MM placeholders in request output ( #34944 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-20 06:19:23 -08:00
Vadim Gimpelson
59965affbd
[BUGFIX] Fix _dummy_run missing prepare_inputs_event synchronization ( #34866 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-20 05:54:27 -08:00
Xin Yang
b1c4f0b265
[Kernel] Optimize grouped topk kernel ( #34206 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2026-02-20 01:34:45 -08:00
Kevin McKay
8de7c636cc
[Bugfix][Hardware][AMD] Fix ROCM_AITER_FA speculative decoding support ( #32877 )
...
Signed-off-by: c0de128 <kevin.mckay@outlook.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-19 22:25:46 -08:00
Frank Wang
059779231f
[Minor] Add logging when using MXFP4 MXFP8 TRTLLM backend ( #34916 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
Signed-off-by: Frank Wang <41319051+frankwang28@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-19 22:07:57 -08:00
tianshu-Michael-yu
ea37530b47
[Models] LFM2: Support LoRA ( #34921 )
...
Co-authored-by: Piotr Mazurek <piotr635@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-19 22:07:23 -08:00
Micah Williamson
f5432e35a3
[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout ( #34922 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-20 05:37:49 +00:00
杨朱 · Kiki
07cab212f0
[Misc] Add deprecated environment variable utilities ( #33677 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-19 21:33:25 -08:00
rasmith
0c1dc42748
[CI][AMD][BugFix][P/D] Add default_vllm_config to test_moriio_connector.py so tests pass ( #33739 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-19 21:32:40 -08:00
Varun Chawla
676f82ae81
Add validation to reject non-text content in system messages ( #34072 )
...
Signed-off-by: Varun Chawla <varun_6april@hotmail.com >
2026-02-19 21:30:33 -08:00
Elizabeth Thomas
81bfc21a6a
[Model Bash]: Improve FP8 Oracle for Config Specific Kernel Selection ( #34260 )
...
Signed-off-by: Elizabeth Thomas <email2eliza@gmail.com >
Signed-off-by: Robert Shaw <robertgshaw2-redhat@h100-02.nemg-001.lab.rdu2.dc.redhat.com >
Signed-off-by: Robert Shaw <robertgshaw2@gmail.com >
Co-authored-by: Robert Shaw <robertgshaw2-redhat@h100-02.nemg-001.lab.rdu2.dc.redhat.com >
Co-authored-by: Robert Shaw <robertgshaw2@gmail.com >
2026-02-19 21:29:08 -08:00
Matthias Gehre
4e2c7caf2d
[Bugfix] Add regression test for MoE quant_config under torch.compile ( #34335 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-02-20 13:27:26 +08:00
Bowen Bao
d9e62c03eb
[Quark] Fix MoE fp8 activation scale handling on mi300 ( #34386 )
...
Signed-off-by: Bowen Bao <bowenbao@amd.com >
2026-02-19 21:27:14 -08:00
Kevin H. Luu
a1a2d79442
[ci] Use the right tag for CPU arm64 image ( #34915 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-19 19:59:15 -08:00
Cyrus Leung
ac900c89bb
[Refactor] Implement output type check in LLM ( #34794 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-19 19:57:55 -08:00
Mark McLoughlin
76df6072ff
[Core] Fix state names in pause_scheduler() ( #34840 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2026-02-19 17:21:46 -08:00
Michael Goin
16f24e8797
[CI] Add GPT-OSS Eval job for H100 ( #34359 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-02-19 17:14:54 -08:00
Nick Hill
40b2f1c3d9
[Model Runner V2] Minor CPU optimizations ( #34856 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-19 16:05:37 -08:00
Mayank Ketkar
648951a9c3
[Bugfix] Fix benchmark_fused_collective crash on CustomOp init ( #34665 )
...
Signed-off-by: Mayank Ketkar <mketkar@zoox.com >
Signed-off-by: Mayank Ketkar <mayket04@gmail.com >
Co-authored-by: Mayank Ketkar <mketkar@zoox.com >
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-02-19 19:01:00 -05:00
Michael Goin
f72061a19a
[UX] More descriptive reasons in is_supported_config for MoE ( #34908 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-19 15:20:52 -08:00
Matthew Bonanni
662205d34e
[Bugfix] Fix Basic Models Test ( #34818 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-19 14:49:07 -08:00
Roger Wang
4fb8beefaa
[Bugfix] Fix cutlass fp8 kernel on hopper for Qwen3.5 ( #34914 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-19 13:34:55 -08:00
Alexei-V-Ivanov-AMD
304319c4ed
Change targets for AMD build in the "CI" pipeline ( #34918 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2026-02-19 21:26:53 +00:00
Wentao Ye
c683d11c94
[Refactor] Deprecate head_first for chunk_gated_delta_rule ( #34263 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-19 13:23:49 -05:00
roikoren755
3eff45d793
Revert "[NemotronH] Do not force router to run in fp32 ( #34582 )" ( #34808 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-19 09:47:05 -08:00
Robert Shaw
4685a630a2
[Model Bash][DeepSeekR1] Remove Shared Expert Clone ( #34344 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-19 07:56:14 -08:00
Eldar Kurtić
ee1d25f199
[Llama4,Quantization] Simplify and generalize logic for Q/K permutations in quantized self-attn layers ( #34471 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-19 07:55:41 -08:00
Linda
6fff24f30f
[Bugfix] Qwen3.5 kv-scale weight remapping ( #34719 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com >
2026-02-19 04:13:37 -08:00
Cyrus Leung
23210a911e
[CI/Build] Try to make beam search test less flaky ( #34885 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-19 19:16:58 +08:00
Cyrus Leung
1391378861
[Bugfix] Fix edge case in UUID data parsing ( #34884 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-19 02:24:30 -08:00
Andreas Karatzas
f6220f9877
[ROCm][Test] Fix beam search determinism failures from batch-size-dependent FP divergence and removed wrong marker ( #34878 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-19 08:25:26 +00:00
Andreas Karatzas
2df2bb27b0
[ROCm][CI] Removing all blocking labels from MI355 until stable infra ( #34879 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-19 07:53:08 +00:00
Tal Nir
f75b61a9e9
[Voxtral Realtime] Fix engine crash on empty multimodal embeddings ( #34862 )
...
Signed-off-by: Tal Nir <tal@nervexneurotech.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-18 23:21:47 -08:00
Wei Zhao
7f51e93864
[Bug] Fix DeepSeek V3 weight loading caused by incorrect prefix ( #34876 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-18 23:20:30 -08:00
Alex Brooks
4611af1663
[Bugfix] Add Quant Config to Llava Next Projector ( #34847 )
...
Signed-off-by: Alex Brooks <albrooks@redhat.com >
2026-02-18 23:18:23 -08:00
Manrique Vargas
ad5aa6bd9f
fix(docs): fix typos in comments and docstrings ( #34836 )
...
Signed-off-by: machov <mv1742@nyu.edu >
2026-02-18 23:17:41 -08:00
Jaeyeon Kim(김재연)
9681068cf9
[Frontend] Fix reasoning_tokens for text-based parsers in Responses API ( #33513 )
...
Signed-off-by: Jaeyeon Kim <anencore94@gmail.com >
2026-02-18 23:16:41 -08:00
Kevin H. Luu
b6101d384d
Deprecate test-pipeline.yaml ( #34864 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-19 02:15:27 +00:00
Woosuk Kwon
5fcb0cdd68
[Model Runner V2] Use FP32 for Gumbel Noise ( #34854 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-18 17:07:37 -08:00
Woosuk Kwon
c878b43b64
[Model Runner V2] Remove unnecessary copies in PW CUDA graph capture ( #34849 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-18 15:52:50 -08:00
rasmith
2b84ac669c
[CI][AMD][BugFix] Use torch.testing.assert_close instead of assert torch.allclose in test_rocm_skinny_gemms.py ( #34181 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-18 23:10:19 +00:00
zhrrr
11d3976b88
[Model Runner V2] support piecewise & mixed cudagraph ( #32771 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
2026-02-18 15:03:17 -08:00
Yongye Zhu
40da9625a1
[MoE Refactor] Convert mxfp4 marlin into modular kernel format ( #34588 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-18 14:37:14 -08:00
Flora Feng
8d9babd4de
Fix empty tool_call_id in Anthropic messages API tool result conversion ( #34745 )
...
Signed-off-by: <>
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Co-authored-by: Flora Feng <sfeng33@h100-01.nemg-001.lab.rdu2.dc.redhat.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-18 14:31:59 -08:00
Aaron Hao
e99ba957ec
[BUG] Fixing Weight Sync unit test ( #34841 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2026-02-18 17:20:10 -05:00
Kyle Sayers
64ac1395e8
[Docs] Clean up speculators docs ( #34065 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2026-02-18 13:48:11 -08:00
Cyrus Leung
61cf087680
[Bugfix] Fix lora tests ( #34834 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-18 13:22:31 -08:00
Wenlong Wang
847a57cd12
[Bugfix][MoE Kernel] Fix incorrect routing selection for models without expert groups (e.g., MiniMax-M2.1) ( #34673 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-18 13:03:24 -08:00
rasmith
fcd6ac97ed
[CI][AMD][BugFix] Skip tests in test_unquantized_backend_selection that should not run on ROCm ( #34655 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2026-02-18 15:00:40 -05:00
Woosuk Kwon
95be2a7f22
[Model Runner V2] Minor simplification for DCP ( #34786 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-18 11:04:53 -08:00
Jaden Mathias
0e60c925cf
[Bugfix] Remove assert causing hipErrorStreamCaptureUnsupported ( #34455 )
...
Signed-off-by: Jaden Mathias <jaden.mathias@amd.com >
2026-02-18 18:54:54 +00:00
Teng Ma
d7ff22204a
[Misc] Add mooncake-transfer-engine to kv_connectors requirements ( #34826 )
...
Signed-off-by: Teng Ma <teng-ma@linux.alibaba.com >
2026-02-18 18:26:24 +00:00
Isotr0py
c0bd8b13da
[Bugfix] Redo Qwen3.5/Qwen3-Next GDN projector fusion ( #34697 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
2026-02-18 09:46:53 -08:00
Michael Goin
caeb887bf6
[Bugfix] Fix NVFP4 TRTLLM MoE non-gated support; add gsm8k for Nemotron-3-Nano FP8+NVFP4 ( #34725 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-18 09:39:22 -08:00
Ilya Markov
6b3166a7c7
[CI][Bugfix] Fix multinode test script ( #34820 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-02-18 11:45:10 -05:00
Robert Shaw
25e2e136ef
[CI] temporarily disable multi-node tests ( #34825 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-18 11:32:44 -05:00
Robert Shaw
6874638bc4
[Model Bash] DeepSeek R1 BF16 Min Latency QKV A GEMM (0.5% E2E Speedup) ( #34758 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-18 07:42:36 -08:00
Burkhard Ringlein
e24663c5a9
Add unit tests for fp8 output fusion of triton_attn ( #34228 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-18 06:22:49 -05:00
Nick Hill
c50e105a88
[Model Runner V2] Avoid prepare prefill kernel launch overhead ( #34780 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-18 00:49:21 -08:00
Cyrus Leung
a766b30349
[Renderer] Deprecate code paths for old input processing ( #34775 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-18 00:35:04 -08:00
Asaf Joseph Gardin
1faa8cb73c
[Quantization] - Added uses_meta_device_weights to quant config ( #34645 )
...
Signed-off-by: Josephasafg <ajgard7@gmail.com >
2026-02-17 23:43:44 -08:00
Marek Michalowski
e89a91d927
[Bugfix] fix activation in cpu_fused_moe_torch call ( #34696 )
...
Signed-off-by: Marek Michalowski <marek.michalowski@arm.com >
2026-02-17 23:39:46 -08:00
Michael Goin
909b147197
[Bugfix] Fix prefix creation for Qwen3.5 ( #34723 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-17 23:39:15 -08:00
ElizaWszola
a88b3be7c4
[Bugfix] Fix quant RMS norm fusion for quantization with TMA-aligned scales ( #33255 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2026-02-17 23:35:04 -08:00
Nick Hill
a49ea5a58f
[Model Runner V2] A bit more PP simplification ( #34766 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-17 21:39:07 -08:00
Cyrus Leung
30ebe0dc3c
[CI/Build] Remove use of skip_v1 ( #34699 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-18 12:19:11 +08:00
Andreas Karatzas
cef65f0715
[ROCm][CI] Removed hard-coded attn backend requirement for Qwen VL ( #34753 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-18 03:59:53 +00:00
Russell Bryant
6f3b2047ab
[Core] Fix SSRF bypass via backslash-@ URL parsing inconsistency ( #34743 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: isotr0py <2037008807@qq.com >
2026-02-18 03:53:35 +00:00
Luka Govedič
02e8f26cea
[torch.compile] Turn on silu+fp4 quant fusion by default for O1+ ( #34718 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-02-18 03:29:15 +00:00
Hongxia Yang
4a00a511bb
[BugFix] [Build] fix string literals comparison in indexer_k_quant_and_cache calling site ( #34653 )
...
Signed-off-by: Hongxia Yang <hongxiay.yang@amd.com >
Co-authored-by: Hongxia Yang <hongxiay.yang@amd.com >
2026-02-17 19:19:41 -08:00
Cyrus Leung
a0d8d944e2
[Renderer] Move MM Hash parsing into Renderer ( #34711 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-17 19:18:55 -08:00
Amr Mahdi
df3f537a66
[CI] Remove unused precompiled wheel args from image build ( #34767 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-17 18:58:18 -08:00
Matthew Bonanni
7743152957
[Attention] Refactor check_and_update_config ( #33600 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-17 17:06:54 -08:00
Wentao Ye
ab33d2a629
[Feature] Decode Context Parallel support for GPU model runner v2 ( #34179 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2026-02-17 16:27:15 -08:00
Woosuk Kwon
be3af2d29e
[Model Runner V2] Further simplification for PP ( #34724 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-17 15:18:18 -08:00
Jongseok Park
c656ba3b4d
[Kernel] Triton-based Top-k and Top-p sampler kernels ( #33538 )
...
Signed-off-by: js_park <cakeng@naver.com >
Signed-off-by: Jongseok Park <37990712+cakeng@users.noreply.github.com >
Signed-off-by: Sunga Kim <sunga.kim@berkeley.edu >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Sunga Kim <sunga.kim@berkeley.edu >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-17 23:14:30 +00:00
Matthew Bonanni
dc5fa77a4e
[Bugfix][MTP][Sparse MLA] Allow sparse MLA with MTP to run with FULL cudagraphs ( #34457 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-17 14:01:27 -05:00
Flora Feng
1e4a084c8e
[CI] Fix flaky test_parsable_context ( #34717 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2026-02-17 18:42:52 +00:00
Richard Zou
7967e854da
[BugFix] Fix sp tests ( #34716 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-17 17:07:56 +00:00
almayne
6bd6d0c3c1
Fixed whisper CPU test that does not spawn properly. ( #34324 )
...
Signed-off-by: Anna Mayne <anna.mayne@arm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-17 06:46:23 -08:00
Nicolò Lucchesi
8e962fef5f
[CI][Nixl] Add CrossLayer KV layout tests ( #34615 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-17 21:35:40 +08:00
Cyrus Leung
574fe75245
[Renderer] Move InputPreprocessor into Renderer (2/2) ( #34560 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-17 05:29:01 -08:00
junuxyz
c61a98f529
[CI][BugFix] ShellCheck cleanup to remove baseline and preserve runtime behavior ( #34514 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-17 12:22:56 +00:00
Harry Mellor
28bffe9466
Fix docs build warning ( #34686 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-17 02:31:40 -08:00
ChenqianCao
ad65177a19
[Bugfix] Fix 'remove_instance_endpoint' method logic in disagg_proxy_demo ( #32922 )
...
Signed-off-by: ChenqianCao <39755070+ChenqianCao@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-17 10:06:53 +00:00
Tim Dettmers
d44a5b6c47
Remove dead bitsandbytes CxB code from 8-bit inference path ( #34633 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-17 01:49:14 -08:00
Jiangyun Zhu
1d65283e95
Revert "[Models] Fuse Qwen3.5 GDN's qkvz_proj and ba_proj" ( #34683 )
2026-02-17 01:29:27 -08:00
kourosh hakhamaneshi
c464b57374
[Ray] Propagate third-party env vars to Ray workers via prefix matching ( #34383 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-17 01:08:42 -08:00
Amr Mahdi
c5c38e152a
[CI] Fix bake config artifact path for AMI rebuild pipeline ( #34656 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-17 06:39:44 +00:00
Woosuk Kwon
d00df624f3
[Model Runner V2] Minor refactoring for penalties ( #34662 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 21:43:00 -08:00
Woosuk Kwon
9752da9d9c
[Model Runner V2] Minor simplification for BadWordsState ( #34669 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 21:27:24 -08:00
Woosuk Kwon
04925b2202
[Model Runner V2] Minor cleanup for PP ( #34666 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 19:15:31 -08:00
Woosuk Kwon
d74278fb67
[Model Runner V2] Fix unintended CPU-GPU sync in make_dummy ( #34667 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 19:00:29 -08:00
haosdent
b68fd899d1
[Bugfix] Fix fused MoE int32 overflow in stride*offset without perf regression ( #34507 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-16 17:58:49 -08:00
Aneesh Puttur
0b5f9b7204
[CI] Enable mypy import following for vllm/v1/kv_offload ( #34639 )
...
Signed-off-by: Aneesh Puttur <aneeshputtur@gmail.com >
2026-02-17 09:58:15 +08:00
zhanqiuhu
9a8853f781
[Core] Pipeline Parallel support for Model Runner V2 ( #33960 )
...
Signed-off-by: Zhanqiu Hu <zh338@cornell.edu >
2026-02-16 17:48:16 -08:00
zhrrr
387a1898d9
[Model Runner V2] support bad_words sampling param ( #33433 )
...
Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-16 16:36:06 -08:00
roikoren755
3b30e61507
[NemotronH] Do not force router to run in fp32 ( #34582 )
...
Signed-off-by: Roi Koren <roik@nvidia.com >
2026-02-16 10:15:32 -08:00
Alexei-V-Ivanov-AMD
824f9e8f3c
Targeting the MI355 agent pool with all existing tests ( #34629 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2026-02-16 17:02:27 +00:00
Nicolò Lucchesi
6cc403e67d
[Bugfix][CI] Fix flaky entrypoints/openai/test_response_api_with_harmony.py::test_function_calling[openai/gpt-oss-20b] ( #34624 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-16 16:11:07 +00:00
Almog Tavor
72d5951d02
[Bugfix] Treat generation_config max_tokens as default not ceiling ( #34063 )
...
Signed-off-by: almogtavor <almogtavor@gmail.com >
2026-02-16 07:58:24 -08:00
Lucas Kabela
a3205beffb
[CI] Enable mypy coverage for individual excluded files ( #34292 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-16 07:34:29 -08:00
Christian Pinto
6930becd45
(bugfix): Fixed encode in LLM entrypoint for IOProcessr plugin prompts ( #34618 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2026-02-16 07:33:55 -08:00
Andreas Karatzas
03a8770a6d
[ROCm][CI] Fix plugins test group; updating terratorch and dependencies ( #34589 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-16 07:33:42 -08:00
Yiqi Xue
bc56a1d56e
[Bugfix] Fix ARC touch KeyError for non-ready T1 blocks in kv offload ( #34576 )
...
Signed-off-by: Yiqi Xue <xuey666@gmail.com >
2026-02-16 07:33:19 -08:00
danisereb
ec7d9e6745
Fix call to moe_mk in modelopt MoE modules (required for LoRA) ( #34575 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-16 07:33:09 -08:00
Isotr0py
3bb4e4311c
[Models] Fuse Qwen3.5 GDN's qkvz_proj and ba_proj ( #34492 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-16 07:32:51 -08:00
Amr Mahdi
08f8c198ae
[CI] Disable precompiled wheel path in CI image builds ( #34606 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-16 15:14:43 +00:00
Harry Mellor
a21cedf4ff
Bump lm-eval version for Transformers v5 compatibility ( #33994 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-16 05:24:35 -08:00
emricksini-h
3ef74cde5d
[CI][Tracing] Fix race condition by adding server readiness check ( #34364 )
...
Attempt to resolve #34284 : "Metrics Tracing (2GPU)" fails with a
segmentation fault.
Signed-off-by: emricksini-h <emrick.birivoutin@hcompany.ai >
2026-02-16 12:57:39 +00:00
Ekagra Ranjan
cd81cdb399
[Scheduler][ASR] Fix CrossAttn blocks per-request for Variable length encoder inputs ( #31058 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
2026-02-16 11:08:44 +00:00
Andreas Karatzas
1e828573b4
[CI][Metrics] Stabilize tests with polling and subprocess guards ( #34566 )
...
test_abort_metrics_reset is flaky due to hardware-dependent
fixed sleeps: replace fixed sleeps with polling.
test_metrics_exist_run_batch passes even when the engine crashes
on startup (false positive): add subprocess lifecycle guards.
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-16 10:52:02 +00:00
Samu Tamminen
a5ccc85c8c
[Bugfix] Fix Dynamo unexpected keyword argument ( #34320 )
...
Signed-off-by: Samu Tamminen <stammine@amd.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-16 01:32:30 -08:00
Roger Wang
b5475d0534
Revert "[Misc] fix qwen3.5 config" ( #34610 )
2026-02-16 01:06:05 -08:00
JJJYmmm
9521002f0a
[Misc] fix qwen3.5 config ( #34604 )
2026-02-16 00:25:38 -08:00
Cyrus Leung
ec17bdd894
[Renderer] Move InputPreprocessor into Renderer (1.5/2) ( #34598 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-15 23:46:33 -08:00
Amr Mahdi
bb59c90248
[CI] Write bake config to temp directory instead of repo root ( #34569 )
...
Signed-off-by: Amr Mahdi <amrmahdi@meta.com >
2026-02-15 22:15:47 -08:00
bnellnm
5bff999d12
[Bugfix] Add method to swap quant_method on FusedMoE to fix LoRA issues ( #34453 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-15 20:10:50 -08:00
Lucas Wilkinson
bb85929aa6
[BugFix] Fix Python 3.13 FlashMLA import error ( #34548 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2026-02-15 20:09:18 -08:00
Parth Bansal
5653021094
[Doc] Add Mistral-7b-v0.3 model to the batch invariance validated model ( #34584 )
...
Signed-off-by: Parth Bansal <parthbansal127@gmail.com >
2026-02-16 12:09:00 +08:00
Andreas Karatzas
974d829b05
[CI][Frontend] Return 422 instead of 500 for invalid Anthropic tool_choice ( #34590 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-15 20:06:48 -08:00
Isotr0py
91ac5d9bfd
[CI/Build] Enable tests for recent day-0 new models ( #34585 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-15 18:17:04 -08:00
Luka Govedič
23d825aba1
[torch.compile] Disable ar-rms fusion for ds3-fp4 & DP, fix CI test ( #34392 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-15 06:33:57 -08:00
Maryam Tahhan
f07a128413
[CPU][ARM] Add ARM BF16 cross-compilation support and improve documen… ( #33079 )
...
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2026-02-15 06:33:08 -08:00
Isotr0py
71cd89264f
[MM Encoder] Add Triton ViT attention backend ( #32183 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-15 06:32:47 -08:00
Isotr0py
19fab44152
[Doc] Update Encoder-Decoder models support doc with Florence-2 ( #34581 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-15 04:18:57 -08:00
Seiji Eicher
79c7e09235
[KV Connector] Add temporary, off-by-default VLLM_DISABLE_REQUEST_ID_RANDOMIZATION workaround ( #34415 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2026-02-14 23:26:10 -08:00
haosdent
79f3fab05a
[Bugfix] Handle num_expert_group=None in flashinfer block-scale FP8 MoE ( #34494 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-14 23:25:46 -08:00
Vadim Gimpelson
604b9eaec5
[BUGFIX] Fix accuracy regression for NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 with TP>1 ( #34476 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-14 23:25:17 -08:00
Stanislav Kirillov
50dbd6c9e6
[bugfix] Fix critical bug when reporting for all paths where handler.create_error_response is used ( #34516 )
...
Signed-off-by: Stanislav Kirillov <stas@nebius.com >
Co-authored-by: Stanislav Kirillov <stas@nebius.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-14 23:24:25 -08:00
Andreas Karatzas
98bcc6ca59
[CI][Entrypoints] Validate detokenize token IDs to prevent int64 overflow causing 500 ( #34468 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-14 23:08:38 -08:00
Andreas Karatzas
f13e86d8dd
[Kernels] Fix Helion GPU utils to use platform-agnostic device name API ( #34537 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-14 20:29:23 -08:00
Woosuk Kwon
9ca768c740
[Model Runner V2] Minor cleanup for Sampler ( #34563 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-14 18:29:03 -08:00
Thomas Parnell
d5fe3f702c
[Hybrid] Enable mamba prefix cache "align" mode with async scheduling ( #33997 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2026-02-14 13:15:56 -08:00
Cyrus Leung
73391a1baa
[Renderer] Move InputPreprocessor into Renderer (1/2) ( #34510 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-14 10:14:21 -08:00
Andreas Karatzas
b3c14229b0
[ROCm][CI] Guard sparse MLA backend imports for ROCm compatibility in tests ( #34538 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-14 07:32:09 -08:00
Roger Wang
2f186635cb
[Bugfix] Fix Qwen3.5 config loading ( #34554 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-14 03:56:11 -08:00
Christian Pinto
342a7cda2d
[Misc] Update tests and examples for Prithvi/Terratorch models ( #34416 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-13 23:03:51 -08:00
Kata Coder
d1ea65d0a1
[new model] add COLQwen3 code & Inference ( #34398 )
...
Signed-off-by: craftsangjae <craftsangjae@gmail.com >
Signed-off-by: katacoder <craftsangjae@gmail.com >
2026-02-14 12:15:19 +08:00
Andreas Karatzas
de42abb366
[CI] Heavy refactoring of Voxtral multimodal audio model tests ( #34294 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-13 20:04:29 -08:00
Julien Denize
60ca7981bc
Add explicit validation error for tool calls. ( #34438 )
...
Signed-off-by: juliendenize <julien.denize@mistral.ai >
2026-02-13 20:04:01 -08:00
Christian S. Perone
0ef5b9147b
fix: use __annotations__ instead of get_type_hints() for dynamic kwargs detection ( #34527 )
...
Signed-off-by: Christian S. Perone <christian.perone@gmail.com >
Signed-off-by: Christian S. Perone <perone@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2026-02-13 20:03:37 -08:00
Shiyan Deng
ed242652d7
[bug] Make sure get_modality_with_max_tokens is deterministic ( #34533 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2026-02-13 20:02:59 -08:00
Wei Zhao
b37b679770
[Feature][Perf] Support Selective CPU Weight Offloading ( #34535 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-13 20:02:24 -08:00
Andreas Karatzas
a0638d052d
[Bugfix] Fix ROCm UVA CPU weight offloading broken by #32993 ( #34543 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-13 20:01:42 -08:00
Harry Huang
c027541eaf
[Hybrid] Enable spec decoding in mamba cache align mode ( #33705 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-13 13:02:28 -08:00
Ben Browning
fd267bc7b7
[Bugfix]: Fix structured output in multi-turn gpt-oss ( #34454 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-13 11:12:48 -08:00
Michael Goin
bfaa559305
Revert "[Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides" ( #34530 )
2026-02-13 10:35:29 -08:00
Richard Zou
87789c8364
[Misc] vLLM's --enforce-eager should turn off compile and cudagraphs only ( #34523 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-13 09:52:20 -08:00
Pushpinder Singh
bcd65c1f6a
[Bugfix] Replace c10::optional with std::optional in topk kernel ( #34467 )
...
Signed-off-by: Pushpinder Singh <pushpindersingh135@gmail.com >
2026-02-13 08:30:23 -08:00
Wei Zhao
59d53066d8
[Feature] Support CPU Offloading without Pytorch Pinned Memory that leads to doubled allocation ( #32993 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-13 08:11:26 -08:00
LoganJane
4a9952ec1b
[Bugfix] Add quant_config in ViT of Kimi-K2.5 ( #34501 )
...
Signed-off-by: LoganJane <LoganJane73@hotmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-13 16:05:34 +00:00
Roger Wang
1dae7b7843
[Bugfix] Exclude language_model_only key from MM AOT compile hash but include in model one ( #34508 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-13 13:59:00 +00:00
Roger Wang
5885e330ef
[Misc] Port Qwen3.5 Configs ( #34512 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-13 05:24:25 -08:00
Ilya Boytsov
071d863e20
Extend ColBERT support to non-standard BERT backbones ( #34170 )
...
Signed-off-by: Ilya Boytsov <ilya.boytsov@aleph-alpha.com >
2026-02-13 09:53:09 +00:00
Woosuk Kwon
0916e7960b
[GDN] Use CPU tensors to build GDN metadata ( #34498 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-13 01:24:45 -08:00
Wentao Ye
3d2a026fd0
[Feature] Pipeline Parallel Async send/recv, 2.9% E2E throughput improvement ( #33368 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-02-13 16:38:16 +08:00
Aaron Hao
dddbff4624
[Core] Move pause and resume functions into engine ( #34125 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: Aaron Hao <ahao@anyscale.com >
Signed-off-by: hao-aaron <ahao@anyscale.com >
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-13 00:15:10 -08:00
Martin Hickey
47e9b63e1a
[KVConnector] Clean up redundant code in KV connectors ( #34147 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2026-02-13 00:14:30 -08:00
Matthias Gehre
934acddef9
[Perf] fused_moe: add int4_w4a16 benchmark support and tuning config ( #34130 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-13 00:14:27 -08:00
Marek Michalowski
742d214d6e
[Bugfix] fix the import path in moe test utils.py ( #34245 )
...
Signed-off-by: Marek Michalowski <marek.michalowski@arm.com >
2026-02-13 00:13:45 -08:00
haosdent
4137c5dfa7
[Bug Fix] Fix MambaManager.cache_blocks() crash on null blocks in align mode ( #34418 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-13 00:13:22 -08:00
Harry Huang
7a8a46ddcb
[BugFix] Fix and optimize max_num_blocks_per_req calculation for MambaSpec ( #34440 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-13 00:13:14 -08:00
myselvess
bcf0731aa0
[New Model] support new model ovis2.6 ( #34426 )
...
Signed-off-by: myselvess <23743269+myselvess@users.noreply.github.com >
2026-02-13 00:12:45 -08:00
Cyrus Leung
ec090c2429
[Refactor] Call renderer for online IO processor request ( #34490 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-12 22:48:45 -08:00
Roger Wang
eea3024f43
[Bugfix] Fix mamba state dtype setting for Qwen3-Next and Qwen3.5 ( #34489 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-12 22:48:42 -08:00
Cyrus Leung
2f308214c0
[Refactor] Pass full VllmConfig to Renderer ( #34485 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 22:48:38 -08:00
Cyrus Leung
1b4e8e53f8
[CI/Build] Fix CUDA re-initialization error in distributed model tests ( #34491 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-13 06:43:53 +00:00
haosdent
dcf6ee8592
[Bugfix] Fix encoder cache underestimation for GLM-4V/GLM-OCR single image ( #34483 )
...
Signed-off-by: haosdent <haosdent@gmail.com >
2026-02-12 21:04:06 -08:00
Cyrus Leung
372b2e762a
[Bugfix] Standardize getting number of image patches/tokens ( #34358 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 20:47:01 -08:00
Andreas Karatzas
6afa587d31
[ROCm][CI] Fix serving tokens test failures ( #34047 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-13 11:27:53 +08:00
Cyrus Leung
94ed6cf6ea
Add new sections to CODEOWNERS ( #34309 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 18:39:28 -08:00
Harry Huang
bf37812ca7
[Hybrid] Fix and optimize block-aligned splitting in mamba cache align mode ( #33706 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-12 18:21:52 -08:00
Frank Wang
b86bf4417e
[Bugfix] Fix Random Dataset Prefix Length Inaccuracy ( #33907 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-12 18:21:19 -08:00
Yanan Cao
de13dd781f
[Kernel] [Helion] [5/N] Add Helion Autotuning infrastructure ( #34025 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-12 18:21:05 -08:00
LoganJane
62788f99a4
[Bugfix] Delete unused redundant code in Kimi-K2.5 ( #34427 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-12 18:18:42 -08:00
Cyrus Leung
ea5ff3a1f6
[Refactor] Simplify BOS/EOS token handling ( #34435 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 18:18:24 -08:00
bnellnm
04ea31baab
[Bugfix] Remove assert that's no longer valid ( #34443 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-12 18:18:15 -08:00
Harry Huang
6f019e6e0a
[BugFix] Add block_size validation for mamba cache align mode ( #34445 )
...
Signed-off-by: huanghaoyan.hhy <huanghaoyan.hhy@alibaba-inc.com >
2026-02-12 18:18:07 -08:00
Zhuohan Li
d707678dfb
Fix num_logprobs parameter description in sampler.py ( #34451 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2026-02-12 18:18:03 -08:00
Cyrus Leung
fc22cae4ac
[CI/Build] Update video URLs for testing ( #34446 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 18:15:36 -08:00
Yanan Cao
96161fe978
[Kernel] [Helion] [4/N] Add silu_mul_fp8 Helion kernel ( #33373 )
...
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com >
2026-02-12 18:13:12 -08:00
Jaewon
4453ba8d9e
[Core] Profiler improvements and lazy initialization ( #33198 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-12 16:16:38 -08:00
Jaewon
aa181c923b
[Core] Add sleep level 0 mode with enqueue/wait pattern ( #33195 )
...
Signed-off-by: Jaewon Lee <jaewon@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2026-02-12 16:16:25 -08:00
Alec S
be7370daf3
[Frontend] Enable generic structured_outputs for responses API ( #33709 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Co-authored-by: Alec Solder <alecs@fb.com >
2026-02-12 16:15:48 -08:00
Mengtao (Martin) Yuan
9ea1f598ce
Use paged_attention_v1 for sliding window decode in rocm_aiter_fa ( #34378 )
...
Signed-off-by: Martin Yuan <myuan@meta.com >
Co-authored-by: Martin Yuan <myuan@meta.com >
2026-02-12 16:14:43 -08:00
amitz-nv
f120bd42d3
[Kernel] Support Flashinfer trtllm fused MoE non gated FP8 & NVFP4 ( #33506 )
...
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com >
2026-02-12 13:06:58 -08:00
Hashem Hashemi
fac4e96940
small adjustment to wvSplitKrc ( #34410 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-12 20:26:36 +00:00
Michael Goin
6d4e27ce29
[Bugfix] Enforce DeepGEMM when using sparse_attn_indexer on CUDA ( #34374 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-12 12:08:06 -08:00
Andreas Karatzas
4c078fa546
[ROCm][CI] Pin TorchCodec to v0.10.0 for ROCm compatibility ( #34447 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-12 18:47:34 +00:00
Patrick von Platen
6c0baee610
[Voxtral Realtime] Refactor & Improve buffering logic ( #34428 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-12 09:46:43 -08:00
Patrick von Platen
1100a97621
[Voxstral Realtime] Enable tests ( #33803 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2026-02-12 09:43:24 -08:00
xuebwang-amd
766e167821
[ROCm][quantization] improve OCP weight quant parser robust ( #34431 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-12 09:40:19 -08:00
Isotr0py
becbe24808
[Bugfix] Remove broken raw url GGUF model loading support ( #34433 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-12 09:40:01 -08:00
Harry Mellor
679ca5d8d3
Fix MoE for the Transformers modelling backend ( #34436 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-12 09:29:42 -08:00
Matthew Bonanni
f2c47886fd
[Attention] Add FlashInfer Sparse MLA backend ( #33451 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
2026-02-12 17:21:54 +00:00
Nicolò Lucchesi
334c715e0f
[Docs] Spec decoding docs warning removal ( #34439 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2026-02-12 09:01:51 -08:00
Aaron Hao
7b5a8b4a9d
[BUG] Reset running requests when clearing cache for pause/resume ( #34382 )
...
Signed-off-by: hao-aaron <ahao@anyscale.com >
2026-02-12 16:19:13 +00:00
danisereb
dea63512bb
Add config file for fused MoE for Nemotron (TP4, B200) ( #34411 )
...
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com >
2026-02-12 06:09:55 -08:00
Douglas Lehr
8a798be929
[ROCm] Enable MXFP4 MoE weight pre-shuffling on gfx950 and update aiter ( #34192 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
Co-authored-by: tjtanaavllm <tunjian.tan@amd.com >
2026-02-12 05:06:33 -08:00
Cyrus Leung
fb455ed547
[V0 Deprecation] Remove code related to per-request logits processors ( #34400 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-12 20:44:28 +08:00
baonudesifeizhai
f5897613fb
Fix Mistral config remap to accept compressed-tensors quantization #34028 ( #34104 )
...
Signed-off-by: baonudesifeizhai <baonudesifeizhai@gmail.com >
2026-02-12 08:22:06 +00:00
Louie Tsai
55a1a9563a
Vllm CPU benchmark suite improvement ( #34128 )
...
Signed-off-by: louie-tsai <louie.tsai@intel.com >
2026-02-12 16:04:44 +08:00
AllenDou
386bfe5d08
[bugfix] refactor FunASR's _get_data_parser ( #34397 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
2026-02-12 07:26:49 +00:00
Kyle Sayers
e9cd691132
[Bugfix] Fix Sparse24 Compressed Tensors models ( #33446 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2026-02-11 23:15:16 -08:00
Yichuan Wang
80f2ba6ea6
Fix DeepSeek-OCR tensor validation for all size variants ( #34085 )
...
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-11 22:50:23 -08:00
Lucas Wilkinson
136b0bfa59
[BugFix] Fix DP chunking ( #34379 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Bill Nell <bnell@redhat.com >
2026-02-12 06:44:03 +00:00
Cyrus Leung
b96f7314b4
[Refactor] Pass Renderer to Input Processor ( #34329 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 19:38:11 -08:00
Cyrus Leung
ced2a92f40
[Refactor] Move validation to params definitions ( #34362 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 19:33:15 -08:00
Runkai Tao
e1d97c38f8
[Bug Fix] Fix naive_block_assignment always defaulting to False due to arg misalignment ( #33848 )
...
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu >
2026-02-12 11:30:57 +08:00
Michael Goin
ec12d39d44
[Bugfix] Fix MTP accuracy for GLM-5 ( #34385 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-12 11:08:19 +08:00
Michael Goin
ff1f83b056
[Refactor] Replace activation: str with MoEActivation enum ( #33843 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2026-02-11 17:29:32 -08:00
Kevin H. Luu
83b47f67b1
[ci] Integrate AMD tests into CI ( #33626 )
...
Signed-off-by: Kevin H. Luu <khluu000@gmail.com >
Signed-off-by: khluu <khluu000@gmail.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
2026-02-12 08:54:17 +08:00
Micah Williamson
fb7b30c716
[ROCm][CI] Revert Test Groups From mi325_8 to mi325_1 Agent Pool In AMD CI ( #34384 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-11 15:52:34 -08:00
bnellnm
31d992d215
[Bugfix] Fix some issues with MoERunner PR #32344 ( #34371 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-11 14:33:14 -08:00
Wei Zhao
5aff2699bd
Fix CI failure - Flashinfer Kernel tests ( #34316 )
...
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com >
2026-02-11 14:17:16 -08:00
Raushan Turganbay
527ca32197
[Bugfix] Fix more multimodal tests for transformers V5 ( #34334 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2026-02-11 22:02:05 +01:00
Junseo Park
5458eb835d
[Bugfix] send None sentinel on final commit so server properly sends transcription.done ( #33963 )
...
Signed-off-by: pjs102793 <pjs102793@naver.com >
Co-authored-by: Nick Hill <nickhill123@gmail.com >
2026-02-11 21:01:53 +00:00
Tomas Ruiz
144d9b7cc8
[Benchmarks] Reduce ready checker log verbosity ( #34349 )
...
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com >
2026-02-11 20:57:57 +00:00
elvischenv
83e26c834e
[GPT-OSS] Remove unnecessary contiguous ( #34337 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2026-02-11 15:29:29 -05:00
TJian
5001211369
[ROCm] [CI] fix test_unrecognized_env ( #34350 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2026-02-11 18:50:44 +00:00
Eldar Kurtić
11c7ace340
[Bugfix] Enable attn quantization of Llama-4 by correctly permuting scales for rope (int8, fp8) ( #34243 )
...
Signed-off-by: Your Name <you@example.com >
Co-authored-by: Your Name <you@example.com >
2026-02-11 13:24:22 -05:00
Xinyu Dong
be7f3d5d20
[Bugfix] fix default is_neox_style is True for deepseek ( #34353 )
...
Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com >
2026-02-11 18:20:45 +00:00
Isotr0py
0ab06100f4
[Multimodal] Expose mm_processor_kwargs for DummyInputsBuilder ( #34330 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2026-02-11 09:37:40 -08:00
Xinyu Chen
ffb3d553cc
[Model Runner V2] Init cuda graph pool when necessary ( #33217 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
2026-02-11 09:12:13 -08:00
junuxyz
fa7e0bfacf
[CI][BugFix] Fix silent failure in shellcheck hook and baseline exist… ( #32458 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-11 17:03:48 +00:00
SorenDreano
48134a2c22
[Docs] Fix typo ("defult") and double spacing ( #34348 )
...
Signed-off-by: SorenDreano <71752785+SorenDreano@users.noreply.github.com >
Co-authored-by: Soren Dreano <soren@numind.ai >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-11 09:02:27 -08:00
kliuae
64f570ab56
[ROCm] [aiter] Split KV cache update for AiterFlashAttention ( #33681 )
...
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com >
2026-02-11 16:26:44 +00:00
Rohan Potdar
fd618871b4
[Bugfix]: Fix ROCm fusion attn test; use AttentionBackend utils to create kv cache ( #33948 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2026-02-11 11:12:05 -05:00
Harry Mellor
67a42b5a44
Don't try and run GLM-ASR with remote code ( #34352 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-11 08:09:40 -08:00
Lucas Wilkinson
c7914d30f9
Reapply [Attention][FA3] Update FA3 to include new swizzle optimization ( #34043 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-11 07:07:56 -08:00
Adam Binford
1b8756562e
Responses harmony system message structured ( #34268 )
...
Signed-off-by: Adam Binford <adamq43@gmail.com >
2026-02-11 05:14:28 -08:00
Linda
275e0d2a99
[NVIDIA][test] Tests for flashinfer TRTLLM BF16 MoE ( #33715 )
...
Signed-off-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com >
Co-authored-by: Pavani Majety <pmajety@nvidia.com >
2026-02-11 12:38:11 +00:00
Harry Mellor
0f5e55e7a8
Make JAIS compatible with Transformers v5 ( #34264 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-11 12:30:37 +00:00
Harry Mellor
1e9204bff3
Make Qwen3VL compatible with Transformers v5 ( #34262 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-11 04:13:23 -08:00
Li, Jiang
05339a7b20
[Bugfix][CPU] Fix llama4 inference on CPU ( #34321 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2026-02-11 19:07:23 +08:00
Harry Mellor
40b8f55358
[Docs] Reduce time spent generating API docs ( #34255 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-11 02:56:02 -08:00
Seiji Eicher
5045d5c983
Patch protobuf for CVE-2026-0994 ( #34253 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Co-authored-by: Kevin H. Luu <khluu000@gmail.com >
2026-02-11 02:25:04 -08:00
Nick Hill
e09546cf05
[Frontend] Exploit tokenizers "new stream" in FastIncrementalDetokenizer ( #34217 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-11 11:03:24 +01:00
Tianqi Ren
786806dd44
[Doc] Update Marlin support matrix for Turing ( #34319 )
...
Signed-off-by: Tianqi Ren <tianqi.r@outlook.com >
2026-02-11 09:03:41 +00:00
Nick Hill
79504027ef
[Misc] Bump fastsafetensors version for latest fixes ( #34273 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-11 00:30:09 -08:00
Luka Govedič
addac0e653
[torch.compile] Enable AR+rms fusion by default available for -O2 ( #34299 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-02-11 00:30:00 -08:00
Cyrus Leung
675a22ed66
[Chore] Move BaseRenderer to base.py ( #34308 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 00:29:51 -08:00
Kunshang Ji
cb9574eb85
[XPU][9/N] clean up existing ipex code/doc ( #34111 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2026-02-11 00:27:15 -08:00
AllenDou
21dfb842d7
[model] support FunASR model ( #33247 )
...
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
2026-02-11 07:37:09 +00:00
R3hankhan
d1b837f0ae
[CPU] Enable FP16 (Half dtype) support for s390x ( #34116 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2026-02-11 14:41:42 +08:00
Roger Wang
0b20469c62
[Bugfix] Fix weight naming in Qwen3.5 ( #34313 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 21:37:14 -08:00
Tyler Michael Smith
d7982daff5
[Bugfix] Fix fused MoE IMA (sans chunking) by using int64 for strides ( #34279 )
...
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-11 05:15:52 +00:00
Robert Shaw
9b17c57460
[ModelBash][DSR1 NVFp4] Removed Bf16 Bias Cast ( #34298 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2026-02-11 05:00:00 +00:00
Hashem Hashemi
1b3540e6c6
Threshold fix wvSplitk for occasional CI fails ( #34013 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
2026-02-11 03:59:14 +00:00
Matthias Gehre
7a048ee65f
[Bugfix] Fix benchmark_moe.py inplace assertion with torch >= 2.9 ( #34149 )
...
Signed-off-by: Matthias Gehre <matthias.gehre@amd.com >
2026-02-11 03:58:56 +00:00
Cyrus Leung
c9a1923bb4
[Plugin] Simplify IO Processor Plugin interface ( #34236 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 19:47:39 -08:00
zofia
b482f71e9f
[XPU][7/N] enable xpu fp8 moe ( #34202 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-02-11 03:33:59 +00:00
Дзержи́нский
1485396abb
[Kernel] Apply 256bit LDG/STG To Activation Kernels ( #33022 )
...
Signed-off-by: Dzerzhinsky <256908701+AstroVoyager7@users.noreply.github.com >
Signed-off-by: Дзержи́нский <256908701+AstroVoyager7@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2026-02-10 19:31:51 -08:00
Kebe
5ee5c86eeb
[Bugfix][DeepSeek-V3.2] fix fp8 kvcache type cast ( #33884 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2026-02-10 19:31:36 -08:00
Cyrus Leung
b5dcb372e4
[Misc] Clean up validation logic in input processor ( #34144 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 19:29:29 -08:00
Tyler Michael Smith
066c6da6a0
[WideEP] Fix nvfp4 DeepEP High Throughput All2All backend ( #33738 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-10 19:15:43 -08:00
Richard Zou
e30cedd44b
[torch.compile] Stop doing unnecessary FakeTensorProp in PiecewiseCompileInterpreter ( #34093 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-10 19:15:40 -08:00
Cyrus Leung
3bcd494ef4
[Redo] Add --trust-remote-code to dataset bench args ( #34251 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-11 11:10:12 +08:00
tianshu-Michael-yu
0e725a7d22
[Bugfix] Fix Worker.load_model context-manager composition for sleep mode ( #34021 )
...
Signed-off-by: tianshu.yu <tianshuyu.formal@gmail.com >
2026-02-11 11:07:51 +08:00
Lucas Wilkinson
ba0511fd80
[Misc] Add run one batch script that supports profiling ( #32968 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2026-02-10 18:29:49 -08:00
Micah Williamson
4a1550d22d
[ROCm][CI] Fix test_sequence_parallel.py location in AMD CI pipeline ( #34280 )
...
Signed-off-by: Micah Williamson <micah.williamson@amd.com >
2026-02-11 01:08:11 +00:00
bnellnm
d1481ba783
[MoE Refactor] Introduce MoERunner abstraction and move execution logic from FusedMoE to DefaultMoERunner ( #32344 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2026-02-10 19:51:07 -05:00
7. Sun
dc6de33c3d
[CI] Add pip caching to cleanup_pr_body workflow ( #32979 )
...
Signed-off-by: 7. Sun <jhao.sun@gmail.com >
2026-02-11 00:45:28 +00:00
Tyler Michael Smith
c4b9e6778f
[Misc] Add pre-commit hook to catch boolean ops in with-statements ( #34271 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com >
2026-02-10 15:13:20 -08:00
Richard Zou
341eed3d30
[torch.compile] Disable recursive pre_grad_passes ( #34092 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2026-02-10 18:02:31 -05:00
Zhengkai Zhang
6f2f59f2b3
[Misc][Spec Decode] support different load config for draft model ( #34022 )
...
Signed-off-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com >
Co-authored-by: zzhengkai <zzhengkai@devgpu049.ldc1.facebook.com >
2026-02-10 14:52:43 -08:00
Ilya Markov
bb2fc8b5e7
[BugFix] Fix async EPLB hang with DeepEP LL all2all backend ( #32860 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
2026-02-10 22:34:47 +00:00
Ilya Markov
67132945bb
[Perf] Move eplb rebalance algo to async thread ( #30888 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2026-02-10 22:19:10 +00:00
Gregory Shtrasberg
f0ca0671c7
[Feature] Warn about unrecognized environment variables ( #33581 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-10 15:45:38 -06:00
Pavani Majety
578977bb5e
[SM100] Resubmit FMHA FP8 prefill for MLA ( #31195 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2026-02-10 16:18:43 -05:00
Roger Wang
9615575afc
[Bugfix] Fix mamba cache dtype for Qwen3.5 ( #34200 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 13:12:31 -08:00
Matthew Bonanni
4293c00b84
[Benchmarks] Fix attention benchmark smoke test ( #34269 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-10 16:04:07 -05:00
J Seppänen
506ad7d7c1
[Bugfix] Fix weights offloading for sleep mode ( #32947 )
...
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2026-02-10 20:38:17 +00:00
Reagan Lee
fdd6f2ad58
Convert online APIs to use Renderer ( #34084 )
...
Signed-off-by: Reagan Lee <“reaganjlee@gmail.com ”>
Co-authored-by: Reagan Lee <“reaganjlee@gmail.com ”>
2026-02-10 19:44:31 +00:00
Qi Wang
33bcd3dc3b
[Misc] Introduce ec_both role EC (encoder cache) connector ( #34182 )
...
Signed-off-by: Qi Wang <qiwa@nvidia.com >
2026-02-10 18:55:35 +00:00
Michael Goin
1f5febb4b8
[UX nit] Fix non-default api_server_count message ( #34152 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2026-02-10 10:35:58 -08:00
Andy Lo
ae871ca923
Minor cleanup for Voxtral ( #34247 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2026-02-10 18:18:30 +00:00
Woosuk Kwon
a2443de5fa
[Model Runner V2] Use pinned memory for write_contents ( #34222 )
...
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai >
2026-02-10 08:55:22 -08:00
Harry Mellor
f84a2a8f31
[Docs] Speed up build environment set-up ( #34240 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-10 16:34:43 +00:00
Vadim Gimpelson
000214c4bb
[BUGFIX] Fix accuracy bugs in Qwen3-Next MTP ( #34077 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2026-02-10 10:57:11 -05:00
junuxyz
c5a66d1697
[Core][BugFix] Fix PP KV cache sharding memory validation ( #33698 )
...
Signed-off-by: junuxyz <216036880+junuxyz@users.noreply.github.com >
2026-02-10 10:46:24 -05:00
Roberto L. Castro
afdce12c89
[Perf][Kernel] Add faster topKperRow decode kernel for DeepSeek-V3.2 sparse attention ( #33680 )
...
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com >
Signed-off-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com >
2026-02-10 10:29:52 -05:00
Zhengxu Chen
82e11973cc
[compile] Enable AOT compile with 2.10 in trunk. ( #34155 )
...
Signed-off-by: Zhengxu Chen <zhxchen17@meta.com >
2026-02-10 23:24:42 +08:00
xuebwang-amd
b129136c7a
[ROCm][Quantization] GPT_OSS in amd-quark format model loading and emulations ( #29008 )
...
Signed-off-by: xuebwang-amd <xuebwang@amd.com >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2026-02-10 10:08:05 -05:00
mgazz
599e4335a4
Support benchmarking of Geospatial models ( #33922 )
...
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com >
2026-02-10 07:04:16 -08:00
Fan Yang
a1946570d8
add --insecure arg to the vllm bench to skip TLS ( #34026 )
...
Signed-off-by: Fan Yang <yan9fan@meta.com >
Co-authored-by: Fan Yang <yan9fan@meta.com >
2026-02-10 22:23:52 +08:00
Harry Mellor
d0bc520569
Bump mamba-ssm version in CI for Transformers v5 compatibility ( #34233 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-10 14:46:01 +01:00
Krish Gupta
748625cdaf
[V1][BugFix] Fix EAGLE3 encoder cache miss with disable_chunked_mm_input ( #34220 )
...
Signed-off-by: KrxGu <krishom70@gmail.com >
2026-02-10 13:05:32 +00:00
Harry Mellor
61413973e8
Stop testing for slow tokenizers as they will not exist soon ( #34235 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2026-02-10 12:08:20 +00:00
Phúc H. Lê Khắc
94de871546
[Misc] allow specify is_mm_prefix_lm in hf_config ( #34215 )
2026-02-10 11:16:21 +00:00
tc-mb
e042d7e685
Add flagos in MiniCPM-o ( #34126 )
...
Signed-off-by: tc-mb <caitianchi@modelbest.cn >
Signed-off-by: Vincent-Xiao <vincent.xiao.me@gmail.com >
Co-authored-by: Vincent-Xiao <vincent.xiao.me@gmail.com >
2026-02-10 02:51:48 -08:00
Roger Wang
ae4e280602
[Bugfix] Fix FI kernelchunk_gated_delta_rule output shape for Qwen3.5 ( #34219 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 10:41:24 +00:00
zzaebok
cbea11c9f0
[Docs] Fix format error in KV load failure recovery doc ( #34137 )
...
Signed-off-by: Jaebok Lee <jaebok9541@naver.com >
2026-02-10 02:16:26 -08:00
Cyrus Leung
2c32558a3c
[Bugfix] Fix --trust-remote-code conflict ( #34218 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 00:29:10 -08:00
Zetong Li
5f970120f0
[Bugfix] Fix memory inconsistency in cross-process shared memory ( #32022 )
...
Signed-off-by: Zetong Li <slippersss@126.com >
2026-02-10 08:22:03 +00:00
Cyrus Leung
998e2d91f8
Revert #34208 ( #34216 )
2026-02-09 23:59:04 -08:00
Wentao Ye
e1060a71a1
[Perf] Optimize detokenizer python logic ( #32975 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2026-02-09 23:54:41 -08:00
Chen Zhang
97fa8f6590
[BugFix] Avoid prefix cache hit in the same schedule step for mamba layers ( #29387 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2026-02-10 07:41:16 +00:00
wang.yuqi
dab1de9f38
[Frontend][CI] Consolidate instrumentator entrypoints ( #34123 )
...
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io >
2026-02-10 07:30:19 +00:00
Balaxxe
8d48d0a9d9
[Bugfix] Sort hf_weights_files in fastsafetensors_weights_iterator to match #33491 ( #34190 )
...
Signed-off-by: Balaxxe <136368465+jaim12005@users.noreply.github.com >
2026-02-09 23:06:30 -08:00
Andrew Xia
9608844f96
[responsesAPI] fix simpleContext streaming output_messages ( #34188 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2026-02-09 22:53:07 -08:00
Cyrus Leung
f69b903b4c
[Bugfix] Add --trust-remote-code to dataset bench args ( #34208 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-09 22:37:50 -08:00
Lucas Wilkinson
81e217fe6b
[Bugfix] Fix DP Attention Padding in Dummy Run ( #34187 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com >
2026-02-10 05:29:39 +00:00
Cyrus Leung
ab97bcf662
[CI/Build] Relax test_mcp_tool_call ( #34204 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-10 05:18:57 +00:00
Cyrus Leung
25e48a3aae
[Doc] Update usage of --limit-mm-per-prompt ( #34148 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2026-02-09 21:12:13 -08:00
Roger Wang
8a5e0e2b2b
[Bugfix][Core] Fix CPU memory leak from Request reference cycle in prefix caching ( #34183 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 13:03:32 +08:00
Andreas Karatzas
4cde2e0159
[ROCm][Bugfix] Resolve Dynamo tracing crash from amdsmi calls in on_gfx* arch detection ( #34108 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-09 20:50:20 -08:00
Roger Wang
047a457fa4
[Bugfix] Adopt ChunkGatedDeltaRule for Qwen3.5 ( #34198 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-10 03:47:54 +00:00
Yuwei An
e94ec59733
[LMCache] Token Base IPC API ( #34175 )
...
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com >
2026-02-10 01:18:42 +00:00
Ning Xie
13397841ab
[structured output] validate unsupported json features first ( #33233 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2026-02-09 23:49:09 +00:00
Gregory Shtrasberg
c60f8e3b49
[Bugfix][ROCm][GPT-OSS] Use old triton_kernels implementation on ROCm if the new API is not available ( #34153 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2026-02-09 17:38:54 -06:00
Michael Goin
5e75a14a66
[Doc] Add DCP support to attention backend doc ( #33936 )
2026-02-09 18:33:43 -05:00
Nick Hill
e7e52781ff
[ModelRunner V2][BugFix] Fix max_query_len calculation ( #34167 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
2026-02-09 21:47:17 +00:00
Charlie Fu
bb9f97308d
[torch.compile][Fusion] Fix attention fusion pass removing kv_udpate op. ( #33945 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2026-02-09 16:15:43 -05:00
Hongxia Yang
4d39650961
[ROCm] update triton branch to support gpt-oss models for gfx11xx devices ( #34032 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2026-02-09 19:36:30 +00:00
Artus Krohn-Grimberghe
8fd31f6245
[Bugfix] Voxtral prompt/audio placeholder alignment ( #34140 )
...
Signed-off-by: Artus KG <artuskg@gmail.com >
2026-02-09 19:30:38 +00:00
Artus Krohn-Grimberghe
eadb4e868b
[Bugfix] Avoid duplicate k-proj weight emission in helper ( #34142 )
...
Signed-off-by: Artus KG <artuskg@gmail.com >
2026-02-09 19:17:44 +00:00
Jiangyun Zhu
285bab4752
[Kernel] use flashinfer for gdn prefill ( #32846 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2026-02-09 12:17:25 -05:00
TomerBN-Nvidia
995bbf38f1
[Bugfix] Fix shared expert input for latent MoE in EP+DP (Nemotron-H) ( #34087 )
...
Signed-off-by: Tomer Natan <tbarnatan@nvidia.com >
Co-authored-by: Cursor <cursoragent@cursor.com >
2026-02-09 16:44:18 +00:00
Mohammad Miadh Angkad
d4f123cc48
[Kernel] FlashInfer: switch allreduce fusion to unified API ( #33985 )
...
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com >
2026-02-09 15:43:24 +00:00
ZhengHongming888
cb62e86f83
Add NUMA Core binding in nixl_connector for CPU xPyD ( #32365 )
...
Signed-off-by: Hongming Zheng <hongming.zheng@intel.com >
Signed-off-by: ZhengHongming888 <hongming.zheng@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-09 15:39:12 +00:00
Luka Govedič
781ddf7868
[CI][torch.compile] Fix incorrect filtering for E2E fusion tests on B200 ( #34031 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2026-02-09 10:05:14 -05:00
Roger Wang
64a9c2528b
[UX] Add --language-model-only for hybrid models ( #34120 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2026-02-09 14:57:33 +00:00
Lucas Wilkinson
d0d97e2974
[Misc] Fix up attention benchmarks ( #33810 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2026-02-09 09:42:03 -05:00
JJJYmmm
9562912cea
[MODEL] Adding Support for Qwen3.5 Models ( #34110 )
...
Signed-off-by: JJJYmmm <1650675829@qq.com >
Signed-off-by: JJJYmmm <92386084+JJJYmmm@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: wulipc <wulipc@users.noreply.github.com >
Co-authored-by: ywang96 <ywang96@users.noreply.github.com >
Co-authored-by: Isotr0py <Isotr0py@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2026-02-09 21:12:58 +08:00
zofia
9bdb06b436
[XPU][6/N] add xpu scaled_mm kernel ( #34117 )
...
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com >
2026-02-09 20:17:35 +08:00
Nikhil Gupta
caad9f1e01
[Fix] [CPU Backend] : Prepack weights for w8a8 oneDNN matmul ( #33901 )
...
Signed-off-by: nikhil-arm <nikhil.gupta2@arm.com >
2026-02-09 18:04:41 +08:00
Ekagra Ranjan
1d5922fade
[ASR] Fix audio benchmark and add RTFx metric ( #32300 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2026-02-09 10:02:37 +00:00
Andreas Karatzas
3025b3cebb
[CI] Remove empty image_size_factors for fuyu, glm4_1v, glm_ocr ( #34107 )
...
Signed-off-by: Andreas Karatzas <akaratza@amd.com >
2026-02-09 17:37:04 +08:00
Jee Jee Li
978a37c823
[Model] GLM adaptation ( #34124 )
2026-02-09 17:32:52 +08:00
ihb2032
5a5c43511a
fix(cpu): fix mla_decode compilation on x86 without AVX512 ( #34052 )
...
Signed-off-by: ihb2032 <hebome@foxmail.com >
Co-authored-by: root <root@LAPTOP-FKNHV411.localdomain >
2026-02-09 08:55:41 +00:00
Nick Hill
d9bede0314
[BugFix] Fix fastsafetensors TP all procs using all GPUs ( #34070 )
...
Signed-off-by: Nick Hill <nickhill123@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2026-02-09 15:15:46 +08:00