Commit Graph

  • d2f4a71cd5 [Bugfix] Kimi-K2 grouped_topk usage for Flashinfer monolithic kernels. (#33858) Pavani Majety 2026-02-05 01:32:10 -08:00
  • 2abd97592f [KV Connector][Metrics] Do not count local prefix cache hits in connector queries (#30522) Mark McLoughlin 2026-02-05 07:57:27 +00:00
  • 6abb0454ad [Perf] Optimize the performance of structured output + reasoning (#33557) Chauncey 2026-02-05 15:45:29 +08:00
  • db6f71d4c9 [CI/Build] Fix CPU CI test case title (#33870) Li, Jiang 2026-02-05 15:07:14 +08:00
  • fd03538bf9 [CPU][BugFix] Allow w8a8 oneDNN quantized matmul to support 3D inputs (#33727) Fadi Arafeh 2026-02-05 06:26:09 +00:00
  • 1f70313e59 [Bugfix] Fix ScoreMultiModalParam multi-document scoring returning single result (#33837) Andreas Karatzas 2026-02-05 00:17:00 -06:00
  • 07daee132b [CI/Build] Parallelize CPU CI tests (#33778) Li, Jiang 2026-02-05 13:53:48 +08:00
  • 9595afda18 [2/N] move responses/serving _make_response_output_items logic to parser (#33281) Andrew Xia 2026-02-05 00:46:15 -05:00
  • c1395f72cd [CI][AMD][BugFix] Ensure VLLM_ROCM_USE_AITER is set so test_rocm_aiter_topk.py can run correctly (#33840) rasmith 2026-02-04 23:05:48 -06:00
  • 007b183d74 [docs] fix unintentional misspellings (#33863) rinbaro 2026-02-04 20:50:59 -08:00
  • add9f1fbd9 [Minor] Include StreamingInput in inputs package (#33856) Nick Hill 2026-02-04 20:38:20 -08:00
  • e3bf79ffa0 Revert "[Attention][FA3] Update FA3 to include new swizzle optimization" (#33841) Luka Govedič 2026-02-04 22:54:27 -05:00
  • fb1270f1f8 [CI][Bugfix]: return McpCall for built-in MCP tools in non-streaming mode (#32762) Andreas Karatzas 2026-02-04 21:14:06 -06:00
  • 72bb24e2db [release] Minor fixes to release annotation (#33849) Kevin H. Luu 2026-02-04 18:07:35 -08:00
  • a7be77beef [Bugfix] fix DeepSeek R1 with CUTLASS MLA Broken on B200 (#33637) Chauncey 2026-02-05 09:28:36 +08:00
  • bbe0574d8e [Bugfix] Disable TRTLLM attention when KV transfer is enabled (#33192) v0.15.2rc0 zhanqiuhu 2026-02-04 19:49:18 -05:00
  • 4d9513537d [CI][torch.compile] Reduce e2e fusion test time (#33293) Luka Govedič 2026-02-04 19:09:03 -05:00
  • 439afa4eea feat: Add ColBERT late interaction model support (#33686) Ilya Boytsov 2026-02-05 01:05:13 +01:00
  • fa4e0fb028 [Core] Don't schedule spec tokens with prefill chunks (#33652) Nick Hill 2026-02-04 15:40:22 -08:00
  • ce498a6d61 Change the type signature of MixtureOfExperts.expert_weights to MutableSequence[Sequence[Tensor]] (#33573) Sage Moore 2026-02-04 14:02:46 -08:00
  • 9f14c9224d Revert "[torch.compile] Significantly speed up cold start times" (#33820) Richard Zou 2026-02-04 13:59:59 -08:00
  • 535de06cb1 [Model] Add transcription support for Qwen3-Omni (#29828) Muhammad Hashmi 2026-02-04 13:17:47 -08:00
  • 4292c90a2a [Bugfix] Support RotaryEmbedding CustomOp for gpt-oss (#33800) Simon Danielsson 2026-02-04 21:17:41 +01:00
  • 6e98f6d8b6 Implement zero-copy GQA for multimodal and CPU (#33732) Taeksang Kim 2026-02-05 05:11:39 +09:00
  • 2f6d17cb2f [rocm][ray] Fix: Unify Ray device visibility handling across CUDA and ROCm (#33308) kourosh hakhamaneshi 2026-02-04 10:09:14 -08:00
  • 192ad4648b [Bugfix] Fix interns1-pro initialization and PP (#33793) Isotr0py 2026-02-05 01:54:45 +08:00
  • 0e92298622 [Misc] Delay deprecation of CommonAttentionMetadata properties (#33801) Lucas Wilkinson 2026-02-04 09:41:57 -07:00
  • 87d9a26166 [Bugfix] Fix ubatch wrapper num_tokens calculate (#33694) jiangkuaixue123 2026-02-05 00:41:45 +08:00
  • 80f921ba4b [Bugfix] Fix normalize still being passed to PoolerConfig (#33794) Cyrus Leung 2026-02-04 23:56:02 +08:00
  • 711edaf0d0 [Perf] Optimize spec decoding + async scheduling, 1.5% Throughput improvement (#33612) Wentao Ye 2026-02-04 09:34:32 -05:00
  • 1d367a738e [Bugfix][ROCm] Include float8_e4m3fnuz in NCCL Dtype Dispatching (#33713) Micah Williamson 2026-02-04 07:36:29 -06:00
  • 32a02c7ca2 Apply #33621 to main (#33758) Cyrus Leung 2026-02-04 21:35:39 +08:00
  • f67ee8b859 [Perf] Optimize chat completion streaming performance (#33782) Chauncey 2026-02-04 20:30:36 +08:00
  • e57ef99b40 [Model] Apply #32631 for recent models (#33785) Cyrus Leung 2026-02-04 20:23:01 +08:00
  • f8516a1ab9 [Bugfix][Model] Fix audio-in-video support for Qwen2.5-Omni and Qwen3-Omni (#33605) Yueqian Lin 2026-02-04 07:15:29 -05:00
  • 824058076c [PERF] Change GDN Attention State Layout from [N, HV, K, V] to [N, HV, V, K] (#33291) Vadim Gimpelson 2026-02-04 15:20:52 +04:00
  • 8e32690869 [KV Connector][BugFix] scheduler: Delay freeing blocks of aborted async loads (#32255) Or Ozeri 2026-02-04 13:16:34 +02:00
  • a208439537 [compile] Remove runner type from ignored caching factor list. (#33712) Zhengxu Chen 2026-02-04 05:56:45 -05:00
  • bcd2f74c0d [compile] Clean up AOT compile bypass on evaluate_guards. (#33578) Zhengxu Chen 2026-02-04 05:12:53 -05:00
  • f79f777803 [XPU][2/N] add support unquantized moe support for xpu (#33659) Kunshang Ji 2026-02-04 18:12:25 +08:00
  • 4c8d1bf361 use ORJSONResponse when available to improve the efficiency of request process (#33548) Augusto Yao 2026-02-04 18:04:11 +08:00
  • 061da6bcf7 [XPU] remove common path warning log (#33769) Kunshang Ji 2026-02-04 16:40:17 +08:00
  • 4403e3ed4c [Metrics] Add labeled prompt token metrics for P/D disaggregation (#33290) zhanqiuhu 2026-02-04 02:46:48 -05:00
  • 08e094997e [Hardware][AMD][CI] Refactor AMD tests to properly use BuildKite parallelism (#32745) Matt 2026-02-04 00:51:33 -06:00
  • d88a1df699 [Deprecation] Deprecate profiling envs (#33722) Wentao Ye 2026-02-04 00:58:21 -05:00
  • 90d74ebaa4 [Deprecation] Remove _get_data_parser in MM processor (#33757) Cyrus Leung 2026-02-04 13:51:52 +08:00
  • 45f8fd6f97 [Feature] Enable TRITON_ATTN for Batch Invariance (#33688) Frank Wang 2026-02-03 21:27:34 -08:00
  • 5e1e0a0fbd [Refactor] Remove unused dead code (#33718) Wentao Ye 2026-02-04 00:25:11 -05:00
  • eb5ed20743 [Bugfix] Define router_logits_dtype for remaining MoE models (#33737) Michael Goin 2026-02-04 00:24:14 -05:00
  • 2647163674 Save startup benchmark results as a list of values (#33629) Huy Do 2026-02-03 20:37:51 -08:00
  • 9fb27dd3b3 [MM] Align the prefix of MMEncoderAttention with Attention (#33750) Shanshan Shen 2026-02-04 12:07:30 +08:00
  • 4dffc5e044 [CPU] Split attention dispatch by head_dim alignment (#32161) R3hankhan 2026-02-04 09:07:15 +05:30
  • e1bf04b6c2 [1/N] Initial Implementation of Parser for ResponsesAPI (#32712) Andrew Xia 2026-02-03 21:59:03 -05:00
  • 02080179a3 [Bugfix] Fix torchrun PP broadcast deadlock with async scheduling (#33701) Isotr0py 2026-02-04 10:17:37 +08:00
  • 1b8fe6f7c4 [Frontend][4/n] Make pooling entrypoints request schema consensus | ScoreRequest (#33060) wang.yuqi 2026-02-04 09:48:40 +08:00
  • 1892993bc1 [BugFix][Spec Decoding] Fix negative accepted tokens metric crash (#33729) v0.15.1rc1 v0.15.1 Nick Hill 2026-02-03 15:34:41 -08:00
  • 7d98f09b1c cherry pick Michael Goin 2026-02-03 16:26:51 -05:00
  • daa2784bb9 [Bugfix] Disable RoutingMethodType.[Renormalize,RenormalizeNaive] TRTLLM per-tensor FP8 MoE (#33620) Michael Goin 2026-02-03 05:37:15 -05:00
  • 52ee21021a [BugFix][Spec Decoding] Fix negative accepted tokens metric crash (#33729) Nick Hill 2026-02-03 15:34:41 -08:00
  • 655efb3e69 [Dependency] Remove comments of ray in dependency files (#33351) Wentao Ye 2026-02-03 18:30:47 -05:00
  • bd8da29a66 [Bugfix] Fix sparse MLA metadata building (#33579) Matthew Bonanni 2026-02-03 18:29:48 -05:00
  • 2a99c5a6c8 [Bugfix] Disable TRTLLM FP8 MoE if router_logits_dtype==float32 and routing_method!=DeepSeekV3 (#33613) Michael Goin 2026-02-03 16:26:51 -05:00
  • 3f7662d650 [Voxtral Realtime] Change name (#33716) Patrick von Platen 2026-02-03 22:03:28 +01:00
  • a372f3f40a [MISC] Fix Tensor Parallelism for Quantized Mamba Models with n_groups=1 (#33257) Vadim Gimpelson 2026-02-04 00:10:31 +04:00
  • 61e632aea1 Turn @config into a dataclass_transform (#31541) Harry Mellor 2026-02-03 17:40:59 +00:00
  • b1bb18de8d [torch.compile] Significantly speed up cold start times (#33641) Richard Zou 2026-02-03 09:12:11 -08:00
  • 2267cb1cfd [Attention][FA3] Update FA3 to include new swizzle optimization (#23465) Lucas Wilkinson 2026-02-03 09:08:47 -07:00
  • 0d6ccf68fa [P/D] rework mooncake connector and introduce its bootstrap server (#31034) dtc 2026-02-04 00:08:25 +08:00
  • 18e7cbbb15 [Bugfix] Fix startup hang for Granite Speech (#33699) Cyrus Leung 2026-02-03 23:57:56 +08:00
  • f0d5251715 [Voxtral models] Skip warm-up to skip confusing error message in warm-up (#33576) Patrick von Platen 2026-02-03 16:22:34 +01:00
  • 5c4f2dd6ef [MM] Pass prefix parameter to MMEncoderAttention (#33674) Shanshan Shen 2026-02-03 22:47:41 +08:00
  • f3d8a34671 [Bugfix] Do not add extra \n for image-only cases when constructing multimodal text prompts. (#33647) wang.yuqi 2026-02-03 22:43:47 +08:00
  • 4bc913aeec Feat/add nemotron nano v3 tests (#33345) shaharmor98 2026-02-03 15:52:49 +02:00
  • fbb3cf6981 [Bugfix][Async][Connector] avoid vllm-side double free during async scheduling + request abort + async KV cache transfer (#33377) Kuntai Du 2026-02-03 21:50:15 +08:00
  • 2df2b3499d Document NixlConnector backend selection via kv_connector_extra_config (#33552) Krish Gupta 2026-02-03 19:19:59 +05:30
  • 2a8d84e66d Fix Gemma3n audio encoder for Transformers v5 (#33673) Harry Mellor 2026-02-03 13:49:49 +00:00
  • a3acfa1071 [Models] Intern-S1-Pro (#33636) zxy 2026-02-03 21:49:45 +08:00
  • be8168ff88 Fix Gemma3 GGUF for Transformers v5 (#33683) Harry Mellor 2026-02-03 12:36:53 +00:00
  • f6af34626d Fix offline test for Transformers v5 (#33682) Harry Mellor 2026-02-03 12:07:24 +00:00
  • ceab70c89d [Bugfix] fix qwen3-asr response error (#33644) Song Zhixin 2026-02-03 19:33:56 +08:00
  • 52683ccbe1 [Misc] Update default image format of encode_base64 (#33656) Cyrus Leung 2026-02-03 19:13:16 +08:00
  • e346e2d056 [Bugfix] Disable RoutingMethodType.[Renormalize,RenormalizeNaive] TRTLLM per-tensor FP8 MoE (#33620) Michael Goin 2026-02-03 05:37:15 -05:00
  • 83449a5ff0 [Refactor] Clean up pooling serial utils (#33665) Cyrus Leung 2026-02-03 18:29:18 +08:00
  • e4bf6ed90d [torch.compile] Don't do the fast moe cold start optimization if there is speculative decoding (#33624) Richard Zou 2026-02-02 19:38:49 -08:00
  • dad2d6a590 [Bugfix][Model] Fix DeepSeek-OCR-2 chat template to include BOS token (#33642) Lucas Hänke de Cansino 2026-02-03 09:35:58 +01:00
  • 611b18757e [torch.compile] Speed up MOE handling in forward_context (#33184) Richard Zou 2026-01-27 18:17:54 -05:00
  • 9cd2cce17d [torch.compile] Don't do the fast moe cold start optimization if there is speculative decoding (#33624) v0.15.1rc0 Richard Zou 2026-02-02 19:38:49 -08:00
  • eec3546bba [Misc][Build] Lazy load cv2 in nemotron_parse.py (#33189) Kiersten Stokes 2026-01-29 00:55:50 -06:00
  • 7c023baf58 Patch Protobuf for CVE 2026-0994 (#33619) zaristei2 2026-02-03 00:03:14 -08:00
  • 099a787ee2 Patch aiohttp for CVE-2025-69223 (#33621) zaristei2 2026-02-03 00:02:39 -08:00
  • 32e84fa1ff [CI/Build] Investigate torchrun distributed tests hanging issue (#33650) Isotr0py 2026-02-03 15:49:17 +08:00
  • fd9c83d0e0 [torch.compile] Document the workaround to standalone_compile failing (#33571) Richard Zou 2026-02-02 23:16:55 -08:00
  • b95cc5014d [Misc] Remove deprecated VLLM_ALL2ALL_BACKEND environment variable (#33535) 杨朱 · Kiki 2026-02-03 15:01:59 +08:00
  • 61397891ce [Minor] Some code simplification in scheduler.py (#33597) Nick Hill 2026-02-02 23:00:00 -08:00
  • ef248ff740 [Misc] Remove deprecated profiler environment variables (#33536) 杨朱 · Kiki 2026-02-03 14:58:44 +08:00
  • e10604480b [XPU][1/N] Deprecate ipex and switch to vllm-xpu-kernels for xpu platform (#33379) Kunshang Ji 2026-02-03 14:46:10 +08:00
  • bf001da4bf [Bugfix] Interleaved thinking keeps compatibility with reasoning_content (#33635) Chauncey 2026-02-03 14:46:05 +08:00
  • a0a984ac2e [CI/Build] Remove hardcoded America/Los_Angeles timezone from Dockerfiles (#33553) 杨朱 · Kiki 2026-02-03 14:32:39 +08:00
  • f1cb9b5544 Fix quantized Falcon-H1 model loading issues (#32728) Shengliang Xu 2026-02-02 22:31:27 -08:00
  • 4c4b6f7a97 [Frontend] Add sampling parameters to Responses API (#32609) Daniel Mescheder 2026-02-03 06:51:10 +01:00