Commit Graph

  • 9df152bbf6 [Misc] Algin Qwen3-VL-embedding image example outputs with HF repo example (#33419) Isotr0py 2026-01-31 11:36:56 +08:00
  • 876a16f4fb [ModelRunner V2] Fix spec decoding + logprobs (#33391) Nick Hill 2026-01-30 19:33:26 -08:00
  • aaa901ad55 [Attention] Move MLA forward from backend to layer (#33284) Matthew Bonanni 2026-01-30 22:30:00 -05:00
  • 010ec0c30e [Deprecation] Deprecate seed_everything and scatter_mm_placeholders in v0.15 (#33362) Wentao Ye 2026-01-30 21:54:16 -05:00
  • 64a40a7ab4 [Bugfix] Fix typo in read_offset variable name (#33426) Alberto Ferrer 2026-01-30 19:26:15 -06:00
  • 31aedfe7d6 [Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831 (#33366) Gregory Shtrasberg 2026-01-30 19:05:23 -06:00
  • 67ebaff528 Refactor NVFP4 Linear utils for ModelOpt and CT (#33201) Michael Goin 2026-01-30 19:37:42 -05:00
  • 2b465570e6 [CI][HPU]accelerate hpu test by skip python re-install and clean container name (#33286) Chendi.Xue 2026-01-30 15:36:29 -06:00
  • 9ca66ecc10 Indicate compile mode in the benchmark results (#32990) Huy Do 2026-01-30 12:34:36 -08:00
  • c3a9752b0c [Hardware][SM100] Add TRTLLM Kernel for INT4 W4A16 Kernel. (#32437) Pavani Majety 2026-01-30 10:30:46 -08:00
  • f451b4558b [Quantization][ROCm] Fix MoE weight loading to be robust (Qwen3_MoE/Qwen3_next as example models) (#33173) xuebwang-amd 2026-01-31 01:50:23 +08:00
  • 3f96fcf646 fix QERL attention import path (#33432) Vasiliy Kuznetsov 2026-01-30 12:29:09 -05:00
  • 6c1f9e4c18 [Kernel] [Helion] [1/N] Add Helion ConfigManager (#32740) Yanan Cao 2026-01-30 09:19:19 -08:00
  • 67239c4c42 Fix encoder-decoder model disabling mm processor cache (#33236) Harry Mellor 2026-01-30 16:30:10 +00:00
  • 8ece60768f [CI] Qwen3-ASR transcriptios tests (#33414) Nicolò Lucchesi 2026-01-30 17:17:56 +01:00
  • fd0e377244 Support FP8 block quant for CompressedTensorsW8A16Fp8 (#33280) Michael Goin 2026-01-30 11:15:20 -05:00
  • f857a03f6b [QeRL] Layerwise Reloading (#32133) Kyle Sayers 2026-01-30 10:50:05 -05:00
  • 74898a7015 [BugFix][LoRA] TritonExperts is ModularMoEPath for FP8 models (#33393) Danielle Robinson 2026-01-30 07:27:42 -08:00
  • 8f5d51203b Disable Cascade Attention for Batch Invariance (#32561) Frank Wang 2026-01-30 07:00:46 -08:00
  • ae5b7aff2b Improve Mistral format checks. (#33253) Julien Denize 2026-01-30 15:23:33 +01:00
  • a11bc12d53 Fix test_moe.py for Transformers v5 (#33413) Harry Mellor 2026-01-30 14:03:25 +00:00
  • 58cb55e4de [Doc] Enhance documentation around CPU container images (#32286) Nathan Weinberg 2026-01-30 08:36:20 -05:00
  • cf896ae0e3 [Misc] Clean up HIDDEN_DEPRECATED_METRICS after metric removal (#33323) 杨朱 · Kiki 2026-01-30 21:31:17 +08:00
  • c5113f60f2 Remove deprecated reasoning_content message field (#33402) Harry Mellor 2026-01-30 11:48:15 +00:00
  • 174f16700b [Doc] [ROCm] Update Documentation to reflect v0.15.0 release (#33388) vllmellm 2026-01-30 19:06:08 +08:00
  • 8e2ad97ad0 [BUGFIX] Pixtral cannot be loaded with --limit-mm-per-prompt 0 (#33406) Julien Denize 2026-01-30 11:52:02 +01:00
  • 10152d2194 [Realtime API] Adds minimal realtime API based on websockets (#33187) Patrick von Platen 2026-01-30 11:41:29 +01:00
  • 1a7894dbdf [Misc] Replace Optional[X] with X | None syntax (#33332) 杨朱 · Kiki 2026-01-30 17:56:59 +08:00
  • c87eac18f7 [Refactor] Move MM item count validation outside of processor (#33396) Cyrus Leung 2026-01-30 17:27:31 +08:00
  • f45870b53f fix: allow LFM2 MoE prefix caching (align) (#33376) tianshu-Michael-yu 2026-01-30 00:23:14 -08:00
  • ba45bedfd1 [model] Add support for openPangu7B-VL (#32449) hujiaxin0 2026-01-30 15:54:27 +08:00
  • 9432ed8c7e Explicitly set return_dict for apply_chat_template (#33372) Harry Mellor 2026-01-30 07:27:04 +00:00
  • 726d89720c [CI] Enable mypy import following for vllm/spec_decode (#33282) Lucas Kabela 2026-01-29 22:43:32 -08:00
  • d334dd26c4 Move decode context parallel validationn to ParallelConfig (#33239) Harry Mellor 2026-01-30 06:18:41 +00:00
  • 070c811d6f [CI][AMD] Skip 4 GPUs testgroup ray tests (#33305) Ryan Rock 2026-01-29 23:39:53 -06:00
  • 8bfc8d5600 [Models] Refactor Kimi-K2.5 weight loading (#33346) Isotr0py 2026-01-30 13:31:20 +08:00
  • ec51831a22 [BugFix] Disable async scheduling for Mamba prefix caching (#33352) Harry Huang 2026-01-30 12:40:19 +08:00
  • 80b918f2bd Fix tie_word_embeddings for multimodal models in Transformers v5 (#33359) Harry Mellor 2026-01-30 03:37:39 +00:00
  • c46b0cd0af [Model][Multimodal] Add explicit MusicFlamingo adapter (#32696) Wang Haoyu 2026-01-30 11:01:29 +08:00
  • 133765760b [Docs] Adding links and intro to Speculators and LLM Compressor (#32849) v0.16.0rc0 Aidan Reilly 2026-01-29 22:12:35 +00:00
  • bfb9bdaf3f [Bugfix] Enable Triton MoE for FP8 per-tensor dynamic (#33300) Michael Goin 2026-01-29 15:15:17 -05:00
  • 2284461d02 [release] Minor fixes to release annotation and wheel upload (#33129) Kevin H. Luu 2026-01-29 12:09:35 -08:00
  • 8e2a469b3b Add Triton fused MoE config for B200 (Nemotron Nano) (#32804) danisereb 2026-01-29 21:21:33 +02:00
  • 23591e631e [Bugfix][Kernel] Fix negative memory offset in GDN Triton kernel (#33326) CarstyYou 2026-01-30 02:40:11 +08:00
  • 0493d897c4 [NVIDIA] [feat] Integrate flashinfer Trtllmgen bf16 moe (#32954) Linda 2026-01-29 19:00:13 +01:00
  • 8c8ebeb941 [BUGFIX][XPU] fix memory check after XPU reuse GPU_worker (#33358) Chendi.Xue 2026-01-29 11:56:30 -06:00
  • 831453fcef [Chore] Move MediaConnector to vllm.multimodal.media (#33324) Cyrus Leung 2026-01-30 00:54:31 +08:00
  • 5a66c9cc76 [ez] Delete torch25_custom_graph_pass (#33287) Angela Yi 2026-01-29 08:47:05 -08:00
  • 5e73e4900c [Bugfix] Fix broken GLM-OCR initialization (#33350) Isotr0py 2026-01-29 23:56:05 +08:00
  • c6e7404cc5 [Multimodal] Simplify MM input definitions (#33331) Cyrus Leung 2026-01-29 21:32:04 +08:00
  • 17b17c0684 [Backport] [Kimi-K2.5] Replace torch.cuda with current_platform for d… (#33320) sthWrong 2026-01-29 20:29:17 +08:00
  • 8bb6271c77 [Intel GPU] refine xpu worker (#32894) Kunshang Ji 2026-01-29 20:26:52 +08:00
  • 8b3f0a99dd [Models] Qwen3-ASR (#33312) Roger Wang 2026-01-29 03:27:15 -08:00
  • 8311f083bd [Bugfix][CPU] Fix thread num for shared memory communication (#33317) Li, Jiang 2026-01-29 19:26:58 +08:00
  • 40c35038d2 [Voxtral] Streaming example (#33042) Patrick von Platen 2026-01-29 12:22:49 +01:00
  • a5aa4d5c0f [Quantization][Refactor] use platform dict to choose kernel (#33130) zofia 2026-01-29 18:44:58 +08:00
  • 615e8033e5 [Bug Fix] Handle variable-length tensors in MultiModalFlatField batching (#31751) andrii.pasternak 2026-01-29 10:42:59 +00:00
  • d09135fbd0 [BugFix] Async Eplb fix potential race condition (#32881) Ilya Markov 2026-01-29 11:31:40 +01:00
  • 8688c3d460 [fix] tesdt mcp_tool_calling_streaming with a more complex math question (#32769) daniel-salib 2026-01-29 02:25:58 -08:00
  • 5400014d55 [Chore] Remove use_data_parallel kwargs from ViT implementation (#33310) Isotr0py 2026-01-29 18:20:52 +08:00
  • 3a92c6f3b5 [Misc] Cleanup Kimi-K2.5's vision chunk modality entrypoints (#33157) Isotr0py 2026-01-29 17:46:02 +08:00
  • e01ff5c070 Bugfix: Pass router logits dtype in nemotron shared experts (#32669) amirkl94 2026-01-29 01:36:34 -08:00
  • fb946a7f89 Make mypy opt-out instead of opt-in (#33205) Harry Mellor 2026-01-29 09:12:26 +00:00
  • a650ad1588 [Misc] Remove missed pad_for_cudagraph (#33283) Lucas Wilkinson 2026-01-29 02:12:05 -07:00
  • d697581a7c [Doc] Update outdated link to Ray documentation (#32660) graftim 2026-01-29 09:56:06 +01:00
  • 5eeba80c74 Adding optional speculator tests for larger models (#32943) shanjiaz 2026-01-29 03:54:02 -05:00
  • 08b1195e62 [PluggableLayer][2/N] Apply PluggableLayer to linear layers (#33152) whx 2026-01-29 16:53:15 +08:00
  • 3bba2edb0f support returning tokenids in responses api (#33212) cmunley1 2026-01-29 00:52:39 -08:00
  • 53fc166402 [BugFix] Fix EPLB fail for MoeFP4 model with Marlin backend (#33262) Ilya Markov 2026-01-29 09:52:11 +01:00
  • 31b25f6516 [Doc]: fixing multiple typos in diverse files (#33256) Didier Durand 2026-01-29 09:52:03 +01:00
  • abb34ac43a [Bugfix] Fix Qwen3-VL-Reranker load. (#33298) wang.yuqi 2026-01-29 16:42:53 +08:00
  • 2515bbd027 [CI/Build][BugFix] fix cuda/compat loading order issue in docker build (#33116) Pengchao Wang 2026-01-29 00:19:05 -08:00
  • c487a8eef4 [Release] [ROCm] Remove old build step (#33316) TJian 2026-01-29 15:35:51 +08:00
  • 9e138cb01d [Misc][Build] Lazy load cv2 in nemotron_parse.py (#33189) Kiersten Stokes 2026-01-29 00:55:50 -06:00
  • f176443446 [Release] [CI] Optim release pipeline (#33156) v0.15.0 TJian 2026-01-29 14:45:42 +08:00
  • f9d03599ef [Release] [CI] Optim release pipeline (#33156) TJian 2026-01-29 14:45:42 +08:00
  • 39037d258e Fix tool call indexing double-counting (#33141) wangln19 2026-01-29 13:57:09 +08:00
  • 51550179fc [Refactor] Define MM data parser in processing info instead of processor itself (#33260) Cyrus Leung 2026-01-29 13:55:17 +08:00
  • 07ea184f00 [ez] Delete more torch version checks <= 2.8 (#33288) Angela Yi 2026-01-28 21:28:46 -08:00
  • a663b218ae [Misc] Add orozery to CODEOWNERS (core, kv_transfer, kv_offload) (#33227) Or Ozeri 2026-01-29 06:24:20 +02:00
  • 1bd47d6e5a [Bugfix] Register fp8 cutlass_group_gemm as supported for only SM90+SM100 (#33285) Michael Goin 2026-01-28 21:40:59 -05:00
  • 141cd43967 [UX] Remove noisy CT UnquantizedLinearMethod warn (#33273) Michael Goin 2026-01-28 19:09:30 -05:00
  • 6bf3b46d78 [ModelRunner V2] Misc code simplification and cleanup (#33266) Nick Hill 2026-01-28 14:41:23 -08:00
  • 77c4f45c6c [7/N][Attention][Docs] Add documentation for attention backends (#32477) Matthew Bonanni 2026-01-28 17:20:22 -05:00
  • ca1969186d [UX] Enable nested configs in config yaml files (#33193) Michael Goin 2026-01-28 16:54:25 -05:00
  • ab597c869a [Bugfix] Add missing encoder only guard for do_kv_cache_update (#33269) Gregory Shtrasberg 2026-01-28 15:25:07 -06:00
  • 4197168ea5 [ez] Remove checks for torch version <= 2.8 (#33209) Angela Yi 2026-01-28 13:03:56 -08:00
  • 59bcc5b6f2 Use aiter triton fused_add_rmsnorm_pad for gpt-oss (#30976) Rohan Potdar 2026-01-28 14:47:47 -06:00
  • 3e440786af [Feature] Fully support for async scheduling + PP, 30.8% E2E throughput improvement, 31.8% TPOT improvement (#32618) Wentao Ye 2026-01-28 15:30:32 -05:00
  • fe18ce4d3f Revert "Enable Cross layers KV cache layout at NIXL Connector (#30207)" (#33241) v0.15.0rc3 Or Ozeri 2026-01-28 14:36:00 +02:00
  • 8bdd3979d8 [CI] Change GPU key to device key for B200 test (#33275) Kevin H. Luu 2026-01-28 11:14:29 -08:00
  • c4e744dbd4 [Perf] Optimize moe_permute for CUTLASS FP8 (#32892) Wentao Ye 2026-01-28 13:15:24 -05:00
  • 8ebf372e9d [CI] Whisper tests enforce_eager=False (#33098) Nicolò Lucchesi 2026-01-28 18:36:56 +01:00
  • f210f0b7b1 [lora/moe] Avoid extra intermediate buffer & Python slicing in expand phase when split_k == 1 (#32774) cwazai 2026-01-29 00:22:45 +08:00
  • 392c5af4fe [Benchmark] Add startup benchmarking to buildkite run (#33183) Bin Bao 2026-01-28 11:03:07 -05:00
  • af9b69f977 [Quantization][Deprecation] Remove Marlin 24 (#32688) Robert Shaw 2026-01-28 07:54:59 -08:00
  • 8e5e40daf4 [Misc] Provide a DeepSeek ReasoningParser with thinking enabled by default (#33221) Chauncey 2026-01-28 21:16:53 +08:00
  • 2e8de86777 Revert "Enable Cross layers KV cache layout at NIXL Connector (#30207)" (#33241) Or Ozeri 2026-01-28 14:36:00 +02:00
  • 247d1a32ea [Quantization][Deprecation] Remove BitBlas (#32683) Robert Shaw 2026-01-28 03:06:22 -08:00
  • 5f7f9ea884 Relax protobuf library version constraints (#33202) v0.15.0rc2 Jeffrey Wang 2026-01-27 20:15:53 -08:00