Commit Graph

  • f296a1966d [Bugfix] Fix FlashInfer GDN warmup ValueError on SM90 GPUs (#36876) Thomas Parnell 2026-03-13 07:09:39 +01:00
  • bc2c0c86ef [Frontend] Fix usage incorrectly returned with empty stream_options` (#36379) Csrayz 2026-03-13 11:33:04 +08:00
  • 891c60dcd5 fix(kv-cache): increase hybrid attention grouping threshold from 1.25 to 1.5 (#36684) jaime campos salas 2026-03-12 23:28:27 -04:00
  • 1ce13cf992 [Model] Add support for BERT-like Chinese ERNIE pooling models (#36385) whyiug 2026-03-13 11:23:53 +08:00
  • 10f08dedfa [Model] Add ColPali late interaction model for multi-modal retrieval (#36818) Nikita 2026-03-13 03:18:57 +01:00
  • 5e1a373d2e [BUG] Fix rank calculation in NCCLWeightTransferEngine (#36940) Aaron Hao 2026-03-12 18:56:51 -07:00
  • 572c776bfb build: update smg-grpc-servicer to use vllm extra (#36938) Simo Lin 2026-03-12 18:31:36 -07:00
  • 55d8073d06 [Bugfix] ep_scatter kernel store-load race condition (#34991) Yifan Qiao 2026-03-12 18:07:59 -07:00
  • cd32d6f586 [Model Runner V2] Some code simplification (#36929) Nick Hill 2026-03-12 17:59:23 -07:00
  • aaa3092f51 [MoE] Add routing simulation override for MXFP4 quantized MoE (#33595) Jaewon 2026-03-12 17:30:44 -07:00
  • 87985077a4 [Speculative Decoding] Add norm_before_fc for gpt-oss draft models (#36545) Shubhra Pandit 2026-03-12 19:03:32 -04:00
  • a79c1c2c80 [AMD][Build] Add DeepEP to ROCm Dockerfile (#36086) Ryan Rock 2026-03-12 16:33:32 -05:00
  • cc8f1f4764 [ROCm][CI] Preparing gfx90a mirroring (#36210) Andreas Karatzas 2026-03-12 15:42:25 -05:00
  • 05b9e8ab5b Revise environment setup in AGENTS.md (#36909) Michael Goin 2026-03-12 20:21:11 +01:00
  • 2cdf92228c [Feature]: Remove Chunking From FusedMoE (#34086) Xinan Miao 2026-03-13 02:24:38 +08:00
  • c973ecdead [bnb] Skip moe + bnb test (#36896) Marc Sun 2026-03-12 19:03:25 +01:00
  • e39257a552 Add AGENTS.md (#36877) Harry Mellor 2026-03-12 17:20:50 +00:00
  • cc16b24b17 Update Flashinfer to 0.6.6 (#36768) Dimitrios Bariamis 2026-03-12 18:19:19 +01:00
  • bdc2343454 [Bugfix] Fix KeyError in parse_response_input for reasoning items with optional content (#34499) Eunkwang Jeon 2026-03-13 01:13:36 +09:00
  • f444c05c32 [Attention] Use FA4 for MLA prefill (#34732) Matthew Bonanni 2026-03-12 12:10:17 -04:00
  • 85199f9681 [Bugfix] fix main branch pre-commit error (1 line change) (#36897) SoluMilken 2026-03-13 00:08:37 +08:00
  • a1257fd1ea [Kernel] Add FP8 KV cache support to Triton MLA decode attention (#34597) grimulkan 2026-03-12 10:32:34 -05:00
  • abcffbba8c [CI] Fix mypy pre-commit errors on main (#36882) Thomas Parnell 2026-03-12 16:22:29 +01:00
  • 53ec16a705 [Hardware] Replace torch.cuda.device_count/current_device/set_device API (#36145) Kunshang Ji 2026-03-12 22:57:47 +08:00
  • 2e693f48e7 [Perf] Add TRTLLM FP8 MoE Modular Kernel (#36307) Wei Zhao 2026-03-12 10:32:31 -04:00
  • 7f1f36bf91 [CI] Fix mypy for vllm/reasoning (#35742) Martin Hickey 2026-03-12 12:21:33 +00:00
  • 5282c7d4d0 [docs] Add lightweight AI assisted contribution policy (#30947) Mark McLoughlin 2026-03-12 11:46:13 +00:00
  • 9e19f8338b [Perf] add packed recurrent fast path for decode (#36596) caozuoba 2026-03-12 19:01:57 +08:00
  • 06e0bc21d2 [Frontend] Split OpenAIServingModels into OpenAIModelRegistry + OpenAIServingModels (#36536) Sage 2026-03-12 12:29:37 +02:00
  • 5a71cdd76e [Bugfix] Fix crash when tool_choice=required exceeds max_tokens (#36841) Chauncey 2026-03-12 18:28:45 +08:00
  • f0d3658c0f [MM][OOT] Support CPU seq_lens for OOT MMEncoderAttention kernels (#36605) Shanshan Shen 2026-03-12 18:28:23 +08:00
  • 57431d8231 [UX] Only show FP4 Marlin fallback warning for w4a4 models (#36806) Michael Goin 2026-03-12 10:19:35 +01:00
  • 3e64fe4a18 [Bugfix] Warm up Triton autotuner for GDN layers during V1 profiling (#36599) Xu Jinyang 2026-03-12 15:51:09 +08:00
  • 8cb24d3aed [KV Connector] Support using FlexKV as KV Cache Offloading option. (#34328) sfeiqiang 2026-03-12 15:46:20 +08:00
  • 00726c74c9 [Bugfix][Model] Fix DeepSeek-OCR TensorSchema crash on empty images_crop (#36670) István Ketykó 2026-03-12 08:35:54 +01:00
  • 9fe404ed04 [Frontend] OpenAI Responses API supports Tool/Function calling with streaming (#29947) Chauncey 2026-03-12 15:03:50 +08:00
  • 802f306cd1 [Tests] Skip model weight download for render-only test server (#36813) Sage 2026-03-12 08:24:42 +02:00
  • 894843eb25 replace with torch.cuda.device with with torch.accelerator.device_index (#36144) Yan Ma 2026-03-12 14:12:57 +08:00
  • 584a3f56de [Kernel][Helion][13/N] Force static_shapes=False in helion register (#36677) Yanan Cao 2026-03-11 22:35:29 -07:00
  • 36735fd772 [BugFix] Fix multiple/duplicate stdout prefixes (#36822) Nick Hill 2026-03-11 21:23:21 -07:00
  • 6ecabe4936 [CI Failure] Fix Language Models Test (Extended Pooling) daily CI Failure (#36761) wang.yuqi 2026-03-12 12:22:05 +08:00
  • 2f8b4ce0c0 [Model Runner V2] Do not initialize sampler for non-last PP ranks (#36824) Woosuk Kwon 2026-03-11 20:55:28 -07:00
  • 2ef69456f5 [LMCache] Fault Tolerance Mechanism (#36586) Yuwei An 2026-03-11 20:54:39 -07:00
  • 17852aa503 more models for vLLM Benchmark Suite (#35086) Louie Tsai 2026-03-11 20:36:51 -07:00
  • 8647c6cf51 [Bugfix] Fix minimax_m2 tool parser when stream interval > 1 (#35895) Flora Feng 2026-03-11 22:25:14 -04:00
  • 513949f95f [XPU][Doc] Remove manual OneAPI install step, now handled by torch-xpu (#36831) Kunshang Ji 2026-03-12 09:46:02 +08:00
  • 262b76a09f [Frontend] Exclude anthropic billing header to avoid prefix cache miss (#36829) Nick Hill 2026-03-11 18:20:34 -07:00
  • c34ba6b961 [Perf] Optimize compute maxsim using batched version, 3.2% E2E throughput improvement (#36710) Wentao Ye 2026-03-11 20:37:01 -04:00
  • 24062b704f [ROCm][CI/Build] Add gfx1152/gfx1153 (Krackan) to HIP supported architectures (#36499) Matthias Gehre 2026-03-12 00:14:40 +01:00
  • d6b61e5166 [BUG] Fix async rlhf tests (#35811) Aaron Hao 2026-03-11 15:06:10 -07:00
  • cf632499ee [Kernel] [Helion] [15/N] Split config files into per-platform files (#36698) Yanan Cao 2026-03-11 14:25:29 -07:00
  • a3774a8198 [Kernel] [Helion] [12/N] Use FakeTensorMode to avoid GPU allocation during config key computation (#36563) Yanan Cao 2026-03-11 14:25:16 -07:00
  • 0ce21c46a0 [Kernel] [Helion] [14/N] Set autotune_ignore_errors=True during autotuning (#36683) Yanan Cao 2026-03-11 14:25:04 -07:00
  • 55eed6b7a5 [Model Runner V2] Add WhisperModelState [6/N] (#35790) Woosuk Kwon 2026-03-11 14:20:38 -07:00
  • c77181e534 [Model Runner V2] Add probabilistic rejection sampling for spec decoding (#35461) Giancarlo Delfin 2026-03-11 14:04:32 -07:00
  • 12001f2ebc [LMCache] Pass TP size in lookup for MLA multi-reader locking (#36129) maobaolong 2026-03-12 04:45:20 +08:00
  • 7ee5d5093b [BugFix][kv_offload] Fix offloading decodes with async scheduling (#33881) Or Ozeri 2026-03-11 22:43:40 +02:00
  • 428bc718bd [Bugfix][ROCm] Strip block_size before attention backend validation (#36274) jennyyyyzhen 2026-03-11 13:37:31 -07:00
  • ff1e3d9c63 [BugFix]: add bagel to MM_PREFIX_LM_MODELS (#36316) 汪志鹏 2026-03-12 03:55:59 +08:00
  • 35bdca5431 [Refactor] Remove dead code in KV connector (#36424) Wentao Ye 2026-03-11 15:40:17 -04:00
  • 8a24842765 [ROCm] add tuned moe_wna16_triton kernel configs for CDNA4 (#35093) Amanzhol Salykov 2026-03-11 20:00:08 +01:00
  • 65986db6ba Make Gemma and Gemma 2 accept inputs_embeds like Gemma 3 (#36787) Harry Mellor 2026-03-11 18:12:43 +00:00
  • 9556af87d5 [torch.compile] Add support for non-contiguous fused RMSNorm + group quant (#36551) Luka Govedič 2026-03-11 13:56:55 -04:00
  • a1a3523a56 [KVConnector] Support worker -> scheduler metadata (#31964) Or Ozeri 2026-03-11 19:36:37 +02:00
  • 741f4e046b fix: align lfm2 thumbnail token counting with HF (#36707) tianshu-Michael-yu 2026-03-11 10:28:38 -07:00
  • a5d06dc557 Add 320 dimension size support to MLA (#36161) Julien Denize 2026-03-11 18:21:22 +01:00
  • 5efa206a8c Fix ExaoneMoeMTP test that never ran in Transformers v4 (#36792) Harry Mellor 2026-03-11 17:10:23 +00:00
  • 196802dfa6 [Misc] Clean up renderers (#36770) Cyrus Leung 2026-03-12 00:39:29 +08:00
  • c84b519cf3 [Bugfix] Fix negative max_tokens when input prompt is too long (#36789) Isotr0py 2026-03-12 00:30:51 +08:00
  • 741ecf0630 [CI] Add bfcl tool call correctness eval (#36560) Flora Feng 2026-03-11 12:27:36 -04:00
  • b7e5a588d8 [Bugfix] Fix DP/EP Shared Expert With Monolithic Kernels (#36061) Robert Shaw 2026-03-11 12:07:14 -04:00
  • 822e250ab7 [torch.compile] Use FakeTensors instead of real GPU tensors for single-size compilation (#36093) Richard Zou 2026-03-11 12:07:09 -04:00
  • bea02cdf93 Fix routed experts capture for hybrid models (Mamba + Attention) (#35744) Hongxin Xu 2026-03-11 23:53:10 +08:00
  • a3ea760ea5 Add 'none' reasoning effort to ChatCompletionRequest (#36238) Julien Denize 2026-03-11 16:45:34 +01:00
  • 35db669f1d Correct link to supported hardware on vllm.ai (#36798) Harry Mellor 2026-03-11 15:43:28 +00:00
  • afebeffbfb Add support to Mistral large 3 eagle with dense layers (#36163) Julien Denize 2026-03-11 16:42:56 +01:00
  • 5573894737 Kimi k2.5 MLA based eagle3 (#36361) Jhao-Ting Chen 2026-03-11 08:36:11 -07:00
  • d5816c8c2f Fix tied weights in weight mapping test for Transformers v5 (#36788) Harry Mellor 2026-03-11 15:10:26 +00:00
  • 8ccbcda5c0 [Model Runner V2] Remove unused warmup_for_prefill method (#36762) Woosuk Kwon 2026-03-11 08:02:44 -07:00
  • a9e532afe2 [ROCm][Perf] Allow MTP lens > 1 in Sparse MLA (#36681) tvirolai-amd 2026-03-11 16:43:03 +02:00
  • f3163bba67 Disable docs build skipping until a better solution is found (#36790) Harry Mellor 2026-03-11 13:53:23 +00:00
  • 700a1ddc65 [Misc] Use envs module to get VLLM_DISABLED_KERNELS (#35776) Martin Hickey 2026-03-11 13:37:46 +00:00
  • f33251ffc8 [Bugfix] Fix Mistral-small --format (#36782) Silvia Colabrese 2026-03-11 12:47:52 +01:00
  • e584dce52b Add XPU MLA Sparse backend for DeepSeek v3.2 (#33230) Wuxun Zhang 2026-03-11 19:19:15 +08:00
  • 40c0461f24 [openapi] refactor render related openapi [3/N] (#36749) Ning Xie 2026-03-11 18:14:34 +08:00
  • 724759684c [Bugfix] Fix Qwen3-VL timestamp mismatch when using num_frames without fps (#36136) Weiguang Li 2026-03-11 18:13:06 +08:00
  • 9c34e9d24f Disable cascade attention by default (#36318) Michael Goin 2026-03-11 11:12:23 +01:00
  • 09b6f99852 [compile] aot_compile should respect VLLM_DISABLE_COMPILE_CACHE (#36358) Richard Zou 2026-03-11 06:12:03 -04:00
  • c87fb515ed fix(lora): use replaced_module_name in pooling model name check (#36402) Ethan T. 2026-03-11 18:11:27 +08:00
  • 5353c9b016 platforms: Fix Ray DP startup crash (#36665) Itay Alroy 2026-03-11 12:08:55 +02:00
  • 13e79fc811 [ci] Update rtol for test_classification (#36556) Angela Yi 2026-03-11 03:08:16 -07:00
  • 9d07a3d6e4 Add: Eagle3 support for Qwen3.5 (#36658) Rahul Tuli 2026-03-11 15:37:42 +05:30
  • 646b85544b [Refactor] Remove Molmo2 processor wrapper (#36667) Cyrus Leung 2026-03-11 18:07:20 +08:00
  • 4286cc5ec2 fix(minicpmv): fix audio inference by handling meta device in init_re… (#36751) tc-mb 2026-03-11 18:06:28 +08:00
  • 95c0f928cd [NemotronH] Small fix reasoning parser (#36635) v0.17.1 roikoren755 2026-03-11 11:44:41 +02:00
  • c9b1e977dc add nemotron v3 reasoning parser (#36393) Shaun Kotek 2026-03-10 00:11:41 +02:00
  • 545d18d81b [Bugfix] Support other quantization methods in glm41v (#36321) LoganJane 2026-03-11 17:48:05 +08:00
  • e661b9ee83 [NemotronH] Small fix reasoning parser (#36635) roikoren755 2026-03-11 11:44:41 +02:00
  • c910eeb125 [XPU]Bug fix for some unexpected error when use AgRs backend on XPU device. (#36593) YiSheng5 2026-03-11 17:17:46 +08:00
  • f4ae58b38b Remove unused config field from Gemma2 (#36672) Harry Mellor 2026-03-11 08:51:19 +00:00