Commit Graph

  • 59bd5f6a71 [Feat] Enable eplb with default all2all backend (#30559) Wentao Ye 2025-12-16 10:33:52 -05:00
  • 00a8d7628c [BugFix] Fix memory spike in workspace allocation (#30744) Lucas Wilkinson 2025-12-16 09:46:22 -05:00
  • 4de08ad698 [CI/Build] Skip broken ViT backend functionality test tempoarily (#30782) Isotr0py 2025-12-16 22:45:25 +08:00
  • 75eb302a2e [Bugfix] Whisper fix number of allocated CrossAttn blocks per-request (#30772) Nicolò Lucchesi 2025-12-16 15:20:19 +01:00
  • 9dbbc59b15 [ROCm][MTP] Support MTP for AITER MLA backend (#28624) Pleaplusone 2025-12-16 22:10:26 +08:00
  • 104003dc77 update piecewise cudagraph warning when splitting_ops=[] (#30728) Boyuan Feng 2025-12-16 06:09:34 -08:00
  • d0fb572929 [ROCm] [AITER] [DOC] Add usage description about check functions in _aiter_ops (#30586) TJian 2025-12-16 21:50:47 +08:00
  • 6f15ac5de7 Don'e assume position_embedding_type will be present for BERT and RoBERTa models (#30770) Harry Mellor 2025-12-16 13:40:26 +00:00
  • 676db55eec [Bugfix] Fix prefix_repetition routing in bench throughput (#29663) Junru Shen 2025-12-16 17:37:15 +08:00
  • 0e391e7570 [Bugfix] Fix RequestOutput miss lora_request (#30636) Jee Jee Li 2025-12-16 17:36:35 +08:00
  • 0d0c929f23 [responsesAPI][8] input/output messages for ResponsesParser (#30158) Andrew Xia 2025-12-16 13:54:59 +08:00
  • e94384bbad [Bugfix] Fix broken ViT attention selection for Blackwell device (#30731) Isotr0py 2025-12-16 13:24:32 +08:00
  • b9ff4f2a8d [feature] extend DBO to XBO (#30120) jiangkuaixue123 2025-12-16 13:04:01 +08:00
  • c881db364e improve lazy import test (#30733) Boyuan Feng 2025-12-15 19:12:05 -08:00
  • 3bd9c49158 [CustomOp] Extract ApplyRotaryEmb as CustomOp and unify the dispatch logic (#29873) Shanshan Shen 2025-12-16 11:08:16 +08:00
  • ff21a0fc85 [docker] Restructure Dockerfile for more efficient and cache-friendly builds (#30626) Amr Mahdi 2025-12-16 04:52:19 +02:00
  • bbd850e597 [Bugfix] fix streaming final output for non harmony (#30237) penfree 2025-12-16 09:03:11 +08:00
  • 511e81e7c9 [BUILD] use sm_100f when compiling flashmla to fix support on sm103 (#30705) Shengqi Chen 2025-12-16 06:48:01 +08:00
  • a182be4308 [UX][Attention] Add attention_config argument to LLM() (#30710) Matthew Bonanni 2025-12-15 17:29:09 -05:00
  • c01d589813 [Benchmarks] auto_tune.sh: Use hostname variable for server requests (#30529) Kevin Musgrave 2025-12-15 17:00:29 -05:00
  • 60dbf7d8f1 Update batch invariant to use attention config (#30704) Matthew Bonanni 2025-12-15 15:24:16 -05:00
  • a450c64a30 [Bugfix] Fail instead of ignoring when CompilationConfig gets invalid args (#30708) Michael Goin 2025-12-15 15:18:02 -05:00
  • b2191abdca [docs][fix] Update Arm CPU vLLM wheel installation docs (#30594) Fadi Arafeh 2025-12-15 19:46:25 +00:00
  • 51e5b3e3c4 [Bugfix] Fix ViT with FlashAttention on ROCm (#30703) Matthew Bonanni 2025-12-15 14:45:21 -05:00
  • ec154c36ee [Platform] Refactor Platform attention backend selection to avoid breakpoint for OOT platform (#30212) Isotr0py 2025-12-16 01:36:07 +08:00
  • 970713d4a4 Remove SkipValidation from ModelConfig (#30695) Harry Mellor 2025-12-15 17:34:08 +00:00
  • 17fec3af09 [Bugfix] Fix missing first token in tool calls during reasoning-to-tool transition (#30671) mondaylord 2025-12-16 00:13:37 +08:00
  • 855b101d75 [Frontend] add tools for dsv32 developer role (#30040) yjc9696 2025-12-15 23:08:47 +08:00
  • d0502b4928 [MoE][Refactor 1/N] Separate Online Quantization (#30627) Robert Shaw 2025-12-15 09:54:53 -05:00
  • 3f175f18a2 [Bugfix] Fix multimodal configuration for Qwen3VL MOE model (#30670) Max Hu 2025-12-15 22:06:01 +08:00
  • ed586e7724 [Refactor] [3/N] Move tool parser tests and run on CPU (#30693) Cyrus Leung 2025-12-15 21:45:36 +08:00
  • 2a1776b7ac [Refactor] [2/N] Move tool parsers into the vLLM main directory (#30675) Chauncey 2025-12-15 20:54:52 +08:00
  • 185c22bf2f [Misc][Hybrid allocator + kv connector] Optionally enable hybrid allocator + KV cache connector (#29805) Nicolò Lucchesi 2025-12-15 12:17:58 +01:00
  • e4806d973a [BugFix] Add embed_input_ids method to make QWenLMHeadModel a vllm model (#30674) duke 2025-12-15 18:38:29 +08:00
  • 4429d934de [Model] Automatic conversion of TokenClassification model (#30666) wang.yuqi 2025-12-15 16:13:00 +08:00
  • 33278073d6 typing: Add type hints to TurnMetrics class in context.py (#30552) ゆり 2025-12-15 16:00:39 +09:00
  • 1adeb3b84c [New Model] BAGEL support (AR only) (#28439) 汪志鹏 2025-12-15 14:58:23 +08:00
  • e3a1cd1c59 [XPU] fix Dockerfile.xpu, avoid wheel conflicts (#30662) Kunshang Ji 2025-12-15 13:32:06 +08:00
  • 3778673ea8 [Feat] Refactor for parallel_config in FusedMoEModularKernel (#30282) Wentao Ye 2025-12-14 23:21:36 -05:00
  • b337647aa0 [Bugfix] Drop empty tool_calls lists to keep assistant replies in chat template (#30648) Seokhyun An 2025-12-15 13:21:12 +09:00
  • a524d1ba0a [Bugfix] Fix deepseek_v32 tokenizer_mode (#30658) Jee Jee Li 2025-12-15 12:20:31 +08:00
  • 87b4d1557d [CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it. (#30125) Shanshan Shen 2025-12-15 11:13:32 +08:00
  • 84e23d103d additional protection for CVE-2025-62164 (#30649) Wenqi Glantz 2025-12-14 22:07:10 -05:00
  • 738648fb81 [CustomOp] Support object-level enable for CustomOp (#30547) Shanshan Shen 2025-12-15 11:02:09 +08:00
  • 917fdae5b2 [Log] Skip piecewise cudagraph warn when using full cudagraph (#30657) Boyuan Feng 2025-12-14 18:49:45 -08:00
  • e2ed238885 Revert "[Fix]Load kv-cache dtype from hf_quant_config.json automatically" (#30653) Robert Shaw 2025-12-14 19:33:41 -05:00
  • 174e39ead7 CPU KV Offloading: Use more CUDA streams (#29013) Or Ozeri 2025-12-15 01:50:45 +02:00
  • 9ccbf6b692 [responsesAPI]add extra body parameters (#30532) RioS 2025-12-15 04:25:45 +09:00
  • ae2e503dda [NIXL][BUG FIX] Fix a bug for PD with host_buffer after merging 29665 (#30420) Chendi.Xue 2025-12-14 09:38:28 -06:00
  • 9e33a1a75b [Model][Quantization] Override HF defaults to GGUF ones (incl. Qwen3 MoE) (#30118) Tsukasa OI 2025-12-15 00:01:42 +09:00
  • add4b0ca44 [Bugfix][benchmarks] Fix input token calculation for rerank benchmark metrics (#30596) Vensen 2025-12-14 22:57:15 +08:00
  • ae88aada38 [Feature]Add EVS (Efficient Video Sampling) Support for Qwen3-VL (#29752) ZiTian Zhao 2025-12-14 21:24:56 +08:00
  • 5ccf0efa84 [Bugfix] Improve error messages in ModelConfig validation (#30213) yifant-code 2025-12-14 08:23:37 -05:00
  • 994acec0cc [Bugfix] Fix fusion for VL models (#30244) ElizaWszola 2025-12-14 14:22:37 +01:00
  • 48b8456ff9 [Bugfix] Revert Qwen2-VL part of change in #28271 (#30542) zifeitong 2025-12-14 05:20:08 -08:00
  • 5b64ac21f9 [Bugfix] Update get_processor_data to use get_all method (#30583) Drew Botwinick 2025-12-14 07:19:20 -06:00
  • a8ec486592 [Misc] Add a script to benchmark compilation time (#29919) Bin Bao 2025-12-14 08:02:39 -05:00
  • 6ecc1e411b [Bugfix] fix _get_quant_method of FusedMoE for deepseekV3.2 on non-NV… (#30057) tjp_zju 2025-12-14 18:20:51 +08:00
  • 0bb0bae436 Nvidia ModelOpt workaround for issue 28072 (#30164) Shengliang Xu 2025-12-14 02:18:31 -08:00
  • 060893654d fix: Update json features supported by xGrammar (#30390) Johannes F 2025-12-14 11:16:06 +01:00
  • e9add129ad [Bugfix] awq_gemm: fix argument order swap (#30364) Matthias Gehre 2025-12-14 11:15:37 +01:00
  • 3224ea9915 [torch.compile] Add encoder tag for compilation (#30489) Ilya Markov 2025-12-14 11:15:11 +01:00
  • 3a20450d31 Add AudioFlamingo3 model support (#30539) Lasha Koroshinadze 2025-12-14 05:14:55 -05:00
  • 1a55cfafcb [Doc]: fixing typos in various files (#30540) Didier Durand 2025-12-14 11:14:37 +01:00
  • add1b9d3de [main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#30632) drslark 2025-12-14 17:32:16 +08:00
  • dcb31196da [Chore] Remove redundant RequestPrompt (#30612) Cyrus Leung 2025-12-14 17:22:37 +08:00
  • f569c654e1 enable unbacked with aot_compile (#30462) Laith Sakka 2025-12-14 11:14:06 +03:00
  • 97f2f160fd [ROCm][CI] Add "Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy Test" Back Into AMD CI (#30590) Micah Williamson 2025-12-14 00:56:26 -06:00
  • 29f7d97715 Improve parse_raw_prompt test cases for invalid input .v2 (#30512) Kayvan Mivehnejad 2025-12-13 22:18:41 -05:00
  • dc7fb5bebe [Bug][KVConnector][Metrics] Remove a vacuous assertion breaking external-launcher (#30577) Qier Li 2025-12-13 20:23:08 -05:00
  • 24429d5924 [Doc] Add instructions for building docker image on GB300 with CUDA13 (#30414) Qidong Su 2025-12-13 16:56:53 -05:00
  • 6e78ed6ba7 [Logs] Optimize startup logs 4 (#29903) Wentao Ye 2025-12-13 16:12:53 -05:00
  • 7c16f3fbcc [Doc] Add documents for multi-node distributed serving with MP backend (#30509) Isotr0py 2025-12-14 02:02:29 +08:00
  • ddbfbe5278 [Docs] Clarify Expert Parallel behavior for attention and MoE layers (#30615) lif 2025-12-14 01:37:59 +08:00
  • 763963aa73 set assume_32bit_indexing and pass unbacked hints (#30459) Laith Sakka 2025-12-13 18:36:53 +03:00
  • 39cefbdf17 [Refactor] TokenizerRegistry only uses lazy imports (#30609) Cyrus Leung 2025-12-13 23:16:22 +08:00
  • ace34e3783 [Bugfix] Qwen3-next with --hf-overrides \{\"num_hidden_layers\":8\} (#30433) Chen Zhang 2025-12-13 06:12:45 -08:00
  • e5db3e2774 [CI/Build] Fix broken mm processor test Mistral-3-large (#30597) Isotr0py 2025-12-13 20:43:01 +08:00
  • 64251f48df [Chore] Adjust tokenizer import to avoid circular imports (#30601) Cyrus Leung 2025-12-13 20:42:39 +08:00
  • 1cec5b7ea9 [Scheduer] Simplify stop checking for pooling models (#30591) Nick Hill 2025-12-13 01:45:26 -08:00
  • b09806e28f [Bugfix] Dictionary MM embeddings for online chat (#30507) Cyrus Leung 2025-12-13 15:48:56 +08:00
  • fdc135d768 [Misc][Quantization] Clarify the intent of GGUF FusedMoE weight materialization (#30310) Tsukasa OI 2025-12-13 14:55:14 +09:00
  • 4fa7ce46f3 [Feature] Add SM103 (Blackwell Ultra) Support to vLLM (#30484) Roberto L. Castro 2025-12-13 04:34:23 +01:00
  • 57e9bf1864 [CI] Whisper logprobs tests (#30504) Nicolò Lucchesi 2025-12-13 03:49:11 +01:00
  • 2f32a68d75 [CI] Update several models in registry that are available online now (#30514) Michael Goin 2025-12-12 21:28:13 -05:00
  • f5dfbbd8e9 [Docs] Remove references to VLLM_ATTENTION_BACKEND (#30564) Matthew Bonanni 2025-12-12 21:20:15 -05:00
  • fc0119425c Add IBM and Red Hat to compute resources sponsors (#30581) Michael Goin 2025-12-12 20:34:23 -05:00
  • 86a3261525 [Bugfix] Pass FA version in MultiHeadAttention (#30575) Matthew Bonanni 2025-12-12 19:02:11 -05:00
  • 08f8a5627e [CI/Build][Kernel][BugFix][AMD] Fix per_token_group_quant_fp8 to use correct fp8 min/max values and update atol/rtol in test_quantfp8_group_functionality (#30292) rasmith 2025-12-12 17:41:56 -06:00
  • b4039c08b5 [ci] Mark PrimeRL integration test as soft fail (#30578) Kevin H. Luu 2025-12-12 14:13:09 -08:00
  • 1e6b115300 [Refactor] Reduce duplicate code in per_token_group_quant cuda kernels (#30496) Wentao Ye 2025-12-12 16:45:23 -05:00
  • 13618626df [MoE-FP8-modelopt] Add FlashInfer alignment padding for intermediate dimensions (#29748) danielafrimi 2025-12-12 22:42:32 +02:00
  • 6ec0d8dbe4 [Fix]Load kv-cache dtype from hf_quant_config.json automatically (#29980) danielafrimi 2025-12-12 21:27:47 +02:00
  • 9693dd0fe3 [CI/Build] Add x86 CPU wheel release pipeline (#28848) Li, Jiang 2025-12-13 03:21:35 +08:00
  • 1f19d8f899 [Perf] Set split_k to 1 for triton_kernels (#30528) Xin Yang 2025-12-12 11:07:57 -08:00
  • cd7740ac5c [ROCm] Enable Triton ScaledMM fallback + kernel selection fix (#26668) shivampr 2025-12-12 10:28:20 -08:00
  • 02a5880394 [CI] Fix mypy for vllm/v1/executor (#30517) Wentao Ye 2025-12-12 13:05:34 -05:00
  • d2c919dcc2 [bugfix] fix bug when top_logprobs=0 with spec decoding (#30059) realliujiaxu 2025-12-13 01:03:35 +08:00
  • f3237f3f6b [Frontend] Fixes anthropic streaming message_start usage nesting (#30266) Benjamin Bartels 2025-12-12 16:28:54 +00:00
  • 9c0ee995a8 [Kernel] Support CUDA Graphs in 3D Triton Attention Kernel (#28306) jvlunteren 2025-12-12 16:55:40 +01:00