Commit Graph

  • 3ecabd06ee Fix tpu-inference platform path (#29554) Johnny Yang 2025-11-26 23:25:21 -08:00
  • c069086b9c [Bugfix] Fix getting device for MoE LoRA (#29475) Jee Jee Li 2025-11-27 15:16:07 +08:00
  • 11ea5ec1ff [Model Runner V2] Refactor CudaGraphManager (#29583) Woosuk Kwon 2025-11-26 21:37:59 -08:00
  • ecb1952378 [cpu][fix] Fix Arm CI tests (#29552) Fadi Arafeh 2025-11-27 05:09:41 +00:00
  • da8e1a1bf9 [DOC] Add vLLM Bangkok Meetup info (#29561) TJian 2025-11-27 12:42:50 +08:00
  • ee80aee1ca [Model Runner V2] Minor cleanup for build_attn_metadata (#29576) Woosuk Kwon 2025-11-26 20:10:12 -08:00
  • 0aeb698b77 [Model Runner V2] Minor code cleanup (#29570) Woosuk Kwon 2025-11-26 19:47:17 -08:00
  • 9bb33c8919 add xpu supported model and model id for cpu (#29380) Louie Tsai 2025-11-26 19:30:50 -08:00
  • a67dec7cba [Bugfix] fix IMA issue in certain cases of the moe marlin kernel (#28619) Jinzhen Lin 2025-11-27 11:02:21 +08:00
  • 77740191de [Attention][Async] Eliminate seq_lens_cpu in FlashAttention metadata building with DCP > 1 (#29449) Matthew Bonanni 2025-11-26 21:48:43 -05:00
  • df01eda4dc [Bugfix] Make compressed-tensors MoEs respect ignored layers (#28878) HDCharles 2025-11-26 21:35:13 -05:00
  • ba1fcd84a7 [TPU] add tpu_inference (#27277) Johnny Yang 2025-11-26 14:46:36 -08:00
  • 56539cddac [Core] Refactor padding logic and pad for CUDA graphs before attention metadata building (#28579) Lucas Wilkinson 2025-11-26 14:07:13 -05:00
  • 430dd4d9eb [Attention] Remove imports from vllm/attention/__init__.py (#29342) Matthew Bonanni 2025-11-26 12:53:15 -05:00
  • c4c0354eec [CI/Build] allow user modify pplx and deepep ref by ENV or command line (#29131) Alec 2025-11-26 09:41:16 -08:00
  • e603129505 [refactor] CTConfig methods to static/class methods (#28870) HDCharles 2025-11-26 12:21:58 -05:00
  • 0b0aa874e8 [Perf] Optimize batch invariant BMM, 18.1% Throughput improvement, 10.7% TTFT improvement (#29345) Wentao Ye 2025-11-26 11:38:52 -05:00
  • 70d5953f82 Revert "[Bugfix] Fix GPT-OSS AR+NORM fusion (#28841)" (#29483) Huamin Li 2025-11-26 06:27:26 -08:00
  • 3650a74ed8 Optimize the wording of the document and unify the terminology and th… (#29491) yxt 2025-11-26 21:16:12 +08:00
  • bb706d6048 Fix TeleChatForCausalLM not register issue (#29473) Yejing Lai 2025-11-26 21:15:00 +08:00
  • e30859dff3 [Bugfix] Fix handling of image embeds in models (#29480) Cyrus Leung 2025-11-26 21:00:15 +08:00
  • 452a7c9f7c [Misc] Allow LM only loading for Pixtral (#29451) Roger Wang 2025-11-26 05:00:00 -08:00
  • d9d342d214 [Performance][MLA][ROCm] Remove redundant D2D copy in deepseek (#27457) Pleaplusone 2025-11-26 12:45:28 +08:00
  • 53d7f1f601 [Kernel] Use pre-allocated output buffer for triton kernel fused_experts (#29219) Xin Yang 2025-11-25 18:21:00 -08:00
  • c5ee430328 Bump actions/checkout from 4 to 6 (#29293) dependabot[bot] 2025-11-26 01:57:08 +00:00
  • 8d6a89dffd [UX] Suppress gloo log spam (#29250) Michael Goin 2025-11-25 20:19:35 -05:00
  • 56531b79cc [Misc] Add backup hash algorithm for FIPS constrained environments (#28795) George D. Torres 2025-11-25 18:50:22 -06:00
  • 12866af748 dummy run corner case (#29433) Xieyang Xu 2025-11-25 16:20:35 -08:00
  • d8819c88eb fix assertion for single world use case (uni) (#29429) Lucia Fang 2025-11-25 16:14:23 -08:00
  • de75b0bb70 [BugFix] Fix initialization of draft model. (#29319) Andrey Khalyavin 2025-11-26 02:45:58 +03:00
  • 7df0289782 Change warning logs to debug for unimplemented MXFP4 Linear/Attention (#29441) Michael Goin 2025-11-25 17:52:31 -05:00
  • 0abc79482a [caching] Add enable_prompt_embeds and cpu_offload_gb to compile hashes. (#29435) Zhengxu Chen 2025-11-25 16:46:41 -05:00
  • 4e57c6587f [Core] Support logprobs with spec decode + async scheduling (#29223) Nick Hill 2025-11-25 12:55:24 -08:00
  • e7d776273d [Compile] Refactor. Move PostGradPassManager out of Compilation config (#29340) Ilya Markov 2025-11-25 20:58:56 +01:00
  • c32a18cbe7 Attempt to fix GPU OOM in a spec-decoding test (#29419) Eldar Kurtić 2025-11-25 20:23:36 +01:00
  • b07555d26f [responsesAPI][2] parse ResponseFunctionToolCallOutputItem (#29383) Andrew Xia 2025-11-25 10:27:26 -08:00
  • 0353d2e162 Fix RoPE related failures in Transformers nightly tests (#29333) Harry Mellor 2025-11-25 16:23:45 +00:00
  • a1f2676879 Scheduled removal of override_pooler_config and disable_log_requests (#29402) Harry Mellor 2025-11-25 16:08:57 +00:00
  • 48ddb02b79 [Hybrid Allocator] Support KV cache groups with different block_size (#29143) Yifan Qiao 2025-11-25 07:30:57 -08:00
  • e502098643 [Kernel] Add NVFP4 MoE CUTLASS support for SM120 (#29242) Michael Goin 2025-11-25 09:59:07 -05:00
  • dbc3d9991a [UX] Put CUDA attention backend selection log into one line (#29337) Michael Goin 2025-11-25 09:46:18 -05:00
  • 794029f012 [Feature]: Improve GGUF loading from HuggingFace user experience like repo_id:quant_type (#29137) Injae Ryou 2025-11-25 23:28:53 +09:00
  • 0231ce836a Revert back to torch.equal over torch.allclose from #28819 (#29086) Eldar Kurtić 2025-11-25 15:23:38 +01:00
  • 516c3f7847 [Bugfix] Fix logic for choosing default prefix caching setting (#29393) Thomas Parnell 2025-11-25 15:05:10 +01:00
  • 51fc9e017a Scheduled removal of CompilationConfig.use_inductor (#29323) Harry Mellor 2025-11-25 12:55:42 +00:00
  • bf0c75cd4f Make Transformers Nightly tests soft-fail and enable all tests (#29401) Harry Mellor 2025-11-25 12:41:15 +00:00
  • c2c661af9b [Bugfix] Fix overallocation in MM profiling (#29386) Roger Wang 2025-11-25 04:38:36 -08:00
  • 798e87db5c [Core] Generalize Encoder-Decoder seq_lens computation to avoid Whisper hardcoded logic (#29268) Nicolò Lucchesi 2025-11-25 12:32:11 +01:00
  • de6889946b [Misc] Suppress log outputs when constructing the default vllm config. (#29291) wang.yuqi 2025-11-25 19:00:44 +08:00
  • 7a80b01889 [CI] Resettle pooling entrypoints tests. (#29370) wang.yuqi 2025-11-25 18:39:10 +08:00
  • e1dd706cd1 [Frontend] Respect Chat Completion parallel_tool_calls param (#26233) Ben Browning 2025-11-25 04:56:15 -05:00
  • a685b47c57 [responsesAPI] refactor construct_input_messages (#29359) Andrew Xia 2025-11-25 01:47:10 -08:00
  • 32c40b95e0 [BugFix] bad_words filtering ineffective when n > 1 (#29313) Avishek Goswami 2025-11-25 15:06:34 +05:30
  • db2906108a [Misc] Streamline unique id generation (#29375) Nick Hill 2025-11-25 00:30:11 -08:00
  • 67fc16cd8c [Bugfix] If chunked_prefill is disabled, end the scheduling early. (#28911) wang.yuqi 2025-11-25 16:06:09 +08:00
  • 6330f9477d [Bugfix] Fix GPT-OSS AR+NORM fusion (#28841) elvischenv 2025-11-25 15:59:40 +08:00
  • ef1f7030f0 [ROCm][CI] Fix test_cudagraph_mode failure in AMD CI (#29367) Micah Williamson 2025-11-25 01:55:09 -06:00
  • 12c007e288 EAGLE Support DP>1 (#26086) Rémi Delacourt 2025-11-25 08:32:21 +01:00
  • f242cfcdd5 [Perf] use cpu all reduce to avoid sync when async_scheduling & dp > 1 (#29311) zhrrr 2025-11-25 15:31:07 +08:00
  • 888152bf87 Allow oot custom compiler extension via CompilerInterface (#28623) Icey 2025-11-25 15:25:15 +08:00
  • fe3a4f5b34 [CI/Build] Pin torchgeo dependency for AMD (#29353) Ryan Rock 2025-11-25 01:14:59 -06:00
  • 98caeadd54 [fix][cpu] Use a SwigluOAI impl which supports interleaved gate-up wei (#29273) Fadi Arafeh 2025-11-25 07:11:11 +00:00
  • 64deead719 [Bugfix] [ROCm] [UX]: revert Flex attention backend (#29371) vllmellm 2025-11-25 14:56:06 +08:00
  • 7992324f23 [BugFix] Use unique ids for different transcription prompts (#29372) Nick Hill 2025-11-24 22:55:16 -08:00
  • 40a6f53f6c Display warning only when ROCm version is less than Pytorch required version (#29200) Inoki 2025-11-25 07:40:06 +01:00
  • ce58fdc1c3 Fix PoolingParams.skip_reading_prefix_cache type (#29364) kflu 2025-11-24 22:39:29 -08:00
  • a21256c463 Add TP CLI argument to multimodal inference examples (#29301) Fanli Lin 2025-11-25 14:03:20 +08:00
  • 316c8492bf Scheduled removal of guided_* config fields (#29326) Harry Mellor 2025-11-25 05:24:05 +00:00
  • 2d9ee28cab [CI/Test Fix] Fix CP tests on Blackwell (#29338) Lucas Wilkinson 2025-11-24 23:55:57 -05:00
  • 81db702ed2 [Attention] add _cudagraph_support for linear attention (#28934) Jiangyun Zhu 2025-11-25 12:25:20 +08:00
  • 92effb07a4 [Model] Add HunyuanOCR support (#29327) Isotr0py 2025-11-25 11:28:51 +08:00
  • 87185c88d5 [Bugfix] Make deprecated --task embedding consistent with `--runner… (#29312) Maryam Tahhan 2025-11-25 03:19:52 +00:00
  • 9cf4edae6e [Metrics] Scheduled removal of deprecated metrics (#29330) Mark McLoughlin 2025-11-25 03:15:13 +00:00
  • 7012d8b45e [Docker] Optimize Dockerfile: consolidate apt-get and reduce image size by ~200MB (#29060) 汪志鹏 2025-11-25 10:54:00 +08:00
  • 22b42b5402 [CI][ROCm] Install arctic-inference on ROCm tests (#29344) Divakar Verma 2025-11-24 20:15:39 -06:00
  • cb7214d8ea [ROCm][MLA] enable fp8 MLA decode on ROCm (#28032) gbyu-amd 2025-11-25 10:15:02 +08:00
  • 77e10c9cab [Perf][Deepseek] optimize gather_and_maybe_dequant_cache kernel's perf for extremely long sequence (#28029) Pleaplusone 2025-11-25 10:05:46 +08:00
  • 6f1355a1b7 [Perf] Disable DeepGEMM MoE by default when TP=8 is used (#29346) Michael Goin 2025-11-24 21:01:40 -05:00
  • a4ad43ad5a Scheduled removal of ParallelConfig's direct child EPLB fields (#29324) Harry Mellor 2025-11-25 01:58:58 +00:00
  • a178a0b40b [BugFix] Fix duplicate id tool-call race condition (#29355) Nick Hill 2025-11-24 17:54:26 -08:00
  • b8328b49fb [XPU] upgrade torch & ipex 2.9 on XPU platform (#29307) Kunshang Ji 2025-11-25 09:34:47 +08:00
  • 5f9679a43b [Spec Decode] Add support for EAGLE3 heads that do not use_aux_hidden_states (#27688) Hanjie Qiu 2025-11-24 20:13:12 -05:00
  • 699bca76c0 [UX] Raise error for attn backend of batch invariant (#29348) Wentao Ye 2025-11-24 19:49:01 -05:00
  • c17610e2ba [Bugfix] Only use triton_kernels for MXFP4 on SM90 and SM100 (#29339) Michael Goin 2025-11-24 18:22:46 -05:00
  • 71df2a57ef [Hybrid Allocator] Better layer padding strategy for gpt-oss eagle (#29303) Chen Zhang 2025-11-24 14:28:32 -08:00
  • 4dd42db566 Remove VLLM_SKIP_WARMUP tip (#29331) Tyler Michael Smith 2025-11-24 17:16:05 -05:00
  • 84371daf75 [Tests] Verify gpt_oss package is installed in harmony tests (#29336) Nick Hill 2025-11-24 14:04:31 -08:00
  • f32c7d6f54 [Model Runner V2] Simplify Eagle bookkeeping with num_rejected (#29347) Woosuk Kwon 2025-11-24 13:54:59 -08:00
  • 3cfa63ad99 [XPU]fix Kimi-VL-A3B-thinking on xpu (#29309) Yan Ma 2025-11-25 05:02:21 +08:00
  • 4d6afcaddc [CI/Build] Moves to cuda-base runtime image while retaining minimal JIT dependencies (#29270) Benjamin Bartels 2025-11-24 19:40:54 +00:00
  • 97588c4d12 [Model Runner V2] Add minor clarification comments for Eagle (#29332) Woosuk Kwon 2025-11-24 11:28:56 -08:00
  • 839c6b7b72 [Multimodal][Qwen3 Omni] Make Qwen3 Omni work with audio-in-video inputs in V1 engine. (#27721) Chenheli Hua 2025-11-24 11:24:37 -08:00
  • 8f066146c3 [MoE][Refactor] Make select_experts a non-static method (#29067) bnellnm 2025-11-24 13:38:04 -05:00
  • cec418b5df [Model Runner V2] Change Numba AoT to JIT (#29328) Woosuk Kwon 2025-11-24 09:34:37 -08:00
  • cc313cb73d [Model Runner V2] Implement Single-step Eagle 1 (#29300) Woosuk Kwon 2025-11-24 09:32:27 -08:00
  • 26a465584a [NIXL] Use config to enable telemetry + NIXL version bump (#29305) Nicolò Lucchesi 2025-11-24 18:18:04 +01:00
  • e924bbb4f4 [Build/CI][DP/EP] Add QWen/Qwen3-30B-A3B-FP8 + EPLB tests to Nightly H100 and B200 (#29195) Varun Sundar Rabindranath 2025-11-24 11:06:17 -05:00
  • 656516c315 [Bugfix] properly handle nested json with llama3 tool parser (#27701) Aydin Abiar 2025-11-24 07:28:51 -08:00
  • e48b2e6848 [Bugfix] [ROCm] [UX] Reorganize ROCm Backend Selection Logic (#26980) vllmellm 2025-11-24 22:24:49 +07:00
  • 7a228b5305 Add option to use unbacked, and backed size obl dynamic shapes for more sounds compilation. (#26199) Laith Sakka 2025-11-24 07:12:41 -08:00