Commit Graph

  • 432870829d [Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe (#20509) Lucia Fang 2025-07-06 12:08:30 +08:00
  • f73d02aadc [BUG] Fix #20484. Support empty sequence in cuda penalty kernel (#20491) Vadim Gimpelson 2025-07-06 06:38:02 +04:00
  • c5ebe040ac test_attention compat with coming xformers change (#20487) Jeremy Reizenstein 2025-07-06 03:37:59 +01:00
  • 8d763cb891 [Misc] remove unused import (#20517) Reid 2025-07-06 10:17:06 +08:00
  • cf4cd53982 [Misc] Add logger.exception for TPU information collection failures (#20510) Reid 2025-07-05 22:24:32 +08:00
  • 32c9be2200 [v1] Re-add fp32 support to v1 engine through FlexAttention (#19754) Isotr0py 2025-07-05 17:41:10 +08:00
  • 8aeaa910a2 Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod (#20507) Lucia Fang 2025-07-05 14:03:20 +08:00
  • 906e05d840 [Misc] Remove the unused LoRA test code (#20494) Jee Jee Li 2025-07-05 13:48:16 +08:00
  • ef9a2990ae [doc] small fix (#20506) Reid 2025-07-05 11:56:39 +08:00
  • 7e90870491 [Misc] Add security warning for development mode endpoints (#20508) Reid 2025-07-05 11:52:13 +08:00
  • d3f05c9248 [Doc] fix mutltimodal_inputs.md gh examples link (#20497) Guy Stone 2025-07-05 00:41:35 +01:00
  • c108781c85 [CI Bugfix] Fix pre-commit failures on main (#20502) Michael Goin 2025-07-05 06:17:30 +09:00
  • 3d184b95b8 [feat]: CUTLASS block scaled group gemm for SM100 (#19757) Duncan Moss 2025-07-04 11:58:04 -07:00
  • 2f35a022e6 Enable V1 for Hybrid SSM/Attention Models (#20016) Thomas Parnell 2025-07-04 19:46:53 +02:00
  • ffe00ef77a [Misc] Small: Remove global media connector. Each test should have its own test connector object. (#20395) Chenheli Hua 2025-07-04 08:15:03 -07:00
  • 5561681d04 [CI] add kvcache-connector dependency definition and add into CI build (#18193) Peter Pan 2025-07-04 21:49:18 +08:00
  • fbd62d8750 [Doc] Fix classification table in list of supported models (#20489) Cyrus Leung 2025-07-04 21:08:02 +08:00
  • 2e26f9156a [Model][3/N] Automatic conversion of CrossEncoding model (#20168) wang.yuqi 2025-07-04 20:47:39 +08:00
  • 9e5452ee34 [Bug][Frontend] Fix structure of transcription's decoder_prompt (#18809) sangbumlikeagod 2025-07-04 20:28:07 +09:00
  • 0e3fe896e2 Support Llama 4 for fused_marlin_moe (#20457) Michael Goin 2025-07-04 16:55:10 +09:00
  • 1caca5a589 [Misc] Add SPDX-FileCopyrightText (#20428) Jee Jee Li 2025-07-04 15:40:42 +08:00
  • 783921d889 [Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels (#20331) Wentao Ye 2025-07-04 03:06:24 -04:00
  • 4a98edff1f [Structured Outputs][V1] Skipping with models doesn't contain tokenizers (#20365) Aaron Pham 2025-07-04 03:05:49 -04:00
  • a7bab0c9e5 [Misc] small update (#20462) Reid 2025-07-04 11:33:44 +08:00
  • 25950dca9b Add ignore consolidated file in mistral example code (#20420) 汪志鹏 2025-07-04 10:55:07 +08:00
  • a4113b035c [Platform] Add custom default max tokens (#18557) Gabriel Marinho 2025-07-03 23:50:17 -03:00
  • 7e1665b089 [Misc] Change warn_for_unimplemented_methods to debug (#20455) Michael Goin 2025-07-04 11:35:08 +09:00
  • 8d1096e7db [Bugfix] Register reducer even if transformers_modules not available (#19510) Seiji Eicher 2025-07-03 15:08:12 -07:00
  • 8d775dd30a [Misc] Fix Unable to detect current VLLM config. Defaulting to NHD kv cache layout warning (#20400) Nicolò Lucchesi 2025-07-03 23:56:09 +02:00
  • 78fe77534b [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. (#18864) bnellnm 2025-07-03 17:55:40 -04:00
  • 2f2fcb31b8 [Misc] Remove _maybe_ignore_quant_config from GLM4.1v (#20432) v0.9.2rc1 Yuxuan Zhang 2025-07-04 05:41:13 +08:00
  • 1dba2c4ebe [Misc] adjust for ipv6 for mookcacke url parse (#20107) Ning Xie 2025-07-04 04:27:17 +08:00
  • 71d6de3a26 [Misc] Clean up InternVL family config registration (#19992) Isotr0py 2025-07-04 04:01:47 +08:00
  • 536fd33003 [CI] Trimming some failing test groups from AMDPRODUCTION. (#20390) Alexei-V-Ivanov-AMD 2025-07-03 10:21:31 -05:00
  • 619b9f5c7e [Frontend] fix duplicate output for bench subcmd (#20446) Reid 2025-07-03 23:02:06 +08:00
  • d1b689c445 [Bugfix] Fix flaky test_streaming_response test (#20363) Nicolò Lucchesi 2025-07-03 16:46:24 +02:00
  • 9854dc9040 [Frontend] improve vllm bench <bench_type> --help display (#20430) Reid 2025-07-03 22:22:16 +08:00
  • ff5c60fad8 [Misc] Automatically tag PRs to add new models (#20222) Isotr0py 2025-07-03 22:11:03 +08:00
  • 6f1229f91d [Model][2/N] Automatic conversion of CrossEncoding model (#19978) wang.yuqi 2025-07-03 21:59:23 +08:00
  • 1819fbda63 [Quantization] Bump to use latest bitsandbytes (#20424) Jee Jee Li 2025-07-03 21:58:46 +08:00
  • 7f0367109e [CI/Build][CPU] Enable cross compilation in CPU release pipeline (#20423) Li, Jiang 2025-07-03 20:26:12 +08:00
  • fb14d53cf6 [Kernel] refactor cpu worker v0 cache dtype (#20080) Ning Xie 2025-07-03 16:39:14 +08:00
  • b024a42e93 [Core] Move multimodal placeholder from chat utils to model definition (#20355) Cyrus Leung 2025-07-03 16:18:30 +08:00
  • cb97f2bfc5 [Docs] Replace two list with tables in intel_gaudi.md (#20414) Michael Yao 2025-07-03 15:48:25 +08:00
  • 359200f6ac [doc] fix link (#20417) Reid 2025-07-03 15:21:57 +08:00
  • 220aee902a [Misc] Add rules to label Speculative Decoding Related PRs (#20406) Lifans 2025-07-02 23:56:49 -07:00
  • 67d25eca05 [Tests] Update online DP tests to verify that requests are balanced (#20157) Nick Hill 2025-07-03 07:49:13 +01:00
  • 363528de27 [Feature] Support MiniMax-M1 function calls features (#20297) qscqesze 2025-07-03 14:48:27 +08:00
  • 4ff61ababa [TPU] Add a case to cover RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 (#20385) QiliangCui 2025-07-02 23:46:41 -07:00
  • 0ec3779df7 [Bugfix][CI/CD][CPU] Fix CPU CI tests (#20383) Li, Jiang 2025-07-03 11:11:36 +08:00
  • b616f6a53d [Misc] Small: Fix video loader return type annotations. (#20389) Chenheli Hua 2025-07-02 20:10:39 -07:00
  • 2e25bb12a8 [Bugfix] Fix import of CutlassExpertsFp8 in compressed_tensors_moe.py (#20381) bnellnm 2025-07-02 22:07:43 -04:00
  • 9965c47d0d Enable CPU nightly performance benchmark and its Markdown report (#18444) Louie Tsai 2025-07-02 18:50:25 -06:00
  • 059d4cdb49 [BugFix] Fix DP headless mode arg validation (#20398) Nick Hill 2025-07-03 01:15:32 +01:00
  • bdb84e26b0 [Bugfix] Fixes for FlashInfer's TORCH_CUDA_ARCH_LIST (#20136) Tyler Michael Smith 2025-07-02 20:15:11 -04:00
  • 3dd359147d [Docs] Update EAGLE example (#20375) Nicolò Lucchesi 2025-07-03 02:13:51 +02:00
  • 657f2f301a [DP] Support external DP Load Balancer mode (#19790) Nick Hill 2025-07-02 18:21:52 +01:00
  • a1aafc827a [ROCm][FEAT] Enable Full Graph Mode in AITER MLA V1 Attn Backend (Decode Phase only) (#20254) vllmellm 2025-07-03 00:25:46 +08:00
  • 139508a418 [Misc] add handler HF_TOKEN is emptry string (#20369) rongfu.leng 2025-07-03 00:14:31 +08:00
  • d265414dbc [Minor] Clean up incorrect comment in test (#20382) Nick Hill 2025-07-02 17:13:37 +01:00
  • 48fb076cbc [V1] LogitsProcessor programming model (#16728) afeldman-nm 2025-07-02 12:10:42 -04:00
  • c1909e7e8c [Kernels] MoE refactor (#19636) bnellnm 2025-07-02 09:08:27 -04:00
  • b95877509b Documentation update tool_calling: mapping back to function from response (#20373) cronoik-inceptionai 2025-07-02 16:55:49 +04:00
  • 706ff13224 [Model] Adds support for SlimMoE models Phi-tiny-MoE-instruct (#20286) zichongli5 2025-07-02 05:54:12 -07:00
  • ccbfb1d1c9 [Bugfix] Fix the max_seq_len limit of 16384 for DeepSeek models (#20322) WangHuaqiang 2025-07-02 20:53:36 +08:00
  • 9e5552aa13 [NVIDIA] Support Cutlass w8a8 FP8 for Blackwell Geforce GPUs (sm120) (#17280) Joonchen Liau 2025-07-02 20:47:19 +08:00
  • 0c600b9ab6 [Build/CI] Automatically tag DeepSeek related PRs (#20370) Lu Fang 2025-07-02 20:02:43 +09:00
  • e303dcf523 [Model] Add Ernie4.5 and Ernie4.5MoE Model Support (#20220) CSWYF3634076 2025-07-02 18:37:01 +08:00
  • ae9c4d416f [Docs] Make TPU ref prettier in google_tpu.md (#20356) Michael Yao 2025-07-02 17:04:08 +08:00
  • d853520b3e [Docs] Fix indentations for 2-level items in deprecation_policy.md (#20352) Michael Yao 2025-07-02 14:50:31 +08:00
  • ba51aea65e [Bugfix] Keye-VL compatibility with tok_kwargs (#20058) (#20353) Cyrus Leung 2025-07-02 14:46:59 +08:00
  • 8452946c06 [Model][VLM] Support Keye-VL-8B-Preview (#20126) Kwai-Keye 2025-07-02 14:35:04 +08:00
  • 2e7cbf2d7d [Frontend] Support configurable mm placeholder strings & flexible video sampling policies via CLI flags. (#20105) Chenheli Hua 2025-07-01 23:34:03 -07:00
  • 7da296be04 [TPU] kv cache update kernel supports dynamic grid (#20235) Chengji Yao 2025-07-01 23:33:37 -07:00
  • b205e8467d [Doc][TPU] Add models and features supporting matrix. (#20230) QiliangCui 2025-07-01 23:33:20 -07:00
  • be0cfb2b68 fix[Docs]: link anchor is incorrect #20309 (#20315) yyzxw 2025-07-02 14:32:34 +08:00
  • 1a03dd496b [Bugfix] Fix dynamic rotary embedding (#20343) Cyrus Leung 2025-07-02 14:31:26 +08:00
  • 27b8017636 [FIX][Intel GPU]fix ipex flash_attn_varlen_func api missing parameter (#20348) Kunshang Ji 2025-07-02 13:26:40 +08:00
  • 9ec1e3065a [Misc][Doc] Add missing comment for LLM (#20285) Lifans 2025-07-01 19:04:24 -07:00
  • 9dae7d46bf [Refactor] Remove Unused Env VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON (#20334) Wentao Ye 2025-07-01 22:03:43 -04:00
  • 7058d7dd5d [Refactor] Remove duplicate find_free_port (#20333) Wentao Ye 2025-07-01 22:03:07 -04:00
  • a0389e0554 [UT][intel GPU] use current_platform instead of device hardcode in v1 tests (#20169) Liangliang Ma 2025-07-02 09:06:04 +08:00
  • 3be8d312a2 [Kernel][Bugfix] Fixup some warnings in nvfp4_blockwise_moe when CUDA < 12.8 (#20324) Tyler Michael Smith 2025-07-01 21:05:47 -04:00
  • 3abfe22154 Enable group size 64 for Machete (#20290) czhu-cohere 2025-07-01 18:05:44 -07:00
  • e81fbefe8a [Refactor] Refactor import utils (#20269) Wentao Ye 2025-07-01 21:05:42 -04:00
  • 9290de5667 remove unused variables in marlin_template.h (#20236) 周周周 2025-07-02 08:51:52 +08:00
  • 7f280d69c9 [Optimization] Cache sampled token ids in model runner (#20291) Woosuk Kwon 2025-07-01 11:01:31 -07:00
  • 02cabff207 [V1] [ROCm] Enable EP with AITER Fused MoE (#20270) TJian 2025-07-01 09:48:30 -07:00
  • 3d19d47d91 [Frontend] Expand tools even if tool_choice="none" (#17177) Shintarou Okada 2025-07-02 01:47:38 +09:00
  • 8acb4badee [CUDA graphs] Enable full cuda graphs with FA3 AoT scheduling (#20301) Woosuk Kwon 2025-07-01 09:07:36 -07:00
  • 314af8617c [Docs] Update transcriptions API to use openai client with stream=True (#20271) Nicolò Lucchesi 2025-07-01 17:47:13 +02:00
  • 0e96cc9b7e [Misc] Minor refactoring for scheduler (#20299) Woosuk Kwon 2025-07-01 07:55:32 -07:00
  • ecad851cbd [Model]Add Tencent HunYuanMoEV1 Model Support (#20114) aiyiwang2025 2025-07-01 22:28:13 +08:00
  • ed70f3c64f Add GLM4.1V model (Draft) (#19331) Yuxuan Zhang 2025-07-01 20:48:26 +08:00
  • 650d5dbd04 [Misc] Minor refactor of NIXL background handshake (#20068) Nicolò Lucchesi 2025-07-01 13:40:14 +02:00
  • 9025a9a705 [Quant] [Bugfix] Fix quantization config matching with hf_to_vllm_mapper (#20046) Kyle Sayers 2025-07-01 06:20:34 -04:00
  • c05596f1a3 [Perf] Validate @config in pre-commit instead of dynamically (#20200) Lionel Villard 2025-07-01 05:10:28 -04:00
  • 787b13389e [doc] fix the incorrect logo in dark mode (#20289) Reid 2025-07-01 16:18:09 +08:00
  • 96453cfa83 [BugFix][V1][ROCm] Triton MLA uses V0 backend on V1 engine (#19067) TY-AMD 2025-07-01 16:12:19 +08:00
  • b1c1fe35a5 [Misc] remove redundant char (#20287) Kebe 2025-07-01 15:33:22 +08:00