Commit Graph

  • ab714131e4 [Doc] Update compatibility matrix for pooling and multimodal models (#21831) Cyrus Leung 2025-07-29 21:29:51 +08:00
  • 755fa8b657 [KVCache] Make KVCacheSpec hashable (#21791) Chen Zhang 2025-07-29 04:58:29 -07:00
  • 2470419119 [Docs] Fix the outdated URL for installing from vLLM binaries (#21523) Kay Yan 2025-07-29 19:56:27 +08:00
  • 61a6905ab0 [Model] Refactor JambaForCausalLM (#21394) Jee Jee Li 2025-07-29 18:25:07 +08:00
  • 37efc63b64 [V0 deprecation] Guided decoding (#21347) Reza Barazesh 2025-07-29 03:15:30 -07:00
  • a4528f0cac [Model]: Fused MoE for nomic-embed-text-v2-moe (#18321) Isotr0py 2025-07-29 18:13:27 +08:00
  • a2480251ec [Doc] Link to RFC for pooling optimizations (#21806) Cyrus Leung 2025-07-29 14:53:18 +08:00
  • 7234fe2685 [Misc] Rework process titles (#21780) Nick Hill 2025-07-29 06:14:47 +01:00
  • f1e2c095ec Migrate InternVLImageInputs and InternVLVideoInputs to TensorSchema (#21684) Benji Beck 2025-07-28 22:09:45 -07:00
  • 12a223ef9b [AMD][CI/Build][Bugfix] Guarding CUDA specific functions by ifndef ROCM (#21766) Gregory Shtrasberg 2025-07-28 23:35:37 -04:00
  • e18f085103 skip fusedmoe layer for start_load_kv (#21378) Calvin Chen 2025-07-29 09:59:44 +08:00
  • afa2607596 [CI] Parallelize Kernels MoE Test (#21764) Michael Goin 2025-07-28 21:56:24 -04:00
  • 48b763d6b5 [Refactor] Merge Compressed Tensor FP8 CompressedTensorsW8A8Fp8MoEMethod and CompressedTensorsW8A8Fp8MoECutlassMethod (#21775) Wentao Ye 2025-07-28 21:47:21 -04:00
  • 947e982ede [Docs] Minimize spacing for supported_hardware.md table (#21779) Michael Goin 2025-07-28 21:46:39 -04:00
  • c6c9122d50 [Kernel] SM90 CUTLASS FP8 GEMM: add support for swap AB + kernel tuning (#20396) lyrisz 2025-07-28 16:13:58 -07:00
  • 8aa1485fcf [Perf] Disable chunked local attention by default with llama4 (#21761) Lucas Wilkinson 2025-07-28 18:49:04 -04:00
  • 89ac266b26 [Feat]: Add support for Dynamic Quant 4 bit CPU kleidiai kernels (#17112) Nikhil Gupta 2025-07-28 21:55:15 +01:00
  • c6f36cfa26 [Bugfix] DeepGEMM is not enabled on B200 due to _lazy_init() (#21472) Clayton Coleman 2025-07-28 16:51:22 -04:00
  • b18b417fbf Revert "[V1] Exception Handling when Loading KV Cache from Remote Store" (#21778) Kuntai Du 2025-07-28 13:15:18 -07:00
  • 9ba1c88a93 [AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol exposure (#21647) Lu Fang 2025-07-28 13:11:16 -07:00
  • e0e58f9729 [Bug] Enforce contiguous input for dynamic_scaled_fp8_quant and static_scaled_fp8_quant (#21773) Wentao Ye 2025-07-28 15:55:48 -04:00
  • b361f14e39 [AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (1-4) due to torch.compile (#21350) rasmith 2025-07-28 14:38:20 -05:00
  • 01c753ed98 update flashinfer to v0.2.9rc2 (#21701) weiliang 2025-07-29 03:31:47 +08:00
  • 94b71ae106 Use metavar to list the choices for a CLI arg when custom values are also accepted (#21760) Harry Mellor 2025-07-28 20:31:10 +01:00
  • 7d44c691b0 [P/D] Log warnings related to prefill KV expiry (#21753) Nick Hill 2025-07-28 19:40:53 +01:00
  • e17a4d3bf9 [Bugfix] Fix granite speech shape validation (#21762) Cyrus Leung 2025-07-29 02:19:21 +08:00
  • ec261b0291 [XPU] IPEX-optimized Punica Wrapper on XPU (#21703) Chaojun Zhang 2025-07-29 00:43:37 +08:00
  • 04fe61aa3d [CI/Build] Fix plugin tests (#21758) Cyrus Leung 2025-07-28 23:08:05 +08:00
  • 25708d317a [Bugfix] Mistral crashes on tool with no description (#21167) Michard Hugo 2025-07-28 17:03:35 +02:00
  • 0e18a5d058 [Misc] Reduce logs for model resolution (#21765) Cyrus Leung 2025-07-28 22:59:56 +08:00
  • 34a20c49b3 [Logs] Change flashinfer sampler logs to once (#21759) Michael Goin 2025-07-28 09:59:51 -04:00
  • 31084b3b1f [Bugfix][CI/Build] Update peft version in test requirement (#21729) Isotr0py 2025-07-28 21:17:43 +08:00
  • bccc43c033 [Bugfix]check health for engine core process exiting unexpectedly (#21728) wuhang 2025-07-28 21:17:31 +08:00
  • 1395dd9c28 [Docs] Add revision date to rendered docs (#21752) Harry Mellor 2025-07-28 14:12:46 +01:00
  • 9ace2eaf35 [Bugfix] Improve JSON extraction in LlamaToolParser (#19024) Keyang Ru 2025-07-28 05:36:58 -07:00
  • 656c24f1b5 [Ernie 4.5] Name Change for Base 0.3B Model (#21735) Anton Vlasjuk 2025-07-28 14:22:32 +02:00
  • 63fe3a700f [PD] let p2p nccl toy proxy handle /chat/completions (#21734) Chauncey 2025-07-28 19:45:50 +08:00
  • 0ae970ed15 [Bugfix] Fix glm4.1v video_grid_thw tensor shape scheme (#21744) Isotr0py 2025-07-28 19:26:49 +08:00
  • 65e8466c37 [Bugfix] Fix environment variable setting in CPU Dockerfile (#21730) Li, Jiang 2025-07-28 19:02:39 +08:00
  • 1b769dccf3 [Bugfix] Fix Ernie4_5_MoeForCausalLM shared experts (#21717) Jee Jee Li 2025-07-28 19:02:25 +08:00
  • 2cc571199b [feature] add log non default args in LLM (#21680) rongfu.leng 2025-07-28 17:21:22 +08:00
  • a4ed731546 [Model] Prioritize Transformers fallback over suffix matching (#21719) Cyrus Leung 2025-07-28 17:15:31 +08:00
  • d128d0d554 Migrate KeyeImageInputs and KeyeVideoInputs to TensorSchema (#21686) Benji Beck 2025-07-28 01:16:35 -07:00
  • a6c050286a [v1][mamba] Added mamba_type into MambaSpec (#21715) Asaf Joseph Gardin 2025-07-28 11:15:55 +03:00
  • 139a7f07bd [BugFix] Fix ChunkedLocalAttention when the hybrid kv-cache is disabled (#21707) Lucas Wilkinson 2025-07-28 03:18:47 -04:00
  • 150d9e6337 [Bugfix] fix max-file-size type from str to int (#21675) Ning Xie 2025-07-28 15:06:52 +08:00
  • 139a97ec56 [Bugfix] Fix shape checking for Fuyu (#21709) Cyrus Leung 2025-07-28 15:05:56 +08:00
  • 18cc33dd60 [bugfix] fix profile impact benchmark results (#21507) rongfu.leng 2025-07-28 13:44:24 +08:00
  • 7656cf4cf3 [Bugfix] [issue-21565] Fix the incompatibility issue with stream and named function calling when Thinking is disabled (#21573) Hongsheng Liu 2025-07-28 13:43:50 +08:00
  • 3ea57a56d9 Migrate Idefics3ImagePixelInputs and Idefics3ImageEmbeddingInputs to … (#21683) Benji Beck 2025-07-27 22:37:23 -07:00
  • 75856bc2cb Migrate GraniteSpeechAudioInputs to TensorSchema (#21682) Benji Beck 2025-07-27 22:37:20 -07:00
  • 304dcdf575 Migrate GLMVImagePixelInputs to TensorSchema (#21679) Benji Beck 2025-07-27 22:36:11 -07:00
  • 88e46c7c8d Migrate Glm4vImageInputs, Glm4vVideoInputs to TensorSchema (#21678) Benji Beck 2025-07-27 22:36:08 -07:00
  • d8937de4c8 Migrate Gemma3ImagePixelInputs to TensorSchema (#21676) Benji Beck 2025-07-27 22:36:05 -07:00
  • e626d286f5 [FEAT] [ROCm] [AITER]: Add AITER HIP block quant kernel (#21242) TJian 2025-07-27 22:07:06 -07:00
  • c7ffe93d9c [Model] Support TP/PP/mamba2 kernel for PLaMo2 (#19674) Shinichi Hemmi 2025-07-28 14:00:47 +09:00
  • 15a72ac478 [V1] Exception Handling when Loading KV Cache from Remote Store (#21534) Adeline 2025-07-28 11:34:17 +08:00
  • 04ff4be310 [Misc] Add fused_moe configs for Qwen3-Coder-480B-A35B-Instruct-FP8 (#21700) Jee Jee Li 2025-07-28 11:12:18 +08:00
  • 93269bb43e Fix GLM tool parser (#21668) Yuxuan Zhang 2025-07-28 10:46:38 +08:00
  • 82acf2184d Fix typo for limit-mm-per-prompt in docs (#21697) Joachim Studnia 2025-07-27 19:45:37 -07:00
  • 86ae693f20 [Deprecation][2/N] Replace --task with --runner and --convert (#21470) Cyrus Leung 2025-07-28 10:42:40 +08:00
  • 8f605ee309 [Attention] Make CutlassMLA the default backend for SM100 (blackwell) (#21626) Alexander Matveev 2025-07-27 16:13:00 -04:00
  • a9b2a1d704 [Misc] Refactor vllm config str (#21666) Ning Xie 2025-07-28 00:51:44 +08:00
  • 57c22e57f9 Fix CUDA permute/unpermute for use with DeepGemm Moe (#17934) Caleb_Du 2025-07-27 22:08:00 +08:00
  • bda9d0535f [Refactor] Refactor MOE NVFP4 Code Base: ModelOpt + Compressed Tensor (#21631) Wentao Ye 2025-07-27 08:25:21 -04:00
  • 3d847a3125 [VLM] Add video support for Intern-S1 (#21671) Isotr0py 2025-07-27 19:49:43 +08:00
  • 5f8c9a425e Migrate Florence2ImagePixelInputs to TensorSchema (#21663) Benji Beck 2025-07-27 02:43:02 -07:00
  • 1cbf951ba2 [Misc] add default value for file pattern arg (#21659) Ning Xie 2025-07-27 13:14:51 +08:00
  • a8936e5193 Refactor: Remove numpy dependency from LoggingStatLogger (#20529) ZiTian.Zhao 2025-07-27 12:06:21 +08:00
  • 01a395e9e7 [CI/Build][Doc] Clean up more docs that point to old bench scripts (#21667) Ye (Charlotte) Qi 2025-07-26 21:02:12 -07:00
  • 971948b846 Handle non-serializable objects in vllm bench (#21665) Huy Do 2025-07-26 20:35:22 -07:00
  • eed2f463b2 [VLM] Support HF format Phi-4-MM model (#17121) Isotr0py 2025-07-27 11:07:57 +08:00
  • 20950b29fb Migrate ChameleonImagePixelInputs to TensorSchema (#21657) Benji Beck 2025-07-26 19:34:25 -07:00
  • 3339cba3ff Migrate FuyuImagePatchInputs to TensorSchema (#21662) Benji Beck 2025-07-26 19:34:14 -07:00
  • 0b8caf9095 Migrate DeepseekVL2ImageInputs to TensorSchema (#21658) Benji Beck 2025-07-26 19:34:11 -07:00
  • ccf27cc4d4 Migrate Blip2ImagePixelInputs and Blip2ImageEmbeddingInputs to TensorSchema (#21656) Benji Beck 2025-07-26 19:33:52 -07:00
  • c657369841 support torch.compile for bailing moe (#21664) Jinzhen Lin 2025-07-27 07:54:32 +08:00
  • 6c66f28fa5 Remove xformers requirement for Mistral-format Pixtral and Mistral3 (#21154) Wenchen Lo 2025-07-26 16:20:29 -07:00
  • de509ae8eb [NVIDIA] Explicitly disable shuffled weights for flashinfer blockscale moe fp8 kernels (#21411) Kaixi Hou 2025-07-26 07:10:36 -07:00
  • e7c4f9ee86 [CI/Build][Doc] Move existing benchmark scripts in CI/document/example to vllm bench CLI (#21355) Ye (Charlotte) Qi 2025-07-26 07:10:14 -07:00
  • 9094d11c5d [Bugfix][Apple Silicon] fix missing symbols when build from source on Mac with Apple Silicon (#21380) Yeju Zhou 2025-07-26 22:09:57 +08:00
  • 56e544f24b [Refactor] Remove moe_align_block_size_triton (#21335) Wentao Ye 2025-07-26 10:08:29 -04:00
  • 97d6c30cc9 [BugFix] Fix shared storage connector load kv only load attention layer (#21428) WeiQing Chen 2025-07-26 22:07:40 +08:00
  • a40a8506df [Misc] Improve memory profiling debug message (#21429) Ye (Charlotte) Qi 2025-07-26 07:07:21 -07:00
  • c215f5c877 [Bug] Fix has_flashinfer_moe Import Error when it is not installed (#21634) Wentao Ye 2025-07-26 10:06:14 -04:00
  • 1cd6eaba54 Support encoder-only models without KV-Cache (#21270) Maximilien de Bayser 2025-07-26 10:09:52 -03:00
  • f27fdfc3ed [Bugfix] Investigate Qwen2-VL failing test (#21527) Isotr0py 2025-07-26 21:09:29 +08:00
  • de10ff0b7c Migrate AyaVisionImagePixelInputs to TensorSchema for shape validation (#21622) Benji Beck 2025-07-26 06:08:18 -07:00
  • 9d197280fa Migrate AriaImagePixelInputs to TensorSchema for shape validation (#21620) Benji Beck 2025-07-26 06:08:15 -07:00
  • e98def439c [Take 2] Correctly kill vLLM processes after benchmarks (#21646) Huy Do 2025-07-26 06:06:05 -07:00
  • 05c1126f29 [Misc] remove unused try-except in pooling config check (#21618) Reid 2025-07-26 20:20:03 +08:00
  • 875af38e01 Support Intern-S1 (#21628) Lyu Han 2025-07-26 19:14:04 +08:00
  • 7728dd77bb [TPU][Test] Divide TPU v1 Test into 2 parts. (#21431) QiliangCui 2025-07-25 23:20:30 -07:00
  • 2f6e6b33fb [Bugfix] Fix isinstance check for tensor types in _load_prompt_embeds to use dtype comparison (#21612) Alexandre JUAN 2025-07-26 05:11:10 +02:00
  • a55c95096b Correctly kill vLLM processes after finishing serving benchmarks (#21641) Huy Do 2025-07-25 19:06:21 -07:00
  • 97349fe2bc [Docs] add offline serving multi-modal video input expamle Qwen2.5-VL (#21530) WeiQing Chen 2025-07-26 09:37:32 +08:00
  • 62965de5fe [Model] Ultravox: Support Llama 4 and Gemma 3 backends (#17818) Farzad Abdolhosseini 2025-07-26 04:12:31 +03:00
  • 7ae75fa6d0 [Feature] Add support for MoE models in the calibration-free RTN-based quantization (#20766) Alex Kogan 2025-07-25 21:09:34 -04:00
  • f1b286b2fb [TPU] Update ptxla nightly version to 20250724 (#21555) Chengji Yao 2025-07-25 17:09:00 -07:00
  • c7742d6113 [Bugfix] Always set RAY_ADDRESS for Ray actor before spawn (#21540) Rui Qiao 2025-07-25 17:08:30 -07:00