Commit Graph

  • 6d0cf239c6 [CI/Build] Add Transformers nightly tests in CI (#20924) Isotr0py 2025-07-15 00:33:17 +08:00
  • 3fc964433a [Misc] Clean up Aimv2 config registration in Ovis config (#20921) Isotr0py 2025-07-14 23:36:43 +08:00
  • 0caf61c08a [CI] Update codeowner for compilation code (#20929) Lu Fang 2025-07-14 08:33:19 -07:00
  • 667624659b [CI] cc folks on changes to vllm/compilation (#20925) Richard Zou 2025-07-14 10:52:17 -04:00
  • 38efa28278 [Model] Add Ling implementation (#20680) ant-yy 2025-07-14 22:10:32 +08:00
  • e8cc53af5e [Misc] Log the reason for falling back to FlexAttention (#20699) Cyrus Leung 2025-07-14 19:16:51 +08:00
  • a4851cfe68 [Bugfix]: Fix messy code when using logprobs (#20910) Chauncey 2025-07-14 19:06:45 +08:00
  • 9887e8ec50 [Misc] Remove unused function (#20909) Reid 2025-07-14 18:48:55 +08:00
  • f326ab9c88 [Bugfix] Bump up mistral_common to support v13 tokenizer (#20905) 22quinn 2025-07-14 03:45:03 -07:00
  • dcf2a5e208 [CI/Build] Fix OOM issue in Jina-VL test (#20907) Cyrus Leung 2025-07-14 18:32:35 +08:00
  • 1e9438e0b0 [MISC] Move bind_kv_cache to worker module (#20900) wangxiyuan 2025-07-14 17:40:00 +08:00
  • 697ef765ee [Refactor][V1] Move outlines utils for V1 imports (#20878) Aaron Pham 2025-07-14 03:58:35 -04:00
  • a99b9f7dee [Quantization] add BNB for MixtralForCausalLM (#20893) Jee Jee Li 2025-07-14 15:34:34 +08:00
  • c488b928a7 [ROCm] [Bugfix] [Critical]: Fix mamba compilation bug (#20883) TJian 2025-07-14 00:23:28 -07:00
  • 2c7fa47161 Fix: Add missing EOFError handling in CLI complete command (#20896) Reid 2025-07-14 15:09:57 +08:00
  • 88fc8a97e3 Removing redundant python version check (#20888) Daniel song 2025-07-14 02:15:05 -04:00
  • 66f6fbd393 [Prefix Cache] Add reproducible prefix-cache block hashing using SHA-256 + CBOR (64bit) (#20511) Maroon Ayoub 2025-07-14 05:45:31 +03:00
  • 8632e831ba [Core] Add update_config RPC method (#20095) 22quinn 2025-07-13 17:49:18 -07:00
  • 4bbfc36b16 [V1] Hybrid allocator without prefix caching (#20661) nopperl 2025-07-14 01:55:14 +09:00
  • 80d38b8ac8 [V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs (#20880) TJian 2025-07-13 08:19:32 -07:00
  • 211b6a6113 [Bugfix] fix define of RerankDocument (#20877) Liuchenlong 2025-07-13 22:32:40 +08:00
  • 247102f07f [Bugfix] Fix: add patch_rope_scaling after hf override (#20857) Wang Siyuan 2025-07-13 15:13:25 +08:00
  • bd4c1e6fdb Support for LlamaForSequenceClassification (#20807) Minkyu Kim 2025-07-13 16:09:34 +09:00
  • 99b4f080d8 Renable google/gemma-3-1b-it accuracy test. (#20866) QiliangCui 2025-07-12 21:48:56 -07:00
  • 020f58abcd [Core] Support multiple tasks per model (#20771) Nicolò Lucchesi 2025-07-13 04:40:11 +02:00
  • c1acd6d7d4 [Refactor] Change the way of import triton (#20774) Wentao Ye 2025-07-12 22:39:55 -04:00
  • 3b3b778d4a [Bugfix] Fix a couple PPLX+CUTLASS MoE bugs (#20825) ElizaWszola 2025-07-13 04:39:14 +02:00
  • 42d440c22b [Perf] Use Triton instead of Torch for DeepGEMM Per Token Group Quant (#20841) Wentao Ye 2025-07-12 22:38:45 -04:00
  • f45a332886 [Sched] Enhance the logic to remove stopped requests from queues (#20739) Woosuk Kwon 2025-07-12 15:33:13 -07:00
  • 6e2c176e1f [Bugfix] Restrict Machete to only run on Hopper (#20830) Michael Goin 2025-07-13 02:34:40 +09:00
  • a86754a12b [docs] convert supported configs to table (#20858) Reid 2025-07-12 21:54:50 +08:00
  • c2a2f19aba [Bugfix] Fix Tensor Parallelism Padding Consistency in Granite Models (#20843) Alex Brooks 2025-07-12 07:11:30 -06:00
  • 2c11a738b3 [Model] New model support for microsoft/Phi-4-mini-flash-reasoning (#20702) Congcong Chen 2025-07-12 06:02:10 -07:00
  • b639327ad9 Revert "Use NVCC --compress-mode to reduce binary size by 30% #20694" (#20853) Michael Goin 2025-07-12 15:07:35 +09:00
  • 4afe687a82 Enable ModelOpt Llama4 fp8 checkpoint deployment (#20419) Zhiyu 2025-07-11 23:07:16 -07:00
  • 5de8d9f111 Remove extra tensor on CPU (#20693) Maximilien de Bayser 2025-07-12 03:06:34 -03:00
  • c1c8ca57ff [cold start time] add envs.VLLM_COMPILE_DEPYF to guard decompile (#20790) Boyuan Feng 2025-07-11 23:06:13 -07:00
  • a3a5a47e48 [Bugfix] Fix torch.compile x LoRA for PyTorch 2.8 (#20823) Richard Zou 2025-07-12 02:06:04 -04:00
  • fb25e95688 [Docs] Update basic.md (#20846) Lucia Fang 2025-07-12 14:05:32 +08:00
  • 0d4891cd03 [Bug] Fix DeepGemm for EP low latency case (#20833) Wentao Ye 2025-07-12 02:05:12 -04:00
  • f56d2996ca [Misc] Respect no_use_tqdm_on_load flag while capturing CUDA graph (#20834) lkchen 2025-07-11 23:04:45 -07:00
  • 147afb448b [Bugfix] Replace unavailable video url in multimodal test (#20854) Isotr0py 2025-07-12 13:25:39 +08:00
  • 3c7d942da8 [Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models (#20637) Nicolò Lucchesi 2025-07-12 06:33:26 +02:00
  • 890323dc1b [Bugfix] : Fix typo - logger.warn_once -> logger.warning_once (#20852) Varun Sundar Rabindranath 2025-07-12 07:56:24 +04:00
  • 01cae37713 [CI/Build] Ensure compatability with Transformers v4.53 (#20541) Isotr0py 2025-07-12 11:53:07 +08:00
  • 11c0198615 [Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading (#20682) yurhett 2025-07-12 11:52:43 +08:00
  • b1235c3e10 [Bugfix] Lazy import fused_experts in BitsAndBytesMoEMethod to avoid break not-cuda-alike devices (#20822) Li, Jiang 2025-07-12 11:52:05 +08:00
  • 44d02f54db [Misc] Restrict deep_gemm's log output (#20827) Jee Jee Li 2025-07-12 11:50:42 +08:00
  • a8593237c0 Add pynccl all-gatherv and reducescatterv (#20154) Trevor Morris 2025-07-11 18:59:23 -07:00
  • fc0f41d10a Integration SM100 FlashInfer fused allreduce RMSNorm (#20691) Ilya Markov 2025-07-12 03:58:15 +02:00
  • 7b828e30d5 [CI Bug] Fix Async Engine, Inputs, Utils, Worker Test: 'State' object has no attribute 'enable_server_load_tracking' (#20845) Wentao Ye 2025-07-11 21:57:24 -04:00
  • 5f0af36af5 Update kimi-k2 tool calling docs, enable unit tests (#20821) bigmoyan 2025-07-12 04:16:14 +08:00
  • 0d21b2664c [Bugfix] Fix OOM in language generation test (#20814) Isotr0py 2025-07-12 02:21:52 +08:00
  • 9907fc4494 [Docs] Data Parallel deployment documentation (#20768) Nick Hill 2025-07-11 17:42:10 +01:00
  • d47661f0cd [Kernel] Basic tuned configs for NVFP4 CUTLASS dense GEMM (#20646) Michael Goin 2025-07-12 01:05:33 +09:00
  • 53fa457391 [Misc] Add unit tests for MoE ModularKernel combinations + Profiling utility (#20449) Varun Sundar Rabindranath 2025-07-11 10:51:46 -04:00
  • 6fb162447b [doc] fix ordered list issue (#20819) Reid 2025-07-11 21:49:46 +08:00
  • 66177189c5 [Bugfix] Add missing field to TritonLanguagePlaceholder (#20812) Li, Jiang 2025-07-11 20:25:11 +08:00
  • b4f0b5f9aa Temporarily suspend google/gemma-3-1b-it. (#20722) QiliangCui 2025-07-11 04:21:26 -07:00
  • cbd14ed561 [Bugfix] Refactor /invocations to be task-agnostic (#20764) Cyrus Leung 2025-07-11 18:20:54 +08:00
  • 7bd4c37ae7 [Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). (#19825) Pavani Majety 2025-07-11 02:23:23 -07:00
  • 8020e98c9f [Quantization][1/N] MoE support BNB-Inflight Quantization (#20061) Jee Jee Li 2025-07-11 16:01:13 +08:00
  • 762be26a8e [Bugfix] Upgrade depyf to 0.19 and streamline custom pass logging (#20777) Luka Govedič 2025-07-11 03:15:22 -04:00
  • 6a9e6b2abf [doc] fold long code block (#20795) Reid 2025-07-11 14:16:41 +08:00
  • 5d09152ff1 [V1] Enable Mamba2 layers other than MambaMixer2 in the v1 engine (#20660) nopperl 2025-07-11 14:53:31 +09:00
  • 31d5c1797f [Perf][fp8] Use CustomOp abstraction for fp8 quant for better perf (#19830) Luka Govedič 2025-07-11 00:56:28 -04:00
  • 35514b682a [XPU] XCCL support enabled in torch 2.8.0.dev nightly builds (#20705) Ratnam Parikh 2025-07-10 20:39:52 -07:00
  • e2de455c34 [Feature] Integrate SM100 DeepGEMM support (#20087) Wentao Ye 2025-07-10 23:18:05 -04:00
  • 5b032352cc [Attention] MLA - Flashinfer Ragged Prefill (#20034) Alexander Matveev 2025-07-10 23:17:47 -04:00
  • 922f316441 [Model] Support HF format of minimax (#20211) Michael Goin 2025-07-11 11:55:21 +09:00
  • 5923ab9524 [fix]: disable cutlass block scaled group gemm for EP (#20781) Duncan Moss 2025-07-10 19:39:18 -07:00
  • 0cf893cae1 Add kimi-k2 tool parser (#20789) bigmoyan 2025-07-11 10:36:23 +08:00
  • cf75cd2098 [CI Bugfix] Specify same TORCH_CUDA_ARCH_LIST for flashinfer aot and install (#20772) Michael Goin 2025-07-11 10:16:01 +09:00
  • b854321ffe [Docs] Lazy import gguf (#20785) Simon Mo 2025-07-10 16:06:37 -07:00
  • 5b6fe23d05 [Bugfix][Benchmark] Make sure the output length > 0 when testing prefill workload. (#20786) Kuntai Du 2025-07-10 14:52:46 -07:00
  • f0c98cae27 [Misc] MoE ModularKernel : Introduce TopKWeightAndReduce (#20648) Varun Sundar Rabindranath 2025-07-10 17:40:38 -04:00
  • 574ad60db9 [KVConnector] Always call connector clear_metadata() at end of step (#20756) Nick Hill 2025-07-10 22:37:27 +01:00
  • fdadb6f43a [Bugfix] Fused MoE Modular Kernel chunking loop (#20392) Varun Sundar Rabindranath 2025-07-10 16:31:10 -04:00
  • 41060c6e08 [Core] Add Support for Default Modality Specific LoRAs [generate / chat completions] (#19126) Alex Brooks 2025-07-10 14:09:37 -06:00
  • 3de2ed767f [Bugfix] Remove assertion of expert_map being None (#20714) Ming Yang 2025-07-10 12:55:22 -07:00
  • 299252ea82 [CI] Fix pre commit issue (#20782) Wentao Ye 2025-07-10 15:48:13 -04:00
  • d6902ce79f [V0][V1][Core] Add outlines integration for V1, and update V0 integration. (#15975) Nathan Hoos 2025-07-10 14:30:26 -05:00
  • 5e53c89a74 [Bugfix] [CI] Fix Tensorizer LoRA test (#20760) Sanger Steel 2025-07-10 15:07:06 -04:00
  • c66e38ea4c [Test] Remove docker build from test. (#20542) QiliangCui 2025-07-10 11:21:58 -07:00
  • 251595368f Fix DeepSeek-R1-0528 chat template (#20717) sfbemerk 2025-07-10 19:47:36 +02:00
  • 4bed167768 [Model][VLM] Support JinaVL Reranker (#20260) shineran96 2025-07-11 01:43:43 +08:00
  • b140416abf [Model] Add reason parser for Hunyuan A13B Model. (#20625) Asher 2025-07-11 00:33:26 +08:00
  • 5b8366b61a [ROCm][Regression] Remove tensor creation that harms performance on ROCm (#20741) Gregory Shtrasberg 2025-07-10 12:22:23 -04:00
  • c7753a9809 [Hardware][CPU] Vllm int8 quantization enablement for ARM CPU (#14129) nishith-fujitsu 2025-07-10 21:29:04 +05:30
  • 4b9a9435bb Update Dockerfile FlashInfer to v0.2.8rc1 (#20718) Michael Goin 2025-07-11 00:09:02 +09:00
  • 3482fd7e4e [Doc] Add engine args back in to the docs (#20674) Harry Mellor 2025-07-10 16:02:40 +01:00
  • 77f77a951e [Misc] Clean up mark to fork process in BNB tests (#20692) Isotr0py 2025-07-10 21:59:40 +08:00
  • 1a4f35e2ea Normalize lm-eval command between baseline and correctness test (#18560) Michael Goin 2025-07-10 22:27:32 +09:00
  • be1e128dfb [CI Bugfix] Skip failing Tensorizer+LoRA test (#20724) Michael Goin 2025-07-10 21:15:03 +09:00
  • 65393ee064 [doc] fix ordered list (#20749) Reid 2025-07-10 18:13:52 +08:00
  • dc221ad72d [Bugfix][Build][Non-CUDA] Only referencing CMAKE_CUDA_COMPILER_VERSION on CUDA where it is defined (#20738) Gregory Shtrasberg 2025-07-10 05:58:11 -04:00
  • 7571a4a7e5 [CI/Build] Fix Basic Models Test (#20728) Jee Jee Li 2025-07-10 17:57:19 +08:00
  • f67d986dd1 [Misc] loose new-model tagger conditions (#20747) Isotr0py 2025-07-10 17:54:47 +08:00
  • cc876d0f29 [KVConnector] Aggregate finished requests on the scheduler (#19555) Or Ozeri 2025-07-10 11:22:18 +03:00
  • fdfd409f8f [TPU][Core]Make load weight exceed hbm error more instructive for customers (#20644) Chenyaaang 2025-07-10 00:01:17 -07:00