Commit Graph

  • 14acf429ac [EPLB] Remove main waits in case of slow EPLB (#36271) Ilya Markov 2026-03-24 12:50:44 +01:00
  • ce57fd5557 [Docs] Fix build (#37991) Harry Mellor 2026-03-24 10:20:49 +00:00
  • 2e67fa756d Fix tool_parser_cls type annotation from Callable to type[ToolParser] (#37957) Flora Feng 2026-03-24 01:58:27 -04:00
  • e3c6c10cad [KV Offload] Refactor CPU offloading: pluggable CachePolicy, remove Backend abstraction, restructure into cpu/ package (#37874) Ronen Schaffer 2026-03-24 07:02:51 +02:00
  • 16a664df24 [Frontend][Bugfix] Pass default_chat_template_kwargs to AnthropicServingMessages (#37899) jetxa 2026-03-24 13:00:12 +08:00
  • 7281199a8c [release] Move agent queue to Release cluster queues (#37783) Kevin H. Luu 2026-03-23 20:36:47 -07:00
  • b2dd75eb48 Downsize CPU jobs to use small queue (#37913) Kevin H. Luu 2026-03-23 20:36:37 -07:00
  • c59a132f96 [V0 Deprecation] Refactor kv cache from list to element (#37487) Wentao Ye 2026-03-23 23:10:11 -04:00
  • de99d91ece [ROCm][CI] Split Entrypoints Integration (API Server 1) into 3 jobs (#37906) Andreas Karatzas 2026-03-23 20:48:37 -05:00
  • 83c9d525b6 [CI] Add batch invariant test: Block FP8 + small MOE (#37895) Wentao Ye 2026-03-23 21:16:14 -04:00
  • 8f4824b664 [Model Runner V2] Gather multimodal embeddings before draft model postprocess (#37932) Giancarlo Delfin 2026-03-23 18:14:13 -07:00
  • 56777b5c89 [Test] E2E Nemotron-3-Super tests (#36803) roikoren755 2026-03-24 02:49:56 +02:00
  • 2488a82f89 [CI] Split V1 Others into 3 separate jobs (#37016) Kevin H. Luu 2026-03-23 15:44:38 -07:00
  • dc6908ac6a [Bugfix] Register VLLM_BATCH_INVARIANT in envs.py to fix spurious unknown env var warning (#35007) Ranran 2026-03-23 17:31:14 -05:00
  • e85f8f0932 [Bug][MoE] Strengthen _supports_current_device() checks in the TRTLLM FP8, NVFP4, and FlashInfer CuteDSL MoE experts (#36728) yzong-rh 2026-03-23 17:02:57 -04:00
  • 5bf3c42d4c [Bug][MoE] Fix TRTLLM NVFP4 Routing Kernel Precision (#36725) Robert Shaw 2026-03-23 21:19:06 +01:00
  • 38364a7e32 [Sparse24] [Deprecation] Remove Sparse24 CT integration and kernels (#36799) Kyle Sayers 2026-03-23 16:03:29 -04:00
  • fafe76b4af [Async][Spec Decoding] Zero-bubble async scheduling + spec decoding (#32951) Matthew Bonanni 2026-03-23 15:37:22 -04:00
  • ffb5b32b5f [MRV2] Consider spec decoding in warmup (#37812) Woosuk Kwon 2026-03-23 10:45:43 -07:00
  • 91fd695b75 [CI] split Entrypoints Integration (API Server 1) into 3 jobs (#37882) Kunshang Ji 2026-03-24 01:37:56 +08:00
  • 1cbbcfe8a3 [CI][PD] Add Hybrid SSM integration tests to CI (#37657) Nicolò Lucchesi 2026-03-23 16:58:19 +01:00
  • aceadb5ee1 Use lazy graph module during split_module to defer recompile() (#37609) Angela Yi 2026-03-23 08:21:29 -07:00
  • ec2280611a [Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding (#37884) Yufeng He 2026-03-23 23:15:12 +08:00
  • 7151ae6528 [Bugfix] RoBERTa position_id accumulation in CUDA graph padding region (#37873) yanghui1-arch 2026-03-23 22:59:21 +08:00
  • 45bd5c8e75 [Mypy] Fix mypy for vllm/config (#37808) Wentao Ye 2026-03-23 10:33:59 -04:00
  • 10a1018c12 [ROCm] fix sleep mode not releasing GPU memory problem on ROCm (#37533) Zhaodong Bing 2026-03-23 21:07:19 +08:00
  • aec2dc6c0d [Bugfix][LoRA] Fix incorrect LoRA Log (#37877) Jee Jee Li 2026-03-23 19:42:52 +08:00
  • 7938d12119 [Bugfix] Fix CPU backend crash in KV cache block zeroing (#37550) DorBernsohn 2026-03-23 13:35:45 +02:00
  • debd6e768c [XPU][MoE Refactor] Refactor xpu mxfp4 support into oracle (#37784) Kunshang Ji 2026-03-23 19:10:41 +08:00
  • 9ace378a63 [Frontend][Responses API] Fix arrival_time recording for TTFT on initial request (#37498) Andrew Xia 2026-03-23 02:58:08 -07:00
  • 27d5ee3e6f [FP8]add FP8 WoQ kernel abstraction. (#32929) Kunshang Ji 2026-03-23 17:47:47 +08:00
  • 35141a7eed [Misc]Update gitignore (#37863) wangxiyuan 2026-03-23 16:14:10 +08:00
  • e99fb98867 [ROCm] Fix fused_moe_fake signature mismatch and other AITER bugs (#36100) Chuan (Richard) Li 2026-03-23 00:48:31 -07:00
  • a16133a0f1 [Perf] [Bugfix] Fix Triton autotuning in inference for Qwen3.5 (#37338) Artem Perevedentsev 2026-03-23 09:37:58 +02:00
  • 54ab804e87 [Bugfix] Store Qwen3Next A_log in fp32 (#37810) Hojin Yang 2026-03-23 16:36:57 +09:00
  • 02e6efe56d [Bugfix] JAIS: Only apply ALiBi when position_embedding_type='alibi' (#37820) r266-tech 2026-03-23 15:36:34 +08:00
  • 410d300893 [ROCm][Refactor] Enable AWQMarlinConfig on ROCm to use choose_mp_linear_kernel (#36505) Matthias Gehre 2026-03-23 08:36:08 +01:00
  • d3fe857135 update doc for online fp8 quantization (#37851) Yan Ma 2026-03-23 13:19:03 +08:00
  • f85e479e66 [Feature] ViT Full CUDA Graph (#35963) Baorun (Lauren) Mu 2026-03-23 01:01:10 -04:00
  • 1f0d210641 [CI/Build][LoRA] Update Qwen35 LoRA testing (#37816) Jee Jee Li 2026-03-23 12:55:49 +08:00
  • 3bbe2e1e6e [Test] Consolidate tool parser unit tests to tests/tool_parsers (#37834) Ben Browning 2026-03-23 00:24:25 -04:00
  • 6e04e79326 always use embed&token_classify for bge-m3 (#37632) Augusto Yao 2026-03-23 11:10:57 +08:00
  • e7767eccae Fix AudioFlamingo3/MusicFlamingo HF parity and RoTE handling (#37643) Lasha Koroshinadze 2026-03-22 22:29:07 -04:00
  • 43877a620b [MRV2] Enable PP CUDA graph test (#37830) Woosuk Kwon 2026-03-22 16:30:25 -07:00
  • 63f49b8bd4 [Model Runner V2] Enable piecewise CUDA graphs for pipeline parallelism (#35162) zhanqiuhu 2026-03-22 16:48:25 -04:00
  • a5e9d511de [MRV2] Use FP64 for Gumbel noise (#37798) Woosuk Kwon 2026-03-22 12:28:10 -07:00
  • c058ff44d4 [Bigfix]fix lora test by pass padded size back to the layer (#37811) Yongye Zhu 2026-03-22 15:20:13 -04:00
  • ce9b1d76cf [MRV2] Skip hidden states allocation for PW CUDA graphs (#37818) Woosuk Kwon 2026-03-22 11:47:21 -07:00
  • e74c17e153 Enable NemotronHPuzzle + NemotronHMTP (#37803) Netanel Haber 2026-03-22 17:13:58 +02:00
  • eaf4978621 [Test] Only Run MLA model when user explicitly set for batch invariance (#37719) Wentao Ye 2026-03-22 09:09:12 -04:00
  • 77d24c4bfe [Bug] Fix fp8 deepgemm batch invariant (#37718) Wentao Ye 2026-03-22 08:57:20 -04:00
  • b3e846017d [Model Runner V2] Support multi-modal embeddings for spec decode model (#36097) Giancarlo Delfin 2026-03-22 02:48:43 -07:00
  • cd1242d82a [ROCm][CI] Stabilize ROCm speech-to-text translation test with lower min acc threshold (#37723) Andreas Karatzas 2026-03-22 04:32:08 -05:00
  • 4383f1532e [MoE] Move PF Methods to Folder (#35927) Robert Shaw 2026-03-22 04:42:59 -04:00
  • 6eedec6e36 [ROCm][CI] Make some duplicated tests optional so that they are only evaluated in our nightly (#37780) Andreas Karatzas 2026-03-22 03:03:18 -05:00
  • ffc8531524 [ROCm][CI] Added missing resampy dependency for MM audio tests (#37778) Andreas Karatzas 2026-03-22 03:02:41 -05:00
  • 6ecba840d7 [ROCm][CI] get_cu_count was renamed to num_compute_units in #35042 (#37764) Andreas Karatzas 2026-03-22 03:02:21 -05:00
  • 3b06c55c78 [ROCm][CI] Fix MEGA_AOT_ARTIFACT fallback when PyTorch < 2.10.0 lacks AOT support (#37763) Andreas Karatzas 2026-03-22 03:02:03 -05:00
  • b050700462 [Perf] Optimize glm4.xv VIT (#37779) Yang Liu 2026-03-22 02:12:34 -04:00
  • 5dac719b2b [Bugfix] Handle libsndfile sf_error(NULL) race condition in audio fallback (#37782) Andreas Karatzas 2026-03-22 00:37:29 -05:00
  • c862481c02 [CI] Skip ISAAC multimodal tests due to broken upstream HF model weights (#37781) Andreas Karatzas 2026-03-22 00:23:32 -05:00
  • c86b17cfe6 [ROCm][CI] Add large_gpu_mark to test_max_tokens_none for ROCm (#37717) Andreas Karatzas 2026-03-21 23:25:16 -05:00
  • 66f927f205 [Bugfix] Fix pooling non-determinism from pinned prompt_lens aliasing (#37775) Andreas Karatzas 2026-03-21 22:22:24 -05:00
  • e78bc74268 [ROCm][CI] close missing quote in kernels/moe block in run-amd-test.sh (#37774) Andreas Karatzas 2026-03-21 20:42:34 -05:00
  • 6b2fa3a762 [MoE] Move FlashInfer CuteDSL experts into fused_moe/experts/ (#37759) Robert Shaw 2026-03-21 19:15:16 -04:00
  • eeee5b262d [Quantization][Deprecation] Remove PTPC FP8 (#32700) Robert Shaw 2026-03-21 18:10:16 -04:00
  • 5ad0446572 Revert "Consolidate AWQ quantization into single awq_marlin.py file" (#37768) Robert Shaw 2026-03-21 17:20:41 -04:00
  • 8cc700dd6a Consolidate AWQ quantization into single awq_marlin.py file Robert Shaw 2026-03-21 17:09:17 -04:00
  • 80b70884eb Add tensor IPC transfer mechanism for multimodal data (#32104) Brandon Pelfrey 2026-03-21 13:10:20 -07:00
  • 61e381dcf0 [Perf] Add SM 10.3 (B300/GB300) all-reduce communicator tuning (#37756) Mohammad Miadh Angkad 2026-03-22 03:43:47 +08:00
  • 88f1b374f5 [Core] Enable allreduce fusion by default for SM 10.3 (B300/GB300) (#37755) Mohammad Miadh Angkad 2026-03-22 03:40:37 +08:00
  • 298e510848 [Hybrid] calling get_mamba_groups() once at MambaCopyBuffers.create() (#37318) v0.18.1rc0 Francesco Fusco 2026-03-21 10:29:43 +01:00
  • 3982bc2cd0 [ROCm] Enable DeepEP ROCm as all2allbackend for AMD GPUs. (#34692) Chaitanya Sri Krishna Lolla 2026-03-21 13:02:31 +05:30
  • 02eec7ecbe [ROCm][CI] Update GSM8K eval config to use fp8-and-mixed models list (MI355) (#37721) Andreas Karatzas 2026-03-21 02:27:12 -05:00
  • 17ee641c45 [Responses API] Add kv_transfer_params for PD disaggregation (#37424) Bongwoo Bak 2026-03-21 14:48:54 +09:00
  • 0d50fa1db6 [ROCm][CI] Mark gemma3 as large GPU test to avoid OOM on MI250 (#37610) Andreas Karatzas 2026-03-20 23:57:25 -05:00
  • 1fa1e53a73 Revert "[compile] Initialize passes at VllmBackend init" (#37733) Simon Mo 2026-03-20 21:35:49 -07:00
  • 3ffa52009f [ROCm][CI] Guard CudaPlatform/RocmPlatform imports to fix test collection on cross-platform builds (#37617) Andreas Karatzas 2026-03-20 22:58:58 -05:00
  • 87bd91892f [MoE Refactor] Mxfp4 oracle rebased (#37128) Yongye Zhu 2026-03-20 22:37:04 -05:00
  • c7f98b4d0a [Frontend] Remove librosa from audio dependency (#37058) Isotr0py 2026-03-21 11:36:15 +08:00
  • 1c472f8fe1 Add get_device_uuid for rocm (#37694) tmm77 2026-03-20 23:33:16 -04:00
  • c57d38d603 elastic_ep: Fix issues with repeated scale up/down cycles (#37131) Itay Alroy 2026-03-21 01:13:02 +02:00
  • e5ed6c6c13 [BugFix] Allow qk_nope_head_dim=192 in FlashInfer MLA backend checks (#37475) Kaihang Jiang 2026-03-20 18:14:55 -04:00
  • b3d0b37908 [Refactor] Remove unused dead code (#36171) Wentao Ye 2026-03-20 18:12:51 -04:00
  • 85f671b8e1 [Model Runner V2] Support Streaming Inputs (#37028) Santino Ramos 2026-03-20 13:42:25 -07:00
  • 8bc6b5cdb0 [ROCm][CI] Setting some mi325_4 tests back to optional (in parity with upstream) (#37711) Andreas Karatzas 2026-03-20 14:25:08 -05:00
  • 4f16ebbbd3 [Bugfix] Disable monolithic TRTLLM MoE for Renormalize routing (#37591) (#37605) Vadim Gimpelson 2026-03-20 23:19:26 +04:00
  • 12fd17eb51 [compile] Initialize passes at VllmBackend init (#35216) Angela Yi 2026-03-20 11:40:33 -07:00
  • 37aadf6237 [Model] Update Kimi-K25 and Isaac processors to fit HF-style (#37693) Cyrus Leung 2026-03-21 02:30:22 +08:00
  • d7d2b5e405 [Bugfix] Disable --calculate-kv-scales for hybrid GDN/Mamba+Attention… (#37565) Le Yang 2026-03-21 02:28:34 +08:00
  • 6ec5e9fd37 refactor: abstract deepgemm support into platform (#37519) SherryC41 2026-03-21 01:54:08 +08:00
  • e1d85e5c24 [Attention] Support distinguishing between short extends and decodes (#37303) Lucas Wilkinson 2026-03-20 10:49:36 -07:00
  • 79eb9369c5 fix CUDAGraph memory being counted twice (#37426) Peter Pan 2026-03-21 01:36:32 +08:00
  • e80cfe575d [MRV2] Avoid recompilation of _gather_block_tables_kernel (#37645) Woosuk Kwon 2026-03-20 10:31:45 -07:00
  • d0532bf38d [Perf] Eliminate redundant SparseMatrix creation in gpt_oss_triton_kernels (#37683) Xin Yang 2026-03-20 10:28:41 -07:00
  • fb4e8bf442 [ROCm][CI] Fix accuracy for llama-nemotron-vl pooling tests (#37613) Andreas Karatzas 2026-03-20 12:16:59 -05:00
  • 6ade4bc5a5 Fix various config related issues for Transformers v5 (#37681) Harry Mellor 2026-03-20 16:30:12 +00:00
  • 2e089b96a8 [compile] Add compiled artifact counter for VLLM_USE_MEGA_AOT_ARTIFACT=1. (#37589) Zhengxu Chen 2026-03-20 12:22:46 -04:00
  • 880be2b1b8 [Metrics] Some small refactoring for better maintainability (#33898) Martin Hickey 2026-03-20 16:11:34 +00:00
  • c0f5fae601 [compile] Fix aot test failures with torch 2.12. (#37604) Zhengxu Chen 2026-03-20 12:06:29 -04:00