Commit Graph

  • 048bb59728 AMD CI Test - unskip moe_sum test and moe_align_block_size tests (#32039) Hongxia Yang 2026-01-14 02:25:10 -05:00
  • 7933638051 [misc] Remove is_torch_equal_or_newer(2.4) cases (#32296) Angela Yi 2026-01-13 23:22:07 -08:00
  • 6b176095e3 [Build] Relax anthropic version pin from ==0.71.0 to >=0.71.0 (#32289) David 2026-01-14 02:21:39 -05:00
  • 9d0d7f48d5 [ROCm][CI] Handle missing vision_config in Isaac model attention patch (#32281) Andreas Karatzas 2026-01-14 01:21:26 -06:00
  • 50632adc58 Consolidate Intel Quantization Toolkit Integration in vLLM (#31716) Yi Liu 2026-01-14 15:11:30 +08:00
  • 6fa6e7ef0c [ROCm][CI] Disable Async Scheduling For Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy Test (#32275) Micah Williamson 2026-01-13 23:29:42 -06:00
  • 90c0836902 [Model Runner V2] Refactor Sampler (#32245) Woosuk Kwon 2026-01-13 17:58:12 -08:00
  • 8ef50d9a6b [Kernel][Performance] Enable smaller Scaling Factor tiling for NVFP4 small-batch decoding (#30885) Roberto L. Castro 2026-01-14 00:22:53 +01:00
  • 2a60ac91d0 [Improvement] Persist CUDA compat libraries paths to prevent reset on apt-get (#30784) emricksini-h 2026-01-13 23:35:05 +01:00
  • 9e65bb4ef4 Add mergify label job for "bug" in PR titles (#31980) Michael Goin 2026-01-13 17:28:19 -05:00
  • 0db574b185 [Build] Add scripts for cherry-picking and trigger build (#32282) Simon Mo 2026-01-13 13:21:05 -08:00
  • 2f4a71daf2 [Misc] Add In-Container restart capability through supervisord for sagemaker entrypoint (#28502) HappyAmazonian 2026-01-13 13:06:10 -08:00
  • 69f8a0ea37 fix(rocm): Use refresh_env_variables() for rocm_aiter_ops in test_moe (#31711) Rabi Mishra 2026-01-14 00:41:54 +05:30
  • f28125d87b [Perf] Optimize grouped topk kernel, 1.2%~2% E2E Throughput improvement (#32058) Wentao Ye 2026-01-13 13:58:18 -05:00
  • 2c24bc6996 [BugFix] [KVConnector] Fix KV events for LMCache connector (#32169) Martin Hickey 2026-01-13 15:50:34 +00:00
  • 0aa8c40552 [Bugfix] Replace PoolingParams.normalize with use_activation (#32243) Cyrus Leung 2026-01-13 18:45:42 +08:00
  • 46f8c6b725 Fix CUDA 13 wheel installation doc (#32276) Dmitry Tokarev 2026-01-13 13:48:37 -05:00
  • af54d2e2d0 [responseAPI] support partial message generation (#32100) Andrew Xia 2026-01-13 13:41:26 -05:00
  • 6beef12b9b [EPLB][Cleanup] Remove is_async_enabled from EplbModelState (#32050) Sage Moore 2026-01-13 10:19:03 -08:00
  • ab74b2a27a [Trivial] Remove duplicate enable_mfu_metrics (#32246) Mark McLoughlin 2026-01-13 17:09:23 +00:00
  • 2263d44b68 [4/N][Attention] Move MLA common to model_executor (#32060) Matthew Bonanni 2026-01-13 12:08:45 -05:00
  • 4f3676e726 nixl_connector: export UCX_MEM_MMAP_HOOK_MODE=none to avoid a UCX memory leak (#32181) Mathis Felardos 2026-01-13 17:21:10 +01:00
  • 510265472c [BugFix] [KVConnector] Fix KV events for LMCache connector (#32169) Martin Hickey 2026-01-13 15:50:34 +00:00
  • 4f02cb2eac [Refactor] [7/N] to simplify the vLLM lora serving architecture (#32251) Chauncey 2026-01-13 23:37:34 +08:00
  • 252c011012 [Refactor] Remove MultiModalProfiler (#32254) Cyrus Leung 2026-01-13 23:10:20 +08:00
  • 98f60e5acb [6/N][Attention] Move utils to more appropriate locations (#32215) Matthew Bonanni 2026-01-13 08:38:52 -05:00
  • fefce49807 [Refactor] [6/N] to simplify the vLLM openai chat_completion serving architecture (#32240) Chauncey 2026-01-13 21:01:39 +08:00
  • a5bbbd2f24 [Quantization] fix: overflow with static per-tensor scaling (#29867) Mickaël Seznec 2026-01-13 13:56:01 +01:00
  • 8c8653b672 [Docs] Nixl Usage recommend fail kv_load_failure_policy (#32198) Nicolò Lucchesi 2026-01-13 13:51:57 +01:00
  • 232214b2ae [Bugfix] Replace PoolingParams.normalize with use_activation (#32243) Cyrus Leung 2026-01-13 18:45:42 +08:00
  • eb28e8068d [Refactor] Remove get_encoder_dummy_data (#32241) Cyrus Leung 2026-01-13 17:21:23 +08:00
  • 542a4059b2 [Model] Use mm_position to compute mrope positions for Qwen2-VL/2.5-VL (#32126) YunzhuLu 2026-01-13 17:04:29 +08:00
  • df7e12715f [ROCm][CI] Fix engine core client tests for ROCm spawn multiprocessing (#32061) Andreas Karatzas 2026-01-13 01:14:30 -06:00
  • 44c34f22d9 [Doc] Update installation from source command (#32239) Roy Wang 2026-01-13 15:10:27 +08:00
  • 80221e1884 [BugFix]Fix eagle draft_model_config and add tests (#31753) Xingyu Liu 2026-01-12 23:09:36 -08:00
  • 5e714f7ff4 [ROCm][CI] Fix HuggingFace flash_attention_2 accuracy issue in Isaac vision encoder (#32233) Andreas Karatzas 2026-01-13 00:33:59 -06:00
  • 11b6af5280 [ROCm][Bugfix] Fix Mamba batched decode producing incorrect output (#32099) v0.14.0rc1 Andreas Karatzas 2026-01-12 23:46:53 -06:00
  • 2a719e0865 [Perf] Optimize requests abort (#32211) Wentao Ye 2026-01-12 23:11:37 -05:00
  • f243abc92d Fix various typos found in docs (#32212) Andrew Bennett 2026-01-12 21:41:47 -06:00
  • 60b77e1463 [Frontend] Add reasoning_effort to OpenAIServing._preprocess_chat() (#31956) Sanghoon Yoon 2026-01-13 12:21:49 +09:00
  • 15b33ff064 [Misc] improve warning/assert messages (#32226) cjackal 2026-01-13 12:11:23 +09:00
  • c6bb5b5603 [BugFix] Fix engine crash caused by chat tools + response_format (#32127) Nick Hill 2026-01-12 18:33:14 -08:00
  • 9273a427b5 [Misc] Allow enabling NCCL for DP sync when async scheduling (#32197) Nick Hill 2026-01-12 18:03:08 -08:00
  • 78d13ea9de [Model] Handle trust_remote_code for transformers backend (#32194) Cyrus Leung 2026-01-13 09:30:12 +08:00
  • a307ac0734 [responsesAPI] add unit test for optional function tool call id (#32036) Andrew Xia 2026-01-12 19:14:54 -05:00
  • a28d9f4470 [ROCm][CI] Handle pytest status code 5 when a shard isn't allocated any tests (#32040) Divakar Verma 2026-01-12 16:35:49 -06:00
  • 629584bfc9 [Kernel][MoE] fix computation order of MoE weight multiplication and improve flow (#31962) xuebwang-amd 2026-01-13 06:17:30 +08:00
  • 0a7dd23754 [Model Runner V2] Add support for M-RoPE (#32143) Woosuk Kwon 2026-01-12 13:37:43 -08:00
  • dec28688c5 [Model Runner V2] Minor refactor for logit_bias (#32209) Woosuk Kwon 2026-01-12 13:08:30 -08:00
  • 9f430c94bd [BUGFIX] Add missed remaping of the names of fp8 kv-scale (#32199) Vadim Gimpelson 2026-01-13 00:42:06 +04:00
  • f8bd8394e3 [NIXL][Bugfix] Failure logging overhaul + early metadata free on failure (#32031) Nicolò Lucchesi 2026-01-12 21:38:49 +01:00
  • ca81811bfe [Model Runner V2] Support logit_bias, allowed_token_ids, min_tokens (#32163) Woosuk Kwon 2026-01-12 11:31:10 -08:00
  • ad8818bb5e [Misc][BE] Type coverage for vllm/compilation [3/3] (#31748) Lucas Kabela 2026-01-12 11:24:38 -08:00
  • 08e8e99ce7 [Misc] Change log level for batch queue log (#32192) Nicolò Lucchesi 2026-01-12 19:59:31 +01:00
  • 2be765b68a [BugFix] scheduler: Fix ordering preserving of skipped requests (#32173) Or Ozeri 2026-01-12 20:39:38 +02:00
  • 16abe6b85a [Misc] Set default torch num threads for input processing (#31879) Roger Wang 2026-01-12 10:28:16 -08:00
  • 1eb61ab34b [Refactor] EPLB rebalance algo to NumPy (#30697) Ilya Markov 2026-01-12 19:13:23 +01:00
  • 3d962d72ab [BugFix] fix FusedMoE.make_expert_params_mapping in EXAONE-MoE (#32196) Kyungmin Lee 2026-01-13 03:00:45 +09:00
  • 20228cb851 [3/N][Attention] Move AttentionMetadata-related code from utils.py to backend.py (#32054) Matthew Bonanni 2026-01-12 12:13:56 -05:00
  • 7c0d3c5152 [Benchmark] Share data between SLA runs (#32184) Cyrus Leung 2026-01-13 01:12:22 +08:00
  • 5b68107411 [Misc][PD] Fix get_attn_backend usage in transfer connectors (#31988) Nicolò Lucchesi 2026-01-12 18:10:05 +01:00
  • 8fb2c135be [Bugfix] Fix stale SSM state for new Mamba requests scheduled as decode (#32118) Asaf Joseph Gardin 2026-01-12 19:02:38 +02:00
  • 8863c2b25c [Model] Standardize pooling heads (#32148) Cyrus Leung 2026-01-13 01:01:49 +08:00
  • 3f72639d36 [FIX] Add NO_MUL activation support for modular kernel path (#31528) danielafrimi 2026-01-12 18:55:49 +02:00
  • 6bc9c8473e [MODEL] New model support for kakaocorp/kanana-1.5-v-3b-instruct (#29384) Jaehyun An 2026-01-13 01:39:02 +09:00
  • 63ed2409e8 Add K-EXAONE-236B-A23B (#31621) Kyungmin Lee 2026-01-13 01:30:50 +09:00
  • 95e53d907c doc: Update model references in supported_models.md (#32188) Andy Zhang 2026-01-13 00:15:28 +08:00
  • 0346396e94 [ROCm] [Bugfix] Fix order of mori build in Dockerfile.rocm_base (#32179) TJian 2026-01-12 23:33:21 +08:00
  • e68b0dad8b doc: Update model name for Qwen3-Coder in documentation (#32185) Andy Zhang 2026-01-12 23:10:50 +08:00
  • 9cddbdba6d OffloadingConnector: Add cpu_bytes_to_use configuration (#24498) Or Ozeri 2026-01-12 17:00:43 +02:00
  • 49e6b86c91 [Feature] Support recording expert indices for rollout router replay (#28284) Hongxin Xu 2026-01-12 22:23:04 +08:00
  • 0565f1fdec [P/D] Refactor mooncake connector sender thread using async coroutines (#31573) dtc 2026-01-12 20:35:35 +08:00
  • 9dbe1fe960 [Bugfix] Fix missing scale passing for encoder Triton Attention implementation (#32149) Isotr0py 2026-01-12 19:13:41 +08:00
  • a5f89ae296 [Doc] Add documentation for offline API docs feature (#32134) RickyChen / 陳昭儒 2026-01-12 18:33:48 +08:00
  • 05e8981234 [Doc] Improve LoRA docs (#32159) Jee Jee Li 2026-01-12 18:19:17 +08:00
  • 899541bdb1 [doc] fix broken links (#32158) XlKsyt 2026-01-12 18:18:38 +08:00
  • d7b2e57097 [Frontend] Fix Flaky MCP Streaming Test (#32153) daniel-salib 2026-01-12 02:03:32 -08:00
  • 5e034f2e3d [cpu][bench] Add Fused MoE Micro Benchmark for CPU Backend (#32092) Andika Rachman 2026-01-12 17:03:28 +07:00
  • 22970c1626 [Misc] Disable default --ready-check-timeout-sec extra call in vllm bench (#30975) Nicolò Lucchesi 2026-01-12 10:58:21 +01:00
  • 600aaab8d6 [Model] Remove incorrect SupportsPP from MTP models (#32150) Cyrus Leung 2026-01-12 17:19:30 +08:00
  • 60446cd684 [Model] Improve multimodal pooling examples (#32085) wang.yuqi 2026-01-12 15:54:09 +08:00
  • 9101dc756c [Model] Avoid hardcoding pooling type (#32119) Cyrus Leung 2026-01-12 13:28:12 +08:00
  • 025a32f9ed [Model Runner V2] Remove async barrier (#32083) Woosuk Kwon 2026-01-11 20:24:30 -08:00
  • 19504ac07f [Model Runner V2] Skip building deprecated fields in attn metadata (#32132) Woosuk Kwon 2026-01-11 14:31:04 -08:00
  • 3df619ac94 [CI] fix test_concat_and_cache_mla_rope_fused (#32117) Jiangyun Zhu 2026-01-11 23:11:11 +08:00
  • d74132ca3b fix offline inference chat response prompt (#32088) Ning Xie 2026-01-11 22:01:18 +08:00
  • a34abc49b7 [FixBug] Improve exception string in tensorizer.py (#31680) maang 2026-01-11 21:01:53 +08:00
  • d70249e2e9 [Misc] fix this log format not space (#32112) rongfu.leng 2026-01-11 21:01:16 +08:00
  • a374532111 [CI/Build] Separate out flaky responses API tests (#32110) Cyrus Leung 2026-01-11 21:01:12 +08:00
  • cee7436a26 [Misc] Make scipy as optional audio/benchmark dependency (#32096) Isotr0py 2026-01-11 16:18:57 +08:00
  • 4c16ba617f [KVConnector] OffloadingConnector: Fix bug in handling of preemptions (#29870) Or Ozeri 2026-01-11 10:05:36 +02:00
  • bde57ab2ed [Hardware][AMD][CI][Bugfix] Fix AMD Quantization test group (#31713) Matt 2026-01-11 01:19:46 -06:00
  • 9103ed1696 [CPU][BugFix] Disable AOT Compile for CPU (#32037) Fadi Arafeh 2026-01-11 07:15:49 +00:00
  • 46eb30f519 make assume_32_bit_indexing configurable (#32044) Laith Sakka 2026-01-10 23:15:46 -08:00
  • 0dd63639be [MTP][GLM][Bugfix] Fixed .weight_scale loading logic that dropped MTP prediction accuracy with fp8+mtp (#32101) Andy Liu 2026-01-10 23:14:54 -08:00
  • ef96fa3f1f [Benchmark][2/2] Use spline interpolation to tune SLA variables (#32095) Cyrus Leung 2026-01-11 12:27:27 +08:00
  • 2a4dbe24ea [BugFix] Wait for compute before offloading KV to CPU (#31341) Or Ozeri 2026-01-11 00:25:08 +02:00
  • 8020a60402 [Bugfix] Fix Qwen3-VL-Reranker model loading for sequence classification (#32089) RickyChen / 陳昭儒 2026-01-11 04:40:09 +08:00
  • e15a5ff07b [MISC] Add strict contiguity check for FlashInfer attention tensors (#32008) Vadim Gimpelson 2026-01-11 00:40:05 +04:00
  • 6ea001cfb7 [Bugfix][Quantization] Ensure input contiguity in per_token_quant_int8 (#31637) Vensen 2026-01-11 04:40:02 +08:00