Commit Graph

  • fe36bf5e80 [Model] Remove the unnecessary dtype conversion in MiniCPM (#32523) Canlin Guo 2026-01-18 16:07:28 +08:00
  • 963dc0b865 [Model Runner V2] Minor optimization for eagle input processing (#32535) Woosuk Kwon 2026-01-17 21:55:17 -08:00
  • 8cc26acd8b [Performance] Improve Triton prefill attention kernel's performance (#32403) Isotr0py 2026-01-18 12:19:59 +08:00
  • 4a6af8813f [MoE Refactor] Move Test Impl into Test Dirs (#32129) Robert Shaw 2026-01-17 23:16:59 -05:00
  • 4147910f1e [Model Runner V2] Move mrope_positions buffer to MRopeState (#32532) Woosuk Kwon 2026-01-17 20:09:48 -08:00
  • 3055232ba0 [Feature] Add FIPS 140-3 compliant hash algorithm option for multimodal hashing (#32386) Karan Bansal 2026-01-18 08:32:01 +05:30
  • d68209402d [build] fix cu130 related release pipeline steps and publish as nightly image (#32522) Shengqi Chen 2026-01-18 10:36:11 +08:00
  • 965765aef9 [build] fix cu130 related release pipeline steps and publish as nightly image (#32522) Shengqi Chen 2026-01-18 10:36:11 +08:00
  • 9e078d0582 [CI/Build][Docker] Add centralized version manifest for Docker builds (#31492) Mritunjay Kumar Sharma 2026-01-17 19:15:30 +05:30
  • 2b99f210f5 [Misc] Fix typo: seperator -> separator in flashmla_sparse.py (#32411) Guofang.Tang 2026-01-17 20:18:30 +08:00
  • 1646fea672 [Model] Molmo2: Enable quantized weight mapping for vision backbone (#32385) Kim Hee Su 2026-01-17 18:33:05 +09:00
  • d3317bbba4 [Models] Lfm2Moe: minor name changes for resolving lora conflicts (#29063) Paul Pak 2026-01-16 23:12:55 -07:00
  • b17039bccc [CI] Implement uploading to PyPI and GitHub in the release pipeline, enable release image building for CUDA 13.0 (#31032) v0.14.0 Shengqi Chen 2026-01-17 12:52:33 +08:00
  • 8e61425ee6 [CI] Implement uploading to PyPI and GitHub in the release pipeline, enable release image building for CUDA 13.0 (#31032) Shengqi Chen 2026-01-17 12:52:33 +08:00
  • 2e7c89e708 Revert "[Attention][MLA] Make FLASHINFER_MLA the default MLA backen… (#32484) Matthew Bonanni 2026-01-16 23:42:39 -05:00
  • 037a6487af apply _validate_input to MistralTokenizer token-id chat prompts (#32448) vanshil shah 2026-01-16 19:23:45 -08:00
  • 5a3050a089 [Docs][Governance] Add @robertshaw2-redhat to lead maintainers group (#32498) Simon Mo 2026-01-16 18:35:49 -08:00
  • 484e22bc18 [TPU][Core] Enable Pipeline Parallelism on TPU backend (#28506) Chenyaaang 2026-01-16 15:29:20 -08:00
  • ca21288080 [CI] Fix OOM in Hopper Fusion E2E Tests (H100) (#32489) Lucas Wilkinson 2026-01-16 14:27:16 -07:00
  • 4c82b6fac7 [responsesAPI] allow tuning include_stop_str_in_output (#32383) Andrew Xia 2026-01-16 16:14:40 -05:00
  • a884bc62d6 [LoRA] Update LoRA expand kernel heuristic (#32425) Xin Yang 2026-01-16 10:38:07 -08:00
  • 7a1030431a Atomics Reduce Counting Optimization for SplitK Skinny GEMMs. (#29843) Hashem Hashemi 2026-01-16 09:45:04 -08:00
  • 9fd918e510 [CI] Update deepgemm to newer version (#32479) Wentao Ye 2026-01-16 12:18:05 -05:00
  • c9a533079c [EPLB][BugFix]Possible deadlock fix (#32418) Ilya Markov 2026-01-16 15:11:01 +01:00
  • 48b67ba75f [Frontend] Standardize use of create_error_response (#32319) Cyrus Leung 2026-01-14 19:22:26 +08:00
  • 6ca4f400d8 [CI][AMD] Skip test_permute_cols since the kernel is not used and not built for ROCm (#32444) rasmith 2026-01-16 02:22:53 -06:00
  • 180e981d56 [Chore] Replace swish with silu (#32459) Cyrus Leung 2026-01-16 16:22:45 +08:00
  • b84c426a8c [ROCm][CI] Skip Qwen3-30B-A3B-MXFP4A16 Eval Test On Non-CUDA Platforms (#32460) Micah Williamson 2026-01-16 02:17:44 -06:00
  • b66b0d6abb fix(rocm): Enable non-gated MoE (is_act_and_mul=False) support on ROCm (#32244) Rabi Mishra 2026-01-16 13:01:10 +05:30
  • 03da3b52ef [Bugfix] Refactor to support DP parallel in R3 (#32306) Hongxin Xu 2026-01-16 15:13:58 +08:00
  • 14ce524249 [CI] Breakup h200 tests (#30499) Lucas Wilkinson 2026-01-15 23:23:22 -07:00
  • 4ae77dfd42 [Frontend][1/n] Make pooling entrypoints request schema consensus | CompletionRequest (#32395) wang.yuqi 2026-01-16 14:17:04 +08:00
  • 73f635a75f [Bug] Add TPU backend option (#32438) XiongfeiWei 2026-01-15 21:17:12 -08:00
  • 35bf5d08e8 [bugfix] Fix online serving crash when text type response_format is received (#26822) cjackal 2026-01-16 13:23:54 +09:00
  • 5de6dd0662 [Bugfix] [DeepSeek-V3.2] fix sparse_attn_indexer padding (#32175) Kebe 2026-01-16 12:21:55 +09:00
  • 709502558c [Model] Add Step3vl 10b (#32329) ltd0924 2026-01-16 11:04:16 +08:00
  • 09f4264a55 [Bugfix] Fix ROCm dockerfiles (#32447) TJian 2026-01-16 10:50:00 +08:00
  • 7f42dc20bb [CI] Fix LM Eval Large Models (H100) (#32423) v0.14.0rc2 Matthew Bonanni 2026-01-15 19:52:49 -05:00
  • c2a37a3cf8 Cherry pick [ROCm] [CI] [Release] Rocm wheel pipeline with sccache #32264 TJian 2026-01-16 02:56:18 +08:00
  • 0e31fc7996 [UX] Use kv_offloading_backend=native by default (#32421) Michael Goin 2026-01-15 13:55:11 -05:00
  • 6ac0fcf416 [ROCm][Bugfix] Disable hip sampler to fix deepseek's accuracy issue on ROCm (#32413) Pleaplusone 2026-01-16 00:35:47 +08:00
  • b62249725c [ROCM] Add ROCm image build to release pipeline (#31995) Douglas Lehr 2026-01-15 05:01:40 -06:00
  • 1b57275207 [Bugfix][ROCm][performance] Resolve the performance regression issue of the Qwen3-Next-80B-A3B-Thinking under rocm_atten (#32336) vllmellm 2026-01-15 03:32:48 +08:00
  • 46f8a982b1 [ROCm][CI] Enable AITER Unified Attention On ROCm For gpt-oss Test (#32431) Micah Williamson 2026-01-15 18:55:57 -06:00
  • bcf2333cd6 [CI] Fix LM Eval Large Models (H100) (#32423) Matthew Bonanni 2026-01-15 19:52:49 -05:00
  • 83239ff19a Add thread_n=64 support to Marlin MoE (#32360) Michael Goin 2026-01-15 19:45:44 -05:00
  • c277fbdf31 [Feat] Support non-gated MoE with Marlin, NVFP4 CUTLASS, FP8, INT8, compressed-tensors (#32257) TomerBN-Nvidia 2026-01-16 02:15:05 +02:00
  • aca5c51487 [Refactor] Remove unused file (#32422) Wentao Ye 2026-01-15 17:59:38 -05:00
  • 31c29257c8 [MoE Refactor][17/N] Apply Refactor to Bf16 (#31827) Yongye Zhu 2026-01-15 12:53:40 -08:00
  • 8c11001ba2 [ROCM] DSfp4 mla projection gemms weight dynamic quantization (#32238) Aleksandr Malyshev 2026-01-15 12:13:08 -08:00
  • bd292be0c0 [BugFix] Python file source reading can fail on UnicodeDecodeError (#32416) Richard Zou 2026-01-15 15:01:41 -05:00
  • 41c544f78a [ROCm] [CI] [Release] Rocm wheel pipeline with sccache (#32264) TJian 2026-01-16 02:56:18 +08:00
  • 1be5a73571 [UX] Use kv_offloading_backend=native by default (#32421) Michael Goin 2026-01-15 13:55:11 -05:00
  • c36ba69bda [BugFix] Fix assert x_s.shape[-1] == x_q.shape[-1] // group_shape[1] in Blackwell Quantized MoE Test (#32362) Lucas Wilkinson 2026-01-15 11:19:12 -07:00
  • 047413375c [Attention][AMD] Make flash-attn optional (#30361) Matthias Gehre 2026-01-15 18:18:24 +01:00
  • 74e4bb1c5a fixing podman build issue (#32131) smit kadvani 2026-01-15 09:07:08 -08:00
  • b34474bf2c [Feature] Support async scheduling + PP (#32359) Wentao Ye 2026-01-15 12:06:23 -05:00
  • 6218034dd7 [Model Runner V2] Support FlashInfer backend & Fix CUDA Graph bug [1/2] (#32348) Woosuk Kwon 2026-01-15 08:59:23 -08:00
  • 77c16df31d [ROCm][Bugfix] Disable hip sampler to fix deepseek's accuracy issue on ROCm (#32413) Pleaplusone 2026-01-16 00:35:47 +08:00
  • 130d6c9514 [ROCm][Perf] Enable shuffle kv cache layout and assembly paged attention kernel for AiterFlashAttentionBackend (#29887) Pleaplusone 2026-01-15 23:29:53 +08:00
  • 361dfdc9d8 [Quant] Support MXFP4 W4A16 for compressed-tensors MoE models (#32285) Dipika Sikka 2026-01-15 10:25:55 -05:00
  • 8ebfacaa75 [Attention][MLA] Make FLASHINFER_MLA the default MLA backend on Blackwell, and TRTLLM the default prefill (#32339) Matthew Bonanni 2026-01-15 09:49:57 -05:00
  • b89275d018 [ROCm] Improve error handling while loading quantized model on gfx120… (#31715) brian033 2026-01-15 20:16:00 +08:00
  • 28459785ff [3/N] Group together media-related code (#32406) Cyrus Leung 2026-01-15 19:52:12 +08:00
  • 8853a50af2 [CI][BugFix][AMD][FP8] Fix test_rms_norm so it runs correctly on ROCm (#32372) rasmith 2026-01-15 05:05:54 -06:00
  • c5891b5430 [ROCM] Add ROCm image build to release pipeline (#31995) Douglas Lehr 2026-01-15 05:01:40 -06:00
  • 707b44cc28 [Refactor] [11/N] to simplify the mcp architecture (#32396) Chauncey 2026-01-15 18:49:31 +08:00
  • 3a4e10c847 [Benchmark] [Feature] add vllm bench sweep startup command (#32337) rongfu.leng 2026-01-15 17:25:46 +08:00
  • cbbae38f93 [2/N] Move cache factories to MM registry (#32382) Cyrus Leung 2026-01-15 17:02:30 +08:00
  • cdba4c74b3 [Model] Avoid token selection in SigLIP pooling head (#32389) Cyrus Leung 2026-01-15 17:01:59 +08:00
  • a52d1396a7 fix: avoid crash on zero-arg tool calls in glm4 parser (#32321) seeksky 2026-01-15 16:45:59 +08:00
  • 1e584823f8 [Bugfix] Strengthen the check of X-data-parallel-rank in Hybrid LB mode (#32314) dtc 2026-01-15 16:31:16 +08:00
  • 4c1c501a7e [Refactor] [10/N] to simplify the vLLM openai completion serving architecture (#32369) Chauncey 2026-01-15 15:41:34 +08:00
  • ae1eba6a9a [ROCm][CI] Pin transformers 4.57.3 to fix jina test failures (#32350) Andreas Karatzas 2026-01-15 01:19:34 -06:00
  • e9ec2a72d8 [Bugfix] Fix stale common_attn_metadata.max_seq_len in speculative decoding with Eagle (#32312) Ofir Zafrir 2026-01-15 08:39:37 +02:00
  • 2c9b4cf5bf [BugFix] Fix DeepSeek-V3.1 + DeepGEMM incompatible scale shapes (#32361) Lucas Wilkinson 2026-01-14 23:32:22 -07:00
  • 9d7ae3fcdb [code clean] remove duplicate check (#32376) Ning Xie 2026-01-15 13:29:34 +08:00
  • 3c2685645e [CI][AMD][Quantization][BugFix] Fix fp8 max in quant_utils.py and update test_fp8_quant.::test_static_fp8_quant_group_2d to use correct fp8 dtype and adjust atol/rtol (#32201) rasmith 2026-01-14 23:04:34 -06:00
  • 773d7073ae [ROCm][CI] Disable async scheduling on ROCm for test_structured_output[meta-llama/Meta-Llama-3.1-8B-Instruct-xgrammar-auto-speculative_config9] (#32355) Micah Williamson 2026-01-14 22:53:43 -06:00
  • edadca109c [Bugfix] Add CpuCommunicator.dispatch and combine to fix DP+MoE inference (#31867) kzwrime 2026-01-15 12:50:48 +08:00
  • d86fc23bdd [Misc] Remove redundant line (#32366) Li Wang 2026-01-15 12:29:56 +08:00
  • 375e5984fe Support configure skip_special_tokens in openai response api (#32345) Shiyan Deng 2026-01-14 20:07:26 -08:00
  • 19b251fe3d Fix optional parameter parsing in MiniMax M2 tool parser #32278 (#32342) baonudesifeizhai 2026-01-14 23:05:48 -05:00
  • 15422ed3f7 [CI/Build][Hardware][AMD] Fix v1/shutdown (#31997) Ryan Rock 2026-01-14 22:01:42 -06:00
  • 8471b27df9 [compile] raise on compile_size implicit padding (#32343) dolpm 2026-01-14 12:46:56 -08:00
  • 66652e8082 [BugFix] Assign page_size_padded when unifying kv cache spec. (#32283) Lumosis 2026-01-14 12:10:01 -08:00
  • e27078ea80 [Bugfix][ROCm][performance] Resolve the performance regression issue of the Qwen3-Next-80B-A3B-Thinking under rocm_atten (#32336) vllmellm 2026-01-15 03:32:48 +08:00
  • d084e9fca7 [MODEL] Fix handling of multiple channels for gpt-oss with speculative decoding (#26291) Aleksandr Samarin 2026-01-14 21:20:52 +03:00
  • 3a612322eb [CI] Move rixl/ucx from Dockerfile.rocm_base to Dockerfile.rocm (#32295) qli88 2026-01-14 10:53:36 -06:00
  • 9ea07b41da [1/N] Reorganize multimodal processing code (#32327) Cyrus Leung 2026-01-14 23:25:31 +08:00
  • 552b262936 rename tokenize serving api request id prefix to tokenize (#32328) Ning Xie 2026-01-14 22:52:20 +08:00
  • 00e6402d56 [Frontend] track responsesAPI server_load (#32323) Chauncey 2026-01-14 20:00:37 +08:00
  • ce0946249d [Misc] Make mem utils can be reused by other platforms (#32322) Shanshan Shen 2026-01-14 19:46:01 +08:00
  • 3f28174c6a [Frontend] Standardize use of create_error_response (#32319) Cyrus Leung 2026-01-14 19:22:26 +08:00
  • 769d0629e1 [Refactor] [9/N] to simplify the vLLM openai translations serving ar chitecture (#32313) Chauncey 2026-01-14 18:20:58 +08:00
  • 90db5b31e4 [Refactor] Move top-level dummy data generation to registry (#32310) Cyrus Leung 2026-01-14 18:17:46 +08:00
  • b8199f6049 [Model] Re-implement Qwen3Omni Audio Encoder (#32167) Roger Wang 2026-01-13 23:40:30 -08:00
  • 7e6f123810 Add Molmo2 multimodal model support (#30997) sangho.lee 2026-01-13 23:33:09 -08:00
  • 9312a6c03a [Refactor] [8/N] to simplify the vLLM openai responsesapi_serving architecture (#32260) Chauncey 2026-01-14 15:26:24 +08:00
  • 6388b50058 [Docs] Add docs about OOT Quantization Plugins (#32035) Michael Goin 2026-01-14 02:25:45 -05:00