Commit Graph

  • 73a484caa1 [Model][Quantization] Fix / Add GGUF support for Qwen2 MoE models (#30307) Tsukasa OI 2025-12-10 04:13:10 +09:00
  • b37bf51e75 [CI/Test] Fix FP8 per-tensor quant test reference scale shape (#30352) Lucas Wilkinson 2025-12-09 13:52:20 -05:00
  • 95501a70ec [BugFix] Fix DeepSeek-R1 hang with DP and MTP (#30119) Lucas Wilkinson 2025-12-09 13:51:19 -05:00
  • e858bfe051 [Cleanup] Refactor profiling env vars into a CLI config (#29912) Benjamin Chislett 2025-12-09 13:29:33 -05:00
  • d471b2aff0 [Model Runner V2] Support num NaNs in logits (#30187) Woosuk Kwon 2025-12-09 10:00:49 -08:00
  • 9e6562a3f6 [Model Runner V2] Fix Triton warning on tl.where (#30355) Woosuk Kwon 2025-12-09 09:59:54 -08:00
  • 0b6a8a304c [BugFix] Fix non detected failing tests (#30277) Ilya Markov 2025-12-09 18:57:55 +01:00
  • 804e3468c0 Update AMD test definitions (2025-12-08) (#30298) Alexei-V-Ivanov-AMD 2025-12-09 11:31:30 -06:00
  • 83319b44c2 [Compile] Fix torch warning TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled (#29897) Wentao Ye 2025-12-09 10:40:37 -05:00
  • 56037dfa2f [BugFix] Fix assert batch_descriptor.num_tokens == num_tokens_padded (#30173) Lucas Wilkinson 2025-12-09 10:36:12 -05:00
  • 5dcd593baf [Feature] Batch-Invariant Support for FA2 and LoRA (#30018) quanliu 2025-12-09 23:01:38 +08:00
  • 5c213d2899 [BUGFIX] Mistral tool call parser v11+ (#30332) Julien Denize 2025-12-09 15:55:38 +01:00
  • ee14644ba9 [ROCm] Aiter Quant Kernels (#25552) vllmellm 2025-12-09 22:27:37 +08:00
  • 1166c31cc7 [Bugfix]: Fix glm46 awq marlin moe wna16 compatibility (#30210) Dongjie Zou 2025-12-09 07:20:21 -05:00
  • 03416eada6 [bugfix][quantization] Fix fp8 per_tensor scale shape (#30257) haoyangli-amd 2025-12-09 19:28:50 +08:00
  • c72ea10723 [Structured Output][Reasoning] Improves decoding throughput for models using single-token reasoning endings. (#30056) Hubert de La Jonquiere 2025-12-09 11:54:08 +01:00
  • 67475a6e81 [DCP][Bugfix][CI] Fix accuracy issue of DCP when using FLASH_ATTN_MLA (#30309) Jaya Yuan 2025-12-09 16:22:14 +08:00
  • 9c32df6101 [Bugfix] Qwen 3 VL Embedding loading (#30303) wang.yuqi 2025-12-09 16:04:02 +08:00
  • aeb82b1930 [CI] Fix Flaky test_eagle_max_len Test (#30306) Micah Williamson 2025-12-09 01:33:34 -06:00
  • aed846917f [Attention] Make split_decodes_and_prefills(..., require_uniform=True) support padding (#29644) Lucas Wilkinson 2025-12-09 02:24:01 -05:00
  • e4605d225e [Misc] Fix safetensors import for safe_open (#30300) Yongtao Huang 2025-12-09 14:50:06 +08:00
  • 58d5b3f514 [Model][Quantization] Restore MoE + GGUF models support (incl. Qwen3 MoE) by allowing Sideload Parameters (#30116) Tsukasa OI 2025-12-09 14:30:05 +09:00
  • c2e1987a6e [Doc] update Intel GPU MM status in Feature x Hardware matrix (#30294) Fanli Lin 2025-12-09 13:16:44 +08:00
  • e130845984 [CPU][CI] Enable fused MoE tests in Arm CI (#30132) Fadi Arafeh 2025-12-09 04:55:39 +00:00
  • 4b03b50211 update torchao safetensors impl (#30155) liangel-02 2025-12-08 23:46:35 -05:00
  • 4c6fd25880 kv_transfer: Rename the shared storage connectors (#30201) Or Ozeri 2025-12-09 06:46:09 +02:00
  • 03b91f7262 [Bugfix] Fix compressed-tensors models failing to load with transformers backend (#30287) Michael Goin 2025-12-08 23:44:28 -05:00
  • f6227c22ab [Kernel]Support W4A8 Grouped GEMM on Hopper (#29691) czhu-cohere 2025-12-08 22:29:06 -05:00
  • ea657f2078 Lora MoE Align Improvements (#29257) gnovack 2025-12-08 18:35:16 -08:00
  • db14f61f2d [ci] Refactor CI file structure (#29343) Kevin H. Luu 2025-12-08 18:25:43 -08:00
  • 78c7503364 [ROCm][CI] Skip NVIDIA-Only Prime-RL Test in AMD CI (#29420) Micah Williamson 2025-12-08 20:14:02 -06:00
  • e41312a2f5 [Bugfix] Skip generation config fallback for GGUF to prevent multi-process hang (#30209) Christina Norman 2025-12-08 19:52:43 -06:00
  • 7b35011ad1 Mark qwen2_5_vl as xfail (#30283) Yanan Cao 2025-12-08 17:14:10 -08:00
  • ae339b1a67 [Bugfix] Fix DeepGEMM after #29546 (#30267) Zhewen Li 2025-12-08 17:05:27 -08:00
  • 0ee6416f67 [Perf] Optimize group_topk kernel, 1.9% Throughput improvement, 2.1% TPOT improvemnt (#30159) Wentao Ye 2025-12-08 19:44:01 -05:00
  • d9417096d1 [Feature] Batch invariant: Enable TRITON_MLA without prefix-caching (#29125) Wentao Ye 2025-12-08 19:31:57 -05:00
  • 9d6235ca9a [moe] Allow disabling DP chunking (#29936) Ming Yang 2025-12-08 16:29:36 -08:00
  • f1599ca55d feat(metrics): Add prefill KV compute metric excluding cached tokens (#30189) Victor Ziliang Peng 2025-12-08 16:08:48 -08:00
  • 60d17251c9 [Disagg] Support large batch size in proxy server and update NixlConnector doc for DP (#28782) Ming Yang 2025-12-08 16:01:08 -08:00
  • 1fb632fdb6 [Perf] Improve fp8 quant in mla; replace ReduceSum with ReduceScatterSum (#29795) Lain 2025-12-08 15:02:34 -08:00
  • 6af70e11a0 [ROCm][CI] Fix test_max_len.py for Rocm (#29916) Charlie Fu 2025-12-08 15:58:30 -06:00
  • ae0f69b16a Add SpecDec support to selective_state_update (#29488) roikoren755 2025-12-08 23:45:18 +02:00
  • 799804d140 Bump nvshmem to 3.3.24 and fix CUDA 13 installation (#30149) Dmitry Tokarev 2025-12-08 15:24:34 -05:00
  • 0d402d2600 online fp8 quant with streaming weight post-processing (#29196) Vasiliy Kuznetsov 2025-12-08 15:15:10 -05:00
  • d1b5e7afbf [TPU] Bump tpu-inference to 0.12.0 (#30221) Johnny Yang 2025-12-08 12:10:10 -08:00
  • fcd5306f65 Add latent MoE support (#30203) shaharmor98 2025-12-08 19:35:01 +02:00
  • 398a596ed2 [MP executor] fix get device count for multi node of mp executor feature (#30042) weiguihua2 2025-12-09 01:33:48 +08:00
  • 67312cad11 [Misc] Split the LoRA code (#30253) Jee Jee Li 2025-12-09 00:59:31 +08:00
  • 87aee9ed2b Add evaluate_guards option to DynamicShapesConfig (#27432) Laith Sakka 2025-12-08 07:46:15 -08:00
  • 184076c3fe [DeepSeek v3.2] Make top-k work for any logit values. (#27568) Daniel Cámpora 2025-12-08 15:55:58 +01:00
  • eb1051fb95 [ROCm] Guard group quant RMS norm fusion patterns (#30239) Ye (Charlotte) Qi 2025-12-08 06:44:48 -08:00
  • 80433e225e [LoRA] Reduce the loading time of MoE LoRA (#30243) Jee Jee Li 2025-12-08 21:29:47 +08:00
  • 5c2433a6f3 Add tip for mypy and markdownlint to the pre-commit comment (#30259) Harry Mellor 2025-12-08 13:11:51 +00:00
  • 77072e93b3 [docs] governance documents (#24801) Simon Mo 2025-12-08 03:06:20 -09:00
  • 2e660c2434 [Frontend] Binary embedding response does not return metadata by setting encoding_format to bytes_only. (#30249) wang.yuqi 2025-12-08 20:01:21 +08:00
  • 408cf42f67 [CI] Prevents triggering of an inactive issue/PR check for forked repository. (#29654) Shiming Zhang 2025-12-08 18:29:14 +08:00
  • 9e77ffca3f [Model][7/N] Improve all pooling task | Deprecation as_reward_model. Extract hidden states prefer using new multi-vector retrieval API (#26686) wang.yuqi 2025-12-08 16:10:09 +08:00
  • bcb6f5947f [Perf] Remove sync point in vit torch sdpa attn backend (#30232) Dazhi Jiang 2025-12-08 15:12:42 +08:00
  • cd00c443d2 [Misc] Rename TensorRT Model Optimizer to Model Optimizer (#30091) Zhiyu 2025-12-07 23:05:27 -08:00
  • d143271234 [Bugfix] fix fuse_allreduce_rms when tp =1 (#30178) Jiangyun Zhu 2025-12-08 14:43:47 +08:00
  • c6df05ebb4 [ROCm] [Fused Moe EP] Use binary expert mask for aiter fused moe kernel (#29773) Zhiwei 2025-12-08 13:23:46 +08:00
  • d726a7b0ed [BugFix] Unblock use of LoRA with data parallel mode (#30220) Nick Hill 2025-12-07 20:21:05 -08:00
  • 344b50d525 Address comment to mergify.yml in #30117 (#30219) Zhijian Jiang 2025-12-07 19:26:25 -08:00
  • 735284ed86 [responsesAPI][7] Browser, Container MCP tools for non harmony models (#29989) Andrew Xia 2025-12-07 18:04:03 -08:00
  • 444f0e3f33 [Frontend] Add MCP type support infrastructure to Responses API (#30054) daniel-salib 2025-12-07 18:02:52 -08:00
  • af0444bf40 [Performance] Fused blockwise quant RMS norm (#27883) ElizaWszola 2025-12-07 17:38:04 +01:00
  • 0044c4038c [BugFix][DeepSeek-V3.2] Fix backend selection logic for Blackwell (#30195) Lucas Wilkinson 2025-12-07 10:53:51 -05:00
  • b952f4d3c3 [v1] Add PrefixLM support to FlexAttention backend (#27938) Isotr0py 2025-12-07 23:51:36 +08:00
  • 541a2ef892 [Perf] Deepgemm fused layout kernel for activations, 4.3% throughput improvement, 10.7% TTFT improvement. (#29546) Wentao Ye 2025-12-07 07:31:14 -05:00
  • b0f4866a77 [CI/Build]Temporary workaround for test_default_mm_loras timeout (#30202) Jee Jee Li 2025-12-07 20:27:11 +08:00
  • 879ddb09c3 [Kernel][MoE] optimize moe_align_block_size (#29642) Jinzhen Lin 2025-12-07 17:58:47 +08:00
  • 1b0482b9d1 [Misc][Core] Remove unused req_index increment in scheduler (#30176) Yifan Qiao 2025-12-07 00:39:21 -08:00
  • e83b7e379c Revert "[Renderer] Separate out RendererConfig from ModelConfig (#30145)" (#30199) Cyrus Leung 2025-12-07 16:00:22 +08:00
  • 27f4c2fd46 [Renderer] Separate out RendererConfig from ModelConfig (#30145) Cyrus Leung 2025-12-07 15:15:42 +08:00
  • a49d813fa8 Lazy loading to avoid importing all files (#29716) Luke 2025-12-06 23:13:14 -08:00
  • 17eb25e327 [Perf] Enable cuda graph for deepepHT, 5.3% throughput improvement, 4.4% TTFT improvement (#29558) Wentao Ye 2025-12-06 23:44:50 -05:00
  • dce6d229f7 Support multiple image/audio embeddings per requests (#29988) jeremyteboul 2025-12-06 20:34:24 -08:00
  • cbedb703cc [Frontend] Remove confusing -O.xx flag error (#30169) Yanan Cao 2025-12-06 18:53:42 -08:00
  • 8d3da4c79d [MISC]: change NIXL compatibility hash logging level to debug (#30182) AuruTus 2025-12-07 08:21:03 +08:00
  • 421125d03a [ez] move harmony utils to parser folder (#30117) Andrew Xia 2025-12-06 14:34:34 -08:00
  • 671427efbf [Model] Move multimodal_cpu_fields definition to field config (#30181) Cyrus Leung 2025-12-06 21:40:02 +08:00
  • 21bb323542 Gigachat 3 tool parser and tests (#29905) Viacheslav 2025-12-06 15:04:14 +03:00
  • 17a9abec2b simplify requires_files list creation (#29656) Chukwuma Nwaugha 2025-12-06 09:42:41 +00:00
  • 92c35abb24 [Misc] Fix circular import in vllm.transformers_utils.config (#30179) Ye (Charlotte) Qi 2025-12-06 01:24:03 -08:00
  • 43e7593031 Support tokenization_kwargs override (#29794) Yu Jiaqi 2025-12-06 17:12:53 +08:00
  • c46b932df2 [Chore] Deprecate SupportsMultiModal.merge_by_field_config (#30170) Cyrus Leung 2025-12-06 15:57:28 +08:00
  • 6476382384 prefix caching design doc sha256 now default (#29261) redwrasse 2025-12-05 23:39:56 -08:00
  • d6aeaddf4a [bugfix] fix type[AttentionBackend] bug in kv_connector_base_v1 (#30051) kx 2025-12-06 15:11:31 +08:00
  • a238cbd89d [Model Runner V2] Support min-p sampling (#30171) Woosuk Kwon 2025-12-05 21:42:47 -08:00
  • 4026ae31e9 [Misc] Move disable_nccl_for_dp_synchronization init logic into VllmConfig (#30161) Nick Hill 2025-12-05 20:59:04 -08:00
  • b12f4a9830 [CI/Build][AMD] Use ROCM_ATTN instead of FLASH_ATTN test for test_register_kv_caches for ROCm and update test for TRITON_ATTN (#29985) rasmith 2025-12-05 22:57:38 -06:00
  • 40a046cd82 [Bugfix]: Fix TokenizerLike interface (#30009) Rohan Potdar 2025-12-05 22:56:40 -06:00
  • e858bc4d14 [Model] Add support for transformer-based Ultravox v0.7 projector (#30089) Peter Salas 2025-12-05 20:55:43 -08:00
  • e3fbb6f152 fix#30092 Kimi-Linear model loading failure with missing indexer_rotary_emb (#30093) Dongjie Zou 2025-12-05 23:55:09 -05:00
  • c4d62618ca Fix AWQ MoE marlin check issue in marlin_utils.py for AMD backend (#30102) yuttian1 2025-12-06 12:54:38 +08:00
  • 62079d8600 [CI/Build][AMD] Skip marlin, machete, and hadacore tests since these require _C functions not defined for ROCm (#30109) rasmith 2025-12-05 22:54:17 -06:00
  • bf4a901af9 Better error when world size is larger than node and distributed_executor_backend is not set (#30140) Harry Mellor 2025-12-06 04:53:52 +00:00
  • 7e31c3a3f6 [CI]: Remove unnecessary imports from test_lmache_integration (#30157) Samuel Shen 2025-12-05 20:53:34 -08:00
  • dc839ad03d [CI/Build][AMD][Quantization] Fix test_int8_kernel.py by updating int8_utils to use hip.libdevice.round (#30151) rasmith 2025-12-05 22:52:11 -06:00
  • 02a4169193 [Tests] Tool call tests for openai/gpt-oss-20b (#26237) Deboleina 2025-12-05 22:03:29 -05:00