Commit Graph

  • 7779de34da [BugFix] Fix P/D with non-MoE DP (#33037) Nick Hill 2026-01-27 08:03:47 -08:00
  • 0d8ce320a2 [Bugfix] Fix DeepseekV32 AssertionError: num_kv_heads == 1 (#33090) Nicolò Lucchesi 2026-01-27 16:03:20 +01:00
  • d51e1f8b62 [Bugfix] Disable CG for Whisper+FA2 (#33164) Nicolò Lucchesi 2026-01-27 14:46:51 +01:00
  • 5042815ab6 [Models] Kimi-K2.5 (#33131) Roger Wang 2026-01-26 22:50:31 -08:00
  • afb390ab02 [CI] Fix AssertionError: MCP tool call not found in output_messages (#33093) Chauncey 2026-01-26 23:19:57 +08:00
  • ecb4f82209 [CI] Update job dependency syntax for Intel and AMD jobs (#33240) Kevin H. Luu 2026-01-28 01:33:59 -08:00
  • 5914090765 [CI] Update job dependency for hardware and CPU jobs (#33237) Kevin H. Luu 2026-01-28 01:10:05 -08:00
  • f1acbd68c5 [CI] Enable mypy import following for vllm/compilation (#33199) Harry Mellor 2026-01-28 08:59:54 +00:00
  • 9581185d51 [XPU]disable test_acceptance_length UT (#33226) Yan Ma 2026-01-28 15:24:13 +08:00
  • 2dd359f953 [Docs] Simplify CPU x86 Docker build documentation (#33071) Maryam Tahhan 2026-01-28 06:37:09 +00:00
  • 22ad649501 [ROCm] Enabling forward_includes_kv_cache on ROCm MHA backends (#33106) Gregory Shtrasberg 2026-01-28 00:36:14 -06:00
  • 36d450e3b8 Adds FunAudioChat multimodal audio model support (#2) (#33058) ramos 2026-01-28 13:18:09 +08:00
  • a2b877df6c [Bugfix] Lazy import NgramProposer in GPU model runner (#32821) 22quinn 2026-01-27 21:07:16 -08:00
  • 35fb0b8613 Don't use min_pixels/max_pixels from Qwen2VL's processor (#33208) Harry Mellor 2026-01-28 05:02:08 +00:00
  • 2eb673a088 Add flake8-implicit-str-concat rules to Ruff (#33191) Harry Mellor 2026-01-28 04:56:10 +00:00
  • a97b5e206d Relax protobuf library version constraints (#33202) Jeffrey Wang 2026-01-27 20:15:53 -08:00
  • 911b51b69f [ROCm][CI] Add TORCH_NCCL_BLOCKING_WAIT For Distributed Tests (A100) (#32891) Micah Williamson 2026-01-27 21:32:31 -06:00
  • 604e3b87e8 [Feature]: Container image WORKDIR consistency (#33159) Xinan Miao 2026-01-28 11:06:48 +08:00
  • 706f123b23 [Docs] Use definition lists for CLI reference docs (#33186) Harry Mellor 2026-01-28 02:22:48 +00:00
  • fb7abfc1d0 [docs] Improve tlparse section (#33211) Angela Yi 2026-01-27 18:07:37 -08:00
  • 5d3d6e44e8 [CI] minor fixes to pipeline generator and tests (#33151) Kevin H. Luu 2026-01-27 17:04:02 -08:00
  • 46ec6d71c7 [Model Runner V2] Use a different stream for grammar bitmask h2d copy (#33059) Woosuk Kwon 2026-01-27 16:37:43 -08:00
  • e82fa448c4 Add attention benchmarking tools (#26835) Matthew Bonanni 2026-01-27 19:09:20 -05:00
  • d9aa39a3bb [torch.compile] Speed up MOE handling in forward_context (#33184) Richard Zou 2026-01-27 18:17:54 -05:00
  • 3a6d5cbefd [Perf] Optimize dcp allocate tensor (#33102) Wentao Ye 2026-01-27 17:24:41 -05:00
  • f5d7049cc1 [Bugfix] Fix display error (inconsistent with context) (#33020) linhaifeng 2026-01-28 04:33:29 +08:00
  • 3c3c547ce0 Enabling "2 node" distributed tests in the AMD CI pipeline. (#32719) Alexei-V-Ivanov-AMD 2026-01-27 13:13:21 -06:00
  • 1cbccb6dba [Attention] Use has_flashinfer helper (#33177) Matthew Bonanni 2026-01-27 13:33:17 -05:00
  • bd92089d33 feature: support eagle3 for HunyuanVL & Hunyuan (#33035) Iris 2026-01-28 01:55:48 +08:00
  • a6760f1525 [Doc] Improve serve parameter documentation with meaningful defaults (#33082) Karan Bansal 2026-01-27 22:49:37 +05:30
  • 66e601ef79 Support compress-tensors with nvfp4 or fp8 weights and modelopt with nvfp4 weights on Turing (#33076) IriKa 2026-01-28 00:04:05 +08:00
  • 0cd259b2d8 [BugFix] Fix P/D with non-MoE DP (#33037) Nick Hill 2026-01-27 08:03:47 -08:00
  • 83fb2d09e8 Support heterogeneous NemotronHPuzzle model (#32549) danielafrimi 2026-01-27 17:55:54 +02:00
  • f3a5ee705f [LoRA][Spec Decode] Support LoRA for Nemotron-H MTP models (#32265) danisereb 2026-01-27 17:53:26 +02:00
  • 7cbbca9aaa [Frontend] Cleanup api server (#33158) wang.yuqi 2026-01-27 23:18:10 +08:00
  • 5ec44056f7 [Metrics][MFU] Fix UnembedMetrics FLOP overcounting for prefill (#33045) (#33045) omkhalil 2026-01-27 10:16:49 -05:00
  • 492a7983dd [Bugfix] Fix DeepseekV32 AssertionError: num_kv_heads == 1 (#33090) Nicolò Lucchesi 2026-01-27 16:03:20 +01:00
  • a608b4c6c2 [5/N][Attention] Finish eliminating vllm/attention folder (#32064) Matthew Bonanni 2026-01-27 10:02:51 -05:00
  • 1f3a2c2944 [Bugfix] Disable CG for Whisper+FA2 (#33164) Nicolò Lucchesi 2026-01-27 14:46:51 +01:00
  • 7227d06156 [Metrics] [KVConnector] Add Offloading Connector metrics (#27942) omerpaz95 2026-01-27 15:34:49 +02:00
  • 14385c80fc Fix weight mapping test for Transfomers v5 (#33162) Harry Mellor 2026-01-27 12:30:14 +00:00
  • 76139d0801 [Frontend] Frontend will only attach supported tasks corresponding entrypoints. (#33139) wang.yuqi 2026-01-27 20:15:43 +08:00
  • da8d0c441a [AMD][QWEN3-NEXT] FP8 Tunings (#32042) Lifan Shen 2026-01-27 01:34:13 -08:00
  • 58996f3589 [AMD][Kernel][BugFix] Use correct scale in concat_and_cache_ds_mla_kernel when on gfx942 (#32976) v0.15.0rc1 rasmith 2026-01-27 01:16:43 -06:00
  • b539f988e1 [Models] Kimi-K2.5 (#33131) Roger Wang 2026-01-26 22:50:31 -08:00
  • 6c00645712 [CI][Pooling] Stabilize ModernBERT test (#32909) Andreas Karatzas 2026-01-26 23:26:48 -06:00
  • b781eeaa15 [code clean] remove duplicate code (#33135) Ning Xie 2026-01-27 12:57:16 +08:00
  • e0b005d9cf [Frontend] Cleanup serving engine (#33103) Cyrus Leung 2026-01-27 12:47:26 +08:00
  • 3b8f0fe59e [torch.compile] Stop assuming 32 bit indexing (#33113) Richard Zou 2026-01-26 23:25:02 -05:00
  • c831911be2 [Frontend] Reduce mixin usage in serving pooling (#33101) Cyrus Leung 2026-01-27 11:50:37 +08:00
  • 157caf511b [Perf] avoid duplicate mem_get_info() call in get_current_memory_usage (#33064) Paco Xu 2026-01-27 11:45:45 +08:00
  • 0b53bec60b [DOC]: Add warning about max_num_batched_tokens and max_model_len when chunked prefill is disabled (#33109) Vincent Gimenes 2026-01-27 04:05:02 +01:00
  • c568581ff3 Fix IndexError with encoder-decoder models when using Custom Paged Attention (#33112) Strahinja Stamenkovic 2026-01-27 03:33:37 +01:00
  • 2d7053438a fix: preserve native tool call ID in multi-turn tool calling (#32768) wangln19 2026-01-27 10:22:35 +08:00
  • 5a93b9162b [MoE Refactor] Integrate Naive Prepare Finalize into MK (#32567) Robert Shaw 2026-01-26 20:28:02 -05:00
  • 6d86fde09c [Model Runner V2] Remove UvaBufferPool for cpu->gpu copy (#33055) Woosuk Kwon 2026-01-26 16:47:35 -08:00
  • 510ed1e8d3 [Bugfix][TPU] Return a Default fp8 MoE Backend (#32908) XiongfeiWei 2026-01-26 15:46:11 -08:00
  • 8caffd92df [Bugfix][MXFP4] Call trtllm_fp4_block_scale_moe with kwargs (#33104) Pengchao Wang 2026-01-26 15:13:18 -08:00
  • 58a05b0ca1 [fix] CPUDNNLGEMMHandler pointer baked into inductor artifact (#32913) dolpm 2026-01-26 13:59:44 -08:00
  • 6ee7f18f33 [Logging] add --disable-access-log-for-endpoints CLI option (#30011) Jared Wen 2026-01-27 05:49:03 +08:00
  • 8f987883cb [Refactor] Remove unused _moe_permute function (#33108) Wentao Ye 2026-01-26 16:06:45 -05:00
  • cf1167e50b [Bugfix] Fix Dtypes for Pynccl Wrapper (#33030) v0.15.0rc0 Robert Shaw 2026-01-26 15:09:32 -05:00
  • ebe0ba91db [ci] Sync test areas with test-pipeline.yaml and enable new pipeline generator (#33080) Kevin H. Luu 2026-01-26 12:28:20 -08:00
  • 43a013c3a2 [Bugfix] Fix Dtypes for Pynccl Wrapper (#33030) Robert Shaw 2026-01-26 15:09:32 -05:00
  • c25dbee40d [Model] Bump transformers version for test registry (#33100) Cyrus Leung 2026-01-27 02:53:22 +08:00
  • 19ab0f7ce5 [Bugfix] Fix Voxtral streaming slot_mapping (#33073) Nicolò Lucchesi 2026-01-26 19:40:40 +01:00
  • 67fe677c53 [FIX] Always support TP > 4 for FP4 Gemm (#31099) danielafrimi 2026-01-26 20:04:20 +02:00
  • d56afd45fd Remove unused logic in models/mistral.py (#33095) Andy Lo 2026-01-26 17:01:52 +00:00
  • a2393ed496 [CI] Fix AssertionError: MCP tool call not found in output_messages (#33093) Chauncey 2026-01-26 23:19:57 +08:00
  • be6931ee27 [ROCm][Bugfix] Fix ptpc scale load issue for fused shared expert path in deepseek mtp (#33018) Pleaplusone 2026-01-26 23:19:04 +08:00
  • 9ef3b718d9 [Bugfix] Fix Can't instantiate abstract class DeepseekV32IndexerBackend (#33052) Chauncey 2026-01-26 22:44:02 +08:00
  • bb17e8f11c [GLM-OCR] GLM-OCR with MTP Support (#33005) Yuxuan Zhang 2026-01-26 22:24:43 +08:00
  • dcd80206b7 [Chore] Update type annotation of input_ids in model forward (#33063) Cyrus Leung 2026-01-26 22:02:10 +08:00
  • f4a0921c9c [Performance] Tune Mamba selective scan kernel for B200 (#32873) danisereb 2026-01-26 15:56:54 +02:00
  • 208c56256f [Feature] Add LoRA support for Gemma3 vision components (#32764) VihaanThat 2026-01-26 19:26:40 +05:30
  • 9ac818a551 [Misc] HF Hub LoRA Resolver (#20320) Alex Brooks 2026-01-26 06:56:32 -07:00
  • 6ca2c91b96 [Model] Use mm_position to compute mrope positions for Qwen3-Omni (#33010) Itay Etelis 2026-01-26 15:48:07 +02:00
  • e33192b269 [lora/moe] Improve fused MoE‑LoRA kernel indexing and memory access (#32770) cwazai 2026-01-26 20:56:34 +08:00
  • 61274bdef5 [Doc] Further update multi-modal impl doc (#33065) Cyrus Leung 2026-01-26 18:54:20 +08:00
  • b40db4dfec [StepVL] add step vl offline example (#33054) ltd0924 2026-01-26 17:00:32 +08:00
  • 11b556878b [Refactor] Use data parser for matching data items to multi-modal UUIDs (#32955) Cyrus Leung 2026-01-26 15:00:28 +08:00
  • ee484b3f4b Set splitk=1 for fused-moe-lora expand kernel (#32882) Danielle Robinson 2026-01-25 22:52:34 -08:00
  • a9b53dd435 [Model Runner V2] Add LoRAState to consolidate lora logic (#33062) Woosuk Kwon 2026-01-25 22:21:12 -08:00
  • 254db42ede [Tests] Remove Duplicates (#33032) Robert Shaw 2026-01-26 00:23:54 -05:00
  • 105d104576 [StepVL] support close img patch (#32923) ltd0924 2026-01-26 12:56:39 +08:00
  • 566cdb6cfb [CI] Fix MHA attention test failure (AttributeError when model_config is None in ViT attention backend) (#33033) Lucas Wilkinson 2026-01-25 20:49:53 -07:00
  • 2f0d3ba745 [Model Runner V2] Minor simplification for finish_requests (#33048) Woosuk Kwon 2026-01-25 18:35:02 -08:00
  • edf927bc9f [Model Runner V2] Fix slot_mapping after #25954 (#33046) Woosuk Kwon 2026-01-25 18:29:49 -08:00
  • 22aeb43007 [Bugfix][VLM] Fix transformers backend embed_multimodal for Qwen2.5-VL profiling (#32969) Andreas Karatzas 2026-01-25 18:34:05 -06:00
  • a698e8e7ad [Model] Use mm_position to compute mrope positions for Qwen2.5-Omni (#32772) Itay Etelis 2026-01-25 14:15:53 +02:00
  • 151e5451c2 [Doc] Add Qwen2.5 models to batch invariance tested models (#33016) zhanqiuhu 2026-01-25 04:20:46 -05:00
  • 73b243463b [BugFix] Add env variable to control PDL in LoRA (#32836) Jee Jee Li 2026-01-25 16:32:30 +08:00
  • 7e67df5570 [Bugfix] fix encoder cache hang in Qwen3VL (#32684) JJJYmmm 2026-01-25 13:17:31 +08:00
  • ff6c1da4e6 [Docs] Fix Apple silicon include path in CPU installation docs (#32977) 7. Sun 2026-01-25 01:51:49 +00:00
  • fcb9df99bd [Perf][Kernel] Optimize FP4 quantization kernels (SM100F) (#32520) Roberto L. Castro 2026-01-25 02:45:27 +01:00
  • 1ebdff412a [DOC] [ROCm] Update doc for v0.14.1 (#32998) TJian 2026-01-25 09:13:21 +08:00
  • 91601ff478 [Feature] add session based streaming input support to v1 (#28973) Joshua Deng 2026-01-24 13:06:28 -07:00
  • d4dbb7af63 Using max_loras + 1 to construct grid in fused_moe_lora (#32277) yugong333 2026-01-24 09:39:30 -08:00
  • 203d0bc0c2 [CPU] Improve CPU Docker build (#30953) Maryam Tahhan 2026-01-24 17:08:24 +00:00
  • 17ab54de81 [CPU Backend][BugFix] Fix failing Darwin pipelines (#33002) Fadi Arafeh 2026-01-24 17:02:22 +00:00