Commit Graph

  • 10546f925a [Bugfix] Fix mm budget setting for Qwen Omni models (#33634) Roger Wang 2026-02-02 20:56:25 -08:00
  • e69c990c21 [Feature][CPU Backend]: Optimize ARM vectorization backend (#30329) Radu Salavat 2026-02-03 04:17:56 +00:00
  • 5eac9a1b34 [torch.compile] Don't do the fast moe cold start optimization if there is speculative decoding (#33624) Richard Zou 2026-02-02 19:38:49 -08:00
  • 1b60b45d0d [CI/Build] add directions for CPU image upload to Docker Hub (#32032) Nathan Weinberg 2026-02-02 21:48:06 -05:00
  • 4b3803d180 [BugFix] DPMetadata raises assert error for dense model (#32739) Dezhan 2026-02-02 16:56:44 -08:00
  • 31a64c63a8 [Release] Fix format and cherry-pick (#33618) Zhewen Li 2026-02-02 16:19:05 -08:00
  • 57eae2f891 [Release] patch step3p5 attention class in v0.15.1 release (#33602) Zhewen Li 2026-02-02 14:54:08 -08:00
  • 5019c59dd2 [Voxtral Realtime] Introduce global log mel max (#33574) Patrick von Platen 2026-02-02 23:01:47 +01:00
  • 089cd4f002 fix cutlass_3x_gemm_fp8_blockwise on sm103a (#32224) Lain 2026-02-02 11:47:46 -08:00
  • 0130223bd9 fix memory for online fp8 quantization with streaming weight load (#31914) Vasiliy Kuznetsov 2026-02-02 14:17:42 -05:00
  • 5d1aef3004 [UX] Format attention backend log line (#33570) Matthew Bonanni 2026-02-02 13:57:12 -05:00
  • f0d005864a [Fix] prefix cache hit rate == 0 bug with gpt-oss style models (#33524) Yifan Qiao 2026-02-01 17:59:58 -08:00
  • ffe1fc7a28 Reduce the kernel overhead when num of active loras is smaller than max loras. Multiple cuda graphs are captured for each num of active-loras. (#32005) yugong333 2026-02-02 09:30:06 -08:00
  • 8b7346d5f1 Update huggingface-hub again (#33567) Harry Mellor 2026-02-02 17:20:54 +00:00
  • 6141ebe0dd Remove incorrect tokenizer info test (#33565) Harry Mellor 2026-02-02 17:11:44 +00:00
  • 199e3cb476 [Model] Use mm_position to compute mrope positions for GLM-4.xV (#33039) Yang Liu 2026-02-02 08:55:48 -08:00
  • 9f8cb81b44 [CI] Add DeepSeek V3.2 nightly eval (#33566) Matthew Bonanni 2026-02-02 11:10:02 -05:00
  • d7e17aaacd [Refactor] Move profiling methods to MM budget (#33559) Cyrus Leung 2026-02-02 23:27:00 +08:00
  • 528e9b1490 [Feature][Core] Support Fabric detection to adapt the MNNVL protocol for the GB series (#33540) Kebe 2026-02-02 23:55:46 +09:00
  • d95b4be47a move spec decode slow test to test_areas.yaml (#33365) shanjiaz 2026-02-02 09:28:36 -05:00
  • 4061dcf4c5 [Bugfix] Enable Kimi k25 processor test (#33562) Isotr0py 2026-02-02 22:25:25 +08:00
  • 0aca8b8c62 [MoE] Enable Shared/Routed Overlap For Latent MoE (Nemotron-H) (#32790) danielafrimi 2026-02-02 16:18:50 +02:00
  • 9eb58f8cf1 fix[ROCm]: Remove unconditional aiter import (#32902) Rabi Mishra 2026-02-02 19:40:02 +05:30
  • b10d05b8a8 [Model] Use explicit types in get_generation_prompt (#33551) Cyrus Leung 2026-02-02 20:38:49 +08:00
  • b398e5c819 Update get_expert_mapping to include self parameter (#33525) Borushiki 2026-02-02 13:29:07 +01:00
  • 78061ef584 Fix accessing hidden_act from model config (#32686) Grzegorz K. Karch 2026-02-02 12:11:33 +01:00
  • 528b3076af [CI][Bugfix] Fix flaky tests/v1/kv_connector/unit/test_multi_connector.py::test_multi_example_connector_consistency (#33555) Nicolò Lucchesi 2026-02-02 12:01:29 +01:00
  • a502831d36 [Chore] Remove redundant input parsing methods (#33542) Cyrus Leung 2026-02-02 18:50:47 +08:00
  • 94cbe0a328 [Nightly CI] Remove CT Model (#33530) Robert Shaw 2026-02-01 22:09:09 -05:00
  • 8b45c58fe9 [Models] Step-3.5-Flash (#33523) csy0225 2026-02-02 10:21:18 +08:00
  • ba871fb788 [Misc] support arbitrary MM datasets in spec dec bench (#33486) Komal Kumar Teru 2026-02-02 14:19:48 +05:30
  • c7039a80b8 pin LMCache to v0.3.9 or greater with vLLM v0.15.0 (#33440) Greg Pereira 2026-01-31 19:50:38 -08:00
  • 15ebd0cedf fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels (#33417) René Honig 2026-01-31 23:06:42 +01:00
  • 2915268369 [fix][torch.compile] Fix cold-start compilation time increase by adding kv cache update to splitting ops (#33441) Luka Govedič 2026-01-31 09:48:34 -05:00
  • d984d664cc [BugFix] Fix whisper FA2 + full cudagraphs (#33360) Lucas Wilkinson 2026-01-30 21:15:06 -07:00
  • 5f45b0b7e0 [Bugfix][ROCm] Fixing the skinny gemm dispatch logic from #32831 (#33366) Gregory Shtrasberg 2026-01-30 19:05:23 -06:00
  • a2dba556db [release] Minor fixes to release annotation and wheel upload (#33129) Kevin H. Luu 2026-01-29 12:09:35 -08:00
  • 6ff16b77f8 [Bugfix] Enable Triton MoE for FP8 per-tensor dynamic (#33300) Michael Goin 2026-01-29 15:15:17 -05:00
  • 1ed963d43a [Bugfix] Fix Qwen3-VL-Reranker load. (#33298) wang.yuqi 2026-01-29 16:42:53 +08:00
  • 39e8b49378 [Bugfix] Register fp8 cutlass_group_gemm as supported for only SM90+SM100 (#33285) Michael Goin 2026-01-28 21:40:59 -05:00
  • ab374786c7 [CPU][IBM Z][Dockerfile] Fix IBM Z builds (#33243) R3hankhan 2026-02-02 13:11:29 +05:30
  • 808dd87b30 [Model] Support DeepSeek-OCR-2 (#33165) RED 2026-02-02 14:24:10 +08:00
  • beb8899482 Fix mistral sliding window parsing (#33521) Andy Lo 2026-02-02 05:08:04 +00:00
  • ce88756b96 [Doc]: update paths for Offline/Online/Others example sections (#33494) Sawyer Bowerman 2026-02-01 22:56:53 -05:00
  • a3154a6092 [Doc] add missing model entries in supported_models.md (#33220) Paco Xu 2026-02-02 11:37:25 +08:00
  • 7c036432fc [Bugfix] GLM-4 tool parser: incremental string streaming (#33218) jack 2026-02-02 11:13:31 +08:00
  • 318b120766 [Nightly CI] Remove CT Model (#33530) Robert Shaw 2026-02-01 22:09:09 -05:00
  • c3b40dc3e7 [Models] Step-3.5-Flash (#33523) csy0225 2026-02-02 10:21:18 +08:00
  • a01ef3fa51 [Fix] prefix cache hit rate == 0 bug with gpt-oss style models (#33524) Yifan Qiao 2026-02-01 17:59:58 -08:00
  • 7320ca3942 Add unpermute-aware fused MoE LoRA path (#32655) Runkai Tao 2026-02-01 20:46:09 -05:00
  • cf0a99f84d [ModelRunner V2] Support spec decode with structured outputs (#33374) Nick Hill 2026-02-01 16:19:59 -08:00
  • e535d90deb [ModelRunner V2] Misc minor simplifications and optimizations (#33467) Nick Hill 2026-02-01 14:17:14 -08:00
  • 0b225fb7b2 [Misc] skip target model mm emb in draft proposal step when draft is text-only (#33437) Komal Kumar Teru 2026-02-02 02:43:35 +05:30
  • 46b4a02794 Fix DeepSeek V2 RoPE initialization error (#33501) will b. 2026-02-01 15:00:56 -06:00
  • 8869cd8ec1 Add MoE config for Super B200 TP2 (#33510) shaharmor98 2026-02-01 20:48:37 +02:00
  • cd86fff38f [BUGFIX] Fix hipErrorIllegalState in Qwen3-Omni during startup profiling allow inference Omni on ROCM (#33077) JartX 2026-02-01 14:36:25 +01:00
  • b5f8c3092d [W8A8 Block Linear Refactor][1/N] Keep all quantization types into QuantFP8 class. (#33047) Maral 2026-02-01 17:28:01 +08:00
  • 21997f45b1 [Redo] #33110 with threading limit (#33502) Cyrus Leung 2026-02-01 17:18:11 +08:00
  • 672023877b Change defaults for vllm bench startup (#33489) Luka Govedič 2026-02-01 02:46:01 -05:00
  • 754a8ca942 fix: only include Authorization header when OPENAI_API_KEY is set (#33488) Zack Yu 2026-01-31 23:35:09 -08:00
  • 302ecf64ff [Models]: lfm2_siglip2 return intermediate encoder layers (#33370) Eduardo Salinas 2026-02-01 01:17:49 -05:00
  • b6bb2842cf [Critical] Revert #33110 (#33500) Cyrus Leung 2026-02-01 13:06:42 +08:00
  • 79b6ec6aab [Bugfix] Fix inconsistent handling of cache reset (#33481) Cyrus Leung 2026-02-01 12:23:41 +08:00
  • d6416fdde9 pin LMCache to v0.3.9 or greater with vLLM v0.15.0 (#33440) Greg Pereira 2026-01-31 19:50:38 -08:00
  • 0fb3157267 [ROCm][CI] Update huggingface-hub pin (#33492) Andreas Karatzas 2026-01-31 20:51:54 -06:00
  • a358e4dffe [Refactor] Make Renderer an abstract class (#33479) Cyrus Leung 2026-02-01 10:36:30 +08:00
  • 079781177a fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels (#33417) René Honig 2026-01-31 23:06:42 +01:00
  • 63c0889416 [Misc] Fix flashinfer related tests (#33462) Roy Wang 2026-02-01 05:10:24 +08:00
  • 1e86c802d4 Fix grammar (#33121) smashyalts 2026-01-31 18:59:34 +01:00
  • fedf64332e [Bugfix]: Fix display errors in TORCH_CHECK messages (#32942) linhaifeng 2026-02-01 01:48:48 +08:00
  • 2238a12c13 [Misc] support collect_env for endpoint /server_info (#33246) Xiao Yang 2026-02-01 01:42:59 +08:00
  • ce0afe2451 Update huggingface-hub pin for the last time before Transformers v5 (#33473) Harry Mellor 2026-01-31 17:14:24 +00:00
  • 88c3e114d8 [Refactor] Move MM data parsing outside processor (#33408) Cyrus Leung 2026-02-01 00:46:14 +08:00
  • 92924b2ddd [Deprecation] Remove deprecated items related to pooling (#33477) Cyrus Leung 2026-02-01 00:44:40 +08:00
  • 27cb2f678f [Bugfix] Early-reject requests with MM data longer than encode cache capacity (#33110) YunzhuLu 2026-02-01 00:41:13 +08:00
  • 22d9a056d5 Support clear mm and encoder cache (#33452) jma99_2333 2026-01-31 07:22:25 -08:00
  • 13b842f271 [BugFix][Router Replay] Capture Logical Experts with EPLB (#33013) ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 2026-01-31 17:12:17 +02:00
  • 15f40b20aa [fix][torch.compile] Fix cold-start compilation time increase by adding kv cache update to splitting ops (#33441) Luka Govedič 2026-01-31 09:48:34 -05:00
  • 793af538a3 [Doc] Update plugin deprecation notices (#33476) Cyrus Leung 2026-01-31 22:48:28 +08:00
  • 6f5e7cda57 support return prompt token ids in responses (#33378) cmunley1 2026-01-31 06:04:20 -08:00
  • 68feb76a6f [Misc] Replace deprecated interface seed_everything (#33474) Roy Wang 2026-01-31 21:38:39 +08:00
  • 4cb59dea6a [Bugfix] Fix incompatibility between #33372 and #32863 (#33475) Cyrus Leung 2026-01-31 21:21:32 +08:00
  • 608b556507 [ez] Add structured torch.compile logs (#33213) Angela Yi 2026-01-31 05:00:54 -08:00
  • f0a1c8453a [Frontend] Use new Renderer for Completions and Tokenize API (#32863) Cyrus Leung 2026-01-31 20:51:15 +08:00
  • 8980001c93 [perf] v1/spec_decode: skip softmax for all-greedy rejection sampling (#32852) caozuoba 2026-01-31 17:51:26 +08:00
  • 527bcd14d4 [ROCM] Enable aiter attn backend for qwen3-next model (#32492) jennyyyyzhen 2026-01-31 01:03:57 -08:00
  • f68e3ea4e1 [BugFix] Add synchronize in CutlassW4A8LinearKernel to ensure data is ready for use. (#33078) Jinwu 2026-01-31 00:14:54 -08:00
  • d5c41db35b [Kernel] [Helion] [3/N] Helion kernel registry (#33203) Yanan Cao 2026-01-30 23:38:46 -08:00
  • 1618e25492 [CPU][Feat] Enable KleidiAI accelerated int4 dynamic quant with BF16 activations on Arm CPUs (#33122) Fadi Arafeh 2026-01-31 07:16:22 +00:00
  • f3888aca83 Add EAGLE3 support for AFMoE (#33111) AutumnAurelium 2026-01-30 22:53:08 -08:00
  • f0bca83ee4 Add support for Mistral Large 3 inference with Flashinfer MoE (#33174) Dimitrios Bariamis 2026-01-31 07:48:27 +01:00
  • 73419abfae [Bugfix] Handle Asym W4A16 (ConchLinearKernel) for CT (#33200) Matthias Gehre 2026-01-31 07:21:51 +01:00
  • e77f162cf5 [Bugfix] Fix Qwen3ASR language asr tag in output (#33410) Nicolò Lucchesi 2026-01-31 06:24:49 +01:00
  • 8ecd213c0b [Kernel] [Helion] [2/N] Helion kernel wrapper (#32964) Yanan Cao 2026-01-30 20:53:01 -08:00
  • 5b55c0bea7 [Attention] Clarify comment explaining attn_logits +1 dimension (#33427) Francesco Fusco 2026-01-31 05:50:30 +01:00
  • 15e0bb9c42 [Streaming -> Realtime] Rename all voxtral related classes, fn, files (#33415) Patrick von Platen 2026-01-31 05:49:00 +01:00
  • 6c64c41b4a [ROCm][CI] Force max_num_seqs=1 on ROCm In test_sharded_state_loader to reduce flakiness (#33277) Micah Williamson 2026-01-30 22:28:29 -06:00
  • a2ef06e1b3 [Misc] offest -> offset in comments and variable names (#33444) Russell Bryant 2026-01-30 23:19:22 -05:00
  • 0a3c71e7e5 [BugFix] Fix whisper FA2 + full cudagraphs (#33360) Lucas Wilkinson 2026-01-30 21:15:06 -07:00
  • 29fba76781 [UX] Use gguf repo_id:quant_type syntax for examples and docs (#33371) Michael Goin 2026-01-30 23:14:54 -05:00