Commit Graph

  • 49d9653852 [ROCm][CI] fix get_valid_backends (#32787) Divakar Verma 2026-01-21 22:27:47 -06:00
  • a1d82466ea [Docs] Remove outdated async_scheduling limitation with speculative decoding (#32775) Ifta khairul Alam Adil 2026-01-22 05:19:25 +01:00
  • 24a163ed77 Cleanup some huggingface_hub-related stuff (#32788) Lucain 2026-01-22 04:38:17 +01:00
  • 378385b90c [EC Connector] Optimize remote cache check in scheduler (#32585) knlnguyen1802 2026-01-22 11:30:59 +08:00
  • c5487e2b96 [Bugfix] Fix potential EAGLE spec decode segfault during graph capture (#32818) Matt 2026-01-21 21:11:55 -06:00
  • 6437ff1fb9 [Deprecation] Remove deprecated environment variables (#32812) Wentao Ye 2026-01-21 21:25:16 -05:00
  • 5e00b561cd [Model Runner V2] Do not error on attention backends (#32820) Woosuk Kwon 2026-01-21 17:02:48 -08:00
  • 408195ec59 [Model Runner V2] Refactor Prompt Logprobs (#32811) Woosuk Kwon 2026-01-21 15:12:20 -08:00
  • 63227accf5 [Kernel] Add topk_sigmoid kernel (#31246) Xin Yang 2026-01-21 14:49:51 -08:00
  • e675dda67b [Misc] Add Helion version check to collect_env (#32797) Yanan Cao 2026-01-21 13:54:46 -08:00
  • 24dc30f7ff [ModelRunner V2] Don't pin reused flashinfer tensors (#32799) Nick Hill 2026-01-21 13:17:43 -08:00
  • 180fba653e [ROCm] fix import for on_gfx9 (#32783) Divakar Verma 2026-01-21 12:41:11 -06:00
  • f999539869 Add missing import of fused_topk to benchmark_moe (#32784) danisereb 2026-01-21 20:30:10 +02:00
  • e1da249c93 [Model Runner V2] Minor refactor for compute_slot_mappings (#32794) Woosuk Kwon 2026-01-21 10:24:35 -08:00
  • 9b693d023c [Misc] Omit "disable NCCL for DP sync" startup log when not applicable (#32707) Nick Hill 2026-01-21 09:03:39 -08:00
  • 808d6fd7b9 Bump Flashinfer to v0.6.1 (#30993) elvischenv 2026-01-22 00:49:50 +08:00
  • 1861ae8aae [PluggableLayer][1/N] Define PluggableLayer (Fix ci) (#32744) whx 2026-01-22 00:38:04 +08:00
  • 4e31b7f228 [Quantization][Deprecation] Remove RTN (#32697) Robert Shaw 2026-01-21 11:34:42 -05:00
  • 6c20e89c02 [ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp (#29287) Pleaplusone 2026-01-21 23:16:30 +08:00
  • 85f55c943c [Quantization][Deprecation] Deprecate HQQ (#32681) Robert Shaw 2026-01-21 09:32:40 -05:00
  • cea3c754c4 [Quantization][Deprecation] Remove DeepSpeedFp8 (#32679) Robert Shaw 2026-01-21 09:32:12 -05:00
  • 42135d6898 [MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority (#32414) Robert Shaw 2026-01-21 08:22:33 -05:00
  • e14467be43 [bugfix] Aria model (#32727) Divakar Verma 2026-01-21 07:11:31 -06:00
  • 7727ce35c2 [Model] Add Eagle2.5-8B Vision-Language Model support (#32456) Kim Hee Su 2026-01-21 18:39:53 +09:00
  • 6bb2bc71e2 [Bugfix] Force using spawn multiprocess method when it's the WSL platform (#32749) Yanwen Lin 2026-01-21 01:35:55 -08:00
  • c80f92c14d [Documentation] Fix typo in docs/design/torch_compile_multimodal.md (#32741) Lucas Kabela 2026-01-20 23:54:20 -08:00
  • f23fb5a7c1 [Bugfix] Support HF sharded weights for Mistral3/Pixtral models (#32673) RickyChen / 陳昭儒 2026-01-21 15:27:30 +08:00
  • 360aa93f8f [Docs] Fix GitHub handle in governance process (#32582) Paco Xu 2026-01-21 15:07:50 +08:00
  • 27ca95b3c9 [Bugfix] Fix Nemotron-Nano-v2-vlm static resolution (#32682) Netanel Haber 2026-01-21 08:28:21 +02:00
  • b4f64e5b02 Update FlashMLA (#32491) Lucas Wilkinson 2026-01-20 22:03:37 -07:00
  • 7ab80a8e37 Added qwen3 vision language moe support for speculative decoding (#32048) shanjiaz 2026-01-20 22:24:05 -05:00
  • 0900cedb3f Enable Eagle3 speculative decoding for Pixtral (LlavaForConditionalGeneration) (#32542) gopalsarda 2026-01-20 19:18:05 -08:00
  • 6f067b1fb7 [Cleanup] Remove unused KVConnectorModelRunnerMixin methods (#32077) Nick Hill 2026-01-20 19:16:37 -08:00
  • 27b81e010d [Bugfix] Fix Granite Vision / Don't use Siglip Pooling Head Nested Models by Default (#32299) Alex Brooks 2026-01-20 20:11:52 -07:00
  • 7013e9ac8f OffloadingConnector: Prevent redundant loads (#29087) Or Ozeri 2026-01-21 03:15:42 +02:00
  • c78ee240b3 Revert "[PluggableLayer][1/N] Define PluggableLayer" (#32725) Robert Shaw 2026-01-20 19:21:06 -05:00
  • d2389c1262 fp8 online quant: split out Fp8OnlineLinearMethod (#32189) Vasiliy Kuznetsov 2026-01-20 18:13:22 -05:00
  • 22375f8d13 [ROCm][CI] Remove DS async eplb accuracy test from AMD CI (#32717) Micah Williamson 2026-01-20 15:40:48 -06:00
  • 9b67338b78 [Bugfix] Suppress log on non-ROCm platform (#32703) TJian 2026-01-21 05:38:20 +08:00
  • 2261340806 [Misc] Remove pad_for_cudagraphs from config (#30143) Lucas Wilkinson 2026-01-20 13:05:48 -07:00
  • 86c69dc54c [Bugfix] Fix byte fallback handling when using outlines (#31391) Shinichi Hemmi 2026-01-21 04:48:08 +09:00
  • 7c5dedc247 [AOT compilation] support torch.compile inductor artifacts in VllmCompiledFunction (#25205) dolpm 2026-01-20 11:45:59 -08:00
  • 193069d129 [5/N] Initialize MM components in context managers (Q-Z) (#32695) Cyrus Leung 2026-01-21 03:10:23 +08:00
  • f0feb1cf81 Test: added acceptance length tests (#32030) Rahul Tuli 2026-01-21 00:25:15 +05:30
  • 09194b90a5 [Doc] Update docs for MM model development with context usage (#32691) Cyrus Leung 2026-01-21 02:37:35 +08:00
  • 9ab4388cd3 [Model Runner V2] Support FLASHINFER_MLA backend (#32709) Woosuk Kwon 2026-01-20 10:26:17 -08:00
  • 04a9e064db [Bugfix] fix the ima issue of qwen-vit (#32687) JJJYmmm 2026-01-21 01:21:25 +08:00
  • c025263ddd [Doc] [ROCm] Update ROCm getting started doc (#32580) TJian 2026-01-21 01:20:08 +08:00
  • 6c97b9b9b6 [Perf] Only clone when needed for moe_permute (#32273) Wentao Ye 2026-01-20 11:34:39 -05:00
  • 4ca62a0dbd [PluggableLayer][1/N] Define PluggableLayer (#32331) whx 2026-01-21 00:19:21 +08:00
  • 7901109ea5 [Bugfix] Fix Off-by-one error in _num_tokens_to_min_blocks calculation (#32603) linhaifeng 2026-01-21 00:13:39 +08:00
  • 13f6630a9e [XPU]Support AgRsAll2AllManager on XPU device (#32654) YiSheng5 2026-01-20 22:27:24 +08:00
  • fda3f03eb2 [4/N] Initialize MM components in context managers (M-P) (#32663) Cyrus Leung 2026-01-20 22:06:32 +08:00
  • bb9172030e [Metrics] Complete removal of deprecated vllm:time_per_output_token_seconds metric (#32661) 杨朱 · Kiki 2026-01-20 20:28:41 +08:00
  • c4e5bdf61b [Bugfix] Fix the fp8_mqa_logits dim mismatch (#32652) Chauncey 2026-01-20 18:48:07 +08:00
  • 7f1bcd18ff [3/N] Initialize MM components in context managers (I-L) (#32650) Cyrus Leung 2026-01-20 18:21:56 +08:00
  • 8be263c3fb [Core] Cleanup shm based object store on engine shutdown (#32429) Walter Beller-Morales 2026-01-20 03:53:37 -05:00
  • e1a34c3a5d [2/N] Initialize MM components in context managers (E-H) (#32641) Cyrus Leung 2026-01-20 16:12:56 +08:00
  • 148117ea2e [Refactor] Make FP8 Linear Ops use kernel abstraction (#27814) vllmellm 2026-01-20 14:48:20 +08:00
  • e9c83cdc51 [Model Runner V2] Skip kernel launch for penalties & logit_bias (#32634) Woosuk Kwon 2026-01-19 22:20:19 -08:00
  • b75e85dede [1/N] Initialize MM components in context managers (A-D) (#32632) Cyrus Leung 2026-01-20 14:12:42 +08:00
  • 4753f3bf69 [Model] Use context managers for encoder- and LM-only mode (#32605) Cyrus Leung 2026-01-20 11:43:38 +08:00
  • 6c01ffb897 [Model Runner V2] Decouple temperature from penalties (#32629) Woosuk Kwon 2026-01-19 19:13:24 -08:00
  • 7b7cdce968 [Model Runner V2] Refactor get_cudagraph_and_dp_padding (#32625) Woosuk Kwon 2026-01-19 18:25:02 -08:00
  • 12dab78f49 [Feat] allow inplace loading lora (#31326) Jackmin801 2026-01-19 18:15:20 -08:00
  • 05dc4bfab6 [Model Runner V2] Initialized communication buffer for DP (#32624) Woosuk Kwon 2026-01-19 17:27:06 -08:00
  • 1a1fc3bbc0 [Attention][MLA] Make FLASHINFER_MLA the default MLA backend on Blackwell, and TRTLLM the default prefill (#32615) Matthew Bonanni 2026-01-19 18:41:34 -05:00
  • 43fada5360 [Model Runner V2] Refactor dummy_run (#32533) Woosuk Kwon 2026-01-19 14:50:59 -08:00
  • 4a5299c93f feat: spec decode with draft models (#24322) Tomas Ruiz 2026-01-19 15:05:46 -06:00
  • 73f2a81c75 docs: prefix caching seems quite outdated (#28784) lon 2026-01-19 16:49:52 -03:00
  • 7350331718 [BugFix] Fix TRT-LLM NVFP4 DP/EP (#32349) jiahanc 2026-01-19 11:32:24 -08:00
  • 9d1e611f0e [CI] Add Helion as an optional dependency (#32482) Yanan Cao 2026-01-19 11:09:56 -08:00
  • 0727cc9ecf [BUGFIX] Fix test_mla_backends.py. Scale MLA projection weights to prevent numerical instability (#32529) Vadim Gimpelson 2026-01-19 22:49:29 +04:00
  • a0490be8f1 [CI][amd] Revert NIXL connector change to avoid crash (#32570) qli88 2026-01-19 12:39:16 -06:00
  • cd3ac5b797 support dynamic resolution image encoding for Nemotron Nano VL (#32121) Netanel Haber 2026-01-19 20:15:58 +02:00
  • 2636d76257 [Misc] Remove unused ModelKeys (#32608) Jee Jee Li 2026-01-20 01:34:59 +08:00
  • aa7f37ccfa Add support for LoRA adapters in Nemotron-H models (#30802) danisereb 2026-01-19 16:30:44 +02:00
  • c88860d759 [Frontend] Score entrypoint support data_1 & data_2 and queries & documents as inputs (#32577) wang.yuqi 2026-01-19 22:07:46 +08:00
  • 758df5afe7 [NIXL][Metrics] Track nixl_num_kv_expired_reqs metric in Prometheus (#32340) Nicolò Lucchesi 2026-01-19 13:28:27 +01:00
  • cdd03d25d3 [CI/Build] Fix dependency conflict between model-hosting-container-standards and starlette (#32560) Daniel Mescheder 2026-01-19 12:27:08 +01:00
  • 74c583bc50 [Core] Whisper support torch.compile (#30385) Nicolò Lucchesi 2026-01-19 11:02:31 +01:00
  • c0a350ca73 [ROCm][CI] Add ROCm attention backend support for EAGLE DP tests (#32363) Andreas Karatzas 2026-01-19 03:57:54 -06:00
  • 71832ba71e [GLM-4.7] GLM Model support for GLM-Lite (#31386) Yuxuan Zhang 2026-01-19 17:18:38 +08:00
  • 11bbf86f6a [CI][Hardware][AMD] Fix test_rotary_embedding_mla_cache_fused (#32408) Matt 2026-01-19 02:25:47 -06:00
  • 3c8740aacb [Frontend] Add render endpoints for prompt preprocessing (#32473) Hyunkyun Moon 2026-01-19 13:21:46 +09:00
  • 7518a3dc65 [CI/Build] Use Common Event Map Fixture in Harmony / MCP Server Tests (#32531) Alex Brooks 2026-01-18 21:05:51 -07:00
  • 976af2f314 [BugFix] Fix embed_input_ids argument error of QwenVLForConditionalGeneration (#32462) honglyua 2026-01-19 11:06:02 +08:00
  • 9a1f16da1e [Model Runner V2] Refactor update_states (#32562) Woosuk Kwon 2026-01-18 17:32:42 -08:00
  • bb1848cd62 [Model Runner V2] Support VLM (#32546) Woosuk Kwon 2026-01-18 16:58:51 -08:00
  • 6101a26dc9 [BUGFIX] Fix degenerate strides in TRTLLM query tensors for FlashInfer backend. Fixes issue #32353 (#32417) Vadim Gimpelson 2026-01-19 04:57:32 +04:00
  • f5d1740030 [Bugfix] Add OOT backend option (#32471) Iryna Boiko 2026-01-18 23:20:39 +01:00
  • eebc58df0c [Refactor] Remove unused cutlass moe problem size function (#32047) Wentao Ye 2026-01-18 15:46:59 -05:00
  • 16de822c71 [Refactor] Remove unused file pallas_kv_cache_update.py (#32433) Wentao Ye 2026-01-18 15:46:39 -05:00
  • 5480c6b1fa [Doc] Correct comment for _jobs dict in OffloadingConnectorWorker (#32556) Deming 2026-01-19 04:46:00 +08:00
  • ba29ab441e Use the same memory for workspace13 and fused_output. (#31531) Andrey Khalyavin 2026-01-18 22:14:22 +03:00
  • afc3622602 [CI] Move Distributed Tests from H200 -> H100 (#32555) Robert Shaw 2026-01-18 13:25:23 -05:00
  • 327a02d8db [MoE Refactor] Separate Router into OO Classes (#30623) bnellnm 2026-01-18 11:40:49 -05:00
  • 2f03035a61 "refactor: refactor_repeated_interfaces" (#32486) tjp_zju 2026-01-18 22:07:01 +08:00
  • 38bf2ffb21 [Bugfix] Fix GLM-ASR audio encoder RoPE dim (#32540) Isotr0py 2026-01-18 19:17:59 +08:00
  • c826c72a96 [Model] Support Step1 Model (#32511) Li Xie 2026-01-18 18:20:46 +08:00