Commit Graph

  • 127ded0a9e [Ultravox] Use wrapped_model_config to instantiate inner model (#24679) Peter Salas 2025-09-11 11:52:24 -07:00
  • bb2b5126da [VLM] Migrate remain DP-supported ViT models to use disable_tp (#24363) Isotr0py 2025-09-12 02:30:41 +08:00
  • 361ae27f8a [Docs] Fix formatting of transcription doc (#24676) Harry Mellor 2025-09-11 19:18:06 +01:00
  • e26fef8397 fix some typos (#24616) co63oc 2025-09-12 01:48:46 +08:00
  • c1eda615ba Fix model name included in responses (#24663) Harry Mellor 2025-09-11 18:47:51 +01:00
  • 4aa23892d6 [Bugfix] Fix platform-specific routing in CustomOp implementations (#24444) Konrad Zawora 2025-09-11 19:15:01 +02:00
  • 1fdd5c42d7 [Kernels] Enable Torch Symmetric Memory All-Reduce By Default (#24111) Ilya Markov 2025-09-11 18:45:31 +02:00
  • bcbe2a4d9e [VLM] Optimize GLM4.5-V-style video processing to only decode necessary frames (#24161) Isotr0py 2025-09-12 00:44:34 +08:00
  • 51d41265ad [Docs] Fix typos in EP deployment doc (#24669) Harry Mellor 2025-09-11 17:07:23 +01:00
  • 4984a291d5 [Doc] Fix Markdown Pre-commit Error (#24670) Wentao Ye 2025-09-11 12:05:59 -04:00
  • 404c85ca72 [Docs] Add transcription support to model (#24664) Nicolò Lucchesi 2025-09-11 16:39:01 +02:00
  • 817beef7f3 [Bugifx] Fix qwen-next packed_modules_mapping (#24656) Jee Jee Li 2025-09-11 22:26:17 +08:00
  • 4f6593b058 [HybridKVCache][Platform] Add support_hybrid_kv_cache for platform (#24646) Mengqing Cao 2025-09-11 21:47:58 +08:00
  • 94e6b2d55f Allow users to specify kv cache memory size (#21489) Boyuan Feng 2025-09-11 06:41:07 -07:00
  • fd1ce98cdd [CI] Split mteb test from Language Models Test (#24634) wang.yuqi 2025-09-11 21:37:51 +08:00
  • d11ec124a0 [Bench] Add qwen-next in benchmark_moe.py (#24661) Jee Jee Li 2025-09-11 21:29:43 +08:00
  • f510715882 [build] add torch to tool.uv no-build-isolation-package (#24303) youkaichao 2025-09-11 21:19:44 +08:00
  • f946197473 [Docs] Fixes a typo in the qwen3next model name. (#24654) Tao He 2025-09-11 19:35:14 +08:00
  • 0cd72a7b72 [XPU] add missing dependency tblib for XPU CI (#24639) Fanli Lin 2025-09-11 19:22:33 +08:00
  • 5f5271f1ee Move LoRAConfig from config/__init__.py to config/lora.py (#24644) Harry Mellor 2025-09-11 12:01:38 +01:00
  • d6249d0699 Fix typing for safetensors_load_strategy (#24641) Harry Mellor 2025-09-11 11:41:39 +01:00
  • 25bb9e8c65 [CI Failure] fix models/language/pooling/test_auto_prefix_cache_support.py (#24636) wang.yuqi 2025-09-11 18:31:23 +08:00
  • a1213fae5f [Misc] Add @NickLucche to codeowners (#24647) Nicolò Lucchesi 2025-09-11 11:18:09 +02:00
  • a8b0361c92 [CI] Split pooling from entrypoints Test (#24632) wang.yuqi 2025-09-11 16:53:09 +08:00
  • ed5ae4aace [Bugfix] Fix _synced_weight_loader (#24565) Kyuyeun Kim 2025-09-11 01:52:33 -07:00
  • 0fc36463e0 [CI]Add transformers_utils to Async Engine, Inputs, Utils, Worker Test (#24615) Xingyu Liu 2025-09-11 01:52:10 -07:00
  • d14c4ebf08 [Docs] Use 1-2-3 list for deploy steps in deployment/frameworks/ (#24633) Michael Yao 2025-09-11 16:50:12 +08:00
  • ba6011027d [Docs] Update V1 doc to reflect whisper support (#24606) Russell Bryant 2025-09-11 04:50:08 -04:00
  • 85df8afdae [Docs] Revise frameworks/anything-llm.md (#24489) Michael Yao 2025-09-11 16:50:05 +08:00
  • 6aeb1dab4a [Bugfix] Fix incorrect import of CacheConfig (#24631) Cyrus Leung 2025-09-11 16:48:25 +08:00
  • e93f4cc9e3 Add the support for the qwen3 next model (a hybrid attention model). (#24526) Tao He 2025-09-11 15:32:09 +08:00
  • 2048c4e379 [torchao] Support quantization configs using module swap (#21982) Jerry Zhang 2025-09-10 23:53:24 -07:00
  • d13360183a Remove redundant all gather + split (#23441) Chenxi Yang 2025-09-10 23:45:07 -07:00
  • 9bd831f501 [Model] New model support for Motif-1-Tiny (#23414) TaehyunKim 2025-09-11 15:29:40 +09:00
  • e2b1f863aa [Doc]: fixing doc typos (#24635) Didier Durand 2025-09-11 08:19:28 +02:00
  • 41329a0ff9 [Core] feat: Add --safetensors-load-strategy flag for faster safetensors loading from Lustre (#24469) shengshiqi-google 2025-09-11 06:10:01 +00:00
  • ee0bc5e1b4 Enable --profile in 'vllm bench throughput' (#24575) Tomas Ruiz 2025-09-11 08:06:19 +02:00
  • 3d1393f6fc Kimi K2 Fused MoE kernels Optimization configs (#24597) Saman A. Pour 2025-09-10 23:06:16 -07:00
  • 8a894084d2 [Engine][Chore] use local variable and remove output var assignment (#24554) Guy Stone 2025-09-11 02:05:42 -04:00
  • e2d8c27f68 [BugFix] Fix pipeline parallel (#24621) Nick Hill 2025-09-10 23:05:30 -07:00
  • 29799ddacc [Bugfix] Add missing VIT backend dispatch on CPU (#24623) Li, Jiang 2025-09-11 13:28:41 +08:00
  • f17a6aa4ec [Ultravox] Fix Gemma instantiation, support quantization via --hf-overrides (#24131) Peter Salas 2025-09-10 22:25:34 -07:00
  • 6c8deacd72 [Bug] [Spec Decode] Fix model_initialization test and mismatch in aux_hidden_layers (#24613) Wenlong Wang 2025-09-10 21:23:18 -07:00
  • 55b823ba0f Add @chaunceyjiang to codeowner for reasoning Reasoning and Tool parser (#24406) Chauncey 2025-09-11 12:23:04 +08:00
  • 8c5a747246 [distributed] update known issues (#24624) youkaichao 2025-09-11 11:09:38 +08:00
  • 5931b7e5d9 [Models][Quantization] Add quantization configuration update in Voxtral model (#24122) Alexandre Marques 2025-09-10 22:13:56 -04:00
  • cc99baf14d [Misc] Make timeout passable in init_distributed_environment (#24522) Jonathan Berkhahn 2025-09-10 15:41:12 -07:00
  • dcb28a332b [Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration (#21078) Hanjie Qiu 2025-09-10 18:31:10 -04:00
  • fba7856581 [Perf] Warmup FlashInfer attention during startup (#23439) Michael Goin 2025-09-10 18:03:17 -04:00
  • b5e383cd8b [gpt-oss] raise error for flashinfer backend without trtllm (#24482) Chen Zhang 2025-09-10 14:33:13 -07:00
  • 9a161307f5 [torch.compile][ROCm][V1] Enable attention output FP8 fusion for V1 attention backends (#19767) Gregory Shtrasberg 2025-09-10 16:59:55 -04:00
  • 37e8182bfe [v1] Add Whisper model support (encoder-decoder) (#21088) Russell Bryant 2025-09-10 16:53:35 -04:00
  • 4db4426404 [CI] Fail subprocess tests with root-cause error (#23795) Nick Hill 2025-09-10 13:53:21 -07:00
  • a0933c3bd6 [Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs (#24577) Thien Tran 2025-09-11 03:33:41 +08:00
  • 09e68bce34 [Misc] update log level debug to warning when process port is used by (#24226) rongfu.leng 2025-09-11 02:32:57 +08:00
  • 9fb74c27a7 [Core] Support configuration parsing plugin (#24277) Xingyu Liu 2025-09-10 11:32:43 -07:00
  • 4032949630 [Bugfix] Fix DeepEP config for DP4TP4 (#23619) Ming Yang 2025-09-10 10:37:56 -07:00
  • 08abfa78ec [Bugfix] fix modelopt exclude_modules name mapping (#24178) tomeras91 2025-09-10 20:20:46 +03:00
  • 2bef2d1405 [Logging] allow config logging stream (#24336) Shiyan Deng 2025-09-10 08:02:01 -07:00
  • 36cacd0958 [Doc] Add documentation for GLM-4.5 series models: tool-calling and reasoning parser (#24589) Robin 2025-09-10 22:50:55 +08:00
  • bb3eb80d92 [Core] Split LoRA layers (#24574) Jee Jee Li 2025-09-10 22:47:51 +08:00
  • fcc0a3130a [CI] Fix tensorizer test assertion (#24545) pwschuurman 2025-09-10 06:57:36 -07:00
  • 736569da8d [Platform] Custom ops support for LMhead and LogitsProcessor (#23564) zzhxxx 2025-09-10 21:26:31 +08:00
  • 2eb9986a2d [BugFix] python collect_env.py and vllm collect-env compatibility with uv venv (#24066) Kay Yan 2025-09-10 21:25:33 +08:00
  • ccee371e86 [Docs] Fix warnings in mkdocs build (continued) (#24092) Hyogeun Oh (오효근) 2025-09-10 22:23:28 +09:00
  • c0bd6a684a Fix Auto_Round Quatization Loading on SM75 and Lower GPUs (#24217) RoadToNowhereX 2025-09-10 23:22:31 +10:00
  • 3144d90217 fix some typos (#24167) co63oc 2025-09-10 21:21:23 +08:00
  • 2f5e5c18de [CI/Build] bump timm dependency (#24189) Daniele 2025-09-10 15:20:59 +02:00
  • bd98842c8a [CI] Add PPL test for generation models (#24485) wang.yuqi 2025-09-10 21:16:39 +08:00
  • d6069887c6 [rocm] enable torchao quantization for rocm (#24400) Lifans 2025-09-10 06:16:21 -07:00
  • 492196ed0e [CI/Build] split true unit tests to Entrypoints Unit Tests (#24418) Ye (Charlotte) Qi 2025-09-10 06:16:07 -07:00
  • f4f1a8df22 [BugFix] Ensure integrity of reused CPU tensors during async scheduling (#24527) Nick Hill 2025-09-10 06:15:14 -07:00
  • 0b9a612fa3 [BugFix][easy] Fix flaky test test_gpt_oss_multi_turn_chat (#24549) lacora 2025-09-10 06:14:55 -07:00
  • 4c04eef706 [BugFix][Multi Modal] Fix TensorSchema shape mismatch in Molmo (#24559) Wenlong Wang 2025-09-10 06:14:27 -07:00
  • f36355abfd Move LoadConfig from config/__init__.py to config/load.py (#24566) Harry Mellor 2025-09-10 14:14:18 +01:00
  • 9e3c3a7df2 [LoRA]: Add LoRA support to Mistral's Voxtral models (#24517) Yash Pratap Singh 2025-09-10 18:42:03 +05:30
  • 6cbd41909e Feature/vit attention unification# 23880 (#23978) baonudesifeizhai 2025-09-10 09:10:14 -04:00
  • 72d30108a0 Support for NemotronH Nano VLM (#23644) danielafrimi 2025-09-10 16:10:06 +03:00
  • 8b83b93739 [Docs] Document the extra memory footprint overhead when using EPLB (#24537) Tyler Michael Smith 2025-09-10 09:09:49 -04:00
  • 9dbefd88e9 [Docs] Improve organisation of API Reference nav (#24569) Harry Mellor 2025-09-10 14:08:21 +01:00
  • 7c195d43da [ROCm][Bugfix] Fix Aiter RMSNorm (#23412) vllmellm 2025-09-10 21:08:03 +08:00
  • 0ae43dbf8c [Attention] add DCP support for FLASH_ATTN_MLA backend (#24453) Lucas Wilkinson 2025-09-10 05:19:26 -04:00
  • 267c80d31f [Model] Limit CPU threads for image transformations in InternVL to reduce cpu contention. (#24519) li-jinpeng 2025-09-10 16:45:44 +08:00
  • 77f62613f9 Consolidate rendering parameters into RenderConfig dataclass (#24543) Flora Feng 2025-09-10 01:44:47 -07:00
  • feaf202e93 [Bugfix] Guard _may_reorder_batch for encoder-only models on CPU (#24319) (#24348) Remy 2025-09-10 15:24:42 +09:00
  • 91130ae376 [docs] promo pytorch conf and ray summit (#24562) Simon Mo 2025-09-09 23:24:20 -07:00
  • e40827280b [Docs] Enable relative links in examples to function when rendered in the docs (#24041) Harry Mellor 2025-09-10 05:40:45 +01:00
  • 4377b1ae3b [Bugfix] Update Run:AI Model Streamer Loading Integration (#23845) pwschuurman 2025-09-09 21:37:17 -07:00
  • 009d689b0c [Core] Simplify and unify mm uuid handling & auto-generated mm hash overrides processing. (#24271) Chenheli Hua 2025-09-09 21:36:09 -07:00
  • 0efdb5c3ba [gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading (#24154) Wei 2025-09-09 21:27:53 -07:00
  • 53b42f4102 [BugFix][Spec Decode] Fix out-of-range index triggered by eagle3; re-enable test for LlamaForCausalLMEagle3 (#24392) Wenlong Wang 2025-09-09 21:24:23 -07:00
  • 309d7aa401 [P/D] MultiConnector supports shutdown (#24425) Chauncey 2025-09-10 12:24:11 +08:00
  • b4a01aaf95 [KV Connector] More async support for get_num_new_matched_tokens (#23620) Yihua Cheng 2025-09-09 21:23:37 -07:00
  • 83dd28aae4 [CI] Adjust threshold for flaky ngram spec decoding test (#24528) Nick Hill 2025-09-09 21:07:33 -07:00
  • f88e84016f [BugFix] Fix async core engine client finalizer (#24540) Nick Hill 2025-09-09 21:07:13 -07:00
  • 3c2156b3af [Hardware][Apple-CPU] Enable native bfloat16 on Apple Silicon (M2 and later) (#24129) Ignacio Sica 2025-09-10 00:50:21 -03:00
  • 7e7db04310 [CI] Retry flaky fp8 cutlass mla tests (#24536) Nick Hill 2025-09-09 20:33:10 -07:00
  • 41f160b974 Add @heheda12345 to CODEOWNERS of KVCacheManager related code (#24546) Chen Zhang 2025-09-09 20:30:32 -07:00
  • dc625ea6b8 [Perf] Convert np array to torch tensor to index into block table for attn chunking (#24474) Yong Hoon Shin 2025-09-09 20:01:06 -07:00
  • b23fb78623 [Bugfix] Fix for 24530. Fix naive all2all shared expert overlap. (#24538) bnellnm 2025-09-09 20:53:53 -04:00