Commit Graph

  • f716a15372 Update KServe guide link in documentation (#29258) Yuan Tang 2025-11-24 09:40:05 -05:00
  • 2601f18a82 [EPLB] Optimize EPLB for Async Rearrange Experts (#22179) WeiQing Chen 2025-11-24 22:08:29 +08:00
  • 4de87866a8 [CPU][IBM Z] Fix BF16 support and vectorize math operations for s390x (#28926) R3hankhan 2025-11-24 17:38:09 +05:30
  • eca7a8fb59 [Doc]: fix typos in various files (#29230) Didier Durand 2025-11-24 12:10:48 +01:00
  • 8005e606bf [Bugfix][Rocm] Fix shared expert weight loading failure in DeepSeek-MTP (#27563) 杰兮 2025-11-24 18:16:52 +08:00
  • 68dfe28eae [Feature][Benchmark] add --link-vars can filter when serve_param equal bench_param (#28909) rongfu.leng 2025-11-24 18:02:28 +08:00
  • ed40d85929 [BugFix] Fix R-VL model loading error (#29299) Fanli Lin 2025-11-24 14:48:45 +08:00
  • 0ff70821c9 [Core] Deprecate xformers (#29262) Roger Wang 2025-11-23 20:18:55 -08:00
  • 5253f4276f [ROCm] Support for Whisper v1 with Aiter Unified Attention and Aiter Flash Attention (#28376) tongqiu 2025-11-24 11:26:00 +08:00
  • 30854783ad [Model] Add OpenCUA-7B support (#29068) Zero 2025-11-24 11:27:55 +09:00
  • 1073ba68b0 [LoRA] Optimize 3D MoE logic (#29222) Jee Jee Li 2025-11-24 10:27:23 +08:00
  • c309bb5245 [Bugfix] Update Gradio OpenAI Chatbot Webserver example to new Gradio message history format (#29249) Josh Moore 2025-11-23 19:47:54 -05:00
  • 3e1ad40655 [Model Runner V2] Add apply_temperature option to gumbel_sample (#29276) Woosuk Kwon 2025-11-23 14:13:00 -08:00
  • 62d54ba46d [Model Runner V2] Optimize CUDA graph capture time (#29275) Woosuk Kwon 2025-11-23 11:15:32 -08:00
  • b004c00418 [Model Runner V2] Support spec decoding [1/N] (#29274) Woosuk Kwon 2025-11-23 10:09:06 -08:00
  • 7f12c82fa6 [Model Runner V2] Change bookkeeping logic in preparation for spec decoding (#29194) Woosuk Kwon 2025-11-23 09:42:52 -08:00
  • 6fb0215eee [Bugfix] Use lazy string reference for DeepseekV3Config in config registry (#28958) Luke 2025-11-23 06:43:21 -05:00
  • 55c21c8836 [ROCm][CI] Fix "Cannot re-initialize CUDA in forked subprocess" in test_pynccl.py (#29119) Micah Williamson 2025-11-22 23:05:00 -06:00
  • 3999442f1c [CI/Build][AMD] Add check for flash_att_varlen_func to test_tree_attention.py (#29252) rasmith 2025-11-22 22:45:08 -06:00
  • 71362ffab4 [CI/Build][AMD] Skip test_multi_shared_storage_connector_consistency in test_multi_connector.py due to hipErrorLaunchFailure when calling .cpu() (#29253) rasmith 2025-11-22 22:42:49 -06:00
  • 20ee418adc [Model Runner V2] Minor fix for cudagraph_utils (#29256) Woosuk Kwon 2025-11-22 20:12:50 -08:00
  • 389aa1b2eb [Doc] Update more docs with respect to V1 (#29188) Cyrus Leung 2025-11-23 10:58:48 +08:00
  • 3ed767ec06 docs: fixes distributed executor backend config for multi-node vllm (#29173) Michael Act 2025-11-23 09:58:28 +07:00
  • 5f96c00c55 [Fix] Add SM check to flashinfer MOE backend (#29144) jiahanc 2025-11-22 16:39:30 -08:00
  • 4587063267 Patch DeepEP when building docker image with CUDA 13 (#29154) Qidong Su 2025-11-22 18:25:13 -05:00
  • 472fdee974 [Chore] Update batch invariant code owner (#29246) Wentao Ye 2025-11-22 16:50:02 -05:00
  • df78aeef08 Refactor: Move CUDA graph dispatch logic earlier (#27382) Yizhou 2025-11-23 05:10:31 +08:00
  • 7df331c66b [BugFix] Fix chunked prompt logprobs + preemption (#29071) Nick Hill 2025-11-22 13:07:18 -08:00
  • eb5352a770 [CI/build] Removes source compilation from runtime image (#26966) Benjamin Bartels 2025-11-22 18:23:09 +00:00
  • d1cf8214e5 [Bugfix] Use HF config fields as fallback when loading Mistral config (#29239) Cyrus Leung 2025-11-23 02:22:48 +08:00
  • 730bd35378 [perf][cpu] Accelerate paged attention GEMMs (QK, PV) on Arm CPUs with NEON (#29193) Fadi Arafeh 2025-11-22 17:04:36 +00:00
  • f55c76c2b3 chore: add RTX_PRO_6000 GLM4.6-FP8 kernel tuning (#29240) Federico 2025-11-22 17:42:48 +01:00
  • d84d8f4429 Fix EVS crash when using video_embeds inputs in Qwen2.5-VL (#29232) ZiTian Zhao 2025-11-22 22:48:59 +08:00
  • ae66818379 [Misc] Fix pre-commit (#29238) Cyrus Leung 2025-11-22 22:48:01 +08:00
  • d44a63c6d6 [BugFix] Fix returned logprobs with spec decode + prefill chunking (#29216) Nick Hill 2025-11-22 06:41:25 -08:00
  • 066209a045 [Attention] Refactor FA block_size limitations to hybrid models only (#29084) Nicolò Lucchesi 2025-11-22 15:38:44 +01:00
  • 5f7209a793 [tiny] Remove unsupported TRITON_MLA backend from batch invariance (#28832) Bram Wasti 2025-11-22 08:00:50 -05:00
  • 2d4978a57e fix: clean up function never use in setup.py (#29061) yihong 2025-11-22 21:00:04 +08:00
  • 6965a392a4 Fix: Resolve circular import in model_loader/utils.py (#29189) Nandan Vallamdasu 2025-11-22 18:28:22 +05:30
  • 5a4802588e [Misc] Further clean up chunked prefill and prefix caching init (#29186) Cyrus Leung 2025-11-22 19:34:15 +08:00
  • 8e22da1d7f [CI/Build Don't add FLASHINFER backend in test_cpu_offloading.py (#29229) rasmith 2025-11-22 05:00:54 -06:00
  • a4fdf2405c [CI/Build] Skip tests that require libcudart in test_lmcache_integration.py (#29228) rasmith 2025-11-22 04:59:39 -06:00
  • e6309acdba Simplify from_blob usage in get_cuda_view_from_cpu_tensor (#29027) Jane (Yuan) Xu 2025-11-22 05:35:32 -05:00
  • 988ee66b0d Handle triton kernel import exception (#29062) jinghanhu 2025-11-22 18:07:50 +08:00
  • ea38474ac5 [Frontend][Responses API] Multi-turn (with type: "output_text") support for non-harmony requests (#29175) Mads Kildegård 2025-11-22 10:58:22 +01:00
  • 742e9ff6b3 [responsesAPI] parse reasoning item input (#28248) Andrew Xia 2025-11-21 23:42:11 -08:00
  • e9056056fb [Model Runner V2] Limit cudagraph size to max decode batch size (#29221) Woosuk Kwon 2025-11-21 20:21:35 -08:00
  • 1489902b53 [LoRA] Cleanup FusedMoEWithLoRA (#29187) Jee Jee Li 2025-11-22 12:01:30 +08:00
  • 933f67ecd8 [Bugfix]Fix a conditional to not check zero value (#28754) Yanan Cao 2025-11-21 19:59:07 -08:00
  • fd65015a14 [CI/Build] Only use supported types and features on ROCm in MoE kernel tests (#29149) rasmith 2025-11-21 21:34:33 -06:00
  • 77e1c035d0 [chore][LMCache connector] Remove useless logs from lmcache connector (#29069) Yihua Cheng 2025-11-21 19:18:00 -08:00
  • 6f403501a0 [CI/Build][AMD] Enable Entrypoints Integration Test (Pooling) to run without error on ROCm (#29212) rasmith 2025-11-21 20:13:18 -06:00
  • 052950e5b3 Add fused MoE config for H200 E160 N192 fp8 (#29182) FlintyLemming 2025-11-22 09:37:51 +08:00
  • 1ef9c9e294 [CI/Build] Disable test_gptoss_tp.py in 'LoRA TP Test' group for ROCm platform (#29204) qli88 2025-11-21 19:36:19 -06:00
  • 5c8f2adf50 [Bugfix] Fix block size in block_table with PCP (#29094) Jie Luo 2025-11-22 09:34:28 +08:00
  • ed8e6843cc [CI/Build] Add terratorch for AMD (#29205) Ryan Rock 2025-11-21 19:31:22 -06:00
  • d045e22dfe [Model][Qwen3VL] Tune Triton w8a8 block fp8 kernel for L40s (#29217) Lukas Geiger 2025-11-22 01:30:55 +00:00
  • 1d34eb11e0 [CI] Bug: Fix triton import issue (#29202) Wentao Ye 2025-11-21 20:14:49 -05:00
  • 9a3101b2ba [Rocm][CI] Fix DeekSeek V2-Lite Accuracy CI (#29135) Charlie Fu 2025-11-21 19:11:02 -06:00
  • d5dbdbfcb2 [docs] Fix cudagraph mode config (#29170) Angela Yi 2025-11-21 17:10:27 -08:00
  • 30d6466238 [BugFix] Fix Eagle IndexError: list index out of range for even num_speculative_tokens (#29102) Lucas Wilkinson 2025-11-21 19:47:05 -05:00
  • e9af6ba62a [Model Runner V2] Optimize Gumbel Sampling Kernel (#29210) Woosuk Kwon 2025-11-21 15:52:28 -08:00
  • c6fa3895e9 [KV Connector] Fix async connector prefix cache metrics (#28585) Mark McLoughlin 2025-11-21 22:45:00 +00:00
  • 3137991f55 [BugFix] EPLB + B200 + DeepGEMM : Handle column-major scales tensor (#29162) Varun Sundar Rabindranath 2025-11-21 17:28:17 -05:00
  • 57430fc95c Default model load/config/tokenizer to mistral format if relevant files exist (#28659) Julien Denize 2025-11-21 22:58:59 +01:00
  • c68c7b403d [BugFix] Fix missing symbol triggering FA2 fallback on Hopper (#29107) Lucas Wilkinson 2025-11-21 16:58:32 -05:00
  • 53a1ba6ec5 [log] add weights loading time log to sharded_state loader (#28628) Ning Xie 2025-11-22 05:06:09 +08:00
  • 1840c5cb18 [BugFix] Make sure to allocate worst case MoE workspace during profile run in the DP + EP case (#27426) Lucas Wilkinson 2025-11-21 14:41:52 -05:00
  • 1bed891f72 [Chore] Fix pre-commit error after #25266 (#29190) Woosuk Kwon 2025-11-21 10:21:40 -08:00
  • ceca060501 [Deprecation] Deprecate seed=None (#29185) Cyrus Leung 2025-11-22 02:19:25 +08:00
  • 75648b16dd [ROCm][CI] Fix config/test_config_generation.py (#29142) Charlie Fu 2025-11-21 11:12:16 -06:00
  • 460d02a417 [NIXL] Fix after virtual block_size for host_buffer with heter kv_layout (#29122) Chendi.Xue 2025-11-21 10:55:27 -06:00
  • b4c8fbaae2 Add TRTLLM MoE NVFP4 kernel to CompressedTensorsW4A4MoeMethod (#28892) Mingyuan Ma 2025-11-21 08:54:11 -08:00
  • e99e467384 [CI/Build][Kernel][AMD] Move extra dim to after load in _fwd_kv_parallel in lighting_attn.py (#29132) rasmith 2025-11-21 10:53:09 -06:00
  • a42ab317ac [Log] Optimize startup log (#28948) Wentao Ye 2025-11-21 11:46:20 -05:00
  • b7f1f490a6 Upstream triton fp4 weight preshuffle (#28888) Aleksandr Malyshev 2025-11-21 08:34:46 -08:00
  • 30b44a1598 GPU Model Runner V2 (#25266) Woosuk Kwon 2025-11-21 08:20:55 -08:00
  • 1f400c58b8 [CI] Add batch invariant test to ci (#27842) Wentao Ye 2025-11-21 11:20:33 -05:00
  • 711241c13c [CI/Build] Fix illegal memory access and unsupported test in kernels/attention/test_cache.py (#29118) rasmith 2025-11-21 09:58:38 -06:00
  • d7219bcda3 [Misc] Move dynamic seed initialization to EngineArgs (#29165) Cyrus Leung 2025-11-21 23:27:44 +08:00
  • 4050bae417 [Doc] Update plugin doc (#28532) wangxiyuan 2025-11-21 22:57:26 +08:00
  • f1805db1a6 [Perf] These changes enhance the NUMA functionality of vllm for systems with more than one NUMA nodes per socket (#25559) skaraban3807 2025-11-21 19:43:52 +05:30
  • 434f3d3eb8 Fix mistral config (#29172) Julien Denize 2025-11-21 15:01:20 +01:00
  • 2092ce8c39 Tool Call Parser logs should not contain user input / model output except on DEBUG (#29160) sfbemerk 2025-11-21 13:57:19 +01:00
  • fc9f821d20 fix cross attention (#28346) who who who 2025-11-21 20:55:43 +08:00
  • 9452863088 Revert "Revert #28875 (#29159)" (#29179) Cyrus Leung 2025-11-21 20:27:43 +08:00
  • 2b1b3dfa4b Update Dockerfile to use gcc-toolset-14 and fix test case failures on power (ppc64le) (#28957) Bhagyashri 2025-11-21 17:54:09 +05:30
  • cca2d2cdbe [Core] Align whisper closer to other multimodal models (#27292) Russell Bryant 2025-11-21 07:01:54 -05:00
  • aab0102a26 [V0 deprecation] Remove more V0 references (#29088) Cyrus Leung 2025-11-21 19:56:59 +08:00
  • b34129bf8e [Misc] remove useless v1 env (#29164) WeiQing Chen 2025-11-21 17:41:20 +08:00
  • 4d7231e774 Revert #28875 (#29159) Cyrus Leung 2025-11-21 17:40:17 +08:00
  • 8ac3a41487 [CI Failure] Fix Gemma3 RoPE configuration for sliding attention layers (#29111) Huamin Li 2025-11-20 23:53:30 -08:00
  • 7d6da483b0 [Minor][Clean] Remove the legacy assertion in video (#29150) Canlin Guo 2025-11-21 15:52:34 +08:00
  • e4c3182c68 [Small] Capture AttributeError when checking ray dependency. (#29024) Chenheli Hua 2025-11-20 22:54:10 -08:00
  • b4734b9550 [Bugfix] Fix default MM LoRA alignment for single str prompts (#29140) Alex Brooks 2025-11-20 22:32:30 -07:00
  • 30b9c67743 Revert "[Redo] #26368 (#28771)" (#29121) Jialin Ouyang 2025-11-20 21:27:45 -08:00
  • 11857a00b0 [Attention] Add ROCM_AITER_MLA_SPARSE to attention backend registry (#29103) Matthew Bonanni 2025-11-20 23:24:43 -05:00
  • 8c25f9cfb6 [BugFix] skip combo kernel on cpu (#29129) Boyuan Feng 2025-11-20 19:50:59 -08:00
  • 56e96b37e4 [V0 Deprecation] Remove best_of (#29090) Cyrus Leung 2025-11-21 11:40:40 +08:00
  • 698024ecce [Doc] update installation guide regarding aarch64+cuda pytorch build (#28875) Qidong Su 2025-11-20 22:40:25 -05:00