Commit Graph

  • 326e7c3105 [Doc] Add Sophgo TPU Support (#30949) wzyrrr 2025-12-19 00:29:33 +08:00
  • 0db5439ded [Bugfix][torch2.10] Fix test_qwen2_5_vl_compilation with 2.10 RC (#30822) Lucas Kabela 2025-12-18 08:23:31 -08:00
  • 28d15ab56b adds jais 2 support (#30188) sarathc-cerebras 2025-12-18 21:16:58 +05:30
  • 6628758233 [Bug] Fix batch invariant in torch 2.10 (#30907) Wentao Ye 2025-12-18 10:27:51 -05:00
  • eee600c34f [Misc] support nsys profile for bench latency (#29776) zhrrr 2025-12-18 22:52:20 +08:00
  • 100f93d2be Filter safetensors files to download if .safetensors.index.json exists (#30537) Michael Goin 2025-12-18 09:51:17 -05:00
  • 96bf50a2c0 [ROCm] Serving Fails on Radeon Due to AITER Dtype Import (#30952) vllmellm 2025-12-18 19:47:46 +08:00
  • f90d3636e2 [Bugfix][CPU] Fix Mac CPU build (#30955) Li, Jiang 2025-12-18 17:38:22 +08:00
  • 8372be2828 [moe] Use enable_chunking func (to support disabling chunking) (#29935) Ming Yang 2025-12-18 01:02:38 -08:00
  • 8da6ae49c3 [ROCm][Bugfix] Fix fa_version argument error in flash_attn_maxseqlen_wrapper for ROCm without aiter (#30909) Andreas Karatzas 2025-12-18 02:45:51 -06:00
  • 2c0ee0fde8 [BugFix] Partial revert of #29558 (DeepEP HT + PIECEWISE CG support) (#30910) Lucas Wilkinson 2025-12-18 02:50:15 -05:00
  • 30bb19a760 [BugFix] Partial revert of #29558 (DeepEP HT + PIECEWISE CG support) (#30910) Lucas Wilkinson 2025-12-18 02:50:15 -05:00
  • aa7e836055 [Bugfix] Fix Unicode issues in GLM-4 tool calling (#30920) Chauncey 2025-12-18 15:12:17 +08:00
  • be2ad5f920 [ROCm][Bugfix] fix(structured_output): Skip guidance backend for schemas with patternProperties (#30730) Andreas Karatzas 2025-12-18 01:04:57 -06:00
  • a85724bd6e [Platform] Let EPD work with non-cuda platform (#30225) wangxiyuan 2025-12-18 14:45:29 +08:00
  • 11a89cf95c [Fix][FlexAttention] return max logical block index to handle reused blocks (#30915) Yifan Qiao 2025-12-17 22:42:21 -08:00
  • e3ab93c896 [CPU] Refactor CPU fused MOE (#30531) Li, Jiang 2025-12-18 14:36:49 +08:00
  • fc2ae6d617 fix: add warmup for audio preprocessing (#30706) Nathan Price 2025-12-18 00:12:29 -06:00
  • ec965569d9 [KV connector][LMCache] Only record the cuda event when there are request to store/load (#30814) Yihua Cheng 2025-12-17 21:31:34 -08:00
  • 82dc338ad6 [AMD][CI] fix lm eval ci arg (#30911) Divakar Verma 2025-12-17 23:18:26 -06:00
  • 717ac33d9c [PERF] Qwen3-next. Add fp8 cutlass MoE tuned configs. chmod -x *MI308X.json (#29553) Vadim Gimpelson 2025-12-18 09:16:04 +04:00
  • cfb7e55515 [Doc][CPU] Update CPU doc (#30765) Li, Jiang 2025-12-18 12:59:09 +08:00
  • b166ef20e1 [refactor] Add prefix support to embed_tokens in DeepSeek MTP (#30788) zzhxxx 2025-12-18 12:45:56 +08:00
  • 5f2f3fba1d [compile] Fix CI for test_gpt2_cache_hit (#30902) Zhengxu Chen 2025-12-17 23:22:23 -05:00
  • 4a8412f773 [UX] Reduce DeepGEMM warmup log output to single progress bar (#30903) Matthew Bonanni 2025-12-17 23:21:51 -05:00
  • 0c738b58bc [Quantization] Support Quark int4-fp8 w4a8 for MoE (#30071) Bowen Bao 2025-12-17 20:20:42 -08:00
  • 55f1fc1b1b [v1] Add PrefixLM support to TritonAttention backend (#30386) v0.13.0rc4 Isotr0py 2025-12-18 08:05:24 +08:00
  • 17f3988094 [BugFix] Workspace allocation during profile run : DeepEPHighThroughput + DeepGEMM (#30899) Varun Sundar Rabindranath 2025-12-17 18:00:59 -05:00
  • 682c38583c [CI][Bugfix] Fix flaky tests/entrypoints/openai/test_audio.py::test_chat_streaming_audio (#30878) Nicolò Lucchesi 2025-12-17 18:49:56 +01:00
  • 5a3adf581e fused_moe_lora PDL improvements (#30716) gnovack 2025-12-17 19:55:00 -08:00
  • 6fe5887652 [Chore] Remove v0 dead code for Qwen2.5-omni (#30883) Isotr0py 2025-12-18 11:54:39 +08:00
  • bc3700e0cd [NIXL] Support P tensor-parallel-size > D tensor-parallel-size (#27274) Nicolò Lucchesi 2025-12-18 04:53:30 +01:00
  • fd8afdf38d [ROCm][CI] Reduce Flakiness For test_async_scheduling Using ROCM_ATTN With FP32 (#30811) Micah Williamson 2025-12-17 20:27:37 -06:00
  • a0b782f9cc [Metrics] Model FLOPs Utilization estimation (#30738) SungMinCho 2025-12-17 17:40:51 -08:00
  • ed2897f336 [CI][Feature] Adds auto-rebase PR rule (#30875) Rafael Vasquez 2025-12-17 19:46:44 -05:00
  • 74a1ac38b0 [v1] Add PrefixLM support to TritonAttention backend (#30386) Isotr0py 2025-12-18 08:05:24 +08:00
  • 05a83dc6ee feat(api): Eager chat template warmup to eliminate first-request latency (#30700) Nathan Price 2025-12-17 18:01:29 -06:00
  • e3fc374a9a [BugFix] Workspace allocation during profile run : DeepEPHighThroughput + DeepGEMM (#30899) Varun Sundar Rabindranath 2025-12-17 18:00:59 -05:00
  • e06d0bf0aa 2.9.1 PyTorch release update (#28495) Andrey Talman 2025-12-17 15:20:22 -05:00
  • e3a0f21e6c [docs]: add ecosystem projects sr in docs/governance (#30844) Xunzhuo 2025-12-18 02:45:56 +08:00
  • 7eb6cb6c18 [Attention] Update tests to remove deprecated env vars (#30563) Matthew Bonanni 2025-12-17 12:49:59 -05:00
  • 9ca8cb38fd [CI][Bugfix] Fix flaky tests/entrypoints/openai/test_audio.py::test_chat_streaming_audio (#30878) Nicolò Lucchesi 2025-12-17 18:49:56 +01:00
  • 2497228ad4 [Chore] Factor out logic for requesting initial memory (#30868) Cyrus Leung 2025-12-17 23:32:17 +08:00
  • 196cdc3224 [Model] Gemma3: Support untied word embeddings (#30827) KimHyemin 2025-12-18 00:11:18 +09:00
  • b7b6a60aca Adapt the old parameter enable_thinking in chat_template_kwargs (#30852) 高鑫崧 2025-12-17 23:10:59 +08:00
  • 9e67c4ce98 [Docs] fix function name (#30748) rongfu.leng 2025-12-17 20:14:45 +08:00
  • 6e9dbcc50e [Fix] uniform decode batch check (#30747) Jialin Ouyang 2025-12-17 03:58:43 -08:00
  • 6482e3895b chores: adjust the attn register param order (#30688) Hank_ 2025-12-17 19:58:16 +08:00
  • fb980eb2fd Fix lazy import (#30858) Harry Mellor 2025-12-17 11:33:50 +00:00
  • 84896fda22 [Bugfix] deepseek-V3.2 self.weights_proj has no bias (#30841) baoqian426 2025-12-17 19:32:34 +08:00
  • 4bf6c23668 [ci] Sync test areas yaml file with test-pipeline (#30862) Kevin H. Luu 2025-12-17 02:30:56 -08:00
  • 9ad5b21710 [Refactor] [4/N] Move VLLM_SERVER_DEV endpoints into the serve directory (#30749) Chauncey 2025-12-17 18:27:30 +08:00
  • f284d7bd0c [Bug] Fix AttributeError: 'ColumnParallelLinear' object has no attribute weight_scale_inv (#30823) Wentao Ye 2025-12-17 05:00:35 -05:00
  • 53cd7f868b [compile] Recompile graph module during Dynamo cache loading. (#30743) Zhengxu Chen 2025-12-17 05:00:12 -05:00
  • 7b966ae2ba [Fix]Load kv-cache dtype from hf_quant_config.json automatically (fix for reverted PR) (#30785) danielafrimi 2025-12-17 11:56:38 +02:00
  • 9db1db5949 [compile] Ignore VLLM_FORCE_AOT_LOAD from cache factors (#30809) Zhengxu Chen 2025-12-17 04:56:24 -05:00
  • 177c391db2 [compile] Disable aot when eager backend is used. (#30810) Zhengxu Chen 2025-12-17 04:55:56 -05:00
  • 519ef9a911 [UX] Make vllm bench serve discover model by default and use --input-len (#30816) Michael Goin 2025-12-17 04:55:30 -05:00
  • a100152288 [Kernels][FI] Skip trtllm attention when num_kv_heads=1 (#30842) Ye (Charlotte) Qi 2025-12-17 01:54:21 -08:00
  • 4c054d89aa [Doc][ResponsesAPI] add documentation (#30840) Andrew Xia 2025-12-17 17:53:02 +08:00
  • f4e884f222 [NIXL][Bugfix] Fix NIXL/RDMA registration failure over CuMemAllocator (#29569) Sheng Lin 2025-12-17 17:52:58 +08:00
  • 3b1d440ede CustomOp: grouped topk (#29575) Xinyu Chen 2025-12-17 17:43:00 +08:00
  • a9e15c21ef [Mamba] Removed disable cascade attn in MambaModelConfig (#30712) Asaf Joseph Gardin 2025-12-17 10:48:53 +02:00
  • 20fda43151 [Bugfix][Frontend] Prevent IndexError in MiniMax M2 tool parser during streaming extraction (#30555) Robin 2025-12-17 16:37:57 +08:00
  • f124b56786 [XPU] fix broken fp8 online quantization for XPU platform (#30831) v0.13.0rc3 Yan Ma 2025-12-17 16:28:13 +08:00
  • 4f735babb7 [XPU] fix broken fp8 online quantization for XPU platform (#30831) Yan Ma 2025-12-17 16:28:13 +08:00
  • d78e128b8b [Bugfix][CPU] Fix CPU backend ROPE dispatch for VL models (#30829) Li, Jiang 2025-12-17 15:25:12 +08:00
  • 761b730dcb [BugFix] Fix memory spike in workspace allocation (#30744) Lucas Wilkinson 2025-12-16 09:46:22 -05:00
  • 0cd5353644 [Bugfix][CPU] Fix CPU backend ROPE dispatch for VL models (#30829) Li, Jiang 2025-12-17 15:25:12 +08:00
  • d4d2751732 Update note comment for flashinfer attention warmup (#30711) Michael Goin 2025-12-17 00:29:03 -05:00
  • 009a773828 bump up compressed tensors version to 0.13.0 (#30799) shanjiaz 2025-12-17 00:01:04 -05:00
  • 44d3b1df3d [CI/Build] Fix compatibility between #30244 and #30396 (#30787) Cyrus Leung 2025-12-17 12:21:19 +08:00
  • bb5ac1fe38 [CPU] Add action to automatically label CPU related PRs (#30678) Fadi Arafeh 2025-12-17 04:21:07 +00:00
  • 811cdf5197 Update model-hosting-container-standards to 0.1.10 (#30815) Michael Goin 2025-12-16 20:52:14 -05:00
  • f34eca5f01 [ROCm] [Bugfix] Fix torch sdpa hallucination (#30789) v0.13.0rc2 TJian 2025-12-17 07:32:43 +08:00
  • 4cd332f3cf [CI] Skip ci failure test (#30804) Wentao Ye 2025-12-16 17:47:53 -05:00
  • 16484d394c [Core][MM] Optimize encoder cache manager by operating with embeddings only (#30475) Roger Wang 2025-12-16 14:18:17 -08:00
  • e397bd6592 [CI/Build] Skip broken ViT backend functionality test tempoarily (#30782) Isotr0py 2025-12-16 22:45:25 +08:00
  • 6a88d590bb [Bugfix] Fix broken ViT attention selection for Blackwell device (#30731) Isotr0py 2025-12-16 13:24:32 +08:00
  • ad8c073131 [CustomOp] Extract ApplyRotaryEmb as CustomOp and unify the dispatch logic (#29873) Shanshan Shen 2025-12-16 11:08:16 +08:00
  • f5db6385a1 Fix nemotron_nas intermediate_size computation (#30795) Grzegorz K. Karch 2025-12-17 02:06:28 +01:00
  • c0a88df7f7 [docker] Allow kv_connectors install to fail on arm64 (#30806) Amr Mahdi 2025-12-17 02:41:57 +02:00
  • e087fbc393 [MM] Pass FA version in ViT Attn (#30756) Nicolò Lucchesi 2025-12-17 00:54:45 +01:00
  • e80455ca8b Replace deprecated enable_fusion with fuse_norm_quant in test_rms_group_quant (#30817) Michael Goin 2025-12-16 18:40:47 -05:00
  • 2410132bb1 [ROCm] [Bugfix] Fix torch sdpa hallucination (#30789) TJian 2025-12-17 07:32:43 +08:00
  • 0a1ab1e565 [Perf][Kernels] Vectorize csrc/activations_kernels.cu (#29512) Michael Goin 2025-12-16 17:56:02 -05:00
  • b6ec077e05 [CI] Skip ci failure test (#30804) Wentao Ye 2025-12-16 17:47:53 -05:00
  • ce96857fdd [Kernel][Quantization][MoE] add marlin kernel support for turing (sm75) (#29901) Jinzhen Lin 2025-12-17 06:35:28 +08:00
  • eaa82a709a [Bugfix][DSV32] Fix overflow in topk. (#30754) Daniel Cámpora 2025-12-16 23:21:17 +01:00
  • f5f51e5931 [Core][MM] Optimize encoder cache manager by operating with embeddings only (#30475) Roger Wang 2025-12-16 14:18:17 -08:00
  • 9fec0e13d5 [Attention] Cache attention metadata builds across hybrid KV-cache groups (#29627) Lucas Wilkinson 2025-12-16 17:10:16 -05:00
  • 254a7f8fd6 [Perf] Do FP4 quant before All gather on flashinfer trtllmgen MOE (#30014) jiahanc 2025-12-16 13:01:48 -08:00
  • f21f5ea38c [Refactor] Small refactor for group topk (#30562) Wentao Ye 2025-12-16 14:50:59 -05:00
  • ca702a14dc [Frontend] Add max-completion-token option to transcription/translation endpoints (#30769) Nicolò Lucchesi 2025-12-16 20:36:49 +01:00
  • 10ee1c64cf [CI] Generalize gsm8k test args and add Qwen3-Next MTP B200 test (#30723) Michael Goin 2025-12-16 14:28:34 -05:00
  • 66c3537e5d [Docs][API] Remove warning about LoRARequest being internal-only (#30774) Mark McLoughlin 2025-12-16 16:35:46 +00:00
  • e1625498f4 Update where bytes_to_unicode is imported from (#30771) Harry Mellor 2025-12-16 16:05:01 +00:00
  • 0b0acc758e Remove head_mask from Ultravox and Swin (#30764) Harry Mellor 2025-12-16 16:02:41 +00:00
  • af506fd76a Fix instantiation of HfHubHTTPError in LoRA test (#30768) Harry Mellor 2025-12-16 16:02:24 +00:00
  • ce12b407f2 [TRTLLM] Remove the MoE GEMM weight name change (#30713) Ming Yang 2025-12-16 08:01:38 -08:00