Commit Graph

  • 904063907c [Misc] fix openai version (#22485) rongfu.leng 2025-08-08 16:12:54 +08:00
  • 43c4f3d77c [Misc] Begin deprecation of get_tensor_model_*_group (#22494) Cyrus Leung 2025-08-08 16:11:54 +08:00
  • 1712543df6 [CI/Build] Fix multimodal tests (#22491) Cyrus Leung 2025-08-08 15:31:19 +08:00
  • 808a7b69df [bench] Fix benchmark/serve.py to ignore unavailable results (#22382) lkchen 2025-08-07 23:15:50 -07:00
  • 099c046463 [Doc] Sleep mode documentation (#22310) iAmir97 2025-08-08 11:25:18 +07:00
  • af473f0a85 [bugfix] Fix Llama3/4 issues caused by FlashInfer 0.2.10 (#22426) Po-Han Huang (NVIDIA) 2025-08-08 11:25:01 +08:00
  • 157f9c1368 Fix pre-commit (#22487) Cyrus Leung 2025-08-08 11:21:54 +08:00
  • 6f287915d8 Optimize MiniCPMO mask creation with vectorized implementation (#22464) ZiTian Zhao 2025-08-08 11:18:50 +08:00
  • c152e2a8a0 not tie_word_embeddings for glm-4.5 and glm-4.5v (#22460) Yuxuan Zhang 2025-08-08 10:37:23 +08:00
  • 17eaaef595 [Bugfix] Fix RuntimeError: Index put requires the source and destination dtypes match (#22065) Chauncey 2025-08-08 10:20:21 +08:00
  • 3303f134e0 [Kernel] Add support for block FP8 on SM120 (NVIDIA 5090 and RTX PRO 6000) (#22131) Junhao Li 2025-08-07 22:18:28 -04:00
  • b2c8ce57c6 Fix Flashinfer CUTLASS MOE Allgather (#21963) Shu Wang 2025-08-07 21:18:25 -05:00
  • a3b9c17b56 Support Tensorrt-LLM MoE fp4 for low-latency (#21331) Shu Wang 2025-08-07 21:18:22 -05:00
  • d57dc2364e Add ModelOpt Qwen3 nvfp4 support (#20101) Zhiyu 2025-08-07 19:18:19 -07:00
  • e2c8f1edec [PERF] Use pybase64 to more quickly decode prompt embeddings (#22469) Andrew Sansom 2025-08-07 21:15:32 -05:00
  • 1ee5ead5f8 [ROCm] [V1] [SpecDec] Enable Speculative Decoding on ROCm V1 Engine (#21496) TJian 2025-08-07 19:13:17 -07:00
  • acf8aeb79e [Misc] normalize multiprocessing Queue usage (#22371) Ning Xie 2025-08-08 09:57:27 +08:00
  • eacd50d31b add comments back yewentao256 2025-08-07 15:24:36 -07:00
  • f07e10e9bc refactor quant folder yewentao256 2025-08-07 15:05:05 -07:00
  • 7e3a8dc906 Remove from_dict from SpeculativeConfig (#22451) Harry Mellor 2025-08-07 18:13:04 +01:00
  • 139d155781 [Frontend] Use engine argument to control MM cache size (#22441) Cyrus Leung 2025-08-08 00:47:10 +08:00
  • 8c9da6be22 [Core] Simplify mm processing cache (#22457) Cyrus Leung 2025-08-08 00:47:07 +08:00
  • 399d2a10e2 Fix pre-commit error in main (#22462) Woosuk Kwon 2025-08-07 08:54:39 -07:00
  • 4815b00f54 [gpt-oss] Generate ResponseOutputItem from Harmony Message (#22410) Chen Zhang 2025-08-07 08:33:25 -07:00
  • 4da8bf20d0 [Tool] Fix auto tool call (#22434) Chen Zhang 2025-08-07 07:03:38 -07:00
  • 7e0b121812 [Bugfix] Add missing packed_modules_mapping to DeepseekV2ForCausalLM (#22352) fxmarty-amd 2025-08-07 15:30:48 +02:00
  • 766bc8162c [Core] Store only the keys for multi-modal data in P0 (#22198) Cyrus Leung 2025-08-07 16:45:04 +08:00
  • 289b18e670 [Docs] Update features/disagg_prefill, add v1 examples and development (#22165) WeiQing Chen 2025-08-07 15:59:23 +08:00
  • 35171b1172 [Doc] update docs for nightly benchmarks (#12022) Andrew Chan 2025-08-07 00:29:45 -07:00
  • a2c6696bfe [Docs] Factor out troubleshooting to its own guide; add section for Ray Observability (#21578) Ricardo Decal 2025-08-07 00:29:13 -07:00
  • 5e8398805e [Doc] Fix link to prefix caching design (#22384) Yong Hoon Shin 2025-08-07 00:28:15 -07:00
  • 136825de75 [Misc] Enhance code formatting in mxfp4.py (#22423) Woosuk Kwon 2025-08-07 00:26:24 -07:00
  • c2dba2dba8 Add H20-3e fused MoE kernel tuning configs for GLM-4.5 (#22433) JaceyShao 2025-08-07 15:24:47 +08:00
  • 434d2f3f7a [Docs] Add missing dependency for docs build (#22435) Harry Mellor 2025-08-07 08:22:07 +01:00
  • 8e8e0b6af1 feat: Add --enable-log-outputs flag for logging model generations (#20707) Adrián García García 2025-08-07 10:10:13 +04:00
  • 82216dc21f [Misc] Support routing logic simulation (#21990) Ming Yang 2025-08-06 23:06:20 -07:00
  • 370661856b [Frontend] Update OpenAI error response to upstream format (#22099) Moritz Sanft 2025-08-07 08:06:00 +02:00
  • cbc8457b26 [Model] Switch to Fused RMS norm in Qwen2.5_VL model. (#22184) vllmellm 2025-08-07 14:05:24 +08:00
  • 4d4297e8fe [Bench] Split serve.py:main into async/async versions (#22405) lkchen 2025-08-06 23:05:07 -07:00
  • 2a4c825523 [CI] Skip the pooling models that do not support transformers v4.55 (#22411) wang.yuqi 2025-08-07 14:05:03 +08:00
  • 4be02a3776 [Bugfix] EPLB load statistics problem (#22167) WeiQing Chen 2025-08-07 12:07:54 +08:00
  • f6278b6243 [gpt-oss] Convert user input to harmony format (#22402) Chen Zhang 2025-08-06 20:56:02 -07:00
  • ad6c655dde preload heavy modules when mp method is forkserver (#22214) Lionel Villard 2025-08-06 23:33:24 -04:00
  • 14bcf93a6a Optimize logger init performance by using module-level constants (#22373) ZiTian.Zhao 2025-08-07 11:32:19 +08:00
  • ecbea55ca2 Update hf_xet pin to resolve hangs (#22356) Harry Mellor 2025-08-07 04:31:41 +01:00
  • 609b533cb6 [Bugfix] Add proper comparison for package versions (#22314) Syed Muhammad Bin Asif 2025-08-07 11:31:03 +08:00
  • 5e9455ae8f [Bugfix]: Fix the streaming output for function calls in the minimax (#22015) qscqesze 2025-08-07 11:30:27 +08:00
  • a00d8b236f Use float32 for test_completion.py (#22385) Michael Goin 2025-08-06 23:07:47 -04:00
  • 04cf435d95 [Bugfix] Fix wrong method name in Intern-S1 image processor (#22417) Cyrus Leung 2025-08-07 11:05:20 +08:00
  • 7377131a2c [Qwen3] Enable dual-chunk-attention support for Qwen3 models. (#21924) Tao He 2025-08-07 10:58:08 +08:00
  • 6b47ef24de [XPU]Fix flash_attn_varlen_func interface on xpu (#22350) Kunshang Ji 2025-08-07 10:28:11 +08:00
  • 1dc8a70b6d [Attention] Support multiple attention metadata builders per kv_cache_spec + proper local attention no hybrid kv cache fix (#21588) Lucas Wilkinson 2025-08-06 21:40:52 -04:00
  • f825c6bd22 Support encoder_only attention for FlexAttention (#22273) Maximilien de Bayser 2025-08-06 22:37:14 -03:00
  • 41b67f4263 [model] Support MiniCPM-V 4.0 (#22166) tc-mb 2025-08-07 09:35:46 +08:00
  • e8961e963a Update flashinfer-python==0.2.10 (#22389) Michael Goin 2025-08-06 21:10:24 -04:00
  • 9a3835aaa9 Fix trtllm-gen attention env and add attention sink (#22378) Lain 2025-08-06 18:07:41 -07:00
  • 5c7cc33f4d [gpt-oss] fix model config with hf_config (#22401) Yongye Zhu 2025-08-06 18:04:04 -07:00
  • 19c9365aa4 [gpt-oss] add demo tool server (#22393) Chen Zhang 2025-08-06 17:47:14 -07:00
  • eec890c1c1 [Bug] Fix B200 DeepGEMM E8M0 Accuracy Issue (#22399) Wentao Ye 2025-08-06 20:03:53 -04:00
  • 46a13949d5 [v1] - Mamba1 Attention Metadata (#21249) Asaf Joseph Gardin 2025-08-07 03:03:42 +03:00
  • 31f09c615f [gpt-oss] flashinfer mxfp4 (#22339) Yongye Zhu 2025-08-06 12:37:27 -07:00
  • 31f5dc5b2a [gpt-oss] Enhance error msg on attention sink init (#22335) Yongye Zhu 2025-08-06 11:41:42 -07:00
  • ec7cb19224 [gpt-oss] Add loop for built-in tool call (#22374) Woosuk Kwon 2025-08-06 10:32:21 -07:00
  • 2435ea7ed5 [Bugfix] Make condition in triton kernel constexpr (#22370) Gregory Shtrasberg 2025-08-06 13:00:58 -04:00
  • 4a6b72c2ab [BugFix] Fix triton compile error in kernel_unified_attention_2/3d caused by attention sinks (#22368) Lucas Wilkinson 2025-08-06 12:47:38 -04:00
  • b4b9813b5e add the codes to check AMD Instinct GPU number (#22367) Zhang Jason 2025-08-06 23:58:38 +08:00
  • 2cb6ef8996 [BugFix] Fix FA2 RuntimeError when sinks is provided (#22365) Lucas Wilkinson 2025-08-06 11:03:03 -04:00
  • 9edd1db02b [Minor] Fix type (#22347) Woosuk Kwon 2025-08-06 02:22:03 -07:00
  • f263a4b53f [gpt-oss] Support chat completion api (#22342) Woosuk Kwon 2025-08-06 01:57:39 -07:00
  • 54991c548a [gpt-oss] add model to supported models doc (#22336) Roger Wang 2025-08-06 01:49:44 -07:00
  • 178d03fbd6 [gpt-oss] Add Tool/ConversationContext classes and harmony_utils (#22340) Woosuk Kwon 2025-08-06 01:08:49 -07:00
  • fa00c5d75b [Misc] Clean up duplicated hf overrides (#22311) Isotr0py 2025-08-06 15:50:25 +08:00
  • 134a8ee8fd [gpt-oss] Add openai-harmony as default dependency (#22332) Woosuk Kwon 2025-08-06 00:10:14 -07:00
  • 90ec006937 [gpt-oss] flashinfer attention sink init (#22330) Yongye Zhu 2025-08-05 23:48:19 -07:00
  • a47e6ffe93 [GptOss] Add GptOss reasoning parser to support structure output (#22322) Chen Zhang 2025-08-05 23:39:13 -07:00
  • 98a3a81024 [ROCm] Add attention sink to use_rocm_custom_paged_attention (#22329) Woosuk Kwon 2025-08-05 23:30:38 -07:00
  • de98252f49 Add GPT-OSS model code and config [1/N] (#22327) Woosuk Kwon 2025-08-05 23:26:00 -07:00
  • 796bae07c5 Update transformers to v4.55 (#21931) Harry Mellor 2025-08-06 06:56:14 +01:00
  • 6e20924350 Add attention sink in attention backends (#22320) Woosuk Kwon 2025-08-05 22:37:21 -07:00
  • dd16bdc798 Increase openai-python version (#22316) Woosuk Kwon 2025-08-05 21:43:21 -07:00
  • e3c876dca3 Upgrade FA3 for attention sink (#22313) Woosuk Kwon 2025-08-05 21:36:21 -07:00
  • 5d5d419ca6 [Bugfix][CI/Build][ROCm] Make sure to use the headers from the build folder on ROCm (#22264) Gregory Shtrasberg 2025-08-05 23:39:32 -04:00
  • 302962e806 [Bugfix] Skip dead and non-GPU nodes for Ray DP engine allocation (#22275) Rui Qiao 2025-08-05 20:35:32 -07:00
  • 7e6544c797 [Perf] Parallelize fill_bitmask to accelerate high-throughput guided decoding (#21862) Benjamin Chislett 2025-08-05 22:57:49 -04:00
  • 8e6c7e873f [Bugfix] Fix MoE BNB version (#22260) Jee Jee Li 2025-08-06 10:56:22 +08:00
  • 6a51530437 [Bugfix] Fix 3D input passed into cutlass_scaled_mm (#22278) Michael Goin 2025-08-05 22:35:20 -04:00
  • 35509fc5be [Bugfix] Remove faulty test for oot attention backend (#22286) Michael Goin 2025-08-05 20:05:40 -04:00
  • 4b29d2784b [CI][TPU] Fix docker clean up (#22271) Siyuan Liu 2025-08-05 16:54:56 -07:00
  • 59a0b8554b [bugfix] fix blackwell deepep installation (#22255) youkaichao 2025-08-06 01:26:09 +08:00
  • 469b3ffaaa [V1] port xformers backend to v1 (#21342) Giancarlo Delfin 2025-08-05 10:04:46 -07:00
  • ae87ddd040 [Refactor] Remove Unused Environment Variable VLLM_NO_DEPRECATION_WARNING (#22199) Wentao Ye 2025-08-05 12:40:23 -04:00
  • a7cb6101ca [CI/Build] Update flashinfer to 0.2.9 (#22233) Michael Goin 2025-08-05 12:39:38 -04:00
  • c494f96fbc Use UV_LINK_MODE=copy in Dockerfile to avoid hardlink fail (#22128) Michael Goin 2025-08-05 09:57:10 -04:00
  • 0c275ad5ad [V0 Deprecation][TPU] Remove V1 flag check from tests (#22248) Nicolò Lucchesi 2025-08-05 15:53:23 +02:00
  • 74333ae2f6 [Misc] correct static type check for GroupCoordinator (#21946) Ning Xie 2025-08-05 18:17:46 +08:00
  • 83156c7b89 [NVIDIA] Support Flashinfer TRT-LLM Prefill Attention Kernel (#22095) elvischenv 2025-08-05 17:45:34 +08:00
  • 4771df7b2b [Feature] Non-contiguous Support for FP8 Quantization (#21961) Wentao Ye 2025-08-05 05:36:43 -04:00
  • 05fae02175 Migrate KimiVLImagePixelInputs to TensorSchema (#21769) Benji Beck 2025-08-05 02:36:18 -07:00
  • d1bf1b9711 [Docs][TPU] Highlight TPU Software version selection (#22242) Nicolò Lucchesi 2025-08-05 11:33:46 +02:00
  • 586f286789 [Model] Pooling model activation supports per request control by PoolingParams (#20538) wang.yuqi 2025-08-05 15:37:00 +08:00