Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

904063907c [Misc] fix openai version (#22485) rongfu.leng 2025-08-08 16:12:54 +08:00
43c4f3d77c [Misc] Begin deprecation of get_tensor_model_*_group (#22494) Cyrus Leung 2025-08-08 16:11:54 +08:00
1712543df6 [CI/Build] Fix multimodal tests (#22491) Cyrus Leung 2025-08-08 15:31:19 +08:00
808a7b69df [bench] Fix benchmark/serve.py to ignore unavailable results (#22382) lkchen 2025-08-07 23:15:50 -07:00
099c046463 [Doc] Sleep mode documentation (#22310) iAmir97 2025-08-08 11:25:18 +07:00
af473f0a85 [bugfix] Fix Llama3/4 issues caused by FlashInfer 0.2.10 (#22426) Po-Han Huang (NVIDIA) 2025-08-08 11:25:01 +08:00
157f9c1368 Fix pre-commit (#22487) Cyrus Leung 2025-08-08 11:21:54 +08:00
6f287915d8 Optimize MiniCPMO mask creation with vectorized implementation (#22464) ZiTian Zhao 2025-08-08 11:18:50 +08:00
c152e2a8a0 not tie_word_embeddings for glm-4.5 and glm-4.5v (#22460) Yuxuan Zhang 2025-08-08 10:37:23 +08:00
17eaaef595 [Bugfix] Fix RuntimeError: Index put requires the source and destination dtypes match (#22065) Chauncey 2025-08-08 10:20:21 +08:00
3303f134e0 [Kernel] Add support for block FP8 on SM120 (NVIDIA 5090 and RTX PRO 6000) (#22131) Junhao Li 2025-08-07 22:18:28 -04:00
b2c8ce57c6 Fix Flashinfer CUTLASS MOE Allgather (#21963) Shu Wang 2025-08-07 21:18:25 -05:00
a3b9c17b56 Support Tensorrt-LLM MoE fp4 for low-latency (#21331) Shu Wang 2025-08-07 21:18:22 -05:00
d57dc2364e Add ModelOpt Qwen3 nvfp4 support (#20101) Zhiyu 2025-08-07 19:18:19 -07:00
e2c8f1edec [PERF] Use pybase64 to more quickly decode prompt embeddings (#22469) Andrew Sansom 2025-08-07 21:15:32 -05:00
1ee5ead5f8 [ROCm] [V1] [SpecDec] Enable Speculative Decoding on ROCm V1 Engine (#21496) TJian 2025-08-07 19:13:17 -07:00
acf8aeb79e [Misc] normalize multiprocessing Queue usage (#22371) Ning Xie 2025-08-08 09:57:27 +08:00
eacd50d31b add comments back yewentao256 2025-08-07 15:24:36 -07:00
f07e10e9bc refactor quant folder yewentao256 2025-08-07 15:05:05 -07:00
7e3a8dc906 Remove from_dict from SpeculativeConfig (#22451) Harry Mellor 2025-08-07 18:13:04 +01:00
139d155781 [Frontend] Use engine argument to control MM cache size (#22441) Cyrus Leung 2025-08-08 00:47:10 +08:00
8c9da6be22 [Core] Simplify mm processing cache (#22457) Cyrus Leung 2025-08-08 00:47:07 +08:00
399d2a10e2 Fix pre-commit error in main (#22462) Woosuk Kwon 2025-08-07 08:54:39 -07:00
4815b00f54 [gpt-oss] Generate ResponseOutputItem from Harmony Message (#22410) Chen Zhang 2025-08-07 08:33:25 -07:00
4da8bf20d0 [Tool] Fix auto tool call (#22434) Chen Zhang 2025-08-07 07:03:38 -07:00
7e0b121812 [Bugfix] Add missing packed_modules_mapping to DeepseekV2ForCausalLM (#22352) fxmarty-amd 2025-08-07 15:30:48 +02:00
766bc8162c [Core] Store only the keys for multi-modal data in P0 (#22198) Cyrus Leung 2025-08-07 16:45:04 +08:00
289b18e670 [Docs] Update features/disagg_prefill, add v1 examples and development (#22165) WeiQing Chen 2025-08-07 15:59:23 +08:00
35171b1172 [Doc] update docs for nightly benchmarks (#12022) Andrew Chan 2025-08-07 00:29:45 -07:00
a2c6696bfe [Docs] Factor out troubleshooting to its own guide; add section for Ray Observability (#21578) Ricardo Decal 2025-08-07 00:29:13 -07:00
5e8398805e [Doc] Fix link to prefix caching design (#22384) Yong Hoon Shin 2025-08-07 00:28:15 -07:00
136825de75 [Misc] Enhance code formatting in mxfp4.py (#22423) Woosuk Kwon 2025-08-07 00:26:24 -07:00
c2dba2dba8 Add H20-3e fused MoE kernel tuning configs for GLM-4.5 (#22433) JaceyShao 2025-08-07 15:24:47 +08:00
434d2f3f7a [Docs] Add missing dependency for docs build (#22435) Harry Mellor 2025-08-07 08:22:07 +01:00
8e8e0b6af1 feat: Add --enable-log-outputs flag for logging model generations (#20707) Adrián García García 2025-08-07 10:10:13 +04:00
82216dc21f [Misc] Support routing logic simulation (#21990) Ming Yang 2025-08-06 23:06:20 -07:00
370661856b [Frontend] Update OpenAI error response to upstream format (#22099) Moritz Sanft 2025-08-07 08:06:00 +02:00
cbc8457b26 [Model] Switch to Fused RMS norm in Qwen2.5_VL model. (#22184) vllmellm 2025-08-07 14:05:24 +08:00
4d4297e8fe [Bench] Split serve.py:main into async/async versions (#22405) lkchen 2025-08-06 23:05:07 -07:00
2a4c825523 [CI] Skip the pooling models that do not support transformers v4.55 (#22411) wang.yuqi 2025-08-07 14:05:03 +08:00
4be02a3776 [Bugfix] EPLB load statistics problem (#22167) WeiQing Chen 2025-08-07 12:07:54 +08:00
f6278b6243 [gpt-oss] Convert user input to harmony format (#22402) Chen Zhang 2025-08-06 20:56:02 -07:00
ad6c655dde preload heavy modules when mp method is forkserver (#22214) Lionel Villard 2025-08-06 23:33:24 -04:00
14bcf93a6a Optimize logger init performance by using module-level constants (#22373) ZiTian.Zhao 2025-08-07 11:32:19 +08:00
ecbea55ca2 Update hf_xet pin to resolve hangs (#22356) Harry Mellor 2025-08-07 04:31:41 +01:00
609b533cb6 [Bugfix] Add proper comparison for package versions (#22314) Syed Muhammad Bin Asif 2025-08-07 11:31:03 +08:00
5e9455ae8f [Bugfix]: Fix the streaming output for function calls in the minimax (#22015) qscqesze 2025-08-07 11:30:27 +08:00
a00d8b236f Use float32 for test_completion.py (#22385) Michael Goin 2025-08-06 23:07:47 -04:00
04cf435d95 [Bugfix] Fix wrong method name in Intern-S1 image processor (#22417) Cyrus Leung 2025-08-07 11:05:20 +08:00
7377131a2c [Qwen3] Enable dual-chunk-attention support for Qwen3 models. (#21924) Tao He 2025-08-07 10:58:08 +08:00
6b47ef24de [XPU]Fix flash_attn_varlen_func interface on xpu (#22350) Kunshang Ji 2025-08-07 10:28:11 +08:00
1dc8a70b6d [Attention] Support multiple attention metadata builders per kv_cache_spec + proper local attention no hybrid kv cache fix (#21588) Lucas Wilkinson 2025-08-06 21:40:52 -04:00
f825c6bd22 Support encoder_only attention for FlexAttention (#22273) Maximilien de Bayser 2025-08-06 22:37:14 -03:00
41b67f4263 [model] Support MiniCPM-V 4.0 (#22166) tc-mb 2025-08-07 09:35:46 +08:00
e8961e963a Update flashinfer-python==0.2.10 (#22389) Michael Goin 2025-08-06 21:10:24 -04:00
9a3835aaa9 Fix trtllm-gen attention env and add attention sink (#22378) Lain 2025-08-06 18:07:41 -07:00
5c7cc33f4d [gpt-oss] fix model config with hf_config (#22401) Yongye Zhu 2025-08-06 18:04:04 -07:00
19c9365aa4 [gpt-oss] add demo tool server (#22393) Chen Zhang 2025-08-06 17:47:14 -07:00
eec890c1c1 [Bug] Fix B200 DeepGEMM E8M0 Accuracy Issue (#22399) Wentao Ye 2025-08-06 20:03:53 -04:00
46a13949d5 [v1] - Mamba1 Attention Metadata (#21249) Asaf Joseph Gardin 2025-08-07 03:03:42 +03:00
31f09c615f [gpt-oss] flashinfer mxfp4 (#22339) Yongye Zhu 2025-08-06 12:37:27 -07:00
31f5dc5b2a [gpt-oss] Enhance error msg on attention sink init (#22335) Yongye Zhu 2025-08-06 11:41:42 -07:00
ec7cb19224 [gpt-oss] Add loop for built-in tool call (#22374) Woosuk Kwon 2025-08-06 10:32:21 -07:00
2435ea7ed5 [Bugfix] Make condition in triton kernel constexpr (#22370) Gregory Shtrasberg 2025-08-06 13:00:58 -04:00
4a6b72c2ab [BugFix] Fix triton compile error in kernel_unified_attention_2/3d caused by attention sinks (#22368) Lucas Wilkinson 2025-08-06 12:47:38 -04:00
b4b9813b5e add the codes to check AMD Instinct GPU number (#22367) Zhang Jason 2025-08-06 23:58:38 +08:00
2cb6ef8996 [BugFix] Fix FA2 RuntimeError when sinks is provided (#22365) Lucas Wilkinson 2025-08-06 11:03:03 -04:00
9edd1db02b [Minor] Fix type (#22347) Woosuk Kwon 2025-08-06 02:22:03 -07:00
f263a4b53f [gpt-oss] Support chat completion api (#22342) Woosuk Kwon 2025-08-06 01:57:39 -07:00
54991c548a [gpt-oss] add model to supported models doc (#22336) Roger Wang 2025-08-06 01:49:44 -07:00
178d03fbd6 [gpt-oss] Add Tool/ConversationContext classes and harmony_utils (#22340) Woosuk Kwon 2025-08-06 01:08:49 -07:00
fa00c5d75b [Misc] Clean up duplicated hf overrides (#22311) Isotr0py 2025-08-06 15:50:25 +08:00
134a8ee8fd [gpt-oss] Add openai-harmony as default dependency (#22332) Woosuk Kwon 2025-08-06 00:10:14 -07:00
90ec006937 [gpt-oss] flashinfer attention sink init (#22330) Yongye Zhu 2025-08-05 23:48:19 -07:00
a47e6ffe93 [GptOss] Add GptOss reasoning parser to support structure output (#22322) Chen Zhang 2025-08-05 23:39:13 -07:00
98a3a81024 [ROCm] Add attention sink to use_rocm_custom_paged_attention (#22329) Woosuk Kwon 2025-08-05 23:30:38 -07:00
de98252f49 Add GPT-OSS model code and config [1/N] (#22327) Woosuk Kwon 2025-08-05 23:26:00 -07:00
796bae07c5 Update transformers to v4.55 (#21931) Harry Mellor 2025-08-06 06:56:14 +01:00
6e20924350 Add attention sink in attention backends (#22320) Woosuk Kwon 2025-08-05 22:37:21 -07:00
dd16bdc798 Increase openai-python version (#22316) Woosuk Kwon 2025-08-05 21:43:21 -07:00
e3c876dca3 Upgrade FA3 for attention sink (#22313) Woosuk Kwon 2025-08-05 21:36:21 -07:00
5d5d419ca6 [Bugfix][CI/Build][ROCm] Make sure to use the headers from the build folder on ROCm (#22264) Gregory Shtrasberg 2025-08-05 23:39:32 -04:00
302962e806 [Bugfix] Skip dead and non-GPU nodes for Ray DP engine allocation (#22275) Rui Qiao 2025-08-05 20:35:32 -07:00
7e6544c797 [Perf] Parallelize fill_bitmask to accelerate high-throughput guided decoding (#21862) Benjamin Chislett 2025-08-05 22:57:49 -04:00
8e6c7e873f [Bugfix] Fix MoE BNB version (#22260) Jee Jee Li 2025-08-06 10:56:22 +08:00
6a51530437 [Bugfix] Fix 3D input passed into cutlass_scaled_mm (#22278) Michael Goin 2025-08-05 22:35:20 -04:00
35509fc5be [Bugfix] Remove faulty test for oot attention backend (#22286) Michael Goin 2025-08-05 20:05:40 -04:00
4b29d2784b [CI][TPU] Fix docker clean up (#22271) Siyuan Liu 2025-08-05 16:54:56 -07:00
59a0b8554b [bugfix] fix blackwell deepep installation (#22255) youkaichao 2025-08-06 01:26:09 +08:00
469b3ffaaa [V1] port xformers backend to v1 (#21342) Giancarlo Delfin 2025-08-05 10:04:46 -07:00
ae87ddd040 [Refactor] Remove Unused Environment Variable VLLM_NO_DEPRECATION_WARNING (#22199) Wentao Ye 2025-08-05 12:40:23 -04:00
a7cb6101ca [CI/Build] Update flashinfer to 0.2.9 (#22233) Michael Goin 2025-08-05 12:39:38 -04:00
c494f96fbc Use UV_LINK_MODE=copy in Dockerfile to avoid hardlink fail (#22128) Michael Goin 2025-08-05 09:57:10 -04:00
0c275ad5ad [V0 Deprecation][TPU] Remove V1 flag check from tests (#22248) Nicolò Lucchesi 2025-08-05 15:53:23 +02:00
74333ae2f6 [Misc] correct static type check for GroupCoordinator (#21946) Ning Xie 2025-08-05 18:17:46 +08:00
83156c7b89 [NVIDIA] Support Flashinfer TRT-LLM Prefill Attention Kernel (#22095) elvischenv 2025-08-05 17:45:34 +08:00
4771df7b2b [Feature] Non-contiguous Support for FP8 Quantization (#21961) Wentao Ye 2025-08-05 05:36:43 -04:00
05fae02175 Migrate KimiVLImagePixelInputs to TensorSchema (#21769) Benji Beck 2025-08-05 02:36:18 -07:00
d1bf1b9711 [Docs][TPU] Highlight TPU Software version selection (#22242) Nicolò Lucchesi 2025-08-05 11:33:46 +02:00
586f286789 [Model] Pooling model activation supports per request control by PoolingParams (#20538) wang.yuqi 2025-08-05 15:37:00 +08:00

... 74 75 76 77 78 ...