Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

59bd5f6a71 [Feat] Enable eplb with default all2all backend (#30559) Wentao Ye 2025-12-16 10:33:52 -05:00
00a8d7628c [BugFix] Fix memory spike in workspace allocation (#30744) Lucas Wilkinson 2025-12-16 09:46:22 -05:00
4de08ad698 [CI/Build] Skip broken ViT backend functionality test tempoarily (#30782) Isotr0py 2025-12-16 22:45:25 +08:00
75eb302a2e [Bugfix] Whisper fix number of allocated CrossAttn blocks per-request (#30772) Nicolò Lucchesi 2025-12-16 15:20:19 +01:00
9dbbc59b15 [ROCm][MTP] Support MTP for AITER MLA backend (#28624) Pleaplusone 2025-12-16 22:10:26 +08:00
104003dc77 update piecewise cudagraph warning when splitting_ops=[] (#30728) Boyuan Feng 2025-12-16 06:09:34 -08:00
d0fb572929 [ROCm] [AITER] [DOC] Add usage description about check functions in _aiter_ops (#30586) TJian 2025-12-16 21:50:47 +08:00
6f15ac5de7 Don'e assume position_embedding_type will be present for BERT and RoBERTa models (#30770) Harry Mellor 2025-12-16 13:40:26 +00:00
676db55eec [Bugfix] Fix prefix_repetition routing in bench throughput (#29663) Junru Shen 2025-12-16 17:37:15 +08:00
0e391e7570 [Bugfix] Fix RequestOutput miss lora_request (#30636) Jee Jee Li 2025-12-16 17:36:35 +08:00
0d0c929f23 [responsesAPI][8] input/output messages for ResponsesParser (#30158) Andrew Xia 2025-12-16 13:54:59 +08:00
e94384bbad [Bugfix] Fix broken ViT attention selection for Blackwell device (#30731) Isotr0py 2025-12-16 13:24:32 +08:00
b9ff4f2a8d [feature] extend DBO to XBO (#30120) jiangkuaixue123 2025-12-16 13:04:01 +08:00
c881db364e improve lazy import test (#30733) Boyuan Feng 2025-12-15 19:12:05 -08:00
3bd9c49158 [CustomOp] Extract ApplyRotaryEmb as CustomOp and unify the dispatch logic (#29873) Shanshan Shen 2025-12-16 11:08:16 +08:00
ff21a0fc85 [docker] Restructure Dockerfile for more efficient and cache-friendly builds (#30626) Amr Mahdi 2025-12-16 04:52:19 +02:00
bbd850e597 [Bugfix] fix streaming final output for non harmony (#30237) penfree 2025-12-16 09:03:11 +08:00
511e81e7c9 [BUILD] use sm_100f when compiling flashmla to fix support on sm103 (#30705) Shengqi Chen 2025-12-16 06:48:01 +08:00
a182be4308 [UX][Attention] Add attention_config argument to LLM() (#30710) Matthew Bonanni 2025-12-15 17:29:09 -05:00
c01d589813 [Benchmarks] auto_tune.sh: Use hostname variable for server requests (#30529) Kevin Musgrave 2025-12-15 17:00:29 -05:00
60dbf7d8f1 Update batch invariant to use attention config (#30704) Matthew Bonanni 2025-12-15 15:24:16 -05:00
a450c64a30 [Bugfix] Fail instead of ignoring when CompilationConfig gets invalid args (#30708) Michael Goin 2025-12-15 15:18:02 -05:00
b2191abdca [docs][fix] Update Arm CPU vLLM wheel installation docs (#30594) Fadi Arafeh 2025-12-15 19:46:25 +00:00
51e5b3e3c4 [Bugfix] Fix ViT with FlashAttention on ROCm (#30703) Matthew Bonanni 2025-12-15 14:45:21 -05:00
ec154c36ee [Platform] Refactor Platform attention backend selection to avoid breakpoint for OOT platform (#30212) Isotr0py 2025-12-16 01:36:07 +08:00
970713d4a4 Remove SkipValidation from ModelConfig (#30695) Harry Mellor 2025-12-15 17:34:08 +00:00
17fec3af09 [Bugfix] Fix missing first token in tool calls during reasoning-to-tool transition (#30671) mondaylord 2025-12-16 00:13:37 +08:00
855b101d75 [Frontend] add tools for dsv32 developer role (#30040) yjc9696 2025-12-15 23:08:47 +08:00
d0502b4928 [MoE][Refactor 1/N] Separate Online Quantization (#30627) Robert Shaw 2025-12-15 09:54:53 -05:00
3f175f18a2 [Bugfix] Fix multimodal configuration for Qwen3VL MOE model (#30670) Max Hu 2025-12-15 22:06:01 +08:00
ed586e7724 [Refactor] [3/N] Move tool parser tests and run on CPU (#30693) Cyrus Leung 2025-12-15 21:45:36 +08:00
2a1776b7ac [Refactor] [2/N] Move tool parsers into the vLLM main directory (#30675) Chauncey 2025-12-15 20:54:52 +08:00
185c22bf2f [Misc][Hybrid allocator + kv connector] Optionally enable hybrid allocator + KV cache connector (#29805) Nicolò Lucchesi 2025-12-15 12:17:58 +01:00
e4806d973a [BugFix] Add embed_input_ids method to make QWenLMHeadModel a vllm model (#30674) duke 2025-12-15 18:38:29 +08:00
4429d934de [Model] Automatic conversion of TokenClassification model (#30666) wang.yuqi 2025-12-15 16:13:00 +08:00
33278073d6 typing: Add type hints to TurnMetrics class in context.py (#30552) ゆり 2025-12-15 16:00:39 +09:00
1adeb3b84c [New Model] BAGEL support (AR only) (#28439) 汪志鹏 2025-12-15 14:58:23 +08:00
e3a1cd1c59 [XPU] fix Dockerfile.xpu, avoid wheel conflicts (#30662) Kunshang Ji 2025-12-15 13:32:06 +08:00
3778673ea8 [Feat] Refactor for parallel_config in FusedMoEModularKernel (#30282) Wentao Ye 2025-12-14 23:21:36 -05:00
b337647aa0 [Bugfix] Drop empty tool_calls lists to keep assistant replies in chat template (#30648) Seokhyun An 2025-12-15 13:21:12 +09:00
a524d1ba0a [Bugfix] Fix deepseek_v32 tokenizer_mode (#30658) Jee Jee Li 2025-12-15 12:20:31 +08:00
87b4d1557d [CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it. (#30125) Shanshan Shen 2025-12-15 11:13:32 +08:00
84e23d103d additional protection for CVE-2025-62164 (#30649) Wenqi Glantz 2025-12-14 22:07:10 -05:00
738648fb81 [CustomOp] Support object-level enable for CustomOp (#30547) Shanshan Shen 2025-12-15 11:02:09 +08:00
917fdae5b2 [Log] Skip piecewise cudagraph warn when using full cudagraph (#30657) Boyuan Feng 2025-12-14 18:49:45 -08:00
e2ed238885 Revert "[Fix]Load kv-cache dtype from hf_quant_config.json automatically" (#30653) Robert Shaw 2025-12-14 19:33:41 -05:00
174e39ead7 CPU KV Offloading: Use more CUDA streams (#29013) Or Ozeri 2025-12-15 01:50:45 +02:00
9ccbf6b692 [responsesAPI]add extra body parameters (#30532) RioS 2025-12-15 04:25:45 +09:00
ae2e503dda [NIXL][BUG FIX] Fix a bug for PD with host_buffer after merging 29665 (#30420) Chendi.Xue 2025-12-14 09:38:28 -06:00
9e33a1a75b [Model][Quantization] Override HF defaults to GGUF ones (incl. Qwen3 MoE) (#30118) Tsukasa OI 2025-12-15 00:01:42 +09:00
add4b0ca44 [Bugfix][benchmarks] Fix input token calculation for rerank benchmark metrics (#30596) Vensen 2025-12-14 22:57:15 +08:00
ae88aada38 [Feature]Add EVS (Efficient Video Sampling) Support for Qwen3-VL (#29752) ZiTian Zhao 2025-12-14 21:24:56 +08:00
5ccf0efa84 [Bugfix] Improve error messages in ModelConfig validation (#30213) yifant-code 2025-12-14 08:23:37 -05:00
994acec0cc [Bugfix] Fix fusion for VL models (#30244) ElizaWszola 2025-12-14 14:22:37 +01:00
48b8456ff9 [Bugfix] Revert Qwen2-VL part of change in #28271 (#30542) zifeitong 2025-12-14 05:20:08 -08:00
5b64ac21f9 [Bugfix] Update get_processor_data to use get_all method (#30583) Drew Botwinick 2025-12-14 07:19:20 -06:00
a8ec486592 [Misc] Add a script to benchmark compilation time (#29919) Bin Bao 2025-12-14 08:02:39 -05:00
6ecc1e411b [Bugfix] fix _get_quant_method of FusedMoE for deepseekV3.2 on non-NV… (#30057) tjp_zju 2025-12-14 18:20:51 +08:00
0bb0bae436 Nvidia ModelOpt workaround for issue 28072 (#30164) Shengliang Xu 2025-12-14 02:18:31 -08:00
060893654d fix: Update json features supported by xGrammar (#30390) Johannes F 2025-12-14 11:16:06 +01:00
e9add129ad [Bugfix] awq_gemm: fix argument order swap (#30364) Matthias Gehre 2025-12-14 11:15:37 +01:00
3224ea9915 [torch.compile] Add encoder tag for compilation (#30489) Ilya Markov 2025-12-14 11:15:11 +01:00
3a20450d31 Add AudioFlamingo3 model support (#30539) Lasha Koroshinadze 2025-12-14 05:14:55 -05:00
1a55cfafcb [Doc]: fixing typos in various files (#30540) Didier Durand 2025-12-14 11:14:37 +01:00
add1b9d3de [main][BugFix] Fixed an accuracy bug of Qwen3-next-MTP when batched inferring (#30632) drslark 2025-12-14 17:32:16 +08:00
dcb31196da [Chore] Remove redundant RequestPrompt (#30612) Cyrus Leung 2025-12-14 17:22:37 +08:00
f569c654e1 enable unbacked with aot_compile (#30462) Laith Sakka 2025-12-14 11:14:06 +03:00
97f2f160fd [ROCm][CI] Add "Qwen3-Next-80B-A3B-Instruct MTP Async EPLB Accuracy Test" Back Into AMD CI (#30590) Micah Williamson 2025-12-14 00:56:26 -06:00
29f7d97715 Improve parse_raw_prompt test cases for invalid input .v2 (#30512) Kayvan Mivehnejad 2025-12-13 22:18:41 -05:00
dc7fb5bebe [Bug][KVConnector][Metrics] Remove a vacuous assertion breaking external-launcher (#30577) Qier Li 2025-12-13 20:23:08 -05:00
24429d5924 [Doc] Add instructions for building docker image on GB300 with CUDA13 (#30414) Qidong Su 2025-12-13 16:56:53 -05:00
6e78ed6ba7 [Logs] Optimize startup logs 4 (#29903) Wentao Ye 2025-12-13 16:12:53 -05:00
7c16f3fbcc [Doc] Add documents for multi-node distributed serving with MP backend (#30509) Isotr0py 2025-12-14 02:02:29 +08:00
ddbfbe5278 [Docs] Clarify Expert Parallel behavior for attention and MoE layers (#30615) lif 2025-12-14 01:37:59 +08:00
763963aa73 set assume_32bit_indexing and pass unbacked hints (#30459) Laith Sakka 2025-12-13 18:36:53 +03:00
39cefbdf17 [Refactor] TokenizerRegistry only uses lazy imports (#30609) Cyrus Leung 2025-12-13 23:16:22 +08:00
ace34e3783 [Bugfix] Qwen3-next with --hf-overrides \{\"num_hidden_layers\":8\} (#30433) Chen Zhang 2025-12-13 06:12:45 -08:00
e5db3e2774 [CI/Build] Fix broken mm processor test Mistral-3-large (#30597) Isotr0py 2025-12-13 20:43:01 +08:00
64251f48df [Chore] Adjust tokenizer import to avoid circular imports (#30601) Cyrus Leung 2025-12-13 20:42:39 +08:00
1cec5b7ea9 [Scheduer] Simplify stop checking for pooling models (#30591) Nick Hill 2025-12-13 01:45:26 -08:00
b09806e28f [Bugfix] Dictionary MM embeddings for online chat (#30507) Cyrus Leung 2025-12-13 15:48:56 +08:00
fdc135d768 [Misc][Quantization] Clarify the intent of GGUF FusedMoE weight materialization (#30310) Tsukasa OI 2025-12-13 14:55:14 +09:00
4fa7ce46f3 [Feature] Add SM103 (Blackwell Ultra) Support to vLLM (#30484) Roberto L. Castro 2025-12-13 04:34:23 +01:00
57e9bf1864 [CI] Whisper logprobs tests (#30504) Nicolò Lucchesi 2025-12-13 03:49:11 +01:00
2f32a68d75 [CI] Update several models in registry that are available online now (#30514) Michael Goin 2025-12-12 21:28:13 -05:00
f5dfbbd8e9 [Docs] Remove references to VLLM_ATTENTION_BACKEND (#30564) Matthew Bonanni 2025-12-12 21:20:15 -05:00
fc0119425c Add IBM and Red Hat to compute resources sponsors (#30581) Michael Goin 2025-12-12 20:34:23 -05:00
86a3261525 [Bugfix] Pass FA version in MultiHeadAttention (#30575) Matthew Bonanni 2025-12-12 19:02:11 -05:00
08f8a5627e [CI/Build][Kernel][BugFix][AMD] Fix per_token_group_quant_fp8 to use correct fp8 min/max values and update atol/rtol in test_quantfp8_group_functionality (#30292) rasmith 2025-12-12 17:41:56 -06:00
b4039c08b5 [ci] Mark PrimeRL integration test as soft fail (#30578) Kevin H. Luu 2025-12-12 14:13:09 -08:00
1e6b115300 [Refactor] Reduce duplicate code in per_token_group_quant cuda kernels (#30496) Wentao Ye 2025-12-12 16:45:23 -05:00
13618626df [MoE-FP8-modelopt] Add FlashInfer alignment padding for intermediate dimensions (#29748) danielafrimi 2025-12-12 22:42:32 +02:00
6ec0d8dbe4 [Fix]Load kv-cache dtype from hf_quant_config.json automatically (#29980) danielafrimi 2025-12-12 21:27:47 +02:00
9693dd0fe3 [CI/Build] Add x86 CPU wheel release pipeline (#28848) Li, Jiang 2025-12-13 03:21:35 +08:00
1f19d8f899 [Perf] Set split_k to 1 for triton_kernels (#30528) Xin Yang 2025-12-12 11:07:57 -08:00
cd7740ac5c [ROCm] Enable Triton ScaledMM fallback + kernel selection fix (#26668) shivampr 2025-12-12 10:28:20 -08:00
02a5880394 [CI] Fix mypy for vllm/v1/executor (#30517) Wentao Ye 2025-12-12 13:05:34 -05:00
d2c919dcc2 [bugfix] fix bug when top_logprobs=0 with spec decoding (#30059) realliujiaxu 2025-12-13 01:03:35 +08:00
f3237f3f6b [Frontend] Fixes anthropic streaming message_start usage nesting (#30266) Benjamin Bartels 2025-12-12 16:28:54 +00:00
9c0ee995a8 [Kernel] Support CUDA Graphs in 3D Triton Attention Kernel (#28306) jvlunteren 2025-12-12 16:55:40 +01:00

... 34 35 36 37 38 ...