This website requires JavaScript.
bf214ca226
[Misc] Fix examples openai_pooling_client.py (#24853 )
wang.yuqi
2025-09-15 19:57:30 +08:00
2e41f5abca
[XPU] Set consistent default KV cache layout (#24745 )
Nicolò Lucchesi
2025-09-15 12:09:34 +02:00
bc0f6059a2
[UT] enhance free kv cache block queue popleft_n (#24220 )
Ning Xie
2025-09-15 18:04:37 +08:00
8de261b04a
[P/D]kv_output_aggregator support P TP > D TP (#23917 )
Chao Lei
2025-09-15 17:36:06 +08:00
a0d8b9738d
[Misc] Own KVConnectors installation (#24867 )
Nicolò Lucchesi
2025-09-15 11:21:09 +02:00
59e17dd4a0
[Misc] rename interval to max_recent_requests (#24229 )
Ning Xie
2025-09-15 17:18:42 +08:00
4979eb79da
[Doc]: fix typos in various files (#24821 )
Didier Durand
2025-09-15 10:08:52 +02:00
a8c0f59973
[Bugfix] MiDashengLM model contact error under concurrent testing (#24738 )
bingchen-mi
2025-09-15 14:38:12 +08:00
f4a948f33f
[Frontend] Skip stop in reasoning content (#14550 )
Ce Gao
2025-09-15 14:04:55 +08:00
3f3313981c
[kv cache] update num_free_blocks in the end (#24228 )
Ning Xie
2025-09-15 13:15:12 +08:00
78818dd1b0
[Docs] Have a try to improve frameworks/streamlit.md (#24841 )
Michael Yao
2025-09-15 12:50:36 +08:00
8e5cdcda4e
[Hybrid Allocator] Support Pipeline Parallel (#23974 )
Chen Zhang
2025-09-14 15:55:17 -07:00
90f3f7d73e
[Spec Decoding]Support Spec Decoding Metrics in DP Mode (#24049 )
wuhang
2025-09-15 05:11:09 +08:00
6dc8da5dc1
[Chore] Remove ipex_ops warning (#24835 )
Robert Shaw
2025-09-14 15:41:53 -04:00
79cbcab871
Force use C++17 globally to avoid compilation error (#24823 )
FengjinChen
2025-09-15 03:30:10 +08:00
ff68035932
[Benchmarks] Throw usage error when using dataset-name random and dataset-path together (#24819 )
Ye (Charlotte) Qi
2025-09-14 10:50:01 -07:00
1177dd53e9
fix type of sampling rate for encode_base64 (#24826 )
co63oc
2025-09-15 00:17:16 +08:00
fc2dbcda8b
[Perf] Fix DeepGEMM Contiguous Layout Issue, 5.5% Throughput Improvement (#24783 )
Wentao Ye
2025-09-14 11:20:17 -04:00
fec347dee1
[Misc] Improve s3_utils type hints with BaseClient (#24825 )
Hyogeun Oh (오효근)
2025-09-14 21:11:14 +09:00
cc3173ae98
[Multi Modal][Performance] Fused Q,K's apply_rope into one (#24511 )
Wenlong Wang
2025-09-14 01:10:21 -07:00
3e903b6cb4
[Chore] Minor simplification for non-PP path (#24810 )
Woosuk Kwon
2025-09-13 17:41:36 -07:00
973c9d01da
[Minor] Simplify duplicative device check for cuda (#24793 )
Victor Ziliang Peng
2025-09-13 11:28:38 -07:00
26b999c71a
[CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe (#24750 )
Michael Goin
2025-09-13 03:29:19 -04:00
e925187f6d
Merge branch 'main' into wye-refactor-quant-folder
ci/build/22474
yewentao256
2025-09-13 07:38:47 -07:00
15b8fef453
Remove redundant assignment in xfer_buffers, This is a little fix (#24732 )
TaoYu Chen
2025-09-13 16:11:59 +08:00
cfa3234a5b
[CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again (#24771 )
Wenlong Wang
2025-09-13 00:45:11 -07:00
41ae4a1eab
[Doc]: fix typos in various files (#24798 )
Didier Durand
2025-09-13 09:43:33 +02:00
4dad72f0d9
[Misc] Correct an outdated comment. (#24765 )
Russell Bryant
2025-09-13 03:34:53 -04:00
59d7ffc17f
[CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe (#24750 )
Michael Goin
2025-09-13 03:29:19 -04:00
1da0f1441d
[Core][Multimodal] Cache supports_kw (#24773 )
Lukas Geiger
2025-09-13 08:27:04 +01:00
98229db244
[Kernels][DP/EP] Optimize Silu Kernel for R1 (#24054 )
Elvir Crnčević
2025-09-13 09:17:27 +02:00
dbeee3844c
[Perf] Use NVIDIA hardware-accelerated instruction for float to fp8_e4m3 quantization (#24757 )
elvischenv
2025-09-13 15:16:24 +08:00
30498f2a65
[Doc]: Remove 404 hyperlinks (#24785 )
Rakesh Asapanna
2025-09-13 12:45:41 +05:30
abc7989adc
[Docs] Remove Neuron install doc as backend no longer exists (#24396 )
Harry Mellor
2025-09-13 08:15:03 +01:00
9a8966bcc2
[Docs] Fix warnings in mkdocs build (continued) (#24791 )
Hyogeun Oh (오효근)
2025-09-13 16:13:44 +09:00
5febdc8750
[Chore] Remove unused batched RoPE op & kernel (#24789 )
Woosuk Kwon
2025-09-13 00:08:20 -07:00
99bfef841f
[Bugfix] Fix GPUModelRunner has no attribute lora_manager (#24762 )
Jee Jee Li
2025-09-13 14:55:14 +08:00
da3fa78dc9
[Compilation Bug] Fix Inductor Graph Output with Shape Issue (#24772 )
v0.10.2rc3
Wentao Ye
2025-09-12 17:23:05 -04:00
bbb70036cb
Enable conversion of multimodal models to pooling tasks (#24451 )
Maximilien de Bayser
2025-09-12 00:30:41 -03:00
89da8d9d09
[Qwen3Next] Fixes the cuda graph capture conditions under large batch sizes (#24660 ) (#24667 )
Tao He
2025-09-13 06:31:32 +08:00
01085b134d
[Qwen3-Next] MoE configs for H100 TP=1,2 and TP2/EP (#24739 )
Elvir Crnčević
2025-09-12 16:54:04 +02:00
66160a9943
[BugFix] Fix Qwen3-Next PP (#24709 )
Nick Hill
2025-09-11 23:35:04 -07:00
eaca762c18
[Qwen3-Next] MoE configs for H20 TP=1,2,4,8 (#24707 )
Jee Jee Li
2025-09-12 10:06:26 +08:00
89e08d6d18
[Model] Add Olmo3 model implementation (#24534 )
Shane A
2025-09-12 20:26:21 -07:00
7f2ea7074e
[Frontend][Multimodal] Allow skipping media data when UUIDs are provided. (#23950 )
Chenheli Hua
2025-09-12 19:16:06 -07:00
4fdd6f5cbf
[Core] Support async scheduling with uniproc executor (#24219 )
Nick Hill
2025-09-12 16:34:28 -07:00
8226dd56bf
[Qwen3Next] Fixes the cuda graph capture conditions under large batch sizes (#24660 ) (#24667 )
Tao He
2025-09-13 06:31:32 +08:00
5fe643fc26
Add FLASHINFER_MLA to backend selector test (#24753 )
Matthew Bonanni
2025-09-12 18:30:07 -04:00
7ba32aa60b
[Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode (#24705 )
Matthew Bonanni
2025-09-12 17:45:53 -04:00
c89ed8de43
Invert pattern order to make sure that out_proj layers are identified (#24781 )
Alexandre Marques
2025-09-12 17:45:29 -04:00
3beadc2f25
[Compilation Bug] Fix Inductor Graph Output with Shape Issue (#24772 )
Wentao Ye
2025-09-12 17:23:05 -04:00
bc636f21a6
[Benchmark] Allow arbitrary headers to be passed to benchmarked endpoints (#23937 )
Clayton Coleman
2025-09-12 16:57:53 -04:00
017354c0ef
[CI] Trigger BC Linter when labels are added/removed (#24767 )
Zhewen Li
2025-09-12 11:44:36 -07:00
1e3e56abfc
Merge branch 'main' into wye-refactor-quant-folder
Wentao Ye
2025-09-12 14:17:56 -04:00
010acc6e1e
[Bugfix] Fix incompatibility between #20452 and #24548 (#24754 )
Cyrus Leung
2025-09-13 02:17:29 +08:00
c8c42597ab
[CI] Speed up model unit tests in CI (#24253 )
afeldman-nm
2025-09-12 13:36:50 -04:00
9d2a44606d
[UX] Remove AsyncLLM torch profiler disabled log (#24609 )
Michael Goin
2025-09-12 13:08:44 -04:00
f17c075884
[Model] Switch to Fused RMSNorm in GLM-4.1V model (#24733 )
Samit
2025-09-13 00:12:23 +08:00
b0d1213ac3
[Models] Prevent CUDA sync in Qwen2.5-VL (#24741 )
Lukas Geiger
2025-09-12 17:03:55 +01:00
57f94e88ea
[Models] Optimise and simplify _validate_and_reshape_mm_tensor (#24742 )
Lukas Geiger
2025-09-12 16:37:37 +01:00
684b6870e1
[Bugfix][Frontend] Fix --enable-log-outputs does not match the documentation (#24626 )
Kebe
2025-09-13 00:01:24 +09:00
1facf77094
Merge branch 'main' into wye-refactor-quant-folder
yewentao256
2025-09-12 08:00:41 -07:00
a5b84f1cbf
[Core] Shared memory based object store for Multimodal data caching and IPC (#20452 )
dongluw
2025-09-12 10:54:17 -04:00
9f04d9d55f
[Qwen3-Next] MoE configs for H100 TP=1,2 and TP2/EP (#24739 )
Elvir Crnčević
2025-09-12 16:54:04 +02:00
4d7c1d531b
[Bugfix] Fix MRoPE dispatch on XPU (#24724 )
Yan Ma
2025-09-12 21:43:56 +08:00
41f17bf290
[Docs] Fix warnings in mkdocs build (continued) (#24740 )
Hyogeun Oh (오효근)
2025-09-12 22:43:15 +09:00
bcb06d7baf
[Doc]: fix typos in various files (#24726 )
Didier Durand
2025-09-12 15:43:12 +02:00
0377802c20
[Multimodal] Remove legacy multimodal fields in favor of MultiModalFeatureSpec (#24548 )
Flora Feng
2025-09-12 06:42:23 -07:00
72fc8aa412
[Multi Modal] Add FA3 in VIT (#24347 )
Wenlong Wang
2025-09-12 06:27:24 -07:00
fdb09c77d6
[sleep mode] save memory for on-the-fly quantization (#24731 )
youkaichao
2025-09-12 19:25:19 +08:00
7a1c4025f1
[Kernel] [CPU] refactor cpu_attn.py:_run_sdpa_forward for better memory access (#24701 )
Ignacio Sica
2025-09-12 08:23:07 -03:00
60a0951924
[Bugfix] Fix BNB name match (#24735 )
Jee Jee Li
2025-09-12 19:12:01 +08:00
64d90c3e4f
[Misc][gpt-oss] Add gpt-oss label to PRs that mention harmony or related to builtin tool call (#24717 )
Chen Zhang
2025-09-12 03:57:07 -07:00
59d5d2c736
[CI/Build] Skip prompt embeddings tests on V1-only CPU backend (#24721 )
Li, Jiang
2025-09-12 18:51:01 +08:00
d21a36f5f9
[CI] Add ci_envs for convenient local testing (#24630 )
wang.yuqi
2025-09-12 16:52:25 +08:00
561a0baee0
[CI] Fix flaky test v1/worker/test_gpu_model_runner.py::test_kv_cache_stride_order (#24640 )
Chen Zhang
2025-09-12 00:49:09 -07:00
f592b3174b
[BugFix] Fix Qwen3-Next PP (#24709 )
Nick Hill
2025-09-11 23:35:04 -07:00
7920de0a2a
[Bugfix] Fix MRoPE dispatch on CPU (#24712 )
Li, Jiang
2025-09-12 12:56:31 +08:00
ddcec289c7
Fix implementation divergence for BLOOM models between vLLM and HuggingFace when using prompt embeds (#24686 )
Andrew Sansom
2025-09-11 23:35:48 -05:00
e090b7b45b
Enable conversion of multimodal models to pooling tasks (#24451 )
Maximilien de Bayser
2025-09-12 00:30:41 -03:00
6a50eaa0d3
[DOCs] Update ROCm installation docs section (#24691 )
Gregory Shtrasberg
2025-09-11 23:02:53 -04:00
12a8414d81
[Qwen3-Next] MoE configs for H20 TP=1,2,4,8 (#24707 )
Jee Jee Li
2025-09-12 10:06:26 +08:00
880c741bb6
[Bugfix] fixes the causal_conv1d_update kernel update non-speculative decoding cases (#24680 )
v0.10.2rc2
Tao He
2025-09-12 09:16:43 +08:00
40b6c9122b
[V1] feat:add engine v1 tracing (#20372 )
RichardoMu
2025-09-12 08:10:39 +08:00
2e6bc46821
[Startup] Make DeepGEMM warmup scale with max-num-batched-tokens (#24693 )
Lucas Wilkinson
2025-09-11 20:10:19 -04:00
fcba05c435
[Bug] Fix Layer weight_block_size Assertion Issue (#24674 )
Wentao Ye
2025-09-11 19:47:59 -04:00
7a30fa8708
[Doc] Clarify cudagraph capture size logic and default behavior in scheduler (#18698 )
Zazzle516
2025-09-12 07:18:09 +08:00
f82f7a8990
[Qwen3-Next] MOE configs for H100 TP4 (#24699 )
Chen Zhang
2025-09-11 15:45:52 -07:00
c3aea10dc8
[Perf] Use upstream CUTLASS for SM90 Block FP8 kernel (#23280 )
Michael Goin
2025-09-11 18:43:14 -04:00
d4fd2768ef
[Bugfix][Attention] Fix FlashInfer MLA block size logic (#24692 )
Matthew Bonanni
2025-09-11 18:39:42 -04:00
7a70a71892
[Qwen3-Next] Add B200 MoE configs for Qwen3-next (#24698 )
Vadim Gimpelson
2025-09-12 02:34:58 +04:00
7d4651997a
[CI/Build] Add bc-linter to vLLM CI (#21234 )
Zhewen Li
2025-09-11 15:34:36 -07:00
569bf1c9c0
[Qwen3-Next] MoE configs for H200 TP=1,2,4 (#24695 )
Woosuk Kwon
2025-09-11 14:38:16 -07:00
1ec20355f5
[Bugfix] Set VLLM_ALLREDUCE_USE_SYMM_MEM default to False (#24696 )
Wentao Ye
2025-09-11 17:32:27 -04:00
e42af78b18
[flashinfer] [kernel] support for fp8 kv cache for trtllm prefill attention (#24197 )
Xiaozhu Meng
2025-09-11 14:20:09 -07:00
074854b24f
[Kernel][B200] mxfp4 fused cutlass moe (#23696 )
Duncan Moss
2025-09-11 14:04:56 -07:00
79ac59f32e
Update Spec Decode metrics to include drafted and accepted token throughput (#24127 )
Andrew Xia
2025-09-11 12:58:43 -07:00
b971f91504
[BugFix] Fix tokenize asyncio task leak (#24677 )
Nick Hill
2025-09-11 12:44:04 -07:00
c733bd5e87
[Qwen3-Next] Add MoE Config for H200 (#24688 )
Woosuk Kwon
2025-09-11 12:40:15 -07:00
a892b259b4
[Doc] Remove Useless Comments (#24687 )
Wentao Ye
2025-09-11 15:25:47 -04:00