Commit Graph

  • bf214ca226 [Misc] Fix examples openai_pooling_client.py (#24853) wang.yuqi 2025-09-15 19:57:30 +08:00
  • 2e41f5abca [XPU] Set consistent default KV cache layout (#24745) Nicolò Lucchesi 2025-09-15 12:09:34 +02:00
  • bc0f6059a2 [UT] enhance free kv cache block queue popleft_n (#24220) Ning Xie 2025-09-15 18:04:37 +08:00
  • 8de261b04a [P/D]kv_output_aggregator support P TP > D TP (#23917) Chao Lei 2025-09-15 17:36:06 +08:00
  • a0d8b9738d [Misc] Own KVConnectors installation (#24867) Nicolò Lucchesi 2025-09-15 11:21:09 +02:00
  • 59e17dd4a0 [Misc] rename interval to max_recent_requests (#24229) Ning Xie 2025-09-15 17:18:42 +08:00
  • 4979eb79da [Doc]: fix typos in various files (#24821) Didier Durand 2025-09-15 10:08:52 +02:00
  • a8c0f59973 [Bugfix] MiDashengLM model contact error under concurrent testing (#24738) bingchen-mi 2025-09-15 14:38:12 +08:00
  • f4a948f33f [Frontend] Skip stop in reasoning content (#14550) Ce Gao 2025-09-15 14:04:55 +08:00
  • 3f3313981c [kv cache] update num_free_blocks in the end (#24228) Ning Xie 2025-09-15 13:15:12 +08:00
  • 78818dd1b0 [Docs] Have a try to improve frameworks/streamlit.md (#24841) Michael Yao 2025-09-15 12:50:36 +08:00
  • 8e5cdcda4e [Hybrid Allocator] Support Pipeline Parallel (#23974) Chen Zhang 2025-09-14 15:55:17 -07:00
  • 90f3f7d73e [Spec Decoding]Support Spec Decoding Metrics in DP Mode (#24049) wuhang 2025-09-15 05:11:09 +08:00
  • 6dc8da5dc1 [Chore] Remove ipex_ops warning (#24835) Robert Shaw 2025-09-14 15:41:53 -04:00
  • 79cbcab871 Force use C++17 globally to avoid compilation error (#24823) FengjinChen 2025-09-15 03:30:10 +08:00
  • ff68035932 [Benchmarks] Throw usage error when using dataset-name random and dataset-path together (#24819) Ye (Charlotte) Qi 2025-09-14 10:50:01 -07:00
  • 1177dd53e9 fix type of sampling rate for encode_base64 (#24826) co63oc 2025-09-15 00:17:16 +08:00
  • fc2dbcda8b [Perf] Fix DeepGEMM Contiguous Layout Issue, 5.5% Throughput Improvement (#24783) Wentao Ye 2025-09-14 11:20:17 -04:00
  • fec347dee1 [Misc] Improve s3_utils type hints with BaseClient (#24825) Hyogeun Oh (오효근) 2025-09-14 21:11:14 +09:00
  • cc3173ae98 [Multi Modal][Performance] Fused Q,K's apply_rope into one (#24511) Wenlong Wang 2025-09-14 01:10:21 -07:00
  • 3e903b6cb4 [Chore] Minor simplification for non-PP path (#24810) Woosuk Kwon 2025-09-13 17:41:36 -07:00
  • 973c9d01da [Minor] Simplify duplicative device check for cuda (#24793) Victor Ziliang Peng 2025-09-13 11:28:38 -07:00
  • 26b999c71a [CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe (#24750) Michael Goin 2025-09-13 03:29:19 -04:00
  • e925187f6d Merge branch 'main' into wye-refactor-quant-folder ci/build/22474 yewentao256 2025-09-13 07:38:47 -07:00
  • 15b8fef453 Remove redundant assignment in xfer_buffers, This is a little fix (#24732) TaoYu Chen 2025-09-13 16:11:59 +08:00
  • cfa3234a5b [CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again (#24771) Wenlong Wang 2025-09-13 00:45:11 -07:00
  • 41ae4a1eab [Doc]: fix typos in various files (#24798) Didier Durand 2025-09-13 09:43:33 +02:00
  • 4dad72f0d9 [Misc] Correct an outdated comment. (#24765) Russell Bryant 2025-09-13 03:34:53 -04:00
  • 59d7ffc17f [CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe (#24750) Michael Goin 2025-09-13 03:29:19 -04:00
  • 1da0f1441d [Core][Multimodal] Cache supports_kw (#24773) Lukas Geiger 2025-09-13 08:27:04 +01:00
  • 98229db244 [Kernels][DP/EP] Optimize Silu Kernel for R1 (#24054) Elvir Crnčević 2025-09-13 09:17:27 +02:00
  • dbeee3844c [Perf] Use NVIDIA hardware-accelerated instruction for float to fp8_e4m3 quantization (#24757) elvischenv 2025-09-13 15:16:24 +08:00
  • 30498f2a65 [Doc]: Remove 404 hyperlinks (#24785) Rakesh Asapanna 2025-09-13 12:45:41 +05:30
  • abc7989adc [Docs] Remove Neuron install doc as backend no longer exists (#24396) Harry Mellor 2025-09-13 08:15:03 +01:00
  • 9a8966bcc2 [Docs] Fix warnings in mkdocs build (continued) (#24791) Hyogeun Oh (오효근) 2025-09-13 16:13:44 +09:00
  • 5febdc8750 [Chore] Remove unused batched RoPE op & kernel (#24789) Woosuk Kwon 2025-09-13 00:08:20 -07:00
  • 99bfef841f [Bugfix] Fix GPUModelRunner has no attribute lora_manager (#24762) Jee Jee Li 2025-09-13 14:55:14 +08:00
  • da3fa78dc9 [Compilation Bug] Fix Inductor Graph Output with Shape Issue (#24772) v0.10.2rc3 Wentao Ye 2025-09-12 17:23:05 -04:00
  • bbb70036cb Enable conversion of multimodal models to pooling tasks (#24451) Maximilien de Bayser 2025-09-12 00:30:41 -03:00
  • 89da8d9d09 [Qwen3Next] Fixes the cuda graph capture conditions under large batch sizes (#24660) (#24667) Tao He 2025-09-13 06:31:32 +08:00
  • 01085b134d [Qwen3-Next] MoE configs for H100 TP=1,2 and TP2/EP (#24739) Elvir Crnčević 2025-09-12 16:54:04 +02:00
  • 66160a9943 [BugFix] Fix Qwen3-Next PP (#24709) Nick Hill 2025-09-11 23:35:04 -07:00
  • eaca762c18 [Qwen3-Next] MoE configs for H20 TP=1,2,4,8 (#24707) Jee Jee Li 2025-09-12 10:06:26 +08:00
  • 89e08d6d18 [Model] Add Olmo3 model implementation (#24534) Shane A 2025-09-12 20:26:21 -07:00
  • 7f2ea7074e [Frontend][Multimodal] Allow skipping media data when UUIDs are provided. (#23950) Chenheli Hua 2025-09-12 19:16:06 -07:00
  • 4fdd6f5cbf [Core] Support async scheduling with uniproc executor (#24219) Nick Hill 2025-09-12 16:34:28 -07:00
  • 8226dd56bf [Qwen3Next] Fixes the cuda graph capture conditions under large batch sizes (#24660) (#24667) Tao He 2025-09-13 06:31:32 +08:00
  • 5fe643fc26 Add FLASHINFER_MLA to backend selector test (#24753) Matthew Bonanni 2025-09-12 18:30:07 -04:00
  • 7ba32aa60b [Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode (#24705) Matthew Bonanni 2025-09-12 17:45:53 -04:00
  • c89ed8de43 Invert pattern order to make sure that out_proj layers are identified (#24781) Alexandre Marques 2025-09-12 17:45:29 -04:00
  • 3beadc2f25 [Compilation Bug] Fix Inductor Graph Output with Shape Issue (#24772) Wentao Ye 2025-09-12 17:23:05 -04:00
  • bc636f21a6 [Benchmark] Allow arbitrary headers to be passed to benchmarked endpoints (#23937) Clayton Coleman 2025-09-12 16:57:53 -04:00
  • 017354c0ef [CI] Trigger BC Linter when labels are added/removed (#24767) Zhewen Li 2025-09-12 11:44:36 -07:00
  • 1e3e56abfc Merge branch 'main' into wye-refactor-quant-folder Wentao Ye 2025-09-12 14:17:56 -04:00
  • 010acc6e1e [Bugfix] Fix incompatibility between #20452 and #24548 (#24754) Cyrus Leung 2025-09-13 02:17:29 +08:00
  • c8c42597ab [CI] Speed up model unit tests in CI (#24253) afeldman-nm 2025-09-12 13:36:50 -04:00
  • 9d2a44606d [UX] Remove AsyncLLM torch profiler disabled log (#24609) Michael Goin 2025-09-12 13:08:44 -04:00
  • f17c075884 [Model] Switch to Fused RMSNorm in GLM-4.1V model (#24733) Samit 2025-09-13 00:12:23 +08:00
  • b0d1213ac3 [Models] Prevent CUDA sync in Qwen2.5-VL (#24741) Lukas Geiger 2025-09-12 17:03:55 +01:00
  • 57f94e88ea [Models] Optimise and simplify _validate_and_reshape_mm_tensor (#24742) Lukas Geiger 2025-09-12 16:37:37 +01:00
  • 684b6870e1 [Bugfix][Frontend] Fix --enable-log-outputs does not match the documentation (#24626) Kebe 2025-09-13 00:01:24 +09:00
  • 1facf77094 Merge branch 'main' into wye-refactor-quant-folder yewentao256 2025-09-12 08:00:41 -07:00
  • a5b84f1cbf [Core] Shared memory based object store for Multimodal data caching and IPC (#20452) dongluw 2025-09-12 10:54:17 -04:00
  • 9f04d9d55f [Qwen3-Next] MoE configs for H100 TP=1,2 and TP2/EP (#24739) Elvir Crnčević 2025-09-12 16:54:04 +02:00
  • 4d7c1d531b [Bugfix] Fix MRoPE dispatch on XPU (#24724) Yan Ma 2025-09-12 21:43:56 +08:00
  • 41f17bf290 [Docs] Fix warnings in mkdocs build (continued) (#24740) Hyogeun Oh (오효근) 2025-09-12 22:43:15 +09:00
  • bcb06d7baf [Doc]: fix typos in various files (#24726) Didier Durand 2025-09-12 15:43:12 +02:00
  • 0377802c20 [Multimodal] Remove legacy multimodal fields in favor of MultiModalFeatureSpec (#24548) Flora Feng 2025-09-12 06:42:23 -07:00
  • 72fc8aa412 [Multi Modal] Add FA3 in VIT (#24347) Wenlong Wang 2025-09-12 06:27:24 -07:00
  • fdb09c77d6 [sleep mode] save memory for on-the-fly quantization (#24731) youkaichao 2025-09-12 19:25:19 +08:00
  • 7a1c4025f1 [Kernel] [CPU] refactor cpu_attn.py:_run_sdpa_forward for better memory access (#24701) Ignacio Sica 2025-09-12 08:23:07 -03:00
  • 60a0951924 [Bugfix] Fix BNB name match (#24735) Jee Jee Li 2025-09-12 19:12:01 +08:00
  • 64d90c3e4f [Misc][gpt-oss] Add gpt-oss label to PRs that mention harmony or related to builtin tool call (#24717) Chen Zhang 2025-09-12 03:57:07 -07:00
  • 59d5d2c736 [CI/Build] Skip prompt embeddings tests on V1-only CPU backend (#24721) Li, Jiang 2025-09-12 18:51:01 +08:00
  • d21a36f5f9 [CI] Add ci_envs for convenient local testing (#24630) wang.yuqi 2025-09-12 16:52:25 +08:00
  • 561a0baee0 [CI] Fix flaky test v1/worker/test_gpu_model_runner.py::test_kv_cache_stride_order (#24640) Chen Zhang 2025-09-12 00:49:09 -07:00
  • f592b3174b [BugFix] Fix Qwen3-Next PP (#24709) Nick Hill 2025-09-11 23:35:04 -07:00
  • 7920de0a2a [Bugfix] Fix MRoPE dispatch on CPU (#24712) Li, Jiang 2025-09-12 12:56:31 +08:00
  • ddcec289c7 Fix implementation divergence for BLOOM models between vLLM and HuggingFace when using prompt embeds (#24686) Andrew Sansom 2025-09-11 23:35:48 -05:00
  • e090b7b45b Enable conversion of multimodal models to pooling tasks (#24451) Maximilien de Bayser 2025-09-12 00:30:41 -03:00
  • 6a50eaa0d3 [DOCs] Update ROCm installation docs section (#24691) Gregory Shtrasberg 2025-09-11 23:02:53 -04:00
  • 12a8414d81 [Qwen3-Next] MoE configs for H20 TP=1,2,4,8 (#24707) Jee Jee Li 2025-09-12 10:06:26 +08:00
  • 880c741bb6 [Bugfix] fixes the causal_conv1d_update kernel update non-speculative decoding cases (#24680) v0.10.2rc2 Tao He 2025-09-12 09:16:43 +08:00
  • 40b6c9122b [V1] feat:add engine v1 tracing (#20372) RichardoMu 2025-09-12 08:10:39 +08:00
  • 2e6bc46821 [Startup] Make DeepGEMM warmup scale with max-num-batched-tokens (#24693) Lucas Wilkinson 2025-09-11 20:10:19 -04:00
  • fcba05c435 [Bug] Fix Layer weight_block_size Assertion Issue (#24674) Wentao Ye 2025-09-11 19:47:59 -04:00
  • 7a30fa8708 [Doc] Clarify cudagraph capture size logic and default behavior in scheduler (#18698) Zazzle516 2025-09-12 07:18:09 +08:00
  • f82f7a8990 [Qwen3-Next] MOE configs for H100 TP4 (#24699) Chen Zhang 2025-09-11 15:45:52 -07:00
  • c3aea10dc8 [Perf] Use upstream CUTLASS for SM90 Block FP8 kernel (#23280) Michael Goin 2025-09-11 18:43:14 -04:00
  • d4fd2768ef [Bugfix][Attention] Fix FlashInfer MLA block size logic (#24692) Matthew Bonanni 2025-09-11 18:39:42 -04:00
  • 7a70a71892 [Qwen3-Next] Add B200 MoE configs for Qwen3-next (#24698) Vadim Gimpelson 2025-09-12 02:34:58 +04:00
  • 7d4651997a [CI/Build] Add bc-linter to vLLM CI (#21234) Zhewen Li 2025-09-11 15:34:36 -07:00
  • 569bf1c9c0 [Qwen3-Next] MoE configs for H200 TP=1,2,4 (#24695) Woosuk Kwon 2025-09-11 14:38:16 -07:00
  • 1ec20355f5 [Bugfix] Set VLLM_ALLREDUCE_USE_SYMM_MEM default to False (#24696) Wentao Ye 2025-09-11 17:32:27 -04:00
  • e42af78b18 [flashinfer] [kernel] support for fp8 kv cache for trtllm prefill attention (#24197) Xiaozhu Meng 2025-09-11 14:20:09 -07:00
  • 074854b24f [Kernel][B200] mxfp4 fused cutlass moe (#23696) Duncan Moss 2025-09-11 14:04:56 -07:00
  • 79ac59f32e Update Spec Decode metrics to include drafted and accepted token throughput (#24127) Andrew Xia 2025-09-11 12:58:43 -07:00
  • b971f91504 [BugFix] Fix tokenize asyncio task leak (#24677) Nick Hill 2025-09-11 12:44:04 -07:00
  • c733bd5e87 [Qwen3-Next] Add MoE Config for H200 (#24688) Woosuk Kwon 2025-09-11 12:40:15 -07:00
  • a892b259b4 [Doc] Remove Useless Comments (#24687) Wentao Ye 2025-09-11 15:25:47 -04:00