Commit Graph

  • 1838cd4860 Revert "Add batch invariant kernel override for FlashInfer backend [2/n]" (#26220) Cyrus Leung 2025-10-04 17:45:08 +08:00
  • 7d6b03381e [CI Failure] fix_test_auto_prefix_cache_support (#26053) Huamin Li 2025-10-04 02:44:49 -07:00
  • 7c2e91c4e0 [Misc] Remove unused executor.apply_model (#26215) Cyrus Leung 2025-10-04 16:45:53 +08:00
  • 736fbf4c89 [Misc] Require merge_by_field_config argument (#26214) Cyrus Leung 2025-10-04 16:40:14 +08:00
  • 44ea85137a [Model] Support nested structures for TensorSchema (#26212) Cyrus Leung 2025-10-04 16:20:32 +08:00
  • d3d649efec Support expert parallel in Transformers backend (#26162) Harry Mellor 2025-10-04 05:35:04 +01:00
  • ea507c3a93 [V1] [Hybrid] Mamba2 Automatic Prefix Caching (#25752) Stan Wozniak 2025-10-04 06:34:22 +02:00
  • 9705fba7b7 [cpu][perf] Accelerate unquantized-linear for AArch64 through oneDNN/ACL and weight prepack (#25948) Fadi Arafeh 2025-10-04 05:16:38 +01:00
  • 2f7dbc9b42 Add batch invariant kernel override for FlashInfer backend [2/n] (#25769) Bram Wasti 2025-10-03 21:49:30 -05:00
  • ea25a76c05 [BugFix] Use async Mistral Tokenizer in Chat Completions (#26134) Ben Browning 2025-10-03 21:42:08 -04:00
  • 67bc0c003e [Bugfix] Fix qwen3 vl dummy data generation with overrides (#26193) Roger Wang 2025-10-03 18:40:20 -07:00
  • 5a05f26603 Fix issue of using only the part of video frame [Nemotron Nano] (#26186) Eugene Khvedchenya 2025-10-04 03:21:00 +03:00
  • 7ef40bb983 [GPTOSS][DP/EP][Marlin] Enable GPTOSS DP/EP using Marlin kernels (#25488) Varun Sundar Rabindranath 2025-10-03 20:13:13 -04:00
  • 767cbb011d [CI] Fix Pre-commit Mypy Error (#26181) Wentao Ye 2025-10-03 19:08:03 -04:00
  • 7cfa4b24bf [BugFix] Fix de-functionalization pass for rotary_embedding (#23953) Angela Yi 2025-10-03 15:44:18 -07:00
  • b71fcd4905 [Misc] Add penalties sampling parameters to serve tool (#25974) Sergei Skvortsov 2025-10-03 23:43:14 +01:00
  • 75003f34e8 [CI] Push multiarch manifests as nightly builds (#25764) Sahithi Chigurupati 2025-10-03 15:42:55 -07:00
  • 78b8015a4d [Bugfix] Relax tokenizer regex for mixtral to include 'tokenizer.model' (#25964) Bowen Bao 2025-10-03 15:31:59 -07:00
  • 831b124151 [responsesAPI] add better error messaging for long prompts (#25724) Andrew Xia 2025-10-03 14:33:13 -07:00
  • c1ffcb55da [Refactor] Optimize FP8 MOE Backend Choice and Log (#26044) Wentao Ye 2025-10-03 17:23:42 -04:00
  • 0879736aab [Perf] Remove hardcoded num_warps=1 (#26183) Corey Lowman 2025-10-03 16:38:50 -04:00
  • a26917332f [Quantization/NVFP4] Speed up TRTLLM NVFP4 MOE weight loading and fix K/V scale loading for MLA Attn (#25968) Pavani Majety 2025-10-03 12:35:06 -07:00
  • cd9e5b8340 Fix V1 engine serialization error with Ray distributed executor (#26148) Nikhil G 2025-10-03 11:39:45 -07:00
  • 300a59c4c3 Avoid division by zero in cache DS MLA kernel (#26174) Matthew Bonanni 2025-10-03 13:35:17 -04:00
  • d76541a6c5 Stop mergify from keeping stale PRs alive (#26169) Harry Mellor 2025-10-03 17:42:34 +01:00
  • dd96465fd7 [BugFix][QWEN-VL]fix wrong apply_rotary_emb_torch selection introduced by #24642 (#26123) Chendi.Xue 2025-10-03 10:52:26 -05:00
  • 4f8f47e87e Fix undefined symbol: cutlass_moe_mm_sm100 (#26098) Jun Jiang 2025-10-03 23:48:32 +08:00
  • d78fda7cda [Renderer] Move Processor out of LLMEngine (#26165) Cyrus Leung 2025-10-03 23:08:22 +08:00
  • 73a99cc2a5 [Model] Fixed stream generator for gpt-oss + spec-decoding (#26027) Aleksandr Samarin 2025-10-03 15:43:41 +02:00
  • adae0c1f43 [CI/Build] do not enforce precompilation on tpu ci tests (#25992) Xiang Si 2025-10-03 06:38:42 -07:00
  • cbf9221992 [Model] Supplement to PR 24862: Pass param prefix to LLMHead (#25805) whx 2025-10-03 21:34:53 +08:00
  • 5f42fc53b6 [backends][short_conv] CUDA graph piecewise edits (#24215) Paul Pak 2025-10-03 21:59:48 +09:00
  • 8ee846c27c [Bugfix] Re-enable prefill of max model length (#24446) Yannick Schnider 2025-10-03 14:13:34 +02:00
  • 812b7f54a8 [Renderer] Move Processor out of AsyncLLM (#24138) Yang Liu 2025-10-03 04:29:45 -07:00
  • 5f2cacdb1e Quick fix for IMA with the Prefix Prefill kernel during graph capture (#25983) Sage Moore 2025-10-03 04:28:22 -07:00
  • aa5053e3fe [Doc] Fixed shape description for fused_batched_moe.py (#25668) Egor 2025-10-03 13:00:23 +02:00
  • 79aa244678 [Multi Modal] Configurable MM Profiling (#25631) Wenlong Wang 2025-10-03 03:59:10 -07:00
  • 2ed3f20dba [openai] Fix missing tool usage check (system message) (#24768) kyt 2025-10-03 19:55:44 +09:00
  • 48f309029a [NIXL][Misc] Expose metrics from NIXL for logging to CLI (#25388) Nicolò Lucchesi 2025-10-03 12:47:59 +02:00
  • 0e93ac0b3a [CI] Fix distributed hybrid tests in CI (#26155) Thomas Parnell 2025-10-03 11:14:18 +02:00
  • 5446ad1d24 [test utils] correct wrong typing (#26159) Yannick Schnider 2025-10-03 11:11:49 +02:00
  • f9a8084e48 [Model] Use merge_by_field_config for MM models (InternVL family) (#26153) Cyrus Leung 2025-10-03 16:59:06 +08:00
  • 3e70e3d4d5 add(v1): RequestStatesStats to RequestOutput (#24947) HUIJONG JEONG 2025-10-03 17:56:25 +09:00
  • eb0fa43868 [Perf] Optimize reshape_and_cache CUDA Kernel (#25955) Jiangyun Zhu 2025-10-03 16:33:46 +08:00
  • 0ad9951c41 [Input] Remove unused prompt field (#26097) Cyrus Leung 2025-10-03 15:23:21 +08:00
  • 8c9117181d [Misc] Remove typing.List (#26150) Varun Sundar Rabindranath 2025-10-03 03:00:33 -04:00
  • c4b48d3c0f [BUG] Reorder model config creation (#26124) ahao-anyscale 2025-10-02 23:59:36 -07:00
  • 10d765482d FusedMoE support for the Transformers backend (#22650) Harry Mellor 2025-10-03 07:12:15 +01:00
  • 39b643dc1a [Model] Use merge_by_field_config for MM models (G) (#26117) Cyrus Leung 2025-10-03 13:38:29 +08:00
  • 711f485643 [Bugfix] Fix import gemm_afp4wfp4 failure on AMD (#26068) Zhewen Li 2025-10-02 22:37:25 -07:00
  • 9c5ee91b2a [ROCm] [VL] [Bugfix] Fix vit flash attn dispatcher logic for ROCm (#26104) TJian 2025-10-02 22:34:53 -07:00
  • f71952c1c4 [Build/CI] Revert back to Ubuntu 20.04, install python 3.12 with uv (#26103) v0.11.0rc6 Tyler Michael Smith 2025-10-03 01:21:01 -04:00
  • d1007767c5 [Bugfix] Disable cascade attention with FlashInfer (#26130) Michael Goin 2025-10-02 19:30:37 -04:00
  • 27edd2aeb4 [Build/CI] Revert back to Ubuntu 20.04, install python 3.12 with uv (#26103) Tyler Michael Smith 2025-10-03 01:21:01 -04:00
  • e5017cd6d6 [gpt-oss] disable tool server initialization if no tool in request (#25790) Andrew Xia 2025-10-02 22:08:35 -07:00
  • 6a7796e871 [Bug]: Limit num_reqs in dummy_run when max_num_seqs is small (#26144) Benjamin Chislett 2025-10-03 00:00:20 -04:00
  • 47b9339546 [DeepSeek] Improve performance of DS MLA cache kernel (#26132) Matthew Bonanni 2025-10-02 23:35:47 -04:00
  • 5d5146eee3 [CI/Build] Conditionally register cutlass_fp4_group_mm to fix building on Hopper (#26138) Michael Goin 2025-10-02 23:32:38 -04:00
  • 2aaa423842 [Attention] Move Backend enum into registry (#25893) Matthew Bonanni 2025-10-02 23:32:24 -04:00
  • ad2d788016 [Bug][Benchmark] Fix duplicate req in oversampling (#26140) Ekagra Ranjan 2025-10-02 22:55:24 -04:00
  • 36ce76c632 [Log] Optimize DeepGEMM Missing Log (#26106) Wentao Ye 2025-10-02 22:02:26 -04:00
  • f1fc2107a3 [Bugfix] Disable cascade attention with FlashInfer (#26130) Michael Goin 2025-10-02 19:30:37 -04:00
  • 13cdc02173 Fix MTP with deepep_low_latency (#25904) Matthew Bonanni 2025-10-02 17:29:49 -04:00
  • 502640c3f9 [Perf] Fix and reapply move apply w8a8 block fp8 linear to class (#25696) ElizaWszola 2025-10-02 21:35:13 +02:00
  • 3d5f1c8640 [Mamba][KVCacheManager] Simplify kv cache manage logic for mamba + MTP (#25119) Chen Zhang 2025-10-02 11:48:31 -07:00
  • 1cab2f9cad EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 32% instead of 5% on MTBench (#25916) Ekagra Ranjan 2025-10-02 14:29:35 -04:00
  • c75c2e70d6 [Deepseek v3.2] Support indexer prefill chunking (#25999) v0.11.0rc5 Chen Zhang 2025-10-02 10:29:12 -07:00
  • 9d9a2b77f1 [Small] Prevent bypassing media domain restriction via HTTP redirects (#26035) Chenheli Hua 2025-10-02 10:27:10 -07:00
  • 6040e0b6c0 [BugFix] Fix FI accuracy issue when used for MLA prefill (#26063) Lucas Wilkinson 2025-10-02 13:18:13 -04:00
  • 05bf0c52a1 Update base image to 22.04 (jammy) (#26065) Huy Do 2025-10-02 05:48:04 -07:00
  • c536881a7c [BugFix] ChunkedLocalAttention is currently not CG compatible (#26034) Lucas Wilkinson 2025-10-01 19:28:00 -04:00
  • ebce361c07 [BugFix][DP/EP] Fix CUTLASS MLA hang under load (#26026) Lucas Wilkinson 2025-10-01 15:30:00 -04:00
  • 1e50f1be70 [Deepseek v3.2] Support indexer prefill chunking (#25999) Chen Zhang 2025-10-02 10:29:12 -07:00
  • ad87ba927a [Small] Prevent bypassing media domain restriction via HTTP redirects (#26035) Chenheli Hua 2025-10-02 10:27:10 -07:00
  • decf7f794b [BugFix] Fix FI accuracy issue when used for MLA prefill (#26063) Lucas Wilkinson 2025-10-02 13:18:13 -04:00
  • d00d652998 [CI/Build] Replace vllm.entrypoints.openai.api_server entrypoint with vllm serve command (#25967) Cyrus Leung 2025-10-03 01:04:57 +08:00
  • 3b279a84be [CI] Add Blackwell DeepSeek FP8 FlashInfer MoE tests (#26040) Michael Goin 2025-10-02 12:07:19 -04:00
  • 5e4a8223c6 [Qwen][ROCm] Flash Attention Rotary Embeddings (#24642) vllmellm 2025-10-02 23:26:08 +08:00
  • e51de388a2 [Platform][CI] Added OOT platform interface e2e test that running on Ascend NPU (#25470) leo-pony 2025-10-02 23:19:22 +08:00
  • cc253b73d3 [Model] Use merge_by_field_config for MM models (D-F) (#26076) Cyrus Leung 2025-10-02 23:17:35 +08:00
  • 7d6fb905d9 [Model] Use merge_by_field_config for MM models (A-C) (#26073) Cyrus Leung 2025-10-02 23:17:31 +08:00
  • 418d111f8c [FA/Chore] Bump vllm-flash-attention (#25537) Lucas Wilkinson 2025-10-02 11:06:14 -04:00
  • be8921fbba Change size of single CUDA graph for CI to 4 (#26089) Thomas Parnell 2025-10-02 16:14:28 +02:00
  • d4e7a1152d Update base image to 22.04 (jammy) (#26065) Huy Do 2025-10-02 05:48:04 -07:00
  • be22bb6f3d Run:ai model streamer add GCS package support (#24909) pwschuurman 2025-10-01 20:59:13 -07:00
  • 169313b9f8 [Misc] Make handling of SamplingParams clearer in n>1 case (#26032) Nick Hill 2025-10-01 19:31:39 -07:00
  • 0b018d8baf [ROCm][Bugfix] Add missing parameter to ROCm backend (#26029) Gregory Shtrasberg 2025-10-01 22:23:14 -04:00
  • c31246800c Support RL online quantization with torchao (#23014) Jerry Zhang 2025-10-01 16:39:29 -07:00
  • 4134312b35 [BugFix] ChunkedLocalAttention is currently not CG compatible (#26034) Lucas Wilkinson 2025-10-01 19:28:00 -04:00
  • da554f932e [Bug] Fix Negative Cuda Memory Usage (#25683) Wentao Ye 2025-10-01 18:16:26 -04:00
  • aac622e0cd [ROCm][Build] Add support for AMD Ryzen AI MAX / AI 300 Series (#25908) Hosang 2025-10-01 17:39:49 -04:00
  • 1726e93ef1 [BugFix][DP/EP] Fix CUTLASS MLA hang under load (#26026) Lucas Wilkinson 2025-10-01 15:30:00 -04:00
  • ee04c0cd04 [CI] Tweaks to GPT-OSS Eval (Blackwell) for stability (#26030) Michael Goin 2025-10-01 15:02:17 -04:00
  • c36f0aa300 Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_indices_offsets (#25995) Huamin Li 2025-10-01 11:18:36 -07:00
  • 5234dc7451 [NVIDIA] Blackwell Family (#24673) Johnny 2025-10-01 19:50:54 +02:00
  • 3b7c20a6b5 [Bugfix] Apply same sampling parameters for both n=1 and n>1 (#26005) Kenichi Maehashi 2025-10-01 23:37:35 +09:00
  • f9e714813a [Benchmark] Finish documented v0.11.0 deprecation of --endpoint-type (#26007) Nathan Scott 2025-10-01 22:41:57 +10:00
  • 2518230d3e [MISC] Fix misleading batch_size_capture_list when cuda_graph_sizes < 4 (#25829) billishyahao 2025-10-01 20:39:45 +08:00
  • a332b84578 [CI] Only capture a single CUDA graph size in CI by default (#25951) Harry Mellor 2025-10-01 10:03:44 +01:00
  • 1405f0c7ba [Misc] Factor out common _apply_feature_select_strategy (#26003) Cyrus Leung 2025-10-01 16:31:03 +08:00