Commit Graph

  • 17af6aa0da [Document] Add ms-swift library to rlhf.md (#27469) jinghanhu 2025-10-25 04:31:50 +08:00
  • fc168c33f3 [CI/Build] Fix test_torch_utils in AMD CI (#27317) Zhewen Li 2025-10-24 12:26:00 -07:00
  • acc78aeb88 [Bugfix] Fix interns1-vit qk norm code path (#27480) Isotr0py 2025-10-25 01:43:45 +08:00
  • 0f67d4d962 [Attention] Add MLA prefill backend: trtllm_ragged_attention_deepseek (#26397) Ming Yang 2025-10-24 10:24:08 -07:00
  • 7e1d697b56 [Bugfix] Fix MultiConnector stats reconstruction across process boundaries (#27366) kourosh hakhamaneshi 2025-10-24 10:08:05 -07:00
  • 699d62e6cf [NIXL][BUGFIX] delay done_recving queue cleanup to bottom of get_finished (#27297) Chendi.Xue 2025-10-24 12:01:41 -05:00
  • cd390b609d [compile] Turn standalone_compile back on (#27460) Richard Zou 2025-10-24 09:30:27 -07:00
  • 2080b05099 [cpu][fix] Fix onednn_mm crash on consecutive matmuls with same M,K,N and different dtype (#27472) Fadi Arafeh 2025-10-24 16:57:48 +01:00
  • 6454afec90 [Doc] Fix minor issues in docs/design/metrics.md (#27436) Lifans 2025-10-24 05:40:54 -07:00
  • 41a62564a7 Fix test named tool use (#27458) Chauncey 2025-10-24 20:27:45 +08:00
  • 284cc92275 [MISC] cudagraph_capture_sizes related improvements (#26016) fhl2000 2025-10-24 20:11:05 +08:00
  • 435be10db9 Fix AArch64 CPU Docker pipeline (#27331) ioana ghiban 2025-10-24 14:11:01 +02:00
  • b7030d962b [Benchmark] Enable benchmark to run with encoding_format="bytes" (#27467) Cyrus Leung 2025-10-24 19:16:50 +08:00
  • 3567816932 [Refactor] move tool parsing logic from protocol.py to the tool parser (#27383) Chauncey 2025-10-24 17:53:23 +08:00
  • e0ef8a2920 [BugFix] Fix torchrun DP with LLM class (#27395) 22quinn 2025-10-24 01:11:37 -07:00
  • 42efe609ba [MM][Bugfix] Replace PatchEmbed's conv3d to linear layer (#27418) Isotr0py 2025-10-24 15:32:47 +08:00
  • 88d3141ec6 [Docs] remove v1 column for embedding models (#27446) Yu Jiaqi 2025-10-24 14:55:03 +08:00
  • 09a6a49eaf [Misc] Avoid "PyTorch non-writable tensors" warning in RayPPCommunicator (#27443) Rui Qiao 2025-10-23 23:53:09 -07:00
  • 074475541a [Bugfix] Fix Pydantic union resolution for ResponseFunctionToolCall in Responses API (#26706) strinczer 2025-10-24 06:53:42 +01:00
  • d4c574c39f [Chore] remove structural tags logging lines (#27451) Aaron Pham 2025-10-24 01:35:45 -04:00
  • c528b9006a Fix EventPublisherFactory logic for disabled KV cache events (#27419) usberkeley 2025-10-24 13:00:01 +08:00
  • 85fee74b33 [Bugfix][CI] Move resolving cudagraph_mode before initializing attn_metadata_builder (#27427) fhl2000 2025-10-24 11:31:14 +08:00
  • 8dbe0c527f [Misc] Add TPU usage report when using tpu_inference. (#27423) hfan 2025-10-23 23:29:37 -04:00
  • 5cc6bddb6e [Kernel] Add GPTQv2 format support for low-bit or asymmetric quantization, by adapting gptq_gemm (#26092) Xiangyu Li 2025-10-24 11:26:13 +08:00
  • 1f9460c4c1 Fix pooling adapters for Transformers backend (#27338) Harry Mellor 2025-10-24 04:23:55 +01:00
  • 70022ffc00 Granite 4.0 quark quantization support (#26944) xiao-llm 2025-10-23 22:14:03 -04:00
  • f417746ad7 [Hardware][POWERPC] Disable oneDNN path in vllm/model_executor/layers/utils.py for Powerpc (#27422) Akash kaothalkar 2025-10-24 02:51:36 +05:30
  • 0552cfb195 [Model] Siglip Embedding Support (#27324) Yu Jiaqi 2025-10-24 04:19:48 +08:00
  • 51dd14ac2b [Bugfix][DP] Fix creating too many DP Placement Groups (#26880) Kebe 2025-10-24 05:16:51 +09:00
  • dbfbf9f324 [Attention] Fix FlashMLA metadata builder arguments for q_len > 1 (#27368) Matthew Bonanni 2025-10-23 15:58:15 -04:00
  • ca76486a16 [Chore] Separate out vllm.utils.platform_utils.py (#27374) Jonathan Chen 2025-10-23 15:08:06 -04:00
  • a9f55dc588 [Misc] Add triton_kernels dependency (#27370) Varun Sundar Rabindranath 2025-10-23 15:04:14 -04:00
  • 81d5bb765a [Bugfix] Fix AWQ marlin layer skipping (#27416) Isotr0py 2025-10-24 02:30:28 +08:00
  • 0825197bee [Bugfix][ROCm][DeepSeek] Fix for forward_hip in rope for DeepSeek (#27373) Gregory Shtrasberg 2025-10-23 13:43:53 -04:00
  • 9ef3d5b875 [Bugfix] Fix dp_chunking enablement logic in FusedMoE layer (#27220) Alexander Matveev 2025-10-23 12:03:14 -04:00
  • 295c7f0267 Mirroring the test definitions (2025-10-22) (#27362) Alexei-V-Ivanov-AMD 2025-10-23 11:02:26 -05:00
  • 3fa2c12185 [Frontend][4/N] Improve all pooling task | Add plugin pooling task (#26973) wang.yuqi 2025-10-23 22:46:18 +08:00
  • fe2016de2d [CI/Build] Remove unnecessary flags from test registry (#27353) Cyrus Leung 2025-10-23 22:42:40 +08:00
  • 237cf6d32a [Misc] Remove use of CUDA_VISIBLE_DEVICES for device selection (fix DP slow startup time &c) (#26709) Ilya Markov 2025-10-23 14:58:39 +02:00
  • faee3ccdc2 [Feature] Pydantic validation for speculative.py (#27156) Navya Srivastava 2025-10-23 05:19:33 -07:00
  • 570c3e1cd4 [Bugfix] Honor --mm_encoder_attn_backend when used (#27124) Bradley D 2025-10-23 05:09:52 -07:00
  • 3a4255c7c4 Run mypy on the lowest supported Python version instead of system Python (#27048) Harry Mellor 2025-10-23 13:07:44 +01:00
  • 61089465a6 [Model] Add MoE support for NemotronH (#25863) tomeras91 2025-10-23 13:27:23 +03:00
  • 88afa11010 [Metrics] [KVConnector] Add connector prefix cache hit rate stats (#26245) Tova Movshovitz 2025-10-23 13:21:08 +03:00
  • d00ce29d89 [CI] Reorganize entrypoints tests (#27403) Chauncey 2025-10-23 18:10:06 +08:00
  • 3b7bdf983b add SLA information into comparison graph for vLLM Benchmark Suite (#25525) Louie Tsai 2025-10-23 01:04:59 -07:00
  • 50b788a17a [CI/Build] Fix AMD CI: test_cpu_gpu.py (#27388) Zhewen Li 2025-10-23 00:55:00 -07:00
  • fc059c7061 [Bugfix] Fix args settings for guided decoding args (#27375) Lucia Fang 2025-10-23 00:34:06 -07:00
  • bfb240cc49 [CI/Build] Fix Prithvi plugin test (#27393) Cyrus Leung 2025-10-23 15:30:44 +08:00
  • e255d92990 [Chore] Remove duplicate has_ functions in vllm.utils (#27372) Jonathan Chen 2025-10-23 02:11:59 -04:00
  • 3729ed00ba [Model] Add num_cached_tokens for PoolingRequestOutput (#27378) wang.yuqi 2025-10-23 14:03:42 +08:00
  • 6644796bf4 [V1][spec decode] return logprobs for spec decoding (#26060) Giancarlo Delfin 2025-10-22 22:59:59 -07:00
  • ff93cc8c84 [CORE] Support Prefix Caching with Prompt Embeds (#27219) Andrew Sansom 2025-10-23 00:18:07 -05:00
  • 243ed7d32e [Bugfix][Core] running queue index leakage exception (#26754) PiteXChen 2025-10-23 12:40:12 +08:00
  • 7e0941055f [Bugfix] Fix incorrect kv cache metrics in grafana.json (#27133) fangpings 2025-10-22 20:58:36 -07:00
  • 6738e4a093 [Bugfix] Fix SLA tuner initialization (#27355) Cyrus Leung 2025-10-23 11:43:04 +08:00
  • 2566dca2a9 [Bugfix] Fix deepseek-ocr multi-image inference and add merge_by_field_config=True with tensor schema support (#27361) Isotr0py 2025-10-23 08:15:38 +08:00
  • b4fda58a2d [MLA] Bump FlashMLA (#27354) Matthew Bonanni 2025-10-22 18:48:37 -04:00
  • a0003b56b0 [Chore] Separate out system utilities from vllm.utils (#27201) dongbo910220 2025-10-23 04:25:25 +08:00
  • 5beacce2ea [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (#27128) Daisy-Ma-coder 2025-10-22 12:36:39 -07:00
  • 8669c69afa [Feature] publisher default set zmq in kv_event config (#26915) rongfu.leng 2025-10-23 03:19:33 +08:00
  • 1651003c35 [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (#27211) Sage 2025-10-22 21:13:03 +03:00
  • 1cb8c6c5fe [Doc] Fix numbering sequence in prefix caching (#27357) William Song 2025-10-23 02:35:47 +09:00
  • e05a6754a8 [Model] Revert PR #26715: Restore custom PaliGemma and Gemma3-MM impl… (#27309) Luciano Martins 2025-10-22 14:05:34 -03:00
  • 084a9dae80 [Bugfix] Disable FlexAttention direct block mask building for encoder-only models (#27344) Isotr0py 2025-10-23 00:39:08 +08:00
  • c9461e05a4 Support Anthropic API /v1/messages Endpoint (#22627) v0.11.1rc2 RED 2025-10-23 00:13:18 +08:00
  • 4dfdb821c8 [P/D] Dynamic kv_output_aggregator collect size (#26734) Nicolò Lucchesi 2025-10-22 18:07:58 +02:00
  • 58fab50d82 [Frontend] Require flag for loading text and image embeds (#27204) Russell Bryant 2025-10-22 11:52:02 -04:00
  • db6f28d898 [Bugfix] Fix HF format InternVL large variants video processing (#27330) Isotr0py 2025-10-22 23:39:23 +08:00
  • 14e2f1231e [Bugfix] Make get_mrope_input_positions instance methods (#27342) Cyrus Leung 2025-10-22 23:38:34 +08:00
  • 7c4767f1eb [NIXL] use Host buffer to support TP_ratio > 1 for XPU (#27140) Chendi.Xue 2025-10-22 10:28:13 -05:00
  • 9771e0b432 [Bugfix] Add missing 'is_internal_router' attribute to FusedMoEWithLoRA (#27351) Jee Jee Li 2025-10-22 23:19:12 +08:00
  • 980de31ca0 [bugfix] remove unused parameters to reduce unnecessary vram usage (#26789) Reinforce-II 2025-10-22 23:16:09 +08:00
  • 1c160841ea [Bug] Fix DeepSeek-V2.5-1210-FP8 issue (#27267) Wentao Ye 2025-10-22 11:00:10 -04:00
  • 4ca13a8667 [NIXL] Terminate handshake listener thread in shutdown (#26404) Mark McLoughlin 2025-10-22 15:59:53 +01:00
  • 675aa2ec64 [Model] Upstream Deepseek-OCR model (#27247) Isotr0py 2025-10-22 22:59:15 +08:00
  • 3ae082c373 [Chore] Separate out optional dependency checks from vllm.utils (#27207) dongbo910220 2025-10-22 22:44:21 +08:00
  • 49c00fe304 Mirroring changes in test-pipeline.yaml into test-amd.yaml (#27242) Alexei-V-Ivanov-AMD 2025-10-22 08:59:45 -05:00
  • 141d3b9fc5 [docs] Update v1 metrics design doc (#27332) Mark McLoughlin 2025-10-22 14:29:15 +01:00
  • abf3db40ef [Core] Handle MoE LoRA edge cases (#27335) Jee Jee Li 2025-10-22 21:14:33 +08:00
  • 8e4ca4d14e Bugfix - pass 'max_num_tokens_padded' into 'moe_lora_align_block_size' (#27311) gnovack 2025-10-22 05:23:57 -07:00
  • 1a0f4defb7 [Log] Add Warning for LLM(data_parallel_size=k) single-process DP Usage (#27282) Wentao Ye 2025-10-22 08:12:21 -04:00
  • 843af7f7fc [Bugfix][CPU] Disable dual stream execution for experts on CPU (#27320) Li, Jiang 2025-10-22 19:02:27 +08:00
  • 1f633b8632 [Frontend][3/N] Improve all pooling task | Support binary embedding response (#27066) wang.yuqi 2025-10-22 18:38:57 +08:00
  • a4c29e6e82 fixed reasoning streaming with tool_choice="required" (#24108) ExtReMLapin 2025-10-22 11:42:55 +02:00
  • 8f18feb191 Remove last level references not removed in #26355 (#27260) Harry Mellor 2025-10-22 10:18:17 +01:00
  • ed540d6d4c Update release pipeline for PyTorch 2.9.0 (#27303) Huy Do 2025-10-22 02:18:01 -07:00
  • f6027b2855 [1/N][Platform] Cleanup useless function (#26982) wangxiyuan 2025-10-22 17:04:57 +08:00
  • ab3e80042e [torch.compile] Enable silu_mul_fp8_quant fusion without custom ops enabled (#27146) Jiangyun Zhu 2025-10-22 12:22:39 +08:00
  • ceacedc1f9 [Benchmark] Add plot utility for parameter sweep (#27168) Cyrus Leung 2025-10-22 11:30:03 +08:00
  • bfa59be8f1 [CI] Nixl integration tests DP-EP (#27199) Nicolò Lucchesi 2025-10-22 05:17:48 +02:00
  • 265ecb05fb [DOC] [ROCm] Add ROCm quickstart guide (#26505) vllmellm 2025-10-22 11:10:48 +08:00
  • 09a7e6f617 [Deepseek v3.2] Remove extra logics in indexer (#26465) Lain 2025-10-21 16:34:03 -07:00
  • 6c2eef5a5d [P/D] KVConnector for decode benchmarking (#25986) Tyler Michael Smith 2025-10-21 19:30:47 -04:00
  • 19748806f0 [Bugfix] skip cuda graph for drafter when running with eager (#26821) Benjamin Chislett 2025-10-21 18:39:09 -04:00
  • 4a8a567e16 Updated xgrammar backend to not deny supported string formats (#27253) ExtReMLapin 2025-10-22 00:25:23 +02:00
  • 344a0017c0 [Performance] Dual stream execution of "shared_experts" and "selected_experts" inside FusedMoE (#26440) Alexander Matveev 2025-10-21 17:38:29 -04:00
  • becb7de40b Update PyTorch to 2.9.0+cu129 (#24994) Huy Do 2025-10-21 14:20:18 -07:00
  • 250fb1b8ea [Bugfix] fixes the decoding metadata of dense mla's fp8 kvcache. (#27144) Tao He 2025-10-22 02:27:03 +08:00
  • 647214f3d5 [V0 Deprecation] Remove V0 executors (#27142) Nick Hill 2025-10-21 11:09:37 -07:00