Commit Graph

  • 950751a987 [v1] Pass BlockTable and KVCacheSpec to AttentionMetadataBuilders (#17483) Chen Zhang 2025-05-11 07:12:04 +08:00
  • 4c31218f80 [Misc] remove --model from vllm serve usage (#17944) Reid 2025-05-10 21:23:31 +08:00
  • 68311891f5 Don't default construct ModelConfig when default constructing VllmConfig (#17943) Harry Mellor 2025-05-10 14:23:00 +01:00
  • fc4441a4ee Add missing content type headers to /ping and /health (#17036) (#17786) Ximo Guanter 2025-05-10 08:13:32 +02:00
  • 246e3e0a36 fix broken test vllm:test_kernels - test_attention_selector.py::test_flash_attn (#17873) tracelogfb 2025-05-09 19:46:54 -07:00
  • 7042cc96b0 [V1][Spec Decoding] Log accumulated metrics after system goes idle (#17913) Mark McLoughlin 2025-05-10 02:23:07 +01:00
  • 0c0fdae84f [Hardware/NVIDIA/Kernel] Enable nvidia/DeepSeek-R1-FP4 Model (#16362) Pavani Majety 2025-05-09 16:24:41 -07:00
  • 3b602cdea7 AMD conditional all test execution // new test groups (#17556) Alexei-V-Ivanov-AMD 2025-05-09 17:35:58 -05:00
  • 4b2ed7926a Improve configs - the rest! (#17562) Harry Mellor 2025-05-09 23:18:44 +01:00
  • 7e3571134f [V1][Spec Decoding] Include bonus tokens in mean acceptance length (#17908) Mark McLoughlin 2025-05-09 21:32:36 +01:00
  • ea2236bf95 Add option to use torch._inductor.standalone_compile (#17057) Richard Zou 2025-05-09 15:59:04 -04:00
  • 7d4aedae7c Handle error when str passed to /v1/audio/transcriptions (#17909) Harry Mellor 2025-05-09 20:23:59 +01:00
  • 22481fbfa3 Update CT WNA16MarlinMoE integration (#16666) Michael Goin 2025-05-09 11:19:45 -06:00
  • 5c4c08f6f1 [Misc] Auto fallback to float16 for pre-Ampere GPUs when detected bfloat16 config (#17265) Isotr0py 2025-05-10 01:16:12 +08:00
  • c44c384b1c [Misc] Add references in ray_serve_deepseek example (#17907) Rui Qiao 2025-05-09 09:59:36 -07:00
  • 85b72cb7b1 Revert "[BugFix][AMD] Compatible patch for latest AITER(05/07/2025)" (#17910) Michael Goin 2025-05-09 09:58:18 -06:00
  • 6e5595ca39 [CI/Build] Automatically retry flaky tests (#17856) Cyrus Leung 2025-05-09 23:55:17 +08:00
  • 200da9a517 [v1] Move block management logic from KVCacheManager to SpecializedManager (#17474) Chen Zhang 2025-05-09 23:25:34 +08:00
  • 9f64e93415 [BugFix][AMD] Compatible patch for latest AITER(05/07/2025) (#17864) qli88 2025-05-09 09:59:36 -05:00
  • ec61ea20a8 [Misc] add dify integration (#17895) Reid 2025-05-09 18:42:39 +08:00
  • c6798baa9c Change top_k to be disabled with 0 (still accept -1 for now) (#17773) Harry Mellor 2025-05-09 11:01:49 +01:00
  • 5b2dcbf0b8 Fix Whisper crash caused by invalid`` max_num_batched_tokens`` config (#17853) inkcherry 2025-05-09 17:16:26 +08:00
  • 6e4a93e3f7 [Bugfix][CPU] Fix broken AVX2 CPU TP support (#17252) Isotr0py 2025-05-09 16:55:14 +08:00
  • 217db4baa6 [Bugfix][ROCm] Fix AITER MLA V1 (#17880) vllmellm 2025-05-09 16:38:21 +08:00
  • ff8c400502 [Doc] remove visible token in doc (#17884) Yan Ma 2025-05-09 16:21:31 +08:00
  • 89a0315f4c [Doc] Update several links in reasoning_outputs.md (#17846) Michael Yao 2025-05-09 16:20:55 +08:00
  • 3d1e387652 [Docs] Add Slides from NYC Meetup (#17879) Simon Mo 2025-05-08 21:46:54 -07:00
  • d310e6de98 [BUGFIX]: return fast when request requires prompt logprobs (#17251) Ning Xie 2025-05-09 12:25:41 +08:00
  • 5e6f939484 [Attention] MLA move rotary embedding to cuda-graph region (#17668) Lucas Wilkinson 2025-05-08 23:14:42 -04:00
  • 760e3ecc8f [V1][Structured Output] Update llguidance (>= 0.7.11) to avoid AttributeError (no StructTag) (#17839) Shanshan Shen 2025-05-09 11:14:18 +08:00
  • 3c9396a64f [FEAT][ROCm]: Support AITER MLA on V1 Engine (#17523) vllmellm 2025-05-09 10:42:05 +08:00
  • 376786fac1 Add cutlass support for blackwell fp8 blockwise gemm (#14383) Shu Wang 2025-05-08 17:09:55 -05:00
  • 4f605a6de5 Fix noisy warning for uncalibrated q_scale/p_scale (#17414) Michael Goin 2025-05-08 15:56:59 -04:00
  • 8342e3abd1 [CI] Prune down lm-eval small tests (#17012) Michael Goin 2025-05-08 15:00:26 -04:00
  • a83a0f92b5 [Test] Attempt all TPU V1 tests, even if some of them fail. (#17334) yarongmu-google 2025-05-08 10:20:54 -07:00
  • 226a4272cf [V1] Improve VLLM_ALLOW_INSECURE_SERIALIZATION logging (#17860) Russell Bryant 2025-05-08 12:57:35 -04:00
  • ec54d73c31 [CI] Fix test_collective_rpc (#17858) Russell Bryant 2025-05-08 12:47:12 -04:00
  • a944f8ede7 [Misc] Delete LoRA-related redundancy code (#17841) Jee Jee Li 2025-05-08 21:02:21 +08:00
  • 015815fe01 [Bugfix] use_fast failing to be propagated to Qwen2-VL image processor (#17838) Cyrus Leung 2025-05-08 20:39:21 +08:00
  • e4ca6e3a99 Fix transient dependency error in docs build (#17848) Harry Mellor 2025-05-08 11:42:03 +01:00
  • 53d0cb7423 [Misc] add chatbox integration (#17828) Reid 2025-05-08 18:05:26 +08:00
  • f50dcb7c21 [Easy] Eliminate c10::optional usage in vllm/csrc (#17819) Lu Fang 2025-05-08 03:05:10 -07:00
  • a1e19b635d [Doc] Fix a typo in the file name (#17836) Cyrus Leung 2025-05-08 18:04:18 +08:00
  • bb239a730f [Bugfix] Fix quark fp8 format loading on AMD GPUs (#12612) fxmarty-amd 2025-05-08 11:53:53 +02:00
  • a463555dee [TPU] Fix the test_sampler (#17820) Jevin Jiang 2025-05-08 02:51:33 -07:00
  • ca04b97c93 [Bugfix] Fix tool call template validation for Mistral models (#17644) Rick Yuan 2025-05-08 17:47:19 +08:00
  • 0a9bbaa104 [Misc] support model prefix & add deepseek vl2 tiny fused moe config (#17763) xsank 2025-05-08 15:50:22 +08:00
  • 39956efb3f [Bugfix] Fix bad words for Mistral models (#17753) Qiong Zhou Huang 2025-05-07 23:32:10 -07:00
  • 597051e56f [Qwen3]add qwen3-235b-bf16 fused moe config on A100 (#17715) Ximingwang-09 2025-05-08 14:09:32 +08:00
  • 96722aa81d [Frontend] Chat template fallbacks for multimodal models (#17805) Cyrus Leung 2025-05-08 14:05:54 +08:00
  • 843b222723 [Hardware][Intel-Gaudi] Support Automatic Prefix Caching on HPU (#17648) Agata Dobrzyniewicz 2025-05-08 07:37:03 +02:00
  • e515668edf [Hardware][Power] Enable compressed tensor W8A8 INT8 quantization for POWER (#17153) Akash kaothalkar 2025-05-08 11:05:03 +05:30
  • 5a499e70d5 [Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs (#17071) Hashem Hashemi 2025-05-07 22:34:49 -07:00
  • 6930a41116 [V1] Add VLLM_ALLOW_INSECURE_SERIALIZATION env var (#17490) Russell Bryant 2025-05-08 01:34:02 -04:00
  • 998eea4a0e Only log non-default CLI args for online serving (#17803) Harry Mellor 2025-05-08 06:33:29 +01:00
  • c747d84576 [Installation] OpenTelemetry version update (#17771) Mikhail Podvitskii 2025-05-08 07:32:49 +02:00
  • b2da14a05a Improve exception reporting in MP engine (#17800) Vadim Markovtsev 2025-05-08 07:32:39 +02:00
  • 7ea2adb802 [Core] Support full cuda graph in v1 (#16072) Chanh Nguyen 2025-05-07 22:30:15 -07:00
  • 3d13ca0e24 [BugFix] Fix --disable-log-stats in V1 server mode (#17600) Nick Hill 2025-05-07 21:08:15 -07:00
  • 66ab3b13c9 Don't call the venv vllm (#17810) Harry Mellor 2025-05-08 05:06:39 +01:00
  • a8238bbdb0 [Chore][Doc] uses model id determined from OpenAI client (#17815) Aaron Pham 2025-05-07 21:48:57 -04:00
  • d43f914d42 [Core][Feature] Input metadata dump on crash (#13407) Wallas Henrique 2025-05-07 19:15:09 -03:00
  • ed5272cf21 [BugFix] Avoid secondary missing MultiprocExecutor.workers error (#17811) Nick Hill 2025-05-07 14:55:04 -07:00
  • c20ef40fd0 [Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend (#14238) Akshat Tripathi 2025-05-07 21:28:47 +01:00
  • db593aa67f [Quantization] Quark MXFP4 format loading (#16943) Bowen Bao 2025-05-07 12:05:05 -07:00
  • f98e307588 [Bugfix] Fix missing lora name mapping for lora without prefix (#17793) Isotr0py 2025-05-08 00:17:12 +08:00
  • 646a31e51e Fix and simplify deprecated=True CLI kwarg (#17781) Harry Mellor 2025-05-07 16:51:06 +01:00
  • be8ff88e66 [Bugfix] Fix Video IO error for short video (#17791) Isotr0py 2025-05-07 23:36:06 +08:00
  • 1a6af1453d Only depend on importlib-metadata for Python < 3.10 (#17776) Christian Heimes 2025-05-07 16:51:06 +02:00
  • 32aa74c09c [ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attention (#17139) Gregory Shtrasberg 2025-05-07 10:12:35 -04:00
  • 7377dd0307 [doc] update the issue link (#17782) Reid 2025-05-07 20:29:05 +08:00
  • 98c89e16ff Make key optional for rotary embedding (#17566) Yong Hoon Shin 2025-05-07 00:11:46 -07:00
  • 324a3119b0 Fix test_memory_usage_no_spec (#17754) Yong Hoon Shin 2025-05-07 00:10:33 -07:00
  • 8a15c2603a [Frontend] Add missing chat templates for various MLLMs (#17758) Cyrus Leung 2025-05-07 15:10:01 +08:00
  • 043e4c4955 Add NeuronxDistributedInference support, Speculative Decoding, Dynamic on-device sampling (#16357) Satyajith Chilappagari 2025-05-07 00:07:30 -07:00
  • ba7703e659 [Misc] Remove qlora_adapter_name_or_path (#17699) Jee Jee Li 2025-05-07 14:10:37 +08:00
  • f80ae5bdcf [Kernel] Use fused rmsnorm for some models like qwen3 series (#17735) Wanrui Dai 2025-05-07 14:10:02 +08:00
  • 1a45a61387 [Kernel] GGUF MoeVec kernel (#16780) Szymon Ożóg 2025-05-07 14:07:23 +08:00
  • c3e9d5060e [Misc] Use apply_rotary_emb from vllm_flash_attn for Qwen2-VL vision RoPE (#17726) Isotr0py 2025-05-07 12:51:33 +08:00
  • 822de7fb94 [Misc] Split model loader (#17712) Jee Jee Li 2025-05-07 12:42:26 +08:00
  • 8d84d836d1 [BugFix][Spec Decode] Fix hidden size mismatch between target and eagle head (#17740) Woosuk Kwon 2025-05-06 19:51:26 -07:00
  • 950b71186f Replace lm-eval bash script with pytest and use enforce_eager for faster CI (#17717) Michael Goin 2025-05-06 21:00:10 -04:00
  • e50a1f1a9c [TPU] Add kernel test for moe_pallas (#17496) Michael Goin 2025-05-06 20:59:57 -04:00
  • a17cef70ea Removed unused marlin cuda code (#17684) Michael Goin 2025-05-06 20:59:47 -04:00
  • 18dd5e01f2 [Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels (#17146) Chih-Chieh Yang 2025-05-06 20:59:30 -04:00
  • 6de3e13413 Add logging for torch nightly version (#17669) Yang Wang 2025-05-06 17:45:51 -07:00
  • ed3a1d2106 [ROCm] fix num_stages for default moe config to avoid triton OutOfResource error (#17744) Hongxia Yang 2025-05-06 20:39:48 -04:00
  • 022afbeb4e Fix doc build performance (#17748) Harry Mellor 2025-05-07 01:36:41 +01:00
  • 2f925e5777 [Kernel] Unified Triton kernel that doesn't distinguish between prefill + decode (#16828) Thomas Parnell 2025-05-06 18:21:48 -04:00
  • de906b95f9 [Bugfix] Fix for the condition to accept empty encoder inputs for mllama (#17732) Gregory Shtrasberg 2025-05-06 15:59:06 -04:00
  • d456aea71f [Misc] Add Next Edit Prediction (NEP) datasets support in benchmark_serving.py (#16839) d.transposed 2025-05-06 21:38:45 +02:00
  • 621ca2c0ab [TPU] Increase block size and reset block shapes (#16458) Jevin Jiang 2025-05-06 10:55:04 -07:00
  • 6115b11582 Make right sidebar more readable in "Supported Models" (#17723) Harry Mellor 2025-05-06 17:48:26 +01:00
  • 5b8c390747 [Bugfix] Fix modality limits in vision language example (#17721) Cyrus Leung 2025-05-07 00:12:28 +08:00
  • 7525d5f3d5 [doc] Add RAG Integration example (#17692) Reid 2025-05-07 00:10:23 +08:00
  • aabcd2cae3 [v1] Introduce KVCacheBlocks as interface between Scheduler and KVCacheManager (#17479) Chen Zhang 2025-05-06 23:50:34 +08:00
  • 0d115460a7 [Docs] Use gh-file to add links to tool_calling.md (#17709) Michael Yao 2025-05-06 23:27:19 +08:00
  • 175bda67a1 [Feat] Add deprecated=True to CLI args (#17426) Aaron Pham 2025-05-06 11:11:27 -04:00
  • cba31c47c4 [v1] AttentionMetadata for each layer (#17394) Chen Zhang 2025-05-06 22:58:37 +08:00
  • a6fed02068 [V1][PP] Support PP for MultiprocExecutor (#14219) Li, Jiang 2025-05-06 22:58:05 +08:00