Commit Graph

  • e7eea5a520 [V1][CI] Fix failed v1-test because of min_p (#13316) Woosuk Kwon 2025-02-14 17:29:51 -08:00
  • a12934d3ec [V1][Core] min_p sampling support (#13191) Aoyu 2025-02-15 07:50:05 +08:00
  • 3bcb8c75da [Core] Reduce TTFT with concurrent partial prefills (#10235) Joe Runde 2025-02-14 16:36:07 -07:00
  • 5e5c8e091e [Quant][Perf] Use moe_wna16 kernel by default for MoEs with many experts (#13236) Michael Goin 2025-02-14 15:53:42 -05:00
  • c9e2d644e7 [Hardware][Gaudi][Bugfix] Fix error for guided decoding (#12317) Yu-Zhou 2025-02-14 20:36:49 +08:00
  • 7734e9a291 [Core] choice-based structured output with xgrammar (#12632) Russell Bryant 2025-02-14 07:36:05 -05:00
  • 6224a9f620 Support logit_bias in v1 Sampler (#13079) Lu Fang 2025-02-14 04:34:59 -08:00
  • 085b7b2d6c [V1] Simplify GPUModelRunner._update_states check (#13265) Nick Hill 2025-02-14 04:33:43 -08:00
  • 4da1f667e9 [VLM] Keep track of whether prompt replacements have been applied (#13215) Cyrus Leung 2025-02-14 20:20:46 +08:00
  • 556ef7f714 [Misc] Log time consumption of sleep and wake-up (#13115) Jun Duan 2025-02-14 07:10:21 -05:00
  • 83481ceb49 [Bugfix] Fix missing parentheses (#13263) Xu Song 2025-02-14 17:07:10 +08:00
  • 185cc19f92 [Frontend] Optionally remove memory buffer used for uploading to URLs in run_batch (#12927) Pooya Davoodi 2025-02-14 00:22:42 -08:00
  • 45f90bcbba [WIP] TPU V1 Support Refactored (#13049) Alexander Matveev 2025-02-14 03:21:53 -05:00
  • b0ccfc565a [Bugfix][V1] GPUModelRunner._update_states should return True when there is a finished request in batch (#13126) Kero Liang 2025-02-14 14:39:20 +08:00
  • ba59b78a9c [ROCm][V1] Add intial ROCm support to V1 (#12790) Sage Moore 2025-02-13 22:21:50 -08:00
  • cbc40128eb [V1] LoRA - Enable Serving Usecase (#12883) Varun Sundar Rabindranath 2025-02-14 11:51:12 +05:30
  • f0b2da72a8 Expand MLA to support most types of quantization (#13181) Michael Goin 2025-02-14 01:19:22 -05:00
  • f2b20fe491 Consolidate Llama model usage in tests (#13094) Harry Mellor 2025-02-14 06:18:03 +00:00
  • 40932d7a05 [Misc] Remove redundant statements in scheduler.py (#13229) Wang Ran (汪然) 2025-02-14 14:07:25 +08:00
  • 84683fa271 [Bugfix] Offline example of disaggregated prefill (#13214) XiaobingZhang 2025-02-14 12:20:47 +08:00
  • 067678262a [Bugfix][CI] Inherit codespell settings from pyproject.toml in the pre-commit-config (#13237) Tyler Michael Smith 2025-02-13 23:19:43 -05:00
  • 09545c0a94 [Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on (#13250) Tyler Michael Smith 2025-02-13 23:19:25 -05:00
  • dd5ede4440 [V1] Consolidate MM cache size to vllm.envs (#13239) Roger Wang 2025-02-13 20:19:03 -08:00
  • 8c32b08a86 [Kernel] Fix awq error when n is not divisable by 128 (#13227) Jinzhen Lin 2025-02-14 12:07:05 +08:00
  • 410886950a [ROCm] Avoid using the default stream on ROCm (#13238) Gregory Shtrasberg 2025-02-13 20:29:26 -05:00
  • e38be640e6 Revert "Add label if pre-commit passes" (#13242) Harry Mellor 2025-02-14 00:12:32 +00:00
  • c1e37bf71b [Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (#13198) Tyler Michael Smith 2025-02-13 19:01:14 -05:00
  • 2344192a55 Optimize moe_align_block_size for deepseek_v3 (#12850) Michael Goin 2025-02-13 18:43:37 -05:00
  • bffddd9a05 Add label if pre-commit passes (#12527) Harry Mellor 2025-02-13 20:51:30 +00:00
  • d84cef76eb [Frontend] Add /v1/audio/transcriptions OpenAI API endpoint (#12909) Nicolò Lucchesi 2025-02-13 16:23:45 +01:00
  • 37dfa60037 [Bugfix] Missing Content Type returns 500 Internal Server Error (#13193) Vaibhav Jain 2025-02-13 20:22:22 +05:30
  • 1bc3b5e71b [VLM] Separate text-only and vision variants of the same model architecture (#13157) Cyrus Leung 2025-02-13 22:19:15 +08:00
  • 02ed8a1fbe [Misc] Qwen2.5-VL Optimization (#13155) 2025-02-13 22:17:57 +08:00
  • 2092a6fa7d [V1][Core] Add worker_base for v1 worker (#12816) Aoyu 2025-02-13 20:35:18 +08:00
  • c9d3ecf016 [VLM] Merged multi-modal processor for Molmo (#12966) Cyrus Leung 2025-02-13 20:34:00 +08:00
  • fdcf64d3c6 [V1] Clarify input processing and multimodal feature caching logic (#13211) Roger Wang 2025-02-13 03:43:24 -08:00
  • 578087e56c [Frontend] Pass pre-created socket to uvicorn (#13113) Russell Bryant 2025-02-13 03:51:46 -05:00
  • fa253f1a70 [VLM] Remove input processor from clip and siglip (#13165) Isotr0py 2025-02-13 16:31:37 +08:00
  • 9605c1256e [V1][core] Implement pipeline parallel on Ray (#12996) Rui Qiao 2025-02-13 00:02:46 -08:00
  • 0ccd8769fb [CI/Build] Allow ruff to auto-fix some issues (#13180) Russell Bryant 2025-02-13 02:45:38 -05:00
  • cb944d5818 Allow Unsloth Dynamic 4bit BnB quants to work (#12974) Daniel Han 2025-02-12 23:13:08 -08:00
  • d46d490c27 [Frontend] Move CLI code into vllm.cmd package (#12971) Russell Bryant 2025-02-13 02:12:21 -05:00
  • 04f50ad9d1 [Bugfix] deepseek_r1_reasoning_parser put reason content in wrong field in certain edge case (#13097) LikeSundayLikeRain 2025-02-13 02:11:26 -05:00
  • 60c68df6d1 [Build] Automatically use the wheel of the base commit with Python-only build (#13178) Cody Yu 2025-02-12 23:10:28 -08:00
  • 009439caeb Simplify logic of locating CUDART so file path (#13203) Lu Fang 2025-02-12 21:52:41 -08:00
  • bc55d13070 [VLM] Implement merged multimodal processor for Mllama (#11427) Isotr0py 2025-02-13 12:26:21 +08:00
  • d88c8666a1 [Bugfix][Example] Fix GCed profiling server for TPU (#12792) Michael Goin 2025-02-12 22:52:11 -05:00
  • 4fc5c23bb6 [NVIDIA] Support nvfp4 quantization (#12784) Kaixi Hou 2025-02-12 19:51:51 -08:00
  • 9f9704dca6 [perf-benchmark] cleanup unused Docker images and volumes in H100 benchmark instance (#12706) Kevin H. Luu 2025-02-12 19:51:33 -08:00
  • 8eafe5eaea [CI/Build] Ignore ruff warning up007 (#13182) Russell Bryant 2025-02-12 22:48:31 -05:00
  • 4c0d93f4b2 [V1][Bugfix] Copy encoder input ids to fix set iteration issue during VLM abort (#13173) Murali Andoorveedu 2025-02-12 12:58:11 -08:00
  • 14b7899d10 [CI] Fix failing FP8 cpu offload test (#13170) Michael Goin 2025-02-12 14:16:06 -05:00
  • 09972e716c [Bugfix] Allow fallback to AWQ from AWQMarlin at per-layer granularity (#13119) Michael Goin 2025-02-12 12:19:53 -05:00
  • 36a08630e8 [CORE] [QUANT] Support for GPTQModel's dynamic quantization per module override/control (#7086) Qubitium-ModelCloud 2025-02-13 01:19:43 +08:00
  • 2c2b560f48 [CI/Build] Use mypy matcher for pre-commit CI job (#13162) Russell Bryant 2025-02-12 12:12:22 -05:00
  • 042c3419fa Introduce VLLM_CUDART_SO_PATH to allow users specify the .so path (#12998) Lu Fang 2025-02-12 09:06:13 -08:00
  • 82cabf53a3 [Misc] Delete unused LoRA modules (#13151) Jee Jee Li 2025-02-13 00:58:24 +08:00
  • 314cfade02 [Frontend] Generate valid tool call IDs when using tokenizer-mode=mistral (#12332) Rafael Vasquez 2025-02-12 11:29:56 -05:00
  • 985b4a2b19 [Bugfix] Fix num video tokens calculation for Qwen2-VL (#13148) Cyrus Leung 2025-02-12 19:55:23 +08:00
  • f4d97e4fc2 [Bug] [V1] Try fetching stop_reason from EngineOutput before checking the request (#13108) bnellnm 2025-02-12 05:39:16 -05:00
  • f1042e86f0 [Misc] AMD Build Improvements (#12923) Shiyan Deng 2025-02-12 02:36:10 -08:00
  • 7c4033acd4 Further reduce the HTTP calls to huggingface.co (#13107) Maximilien de Bayser 2025-02-12 07:34:09 -03:00
  • d59def4730 Bump actions/setup-python from 5.3.0 to 5.4.0 (#12672) dependabot[bot] 2025-02-12 16:41:22 +08:00
  • 0c7d9effce Bump helm/chart-testing-action from 2.6.1 to 2.7.0 (#12463) dependabot[bot] 2025-02-12 16:41:06 +08:00
  • dd3b4a01f8 Bump actions/stale from 9.0.0 to 9.1.0 (#12462) dependabot[bot] 2025-02-12 00:40:25 -08:00
  • a0597c6b75 Bump helm/kind-action from 1.10.0 to 1.12.0 (#11612) dependabot[bot] 2025-02-12 00:40:19 -08:00
  • e92694b6fe [Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency (#12921) Lingfan Yu 2025-02-11 21:12:37 -08:00
  • 842b0fd402 [ci] Add more source file dependencies for some tests (#13123) Kevin H. Luu 2025-02-11 20:38:10 -08:00
  • 974dfd4971 [Model] IBM/NASA Prithvi Geospatial model (#12830) Christian Pinto 2025-02-12 04:34:30 +00:00
  • 3ee696a63d [RFC][vllm-API] Support tokenizer registry for customized tokenizer in vLLM (#12518) Keyun Tong 2025-02-11 20:25:58 -08:00
  • 72c2b68dc9 [Misc] Move pre-commit suggestion back to the end (#13114) Russell Bryant 2025-02-11 17:34:16 -05:00
  • 14ecab5be2 [Bugfix] Guided decoding falls back to outlines when fails to import xgrammar (#12976) Yuan Tang 2025-02-11 13:17:44 -05:00
  • deb6c1c6b4 [Doc] Improve OpenVINO installation doc (#13102) Harry Mellor 2025-02-11 18:02:46 +00:00
  • 565c1efa65 [CI/Build][Bugfix] Fix CPU backend default threads num (#13077) Li, Jiang 2025-02-12 00:55:56 +08:00
  • 2b25b7d2e1 Fix initializing GGUF weights for ColumnParallelLinear when using tensor parallel > 1 (#13023) Szymon Ożóg 2025-02-11 17:38:48 +01:00
  • 6c4dbe23eb [BugFix] Pop instead of del CUDA_VISIBLE_DEVICES (#12962) ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 2025-02-11 18:21:50 +02:00
  • 21f5d50fa5 [Bugfix] Do not use resource module on Windows (#12858) (#13029) MoonRide303 2025-02-11 17:21:18 +01:00
  • bf3e05215c [Misc] Fix typo at comments at metrics.py (#13024) Jewon Lee 2025-02-12 01:20:37 +09:00
  • ad9776353e Set torch_dtype in TransformersModel (#13088) Harry Mellor 2025-02-11 15:51:19 +00:00
  • 75e6e14516 [V1][Metrics] Add several request timing histograms (#12644) Mark McLoughlin 2025-02-11 15:14:00 +00:00
  • 110f59a33e [Bugfix] fix flaky test (#13089) மனோஜ்குமார் பழனிச்சாமி 2025-02-11 20:11:20 +05:30
  • 2e3b969ec0 [Platform] add pre_register_and_update function (#12432) wangxiyuan 2025-02-11 22:06:46 +08:00
  • da317197dd [Build] Fix cuda link target of cumem_allocator in CPU env (#12863) Yuhong Guo 2025-02-11 21:55:57 +08:00
  • 7539bbc6a6 [ROCm] Using a more precise memory profiling (#12624) Gregory Shtrasberg 2025-02-11 08:47:10 -05:00
  • 9cf4759493 [executor] init local_rank as device index (#13027) Mengqing Cao 2025-02-11 21:20:53 +08:00
  • 41c5dd45b9 [V1][Metrics] Add GPU prefix cache hit rate % gauge (#12592) Cody Yu 2025-02-11 00:27:25 -08:00
  • fc6485d277 [Bugfix]: Reasoning output bug according to the chat template change (#13025) Ce Gao 2025-02-11 15:49:03 +08:00
  • 78a141d768 [Misc] LoRA - Refactor Punica ops tests (#12970) Varun Sundar Rabindranath 2025-02-11 12:56:03 +05:30
  • c320ca8edd [Core] Don't do platform detection at import time (#12933) Russell Bryant 2025-02-11 02:25:25 -05:00
  • 58047c6f04 [Benchmark] Add BurstGPT to benchmark_serving (#13063) Woosuk Kwon 2025-02-10 21:25:30 -08:00
  • cb080f32e3 [Bugfix] Support missing tool parameters in mistral tokenizer (#12884) Florian Greinacher 2025-02-11 04:33:33 +01:00
  • 2c0f58203c [Docs] Annouce Meta Meetup (#13065) Simon Mo 2025-02-10 18:24:29 -08:00
  • 2ff4857678 [V1][Minor] Move scheduler outputs to a separate file (#13062) Woosuk Kwon 2025-02-10 18:10:06 -08:00
  • 91e876750e [misc] Fix setup.py condition to avoid AMD from being mistaken with CPU (#13022) Kevin H. Luu 2025-02-10 18:06:16 -08:00
  • 08b2d845d6 [Model] Ultravox Model: Support v0.5 Release (#12912) Farzad Abdolhosseini 2025-02-10 14:02:48 -08:00
  • 2ae889052c Fix seed parameter behavior in vLLM (#13007) மனோஜ்குமார் பழனிச்சாமி 2025-02-10 20:56:50 +05:30
  • 51f0b5f7f6 [Bugfix] Clean up and fix multi-modal processors (#13012) Cyrus Leung 2025-02-10 18:45:21 +08:00
  • fde71262e0 [misc] Add retries with exponential backoff for HF file existence check (#13008) Kevin H. Luu 2025-02-10 01:15:02 -08:00
  • 243137143c [Doc] Add link to tool_choice tracking issue in tool_calling.md (#13003) Yuan Tang 2025-02-10 01:09:33 -05:00
  • b2496bb07f [core] fix sleep mode and pytorch checkpoint compatibility (#13001) youkaichao 2025-02-10 13:03:43 +08:00