Commit Graph

  • 879f69bed3 [Refactor] Remove duplicate ceil_div (#20023) Wentao Ye 2025-06-25 01:19:09 -04:00
  • 7108934142 [Frontend] speed up import time of vllm.config (#18036) David Xia 2025-06-25 00:41:11 -04:00
  • 3443aaf8dd Move to a faster base64 implementation (#19984) h-avsha 2025-06-25 06:33:51 +03:00
  • 2273ec322c Revert "Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor" (#20030) Isotr0py 2025-06-25 11:23:29 +08:00
  • a6c4b87fbc Revert "[Feature] Integrate new deepgemm (#19820)" (#20049) Wentao Ye 2025-06-24 22:45:22 -04:00
  • 1afa9948f5 [Llama4] Update attn_temperature_tuning (#19997) Brayden Zhong 2025-06-24 22:42:53 -04:00
  • 0d06b533a0 cmake: Update vllm_flash_attn for vllm_kernels (#20032) Eli Uriegas 2025-06-24 15:44:10 -07:00
  • c01d1c5aba use .dev for version comparison with pytorch nightly release (#20031) Boyuan Feng 2025-06-24 14:52:16 -07:00
  • ead369845d [Easy] Remove submodule added in #19463 (#20039) Brayden Zhong 2025-06-24 16:23:15 -04:00
  • c6e3bba8e6 [Feature] Integrate new deepgemm (#19820) Wentao Ye 2025-06-24 15:51:56 -04:00
  • 91f7d9d0b6 [P/D] Asynchronously do _nixl_handshake (#19836) lkchen 2025-06-24 12:46:10 -07:00
  • 8619e7158c [BugFix] Fix multi-node offline data parallel (#19937) Nick Hill 2025-06-24 12:45:20 -07:00
  • c635c5f744 [Misc][Benchmarking] Add variable request-rate ("ramp-up") to the benchmarking client. (#19423) d.transposed 2025-06-24 20:41:49 +02:00
  • a045b7e89a [Perf] Improve/Fix-regression for FA3 in High QPS regimes (#19463) Lucas Wilkinson 2025-06-24 13:09:01 -04:00
  • 981eeca41a [Fix][V1] Remove --scheduling-policy oracle (#20010) amit 2025-06-24 19:52:15 +03:00
  • 26d34eb67e refactor example - qwen3_reranker (#19847) Reid 2025-06-24 22:03:20 +08:00
  • 53da4cd397 [Bugfix][CPU] Fix InputBatch for pooling models in the CPU v1 (#20014) Li, Jiang 2025-06-24 21:20:04 +08:00
  • 9a3b88328f [PERF] Speedup of MRoPE prepare inputs (#19939) Vadim Gimpelson 2025-06-24 10:01:26 +04:00
  • 3014c920da add some examples for other benchmark scripts (#19893) Reid 2025-06-24 13:57:46 +08:00
  • 0eed516951 [doc] Fix broken link in the installation for CPU (#19980) Kay Yan 2025-06-24 12:04:11 +08:00
  • ee5ad8d2c5 [Misc][Tools][Benchmark] Add profile to autotune script (#19711) Chenyaaang 2025-06-23 17:59:41 -07:00
  • a738dbb2a1 Update test case parameter to have the throughput above 8.0 (#19994) QiliangCui 2025-06-23 17:18:10 -07:00
  • 33d5e29be9 [TPU] Fix tpu model runner test (#19995) Chenyaaang 2025-06-23 16:04:28 -07:00
  • 4671ac6e2a [Bugfix][Benchmark] Fix Marlin benchmark (#19929) 22quinn 2025-06-23 15:25:12 -07:00
  • dd2ccf8dde Feat Dynamic Quantization for MoE Layers in GPTQ Marlin Backend (#19395) Jun-Howie 2025-06-24 06:23:28 +08:00
  • a3bc76e4b5 [CI/Build] Push latest tag for cpu and neuron docker image (#19897) 22quinn 2025-06-23 14:15:37 -07:00
  • e6327c9b3e [Feature] Support sequence parallelism for static fp8 quantization (#19181) cascade 2025-06-23 13:09:02 -07:00
  • d0132f025d [Misc] Add type alias ReqId and EngineId for better readability (#19880) lkchen 2025-06-23 12:57:57 -07:00
  • 61f4fc5dc6 [Bugfix][v1] Fix step pooler implementation and step pooling usage in v1 (#19956) Isotr0py 2025-06-24 02:38:06 +08:00
  • 68aaeb3749 [EP+DP] Optimize the little operations in the DeepGEMM + DeepEP low latency case (#19885) Tyler Michael Smith 2025-06-23 14:07:47 -04:00
  • c3649e4fee [Docs] Fix syntax highlighting of shell commands (#19870) Lukas Geiger 2025-06-23 18:59:09 +01:00
  • 53243e5c42 [doc] improve readability for long commands (#19920) Reid 2025-06-23 22:27:07 +08:00
  • a6e6604d32 [Bugfix] Fix CI bitsandbytes failure (#19969) Jee Jee Li 2025-06-23 21:30:55 +08:00
  • b82e0f82cb [doc] use MkDocs collapsible blocks - supplement (#19973) Reid 2025-06-23 18:54:16 +08:00
  • 5111642a6f [Doc] Update V1 status for decoder-only embedding models (#19952) Isotr0py 2025-06-23 17:31:06 +08:00
  • 1bcd15edc7 [BugFix][P/D] Fix for cases where _recving_transfers can be cleaned up when *all* transfer done (#19874) lkchen 2025-06-22 22:41:53 -07:00
  • 2ebff5b77c [P/D][NixlConnector] Support tp_size > num_kv_heads deployments (#19691) Nicolò Lucchesi 2025-06-23 07:41:50 +02:00
  • f17aec0d63 [doc] Fold long code blocks to improve readability (#19926) Reid 2025-06-23 13:24:23 +08:00
  • 493c275352 Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor (#19643) Vensen 2025-06-23 11:40:28 +08:00
  • f39ab2d4bd [Misc] Configurable timeout for execute_model RPC calls via env var (#19544) jinqinn 2025-06-23 11:36:26 +08:00
  • 4a0f7888a3 [Core] feat: Implement Priority Scheduling in V1 Engine (#19057) amit 2025-06-23 06:18:08 +03:00
  • c4cf260677 [Perf][CLI] Improve overall startup time (#19941) Aaron Pham 2025-06-22 19:11:22 -04:00
  • 33d51f599e [BugFix] Add an env to disable moe chunking to work around compile incompatibility (#19642) Ye (Charlotte) Qi 2025-06-22 15:17:49 -07:00
  • e91386cde1 [Chore] dedup logs (#19955) Aaron Pham 2025-06-22 15:43:07 -04:00
  • 2c11a29f0b [Misc] Simplify vllm bench cli subcommand implementation (#19948) Ye (Charlotte) Qi 2025-06-22 09:34:48 -07:00
  • c76a506bd6 [Misc] Update model-specific PR tagging (#19949) Roger Wang 2025-06-22 05:16:08 -07:00
  • ec0db6f51c [doc] use snippets for contact us (#19944) Reid 2025-06-22 18:26:13 +08:00
  • c305a2109d [CI/Build] Auto tag perf benchmarks related PRs (#19943) 22quinn 2025-06-22 01:46:21 -07:00
  • 202c5df935 [Benchmark] fix request loss if "ping" is returned (#19535) Wang, Yi 2025-06-22 15:21:04 +08:00
  • 2bb246b8f7 [MISC] add cpu_kvcache_space_bytes to CacheConfig (#19812) Ning Xie 2025-06-22 13:39:09 +08:00
  • 4c409cabc2 [Misc] add vllm_config in __init__ (#19866) Ning Xie 2025-06-22 11:10:46 +08:00
  • 3b1e4c6a23 [Docs] Add GPT2ForSequenceClassification to supported models in docs (#19932) Adrian 2025-06-21 22:57:19 +02:00
  • 2c5302fadd [Multimodal] Optimize Qwen2/2.5-VL startup time (#19756) Woosuk Kwon 2025-06-21 13:01:07 -07:00
  • caa680fd2e [doc] add contact us in community (#19922) Reid 2025-06-22 01:29:06 +08:00
  • c3bf9bad11 [New model support]Support Tarsier2 (#19887) 汪志鹏 2025-06-21 12:01:51 +08:00
  • 6f170f11dd [Bugfix] Fix bnb 8bit model weights loading (#19917) Isotr0py 2025-06-21 11:29:09 +08:00
  • 8ca81bb069 Fix: Check the type of params to be a Sequence not list. (#19910) Rabin Adhikari 2025-06-21 01:03:17 +02:00
  • e773a9e1c2 [Misc] Clean up useless code (#19889) wangxiyuan 2025-06-21 05:09:09 +08:00
  • 71baf85ae1 [Kernel] mark TorchSDPABackend swap_blocks NotImplementedError (#19749) Ning Xie 2025-06-21 02:18:11 +08:00
  • 79f2f1c2a1 [CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tests (#19901) Li, Jiang 2025-06-20 23:30:36 +08:00
  • 2e3e3c86dc Export NaNs in logits to scheduler_stats if output is corrupted (#18777) Vlad Tiberiu Mihailescu 2025-06-20 07:47:16 -07:00
  • 7e8977fcd4 [custom_op][vllm-plugin] update custom_op class to use op_registry (#19164) Chendi.Xue 2025-06-20 09:44:56 -05:00
  • f1e840e842 [Model] GPT2ForSequenceClassification model (#19663) Adrian 2025-06-20 14:07:41 +02:00
  • 7771d1de88 [Fix] import regex instead of re (#19875) Thomas Parnell 2025-06-20 13:16:48 +02:00
  • 71d1219545 [Kernel] correct cpu worker function parameter type (#19745) Ning Xie 2025-06-20 18:50:13 +08:00
  • e384f2f108 [Misc] refactor example - openai_transcription_client (#19851) Reid 2025-06-20 16:02:21 +08:00
  • 089a306f19 [Misc] update cuda version (#19526) Reid 2025-06-20 15:25:15 +08:00
  • 5e666f72cd [Bugfix][Ray] Set the cuda context eagerly in the ray worker (#19583) kourosh hakhamaneshi 2025-06-19 22:01:16 -07:00
  • e3a3e4db46 [Bugfix] Enable PP with AITER+V1 (#19822) qli88 2025-06-19 23:43:20 -05:00
  • e41bf15cd0 [Chore]: qwen3-moe-type-hints-mistake (#19860) Xerxes 2025-06-20 12:43:07 +08:00
  • 5aa4a015ce [Benchmark] Fix Value of type "SampleRequest" is not indexable (#18032) Brayden Zhong 2025-06-20 00:28:55 -04:00
  • b6bad3d186 [CI][Neuron] Fail and exit on first error (#19622) Elaine Zhao 2025-06-19 21:27:51 -07:00
  • ee9a1531aa [CI/Build][Bugfix] Fix deadlock on v1 engine test CI (#19872) Isotr0py 2025-06-20 09:51:07 +08:00
  • 10d82f9ac5 [Benchmark][Bugfix] Fix Dataset Length Calculation (#19868) Robert Shaw 2025-06-19 21:30:41 -04:00
  • ea10dd9d9e [Frontend] early return chat format resolution when specified (#19735) xzbdmw 2025-06-20 02:49:59 +08:00
  • ead2110297 [Core][Bugfix] Fix Online MM Beam Search (#19688) Alex Brooks 2025-06-19 11:18:07 -06:00
  • 01220ce89a [CI][CPU] Improve dummy Triton interfaces and fix the CPU CI (#19838) Li, Jiang 2025-06-19 23:46:09 +08:00
  • 6f68c49220 [Doc] Update V1 user guide for embedding models (#19842) 22quinn 2025-06-19 02:43:27 -07:00
  • 4719460644 Fixing Chunked Prefill Test. (#19762) Alexei-V-Ivanov-AMD 2025-06-19 03:36:16 -05:00
  • 466166dcfd [Frontend] Add optional token-level progress bar to LLM.beam_search (#19301) NekoMimiUnagi 2025-06-19 02:21:41 -05:00
  • 1d0ae26c85 Add xLAM tool parser support (#17148) Zuxin 2025-06-18 23:26:41 -07:00
  • 6021999573 [Minor] Allow redirecting model path for HfRunner in test (#19795) Isotr0py 2025-06-19 14:04:10 +08:00
  • c7b370c603 raise exception for pin_lora (#19809) Ning Xie 2025-06-19 13:57:35 +08:00
  • aa20d10a91 [Misc] [ROCm] Prevent surplus tensor reshape (#19803) zsolt-borbely-htec 2025-06-19 07:57:16 +02:00
  • 2de12be428 [ROCm] [AITER] [Bugfix] Patch for AITER commit 648764942e552a8bb5fe16026703716a81f05374 (#18990) TJian 2025-06-18 22:56:31 -07:00
  • 83ca9ae47b Mark invariant normalizer in Gemma as non-persistent (#19788) Yu-Hang "Maxin" Tang 2025-06-18 22:56:03 -07:00
  • e2148dc5ea [Bugfix] Add check_health to v1 async client. (#19821) kourosh hakhamaneshi 2025-06-18 21:47:01 -07:00
  • b1098b4072 [Bugfix] Fix the linter (#19826) Lu Fang 2025-06-19 12:44:41 +08:00
  • 799397ee4f Support embedding models in V1 (#16188) Maximilien de Bayser 2025-06-19 01:36:33 -03:00
  • 4959915089 [Quantization] Modify the logic of BNB double quantization (#19742) Jee Jee Li 2025-06-19 11:52:09 +08:00
  • 8d1e89d946 [Misc][ROCm] Enforce no unused variable in ROCm C++ files (#19796) Lu Fang 2025-06-19 11:25:15 +08:00
  • 36239f79dd Fix FA2 fallback for Blackwell V1 (#19781) Michael Goin 2025-06-19 10:53:55 +09:00
  • dfada85eee [Frontend] Expose custom args in OpenAI APIs (#16862) afeldman-nm 2025-06-18 20:41:11 -04:00
  • ed33349738 [BugFix] Fix use_cudagraph=False (#19612) Richard Zou 2025-06-18 20:23:12 -04:00
  • d49adea1f9 [Multimodal] Use fast processor for Qwen2/2.5-VL (#19789) Woosuk Kwon 2025-06-18 15:49:40 -07:00
  • 14fdd21d39 [Core] More fixes to MultiModalEmbeddings type handling (#19715) Russell Bryant 2025-06-18 18:48:29 -04:00
  • 04fefe7c9a [TPU] Update torch-xla version to include paged attention tuned block change (#19813) QiliangCui 2025-06-18 15:41:13 -07:00
  • 3b523e38d9 [Core] Do not copy array during hashing (#19484) Lukas Geiger 2025-06-18 23:36:55 +01:00
  • 16c16301c8 Disable "Forbid direct 'import triton'" check for vllm/triton_utils/importing.py in an extensible way (#19783) afeldman-nm 2025-06-18 18:08:00 -04:00
  • 9206d0ff01 docs: fix Slack bulletpoint in README (#19811) Nathan Weinberg 2025-06-18 16:47:08 -04:00