Commit Graph

  • da31f6ad3d Revert precompile wheel changes (#22055) Simon Mo 2025-08-01 01:26:24 -07:00
  • 98df153abf [Frontend] Align tool_choice="required" behavior with OpenAI when tools is empty (#21052) Sungyoon Jeong 2025-08-01 16:54:17 +09:00
  • e0f63e4a35 [Core] Avoid repeated len(block_token_ids) check in hash_request_tokens (#21781) Zebing Lin 2025-08-01 03:23:29 -04:00
  • b4e081cb15 [Bugfix] Disable multi-modal preprocessor cache for DP (#21896) Cyrus Leung 2025-08-01 15:03:56 +08:00
  • 79731a79f0 [Doc] Fix a syntax error of example code in structured_outputs.md (#22045) Hongsheng Liu 2025-08-01 15:01:22 +08:00
  • 53d7c39271 Update sampling_metadata.py (#21937) Aviad Rossmann 2025-08-01 09:23:18 +03:00
  • 61dcc280fa [Doc] Add Voxtral to Supported Models page (#22059) Cyrus Leung 2025-08-01 14:10:56 +08:00
  • 0f46a780d4 [Model] [Quantization] Support quantization for Gemma3n (#21974) Kyle Sayers 2025-08-01 01:45:15 -04:00
  • e1a7fe4af5 [BugFix] fix: aot passes kvcache dtype information (#19750) Mickaël Seznec 2025-08-01 07:45:02 +02:00
  • 82de9b9d46 [Misc] Automatically resolve HF processor init kwargs (#22005) Cyrus Leung 2025-08-01 13:44:10 +08:00
  • ad57f23f6a [Bugfix] Fix: Fix multi loras with tp >=2 and LRU cache (#20873) Charent 2025-08-01 10:48:13 +08:00
  • 3700642013 [Refactor] Remove Duplicate per_block_cast_to_fp8, Remove Dependencies of DeepGEMM (#21787) Wentao Ye 2025-07-31 21:13:27 -04:00
  • 0bd409cf01 Move flashinfer-python to optional extra vllm[flashinfer] (#21959) Michael Goin 2025-07-31 21:02:11 -04:00
  • e360316ab9 Add DeepGEMM to Dockerfile in vllm-base image (#21533) Matthew Bonanni 2025-07-31 21:01:55 -04:00
  • c3e0e9337e [Feature] Add Flashinfer MoE Support for Compressed Tensor NVFP4 (#21639) Wentao Ye 2025-07-31 18:26:11 -04:00
  • 6e672daf62 Add FlashInfer allreduce RMSNorm Quant fusion (#21069) Ilya Markov 2025-07-31 22:58:38 +02:00
  • 2dff2e21d9 [Bugfix] Fix MTP weight loading (#21941) Benjamin Chislett 2025-07-31 16:33:53 -04:00
  • 71470bc4af [Misc] Add unit tests for chunked local attention (#21692) Yong Hoon Shin 2025-07-31 11:39:16 -07:00
  • 9e0726e5bf [Meta] Official Eagle mm support, first enablement on llama4 (#20788) zhiweiz 2025-07-31 10:35:07 -07:00
  • 53c21e492e Update torch_xla pin to 20250730 (#21956) XiongfeiWei 2025-07-31 10:26:43 -07:00
  • 0780bb5783 Removing amdproduction Tests (#22027) Alexei-V-Ivanov-AMD 2025-07-31 11:53:27 -05:00
  • 58bb902186 fix(setup): improve precompiled wheel setup for Docker builds (#22025) Doug Smith 2025-07-31 12:52:48 -04:00
  • 7349d5268b [ez] Remove a trailing space from compilation/decorators.py (#22028) Zhengxu Chen 2025-07-31 12:46:07 -04:00
  • 9484641616 [Model] Add step3 vl (#21998) Song 2025-07-31 23:19:06 +08:00
  • 207b750e19 [NVIDIA] Add SM100 Flashinfer MoE per tensor scale fp8 backend (#21458) amirkl94 2025-07-31 16:00:01 +03:00
  • 5daffe7cf6 [BugFix] Fix case where collective_rpc returns None (#22006) Nick Hill 2025-07-31 13:51:37 +01:00
  • 2836dd73f1 [Model][CI] Let more pooling models support v1 (#21747) wang.yuqi 2025-07-31 16:51:15 +08:00
  • d2aab336ad [CI/Build] get rid of unused VLLM_FA_CMAKE_GPU_ARCHES (#21599) Daniele 2025-07-31 09:00:08 +02:00
  • 9532a6d563 [Deprecation] Remove deprecated args and methods (#21907) Cyrus Leung 2025-07-31 14:46:38 +08:00
  • 3e36fcbee6 [Bugfix]: fix metadata file copy in test_sharded_state_loader (#21830) Ning Xie 2025-07-31 14:22:11 +08:00
  • 055bd3978e [CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes (#21973) Michael Goin 2025-07-30 23:45:29 -04:00
  • 0f7919fca0 [Misc] Expand SUPPORTED_HIDDEN_SIZES for DeepEP low-latency kernels (#21818) Jee Jee Li 2025-07-31 11:41:12 +08:00
  • 61445453df [UX] Rename CUTLASS_MLA_VLLM_V1 to CUTLASS_MLA (#21966) Michael Goin 2025-07-30 23:40:34 -04:00
  • ec02e536df [Bugfix] Relax lang pin for voxtral (#21833) Sanchit Gandhi 2025-07-31 04:38:52 +01:00
  • 9cb497bfa3 [Example] Add async_llm_streaming.py example for AsyncLLM streaming in python (#21763) Michael Goin 2025-07-30 20:39:46 -04:00
  • ca9e2be3ed [Core] Move EngineCoreRequest to Request conversion out of EngineCore (#21627) Zebing Lin 2025-07-30 18:00:54 -04:00
  • 601f856d56 [Bugfix] Fix None value handling in trace span creation for cancelled requests (#20272) Bram 2025-07-30 14:44:02 -07:00
  • 287f527f54 [Feature] Add async tensor parallelism for scaled mm (#20155) cascade 2025-07-30 14:23:41 -07:00
  • f12d9256b3 [Misc] Use dracut on CentOS and skip clone if repo exists for EP kernel installation (#21635) Ming Yang 2025-07-30 13:15:06 -07:00
  • b9b753e7a7 For VLLM_USE_PRECOMPILED, only compiled .so files should be extracted (#21964) Doug Smith 2025-07-30 16:04:40 -04:00
  • 56bd537dde [Misc] Support more collective_rpc return types (#21845) Nick Hill 2025-07-30 18:20:20 +01:00
  • 8f0d516715 [TPU] Support Pathways in vLLM (#21417) wenxindongwork 2025-07-30 10:02:12 -07:00
  • f4135232b9 feat(distributed): add get_required_kvcache_layout class method to kv connector api (#20433) wxsm 2025-07-31 00:41:51 +08:00
  • 4904e53c32 [Bugfix] SharedStorage Connector for V1 PD multimodal (#21611) Chenguang Zheng 2025-07-31 00:18:37 +08:00
  • 004203e953 [CI/Build] Fix registry tests (#21934) Cyrus Leung 2025-07-31 00:10:41 +08:00
  • 5c765aec65 [Bugfix] Fix TypeError in scheduler when comparing mixed request_id types (#21816) 633WHU 2025-07-30 23:54:44 +08:00
  • ad510309ee Override attention metadata for fast prefill in some KV sharing setups (#21590) Yong Hoon Shin 2025-07-30 08:54:15 -07:00
  • 366f6b3a4d [Bugfix] Fix multi-api server not working for text models (#21933) Cyrus Leung 2025-07-30 23:42:05 +08:00
  • 6e599eebe8 [Bugfix] Fix OOM tests in initialization test (#21921) Isotr0py 2025-07-30 22:35:47 +08:00
  • 88edf5994c [Docs] Reduce the size of the built docs (#21920) Harry Mellor 2025-07-30 15:35:08 +01:00
  • ff08e51940 [NVIDIA] Fix Llama4 Scout FP4 functionality issues (#21499) Po-Han Huang (NVIDIA) 2025-07-30 22:33:40 +08:00
  • 8f4a1c9a04 [Misc] Improve code readability of KVCacheManager (#21673) Ruixiang Tan 2025-07-30 22:20:43 +08:00
  • 36ede45989 Reduce time wasted in GitHub Actions using concurrency (#21919) Harry Mellor 2025-07-30 15:18:02 +01:00
  • 0e40b26073 [CI/Build] Only run markdownlint in CI (#21892) Cyrus Leung 2025-07-30 22:17:14 +08:00
  • 0271c2ff2f [Test] Add Benchmark and Unit Test for per_token_group_quant (#21860) Wentao Ye 2025-07-30 10:15:02 -04:00
  • e91d3c9cda [misc] skip p2p check by default (#21904) youkaichao 2025-07-30 22:05:04 +08:00
  • bf668b5bf5 [Feature] Support multiple api keys in server (#18548) Yan Pashkovsky 2025-07-30 15:03:23 +01:00
  • da3e0bd6e5 [Bugfix] we should use metavar is not choices (#21902) rongfu.leng 2025-07-30 21:51:58 +08:00
  • fcfd1eb9c5 [Doc] Remove vLLM prefix and add citation for PagedAttention (#21910) Cyrus Leung 2025-07-30 21:36:34 +08:00
  • d979dd6beb [Feature][EPLB] Add eplb support for Qwen3 (#20815) aladerran 2025-07-30 21:27:57 +08:00
  • b876860c62 [Hardware][CPU] Build fix for ARM without BF16 (#21848) Eric Curtin 2025-07-30 14:22:00 +01:00
  • 13986365a9 Add @patrickvonplaten as maintainer of mistral's related files. (#21928) Patrick von Platen 2025-07-30 14:42:51 +02:00
  • 5c8fe389d6 [Docs] Fix the example code of streaming chat completions in reasoning (#21825) Hongsheng Liu 2025-07-30 20:11:58 +08:00
  • 5bbaf492a6 [Doc] Update partial support (#21916) Cyrus Leung 2025-07-30 16:32:39 +08:00
  • 533db0935d [benchmark] add max-concurrency in result table (#21095) Peter Pan 2025-07-30 16:15:43 +08:00
  • fc91da5499 [Model] Remove DSV2 unused code (#21903) Jee Jee Li 2025-07-30 15:55:03 +08:00
  • 547795232d [Tests] Fixing bug inside MultiModalProfiler. (#21842) Varun Vinayak Shenoy 2025-07-30 00:44:15 -07:00
  • 30ef30ed5a [CI] rollback lint-and-deploy pipeline using amd machine (#21912) Kebe 2025-07-30 15:37:59 +08:00
  • 02f82fe438 [Doc] Update Intern-S1 info (#21908) Jee Jee Li 2025-07-30 14:58:57 +08:00
  • 2ca5f82c2a [Misc] Remove redundant config definitions (#21891) Cyrus Leung 2025-07-30 14:54:18 +08:00
  • 6f8d261882 Update vLLM Benchmark Suite for Xeon based on 0.9.2 release (#21486) Louie Tsai 2025-07-29 22:57:03 -07:00
  • 4cd7fe6cea [Docs] Expand introduction to Ray in Multi-node deployment section (#21584) Ricardo Decal 2025-07-29 22:07:28 -07:00
  • 16f3250527 [CI/Build] Fix pre-commit failure in docs (#21897) Cyrus Leung 2025-07-30 12:53:08 +08:00
  • e3bc17ceea Add @sighingnow as maintainer of qwen's related files. (#21895) Tao He 2025-07-30 12:30:44 +08:00
  • 05cbbe20c5 [XPU] use ZE_AFFINITY_MASK for device select on xpu (#21815) Kunshang Ji 2025-07-30 11:56:14 +08:00
  • 65f311ce59 [Frontend] Add LLM.reward specific to reward models (#21720) wang.yuqi 2025-07-30 11:56:03 +08:00
  • 1b0a155534 [Perf] Using __nv_fp8_e4m3 instead of c10::e4m3 for per_token_group_quant (#21867) Wentao Ye 2025-07-29 23:50:46 -04:00
  • 44bc46da60 [Bugfix] Actually disable processing cache when API server is scaled out (#21839) Cyrus Leung 2025-07-30 11:36:04 +08:00
  • b7b23da4d2 [Bugfix] Fix comment typo of get_num_common_prefix_blocks() (#21827) MingzhenHan 2025-07-30 11:35:33 +08:00
  • fdde18229e [Bugfix] Fix shape mismatch assertion error when loading Gemma3n model with BitsAndBytes quantization (#21808) Areeb Syed 2025-07-30 09:05:21 +05:30
  • b917da442b Expose PyTorch profiler configuration to environment variables (#21803) Csrayz 2025-07-30 10:46:31 +08:00
  • fb58e3a651 [Docs] Update docker.md with HF_TOKEN, new model, and podman fix (#21856) Michael Goin 2025-07-29 22:45:41 -04:00
  • 76080cff79 [DOC] Fix path of v1 related figures (#21868) Chen Zhang 2025-07-29 19:45:18 -07:00
  • ba5c5e5404 [Docs] Switch to better markdown linting pre-commit hook (#21851) Harry Mellor 2025-07-30 03:45:08 +01:00
  • 555e7225bc [v1][attention] Support Hybrid Allocator + FlashInfer (#21412) Chen Zhang 2025-07-29 18:45:29 -07:00
  • 0e36abf993 [Bugfix] Correct max tokens for non-contiguous embeds (#21798) milesial 2025-07-29 18:16:25 -07:00
  • 452b2a3180 [ci] mark blackwell test optional for now (#21878) Simon Mo 2025-07-29 18:03:27 -07:00
  • 0d0cc9e150 [ci] add b200 test placeholder (#21866) Simon Mo 2025-07-29 17:11:50 -07:00
  • 9266d98048 [BugFix] Fix interleaved sliding window not set for Gemma3n (#21863) Yong Hoon Shin 2025-07-29 16:34:19 -07:00
  • 176bbce1db Revert "[AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol exposure (#21647)" (#21850) Gregory Shtrasberg 2025-07-29 17:56:29 -04:00
  • a1873db23d docker: docker-aware precompiled wheel support (#21127) Doug Smith 2025-07-29 17:45:19 -04:00
  • a33ea28b1b Add flashinfer_python to CUDA wheel requirements (#21389) Michael Goin 2025-07-29 15:51:58 -04:00
  • 7b49cb1c6b [Doc] update Contributing page's testing section (#18272) David Xia 2025-07-29 13:32:46 -04:00
  • f03e9cf2bb [Doc] Add FusedMoE Modular Kernel Documentation (#21623) Varun Sundar Rabindranath 2025-07-29 23:02:30 +05:30
  • 37f86d9048 [Docs] use uv in GPU installation docs (#20277) David Xia 2025-07-29 13:32:06 -04:00
  • 58b11b24a6 [Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backend (#21525) elvischenv 2025-07-29 22:34:00 +08:00
  • ad341c5194 [Bugfix]fix mixed bits and visual language model quantization in AutoRound (#21802) Wenhua Cheng 2025-07-29 22:26:31 +08:00
  • 759b87ef3e [TPU] Add an optimization doc on TPU (#21155) Brittany 2025-07-29 07:23:19 -07:00
  • f693b067a2 [Docs] Merge design docs for a V1 only future (#21832) Harry Mellor 2025-07-29 15:22:50 +01:00
  • 04e38500ee [Bugfix] VLLM_V1 supports passing other compilation levels (#19340) Richard Zou 2025-07-29 09:35:58 -04:00