Commit Graph

  • f78c0be80a Fix benchmark_moe.py tuning for CUDA devices (#14164) Michael Goin 2025-03-04 00:11:03 -05:00
  • 66233af7b6 Use math.prod instead of np.prod for trivial ops (#14142) Zhanwen Chen 2025-03-04 00:09:22 -05:00
  • bf13d40972 [core] Pass all driver env vars to ray workers unless excluded (#14099) Rui Qiao 2025-03-03 19:44:17 -08:00
  • 989f4f430c [Misc] Remove lru_cache in NvmlCudaPlatform (#14156) Cody Yu 2025-03-03 19:09:34 -08:00
  • bb5b640359 [core] moe fp8 block quant tuning support (#14068) Divakar Verma 2025-03-03 19:30:23 -06:00
  • c060b71408 [Model] Add support for GraniteMoeShared models (#13313) Travis Johnson 2025-03-03 17:04:52 -07:00
  • 79e4937c65 [v1] Add comments to the new ragged paged attention Pallas kernel (#14155) iefgnoix 2025-03-03 15:00:55 -08:00
  • cd1d3c3df8 [Docs] Add GPTQModel (#14056) Qubitium-ModelCloud 2025-03-04 05:59:09 +08:00
  • 19d98e0c7d [Kernel] Optimize moe intermediate_cache usage (#13625) Michael Goin 2025-03-03 16:29:53 -05:00
  • 2b04c209ee [Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 (#14100) Michael Goin 2025-03-03 16:20:24 -05:00
  • ae122b1cbd [WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics (#14055) Mark McLoughlin 2025-03-03 19:04:45 +00:00
  • 872db2be0e [V1] Simplify stats logging (#14082) Nick Hill 2025-03-03 10:34:14 -08:00
  • 2dfdfed8a0 [V0][Metrics] Deprecate some KV/prefix cache metrics (#14136) Mark McLoughlin 2025-03-03 18:25:46 +00:00
  • c41d27156b [V0][Metrics] Remove unimplemented vllm:tokens_total (#14134) Mark McLoughlin 2025-03-03 17:50:22 +00:00
  • 91373a0d15 Fix head_dim not existing in all model configs (Transformers backend) (#14141) Harry Mellor 2025-03-03 17:48:11 +00:00
  • 848a6438ae [ROCm] Faster Custom Paged Attention kernels (#12348) TJian 2025-03-04 01:24:45 +08:00
  • 98175b2816 Improve the docs for TransformersModel (#14147) Harry Mellor 2025-03-03 17:03:05 +00:00
  • 4167252eaf [V1] Refactor parallel sampling support (#13774) Mark McLoughlin 2025-03-03 16:15:27 +00:00
  • f35f8e2242 [Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 (#13921) Cody Yu 2025-03-03 00:43:14 -08:00
  • b87c21fc89 [Misc][Platform] Move use allgather to platform (#14010) Mengqing Cao 2025-03-03 15:40:04 +08:00
  • e584b85afd [Misc] duplicate code in deepseek_v2 (#14106) wang.yuqi 2025-03-03 14:10:11 +08:00
  • 09e56f9262 [Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure (#14051) Sheng Yao 2025-03-03 09:35:01 +08:00
  • cf069aa8aa Update deprecated Python 3.8 typing (#13971) Harry Mellor 2025-03-03 01:34:51 +00:00
  • bf33700ecd [v0][structured output] Support reasoning output (#12955) Ce Gao 2025-03-03 03:49:42 +08:00
  • bc6ccb9878 [Doc] Source building add clone step (#14086) qux-bbb 2025-03-02 18:59:50 +08:00
  • 82fbeae92b [Misc] Accurately capture the time of loading weights (#14063) Jun Duan 2025-03-01 20:20:30 -05:00
  • cc5e8f6db8 [Model] Add LoRA support for TransformersModel (#13770) Jee Jee Li 2025-03-02 09:17:34 +08:00
  • d54990da47 [v1] Add __repr__ to KVCacheBlock to avoid recursive print (#14081) Chen Zhang 2025-03-02 04:46:02 +08:00
  • b9f1d4294e [v1][Bugfix] Only cache blocks that are not in the prefix cache (#14073) Chen Zhang 2025-03-01 16:25:54 +08:00
  • b28246f6ff [ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class (#14065) Sage Moore 2025-02-28 23:18:32 -08:00
  • 3b5567a209 [V1][Minor] Do not print attn backend twice (#13985) Woosuk Kwon 2025-02-28 23:09:14 -08:00
  • fdcc405346 [Doc] Consolidate whisper and florence2 examples (#14050) Isotr0py 2025-03-01 14:49:15 +08:00
  • 8994dabc22 [Documentation] Add more deployment guide for Kubernetes deployment (#13841) Kuntai Du 2025-02-28 22:44:24 -08:00
  • 02296f420d [Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor (#14053) Li, Jiang 2025-03-01 14:31:01 +08:00
  • 6a92ff93e1 [Misc][Kernel]: Add GPTQAllSpark Quantization (#12931) YajieWang 2025-03-01 14:30:59 +08:00
  • 6a84164add [Bugfix] Add file lock for ModelScope download (#14060) Jee Jee Li 2025-03-01 14:10:28 +08:00
  • f64ffa8c25 [Docs] Add pipeline_parallel_size to optimization docs (#14059) Brayden Zhong 2025-03-01 00:43:54 -05:00
  • bd56c983d6 [torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass (#10902) Luka Govedič 2025-02-28 18:20:11 -05:00
  • 084bbac8cc [core] Bump ray to 2.43 (#13994) Rui Qiao 2025-02-28 13:47:44 -08:00
  • 28943d36ce [v1] Move block pool operations to a separate class (#13973) Chen Zhang 2025-03-01 04:53:31 +08:00
  • b526ca6726 Add RELEASE.md (#13926) Andrey Talman 2025-02-28 20:25:50 +00:00
  • e7bd944e08 [v1] Cleanup the BlockTable in InputBatch (#13977) Chen Zhang 2025-03-01 03:03:16 +08:00
  • c3b6559a10 [V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU (#13379) iefgnoix 2025-02-28 10:01:36 -08:00
  • 4be4b26cb7 Fix entrypoint tests for embedding models (#14052) Harry Mellor 2025-02-28 16:56:44 +00:00
  • 2aed2c9fa7 [Doc] Fix ROCm documentation (#14041) Brayden Zhong 2025-02-28 11:42:07 -05:00
  • 9b61dd41e7 [Bugfix] Initialize attention bias on the same device as Query/Key/Value for QwenVL Series (#14031) Yang Liu 2025-02-28 23:36:08 +08:00
  • f7bee5c815 [VLM][Bugfix] Enable specifying prompt target via index (#14038) Cyrus Leung 2025-02-28 23:35:55 +08:00
  • e0734387fb [Bugfix] Fix MoeWNA16Method activation (#14024) Jee Jee Li 2025-02-28 23:22:42 +08:00
  • f58f8b5c96 Update AutoAWQ docs (#14042) Harry Mellor 2025-02-28 15:20:29 +00:00
  • b3f7aaccd0 [V1][Minor] Restore V1 compatibility with LLMEngine class (#13090) Thibault Schueller 2025-02-28 09:52:25 +01:00
  • b91660ddb8 [Hardware][Intel-Gaudi] Regional compilation support (#13213) Kacper Pietkun 2025-02-28 09:51:49 +01:00
  • 76c89fcadd Use smaller embedding model when not testing model specifically (#13891) Harry Mellor 2025-02-28 08:50:43 +00:00
  • b9e41734c5 [Bugfix][Disaggregated] patch the inflight batching on the decode node in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) (#13987) Mathis Felardos 2025-02-28 08:53:45 +01:00
  • 1088f06242 [Doc] Move multimodal Embedding API example to Online Serving page (#14017) Cyrus Leung 2025-02-28 15:12:04 +08:00
  • 73e0225ee9 [Bugfix] Check that number of images matches number of <|image|> tokens with mllama (#13911) Travis Johnson 2025-02-27 21:00:45 -07:00
  • 6c85da3a18 [V1]SupportsV0Only protocol for model definitions (#13959) Roger Wang 2025-02-27 17:02:15 -08:00
  • 67fc426845 [Misc] Print FusedMoE detail info (#13974) Jee Jee Li 2025-02-28 07:53:13 +08:00
  • 9804145cac [Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict (#13626) Benjamin Chislett 2025-02-27 18:28:08 -05:00
  • 2e94b9cfbb [Attention] Flash MLA for V1 (#13867) Lucas Wilkinson 2025-02-27 18:03:41 -05:00
  • 8294773e48 [core] Perf improvement for DSv3 on AMD GPUs (#13718) qli88 2025-02-27 16:14:30 -06:00
  • cd813c6d4d [V1][Minor] Minor cleanup for GPU Model Runner (#13983) Woosuk Kwon 2025-02-27 13:11:40 -08:00
  • 38acae6e97 [ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups (#13970) Sage Moore 2025-02-27 12:31:47 -08:00
  • a2dd48c386 [VLM] Deprecate legacy input mapper for OOT multimodal models (#13979) Cyrus Leung 2025-02-28 03:14:55 +08:00
  • 126f6beeb4 Bump azure/setup-helm from 4.2.0 to 4.3.0 (#13742) dependabot[bot] 2025-02-27 19:04:10 +00:00
  • 58d1b2aa77 [Attention] MLA support for V1 (#13789) Yang Chen 2025-02-27 10:14:17 -08:00
  • f1579b229d [VLM] Generalized prompt updates for multi-modal processor (#13964) Cyrus Leung 2025-02-28 01:44:25 +08:00
  • 7864875879 [Bugfix] Fix qwen2.5-vl overflow issue (#13968) Isotr0py 2025-02-28 01:30:39 +08:00
  • 1dd422b64a Update LMFE version to v0.10.11 to support new versions of transforme… (#13930) Noam Gat 2025-02-27 19:16:12 +02:00
  • 06c8f8d885 [bugfix] Fix profiling for RayDistributedExecutor (#13945) Rui Qiao 2025-02-27 09:01:21 -08:00
  • 5677c9bb3e Deduplicate .pre-commit-config.yaml's exclude (#13967) Harry Mellor 2025-02-27 16:27:47 +00:00
  • 512d77d582 Update quickstart.md (#13958) 王博伟 2025-02-28 00:05:11 +08:00
  • 7f0be2aa24 [Model] Deepseek GGUF support (#13167) Szymon Ożóg 2025-02-27 11:08:35 +01:00
  • edf309ebbe [VLM] Support multimodal inputs for Florence-2 models (#13320) Isotr0py 2025-02-27 18:06:41 +08:00
  • 788f284b53 Fix test_block_fp8.py test for MoE (#13915) Michael Goin 2025-02-27 05:00:00 -05:00
  • 4b1d141f49 [PP] Correct cache size check (#13873) Yang Zheng 2025-02-27 17:47:29 +08:00
  • 10c3b8c1cf [Misc] fixed 'required' is an invalid argument for positionals (#13948) Chauncey 2025-02-27 17:06:49 +08:00
  • a7f37314b7 [CI/Build] Add examples/ directory to be labelled by mergify (#13944) Brayden Zhong 2025-02-27 03:24:11 -05:00
  • cd711c48b2 [V1][Metrics] Handle preemptions (#13169) Mark McLoughlin 2025-02-27 04:04:59 +00:00
  • 378b3ef6f8 [ROCm][V1] Update reshape_and_cache to properly work with CUDA graph padding (#13922) Sage Moore 2025-02-26 20:04:12 -08:00
  • c9944acbf9 [misc] Rename Ray ADAG to Compiled Graph (#13928) Rui Qiao 2025-02-26 20:03:28 -08:00
  • ca377cf1b9 Use CUDA 12.4 as default for release and nightly wheels (#12098) Michael Goin 2025-02-26 22:06:37 -05:00
  • a31614e386 [ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined (#13851) ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 2025-02-27 04:39:10 +02:00
  • f95903909f [Kernel] FlashMLA integration (#13747) Lucas Wilkinson 2025-02-26 21:35:08 -05:00
  • b382a7f28f [BugFix] Make FP8 Linear compatible with torch.compile (#13918) Woosuk Kwon 2025-02-26 13:48:55 -08:00
  • 4cb6fa0a9c [Bugfix] Backend option to disable xgrammar any_whitespace (#12744) Wallas Henrique 2025-02-26 15:52:34 -03:00
  • d08b285adf [Misc] fixed qwen_vl_utils parameter error (#13906) Chauncey 2025-02-27 00:31:53 +08:00
  • b27122acc2 [TPU] use torch2.6 with whl package (#13860) Chenyaaang 2025-02-26 05:18:54 -08:00
  • 934bb99c71 [Bugfix] Update expected token counts for Ultravox tests (#13895) Cyrus Leung 2025-02-26 20:56:50 +08:00
  • 3f808cc044 [Bugfix] Do not crash V0 engine on input errors (#13101) Joe Runde 2025-02-26 04:07:29 -07:00
  • ec8a5e5386 [Misc]: Add support for goodput on guided benchmarking + TPOT calculation refactor (#13736) Brayden Zhong 2025-02-26 06:06:47 -05:00
  • 215bf150a6 [Bugfix] Handle None parameters in Mistral function calls. (#13786) Florian Greinacher 2025-02-26 12:06:21 +01:00
  • 0ecdd98031 Add comments on accessing kv_cache and attn_metadata (#13887) Harry Mellor 2025-02-26 10:41:02 +00:00
  • 7b700ec8c8 [Bugfix] Add test example for Ultravox v0.5 (#13890) Cyrus Leung 2025-02-26 18:31:43 +08:00
  • 7ca1da020f [Misc] Fix input processing for Ultravox (#13871) Roger Wang 2025-02-25 23:56:34 -08:00
  • 5157338ed9 [Misc] Improve LoRA spelling (#13831) Jee Jee Li 2025-02-26 15:43:01 +08:00
  • e206b54331 [v0][Core] Use xgrammar shared context to avoid copy overhead for offline engine (#13837) Seth Kimmel 2025-02-25 22:58:24 -08:00
  • 1d35662e6d [ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms (#13844) Sage Moore 2025-02-25 22:56:58 -08:00
  • e656f638de [Doc] fix the incorrect module path of tensorize_vllm_model (#13863) Albert 2025-02-26 14:56:19 +08:00
  • 145944cb94 Improve pipeline partitioning (#13839) Harry Mellor 2025-02-26 02:53:56 +00:00
  • 094b7d9496 [Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues (#13797) Henry Tsang 2025-02-25 18:52:03 -08:00