Commit Graph

  • 32d4b669d0 [BugFix][V1] Fix int32 token index overflow when preparing input ids (#16806) Yong Hoon Shin 2025-04-23 12:12:35 -07:00
  • 3cde34a4a4 [Frontend] Support guidance:no-additional-properties for compatibility with xgrammar (#15949) Travis Johnson 2025-04-23 12:34:41 -06:00
  • bdb3660312 Use @property and private field for data_parallel_rank_local (#17053) Harry Mellor 2025-04-23 16:50:08 +01:00
  • f3a21e9c68 CacheConfig.block_size should always be int when used (#17052) Harry Mellor 2025-04-23 16:50:05 +01:00
  • 8e630d680e Improve Transformers backend model loading QoL (#17039) Harry Mellor 2025-04-23 15:33:51 +01:00
  • af869f6dff [CI] Update structured-output label automation (#17055) Russell Bryant 2025-04-23 10:33:14 -04:00
  • 53c0fa1e25 Ensure that pid passed to kill_process_tree is int for mypy (#17051) Harry Mellor 2025-04-23 15:32:26 +01:00
  • f7912cba3d [Doc] Add top anchor and a note to quantization/bitblas.md (#17042) Michael Yao 2025-04-23 22:32:16 +08:00
  • 6317a5174a Categorize tests/kernels/ based on kernel type (#16799) Michael Goin 2025-04-23 07:21:07 -06:00
  • aa72d9a4ea Mistral-format support for compressed-tensors (#16803) Michael Goin 2025-04-23 06:46:23 -06:00
  • ce17db8085 [CI] Run v1/test_serial_utils.py in CI (#16996) Russell Bryant 2025-04-23 04:13:34 -04:00
  • 8c87a9ad46 [Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers (#16964) Chauncey 2025-04-23 15:24:09 +08:00
  • ec69124eb4 [Misc] Improve readability of get_open_port function. (#17024) huafeng 2025-04-23 14:16:53 +08:00
  • d0da99fb70 [BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size) (#16998) Lucas Wilkinson 2025-04-23 00:49:24 -04:00
  • b2f195c429 [V1] Avoid socket errors during shutdown when requests are in in-flight (#16807) Nick Hill 2025-04-22 21:36:29 -07:00
  • 047797ef90 [Bugfix] Triton FA function takes no keyword arguments (#16902) vllmellm 2025-04-23 12:35:24 +08:00
  • eb8ef4224d [doc] add download path tips (#17013) Reid 2025-04-23 12:06:30 +08:00
  • 56a735261c [INTEL-HPU][v0] Port delayed sampling to upstream (#16949) Chendi.Xue 2025-04-22 22:14:11 -05:00
  • e1cf90e099 [misc] tune some env vars for GB200 (#16992) youkaichao 2025-04-23 10:59:48 +08:00
  • 6bc1e30ef9 Revert "[Misc] Add S3 environment variables for better support of MinIO." (#17021) Chauncey 2025-04-23 10:22:29 +08:00
  • 7e081ba7ca [BugFix] Revert ROCm Custom Paged Attention Env Flag Check (#17022) vllmellm 2025-04-23 10:17:48 +08:00
  • 1e013fa388 [V1][DP] More robust DP/EP dummy request coordination (#16277) Nick Hill 2025-04-22 19:12:15 -07:00
  • bc7c4d206b [Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1 (#13305) Aleksandr Malyshev 2025-04-22 19:11:56 -07:00
  • f67e9e9f22 add Dockerfile build vllm against torch nightly (#16936) Yang Wang 2025-04-22 19:08:27 -07:00
  • 36fe78769f [Bugfix] validate urls object for multimodal content parts (#16990) Guillaume Calmettes 2025-04-23 03:43:06 +02:00
  • 83d933718c [Core][V1][TPU] Enable structured decoding on TPU V1 (#16499) Chenyaaang 2025-04-22 17:05:23 -07:00
  • 5175b884f7 [BugFix] Remove default multiproc executor collective_rpc timeout (#17000) Nick Hill 2025-04-22 16:27:14 -07:00
  • 5536b30a4c Fencing Kernels Tests for enabling on AMD (#16929) Alexei-V-Ivanov-AMD 2025-04-22 11:32:40 -05:00
  • 7f58fb9718 Add assertion for no objects while hashing hf_config (#16930) Richard Zou 2025-04-22 12:32:22 -04:00
  • 30bc3e0f66 [FEAT][ROCm]: Support AITER MLA (#15893) vllmellm 2025-04-23 00:31:13 +08:00
  • f34410715f [frontend] enhance tool_calls type check (#16882) Reid 2025-04-22 23:40:24 +08:00
  • 68d4c33202 [Misc] Add S3 environment variables for better support of MinIO. (#16977) Chauncey 2025-04-22 22:27:36 +08:00
  • f961d7f6ef [BugFix] Pass in correct VLLM config in FlashInfer backend (#13207) (#16973) Zhengyuan Su (苏政渊) 2025-04-22 21:44:10 +08:00
  • d059110498 Improve configs - SpeculativeConfig (#16971) Harry Mellor 2025-04-22 13:55:36 +01:00
  • 571e8dd65e [Bugfix] Fix distributed bug again in Qwen2.5-VL & Qwen2.5-Omni (#16974) Yang Fan 2025-04-22 20:23:17 +08:00
  • 4b91c927f6 [Misc] refactor example series (#16972) Reid 2025-04-22 19:44:21 +08:00
  • 0e237f0035 [FEAT][ROCm] Integrate Paged Attention Kernel from AITER (#15001) vllmellm 2025-04-22 17:46:28 +08:00
  • 8f7bace7c3 [Doc] Improve documentation for multimodal CLI args (#16960) Cyrus Leung 2025-04-22 16:35:35 +08:00
  • e4d6144232 [BugFix] Fix incremental detokenization perf issue (#16963) Nick Hill 2025-04-22 01:16:19 -07:00
  • 8d32dc603d [Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS (#6036) Lei Wang 2025-04-22 16:01:36 +08:00
  • c4ab9f3e71 [V1] Remove pre-allocation for KV cache (#16941) Woosuk Kwon 2025-04-22 00:52:18 -07:00
  • 2689d5c027 [Model] Use autoweightloader for mamba (#16950) Flora Feng 2025-04-22 00:48:15 -07:00
  • acba33a0f1 [Bugfix] Fix the issue where llm.generate cannot be called repeatedly after setting GuidedDecodingParams (#16767) Chauncey 2025-04-22 14:02:20 +08:00
  • a114bf20a3 [Perf] Optimize _update_states for GPU model runner (#16910) SnowCharm 2025-04-22 14:01:54 +08:00
  • 3097ce3a32 [Doc] Update ai_accelerator/hpu-gaudi.inc.md (#16956) Michael Yao 2025-04-22 13:33:27 +08:00
  • d6da9322c8 [Bugfix] Fix f-string for Python 3.9-3.11 (#16962) Cyrus Leung 2025-04-22 12:45:55 +08:00
  • 71ce44047f Support S3 Sharded loading with RunAI Model Streamer (#16317) omer-dayan 2025-04-22 07:21:49 +03:00
  • 188b7f9b8c [Performance][ROCm] Add skinny gemms for unquantized linear on ROCm (#15830) Charlie Fu 2025-04-21 22:46:22 -05:00
  • b9b4746950 [V1] Remove additional_config check (#16710) wangxiyuan 2025-04-22 11:45:27 +08:00
  • 7b8a2ab76f [Kernel] Add expert_map support to Cutlass FP8 MOE (#16861) Varun Sundar Rabindranath 2025-04-21 23:44:32 -04:00
  • c9acbf1141 [Misc] Remove the chunked prefill warning for LoRA (#16925) Jee Jee Li 2025-04-22 11:44:24 +08:00
  • 5b794cae8d [ROCm] Add aiter tkw1 kernel for Llama4 fp8 (#16727) kliuae 2025-04-22 11:42:34 +08:00
  • 0e4254492f [Bugfix]: fix issue with n>1 sampling on v1 requests overriding each other (#16863) Jeffrey Li 2025-04-21 23:40:19 -04:00
  • 1311913f55 [BugFix][Spec Decode] No in-place update to draft probs (#16952) Woosuk Kwon 2025-04-21 19:54:19 -07:00
  • 29f395c97c [Doc] Remove unnecessary V1 flag (#16924) Cyrus Leung 2025-04-22 09:04:38 +08:00
  • fa3bba2a53 [TPU][V1] Enable Top-P (#16843) Nicolò Lucchesi 2025-04-22 02:46:07 +02:00
  • 986537f1c3 [V1] V1 FlashInfer Attention (#16684) Michael Goin 2025-04-21 18:38:41 -06:00
  • 210207525e [TPU][V1] Capture multimodal encoder during model compilation (#15051) Nicolò Lucchesi 2025-04-22 02:36:59 +02:00
  • 71eda0bb76 Update Qwen1.5-MoE-W4A16-compressed-tensors.yaml (#16946) Michael Goin 2025-04-21 18:35:32 -06:00
  • 471fe65630 [TPU][V1] Implicitly adjust page size when there's SMEM OOM (#16871) Chengji Yao 2025-04-21 14:43:13 -07:00
  • 3a0fba5cf4 [V1][Spec Decode] Handle draft tokens beyond max_model_len (#16087) Woosuk Kwon 2025-04-21 12:38:50 -07:00
  • 299ebb62b2 [Core] Speed up decode by remove synchronizing operation in sampler (#16436) Chanh Nguyen 2025-04-21 11:18:22 -07:00
  • f728ab8e35 [Doc] mention how to install in CPU editable mode (#16923) David Xia 2025-04-21 13:45:51 -04:00
  • 63e26fff78 [doc] install required python3-dev apt package (#16888) David Xia 2025-04-21 12:15:18 -04:00
  • fe3462c774 [XPU][Bugfix] minor fix for XPU (#15591) Yan Ma 2025-04-22 00:02:57 +08:00
  • 3b34fd5273 Raise error for data-parallel with benchmark_throughput (#16737) Kartik Ramesh 2025-04-21 10:51:43 -05:00
  • 55d6d3fdb8 [Bugfix] Fix GLM rotary_dim issue and support v1 (#16912) Isotr0py 2025-04-21 22:26:34 +08:00
  • 7272bfae77 [Misc] Refactor platform to get device specific stream and event (#14411) Shanshan Shen 2025-04-21 21:25:49 +08:00
  • d9ac9e3dc5 [Misc] fix collect_env version parse (#15267) wangxiyuan 2025-04-21 20:29:40 +08:00
  • d41faaf9df Restore buffers when wake up from level 2 sleep (#16564) (#16889) Han Zhang 2025-04-21 20:18:28 +08:00
  • b34f33438a [Doc] Split dummy_processor_inputs() in Multimodal Docs (#16915) Alex Brooks 2025-04-21 05:10:01 -06:00
  • 26c0406555 [Bugfix] Fix distributed bug in Qwen2.5-VL & Qwen2.5-Omni (#16907) Yang Fan 2025-04-21 18:25:21 +08:00
  • 4c41278b77 [CI/CD][V1] Add spec decode tests to CI (#16900) Woosuk Kwon 2025-04-20 22:37:16 -07:00
  • bb3605db85 [Bugfix] Fix v1/spec_decode/test_ngram.py (#16895) qizixi 2025-04-20 20:54:29 -07:00
  • fe742aef5a [easy] Pass compile_fx only the config patches (#16845) Richard Zou 2025-04-20 00:25:19 -04:00
  • 4b07d36891 Improve configs - CacheConfig (#16835) Harry Mellor 2025-04-20 05:25:04 +01:00
  • 87aaadef73 Serialize tensors using int8 views (#16866) Staszek Paśko 2025-04-19 19:28:34 +02:00
  • 682e0b6d2f Log how much time loading a compiled artifact takes (#16848) Richard Zou 2025-04-19 12:50:46 -04:00
  • d6195a748b [doc] update hyperlink (#16877) Reid 2025-04-20 00:40:38 +08:00
  • 205d84aaa9 [VLM] Clean up models (#16873) Cyrus Leung 2025-04-19 20:13:06 +08:00
  • 5124f5bf51 [Model] Qwen2.5-Omni Cleanup (#16872) Roger Wang 2025-04-19 02:37:02 -07:00
  • 83f3c3bd91 [Model] Refactor Phi-4-multimodal to use merged processor and support V1 (#15477) Isotr0py 2025-04-19 17:26:11 +08:00
  • d9737ca1c6 [V1][Misc] stop update prefix cache stats when logs_stats is disabled (#16460) vie-serendipity 2025-04-19 17:25:19 +08:00
  • 9d4ca19d50 [Misc] Benchmarks for audio models (#16505) Nicolò Lucchesi 2025-04-19 11:24:14 +02:00
  • 2ef0dc53b8 [Frontend] Add sampling params to v1/audio/transcriptions endpoint (#16591) Nicolò Lucchesi 2025-04-19 09:03:54 +02:00
  • 1d4680fad2 [rocm][MI300] llama4 maverick fp8 moe config tp8 (#16847) Divakar Verma 2025-04-19 01:21:43 -05:00
  • 2c1bd848a6 [Model][VLM] Add Qwen2.5-Omni model support (thinker only) (#15130) Yang Fan 2025-04-19 14:14:36 +08:00
  • 5c9121203c [release] Publish neuron docker image (#16733) omrishiv 2025-04-19 01:11:25 +01:00
  • 490b1698a5 [Doc] Updated Llama section in tool calling docs to have llama 3.2 config info (#16857) Justin Ho 2025-04-18 19:28:53 -04:00
  • 5a5e29de88 [Misc] refactor examples series - Chat Completion Client With Tools (#16829) Reid 2025-04-19 07:24:42 +08:00
  • 3d3ab3689f [New Model]: Snowflake Arctic Embed (Family) (#16649) wang.yuqi 2025-04-18 23:11:57 +08:00
  • 686623c5e7 Fix nullable_kvs fallback (#16837) Harry Mellor 2025-04-18 13:58:39 +01:00
  • aadb656562 [Misc] Clean up Kimi-VL (#16833) Cyrus Leung 2025-04-18 20:15:09 +08:00
  • 87e067de41 [Model] use AutoWeightsLoader for BigCode, GPT-J (#16823) Jonghyun Choe 2025-04-18 19:42:41 +09:00
  • 26507f8973 [Docs] Fix a link and grammar issue in production-stack.md (#16809) Michael Yao 2025-04-18 14:42:58 +08:00
  • 9c1d5b456d [Doc] add podman setup instructions for official image (#16796) Nathan Weinberg 2025-04-18 02:10:49 -04:00
  • e31045f95c [Bugfix] fix pp for llama4 (#16746) Lucia Fang 2025-04-17 22:51:30 -07:00
  • aaec845f8e [ROCm] [Attention] Cleanup ROCm output passing (#16431) Luka Govedič 2025-04-18 01:46:45 -04:00
  • 7bdfd29a35 [Misc] add collect_env to cli and docker image (#16759) rongfu.leng 2025-04-18 13:13:35 +08:00
  • e78587a64c Improve-mm-and-pooler-and-decoding-configs (#16789) Harry Mellor 2025-04-18 06:13:32 +01:00