Commit Graph

  • 44607e07d3 Check if selected backend is None in get_attn_backend_cls() (#12975) Yuan Tang 2025-02-09 22:45:07 -05:00
  • 67c4637ccf [V1] Use msgpack for core request serialization (#12918) Nick Hill 2025-02-09 19:35:56 -08:00
  • aa0ca5ebb7 [core][rlhf] add colocate example for RLHF (#12984) youkaichao 2025-02-10 10:28:59 +08:00
  • 59fff4a01a [core] improve error handling when wake up from sleep mode (#12981) youkaichao 2025-02-10 09:38:57 +08:00
  • 29f1d47e73 [MISC] Always import version library first in the vllm package (#12979) Lu Fang 2025-02-09 02:56:40 -08:00
  • cf797aa856 [core] port pynvml into vllm codebase (#12963) youkaichao 2025-02-09 15:00:00 +08:00
  • 24700c346b [V1] Cache uses_mrope in GPUModelRunner (#12969) Woosuk Kwon 2025-02-08 15:32:32 -08:00
  • d366ccc4e3 [RFC] [Mistral] FP8 format (#10130) Patrick von Platen 2025-02-08 22:12:53 +01:00
  • 870c37481e [V1][Minor] Remove outdated comment (#12968) Woosuk Kwon 2025-02-08 12:48:30 -08:00
  • 86222a3dab [VLM] Merged multi-modal processor for GLM4V (#12449) Jee Jee Li 2025-02-09 04:32:16 +08:00
  • fe743b798d [bugfix] fix early import of flash attention (#12959) youkaichao 2025-02-09 00:06:56 +08:00
  • 913df14da3 [Bugfix] Remove unused seq_group_metadata_list from ModelInputForGPU (#12935) shangmingc 2025-02-08 22:46:19 +08:00
  • 8a69e0e20e [CI/Build] Auto-fix Markdown files (#12941) Cyrus Leung 2025-02-08 20:25:15 +08:00
  • 4c8dd12ef3 [Misc] Add qwen2.5-vl BNB support (#12944) Isotr0py 2025-02-08 20:24:47 +08:00
  • 256a2d29dc [Doc] Correct HF repository for TeleChat2 models (#12949) Jun Duan 2025-02-08 04:42:15 -05:00
  • c45d398e6f [CI] Resolve transformers-neuronx version conflict (#12925) Liangfu Chen 2025-02-08 01:41:35 -08:00
  • 011e612d92 [Misc] Log time consumption on weight downloading (#12926) Jun Duan 2025-02-08 04:16:42 -05:00
  • 7e1837676a [misc] Add LoRA to benchmark_serving (#12898) Varun Sundar Rabindranath 2025-02-08 14:45:44 +05:30
  • 2880e21e3d [Hardware][Intel-Gaudi] Enable long-contexts + LoRA support for Intel Gaudi (#12812) Sanju C Sudhakaran 2025-02-08 14:45:30 +05:30
  • 407b5537db [Build] Make pypi install work on CPU platform (#12874) wangxiyuan 2025-02-08 17:15:15 +08:00
  • 4ea48fb35c [V1][Minor] Move cascade attn logic outside _prepare_inputs (#12943) Woosuk Kwon 2025-02-08 00:39:09 -08:00
  • e31498bdcb [Misc] Add offline test for disaggregated prefill (#12418) Shaoting 2025-02-08 02:38:20 -06:00
  • 91dd8f7aa6 [bugfix] respect distributed_executor_backend in world_size=1 (#12934) youkaichao 2025-02-08 16:17:08 +08:00
  • d01f66b039 [Bugfix] Fix multi-round chat error when mistral tokenizer is used (#12859) zifeitong 2025-02-07 23:04:34 -08:00
  • cc01223f3b [Misc] Fix typo in the example file (#12896) Ke Zhao 2025-02-08 14:56:43 +08:00
  • 306923da82 [Bugfix] Fix Qwen2_5_VLForConditionalGeneration packed_modules_mapping (#12905) Jee Jee Li 2025-02-08 13:02:53 +08:00
  • 3243158336 [V1] Move KV block hashes from Request to KVCacheManager (#12922) Woosuk Kwon 2025-02-07 19:14:10 -08:00
  • b21f0f9d17 [V1][Minor] Remove outdated comment (#12928) Woosuk Kwon 2025-02-07 19:07:37 -08:00
  • 45cbc4991d [Bugfix] Fix disagg hang caused by the prefill and decode communication issues (#12723) Lu Fang 2025-02-07 16:39:50 -08:00
  • 932c6b7461 [V1] LM Eval With Streaming Integration Tests (#11590) Robert Shaw 2025-02-07 18:07:03 -05:00
  • eaa92d4437 [ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing (#12501) TJian 2025-02-08 00:13:43 +08:00
  • 0630d4537a [V1] Logprobs and prompt logprobs support (#9880) afeldman-nm 2025-02-07 10:26:20 -05:00
  • 538fab93cd PR #12718 (#12718) Amit Garg 2025-02-07 06:22:37 -08:00
  • ce26b16268 [Misc] Remove unnecessary detokenization in multimodal processing (#12868) Cyrus Leung 2025-02-07 22:21:17 +08:00
  • 1918aa1b80 [MISC][EASY] Break check file names into entry and args in the pre-commit hooks (#12880) Lu Fang 2025-02-07 05:04:39 -08:00
  • 6e1fc61f0f Prevent unecessary requests to huggingface hub (#12837) Maximilien de Bayser 2025-02-07 02:37:41 -03:00
  • aa375dca9f [Bugfix] Missing quant_config in deepseek embedding layer (#12836) Szymon Ożóg 2025-02-07 06:35:09 +01:00
  • 433c4a4923 Make vllm compatible with verl (#12824) ZSL98 2025-02-07 11:54:20 +08:00
  • ef533d25fb [Bugfix] FA2 illegal memory access (#12848) Lucas Wilkinson 2025-02-06 22:54:07 -05:00
  • b260782357 [misc] Revert # 12833 (#12857) Kevin H. Luu 2025-02-06 16:29:12 -08:00
  • 741429a4cd [MISC] Check space in the file names in the pre commit checks (#12804) Lu Fang 2025-02-06 15:36:21 -08:00
  • aff404571b Add Bamba Model (#10909) Yu Chin Fabian Lim 2025-02-07 07:22:42 +08:00
  • 467a96a541 [V1] LoRA Support (#10957) Varun Sundar Rabindranath 2025-02-06 23:02:51 +05:30
  • 8108ac841d [Bugfix] Fix unsupported FA version check for Turing GPU (#12828) Isotr0py 2025-02-07 01:18:22 +08:00
  • afe74f7a96 [Doc] double quote cmake package in build.inc.md (#12840) Jitse Klomp 2025-02-06 18:17:55 +01:00
  • 09b95e36ab [torch.compile] PyTorch 2.6 and nightly compatibility (#12393) youkaichao 2025-02-07 01:09:07 +08:00
  • 85ac82d228 [Kernel] Make rotary_embedding ops more flexible with input shape (#12777) Isotr0py 2025-02-07 00:46:13 +08:00
  • 1e57b1ee63 [Misc] Remove unnecessary decode call (#12833) Cyrus Leung 2025-02-07 00:45:44 +08:00
  • e152f29502 [misc] Reduce number of config file requests to HuggingFace (#12797) Kevin H. Luu 2025-02-06 06:59:18 -08:00
  • c786e757fa [Attention] Use FA3 for MLA on Hopper (#12807) Lucas Wilkinson 2025-02-06 06:43:12 -05:00
  • cefd56ee35 [Docs] Add Google Cloud Slides (#12814) Simon Mo 2025-02-06 01:02:38 -08:00
  • 7ca9934fe7 [Misc] Update w2 scale loading for GPTQMarlinMoE (#12757) Dipika Sikka 2025-02-06 04:02:14 -05:00
  • 0408efc6d0 [Misc] Improve error message for incorrect pynvml (#12809) v0.7.2 youkaichao 2025-02-06 15:23:50 +08:00
  • 449d1bce02 [Misc] Remove duplicated DeepSeek V2/V3 model definition (#12793) Michael Goin 2025-02-06 02:16:20 -05:00
  • 1a6fcad4c9 Improve TransformersModel UX (#12785) Harry Mellor 2025-02-06 06:24:57 +00:00
  • 56534cd577 [Bugfix] Fix the test_ultravox.py's license (#12806) Lu Fang 2025-02-05 21:25:54 -08:00
  • d88506dda4 [Model] LoRA Support for Ultravox model (#11253) Sumit Vij 2025-02-05 19:54:13 -08:00
  • 9cdea30b4f [Misc][Easy] Remove the space from the file name Lu Fang 2025-02-05 19:23:35 -08:00
  • 76abd0c881 [Bugfix] Better FP8 supported defaults Lucas Wilkinson 2025-02-05 22:22:19 -05:00
  • 5b19b93082 [ROCm][Kernel] Using the correct warp_size value Gregory Shtrasberg 2025-02-05 22:15:08 -05:00
  • 75404d041b [VLM] Update compatibility with transformers 4.49 Cyrus Leung 2025-02-06 11:09:45 +08:00
  • bf3b79efb8 [VLM] Qwen2.5-VL Roger Wang 2025-02-05 13:31:38 -08:00
  • 9a5b1554b4 [Docs] Drop duplicate [source] links Russell Bryant 2025-02-05 16:30:50 -05:00
  • a4ce74c14a [VLM] Use shared field to pass token ids to model Cyrus Leung 2025-02-06 05:30:46 +08:00
  • 3b2005e1db Add: Support for Sparse24Bitmask Compressed Models Rahul Tuli 2025-02-05 15:30:43 -06:00
  • af8486de49 [Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU) Sanju C Sudhakaran 2025-02-06 02:59:45 +05:30
  • 4c3aac51e1 Merging PR #12536 Chen Zhang 2025-02-06 05:24:26 +08:00
  • bc1bdecebf [core][distributed] exact ray placement control (#12732) youkaichao 2025-02-06 02:03:19 +08:00
  • 022bcc701a [Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 (#12546) Akash kaothalkar 2025-02-05 12:41:02 +05:30
  • c53dc466b1 [Doc] Remove performance warning for auto_awq.md (#12743) Michael Goin 2025-02-05 01:43:11 -05:00
  • 3d09e592a8 [V1][Misc] Shorten FinishReason enum and use constant strings (#12760) Nick Hill 2025-02-04 22:43:02 -08:00
  • fcf2e3d7fc [Bugfix] Fix OpenVINO model runner (#12750) Harry Mellor 2025-02-05 06:42:46 +00:00
  • 58b218d7ae [Doc] Update PR Reminder with link to Developer Slack (#12748) Michael Goin 2025-02-05 01:42:09 -05:00
  • 7ff7a638b6 [Model][Quant] Fix GLM, Fix fused module mappings for quantization (#12634) Kyle Sayers 2025-02-05 00:32:06 -05:00
  • 686006a220 [Misc] Bump the compressed-tensors version (#12736) Dipika Sikka 2025-02-04 23:44:48 -05:00
  • 98fd089fc9 [VLM] Add MLA with pure RoPE support for deepseek-vl2 models (#12729) Isotr0py 2025-02-05 12:44:26 +08:00
  • 249824c3bf Refactor Linear handling in TransformersModel (#12727) Harry Mellor 2025-02-05 04:31:12 +00:00
  • 64862d106e [ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling (#12713) Aleksandr Malyshev 2025-02-04 19:58:22 -08:00
  • b3a0d01e45 [Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS (#12368) Aviv Keshet 2025-02-04 18:46:26 -08:00
  • 75e94309e8 [Perf] Mem align KV caches for CUDA devices (MLA perf improvement) (#12676) Lucas Wilkinson 2025-02-04 21:22:24 -05:00
  • 233df6f5c4 [V1][Metrics] Add request_success_total counter, labelled with finish reason (#12579) Mark McLoughlin 2025-02-05 00:46:54 +00:00
  • 18016a5e62 [Bugfix] Fix CI failures for InternVL and Mantis models (#12728) Cyrus Leung 2025-02-04 23:54:23 +08:00
  • 649550f27e [Build] update requirements of no-device for plugin usage (#12630) Sophie du Couédic 2025-02-04 14:19:12 +01:00
  • 62467a834a Avoid unnecessary multi-modal input data copy when len(batch) == 1 (#12722) Kero Liang 2025-02-04 21:03:19 +08:00
  • 6469038b14 [Bugfix] Fix loading of fine-tuned models based on Phi-3-Small (#12689) Michael Greenbaum 2025-02-04 14:58:48 +02:00
  • 815079de8e [VLM] merged multimodal processor and V1 support for idefics3 (#12660) Isotr0py 2025-02-04 20:00:51 +08:00
  • 18a88fcccc [V1] Remove scheduling constraint on partial requests (#12674) Woosuk Kwon 2025-02-04 02:43:58 -08:00
  • d1ca7df84d [VLM] Merged multi-modal processor for InternVL-based models (#12553) Cyrus Leung 2025-02-04 16:44:52 +08:00
  • 96b23621c1 [Misc] Add BNB quantization for Whisper (#12381) Jee Jee Li 2025-02-04 16:27:36 +08:00
  • c36ac98d01 [AMD][ROCm] Enable DeepSeek model on ROCm (#12662) Hongxia Yang 2025-02-04 03:24:11 -05:00
  • 4896d0c2dd [Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs (#12711) Kyle Sayers 2025-02-04 02:27:11 -05:00
  • bb392af434 [Doc] Replace ibm-fms with ibm-ai-platform (#12709) Thomas Parnell 2025-02-04 02:05:04 -05:00
  • 5d98d56089 Support Pixtral-Large HF by using llava multimodal_projector_bias config (#12710) Michael Goin 2025-02-03 22:55:46 -05:00
  • 73b35cca7f [Core] Improve hash collision avoidance in prefix caching (#12621) Russell Bryant 2025-02-03 19:28:20 -05:00
  • 5095e96606 [V1] Revert uncache_blocks and support recaching full blocks (#12415) Cody Yu 2025-02-03 15:04:53 -08:00
  • cf58b9c4ca [MISC] Remove model input dumping when exception (#12582) Cody Yu 2025-02-03 13:34:16 -08:00
  • 4797dad3ec [Model] Add Deepseek V3 fp8_w8a8 configs for B200 (#12707) kushanam 2025-02-03 13:30:39 -08:00
  • 6dd5e52823 Squelch MLA warning for Compressed-Tensors Models (#12704) Kyle Sayers 2025-02-03 16:29:56 -05:00
  • c11de33dad [Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm (#12696) Tyler Michael Smith 2025-02-03 16:04:59 -05:00
  • 33e0602e59 [Misc] Fix improper placement of SPDX header in scripts (#12694) Russell Bryant 2025-02-03 14:16:59 -05:00