Commit Graph

  • a1a2aaadb9 [Model]: Add transformers backend support (#11330) Arthur 2025-02-03 14:30:38 +01:00
  • 1298a400e8 [ci/build] fix gh200 test (#12681) youkaichao 2025-02-03 15:59:49 +08:00
  • ad4a9dc817 [cuda] manually import the correct pynvml module (#12679) youkaichao 2025-02-03 15:58:21 +08:00
  • b9986454fe Fix for attention layers to remain unquantized during moe_wn16 quant (#12570) Srikanth Srinivas 2025-02-02 21:46:19 -08:00
  • c5932e5dac Properly check if all fused layers are in the list of targets (#12666) Eldar Kurtic 2025-02-03 06:42:18 +01:00
  • 20579c0fae make sure mistral_common not imported for non-mistral models (#12669) youkaichao 2025-02-03 13:40:25 +08:00
  • 95460fc513 [Kernel] port sgl moe_align_block_size kernels (#12574) Yang Chen 2025-02-02 21:09:50 -08:00
  • 326fcc8b9f [Doc] Deprecate Discord (#12668) Zhuohan Li 2025-02-02 19:19:56 -08:00
  • e64330910b [doc][misc] clarify VLLM_HOST_IP for multi-node inference (#12667) youkaichao 2025-02-03 09:32:18 +08:00
  • e489ad7a21 [Misc] Add SPDX-License-Identifier headers to python source files (#12628) Russell Bryant 2025-02-02 14:58:18 -05:00
  • f256ebe4df [Hardware][Intel GPU] add XPU bf16 support (#12392) Kunshang Ji 2025-02-02 18:17:26 +08:00
  • f8ece6e17f [Core][v1] Unify allocating slots in prefill and decode in KV cache manager (#12608) Shawn Du 2025-02-02 16:40:58 +08:00
  • abfcdcdf27 [V1][Minor] Avoid frequently creating ConstantList (#12653) Woosuk Kwon 2025-02-01 23:43:20 -08:00
  • e497f33491 [Core] Silence unnecessary deprecation warnings (#12620) Russell Bryant 2025-02-02 02:35:50 -05:00
  • baaa2b24da [Bugfix] fix moe_wna16 get_quant_method (#12648) Jinzhen Lin 2025-02-02 15:29:56 +08:00
  • b4e5c03306 doc: fixing minor typo in readme.md (#12643) Vicente Herrera 2025-02-01 18:17:29 +01:00
  • 3194039c0e Apply torch.compile to fused_moe/grouped_topk (#12637) Michael Goin 2025-02-01 11:16:19 -05:00
  • 4f4d427ac2 Disable chunked prefill and/or prefix caching when MLA is enabled (#12642) v0.7.1 Simon Mo 2025-01-31 23:46:57 -08:00
  • 1e3698393f [CI/Build] Add label automation for structured-output, speculative-decoding, v1 (#12280) Russell Bryant 2025-02-01 02:13:10 -05:00
  • baeded2569 [Attention] Deepseek v3 MLA support with FP8 compute (#12601) Lucas Wilkinson 2025-02-01 00:52:51 -05:00
  • 3e1c76cf3a Fix: Respect sparsity_config.ignore in Cutlass Integration (#12517) Rahul Tuli 2025-01-31 23:41:59 -06:00
  • cfa134d247 [Bugfix/CI] Fixup benchmark_moe.py (#12562) Tyler Michael Smith 2025-02-01 00:41:35 -05:00
  • 35b7a05507 [ci] Upgrade transformers to 4.48.2 in CI dependencies (#12599) Kevin H. Luu 2025-01-31 21:22:23 -08:00
  • 1867c258bd Fix target matching for fused layers with compressed-tensors (#12617) Eldar Kurtic 2025-02-01 06:07:46 +01:00
  • cb3e73e4c8 [BugFix] fix wrong output when using lora and num_scheduler_steps=8 (#11161) fade_away 2025-02-01 12:52:07 +08:00
  • b1340f9d55 [V1] Bugfix: Validate Model Input Length (#12600) Robert Shaw 2025-01-31 21:32:04 -05:00
  • 44bbca78d7 [Doc] int4 w4a16 example (#12585) Brian Dellabetta 2025-01-31 17:38:48 -06:00
  • 60808bd4c7 [Doc] Improve installation signposting (#12575) Harry Mellor 2025-01-31 23:38:35 +00:00
  • fc542144c4 [Feature] Fix guided decoding blocking bitmask memcpy (#12563) Ryan Nguyen 2025-01-31 18:37:30 -05:00
  • eb5741ad42 [Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (#12587) Tyler Michael Smith 2025-01-31 18:29:11 -05:00
  • 145c2ff648 [Bugfix] Revert MoE Triton Config Default (#12629) Robert Shaw 2025-01-31 18:28:47 -05:00
  • 415f19474d [release] Add input step to ask for Release version (#12631) Kevin H. Luu 2025-01-31 13:39:36 -08:00
  • 89003c4082 [v1][Bugfix] Add extra_keys to block_hash for prefix caching (#12603) Chen Zhang 2025-02-01 05:13:04 +08:00
  • 60bcef000e [Docs][V1] Prefix caching design (#12598) Cody Yu 2025-01-31 12:30:46 -08:00
  • 847f883232 [Git] Automatically sign-off commits (#12595) Cody Yu 2025-01-31 12:30:33 -08:00
  • 325f679f32 [BugFix] Fix Torch.Compile For DeepSeek (#12594) Robert Shaw 2025-01-31 15:06:39 -05:00
  • e3f7ff65e7 Add favicon to docs (#12611) Harry Mellor 2025-01-31 17:20:34 +00:00
  • 7a8987dac5 [Bugfix] Gracefully handle huggingface hub http error (#12571) Roger Wang 2025-01-31 00:19:35 -08:00
  • cabaf4eff3 [Attention] MLA decode optimizations (#12528) Lucas Wilkinson 2025-01-31 02:49:37 -05:00
  • a1fc18c030 [ROCm][AMD][Model] llama 3.2 support upstreaming (#12421) Aleksandr Malyshev 2025-01-30 20:24:28 -08:00
  • 9798b2fb00 [Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling (#11868) Lucas Wilkinson 2025-01-30 21:33:00 -05:00
  • 4078052f09 [V1][Log] Add max request concurrency log to V1 (#12569) Michael Goin 2025-01-30 18:07:19 -05:00
  • bd2107e30a [CPU][PPC] Updated torch, torchvision, torchaudio dependencies (#12555) Nishidha 2025-01-31 02:59:39 +05:30
  • 9b0c4bab36 [Kernel] Triton Configs for Fp8 Block Quantization (#11589) Robert Shaw 2025-01-30 14:53:22 -05:00
  • 41bf5612f5 [Misc] fix typo: add missing space in lora adapter error message (#12564) Beim 2025-01-31 04:39:22 +13:00
  • a2769032ca Set ?device={device} when changing tab in installation guides (#12560) Harry Mellor 2025-01-30 08:05:42 +00:00
  • f17f1d4608 [V1][Metrics] Add GPU cache usage % gauge (#12561) Mark McLoughlin 2025-01-30 02:31:01 +00:00
  • 1c1bb0bbf2 [Misc][MoE] add Deepseek-V3 moe tuning support (#12558) Divakar Verma 2025-01-29 18:47:30 -06:00
  • e0cc5f259a [V1][BugFix] Free encoder cache for aborted requests (#12545) Woosuk Kwon 2025-01-29 13:47:33 -08:00
  • 73aa6cfdf7 Revert "[Build/CI] Fix libcuda.so linkage" (#12552) Tyler Michael Smith 2025-01-29 16:12:24 -05:00
  • 27b78c73ca [Kernel] add triton fused moe kernel for gptq/awq (#12185) Jinzhen Lin 2025-01-29 22:07:09 +08:00
  • b02fd288b2 [Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. (#11787) Pavani Majety 2025-01-29 01:46:12 -08:00
  • ff7424f491 [Frontend] Support override generation config in args (#12409) Yanyi Liu 2025-01-29 17:41:01 +08:00
  • d93bf4da85 [Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM (#12069) Alphi 2025-01-29 17:24:59 +08:00
  • 036ca94c25 [Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense (#12347) Travis Johnson 2025-01-29 01:54:35 -07:00
  • ef001d98ef Fix the pydantic logging validator (#12420) Maximilien de Bayser 2025-01-29 04:53:13 -03:00
  • 5f671cb4c3 [V1] Improve Error Message for Unsupported Config (#12535) Robert Shaw 2025-01-28 23:56:56 -05:00
  • bd02164cf9 Bugfix for whisper quantization due to fake k_proj bias (#12524) Michael Goin 2025-01-28 23:49:03 -05:00
  • 46fb056749 [V1][Metrics] Add TTFT and TPOT histograms (#12530) Mark McLoughlin 2025-01-29 04:11:16 +00:00
  • dd6a3a02cb [Doc] Convert docs to use colon fences (#12471) Harry Mellor 2025-01-29 03:38:29 +00:00
  • a7e3eba66f [Frontend] Support reasoning content for deepseek r1 (#12473) Ce Gao 2025-01-29 11:38:08 +08:00
  • fbb5bd4cef [TPU] Add example for profiling TPU inference (#12531) Michael Goin 2025-01-28 22:16:47 -05:00
  • 80fcc3ed1c [Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels (#12482) fenghuizhang 2025-01-28 14:36:44 -08:00
  • c386c43ca3 [V1][Metrics] Add per-request prompt/generation_tokens histograms (#12516) Mark McLoughlin 2025-01-28 22:07:22 +00:00
  • f26d790718 Do not run suggestion pre-commit hook multiple times (#12521) Harry Mellor 2025-01-28 20:05:27 +00:00
  • 0f657bdc52 Replace missed warning_once for rerank API (#12472) Michael Goin 2025-01-28 14:06:32 -05:00
  • 3fd1fb63ef [V1][Metrics] Hook up IterationStats for Prometheus metrics (#12478) Mark McLoughlin 2025-01-28 16:38:38 +00:00
  • 925d2f1908 [Doc] Fix typo for x86 CPU installation (#12514) Jun Duan 2025-01-28 11:37:10 -05:00
  • 8f58a51358 [VLM] Merged multi-modal processor and V1 support for Qwen-VL (#12504) Cyrus Leung 2025-01-29 00:25:05 +08:00
  • 2079e43bee [Core] Make raw_request optional in ServingCompletion (#12503) Sebastian Schoennenbeck 2025-01-28 11:56:45 +01:00
  • e29d4358ef [V1] Include Engine Version in Logs (#12496) Robert Shaw 2025-01-28 03:27:41 -05:00
  • 8cbc424975 Update README.md with V1 alpha release (#12495) Roger Wang 2025-01-28 00:22:41 -08:00
  • dd66fd2b01 [CI] fix pre-commit error (#12494) Mengqing Cao 2025-01-28 14:11:05 +08:00
  • 0f465ab533 [FEATURE] Enables offline /score for embedding models (#12021) Gabriel Marinho 2025-01-28 00:30:13 -03:00
  • 23a7cbc88b [CI/Build] Fixed the xla nightly issue report in #12451 (#12453) Hossein Sarshar 2025-01-27 22:18:07 -05:00
  • 426a5c3625 Fix bad path in prometheus example (#12481) Michael Goin 2025-01-27 20:56:31 -05:00
  • ddee88d0ff [Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache (#11277) Liangfu Chen 2025-01-27 17:31:16 -08:00
  • 823ab79633 Update pre-commit hooks (#12475) Harry Mellor 2025-01-28 00:23:08 +00:00
  • 6116ca8cd7 [Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill (#10132) Nicolò Lucchesi 2025-01-27 22:38:35 +01:00
  • 2bc3fbba0c [FlashInfer] Upgrade to 0.2.0 (#11194) Bowen Wang 2025-01-28 02:19:24 +08:00
  • 3f1fc7425a [V1][CI/Test] Do basic test for top-p & top-k sampling (#12469) Woosuk Kwon 2025-01-27 09:40:04 -08:00
  • 01ba927040 [V1][Metrics] Add initial Prometheus logger (#12416) Mark McLoughlin 2025-01-27 17:26:28 +00:00
  • 103bd17ac5 [Build] Only build 9.0a for scaled_mm and sparse kernels (#12339) Lucas Wilkinson 2025-01-27 10:40:00 -05:00
  • ce69f7f754 [Bugfix] Fix gpt2 GGUF inference (#12467) Isotr0py 2025-01-27 18:31:49 +08:00
  • 624a1e4711 [V1][Minor] Minor optimizations for update_from_output (#12454) Woosuk Kwon 2025-01-27 01:09:27 -08:00
  • 372bf0890b [Bugfix] Fix missing seq_start_loc in xformers prefill metadata (#12464) Isotr0py 2025-01-27 15:25:30 +08:00
  • 5204ff5c3f [Bugfix] Fix Granite 3.0 MoE model loading (#12446) v0.7.0 Cyrus Leung 2025-01-27 13:26:44 +08:00
  • 0cc6b383d7 [Frontend] Support scores endpoint in run_batch (#12430) Pooya Davoodi 2025-01-26 20:30:17 -08:00
  • 28e0750847 [V1] Avoid list creation in input preparation (#12457) Woosuk Kwon 2025-01-26 19:57:56 -08:00
  • 582cf78798 [DOC] Add link to vLLM blog (#12460) Yuan Tang 2025-01-26 21:46:19 -06:00
  • 0034b09ceb [Frontend] Rerank API (Jina- and Cohere-compatible API) (#12376) Kyle Mistele 2025-01-26 20:58:45 -06:00
  • 72bac73067 [Build/CI] Fix libcuda.so linkage (#12424) Tyler Michael Smith 2025-01-26 16:18:19 -05:00
  • 68f11149d8 [Bugfix][Kernel] Fix perf regression caused by PR #12405 (#12434) Lucas Wilkinson 2025-01-26 14:09:34 -05:00
  • 72f4880425 [Bugfix/CI] Fix broken kernels/test_mha.py (#12450) Tyler Michael Smith 2025-01-26 13:39:03 -05:00
  • aa2cd2c43d [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417) Tyler Michael Smith 2025-01-26 06:59:58 -05:00
  • 9ddc35220b [Frontend] generation_config.json for maximum tokens(#12242) Matthew Hendrey 2025-01-26 06:59:25 -05:00
  • a5255270c3 [Misc] Revert FA on ViT #12355 and #12435 (#12445) Roger Wang 2025-01-26 03:56:34 -08:00
  • 0ee349b553 [V1][Bugfix] Fix assertion when mm hashing is turned off (#12439) Roger Wang 2025-01-26 00:47:42 -08:00
  • fa63e710c7 [V1][Perf] Reduce scheduling overhead in model runner after cuda sync (#12094) Keyun Tong 2025-01-26 00:42:37 -08:00
  • 2a0309a646 [Misc][Bugfix] FA3 support to ViT MHA layer (#12435) Roger Wang 2025-01-25 21:00:31 -08:00