Commit Graph

  • d0feea31c7 [Kernel] optimize performance of gptq marlin kernel when n is small (#14138) Jinzhen Lin 2025-03-08 00:53:38 +08:00
  • 58abe35455 [Benchmarks] Make detokenization optional in benchmark scripts (#11697) Jeremy Arnold 2025-03-07 10:09:00 -06:00
  • f7ebad2307 [Doc] Update prefix_caching.md to match the example image (#14420) York-RDWang 2025-03-07 23:29:00 +08:00
  • 80e9afb5bc [V1][Core] Support for Structured Outputs (#12388) Aaron Pham 2025-03-07 10:19:11 -05:00
  • 1e3598edeb Use the optimized block sizes after tuning the kernel. (#14329) iefgnoix 2025-03-07 05:25:13 -08:00
  • f7a6bd0fa1 Fix missing kv_caches and attn_metadata in OpenVINOCausalLM (#14271) Harry Mellor 2025-03-07 13:30:42 +01:00
  • 0ca3b8e01c [BUGFIX] Skip tokenization support for throughput benchmark (#12712) Aleksandr Malyshev 2025-03-07 02:51:47 -08:00
  • cc10281498 [Misc] Set default value of seed to None (#14274) மனோஜ்குமார் பழனிச்சாமி 2025-03-07 16:10:01 +05:30
  • 05fb6718f0 [Bugfix] Clean up multi-modal processors (#14417) Cyrus Leung 2025-03-07 18:33:38 +08:00
  • 12c29a881f [Bugfix] Further clean up LoRA test (#14422) Jee Jee Li 2025-03-07 18:30:55 +08:00
  • 70da0c0748 correct wrong markdown syntax (#14414) Peng Li 2025-03-07 16:01:18 +08:00
  • c1588a2c94 [GH] Auto-apply multi-modality label to relevant PRs (#14402) Cyrus Leung 2025-03-07 15:26:32 +08:00
  • 8ca7a71df7 OpenVINO: added CPU-like conditions (#14338) Ilya Lavrenov 2025-03-07 10:24:49 +04:00
  • 63137cd922 [Build] Add nightly wheel fallback when latest commit wheel unavailable (#14358) Isotr0py 2025-03-07 14:10:57 +08:00
  • ddd1ef66ec [Bugfix] Fix JambaForCausalLM LoRA (#14370) Jee Jee Li 2025-03-07 14:05:47 +08:00
  • e5e03c2c1b [BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs (#14396) Lucas Wilkinson 2025-03-07 00:56:06 -05:00
  • e1744502c2 [FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object (#14390) Luka Govedič 2025-03-07 00:20:16 -05:00
  • dae6896977 [Perf] Reduce MLA CPU overheads in V1 (#14384) Lucas Wilkinson 2025-03-06 22:59:14 -05:00
  • c34eeec58d [Bugfix] Correctly call cudaProfilerStop in benchmarks script (#14183) Brayden Zhong 2025-03-06 19:42:49 -05:00
  • ad60bbb2b2 [Doc] Fix a typo (#14385) Daniel Li 2025-03-06 16:31:52 -08:00
  • 0578e5a462 [Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue (#14310) Chengji Yao 2025-03-06 15:31:05 -08:00
  • 04222984f8 [Docs] Add nsight guide to profiling docs (#14298) Michael Goin 2025-03-06 17:19:58 -05:00
  • 6832707e90 [V1][Bugfix] Standardize quantized kv cache rejection for attention backends (#14221) Michael Goin 2025-03-06 17:18:29 -05:00
  • 6b2ef5cd17 [Bug] Fix Attention when ignored in by quant_method (#14313) Michael Goin 2025-03-06 17:18:06 -05:00
  • 958adce478 [Bugfix] Fix use_direct_call condition in FusedMoE layer for (#14382) Tyler Michael Smith 2025-03-06 17:17:21 -05:00
  • 99b0915d3b [Kernel] Add needs_fixed_stride_order tag to most GEMMs (#14306) Tyler Michael Smith 2025-03-06 17:17:09 -05:00
  • 8ca2b21c98 [CI] Disable spawn when running V1 Test (#14345) Thomas Parnell 2025-03-06 22:52:46 +01:00
  • d9292786e1 [CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa (#13569) Michael Goin 2025-03-06 16:08:36 -05:00
  • cc2f9b32c8 [Distributed] Add enable_expert_parallel arg (#14305) Tyler Michael Smith 2025-03-06 13:54:45 -05:00
  • cd579352bf [V1] Do not detokenize if sampling param detokenize is False (#14224) Himanshu Jaju 2025-03-06 19:40:24 +01:00
  • 9f1710f1ac Fix mla prefill context performance (#13897) Ying Zhong 2025-03-07 01:35:49 +08:00
  • e642ec962c Add authors to license header. (#14371) Thomas Parnell 2025-03-06 17:43:09 +01:00
  • ada19210a3 Adding cpu inference with VXE ISA for s390x architecture (#12613) Dilip Gowda Bhagavan 2025-03-06 22:10:53 +05:30
  • bf0560bda9 Reinstate best_of for V0 (#14356) Harry Mellor 2025-03-06 17:34:22 +01:00
  • 151b08e0fe [RLHF] use worker_extension_cls for compatibility with V0 and V1 (#14185) youkaichao 2025-03-07 00:32:46 +08:00
  • 81b2f4a45f [Doc] Fix date typo in README.md (#14366) Jitse Klomp 2025-03-06 17:29:57 +01:00
  • 82551ad616 [Core] Don't use cache during multi-modal profiling (#14336) Cyrus Leung 2025-03-07 00:03:31 +08:00
  • caac5c2e59 [Bugfix][Core] fix abort_seq_group and memory leak when n>1 (#14326) courage17340 2025-03-06 23:59:32 +08:00
  • 6bd1dd9d26 [Kernel] [V1] Improved performance for V1 Triton (ROCm) backend (#14152) Thomas Parnell 2025-03-06 16:39:16 +01:00
  • 4f27044aab [Doc] Correct beam_search using in generative_models.md (#14363) Irina Yuryeva 2025-03-06 18:37:10 +03:00
  • 0ddc991f5c [Doc] Update reasoning with stream example to use OpenAI library (#14077) Yanyi Liu 2025-03-06 21:20:37 +08:00
  • fa82b93853 [Frontend][Docs] Transcription API streaming (#13301) Nicolò Lucchesi 2025-03-06 11:39:35 +01:00
  • 69ff99fdcd [Core] Optimizing cross-attention QKVParallelLinear computation (#12325) Nicolò Lucchesi 2025-03-06 10:37:26 +01:00
  • 5d802522a7 [V1][VLM][Pixtral-HF] Support Pixtral-HF on V1 (#14275) lkchen 2025-03-06 00:58:41 -08:00
  • 1769928079 [Model] Update Paligemma multimodal processing with PromptUpdate (#14015) kYLe 2025-03-06 02:31:38 -06:00
  • ed6ea06577 [Hardware] Update the flash attn tag to support Blackwell (#14244) Pavani Majety 2025-03-05 22:01:37 -08:00
  • 5ee10e990d [Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention (#11301) Nicolò Lucchesi 2025-03-06 05:00:53 +01:00
  • 3dbd2d813a [V1] LoRA - Enable more V1 tests (#14315) Varun Sundar Rabindranath 2025-03-05 22:55:42 -05:00
  • f5f7f00cd9 [Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 (#14114) Ce Gao 2025-03-06 11:49:20 +08:00
  • abcc61e0af [misc] Mention ray list nodes command to troubleshoot ray issues (#14318) Rui Qiao 2025-03-05 18:00:36 -08:00
  • f6bb18fd9a [BugFix] MLA + V1, illegal memory access and accuracy issues (#14253) Lucas Wilkinson 2025-03-05 20:10:13 -05:00
  • 71eaf8969b [Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation (#13850) Yuan Tang 2025-03-05 20:09:29 -05:00
  • ca100c90fe Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (#13917) Michael Goin 2025-03-05 20:08:51 -05:00
  • ffad94397d [CI/Build] Use spawn multiprocessing mode for V1 test pipeline (#14243) Russell Bryant 2025-03-05 20:08:02 -05:00
  • 4dacaa4a83 [BugFix] Fix prefix caching V0 MLA (#14255) Lucas Wilkinson 2025-03-05 20:07:42 -05:00
  • a7ea35aa67 [Bugfix] Remove num_tokens_across_dp (#14302) Tyler Michael Smith 2025-03-05 18:55:55 -05:00
  • 1e3e76b6cc [Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch (#14237) pyc96 2025-03-05 14:22:40 -08:00
  • 53ea6ad830 [V1][Easy] Add empty allowed_token_ids in the v1 sampler test (#14308) Lu Fang 2025-03-05 13:41:18 -08:00
  • 1b7624bf5c [misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env (#14267) Serena 2025-03-06 05:28:50 +08:00
  • ac60dc7fe1 [V1][BugFix] Fix for mixed top_k batch (#14301) Nick Hill 2025-03-05 12:43:04 -08:00
  • a4f1ee35d6 Deprecate best_of Sampling Parameter in anticipation for vLLM V1 (#13997) Vincent 2025-03-05 15:22:43 -05:00
  • a32c8669ca [V1][Minor] Remove obsolete FIXME comment (#14304) Nick Hill 2025-03-05 11:59:23 -08:00
  • ca2ca8de57 [Docs] Add Meta Slides (#14297) Simon Mo 2025-03-05 08:30:23 -08:00
  • f71b00a19e [Bugfix] Fix broken vision language example (#14292) Isotr0py 2025-03-05 23:57:10 +08:00
  • 8f808cf86e prefix_caching.md: Fixed typo (#14293) DaividFrank 2025-03-05 16:43:13 +01:00
  • 7bab4bb048 [Misc] Add Qwen2MoeForCausalLM moe tuning support (#14276) Jee Jee Li 2025-03-05 23:11:29 +08:00
  • e17e4488bd [LoRA] Remove linear hack outside transformers backend (#14177) Isotr0py 2025-03-05 23:06:28 +08:00
  • 257e200a25 [V1][Frontend] Add Testing For V1 Runtime Parameters (#14159) Robert Shaw 2025-03-05 14:18:55 +00:00
  • 47d4a7e004 Small update for external_launcher backend docs (#14288) Zhe Zhang 2025-03-05 05:30:00 -08:00
  • 7f89a594dd [Doc] [3/N] Refer code examples for common cases in dev multimodal processor (#14278) Cyrus Leung 2025-03-05 20:29:50 +08:00
  • 961644e6a8 [Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID (#14217) Iacopo Poli 2025-03-05 12:44:10 +01:00
  • 8d6cd32b7b [Bugfix][V1] Fix allowed_token_ids for v1 Sampler (#14169) Lu Fang 2025-03-05 00:49:44 -08:00
  • ec79b67c77 [Misc][V1] Avoid using envs.VLLM_USE_V1 in mm processing (#14256) Roger Wang 2025-03-04 23:37:16 -08:00
  • 32985bed7c [Frontend] Allow return_tokens_as_token_ids to be passed as a request param (#14066) Benjamin Chislett 2025-03-05 01:30:40 -05:00
  • dae9ec464c Temporarily disable test_awq_gemm_opcheck (#14251) Michael Goin 2025-03-05 01:10:35 -05:00
  • 6eaf93020d [platforms] improve rocm debugging info (#14257) youkaichao 2025-03-05 13:32:18 +08:00
  • 72c62eae5f [V1] EP/TP MoE + DP Attention (#13931) Tyler Michael Smith 2025-03-05 00:27:26 -05:00
  • 0a995d5434 [Model] New model support for Phi-4-multimodal-instruct (#14119) Congcong Chen 2025-03-04 20:57:01 -08:00
  • ade3f7d988 [V1][Bugfix] Do not reset prefix caching metrics (#14235) Cody Yu 2025-03-04 20:39:13 -08:00
  • 0df25101d6 [Bugfix] Fix gptq_marlin for deepseek-v3 (#13750) rainkert 2025-03-05 12:25:53 +08:00
  • e123aafdf0 Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 (#14157) Michael Goin 2025-03-04 23:25:24 -05:00
  • 5b143d33be Moved numba from common requirements to cuda/rocm specific requirements (#14199) Nishidha 2025-03-05 09:55:00 +05:30
  • eb59b5a6cb [misc] announce china meetup (#14248) youkaichao 2025-03-05 10:33:50 +08:00
  • fbfc3ee37e [V1][TPU] TPU multimodal model support for ragged attention (#14158) Michael Goin 2025-03-04 19:58:48 -05:00
  • 3e1d223626 [ROCm] Disable a few more kernel tests that are broken on ROCm (#14145) Sage Moore 2025-03-04 15:37:55 -08:00
  • 4f5b059f14 Clean up unused padding_idx variables across many model definitions (#13240) Tyler Michael Smith 2025-03-04 16:27:00 -05:00
  • 288ca110f6 [Security] Serialize using safetensors instead of pickle in Mooncake Pipe (#14228) Kuntai Du 2025-03-04 15:10:32 -06:00
  • c2bd2196fc [v1][Metrics] Add design doc (#12745) Mark McLoughlin 2025-03-04 20:36:55 +00:00
  • 550c7ba3dc [Docs] Update Dockerfile dependency image (#14215) Michael Goin 2025-03-04 15:22:11 -05:00
  • e5b2f1601a [Frontend] Do prompt_logprobs clamping for chat as well as completions (#14225) Harry Mellor 2025-03-04 21:13:06 +01:00
  • 9badee53de Fix performance when --generation-config is not None (#14223) Harry Mellor 2025-03-04 20:59:22 +01:00
  • beebf4742a [TPU][Profiler] Support start_profile/stop_profile in TPU worker (#13988) Siyuan Liu 2025-03-04 11:40:06 -08:00
  • f89978ad7c add cutlass support for blackwell fp8 gemm (#13798) kushanam 2025-03-04 07:55:07 -08:00
  • b3cf368d79 [V1][Molmo] Fix get_multimodal_embeddings() in molmo.py (#14161) lkchen 2025-03-04 07:43:59 -08:00
  • c8525f06fc [V0][Metrics] Deprecate some questionable request time metrics (#14135) Mark McLoughlin 2025-03-04 15:11:33 +00:00
  • 5db6b2c961 [V1][BugFix] Fix remaining sync engine client shutdown errors/hangs (#13869) Nick Hill 2025-03-04 07:06:47 -08:00
  • 6247bae6c6 [Bugfix] Restrict MacOS CPU detection (#14210) Michael Goin 2025-03-04 09:25:27 -05:00
  • 3610fb4930 [doc] add "Failed to infer device type" to faq (#14200) youkaichao 2025-03-04 20:47:06 +08:00
  • 71c4b40562 [sleep mode] error out with expandable_segments (#14189) youkaichao 2025-03-04 18:54:19 +08:00
  • ac65bc92df [platform] add debug logging during inferring the device type (#14195) youkaichao 2025-03-04 18:39:16 +08:00