Commit Graph

  • bca55b556f [Bugfix] fix adding bias twice in ipex GPTQ quantization (#18363) Random Fly 2025-05-20 15:54:33 +08:00
  • d981396778 [release] Change dockerhub username for TPU release (#18389) Kevin H. Luu 2025-05-19 23:49:23 -07:00
  • 9609327fa4 [Core] [Bugfix]: tensor parallel with prompt embeds (#18171) Nan Qin 2025-05-19 22:21:27 -05:00
  • f07a673eb2 [Misc] Allow AutoWeightsLoader to skip loading weights with specific substr in name (#18358) Isotr0py 2025-05-20 11:20:12 +08:00
  • d565e0976f [neuron] fix authorization issue (#18364) Liangfu Chen 2025-05-19 16:30:32 -07:00
  • 258bf621d5 fix CUDA_check redefinition in #17918 (#18287) Lucia Fang 2025-05-19 13:42:35 -07:00
  • dc1440cf9f Neuron up mistral (#18222) Satyajith Chilappagari 2025-05-19 09:54:47 -07:00
  • 8171221834 [Misc] Fix typo (#18330) Gong Shufan 2025-05-20 00:51:01 +08:00
  • 7937c2fd52 Add files via uploadAdd fused MoE kernel tuning configs (fp8_w8a8) for DeepSeek V3/R1 on a single-node 8x NVIDIA H20 96GB setup (#18337) sunyicode0012 2025-05-20 00:49:57 +08:00
  • e2ee1e8e9e [Feature]Add support for models quantized with AutoRound (#17850) Wenhua Cheng 2025-05-20 00:38:53 +08:00
  • 20d8ce81eb [Frontend] add --quick option for vllm chat/complete (#18297) Reid 2025-05-20 00:36:13 +08:00
  • 84ab4feb7e [Doc] Fix typo (#18355) Elad Segal 2025-05-19 19:05:16 +03:00
  • 6781af5608 [Quantization] Pool model support bitsandbytes (#18087) Jee Jee Li 2025-05-20 00:03:43 +08:00
  • 1b15df2546 [BugFix] Fix handling of num_computed_tokens with connector (#18232) Nick Hill 2025-05-19 09:03:25 -07:00
  • 43b5f61dce [Doc] Move input-related docs to Features (#18353) Cyrus Leung 2025-05-19 23:08:39 +08:00
  • c5bb0ebdc6 [Doc] Fix prompt embedding examples (#18350) Li Wang 2025-05-19 21:48:16 +08:00
  • d637b96099 [BugFix] [Vul] Add missing usedforsecurity=False in MD5 hashing to enable FIPS (#18319) Shaoyu Yang 2025-05-19 16:31:23 +08:00
  • 275c5daeb0 fix: Add type specifications for CLI arguments in tensorizer options (#18314) CYJiang 2025-05-19 14:42:17 +08:00
  • 47fda6d089 [Build] Supports CUDA 12.6 and 11.8 after Blackwell Update (#18316) Simon Mo 2025-05-18 23:19:33 -07:00
  • 27d0952600 [Misc] extract parser.parse_args() (#18323) Reid 2025-05-19 12:06:26 +08:00
  • 221cfc2fea Feature/vllm/input embedding completion api (#17590) Nan Qin 2025-05-18 22:18:05 -05:00
  • 9da1095daf [Spec Decode][V0] Fix spec decode correctness test in V0 eagle/medusa (#18175) wwl2755 2025-05-18 21:49:46 -05:00
  • d1211f8794 [Doc] Add doc to explain the usage of Qwen3 thinking (#18291) Robin 2025-05-19 07:04:07 +08:00
  • b6a6e7a529 [Misc] add litellm integration (#18320) Reid 2025-05-18 23:32:30 +08:00
  • 4fb349f66a Fix copy-paste error in phi4mm image processing (#18315) Lifu Huang 2025-05-18 07:00:12 -07:00
  • 908733aca7 [Model] Use sigmoid for single-label classification (#18313) 22quinn 2025-05-18 07:00:09 -07:00
  • 1a8f68bb90 [doc] update reasoning doc (#18306) Reid 2025-05-18 21:59:14 +08:00
  • 9ab2c02ff8 Support sequence parallelism combined with pipeline parallelism (#18243) cascade 2025-05-17 15:47:25 -07:00
  • 66e63e86ec [MISC] fix typo (#18305) Ning Xie 2025-05-18 01:52:09 +08:00
  • 9214e60631 [Model] use AutoWeightsLoader for solar (#18113) rongfu.leng 2025-05-17 15:24:17 +08:00
  • f880d42582 Fixed build on ppc64le due to openssl conflicts (#18262) Nishidha 2025-05-17 12:53:46 +05:30
  • dcfe95234c Update Dockerfile to build for Blackwell (#18095) Michael Goin 2025-05-17 03:23:25 -04:00
  • 48ac2bed5b [Hardware][TPU] Optionally import for TPU backend (#18269) Siyuan Liu 2025-05-17 00:23:12 -07:00
  • 3e0d435027 [P/D][V1] Support dynamic loading of external KV connector implementations (#18142) David Ben-David 2025-05-17 09:40:39 +03:00
  • 4ee4826ede [BugFix] Correct max_model_len derivation from config.json for Mistral format (#17937) 汪志鹏 2025-05-16 21:20:13 -07:00
  • 60017dc841 [Misc] reformat the collect-env output (#18285) Reid 2025-05-17 10:46:18 +08:00
  • 55f1a468d9 Move cli args docs to its own page (#18228) (#18264) Trevor Royer 2025-05-16 19:43:45 -07:00
  • fd195b194e [V1][P/D] Local attention optimization for NIXL (#18170) Michael Goin 2025-05-16 21:16:33 -04:00
  • fabe89bbc4 [Spec Decode] Don't fall back to V0 when spec decoding is enabled (#18265) Woosuk Kwon 2025-05-16 16:10:27 -07:00
  • e73b7dfd69 [Bugfix] fix an illegal memory access was encountered of marlin kernel + act_order (#18245) Jinzhen Lin 2025-05-17 07:02:44 +08:00
  • 7fdfa01530 [Sampler] Adapt to FlashInfer 0.2.3 sampler API (#15777) Bowen Wang 2025-05-16 15:14:03 -07:00
  • aef94c6d07 [CI] Assign reviewer to mergify with changes to Tensorizer files (#18278) Sanger Steel 2025-05-16 15:04:14 -04:00
  • 0ceaebf87b [BugFix] Fix ordering of KVConnector finished send/rcv sets (#18211) Nick Hill 2025-05-16 09:20:54 -07:00
  • 1db4f47f81 [BugFix] Fix multi async save in MultiConnector (#18246) Nick Hill 2025-05-16 08:13:47 -07:00
  • d3d91b6f71 [Misc][MacOS] fix bfloat16 error (#18249) Reid 2025-05-16 23:05:59 +08:00
  • 87d871470d [Model] Use autoweightloader for dbrx (#18251) learner0810 2025-05-16 22:54:13 +08:00
  • a5f8c111c2 [Fix] Fix typo in resolve_hf_chat_template (#18259) fxmarty-amd 2025-05-16 16:52:41 +02:00
  • e23564cb70 use ceil_div in cutlass block scaling shape check (#17918) Lain 2025-05-16 03:02:58 -07:00
  • 390ec88905 [Misc] Consolidate Audio tests into multimodal common generation tests (#18214) Isotr0py 2025-05-16 17:18:08 +08:00
  • 541817670c [Misc] Add Ray Prometheus logger to V1 (#17925) Seiji Eicher 2025-05-16 03:02:42 -05:00
  • 67da5720d4 [PERF] Speed up Qwen2.5-VL model by speed up rotary position embedding (#17973) Vadim Gimpelson 2025-05-16 10:31:02 +04:00
  • 5c04bb8b86 [doc] fix multimodal example script (#18089) David Xia 2025-05-16 02:05:34 -04:00
  • 3d2779c29a [Feature] Support Pipeline Parallism in torchrun SPMD offline inference for V1 (#17827) Lucia Fang 2025-05-15 22:28:27 -07:00
  • 6b31c84aff Throw better error for when running into k8s service discovery issue (#18209) Will Eaton 2025-05-16 00:07:28 -04:00
  • b18201fe06 Allow users to pass arbitrary JSON keys from CLI (#18208) Harry Mellor 2025-05-16 05:05:34 +01:00
  • f4937a51c1 [Model] vLLM v1 supports Medusa (#17956) Sky Lee 2025-05-16 12:05:31 +08:00
  • ee659e3b60 [Bugfix][ROCm] Use chunked_prefill_paged_decode as fallback for V1 attention on ROCm (#18093) kliuae 2025-05-16 10:30:17 +08:00
  • 4e1c6a0264 [Bugfix] fix rotary embedding test for _get_padded_tensor_shape (#18229) Lucas Wilkinson 2025-05-15 21:32:45 -04:00
  • c7852a6d9b [Build] Allow shipping PTX on a per-file basis (#18155) Lucas Wilkinson 2025-05-15 19:41:55 -04:00
  • 8795eb9975 [Bugfix] Fix test_eagle test (#18223) Lucia Fang 2025-05-15 15:59:42 -07:00
  • 0b34593017 Adding "AMD: Tensorizer Test" to amdproduction. (#18216) Alexei-V-Ivanov-AMD 2025-05-15 13:01:25 -05:00
  • e3f3aee6f4 [Misc] Avoid cuda graph log when sizes still match (#18202) Nicolò Lucchesi 2025-05-15 18:59:38 +02:00
  • 92540529c0 [Bugfix] [ROCm]: Remove assertion logic when using AITER fused moe in unquantizedMethod to reenable LLama4 BF16 (#18205) TJian 2025-05-16 00:53:18 +08:00
  • fadb8d5c2d [Bugfix]Change the exception thrown by call_hf_processor from RuntimeError to ValueError (#18181) Zhonghua Deng 2025-05-16 00:01:47 +08:00
  • 2aa5470ac5 [Frontend] Fix chat template content format detection (#18190) Sebastian Schoennenbeck 2025-05-15 18:00:21 +02:00
  • 51ff154639 Improve examples rendering in docs and GitHub (#18203) Harry Mellor 2025-05-15 16:57:49 +01:00
  • 566ec04c3d Adding "Basic Models Test" and "Multi-Modal Models Test (Extended) 3" in AMD Pipeline (#18106) Alexei-V-Ivanov-AMD 2025-05-15 10:49:23 -05:00
  • 01c22335ba [Kernel] [V1] Fix performance regression for triton unified attention (#18161) Thomas Parnell 2025-05-15 15:39:00 +02:00
  • 451da4bcbd add tools into TokenizeChatRequest (#18187) hustxiayang 2025-05-15 07:01:49 -04:00
  • 07ad27121f Update deprecated type hinting in model_loader (#18130) Harry Mellor 2025-05-15 12:00:21 +01:00
  • a9944aabfa fix: typos (#18151) omahs 2025-05-15 11:16:15 +02:00
  • a8f5aec20a [V1] Update zmq socket creation in nixl connector (#18148) Russell Bryant 2025-05-15 02:17:57 -04:00
  • de71fec81b [CI] don't skip fixed test_kv_cache_events() (#18183) David Xia 2025-05-15 02:17:16 -04:00
  • 70f8b96724 [Bugfix] Fix FusedMoEPrepareAndFinalize for cuda-disalike backends (#18178) Mengqing Cao 2025-05-15 14:16:31 +08:00
  • dd2a94596a [Model] Allow the use of sliding window in Qwen2 (#17772) inkcherry 2025-05-15 13:29:38 +08:00
  • 420caf7557 [UT] Add ut for none hash (#17892) Ning Xie 2025-05-15 13:28:11 +08:00
  • 4f07a64075 Support custom implementations of VideoLoader backends. (#18091) Chenheli Hua 2025-05-14 22:26:49 -07:00
  • e6b8e65d2d [Bugfix] Fix fp8 tests for triton_unified_attention for Triton 3.3 (#18013) Thomas Parnell 2025-05-15 07:26:34 +02:00
  • 26d0419309 Update deprecated type hinting in models (#18132) Harry Mellor 2025-05-15 06:06:50 +01:00
  • 83f74c698f [Fix][ROCm] Enforce eager for all encoder-decoder models on ROCm (#18154) Luka Govedič 2025-05-15 01:04:43 -04:00
  • 2dff093574 [Misc] add lobe-chat support (#18177) Reid 2025-05-15 13:02:23 +08:00
  • afe3236e90 [Chore] astral's ty (#18116) Aaron Pham 2025-05-15 01:00:43 -04:00
  • 65334ef3b9 [V1][Metrics] Remove unused code (#18158) Mark McLoughlin 2025-05-15 04:13:17 +01:00
  • e60f550b38 [v1] Support multiple KV cache groups in GPU model runner (#17945) Chen Zhang 2025-05-15 09:54:54 +08:00
  • f25e0d1125 [Bugfix]: make most of test_openai_schema.py pass (#17664) David Xia 2025-05-14 20:04:35 -04:00
  • 09f106a91e Upload vllm index for the rc builds (#18173) Andrey Talman 2025-05-14 16:35:56 -07:00
  • 2142035b51 [V1] Support multiple kv connectors (#17564) Michael Goin 2025-05-14 19:28:02 -04:00
  • 78aa341d12 [CI] Fix race condition in test_kv_cache_events test (#18169) Russell Bryant 2025-05-14 19:27:48 -04:00
  • 7974736740 Add support for loading torchao models with AOPerModuleConfig (#17826) Jerry Zhang 2025-05-14 16:24:59 -07:00
  • 2fc9075b82 [V1] Structured Outputs + Thinking compatibility (#16577) Aaron Pham 2025-05-14 18:45:24 -04:00
  • d93c976a0d [Kernel] Have rotary embeddings support tensors (#18046) Lucas Wilkinson 2025-05-14 18:43:55 -04:00
  • 749f792553 [Frontend] decrease import time of vllm.multimodal (#18031) David Xia 2025-05-14 18:43:32 -04:00
  • 856865008e [CI] Disable Failing Tests (#18165) Robert Shaw 2025-05-14 16:49:56 -04:00
  • f9c069c85e Modularize fused experts and integrate PPLX kernels (#15956) bnellnm 2025-05-14 16:11:54 -04:00
  • 418d2f8bfb [V1][Spec Decode] Share input embedding of target model with EAGLE draft model to free ~1GB for llama 3 model (#17326) Ekagra Ranjan 2025-05-14 15:31:46 -04:00
  • 964472b966 [Doc] Update prefix cache metrics to counting tokens (#18138) Chen Zhang 2025-05-14 23:23:30 +08:00
  • 59dd311cf5 [KVConnector] Keep KVTransferParams as a dict (#18033) Nick Hill 2025-05-14 08:05:57 -07:00
  • d066e52013 [Bugfix] Fix chat utils tests (#18139) Cyrus Leung 2025-05-14 20:38:21 +08:00
  • c8ea982d9b Update deprecated type hinting in platform, plugins, triton_utils, vllm_flash_attn (#18129) Harry Mellor 2025-05-14 13:28:16 +01:00
  • dc372b9c8a Update deprecated type hinting in vllm/device_allocator and vllm/distributed (#18126) Harry Mellor 2025-05-14 12:07:57 +01:00