Commit Graph

  • a89209b78d [v1] Support mamba2 (#19327) Chen Zhang 2025-06-19 04:34:15 +08:00
  • ffacb222cb [Docs] Add Huzaifa Sidhpurwala to vuln mgmt team doc (#19808) Russell Bryant 2025-06-18 16:22:28 -04:00
  • 12575cfa7a [Bugfix] fix RAY_CGRAPH_get_timeout is not set successfully (#19725) Chauncey 2025-06-19 01:26:16 +08:00
  • 8b6e1d639c [Hardware][AMD] integrate aiter chunked prefill into vllm (#18596) Zzz9990 2025-06-18 23:46:51 +08:00
  • 735a9de71f [Qwen] Add tagging rule for Qwen related PRs (#19799) Lu Fang 2025-06-18 22:26:43 +08:00
  • 257ab95439 [Platform] Allow platform use V1 Engine by default (#19792) wangxiyuan 2025-06-18 21:03:36 +08:00
  • cca91a7a10 [doc] fix the incorrect label (#19787) Reid 2025-06-18 18:30:58 +08:00
  • f04d604567 [Minor] Zero-initialize attn output buffer (#19784) Woosuk Kwon 2025-06-17 23:59:27 -07:00
  • 19a53b2783 [V1] Decouple GPU and TPU InputBatch (#19778) afeldman-nm 2025-06-18 02:38:13 -04:00
  • eccdc8318c [V1][P/D] An native implementation of xPyD based on P2P NCCL (#18242) Zhonghua Deng 2025-06-18 14:32:36 +08:00
  • 5f52a84685 [V1] Add API docs for EncoderCacheManager (#19294) Russell Bryant 2025-06-18 01:37:01 -04:00
  • d4629dc43f [Misc] Add __str__ for RequestStatus (#19780) lkchen 2025-06-17 20:03:01 -07:00
  • 6e9cc73f67 [MISC] correct DeviceConfig device field static type analysis (#19699) Ning Xie 2025-06-18 08:21:50 +08:00
  • c53711bd63 [MISC] correct copy_blocks src_to_dists param type (#19696) Ning Xie 2025-06-18 08:21:06 +08:00
  • dac8cc49f4 [TPU] Update torch version to include paged attention kernel change (#19706) Chenyaaang 2025-06-17 15:24:49 -07:00
  • a44b1c951d [Feature][ROCm] Add full graph capture support for TritonAttentionBackend (#19158) Charlie Fu 2025-06-17 16:03:06 -05:00
  • b447624ee3 [Bugfix] Fix faulty triton importing logic when using Ray for DP (#19734) Michael Goin 2025-06-18 05:59:29 +09:00
  • cda92307c1 [Misc] Update lmcache connector with the latest connector apis (#19441) Jiayi Yao 2025-06-17 12:57:54 -07:00
  • bf57ccc5c2 Remove sm120 arch from sm100 cutlass kernel arch list (#19716) Michael Goin 2025-06-18 03:49:39 +09:00
  • ffb2cd6b54 [Perf] Optimize moe_align_block_size CUDA kernel (#19572) Wentao Ye 2025-06-17 14:49:26 -04:00
  • ca94d7fa00 [Bugfix] Update multimodel models mapping to fit new checkpoint after Transformers v4.52 (#19151) Isotr0py 2025-06-17 23:58:38 +08:00
  • 5a1c2e15d8 [Mis] remove duplicate engine status checks (#19647) CYJiang 2025-06-17 23:17:38 +08:00
  • 4c8f64faa7 [V1][Kernel] Flashinfer HND KV cache layout (#19280) Nicolò Lucchesi 2025-06-17 15:09:22 +02:00
  • 93aee29fdb [doc] split "Other AI Accelerators" tabs (#19708) David Xia 2025-06-17 09:05:29 -04:00
  • 154d063b9f [doc][mkdocs] Add edit button to documentation (#19637) Reid 2025-06-17 19:10:31 +08:00
  • ccd7c05089 [Kernel] Add Split-KV Support to Unified Triton Attention Kernel (#19152) jvlunteren 2025-06-17 12:45:07 +02:00
  • c48c6c4008 Add a doc on how to update PyTorch version (#19705) Huy Do 2025-06-17 03:10:37 -07:00
  • aed8468642 [Doc] Add missing llava family multi-image examples (#19698) Isotr0py 2025-06-17 15:05:21 +08:00
  • 5c76b9cdaf [Core] add remove_seq_from_computed_blocks_tracker to BlockSpaceManager (#19686) quanliu 2025-06-17 12:40:58 +08:00
  • ddfed314f9 Fixes IMA for TP w/ flex-attention (#19712) Driss Guessous 2025-06-16 21:01:50 -07:00
  • 5b3ad5ecf2 [DOC] fix doc typos (#19600) Di Liu 2025-06-17 11:34:53 +08:00
  • ede5c4ebdf [Frontend] add chunking audio for > 30s audio (#19597) nguyenhoangthuan99 2025-06-17 10:34:00 +07:00
  • 07334959d8 [Wheel Size] Only build FA2 8.0+PTX (#19336) Lucas Wilkinson 2025-06-16 23:32:49 -04:00
  • 119f683949 [doc] add project flag to gcloud TPU command (#19664) David Xia 2025-06-16 21:00:09 -04:00
  • 0860087aff [Fix] Fall back to Gloo when NCCL backend is unavailable (#19641) Conroy Cheers 2025-06-17 10:42:14 +10:00
  • 6bc7b57315 [Quantization] Remove FP4 emulation; Fall-back to marlin for device < 100 (#19563) Dipika Sikka 2025-06-16 17:33:51 -04:00
  • 90f9c2eb5c [V1] Change return type on get_multimodal_embeddings() (#19446) Russell Bryant 2025-06-16 13:32:15 -04:00
  • 387bdf0ab9 [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (#19677) qscqesze 2025-06-17 00:47:14 +08:00
  • 5e5baa91aa [Kernels] Use empty for modular MoE workspaces (#19667) bnellnm 2025-06-16 10:58:01 -04:00
  • 836d4ce140 [Bugfix] fix missing 'finish_reason': null in streaming chat (#19662) Chauncey 2025-06-16 22:10:39 +08:00
  • c3fec47bb7 [MISC] bump huggingface_hub pkg to 0.33.0 (#19547) Ning Xie 2025-06-16 20:22:28 +08:00
  • 1173804dca [Bugfix] Fix TP inference for Flex attention backend (#19657) Isotr0py 2025-06-16 19:21:37 +08:00
  • 4d5424029b [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (#19652) Shawn Tan 2025-06-16 07:14:18 -04:00
  • 3e7506975c [DOC] Add reasoning capability to vLLM streamlit code (#19557) Navanit Dubey 2025-06-16 16:39:12 +05:30
  • ee35e96ac3 [BugFix] Don't catch BaseException when dumping execute_model errors (#19626) Nick Hill 2025-06-16 04:01:08 -07:00
  • dec66d253b [Kernel] GGUF MMVQ kernel for multiple input vectors (#18754) Szymon Ożóg 2025-06-16 11:33:26 +02:00
  • 8d120701fd [Docs] Move multiproc doc to v1 dir (#19651) Russell Bryant 2025-06-16 05:10:12 -04:00
  • f40f763f12 [CI] Add mteb testing for rerank models (#19344) wang.yuqi 2025-06-16 16:36:43 +08:00
  • 26bc46ef89 [MISC] typo fix (#19672) Ning Xie 2025-06-16 15:18:49 +08:00
  • a77aea59fd [TPU] support attention head dim smaller than 128 (#19620) Chengji Yao 2025-06-15 23:40:53 -07:00
  • b692e9cd07 [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (#19660) Ye (Charlotte) Qi 2025-06-15 23:30:29 -07:00
  • 367871a469 [Misc][Frontend] passthrough bad_words (#19564) Francesco Bertolotti 2025-06-16 07:05:13 +02:00
  • 92183b41f3 [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (#18957) quanliu 2025-06-16 12:56:37 +08:00
  • c6703d1e0d [MISC] Remove unused variableds in C++ (#19609) Lu Fang 2025-06-16 11:05:28 +08:00
  • a5e7242d5f [Misc] Remove duplicate multiproc method setting for CPU platform (#19649) Isotr0py 2025-06-16 10:26:58 +08:00
  • 91b2c17a55 [CI/Build] Fix torch nightly CI dependencies part 2 (#19589) Richard Zou 2025-06-15 08:01:10 -04:00
  • 055915e6ce Enable prefix caching with full cuda graphs (#19617) Woosuk Kwon 2025-06-15 01:05:05 -07:00
  • 3d330c4c09 [Benchmark] Refactor benchmark script for fp8 & int8 (#19627) Wentao Ye 2025-06-15 03:15:37 -04:00
  • 0b73736a0d [Kernel] Raise verbose error and consolidate num_heads/num_kv_heads divisibility check (#19339) 22quinn 2025-06-14 22:43:48 -07:00
  • ee1531bc38 [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (#19644) Lu Fang 2025-06-15 12:15:41 +08:00
  • e13945f9dd [Perf] Further tunings for SM100 FP8 CUTLASS kernel (#19566) Ilya Markov 2025-06-15 02:25:10 +02:00
  • 08500011d3 [Fix] Convert kv_transfer_config from dict to KVTransferConfig (#19262) maobaolong 2025-06-15 03:32:07 +08:00
  • 861a0a0a39 [Bugfix] Don't attempt to use triton if no driver is active (#19561) Konrad Zawora 2025-06-14 21:30:54 +02:00
  • bc956b38d0 Only build CUTLASS MoE kernels on Hopper (#19648) Huy Do 2025-06-14 11:44:15 -07:00
  • 294fc1e2c9 [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (#19500) jiahanc 2025-06-14 09:34:28 -07:00
  • 2db9044ab6 [Bugfix] Fix auto dtype casting for BatchFeature (#19316) Isotr0py 2025-06-14 23:13:08 +08:00
  • 6fa718a460 [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (#19593) Reid 2025-06-14 16:54:52 +08:00
  • 06be858828 [Bugfix] Fix the speculative decoding test by setting the target dtype (#19633) Lu Fang 2025-06-14 11:57:32 +08:00
  • d1e34cc9ac [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (#18354) Saheli Bhattacharjee 2025-06-14 04:07:36 +01:00
  • bd517eb9fe [BugFix] Fix DP Coordinator incorrect debug log message (#19624) Nick Hill 2025-06-13 17:18:03 -07:00
  • d65668b4e8 Adding "AMD: Multi-step Tests" to amdproduction. (#19508) Concurrensee 2025-06-13 19:08:51 -05:00
  • aafbbd981f [torch.compile] Use custom ops when use_inductor=False (#19618) Woosuk Kwon 2025-06-13 15:05:54 -07:00
  • 0f0874515a [Doc] Add troubleshooting section to k8s deployment (#19377) Anna Pendleton 2025-06-13 14:47:51 -07:00
  • 3597b06a4f [CUDA] Enable full cudagraph for FlashMLA (#18581) Luka Govedič 2025-06-13 14:12:26 -04:00
  • 1015296b79 [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (#19606) Reid 2025-06-14 00:25:08 +08:00
  • ce9dc02c93 [Refactor] Remove unused variables in moe_permute_unpermute_kernel.inl (#19573) Wentao Ye 2025-06-13 09:12:15 -04:00
  • a24cb91600 [Model] Fix minimax model cache & lm_head precision (#19592) qscqesze 2025-06-13 20:08:20 +08:00
  • 7e8d97dd3f [BugFix] Honor enable_caching in connector-delayed kvcache load case (#19435) Nick Hill 2025-06-13 02:46:32 -07:00
  • d70bc7c029 [torch.compile] reorganize the cache directory to support compiling multiple models (#19064) youkaichao 2025-06-13 15:23:25 +08:00
  • ce688ad46e use base version for version comparison (#19587) Boyuan Feng 2025-06-13 00:09:34 -07:00
  • cefdb9962d [Fix] The zip function in Python 3.9 does not have the strict argument (#19549) 汪志鹏 2025-06-13 14:57:48 +08:00
  • ace5cdaff0 [Fix] bump mistral common to support magistral (#19533) 汪志鹏 2025-06-13 13:28:12 +08:00
  • 6458721108 [CPU] Refine default config for the CPU backend (#19539) Li, Jiang 2025-06-13 13:27:39 +08:00
  • bb4a0decef [Misc] Correct broken docs link (#19553) Hyogeun Oh (오효근) 2025-06-13 14:27:13 +09:00
  • c707cfc12e [doc] fix incorrect link (#19586) Reid 2025-06-13 12:26:09 +08:00
  • 7b3c9ff91d [Doc] uses absolute links for structured outputs (#19582) Aaron Pham 2025-06-12 23:35:17 -04:00
  • c68698b326 [Bugfix] Fix EAGLE vocab embedding for multimodal target model (#19570) qizixi 2025-06-12 20:09:19 -07:00
  • e3b12667d4 [BugFix] : Fix Batched DeepGemm Experts (#19515) Varun Sundar Rabindranath 2025-06-12 22:43:02 -04:00
  • e6aab5de29 Revert "[Build/CI] Add tracing deps to vllm container image (#15224)" (#19378) kourosh hakhamaneshi 2025-06-12 17:26:40 -07:00
  • c57bb199b3 [V1] Resolve failed concurrent structured output requests (#19565) Russell Bryant 2025-06-12 19:30:09 -04:00
  • dba68f9159 [Doc] Unify structured outputs examples (#18196) Aaron Pham 2025-06-12 18:50:31 -04:00
  • a3319f4f04 [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (#19452) Michael Goin 2025-06-12 15:39:15 -04:00
  • 9d880f594d [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (#19506) Varun Sundar Rabindranath 2025-06-12 14:01:16 -04:00
  • 017ef648e9 [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (#18847) Ekagra Ranjan 2025-06-12 13:30:56 -04:00
  • 4b25ab14e2 [doc] Make top navigation sticky (#19540) Reid 2025-06-12 23:48:11 +08:00
  • f98548b9da [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (#16756) Luka Govedič 2025-06-12 11:31:04 -04:00
  • 96846bb360 Fix TorchAOConfig skip layers (#19265) mobicham 2025-06-12 16:22:53 +02:00
  • b6efafd9e4 [Perf] Vectorize static / dynamic INT8 quant kernels (#19233) Wentao Ye 2025-06-12 09:51:41 -04:00
  • 1129e2b1ab [V1][NixlConnector] Drop num_blocks check (#19532) Nicolò Lucchesi 2025-06-12 14:36:14 +02:00
  • c742438f8b [Doc] Add V1 column to supported models list (#19523) Cyrus Leung 2025-06-12 19:16:44 +08:00