Commit Graph

  • da63274d9f [Bugfix][NIXL] Fix Async Scheduler timeout issue (#25808) Nicolò Lucchesi 2025-09-27 21:17:35 +02:00
  • c216119d64 [Core] GC Debug callback (#24829) Jialin Ouyang 2025-09-27 10:53:31 -07:00
  • 5546acb463 [Bug]: Set LD_LIBRARY_PATH to include the 'standard' CUDA location (#25766) Clayton Coleman 2025-09-27 13:36:28 -04:00
  • c0ec81836f [torch.compile]: Add VLLM_DEBUG_DUMP_PATH environment variable (#25651) Jiangyun Zhu 2025-09-28 00:09:00 +08:00
  • b65e56babe [Core] Refactor self.model() to call a helper for subclassing. (#25084) Patrick C. Toulme 2025-09-27 11:40:59 -04:00
  • 49996cd597 [env] default nixl side port conflicts with kv-event zmq port (#25056) Peter Pan 2025-09-27 23:02:40 +08:00
  • ecb37e276a [docs] transcriptions API audio upload (#25446) yyzxw 2025-09-27 23:00:35 +08:00
  • a5354b3ed2 [Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (#24982) Tyler Michael Smith 2025-09-27 10:22:28 -04:00
  • f9df8b4ad7 [Bugfix] Fix triton import precommit failure (#25803) Tyler Michael Smith 2025-09-27 10:13:11 -04:00
  • ec152c8748 Fix GPTQ model loading in Transformers backend (#25770) Harry Mellor 2025-09-27 13:18:20 +01:00
  • 7977e5027c Add filtering for chat template kwargs (#25794) Russell Bryant 2025-09-27 06:46:49 -04:00
  • 3f5d902d2a Validate API tokens in constant time (#25781) Russell Bryant 2025-09-27 06:09:26 -04:00
  • 27d7638b94 [Bugfix] Merge MM embeddings by index instead of token IDs (#16229) Cyrus Leung 2025-09-27 16:15:12 +08:00
  • 176173989a [Bugfix] Add missing image_size for phi4_multimodal (#25796) Xiaohan Zou 2025-09-27 03:59:22 -04:00
  • 23b8ee672d [Misc] Update openai client example file for multimodal (#25795) Roger Wang 2025-09-27 00:57:07 -07:00
  • 3939152069 [Misc] Fix codeowners override for v1 sample and attention (#25037) 22quinn 2025-09-27 00:47:29 -07:00
  • cd87bfbf37 [CI/Build] Reorganize root-level V1 tests (#25767) Cyrus Leung 2025-09-27 13:51:15 +08:00
  • b3613e3ace [CI/Build] Add timing to Model Executor Test (#25799) 22quinn 2025-09-26 21:57:27 -07:00
  • d346ec695e [CI/Build] Consolidate model loader tests and requirements (#25765) Cyrus Leung 2025-09-27 12:45:20 +08:00
  • c242c98031 [Bugfix] Allow Only SDPA Backend for ViT on B200 for Qwen3-VL (#25788) Wentao Ye 2025-09-26 23:44:52 -04:00
  • f1d53d150c [Multimodal][Speculative Decoding]Eagle Eagle3 mm support, enablement on qwen2.5vl (#22872) WeiQing Chen 2025-09-27 11:35:47 +08:00
  • 92da847cf5 Add flashinfer-build.sh and register precompiled cu128 wheel in Dockerfile (#25782) Michael Goin 2025-09-26 21:54:09 -04:00
  • 3958b96bf5 Add option to restrict media domains (#25783) Russell Bryant 2025-09-26 21:23:52 -04:00
  • 8bf8f45822 [Core] Don't count preempted tokens in prefix cache hit rate (#25787) Zhuohan Li 2025-09-26 17:16:40 -07:00
  • 6f5c0931c1 [Spec decode] automatically disable mm for text-only draft models (#25667) Jonas M. Kübler 2025-09-27 02:10:21 +02:00
  • 4e33a7ea85 [Bugfix] Optimize CpuGpuBuffer initialization (#25447) Naman Lalit 2025-09-26 17:07:36 -07:00
  • dc48ba0c75 Kernel-override Determinism [1/n] (#25603) Bram Wasti 2025-09-26 19:59:09 -04:00
  • 4778b42660 Reduce the Cuda Graph memory footprint when running with DBO (#25779) Sage Moore 2025-09-26 15:29:56 -07:00
  • c70ac4b8ff [spec decode] Consolidate speculative decode method name for MTP (#25232) qizixi 2025-09-26 15:27:05 -07:00
  • cf89202855 [CI] Fix FlashInfer AOT in release docker image (#25730) Michael Goin 2025-09-26 17:11:40 -04:00
  • f075693da7 [V1] address post issues related to #20059 (part 1) (#23046) fhl2000 2025-09-27 03:58:19 +08:00
  • f708bd4904 [CI] Add E2E Blackwell Quantized MoE Test (#25723) Michael Goin 2025-09-26 15:23:00 -04:00
  • 0002b7f0d1 [Docs] Add Toronto Meetup (#25773) Michael Goin 2025-09-26 15:00:46 -04:00
  • 11aafd9886 [Bugfix] Improve GLM4 MoE Reasoning Parser's is_reasoning_end Condition (#25355) Frank Wang 2025-09-26 11:54:00 -07:00
  • b761df963c [Doc]: improve CPU(x86) build-wheel-from-source section (#25617) v0.11.1rc0 v0.11.0rc1 Clouddude 2025-09-26 13:26:33 -04:00
  • 33f6aaf972 Eagle3 that supports the Minicpm3 model (#24243) 阿丹(adan) 2025-09-27 01:04:57 +08:00
  • 56aafa8c0b [Misc] fix unique_filepath (#25732) Jiangyun Zhu 2025-09-27 00:56:15 +08:00
  • 8d52f2b3a7 [ray][metrics] Replace ':' with '_' for OpenTelemetry compatibility in Ray (#25439) Seiji Eicher 2025-09-26 09:43:30 -07:00
  • 984d18498a [BugFix] Fix using dbo_decode_token_threshold always (and ignoring dbo_prefill_token_threshold) (#25622) Lucas Wilkinson 2025-09-26 12:22:49 -04:00
  • d4d9899860 [Quantization] Add field to skip unquantized modules for GPTQ config (#25455) Isotr0py 2025-09-26 23:47:41 +08:00
  • db1e42f627 [CI/Build] Fix some V1 tests not being run (#25569) Cyrus Leung 2025-09-26 20:52:36 +08:00
  • bc9d7b5595 [CI/Build] Split up Distributed Tests (#25572) Cyrus Leung 2025-09-26 20:49:33 +08:00
  • fe6b19c314 [Bugfix] Properly abort pooling request. (#25734) wang.yuqi 2025-09-26 20:47:34 +08:00
  • 2827b3f4a3 [CI] Fix test_shared_storage_connector_hashes (#25748) Chauncey 2025-09-26 20:46:17 +08:00
  • 2b6b1d7809 [Model] Mamba2 varlen refactor (#21467) Chih-Chieh Yang 2025-09-26 07:31:14 -04:00
  • 633f943e30 [Doc] Update Batch-level DP docs (#25757) Cyrus Leung 2025-09-26 17:37:40 +08:00
  • b03b1b97f6 Support LongCat-Flash-Chat tool call (#24083) Xu Wenqing 2025-09-26 17:25:39 +08:00
  • dfb9af2014 [Bugfix] Fix Shared Expert/Zero expert code in FusedMoE.process_chunk (#25698) Sage Moore 2025-09-26 01:25:28 -07:00
  • 19f76ee68e [misc] refactor speculative config (#25657) yyzxw 2025-09-26 16:22:06 +08:00
  • dd70437a4f Remove cuda hard-code in compute_causal_conv1d_metadata (#25555) Icey 2025-09-26 16:19:20 +08:00
  • 99b3a504c5 [Qwen3-Next][GDN] fixes cuda graph capturing bug in GDN metadata and a stride bug in causal_conv_1d. (#25743) Tao He 2025-09-26 16:18:58 +08:00
  • 6e30010d2f fix: print outputt offline_inference/base/chat.py example (#25744) Iceber Gu 2025-09-26 16:18:24 +08:00
  • 52621c8f5c [Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI300X (#25703) xaguilar-amd 2025-09-26 10:18:20 +02:00
  • d48f4d6daf perf: Avoid copying inputs_embeds tensors to GPU unless prompt_embeds is enabled (#25739) Andrew Sansom 2025-09-26 03:18:09 -05:00
  • e84e0735c7 fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid hidden performance regressions (#25738) Andrew Sansom 2025-09-26 03:18:05 -05:00
  • 3edf87d25f [CI/Build] fix doc build warning: Failed to get 'name: description' pair (#25733) yitingdc 2025-09-26 16:18:02 +08:00
  • 392edee34a EVS Support (Video tokens pruning) (#22980) Eugene Khvedchenya 2025-09-26 06:54:54 +03:00
  • 983056e456 [Misc] Remove unnecessary memoryviews in shm_broadcast.py (#25721) Nick Hill 2025-09-25 20:11:44 -07:00
  • 13dd93c667 [Core] Force PIECEWISE CUDAGraph mode for encoder-decoder (#25701) Russell Bryant 2025-09-25 21:21:56 -04:00
  • 53a30845be Llamas 3.1 405B fp4 changes upstreaming from 355_wip (#25135) Aleksandr Malyshev 2025-09-25 18:16:53 -07:00
  • 8b77328ffe [Misc] Don't log shm dequeue delay warning on worker side (#25720) Nick Hill 2025-09-25 18:08:30 -07:00
  • 9fe4c2bdb9 [Refactor] Remove DeepGEMM OP Register (#25710) Wentao Ye 2025-09-25 20:13:41 -04:00
  • 081b5594a2 Fix routing_bias dtype (#25711) Shu Wang 2025-09-25 18:35:14 -05:00
  • 57329a8c01 [Model] rename NemotronH_Nano_VL -> NemotronH_Nano_VL_V2 (#25708) tomeras91 2025-09-26 02:10:29 +03:00
  • 8c435c9bce [Core] Enable command line logging for LLMEngine (#25610) Zhuohan Li 2025-09-25 15:31:17 -07:00
  • e71b8e210d [Spec Decode] Add Batch Parallel Ngram. Upto 8x lower overhead. (#24986) Ekagra Ranjan 2025-09-25 18:22:03 -04:00
  • 89fa54e6f7 [Optimization] Use a cheaper cache key in get_model_architecture (#25682) Cyrus Leung 2025-09-26 05:54:20 +08:00
  • 3d54bdcb73 [Optimization] Streamline InputPreprocessor (#25702) Cyrus Leung 2025-09-26 05:06:49 +08:00
  • 6b0fcbbf43 [Misc] Simplify test_argsort_mm_positions (#25690) Cyrus Leung 2025-09-26 02:23:01 +08:00
  • 0fa673af4c [V0 deprecation] Clean up LoRA (#25686) Jee Jee Li 2025-09-26 02:12:33 +08:00
  • 3468f17ebe [V0 deprecation] Remove _VLLM_V1 suffixes from attention backend names (#25489) Matthew Bonanni 2025-09-25 13:37:50 -04:00
  • 71b25b0d48 [V0 deprecation] Clean up V0 fallback in compilation config (#25675) Isotr0py 2025-09-26 01:29:51 +08:00
  • 0ea80c87d9 [Model] Define merge_by_field_config MM interface (#25676) Cyrus Leung 2025-09-26 01:13:07 +08:00
  • b8d9e4a326 [Model] Add optional parameter to reasoning parser constructor (#25554) Tao Hui 2025-09-26 01:12:50 +08:00
  • 13cc7f5370 [BugFix] Fix DBO hang (#25625) Lucas Wilkinson 2025-09-25 13:04:48 -04:00
  • 916bd9204d Revert "[Bug] Dynamo Unsupported due to BasevLLMParameter.torch_function calling disabled super()" (#25681) Michael Goin 2025-09-25 12:45:06 -04:00
  • e04a1b6b21 [BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin… (#24662) AlonKejzman 2025-09-25 18:40:14 +03:00
  • 2e5df88c92 [Logging] Remove TORCH_NCCL_AVOID_RECORD_STREAMS to squash a warning (#25532) Tyler Michael Smith 2025-09-25 11:16:06 -04:00
  • 0754ac4c49 [Misc] Remove cruft file in repo (#25678) Nicolò Lucchesi 2025-09-25 17:05:12 +02:00
  • 03858e6d1c [Bugfix] Fix InternS1 video processing after Transformers v4.56 (#25644) Isotr0py 2025-09-25 22:46:04 +08:00
  • 532a6cfccb [ux] Switch a warning to debug about a pytorch fallback (#23750) Russell Bryant 2025-09-25 10:38:16 -04:00
  • eb32335e35 [CPU] update torch 2.8 and fix missing fields in TorchSDPAMetadata (#25652) Li, Jiang 2025-09-25 21:29:11 +08:00
  • 69a8c8e99a [torch.compile] Make Query Quantization Fusable (#24914) Jonas M. Kübler 2025-09-25 15:25:12 +02:00
  • 6c340da4df [misc] log info messages by default for hanging / busy / idle (#25627) youkaichao 2025-09-25 21:14:57 +08:00
  • 2f17117606 [mypy] Fix wrong type annotations related to tuple (#25660) Cyrus Leung 2025-09-25 21:00:45 +08:00
  • 1e9a77e037 [Hardware][RISC-V] Add riscv64 support for vLLM with scalar (#22112) chenlang 2025-09-25 20:46:11 +08:00
  • d2af67441d [XPU][Triton]add xpu config in triton_reshape_and_cache_flash (#25643) Kunshang Ji 2025-09-25 20:38:11 +08:00
  • 0bcc3a160d [CI/Build] Fix flaky entrypoints test (#25663) Cyrus Leung 2025-09-25 20:19:40 +08:00
  • 70fbdb26e9 Add backward compatibility for guided_... API (#25615) Harry Mellor 2025-09-25 12:45:25 +01:00
  • 7f570f1caa [V0 deprecation] Remove unreachable model_config.supported_tasks (#25642) wang.yuqi 2025-09-25 19:26:31 +08:00
  • eaeca3cd7f [Bugfix] Parse SpeculativeConfig Error (#25142) yyzxw 2025-09-25 19:09:39 +08:00
  • 12c1287d64 [mypy] Further improve MM type annotations (#25654) Cyrus Leung 2025-09-25 18:57:36 +08:00
  • 17b4c6685c [Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for video profiling (#25648) Isotr0py 2025-09-25 18:36:01 +08:00
  • 3c2b2ccece [Bugfix] Add triton.language.tensor placeholder (#25649) Agata Dobrzyniewicz 2025-09-25 12:31:14 +02:00
  • 7be9ffcd9f [Misc] Fix Qwen3-VL video_grid_thw typing (#25646) Roger Wang 2025-09-25 03:16:45 -07:00
  • 393de22d2e [fix] Update torch version in cpu-build.txt for AArch64/ppc64le and Darwin (#25579) Fadi Arafeh 2025-09-25 10:39:18 +01:00
  • 1260180c67 Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class… (#25607) Tyler Michael Smith 2025-09-25 04:05:21 -04:00
  • af4ee63e0e typo: remove duplicate is (#25641) Nicole LiHui 🥜 2025-09-25 15:46:22 +08:00
  • bc092ea873 Map CwmForCausalLM to llama and LlamaForCausalLM (#25611) Jacob Kahn 2025-09-25 09:37:03 +02:00
  • 755ed7b05b [Misc] Simplify PoolerOutput and move to v1/outputs (#25629) Cyrus Leung 2025-09-25 14:47:03 +08:00