Commit Graph

  • 4c23690f43 [Attention] FlashAttention ViT support, make default backend (#28763) Matthew Bonanni 2025-11-18 23:06:21 -05:00
  • 814843e021 Enable bitsandbytes quantization on AMD GPUs that use warp size 32 (#27307) Strahinja Stamenkovic 2025-11-19 04:12:31 +01:00
  • 20852c8f4c [CPU] Refactor CPU WNA16 (#28826) Li, Jiang 2025-11-19 10:32:00 +08:00
  • 40b6b38f2c [Core] Switch Flat logprob control from environment variable to SamplingParams (#28914) Jialin Ouyang 2025-11-18 18:10:02 -08:00
  • da94c7c0eb Move online quantization to model.load_weights (#26327) Jerry Zhang 2025-11-18 16:52:41 -08:00
  • 1395461f5f [Hybrid][torch.compile] Refactor mamba2 forward to avoid obscuring linear projections under custom op (#28587) tomeras91 2025-11-19 02:49:36 +02:00
  • 9912b8ccb8 [Build] Add OpenAI triton_kernels (#28788) Varun Sundar Rabindranath 2025-11-18 19:45:20 -05:00
  • 49ef847aa8 [NVIDIA] Guard SM100 CUTLASS MoE macro to SM100 builds v2 (#28938) Johnny 2025-11-19 01:44:27 +01:00
  • 67745d189f Supress verbose logs from model_hosting_container_standards (#28949) Michael Goin 2025-11-18 15:29:06 -05:00
  • 2a2d5d2780 Replace torch.cuda.Event with torch.Event for better hardware compatibility (#26985) Kunshang Ji 2025-11-19 03:34:36 +08:00
  • c3e2978620 [NIXL] fix cpu PD after physical <> logical block_size PR (#28904) Chendi.Xue 2025-11-18 13:03:23 -06:00
  • e4bb2684bc [Models] Replace all nn.Conv2d with vLLM's Conv2dLayer (#28842) Isotr0py 2025-11-19 02:56:04 +08:00
  • c64c0b78de [chore] Move the rest of wikimedia url to S3 (#28921) Kevin H. Luu 2025-11-18 09:44:18 -08:00
  • 0af3d4f0df [FEAT] [AITER] [ROCm] integrate aiter sampling ops (#26084) vllmellm 2025-11-19 01:28:34 +08:00
  • da8dadf68b [Minor] Rename ec_producer field to is_ec_producer (#28884) Nick Hill 2025-11-18 09:26:07 -08:00
  • f226a3f0c1 [CI][NIXL] Change default block_size for tests (#28927) Nicolò Lucchesi 2025-11-18 18:22:30 +01:00
  • c2612371ad [Model] Add Gemma3 GGUF multimodal support (#27772) Luciano Martins 2025-11-18 13:56:29 -03:00
  • 49a986ecd4 [Benchmark] multi_turn: Report warmup-inclusive runtime (#28937) Ido Segev 2025-11-18 18:38:22 +02:00
  • f6aa122698 [CI Sprint] Quantization CI Cleanup (#24130) Alex 2025-11-18 08:21:48 -06:00
  • 184b12fdc6 [Bugfix][NIXL] Fix block_size_ratio when logical !=physical blocks (#28925) Nicolò Lucchesi 2025-11-18 15:07:50 +01:00
  • b9489f51e1 [Model][Perf] Use cos and sin cache in QwenVL (#28798) Canlin Guo 2025-11-18 19:51:54 +08:00
  • 285eaa4285 [Bugfix] Safeguard against missing backend in AttentionBackendEnum (#28846) Song Zhixin 2025-11-18 18:53:44 +08:00
  • 439368496d [BugFix] Fix PP/async scheduling with pooling models (#28899) v0.11.1 Nick Hill 2025-11-18 00:20:45 -08:00
  • 896e41ae04 [CI/Build] Replace wikipedia url with local server ones (#28908) Isotr0py 2025-11-18 16:10:55 +08:00
  • 5bb1da5190 [MISC] Remove format.sh (#28906) Kuntai Du 2025-11-18 13:28:31 +08:00
  • 5bdd155277 [CI] Fix async scheduling + spec decoding test flake (#28902) Nick Hill 2025-11-17 21:26:32 -08:00
  • 0168f69e50 [Misc] Remove unnecessary parentheses from log statements (#28897) Ning Xie 2025-11-18 12:33:46 +08:00
  • 083cf326dc [Doc]: fix typos in various files (#28863) Didier Durand 2025-11-18 05:32:14 +01:00
  • bf9e1e8767 [Bugfix] Fix wrong CLI defaults for dynamic SchedulerConfig fields (#28872) Cyrus Leung 2025-11-18 12:30:29 +08:00
  • 3ddcf46011 [Refactor] Remove Unused Func in Batch Invariant (#28881) Wentao Ye 2025-11-17 23:29:29 -05:00
  • d0a73620cc [ROCm][Quantization] add apply_vllm_mapper in quark config for models like gpt-oss (#28638) xuebwang-amd 2025-11-18 11:16:45 +08:00
  • 88ab591f0b Run macos smoke test workflow on main commit (#28752) Michael Goin 2025-11-17 22:16:03 -05:00
  • b6e04390d3 [Bugfix] Fix Kimi-K2 tool parser concatenated tool calls parsing (#28831) Benjamin Bartels 2025-11-18 03:13:25 +00:00
  • 552cac95b5 [Misc] Fix wrong comment in scheduler (#28880) Zhuohan Li 2025-11-17 15:32:22 -08:00
  • 61485844fc [BugFix] Corner case that could cause out-of-sync with external launcher mode and dp >1 (#28774) Bangsheng Tang 2025-11-17 15:22:11 -08:00
  • f77bce001a [Model] Add Afmoe architecture implementation (#28332) Pranav 2025-11-17 15:11:20 -08:00
  • a289cc1dde [Test] Batch Invariant: Rename and organize tests (#27421) Wentao Ye 2025-11-17 18:09:47 -05:00
  • 95ae50b7d1 [Quantization] [Eagle] Add complete quantization support to the draft model in Eagle (#28435) Shreyas Kulkarni 2025-11-17 18:01:34 -05:00
  • 7765e5ba75 [BugFix] Fix PP performance and PP kv connector output regression (#28768) Nick Hill 2025-11-17 14:08:50 -08:00
  • d8874c61a5 [Core] Async Scheduling X Spec Decoding Compatibility (#24799) Ronald 2025-11-18 04:16:20 +08:00
  • f8b19c0ffd [Bugfix] Fix GPT-OSS on AMD after #28603 (#28816) Zhewen Li 2025-11-17 10:15:26 -08:00
  • e42bd8c2e3 Cast return value to int64_t for cache size (#28814) tiehexue 2025-11-18 00:02:32 +08:00
  • 7f064491f8 [Bugfix][Perf] Revert applying HF processor on text-only inputs for multimodal models (#28858) Roger Wang 2025-11-17 06:49:25 -08:00
  • 64e39d667c [BugFix] Temporary fix for IMA with MTP = 2 and full-cg (#28315) Lucas Wilkinson 2025-11-17 09:41:22 -05:00
  • 1b82fb0ad3 [XPU] work around for sp, avoid custom op import error (#28822) Kunshang Ji 2025-11-17 21:16:44 +08:00
  • d4acf518d0 [Metrics] Fix KV cache usage percent metric multiproc (#28792) Jae-Won Chung 2025-11-17 04:54:15 -05:00
  • ab01cd14e5 [BugFix] Fix glm4_moe_mtp load weights bug (#28805) wuyaoxuehun 2025-11-17 16:13:11 +07:00
  • 577bb34fff [CPU][Bugfix] Fix _to_list in CPU model runner (#28824) Li, Jiang 2025-11-17 15:47:24 +08:00
  • 3380ed5e11 [Doc] Add llama4 LoRA tag (#28825) Jee Jee Li 2025-11-17 14:08:48 +08:00
  • 6f37419244 [Bugfix][Model] Prevent special token leakage in KimiK2ToolParser streaming mode (#28543) Jay Caldwell 2025-11-16 23:54:46 -06:00
  • 60e089f0b9 [ROCm][Qwen3-32B] Fix AITER MHA accuracy issue cause by #25763 (#28670) Xiake Sun 2025-11-17 12:52:11 +08:00
  • d64429bb36 [NIXL][XPU] update install script of NIXL (#28778) liuzhenwei 2025-11-17 11:01:33 +08:00
  • 561253b37f [Performance][Fix] update nvfp4 code to support renorm routing (#28569) jiahanc 2025-11-16 18:02:42 -08:00
  • 80b6080ddc [BugFix] Fix async scheduling + chunked prefill + preemption (#28787) Nick Hill 2025-11-16 14:46:46 -08:00
  • 03ee48111d Feature: Support Relu2 in FusedMoE fp8 cutlass path (#27261) amirkl94 2025-11-16 20:39:44 +02:00
  • 5a87076d6e [Model][QwenVL] Optimize Qwen2_5_VisionAttention q,k preparation (#28769) Lukas Geiger 2025-11-16 17:37:15 +00:00
  • ac1daf3233 fix comment typo (#28802) Ning Xie 2025-11-17 01:03:21 +08:00
  • 63fed55506 [Doc]: fix typos in various files (#28811) Didier Durand 2025-11-16 15:30:06 +01:00
  • 8d259fad6c Fix gpt oss weight loading with EP + bf16 (#28765) Anna Shors 2025-11-16 05:12:45 -08:00
  • 3bc1175798 [Bugfix] Fix host and port join for ipv6 in bench serve (#28679) scottzh8 2025-11-16 02:20:57 -08:00
  • af02c40970 Fixed gpt-oss _load_weights_other() parameter position bug (#28715) Dezhan 2025-11-16 01:46:29 -08:00
  • b316ac6589 [V1] Support MP Executor for multi node distributed inference (#23691) Lucia Fang 2025-11-16 01:01:21 -08:00
  • a55b64635c [Model] Allow users to control skip reading cache per request. (#28194) wang.yuqi 2025-11-16 16:04:50 +08:00
  • d231876ce3 [Benchmark] Fix client seed synchronization in multi-turn benchmark (#28512) ai-jz 2025-11-15 23:04:32 -08:00
  • f67299f66d [compile] Enable sequence parallelism matching w/o custom ops enabled (#27126) v0.11.1rc7 Angela Yi 2025-11-15 03:46:12 -08:00
  • 5f6666fb5a LLaMA4 LoRA Adapter Enablement (#28602) Fardin Hoque 2025-11-14 10:27:56 -08:00
  • 66a62d73da [Bugfix][Nixl] Fix kernel physical<>logical block_size issue (#28677) Nicolò Lucchesi 2025-11-14 15:40:05 +01:00
  • c505dd6b61 [BugFix] Fix FA3 IMA with FULL_AND_PIECEWISE and cascade attention (default) (#28702) Lucas Wilkinson 2025-11-14 07:19:22 -05:00
  • f7adf64aac [BugFix] Fix multi-modal async scheduling race condition (#28706) Nick Hill 2025-11-14 01:11:13 -08:00
  • 240d6b1758 [Bugfix] fix dots.ocr pp support (#28705) Jiangyun Zhu 2025-11-14 17:01:26 +08:00
  • b315ba9052 [Misc] Update xformers to 0.33.0.post1 (#28678) Roger Wang 2025-11-13 21:52:53 -08:00
  • 9b24cf6f47 [bugfix] correct local_chunk_len for DCP in reorg_kvcache with long context (#28526) Qiu 2025-11-14 03:29:22 +08:00
  • facbc2c21e [BugFix] Ensure EngineArgs.create_engine_config is idempotent (#28515) Nick Hill 2025-11-13 09:14:08 -08:00
  • e2fd9a2edf [Misc] Turn off encoder torch compile by default (#28634) Roger Wang 2025-11-13 08:38:08 -08:00
  • 1326f17492 Use official xformers-0.0.33 built for PT 2.9 (#28600) Huy Do 2025-11-12 22:48:53 -08:00
  • caf412e593 Skip models that cannot currently init on Transformers v5 (#28471) Harry Mellor 2025-11-12 23:43:57 +00:00
  • a035b5cffb [CI] Skip "Multi-Modal Models Test (Extended) 3" test that's broken in current Transformers (#28559) Harry Mellor 2025-11-12 19:38:13 +00:00
  • 5b4dcecdd7 Remove deprecated fields from CompilationConfig (#27593) Harry Mellor 2025-11-12 16:10:28 +00:00
  • 609bb244bd [Performance] Cache loaded custom logitsprocs to avoid overheads (#28462) Isotr0py 2025-11-12 08:49:29 +08:00
  • 3a9ea77c35 [Bugfix] Fix max image size for PaddleOCR-VL (#28442) Roger Wang 2025-11-11 00:07:24 -08:00
  • 28a82bb5e6 [Bugfix] Fix Stream Sync for Shared Expert Overlap (#28430) Robert Shaw 2025-11-11 00:59:08 -05:00
  • 2a21f3e7c2 Only register rocm_aiter_ops if aiter is found (#28428) Michael Goin 2025-11-10 19:53:24 -07:00
  • ab625ba2fc [CI/Test Fix] Fix CP tests on Blackwell (#28404) Lucas Wilkinson 2025-11-10 20:36:29 -05:00
  • 324c8cbd79 [Feature] Refactor batch invariant fp8 DeepGEMM (#27606) Wentao Ye 2025-11-10 19:08:40 -05:00
  • 75ecaf48fe [Bugfix] Ensure calculated KV scales are applied in attention. (#27232) Adrian Abeyta 2025-11-10 17:42:37 -06:00
  • f849ee739c Adding a benchmark for batch invariance (#28161) Bram Wasti 2025-11-16 00:22:17 -05:00
  • be263f7645 [BugFix] Fix AssertionError: DCP not support reorder_batch_threshold > 1 now. (#28751) Lucas Wilkinson 2025-11-15 17:35:06 -05:00
  • 2bb4435cb7 [Doc]: fix typos in various files (#28567) Didier Durand 2025-11-15 20:27:50 +01:00
  • 07cadab27a [Model][Qwen3VL] Cache positional embedding indices (#28475) Lukas Geiger 2025-11-15 19:03:09 +00:00
  • 637f292196 [CI] Fix broken pipeline (#28781) Nick Hill 2025-11-15 08:44:14 -08:00
  • e439c784fa Add support for Eagle with separate lm-head and embed_tokens layers (#28549) Eldar Kurtić 2025-11-15 15:12:02 +01:00
  • 085a525332 [Model] Fix lmhead init bug of bailing_moe (#28777) hwhaokun 2025-11-15 21:44:12 +08:00
  • 89d3679221 [Doc] Fix failing doc build (#28772) Cyrus Leung 2025-11-15 21:33:27 +08:00
  • cb15ee28db Allow Gemma3 to take image embeddings (#28483) tingtinggithub 2025-11-15 04:18:08 -08:00
  • f36292dbee [compile] Enable sequence parallelism matching w/o custom ops enabled (#27126) Angela Yi 2025-11-15 03:46:12 -08:00
  • 173b356abf [PERF] Remove TRTLLM Gen attn kernel limitation max_seq_len <=131072 (#28755) Vadim Gimpelson 2025-11-15 14:13:41 +04:00
  • 638e4196d1 [Misc] Make SchedulerConfig.max_model_len init-only (#28733) Cyrus Leung 2025-11-15 17:59:31 +08:00
  • 1ec978c209 [Kernel][Moe Configs] llama4 maverick fp8 moe config tp8 on mi325 (#28709) Zhewen Li 2025-11-15 01:10:48 -08:00
  • 74b5267d3a Use narrow over indexing in hadacore_transform to prep for ABI stable (#28756) Jane (Yuan) Xu 2025-11-15 04:10:15 -05:00
  • dd6ac1c2bb [RL] [V1] Remove unused device argument from reset_kv_cache (#28766) Zhuohan Li 2025-11-14 23:59:42 -08:00