Commit Graph

  • b30372cbd0 [Perf] Move gc.freeze logic from EngineCoreProc to EngineCore for better coverage (#27896) Jialin Ouyang 2025-11-10 15:34:18 -08:00
  • d17ecc6b19 [PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds (#24248) Ilya Markov 2025-11-11 00:33:11 +01:00
  • 021143561f [ROCm] Add missing gemm_a8w8_blockscale import (#28378) Yong Hoon Shin 2025-11-10 13:13:36 -10:00
  • 30700b1cd7 [CI] Fix Plugin Tests Tests (#28413) v0.11.1rc6 Robert Shaw 2025-11-10 17:36:11 -05:00
  • 4b94ed8f92 [Frontend][2/n] remove empty content from _parse_tool_calls_from_content (#28331) Andrew Xia 2025-11-10 14:07:49 -08:00
  • 6dec9f6109 [BugFix] Fix DeepGEMM over-allocating workspace (#28254) Lucas Wilkinson 2025-11-10 17:01:17 -05:00
  • bf6a3d0ff5 [Misc] Add more scoping for improved trace (#28329) Wei Wei 2025-11-10 13:03:21 -08:00
  • 40d33264c6 [Bugfix][EPLB] Disabled shared expert overlap when EPLB is enabled (#28377) Sage Moore 2025-11-10 12:39:19 -08:00
  • 9c84ca8293 [FA/Chore] Bump FA version for FP8 two-level accumulation (#27889) Jonas M. Kübler 2025-11-10 21:06:04 +01:00
  • 6d54336ae5 [Bugfix] Fix llguidance backend, rollback when EOS was encountered (#25905) Rémi Delacourt 2025-11-10 20:53:32 +01:00
  • 34553b9d27 [Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3next (#27492) jiahanc 2025-11-10 09:34:57 -08:00
  • b039bfda8f [Bugfix] Fix persistent_masked_m_silu_mul_quant tests (#28366) Varun Sundar Rabindranath 2025-11-10 12:21:52 -05:00
  • d0e186c16f [V0 Deprecation] Remove unused context_len and seq_len from M-RoPE (#28395) Cyrus Leung 2025-11-11 00:30:06 +08:00
  • f080a83511 [RFC][ROCm][AITER] Keep all AITER kernels in _aiter_ops class like _custom_ops and _ipex_ops (#24490) vllmellm 2025-11-10 17:20:53 +01:00
  • 40e2eeeb92 [Kernel] Optimization of the mm_k operator. (#28280) caozuoba 2025-11-11 00:03:46 +08:00
  • b06b9470ca [Rocm][fused_moe][fp4] view weight to torch.float4_e2m1fn_x2 when running aiter fused moe for fp4 model (#27474) zejunchen-zejun 2025-11-10 23:38:56 +08:00
  • 4673e465ff Add @tjtanaa to codeowner for ROCm and multi-modal (#28360) TJian 2025-11-10 05:39:17 -08:00
  • 912744d066 [Fix] optimize visual token mask with caching and multi-token support (#28374) Ferrebo 2025-11-10 21:23:49 +08:00
  • 15be507c86 [bugfix] fix siglip batch text output error (#28365) Yu Jiaqi 2025-11-10 21:21:15 +08:00
  • 6f7de33bed [Metrics] Refactor LoRA state tracking (#26801) Mark McLoughlin 2025-11-10 08:34:36 +00:00
  • a98cc35c34 Restore PlaMo2 unit test as pfnet/plamo-2-1b now supports transformers >=4.56 (#28019) Shinichi Hemmi 2025-11-10 15:50:02 +09:00
  • e8697faf03 [V0 deprecation] Remove no longer used get_metadata_cls (#28370) Lucas Wilkinson 2025-11-10 01:32:09 -05:00
  • 03fa4d3fb3 [Hardware][AMD][Model] Add Triton MoE tuning support and optimized configs for Qwen3 omni for MI308X (#28373) Xiake Sun 2025-11-10 12:53:40 +08:00
  • 6b2b9fd934 [CI] lora/test_mixtral.py : Add additional expected outputs due to flakiness (#28322) Varun Sundar Rabindranath 2025-11-09 21:45:29 -05:00
  • c5f685b3ae [ROCm][Platform] Add RX7900XTX device id in _ROCM_DEVICE_ID_NAME_MAP (#28279) JartX 2025-11-10 00:09:36 +01:00
  • c4768dcf47 [Kernel] Fix fused_gdn_gating (#28343) Jiangyun Zhu 2025-11-10 05:26:35 +08:00
  • a65a934ebe [CI/Build] Temporary fix to LM Eval Small Models (#28324) Zhewen Li 2025-11-09 13:08:38 -08:00
  • 4a8d6bd168 Fix cu_num_generated_tokens slicing logic in LogprobsLists.slice() method (#28214) usberkeley 2025-11-10 03:11:46 +08:00
  • 636efd10a5 [Core] Separate out attention metadata building logic from prepare inputs (#26764) Lucas Wilkinson 2025-11-09 13:51:43 -05:00
  • 289eb6c537 [Core] Simplify async KV output aggregation (#28327) Nick Hill 2025-11-09 09:44:13 -08:00
  • 19d91ece4b [CI] Fix flaky test_eagle_correctness test (#28364) Nicolò Lucchesi 2025-11-09 17:04:59 +01:00
  • 7ae5a5fb11 [Misc] Add some comments in qwen3-next (#28267) Jiangyun Zhu 2025-11-09 15:59:24 +08:00
  • de2b78305f [ROCm] Add env to enable/disable aiter triton gemm (#28321) Yong Hoon Shin 2025-11-08 20:27:00 -10:00
  • e5e9067e61 [Misc] fix typo and add detailed log (#28178) Ning Xie 2025-11-09 13:33:46 +08:00
  • 3a7d580343 fix: close issue 28338 by fixed python version (#28339) yihong 2025-11-09 13:07:26 +08:00
  • 05f8d69077 [chore] Move some wikimedia images to S3 (#28351) Kevin H. Luu 2025-11-08 17:58:26 -08:00
  • 404d7a9d14 [Performance][gpt-oss] Revert gpt-oss max cudagraph size to 1024 (#28345) Mohammad Miadh Angkad 2025-11-09 06:50:10 +08:00
  • 171133f929 [Bugfix] Fix test fused quant layernorm tests (#27865) ElizaWszola 2025-11-08 23:31:33 +01:00
  • 32787d0644 Remove setuptools upper bound constraint (<80) (#28337) Cole Murray 2025-11-08 14:30:18 -08:00
  • 975676d174 [Feat] Drop-in Torch CUDA Profiler (#27841) Benjamin Chislett 2025-11-08 17:07:37 -05:00
  • 77d702a22b Enhance run_cluster.sh for multi-NIC support (#28328) Ev Lacey 2025-11-08 14:04:16 -08:00
  • 2108a571d7 [DCP] Support dcp kv_cache interleave size > 1 (#26696) zhangsicheng5 2025-11-09 03:45:27 +08:00
  • 47604137a2 [Bugfix] Spec decode + structured output + spec model max len edge case (#28298) Andy Lo 2025-11-08 19:44:25 +00:00
  • 26990d25dc [Bugfix] Update device name for H200 detection (#28349) Robert Shaw 2025-11-08 14:01:11 -05:00
  • d9ab1ad9d1 reasoning_content -> reasoning (#27752) Harry Mellor 2025-11-08 04:15:08 -08:00
  • 608bb14462 [Attention] Remove max cudagraph size limit of 992 (#27840) 22quinn 2025-11-07 22:33:27 -08:00
  • 4a36681f85 [flashinfer][fix] do not check nvcc availability when using pre-downloaded cubins (#27990) Xiaozhu Meng 2025-11-07 22:25:21 -08:00
  • d15afc1fd0 Refactor CPU/GPU extension targets for CMake build (#28026) Abolfazl Shahbazi 2025-11-07 22:17:35 -08:00
  • 934a9c3b79 [Model] Consolidate Deepseek-MoE implementation with DeepSeek-v2 (#28101) Isotr0py 2025-11-08 13:01:27 +08:00
  • 70af44fd10 [bugfix] support eagle with lora cudagraph specialization (#28318) gnovack 2025-11-07 19:25:45 -08:00
  • 781f5ebf52 Bump arctic-inference requirement (#28174) Aurick Qiao 2025-11-07 18:31:18 -08:00
  • 0852527647 [Perf][DeepSeek] Add sigmoid+bias fusion to fused_grouped_topk from TRTLLM (#28124) Michael Goin 2025-11-08 10:20:55 +08:00
  • 61d25dc44b Update gpu.rocm.inc.md to add support for AMD Ryzen AI MAX / AI 300 Series (gfx1151, gfx1150) (#28308) Hamid Mukhtar 2025-11-07 21:09:21 -05:00
  • d0c7792004 [Bugfix][LoRA][Spec Decode] Support LoRA with speculative decoding (#21068) Xiaohong (Sean) Chen 2025-11-07 20:58:22 -05:00
  • b158df2813 remove resolve_op_overloads and use splitting_ops directly (#28081) Boyuan Feng 2025-11-07 17:13:13 -08:00
  • 1aaecda078 [XPU] Enable Expert parallel for MoE models (#28263) Kunshang Ji 2025-11-08 08:33:11 +08:00
  • 811df41ee9 Update Flashinfer from v0.4.1 to v0.5.2 (#27952) Harry Mellor 2025-11-07 16:24:42 -08:00
  • 67a2da890e [PerfFix] Avoid separate thread for MP executor shm spin (take 2) (#28319) Nick Hill 2025-11-07 14:11:03 -08:00
  • da786e339e [Core] Rework handling of async scheduling config (#28250) Nick Hill 2025-11-07 12:01:23 -08:00
  • 18903216f5 [Bugfix] Fix and add tests for GptOss reasoning parser (#28000) Benjamin Chislett 2025-11-07 14:28:04 -05:00
  • d0ceb38ae8 [Build] Fix release pipeline failing annotation (#28272) Simon Mo 2025-11-07 10:06:45 -08:00
  • 155ad56d7b [doc] add guide about the provided PTX was compiled with an unsupported toolchain (#28305) youkaichao 2025-11-08 00:26:34 +08:00
  • 5fb4137c99 [README] Add Arm CPUs to the list of supported targets (#28290) Fadi Arafeh 2025-11-07 15:41:47 +00:00
  • 68a72a5cc1 Revert "[PerfFix] Avoid separate thread for MP executor shm spin (#28012)" (#28289) Nicolò Lucchesi 2025-11-07 16:07:01 +01:00
  • 0f872b7977 [Log] update shm wait time msg (#28255) Boyuan Feng 2025-11-07 06:43:30 -08:00
  • 4b1ff13221 [Feature] Default ignore_eos True for random dataset (#28227) Wentao Ye 2025-11-07 07:35:33 -05:00
  • e0d6b4a867 [CLI] add --max-tokens to vllm complete (#28109) Iceber Gu 2025-11-07 20:21:40 +08:00
  • 72b1c2ae2c [Bugfix] Use latency MOE backend as default for Flashinfer and other misc fixes (#27439) Pavani Majety 2025-11-07 04:18:39 -08:00
  • e0919f331d [Core][MM] Add mechanism to configure multimodal fields which should stay on CPU (#28168) Lukas Geiger 2025-11-07 12:14:29 +00:00
  • 8e19d470af [fix] Revert "fixing mm placeholder replacement issue with gemma3" (#28285) Kevin H. Luu 2025-11-07 04:09:09 -08:00
  • 1958bda9b4 [Misc][Model][Refactor] Pass the prefix into Linear layers (#28259) Mengqing Cao 2025-11-07 19:38:38 +08:00
  • 7bdb42b2f2 [CPU]Avoid repeated random sample compile (#28260) Zhang Xiangze 2025-11-07 19:03:57 +08:00
  • 315068eb4a [FixBug]Aeala/ShareGPT_Vicuna_unfiltered marked as multimodal benchmark (#28265) 汪志鹏 2025-11-07 17:35:22 +08:00
  • ccd98b59c1 [Perf] Introduce FlattenLogprobs to store logprobs results to reduce GC overhead (#28171) Jialin Ouyang 2025-11-07 00:27:12 -08:00
  • 21b82f4ea2 [Kernel] LoRA triton kernels support PDL (#27402) Jee Jee Li 2025-11-07 16:05:48 +08:00
  • a736e5ff77 [CI] Reduce Blackwell Fusion test runtime by filtering tests and only run all tests in nightly (#28074) Copilot 2025-11-07 15:58:16 +08:00
  • 9da9208b20 [Bug] Fix missing token_ids for reasoning parser models in chat completions #28246 (#28256) baonudesifeizhai 2025-11-07 02:31:58 -05:00
  • 11fd69dd54 [amd][gptoss] Perf gain because of block alignment (#28024) smit kadvani 2025-11-06 21:27:42 -08:00
  • c0a4b95d64 Fix issues from #28242 (#28257) Harry Mellor 2025-11-06 20:23:17 -08:00
  • a47d94f18c Add runai model streamer e2e test for GCS (#28079) Alexis MacAskill 2025-11-06 19:07:54 -08:00
  • e70fbc599b [CI/Build] Loosen STT LoRA Translate Check (Flaky Test) (#28247) Alex Brooks 2025-11-06 19:51:27 -07:00
  • 4bf56c79cc [Multimodal][torch.compile] Add compilation config field for turning off ViT/MM compile (#28242) Lucas Kabela 2025-11-06 16:16:03 -08:00
  • 59b453eaa2 Speed up mm processor kwargs per request by spliting dynamic and static kwargs (#26483) Junhong Liu 2025-11-07 07:51:28 +08:00
  • 827e4237bc Fix failing test for CRadio (#27738) Eugene Khvedchenya 2025-11-07 01:32:25 +02:00
  • ca6f755d24 [BugFix] Fix FusedMoELoRA + ModularKernel Integration (#28237) Varun Sundar Rabindranath 2025-11-06 17:53:30 -05:00
  • ca90f50304 [Test] Add non-MoE DP test coverage (#28235) Matthew Bonanni 2025-11-06 15:59:57 -05:00
  • da855b42d2 [Doc]: Make extraInit containers fully configurable in helm chart (#27497) Fang Han 2025-11-06 12:27:16 -08:00
  • 449de9001a [ROCm] triton fp8 kernel (#27058) Aleksandr Malyshev 2025-11-06 11:46:44 -08:00
  • d4aa65c998 [Chore] eliminate duplicated and unconditional object serialization in anthropic messages api (#27792) Vico Chu 2025-11-07 03:09:19 +08:00
  • 7a8375f8a0 Add llama 4 scaling support (#28145) Julien Denize 2025-11-06 19:55:17 +01:00
  • 5e0c1fe69c [Structured outputs] Upgrade llguidance to 1.3.0 (#28039) Andy Lo 2025-11-06 18:24:47 +00:00
  • 4507a6dae4 CODEOWNERS: Add myself as reviewer on security docs (#28216) Russell Bryant 2025-11-06 12:39:42 -05:00
  • d1dd5f53e4 [Frontend] Fix logging format when enable response logging (#28049) Roy Wang 2025-11-07 00:25:39 +08:00
  • e52e4da971 [HARDWARE][CPU] Add Option for Disabling Binding to Specific CPU Cores (#27953) StanHatko 2025-11-06 10:47:11 -05:00
  • 2176778cd3 [Doc] Add Arm CPUs are on the list of supported targets in vLLM (#26018) Milos Puzovic 2025-11-06 15:30:26 +00:00
  • 0370679ce9 [Kernel][Model] Tune fused_moe Triton configs for MiniMax-M2 on H100 (#28200) Eric Yue 2025-11-06 23:29:46 +08:00
  • 8816e375d3 [Docs] Switch to directory style URLs (#28058) Harry Mellor 2025-11-06 07:06:33 -08:00
  • f32229293e Disable nm-testing models with issues in CI (#28206) Michael Goin 2025-11-06 22:19:07 +08:00
  • c757a15f0f [CPU]Improve cpu fused moe perf (#27244) xiangze-arm 2025-11-06 19:04:18 +08:00
  • 59a50afa08 [Frontend] OpenAI Responses API supports Tool/Function calling - non-harmony (#26874) Chauncey 2025-11-06 18:40:03 +08:00