Commit Graph

  • 1fb0209bbc [Bugfix][Hardware][AMD] Fix exception types in AITER MLA FP8 check (#31177) Kevin McKay 2026-01-06 00:10:59 -06:00
  • 81323ea221 [CI] Fix CPU MM PRocessor Test (#31764) Robert Shaw 2026-01-05 23:22:18 -05:00
  • e1cd7a5faf [Bugfix] Add init_workspace_manager to moe kernel benchmarks (#31042) Michael Goin 2026-01-05 22:14:33 -05:00
  • a68e703c32 [UX] Add -ep shorthand for --enable-expert-parallel (#30890) Michael Goin 2026-01-05 22:13:36 -05:00
  • cd1245a184 [Cleanup] Remove redundant decoder_layer_type assignment in Qwen2 (#31760) maang 2026-01-06 10:09:18 +08:00
  • ffec815422 [Perf] Optimize additional fill(0) in cutlass moe, 2.9% E2E throughput improvement, 10.8% TTFT improvement (#31754) Wentao Ye 2026-01-05 21:01:13 -05:00
  • d386ab1412 [Docs] Improve malformed exception caused by backslash line continuations (#31694) maang 2026-01-06 09:51:54 +08:00
  • ccb309a964 Revert "[CI Failure] Disable B200 tests while runner is broken" (#31750) Michael Goin 2026-01-05 20:26:33 -05:00
  • 2f4e6548ef [Bugfix] vLLM produces invalid UTF-8 tokens and “�” (#28874) John Calderon 2026-01-05 19:23:00 -05:00
  • 3c98c2d21b [CI/Build] Allow user to configure NVSHMEM version via ENV or command line (#30732) Seiji Eicher 2026-01-05 15:56:08 -08:00
  • 9513029898 [Bugfix] Properly apply v_scale for mimo_v2_flash (#31175) Michael Goin 2026-01-05 18:20:46 -05:00
  • f6c0009afa [Bugfix] Fix Broken ModelOpt NVFP4 MoE (#31742) Robert Shaw 2026-01-05 18:18:38 -05:00
  • 776ca1e187 [MoE Refactor] Aiter Experts for BF16 MoE (#31542) Yongye Zhu 2026-01-05 14:52:59 -08:00
  • af9a7ec255 [Bug] Revert torch warning fix (#31585) Wentao Ye 2026-01-05 17:31:21 -05:00
  • 276e03b92c [CI][DeepSeek] Add nightly DeepSeek R1 lm_eval tests on H200 (#30356) Matthew Bonanni 2026-01-05 17:17:59 -05:00
  • 32f4e4db00 [Cleanup] Remove deprecated fields from CachedRequestData class (#31734) Nick Hill 2026-01-05 13:07:14 -08:00
  • ee21291825 [Model] Nemotron Parse 1.1 Support (#30864) amitz-nv 2026-01-05 23:00:14 +02:00
  • af1b07b0c5 [docker] install cuda13 version of lmcache and nixl (#30913) Qidong Su 2026-01-05 15:50:39 -05:00
  • c77a993cc2 pin lora_b moe weights on cpu (#31317) gnovack 2026-01-05 12:15:40 -08:00
  • fdcc5176be [BugFix] Fix architecture flags to prevent issues on SM103 (#31150) Roberto L. Castro 2026-01-05 21:11:35 +01:00
  • 5708297e4e [Misc][Model][Refactor] Pass the prefix into Linear layers (#31669) Wang Kunpeng 2026-01-06 04:03:18 +08:00
  • 02dbb933cb Fix GLM-4.6v flash tool calling in transformers 5.x (#31622) baonudesifeizhai 2026-01-05 14:32:43 -05:00
  • 51e38a8e30 [Misc] Enable Paligemma's PrefixLM attention mask computation (#31725) Isotr0py 2026-01-06 03:31:49 +08:00
  • d8e38d4939 Triton Attention: Support cross-layers blocks (#30687) Or Ozeri 2026-01-05 21:29:16 +02:00
  • 21156ff199 [Bugfix] Add missing extra_tensors arg to DeviceCommunicatorBase.disp… (#31644) kzwrime 2026-01-06 01:26:09 +08:00
  • c455b771fd [Bugfix][CPU] Fix RotaryEmbedding fallback causing gibberish with --enforce-eager (#31643) RickyChen / 陳昭儒 2026-01-06 01:25:38 +08:00
  • eefa713a66 [CI Failure] Disable B200 tests while runner is broken (#31732) Michael Goin 2026-01-05 11:50:51 -05:00
  • 79ed460dd5 [Frontend] [Doc] Exclude log deltas feature (#30322) Kevin Šuc 2026-01-05 17:34:35 +01:00
  • 6aa5b18e1d [v1] Add encoder-only/cross attention support to Triton Attention backend (#31406) Isotr0py 2026-01-06 00:00:23 +08:00
  • 911d38ed99 [Model] Let more models to support the score template. (#31335) wang.yuqi 2026-01-05 19:54:26 +08:00
  • caaa482aca [platform] Support additional forward context for OOT (#31674) zzzzwwjj 2026-01-05 18:25:13 +08:00
  • b471aad41f [KVconnector][LMCache] remove the import of legacy LMCache code (#31704) Yihua Cheng 2026-01-05 02:11:01 -08:00
  • d5503ca7f9 [LoRA] LoRA PDL improvement (#31660) Jee Jee Li 2026-01-05 16:28:46 +08:00
  • a2ad15c070 [Model] Enable LoRA support for BLIP2 (#31620) Qiping Pan 2026-01-05 00:02:24 -08:00
  • 3133c192a3 [ROCM] Reorder arguments and rename parameters for rope_cached_thd_positions_2c_fwd_inplace (#29993) Tres 2026-01-05 08:37:57 +01:00
  • 76fd458aa7 [CI] Bump sentence-transformer from 3.2.1 to 5.2.0 (#31664) wang.yuqi 2026-01-05 13:45:01 +08:00
  • e2701cc525 [Frontend] [Bugfix] respect server-level default chat template kwargs in reasoning parser (#31581) cjackal 2026-01-05 14:42:47 +09:00
  • fe8a9fbd2e [Bugfix] Fix EPLB state logging error (#31455) Tyler Michael Smith 2026-01-04 23:06:28 -05:00
  • 98b8b3abaa [log] enable max_log_len trim only when needed (#31482) Ning Xie 2026-01-05 11:55:43 +08:00
  • 346e56455a Add chat prefix completion feature to DeepSeek v3.2 (#31147) CHENYUE 2026-01-05 11:20:25 +08:00
  • 8be6432bda [CI Failure] Fix NomicBert max_model_len validation (#31662) wang.yuqi 2026-01-05 11:06:52 +08:00
  • 43e3f8e4a9 [Misc] Various code simplifications (#31666) Nick Hill 2026-01-04 18:35:56 -08:00
  • bb4337b34c [Platform] Deprecate seed_everything (#31659) wangxiyuan 2026-01-05 10:34:04 +08:00
  • 367856de14 [CI/Build] Revive skipped reward models e2e test (#31665) Isotr0py 2026-01-05 10:33:46 +08:00
  • da436f868a [Minor] Small pooler output processing optimization (#31667) Nick Hill 2026-01-04 18:33:12 -08:00
  • f099cd557a [Bugfix] Fix AttributeError: 'Stream' object has no attribute 'dp_size' (#31663) Jee Jee Li 2026-01-05 10:31:31 +08:00
  • f2b6dfd237 [ROCm][CI] Fix language generation test accuracy by disabling HF flash_sdp and mem_efficient_sdp (#31597) Andreas Karatzas 2026-01-04 20:17:05 -06:00
  • 89f1f25310 [CI] Skip Phi-MoE test due to old API util (#31632) Andreas Karatzas 2026-01-04 18:52:07 -06:00
  • b53b89fdb3 [BugFix] Async scheduling: handle model forward errors more cleanly (#31611) Nick Hill 2026-01-04 11:04:37 -08:00
  • 6522721d17 [misc] Sort uvicorn log level description according to verbosity (#31137) Ning Xie 2026-01-05 02:45:37 +08:00
  • 0d4044edd8 fix no think of GLM-4.5 / GLM-4.7 (#31449) Yuxuan Zhang 2026-01-04 11:43:00 +08:00
  • 41ab179738 [Docs] Fix argparse include path for mm-processor benchmark (#31654) Reagan Lee 2026-01-03 19:31:29 -08:00
  • 268b1c55ad [MoE Refactor][13/N] Convert FI to Use PFNoEP (#31533) Robert Shaw 2026-01-03 15:26:36 -05:00
  • 4f9ce35afe [CI][Bugfix] Fix token counting in chunked prefill compl test (#31630) Andreas Karatzas 2026-01-03 00:28:49 -06:00
  • 97a01308e9 Improve HF qwen3_omni: preserve audio_sample_rate in kwargs restructuring (#29255) jeremyteboul 2026-01-02 20:31:09 -08:00
  • 0eee877f67 [Core] Parse vLLM engine required fields from hf_config to model_arch_config (#28454) Xingyu Liu 2026-01-02 16:13:15 -07:00
  • a0e9ee83c7 [Benchmark] Fix OOM during MoE kernel tuning for large models (#31604) Alfred 2026-01-03 06:24:51 +08:00
  • a3f2f40947 [MoE Refactor] Explicit construct mk for flashinfer bf16 kernel (#31504) Yongye Zhu 2026-01-02 13:54:50 -08:00
  • 5a468ff7c7 [MoE Refactor] Split invoke_fused_moe_kernel (#31050) Yongye Zhu 2026-01-02 13:47:15 -08:00
  • 6ef770df7c [MoE] Fix output_shape calculation in Attention layer to handle 3D query inputs (#31596) Andreas Karatzas 2026-01-02 09:46:23 -06:00
  • bd877162eb [BugFix] Support online dense model DP without overhead (#30739) Nick Hill 2026-01-02 07:36:38 -08:00
  • 08f425bad1 CustomOp: test forward dispatch for grouped_topk (#31530) Xinyu Chen 2026-01-02 23:04:01 +08:00
  • a01f2faedf Add multimodal input method in the documentation (#31601) labAxiaoming 2026-01-02 20:43:30 +08:00
  • cc410e8644 [Bugfix] Fix weight_loader v1 block scale (#31103) Kyuyeun Kim 2026-01-01 21:14:10 -08:00
  • 825c2dc133 [Bugfix][Hardware][AMD] Fix last_page_len calculation in AITER MLA decode (#31282) Kevin McKay 2026-01-01 23:14:00 -06:00
  • 1f43c121d5 Remove unused use_marlin variable in Mxfp4MoEMethod (#31549) Vaibhav Sourirajan 2026-01-02 00:13:36 -05:00
  • ca179d0f64 [Bugfix] Fix activation quantization for compressed-tensors W4A16 (#31572) Tmn07 2026-01-02 13:13:22 +08:00
  • 013b54088c [ROCm][CI] Fix ModernBERT token classification test (#31612) Andreas Karatzas 2026-01-01 22:19:08 -06:00
  • 5ac55eb30f [Model] Enable LoRA support for tower and connector in LLaVA (#31513) Jay Hemnani 2026-01-01 19:32:39 -08:00
  • ea53ca5e85 [Bugfix] Fix block size used in EAGLE slot mapping (#31540) Benjamin Chislett 2026-01-01 22:32:30 -05:00
  • 27864a851c feat: support LoRA for DeepSeek-OCR(Language Model part) (#31569) zhima771 2026-01-02 11:32:11 +08:00
  • 5cc4876630 [ROCm][CI] Fix failure in Language Models Tests (Extra Standard) by reducing agent pool size (#31553) Andreas Karatzas 2026-01-01 21:29:42 -06:00
  • 5fff44064b [Bugfix] Replace BaseException with specific exceptions in FLA utils (#31590) Kevin McKay 2026-01-01 21:27:54 -06:00
  • 1f5b7c41c3 Add Multimodal Processor Benchmark (#29105) Reagan Lee 2026-01-01 19:26:53 -08:00
  • adcf682fc7 [Audio] Improve Audio Inference Scripts (offline/online) (#29279) Ekagra Ranjan 2025-12-31 18:34:18 -05:00
  • 21de6d4b02 [CI][Bugfix] Fix token counting in chunked prefill streaming test (#31565) Andreas Karatzas 2025-12-31 17:05:14 -06:00
  • 6c2cfb62ff [BugFix] Fix async scheduling for pooling models (#31584) Nick Hill 2025-12-31 14:48:51 -08:00
  • d8da76f3b7 [Bugfix] Fix BAGEL online serving for text and image understanding (#31546) Fanjiang Ye 2025-12-31 16:46:10 -06:00
  • d722e9e614 Add GLM-ASR multimodal support (#31436) baonudesifeizhai 2025-12-31 10:12:24 -05:00
  • cf16342d43 [ROCm][CI] Update MiniCPM model test: MiniCPM3-4B to MiniCPM4.1-8B and simplify attention backend testing (#31551) Andreas Karatzas 2025-12-31 02:12:01 -06:00
  • 357d435c54 [Bug] Fix log issue with \n (#31390) Wentao Ye 2025-12-31 00:16:55 -05:00
  • 108a2728f7 Add get_expert_mapping to NemotronHModel (for LoRA support) (#31539) danisereb 2025-12-31 07:09:03 +02:00
  • 578c8f51f6 [CI] [Critical] [CUDA] Fix duplicated test name (#31562) TJian 2025-12-31 14:01:09 +09:00
  • b4bb5f312f [Core] Remove unused num_tokens parameter from _init_model_kwargs (#31517) maang-h 2025-12-31 12:47:23 +08:00
  • 70e1acefcd [BugFix] Fix NUMA node validation in CPU platform (#31520) SameerAsal 2025-12-30 20:06:49 -08:00
  • 84f6cd741b [Mics] add pcp basic support to MoE model (#31003) Qiu 2025-12-31 12:01:29 +08:00
  • ecd49ce7e6 [Fix] Align fused moe lora_b shape with peft (#31534) B-201 2025-12-31 09:44:59 +08:00
  • e1ee11b2a5 Add docker buildx bake configuration (#31477) Amr Mahdi 2025-12-30 17:08:54 -08:00
  • 04147dcfa7 [Bugfix]Fix pooling model always disabled due to incorrect PP rank check (#31505) vintipandey 2025-12-30 11:27:10 -08:00
  • 07728bf5cd [BugFix] add select_gemm_impl on CompressedTensorsWNA16MoEMethod to support LoRA (#31453) JartX 2025-12-30 20:20:15 +01:00
  • 3f52fa5aa2 [Model] Add support for openPangu moe model (#28775) yt0428 2025-12-31 00:11:38 +08:00
  • 7157596103 [CPU] Disable async schedule on CPU (#31525) Li, Jiang 2025-12-30 20:34:08 +08:00
  • ab1af6aa3e [CI][NIXL] Split DPEP tests (#31491) Nicolò Lucchesi 2025-12-30 13:26:12 +01:00
  • 1a834df2d4 [ROCm][Bugfix] Fix accuracy issue on fmoe when VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS enabled (#31523) Pleaplusone 2025-12-30 17:21:49 +08:00
  • 51085c2aeb [Frontend] add continue_final_message parameter to /embeddings endpoint (#31497) Kevin 2025-12-29 23:21:13 -08:00
  • 3d973764ce [xpu] [bugfix] upgrade to latest oneccl in dockerfile (#31522) Roger Feng 2025-12-30 14:52:28 +08:00
  • 3b312fb792 [Minor] Various small code cleanups/simplifications (#31508) Nick Hill 2025-12-29 22:42:06 -08:00
  • f84bf7d79b Add Loraconfig parameter to get_punica_wrapper function (#31408) ZT-AIA 2025-12-30 14:27:31 +08:00
  • 99dcf5dcc5 Migrate meetups & sponsors [2/N] (#31500) Roy Wang 2025-12-30 12:26:15 +08:00
  • dc837bc23e feat(frontend): add --default-chat-template-kwargs CLI argument (#31343) Hojin Yang 2025-12-30 12:38:47 +09:00