Commit Graph

  • eac3b96ec0 [Models] Allow converting Qwen3-VL into Reranker model (#31890) Isotr0py 2026-01-08 16:10:15 +08:00
  • 573a1d1119 [ROCm]Skip test_torchao.py::test_pre_quantized_model on CDNA3 arch (#31905) Zhiwei 2026-01-08 15:47:44 +08:00
  • 33156f56e0 [docker] A follow-up patch to fix #30913: [docker] install cuda13 version of lmcache and nixl (#31775) Shang Wang 2026-01-08 02:47:02 -05:00
  • 107cf8e92f fix(rocm): Add get_supported_kernel_block_sizes() to ROCM_ATTN (#31712) Rabi Mishra 2026-01-08 13:16:07 +05:30
  • 63baa28cf5 [Model] Enable LoRA support for tower and connector in GLM4-V (#31652) Zyyeric 2026-01-08 01:45:53 -06:00
  • e5173d3bac [Bugfix] Remove the num_hidden_layers override for glm4_moe (#31745) Andy Liu 2026-01-07 23:45:10 -08:00
  • d3235cb503 [Fix] Enable mm_processor_cache with vision LoRA (#31927) prashanth058 2026-01-08 01:31:51 -06:00
  • 287b37cda4 [BugFix] Fix spec decoding edge case bugs (#31944) Nick Hill 2026-01-07 23:31:03 -08:00
  • 791b2fc30a [grpc] Support gRPC server entrypoint (#30190) Chang Su 2026-01-07 23:24:46 -08:00
  • be6a81f31b [chore] Update FA commit (#30460) Lucas Wilkinson 2026-01-08 02:24:18 -05:00
  • 2ab441befe [platform] add dp_metadata arg to set_additional_forward_context (#31942) Ronald 2026-01-08 14:56:44 +08:00
  • 9572f74f15 [Model] Enable LoRA support for tower and connector in DotsOCR (#31825) ShaanveerS 2026-01-08 07:50:16 +01:00
  • 5f2a473ff3 [ROCm][CI] v1 cpu offloading attention backend fix (#31833) Andreas Karatzas 2026-01-08 00:37:50 -06:00
  • 6b2a672e47 [Doc] Add Claude code usage example (#31188) Michael Goin 2026-01-08 00:50:23 -05:00
  • f1b1bea5c3 [CI][BugFix][AMD] Actually skip tests marked @pytest.mark.skip_v1 (#31873) rasmith 2026-01-07 23:06:09 -06:00
  • cddbc2b4b2 [ROCm][CI] Add rocm support for run-multi-node-test.sh (#31922) Charlie Fu 2026-01-07 22:36:39 -06:00
  • 087a138963 [ROCm][CI] Fix attention backend test flakiness from uninitialized KV cache memory (#31928) Andreas Karatzas 2026-01-07 22:35:25 -06:00
  • c4041f37a4 [ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router weight scaling (#31931) Andreas Karatzas 2026-01-07 22:17:56 -06:00
  • a79079feef [BugFix] Fix flakiness in test_eagle_dp for PyTorch 2.10 (#31915) Richard Zou 2026-01-07 23:04:58 -05:00
  • 9f6dcb71ae [MoE Refactor][16/N] Apply Refactor to NVFP4 (#31692) Robert Shaw 2026-01-07 22:46:27 -05:00
  • 8dd2419fa9 [CI] Skip Qwen-VL in multimodal processing tests due to flaky external dependency (#31932) Andreas Karatzas 2026-01-07 20:58:01 -06:00
  • 39d82005f7 fix(rocm): add early return in get_flash_attn_version for ROCm (#31286) Rabi Mishra 2026-01-08 07:58:07 +05:30
  • 25eef3dc2e feat(moe): Add is_act_and_mul=False support for Triton MoE kernels (#31645) Rabi Mishra 2026-01-08 07:57:09 +05:30
  • 0d7667419f [0/N][Attention] Fix miscellaneous pre-commit issues (#31924) Matthew Bonanni 2026-01-07 20:15:17 -05:00
  • 5dcd7ef1f2 [MoE Refactor][15/N] Apply Refactor to Fp8 (#31415) Robert Shaw 2026-01-07 19:42:33 -05:00
  • ffc0a2798b Add back missing DeepEP LL params (#31911) Elvir Crnčević 2026-01-07 23:47:54 +01:00
  • 10ef65eded [BugFix] Fix bad words with speculative decoding (#31908) Nick Hill 2026-01-07 12:46:42 -08:00
  • 6170d47d22 [EPLB] Optimize EPLB with numpy (#29499) Ilya Markov 2026-01-07 21:21:35 +01:00
  • 0ada960a20 [Kernel] Support bias type in grouped_topk kernel (#31781) Xin Yang 2026-01-07 12:16:32 -08:00
  • c907d22158 [refactor] refactor memory constants usage (#31865) Ning Xie 2026-01-08 02:37:31 +08:00
  • f347ac6c34 [Perf] Fuse stride preparation for NVFP4 cutlass_moe (#31837) Michael Goin 2026-01-07 13:31:26 -05:00
  • 05f47bd8d2 [Doc] Fix: Correct vLLM announcing blog post link in docs (#31868) Festus Ayobami Owumi 2026-01-07 18:06:42 +00:00
  • bf184a6621 Enable quantized attention in NemotronH models (#31898) roikoren755 2026-01-07 19:37:19 +02:00
  • 30399cc725 UX: add vLLM env info in '/server_info' (#31899) Jee Jee Li 2026-01-08 01:13:02 +08:00
  • b89443b8d9 [KVConnector]: Enable Cross-layers KV cache layout for MultiConnector (#30761) Kfir Toledo 2026-01-07 18:59:43 +02:00
  • 1d9e9ae8a4 [Bugfix]: prevent leaking tokens in crash log (#30751) Marko Rosenmueller 2026-01-07 17:15:19 +01:00
  • b7036c87a1 [Refactor] Clean up pooler modules (#31897) Cyrus Leung 2026-01-08 00:07:43 +08:00
  • cc6dafaef2 [Perf][Kernels] Enable FlashInfer DeepGEMM swapAB on SM90 (for W8A8 Linear Op) (#29213) Kate Cheng 2026-01-07 07:53:54 -08:00
  • 1ab055efe6 [OpenAI] Extend VLLMValidationError to additional validation parameters (#31870) R3hankhan 2026-01-07 20:15:49 +05:30
  • b665bbc2d4 [Chore] Migrate V0 attention utils (#31891) Cyrus Leung 2026-01-07 21:44:36 +08:00
  • 974138751b [Refactor] GLM-ASR Modeling (#31779) Jared Wen 2026-01-07 21:08:29 +08:00
  • 41cfa50632 [ROCm][AITER] fix wrong argument passed to AITER flash_attn_varlen_func (#31880) vllmellm 2026-01-07 12:25:03 +01:00
  • d111bc53ad [Bugfix][MTP] Fix GLM4 MoE fp8 loading with MTP on (#31757) Andy Liu 2026-01-07 01:18:52 -08:00
  • 0790f07695 [Misc] Improve error messages for unsupported types and parameters (#30593) BlankR 2026-01-07 01:00:16 -08:00
  • 1f33e38e81 [Model] Cleanup: Remove redundant manual definition of make_empty_intermediate_tensors in GLM-4-MoE (#31869) maang 2026-01-07 16:18:28 +08:00
  • 59fe6f298e [XPU]fallback to TRITON_ATTN on xpu when use float32 dtype (#31762) sihao_li 2026-01-07 16:10:29 +08:00
  • e7596371a4 [Refactor][TPU] Remove torch_xla path and use tpu-inference (#30808) weiyu 2026-01-07 00:07:16 -08:00
  • 0dd5dee9b9 [Bugfix][Kernel] fix bias adding in triton kernel implemented fused moe (#31676) xuebwang-amd 2026-01-07 15:36:13 +08:00
  • 4614c5a539 [Bugfix][Hardware][AMD] Consolidate FP8 min/max values helper function (#31106) Kevin McKay 2026-01-07 00:55:03 -06:00
  • 482914849c [BugFix] LoRA: Support loading base_layer of experts (#31104) ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 2026-01-07 08:49:39 +02:00
  • efeaac92f2 [Bugfix] Fix race condition in async-scheduling for vlm model (#31841) tianshu-Michael-yu 2026-01-06 22:45:10 -08:00
  • 55caa6051d refactor: find_loaded_library (#31866) tjp_zju 2026-01-07 14:42:20 +08:00
  • c7a79d41a0 [Attention][3/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties (#31850) Lucas Wilkinson 2026-01-07 00:31:34 -05:00
  • 6409004b26 [ROCm][AITER] bugfix accuracy regression in ROCM_AITER_TRITON_MLA backend (#31816) vllmellm 2026-01-07 06:04:53 +01:00
  • aafd4d2354 [Chore] Try remove init_cached_hf_modules (#31786) Cyrus Leung 2026-01-07 12:34:04 +08:00
  • 0a2c2dc3f1 fixed mypy warnings for files vllm/v1/attention with TEMPORARY workaround (#31465) Jack Yang 2026-01-06 23:08:47 -05:00
  • f09c5feb7c Change warning in get_current_vllm_config to report caller's line number (#31855) Tyler Michael Smith 2026-01-06 22:48:13 -05:00
  • 1b8af957f6 [Doc] Update release docs (#31799) Cyrus Leung 2026-01-07 11:27:40 +08:00
  • a051525e07 [Model] Enable LoRA support for PaliGemma (#31656) Ce Zhao 2026-01-06 21:09:32 -05:00
  • 5b833be49e [1/2][lmcache connector] clean up lmcache multi-process adapter (#31838) Yihua Cheng 2026-01-06 18:02:42 -08:00
  • 873480d133 [Misc][BE] Type coverage for vllm/compilation [1/3] (#31554) Lucas Kabela 2026-01-06 17:37:51 -08:00
  • 6f351548b2 [Frontend] Implement robust video frame recovery for corrupted videos (#29197) vSeamar 2026-01-06 17:13:24 -08:00
  • 364a8bc6dc [ROCm][CI] Fix plugin tests (2 GPUs) failures on ROCm and removing VLLM_FLOAT32_MATMUL_PRECISION from all ROCm tests (#31829) Andreas Karatzas 2026-01-06 19:12:23 -06:00
  • 9a1d20a89c [CI] Add warmup run in test_fusion_attn (#31183) Angela Yi 2026-01-06 16:31:52 -08:00
  • 309a8f66ee [Bugfix] Handle mistral tokenizer in get_hf_processor (#31817) Cyrus Leung 2026-01-07 07:46:56 +08:00
  • e5d427e93a [ROCm][CI] Pinning timm lib version to fix ImportError in Multi-Modal Tests (Nemotron) (#31835) Andreas Karatzas 2026-01-06 17:23:11 -06:00
  • 2a42ae790d [ROCm][CI] Fix ModernBERT token classification test numerical accuracy on ROCm (#31820) Andreas Karatzas 2026-01-06 17:21:15 -06:00
  • d49899732e [Spec Decode][UX] Add acceptance stats to vllm bench serve report (#31739) Matthew Bonanni 2026-01-06 16:21:42 -05:00
  • dba95378a6 Report error log after vllm bench serve (#31808) Elvir Crnčević 2026-01-06 21:24:19 +01:00
  • ada6f91d56 Fix RecursionError in MediaWithBytes unpickling (#31191) Nikhil G 2026-01-06 12:11:26 -08:00
  • 8becf146bd [Quantization][Refactor] Move CPU GPTQ kernel into MP linear (#31801) Li, Jiang 2026-01-07 03:10:18 +08:00
  • c07163663d [ROCm][CI] Fix tests/compile unit tests (#28895) Charlie Fu 2026-01-06 12:50:43 -06:00
  • f7008ce1c4 [Perf] Async Scheduling + Speculative Decoding + Structured Outputs (#29821) Benjamin Chislett 2026-01-06 13:50:37 -05:00
  • 4e67a8f616 [Bugfix] Fix GLM-4 MoE router logits dtype for data parallel chunking (#31055) Yakine Tahtah 2026-01-06 18:57:56 +01:00
  • 142c4d1738 make 500: InternalServerError more informative (#20610) Masataro Asai 2026-01-06 12:36:24 -05:00
  • 6f5e653383 [Log] add log about gpu worker init snapshot and requested memory (#29493) Ning Xie 2026-01-07 01:32:55 +08:00
  • 22dffca982 [PERF] Speed-up of GDN attention decode part (Qwen3-Next) (#31722) Vadim Gimpelson 2026-01-06 21:32:46 +04:00
  • 4c73be14e0 [Attention][2/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties (#31774) Lucas Wilkinson 2026-01-06 12:32:14 -05:00
  • 2f4bdee61e [Quantization][MoE] remove unused ep logic from moe marlin (#31571) Jinzhen Lin 2026-01-07 01:07:19 +08:00
  • 28c94770ad [NemotronH] Use ReplicatedLinear for fc1_latent_proj (#31807) roikoren755 2026-01-06 18:00:40 +02:00
  • af8fd73051 [MoE Refactor][14/N] Clean Up FI Quant Config Smuggling (#31593) Robert Shaw 2026-01-06 10:47:04 -05:00
  • d3e477c013 [MoE Refactor] Add Temporary Integration Tests - H100/B200 (#31759) Robert Shaw 2026-01-06 10:34:17 -05:00
  • 02809af1e7 [Bugfix]: Fix cross attention backend selection for Turing GPU (#31806) Isotr0py 2026-01-06 23:15:56 +08:00
  • cbd4690a03 [LoRA]Disable linear LoRA kernel PDL (#31777) Jee Jee Li 2026-01-06 23:12:25 +08:00
  • 96860af655 [Model] rename use_pad_token to use_sep_token (#31784) wang.yuqi 2026-01-06 22:16:04 +08:00
  • 0202971a48 [Frontend] Support GLM-4.5 / GLM-4.7 with enable_thinking: false (#31788) Chauncey 2026-01-06 21:53:21 +08:00
  • 2c1a4f2488 [Bugfix]: avoid overriding audio/text kwargs (Qwen3-Omni) (#31790) Jzz1943 2026-01-06 20:59:17 +08:00
  • 6444824873 [Misc] Implement TokenizerLike.convert_tokens_to_ids (#31796) Cyrus Leung 2026-01-06 20:08:22 +08:00
  • bf0f3a4638 [Bugfix] Fix torch.compile error for DP + MoE on CPU Backend (#31650) kzwrime 2026-01-06 20:06:20 +08:00
  • e0327c9db2 [Attention][1/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties (#31773) Lucas Wilkinson 2026-01-06 07:05:17 -05:00
  • 14df02b4e1 [Chore] Cleanup mem_utils.py (#31793) Cyrus Leung 2026-01-06 19:55:59 +08:00
  • 6ebb66ccea [Doc] Fix format of multimodal_inputs.md (#31800) BlankR 2026-01-06 03:30:24 -08:00
  • 43d384bab4 [CI] Increase the MTEB_EMBED_TOL threshold to 5e-4. (#31797) wang.yuqi 2026-01-06 19:30:05 +08:00
  • db318326a5 [Misc] Use deprecated for seed_everything (#31780) Cyrus Leung 2026-01-06 19:29:55 +08:00
  • 799b5721f6 [cpu][bench] Add CPU paged attention benchmarks (#31720) Fadi Arafeh 2026-01-06 10:57:57 +00:00
  • 97ca4c3b60 [Chore] Remove more V0 dead code from sequence.py (#31783) Cyrus Leung 2026-01-06 18:25:14 +08:00
  • ee2e69d6cd [Bugfix][CI/Build] Fix failing pooling models test due to Triton kernel accuracy diff (#31776) Isotr0py 2026-01-06 16:44:22 +08:00
  • 7101e0851f [Models]: Use MMEncoderAttention for MoonViT (#31738) Isotr0py 2026-01-06 16:00:25 +08:00
  • e9717801bd [Bugfix][ROCm] Fix Unsupported attention metadata type for speculative decoding in eagle.py (#31714) vllmellm 2026-01-06 08:53:22 +01:00
  • da71d44410 [Doc] Show that use_audio_in_video is supported in docs (#30837) Cyrus Leung 2026-01-06 15:27:19 +08:00