Commit Graph

  • 27c6c2f98c [Bugfix] Fix MoE LoRA bin/pt loading (#31161) Jee Jee Li 2025-12-23 19:09:15 +08:00
  • 73cfb7a722 Correct position of docstring of class attributes (#31209) Weida Hong 2025-12-23 18:08:58 +08:00
  • f32cfd7d97 [ROCm][FEAT] Support AITER RMSNorm quantization fusion pass (#26575) vllmellm 2025-12-23 11:07:54 +01:00
  • 6b16fff01b [Bugfix] Fix Jais2ForCausalLM (#31198) Jee Jee Li 2025-12-23 15:44:01 +08:00
  • f1c2c20136 [XPU] decrease IGC_ForceOCLSIMDWidth for speculative decoding triton-xpu kernel compilation (#30538) Yan Ma 2025-12-23 13:22:15 +08:00
  • 8cef137689 [Chore] Update more locations to use attention_config.backend (#31153) Cyrus Leung 2025-12-23 11:19:50 +08:00
  • a37328fc5c [Feature] Batch invariant: Lora (#30097) quanliu 2025-12-23 10:32:47 +08:00
  • 3e10262356 Revert "[SM100] Enable fp8 compute for prefill MLA (#30746)" (#31197) Pavani Majety 2025-12-22 18:15:33 -08:00
  • 612d5ffdab [ci] Fix Pytorch compilation test oom in 2.10 (#31194) Angela Yi 2025-12-22 17:56:47 -08:00
  • 78e5e62bbf [AMD][CI] fix v1/engine test_preprocess_error_handling (#31192) Divakar Verma 2025-12-22 19:28:19 -06:00
  • b57b967386 [MoE Refactor][7/N] AITER MK (#31102) Robert Shaw 2025-12-22 18:42:58 -05:00
  • 6d518ffbaa [CI Failure] Disable mosaicml/mpt-7b and databricks/dbrx-instruct tests (#31182) Michael Goin 2025-12-22 18:40:35 -05:00
  • 85aff45e24 [Perf] Remove blocking copy in GDN Attention (#31167) Benjamin Chislett 2025-12-22 17:25:22 -05:00
  • 5312a7284e [Bug] Fix 'CutlassMLAImpl' object has no attribute '_workspace_buffer' (#31173) Wentao Ye 2025-12-22 17:24:27 -05:00
  • de71747655 [SpecDecode] Simplified alternative padded-speculation acceptance rate fix (#29845) Lucas Wilkinson 2025-12-22 16:06:10 -05:00
  • 9586354053 [Doc] Add vllm-metal to hardware plugin documentation (#31174) Michael Goin 2025-12-22 15:06:29 -05:00
  • b10f41c894 [SM100] Enable fp8 compute for prefill MLA (#30746) Pavani Majety 2025-12-22 11:15:57 -08:00
  • 7b926e8901 [MoE Refactor][9/N] Use modular kernel for unquantized Triton MoE (#31052) Yongye Zhu 2025-12-22 09:34:19 -08:00
  • ab3a85fd68 [ROCm][CI/Build] Fix triton version to one that has triton_kernels required for gpt-oss to run (#31159) Gregory Shtrasberg 2025-12-22 11:19:27 -06:00
  • 8dd0db687b [UX] improve profiler error message (#31125) Boyuan Feng 2025-12-22 08:45:59 -08:00
  • 022f3cea53 [ROCm] [Critical]: Remove unused variable (#31156) TJian 2025-12-23 01:28:22 +09:00
  • a5bc77c253 [AMD][CI] Add "V1 Test e2e + engine" to mi325_8 Agent Pool (#31040) Micah Williamson 2025-12-22 09:41:56 -06:00
  • b1c3f96ae3 [CI][Bugfix] Fix entrypoints/openai/test_audio.py (#31151) Nicolò Lucchesi 2025-12-22 16:21:40 +01:00
  • 8f8f469b1b [BugFix] skip language model in Encoder (#30242) dengyunyang 2025-12-22 21:25:59 +08:00
  • 2cf91c2ea4 [CI] add polling for precompiled wheel in python_only_compile.sh, fix index generation for releases (#30781) Shengqi Chen 2025-12-22 21:24:21 +08:00
  • bd6d5a7475 [gpt-oss] Fix harmony parser in streaming responses (#30205) AlonKejzman 2025-12-22 14:56:06 +02:00
  • 256a33ecb4 [Model] Fix bagel failed to run (#31132) Li Wang 2025-12-22 18:15:54 +08:00
  • c02a2705f9 Update MiniMax-M2 ToolCall and add MiniMax-M2.1 in Docs (#31083) Roger Young 2025-12-22 13:28:40 +08:00
  • cf8eed7bef [Bugfix][ROCm] Fix typo: is_linear_fp8_enaled -> is_linear_fp8_enabled (#31109) Kevin McKay 2025-12-21 23:14:58 -06:00
  • 44ae85f725 [Misc] Fix typo: 'occured' -> 'occurred' (#31120) Kevin McKay 2025-12-21 23:14:27 -06:00
  • 14c3e6ade3 [Misc] Fix spelling typos in model comments (#31117) Kevin McKay 2025-12-21 23:14:14 -06:00
  • 42b42824ae [Misc] Fix grammar errors in comments and messages (#31115) Kevin McKay 2025-12-21 23:14:02 -06:00
  • ec58c10ce1 [Misc] Fix quantization-related typos (#31116) Kevin McKay 2025-12-21 23:13:48 -06:00
  • 8c084de59d [Misc] Fix spelling typos in comments (#31114) Kevin McKay 2025-12-21 23:13:14 -06:00
  • 19cc9468fd [Feature]: Support NVIDIA ModelOpt HF FP8 variants FP8_PER_CHANNEL_PER_TOKEN and FP8_PB_WO in vLLM (#30957) CedricHuang 2025-12-22 11:34:49 +08:00
  • 097978a15d [Kernel] Enable fused_qknorm_rope_kernel supports partial rope (#30821) Jee Jee Li 2025-12-22 10:39:22 +08:00
  • 7e065eba59 [CI] Fix "2 Node Tests (4 GPUs in total)" (#31090) Lucas Wilkinson 2025-12-21 21:32:40 -05:00
  • 9d701e90d8 [Doc] Clarify FP8 KV cache computation workflow (#31071) Steve Westerhouse 2025-12-21 18:41:37 -06:00
  • 06d490282f [NVFP4][Perf] Tune NVFP4 input quant kernel for small batch size (#30897) Michael Goin 2025-12-21 12:41:57 -05:00
  • b471092d3a [MoE Refactor][4/N] Marlin Fp8 Mk (#31036) Robert Shaw 2025-12-21 12:37:42 -05:00
  • 93cabc417c ci: add nvidia-smi warmup before Prime-RL integration test (#31093) Ameen Patel 2025-12-21 07:43:01 -08:00
  • bb80f69bc9 add aarnphm and chaunceyjiang to the new tool_parser directory (#31088) Chauncey 2025-12-21 11:24:34 +08:00
  • 3e92b2b7ac [BugFix]fix gpt-oss v1/completions response bug (#30608) 汪志鹏 2025-12-21 10:39:31 +08:00
  • 7c73ceb581 [Quantization] add marlin w4a8/w8a8 check (#31061) Jinzhen Lin 2025-12-21 05:58:11 +08:00
  • ae0770fa6b [CI] Fix H200 Distributed test (#31054) Lucas Wilkinson 2025-12-20 16:48:49 -05:00
  • ee52d9901d [Quantization] support logical_widths for fp8 marlin (#30962) Jinzhen Lin 2025-12-21 04:02:57 +08:00
  • 54c8924384 [MoE Refactor][5/N] Isolate zero expert to LongCatFlash (#28891) baonudesifeizhai 2025-12-20 13:22:04 -05:00
  • 560ae9638c [XPU] enable fp8 online streaming quantization (#30944) Yan Ma 2025-12-20 21:45:27 +08:00
  • 1501a4070e [Bugfix] Read truncate_prompt_tokens from pooling_params in AsyncLLM.encode() (#31013) Jeffrey Wang 2025-12-20 02:29:31 -08:00
  • ff2168bca3 [CI] FIx fixture 'siglip_attention_config' not found (#31053) Lucas Wilkinson 2025-12-19 22:46:15 -05:00
  • 0be149524c [ROCm][CI/Build] Update ROCm dockerfiles (#30991) Gregory Shtrasberg 2025-12-19 21:19:12 -06:00
  • d52c5096d7 [Bugfix] fix the alias bug of AttentionBackendEnum when register CUSTOM attention backend to vllm (#30869) zejunchen-zejun 2025-12-20 09:03:35 +08:00
  • 8a7a414374 GLM-4.7 Tool Parser and Doc Update (#30876) Yuxuan Zhang 2025-12-20 08:09:58 +08:00
  • 95befecc18 [MoE Refactor][2/N] Use Modular Kernels for Fp8 (#30825) Robert Shaw 2025-12-19 18:36:38 -05:00
  • 4cf9429897 [Bug] Fix error 'Dynamo failed to run FX node with fake tensors for Deepseek V3.2 (#31046) Wentao Ye 2025-12-19 18:31:31 -05:00
  • 83a317f650 [MoE Refactor][3/N] Deprecate cutlass block quant fp8 (b200) (#30990) Robert Shaw 2025-12-19 16:09:54 -05:00
  • 5f6477d1d0 [BugFix] Fix TypeError: unhashable type: 'dict' when serving deepseek32 (#30924) Lucas Wilkinson 2025-12-19 16:07:54 -05:00
  • 3bd8335bd0 [Refactor] Refactor for DeepGemmQuantScaleFMT using cache (#30898) Wentao Ye 2025-12-19 15:50:39 -05:00
  • 1ab5213531 Make engine core client handshake timeout configurable (#27444) Seiji Eicher 2025-12-19 12:38:30 -08:00
  • 969bbc7c61 [Model] Add MiMo-V2-Flash support (#30836) Zhonghua Deng 2025-12-20 01:17:03 +08:00
  • 268a972c62 Update Pytorch version update docs (#30982) Andrey Talman 2025-12-19 11:08:53 -05:00
  • 5fbfa8d9ef [Quantization] fix marlin w8a8 check (#30961) Jinzhen Lin 2025-12-19 23:33:22 +08:00
  • 23a1946e3b [CustomOp][Refactor] Extract common methods for ApplyRotaryEmb CustomOp (#31021) Shanshan Shen 2025-12-19 22:16:09 +08:00
  • b5545d9d5c [Bugfix] [Kernel] Triton attention kernels: mask out V blocks that fall outside sliding window (#30887) Thomas Parnell 2025-12-19 14:39:54 +01:00
  • bd2b52fc2d [CPU][Bugfix] Fix ppc64le CPU build (#30871) Nishidha Panpaliya 2025-12-19 17:56:35 +05:30
  • 420ba2dbb6 Enable aarch64 CPU performance benchmarks (#26494) Li, Jiang 2025-12-19 20:16:18 +08:00
  • 455949675d [Frontend][Bug] allow tool calls in analysis channel (#28139) Marko Rosenmueller 2025-12-19 11:47:44 +01:00
  • 086b96339f [Bugfix] Add validation for tool requests when tool_parser is unavailable (#30613) lif 2025-12-19 18:23:28 +08:00
  • 9187de9fac [Quantization] enable compressed-tensors marlin support for turing (2) (#31008) Jinzhen Lin 2025-12-19 16:56:35 +08:00
  • ac1c934276 [Bugfix] Fix incorrect tiles creation for mm prefix triton attention (#30974) Isotr0py 2025-12-19 16:00:33 +08:00
  • 4924ac582c Add hidden dimension validation for multimodal embedding inputs (#30968) Wenqi Glantz 2025-12-19 02:59:36 -05:00
  • 096b25c9ed [Doc][CPU] Fix index link for CPU regular release wheels (#31015) Li, Jiang 2025-12-19 15:29:52 +08:00
  • de08b8f61b [Quantization] enable compressed-tensors marlin support for turing (#31000) Jinzhen Lin 2025-12-19 12:29:48 +08:00
  • 2ac85a4544 [BugFix] Fix logprobs with spec decode and modified logits (#30846) Nick Hill 2025-12-18 19:58:28 -08:00
  • 7b43db210c [ROCm][CI][Bugfix] Multi-Modal Model Support Fixes and Attention Backend Improvements (#30270) Andreas Karatzas 2025-12-18 20:17:27 -06:00
  • 6a09612b2e [Bugfix] Fix tool_choice="none" being ignored by GPT-OSS/harmony models (#30867) v0.14.0rc0 PlatinumGod 2025-12-19 09:34:27 +08:00
  • 45c0526ac9 [BugFix] Handle errors when preprocessing added requests (#30895) Nick Hill 2025-12-18 17:29:11 -08:00
  • d6b3d39b6d [Cleanup] Refactor FlashInferMetadataBuilder (#29128) Benjamin Chislett 2025-12-18 17:45:30 -05:00
  • 6ca74bc11a [NIXL][BUG FIX] Fix both failing issue and accuracy issue with nixl + host_buffer on CUDA (#30419) Chendi.Xue 2025-12-18 16:10:02 -06:00
  • 72506c9834 Check for truthy rope_parameters not the existence of it (#30983) v0.13.0 Harry Mellor 2025-12-18 21:59:10 +00:00
  • b2eb84de77 [Bugfix] Remove tile_size=64 for mm_prefix triton attention (#30973) Isotr0py 2025-12-19 03:42:32 +08:00
  • ac43367ced adds jais 2 support (#30188) sarathc-cerebras 2025-12-18 21:16:58 +05:30
  • 30fe765e9f [Fix][FlexAttention] return max logical block index to handle reused blocks (#30915) Yifan Qiao 2025-12-17 22:42:21 -08:00
  • 19c583398a Check for truthy rope_parameters not the existence of it (#30983) Harry Mellor 2025-12-18 21:59:10 +00:00
  • b0b77c4655 [BugFix] Fix spec decode + structured outputs + preemption edge case (#30916) Nick Hill 2025-12-18 12:59:55 -08:00
  • 634a14bd7d Strengthen input validation and tests for 'parse_raw_prompts’. (#30652) Kayvan Mivehnejad 2025-12-18 14:51:58 -05:00
  • 24b65eff0d [BugFix] Spec decode with VLLM_ENABLE_V1_MULTIPROCESSING=0 (#30319) Chen Zhang 2025-12-18 11:47:56 -08:00
  • 41b6f9200f Remove all2all backend envvar (#30363) Elizabeth Thomas 2025-12-18 13:46:28 -06:00
  • 97000a2be7 [Bug] Fix compressed tensor not using deepgemm (#30820) Wentao Ye 2025-12-18 14:45:55 -05:00
  • d2dc5dfc6e [Bugfix] Remove tile_size=64 for mm_prefix triton attention (#30973) Isotr0py 2025-12-19 03:42:32 +08:00
  • b8c477c115 tuned fused configs for B300 (#30629) navmarri14 2025-12-18 11:41:59 -08:00
  • 53ad423f26 [Perf] enable flashinfer rotary_embedding custom ops in DeepSeek rotary (#30729) jiahanc 2025-12-18 11:31:18 -08:00
  • 889f8bb250 [BugFix]Reclaim resources to prevent memory leaks when use LMCacheMPConnector (#30745) wz1qqx 2025-12-19 03:09:51 +08:00
  • 058926d48c [XPU] allow custom workers (e.g. vllm-omni workers) to be used on XPU (#30935) Fanli Lin 2025-12-19 02:16:36 +08:00
  • 700a5ad6c6 [MM Encoder]: Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface (#30684) Isotr0py 2025-12-19 02:04:19 +08:00
  • 62be3670cb [BugFix] Add sleep to fix tight loop and release GIL (#29476) Alec 2025-12-18 09:52:55 -08:00
  • 500f26e6d3 [Bugfix] fix DP-aware routing in OpenAI API requests (#29002) inkcherry 2025-12-19 01:50:42 +08:00
  • 686cbaac64 [Cleanup] Remove unused ModelRunner V1 InputBatch.num_tokens field (#30218) Nick Hill 2025-12-18 09:17:00 -08:00
  • f4ee2c3d90 fix fp8 online quantization streaming with tp > 1 (#30900) Vasiliy Kuznetsov 2025-12-18 11:45:15 -05:00
  • 9a5e96523b [LoRA] Set default MXFP4 LoRA backend to Marlin (#30598) Xin Yang 2025-12-18 08:42:22 -08:00