Commit Graph

  • 7c38ed0f1c [Frontend] split append tool output (#28333) Andrew Xia 2025-11-12 20:03:23 -08:00
  • a1d3866dda [n-gen] DO NOT repeatedly return finished child requests (#28591) Jialin Ouyang 2025-11-12 19:36:07 -08:00
  • 97d1c99302 Rename clashing method names for vLLM model protocol (#27583) Harry Mellor 2025-11-13 03:14:33 +00:00
  • 3226283461 [Docs] Add some details about what the MoE block needs for the Transformers backend (#28588) Harry Mellor 2025-11-13 03:12:14 +00:00
  • 8832fff972 [BugFix] Fix mm_encoder_attn_backend arg type checking (#28599) Nick Hill 2025-11-12 19:06:03 -08:00
  • a543e678b4 [Bugfix] Fix SM100 gpt-oss regression due to faulty attn sink support (#28561) Michael Goin 2025-11-12 21:40:59 -05:00
  • 2dacd57394 [platform] Move get_cu_count to utils (#27005) wangxiyuan 2025-11-13 08:48:47 +08:00
  • d75ad04818 [ROCm][Bugfix] Revert removing setuptools version restriction (#28592) Gregory Shtrasberg 2025-11-12 19:46:58 -05:00
  • 52eadcec9e [Docs] Update meetups.md description (#28583) Michael Goin 2025-11-12 19:00:23 -05:00
  • 51c599f0ec Skip models that cannot currently init on Transformers v5 (#28471) Harry Mellor 2025-11-12 23:43:57 +00:00
  • 69d0e90313 [MoE][Kernel][Perf] Improve Shared Expert Stream Overlap (#28406) Alexander Matveev 2025-11-12 18:37:24 -05:00
  • 4ca5cd5740 [Core][AMD] Migrate fully transparent sleep mode to ROCm platform (#12695) ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 2025-11-13 01:24:12 +02:00
  • 10f01d5a3a [Bugfix] Adjust Marlin CUDA arch selection to 8.0+PTX;9.0+PTX (#28294) Michael Goin 2025-11-12 18:14:13 -05:00
  • 3eb0c2673e [TPU] Support GCS path in VLLM_TORCH_PROFILER_DIR (#28487) QiliangCui 2025-11-12 14:31:14 -08:00
  • d8140b9833 [ROCM] Fix ROCm warnings, environment flag access, and GEMM kernel naming for consistency in _aiter_ops.py (#28464) vllmellm 2025-11-13 05:46:57 +08:00
  • 74a9a9faad [Performance][B200] Fix deepgemm prologue (#27897) Varun Sundar Rabindranath 2025-11-12 16:13:03 -05:00
  • 478ee511de [Misc]Fix typo in llm_engine.py (#28584) Wei Wei 2025-11-12 12:59:43 -08:00
  • 58ce8d12b7 [BugFix] Priority scheduling and spec tokens preemption (#28558) Andy Lo 2025-11-12 20:29:21 +00:00
  • 94a9ebcf31 [KV connector][WIP] KV cache proxy based on LMCache multi-process mode (#27902) Yihua Cheng 2025-11-12 12:25:43 -08:00
  • a39dd7bb06 [CI] Skip "Multi-Modal Models Test (Extended) 3" test that's broken in current Transformers (#28559) Harry Mellor 2025-11-12 19:38:13 +00:00
  • 64d57c3be7 [Model] [Config] Correctly identify granite-4.0-micro as non-hybrid model (#28563) Thomas Parnell 2025-11-12 19:17:55 +01:00
  • a1e7fa362a [EPLB][ROCm]: support EPBL for ROCm backend (#27731) PerryZhang01 2025-11-13 02:16:35 +08:00
  • bac904565f Implement ARC KV cache eviction policy for CPU offloader (#27039) alberto 2025-11-12 17:51:39 +00:00
  • 304419576a [Perf] Refactor cudagraph_support to enable full CUDA graphs for spec decoding with FlashInfer (#28479) Benjamin Chislett 2025-11-12 11:56:40 -05:00
  • a742134cc5 Remove deprecated fields from CompilationConfig (#27593) Harry Mellor 2025-11-12 16:10:28 +00:00
  • 728a9eb70e [Misc] Refactor Attention kv transfer methods into decorator (#27816) Nicolò Lucchesi 2025-11-12 17:05:44 +01:00
  • bc5bd45c7d [Refactor] Remove redundant TP gather/split in split_qkv in QwenVL (#28271) Canlin Guo 2025-11-12 23:56:47 +08:00
  • f76e85c299 [Performance][Hopper] Avoid M dim padding to 4x for most cases (due to cuda graphs paddings) (#28492) Alexander Matveev 2025-11-12 10:51:43 -05:00
  • 54aecd9ed5 Fix pre-commit (and XPU) on main (#28556) Harry Mellor 2025-11-12 14:13:41 +00:00
  • 10138c92a5 [V0 deprecation] Deprecate use_v1 parameter (#28112) wangxiyuan 2025-11-12 22:03:52 +08:00
  • a9d18b5107 [Bugfix] Fix gpt_oss packed_modules_mapping (#28536) Jee Jee Li 2025-11-12 21:02:06 +08:00
  • edb59a9470 [ROCm] [Bugfix] Fix fused_qknorm_rope_kernel rocm compatibility (#28500) TJian 2025-11-12 05:01:14 -08:00
  • c5f10cc139 add cpu option for p/d in nixl_connector (#28356) ZhengHongming888 2025-11-12 03:53:08 -08:00
  • d143152308 [KVConnector] Enable get_block_ids_with_load_errors() in LMCache connector (#27978) ziruiliu 2025-11-12 18:44:58 +08:00
  • a4730c1b4f [XPU]Fix crash due to removed VLLM_USE_V1 attribute (#28520) Chaojun Zhang 2025-11-12 18:20:55 +08:00
  • d3ade61e42 [Model] fix glm4_moe_mtp load weights with GLM-4.6 checkpoint. (#27597) wuyaoxuehun 2025-11-12 17:14:00 +07:00
  • 1761dea1a8 [BugFix]: --enable-lora with model granite-4.0-micro crash (#27733) yyzxw 2025-11-12 17:03:56 +08:00
  • c748355e0d [CI] Introduce autorun_on_main feature (#27836) Huamin Li 2025-11-12 00:51:19 -08:00
  • 91864b79b3 [CI/Build] Fix crash due to removed VLLM_USE_V1 attribute in EPD (#28521) Chenguang Zheng 2025-11-12 15:09:33 +08:00
  • ac0bb2c307 [Core] Cache vllm_is_batch_invariant (#28304) Lukas Geiger 2025-11-12 05:03:01 +00:00
  • f31419ed8b [Benchmark] Add retry support to fix workload bias in multi-turn benchmark (#28493) ai-jz 2025-11-11 21:00:45 -08:00
  • b9ce9a3013 [BugFix] Add fallback path in apply_rotary_pos_emb_flashattn for non-cuda platforms (#28447) Fanli Lin 2025-11-12 11:13:21 +08:00
  • 4ccffe561f [Core] Encoder separation for Encode-Prefill-Decode Disaggregation (#25233) Chenguang Zheng 2025-11-12 10:58:33 +08:00
  • cbb799e314 [Model][Qwen3VL] Simplify get_mrope_input_positions using numpy (#28302) Lukas Geiger 2025-11-12 02:55:10 +00:00
  • 9f0247cfa4 VLLM_USE_TRITON_FLASH_ATTN V0 variable deprecation (#27611) Andreas Karatzas 2025-11-11 20:34:36 -06:00
  • 7f829be7d3 [CPU] Refactor CPU attention backend (#27954) Li, Jiang 2025-11-12 09:43:06 +08:00
  • e1710393c4 [[V0 deprecation]]Remove VLLM_USE_V1 env (#28204) wangxiyuan 2025-11-12 09:22:16 +08:00
  • 3f770f4427 [Performance] Cache loaded custom logitsprocs to avoid overheads (#28462) Isotr0py 2025-11-12 08:49:29 +08:00
  • 48c879369f [Frontend] Change CompilationMode to a proper Enum (#28165) Yanan Cao 2025-11-11 16:46:18 -08:00
  • 1788aa1efb [BugFix] Graceful handling of torch symm mem errors. (#27671) Ilya Markov 2025-11-12 01:41:54 +01:00
  • d23539549a Use FLASHINFER MLA backend when testing fp8_kv_scale_compile (#28491) Adrian Abeyta 2025-11-11 18:34:58 -06:00
  • 412e153df5 [Feature] Allow configuring FlashInfer workspace size (#28269) Max Hu 2025-11-11 18:32:20 -05:00
  • e5f599d4d1 [Bugfix] Disable shared expert overlap if Marlin MoE is used (#28410) Michael Goin 2025-11-11 18:16:12 -05:00
  • 28534b92b9 Add Zurich vLLM Meetup (#28488) Michael Goin 2025-11-11 17:53:59 -05:00
  • d4902ba56d [Misc] Cleanup Executor interface (#28441) wangxiyuan 2025-11-12 06:28:07 +08:00
  • df4d3a44a8 [TPU] Rename path to tpu platform (#28452) Kyuyeun Kim 2025-11-11 11:16:47 -08:00
  • 9d1c474704 [LoRA][1/N]Remove LoRA extra vocab (#28382) Jee Jee Li 2025-11-12 03:06:21 +08:00
  • 8c32c6e4b4 [Misc] fix typo in DCP comment (#28389) Jie Luo 2025-11-12 02:59:16 +08:00
  • de120bc94f [V0 deprecation] Clean up num_prefill_tokens logic for V0 (#28203) Canlin Guo 2025-11-12 02:57:12 +08:00
  • 4228be7959 [Perf] Use np.ndarray instead of list[list[int]] to reduce GC overhead (#28245) Jialin Ouyang 2025-11-11 10:28:47 -08:00
  • 76e4dcf225 [Misc] Remove unused attention prefix prefill ops functions (#26971) Lukas Geiger 2025-11-11 18:26:04 +00:00
  • d5edcb8678 [BugFix] Fix Siglip2Attention on XPU (#28448) Fanli Lin 2025-11-12 02:18:02 +08:00
  • 6c3c0f8235 [Kernel] Optimize rms_norm kernel (#27931) Xin Yang 2025-11-11 10:02:23 -08:00
  • 684f254585 Prefer FlashAttention MLA as default over FlashMLA (#27363) Matthew Bonanni 2025-11-11 11:13:51 -06:00
  • e553424919 [CI/Build] Refactor Attention backend for test_prefix_prefill from xformers to SDPA (#28424) Zhewen Li 2025-11-11 09:09:47 -08:00
  • 5a1271d83a [Quantization] fix attention quantization of gpt_oss model (#27334) xuebwang-amd 2025-11-12 01:06:00 +08:00
  • 05576df85c [ROCm][Quantization] extend AMD Quark to support mixed-precision quantized model (#24239) xuebwang-amd 2025-11-12 01:05:22 +08:00
  • 68c09efc37 [Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model (#27165) zhrrr 2025-11-12 01:00:31 +08:00
  • a7ef3eb0cd [NIXL] Generalize block-first backend layouts (FlashInfer-like) (#28282) Nicolò Lucchesi 2025-11-11 17:57:43 +01:00
  • f9a4087182 Remove weight_scale.T special case for SM90 Block FP8 CUTLASS kernel (#28431) Michael Goin 2025-11-11 09:46:04 -07:00
  • 287bbbeb06 [Doc] Fix typo in serving docs (#28474) the-codeboy 2025-11-11 17:45:49 +01:00
  • 3143eb23fc [BugFix] Add test_outputs.py to CI pipeline (#28466) usberkeley 2025-11-12 00:01:30 +08:00
  • b886068056 [BugFix] Fix RuntimeError in PixtralHFAttention on CPU/XPU (#28444) Fanli Lin 2025-11-11 23:29:33 +08:00
  • a90ad7d838 Add @markmc to CODEOWNERS for Observability (#28457) Mark McLoughlin 2025-11-11 15:03:22 +00:00
  • 533b018f72 [BugFix] Fix Failing Ruff Check (#28469) jvlunteren 2025-11-11 15:41:43 +01:00
  • a1448b4b69 [Kernels] Split up fused_moe/layer.py, isolate more modular kernel code (#28064) bnellnm 2025-11-11 09:29:02 -05:00
  • fa1970201d [Docs] Fix grammar in CPU installation guide (#28461) Maryam Tahhan 2025-11-11 14:01:11 +00:00
  • 3380543b20 Add request timeout override for multi-turn benchmarks (#28386) Ido Segev 2025-11-11 15:41:18 +02:00
  • afffd3cc8a [Model] Pass mm_features directly into get_mrope_input_positions (#28399) Cyrus Leung 2025-11-11 21:14:48 +08:00
  • 7dbe6d81d6 Fix Fused MoE LoRA Triton kernel bug (#28450) Chaojun Zhang 2025-11-11 20:46:47 +08:00
  • b30dfa03c5 [Attention] Refactor CUDA attention backend selection logic (#24794) Matthew Bonanni 2025-11-11 06:40:44 -06:00
  • 2e78150d24 [CI] Add mergify rules for nvidia label (#28417) Michael Goin 2025-11-11 05:28:28 -07:00
  • d381eb967f Multi turn benchmark progress bar for synthetic conversation generation (#28394) Ido Segev 2025-11-11 13:06:04 +02:00
  • 9973e6e04a [Model][Qwen3VL] Slighly speedup fast_pos_embed_interpolate (#28434) Lukas Geiger 2025-11-11 10:35:10 +00:00
  • c7991269dd [BugFix] 'DeepseekV2Config' object has no attribute 'use_mla'` (#28387) Fanli Lin 2025-11-11 16:45:38 +08:00
  • f0359fffa4 [Bugfix] fix qwen3-next crash (#28202) Jiangyun Zhu 2025-11-11 16:24:28 +08:00
  • 798c7bebca [EPLB] Refactor balance_packing to use numpy and optimize GPU-CPU transfers in EPLB (#28369) Sage Moore 2025-11-11 00:19:51 -08:00
  • 4fd4b743a2 [Bugfix] Fix max image size for PaddleOCR-VL (#28442) Roger Wang 2025-11-11 00:07:24 -08:00
  • cc079763c5 [BugFix] Avoid calling KV connector layer APIs when metadata is unset (#28253) David Ben-David 2025-11-11 09:39:36 +02:00
  • a7adbc6c6b [Doc] Sleep mode documentation (#28357) iAmir97 2025-11-11 13:44:35 +07:00
  • e605e8e323 [Bugfix] Fix Stream Sync for Shared Expert Overlap (#28430) Robert Shaw 2025-11-11 00:59:08 -05:00
  • bca74e32b7 [Frontend] Add sagemaker_standards dynamic lora adapter and stateful session management decorators to vLLM OpenAI API server (#27892) Zuyi Zhao 2025-11-10 20:57:01 -08:00
  • 8d706cca90 [Misc] FlattenLogprobs -> FlatLogprobs (#28335) Zhuohan Li 2025-11-10 19:41:23 -08:00
  • 57201a6a4c Fix rotary embedding benchmark script (#28323) Xin Yang 2025-11-10 18:57:12 -08:00
  • f2d9ad0620 Only register rocm_aiter_ops if aiter is found (#28428) Michael Goin 2025-11-10 19:53:24 -07:00
  • de540c0354 [Feature] Add env var VLLM_MOE_USE_DEEP_GEMM (#28422) Wentao Ye 2025-11-10 21:29:48 -05:00
  • 39029d5192 [CI/Test Fix] Fix CP tests on Blackwell (#28404) Lucas Wilkinson 2025-11-10 20:36:29 -05:00
  • 35d801f13f [Feature] Refactor batch invariant fp8 DeepGEMM (#27606) Wentao Ye 2025-11-10 19:08:40 -05:00
  • 0bf29fadf5 [Test] Remove old non-varlen FA2 test (#28420) Matthew Bonanni 2025-11-10 17:57:41 -06:00
  • a5a790eea6 [Bugfix] Ensure calculated KV scales are applied in attention. (#27232) Adrian Abeyta 2025-11-10 17:42:37 -06:00