Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

7c38ed0f1c [Frontend] split append tool output (#28333) Andrew Xia 2025-11-12 20:03:23 -08:00
a1d3866dda [n-gen] DO NOT repeatedly return finished child requests (#28591) Jialin Ouyang 2025-11-12 19:36:07 -08:00
97d1c99302 Rename clashing method names for vLLM model protocol (#27583) Harry Mellor 2025-11-13 03:14:33 +00:00
3226283461 [Docs] Add some details about what the MoE block needs for the Transformers backend (#28588) Harry Mellor 2025-11-13 03:12:14 +00:00
8832fff972 [BugFix] Fix mm_encoder_attn_backend arg type checking (#28599) Nick Hill 2025-11-12 19:06:03 -08:00
a543e678b4 [Bugfix] Fix SM100 gpt-oss regression due to faulty attn sink support (#28561) Michael Goin 2025-11-12 21:40:59 -05:00
2dacd57394 [platform] Move get_cu_count to utils (#27005) wangxiyuan 2025-11-13 08:48:47 +08:00
d75ad04818 [ROCm][Bugfix] Revert removing setuptools version restriction (#28592) Gregory Shtrasberg 2025-11-12 19:46:58 -05:00
52eadcec9e [Docs] Update meetups.md description (#28583) Michael Goin 2025-11-12 19:00:23 -05:00
51c599f0ec Skip models that cannot currently init on Transformers v5 (#28471) Harry Mellor 2025-11-12 23:43:57 +00:00
69d0e90313 [MoE][Kernel][Perf] Improve Shared Expert Stream Overlap (#28406) Alexander Matveev 2025-11-12 18:37:24 -05:00
4ca5cd5740 [Core][AMD] Migrate fully transparent sleep mode to ROCm platform (#12695) ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 2025-11-13 01:24:12 +02:00
10f01d5a3a [Bugfix] Adjust Marlin CUDA arch selection to 8.0+PTX;9.0+PTX (#28294) Michael Goin 2025-11-12 18:14:13 -05:00
3eb0c2673e [TPU] Support GCS path in VLLM_TORCH_PROFILER_DIR (#28487) QiliangCui 2025-11-12 14:31:14 -08:00
d8140b9833 [ROCM] Fix ROCm warnings, environment flag access, and GEMM kernel naming for consistency in _aiter_ops.py (#28464) vllmellm 2025-11-13 05:46:57 +08:00
74a9a9faad [Performance][B200] Fix deepgemm prologue (#27897) Varun Sundar Rabindranath 2025-11-12 16:13:03 -05:00
478ee511de [Misc]Fix typo in llm_engine.py (#28584) Wei Wei 2025-11-12 12:59:43 -08:00
58ce8d12b7 [BugFix] Priority scheduling and spec tokens preemption (#28558) Andy Lo 2025-11-12 20:29:21 +00:00
94a9ebcf31 [KV connector][WIP] KV cache proxy based on LMCache multi-process mode (#27902) Yihua Cheng 2025-11-12 12:25:43 -08:00
a39dd7bb06 [CI] Skip "Multi-Modal Models Test (Extended) 3" test that's broken in current Transformers (#28559) Harry Mellor 2025-11-12 19:38:13 +00:00
64d57c3be7 [Model] [Config] Correctly identify granite-4.0-micro as non-hybrid model (#28563) Thomas Parnell 2025-11-12 19:17:55 +01:00
a1e7fa362a [EPLB][ROCm]: support EPBL for ROCm backend (#27731) PerryZhang01 2025-11-13 02:16:35 +08:00
bac904565f Implement ARC KV cache eviction policy for CPU offloader (#27039) alberto 2025-11-12 17:51:39 +00:00
304419576a [Perf] Refactor cudagraph_support to enable full CUDA graphs for spec decoding with FlashInfer (#28479) Benjamin Chislett 2025-11-12 11:56:40 -05:00
a742134cc5 Remove deprecated fields from CompilationConfig (#27593) Harry Mellor 2025-11-12 16:10:28 +00:00
728a9eb70e [Misc] Refactor Attention kv transfer methods into decorator (#27816) Nicolò Lucchesi 2025-11-12 17:05:44 +01:00
bc5bd45c7d [Refactor] Remove redundant TP gather/split in split_qkv in QwenVL (#28271) Canlin Guo 2025-11-12 23:56:47 +08:00
f76e85c299 [Performance][Hopper] Avoid M dim padding to 4x for most cases (due to cuda graphs paddings) (#28492) Alexander Matveev 2025-11-12 10:51:43 -05:00
54aecd9ed5 Fix pre-commit (and XPU) on main (#28556) Harry Mellor 2025-11-12 14:13:41 +00:00
10138c92a5 [V0 deprecation] Deprecate use_v1 parameter (#28112) wangxiyuan 2025-11-12 22:03:52 +08:00
a9d18b5107 [Bugfix] Fix gpt_oss packed_modules_mapping (#28536) Jee Jee Li 2025-11-12 21:02:06 +08:00
edb59a9470 [ROCm] [Bugfix] Fix fused_qknorm_rope_kernel rocm compatibility (#28500) TJian 2025-11-12 05:01:14 -08:00
c5f10cc139 add cpu option for p/d in nixl_connector (#28356) ZhengHongming888 2025-11-12 03:53:08 -08:00
d143152308 [KVConnector] Enable get_block_ids_with_load_errors() in LMCache connector (#27978) ziruiliu 2025-11-12 18:44:58 +08:00
a4730c1b4f [XPU]Fix crash due to removed VLLM_USE_V1 attribute (#28520) Chaojun Zhang 2025-11-12 18:20:55 +08:00
d3ade61e42 [Model] fix glm4_moe_mtp load weights with GLM-4.6 checkpoint. (#27597) wuyaoxuehun 2025-11-12 17:14:00 +07:00
1761dea1a8 [BugFix]: --enable-lora with model granite-4.0-micro crash (#27733) yyzxw 2025-11-12 17:03:56 +08:00
c748355e0d [CI] Introduce autorun_on_main feature (#27836) Huamin Li 2025-11-12 00:51:19 -08:00
91864b79b3 [CI/Build] Fix crash due to removed VLLM_USE_V1 attribute in EPD (#28521) Chenguang Zheng 2025-11-12 15:09:33 +08:00
ac0bb2c307 [Core] Cache vllm_is_batch_invariant (#28304) Lukas Geiger 2025-11-12 05:03:01 +00:00
f31419ed8b [Benchmark] Add retry support to fix workload bias in multi-turn benchmark (#28493) ai-jz 2025-11-11 21:00:45 -08:00
b9ce9a3013 [BugFix] Add fallback path in apply_rotary_pos_emb_flashattn for non-cuda platforms (#28447) Fanli Lin 2025-11-12 11:13:21 +08:00
4ccffe561f [Core] Encoder separation for Encode-Prefill-Decode Disaggregation (#25233) Chenguang Zheng 2025-11-12 10:58:33 +08:00
cbb799e314 [Model][Qwen3VL] Simplify get_mrope_input_positions using numpy (#28302) Lukas Geiger 2025-11-12 02:55:10 +00:00
9f0247cfa4 VLLM_USE_TRITON_FLASH_ATTN V0 variable deprecation (#27611) Andreas Karatzas 2025-11-11 20:34:36 -06:00
7f829be7d3 [CPU] Refactor CPU attention backend (#27954) Li, Jiang 2025-11-12 09:43:06 +08:00
e1710393c4 [[V0 deprecation]]Remove VLLM_USE_V1 env (#28204) wangxiyuan 2025-11-12 09:22:16 +08:00
3f770f4427 [Performance] Cache loaded custom logitsprocs to avoid overheads (#28462) Isotr0py 2025-11-12 08:49:29 +08:00
48c879369f [Frontend] Change CompilationMode to a proper Enum (#28165) Yanan Cao 2025-11-11 16:46:18 -08:00
1788aa1efb [BugFix] Graceful handling of torch symm mem errors. (#27671) Ilya Markov 2025-11-12 01:41:54 +01:00
d23539549a Use FLASHINFER MLA backend when testing fp8_kv_scale_compile (#28491) Adrian Abeyta 2025-11-11 18:34:58 -06:00
412e153df5 [Feature] Allow configuring FlashInfer workspace size (#28269) Max Hu 2025-11-11 18:32:20 -05:00
e5f599d4d1 [Bugfix] Disable shared expert overlap if Marlin MoE is used (#28410) Michael Goin 2025-11-11 18:16:12 -05:00
28534b92b9 Add Zurich vLLM Meetup (#28488) Michael Goin 2025-11-11 17:53:59 -05:00
d4902ba56d [Misc] Cleanup Executor interface (#28441) wangxiyuan 2025-11-12 06:28:07 +08:00
df4d3a44a8 [TPU] Rename path to tpu platform (#28452) Kyuyeun Kim 2025-11-11 11:16:47 -08:00
9d1c474704 [LoRA][1/N]Remove LoRA extra vocab (#28382) Jee Jee Li 2025-11-12 03:06:21 +08:00
8c32c6e4b4 [Misc] fix typo in DCP comment (#28389) Jie Luo 2025-11-12 02:59:16 +08:00
de120bc94f [V0 deprecation] Clean up num_prefill_tokens logic for V0 (#28203) Canlin Guo 2025-11-12 02:57:12 +08:00
4228be7959 [Perf] Use np.ndarray instead of list[list[int]] to reduce GC overhead (#28245) Jialin Ouyang 2025-11-11 10:28:47 -08:00
76e4dcf225 [Misc] Remove unused attention prefix prefill ops functions (#26971) Lukas Geiger 2025-11-11 18:26:04 +00:00
d5edcb8678 [BugFix] Fix Siglip2Attention on XPU (#28448) Fanli Lin 2025-11-12 02:18:02 +08:00
6c3c0f8235 [Kernel] Optimize rms_norm kernel (#27931) Xin Yang 2025-11-11 10:02:23 -08:00
684f254585 Prefer FlashAttention MLA as default over FlashMLA (#27363) Matthew Bonanni 2025-11-11 11:13:51 -06:00
e553424919 [CI/Build] Refactor Attention backend for test_prefix_prefill from xformers to SDPA (#28424) Zhewen Li 2025-11-11 09:09:47 -08:00
5a1271d83a [Quantization] fix attention quantization of gpt_oss model (#27334) xuebwang-amd 2025-11-12 01:06:00 +08:00
05576df85c [ROCm][Quantization] extend AMD Quark to support mixed-precision quantized model (#24239) xuebwang-amd 2025-11-12 01:05:22 +08:00
68c09efc37 [Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model (#27165) zhrrr 2025-11-12 01:00:31 +08:00
a7ef3eb0cd [NIXL] Generalize block-first backend layouts (FlashInfer-like) (#28282) Nicolò Lucchesi 2025-11-11 17:57:43 +01:00
f9a4087182 Remove weight_scale.T special case for SM90 Block FP8 CUTLASS kernel (#28431) Michael Goin 2025-11-11 09:46:04 -07:00
287bbbeb06 [Doc] Fix typo in serving docs (#28474) the-codeboy 2025-11-11 17:45:49 +01:00
3143eb23fc [BugFix] Add test_outputs.py to CI pipeline (#28466) usberkeley 2025-11-12 00:01:30 +08:00
b886068056 [BugFix] Fix RuntimeError in PixtralHFAttention on CPU/XPU (#28444) Fanli Lin 2025-11-11 23:29:33 +08:00
a90ad7d838 Add @markmc to CODEOWNERS for Observability (#28457) Mark McLoughlin 2025-11-11 15:03:22 +00:00
533b018f72 [BugFix] Fix Failing Ruff Check (#28469) jvlunteren 2025-11-11 15:41:43 +01:00
a1448b4b69 [Kernels] Split up fused_moe/layer.py, isolate more modular kernel code (#28064) bnellnm 2025-11-11 09:29:02 -05:00
fa1970201d [Docs] Fix grammar in CPU installation guide (#28461) Maryam Tahhan 2025-11-11 14:01:11 +00:00
3380543b20 Add request timeout override for multi-turn benchmarks (#28386) Ido Segev 2025-11-11 15:41:18 +02:00
afffd3cc8a [Model] Pass mm_features directly into get_mrope_input_positions (#28399) Cyrus Leung 2025-11-11 21:14:48 +08:00
7dbe6d81d6 Fix Fused MoE LoRA Triton kernel bug (#28450) Chaojun Zhang 2025-11-11 20:46:47 +08:00
b30dfa03c5 [Attention] Refactor CUDA attention backend selection logic (#24794) Matthew Bonanni 2025-11-11 06:40:44 -06:00
2e78150d24 [CI] Add mergify rules for nvidia label (#28417) Michael Goin 2025-11-11 05:28:28 -07:00
d381eb967f Multi turn benchmark progress bar for synthetic conversation generation (#28394) Ido Segev 2025-11-11 13:06:04 +02:00
9973e6e04a [Model][Qwen3VL] Slighly speedup fast_pos_embed_interpolate (#28434) Lukas Geiger 2025-11-11 10:35:10 +00:00
c7991269dd [BugFix] 'DeepseekV2Config' object has no attribute 'use_mla'` (#28387) Fanli Lin 2025-11-11 16:45:38 +08:00
f0359fffa4 [Bugfix] fix qwen3-next crash (#28202) Jiangyun Zhu 2025-11-11 16:24:28 +08:00
798c7bebca [EPLB] Refactor balance_packing to use numpy and optimize GPU-CPU transfers in EPLB (#28369) Sage Moore 2025-11-11 00:19:51 -08:00
4fd4b743a2 [Bugfix] Fix max image size for PaddleOCR-VL (#28442) Roger Wang 2025-11-11 00:07:24 -08:00
cc079763c5 [BugFix] Avoid calling KV connector layer APIs when metadata is unset (#28253) David Ben-David 2025-11-11 09:39:36 +02:00
a7adbc6c6b [Doc] Sleep mode documentation (#28357) iAmir97 2025-11-11 13:44:35 +07:00
e605e8e323 [Bugfix] Fix Stream Sync for Shared Expert Overlap (#28430) Robert Shaw 2025-11-11 00:59:08 -05:00
bca74e32b7 [Frontend] Add sagemaker_standards dynamic lora adapter and stateful session management decorators to vLLM OpenAI API server (#27892) Zuyi Zhao 2025-11-10 20:57:01 -08:00
8d706cca90 [Misc] FlattenLogprobs -> FlatLogprobs (#28335) Zhuohan Li 2025-11-10 19:41:23 -08:00
57201a6a4c Fix rotary embedding benchmark script (#28323) Xin Yang 2025-11-10 18:57:12 -08:00
f2d9ad0620 Only register rocm_aiter_ops if aiter is found (#28428) Michael Goin 2025-11-10 19:53:24 -07:00
de540c0354 [Feature] Add env var VLLM_MOE_USE_DEEP_GEMM (#28422) Wentao Ye 2025-11-10 21:29:48 -05:00
39029d5192 [CI/Test Fix] Fix CP tests on Blackwell (#28404) Lucas Wilkinson 2025-11-10 20:36:29 -05:00
35d801f13f [Feature] Refactor batch invariant fp8 DeepGEMM (#27606) Wentao Ye 2025-11-10 19:08:40 -05:00
0bf29fadf5 [Test] Remove old non-varlen FA2 test (#28420) Matthew Bonanni 2025-11-10 17:57:41 -06:00
a5a790eea6 [Bugfix] Ensure calculated KV scales are applied in attention. (#27232) Adrian Abeyta 2025-11-10 17:42:37 -06:00

... 45 46 47 48 49 ...