Commit Graph - vllm - Gitea: Git with a cup of tea

biondizzle/vllm

Fork 0

Commit Graph

Select branches

Hide Pull Requests

cmm

main

ci/build/22474

submission

v0.1.0

v0.1.1

v0.1.2

v0.1.3

v0.1.4

v0.1.5

v0.1.6

v0.1.7

v0.10.0

v0.10.0rc1

v0.10.0rc2

v0.10.1

v0.10.1.1

v0.10.1rc1

v0.10.2

v0.10.2rc1

v0.10.2rc2

v0.10.2rc3

v0.11.0

v0.11.0rc1

v0.11.0rc2

v0.11.0rc3

v0.11.0rc4

v0.11.0rc5

v0.11.0rc6

v0.11.1

v0.11.1rc0

v0.11.1rc1

v0.11.1rc2

v0.11.1rc3

v0.11.1rc4

v0.11.1rc5

v0.11.1rc6

v0.11.1rc7

v0.11.2

v0.12.0

v0.13.0

v0.13.0rc1

v0.13.0rc2

v0.13.0rc3

v0.13.0rc4

v0.14.0

v0.14.0rc0

v0.14.0rc1

v0.14.0rc2

v0.14.1

v0.15.0

v0.15.0rc0

v0.15.0rc1

v0.15.0rc2

v0.15.0rc3

v0.15.1

v0.15.1rc0

v0.15.1rc1

v0.15.2rc0

v0.16.0

v0.16.0rc0

v0.16.0rc1

v0.16.0rc2

v0.16.0rc3

v0.16.1rc0

v0.17.0

v0.17.0rc0

v0.17.0rc1

v0.17.1

v0.17.1rc0

v0.17.2rc0

v0.18.0

v0.18.0rc0

v0.18.0rc1

v0.18.0rc2

v0.18.1

v0.18.1rc0

v0.18.2rc0

v0.19.0

v0.19.0rc0

v0.19.0rc1

v0.19.1rc0

v0.2.0

v0.2.1

v0.2.1.post1

v0.2.2

v0.2.3

v0.2.4

v0.2.5

v0.2.6

v0.2.7

v0.3.0

v0.3.1

v0.3.2

v0.3.3

v0.4.0

v0.4.0.post1

v0.4.1

v0.4.2

v0.4.3

v0.5.0

v0.5.0.post1

v0.5.1

v0.5.2

v0.5.3

v0.5.3.post1

v0.5.4

v0.5.5

v0.6.0

v0.6.1

v0.6.1.post1

v0.6.1.post2

v0.6.2

v0.6.3

v0.6.3.post1

v0.6.4

v0.6.4.post1

v0.6.5

v0.6.6

v0.6.6.post1

v0.7.0

v0.7.1

v0.7.2

v0.7.3

v0.8.0

v0.8.0rc1

v0.8.0rc2

v0.8.1

v0.8.2

v0.8.3

v0.8.3rc1

v0.8.4

v0.8.5

v0.8.5.post1

v0.9.0

v0.9.0.1

v0.9.1

v0.9.1rc1

v0.9.1rc2

v0.9.2

v0.9.2rc1

v0.9.2rc2

a89209b78d [v1] Support mamba2 (#19327) Chen Zhang 2025-06-19 04:34:15 +08:00
ffacb222cb [Docs] Add Huzaifa Sidhpurwala to vuln mgmt team doc (#19808) Russell Bryant 2025-06-18 16:22:28 -04:00
12575cfa7a [Bugfix] fix RAY_CGRAPH_get_timeout is not set successfully (#19725) Chauncey 2025-06-19 01:26:16 +08:00
8b6e1d639c [Hardware][AMD] integrate aiter chunked prefill into vllm (#18596) Zzz9990 2025-06-18 23:46:51 +08:00
735a9de71f [Qwen] Add tagging rule for Qwen related PRs (#19799) Lu Fang 2025-06-18 22:26:43 +08:00
257ab95439 [Platform] Allow platform use V1 Engine by default (#19792) wangxiyuan 2025-06-18 21:03:36 +08:00
cca91a7a10 [doc] fix the incorrect label (#19787) Reid 2025-06-18 18:30:58 +08:00
f04d604567 [Minor] Zero-initialize attn output buffer (#19784) Woosuk Kwon 2025-06-17 23:59:27 -07:00
19a53b2783 [V1] Decouple GPU and TPU InputBatch (#19778) afeldman-nm 2025-06-18 02:38:13 -04:00
eccdc8318c [V1][P/D] An native implementation of xPyD based on P2P NCCL (#18242) Zhonghua Deng 2025-06-18 14:32:36 +08:00
5f52a84685 [V1] Add API docs for EncoderCacheManager (#19294) Russell Bryant 2025-06-18 01:37:01 -04:00
d4629dc43f [Misc] Add __str__ for RequestStatus (#19780) lkchen 2025-06-17 20:03:01 -07:00
6e9cc73f67 [MISC] correct DeviceConfig device field static type analysis (#19699) Ning Xie 2025-06-18 08:21:50 +08:00
c53711bd63 [MISC] correct copy_blocks src_to_dists param type (#19696) Ning Xie 2025-06-18 08:21:06 +08:00
dac8cc49f4 [TPU] Update torch version to include paged attention kernel change (#19706) Chenyaaang 2025-06-17 15:24:49 -07:00
a44b1c951d [Feature][ROCm] Add full graph capture support for TritonAttentionBackend (#19158) Charlie Fu 2025-06-17 16:03:06 -05:00
b447624ee3 [Bugfix] Fix faulty triton importing logic when using Ray for DP (#19734) Michael Goin 2025-06-18 05:59:29 +09:00
cda92307c1 [Misc] Update lmcache connector with the latest connector apis (#19441) Jiayi Yao 2025-06-17 12:57:54 -07:00
bf57ccc5c2 Remove sm120 arch from sm100 cutlass kernel arch list (#19716) Michael Goin 2025-06-18 03:49:39 +09:00
ffb2cd6b54 [Perf] Optimize moe_align_block_size CUDA kernel (#19572) Wentao Ye 2025-06-17 14:49:26 -04:00
ca94d7fa00 [Bugfix] Update multimodel models mapping to fit new checkpoint after Transformers v4.52 (#19151) Isotr0py 2025-06-17 23:58:38 +08:00
5a1c2e15d8 [Mis] remove duplicate engine status checks (#19647) CYJiang 2025-06-17 23:17:38 +08:00
4c8f64faa7 [V1][Kernel] Flashinfer HND KV cache layout (#19280) Nicolò Lucchesi 2025-06-17 15:09:22 +02:00
93aee29fdb [doc] split "Other AI Accelerators" tabs (#19708) David Xia 2025-06-17 09:05:29 -04:00
154d063b9f [doc][mkdocs] Add edit button to documentation (#19637) Reid 2025-06-17 19:10:31 +08:00
ccd7c05089 [Kernel] Add Split-KV Support to Unified Triton Attention Kernel (#19152) jvlunteren 2025-06-17 12:45:07 +02:00
c48c6c4008 Add a doc on how to update PyTorch version (#19705) Huy Do 2025-06-17 03:10:37 -07:00
aed8468642 [Doc] Add missing llava family multi-image examples (#19698) Isotr0py 2025-06-17 15:05:21 +08:00
5c76b9cdaf [Core] add remove_seq_from_computed_blocks_tracker to BlockSpaceManager (#19686) quanliu 2025-06-17 12:40:58 +08:00
ddfed314f9 Fixes IMA for TP w/ flex-attention (#19712) Driss Guessous 2025-06-16 21:01:50 -07:00
5b3ad5ecf2 [DOC] fix doc typos (#19600) Di Liu 2025-06-17 11:34:53 +08:00
ede5c4ebdf [Frontend] add chunking audio for > 30s audio (#19597) nguyenhoangthuan99 2025-06-17 10:34:00 +07:00
07334959d8 [Wheel Size] Only build FA2 8.0+PTX (#19336) Lucas Wilkinson 2025-06-16 23:32:49 -04:00
119f683949 [doc] add project flag to gcloud TPU command (#19664) David Xia 2025-06-16 21:00:09 -04:00
0860087aff [Fix] Fall back to Gloo when NCCL backend is unavailable (#19641) Conroy Cheers 2025-06-17 10:42:14 +10:00
6bc7b57315 [Quantization] Remove FP4 emulation; Fall-back to marlin for device < 100 (#19563) Dipika Sikka 2025-06-16 17:33:51 -04:00
90f9c2eb5c [V1] Change return type on get_multimodal_embeddings() (#19446) Russell Bryant 2025-06-16 13:32:15 -04:00
387bdf0ab9 [Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) (#19677) qscqesze 2025-06-17 00:47:14 +08:00
5e5baa91aa [Kernels] Use empty for modular MoE workspaces (#19667) bnellnm 2025-06-16 10:58:01 -04:00
836d4ce140 [Bugfix] fix missing 'finish_reason': null in streaming chat (#19662) Chauncey 2025-06-16 22:10:39 +08:00
c3fec47bb7 [MISC] bump huggingface_hub pkg to 0.33.0 (#19547) Ning Xie 2025-06-16 20:22:28 +08:00
1173804dca [Bugfix] Fix TP inference for Flex attention backend (#19657) Isotr0py 2025-06-16 19:21:37 +08:00
4d5424029b [Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. (#19652) Shawn Tan 2025-06-16 07:14:18 -04:00
3e7506975c [DOC] Add reasoning capability to vLLM streamlit code (#19557) Navanit Dubey 2025-06-16 16:39:12 +05:30
ee35e96ac3 [BugFix] Don't catch BaseException when dumping execute_model errors (#19626) Nick Hill 2025-06-16 04:01:08 -07:00
dec66d253b [Kernel] GGUF MMVQ kernel for multiple input vectors (#18754) Szymon Ożóg 2025-06-16 11:33:26 +02:00
8d120701fd [Docs] Move multiproc doc to v1 dir (#19651) Russell Bryant 2025-06-16 05:10:12 -04:00
f40f763f12 [CI] Add mteb testing for rerank models (#19344) wang.yuqi 2025-06-16 16:36:43 +08:00
26bc46ef89 [MISC] typo fix (#19672) Ning Xie 2025-06-16 15:18:49 +08:00
a77aea59fd [TPU] support attention head dim smaller than 128 (#19620) Chengji Yao 2025-06-15 23:40:53 -07:00
b692e9cd07 [Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config (#19660) Ye (Charlotte) Qi 2025-06-15 23:30:29 -07:00
367871a469 [Misc][Frontend] passthrough bad_words (#19564) Francesco Bertolotti 2025-06-16 07:05:13 +02:00
92183b41f3 [Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker (#18957) quanliu 2025-06-16 12:56:37 +08:00
c6703d1e0d [MISC] Remove unused variableds in C++ (#19609) Lu Fang 2025-06-16 11:05:28 +08:00
a5e7242d5f [Misc] Remove duplicate multiproc method setting for CPU platform (#19649) Isotr0py 2025-06-16 10:26:58 +08:00
91b2c17a55 [CI/Build] Fix torch nightly CI dependencies part 2 (#19589) Richard Zou 2025-06-15 08:01:10 -04:00
055915e6ce Enable prefix caching with full cuda graphs (#19617) Woosuk Kwon 2025-06-15 01:05:05 -07:00
3d330c4c09 [Benchmark] Refactor benchmark script for fp8 & int8 (#19627) Wentao Ye 2025-06-15 03:15:37 -04:00
0b73736a0d [Kernel] Raise verbose error and consolidate num_heads/num_kv_heads divisibility check (#19339) 22quinn 2025-06-14 22:43:48 -07:00
ee1531bc38 [Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness (#19644) Lu Fang 2025-06-15 12:15:41 +08:00
e13945f9dd [Perf] Further tunings for SM100 FP8 CUTLASS kernel (#19566) Ilya Markov 2025-06-15 02:25:10 +02:00
08500011d3 [Fix] Convert kv_transfer_config from dict to KVTransferConfig (#19262) maobaolong 2025-06-15 03:32:07 +08:00
861a0a0a39 [Bugfix] Don't attempt to use triton if no driver is active (#19561) Konrad Zawora 2025-06-14 21:30:54 +02:00
bc956b38d0 Only build CUTLASS MoE kernels on Hopper (#19648) Huy Do 2025-06-14 11:44:15 -07:00
294fc1e2c9 [Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization (#19500) jiahanc 2025-06-14 09:34:28 -07:00
2db9044ab6 [Bugfix] Fix auto dtype casting for BatchFeature (#19316) Isotr0py 2025-06-14 23:13:08 +08:00
6fa718a460 [Misc] Modularize CLI Argument Parsing in Benchmark Scripts (#19593) Reid 2025-06-14 16:54:52 +08:00
06be858828 [Bugfix] Fix the speculative decoding test by setting the target dtype (#19633) Lu Fang 2025-06-14 11:57:32 +08:00
d1e34cc9ac [V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. (#18354) Saheli Bhattacharjee 2025-06-14 04:07:36 +01:00
bd517eb9fe [BugFix] Fix DP Coordinator incorrect debug log message (#19624) Nick Hill 2025-06-13 17:18:03 -07:00
d65668b4e8 Adding "AMD: Multi-step Tests" to amdproduction. (#19508) Concurrensee 2025-06-13 19:08:51 -05:00
aafbbd981f [torch.compile] Use custom ops when use_inductor=False (#19618) Woosuk Kwon 2025-06-13 15:05:54 -07:00
0f0874515a [Doc] Add troubleshooting section to k8s deployment (#19377) Anna Pendleton 2025-06-13 14:47:51 -07:00
3597b06a4f [CUDA] Enable full cudagraph for FlashMLA (#18581) Luka Govedič 2025-06-13 14:12:26 -04:00
1015296b79 [doc][mkdocs] fix the duplicate Supported features sections in GPU docs (#19606) Reid 2025-06-14 00:25:08 +08:00
ce9dc02c93 [Refactor] Remove unused variables in moe_permute_unpermute_kernel.inl (#19573) Wentao Ye 2025-06-13 09:12:15 -04:00
a24cb91600 [Model] Fix minimax model cache & lm_head precision (#19592) qscqesze 2025-06-13 20:08:20 +08:00
7e8d97dd3f [BugFix] Honor enable_caching in connector-delayed kvcache load case (#19435) Nick Hill 2025-06-13 02:46:32 -07:00
d70bc7c029 [torch.compile] reorganize the cache directory to support compiling multiple models (#19064) youkaichao 2025-06-13 15:23:25 +08:00
ce688ad46e use base version for version comparison (#19587) Boyuan Feng 2025-06-13 00:09:34 -07:00
cefdb9962d [Fix] The zip function in Python 3.9 does not have the strict argument (#19549) 汪志鹏 2025-06-13 14:57:48 +08:00
ace5cdaff0 [Fix] bump mistral common to support magistral (#19533) 汪志鹏 2025-06-13 13:28:12 +08:00
6458721108 [CPU] Refine default config for the CPU backend (#19539) Li, Jiang 2025-06-13 13:27:39 +08:00
bb4a0decef [Misc] Correct broken docs link (#19553) Hyogeun Oh (오효근) 2025-06-13 14:27:13 +09:00
c707cfc12e [doc] fix incorrect link (#19586) Reid 2025-06-13 12:26:09 +08:00
7b3c9ff91d [Doc] uses absolute links for structured outputs (#19582) Aaron Pham 2025-06-12 23:35:17 -04:00
c68698b326 [Bugfix] Fix EAGLE vocab embedding for multimodal target model (#19570) qizixi 2025-06-12 20:09:19 -07:00
e3b12667d4 [BugFix] : Fix Batched DeepGemm Experts (#19515) Varun Sundar Rabindranath 2025-06-12 22:43:02 -04:00
e6aab5de29 Revert "[Build/CI] Add tracing deps to vllm container image (#15224)" (#19378) kourosh hakhamaneshi 2025-06-12 17:26:40 -07:00
c57bb199b3 [V1] Resolve failed concurrent structured output requests (#19565) Russell Bryant 2025-06-12 19:30:09 -04:00
dba68f9159 [Doc] Unify structured outputs examples (#18196) Aaron Pham 2025-06-12 18:50:31 -04:00
a3319f4f04 [Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant (#19452) Michael Goin 2025-06-12 15:39:15 -04:00
9d880f594d [Misc] Turn MOE_DP_CHUNK_SIZE into an env var (#19506) Varun Sundar Rabindranath 2025-06-12 14:01:16 -04:00
017ef648e9 [Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets (#18847) Ekagra Ranjan 2025-06-12 13:30:56 -04:00
4b25ab14e2 [doc] Make top navigation sticky (#19540) Reid 2025-06-12 23:48:11 +08:00
f98548b9da [torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass (#16756) Luka Govedič 2025-06-12 11:31:04 -04:00
96846bb360 Fix TorchAOConfig skip layers (#19265) mobicham 2025-06-12 16:22:53 +02:00
b6efafd9e4 [Perf] Vectorize static / dynamic INT8 quant kernels (#19233) Wentao Ye 2025-06-12 09:51:41 -04:00
1129e2b1ab [V1][NixlConnector] Drop num_blocks check (#19532) Nicolò Lucchesi 2025-06-12 14:36:14 +02:00
c742438f8b [Doc] Add V1 column to supported models list (#19523) Cyrus Leung 2025-06-12 19:16:44 +08:00

... 86 87 88 89 90 ...